ArticlePDF Available

An Improved Ensemble Learning Approach for Heart Disease Prediction Using Boosting Algorithms

Authors:

Abstract and Figures

Cardiovascular disease is among the top five fatal diseases that affect lives worldwide. Therefore, its early prediction and detection are crucial, allowing one to take proper and necessary measures at earlier stages. Machine learning (ML) techniques are used to assist healthcare providers in better diagnosing heart disease. This study employed three boosting algorithms, namely, gradient boost, XGBoost, and AdaBoost, to predict heart disease. The dataset contained heart disease-related clinical features and was sourced from the publicly available UCI ML repository. Exploratory data analysis is performed to find the characteristics of data samples about descriptive and inferential statistics. Specifically, it was carried out to identify and replace outliers using the interquartile range and detect and replace the missing values using the imputation method. Results were recorded before and after the data preprocessing techniques were applied. Out of all the algorithms, gradient boosting achieved the highest accuracy rate of 92.20% for the proposed model. The proposed model yielded better results with gradient boosting in terms of precision, recall, and f1-score. It attained better prediction performance than the existing works and can be used for other diseases that share common features using transfer learning.
Content may be subject to copyright.
An Improved Ensemble Learning Approach for Heart Disease Prediction Using
Boosting Algorithms
Shahid Mohammad Ganie
1
, Pijush Kanti Dutta Pramanik
2
, Majid Bashir Malik
3
, Anand Nayyar
4
and
Kyung Sup Kwak
5
,
*
1
School of Business, Woxsen University, Hyderabad, Telangana, 502345, India
2
School of Computing Science & Engineering, Galgotias University, Greater Noida, UP 203201, India
3
Department of Computer Sciences, Baba Ghulam Shah Badshah University, Rajouri, 185234, India
4
Graduate School, Faculty of Information Technology, Duy Tan University, Da Nang, 50000, Vietnam
5
Department of Information and Communication Engineering, Inha University, 22212, Korea
*Corresponding Author: Kyung Sup Kwak. Email: kskwak@inha.ac.kr
Received: 13 August 2022; Accepted: 03 November 2022
Abstract: Cardiovascular disease is among the top ve fatal diseases that affect
lives worldwide. Therefore, its early prediction and detection are crucial, allowing
one to take proper and necessary measures at earlier stages. Machine learning
(ML) techniques are used to assist healthcare providers in better diagnosing heart
disease. This study employed three boosting algorithms, namely, gradient boost,
XGBoost, and AdaBoost, to predict heart disease. The dataset contained heart dis-
ease-related clinical features and was sourced from the publicly available UCI ML
repository. Exploratory data analysis is performed to nd the characteristics of
data samples about descriptive and inferential statistics. Specically, it was carried
out to identify and replace outliers using the interquartile range and detect and
replace the missing values using the imputation method. Results were recorded
before and after the data preprocessing techniques were applied. Out of all the
algorithms, gradient boosting achieved the highest accuracy rate of 92.20% for
the proposed model. The proposed model yielded better results with gradient
boosting in terms of precision, recall, and f1-score. It attained better prediction
performance than the existing works and can be used for other diseases that share
common features using transfer learning.
Keywords: Heart disease prediction; machine learning classiers; ensemble
approach; XGBoost; AdaBoost; gradient boost
1 Introduction
Heart disease is considered one of the hazards that affect human lives globally. As per the statistical
reports of different international healthcare organizations, 17.9 million (32% of all global deaths) died in
2019 because of cardiovascular diseases; this statistic has been estimated to increase to 23 million people
by 2030 [1]. Out of all the cardiovascular disease deaths, 85% are due to heart disease and stroke.
Research studies have estimated that heart disease accounts for 80% of lives in low economically
This work is licensed under a Creative Commons Attribution 4.0 International License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original
work is properly cited.
DOI: 10.32604/csse.2023.035244
Article
ech
T
PressScience
developed countries and creates 85% of disabilities [1]. Detecting and predicting heart disease at earlier
stages are necessary to reduce premature deaths by a signicant number in the future. The risk and
progression of heart-related diseases depend on factors such as age, changes in lifestyle, food habits, and
rapidly growing socio-economic causes, such as admission to healthcare centers [2,3]. Thus, some other
risk factors due to heart-related problems are high blood pressure, raised glucose levels, upraised blood
lipids, obesity, and being overweight.
Exploring computational intelligence techniques is needed for better prediction of heart-related diseases
so that they can be prevented and cautionary measures can be taken in advance. Furthermore, machine
learning (ML) techniques can be extensively explored to cater to healthcare resources and governance for
better patient health services. This will directly benet hospital management, telemedicine systems,
practitioners, healthcare providers, and patient categories. In this study, we intend to develop a model for
better heart disease prediction using Ensemble Learning (EL) techniques. Specically, considering the
criticalness of the application, we intend to improve the accuracy and other measures of the model for
heart disease prediction. Following are the novel contributions of this work:
Preprocessing of the data to improve the characteristic assessment of the dataset
Comparison of results before and after applying preprocessing techniques
Exercising feature engineering process to identify the contribution of attributes
Applying boosting algorithms using an ensemble learning approach to increase prediction accuracy
Compare the performance evaluation of the proposed model with similar research works
The rest of the article is organized as follows. Section 2 mentions the related work. Section 3 presents the
details of the proposed methodology and dataset. Section 4 presents and analyzes the experimental details
and results. Finally, Section 5 describes the conclusion and some future directions.
2 Related Work
Machine/ensemble learning techniques, with their potential to deliver consistent, reliable, and valid
results, are used in almost every sphere of life to solve real-life problems [4,5]. Copious work has been
done for disease prediction using ML and EL techniques [6]. Researchers have explored different
datasets, algorithms, and methodologies to conduct future research in diagnosing cardiovascular disease
[7,8]. Some of the important kind of literature is discussed as follows.
Latha et al. [9] experimented with different ensemble techniques, such as bagging boosting, stacking,
and a majority vote, using traditional classication algorithms to improve the efcacy of predicting
disease risk. They achieved the highest accuracy with a majority vote. Theerthagiri et al. [10] explored a
gradient boosting algorithm based on recursive feature elimination to predict heart disease based on some
medical parameters such as patients age, systolic and diastolic blood pressure, height, weight, smoke,
glucose/blood sugar, cholesterol, alcohol intake, smoke, and physical workout. Sultan Bin Habib et al.
[11] tried different ensemble techniques such as adaptive boosting (AdaBoost), gradient boosting machine
(GBM), light GBM (LGBM), extreme gradient boosting (XGBoost), and category boosting (CatBoost) to
predict coronary disease, considering several attributes such as gender, age, education, smoking habits,
blood pressure, hypertension, diabetes, cholesterol level, Quetelet index, heart rate, glucose level, and
chronic heart disease history. They achieved the highest accuracy with XGBoost. Budholiya et al. [12]
used an enhanced XGBoost classier to predict heart disease effectively. The One-Hot encoding
technique was used to handle categorical features, and Bayesian optimization was used to enhance the
hyper-parameters to achieve better results. Pan et al. [13] conducted an extensive study by using a dataset
containing a good mixture of numerical and categorical attributes based on EL techniques to predict
disease. The authors observed that combining the support vector machine and AdaBoost with categorical
3994 CSSE, 2023, vol.46, no.3
attributes provides better results in predicting heart disease. Pouriyeh et al. [14] developed a framework for
the prediction/detection of heart disease by comparing conventional ML techniques with EL methods. The
dataset used for this work is taken from the online available UCI ML repository. The authors have used a
10-fold cross-validation technique to validate the results. The results showed that the support vector
machine, in combination with the boosting method, provides better results with the highest accuracy rate
of 89.12%. Moreover, bagging and stacking techniques, combined with different traditional classiers,
improve the efcacy of overall results. Deshmukh [15] used an ensemble learning approach for heart
disease prediction. The results are compared between majority voting classiers and the rest of the
classiers. An extra tree classier was used for the feature selection process. Bagged classiers with the
majority outperformed other classiers with the highest accuracy of 87.78%. The authors suggested that
this work can be extended using optimization procedures and new feature extraction methods. Mary et al.
[16] developed a model for heart disease prediction using ten machine learning algorithms. Among all the
considered classiers, support vector machine yielded better results with accuracy rate of 83.49% on the
UCI dataset. The simple card algorithm increased the accuracy and reduced the prediction error rate for
other measurements. Different metrics are evaluated to validated the proposed framework. The authors
suggested that hybrid approach can be used to extend the existing work for better prediction. Alqahtani
et al. [17] proposed a framework for cardiovascular disease prediction using ensemble learning and deep
learning techniques. In the experiment, the random forest algorithm achieved the highest accuracy
(88.65%), precision (90.03), recall (88.03), f1-score (88.02), and ROC-AUC value (92). Furthermore,
feature importance was calculated to measure the risk of being involved in this disease in the future.
Kondababu et al. [18] built a model by comparing different machine learning techniques for heart disease
prediction. Seven machine learning classiers were considered for comparative and performance analyses.
Out of all classiers, hybrid random forest with linear model produced better results with accuracy rate of
88.4%. No data preprocessing technique was used to improve the output of proposed model. The authors
suggested that the future course of this work can done using large datasets and diverse mixture of
machine learning techniques.
Most of the work mentioned above did not sufciently exploit data preprocessing before developing the
ensemble learning models. It resulted in inadequate outputs. Therefore, we felt the need to utilize exploratory
data analysis to improve the data quality required for the prediction model. Furthermore, data normalization
and standardization were missing in most of the existing literature, although these approaches play crucial
roles in achieving higher prediction performance.
3 Research Methodology
Fig. 1 depicts the methodology adopted for this experimental study. It presents the procedural steps that
must be executed for the early prediction of disease using various ensemble learning techniques. A publicly
available heart disease dataset has been imported into the web-based Jupyter notebook (open-source
platform) for the experimental process. The required library packages are installed from Sklearn using the
Python programming language. Initially, the boosting classiers are applied without data preprocessing to
predict the disease. After exploratory data analysis, we found that preprocessing of data can play an
important role in attaining better results. In preprocessing phase, missing values are identied and
replaced using the data imputation method. The interquartile range method is used to detect and replace
outliers present in the dataset. Also, some other required libraries are executed to check the corrupted
data, if any, in the dataset. The dataset is split into a 70:30 ratio, where 70% is used to train the models
and 30% to test these models. To validate the results, k-fold cross-validation (K = 10) is applied. Finally,
the three considered boosting algorithms are applied after data preprocessing to obtain the desired results.
CSSE, 2023, vol.46, no.3 3995
3.1 Techniques Used
The use of ensemble learning techniques is explored in almost every eld to solve real-life problems
[19]. These models have made signicant progress in better prediction, detection, diagnosis, and
prognosis of different diseases. In this study, for heart disease prediction, we considered the following
three ensemble-learning-based boosting algorithms [6]:
Gradient boosting: The weak learners are trained sequentially, and all estimators are added gradually
by adapting the weights. The gradient boosting algorithm focuses on predicting the residual errors of
previous estimators and attempts to minimize the difference between the predicted and actual values.
AdaBoost: AdaBoost works by adjusting all the weights without prior knowledge of weak learners.
The base learnersweakness is measured by the estimators error rate while training the models.
Decision tree stumps are widely used with the AdaBoost algorithm to solve classication and
regression problems.
XGBoost: XGBoost works by combining different kinds of decision trees (weak learners) to calculate
the similarity scores independently. It helps to overcome the problem of overtting during the training
phase by adapting the gradient descent and regularization process.
3.2 Dataset Selection
For the experiment, we used the popular dataset on heart disease, openly available in the machine
learning repository
1
at the University of California Irvine (UCI). The dataset is rich in clinical features
related to heart disease, covering wide demography. Thus, it has been one of the most popular choices for
researchers.
3.3 Attribute Information
The dataset contains 1329 instances and 14 attributes, where the rst 13 attributes are predicate/
independent variables, and the last one is a dependent/target variable. The attributes are described in
Table 1. The table presents information about considered attributes, the description of attributes, their
measurements, and the value of the range.
Figure 1: Proposed methodology for research work
1
https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset.
3996 CSSE, 2023, vol.46, no.3
3.4 Dataset Description
Descriptive statistics play a vital role in identifying the data characteristics. It summarizes the data so that
understanding data becomes easier for human interpretation. Table 2 describes the statistical measurements
of the clinical attributes with their measures, such as count of records, minimum (min) value, maximum
(max) value, mean, and standard deviation (Std). For example, the age attribute has 54.41 as a mean
value and 9.07 as a standard deviation, and the maximum and minimum age numbers are 77 and
29 years, respectively. These statistical measurements are also calculated for the rest of the attributes.
3.5 Class Balance
Machine/ensemble learning models provide poor results if the dataset used is not balanced for the
problem statement. In some situations, if the target class is not equally distributed, then some sampling
techniques can be used to make a balanced dataset. The dataset for this experiment contains a good
mixture of classes, where class 1 is heart disease (691 instances) and class 0 is no heart disease
(637 instances), as shown in Fig. 2.
Table 1: Attributes information of the dataset
Attribute Description Measurement Value
range
Age Age of an individual Years 29 to 77
Sex Gender of an individual 1 = male, 0 = female 0 or 1
Cpericarditis Degree of chest pain Low, moderate, high,
extremely high
0to3
RestingBP Blood pressure of an individual while at rest
(inactive)
Hg level (in mm) 94 to 200
Cholesterol Level of serum cholesterol mg/dl 126 to
256
FastingBP Glucose level in an empty stomach (fasting) Greater than 120 mg/dl
(1 = true, 0 = false)
0or1
RestingECG Resting electrocardiographic results of an
individual while inactive
0 = normal, 1 = having ST 0 to 2
MaximumHR Highest heart rate recorded. Beats per minute 71 to 202
ExerciseIA Doing exercise with angina disease 1 = yes, 0 = no 0 or 1
Oldpeak ST depression, caused by doing exercise in
comparison to being inactive
Numeric value Relative
Slope Slope of the old peak value in the ST segment
while an individual is doing an exercise
0 = downsloping, 1 = at,
2 = upsloping
0to2
Ca No. of major vessels colored by uoroscopy Numeric 0 to 3
Thal Whether an individual has thalassemia or not 3 = normal, 6 = xed defect,
7 = reversible defect
3to7
Outcome Class attribute 0 = no heart disease,
1 = heart disease
0or1
CSSE, 2023, vol.46, no.3 3997
3.6 Histogram of Dataset
A histogram is used to visualize and interpret the distribution of data samples. The representation of
histograms can be uniform, normal, left-skewed, and right-skewed. Fig. 3 depicts the normally distributed
histograms that groups all the attributes within the value range. The x-axis represents the nature of the
attribute, and the y-axis represents the value of that attribute.
3.7 Boxplot of Dataset
Fig. 4 shows the boxplots of each attribute present in the dataset. To represent boxplots for attributes, the
interquartile range method using the probability density function has been used to handle the outliers in the
dataset. For example, in fasting blood pressure, a single outlier was detected, whereas, in resting blood
pressure, multiple outliers were detected and were replaced with the z-score method.
Table 2: Dataset description
Attributes Count Mean Std Min Max
Age 1328 54.41 9.07 29 77
Sex 1328 0.69 0.46 0 1
Cpericarditis 1328 0.94 1.02 0 3
RestingBP 1328 131.61 17.51 94 200
Cholesterol 1328 246.06 51.62 126 564
FastingBP 1328 0.14 0.35 0 1
RestingECG 1328 0.52 0.52 0 2
MaximumHR 1328 149.23 22.97 71 202
ExerciseIA 1328 0.33 0.47 0 1
Oldpeak 1328 1.06 1.17 0 1
Slope 1328 1.38 0.61 0 2
Ca 1328 0.74 1.02 0 4
Thal 1328 2.32 0.61 0 3
Outcome 1328 0.52 0.49 0 1
637
691
0: No heart
disease
1: Heart
disease
Figure 2: Instances of the outcome variable
3998 CSSE, 2023, vol.46, no.3
3.8 Correlation Coefcient Analysis
The correlation coefcient analysis (CCA) method is used to identify and plot the relationship among the
datasets attributes [20]. A dataset is considered good if a strong association/relationship exists between the
set of independent and dependent attributes. Fig. 5 presents the CCA of all attributes used to predict disease,
and the range of relationships exists between +1 to 1 within the x-axis and y-axis. The cell value indicates
the degree of relationship between the intersecting attributes. For example, the relationship value between
resting blood pressure and age is 0.12.
4 Experiment, Results, and Discussion
This section presents the discussion on the experimental details and results achieved using boosting
algorithms for heart disease prediction. Subsequently, all results after implementing the proposed
framework are shown and analyzed systematically. The results are presented in two modules: before
preprocessing and after preprocessing for disease prediction. The evaluation is extensively discussed in
terms of performance evaluation metrics such as precision, recall, f1-score, receiver operation curve, and
traveling time of considered boosting algorithms.
Figure 3: Histogram of attributes
CSSE, 2023, vol.46, no.3 3999
Figure 4: Boxplot of attributes
Figure 5: Correlation coefcient matrix
4000 CSSE, 2023, vol.46, no.3
4.1 Data Preprocessing
Data preprocessing is vital in developing a robust and reliable system before applying ML methods to
the model [21]. In this work, missing values have been identied and replaced by the data imputation
method. Initially, we used the isnull() method to detect all the missing values and then executed the mean
and mode imputation technique with the SimpleImputer() method to ll in these missing values. This
process replaces all the missing values using the columns mean, median, and mode. Outliers have been
detected and replaced using the Interquartile range method, where Z-score techniques were used to shift
the distribution of all the data samples and make the mean 0.
4.2 Hardware/Software Specication and Computational Time
An HP Z60 workstation was used to carry out this research work. The hardware specication of the
system is as follows: Intel XEON 2.4 GHz CPU (12 core), 4 GB RAM, 1 TB hard disk, and Windows
10 pro-64-bit. The algorithms ADB, XGB, and GB on this machine took 4.23, 3.57, and 4.51 Seconds,
respectively, for execution. Apart from hardware components, the software used for implementations is
graphical user interface-based Anaconda Navigator, web-based computing platform Jupyter Notebook,
and Python as a programming language.
4.3 Accuracy of Classiers
The testing accuracy of boosting algorithms is shown in Fig. 6. The algorithms employed in this work
are XGBoost (XGB), AdaBoost (ADB), and gradient boosting (GB). Without applying preprocessing
techniques, the accuracy of classiers like XGB, ADB, and GB are 87.50%, 81.50%, and 86%,
respectively. After applying preprocessing techniques, GB outperformed other boosting algorithms by
obtaining the highest accuracy rate of 92.20%, followed by AGB and ADB, both having 89.61%.
4.4 Other Measurements
The precision, recall, and f1-score of the three considered classiers were calculated before and after
data preprocessing. The values were calculated (in percentage) for both the classes (0: no heart disease, 1:
heart disease), as shown in Figs. 7 and 8. XGB performed best for all the measurements without
preprocessing, whereas ADB performed the worst. With preprocessing, GB performed the best in most
measurements, whereas XGB and ADB achieved more or less the same results.
87.50%
89.61%
81.50%
89.61%
86.00%
92.20%
76%
78%
80%
82%
84%
86%
88%
90%
92%
94%
Before preprocessing After preprocessing
Accuracy
XGB
ADB
GB
Figure 6: Classication accuracy
CSSE, 2023, vol.46, no.3 4001
4.5 Feature Importance
Feature importance is a process to calculate the score of input features (independent predicate variables)
based on the contribution predicting the output feature (dependent/target variable) [22]. It plays an important
role in developing machine/ensemble learning models to improve prediction results. In this work, the feature
importance score (F-score) represents the number of times an attribute is used for splitting in the training
process. A higher F-score of a feature (e.g., cholesterol) indicates that it is an important attribute. Fig. 9
shows the contribution of all the attributes toward prediction in descending order based on their F-score.
For example, cholesterol has the highest signicance in the prediction, whereas fasting blood pressure has
the lowest.
0.82
0.94
0.87
0.94
0.82
0.88
0.82
0.83
0.82
0.81
0.8
0.8
0.82
0.91
0.86
0.9
0.82
0.86
0.7
0.75
0.8
0.85
0.9
0.95
Precision Recall F1-score Precision Recall F1-score
0: Negative 1: Positive
XGB
ADB
GB
Figure 7: Other measurements before preprocessing
0.89
0.96
0.92
0.91
0.78
0.84
0.9
0.95
0.92
0.89
0.79
0.84
0.93
0.95
0.94
0.89
0.86
0.88
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision Recall F1-score Precision Recall F1-score
0: Negative 1: Positive
XGB
ADB
GB
Figure 8: Other measurements after preprocessing
Figure 9: Feature importance for prediction
4002 CSSE, 2023, vol.46, no.3
4.6 ROC Curve
The receiver operating characteristic (ROC) curve has been used to show the prediction capability of
considered boosting algorithms at different thresholds. It represents the false-positive rate vs. the true-
positive rate along the x-axis and y-axis, respectively. Using the ROC curve, we analyzed how well our
models distinguish between classes (0-no heart disease and 1-heart disease). A higher ROC curve means
that the model is predicting good results between 0s and 1s. If the model has AUC near 1, it means a
good separability measure; if AUC is near 0, it means the worst measure of disassociation. When the
value of AUC is 0.5, the model is not working to separate the classes effectively. The ROC curves for
ADB, XGB, and GB are shown in Figs. 1012, respectively. From the gures, we conclude that XGB
performs best, followed by GB and ADB.
Figure 10: ROC curve for ADB
Figure 11: ROC curve for XGB
CSSE, 2023, vol.46, no.3 4003
4.7 Comparative Analysis
The proposed method produced good results in terms of different evaluation metrics for heart disease
prediction. The performance of our proposed framework has been compared with several relevant studies
in terms of techniques used, dataset, and accuracy, as shown in Table 3. Our proposed framework yielded
good results in terms of different evaluation metrics, particularly for accuracy in predicting heart disease.
Techniques such as data imputation for handling missing values, detection, and replacement of outliers
using the Boxplot method have been used to achieve better results than other related works.
Figure 12: ROC curve for GB
Table 3: Comparison of the proposed work with existing similar works
Research
work
Ensemble techniques adopted Dataset used Highest
accuracy
[11] XGB, ADB, GBM,
LGBM, and CatBoost
Framingham heart disease
dataset (publicly available)
87.62% with
XGB
[9] Boosting, bagging, stacking, and majority vote Cleveland heart disease
dataset (publicly available)
85.48% with
majority vote
[10] Recursive feature elimination and GB Do 89.78%
[12] XGB with Bayesian optimization Do 91.80%
[14] CatBoost, GB, XGB, and ADB Do 83.60% with
ADB
[17] DNN, KDNN, XGB, KNN, decision tree, and
random forest
Do 88.65% with
random forest
[18] Naïve Bayes, linear model, logistic regression,
decision tree, random forest, SVM, and HRFLM
Do 88.40% with
HRFLM
Our
method
XGB, ADB, and GB Do 92.20% for
BDT
4004 CSSE, 2023, vol.46, no.3
5 Conclusion
This study applied boosting algorithms to predict heart disease effectively. Different preprocessing
methods, such as imputation, Z-score, and cleaning methods, were employed to improve the datasets
prediction results and quality assessment. This study also executed three different boosting algorithms,
namely, XGBoost, AdaBoost, and gradient boosting, before and after applying preprocessing techniques.
The experimental results were assessed using different statistical/ML measurements. The experimental
results revealed that gradient boosting achieved the highest accuracy rate of 92.20%. The gradient
boosting algorithm also achieved better results for other metrics, such as precision, recall, and f1-score.
Finally, the feature importance process was employed to calculate the contribution of independent
features toward the nal outcome.
Other ensemble learning techniques, such as bagging and stacking, can be used to improve the efcacy
of this work. This proposed method can be used for other healthcare datasets that share the commonality of
features to extend the scope of this research work. Deep learning techniques can also be explored to detect
and predict cardiovascular diseases better.
Funding Statement: This work was supported by National Research Foundation of Korea-Grant funded by
the Korean Government (MSIT)-NRF-2020R1A2B5B02002478.
Conicts of Interest: The authors declare that they have no conicts of interest to report regarding the
present study.
References
[1] WHO, Cardiovascular diseases (CVDs),11
th
June, 2021. [Online]. Available: https://www.who.int/news-room/
fact-sheets/detail/cardiovascular-diseases-(cvds) (accessed on 6 July 2022).
[2] Y. Ruan, Y. Guo, Y. Zheng, Z. Huang, S. Sun et al., Cardiovascular disease (CVD) and associated risk factors
among older adults in six low-and middle-income countries: Results from SAGE Wave 1,BMC Public Health,
vol. 18, no. 1, pp. 113, 2018.
[3] M. -H. Biglu, M. Ghavami and S. Biglu, Cardiovascular diseases in the mirror of science,Journal of
Cardiovascular and Thoracic Research, vol. 8, no. 4, pp. 158163, 2016.
[4] S. M. Ganie, M. B. Malik and T. Arif, Early prediction of diabetes mellitus using various articial intelligence
techniques: A technological review,International Journal of Business Intelligence and Systems Engineering,
vol. 1, no. 4, pp. 122, 2021.
[5] J. Alzubi, A. Nayyar and A. Kumar, Machine learning from theory to algorithms: An overview,Journal of
Physics: Conference Series, vol. 1142, no. 1, pp. 012012, 2018.
[6] S. M. Ganie, M. B. Malik and T. Arif, Performance analysis and prediction of type 2 diabetes mellitus based on
lifestyle data using machine learning approaches,Journal of Diabetes & Metabolic Disorders, vol. 21, no. 1, pp.
339352, 2022.
[7] N. Nissa, S. Jamwal and S. Mohammad, Early detection of cardiovascular disease using machine learning
techniques an experimental study,International Journal of Recent Technology and Engineering, vol. 9, no. 3,
pp. 635641, 2020.
[8] S. Jamwal and S. M. Najmu Nissa, Heart disease prediction using machine learning,Lecture Notes in Networks
and Systems, vol. 203, no. 67, pp. 653665, 2021.
[9] C. B. C. Latha and S. C. Jeeva, Improving the accuracy of prediction of heart disease risk based on ensemble
classication techniques,Informatics in Medicine Unlocked, vol. 16, no. November 2018, pp. 100203, 2019.
[10] P. Theerthagiri and J. Vidya, Cardiovascular disease prediction using recursive feature elimination and gradient
boosting classication techniques,CoRR, vol. abs/2106.0, 2021. [Online]. Available: https://arxiv.org/abs/2106.
08889
CSSE, 2023, vol.46, no.3 4005
[11] A. Z. Sultan Bin Habib, T. Tasnim and M. M. Billah, A study on coronary disease prediction using boosting-
based ensemble machine learning approaches,in Proc. ICIET 2019, Dhaka, Bangladesh, pp. 2324, 2019.
[12] K. Budholiya, S. K. Shrivastava and V. Sharma, An optimized XGBoost based diagnostic system for effective
prediction of heart disease,Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 7,
pp. 45144523, 2022.
[13] C. Pan, A. Poddar, R. Mukherjee and A. K. Ray, Impact of categorical and numerical features in ensemble
machine learning frameworks for heart disease prediction,Biomedical Signal Processing and Control, vol.
76, no. April, pp. 103666, 2022.
[14] S. Pouriyeh, S. Vahid, G. Sannino, G. De Pietro, H. Arabnia et al., A comprehensive investigation and
comparison of machine learning techniques in the domain of heart disease,in Proc. IEEE Symp. Computers
and Communications, Heraklion, Greece, pp. 204207, 2017.
[15] V. M. Deshmukh, Heart disease prediction using ensemble methods,International Journal of Recent
Technology and Engineering, vol. 8, no. 3, pp. 85218526, 2019.
[16] N. Mary, B. Khan, A. A. Asiri, F. Muhammad, S. Alqhtani et al., Investigating of classication algorithms for
heart disease risk prediction,Journal of Intelligent Medicine and Healthcare, vol. 1, no. 1, pp. 1131, 2022.
[17] A. Alqahtani, S. Alsubai, M. Sha, L. Vilcekova and T. Javed, Cardiovascular disease detection using ensemble
learning,Computational Intelligence and Neuroscience, vol. 2022, no. 3, pp. 19, 2022.
[18] A. Kondababu, V. Siddhartha, B. H. K. Bhagath Kumar and B. Penumutchi, A comparative study on machine
learning based heart disease prediction,Materials Today: Proceedings, [in press], 2021.
[19] S. M. Ganie and M. B. Malik, An ensemble machine learning approach for predicting type-II diabetes mellitus
based on lifestyle indicators,Healthcare Analytics, vol. 2, no. 1, pp. 100092, 2022.
[20] A. Hussain and S. Naaz, Prediction of diabetes mellitus: Comparative study of various machine learning
models,in Int. Conf. on Innovative Computing and Communications. Advances in Intelligent Systems and
Computing, vol. 1166. Singapore: Springer, pp. 103115, 2021.
[21] A. Jazayeri, O. S. Liang and C. C. Yang, Imputation of missing data in electronic health records based on
patientssimilarities,Journal of Healthcare Informatics Research, vol. 4, no. 3, pp. 295307, 2020.
[22] D. Dutta, D. Paul and P. Ghosh, Analysing feature importances for diabetes prediction using machine learning,
in Proc. IEMCON 2018, Vancouver, Canada, pp. 924928, 2019.
4006 CSSE, 2023, vol.46, no.3
... Since heart disease constitutes one of the most prominent causes of death worldwide, early detection and appropriate treatment are indispensable [3]. Exploring computational intelligence techniques is necessary to predict heart-related diseases accurately and take preventive measures [4]. However, With the increasing availability of diverse datasets, the exploration of advanced machine-learning techniques has become imperative to enhance diagnostic precision. ...
Conference Paper
Full-text available
Heart disease stands as one of the most intricate health conditions, affecting a substantial population across the globe. The aim of this research is to add to the expanding knowledge base in the early detection of heart disease. The well-known heart disease dataset sourced from UCI Machine Learning Repository was utilized in this study. The dataset encompasses 15 distinct attributes, not counting the target variable, collected from 920 different patients. Pearson correlation was employed as feature selection approach with collinearity threshold of 85%. Binning method was used as target engineering strategy. Before target engineering, various well-known classifcation algorithms were employed, such as Random forest, Extra gradient,CatBoost, Bagging, Logistic regression, LightGBM, SVM, Ada boost, XGBoost, Decision tree, Gradient boosting, Naive bayes, and KNN. After target engineering another diverse set of classifier models deployed, encompassing Random forest, KNN, Logistic regression, SVM, Naive bayes, XGBoost, and Decision tree. Both soft and hard voting ensemble was employed on the best selected five models, according to accuracy metric. An artificial neural network model was also constructed and applied after the target engineering method. The ANN model was trained over 300 epochs as after that point the loss curves exhibited a nearly flat trend. At that amount of epochs, the ANN model achieved accuracy score of 99.59%, which was highest among of all the suggested model. The soft voting ensemble on top five models was able to gain accuracy, precision, recall, f1-score, MAE, MSE, RMSE, RAE, and RRSE score at 90.12%, 90.08%, 90.12%, 90.08%, 9.88%, 9.88%, 34.32%, 28.40%, and 77.34% respectively, which was the second highest.
... With an AUC close to 1, the model has strong separability; with an AUC close to 0, it has the worst disassociation. The model lacks the ability to distinguish across classes when the AUC value is 0.5 [53]. Using the ROC-AUC curve, we further evaluated the capability of base classifiers and the voting ensemble classifier in predicting HD, as shown in Figure 6. ...
Article
Full-text available
Cardiovascular disease (CVD) is a leading cause of death globally; therefore, early detection of CVD is crucial. Many intelligent technologies, including deep learning and machine learning (ML), are being integrated into healthcare systems for disease prediction. This paper uses a voting ensemble ML with chi-square feature selection to detect CVD early. Our approach involved applying multiple ML classifiers, including naïve Bayes, random forest, logistic regression (LR), and k-nearest neighbor. These classifiers were evaluated through metrics including accuracy, specificity, sensitivity, F1-score, confusion matrix, and area under the curve (AUC). We created an ensemble model by combining predictions from the different ML classifiers through a voting mechanism, whose performance was then measured against individual classifiers. Furthermore, we applied chi-square feature selection method to the 303 records across 13 clinical features in the Cleveland cardiac disease dataset to identify the 5 most important features. This approach improved the overall accuracy of our ensemble model and reduced the computational load considerably by more than 50%. Demonstrating superior effectiveness, our voting ensemble model achieved a remarkable accuracy of 92.11%, representing an average improvement of 2.95% over the single highest classifier (LR). These results indicate the ensemble method as a viable and practical approach to improve the accuracy of CVD prediction.
... An extensive amount of clinical data, such as patient demographics, laboratory results, and imaging findings, can be analysed by ML algorithms, which can then be used to identify patterns and relationships associated with CLD. Subsequently, predictive models for early disease detection and risk stratification can be constructed using this data [4]. ...
Article
Full-text available
Chronic liver disease (CLD) is a major health concern for millions of people all over the globe. Early prediction and identification are critical for taking appropriate action at the earliest stages of the disease. Implementing machine learning methods in predicting CLD can greatly improve medical outcomes, reduce the burden of the condition, and promote proactive and preventive healthcare practices for those at risk. However, traditional machine learning has some limitations which can be mitigated through ensemble learning. Boosting is the most advantageous ensemble learning approach. This study aims to improve the performance of the available boosting techniques for CLD prediction. Seven popular boosting algorithms of Gradient Boosting (GB), AdaBoost, LogitBoost, SGBoost, XGBoost, LightGBM, and CatBoost, and two publicly available popular CLD datasets (Liver disease patient dataset (LDPD) and Indian liver disease patient dataset (ILPD)) of dissimilar size and demography are considered in this study. The features of the datasets are ascertained by exploratory data analysis. Additionally, hyperparameter tuning, normalisation, and upsampling are used for predictive analytics. The proportional importance of every feature contributing to CLD for every algorithm is assessed. Each algorithm's performance on both datasets is assessed using k-fold cross-validation, twelve metrics, and runtime. Among the five boosting algorithms, GB emerged as the best overall performer for both datasets. It attained 98.80% and 98.29% accuracy rates for LDPD and ILPD, respectively. GB also outperformed other boosting algorithms regarding other performance metrics except runtime.
... About 92% of diagnostic accuracy was reported with these techniques. [15] Early diagnosis of metastatic breast cancer is possible with the examination of blood profiles by ML-based applications. It may reduce the healthcare expenditure by improving the diagnosis of cancer patients in the earlier stages. ...
Preprint
Artificial intelligence (AI) is revolutionizing the field of pharmaceutical and healthcare sector. The current understanding of drug development (DVPT) and discovery can be expanded with the aid of AI to benefit the health of the society. Current opportunities, where AI can be helpful include support in preclinical and clinical studies, target identification, hit identification, lead optimization, and clinical decision-making. The challenges that AI should overcome includes ethical aspects, intellectual issues, and regulatory aspects in drug DVPT and its clinical establishment. AI has various applications in drug discovery (DDS) and DVPT. The overall goal is to accentuate the importance of drug discovery and DVPT as well as the potential for AI to streamline the procedures and enhance better health outcomes. Furthermore, the recent updates of DVPTs in AI-related issues in regulatory affairs are highlighted along with key moves that the pharmaceutical industry shall follow as it catches the benefits of AI-based applications.
Chapter
Effective clinical decision-making is critical in liver transplantation, as timely and precise assessments substantially influence patient outcomes. The chapter examine the possible benefits of artificial intelligence (AI) tools in improving clinical decision-making in liver transplantation. As part of this research, 44 relevant research papers are analyzed that satisfied the inclusion requirements. Various AI methodologies in liver transplantation, such as machine learning, deep learning, and predictive modeling are examined. This study aimed to assess whether AI-based scoring algorithms can improve the accuracy and efficiency of clinical judgments and predict outcomes such as graft failure, patient survival, and rejection. The findings suggest that AI-based models can improve clinical decision-making by providing accurate forecasts of critical outcomes and expediting evaluations, resulting in timely interventions. However, successfully integrating AI into clinical practice requires further research and validation. These insights benefit doctors, researchers, and policymakers interested in leveraging AI to enhance decision-making efficiency in liver transplantation.
Article
Full-text available
Purpose Liver disease causes two million deaths annually, accounting for 4% of all deaths globally. Prediction or early detection of the disease via machine learning algorithms on large clinical data have become promising and potentially powerful, but such methods often have some limitations due to the complexity of the data. In this regard, ensemble learning has shown promising results. There is an urgent need to evaluate different algorithms and then suggest a robust ensemble algorithm in liver disease prediction. Method Three ensemble approaches with nine algorithms are evaluated on a large dataset of liver patients comprising 30,691 samples with 11 features. Various preprocessing procedures are utilized to feed the proposed model with better quality data, in addition to the appropriate tuning of hyperparameters and selection of features. Results The models’ performances with each algorithm are extensively evaluated with several positive and negative performance metrics along with runtime. Gradient boosting is found to have the overall best performance with 98.80% accuracy and 98.50% precision, recall and F1-score for each. Conclusions The proposed model with gradient boosting bettered in most metrics compared with several recent similar works, suggesting its efficacy in predicting liver disease. It can be further applied to predict other diseases with the commonality of predicate indicators.
Conference Paper
Liver disease has become a major health crisis globally. Machine learning methodologies are increasingly being applied to predict and diagnose various diseases. This paper uses five boosting algorithms (XGBoost, CatBoost, LightGBM, AdaBoost, and gradient boosting) to predict liver disease. Several preprocessing procedures are utilised to enhance the prediction performance, in addition to the appropriate tuning of hyperparameters and selection of features. The model's performance is assessed using various metrics, including accuracy, precision, recall, f1-score, misclassification rate, AUC-ROC, and runtime. Among the five methods evaluated, gradient boosting emerged as the best performer, attaining the highest scores in nearly all performance metrics. It achieved an AUC-ROC of 86%, an accuracy of 87.43%, a precision of 86%, and a recall of 88.5%.
Article
Full-text available
Precision Livestock Farming (PLF) is revolutionizing the agricultural landscape, offering a data-driven and technology-infused approach to enhance the sustainability of livestock production. This abstract explores the convergence of PLF and Sustainable Agriculture, highlighting the principles, advantages, challenges, and future directions of this dynamic partnership. Sustainable Agriculture is guided by three core principles: environmental stewardship, economic viability, and social responsibility. PLF seamlessly aligns with these principles by optimizing resource efficiency, reducing environmental footprints, improving animal health and welfare, enhancing economic resilience, and bolstering the reputation of the livestock industry. Challenges, including initial investment, data management, technology accessibility, and regulatory compliance, are significant but surmountable hurdles in the path to PLF adoption. Equitable access, data privacy, and responsible regulations are crucial considerations. The future of PLF and Sustainable Agriculture lies in the integration of PLF with other sustainable practices, global adoption, technological advancements, and consumer education. The synergy of PLF with organic farming and agroecology, its potential to address global food security, and the promise of more advanced technologies and informed consumers pave the way for a sustainable and efficient agricultural future. As we navigate the evolving agricultural landscape, the partnership between Precision Livestock Farming and Sustainable Agriculture holds the promise of harmonizing technology and sustainability, ensuring that we can meet the world's food needs while preserving the health of our planet and the welfare of its inhabitants.
Article
Full-text available
Nowadays, people are suffering from many health issues. One of them is heart disease among the worldwide population. This causes due to imbalance lifestyle and unhealthy food consumption. The data generated by hospitals is huge and complex by nature which store patients medical and demographic information. Accurate and prompt diagnosis of heart diseases are becoming more challenging task in medical domain due to the complex data. Therefore, the computer aided systems are useful to store this complex and multivariate data to generate useful decisions. Machine learning techniques are used to classify and to predict the diseases. In this study, Majority voting classifier and Bagging ensemble method both have been evaluated. These ensemble methods combined the five base classifiers including DT (Decision Tree), LR (Logistic Regression), ANN (Artificial Neural Network), NB (Naïve Bayesian), and KNN (K-Nearest Neighbour). Bagging ensemble approach is used to combine the multiple classifiers prediction abilities for better performance. Experimental work is performed on Cleveland dataset using 14 attributes which is available online on UCI Repository. The results showed that the Bagging ensemble method is performed better to achieve higher accuracy of 87.78 %.
Article
Full-text available
One of the most challenging tasks for clinicians is detecting symptoms of cardiovascular disease as earlier as possible. Many individuals worldwide die each year from cardiovascular disease. Since heart disease is a major concern, it must be dealt with timely. Multiple variables affecting health, such as excessive blood pressure, elevated cholesterol, an irregular pulse rate, and many more, make it challenging to diagnose cardiac disease. Thus, artificial intelligence can be useful in identifying and treating diseases early on. This paper proposes an ensemble-based approach that uses machine learning (ML) and deep learning (DL) models to predict a person’s likelihood of developing cardiovascular disease. We employ six classification algorithms to predict cardiovascular disease. Models are trained using a publicly available dataset of cardiovascular disease cases. We use random forest (RF) to extract important cardiovascular disease features. The experiment results demonstrate that the ML ensemble model achieves the best disease prediction accuracy of 88.70%.
Article
Full-text available
Machine Learning (ML) is a branch of artificial intelligence that allows computers to learn without being explicitly programmed. ML has been widely used in healthcare to predict various chronic diseases. Prediction of diabetes at earlier stages is crucial for better clinical pathways to reduce the complications and delay the occurrence of diabetes. In this study, a new ensemble learning-based framework is proposed for the early predicting of Type-II diabetes mellitus using lifestyle indicators. Different ensemble learning techniques like Bagging, Boosting, and Voting are employed. Exploratory data analysis is used to improve the quality assessment of the dataset. The synthetic minority oversampling technique is used for class balancing, and the K-fold cross-validation technique is employed to validate the results. A feature engineering process is applied to calculate the contribution of lifestyle parameters. Among all the classification techniques, the bagged decision tree achieved the highest accuracy rate (99.41%), precision (99.13%), recall (95.83%), specificity (99.11%), F1-score (99.15%), misclassification rate (MCR) (0.86%), and receiver operating characteristic (ROC) curve (99.07%), respectively. The proposed framework can be used in the healthcare industry for the early prediction of diabetes. Also, it can be used for other datasets which share a commonality of data with diabetes.
Article
Full-text available
Cardiovascular diseases are one of the most common chronic illnesses that affect people's health. Early detection of cardiovascular diseases's can reduce mortality rates by preventing or reducing the severity of the disease. Machine learning algorithms are a promising method for identifying risk factors. This article proposes a recursive feature elimination‐based gradient boosting algorithm in order to obtain accurate heart disease prediction. The patients' health record with important cardiovascular disease features has been analysed for the evaluation of the results. Several other machine learning methods were also used to build the prediction model, and the results were compared with the proposed model. The results of this proposed model infer that the combined recursive feature elimination and gradient boosting algorithm achieves the highest accuracy (89.7%). Further, with an area under the curve of 0.84, the proposed algorithm was found superior and had obtained a substantial gain over other techniques. Thus, the proposed gradient boosting algorithm will serve as a prominent cardiovascular disease estimation and treatment model.
Article
Full-text available
Millions of people around the globe are suffering from diabetes. Diabetes mellitus is a chronic and fatal disease that leads to several deaths exponentially every year. Most of the patients (diabetic or potentially diabetic) are not familiar with their health issues and the risk factor they face before the diagnosis of diabetes. Early prediction of diabetes mellitus is a key factor in dealing with the disease. Technological advancement can be used for the same to save human lives. The paper reviews substantial work related to diabetes mellitus based on different classification techniques. Furthermore, the paper tries to identify some shortcomings of each methodology. In this paper, a generic smart framework for realistic health management of diabetes mellitus is presented and implemented using a publically available Pima Indian diabetes dataset sourced from the UCI machine learning repository. Different classification algorithms were employed namely Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XGB), AdaBoost (AB) and Gradient Boosting Classifier (GBC). Pre-processing techniques have been employed to improve the data quality assessment. Among all the classifiers, GB outperformed other models with accuracy rate of 92.20% followed by RF, XGB, ADB and DT as 91.55%, 89.61%, 89.61% and 88.96% respectively. Also, the performance measures of these classifiers were calculated in terms of precision, recall, f1-score, etc
Article
Heart disease prognosis (HDP) is a difficult undertaking that requires knowledge and expertise to predict early on. Heart failure is on the rise as a result of today's lifestyle. The healthcare business generates a vast volume of patient records, which are challenging to manage manually. When it comes to data mining and machine learning, having a huge volume of data is crucial for getting meaningful information. Several methods for predicting HD have been used by researchers over the last few decades, but the fundamental concern remains the uncertainty factor in the output data, as well as the need to decrease the error rate and enhance the accuracy of HDP assessment measures. However, in order to discover the optimal HDP solution, this study compares multiple classification algorithms utilizing two separate heart disease datasets from the Kaggle repository and the University of California, Irvine (UCI) machine learning repository. In a comparative analysis, Mean Absolute Error (MAE), Relative Absolute Error (RAE), precision, recall, f-measure, and accuracy are used to evaluate Linear Regression (LR), Decision Tree (J48), Naive Bayes (NB), Artificial Neural Network (ANN), Simple Cart (SC), Bagging, Decision Stump (DS), AdaBoost, Rep Tree (REPT), and Support Vector Machine (SVM). Overall, the SVM classifier surpasses other classifiers in terms of increasing accuracy and decreasing error rate, with RAE of 33.2631 and MAE of 0.165, the precision of 0.841, recall of 0.835, f-measure of 0.833, and accuracy of 83.49 percent for the dataset gathered from UCI. The SC improves accuracy and reduces the error rate for the Kaggle dataset, which is 3.30 % for RAE, 0.016 percent for MAE, 0.984 % for precision, 0.984 percent for recall, 0.984 percent for f-measure, and 98.44 % for accuracy.
Article
Cardiovascular disease (CVD) or heart disease is one of the most fatal diseases of the world that has been observed through-out the last decade. The prediction of CVD in majority of cases depends on a set of combination of clinical and pathological data represented by either numerical or categorical variables. The categorical medical data may inherently embed prior medical information during its process of categorisation. Whereas the numerical data are flexible for accurate measurements and reading. Hence it is necessary to asses the impact of categorical and numerical features for CVD prediction. In this work, an exhaustive analysis of numerical, categorical and combination of both types of features have been done in context of state-of-the-art machine learning algorithms. The work has compared the boosting algorithms such as Gradient Boosting, Extreme Gradient Boosting (XGBoost), AdaBoost, CatBoost and additionally artificial neural networks, random forest, support vector machines (SVM), decision tree and logistic regression. A soft voting ensemble mechanism with learning algorithms has also been implemented to predict CVD. The current work has used a publicly available and widely used benchmark dataset: Cleveland heart disease dataset (UCI repository). It uses ten different performance metrics which consistently demonstrate that the categorical features outperforms the numerical and combined features. It is further observed that the ensemble learning of SVM + AdaBoost classifiers with categorical features produces optimum performance of CVD prediction.
Article
Objective: Diabetes is a chronic fatal disease that has affected millions of people all over the globe. Type 2 Diabetes Mellitus (T2DM) accounts for 90% of the affected population among all types of diabetes. Millions of T2DM patients remain undiagnosed due to lack of awareness and under resourced healthcare system. So, there is a dire need for a diagnostic and prognostic tool that shall help the healthcare providers, clinicians and practitioners with early prediction and hence can recommend the lifestyle changes required to stop the progression of diabetes. The main objective of this research is to develop a framework based on machine learning techniques using only lifestyle indicators for prediction of T2DM disease. Moreover, prediction model can be used without visiting clinical labs and hospital readmissions. Method: A proposed framework is presented and implemented based on machine learning paradigms using lifestyle indicators for better prediction of T2DM disease. The current research has involved different experts like Diabetologists, Endocrinologists, Dieticians, Nutritionists, etc. for selecting the contributing 1552 instances and 11 attributes lifestyle biological features to promote health and manage complications towards T2DM disease. The dataset has been collected through survey and google forms from different geographical regions. Results: Seven machine learning classifiers were employed namely K-Nearest Neighbour (KNN), Linear Regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT), Random Forest (RF) and Gradient Boosting (GB). Gradient Boosting Model outperformed best with an accuracy rate of 97.24% for training and 96.90% for testing separately followed by RF, DT, NB, SVM, LR, and KNN as 95.36%, 92.52%, 90.72%, 90.20%, 90.20% and 77.06% respectively. However, in terms of precision, RF achieved high performance (0.980%) and KNN performed the lowest (0.793%). As far as recall is being concerned, GB achieved the highest rate of 0.975% and KNN showed the worst rate of 0.774%. Also, ELM is top performed in terms of f1-score. According to the ROCs, GB and NB had a better area under the curve compared to the others. Conclusion: The research developed a realistic health management system for T2DM disease based on machine learning techniques using only lifestyle data for prediction of T2DM. To extend the current study, these models shall be used for different, large and real-time datasets which share the commonality of data with T2DM disease to establish the efficacy of the proposed system.