ArticlePDF Available

An Improved Ensemble Learning Approach for Heart Disease Prediction Using Boosting Algorithms

April 2023
Computer Systems Science and Engineering 46(3):3993-4006

April 2023
46(3):3993-4006

DOI:10.32604/csse.2023.035244

Authors:

Shahid Mohammad Ganie

Woxsen University

Pijush Kanti Dutta Pramanik

Galgotias University

Majid Bashir Malik

Baba Ghulam Shah Badshah University, Rajouri, J & K

Anand Nayyar

Duy Tan University

Cardiovascular disease is among the top five fatal diseases that affect lives worldwide. Therefore, its early prediction and detection are crucial, allowing one to take proper and necessary measures at earlier stages. Machine learning (ML) techniques are used to assist healthcare providers in better diagnosing heart disease. This study employed three boosting algorithms, namely, gradient boost, XGBoost, and AdaBoost, to predict heart disease. The dataset contained heart disease-related clinical features and was sourced from the publicly available UCI ML repository. Exploratory data analysis is performed to find the characteristics of data samples about descriptive and inferential statistics. Specifically, it was carried out to identify and replace outliers using the interquartile range and detect and replace the missing values using the imputation method. Results were recorded before and after the data preprocessing techniques were applied. Out of all the algorithms, gradient boosting achieved the highest accuracy rate of 92.20% for the proposed model. The proposed model yielded better results with gradient boosting in terms of precision, recall, and f1-score. It attained better prediction performance than the existing works and can be used for other diseases that share common features using transfer learning.

Proposed methodology for research work

…

Instances of the outcome variable

…

Histogram of attributes

…

Boxplot of attributes

…

Correlation coefficient matrix

…

Figures - uploaded by Anand Nayyar

Content may be subject to copyright.

Content uploaded by Anand Nayyar

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

An Improved Ensemble Learning Approach for Heart Disease Prediction Using

Boosting Algorithms

Shahid Mohammad Ganie

, Pijush Kanti Dutta Pramanik

, Majid Bashir Malik

, Anand Nayyar

and

Kyung Sup Kwak

School of Business, Woxsen University, Hyderabad, Telangana, 502345, India

School of Computing Science & Engineering, Galgotias University, Greater Noida, UP 203201, India

Department of Computer Sciences, Baba Ghulam Shah Badshah University, Rajouri, 185234, India

Graduate School, Faculty of Information Technology, Duy Tan University, Da Nang, 50000, Vietnam

Department of Information and Communication Engineering, Inha University, 22212, Korea

*Corresponding Author: Kyung Sup Kwak. Email: kskwak@inha.ac.kr

Received: 13 August 2022; Accepted: 03 November 2022

Abstract: Cardiovascular disease is among the top ﬁve fatal diseases that affect

lives worldwide. Therefore, its early prediction and detection are crucial, allowing

one to take proper and necessary measures at earlier stages. Machine learning

(ML) techniques are used to assist healthcare providers in better diagnosing heart

disease. This study employed three boosting algorithms, namely, gradient boost,

XGBoost, and AdaBoost, to predict heart disease. The dataset contained heart dis-

ease-related clinical features and was sourced from the publicly available UCI ML

repository. Exploratory data analysis is performed to ﬁnd the characteristics of

data samples about descriptive and inferential statistics. Speciﬁcally, it was carried

out to identify and replace outliers using the interquartile range and detect and

replace the missing values using the imputation method. Results were recorded

before and after the data preprocessing techniques were applied. Out of all the

algorithms, gradient boosting achieved the highest accuracy rate of 92.20% for

the proposed model. The proposed model yielded better results with gradient

boosting in terms of precision, recall, and f1-score. It attained better prediction

performance than the existing works and can be used for other diseases that share

common features using transfer learning.

Keywords: Heart disease prediction; machine learning classiﬁers; ensemble

approach; XGBoost; AdaBoost; gradient boost

1 Introduction

Heart disease is considered one of the hazards that affect human lives globally. As per the statistical

reports of different international healthcare organizations, 17.9 million (32% of all global deaths) died in

2019 because of cardiovascular diseases; this statistic has been estimated to increase to 23 million people

by 2030 [1]. Out of all the cardiovascular disease deaths, 85% are due to heart disease and stroke.

Research studies have estimated that heart disease accounts for 80% of lives in low economically

This work is licensed under a Creative Commons Attribution 4.0 International License, which

permits unrestricted use, distribution, and reproduction in any medium, provided the original

work is properly cited.

DOI: 10.32604/csse.2023.035244

Article

ech

PressScience

developed countries and creates 85% of disabilities [1]. Detecting and predicting heart disease at earlier

stages are necessary to reduce premature deaths by a signiﬁcant number in the future. The risk and

progression of heart-related diseases depend on factors such as age, changes in lifestyle, food habits, and

rapidly growing socio-economic causes, such as admission to healthcare centers [2,3]. Thus, some other

risk factors due to heart-related problems are high blood pressure, raised glucose levels, upraised blood

lipids, obesity, and being overweight.

Exploring computational intelligence techniques is needed for better prediction of heart-related diseases

so that they can be prevented and cautionary measures can be taken in advance. Furthermore, machine

learning (ML) techniques can be extensively explored to cater to healthcare resources and governance for

better patient health services. This will directly beneﬁt hospital management, telemedicine systems,

practitioners, healthcare providers, and patient categories. In this study, we intend to develop a model for

better heart disease prediction using Ensemble Learning (EL) techniques. Speciﬁcally, considering the

criticalness of the application, we intend to improve the accuracy and other measures of the model for

heart disease prediction. Following are the novel contributions of this work:

Preprocessing of the data to improve the characteristic assessment of the dataset

Comparison of results before and after applying preprocessing techniques

Exercising feature engineering process to identify the contribution of attributes

Applying boosting algorithms using an ensemble learning approach to increase prediction accuracy

Compare the performance evaluation of the proposed model with similar research works

The rest of the article is organized as follows. Section 2 mentions the related work. Section 3 presents the

details of the proposed methodology and dataset. Section 4 presents and analyzes the experimental details

and results. Finally, Section 5 describes the conclusion and some future directions.

2 Related Work

Machine/ensemble learning techniques, with their potential to deliver consistent, reliable, and valid

results, are used in almost every sphere of life to solve real-life problems [4,5]. Copious work has been

done for disease prediction using ML and EL techniques [6]. Researchers have explored different

datasets, algorithms, and methodologies to conduct future research in diagnosing cardiovascular disease

[7,8]. Some of the important kind of literature is discussed as follows.

Latha et al. [9] experimented with different ensemble techniques, such as bagging boosting, stacking,

and a majority vote, using traditional classiﬁcation algorithms to improve the efﬁcacy of predicting

disease risk. They achieved the highest accuracy with a majority vote. Theerthagiri et al. [10] explored a

gradient boosting algorithm based on recursive feature elimination to predict heart disease based on some

medical parameters such as patient’s age, systolic and diastolic blood pressure, height, weight, smoke,

glucose/blood sugar, cholesterol, alcohol intake, smoke, and physical workout. Sultan Bin Habib et al.

[11] tried different ensemble techniques such as adaptive boosting (AdaBoost), gradient boosting machine

(GBM), light GBM (LGBM), extreme gradient boosting (XGBoost), and category boosting (CatBoost) to

predict coronary disease, considering several attributes such as gender, age, education, smoking habits,

blood pressure, hypertension, diabetes, cholesterol level, Quetelet index, heart rate, glucose level, and

chronic heart disease history. They achieved the highest accuracy with XGBoost. Budholiya et al. [12]

used an enhanced XGBoost classiﬁer to predict heart disease effectively. The One-Hot encoding

technique was used to handle categorical features, and Bayesian optimization was used to enhance the

hyper-parameters to achieve better results. Pan et al. [13] conducted an extensive study by using a dataset

containing a good mixture of numerical and categorical attributes based on EL techniques to predict

disease. The authors observed that combining the support vector machine and AdaBoost with categorical

3994 CSSE, 2023, vol.46, no.3

attributes provides better results in predicting heart disease. Pouriyeh et al. [14] developed a framework for

the prediction/detection of heart disease by comparing conventional ML techniques with EL methods. The

dataset used for this work is taken from the online available UCI ML repository. The authors have used a

10-fold cross-validation technique to validate the results. The results showed that the support vector

machine, in combination with the boosting method, provides better results with the highest accuracy rate

of 89.12%. Moreover, bagging and stacking techniques, combined with different traditional classiﬁers,

improve the efﬁcacy of overall results. Deshmukh [15] used an ensemble learning approach for heart

disease prediction. The results are compared between majority voting classiﬁers and the rest of the

classiﬁers. An extra tree classiﬁer was used for the feature selection process. Bagged classiﬁers with the

majority outperformed other classiﬁers with the highest accuracy of 87.78%. The authors suggested that

this work can be extended using optimization procedures and new feature extraction methods. Mary et al.

[16] developed a model for heart disease prediction using ten machine learning algorithms. Among all the

considered classiﬁers, support vector machine yielded better results with accuracy rate of 83.49% on the

UCI dataset. The simple card algorithm increased the accuracy and reduced the prediction error rate for

other measurements. Different metrics are evaluated to validated the proposed framework. The authors

suggested that hybrid approach can be used to extend the existing work for better prediction. Alqahtani

et al. [17] proposed a framework for cardiovascular disease prediction using ensemble learning and deep

learning techniques. In the experiment, the random forest algorithm achieved the highest accuracy

(88.65%), precision (90.03), recall (88.03), f1-score (88.02), and ROC-AUC value (92). Furthermore,

feature importance was calculated to measure the risk of being involved in this disease in the future.

Kondababu et al. [18] built a model by comparing different machine learning techniques for heart disease

prediction. Seven machine learning classiﬁers were considered for comparative and performance analyses.

Out of all classiﬁers, hybrid random forest with linear model produced better results with accuracy rate of

88.4%. No data preprocessing technique was used to improve the output of proposed model. The authors

suggested that the future course of this work can done using large datasets and diverse mixture of

machine learning techniques.

Most of the work mentioned above did not sufﬁciently exploit data preprocessing before developing the

ensemble learning models. It resulted in inadequate outputs. Therefore, we felt the need to utilize exploratory

data analysis to improve the data quality required for the prediction model. Furthermore, data normalization

and standardization were missing in most of the existing literature, although these approaches play crucial

roles in achieving higher prediction performance.

3 Research Methodology

Fig. 1 depicts the methodology adopted for this experimental study. It presents the procedural steps that

must be executed for the early prediction of disease using various ensemble learning techniques. A publicly

available heart disease dataset has been imported into the web-based Jupyter notebook (open-source

platform) for the experimental process. The required library packages are installed from Sklearn using the

Python programming language. Initially, the boosting classiﬁers are applied without data preprocessing to

predict the disease. After exploratory data analysis, we found that preprocessing of data can play an

important role in attaining better results. In preprocessing phase, missing values are identiﬁed and

replaced using the data imputation method. The interquartile range method is used to detect and replace

outliers present in the dataset. Also, some other required libraries are executed to check the corrupted

data, if any, in the dataset. The dataset is split into a 70:30 ratio, where 70% is used to train the models

and 30% to test these models. To validate the results, k-fold cross-validation (K = 10) is applied. Finally,

the three considered boosting algorithms are applied after data preprocessing to obtain the desired results.

CSSE, 2023, vol.46, no.3 3995

3.1 Techniques Used

The use of ensemble learning techniques is explored in almost every ﬁeld to solve real-life problems

[19]. These models have made signiﬁcant progress in better prediction, detection, diagnosis, and

prognosis of different diseases. In this study, for heart disease prediction, we considered the following

three ensemble-learning-based boosting algorithms [6]:

Gradient boosting: The weak learners are trained sequentially, and all estimators are added gradually

by adapting the weights. The gradient boosting algorithm focuses on predicting the residual errors of

previous estimators and attempts to minimize the difference between the predicted and actual values.

AdaBoost: AdaBoost works by adjusting all the weights without prior knowledge of weak learners.

The base learners’weakness is measured by the estimator’s error rate while training the models.

Decision tree stumps are widely used with the AdaBoost algorithm to solve classiﬁcation and

regression problems.

XGBoost: XGBoost works by combining different kinds of decision trees (weak learners) to calculate

the similarity scores independently. It helps to overcome the problem of overﬁtting during the training

phase by adapting the gradient descent and regularization process.

3.2 Dataset Selection

For the experiment, we used the popular dataset on heart disease, openly available in the machine

learning repository

at the University of California Irvine (UCI). The dataset is rich in clinical features

related to heart disease, covering wide demography. Thus, it has been one of the most popular choices for

researchers.

3.3 Attribute Information

The dataset contains 1329 instances and 14 attributes, where the ﬁrst 13 attributes are predicate/

independent variables, and the last one is a dependent/target variable. The attributes are described in

Table 1. The table presents information about considered attributes, the description of attributes, their

measurements, and the value of the range.

Figure 1: Proposed methodology for research work

https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset.

3996 CSSE, 2023, vol.46, no.3

3.4 Dataset Description

Descriptive statistics play a vital role in identifying the data characteristics. It summarizes the data so that

understanding data becomes easier for human interpretation. Table 2 describes the statistical measurements

of the clinical attributes with their measures, such as count of records, minimum (min) value, maximum

(max) value, mean, and standard deviation (Std). For example, the age attribute has 54.41 as a mean

value and 9.07 as a standard deviation, and the maximum and minimum age numbers are 77 and

29 years, respectively. These statistical measurements are also calculated for the rest of the attributes.

3.5 Class Balance

Machine/ensemble learning models provide poor results if the dataset used is not balanced for the

problem statement. In some situations, if the target class is not equally distributed, then some sampling

techniques can be used to make a balanced dataset. The dataset for this experiment contains a good

mixture of classes, where class 1 is heart disease (691 instances) and class 0 is no heart disease

(637 instances), as shown in Fig. 2.

Table 1: Attributes information of the dataset

Attribute Description Measurement Value

range

Age Age of an individual Years 29 to 77

Sex Gender of an individual 1 = male, 0 = female 0 or 1

Cpericarditis Degree of chest pain Low, moderate, high,

extremely high

0to3

RestingBP Blood pressure of an individual while at rest

(inactive)

Hg level (in mm) 94 to 200

Cholesterol Level of serum cholesterol mg/dl 126 to

256

FastingBP Glucose level in an empty stomach (fasting) Greater than 120 mg/dl

(1 = true, 0 = false)

0or1

RestingECG Resting electrocardiographic results of an

individual while inactive

0 = normal, 1 = having ST 0 to 2

MaximumHR Highest heart rate recorded. Beats per minute 71 to 202

ExerciseIA Doing exercise with angina disease 1 = yes, 0 = no 0 or 1

Oldpeak ST depression, caused by doing exercise in

comparison to being inactive

Numeric value Relative

Slope Slope of the old peak value in the ST segment

while an individual is doing an exercise

0 = downsloping, 1 = ﬂat,

2 = upsloping

0to2

Ca No. of major vessels colored by ﬂuoroscopy Numeric 0 to 3

Thal Whether an individual has thalassemia or not 3 = normal, 6 = ﬁxed defect,

7 = reversible defect

3to7

Outcome Class attribute 0 = no heart disease,

1 = heart disease

0or1

CSSE, 2023, vol.46, no.3 3997

3.6 Histogram of Dataset

A histogram is used to visualize and interpret the distribution of data samples. The representation of

histograms can be uniform, normal, left-skewed, and right-skewed. Fig. 3 depicts the normally distributed

histograms that groups all the attributes within the value range. The x-axis represents the nature of the

attribute, and the y-axis represents the value of that attribute.

3.7 Boxplot of Dataset

Fig. 4 shows the boxplots of each attribute present in the dataset. To represent boxplots for attributes, the

interquartile range method using the probability density function has been used to handle the outliers in the

dataset. For example, in fasting blood pressure, a single outlier was detected, whereas, in resting blood

pressure, multiple outliers were detected and were replaced with the z-score method.

Table 2: Dataset description

Attributes Count Mean Std Min Max

Age 1328 54.41 9.07 29 77

Sex 1328 0.69 0.46 0 1

Cpericarditis 1328 0.94 1.02 0 3

RestingBP 1328 131.61 17.51 94 200

Cholesterol 1328 246.06 51.62 126 564

FastingBP 1328 0.14 0.35 0 1

RestingECG 1328 0.52 0.52 0 2

MaximumHR 1328 149.23 22.97 71 202

ExerciseIA 1328 0.33 0.47 0 1

Oldpeak 1328 1.06 1.17 0 1

Slope 1328 1.38 0.61 0 2

Ca 1328 0.74 1.02 0 4

Thal 1328 2.32 0.61 0 3

Outcome 1328 0.52 0.49 0 1

637

691

0: No heart

disease

1: Heart

disease

Figure 2: Instances of the outcome variable

3998 CSSE, 2023, vol.46, no.3

3.8 Correlation Coefﬁcient Analysis

The correlation coefﬁcient analysis (CCA) method is used to identify and plot the relationship among the

dataset’s attributes [20]. A dataset is considered good if a strong association/relationship exists between the

set of independent and dependent attributes. Fig. 5 presents the CCA of all attributes used to predict disease,

and the range of relationships exists between +1 to −1 within the x-axis and y-axis. The cell value indicates

the degree of relationship between the intersecting attributes. For example, the relationship value between

resting blood pressure and age is 0.12.

4 Experiment, Results, and Discussion

This section presents the discussion on the experimental details and results achieved using boosting

algorithms for heart disease prediction. Subsequently, all results after implementing the proposed

framework are shown and analyzed systematically. The results are presented in two modules: before

preprocessing and after preprocessing for disease prediction. The evaluation is extensively discussed in

terms of performance evaluation metrics such as precision, recall, f1-score, receiver operation curve, and

traveling time of considered boosting algorithms.

Figure 3: Histogram of attributes

CSSE, 2023, vol.46, no.3 3999

Figure 4: Boxplot of attributes

Figure 5: Correlation coefﬁcient matrix

4000 CSSE, 2023, vol.46, no.3

4.1 Data Preprocessing

Data preprocessing is vital in developing a robust and reliable system before applying ML methods to

the model [21]. In this work, missing values have been identiﬁed and replaced by the data imputation

method. Initially, we used the isnull() method to detect all the missing values and then executed the mean

and mode imputation technique with the SimpleImputer() method to ﬁll in these missing values. This

process replaces all the missing values using the column’s mean, median, and mode. Outliers have been

detected and replaced using the Interquartile range method, where Z-score techniques were used to shift

the distribution of all the data samples and make the mean 0.

4.2 Hardware/Software Speciﬁcation and Computational Time

An HP Z60 workstation was used to carry out this research work. The hardware speciﬁcation of the

system is as follows: Intel XEON 2.4 GHz CPU (12 core), 4 GB RAM, 1 TB hard disk, and Windows

10 pro-64-bit. The algorithms ADB, XGB, and GB on this machine took 4.23, 3.57, and 4.51 Seconds,

respectively, for execution. Apart from hardware components, the software used for implementations is

graphical user interface-based Anaconda Navigator, web-based computing platform Jupyter Notebook,

and Python as a programming language.

4.3 Accuracy of Classiﬁers

The testing accuracy of boosting algorithms is shown in Fig. 6. The algorithms employed in this work

are XGBoost (XGB), AdaBoost (ADB), and gradient boosting (GB). Without applying preprocessing

techniques, the accuracy of classiﬁers like XGB, ADB, and GB are 87.50%, 81.50%, and 86%,

respectively. After applying preprocessing techniques, GB outperformed other boosting algorithms by

obtaining the highest accuracy rate of 92.20%, followed by AGB and ADB, both having 89.61%.

4.4 Other Measurements

The precision, recall, and f1-score of the three considered classiﬁers were calculated before and after

data preprocessing. The values were calculated (in percentage) for both the classes (0: no heart disease, 1:

heart disease), as shown in Figs. 7 and 8. XGB performed best for all the measurements without

preprocessing, whereas ADB performed the worst. With preprocessing, GB performed the best in most

measurements, whereas XGB and ADB achieved more or less the same results.

87.50%

89.61%

81.50%

89.61%

86.00%

92.20%

76%

78%

80%

82%

84%

86%

88%

90%

92%

94%

Before preprocessing After preprocessing

Accuracy

XGB

ADB

Figure 6: Classiﬁcation accuracy

CSSE, 2023, vol.46, no.3 4001

4.5 Feature Importance

Feature importance is a process to calculate the score of input features (independent predicate variables)

based on the contribution predicting the output feature (dependent/target variable) [22]. It plays an important

role in developing machine/ensemble learning models to improve prediction results. In this work, the feature

importance score (F-score) represents the number of times an attribute is used for splitting in the training

process. A higher F-score of a feature (e.g., cholesterol) indicates that it is an important attribute. Fig. 9

shows the contribution of all the attributes toward prediction in descending order based on their F-score.

For example, cholesterol has the highest signiﬁcance in the prediction, whereas fasting blood pressure has

the lowest.

0.82

0.94

0.87

0.94

0.82

0.88

0.82

0.83

0.82

0.81

0.8

0.82

0.91

0.86

0.9

0.82

0.86

0.7

0.75

0.8

0.85

0.9

0.95

Precision Recall F1-score Precision Recall F1-score

0: Negative 1: Positive

XGB

ADB

Figure 7: Other measurements before preprocessing

0.89

0.96

0.92

0.91

0.78

0.84

0.9

0.95

0.92

0.89

0.79

0.84

0.93

0.95

0.94

0.89

0.86

0.88

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Precision Recall F1-score Precision Recall F1-score

0: Negative 1: Positive

XGB

ADB

Figure 8: Other measurements after preprocessing

Figure 9: Feature importance for prediction

4002 CSSE, 2023, vol.46, no.3

4.6 ROC Curve

The receiver operating characteristic (ROC) curve has been used to show the prediction capability of

considered boosting algorithms at different thresholds. It represents the false-positive rate vs. the true-

positive rate along the x-axis and y-axis, respectively. Using the ROC curve, we analyzed how well our

models distinguish between classes (0-no heart disease and 1-heart disease). A higher ROC curve means

that the model is predicting good results between 0’s and 1’s. If the model has AUC near 1, it means a

good separability measure; if AUC is near 0, it means the worst measure of disassociation. When the

value of AUC is 0.5, the model is not working to separate the classes effectively. The ROC curves for

ADB, XGB, and GB are shown in Figs. 10–12, respectively. From the ﬁgures, we conclude that XGB

performs best, followed by GB and ADB.

Figure 10: ROC curve for ADB

Figure 11: ROC curve for XGB

CSSE, 2023, vol.46, no.3 4003

4.7 Comparative Analysis

The proposed method produced good results in terms of different evaluation metrics for heart disease

prediction. The performance of our proposed framework has been compared with several relevant studies

in terms of techniques used, dataset, and accuracy, as shown in Table 3. Our proposed framework yielded

good results in terms of different evaluation metrics, particularly for accuracy in predicting heart disease.

Techniques such as data imputation for handling missing values, detection, and replacement of outliers

using the Boxplot method have been used to achieve better results than other related works.

Figure 12: ROC curve for GB

Table 3: Comparison of the proposed work with existing similar works

Research

work

Ensemble techniques adopted Dataset used Highest

accuracy

[11] XGB, ADB, GBM,

LGBM, and CatBoost

Framingham heart disease

dataset (publicly available)

87.62% with

XGB

[9] Boosting, bagging, stacking, and majority vote Cleveland heart disease

dataset (publicly available)

85.48% with

majority vote

[10] Recursive feature elimination and GB Do 89.78%

[12] XGB with Bayesian optimization Do 91.80%

[14] CatBoost, GB, XGB, and ADB Do 83.60% with

ADB

[17] DNN, KDNN, XGB, KNN, decision tree, and

random forest

Do 88.65% with

random forest

[18] Naïve Bayes, linear model, logistic regression,

decision tree, random forest, SVM, and HRFLM

Do 88.40% with

HRFLM

Our

method

XGB, ADB, and GB Do 92.20% for

BDT

4004 CSSE, 2023, vol.46, no.3

5 Conclusion

This study applied boosting algorithms to predict heart disease effectively. Different preprocessing

methods, such as imputation, Z-score, and cleaning methods, were employed to improve the dataset’s

prediction results and quality assessment. This study also executed three different boosting algorithms,

namely, XGBoost, AdaBoost, and gradient boosting, before and after applying preprocessing techniques.

The experimental results were assessed using different statistical/ML measurements. The experimental

results revealed that gradient boosting achieved the highest accuracy rate of 92.20%. The gradient

boosting algorithm also achieved better results for other metrics, such as precision, recall, and f1-score.

Finally, the feature importance process was employed to calculate the contribution of independent

features toward the ﬁnal outcome.

Other ensemble learning techniques, such as bagging and stacking, can be used to improve the efﬁcacy

of this work. This proposed method can be used for other healthcare datasets that share the commonality of

features to extend the scope of this research work. Deep learning techniques can also be explored to detect

and predict cardiovascular diseases better.

Funding Statement: This work was supported by National Research Foundation of Korea-Grant funded by

the Korean Government (MSIT)-NRF-2020R1A2B5B02002478.

Conﬂicts of Interest: The authors declare that they have no conﬂicts of interest to report regarding the

present study.

References

[1] WHO, “Cardiovascular diseases (CVDs),”11

June, 2021. [Online]. Available: https://www.who.int/news-room/

fact-sheets/detail/cardiovascular-diseases-(cvds) (accessed on 6 July 2022).

[2] Y. Ruan, Y. Guo, Y. Zheng, Z. Huang, S. Sun et al., “Cardiovascular disease (CVD) and associated risk factors

among older adults in six low-and middle-income countries: Results from SAGE Wave 1,”BMC Public Health,

vol. 18, no. 1, pp. 1–13, 2018.

[3] M. -H. Biglu, M. Ghavami and S. Biglu, “Cardiovascular diseases in the mirror of science,”Journal of

Cardiovascular and Thoracic Research, vol. 8, no. 4, pp. 158–163, 2016.

[4] S. M. Ganie, M. B. Malik and T. Arif, “Early prediction of diabetes mellitus using various artiﬁcial intelligence

techniques: A technological review,”International Journal of Business Intelligence and Systems Engineering,

vol. 1, no. 4, pp. 1–22, 2021.

[5] J. Alzubi, A. Nayyar and A. Kumar, “Machine learning from theory to algorithms: An overview,”Journal of

Physics: Conference Series, vol. 1142, no. 1, pp. 012012, 2018.

[6] S. M. Ganie, M. B. Malik and T. Arif, “Performance analysis and prediction of type 2 diabetes mellitus based on

lifestyle data using machine learning approaches,”Journal of Diabetes & Metabolic Disorders, vol. 21, no. 1, pp.

339–352, 2022.

[7] N. Nissa, S. Jamwal and S. Mohammad, “Early detection of cardiovascular disease using machine learning

techniques an experimental study,”International Journal of Recent Technology and Engineering, vol. 9, no. 3,

pp. 635–641, 2020.

[8] S. Jamwal and S. M. Najmu Nissa, “Heart disease prediction using machine learning,”Lecture Notes in Networks

and Systems, vol. 203, no. 67, pp. 653–665, 2021.

[9] C. B. C. Latha and S. C. Jeeva, “Improving the accuracy of prediction of heart disease risk based on ensemble

classiﬁcation techniques,”Informatics in Medicine Unlocked, vol. 16, no. November 2018, pp. 100203, 2019.

[10] P. Theerthagiri and J. Vidya, “Cardiovascular disease prediction using recursive feature elimination and gradient

boosting classiﬁcation techniques,”CoRR, vol. abs/2106.0, 2021. [Online]. Available: https://arxiv.org/abs/2106.

08889

CSSE, 2023, vol.46, no.3 4005

[11] A. Z. Sultan Bin Habib, T. Tasnim and M. M. Billah, “A study on coronary disease prediction using boosting-

based ensemble machine learning approaches,”in Proc. ICIET 2019, Dhaka, Bangladesh, pp. 23–24, 2019.

[12] K. Budholiya, S. K. Shrivastava and V. Sharma, “An optimized XGBoost based diagnostic system for effective

prediction of heart disease,”Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 7,

pp. 4514–4523, 2022.

[13] C. Pan, A. Poddar, R. Mukherjee and A. K. Ray, “Impact of categorical and numerical features in ensemble

machine learning frameworks for heart disease prediction,”Biomedical Signal Processing and Control, vol.

76, no. April, pp. 103666, 2022.

[14] S. Pouriyeh, S. Vahid, G. Sannino, G. De Pietro, H. Arabnia et al., “A comprehensive investigation and

comparison of machine learning techniques in the domain of heart disease,”in Proc. IEEE Symp. Computers

and Communications, Heraklion, Greece, pp. 204–207, 2017.

[15] V. M. Deshmukh, “Heart disease prediction using ensemble methods,”International Journal of Recent

Technology and Engineering, vol. 8, no. 3, pp. 8521–8526, 2019.

[16] N. Mary, B. Khan, A. A. Asiri, F. Muhammad, S. Alqhtani et al., “Investigating of classiﬁcation algorithms for

heart disease risk prediction,”Journal of Intelligent Medicine and Healthcare, vol. 1, no. 1, pp. 11–31, 2022.

[17] A. Alqahtani, S. Alsubai, M. Sha, L. Vilcekova and T. Javed, “Cardiovascular disease detection using ensemble

learning,”Computational Intelligence and Neuroscience, vol. 2022, no. 3, pp. 1–9, 2022.

[18] A. Kondababu, V. Siddhartha, B. H. K. Bhagath Kumar and B. Penumutchi, “A comparative study on machine

learning based heart disease prediction,”Materials Today: Proceedings, [in press], 2021.

[19] S. M. Ganie and M. B. Malik, “An ensemble machine learning approach for predicting type-II diabetes mellitus

based on lifestyle indicators,”Healthcare Analytics, vol. 2, no. 1, pp. 100092, 2022.

[20] A. Hussain and S. Naaz, “Prediction of diabetes mellitus: Comparative study of various machine learning

models,”in Int. Conf. on Innovative Computing and Communications. Advances in Intelligent Systems and

Computing, vol. 1166. Singapore: Springer, pp. 103–115, 2021.

[21] A. Jazayeri, O. S. Liang and C. C. Yang, “Imputation of missing data in electronic health records based on

patients’similarities,”Journal of Healthcare Informatics Research, vol. 4, no. 3, pp. 295–307, 2020.

[22] D. Dutta, D. Paul and P. Ghosh, “Analysing feature importances for diabetes prediction using machine learning,”

in Proc. IEMCON 2018, Vancouver, Canada, pp. 924–928, 2019.

4006 CSSE, 2023, vol.46, no.3

Heart Disease Detection Using an Ensemble Solution with Target Engineering and Pearson Correlation Feature Selection

Conference Paper

Full-text available

May 2024

Heart disease stands as one of the most intricate health conditions, affecting a substantial population across the globe. The aim of this research is to add to the expanding knowledge base in the early detection of heart disease. The well-known heart disease dataset sourced from UCI Machine Learning Repository was utilized in this study. The dataset encompasses 15 distinct attributes, not counting the target variable, collected from 920 different patients. Pearson correlation was employed as feature selection approach with collinearity threshold of 85%. Binning method was used as target engineering strategy. Before target engineering, various well-known classifcation algorithms were employed, such as Random forest, Extra gradient,CatBoost, Bagging, Logistic regression, LightGBM, SVM, Ada boost, XGBoost, Decision tree, Gradient boosting, Naive bayes, and KNN. After target engineering another diverse set of classifier models deployed, encompassing Random forest, KNN, Logistic regression, SVM, Naive bayes, XGBoost, and Decision tree. Both soft and hard voting ensemble was employed on the best selected five models, according to accuracy metric. An artificial neural network model was also constructed and applied after the target engineering method. The ANN model was trained over 300 epochs as after that point the loss curves exhibited a nearly flat trend. At that amount of epochs, the ANN model achieved accuracy score of 99.59%, which was highest among of all the suggested model. The soft voting ensemble on top five models was able to gain accuracy, precision, recall, f1-score, MAE, MSE, RMSE, RAE, and RRSE score at 90.12%, 90.08%, 90.12%, 90.08%, 9.88%, 9.88%, 34.32%, 28.40%, and 77.34% respectively, which was the second highest.

An Improved Ensemble-Based Cardiovascular Disease Detection System with Chi-Square Feature Selection

Article

Full-text available

May 2024

Cardiovascular disease (CVD) is a leading cause of death globally; therefore, early detection of CVD is crucial. Many intelligent technologies, including deep learning and machine learning (ML), are being integrated into healthcare systems for disease prediction. This paper uses a voting ensemble ML with chi-square feature selection to detect CVD early. Our approach involved applying multiple ML classifiers, including naïve Bayes, random forest, logistic regression (LR), and k-nearest neighbor. These classifiers were evaluated through metrics including accuracy, specificity, sensitivity, F1-score, confusion matrix, and area under the curve (AUC). We created an ensemble model by combining predictions from the different ML classifiers through a voting mechanism, whose performance was then measured against individual classifiers. Furthermore, we applied chi-square feature selection method to the 303 records across 13 clinical features in the Cleveland cardiac disease dataset to identify the 5 most important features. This approach improved the overall accuracy of our ensemble model and reduced the computational load considerably by more than 50%. Demonstrating superior effectiveness, our voting ensemble model achieved a remarkable accuracy of 92.11%, representing an average improvement of 2.95% over the single highest classifier (LR). These results indicate the ensemble method as a viable and practical approach to improve the accuracy of CVD prediction.

A comparative analysis of boosting algorithms for chronic liver disease prediction

Article

Full-text available

Feb 2024

Chronic liver disease (CLD) is a major health concern for millions of people all over the globe. Early prediction and identification are critical for taking appropriate action at the earliest stages of the disease. Implementing machine learning methods in predicting CLD can greatly improve medical outcomes, reduce the burden of the condition, and promote proactive and preventive healthcare practices for those at risk. However, traditional machine learning has some limitations which can be mitigated through ensemble learning. Boosting is the most advantageous ensemble learning approach. This study aims to improve the performance of the available boosting techniques for CLD prediction. Seven popular boosting algorithms of Gradient Boosting (GB), AdaBoost, LogitBoost, SGBoost, XGBoost, LightGBM, and CatBoost, and two publicly available popular CLD datasets (Liver disease patient dataset (LDPD) and Indian liver disease patient dataset (ILPD)) of dissimilar size and demography are considered in this study. The features of the datasets are ascertained by exploratory data analysis. Additionally, hyperparameter tuning, normalisation, and upsampling are used for predictive analytics. The proportional importance of every feature contributing to CLD for every algorithm is assessed. Each algorithm's performance on both datasets is assessed using k-fold cross-validation, twelve metrics, and runtime. Among the five boosting algorithms, GB emerged as the best overall performer for both datasets. It attained 98.80% and 98.29% accuracy rates for LDPD and ILPD, respectively. GB also outperformed other boosting algorithms regarding other performance metrics except runtime.

Pushing Boundaries: The Landscape of AI-Driven Drug Discovery and Development with Insights Into Regulatory Aspects

Preprint

Jan 2024

Artificial intelligence (AI) is revolutionizing the field of pharmaceutical and healthcare sector. The current understanding of drug development (DVPT) and discovery can be expanded with the aid of AI to benefit the health of the society. Current opportunities, where AI can be helpful include support in preclinical and clinical studies, target identification, hit identification, lead optimization, and clinical decision-making. The challenges that AI should overcome includes ethical aspects, intellectual issues, and regulatory aspects in drug DVPT and its clinical establishment. AI has various applications in drug discovery (DDS) and DVPT. The overall goal is to accentuate the importance of drug discovery and DVPT as well as the potential for AI to streamline the procedures and enhance better health outcomes. Furthermore, the recent updates of DVPTs in AI-related issues in regulatory affairs are highlighted along with key moves that the pharmaceutical industry shall follow as it catches the benefits of AI-based applications.

Empowering Clinical Decision Making: An In‐Depth Systematic Review of AI‐Driven Scoring Approaches for Liver Transplantation Prediction

Chapter

Jun 2024

Effective clinical decision-making is critical in liver transplantation, as timely and precise assessments substantially influence patient outcomes. The chapter examine the possible benefits of artificial intelligence (AI) tools in improving clinical decision-making in liver transplantation. As part of this research, 44 relevant research papers are analyzed that satisfied the inclusion requirements. Various AI methodologies in liver transplantation, such as machine learning, deep learning, and predictive modeling are examined. This study aimed to assess whether AI-based scoring algorithms can improve the accuracy and efficiency of clinical judgments and predict outcomes such as graft failure, patient survival, and rejection. The findings suggest that AI-based models can improve clinical decision-making by providing accurate forecasts of critical outcomes and expediting evaluations, resulting in timely interventions. However, successfully integrating AI into clinical practice requires further research and validation. These insights benefit doctors, researchers, and policymakers interested in leveraging AI to enhance decision-making efficiency in liver transplantation.

Pushing Boundaries: The Landscape of AI‐Driven Drug Discovery and Development with Insights Into Regulatory Aspects

Chapter

Jun 2024

Improved liver disease prediction from clinical data through an evaluation of ensemble learning approaches

Article

Full-text available

Jun 2024
BMC MED INFORM DECIS

Purpose Liver disease causes two million deaths annually, accounting for 4% of all deaths globally. Prediction or early detection of the disease via machine learning algorithms on large clinical data have become promising and potentially powerful, but such methods often have some limitations due to the complexity of the data. In this regard, ensemble learning has shown promising results. There is an urgent need to evaluate different algorithms and then suggest a robust ensemble algorithm in liver disease prediction. Method Three ensemble approaches with nine algorithms are evaluated on a large dataset of liver patients comprising 30,691 samples with 11 features. Various preprocessing procedures are utilized to feed the proposed model with better quality data, in addition to the appropriate tuning of hyperparameters and selection of features. Results The models’ performances with each algorithm are extensively evaluated with several positive and negative performance metrics along with runtime. Gradient boosting is found to have the overall best performance with 98.80% accuracy and 98.50% precision, recall and F1-score for each. Conclusions The proposed model with gradient boosting bettered in most metrics compared with several recent similar works, suggesting its efficacy in predicting liver disease. It can be further applied to predict other diseases with the commonality of predicate indicators.

Predicting Chronic Liver Disease Using Boosting Technique

Conference Paper

Dec 2023

Liver disease has become a major health crisis globally. Machine learning methodologies are increasingly being applied to predict and diagnose various diseases. This paper uses five boosting algorithms (XGBoost, CatBoost, LightGBM, AdaBoost, and gradient boosting) to predict liver disease. Several preprocessing procedures are utilised to enhance the prediction performance, in addition to the appropriate tuning of hyperparameters and selection of features. The model's performance is assessed using various metrics, including accuracy, precision, recall, f1-score, misclassification rate, AUC-ROC, and runtime. Among the five methods evaluated, gradient boosting emerged as the best performer, attaining the highest scores in nearly all performance metrics. It achieved an AUC-ROC of 86%, an accuracy of 87.43%, a precision of 86%, and a recall of 88.5%.

OptiANN-LR: Augmenting Diabetes Prediction Accuracy through Hyper Learning Rate Tuning in Optimized Artificial Neural Networks

Conference Paper

Feb 2024

Precision Livestock Farming (PLF) and Sustainable Agriculture

Article

Full-text available

Oct 2023

Precision Livestock Farming (PLF) is revolutionizing the agricultural landscape, offering a data-driven and technology-infused approach to enhance the sustainability of livestock production. This abstract explores the convergence of PLF and Sustainable Agriculture, highlighting the principles, advantages, challenges, and future directions of this dynamic partnership. Sustainable Agriculture is guided by three core principles: environmental stewardship, economic viability, and social responsibility. PLF seamlessly aligns with these principles by optimizing resource efficiency, reducing environmental footprints, improving animal health and welfare, enhancing economic resilience, and bolstering the reputation of the livestock industry. Challenges, including initial investment, data management, technology accessibility, and regulatory compliance, are significant but surmountable hurdles in the path to PLF adoption. Equitable access, data privacy, and responsible regulations are crucial considerations. The future of PLF and Sustainable Agriculture lies in the integration of PLF with other sustainable practices, global adoption, technological advancements, and consumer education. The synergy of PLF with organic farming and agroecology, its potential to address global food security, and the promise of more advanced technologies and informed consumers pave the way for a sustainable and efficient agricultural future. As we navigate the evolving agricultural landscape, the partnership between Precision Livestock Farming and Sustainable Agriculture holds the promise of harmonizing technology and sustainability, ensuring that we can meet the world's food needs while preserving the health of our planet and the welfare of its inhabitants.

Heart Disease Prediction using Ensemble Methods

Article

Full-text available

Sep 2019

Vaishali M Deshmukh

Nowadays, people are suffering from many health issues. One of them is heart disease among the worldwide population. This causes due to imbalance lifestyle and unhealthy food consumption. The data generated by hospitals is huge and complex by nature which store patients medical and demographic information. Accurate and prompt diagnosis of heart diseases are becoming more challenging task in medical domain due to the complex data. Therefore, the computer aided systems are useful to store this complex and multivariate data to generate useful decisions. Machine learning techniques are used to classify and to predict the diseases. In this study, Majority voting classifier and Bagging ensemble method both have been evaluated. These ensemble methods combined the five base classifiers including DT (Decision Tree), LR (Logistic Regression), ANN (Artificial Neural Network), NB (Naïve Bayesian), and KNN (K-Nearest Neighbour). Bagging ensemble approach is used to combine the multiple classifiers prediction abilities for better performance. Experimental work is performed on Cleveland dataset using 14 attributes which is available online on UCI Repository. The results showed that the Bagging ensemble method is performed better to achieve higher accuracy of 87.78 %.

Cardiovascular Disease Detection using Ensemble Learning

Article

Full-text available

Aug 2022
Comput Intell Neurosci

One of the most challenging tasks for clinicians is detecting symptoms of cardiovascular disease as earlier as possible. Many individuals worldwide die each year from cardiovascular disease. Since heart disease is a major concern, it must be dealt with timely. Multiple variables affecting health, such as excessive blood pressure, elevated cholesterol, an irregular pulse rate, and many more, make it challenging to diagnose cardiac disease. Thus, artificial intelligence can be useful in identifying and treating diseases early on. This paper proposes an ensemble-based approach that uses machine learning (ML) and deep learning (DL) models to predict a person’s likelihood of developing cardiovascular disease. We employ six classification algorithms to predict cardiovascular disease. Models are trained using a publicly available dataset of cardiovascular disease cases. We use random forest (RF) to extract important cardiovascular disease features. The experiment results demonstrate that the ML ensemble model achieves the best disease prediction accuracy of 88.70%.

An ensemble Machine Learning approach for predicting Type-II diabetes mellitus based on lifestyle indicators

Article

Full-text available

Nov 2022

Machine Learning (ML) is a branch of artificial intelligence that allows computers to learn without being explicitly programmed. ML has been widely used in healthcare to predict various chronic diseases. Prediction of diabetes at earlier stages is crucial for better clinical pathways to reduce the complications and delay the occurrence of diabetes. In this study, a new ensemble learning-based framework is proposed for the early predicting of Type-II diabetes mellitus using lifestyle indicators. Different ensemble learning techniques like Bagging, Boosting, and Voting are employed. Exploratory data analysis is used to improve the quality assessment of the dataset. The synthetic minority oversampling technique is used for class balancing, and the K-fold cross-validation technique is employed to validate the results. A feature engineering process is applied to calculate the contribution of lifestyle parameters. Among all the classification techniques, the bagged decision tree achieved the highest accuracy rate (99.41%), precision (99.13%), recall (95.83%), specificity (99.11%), F1-score (99.15%), misclassification rate (MCR) (0.86%), and receiver operating characteristic (ROC) curve (99.07%), respectively. The proposed framework can be used in the healthcare industry for the early prediction of diabetes. Also, it can be used for other datasets which share a commonality of data with diabetes.

Cardiovascular disease prediction using recursive feature elimination and gradient boosting classification techniques

Article

Full-text available

Jun 2022
EXPERT SYST

Cardiovascular diseases are one of the most common chronic illnesses that affect people's health. Early detection of cardiovascular diseases's can reduce mortality rates by preventing or reducing the severity of the disease. Machine learning algorithms are a promising method for identifying risk factors. This article proposes a recursive feature elimination‐based gradient boosting algorithm in order to obtain accurate heart disease prediction. The patients' health record with important cardiovascular disease features has been analysed for the evaluation of the results. Several other machine learning methods were also used to build the prediction model, and the results were compared with the proposed model. The results of this proposed model infer that the combined recursive feature elimination and gradient boosting algorithm achieves the highest accuracy (89.7%). Further, with an area under the curve of 0.84, the proposed algorithm was found superior and had obtained a substantial gain over other techniques. Thus, the proposed gradient boosting algorithm will serve as a prominent cardiovascular disease estimation and treatment model.

Early prediction of diabetes mellitus using various artificial intelligence techniques: a technological review

Article

Full-text available

Jan 2021

Early Prediction of Diabetes Mellitus using various Artificial Intelligence Techniques: A Technological Review

Article

Full-text available

Jun 2021

Millions of people around the globe are suffering from diabetes. Diabetes mellitus is a chronic and fatal disease that leads to several deaths exponentially every year. Most of the patients (diabetic or potentially diabetic) are not familiar with their health issues and the risk factor they face before the diagnosis of diabetes. Early prediction of diabetes mellitus is a key factor in dealing with the disease. Technological advancement can be used for the same to save human lives. The paper reviews substantial work related to diabetes mellitus based on different classification techniques. Furthermore, the paper tries to identify some shortcomings of each methodology. In this paper, a generic smart framework for realistic health management of diabetes mellitus is presented and implemented using a publically available Pima Indian diabetes dataset sourced from the UCI machine learning repository. Different classification algorithms were employed namely Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XGB), AdaBoost (AB) and Gradient Boosting Classifier (GBC). Pre-processing techniques have been employed to improve the data quality assessment. Among all the classifiers, GB outperformed other models with accuracy rate of 92.20% followed by RF, XGB, ADB and DT as 91.55%, 89.61%, 89.61% and 88.96% respectively. Also, the performance measures of these classifiers were calculated in terms of precision, recall, f1-score, etc

Investigating of Classification Algorithms for Heart Disease Risk Prediction

Article

Jan 2022

Heart Disease Risk Prediction Expending of Classification Algorithms

Article

Jun 2022
CMC-COMPUT MATER CON

Heart disease prognosis (HDP) is a difficult undertaking that requires knowledge and expertise to predict early on. Heart failure is on the rise as a result of today's lifestyle. The healthcare business generates a vast volume of patient records, which are challenging to manage manually. When it comes to data mining and machine learning, having a huge volume of data is crucial for getting meaningful information. Several methods for predicting HD have been used by researchers over the last few decades, but the fundamental concern remains the uncertainty factor in the output data, as well as the need to decrease the error rate and enhance the accuracy of HDP assessment measures. However, in order to discover the optimal HDP solution, this study compares multiple classification algorithms utilizing two separate heart disease datasets from the Kaggle repository and the University of California, Irvine (UCI) machine learning repository. In a comparative analysis, Mean Absolute Error (MAE), Relative Absolute Error (RAE), precision, recall, f-measure, and accuracy are used to evaluate Linear Regression (LR), Decision Tree (J48), Naive Bayes (NB), Artificial Neural Network (ANN), Simple Cart (SC), Bagging, Decision Stump (DS), AdaBoost, Rep Tree (REPT), and Support Vector Machine (SVM). Overall, the SVM classifier surpasses other classifiers in terms of increasing accuracy and decreasing error rate, with RAE of 33.2631 and MAE of 0.165, the precision of 0.841, recall of 0.835, f-measure of 0.833, and accuracy of 83.49 percent for the dataset gathered from UCI. The SC improves accuracy and reduces the error rate for the Kaggle dataset, which is 3.30 % for RAE, 0.016 percent for MAE, 0.984 % for precision, 0.984 percent for recall, 0.984 percent for f-measure, and 98.44 % for accuracy.

Impact of categorical and numerical features in ensemble machine learning frameworks for heart disease prediction

Article

Jul 2022
BIOMED SIGNAL PROCES

Cardiovascular disease (CVD) or heart disease is one of the most fatal diseases of the world that has been observed through-out the last decade. The prediction of CVD in majority of cases depends on a set of combination of clinical and pathological data represented by either numerical or categorical variables. The categorical medical data may inherently embed prior medical information during its process of categorisation. Whereas the numerical data are flexible for accurate measurements and reading. Hence it is necessary to asses the impact of categorical and numerical features for CVD prediction. In this work, an exhaustive analysis of numerical, categorical and combination of both types of features have been done in context of state-of-the-art machine learning algorithms. The work has compared the boosting algorithms such as Gradient Boosting, Extreme Gradient Boosting (XGBoost), AdaBoost, CatBoost and additionally artificial neural networks, random forest, support vector machines (SVM), decision tree and logistic regression. A soft voting ensemble mechanism with learning algorithms has also been implemented to predict CVD. The current work has used a publicly available and widely used benchmark dataset: Cleveland heart disease dataset (UCI repository). It uses ten different performance metrics which consistently demonstrate that the categorical features outperforms the numerical and combined features. It is further observed that the ensemble learning of SVM + AdaBoost classifiers with categorical features produces optimum performance of CVD prediction.

Performance Analysis and Prediction of Type 2 Diabetes Mellitus based on lifestyle data using Machine Learning Approaches

Article

Mar 2022

Objective: Diabetes is a chronic fatal disease that has affected millions of people all over the globe. Type 2 Diabetes Mellitus (T2DM) accounts for 90% of the affected population among all types of diabetes. Millions of T2DM patients remain undiagnosed due to lack of awareness and under resourced healthcare system. So, there is a dire need for a diagnostic and prognostic tool that shall help the healthcare providers, clinicians and practitioners with early prediction and hence can recommend the lifestyle changes required to stop the progression of diabetes. The main objective of this research is to develop a framework based on machine learning techniques using only lifestyle indicators for prediction of T2DM disease. Moreover, prediction model can be used without visiting clinical labs and hospital readmissions. Method: A proposed framework is presented and implemented based on machine learning paradigms using lifestyle indicators for better prediction of T2DM disease. The current research has involved different experts like Diabetologists, Endocrinologists, Dieticians, Nutritionists, etc. for selecting the contributing 1552 instances and 11 attributes lifestyle biological features to promote health and manage complications towards T2DM disease. The dataset has been collected through survey and google forms from different geographical regions. Results: Seven machine learning classifiers were employed namely K-Nearest Neighbour (KNN), Linear Regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT), Random Forest (RF) and Gradient Boosting (GB). Gradient Boosting Model outperformed best with an accuracy rate of 97.24% for training and 96.90% for testing separately followed by RF, DT, NB, SVM, LR, and KNN as 95.36%, 92.52%, 90.72%, 90.20%, 90.20% and 77.06% respectively. However, in terms of precision, RF achieved high performance (0.980%) and KNN performed the lowest (0.793%). As far as recall is being concerned, GB achieved the highest rate of 0.975% and KNN showed the worst rate of 0.774%. Also, ELM is top performed in terms of f1-score. According to the ROCs, GB and NB had a better area under the curve compared to the others. Conclusion: The research developed a realistic health management system for T2DM disease based on machine learning techniques using only lifestyle data for prediction of T2DM. To extend the current study, these models shall be used for different, large and real-time datasets which share the commonality of data with T2DM disease to establish the efficacy of the proposed system.

An Improved Ensemble Learning Approach for Heart Disease Prediction Using Boosting Algorithms

Abstract and Figures

Recommended publications

An ensemble learning approach for diabetes prediction using boosting techniques

An ensemble Machine Learning approach for predicting Type-II diabetes mellitus based on lifestyle in...

Machine Learning Techniques in Healthcare Informatics: Showcasing of Type 2 Diabetes Mellitus Diseas...

Chronic kidney disease prediction using boosting techniques based on clinical parameters