Content uploaded by Md Milon Uddin
Author content
All content in this area was uploaded by Md Milon Uddin on Dec 02, 2022
Content may be subject to copyright.
Content uploaded by Muntasir Mamun
Author content
All content in this area was uploaded by Muntasir Mamun on Dec 02, 2022
Content may be subject to copyright.
978-1-6654-9299-7/22/$31.00 ©2022 IEEE
MLHeartDis:Can Machine Learning Techniques
Enable to Predict Heart Diseases?
Muntasir Mamun Md. Milon Uddin
Deapartment of Computer Science
The Univeisity of South Dkaota, Vermillion, SD,USA, 57069
Muntasir.Mamun@coyotes.usd.edu
Department of Electrical Engineering
The University of Texas at Tyler, Tyler, TX , USA 75799
muddin3@patriots.uttyler.edu
Vivek Kumar Tiwari Asm Mohaimenul Islam
Deapartment of Electronic Engineering
The Univeisity of Texas at Tyler,USA, 75701
vtiwari@patriots.uttyler.edu
Ahmed Ullah Ferdous
Department of Computer Science
University of South Dakota, SD, TX , USA 57069
asm.islam@coyotes.usd.edu
Deapartment of Electronic and Telecommunication
Univeisity of Liberal Arts Bangladesh, Dhaka, Bangladesh
ahmdferdous@gmail.com
Abstract— Heart disease is contributing one of the leading
reasons of death in the contemporary world. The three major
danger signs for heart disease are smoking, high blood pressure
and cholesterol, and 47% of all US citizens have at least one of
these risk factors. In the field of clinical data analysis, predicting
cardiovascular disease is a major difficulty. In this case,
Machine learning (ML) can be important for taking decisions
and predictions about heart disease based on personal key
indicators (e.g., blood pressure, cholesterol level, smoking,
diabetic status, obesity, stroke, alcohol drinking) of heart
disease. In this paper, we proposed six machine learning models
using survey data of over 400k US residents from the year 2020.
The six machine learning models-Xgboost, Adaboost, Random
Forest, Decision Tree, Logistic Regression, and Naïve Bayes
have been compared in detail. Through the prediction model for
heart disease, we achieved an improved performance level with
an accuracy level of 91.57% for the prediction of heart diseases
with the logistic regression model.
Keywords— Machine learning, heart disease prediction, centers
for disease control and prevention (CDC), classification
algorithms, cardiovascular disease (CVD), regression model
I. INTRODUCTION
The heart is an essential component of the human
body, life depends on its component functioning. According
to the World Health Organization, heart disease would cause
over 23.6 million deaths worldwide by 2030 [1].
Numerous different heart problems are categorized
as cardiovascular diseases (CVDs). Heart attacks, which
claim the lives of more than 370,000 people annually, may be
caused by coronary heart disease, the most frequent of them
all. Heart failure is one of the first CVD presentations and
another one that causes morbidity and mortality. The World
Heart Federation recently listed certain risk factors that rise
the incidence and occurrence of heart failure, including
arterial hypertension, diabetes, smoking, injured heart
muscles, malfunctioning heart valves, and obesity [2].
Machine learning (ML), a new technology for evaluation
of clinical data and prediction generation for early disease
detection, is one of today's techniques for computer-aided
detection such as cancer diagnosis [3], heart failure diagnosis
[4], Alzheimer detection using brain MRI images [5],
Parkinson’s disease detection [6], and different fields like
virtual reality [7], 360-degree video caching [8] and so on.
In this research, we use four datasets with clinical data of
patients with heart disease to determine the performance of
six machine learning algorithms (MLAs). Machine learning
is a fast-expanding trend in the healthcare field as a result of
the improvement of wearable technology and sensors that
utilize data to evaluate a patient's health in real time. By using
machine learning to anticipate heart disease, an accurate
diagnosis can be made at a lower cost than the conventional
method [9].
Our study's primary contribution in this regard is the
identification of the top features from the raw dataset. with a
primary focus on the prediction heart diseases using machine
learning techniques. This would enable quick and correct
treatment of the identified risk factors during the course of
any necessary preventive diagnosis of these cardiovascular
illnesses. We have compared our results with other previous
works.
The paper is structured as follows: The literature study on
utilizing machine learning to detect heart problems is covered
in Section 2. The methodology is then presented in Section 3.
Dataset is represented in Section 4. The results are
highlighted and discussed in Section 5. Finally, in Section 6,
we highlight our recommendations for future works and offer
our findings.
II. LITERATURE REVIEW
Using Logistic Regression, Random Forest, Support
Vector Machine, Gaussian Nave Bayes, Gradient boosting,
K-nearest neighbors, Multinomial Nave Bayes, and Decision
Trees, Padmaja et al. [10] developed a machine learning
0561
2022 IEEE 13th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON) | 978-1-6654-9299-7/22/$31.00 ©2022 IEEE | DOI: 10.1109/UEMCON54665.2022.9965714
Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on December 02,2022 at 16:36:35 UTC from IEEE Xplore. Restrictions apply.
model to identify cardiac illnesses. For validating the process,
the authors used the UCI repository-based Cleveland data for
classification which has 303 data samples with 14 attributes.
Authors found that, Random Forest achieved the highest
accuracy which is 93.44% and outperformed others machine
learning models. In addition, the performance of the classifier
is enhanced by shortening the execution time and selecting
important features from the input data set using the chi-square
feature selection approach.
Mary et al. [11] developed several Machine Learning
algorithms where authors used Cleveland dataset to predict
cardiac disease. The dataset has 303 instances and 14
attributes. Authors used KNN, SVM, Logistic Regression,
Neural Network, RF, Naïve Bayes, DT, and GDBT- Bagging
Tree Machine Learning Models for prediction. They
achieved the highest accuracy of 90% using the Artificial
Neural Network (ANN) algorithm with producing SAE.
For predicting heart illnesses, Singh et al. [12] employed four
different machine learning models. The authors collected the
UCI repository-based cleveland dataset which has 303
instances and 14 attributes. They found that k-nearest
neighbor achieved the highest accuracy of 87% for predicting
heart diseases.
For the purpose of predicting heart diseases, Shah et
al. [13] employed supervised machine learning methods such
Naive Bayes, decision trees, K-nearest neighbors, and
random forests. Here authors used UCI repository-based
cleveland dataset which has 303 instances and 14 attributes.
They used WEKA tools for pre-processing the dataset. The
authors implemented four models for the prediction and
found that k-nearest neighbor achieved the highest accuracy
of 90.789%.
Ghosh et al. [14] used a combined dataset
of UCI based (Cleveland, Long Beach VA, Switzerland,
Hungarian and Stat log) which has more than 1190 instances
and 14 attributes for heart diseases prediction. After
calculating all the results of the models, the authors achieved
the highest accuracy of 99.05% using Random Forest
Bagging Method (RFBM) model.
Based on the patients' accessible health indicators,
the authors in [15] have developed a framework that is
effective at detecting patient perspectives for predicting risk
factors. The study's goal is to offer light on the greatest
safeguards that specialists in medicine can employ in the case
of heart disease risk. The algorithms employed include C4.5,
SVM, CMAR, and Bayesian Classifiers. The system is
trained and tested using 10-fold methods which is a drawback
of this approach.
Machine learning techniques were employed [16] to
analyze the raw data and deliver the patient's disease
prediction and health status. The hybrid approach employed
by the authors combines the strengths of fuzzy logic and the
94 percent accurate k-nearest neighbor algorithm.
The study in [17] examines a method known as
outfit characterization, which combines various classifiers to
increase the precision of fragility estimations. The forecast
model is displayed with different highlights mixtures and a
few well-known grouping techniques. They generated an
improved exhibition level with an 88:7 percent precision
level.
The authors in [18] evaluated the 10 algorithms'
performance in classifications with two and four attributes.
Their findings show that the most important CVD risk factors
in the majority of the datasets are age, heart rate, and blood
pressure, followed by weight, cholesterol, smoking, serum
creatinine, ejection fraction, kind of chest pain, number of
arteries, platelet count, and obesity. The prediction
performance study made all of these characteristics stand out,
and they therefore affect CVD detection.
Table 1. performance of previous work
The convolutional neural network was employed by
the authors in [19] to forecast heart diseases. They used some
key variables that can help predict a person's susceptibility to
Authors
(year)
Dataset
Collection
(samples)
Applied Models Performance
(Proposed
model)
Mary et
al.
(2020)
[9]
UCI
cleveland
dataset (303)
Support Vector Machine,
Logistic Regression,
Artificial Neural
Network(Proposed),
K-Nearest Neighbor,
Random Forest,
Naïve Bayes,
Decision tree, and
GDBT- Bagging Tree
Accuracy:
90%
Singh et
al.
(2020)
[10]
UCI
cleveland
dataset (303)
K-Nearest Neighbor
(Proposed),
Decision tree, Linear
regression, and Support
vector machine
Accuracy:
87%
Shah et
al.
(2020)
[11]
UCI
cleveland
dataset (303)
Naïve Bayes, Decision
tree,
K-Nearest Neighbor
(Proposed),
and Random Forest
Accuracy:
90.789%
Padmaja
et al.
(2021)
[8]
UCI
cleveland
dataset (303)
Logistic
Regression,
Random
Forest(Proposed),
Support vector machine,
Gaussian Naïve Bayes,
Gradient boosting, K-
nearest neighbours,
Multinomial Naïve
Bayes and Decision trees
Accuracy:
93.44%
Ghosh et
al.
(2021)
[12]
UCI
(Cleveland +
Long Beach
VA +
Switzerland
+ Hungarian
and Stat log)
(1190+)
Decision Tree Bagging
Method (DTBM),
Random Forest
Bagging Method
(RFBM) (Proposed), K-
Nearest Neighbors
Bagging Method
(KNNBM), AdaBoost
Boosting Method
(ABBM), and Gradient
Boosting Boosting
Method (GBBM)
Accuracy:
99.05%
Our
(2022)
Kaggle
dataset
(319795)
XGBoost, AdaBoost,
Random Forest,
Decision Tree, Naïve
Bayes, and Logistic
Regression (Proposed)
Accuracy:
91.50%
0562
Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on December 02,2022 at 16:36:35 UTC from IEEE Xplore. Restrictions apply.
heart disease, including age, sex, cholesterol, and ECG slope.
There are four levels in use. The most accurate measurement
is made using an exponential linear unit (ELU) (87.09
percent). Since the research's data set was very limited,
expanding the dataset would further increase accuracy
III. METHODOLOGY
Pre-processing comes after data collection in the suggested
methodology. The chosen classifiers, including XGBoost,
AdaBoost, Random Forest, Decision Tree, Naïve Bayes, and
Logistic Regression are then trained and tested using the
common Hold-Out validation method on the heart diseases
dataset. To discover the best approach for predicting heart
diseases, the results are computed and examined. The
proposed strategy's outline is shown in figure 1.
A. Dataset Collection
We used a dataset titled "Key Indicators of Heart Disease"
in this paper that was obtained from the Kaggle online
domain [20]. This dataset has 319795 instances, and 18
attributes, whereas 1 class attribute and 17 attributes are
predictive. Proper Heart Diseases prediction is conducted by
appropriately using attributes, where the attributes describe
the symptoms. The predictive attributes are gender, age, race,
, obesity (high BMI), diabetic condition, not getting enough
physical activity and health, gen health, mental health,
drinking too much alcohol, smoking, stroke status, difficulty
of walking, asthma, kidney disease, skin cancer, respectively
and the class attribute is heart disease.
B. Dataset Pre-process:
The original dataset of approximately 300 variables was
reduced to just about 18 variables (9 booleans, 5 strings and
4 decimals) for predicting the significant output of cardiac
illnesses or heart diseases. Dataset pre-processing has been
done by using feature extraction, data cleaning, missing
values handling, and categorical variables transformation.
C. Validation Process:
It is essential to choose the right validation procedure for a
specific dataset. The hold-out validation is most effective for
getting the appropriate results when the dataset is large [21].
We applied a hold-out validation process by training 80% of
the dataset and testing 20% of the dataset. Using this
validation process, we calculated the accuracy, sensitivity,
specificity, precision, area under the curve, and F1-Score
performance matrices for each Machine Learning approach
using this validation process. The performance metrics and
visualization output graphs are demonstrated in detail in the
result analysis section. We have explained the overview of
the research work step by step in a flowchart.
Fig. 1. An overview of study
IV. D
ATASET
The Behavioral Risk Factor Surveillance System
(BRFSS), which conducts annual telephone surveys to gather
data on Americans' health conditions, uses the dataset, which
comes from the CDC, as a key component. The BRFSS
collects data from all 50 states, the District of Columbia, and
three U.S. territories. The BRFSS is the largest continually
running health survey system in the world, conducting over
400,000 adult interviews each year. This dataset has 319795
instances, and 18 attributes, whereas 1 class attribute and 17
attributes are predictive. Columns that ask respondents about
their health, including "Do you have considerable trouble
walking or climbing stairs," make up the great majority of the
content. or "Do you have a lifetime cigarette smoking total of
at least 100? 5 packs equal 100 cigarettes". We found
numerous variables (questions) in this dataset that either
directly or indirectly affect heart disease, so we chose the
most pertinent ones and cleaned it up so that it could be used
in machine learning projects. Only roughly 20 variables
remained from the original dataset's almost 300 variables.
Heart disease was treated as a binary variable, with "Yes"
denoting the presence of heart illness and "No" denoting the
absence of heart disease.
V.
R
ESULTS
The findings of accuracy, F1 score, precision, recall,
specificity, and area under the curve have been used to assess
the performance of six machine learning techniques. Table 2
displays a comparison of the various models.
0563
Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on December 02,2022 at 16:36:35 UTC from IEEE Xplore. Restrictions apply.
Table 2. Comparison of results
Only accuracy, however, cannot serve as an
adequate metric for evaluating a model's performance. AUC
value also becomes an important matrix for determining a
model's performance and assesses a model's capacity for class
distinction. The True Positive Rate and the False Positive
Rate are compared along a probability curve at various
thresholds. The ability to differentiate between positive and
negative classes by the models is shown by the AUC. The
outcomes are better the higher the AUC. The range of
numbers between 0 and 1, where 0 denotes an erroneous test,
and a result of 1 indicates that absolute precision in the test.
An AUC of 0.5 often means there is no discrimination (i.e.,
the capacity to distinguish between patients with and 0.7 to
0.8 is thought to be a reasonable threshold for cancer or other
conditions. 0.8 to 0.9 is regarded as excellent, higher than 0.9
is regarded as regarded as exemplary performance [22].
We offered the AUC curves and average outcomes
utilizing a hold-out validation procedure that involves
training 80% of the dataset and testing 20% of it. The results
of Xgboost, Adaboost, and Naive Bayes are comparable.
When accuracy, specificity, and F1score are taken into
account, logistic regression outperforms other models
(showed in table 2).
Fig. 2. AUC curve for Adaboost
Fig.3. AUC curve for Decision tree
Fig.4. AUC curve for Logistic
regression
Fig.5. AUC curve for Naïve Bayes
Fig.6. AUC curve for Random
Forest
Fig.7. AUC curve for XGboost
The highest AUC score is found 0.84 for Adaboost,
Logistic regression and XGBoost. On the other hand, Naïve
Bayes model exhibits the lowest AUC score which is 0.65.
VI. C
ONCLUSION
One of the difficult tasks in medicine is predicting heart
disease. If the disease is recognized, the death rate can be
significantly reduced using machine learning techniques. Six
machine learning algorithms were employed in this research
paper to forecast heart diseases. In our findings, logistic
regression algorithm showed better accuracy 91.57% for the
prediction of heart diseases. Using a big data set and selecting
more features efficiently the accuracy can be increased in the
future research work.
R
EFERENCES
[1] V. Krishnaiah, G. Narsimha, N. Subhash Chandra, “heart disease
Prediction System using Data Mining Techniques and Intelligent Fuzzy
Approach: A Review”, International Journal of Computer Applications,
February 2016
[2] What is CVD?—World Heart Federation. Available online:
https://world-heart-federation.org/what-is-cvd/ (accessed on 15 July 2022)
[3] M. Mamun, A. Farjana, M. Al Mamun and M. S. Ahammed, "Lung
cancer prediction model using ensemble learning techniques and a
systematic review analysis," 2022 IEEE World AI IoT Congress (AIIoT),
2022, pp. 187-193, doi: 10.1109/AIIoT54504.2022.9817326.
[4] M. Mamun, A. Farjana, M. A. Mamun, M. S. Ahammed and M. M.
Rahman, "Heart failure survival prediction using machine learning
algorithm: am I safe from heart failure?," 2022 IEEE World AI IoT Congress
(AIIoT), 2022, pp. 194-200, doi: 10.1109/AIIoT54504.2022.9817303.
[5] M. Mamun, S. B. Shawkat, M. S. Ahammed, M. M. Uddin, M. I.
Mahmud, A. M. Islam, "Deep Learning Based Model for Alzheimer's
Disease Detection Using Brain MRI Images", 2022 IEEE 13th Annual
Ubiquitous Computing, Electronics & Mobile Communication Conference
(UEMCON), 2022, (Preprint)
[6] M. Mamun, M. I. Mahmud, M. I. Hossain, A. M. Islam, M. S. Ahammed,
M. M. Uddin, "Vocal Feature Guided Detection of Parkinson's Disease
Using Machine Learning Algorithms", 2022 IEEE 13th Annual Ubiquitous
Computing, Electronics & Mobile Communication Conference (UEMCON),
2022, (Preprint)
[7] M. M. Uddin and J. Park, "Machine learning model evaluation for 360°
video caching," 2022 IEEE World AI IoT Congress (AIIoT), 2022, pp. 238-
244, doi: 10.1109/AIIoT54504.2022.9817292.
[8] M. Milon Uddin and J. Park, "360 Degree Video Caching with LRU &
LFU," 2021 IEEE 12th Annual Ubiquitous Computing, Electronics &
Mobile Communication Conference (UEMCON), 2021, pp. 0045-0050, doi:
10.1109/UEMCON53757.2021.9666668.
[9] T. KarayÕlan, O. KÕlÕç, “Prediction of Heart Disease Using Neural
Network”, 2nd International Conference of Computer Science and
Engineering, IEEE, 2017.
[
10] B Padmaja, E. (2022). Early and Accurate Prediction of Heart Disease
Using Machine Learning Model. Retrieved 13 July 2022, from
https://turcomat.org/index.php/turkbilmat/article/view/8438
Model Acc.
(%)
Sen.
(%)
Spec.
(%)
Precision
(%)
F1-
score
(%)
AUC
Xgboost 91.50 92.22 50.28 99.10 95.53 0.84
Adaboos
t
91.55 92.35 51.70 99 95.56 0.84
Random
Forest
90.28 92.40 33.09 97.46 94.86 0.79
Decision
Tree
86.44 92.96 22.65 92.16 92.56 0.59
Logistic
Regressi
on
91.57 92.32 52.61 99.07 95.58 0.84
Naïve
Bayes
91.40 91.51 15.38 99.96 95.55 0.65
0564
Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on December 02,2022 at 16:36:35 UTC from IEEE Xplore. Restrictions apply.
[11]Mary, M. (2020). Heart Disease Prediction using Machine Learning
Techniques: A Survey. International Journal For Research In Applied
Science And Engineering Technology, 8(10), 441-447.
[12]A. Singh and R. Kumar, "Heart Disease Prediction Using Machine
Learning Algorithms," 2020 International Conference on Electrical and
Electronics Engineering (ICE3), 2020, pp. 452-457
[13] Shah, D., Patel, S. & Bharti, S.K. Heart Disease Prediction using
Machine Learning Techniques. SN COMPUT. SCI. 1, 345 (2020).
[14] Ghosh, P., Azam, S., Jonkman, M., Karim, A., Shamrat, F., & Ignatious,
E. et al. (2021). Efficient Prediction of Cardiovascular Disease Using
Machine Learning Algorithms With Relief and LASSO Feature Selection
Techniques. IEEE Access, 9, 19304-19326.
[15] Purushottama. C, Kanak Saxenab, Richa Sharma (2016), “Efficient
Heart Disease Prediction System”, Elsevier, Procedia Computer Science,
No. 85, pp. 962 – 969
[16] Sharanyaa, S., S. Lavanya, M. R. Chandhini, R. Bharathi, and K.
Madhulekha. "Hybrid Machine Learning Techniques for Heart Disease
Prediction." International Journal of Advanced Engineering Research and
Science 7, no. 3 (2020), pp 44-8.
[17] B. Keerthi Samhitha, M. R. Sarika Priya., C. Sanjana., S. C. Mana and
J. Jose, "Improving the Accuracy in Prediction of Heart Disease using
Machine Learning Algorithms," 2020 International Conference on
Communication and Signal Processing (ICCSP), 2020, pp. 1326-1330.
[18] L. R. Guarneros-Nolasco, N. A. Cruz-Ramos, G. Alor-Hernández, L.
Rodríguez-Mazahua, and J. L. Sánchez-Cervantes, “Identifying the main risk
factors for cardiovascular diseases prediction using machine learning
algorithms,” Mathematics, vol. 9, no. 20, p. 2537, 2021.
[19] I. Dokare, A. Prithiani, H. Ochani, S. Kanjan, and D. Tarachandani,
“Prediction of having a heart disease using machine learning,” SSRN
Electronic Journal, 2020.
[20] Kamil Pytlak.(2022, February).Personal Key Indicators of Heart
Disease. Version 1. Retrieved July 3, 2022 from
https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-
heart-disease.
[19] Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth,
A. (2015). The reusable holdout: Preserving validity in adaptive data
analysis. Science, 349(6248), 636-638. doi: 10.1126/science.aaa9375
[22] Mandrekar, J. (2010). Receiver Operating Characteristic Curve in
Diagnostic Test Assessment. Journal Of Thoracic Oncology,5(9), 1315-
1316.doi: 10.1097/jto.0b013e3181ec173d
0565
Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on December 02,2022 at 16:36:35 UTC from IEEE Xplore. Restrictions apply.