Content uploaded by Nusrat Jahan
Author content
All content in this area was uploaded by Nusrat Jahan on Aug 31, 2020
Content may be subject to copyright.
Cardiovascular Disease Forecast using Machine
Learning Paradigms
Saiful Islam
Senior Lecturer
Daffodil Inte rn ation a l Univ ers it y
E-mail: s aiful.cs e@d iu.edu .bd
Nusrat Jahan
Senior Lecturer
Daffodil Inte rn ation a l Univ ers it y
E-mail: nusratjah an.cs e@ diu.edu .bd
Mst. Eshita Khatun
Lecturer
Daffodil Inte rn ation a l Univ ers it y
E-mail: eshita .c s e @d iu .e du.bd
Abstract— In thi s recent era, Cardiovascular di sease (CVD)
propagation rate has been i ntensifying the cause of death
worldwide among the non -communicabl e disease . In particular
the south asian countries have a tremendous risk of
cardi ovascular disease at an e arly age than any other ethni c
group. Most often it’s challe nging for medical practitioners to
predict cardiovascular disease as it requires experience and
knowledge which is a complex task to accomplish. This health
industry has enormous amounts of data which is use ful for
making effe ctive conclusi ons using their hi dden i nformation. So,
usi ng appropriate resu lts and makin g effe ctive deci sions on data,
some superior data analysi s te chniques are used, for example
Naive Bayes, Decision Tree. By using some propertie s like (age,
gender, bp, stress, etc) it can be predicted the chances of
cardi ovascular disease . In thi s study, we collected 301 sample
data wi th 12 cl inical attributes . Logi stic regressi on, Deci sion tree,
SVM, and Naive bayes classifi cation algorithms have be en
appl ie d to predict heart dise ase. In this case , logisti c regression
provided 86.25% accuracy. Howe ver, we also compared the UCI
dataset based resul ts with our model.
Keywords— Classification Algorithm, Heart Diseases, Decision
Tree, SVM, Logistic Regression, Naive Bayes, UCI dataset.
I. INTRODUCTION
Cardiovascular disease is one of the leading causes of death
throughout the world. Many people die from cardiovascular
diseases than from any other causes. Cardiovascular diseases
are more accountable than from any other causes to lay down
one’s life. It accounts for nearly one in every three deaths
worldwide annually according to The World Health
Organization report. If the heart stops functioning normally,
our body's other organs will st o p the ir working proc es s .
Mortality rate increases on account of cardiovascular dis ease in
different countries including Bangladesh. In 2017, The World
Health Organization figured out the deaths from heart disease
in Bangladesh entered 14.31 percent of total deaths, and every
year cardio v as cula r d is e ase kills 17.9 million peop le , 31
percent of global deaths [1].
In the medical domain, data mining can be used to extract
information from hidden patterns of dataset. In present we
receive any medical data in a distributed way as a paper
document. This data orientation needs an embodied structure.
Pre-process data is needed to apply machine learning
algorithms to gain a more accurate result. By using data mining
te chniq u es , t his extrac tive d ata will h elp to p re dict th e medica l
diagnosis. Future prediction based approach also will he lp the
doctor’s to take the right steps to treat the patient in time with
the help of previous patterns of the dataset. Techniques of the
data mining and model of the prediction are responsible for
proper prognosis of the disease [2].
Cardiovascular disease rate increases gradually in related to
people’s driven life-style as like habits of smoking, high fat
ingestion, lakings of physical mobility. A s the h eart is a pump
to circulate the blood across the whole body. According to The
United States National Institutes of Health (NIH) reports heart
rate varies from person to person on the basis of different
pa ramete rs o n av erag e 60-100 times in one minu te [3].
This study is for finding the appropriate model to predict
cardiovas cular diseas e. Heart d is ease is o ne of the major causes
of death in our country also. It is crucial to make people aware
about the risk factor of cardiovascular diseases.
In the following section we first discussed a few related
works then talked about our proposed approach to find the risk
factor as well as a better model for heart disease prediction.
Aft er that, in section III we dis cu s s ed the res ult, and next
conclusion with some notes for future work in section IV.
II. RELATED WORK
This section is for presenting the research demand on this
topic and some works that must highlight our study. We found
that a lot of research was focused on cardiovascular disease.
We were keen to know the risk factors for predicting heart
disease.
Nabaouia Louridi, Meryem Amar and Bouabid El Ouahidi
used different machine learning algorithms to identify the
Cardiovascular Disease (CVD) where they finally proposed a
SVM with a linear kernel approach. In this approach they used
13 features and found an accuracy of 86.8% [4]. In 2019 N.
Satish Chandra Reddy, Song Shue Nee, Lim Zhi Min & Chew
Xin Ying stated that Random Forest can be used as
classification algorithm to train the sys tem for identifying CVD
with 90% to 95% accuracy. They used 14 features [5].
Us ing t he Da tas et o f UCI lib rary in 2018 A d iti Gav han e,
Gouthami Kokkula, Isha Pandya & Prof. Kailas Devadkar
conducted a study to predict the heart dis ease of a human using
13 features among 76. In this study they used MLP and got an
average of 0.91 precis ion [6].
Proceedings of the Fourth International Conference on Computing Methodologies and Communication (ICCMC 2020)
IEEE Xplore Part Number:CFP20K25-ART; ISBN:978-1-7281-4889-2
978-1-7281-4889-2/20/$31.00 ©2020 IEEE 487
2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC) 978-1-7281-4889-2/20/$31.00 ©2020 IEEE 10.1109/ICCMC48092.2020.ICCMC-00091
In this work, Sonakshi Harjai1 & Sunil Kumar Khatri have
used Correlation-based feature selection and Multilayer
Perceptron classifier to propose a new model. 297 input was
tested to find an 89.2% accuracy with the proposed model [7].
In 2019 Sen thilkumar Mohan, Cha ndras ega r Thiru malai
And Gautam Srivastava proposed a Hybrid Random Forest
with a linear model with 88.7% accuracy to predict heart
disease. In this study they consider UCI machine learning
repository to collect the dat aset. They used multi-class variable
and binary classification for data pre-processing [8].
Uma N Dulh are in 2018, narrates a methodology to
improve the performance of Bayesian classifier to predict heart
disease. Here, the author used Statlog (heart) data set UCI
where th ere were 14 at tribu tes with 1 cla s s label an d 270
instances with no missing values. He obtained 87.91%
accuracy for Particle swarm optimization (PSO) with Naive
Bayes classifier [9].
So, it is obvious that we need a proper solution to prevent
heart disease as soon as possible. This is because all over the
world researchers are very keen to work with cardiovascular
disease prediction as we are able to know the risk factors of
this disease.
III. PROP OSED MODEL
In this study we address the issue for predicting heart
dis e ase. W e co llected 301 dat a ag ain s t 13 at tribut es [10]. This
study proposed an approach which classified the risk of having
cardiovascular disease. In this proposed system, we used
Logistic regression, Naive Bayes, SVM, and Decision Tree
classification algorithm for getting better accuracy in the case
of predict heart disease. Finally, analysing the obtained results
with the help of Comparing Models through Confusion Matrix.
A e ntire of 301 s ampless with 13 at tribut es were o bta ined
as we mentioned. After that, whole samples were divided
equally into two sets: training data (80%) and testing data
(20%). To avoid bias, the samples for each set were nominated
randomly. Presented Fig. 1 to illustrat e the wh ole working
process. We started our study with feature selection as we
considered UCI dataset (Heart) to compare our study. After
feature selection we collwcted all s ample data . At the end of
the study, we applied four selected classifier algorithms to
predict the risk level of cardiovascular disease. Among four
classifier we got best result from logistic regression. Finally,
our model is ready to predict heart disease.
There are some machine learning algorithms from the area
of statistics. Also known as the go-to technique for
classification problems in machine learning. Logistic
Regression is a little bit similar because both have the goal of
estimating the values for the parameters or coefficients. After
training a model we find out the relation between training and
testing data. We got 86.25% accurate result for logistic
regression.
Naive Bayes is another classification algorithm which uses
the Bayes theorem with independent assumptions between
features. One dimensional Naive Bayes classifier computes the
ratio o f t he lo g p ro b abilit ies of th e feat ures belonging in all the
classes. Naive Bayes does not consider the correlation between
attributes. Naive Bayes is a very scalable classifier but it can
create a bias towards one or more attributes. In our approach
we found 73.77% accuracy after applying naive bayes.
In the large amount of data environment Support vector
Machine (SVM) provides classification learning model.
Linearly (e.g., straight line or hyperplane) data domain
dividation is called Linear SVM. And data domain
transformation to a space entitled the space of feature and to
separate the classes, data domain can be divided linearly is
named non linear support vector machine [11]. In the case of
classify our dataset we got 83.61% accuracy for SVM.
Finally, our last selected algorithm was Decision tree.
Decision tree is one of the most popular classification and
decision making algorithms. Different types of decision tree are
available. Here, we used classification decision tree to classify
the datase. However we did not get better result (75.41%) for
this algorithm. Arundhati Navada et al. in 2011 pres en ted a
paper for describing the basic idea of decision tree. In this
paper they talked about different types of decision tree with
equation and explanation [12].
Fig. 1. Flowchart of our proposed Model
Now, we are going to describe our datset. We considered
UCI repository as a standard clinical attributes to select
features for our study as we said. After that we fixed our
attributes list with class label to collect the data from a google
form. Table 1 for presenting the attributes list of our dataset.
Here, we described all 12 attributes with proper measurement
values.
Proceedings of the Fourth International Conference on Computing Methodologies and Communication (ICCMC 2020)
IEEE Xplore Part Number:CFP20K25-ART; ISBN:978-1-7281-4889-2
978-1-7281-4889-2/20/$31.00 ©2020 IEEE 488
0
20
40
60
80
100
Low Normal High
Heart Disease frequency According to
Exercise
Risk No Risk
TABLE 1. DESCRIPTION OF ATTRIBU TES
Now, we are going to explore our dataset. Number of
affected people according to gender and comparison of age for
blood cholostore is illu s t ra ted in Fig . 2 a n d 3 respec tively for
depicting the dataset.
Fig. 2. Explore Gender Dat aset
Fig. 3. Explore Blood Cholesterol Dataset
Fig. 4. Ex plore Class label for exercise.
Exercise or physical activity is one of the most effective
factor to predict heart disease. In Fig. 4, we presenting clas s
label according to daily exercis e. It is obvious from bar chart
thet exercis e is helpful to reduce risk of heart dis ease.
Fig. 5: Risk Factor of CVD
We demonastrated the risk factor of Cardiovaascular
Disease (CVD) with the help of Fig 5. Here, we got that
obesity is one of the crucial risk factors for CVD.
Proceedings of the Fourth International Conference on Computing Methodologies and Communication (ICCMC 2020)
IEEE Xplore Part Number:CFP20K25-ART; ISBN:978-1-7281-4889-2
978-1-7281-4889-2/20/$31.00 ©2020 IEEE 489
IV. RESULT ANALYSIS
Our prediction approach was developed with 13 attributes
with class labels. Table 2 for present the result of our proposed
approach. Here, it is obvious that logistic regression gave better
results among four selected models. In our strudy we applied
all thos e classifier on our own collected data. Elsewhere we
found more accurate those who worked with datas et (Heart)
from UCI repos itory .
TABLE 2. COMPA RISON THE ACCU RACY OF VARIOUS ALGORITHMS
Cl assification Algo rithm
Accuracy
Logistic Regression
86.25%
Support Vector Classifier
83.61%
Decis io n T rees
75.41%
Naïve Bayes
73.77%
A. Comparative Study
In this study we also ovserved UCI Heart disease dataset
and foun many reseach works on this dataset. Senthilkumar
Mohan et al. worked with UCI dataset and they also applied
those four classification algorithms. Table 3 for compared our
result with UCI dataset based study.
TABLE 3. UCI DATASET BASED RESULT COMP ARISIO N
Here, we can find that they got less accuracy for Naive
bayes and best result for SVM. Howwever, we got best result
for logistic regression. Now Table 4 to depict the result of
Logistic regression classifier.
TABLE 4. LOGISTIC REGRESSIO N RESULT ANALYSIS
V. CONCLUSION
Heart disease is one of the major deaths anywhere in the
world. We addressed this issue in addition proposed an
approach to predict the risk factor as we can prevent this
disease as early as possible. In this study we got logistic
regression provided better results agains t 12 attributes. We also
compared UCI dataset and found the factor to predict heart
disease. Many research works showed more accurate results
though they used UCI repository. However, we considered our
own collected data.
In future we have a desire to predict this disease with more
accurately and for this case we will h ave to colle ct more d ata as
much as poss ible. Concurrently, we need accurate clinical data
to predict any types of diseases. Everyone needs good health to
live a be autiful life ot h erwis e s ocia l d eve lopment will b e stuc k.
REFERENCES
[1] “ Cardiovascular diseases” Av a ila bl e : h t t p s: / /www. wh o .in t /en /n ews -
roo m/fact-sheets/detail/cardiovascular-diseases-(cv ds). [Accessed: 25-
January- 2020]
[2] Wu, Ching-seh Mike, Mustafa Badsh ah, and Vish wa Bhagwat , “ Heart
Disease Pr edict ion Using Data Mining Techniques. ” In Pro ceedin gs of
the 2019 2nd International Conference on Data Science and Information
Technology, pp. 7-11. 2019.
[3] “W hat should my heart rat e be?”
h t t p s : // www. m ed ic a ln e ws t o da y . co m / a r t i cl e s/ 235710.php#normal-
rest ing-heart-rate. [Accessed: 21- January- 2020]
[4] Nabao uia Lo ur idi, Me rye m Amar , Bo uabid El Oua hidi
“IDENT IFICAT ION OF CARDIOVASCULAR DISEASES USING
MACHINE LEARNING”, 7th Mediterranean Congress of
Telecommunications (CMT), 2019, DOI: 10.1109/CMT.2019.8931411.
[5] N. Satish Chandra Reddy, Song Shue Nee, Lim Zhi Min & Chew Xin
Ying “ Classification an d Feature Selection App roaches by Machine
Learning T ech niques: Heart Disease Prediction”, Int ernation al Journal
of Innovative Computing, 2019, DOI:
ht tps://doi.org/10.11113/ijic.v9n1.210.
[6] Aditi Gavhane, Gouthami Kokkula, Isha Pandya & Prof. Kailas
Devadkar (P hD) “ Prediction of Heart Disease Using Machine Learning”,
ICECA 2018, IEEE Xplore ISBN:978-1-5386-0965-1.
[7] Sonakshi Harjai1 & Sunil Kumar Kh at ri, “ An Int elligent Clinical
Decision Support System Based on Artificial Neural Network for Early
Diagnosis of Cardiovascular Diseases in Rural Areas”, AICAI, 2019,
DOI: 10.1109/AICAI.2019.8701237.
[8] Sen th ilk uma r Moha n, Ch andra segar T hirum ala i A nd Gaut am Sriv astava,
“Effective Heart Disease Prediction Using Hybrid Machine Learning
Techniques”, Special Sect ion On Smart Caching, Communications,
Computing and Cybersecurity For Information-centric Internet Of
Things, IEEE Access Volume 7, 2019,
DOI:10.1109/ACCESS.2019.2923707.
[9] Duraipan dia n, M. "Performance Eva luat io n of Ro ut in g Al gorithm f or
MANET based on t he Machine L earning Techniques." Journ al o f trends
in Com puter Science and Smart technology (TCSST) 1, no. 01 (2019):
25-38.
[10] “ Dataset” Available: https://gith ub.com /istyak /hd/blob/master/a4.csv.
[Accessed: 12- Dec- 2019].
[11] Suthaharan S., “Support Vector Machine. In: Machine Learning Models
and Algorithms for Big Data Classification”. Integrated Series in
Information Systems, vol 36. Springer, Boston, MA, 20 16.
[12] Arundhat i Navada, Aamir Nizam Ansari, Siddhart h Pat il, Balwan t
A.Sonkamble, “Overview of Use of Decision Tree algorithms in
Machine Learning”. IIEEE Control and System Graduate Research
Colloquium, 2011.
Classification Algorithm Accuracy
Lo gistic Regressio n 82.9%
Support Vector Classifier 86.1%
Decision Trees 85%
Naïve Bayes 75.8%
Class
Label
Precision Recall F1-score Support
0 0.82 0.97 0.89 34
1 0.95 0.74 0.83 27
Avg 0.89 0.86 0.86 61
Proceedings of the Fourth International Conference on Computing Methodologies and Communication (ICCMC 2020)
IEEE Xplore Part Number:CFP20K25-ART; ISBN:978-1-7281-4889-2
978-1-7281-4889-2/20/$31.00 ©2020 IEEE 490