Conference PaperPDF Available

MLHeartDis:Can Machine Learning Techniques Enable to Predict Heart Diseases?

Authors:

Figures

Content may be subject to copyright.
978-1-6654-9299-7/22/$31.00 ©2022 IEEE
MLHeartDis:Can Machine Learning Techniques
Enable to Predict Heart Diseases?
Muntasir Mamun Md. Milon Uddin
Deapartment of Computer Science
The Univeisity of South Dkaota, Vermillion, SD,USA, 57069
Muntasir.Mamun@coyotes.usd.edu
Department of Electrical Engineering
The University of Texas at Tyler, Tyler, TX , USA 75799
muddin3@patriots.uttyler.edu
Vivek Kumar Tiwari Asm Mohaimenul Islam
Deapartment of Electronic Engineering
The Univeisity of Texas at Tyler,USA, 75701
vtiwari@patriots.uttyler.edu
Ahmed Ullah Ferdous
Department of Computer Science
University of South Dakota, SD, TX , USA 57069
asm.islam@coyotes.usd.edu
Deapartment of Electronic and Telecommunication
Univeisity of Liberal Arts Bangladesh, Dhaka, Bangladesh
ahmdferdous@gmail.com
Abstract Heart disease is contributing one of the leading
reasons of death in the contemporary world. The three major
danger signs for heart disease are smoking, high blood pressure
and cholesterol, and 47% of all US citizens have at least one of
these risk factors. In the field of clinical data analysis, predicting
cardiovascular disease is a major difficulty. In this case,
Machine learning (ML) can be important for taking decisions
and predictions about heart disease based on personal key
indicators (e.g., blood pressure, cholesterol level, smoking,
diabetic status, obesity, stroke, alcohol drinking) of heart
disease. In this paper, we proposed six machine learning models
using survey data of over 400k US residents from the year 2020.
The six machine learning models-Xgboost, Adaboost, Random
Forest, Decision Tree, Logistic Regression, and Naïve Bayes
have been compared in detail. Through the prediction model for
heart disease, we achieved an improved performance level with
an accuracy level of 91.57% for the prediction of heart diseases
with the logistic regression model.
Keywords— Machine learning, heart disease prediction, centers
for disease control and prevention (CDC), classification
algorithms, cardiovascular disease (CVD), regression model
I. INTRODUCTION
The heart is an essential component of the human
body, life depends on its component functioning. According
to the World Health Organization, heart disease would cause
over 23.6 million deaths worldwide by 2030 [1].
Numerous different heart problems are categorized
as cardiovascular diseases (CVDs). Heart attacks, which
claim the lives of more than 370,000 people annually, may be
caused by coronary heart disease, the most frequent of them
all. Heart failure is one of the first CVD presentations and
another one that causes morbidity and mortality. The World
Heart Federation recently listed certain risk factors that rise
the incidence and occurrence of heart failure, including
arterial hypertension, diabetes, smoking, injured heart
muscles, malfunctioning heart valves, and obesity [2].
Machine learning (ML), a new technology for evaluation
of clinical data and prediction generation for early disease
detection, is one of today's techniques for computer-aided
detection such as cancer diagnosis [3], heart failure diagnosis
[4], Alzheimer detection using brain MRI images [5],
Parkinson’s disease detection [6], and different fields like
virtual reality [7], 360-degree video caching [8] and so on.
In this research, we use four datasets with clinical data of
patients with heart disease to determine the performance of
six machine learning algorithms (MLAs). Machine learning
is a fast-expanding trend in the healthcare field as a result of
the improvement of wearable technology and sensors that
utilize data to evaluate a patient's health in real time. By using
machine learning to anticipate heart disease, an accurate
diagnosis can be made at a lower cost than the conventional
method [9].
Our study's primary contribution in this regard is the
identification of the top features from the raw dataset. with a
primary focus on the prediction heart diseases using machine
learning techniques. This would enable quick and correct
treatment of the identified risk factors during the course of
any necessary preventive diagnosis of these cardiovascular
illnesses. We have compared our results with other previous
works.
The paper is structured as follows: The literature study on
utilizing machine learning to detect heart problems is covered
in Section 2. The methodology is then presented in Section 3.
Dataset is represented in Section 4. The results are
highlighted and discussed in Section 5. Finally, in Section 6,
we highlight our recommendations for future works and offer
our findings.
II. LITERATURE REVIEW
Using Logistic Regression, Random Forest, Support
Vector Machine, Gaussian Nave Bayes, Gradient boosting,
K-nearest neighbors, Multinomial Nave Bayes, and Decision
Trees, Padmaja et al. [10] developed a machine learning
0561
2022 IEEE 13th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON) | 978-1-6654-9299-7/22/$31.00 ©2022 IEEE | DOI: 10.1109/UEMCON54665.2022.9965714
Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on December 02,2022 at 16:36:35 UTC from IEEE Xplore. Restrictions apply.
model to identify cardiac illnesses. For validating the process,
the authors used the UCI repository-based Cleveland data for
classification which has 303 data samples with 14 attributes.
Authors found that, Random Forest achieved the highest
accuracy which is 93.44% and outperformed others machine
learning models. In addition, the performance of the classifier
is enhanced by shortening the execution time and selecting
important features from the input data set using the chi-square
feature selection approach.
Mary et al. [11] developed several Machine Learning
algorithms where authors used Cleveland dataset to predict
cardiac disease. The dataset has 303 instances and 14
attributes. Authors used KNN, SVM, Logistic Regression,
Neural Network, RF, Naïve Bayes, DT, and GDBT- Bagging
Tree Machine Learning Models for prediction. They
achieved the highest accuracy of 90% using the Artificial
Neural Network (ANN) algorithm with producing SAE.
For predicting heart illnesses, Singh et al. [12] employed four
different machine learning models. The authors collected the
UCI repository-based cleveland dataset which has 303
instances and 14 attributes. They found that k-nearest
neighbor achieved the highest accuracy of 87% for predicting
heart diseases.
For the purpose of predicting heart diseases, Shah et
al. [13] employed supervised machine learning methods such
Naive Bayes, decision trees, K-nearest neighbors, and
random forests. Here authors used UCI repository-based
cleveland dataset which has 303 instances and 14 attributes.
They used WEKA tools for pre-processing the dataset. The
authors implemented four models for the prediction and
found that k-nearest neighbor achieved the highest accuracy
of 90.789%.
Ghosh et al. [14] used a combined dataset
of UCI based (Cleveland, Long Beach VA, Switzerland,
Hungarian and Stat log) which has more than 1190 instances
and 14 attributes for heart diseases prediction. After
calculating all the results of the models, the authors achieved
the highest accuracy of 99.05% using Random Forest
Bagging Method (RFBM) model.
Based on the patients' accessible health indicators,
the authors in [15] have developed a framework that is
effective at detecting patient perspectives for predicting risk
factors. The study's goal is to offer light on the greatest
safeguards that specialists in medicine can employ in the case
of heart disease risk. The algorithms employed include C4.5,
SVM, CMAR, and Bayesian Classifiers. The system is
trained and tested using 10-fold methods which is a drawback
of this approach.
Machine learning techniques were employed [16] to
analyze the raw data and deliver the patient's disease
prediction and health status. The hybrid approach employed
by the authors combines the strengths of fuzzy logic and the
94 percent accurate k-nearest neighbor algorithm.
The study in [17] examines a method known as
outfit characterization, which combines various classifiers to
increase the precision of fragility estimations. The forecast
model is displayed with different highlights mixtures and a
few well-known grouping techniques. They generated an
improved exhibition level with an 88:7 percent precision
level.
The authors in [18] evaluated the 10 algorithms'
performance in classifications with two and four attributes.
Their findings show that the most important CVD risk factors
in the majority of the datasets are age, heart rate, and blood
pressure, followed by weight, cholesterol, smoking, serum
creatinine, ejection fraction, kind of chest pain, number of
arteries, platelet count, and obesity. The prediction
performance study made all of these characteristics stand out,
and they therefore affect CVD detection.
Table 1. performance of previous work
The convolutional neural network was employed by
the authors in [19] to forecast heart diseases. They used some
key variables that can help predict a person's susceptibility to
Authors
(year)
Dataset
Collection
(samples)
Applied Models Performance
(Proposed
model)
Mary et
al.
(2020)
[9]
UCI
cleveland
dataset (303)
Support Vector Machine,
Logistic Regression,
Artificial Neural
Network(Proposed),
K-Nearest Neighbor,
Random Forest,
Naïve Bayes,
Decision tree, and
GDBT- Bagging Tree
Accuracy:
90%
Singh et
al.
(2020)
[10]
UCI
cleveland
dataset (303)
K-Nearest Neighbor
(Proposed),
Decision tree, Linear
regression, and Support
vector machine
Accuracy:
87%
Shah et
al.
(2020)
[11]
UCI
cleveland
dataset (303)
Naïve Bayes, Decision
tree,
K-Nearest Neighbor
(Proposed),
and Random Forest
Accuracy:
90.789%
Padmaja
et al.
(2021)
[8]
UCI
cleveland
dataset (303)
Logistic
Regression,
Random
Forest(Proposed),
Support vector machine,
Gaussian Naïve Bayes,
Gradient boosting, K-
nearest neighbours,
Multinomial Naïve
Bayes and Decision trees
Accuracy:
93.44%
Ghosh et
al.
(2021)
[12]
UCI
(Cleveland +
Long Beach
VA +
Switzerland
+ Hungarian
and Stat log)
(1190+)
Decision Tree Bagging
Method (DTBM),
Random Forest
Bagging Method
(RFBM) (Proposed), K-
Nearest Neighbors
Bagging Method
(KNNBM), AdaBoost
Boosting Method
(ABBM), and Gradient
Boosting Boosting
Method (GBBM)
Accuracy:
99.05%
Our
(2022)
Kaggle
dataset
(319795)
XGBoost, AdaBoost,
Random Forest,
Decision Tree, Naïve
Bayes, and Logistic
Regression (Proposed)
Accuracy:
91.50%
0562
Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on December 02,2022 at 16:36:35 UTC from IEEE Xplore. Restrictions apply.
heart disease, including age, sex, cholesterol, and ECG slope.
There are four levels in use. The most accurate measurement
is made using an exponential linear unit (ELU) (87.09
percent). Since the research's data set was very limited,
expanding the dataset would further increase accuracy
III. METHODOLOGY
Pre-processing comes after data collection in the suggested
methodology. The chosen classifiers, including XGBoost,
AdaBoost, Random Forest, Decision Tree, Naïve Bayes, and
Logistic Regression are then trained and tested using the
common Hold-Out validation method on the heart diseases
dataset. To discover the best approach for predicting heart
diseases, the results are computed and examined. The
proposed strategy's outline is shown in figure 1.
A. Dataset Collection
We used a dataset titled "Key Indicators of Heart Disease"
in this paper that was obtained from the Kaggle online
domain [20]. This dataset has 319795 instances, and 18
attributes, whereas 1 class attribute and 17 attributes are
predictive. Proper Heart Diseases prediction is conducted by
appropriately using attributes, where the attributes describe
the symptoms. The predictive attributes are gender, age, race,
, obesity (high BMI), diabetic condition, not getting enough
physical activity and health, gen health, mental health,
drinking too much alcohol, smoking, stroke status, difficulty
of walking, asthma, kidney disease, skin cancer, respectively
and the class attribute is heart disease.
B. Dataset Pre-process:
The original dataset of approximately 300 variables was
reduced to just about 18 variables (9 booleans, 5 strings and
4 decimals) for predicting the significant output of cardiac
illnesses or heart diseases. Dataset pre-processing has been
done by using feature extraction, data cleaning, missing
values handling, and categorical variables transformation.
C. Validation Process:
It is essential to choose the right validation procedure for a
specific dataset. The hold-out validation is most effective for
getting the appropriate results when the dataset is large [21].
We applied a hold-out validation process by training 80% of
the dataset and testing 20% of the dataset. Using this
validation process, we calculated the accuracy, sensitivity,
specificity, precision, area under the curve, and F1-Score
performance matrices for each Machine Learning approach
using this validation process. The performance metrics and
visualization output graphs are demonstrated in detail in the
result analysis section. We have explained the overview of
the research work step by step in a flowchart.
Fig. 1. An overview of study
IV. D
ATASET
The Behavioral Risk Factor Surveillance System
(BRFSS), which conducts annual telephone surveys to gather
data on Americans' health conditions, uses the dataset, which
comes from the CDC, as a key component. The BRFSS
collects data from all 50 states, the District of Columbia, and
three U.S. territories. The BRFSS is the largest continually
running health survey system in the world, conducting over
400,000 adult interviews each year. This dataset has 319795
instances, and 18 attributes, whereas 1 class attribute and 17
attributes are predictive. Columns that ask respondents about
their health, including "Do you have considerable trouble
walking or climbing stairs," make up the great majority of the
content. or "Do you have a lifetime cigarette smoking total of
at least 100? 5 packs equal 100 cigarettes". We found
numerous variables (questions) in this dataset that either
directly or indirectly affect heart disease, so we chose the
most pertinent ones and cleaned it up so that it could be used
in machine learning projects. Only roughly 20 variables
remained from the original dataset's almost 300 variables.
Heart disease was treated as a binary variable, with "Yes"
denoting the presence of heart illness and "No" denoting the
absence of heart disease.
V.
R
ESULTS
The findings of accuracy, F1 score, precision, recall,
specificity, and area under the curve have been used to assess
the performance of six machine learning techniques. Table 2
displays a comparison of the various models.
0563
Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on December 02,2022 at 16:36:35 UTC from IEEE Xplore. Restrictions apply.
Table 2. Comparison of results
Only accuracy, however, cannot serve as an
adequate metric for evaluating a model's performance. AUC
value also becomes an important matrix for determining a
model's performance and assesses a model's capacity for class
distinction. The True Positive Rate and the False Positive
Rate are compared along a probability curve at various
thresholds. The ability to differentiate between positive and
negative classes by the models is shown by the AUC. The
outcomes are better the higher the AUC. The range of
numbers between 0 and 1, where 0 denotes an erroneous test,
and a result of 1 indicates that absolute precision in the test.
An AUC of 0.5 often means there is no discrimination (i.e.,
the capacity to distinguish between patients with and 0.7 to
0.8 is thought to be a reasonable threshold for cancer or other
conditions. 0.8 to 0.9 is regarded as excellent, higher than 0.9
is regarded as regarded as exemplary performance [22].
We offered the AUC curves and average outcomes
utilizing a hold-out validation procedure that involves
training 80% of the dataset and testing 20% of it. The results
of Xgboost, Adaboost, and Naive Bayes are comparable.
When accuracy, specificity, and F1score are taken into
account, logistic regression outperforms other models
(showed in table 2).
Fig. 2. AUC curve for Adaboost
Fig.3. AUC curve for Decision tree
Fig.4. AUC curve for Logistic
regression
Fig.5. AUC curve for Naïve Bayes
Fig.6. AUC curve for Random
Forest
Fig.7. AUC curve for XGboost
The highest AUC score is found 0.84 for Adaboost,
Logistic regression and XGBoost. On the other hand, Naïve
Bayes model exhibits the lowest AUC score which is 0.65.
VI. C
ONCLUSION
One of the difficult tasks in medicine is predicting heart
disease. If the disease is recognized, the death rate can be
significantly reduced using machine learning techniques. Six
machine learning algorithms were employed in this research
paper to forecast heart diseases. In our findings, logistic
regression algorithm showed better accuracy 91.57% for the
prediction of heart diseases. Using a big data set and selecting
more features efficiently the accuracy can be increased in the
future research work.
R
EFERENCES
[1] V. Krishnaiah, G. Narsimha, N. Subhash Chandra, “heart disease
Prediction System using Data Mining Techniques and Intelligent Fuzzy
Approach: A Review”, International Journal of Computer Applications,
February 2016
[2] What is CVD?—World Heart Federation. Available online:
https://world-heart-federation.org/what-is-cvd/ (accessed on 15 July 2022)
[3] M. Mamun, A. Farjana, M. Al Mamun and M. S. Ahammed, "Lung
cancer prediction model using ensemble learning techniques and a
systematic review analysis," 2022 IEEE World AI IoT Congress (AIIoT),
2022, pp. 187-193, doi: 10.1109/AIIoT54504.2022.9817326.
[4] M. Mamun, A. Farjana, M. A. Mamun, M. S. Ahammed and M. M.
Rahman, "Heart failure survival prediction using machine learning
algorithm: am I safe from heart failure?," 2022 IEEE World AI IoT Congress
(AIIoT), 2022, pp. 194-200, doi: 10.1109/AIIoT54504.2022.9817303.
[5] M. Mamun, S. B. Shawkat, M. S. Ahammed, M. M. Uddin, M. I.
Mahmud, A. M. Islam, "Deep Learning Based Model for Alzheimer's
Disease Detection Using Brain MRI Images", 2022 IEEE 13th Annual
Ubiquitous Computing, Electronics & Mobile Communication Conference
(UEMCON), 2022, (Preprint)
[6] M. Mamun, M. I. Mahmud, M. I. Hossain, A. M. Islam, M. S. Ahammed,
M. M. Uddin, "Vocal Feature Guided Detection of Parkinson's Disease
Using Machine Learning Algorithms", 2022 IEEE 13th Annual Ubiquitous
Computing, Electronics & Mobile Communication Conference (UEMCON),
2022, (Preprint)
[7] M. M. Uddin and J. Park, "Machine learning model evaluation for 360°
video caching," 2022 IEEE World AI IoT Congress (AIIoT), 2022, pp. 238-
244, doi: 10.1109/AIIoT54504.2022.9817292.
[8] M. Milon Uddin and J. Park, "360 Degree Video Caching with LRU &
LFU," 2021 IEEE 12th Annual Ubiquitous Computing, Electronics &
Mobile Communication Conference (UEMCON), 2021, pp. 0045-0050, doi:
10.1109/UEMCON53757.2021.9666668.
[9] T. KarayÕlan, O. KÕlÕç, “Prediction of Heart Disease Using Neural
Network”, 2nd International Conference of Computer Science and
Engineering, IEEE, 2017.
[
10] B Padmaja, E. (2022). Early and Accurate Prediction of Heart Disease
Using Machine Learning Model. Retrieved 13 July 2022, from
https://turcomat.org/index.php/turkbilmat/article/view/8438
Model Acc.
(%)
Sen.
(%)
Spec.
(%)
Precision
(%)
F1-
score
(%)
AUC
Xgboost 91.50 92.22 50.28 99.10 95.53 0.84
Adaboos
t
91.55 92.35 51.70 99 95.56 0.84
Random
Forest
90.28 92.40 33.09 97.46 94.86 0.79
Decision
Tree
86.44 92.96 22.65 92.16 92.56 0.59
Logistic
Regressi
on
91.57 92.32 52.61 99.07 95.58 0.84
Naïve
Bayes
91.40 91.51 15.38 99.96 95.55 0.65
0564
Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on December 02,2022 at 16:36:35 UTC from IEEE Xplore. Restrictions apply.
[11]Mary, M. (2020). Heart Disease Prediction using Machine Learning
Techniques: A Survey. International Journal For Research In Applied
Science And Engineering Technology, 8(10), 441-447.
[12]A. Singh and R. Kumar, "Heart Disease Prediction Using Machine
Learning Algorithms," 2020 International Conference on Electrical and
Electronics Engineering (ICE3), 2020, pp. 452-457
[13] Shah, D., Patel, S. & Bharti, S.K. Heart Disease Prediction using
Machine Learning Techniques. SN COMPUT. SCI. 1, 345 (2020).
[14] Ghosh, P., Azam, S., Jonkman, M., Karim, A., Shamrat, F., & Ignatious,
E. et al. (2021). Efficient Prediction of Cardiovascular Disease Using
Machine Learning Algorithms With Relief and LASSO Feature Selection
Techniques. IEEE Access, 9, 19304-19326.
[15] Purushottama. C, Kanak Saxenab, Richa Sharma (2016), “Efficient
Heart Disease Prediction System”, Elsevier, Procedia Computer Science,
No. 85, pp. 962 – 969
[16] Sharanyaa, S., S. Lavanya, M. R. Chandhini, R. Bharathi, and K.
Madhulekha. "Hybrid Machine Learning Techniques for Heart Disease
Prediction." International Journal of Advanced Engineering Research and
Science 7, no. 3 (2020), pp 44-8.
[17] B. Keerthi Samhitha, M. R. Sarika Priya., C. Sanjana., S. C. Mana and
J. Jose, "Improving the Accuracy in Prediction of Heart Disease using
Machine Learning Algorithms," 2020 International Conference on
Communication and Signal Processing (ICCSP), 2020, pp. 1326-1330.
[18] L. R. Guarneros-Nolasco, N. A. Cruz-Ramos, G. Alor-Hernández, L.
Rodríguez-Mazahua, and J. L. Sánchez-Cervantes, “Identifying the main risk
factors for cardiovascular diseases prediction using machine learning
algorithms,” Mathematics, vol. 9, no. 20, p. 2537, 2021.
[19] I. Dokare, A. Prithiani, H. Ochani, S. Kanjan, and D. Tarachandani,
“Prediction of having a heart disease using machine learning,” SSRN
Electronic Journal, 2020.
[20] Kamil Pytlak.(2022, February).Personal Key Indicators of Heart
Disease. Version 1. Retrieved July 3, 2022 from
https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-
heart-disease.
[19] Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth,
A. (2015). The reusable holdout: Preserving validity in adaptive data
analysis. Science, 349(6248), 636-638. doi: 10.1126/science.aaa9375
[22] Mandrekar, J. (2010). Receiver Operating Characteristic Curve in
Diagnostic Test Assessment. Journal Of Thoracic Oncology,5(9), 1315-
1316.doi: 10.1097/jto.0b013e3181ec173d
0565
Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on December 02,2022 at 16:36:35 UTC from IEEE Xplore. Restrictions apply.
... [4] Mohammed Ali Shaik, Radhandi Sreeja, Safa Zainab (2023 IEEE) "Improving Accuracy of Heart Disease Prediction through Machine Learning Algorithms" incorporated four diverse components, each addressing distinct factors pertinent to heart diseases and results better prediction with Random Forest (RF) and K-Nearest Neighbour (KNN). [3] ...
... The existing system adopts a Hybrid algorithm approach, which amalgamates the strengths of multiple algorithms to enhance overall performance. This technique integrates various methodologies such as ensemble methods, metalearning, or blending diverse model types like neural networks and decision trees [3]. By leveraging the unique advantages of each constituent algorithm, the goal is to achieve superior results compared to any individual algorithm in isolation. ...
... This approach reduces overfitting peril and enhances the model's resilience to noisy data. Furthermore, the method effectively uses categorical and numerical data, and it can manage missing data [3]. By analysing a dataset containing various features related to a person's health, such as age, blood pressure, cholesterol levels and more, can learn patterns and make predictions about the likelihood of someone having heart disease. ...
Article
A recent study by the World Health Organization sheds light on the alarming increase in cardiovascular diseases, contributing to approximately 17.9 million deaths annually. This study delves into the effectiveness of employing the Random Forest algorithm, a robust machine learning approach, to forecast the likelihood of heart disease based on diverse risk factors. By leveraging a dataset encompassing demographic, clinical, and lifestyle attributes, the Random Forest model underwent training to categorize individuals into two groups: those with or without heart disease. Through meticulous feature selection and ensemble learning, the algorithm adeptly captures intricate relationships among predictors, thereby augmenting prediction accuracy. Evaluation metrics including accuracy and AUC-ROC curve were employed in order to determine model's effectiveness. Impressively, our model achieves a prediction accuracy of 97%. Moreover, a comparative analysis with other prominent machine learning models such as Naive Bayes, Support Vector Machine (SVM), Logistic Regression (LR), XGBoost, Decision Tree revealed that the Random Forest approach outperforms others in terms of accuracy and efficiency in prediction tasks. Keywords: Random Forest (RF), Machine Learning (ML), Accuracy, Classification.
... Critical risk factors of heart disease have been possible to analyze through these data [4] [5]. However, these datasets need a high amount of preprocessing to produce useful results [1] [6]. To solve this problem, enhanced computational approaches are needed. ...
... Many recognized studies utilizing artificial intelligence (AI) models showed high accuracy, nevertheless without additional information to confirm the dependability [2] [11]. Class imbalances are common among medical survey data [1] [6]. It is challenging to achieve high accuracy for the minority class while remaining well-balanced [1] [6] [12]. ...
... For this sort of study, finding the key variables that impact the prediction of certain target classes is essential for efficient model interpretation and decision-making [13]. Despite significant breakthroughs in the field of heart disease detection using machine learning and deep learning, there are still these identified challenges to be overcome [1] [6] [11] [14]. These issues underline the importance of this study in adding to the present body of knowledge and providing solutions to the difficulties outlined in the prior studies [1] [6] [10] [11] [14]. ...
Preprint
Full-text available
The traditional approaches in heart disease prediction across a vast amount of data encountered a huge amount of class imbalances. Applying the conventional approaches that are available to resolve the class imbalances provides a low recall for the minority class or results in imbalance outcomes. A lightweight GrowNet-based architecture has been proposed that can obtain higher recall for the minority class using the Behavioral Risk Factor Surveillance System (BRFSS) 2022 dataset. A Synthetic Refinement Pipeline using Adaptive-TomekLinks has been employed to resolve the class imbalances. The proposed model has been tested in different versions of BRFSS datasets including BRFSS 2022, BRFSS 2021, and BRFSS 2020. The proposed model has obtained the highest specificity and sensitivity of 0.74 and 0.81 respectively across the BRFSS 2022 dataset. The proposed approach achieved an Area Under the Curve (AUC) of 0.8709. Additionally, applying explainable AI (XAI) to the proposed model has revealed the impacts of transitioning from smoking to e-cigarettes and chewing tobacco on heart disease.
... Heart diseases detection using machine learning has been proposed in [21] where author proposed six machine learning methods-Xgboost, Adaboost, Random Forest, Decision Tree, Logistic Regression, and Naïve Bayes have been compared in detail. Through the prediction model for heart disease, they have achieved an impressive accuracy which is 91.57% ...
... Using the most appropriate methods during this stage is essential to ensure the reliability and usefulness of the final results. For example, we can use hold-out validation method as this method gives better results when we are using large dataset [21]. To sue hold-out validation technique we have to split out dataset into training and test dataset. ...
Conference Paper
Full-text available
Hearth disease is one of the leading causes of death globally and a common disease in the middle and old ages. Among all heart diseases, heart attack and strokes are the most common cardiac illness that is the responsible majority of heart disease death. To identify heart diseases, for instance, Angiography is costly and has significant side effects. Therefore, machine learning can play an important role in identifying and predicting the potential risk factor of cardiac disease based on clinical and patient data, which is affordable and reliable. This study proposed and evaluated six machine learning models using survey data of 400k US residents to predict heart disease. This study also compared the evaluated six machine learning models, which are Xgboost, Bagging, Random Forest, Decision Tree, K-Nearest Neighbor, and Naïve Bayes. The accuracy, sensitivity, F1-score, and AUC of six machine learning …
... Deep learning is widely used for analysis, classification, and detection in the healthcare business [17,18,19]. In 1980 [20,21,22], CNN was initially used. ...
Article
Full-text available
Currently, the radiologist can more accurately identify brain tumours through the development of Computer-Assisted Diagnosis (CAD), Machine Learning and Deep Learning. Recently, Deep Learning (DL) strategies have gained traction as a means to rapidly and accurately construct automated systems for diagnosing and segmenting the image. The standard approach to this issue is to create a custom feature for classification. Most neurological diseases originate from abnormal growth of brain cells, which can compromise brain architecture and even lead to malignant brain tumours. Brain tumour detection and classification algorithms that are both quick and accurate have been the subject of extensive study. This facilitates the straight forward diagnosis of brain tumours using Magnetic Resonance Image (MRI) images. Through Deep Learning (DL) model the diagnosis of brain malignancies in MRI images using Convolutional Neural Network (CNN) is possible by training the data. So, in this paper the brain tumouris predicted byproposing a Hybridfeature extraction technique i.e., tuned CNN model with ResNet150 and U-net.
... Heart disease, a leading global cause of death, is strongly linked to risk factors like smoking, high blood pressure, and cholesterol, affecting nearly half of the US population. Machine learning plays a pivotal role in predicting cardiovascular diseases based on personal indicators, with this paper presenting six models, including Xgboost, Adaboost, Random Forest, Decision Tree, Logistic Regression, and Naïve Bayes, achieving an impressive 91.57% accuracy using the logistic regression model [14]. In recent years, cardiovascular diseases have become a leading global cause of death, driven by lifestyle changes, dietary habits, and work culture. ...
Article
Full-text available
INTRODUCTION: This study explores machine learning algorithms (SVM, Adaboost, Logistic Regression, Naive Bayes, and Random Forest) for heart disease prediction, utilizing comprehensive cardiovascular and clinical data. Our research enables early detection, aiding timely interventions and preventive measures. Hyperparameter tuning via GridSearchCV enhances model accuracy, reducing heart disease's burdens. Methodology includes preprocessing, feature engineering, model training, and cross-validation. Results favor Random Forest for heart disease prediction, promising clinical applications. This work advances predictive healthcare analytics, highlighting machine learning's pivotal role. Our findings have implications for healthcare and policy, advocating efficient predictive models for early heart disease management. Advanced analytics can save lives, cut costs, and elevate care quality. OBJECTIVES: Evaluate the models to enable early detection, timely interventions, and preventive measures. METHODS: Utilize GridSearchCV for hyperparameter tuning to enhance model accuracy. Employ preprocessing, feature engineering, model training, and cross-validation methodologies. Evaluate the performance of SVM, Adaboost, Logistic Regression, Naive Bayes, and Random Forest algorithms. RESULTS: The study reveals Random Forest as the favored algorithm for heart disease prediction, showing promise for clinical applications. Advanced analytics and hyperparameter tuning contribute to improved model accuracy, reducing the burden of heart disease. CONCLUSION: The research underscores machine learning's pivotal role in predictive healthcare analytics, advocating efficient models for early heart disease management.
... Study dataset, and the Cardiovascular Disease dataset then they applied four different ML models in each of them, and discovered that KNN and logistic regression both performed well across all four datasets, They achieved the best accuracy of 93.55% using logistic regression [8]. Another study that looked at predicting heart disease utilized six different models and found that logistic regression had the highest accuracy (91.1%) in the Key Indicators Of Heart Disease Dataset available in Kaggle [9]. ...
Conference Paper
Full-text available
Cardiovascular diseases (CVD) continue to pose significant health risks globally, accentuating the need for early and precise detection mechanisms. With the evolution of computational methods in healthcare, machine learning offers transformative solutions for diagnostic accuracy. This research aims to identify an algorithm with consistent performance across multiple datasets for potential integration into a cardiac disease prediction platform. We examined nine prominent machine learning algorithms, namely Support Vector Machine (SVM), Gradient Boosting (GB), Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), K-Nearest Neighbor (KNN), Naive Bayes (NB), Extreme Gradient Boosting (XGBoost), and Multilayer Perceptron (MLP), and evaluated their predictive performance across two heterogeneous datasets. Both datasets encompass 14 attributes but differ in instance sizes: 303 and 1025, respectively. Through a meticulous methodological framework, the data underwent preprocessing, splitting, and model training, followed by validation using metrics such as Precision, Recall, F1 score, and Accuracy, coupled with a confusion matrix for detailed class-based evaluation. Our findings revealed that the Random Forest and MLP algorithms exhibited superior consistency and robustness in disease prediction across both datasets, achieving a peak accuracy of 95.14%. While XGBoost performed proficiently on one dataset, its performance wavered in a cross-dataset scenario. Based on these findings, either the Random Forest or Multilayer Perceptron models are recommended for developing a robust heart disease prediction system. This research not only affirms the potential of machine learning in revolutionizing CVD diagnostics but also underscores the importance of algorithm selection based on dataset characteristics.
Article
Using Magnetic Resonance Imaging (MRI) images to detect brain tumors by medical practitioners is mundane and prone to errors. Misdiagnosis of brain tumors can be life-threatening, so to lessen misdiagnosis, computational techniques can be used in concert with medical professionals. Deep learning approaches have been gaining popularity in modeling and developing systems for medical image processing that can detect abnormalities quickly. The methods proposed herein are based on Convolutional Neural Networks (CNN) trained on the 'BR35H::Brain Tumor Detection 2020' dataset. A custom CNN architecture was designed, followed by the utilization of transfer learning with four pre-trained models: InceptionV3, ResNet101, VGG19, and DenseNet169 and a comparative analysis of these architectures has been presented in this paper. The experimental results show that the DenseNet169 model outperformed other models with a training accuracy of 99.83 %, test accuracy of 99.66%, precision of 99.67%, and recall of 99.67%. Additionally, ResNet101 has a 95.92% test accuracy, VGG19 has a test accuracy of 97.83%, the custom architecture has a test accuracy of 98.16%, and InceptionV3 has the lowest test accuracy of 91.66%. It has been concluded that DenseNet169 provides better results for the classification of brain tumors than other models.
Preprint
Full-text available
While type 2 diabetes is predominantly found in the elderly population, recent publications indicates an increasing prevalence in the young adult population. Failing to predict it in the minority younger age group could have significant adverse effects on their health. The previous work acknowledges the bias of machine learning models towards different gender and race groups and proposes various approaches to mitigate it. However, prior work has not proposed any effective methodologies to predict diabetes in the young population which is the minority group in the diabetic population. In this paper, we identify this deficiency in traditional machine learning models and implement double prioritization (DP) bias correction techniques to mitigate the bias towards the young population when predicting diabetes. Deviating from the traditional concept of one-model-fits-all, we train customized machine-learning models for each age group. The DP model consistently improves recall of diabetes class by 26% to 40% in the young age group (30-44). Moreover, the DP technique outperforms 7 commonly used whole-group sampling techniques such as random oversampling, SMOTE, and AdaSyns techniques by at least 36% in terms of diabetes recall in the young age group. We also analyze the feature importance to investigate the source of bias in the original model. Data and Code Availability We use a publicly available dataset called Behavioral Risk Factor Surveillance System (BRFSS) from 2021 CDC. To reproduce the result, the anonymised code has been attached as supplementary files. The code will be uploaded to a public repository upon publication. Institutional Review Board (IRB) Our research does not require IRB approval.
Article
Full-text available
Keywords Machine Learning, Logistic regression, Framingham dataset, heart diseases. Abstract Myocardial Infarction and Brain attacks are responsible for the fatalities of individuals from cardiovascular diseases (CVDs), and especially the deaths occur before age 70. 17.9 million people are thought to pass away from CVDs annually. Accurate monitoring for each patient individually is not always possible, and clinicians cannot consult with patients every 24 hours due to the additional time and knowledge required. Using the patient's various cardiac characteristics and the machine learning approach of logistic regression on a publicly accessible dataset from Kaggle, we developed and examined models for predicting heart disease in this research. The main objective is to ascertain of acquiring coronary heart disease (CHD) upto 10 years of health risk. More than 4,000 records, 15 attributes, and patient data are included in the collection. To forecast outcomes, it makes predictions about a dependent variable based on one or more sets of independent variables. Both binary classification and multi-class classification can use it. This study aims to establish the most significant heart disease risk factors and estimate the overall risk using logistic regression.
Conference Paper
Full-text available
360 degree virtual reality videos enhance the viewing experience by giving a more immersive and interactive environment compared to traditional videos. These videos require large bandwidth to transmit. Typically, viewers observe only a part of the entire 3600 videos, called the field of view (FoV), when watching 3600 videos. Edge caching can be a good solution to optimize bandwidth utilization as well as improve user quality of experience (QoE). In this research, three machine learning models utilizing random forest, linear regression, and Bayesian regression have been proposed to develop a 3600 -video caching algorithm. Tile frequency, user’s view prediction probability and tile resolution have been used as feature. The purpose of the developed machine learning models is to determine the caching strategy of 360-degree video tiles. The models are capable to predict the viewing frequency of 360 degree video tiles (subsets of a full video). We have compared the results of the three developed models and the results show that the random forest regression model outperforms the other proposed models with a predictive R2 value of 0.79
Conference Paper
Full-text available
360-degree videos, which provide a means to enjoy virtual reality, have gained in popularity among people around the world. It allows users to view video scenes at any angles while watching videos. 360-degree video caching at the edge server can be a good solution to minimize the bandwidth cost and to deliver the video with less latency. Popular video contents can be divided into tiles which are cached at the edge server in a potential 360- degree video streaming system. In this research, a system architecture for 360 video caching has been proposed, and video caching has been performed using the Least Recently Used (LRU) and Least Frequently Used (LFU) algorithms. Recency and frequency are used for cache eviction. In the experiment, 48 users’ head movement data is utilized in a sequential and randomized order for two 360-degree videos, and caching is compared between the LRU cache and LFU cache by varying cache size. The results show that average cache hit rate is greater when using LFU caching as compared to LRU caching for a varying cache size
Article
Full-text available
Cardiovascular Diseases (CVDs) are a leading cause of death globally. In CVDs, the heart is unable to deliver enough blood to other body regions. As an effective and accurate diagnosis of CVDs is essential for CVD prevention and treatment, machine learning (ML) techniques can be effectively and reliably used to discern patients suffering from a CVD from those who do not suffer from any heart condition. Namely, machine learning algorithms (MLAs) play a key role in the diagnosis of CVDs through predictive models that allow us to identify the main risks factors influencing CVD development. In this study, we analyze the performance of ten MLAs on two datasets for CVD prediction and two for CVD diagnosis. Algorithm performance is analyzed on top-two and top-four dataset attributes/features with respect to five performance metrics –accuracy, precision, recall, f1-score, and roc-auc—using the train-test split technique and k-fold cross-validation. Our study identifies the top-two and top-four attributes from CVD datasets analyzing the performance of the accuracy metrics to determine that they are the best for predicting and diagnosing CVD. As our main findings, the ten ML classifiers exhibited appropriate diagnosis in classification and predictive performance with accuracy metric with top-two attributes, identifying three main attributes for diagnosis and prediction of a CVD such as arrhythmia and tachycardia; hence, they can be successfully implemented for improving current CVD diagnosis efforts and help patients around the world, especially in regions where medical staff is lacking.
Article
Full-text available
Cardiovascular diseases are among the most common serious illnesses affecting human health. CVDs may be prevented or mitigated by early diagnosis, and this may reduce mortality rates. Identifying risk factors using machine learning models is a promising approach. We would like to propose a model that incorporates different methods to achieve effective prediction of heart disease. For our proposed model to be successful, we have used efficient Data Collection, Data Pre-processing and Data Transformation methods to create accurate information for the training model. We have used a combined dataset (Cleveland, Long Beach VA, Switzerland, Hungarian and Stat log). Suitable features are selected by using the Relief, and Least Absolute Shrinkage and Selection Operator (LASSO) techniques. New hybrid classifiers like Decision Tree Bagging Method (DTBM), Random Forest Bagging Method (RFBM), K-Nearest Neighbors Bagging Method (KNNBM), AdaBoost Boosting Method (ABBM), and Gradient Boosting Boosting Method (GBBM) are developed by integrating the traditional classifiers with bagging and boosting methods, which are used in the training process. We have also instrumented some machine learning algorithms to calculate the Accuracy (ACC), Sensitivity (SEN), Error Rate, Precision (PRE) and F1 Score (F1) of our model, along with the Negative Predictive Value (NPR), False Positive Rate (FPR), and False Negative Rate (FNR). The results are shown separately to provide comparisons. Based on the result analysis, we can conclude that our proposed model produced the highest accuracy while using RFBM and Relief feature selection methods (99.05%).