ArticlePDF Available

ENSEMBLE HEART DISEASE PREDICTION USING ENHANCED FEATURE SELECTION BY GENETIC ALGORITHM AND BEST ACCURACY OBTAINED USING NAIVE BAYES, SVM AND KNN

Authors:

Abstract and Figures

A doctor's competence and experience in the relevant field is used to make a clinical diagnosis for any ailment. Lack of experience in the relevant topic has even resulted in incorrect diagnosis and treatment. Patients must undergo a number of tests in order to be diagnosed with heart disease. Not every test that is routinely performed is essential for accurate illness diagnosis.The goal of the study is to find the presence or absence of heart disease based on reduced number of attributes. In this research, a genetic algorithm is used to choose features from 76 features in a heart disease dataset gathered from multiple hospitals, as well as features from the Cleveland health dataset retrieved from the UCI repository. A genetic algorithm selects the best features that lead to accurate prediction. This minimizes the number of tests required, lowering treatment costs. Using genetic feature reduction, seventy-six traits are reduced to sixteen. Three classifiers, such as SVM, Naive Bayes, and KNN, classifiers used for predicting the heart disease with approximately similar accuracy if the no of attributes are lessened. After the attributes were minimized using a genetic approach, Naive Bayes fared better, regularly scoring 99.68 percent. Without feature reduction, classification accuracy was poor, and treatment costs were higher.
Content may be subject to copyright.
Juni Khyat ISSN: 2278-4632
(UGC Care Group I Listed Journal) Vol-11 Issue-12 No.01 December 2021
Page | 115 Copyright @ 2021 Author
ENSEMBLE HEART DISEASE PREDICTION USING ENHANCED FEATURE
SELECTION BY GENETIC ALGORITHM AND BEST ACCURACY OBTAINED USING
NAIVE BAYES, SVM AND KNN
R. Suresh Research Scholar Department of Computer Science & Engineering Osmania university
Hyderabad
Dr. Nagaratna P. Hegde Professor Department of Computer Science and Engineering Vasavi
College of Engineering
Abstract:
A doctor's competence and experience in the relevant field is used to make a clinical
diagnosis for any ailment. Lack of experience in the relevant topic has even resulted in incorrect
diagnosis and treatment. Patients must undergo a number of tests in order to be diagnosed with heart
disease. Not every test that is routinely performed is essential for accurate illness diagnosis.The goal
of the study is to find the presence or absence of heart disease based on reduced number of attributes.
In this research, a genetic algorithm is used to choose features from 76 features in a heart disease
dataset gathered from multiple hospitals, as well as features from the Cleveland health dataset
retrieved from the UCI repository. A genetic algorithm selects the best features that lead to accurate
prediction. This minimizes the number of tests required, lowering treatment costs. Using genetic
feature reduction, seventy-six traits are reduced to sixteen. Three classifiers, such as SVM, Naive
Bayes, and KNN, classifiers used for predicting the heart disease with approximately similar
accuracy if the no of attributes are lessened. After the attributes were minimized using a genetic
approach, Naive Bayes fared better, regularly scoring 99.68 percent. Without feature reduction,
classification accuracy was poor, and treatment costs were higher.
Keywords: Support Vector Machine, K Nearest Neighbor algorithm, Genetic algorithm.
1. Introduction
A large number of hospitals and health care centers have sprung up as a result of increased health-
care awareness and technological advancements. However, in underdeveloped nations, providing
high-quality health care at an affordable cost remains a challenge. Despite the fact that many
countries have taken concrete steps to provide healthcare, the availability of diagnosis facilities to the
urban population and villages is not as per the requirements. All doctors do not have the same level
of experience or expertise. Information regarding specific disease models, Decision Support for
understanding the severity of a disease systems, and X-ray and computed tomography Processing
Systems are all available in corporate hospitals, although not all hospitals have them, and their uses
are limited. A guideline for clinical decision making would be decision assistance systems used for
diagnosis help for new doctors and expertise doctors.
According to the World Health Organization, Cardiovascular Disease accounts for 29.2 percent of all
global fatalities in 2003. (CVD). CVD is anticipated to overtake cancer as the major cause of
mortality in emerging countries by the end of this year, owing to changes in lifestyle, work culture,
and eating habits. As a result, more thorough and effective ways of diagnosing heart disorders, as
well as periodic examinations, are critical.
ML is a subset of AI study that has grown in popularity as a component of data science. ML models
are capable of a wide range of tasks, including decision taking systems, prediction support systems,
and classification. Training data is required to learn the ML algorithms. Following the learning
module, a system is created as a result of the ML algorithm. After that, the model is tested and
verified on a collection of previously uncovered real-time test datasets. The model's ultimate
accuracy is then compared to the actual number, demonstrating that the projected result is right
overall.
Juni Khyat ISSN: 2278-4632
(UGC Care Group I Listed Journal) Vol-11 Issue-12 No.01 December 2021
Page | 116 Copyright @ 2021 Author
2. Related work
A great deal of effort is put into developing effective medical diagnosis procedures for a variety of
disorders. The current study aims to use classification to forecast diagnosis with a smaller number of
indicators that contribute best to heart illness. Sellapan et al. and Asha et al. Created an Intelligent
model for Heart Disease Prediction using classifiers: Neural Network,Decision Tree, and Naive
Bayes to predict heart disease. With a prediction probability of 96.6 percent, Naive Bayes worked
well. In addition, 13 attributes were considered in the prediction process.Applied reduced number of
attributes to six and were able to attain the same results. Harleen et al. Investigated application of
various classification DM techniques like ANN, decision trees and rule induction for diabetic patient
diagnosis. Carlos compared association rules with decision trees to build an effective search for heart
disease diagnosis. The current strategy would be a continuation of the hunt for an effective diagnosis.
Yu-Xuan Wang et al. investigated many advantages that highlighted the importance of machine
learning approaches in diverse fields. They offered a novel approach to creating a functioning
framework. The method made use of many machine learning techniques. The entire data was
assembled and was reviewed after receiving the right output from data mining professional. Based on
the results of the numerous tests, the recommended technique appears to be effective.
Prior work on analyzing and data mining algorithms for various applications was proposed by
Zhiqiang Ge, et al., (2017).These processes were employed in the corporate world for a variety of
purposes. They looked at nearly 8 unsupervised and approximately ten supervised learning
techniques in this article. Demonstrated semi-supervised learning algorithms in their research.
According to the industry technique, around 90%-95 percent of applications used both unsupervised
and supervised machine learning techniques. As a result, it was demonstrated that ML approaches
are critical in the planning of many unique applications in sectors such as medical analytic service
industry.
3. Methods and Techniques used
To create the heart disease prediction model, four popular machine learning techniques were chosen.
The following are the specifics of these techniques:
3.1. SVM
Support Vector Machine is a machine learning classification tool for analyzing data and discovering
patterns in classification and regression analysis. When data is classified as a two-class problem,
SVM is usually considered. Data is described in this technique by determining the optimum hyper
plane region which isolates entire data point values from one class to other class. The greater the
separation between two classes, the efficient the model is thought to be. Support vectors are data
points that are located near the margin's edge.
The techniques and methods used are mathematical to construct critical real-world situations. Since
the dataset collected from various hospitals comprises of two classes to forecast based on no of
attributes, picked SVM for this paper. The training in SVM is done via a library function called
Sklearn
The most difficult part of developing a model with SVM is avoiding over fitting and under fitting by
choosing the right kernel and technique. Because our dataset has a large number of instances. As a
result, SVM's final model must be tested and validated against real-world data.
3.2 .Random Forest Classification
Random Forest is a group of decision tree based trees that haven't been pruned. It performs
admirably in a variety of real-world issues since it is unaffected by errors in the dataset and
following risk of over fitting is minimal. It is faster than many other tree-based algorithms and
enhanced the performance accuracy for test and validation data. The total of predictions from
Juni Khyat ISSN: 2278-4632
(UGC Care Group I Listed Journal) Vol-11 Issue-12 No.01 December 2021
Page | 117 Copyright @ 2021 Author
independent decision tree algorithms is known as random forests. When constructing a random tree,
there are several options for tuning the random forest's performance.
3.3.Naive Bayes
The Bayes' Theorem states that attributes or instances are independent of each other, and
supervised ML approach is built on it. When the input data is highly dimensional, the Nave Bayes
Classifier is utilised. In computer vision, the Naive Bayes approach is quite valuable. It has
demonstrated itself to be a good classifier in particular.
3.4. KNN
The following steps in the algorithm describe the working of KNN:
1. Decide the no of neighbors (K).
2. Obtain Euclidean distance in between K neighbors.
3. Use obtained Euclidean distance, to find K nearest neighbors.
4. Calculate no of data points in each category among k neighbors.
5. Allocate new data values to the category with greatest no of neighbors.
6. Repeat the steps until the appropriate result found.
3.5. Data Set collection
The data set consists of 1040 records with 76 attributes utilized by medical experts, obtained from
various hospital data and the UCI repository. Categorical attributes were employed in all models for
the sake of simplicity. With Genetic algorithm, the number of attributes is reduced to sixteen. The
classification models are fed with the reduced data set. The test mode is the K fold cross validation
approach. The prediction of a target class, with a value of "0" represents no heart disease and "1"
representing cardiac disorders.
3.6. Preprocessing of Dataset used
Several features in the dataset used are having missing values, leading to erroneous results and
degrade the model's accuracy. To solve this problem, utilizing the mean of column to replace missed
values is the answer. This method substitutes zero value or average of neighborhood values or the
mean values. After that, the dataset's attributes are transformed from Numerical value to Nominal
value making it compatible with the machine learning techniques utilized.
3.7. Building Model
The model was created using Google Colab, a machine learning platform. Dataset collection, data
preprocessing, regression, data visualization, and feature selection are all simple tasks that the
software can handle. It provides a simple environment for loading data from files, URLs, or
databases. The software supports the CSV, ARFF, C4.5 and various file formats. It conveniently
analyses and graphically presents the confusion matrix, performance parameters like precision,
recall, Roc.
For the purposes of comparing the four models, the following four accuracy measures were used:
• Precision
It is the probability of retrieving relevant outcome. Precision is the no of TP divided by the no of TP
plus FP.
• Recall
It is the likelihood of actual values to total retrieval of predicted values on average. TP/TP + FN =
recall
• Accuracy
Juni Khyat ISSN: 2278-4632
(UGC Care Group I Listed Journal) Vol-11 Issue-12 No.01 December 2021
Page | 118 Copyright @ 2021 Author
The percentage of all accurate predictions divided by the total no of attributes .calculate a classifier's
accuracy. No of TP + TN]/ [Total attributes] = Accuracy
• ROC Area
The performance evaluation difference between a classification model's TPR and FPR is depicted by
the ROC curve.
In 10-fold Cross validation is used for evaluation of the model , the data is divided into two groups,
train and test datasets.
3.8.Accuracy measurement of Model
Evaluation of the model by performance measures like Accuracy, recall, precision, and ROC taken
into account. When the ROC value is less than 0.80, the classifier is “Best" and if the ROC value is
0.77, it is “good." The optimal model with great accuracy is one with a ROC value very close to 1.
Table1: Evaluation of Naïve Bayes classifier by various parameters
Figure 1: Accuracy score of Naive bayes, SVM and KNN with genetic algorithm Feature selection
Figure 2: Accuracy score of Naïve bayes ,SVM and KNN without genetic algorithm Feature
selection
Juni Khyat ISSN: 2278-4632
(UGC Care Group I Listed Journal) Vol-11 Issue-12 No.01 December 2021
Page | 119 Copyright @ 2021 Author
Figure: Roc curve of Naive Bayes
4. Conclusion and Future work
The goal of this study and find to predict the existence of heart disease with a less no of Attributes.
Initially, seventy-six characteristics were used to predict heart disease. In the current study, a genetic
algorithm used to discover the attributes that contribute the most to the predict of cardiac illnesses,
hence reducing the amount of tests that a patient must do.
Using genetic search, seventy-six traits are reduced to sixteen. Following that, three
classifiers such as Naive Bayes, SVM, and KNN are used to predict the diagnosis of patients with the
approximately similar accuracy as when the number of attributes are reduced. Naive Bayes
algorithm performed better. Before the model was built, inconsistencies and missing values were
rectified. We plan in future to continue our research by evaluating the effectiveness of cardiac
illness using fuzzy learning models.
References
[1] Asha Rajkumar and Mrs. G.Sophia Reena (2010): Diagnosis Of Heart Disease Using Datamining Algorithm,
GJCST,Vol. 10 Issue 10 Ver.1.0 Sep2010, pp. 38-43.
[2] Boleslaw Szymanski, et al. (2006): Using Efficient Supanova Kernel For Heart Disease Diagnosis, proc. ANNIE 06,
intelligent engineeringsystems through artificial neural networks, vol. 16, pp. 305-310.
[3] S. Pouriyeh, S. Vahid, G. Sannino, G. De Pietro, H. Arabnia and J. Gutierrez, "A comprehensive investigation and
comparison of Machine Learning Techniques in the domain of heart disease," 2017 IEEE Symposium on Computers
and Communications (ISCC), Heraklion, 2017, pp.204-07,doi:10.1109/ISCC.2017.8024530.
[4] S. Dhar, K. Roy, T. Dey, P. Datta and A. Biswas, "A Hybrid Machine Learning Approach for Prediction of Heart
Diseases," 2018 4th International Conference on Computing Communication and Automation (ICCCA), Greater
Noida, India, 2018, pp. 1-6, doi: 10.1109/CCAA.2018.8777531.
[5] C. Raju, E. Philipsy, S. Chacko, L. Padma Suresh and S. Deepa Rajan, "A Survey on Predicting Heart Disease using
Data Mining Techniques," 2018 Conference on Emerging Devices and Smart Systems (ICEDSS), Tiruchengode,
2018, pp. 253-255, doi: 10.1109/ICEDSS.2018.8544333.
[6] R. Detrano, A. Janosi, W. Steinbrunn, M. Pfisterer, J.-J. Schmid,S. Sandhu, K. H. Guppy, S. Lee, and V. Froelicher,
“International application of a new probability algorithm for the diagnosis of coronary artery disease,” The American
journal of cardiology, vol. 64, no. 5, pp. 304310, 1989.
[7] B. Edmonds, “Using localised ’gossip’ to structure distributed learning,” 2005.
[8] Fsdfsdf BayuAdhi Tama,1 Afriyan Firdaus,2 Rodiyatul FS, “Detection of Type 2 Diabetes Mellitus with Data Mining
Approach Using Support Vector Machine”, Vol. 11, issue 3, pp. 12-23, 2008.
[9] Yu-Xuan Wang, QiHui Sun, Ting-Ying Chien, Po-Chun Huang, “Using Data Mining and Machine Learning Techniques
for System Design Space Exploration and Automatized Optimization”, Proceedings of the 2017 IEEE International
Conference on Applied System Innovation, vol. 15, pp. 1079-1082, 2017.
[10] ZhiqiangGe, Zhihuan Song, Steven X. Ding, Biao Huang, “Data Mining and Analytics in the Process Industry: The
Role of Machine Learning”, 2017 IEEE Translations and contentmining are permitted for academic research only,
vol. 5, pp. 20590-20616, 2017.
[11]https://en.wikipedia.org/wiki/Bayes27_theorem
[12]https://en.wikipedia.org/wiki/Naive_Bayes_classifier
[13]https://towardsdatascience.com/understanding-random-forest-58381e0602d2
[14] Sellappan Palaniappan and Rafiah Awang (2008): Intelligent Heart Disease Prediction System Using Data Mining
Techniques, 978-1-4244-1968- 5/08/ IEEE.
[15]Shantakumar B.Patil and Y.S.Kumaraswamy (2009): Intelligent and Effective Heart Attack Prediction System Using
Data Mining and Artificial Neural Network, European Journal of Scientific Research ISSN 1450- 216X Vol.31 No.4,
pp. 642-656.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
This paper aims to investigate and compare the accuracy of different data mining classification schemes and their combinations through Ensemble Machine Learning Techniques for predicting heart disease. The Cleveland dataset for heart diseases, containing 303 instances, has been used in this study. Due to the limited number of samples, 10-Fold Cross-Validation is applied in order to portion the data into training and testing datasets. In this study different machine learning classifiers, such as Decision Tree (DT), Naïve Bayes (NB), Multilayer Perceptron (MLP), K-Nearest Neighbor (K-NN), Single Conjunctive Rule Learner (SCRL), Radial Basis Function (RBF) and Support Vector Machine (SVM) are utilized. Moreover, the ensemble prediction of classifiers including bagging, boosting, and stacking are applied to the dataset. The result of the experiment indicates that SVM method using boosting technique outperformed among the aforementioned methods.
Article
Full-text available
The idea of a “memetic” spread of solutions through a human culture in parallel to their development is applied as a distributed approach to learning. Local parts of a problem are associated with a set of overlappingt localities in a space and solutions are then evolved in those localites. Good solutions are not only crossed with others to search for better solutions but also they propogate across the areas of the problem space where they are relatively successful. Thus the whole population co-evolves solutions with the domains in which they are found to work. This approach is compared to the equivalent global evolutionary computation approach with respect to predicting the occcurence of heart disease in the Cleveland data set. It greatly outperforms the global approach, but the space of attributes within which this evolutionary process occurs can effect its efficiency.
Conference Paper
Full-text available
The healthcare industry collects huge amounts of healthcare data which, unfortunately, are not ";mined"; to discover hidden information for effective decision making. Discovery of hidden patterns and relationships often goes unexploited. Advanced data mining techniques can help remedy this situation. This research has developed a prototype Intelligent Heart Disease Prediction System (IHDPS) using data mining techniques, namely, Decision Trees, Naive Bayes and Neural Network. Results show that each technique has its unique strength in realizing the objectives of the defined mining goals. IHDPS can answer complex ";what if"; queries which traditional decision support systems cannot. Using medical profiles such as age, sex, blood pressure and blood sugar it can predict the likelihood of patients getting a heart disease. It enables significant knowledge, e.g. patterns, relationships between medical factors related to heart disease, to be established. IHDPS is Web-based, user-friendly, scalable, reliable and expandable. It is implemented on the .NET platform.
Article
The diagnosis of diseases is a vital and intricate job in medicine. The recognition of heart disease from diverse features or signs is a multi-layered problem that is not free from false assumptions and is frequently accompanied by impulsive effects. Thus the attempt to exploit knowledge and experience of several specialists and clinical screening data of patients composed in databases to assist the diagnosis procedure is regarded as a valuable option. This research work is the extension of our previous research with intelligent and effective heart attack prediction system using neural network. A proficient methodology for the extraction of significant patterns from the heart disease warehouses for heart attack prediction has been presented. Initially, the data warehouse is pre-processed in order to make it suitable for the mining process. Once the preprocessing gets over, the heart disease warehouse is clustered with the aid of the K-means clustering algorithm, which will extract the data appropriate to heart attack from the warehouse. Consequently the frequent patterns applicable to heart disease are mined with the aid of the MAFIA algorithm from the data extracted. In addition, the patterns vital to heart attack prediction are selected on basis of the computed significant weightage. The neural network is trained with the selected significant patterns for the effective prediction of heart attack. We have employed the Multi-layer Perceptron Neural Network with Back-propagation as the training algorithm. The results thus obtained have illustrated that the designed prediction system is capable of predicting the heart attack effectively.
Article
A new discriminant function model for estimating probabilities of angiographic coronary disease was tested for reliability and clinical utility in 3 patient test groups. This model, derived from the clinical and noninvasive test results of 303 patients undergoing angiography at the Cleveland Clinic in Cleveland, Ohio, was applied to a group of 425 patients undergoing angiography at the Hungarian Institute of Cardiology in Budapest, Hungary (disease prevalence 38%); 200 patients undergoing angiography at the Veterans Administration Medical Center in Long Beach, California (disease prevalence 75%); and 143 such patients from the University Hospitals in Zurich and Basel, Switzerland (disease prevalence 84%). The probabilities that resulted from the application of the Cleveland algorithm were compared with those derived by applying a Bayesian algorithm derived from published medical studies called CADENZA to the same 3 patient test groups. Both algorithms overpredicted the probability of disease at the Hungarian and American centers. Overprediction was more pronounced with the use of CADENZA (average overestimation 16 vs 10% and 11 vs 5%, p less than 0.001). In the Swiss group, the discriminant function underestimated (by 7%) and CADENZA slightly overestimated (by 2%) disease probability. Clinical utility, assessed as the percentage of patients correctly classified, was modestly superior for the new discriminant function as compared with CADENZA in the Hungarian group and similar in the American and Swiss groups. It was concluded that coronary disease probabilities derived from discriminant functions are reliable and clinically useful when applied to patients with chest pain syndromes and intermediate disease prevalence.
Using Efficient Supanova Kernel For Heart Disease Diagnosis, proc. ANNIE 06, intelligent engineeringsystems through artificial neural networks
  • Boleslaw Szymanski
Boleslaw Szymanski, et al. (2006): Using Efficient Supanova Kernel For Heart Disease Diagnosis, proc. ANNIE 06, intelligent engineeringsystems through artificial neural networks, vol. 16, pp. 305-310.
Detection of Type 2 Diabetes Mellitus with Data Mining Approach Using Support Vector Machine
  • Fsdfsdf Bayuadhi
Fsdfsdf BayuAdhi Tama,1 Afriyan Firdaus,2 Rodiyatul FS, "Detection of Type 2 Diabetes Mellitus with Data Mining Approach Using Support Vector Machine", Vol. 11, issue 3, pp. 12-23, 2008.