ArticlePDF Available

ENSEMBLE HEART DISEASE PREDICTION USING ENHANCED FEATURE SELECTION BY GENETIC ALGORITHM AND BEST ACCURACY OBTAINED USING NAIVE BAYES, SVM AND KNN

December 2021

December 2021
11(12):115-119

Authors:

ACE Engineering College

A doctor's competence and experience in the relevant field is used to make a clinical diagnosis for any ailment. Lack of experience in the relevant topic has even resulted in incorrect diagnosis and treatment. Patients must undergo a number of tests in order to be diagnosed with heart disease. Not every test that is routinely performed is essential for accurate illness diagnosis.The goal of the study is to find the presence or absence of heart disease based on reduced number of attributes. In this research, a genetic algorithm is used to choose features from 76 features in a heart disease dataset gathered from multiple hospitals, as well as features from the Cleveland health dataset retrieved from the UCI repository. A genetic algorithm selects the best features that lead to accurate prediction. This minimizes the number of tests required, lowering treatment costs. Using genetic feature reduction, seventy-six traits are reduced to sixteen. Three classifiers, such as SVM, Naive Bayes, and KNN, classifiers used for predicting the heart disease with approximately similar accuracy if the no of attributes are lessened. After the attributes were minimized using a genetic approach, Naive Bayes fared better, regularly scoring 99.68 percent. Without feature reduction, classification accuracy was poor, and treatment costs were higher.

Accuracy score of Naive bayes, SVM and KNN with genetic algorithm Feature selection

…

Figures - uploaded by Dr Ralla Suresh

Content may be subject to copyright.

Content uploaded by Dr Ralla Suresh

Content may be subject to copyright.

Juni Khyat ISSN: 2278-4632

(UGC Care Group I Listed Journal) Vol-11 Issue-12 No.01 December 2021

Page | 115 Copyright @ 2021 Author

ENSEMBLE HEART DISEASE PREDICTION USING ENHANCED FEATURE

SELECTION BY GENETIC ALGORITHM AND BEST ACCURACY OBTAINED USING

NAIVE BAYES, SVM AND KNN

R. Suresh Research Scholar Department of Computer Science & Engineering Osmania university

Hyderabad

Dr. Nagaratna P. Hegde Professor Department of Computer Science and Engineering Vasavi

College of Engineering

Abstract:

A doctor's competence and experience in the relevant field is used to make a clinical

diagnosis for any ailment. Lack of experience in the relevant topic has even resulted in incorrect

diagnosis and treatment. Patients must undergo a number of tests in order to be diagnosed with heart

disease. Not every test that is routinely performed is essential for accurate illness diagnosis.The goal

of the study is to find the presence or absence of heart disease based on reduced number of attributes.

In this research, a genetic algorithm is used to choose features from 76 features in a heart disease

dataset gathered from multiple hospitals, as well as features from the Cleveland health dataset

retrieved from the UCI repository. A genetic algorithm selects the best features that lead to accurate

prediction. This minimizes the number of tests required, lowering treatment costs. Using genetic

feature reduction, seventy-six traits are reduced to sixteen. Three classifiers, such as SVM, Naive

Bayes, and KNN, classifiers used for predicting the heart disease with approximately similar

accuracy if the no of attributes are lessened. After the attributes were minimized using a genetic

approach, Naive Bayes fared better, regularly scoring 99.68 percent. Without feature reduction,

classification accuracy was poor, and treatment costs were higher.

Keywords: Support Vector Machine, K Nearest Neighbor algorithm, Genetic algorithm.

1. Introduction

A large number of hospitals and health care centers have sprung up as a result of increased health-

care awareness and technological advancements. However, in underdeveloped nations, providing

high-quality health care at an affordable cost remains a challenge. Despite the fact that many

countries have taken concrete steps to provide healthcare, the availability of diagnosis facilities to the

urban population and villages is not as per the requirements. All doctors do not have the same level

of experience or expertise. Information regarding specific disease models, Decision Support for

understanding the severity of a disease systems, and X-ray and computed tomography Processing

Systems are all available in corporate hospitals, although not all hospitals have them, and their uses

are limited. A guideline for clinical decision making would be decision assistance systems used for

diagnosis help for new doctors and expertise doctors.

According to the World Health Organization, Cardiovascular Disease accounts for 29.2 percent of all

global fatalities in 2003. (CVD). CVD is anticipated to overtake cancer as the major cause of

mortality in emerging countries by the end of this year, owing to changes in lifestyle, work culture,

and eating habits. As a result, more thorough and effective ways of diagnosing heart disorders, as

well as periodic examinations, are critical.

ML is a subset of AI study that has grown in popularity as a component of data science. ML models

are capable of a wide range of tasks, including decision taking systems, prediction support systems,

and classification. Training data is required to learn the ML algorithms. Following the learning

module, a system is created as a result of the ML algorithm. After that, the model is tested and

verified on a collection of previously uncovered real-time test datasets. The model's ultimate

accuracy is then compared to the actual number, demonstrating that the projected result is right

overall.

Juni Khyat ISSN: 2278-4632

(UGC Care Group I Listed Journal) Vol-11 Issue-12 No.01 December 2021

Page | 116 Copyright @ 2021 Author

2. Related work

A great deal of effort is put into developing effective medical diagnosis procedures for a variety of

disorders. The current study aims to use classification to forecast diagnosis with a smaller number of

indicators that contribute best to heart illness. Sellapan et al. and Asha et al. Created an Intelligent

model for Heart Disease Prediction using classifiers: Neural Network,Decision Tree, and Naive

Bayes to predict heart disease. With a prediction probability of 96.6 percent, Naive Bayes worked

well. In addition, 13 attributes were considered in the prediction process.Applied reduced number of

attributes to six and were able to attain the same results. Harleen et al. Investigated application of

various classification DM techniques like ANN, decision trees and rule induction for diabetic patient

diagnosis. Carlos compared association rules with decision trees to build an effective search for heart

disease diagnosis. The current strategy would be a continuation of the hunt for an effective diagnosis.

Yu-Xuan Wang et al. investigated many advantages that highlighted the importance of machine

learning approaches in diverse fields. They offered a novel approach to creating a functioning

framework. The method made use of many machine learning techniques. The entire data was

assembled and was reviewed after receiving the right output from data mining professional. Based on

the results of the numerous tests, the recommended technique appears to be effective.

Prior work on analyzing and data mining algorithms for various applications was proposed by

Zhiqiang Ge, et al., (2017).These processes were employed in the corporate world for a variety of

purposes. They looked at nearly 8 unsupervised and approximately ten supervised learning

techniques in this article. Demonstrated semi-supervised learning algorithms in their research.

According to the industry technique, around 90%-95 percent of applications used both unsupervised

and supervised machine learning techniques. As a result, it was demonstrated that ML approaches

are critical in the planning of many unique applications in sectors such as medical analytic service

industry.

3. Methods and Techniques used

To create the heart disease prediction model, four popular machine learning techniques were chosen.

The following are the specifics of these techniques:

3.1. SVM

Support Vector Machine is a machine learning classification tool for analyzing data and discovering

patterns in classification and regression analysis. When data is classified as a two-class problem,

SVM is usually considered. Data is described in this technique by determining the optimum hyper

plane region which isolates entire data point values from one class to other class. The greater the

separation between two classes, the efficient the model is thought to be. Support vectors are data

points that are located near the margin's edge.

The techniques and methods used are mathematical to construct critical real-world situations. Since

the dataset collected from various hospitals comprises of two classes to forecast based on no of

attributes, picked SVM for this paper. The training in SVM is done via a library function called

Sklearn

The most difficult part of developing a model with SVM is avoiding over fitting and under fitting by

choosing the right kernel and technique. Because our dataset has a large number of instances. As a

result, SVM's final model must be tested and validated against real-world data.

3.2 .Random Forest Classification

Random Forest is a group of decision tree based trees that haven't been pruned. It performs

admirably in a variety of real-world issues since it is unaffected by errors in the dataset and

following risk of over fitting is minimal. It is faster than many other tree-based algorithms and

enhanced the performance accuracy for test and validation data. The total of predictions from

Juni Khyat ISSN: 2278-4632

(UGC Care Group I Listed Journal) Vol-11 Issue-12 No.01 December 2021

Page | 117 Copyright @ 2021 Author

independent decision tree algorithms is known as random forests. When constructing a random tree,

there are several options for tuning the random forest's performance.

3.3.Naive Bayes

The Bayes' Theorem states that attributes or instances are independent of each other, and

supervised ML approach is built on it. When the input data is highly dimensional, the Nave Bayes

Classifier is utilised. In computer vision, the Naive Bayes approach is quite valuable. It has

demonstrated itself to be a good classifier in particular.

3.4. KNN

The following steps in the algorithm describe the working of KNN:

1. Decide the no of neighbors (K).

2. Obtain Euclidean distance in between K neighbors.

3. Use obtained Euclidean distance, to find K nearest neighbors.

4. Calculate no of data points in each category among k neighbors.

5. Allocate new data values to the category with greatest no of neighbors.

6. Repeat the steps until the appropriate result found.

3.5. Data Set collection

The data set consists of 1040 records with 76 attributes utilized by medical experts, obtained from

various hospital data and the UCI repository. Categorical attributes were employed in all models for

the sake of simplicity. With Genetic algorithm, the number of attributes is reduced to sixteen. The

classification models are fed with the reduced data set. The test mode is the K fold cross validation

approach. The prediction of a target class, with a value of "0" represents no heart disease and "1"

representing cardiac disorders.

3.6. Preprocessing of Dataset used

Several features in the dataset used are having missing values, leading to erroneous results and

degrade the model's accuracy. To solve this problem, utilizing the mean of column to replace missed

values is the answer. This method substitutes zero value or average of neighborhood values or the

mean values. After that, the dataset's attributes are transformed from Numerical value to Nominal

value making it compatible with the machine learning techniques utilized.

3.7. Building Model

The model was created using Google Colab, a machine learning platform. Dataset collection, data

preprocessing, regression, data visualization, and feature selection are all simple tasks that the

software can handle. It provides a simple environment for loading data from files, URLs, or

databases. The software supports the CSV, ARFF, C4.5 and various file formats. It conveniently

analyses and graphically presents the confusion matrix, performance parameters like precision,

recall, Roc.

For the purposes of comparing the four models, the following four accuracy measures were used:

• Precision

It is the probability of retrieving relevant outcome. Precision is the no of TP divided by the no of TP

plus FP.

• Recall

It is the likelihood of actual values to total retrieval of predicted values on average. TP/TP + FN =

recall

• Accuracy

Juni Khyat ISSN: 2278-4632

(UGC Care Group I Listed Journal) Vol-11 Issue-12 No.01 December 2021

Page | 118 Copyright @ 2021 Author

The percentage of all accurate predictions divided by the total no of attributes .calculate a classifier's

accuracy. No of TP + TN]/ [Total attributes] = Accuracy

• ROC Area

The performance evaluation difference between a classification model's TPR and FPR is depicted by

the ROC curve.

In 10-fold Cross validation is used for evaluation of the model , the data is divided into two groups,

train and test datasets.

3.8.Accuracy measurement of Model

Evaluation of the model by performance measures like Accuracy, recall, precision, and ROC taken

into account. When the ROC value is less than 0.80, the classifier is “Best" and if the ROC value is

0.77, it is “good." The optimal model with great accuracy is one with a ROC value very close to 1.

Table1: Evaluation of Naïve Bayes classifier by various parameters

Figure 1: Accuracy score of Naive bayes, SVM and KNN with genetic algorithm Feature selection

Figure 2: Accuracy score of Naïve bayes ,SVM and KNN without genetic algorithm Feature

selection

Juni Khyat ISSN: 2278-4632

(UGC Care Group I Listed Journal) Vol-11 Issue-12 No.01 December 2021

Page | 119 Copyright @ 2021 Author

Figure: Roc curve of Naive Bayes

4. Conclusion and Future work

The goal of this study and find to predict the existence of heart disease with a less no of Attributes.

Initially, seventy-six characteristics were used to predict heart disease. In the current study, a genetic

algorithm used to discover the attributes that contribute the most to the predict of cardiac illnesses,

hence reducing the amount of tests that a patient must do.

Using genetic search, seventy-six traits are reduced to sixteen. Following that, three

classifiers such as Naive Bayes, SVM, and KNN are used to predict the diagnosis of patients with the

approximately similar accuracy as when the number of attributes are reduced. Naive Bayes

algorithm performed better. Before the model was built, inconsistencies and missing values were

rectified. We plan in future to continue our research by evaluating the effectiveness of cardiac

illness using fuzzy learning models.

References

[1] Asha Rajkumar and Mrs. G.Sophia Reena (2010): Diagnosis Of Heart Disease Using Datamining Algorithm,

GJCST,Vol. 10 Issue 10 Ver.1.0 Sep2010, pp. 38-43.

[2] Boleslaw Szymanski, et al. (2006): Using Efficient Supanova Kernel For Heart Disease Diagnosis, proc. ANNIE 06,

intelligent engineeringsystems through artificial neural networks, vol. 16, pp. 305-310.

[3] S. Pouriyeh, S. Vahid, G. Sannino, G. De Pietro, H. Arabnia and J. Gutierrez, "A comprehensive investigation and

comparison of Machine Learning Techniques in the domain of heart disease," 2017 IEEE Symposium on Computers

and Communications (ISCC), Heraklion, 2017, pp.204-07,doi:10.1109/ISCC.2017.8024530.

[4] S. Dhar, K. Roy, T. Dey, P. Datta and A. Biswas, "A Hybrid Machine Learning Approach for Prediction of Heart

Diseases," 2018 4th International Conference on Computing Communication and Automation (ICCCA), Greater

Noida, India, 2018, pp. 1-6, doi: 10.1109/CCAA.2018.8777531.

[5] C. Raju, E. Philipsy, S. Chacko, L. Padma Suresh and S. Deepa Rajan, "A Survey on Predicting Heart Disease using

Data Mining Techniques," 2018 Conference on Emerging Devices and Smart Systems (ICEDSS), Tiruchengode,

2018, pp. 253-255, doi: 10.1109/ICEDSS.2018.8544333.

[6] R. Detrano, A. Janosi, W. Steinbrunn, M. Pfisterer, J.-J. Schmid,S. Sandhu, K. H. Guppy, S. Lee, and V. Froelicher,

“International application of a new probability algorithm for the diagnosis of coronary artery disease,” The American

journal of cardiology, vol. 64, no. 5, pp. 304–310, 1989.

[7] B. Edmonds, “Using localised ’gossip’ to structure distributed learning,” 2005.

[8] Fsdfsdf BayuAdhi Tama,1 Afriyan Firdaus,2 Rodiyatul FS, “Detection of Type 2 Diabetes Mellitus with Data Mining

Approach Using Support Vector Machine”, Vol. 11, issue 3, pp. 12-23, 2008.

[9] Yu-Xuan Wang, QiHui Sun, Ting-Ying Chien, Po-Chun Huang, “Using Data Mining and Machine Learning Techniques

for System Design Space Exploration and Automatized Optimization”, Proceedings of the 2017 IEEE International

Conference on Applied System Innovation, vol. 15, pp. 1079-1082, 2017.

[10] ZhiqiangGe, Zhihuan Song, Steven X. Ding, Biao Huang, “Data Mining and Analytics in the Process Industry: The

Role of Machine Learning”, 2017 IEEE Translations and contentmining are permitted for academic research only,

vol. 5, pp. 20590-20616, 2017.

[11]https://en.wikipedia.org/wiki/Bayes27_theorem

[12]https://en.wikipedia.org/wiki/Naive_Bayes_classifier

[13]https://towardsdatascience.com/understanding-random-forest-58381e0602d2

[14] Sellappan Palaniappan and Rafiah Awang (2008): Intelligent Heart Disease Prediction System Using Data Mining

Techniques, 978-1-4244-1968- 5/08/ IEEE.

[15]Shantakumar B.Patil and Y.S.Kumaraswamy (2009): Intelligent and Effective Heart Attack Prediction System Using

Data Mining and Artificial Neural Network, European Journal of Scientific Research ISSN 1450- 216X Vol.31 No.4,

pp. 642-656.

ResearchGate has not been able to resolve any citations for this publication.

A Comprehensive Investigation and Comparison of Machine Learning Techniques in the Domain of Heart Disease

Conference Paper

Full-text available

Jul 2017

This paper aims to investigate and compare the accuracy of different data mining classification schemes and their combinations through Ensemble Machine Learning Techniques for predicting heart disease. The Cleveland dataset for heart diseases, containing 303 instances, has been used in this study. Due to the limited number of samples, 10-Fold Cross-Validation is applied in order to portion the data into training and testing datasets. In this study different machine learning classifiers, such as Decision Tree (DT), Naïve Bayes (NB), Multilayer Perceptron (MLP), K-Nearest Neighbor (K-NN), Single Conjunctive Rule Learner (SCRL), Radial Basis Function (RBF) and Support Vector Machine (SVM) are utilized. Moreover, the ensemble prediction of classifiers including bagging, boosting, and stacking are applied to the dataset. The result of the experiment indicates that SVM method using boosting technique outperformed among the aforementioned methods.

Using Localised ‘Gossip’ to Structure Distributed Learning

Article

Full-text available

Jan 2005

Bruce Edmonds

The idea of a “memetic” spread of solutions through a human culture in parallel to their development is applied as a distributed approach to learning. Local parts of a problem are associated with a set of overlappingt localities in a space and solutions are then evolved in those localites. Good solutions are not only crossed with others to search for better solutions but also they propogate across the areas of the problem space where they are relatively successful. Thus the whole population co-evolves solutions with the domains in which they are found to work. This approach is compared to the equivalent global evolutionary computation approach with respect to predicting the occcurence of heart disease in the Cleveland data set. It greatly outperforms the global approach, but the space of attributes within which this evolutionary process occurs can effect its efficiency.

Intelligent heart disease prediction system using data mining techniques

Conference Paper

Full-text available

Mar 2008

The healthcare industry collects huge amounts of healthcare data which, unfortunately, are not ";mined"; to discover hidden information for effective decision making. Discovery of hidden patterns and relationships often goes unexploited. Advanced data mining techniques can help remedy this situation. This research has developed a prototype Intelligent Heart Disease Prediction System (IHDPS) using data mining techniques, namely, Decision Trees, Naive Bayes and Neural Network. Results show that each technique has its unique strength in realizing the objectives of the defined mining goals. IHDPS can answer complex ";what if"; queries which traditional decision support systems cannot. Using medical profiles such as age, sex, blood pressure and blood sugar it can predict the likelihood of patients getting a heart disease. It enables significant knowledge, e.g. patterns, relationships between medical factors related to heart disease, to be established. IHDPS is Web-based, user-friendly, scalable, reliable and expandable. It is implemented on the .NET platform.

A Hybrid Machine Learning Approach for Prediction of Heart Diseases

Conference Paper

Dec 2018

A Survey on Predicting Heart Disease using Data Mining Techniques

Conference Paper

Mar 2018

Using data mining and machine learning techniques for system design space exploration and automatized optimization

Conference Paper

May 2017

Intelligent and effective heart attack prediction system using data mining and artificial neural network

Article

Jan 2009

The diagnosis of diseases is a vital and intricate job in medicine. The recognition of heart disease from diverse features or signs is a multi-layered problem that is not free from false assumptions and is frequently accompanied by impulsive effects. Thus the attempt to exploit knowledge and experience of several specialists and clinical screening data of patients composed in databases to assist the diagnosis procedure is regarded as a valuable option. This research work is the extension of our previous research with intelligent and effective heart attack prediction system using neural network. A proficient methodology for the extraction of significant patterns from the heart disease warehouses for heart attack prediction has been presented. Initially, the data warehouse is pre-processed in order to make it suitable for the mining process. Once the preprocessing gets over, the heart disease warehouse is clustered with the aid of the K-means clustering algorithm, which will extract the data appropriate to heart attack from the warehouse. Consequently the frequent patterns applicable to heart disease are mined with the aid of the MAFIA algorithm from the data extracted. In addition, the patterns vital to heart attack prediction are selected on basis of the computed significant weightage. The neural network is trained with the selected significant patterns for the effective prediction of heart attack. We have employed the Multi-layer Perceptron Neural Network with Back-propagation as the training algorithm. The results thus obtained have illustrated that the designed prediction system is capable of predicting the heart attack effectively.

International application of a new probability algorithm for the diagnosis of coronary artery disease

Article

Sep 1989
AM J CARDIOL

A new discriminant function model for estimating probabilities of angiographic coronary disease was tested for reliability and clinical utility in 3 patient test groups. This model, derived from the clinical and noninvasive test results of 303 patients undergoing angiography at the Cleveland Clinic in Cleveland, Ohio, was applied to a group of 425 patients undergoing angiography at the Hungarian Institute of Cardiology in Budapest, Hungary (disease prevalence 38%); 200 patients undergoing angiography at the Veterans Administration Medical Center in Long Beach, California (disease prevalence 75%); and 143 such patients from the University Hospitals in Zurich and Basel, Switzerland (disease prevalence 84%). The probabilities that resulted from the application of the Cleveland algorithm were compared with those derived by applying a Bayesian algorithm derived from published medical studies called CADENZA to the same 3 patient test groups. Both algorithms overpredicted the probability of disease at the Hungarian and American centers. Overprediction was more pronounced with the use of CADENZA (average overestimation 16 vs 10% and 11 vs 5%, p less than 0.001). In the Swiss group, the discriminant function underestimated (by 7%) and CADENZA slightly overestimated (by 2%) disease probability. Clinical utility, assessed as the percentage of patients correctly classified, was modestly superior for the new discriminant function as compared with CADENZA in the Hungarian group and similar in the American and Swiss groups. It was concluded that coronary disease probabilities derived from discriminant functions are reliable and clinically useful when applied to patients with chest pain syndromes and intermediate disease prevalence.

Using Efficient Supanova Kernel For Heart Disease Diagnosis, proc. ANNIE 06, intelligent engineeringsystems through artificial neural networks

Jan 2006
305-310

Boleslaw Szymanski

Boleslaw Szymanski, et al. (2006): Using Efficient Supanova Kernel For Heart Disease Diagnosis, proc. ANNIE 06, intelligent engineeringsystems through artificial neural networks, vol. 16, pp. 305-310.

Detection of Type 2 Diabetes Mellitus with Data Mining Approach Using Support Vector Machine

Jan 2008
12-23

Fsdfsdf Bayuadhi

Fsdfsdf BayuAdhi Tama,1 Afriyan Firdaus,2 Rodiyatul FS, "Detection of Type 2 Diabetes Mellitus with Data Mining Approach Using Support Vector Machine", Vol. 11, issue 3, pp. 12-23, 2008.