ArticlePDF Available

Prediction of Heart Disease Using Feature Selection and Random Forest Ensemble Method

June 2020
International Journal for Pharmaceutical Research Scholars 12(4):56-66

June 2020
12(4):56-66

DOI:10.31838/ijpr/2020.12.04.013

Authors:

Dhyan Chandra Yadav

Veer Bahadur Singh Purvanchal University

Saurabh Pal

Veer Bahadur Singh Purvanchal University

The heart is very soft and sensitive part of body by which brain handles blood related system in body. The heart disease that greatly affects in body as like: pulmonary artery, atalata, enzaina and birth defects included. Heart disease is mainly related to contraction or blocked blood vessels in the heart. The symptoms of heart disease depend on the type of disease. Heart disease occurs not only in adults but also in children. The infection affecting the tissues is known as percarditis. In this, the tissues closest to the heart are affected. Infections affecting the lining of the heart muscle are known as myocardium .The study of medical datasets is made very intuitive by machine learning algorithms. The machine learning algorithms provide techniques to identify dataset attributes and the relationship between them. In this research work, we used heart disease related information from UCI repository. The dataset contained 1025 Instances with 14 attributes, sick and nonstick patients in target variable. In this paper, we proposed and analyzed classification accuracy, precision and sensitivity by four tree based classification algorithms: M5P, random Tree and Reduced Error Pruning with Random forest ensemble method. All the prediction based algorithms have applied after the features selection of heart patient's dataset. In this paper, we used three features based algorithms: Pearson Correlation, Recursive Features Elimination and Lasso Regularization. The data table analyzed by different feature selection methods for better prediction. All the analysis is done by three experimental setup; First experiment applied Pearson Correlation on M5P, random Tree, Reduced Error Pruning and Random forest ensemble method. In the second experiment we used Recursive Features Elimination and applied on above four tree based algorithms. In the third experiment we used Lasso Regularization and applied on as above tree based algorithms. After all the performance we analyzed and calculated classification accuracy, precision and sensitivity. With the results, we finally concluded that feature selection methods Pearson correlation and Lasso Regularization with random forest ensemble method provide better results 99% accuracy. We analyzed and find the random forest ensemble method predicted better result compare to other algorithms in the previous year's works.

Representation of blockage in heart .https://images.app.goo.gl/sSdy8qxDpni7fFTj6

…

Computational Formula for Prediction [14]

…

Valuable Score with Features of Pearson Correlation

…

Measure Prediction Performance for With/ Without PC by Tree Classifiers

…

Measure Prediction Performance for With/ Without RFE by Tree Classifiers

…

Figures - uploaded by Saurabh Pal

Content may be subject to copyright.

Content uploaded by Saurabh Pal

Content may be subject to copyright.

56| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4

Research Article

Prediction of Heart Disease Using Feature Selection and

Random Forest Ensemble Method

DHYAN CHANDRA YADAV, SAURABH PAL

VBS Purvanchal University, Jaunpur, India

Email ID: dc9532105114@gmail.com, drsaurabhpal@yahoo.co.in

Received: 12.03.19, Revised: 28.05.20, Accepted: 02.06.20

ABSTRACT

The heart is very soft and sensitive part of body by which brain handles blood related system in body. The

heart disease that greatly affects in body as like: pulmonary artery, atalata, enzaina and birth defects included.

Heart disease is mainly related to contraction or blocked blood vessels in the heart. The symptoms of heart

disease depend on the type of disease. Heart disease occurs not only in adults but also in children. The

infection affecting the tissues is known as percarditis. In this, the tissues closest to the heart are affected.

Infections affecting the lining of the heart muscle are known as myocardium .The study of medical datasets is

made very intuitive by machine learning algorithms. The machine learning algorithms provide techniques to

identify dataset attributes and the relationship between them.

In this research work, we used heart disease related information from UCI repository. The dataset contained

1025 Instances with 14 attributes, sick and nonstick patients in target variable. In this paper, we proposed and

analyzed classification accuracy, precision and sensitivity by four tree based classification algorithms: M5P,

random Tree and Reduced Error Pruning with Random forest ensemble method. All the prediction based

algorithms have applied after the features selection of heart patient’s dataset. In this paper, we used three

features based algorithms: Pearson Correlation, Recursive Features Elimination and Lasso Regularization. The

data table analyzed by different feature selection methods for better prediction. All the analysis is done by

three experimental setup; First experiment applied Pearson Correlation on M5P, random Tree, Reduced Error

Pruning and Random forest ensemble method. In the second experiment we used Recursive Features

Elimination and applied on above four tree based algorithms. In the third experiment we used Lasso

Regularization and applied on as above tree based algorithms. After all the performance we analyzed and

calculated classification accuracy, precision and sensitivity.

With the results, we finally concluded that feature selection methods Pearson correlation and Lasso

Regularization with random forest ensemble method provide better results 99% accuracy. We analyzed and

find the random forest ensemble method predicted better result compare to other algorithms in the previous

year’s works.

Keywords: Data mining Tree based Algorithms, Random Forest Ensemble Method, Features Relevant

Method, Features Elimination Method Lasso Regularization Method and Heart Disease.

INTRODUCTION

Research is going on, in large research

institutions to ensure factors related to heart

disease. In some institutions, smoking, age,

high/low blood pressure, obesity, diabetes and

lack of exercise have been included as main

factors for diseases. According to the instructions

of the researchers, it is considered helpful to

identify the disease related to heart disease. Heart

disease is also revealed due to blockage in the

blood vessels, which later expresses the possibility

of heart attack, chest pain or stroke. Valve and

heart muscles are mainly affected in heart

disease. The level of mortality among the world

population by heart disease is quite large.

Cardiovascular data are available in very large

quantities in healthcare. Due to the large amount

of data, it becomes very difficult to study it in

general. But with the help of data mining, large

collections are easily converted into information.

Which shows how the condition of heart disease

has been in children and adults in the past years

and its study also helps in estimating how to

reduce the mortality caused by cardiovascular

diseases in the future. Machine learning

algorithms can improve the treatment of a person

suffering from the disease by comparing its

factors.

ISSN 0975-2366

DOI:https://doi.org/10.31838/ijpr/2020.12.04.013

Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest

Ensemble Method

57| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4

Fig.1: Representation of blockage in heart .https://images.app.goo.gl/sSdy8qxDpni7fFTj6

Some of the symptoms of heart disease are as

follows:

 Heart tightness, pressure and pain.

 Chest arms or neck jaw and back pain.

Heart attacks are as follows:

 Having a dizzy head.

 Face turning brown.

 Restlessness.

 Trouble breathing, etc.

Heart diseases that are not easily understood

like:

 Arrhythmias: Heart beat to be irregular.

 Cardiogenic: Shock in person to properly do

not get the blood that person's blood

pressure suddenly collapsed.

 Hypoxemia: There is much difficulty in

breathing due to lack of oxygen in the blood.

 Pulmonary Edema: Pulmonary edema

involves the accumulation of fluid in or

around the lungs of a heart patient.

 DVT or deep win thrombosis: Due to an

excess of blood clots in the veins obstructing

the blood flow.

 Mycordial rupture: In this, damage the wall

of heart, of heart patients, which indicates a

major danger.

 Ventricular aneurysm: A bulge in the heart

chamber of the afflicted person, causing

difficulty in breathing with blood flow [1].

In this paper, we predict various heart diseases by

variety of feature selection algorithms, applied on

tree based machine learning algorithms. Machine

learning algorithms provide correlations between

various related attributes.

RELATED WORKS

Cai et al., [2020], discussed about heart

arrhythmia and 12 lead electro cardiogram. They

used one dimensional deep densely connected

neural network to detect artial fibrillation. Authors

found accuracy, sensitivity and specificity (99.35) ,

(99.19) and (99.44) respectively the results on test

dataset [2].

Buettner et al., [2020], considered

electroencephalography recording of heart

patients. Authors explained five granular divisions

of EEG spectra by machine learning classifiers.

They used Random Forest algorithm to make a

balance between paranoid schizophrenic and

non- schizophrenic persons with (96.77) percent

classification accuracy [3].

Magesh and Swarnalatha [2020] analyzed about

cardiovascular ailment centers in ruler side. They

found some risk factors or illness in coronary

disease by smoking. Authors examined target

level distribution from samples and identify

features through entropy. They used Random

Forest in the prediction of heart disease and

found accuracy (89.30) percent with cluster based

Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest

Ensemble Method

58| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4

DT learning and (76.70) percent without cluster

based DT learning [4].

Shen et al., [2020], discussed about a trial

fibrillation arrhythmia. They used neural network

and manual extraction features on the prediction

of a trial fibrillation. Authors used decision tree,

Random Forest, GBDT, XG Boost, LightGBM and

find (99.91) percent accuracy by stacking model

[5] .

Kar et al., [2020], observed the condition of heart

of a patient by electrocardiogram signal. They

analyzed ECG, signal by continues and discrete

wavelet transforms. Authors used time interval,

statistical features and classify irregular

heartbeats. They calculated K-NN, DT-CWT

features and find (98.92) percent classification

accuracy [6].

Harimoorthy and Thangavelu [2020], discussed

about hidden pattern in chronic kidney disease.

They reduced some features from chronic kidney

disease and improved in SVM Redial biaskernal.

Authors compared SVMRBK with (SVM-Linear,

SVM-Polynomial, Random forest and Decision

Tree) and find improvement in accuracy of SVM-

RBK (98.3) percent, (98.7) percent AND (89.9)

percent [7].

Miled et al., [2020], analyzed electronic medical

record of diagnosis, prescriptions and medical

notes. They used machine learning algorithms to

identify dementia and non dementia cases and

predict the fact. They developed Random forest

algorithms in three EMR dataset and find (77.43)

percent accuracy [8].

METHODOLOGY

In this phase, we have described the heart

patient’s attributes and applied algorithms. We

visualized all the attributes measured their

distribution and considered applied algorithms

with experimental setup.

Data Description:

In this paper, we organized dataset from

recorded UCI website. The dataset is related with

heart patients and measure the distribution of

heart disease patient attributes. The class

distribution, box whisker plotting and visualizing

of dataset have discussed by Python language. In

this dataset, we used 1025 instances and 14

attributes.

Class Distribution

target

0 499

1 526

dtype: int64

The class level distribution of dataset represents

how much TRUE /FALSE positive in the target

variable.

Box and Whisker Plots

Each attributes and their numeric values provide

help in disease prediction. By the help of box and

whisker, we have implemented the heart disease

attributes in brief and measured each attributes

distribution [9].

Fig.2: Representation of Box and Whisker plotting of heart disease attributes

Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest

Ensemble Method

59| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4

Histograms Representation

The histograms are graphical representation of

each attributes separately in the graph and

measure their visualization [10]. In this paper, we

used heart disease 14 attributes. Each attributes

represents their valuable representation in whole

dataset.

Fig.3: Representation of Histogram plotting of heart disease attributes

Algorithms:

M5P algorithm

Fig.4: Representation of M5P algorithms for 40 instances and 14 attributes of heart disease

In this paper, we used this model for numeric

prediction with the results; at the leaf find the

class values of instances. The work of this

algorithm as an expert to search on node, each

node due to prediction [11]. For example we

considered some instances for heart disease and

their performance.

Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest

Ensemble Method

60| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4

Random Tree Algorithm

Fig.5: Random Tree algorithms for 40 instances and 14 attributes of heart disease

Random tree algorithm used for randomly

selection of attributes in decision node [12]. The

main work of this algorithm to measure the

performance of class predictions with their

probability and try to improve prediction

performance at each node.

Reduced Error Pruning

Fig.6: Representation of REP algorithms for 40 instances and 14 attributes of heart disease

The performance of reduced error pruning is

based on C4.5 algorithms [13]. In this experiment

we used, batch size=100, max Depth=-1,

minimum variance probability=0.001 num

folds=3 and seed=1 for fast learner on each

node. The main objective of algorithm to reduce

error pruning on each node of the tree.

Formula Representation:

Table 1: Computational Formula for Prediction [14]

S.No.

Measure

Formula

Accuracy

(TP+TN)/(TP+TN+FP+FN)

Sensitivity

(TP)/(TP+TN)

Specificity

(TN)/(TN+FP)

Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest

Ensemble Method

61| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4

Proposed Ensemble Method:

In this research paper, we used random forest as

an ensemble method. The Random Forest is a

powerful decision making tree ensemble method

[15]. The main property of this algorithm is to

select decision randomly from other tree. In this

paper, we used M5P tree, Random tree and Error

Reduced Pruning tree with Random Forest

Ensemble method. After the features selection

trained on (75%) dataset and the test on (25%)

with tree algorithms with ensemble method. The

final prediction has measured by average voting

algorithms. In this experiment, we used bag size=

100%, batch size==100 and seed = 1 for better

prediction.

Fig.7: Proposed Model of Random Forest algorithm as a ensemble model

RESULTS

In this paper, we used various features selection

method and applied on various machine learning

algorithms for better prediction.

 Pearson correlation with output variables, find

score of some features: cp, exang, oldpeak

and target (.43), (.43), (.43) and 1.00

respectively, these are highly correlated

features.

 The Pearson correlation features selection

method with Random Forest Ensemble

method calculated (99.9%) accuracy.

 The Recursive Features selection method

provide optimal number of features:12 and

the score with 12 features: 0.54

 The Recursive Features selection method

applied with Random Forest and calculated

(94.12%) accuracy.

 Lasso Regularization by lassoCV() calculated:

 Best alpha= 0.0048, Best score =.51

 In the performance Lasso Model avoid some

features: fbs, chol and age

 Lasso Regularization with Random Forest

ensemble method finds (99.9%) accuracy.

DISCUSSION

In this section, we discussed about all feature

selection performance with machine learning

algorithms:

Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest

Ensemble Method

62| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4

Fig.8: Representation of Pearson Correlation for heart disease attributes

The matrix has an absolute value (0.3) with the

output variable and gives the results for highly

correlate attributes [16]. We used Pearson

Correlation matrix with output variable and select

the highly correlated features as:

Table 2: Valuable Score with Features of Pearson Correlation

Features

Correlation

0.434854

thalach

0.422895

exang

0.438029

oldpeak

0.438441

slope

0.345512

0.382085

thal

0.337838

target

1.000000

Name: target, dtype: float64

Table 3: Measure Prediction Performance for With/ Without PC by Tree Classifiers

Algorithms

FSM (Without PC)

FSM (With PC)

Specificity

Sensitivity

Accuracy

Specificity

Sensitivity

Accuracy

M5PT

41.2

94.5

89.3

83.1

91.2

93.4

42.7

95.7

91.2

92.6

95.3

95.2

REPT

50.6

95.8

91.5

89.7

96.6

RFT

62.3

95.1

93.8

90.3

99.6

99.9

For the table.3, it is clear that PC= Pearson

Correlation feature selection on RFT = Random

Forest calculated highest accuracy and sensitivity.

We Initialized Recursive Features Elimination

model for fitting the data to model and find the

result as:

[False True True False False False False False

True True True True True]

[3 1 1 6 7 4 2 5 1 1 1 1 1]

Recursively remaining heart attributes and

building a model on those heart attributes remain

in table. All the True are most relevant features in

dataset and False are irrelevant features [17] .

Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest

Ensemble Method

63| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4

So calculate no of features with variable to store

the optimum features as:

Optimum number of features: 12; Score with 12

features: 0.541462

By the experiment find the transforming data

using RFE and fitting the data to model as :

Index(['age', 'sex', 'cp', 'trestbps', 'fbs', 'restecg',

'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal'],

dtype='object')

Table 4: Measure Prediction Performance for With/ Without RFE by Tree Classifiers

Algorithms

FSM (Without RFE)

FSM (With RFE)

Specificity

Sensitivity

Accuracy

Specificity

Sensitivity

Accuracy

M5PT

52.4

83.7

74.8

92.3

82.5

80.6

53.3

84.2

80.3

81.7

84.7

86.3

REPT

41.2

84.3

85.8

78.6

84.8

85.6

RFT

51.1

84.8

87.9

91.6

98.8

98.2

For the table.4, it is clear that RFE= Recursive

Features Elimination on RFT = Random Forest

calculated highest accuracy and sensitivity.

In the Lasso regularization model, we used CV

based function for better feature importance[18] .

reg = LassoCV()

Best alpha using built-in LassoCV: 0.004860

Best score using built-in LassoCV: 0.513496

Text(0.5, 1.0, 'Feature importance using Lasso

Model') and reduced some less important as: fbs,

chol and age

Fig.9: Lasso regularization model for features selection

If the features are irrelevant then Lasso penalizes,

with the results, we find features: fbs, chol and

age are penalized. The last top and bottom

features are highly related with each other.

Table 5: Measure Prediction Performance for with / Without Lasso regularization by Tree

Classifiers

Algorithms

Without LRM

With LRM

Specificity

Sensitivity

Accuracy

Specificity

Sensitivity

Accuracy

M5PT

63.5

72.2

83.8

75.2

73.2

89.8

31.7

78.3

83.7

79.5

79.8

91.9

REPT

61.5

79.4

78.7

91.5

63.2

79.4

RFT

73.5

76.8

88.9

91.3

97.1

99.9

Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest

Ensemble Method

64| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4

For the table.5, it is clear that LRM= Lasso regularization model on RFT = Random Forest calculated

highest accuracy and sensitivity.

Fig.10: Representation of Accuracy by Tree based Algorithms in heat disease

With the results, table.3, 4 &5 represents the

comparison of other M5P, RT and REPT

algorithms. Fig.10., represents the above all the

obtained experiments and we find that the highest

accuracy and Sensitivity of Random Forest

ensemble method.

Table 6: Representation of Previous Year Paper Accuracy Score

Authors

Instances

Algorithms

Accuracy

Hui et al.,[2012] [19]

9800

SRBC, ICA & RR

98.35

Martis et ai., [2013][20]

110,094

ICA, DWT & PNN

99.28

Ince et al.,[2015][21]

100,389

CNN & BP

98.90

Naomin et al.,[2016][22]

110,094

NN, SVM & PCA

98.90

Hua et al., [2017][23]

90808

SVM & Weighted RR

98.46

Oh et al.,[2017][24]

109949

CNN

94.47

Yildirim et al.,[2018][25]

7376

DBLSTM

99.39

Yildirim et al., [2019][26]

100,022

CAE-LSTM

99.23

Haotien et al.,[2020][27]

100630

CNN

99.06

We have studies near 2012 -2020 and find the

highest accuracy near about (99%). In the work,

we have compared different algorithms

individually but did not cover (100%) accuracy. In

this research work, we have tried to test with

different features selection method applied on

different machine learning tree classifiers

algorithms and finally find Random Forest

ensemble method provide better result (99.9%)

accuracy.

CONCLUSION

In this research paper, we used Pearson

Correlation, Recursive Features Elimination and

Lasso Regularization, features selection methods

and applied on Machine learning tree based

classifiers algorithms: M5P, Random Tree and

Reduced Error Pruning with Random Forest

ensemble method. In this analysis, we evaluated

the value of classification accuracy, precision,

sensitivity and ROC. We have used UCI

Repository dataset for 1025 instances and 14

attributes. In this research, we Identify whether a

person is suffering from heart problem are not in

heart disease machine learning algorithms

provide various way to implement the medical

data set. In this research work, the important

features were identified by Pearson correlation,

Recursive Features Elimination and Lasso

Regularization with the selected important

features we examine with improved, Random

Forest, Random Tree, Reduced Error Pruning and

M5P classifiers algorithms in heart disease. With

the results, we find that improved Random Forest

ensemble method with batch size (100) and seed

(1) provide batter accuracy compare to other.

Since this work is based on recorded data from

UCI repository, for future work planning , we will

Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest

Ensemble Method

65| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4

try train and test on huge medical data set with

more than one ensemble method and try to

improvement in their performance.

Conflict of Interest

Authors have no conflict of Interest.

Funding

This study was not funded.

Acknowledgements

The author is grateful to Veer Bahadur Singh

Purvanchal University Jaunpur, Uttar Pradesh, for

Providing financial support to work as Post

Doctoral Research Fellowship.

REFERENCES

1. Lui, C. K., Kerr, W. C., Li, L., Mulia, N., Ye, Y.,

Williams, E., ... & Lown, E. A. (2020). Lifecourse

Drinking Patterns, Hypertension, and Heart

Problems Among US Adults. American Journal of

Preventive Medicine.

2. Cai, W., Chen, Y., Guo, J., Han, B., Shi, Y., Ji, L.,

... & Luo, J. (2020). Accurate detection of atrial

fibrillation from 12-lead ECG using deep neural

network. Computers in biology and medicine, 116,

103378.

3. Buettner, R., Beil, D., Scholtz, S., & Djemai, A.

(2020, January). Development of a machine

learning based algorithm to accurately detect

schizophrenia based on one-minute EEG

recordings. In Proceedings of the 53rd Hawaii

International Conference on System Sciences.

4. Magesh, G., & Swarnalatha, P. (2020). Optimal

feature selection through a cluster-based DT

learning (CDTL) in heart disease

prediction. Evolutionary Intelligence, 1-11.

5. Shen, M., Zhang, L., Luo, X., & Xu, J. (2020,

January). Atrial Fibrillation Detection Algorithm

Based on Manual Extraction Features and

Automatic Extraction Features. In IOP Conference

Series: Earth and Environmental Science (Vol. 428,

No. 1, p. 012050). IOP Publishing.

6. Kar, N., Sahu, B., Sabut, S., & Sahoo, S. (2020).

Effective ECG Beat Classification and Decision

Support System Using Dual-Tree Complex

Wavelet Transform. In Advances in Intelligent

Computing and Communication (pp. 366-374).

Springer, Singapore.

7. Harimoorthy, K., & Thangavelu, M. (2020). Multi-

disease prediction model using improved SVM-

radial bias technique in healthcare monitoring

system. Journal of Ambient Intelligence and

Humanized Computing, 1-9.

8. Miled, Z. B., Haas, K., Black, C. M., Khandker, R.

K., Chandrasekaran, V., Lipton, R., & Boustani, M.

A. (2020). Predicting dementia with routine care

EMR data. Artificial Intelligence in Medicine, 102,

101771.

9. Norris, D. J. (2020). Introduction to machine

learning (ML) with the Raspberry Pi (RasPi).

In Machine Learning with the Raspberry Pi (pp. 1-

47). Apress, Berkeley, CA.

10. del Rio, A. A. H., Cuevas, E., & Zaldivar, D.

(2020). Multi-level Image Thresholding

Segmentation Using 2D Histogram Non-local

Means and Metaheuristics Algorithms.

In Applications of Hybrid Metaheuristic Algorithms

for Image Processing (pp. 121-149). Springer,

Cham.

11. Mudali, P., Roopa, J., Raju, M. G., & Yadav, A.

(2020). Analysis of Parallel M5P and Random

Forest Regression for Visualization of Traffic

Behavior. In Computational Intelligence in Pattern

Recognition (pp. 231-241). Springer, Singapore.

12. Nachmias, A. (2020). Uniform Spanning Trees of

Planar Graphs. In Planar Maps, Random Walks and

Circle Packing (pp. 89-103). Springer, Cham.

13. Thomas, T., Vijayaraghavan, A. P., & Emmanuel,

S. (2020). Applications of Decision Trees.

In Machine Learning Approaches in Cyber Security

Analytics (pp. 157-184). Springer, Singapore.

14. Shehab, M. (2019). Artificial Intelligence in Diffusion

MRI: Enhanced Cuckoo Search Algorithm with

Metaheuristic Components for Extracting the

Maxima of the Orientation Distribution

Function (Vol. 877). Springer Nature.

15. Sniatala, P., Amini, M. H., & Boroojeni, K. G.

(2020). A Novel Fault Tolerant Random Forest

Model Using Brooks–Iyengar Fusion.

In Fundamentals of Brooks–Iyengar Distributed

Sensing Algorithm (pp. 159-165). Springer, Cham.

16. Jain, G., Mahara, T., & Tripathi, K. N. (2020). A

Survey of Similarity Measures for Collaborative

Filtering-Based Recommender System. In Soft

Computing: Theories and Applications (pp. 343-

352). Springer, Singapore.

17. Kumari, P., & Haider, M. T. U. (2020). Sentiment

Analysis on Aadhaar for Twitter Data—A Hybrid

Classification Approach. In Proceeding of

International Conference on Computational Science

and Applications (pp. 309-318). Springer,

Singapore.

18. Chen, Q., & Huang, L. (2020). Research on

Prediction Model of Gas Emission Based on

Lasso Penalty Regression Algorithm. In Artificial

Intelligence in China (pp. 165-172). Springer,

Singapore.

19. Huang, H. F., Hu, G. S., & Zhu, L. (2012). Sparse

representation-based heartbeat classification

using independent component analysis. Journal of

medical systems, 36(3), 1235-1247.

20. Martis, R. J., Acharya, U. R., & Min, L. C. (2013).

ECG beat classification using PCA, LDA, ICA and

discrete wavelet transform. Biomedical Signal

Processing and Control, 8(5), 437-448.

21. Kiranyaz, S., Ince, T., & Gabbouj, M. (2015). Real-

time patient-specific ECG classification by 1-D

convolutional neural networks. IEEE Transactions

on Biomedical Engineering, 63(3), 664-675.

Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest

Ensemble Method

66| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4

22. Elhaj, F. A., Salim, N., Harris, A. R., Swee, T. T., &

Ahmed, T. (2016). Arrhythmia recognition and

classification using combined linear and nonlinear

features of ECG signals. Computer methods and

programs in biomedicine, 127, 52-63.

23. Chen, S., Hua, W., Li, Z., Li, J., & Gao, X. (2017).

Heartbeat classification using projected and

dynamic features of ECG signal. Biomedical Signal

Processing and Control, 31, 165-173.

24. Acharya, U. R., Oh, S. L., Hagiwara, Y., Tan, J. H.,

Adam, M., Gertych, A., & San Tan, R. (2017). A

deep convolutional neural network model to

classify heartbeats. Computers in biology and

medicine, 89, 389-396.

25. Yildirim, Ö. (2018). A novel wavelet sequence

based on deep bidirectional LSTM network

model for ECG signal classification. Computers in

biology and medicine, 96, 189-202.

26. Yildirim, O., Baloglu, U. B., Tan, R. S., Ciaccio, E.

J., & Acharya, U. R. (2019). A new approach for

arrhythmia classification using deep coded

features and LSTM networks. Computer methods

and programs in biomedicine, 176, 121-133.

27. Wang, H., Shi, H., Chen, X., Zhao, L., Huang, Y.,

& Liu, C. (2020). An Improved Convolutional

Neural Network Based Approach for Automated

Heartbeat Classification. Journal of Medical

Systems, 44(2), 35.

Cardiovascular Disease Prediction Using Risk Factors: A Comparative Performance Analysis of Machine Learning Models

Article

Full-text available

May 2024

The diagnosis and prognosis of cardiovascular diseases are critical medical responsibilities that assist cardiologists in correctly classifying patients and treating them accordingly. The utilization of machine learning in the medical domain has witnessed a notable surge due to its ability to discern patterns from vast amounts of data. Machine learning algorithms that can categorize cases of cardiovascular illness may help doctors reduce the number of wrong diagnoses. This research investigates the efficacy of different machine learning algorithms in predicting cardiovascular disease in accordance with risk factors. This study utilizes a variety of machine learning models, including Logistic Regression, Random Forest, Decision Tree, Extra Trees classifier, Support Vector Machine (SVM), XGBoost (XGB), Light Gradient Boosting Machine (LGBM), GaussianNB, and Multilayer Perceptron (MLP). The machine learning models are applied to a concrete dataset acquired from Kaggle. The models underwent training using a dataset that was partitioned into an 80:20 ratio. Machine learning model evaluation involves the utilization of performance measurements such as Accuracy, Precision, Recall, and ROC curves. An exhaustive evaluation is carried out to gauge the efficacy of the models.

Discriminating insulin resistance in middle-aged nondiabetic women using machine learning approaches

Article

Full-text available

Jan 2024

Objective We employed machine learning algorithms to discriminate insulin resistance (IR) in middle-aged nondiabetic women. Methods The data was from the National Health and Nutrition Examination Survey (2007–2018). The study subjects were 2084 nondiabetic women aged 45–64. The analysis included 48 predictors. We randomly divided the data into training (n = 1667) and testing (n = 417) datasets. Four machine learning techniques were employed to discriminate IR: extreme gradient boosting (XGBoosting), random forest (RF), gradient boosting machine (GBM), and decision tree (DT). The area under the curve (AUC) of receiver operating characteristic (ROC), accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score were compared as performance metrics to select the optimal technique. Results The XGBoosting algorithm achieved a relatively high AUC of 0.93 in the training dataset and 0.86 in the testing dataset to discriminate IR using 48 predictors and was followed by the RF, GBM, and DT models. After selecting the top five predictors to build models, the XGBoost algorithm with the AUC of 0.90 (training dataset) and 0.86 (testing dataset) remained the optimal prediction model. The SHapley Additive exPlanations (SHAP) values revealed the associations between the five predictors and IR, namely BMI (strongly positive impact on IR), fasting glucose (strongly positive), HDL-C (medium negative), triglycerides (medium positive), and glycohemoglobin (medium positive). The threshold values for identifying IR were 29 kg/m², 100 mg/dL, 54.5 mg/dL, 89 mg/dL, and 5.6% for BMI, glucose, HDL-C, triglycerides, and glycohemoglobin, respectively. Conclusion The XGBoosting algorithm demonstrated superior performance metrics for discriminating IR in middle-aged nondiabetic women, with BMI, glucose, HDL-C, glycohemoglobin, and triglycerides as the top five predictors.

Development and Performance Evaluation of a Heart Disease Prediction Model Using Convolutional Neural Network

Article

Full-text available

Mar 2024

Heart disease is a leading cause of mortality globally and its prevalence is increasing year after year. Recent statistics from the World Health Organization show that about 17.9 million individuals are embattled with heart diseases annually and people under the age of 70 account for one-third of these deaths. Hence, there is need to intensify research on early heart disease prediction and artificial intelligence-based heart disease prediction systems. Previous heart disease prediction systems using machine learning techniques are unable to manage large amount of data, resulting in poor prediction accuracy. Hence, this research employs Convolutional Neural Networks, a deep learning approach for prediction of heart diseases. The dataset for training and testing the model was obtained from a government owned hospital in Nigeria and Kaggle. The resulting system was evaluated using precision, recall, f1-score and accuracy metrics. The results obtained are: 0.94, 0.95, 0.95 and 0.95 for precision, recall, f1-score and accuracy respectively. This show that the CNN-based model responded very well to the prediction of heart diseases for both negative and positive classes. The results obtained were also compared to some selected machine-learning models like Random Forest, Naïve Bayes, KNN and Logistic Regression and results show that the developed model achieved a significant improvement over the methods considered. Therefore, convolutional neural network is more suitable for heart disease prediction than some state-of-the-art machine-learning models. The contribution to knowledge of this research is the use of Afrocentric dataset for heart disease prediction. Future research should consider increasing the data size for model training to achieve improved accuracy.

Early prediction of chronic heart disease with recursive feature elimination and supervised learning techniques

Article

Full-text available

Mar 2024

p>Chronic heart disease (CHD) is a common complication among patients suffering in the cardiological intensive care unit, often resulting in poor prognosis and high mortality. Early prediction of CHD can reduce mortality by preventing the severity of the disease. This study evaluated the efficacy of on recursive feature elimination for predicting CHD using supervised learning techniques for predicting CHD. The study employed 1190 Cleveland Hungarian CHD dataset. Different supervised learning techniques (support vector machine, decision tree, k-nearest neighbor, Naive Bayes, stochastic gradient descent, adaptive boosting, and multilayer perceptron) were used to study the efficacy of the recursive feature elimination. Chest pain type, sex, blood sugar level, angina, depression, and slope were associated with CHD occurrence. The accuracy of the K-nearest neighbor and decision tree model was 89.91% for the feature-selected dataset indicating good predictive ability. Ultimately, the support vector machine and logistic regression with the selected features exhibited good discriminatory ability for early prediction of CHD. Thus, the recursive feature elimination is a good approach to develop a a model with higher accuracy to predict CHD.</p

Smart Cloud Data Management Based on Deep Reinforcement Learning with Spider Swarm Optimization Algorithm

Chapter

Full-text available

May 2024

Cardio Vascular Disease Prediction Based on PCA-ReliefF Hybrid Feature Selection Method with SVM

Chapter

May 2024

Heart Disease Prediction using XGBoost and Random Forest Models

Conference Paper

Jan 2024

Tri-fuzzy interval arithmetic with deep learning and hybrid statistical approach for analysis and prognosis of cardiovascular disease

Article

Mar 2024

In the era of artificial intelligence, healthcare informatics holds significant promise for cardiovascular disease (CVD) analysis. This study employs three computational intelligence approaches to address CVD-related challenges comprehensively. At first, various statistical methods unveil relationships between heterogeneous risk factors and predicted outcomes, employing tests of significance to discern differences in risk factors between classes with and without CVD. In the second stage, a hybrid statistical approach incorporates feature selection, identifying critical risk factors, and employs Tri-fuzzy interval arithmetic for precise estimation. Finally, the proposed Gaussian Probabilistic Neural Network (Gaussian-PNN) predicts heart disease onset with maximum accuracy, providing a nuanced assessment of CVD probability for each patient using interval-based lower and upper bounds derived from Tri-Fuzzy numbers. Experimental validations affirm the efficacy of these contributions, highlighting the analysis of significant risk factors, interrelationship establishment, and the novel integration of crisp and fuzzy interval estimates, advancing heart disease diagnosis.

A Heart Disease Prognosis Pipeline for the Edge using Federated Learning

Article

Mar 2024

Cardiovascular Disease Prediction Using Deep Learning Models

Conference Paper

Feb 2024

One of the most prevalent and significant diseases affecting people's health is cardiovascular disease (CVD). In early diagnosis, cardiovascular may be less severe and this may decrease the death rates. Distinguishing risk factors utilizing deep learning models is a hopeful methodology. Our research is based on the “Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms With Relief and LASSO Feature Selection Techniques” gives the accuracy of 99.05% and we have extended this research by applying deep learning models, ANN, and CNN for prediction of heart disease from their health measures. L2 regularization technique is applied on these models to handle overfitting problem and the same dataset is applied for getting more accuracy in the result so that the physician diagnoses the ailment with confidence. To generate reliable data for the training model, we effectively collected data, processed data, and transformed data. For the evaluation of the proposed method, the metrics reveal accuracy, sensitivity, precision, recall, and F1-Score. The proposed method, ANN, gives 98% accuracy, and the CNN method gives 95% accuracy with L2 regularization. Hence, the ANN model gives the best accuracy result.

Chapter

Full-text available

Jan 2020

Artificial Intelligence in Diffusion MRI: Enhanced Cuckoo Search Algorithm with Metaheuristic Components for Extracting the Maxima of the Orientation Distribution Function

Book

Full-text available

Jan 2020

Mohammad Shehab

This book focuses on the use of artificial intelligence to address a specific problem in the brain – the orientation distribution function. It discusses three aspects: (i) Preparing, enhancing and evaluating one of the cuckoo search algorithms (CSA); (ii) Describing the problem: Diffusion-weighted magnetic resonance imaging (DW-MRI) is used for non-invasive investigations of anatomical connectivity in the human brain, while Q-ball imaging (QBI) is a diffusion MRI reconstruction technique based on the orientation distribution function (ODF), which detects the dominant fiber orientations; however, ODF lacks local estimation accuracy along the path. (iii) Evaluating the performance of the CSA versions in solving the ODF problem using synthetic and real-world data. This book appeals to both postgraduates and researchers who are interested in the fields of medicine and computer science.

An Improved Convolutional Neural Network Based Approach for Automated Heartbeat Classification

Article

Full-text available

Dec 2019
J MED SYST

With age, our blood vessels are prone to aging, which induces cardiovascular disease. As an important basis for diagnosing heart disease and evaluating heart function, the electrocardiogram (ECG) records cardiac physiological electrical activity. Abnormalities in cardiac physiological activity are directly reflected in the ECG. Thus, ECG research is conducive to heart disease diagnosis. Considering the complexity of arrhythmia detection, we present an improved convolutional neural network (CNN) model for accurate classification. Compared with the traditional machine learning methods, CNN requires no additional feature extraction steps due to the automatic feature processing layers. In this paper, an improved CNN is proposed to automatically classify the heartbeat of arrhythmia. Firstly, all the heartbeats are divided from the original signals. After segmentation, the ECG heartbeats can be inputted into the first convolutional layers. In the proposed structure, kernels with different sizes are used in each convolution layer, which takes full advantage of the features in different scales. Then a max-pooling layer followed. The outputs of the last pooling layer are merged and as the input to fully-connected layers. Our experiment is in accordance with the AAMI inter-patient standard, which included normal beats (N), supraventricular ectopic beats (S), ventricular ectopic beats (V), fusion beats (F), and unknown beats (Q). For verification, the MIT arrhythmia database is introduced to confirm the accuracy of the proposed method, then, comparative experiments are conducted. The experiment demonstrates that our proposed method has high performance for arrhythmia detection, the accuracy is 99.06%. When properly trained, the proposed improved CNN model can be employed as a tool to automatically detect different kinds of arrhythmia from ECG.

Predicting dementia with routine care EMR data

Article

Full-text available

Dec 2019
ARTIF INTELL MED

Our aim is to develop a machine learning (ML) model that can predict dementia in a general patient population from multiple health care institutions one year and three years prior to the onset of the disease without any additional monitoring or screening. The purpose of the model is to automate the cost-effective, non-invasive, digital pre-screening of patients at risk for dementia. Towards this purpose, routine care data, which is widely available through Electronic Medical Record (EMR) systems is used as a data source. These data embody a rich knowledge and make related medical applications easy to deploy at scale in a cost-effective manner. Specifically, the model is trained by using structured and unstructured data from three EMR data sets: diagnosis, prescriptions, and medical notes. Each of these three data sets is used to construct an individual model along with a combined model which is derived by using all three data sets. Human-interpretable data processing and ML techniques are selected in order to facilitate adoption of the proposed model by health care providers from multiple institutions. The results show that the combined model is generalizable across multiple institutions and is able to predict dementia within one year of its onset with an accuracy of nearly 80% despite the fact that it was trained using routine care data. Moreover, the analysis of the models identified important predictors for dementia. Some of these predictors (e.g., age and hypertensive disorders) are already confirmed by the literature while others, especially the ones derived from the unstructured medical notes, require further clinical analysis.

Uniform Spanning Trees of Planar Graphs

Chapter

Full-text available

Jan 2020

Asaf Nachmias

Let G be a finite connected graph. A spanning tree T of G is a connected subgraph of G that contains no cycles and such that every vertex of G is incident to at least one edge of T. The set of spanning trees of a given finite connected graph is obviously finite and hence we may draw one uniformly at random. This random tree is called the uniform spanning tree (UST) of G. This model was first studied by Kirchhoff who gave a formula for the number of spanning trees of a given graph and provided a beautiful connection with the theory of electric networks. In particular, he showed that the probability that a given edge {x, y} of G is contained in the UST equals \(\mathcal {R}_{\mathrm {eff}}(x \leftrightarrow y; G)\); we prove this fundamental formula in Sect. 7.2 (see Theorem 7.2).

Sentiment Analysis on Aadhaar for Twitter Data—A Hybrid Classification Approach

Conference Paper

Jan 2020

Analysis of Parallel M5P and Random Forest Regression for Visualization of Traffic Behavior

Chapter

Jan 2020

A Novel Fault Tolerant Random Forest Model Using Brooks–Iyengar Fusion

Chapter

Feb 2020

Over the past two decades machine learning has become one of the mainstays of information technology and with that, a rather central, albeit usually hidden, part of our life. With the ever increasing amounts of data becoming available there is good reason to believe that smart data analysis will become even more pervasive as a necessary ingredient for technological progress. This chapter presents the idea of utilizing Brooks–Iyengar algorithm to improve the random forest classifiers applicable to various classification tasks. The proposed method is also applicable to regression tasks as the dependent variable of regression can be modeled as a multi-valued label using the technique of discretization. The extension of classifiers does not usually result in a regressor as precise as its corresponding classifier since the discretization error of dependent variable adds up across different trees, creating a less precise forest regressor comparing with the original forest classifier. In this chapter, we show how to mitigate this issue using Brooks–Iyengar fusion algorithm.

Research on Prediction Model of Gas Emission Based on Lasso Penalty Regression Algorithm

Chapter

Jan 2020

Applications of Decision Trees

Chapter

Jan 2020

Decision tree is a machine learning technique for solving both classification and regression problems. They help in identifying the relationship among data points in a dataset by constructing tree structures. These tree-like structures are used to make accurate predictions about unseen data. The dataset is split into multiple subsets, thereby resulting in each decision node branching to more decision nodes. The very first decision node from which the split begins is called the root node, and the final decision nodes which do not split further anymore are called the leaf nodes. Decision trees are constructed as a top-to-down structured model in the divide-and-conquer fashion.

Prediction of Heart Disease Using Feature Selection and Random Forest Ensemble Method

Abstract and Figures

Recommended publications

Analysis of Heart Disease Using Parallel and Sequential Ensemble Methods With Feature Selection Tech...

Heartbeat classification using deep residual convolutional neural network from 2-lead electrocardiog...

An Experimental Study of Diversity of Diabetes Disease Features by Bagging and Boosting Ensemble Met...

CRT-Net: A Generalized and Scalable Framework for the Computer-Aided Diagnosis of Electrocardiogram...