ArticlePDF Available

Prediction of Heart Disease Using Feature Selection and Random Forest Ensemble Method

Authors:

Abstract and Figures

The heart is very soft and sensitive part of body by which brain handles blood related system in body. The heart disease that greatly affects in body as like: pulmonary artery, atalata, enzaina and birth defects included. Heart disease is mainly related to contraction or blocked blood vessels in the heart. The symptoms of heart disease depend on the type of disease. Heart disease occurs not only in adults but also in children. The infection affecting the tissues is known as percarditis. In this, the tissues closest to the heart are affected. Infections affecting the lining of the heart muscle are known as myocardium .The study of medical datasets is made very intuitive by machine learning algorithms. The machine learning algorithms provide techniques to identify dataset attributes and the relationship between them. In this research work, we used heart disease related information from UCI repository. The dataset contained 1025 Instances with 14 attributes, sick and nonstick patients in target variable. In this paper, we proposed and analyzed classification accuracy, precision and sensitivity by four tree based classification algorithms: M5P, random Tree and Reduced Error Pruning with Random forest ensemble method. All the prediction based algorithms have applied after the features selection of heart patient's dataset. In this paper, we used three features based algorithms: Pearson Correlation, Recursive Features Elimination and Lasso Regularization. The data table analyzed by different feature selection methods for better prediction. All the analysis is done by three experimental setup; First experiment applied Pearson Correlation on M5P, random Tree, Reduced Error Pruning and Random forest ensemble method. In the second experiment we used Recursive Features Elimination and applied on above four tree based algorithms. In the third experiment we used Lasso Regularization and applied on as above tree based algorithms. After all the performance we analyzed and calculated classification accuracy, precision and sensitivity. With the results, we finally concluded that feature selection methods Pearson correlation and Lasso Regularization with random forest ensemble method provide better results 99% accuracy. We analyzed and find the random forest ensemble method predicted better result compare to other algorithms in the previous year's works.
Content may be subject to copyright.
56| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4
Research Article
Prediction of Heart Disease Using Feature Selection and
Random Forest Ensemble Method
DHYAN CHANDRA YADAV, SAURABH PAL
VBS Purvanchal University, Jaunpur, India
Email ID: dc9532105114@gmail.com, drsaurabhpal@yahoo.co.in
Received: 12.03.19, Revised: 28.05.20, Accepted: 02.06.20
ABSTRACT
The heart is very soft and sensitive part of body by which brain handles blood related system in body. The
heart disease that greatly affects in body as like: pulmonary artery, atalata, enzaina and birth defects included.
Heart disease is mainly related to contraction or blocked blood vessels in the heart. The symptoms of heart
disease depend on the type of disease. Heart disease occurs not only in adults but also in children. The
infection affecting the tissues is known as percarditis. In this, the tissues closest to the heart are affected.
Infections affecting the lining of the heart muscle are known as myocardium .The study of medical datasets is
made very intuitive by machine learning algorithms. The machine learning algorithms provide techniques to
identify dataset attributes and the relationship between them.
In this research work, we used heart disease related information from UCI repository. The dataset contained
1025 Instances with 14 attributes, sick and nonstick patients in target variable. In this paper, we proposed and
analyzed classification accuracy, precision and sensitivity by four tree based classification algorithms: M5P,
random Tree and Reduced Error Pruning with Random forest ensemble method. All the prediction based
algorithms have applied after the features selection of heart patient’s dataset. In this paper, we used three
features based algorithms: Pearson Correlation, Recursive Features Elimination and Lasso Regularization. The
data table analyzed by different feature selection methods for better prediction. All the analysis is done by
three experimental setup; First experiment applied Pearson Correlation on M5P, random Tree, Reduced Error
Pruning and Random forest ensemble method. In the second experiment we used Recursive Features
Elimination and applied on above four tree based algorithms. In the third experiment we used Lasso
Regularization and applied on as above tree based algorithms. After all the performance we analyzed and
calculated classification accuracy, precision and sensitivity.
With the results, we finally concluded that feature selection methods Pearson correlation and Lasso
Regularization with random forest ensemble method provide better results 99% accuracy. We analyzed and
find the random forest ensemble method predicted better result compare to other algorithms in the previous
year’s works.
Keywords: Data mining Tree based Algorithms, Random Forest Ensemble Method, Features Relevant
Method, Features Elimination Method Lasso Regularization Method and Heart Disease.
INTRODUCTION
Research is going on, in large research
institutions to ensure factors related to heart
disease. In some institutions, smoking, age,
high/low blood pressure, obesity, diabetes and
lack of exercise have been included as main
factors for diseases. According to the instructions
of the researchers, it is considered helpful to
identify the disease related to heart disease. Heart
disease is also revealed due to blockage in the
blood vessels, which later expresses the possibility
of heart attack, chest pain or stroke. Valve and
heart muscles are mainly affected in heart
disease. The level of mortality among the world
population by heart disease is quite large.
Cardiovascular data are available in very large
quantities in healthcare. Due to the large amount
of data, it becomes very difficult to study it in
general. But with the help of data mining, large
collections are easily converted into information.
Which shows how the condition of heart disease
has been in children and adults in the past years
and its study also helps in estimating how to
reduce the mortality caused by cardiovascular
diseases in the future. Machine learning
algorithms can improve the treatment of a person
suffering from the disease by comparing its
factors.
ISSN 0975-2366
DOI:https://doi.org/10.31838/ijpr/2020.12.04.013
Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest
Ensemble Method
57| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4
Fig.1: Representation of blockage in heart .https://images.app.goo.gl/sSdy8qxDpni7fFTj6
Some of the symptoms of heart disease are as
follows:
Heart tightness, pressure and pain.
Chest arms or neck jaw and back pain.
Heart attacks are as follows:
Having a dizzy head.
Face turning brown.
Restlessness.
Trouble breathing, etc.
Heart diseases that are not easily understood
like:
Arrhythmias: Heart beat to be irregular.
Cardiogenic: Shock in person to properly do
not get the blood that person's blood
pressure suddenly collapsed.
Hypoxemia: There is much difficulty in
breathing due to lack of oxygen in the blood.
Pulmonary Edema: Pulmonary edema
involves the accumulation of fluid in or
around the lungs of a heart patient.
DVT or deep win thrombosis: Due to an
excess of blood clots in the veins obstructing
the blood flow.
Mycordial rupture: In this, damage the wall
of heart, of heart patients, which indicates a
major danger.
Ventricular aneurysm: A bulge in the heart
chamber of the afflicted person, causing
difficulty in breathing with blood flow [1].
In this paper, we predict various heart diseases by
variety of feature selection algorithms, applied on
tree based machine learning algorithms. Machine
learning algorithms provide correlations between
various related attributes.
RELATED WORKS
Cai et al., [2020], discussed about heart
arrhythmia and 12 lead electro cardiogram. They
used one dimensional deep densely connected
neural network to detect artial fibrillation. Authors
found accuracy, sensitivity and specificity (99.35) ,
(99.19) and (99.44) respectively the results on test
dataset [2].
Buettner et al., [2020], considered
electroencephalography recording of heart
patients. Authors explained five granular divisions
of EEG spectra by machine learning classifiers.
They used Random Forest algorithm to make a
balance between paranoid schizophrenic and
non- schizophrenic persons with (96.77) percent
classification accuracy [3].
Magesh and Swarnalatha [2020] analyzed about
cardiovascular ailment centers in ruler side. They
found some risk factors or illness in coronary
disease by smoking. Authors examined target
level distribution from samples and identify
features through entropy. They used Random
Forest in the prediction of heart disease and
found accuracy (89.30) percent with cluster based
Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest
Ensemble Method
58| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4
DT learning and (76.70) percent without cluster
based DT learning [4].
Shen et al., [2020], discussed about a trial
fibrillation arrhythmia. They used neural network
and manual extraction features on the prediction
of a trial fibrillation. Authors used decision tree,
Random Forest, GBDT, XG Boost, LightGBM and
find (99.91) percent accuracy by stacking model
[5] .
Kar et al., [2020], observed the condition of heart
of a patient by electrocardiogram signal. They
analyzed ECG, signal by continues and discrete
wavelet transforms. Authors used time interval,
statistical features and classify irregular
heartbeats. They calculated K-NN, DT-CWT
features and find (98.92) percent classification
accuracy [6].
Harimoorthy and Thangavelu [2020], discussed
about hidden pattern in chronic kidney disease.
They reduced some features from chronic kidney
disease and improved in SVM Redial biaskernal.
Authors compared SVMRBK with (SVM-Linear,
SVM-Polynomial, Random forest and Decision
Tree) and find improvement in accuracy of SVM-
RBK (98.3) percent, (98.7) percent AND (89.9)
percent [7].
Miled et al., [2020], analyzed electronic medical
record of diagnosis, prescriptions and medical
notes. They used machine learning algorithms to
identify dementia and non dementia cases and
predict the fact. They developed Random forest
algorithms in three EMR dataset and find (77.43)
percent accuracy [8].
METHODOLOGY
In this phase, we have described the heart
patient’s attributes and applied algorithms. We
visualized all the attributes measured their
distribution and considered applied algorithms
with experimental setup.
Data Description:
In this paper, we organized dataset from
recorded UCI website. The dataset is related with
heart patients and measure the distribution of
heart disease patient attributes. The class
distribution, box whisker plotting and visualizing
of dataset have discussed by Python language. In
this dataset, we used 1025 instances and 14
attributes.
Class Distribution
target
0 499
1 526
dtype: int64
The class level distribution of dataset represents
how much TRUE /FALSE positive in the target
variable.
Box and Whisker Plots
Each attributes and their numeric values provide
help in disease prediction. By the help of box and
whisker, we have implemented the heart disease
attributes in brief and measured each attributes
distribution [9].
Fig.2: Representation of Box and Whisker plotting of heart disease attributes
Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest
Ensemble Method
59| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4
Histograms Representation
The histograms are graphical representation of
each attributes separately in the graph and
measure their visualization [10]. In this paper, we
used heart disease 14 attributes. Each attributes
represents their valuable representation in whole
dataset.
Fig.3: Representation of Histogram plotting of heart disease attributes
Algorithms:
M5P algorithm
Fig.4: Representation of M5P algorithms for 40 instances and 14 attributes of heart disease
In this paper, we used this model for numeric
prediction with the results; at the leaf find the
class values of instances. The work of this
algorithm as an expert to search on node, each
node due to prediction [11]. For example we
considered some instances for heart disease and
their performance.
Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest
Ensemble Method
60| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4
Random Tree Algorithm
Fig.5: Random Tree algorithms for 40 instances and 14 attributes of heart disease
Random tree algorithm used for randomly
selection of attributes in decision node [12]. The
main work of this algorithm to measure the
performance of class predictions with their
probability and try to improve prediction
performance at each node.
Reduced Error Pruning
Fig.6: Representation of REP algorithms for 40 instances and 14 attributes of heart disease
The performance of reduced error pruning is
based on C4.5 algorithms [13]. In this experiment
we used, batch size=100, max Depth=-1,
minimum variance probability=0.001 num
folds=3 and seed=1 for fast learner on each
node. The main objective of algorithm to reduce
error pruning on each node of the tree.
Formula Representation:
Table 1: Computational Formula for Prediction [14]
S.No.
Measure
1.
Accuracy
2.
Sensitivity
3.
Specificity
Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest
Ensemble Method
61| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4
Proposed Ensemble Method:
In this research paper, we used random forest as
an ensemble method. The Random Forest is a
powerful decision making tree ensemble method
[15]. The main property of this algorithm is to
select decision randomly from other tree. In this
paper, we used M5P tree, Random tree and Error
Reduced Pruning tree with Random Forest
Ensemble method. After the features selection
trained on (75%) dataset and the test on (25%)
with tree algorithms with ensemble method. The
final prediction has measured by average voting
algorithms. In this experiment, we used bag size=
100%, batch size==100 and seed = 1 for better
prediction.
Fig.7: Proposed Model of Random Forest algorithm as a ensemble model
RESULTS
In this paper, we used various features selection
method and applied on various machine learning
algorithms for better prediction.
Pearson correlation with output variables, find
score of some features: cp, exang, oldpeak
and target (.43), (.43), (.43) and 1.00
respectively, these are highly correlated
features.
The Pearson correlation features selection
method with Random Forest Ensemble
method calculated (99.9%) accuracy.
The Recursive Features selection method
provide optimal number of features:12 and
the score with 12 features: 0.54
The Recursive Features selection method
applied with Random Forest and calculated
(94.12%) accuracy.
Lasso Regularization by lassoCV() calculated:
Best alpha= 0.0048, Best score =.51
In the performance Lasso Model avoid some
features: fbs, chol and age
Lasso Regularization with Random Forest
ensemble method finds (99.9%) accuracy.
DISCUSSION
In this section, we discussed about all feature
selection performance with machine learning
algorithms:
Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest
Ensemble Method
62| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4
Fig.8: Representation of Pearson Correlation for heart disease attributes
The matrix has an absolute value (0.3) with the
output variable and gives the results for highly
correlate attributes [16]. We used Pearson
Correlation matrix with output variable and select
the highly correlated features as:
Table 2: Valuable Score with Features of Pearson Correlation
Features
Correlation
cp
0.434854
thalach
0.422895
exang
0.438029
oldpeak
0.438441
slope
0.345512
ca
0.382085
thal
0.337838
target
1.000000
Name: target, dtype: float64
Table 3: Measure Prediction Performance for With/ Without PC by Tree Classifiers
Algorithms
FSM (Without PC)
FSM (With PC)
Specificity
Sensitivity
Accuracy
Specificity
Sensitivity
Accuracy
M5PT
41.2
94.5
89.3
83.1
91.2
93.4
RT
42.7
95.7
91.2
92.6
95.3
95.2
REPT
50.6
95.8
91.5
89.7
96.6
96.6
RFT
62.3
95.1
93.8
90.3
99.6
99.9
For the table.3, it is clear that PC= Pearson
Correlation feature selection on RFT = Random
Forest calculated highest accuracy and sensitivity.
We Initialized Recursive Features Elimination
model for fitting the data to model and find the
result as:
[False True True False False False False False
True True True True True]
[3 1 1 6 7 4 2 5 1 1 1 1 1]
Recursively remaining heart attributes and
building a model on those heart attributes remain
in table. All the True are most relevant features in
dataset and False are irrelevant features [17] .
Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest
Ensemble Method
63| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4
So calculate no of features with variable to store
the optimum features as:
Optimum number of features: 12; Score with 12
features: 0.541462
By the experiment find the transforming data
using RFE and fitting the data to model as :
Index(['age', 'sex', 'cp', 'trestbps', 'fbs', 'restecg',
'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal'],
dtype='object')
Table 4: Measure Prediction Performance for With/ Without RFE by Tree Classifiers
Algorithms
FSM (Without RFE)
FSM (With RFE)
Specificity
Sensitivity
Accuracy
Specificity
Sensitivity
Accuracy
M5PT
52.4
83.7
74.8
92.3
82.5
80.6
RT
53.3
84.2
80.3
81.7
84.7
86.3
REPT
41.2
84.3
85.8
78.6
84.8
85.6
RFT
51.1
84.8
87.9
91.6
98.8
98.2
For the table.4, it is clear that RFE= Recursive
Features Elimination on RFT = Random Forest
calculated highest accuracy and sensitivity.
In the Lasso regularization model, we used CV
based function for better feature importance[18] .
reg = LassoCV()
Best alpha using built-in LassoCV: 0.004860
Best score using built-in LassoCV: 0.513496
Text(0.5, 1.0, 'Feature importance using Lasso
Model') and reduced some less important as: fbs,
chol and age
Fig.9: Lasso regularization model for features selection
If the features are irrelevant then Lasso penalizes,
with the results, we find features: fbs, chol and
age are penalized. The last top and bottom
features are highly related with each other.
Table 5: Measure Prediction Performance for with / Without Lasso regularization by Tree
Classifiers
Algorithms
Without LRM
With LRM
Specificity
Sensitivity
Accuracy
Specificity
Sensitivity
Accuracy
M5PT
63.5
72.2
83.8
75.2
73.2
89.8
RT
31.7
78.3
83.7
79.5
79.8
91.9
REPT
61.5
79.4
78.7
91.5
63.2
79.4
RFT
73.5
76.8
88.9
91.3
97.1
99.9
Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest
Ensemble Method
64| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4
For the table.5, it is clear that LRM= Lasso regularization model on RFT = Random Forest calculated
highest accuracy and sensitivity.
Fig.10: Representation of Accuracy by Tree based Algorithms in heat disease
With the results, table.3, 4 &5 represents the
comparison of other M5P, RT and REPT
algorithms. Fig.10., represents the above all the
obtained experiments and we find that the highest
accuracy and Sensitivity of Random Forest
ensemble method.
Table 6: Representation of Previous Year Paper Accuracy Score
Authors
Instances
Algorithms
Accuracy
Hui et al.,[2012] [19]
9800
SRBC, ICA & RR
98.35
Martis et ai., [2013][20]
110,094
ICA, DWT & PNN
99.28
Ince et al.,[2015][21]
100,389
CNN & BP
98.90
Naomin et al.,[2016][22]
110,094
NN, SVM & PCA
98.90
Hua et al., [2017][23]
90808
SVM & Weighted RR
98.46
Oh et al.,[2017][24]
109949
CNN
94.47
Yildirim et al.,[2018][25]
7376
DBLSTM
99.39
Yildirim et al., [2019][26]
100,022
CAE-LSTM
99.23
Haotien et al.,[2020][27]
100630
CNN
99.06
We have studies near 2012 -2020 and find the
highest accuracy near about (99%). In the work,
we have compared different algorithms
individually but did not cover (100%) accuracy. In
this research work, we have tried to test with
different features selection method applied on
different machine learning tree classifiers
algorithms and finally find Random Forest
ensemble method provide better result (99.9%)
accuracy.
CONCLUSION
In this research paper, we used Pearson
Correlation, Recursive Features Elimination and
Lasso Regularization, features selection methods
and applied on Machine learning tree based
classifiers algorithms: M5P, Random Tree and
Reduced Error Pruning with Random Forest
ensemble method. In this analysis, we evaluated
the value of classification accuracy, precision,
sensitivity and ROC. We have used UCI
Repository dataset for 1025 instances and 14
attributes. In this research, we Identify whether a
person is suffering from heart problem are not in
heart disease machine learning algorithms
provide various way to implement the medical
data set. In this research work, the important
features were identified by Pearson correlation,
Recursive Features Elimination and Lasso
Regularization with the selected important
features we examine with improved, Random
Forest, Random Tree, Reduced Error Pruning and
M5P classifiers algorithms in heart disease. With
the results, we find that improved Random Forest
ensemble method with batch size (100) and seed
(1) provide batter accuracy compare to other.
Since this work is based on recorded data from
UCI repository, for future work planning , we will
Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest
Ensemble Method
65| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4
try train and test on huge medical data set with
more than one ensemble method and try to
improvement in their performance.
Conflict of Interest
Authors have no conflict of Interest.
Funding
This study was not funded.
Acknowledgements
The author is grateful to Veer Bahadur Singh
Purvanchal University Jaunpur, Uttar Pradesh, for
Providing financial support to work as Post
Doctoral Research Fellowship.
REFERENCES
1. Lui, C. K., Kerr, W. C., Li, L., Mulia, N., Ye, Y.,
Williams, E., ... & Lown, E. A. (2020). Lifecourse
Drinking Patterns, Hypertension, and Heart
Problems Among US Adults. American Journal of
Preventive Medicine.
2. Cai, W., Chen, Y., Guo, J., Han, B., Shi, Y., Ji, L.,
... & Luo, J. (2020). Accurate detection of atrial
fibrillation from 12-lead ECG using deep neural
network. Computers in biology and medicine, 116,
103378.
3. Buettner, R., Beil, D., Scholtz, S., & Djemai, A.
(2020, January). Development of a machine
learning based algorithm to accurately detect
schizophrenia based on one-minute EEG
recordings. In Proceedings of the 53rd Hawaii
International Conference on System Sciences.
4. Magesh, G., & Swarnalatha, P. (2020). Optimal
feature selection through a cluster-based DT
learning (CDTL) in heart disease
prediction. Evolutionary Intelligence, 1-11.
5. Shen, M., Zhang, L., Luo, X., & Xu, J. (2020,
January). Atrial Fibrillation Detection Algorithm
Based on Manual Extraction Features and
Automatic Extraction Features. In IOP Conference
Series: Earth and Environmental Science (Vol. 428,
No. 1, p. 012050). IOP Publishing.
6. Kar, N., Sahu, B., Sabut, S., & Sahoo, S. (2020).
Effective ECG Beat Classification and Decision
Support System Using Dual-Tree Complex
Wavelet Transform. In Advances in Intelligent
Computing and Communication (pp. 366-374).
Springer, Singapore.
7. Harimoorthy, K., & Thangavelu, M. (2020). Multi-
disease prediction model using improved SVM-
radial bias technique in healthcare monitoring
system. Journal of Ambient Intelligence and
Humanized Computing, 1-9.
8. Miled, Z. B., Haas, K., Black, C. M., Khandker, R.
K., Chandrasekaran, V., Lipton, R., & Boustani, M.
A. (2020). Predicting dementia with routine care
EMR data. Artificial Intelligence in Medicine, 102,
101771.
9. Norris, D. J. (2020). Introduction to machine
learning (ML) with the Raspberry Pi (RasPi).
In Machine Learning with the Raspberry Pi (pp. 1-
47). Apress, Berkeley, CA.
10. del Rio, A. A. H., Cuevas, E., & Zaldivar, D.
(2020). Multi-level Image Thresholding
Segmentation Using 2D Histogram Non-local
Means and Metaheuristics Algorithms.
In Applications of Hybrid Metaheuristic Algorithms
for Image Processing (pp. 121-149). Springer,
Cham.
11. Mudali, P., Roopa, J., Raju, M. G., & Yadav, A.
(2020). Analysis of Parallel M5P and Random
Forest Regression for Visualization of Traffic
Behavior. In Computational Intelligence in Pattern
Recognition (pp. 231-241). Springer, Singapore.
12. Nachmias, A. (2020). Uniform Spanning Trees of
Planar Graphs. In Planar Maps, Random Walks and
Circle Packing (pp. 89-103). Springer, Cham.
13. Thomas, T., Vijayaraghavan, A. P., & Emmanuel,
S. (2020). Applications of Decision Trees.
In Machine Learning Approaches in Cyber Security
Analytics (pp. 157-184). Springer, Singapore.
14. Shehab, M. (2019). Artificial Intelligence in Diffusion
MRI: Enhanced Cuckoo Search Algorithm with
Metaheuristic Components for Extracting the
Maxima of the Orientation Distribution
Function (Vol. 877). Springer Nature.
15. Sniatala, P., Amini, M. H., & Boroojeni, K. G.
(2020). A Novel Fault Tolerant Random Forest
Model Using BrooksIyengar Fusion.
In Fundamentals of BrooksIyengar Distributed
Sensing Algorithm (pp. 159-165). Springer, Cham.
16. Jain, G., Mahara, T., & Tripathi, K. N. (2020). A
Survey of Similarity Measures for Collaborative
Filtering-Based Recommender System. In Soft
Computing: Theories and Applications (pp. 343-
352). Springer, Singapore.
17. Kumari, P., & Haider, M. T. U. (2020). Sentiment
Analysis on Aadhaar for Twitter DataA Hybrid
Classification Approach. In Proceeding of
International Conference on Computational Science
and Applications (pp. 309-318). Springer,
Singapore.
18. Chen, Q., & Huang, L. (2020). Research on
Prediction Model of Gas Emission Based on
Lasso Penalty Regression Algorithm. In Artificial
Intelligence in China (pp. 165-172). Springer,
Singapore.
19. Huang, H. F., Hu, G. S., & Zhu, L. (2012). Sparse
representation-based heartbeat classification
using independent component analysis. Journal of
medical systems, 36(3), 1235-1247.
20. Martis, R. J., Acharya, U. R., & Min, L. C. (2013).
ECG beat classification using PCA, LDA, ICA and
discrete wavelet transform. Biomedical Signal
Processing and Control, 8(5), 437-448.
21. Kiranyaz, S., Ince, T., & Gabbouj, M. (2015). Real-
time patient-specific ECG classification by 1-D
convolutional neural networks. IEEE Transactions
on Biomedical Engineering, 63(3), 664-675.
Dhyan Chandra Yadav et al / Prediction of Heart Disease Using Feature Selection and Random Forest
Ensemble Method
66| International Journal of Pharmaceutical Research | Oct - Dec 2020 | Vol 12 | Issue 4
22. Elhaj, F. A., Salim, N., Harris, A. R., Swee, T. T., &
Ahmed, T. (2016). Arrhythmia recognition and
classification using combined linear and nonlinear
features of ECG signals. Computer methods and
programs in biomedicine, 127, 52-63.
23. Chen, S., Hua, W., Li, Z., Li, J., & Gao, X. (2017).
Heartbeat classification using projected and
dynamic features of ECG signal. Biomedical Signal
Processing and Control, 31, 165-173.
24. Acharya, U. R., Oh, S. L., Hagiwara, Y., Tan, J. H.,
Adam, M., Gertych, A., & San Tan, R. (2017). A
deep convolutional neural network model to
classify heartbeats. Computers in biology and
medicine, 89, 389-396.
25. Yildirim, Ö. (2018). A novel wavelet sequence
based on deep bidirectional LSTM network
model for ECG signal classification. Computers in
biology and medicine, 96, 189-202.
26. Yildirim, O., Baloglu, U. B., Tan, R. S., Ciaccio, E.
J., & Acharya, U. R. (2019). A new approach for
arrhythmia classification using deep coded
features and LSTM networks. Computer methods
and programs in biomedicine, 176, 121-133.
27. Wang, H., Shi, H., Chen, X., Zhao, L., Huang, Y.,
& Liu, C. (2020). An Improved Convolutional
Neural Network Based Approach for Automated
Heartbeat Classification. Journal of Medical
Systems, 44(2), 35.
... The prevalence of cardiovascular illnesses is higher in males compared to females, particularly during middle or old age [1]. However, it is worth noting that children are also experiencing similar health conditions [2]. Based on data presented by the World Health Organization (WHO), it is evident that cardiovascular disease accounts for approximately one-third of all global mortality. ...
... The possibility of inaccurate forecasts stems from a deficiency in proficiency among healthcare professionals [1]. Identifying heart disease early can provide difficulties [2]. The surgical management of cardiovascular disease poses challenges, especially in developing countries characterized by a scarcity of skilled healthcare practitioners and restricted availability of diagnostic tools and essential resources required for the adequate treatment and diagnosis of individuals with cardiac ailments [3]. ...
Article
Full-text available
The diagnosis and prognosis of cardiovascular diseases are critical medical responsibilities that assist cardiologists in correctly classifying patients and treating them accordingly. The utilization of machine learning in the medical domain has witnessed a notable surge due to its ability to discern patterns from vast amounts of data. Machine learning algorithms that can categorize cases of cardiovascular illness may help doctors reduce the number of wrong diagnoses. This research investigates the efficacy of different machine learning algorithms in predicting cardiovascular disease in accordance with risk factors. This study utilizes a variety of machine learning models, including Logistic Regression, Random Forest, Decision Tree, Extra Trees classifier, Support Vector Machine (SVM), XGBoost (XGB), Light Gradient Boosting Machine (LGBM), GaussianNB, and Multilayer Perceptron (MLP). The machine learning models are applied to a concrete dataset acquired from Kaggle. The models underwent training using a dataset that was partitioned into an 80:20 ratio. Machine learning model evaluation involves the utilization of performance measurements such as Accuracy, Precision, Recall, and ROC curves. An exhaustive evaluation is carried out to gauge the efficacy of the models.
... In this study, four machine learning techniques, namely random forest (RF), extreme gradient boosting (XGBoosting), gradient boosting machine (GBM), and decision tree (DT), were chosen for hyperparameter optimization within the dataset. The RF algorithm is a type of ensemble learning that builds many decision trees during training and finds the average prediction (regression) of the individual trees or the mode of the classes (classification) [31]. XGBoosting is a distributed gradient boosting library optimized for flexible and effective implementation. ...
Article
Full-text available
Objective We employed machine learning algorithms to discriminate insulin resistance (IR) in middle-aged nondiabetic women. Methods The data was from the National Health and Nutrition Examination Survey (2007–2018). The study subjects were 2084 nondiabetic women aged 45–64. The analysis included 48 predictors. We randomly divided the data into training (n = 1667) and testing (n = 417) datasets. Four machine learning techniques were employed to discriminate IR: extreme gradient boosting (XGBoosting), random forest (RF), gradient boosting machine (GBM), and decision tree (DT). The area under the curve (AUC) of receiver operating characteristic (ROC), accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score were compared as performance metrics to select the optimal technique. Results The XGBoosting algorithm achieved a relatively high AUC of 0.93 in the training dataset and 0.86 in the testing dataset to discriminate IR using 48 predictors and was followed by the RF, GBM, and DT models. After selecting the top five predictors to build models, the XGBoost algorithm with the AUC of 0.90 (training dataset) and 0.86 (testing dataset) remained the optimal prediction model. The SHapley Additive exPlanations (SHAP) values revealed the associations between the five predictors and IR, namely BMI (strongly positive impact on IR), fasting glucose (strongly positive), HDL-C (medium negative), triglycerides (medium positive), and glycohemoglobin (medium positive). The threshold values for identifying IR were 29 kg/m², 100 mg/dL, 54.5 mg/dL, 89 mg/dL, and 5.6% for BMI, glucose, HDL-C, triglycerides, and glycohemoglobin, respectively. Conclusion The XGBoosting algorithm demonstrated superior performance metrics for discriminating IR in middle-aged nondiabetic women, with BMI, glucose, HDL-C, glycohemoglobin, and triglycerides as the top five predictors.
... Gavhane et al. (2018) [15] proposed Heart Disease Prediction System using Multi-layer perceptron algorithm, the resulting system was found efficient and accurate but research revealed that the model's performance can be improved by incorporating more relevant features. M5P, Random Tree, Reduced Error Pruning, and Random Forest ensemble method were employed by Yadav and Pal (2020) [16] for classification and Pearson Correlation, Recursive Features Elimination, and Lasso Regularization were used by the research for feature selection. Results show that the Lasso Regularization technique has the highest performance. ...
Article
Full-text available
Heart disease is a leading cause of mortality globally and its prevalence is increasing year after year. Recent statistics from the World Health Organization show that about 17.9 million individuals are embattled with heart diseases annually and people under the age of 70 account for one-third of these deaths. Hence, there is need to intensify research on early heart disease prediction and artificial intelligence-based heart disease prediction systems. Previous heart disease prediction systems using machine learning techniques are unable to manage large amount of data, resulting in poor prediction accuracy. Hence, this research employs Convolutional Neural Networks, a deep learning approach for prediction of heart diseases. The dataset for training and testing the model was obtained from a government owned hospital in Nigeria and Kaggle. The resulting system was evaluated using precision, recall, f1-score and accuracy metrics. The results obtained are: 0.94, 0.95, 0.95 and 0.95 for precision, recall, f1-score and accuracy respectively. This show that the CNN-based model responded very well to the prediction of heart diseases for both negative and positive classes. The results obtained were also compared to some selected machine-learning models like Random Forest, Naïve Bayes, KNN and Logistic Regression and results show that the developed model achieved a significant improvement over the methods considered. Therefore, convolutional neural network is more suitable for heart disease prediction than some state-of-the-art machine-learning models. The contribution to knowledge of this research is the use of Afrocentric dataset for heart disease prediction. Future research should consider increasing the data size for model training to achieve improved accuracy.
... The RFE reduces the size of the feature for training the model. The feature elimination method reduces redundant features, which mislead model fitting and pattern identification processes during learning [25], [26]. Additionally, research articles [27], [28] investigated that the RFE improves the performance of gradient boosting, and KNN in predicting cardiovascular disease. ...
Article
Full-text available
p>Chronic heart disease (CHD) is a common complication among patients suffering in the cardiological intensive care unit, often resulting in poor prognosis and high mortality. Early prediction of CHD can reduce mortality by preventing the severity of the disease. This study evaluated the efficacy of on recursive feature elimination for predicting CHD using supervised learning techniques for predicting CHD. The study employed 1190 Cleveland Hungarian CHD dataset. Different supervised learning techniques (support vector machine, decision tree, k-nearest neighbor, Naive Bayes, stochastic gradient descent, adaptive boosting, and multilayer perceptron) were used to study the efficacy of the recursive feature elimination. Chest pain type, sex, blood sugar level, angina, depression, and slope were associated with CHD occurrence. The accuracy of the K-nearest neighbor and decision tree model was 89.91% for the feature-selected dataset indicating good predictive ability. Ultimately, the support vector machine and logistic regression with the selected features exhibited good discriminatory ability for early prediction of CHD. Thus, the recursive feature elimination is a good approach to develop a a model with higher accuracy to predict CHD.</p
Article
In the era of artificial intelligence, healthcare informatics holds significant promise for cardiovascular disease (CVD) analysis. This study employs three computational intelligence approaches to address CVD-related challenges comprehensively. At first, various statistical methods unveil relationships between heterogeneous risk factors and predicted outcomes, employing tests of significance to discern differences in risk factors between classes with and without CVD. In the second stage, a hybrid statistical approach incorporates feature selection, identifying critical risk factors, and employs Tri-fuzzy interval arithmetic for precise estimation. Finally, the proposed Gaussian Probabilistic Neural Network (Gaussian-PNN) predicts heart disease onset with maximum accuracy, providing a nuanced assessment of CVD probability for each patient using interval-based lower and upper bounds derived from Tri-Fuzzy numbers. Experimental validations affirm the efficacy of these contributions, highlighting the analysis of significant risk factors, interrelationship establishment, and the novel integration of crisp and fuzzy interval estimates, advancing heart disease diagnosis.
Conference Paper
One of the most prevalent and significant diseases affecting people's health is cardiovascular disease (CVD). In early diagnosis, cardiovascular may be less severe and this may decrease the death rates. Distinguishing risk factors utilizing deep learning models is a hopeful methodology. Our research is based on the “Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms With Relief and LASSO Feature Selection Techniques” gives the accuracy of 99.05% and we have extended this research by applying deep learning models, ANN, and CNN for prediction of heart disease from their health measures. L2 regularization technique is applied on these models to handle overfitting problem and the same dataset is applied for getting more accuracy in the result so that the physician diagnoses the ailment with confidence. To generate reliable data for the training model, we effectively collected data, processed data, and transformed data. For the evaluation of the proposed method, the metrics reveal accuracy, sensitivity, precision, recall, and F1-Score. The proposed method, ANN, gives 98% accuracy, and the CNN method gives 95% accuracy with L2 regularization. Hence, the ANN model gives the best accuracy result.
Book
Full-text available
This book focuses on the use of artificial intelligence to address a specific problem in the brain – the orientation distribution function. It discusses three aspects: (i) Preparing, enhancing and evaluating one of the cuckoo search algorithms (CSA); (ii) Describing the problem: Diffusion-weighted magnetic resonance imaging (DW-MRI) is used for non-invasive investigations of anatomical connectivity in the human brain, while Q-ball imaging (QBI) is a diffusion MRI reconstruction technique based on the orientation distribution function (ODF), which detects the dominant fiber orientations; however, ODF lacks local estimation accuracy along the path. (iii) Evaluating the performance of the CSA versions in solving the ODF problem using synthetic and real-world data. This book appeals to both postgraduates and researchers who are interested in the fields of medicine and computer science.
Article
Full-text available
With age, our blood vessels are prone to aging, which induces cardiovascular disease. As an important basis for diagnosing heart disease and evaluating heart function, the electrocardiogram (ECG) records cardiac physiological electrical activity. Abnormalities in cardiac physiological activity are directly reflected in the ECG. Thus, ECG research is conducive to heart disease diagnosis. Considering the complexity of arrhythmia detection, we present an improved convolutional neural network (CNN) model for accurate classification. Compared with the traditional machine learning methods, CNN requires no additional feature extraction steps due to the automatic feature processing layers. In this paper, an improved CNN is proposed to automatically classify the heartbeat of arrhythmia. Firstly, all the heartbeats are divided from the original signals. After segmentation, the ECG heartbeats can be inputted into the first convolutional layers. In the proposed structure, kernels with different sizes are used in each convolution layer, which takes full advantage of the features in different scales. Then a max-pooling layer followed. The outputs of the last pooling layer are merged and as the input to fully-connected layers. Our experiment is in accordance with the AAMI inter-patient standard, which included normal beats (N), supraventricular ectopic beats (S), ventricular ectopic beats (V), fusion beats (F), and unknown beats (Q). For verification, the MIT arrhythmia database is introduced to confirm the accuracy of the proposed method, then, comparative experiments are conducted. The experiment demonstrates that our proposed method has high performance for arrhythmia detection, the accuracy is 99.06%. When properly trained, the proposed improved CNN model can be employed as a tool to automatically detect different kinds of arrhythmia from ECG.
Article
Full-text available
Our aim is to develop a machine learning (ML) model that can predict dementia in a general patient population from multiple health care institutions one year and three years prior to the onset of the disease without any additional monitoring or screening. The purpose of the model is to automate the cost-effective, non-invasive, digital pre-screening of patients at risk for dementia. Towards this purpose, routine care data, which is widely available through Electronic Medical Record (EMR) systems is used as a data source. These data embody a rich knowledge and make related medical applications easy to deploy at scale in a cost-effective manner. Specifically, the model is trained by using structured and unstructured data from three EMR data sets: diagnosis, prescriptions, and medical notes. Each of these three data sets is used to construct an individual model along with a combined model which is derived by using all three data sets. Human-interpretable data processing and ML techniques are selected in order to facilitate adoption of the proposed model by health care providers from multiple institutions. The results show that the combined model is generalizable across multiple institutions and is able to predict dementia within one year of its onset with an accuracy of nearly 80% despite the fact that it was trained using routine care data. Moreover, the analysis of the models identified important predictors for dementia. Some of these predictors (e.g., age and hypertensive disorders) are already confirmed by the literature while others, especially the ones derived from the unstructured medical notes, require further clinical analysis.
Chapter
Full-text available
Let G be a finite connected graph. A spanning tree T of G is a connected subgraph of G that contains no cycles and such that every vertex of G is incident to at least one edge of T. The set of spanning trees of a given finite connected graph is obviously finite and hence we may draw one uniformly at random. This random tree is called the uniform spanning tree (UST) of G. This model was first studied by Kirchhoff who gave a formula for the number of spanning trees of a given graph and provided a beautiful connection with the theory of electric networks. In particular, he showed that the probability that a given edge {x, y} of G is contained in the UST equals \(\mathcal {R}_{\mathrm {eff}}(x \leftrightarrow y; G)\); we prove this fundamental formula in Sect. 7.2 (see Theorem 7.2).
Chapter
Over the past two decades machine learning has become one of the mainstays of information technology and with that, a rather central, albeit usually hidden, part of our life. With the ever increasing amounts of data becoming available there is good reason to believe that smart data analysis will become even more pervasive as a necessary ingredient for technological progress. This chapter presents the idea of utilizing Brooks–Iyengar algorithm to improve the random forest classifiers applicable to various classification tasks. The proposed method is also applicable to regression tasks as the dependent variable of regression can be modeled as a multi-valued label using the technique of discretization. The extension of classifiers does not usually result in a regressor as precise as its corresponding classifier since the discretization error of dependent variable adds up across different trees, creating a less precise forest regressor comparing with the original forest classifier. In this chapter, we show how to mitigate this issue using Brooks–Iyengar fusion algorithm.
Chapter
Decision tree is a machine learning technique for solving both classification and regression problems. They help in identifying the relationship among data points in a dataset by constructing tree structures. These tree-like structures are used to make accurate predictions about unseen data. The dataset is split into multiple subsets, thereby resulting in each decision node branching to more decision nodes. The very first decision node from which the split begins is called the root node, and the final decision nodes which do not split further anymore are called the leaf nodes. Decision trees are constructed as a top-to-down structured model in the divide-and-conquer fashion.