ArticlePDF Available

Abstract

For processing of large amount of data numerous techniques are used. Data Mining is one of the techniques that are used most often. To process these data, Data mining combines traditional data analysis with sophisticated algorithms. Medical data mining is an important area of Data Mining and considered as one of the important research field due to its application in healthcare domain. Classification and prediction of medical datasets poses challenges in Medical Data Mining. The heart disease accounts to be the leading cause of death worldwide. It is difficult for medical practitioners to predict the heart attack as it is a complex task that requires experience and knowledge. The health sector today contains hidden information that can be important in making decisions. Data mining algorithms such as decision tree and Naïve Bayes are applied in this research for predicting heart attacks. The research result shows prediction accuracy of 99%. Data mining enable the health sector to predict patterns in the dataset.
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
Early Prediction of Heart Disease Using
Decision Tree Algorithm
A. Sankari Karthiga1, M. Safish Mary2, M. Yogasini3
M.Phil. Scholar1, Assistant Professor2, Assistant Professor3,
Mother Teresa Women’s University, Kodaikanal1
St. Xavier’s College, Tirunelveli2
Sadakathdulla Appa College, Tirunelveli3
sankarikarthiga.sk@gmail.com
Abstract-For processing of large amount of data numerous techniques are used. Data
Mining is one of the techniques that are used most often. To process these data, Data
mining combines traditional data analysis with sophisticated algorithms. Medical data
mining is an important area of Data Mining and considered as one of the important
research field due to its application in healthcare domain. Classification and prediction
of medical datasets poses challenges in Medical Data Mining. The heart disease
accounts to be the leading cause of death worldwide. It is difficult for medical
practitioners to predict the heart attack as it is a complex task that requires experience
and knowledge. The health sector today contains hidden information that can be
important in making decisions. Data mining algorithms such as decision tree and Naïve
Bayes are applied in this research for predicting heart attacks. The research result
shows prediction accuracy of 99%. Data mining enable the health sector to predict
patterns in the dataset.
Index Terms- Decision Tree Algorithm, Naïve Bayes Algorithm.
I. INTRODUCTION
1.1. DATA MINING
Data Mining is about explaining the past and predicting the future by means of data
analysis. Data mining is a multi-disciplinary field that combines statistics, machine learning,
artificial intelligence and database technology. The value of data mining applications is often
estimated to be very high. Many businesses have stored large amounts of data over years of
operation, and data mining is able to extract very valuable knowledge from this data. The
businesses are then able to leverage the extracted knowledge into more clients, more sales,
and greater profits. This is also true in the engineering and medical fields.
1.1.1. Statistics
The science of statistics is to collecting, classifying, summarizing, organizing, analyzing, and
interpreting data.
1.1.2. Artificial Intelligence
The study of computer algorithms is to dealing with the simulation of intelligent behaviour in
order to perform those activities that are normally thought to require intelligence.
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
This work by IJARBEST is licensed under Creative Commons Attribution 4.0 International License. Available at https://www.ijarbest.com/Archive
1
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
1.1.3. Machine Learning
The study of the computer algorithms aim is to learn in order to improve automatically
through experience.
1.1.4. Database
The science and technology of collecting, storing and managing data so users can retrieve,
add, update or remove such data.
1.1.5. Data warehousing
The science and technology of collecting, storing and managing data with advanced multi-
dimensional reporting services in support of the decision-making processes.
1.1.6. Explaining the Past
Data mining explains the past through data exploration.
1.1.7. Predicting the Future
Data mining predicts the future by means of modeling.
1.1.8. Data Exploration
Data Exploration is about describing the data by means of statistical and visualization
techniques. We explore data in order to bring important aspects of that data into focus for
further analysis.
“Data Mining is a non-trivial extraction of implicit, previously unknown and potential
useful information about data[1]. In short, it is a process of analyzing data from different
perspective and gathering the knowledge from it. The discovered knowledge can be used for
different applications for example healthcare industry. Nowadays healthcare industry
generates large amount of data about patients, disease diagnosis etc. Data mining provides a
set of techniques to discover hidden patterns from data. A major challenge facing Healthcare
industry is quality of service. Quality of service implies diagnosing disease correctly &
provides effective treatments to patients. Poor diagnosis can lead to disastrous consequences
that are unacceptable.
According to survey of WHO, 17 million total global deaths are due to heart attacks
and strokes. The deaths due to heart disease in many countries occur due to work overload,
mental stress and many other problems. Overall, it is found as primary reason behind death in
adults. Diagnosis is complicated and important task that needs to be executed accurately and
efficiently. The diagnosis is often made, based on doctor’s experience & knowledge. This
leads to unwanted results & excessive medical costs of treatments provided to patients.
Therefore, an automatic medical diagnosis system is designed that take advantage of
collected database and decision support system. This system can help in diagnosing disease
with less medical tests & effective treatments.
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
2
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
1.2. MEDICAL DATA MINING
Medical data mining has great potential for exploring the hidden patterns in the data
sets of the medical domain. These patterns can be utilized for clinical diagnosis. However,
the available raw medical data are widely distributed, heterogeneous in nature, and
voluminous. These data need to be collected in an organized form. This collected data can be
then integrated to form a hospital information system. Data mining technology provides a
user oriented approach to novel and hidden patterns in the data.
The World Health Organization has estimated that 12 million deaths occurs
worldwide, every year due to the Heart diseases. Half the deaths in the United States and
other developed countries occur due to cardio vascular diseases. It is also the chief reason of
deaths in numerous developing countries. On the whole, it is regarded as the primary reason
behind deaths in adults. The term Heart disease encompasses the diverse diseases that affect
the heart. Heart disease was the major cause of casualties in the different countries including
India. Heart disease kills one person every 34 seconds in the United States. Coronary heart
disease, Cardiomyopathy and Cardiovascular disease are some categories of heart diseases.
The term “cardiovascular disease” includes a wide range of conditions that affect the heart
and the blood vessels and the manner in which blood is pumped and circulated through the
body. Cardiovascular disease (CVD) results in several illness, disability, and death. The
diagnosis of diseases is a vital and intricate job in medicine.
Medical diagnosis is regarded as an important yet complicated task that needs to be
executed accurately and efficiently. The automation of this system would be extremely
advantageous. Regrettably all doctors do not possess expertise in every sub specialty and
moreover there is a shortage of resource persons at certain places. Therefore, an automatic
medical diagnosis system would probably be exceedingly beneficial by bringing all of them
together. Appropriate computer-based information and/or decision support systems can aid in
achieving clinical tests at a reduced cost. Efficient and accurate implementation of automated
system needs a comparative study of various techniques available. This paper aims to analyze
the different predictive/ descriptive data mining techniques proposed in recent years for the
diagnosis of heart disease.
Medical diagnosis is considered as a significant yet intricate task that needs to be
carried out precisely and efficiently. The automation of the same would be highly beneficial.
Clinical decisions are often made based on doctor’s intuition and experience rather than on
the knowledge rich data hidden in the database. This practice leads to unwanted biases, errors
and excessive medical costs which affects the quality of service provided to patients. Data
mining have the potential to generate a knowledge-rich environment which can help to
significantly improve the quality of clinical decisions.
Decision Tree is a popular classifier which is simple and easy to implement. It
requires no domain knowledge or parameter setting and can handle high dimensional data.
The results obtained from Decision Trees are easier to read and interpret. The drill through
feature to access detailed patients‟ profiles is only available in Decision Trees.
Naïve Bayes is a statistical classifier which assumes no dependency between
attributes. It attempts to maximize the posterior probability in determining the class. The
advantage of using naive bayes is that one can work with the naive Bayes model without
using any Bayesian methods. Naive Bayes classifiers have works well in many complex real-
world situations
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
3
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
1.3. HEART DISEASE
The heart is important organ of human body part. It is nothing more than a pump,
which pumps blood through the body. If circulation of blood in body is inefficient the organs
like brain suffer and if heart stops working altogether, death occurs within minutes. Life is
completely dependent on efficient working of the heart. The term Heart disease refers to
disease of heart & blood vessel system within it.
A number of factors have been shown that increases the risk of Heart disease:
Family history
Smoking
Poor diet
High blood pressure
High blood cholesterol
Obesity
Physical inactivity
Hyper tension
Factors like these are used to analyze the Heart disease. In many cases, diagnosis is
generally based on patient’s current test results & doctor’s experience. Thus the diagnosis is a
complex task that requires much experience & high skill.
Heart disease is a broad term that includes all types of diseases affecting different
components of the heart. Heart means 'cardio.' Therefore, all heart diseases belong to the
category of cardiovascular diseases. Some types of Heart diseases are
1. Coronary heart disease It also known as coronary artery disease (CAD), it is
the most common type of heart disease across the world. It is a condition in
which plaque deposits block the coronary blood vessels leading to a reduced
supply of blood and oxygen to the heart.
2. Angina pectoris it is a medical term for chest pain that occurs due to
insufficient supply of blood to the heart. Also known as angina, it is a warning
signal for heart attack. The chest pain is at intervals ranging for few seconds or
minutes.
3. Congestive heart failure it is a condition where the heart cannot pump enough
blood to the rest of the body. It is commonly known as heart failure.
4. Cardiomyopathy, it is the weakening of the heart muscle or a change in the
structure of the muscle due to inadequate heart pumping. Some of the common
causes of cardiomyopathy are hypertension, alcohol consumption, viral
infections, and genetic defects.
5. Congenital heart disease, it also known as congenital heart defect, it refers to
the formation of an abnormal heart due to a defect in the structure of the heart
or its functioning. It is also a type of congenital disease that children are born
with.
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
4
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
6. Arrhythmias it is associated with a disorder in the rhythmic movement of the
heartbeat. The heartbeat can be slow, fast, or irregular. These abnormal
heartbeats are caused by a short circuit in the heart's electrical system.
7. Myocarditis it is an inflammation of the heart muscle usually caused by viral,
fungal, and bacterial infections affecting the heart. It is an uncommon disease
with few symptoms like joins pain, leg swelling or fever that cannot be
directly related to the heart.
1.4. DECISION TREES
The decision tree approach is more powerful for classification problems. There are
two steps in this techniques building a tree & applying the tree to the dataset. There are many
popular decision tree algorithms CART, ID3, C4.5, CHAID, and J48. From these J48
algorithm is used for this system. J48 algorithm uses pruning method to build a tree. Pruning
is a technique that reduces size of tree by removing over fitting data, which leads to poor
accuracy in predications. The J48 algorithm recursively classifies data until it has been
categorized as perfectly as possible. This technique gives maximum accuracy on training
data. The overall concept is to build a tree that provides balance of flexibility & accuracy.
1.5. NAIVE BAYES
Naive Bayes classifier is based on Bayes theorem. This classifier algorithm uses
conditional independence, means it assumes that an attribute value on a given class is
independent of the values of other attributes.
1.6. ORGANIZATION OF THE THESIS
This chapter is organized as follows: first, we outline the basics of patient physiology
and fetus response to different stages of oxygen deficiency - hypo anemia, hypoxia, and
asphyxia. Next, we describe an interaction between mother and fetus during gestation with
emphasis on the antepartum and intrapartum period. Finally, we introduce methods for the
patient hypoxia diagnostics with focus on electronic patient monitoring that involves
observation of CTG or FECG changes. We stress the significance of signal interpretation and
describe advantages and disadvantages of respective methods.
II. CLASSIFICATION USING DECISION TREE ALGORITHM
2.1. INTRODUCTION
Decision tree builds classification or regression models in the form of a tree structure.
It breaks down a dataset into smaller and smaller subsets while at the same time an associated
decision tree is incrementally developed. The final result is a tree with decision nodes and
leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast
and Rainy). Leaf node (e.g., Play) represents a classification or decision. The topmost
decision node in a tree which corresponds to the best predictor called root node. Decision
trees can handle both categorical and numerical data.
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
5
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
2.2. ALGORITHM
The core algorithm for building decision trees called C4.5 by J. R. Quinlan which
employs a top-down, greedy search through the space of possible branches with no
backtracking. C4.5 uses Entropy and Information Gain to construct a decision tree.
2.3. ENTROPY
A decision tree is built top-down from a root node and involves partitioning the data
into subsets that contain instances with similar values (homogenous). ID3 algorithm uses
entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous
the entropy is zero and if the sample is an equally divided it has entropy of one. To build a
decision tree, we need to calculate two types of entropy using frequency tables as follows:
a) Entropy using the frequency table of one attributes:
b) Entropy using the frequency table of two attributes:
2.4. INFORMATION GAIN
The information gain is based on the decrease in entropy after a dataset is split on an
attribute. Constructing a decision tree is all about finding attribute that returns the highest
information gain (i.e., the most homogeneous branches).
Step 1: Calculate entropy of the target.
Step 2: The dataset is then split on the different attributes. The entropy for each branch is
calculated. Then it is added proportionally, to get total entropy for the split. The resulting
entropy is subtracted from the entropy before the split. The result is the Information Gain, or
decrease in entropy.
Step 3: Choose attribute with the largest information gain as the decision node.
Step 4(a): A branch with entropy of 0 is a leaf node.
Step 4(b): A branch with entropy more than 0 needs further splitting.
Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is
classified.
2.5. DECISION TREE TO DECISION RULES
A decision tree can easily be transformed to a set of rules by mapping from the root
node to the leaf nodes one by one.
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
6
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
III. CLASSIFICATION USING NAIVE BAYES CLASSIFIER
A. INTRODUCTION
The Naive Bayesian classifier is based on Bayes’ theorem with independence
assumptions between predictors. A Naive Bayesian model is easy to build, with no
complicated iterative parameter estimation which makes it particularly useful for very large
datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and
is widely used because it often outperforms more sophisticated classification methods.
B. ALGORITHM
Bayes theorem provides a way of calculating the posterior probability, P (c|x), from P(c),
P(x), and P (x|c). Naive Bayes classifier assumes that the effect of the value of a predictor (x)
on a given class (c) is independent of the values of other predictors. This assumption is called
class conditional independence.
P (c|x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of class.
P (x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
Thus, we can write:
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior
probabilities for class membership are:
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
7
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
Having formulated our prior probability, we are now ready to classify a new object
(WHITE circle). Since the objects are well clustered, it is reasonable to assume that the more
GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to
that particular color. To measure this likelihood, we draw a circle around X which
encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then
we calculate the number of points in the circle belonging to each class label. From this we
calculate the likelihood:
From the illustration above, it is clear that Likelihood of X given GREEN is smaller than
Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones.
Thus:
Although the prior probabilities indicate that X may belong to GREEN (given that there are
twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class
membership of X is RED (given that there are more RED objects in the vicinity of X than
GREEN). In the Bayesian analysis, the final classification is produced by combining both
sources of information, i.e., the prior and the likelihood, to form a posterior probability using
the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).
Finally, we classify X as RED since its class membership achieves the largest posterior
probability.
Naive Bayes can be modelled in several different ways including normal, lognormal, gamma
and Poisson density functions:
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
8
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
C. PERFORMANCE ANALYSES
(i) DECISION TREE CLASSIFIER-CROSS VALIDATION (EXPERIMENTAL RESULTS)
(ii) DECISION TREE PERFORMANCE METRICS
METHOD
DECISION TREE
Accuracy
98.2753
Sensitivity
95.452
Specificity
97.7919
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
9
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
(iii) SUMMARY
The constructing decision tree techniques are generally computationally
inexpensive, making it possible to quickly construct models even when the training set size is
very large. Furthermore, once a decision tree has been built, classifying a test record is
extremely fast.
D. NAÏVE BAYES
(i) EXPRIMENTAL RESULTS
(ii) NAÏVE BAYES PERFORMANCE METRICS
METHOD
NAIVE BAYES
Accuracy
89.9028
Sensitivity
70.9042
Specificity
85.5353
98.2753
95.452
97.7919
94
94.5
95
95.5
96
96.5
97
97.5
98
98.5
Accuracy Sensitivity Specificity
DECISION TREE
DECISION TREE
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
10
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
(iii) SUMMARY
Poisson variables are regarded here as continuous since they are ordinal rather than truly
categorical. For categorical variables, a discrete probability is used with values of the
categorical level being proportional to their conditional frequency in the training data.
IV. RESULT ANALYSIS
The dataset consists of total 573 records in Heart disease database. The total records
are divided into two data sets one is used for training consists of 303 records & another for
testing consists of 270 records. The data mining tool MATLAB is used for experiment.
Initially dataset contained some fields, in which some value in the records was
missing. These were identified and replaced with most appropriate values using Replace
Missing Values filter from MATLAB. The ReplaceMissingValues filter scans all records &
replaces missing values with mean mode method. This process is known as Data Pre-
processing. After pre-processing the data, data mining classification techniques such as
Neural Networks, Decision Trees, & Naive Bayes were applied.
A confusion matrix is obtained to calculate the accuracy of classification. A confusion
matrix shows how many instances have been assigned to each class. In our experiment we
have two classes, and therefore we have a 2x2 confusion matrix.
Class a = YES (has heart disease)
Class b = NO (no heart disease)
89.9028
70.9042
85.5353
0
20
40
60
80
100
Accuracy Sensitivity Specificity
NAIVE BAYES
NAIVE BAYES
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
11
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
V. CONFUSION MATRIX
TP (True Positive): It denotes the number of records classified as true while they were
actually true.
FN (False Negative): It denotes the number of records classified as false while they were
actually true.
FP (False Positive): It denotes the number of records classified as true while they were
actually false.
TN (True Negative): It denotes the number of records classified as false while they were
actually false.
Confusion matrix obtained for three classification methods with 13 attributes
CONFUSION MATRIX FOR NAIVE BAYES
CONFUSION MATRIX FOR DECISION TREES
The classification task is to generalize well on unseen/independent data. A classifier is
learned on training/learning data and then tested on data that has not been used for learning
(unseen test data). There exist many measures to assess performance of a classifier and a lot
of techniques to create training and test data in order to estimate generalization ability of a
classifier on test (unseen) data.
Heart disease dataset: UCI Machine Learning Repository.
CHARACTERISTICS OF A DATA SET
Data Set Characteristics
Multivariate
Attribute Characteristics
Real
Associated tasks
Classification
Number of Instances
573
Number of Attributes
13
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
12
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
CLASS INFORMATION:
The PHR pattern classification for the three class are.
Category I (Normal)
Category II (Disease)
VI. PERFORMANCE EVALUATION
This is a measurement tool to calculate the performance:
Accuracy =
Sensitivity =
Specificity =
PERFORMANCE METRICS OF DT AND NB
METHOD
DECISION TREE
NAIVE BAYES
Accuracy
98.2753
89.9028
Sensitivity
95.452
70.9042
Specificity
97.7919
85.5353
98.2753
89.9028
84
86
88
90
92
94
96
98
100
DECISION
TREE NAIVE
BAYES
ACCURACY
Accuracy
0
20
40
60
80
100
120
DECISION
TREE NAIVE
BAYES
SENSITIVITY
Sensitivity
FNTPTP
FNFPTNTP TNTP
FPTNTN
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
13
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
SUMMARY OF THE CLASSIFICATION ACCURACY DECISION TREE CLASSIFIER - CROSS
VALIDATION
NORMAL
DISEASE
Accuracy
97.6482
99.3885
Sensitivity
98.3686
93.7500
Specificity
95.1168
99.8974
SUMMARY OF THE CLASSIFICATION ACCURACY NAIVE BAYES CLASSIFIER - CROSS
VALIDATION
NORMAL
DISEASE
Accuracy
87.3001
93.3208
Sensitivity
93.7160
82.3864
Specificity
64.7558
94.3077
97.7919
85.5353
75
80
85
90
95
100
DECISION
TREE NAIVE
BAYES
SPECIFICITY
Specificity
90
92
94
96
98
100
102
Percentage of Accuracy
Diagnosis Result
DECISION TREE CLASSIFICATION
USING CROSS VALIDATION
NORMAL
DISEASE
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
14
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
PERFORMANCE ANALYSIS FOR FOUR CLASSIFIERS - 10 FOLD CROSS VALIDATION
METHOD
DECISION TREE
NAIVE BAYES
Accuracy
98.2753
89.9028
Sensitivity
95.4520
70.9042
Specificity
97.7919
85.5353
VII. CONCLUSION
The overall objective of our work is to predict more accurately the presence of heart
disease. In this paper, UCI repository dataset are used to get more accurate results. Three data
mining classification techniques were applied namely Decision trees and Naive Bayes. From
results, it has been seen that Decision trees provides accurate results as compare to Naive
Bayes. This system can be further expanded. It can use more number of inputs. Other data
mining techniques can also be used for predication e.g. Clustering, Time series, Association
rules. The text mining can be used to mine huge amount of unstructured data available in
healthcare industry database.
0
20
40
60
80
100
Accuracy Sensitivity Specificity
Percentage of Accuracy
Diagnosis Result
NAIVE BAYES CLASSIFICATION USING
CROSS VALIDATION
NORMAL
DISEASE
0
20
40
60
80
100
120
Percentage
Performance
PERFORMANCE METRICS
COMPARISION OF SENSITIVITY FOR
TWO CLASSIFIERS
DECISION TREE
NAIVE BAYES
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
15
ISSN (ONLINE):2395-695X
ISSN (PRINT):2395-695X
International Journal of Advanced Research in Basic Engineering Sciences and Technology (IJARBEST)
Vol.3, Issue.3, March 2017
REFERENCES
[1] Frawley and G. Piatetsky -Shapiro, Knowledge Discovery in Databases: An Overview. Published
by the AAAI Press/ The MIT Press, Menlo Park, C.A 1996.
[2] Yanwei, X.; Wang, J.; Zhao, Z.; GAO, Y., “Combination data mining models with new medical data to
predict outcome of coronary heart disease”. Proceedings International Conference on Convergence
Information Technology 2007, pp. 868 872.
[3] SellappanPalaniappan, RafiahAwang, "Intelligent Heart Disease Prediction System Using Data Mining
Techniques", IJCSNS International Journal of Computer Science and Network Security, Vol.8 No.8,
August 2008
[4] Niti Guru, Anil Dahiya, NavinRajpal, "Decision Support System for Heart Disease Diagnosis Using
Neural Network", Delhi Business Review, Vol. 8, No. 1 (January - June 2007).
[5] HeonGyu Lee, Ki Yong Noh, KeunHoRyu, “Mining Bio signal Data: Coronary Artery Disease
Diagnosis using Linear and Nonlinear Features of HRV,” LNAI 4819: Emerging Technologies in
Knowledge Discovery and Data Mining, pp. 56-66, May 2007.
[6] ShantakumarB.Patil, Y.S.Kumaraswamy “Intelligent and Effective Heart Attack Prediction System
Using Data Mining and Artificial Neural Network”. ISSN 1450-216X Vol.31 No.4 (2009), pp.642-656.
[7] Carlos Ordonez, "Improving Heart Disease Prediction Using Constrained Association Rules,"
Seminar Presentation at University of Tokyo, 2004.
[8] Kiyong Noh, HeonGyu Lee, Ho-Sun Shon, Bum Ju Lee, and KeunHoRyu, "Associative Classification
Approach for Diagnosing Cardiovascular Disease", Springer, Vol: 345, pp: 721- 727, 2006.
[9] Franck Le Duff, CristianMunteanb, Marc Cuggiaa, Philippe Mabob, "Predicting Survival Causes After
Out of Hospital Cardiac Arrest using Data Mining Method”, Studies in health technology and
informatics, Vol. 107, No. Pt 2, pp. 1256-9, 2004.
[10] LathaParthiban and R.Subramanian, "Intelligent Heart Disease Prediction System using CANFIS and
Genetic Algorithm", International Journal of Biological, Biomedical and Medical Sciences, Vol. 3,
No. 3, 2008.
[11] Antepartum patient heart rate feature extraction and classification using empirical mode decomposition
and support vector machine, Niranjana KrupaEmail author, Mohd Ali MA, Edmond Zahedi, Shuhaila
Ahmed and Fauziah M Hassan, January 2011
[12] Performance Evaluation of K-Means and Heirarichal Clustering in Terms of Accuracy and Running
Time, Nidhi Singh,.Divakar Singh, Department of computer Science &
Engg.BUIT,BU,Bhopal.India.(M.P) , 2012.
[13] Determination of Patient State from Cardiotocogram Using LS-SVM with Particle Swarm
Optimization and Binary Decision Tree, Ersen Yılmaz and Çağlar Kılıkçıer Electrical-Electronic
Engineering Department, Uludag University, 16059 Gorukle, Bursa, Turkey, Received 26 June 2013;
Accepted 6 September 2013.
A. Sankari Karthiga, M. Safish Mary, M. Yogasini ©IJARBEST PUBLICATIONS
16
... The results showed that the ANN classification technique had the topmost predicted compared to DT and NB, reliability on the used dataset. Using a public dataset of 573 records, Karthiga et al [12] conducted research to successfully determine whether cardiac disease will exist. The DT and NB classification methods were used by the authors to process the dataset. ...
... The Decision Tree was found to be the best-performing model in the comparison since it had the highest recall of 0.98%. Fig. 5 illustrates the precision The results have identified that the proposed model has the highest accuracy of 98.04%, whereas other models namely SVM [12], SVM [3], Naïve Bayes [3], HRFLM [5], have resulted in 95%, 71%,63%, and 88.7% accuracy, respectively. ...
... For HD diagnosis, Kalaiselvi et al. (22) created amerged medical DSS based on ANN and fuzzy model. (23,24,25,26,27,28,29,30,31) The accuracy attained was 91 %. An HD categorization using relief and rough set was proposed by Masethe et al. (32) The technique shows the classification accuracy of 92 %. ...
Article
Full-text available
Heart disease is an illness that influences enormous people worldwide. Particularly in cardiology, heart disease diagnosis and treatment need to happen quickly and precisely. Here, a machine learning-based (ML) approach is anticipated for diagnosing a cardiac disease that is both effective and accurate. The system was developed using standard feature selection algorithms for removing unnecessary and redundant features. Here, a novel normalized graph model (n-GM) is used for prediction. To address the issue of feature selection, this work considers the significant information feature selection approach. To improve classification accuracy and shorten the time it takes to process classifications, feature selection techniques are utilized. Furthermore, the hyper-parameters and learning techniques for model evaluation have been accomplished using cross-validation. The performance is evaluated with various metrics. The performance is evaluated on the features chosen via features representation. The outcomes demonstrate that the suggested n-GM gives 98% accuracy for modeling an intelligent system to detect heart disease using a classifier support vector machine
... Since the present study aims for the prediction of CAD using predetermined outcomes, a supervised learning approach was considered. Karthiga et al. (2017) developed an ML model for predicting heart conditions using a dataset of 573 records. MATLAB tool was applied with missing values filter to pre-process the data. ...
Article
Full-text available
Objective: Many modifiable risk factors affect the onset of coronary artery disease (CAD), a condition that is extremely common throughout the globe. Predictive models created using machine learning (ML) algorithms may help physicians identify CAD earlier and may lead to better results. The goal of this project was to use ML algorithms to predict CAD in patients. Methods: The gathered dataset of UCI heart disease was used in this study to evaluate a variety of machine learning methods to predict CAD. Just the most crucial aspects of the hypothesis testing method were kept. Support vector machines (SVM) were used in a comparative analysis employing a variety of assessment measures. Results: All machine learning methods achieved accuracy levels of at least 80%, with the SVM algorithm obtaining accuracy levels of at least 90%. Predictive ML models had high diagnostic relevance in CAD, as seen by the SVM model's high recall (0.9), which is was the highest of all the models. Conclusion: The findings of the current study demonstrated that, independent of the measures used to evaluate machine learning models, feature selection has a significant impact on performance. Finding the most useful features is thus crucial. SVM was chosen as the top model based on the features we considered.
... where S represents the dataset, C represents the classes in set S and p(C) is the proportion of data points that belong to class C to the number of total data points in set S [36]. ...
Article
Full-text available
As described in Chapter two, this thesis focuses on creating a manufacturing testbed in the Future Factories (FF) lab at the University of South Carolina and utilizing Semantic Web technologies on this testbed to realize an autonomous manufacturing use case. This section will narrow down the use case adopted in this thesis. Following that, various aspects of the developed testbed are introduced that showcase its capabilities and how they fit in with the overall use case. Finally, this section will cover the implementation plan of the application developed in this thesis
... Data is split up in a branch-like structure called a decision tree. The root node is divided into sub-branches or further branches according to rules and each attribute's maximal acquisition of information [17] [26]. ...
Article
Full-text available
Emotion is an interdisciplinary research field investigated by many research areas such as psychology, philosophy, computing, and others. Emotions influence how we make decisions, plan, reason, and deal with various aspects. Automated human emotion recognition (AHER) is a critical research topic in Computer Science. It can be applied in many applications such as marketing, human–robot interaction, electronic games, E-learning, and many more. It is essential for any application requiring to know the emotional state of the person and act accordingly. The automated methods for recognizing emotions use many modalities such as facial expressions, written text, speech, and various biosignals such as the electroencephalograph, blood volume pulse, electrocardiogram, and others to recognize emotions. The signals can be used individually(uni-modal) or as a combination of more than one modality (multi-modal). Most of the work presented is in laboratory experiments and personalized models. Recent research is concerned about in the wild experiments and creating generic models. This study presents a comprehensive review and an evaluation of the state-of-the-art methods for AHER employing machine learning from a computer science perspective and directions for future research work.
Article
Full-text available
p>The limitations of medical personnel, especially heart disease, cause difficulties in diagnosing heart disorders, so diagnosing heart disorders is not easy, it takes the ability and experience of a cardiologist who has the expertise and experience to be able to accurately diagnose heart disorders. Several studies in the field of computing have been carried out in diagnosing cardiac abnormalities in patients. This study was conducted to accurately test the results of the classification of heart disorders using electrocardiogram medical record data with a C.45 decision tree approach. The results showed that the classification of heart defects obtained a mean squared error (MSE) value of 0.24, a root mean squared error (RMSE) value of 0.49, and an accuracy value of 75.33% with the C4.5 algorithm.</p
Article
Full-text available
Data mining is defined as analyzing very large amount of data for getting some useful information. Data mining techniques like association rule mining, classification and clustering is implemented to analyze the different types of disease. Classification is an important problem in Data mining. Given a database contains a collection of records, each with a single class label, a classifier performs a brief and clear definition for each class that can be used to classify successive records. Data mining plays an important role in medical systems. It is used to discover the knowledge out of data and presenting it in the form that human can easily understand. It is a cooperative effort of humans and computers. There are two primary goals of data mining - Prediction and Description. Prediction involves some variables or fields in the data set to predict unknown or future values of other variables of interest. Description focuses on finding patterns describing the data that can be interpreted by humans. It is very useful for predicting diseases such as Heart disease, Lung disease. Lung cancer is one of the most dangerous diseases in the world. The early detection of lung cancer can cure the disease completely. Data mining plays an effective role by using Naïve Bayes and Artificial Neural Network to massive volume of healthcare of data. The health care industry collects huge amounts of data which unfortunately are not mined to find the hidden data. The Naïve Bayes aims at delivering robust classifications also when dealing with small or incomplete data sets. The aim of the paper is to detect and diagnose the lung diesases as early as possible which will help the doctor to save the patient’s life. This paper describes how lung cancer was predicted and controlled, using data mining techniques. © 2017, Institute of Advanced Scientific Research, Inc. All rights reserved.
Article
Full-text available
Heart disease (HD) is a major cause of morbidity and mortality in the modern society. Medical diagnosis is an important but complicated task that should be performed accurately and efficiently and its automation would be very useful. All doctors are unfortunately not equally skilled in every sub specialty and they are in many places a scarce resource. A system for automated medical diagnosis would enhance medical care and reduce costs. In this paper, a new approach based on coactive neuro-fuzzy inference system (CANFIS) was presented for prediction of heart disease. The proposed CANFIS model combined the neural network adaptive capabilities and the fuzzy logic qualitative approach which is then integrated with genetic algorithm to diagnose the presence of the disease. The performances of the CANFIS model were evaluated in terms of training performances and classification accuracies and the results showed that the proposed CANFIS model has great potential in predicting the heart disease. Keywords—CANFIS, Genetic Algorithms (GA), Heart disease, Membership Function (MF).
Article
The diagnosis of diseases is a vital and intricate job in medicine. The recognition of heart disease from diverse features or signs is a multi-layered problem that is not free from false assumptions and is frequently accompanied by impulsive effects. Thus the attempt to exploit knowledge and experience of several specialists and clinical screening data of patients composed in databases to assist the diagnosis procedure is regarded as a valuable option. This research work is the extension of our previous research with intelligent and effective heart attack prediction system using neural network. A proficient methodology for the extraction of significant patterns from the heart disease warehouses for heart attack prediction has been presented. Initially, the data warehouse is pre-processed in order to make it suitable for the mining process. Once the preprocessing gets over, the heart disease warehouse is clustered with the aid of the K-means clustering algorithm, which will extract the data appropriate to heart attack from the warehouse. Consequently the frequent patterns applicable to heart disease are mined with the aid of the MAFIA algorithm from the data extracted. In addition, the patterns vital to heart attack prediction are selected on basis of the computed significant weightage. The neural network is trained with the selected significant patterns for the effective prediction of heart attack. We have employed the Multi-layer Perceptron Neural Network with Back-propagation as the training algorithm. The results thus obtained have illustrated that the designed prediction system is capable of predicting the heart attack effectively.
Article
The prediction of survival of Coronary Heart Disease (CHD) has been a challenging research problem for medical society. The goal of this paper is to develop data mining algorithms for predicting survival of CHD patients based on 1000 cases .We carry out a clinical observation and a 6-month follow up to include 1000 CHD cases. The survival information of each case is obtained via follow up. Based on the data, we employed three popular data mining algorithms to develop the prediction models using the 502 cases. We also used 10-fold cross-validation methods to measure the unbiased estimate of the three prediction models for performance comparison purposes. The results indicated that the SVM is the best predictor with 92.1 accuracy on the holdout sample artificial neural networks came out to be the second with 91.0 accuracy and the decision tress models came out to be the worst of the three with 89.6% accuracy. The comparative study of multiple prediction models for survival of CHD patients along with a 10-fold cross- validation provided us with an insight into the relative prediction ability of different data.
Chapter
ECG is a test that measures a heart’s electrical activity, which provides valuable clinical information about the heart’s status. In this paper, we propose a classification method for extracting multi-parametric features by analyzing HRV from ECG, data preprocessing and heart disease pattern. The proposed method is an associative classifier based on the efficient FP-growth method. Since the volume of patterns produced can be large, we offer a rule cohesion measure that allows a strong push of pruning patterns in the pattern-generating process. We conduct an experiment for the associative classifier, which utilizes multiple rules and pruning, and biased confidence (or cohesion measure) and dataset consisting of 670 participants distributed into two groups, namely normal people and patients with coronary artery disease.
Conference Paper
The main purpose of our study is to propose a novel methodology to develop the multi-parametric feature including linear and nonlinear features of HRV (Heart Rate Variability) diagnosing cardiovascular disease. To develop the multi-parametric feature of HRV, we used the statistical and classification techniques. This study analyzes the linear and the non-linear properties of HRV for three recumbent positions, namely the supine, left lateral and right lateral position. Interaction effect between recumbent positions and groups (normal and patients) was observed based on the HRV indices and the extracted HRV indices used to classify the CAD (Coronary Artery Disease) group from the normal people. We have carried out various experiments on linear and non-linear features of HRV indices to evaluate several classifiers, e.g., Bayesian classifiers, CMAR, C4.5 and SVM. In our experiments, SVM outperformed the other classifiers.
Article
this article. 0738-4602/92/$4.00 1992 AAAI 58 AI MAGAZINE for the 1990s (Silberschatz, Stonebraker, and Ullman 1990)