ArticlePDF Available

Anemia types prediction based on data mining classification algorithms

Authors:

Abstract and Figures

Medical Data Mining domain concerned with prediction knowledge as a method to extract desired outcomes from data for specific purposes. Anemia is one of the most common hema-tological diseases and in this study concentrate on the most five common types of anemia. This paper specifies the anemia type for the anemic patients through a predictive model conducting some data mining classification algorithms. The real data of dataset constructed from the Complete Blood Count (CBC) test results of the patients. These data filtered and eliminated undesirable variables, then implemented on some classification algorithms such as Naïve Bayes, Multilayer Perception, J48 and SMO using WEKA data-mining tool. Several experiments has proven that J48 decision tree algorithm gives the best potential classification of anemia types. WEKA experimenter proves J48 decision tree algorithm has the best performance with accuracy, precision, recall, True Positive rate, False Positive rate and F-measure.
Content may be subject to copyright.
615
Communication, Management and Information Technology – Sampaio de Alencar (Ed.)
© 2017 Taylor & Francis Group, London, ISBN 978-1-138-02972-9
Anemia types prediction based on data mining classification algorithms
Manal Abdullah
Department of Computer Science, Faculty of Computing and Information Technology,
King Abdul-Aziz University, Jeddah, Saudi Arabia
Salma Al-Asmari
Department of Computer Science, Faculty of Computing, King Khalid University, Abha, Saudi Arabia
ABSTRACT: Medical Data Mining domain concerned with prediction knowledge as a method to
extract desired outcomes from data for specific purposes. Anemia is one of the most common hema-
tological diseases and in this study concentrate on the most five common types of anemia. This paper
specifies the anemia type for the anemic patients through a predictive model conducting some data mining
classification algorithms. The real data of dataset constructed from the Complete Blood Count (CBC)
test results of the patients. These data filtered and eliminated undesirable variables, then implemented on
some classification algorithms such as Naïve Bayes, Multilayer Perception, J48 and SMO using WEKA
data-mining tool. Several experiments has proven that J48 decision tree algorithm gives the best potential
classification of anemia types. WEKA experimenter proves J48 decision tree algorithm has the best per-
formance with accuracy, precision, recall, True Positive rate, False Positive rate and F-measure.
Keywords: Anemia, Medical Data mining, classification algorithms, naïve Bayes, J48 decision tree, Sup-
port vector machine, SMO
for Knowledge Analysis. It is an open source
data-mining tool that provides an efficient frame-
work for implementing several classification algo-
rithms. This tool provides processing the datasets
and filtering out and remove irrelevant (not useful)
data and the dataset can be incision into test and
training sets. It supports perform classification
algorithms then transforming all the dataset into
appropriate pattern as a machine learning form.
WEKA also can upload different file formats
such as ARFF, CVS, C4.5 and different databases
Garner (1995).
There are growing researches interest in using
data mining in the medical domain. Developing
in this new approach, called medical data mining,
concerned with developing systems that determine
and predict knowledge from data generating from
medical environments. The data mining in the
medical domain specifically the hospital database,
including the data, which is huge in amounts, com-
plex in contents, with heterogeneous types, hier-
archical and varying in quality. Among last years,
the information on laboratories keeps on enhanc-
ing and developing. The specific patterns of infor-
mation can predicated through using data mining
methodologies to enhance conducting researches
and evaluation of reports. The data mining clas-
sification depends on similarities existing in the
data. The classification algorithms used to prove
1 INTRODUCTION
Data mining concept is sorting the data to identify
patterns and find relationships between these data.
It is techniques are appropriate for simple or struc-
tured datasets such as relational databases, trans-
actional databases. Different approaches of data
mining proposed to improve the challenges of stor-
ing and processing all types of data (Kaur et al.,
2015 & Kishore et al., 2015).
Data mining has three basic mechanisms
Clustering (Classification), Decision Rules and
Analysis. Classification analyzes a set of data and
produces a set of decision rules, which used to
classify the data sets. In the artificial intelligence,
machine learning or database systems data mining
process is starting by extract the information from
dataset then convert it to meaning full structure.
This means that it determines patterns in datasets
and embracing methods. There are many classes in
data mining where the most common one is clas-
sification, which is used to predict set of relation-
ship between data. In healthcare, it is significant to
invest the development in computer technology to
enhance processing the medical data such as data
mining classification algorithms and tools. This
paper will utilize the WEKA tool for data mining
(Shouval et al., 2014). As data mining tool, WEKA
name is derived from Waikato Environment
ICCMIT_Book.indb 615ICCMIT_Book.indb 615 10/3/2016 9:26:43 AM10/3/2016 9:26:43 AM
616
the results is acceptable to the doctors or the end
user. Medical data mining uses many algorithms
such as Decision Trees, Neural Networks, Naïve
Bayes and others.
This paper identifies set of attributes associated
with the patient CBC test result that give the ane-
mia type, and improve the quality of prediction by
identifying the anemic patients, so that can help
doctors immediately improving their performance.
This paper investigates the accuracy of some clas-
sification algorithms in predicting some anemia
types. It is also utilizing WEKA tool for conduct-
ing classification, decision rules and analyzing the
results. The evaluation of data using classification
algorithms takes a set of classified data as train-
ing set and use it for training the algorithms. Then
classifies the test data based on the decision rules
extracted from the training set for predicting ane-
mia diseases. The use of WEKA Experimenter
conducted to specify which classification algo-
rithm gives best performance in terms of accuracy,
precision, recall, True Positive rate, False Positive
rate and F-measure. The main objectives of this
work are: using predictive attributes for produc-
ing data and performing data mining algorithms
to get the best prediction of the anemia types using
the patient Complete Blood Count (CBC) data
results.
2 RELATED WORKS
There are many works that used different data min-
ing algorithms to classify several types of diseases,
such as anemia disease for specific types based
on Data Mining algorithms Elshami & Alhalees
(2012). In addition, many other researchers tried
to find their own method. A person with anemia
probably unaware of the problem because symp-
toms may not appear. Millions of people may have
anemia and their health exposed risk. Therefore
the disease is significant, several studies carried
out in this domain mentioned in the literature
(Yilmaz et al., 2013). (Sanap et al., 2011) developed
a system using the classification technique: C4.5
decision tree algorithm and SMO support vector
machine WEKA. They implemented a number of
experiments using these algorithms. The anemia
classification using decision tree that given clear
results depend on CBC reports. (Amin et al., 2015)
have compared between naïve Bayes, J48 classifier
and neural network classification algorithms using
WEKA and working on hematological data to
specify what the best and appropriate algorithm.
The proposed model can predict hematological
data and the results showed that the best algorithm
is J48 classifier with high accuracy and naïve Bayes
is the lowest average in average errors. The study
of (Sanap et al., 2011) and (Amin et al., 2015)
proved that the C4.5 algorithm (as J48 in WEKA)
results gives high accuracy more than other clas-
sifiers. Dogan & Turkoglu (2008) based on the
biochemistry blood parameters they designed a
system to help physicians in the diagnosis of Ane-
mia. The system designed using the decision tree
algorithm. The system used the characteristics of
the hematology and classify the results into posi-
tive or negative Anemia. The results of this sys-
tem accorded with physicians’ decision. Siadaty &
Knaus (2006) selected decision trees as a common
and simple classifier and also has low computa-
tional complexity. The problem was the needed
time to build a decision tree for large dataset is
come to be intractable. They solved the problem
by developing a parallel model of ID3 algorithm.
It is a thread-level parallelism decision tree and
do the computations independently. The experi-
ment done on anemic patient’s data set. (Kishore
et al., 2015) presented set of the basic classification
algorithms, which groupof essential types of clas-
sification methods such as decision trees, Bayesian
networks, k-nearest neighbor and support vector
machine classifier. The study shows a comprehen-
sive review of diverse classification algorithms in
data mining. This research presents an investiga-
tion for five types of anemia disease by using naïve
Bayes, Multilayer perception, J48 decision tree and
support vector machine data mining algorithms
depending on CBC data. The best one of classi-
fication algorithms depends on specifically in the
problem domain Kesavaraj & Sukumaran (2013).
3 ANEMIA CLASSIFICATION
3.1 What is anemia?
It is a medical condition indicates to the reduc-
tion of hemoglobin or red cell concentration in the
human blood. A Complete Blood Cell (CBC) count
test conducted for patients in laboratory. The ane-
mia disease types identified using this information:
age, gender, hemoglobin, Hematocrit and other
attribute values when it is lower a normal range
Green (2012). Anemia types classification accord-
ing to CBC test values illustrated in Fig. 1 (Sanap
et al., 2011).
3.2 The anemia classification
Anemia disease categorized into different types
based on the CBC test values. In this model Ane-
mia types nomenclature illustrated (see Table 1)
and classified according to MCV (Mean cor-
puscular volume) value into the three essential
kinds of microcytic (MCV<80) ft, normocytic
ICCMIT_Book.indb 616ICCMIT_Book.indb 616 10/3/2016 9:26:43 AM10/3/2016 9:26:43 AM
Downloaded by [Manal Abdullah] at 10:04 24 November 2016
617
(MCV = 80–100) ft, and macrocytic (MCV>100)
ft anemia, and classified using MCHC (Mean cor-
puscular hemoglobin concentration) into normo-
chromic (MCHC = 32–36) g/dl and hypochromic
(MCHC<32) g/dl anemia. RDW (Red Cell Distri-
bution Width) used to measure the anemia and it is
high if (RDW>14.6) and normal if (RDW = 11.6–
14.6) Green, (2012) and (Sanap et al., 2011).
4 THE PROPOSEDMETHOD
4.1 Experimental setup
In the context of classification the anemia types.
A number of attributes are considered to predict
the type of anemia for the anemic patient. These
influencing attributes are categorized as an input.
The data is taken from Complete Blood Count
(CBC) test results, which are conducted by col-
lecting blood samples from 41 anemic patients (41
instances) and constructing ANEMIA dataset.
The dataset consists of 7 attributes and defined in
Table 2 along with their values.
Then data is transformed into a standard file
format. CSV, which is supported by the WEKA
tool to construct ANEMIA dataset, filtered and
eliminating out irrelevant data using specific
techniques. The CBC data contain 34 irrelevant
attributes that are removed. The relevant attributes
are shown in Table 2. The attributes are verities
between nominal and numeric values and each has
its own determined category.
The classification algorithms performed for pre-
dicting and classifying five most common Anemia
types based on rules that shown in Table 3. The
analysis of identifying anemia types are conducted
using the WEKA tool. (Siadaty et al., 2006, Sanap
et al., 2011 and Shashidhara, 2012):
The implementation of the proposed method
starts by collecting CBC results and build our own
dataset. Then data are preprocessed to extract and
filter the attributes of importance. Data are con-
verted to CSV format to be able using by WEKA
classifier software. CSV file format is selected
to allows data to be saved in a table structured
(spreadsheet) format.After the classification and
generated results, evaluated using the WEKA
experimenter and the Knowledge Flow Model.
4.2 The proposed algorithms for classification
In this method, various data mining algorithms are
used for predicting the anemia type for patients.
During this study, classification algorithms used for
prediction and the dataset are tested then analyzed
with four candidate algorithms which are: Naïve
Bayes, neural network (multilayer perception),
Decision Tree (J48) and Support Vector Machine
(SMO). The Naive Bayes algorithm implements the
principle of conditional probabilities that computes
a probability by calculating the rate of values and
combinations of values in the specific data. This
algorithm determines the probability of an event
happen given the probability of another event that
has already happened. Naïve Bayes algorithm use
Figure 1. Anemia types classification.
Table 1. ANEMIA types nomenclature.
ACD Anemia of chronic disease
IDA Iron deficiency anemia
ARD Anemia of renal disease
THAL Thalassemia
APA Aplastic Anemia
Table 2. Attributes of ANEMIA dataset.
Attribute Attribute value Attribute category
Age 0–12
>12 Child
Adult
Gender Female
Male F
M
MCV <80
80–100
>100
Microcytic
Normocytic
Macrocytic
HCT <37
37.0–50.0
Low
Normal
HGB <10
10–12
Severe
Moderate
MCHC <32
32–36
hypochromic
normochromic
RDW >14.6
11.6–14.6
High
Normal
ICCMIT_Book.indb 617ICCMIT_Book.indb 617 10/3/2016 9:26:43 AM10/3/2016 9:26:43 AM
Downloaded by [Manal Abdullah] at 10:04 24 November 2016
618
classify it into what is the two probable classes and
give the output. The SVM algorithm has the same
functional form of neural networks and radial basis
functions Kesavaraj & Sukumaran (2013). It is gen-
erally used to a two class classification problem,
its detect the plane and gives the greatest separa-
tion between the two classes. The SVM algorithm
discovers the optimal plane with a maximum dis-
tance to the nearby point of the two classes. A set
of instances that are closest to the optimal plane,
explains the support vector and specify the margins
of each class (Shouval et al., 2014). See the descrip-
tion of the proposed methodology illustrated as
flowchart shown in Fig. 2.
5 RESULTS AND DISCUSSION
Evaluation of data is done by using 41 instances in
the dataset using Naïve Bayes, neural network in
WEKA (multilayer perception), J48 decision tree
algorithms, and support vector machine in WEKA
(SMO) with the test option: several percentages
splits (20%, 40%, 60%) of the dataset see Table 4.
The results in Table 4 of evaluation an ANEMIA
dataset using WEKA through different experi-
ments 20%, 40%, 60% percentage split data. The
table include the result through accuracy (correctly
classified instances), mean absolute error, weighted
average ROC and F-measure. Fig. 3 show the SMO
algorithm results using 60% training set data.
Figure 2. Flowchart of proposed method.
kernel density estimators that improve implementa-
tion if the normal assumption clearly correct; it can
also deal with numeric attributes using supervised
discretization Vijayarani & Muthulakshmi (2013).
The second algorithm is a neural network in WEKA
named (multilayer perception). It is a feed forward
neural network multilayer model that can map set
of the input data (each one is a neuron) into a set of
suitable outputs. The input node is an element with
a nonlinear activation function. The multilayer per-
ception consists of multiple one or more of hidden
layers of nodes called (hidden neurons) in a directed
chart, with each layer completely connected to the
next layer (Prakash et al., 2015). The J48 decision
tree algorithm is used also for automatic processing
and canchoose related aspects from training data.
It can cut the meaningless approaches into effective
process, especially when dealing with continuous
attributes. It split the values based on the threshold-
ing to specify what is upper than, less than or equal
to the threshold value. J48 algorithm contains the
capability of dealing with training data with miss-
ing values of some attributes (Ahmad et al., 2011).
Support Vector Machines named (SMO) in WEKA
used as a supervised learning method which analyz-
ing data and recognizing patterns. It is not prob-
able classifier, which process set of input data and
Table 3. Anemia classification rules.
The rule Decision*
IF (MCV = microcytic AND
HGB = 10–12) then
ACD,
moderate
Else if (MCV = microcytic AND
HGB = <10) then
ACD,
severe
Else if (MCV = normocytic AND
MCHC <32 AND RDW = 11.6–14.6
AND HGB = 10–12) then
THAL,
moderate
Else if (MCV = normocytic AND
MCHC <32 AND RDW = 11.6–14.6
AND HGB = <10) then
THAL,
severe
Else if (MCV = normocytic AND
MCHC <32 AND RDW = 11.6–14.6
AND HGB = 10–12) then
IDA,
moderate
Else if (MCV = normocytic AND
MCHC <32 AND RDW = 11.6–14.6
AND HGB = <10) then
IDA,
severe
Else if (MCV = normocytic AND
MCHC = 32–36 AND HGB = 10–12)
then
ARD,
moderate
Else if ( MCV = normocytic AND
MCHC = 32–36 AND HGB = <10)
then
ARD,
severe
Else if (MCV = macrocytic AND
HGB = 10–12) then
APA,
moderate
Else if (MCV = macrocytic AND
HGB = <10)
APA,
severe
*The decision includes (Anemia type and severity grade).
ICCMIT_Book.indb 618ICCMIT_Book.indb 618 10/3/2016 9:26:44 AM10/3/2016 9:26:44 AM
Downloaded by [Manal Abdullah] at 10:04 24 November 2016
619
The test that using percentage split is conducted
by deciding a specific percent of data for training
and the rest of data for testing. In this experiment
the percentage split are chosen as 20%, 40% and
60%, where the partitions is conducted randomly.
The percentage split 20%: the data will split into
20% will used as training set data and the rest 80%
will used as testing set data. The same process done
with other percentages 40% and 60%.
The accuracy (Correctly Classified Instances)
rate of the results using different splitting percent-
ages increased in naïve Bayes, J48, multilayer per-
ception and SMO. The accuracy increasing with
the training set average respectively. All statistic
results provide an important comparison of the
accuracy between all algorithms done and finally
it have been investigated that J48 decision tree and
SMO algorithms implement best results with accu-
racy 93.75% when using the percentage split 60%.
The accuracy measure of all the algorithms using
60% training set are illustrated in Fig. 4.
The results shown in the Table 5 are the per-
formance of naïve Bayes, neural network (multi-
layer perception), J48 decision tree and SMO using
Table 4. Simulation result of algorithms using 20%, 40%, 60% training set data.
Algorithm Training Set Accuracy*% Mean absolute error% Weighted av. ROC F-Measure
Naïve Bayes 20% 30.303 0.458 0.507 0.257
40% 60 0.3372 0.708 0.587
60% 68.75 0.2645 0.825 0.68
Multilayer Perception 20% 39.3939 0.3744 0.775 0.383
40% 72 0.2198 0.852 0.716
60% 87.5 0.1372 0.921 0.859
J48 Decision tree 20% 27.2727 0.3207 0.855 0.218
40% 88 0.1689 0.868 0.878
60% 93.75 0.1743 0.97 0.935
SMO 20% 39.3939 0.4108 0.677 0.396
40% 84 0.2578 0.902 0.83
60% 93.75 0.2361 0.96 0.912
*Correctly Classified Instances.
Figure 3. Support vector machine (SMO) algorithm
output using 60% training set data.
Figure 4. Comparing algorithms accuracy using the
percentage split 60%.
WEKA experimenter. The data mining measures in
the table illustrates more useful and precise evalu-
ation of algorithm’s performance, especially when
dealing with datasets: recall (sensitivity), precision,
F-measures, true positive rate and false positive
rate, which computed as follows:
Recall (sensitivity) = True Positive rate/(True Posi-
tive rate + False Negative rate).
Precision = True Positive rate/(True Positive rate +
False Positive rate).
F-measure = (2 *recall *precision)/(recall +
precision).
The True Positive rate is the number of positive
instances classified correctly, The False Negative
rate is the number of positive instances (records)
classified negatively; False Positive rate is the
number of negative instances classified positively
(Huang et al., 2012).
In the context of using WEKA experimenter a
snapshot of using F-measure illustrated in Fig. 5,
using the precision in Fig. 6 and using the TP rate
ICCMIT_Book.indb 619ICCMIT_Book.indb 619 10/3/2016 9:26:44 AM10/3/2016 9:26:44 AM
Downloaded by [Manal Abdullah] at 10:04 24 November 2016
620
in Fig. 7. In these experiments, it has shown that
J48 decision tree performs best among four algo-
rithms with F-Measure 93%, Sensitivity is 93%,
true positive rate is 93%, Precisions 97% and it is
the lowest in the false positive rate 0.05.
The comparative performance based on the
accuracy among four algorithms also conducted
by using knowledge flow model shown in Fig. 8,
Table 5. Comparison of classification algorithms.
Algorithm TP
Rate FP
Rate Precision F-Measure Recall
Naïve Bayes 0.92 0.10 0.93 0.91 0.92
Multilayer
Perception 0.92 0.10 0.95 0.91 0.92
J48 Decision
tree 0.93 0.05 0.97 0.93 0.93
SMO 0.90 0.40 0.85 0.84 0.90
Figure 5. Comparing algorithms with use the WEKA
experimenter using F-measure.
Figure 6. Comparing algorithms with use the WEKA
experimenter using precision.
Figure 7. Comparing algorithms with use the WEKA
experimenter using true positive rate.
Figure 8. Knowledge flow model using WEKA.
which shows the membership tree structure using
10 folds validation test.
The performance chart of knowledge flow
modelconducted for the experiment algorithms
Naive Bayes, Multilayer Perceptron, J48 and
SMO. It is another important performance meas-
ures in WEKA.The performance represented
by the Region of meeting Curve (ROC) for each
ICCMIT_Book.indb 620ICCMIT_Book.indb 620 10/3/2016 9:26:45 AM10/3/2016 9:26:45 AM
Downloaded by [Manal Abdullah] at 10:04 24 November 2016
621
Dogan, S., & Turkoglu, I. (2008). Iron-deficiency anemia
detection from hematology parameters by using deci-
sion trees. International Journal of Science & Technol-
ogy, 3(1), 85–92.
Elshami, E. H., & Alhalees, A. M. (2012). Automated
Diagnosis of Thalassemia Based on Data Min-
ing Classifiers. Paper presented at the The Interna-
tional Conference on Informatics and Applications
(ICIA2012).
Garner, S. R. (1995). Weka: The waikato environment for
knowledge analysis. Paper presented at the Proceed-
ings of the New Zealand computer science research
students conference.
Green, R. (2012). Anemias beyond B12 and iron defi-
ciency: the buzz about other B’s, elementary, and
nonelementary problems. ASH Education Program
Book, 2012(1), 492–498.
Huang, F., Wang, S., & Chan, C.-C. (2012). Predicting
disease by using data mining based on healthcare infor-
mation system. Paper presented at the Granular Com-
puting (GrC), 2012 IEEE International Conference on.
Kaur, P., Singh, M., & Josan, G. S. (2015). Classification
and Prediction Based Data Mining Algorithms to
Predict Slow Learners in Education Sector. Procedia
Computer Science, 57, 500–508.
Kesavaraj, G., & Sukumaran, S. (2013). A study on clas-
sification techniques in data mining. Paper presented
at the Computing, Communications and Networking
Technologies (ICCCNT), 2013 Fourth International
Conference on.
Kishore, C. R., Rao, K. P., & Murthy, G. Performance
Evaluation of Entorpy and Gini using Threaded and
Non Threaded ID3 on Anaemia Dataset. Life, 6(10),
10–12.
Prakash, V. A., Ashoka, D., & Aradya, V. M. (2015).
Application of Data Mining Techniques for Defect
Detection and Classification. Paper presented at the
Proceedings of the 3rd International Conference
on Frontiers of Intelligent Computing: Theory and
Applications (FICTA) 2014.
Sanap, S. A., Nagori, M., & Kshirsagar, V. (2011). Clas-
sification of anemia using data mining techniques
Swarm, Evolutionary, and Memetic Computing (pp.
113–121): Springer.
Shashidhara, M. Classification of Women Health Dis-
ease (Fibroid) Using Decision Tree algorithm.
Shouval, R., Bondi, O., Mishan, H., Shimoni, A., Unger,
R., & Nagler, A. (2014). Application of machine
learning algorithms for clinical predictive modeling:
a data-mining approach in SCT. Bone marrow trans-
plantation, 49(3), 332–337.
Siadaty, M. S., & Knaus, W. A. (2006). Locating previ-
ously unknown patterns in data-mining results: a dual
data-and knowledge-mining method. BMC Medical
Informatics and Decision Making, 6(1), 13.
Vijayarani, S., & Muthulakshmi, M. (2013). Compara-
tive Analysis of Bayes and Lazy Classification Algo-
rithms. International Journal of Advanced Research
in Computer and Communication Engineering, 2(8),
3118–3124.
Yilmaz, A., Dagli, M., & Allahverdi, N. (2013). A fuzzy
expert system design for iron deficiency anemia.
Paper presented at the Application of Information
and Communication Technologies (AICT), 2013 7th
International Conference on.
Figure 9. Performance chart of (ROC) curve.
algorithm based on 10 folds validation test. From
the Fig. 9, it is clearly shown that J48 decision tree
has the highest weighted average ROC0.97.
6 CONCLUSION AND FUTURE WORK
This paper used many classification algorithms to
get the best prediction of Anemia types based on
a dataset of 41 patients. The proposed model is
designed depending on five most common anemia
types then classifying and analyzing the anemia
type for anemic patients’ dataset.
The dataset constructed from results of complete
blood count test CBC. The experiment conducted
by using four data mining classification algorithms
where J48 decision tree and SMO performs best
with 93.75% accuracy in the percentage split 60%.
When comparing the selected algorithms through
utilizing of WEKA experimenter is proved that the
J48 decision tree algorithm gives the best performance
with F-Measure, Sensitivity, The true positive rate,
Precisions and the lowest value in the false positive
rate. Therefore, J48 proved to be potentially the most
effective and efficient classification algorithm. In the
same context, based on anemia model the perform-
ance chart by Region under meeting Curve (ROC)
shown that the highest weight for J48 decision tree.
In future, use more of the data mining algo-
rithms to classify all types of anemia diseases on
different datasets to find the accuracy and predic-
tions of preferred results.
REFERENCES
Ahmad, A., Mustapha, A., Zahadi, E. D., Masah, N., &
Yahaya, N. Y. (2011). Comparison between Neural
Networks against Decision Tree in Improving Predic-
tion Accuracy for Diabetes Mellitus Digital Informa-
tion Processing and Communications (pp. 537–545):
Springer.
Amin, M. N., & Habib, M. A. Comparison of Different
Classification Techniques Using WEKA for Hemato-
logical Data.
ICCMIT_Book.indb 621ICCMIT_Book.indb 621 10/3/2016 9:26:46 AM10/3/2016 9:26:46 AM
Downloaded by [Manal Abdullah] at 10:04 24 November 2016
Downloaded by [Manal Abdullah] at 10:04 24 November 2016
... En [9] tiene un enfoque clasificación nutricional por antropometría compatible con riesgo de desnutrición crónica. Otras investigaciones como [17] que diseña un modelo que prediga el estado nutricional de niños menores de cinco años utilizando técnicas de minería de datos, u otros estudios [19,20,22,23,24] haciendo comparaciones con diferentes modelos de clasificación relacionados al problema de la anemia. ...
... En la comparación de las herramientas utilizadas en los artículos que pertenecen a este grupo, se encuentra que varias investigaciones utilizaron el software llamado Weka (Waikato Environment for Knowledge Analysis) en sus diferentes versiones, en [8] con el objetivo de conocer si un paciente necesita un seguimiento por un especialista de nutrición, en [17] diseña un modelo que prediga el estado nutricional de niños menores de cinco años utilizando técnicas de minería de datos, en [19] explora la cantidad de alimentos sobre los que se requería información sobre la ingesta para predecir con precisión el cumplimiento, o no, de las recomendaciones dietéticas clave, en [21] estudia los hábitos dietéticos relacionados con el estado de obesidad de los niños, en [22] demostrar el análisis de la desnutrición en función de la ingesta de alimentos, el índice de riqueza, el grupo de edad, el nivel educativo, la ocupación, etc. y en [23] explora la cantidad de alimentos sobre los que se requería información sobre la ingesta para predecir con precisión el cumplimiento, o no, de las recomendaciones dietéticas clave. ...
... En donde mayor cantidad de muestra fue en un estudio donde se utilizó regresión lineal y otros algoritmos, siendo 9004 datos, los cuales se recopilaron utilizando el analizador de hematología automático Mindray BC-5300 [23]. ...
Article
Full-text available
One of the main public health problems is child malnutrition, since it negatively affects the individual throughout his life, limits the development of society and makes it difficult to eradicate poverty. The first objective of this research is to apply data mining techniques for preprocessing, cleaning, reduction and transformation to a data lake that has allowed analyzing anemia in children under 5 years of age, the second objective is to apply Machine Learning algorithms to obtain the best model to predict anemia in children under 5 years of age. The data set was extracted from the open data platform of the government of Peru that corresponds to South Lima, North Lima, East Lima, Central Lima and rural Lima, which collected a total of 138,369 instances and 36 variables of which 30 are categorical and 6 numeric, being an unbalanced data set. In order to obtain the best predictor variables, the Anova F-test and Chi Square filters were used, and it was possible to reduce them to 10 variables, cases were also carried out without considering one of the filters and both filters.To find the best prediction model, the algorithms have been tested: decision tree, logistic regression, K nearest neighbors, random forest and naive bayes. As a result, we show that the best algorithm to predict anemia in children under 5 years of age is the Naive Bayes algorithm with the highest recall of 74%, precision of 43% and accuracy of 70%.
... Thalassemia diagnosis depends on certain characteristics derived after performing a complete blood count (CBC) test. However, the reliability of the test can lead to the misdiagnosis of thalassemia as similar characteristics can also be observed in different blood disorders (Abdullah & Al-Asmari, 2016;Jatoi et al., 2018;Meena et al., 2019). Blood diseases can be of various types, such as anaemia, which is a common nutritional deficiency and blood disorder in childhood and infancy, and iron deficiency anemia (IRD) is mostly found in women and children, especially in developing countries (AlAgha et al., 2018;Jatoi et al., 2018). ...
... Blood diseases can be of various types, such as anaemia, which is a common nutritional deficiency and blood disorder in childhood and infancy, and iron deficiency anemia (IRD) is mostly found in women and children, especially in developing countries (AlAgha et al., 2018;Jatoi et al., 2018). However, the most crucial type of anaemia is thalassemia, an inherited disorder whose identification or differentiation from normal patients is challenging from the CBC test (Abdullah & Al-Asmari, 2016). Therefore, the problem identified in the healthcare sector is to design a model that can predict the risk of thalassemia in patients before their CBC test. ...
... Similarly, Saichanma et al. (2014) used the J48 decision tree algorithm to predict the abnormality of peripheral blood smear, focusing mainly on the attribute of RBC of the CBC test (Saichanma et al., 2014). The previous studies (Abdullah & Al-Asmari, 2016;Alaa & Shurrab, 2017;AlAgha et al., 2018;Jatoi et al., 2018;Meena et al., 2019) have classified the types of anaemia or thalassemia utilizing the techniques of data mining. However, the present study focused on determining thalassemia traits' existence based on the CBC test attributes (MCV, HGB, RDW, MCHC, and HCT) for predicting the risk of thalassemia. ...
Article
Full-text available
Medical data mining is concerned with prediction knowledge, which is a useful method for extracting hidden patterns from given data for specific purposes. Thalassemia is one of the most common inherited blood hematological disorders, and this paper adopted data mining classification techniques to generate results with high performance and accuracy for risk prediction of thalassemia. The dataset for this purpose was collected from NIBD (National Institute of Blood Diseases), a well-known institute and hospital for blood diseases in Karachi, Pakistan. They provided 301 records of CBC test reports containing positive and negative statuses of diagnosis of thalassemia traits. There were many instances in the report, of which 6 were used for our research purpose, i.e. Gender, MCV, HGB, HCT, MCHC, and RDW. The dataset was divided into training and test data using the WEKA tool. Four algorithms of data mining classification, namely J48 Decision Tree, Naïve Bayesian Network, SMO algorithm, and Multilayer Perceptron Neural Network were adopted to train the model and classify the patient having traits of thalassemia from normal persons with the use of the WEKA tool. Results revealed that out of all four algorithms, Naïve Bayes provided results with the highest accuracy of 99%.
... In this study, ANN, SVM, and statistical model methods were applied in the diagnosis of iron deficiency. Some classification algorithms such as NB, MP, J48, and SMO were used by using WEKA data mining tool [40]. As a result, it was observed that the J48 decision tree algorithm (JDTA) had the best performance. ...
Article
Full-text available
Data mining methods are important for the diagnosis and prediction of diseases. Early and accurate diagnosis of patients is vital for their treatment. Various methods have been used in the literature to classify anemia. However, due to the different characteristics of patient datasets, changes in dataset sizes, different parameter numbers and features, and different numbers of patient records, algorithm performances vary according to datasets. In this study, the Harris hawks algorithm (HHA) and the multivariate adaptive regression spline (MARS) were used to classify anemia based on blood data of 1732 patients from the Kaggle database of patients with and without anemia. Six different algorithms were proposed to determine the parameters of the linear anemia approximation, namely multilinear form HHA, multilinear quadratic form HHA, multilinear exponential form HHA, first-order MARS model, second-order MARS model, and the best performing MARS model. The performance of the six proposed algorithms has been analyzed and found to be better than the previous studies in the literature.
... The highest accuracy (85.6%) was obtained using Bagged Decision Trees. Abdullah and Al-Asmari (2016) classified five anaemia types with seven different blood parameters using blood records from 41 anaemic patients (Abdullah & Al-Asmari, 2016). Using classification algorithms such as NB, Multilayer Perception, J48, and Sequential Minimal Optimization (SMO), the highest success was achieved with J48. ...
Article
Full-text available
Anaemia occurs when the haemoglobin (Hgb) value falls below a certain reference range. It requires many blood tests, radiological images, and tests for diagnosis and treatment. By processing medical data from patients with artificial intelligence and machine learning methods, disease predictions can be made for newly ill individuals and decision‐support mechanisms can be created for physicians with these predictions. Thanks to these methods, which are very important in reducing the margin of error in the diagnoses made by doctors, the evaluation of data records in health institutions is also important for patients and hospitals. In this study, six hybrid models are proposed to classify non‐anaemia records, Hgb‐anaemia, folate deficiency anaemia (FDA), iron deficiency anaemia (IDA), and B12 deficiency anaemia by combining artificial intelligence and machine learning methods TreeBagger, Crow Search Algorithm (CSA), Chicken Swarm Optimization Algorithm (CSO) and JAYA methods. The proposed hybrid models are analysed with two different approaches, with/without applying the SMOTE technique to achieve high performance by better emphasizing the importance of parameters. To solve the multiclass anaemia classification problem, fuzzy logic‐based parameter optimization is applied to improve the class‐based accuracy as well as the overall accuracy in the dataset. The proposed methods are evaluated using ROC criteria to build a prediction model to determine the anaemia type of anaemic patients. As a result of the study on the dataset taken from the Kaggle database, it is observed that the six proposed hybrid methods outperformed other studies using the same dataset and similar studies in the literature.
Article
Full-text available
Educational Data Mining field concentrate on Prediction more often as compare to generate exact results for future purpose. In order to keep a check on the changes occurring in curriculum patterns, a regular analysis is must of educational databases. This paper focus on identifying the slow learners among students and displaying it by a predictive data mining model using classification based algorithms. Real World data set from a high school is taken and filtration of desired potential variables is done using WEKA an Open Source Tool. The dataset of student academic records is tested and applied on various classification algorithms such as Multilayer Perception, Naïve Bayes, SMO, J48 and REPTree using WEKA an Open source tool. As a result, statistics are generated based on all classification algorithms and comparison of all five classifiers is also done in order to predict the accuracy and to find the best performing classification algorithm among all. In this paper, a knowledge flow model is also shown among all five classifiers. This paper showcases the importance of Prediction and Classification based data mining algorithms in the field of education and also presents some promising future lines.
Conference Paper
Full-text available
In this study, an application of fuzzy expert system was introduced. This system was designed to determine the level of iron deficiency anemia and thus, expert physicians were provided with a system to assist them to determine an exact diagnosis prior to their treatment. While realizing the system design, the laboratory records obtained from real patients were examined, the appropriate parameters were specified, the input and output parameters were fuzzified and finally the rule base was built with the expert physician. This study used the centroid defuzzification method together with the Mamdani inference mechanism which is often used in the related literature. With the help of a visual programming language, the level of the disease was displayed in a perceptible way as the result of running the system.
Article
Full-text available
Data collected from hematopoietic SCT (HSCT) centers are becoming more abundant and complex owing to the formation of organized registries and incorporation of biological data. Typically, conventional statistical methods are used for the development of outcome prediction models and risk scores. However, these analyses carry inherent properties limiting their ability to cope with large data sets with multiple variables and samples. Machine learning (ML), a field stemming from artificial intelligence, is part of a wider approach for data analysis termed data mining (DM). It enables prediction in complex data scenarios, familiar to practitioners and researchers. Technological and commercial applications are all around us, gradually entering clinical research. In the following review, we would like to expose hematologists and stem cell transplanters to the concepts, clinical applications, strengths and limitations of such methods and discuss current research in HSCT. The aim of this review is to encourage utilization of the ML and DM techniques in the field of HSCT, including prediction of transplantation outcome and donor selection.Bone Marrow Transplantation advance online publication, 7 October 2013; doi:10.1038/bmt.2013.146.
Article
Full-text available
The term "unexplained anemia" appears frequently in a request for a hematology consultation. Although most anemia consultations are fairly routine, they occasionally represent challenging problems that require an amalgam of experience, insight, and a modicum of "out-of-the-box" thinking. Problem anemia cases and pitfalls in their recognition can arise for one of several reasons that are discussed in the cases presented herein. "Anemias beyond B12 and iron deficiency" covers a vast domain of everything that lies beyond the commonly encountered anemias caused by simple deficiencies of 2 currently major hematologically relevant micronutrients. However, even these deficiencies may be obscured when they coexist or are not considered because of misleading distractions. They may also be mistakenly identified when other less common nutrient deficiencies occur. I present herein case examples of such situations: a young patient with pancytopenia and schistocytes who was responsive to plasmapheresis, but in whom pernicious anemia was not suspected because of ethnicity and age; a bicytopenic patient with anemia and myelodysplastic features caused by copper deficiency after gastric reduction surgery; and a patient with BM hypoplasia and a dimorphic blood smear who was found to have paroxysmal nocturnal hemoglobinuria. These "pearls" represent but 3 examples of the many varieties of problems in anemia diagnosis and are used to illustrate potential pitfalls and how to avoid them.
Article
In order to overcome the software development challenges like delivering a project on time`, developing quality software products and reducing development cost, software industries commonly uses defect detection software tools to manage quality in software products. Defects are detected and classified based on their severity, this can be automated in order to reduce the development time and cost. Nowadays to extract useful knowledge from large software repositories engineers and researchers are using data mining techniques. In this paper, software defect detection and classification method is proposed and data mining techniques are integrated to identify, classify the defects from large software repository. Based on defects severity proposed method discussed in this paper focuses on three layers: core, abstraction and application layer. The designed method is evaluated using the parameters precision and recall.
Conference Paper
Data mining is a process of inferring knowledge from such huge data. Data Mining has three major components Clustering or Classification, Association Rules and Sequence Analysis. By simple definition, in classification/clustering analyze a set of data and generate a set of grouping rules which can be used to classify future data. Data mining is the process is to extract information from a data set and transform it into an understandable structure. It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns. Data mining involves six common classes of tasks. Anomaly detection, Association rule learning, Clustering, Classification, Regression, Summarization. Classification is a major technique in data mining and widely used in various fields. Classification is a data mining (machine learning) technique used to predict group membership for data instances. In this paper, we present the basic classification techniques. Several major kinds of classification method including decision tree induction, Bayesian networks, k-nearest neighbor classifier, the goal of this study is to provide a comprehensive review of different classification techniques in data mining.
Conference Paper
This paper applies the data mining process to predict hypertension from patient medical records with eight other diseases. A sample with the size of 9862 cases has been studied. The sample was extracted from a real world Healthcare Information System database containing 309383 medical records. We observed that the distribution of patient diseases in the medical database is imbalanced. Under-sampling technique has been applied to generate training data sets, and data mining tool Weka has been used to generate the Naïve Bayesian and J-48 classifiers. In addition, an ensemble of five J-48 classifiers was created trying to improve the prediction performance, and rough set tools were used to reduce the ensemble based on the idea of second-order approximation. Experimental results showed a little improvement of the ensemble approach over pure Naïve Bayesian and J-48 in accuracy, sensitivity, and F-measure.