ArticlePDF Available

Anemia types prediction based on data mining classification algorithms

November 2016

November 2016

Authors:

Manal Abdullah

King Abdulaziz University

Salma Al-Asmari

King Khalid University

Medical Data Mining domain concerned with prediction knowledge as a method to extract desired outcomes from data for specific purposes. Anemia is one of the most common hema-tological diseases and in this study concentrate on the most five common types of anemia. This paper specifies the anemia type for the anemic patients through a predictive model conducting some data mining classification algorithms. The real data of dataset constructed from the Complete Blood Count (CBC) test results of the patients. These data filtered and eliminated undesirable variables, then implemented on some classification algorithms such as Naïve Bayes, Multilayer Perception, J48 and SMO using WEKA data-mining tool. Several experiments has proven that J48 decision tree algorithm gives the best potential classification of anemia types. WEKA experimenter proves J48 decision tree algorithm has the best performance with accuracy, precision, recall, True Positive rate, False Positive rate and F-measure.

Anemia types classification.

…

. ANEMIA types nomenclature.

…

Flowchart of proposed method.

…

. Attributes of ANEMIA dataset.

…

Support vector machine (SMO) algorithm output using 60% training set data.

…

Figures - uploaded by Salma Al-Asmari

Content may be subject to copyright.

Content uploaded by Salma Al-Asmari

Content may be subject to copyright.

615

Communication, Management and Information Technology – Sampaio de Alencar (Ed.)

Anemia types prediction based on data mining classification algorithms

Manal Abdullah

Department of Computer Science, Faculty of Computing and Information Technology,

King Abdul-Aziz University, Jeddah, Saudi Arabia

Salma Al-Asmari

Department of Computer Science, Faculty of Computing, King Khalid University, Abha, Saudi Arabia

ABSTRACT: Medical Data Mining domain concerned with prediction knowledge as a method to

extract desired outcomes from data for specific purposes. Anemia is one of the most common hema-

tological diseases and in this study concentrate on the most five common types of anemia. This paper

specifies the anemia type for the anemic patients through a predictive model conducting some data mining

classification algorithms. The real data of dataset constructed from the Complete Blood Count (CBC)

test results of the patients. These data filtered and eliminated undesirable variables, then implemented on

some classification algorithms such as Naïve Bayes, Multilayer Perception, J48 and SMO using WEKA

data-mining tool. Several experiments has proven that J48 decision tree algorithm gives the best potential

classification of anemia types. WEKA experimenter proves J48 decision tree algorithm has the best per-

formance with accuracy, precision, recall, True Positive rate, False Positive rate and F-measure.

Keywords: Anemia, Medical Data mining, classification algorithms, naïve Bayes, J48 decision tree, Sup-

port vector machine, SMO

for Knowledge Analysis. It is an open source

data-mining tool that provides an efficient frame-

work for implementing several classification algo-

rithms. This tool provides processing the datasets

and filtering out and remove irrelevant (not useful)

data and the dataset can be incision into test and

training sets. It supports perform classification

algorithms then transforming all the dataset into

appropriate pattern as a machine learning form.

WEKA also can upload different file formats

such as ARFF, CVS, C4.5 and different databases

Garner (1995).

There are growing researches interest in using

data mining in the medical domain. Developing

in this new approach, called medical data mining,

concerned with developing systems that determine

and predict knowledge from data generating from

medical environments. The data mining in the

medical domain specifically the hospital database,

including the data, which is huge in amounts, com-

plex in contents, with heterogeneous types, hier-

archical and varying in quality. Among last years,

the information on laboratories keeps on enhanc-

ing and developing. The specific patterns of infor-

mation can predicated through using data mining

methodologies to enhance conducting researches

and evaluation of reports. The data mining clas-

sification depends on similarities existing in the

data. The classification algorithms used to prove

1 INTRODUCTION

Data mining concept is sorting the data to identify

patterns and find relationships between these data.

It is techniques are appropriate for simple or struc-

tured datasets such as relational databases, trans-

actional databases. Different approaches of data

mining proposed to improve the challenges of stor-

ing and processing all types of data (Kaur et al.,

2015 & Kishore et al., 2015).

Data mining has three basic mechanisms

Clustering (Classification), Decision Rules and

Analysis. Classification analyzes a set of data and

produces a set of decision rules, which used to

classify the data sets. In the artificial intelligence,

machine learning or database systems data mining

process is starting by extract the information from

dataset then convert it to meaning full structure.

This means that it determines patterns in datasets

and embracing methods. There are many classes in

data mining where the most common one is clas-

sification, which is used to predict set of relation-

ship between data. In healthcare, it is significant to

invest the development in computer technology to

enhance processing the medical data such as data

mining classification algorithms and tools. This

paper will utilize the WEKA tool for data mining

(Shouval et al., 2014). As data mining tool, WEKA

name is derived from Waikato Environment

ICCMIT_Book.indb 615ICCMIT_Book.indb 615 10/3/2016 9:26:43 AM10/3/2016 9:26:43 AM

616

the results is acceptable to the doctors or the end

user. Medical data mining uses many algorithms

such as Decision Trees, Neural Networks, Naïve

Bayes and others.

This paper identifies set of attributes associated

with the patient CBC test result that give the ane-

mia type, and improve the quality of prediction by

identifying the anemic patients, so that can help

doctors immediately improving their performance.

This paper investigates the accuracy of some clas-

sification algorithms in predicting some anemia

types. It is also utilizing WEKA tool for conduct-

ing classification, decision rules and analyzing the

results. The evaluation of data using classification

algorithms takes a set of classified data as train-

ing set and use it for training the algorithms. Then

classifies the test data based on the decision rules

extracted from the training set for predicting ane-

mia diseases. The use of WEKA Experimenter

conducted to specify which classification algo-

rithm gives best performance in terms of accuracy,

precision, recall, True Positive rate, False Positive

rate and F-measure. The main objectives of this

work are: using predictive attributes for produc-

ing data and performing data mining algorithms

to get the best prediction of the anemia types using

the patient Complete Blood Count (CBC) data

results.

2 RELATED WORKS

There are many works that used different data min-

ing algorithms to classify several types of diseases,

such as anemia disease for specific types based

on Data Mining algorithms Elshami & Alhalees

(2012). In addition, many other researchers tried

to find their own method. A person with anemia

probably unaware of the problem because symp-

toms may not appear. Millions of people may have

anemia and their health exposed risk. Therefore

the disease is significant, several studies carried

out in this domain mentioned in the literature

(Yilmaz et al., 2013). (Sanap et al., 2011) developed

a system using the classification technique: C4.5

decision tree algorithm and SMO support vector

machine WEKA. They implemented a number of

experiments using these algorithms. The anemia

classification using decision tree that given clear

results depend on CBC reports. (Amin et al., 2015)

have compared between naïve Bayes, J48 classifier

and neural network classification algorithms using

WEKA and working on hematological data to

specify what the best and appropriate algorithm.

The proposed model can predict hematological

data and the results showed that the best algorithm

is J48 classifier with high accuracy and naïve Bayes

is the lowest average in average errors. The study

of (Sanap et al., 2011) and (Amin et al., 2015)

proved that the C4.5 algorithm (as J48 in WEKA)

results gives high accuracy more than other clas-

sifiers. Dogan & Turkoglu (2008) based on the

biochemistry blood parameters they designed a

system to help physicians in the diagnosis of Ane-

mia. The system designed using the decision tree

algorithm. The system used the characteristics of

the hematology and classify the results into posi-

tive or negative Anemia. The results of this sys-

tem accorded with physicians’ decision. Siadaty &

Knaus (2006) selected decision trees as a common

and simple classifier and also has low computa-

tional complexity. The problem was the needed

time to build a decision tree for large dataset is

come to be intractable. They solved the problem

by developing a parallel model of ID3 algorithm.

It is a thread-level parallelism decision tree and

do the computations independently. The experi-

ment done on anemic patient’s data set. (Kishore

et al., 2015) presented set of the basic classification

algorithms, which groupof essential types of clas-

sification methods such as decision trees, Bayesian

networks, k-nearest neighbor and support vector

machine classifier. The study shows a comprehen-

sive review of diverse classification algorithms in

data mining. This research presents an investiga-

tion for five types of anemia disease by using naïve

Bayes, Multilayer perception, J48 decision tree and

support vector machine data mining algorithms

depending on CBC data. The best one of classi-

fication algorithms depends on specifically in the

problem domain Kesavaraj & Sukumaran (2013).

3 ANEMIA CLASSIFICATION

3.1 What is anemia?

It is a medical condition indicates to the reduc-

tion of hemoglobin or red cell concentration in the

human blood. A Complete Blood Cell (CBC) count

test conducted for patients in laboratory. The ane-

mia disease types identified using this information:

age, gender, hemoglobin, Hematocrit and other

attribute values when it is lower a normal range

Green (2012). Anemia types classification accord-

ing to CBC test values illustrated in Fig. 1 (Sanap

et al., 2011).

3.2 The anemia classification

Anemia disease categorized into different types

based on the CBC test values. In this model Ane-

mia types nomenclature illustrated (see Table 1)

and classified according to MCV (Mean cor-

puscular volume) value into the three essential

kinds of microcytic (MCV<80) ft, normocytic

ICCMIT_Book.indb 616ICCMIT_Book.indb 616 10/3/2016 9:26:43 AM10/3/2016 9:26:43 AM

Downloaded by [Manal Abdullah] at 10:04 24 November 2016

617

(MCV = 80–100) ft, and macrocytic (MCV>100)

ft anemia, and classified using MCHC (Mean cor-

puscular hemoglobin concentration) into normo-

chromic (MCHC = 32–36) g/dl and hypochromic

(MCHC<32) g/dl anemia. RDW (Red Cell Distri-

bution Width) used to measure the anemia and it is

high if (RDW>14.6) and normal if (RDW = 11.6–

14.6) Green, (2012) and (Sanap et al., 2011).

4 THE PROPOSEDMETHOD

4.1 Experimental setup

In the context of classification the anemia types.

A number of attributes are considered to predict

the type of anemia for the anemic patient. These

influencing attributes are categorized as an input.

The data is taken from Complete Blood Count

(CBC) test results, which are conducted by col-

lecting blood samples from 41 anemic patients (41

instances) and constructing ANEMIA dataset.

The dataset consists of 7 attributes and defined in

Table 2 along with their values.

Then data is transformed into a standard file

format. CSV, which is supported by the WEKA

tool to construct ANEMIA dataset, filtered and

eliminating out irrelevant data using specific

techniques. The CBC data contain 34 irrelevant

attributes that are removed. The relevant attributes

are shown in Table 2. The attributes are verities

between nominal and numeric values and each has

its own determined category.

The classification algorithms performed for pre-

dicting and classifying five most common Anemia

types based on rules that shown in Table 3. The

analysis of identifying anemia types are conducted

using the WEKA tool. (Siadaty et al., 2006, Sanap

et al., 2011 and Shashidhara, 2012):

The implementation of the proposed method

starts by collecting CBC results and build our own

dataset. Then data are preprocessed to extract and

filter the attributes of importance. Data are con-

verted to CSV format to be able using by WEKA

classifier software. CSV file format is selected

to allows data to be saved in a table structured

(spreadsheet) format.After the classification and

generated results, evaluated using the WEKA

experimenter and the Knowledge Flow Model.

4.2 The proposed algorithms for classification

In this method, various data mining algorithms are

used for predicting the anemia type for patients.

During this study, classification algorithms used for

prediction and the dataset are tested then analyzed

with four candidate algorithms which are: Naïve

Bayes, neural network (multilayer perception),

Decision Tree (J48) and Support Vector Machine

(SMO). The Naive Bayes algorithm implements the

principle of conditional probabilities that computes

a probability by calculating the rate of values and

combinations of values in the specific data. This

algorithm determines the probability of an event

happen given the probability of another event that

has already happened. Naïve Bayes algorithm use

Figure 1. Anemia types classification.

Table 1. ANEMIA types nomenclature.

ACD Anemia of chronic disease

IDA Iron deficiency anemia

ARD Anemia of renal disease

THAL Thalassemia

APA Aplastic Anemia

Table 2. Attributes of ANEMIA dataset.

Attribute Attribute value Attribute category

Age 0–12

>12 Child

Adult

Gender Female

Male F

MCV <80

80–100

>100

Microcytic

Normocytic

Macrocytic

HCT <37

37.0–50.0

Low

Normal

HGB <10

10–12

Severe

Moderate

MCHC <32

32–36

hypochromic

normochromic

RDW >14.6

11.6–14.6

High

Normal

ICCMIT_Book.indb 617ICCMIT_Book.indb 617 10/3/2016 9:26:43 AM10/3/2016 9:26:43 AM

Downloaded by [Manal Abdullah] at 10:04 24 November 2016

618

classify it into what is the two probable classes and

give the output. The SVM algorithm has the same

functional form of neural networks and radial basis

functions Kesavaraj & Sukumaran (2013). It is gen-

erally used to a two class classification problem,

its detect the plane and gives the greatest separa-

tion between the two classes. The SVM algorithm

discovers the optimal plane with a maximum dis-

tance to the nearby point of the two classes. A set

of instances that are closest to the optimal plane,

explains the support vector and specify the margins

of each class (Shouval et al., 2014). See the descrip-

tion of the proposed methodology illustrated as

flowchart shown in Fig. 2.

5 RESULTS AND DISCUSSION

Evaluation of data is done by using 41 instances in

the dataset using Naïve Bayes, neural network in

WEKA (multilayer perception), J48 decision tree

algorithms, and support vector machine in WEKA

(SMO) with the test option: several percentages

splits (20%, 40%, 60%) of the dataset see Table 4.

The results in Table 4 of evaluation an ANEMIA

dataset using WEKA through different experi-

ments 20%, 40%, 60% percentage split data. The

table include the result through accuracy (correctly

classified instances), mean absolute error, weighted

average ROC and F-measure. Fig. 3 show the SMO

algorithm results using 60% training set data.

Figure 2. Flowchart of proposed method.

kernel density estimators that improve implementa-

tion if the normal assumption clearly correct; it can

also deal with numeric attributes using supervised

discretization Vijayarani & Muthulakshmi (2013).

The second algorithm is a neural network in WEKA

named (multilayer perception). It is a feed forward

neural network multilayer model that can map set

of the input data (each one is a neuron) into a set of

suitable outputs. The input node is an element with

a nonlinear activation function. The multilayer per-

ception consists of multiple one or more of hidden

layers of nodes called (hidden neurons) in a directed

chart, with each layer completely connected to the

next layer (Prakash et al., 2015). The J48 decision

tree algorithm is used also for automatic processing

and canchoose related aspects from training data.

It can cut the meaningless approaches into effective

process, especially when dealing with continuous

attributes. It split the values based on the threshold-

ing to specify what is upper than, less than or equal

to the threshold value. J48 algorithm contains the

capability of dealing with training data with miss-

ing values of some attributes (Ahmad et al., 2011).

Support Vector Machines named (SMO) in WEKA

used as a supervised learning method which analyz-

ing data and recognizing patterns. It is not prob-

able classifier, which process set of input data and

Table 3. Anemia classification rules.

The rule Decision*

IF (MCV = microcytic AND

HGB = 10–12) then

ACD,

moderate

Else if (MCV = microcytic AND

HGB = <10) then

ACD,

severe

Else if (MCV = normocytic AND

MCHC <32 AND RDW = 11.6–14.6

AND HGB = 10–12) then

THAL,

moderate

Else if (MCV = normocytic AND

MCHC <32 AND RDW = 11.6–14.6

AND HGB = <10) then

THAL,

severe

Else if (MCV = normocytic AND

MCHC <32 AND RDW = 11.6–14.6

AND HGB = 10–12) then

IDA,

moderate

Else if (MCV = normocytic AND

MCHC <32 AND RDW = 11.6–14.6

AND HGB = <10) then

IDA,

severe

Else if (MCV = normocytic AND

MCHC = 32–36 AND HGB = 10–12)

then

ARD,

moderate

Else if ( MCV = normocytic AND

MCHC = 32–36 AND HGB = <10)

then

ARD,

severe

Else if (MCV = macrocytic AND

HGB = 10–12) then

APA,

moderate

Else if (MCV = macrocytic AND

HGB = <10)

APA,

severe

*The decision includes (Anemia type and severity grade).

ICCMIT_Book.indb 618ICCMIT_Book.indb 618 10/3/2016 9:26:44 AM10/3/2016 9:26:44 AM

Downloaded by [Manal Abdullah] at 10:04 24 November 2016

619

The test that using percentage split is conducted

by deciding a specific percent of data for training

and the rest of data for testing. In this experiment

the percentage split are chosen as 20%, 40% and

60%, where the partitions is conducted randomly.

The percentage split 20%: the data will split into

20% will used as training set data and the rest 80%

will used as testing set data. The same process done

with other percentages 40% and 60%.

The accuracy (Correctly Classified Instances)

rate of the results using different splitting percent-

ages increased in naïve Bayes, J48, multilayer per-

ception and SMO. The accuracy increasing with

the training set average respectively. All statistic

results provide an important comparison of the

accuracy between all algorithms done and finally

it have been investigated that J48 decision tree and

SMO algorithms implement best results with accu-

racy 93.75% when using the percentage split 60%.

The accuracy measure of all the algorithms using

60% training set are illustrated in Fig. 4.

The results shown in the Table 5 are the per-

formance of naïve Bayes, neural network (multi-

layer perception), J48 decision tree and SMO using

Table 4. Simulation result of algorithms using 20%, 40%, 60% training set data.

Algorithm Training Set Accuracy*% Mean absolute error% Weighted av. ROC F-Measure

Naïve Bayes 20% 30.303 0.458 0.507 0.257

40% 60 0.3372 0.708 0.587

60% 68.75 0.2645 0.825 0.68

Multilayer Perception 20% 39.3939 0.3744 0.775 0.383

40% 72 0.2198 0.852 0.716

60% 87.5 0.1372 0.921 0.859

J48 Decision tree 20% 27.2727 0.3207 0.855 0.218

40% 88 0.1689 0.868 0.878

60% 93.75 0.1743 0.97 0.935

SMO 20% 39.3939 0.4108 0.677 0.396

40% 84 0.2578 0.902 0.83

60% 93.75 0.2361 0.96 0.912

*Correctly Classified Instances.

Figure 3. Support vector machine (SMO) algorithm

output using 60% training set data.

Figure 4. Comparing algorithms accuracy using the

percentage split 60%.

WEKA experimenter. The data mining measures in

the table illustrates more useful and precise evalu-

ation of algorithm’s performance, especially when

dealing with datasets: recall (sensitivity), precision,

F-measures, true positive rate and false positive

rate, which computed as follows:

Recall (sensitivity) = True Positive rate/(True Posi-

tive rate + False Negative rate).

Precision = True Positive rate/(True Positive rate +

False Positive rate).

F-measure = (2 *recall *precision)/(recall +

precision).

The True Positive rate is the number of positive

instances classified correctly, The False Negative

rate is the number of positive instances (records)

classified negatively; False Positive rate is the

number of negative instances classified positively

(Huang et al., 2012).

In the context of using WEKA experimenter a

snapshot of using F-measure illustrated in Fig. 5,

using the precision in Fig. 6 and using the TP rate

ICCMIT_Book.indb 619ICCMIT_Book.indb 619 10/3/2016 9:26:44 AM10/3/2016 9:26:44 AM

Downloaded by [Manal Abdullah] at 10:04 24 November 2016

620

in Fig. 7. In these experiments, it has shown that

J48 decision tree performs best among four algo-

rithms with F-Measure 93%, Sensitivity is 93%,

true positive rate is 93%, Precisions 97% and it is

the lowest in the false positive rate 0.05.

The comparative performance based on the

accuracy among four algorithms also conducted

by using knowledge flow model shown in Fig. 8,

Table 5. Comparison of classification algorithms.

Algorithm TP

Rate FP

Rate Precision F-Measure Recall

Naïve Bayes 0.92 0.10 0.93 0.91 0.92

Multilayer

Perception 0.92 0.10 0.95 0.91 0.92

J48 Decision

tree 0.93 0.05 0.97 0.93 0.93

SMO 0.90 0.40 0.85 0.84 0.90

Figure 5. Comparing algorithms with use the WEKA

experimenter using F-measure.

Figure 6. Comparing algorithms with use the WEKA

experimenter using precision.

Figure 7. Comparing algorithms with use the WEKA

experimenter using true positive rate.

Figure 8. Knowledge flow model using WEKA.

which shows the membership tree structure using

10 folds validation test.

The performance chart of knowledge flow

modelconducted for the experiment algorithms

Naive Bayes, Multilayer Perceptron, J48 and

SMO. It is another important performance meas-

ures in WEKA.The performance represented

by the Region of meeting Curve (ROC) for each

ICCMIT_Book.indb 620ICCMIT_Book.indb 620 10/3/2016 9:26:45 AM10/3/2016 9:26:45 AM

Downloaded by [Manal Abdullah] at 10:04 24 November 2016

621

Dogan, S., & Turkoglu, I. (2008). Iron-deficiency anemia

detection from hematology parameters by using deci-

sion trees. International Journal of Science & Technol-

ogy, 3(1), 85–92.

Elshami, E. H., & Alhalees, A. M. (2012). Automated

Diagnosis of Thalassemia Based on Data Min-

ing Classifiers. Paper presented at the The Interna-

tional Conference on Informatics and Applications

(ICIA2012).

Garner, S. R. (1995). Weka: The waikato environment for

knowledge analysis. Paper presented at the Proceed-

ings of the New Zealand computer science research

students conference.

Green, R. (2012). Anemias beyond B12 and iron defi-

ciency: the buzz about other B’s, elementary, and

nonelementary problems. ASH Education Program

Book, 2012(1), 492–498.

Huang, F., Wang, S., & Chan, C.-C. (2012). Predicting

disease by using data mining based on healthcare infor-

mation system. Paper presented at the Granular Com-

puting (GrC), 2012 IEEE International Conference on.

Kaur, P., Singh, M., & Josan, G. S. (2015). Classification

and Prediction Based Data Mining Algorithms to

Predict Slow Learners in Education Sector. Procedia

Computer Science, 57, 500–508.

Kesavaraj, G., & Sukumaran, S. (2013). A study on clas-

sification techniques in data mining. Paper presented

at the Computing, Communications and Networking

Technologies (ICCCNT), 2013 Fourth International

Conference on.

Kishore, C. R., Rao, K. P., & Murthy, G. Performance

Evaluation of Entorpy and Gini using Threaded and

Non Threaded ID3 on Anaemia Dataset. Life, 6(10),

10–12.

Prakash, V. A., Ashoka, D., & Aradya, V. M. (2015).

Application of Data Mining Techniques for Defect

Detection and Classification. Paper presented at the

Proceedings of the 3rd International Conference

on Frontiers of Intelligent Computing: Theory and

Applications (FICTA) 2014.

Sanap, S. A., Nagori, M., & Kshirsagar, V. (2011). Clas-

sification of anemia using data mining techniques

Swarm, Evolutionary, and Memetic Computing (pp.

113–121): Springer.

Shashidhara, M. Classification of Women Health Dis-

ease (Fibroid) Using Decision Tree algorithm.

Shouval, R., Bondi, O., Mishan, H., Shimoni, A., Unger,

R., & Nagler, A. (2014). Application of machine

learning algorithms for clinical predictive modeling:

a data-mining approach in SCT. Bone marrow trans-

plantation, 49(3), 332–337.

Siadaty, M. S., & Knaus, W. A. (2006). Locating previ-

ously unknown patterns in data-mining results: a dual

data-and knowledge-mining method. BMC Medical

Informatics and Decision Making, 6(1), 13.

Vijayarani, S., & Muthulakshmi, M. (2013). Compara-

tive Analysis of Bayes and Lazy Classification Algo-

rithms. International Journal of Advanced Research

in Computer and Communication Engineering, 2(8),

3118–3124.

Yilmaz, A., Dagli, M., & Allahverdi, N. (2013). A fuzzy

expert system design for iron deficiency anemia.

Paper presented at the Application of Information

and Communication Technologies (AICT), 2013 7th

International Conference on.

Figure 9. Performance chart of (ROC) curve.

algorithm based on 10 folds validation test. From

the Fig. 9, it is clearly shown that J48 decision tree

has the highest weighted average ROC0.97.

6 CONCLUSION AND FUTURE WORK

This paper used many classification algorithms to

get the best prediction of Anemia types based on

a dataset of 41 patients. The proposed model is

designed depending on five most common anemia

types then classifying and analyzing the anemia

type for anemic patients’ dataset.

The dataset constructed from results of complete

blood count test CBC. The experiment conducted

by using four data mining classification algorithms

where J48 decision tree and SMO performs best

with 93.75% accuracy in the percentage split 60%.

When comparing the selected algorithms through

utilizing of WEKA experimenter is proved that the

J48 decision tree algorithm gives the best performance

with F-Measure, Sensitivity, The true positive rate,

Precisions and the lowest value in the false positive

rate. Therefore, J48 proved to be potentially the most

effective and efficient classification algorithm. In the

same context, based on anemia model the perform-

ance chart by Region under meeting Curve (ROC)

shown that the highest weight for J48 decision tree.

In future, use more of the data mining algo-

rithms to classify all types of anemia diseases on

different datasets to find the accuracy and predic-

tions of preferred results.

REFERENCES

Ahmad, A., Mustapha, A., Zahadi, E. D., Masah, N., &

Yahaya, N. Y. (2011). Comparison between Neural

Networks against Decision Tree in Improving Predic-

tion Accuracy for Diabetes Mellitus Digital Informa-

tion Processing and Communications (pp. 537–545):

Springer.

Amin, M. N., & Habib, M. A. Comparison of Different

Classification Techniques Using WEKA for Hemato-

logical Data.

ICCMIT_Book.indb 621ICCMIT_Book.indb 621 10/3/2016 9:26:46 AM10/3/2016 9:26:46 AM

Downloaded by [Manal Abdullah] at 10:04 24 November 2016

Machine Learning for the Prediction of Anemia in Children Under 5 Years of Age by Analyzing their Nutritional Status Using Data Mining

Article

Full-text available

Sep 2023

One of the main public health problems is child malnutrition, since it negatively affects the individual throughout his life, limits the development of society and makes it difficult to eradicate poverty. The first objective of this research is to apply data mining techniques for preprocessing, cleaning, reduction and transformation to a data lake that has allowed analyzing anemia in children under 5 years of age, the second objective is to apply Machine Learning algorithms to obtain the best model to predict anemia in children under 5 years of age. The data set was extracted from the open data platform of the government of Peru that corresponds to South Lima, North Lima, East Lima, Central Lima and rural Lima, which collected a total of 138,369 instances and 36 variables of which 30 are categorical and 6 numeric, being an unbalanced data set. In order to obtain the best predictor variables, the Anova F-test and Chi Square filters were used, and it was possible to reduce them to 10 variables, cases were also carried out without considering one of the filters and both filters.To find the best prediction model, the algorithms have been tested: decision tree, logistic regression, K nearest neighbors, random forest and naive bayes. As a result, we show that the best algorithm to predict anemia in children under 5 years of age is the Naive Bayes algorithm with the highest recall of 74%, precision of 43% and accuracy of 70%.

Risk Prediction of Thalassemia Using Data Mining Classifiers

Article

Full-text available

Sep 2023

Medical data mining is concerned with prediction knowledge, which is a useful method for extracting hidden patterns from given data for specific purposes. Thalassemia is one of the most common inherited blood hematological disorders, and this paper adopted data mining classification techniques to generate results with high performance and accuracy for risk prediction of thalassemia. The dataset for this purpose was collected from NIBD (National Institute of Blood Diseases), a well-known institute and hospital for blood diseases in Karachi, Pakistan. They provided 301 records of CBC test reports containing positive and negative statuses of diagnosis of thalassemia traits. There were many instances in the report, of which 6 were used for our research purpose, i.e. Gender, MCV, HGB, HCT, MCHC, and RDW. The dataset was divided into training and test data using the WEKA tool. Four algorithms of data mining classification, namely J48 Decision Tree, Naïve Bayesian Network, SMO algorithm, and Multilayer Perceptron Neural Network were adopted to train the model and classify the patient having traits of thalassemia from normal persons with the use of the WEKA tool. Results revealed that out of all four algorithms, Naïve Bayes provided results with the highest accuracy of 99%.

Classification of anemia using Harris hawks optimization method and multivariate adaptive regression spline

Article

Full-text available

Jan 2024
NEURAL COMPUT APPL

Data mining methods are important for the diagnosis and prediction of diseases. Early and accurate diagnosis of patients is vital for their treatment. Various methods have been used in the literature to classify anemia. However, due to the different characteristics of patient datasets, changes in dataset sizes, different parameter numbers and features, and different numbers of patient records, algorithm performances vary according to datasets. In this study, the Harris hawks algorithm (HHA) and the multivariate adaptive regression spline (MARS) were used to classify anemia based on blood data of 1732 patients from the Kaggle database of patients with and without anemia. Six different algorithms were proposed to determine the parameters of the linear anemia approximation, namely multilinear form HHA, multilinear quadratic form HHA, multilinear exponential form HHA, first-order MARS model, second-order MARS model, and the best performing MARS model. The performance of the six proposed algorithms has been analyzed and found to be better than the previous studies in the literature.

A new computer‐aided diagnostic method for classifying anaemia disease: Hybrid use of Tree Bagger and metaheuristics

Article

Full-text available

Dec 2023
EXPERT SYST

Anaemia occurs when the haemoglobin (Hgb) value falls below a certain reference range. It requires many blood tests, radiological images, and tests for diagnosis and treatment. By processing medical data from patients with artificial intelligence and machine learning methods, disease predictions can be made for newly ill individuals and decision‐support mechanisms can be created for physicians with these predictions. Thanks to these methods, which are very important in reducing the margin of error in the diagnoses made by doctors, the evaluation of data records in health institutions is also important for patients and hospitals. In this study, six hybrid models are proposed to classify non‐anaemia records, Hgb‐anaemia, folate deficiency anaemia (FDA), iron deficiency anaemia (IDA), and B12 deficiency anaemia by combining artificial intelligence and machine learning methods TreeBagger, Crow Search Algorithm (CSA), Chicken Swarm Optimization Algorithm (CSO) and JAYA methods. The proposed hybrid models are analysed with two different approaches, with/without applying the SMOTE technique to achieve high performance by better emphasizing the importance of parameters. To solve the multiclass anaemia classification problem, fuzzy logic‐based parameter optimization is applied to improve the class‐based accuracy as well as the overall accuracy in the dataset. The proposed methods are evaluated using ROC criteria to build a prediction model to determine the anaemia type of anaemic patients. As a result of the study on the dataset taken from the Kaggle database, it is observed that the six proposed hybrid methods outperformed other studies using the same dataset and similar studies in the literature.

Anemia classification in Gujarat using data mining

Conference Paper

Jan 2024

Anemia Disease Prediction using Machine Learning Techniques and Performance Analysis

Conference Paper

Feb 2024

XAIA: An Explainable AI Approach for Classification and Analysis of Blood Anemia

Conference Paper

Dec 2023

A survey on the use of machine learning approaches for analysis of anemia

Conference Paper

Jan 2023

A survey on prediction of anemia in pregnant women based on NFHS-4 dataset using ML approaches

Conference Paper

Jan 2023

A Comprehensive Study for Predicting Chronic Kidney Disease, Diabetes, Hypertension, and Anemia by Machine Learning and Feature Engineering Techniques

Conference Paper

Jul 2023

Comparison of Different Classification Techniques Using WEKA for Hematological Data

Article

Full-text available

Jan 2015

Comparative Analysis of Bayes and Lazy Classification Algorithms

Article

Full-text available

Aug 2013

Vijayarani Mohan

Classification and Prediction Based Data Mining Algorithms to Predict Slow Learners in Education Sector

Article

Full-text available

Dec 2015

Educational Data Mining field concentrate on Prediction more often as compare to generate exact results for future purpose. In order to keep a check on the changes occurring in curriculum patterns, a regular analysis is must of educational databases. This paper focus on identifying the slow learners among students and displaying it by a predictive data mining model using classification based algorithms. Real World data set from a high school is taken and filtration of desired potential variables is done using WEKA an Open Source Tool. The dataset of student academic records is tested and applied on various classification algorithms such as Multilayer Perception, Naïve Bayes, SMO, J48 and REPTree using WEKA an Open source tool. As a result, statistics are generated based on all classification algorithms and comparison of all five classifiers is also done in order to predict the accuracy and to find the best performing classification algorithm among all. In this paper, a knowledge flow model is also shown among all five classifiers. This paper showcases the importance of Prediction and Classification based data mining algorithms in the field of education and also presents some promising future lines.

A fuzzy expert system design for iron deficiency anemia

Conference Paper

Full-text available

Oct 2013

In this study, an application of fuzzy expert system was introduced. This system was designed to determine the level of iron deficiency anemia and thus, expert physicians were provided with a system to assist them to determine an exact diagnosis prior to their treatment. While realizing the system design, the laboratory records obtained from real patients were examined, the appropriate parameters were specified, the input and output parameters were fuzzified and finally the rule base was built with the expert physician. This study used the centroid defuzzification method together with the Mamdani inference mechanism which is often used in the related literature. With the help of a visual programming language, the level of the disease was displayed in a perceptible way as the result of running the system.

Application of machine learning algorithms for clinical predictive modeling: A data-mining approach in SCT

Article

Full-text available

Oct 2013

Data collected from hematopoietic SCT (HSCT) centers are becoming more abundant and complex owing to the formation of organized registries and incorporation of biological data. Typically, conventional statistical methods are used for the development of outcome prediction models and risk scores. However, these analyses carry inherent properties limiting their ability to cope with large data sets with multiple variables and samples. Machine learning (ML), a field stemming from artificial intelligence, is part of a wider approach for data analysis termed data mining (DM). It enables prediction in complex data scenarios, familiar to practitioners and researchers. Technological and commercial applications are all around us, gradually entering clinical research. In the following review, we would like to expose hematologists and stem cell transplanters to the concepts, clinical applications, strengths and limitations of such methods and discuss current research in HSCT. The aim of this review is to encourage utilization of the ML and DM techniques in the field of HSCT, including prediction of transplantation outcome and donor selection.Bone Marrow Transplantation advance online publication, 7 October 2013; doi:10.1038/bmt.2013.146.

Anemias beyond B12 and iron deficiency: The buzz about other B's, elementary, and nonelementary problems

Article

Full-text available

Dec 2012
Hematology

Ralph Green

The term "unexplained anemia" appears frequently in a request for a hematology consultation. Although most anemia consultations are fairly routine, they occasionally represent challenging problems that require an amalgam of experience, insight, and a modicum of "out-of-the-box" thinking. Problem anemia cases and pitfalls in their recognition can arise for one of several reasons that are discussed in the cases presented herein. "Anemias beyond B12 and iron deficiency" covers a vast domain of everything that lies beyond the commonly encountered anemias caused by simple deficiencies of 2 currently major hematologically relevant micronutrients. However, even these deficiencies may be obscured when they coexist or are not considered because of misleading distractions. They may also be mistakenly identified when other less common nutrient deficiencies occur. I present herein case examples of such situations: a young patient with pancytopenia and schistocytes who was responsive to plasmapheresis, but in whom pernicious anemia was not suspected because of ethnicity and age; a bicytopenic patient with anemia and myelodysplastic features caused by copper deficiency after gastric reduction surgery; and a patient with BM hypoplasia and a dimorphic blood smear who was found to have paroxysmal nocturnal hemoglobinuria. These "pearls" represent but 3 examples of the many varieties of problems in anemia diagnosis and are used to illustrate potential pitfalls and how to avoid them.

Application of Data Mining Techniques for Defect Detection and Classification

Article

Jan 2014

In order to overcome the software development challenges like delivering a project on time`, developing quality software products and reducing development cost, software industries commonly uses defect detection software tools to manage quality in software products. Defects are detected and classified based on their severity, this can be automated in order to reduce the development time and cost. Nowadays to extract useful knowledge from large software repositories engineers and researchers are using data mining techniques. In this paper, software defect detection and classification method is proposed and data mining techniques are integrated to identify, classify the defects from large software repository. Based on defects severity proposed method discussed in this paper focuses on three layers: core, abstraction and application layer. The designed method is evaluated using the parameters precision and recall.

A study on classification techniques in data mining

Conference Paper

Jul 2013

Data mining is a process of inferring knowledge from such huge data. Data Mining has three major components Clustering or Classification, Association Rules and Sequence Analysis. By simple definition, in classification/clustering analyze a set of data and generate a set of grouping rules which can be used to classify future data. Data mining is the process is to extract information from a data set and transform it into an understandable structure. It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns. Data mining involves six common classes of tasks. Anomaly detection, Association rule learning, Clustering, Classification, Regression, Summarization. Classification is a major technique in data mining and widely used in various fields. Classification is a data mining (machine learning) technique used to predict group membership for data instances. In this paper, we present the basic classification techniques. Several major kinds of classification method including decision tree induction, Bayesian networks, k-nearest neighbor classifier, the goal of this study is to provide a comprehensive review of different classification techniques in data mining.

Predicting disease by using data mining based on healthcare information system

Conference Paper

Aug 2012

This paper applies the data mining process to predict hypertension from patient medical records with eight other diseases. A sample with the size of 9862 cases has been studied. The sample was extracted from a real world Healthcare Information System database containing 309383 medical records. We observed that the distribution of patient diseases in the medical database is imbalanced. Under-sampling technique has been applied to generate training data sets, and data mining tool Weka has been used to generate the Naïve Bayesian and J-48 classifiers. In addition, an ensemble of five J-48 classifiers was created trying to improve the prediction performance, and rough set tools were used to reduce the ensemble based on the idea of second-order approximation. Experimental results showed a little improvement of the ensemble approach over pure Naïve Bayesian and J-48 in accuracy, sensitivity, and F-measure.

AUTOMATED DIAGNOSIS OF THALASSEMIA BASED ON DATAMINING CLASSIFIERS

Article

Jan 2012

Iyad H M Alshami

Anemia types prediction based on data mining classification algorithms

Abstract and Figures

Recommended publications

Classification of Anemia Using Data Mining Techniques

Prediction of Anemia Using Naïve-Bayes Classification Algorithm in Machine Learning

Implementation of Classification Algorithms in Educational Data using Weka Tool

Machine Learning Algorithms for Anemia Disease Prediction: Select Proceedings of IC3E 2018

Curability Prediction Model for Anemia Using Machine Learning