ResearchPDF Available

PREDICTION AND DIAGNOSIS OF DIABETES DISEASES UTILIZING RANDOM FOREST CLASSIFIER

July 2023

July 2023

DOI:10.13140/RG.2.2.34054.75846

Authors:

Mohammad Saif Raza

Babu Banarsi Das Institute of Technology

Shubham Mishra

GLA University

Diabetes, usually referred to as diabetes mellitus (DM), is a potentially dangerous disorder that affects people all over the world. High blood sugar levels are a sign of diabetes. Numerous risk factors, such as being overweight, having high blood glucose levels, not exercising enough, and other risk factors, can lead to the development of diabetes. There is a potential that it can be managed or that its effects may be diminished if it is discovered when it is still fairly simple to do so. One example of machine learning in action is the creation of a computer programme or system that has the ability to learn from its past experiences and better itself. Here is an example of the scope of the artificial intelligence field. The PIMA dataset is used in a variety of scenarios throughout the course of this inquiry. Each of the 768 cases in this collection differs from the others in about nine different ways. Each algorithmic machine learning strategy has an almost endless number of implementations that can be used. On the other hand, we choose to use three different unsupervised learning techniques to satisfy the requirements of these research initiatives. The logistic regression algorithm, the decision tree method, and the random forest algorithm are all well-known algorithms by their respective names. Each and every one of these algorithms underwent rigorous training and testing before being put into this model to make sure it was fit for use. By contrasting and analysing the various metric algorithmic techniques' respective performance levels, we will ultimately analyse the applicability of each one to machine learning. This will enable us to decide which of these approaches is the most successful. Some of the performance measures that are looked at include accuracy, F-measure, recall, and precision. Other performance measures are also available. The best accuracy score is 74% for the Logistic Regression model, which also has the highest overall score of 0.68 and the highest f-measure value. Getting the highest score possible is how all of these honours came to be. Additionally, with a precision value of 0.73, it has the greatest f-measure and f-measure precision values. It also has the greatest worth. Out of all the methods, the Decision Tree methodology won, obtaining the highest recall score of 0.61.

K nearest neighbor algorithm The KNN algorithm: Step 1: Each new instance is compared to the ones that are already available cases based on the distance assignment, and it is then classified using the k value. Step2:. If the instances are more similar to one another, then the distance between them will be less, and vice versa. Step 3: Take note of the k-value, the distance, and the instance. On the basis of these observations, occurrences are classified into the appropriate category. Step4: The k-value serves as the foundation for the forecast. So KNN classifier is k-dependent. The number of nearest neighbors is denoted by k in this context, and depending on the value of k, the results may or may not be the same [7]. Step 5: Pima Indian Diabetic Dataset (PIDD) classification accuracy can be improved by determining the value of the parameter k.

ComparisonlAlgorithms

…

Figures - uploaded by Shubham Mishra

Content may be subject to copyright.

Content uploaded by Shubham Mishra

Content may be subject to copyright.

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

368

PREDICTION AND DIAGNOSIS OF DIABETES

DISEASES UTILIZING RANDOM FOREST

CLASSIFIER

Mohammad Saif Raza, M.Tech Scholar, Department of Computer Science & Engineering, Rameshwaram

Institute of Technology & Management, Lucknow, India.

Shubham Mishra, Assistant Professor, Department of Computer Science & Engineering, Rameshwaram

Institute of Technology & Management, Lucknow, India.

Abstract— Diabetes, usually referred to as diabetes mellitus (DM), is a potentially dangerous disorder that affects people all over

the world. High blood sugar levels are a sign of diabetes. Numerous risk factors, such as being overweight, having high blood

glucose levels, not exercising enough, and other risk factors, can lead to the development of diabetes. There is a potential that it can

be managed or that its effects may be diminished if it is discovered when it is still fairly simple to do so. One example of machine

learning in action is the creation of a computer programme or system that has the ability to learn from its past experiences and better

itself. Here is an example of the scope of the artificial intelligence field. The PIMA dataset is used in a variety of scenarios

throughout the course of this inquiry. Each of the 768 cases in this collection differs from the others in about nine different ways.

Each algorithmic machine learning strategy has an almost endless number of implementations that can be used. On the other hand,

we choose to use three different unsupervised learning techniques to satisfy the requirements of these research initiatives. The

logistic regression algorithm, the decision tree method, and the random forest algorithm are all well-known algorithms by their

respective names. Each and every one of these algorithms underwent rigorous training and testing before being put into this model

to make sure it was fit for use. By contrasting and analysing the various metric algorithmic techniques' respective performance

levels, we will ultimately analyse the applicability of each one to machine learning. This will enable us to decide which of these

approaches is the most successful. Some of the performance measures that are looked at include accuracy, F-measure, recall, and

precision. Other performance measures are also available. The best accuracy score is 74% for the Logistic Regression model, which

also has the highest overall score of 0.68 and the highest f-measure value. Getting the highest score possible is how all of these

honours came to be. Additionally, with a precision value of 0.73, it has the greatest f-measure and f-measure precision values. It also

has the greatest worth. Out of all the methods, the Decision Tree methodology won, obtaining the highest recall score of 0.61.

Index Terms— Data mining, Diabetes Mellitus, EM algorithm, Random Forest with Feature Selection, ML Algorithm, etc.

I. INTRODUCTION

Diabetes Mellitus (DM) is a chronic condition that necessitates constant medical attention as well as instruction in self-management

to lower the risk of negative long-term results and the emergence of complications. By keeping the patient's blood sugar levels under

control and treating diabetes with a combination of food and medication, one is able to minimise or eliminate a wide variety of

diabetes-related symptoms and effects. The two main forms of diabetes that can be recognised from one another are as follows: The

type of diabetes that affects children and teenagers is known as type 1 diabetes, sometimes known as adult-onset diabetes. A kind of

diabetes known as insulin dependency develops when the body stops producing the hormone known as insulin. As a result, the body

starts to rely on external sources of insulin. Diabetes can occur when there is insufficient insulin because the body needs insulin to

be able to utilise the glucose from food. This happens frequently to those who are younger, especially kids and teenagers. [The

causal relationship] Five to ten percent of all instances of diabetes are brought on by this cause. For diabetics who have been

diagnosed with this type of condition, insulin injections are typically necessary for them to be able to survive. Type 2 diabetes, also

known as adult-onset diabetes or diabetes that does not require the use of insulin, is the most common type of diabetes and affects

the vast majority of diabetics. Juvenile diabetes, also known as diabetes mellitus type 1, is characterised by the body's inability to

produce enough insulin in the right amounts. A person's risk of getting type 2 diabetes is increased by variables like being

overweight, having a family history of the disease, and being over 40. This is due to the fact that diabetes is becoming more and

more prevalent in adults due to poor eating habits [1], which explains why this is the case.

A number of variables, including but not limited to high blood pressure, being overweight, kidney failure, high cholesterol,

blindness, and a lack of physical activity, can lead to diabetes (American Diabetes Association, 2004). It would seem that the onset

of diabetes is influenced by both hereditary and environmental variables, like being overweight, belonging to a particular race or

gender, reaching a specific age, and not exercising enough. Among these elements are: The increase in the number of diabetic

patients worldwide has piqued the interest of researchers in artificial intelligence and biomedical engineering who are working in

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

369

the field of diabetes research. This is because there are more diabetic people worldwide now than ever before (Ashwinkumar &

Anandakumar 2012).

According to the results of an objectively conducted assessment, diabetes is ranked seventh on the list of illnesses that can cause

death. This conclusion was reached using these findings. 51 million people worldwide have been diagnosed with diabetes, and type

2 diabetes is significantly more common than type 1 diabetes, with a difference of more than two to one. As of November 2007, 20.8

million persons in the United States, including both adults and children, had been diagnosed with diabetes, which affected around

7.0% of the country's population. According to the results of a global survey conducted in 2013 by Boehringer Ingelheim and Eli

Lilly and Company, there are 382 million people suffering from Type-2 diabetes worldwide and 25.8 million people with Type-1

diabetes in the United States. Type 2 diabetes is the most common form of the disease and is thought to account for 90–95% of all

cases of diabetes, making it a serious issue in both developed and developing nations. This is due to the fact that type 2 diabetes is

the most prevalent variety of the illness.

By 2035, the number of people worldwide who currently have diabetes is expected to rise to 592 million, according to some

forecasts made by the International Diabetes Federation (IDF). These predictions were first created in the year 2005. The World

Diabetes Atlas estimates that there are currently 285 million people living with diabetes worldwide, with the possibility of this

number rising to 438 million by the year 2030. According to poll findings, the number of people with type 2 diabetes will increase

by the year 2030, prompting ominous predictions for the future. based on the conclusions of Kenney and Munce (2003).

Additionally, it is a guarantee that 85% of the world's diabetic patients will reside in poor countries by the year 2030. This forecast

is supported by the expectation that diabetes prevalence will increase. The assumption behind this estimate is that the number of

persons with diabetes would likely rise. According to projections, there would be 79.4 million diabetics in India by the year 2030, up

from the 31.7 million who had the disease in 2000. This forecast is based on recent data. (2004) Huy Nguyen et al. One of the most

crucial elements of diabetes treatment that will ensure success is obtaining an accurate diagnosis as soon as possible (Mythili et al

2003).

Diabetes already affects more than 62 million individuals in the Republic of India, indicating that the disease is rapidly moving

towards the position of a potential epidemic. The number of persons with diabetes is expected to more than double from 171 million

in the year 2000 to 366 million in the year 2030, according to research done by Wild et al. The disease is expected to spread most

quickly in India. India is expected to have up to 79.4 million diabetics by the year 2020, compared to China's 42.3 million and the

United States' 30.3 million, both of which would witness considerable increases in the number of diabetes in their populations. By

the year 2020, it is anticipated that there will be a considerable increase in the number of diabetes in India. India is currently facing

an uncertain future due to the possibility that diabetes could become a significant burden in the future. [2].

Diabetes is a group of illnesses in which the body either produces insufficient amounts of insulin, fails to use the insulin that is

produced properly, or exhibits a combination of both of these symptoms. Diabetes can also develop when the body uses insulin

incorrectly. If this happened, the amount of glucose in the blood would rise because the body wouldn't be able to move sugar from

the blood into the cells. One of the main sources of energy that our bodies need is a form of sugar called glucose that is found in our

blood. A accumulation of sugar in the blood, a symptom of diabetes, can be caused by either insulin resistance or a deficiency in the

synthesis of insulin. It will have a number of detrimental repercussions on one's health. [5].

The three main types of diabetes are as follows:

 Diabetes Type 1, The most common type of the condition is diabetes mellitus, often known as insulin-dependent diabetes.

It is believed that autoimmune diseases contribute to the onset of type 1 diabetes. Diabetes type 1 develops when the beta

cells that produce insulin in the pancreas are mistakenly attacked and killed by the immune system in our body, resulting in

irreversible damage. Diabetes type 1 develops as a result of this. The most serious type of diabetes is type 1 diabetes

mellitus. The most significant factor in the development of type 1 diabetes is the existence of a genetic predisposition [5].

 Diabetes Type 2, Diabetes mellitus is a disorder that develops either when the body is unable to produce enough insulin or

when it is unable to utilise the insulin that is produced properly. Because of this, sugar does not function as an energy

source and accumulates in the blood, which might have a negative impact on one's health. Diabetes type 2, the most

prevalent form of the disease, is diagnosed in about 90% of people with the disease. Despite the fact that adults are more

likely to develop type 2 diabetes than children, the condition regularly affects children.

 Gestational diabetes, Gestational diabetes is a short form of diabetes that affects pregnant women. Gestational diabetes is

the medical term for the condition that can occur during pregnancy in people who have never been diagnosed with diabetes.

It affects between two and four percent of all pregnancies and is linked to a higher risk of both the mother and the child

developing diabetes.

The activity of identifying correlations, trends, and anomalies from massive datasets stored in databases and other types of data

repositories is known as "data mining." Two techniques that can be used to accomplish this aim are pattern recognition and anomaly

identification. Larger databases than those found in other types of facilities are frequently found in data warehouses and other forms

of data storage facilities. Data mining's essential component, knowledge discovery, is made up of the procedures listed below. You

may find this part of the article here. These procedures include data cleansing, integration, selection, transformation, mining,

evaluation of patterns discovered in the data, and presentation of information derived from the data. "Data cleaning" is the process of

removing undesirable components from a dataset, such as noise and missing information. This strategy also includes gathering data

on the model that was applied to access the noise and accounting for any adjustments made. The "data integration" phase is the stage

where the primary emphasis is on combining data from a variety of different sources. Another name for this stage is "data

integration phase." In order to access the precise information that is needed, a subset of the data must be chosen. A procedure called

as data transformation must first integrate a number of techniques for data preparation in order to make the data suitable for mining.

The data will be prepared for mining after this is finished. Once this stage is complete, the data can be mined. The terms

normalization and aggregate are just two examples of the many distinct methods that fall under this category.

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

370

"Knowledge discovery" is the process of autonomously generating information in a manner that humans can understand [3].

Computers are able to successfully execute this task. The numerous processes that make up the KDD process are depicted in a

schematic in Figure 1. These actions are displayed one after the other.

Figure 1: Steps of the KDD Process

The phrase "data mining" covers a broad variety of operations, including categorization, forecasting, time series analysis,

association, grouping, and summarization of data. These are but a few of the activities included in this category. Each and every one

of these jobs has some connection to the descriptive or predictive components of data mining. Each of the aforementioned actions

can be performed by a data mining system as part of the data mining process, either separately or in various combinations.

Figure 2: Data Mining Tasks

II. LITERATURE REVIEW

In the sphere of healthcare, data mining can be a very useful teaching tool, especially when it comes to the goal of finding instances

of fraud and abuse. As a result, it is practical to utilise it to improve client relationship management decisions, allowing hospital staff

to deliver better and more reasonably priced medical care. It enables medical professionals to determine which techniques provide

the best quality of care, which is advantageous for therapy. Data mining techniques are frequently used in medical applications, such

as data modelling for healthcare applications, executive information systems for healthcare, projecting treatment costs, and resource

demand. By examining data from the patient's past as well as information from Public Health Informatics, e-governance

frameworks in healthcare, and health insurance, forecasts regarding the behaviour of a patient in the future can be made (Dey &

Rautaray 2014).

The naive Bayes algorithm stands out as one of the most intriguing and potentially profitable solutions for mining meaningful

information from medical datasets. Despite the fact that this methodology has been used to analyse medical data, it is not without

advantages or disadvantages. It is a statistically straightforward classifier that operates under the assumption that attributes are

subject to change independently of other factors. This method's ability to maintain a high rate of classification accuracy even when

used with very big datasets is another crucial feature. When other features are taken into account, its accuracy increases, which

makes it more suitable for use with medical data. On the other hand, it struggles to determine the level of independence between two

attributes when it is difficult to do so. It suffers significantly from harm as a direct result of noise. This method's performance and

the decision tree method's performance are comparable.

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

371

The decision tree algorithm is the best tool to use in circumstances where a medical professional wants to express their

decision-making as rules. The categorization of the rules in this algorithm is one of its most significant features (Kuo et al 2001).

When a physician is seeking to quantify a patient's symptoms, regression analysis of the gathered information can be used to

generate a prediction about a particular value. Even when there is a very small difference between two groups, it still performs

wonderfully. Among the criteria that the decision tree technique can easily manage are accuracy, specificity, sensitivity, positive

predictive value, and negative predictive value.

To get the lowest possible error ratio, the decision tree classifier was employed. methods such as feature selection, cross validation,

error reduction pruning, and increasing model complexity are being researched and investigated. Dimensionality reduction, also

known as the process of compressing the attribute space of a feature collection, can be achieved in part through the use of feature

selection. This is accomplished by eliminating data attributes that are judged irrelevant and useless. The predictive value can be

evaluated more accurately thanks to cross-validation, which has also shown an improvement in classification accuracy despite an

increase in model complexity. This is true even when a more complex model was used for cross-validation. Cross-validation is an

estimating technique that is more reliable. Reduced error pruning was used as an approach to successfully address the overfitting

issue that had been damaging the decision tree. The enhancement over the prior system includes both an increase in accuracy and a

decrease in the error rate. In other words, both parties benefited from the improvement. The decision tree is built in a significantly

shorter amount of time [4].

The Support Vector Machine (SVM) method is a crucial step in the categorising process and can be used with medical datasets.

SVM was developed to avoid overfitting training data, and with the right kernel choice, such the Gaussian kernel, the algorithms can

concentrate more on how similar classes are to one another than on other levels of similarity.

The support vectors of the training sample that is most similar to the category being categorised are compared to the values of the

SVM's ratios when it is used to classify a new category. This guarantees that the new category can be classified correctly. The

degree to which this class is comparable to the other class will then determine how further this class is categorised. The relevance of

SVM is due to the fact that it can function as a universal approximation for a large range of kernels in addition to the absence of any

local minima. It is crucial for a number of reasons, including this. The SVM's primary drawback, however, is that it makes it

difficult to identify the characteristics or data sets that have the most influence on a forecast. One of the SVM's most serious

shortcomings is this.

The K Nearest Neighbour, or KNN, Algorithm is well suited for use on medical datasets and makes effective use of those databases

thanks to a fascinating combination of features. It is suitable for usage on other kinds of databases due to these characteristics as

well. Because it is so straightforward to use, the KNN approach is the one that is most frequently used for pattern recognition.

Because it is the most dependable, this is the situation. Despite this, there are some situations when it is unable to produce

satisfactory results. However, by fine-tuning the parameter k in the KNN algorithm, the outcomes might be enhanced in a number of

contexts. According to Moreno et al. (2003), this parameter, which denotes the number of neighbours, determines how similar a

particular value is to its neighbours. Voting was used to conduct a study into kNN, and the results of the investigation were

evaluated on the prediction of cardiovascular disease. The results show that kNN implementation can potentially achieve a higher

level of accuracy in heart disease prediction than neural networks. Despite the fact that neural networks are currently the industry

standard, this is the case. The dataset may now be identified more precisely in terms of heart disease thanks to the usage of KNN in

combination with a genetic approach.

All-over body skin temperature measurements and asymmetric dimethylarginine (ADMA) blood levels analysis were performed on

patients with type 2 diabetes mellitus. These two elements were taken into account during the diagnosis procedure. The population

was split into two groups: those with no issues and those with complications. One group of individuals was regarded as typical. A

thermography camera was used to take thermograms of every part of the subject's body without having to touch them directly.

Thyroid hormones and other blood constituents were measured biochemically together with a number of other blood parameters.

Additionally, a score reflecting the propensity to acquire diabetes was established. The areas of the sole with the lowest skin

temperature readings in healthy individuals were found to be the back of the foot, while the areas with the highest readings were

found to be the ear. Through the process of observation, this was discovered. Diabetes patients showed lower mean skin

temperatures overall than non-diabetic patients, and the nose and tibia areas saw significantly lower skin temperatures [3]. The

entire body experienced this in the same way.

The results of numerous studies indicate that whether or not a patient is seen by a variety of doctors, or even by the same doctor at

various times, the diagnosis of that patient can change dramatically. This holds true even if the patient receives several examinations

from the same doctor. The use of computerized medical diagnostics enables doctors to diagnose their patients' illnesses more

quickly and accurately. In order to make it easier to spot patterns in the data it gathers, this strategy makes use of the Naive Bayesian

theorem. The naive Bayesian algorithm estimates the likelihood of a wide range of disorders that can affect the skin in addition to

calculating the proportion of patients that have each dermatological issue.

III. DATA MINING STRATEGIES

The Expectation Maximization (EM) Algorithm

This electromagnetic technique can be broken down into two distinct components. The first stage is to decide what to expect, and the

second is to maximize that expectation by repeatedly going through the process. After choosing a model as the first stage in the

expectation, which also includes choosing a model, comes the process of estimating any missing labels. You will choose labels and

then map pertinent models to those labels throughout the maximization phase. Your outcomes will be maximized by doing this. The

purpose of the process is to maximize the expected log-likelihood of the data, therefore this is done. Within the operational order,

three separate phases can be identified [2].

Step 1: The expectation step that determines mean value, denoted by μ and infers the values of x and y such that x= [(0.5) / (0.5 + μ)

* h] and y= [(μ/ 0.5 + μ) * h] with conditions of x / y = (0.5 / μ) and h = (x + y).

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

372

Step 2: The maximization step that determines fractions of x and y and then computes the maximum likelihood of μ at first.

Step 3: For the following cycle, repeat steps 1 and 2. Cross-validation of the mean and standard deviation for a total of seven

different features was used to establish the clusters. Each student in the group was then given a test to see whether they had any

positive or negative conditions related to diabetes. Binary answer variables are alternatively represented by the numbers 1 and 0

during the data analysis process. If the diabetes test yields a 1, it indicates that diabetes is present (positive), and if it yields a 0, it

indicates that diabetes is absent (negative). The EM technique, however, is not very accurate when applied to data sets with bigger

dimensions because of the numerical imprecision [2].

Figure 3: EM Algorithm Steps

K Nearest Neighbour Algorithm

Due to its relative simplicity and high degree of accuracy, the K Nearest Neighbour (KNN) method has been used in a wide range of

applications for the goal of data analysis. The applications that fall under this heading include machine learning, pattern recognition,

data mining, and database management. It is one of the top 10 algorithms that can be utilised in the field of data mining, according

to the most recent rankings (Wu et al 2008). The categorization method known as the KNN algorithm is referred to as "lazy

learning." The machine learning algorithm can be employed in this most basic version. This system makes it feasible to predict

when any kind of label will appear [5].

Using the KNN classification, samples are arranged based on how similar they are to one another. It serves as an example of a

learning method known as "lazy learning," which approximates the function locally and defers execution until classification. This

type of learning methodology approximates the function. Classification and clustering applications make best use of K-Nearest

Neighbours. Numerous researchers have discovered that the KNN algorithm generates outcomes that meet or surpass their

expectations after testing it on a wide range of datasets. It is really challenging to understand the Pima Indian diabetes dataset since

there are so many elements that are missing. The Euclidean Distance matrix's missing values are determined via the KNN method

by looking at the columns of data that are immediately surrounding the matrix. If the equal value from the closest neighbouring

column is likewise missing, the value from the subsequent immediate neighbouring column is used in its place. In contrast to other

ways, this method is not only straightforward but also provides a sizable competitive edge. The fact that KNN does not use

probabilistic semantics, which would enable the use of posterior prediction probabilities, is one of the disadvantages of KNN, which

might be considered as a negative.

A large number of KNN's writers have contributed to its most recent upgrade in an effort to make KNN more useful. The class-wise

KNN (C-KNN) technique has been used, and the Pima Indian diabetes dataset has been used to validate its performance. At this

point, a class label is applied to the testing data using the shortest class-wise distance. The C-KNN algorithm has reached an

accuracy level of 78.16%. To make it easier to classify the diabetes cases found in the Pima Indian database, the K means and KNN

classification methods have been integrated into a single model called as the amalgam KNN model. By eliminating the noise in this

case, the quality of the data is enhanced while simultaneously increasing the quantity of work that can be completed in the same

period of time. The cases that were incorrectly classified are excluded using the K-means algorithm, and the classification is

completed using the KNN algorithm.

The KNN algorithm's K value will be determined by the data. A greater value for k can help reduce the noise in categorization by

assisting in the categorization process. The cross-validation approach can be used to choose an appropriate value for k. We were

able to attain a classification accuracy of 97.4% by first figuring out the k value and then performing ten-fold cross validation [6,] all

of which were necessary steps. A graphic representation of the basic idea underpinning the KNN algorithm is shown in Figure 4.

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

373

Figure 4: K nearest neighbor algorithm

The KNN algorithm:

Step 1: Each new instance is compared to the ones that are already available cases based on the distance assignment, and it is then

classified using the k value.

Step2:. If the instances are more similar to one another, then the distance between them will be less, and vice versa.

Step 3: Take note of the k-value, the distance, and the instance. On the basis of these observations, occurrences are classified into

the appropriate category.

Step4: The k-value serves as the foundation for the forecast. So KNN classifier is k-dependent. The number of nearest neighbors is

denoted by k in this context, and depending on the value of k, the results may or may not be the same [7].

Step 5: Pima Indian Diabetic Dataset (PIDD) classification accuracy can be improved by determining the value of the parameter k.

K-Means Algorithm

Algorithms that can perform effectively on unlabeled samples even in the absence of direct supervision are known as unsupervised

algorithms. This implies that even if the input can be determined, the output cannot be predicted. The K means algorithm is one of a

variety of unsupervised learning algorithms that include many techniques. In order to function properly, they need n objects in the

data collection, which are then divided into k clusters, as well as an input parameter, which is the number of clusters. Based on a

random selection, the algorithm selects one of the k options. An item is given a specific placement within one of the clusters to

which it belongs based on how close it is to the linked cluster to which it belongs. The next step is to identify the regions that are in

close proximity to one another. It is advised to use the Euclidean distance while attempting to determine the position of the object

that is most central to it. The new cluster centres are identified by averaging the items contained within each of the k clusters after

the items have been grouped into clusters. Up until all of the clusters have been used, this process is repeated. It is done by using this

method up until there is no longer any fluctuation in the k cluster centres. The sum of squared error (SSE) is the objective function

that the K-means algorithm seeks to reduce in order to successfully carry out its goal [8]. The acronym SSE stands for the following:

(1)

Here, E stands for the total squared error of the objects that have been assigned cluster means for the kth cluster, p is the item that has

been assigned to the Cith cluster, and mi is the mean of the Cith cluster. The total number of records in the dataset is denoted by the

letter n, while the value k indicates the number of clusters.

Input: D is input -data set.

Output: Output is k clusters.

Step 1: Set the initial values for the cluster centers to D.

Step 2: Pick k items at random from the collection D.

Step 3: Repeat the steps below until there is no change in the cluster means and the minimum error E has been obtained.

Step 4: Take into consideration each of the k clusters. When it comes to the initialization process, compare the objects' mean values

across the clusters.

Step 5: Create the initial state of the object by assigning the value that is most similar to D to one of the k clusters.

Step 6: Find the average value of the objects in each of the k different clusters.

Step 7: Make the necessary adjustments to the cluster means based on the object value.

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

374

Random Forest Algorithm

Figure 5: Flow graph of Random Forest Algorithm

To get things going, let's look at the supervised classification method known as Random Forest. The goal of the game is to create a

random forest using every means possible, as indicated by the game's title. There are several approaches to accomplish this goal. A

forest's ability to make discoveries is correlated with the amount of trees present; the more trees present, the more precise the results

will be. One thing to keep in mind, though, is that building the forest is not the same as building the choice using information gain or

using the gain index approach. So have that in mind.

In order to learn more about decision trees and get a full understanding of what they are all about, the author provides readers with

access to four websites that may be helpful to anyone dealing with them for the first time. A decision tree is a helpful tool for aiding

in decision-making. To illustrate the numerous potential possibilities, a graph in the form of a tree is used. The decision tree will

automatically generate some sort of rule set for you to follow if you provide it with a training dataset that includes targets and

features. Making precise forecasts is possible by following these rules. As an example to support his argument, the author gives the

following scenario: You're attempting to ascertain if your daughter will enjoy watching an animated movie. If this is the case, you

should compile a list of previous animated films she like and incorporate specific elements from those films as inputs for your

forecast. After that, you are free to carry on with the decision tree technique for generating the rules. Once you've entered the

movie's qualities and seen the results, you'll be able to tell whether or not your daughter will enjoy it. Information gain and the Gini

index computations are used throughout the entire process of identifying these nodes and creating the regulations.

Random Forest was initially created by Leo Bremen. The Random Forest rule, which consists of two stages—the first of which is

the creation of the random forest and the second of which is the decision to make a prediction based on the random forest classifier

that was developed in the first stage—could be an example of a supervised classification rule [11]. The supervised classification

counterpart of the Random Forest rule is [11], while the pseudo code for Random Forest is rf.



The first thing you need to do is pick the "R" features out of the total "m" features, where R<<m.



The node that makes use of the most optimal split point among the "R" features.



Step Three: Using the most effective split; divide the node into daughter nodes.



Continue to repeat steps a to c until the desired number of nodes has been achieved.



Construct the forest by performing steps a to d a "a" number of times in order to produce a "n" number of trees.

IV. DATA SET DESCRIPTION

The study of diabetes mellitus has been conducted on the Pima Indians of the Gila River Indian Community in Central Arizona since

1965. The study is repeated every two years. These tests, which also include an oral glucose tolerance test and various assessments

of complications of diabetes and other medical conditions, provide the majority of information regarding the prevalence, incidence,

risk factors, and pathogenesis of diabetes in the Pima Indian population (Leslie et al 2004). There are many research discoveries that

appear to be relevant to the Pima people. Metabolic characteristics of Pima Indians with type 2 diabetes include obesity, insulin

resistance, insulin secretion, and a higher rate of endogenous glucose production, which are the characteristics that distinguish

diabetes [10].

The Pima Indian diabetes dataset contains information on 768 people's various measurements and a forecast of whether diabetes

will eventually strike them. The patients at this hospital were all Pima Indians and had reached the age of 21. These eight

characteristics determine whether the tested data belongs to the group of those with diabetes (tested positively) or those without

diabetes (tested negatively). The dataset consists of 268 patients with diabetes (class = 1) and 500 patients without diabetes (class =

0).

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

375

Table 1: Characteristics of PIMA Indian Dataset

Data Set

No. of Example

Input Attributes

Output Classes

Number of

Attributes

Pima Indian

Diabetes

768

The aim of this data set was to identify diabetic Pima Indians. Based on personal details like age, the number of pregnancies, and the

results of medical tests like blood pressure, body mass index, glucose tolerance test results, etc., try to ascertain if a Pima Indian

person had diabetes or not. The attributes are detailed below [16].

1. The number of pregnancies.

2. In an oral glucose tolerance test, plasma glucose levels at two hours.

3. Diastolic pressure (mm Hg)

4. Thickness of the triceps skin fold (mm)

5. Insulin 2-hour serum (mu U/ml)

6. Body mass index (BMI) (weight in kg/ (height in m)^2)

7. Diabetes pedigree function

8. Age (years)

9. Class variable (0 or 1)

V. RESULT AND DISCUSSIONS

The usefulness of the suggested technique is assessed in this section of the article. To evaluate the viability of the suggested

Protocol, simulated implementations of the proposed algorithms are employed. For this, Tensorflow and a number of other Python

libraries can be used; the Python programming language is the foundation of our work.

Synthetic Minority Over-Sampling Technique (SMOTE)

In order to balance the number of samples in each class SMOTE analysis is been carried out. Below figures shows the item count

before and after the SMOTE analysis.

Figure 6: Item Counts

Figure 7: Item Counts

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

376

Ensemble Learning

The below performance chart shows that the accuracy of ensemble learning model to identify normal and abnormal diabetic cases is

0.74.

The figure 8 chart shows the Confusion matrix of ensemble learning model. The diagonal elements show the correctly classified

item count and off diagonal elements show the count of misclassified elements.

Figure 8: Confusion matrix

Logistic Regression

The below performance chart shows that the accuracy of Logistic Regression model to identify normal and abnormal diabetic cases

is 0.70.

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

377

The figure 9 chart shows the Confusion matrix of Logistic Regression model. The diagonal elements show the correctly classified

item count and off diagonal elements show the count of misclassified elements.

Figure 9: Confusion matrix

Random Forest

The below performance chart shows that the accuracy of Random Forest model to identify normal and abnormal diabetic cases is

0.76.

The figure 10 chart shows the Confusion matrix of Random Forest model. The diagonal elements show the correctly classified item

count and off diagonal elements show the count of misclassified elements.

Figure 10: Confusion matrix

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

378

Gaussian NB

The below performance chart shows that the accuracy of Gaussian NB model to identify normal and abnormal diabetic cases is 0.72.

The figure 11 chart shows the Confusion matrix of Gaussian NB model. The diagonal element show the correctly classified item

count and off diagonal elements shows the count of misclassified elements.

Figure 11: Confusion matrix

Artificial Neural Network (ANN)

The below performance chart shows that the accuracy of ANN model to identify normal and abnormal diabetic cases is 0.34.

IJRAR23C1175

International Journal of Research and Analytical Reviews (IJRAR)

379

Table 2: ComparisonlAlgorithms

S.lNo.

Method

Name

Accuracy (%)

Sensitivity (%)

Specificity

(%)

Ensemble

learning

0.74

0.75

0.71

Logistic

Regression

0.70

0.71

0.69

Random

Forest

0.76

0.77

0.74

Gaussian NB

0.72

0.76

0.65

ANN

0.34

0.65

0.67

VI. CONCLUSION

The number of data mining tools is growing, and with them, the number of machine intelligence algorithms. On patient medical

records, data mining can be used. In the area of healthcare, a substantial amount of data has been acquired and organized. The

diabetic dataset is the one that has undergone the least amount of analysis. Data mining approaches are used to successfully address

and resolve the topic of diabetes prediction throughout the entire thesis. Three distinct predictive models for diabetes have been

shown to be beneficial, and each of these models is based on the same well-known classification technique, known as the Random

Forest algorithm. It is abundantly clear from the tests carried out on the data set containing Pima Indians with diabetes using the

Python programme that the performance of the suggested classification methods greatly increased.

References

[1]

C.kalaiselvi,G.m.Nasira,2014.”A New Approach of Diagnosis of Diabetes and Prediction of Cancer using ANFIS”,IEEE

Computing and Communicating Technologies,pp 188-190

[2]

Kenney, WL & Munce, TA 2003, ‘Invited review: aging and human temperature regulation’, Journal of Applied Physiology,

vol. 95, no. 6, pp. 2598-2603.

[3]

R. Kaundal, A. S. Kapoor, and G. P. Raghava, “Machine learning techniques in disease forecasting: a case study on rice blast

prediction,” BMC Bioinformatics, vol. 7, no. 1, p. 485, Nov 2006.

[4]

S. W. Franklin and S. E. Rajan, “Diagnosis of diabetic retinopathy by employing image processing technique to detect exudates

in retinal images,” IET Image Processing, vol. 8, pp. 601–609, October 2014.

[5]

L. Zhou, Y. Zhao, J. Yang, Q. Yu, and X. Xu, “Deep multiple instance learning for automatic detection of diabetic retinopathy in

retinal images,” IET Image Processing, vol. 12, pp. 563–571, April 2018.

[6]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in

Neural Information Processing Systems, p. 10971105, 2012.

[7]

W. Sandham, E. Lehmann, D. Hamilton, and M. Sandilands, “Simulating and predicting blood glucose levels for improved

diabetes healthcare,” IET Conference Proceedings, pp. 121–121(1).

[8]

L. C. Shofwatul Uyun, “Feature selection mammogram based on breast cancer mining,” IJECE, vol. 8, pp. 60 – 69, Feb 2018.

[9]

N. K. A. Hussein Attya Lafta, Zainab Falah Hasan, “The classification of medical datasets using back propagation neural

network powered by genetic-based features elector,” IJECE, vol. 9, Apr 2019.

[10]

V. D. Komal Kumar N, R. Lakshmi Tulasi, “An ensemble multi-model technique for predicting chronic kidney disease,” IJECE,

vol. 9, Apr 2019.

[11]

F. J. Rini Widyaningrum, Sri Lestari, “Image analysisof periapical radiographfor bone mineral density prediction,” IJECE, vol.

8, pp. 2083–2090, Aug 2018..

[12]

UCI machine learning repository and archive.ics.uci.edu/ml/datasets.html.

[13]

Manjusha, KK, Sankaranarayanan, K. & Seena, P 2014, ‘Prediction of different dermatological conditions using naïve Bayesian

classification’, International Journal of Advanced Research in Computer Science and Software Engineering, vol. 4, no. 1, pp.

864-868.

[14]

Al-Sakran, HO 2015, ‘Framework architecture for improving healthcare information systems using agent technology’,

International Journal of Managing Information Technology, vol. 7, no.1, pp. 17-31.

[15]

Mythili, T, Mukherji, D, Padalia, N, & Naidu, A 2013, ‘A heart disease prediction model using SVM -decision trees-logistic

regression (SDL)’, International Journal of Computer Applications, vol. 68, no.16, pp. 11-15.

[16]

Kumar, DS, Sathyadevi, G & Sivanesh, S 2011, ‘ Decision support system for medical diagnosis using data mining’,.

International Journal of Computer Science Issues, vol. 8, no.3, pp. 147-153.

[17]

Palaniappan, S & Awang, R 2008, ‘Intelligent heart disease prediction system using data mining techniques’, Proceedings of the

IEEE in computer systems and applications, pp. 108-115.

[18]

Suguna, N & Thanushkodi, K 2010, ‘An improved K-nearest neighbor classification using genetic algorithm’ ‘ International

Journal of Computer Science, vol. 7 no. 2, pp. 18-21.

ResearchGate has not been able to resolve any citations for this publication.

An ensemble multi-model technique for predicting chronic kidney disease

Article

Full-text available

Apr 2019
IJECE

span lang="EN-US">Chronic Kidney Disease (CKD) is a type of lifelong kidney disease that leads to the gradual loss of kidney function over time; the main function of the kidney is to filter the wastein the human body. When the kidney malfunctions, the wastes accumulate in our body leading to complete failure. Machine learning algorithms can be used in prediction of the kidney disease at early stages by analyzing the symptoms. The aim of this paper is to propose an ensemble learning technique for predicting Chronic Kidney Disease (CKD). We propose a new hybrid classifier called as ABC4.5, which is ensemble learning for predicting Chronic Kidney Disease (CKD). The proposed hybrid classifier is compared with the machine learning classifiers such as Support Vector Machine (SVM), Decision Tree (DT), C4.5, Particle Swarm Optimized Multi Layer Perceptron (PSO-MLP). The proposed classifier accurately predicts the occurrences of kidney disease by analysis various medical factors. The work comprises of two stages, the first stage consists of obtaining weak decision tree classifiers from C4.5 and in the second stage, the weak classifiers are added to the weighted sum to represent the final output for improved performance of the classifier.</span

Classification of medical datasets using back propagation neural network powered by genetic-based features elector

Article

Full-text available

Apr 2019
IJECE

The classification is a one of the most indispensable domains in the data mining and machine learning. The classification process has a good reputation in the area of diseases diagnosis by computer systems where the progress in smart technologies of computer can be invested in diagnosing various diseases based on data of real patients documented in databases. The paper introduced a methodology for diagnosing a set of diseases including two types of cancer (breast cancer and lung), two datasets for diabetes and heart attack. Back Propagation Neural Network plays the role of classifier. The performance of neural net is enhanced by using the genetic algorithm which provides the classifier with the optimal features to raise the classification rate to the highest possible. The system showed high efficiency in dealing with databases differs from each other in size, number of features and nature of the data and this is what the results illustrated, where the ratio of the classification reached to 100% in most datasets).

Image Analysis of Periapical Radiograph for Bone Mineral Density Prediction

Article

Full-text available

Aug 2018
IJECE

Osteoporosis is a systemic skeletal disease. Parameter from any bone site in the body has possibility to be developed as a predictor of osteoporosis. The alteration in the mandible trabecular bone is visible in periapical radiographs. The aim of this study was to correlate the area parameter and the integrated density of periapical radiograph with bone mineral density. Image analysis of periapical radiograph i.e. measurement of area parameter and integrated density was done on Region of Interest (ROI) by using canny edge detection method. Result of this study showed that the area parameter has asignificant (α<0.05) negative correlation with the bone mass density (BMD) of the lumbar spine (r = -0.371) and T-score of the lumbar spine (r = -0.383). The linear regression test showed that the area parameter only can be used to predict T-score of the lumbar spine (F=5.822, α<0.05). The integrated density showed a significant (α < 0.05) negative correlation with T-score of hip (r = -0.332) and T-score of lumbar spine (r = -0.377). It can be concluded that the area parameter can be used as one of input parameters for computer-aided system of osteoporosis early detection by using periapical radiograph. Copyright © 2018 Institute of Advanced Engineering and Science. All rights reserved.

Feature Selection Mammogram based on Breast Cancer Mining

Article

Full-text available

Feb 2018
IJECE

The very dense breast of mammogram image makes the Radiologists often have difficulties in interpreting the mammography objectively and accurately. One of the key success factors of computer-aided diagnosis (CADx) system is the use of the right features. Therefore, this research emphasizes on the feature selection process by performing the data mining on the results of mammogram image feature extraction. There are two algorithms used to perform the mining, the decision tree and the rule induction. Furthermore, the selected features produced by the algorithms are tested using classification algorithms: k-nearest neighbors, decision tree, and naive bayesian with the scheme of 10-fold cross validation using stratified sampling way. There are five descriptors that are the best features and have contributed in determining the classification of benign and malignant lesions as follows: slice, integrated density, area fraction, model gray value, and center of mass. The best classification results based on the five features are generated by the decision tree algorithm with accuracy, sensitivity, specificity, FPR, and TPR of 93.18%; 87.5%; 3.89%; 6.33% and 92.11% respectively.

A Heart Disease Prediction Model using SVM-Decision Trees-Logistic Regression (SDL)

Article

Full-text available

Apr 2013

Mythili Thirugnanam

The early prognosis of cardiovascular diseases can aid in making decisions to lifestyle changes in high risk patients and in turn reduce their complications. Research has attempted to pinpoint the most influential factors of heart disease as well as accurately predict the overall risk using homogenous data mining techniques. Recent research has delved into amalgamating these techniques using approaches such as hybrid data mining algorithms. This paper proposes a rule based model to compare the accuracies of applying rules to the individual results of support vector machine, decision trees, and logistic regression on the Cleveland Heart Disease Database in order to present an accurate model of predicting heart disease.

Decision Support System for Medical Diagnosis Using Data Mining

Article

Full-text available

May 2011

The healthcare industry collects a huge amount of data which is not properly mined and not put to the optimum use. Discovery of these hidden patterns and relationships often goes unexploited. Our research focuses on this aspect of Medical diagnosis by learning pattern through the collected data of diabetes, hepatitis and heart diseases and to develop intelligent medical decision support systems to help the physicians. In this paper, we propose the use of decision trees C4.5 algorithm, ID3 algorithm and CART algorithm to classify these diseases and compare the effectiveness, correction rate among them.

Deep multiple instance learning for automatic detection of diabetic retinopathy in retinal images

Article

Nov 2017

As a weakly supervised learning technique, multiple instance learning (MIL) has shown an advantage over supervised learning methods for automatic detection of diabetic retinopathy (DR): only the image-level annotation is needed to achieve both detection of DR images and DR lesions, making more graded and de-identified retinal images available for learning. However, the performance of existing studies on this technique is limited by the use of handcrafted features. The authors propose a deep MIL method for DR detection, which jointly learns features and classifiers from data and achieves a significant improvement on detecting DR images and their inside lesions. Specifically, a pre-trained convolutional neural network is adapted to achieve the patch-level DR estimation, and then global aggregation is used to make the classification of DR images. Further, the authors propose an end-to-end multi-scale scheme to better deal with the irregular DR lesions. For detection of DR images, they achieve an area under the ROC curve of 0.925 on a subset of a Kaggle dataset, and 0.960 on Messidor. For detection of DR lesions, they achieve an F1-score of 0.924 with sensitivity 0.995 and precision 0.863 on DIARETDB1 using the connected component-level validation.

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

A New Approach for Diagnosis of Diabetes and Prediction of Cancer Using ANFIS

Conference Paper

Feb 2014

The multi factorial, chronic, severe diseases like diabetes and cancer have complex relationship. When the glucose level of the body goes to abnormal level, it will lead to Blindness, Heart disease, Kidney failure and also Cancer. Epidemiological studies have proved that several cancer types are possible in patients having diabetes. Many researchers proposed methods to diagnose diabetes and cancer. To improve the classification accuracy and to achieve better efficiency a new approach like Adaptive Neuro Fuzzy Inference System (ANFIS) is proposed.

Diagnosis of diabetic retinopathy by employing image processing technique to detect exudates in retinal images

Article

Oct 2014

Diabetic retinopathy (DR) is a microvascular complication of long-term diabetes and it is the major cause of visual impairment because of changes in blood vessels of the retina. Major vision loss because of DR is highly preventable with regular screening and timely intervention at the earlier stages. The presence of exudates is one of the primitive signs of DR and the detection of these exudates is the first step in automated screening for DR. Hence, exudates detection becomes a significant diagnostic task, in which digital retinal imaging plays a vital role. In this study, the authors propose an algorithm to detect the presence of exudates automatically and this helps the ophthalmologists in the diagnosis and follow-up of DR. Exudates are normally detected by their high grey-level variations and they have used an artificial neural network to perform this task by applying colour, size, shape and texture as the features. The performance of the authors algorithm has been prospectively tested by using DIARETDB1 database and evaluated by comparing the results with the ground-truth images annotated by expert ophthalmologists. They have obtained illustrative results of mean sensitivity 96.3%, mean specificity 99.8%, using lesion-based evaluation criterion and achieved a classification accuracy of 99.7%.

PREDICTION AND DIAGNOSIS OF DIABETES DISEASES UTILIZING RANDOM FOREST CLASSIFIER

Abstract and Figures

Recommended publications

A Review of Data Mining Schemes for Prediction of Diabetes Mellitus and Correlated Ailments

Safe Prediction of Diabetes Mellitus Using Weighted Conglomeration of Mining Schemes

The Prediction and Diagnosis of Diabetes Mellitus Using Data Mining Schemes: A Survey

Mining of multiple ailments correlated to diabetes mellitus