ChapterPDF Available

Machine Learning-Based Diabetes Prediction Using Missing Value Impotency

June 2022

June 2022

DOI:10.1007/978-981-16-8739-6_51

Publisher: Springer

Authors:

Santi KUMARI Behera

Veer Surendra Sai University of Technology

Prabira kumar Sethy

Guru Ghasidas University

Dayal Kumar Behera

Silicon Institute of Technology

Show all 5 authorsHide

Diabetes is a chronic disease that has been impacting an increasing number of people throughout the years. Each year, it results in a huge number of deaths. Since late diagnosis results in severe health complications and a significant number of deaths each year, developing methods for early detection of this pathology is critical. As a result, early detection is critical. Machine learning (ML) techniques aid in the early detection and prediction of diabetes. However, ML models do not perform well with missing values in the dataset. Imputation of missing values improves the outcome. The article proposes an ensemble method with a strong emphasis on missing value imputation. Numerous ML models have been used to validate the proposed framework. The experimentation uses the Pima Indian Diabetes Dataset, which contains information about people with and without diabetes. TPR, FNR, PPV, FDR, overall accuracy, training time, and AUC are used to evaluate the performance of the 24 ML methods. The collected results demonstrate that subspace KNN outperforms with an accuracy of 85%. The collected data are confirmed systematically and orderly utilizing receiver operating characteristic (ROC) curves. Using missing value imputation for data pre-processing and classification has been shown to beat state-of-the-art algorithms in the diabetes detection sector.

Content uploaded by Prabira kumar Sethy

Content may be subject to copyright.

Chapter 51

Machine Learning-Based Diabetes

Prediction Using Missing Value

Impotency

Santi Kumari Behera, Julie Palei, Dayal Kumar Behera, Subhra Swetanisha,

and Prabira Kumar Sethy

Abstract Diabetes is a chronic disease that has been impacting an increasing number

of people throughout the years. Each year, it results in a huge number of deaths. Since

late diagnosis results in severe health complications and a signiﬁcant number of

deaths each year, developing methods for early detection of this pathology is critical.

As a result, early detection is critical. Machine learning (ML) techniques aid in the

early detection and prediction of diabetes. However, ML models do not perform

well with missing values in the dataset. Imputation of missing values improves

the outcome. The article proposes an ensemble method with a strong emphasis on

missing value imputation. Numerous ML models have been used to validate the

proposed framework. The experimentation uses the Pima Indian Diabetes Dataset,

which contains information about people with and without diabetes. TPR, FNR,

PPV, FDR, overall accuracy, training time, and AUC are used to evaluate the perfor-

mance of the 24 ML methods. The collected results demonstrate that subspace KNN

outperforms with an accuracy of 85%. The collected data are conﬁrmed systemat-

ically and orderly utilizing receiver operating characteristic (ROC) curves. Using

missing value imputation for data pre-processing and classiﬁcation has been shown

to beat state-of-the-art algorithms in the diabetes detection sector.

S. K. Behera

Department of CSE, VSSUT Burla, Burla, Odisha, India

J. Palei ·P. K. Se t h y ( B)

Department of Electronics, Sambalpur University, Burla, Odisha 768019, India

J. Palei

e-mail: 19mscel05@suiit.ac.in

D. K. Behera

Department of CSE, Silicon Institute of Technology, Bhubaneswar, Odisha, India

S. Swetanisha

Department of CSE, Trident Academy of Technology, Bhubaneswar, Odisha, India

S. Dehuri et al. (eds.), Biologically Inspired Techniques in Many Criteria Decision Making,

Smart Innovation, Systems and Technologies 271,

https://doi.org/10.1007/978-981- 16-8739- 6_51

575

576 S. K. Behera et al.

51.1 Introduction

“According to the World Health Organization (WHO), around 1.6 million people

die each year from diabetes [1].” “Diabetes is a type of disease that arises when the

human body’s blood glucose/blood sugar level is abnormally high. Type 1 diabetes,

commonly known as insulin-dependent diabetes, is most frequently diagnosed in

childhood [2].” In Type 1, the pancreas is attacked by the body’s antibodies, which

then kill internal body parts and cause the pancreas to stop producing insulin. “Type 2

diabetes is often referred to as adult-onset diabetes or non-insulin-dependent diabetes

[3].” Although it is more merciful than Type 1, it is nevertheless extremely damaging

and can result in serious complications, particularly in the small blood vessels of the

eyes, kidneys, and nerves [4]. Type 3 gestational diabetes [5] develops from increased

blood sugar levels in pregnant womenwhose diabetes is not recognized earlier. “Some

authors have created and validated a risk score for primary cesarean delivery (CD)

in women with gestational diabetes. In women with gestational diabetes mellitus

(GDM), a risk score based on nulliparity, excessive gestational weight gain, and

usage of insulin can be used to determine the likelihood of primary CD [6].” “Ghaderi

et al. worked on the effect of smartphone education on the risk perception of Type 2

diabetes in a woman with GDM [2].”

Diabetes mellitus is related to long-term consequences. Additionally, people with

diabetes face an increased chance of developing a variety of health concerns. Glucose

levels in the human body typically range between 70 and 99 mg per deciliter [1]. “If

the glucose level is more signiﬁcant than 126 mg/dl, diabetes is present. Prediabetes

is deﬁned as a blood glucose level of 100–125 mg/dl [7].”

Diabetes is inﬂuenced by height, weight, hereditary factors, and insulin [8], but

the primary factor evaluated is blood sugar content. Early detection is the only

approach to avoid difﬁculties. Predictive analytics strives to improve disease diag-

nosis accuracy, patient care, resource optimization, and clinical outcomes. Numerous

researchers are conducting experiments to diagnose disease using various classiﬁ-

cation algorithms from ML approaches such as J48, SVM, Naïve Bayes, and deci-

sion tree. Researchers have demonstrated that machine learning algorithms [9,10]

perform better at diagnosing various diseases. Naïve Bayes, SVM, and decision tree

ML classiﬁcation algorithms are applied and assessed in work [8] to predict diabetes

in a patient using the PIDD dataset.

ML techniques can also be utilized to identify individuals at elevated risk of Type 2

diabetes [11] or prediabetes in the absence of established impaired glucose regulation.

Body mass index, waist-hip ratio, age, systolic and diastolic blood pressure, and

diabetes inheritance were the most impactful factors. Increased risk of Type 2 diabetes

was associated with high levels of these characteristics and diabetes heredity.

ML techniques aid in the early detection and prediction of diabetes. However,

ML models do not perform well with missing values in the dataset. Therefore, this

work emphasizes on the missing value imputation. The objectives of this work are

as follows:

51 Machine Learning-Based Diabetes Prediction … 577

•Study the impact of missing value in the PIMA diabetes dataset.

•Performing missing value imputation by replacing the missing value with the

mean value of the group.

•Designing an ensemble subspace KNN for classifying diabetes.

•Comparative analysis of various traditional classiﬁers against the ensemble

classiﬁer.

51.2 Related Works

The purpose of the article [12] is to illustrate the construction and validation of 10-

year risk prediction models for Type 2 diabetes mellitus (T2DM). Data collected

in 12 European nations (SHARE) are used for validation of the model. The dataset

included 53 variables encompassing behavioral, physical, and mental health aspects

of participants aged 50 or older. To account for highly imbalanced outcome variables,

the logistic regression model was developed, each instance wasweighted according to

the inverse percentage of the result label. The authors used a pooled sample of 16,363

people to develop and evaluate a global regularized logistic regression model with an

area under the receiver operating characteristic curve of 0.702. Continuous glucose

monitoring (CGM) devices continue to have a temporal delay, which can result

in clinically signiﬁcant differences between the CGM and the actual blood glucose

level, particularly during rapid changes. In [13], authors have used the artiﬁcial neural

network regression (NN) technique to forecast CGM results. Diabetes can also be

a risk factor for developing other diseases such as heart attack, renal impairment,

and partial blindness. Kayal Vizhi and “Aman Dash worked on smart sensors and

ML techniques such as random forest and extreme gradient boosting for predicting

whether a person would get diabetes or not [14].” Mujumdar et al. [5] developed a

diabetes prediction model that took into account external risk variables for diabetes

in addition to standard risk factors such as glucose, BMI, age, and insulin. The

new dataset improves classiﬁcation accuracy when compared to the available PIMA

dataset. The study [15] covers algorithms such as linear regression, decision trees,

random forests, and their advantages for early identiﬁcation and treatment of disease.

The research study discussed the predictive accuracy of the algorithms mentioned

above. Mitushi Soni and Sunita Varma [16] forecasted diabetes using ML classiﬁca-

tion and ensemble approaches. When compared to other models, each model’s accu-

racy varies. Their ﬁndings indicate that random forest outperformed different ML

algorithms in terms of accuracy. “Jobeda Jamal Khanam and Simon Foo conducted

research using the PIMA dataset. The collection comprises data on 768 patients and

their nine unique characteristics. On the dataset, seven ML algorithms were applied

to predict diabetes. They concluded that a model combining logistic regression (LR)

and support vector machine (SVM) effectively predicts diabetes [1].” “Varga used

NCBI PubMed to conduct a systematic search. First, articles that had the words “dia-

betes” and “prediction” were chosen. Next, the authors searched for metrics relating

to predictive statistics in all abstracts of original research articles published in the ﬁeld

578 S. K. Behera et al.

of diabetes epidemiology. To illustrate the distinction between association and predic-

tion, simulated data were constructed. It is demonstrated that biomarkers with large

effect sizes and small P values might have low discriminative utility[17].” The article

[18] attempts to synthesize the majority of the work on ML and data mining tech-

niques used to forecast diabetes and its complications. Hyperglycemia is a symptom

of diabetes caused by insufﬁcient insulin secretion and/or use. For experimental

purposes, Kalagotla et al. [19] designed a novel stacking method based on multi-layer

perceptron, SVM, and LR. The stacking strategy combined the intelligent models and

improved model performance. In comparison with AdaBoost, the proposed unique

stacking strategy outperformed other models. Authors in [20] worked on a pipeline

for predicting diabetes individuals using deep learning techniques. It incorporates

data enhancement using a variational autoencoder (VAE), feature enhancement via

a sparse autoencoder (SAE), and classiﬁcation via a convolutional neural network.

51.3 Material and Methods

The details about dataset and adapted methodology are elaborated in appropriate

subsection.

51.3.1 Dataset

The Pima Indians Diabetes Database is used in this paper. “The National Insti-

tute of Diabetes and Digestive and Kidney Diseases ﬁrst collected this dataset. The

dataset’s purpose is to diagnostically predict if a patient has diabetes or not, using

particular diagnostic metrics contained in the dataset. Therefore, numerous limits on

the selection of these examples from a broader database were imposed. All patients

at this facility are females who are at least 21 years old and of Pima Indian ancestry.

The datasets contain a variety of medical predictor variables and one outcome vari-

able. The number of pregnancies, their BMI, insulin level, and age are all predictor

variables of the patient [21].”

51.3.2 Proposed Model

This research focuses heavily on enhancing the outcomes and accuracy of diabetes

detection. The proposed approach is depicted in Fig. 51.1. Numerous classical ML

classiﬁers and related ensemble variation models are used to categorize disease as

positive or negative. Numerous characteristics in the original dataset have entries

of 0. According to the experts’ advice, these values must be considered a missing

value, such as glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree

51 Machine Learning-Based Diabetes Prediction … 579

Fig. 51.1 Proposed framework for diabetes prediction

function, and age cannot be zero. Hence, the missing value is imputed by considering

the mean value of the group. After that, the class label of the dataset is evaluated

by taking two different values into account: Diabetic is set to 1 and non-diabetic by

0. Then, the dataset is divided into validation and test set. The validation data are

used to train the classiﬁer in two scenarios. In the ﬁrst case, the classiﬁer is trained

with missing value and in another case by missing value imputation. The missing

value imputation does not pre-process the test data, and it is passed to the model for

prediction.

51.4 Results and Discussion

The classiﬁcation performance in TPR, FNR, PPV, FDR, overall accuracy, training

time, and AUC is evaluated using the most robust ML classiﬁers such as KNN, SVM,

Naïve-Bayes, logistic regression, discriminant models, and ensemble models.

Table 51.1 depicts the performance of various models on the validation data

without considering missing value imputation (MVI), whereas Table 51.2 represents

performance by considering missing value imputation. From the data, it is clear that

the AUC of all the models improves a lot in missing value imputation.

Table 51.3 represents the performance of the prediction on the test data. Again,

subspace KNN performs better as compared to other classiﬁers.

From Table 51.3, it is clear that subspace KNN performed better than the other

classiﬁer. The confusion matrix of the ensemble subspace KNN model is depicted

in Fig. 51.2. The AUC of ROC is shown in Fig. 51.3.

580 S. K. Behera et al.

Table 51.1 Performance analysis on validation data without MVI

Validation

model

Accuracy

(validation)

Training time (s) TPR FNR PPV FDR AUC

Fine KNN 100 0.28351 100 0100 0 1

weighted KNN 100 0.21176 100 0100 0 1

Subspace KNN

(ensemble)

100 0.37183 100 0100 0 1

Bagged trees

(ensemble)

99.7 0.51479 99.65 0.35 99.8 0.2 1

Fine Gaussian

SVM

98.8 0.20768 98.3 1.7 99.1 0.9 1

Fine tree 93.2 1.7373 92.3 7.7 92.75 7.25 0.98

Boosted trees

(ensemble)

85.5 0.62198 83.1 16.9 84.6 15.4 0.95

Cubic SVM 87 0.41443 84.7 15.3 86.2 13.8 0.93

RUSBoosted

trees

(ensemble)

84.9 0.45368 86.05 13.95 83.4 16.6 0.93

Medium

Gaussian SVM

82.7 0.2608 78.3 21.7 82.65 17.35 0.9

Medium tree 83.3 0.35821 80.2 19.8 82.35 17.65 0.89

Quadratic

SVM

79.8 0.26857 74.9 25.1 79.25 20.75 0.87

Cosine KNN 79.8 0.20666 75.25 24.75 78.95 21.05 0.87

Medium KNN 78.5 0.26587 72.85 27.15 78.15 21.85 0.87

Cubic KNN 78.4 0.25979 72.95 27.05 77.8 22.2 0.87

Linear

discriminant

78.4 0.3483 73.7 26.3 77.1 22.9 0.84

Coarse

Gaussian SVM

78.4 0.1325 72.75 27.25 77.95 22.05 0.84

Logistic

regression

78.3 0.73955 73.6 26.4 76.9 23.1 0.84

Linear SVM 77.3 0.39083 72.55 27.45 75.8 24.2 0.84

Quadratic

discriminant

76.4 0.34544 72.1 27.9 74.4 25.6 0.83

Subspace

discriminant

(ensemble)

76.3 1.4926 69.4 30.6 76.25 23.75 0.83

Kernel Naive

Bayes

75.4 0.61343 68.05 31.95 75.45 24.55 0.83

Gaussian

Naive Bayes

76.2 0.30505 72.7 27.3 73.85 26.15 0.82

Coarse KNN 75.4 0.22108 66.9 33.1 77.45 22.55 0.82

Coarse tree 77.2 0.22121 72.3 27.7 75.75 24.25 0.74

51 Machine Learning-Based Diabetes Prediction … 581

Table 51.2 Performance analysis on validation data with MVI

With M VI

(validation)

Accuracy

(validation)

Training time (s) TPR FNR PPV FDR AUC

Fine KNN 100 0.40763 100 0100 0 1

Weighted

KNN

100 0.22415 100 0100 0 1

Subspace

KNN

(ensemble)

100 0.39983 100 0100 0 1

Bagged trees

(ensemble)

99.7 0.57935 99.7 0.3 99.7 0.3 1

Fine Gaussian

SVM

97.8 0.2167 96.9 3.1 98.25 1.75 1

Boosted trees

(ensemble)

96.5 0.75288 95.9 4.1 96.3 3.7 1

Fine tree 97.4 2.0716 96.6 3.4 97.65 2.35 0.99

RUSBoosted

trees

(ensemble)

94.3 0.47435 94.4 5.6 93.25 6.75 0.99

Medium tree 93.6 0.36612 91.9 8.1 94.05 5.95 0.98

Cubic SVM 87.4 0.49901 85.35 14.65 86.55 13.45 0.94

Coarse tree 88 0.25733 84.05 15.95 89.65 10.35 0.93

Kernel Naive

Bayes

73.2 0.72157 62.1 37.9 81.55 18.45 0.93

Quadratic

SVM

81.5 0.46121 77.5 22.5 80.7 19.3 0.89

Medium

Gaussian

SVM

81.4 0.2668 77.15 22.85 80.75 19.25 0.89

Medium KNN 80.5 0.26288 76 24 79.75 20.25 0.88

Cubic KNN 80.1 0.26259 75.35 24.65 79.4 20.6 0.88

Cosine KNN 78.4 0.25825 73.8 26.2 77.05 22.95 0.87

Coarse

Gaussian

SVM

78.6 0.15006 73.45 26.55 77.9 22.1 0.85

Linear SVM 78.4 0.51793 73.55 26.45 77.25 22.75 0.85

Logistic

regression

78 1.0221 73.5 26.5 76.45 23.55 0.85

Linear

discriminant

77.6 0.47924 73 27 76 24 0.85

Coarse KNN 76.6 0.22954 69.8 30.2 76.55 23.45 0.84

(continued)

582 S. K. Behera et al.

Table 51.2 (continued)

With M VI

(validation)

Accuracy

(validation)

Training time (s) TPR FNR PPV FDR AUC

Subspace

discriminant

(ensemble)

76.4 0.44237 69.8 30.2 76.15 23.85 0.84

Quadratic

discriminant

70.3 0.36838 59.3 40.7 72.1 27.9 0.82

Gaussian

Naive Bayes

71.1 0.37027 61.2 38.8 71.35 28.65 0.81

Table 51.3 Performance analysis on test data without MVI

Tes t Accuracy (test) TPR FNR PPV FDR AU C

Subspace KNN

(ensemble)

85 88.45 11.55 85 15 1

Fine KNN 80 84.6 15.4 81.8 18.2 0.85

Weighted KNN 80 84.6 15.4 81.8 18.2 0.9

Fine tree 70 73.6 26.4 71.7 28.3 0.65

Medium tree 70 73.6 26.4 71.7 28.3 0.65

Medium Gaussian SVM 70 76.9 23.1 76.9 23.1 0.81

Boosted trees (ensemble) 70 63.75 36.25 66.65 33.35 0.75

Bagged trees (ensemble) 70 76.9 23.1 76.9 23.1 0.91

RUSBoosted trees

(ensemble)

70 76.9 23.1 76.9 23.1 0.87

Coarse Gaussian SVM 65 69.75 30.25 68.75 31.25 0.73

Subspace discriminant

(ensemble)

65 69.75 30.25 68.75 31.25 0.69

Linear discriminant 60 62.6 37.4 61.65 38.35 0.74

Logistic regression 60 62.6 37.4 61.65 38.35 0.71

Linear SVM 60 62.6 37.4 61.65 38.35 0.76

Quadratic SVM 60 69.25 30.75 73.35 26.65 0.76

Cubic SVM 60 69.25 30.75 73.35 26.65 0.63

Fine Gaussian SVM 60 69.25 30.75 73.35 26.65 0.96

Coarse KNN 60 65.95 34.05 65.95 34.05 0.75

Cubic KNN 60 65.95 34.05 65.95 34.05 0.72

Coarse tree 55 65.4 34.6 71.9 28.1 0.73

Quadratic discriminant 55 62.1 37.9 66.25 36.9 0.67

Gaussian Naive Bayes 55 62.1 37.9 63.1 36.9 0.68

Kernel Naive Bayes 55 62.1 37.9 63.1 36.9 0.71

Medium KNN 55 62.1 37.9 63.1 36.9 0.76

Cosine KNN 55 62.1 37.9 63.1 36.9 0.7

51 Machine Learning-Based Diabetes Prediction … 583

Fig. 51.2 Confusion matrix

of subspace KNN on test

data

Fig. 51.3 ROC of subspace

KNN on test data

51.5 Conclusion

Diabetes early identiﬁcation is a big challenge in the health care business. In our

research, we developed a system capable of accurately predicting diabetes. The

purpose of this study is to propose an ensemble method based on in-depth ML

techniques for diabetes prediction utilizing a well-known dataset called Pima Indian

Diabetes. We used twenty-four different ML algorithms, including KNN, SVM,

Naïve-Bayes, logistic regression, discriminant models, and ensemble models to

predict diabetes and evaluate performance on various measures like TPR, FNR,

PPV, FDR, overall accuracy, training time, and AUC. This work also emphasizes

missing value imputation. In the validation data, overall accuracy improves to a

great extent with missing value impotency, depicted in Table 51.2. Among all the

proposed models, the subspace KNN is considered the most efﬁcient and promising

for predicting diabetes, with an accuracy of 85% in test data.

584 S. K. Behera et al.

References

1. Khanam, J.J., Foo, S.Y.: A comparison of machine learning algorithms for diabetes prediction.

ICT Express (2021). https://doi.org/10.1016/j.icte.2021.02.004

2. Ghaderi, M., Farahani, M.A., Hajiha, N., Ghaffari, F., Haghani, H.: The role of smartphone-

based education on the risk perception of type 2 diabetes in women with gestational diabetes.

Health Technol. (Berl) 9(5), 829–837 (2019). https://doi.org/10.1007/s12553-019-00342-3

3. Mandal, S.: New molecular biomarkers in precise diagnosis and therapy of Type 2 diabetes.

Health Technol. (Berl) 10(3), 601–608 (2020). https://doi.org/10.1007/s12553-019-00385-6

4. Himsworth, H.P.: The syndrome of diabetes mellitus and its causes. Lancet 253(6551), 465–473

(1949). https://doi.org/10.1016/S0140-6736(49)90797-7

5. Mujumdar, A., Vaidehi, V.: Diabetes prediction using machine learning algorithms. Procedia

Comput. Sci. 165, 292–299 (2019). https://doi.org/10.1016/j.procs.2020.01.047

6. Phaloprakarn, C., Tangjitgamol, S.: Risk score for predicting primary cesarean delivery in

women with gestational diabetes mellitus. BMC Pregnancy Childbirth 20(1), 1–8 (2020).

https://doi.org/10.1186/s12884-020-03306-y

7. https://www.mayoclinic.org/diseases-conditions/prediabetes/diagnosis-treatment/drc-203

55284

8. Sisodia, D., Sisodia, D.S.: Prediction of diabetes using classiﬁcation algorithms. Procedia

Comput. Sci. 132(Iccids), 1578–1585 (2018). https://doi.org/10.1016/j.procs.2018.05.122

9. Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., Chouvarda, I.: Machine

learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15,

104–116 (2017). https://doi.org/10.1016/j.csbj.2016.12.005

10. Okagbue, H.I., Adamu, P.I., Oguntunde, P.E., Obasi, E.C.M., Odetunmibi, O.A.: Machine

learning prediction of breast cancer survival using age, sex, length of stay, mode of diagnosis

and location of cancer. Health Technol. (Berl). 11, 887–893 (2021). https://doi.org/10.1007/

s12553-021-00572-4

11. Lama, L. et al.: Machine learning for prediction of diabetes risk in middle-aged Swedish people.

Heliyon 7, e07419 (2021). https://doi.org/10.1016/j.heliyon.2021.e07419

12. Gregor Stiglic, L.C., Wang, F., Sheikh, A.: Development and validation of the type 2 diabetes

mellitus 10-year risk score prediction models from survey data. Prim. Care Diabetes 15(4),

699–705 (2021)

13. Lebech Cichosz, O.S., Hasselstrøm Jensen, M.: Short-term prediction of future continuous

glucose monitoring readings in type 1 diabetes: development and validation of a neural network

regression model. Int. J. Med. Inform. 151, 104472(2021)

14. Vizhi, K., Dash, A.: Diabetes prediction using machine learning. Int. J. Adv. Sci. Technol.

29(6), 2842–2852 (2020). https://doi.org/10.32628/cseit2173107

15. Muhammad Daniyal Baig, M.F.N.: Diabetes prediction using machine learning algorithms.

Lect. Notes Netw. Syst. (2020). https://doi.org/10.13140/RG.2.2.18158.64328

16. Soni, M., Varma, S.: Diabetes prediction using machine learning techniques. Int. J. Eng. Res.

Technol. 9(09), 921–924 (2020). https://doi.org/10.1007/978-981-33-6081-5_34

17. Varga, T.V., Niss, K., Estampador, A.C., Collin, C.B., Moseley, P.L.: Association is not predic-

tion: a landscape of confused reporting in diabetes—a systematic review. Diabetes Res. Clin.

Pract. 170, 108497 (2020). https://doi.org/10.1016/j.diabres.2020.108497

18. Jaiswal, T.P.V., Negi, A., Pal, T.: A review on current advances in machine learning based

diabetes prediction. Prim. Care Diabetes 15(3), 435–443 (2021)

19. Kalagotla, K., Satish Kumar, Gangashetty, S.V.: A novel stacking technique for prediction of

diabetes. Comput. Biol. Med. 135, 104554 (2021)

20. García-Ordás, M.T., Benavides, C., Benítez-Andrades, J.A., Alaiz-Moretón, H., García-

Rodríguez, I.: Diabetes detection using deep learning techniques with oversampling and feature

augmentation. Comput. Methods Programs Biomed. 202, 105968 (2021). https://doi.org/10.

1016/j.cmpb.2021.105968

21. Pima Indians Diabetes Database. Available at: https://www.kaggle.com/uciml/pima-indians-

diabetes-database

ResearchGate has not been able to resolve any citations for this publication.

Machine Learning for Prediction of Diabetes Risk in Middle-aged Swedish people

Article

Full-text available

Jun 2021

Aims To study if machine learning methodology can be used to detect persons with increased type 2 diabetes or prediabetes risk among people without known abnormal glucose regulation. Methods Machine learning and interpretable machine learning models were applied on research data from Stockholm Diabetes Preventive Program, including more than 8000 people initially with normal glucose tolerance or prediabetes to determine high and low risk features for further impairment in glucose tolerance at follow-up 10 and 20 years later. Results The features with the highest importance on the outcome were body mass index, waist-hip ratio, age, systolic and diastolic blood pressure, and diabetes heredity. High values of these features as well as diabetes heredity conferred increased risk of type 2 diabetes. . The machine learning model was used to generate individual, comprehensible risk profiles, where the diabetes risk was obtained for each person in the data set. Features with the largest increasing or decreasing effects on the risk were determined. Conclusions The primary application of this machine learning model is to predict individual type 2 diabetes risk in people without diagnosed diabetes, and to which features the risk relates However, since most features affecting diabetes risk also play a role for metabolic control in diabetes, e.g. body mass index, diet composition, tobacco use, and stress, the tool can possibly also be used in diabetes care to develop more individualized, easily accessible health care plans to be utilized when encountering the patients.

A comparison of machine learning algorithms for diabetes prediction

Article

Full-text available

Feb 2021

Diabetes is a disease that has no permanent cure; hence early detection is required. Data mining, machine learning (ML) algorithms, and Neural Network (NN) methods are used in diabetes prediction in our research. We used the Pima Indian Diabetes (PID) dataset for our research, collected from the UCI Machine Learning Repository. The data set contains information about 768 patients and their corresponding nine unique attributes. We used seven ML algorithms on the dataset to predict diabetes. We found that the model with Logistic Regression (LR) and Support Vector Machine (SVM) works well on diabetes prediction. We built the NN model with a different hidden layer with various epochs and observed the NN with two hidden layers provided 88.6% accuracy.

Diabetes detection using deep learning techniques with oversampling and feature augmentation

Article

Full-text available

Feb 2021
COMPUT METH PROG BIO

Background and objective: Diabetes is a chronic pathology which is affecting more and more people over the years. It gives rise to a large number of deaths each year. Furthermore, many people living with the disease do not realize the seriousness of their health status early enough. Late diagnosis brings about numerous health problems and a large number of deaths each year so the development of methods for the early diagnosis of this pathology is essential. Methods: In this paper, a pipeline based on deep learning techniques is proposed to predict diabetic people. It includes data augmentation using a variational autoencoder (VAE), feature augmentation using an sparse autoencoder (SAE) and a convolutional neural network for classification. Pima Indians Diabetes Database, which takes into account information on the patients such as the number of pregnancies, glucose or insulin level, blood pressure or age, has been evaluated. Results: A 92.31% of accuracy was obtained when CNN classifier is trained jointly the SAE for featuring augmentation over a well balanced dataset. This means an increment of 3.17% of accuracy with respect the state-of-the-art. Conclusions: Using a full deep learning pipeline for data preprocessing and classification has demonstrate to be very promising in the diabetes detection field outperforming the state-of-the-art proposals.

Association is not prediction: A landscape of confused reporting in diabetes – A systematic review

Article

Full-text available

Dec 2020
DIABETES RES CLIN PR

Aims Appropriate analysis of big data is fundamental to precision medicine. While statistical analyses often uncover numerous associations, associations themselves do not convey predictive value. Confusion between association and prediction harms clinicians, scientists, and ultimately, the patients. We analyzed published papers in the field of diabetes that refer to “prediction” in their titles. We assessed whether these articles report metrics relevant to prediction. Methods A systematic search was undertaken using NCBI PubMed. Articles with the terms “diabetes” and “prediction” were selected. All abstracts of original research articles, within the field of diabetes epidemiology, were searched for metrics pertaining to predictive statistics. Simulated data was generated to visually convey the differences between association and prediction. Results The search-term yielded 2,182 results. After discarding non-relevant articles, 1,910 abstracts were evaluated. Of these, 39% (n = 745) reported metrics of predictive statistics, while 61% (n = 1,165) did not. The top reported metrics of prediction were ROC AUC, sensitivity and specificity. Using the simulated data, we demonstrated that biomarkers with large effect sizes and low P values can still offer poor discriminative utility. Conclusions We demonstrate a landscape of confused reporting within the field of diabetes epidemiology where the term “prediction” is often incorrectly used to refer to association statistics. We propose guidelines for future reporting, and two major routes forward in terms of main analytic procedures and research goals: the explanatory route, which contributes to precision medicine, and the prediction route which contributes to personalized medicine.

Risk score for predicting primary cesarean delivery in women with gestational diabetes mellitus

Article

Full-text available

Oct 2020

Background: Women with gestational diabetes mellitus (GDM) have a higher risk of cesarean delivery (CD) than glucose-tolerant women. The aim of this study was to develop and validate a risk score for predicting primary CD in women with GDM. Methods: A risk score for predicting primary CD was developed using significant clinical features of 385 women who had a diagnosis of GDM and delivered at our institution between January 2011 and December 2014. The score was then tested for validity in another cohort of 448 individuals with GDM who delivered between January 2015 and December 2018. Results: The risk score was developed using the features nulliparity, excess gestational weight gain, and insulin use. The scores that classified the pregnant women as low risk (0 points), intermediate risk (1-3 points), and high risk (≥ 4 points) were directly associated with the primary CD rates of the women in the development cohort: 14.7, 38.2 and 62.3%, respectively (P < 0.001). The model showed good calibration and acceptable discriminative power with a C statistic of 0.724 (95% confidence interval, 0.670-0.777). Similar results were observed in the validation cohort. Conclusion: A risk score using the features nulliparity, excess gestational weight gain, and insulin use can estimate the risk for primary CD in women with GDM.

Machine learning prediction of breast cancer survival using age, sex, length of stay, mode of diagnosis and location of cancer

Article

Jun 2021

Breast cancer is one of the leading causes of death in females and survival depends on early diagnosis and treatment. This paper applied machine learning techniques in prediction of breast cancer survival (dead or alive) using age, sex, length of stay, mode of diagnosis and location of cancer as predictors (independent variables). The data was obtained from the outpatient department of the University of Ilorin Teaching Hospital, Ilorin, Nigeria. The sample size of 300 consists of 175 females and 25 males who were admitted at the hospital and treated for breast cancer. The patients were later discharged or died. Adaptive boosting (AdaBoost) performed best out of the data mining models used in the classification in all the three cases where the target class is average over classes, alive or dead. The AdaBoost performed best with the classification accuracy and area under curve (AUC) of 98.3% and 99.9% respectively. Furthermore, a probe on the prediction by AdaBoost showed that the probability of dead due to breast cancer is 0.47, which the length of stay hugely contributed to the high probability, location of breast cancer and mode of diagnosis contributed minimally while age and sex contributed insignificantly. The high probability of breast cancer mortality predicted in this paper is a call for concern as early detection of breast cancer, routine breast examination and breast cancer awareness are crucial in increasing the probability of survival. The results can be used to design a decision support system that can increase the chances of breast cancer survival.

A Novel Stacking Technique for Prediction of Diabetes

Article

Jun 2021
COMPUT BIOL MED

Background Machine Learning (ML) represents a rapidly growing technology that supplies the most effective solutions for solving complex problems. The application of ML techniques in healthcare is gaining more attention because of ML-associated automatic pattern identification mechanisms. Diabetes is characterized by hyperglycemia resulting from improper insulin secretion and/or insulin utilization. Methods The PIMA Indian diabetes dataset was obtained from the University of California/Irvine (UCI) machine learning repository for experimental purposes. The study was carried out in three stages: (1) a correlation technique was developed for feature selection; (2) the AdaBoost technique was implemented on selected features for classification; and (3) a novel stacking technique with multi-layer perceptron, support vector machine, and logistic regression (MLP, SVM, and LR, respectively) was designed and developed for the selected features. Results The proposed stacking technique integrated the intelligent models and led to an improvement in model performance, thereby overcoming the issue of generating multiple decision stumps by AdaBoost. The proposed novel stacking technique outperformed other models when compared with AdaBoost in terms of performance metrics. The proposed models were then implemented on other datasets, such as the Cleveland heart disease and Wisconsin breast cancer diagnostic datasets, to illustrate their broader applications. Conclusion: Stacking can outperform other models when compared with the other reported techniques that were implemented using the PIMA Indian diabetes dataset.

Short-term prediction of future continuous glucose monitoring readings in type 1 diabetes: Development and validation of a neural network regression model

Article

Jul 2021
INT J MED INFORM

Background and objective CGM systems are still subject to a time-delay, which especially during rapid changes causes clinically significant difference between the CGM and the actual BG level. This study had the aim of exploring the potential of developing and validating a model for prediction of future CGM measurements in order to overcome the time-delay. Methods An artificial neural network regression (NN) approach were used to predict CGM values with a lead-time of 15 min. The NN were trained and internally validated on 23 million minutes of CGM and externally validated on 2 million minutes of CGM. The validation included data from 278 type 1 diabetes patients using three different CGM sensors. The NN performance were compared with three alternative methods, linear extrapolation, spline extrapolation and last observation carried forward. Results The internal validation yielded a RMSE of 9.1 mg/dL, a MARD of 4.2 % and 99.9 % of predictions were in the A + B zone of the consensus error grid. The external validation yielded a RMSE of 5.9–11.3 mg/dL, a MARD of 3.2–5.4 % and 99.9–100 % of predictions were in the A + B zone of the consensus error grid. The NN performed better on all parameters compared to the two alternative methods. Conclusions We proposed and validated a NN glucose prediction model that is potential simple to use and implement. The model only needs input from a CGM system in order to facilitate glucose prediction with a lead time of 15 min. The approach yielded good results for both internal and external validation.

Development and validation of the type 2 diabetes mellitus 10-year risk score prediction models from survey data

Article

Apr 2021

Aims: In this paper, we demonstrate the development and validation of the 10-years type 2 diabetes mellitus (T2DM) risk prediction models based on large survey data. Methods: The Survey of Health, Ageing and Retirement in Europe (SHARE) data collected in 12 European countries using 53 variables representing behavioural as well as physical and mental health characteristics of the participants aged 50 or older was used to build and validate prediction models. To account for strongly unbalanced outcome variables, each instance was assigned a weight according to the inverse proportion of the outcome label when the regularized logistic regression model was built. Results: A pooled sample of 16,363 individuals was used to build and validate a global regularized logistic regression model that achieved an area under the receiver operating characteristic curve of 0.702 (95% CI: 0.698-0.706). Additionally, we measured performance of local country-specific models where AUROC ranged from 0.578 (0.565-0.592) to 0.768 (0.749-0.787). Conclusions: We have developed and validated a survey-based 10-year T2DM risk prediction model for use across 12 European countries. Our results demonstrate the importance of re-calibration of the models as well as strengths of pooling the data from multiple countries to reduce the variance and consequently increase the precision of the results.

A review on current advances in machine learning based diabetes prediction

Article

Feb 2021

Tarun Pal

Diabetes is a metabolic disorder comprising of high glucose level in blood over a prolonged period in the body as it is not capable of using it properly. The severe complications associated with diabetes include diabetic ketoacidosis, nonketotic hypersmolar coma, cardiovascular disease, stroke, chronic renal failure, retinal damage and foot ulcers. There is a huge increase in the number of patients with diabetes globally and it is considered a major health problem worldwide. Early diagnosis of diabetes is helpful for treatment and reduces the chance of severe complications associated with it. Machine learning algorithms (such as ANN, SVM, Naive Bayes, PLS-DA and deep learning) and data mining techniques are used for detecting interesting patterns for diagnosing and treatment of disease. Current computational methods for diabetes diagnosis have some limitations and are not tested on different datasets or peoples from different countries which limits the practical use of prediction methods. This paper is an effort to summarize the majority of the literature concerned with machine learning and data mining techniques applied for the prediction of diabetes and associated challenges. This report would be helpful for better prediction of disease and improve in understanding the pattern of diabetes. Consequently, the report would be helpful for treatment and reduce risk of other complications of diabetes.

Machine Learning-Based Diabetes Prediction Using Missing Value Impotency

Abstract

Recommended publications

Extreme Gradient Boosting and Soft Voting Ensemble Classifier for Diabetes Prediction

A Detailed Schematic Study on Feature Extraction Methodologies and Its Applications: A Position Pape...

A survey on diabetes risk prediction using machine learning approaches

Diabetes Prediction Using Boosting Algorithms: Performance Comparison