ChapterPDF Available

Machine Learning-Based Diabetes Prediction Using Missing Value Impotency

Authors:

Abstract

Diabetes is a chronic disease that has been impacting an increasing number of people throughout the years. Each year, it results in a huge number of deaths. Since late diagnosis results in severe health complications and a significant number of deaths each year, developing methods for early detection of this pathology is critical. As a result, early detection is critical. Machine learning (ML) techniques aid in the early detection and prediction of diabetes. However, ML models do not perform well with missing values in the dataset. Imputation of missing values improves the outcome. The article proposes an ensemble method with a strong emphasis on missing value imputation. Numerous ML models have been used to validate the proposed framework. The experimentation uses the Pima Indian Diabetes Dataset, which contains information about people with and without diabetes. TPR, FNR, PPV, FDR, overall accuracy, training time, and AUC are used to evaluate the performance of the 24 ML methods. The collected results demonstrate that subspace KNN outperforms with an accuracy of 85%. The collected data are confirmed systematically and orderly utilizing receiver operating characteristic (ROC) curves. Using missing value imputation for data pre-processing and classification has been shown to beat state-of-the-art algorithms in the diabetes detection sector.
Chapter 51
Machine Learning-Based Diabetes
Prediction Using Missing Value
Impotency
Santi Kumari Behera, Julie Palei, Dayal Kumar Behera, Subhra Swetanisha,
and Prabira Kumar Sethy
Abstract Diabetes is a chronic disease that has been impacting an increasing number
of people throughout the years. Each year, it results in a huge number of deaths. Since
late diagnosis results in severe health complications and a significant number of
deaths each year, developing methods for early detection of this pathology is critical.
As a result, early detection is critical. Machine learning (ML) techniques aid in the
early detection and prediction of diabetes. However, ML models do not perform
well with missing values in the dataset. Imputation of missing values improves
the outcome. The article proposes an ensemble method with a strong emphasis on
missing value imputation. Numerous ML models have been used to validate the
proposed framework. The experimentation uses the Pima Indian Diabetes Dataset,
which contains information about people with and without diabetes. TPR, FNR,
PPV, FDR, overall accuracy, training time, and AUC are used to evaluate the perfor-
mance of the 24 ML methods. The collected results demonstrate that subspace KNN
outperforms with an accuracy of 85%. The collected data are confirmed systemat-
ically and orderly utilizing receiver operating characteristic (ROC) curves. Using
missing value imputation for data pre-processing and classification has been shown
to beat state-of-the-art algorithms in the diabetes detection sector.
S. K. Behera
Department of CSE, VSSUT Burla, Burla, Odisha, India
J. Palei ·P. K. Se t h y ( B)
Department of Electronics, Sambalpur University, Burla, Odisha 768019, India
J. Palei
e-mail: 19mscel05@suiit.ac.in
D. K. Behera
Department of CSE, Silicon Institute of Technology, Bhubaneswar, Odisha, India
S. Swetanisha
Department of CSE, Trident Academy of Technology, Bhubaneswar, Odisha, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022
S. Dehuri et al. (eds.), Biologically Inspired Techniques in Many Criteria Decision Making,
Smart Innovation, Systems and Technologies 271,
https://doi.org/10.1007/978-981- 16-8739- 6_51
575
576 S. K. Behera et al.
51.1 Introduction
According to the World Health Organization (WHO), around 1.6 million people
die each year from diabetes [1].” “Diabetes is a type of disease that arises when the
human body’s blood glucose/blood sugar level is abnormally high. Type 1 diabetes,
commonly known as insulin-dependent diabetes, is most frequently diagnosed in
childhood [2].” In Type 1, the pancreas is attacked by the body’s antibodies, which
then kill internal body parts and cause the pancreas to stop producing insulin. “Type 2
diabetes is often referred to as adult-onset diabetes or non-insulin-dependent diabetes
[3].” Although it is more merciful than Type 1, it is nevertheless extremely damaging
and can result in serious complications, particularly in the small blood vessels of the
eyes, kidneys, and nerves [4]. Type 3 gestational diabetes [5] develops from increased
blood sugar levels in pregnant womenwhose diabetes is not recognized earlier. “Some
authors have created and validated a risk score for primary cesarean delivery (CD)
in women with gestational diabetes. In women with gestational diabetes mellitus
(GDM), a risk score based on nulliparity, excessive gestational weight gain, and
usage of insulin can be used to determine the likelihood of primary CD [6].” “Ghaderi
et al. worked on the effect of smartphone education on the risk perception of Type 2
diabetes in a woman with GDM [2].”
Diabetes mellitus is related to long-term consequences. Additionally, people with
diabetes face an increased chance of developing a variety of health concerns. Glucose
levels in the human body typically range between 70 and 99 mg per deciliter [1]. “If
the glucose level is more significant than 126 mg/dl, diabetes is present. Prediabetes
is defined as a blood glucose level of 100–125 mg/dl [7].”
Diabetes is influenced by height, weight, hereditary factors, and insulin [8], but
the primary factor evaluated is blood sugar content. Early detection is the only
approach to avoid difficulties. Predictive analytics strives to improve disease diag-
nosis accuracy, patient care, resource optimization, and clinical outcomes. Numerous
researchers are conducting experiments to diagnose disease using various classifi-
cation algorithms from ML approaches such as J48, SVM, Naïve Bayes, and deci-
sion tree. Researchers have demonstrated that machine learning algorithms [9,10]
perform better at diagnosing various diseases. Naïve Bayes, SVM, and decision tree
ML classification algorithms are applied and assessed in work [8] to predict diabetes
in a patient using the PIDD dataset.
ML techniques can also be utilized to identify individuals at elevated risk of Type 2
diabetes [11] or prediabetes in the absence of established impaired glucose regulation.
Body mass index, waist-hip ratio, age, systolic and diastolic blood pressure, and
diabetes inheritance were the most impactful factors. Increased risk of Type 2 diabetes
was associated with high levels of these characteristics and diabetes heredity.
ML techniques aid in the early detection and prediction of diabetes. However,
ML models do not perform well with missing values in the dataset. Therefore, this
work emphasizes on the missing value imputation. The objectives of this work are
as follows:
51 Machine Learning-Based Diabetes Prediction … 577
Study the impact of missing value in the PIMA diabetes dataset.
Performing missing value imputation by replacing the missing value with the
mean value of the group.
Designing an ensemble subspace KNN for classifying diabetes.
Comparative analysis of various traditional classifiers against the ensemble
classifier.
51.2 Related Works
The purpose of the article [12] is to illustrate the construction and validation of 10-
year risk prediction models for Type 2 diabetes mellitus (T2DM). Data collected
in 12 European nations (SHARE) are used for validation of the model. The dataset
included 53 variables encompassing behavioral, physical, and mental health aspects
of participants aged 50 or older. To account for highly imbalanced outcome variables,
the logistic regression model was developed, each instance wasweighted according to
the inverse percentage of the result label. The authors used a pooled sample of 16,363
people to develop and evaluate a global regularized logistic regression model with an
area under the receiver operating characteristic curve of 0.702. Continuous glucose
monitoring (CGM) devices continue to have a temporal delay, which can result
in clinically significant differences between the CGM and the actual blood glucose
level, particularly during rapid changes. In [13], authors have used the artificial neural
network regression (NN) technique to forecast CGM results. Diabetes can also be
a risk factor for developing other diseases such as heart attack, renal impairment,
and partial blindness. Kayal Vizhi and “Aman Dash worked on smart sensors and
ML techniques such as random forest and extreme gradient boosting for predicting
whether a person would get diabetes or not [14].” Mujumdar et al. [5] developed a
diabetes prediction model that took into account external risk variables for diabetes
in addition to standard risk factors such as glucose, BMI, age, and insulin. The
new dataset improves classification accuracy when compared to the available PIMA
dataset. The study [15] covers algorithms such as linear regression, decision trees,
random forests, and their advantages for early identification and treatment of disease.
The research study discussed the predictive accuracy of the algorithms mentioned
above. Mitushi Soni and Sunita Varma [16] forecasted diabetes using ML classifica-
tion and ensemble approaches. When compared to other models, each model’s accu-
racy varies. Their findings indicate that random forest outperformed different ML
algorithms in terms of accuracy. “Jobeda Jamal Khanam and Simon Foo conducted
research using the PIMA dataset. The collection comprises data on 768 patients and
their nine unique characteristics. On the dataset, seven ML algorithms were applied
to predict diabetes. They concluded that a model combining logistic regression (LR)
and support vector machine (SVM) effectively predicts diabetes [1].” “Varga used
NCBI PubMed to conduct a systematic search. First, articles that had the words “dia-
betes” and “prediction” were chosen. Next, the authors searched for metrics relating
to predictive statistics in all abstracts of original research articles published in the field
578 S. K. Behera et al.
of diabetes epidemiology. To illustrate the distinction between association and predic-
tion, simulated data were constructed. It is demonstrated that biomarkers with large
effect sizes and small P values might have low discriminative utility[17].” The article
[18] attempts to synthesize the majority of the work on ML and data mining tech-
niques used to forecast diabetes and its complications. Hyperglycemia is a symptom
of diabetes caused by insufficient insulin secretion and/or use. For experimental
purposes, Kalagotla et al. [19] designed a novel stacking method based on multi-layer
perceptron, SVM, and LR. The stacking strategy combined the intelligent models and
improved model performance. In comparison with AdaBoost, the proposed unique
stacking strategy outperformed other models. Authors in [20] worked on a pipeline
for predicting diabetes individuals using deep learning techniques. It incorporates
data enhancement using a variational autoencoder (VAE), feature enhancement via
a sparse autoencoder (SAE), and classification via a convolutional neural network.
51.3 Material and Methods
The details about dataset and adapted methodology are elaborated in appropriate
subsection.
51.3.1 Dataset
The Pima Indians Diabetes Database is used in this paper. “The National Insti-
tute of Diabetes and Digestive and Kidney Diseases first collected this dataset. The
dataset’s purpose is to diagnostically predict if a patient has diabetes or not, using
particular diagnostic metrics contained in the dataset. Therefore, numerous limits on
the selection of these examples from a broader database were imposed. All patients
at this facility are females who are at least 21 years old and of Pima Indian ancestry.
The datasets contain a variety of medical predictor variables and one outcome vari-
able. The number of pregnancies, their BMI, insulin level, and age are all predictor
variables of the patient [21].”
51.3.2 Proposed Model
This research focuses heavily on enhancing the outcomes and accuracy of diabetes
detection. The proposed approach is depicted in Fig. 51.1. Numerous classical ML
classifiers and related ensemble variation models are used to categorize disease as
positive or negative. Numerous characteristics in the original dataset have entries
of 0. According to the experts’ advice, these values must be considered a missing
value, such as glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree
51 Machine Learning-Based Diabetes Prediction … 579
Fig. 51.1 Proposed framework for diabetes prediction
function, and age cannot be zero. Hence, the missing value is imputed by considering
the mean value of the group. After that, the class label of the dataset is evaluated
by taking two different values into account: Diabetic is set to 1 and non-diabetic by
0. Then, the dataset is divided into validation and test set. The validation data are
used to train the classifier in two scenarios. In the first case, the classifier is trained
with missing value and in another case by missing value imputation. The missing
value imputation does not pre-process the test data, and it is passed to the model for
prediction.
51.4 Results and Discussion
The classification performance in TPR, FNR, PPV, FDR, overall accuracy, training
time, and AUC is evaluated using the most robust ML classifiers such as KNN, SVM,
Naïve-Bayes, logistic regression, discriminant models, and ensemble models.
Table 51.1 depicts the performance of various models on the validation data
without considering missing value imputation (MVI), whereas Table 51.2 represents
performance by considering missing value imputation. From the data, it is clear that
the AUC of all the models improves a lot in missing value imputation.
Table 51.3 represents the performance of the prediction on the test data. Again,
subspace KNN performs better as compared to other classifiers.
From Table 51.3, it is clear that subspace KNN performed better than the other
classifier. The confusion matrix of the ensemble subspace KNN model is depicted
in Fig. 51.2. The AUC of ROC is shown in Fig. 51.3.
580 S. K. Behera et al.
Table 51.1 Performance analysis on validation data without MVI
Validation
model
Accuracy
(validation)
Training time (s) TPR FNR PPV FDR AUC
Fine KNN 100 0.28351 100 0100 0 1
weighted KNN 100 0.21176 100 0100 0 1
Subspace KNN
(ensemble)
100 0.37183 100 0100 0 1
Bagged trees
(ensemble)
99.7 0.51479 99.65 0.35 99.8 0.2 1
Fine Gaussian
SVM
98.8 0.20768 98.3 1.7 99.1 0.9 1
Fine tree 93.2 1.7373 92.3 7.7 92.75 7.25 0.98
Boosted trees
(ensemble)
85.5 0.62198 83.1 16.9 84.6 15.4 0.95
Cubic SVM 87 0.41443 84.7 15.3 86.2 13.8 0.93
RUSBoosted
trees
(ensemble)
84.9 0.45368 86.05 13.95 83.4 16.6 0.93
Medium
Gaussian SVM
82.7 0.2608 78.3 21.7 82.65 17.35 0.9
Medium tree 83.3 0.35821 80.2 19.8 82.35 17.65 0.89
Quadratic
SVM
79.8 0.26857 74.9 25.1 79.25 20.75 0.87
Cosine KNN 79.8 0.20666 75.25 24.75 78.95 21.05 0.87
Medium KNN 78.5 0.26587 72.85 27.15 78.15 21.85 0.87
Cubic KNN 78.4 0.25979 72.95 27.05 77.8 22.2 0.87
Linear
discriminant
78.4 0.3483 73.7 26.3 77.1 22.9 0.84
Coarse
Gaussian SVM
78.4 0.1325 72.75 27.25 77.95 22.05 0.84
Logistic
regression
78.3 0.73955 73.6 26.4 76.9 23.1 0.84
Linear SVM 77.3 0.39083 72.55 27.45 75.8 24.2 0.84
Quadratic
discriminant
76.4 0.34544 72.1 27.9 74.4 25.6 0.83
Subspace
discriminant
(ensemble)
76.3 1.4926 69.4 30.6 76.25 23.75 0.83
Kernel Naive
Bayes
75.4 0.61343 68.05 31.95 75.45 24.55 0.83
Gaussian
Naive Bayes
76.2 0.30505 72.7 27.3 73.85 26.15 0.82
Coarse KNN 75.4 0.22108 66.9 33.1 77.45 22.55 0.82
Coarse tree 77.2 0.22121 72.3 27.7 75.75 24.25 0.74
51 Machine Learning-Based Diabetes Prediction … 581
Table 51.2 Performance analysis on validation data with MVI
With M VI
(validation)
Accuracy
(validation)
Training time (s) TPR FNR PPV FDR AUC
Fine KNN 100 0.40763 100 0100 0 1
Weighted
KNN
100 0.22415 100 0100 0 1
Subspace
KNN
(ensemble)
100 0.39983 100 0100 0 1
Bagged trees
(ensemble)
99.7 0.57935 99.7 0.3 99.7 0.3 1
Fine Gaussian
SVM
97.8 0.2167 96.9 3.1 98.25 1.75 1
Boosted trees
(ensemble)
96.5 0.75288 95.9 4.1 96.3 3.7 1
Fine tree 97.4 2.0716 96.6 3.4 97.65 2.35 0.99
RUSBoosted
trees
(ensemble)
94.3 0.47435 94.4 5.6 93.25 6.75 0.99
Medium tree 93.6 0.36612 91.9 8.1 94.05 5.95 0.98
Cubic SVM 87.4 0.49901 85.35 14.65 86.55 13.45 0.94
Coarse tree 88 0.25733 84.05 15.95 89.65 10.35 0.93
Kernel Naive
Bayes
73.2 0.72157 62.1 37.9 81.55 18.45 0.93
Quadratic
SVM
81.5 0.46121 77.5 22.5 80.7 19.3 0.89
Medium
Gaussian
SVM
81.4 0.2668 77.15 22.85 80.75 19.25 0.89
Medium KNN 80.5 0.26288 76 24 79.75 20.25 0.88
Cubic KNN 80.1 0.26259 75.35 24.65 79.4 20.6 0.88
Cosine KNN 78.4 0.25825 73.8 26.2 77.05 22.95 0.87
Coarse
Gaussian
SVM
78.6 0.15006 73.45 26.55 77.9 22.1 0.85
Linear SVM 78.4 0.51793 73.55 26.45 77.25 22.75 0.85
Logistic
regression
78 1.0221 73.5 26.5 76.45 23.55 0.85
Linear
discriminant
77.6 0.47924 73 27 76 24 0.85
Coarse KNN 76.6 0.22954 69.8 30.2 76.55 23.45 0.84
(continued)
582 S. K. Behera et al.
Table 51.2 (continued)
With M VI
(validation)
Accuracy
(validation)
Training time (s) TPR FNR PPV FDR AUC
Subspace
discriminant
(ensemble)
76.4 0.44237 69.8 30.2 76.15 23.85 0.84
Quadratic
discriminant
70.3 0.36838 59.3 40.7 72.1 27.9 0.82
Gaussian
Naive Bayes
71.1 0.37027 61.2 38.8 71.35 28.65 0.81
Table 51.3 Performance analysis on test data without MVI
Tes t Accuracy (test) TPR FNR PPV FDR AU C
Subspace KNN
(ensemble)
85 88.45 11.55 85 15 1
Fine KNN 80 84.6 15.4 81.8 18.2 0.85
Weighted KNN 80 84.6 15.4 81.8 18.2 0.9
Fine tree 70 73.6 26.4 71.7 28.3 0.65
Medium tree 70 73.6 26.4 71.7 28.3 0.65
Medium Gaussian SVM 70 76.9 23.1 76.9 23.1 0.81
Boosted trees (ensemble) 70 63.75 36.25 66.65 33.35 0.75
Bagged trees (ensemble) 70 76.9 23.1 76.9 23.1 0.91
RUSBoosted trees
(ensemble)
70 76.9 23.1 76.9 23.1 0.87
Coarse Gaussian SVM 65 69.75 30.25 68.75 31.25 0.73
Subspace discriminant
(ensemble)
65 69.75 30.25 68.75 31.25 0.69
Linear discriminant 60 62.6 37.4 61.65 38.35 0.74
Logistic regression 60 62.6 37.4 61.65 38.35 0.71
Linear SVM 60 62.6 37.4 61.65 38.35 0.76
Quadratic SVM 60 69.25 30.75 73.35 26.65 0.76
Cubic SVM 60 69.25 30.75 73.35 26.65 0.63
Fine Gaussian SVM 60 69.25 30.75 73.35 26.65 0.96
Coarse KNN 60 65.95 34.05 65.95 34.05 0.75
Cubic KNN 60 65.95 34.05 65.95 34.05 0.72
Coarse tree 55 65.4 34.6 71.9 28.1 0.73
Quadratic discriminant 55 62.1 37.9 66.25 36.9 0.67
Gaussian Naive Bayes 55 62.1 37.9 63.1 36.9 0.68
Kernel Naive Bayes 55 62.1 37.9 63.1 36.9 0.71
Medium KNN 55 62.1 37.9 63.1 36.9 0.76
Cosine KNN 55 62.1 37.9 63.1 36.9 0.7
51 Machine Learning-Based Diabetes Prediction … 583
Fig. 51.2 Confusion matrix
of subspace KNN on test
data
Fig. 51.3 ROC of subspace
KNN on test data
51.5 Conclusion
Diabetes early identification is a big challenge in the health care business. In our
research, we developed a system capable of accurately predicting diabetes. The
purpose of this study is to propose an ensemble method based on in-depth ML
techniques for diabetes prediction utilizing a well-known dataset called Pima Indian
Diabetes. We used twenty-four different ML algorithms, including KNN, SVM,
Naïve-Bayes, logistic regression, discriminant models, and ensemble models to
predict diabetes and evaluate performance on various measures like TPR, FNR,
PPV, FDR, overall accuracy, training time, and AUC. This work also emphasizes
missing value imputation. In the validation data, overall accuracy improves to a
great extent with missing value impotency, depicted in Table 51.2. Among all the
proposed models, the subspace KNN is considered the most efficient and promising
for predicting diabetes, with an accuracy of 85% in test data.
584 S. K. Behera et al.
References
1. Khanam, J.J., Foo, S.Y.: A comparison of machine learning algorithms for diabetes prediction.
ICT Express (2021). https://doi.org/10.1016/j.icte.2021.02.004
2. Ghaderi, M., Farahani, M.A., Hajiha, N., Ghaffari, F., Haghani, H.: The role of smartphone-
based education on the risk perception of type 2 diabetes in women with gestational diabetes.
Health Technol. (Berl) 9(5), 829–837 (2019). https://doi.org/10.1007/s12553-019-00342-3
3. Mandal, S.: New molecular biomarkers in precise diagnosis and therapy of Type 2 diabetes.
Health Technol. (Berl) 10(3), 601–608 (2020). https://doi.org/10.1007/s12553-019-00385-6
4. Himsworth, H.P.: The syndrome of diabetes mellitus and its causes. Lancet 253(6551), 465–473
(1949). https://doi.org/10.1016/S0140-6736(49)90797-7
5. Mujumdar, A., Vaidehi, V.: Diabetes prediction using machine learning algorithms. Procedia
Comput. Sci. 165, 292–299 (2019). https://doi.org/10.1016/j.procs.2020.01.047
6. Phaloprakarn, C., Tangjitgamol, S.: Risk score for predicting primary cesarean delivery in
women with gestational diabetes mellitus. BMC Pregnancy Childbirth 20(1), 1–8 (2020).
https://doi.org/10.1186/s12884-020-03306-y
7. https://www.mayoclinic.org/diseases-conditions/prediabetes/diagnosis-treatment/drc-203
55284
8. Sisodia, D., Sisodia, D.S.: Prediction of diabetes using classification algorithms. Procedia
Comput. Sci. 132(Iccids), 1578–1585 (2018). https://doi.org/10.1016/j.procs.2018.05.122
9. Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., Chouvarda, I.: Machine
learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15,
104–116 (2017). https://doi.org/10.1016/j.csbj.2016.12.005
10. Okagbue, H.I., Adamu, P.I., Oguntunde, P.E., Obasi, E.C.M., Odetunmibi, O.A.: Machine
learning prediction of breast cancer survival using age, sex, length of stay, mode of diagnosis
and location of cancer. Health Technol. (Berl). 11, 887–893 (2021). https://doi.org/10.1007/
s12553-021-00572-4
11. Lama, L. et al.: Machine learning for prediction of diabetes risk in middle-aged Swedish people.
Heliyon 7, e07419 (2021). https://doi.org/10.1016/j.heliyon.2021.e07419
12. Gregor Stiglic, L.C., Wang, F., Sheikh, A.: Development and validation of the type 2 diabetes
mellitus 10-year risk score prediction models from survey data. Prim. Care Diabetes 15(4),
699–705 (2021)
13. Lebech Cichosz, O.S., Hasselstrøm Jensen, M.: Short-term prediction of future continuous
glucose monitoring readings in type 1 diabetes: development and validation of a neural network
regression model. Int. J. Med. Inform. 151, 104472(2021)
14. Vizhi, K., Dash, A.: Diabetes prediction using machine learning. Int. J. Adv. Sci. Technol.
29(6), 2842–2852 (2020). https://doi.org/10.32628/cseit2173107
15. Muhammad Daniyal Baig, M.F.N.: Diabetes prediction using machine learning algorithms.
Lect. Notes Netw. Syst. (2020). https://doi.org/10.13140/RG.2.2.18158.64328
16. Soni, M., Varma, S.: Diabetes prediction using machine learning techniques. Int. J. Eng. Res.
Technol. 9(09), 921–924 (2020). https://doi.org/10.1007/978-981-33-6081-5_34
17. Varga, T.V., Niss, K., Estampador, A.C., Collin, C.B., Moseley, P.L.: Association is not predic-
tion: a landscape of confused reporting in diabetes—a systematic review. Diabetes Res. Clin.
Pract. 170, 108497 (2020). https://doi.org/10.1016/j.diabres.2020.108497
18. Jaiswal, T.P.V., Negi, A., Pal, T.: A review on current advances in machine learning based
diabetes prediction. Prim. Care Diabetes 15(3), 435–443 (2021)
19. Kalagotla, K., Satish Kumar, Gangashetty, S.V.: A novel stacking technique for prediction of
diabetes. Comput. Biol. Med. 135, 104554 (2021)
20. García-Ordás, M.T., Benavides, C., Benítez-Andrades, J.A., Alaiz-Moretón, H., García-
Rodríguez, I.: Diabetes detection using deep learning techniques with oversampling and feature
augmentation. Comput. Methods Programs Biomed. 202, 105968 (2021). https://doi.org/10.
1016/j.cmpb.2021.105968
21. Pima Indians Diabetes Database. Available at: https://www.kaggle.com/uciml/pima-indians-
diabetes-database
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Aims To study if machine learning methodology can be used to detect persons with increased type 2 diabetes or prediabetes risk among people without known abnormal glucose regulation. Methods Machine learning and interpretable machine learning models were applied on research data from Stockholm Diabetes Preventive Program, including more than 8000 people initially with normal glucose tolerance or prediabetes to determine high and low risk features for further impairment in glucose tolerance at follow-up 10 and 20 years later. Results The features with the highest importance on the outcome were body mass index, waist-hip ratio, age, systolic and diastolic blood pressure, and diabetes heredity. High values of these features as well as diabetes heredity conferred increased risk of type 2 diabetes. . The machine learning model was used to generate individual, comprehensible risk profiles, where the diabetes risk was obtained for each person in the data set. Features with the largest increasing or decreasing effects on the risk were determined. Conclusions The primary application of this machine learning model is to predict individual type 2 diabetes risk in people without diagnosed diabetes, and to which features the risk relates However, since most features affecting diabetes risk also play a role for metabolic control in diabetes, e.g. body mass index, diet composition, tobacco use, and stress, the tool can possibly also be used in diabetes care to develop more individualized, easily accessible health care plans to be utilized when encountering the patients.
Article
Full-text available
Diabetes is a disease that has no permanent cure; hence early detection is required. Data mining, machine learning (ML) algorithms, and Neural Network (NN) methods are used in diabetes prediction in our research. We used the Pima Indian Diabetes (PID) dataset for our research, collected from the UCI Machine Learning Repository. The data set contains information about 768 patients and their corresponding nine unique attributes. We used seven ML algorithms on the dataset to predict diabetes. We found that the model with Logistic Regression (LR) and Support Vector Machine (SVM) works well on diabetes prediction. We built the NN model with a different hidden layer with various epochs and observed the NN with two hidden layers provided 88.6% accuracy.
Article
Full-text available
Background and objective: Diabetes is a chronic pathology which is affecting more and more people over the years. It gives rise to a large number of deaths each year. Furthermore, many people living with the disease do not realize the seriousness of their health status early enough. Late diagnosis brings about numerous health problems and a large number of deaths each year so the development of methods for the early diagnosis of this pathology is essential. Methods: In this paper, a pipeline based on deep learning techniques is proposed to predict diabetic people. It includes data augmentation using a variational autoencoder (VAE), feature augmentation using an sparse autoencoder (SAE) and a convolutional neural network for classification. Pima Indians Diabetes Database, which takes into account information on the patients such as the number of pregnancies, glucose or insulin level, blood pressure or age, has been evaluated. Results: A 92.31% of accuracy was obtained when CNN classifier is trained jointly the SAE for featuring augmentation over a well balanced dataset. This means an increment of 3.17% of accuracy with respect the state-of-the-art. Conclusions: Using a full deep learning pipeline for data preprocessing and classification has demonstrate to be very promising in the diabetes detection field outperforming the state-of-the-art proposals.
Article
Full-text available
Aims Appropriate analysis of big data is fundamental to precision medicine. While statistical analyses often uncover numerous associations, associations themselves do not convey predictive value. Confusion between association and prediction harms clinicians, scientists, and ultimately, the patients. We analyzed published papers in the field of diabetes that refer to “prediction” in their titles. We assessed whether these articles report metrics relevant to prediction. Methods A systematic search was undertaken using NCBI PubMed. Articles with the terms “diabetes” and “prediction” were selected. All abstracts of original research articles, within the field of diabetes epidemiology, were searched for metrics pertaining to predictive statistics. Simulated data was generated to visually convey the differences between association and prediction. Results The search-term yielded 2,182 results. After discarding non-relevant articles, 1,910 abstracts were evaluated. Of these, 39% (n = 745) reported metrics of predictive statistics, while 61% (n = 1,165) did not. The top reported metrics of prediction were ROC AUC, sensitivity and specificity. Using the simulated data, we demonstrated that biomarkers with large effect sizes and low P values can still offer poor discriminative utility. Conclusions We demonstrate a landscape of confused reporting within the field of diabetes epidemiology where the term “prediction” is often incorrectly used to refer to association statistics. We propose guidelines for future reporting, and two major routes forward in terms of main analytic procedures and research goals: the explanatory route, which contributes to precision medicine, and the prediction route which contributes to personalized medicine.
Article
Full-text available
Background: Women with gestational diabetes mellitus (GDM) have a higher risk of cesarean delivery (CD) than glucose-tolerant women. The aim of this study was to develop and validate a risk score for predicting primary CD in women with GDM. Methods: A risk score for predicting primary CD was developed using significant clinical features of 385 women who had a diagnosis of GDM and delivered at our institution between January 2011 and December 2014. The score was then tested for validity in another cohort of 448 individuals with GDM who delivered between January 2015 and December 2018. Results: The risk score was developed using the features nulliparity, excess gestational weight gain, and insulin use. The scores that classified the pregnant women as low risk (0 points), intermediate risk (1-3 points), and high risk (≥ 4 points) were directly associated with the primary CD rates of the women in the development cohort: 14.7, 38.2 and 62.3%, respectively (P < 0.001). The model showed good calibration and acceptable discriminative power with a C statistic of 0.724 (95% confidence interval, 0.670-0.777). Similar results were observed in the validation cohort. Conclusion: A risk score using the features nulliparity, excess gestational weight gain, and insulin use can estimate the risk for primary CD in women with GDM.
Article
Breast cancer is one of the leading causes of death in females and survival depends on early diagnosis and treatment. This paper applied machine learning techniques in prediction of breast cancer survival (dead or alive) using age, sex, length of stay, mode of diagnosis and location of cancer as predictors (independent variables). The data was obtained from the outpatient department of the University of Ilorin Teaching Hospital, Ilorin, Nigeria. The sample size of 300 consists of 175 females and 25 males who were admitted at the hospital and treated for breast cancer. The patients were later discharged or died. Adaptive boosting (AdaBoost) performed best out of the data mining models used in the classification in all the three cases where the target class is average over classes, alive or dead. The AdaBoost performed best with the classification accuracy and area under curve (AUC) of 98.3% and 99.9% respectively. Furthermore, a probe on the prediction by AdaBoost showed that the probability of dead due to breast cancer is 0.47, which the length of stay hugely contributed to the high probability, location of breast cancer and mode of diagnosis contributed minimally while age and sex contributed insignificantly. The high probability of breast cancer mortality predicted in this paper is a call for concern as early detection of breast cancer, routine breast examination and breast cancer awareness are crucial in increasing the probability of survival. The results can be used to design a decision support system that can increase the chances of breast cancer survival.
Article
Background Machine Learning (ML) represents a rapidly growing technology that supplies the most effective solutions for solving complex problems. The application of ML techniques in healthcare is gaining more attention because of ML-associated automatic pattern identification mechanisms. Diabetes is characterized by hyperglycemia resulting from improper insulin secretion and/or insulin utilization. Methods The PIMA Indian diabetes dataset was obtained from the University of California/Irvine (UCI) machine learning repository for experimental purposes. The study was carried out in three stages: (1) a correlation technique was developed for feature selection; (2) the AdaBoost technique was implemented on selected features for classification; and (3) a novel stacking technique with multi-layer perceptron, support vector machine, and logistic regression (MLP, SVM, and LR, respectively) was designed and developed for the selected features. Results The proposed stacking technique integrated the intelligent models and led to an improvement in model performance, thereby overcoming the issue of generating multiple decision stumps by AdaBoost. The proposed novel stacking technique outperformed other models when compared with AdaBoost in terms of performance metrics. The proposed models were then implemented on other datasets, such as the Cleveland heart disease and Wisconsin breast cancer diagnostic datasets, to illustrate their broader applications. Conclusion: Stacking can outperform other models when compared with the other reported techniques that were implemented using the PIMA Indian diabetes dataset.
Article
Background and objective CGM systems are still subject to a time-delay, which especially during rapid changes causes clinically significant difference between the CGM and the actual BG level. This study had the aim of exploring the potential of developing and validating a model for prediction of future CGM measurements in order to overcome the time-delay. Methods An artificial neural network regression (NN) approach were used to predict CGM values with a lead-time of 15 min. The NN were trained and internally validated on 23 million minutes of CGM and externally validated on 2 million minutes of CGM. The validation included data from 278 type 1 diabetes patients using three different CGM sensors. The NN performance were compared with three alternative methods, linear extrapolation, spline extrapolation and last observation carried forward. Results The internal validation yielded a RMSE of 9.1 mg/dL, a MARD of 4.2 % and 99.9 % of predictions were in the A + B zone of the consensus error grid. The external validation yielded a RMSE of 5.9–11.3 mg/dL, a MARD of 3.2–5.4 % and 99.9–100 % of predictions were in the A + B zone of the consensus error grid. The NN performed better on all parameters compared to the two alternative methods. Conclusions We proposed and validated a NN glucose prediction model that is potential simple to use and implement. The model only needs input from a CGM system in order to facilitate glucose prediction with a lead time of 15 min. The approach yielded good results for both internal and external validation.
Article
Aims: In this paper, we demonstrate the development and validation of the 10-years type 2 diabetes mellitus (T2DM) risk prediction models based on large survey data. Methods: The Survey of Health, Ageing and Retirement in Europe (SHARE) data collected in 12 European countries using 53 variables representing behavioural as well as physical and mental health characteristics of the participants aged 50 or older was used to build and validate prediction models. To account for strongly unbalanced outcome variables, each instance was assigned a weight according to the inverse proportion of the outcome label when the regularized logistic regression model was built. Results: A pooled sample of 16,363 individuals was used to build and validate a global regularized logistic regression model that achieved an area under the receiver operating characteristic curve of 0.702 (95% CI: 0.698-0.706). Additionally, we measured performance of local country-specific models where AUROC ranged from 0.578 (0.565-0.592) to 0.768 (0.749-0.787). Conclusions: We have developed and validated a survey-based 10-year T2DM risk prediction model for use across 12 European countries. Our results demonstrate the importance of re-calibration of the models as well as strengths of pooling the data from multiple countries to reduce the variance and consequently increase the precision of the results.
Article
Diabetes is a metabolic disorder comprising of high glucose level in blood over a prolonged period in the body as it is not capable of using it properly. The severe complications associated with diabetes include diabetic ketoacidosis, nonketotic hypersmolar coma, cardiovascular disease, stroke, chronic renal failure, retinal damage and foot ulcers. There is a huge increase in the number of patients with diabetes globally and it is considered a major health problem worldwide. Early diagnosis of diabetes is helpful for treatment and reduces the chance of severe complications associated with it. Machine learning algorithms (such as ANN, SVM, Naive Bayes, PLS-DA and deep learning) and data mining techniques are used for detecting interesting patterns for diagnosing and treatment of disease. Current computational methods for diabetes diagnosis have some limitations and are not tested on different datasets or peoples from different countries which limits the practical use of prediction methods. This paper is an effort to summarize the majority of the literature concerned with machine learning and data mining techniques applied for the prediction of diabetes and associated challenges. This report would be helpful for better prediction of disease and improve in understanding the pattern of diabetes. Consequently, the report would be helpful for treatment and reduce risk of other complications of diabetes.