Content uploaded by Ivan Miguel Pires
Author content
All content in this area was uploaded by Ivan Miguel Pires on Feb 03, 2021
Content may be subject to copyright.
Diabetes Disease through Machine Learning: A comparative
study
Gonçalo Marques
Instituto de Telecomunicações,
Universidade da Beira Interior,
Covilhã, Portugal
Ivan Miguel Pires
Instituto de Telecomunicações,
Universidade Da Beira Interior,
Covilhã, Portugal, Computer Science
Department, Polytechnic Institute of
Viseu, Viseu, Portugal, and UICISA:E
Research Centre, Polytechnic Institute
of Viseu, Viseu, Portugal
Nuno M. Garcia
Instituto de Telecomunicações,
Universidade da Beira Interior,
Covilhã, Portugal
ABSTRACT
Diabetes is a critical problem in developed and developing coun-
tries. The early detection of this disease is crucial for ecient and
eective treatment. Moreover, the application of machine learn-
ing for disease detection is a trending topic. There are numerous
machine learning methods available in the literature. The main
contribution of this paper is to present a preliminary study on the
application of machine learning methods on a public and widely
used diabetes dataset. The authors have applied eight dierent ma-
chine learning techniques using PIMA diabetes dataset. The data
have been normalized, and Neural Networks, SGD, Random Forest,
kNN, Naïve Bayes, AdaBoost, Decision Tree and SVM methods have
been applied. First, the techniques have been validated using strati-
ed 10-fold cross-validation. Second, the confusion matrix has been
extracted for each method, and the accuracy, recall, precision and
F1-score have been calculated. The three methods with better accu-
racy are Neural Networks, SGD and kNN. These methods report
77.47%, 76.43% and 73.96% of average accuracy between classes.
CCS CONCEPTS
•Computing methodologies
;
•Machine Learning
;
•Machine
Learning Approaches;
KEYWORDS
Diabetes, Machine Learning, Health Informatics
ACM Reference Format:
Gonçalo Marques, Ivan Miguel Pires, and Nuno M. Garcia. 2020. Diabetes
Disease through Machine Learning: A comparative study. In 2020 4th In-
ternational Conference on Computer Science and Articial Intelligence (CSAI
2020), December 11–13, 2020, Zhuhai, China. ACM, New York, NY, USA,
6 pages. https://doi.org/10.1145/3445815.3445828
1 INTRODUCTION
Chronic diseases are a critical problem nowadays. Mainly, diabetes
is one of the major challenges for healthcare researchers. World
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
CSAI 2020, December 11–13, 2020, Zhuhai, China
©2020 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8843-6/20/12.
https://doi.org/10.1145/3445815.3445828
Health Organization (WHO) reports that 402 million people have
diabetes. These people are particularly located in low-and middle-
income countries. Diabetes is closely related to critical damages
in heart, blood vessels and eyes. Furthermore, 1.6 million deaths
are directly connected with diabetes, according to WHO [
11
]. Addi-
tionally, WHO have launched several activities to support patients
with diabetes and is expanding access to treatments diabetes [14].
The early diagnosis of diabetes disease is critical for its treatment
[
4
]. Health professionals analyze the diagnosis according to their
experience and knowledge [
9
]. Currently, healthcare units support
electronic healthcare records (EHR), especially in developed coun-
tries. EHR support a massive number of records and organized
information of patients [13].
Articial intelligence is widely used in multiple engineering ap-
plications as well as in the healthcare domain [
3
,
6
]. Numerous
research units are studying the implementation of machine learn-
ing for disease recognition [
3
]. On the one hand, the successive
advances in computers science enabled the design of cyber-physical
systems that can be used for real-time data collection [
7
]. On the
other hand, the evolution of the methods available for data consult-
ing and storage enables the development of intelligent systems [
10
].
The implementation of machine learning can also be associated
with the creation of enhanced living environments [
8
]. Neverthe-
less, enhanced living environments and ambient assisted living
technologies aim to promote people life quality.
Machine learning can transform these data into automated sys-
tems that can be used to support medical decision [
1
]. These systems
will identify patterns and provide signicant inputs for enhanced
medical decision [2]. The high impact and healthcare costs associ-
ated with diabetes can be attenuated using machine learning [
5
].
The application of automated methods will transform data into
knowledge and detect patterns that are dicult to be evaluated by
human beings concerning the amount of data available [12].
This study aims to evaluate the results of the application of dier-
ent machine learning methods for diabetes disease diagnosis. The
authors have applied eight dierent machine learning techniques us-
ing PIMA diabetes dataset. A public dataset has been used, and Neu-
ral Networks, SGD, Random Forest, kNN, Naïve Bayes, AdaBoost,
Decision Tree and SVM methods have been applied. The methods
have been validated using stratied 10-fold cross-validation. The
confusion matrix for each method have been extracted, and the
performance evaluated.
CSAI 2020, December 11–13, 2020, Zhuhai, China Nuno Garcia et al.
Table 1: Dataset statistical analysis.
Features Range Minimum Maximum Mean Std. Deviation Variance
Statistic Statistic Statistic Std. Error Statistic Statistic
Pregnancies 17 0 17 3.85 .122 3.370 11.354
Glucose 199 0 199 120.89 1.154 31.973 1022.248
BP 122 0 122 69.11 .698 19.356 374.647
SFT 99 0 99 20.54 .576 15.952 254.473
Insulin 846 0 846 79.80 4.159 115.244 13281.180
BMI 67.1 .0 67.1 31.993 .2845 7.8842 62.160
DPF 2.342 .078 2.420 .47188 .011956 .331329 .110
Age 60 21 81 33.24 .424 11.760 138.303
2 MATERIALS AND METHODS
2.1 Dataset
The dataset used is the Pima Indians Diabetes Dataset [
11
]. The data
was collected from the National Institute of Diabetes and Digestive
and Kidney Diseases. The original goal of this data was to test the
identication of the presence of diabetes. The data included female
individuals with at least 21 years old. The datasets consist of several
medical predictor variables and one target variable, Outcome. The
features included are the number of pregnancies, glucose level,
blood pressure (BP), triceps skinfold thickness (SFT), insulin, Body
Mass Index (BMI), diabetes pedigree function (DPF), and age. The
dataset used is publicly available. The target variable indicates the
presence or absence of diabetes disease. The dataset has 768 entries,
268 have diabetes disease and 500 do not have diabetes.
The analysis study of the dataset is presented in Table 1. The
analysis has been conducted in IBM SPSS version 26. The statistic
range, minimum, maximum, mean, standard deviation and variance
of each feature have been extracted. The data have been normalized
using a scale of [-1,1] before the application of the machine learning
methods.
2.2 Machine Learning Methods
There are numerous machine learning methods available in the lit-
erature. The authors have applied eight dierent machine learning
techniques. Neural Networks, SGD, Random Forest, kNN, Naïve
Bayes, AdaBoost, Decision Tree and SVM methods have been ap-
plied. The specication of the machine learning methods is pre-
sented in this section. The presented of the parameters used is
essential for the reproduction of the results.
In the Decision Tree, the Pruning is dened at least two instances
in leaves, at least ve instances in internal nodes, maximum depth
100. The splitting stop when the majority reaches 95% and use
binary trees.
The kNN method parameters are dened as follows. The number
of neighbours is 5, the metric used is Euclidean, and the weight is
Uniform.
The AdaBoost us a tree for the base estimator, the number of
estimators is 50, the classication algorithm is the SAMME.R.
The Random Forest was developed using 10 as the number of
trees, the maximal number of considered features is unlimited, the
replicable training is not implemented, the maximal tree depth is
unlimited, and it stops splitting nodes with 5 maximum instances.
The SVM method uses a C
=
1.0,
ϵ=
0.1, the Kernel is RBF, exp(-
auto|x-y|
2
), the numerical tolerance is 0.001 and the iteration limit
is 100.
The Neural Network consists of 500 neurons, the activation
function is Identity, the solver is ADAM, the alpha is 0.0009, the
max number of iterations is 250 and uses replicable training.
The SGD uses the Hinge classication loss function, the Squared
Loss is implemented, the regularization is Ridge (L2), the regulariza-
tion strength (
α
) is 1*10
-5
, the learning rate is Constant, the initial
learning rate (
η
0) is 0.01, and the shue data after each iteration is
dened as True.
2.3 Validation and Study Design
The experiments have been carried in a MacBook Pro (15-inch, 2018).
The machine incorporates a 2.6 GHz 6-Core Intel Core i7 CPU and
16 GB 2400 MHz DDR4 memory. The data have been normalized,
and the machine learning methods applied. The stratied 10-fold
cross-validation method has been used, and the confusion matrix
has been extracted for each method. Finally, the accuracy (1), recall
(2), precision (3) and F1-score (4) have been calculated.
Accuracy=TP +TN
TP +FP +FN +TN (1)
Precision =TP
TP +FP (2)
Recall =TP
TP +FN (3)
F1−Scor e =2∗Recall ∗Precision
Recal +Precision (4)
The study design used in this work is presented in Figure 1, and
it is composed by dataset, machine learning, validation, test and
score, and comparative study.
3 RESULTS AND DISCUSSION
The average results between classes have been calculated, consid-
ering the extracted confusion matrix for each method. The perfor-
mance has been validated using stratied 10-fold cross-validation.
Figure 2 presents the accuracy reported by the implemented
methods. The average values range from 66.54% and 77.47%. On the
one hand, the lowest accuracy is reported by SVM (66.54%). On the
other hand, the three methods with highest accuracy are Neural
Networks (77.47%), SGD (76.43%) and kNN (73.96%).
Diabetes Disease through Machine Learning: A comparative study CSAI 2020, December 11–13, 2020, Zhuhai, China
Figure 1: Study Design.
Figure 2: Average accuracy values.
Figure 3: Average F1-Score values.
The F1-Score values have been calculated and are presented in
Figure 3. The better results are also presented by Neural Networks
corresponding to 76.84%, followed by SGD with 75.67%, and the
lowest performance is reported by SVM (67.08%).
Figure 4 presents the average precision values of the dierent
implementation models. The values range from 76.96% reported by
Neural Networks to 68.14% presented by SVM.
As presented in Figure 5, The best recall value reported is 77.47%
concerning Neural Networks. The low recall value is 68.54% for
SVM method. Moreover, kNN method provides a recall value of
73.96%.
The experiments conducted allows the authors to suggest Neural
Networks for the development of automated decision support sys-
tems for diabetes. This method reports a 77,47%, 76,84%, 76,96% and
CSAI 2020, December 11–13, 2020, Zhuhai, China Nuno Garcia et al.
Figure 4: Average precision values.
Figure 5: Average recall values.
77,47% concerning accuracy, F1-score, precision and recall. Further-
more, a more detail analysis can be conducted by testing dierent
parameters such as the number of neurons, the activation function
and the maximum number of iterations.
The three methods with better accuracy are Neural Networks,
SGD and kNN. These methods report 77.47%, 76.43% and 73.96% of
average accuracy between classes. The Receiver Operating Charac-
teristics (ROC) is an ecient method to evaluate the performance
of a classication model. The ROC curve of Neural Networks, SGD
and kNN models for target class 1 and class 0 is represented in
Figure 6 and Figure 7, respectively. The analysis of Area Under the
Curve (AUC) across dierent classiers is a relevant method to
summarize its performance. Therefore, the AUC results average
over classes for Neural Networks, SGD and kNN are 82.84%, 71.60%
and 77.02%, respectively.
Nevertheless, the present study has several limitations. On the
one hand, the number of instances in this dataset is limited. On the
other hand, this dataset only includes female individuals.
This study aims to support future research activities as a base
point for forthcoming analysis. This work should be considered as a
preliminary study. Several parameters can be updated in the dier-
ent methods implemented. These impact of normalization, feature
selection, data imputation and augmentation are not addressed in
this study and should be done as future work.
Currently, the application of machine learning methods in the
design and development of an automated system to support health-
care is a trending and essential topic. Furthermore, it is crucial to
promote research in this eld to decrease the cost and improve the
quality of healthcare. However, medical analysis and appreciation
Diabetes Disease through Machine Learning: A comparative study CSAI 2020, December 11–13, 2020, Zhuhai, China
Figure 6: ROC Curve for target class 1.
Figure 7: ROC Curve for target class 0.
must always be ensured. These methods should be eective sys-
tems to support medical diagnostics, but they will never replace
the crucial role of doctors.
4 CONCLUSIONS
This paper has presented the application of dierent machine learn-
ing methods for the identication of diabetes diseases using a public
dataset. In total, eight dierent approaches have been applied. After
data normalization, Neural Networks, SGD, Random Forest, kNN,
Naïve Bayes, AdaBoost, Decision Tree and SVM methods have been
applied. The results suggest the use of Neural Networks, SGD and
kNN. On the one hand, the application of Neural Networks presents
an accuracy of 77.47%, an F1-score of 76.84%, a precision of 76.96%
and a recall of 77.47%. On the other hand, SGD reports 76.43%,
75.67%, 75.84% and 76.43% concerning the accuracy, F1-score, pre-
cision and recall, respectively. Finally, the kNN method states an
average accuracy of 73.96%, an F1-Score of 73.54%, a precision of
CSAI 2020, December 11–13, 2020, Zhuhai, China Nuno Garcia et al.
73.38% and a recall of 73.96%. Future work should focus on the analy-
sis of the impact of feature selection in dierent methods. Moreover,
other relevant studies can be done with Neural Networks by testing
dierent parameters such as the activation function.
ACKNOWLEDGMENTS
This work is funded by FCT/MEC through national funds and
co-funded by FEDER – PT2020 partnership agreement under the
project UIDB/50008/2020.
This work is funded by National Funds through the FCT - Foun-
dation for Science and Technology, I.P., within the scope of the
project UIDB/00742/2020.
This article is based upon work from COST Action IC1303–
AAPELE–Architectures, Algorithms and Protocols for Enhanced
Living Environments and COST Action CA16226–SHELD-ON–
Indoor living space improvement: Smart Habitat for the Elderly,
supported by COST (European Cooperation in Science and Tech-
nology). More information in www.cost.eu.
Furthermore, we would like to thank the Politécnico de Viseu
for their support.
REFERENCES
[1]
Ahmed Abdelaziz, Ahmed S. Salama, A. M. Riad, and Alia N. Mahmoud. 2019. A
Machine Learning Model for Predicting of Chronic Kidney Disease Based Internet
of Things and Cloud Computing in Smart Cities. In Security in Smart Cities: Models,
Applications, and Challenges, Aboul Ella Hassanien, Mohamed Elhoseny, Syed
Hassan Ahmed and Amit Kumar Singh (eds.). Springer International Publishing,
Cham, 93–114. https://doi.org/10.1007/978-3- 030-01560-2_5
[2]
Flávio H.D. Araújo, André M. Santana, and Pedro de A. Santos Neto. 2016. Using
machine learning to support healthcare professionals in making preauthorisation
decisions. International Journal of Medical Informatics 94: 1–7. https://doi.org/10.
1016/j.ijmedinf.2016.06.007
[3]
Igor Vyacheslavovich Buzaev, Vladimir Vyacheslavovich Plechev, Irina Evge-
nievna Nikolaeva, and Rezida Maratovna Galimova. 2016. Articial intelligence:
Neural network model as the multidisciplinary team member in clinical decision
support to avoid medical mistakes. Chronic Diseases and Translational Medicine
2, 3: 166–172. https://doi.org/10.1016/j.cdtm.2016.09.007
[4]
Imane Chakour, Yousef El Mourabit, Cherki Daoui, and Mohamed Baslam. 2020.
Multi-Agent System Based on Machine Learning for Early Diagnosis of Diabetes.
In2020 IEEE 6th International Conference on Optimization and Applications (ICOA),
1–6.
[5]
Irene Dankwa-Mullan, Marc Rivo, Marisol Sepulveda, Yoonyoung Park, Jane
Snowdon, and Kyu Rhee. 2019. Transforming Diabetes Care Through Articial
Intelligence: The Future Is Here. Population Health Management 22, 3: 229–242.
https://doi.org/10.1089/pop.2018.0129
[6]
Thomas Davenport and Ravi Kalakota. 2019. The potential for articial intel-
ligence in healthcare. Future Healthcare Journal 6, 2: 94–98. https://doi.org/10.
7861/futurehosp.6-2- 94
[7]
Nilanjan Dey, Amira S. Ashour, Fuqian Shi, Simon James Fong, and João Manuel
R. S. Tavares. 2018. Medical cyber-physical systems: A survey. Journal of Medical
Systems 42, 4: 74. https://doi.org/10.1007/s10916-018- 0921-x
[8]
Ivan Ganchev, Nuno M. Garcia, Ciprian Dobre, Constandinos X. Mavromoustakis,
and Rossitza Goleva (eds.). 2019. Enhanced Living Environments: Algorithms,
Architectures, Platforms, and Systems. Springer International Publishing, Cham.
https://doi.org/10.1007/978-3- 030-10752- 9
[9]
Jigna J Hathaliya, Sudeep Tanwar, Sudhanshu Tyagi, and Neeraj Kumar. 2019.
Securing electronics healthcare records in Healthcare 4.0: A biometric-based
approach. Computers & Electrical Engineering 76: 398–410. https://doi.org/10.
1016/j.compeleceng.2019.04.017
[10]
Gonçalo Marques, Rui Pitarma, Nuno M. Garcia, and Nuno Pombo. 2019. Internet
of Things Architectures, Technologies, Applications, Challenges, and Future
Directions for Enhanced Living Environments and Healthcare Systems: A Review.
Electronics 8, 10: 1081. https://doi.org/10.3390/electronics8101081
[11]
Huma Naz and Sachin Ahuja. 2020. Deep learning approach for diabetes predic-
tion using PIMA Indian dataset. Journal of Diabetes & Metabolic Disorders 19, 1:
391–403. https://doi.org/10.1007/s40200-020- 00520-5
[12]
Andreas K Triantafyllidis and Athanasios Tsanas. 2019. Applications of Machine
Learning in Real-Life Digital Health Interventions: Review of the Literature.
Journal of Medical Internet Research 21, 4: e12286. https://doi.org/10.2196/12286
[13]
Le Zheng, Oliver Wang, Shiying Hao, Chengyin Ye, Modi Liu, Minjie Xia, Alex
N. Sabo, Liliana Markovic, Frank Stearns, Laura Kanov, Karl G. Sylvester, Eric
Widen, Do B. McElhinney, Wei Zhang, Jiayu Liao, and Xuefeng B. Ling. 2020.
Development of an early-warning system for high-risk patients for suicide at-
tempt using deep learning and electronic health records. Translational Psychiatry
10, 1: 72. https://doi.org/10.1038/s41398-020- 0684-2
[14]
Diabetes. Retrieved August 20, 2020 from https://www.who.int/westernpacic/
health-topics/diabetes