ChapterPDF Available

Deep Q-Learning for Virtual Autonomous Automobile

Authors:

Abstract

The Deep Q-Learning is a reinforcement learning algorithm that is proposed by the research for developing autonomous automobiles. The research used the advanced and latest technologies and libraries to develop a virtual automobile that is autonomous. The proposed model is implemented using neural networks, which take the state “S” as input vector x and forecast the following potential action “a” that, according to the state-action value function, will be the most profitable. In the virtual environment developed by the research, the automobile, which is the agent, moves randomly and takes random actions continuously. These are stored and used to train the neural network in the ratio of dataset 60–20–20%. After random state travel and training, the agent is able to learn on its own to drive. This is achieved by rewarding the agent by +a for a correct or expected action and penalizing the agent by −p for a wrong or unexpected action. By doing so, the agent is able to drive in the lane and avoid the obstacles. The research is fully software-based and virtual, thus no requirement of hardware except for a computer. The research also studies reinforcement learning and the DQN algorithm to enhance the learning of the readers in the domain of AI.
Lecture Notes in Networks and Systems 785
AbhishekSwaroop
ZdzislawPolkowski
SérgioDuarteCorreia
BalVirdeeEditors
Proceedings
ofData
Analytics and
Management
ICDAM 2023, Volume 1
Lecture Notes in Networks and Systems
Volume 785
Series Editor
Janusz Kacprzyk , Systems Research Institute, Polish Academy of Sciences,
Warsaw, Poland
Advisory Editors
Fernando Gomide, Department of Computer Engineering and Automation—DCA,
School of Electrical and Computer Engineering—FEEC, University of
Campinas—UNICAMP, São Paulo, Brazil
Okyay Kaynak, Department of Electrical and Electronic Engineering,
Bogazici University, Istanbul, Türkiye
Derong Liu, Department of Electrical and Computer Engineering, University of
Illinois at Chicago, Chicago, USA
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Witold Pedrycz, Department of Electrical and Computer Engineering, University of
Alberta, Alberta, Canada
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Marios M. Polycarpou, Department of Electrical and Computer Engineering,
KIOS Research Center for Intelligent Systems and Networks, University of Cyprus,
Nicosia, Cyprus
Imre J. Rudas, Óbuda University, Budapest, Hungary
Jun Wang, Department of Computer Science, City University of Hong Kong,
Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest
developments in Networks and Systems—quickly, informally and with high quality.
Original research reported in proceedings and post-proceedings represents the core
of LNNS.
Volumes published in LNNS embrace all aspects and subfields of, as well as new
challenges in, Networks and Systems.
The series contains proceedings and edited volumes in systems and networks,
spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor
Networks, Control Systems, Energy Systems, Automotive Systems, Biological
Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems,
Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems,
Robotics, Social Systems, Economic Systems and other. Of particular value to
both the contributors and the readership are the short publication timeframe and
the world-wide distribution and exposure which enable both a wide and rapid
dissemination of research output.
The series covers the theory, applications, and perspectives on the state of the art
and future developments relevant to systems and networks, decision making, control,
complex processes and related areas, as embedded in the fields of interdisciplinary
and applied sciences, engineering, computer science, physics, economics, social, and
life sciences, as well as the paradigms and methodologies behind them.
Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago.
All books published in the series are submitted for consideration in Web of Science.
For proposals from Asia please contact Aninda Bose (aninda.bose@springer.com).
Abhishek Swaroop ·Zdzislaw Polkowski ·
Sérgio Duarte Correia ·Bal Virdee
Editors
Proceedings of Data
Analytics and Management
ICDAM 2023, Volume 1
Editors
Abhishek Swaroop
Department of Information Technology
Bhagwan Parshuram Institute
of Technology
New Delhi, Delhi, India
Sérgio Duarte Correia
Polytechnic Institute of Portalegre
Portalegre, Portugal
Zdzislaw Polkowski
Jan Wyzykowski University
Polkowice, Poland
Bal Virdee
Centre for Communications Technology
London Metropolitan University
London, UK
ISSN 2367-3370 ISSN 2367-3389 (electronic)
Lecture Notes in Networks and Systems
ISBN 978-981-99-6543-4 ISBN 978-981-99-6544-1 (eBook)
https://doi.org/10.1007/978-981-99-6544-1
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Paper in this product is recyclable.
ICDAM-2023 Steering Committee Members
Patrons
Prof. (Dr.) Don MacRaild, Pro-Vice Chancellor, London Metropolitan University,
London
Prof. (Dr.) Wioletta Palczewska, Rector, The Karkonosze State University of Applied
Sciences in Jelenia Góra, Poland
Prof. (Dr.) Beata Tel˛zka, Vice-Rector, The Karkonosze State University of Applied
Sciences in Jelenia Góra
General Chairs
Prof. Dr. Janusz Kacprzyk, Polish Academy of Sciences, Systems Research Institute,
Poland
Prof. Dr. Karim Ouazzane, London Metropolitan University, London
Prof. Dr. Bal Virdee, London Metropolitan University, London
Prof. Cesare Alippi, Polytechnic University of Milan, Italy
Honorary Chairs
Prof. Dr. Aboul Ella Hassanien, Cairo University, Egypt
Prof. Dr. Vaclav Snasel, Rector, VSB-Technical University of Ostrava,
Czech Republic
Prof. Chris Lane, London Metropolitan University, London
v
vi ICDAM-2023 Steering Committee Members
Conference Chairs
Prof. Dr. Vassil Vassilev, London Metropolitan University, London
Dr. Pancham Shukla, Imperial College London, London
Prof. Dr. Mak Sharma, Birmingham City University, London
Dr. Shikun Zhou, University of Portsmouth
Dr. Magdalena Baczy´nska, Dean, The Karkonosze State University of Applied
Sciences in Jelenia Góra, Poland
Dr. Zdzislaw Polkowski, Adjunct Professor KPSW, The Karkonosze State University
of Applied Sciences in Jelenia Góra
Prof. Dr. Abhishek Swaroop, Bhagwan Parshuram Institute of Technology, Delhi,
India
Prof. Dr. Anil K. Ahlawat, Dean, KIET Group of Institutes, India
Technical Program Chairs
Dr. Shahram Salekzamankhani, London Metropolitan University, London
Dr. Mohammad Hossein Amirhosseini, University of East London, London
Dr. Sandra Fernando, London Metropolitan University, London
Dr. Qicheng Yu, London Metropolitan University, London
Prof. Joel J. P. C. Rodrigues, Federal University of Piauí (UFPI), Teresina—PI, Brazil
Dr. Ali Kashif Bashir, Manchester Metropolitan University, UK
Dr. Rajkumar Singh Rathore, Cardiff Metropolitan University, UK
Conveners
Dr. Ashish Khanna, Maharaja Agrasen Institute of Technology (GGSIPU),
New Delhi, India
Dr. Deepak Gupta, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi,
India
Publicity Chairs
Dr. Józef Zaprucki, Prof. KPSW, Rector’s Proxy for Foreign Affairs, The Karkonosze
State University of Applied Sciences in Jelenia Góra
Dr. Umesh Gupta, Bennett University, India
Dr. Puneet Sharma, Assistant Professor, Amity University, Noida
Dr. Deepak Arora, Professor and Head (CSE), Amity University, Lucknow Campus
ICDAM-2023 Steering Committee Members vii
João Matos-Carvalho, Lusófona University, Portugal
Co-conveners
Mr. Moolchand Sharma, Maharaja Agrasen Institute of Technology, India
Dr. Richa Sharma, London Metropolitan University, London
Preface
We hereby are delighted to announce that The London Metropolitan University,
London, in collaboration with The Karkonosze University of Applied Sciences,
Poland, Politécnico de Portalegre, Portugal, and Bhagwan Parshuram Institute of
Technology, India, has hosted the eagerly awaited and much coveted International
Conference on Data Analytics and Management (ICDAM-2023). The fourth version
of the conference was able to attract a diverse range of engineering practitioners,
academicians, scholars, and industry delegates, with the reception of abstracts
including more than 7000 authors from different parts of the world. The committee
of professionals dedicated toward the conference is striving to achieve a high-
quality technical program with tracks on data analytics, data management, big data,
computational intelligence, and communication networks. All the tracks chosen in
the conference are interrelated and are very famous among present-day research
community. Therefore, a lot of research is happening in the above-mentioned tracks
and their related sub-areas. More than 1200 full-length papers have been received,
among which the contributions are focused on theoretical, computer simulation-
based research, and laboratory-scale experiments. Among these manuscripts, 190
papers have been included in the Springer proceedings after a thorough two-stage
review and editing process. All the manuscripts submitted to the ICDAM-2023 were
peer-reviewed by at least two independent reviewers, who were provided with a
detailed review pro forma. The comments from the reviewers were communicated
to the authors, who incorporated the suggestions in their revised manuscripts. The
recommendations from two reviewers were taken into consideration while selecting a
manuscript for inclusion in the proceedings. The exhaustiveness of the review process
is evident, given the large number of articles received addressing a wide range of
research areas. The stringent review process ensured that each published manuscript
met the rigorous academic and scientific standards. It is an exalting experience to
finally see these elite contributions materialize into the four book volumes as ICDAM
proceedings by Springer entitled “Proceedings of Data Analytics and Management:
ICDAM-2023”.
ICDAM-2023 invited four keynote speakers, who are eminent researchers in the
field of computer science and engineering, from different parts of the world. In
ix
xPreface
addition to the plenary sessions on each day of the conference, 17 concurrent technical
sessions are held every day to assure the oral presentation of around 190 accepted
papers. Keynote speakers and session chair(s) for each of the concurrent sessions
have been leading researchers from the thematic area of the session. The delegates
were provided with a book of extended abstracts to quickly browse through the
contents, participate in the presentations, and provide access to a broad audience of
the audience. The research part of the conference was organized in a total of 22 special
sessions. These special sessions provided the opportunity for researchers conducting
research in specific areas to present their results in a more focused environment.
An international conference of such magnitude and release of the ICDAM-2023
proceedings by Springer has been the remarkable outcome of the untiring efforts
of the entire organizing team. The success of an event undoubtedly involves the
painstaking efforts of several contributors at different stages, dictated by their devo-
tion and sincerity. Fortunately, since the beginning of its journey, ICDAM-2023 has
received support and contributions from every corner. We thank them all who have
wished the best for ICDAM-2023 and contributed by any means toward its success.
The edited proceedings volumes by Springer would not have been possible without
the perseverance of all the steering, advisory, and technical program committee
members.
All the contributing authors owe thanks from the organizers of ICDAM-2023
for their interest and exceptional articles. We would also like to thank the authors
of the papers for adhering to the time schedule and for incorporating the review
comments. We wish to extend my heartfelt acknowledgment to the authors, peer-
reviewers, committee members, and production staff whose diligent work put shape
to the ICDAM-2023 proceedings. We especially want to thank our dedicated team of
peer-reviewers who volunteered for the arduous and tedious step of quality checking
and critique on the submitted manuscripts. We wish to thank our faculty colleague
Mr. Moolchand Sharma for extending their enormous assistance during the confer-
ence. The time spent by them and the midnight oil burnt is greatly appreciated,
for which we will ever remain indebted. The management, faculties, administrative,
and support staff of the college have always been extending their services whenever
needed, for which we remain thankful to them.
Lastly, we would like to thank Springer for accepting our proposal for publishing
the ICDAM-2023 conference proceedings. Help received from Mr. Aninda Bose, the
acquisition senior editor, in the process has been very useful.
New Delhi, India
Polkowice, Poland
Portalegre, Portugal
London, UK
Abhishek Swaroop
Zdzislaw Polkowski
Sérgio Duarte Correia
Bal Virdee
Contents
Diagnosis of Parkinson Disease Using Ensemble Methods for Class
Imbalance Problem ................................................ 1
Ritika Kumari, Jaspreeti Singh, and Anjana Gosain
A Comparative Analysis of Pneumonia Detection Using Chest
X-rays with DNN .................................................. 11
Prateek Jha, Mohit Rohilla, Avantika Goyal, Siddharth Arora,
Ruchi Sharma, and Jitender Kumar
Machine Learning-Based Binary Sentiment Classification of Movie
Reviews in Hindi (Devanagari Script) ............................... 23
Ankita Sharma and Udayan Ghose
Deep Learning-Based Recommendation Systems: Review
and Critical Analysis .............................................. 39
Md Mahtab Alam and Mumtaz Ahmed
Retention in Second Year Computing Students in a London-Based
University During the Post-COVID-19 Era Using Learned
Optimism as a Lens: A Statistical Analysis in R ...................... 57
Alexandros Chrysikos and Neal Bamford
Alzheimer’s Disease Knowledge Graph Based on Ontology
and Neo4j Graph Database ......................................... 71
Ivaylo Spasov, Sophia Lazarova, and Dessislava Petrova-Antonova
Forecasting Bitcoin Prices in the Context of the COVID-19
Pandemic Using Machine Learning Approaches ...................... 81
Prashanth Sontakke, Fahimeh Jafari, Mitra Saeedi,
and Mohammad Hossein Amirhosseini
Online Food Delivery Customer Churn Prediction: A Quantitative
Analysis on the Performance of Machine Learning Classifiers ......... 95
J. Gerald Manju, A. Dharini, B. Kiruthika, and A. Malini
xi
xii Contents
Prevention Equipment for COVID-19 Spread Using IoT
and Multimedia-Based Solutions .................................... 105
T. S. Dhachina Moorthy, N. Nimalan, S. Sridevi, and B. Nevetha
Renal Disease Classification Using Image Processing .................. 121
Rohan Sahai Mathur, Varun Gupta, Tushar Bansal, Yash Khare,
and Sanjay Kumar Dubey
Identification of Fake Users on Social Networks and Detection
of Spammers ...................................................... 137
B. Srinivasa Rao, Badisa Bhavana, Gudimetla Abhishek,
and Peddiboyina Hema Harini
A Effective Method for Predicting the Dyslexia by Applying
Ensemble Technique ............................................... 151
S. K. Saida, Yanduru Yamini Snehitha, Narindi Sai Priya,
and Avula Srinivasa Ajay Babu
Identifying Suicidal Risk: A Text Classification Study for Early
Detection ......................................................... 163
Devineni Vijaya Sri, Anumolu Bindu Sai, Valluri Anand,
and Karanam Manjusha
Citrus Plant Leaves Disease Detection Using CNN and LVQ
Algorithm ........................................................ 175
Roop Singh Meena and Shano Solanki
Longevity Recommendation for Root Canal Treatment ............... 189
Pragati Choudhari, Anand Singh Rajawat, S. B. Goyal, Xiao ShiXiao,
and Amol Potgantwar
Deep Q-Learning for Virtual Autonomous Automobile ................ 203
Piyush Pant, Rajendra Sinha, Anand Singh Rajawat, S. B. Goyal,
and Masri bin Abdul Lasi
Improving Digital Marketing Using Sentiment Analysis with Deep
LSTM ............................................................ 217
Masri bin Abdul Lasi, Abu Bakar bin Abdul Hamid,
Amer Hamzah bin Jantan, S. B. Goyal, and Nurun Najah binti Tarmidzi
5G Enabled IoT-Based DL with BC Model for Secured Home Door
System ........................................................... 233
S. B. Goyal, Anand Singh Rajawat, Pravin Gundalwar,
Ram Kumar Solanki, and Masri bin Abdul Lasi
Improving Efficiency of Spinal Cord Image Segmentation Using
Transfer Learning Inspired Mask Region-Based Augmented
Convolutional Neural Network ..................................... 245
Sheetal Garg and S. R. Bhagyashree
Contents xiii
Neurological Disease Prediction Based on EEG Signals Using
Machine Learning Approaches ..................................... 263
Zahraa Maan Sallal and Alyaa A. Abbas
Watermarking System Using DWT and SVD ......................... 273
Fatima M. Khudair, Asaad N. Hashim, and Mohammed Jameel Alsalhy
Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance
for Robust Security ................................................ 287
Nadiya Zafar, Ashish Khanna, Shaily Jain, Zeeshan Ali,
and Jameel Ahamed
Human Body Poses Detection and Estimation Using Convolutional
Neural Network ................................................... 303
Jitendra Kumar Baroliya and Amit Doegar
A Novel Image Alignment Technique Leveraging Teaching
Learning-Based Optimization for Medical Images .................... 317
Paluck Arora, Rajesh Mehta, and Rohit Ahuja
Study of Cyber Threats in IoT Systems .............................. 329
Abir El Akhdar, Chafik Baidada, and Ali Kartit
Generic Sentimental Analysis in Web Data Recommendation
Based on Social Media Scalable Data Analytics Using Machine
Learning Architecture ............................................. 345
Ramesh Sekaran, Sivaram Rajeyyagari, Ashok Kumar Munnangi,
Manikandan Parasuraman, Manikandan Ramachandran, and Anil Kumar
Cloud Spark Cluster to Analyse English Prescription Big Data
for NHS Intelligence ............................................... 361
Sandra Fernando, Victor Sowinski Mydlarz, Asya Katanani,
and Bal Virdee
Prediction of Column Average Carbon Dioxide Emission Using
Random Forest Regression ......................................... 377
P. Sai Swetha, M. A. Chiranjath Sshakthi, S. Hrushikesh, and A. Malini
Predicting Students’ Performance Using Feature Selection-Based
Machine Learning Technique ....................................... 389
N. Kartik, R. Mahalakshmi, and K. A. Venkatesh
Hybrid Deep Learning-Based Human Activity Recognition (HAR)
Using Wearable Sensors: An Edge Computing Approach .............. 399
Neha Gaud, Maya Rathore, and Ugrasen Suman
Hybrid Change Detection Technique with Particle Swarm
Optimization for Land Use Land Cover Using Remote-Sensed Data .... 411
Snehlata Sheoran, Neetu Mittal, and Alexander Gelbukh
xiv Contents
Critical Analysis of 5G Networks’ Traffic Intrusion Using PCA,
t-SNE, and UMAP Visualization and Classifying Attacks .............. 421
Humera Ghani, Shahram Salekzamankhani, and Bal Virdee
Denoising the Endoscopy Images of the Gastrointestinal Tract
Using Complex-Valued CNN ....................................... 439
Nisha and Prachi Chaudhary
FTL-Emo: Federated Transfer Learning for Privacy Preserved
Biomarker-Based Automatic Emotion Recognition ................... 449
Akshi Kumar, Aditi Sharma, Ravi Ranjan, and Liangxiu Han
Content Analysis of Twitter Conversations Associated
with Turkey–Syria Earthquakes .................................... 461
Harkiran Kaur, Harishankar Kumar, and Abhinandan Singla
Transition from Traditional Insurance Sector to InsurTech:
Systematic Analysis and Future Research Directions .................. 473
Tamanna Kewal and Charu Saxena
Diagnosis of Laryngitis and Cordectomy using Machine Learning
with ML.Net and SVD ............................................. 489
Syed Irfan Ali, Ahmed Sajjad Khan, Syed Mohammad Ali,
and Mohammad Nasiruddin
Speed of Diagnosis for Brain Diseases Using MRI and Convolutional
Neural Networks .................................................. 501
B. Srinivasa Rao, Vankalapati Nanda Gopal, Vatala Akash,
and Shaik Nazeer
Dog Breed Identification Using Deep Learning ....................... 515
Anurag Tuteja, Sumit Bathla, Pallav Jain, Utkarsh Garg, Aman Dureja,
and Ajay Dureja
Towards Detecting Digital Criminal Activities Using File System
Analysis .......................................................... 531
Mustafa Al-Fayoumi, Mohammad Al-Fawa’reh, Qasem Abu Al-Haija,
and Alaa Alakailah
Performance Evaluation of Virtual Machine and Container-Based
Migration Technique ............................................... 551
Aditya Bhardwaj, Amit Pratap Singh, Priya Sharma, Konika Abid,
and Umesh Gupta
Rhetorical Role Detection in Legal Judgements Using Zero-Shot
Learning ......................................................... 559
Shambhavi Mishra, Tanveer Ahmed, Vipul Mishra, Priyam Srivastava,
Abuzar Sayeed, and Umesh Gupta
Contents xv
IoB-Based Intelligent Healthcare System for Disease Diagnosis
in Humans ........................................................ 575
Shalu, Neha Saini, Pooja, and Dinesh Singh
Analyzing the Impact of Extractive Summarization Techniques
on Legal Text ..................................................... 585
Utkarsh Dixit, Sonam Gupta, Arun Kumar Yadav, and Divakar Yadav
An Energy Conserving MANET-LoRa Architecture for Wireless
Body Area Network ................................................ 603
Sakshi Gupta, Manorama, and Itu Snigdh
Blockchain Integration with Internet of Things (IoT)-Based
Systems for Data Security: A Review ................................ 617
Gagandeep Kaur, Rajesh Shrivastava, and Umesh Gupta
Comparative Study of Heart Failure Using the Approach
of Machine Learning and Deep Neural Networks ..................... 627
Shachi Mall and Jagendra Singh
House Price Prediction Using Hybrid Deep Learning Techniques ...... 643
Nitigya Vasudev, Gurpreet Singh, Prateek Saini, and Tejasvi Singhal
Sentiment Analysis Using Machine Learning of Unemployment
Data in India ...................................................... 655
Rudra Tiwari, Jatin Sachdeva, Ashok Kumar Sahoo,
and Pradeepta Kumar Sarangi
Customer Churn in Telecom Sector: Analyzing the Effectiveness
of Machine Learning Techniques .................................... 677
Vaibhav Sharma, Lekha Rani, Ashok Kumar Sahoo,
and Pradeepta Kumar Sarangi
Author Index ...................................................... 693
Editors and Contributors
About the Editors
Prof. (Dr.) Abhishek Swaroop completed his B.Tech. (CSE) from GBP University
of Agriculture and Technology, M.Tech. from Punjabi University Patiala, and Ph.D.
from NIT Kurukshetra. He has industrial experience of 8 years in organizations like
Usha Rectifier Corporations and Envirotech Instruments Pvt. Limited. He has 22
years of teaching experience. He has served in reputed educational institutions such as
Jaypee Institute of Information Technology, Noida, Sharda University Greater Noida,
and Galgotias University Greater Noida. He has served at various administrative
positions such as Head of the Department, Division Chair, NBA Coordinator for the
university, and Head of training and placements. Currently, he is serving as Professor
and HoD, Department of Information Technology in Bhagwan Parshuram Institute
of Technology, Rohini, and Delhi. He is actively engaged in research. He has more
than 60 quality publications, out of which eight are SCI and 16 Scopus.
Prof. (Dr.) Zdzislaw Polkowski is Adjunct Professor at Faculty of Technical
Sciences at the Jan Wyzykowski University, Poland. He is also Rector’s Repre-
sentative for International Cooperation and Erasmus Program and Former Dean of
the Technical Sciences Faculty during the period of 2009–2012 His area of research
includes management information systems, business informatics, IT in business and
administration, IT security, small medium enterprises, CC, IoT, big data, business
intelligence, and block chain. He has published around 60 research articles. He
has served the research community in the capacity of Author, Professor, Reviewer,
Keynote Speaker, and Co-editor. He has attended several international conferences
in the various parts of the world. He is also playing the role of Principal Investigator.
Prof. Sérgio Duarte Correia received his Diploma in Electrical and Computer
Engineering from the University of Coimbra, Portugal, in 2000, the master’s degree in
Industrial Control and Maintenance Systems from Beira Interior University, Covilhã,
Portugal, in 2010, and the Ph.D. in Electrical and Computer Engineering from the
xvii
xviii Editors and Contributors
University of Coimbra, Portugal, in 2020. Currently, he is Associate Professor at
the Polytechnic Institute of Portalegre, Portugal. He is Researcher at COPELABS—
Cognitive and People-centric Computing Research Center, Lusófona University of
Humanities and Technologies, Lisbon, Portugal, and Valoriza—Research Center for
Endogenous Resource Valorization, Polytechnic Institute of Portalegre, Portalegre,
Portugal. Over past 20 years, he has worked with several private companies in the field
of product development and industrial electronics. His current research interests are
artificial intelligence, soft computing, signal processing, and embedded computing.
Prof. Bal Virdee graduated with a B.Sc. (Engineering) Honors in Communication
Engineering and M.Phil. from Leeds University, UK. He obtained his Ph.D. from
University of North London, UK. He was worked as Academic at Open Univer-
sity and Leeds University. Prior to this, he was Research and Development Elec-
tronic Engineer in the Future Products Department at Teledyne Defence (formerly
Filtronic Components Ltd., Shipley, West Yorkshire) and at PYE TVT (Philips) in
Cambridge. He has held numerous duties and responsibilities at the university, i.e.,
Health and Safety Officer, Postgraduate Tutor, Examination’s Officer, Admission’s
Tutor, Short Course Organizer, Course Leader for M.Sc./M.Eng. Satellite Commu-
nications, B.Sc. Communications Systems, and B.Sc. Electronics. In 2010. He was
appointed Academic Leader (UG Recruitment). He is Member of ethical committee
and Member of the school’s research committee and research degrees committee.
Contributors
Alyaa A. Abbas General Directorate of Education in Al-Muthana Governorate,
Ministry of Education, Samah, Iraq
Gudimetla Abhishek Lakireddy Bali Reddy College of Engineering, Mylavaram,
Andhra Pradesh, India
Konika Abid Department of CSE, Sharda University, Greater Noida, India
Jameel Ahamed Department of CS&IT, Maulana Azad National Urdu University,
Hyderabad, India
Mumtaz Ahmed Department of Computer Engineering, Jamia Millia Islamia,
New Delhi, India
Tanveer Ahmed Department of CSE, Bennett University, Greater Noida, India
Rohit Ahuja Computer Science and Engineering Department, Thapar Institute of
Engineering and Technology, Patiala, India
Vatala Akash Lakireddy Bali Reddy College of Engineering, Mylavaram,
Andhra Pradesh, India
Editors and Contributors xix
Abir El Akhdar LTI Laboratory, University of Chouaib Doukkali, National School
of Applied Sciences, El, Jadida, Morocco
Mohammad Al-Fawa’reh Computing and Security, Edith Cowan University,
Joondalup, WA, Australia
Mustafa Al-Fayoumi Department of Cybersecurity, Princess Sumaya University
of Technology, Amman, Jordan
Qasem Abu Al-Haija Department of Cybersecurity, Princess Sumaya University
of Technology, Amman, Jordan
Alaa Alakailah Department of Cybersecurity, Princess Sumaya University of
Technology, Amman, Jordan
Md Mahtab Alam Department of Computer Engineering, Jamia Millia Islamia,
New Delhi, India
Syed Irfan Ali Artificial Intelligence and Data Science Engineering, Anjuman
College of Engineering & Technology, Nagpur, India
Syed Mohammad Ali Electronics & Telecommunication Engineering, Anjuman
College of Engineering & Technology, Nagpur, India
Zeeshan Ali University of Glasgow, Glasgow, UK
Mohammed Jameel Alsalhy National University of Science and Technology,
Thi-Qar, Nasiriyah, Iraq
Mohammad Hossein Amirhosseini University of East London, London,
United Kingdom
Valluri Anand Lakireddy Bali Reddy College of Engineering, Mylavaram,
Andhra Pradesh, India
Paluck Arora Computer Science and Engineering Department, Thapar Institute of
Engineering and Technology, Patiala, India
Siddharth Arora Bharati Vidyapeeth’s College of Engineering, New Delhi, India
Avula Srinivasa Ajay Babu Lakireddy Bali Reddy College of Engineering,
Mylavaram, Andhra Pradesh, India
Chafik Baidada LTI Laboratory, University of Chouaib Doukkali, National School
of Applied Sciences, El, Jadida, Morocco
Neal Bamford London Metropolitan University, London, UK
Tushar Bansal Amity University, Uttar Pradesh, Noida, India
Jitendra Kumar Baroliya Computer Science and Engineering Department,
NITTTR, Chandigarh, India
xx Editors and Contributors
Sumit Bathla Department of IT, Bhagwan Parshuram Institute of Technology,
New Delhi, India
S. R. Bhagyashree Department of Electronics and Communication Engineering,
ATME College of Engineering, Mysuru, India
Aditya Bhardwaj School of CSET, Bennett University, Greater Noida, India
Badisa Bhavana Lakireddy Bali Reddy College of Engineering, Mylavaram,
Andhra Pradesh, India
Prachi Chaudhary ECE Department, DCRUST, Murthal, India
Pragati Choudhari Department of Computer Engineering, Indira College of Engi-
neering and Management, Sandip University, Pune, India
Alexandros Chrysikos London Metropolitan University, London, UK
A. Dharini Thiagarajar College of Engineering, Madurai, Tamil Nadu, India
Utkarsh Dixit Ajay Kumar Garg Engineering College, Ghaziabad, India
Amit Doegar Computer Science and Engineering Department, NITTTR,
Chandigarh, India
Sanjay Kumar Dubey Amity University, Uttar Pradesh, Noida, India
Ajay Dureja Department of IT, Bharati Vidyapeeth’s College of Engineering,
New Delhi, India
Aman Dureja Department of IT, Bhagwan Parshuram Institute of Technology,
New Delhi, India
Sandra Fernando Assistive Technology Group, SCDM, London Metropolitan
University, London, UK
Sheetal Garg Department of Electronics and Communication Engineering, ATME
College of Engineering, Mysuru, India
Utkarsh Garg Department of IT, Bhagwan Parshuram Institute of Technology,
New Delhi, India
Neha Gaud School of Computer Science and Information Technology, DAVV,
Indore, M.P, India
Alexander Gelbukh Instituto Politécnico Nacional Mexico, Mexico City, Mexico
J. Gerald Manju Thiagarajar College of Engineering, Madurai, Tamil Nadu, India
Humera Ghani School of Computing and Digital Media, Centre for Communica-
tions Technology, London Metropolitan University, London, UK
Udayan Ghose University School of Information, Communication and Tech-
nology, Guru Gobind Singh Indraprastha University, Delhi, India
Editors and Contributors xxi
Vankalapati Nanda Gopal Lakireddy Bali Reddy College of Engineering,
Mylavaram, Andhra Pradesh, India
Anjana Gosain USICT, Guru Gobind Singh Indraprastha University, New Delhi,
India
Avantika Goyal Bharati Vidyapeeth’s College of Engineering, New Delhi, India
S. B. Goyal City University, Petaling Jaya, Malaysia
Pravin Gundalwar School of Computer Science and Engineering, Sandip Univer-
sity, Nashik, India
Sakshi Gupta Amity Institute of Information Technology, AMITY university,
Noida, India
Sonam Gupta Ajay Kumar Garg Engineering College, Ghaziabad, India
Umesh Gupta Department of CSE, SR University, Warangal, Telangana, India;
School of Computer Science Engineering and Technology, Bennett University,
Greater Noida, India
Varun Gupta Amity University, Uttar Pradesh, Noida, India
Abu Bakar bin Abdul Hamid Putra Business School, University Putra Malaysia,
Serdang, Malaysia
Liangxiu Han Manchester Metropolitan University, Manchester, UK
Peddiboyina Hema Harini Lakireddy Bali Reddy College of Engineering,
Mylavaram, Andhra Pradesh, India
Asaad N. Hashim Faculty of Computer Science and Mathematics, University of
Kufa, Kufah, Iraq
S. Hrushikesh Thiagarajar College of Engineering, Madurai, Tamil Nadu, India
Fahimeh Jafari University of East London, London, United Kingdom
Pallav Jain Department of IT, Bhagwan Parshuram Institute of Technology,
New Delhi, India
Shaily Jain Faculty of Computing, Engineering and Science, University of South
Wales, South Wales, UK
Amer Hamzah bin Jantan City University, Petaing Jaya, Malaysia;
Putra Business School, University Putra Malaysia, Serdang, Malaysia
Prateek Jha Bharati Vidyapeeth’s College of Engineering, New Delhi, India
N. Kartik Department of Computer Applications/Science, Presidency
College(Autonomous)/Presidency University, Bengaluru, India
Ali Kartit LTI Laboratory, University of Chouaib Doukkali, National School of
Applied Sciences, El, Jadida, Morocco
xxii Editors and Contributors
Asya Katanani Assistive Technology Group, SCDM, London Metropolitan
University, London, UK
Gagandeep Kaur Department of Computer Science and Engineering, Madhav
Institute of Technology and Science, Gwalior, India
Harkiran Kaur Department of Computer Science and Engineering, Thapar Insti-
tute of Engineering and Technology, Patiala, Punjab, India
Tamanna Kewal University School of Business, Chandigarh University, Mohali,
Punjab, India
Ahmed Sajjad Khan Electronics & Telecommunication Engineering, Anjuman
College of Engineering & Technology, Nagpur, India
Ashish Khanna Department of CSE, Maharaja Agrasen Institute of Technology,
New Delhi, India
Yash Khare Amity University, Uttar Pradesh, Noida, India
Fatima M. Khudair Faculty of Computer Science and Mathematics, University of
Kufa, Kufah, Iraq
B. Kiruthika Thiagarajar College of Engineering, Madurai, Tamil Nadu, India
Akshi Kumar Manchester Metropolitan University, Manchester, UK
Anil Kumar Tula’s Institute, Dehradun, India
Harishankar Kumar Department of Computer Science and Engineering, Thapar
Institute of Engineering and Technology, Patiala, Punjab, India
Jitender Kumar Bharati Vidyapeeth’s College of Engineering, New Delhi, India
Ritika Kumari USICT, Guru Gobind Singh Indraprastha University, New Delhi,
India;
Department of Artificial Intelligence and Data Sciences, IGDTUW, Delhi, India
Masri bin Abdul Lasi City University, Petaling Jaya, Malaysia
Sophia Lazarova GATE Institute, Sofia University, Sofia, Bulgaria
R. Mahalakshmi Department of Computer Science, Presidency University,
Bengaluru, India
A. Malini Thiagarajar College of Engineering, Madurai, Tamil Nadu, India
Shachi Mall School of Computer Science Engineering and Technology, Bennett
University, Greater Noida, India
Karanam Manjusha Lakireddy Bali Reddy College of Engineering, Mylavaram,
Andhra Pradesh, India
Manorama Amity Institute of Information Technology, Ranchi, India
Editors and Contributors xxiii
Rohan Sahai Mathur Amity University, Uttar Pradesh, Noida, India
Roop Singh Meena Computer Science and Engineering Department, NITTTR,
Chandigarh, India
Rajesh Mehta Computer Science and Engineering Department, Thapar Institute of
Engineering and Technology, Patiala, India
Shambhavi Mishra Department of CSE, Bennett University, Greater Noida, India
Vipul Mishra Department of CSE, Pandit Deendayal Energy University, Gandhi-
nagar, India
Neetu Mittal Amity University Uttar Pradesh, Noida, Uttar Pradesh, India
T. S. Dhachina Moorthy Department of Information Technology, Thiagarajar
College of Engineering, Madurai, Tamil Nadu, India
Ashok Kumar Munnangi Department of Information Technology, Velagapudi
Ramakrishna Siddhartha Engineering College, Vijayawada, Andhra Pradesh, India
Victor Sowinski Mydlarz Assistive Technology Group, SCDM, London
Metropolitan University, London, UK
Mohammad Nasiruddin Electronics & Telecommunication Engineering,
Anjuman College of Engineering & Technology, Nagpur, India
Shaik Nazeer Lakireddy Bali Reddy College of Engineering, Mylavaram,
Andhra Pradesh, India
B. Nevetha Department of Information Technology, Thiagarajar College of Engi-
neering, Madurai, Tamil Nadu, India
N. Nimalan Department of Information Technology, Thiagarajar College of Engi-
neering, Madurai, Tamil Nadu, India
Nisha ECE Department, DCRUST, Murthal, India
Piyush Pant Sandip University, Nashik, India
Manikandan Parasuraman Department of Computer Science and Engineering,
JAIN (Deemed to be University), Bengaluru, Karnataka, India
Dessislava Petrova-Antonova GATE Institute, Sofia University, Sofia, Bulgaria
Pooja School of Computer Science and Engineering, Galgotias University,
Greater Noida, India
Amol Potgantwar Sandip Institute of Technology and Research Centre, Sandip
University, Nashik, India
Narindi Sai Priya Lakireddy Bali Reddy College of Engineering, Mylavaram,
Andhra Pradesh, India
xxiv Editors and Contributors
Anand Singh Rajawat School of Computer Science and Engineering, Sandip
University, Nashik, India;
City University, Petaling Jaya, Malaysia
Sivaram Rajeyyagari Department of Computer Science, College of Computing
and Information Technology, Shaqra University, Shaqra, Kingdom of Saudi Arabia
Manikandan Ramachandran School of Computing, SASTRA Deemed Univer-
sity, Thanjavur, India
Lekha Rani Institute of Engineering and Technology, Chitkara University, Punjab,
India
Ravi Ranjan Netaji Subhas University of Technology, Delhi, India
Maya Rathore Christian Eminent College, Indore, M.P, India
Mohit Rohilla Bharati Vidyapeeth’s College of Engineering, New Delhi, India
Jatin Sachdeva Chitkara University Institute of Engineering & Technology,
Chitkara University, Punjab, India
Mitra Saeedi University of East London, London, United Kingdom
Ashok Kumar Sahoo Graphic Era Hill University, Dehradun, India
Anumolu Bindu Sai Lakireddy Bali Reddy College of Engineering, Mylavaram,
Andhra Pradesh, India
S. K. Saida Department of Information Technology, Lakireddy Bali Reddy College
of Engineering, Mylavaram, Andhra Pradesh, India
Neha Saini Government College Chhachhrauli, Yamuna Nagar, Haryana, India
Prateek Saini Chitkara University Institute of Engineering and Technology,
Chitkara University Punjab, Chandigarh, India
Shahram Salekzamankhani School of Computing and Digital Media, Centre for
Communications Technology, London Metropolitan University, London, UK
Zahraa Maan Sallal General Directorate of Education in Al-Qadisiyah Gover-
norate/Ministry of Education, Al Diwaniyah, Iraq
Pradeepta Kumar Sarangi Institute of Engineering and Technology, Chitkara
University, Punjab, India
Charu Saxena University School of Business, Chandigarh University, Mohali,
Punjab, India
Abuzar Sayeed Department of CSE, Bennett University, Greater Noida, India
Ramesh Sekaran Department of Computer Science and Engineering, JAIN
(Deemed to be University), Bengaluru, Karnataka, India
Shalu Manav Rachna University, Faridabad, Haryana, India
Editors and Contributors xxv
Aditi Sharma Delhi Technological University, New Delhi, India;
Thapar Institute of Engineering and Technology, Patiala, India
Ankita Sharma University School of Information, Communication and Tech-
nology, Guru Gobind Singh Indraprastha University, Delhi, India
Priya Sharma Department of CSE, Sharda University, Greater Noida, India
Ruchi Sharma Bharati Vidyapeeth’s College of Engineering, New Delhi, India
Vaibhav Sharma Institute of Engineering and Technology, Chitkara University,
Punjab, India
Snehlata Sheoran Amity University Uttar Pradesh, Noida, Uttar Pradesh, India
Xiao ShiXiao Chengyi College Jimei University, Xiamen, China
Rajesh Shrivastava School of Computer Science Engineering and Technology,
Bennett University, Greater Noida, India
Amit Pratap Singh Department of CSE, Sharda University, Greater Noida, India
Dinesh Singh Deenbandhu Chhotu Ram University of Science and Technology,
Murthal, Sonepat, India
Gurpreet Singh Chitkara University Institute of Engineering and Technology,
Chitkara University Punjab, Chandigarh, India
Jagendra Singh School of Computer Science Engineering and Technology,
Bennett University, Greater Noida, India
Jaspreeti Singh USICT, Guru Gobind Singh Indraprastha University, New Delhi,
India
Tejasvi Singhal Chitkara University Institute of Engineering and Technology,
Chitkara University Punjab, Chandigarh, India
Abhinandan Singla Department of Computer Science and Engineering, Thapar
Institute of Engineering and Technology, Patiala, Punjab, India
Rajendra Sinha Sandip University, Nashik, India
Yanduru Yamini Snehitha Lakireddy Bali Reddy College of Engineering,
Mylavaram, Andhra Pradesh, India
Itu Snigdh B.I.T Mesra, Ranchi, India
Ram Kumar Solanki School of Computer Science and Engineering, Sandip
University, Nashik, India
Shano Solanki Computer Science and Engineering Department, NITTTR, Chandi-
garh, India
Prashanth Sontakke University of East London, London, United Kingdom
xxvi Editors and Contributors
Ivaylo Spasov Rila Solutions, Sofia, Bulgaria
S. Sridevi Department of Information Technology, Thiagarajar College of Engi-
neering, Madurai, Tamil Nadu, India
B. Srinivasa Rao Department of Information Technology, Lakireddy Bali Reddy
College of Engineering, Mylavaram, Andhra Pradesh, India
Priyam Srivastava Department of CSE, Bennett University, Greater Noida, India
M. A. Chiranjath Sshakthi Thiagarajar College of Engineering, Madurai,
Tamil Nadu, India
Ugrasen Suman School of Computer Science and Information Technology, DAVV,
Indore, M.P, India
P. Sai Swetha Thiagarajar College of Engineering, Madurai,
Tamil Nadu, India
Nurun Najah binti Tarmidzi City University, Petaing Jaya, Malaysia
Rudra Tiwari Doon International School, Dehradun, India
Anurag Tuteja Department of IT, Bhagwan Parshuram Institute of Technology,
New Delhi, India
Nitigya Vasudev Chitkara University Institute of Engineering and Technology,
Chitkara University Punjab, Chandigarh, India
K. A. Venkatesh School of Advanced Computer Science, Alliance University,
Bengaluru, India
Devineni Vijaya Sri Department of Information Technology, Lakireddy Bali
Reddy College of Engineering, Mylavaram, Andhra Pradesh, India
Bal Virdee Assistive Technology Group, SCDM, London Metropolitan University,
London, UK;
School of Computing and Digital Media, Centre for Communications Technology,
London Metropolitan University, London, UK
Arun Kumar Yadav National Institute of Technology, Hamirpur, HP, India
Divakar Yadav National Institute of Technology, Hamirpur, HP, India
Nadiya Zafar Department of CS&IT, Maulana Azad National Urdu University,
Hyderabad, India
Diagnosis of Parkinson Disease Using
Ensemble Methods for Class Imbalance
Problem
Ritika Kumari, Jaspreeti Singh, and Anjana Gosain
Abstract Parkinson disease (PD) is the most prevalent degenerative neurological
disorders that is incurable. Early PD diagnosis is essential in order to determine the
initial course of treatment. Typically, the issue of class imbalance has an impact on
the PD diagnosis. This paper seeks to give a comparative analysis of the ensemble
methods: random forest, bagging, and random under-sampling boost for addressing
the class imbalance problem for PD diagnosis. We make use of a real-world PD
speech dataset that is housed in the repository at UCI (University of California,
Irvine). Due to the high imbalance in this dataset, feature scaling and the Synthetic
Minority Oversampling Technique (SMOTE) are employed. We also employ the
feature selection (FS) technique for enhancing the efficiency of the machine learning
algorithms (MLAs). The results show that bagging performs best with an accuracy
of 96.46%. This study proposes the use of ensemble approaches for PD’s early
diagnosis.
Keywords Classification ·Ensemble methods ·Random forest ·Feature
selection ·Bagging
R. Kumari (B)·J. Singh ·A. Gosain
USICT, Guru Gobind Singh Indraprastha University, New Delhi, India
e-mail: ritikakumari@igdtuw.ac.in
J. Singh
e-mail: jaspreeti_singh@ipu.ac.in
A. Gosain
e-mail: anjana_gosain@ipu.ac.in
R. Kumari
Department of Artificial Intelligence and Data Sciences, IGDTUW, Delhi, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_1
1
2R. Kumari et al.
1 Introduction
Parkinson’s disease is the neurological condition with the second-slowest rate of
progression, affecting 7 to 10 million individuals worldwide after Alzheimer’s
disease [14]. PD is typically observed in elderly people. The primary reason for
PD is the decrease in the dopamine’s level, a chemical in the human brain [5]. The
dopamine produced by neurons is responsible for transmitting signals in the human
brain. It is yet unclear what is causing the impairment of these neurons. The signs
of PD may include loss of smell, constipation, sleep and speech issues, swallowing
difficulties, bradykinesia, stiffness, and postural imbalance [6].
PD is incurable, but early diagnosis may help in providing proper treatment and
taking preventive measures [1,7]. Researchers have noticed changes in speech as
an early symptom in PD patients. This has motivated us to develop an ML model
that can serve as a second opinion in the diagnosis of PD patients. We use the PD
speech dataset for analyzing the changes in speech for this study as it is non-invasive
and low cost. The dataset suffers from class imbalance problem (CIP) as it is highly
imbalanced which further makes the analyses difficult. CIP occurs when one class is
present in majority in comparison with another (minority class). Using an imbalanced
dataset makes the traditional classifiers biased toward the majority class.
To handle CIP, the researchers have worked at three levels: data level (DL), algo-
rithm level (AL), and hybrid level (HL) [8,9]. In DL, we work at the data level and
try to develop uniformity in the class distribution using data sampling techniques
such as under-sampling or oversampling. However, DL strategies suffer from over-
fitting of the model in case of oversampling and in case of under-sampling, and it
may lead to loss of potential data. In AL, we formulate a new algorithm or make
some modifications to the existing algorithm. This strategy requires knowledge and
expertise in the area of the algorithm. Then comes the HL, at this level we take
the benefits of both DL and AL. Ensemble methods come under this category [10].
Several different independent classifiers are combined to create a robust classifier
using the effective technique of ensemble learning. Numerous research demonstrate
that ensemble learning models perform better on imbalanced datasets and have a
great generalization capacity [1115].
In this study, we employ three ensemble methods, namely random forest (RF),
bagging, and random under-sampling boost (RUSBoost) for the PD diagnosis using
the PD speech dataset.
The following is the study’s key contribution:
(1) Firstly, as the PD speech dataset studied is highly imbalanced, we use SMOTE
oversampling technique for balancing the dataset.
(2) Secondly, we evaluate the performance of the ensemble methods (a) without
using any feature selection (FS) method (b) using the SelectKBest FS method.
Finally, we compare our work with the already existing research in the area of PD.
And it is observed that bagging outperforms RF and RUSBoost with an accuracy of
96.46%.
Diagnosis of Parkinson Disease Using Ensemble Methods for Class 3
This paper presents a summary of the relevant literature for the CIP in PD diagnosis
in Sect. 2. Section 3briefly discusses the ensemble methods studied, FS method used,
and performance metrics involved. Section 4explains the experimental results and
discussions along with the comparison of our study with the prior work. The study’s
conclusion is given in Sect. 5.
2 Related Work
Several researchers have worked on the diagnosis of PD using different MLAs and
FS methods.
An ensemble-based model for diagnosing PD was proposed by Biswas et al. [16]
in 2022. The authors employed stacking to create a strong model and FS to choose
the pertinent characteristics. The proposed model was evaluated using a variety of
MLAs. The authors claimed that the ensemble-based model surpassed the other
techniques.
Saeed et al. [17] in 2022 developed a comprehensive strategy for PD prediction
where the authors employed several ML algorithms and FS techniques. They have
studied the performance of different classifiers and also used different FS methods.
With the wrapper filter method, this research improved the K nearest neighbor’s
(KNN) accuracy to 88.33%.
Yadav and Jain [18] in 2022 conducted the study with six ML models: KNN,
Naïve Bayes (NB), support vector machine (SVM), RF, etc., for the PD prediction.
According to the experimental findings, KNN had the highest accuracy, i.e., 92.05%
for early detection of PD.
For the effective diagnosis of PD, Lamba et al. [1] proposed a hybrid system in
2021. The authors have used three classifiers, namely KNN, RF and NB along with
SMOTE for addressing the CIP. Three FS methods were also employed for reducing
the feature subset. The study concluded that the RF classifier showed better results
than others classifiers with an accuracy of 95.58%.
Yama n et al . [19] in 2020 conducted the experiment using the FS method: relief
for selecting the acoustic features from the dataset. They used KNN and SVM for
the PD prediction and found that out of the two, the SVM classifier performed the
best with an accuracy of 91.25%.
Polat [20] in 2019 used the PD dataset to propose the hybrid model for PD predic-
tion. The authors worked at the data level to handle the CIP using SMOTE technique
and employed RF for their study. The authors noticed that RF achieved an accuracy
of 94.8%.
Mathur et al. [7] in 2019 suggested using the combined effect of KNN with artifi-
cial neural network (ANN) for PD detection. The researchers studied the performance
of various MLAs and selected the best-performing models w.r.t. accuracy and less
execution time. The study showed that the ensemble-based method AdaBoost.M1
with KNN gives the best accuracy, i.e., 91.28%.
4R. Kumari et al.
3 Materials and Methods
3.1 Ensemble Methods
Three ensemble methods—RF, bagging, and RUSBoost—are used in this research.
The performance of these methods is analyzed using all features and selected features.
Random Forest (RF) RF is a supervised ensemble MLA which generates many
decision trees (DT), and the instances are selected randomly. Each DT provides the
prediction, and the final prediction is done on the basis of majority voting [21].
Bagging To improve the predicted accuracy of MLAs, Breiman [22] suggested the
bootstrap aggregating (bagging) machine learning ensemble technique. The bagging
technique, given a training set, randomly creates a variety of bootstrap samples by
sampling with replacement from the original dataset. The ensemble’s classifiers are
then trained individually for each bootstrap sample. The majority vote is then used
to determine the prediction results for classification problems.
RUSBoost RUSBoost proposed by Seiffert et al. [23] is a combination of random
under-sampling (RUS) and boosting. RUS is a method for removing instances from
the over-represented class at random. AdaBoost [24], the most used boosting tech-
nique, iteratively trains the base learners in a sequential manner. All built models then
take part in a weighted vote to categorize unlabeled examples. Because the minority
class instances are more prone to be misclassified and hence assigned higher weights
in following rounds, this strategy is particularly effective at addressing CIP.
3.2 FS Method
FS is a preprocessing method to extract significant features from a dataset. The
name attribute is initially omitted because it does not appear to have an impact on
performance. Then, using the scikit-learn library’s SelectKBest function, we choose
the k feature with the best score. Based on f_classif univariate statistical analysis, the
score is determined. The features that SelectKBest chose are displayed in Table 1.
Using the MinMaxScaler from the Sklearn library, we scale the features as well.
Table 1 Features selected by SelectKBest FS method
FS method #features Selected Features
SelectKBest 15 MDVP:Fo (Hz), MDVP:Flo (Hz), MDVP:Jitter (Abs), MDVP:Shimmer,
MDVP:Shimmer (dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ,
Shimmer:DDA, HNR, RPDE, spread1, spread2, D2, PPE
Diagnosis of Parkinson Disease Using Ensemble Methods for Class 5
3.3 Performance Metrics
We use accuracy and AUC metric for the performance evaluation of the techniques.
Accuracy It refers to percentage of correct predictions over total predictions.
Accuracy is calculated using Eq. (1).
Accuracy =Correct Predictions
Total Predictions
.(1)
AUC Under this, true positive rate (TPrate) or the percentage of truly classified posi-
tive cases and true negative rate (TNrate) or the percentage of truly classified nega-
tive instances are shown on x-axis and y-axis, respectively, in an AUC curve [25].
Area under the receiver operator characteristics (ROC) represents the classifier’s
performance. AUC is evaluated using Eq. (2).
AUC =TPrate +TNrate
2
.(2)
4 Experimental Setup
The experiment is conducted with three ensemble methods: RF, bagging, and
RUSBoost using Python 3.8 version in Jupyter Notebook. We obtain the dataset
from the UCI repository. The parameter settings of the ensemble methods are given
in Table 2.
4.1 Dataset
In this study, the publicly accessible PD dataset [26] from the UCI repository is
used. The collection includes speech sounds from 23 PD patients and eight healthy
Table 2 Initial parameter settings
Ensemble method Parameter settings
RF n_estimators =100
Bagging base_estimator =SVC
RUSBoost base_estimator =LogisticRegression
6R. Kumari et al.
Table 3 Parkinson’s speech dataset
Attribute name Description
name ASCII subject name and recording number
MDVP:Fo (Hz) Average vocal fundamental frequency
MDVP:Fhi (Hz) Maximum vocal fundamental frequency
MDVP:Flo (Hz) Minimum vocal fundamental frequency
MDVP:Jitter (%), MDVP:Jitter (Abs),
MDVP:RAP, MDVP:PPQ, Jitter:DDP
Fundamental frequency variation measures
MDVP:Shimmer, MDVP:Shimmer (dB),
Shimmer:APQ3, Shimmer:APQ5,
MDVP:APQ, Shimmer:DDA
Amplitude variation measures
NHR, HNR Ratio of noise to tonal components in the voice
RPDE, D2 Two nonlinear dynamical complexity measures
DFA Signal fractal scaling exponent
spread1, spread2, PPE Nonlinear measures of fundamental frequency
variation
status Health status: 1—PD, 0—healthy
Table 4 Performance of
ensemble methods with all
features
Methods Accuracy AUC
RF 94.76 99.25
Bagging 96.24 99.31
RUSBoost 76.73 88.89
subjects. Max Little of the University of Oxford produced this dataset having 195
rows representing the voice measurements of 31 different people, and each column
represents a different voice attribute. Out of 195 voice measurements, 147 are from
people with PD, and the rest belongs to healthy people. The status column contains
two values: ‘0’ represents healthy people and ‘1’ represents people with PD. Table 3
shows dataset properties.
Repeated stratified K-fold cross-validation with ten splits is used for evaluation.
Two performance measures accuracy and AUC are utilized. SelectKBest method is
applied for selecting the most relevant features from the dataset. Performance of the
ensemble methods with all features and selected features is given in Tables 4and 5.
Figure 1represents this graphically: (a) accuracy and (b) AUC. The highest values
are highlighted.
4.2 Results and Discussions
(a) With All Features
Diagnosis of Parkinson Disease Using Ensemble Methods for Class 7
Table 5 Performance of
ensemble methods with
selected features
Methods Accuracy AUC
RF 95.71 99.36
Bagging 96.46 98.45
RUSBoost 77.54 88.86
(a)
(b)
0 20406080100120
Before FS
After FS
Before FS After FS
RF 94.76 95.71
Bagging 96.24 96.46
RUSBoost 76.73 77.54
Accuracy
RF Bagging RUSBoost
80 85 90 95 100 105
Before FS
After FS
Before FS After FS
RF 99.25 99.36
Bagging 99.31 98.45
RUSBoost 88.89 88.86
AUC
RF Bagging RUSBoost
Fig. 1 Performance of ML techniques aaccuracy and bAUC
8R. Kumari et al.
Table 6 Comparison with
prior studies Reference Year Technique
[27]2020 SVM
[1]2021 RF
[17]2022 KNN
[16]2022 Ensembled expert system
[28]2022 SVM
Our work 2022 Bagging
With an accuracy of 96.24% and an AUC of 99.31%, bagging surpassed the
other ensemble methods as shown in Fig. 1a and b. This might be the case since
the approach improves the stability and generalization capacity of multiple base
classifiers. RUSBoost had the worst accuracy scoring 76.73%. This might be
the case because RUSBoost’s performance is hampered by the fact that under-
sampling could not be performed as SMOTE initially balanced the dataset.
(b) With Selected Features
With 15 selected features, a slight improvement of 0.22% w.r.t the accuracy of
bagging is noticed. In the case of FS also, with an accuracy of 96.46%, bagging
outperforms RF and RUSBoost. Our study suggests that accuracy of the model
is enhanced by FS; thus, their usage is beneficial in diagnosis of PD.
From the results, it is illustrated that bagging can be taken as the viable tool in
early PD diagnosis.
4.3 Comparison with Previous Studies
The best-performing method from our research is compared with the results from
earlier studies using the same PD dataset in Table 6.
5 Conclusion
PD is a chronic health disease; therefore, detecting it in the early phase is very
crucial in order to prolong a patient’s life. This paper utilizes speech signals for
an early PD diagnosis taken from the UCI repository. To balance the dataset, we
use SMOTE technique. We perform the comparative analysis of three ensemble
methods, namely RF, bagging, and RUSBoost, for PD diagnosis. We also use the
FS SelectKBest method for selecting features and comparing the performance of
the ensemble methods without the FS method and with the FS method. The results
suggest that the FS technique is advantageous since it helps to reduce the complexity
Diagnosis of Parkinson Disease Using Ensemble Methods for Class 9
and enhances the model’s accuracy. With an accuracy of 96.46%, the experimental
data demonstrates that the ensemble method bagging beats the other strategies in
the study. For future work, various FS methods may be utilized to select the most
contributing features.
References
1. Lamba R, Gulati T, Alharbi HF, Jain A (2022) A hybrid system for Parkinson’s disease diagnosis
using machine learning techniques. Int J Speech Technol 25(3):583–593
2. Cacabelos R (2017) Parkinson’s disease: from pathogenesis to pharmacogenomics. Int J Mol
Sci 18(3):551
3. Bharath S, Hsu M, Kaur D, Rajagopalan S, Andersen JK (2002) Glutathione, iron and
Parkinson’s disease. Biochem Pharmacol 64(5–6):1037–1048
4. Tuncer T, Dogan S, Acharya UR (2020) Automated detection of Parkinson’s disease using
minimum average maximum tree and singular value decomposition method with vowels.
Biocybernetics Biomed Eng 40(1):211–220
5. Shamrat FMJM, Asaduzzaman M, Rahman AS, Tusher RTH, Tasnim Z (2019) A comparative
analysis of Parkinson disease prediction using machine learning approaches. Int J Sci Technol
Res 8(11):2576–2580
6. Challa KNR, Pagolu VS, Panda G, Majhi B (2016) An improved approach for prediction of
Parkinson’s disease using machine learning techniques. In: 2016 international conference on
signal processing, communication, power and embedded system (SCOPES). IEEE, pp 1446–
1451
7. Mathur R, Pathak V, Bandil D (2019) Parkinson disease prediction using machine learning
algorithm. In: Emerging trends in expert applications and security. Advances in intelligent
systems and computing, vol 841. Springer, Singapore, pp 357–363
8. Gosain A, Sardana S (2017) Handling class imbalance problem using oversampling techniques:
a review. In: 2017 international conference on advances in computing, communications and
informatics (ICACCI). IEEE, pp 79–85
9. Kaur P, Gosain A (2019) Empirical assessment of ensemble based approaches to classify
imbalanced data in binary classification. Int J Adv Comput Sci Appl (IJACSA) 10(3):48–58
10. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for
the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans
Syst Man Cybern Part C Appl Rev 42(4):463–484
11. Hou S, Liu Y, Yang Q (2022) Real-time prediction of rock mass classification based on TBM
operation big data and stacking technique of ensemble learning. J Rock Mech Geotech Eng
14(1):123–143
12. Liu L, Wu X, Li S, Li Y, Tan S, Bai Y (2022) Solving the class imbalance problem using
ensemble algorithm: application of screening for aortic dissection. BMC Med Inform Decis
Mak 22(82):1–16
13. Abedin MZ, Guotai C, Hajek P, Zhang T (2023) Combining weighted SMOTE with ensemble
learning for the class-imbalanced prediction of small business credit risk. Complex Intell Syst
9:3559–3579
14. Nishant PS, Rohit B, Chandra BS, Mehrotra S (2021) HOUSEN: hybrid over–undersampling
and ensemble approach for imbalance classification. In: Inventive systems and control. Lecture
notes in networks and systems, vol 204. Springer, Singapore, pp 93–108
15. Sarkar S, Khatedi N, Pramanik A, Maiti J (2020) An ensemble learning-based undersampling
technique for handling class-imbalance problem. In: Proceedings of ICETIT 2019. Lecture
notes in electrical engineering, vol 605. Springer, Cham, pp 586–595
10 R. Kumari et al.
16. Biswas SK, Boruah AN, Saha R, Raj RS, Chakraborty M, Bordoloi M (2022) Early detection
of Parkinson disease using stacking ensemble method. Comput Methods Biomech Biomed Eng
26(5):527––539
17. Saeed F, Al-Sarem M, Al-Mohaimeed M, Emara A, Boulila W, Alasli M, Ghabban F
(2022) Enhancing Parkinson’s disease prediction using machine learning and feature selection
methods. Comput Mater Continua 71(3):5639–5658
18. Yadav D, Jain I (2022) Comparative analysis of machine learning algorithms for Parkinson’s
disease prediction. In: 2022 6th international conference on intelligent computing and control
systems (ICICCS). IEEE, pp 1334–1339
19. Yaman O, Ertam F, Tuncer T (2020) Automated Parkinson’s disease recognition based on
statistical pooling method using acoustic features. Med Hypotheses 135:109483
20. Polat K (2019) A hybrid approach to Parkinson disease classification using speech signal:
the combination of SMOTE and random forests. In: 2019 scientific meeting on electrical-
electronics and biomedical engineering and computer science (EBBT). IEEE, pp 1–3
21. Rani P, Kumar R, Ahmed NMOS, Jain A (2021) A decision support system for heart disease
prediction based upon machine learning. J Reliable Intell Environ 7:263–275
22. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
23. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) RUSBoost: a hybrid approach
to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197
24. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Machine
Learning: Proceedings of the thirteenth international conference, pp 1–9
25. Kumari R, Singh J, Gosain A (2023) SmS: SMOTE-stacked hybrid model for diagnosis of
polycystic ovary syndrome using feature selection method. Expert Syst Appl 225:120102
26. UCI Repository for Parkinsons Dataset (PD) Retrieved from https://archive.ics.uci.edu/ml/mac
hine-learning-databases/parkinsons. Accessed on 15 Jan 2023
27. Senturk ZK (2020) Early diagnosis of Parkinson’s disease using machine learning algorithms.
Med Hypotheses 138:109603
28. Kuresan H, Samiappan D (2022) Genetic algorithm and principal components analysis in
speech-based Parkinson’s early diagnosis studies. Int J Nonlinear Anal Appl 13(1):591–602
A Comparative Analysis of Pneumonia
Detection Using Chest X-rays with DNN
Prateek Jha, Mohit Rohilla, Avantika Goyal, Siddharth Arora,
Ruchi Sharma, and Jitender Kumar
Abstract A variety of viral infections can develop in pneumonia–known to be highly
catastrophic lung disease. Due to the close association between pneumonia and other
lung disorders, the diagnosis of pneumonia using chest X-ray images presents a
significant challenge. Due to this issue, higher levels of accuracy cannot be gained
from the recent approaches for detecting pneumonia. In this research, pneumonia
is classified using deep learning algorithms. CNN model was developed to make
chest X-ray diagnosis easier. Furthermore, the utilization of pre-trained convolu-
tional neural network (CNN) models, which extract features from vast datasets,
proves highly advantageous in the branch of image classification applications. In our
analysis, we use a selection process to determine the most suitable CNN model for
the task at hand. CNN models offer substantial assistance in the evaluation of chest
X-ray images, particularly in the identification of pneumonia. To effectively identify
pneumonic lungs in chest X-rays and contribute to pneumonia treatment, this article
presents a range of convolutional neural network models.
Keywords Deep convolutional neural network (DCNN) ·Image classification ·
Conv2D ·Maxpooling2D ·Batch normalization ·Activation function ·Chest
X-ray (CXR)
P. Jha ·M. Rohilla (B)·A. Goyal ·S. Arora ·R. Sharma ·J. Kumar
Bharati Vidyapeeth’s College of Engineering, New Delhi, India
e-mail: rohillamohit1510@gmail.com
P. Jha
e-mail: jhapk0001@gmail.com
A. Goyal
e-mail: goyal.avi2000@gmail.com
S. Arora
e-mail: siddharth2699@gmail.com
R. Sharma
e-mail: ruchi.sharma@bharatividyapeeth.edu
J. Kumar
e-mail: jitender.kumar@bharatividyapeeth.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_2
11
12 P. Jh a e t a l.
1 Introduction
The projection radiography of chest is called a chest X-ray. It is used to diagnose
disorders that affect the heart, lungs, bones, respiratory system, and some major
vessels in the chest. Chest X-ray can help to predict pneumonia [1]. Diagnostic and
medical facilities do research on the classification of chest X-ray pictures. Normally,
a chest X-ray is need to visit a doctor for some chest pain, a chest injury, orshortness
of breath. The doctor can diagnose you with any heart issue or a collapsed lung or
may be pneumonia and could be broken ribs, any of a number of other ailments using
the image. To improve efficacy and quality, the main goal is to offer a system for
recognizing and classifying disorders. There have been recent studies showing the
high effectiveness of chest X-rays. By focusing on the chest X-ray image catego-
rization approaches based on the application of machine learning algorithms, this
research seeks to provide a better accurate solution in chest X-rays. The review starts
out with background material on data mining, as well as the fundamentals of machine
learning and medical picture analysis [2].
Today, the most significant danger to human well-being is viruses. Pneumonia is
one of the contiguous illnesses. Because of this, classifying medical images auto-
matically has become much more challenging. The aim of this paper is to classify
medical photographs into predetermined categories. One of the preferred and often
used techniques for producing medical picture categorization issues is deep learning
(DL), which has lately gained popularity.
The gap between human and computer skills is increasingly being closed by arti-
ficial intelligence. One of many of these disciplines is computer vision. A CNN is a
DL system that can understand differences between objects in an image by ingesting
a source image, assigning different characteristics and objects in the image impor-
tance. A convolutional neural network requires significantly less preprocessing than
earlier classification methods. A convolutional neural network’s design resembles
the connectivity pattern of the neurons. Chest X-ray pictures were taught to identify
pneumonia using CNN technique. The suggested CNN method will be generated
using three models in an effort to combine two strategies that, by utilizing the top
CNN models for the training stage and using a vision transformer, both of which,
when applied separately, have shown promising results. Related work has shown that
creating an ensemble learning solution using the top two CNN models has been used
in this model.
For our experimental analysis, pneumonia diseases have been confirmed,
including the ability to spot pneumonia from CXR pictures. The two datasets in
this set are referred to as normal and pneumonia. Goldbaum and Kermany provided
the dataset [3]. Using the most important properties that the convolution networks to
retrieve for good performance (Fig. 1).
As a result of the increased input, as previously noted, a larger network is always
used to provide a sufficient network receptive field.
A Comparative Analysis of Pneumonia Detection Using Chest X-rays 13
Fig. 1 Eight prevalent thoracic diseases are sized differently in the chest X-ray 14 dataset
2 Literature Review
Deep learning has been widely used by researchers over the recent years to find out
the illness from the chest X-ray. For instance, Rajpurkar et al. [4] constructed the
121-layer CNN mode known as CheXNet. Fourteen diseases were found from 10,000
X-ray images and were used to train this model. The model was also implemented on
420 X-rays images, and the outcome was compared to radiologist’s research. As an
outcome, they came to know that the CNN approach deep learning based was better
than typical identification of pneumonia.
Stephen et al. [5] trained a CNN algorithm from zero to achieve features from
X-ray pictures and used it for the prediction if a particular patient had pneumonia or
not.
To anticipate pneumonia by chest X-ray photographs, Atitallah et al. [6] proposed
a system in based on average adaptive filtering CNN. Chest X-ray image was consid-
ered to adaptive filtration to cancel noise, increasing accuracy and made it easy to
identify. Then, for feature extraction, a CNN model with two layers is built using
dropout. The Significant Filter is needed to improve CNN’s classification accuracy.
Maselli et al. [7] draw out features for the pneumonia classification challenge
using three well-known CNN models. They used the similar dataset to train each
set individually, from each CNN’s final fully connected layer, they extracted 1000
features. Further, the features selected were given to machine classification methods.
Further, DL model which has 49 convolutional layers and two dense layers was used
in their research. The test accuracy for their model was 90.05%.
For example, CNN methodology using residual junctions and dilated CNN
approaches was used by Ayan and Karabulut [8] to classify the pneumonia. By
14 P. Jh a e t a l.
choosing X-ray images, they visualized how CNN’s approach was impacted by Doso-
vitskiy et al. [9] and used transfer learning to get CNN method for the pneumonia
identification in X-ray images.
In conclusion, the state of the art includes some mind blowing ideas, but we have
tried to grow things a step ahead by introducing a method which combines two
different approaches: employing convolution data models for the stage training and
choosing the best one that yields best results. The acquired findings are optimistic
and only slightly improve upon the performance of the existing art of the state with
a minimum number of features and layers.
3 Dataset
For our experimental analysis, we detected pneumonia from CXR images. The two
sets of data are called normal and pneumonia. The dataset was taken from Kaggle.
There are total of 5856 normal and pneumonia images in the dataset, which are
categorized into three parts: train, test, and validate. Then it was further divided into
two sub-folders normal and pneumonia. 1341 healthy patients and 3875 pneumonia
images make up the training subset. Additionally, it has 34 normal individuals and
390 pneumonia images making the test subset. Sixteen validation images are also
included in these data, including eight patients with pneumonia and eight healthy
individuals.
4 Methodology
In this study, an ideal approach for detecting pneumonia from chest X-rays is
proposed. A straightforward eight-layer CNN with max pooling and activation
function will serve as our initial model (Fig. 2).
For the successful completion of this project, a number of steps were taken into
consideration, which are as follows.
4.1 Choosing the Dataset
The dataset consists of 5856 normal and pneumonia X-ray images. It is divided
into three folders (train, test, and validation), each of which has two sub-folders
(pneumonia/normal). Images are of grayscale format and of varying sizes.
A Comparative Analysis of Pneumonia Detection Using Chest X-rays 15
Fig. 2 Block diagram
4.2 Preprocessing the Images
Prior to training the model, the chest X-ray images resized to 224 224. The X-
ray images show that more than 1200 people are healthy and more than 3800 have
pneumonia from the dataset.
16 P. Jh a e t a l.
4.3 CNN Classification Model
The suggested methodology’s numerous strategies, which are described in the
following sections, were used to train the CNN classification model.
Conv2D
Conv2D is a two-dimensional convolution layer that helps form a tensor of outputs
by combining a convolution kernel with input layers. To produce a tensor of outputs,
this layer creates a convolution kernel that undergoes convolution with the input of
the layer.
Activation Function
One of the choices you have when building a neural network is which activation
function to employ in hidden layer and at the output layer. Since neural networks
are nonlinear, they may construct complex representations and functions according
to their inputs, which are not doable with this layer.
MaxPooling2D
The creation of the convolution layer utilizes Keras maxpooling2d function. This
layer constructs a convolution kernel that intertwines with the input layer, assisting
in the generation of tensor outputs. The kernel is a mask-based image processing
matrix that is used for edge detection, blurring, and convolution between the image
and kernel. The class name for the Keras maxpooling2d is maxpool2d, and it will
use the maxpooling2d class from the TF Keras layers.
Batch Normalization
Batch normalization accelerates and improves the reliability of deep neural networks
by adding extra layers to them. A new layer normalizes and normalizes the input it
receives from a preceding layer.
4.4 Flattening Layer
Flattening layer converts 2D matrix into 1D array/vector for the feeding of next step/
layer. The output of convolutional layers transforms into 1D long array/vector, which
is connected to the final layer.
4.5 Compiling Model
Finally, the model was compiled on Google Colab, and outputs were generated
wherein we receive distinguished images as normal or pneumonia.
A Comparative Analysis of Pneumonia Detection Using Chest X-rays 17
5 Experiment Results and Discussion
This part provides the specifics of experiments that have been done to evaluate the
suggested architecture. The deep learning networks have been implemented using
the Keras with TensorFlow. Google Colaboratory was used for the computation in
this paper and section.
Accuracy and Loss
The model has gone through a number of epochs to check the performance. The
best results were obtained with Epoch 10, with Epoch 11 experiment showing some
undesired results and get terminated. This architecture had an accuracy of 0.8781
and a loss of 0.2865 as shown in Fig. 3.
Finally, with the same architecture and hyperparameters, the accuracy gets to
90.38%.
Performance of Model
See Figs. 4,5and 6and Tables 1and 2.
Accuracy, precision, and recall were also calculated for the model. Accuracy is
the true predictions divided by total predictions (Eq. 1). Precision tells about the
preciseness of the model to predict the true label (Eq. 2). Recall can be defined as
true positive label divided by the sum of the false negative label and true positive
label (Eq. 3).
Accuracy =(TN +TP)(TN +TP +FN +FP)(1)
Precision =TP(FP +TP)(2)
Fig. 3 Accuracy and loss
18 P. Jh a e t a l.
Fig. 4 Accuracy and loss
curves for train and
validation data versus
number of epochs
Recall =TP(FN +TP)(3)
For calculating the precision, recall, and accuracy, the confusion matrix was obtained
(Fig. 5). The confusion matrix tells about the false positive, false negative, true
positive, and true negative; hence, it could be used to analyze the model and calculate
the precision, recall, and accuracy. F1-score calculates the mean of the precision and
A Comparative Analysis of Pneumonia Detection Using Chest X-rays 19
Fig. 5 Confusion matrix of
training dataset
Fig. 6 ROC curve
Table 1 Obtained scores
from the proposed approach Precision Recall F1-Score Accuracy
89.75 79.04 84.06 90.34
Table 2 Comparison between proposed approach and existing methods
Model Number of images Accuracy Precision Recall
Ayan et al. [8]5856 84.5 91.3 89.1
Rahman et al. [10]5247 98.0 97.0 99.0
Proposed methodology 5856 90.38 89.75 79.04
recall (Eq. 4).
F1 score =(2Precision Recall)(Precision +Recall)(4)
20 P. Jh a e t a l.
The F1-score is 84.06.
Comparative Analysis with Existing Methods
Various scores of proposed methodology and existing methods have been compared.
We have gone through two existing methods from Ayan et al. [8] and Rahman et al.
[10] which used VGG16, SqueezeNet, DenseNet and Xception and got an accuracy
of 84.5 and 98.0%, respectively.
The proposed approach has lesser accuracy compared to Rahman et al. [10], but
it gives higher accuracy than Ayan et al. [8] with the same dataset. And it would be
easier for real-world applications.
6 Conclusion
The purpose of this work is to increase medical expertise in situations when there are
few radiotherapists available. We assist in the pre-diagnosis of pneumonia in order
to avoid any adverse repercussions in these areas. The creation of such an algorithm
may be advantageous for the healthcare industry. We evaluated how different pre-
trained models performed and concluded that our approach produces results that are
superior to those of some earlier works. We would like to provide the most efficient
pre-trained CNN model available for similar future research. Better algorithms will
likely be created as a result of our research.
References
1. Ortiz-Toro C, García-Pedrero A, Lillo-Saavedra M, Gonzalo-Martín C (2022) Automatic
pneumonia detection in chest X-ray images using textural features. Comput Biol Med
145:105466
2. Wang L, Wang H, Huang Y, Yan B, Chang Z, Liu Z, Zhao M, CuiL, Song J, Li F (2022) Trends
in the application of deep learning networks in medical image analysis: evolution between
2012 and 2020. Eur J Radiol 146:110069
3. Malhotra P, Gupta S, Koundal D, Zaguia A, Kaur M, Lee H-N (2022) Deep learning-based
computer-aided pneumothorax detection using chest X-ray images. Sensors 22(2278):1–23
4. Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul A, Langlotz C, Shpan-
skaya K et al (2017) CheXNet: radiologist-level pneumonia detection on chest X-rays with deep
learning. arXiv:1711.05225 [cs.CV], arXiv:1711.05225v3 [cs.CV], https://doi.org/10.48550/
arXiv.1711.05225
5. Stephen O, Sain M, Maduh UJ, Jeong D-U (2019) An efficient deep learning approach to
pneumonia classification in healthcare. J Healthc Eng 2019(4180949):1–7
6. Atitallah SB, Driss M, Boulila W, Koubaa A, Ghézala HB (2022) Fusion of convolutional
neural networks based on Dempster-Shafer theory for automatic pneumonia detection from
chest X-ray images. Int J Imaging Syst Technol 32(2):658–672
7. Maselli G, Bertamino E, Capalbo C, Mancini R, Orsi GB, Napoli C, Napoli C (2021) Hierar-
chical convolutional models for automatic pneumonia diagnosis based on X-ray images: new
strategies in public health. Ann IG 33(6):644–655
A Comparative Analysis of Pneumonia Detection Using Chest X-rays 21
8. Ayan E, Karabulut B, Ünver HM (2022) Diagnosis of pediatric pneumonia with ensemble of
deep convolutional neural networks in chest X-ray images. Arab J Sci Eng 47:2123–2139
9. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M,
Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16×16 words: transformers for
image recognition at scale. arXiv:2010.11929 [cs.CV], arXiv:2010.11929v2 [cs.CV], https://
doi.org/10.48550/arXiv.2010.11929
10. Wu H, Xie P, Zhang H, Li D, Cheng M (2020) Predict pneumonia with chest X-ray images
based on convolutional deep neural learning networks. J Intell Fuzzy Syst 39(3):2893–2907
23
Machine Learning-Based Binary Sentiment
Classification of Movie Reviews in Hindi
(Devanagari Script)
Ankita Sharma and Udayan Ghose
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture
Notes in Networks and Systems 785, https://doi.org/10.1007/978-981-99-6544-1_3
Abstract Lately, there has been a remarkable surge in online movie reviews in
Hindi with the advent of the UTF-8 standard. Movie reviews are an excellent
source of sentiments; therefore, Hindi movie review classification is one of the
exciting and demanding tasks of NLP, as it helps the viewers decide whether a
film/movie is worth watching. Much work in movie reviews sentiment classi-
fication has been done mainly for resource-affluent languages. Still, preliminary
work is being done in Hindi due to its complex nature and scarce resources like
adequate-labeled datasets. This paper aims to develop a machine learning-based
solution for performing binary sentiment classification on movie reviews in Hindi
(Devanagari Script). To this end, a primary binary polarity dataset, namely, Movie
Reviews in Hindi (MRH) consisting of 5K reviews, is made. Apart from MRH,
the Hindi IIT-P movie and product review datasets are also deployed in this work.
Firstly, all three datasets are prepared for further processing using the preproc-
essing steps, and the features used are unigram, bigram, and trigram, along with
TF-IDF. Second, various state-of-the-art classifiers are applied to all three data-
sets. Further, we proposed and used a stacked model of classifiers for perform-
ing binary sentiment classification on Hindi reviews. Experimental results on all
three datasets prove that the proposed stacking ensemble based on the employed
features compared favorably to all the baseline classifiers applied and achieved
reasonably high performance. Therefore, it indicates the efficacy of the proposed
stacked model for sentence level movie reviews sentiment classification in a
resource-scarce scenario.
A. Sharma (*) · U. Ghose
University School of Information, Communication and Technology, Guru Gobind Singh
Indraprastha University, Delhi, India
e-mail: ankitasharma2711@gmail.com
U. Ghose
e-mail: udayan@ipu.ac.in
24 A. Sharma and U. Ghose
Keywords Binary sentiment classification · Machine learning · Movie ·
Hindi · Stacking ensemble
1 Introduction
In today’s contemporary world, people post their opinions on social network-
ing sites like Twitter, Facebook, etc., generating massive textual content. They
also write blogs, participate in forums, and post multiple online reviews. India
has about 658 M Internet users, expected to reach 900 M by 2025. Due to vol-
canic growth in data, we are drowning in data but hungering for knowledge. Thus,
mining that data to get valuable insights is becoming critical [1]. It is known that
Hindi is the official language of India and the third most spoken language in the
world, with over 615 M speakers. Hindi is a communication medium in India and
many parts of the world. Making sense of Hindi text posted online is crucial to
understand its emotion. Therefore, Hindi sentiment analysis (SA) is becoming
essential. Textual SA is a process of finding the emotion or opinion of the writer
in written text. It plays a significant role in judging the writer’s perception and
guides in decision making. Till now, most of the work in SA has been done mainly
for resource-affluent languages, but only preliminary work in Hindi. With the
advent of UTF-8 standard, Hindi movie review content on the web is proliferating
[2]. The availability of voluminous Hindi textual content on the web has fueled
interest for researchers to explore this area. The Indian youth is very passionate
about Hindi cinema, and they participate proactively by writing movie reviews
(MRs) in Hindi (Devanagari) over the web; also, a significant amount of capi-
tal is invested in Hindi cinema every year [3]. It becomes tedious to go through
millions of reviews posted daily online. Therefore, there is a need to automate a
review mining or classification system to help viewers decide whether to watch
or skip a movie. Reviewing reviews gives viewers an idea about both positive and
negative movie aspects. SA research in Hindi is still developing, and recently sev-
eral studies have been conducted on SC in Hindi textual data using the machine
learning approach (MLA) and lexicon-based approach (LBA) [4]. Researchers
have primarily used their private and Hinglish datasets rather than pure Hindi data-
sets in Devanagari script. The present work provides an efficient machine learn-
ing (ML)-based solution for binary SC of MRs written in Hindi. Binary SC, in
respect of MRs, categorizes MRs as positive or negative. SA or SC can be done
at sentence, document, and aspect level [5, 6]. This work-study will be confined
to binary sentence level SA. Binary sentence level SA unfolds whether a sentence
is of positive polarity or negative polarity. Only subjective nature, sentence level
MRs in Hindi (Devanagari script) have been considered for this work. The litera-
ture review shows little research on ensemble-based solutions for Hindi text clas-
sification. Therefore, this paper presents a stacking classifier-based solution for
the SC of MRs in the Devanagari script. All experiments were performed on three
25
Machine Learning-Based Binary Sentiment
datasets. Moreover, to further validate the performance of our proposed solution,
different state-of-the-art (SOTA) classifiers were used individually for comparison.
The experiments presented and the results obtained strongly validate the effective-
ness of our proposed solution for binary SC in a resource-limited scenario.
The contribution of the present work is as follows:
(a) The key point of this investigation is to propose a stacking ensemble of
classifier-based solution for SC of review sentences of the movie domain
into binary sentiment category, i.e., positive or negative.
(b) A binary polarity MRH dataset comprising of 5 K MRs in Hindi is com-
piled and manually annotated to ensure maximal quality.
(c) Analysis and preprocessing are performed on the collected dataset.
(d) TF-IDF along with unigram, bigram, and trigram are used for feature
extraction.
(e) A stacking ensemble of the classifiers-based solution has been devised for
SC of MRs sentences.
(f) To further validate the performance and for comparative analysis, various
SOTA classifiers were applied.
The remaining paper is as follows: Sect. 2 overviews related work done in Hindi
text SA. Section 3 describes the formed dataset, methodology used, and pro-
posed stacking ensemble for SC of Hindi reviews. Section 4 discusses the results
obtained. Lastly, we conclude our work and offer potential future directions
regarding Hindi reviews SC.
2 Literature Review
There have been plenty of research efforts for resource-affluent languages.
However, research in the Hindi language is still evolving. To the best of our
knowledge, a few studies exist in the movie domain using pure Hindi reviews;
also, most of the research works in Hindi SA have used their primary dataset to
conduct their experimental study.
This section will cover some prior studies related to SA in Hindi text.
Madan and Ghose [7] performed SA on Hindi twitter. Hindi MRs were col-
lected and LBA, MLA, and hybrid approaches were applied. It concludes that the
hybrid approach outperforms LBA and that the decision tree (DT) classifier per-
formed the best and obtained an accuracy of 92.97%.
Hussaini et al. [8] have attempted to perform a score-based SA of Hindi book
reviews. An annotated dataset of 700 sentences related to Hindi book reviews is
made. Verbs, nouns, adjectives, and adverbs are used as opinion words. HSWN,
word sense disambiguation, and the Hindi subjectivity lexicon were applied.
According to the results obtained Hindi subjectivity lexicon performed best and
achieved an accuracy of 87.4%.
26 A. Sharma and U. Ghose
Jha et al. [9] have proposed HSAS in their paper. HSAS looks into two ways of
producing the Hindi subjectivity lexicon. The first way is to use the translator to
translate the English language to the Hindi language, and the second way is to use
an improved seed list of the Hindi language. Encouraging results were achieved,
HSAS obtained an accuracy of 80% when the seed word approach was used.
Jha et al. [10] have proposed HOMS for performing opinion mining on movie
review data. A dataset of 200 MRs was collected with equal positive, negative
reviews. Naïve Bayes (NB) classifier was used, and for POS tagging, adjectives
were only considered. HOMS achieved an accuracy of 87.1%. Jha et al. [11] have
proposed a method to find opinions in reviews of Hindi movies. NB classifier,
support vector machine (SVM), maximum entropy, and LBA has been utilized.
A dataset of 1000 MRs, 500 each for positive and negative valence, is collected.
The results obtained state that accuracy gets increased when bigram features are
used rather than unigram features. Kumar et al. [12] have expanded the Indian sen-
timent lexicon. SA was performed on Indian tweets using co-occurrences at the
sentence level and DTs. The corpora comprising of Hindi 2,358,708 sentences and
Bengali 109,855 sentences were collected from an online newspaper. An accuracy
of 43.20 and 42% were obtained for Bengali corpora and accuracy of 49.68 and
46.25% were obtained for Hindi corpora for constraint and unconstraint submis-
sion, respectively.
Kaur et al. [13] have proposed a new approach for Hinglish SA. A dataset of
100 positive and 100 negative reviews of the movie domain was collected. For
feature extraction, unigram, bigram, and trigram are used, and the classification
of sentiment was performed using SVM, NB, logistic regression (LR), and neural
network. Mishra et al. [14] have proposed an improvised context-specific polarity
lexicon (CSPL) resource for Hindi reviews. Five thousand two hundred reviews
from the hotel and movie domain were collected. The results were compared with
HSWN. The results obtained to state that the proposed lexicon performed better,
and an accuracy of 88% in the hotel domain using CSPL and 77% in the movie
domain using CSPL-extended was obtained.
Sharma and Moh [15] have attempted to prognosticate Indian Election results
by performing SA on Hindi twitter. Forty-two thousand two hundred thirty-five
tweets were collected both supervised and unsupervised approaches were applied.
NB and SVM prognosticated BJP’s win, while dictionary-based approach prog-
nosticated INC’s win. Among the three applied approaches, SVM obtained the
highest accuracy of 78.4%. Sharma et al. [16] performed SA on the Hindi lan-
guage by using a modified subjectivity lexicon. A dataset of 50 Hindi tweets was
obtained from Twitter on hashtag, “JAIHIND” and “WORLDCUP2015”. The
comparison of results obtained was made with unigram presence, and the results
concluded that proposed modified lexicon performs better and obtained an accu-
racy of 73.53 and 81.97% for both hashtags, respectively. Singh and Lefever [17]
performed Hinglish SA using cross-lingual word-embeddings. A dataset from
Sem-Eval 2020 was used. A supervised classification model and transfer learning
model were applied. Results suggested that integrating cross-lingual embedding
increases performance. An F-score of 0.556 was obtained. An attempt to detect
27
Machine Learning-Based Binary Sentiment
sarcastic sentiment in Hindi was made by Bharati et al. [18]. A dataset compris-
ing 4000 tweets were collected and manually labeled as sarcastic and non-sar-
castic. A sarcasm detection algorithm based on the contradiction between a tweet
and context was applied. The context with the same timestamp was used. The
results obtained outperformed SOTA approaches for Hindi sarcasm detection and
obtained an accuracy of 87%.
3 Proposed Methodology
This section describes the followed methodology and proposed stacked architec-
ture for SC of MRs in Hindi. First, the formation of the MRH dataset and the sta-
tistics of the dataset used is discussed. Second, the preprocessing step is explained,
followed by the feature extraction step, where the features used and vectorization
are discussed. Then, the model generation and SC using different baselines and
the proposed stacking ensemble are described. Finally, the performance evaluation
metrics used in this work are presented.
3.1 The Formation of Movie Reviews in Hindi (MRH)
Dataset
The MRH dataset of this work is primary and consists of 5000 review sentences.
MRs in Hindi were obtained from various online websites,1,2,3. The collected
reviews were manually labeled as positive or negative. The reviews rated with one
or two stars by the reviewers were considered negative, and the reviews rated with
more than three stars were considered positive. Neutral reviews were not consid-
ered in this work. Later, each annotation was manually reviewed by two language
experts to confirm the polarity of the label. Cohen’s Kappa was used to evaluate
the annotation quality, which yielded a score of ~ 85%. The dataset used in this
work is a CSV file with two columns: MRs text, and polarity labels (PLabels). The
snapshot of the reviews in our dataset is shown in the Table 1.
1 Webdunia, https://hindi.webdunia.com/bollywood-movie-review/, last accessed on 31/01/23.
2 Filmibeat, https://hindi.filmibeat.com/reviews, last accessed on 10/02/23.
3 Amarujala, https://www.amarujala.com/entertainment/movie-review, last accessed on 15/02/23.
28 A. Sharma and U. Ghose
3.2 Statistics of Datasets Used
In addition to the MRH dataset, two other datasets are used in this paper. Details
and statistics of the datasets used can be found in Table 2. Since we perform a
binary classification of MRs, our datasets contain only two polarity classes: posi-
tive and negative. The IIT-P product reviews and IIT-P movie reviews were anno-
tated with four classes, namely positive, negative, neutral, and conflict [19]. For
this study, only reviews with positive or negative classes were considered, while
the remaining classes were ignored. Statistics of the employed datasets are given
in Table 2.
3.3 Preprocessing and Data Preparation
Since reviews are collected from online sources, they need to be cleaner, more
consistent, and more accurate. Therefore, its preprocessing is required as it can-
not be used directly for analysis. First, numbers, special characters, extra spaces,
repeated words, and non-Hindi words were removed, and emoticons were replaced
Table 1 Snapshot of MRH dataset
MRs text PLabels
            
{Citylights is a very beautiful film full of human emotions and humanity}
Positive
              
{Aiyaary fails to hook the audience anywhere due to weak screenplay and sluggish
direction}
Negative
      
{Thrilling story of India’s pride Parmanu}
Positive
         
{There is nothing new or special in the story of the film Simran}
Negative
                   
{Tiger Shroff has given such a performance in the film Heropanti that people have
become crazy about him}
Positive
Table 2 Brief statistics of the review datasets used in this work
Datasets Language Positive Negative Total reviews
MRH (Ours) Hindi 2895 2105 5,000
IIT-P movie Hindi 823 530 1,353
IIT-P product Hindi 2290 712 3,002
29
Machine Learning-Based Binary Sentiment
with their textual equivalents. This text cleanup was followed by tokenization.
After tokenization, stop words in Hindi are removed, and negation words were left
untreated, as their removal could change the meaning of the reviews.
3.4 Features and Vectorization
The motive of this step is to extract features for Hindi MRs classification. This
step is essential as MLA works with data that is numeric in nature, so it is the pro-
cess of converting textual data into the numeric format. In our paper, we have used
the most popular method for vectorization that is Term Frequency and Inverse
Document Frequency (TF-IDF) along with N-gram features. This work considered
unigram, bigram, and trigram along with TF-IDF [20, 21].
3.5 Model Generation and the Brief Description of the ML
Classifiers Used
This phase aims to apply supervised MLAs for binary MRs classification on all
three datasets. The “No free lunch” theorem states that no single ML algorithm
works well for all types of problems. So, applying and experimenting with differ-
ent algorithms that fit our problem is always a good practice. We have also experi-
mented with different MLAs for classifying sentiment in Hindi MRs [22].
Naïve Bayes (NB). NB is a classification technique based on the principle of
conditional probability according to Bayes’ theorem. It is a simple but surpris-
ingly powerful algorithm for predictive modeling. It consists of two parts, Naïve
and Bayes, where Naïve Bayes assumes that the occurrence of certain features
is independent of the occurrence of other features, even if these features depend
on each other.
Support Vector Machine (SVM). SVM is a supervised learning method that
looks at data and classifies it into one of two categories; in other words, it sep-
arates the data using hyperplanes. It is one of the most popular ML classifica-
tion methods that can be used for both classification and regression problems.
It creates a hyperplane that separates the classes in the best possible way, i.e., it
chooses the correct hyperplane with maximum separation from one of the clos-
est data points.
Decision Tree (DT). DT is a powerful algorithm that can be used for both clas-
sification and regression problems. It is a nonparametric model based on the
conditionality principle. An advantage of DT is that the number of parameters
does not increase when more features are added.
K-Nearest Neighbor (KNN). This is a supervised MLA used mainly for clas-
sification tasks. In this nonparametric technique, data points are classified based
30 A. Sharma and U. Ghose
on the classification of their neighboring points. It is called lazy learner because
it does not train but remembers the training dataset [23].
Logistic Regression (LR). LR module can be imported from the sklearn.lin-
ear_model module and is often used for classification problems. It is an efficient
model with low variance. The idea is to find a relationship between features and
the possibility of a certain outcome. As in our work, the binomial model LR was
used because only two positive and negative labels are considered.
Extra Trees (ET). Extremely Randomized Trees, also known as Extra Trees.
It is based on the ensemble of DTs. First, a large number of unpruned DTs
are created from training data. In classification, predictions are made based
on majority voting. All predictions made by the trees are aggregated to obtain
the final prediction. In this process, the selection of splits and features is done
randomly.
AdaBoost (AB). AdaBoost or Adaptive Boosting is a widely used iterative
ensemble method. It selects training data randomly and iteratively trains by
selecting a training data set based on the correct prediction of the previous
training.
Gradient Boosting Machine (GBM). GBM is based on the sequential ensem-
ble method. In this method, weak learners are generated sequentially so that the
current weak learners are always better than the previous weak learners. The
overall performance of the model improves with each iteration [24].
XGBoost (XGB). XGBoost or extremely gradient boost is an enhanced version
of GBM that works with an ensemble of DTs. The problem with GBM was that
it computed output very slowly due to sequential analysis. XGBoost overcomes
this drawback by computing output quickly, increasing the efficiency of model
efficiency. It uses cache optimization and implements distributed computation
methods to improve the performance of the model.
3.6 Proposed Stacked Model of Classifiers
There is a saying called the “wisdom of crowds” that the crowd’s collective opin-
ion is always better than that of a single expert. Combining many ML models
into a single model is called ensemble learning [6], and stacked generalization or
simply stacking is an ensemble of ensembles. In the stacking ensemble, the final
estimator is trained rationally by integrating the different estimator’s predictions
[25], and is inspired by the wisdom of crowds. In this work, a stacked model of
classifiers for performing binary SC on the Hindi reviews dataset is presented.
The proposed framework combines six efficient classifiers to improve the perfor-
mance of each classifier. The classifier employed in the proposed architecture is
selected based on ease of implementation and respective trade-offs—combined
classifier models in the stacked architecture balance individual model’s bias and
variances. The proposed stacking method consists of one layer of estimators/
31
Machine Learning-Based Binary Sentiment
classifiers as subsequent layers increases the complexity. The applied estimators
are NBC, SVM, boosting-based GBM, XGB, and bootstrap aggregation-based ET.
The prediction made by the estimators is used as a feature for the final estima-
tor LR. The final estimator, also called the meta-estimator, makes the final predic-
tions for predicting the review labels. The meta-estimator learns from the strengths
of the previously used learners and compensates for their weaknesses. To avoid
overfitting, cross-validation (CV) is performed at each stacking/training step. The
dataset is split into S folds, and in S successive rounds, S1 folds are used to fit
base-level estimators in every iteration, and the base-level estimators are applied
to the remaining subset that was not included for model training in each prior iter-
ation. The resultant predictions are then stacked and given as input data to meta-
level estimators. After training the stacked classifiers, the base-level estimators are
fit to the entire dataset. The proposed architecture and algorithm for the proposed
stacked ensemble of classifiers with S-fold CV is given in Fig. 1.
Fig. 1 Proposed stacked model of classifiers for SC of MRs in Hindi
32 A. Sharma and U. Ghose
Algorithm: Proposed Stacked model of Classifiers with S-fold Cross Validation
Input: Hindi Movie Reviews set R(r
1
, r
2
, r
3
,…, r
5000
);
Sentiment Label Class set Label (l
1
, l
2
, l
3
,…, l
n
);
Estimators set E
st
(NBC, SVM, GBM, XGB, ET);
Output: Predicted polarity label based on proposed stacking ensemble of
classifiers
1: A Stacking ensemble S
en
2: Adopt CV approach in preparing a training set for estimators
3: Randomly split Rinto Sequal-size subsets: R{R
1
, R
2
,…, R
S
}
4: for s1 to Sdo
5: Step 1: Learn base level estimators
6: fort1 to Kdo
7: Learn a stacker Sfrom R\R
S
8: end for
9: end for
10: Step 2: Learn a meta-level estimator LR
11: Step 3: Re-train base level estimators
12: for t1 to Kdo
13: Train a classifier s
k
basedon R
14: end for
15: return S(r)sʹ(s
1
(r), s
2
(r), s
3
(r), ……, s
p
(r))
3.7 Performance Evaluation
The performance comparison between the various SOTA ML-based classification
models and our proposed stacked model of classifiers is evaluated using the stand-
ard evaluation metrics such as accuracy, recall, precision, and F measure which are
defined beneath [26].
Accuracy states how many reviews are correctly classified to the total number
of reviews taken into consideration, and it is a vital classification metric.
33
Machine Learning-Based Binary Sentiment
Precision gives how many positive reviews were predicted as positive.
Recall is a ratio of positive review identifications by total positive reviews, and
it tests the classifier’s completeness.
F measure measures the accuracy of the test conducted and is the harmonic
mean between precision and recall.
4 Results and Discussion
This work aims to efficiently analyze the sentiments of viewers and reviewers
expressed in written Hindi MRs. Python 3.11.1 was used for the implementation.
To test the effectiveness of our proposed stacked model, all experiments were con-
ducted on two sentence level Hindi benchmark review datasets, namely the IIT-P
movie and product review dataset, along with the MRH dataset. After prelimi-
nary processing and feature extraction using TF-IDF with unigrams, bigrams, and
trigrams, respectively. Various SOTA classifier models are applied, followed by the
proposed stacked model. All the results are evaluated using tenfold CV. The value
of the standard evaluation metric such as accuracy, recall, precision, and F meas-
ure is calculated in this work, but the presented results are only discussed in terms
of accuracy. The clustered bar graph with accuracy, precision, recall, and F meas-
ure score in percentage on dataset 1, i.e., MRH, dataset 2, i.e., IIT-P movie reviews
and dataset 3, i.e., IIT-P product reviews are given in Figs. 2, 3 and 4, respectively.
Considering our MRH dataset with unigrams, the results imply that the pro-
posed architecture achieved the highest accuracy of 83%, followed by LR and
XGB, which equally attained the accuracy of 81%. In Fig. 2b, the proposed archi-
tecture achieved the highest accuracy of 78%, followed by MNB and LR, which
attained the accuracy of 77 and 76%, respectively. While in Fig. 2c,a different
observation was observed where ET obtained an accuracy of 73%, which outper-
formed the proposed model which attained the accuracy of 69%.
The same experiment is performed on dataset 2. Figure 3 presents the results
obtained. Based on the results presented in Fig. 3a, the proposed architecture
achieved the highest accuracy of 79%, followed by SVM, which attained an accu-
racy of 78%. The proposed model performed as well as the best-performing mod-
els in bigrams with TF-IDF and trigrams with TF-IDF, as shown in Figs. 3b and
3c. The proposed architecture was applied to dataset 3 which are product reviews,
to check the domain independence as given in Fig. 4. In the case of unigram with
TF-IDF, the same pattern was observed. That is, the proposed architecture out-
performed the other SOTA models applied. While as the n-gram range increased,
i.e., in the bigram and trigram case, the proposed model performed equally to the
best-performing model. Empirical results from all three datasets suggest that the
proposed architecture achieves improved overall performance in unigrams with
the TF-IDF case. In contrast, for bigram and trigram with the TF-IDF case, per-
formance evaluation results obtained were complied with the highest accuracy
obtained in applied classifiers. The proposed architecture performed better than
34 A. Sharma and U. Ghose
the individual classifiers in all three datasets and outperformed in unigram with
TF-IDF. It was concluded from the experiments conducted on three datasets that
the proposed architecture, which is based on the stacking model of classifiers,
is apt for performing binary SC on Hindi review datasets. Also, it was observed
that among features, unigram with TF-IDF outperformed both bigram and trigram
Fig. 2 Accuracy, precision, recall, and F measure in % on dataset-1 with a unigram, b bigrams,
and c trigram along with TF-IDF as features
35
Machine Learning-Based Binary Sentiment
with TF-IDF in all three Hindi datasets applied. The other observation was that
the results did not significantly improve even when applying a stacking ensemble
of classifiers on higher N grams. The responsible factor is the TF-IDF used with
N-gram (unigram, bigram, and trigram). The Hindi polarity-bearing words, fre-
quently occurring in Hindi reviews could be assigned a lower weight by TF-IDF,
Fig. 3 Accuracy, precision, recall, and F measure in % on dataset-2 with a unigram, b bigrams,
and c trigram along with TF-IDF as features
36 A. Sharma and U. Ghose
which could be the reason for this result. To best of the author’s knowledge, pre-
vious work has not addressed the binary SC of Hindi MRs with the proposed
architecture. The proposed architecture is characterized by better accuracy, ease
of implementation, and improved performance. Moreover, requires lesser com-
putational resources and is apt for dealing with overfitting problems. Therefore,
Fig. 4 Accuracy, precision, recall, and F measure in % on dataset-3 with a unigram, b bigrams,
and c trigram along with TF-IDF as features
37
Machine Learning-Based Binary Sentiment
proposed architecture is expected to help the viewers and reviewers evaluate the
online MRs in Hindi and thus help decide whether a movie is to be watched.
5 Conclusion
Nowadays, the Internet has become a perpetual podium, people use it to express
their sentiments. With the advent of UTF-8 standard, Hindi textual content on the
web proliferates as people feel more comfortable expressing their views, emo-
tions, etc., in their native language. The availability of voluminous Hindi textual
content on the Internet has sparked researcher’s interest in exploring this area. This
paper aims to develop an ML-based solution for performing binary SC on MRs
in Hindi (Devanagari Script). To this end, a binary movie domain-oriented data-
set, namely MRH, is made, and a stacked model of classifiers is proposed. After
preprocessing and feature extraction, the proposed architecture is applied, which
contains six classifier models to improve the performance of individual classifi-
ers. The classifier employed in the proposed architecture is selected based on
ease of implementation and respective trade-offs—combined classifier models in
the stacked architecture balance individual model’s bias and variances. Different
SOTA classifiers were used individually for comparison to validate the proposed
solution’s performance further. The experimental results on all three datasets
strongly confirm the efficacy of the proposed architecture for doing binary SC in
a resource-deficient scenario, and unigram with TF-IDF is performing the best
among all the features applied. We plan to incorporate character-level and word-
level features and their mélange in the proposed stacked ensemble. Also, we will
employ a deep learning model in the meta-learning stage. To further escalate the
performance, we will increase the size of our dataset.
References
1. Sharma A, Ghose U (2020) Sentimental analysis of twitter data with respect to general
elections in India. Procedia Comput Sci 173:325–334
2. Kulkarni DS, Rodd SS (2021) Sentiment analysis in Hindi—a survey on the state-of-the-art
techniques. In: ACM transactions on Asian and low-resource language information process-
ing, vol 21, issue 1, pp 1–46
3. Kaur A, Nidhi AP (2013) Predicting movie success using neural network. Int J Sci Res
2(9):69–71
4. Sharma A, Ghose U (2021) Lexicon a linguistic approach for sentiment classification. In:
2021 11th international conference on cloud computing, data science and engineering (con-
fluence). IEEE, pp 887–893
5. Makhloga VS et al (2021) Machine learning algorithms to predict potential dropout in high
school. In: Data analytics and management: proceedings of ICDAM. Springer, Singapore,
pp 189–201
38 A. Sharma and U. Ghose
6. Sharma A, Ghose U (2023) Voting ensemble-based model for sentiment classification of
Hindi movie reviews. In: Computational intelligence: proceedings of InCITe2022. Springer,
Singapore, pp 473–483
7. Madan A, Ghose U (2021) Sentiment analysis for twitter data in the Hindi language. In:
2021 11th international conference on cloud computing, data science and engineering (con-
fluence). IEEE, pp 784–789
8. Hussaini F et al (2018) Score-based sentiment analysis of book reviews in Hindi language.
Int J Nat Lang Comput 7(5):115–127
9. Jha V et al (2015) HSAS: Hindi subjectivity analysis system. In: 2015 annual IEEE India
conference (INDICON). IEEE, pp 1–6
10. Jha V et al (2015) HOMS: Hindi opinion mining system. In: 2015 IEEE 2nd international
conference on recent trends in information systems (ReTIS). IEEE, pp 366–371
11. Jha V et al (2016) Sentiment analysis in a resource scarce language: Hindi. Int J Sci Eng
Res 7(9):968–980
12. Kumar A et al (2015) IIT-TUDA: system for sentiment analysis in Indian languages using
lexical acquisition. In: International conference on mining intelligence and knowledge
exploration. MIKE 2015: mining intelligence and knowledge exploration. Springer, Cham,
pp 684–693
13. Kaur H et al (2018) Dictionary based sentiment analysis of Hinglish text. Int J Adv Res
Comput Sci 8(5):816–822
14. Mishra D et al (2016) Context specific lexicon for Hindi reviews. Procedia Comput Sci
93:554–563
15. Sharma P, Moh T-S (2016) Prediction of Indian election using sentiment analysis on Hindi
twitter. In: 2016 IEEE international conference on big data (big data). IEEE, pp 1966–1971
16. Sharma Y et al (2015) A practical approach to sentiment analysis of Hindi tweets. In: 2015
1st international conference on next generation computing technologies (NGCT). IEEE, pp
677–680
17. Singh P, Lefever E (2020) Sentiment analysis for Hinglish code-mixed tweets by means
of cross-lingual word embeddings, In: Proceedings of the 4th workshop on computa-
tional approaches to code switching, Marseille, France. European Language Resources
Association, pp 45–51
18. Bharti SK et al (2017) Context-based sarcasm detection in Hindi tweets. In: 2017 ninth
international conference on advances in pattern recognition (ICAPR). IEEE, pp 1–6
19. Akhtar MS et al (2016) A hybrid deep learning architecture for sentiment analysis. In:
Proceedings of the COLING 2016, the 26th international conference on computational lin-
guistics: technical papers, Osaka, Japan. The COLING 2016 Organizing Committee, pp
482–493
20. Oussous A et al (2020) ASA: a framework for Arabic sentiment analysis. J Inf Sci
46(4):544–559
21. Mehmood K et al (2019) Sentiment analysis for a resource poor language—Roman Urdu.
In: ACM transactions on Asian and low-resource language information processing, vol 19,
issue 1, pp 1–15
22. Shah SR, Kaushik A (2019) Sentiment analysis on Indian indigenous languages: a review
on multilingual opinion mining. arXiv preprint arXiv:1911.12848
23. Hourrane O et al (2019) Sentiment classification on movie reviews and twitter: an experi-
mental study of supervised learning models. In: 2019 1st international conference on smart
systems and data science (ICSSD), Rabat, Morocco. IEEE, pp 1–6
24. Sarkar K (2020) Heterogeneous classifier ensemble for sentiment analysis of Bengali and
Hindi tweets. Sādhanā45(196):1–17
25. Wolpert DH (1992) Stacked generalization. Neural Netw 5(2):241–259
26. Jain V et al (2021) Product recommendation platform based on natural language process-
ing. In: Data analytics and management: proceedings of ICDAM. Springer, Singapore, pp
627–635
Deep Learning-Based Recommendation
Systems: Review and Critical Analysis
Md Mahtab Alam and Mumtaz Ahmed
Abstract Recommendation systems (RSs) belong to a category of information
filtering systems designed to predict the “ranking” or “preference” that users will
give to a particular item. RSs are automated instruments and strategies that assist
and increase the decision-making process by aggregating the views of individuals
and guiding them to suitable recipients. RSs are extensively utilized in various
domains, including e-commerce, social networking sites, and entertainment, and
impact everyone’s everyday life. The systems are designed to assist the user by
proposing the items that are appropriate for him or her without requiring them to
undergo the lengthy, time-consuming, and complex process of selecting from a wide
selection of items that can number in the thousands or millions. The major aim
of making recommendations based on the user’s interests is to minimize human
work. Models and algorithms are expected to catch different user preferences and
mostly identify non-dependencies between them and the multitude of items to provide
personalization. In addition, this problem is compounded by real data criteria and
ambitious real-time requirements. Many difficulties arise when developing and oper-
ating RSs. Therefore, it is compulsory to address them and design a system in which
they become mitigated or tolerable. Sparsity, Cold Start, and Scalability are a few
challenges when a user develops a recommendation system. The pervasive use of deep
learning has demonstrated its power in solving complicated tasks more efficiently
than conventional techniques. This paper seeks to stimulate advancements in RSs by
providing a thorough summary of recent research on recommendation systems using
deep learning. The surveyed articles are categorized using a taxonomy of recommen-
dation systems that are offered. Based on the analysis of the evaluated works and the
stated potential solutions, open problems are highlighted.
Keywords Recommendation systems ·Ranking ·Sparsity ·Cold start ·
Scalability ·Deep learning ·Collaborative filtering
M. M. Alam (B)·M. Ahmed
Department of Computer Engineering, Jamia Millia Islamia, New Delhi, India
e-mail: mahtab.alam57@gmail.com
M. Ahmed
e-mail: ahmedmumtaz01@gmail.com
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_4
39
40 M. M. Alam and M. Ahmed
1 Introduction
A recommendation system that offers fast and specific advice can typically draw
users’ attention and benefit businesses. In recent years, recommendation systems
have experienced exponential growth and their applications have expanded across
various real-life domains. The theory behind the success of recommendation systems
is that before making decisions of some sort, humans have a natural tendency to
make their decisions on the opinion of their friends and neighbors, particularly on
the purchasing of certain products.
The most popular applications for recommendation systems include:
Entertainment: film, songs, and game recommendations.
Content: customized newspapers, text reviews, web page recommendations, and
email filtering.
E-commerce: customer reviews for buying items such as mobile, desktop,
stationary items, etc.
Services: expert consulting recommendations, recommendations for rental
houses, or transport services.
(A) Motivation
This work has been carried out considering the considerations that are relevant
to bring to my mind the importance of recommendation systems focused on
deep learning. Since such systems are used due to the spreading of such data
types that are present in excess, this system could be common these days.
Users will find the data that is most important to a particular intent from the
vast volume of data available on the Internet. They could use this insight to see
if it could be changed so that certain templates could be increasingly used for
other purposes.
It has been noted that even now, students and other learners in different
classes rely very much on the Internet or electronically generated data. This data
is used and handled a lot to serve different objectives related to academics or the
different fields for analysis. These structures, however, are found in academic
or research and have importance in other areas of real life. The problem to
be tackled is to tackle an explosive volume of information: strengthening the
filtering of information to promote decision-making and reclaim attention.
Since people find it difficult to hit the data that is extremely important to their
application, they will choose to use suggestions to reach the correct form of data.
Research like this one must be conducted to determine the best approaches to
the data management problem. Because these technologies ensure that internal
operations within these businesses run smoothly, it is difficult to ignore their use
in industry. It is crucial to keep in mind that recommendation systems make
it simple for customers in these industries to benefit from the right kind of
information that can be used in a variety of situations. However, it is crucial to
employ the learning model to make it capable of assisting users in carrying out
many hard tasks to enhance the user’s framework. These activities may apply
Deep Learning-Based Recommendation Systems: Review and Critical 41
to one’s academics or the sector of the business that seeks to support consumers
in separate ways.
They are now commonly used and surround everyone in everyday life. Any
popular places for suggestion schemes include e-commerce, social media, or
entertainment. Amazon, Twitter, Spotify, Netflix, and many more use recom-
mendation systems and machine learning to personalize content. Therefore,
with every aspect of the purchase process, Amazon incorporates suggestions.
When they recorded a 29% growth in revenue in a single year in 2012, this was
viewed as the key driver [1]. Comparable success has been recorded by Netflix,
as almost 80% of content consumption derives from reviews that make them
an important part of the whole network [2]. Because the suggested systems
are primarily powered by machine learning, another significant advancement
in this area needs to be integrated.
(B) Challenges
Many difficulties [3] arise in the process of the development of RSs. Therefore,
it is inevitable to remember them and design systems in which they become miti-
gated or at least tolerable. There are mainly three challenges namely: Sparsity
[4], Cold Start [5], and Scalability [6].
Sparsity It is one of the major challenges and is used to mention the problem
that a significant part of the future interaction between user items resembling the
interaction matrix Ris uncertain. Density, which is defined as the ratio between
known entries and the matrix size of user-item interaction, is its counterpart. It is
a common issue in RSs to have few details about any of all encounters. This is the
very beginning of why we need to make rating estimates and build our rankings
from them. It not only raises ambiguity but also difficulty in computation. By
matrix factorization methods that transform sparse into dense user and object
representations, the system partially mitigates the sparsity problem.
Cold Start It can be divided into cold start users or items and applies to users
without connections. An incomplete cold start (ICS) is implemented to describe
entities with too few connections rather than zero that are still relevant. Cold
start poses a key concern in Collaborative Filtering (CF) as experiences are the
only source of data to infer tastes. Therefore, unpracticed users or novel items
will not be suggested or offered customized reviews in CF until experiences
are encountered by the system. Only then is there a potential for suggestions to
be produced? In Content-Based Filtering (CBF) cold start; when they join the
device with their features set, there is less issue in terms of items. As far as users
are concerned, the issue is also problematic because the user profile learner has
no details, i.e., user-item experiences, from which user profile features can be
inferred. This cold start condition of the CBF customer is also mitigated by the
presence of a default profile that includes common features; but due to the lack
of personalization, it is a mediocre solution.
42 M. M. Alam and M. Ahmed
Scalability It is a problem that sparsity necessarily causes. Real-world RSs
also deal with millions of items and more than a billion users. An effort to meet
low inference latencies, e.g., 10 ms, is the classical IR dichotomy. However, the
candidate generation must be matched between performance and consistency.
These demands scaling well and achieving high quality in a limited period for
effective algorithms and parallel architecture for RSs.
(C) Contributions of the paper
Even though the techniques employed in present-day recommendation systems
were developed over a decade ago, the field is currently experiencing active
research due to the pervasive presence of the Internet in people’s lives and the
continuous emergence of recent technologies. The primarily goal of this paper
is to compile various existing techniques into a single resource and assess
them based on different parameters. The paper reaches a conclusion regarding
the integration of different recommendation techniques with the ongoing tech-
nology trends, while also addressing the challenges they face. Furthermore,
this paper suggests proposing a novel hybrid technique to overcome certain
limitations encountered by existing approaches.
2 Terminology and Background Concepts
The theoretical foundations of recommendation systems and deep learning are
discussed in this section and innovations crucial to the application are discussed.
(A) Recommendation System
In Fig. 1, three fundamental categories of recommendation systems are typically
applied to different machine learning environments. Let us talk about these
kinds of systems to illustrate how they related to systems based on deep learning.
RSs also referred to as recommender systems [7], have mainly three types:
Content-Based Filtering (CBF) RSs,Collaborative Filtering (CF) RSs, and
Hybrid RSs (a combination of two or so).
Collaborative Filtering Recommendation System It is a type of recommen-
dation system in which recommended an item based on history. Users-based
filtering, item-based filtering, as well as several other methods, are additional
sub-types of CF [8]. The fundamental assumption underlying the utilization
of such systems is that users will be more likely to target the same items they
previously did. They must understand that users are expected to pay greater
attention to this kind of information in the future.
The most popular method for guidance engines is Collaborative Filtering
(CF). CF analyses interactions between consumers and interdependencies
between items to recognize new user-item connections [9]. We need a history
of experiences that resemble user ratings on particular items ri,j, provided the
Deep Learning-Based Recommendation Systems: Review and Critical 43
Fig. 1 Classification of RSs
user sets uU, and items v Vwith m=|U|and n=|V|. An incom-
plete matrix R=ri,jwill explain these scores. Absent entries in Rrefer
to scores that are not yet observed, but that may be observed in the future. As
both m and n in recommendation situations are normally broad and users
commonly communicate with only a comparatively small portion of all objects,
Rbecomes much more sparse. CF transforms the problem with the suggestion
into a problem with matrix completion. Based on existing scores, this involves
predicting the missing values in R and displaying the top@k entries to the
respective customer.
To solve the matrix completion problem, we differentiate between two tech-
niques: The model-based technique and the Memory-based technique.Both
manipulate the resemblance of the object and/or customer. When using user
comparisons, we do user-user CF and vice versa, item-item CF. Both yield
comparable effects, but as mis rarely as high as n, they can vary greatly in
terms of performance [1012].
Memory-based techniques also referred to as neighborhood-based tech-
niques, infer scores by comparing items or users with each other. It implement
a weighted average over the user, resp. Item, k ratings of the most comparable
entities. For this job, we may use various similarity measures, of which the most
common are cosine similarity or Pearson correlation [12]. K-Nearest Neighbors
(KNN) [8] is a method [13] to find the nearest neighbors. We use a three-step
method to produce recommendations for a single user xU: The initial step
involves computing the cosine similarity between user xand all other users
yUx, it is mentioned as given formula:
44 M. M. Alam and M. Ahmed
cos_sim (x,y)=cos (x,y)=iVxy rx,i.ry,i
iVxy r2
x,iiVxy r2
y,i
(1)
where
Vxy set of rated items which are given by both the user xand y.
Naïve Bayes, Clustering [14], or (probabilistic) Matrix Factorization (MF)
[15] provides model-based approaches [16]. The latent factor models are often
referred to as MF-based models and present the most common model-based
methodology. They are focused on algorithms for machine learning to create
models that identify patterns, i.e., past scores, in the training data. They are
also able to generalize to new data, i.e., to build suggestions and they predict
uncertain scores.
Again, we begin with a set of Uusers, a set of Vproducts, and a sparse matrix
of Rratings. In R, we need to approximate missed values and complete the
matrix by doing so. MF breaks Rinto two lower-dimensional matrices, Pand
Q, which should be as similar as possible to R.Both Pand Qmap users into
ad-dimensional embedding space with dmin(m,n), respectively items.
Thus, we use latent factors that are much less and therefore denser relative to the
sparse vectors within R, instead of using scores to represent users and objects
as in memory-based techniques. Finally, to approximate unnoticed rating
ri,j,
we use these dense representations—fitted to resemble observed rating ri,j.
Thus, MF aims to recreate Rusing the multiplication matrix of dense
representations of users and items:
Rm×nR=Pm×dQT
d×n(2)
r1,1r1,2··· r1,n
.
.
..
.
.··· .
.
.
rm,1rm,2··· rm,n
p1,1p1,2··· p1,d
.
.
..
.
.··· .
.
.
pm,1pm,2··· pm,d
q1,1q2,1··· qd,1
.
.
..
.
.··· .
.
.
q1,nq2,n··· qd,n
(3)
We use the RMSE [17] between the actual and reconstructed scores to
calculate the reconstruction error:
RMSE =
1
|s|
(i,j)S
rij
rij2
=
1
|s|
(i,j)S
rij PiQT
j2
(4)
Deep Learning-Based Recommendation Systems: Review and Critical 45
where S: set of user-item tuples for all ratings we observed. We now must
initialize and adapt the respective latent factor vectors pi,qjRdmini-
mizing the RMSE. This resembles an optimization problem that we can address
algorithm.
Stochastic Gradient Descent (SGD) [18]SGD is a standard algorithm used
for these problems and works by gradually adapting the latent factor variables to
minimize RMSE, which is a differentiable function. This allows us to calculate
the partial derivatives for either the user or item embedding variables. The
resulting gradient points in the direction of a local or global minimum. We
guide the loss function toward its minimum by multiplying the gradient with a
predetermined learning rate αfor the latent variable update. The convergence
is critically dependent on an appropriate choice for α:
pi=piα·RMSE
pi
(5)
qj=qjα·RMSE
qj
(6)
Content-Based Filtering Recommendation System (CBF) The algorithm
learns to recommend items based on the similarities of attributes. These features
are part of user profiles and representations of items and can be basic keywords
or representations of items and user profiles based on concepts. As the basis
for suggestions, they simply present material that is mapped against each other.
The content of users can correspond to desires or interests, while the content of
items can be textual or consist of metadata correlated with items. As a result of
matching item attributes with target user profile attributes, we get appropriate
scores that are considered as a user’s preference level for a given item. The
profile learner creates a model based on previous experiences to create user
profiles. It leverages probabilistic models, customer input on significance, or
K-Nearest Neighbor (KNN). Lastly, to produce a binary or continuous relative
judgment, the filtering component matches user profiles against item represen-
tations. In a ranked list of items that are likely to be interesting for a particular
user, such item judgments are ordered.
User profiles combine users’ rating behavior with the content of rated items
that are agnostic to other users’ rating behavior. To obtain a user representation,
item descriptions that are labeled with ratings (either implicit or explicit) are
used as training data. This introduces the ability to recommend novel items but
does not provide unpracticed users with generalization [[19], p. 14 sq.]
A widely used approach is the tf idf, often known as vector space repre-
sentation. Based on a weighted vector of item attributes, the system creates a
content-based user profile for each user. The weights, which indicate each func-
tion’s user value, can be determined using a variety of methods using content
vectors with independently valued entries. Simple methods employ the average
46 M. M. Alam and M. Ahmed
rating of user-item matrix, whereas more complex ways assess the likelihood
that the item would be liked by consumers using model-based RSs like Artifi-
cial Neural Networks, Association Rule Mining, Bayesian Classifiers, and
Cluster analysis.
tf idf(n,d)=tf(n,d)idf(n)(7)
tf(n,d)=fd(n)maxwdfd(w)(8)
idf(n)=logC(df(n)+1)(9)
where
tf Term Frequency.
idf Inverse Document Frequency.
df Document Frequency.
nTerm (word).
dDocument (set of words).
fd(t)Count of tin d.
fd(w)Number of words in d.
CCount of the corpus (the total document set).
df(n)Occurrence of the document.
Hybrid Recommendation System It is now used for most recommendation
systems, incorporating CF, CBF, and other techniques. There is no justification
for why it would not be feasible to hybridize many different methods of the
same kind. It is possible to incorporate hybrid approaches in many ways: by
independently creating and then integrating content-based and collaborative
predictions; by incorporating CBF skills into the CF approach (and vice versa);
or by integrating the strategies into one framework. Multiple experiments that
compare the empirical success of hybrid methods with CBF and CF methods
have shown that hybrid methods can provide more detailed recommendations
than other methods. These strategies can also help to address the common
challenges such as cold start and sparsity problem. Netflix serves as a prominent
example of how hybrid recommendation algorithms are implemented.
Kaur and Bathla [7] Listed 5 types of advisory structures based on
approaches. The several types are the CBF recommendation system, CF recom-
mendation system, demographic-based recommendation system, recommen-
dations based on usefulness (utility-based), and recommendations based on
information (knowledge-based).
(B) Deep Learning
It is subclass of machine learning (ML) techniques that analyze raw data through
successive layers to extract increasingly complex features. For instance, lower
levels can detect edges in images, while higher levels can recognize meaningful
Deep Learning-Based Recommendation Systems: Review and Critical 47
patterns such as digits, letters, or faces that have significance to humans. The
underlying principle behind deep learning algorithms is like that of humans—
they learn from experience.
Deep learning is already seeing an immense buzz. In many fields of applica-
tion, such as machine vision and speech comprehension, tremendous success
in deep learning (DL) [20,21] has been seen in the last few decades. Due
to its potential to tackle complex challenges and deliver exceptional results,
researchers and businesses are racing to expand the applications of deep
learning. Recently, it has significantly transformed recommendation systems,
providing new ways to improve their effectiveness. By simplifying traditional
models and achieving high recommendation accuracy, deep learning-based
recommendation systems (DLRS) [11,22] have garnered a lot of interest
in recent times. It successfully captures the unpredictable and complicated
interaction between users and items and enables more intricate abstractions
to be embodied in the higher layers as data representations. Additionally, deep
learning leverages abundant open data sources, such as textual, visual, and
qualitative knowledge, to incorporate diverse experiences into the information
itself.
Deep learning can be further subdivided into three different forms [20]:
(1) Supervised learning (task-driven)
(2) Unsupervised learning (data-driven)
(3) Reinforcement learning
(1) Supervised Learning
It helps in to train a model using labeled data to make predictions on new,
unseen data. There are different methods used in the supervised learning
process. These approaches include classification [23] and regression that
are essential for predicting answers of any kind. Artificial Neural Networks
(ANNs) are modeled after the biological network of neurons. There are
various categories of ANNs, including Convolution Neural Networks
(CNNs) [17]. CNNs use multilayered structures to recognize images and
voices in various applications.
(2) Unsupervised Learning
It helps in to train a model using unlabeled data to discover patterns and
relationships in the data on its own. It is used when the desired output is
not known, and the objective is to uncover hidden data structure. Clus-
tering [23] is a method that is commonly used in this type of learning that
uses multiple applications such as image recognition and interpretation of
objects.
48 M. M. Alam and M. Ahmed
Fig. 2 Simple neural network and multilayer neural network [24]
(3) Reinforcement Learning
It entails training a model to make decisions by utilizing feedback received
from the environment. It helps to maximize a reward through trial and error,
and it is often used in robotics and gaming applications.
The two forms of neural networks: are (a) Simple neural networks and (b)
Multilayer neural networks in which multiple layers of hidden layer network
(also known as deep layer neural network). The neural network is shown in
Fig. 2.
Many ML models such as Support Vector Machines (SVM) [25] and
Logistic Regression have shallow architectures consisting of only one or two
layers. Despite their popularity in the 1990s, these shallow models have limited
ability to represent complex data, such as text, images, and audio, leading to
difficulties in modeling such data.
Experimental results recently suggest deep architecture is needed to train
better models of deep learning. At most, models with 2 to 3 layers perform well
than deep models before that. Deep models can be more challenging to train
and may yield worse results. However, in 2006 [21], the successful training of
aDeep Belief Network (DBN) to predict handwritten digits using a layer-wise
training methodology marked the first successful exploration of deep models.
In the past, researchers had not fully utilized deep models primarily because of
limited data availability and computational power. However, deep architecture
models, which typically consist of multiple layers, have the capacity to learn
a hierarchical representation of features, starting from low-level features and
progressing to high-level features.
ARestricted Boltzmann Machine (RBM) is a generative stochastic ANN
that can learn a probability distribution over its inputs. It successfully applied in
various applications including CF, and dimensionality reduction. RBM consists
of two layers: a visible layer and hidden layer. The connections between the
two layers are symmetrically weighted and the units in each layer do not have
connections with each other.
Deep Learning-Based Recommendation Systems: Review and Critical 49
Deep Neural Networks (DNNs) [26] are a type of ANN which has multiple
layers of interconnected nodes, also known as Multilayer Perceptron (MLP)
[27]. Backpropagation (BP) [28] is needed to understand DNN. They are
designed to learn and model complex relationships between inputs and outputs,
and they have been extremely successful in solving various tasks in speech
recognition, natural language processing (NLP), and many other areas. A DNN
takes an input, passes it through multiple hidden layers, and finally produces
an output. Each hidden layer is made up of many artificial neurons, and each
neuron receives inputs from the previous layer, performs computations on them,
and then sends the results to the next layer.
Deep Auto Encoders (DAEs) are a specific type of Deep Neural Network
(DNN). Unlike DNNs, which are supervised learning algorithms, DAEs are
unsupervised learning algorithms with the input and output being the same.
This design allows for the middle layer’s output to be represented as dense
representations. Like DNNs, DAEs can be pre-trained using DBN. A tech-
nique for learning deep multilayer autoencoder pre-training was proposed in
[29]. This approach involves treating consecutive layers as a Restricted Boltz-
mann Machine and using pre-training to approximate a rational parameter
initialization.
Convolutional Neural Networks (CNNs) [30] are a type of ANN archi-
tecture that is designed to work with grid-structured data, such as an image.
In CNNs, the layers are organized in a way that is meant to gather relevant
features from the input data in an ordered manner. The first layer typically
extracts low-level features such as edges and simple shapes, while deeper layers
extract higher-level features and abstract concepts. The final layers of CNNs
are usually fully connected, meaning that they take the features gathered by the
preceding convolutional layers and produce a prediction. In summary, CNNs
are a powerful deep learning tool for processing grid-structured data and have
been successful in various applications.
The Deep Learning technique [31] in both DM and ML communities is
a hot and evolving field. Either supervised or unsupervised methods will train
these models. Deep learning models first made a significant impact on computer
vision and audio, voice, and language processing. They have performed better
than many existing models in these areas. For different NLP functions, later deep
models have proven their efficacy. Semantic parsing [32], automatic translation
[33], sentence modeling, and several typical NLP tasks [11] are included in
these tasks.
Growing research activity on applying DL to RSs has been seen in the last
two years. Several neural network architectures have been tested by researchers,
such as autoencoders (AEs) [34,35], recurrent neural networks (RNNs) [8,
22,36], or CNNs [17], regular feedforward, or wide and deep architectures
[30]. Nevertheless, analysis is either in its infancy on this intersection or gets
no consideration. I will have a range of methods with different architectures
in the following and point out accomplishments and limitations as well as
commonalities and specialties. The list is not holistic and attempts to offer a first
50 M. M. Alam and M. Ahmed
impression of alternative strategies. Papers can be divided into deep learning-
based RS papers, surveys and summaries, and others that do not directly discuss
the convergence of DL and RSs with theoretically applicable contributions.
(C) Evaluation Metrics
The evaluation metrics [35,37,38] for assessing the success of RSs as well
as the nature of such an assessment are presented in this section. The metrics
used are different for quantifying certain domains. The work usually measures
the absolute and squared root of squared difference between real and expected
ratings using mean absolute error (MAE) [39], and root mean squared error
(RMSE) [17], to quantify the accuracy of ratings as numerical quantities.
Mean Absolute Error To know the closeness of actual output to their
predicted value, MAE is measured. The mathematical definition of MAE is
the average of the absolute differences between the original values and the
predicted values.
MAE =1
n
n
i=1
yiyp
(10)
where
nTotal number of observations/rows in dataset.
iVariable.
yiActual value.
ypPredicted value.
Mean Squared Error (MSE) It measures the average squared difference
between original values and predicted values.
MSE =1
n
n
i=1yiyp2(11)
where
nTotal number of observations/rows in dataset.
iVariable.
yiActual value.
ypPredicted value.
Root Mean Squared Error (RMSE) It is a commonly used for evaluating
the performance of regression models. It is square root of the average of the
squared differences between original values and predicted values.
Deep Learning-Based Recommendation Systems: Review and Critical 51
RMSE =
1
n
n
i=1yiyp2(12)
where
nTotal number of observations/rows in dataset.
iVariable.
yiActual value.
ypPredicted value.
3 Deep Learning-Based Recommendation System
This section gives an analysis of the various works that have been proposed.
Comparison of various models (Table 1).
4 Discussion
Surveying recommendation systems can help in understanding the current techniques
[48,49], identifying emerging trends [50], and assessing the effectiveness of different
techniques and algorithms. In this section, we will analyze and interpret the findings
from the literature and discuss their implications for our recommendation system.
The discussion based on the literature review provides valuable insights into the
effectiveness of different RSs, including CF, CBF, hybrid approaches, and deep
learning methods. Furthermore, addressing different type of challenges, selecting
appropriate evaluation metrics, and considering ethical and privacy considerations
are essential for designing an effective and responsible recommendation system.
To mitigate the issues of the literature review paper this paper suggests proposing
a novel hybrid technique to overcome certain limitations encountered by existing
approaches.
5 Conclusion
In this paper, we gave a thorough analysis of the most significant research on DLRS
so far. To determine the contributions and aspects of these studies, we also perform a
brief statistical analysis of these works. We present several significant research proto-
types, assess their benefits and drawbacks, and discuss any relevant applications. We
52 M. M. Alam and M. Ahmed
Table 1 Model analysis
Existing
work
Year Methods Evaluation
metrics
Dataset Results
[31]2017 Stacked denoising autoencoder RMSE Netflix movie 1.049
[16]2015 Hierarchical Bayesian model MAP Netflix 0.031
[40]2017 Stacked denoising autoencoder
(SDAE) and Matrix
factorization (MF)
RMSE Book Crossing
Movielens 100k
Movielens 1M
0.924
0.508
0.502
[41]2020 Autoencoder RMSE Movielens 1M
Movielens 10M
0.029
0.010
[42]2020 DLCRS RMSE Movielens 100k
Movilens 1M
0.917
0.903
[43]2019 K-nearest neighbors (KNN) Accuracy ICU patient 95.6%
[44]2020 Matrix factorization RMSE and
MAE
Book Crossing
Movielens 100K
25.78
and
19.69%
19.69
and
14.08%
[35]2018 Autoencoder MAP Movielens 100k
Movielens 10M
0.223
0.179
[45]2018 Opinion mining with experts MAE and
RMSE
Books 0.97,
4.08
[46]2021 Collaborative filtering and
support vector machines
classifier
Average
accuracy
User speech
emotion
information
87.2%
[47]2022 Session based HR@kNews (Adressa,
Globo, and
MIND)
0.1658
0.1852
0.0495
also include some of the most urgent unsolved issues and intriguing potential future
developments. Deep learning and recommendation systems are both still immensely
popular study areas today. Every year, there are a lot of new innovative approaches
and developing strategies. Here, we lay out a complete framework for comprehending
the fundamental ideas in this area, describe the most significant developments, and
offer some insight into potential future research.
References
1. Hallinan B, Striphas T (2016) Recommended for you: the Netflix prize and the production of
algorithmic culture. New Media Soc 18(1):117–137
2. Koren Y (2009) The BellKor solution to the Netflix grand prize, pp 1–10
3. Thirumaran E (2009) Collaborative filtering based recommendation systems. In: Handbook of
research on text and web mining technologies, p 16
Deep Learning-Based Recommendation Systems: Review and Critical 53
4. Barjasteh I, Forsati R, Masrour F, Esfahanian A-H, Radha H (2015) Cold-start item and user
recommendation with decoupled completion and transduction. In: RecSys’15 Proceedings of
the 9th ACM conference on recommender systems, pp 91–98
5. Qin C et al (2020) A survey on knowledge graph-based recommender systems. Sci Sin Inf
50(7):937–956
6. van den Berg R, Kipf TN, Welling M (2017) Graph convolutional matrix completion.
arXiv:1706.02263 [stat.ML], arXiv:1706.02263v2 [stat.ML], https://doi.org/10.48550/arXiv.
1706.02263
7. Kaur H, Bathla G (2019) Techniques of recommender system. Int J Innovative Technol
Exploring Eng 8(9S):373–379
8. Devooght R, Bersini H (2016) Collaborative filtering with recurrent neural networks.
arXiv:1608.07400 [cs.IR], arXiv:1608.07400v2 [cs.IR], https://doi.org/10.48550/arXiv.1608.
07400
9. Hu Y, Koren Y, Volinsky C (2008) Collaborative filtering for implicit feedback datasets. In:
2008 eighth IEEE international conference on data mining, Pisa, Italy. IEEE, pp 263–272
10. Bhasker B (2012) Comparative study of collaborative filtering algorithms. In: KDIR’12
Proceedings of the international conference on knowledge discovery information retrieval,
pp 132–137
11. Zhang S, Yao L, Sun A, Tay Y (2019) Deep learning based recommender system: a survey and
new perspectives. ACM Comput Surv 52(1):1–38
12. Ekstrand MD, Riedl JT, Konstan JA (2010) Collaborative filtering recommender systems.
Found Trends Hum-Comput Interact 4(2):81–173
13. Al-Garadi MA et al (2020) A survey of machine and deep learning methods for internet of
things (IoT) security. IEEE Commun Surv Tutorials 22(3):1646–1685
14. Sohail SS, Siddiqui J, Ali R (2017) Classifications of recommender systems: a review. J Eng
Sci Technol Rev 10(4):132–153
15. Ricci F, Rokach L, Shapira B, Kantor PB (2010) Recommender systems handbook 2011. In:
Google scholar. Google scholar digital library digital library
16. Wang H, Wang N, Yeung D-Y (2015) Collaborative deep learning for recommender systems.
In: KDD’15: Proceedings of the 21th ACM SIGKDD international conference on knowledge
discovery and data mining, pp 1235–1244
17. Sahoo AK, Pradhan C, Barik RK, Dubey H (2019) DeepReco: deep learning based health
recommender system using collaborative filtering. Computation 7(25):1–18
18. Vo ND, Hong M, Jung JJ (2020) Implicit stochastic gradient descent method for a cross-domain
recommendation system. Sensors (Switzerland) 20(2510):1–16
19. Aggarwal CC (2016) Recommender systems: the textbook. Recommender Syst 39(4):8–21
20. Quadrana M, Karatzoglou A, Hidasi B, Cremonesi P (2017) Personalizing session-based
recommendations with hierarchical recurrent neural networks. In: RecSys’17: Proceedings
of the eleventh ACM conference on recommender systems, pp 130–137
21. Mukhopadhyay S (2018) Deep learning and neural networks. In: Advanced data analytics using
python. Apress, Berkeley, CA, pp 99–119
22. Covington P, Adams J, Sargin E (2016) Deep neural networks for YouTube recommendations.
In: RecSys 2016 Proceedings of the 10th ACM conference on recommender systems, pp 191–
198
23. Lu J, Wu D, Mao M, Wang W, Zhang G (2015) Recommender system application developments:
a survey. Decis Support Syst 74:12–32
24. Karatzoglou A, Hidasi B (2017) Deep learning for recommender systems. In: Proceedings of
the eleventh ACM conference on recommender systems, RecSys’17, pp 396–397
25. Shalaby W et al (2017) Help me find a job: a graph-based approach for job recommendation
at scale. In: 2017 IEEE international conference on big data (Big Data), Boston, MA, USA.
IEEE, pp 1544–1553
26. Papadakis H, Fragopoulou P, Michalakis N, Panagiotakis C (2018) A mobile application for
personalized movie recommendations with dynamic updates. In: 2018 international conference
on intelligent systems (IS), Funchal, Portugal. IEEE, pp 507–514
54 M. M. Alam and M. Ahmed
27. Li M, Gao W, Chen Y (2020) A topic and concept integrated model for thread recommenda-
tion in online health communities. In: CIKM’20: Proceedings of the 29th ACM international
conference on information and knowledge management, pp 765–774
28. Zhang S, Tay Y, Yao L, Wu B, Sun A (2019) DeepRec: an open-source toolkit for deep learning
based recommendation. In: Proceedings of the twenty-eighth international joint conference on
artificial intelligence (IJCAI-19), pp 6581–6583
29. Elkahky AM, Song Y, He X (2015) A multi-view deep learning approach for cross-domain
user modeling in recommendation systems. In: WWW’15 Proceedings of the 24th international
conference on World Wide Web, pp 278–288
30. Rajkomar A et al (2018) Scalable and accurate deep learning with electronic health records.
npj Digit Med 1(18):1–10
31. Wei J, He J, Chen K, Zhou Y, Tang Z (2017) Collaborative filtering and deep learning based
recommendation system for cold start items. Expert Syst Appl 69:29–39
32. Pazzani MJ (1999) A framework for collaborative, content-based and demographic filtering.
Artif Intell Rev 13(5):393–408
33. van den Oord A, Dieleman S, Schrauwen B (2013) Deep content-based music recommendation.
In: Advances in neural information processing systems, vol 26 (NIPS 2013), pp 1–9
34. Strub F, Gaudel R, Mary J (2016) Hybrid recommender system based on autoencoders. In:
DLRS 2016: Proceedings of the 1st workshop on deep learning for recommender systems, pp
11–16
35. Li T, Ma Y, Xu J, Stenger B, Liu C, Hirate Y (2018) Deep heterogeneous autoencoders
for collaborative filtering. In: 2018 IEEE international conference on data mining (ICDM),
Singapore. IEEE, pp 1164–1169
36. Hidasi B, Karatzoglou A, Baltrunas L, Tikk D (2016) Session-based recommendations with
recurrent neural networks. arXiv:1511.06939 [cs.LG], arXiv:1511.06939v4 [cs.LG], pp 1–10,
https://doi.org/10.48550/arXiv.1511.06939
37. Felfernig A, Burke R (2008) Constraint-based recommender systems: technologies and
research issues. In: ICEC’08 Proceedings of the 10th international conference on electronic
commerce, article no 3, pp 1–10
38. Chen C-W, Lamere P, Schedl M, Zamani H (2018) Recsys challenge 2018: automatic
music playlist continuation. In: RecSys 2018 Proceedings of the 12th ACM conference on
recommender systems, pp 527–528
39. Shen Y, Lv T, Chen X, Wang Y (2016) A collaborative filtering based social recommender
system for e-commerce. Int J Simul Syst Sci Technol 17(22):91–96
40. Dong X, Yu L, Wu Z, Sun Y, Yuan L, Zhang F (2017) A hybrid collaborative filtering model
with deep structure for recommender systems. In: AAA’17 Proceedings of the thirty-first AAAI
conference on artificial intelligence, pp 1309–1315
41. Ferreira D, Silva S, Abelha A, Machado J (2020) Recommendation system using autoencoders.
Appl Sci (Switzerland) 10(5510):1–17
42. Aljunid MF, Dh M (2020) An efficient deep learning approach for collaborative filtering
recommender system. Procedia Comput Sci 171:829–836
43. Neloy AA, Oshman MS, Islam MM, Hossain MJ, Zahir ZB (2019) Content-based health
recommender system for ICU patient. In: Multi-disciplinary trends in artificial intelligence.
MIWAI 2019. Lecture notes in computer science, vol 11909. Springer, Cham, pp 229–237
44. Davagdorj K, Park KH, Ryu KH (2020) A collaborative filtering recommendation system
for rating prediction. In: Advances in intelligent information hiding and multimedia signal
processing. Smart innovation, systems and technologies, vol 156. Springer, Singapore, pp
265–271
45. Sohail SS, Siddiqui J, Ali R (2018) Feature-based opinion mining approach (FOMA) for
improved book recommendation. Arab J Sci Eng 43:8029–8048
46. Kim T-Y, Ko H, Kim S-H, Kim H-D (2021) Modeling of recommendation system based on
emotional information and collaborative filtering. Sensors 21(1997):1–25
47. Gong S, Zhu KQ (2022) Positive, negative and neutral: modeling implicit feedback in session-
based news recommendation. In: SIGIR’22 proceedings of the 45th international ACM SIGIR
conference on research and development in information retrieval, Madrid, Spain, pp 1185–195
Deep Learning-Based Recommendation Systems: Review and Critical 55
48. Sharma M, Mittal R, Bharati A, Saxena D, Singh AK (2021) A survey and classification on
recommendation systems. In: Proceedings of the 2nd international conference on big data,
machine learning and applications (BigDML 2021), Silchar, India, pp 19–20
49. Bukhari SNH, Jain A, Haq E, Mehbodniya A, Webber J (2021) Ensemble machine learning
model to predict SARS-CoV-2 T-cell epitopes as potential vaccine targets. Diagnostics
11(1990):1–18
50. Bukhari SNH, Webber J, Mehbodniya A (2022) Decision tree based ensemble machine learning
model for the prediction of Zika virus T-cell epitopes as potential vaccine candidates. Sci Rep
12(7810):1–11
Retention in Second Year Computing
Students in a London-Based University
During the Post-COVID-19 Era Using
Learned Optimism as a Lens:
A Statistical Analysis in R
Alexandros Chrysikos and Neal Bamford
Abstract The aim of the current research project is to investigate the low retention
rate in second-year undergraduate computing students at a London-based university.
The research is conducted in 2022 during the post-COVID-19 era using learned
optimism as a lens and compares to the 2021 study Chrysikos et al. [1]. The main aim
is to support the university’s efforts to improve retention rate as the overall dropout
has been increasing in the last few years. The research methodology employed was
an exploratory investigation approach by using statistical modelling analysis in R
to predict behavioural patterns. The study aimed to discover any effect the CODE-
It initiative had on student grades and optimism scores, to quantify its success as
an initiative. The primary outcome of the data analysis indicates that the CODE-It
initiative had a positive impact on student optimism scores, particularly among black
ethnicity students. Additionally, a slight increase in optimism was observed among
the least optimistic students. The return to in-person interaction with classmates
and lecturers may have played a significant role in raising the minimum scores
compared to the 2021 study [1]. Nevertheless, many students continue to grapple
with the lasting effects of the post-pandemic era, particularly in matters of financial
hardship. Finally, for those students who did attend CODE-It, 85% showed that they
felt it was a worthwhile exercise. Specifically, black ethnicity students had a higher
proportion of attendance and were no longer the student ethnicity group with the
lowest optimism score.
Keywords Learned optimism ·Student retention ·Computing ·R programming ·
Quantitative research ·Data analysis
A. Chrysikos (B)·N. Bamford
London Metropolitan University, 166-220 Holloway Road, London N7 8DB, UK
e-mail: a.chrysikos@londonmet.ac.uk
N. Bamford
e-mail: n.bamford@londonmet.ac.uk
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_5
57
58 A. Chrysikos and N. Bamford
1 Introduction
In 2021, Chrysikos and other authors conducted a study on student retention that
was published in 2023 [1]. The study focused on identifying the reasons for higher
than usual dropout rates among foundation and first-year undergraduate computing
students at a London-based university. To accomplish this, a survey was conducted
among these students to collect relevant data. This data was analyzed using R, a
statistical modelling language, to explore potential links between optimism levels
and retention. The overall conclusion was that students with a foreign qualification
were optimistic (comprising 31% of the students), while students with other or an
unknown qualification were mildly pessimistic (comprising 43% of the students).
Students with a Bachelor of Technology (B.Tech), higher education diploma or A
level qualification were generally more pessimistic (comprising 26% of the students),
especially if they were also of black ethnicity (comprising 5%), or were also not of
black ethnicity, aged under 34 and British (comprising 5% of the students).
To further identify factors affecting optimism, the authors conducted a similar
survey for the same group of students in 2022, with a specific focus on the black
ethnicity group, which had been identified with the lowest optimism scores and there-
fore faced the greatest risk of dropping out. Although the survey sections and ques-
tions remained the same, a further section was included which asked the students if
they had been involved in an initiative run by the university studied known as CODE-
It. The study aimed to discover any effect the CODE-It initiative had on student grades
and optimism scores, to be able to quantify its success as an initiative and finally
make recommendations for a further study. CODE-It is a short programming training
course aiming to prepare students to solve real-world projects. Arranging students
in teams of at least 3 and no more than 5, gives them an opportunity to be creative
and innovative in solving real-world problems on a single theme.
2 Literature Review
Non-continuation in UK universities has been an issue over the past several years. In
2019/2020 the percentage of both young and mature students leaving HE has reduced
by 1.3 and 1.8%, respectively, from consistent yearly values [2]. The figures are for
UK students who did not leave within 50 days of commencement, not continuing
in HE after their first year of HE provider and academic year of entry. Similar data
is seen for both Scotland and Wales for young students and Scotland for mature
students, with a nominal increase in Wales. In general, retention rates in England
are comparatively favorable when compared to international institutions. A rate of
72% of students in 2021 is significantly above the international average of 39% for
bachelor’s degrees [3]. However, dropout rates among young undergraduates have
increased over the past fifteen years only reducing slightly in 2019/20 [2].
Retention in Second Year Computing Students in a London-Based 59
During recent research conducted by [4] it was found that London has the highest
dropout rate of all English regions, and the capital struggles to keep students. Those
universities with a higher intake of black ethnicity students are more likely to see
students from disadvantaged backgrounds not complete their studies. However, more
selective higher education institutions have lower non-continuation rates for black
ethnicity students than white. Gender has also been seen to be a major factor in contin-
uation rates. Only binary gender data is available currently, which shows that comple-
tion rates for female students are 11% higher than males. Furthermore, London
universities have a high proportion of students from low socio-economic backgrounds
and from ethnic minorities, which partly explains the higher than average dropout
rates seen within the region. However, at the same time, students attending London
universities tend to come from areas with high university participation rates, and
students from high university participation rates typically have lower dropout rates
[5].
There seems to be no evidence that dropout rates are linked to the standing or
academic success of an institute, as some universities with gold or silver awards
have dropout rates much higher than the benchmark. Investigating the various demo-
graphics of those who dropped out, students who were deemed to be mature (an age
of 21 or over is categorised as mature in the UK) were twice as likely to dropout
of university than those students entering straight after A levels. Two main concepts
were identified as factors which could explain the likelihood of a student continuing
with their studies or dropping out, these being a sense of belonging and a level of
engagement [6].
In the most recent survey conducted by the Higher Education Policy Institute
(HEPI), data reveals that the majority of white students—61%—feel a positive sense
of belonging, while for other student groups, the sense of belonging is significantly
less evident: Asian 48%; Black 46%; Chinese 46%; mixed 53%; and Other 43% [7].
However, a new questionnaire on loneliness identified that higher education can be
a lonely place, with nearly one-in-four feeling lonely “all” or “most” of the time
[7]. It may not always be possible for a student to engage fully in university life in
a way that would not affect them academically. Two reasons for this are financial
and time constraints. One small study of institutions in London found that travel
or commuting time stayed a significant predictor of student progression or contin-
uation for England-domiciled full-time undergraduates at three of the six London
institutions participating in the study [3]. In the case of mature students, they may
have been out of education for some time and might also have work and home life
to balance [8].
3 Methodology
Quantitative analysis involves the systematic analysis of data through collection and
statistical, mathematical, and computational analysis to obtain results. Numerical
data is used and analysed using special statistical techniques to get the solutions
60 A. Chrysikos and N. Bamford
for the questions like how, how many, how much, what, where, when and who
[9]. The quantitative data is then analysed and modelled using the R programming
language in R studio. The purpose of preferring a quantitative approach is to create
and implement statistical models, theories and hypotheses related to the subject of
research. A quantitative approach is used to bring out a conclusive result for the
objective.
The data collection method employed was a questionnaire in the form of a survey.
The data collected through the survey was then explored to discover and summarise
the characteristics of the data (see also Sect. 4). Then, an exploratory analysis on
data was performed to summarise the characteristics of data, specifically, regression
tree analysis with the use of scatter and box plots to show how various aspects of
the data relate to each other. In the current research, the outcome was the optimism
score as the target variable and the predictive variables were split into two feature
sets. The first feature set consisted of attendance of CODE-It, gender, age, ethnicity,
disability, full or part-time student and level of study. The second feature set consisted
of attendance of CODE-It, gender, age, ethnicity, disability, full or part-time student,
level of study and average component mark.
4 Data Collection
The data under analysis was collected in the form of an online survey from the
computing students who participated in the 2021 study [1]. The survey was structured
in the three following sections.
Section 1 Respondent Content Consisting of seven questions, this section was
concerned with making the student aware of the nature of the survey and seeking
their permission to use the data in the research in line with the General Data Protection
Regulation (GDPR).
Section 2 Optimism Questions Adapted from the “Learned Optimism” survey [10]
and consisting of 30 questions which were applied to the Optimism Test Scoring
Sheet and the interpretation guide applied. From the survey data, an overall score
was obtained by using the optimism test scoring sheet values, specifically:
A student’s pessimism score when unpleasant events happen,
Optimism score when good events happen,
Total optimism score and
Hope score.
The pessimism score is a score when unpleasant events happen and is the total
of questions answered with the I (5/30), D (5/30) or F (5/30) option. There was
a maximum pessimism score of 15 available across the 30 questions on the ques-
tionnaire. The optimism score is a score when good events happen and was the
total of questions answered with the H (5/30), E (5/30) or B (5/30) option. There
Retention in Second Year Computing Students in a London-Based 61
was a maximum pessimism score of 15 available across the 30 questions on the
questionnaire. Total optimism was the calculated score of optimism–pessimism.
Section 3 Final Question A new addition to the current study’s survey was the
inclusion of a question asking students whether they participated in the CODE-It
initiative and, if so, whether it was a positive or negative experience for them.
As this study is in its second year and includes participation in the CODE-It initia-
tive, the results from the current study were combined with the students’ current
average component marks and the average module results from the 2021 study
[1]. This merging of data is a common practice in research to compare multiple
variables from different sources and domains.
5 Data Analysis and Discussion
After the data merging was completed, some transformations were made to aid anal-
ysis and some cases filtered out, either due to missing data or because student consent
was not provided. From the original 74 cases, 7 cases did not give consent, leaving
67 cases which could be used for analysis.
The 2021 study highlighted four recommendations for further analysis [1]:
(1) Contrast in optimism of students with foreign qualifications and UK qualifica-
tions,
(2) Exploration of factors causing black ethnicity students to be less optimistic,
(3) Expand the research to other universities, and
(4) Compare year-on-year of student satisfaction levels from the National Student
Survey.
Items 1 and 3 are still an ambition and should be considered from the next research.
Item 2 forms the basis of the main analysis of this study. Item 4 is discussed in the
following section.
5.1 Exploration of Factors Causing Black Ethnicity Students
to Be Less Optimistic
Carrying on from the 2021 study, data comparisons were conducted to show any
major similarities or differences in the data distribution. All data variables which
exist in both years’ studies were included, and in the case of the current study the
extra variables of average component mark (how the student is currently progressing
in their studies), average module score (how the student performed in 2021) and if
they attended the CODE-It initiative were also included.
62 A. Chrysikos and N. Bamford
Comparative Analysis of Year-on-Year Data
Ethnicity Comparing ethnicity between 2021 and 2022, significant differences
occurred in White Ethnicity (~8% increase) and Other Ethnicity (~6% increase).
Gender Comparing gender between 2021 and 2022, no major differences were seen
in percentages.
Disability Comparing disability between 2021 and 2022, a 6% rise was seen in those
with some form of disability.
Component Mark Module marks were compared using average component mark
and average module score. On average, module score was down by 8%, due to the
natural increase in difficulty of study between first and second years at university,
commonly referred to as “Second Year Slump” [11]. The next survey should include
a set of questions asking if the student found the next level of study more difficult
than the previous year.
Optimism Optimism can be seen to have improved by 3 points at the minimum level,
with a drop of 1 point at the maximum. The mean was the same and the median
was within 1 point. Therefore, it can be summarised that optimism did increase.
This might be explained by the post-COVID-19 pandemic effect of the return to
in-class teaching rather than online. Other factors within this explanation might
be, the CODE-It initiative, giving students the chance to collaborate on fun team-
based activities, students being able to interact with their classmates in lectures and
tutorials and having face-to-face in person time with lecturers. In addition, speakers
from relevant industry backgrounds of all ethnicity types were invited to talk to the
students about a range of topics including interview tips, C.V. preparation, how to
achieve higher grades and projects in industry within relevant fields.
5.2 Analysis of Optimism Score by Feature Grouping
Further analysis was conducted on the optimism scores to assess how various variable
groupings, including ethnicity, gender, disability, and component marks, contributed
to the results.
Optimism Grouped by Ethnicity Analysis based on ethnicity revealed that White
Ethnicity students exhibited both the minimum and maximum optimism scores but,
on average, tended to be pessimistic. Students of Black Ethnicity, while still showing
average pessimism, had improved their minimum scores compared to the previous
year (comparing 2021 and 2022). On the other hand, students of Asian Ethnicity
tended to be pessimistic on average, while those of Other Ethnicities demonstrated
average levels of optimism.
Retention in Second Year Computing Students in a London-Based 63
Optimism Grouped by Gender Gender-based grouping significantly influenced
the mean and median values of the optimism score, although there were no substantial
differences at the extremes.
Optimism Grouped by Disability There was no significant difference in optimism
scores between students with disabilities and those without.
Optimism Grouped by Component Mark Grouping optimism by binned average
component mark showed that those students with marks < 50 (21%) had the lowest
mean and median optimism score. Those with marks > =80 (7%) had the highest
mean and median score due to feeling positive about their academic achievement
obtained.
In this section’s analysis, we observed an 8% increase in White Ethnicity students,
a 6% decrease in Other Ethnicity compared to the 2021 study. For Black Ethnicity,
Asian Ethnicity, and those with unknown ethnicity, changes were below 5%. Gender
distribution remained similar to 2021, with changes of less than 5%. Disability saw
a 6% increase among students with a disability compared to 2021, while changes
for those without disabilities were below 5%. Median and mean component grades
decreased by 8 to 9%, which is expected given the increased difficulty between year
one and year two undergraduate courses. In terms of optimism, we observed a 3.00
point increase at the lowest level (from 8.00 to 5.00) and a 1.00 point decrease at the
highest level (from 9.00 to 8.00). The mean remained unchanged, while the median
decreased by 1.00 point (from 2.00 to 1.00). When grouping optimism scores by
ethnicity, we found that Other Ethnicity had the highest mean score at 1.25, followed
by Black Ethnicity (0.93), White Ethnicity (0.42), and Asian Ethnicity (0.40). White
Ethnicity showed the widest range of optimism scores, ranging from 5.00 to 8.00.
Grouping optimism scores by gender, we noted that females were, on average, more
optimistic (1.81) than males (0.02), with minimum and maximum values within 1.00
point of each other. However, there were no significant differences when grouping
optimism by disability. Finally, when grouping optimism by binned average compo-
nent marks, students with an average below 50 had the lowest maximum score (3.00),
the second-highest minimum score (– 4.00), and the lowest mean score (– 1.00).
5.3 Regression Tree Analysis of Optimism Scores
Two feature sets were used in regression tree analysis. The first set without average
component mark and the second set including it to see what effect it had. Compared to
the 2021 study [1] the variables qualification and work experience could not be used
as they were not recorded on the survey. In the current study, the following predictor
variables were used: attended CODE-It, gender (M/F), age, ethnicity, disability (Y/
N), full or part-time and study level (degree or foundation).
64 A. Chrysikos and N. Bamford
5.3.1 Feature Set 1
A regression tree analysis, using the previously mentioned predictor variables, was
conducted and produced a variable importance. Specifically, it was observed that the
10% of Black Ethnicity students who were at or below an optimism score of 1.50
in 2021 were no longer there and the lowest score of 0.89 (29%) was comprised
of White Ethnicity students.
The next group with a score of 0.29 (10%) was a combination of Asian, Black
and Other Ethnicities. In both cases this applied to male students who attended a
foundation year prior to entry.
The next set of scores 0.00 and 1.60 were males of all ethnicities who did not
partake in a foundation year. It can be seen in this group which was evenly split at
15% each that for those who attended the CODE-It initiative, their optimism score
was 1.60 (compared to 0.00 for those who did not) and at 1.60 this was moving away
from pessimism to an average optimism score, showing a clear positive effect on
optimism scores, regardless of demographic factors by attending CODE-It.
The final set of scores (31%) are for females split by study level. For those who
did not attend a foundation year (19%), the score was pessimistic in contrast to those
who did attend a foundation year (12%) with the highest optimism score of 3.10
which was just above the high average level.
It can, therefore, be stated that compared to the 2021 study, Black Ethnicity
students had improved their level of optimism at 2.90 which was near the highest
level of 2021 of 3.00. White Ethnicity male students (28%) who attended a foundation
year had the lowest optimism score of 0.89. For male students who attended CODE-
It, optimism scores were improved by 1.60 points. Foundation year female students
were 3 points more optimistic than the equivalent non-foundation students.
5.3.2 Feature Set 2
Feature Set 2 included the average component mark, but it did not produce significant
variable importance. Specifically, it was observed that when average component mark
is added as an explanatory variable, there were two distinct groups. Students with
an average component mark of < 51 and student with an average component mark
of > =51. For marks < 51 (27%), students were pessimistic at 0.93 regardless
of any other variable. Those students with an average module score > =51 were
further split into two groups, male and female. The male group was split into 16%
withascoreof0.56 and are those who attended a foundation year. For those male
students who did not attend a foundation year (27%), their score was 1.20 which was
heading towards an average optimism score of 2.00. The final distinct group (30%)
and the most optimistic by one whole point with an average optimism score of 2.20
were females. It can also be observed that those groups at most risk when taking into
account optimism as an indicator are students with average component marks < 51
(27%) and male students who attended a foundation score with average component
mark > =51 (16%).
Retention in Second Year Computing Students in a London-Based 65
5.4 Analysis of Attendance of CODE-It
Although the variable importance of attending CODE-It showed relevance in the
Feature Set 1 of the regression tree analysis, it did not show significant relevance in
the Feature Set 2. Therefore, a separate analysis of its effect on the average increase
of marks was conducted. The attendance based on ethnicity was also explored. This
time the results were of significant relevance. Specifically, the results showed that
26 students attended CODE-It. For 11, their average module mark increased by a
median of 10 and average of 11 points, however 15 students saw a drop in their
average module mark by a median of 12 and average of 16. This contrasts with 30
students who did not attend CODE-It where 10 saw a median increase of 9 and an
average increase of 32. Finally, there were 20 students who did not participate in
CODE-It and saw a median decrease of 17 and an average decrease of 56. Of those
26 students who attended CODE-It and graded their experience (positive or negative)
the majority 22 (85%) thought it was a positive experience compared to 4 (15%) who
did not.
Analysing the data and grouping by ethnicity showed that the largest participating
group of students by ethnicity based on percentage of ethnic group were those who
identify as Black Ethnicity (60%), then White (42%), Asian (40%) and other (25%).
This could in part explain the increase in optimism levels in that group and further
research should be conducted to ascertain a correlation. In addition, the data anal-
ysis by attendance of CODE-It by gender showed that 47% were female students,
while 43% were male. The attendance of CODE-It by study level showed a higher
attendance by foundation degree students (58%) compared to a 42% of degree-level
students.
All this information suggests that attending CODE-It had a positive effect on
the participant students’ grades as well as showing a slight positive increase in
grades and a less negative effect on grade reduction. Specifically, most students
(85%) who attended CODE-It thought it was a positive experience. Additionally, a
higher percentage of Black Ethnicity students attended CODE-It, which could be
a contributing factor to their increased optimism compared to the previous year.
Finally, there is no significant difference in attendance based on gender or study
level (both important variables in the regression tree analysis findings), being at 5%
in each case. From these results, it could be argued that CODE-It should be continued
as a worthwhile exercise, further refined and its effects studied in any similar future
studies. With the analysis completed in this section, the implications of the findings
are discussed in the following section.
66 A. Chrysikos and N. Bamford
6 Implications
The analysis and interpretation of the survey data revealed four prominent impli-
cations, categorized into two related groups. The first group encompasses findings
consistent with those of the 2021 study [1], focusing on the ethnicity group with
the lowest average optimism scores and the optimism scores of the entire student
population.
Ethnicity of Students with the Lowest Optimism Score The student ethnicity
group with the lowest optimism score in the 2021 analysis, those of Black Ethnicity,
were no longer the lowest ethnicity group in. The ethnic group with the lowest opti-
mism group are now those students who identify as White Ethnicity. The increase in
optimism of the Black Ethnicity students may be attributable to the higher proportion
of that group of student’s participation (60% attendance) in the CODE-It initiative
compared to the White Ethnicity group of students (42% attendance).
Slight Increase in the Lower Optimism Score Year-on-Year On average, opti-
mism has increased by three points at the lower end, decreased by only one point,
and remained relatively unchanged in both the mean and median in both years. The
return to in-person interaction with classmates and lecturers could be a significant
factor in reducing the minimum score compared to the 2021 study. However, there is
still a very real post-pandemic effect being experienced by many students, especially
around matters of hardship and finance [8]. As it was possible to observe the effects
of the average component mark in the regression tree analysis, a second feature set
including that data (for cases where it was available) was run. It found that the least
optimistic students were those with a score < =50. For each feature set, females
remained the most optimistic.
The second group of implications became apparent through the inclusion of new
data related to attendance and the CODE-It experience, as well as the impact of
the natural phenomena affecting a statistically significant number of second-year
undergraduate students—commonly referred to as the ‘Second Year Slump’—in
comparison with their first-year experiences.
Decrease in Median Average Component Score Year-on-Year in Line
with Recognised “Second Year Slump” With the addition of the average compo-
nent mark (2022) and average module result (2021), a median and mean drop of 8%
was observed. The so-called “Second Year Slump” is a phenomenon researched in
the U.S., but recognised as an international experience [11]. Students are observed
to become generally less satisfied with their university experience and their priori-
ties change. They also reported feeling unprepared for the overall workload of the
second year, in particular the volume of assessments. This is something which should
be factored into and observed in the next survey.
Retention in Second Year Computing Students in a London-Based 67
Quantifiable Positive Effect of the CODE-It Initiative on Average Component
Score and Optimism Levels It was possible to analyse the effects of the CODE-It
initiative against the survey data collected. This was done by comparing average
increase and decrease in component score and showed that those who attended
CODE-It saw less of a decrease (by 5 points) and a slight increase (by 1 point)
compared to those who did not attend. Grouping the CODE-It attendance by ethnicity
showed that those students who identified as Black Ethnicity had a higher proportion
of attendance (60%) and were no longer the student ethnicity group with the lowest
optimism score. Finally, for those students who did attend CODE-It, 85% indicated
that they felt it was a worthwhile exercise. Therefore, it is recommended to continue
running CODE-It while improving and continuously measuring its effects.
7 Limitations
There were several limitations encountered conducting the research, one of which
would have had no effect, it would just have been interesting to have other data,
and two of which it could be argued could have had a small effect on the analysis,
regardless interesting and relevant conclusions were obtained for this study.
The first limitation was the non-ability to roll out the survey to multiple university
schools.
The second limitation was the level of engagement by students. Although all
the students who continued to the second year of their academic studies were asked
to complete the survey, it was not possible to obtain feedback from all of them.
However, a statistically significant number of students did complete the survey.
The third limitation was the lack of data from students who did not continue
their studies into the second year. For those students who did not continue into
the second year, because they dropped out, although they were contacted, none of
them responded. It is difficult to postulate the reason for this, therefore, a better
mechanism for obtaining feedback in such cases might be sought in order to gather
as much relevant data as possible.
The fourth limitation was the exclusion of variables in regression tree analysis.
Specifically, not all the variables used in the regression tree analysis of the 2021 study
were available in the current study. These were: qualification and work experience.
However, this seemed not to have a detrimental effect on the Feature Set 1 analysis
which gave comparable results in the 2021 study [1].
68 A. Chrysikos and N. Bamford
8 Conclusion
Students in the UK continue to experience the lingering effects of the global
pandemic. Despite the return to in-person and face-to-face teaching, optimism levels
have not significantly increased overall. However, there was a slight increase at the
lower end. In the second-year study, students identifying as Black Ethnicity moved
from the lowest optimism group to the second-lowest, with those identifying as White
Ethnicity now in the lowest optimism group. This change may be partly attributed to
a higher percentage of Black Ethnicity students participating in the CODE-It initia-
tive. Furthermore, average module scores have slightly decreased at the overall mean
level, which is a phenomenon known as the “Second Year Slump” [11]. It is essential
to monitor this trend, as the average component mark is a significant variable in the
regression tree analysis.
9 Recommendations for Further Research
The first recommendation for further research is related to the evaluation of CODE-
It initiative. It is highly recommended to further investigate the CODE-It initia-
tive, which has demonstrated a positive impact on student grades and optimism
levels, particularly among Black Ethnicity students. Given that 85% of students who
attended the initiative reported a positive experience, conducting more surveys to
gather additional student feedback would be beneficial.
The second recommendation is related to the inclusion of predictive
factors. Consider incorporating predictive factors identified in previous studies into
future research. These factors may include the impact of commuting [12], student
financial hardship [8], and the natural increase in academic difficulty between the
first and second years of undergraduate studies [11]. Surveys that capture students’
academic experiences from year 1 to year 2 could provide valuable insights.
A third recommendation is to study the effect of ethnicity groups as a contributing
factor to levels of optimism. Conduct a separate study to examine the influence of
ethnicity on students’ optimism levels. Specifically, investigate how different ethnic
backgrounds contribute to variations in optimism levels among students and explore
strategies to enhance these levels.
Finally, conducting comparative analysis with other universities is recom-
mended. To gain a broader perspective on computing students’ well-being and opti-
mism levels, consider including other universities in future studies. Comparative
analyses could provide insights into the effectiveness of initiatives like CODE-It and
help identify best practices to improve student welfare and optimism levels across
different institutions.
Retention in Second Year Computing Students in a London-Based 69
References
1. Chrysikos A, Ravi I, Stasinopoulos D, Rigby R, Catterall S (2023) Retention of computing
students in a London-based university during the covid-19 pandemic using learned optimism
as a lens: a statistical analysis in R. In: Arai K (ed) Intelligent computing. SAI 2023. Lecture
notes in networks and systems, vol 711. Springer, Cham. https://doi.org/10.1007/978-3-031-
37717-4_16
2. HESA (2022) Non-continuation summary: UK performance indicators. Retrieved from https://
www.hesa.ac.uk/data-andanalysis/performance-indicators/non-continuation-summary
3. Hillman N (2021) A short guide to non-continuation in UK universities. In: HEPI.
Retrieved from https://www.hepi.ac.uk/2021/01/07/a-short-guide-to-non-continuation-in-uk-
universities/
4. Keohane N (2017) On course for success? Student retention at university. In: VOCEDplus, the
international tertiary education and research database. Social Market Foundation. Retrieved
from https://www.voced.edu.au/content/ngv:77230
5. Priestley M, Hall A, Wilbraham SJ, Mistry V, Hughes G, Spanner L (2022) Student perceptions
and proposals for promoting wellbeing through social relationships at university. J Furth High
Educ 46(9):1243–1256
6. Vytniorgu R (2022) To encourage a sense of belonging among students, avoid excessive focus
on identity differences and increase engagement with local communities. In: HEPI. Retrieved
from https://www.hepi.ac.uk/2022/11/17/to-encourage-a-sense-of-belonging-among-stu
dents-avoid-excessive-focus-on-identity-differences-and-increase-engagement-with-local-
communities/
7. HEPI (2022) Students signal significant bounce-back in the value of their studies.
Retrieved from https://www.hepi.ac.uk/2022/06/09/students-signal-significant-bounce-back-
in-the-value-of-their-studies/
8. Shearing H (2022) Hardship funding for students doubled last year. In: BBC News. BBC.
Retrieved from https://www.bbc.com/news/education-61883656
9. Apuke OD (2017) Quantitative research methods: a synopsis approach. Arab J Bus Manage
Rev (Kuwait Chapter) 6(10):40–47. https://doi.org/10.12816/0040336
10. Seligman MEP (2018) Learned optimism: how to change your mind and your life. Nicholas
Brealey Publishing
11. Milson C (2015) Disengaged and overwhelmed: why do second year students
underperform? In: The Guardian. Guardian News and Media Limited. Retrieved
from https://www.theguardian.com/higher-education-network/2015/feb/16/disengaged-and-
overwhelmed-why-do-second-year-students-underperform
12. Hillman N (2021) A short guide to non-continuation in UK universities. Higher Education
Policy Institute
Alzheimer’s Disease Knowledge Graph
Based on Ontology and Neo4j Graph
Database
Ivaylo Spasov, Sophia Lazarova, and Dessislava Petrova-Antonova
Abstract Recently, a massive amount of data has been available for research on
Alzheimer’s disease. However, the data entities are stored with different names
at different levels of granularity and in various formats. Thus, a comprehensive
knowledge graph is needed to facilitate the development of analytical models related
to Alzheimer’s disease. In our previous work, we created the Alzheimer’s disease
Ontology for Diagnosis and Preclinical Classification (AD-DPC), a domain ontology
incorporating the knowledge of medical experts in an understandable way for individ-
uals with no medical background. This paper extends our work by employing Neo4j
graph database technology and AD-DPC to build a domain-specific knowledge graph.
Data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) is used to popu-
late the knowledge graph and to validate its data retrieval and visualisation capabili-
ties. The knowledge graph contains 2996 diagnoses, 154,953 psychometric findings,
24,102 blood findings, 12,471 CSF findings, and 14,703 brain imaging findings from
MRI or PET scanning. The nodes were further annotated with 259,260 labels and
673,325 relations based on the AD-DPC ontology. The results prove the efficacy
of using ontologies as a base for the semantic modelling of graph databases. They
further rely on their straightforward and intuitive data querying and visualisation
support.
Keywords Alzheimer’s disease data modelling ·Knowledge graphs ·Neo4j
I. Spasov
Rila Solutions, Sofia, Bulgaria
S. Lazarova ·D. Petrova-Antonova (B)
GATE Institute, Sofia University, St. Kliment Ohridski, Sofia, Bulgaria
e-mail: dessislava.petrova@gate-ai.eu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_6
71
72 I. Spasov et al.
1 Introduction
Dementia puts an ever-growing physical, emotional, and financial strain on patients
and their families [13]. According to the World Health Organisation (WHO),
55 million people worldwide live with dementia, expected to reach a staggering
139 million in 2050 [4]. Alzheimer’s disease (AD) is the most common type of
dementia, accounting for 60–70% of all dementia cases [2]. Numerous attempts have
been made to explain the causes of the disease, but none of the generated hypotheses
is universally accepted. Thus, the underlying causes of AD’s pathological changes
remain unknown [5]. The lack of knowledge about AD’s root causes contributes to
the complexity of the disease and renders it neither curable nor preventable [6].
The eagerness to find a cure and slow down the progression of the disease has
recently led to an intense interest in the applications of Big Data in AD research. The
AD research community produces large amounts of data, including patient profiles,
anamnestic data, genomic data, neuroimaging and molecular biomarkers, and cogni-
tive and neuropsychiatric assessments. This data is expanded with data from mobile
devices such as wearables and smartphones [7]. Leveraging the collection, aggrega-
tion, and analysis of these large data volumes changes the landscape of AD research
by shedding light on its aetiology or contributing to timelier diagnosis and preven-
tion. For example, machine learning and statistical methods are commonly used to
develop AD screening tools, early detection algorithms and decision-support tools
based on brain imaging data [8], combinations of non-imaging features [9], speech
patterns [10], and even novel biomarkers [11]. However, despite the vast amount
of data available, these data are often scattered and quite heterogeneous regarding
organisation and formatting [12]. Thus, the success of advanced analytics heavily
depends on the adequate standardisation and interoperability of medical data. An
ontology-based approach is needed to explicitly define the semantics of the domain
and map data across heterogeneous data sources [13].
Knowledge graphs (KGs) are heterogeneous knowledge bases modelled through
ontologies and graph databases. They store data in a semantically structured manner,
support drawing new conclusions through reasoning, and provide context-facilitating
machine learning (ML) models [14]. Prominent examples of KGs of biological
data are the Monarch Initiative [15] and Pheno4J [16]. The Monarch Initiative is
a large-scale endeavour which uses an ontology-based strategy in combination with
a graph database to integrate massive amounts of heterogeneous genotype–pheno-
type data and reveal complex relationships within it. Similarly, Pheno4J uses the
Human Phenotype Ontology (HPO) to build a Java-based solution that loads anno-
tated genetic variants and well-phenotyped patients into the Neo4j database [17].
There are several implementations of KGs for AD focused on extracting and organ-
ising knowledge from scientific articles [18,19], identifying candidates for drug
repurposing [20,21], studying depression as a risk factor for AD [22], representing
knowledge for the nonpharmacological treatment of psychotic symptoms in dementia
[23], and visualisation of dementia risk factors [24]. However, to the best of our
Alzheimer’s Disease Knowledge Graph Based on Ontology and Neo4j 73
knowledge, there are no existing KGs focused on AD diagnosis and preclinical
classification.
In our previous work, we created Alzheimer’s disease Ontology for Diagnosis and
Preclinical Classification (AD-DPC) [25]. It incorporates the knowledge of domain
experts while keeping it understandable for individuals outside the medical domain.
It aims to facilitate knowledge exchange between medical and technological experts
in interdisciplinary teams. This paper proposes a KG modelled through the AD-DPC
and a Neo4j graph database. It is populated with data from the Alzheimer’s Disease
Neuroimaging Initiative (ADNI)1. Its utility is validated regarding data retrieval and
data visualisation. The main contributions of the paper are as follows:
Integration of various datasets from ADNI based on AD-DPC ontology in a
common data repository, enabling data analytics and development of ML models.
Development of an Alzheimer’s disease KG supporting semantic data interoper-
ability and knowledge sharing.
Implementation of a fully operational Neo4j graph database, compliant with AD-
DPC ontology and providing intuitive data interaction and visualisation.
The rest of the paper is organised as follows. Section 2describes data preparation.
Section 3presents data modelling in the Neo4j database. Section 4shows sample
queries written with Cypher and corresponding results returned as graphs. Finally,
Sect. 5discusses the results and concludes the paper.
2 Data Preparation
Alzheimer’s Disease Neuroimaging Initiative (ADNI)2is a multicentre longitu-
dinal study aiming to understand the changes occurring during the progression of
Alzheimer’s disease. ADNI offers access to large amounts of data collected from
cognitively normal (CN) subjects, subjects with mild cognitive impairment (MCI)
and subjects with Alzheimer’s disease (AD). The repository contains demographic,
clinical, neuropsychological, neuroimaging, and biochemical biomarker data. For
our work, we used the data corresponding to the concepts outlined by the medical
experts in AD-DPC, containing 16,227 entries of about 2404 participants.
The demographic and anamnestic data includes age, gender, years of education,
family history of dementia, blood pressure, and body mass index (BMI). Our data
sample contains longitudinal data. Therefore, each entry has a timestamp encoded
as the number of months since the baseline visit. The baseline corresponds to ‘0’,
6 months after the baseline corresponds to ‘06’ and so on. Follow-up visits were
1Data used in preparation of this article were obtained from the Alzheimer’s disease Neuroimaging
Initiative (ADNI) database (https://adni.loni.usc.edu/). As such, the investigators within the ADNI
contributed to the design and implementation of ADNI and/or provided data but did not participate
in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://
adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.
2See footnote 1.
74 I. Spasov et al.
conducted approximately every 6 months. Longitudinal results from 12 neuropsycho-
logical assessments are considered, namely Mini-mental state examination (MMSE),
Montreal Cognitive Assessment (MoCA), Alzheimer’s Disease Assessment Scale
(ADAS), Clinical Dementia Rating (CDR), Functional Activities Questionnaire
(FAQ), Rey Auditory Verbal Learning Test (RAVLT), clock drawing and copying
task, Boston naming test (BNT), American National Adult Reading Test (ANART),
verbal fluency task, logical memory task (delayed and immediate recall). Cere-
brospinal fluid (CSF) assessments and blood plasma biomarkers are also available.
In particular, we included CSF concentration of amyloid-beta 42 (Aβ42), total tau
(t-tau), and phosphorylated tau (p-tau) as well as plasma concentrations of p-tau181
and neurofilament light (NfL). The included APOE4 status is a marker of genetic
predisposition to developing AD. It is binary encoded with 1 designating that the
participant is a carrier of at least one allele 4 copy.
Finally, we included results from fluorodeoxyglucose-positron emission tomog-
raphy (FDG-PET) and florbetapir-PET (AV45-PET) imaging along with volumetric
data extracted from magnetic resonance imaging (MRI) images. Brain FDG-PET is
commonly used to estimate the distribution of neural injury, and AV45-PET is used
to visualise accumulations of Aβ42 plaques in the brain. This version of the KG does
not contain actual brain images, only data extracted from them. However, storing
brain images remains to be implemented in future work.
3 Modelling Data in Neo4j
To map the ontology to a database that is suitable for storing real-world medical data,
we outlined three generics groups of concepts that are represented in the graph: (1)
Participant data (demographic and anamnestic); (2) Clinical findings (results from
tests and assessments); and (3) Diagnosis.
These generic groups were enriched with attributes and relations from the
ontology. Corresponding timestamps were added (defined as Zero-Dimensional
Temporal Region) and assessment definitions, result interpretation scales, and diag-
nostic process descriptions. The result interpretation scales were modelled as follows.
Each laboratory result can be treated as either “positive”, “negative”, or “invalid”
with respect to the volume of a target biochemical compound within a sample. There-
fore a scale node was defined with the following properties: unit, minimal value for
negative reading, maximum value for negative reading, minimum value for positive
reading, and maximum value for positive reading. This was each test result can be
evaluated against the respective scale and labelled according to three criteria:
if a score is between the min/max negative, the test outcome is negative;
if a score is between the min/max positive, the test outcome is positive;
every other value is treated as an invalid outcome.
Each participant has a dedicated node (Participant) attributed by base information
such as participant ID, age, years of education and gender. Then each participant
Alzheimer’s Disease Knowledge Graph Based on Ontology and Neo4j 75
Fig. 1 Database hierarchy. Each circle represents a node, and each arrow is a relation implemented
from the AD-DPC ontology
node has a dependent node (Participant File) containing all the clinical data available
for this person. The participant file contains constitutional data, 0 or more image find-
ings, psychometric findings, blood findings, CSF findings, and a diagnosis. Consti-
tutional data contains anamnestic data routinely collected during examinations, such
as patient history and BMI. The graph database follows the ontological structure
where each result from an assessment or test is treated as a “finding” produced by a
particular laboratory assay, psychometric test, or brain imaging procedure. Figure 1
shows the overall structure of the database.
To create and populate a Neo4J database, we used the Neo4j native Cypher Query
Language. The resulting Neo4j database contained information about all 2404 partic-
ipants. The KG contains 2996 diagnoses, 154,953 psychometric findings, 24,102
blood findings, 12,471 CSF findings, and 14,703 brain imaging findings from MRI
or PET scanning. These nodes were annotated with 259,260 labels and 673,325
relations based on the AD-DPC ontology.
4 Results and Discussion
This section presents sample queries used to extract data from the Neo4j database.
Different scenarios are explored to show the longitudinal and multimodal nature of
the data. Each query is represented by its objective, syntax, and returned result.
76 I. Spasov et al.
The objective of the first query is to get all laboratory findings for a participant
with record identifier (RID) 2 that were logged 72 months after the baseline. Its
syntax is shown in the following listing:
MATCH (pf:ParticipantFile {rid:‘2’}) - [] - (p:Participant)WITH pf,p
MATCH (t:ZeroDimensialTemporalRegion {months: ‘72’}) - [] -
(n:LaboratoryFinding) [] - (pf).
Return *;
The execution of the query returns three records, each standing for a laboratory
finding for a participant with RID 2 logged at a visit 72 months after the baseline
visit. The results can be presented in a table (Fig. 2) or visualised in a graph (Fig. 3).
The objective of the second query is to return all laboratory findings for a patient
with record identification 2 that were logged 72 months after the baseline.
MATCH (pf:ParticipantFile {rid:‘2’}) - [] - (p:Participant)WITH pf,p
MATCH (t:ZeroDimensialTemporalRegion {months: ‘72’}) - [] -
(n:LaboratoryFinding) [] - (pf).
Return *
The laboratory findings for the corresponding patient are visualised in Fig. 4.
The objective of the third query is to list all BMI measures with the corresponding
time region for participant with RID 2.
MATCH (pf:ParticipantFile {rid:‘2’}) - [] - (p:Participant)WITH pf,p
Fig. 2 Table representation
of the results from the first
query
Fig. 3 Graph visualisation of the results from the first query
Alzheimer’s Disease Knowledge Graph Based on Ontology and Neo4j 77
Fig. 4 Graph visualisation of the results from the second query
MATCH (t:ZeroDimensialTemporalRegion) - [] - (n:BMI) - [] - (pf).
Return *
The execution of the query returns 11 records, each standing for a record of BMI
for a participant with RID 2. Each record originates from a different visit. The date
stamps are encoded as several months since the baseline visit. Thus, 108 refers to
108 months after baseline (Fig. 5).
The inherited semantic structure is a natural consequence of the here-by-presented
KG built on top of an ontology. As results show, this structure offers the benefit of
intuitive querying and visually supported data retrieval. Nevertheless, such access
point is limiting because it requires knowledge about databases, data querying, and
programming. To create a more inclusive access point to the KG, future work should
focus on implementing an interactive interface that will eliminate the need for direct
user interaction with Neo4j. To fully leverage the semantic layer granted by AD-DPC,
future extensions to implement user interface based on natural language questions
are considered.
Our results are partially similar to those in a previous work modelling ADNI data
through an ontology-based approach [26]. Similarly to our findings, they conclude
that semantic databases grant more intuitive data querying, thus creating an outlet for
simplified data access in machine learning. However, while they chose to model their
ontology entirely after the neuropsychological data offered by ADNI, we consider
this approach as limited in data blending and operability since any future data import
outside of ADNI will likely require significant changes in the ontological structure.
Therefore, we used a data-independent ontological structure that we later translated
into a KG able to accommodate data from any dataset containing AD patient infor-
mation. While we acknowledge that importing data from several data sources will
require the development of dataset-specific mappings, we consider that changes in
the semantic structure of the KG should only occur in the following cases: (1) to
accommodate high-demand user needs; (2) to reflect any novelties in the domain;
(3) to fix existing imperfections.
78 I. Spasov et al.
Fig. 5 Graph visualisation of the results from the third query
5 Conclusion
The paper proposed a KG modelled through the AD-DPC and a Neo4j graph database.
We populated the KG with data from ADNI and demonstrated data organisation and
querying. The KG contained concepts, relations, and attributes described in the AD-
DPC ontology. The resulting rich data representation offers an additional semantic
layer, making the KG self-explanatory. This will significantly benefit data analysts
interested in researching AD but lacking background knowledge. A well-known
setback of interdisciplinary collaboration is the coordination and communication
between people with different backgrounds. Expert knowledge and consultation
are often expensive, difficult to provide and time-consuming. Incorporating AD-
DPC semantics in the KG minimises the need for external consultation in analytics.
However, to fully achieve this goal, definitions and elucidations of base ontological
concepts within the graph should be integrated.
Alzheimer’s Disease Knowledge Graph Based on Ontology and Neo4j 79
The paper demonstrated data loading from a single data source—ADNI. A limi-
tation of our work is that we did not explore the possibilities for data integration from
different sources. However, we consider the increasing need for data interoperability
in the domain of AD as one of the pressing matters that must be addressed before the
domain can truly embrace and leverage Big Data analytics. Therefore, future work
shall focus on data integration from multiple sources. Another limitation is that the
current implementation of the KG was created through a manual mapping between
the ontology and the database. Future work considers automation of this process.
This will ensure that any ontology updates will also be reflected in the KG.
Acknowledgements This research work has been supported by the GATE project, funded by the
H2020 WIDESPREAD-2018-2020 TEAMING Phase 2 programme (agreement no. 857155); OP
Science and Education for Smart Growth (agreement no. BG05M2OP001-1.003-0002-C01); and
the BNS fund (project no. KP-06-N32/5).
References
1. Duong S, Patel T, Chang F (2017) Dementia: what pharmacists need to know. Can Pharmacists
J 150(2):118–129. https://doi.org/10.1177/1715163517690745
2. Silva MVF, de Mello Gomide Loures C, Alves LCV, de Souza C, Borges KBG, das Graças
Carvalho M (2019) Alzheimer’s disease: risk factors and potentially protective measures. J
Biomed Sci 26(33):1–11. https://doi.org/10.1186/s12929-019-0524-y
3. Li X, Feng X, Sun X, Hou N, Han F, Liu Y (2022) Global, regional, and national burden of
Alzheimer’s disease and other dementias, 1990–2019. Front Ageing Neurosci 14(937486):1–
17. https://doi.org/10.3389/fnagi.2022.937486
4. World Health Organization (WHO) Dementia Fact Sheet. Retrieved from https://www.who.
int/news-room/fact-sheets/detail/dementia. Accessed on 13 Feb 2022
5. Breijyeh Z, Karaman R (2020) Comprehensive review on Alzheimer’s disease: causes and
treatment. Molecules 25(24):5789. https://doi.org/10.3390/molecules25245789
6. Luo J, Wu M, Gopukumar D, Zhao Y (2016) Big data application in biomedical research and
health care: a literature review. Biomed Inform Insights, vol 8. https://doi.org/10.4137/BII.
S31559
7. Ienca M, Vayena E, Blasimme A (2018) Big data and dementia: charting the route ahead for
research, ethics, and policy. Front Med 5(13):1–7. https://doi.org/10.3389/fmed.2018.00013
8. Zhao Z et al (2023) Conventional machine learning and deep learning in Alzheimer’s disease
diagnosis using neuroimaging: a review. Front Comput Neurosci 17(1038636):1–16. https://
doi.org/10.3389/fncom.2023.1038636
9. Wang H et al (2022) Develop a diagnostic tool for dementia using machine learning and non-
imaging features. Front Aging Neurosci 14(945274):1–14. https://doi.org/10.3389/fnagi.2022.
945274
10. Fristed E et al (2022) Leveraging speech and artificial intelligence to screen for early
Alzheimer’s disease and amyloid beta positivity. Brain Commun 4(5):1–12. https://doi.org/
10.1093/braincomms/fcac231
11. Bourkhime H et al (2022) Machine learning and novel ophthalmologic biomarkers for
Alzheimer’s disease screening: systematic review. ITM Web Conf 43(01009):1–9. https://doi.
org/10.1051/itmconf/20224301009
12. Birkenbihl C et al (2020) Evaluating the Alzheimer’s disease data landscape. Alzheimer’s
Dement Transl Res Clin Interv 6(e12102):1–11. https://doi.org/10.1002/trc2.12102
80 I. Spasov et al.
13. Liyanage H, Krause P, de Lusignan S (2015) Using ontologies to improve semantic inter-
operability in health data. J Innov Health Inf 22(2):309–315. https://doi.org/10.14236/jhi.v22
i2.159
14. Timón-Reina S, Rincón M, Martínez-Tomás R (2021) An overview of graph databases and
their applications in the biomedical domain. Database 2021:baab026. https://doi.org/10.1093/
database/baab026
15. Mungall CJ et al (2017) The monarch initiative: an integrative data and analytic platform
connecting phenotypes to genotypes across species. Nucleic Acids Res 45(D1):D712–D722.
https://doi.org/10.1093/nar/gkw1128
16. Mughal S et al (2017) Pheno4J: a gene to phenotype graph database. Bioinformatics
33(20):3317–3319. https://doi.org/10.1093/bioinformatics/btx397
17. Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S (2008) The human phenotype
ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet
83(5):610–615. https://doi.org/10.1016/j.ajhg.2008.09.017
18. Fahd K, Miao Y, Miah SJ, Venkatraman S, Ahmed K (2022) Knowledge graph model devel-
opment for knowledge discovery in dementia research using cognitive scripting and next-
generation graph-based database: a design science research approach. Soc Netw Anal Min
12(61):1–12. https://doi.org/10.1007/s13278-022-00894-9
19. Rossanez A, dos Reis JC, da Silva TR, de Ribaupierre H (2020) KGen: a knowledge graph gener-
ator from biomedical scientific literature. BMC Med Inform Decis Making 20(314 (S4)):1–24.
https://doi.org/10.1186/s12911-020-01341-5
20. Nian Y et al (2022) Mining on Alzheimer’s diseases related knowledge graph to identity poten-
tial AD-related semantic triples for drug repurposing. BMC Bioinformatics 23(407 (S6)):1–15.
https://doi.org/10.1186/s12859-022-04934-1
21. Hsieh K-L, Plascencia-Villa G, Lin K-H, Perry G, Jiang X, Kim Y () Synthesize heterogeneous
biological knowledge via representation learning for Alzheimer’s disease drug repurposing. In:
iScience, vol 26, issue 105678, pp 1–18. https://doi.org/10.1016/j.isci.2022.105678
22. Malec SA et al (2023) Causal feature selection using a knowledge graph combining structured
knowledge from the biomedical literature and ontologies: a use case studying depression as
a risk factor for Alzheimer’s disease. In: bioRxiv preprint, pp 1–45. https://doi.org/10.1101/
2022.07.18.500549
23. Zhang Z et al (2022) Developing an intuitive graph representation of knowledge for nonphar-
macological treatment of psychotic symptoms in dementia. J Gerontological Nurs 48(4):49–55.
https://doi.org/10.3928/00989134-20220308-02
24. Fahd K, Venkatraman S (2021) Visualizing risk factors of dementia from scholarly literature
using knowledge maps and next-generation data models. Vis Comput Ind Biomed Art 4(19):1–
19. https://doi.org/10.1186/s42492-021-00085-x
25. Lazarova S, Petrova-Antonova D, Kunchev T (2023) Ontology-driven knowledge sharing in
Alzheimer’s disease research. Information 14(3):188. https://doi.org/10.3390/info14030188
26. Taglino F et al (2023) An ontology-based approach for modelling and querying Alzheimer’s
disease data, pp 1–19. https://doi.org/10.21203/rs.3.rs-1813123/v1
Forecasting Bitcoin Prices in the Context
of the COVID-19 Pandemic Using
Machine Learning Approaches
Prashanth Sontakke, Fahimeh Jafari, Mitra Saeedi,
and Mohammad Hossein Amirhosseini
Abstract Using daily data from 1st April 2016 to 3rd March 2022, this study
aims to explore the use and effectiveness of machine learning algorithms in fore-
casting the price of Bitcoin. The paper examines the forecasting performance based
on different time lags within the selected periods: (1) before pandemic and (2)
including pandemic. The second time frame is selected to examine the effect of
the Covid pandemic on the Bitcoin market fluctuations. This research employs four
machine learning models, including linear regression, support vector regression,
extreme gradient boosting, and long short-term memory. These are refined and cali-
brated to produce the most accurate forecasts. The performance of the algorithms
was measured and compared using regression metrics. The results show that before
the pandemic, the linear regression model performed the best for next-day predic-
tions, while extreme gradient boosting performed best overall and for longer-term
predictions. For the period including the pandemic, extreme gradient boosting and
linear regression performed the best, consistently outperforming long short-term
memory and support vector regression. The prediction models for data before the
pandemic have demonstrated improved performance, whereas the selected model
for the period including the pandemic exhibited satisfactory results. This is because
Bitcoin prices displayed the highest volatility during the Covid pandemic. The study
finds that extreme gradient boosting performs best overall and for longer-term predic-
tions, while linear regression performs the best for next-day predictions before the
pandemic. Moreover, the study reports satisfactory results for Bitcoin price prediction
for the period including the pandemic, despite the high volatility of prices.
P. Sontakke ·F. Jaf a r i ·M. Saeedi ·M. H. Amirhosseini (B)
University of East London, London, United Kingdom
e-mail: m.h.amirhosseini@uel.ac.uk
P. Sontakke
e-mail: U2054788@uel.ac.uk
F. Ja f a r i
e-mail: f.jafari@uel.ac.uk
M. Saeedi
e-mail: m.saeedi@uel.ac.uk
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_7
81
82 P. Sontakke et al.
Keywords Cryptocurrency ·Bitcoin price ·Time series forecasting ·Machine
learning ·Technical indicators ·Linear regression ·Support vector regression ·
Extreme gradient boosting ·Long short-term memory
1 Introduction
Bitcoin, one of the most popular cryptocurrencies, was introduced by Satoshi Nako-
moto in 2009 [1]. The principle of decentralisation is applied to cryptocurrency, while
fiat currencies are based on central banking systems. Therefore, a cryptocurrency is
not subjected to interference from a central banking authority. The global finan-
cial crisis in 2007–2008, known as the subprime mortgage crisis, followed by the
eurozone debt crisis in 2011–2012 substantially increased people’s distrust in their
government and declined their faith in traditional financial institutions. As a result,
Bitcoin with its promising and revolutionised features of a decentralised structure
with no governmental and regulatory controls, was well-received in the coming years
[2]. Bitcoin and other cryptocurrencies are used in different ways such as specula-
tive trading assets, investment, or simply as a payment method. Bitcoin, with its
explicit speculative behaviour, is subjected to high volatility and bubbles [3]. The
unusual price behaviour of Bitcoin has attracted many researchers to provide the
most efficient models to predict the price.
Financial time series forecasting has been a subject of significant interest in
economics, statistics, and computer science. A cryptocurrency is a digital currency
that uses cryptography to make transactions securely [4]. All cryptocurrencies are
traded across various exchanges 24/7, resulting in much volatility compared with
traditional stock markets. The motivation behind predicting the price of Bitcoin using
machine learning techniques was heavily inspired by increasingly better-performing
ensemble algorithms and neural network architectures. Bitcoin recorded its all-time
high in 2021 and experienced high fluctuations during the Covid pandemic, attracting
massive public attention. The high-price volatility of Bitcoin, especially during the
pandemic, motivated this research to analyse the Bitcoin price behaviour before and
during the Covid pandemic.
This study aims to examine the effectiveness of machine learning algorithms in
forecasting Bitcoin prices before and during the COVID-19 pandemic. It uses a
robust feature selection strategy to identify the most critical features for prediction
and applies different machine learning algorithms to forecast Bitcoin prices. The
models have been optimised and tuned to reflect the fluctuations as well. The paper
considers forecasting performance on different lags within pre-selected periods. It
evaluates the extent to which the prices of Bitcoin can be accurately predicted for
the next day, 7th day, 15th day, and 30th day.
The rest of this paper is organised as follows. Section 2discusses the literature
review. The methodology and the machine learning models utilised in this paper
are detailed in Sect. 3. Section 4is devoted to the experimental result, and Sect. 5
concludes the paper.
Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic 83
2 Literature Review
The high volatility of Bitcoin price could be due to many factors from operating hours
of the American, European, and Asian markets to different macroeconomic factors of
the world economy, especially the leading economies. While regulatory implications
and economic pressures led Bitcoin to be perceived differently in various countries,
Bitcoin price volatility and its hedging capacity have been discussed in many studies,
as Bitcoin-based portfolios can gain significant gains. Bitcoin has been considered a
risk diversifier for the portfolio. In some cases, it proved to be the best hedge choice
during financial crises helping the investor in the investment process [5].
Some studies suggested that Bitcoin should not be considered as a currency; they
argued that due to Bitcoin’s volatile price behaviour, it should be instead referred to as
a speculative investment asset. Among the early studies on Bitcoin price volatility,
Mittal [6] found no fundamental explanation for Bitcoin’s price movements and
concluded that the primary determinant of Bitcoin price is the investors’ speculation.
Meanwhile, Buchholz et al. [7] argued that Bitcoin’s price had bubble characteristics
with no significant relation with other financial assets. He concluded that Bitcoin price
movement was only derived from its own dynamics of supply and demand induced
by the behaviour of speculative investors. Gronwald [8] examined if Bitcoin’s price
movements exhibited characteristics of commodities such as gold or oil and found
that compared with the price fluctuation of traditional commodities, Bitcoin price
was significantly more volatile.
As interest in Bitcoin grew during the initial years, some studies have used statis-
tical and econometric model-based techniques to predict Bitcoin prices [9]. Statistical
model-based time series forecasting is a method of estimating and predicting price
values, but it has the drawback of requiring assumptions about the data distribution
beforehand. Bitcoin prices are non-stationary, and this approach cannot be used to
make accurate predictions as there are no seasonal effects with Bitcoin. Some studies
recommended autoregressive integrated moving average (ARIMA)-based model for
predicting Bitcoin prices [10,11]. Alahmari [12] used the ARIMA model to predict
Bitcoin, Ripple, and Ethereum based on daily, weekly, and monthly time horizons.
Huang et al. [13] developed a classification tree-based model for predicting Bitcoin
returns using 124 technical indicators that indicate overlap, momentum, pattern, etc.
Their approach claimed that technical analysis of historical data could predict Bitcoin
returns within narrow ranges as its value is believed to be driven by factors other
than fundamental factors. The result could surpass the buy-and-hold strategy and
significantly contribute to the newly emerging literature on technical analysis-based
cryptocurrency price forecasting.
Machine learning can be referred to as an automated learning process from experi-
ence without the need for explicit programming. This motivated many researchers to
study Bitcoin volatility and propose forecasting techniques using machine learning.
Greaves and Au [14] applied linear regression, logistic regression, support vector
machines (SVM), and artificial neural networks (ANN) and achieved a 55% accuracy
rate with ANN, outperforming the other models. They concluded that financial flow
84 P. Sontakke et al.
features from various exchanges would be an added advantage in predicting Bitcoin
prices. Using only blockchain-based features for training and testing offers limited
predictability. Madan et al. [15] addressed binary classification models like logistic
regression and random forest. Results show that the random forest outperformed
SVM as the former is not affected by high standard deviation and outliers within the
data. The study by Radityo et al. [16] predicted next-day prices using the closing price
of Bitcoin in USD. The research utilised four variations of artificial neural network
(generic algorithm NN, backpropagation NN, genetic algorithm BPNN, and neuro
evolution of augmenting topologies) and compared the results based on mean abso-
lute percentage error (MAPE) values and computational time complexity. Among the
variants of ANN used, GABPNN showed the best results, whereas the performance of
the genetic algorithm NN was unsatisfactory. The study by Yeh et al. [17] proposes an
improved ensemble learning method for forecasting Bitcoin price movements. The
method combines AdaBoost, random forest, and extreme gradient boosting algo-
rithms to enhance prediction accuracy. The authors evaluate the proposed method on
real-world Bitcoin price data and compare it with other popular forecasting methods,
including ARIMA and LSTM neural networks. The experimental results demonstrate
that the proposed method outperforms other methods in accuracy and robustness.
Authors in [18] present a hybrid deep learning framework for forecasting cryptocur-
rency prices, including Bitcoin. The framework combines CNN and LSTM to capture
the complex temporal patterns of cryptocurrency price data. The experimental results
show that the proposed framework achieves higher accuracy and lower error rates
than other models. However, it is worth noting that these studies do not consider the
pandemic period for Bitcoin price.
3 Methodology
A time series is a set of sequential data points for a specific successive time duration.
It incorporates methods that relate time series with understanding the trend of data
points within the time series or helps make predictions. This research concentrates
on forecasting Bitcoin prices using multivariate time series and machine learning
models, where the value of the target variable x at a future time point, x[t+s]=
f(x[t],x[t1], ..., x[tn]),with s >0, represents the prediction horizon.
The prediction forecast is evaluated for horizons of the next day, 7th day, 15th day, and
30th day. As shown in Fig. 1, the implementation of a time series-based forecasting
method begins with creating a dataset. Then, machine learning models are trained
for the specified prediction horizons. Technical indicators contributing to the Bitcoin
price have been scraped from open data sources.
As a preprocessing step, the data is consolidated into a single data frame, cleaned,
and scaled. The end-of-day close price is used to create datasets for the next-day, 7th-,
15th-, and 30th-day forecast for historical periods of data (i.e. from 1st April 2016
to 1st November 2019 and 1st April 2016, to 3rd March 2022). This results in four
separate datasets for the two time periods specified. Feature extraction and feature
Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic 85
Fig. 1 Step-by-step model development
selection are performed separately for each dataset. Over 900 derived features are
created based on past time frames of 7, 30, and 90 days. Feature selection, which is
a crucial step, is depicted in a figure and is performed to reduce the number of input
variables, thereby reducing the dimensionality and computational complexity of the
model. The top 10 features from each dataset are extracted using a random forest
regressor, followed by a training and testing split.
3.1 Data Collection
We have collected daily historical data from Yahoo Finance API (OHLC feature),
blockchain-based features from Bitinfocharts [19], and Quandl [20] through web
scraping techniques. We have 23 features excluding date and target variables. Table 1
represents the features that have been gathered.
3.2 Feature Engineering Using Technical Indicators
The dataset was enriched with newly generated features based on technical indicators
and lagged for 7, 30, and 90 days. These technical indicators added to the dataset
by providing information that could not be obtained from the existing features. For
instance, these new features addressed the need for more information regarding
properties like variance and standard deviation, which were calculated from the raw
features. This calculation allowed us to observe the relationship between prices and
the standard deviation of hash rate for past 7, 30, and 90-day intervals, rather than
86 P. Sontakke et al.
Table 1 Features collected using web scraping
Number of transactions per
day in blockchain
Block size Miner revenue
Number of sent by addresses Number of active addresses Open price
Average mining difficulty Average hash rate Low price
Average and median
transaction fee
Average block time Vo l u m e
Mining profitability Sent coins High price
Average and median
transaction value
Tweets and google trends per
day
Number of coins in circulation
Average fee percentage in
total block reward
Top 100 richest addresses to
total coins
Close price
Market cap Confirmation time
Table 2 Extracted features based on technical indicators
Simple moving average Weighted moving average
Exponential moving average Double exponential moving average
Triple exponential moving average Standard deviation
Relative strength index Rate of change
Bollinger bands Moving average convergence divergence
just the raw features. Table 2represents the features extracted based on technical
indicators.
3.3 Feature Selection
When dealing with large datasets with many features, it can increase the complexity
and time of computing an algorithm. The feature selection process can help identify
which features have a more significant impact on the outcome by analysing the
contribution of each feature and reducing the dimensionality of the dataset, all the
while retaining or improving the accuracy scores. A random forest regressor selects
the top 10 features from the entire dataset. When working with extensive datasets that
possess numerous features, the computational time and complexity of an algorithm
can increase significantly. To address this issue, the feature selection process can
be employed to identify the most impactful features by evaluating each feature’s
contribution. This process reduces the dataset’s dimensionality, while preserving
or enhancing the accuracy scores. Therefore, we have applied the random forest
regressor method to identify the top 10 features represented in Table 3.
To determine relevant features for this study, we have identified new features using
technical analysis and feature selection algorithms. Feature engineering revealed the
Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic 87
Table 3 Most frequently selected features across all horizons
Features Next day 715 30
WMA 30 Number of coins in circulations * * * *
SMA 13 High * * *
EMA 90 Low * * *
EMA 7 Number of coins in circulations * * *
Close * *
High * *
WMA 7 Close * *
EMA 7 Open * *
DEMA 30 Close * *
DEMA 7 Market cap * *
EMA 30 Close * *
extent to which features directly related to the blockchain impacted the price of
Bitcoin, e.g. miner revenue, which involves transaction fees and rewards, is correlated
with the Bitcoin price. Similarly, block size and the creation of new blocks also
correlate with the number of transactions. More number of Bitcoin transactions
correlates with Bitcoin price. More processing power in mining coins is the result of
high difficulty, which is highly correlated with the hash rate.
3.4 Training and Testing
After the feature selection process, the next step is to allocate a portion of the data as
the training set and another portion as the testing set. Due to the non-stationary nature
of cryptocurrency prices, there is a conundrum of using too much or too little data
for training. While the former makes the model irrelevant, the latter makes it prone
to overfitting the model. This problem is usually solved by using the ideal ratio of
80% training data and 20% testing data based on the Pareto principle. However, we
observed overfitting in the results obtained through time series split cross-validation.
Therefore, we employed a sliding window approach which uses 10 consecutive data
points to predict the 11th and 12th data points within the same sequence, as supported
by previous research [21]. Essentially, the prediction of the next two days will be
based on data from the preceding ten days, with the final metric being the average
of the metrics computed for each split.
88 P. Sontakke et al.
3.5 Machine Learning Algorithms
Four machine learning models have been implemented in this work including (1)
linear regression with gradient descent (LR), (2) support vector regression (SVR),
(3) extreme gradient boosting (XGBoost), and (4) long short-term memory (LSTM).
Table 4gives a summary of the parameters chosen for each model. For all four models,
all possible combinations of the hyperparameters were investigated during the hyper-
parameter tuning process and the combinations presented in Table 4produced the
best results.
For the LSTM model, a bidirectional layer of 500 cells was used followed by
a dropout of 25% which in turn is fed to another bidirectional layer of 600 cells,
followed by dropout of 30%. In order to update network weights during training, an
optimiser algorithm was used. Adam optimiser is suitable for non-convex optimi-
sation problems with benefits like little memory requirements, efficient with noisy
gradients and computationally efficient. Hence, Adam optimiser was adopted.
4 Experimental Results
The performance outcomes of the forecasting models are outlined in this section. We
have developed all the steps explained in Sect. 3using Python on the Google Colab
platform. Two case studies have been conducted based on the following time frames:
Period 1: Before pandemic (1st April 2016–1st November 2019)
Table 4 Hyperparameter
tuning for each model Model Parameters Val u e
LR Loss function
Penalty
Shuffle
L1_ratio
Epsilon
Learning rate
Max_inter
Squared_epsilon_insensitive
Elasticnet
True
0.15
0.01
Adaptive
1000
SVR Kernel
c
Gamma
Radial basis function
1000
Auto
XGBoost n_estimators
max_depth
learning_rate
n_jobs
500
3
0.01
1
LSTM Monitor
Verbose
Mode
Patience
Root_mean_squared_error
1
min
3
Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic 89
Period 2: Including pandemic (1st April 2016–3rd March 2022)
The models are evaluated using three metrics which are root-mean squared error
(RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).
When evaluating models, it is ideal to have low values for the MAPE, RMSE, and
MAE metrics. For instance, in the case of Bitcoin price prediction, a model with
inconsistent values may result in a higher RMSE value, but it could still have lower
MAPE or MAE values. Hence, it is crucial to assess the models using all three
measures.
4.1 Period 1: Before Pandemic (1st April 2016–1st November
2019)
In the second period, we examine forecasting Bitcoin prices before the pandemic.
Table 5presents the outcomes of the machine learning models for the different time
frames.
Table 5 Comparing model accuracy across time frames before pandemic
Test metrics: 01 April 2016—01 November 2019
Next day
LR SVR XGBoost LSTM
RMSE 261.6396 363.0427 288.9283 373.1148
MAE 244.5862 349.7400 272.1291 359.8107
MAPE 0.8746 5.9647 0.8746 0.8746
7th Day
LR SVR XGBoost LSTM
RMSE 407.1737 389.4182 316.0947 399.7969
MAE 392.4431 376.1204 297.5931 386.8093
MAPE 1.4965 6.2503 1.4965 6.3994
15th Day
LR SVR XGBoost LSTM
RMSE 398.0850 387.2883 326.3839 406.8071
MAE 383.5820 374.2908 309.8014 394.1188
MAPE 4.2807 6.1359 4.2807 6.6521
30th Day
LR SVR XGBoost LSTM
RMSE 386.2755 389.6317 277.8465 423.1300
MAE 372.3106 374.5799 259.9288 408.9162
MAPE 0.8412 6.1730 0.8412 6.7438
90 P. Sontakke et al.
During this period, Bitcoin prices displayed minimal fluctuation but saw a signif-
icant increase in early 2017, maintaining a stable trend for the remainder of the
interval. Among the models for next-day predictions, LR achieved the lowest RMSE
of 261.6396, followed by XGBoost, SVR, and LSTM. LR also had the best MAE of
244.5862, followed by XGBoost, SVR, and LSTM. In terms of MAPE, LR, XGBoost,
and LSTM recorded 0.8746, with SVR coming in at 5.9647. Therefore, the LR model
is the best performer among the four models mentioned (LR, XGBoost, SVR, and
LSTM).
For the 7-day prediction, XGBoost showed the best performance, with the lowest
RMSE of 316.0947, followed by SVR, LSTM, and LR. XGBoost also had the best
MAE of 297.5931, followed by SVR, LSTM, and LR. In terms of MAPE, LR and
XGBoost performed best with a value of 1.4965, followed by SVR and LSTM. For
the 15-day prediction, XGBoost showed the best performance with an RMSE of
326.3839, followed by SVR, LR, and LSTM. XGBoost also had the best MAE of
309.8014, followed by SVR at 374.2908, LR at 383.5820, and LSTM at 394.1188. In
terms of MAPE, LR and XGBoost performed best with a value of 4.2807, followed
by SVR and LSTM. For the 30-day prediction, the best RMSE was achieved by
the XGBoost model with a value of 277.8465, followed by LR, SVR, and LSTM.
XGBoost also had the best MAE of 259.9288, followed by LR at 372.3106, SVR,
and LSTM.
The results show that the LR model performs the best for next-day predictions with
the lowest RMSE and MAE values. For the 7-day prediction, XGBoost outperforms
the other models with the lowest RMSE and MAE values. Similarly, for the 15-
day and 30-day predictions, XGBoost performs the best with the lowest RMSE and
MAE values. For all prediction periods, LR and XGBoost also performed well in
terms of MAPE values. In conclusion, the XGBoost model performs the best overall,
while the LR model performs well for next-day predictions. Figure 2presents a graph
contrasting the actual and predicted data for the 15-day forecast utilising the XGBoost
model.
Fig. 2 Comparison of actual versus predicted data for 15-day prediction using XGBoost model
Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic 91
4.2 Period 2: Including Pandemic (1st April 2016–3rd March
2022)
In the second period, we examine forecasting bitcoin prices for the period that
included the pandemic, characterised by an unusual level of volatility. This consti-
tutes the core contribution of this research. Table 6displays the results of the machine
learning models for various time frames in this period.
The results of the next-day prediction show that XGBoost achieved the lowest
RMSE of 723.9742, followed by LR at 773.8296, LSTM at 890.0664, and SVR at
981.2988. XGBoost also had the best MAE of 682.4402, with LR, LSTM, and SVR
following. LR and XGBoost had the best MAPE of 0.3862, while LSTM and SVR
followed.
For the 7th-day prediction, XGBoost had the lowest RMSE of 734.0597, followed
by LR, LSTM, and SVR. XGBoost also reported the best MAE of 691.8397, followed
by LR, LSTM, and SVR. LR and XGBoost had the best MAPE of 4.0497, followed
by SVR and LSTM. For the 15-day prediction, XGBoost had the lowest RMSE of
686.5598, followed by LR, LSTM, and SVR. XGBoost also had the best MAE of
Table 6 Comparing model accuracy across time frames including pandemic
Test metrics: 01 April 2016–01 November 2019
Next day
LR SVR XGBoost LSTM
RMSE 773.8296 981.2988 723.9742 890.0664
MAE 739.1618 952.5082 682.4402 859.1364
MAPE 0.3862 6.3134 0.3862 5.8114
7th Day
LR SVR XGBoost LSTM
RMSE 989.5905 998.1517 734.0597 993.7988
MAE 958.2842 969.1428 691.8397 963.8020
MAPE 4.0497 6.362 4.0497 6.3729
15th Day
LR SVR XGBoost LSTM
RMSE 976.8267 1008.8316 686.5598 1002.8917
MAE 944.8389 979.7016 648.2302 973.0593
MAPE 6.2047 6.3711 6.2047 6.4255
30th Day
LR SVR XGBoost LSTM
RMSE 1007.7876 1029.7602 678.8905 1038.7840
MAE 966.5514 989.2436 633.0426 997.3399
MAPE 0.6291 6.3763 0.6291 6.5032
92 P. Sontakke et al.
Fig. 3 Comparison of actual versus predicted data using XGBoost model for the period covering
pandemic
648.2302, followed by LR, LSTM, and SVR. LR and XGBoost had the best MAPE
of 6.2047, followed by SVR and LSTM. For the 30-day prediction, XGBoost had
the lowest RMSE of 678.8905, followed by LR, LSTM, and SVR. XGBoost also
had the best MAE of 633.0426, followed by LR, SVR, and LSTM. LR and XGBoost
had the best MAPE of 0.6291, while SVR and LSTM followed.
The results show that XGBoost outperformed the other models in all time frames
regarding RMSE, MAE, and MAPE. LR also performed well, consistently achieving
the second-best results. LSTM and SVR showed lower performance compared
with XGBoost and LR. Overall, XGBoost and LR demonstrated the best results in
predicting future outcomes based on the given dataset. The graph in Fig. 3compares
the actual and predicted data using the XGBoost model for the period that includes
the pandemic.
The four machine learning models (SVR, XGBoost, LR, and LSTM) used in the
study differ in their underlying principles and have varying strengths and weaknesses.
Regarding speed, SVR was the quickest at 3 s, followed by linear regression at 10 s,
XGBoost at 90 s, and LSTM at 90 min for predicting next-day bitcoin prices in the
second period. As LSTM had the longest runtime, the other models are recommended
for their time-saving advantages.
5 Conclusion and Future Work
This study assessed the performance of four machine learning models, linear regres-
sion, support vector regression, XGBoost, and LSTM, in predicting Bitcoin price
volatility during the COVID-19 pandemic using technical features and indicators.
The results show that the models performed better before the pandemic compared
with during the pandemic with high volatility. Despite this, the study still reports
satisfactory results for Bitcoin price prediction during the pandemic. The authors
suggest that this remains a challenge for future studies.
The study employed a robust feature selection strategy to determine the most
critical features. The random forest regressor recommended features for all defined
horizons, which have been partially related to specified periods. For example, the
number of coins in circulation has been selected for all horizons, while the close price
Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic 93
has been only selected for the next-day and 7th-day horizons and not for the 15th-
and 30th-day horizons. The study shows a satisfactory prediction of bitcoin prices
over the selected horizons. The results showed that the accuracy of predictions for
the next day, 15th day and 30th day was superior to that for the 7th-day horizon in the
second dataset. The reason could not be established as bitcoin prices are stochastic.
The limitations of this study could include the following:
The study only examines the period of 1st April 2016 to 3rd March 2022, and
may not capture the full range of Bitcoin price fluctuations over a longer period.
The study only focuses on four machine learning models, and other models may
better predict Bitcoin price fluctuations.
The study only uses technical features and indicators, and additional factors such
as global economic conditions and regulatory changes may affect Bitcoin prices.
Future work could involve exploring other machine learning models or incorpo-
rating additional features to improve the performance of the models. Additionally,
the study could be extended to other cryptocurrencies and compare the results with
those obtained for Bitcoin.
References
1. Nakamoto S (2008) Bitcoin: a peer-to-peer electronic cash system. Retrieved from https://
www.bitcoinpaper.info/bitcoinpaper-html/
2. Summoogum JP, Saeedi M (2020) A study on the inefficiencies of bitcoin and its future
adoption. Test Eng Manag 82:16624–16634
3. Cheah E-T, Fry J (2015) Speculative bubbles in bitcoin markets? An empirical investigation
into the fundamental value of bitcoin. Econ Lett 130:32–36
4. Garcia D, Tessone CJ, Mavrodiev P, Perony N (2014) The digital traces of bubbles: feed-
back cycles between socio-economic signals in the bitcoin economy. J R Soc Interface
11(99):20140623-1–20140623-8. https://doi.org/10.1098/rsif.2014.0623
5. Nistala MN, Saeedi M, Islam MU (2020) Bitcoin price volatility and hedging capacity. Int J
Manag 11(10):1703–1712. https://doi.org/10.34218/IJM.11.10.2020.156
6. Mittal S (2014) Is bitcoin money? Bitcoin and alternate theories of money (SSRN Scholarly
Paper No. ID 2434194). Social Science Research Network, Rochester, NY
7. Buchholz M, Delaney J, Warren J, Parker J (2012) Bits and bets, information, price volatility,
and demand for Bitcoin. Economics 312(1):2–48
8. Gronwald M (2019) Is bitcoin a commodity? On price jumps, demand shocks, and certainty
of supply. J Int Money Financ 97:86–92
9. Brooks C (2019) Introductory econometrics for finance. Cambridge University Press,
Cambridge. https://doi.org/10.1017/9781108524872
10. Roy S, Nanjiba S, Chakrabarty A (2018) Bitcoin price forecasting using time series analysis.
In: 2018 21st international conference of computer and information technology (ICCIT). IEEE,
pp 1–5. https://doi.org/10.1109/ICCITECHN.2018.8631923
11. Anupriya, Garg S (2018) Autoregressive integrated moving average model based prediction of
bitcoin close price. In: 2018 international conference on smart systems and inventive technology
(ICSSIT). IEEE, pp 473–478. https://doi.org/10.1109/ICSSIT.2018.8748423
12. Alahmari SA (2019) Using machine learning ARIMA to predict the price of cryptocurrencies.
ISC Int J Inf Secur 11(3):139–144. https://doi.org/10.22042/isecure.2019.11.0.18
94 P. Sontakke et al.
13. Huang J-Z, Huang W, Ni J (2019) Predicting bitcoin returns using high-dimensional technical
indicators. J Finan Data Sci 5(3):140–155. https://doi.org/10.1016/j.jfds.2018.10.001
14. Greaves A, Au B (2015) Using the bitcoin transaction graph to predict the price of bitcoin, pp 1–
8. Retrieved from https://snap.stanford.edu/class/cs224w-2015/projects_2015/Using_the_Bit
coin_Transaction_Graph_to_Predict_the_Price_of_Bitcoin.pdf
15. Madan I, Saluja S, Zhao A (2015) Automated bitcoin trading via machine learning algorithms,
pp 1–5. Department of Computer Science, Stanford University, Stanford, CA, USA, Technical
Reports. Retrieved from https://www.smallake.kr/wp-content/uploads/2017/10/Isaac-Madan-
Shaurya-Saluja-Aojia-ZhaoAutomated-Bitcoin-Trading-via-Machine-Learning-Algorithms.
pdf
16. Radityo A, Munajat Q, Budi I (2017) Prediction of bitcoin exchange rate to American
dollar using artificial neural network methods. In: 2017 international conference on advanced
computer science and information systems (ICACSIS). IEEE, pp 433–438. https://doi.org/10.
1109/ICACSIS.2017.8355070
17. Yeh CC, Liao YC, Yang YJ (2020) Predicting bitcoin prices with machine learning techniques.
Expert Syst Appl 163:113762. https://doi.org/10.1016/j.eswa.2020.113762
18. Wang S, Ma Y, Zhang Y (2018) Forecasting bitcoin price with deep learning networks. Phys
A 510:828–834. https://doi.org/10.1016/j.physa.2018.07.026
19. BitInfoCharts. Retrieved from https://bitinfocharts.com/. Accessed on 18 Feb 2023
20. Quandl. Retrieved from https://demo.quandl.com/. Accessed on 18 Feb 2023
21. Hota HS, Handa R, Shrivas AK (2017) Time series data prediction using sliding window based
RBF neural network. Int J Comput Intell Res 13(5):1145–1156
Online Food Delivery Customer Churn
Prediction: A Quantitative Analysis
on the Performance of Machine Learning
Classifiers
J. Gerald Manju, A. Dharini, B. Kiruthika, and A. Malini
Abstract Securing current customers is extremely necessary than earning new
customers in a market that is expanding. To trace customer churn, a reliable churn
prediction paradigm is required. Customer churn is the process through which people
switch from one firm to another or break off contact with the company. This decision is
driven by a variety of influences. It is critical for companies to acknowledge each one
so that they can encourage customers to stay over. This is accomplished by regularly
conducting surveys regarding customer satisfaction and analyzing the responses.
Applying appropriate modelling approaches is a vital component of predicting
customer churn. Predominantly, this study evaluates several machine learning models
and also an incorporated model that aids in predicting customer churn where the data
collected from Bengaluru regions in India about online food delivery is prioritized.
In order to make better predictions using machine learning, a variety of general clas-
sifiers and ensemble classifiers are used and their degree of functionality are assessed
by determining their accuracy and area under the ROC curve. According to the AUC
scores obtained for the individual classifiers, the Naïve Bayes and random forest
classifiers rank first with the same AUC score of 0.952. After dealing with this case,
the results show that the random forest classifier outperforms all other models used.
Keywords Voting classifier ·Ensemble ·AUC ·XGBoost ·Naïve Bayes ·
Random forest
J. Gerald Manju ·A. Dharini ·B. Kiruthika ·A. Malini (B)
Thiagarajar College of Engineering, Madurai, Tamil Nadu, India
e-mail: amcse@tce.edu
J. Gerald Manju
e-mail: gerald@student.tce.edu
A. Dharini
e-mail: dharinia@student.tce.edu
B. Kiruthika
e-mail: kiruthikab@student.tce.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_8
95
96 J. Gerald Manju et al.
1 Introduction
To combat the intense competition in the business sector, companies are placing
a greater emphasis on the customer relationship management (CRM). A diligent
communication is required for the sustenance of the customers. Acquiring new
customers involves huge investment than retaining the existing customers. Customer
base is a more important resource and so companies have to put efforts to hold on
the existing customers. This is attained through customer churn prediction which in
turn helps in the development of retention strategies. The term ‘churn’ refers to both
consumers who switch service providers and those who leave the service providers
with whom they are entrusted. Additionally, it takes into account the likelihood that
some of the customers will leave.
Several data mining techniques are implemented to predict customer churn
utilizing machine learning models. In fact, studies have shown that gathering, orga-
nizing, cleaning, processing and analyzing data is an expensive process for producing
reliable forecasts. Moreover, researches have shown that ensemble approaches are
better predictors than individual models. In order to achieve better results, the
effectiveness of several classifiers is evaluated.
The contributions of this paper are,
To create a classifier that is effective at predicting customer churn, incorporate
the classifiers’ training and testing results.
Employ Naive Bayes, logistic regression which are general classifiers and
ensemble classifiers like gradient boosting, XGBoost and random forest to forecast
customer churn.
Aggregate the results of the individual classifiers using voting classifier, consid-
ering the majority votes and analyze the effectiveness of the voting classifier.
Determine the efficacy of the distinct classifiers based on the AUC analysis.
The following of the paper contains the existing literature. It explains the
approaches that were used in the study and an examination of how well machine
learning classifiers performed in making predictions with the conclusion.
2 Literature Survey
Raeisi et al. used six data mining methods in the e-commerce churn prediction
of online Iranian food ordering service which includes gradient boosted tree, rule
induction, k-NN, random forest, decision tree, Naïve Bayes, and concluded that the
highest accuracy of 86.90% was obtained by gradient boosted trees [1]. However, the
performance of the models could be evaluated one step ahead with AUC analysis.
Lalwani et al. used XGBoost classifier and AdaBoost classifier, CatBoost clas-
sifier, Naïve Bayes, logistic regression, decision trees, SVM, random forest and
extra tree classifier to provide a comparative study in telecommunication industry
Online Food Delivery Customer Churn Prediction: A Quantitative 97
where the maximum accuracy was gained for Adaboost and XGBoost classifiers,
following the ensembling approach with an AUC value 0.84 [2]. Their emphasis
on the domination of ensemble methods on weak learners is not significant for all
aspects.
Abbasimehr et al. employed C4.3 decision tree, SVM, ANN, RIPPER rule learner
as base learners and improved the performance of these models using ensemble
methods which includes bagging, boosting, staking and voting [3]. Above all,
Boosting RIPPER and Boosting C4 indicate the domination of the collaborated strong
learners over their base learners in the prediction of churning of customers.
Sudharsan et al. utilized the Swish RNN strategy for the forecasting of churning
customers in the telecommunication industry [4]. By the S-RNN, sensitivity value of
98.27%, specificity of 92.31% and accuracy of 95.99% were observed and clustering
was done with a clustering algorithm called CLARA. This enabled quicker forecasts.
Fathian et al. examined the bagging and boosting ensemble classifiers which
are heterogeneous [5]. For this a comparison was made between the sensitivity, F-
measure, specification, accuracy and AUC of 14 models and they concluded with
a point that the combination of PCA, SOM and boosting (heterogeneous approach)
acquired the desirable results.
Sharma et al. put forth XGBoost algorithm as the best performing algorithm and
classifies churners among the total churners correctly with 81 and 85% as highest
true positive rate and AUC [6]. The priority of a single classifier’s performances is
not always dependable. Consequently, combining different models can be one of the
approaches to get the optimal model.
Dhini and Fauzan performed two ensemble learning approaches, namely extreme
gradient boosting and random forest [7]. They inferred that the XGBoost is the best
predictor.
Researches have emphasized on the utilization of individual models and opti-
mistic aspects of ensemble models. This paper’s intent is to execute a comparative
analysis on the performance of several individual classifiers and their heterogeneous
combination using voting classifier which is to be evaluated by the accuracy and
AUC analysis.
3 Design and Methodology
Data collection is a fundamental component. The quality of the training data deter-
mines how accurately the machine learning models anticipate the future. Another
integral part of machine learning is data preprocessing. It is a process of generating
a clean and reliable data [8]. For an effective and efficient predictive model, in addi-
tion to data collection the processing of the collected data is very important for better
results [9]. The process of choosing the features that possess the greatest influence
and are the best predictors of the target feature is regarded as selecting features
and a statistical technique to assess the degree of association between two variables
98 J. Gerald Manju et al.
is correlation [10]. Depending on the k highest scores, the SelectKBest technique
chooses the features that have the greatest influence [11]. This is depicted in Fig. 1.
4 Classification Algorithms
4.1 General Classifiers
Logistic Regression Classifier Modelling of the dependent variable is done through
logistic function. There are only two possible classes for the dependent variable. Thus,
binary data can be processed using this technique.
Naïve Bayes Classifier This algorithm is a Bayes theorem-based classifier and is
mostly utilized in text categorization. It is appropriate for a training dataset with
many dimensions. It is a probabilistic classifier that goes by the name ‘Naïve’ since
it makes the assumption that the occurrence of one characteristic is unrelated to the
occurrence of other features.
4.2 Ensemble Classifiers
Random Forest Classifier The ensemble strategy (bagging approach), which
employs multiple decision trees on diverse dataset subsets, is used by the random
forest classifier. It takes into account the predictions that have received the most votes
in order to increase predicted accuracy. As a result, it foretells the outcome.
Gradient Boosting Classifier This classifier is a combination of a number of weak
learning models with the boosting approach leading to a powerful predictor. This
process frequently uses decision trees. The residual errors of each weak learner’s
predecessor are used as labels for the training process.
XGBoost Classifier For large datasets, the gradient boosted trees approach is effec-
tively implemented by the XGBoost classifier. It is done to deliver accurate findings
and prevents overfitting [12].
4.3 Voting Classifier
A voting classifier trains different models and then estimates an output on the basis
of maximum likelihood. It provides the results by the average of each classifier
submitted into it to anticipate the output in accordance with highest majority of votes.
Online Food Delivery Customer Churn Prediction: A Quantitative 99
Fig. 1 Heterogeneous ensemble method
100 J. Gerald Manju et al.
Instead of developing distinct, specialized approaches and assessing the accuracy
individually, a single model is produced for betterment. This supports the two voting
processes known as hard voting and soft voting.
5 Performance Metrics
5.1 Accuracy
A machine learning algorithm’s evaluation is crucial and this is done by calculating
the proportion of a classifier’s accurate predictions to all of its predictions which is
called accuracy [13].
Accuracy =
number of correct predicions
total number of predictions
5.2 Area under the Curve Analysis
Classification analysis uses the area under curve.
AUC is for evaluation and comparison. AUC is a scale that varies from 0 to 1.
When a model has an AUC of 0, the predictions are unreliable, while when it has an
AUC of 1, all of the predictions are accurate.
6 Results and Discussion
An open-source online food delivery customer churn prediction dataset is taken from
Kaggle for the study. The dataset consists of 388 instances with 55 attributes as given
in Table 1. It is on the basis of consumer trends [14], decisions concerning a purchase
in general, importance of delivery time restaurant rating on purchasing. The attributes
are related to online food delivery preferences in Bengaluru region. The dataset helps
us predict whether the online food delivery customers churn or not with respect to
their preferences. The attribute ‘Output’ which is dependent indicates with the words
‘Yes’ or ‘No’ if the client has churned or not. The dataset has categorical attributes
in object type and continuous attributes in numeric type. It includes variables with
Likert form.
Various preprocessing steps are executed [15]. To overcome the chances of over-
fitting, dimensionality reduction methods are carried out. From the observations of
data visualization of continuous variables, it is shown that in age groups less than 25
Online Food Delivery Customer Churn Prediction: A Quantitative 101
Table 1 Characteristics of
the dataset Total count of instances 388 (0–387)
Total count of attributes 55
Count of attributes (categorical) 50
Count of attributes (numeric) 5
Count of missing values Nil
Datatypes Float64—2
Int64—3
Object—50
and family size of less than 4 reordering is frequent. Data redundancy is also reduced
by removing certain attributes such as ‘pincode’, ‘Meal (P1)’, ‘Meal (P2)’, ‘Medium
(P1)’, ‘Medium (P2)’, ‘lat’, ‘lon’, as they provided no information. The attributes with
values in Likert scale are transformed to ordinal rank order scale. Then, the correla-
tion matrix is determined using Spearman’s rank correlation method to understand
the intra-relationship between the attributes. A correlation is observed along the diag-
onals and the attributes are ordinal in preference of being continuous. Due to such
cases, applying principal component analysis for dimensionality reduction is not
effective. The attributes with categorical values are converted to dummy variables.
The most influencing 20 attributes are selected using SelectKBest method [16].
Various classification models are taken to construct an ensemble classification
model to improve the performance of the data mining techniques [17]. Classifiers
like random forest, XGBoost, gradient boost, Naive Bayes, logistic regression are
used as base learners to create an ensemble classification model utilizing the voting
classifier, which is regarded as a strong learner as depicted in Fig. 2[18].
Primarily, the quality of the individual classifiers is assessed using the performance
metric accuracy. Among the five individual classifiers used, XGBoost classifier has
the best accuracy of 100% which is represented in Fig. 3.
To predict the output class, the accuracy of each classifier is then combined using
a voting classifier based on the majority of votes by soft voting. The combination of
the performances of the multiple classifiers gives an accuracy of 98%. Basically, the
voting classifier is regarded as a versatile classifier but it is important to understand
that the voting ensemble method has its limitations [19]. There is a possibility for an
individual classifier to outperform a group of classifiers and this is clearly observed
from our study as XGBoost classifier outperforms the voting classifier.The accuracies
of the individual models along with voting classifier are represented in Fig. 3.
From the AUC analysis represented in Fig. 3, Naïve Bayes and random forest
classifiers have same AUC scores. To resolve this, the difference between the training
score and testing score is considered. The best successful model is seen to be the
classifier with the smallest difference.
From Table 2, we infer that random forest classifier is the most effective classi-
fier since there is the least difference between its training and testing scores which
suggests that the model is less likely to make incorrect predictions and so lessens the
overfitting issue.
102 J. Gerald Manju et al.
Fig. 2 Combination of classifiers
Fig. 3 Comparative analysis of accuracies and AUC scores
Table 2 Comparative
analysis on training and
testing scores
Model Training score Testing score
Naïve Bayes 0.88 0.96
Random forest 1.00 0.97
Online Food Delivery Customer Churn Prediction: A Quantitative 103
7 Conclusion
This paper’s primary objective is to implement an assessment of the effectiveness of
machine learning classifiers used on customer churning data. Specifically, it focuses
on the effectiveness of the voting classifier. Basically, voting classifier is recom-
mended as it is a more powerful meta-classifier that neutralizes the weaknesses of
the individual classifiers on a particular dataset. It is an ensemble method that incor-
porates the outcomes of multiple models to arrive at the ultimate optimal outcome and
then makes predictions. It is essential to comprehend that voting-based models could
not be used as a generic machine learning strategy as the voting ensemble method
is not without its drawbacks, after all. In some instances, a model can outperform
a collection of models where its accurate predictions are nullified by the voting
classifier.
This particular aspect of the voting classifier is witnessed through this study as
the individual model—XGBoost classifier outperforms the voting classifier with an
accuracy of 100%. Here, the voting classifier nullified the accurate prediction by the
XGBoost classifier.
Most of the classifiers not only make prediction of the classes but also output the
probability of that prediction. But accuracy lacks the utilization of this probability.
Area under the curve analysis aids in the assessment of probability with greater
accuracy. From the AUC scores obtained for the individual classifiers, it is observed
that random forest and Naïve Bayes classifiers stand first having the same AUC score
of 0.952. The results show that random forest classifier outperforms all other models
after this case has been handled.
Currently, the data used for this study is limited to the online food delivery sector,
whereby this may not be appropriate in the same way for other sectors. In future, an
extension can be made by making analysis in different sectors. Availing explainable
artificial intelligence (XAI) is a more reliable technique which helps in unveiling the
underlying prediction process and it explains interpreting the impenetrable machine
learning models [20]. This in turn serves good for realizing the feature contribu-
tion explanation, which is accountable for better understanding the most influencing
features in the customer churn thereby assisting in its mitigation.
References
1. Raeisi S, Sajedi H (2020) E-commerce customer churn prediction by gradient boosted trees. In:
2020 10th international conference on computer and knowledge engineering (ICCKE). IEEE,
pp 55–59
2. Lalwani P, Mishra MK, Chadha JS, Sethi P (2022) Customer churn prediction system: a machine
learning approach. Computing 104(2):271–294
3. Abbasimehr H, Setak M, Tarokh MJ (2014) A comparative assessment of the performance of
ensemble learning in customer churn prediction. Int Arab J Inf Technol 11(6):599–606
4. Sudharsan R, Ganesh EN (2022) A Swish RNN based customer churn prediction for the telecom
industry with a novel feature selection strategy. Connect Sci 34(1):1855–1876
104 J. Gerald Manju et al.
5. Fathian M, Hoseinpoor Y, Minaei-Bidgoli B (2016) Offering a hybrid approach of data mining
to predict the customer churn based on bagging and boosting methods. Kybernetes 45(5):732–
743
6. Sharma T, Gupta P, Nigam V, Goel M (2020) Customer churn prediction in telecommunications
using gradient boosted trees. In: Khanna A, Gupta D, Bhattacharyya S, Snasel V, Platos J,
Hassanien A (eds) International conference on innovative computing and communications.
Advances in intelligent systems and computing, vol 1059. Springer, Singapore, pp 235–246
7. Dhini A, Fauzan M (2021) Predicting customer churn using ensemble learning: case study of
a fixed broadband company. Int J Technol 12(5):1030–1037
8. Jagadeesan AP (2020) Bank customer retention prediction and customer ranking based on deep
neural networks. Int J Sci Dev Res (IJSDR) 5(9):444–449
9. Momin S, Bohra T, Raut P (2020) Prediction of customer churn using machine learning. In:
EAI international conference on big data innovation for sustainable cognitive computing. EAI/
Springer innovations in communication and computing. Springer, Cham, pp 203–212
10. Fujo SW, Subramanian S, Khder MA (2022) Customer churn prediction in telecommunication
industry using deep learning. Inf Sci Lett 11(1):185–198
11. Domingos E, Ojeme B, Daramola O (2021) Experimental analysis of hyperparameters for deep
learning-based churn prediction in the banking sector. Computation 9(34):1–19
12. Sree GMA, Ashika S, Karthi S, Sathesh V, Shankar M, Pamina J (2019) Churn prediction in
telecom using classification algorithms. Int J Sci Res Eng Dev 2(1):1–16
13. Ahmad AK, Jafar A, Aljoumaa K (2019) Customer churn prediction in telecom using machine
learning in big data platform. J Big Data 6(28):1–24
14. Dias J, Godinho P, Torres P (2020) Machine learning for customer churn prediction in retail
banking. In: International conference on computational science and its applications. Springer,
Cham, pp 576–589
15. Shirazi F, Mohammadi M (2019) A big data analytics model for customer churn prediction in
the retiree segment. Int J Inf Manage 48:238–253
16. Khodabandehlou S, Rahman MZ (2017) Comparison of supervised machine learning tech-
niques for customer churn prediction based on analysis of customer behavior. J Syst Inf Technol
19(1/2):65–93
17. Kumar AS, Chandrakala D (2016) A survey on customer churn prediction using machine
learning techniques. Int J Comput Appl 154(10):13–16
18. Al-Najjar D, Al-Rousan N, Al-Najjar H (2022) Machine learning to develop credit card
customer churn prediction. J Theor Appl Electron Commer Res 17:1529–1542
19. Xu T, Ma Y, Kim K (2021) Telecom churn prediction system based on ensemble learning using
feature grouping. Appl Sci 11(4742):1–12
20. Tavassoli S, Koosha H (2022) Hybrid ensemble learning approaches to customer churn
prediction. Kybernetes 51(3):1062–1088
Prevention Equipment for COVID-19
Spread Using IoT and Multimedia-Based
Solutions
T. S. Dhachina Moorthy, N. Nimalan, S. Sridevi, and B. Nevetha
Abstract The global spread of COVID-19 is a growing concern for everyone.
The virus is transmitted through droplets and airborne particles from one person
to another. The World Health Organization (WHO) recommends wearing a face
mask, social distancing, avoiding crowded areas, and maintaining a strong immune
system to reduce the spread of COVID-19. In response to the pandemic, many coun-
tries have implemented lockdowns to control its spread. Research has shown that
wearing masks in public can help prevent person-to-person transmission of the virus.
This paper proposes a device that uses cameras to detect elevated body temperature,
people wearing face masks, those not wearing face masks, and calculates proximity
among individuals. The proposed model can be deployed in public places such as
shopping malls, hotels, apartment entrances, airports, hospitals, and offices to main-
tain safety standards. The system uses Internet of Things (IoT) technology and deep
learning mechanisms to detect individuals who may be infected with COVID-19. The
proposed framework is evaluated using the face mask detection and social distance
detecting algorithms in the TensorFlow library. A non-contact sensor is used to check
the temperature of each person passing through the device. To ensure ease of use, an
animated film is used to help people understand how to operate the proposed system.
A multimedia application is also employed to display the system’s output to end-users
in the form of visualizations or reports, accompanied by an alarming sound to remind
individuals to maintain distance or avoid crowded areas. The proposed system, when
implemented, can help prevent the spread of COVID-19 and save lives.
T. S. D. Moorthy ·N. Nimalan ·S. Sridevi (B)·B. Nevetha
Department of Information Technology, Thiagarajar College of Engineering, Madurai, Tamil
Nadu, India
e-mail: sridevi@tce.edu
T. S. D. Moorthy
e-mail: dhachina@student.tce.edu
N. Nimalan
e-mail: nimalan@student.tce.edu
B. Nevetha
e-mail: nevetha@student.tce.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_9
105
106 T. S. D. Moorthy et al.
Keywords COVID-19 ·Face mask detection ·Camera ·IoT ·Sensors ·
Temperature detection
1 Introduction
The global outbreak of the coronavirus has raised concerns among the public
regarding the spread of the virus. To help slow down and ultimately stop the spread
of the virus, society is seeking tools to aid in the detection of infections. Although
there are currently no thermal cameras available that can detect the virus, forward-
looking infrared (FLIR) cameras can be used as an additional tool for body temper-
ature screening in high-traffic public places, allowing for quick individual screening
[1,2]. If an individual’s skin temperature in key areas is higher than the average
temperature, they may be selected for additional screening. These cameras use the
thermography temperature measurement technology, which allows for accurate non-
reactive, contactless, and planar recording of surface temperatures. This makes them
suitable for fast and easy detection of increased body temperatures, which may indi-
cate a possible virus infection among individuals who have undergone the screening
process. Using the infrared cameras, the body temperature is diagnosed at the inner
angle of the eye. Alarm will be raised when there is slightest differences and thus
elevated body temperatures can be displayed.
Besides that, the device can detect, alert, and hopefully remind a person who is
not wearing a mask to wear one before entering a venture or facility. To slow the
spread of the coronavirus, the centers for disease control and prevention and state
health agencies have advised people to maintain social distance and wear masks in
public over the past few months [3,4]. A small camera is built into the device. When
a person without a mask approaches us, it alerts us, flashes bright lights, and sends
out a loud audio alert reminding them to wear a mask. The device can detect all types
of face masks, including medical masks and scarves. The device’s goal is to remind
people to wear masks, especially during a health.
The proposed work also includes an additional feature in which cameras can
measure the distance between people, and report the person when they are too near
to each other. Time-of-flight technology is used by the sensors to enable precise
monitoring of lines of people. Using fully anonymous data collection, sensors can
identify the existence of people and calculate the proximity to the neighboring person.
Ultrasonic sensors with high accuracy, miniaturization, and low power consumption
are perfect technology for solutions that prevent infection by social distancing. To
check if people are maintaining regulation distances, real-time data is linked to
a visual and/or audio system [5,6]. In accordance with the local regulations and
current public health advice, the system allows a minimal distance threshold between
people to be set. A signal is sent to enable a visual/audio alert when the distance is
violated—which makes sure that the distracted people figure out that they are not
respecting distancing. To ensure social distancing measures are maintained, social
distance monitoring devices can be installed to monitor entrances of public places.
Prevention Equipment for COVID-19 Spread Using IoT 107
Businesses can make sure that the safety of staff and customers is maintained, while
adhering with rules and regulations and health constraints caused by COVID-19.
2 Existing System
Handheld temperature scanners also called ‘temperature guns’ have been used for a
long period of time for checking the temperature of individuals. The temperature is
determined by the thermal radiation emitted by the object. This prevents individuals
from touching each other or touching the device, but let us take it this way. What
if the person who checks the body temperature has the coronavirus but behaves
asymptomatic? He must be in close pandemic, and more importantly, to avoid conflict
with other humans. Impact of COVID-19 across the world is shown by Fig. 1[7].
Here is a high risk of spread due to the reduced distance between the individuals.
In some places right now, they appoint people to hold that scanner and check the
temperature of customers [810]. This results in the wastage of time and money. This
device can prove to be helpful for the doctors too. Usage of thermometers at hospitals
might be a disadvantage for doctors as it will take some time for it to analyze. This
time waste can affect the profit of the doctor per day. Using our devices, you can
ensure a safer distance and instant temperature readings which can result in a better
profit.
The existing system used IoT-based enabled devices to predict and prevent
COVID-19 spread. Some of the people do not get exposure on how to use these
devices. For creating awareness and demonstrating the working of the proposed
system, here we are also focusing on multimedia applications. With the help of
Fig. 1 Total confirmed cases across the world on May, 2020 (Source: https://www.newsclick.in/
covid-19-graphs-cases-recoveries-deaths)
108 T. S. D. Moorthy et al.
multimedia applications, one should understand how to use the IoT-based enabling
devices [11,12].
3 Proposed Methodology
The solution is a device which can act as a multitasking machine by identifying the
temperature of the people nearby us, whether all around us are wearing a mask and to
maintain a safe distance between every person around us. Though there are no thermal
cameras which can detect the virus, in addition to conventional body temperature
screening technologies, forward-looking infrared (FLIR) cameras can be utilized for
quick individual screening to identify people with excessive skin temperatures in
high-traffic public areas. The proposed work develops a fever screening and tracking
system using the thermal and normal cameras present in the device. We train the
network of cameras with deep learning algorithms where the video feed is given as
input and the corresponding result was taken. The algorithm will be precise as the
device will be trained with lots of images to identify whether a respective individual
is wearing a mask [7,1315]. Using fully anonymous data collection, sensors can
identify the existence of people and calculate the proximity to the neighboring person.
Ultrasonic sensors with high accuracy, miniaturization, and low power consumption
are perfect technology for solutions that prevents infection by social distancing. In
accordance with the local regulations and current public health advice, the system
allows a minimal distance threshold between people to be set [16,17]. A signal is
sent to enable a visual/audio alert when the distance is violated—which makes sure
that the distracted people figure out that they are not respecting distancing.
3.1 Proposed System Design
See Fig. 2.
3.2 Prototype System Architecture
See Figs. 3,4and 5.
The entire process for the working prototype model is displayed in Fig. 6, starting
with the import of the dataset, followed by the start of the video stream, face detection
in the video stream, and application of the face mask classifier to the face ROI to
determine whether there is a mask. The results are displayed in a box around the face
ROI that is highlighted. The software then looks for neighboring faces if a mask is
found.
Prevention Equipment for COVID-19 Spread Using IoT 109
Fig. 2 Proposed system design
Fig. 3 Sample input file for object detection
110 T. S. D. Moorthy et al.
Fig. 4 Face mask dataset
Fig. 5 Without face mask dataset
The person identification model begins and tries to identify the person if a mask
is not there, sends a notification to the impacted person after the person identification
model has been run successfully. Until every person in the frame is covered by a
mask, this process keeps repeating.
Prevention Equipment for COVID-19 Spread Using IoT 111
Fig. 6 Prototype system architecture
3.3 Face Mask Detection Using Hybrid Convolution Neural
Networks (CNN) Algorithm
The face mask that is present on every person’s face is identified in this study using
hybrid convolution neural networks (CNN). But there is a slight modification here.
The input for processing the image datasets is supplied in as an array. Afterward
it is transmitted inside the mobile nets, where maximum pooling takes place. The
major applications of hybrid CNN, a type of artificial neural network, are image
recognition and analysis. CNN is specifically designed to analyze pixel input. A
result is produced by combining between 100 and 1000 filters, and the resulting
output is then passed on to the following layer of the neural network [1820]. Keres
and TensorFlow are used to train the mask detection model. The process used in the
algorithm is explained in the Fig. 7.
Fig. 7 Face mask detection using CNN
112 T. S. D. Moorthy et al.
3.4 Social Distancing Detection
Using YOLO V3, the proposed work detects the people present in the given video
dataset or live video feed. To track the people, we draw boxes around each individual
and measure their centroids and give a unique ID to them as given in the Fig. 8.
The algorithm identified each person and in the next stage we must track them as
the same person as they move. On the above subpicture 2, the purple one is the initial
position and the yellow one is the point after the person has moved. To know that
it is the same person who has moved from here to there, we measure the Euclidean
distance from every old and new centroid and the close pairs will be considered as the
same person. The working of the distance measuring algorithm is shown in Fig. 9.
We need to calculate the space between everyone coming across the device. Let
the doodle in the image be the object and we are taking a photo from a camera. At
first, we will measure the distance from the camera to the object using the first given
formula where
Fis the focal length,
Pis how many pixel it is covering in the photo,
Dis the distance, and
Wis the object width.
Fig. 8 Assigning ID to centroids
Fig. 9 Distance measuring
algorithm
Prevention Equipment for COVID-19 Spread Using IoT 113
Using that focal length, we are moving the camera and therefore the distance is
now changed and so is the number of pixels covered. Now with this information, the
new distance is calculated and saved.
3.5 Temperature Detection
The Arduino temperature sensor transforms the ambient temperature to electricity.
It then converts the oltage to Celsius, then to Fahrenheit, and displays the Fahrenheit
temperature on the LCD panel. It is shown in Fig. 9.
4 Functional Requirements
4.1 Hardware Requirements: Infrared Thermometer
On its most basic design an infrared thermometer is used to focus IR energy using
a detector, which converts energy to an electrical signal displayed using units of
temperature [11,21,22].
This ensures temperature checking without going into proximity of the person. As
a result, the infrared thermometer can be used to measure temperature in situations
when other instruments cannot produce accurate outputs.
Thermal Imaging Cameras
With a 180° rotating optical block, the FLIR T865 thermal imaging camera is a
non-contact inspection tool that enables users to comfortably and safely evaluate
the state of crucial mechanical and electrical equipment in utility and manufacturing
applications.
Buzzer
Abuzzer or beeper is an audio signaling device which is used for alerting people
when someone without a mask comes into a specific region (Fig. 10).
4.2 Software Requirements
Face Mask Detection
TensorFlow
114 T. S. D. Moorthy et al.
Fig. 10 Circuit diagarm of temperature sensor
It is the open-source platform used for machine learning. Some of the pack-
ages have been imported like data generation. Mobile net will be imported from
TensorFlow and Keras.
Keras
It is an open-source library that provides the interface for the neural networks.
All packages which are imported from TensorFlow are also imported from the
sub-package Keras.
Imutils
It is a sequence of image processing functions.
NumPy
NumPy is used for mathematical functions. Here, NumPy is used to store the
images with and without mask as separate arrays.
OpenCV-Python
OpenCV provides the entire computer vision library and tools.
Matplotlib
Matplotlib is used to create diagrammatic visualizations. Here, the trained data
and value has been marked as a line chart mentioned in Fig. 11.
Argparse
Argparse is used to write user-friendly command line interfaces.
SciPy
SciPy provides algorithms for optimization and better understanding for the user.
Scikit learn
Prevention Equipment for COVID-19 Spread Using IoT 115
Fig. 11 Results of face
mask detection
It provides efficient tools for predictive data analysis.
Social Distancing Detection
Yol o V 3
A real-time object detection method called you only look once (YOLO) recognizes
items in videos and live feeds. Here, we have used to identify the people present
in the video dataset and ensure the safe distance between them.
Imutils,
NumPy,
OpenCV-Python,
SciPy.
Temperature Monitoring
Tinkercad
Tinkercad is an online Arduino simulation website which is used here for the
online circuit design of our temperature sensor shown in Fig. 10.
116 T. S. D. Moorthy et al.
Fig. 12 Plot of the loss or accuracy of training and value versus epoch
5 Results and Discussion for Multimedia-Based Strategic
Plans to Prevent Spread of COVID-19
The relevant images/videos are collected and video and audio will be added and
edited using Final Cut Pro video editing tool. Interactive video will be created, so
that it will be simple for individuals to use. Moreover, multimedia applications are
utilized to provide the result of the suggested model to the end users in reports or
visualization forms, along with alarm sounds. The sound is produced to warn people
to keep their distance or avoid the crowd (Figs. 12,13,14,15).
The proposed multimedia-based applications to prevent COVID-19 is uploaded
in YouTube. The link is as follows: https://youtu.be/LfAga_j8F5k.
6 Business Impacts
The arrival of our project into business will have a major impact on handheld temper-
ature scanners, as there is no need to point the device to our foreheads. It is enough
for a person to just pass by and the device will tell if the person has a fever or not. By
this, we can also ensure a safer distance between the individuals which cannot happen
in case of the handheld scanner. This device will have its use mostly only during the
pandemics which in a matter of time, must come to an end. So, the device will have to
hibernate as it will not be a daily use item. Therefore, the project will not have many
buyers during the normal days, but its usage will boom during difficult times. This
Prevention Equipment for COVID-19 Spread Using IoT 117
Fig. 13 Output
Fig. 14 Algorithm
comparison
Fig. 15 Confusion matrix
118 T. S. D. Moorthy et al.
device can help doctors save time for waiting for a thermometer’s mercury reading
to rise or fall as it can scan the body temperature as soon as the patient enters the
doctor’s room. This will help him/her to diagnose a lot more customers than before,
which results in an increased profit for the doctor.
7 Conclusion
The availability of smart technology and new breakthroughs promote the develop-
ment of new models, which will help fulfilling the demands of developing nations. An
IoT-enabled smart gadget is created in this study to measure proximity between indi-
viduals, detect face masks, and measure body temperature, all of which can improve
public safety. This will add an additional layer of prevention against the spread of the
COVID-19 infection, while also assisting in the reduction of labor requirements. The
model makes advantage of IoT to identify face masks, detect temperatures, and track
the proximity of all people present at any one time. Moreover, the gadget is scalable
and feasible. There are also many ways to boost performance and get better results.
The suggested approach and multimedia-based software would assist in maintaining
safety standards as the states and municipalities adopt reopening plans throughout
the COVID-19 epidemic.
References
1. Chan JF-W, To KK-W, Tse H et al (2013) Interspecies transmission and emergence of novel
viruses: lessons from bats and birds. Trends Microbiol 21(10):544–555
2. Ng DK-k, Chan C-h, Chan EY-t, Kwok K-l, Chow P-y, Lau W-F, Ho JC-S (2005) A brief report
on the normal range of forehead temperature as determined by noncontact, handheld, infrared
thermometer. Am J Infect Control 33(4):227–229. https://doi.org/10.1016/j.ajic.2005.01.003.
PMID: 15877017; PMCID: PMC7115295
3. Kumar S, Gupta SK, Kaur M, Gupta U (2022) VI-NET: a hybrid deep convolutional neural
network using VGG and inception V3 model for copy-move forgery classification. J Vis
Commun Image Represent 89:103644
4. Singh S, Bhardwaj A, Budhiraja I, Gupta U, Gupta I (2023) Cloud-based architecture for
effective surveillance and diagnosis of COVID-19. In: Convergence of cloud with AI for big
data analytics: foundations and innovation. Scrivener Publishing LLC, pp 69–88
5. Gupta U, Gupta D (2021) Kernel-target alignment based fuzzy Lagrangian twin bounded
support vector machine. Int J Uncertainty Fuzziness Knowl Based Syst 29(5):677–707
6. Kumar S, Gupta S, Gupta U (2022) Discrete cosine transform features matching-based forgery
mask detection for copy-move forged images. In: 2022 2nd international conference on
innovative sustainable computational technologies (CISCT). IEEE, pp 1–4
7. Chen Y, Cheng J, Jiang X, Xu X (2020) The reconstruction and prediction algorithm of
the fractional TDD for the local outbreak of COVID-19. arXiv:2002.10302 [physics.soc-ph],
arXiv:2002.10302v1 [physics.soc-ph], https://doi.org/10.48550/arXiv.2002.10302
8. Gupta U, Dutta M, Vadhavaniya M (2013) Analysis of target tracking algorithm in thermal
imagery. Int J Comput Appl 71(16):34–41
Prevention Equipment for COVID-19 Spread Using IoT 119
9. Long G (2016) Design of a non-contact infrared thermometer. Int J Smart Sens Intell Syst
9(2):1110–1129. https://doi.org/10.21307/ijssis-2017-910
10. Barnawi A, Chhikara P, Tekchandani R, Kumar N, Alzahrani B (2021) Artificial intelligence-
enabled Internet of Things-based system for COVID-19 screening using aerial thermal imaging.
Future Gener Comput Syst 124:119–132. https://doi.org/10.1016/j.future.2021.05.019. Epub
2021 May 26. PMID: 34075265; PMCID: PMC8152244
11. Yugakiruthika AB, Malini A (2022) A comprehensive tool survey for blockchain to IoT appli-
cations. In: Data engineering for smart systems: proceedings of SSIC 2021. Springer Singapore,
pp 89–99
12. Yugakiruthika AB, Malini A (2022) Security testing for blockchain enabled IoT system. In:
Data engineering for smart systems: proceedings of SSIC 2021. Springer Singapore, pp 45–55
13. Varshini B, Yogesh HR, Pasha SD, Suhail M, Madhumitha V, Sasi A (2021) IoT-enabled
smart doors for monitoring body temperature and face mask detection. Glob Transitions Proc
2(2):246–254. https://doi.org/10.1016/j.gltp.2021.08.071. ISSN 2666–285X
14. Chavda A, Dsouza J, Badgujar S, Damani A (2021) Multi-stage CNN architecture for face
mask detection. In: 2021 6th international conference for convergence in technology (I2CT).
IEEE, pp 1–8
15. Petropoulos F, Makridakis S (2020) Forecasting the novel coronavirus COVID-19. PLoS ONE
15(3):e0231236-1–e0231236-8. https://doi.org/10.1371/journal pone.0231236
16. Akash V, Sridevi S, Ananthi G, Rajaram S (2021) Forecasting of novel corona virus disease
(covid-19) using LSTM and XG boosting algorithms. In: Data analytics in bioinformatics—a
machine learning perspective. Wiley Publishers
17. Retrieved from https://en.wikipedia.org/wiki/Infrared_thermometer
18. Retrieved from https://serverscheck.com/solutions/corona-covid-19.asp
19. Retrieved from https://spectrum.ieee.org/why-use-timeofflight-for-distance-measurement
20. Retrieved from https://www.pyimagesearch.com/2020/05/04/covid-19-face-mask-detector-
with-opencv-keras-tensorflow-and-deep-learning/
21. Retrieved from https://pyimagesearch.com/2018/08/13/opencv-people-counter/
22. Retrieved from https://pyimagesearch.com/2020/06/01/opencv-social-distancing-detector/
Renal Disease Classification Using Image
Processing
Rohan Sahai Mathur, Varun Gupta, Tushar Bansal, Yash Khare,
and Sanjay Kumar Dubey
Abstract The growth of renal disease has increased gradually and affected millions
of people and the number of affected people increases each year. Chronic kidney
disease usually occurs due to abnormal albumin excretion which reduces the func-
tioning of the kidney by more than three months. On the view of life expectancy,
7.6% of the deaths are due to chronic kidney disease and it accounted for 4.6% of
all-cause mortality. And the best to treat renal diseases is early prophylaxis which is
achieved by accurately diagnosing the patient at the very early stages. The diagnostic
methods include ultrasonographic diagnosis which is a cheaper, more convenient,
and timeliness method. This paper presents renal disease detection and classification
using supervised techniques which classifies disease with up to 97 percent accuracy
and uses image processing tools for kidney stone detection with an accuracy of 95%.
Keywords Machine learning ·Image processing ·Chronic kidney disease ·
Kidney stone ·Ultrasound images
1 Introduction
The World Health Organization (WHO) considers chronic kidney disease (CKD)
as one of the significant public health issues. It affected millions of people and an
increase of 2% has been observed in the number of affected people each year. The
disease has spread across the globe and remains a crucial public health problem that
affects 12% of the population around the globe. Chronic kidney disease is considered
a dysfunction of the kidney which decreases renal function progressively and it is
an irreversible disease due to decreasing rate of glomerular filtration which leads to
R. S. Mathur (B)·V. Gupta ·T. Ba n sal ·Y. K h a r e ·S. K. Dubey
Amity University, Uttar Pradesh, Sector 125, Noida, India
e-mail: rsmathur74@gmail.com
S. K. Dubey
e-mail: skdubey1@amity.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_10
121
122 R. S. Mathur et al.
the complete deterioration of the kidney. When a kidney is heavily damaged then it
loses its capability to filter the blood effectively [1].
Chronic kidney disease occurs when there is abnormal albumin excretion which
causes kidney damage by reducing the proper function of the kidneys for more
than three months. Kidney function can be detected by measuring it directly or by
estimating the glomerular filtration rate (GFR). It has a major influence on global
health as it increases the mortality threat in renal diseases either as an associated
threat factor for cardiovascular diseases, or causes global morbidity and mortality
directly [2].
CKD stage 1 and stage 2 CKD patients suffer from the renal disease with zero
symptoms and display minor confusion in the capability of the kidney without water
which is critically proven clinically electrolyte, or endocrine lopsided characteristics
on serum examinations which can also be metabolic. Thus, there is a basic need
to foster protected, quick, and savvy strategies to evaluate for and analyze CKD
precisely, so there should be a start to measure all prevention techniques at the earliest
to hinder the weakening of kidney capability. In patients at all phases of CKD, there
could be performed a kidney that is broadly accessible for the minimal expense to
survey for primary changes related to CKD, for example, decreased size of the kidney
or change in the surface of the parenchyma section of the kidney. Echocardiographic
imaging might disclose the dilation part, diastolic/systolic brokenness into the part
which ventricles and atria, left ventricular wall’s hypertrophy, valve stenosis which
is brought by coronary illness by early degenerative valvular, and decayed right
ventricular capability hyper tensed aspiratory. Some patients are having high phases
of renal dysfunction which vary from the sound populace by hydration with cyclic
changes. In the typical condition, the renal control the body’s liquid volume and
homeostasis with controlled water though it will be there as soon as humanly possible
[3].
In the current scenario, the detection of kidney diseases is done by continuously
monitoring specific parameters obtained through diagnostic tests. Then statistical
models are used to determine the actual presence or absence of the disease and analyze
its severity. This analysis can be done automatically through the models based on
artificial intelligence and machine learning and it may help in obtaining statistically
better results or more high-performing solutions. Machine learning algorithms can
be used in many ways like object detection (which can be used to detect kidney
stones), classification of a kidney mass type, and prediction of the severity of the
disease [4].
The contributions made are as follows:
Using image processing techniques such as image impainting and recognition to
identify the presence of kidney stone(s) inside the kidney.
Use of machine learning techniques to predict the efficiency of the therapy given
to chronic kidney disease stage V dialysis patients (hemodialysis), and suggesting
course of action.
Renal Disease Classification Using Image Processing 123
2 Related Work
Early renal classification is an especially important task because it can lead to chronic
kidney disease. This classification uses machine learning algorithms like decision
tree, SVM gives better results than the traditional method with around 84% accu-
racy [5]. The prediction of T-cell epitopes for SARS-CoV-2 can also give addi-
tional information like protein–protein interaction and structural information [6].
The ensemble learning technique can be used to predict T-cells with more accu-
racy [7]. Medical images are an important part for diagnosis of any disease and for
better and more efficient kidney disease prediction, deep learning can be best tool
[8]. Disease management requires early detection as soon as possible and machine
learning can give more accurate CKD prediction with predictors like blood pressure,
serum creatinine [9]. Kidney identification is mostly performed semi-automatically
or automatically. Earlier, there was a study conducted which found out that ESRD
is the most severe stage which means end-stage renal disease in which patients go
into critical condition and ultimately, they need a transplant [10]. There has been
the development of a system that targets anticipating the early location of constant
kidney infection otherwise called chronic kidney disease for diabetic patients with
the assistance of artificial intelligence (AI) strategies and recommends a choice tree
to show up at substantial outcomes with beneficial precision [11]. Apart from CKD,
acute kidney injuries (AKI) also critically affect ill patients. The development of the
method of identifying helps in reducing the complications of AKI. There has been a
development of the model which used five supervised learning models to detect AKI
using deep learning [12]. In kidney disease detection, detection of the glomerular
lesions is the major component which is time-consuming and should be accurately
done. A framework developed has been made based upon a deep neural network to
locate glomeruli and quantify distinct glomerular cells [13]. Machine learning has
been proven to be important for CKD prediction with high accuracy. Heterogeneous
modified ANN and backpropagation have been proved to have high accuracy in
preprocessing the ultrasound image of the kidney images and helps in detecting the
region of interest in a more precise way [14]. The case-based reasoning method has
been proven to provide an excellent neural network-based renal disease prediction.
This model uses demographic data for training along with some medical data [15].
Artificial neural network methods like SVM, KNN can be used to predict CKD with
more accuracy, sensitivity, and specificity [16]. It is crucial for the patient with CKD
to be treated by the renal replacement therapy (RRT) which is kidney transplantation
or hemodialysis at the right time to ensure the patient’s well-being [17].
124 R. S. Mathur et al.
3 Methodology
In this project, both machine learning and image processing are deployed to develop
a machine-level understanding of chronic kidney disease and kidney stones, respec-
tively. Machine learning uses dialysis patients’ data to analyze the therapy of CKD
stage V patients and image processing uses patients’ ultrasound images to detect
kidney stones.
The different machine learning algorithms are used to train models like KNN
and decision tree. And in the conclusion of the model-building stage, a confusion
matrix is used to determine the quality of the machine learning model. Lastly, renal
diseases will be predicted using the classification model mentioned in the previous
stage. Figure 1shows the methodology the project will follow and the steps to assure
the effectiveness of the image processing model. The steps for the image processing
model are as follows.
3.1 Ultrasound Pictures
The first step is to collect the various ultrasound images from the hospitals in which
some of them contain stone and others do not. This dataset will help to find the
effectiveness of the model created.
Fig. 1 Methodology of
image processing
Renal Disease Classification Using Image Processing 125
3.2 Image Pre-processing
The second step is to remove all the unnecessary noises from the ultrasound image like
the text or the lines from it. This can be done using various methods like the image
inpainting technique which is a method in which missing parts of the ultrasound
image can be recovered or filled. The other method that can be used is noise filtering,
it is a method in which it detects text or lines on the ultrasound image and removes
it from it and afterward image inpainting can be used to fill those spaces.
3.3 Diagnosis of the Kidney’s Outer Contour
In this step, after removing all the noises from the ultrasound images, the outer
portion of the kidney is highlighted to narrow to area of interest in the image. This
is done using various image processing techniques to reduce the region where the
stone can be found in the kidney.
3.4 ROI Identification
After reducing the area of interest, the main region of interest, i.e., ROI is determined
in the kidney by removing all the connected pixel components which have a value
less than p number of pixels with the help of a MATLAB function. These steps help
to furthermore shirk the region of interest.
3.5 Detection of the Stone in the ROI
As the region of interest is determined, the stone can be found by overlaying the
output image with the original image. This helps to find the region where the stone
is present in the kidney.
3.6 Labeling of the Ultrasound Image
The last step is to label to image and notify the user if the stone is detected or not.
If the stone is detected in the kidney, the user can take necessary steps to avoid any
mishap or can visit the doctor for further consultation.
126 R. S. Mathur et al.
4 Experimental Work
There are two methods that have been adopted in this paper. Machine learning is
adopted for training and testing on the collected dataset which includes parameters
like hemoglobin, urea, creatinine, etc., on data collected from chronic kidney disease
patients and image processing for classifying ultrasound images on data collected
from kidney stone patients.
4.1 Machine Learning
The dataset has been collected from hospitals and labeled by consulting with experts
in the medical field and developing the results obtained in unsupervised learning.
Then, this dataset has been divided into knumbers of folds, using k1 folds
for training and onefold for testing purposes. In this way, cross-validation of the
findings from supervised learning becomes easier. Supervised learning is a subpart
of machine learning and artificial intelligence. It is used to classify the data by using
dataset and train algorithm by labeling the dataset. There are several supervised
learning methods that have been used including linear regression, etc. The linear
regression model performs a regression task and target prediction value that is based
upon independent variables. It is also used for predicting the relationship between
forecasting and variables. Logistic regression is used for statistical analysis to predict
outcomes in yes or no which are predicted based on prior observations of a dataset.
It helps in predicting dependent data variables using the relationship between one or
more existing independent variables. The support vector machine is used to create
the best line decision boundaries for segregating n-dimensional space into classes.
SVM is used to put the new data point correctly with ease so that it can be used in the
future. The Naive Bayes algorithm is used for solving text classification that includes
a high-dimensional training dataset. A decision tree is used as a decision support tool
that gives a possible consequence including chance event outcomes, resource costs,
and utility. Lastly, random forest is used for both classification and regression which
is used to combine multiple classifiers to solve complex machine learning problems
and improves the performance of the model.
4.2 Image Processing
The project uses the various concepts of image processing to detect the stone in the
human kidney. A model has been made to find the specific place of the stone or the
region of interest in the human kidney. The various steps to determine the specific
place of the stone in the human kidney is as follows.
Renal Disease Classification Using Image Processing 127
4.2.1 Preprocessing
The initial step that is being followed in our image processing is to first remove all
the unimportant data which is not very useful for our purpose. This unimportant
data includes all the markings that doctors used to do while examining the reports
of the kidneys. Hence these can be removed successfully with the use of the image
processing. To get our ROI, i.e., region of interest, we will be implementing image
processing. So, we must contour our image from RGB to gray, and then gray to
binary, and so forth. However, after converting, we will not be able to see any color
difference between RGB image and grayscale image as in the RGB image, no color
is present. Now, our next step is to convert our image into the binary image. But
before that, first we will be checking our histogram to verify whether we will be able
to apply the global thresholding or not.
After applying the LFT thresholding, we can see that it is not binomial, as there is
no logo in the lower part. However, we can also use the pixel info to get our threshold
value using the trial-and-error method. As our cursor goes to the brighter part, the
intensity value increases. Hence, in our case, we have selected 20. It means that if
we can get the intensity above 20, then we will be getting 1 in return else 0. This
is how we can get our output. However, since we are getting many holes, we need
to fill these holes to implement it properly. Now, the holes which we were getting
earlier are now filled. And we can see our ROI vividly. However, there are still some
unnecessary things which need to be cleared up as well.
Figure 2displays the ultrasound image that has been taken as the input for the
model to determine the specific place of the stone in the kidney. This ultrasound image
also contains various noises like the text or the lines which need to be removed by
preprocessing the image before the ultrasound image can be used for determining
the place of the stone.
Fig. 2 Ultrasound image of
kidney [16]
128 R. S. Mathur et al.
Fig. 3 Processed image of
the ultrasound
Figure 3displays the ultrasound image which has been preprocessed and all the
noises like the text or the lines which can vary the output are removed. Now, only
the region of the kidney is remaining after the preprocessing step.
4.2.2 Contrast Enhancement
The boundary region can be eliminated because it is not crucial for our application
by making the dark section darker and the bright portion brighter. The contrast
enhancement or contrast stretching is the term used for this. As a result, values
between 0 and 0.3 are classified as 0, whereas values beyond 0.7 are treated as 1.
And for values between 0.3 and 0.7, the output will translate it to a value between 0
and 1.
4.2.3 Feature Extraction
In this section, several techniques are utilized to extract kidney traits that aid in
characterizing the kidney that is being described. The extracted feature consists of
three parts: a median filter, ROIPoly or region of interest polynomial, and image
segmentation.
4.2.4 Median Filter
A particular kind of image processing filter known as a median filter replaces a given
picture’s pixel value with the median value of the pixels in the area around it. Mostly,
noises that are not required in the image are eliminated from a picture using the
median filter. Instead of using the average of the surrounding pixels, it uses their
median to keep the image’s crisp edges while reducing noise. Given that the median
Renal Disease Classification Using Image Processing 129
of a group of values is less susceptible to outliers than the mean, the median filter is
particularly helpful for maintaining the borders of a picture.
4.2.5 ROI Polynomial
ROI, or region of interest, is an image processing phrase that refers to a specific
portion of an image that is of interest and should be handled differently than the
remainder of the picture. In image processing, ROIPoly is a particular implementation
of ROI selection where the user may specify an area of interest by encircling it with
a polygon. When the user clicks on various locations in the picture to pick the ROI, a
polygon is produced by connecting these points. Once the ROI has been established,
the user may isolate that area and only apply certain image processing methods to
it, leaving the rest of the picture untouched. This enables the user to concentrate on
a specific area of a picture and carry out processing actions that are specific to that
area.
4.2.6 Image Segmentation
In image segmentation, the ‘bwareaopen’ function of MATLAB has been used. In
this function, the objects in the picture are distinguished from the background using
binary images. The function eliminates minute connected parts from a binary picture.
Larger items in an image may be distinguished from smaller ones that can be caused
by noise or other image processing processes using this function.
4.2.7 Labeling the Image
The process of labeling the image is started when the image is segmented using the
‘bwareaopen’ function. Here, if our value is greater than 1, i.e., if the single binary
object is detected then, it will simply display that ‘Stone is detected’. Else it will
show that ‘No stone is detected’.
5 Discussion
In this project, we used supervised learning and unsupervised learning algorithms
to determine the model with the highest accuracy, being decision tree algorithm. We
used this model to predict the efficiency of therapy received in hemodialysis patients
who have reached Stage V of chronic kidney disease. We used parameters such as
urea, creatinine, albumin, total protein, and hemoglobin, as indicators of dialysis
therapy. Our results demonstrate that decision tree algorithm effectively predicts
130 R. S. Mathur et al.
CKD therapy efficiency levels with high accuracy, precision of 0.97, recall of 0.97,
and accuracy of 0.97.
In this project, we analyzed the ultrasound images of various patients using image
processing to identify the kidney stone in the region of interest area of the kidney.
Initially, preprocessing is done on the image to remove any distortion that may arise
due to the quality of the image. After that, region of interest is identified which helps
recognize the location where the probability of finding the stone in the kidney is
maximum. This method is then implemented on various ultrasound images to check
the accuracy of this method. The accuracy comes out to be 95.53%. This accuracy
can further be improved by using different preprocessing and postprocessing filters
like the negate, contrast, Prewitt, Sobel, and canny filter.
6 Limitations
Image distortions can cause image features to appear stretched, compressed,
or warped, which can affect the accuracy of image analysis and recognition
algorithms.
Image gaps can lead to missing information in the image, which can affect the
performance of image analysis algorithms such as object detection, segmentation,
and recognition.
Medical data is sensitive and personal; hence it is difficult to obtain large amounts
of high-quality data to train machine learning models efficiently.
However, it can perpetuate bias and unfairness if the training data is biased or
if the model is not designed to be fair. Also, these models can be vulnerable to
attacks such as adversarial attacks, where an attacker deliberately manipulates the
input data to cause the model to produce incorrect predictions.
7 Result and Analysis
Prescribed ranges for values are different for different investigations. For instance,
hemoglobin of a CKD stage V patient is advised to be in 11–12 g /deciliter, and so
on for investigations like albumin, total protein, HbA1c, platelets, urea, creatinine,
and Kt/v.
K-means clustering model was designed to divide the dataset into three clusters.
The first cluster (cluster 2) is supposed to define a data range where the patient is
responding very well to the therapy, cluster 1 defines data ranges where patient is
responding satisfactorily to the therapy, and cluster 0 represents a data range where
the therapy given is below par.
This cluster categorization aims to aid medical professionals to determine whether
to continue the same methods of therapy on a given patient with the given parameters.
In a country such as India, observing a shortage of medical professionals, this cluster
Renal Disease Classification Using Image Processing 131
categorization helps to channelize the focus of medical professionals toward patient
in an optimum way. For instance, cluster 0 patients (where therapy is below par)
require the most attention, cluster 1 patients (therapy is at par) require the existing
attention and care, and cluster 2 patients (therapy is above par) do not require urgent
medical intervention.
It has been observed that the unsupervised model was not very successful in
predicting whether a given value belongs to a particular cluster. The unsupervised
learning model needs improvement in predicting these cluster values, as there is found
to be very less inaccuracy in prediction. In diseases such as chronic kidney disease, a
lot of parameters, namely A G Ratio, Alkaline Phosphate, Calcium, Chlorides, Folic
Acid, Globulin, Indirect Bilirubin, Kt/v and Total Phosphate to name a few. This
study was focused on five major parameters—hemoglobin, urea, creatinine, total
protein, and albumin.
In supervised learning, the dataset is labeled into three groups with consultation
from medical experts. Group 0 represents values below the prescribed medical range
for stage V chronic kidney disease patients, group 1 represents values within the
prescribed range, and group 2 represents values above the prescribed range.
The following figures show the results obtained from the experiments. A random
sample of 425 patients’ hemoglobin data was taken from the dataset and shows the
predicted values of therapy category for each datapoint, as shown in Fig. 4.
Confusion matrix shows that an overwhelming majority of our predicted values
fall within the categories of ‘True Category 0’ ‘True Category 1’, and ‘True Category
2’. As most of the large values are aligned along the diagonal of the confusion matrix
and the non-diagonal values are close to 0, we infer that our prediction is highly
accurate.
Figure 5shows the results obtained by Gini index classification on the selected
sample. An accuracy of 97% shows that the proposed model is a near perfect model.
Fig. 4 Gini index classification results (decision tree)
132 R. S. Mathur et al.
Fig. 5 Results of supervised
learning using Gini index
A random sample of 372 patients’ hemoglobin data was taken from the dataset
and shows the predicted values of therapy category for each datapoint, as shown in
Fig. 6.
Confusion matrix shows that an overwhelming majority of our predicted values
fall within the categories of ‘True Category 0’, ‘True Category 1’, and ‘True Category
2’. As most of the large values are aligned along the diagonal of the confusion matrix
and all but one non-diagonal value is 0, we infer that our prediction is very accurate.
Figure 7shows the results obtained by entropy classification on the selected
sample. An accuracy of 98% shows that there are very few anomalies and impurities
in the datapoints, also data splitting has been done efficiently.
It can be observed that this supervised machine learning model is successful in
predicting whether a given value belongs to a particular cluster. Three clusters can
be seen, namely 0, 1, and 2. Supervised learning is successful in predicting cluster
values, as it is very accurate in prediction.
The analysis part focuses a lot on developing insights on the accuracy of the
algorithms achieved and the results obtained through this research. It is observed that
some algorithms performed better and came close to predicting the actual values. As
the data is categorized into three main categories, cluster 1 represents those values
Fig. 6 Entropy classification results
Renal Disease Classification Using Image Processing 133
Fig. 7 Results of supervised learning using entropy
depicting good therapy response from patient, cluster 2 representing satisfactory
response, and cluster 3 representing unsatisfactory response.
The result of the image processing focuses to show the areas where the probability
of finding the stone is most likely. We used various methods to reach the region of
interest where stone may be found with the assistance of doctors. The insights from
the result of image processing will further assist the doctors in treating the patients
correctly as it will inform the doctors whether the stone is present in the patient’s
body or not. Also, it will identify the area of the stone that will help the doctor to
operate easily in case the stone is detected.
Figure 8shows that the ultrasound image that we gave as an input to the model
contains the stone in the kidney and shows the location of the stone which can help
the doctor to remove the stone from the kidney easily. It helps to ease out the work
for the doctors and save time in case of emergency situations.
Fig. 8 Result of image
processing (showing the
location of the detected
stone)
134 R. S. Mathur et al.
8 Conclusion and Future Scope
Machine learning models yielded satisfactory results with unsupervised learning
underperforming and supervised learning performing well.
Selection of few parameters and a small dataset in unsupervised learning is the
cause for unsatisfactory results. For one patient, a hemoglobin of 12 gm/dl would
be good for one set of parameters, while for another patient the same level of
hemoglobin proves to be too much for the set of parameters. It depends on patient
to patient whether the therapy given is good, bad, or satisfactory. Hence, a much
larger dataset of chronic kidney disease (CKD) patients of all stages (from Stage
I to Stage V) is required to yield better results for unsupervised CKD predictions.
Data labeling conducted with assistance from medical experts proved crucial for
supervised learning as the model proposed gives accurate predictions. A doctor’s
opinion is always better than any machine predictions, which is depicted in the
parity seen between unsupervised and supervised machine learning model results.
Image processing clearly shows the location of kidney stones in ultrasound images
and can be applied in chronic kidney disease too. A model can be built which takes
ultrasound images and directly predicts whether the kidney has a stone or not. If yes,
it could help determine the area occupied by the stone and its size. This could prove
to be of great utility in the medical field and assist doctors in better decision-making.
In future, the image processing will be able to detect the size and dimension of the
stone that will help the doctors to decide the procedure to apply for each patient, e.g.,
a patient having a small stone dimension can be cured with the help of medicine only,
while a patient having a little bit larger stone dimension needs to be operated on by
the doctor. This distinction can also be made with the image processing techniques
in the future. Furthermore, with the help of ultrasound images, the classification
of various stages of chronic kidney disease will also be possible. With help of the
concentric contour detection and the transition indicator measurement the stages can
be classified, because if the white to black ratio is a higher value it means the patient
is in the initial stage, while if the value is lower it will mean that the patient is in final
stages. So, this classification will be vital for doctors in making decisions about the
process of treatment for each patient.
References
1. Alnazer I, Bourdon P, Urruty T, Falou O, Khalil M, Shahin A, Fernandez-Maloigne C (2021)
Recent advances in medical image processing for the evaluation of chronic kidney disease.
Med Image Anal 69:101960
2. Gudigar A, Raghavendra U, Samanth J, Gangavarapu MR, Kudva A, Paramasivam G, Acharya
UR et al (2021) Automated detection of chronic kidney disease using image fusion and graph
embedding techniques with ultrasound images. Biomed Signal Process Control 68:102733
3. Ghosh P, Shamrat FMJM, Shultana S, Afrin S, Anjum AA, Khan AA (2020) Optimization
of prediction method of chronic kidney disease using machine learning algorithm. In: 2020
Renal Disease Classification Using Image Processing 135
15th international joint symposium on artificial intelligence and natural language processing
(iSAI-NLP). IEEE, pp 1–6
4. Georgieva V, Petrov P, Mihaylova A (2018) Ultrasound image processing for improving
diagnose of renal diseases. In: 2018 IX national conference with international participation
(ELECTRONICA). IEEE, pp 1–4
5. Bai Q et al (2022) Machine learning to predict end stage kidney disease in chronic kidney
disease. Sci Rep 12(8377):1–8
6. Bukhari SNH, Jain A, Haq E, Mehbodniya A, Webber J (2021) Ensemble machine learning
model to predict SARS-CoV-2 T-Cell epitopes as potential vaccine targets. Diagnostics
2021(1990):1–18
7. Bukhari SNH, Webber J, Mehbodniya A (2022) Decision tree based ensemble machine learning
model for the prediction of Zika virus T-cell epitopes as potential vaccine candidates. Sci Rep
12(7810):1–11
8. Kumar K et al (2023) A deep learning approach for kidney disease recognition and prediction
through image processing. Appl Sci 13(3621):1–14
9. Islam MA, Majumder MZH, Hussein MA (2023) Chronic kidney disease prediction based on
machine learning algorithms. J Pathol Inform 14:100189
10. Segal Z, Kalifa D, Radinsky K, Ehrenberg B, Elad G, Maor G, Koren G et al (2020) Machine
learning algorithm for early detection of end-stage renal disease. BMC Nephrol 21(518):1–10
11. Padmanaban KRA, Parthiban G (2016) Applying machine learning techniques for predicting
the risk of chronic kidney disease. Indian J Sci Technol 9(29):1–5
12. Li Y, Yao L, Mao C, Srivastava A, Jiang X, Luo Y (2018) Early prediction of acute kidney
injury in critical care setting using clinical notes. In: 2018 IEEE international conference on
bioinformatics and biomedicine (BIBM). IEEE, pp 683–686
13. Zeng C, Nan Y, Xu F, Lei Q, Li F, Chen T, Liang S et al (2020) Identification of glomerular
lesions and intrinsic glomerular cell types in kidney diseases via deep learning. J Pathol
252(1):53–64
14. Ma F, Sun T, Liu L, Jing H (2020) Detection and diagnosis of chronic kidney disease using
deep learning-based heterogeneous modified artificial neural network. Futur Gener Comput
Syst 111:17–26
15. Vásquez-Morales GR, Martinez-Monterrubio SM, Moreno-Ger P, Recio-Garcia JA (2019)
Explainable prediction of chronic renal disease in the Colombian population using neural
networks and case-based reasoning. IEEE Access 7:152900–152910
16. Pal S (2022) Chronic kidney disease prediction using machine learning techniques. Biomed
Mater Devices (pp 1–7)
17. Dovgan E, Gradišek A, Luštrek M, Uddin M, Nursetyo AA, Annavarajula SK, Li Y-C, Syed-
Abdul S (2020) Using machine learning models to predict the initiation of renal replacement
therapy among chronic kidney disease patients. PlOS ONE 15(6):e0233976
Identification of Fake Users on Social
Networks and Detection of Spammers
B. Srinivasa Rao, Badisa Bhavana, Gudimetla Abhishek,
and Peddiboyina Hema Harini
Abstract Numerous people use social networking services on a global scale. The
way people interact with social media platforms like Facebook and Twitter has the
significant impact per day life, frequently with negative outcomes. Popular social
networking sites have actually been targeted by spammers who spread a lot of
unwanted and damaging stuff there. As an illustration, Twitter has become one of
the most widely utilized platforms ever, which has led to an annoying quantity of
spam. By sending undesired users’ tweets in order to advance businesses websites, or
fake users waste resources and also hurt actual people. Additionally, the capacity for
transmitting incorrect information to people using phoney identifications has grown,
aiding in the spread of dangerous goods. Identifying spammers, unauthorized Twitter
users has recently developed a major research issue on today’s social networks online
(OSNs). Phoney customers, link spam, and content based on trending topics that
is likewise spam. The solutions that are provided are also contrasted based on a
variety of criteria, such as individual, web content, graph, structure, and temporal
components. We hope that the research study that has been presented will serve as
an important resource for academics looking for one of the most significant recent
developments in Twitter spam identification on a single platform.
Keywords OSN ·Spam ·Fake account ·URL ·Twitter ·Social media
B. Srinivasa Rao (B)
Department of Information Technology, Lakireddy Bali Reddy College of Engineering,
Mylavaram, Andhra Pradesh, India
e-mail: doctorbsrinivasarao@gmail.com
B. Bhavana ·G. Abhishek ·P. H. Harini
Lakireddy Bali Reddy College of Engineering, Mylavaram, Andhra Pradesh, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_11
137
138 B. Srinivasa Rao et al.
1 Introduction
One of the more well-known social media platforms, Twitter, has been used in a
number of studies. The majority of people currently utilize Twitter. Due to the fact
that we have fictitious clients on Twitter as well, we discovered false Twitter user
identification in our study. In this study, we will definitely identify phoney users
using false content, URL-based spam detection, spam in trending subjects, and fake
individual recognition then expose the bogus client. By posting frequently and on
issues unrelated to the other person, the phoney user wastes other people’s time.
Social networking services like Twitter, Facebook, MySpace, Instagram, and Linked
In have actually grown in appeal during the past few years. Twitter is one of the most
well-known and significant networking websites when compared to other social
media platforms. Users of social networking websites are now able to upload and
also distribute messages thanks to Twitter. The Twitter network refers to messages
with a character count of no more than 280 as tweets. People typically utilize social
media network websites to express their opinions on a variety of topics, feelings,
and ideas about other people. These social networking sites might be the biggest
platforms for individuals to publish comments and reviews on products they have
already bought. At the moment, 0.13% of Twitter adverts are clicked, which results
in a higher rate of spam data access than email spam [1]. Twitter and other online
social networks, which are largely used for the exchange of useful information, are
frequently targeted by social crawlers and hackers due to their large user bases. On
social networking sites, spam crawlers are frequently referred to as social crawlers.
In fact, a number of studies have been done on the subject of identifying Twitter
spam. To include the most recent research, a few surveys on false Twitter user recog-
nition have also been conducted. Tingmin et al. present a research of modern methods
and procedures for Twitter spam detection in their magazine [2]. The study previ-
ously cited provides a comparison of current methods. On the other hand, a study
on the various actions made by spammers on the social media platform Twitter was
conducted by the authors of [3]. The study offers an analysis of the literary works
that acknowledge the prevalence of spammers on Twitter. Despite all of the study
investigations that have been conducted, there is still room in the literary works.
Therefore, in order to reduce the gap, we look at the most current developments in
spammer discovery as well as fake individual recognition on Twitter. Additionally,
this study employs a taxonomy of methods for identifying Twitter spam and tries to
provide a thorough summary of recent advancements in the area.
A social media platform Twitter is one that “focuses on the growth and verifica-
tion of online social networks for areas of individuals who share passions and condi-
tioning or that are interested in finding the passions as well as conditioning of others,
and which requires the use of software application,” according to Wikipedia. An
OCLC report defines social networking sites that comply. Websites like Face maga-
zine, Mixi, and MySpace were primarily developed for drug users who engage in the
trading of goods and services. Social networks offer a variety of advantages to people
Identification of Fake Users on Social Networks and Detection 139
within an organization. Assistance with locating social networks can foster relation-
ships between subjects of study and those involved in the promotion of literacy.
They can also improve non-formal literacy. Support for a group’s participants social
media tools are available to all employees of a company, not just those that interact
with students. Technology can advance with the aid of social media. Conversing
with others utilizing social networks can provide crucial corporate information and
feedback on institutional solutions (although this may generate ethical endeavours).
Reduction in workload and information accessibility. The simplicity of a variety
of social networking sites can benefit drug addicts by facilitating access to new
tools and procedures. The Face Publishing System is an example of how a social
networking service can be utilized as a surface for other applications regular inter-
face. One advantage of social media networks could be their shared user interface,
which transcends both professional and social divides. Because identical solutions
are regularly used in a particular ability, the user interface, and the ways the solution
work may be familiar with, less training and assistance is needed to use the options
in a professional context. However, this may be problematic for those who prefer
clear distinctions between their work and their social upbringing.
1.1 Objectives of the Project
Finding any kind of information from any kind of source anywhere in the world
has become more simpler as a result of the Internet. Folks can get a large amount
of data and facts regarding other people thanks to the via social media businesses’
expanding user bases. Because they provide so much information, these websites
draw phoney users. In fact, Twitter has become very popular as a source for current
personal information. Nowadays, it is quite simple to obtain any kind of information
from any kind of source situated anywhere on the planet via the Web. People now
have access to a wealth of information as well as specifics about other users and
thanks to the social network platforms that are becoming more and more popular.
Because there is so much easily available material on these networks, they attract
fake users. Due of the vast amounts of information they provide, many websites
attract phoney users.
1.2 Conflict Meaning
Finding any kind of information from any kind of source worldwide has become
incredibly simple thanks to the Internet. Due to the social media sites’ rising popu-
larity, users have access to a vast amount of personal data on both themselves and
other people. These platforms are attractive to scammers because they use so much
data. The popularity of Twitter as a source of up-to-date information about people has
actually grown quickly. Nowadays, any type of information from any kind of source
140 B. Srinivasa Rao et al.
can be easily found using the Internet. People can learn a lot about other people and
gain a lot of information because social networking platforms are growing more and
more popular. Due to the vast amount of information they provide, these websites
draw the wrong kind of people. These websites attract phoney users because of the
wealth of information they employ.
Motivation: Scientists have recently shown an increasing level of interest in the
finding of spam on social networking sites. Recognizing spam is a difficult topic in
maintaining social media’s safety and security. If users are to be protected among all
forms of hazardous assaults, to hold their sensation of solitude, and to feel safe and
secure, then finding spam on OSN sites is essential. Spammers’ dangerous actions
cause extensive area damage. Twitter is a platform that spammers use to spread
false information, fake news, rumours, and inappropriate statements, to mention a
few. Spammers maintain a large number of customer lists in order to connect their
interests and employ a variety of other techniques, such sending spam messages at
random, to accomplish their harmful goals. The original customers, also referred to as
non-spammers, are irritated by these behaviour. The OSN systems’ reputation is also
damaged. To ensure that the appropriate steps may be taken to halt their unwanted
behaviour, it is crucial to develop a system for finding spammers.
2 Related Work
Reference [1] Shivangi Ghee Wala et al. suggestion’s OSNs have also made a number
of measures to safeguard private information from various threats. Despite the impor-
tance of these recommendations, developers believe that there isn’t already a concep-
tual foundation for building information protection technologies. This technique’s
central concept must be one of risk. Therefore, we advise OSNs to adopt a threat
monitoring strategy for the duration of their work. By connecting risk factors to social
network users, they hope to make people consider how risky it would have been
to connect with them while giving personal information. They consider client risk
mindsets while using similarity and earnings signals to determine danger limitations.
We especially employ a dynamic risk evaluation mentor approach in which a select
number of critical user communications demonstrate consumer threat behaviour. The
risk assessment method discussed in this article has been created and also tested using
actual data.
Reference [4] the method proposed by Rohit Kumar Kaliyar et al. was referred
to as “Fake Information Discovery Making Use of a Deep Semantic Network.” The
results of online forums with person-to-person conversational forms have garnered
significantly more attention than the combination of electronic communication tools
in co-located classes. In this study, middle school students’ perceptions and presump-
tions of two different communication styles in co-located classrooms—face-to-face
(F2F) and simultaneous, computer-mediated interaction—were examined (CMC).
What scientific work is easily accessible in French? As a result, they distinguish
Identification of Fake Users on Social Networks and Detection 141
between pupils who are considered to be participating in face-to-face (F2F) class-
room discussions and those who are typically mute. These studies demonstrate the
benefits of computer-mediated communication (CMC) over face-to-face interac-
tions in co-locations and reveal that different students have varied preconceptions of
both F2F and CMC (“active” and also “silent”). Computer network infractions and
cyberattacks have a substantial protective impact.
Reference [5] Gupta et al. suggested a technique referred to as—On the direction
of Differentiating Fake Customer Records in Facebook. People are highly suscep-
tible to OSNs as a result of a genuine concern over digital offenders carrying out
numerous evil deeds. A whole industry of record-based bootleg market managements
has developed, selling these phoney solutions. Our research study’s main objective
is to identify bogus information on Facebook, a hugely popular (and challenging to
find information about) online social media. Here is a list of our task’s unspoken obli-
gations. It has taken a lot of work to compile data connecting real and even phoney
Facebook papers. Due to Facebook’s strict security policies and also applications
user interface, which is constantly enhancing and also integrating new restrictions,
gathering customer account information came to be a tough operation. The next step
is to leverage Facebook customer channel data to study customer profile behaviours
and discover a large collection of 17 criteria that are essential for differentiating
fraudulent users from real ones on Facebook. Finally, these highlights will undoubt-
edly be used to identify the significant AI-based classifiers that excel at identifying
jobs out of a total of 12 classifiers used.
Identifying fake Twitter accounts Akta¸s, B. Erçahin, D. Kilinç, and C. Akyol are
the authors. Many people use social media websites like Facebook and Twitter, and
the connections they make there, have a profound impact on their lives. A number of
issues have emerged as a result of social networking’s rising popularity, including the
potential for dangerous material to spread by deceiving people into thinking they are
someone they are not. This illness has the potential to profoundly destroy culture in
real life. We present a categorization method for identifying Twitter bogus accounts in
this study. Our dataset was pre-processed using the Worsening Reduction Discretiza-
tion (EMD) method of monitored discretization on mathematical characteristics, and
the output of the Naive Bayes algorithm was then examined.
Finding young Twitter spammers, title. G. Magno, T. Rodrigues, V. Almeida,
F. Benevenuto, and all contributed to the essay’s creation. Many individuals are
tweeting on a global scale, therefore new information mining tools and Internet
search engines are emerging to help users keep up with events and information on
Twitter. Despite being useful as tools for accelerating the flow of information and
enabling users to discuss events and promote their standing, these services also open
the door for new kinds of spam. The most popular topics on Twitter at any given
time, trending topics have certainly been thought about as a possible method to
boost visitors and also revenue. In their tweets, spammers use popular terms from a
hot issue as well as URLs that, in most cases, are masked by URL shorteners and
take users to completely unrelated websites. If methods for discouraging spammers
are not developed, this kind of spam could also reduce the usefulness of real-time
search systems. The challenge of identifying spammers on Twitter is examined in
142 B. Srinivasa Rao et al.
this essay. A sizable dataset from Twitter, which contains more than 54 million users,
1.8 billion tweets, and 1.9 billion links, served as our initial step. We use tweets on
three 2009 hot motifs to create a sizable labelled collection of people who can be
manually classified as spammers and non-spammers. Then, we identify several traits
related to tweet content and user social behaviours that can be utilized to identify
spammers. These characteristics served as the foundation for our machine learning
method for classifying individuals as spammers or non-spammers. Only a small
percentage of non-spammers are incorrectly categorized by our technology, while it
successfully detects the vast majority of spammers. 96% of non-spammers and about
70% of spammers were correctly identified. Our searches also illustrate the crucial
components for identifying Twitter spam.
TITLE: A comprehensive NLP-based technique for identifying potentially
harmful tweets. S. Gharge and M. Chavan are the authors. The detection of phoney
client accounts was the main objective of many past jobs. Recently, Twitter spam
detection has drawn more attention as a social media network study topic. We do,
however, provide a strategy based on two novel ideas: a method based on language
evaluation for finding spam on Twitter in vogue at the time, as well as a method for
detecting spam tweets without taking the user’s history into account. The issues of
conversation that are currently popular are known as trending topics. This growing
microblogging trend benefits spammers. In our research, we look for spam tweets
using linguistic techniques. We started by gathering tweets related to a variety of
popular subjects and categorizing them as either having safe or dangerous mate-
rial. After labelling, we eliminated many traits based on linguistic variances, using
language as a tool. We also assess the effectiveness and classify tweets as spam or
not. Therefore, our method can be used to detect spam on Twitter by focusing on
tweet analysis rather than customer account evaluation.
TITLE: Using cutting-edge methods, a research on Twitter spam. S. Wen, Y.
Xiang, W. Zhou, T. Wu, and S. are the authors. Twitter trolls has actually long been
a serious yet challenging issue to solve. Researchers have already offered a variety
of study and support methods to protect Twitter users from spamming activities.
Particularly in the recent three years, a number of novel techniques have been devel-
oped that significantly improve the performance and accuracy of exploration when
compared to those that were first offered three years earlier. As a result, we are
motivated to conduct a fresh study on Twitter spam detection techniques. This inves-
tigation is divided into three sections: (1) A review of contemporary literature: this
section provides in-depth analysis (such as taxonomies and also predispositions on
function selection) as well as justification (such as benefits and drawbacks on each
fundamental strategy); (2) Relative studies: To provide a quantifiable understanding
of current techniques, we will compare the performance of various common strate-
gies on a global tested (i.e. the same datasets and also real-world scenarios); and
(3) An analysis of contemporary literature. (3) Unresolved issues: The third section
provides a summary of the issues that current Twitter spam discovery techniques
continue to encounter. It is crucial that these unresolved issues are addressed for the
benefit of both the academic community and business. Visitors to this study may
Identification of Fake Users on Social Networks and Detection 143
include people looking for a thorough understanding of the topic to develop original
strategies, as well as those with or without prior experience in the field.
TITLE: An example of spammers’ behaviour on popular social media networks.
Author S. J. Soman is involved. Social media websites and applications have devel-
oped into a significant component of the Internet and are currently having a significant
effect on people’s lives. Utilizing social networking platforms enables customer inter-
action (SNSs). But the blogosphere has undoubtedly been tortured by many forms of
information that resemble spam. Websites for social media networks have become
increasingly popular, making them a prime target for spammers because they annoy
users by returning useless search results. Scientists initially concentrated on building
honey pots to find spam. Spammers and marketing specialists both use Twitter as a
target mechanism. The writers look at extensive literary works that demonstrate the
existence of spam and spammers on well-known social media sites.
2.1 Existing System
Social networking services like Twitter and Facebook are used by millions of individ-
uals, and their involvement with these sites has a positive impact on their lives. Due
to its popularity, social networking has given rise to a number of issues, including the
potential for dangerous content to spread by tricking people into believing they are
someone they are not. This circumstance has the potential to cause significant harm
to society in the actual world. In our study, we offer a classification technique for
identifying Twitter bogus accounts. Our dataset was pre-processed using the Entropy
Minimization Discretion (EMD) method on numerical features explained.
2.2 Proposed System
The suggested system uses a combination of metadata-, content-, interaction-, and
community-based elements to identify fake users in order to identify social spam
bots on Twitter. Most network-based features are not defined using user followers and
underlying community structures in the analysis of characterizing features of existing
approaches, which ignores the fact that a user’s reputation in a network is inherited
from followers (rather than from those they are following) and community members.
As a result, the system places a strong emphasis on using community structures and
followers to define a user’s network-based features. The system divides a group
of features into four major categories: the following: false material; spam based
on URL; spam in popular subjects, and imposters. The network category is further
divided into features that are interaction- and community-based. While content-based
features seek to study a user’s message posting behaviour and the calibre of the text
they use in postings, metadata features are derived from additional information that
144 B. Srinivasa Rao et al.
Fig. 1 Spammer detection model
is available regarding a user’s tweets. The network of user interactions is used to
extract network-based features (Fig. 1).
3 Methodology
The authors discuss the strategy for identifying spam and fake accounts on the online
social network Twitter.
Author uses four different detection methods, including fake user identification,
fake content, spam URL detection, and spam trending topic, to carry out the task
of detection. After determining whether a tweet is regular or spam using the afore-
mentioned four methods, we will train a random forest data mining algorithm on the
above data to identify spam and non-spam tweets and the percentage of fake and
real money. To classify tweets as spam or not spam, authors use a lot of information
method, but in this case we use random forest classification.
Here is a description of four ways to determine if a tweet is spam.
User attributes (retweets, tweets, following, etc.), content attributes, etc. many
features included.
Fake Content: If an account’s reputation is low and there is a strong likelihood
that it is spam, it is shown by a low number of followers relative to the number of
followers. Similar features include HTTP links, mentions and replies, hot topics, and
the reputation of tweets. If the user tweets a lot in a short time according to the time
zone, the account is considered as spam.
URL recognition for spam: The user-based functions include determined by the
number of factors, including the age of the account and the quantity of user favourite,
lists, and tweets. The parsed JSON structure contains the characteristics that are
based on human input detected. Great retweets, hashtags, user mentions, and URLs
Identification of Fake Users on Social Networks and Detection 145
are attributes of a tweet like anything else. We will use a machine learning method
called Naive Bayes to determine if a tweet is a spam URL.
Using the Naive Bayes method to classify tweet content, it is possible to determine
whether a trending topic contains spam or terms that are not spam. This model will
search for matching tweets, spam links, and adult content. Naive Bayes returns 1 if
the tweet contains SPAM, and 0 if no SPAM content is found.
Fake user ID: These features include account age, followers and unfollowers. As
opposed to spammers who publish a small number of duplicate tweets, the features
of content related relating to the tweets are submitted by the users. This is because
spammer bots upload several instances of duplicating content. This method extracts
information from tweets and uses the Naive Bayes algorithm to categories them as
spam or non-spam depending on whether they are following or contain material
that is spam or not. To determine if the account is a fake account, a random forest
algorithm will then be used to train these attributes. The feature.txt file will contain
all the extracted features. The “Model” folder contains the Naive Bayes classifier.
The aforementioned methods allow us to determine if a tweet contains a legitimate
content or spam. Social networks can improve their reputation in the market by
identifying and eliminating such spam communications. Social networks’ popularity
might decline if spam messages were not removed from them. Today’s consumers
rely extensively using social media access business, family, and news information,
thus keeping them free of spam will help them build their reputation.
We are using a Twitter dataset in JSON format that comprises user information,
tweet counts, follower and following counts, favourite tweets, and more to create
this project. We examine all information using the Python JSON API to determine
whether a user account is real or false and whether it contains spam or regular
communications. The “tweets” folder contains all of these dataset files.
4 Implementation
Double click the “run.bat” file to bring up the following screen to start this project.
Click the “Upload Twitter JSON Format Tweets Dataset” button in the aforemen-
tioned window, then upload the tweets folder.
146 B. Srinivasa Rao et al.
I’ve uploaded a folder called “tweets” that contains tweets in JSON format from
various individuals in the screen above. Click the open button now to begin reading
tweets.
We can see all of the loaded tweets from all users on the screen above. To load
the simple Naive Bayes classifier, tick the “Put Naive Bayes on.” To analyse use the
“Tweet Text or URL” button.
To analyse each tweet for fraudulent material and spam, choose “Detect Fake
Content, Spam URL, Trending Subject & Fake Account.” URLs, both fake accounts
utilize the Naive Bayes classifier in addition to the others—mentioned techniques.
The Naive Bayes classifier is already loaded on the screen above.
Identification of Fake Users on Social Networks and Detection 147
All features from the tweet collection are extracted and analysed in the screen
above to determine if a tweet is spam or not. Each tweet record displays data such
the account’s TWEET TEXT, FOLLOWERS, FOLLOWING, etc. false whether the
tweet text is legitimate, spam, or neither phrases. In the text field above, each record
value is separated by an empty line. To train a random forest classifier with the
features of the retrieved tweets, click the “Run Random Forest Prediction” button.
This a simulation will be utilized to forecast or to find false and spam accounts for
incoming tweets. To read each tweet’s details, scroll down above the text area.
Click the “Detection Graph” button to view a graph of the total number of tweets,
spam, and bogus accounts. In the screen above, we calculated the random forest
prediction accuracy to be 92%.
The total number of tweets, false accounts, and tweets with spam language are
represented on the x-axis in the graph above, while their count is shown on the y-axis.
148 B. Srinivasa Rao et al.
5 Conclusion
In this research, we reviewed the methods for identifying spammers on Twitter.
Additionally, we provided a taxonomy of Twitter spam detection methods and divided
them into categories such as false user detection, spam detection in hot topics, spam
detection based on URLs, and fake content detection. Several features, including user
features, content features, graph features, structure features, and temporal features
were used to compare the provided strategies. The strategies were also contrasted
in terms of the datasets they employed and the goals they were designed to achieve.
The review that is being presented is expected to make it simpler for academics
to locate data on cutting-edge Twitter spam detection methods in one place. There
are still certain open areas that need significant research by researchers despite the
development of efficient and successful ways for the spam detection and false user
identification on Twitter. The problems are succinctly highlighted as follows: Due
to the grave consequences that false news can have on both an individual and a
communal level, the subject of false news detection on social media networks needs to
be investigated. Finding the sources of rumours on social media is a related topic that
is worthwhile of further study. While there have been some studies using statistical
methods to determine the origin of words, the best strategies can be used, such as
speech-based ones, because of their good results.
Feature Analysis
Although effective and successful methods have been developed for Twitter spam
detection and fraudulent user detection, there are still some gaps in research that need
to be filled. Several of the problems include the following: Fake news identification
on social media networks is a topic that has to be looked at because of the significant
effects false news has on an individual and societal level. Another related matter that
merits investigation is the ability to track out the source of rumours on social media.
Although some research have already been done to identify the source of rumours
using statistical techniques, more sophisticated strategies, such those based on social
networks, can be used because of their proven effectiveness.
References
1. Social Networks Analysis and Mining (ASONAM) 2018 Aug 28. IEEE, pp 1191–1198
2. Er¸sahin B, Akta¸s Ö, Kılınç D, Akyol C. Twitter fake account detection. In: 2017 international
conference on computer science and engineering (UBMK) 2017 Oct 5. IEEE, pp 388–392
3. Gupta A, Kaushal R. Towards detecting fake user accounts in Facebook. In: 2017 ISEA Asia
security and privacy (ISEASP) 2017. IEEE, pp 1–6
Identification of Fake Users on Social Networks and Detection 149
4. Pakaya FN, Ibrohim MO, Gheewala S, Patel R. ML based Twitter Spam account detec-
tion: a review. In: 2018 second international conference on computing methodologies and
communication (ICCMC) 2018 Feb 15. IEEE, pp 79–84
5. Kaliyar RK. Fake news detection using a deep neural network. In: 2018 4th international
conference on computing communication and automation (ICCCA) 2018 Dec 14. IEEE, pp
1–7
A Effective Method for Predicting
the Dyslexia by Applying Ensemble
Technique
S. K. Saida, Yanduru Yamini Snehitha, Narindi Sai Priya,
and Avula Srinivasa Ajay Babu
Abstract Dyslexia is a condition where a person will face difficulties in certain tasks
including reading, writing, speaking, and identifying sounds. Around 10% of people
globally struggle with this issue. The most important step in preventing dyslexia is
early identification. There are several ways to estimate the risk of dyslexia, where we
have developed a model which allows the user to specify their language vocabulary,
memory, speed, visual discrimination, audio discrimination test results. The model
will determine the user’s individual risk of dyslexia after receiving input from the
user. The approach we used included data preparation, data preprocessing, model
training, model testing, and model construction. Predicting Risk of Dyslexia-PLOS
ONE dataset is used. Dyslexia can be identified using machine learning classification
techniques like Decision Trees, Random Forests, and Support Vector Machines.
When compared to individual classification strategies, the ensemble technique in the
proposed work predicts the risk of dyslexia with a better degree of accuracy. Here, we
consider integrating GridSearch CV, Support Vector Machine, and Random Forest.
Accuracy, precision, recall, and F1-score were taken into consideration as outcome
measures.
Keywords Dyslexia ·Machine learning ·Random forest ·Support vector
machine ·GridSearch CV
S. K. Saida (B)
Department of Information Technology, Lakireddy Bali Reddy College of Engineering,
Mylavaram, Andhra Pradesh, India
e-mail: saida518@gmail.com
Y. Y. Sn e h i th a ·N. S. Priya ·A. S. A. Babu
Lakireddy Bali Reddy College of Engineering, Mylavaram, Andhra Pradesh, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_12
151
152 S. K. Saida et al.
1 Introduction
Dyslexia is a neurological condition where a person will face difficulties in some
specific tasks like reading, writing, speaking clearly, and identifying sounds [1].
Dyslexic people mainly face the problem in decoding each word. A normal person
can decode a particular word faster than a dyslexic person. The major problem they
have is manipulating the words but not eye problem or lack of intelligence [2].
Everyone will assume that the people who have dyslexia have very low intelligence
and they can’t succeed in their life but the major problems they have are unable
to recognize the directions correctly, seeing the letters backwards like ‘b’ as ‘d’ or
‘saw’ as ‘was’ etc., and cannot remember the names of the things in their surroundings
(Requires more time to remember).
Essentially, brain is divided into two sections one is right hemisphere and the other
one is left hemisphere. The left hemisphere of the brain is in charge of language and
logic, whereas the right hemisphere is in charge of the creative side. Reading a word
takes longer in the dyslexic brain’s right hemisphere and frontal lobe because of this,
and it may even take longer to register in the frontal lobe [3]. As they mostly rely on
the right side where it becomes the major advantage of the dyslexic people to think
differently and creatively. There is no medicine or treatment for the dyslexia but
the early identification is the most important step to make their future successful [4].
Special teaching methods and emotional support are mainly required for the dyslexia
people to make their life easier. There are no particular causes for dyslexia but some
may get dyslexia by hereditary or brain injury at the school age [5] (Fig. 1).
Recognition of dyslexia includes various methods. The most frequent method
is to detect the dyslexia based on the eye movement of the people. The machine
learning model will detect the dyslexia by taking the eye movement video using
various algorithms [6]. The another way to detect the dyslexia is by taking their
audio recordings and giving it to the machine learning model [7].
Data preparation is the initial step where we collected Predicting Risk of Dyslexia-
PLOS ONE set from the Kaggle. After collection of data from the Kaggle data
preprocessing is done using the Standard Scaler technique which will eliminate
the mean and scale every variable to unit variance. Here we use ensemble technique
which is the combination of two algorithms in a single model to improve the efficiency
Fig. 1 Identification of
dyslexia
A Effective Method for Predicting the Dyslexia by Applying Ensemble 153
and accuracy of the detection. The Random Forest Classifier and GridSearch CV have
given the best accurate results for the model compared to Support Vector Machine
and Decision Tree.
2 Related Work
In 2022, Brunswick et al. [8]. In this paper the author have included the students if
145 universities with and without dyslexia. This survey includes 53% childhood and
47% adulthood people. They have explored that the people with dyslexia have low
self-esteem and self-efficacy but higher creativity. To reduce the negative effects for
dyslexic people, early assessment of dyslexia should be done mainly.
In 2020, Chakraborty et al. [9]. In this paper machine learning algorithms are used
to detect the dyslexia using the eye movement of the person. The dyslexic people
have diverse eye developments than the typical reader. SVM and Random Forest
algorithms are used in this model to detect with 89.8% precision.
In 2017, Hassanain et al. [10]. In this paper they have created a big data-based
and tablet-based multimedia environment where they have considered the children
if age both less than 10 and greater than 10 to detect the symptoms of dyslexia. In
their framework they have included clock drawing test, writing test, reading test, and
drawing family members. Automatic grading will be done while attempting the test
scenarios. At last the calculation if scores in every scenario will be considered and
the detection is done.
In 2020, Ileri et al. [11]. This paper mainly deals with the EOG signals of the brain
for diagnosing the dyslexia. Firstly EOG signals of the person will be captured while
reading four different texts and after that the obtained EOG signals will be filtered and
segmented into frames. At last they are classified using 1D CNN machine learning
algorithm.
In 2020, Seshadri et al. [12]. They claimed that frontal regions of dyslexic patients
will have unusual patterns of delta and theta activity. They included both young-
sters with and without dyslexia and employed EEG signals in a scenario where
the eyes were closed. Using the relative wavelet energies, they have calculated the
lateralization score at each electrode position.
In 2018,Fridetal.[13] using an etymological Computer to predict the likelihood
of dyslexia in individuals’ corporations Designers can create augmented reality appli-
cations using Vuforia, a flexible-based enhanced reality programming development
unit. C++, Java, and .NET interfaces through a machine learning-based Unity plugin
game. For the purpose of predicting dyslexia, an SVM model with a Gaussian kernel
was created using LIBSVM.
154 S. K. Saida et al.
3 Methods and Materials
The system here uses the idea of machine learning, and the models are trained before
being tested. The final outcome will be predicted by the model with the highest
accuracy. We focus on the work flow of our suggested work in this area. The flow
chart that shows the various steps is shown below (Fig. 2).
Data Preparation
This is the first and most crucial part of the suggested work, where we used the dataset
from the online open source platform Kaggle. 500 rows and 7 columns make up the
collected dataset known as Predicting Risk of Dyslexia-PLOS ONE. This dataset
can be easily examined in excel sheets as numerical data with a.csv extension.
Fig. 2 Work ow diagra m
A Effective Method for Predicting the Dyslexia by Applying Ensemble 155
Data Preprocessing
Standard Scaler is used for the data preprocessing. This method will eliminate the
mean and scale every variable to unit variance. Each variable is used to guide the
autonomous execution of this process. Outliers may have an impact on the Standard
Scaler, which entails estimating the empirical mean and standard deviation of each
feature (if they are present in the dataset). Therefore, before including it into the
machine learning model, we must normalize the data (mean =0, standard deviation
=1) that is commonly used to address this possible problem.
Standardization
z=
xμ
σ
Mean
μ=
1
N
N
i=1
(xi)
Standard Deviation
σ=
1
N
N
i=1
(xiμ)2
Model Training
To train the machine learning algorithm, a dataset is used. The dataset in question
is the training model. Training data typically outweighs testing data in size. This
is because we want to provide the model as much data as we can so that it can
recognize and pick up on useful patterns. It is composed of the right sets of input
data and sample output data, both of which have an impact on the final visual product.
In order to compare the processed output to the sample output, the training model is
utilized to process the input data using the algorithm. The association’s conclusion
is used to modify the model.
Model Testing
Model testing is the procedure where a fully trained model’s performance is evaluated
on a testing set. Testing a model’s performance involves putting it to the test using
fresh datasets and test data as well as examining the results in terms of the model’s
outputs for things like review, precision, and other criteria that are not spelled out in
stone exactness with the model that has already been developed. It’s interesting to
note that neither the preparation dataset nor the unit included in the testing set are
used to generate perceptions. If the test set contains models from the preparation set,
156 S. K. Saida et al.
it is attempting to determine if the algorithmic framework has simply retained them
or has learned how to combine data from the preparation set.
4 Description of Proposed Ensemble Techniques
(A) Random Forest Classifier
For training a dataset, a large number of decision trees are constructed as part of the
ensembling learning approach known as Random Forest, which outputs the mode
of the classes of the individual trees. At each split, this method chooses a random
subset of characteristics and creates a separate tree [14]. Here’s how the Random
Forest Classifier operates:
Step 1: We must first select n random samples at random from the entire training
dataset; n must be smaller than the total number of observations.
Step 2: A decision tree should be constructed for each and every sample. A decision
tree’s nodes can be branched using the Gini index or entropy.
The Gini index calculation formula is stated as Eq. (1)
Gini =1
c
i=1
(pi)2(1)
where pi=relative frequency, c=number of classes
The Entropy calculation formula is stated as Eq. (2)
Entropy =
c
i=1
pilog2(pi)(2)
Step 3: Next, we must decide how many trees will be present in the forest; often,
we select a high quantity, such as 100 or 500.
Step 4: Each tree is used to make predictions for a fresh piece of data.
Step 5: By calculating the average of the predictions collected from all the trees in
the forest, the final predictions for the new data point are determined.
Step 6: In order to create many trees and a forest, we next repeat the technique
above for the full dataset.
(B) Support Vector Machine
Both classification and regression are performed using the Support Vector Machine
(SVM) supervised machine learning technique. The most correct term is classifi-
cation, even if we also discuss regression issues. The SVM method looks for an
N-dimensional space hyperplane that categorizes the data points with clarity [15].
A Effective Method for Predicting the Dyslexia by Applying Ensemble 157
Positive Hyperlane
Class 2
Negative Hyperlane
Hyperlane
Class 1
Fig. 3 Support vector machine
The number of features determines the hyperplane’s size. The hyperplane is essen-
tially a line if the input qualities are limited to only two. The hyperplane changes
into a 2-D plane as the number of input features approaches three. When there are
more than three factors involved, it is difficult to imagine (Fig. 3).
SVMs differ from other classification algorithms because they choose the decision
boundary that optimizes the distance between the nearest data points of all the classes.
The decision boundary created by SVMs is referred to as the maximum margin
classifier or maximum margin hyperplane.
(C) GridSearch CV
Another scikit-learn technique for conducting a thorough search on hyper-parameters
is GridSearch CV. It is a process for going through many combinations of hyper-
parameter values in a methodical way and training a machine learning model for
each combination to see which collection of hyper-parameters performs the best.
Using a preset ‘grid’ of hyper-parameters, the hyper-parameter search is carried out,
which means that all conceivable combinations of hyper-parameter values are tested
systematically. It is a helpful tool for determining the best set of hyper-parameters
for a machine learning model, but it can be computationally expensive, especially
when there are many hyper-parameters and a wide range of possible values.
5 Results and Analysis
In the proposed work we have collected the dataset from the Kaggle. The dataset
named Predicting Risk of Dyslexia-PLOS ONE is used to detect the risk of the
dyslexia in individual persons. It consists of data with size (500, 7) which means
500 samples and 7 features that include language vocabulary, memory, speed, visual
discrimination, audio discrimination.
There are no NaN values in the collected dataset and we have proceeded to the
further step without applying any techniques.
158 S. K. Saida et al.
(A) Standard Scaler Technique
After importing the dataset we have used the Standard Scaler technique to preprocess
the data in the dataset. Standardization changes the distribution to have a mean of
zero and a standard deviation of one. This is accomplished by scaling each input
variable individually by subtracting the mean (a process known as rounding) and
dividing by the standard deviation (Fig. 4).
(B) Scatter Plot of different algorithms
Data can be graphically represented using a scatter plot. The coordinate axes are
used in a straight forward scatter plot to plot the points according to their values. The
below figure is the scatter plot of the algorithms and ensemble techniques which we
have used in the proposed model. Here the X coordinate represents the algorithms
and Y coordinate represents the score (Fig. 5).
Fig. 4 Data preprocessing
Fig. 5 Scatter plot of algorithms
A Effective Method for Predicting the Dyslexia by Applying Ensemble 159
Fig. 6 Line Plot of performance metrics
(C) Performance Metrics
The proposed model is evaluated based on four measures which are:
Accuracy
It is a parameter for measuring how well models perform in classification
tasks, and it is so well-known that it is often used to calculate the total model
performance. It is the percentage of accurate classifications that a machine learning
model that has undergone training achieves, or the ratio of correct predictions to
all other predictions.
Precision
The accuracy of a model determines how many of the detected things are
genuinely significant. By taking the genuine positives out of the total positives, it
is calculated. We can assess the precision with which the machine learning model
classifies the model as positive.
Recall
The number of significant elements that were found is measured by recall.
It estimates the percentage of actual positive labels that the model accurately
identified. As a result, it divides the total number of essential elements by the true
positives.
F1-score
One of the most significant evaluation criteria in machine learning is the F1-
score. It succinctly distills a model’s predictive power by combining accuracy and
recall, two measurements that ordinarily compete with one another (Figs. 6and
7).
6 Conclusion and Future Scope
Our proposed methodology aims to provide an efficient classification for dyslexia.
Our entire model is classified into 3 stages. Firstly, for the data acquisition Predicting
Risk of Dyslexia-PLOS ONE dataset is collected from Kaggle on Internet. For the
160 S. K. Saida et al.
Fig. 7 Performance measures
data preprocessing we have used the Standard Scaler technique to eliminate the mean
and scale every variable to unit variance and this process is carried out independently
on each single variable. We have done the comparison of algorithms with each other
based on their performance metrics. The algorithms which we have included are
Decision Tree, SVM, Random Forest Classifier, and the ensemble techniques like
Random Forest with Grid Search CV, SVM with Grid Search CV. At last we have
got the best results using the ensemble technique of Random Forest Classifier and
Grid Search CV. The model uses the ensemble technique with those two algorithms
for classification and it has given the efficient values for the performance metrics like
accuracy, precision, recall, and F1-score. Our proposed model will be user friendly
and more effective to predict the dyslexia.
References
1. Protopapas A, Parrila R (5 April, 2018) Is dyslexia a brain disorder? Brain Sci 8(4):61. https://
doi.org/10.3390/brainsci8040061. PMID: 29621138; PMCID: PMC5924397
2. Snowling MJ, Hulme C, Nation K (13 Aug, 2020) Defining and understanding dyslexia: past,
present and future. Oxf Rev Educ. 46(4):501–513. https://doi.org/10.1080/03054985.2020.176
5756. PMID: 32939103; PMCID: PMC7455053
3. Raschle NM, Chang M, Gaab N (1 Aug 2011) Structural brain alterations associated with
dyslexia predate reading onset. Neuroimage 57(3):742–9. https://doi.org/10.1016/j.neuroi
mage.2010.09.055. Epub 2010 Sep 25. PMID: 20884362; PMCID: PMC3499031
4. Snowling MJ (1 Jan 2013) Early identification and interventions for dyslexia: a contemporary
view. J Res Spec Educ Needs 13(1):7–14. https://doi.org/10.1111/j.1471-3802.2012.01262.x.
PMID: 26290655; PMCID: PMC4538781
5. Werth R (2019) What causes dyslexia? Identifying the causes and effective compensatory
therapy. Restor Neurol Neurosci 37(6):591–608. https://doi.org/10.3233/RNN-190939. PMID:
31796709; PMCID: PMC6971836
6. Nerušil B, Polec J, Škunda J, Kaˇcur J (3 Aug 2021) Eye tracking based dyslexia detection
using a holistic approach. Sci Rep 11(1):15687. https://doi.org/10.1038/s41598-021-95275-1.
PMID: 34344972; PMCID: PMC8333039
7. Radford J, Richard G, Richard H, Serrurier M. Detecting dyslexia from audio records: an AI
approach. https://doi.org/10.5220/0010196000580066
A Effective Method for Predicting the Dyslexia by Applying Ensemble 161
8. Brunswick N, Bargary S (28 Aug 2022) Self-concept, creativity and developmental dyslexia
in university students: effects of age of assessment. Dyslexia 28(3):293–308. https://doi.org/
10.1002/dys.1722. Epub 2022 Jul 11. PMID: 35818173; PMCID: PMC9543102
9. Chakraborty V, Sundaram M, Machine learning algorithms for prediction of dyslexia using eye
movement. 06 Nov 2020 Bengaluru. https://doi.org/10.1088/1742-6596/1427/1/012012
10. Hassanain E. A multimedia big data retrieval framework to detect dyslexia among children.
2017 IEEE international conference on big data. 978-1-5386-2715-0/17
11. ˙
Ileri R, Latifo˘glu F, Demirci E (2020) New method to diagnosis of dyslexia using 1D-CNN,
2020 medical technologies congress (TIPTEKNO). Antalya, Turkey, pp 1–4. https://doi.org/
10.1109/TIPTEKNO50054.2020.9299241
12. Seshadri NPG, Singh BK (2020) Hemispheric lateralization analysis in dyslexic and normal
children using rest-EEG. 2020 IEEE recent advances in intelligent computational systems
(RAICS). Thiruvananthapuram, India, pp 37–41. https://doi.org/10.1109/RAICS51191.2020.
9332509
13. Frid A, Mane Vitz LM (2018) Features and machine learning for correlating and classifying
between brain areas and dyslexia. arXiv:1812.10622
14. Ali J, Khan R, Ahmad N, Maqsood I (2012) Random forests and decision trees. Int J Comput
Sci Issues (IJCSI) 9
15. Evgeniou T, Pontil M (2001) Support vector machines: theory and applications. 2049. 249–257.
https://doi.org/10.1007/3-540-44673-7_12
Identifying Suicidal Risk: A Text
Classification Study for Early Detection
Devineni Vijaya Sri, Anumolu Bindu Sai, Valluri Anand,
and Karanam Manjusha
Abstract Language usage is affected by suicidal intent that is conveyed on social
media. Many at-risk users rely on online forum websites to discuss their issues or
find out information about related duties. Our study’s main goal is to share ongoing
research on automatically identifying suicidal postings. We developed a method in
order to identify individuals who might be at suicide risk by analysing data from
social networking sites like Reddit. To achieve this, we plan to apply a variety
of classification techniques, including both deep learning and traditional machine
learning methods. To this purpose, we compare our results to those of other clas-
sification methods using a combined LSTM-CNN model. Our experiment reveals
that combining word embedding techniques with neural network architecture may
produce the best relevance classification results. Furthermore, our results show how
deep learning architectures may be used to build a viable model for a suicide risk
assessment by excelling at a variety of text classification tasks.
Keywords Suicidal ideation ·Neural network architecture ·Text classification ·
Classification algorithms ·LSTM-CNN model
1 Introduction
The mortality rate is anticipated to rise to one every 20s by 2020 [1]. Nearly 79% of
suicides take place in low- and middle-income countries, where there are frequently
insufficient and limited resources for detection and management. However, Pompili
et al. [2] show that “many characteristics thought to be risk factors for suicidal
conduct” might be fairly comparable in a suicide ideator and suicide attempter. In
D. Vijaya Sri (B)
Department of Information Technology, Lakireddy Bali Reddy College of Engineering,
Mylavaram, Andhra Pradesh, India
e-mail: devineni66@gmail.com
A. B. Sai ·V. A n a n d ·K. Manjusha
Lakireddy Bali Reddy College of Engineering, Mylavaram, Andhra Pradesh, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_13
163
164 D. Vijaya Sri et al.
order to reduce suicide rates by 10% by 2020 [3], early detection of suicidal ideation
has been developed and put into practice as a part of national harm reduction plans
in WHO member countries.
It offers a useful research environment for the creation of cutting-edge technical
innovations that might revolutionise suicide risk reduction and suicide detection
[4]. That might serve as a good preliminary step for intervention. Kumar et al. [5]
conducted a study on the posting habits of Reddit Suicide Watch users who keep
up with news regarding celebrity suicides [6]. He presented a strategy that would
effectively stop suicides involving prominent figures. To determine the distinctive
signs of this transition, he created a methodology based on propensity weighting.
AvgDiffLDP is an innovative optimisation approach that Ji et al. [7] recently devised
for the early identification of suicidal thoughts. Our study’s main goal is to use
powerful deep learning architectures for data analysis to disseminate knowledge
about suicide thoughts in Reddit social media communities [8]. We attempt to deter-
mine if combining CNN and LSTM classifiers into a single model may enhance the
performance of language modelling and text categorisation [9].
On the basis of the baseline and our suggested model, we assess the experimental
strategy. We leverage data gathered from Reddit social media, a platform that allows
users to write lengthier messages, for our data set [10].
In order to do our experiment, we first choose a data source and assess the salient
features of our suggested model. Our next step involves analysing the frequency of
n-grams (consisting of both individual words and pairs of words) in the dataset, with
the aim of identifying any signals of suicidal intent [11]. This approach is designed to
help us uncover potential patterns and trends that may shed light on the presence of
suicidal ideation. This enables us to spot potential trends and telltale signs that point to
the presence of suicidal ideas [12]. Using both the baseline and our suggested model,
we assess the experimental strategy. Lastly, we use tenfold cross-validation to train
our LSTM-CNN model and identify the most effective hyper-parameter for spotting
suicidal thoughts [13]. We leverage data from Reddit social media, a platform that
permits users to write longer messages, for our dataset.
Our study makes the following three distinct contributions:
N-gram analysis: We examined n-gram data from suicide-related forums in our
study to demonstrate that decreased social engagement and suicide thoughts are
regularly mentioned topics [14]. Our results show that a change to social ideation
is linked to a range of psychological conditions, such as an increase in self-focus,
despair, discontent, anxiety, or loneliness.
Traditional feature analysis: We used traditional feature analysis to compare
different approaches to detecting suicidal thoughts [15]. We used CNN, LSTM, and
a combined LSTM-CNN model to compare performance of statistical characteristics
with word2vec, bag of words, and TF-IDF.
Comparative analysis: We evaluated the accuracy of the deep model of neural
networks we propose, the LSTM-CNN integrated class, for detecting suicidal
thoughts [16]. To establish a state-of-the-art approach, we compared its performance
and potential against other deep learning techniques, including CNN and LSTM, as
Identifying Suicidal Risk: A Text Classification Study for Early Detection 165
well as four conventional machine learning classifiers, namely SVM, NB, RF, and
XGBoost. The evaluation was conducted using a real-world dataset.
2 Literature Review
Reddit users’ suicidal inclinations were explored by Kumar et al. [17] in relation to
the Werther or copycat effect [18]. His research shows that after news of celebrity
suicides, individuals’ posting frequency significantly increased and their language
behaviour changed. This change was seen as moving in the direction of postings that
were less socially integrated and were more negative and self-focused. Similar to
this, Ueda et al. [19] carried out in-depth research on a million Twitter tweets after
26 well-known Japanese celebrities committed suicide between 2010 and 2014.
Suicidal inclinations are more effectively recognised when regular linguistic
patterns in social media material are identified. Applying various machine learning
techniques to various NLP techniques frequently supports it. A suicide note anal-
ysis technique was developed by Desmet et al. [20] utilising binary support vector
machine (SVM) classifiers to identify suicidal thoughts. Machine learning algo-
rithms have been shown to be effective in separating people into those who are and
are not at suicide risk by Braithwaite et al. [21] 0.125 Twitter users were discov-
ered by Wood et al. [22] who then monitored their tweets up until the point of their
attempted suicide. Okhapkina et al. [23] research looked at modifying information
retrieval techniques to spot a harmful informational effect on social networks [24].
He created a lexicon of phrases with a suicidal undertone. He developed singular
vector decompositions for TF-IDF matrices.
Significant modifications have been made as a result of recent developments in
neural network models for natural language processing. Recurrent neural networks
(RNN) have distinguished themselves as a particularly potent method for sequence
modelling among these [25].
Recent research has demonstrated that convolutional, nonlinear, and pooling
layers in CNN neural networks perform better than conventional NLP techniques
for a variety of NLP tasks [26]. However, it fails to capture distant interactions and
instead highlights local n-gram properties. The power of CNN on n-gram character-
istics from different sentence places was supported by Kalchbrenner et al. [27]Yin
and Schutze devised a strategy that utilises unsupervised pre-training and multiple
channel word embedding to enhance accuracy of classification.
Using n-gram features with CTAKES and LR approaches, Gehrmann et al.
compared the CNN model to more conventional, rule-based entity extraction
methods. He found in his investigation [28] that CNN performs better than previous
phenol-typing algorithms in predicting ten phenotypes. Morales et al.’s presenta-
tion of the findings for a novelly evaluated personality and tone traits demonstrated
the efficacy CNN and LSTM models were used to assess the risk of suicide. In
comparison with other methods, CNN performed better in detecting the presence
of suicidal inclinations in teenagers, according to Bhat et al., and deep learning
166 D. Vijaya Sri et al.
techniques were used by Du et al. to identify mental stresses in social media for
suicide identification. He created a binary classifier using CNN networks to distin-
guish between suicidal and non-suicidal tweets. According to other recent studies,
the Suicide Watch forum, which is used as a data set in our research article, benefited
from CNN implementations.
Fundamentally, a single recurrent or convolutional neural network used as a vector
to encode an entire sequence usually is not enough to capture the entirety of the
significant information. A hybrid framework that combines the advantages of RNNs
and CNNs has been worked on. This strategy seeks to improve results by utilising the
distinctive qualities of each model. The measurement problem of semantic textual
similarity has received significant attention. Using both CNNs and RNNs inside the
hybrid framework, several methods have been researched and developed to increase
the precision and dependability of these metrics. In reference, in order to overcome
the difficulty of determining semantic textual similarity, He et al. developed a new
neural network model that combines ConvNet and Bi-LSTMs. In order to get better
outcomes, Matsumoto et al. suggested a hybrid framework that employs a quick
method of deep learning in close cooperation with an initial information retrieval
model.
3 Methodology
Through the comment threads that are connected to each post, they converse [29].
The Ji et al. [30] data set, which includes a list of postings that are both suicide
indicative and are not, was employed in our investigation. Users’ private information
is replaced with a special ID to protect their privacy. Due to the users’ propensity
to participate in numerous sub-Reddits, each group is composed of an equivalently
random amount of messages originating from diverse themes. Our data set is made up
of 3652 non-suicidal posts and 3549 posts with suicidal indications from reasonably
big sub-Reddits supporting those who may be at danger. Posts that are not suicidal
come from sub-Reddits with a focus on friends and family.
4 Existing Schemes
A comprehensive overview of our suggested framework is shown in Fig. 1.Thetwo
frameworks for text data mining are different. Natural language processing (NLP)
methods are used in the first framework to pre-process data and extract features. Prior
to being analysed by standard machine learning systems as baseline approaches, the
words are first encoded using techniques like TF-IDF, BOW, and statistical features.
The second framework, on the other hand, uses deep learning classifiers after pre-
processing the data and extracting features using word embedding methods. Also,
Identifying Suicidal Risk: A Text Classification Study for Early Detection 167
Fig. 1 Framework for suicide ideation detection
this framework provides two different kinds of classifiers: one for the conventional
approach and one for the proposed model.
Model Architecture and Its Parameters
The parameter configuration for the proposed model (LSTM +CNN) is given in
Table 1. The following parameters are used in the experiment: the number of filters,
the kernel size, the padding, the pooling size, the optimiser, the batch size, the epochs,
and the units. The NLTK natural language toolkit is used with Python. The models
are created using the TensorFlow deep learning framework.
Number of true positive predictions (TP), true negative predictions (TN), false
positive predictions (FP), and false negative predictions (FN) are included in the
assessment metrics [80]. An accuracy defined as follows is the score for classifying
Table 1 Parameter configuration for the proposed LSTM-CNN model
LSTM-CNN model layers Parameters Val u e s
Convolutional layer Number of filters 2,4,6,8
Kernel size 2,3,4
Padding ‘Same’
Activation function ‘ReLU’
Pooling layer
LSTM layer and other
Pooling size Max-pooling
Units 100
Embedding dimension 300
Batch size 8
Number of epochs 10
Dropout 05
Fully connected layer Softmax
168 D. Vijaya Sri et al.
assessment that is the most simple:
Accuracy =TP +TN
TP +TN +FP +FN
Precision =TP
TP +FP
Recall =TP
TP +FN
F1=2.
Precision.recall
Precision +recall
.
5 Results
Our strategy is split into two phases. The first stage entails scrutinising the labelled
Reddit posts corpus and comparing the most common n-grams in posts with clinical
depression to those without. This aids in identifying any patterns or indicators of
suicidal intent. The next step is to compare the effectiveness of our proposed deep
learning prediction model, which sets classifiers’ baselines using a predetermined
collection of characteristics. This enables us to accurately determine the classifier’s
ability to detect suicidal ideation and evaluate its performance using appropriate
analytical metrics (Figs. 2,3,4,5,6,7and 8).
6 Conclusion
Deep learning techniques are being used into suicide care, opening up new avenues
for better ideation detection and the potential for early suicide prevention. Our study
contributes to the effort to advance convolutional linguistics technologically so that
it may be successfully used in the field of mental health treatment and disseminated
among researchers.
Our method was developed for this aim using a sub-Reddit data corpus made up
of posts that were both suicide indicative and were not. To transform the text of the
postings into a format that our system could understand, we employed several data
representation approaches. By using several NLP and text classification algorithms,
we were able to identify a tighter link between language usage and suicidal ideation.
Identifying Suicidal Risk: A Text Classification Study for Early Detection 169
Fig. 2 Training accuracy
Fig. 3 Loss curve
170 D. Vijaya Sri et al.
Fig. 4 Suicide prediction
Fig. 5 Accuracy curve
Identifying Suicidal Risk: A Text Classification Study for Early Detection 171
Fig. 6 Data set collection
Fig. 7 Testing accuracy
172 D. Vijaya Sri et al.
Fig. 8 Suicide percentage
We discussed the LSTM-CNN experiment and saw CNN’s potential in several texts
categorisation tasks. These networks were created on top of word2vec features.
Our goal was not to investigate in-depth how sensitive CNN hyper-parameters
are to planned decisions. Instead, we focused on enhancing CNN’s ability to classify
activities involving suicidal thoughts. We found the factors associated with portrayals
of suicidal inclinations throughout our data analysis. We saw a major change in the
way those at risk used language. Users’ self-centeredness was notably found to be
accompanied by indicators of irritation, pessimism, negativity, or loneliness.
References
1. World Health Organization (2018) National suicide prevention strategies: progress, examples
and indicators; World Health Organization: Geneva, Switzerland
2. Beck AT, Kovacs M, Weissman A (1975) Hopelessness and suicidal behavior: an overview.
JAMA 234:1146–1149
Identifying Suicidal Risk: A Text Classification Study for Early Detection 173
3. Silver MA, Bohnert M, Beck AT, Marcus D (1971) Relation of depression of attempted suicide
and seriousness of intent. Arch Gen Psychiatry 25:573–576
4. Klonsky ED, May AM (2014) Differentiating suicide attempters from suicide ideators: a critical
frontier for suicidology research. Suicide Life-Threat Behav 44:1–5
5. Pompili M, Innamorati M, Di Vittorio C, Sher L, Girardi P, Amore M (2014) Sociodemographic
and clinical differences between suicide ideators and attempters: a study of mood disordered
patients 50 years and older. Suicide Life-Threat. Behav. 44:34–45
6. DeJong TM, Overholser JC, Stockmeier CA (2010) Apples to oranges?: a direct comparison
between suicide attempters and suicide completers. J Affect Disord 124:90–97
7. De Choudhury M, Kiciman E, Dredze M, Coppersmith G, Kumar M (2016) Discovering shifts
to suicidal ideation from mental health content in social media. In: Proceedings of the 2016
CHI conference on human factors in computing systems, San José, CA, USA, 9–12 December
2016; ACM: New York, NY, USA, pp 2098–2110
8. Marks M (2019) Artificial intelligence based suicide prediction. Yale J Health Policy Law
Ethics. Forthcoming
9. Kumar M, Dredze M, Coppersmith G, De Choudhury M (2015) Detecting changes in suicide
content manifested in social media following celebrity suicides. In: Proceedings of the 26th
ACM conference on hypertext & social media, Prague, Czech Republic, 4–7 July 2015; ACM:
New York, NY, USA, pp 85–94
10. Ji S, Long G, Pan S, Zhu T, Jiang J, Wang S (2019) Detecting suicidal ideation with data
protection in online communities. In: Proceedings of the international conference on database
systems for advanced applications, Chiang Mai, Thailand, 22–25 April 2019. Springer, Berlin,
Germany, pp 225–229
11. Yang Y, Zheng L, Zhang J, Cui Q, Li Z, Yu PS (2018) TI-CNN: convolutional neural networks
for fake news detection. arXiv arXiv:1806.00749
12. Mikolov T, Karafiát M, Burget L, ˇ
Cernock`y J, Khudanpur S. Recurrent neural network based
language model. In: Proceedings of the eleventh annual conference of the international speech
communication association, Makuhari, Chiba, Japan, 26–30 September 2010
13. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words
and phrases and their compositionality. In: Proceedings of the advances in neural information
processing systems, Lake Tahoe, CA, USA, 5–10 December 2013; pp 3111–3119
14. Coppersmith G, Ngo K, Leary R, Wood A. Exploratory analysis of social media prior to
a suicide attempt. In: Proceedings of the third workshop on computational Linguistics and
clinical psychology, San Diego, CA, USA, 16 June 2016, pp 106–117
15. Hsiung RC (2007) A suicide in an online mental health support group: reactions of the group
members, administrative responses, and recommendations. CyberPsychol Behav 10:495–500
16. Jashinsky J, Burton SH, Hanson CL, West J, Giraud-Carrier C, Barnes MD, Argyle T (2014)
Tracking suicide risk factors through Twitter in the US. Crisis 35:51–59
17. Colombo GB, Burnap P, Hodorog A, Scourfield J (2016) Analysing the connectivity and
communication of suicidal users on twitter. Comput Commun 73:291–300
18. Niederkrotenthaler T, Till B, Kapusta ND, Voracek M, Dervic K, Sonneck G (2009) Copycat
effects after media reports on suicide: a population-based ecologic study. Soc Sci Med 69:1085–
1090
19. Ueda M, Mori K, Matsubayashi T, Sawada Y (2017) Tweeting celebrity suicides: users reaction
to prominent suicide deaths on Twitter and subsequent increases in actual suicides. Soc Sci
Med 189:158–166
20. Desmet B, Hoste V (2013) Emotion detection in suicide notes. Expert Syst Appl 40:6351–6358
21. Huang X, Zhang L, Chiu D, Liu T, Li X, Zhu T. Detecting suicidal ideation in Chinese
microblogs with psychological lexicons. In: Proceedings of the 2014 IEEE 11th international
conference on ubiquitous intelligence and computing and 2014 IEEE 11th international confer-
ence on autonomic and trusted computing and 2014 IEEE 14th international conference on
scalable computing and communications and its associated workshops, Bali, Indonesia, 9–12
December 2014; pp 844–849
174 D. Vijaya Sri et al.
22. Braithwaite SR, Giraud-Carrier C, West J, Barnes MD, Hanson CL (2016) Validating machine
learning algorithms for Twitter data against established measures of suicidality. JMIR Ment
Health 3:e21
23. Sueki H (2015) The association of suicide-related Twitter use with suicidal behaviour: a cross-
sectional study of young internet users in Japan. J Affect Disord 170:155–160
24. O’Dea B, Wan S, Batterham PJ, Calear AL, Paris C, Christensen H (2015) Detecting suicidality
on Twitter. Internet Interv 2:183–188
25. Wood A, Shiffman J, Leary R, Coppersmith G. Language signals preceding suicide attempts.
In: Proceedings of the CHI 2016 computing and mental health workshop, San Jose, CA, USA,
7–12 May 2016
26. Okhapkina E, Okhapkin V, Kazarin O. Adaptation of information retrieval methods for iden-
tifying of destructive informational influence in social networks. In: Proceedings of the 2017
IEEE 31st international conference on advanced information networking and applications
workshops (WAINA), Taipei, Taiwan, 27–29 March 2017; pp 87–92
27. Sawhney R, Manchanda P, Singh R, Aggarwal S. A computational approach to feature extrac-
tion for identification of suicidal ideation in tweets. In: Proceedings of the ACL 2018, student
research workshop, Melbourne, Australia, 15–20 July 2018; pp 91–98
28. Alada˘g AE, Muderrisoglu S, Akbas NB, Zahmacioglu O, Bingol HO (2018) Detecting suicidal
ideation on forums: proof-of-concept study. J Med Internet Res 20:e215
Citrus Plant Leaves Disease Detection
Using CNN and LVQ Algorithm
Roop Singh Meena and Shano Solanki
Abstract This study introduces a unique method for disease identification in citrus
plants by combining convolutional neural network (CNN) and learning vector quanti-
zation (LVQ) techniques. The suggested technology is meant to aid in the early iden-
tification and diagnosis of citrus plant diseases, which is important for preserving
crop yields and avoiding crop loss. Features are extracted from pictures of citrus
plant leaves using a convolutional neural network. Furthermore, the LVQ algorithm
is used to identify the retrieved features as either healthy or unhealthy. When tested
on a dataset consisting of photographs of citrus plant leaves, the suggested system
achieved a high accuracy of 96.33% in disease classification. A total of 3570 pictures
were used in this analysis, including both healthy and diseased citrus plants, repre-
senting different pathogens (citrus canker, citrus scab, citrus rust, other diseases, and
healthy images) classes. There are 500 test photographs from each class and 1070
full-size images throughout test categories. An F1-score of 96.54%, a recall score
of 96.54%, and a precision score of 96.69% were all obtained using the proposed
strategy. Based on the obtained data, it appears that the proposed method achieves
superior accuracy in disease identification compared to the state-of-the-art methods.
The citrus industry stands to benefit greatly from this strategy, as it may be used for
early disease identification and prevention in citrus plants.
Keywords Citrus plant diseases ·CNN ·LVQ ·Image preprocessing ·
Segmentation
R. S. Meena (B)·S. Solanki
Computer Science and Engineering Department, NITTTR, Chandigarh, India
e-mail: Roopsingh1988@gmail.com
S. Solanki
e-mail: shano@nitttrchd.ac.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_14
175
176 R. S. Meena and S. Solanki
1 Introduction
Plant disease detection is an important task in agriculture as it helps to identify and
control the spread of diseases that can damage crops and reduce yields. In recent years,
the use of convolutional neural networks (CNNs) and other deep learning techniques
has been incorporated to automate the process of plant disease detection. Machine
learning has replaced pattern recognition and computer image processing for citrus
plant disease. Automated fruit categorization using machine vision can improve clas-
sification accuracy and address problems with manual filtration, including low output
and variable separation levels [1]. Plant disease detection is one of the agriculture’s
greatest challenges. These services prevent the disease from infecting other crops
and causing significant monetary losses. Plant diseases can have a range of effects
on the agricultural economy, from mild symptoms to crop loss. CNN technology is
believed to be the most efficient approach to deep learning.
This industry is crucial to the economy of India because 60% of the population
relies on agriculture. As a result, it is an important area of study. Agriculture sector
photos are taken using IoT sensors, cameras, and drones [2].
2 Taxonomy of Citrus Diseases
Citrus plant diseases contribute significantly to the decline in agricultural output,
which is detrimental to the economy of any country. Citrus fruit contains vitamin
C as well as good amounts of other vitamins and minerals, including B vitamins,
potassium, phosphorous, magnesium, and copper, which are found primarily in citrus.
It is a challenge to accurately identify several citrus diseases using deep learning-
based methods (Fig. 1).
2.1 Citrus Canker
Lesions on the leaves of citrus plants are the most dangerous symptom of citrus canker
disease, which is a cancer of the citrus tree. When citrus trees are infected with the
infectious disease citrus canker, their leaves and fruit begin to develop prematurely.
The white, spongy patches on the injured leaves may change to a darker hue, like
brown or gray. Sores resembling rings with oily edges can be seen on either side
of the boot. Signs of citrus canker disease include the appearance of raised, scabby
lesions on the leaves, stems, and fruit, as well as the premature dropping of the fruit
and the plant’s defoliation. Leaves and fruits might become misshapen when the
lesions, which are often surrounded by an oily, water-soaked edge, eventually join
together. This citrus blight can be recognized by its characteristic lesions [3].
Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 177
Scab Rust
Canker
Healthy Greening Black Spot
Fig. 1 Citrus plant leaves disease
2.2 Citrus Scab
Citrus scab disease symptoms include the formation of raised, corky scab-like lesions
on fruit, leaves, and twigs. These lesions can vary in size and color from light yellow
to dark brown and can lead to fruit cracking and distortion. Additionally, infected
leaves may become distorted or drop prematurely [4].
2.3 Citrus Rust
Citrus rust disease symptoms include the formation of yellow-orange pustules on
leaves, stems, and fruit. These pustules may later turn brown or black as they dry out,
which can cause defoliation and premature fruit drop. Severe infections can lead to
stunted growth and reduced fruit quality [5].
2.4 Citrus Greening
Citrus greening disease, also known as Huanglongbing (HLB), has symptoms
including asymmetrical yellowing of leaves, mottled leaves, yellow shoots, and
178 R. S. Meena and S. Solanki
stunted growth. Infected trees may produce small, lopsided, and bitter fruit that
does not ripen properly. HLB is a serious and incurable disease that can ultimately
kill the tree [6].
2.5 Citrus Anthracnose
Citrus anthracnose disease symptoms include the formation of small, circular, or
irregular-shaped sunken lesions on leaves, twigs, and fruit. These lesions may be
brown or black and have a water-soaked appearance. Infected fruit may drop prema-
turely, become deformed, and rot. Severe infections can cause defoliation and dieback
of twigs and branches.
These features can provide valuable information for various image analysis tasks
in fields such as computer vision, remote sensing, and medical imaging [7].
3 Convolutional Neural Network
The primary application of the deep learning model known as a convolutional neural
network (CNN) is in the domain of image and video recognition [8].
3.1 Convolutional Layer
The term “CNN” was initially applied to the convolution layer. That layer used
different scientific methods to obtain the extracted features from the input picture
[9]. The filter is used to make the input image smaller. The filter is progressively
decreased throughout the picture, starting in the upper-left corner. Divide the picture
values by the filter values to get the total for each category [10]. A new, smaller matrix
is created using the provided image. The process of convolution in the convolution
layer is depicted in Fig. 2below using a 5 ×5 input image and a 3 ×3 filter. A
general CNN feature mapping is shown in Fig. 2[9].
3.2 Pooling Layer
The pooling layer is usually applied immediately, just after the convolution layer. The
size of the convolution layer’s output matrix can vary depending on the prevalence
and incidence of a condition. The filter size used by the pooling layer can vary, but
typically it is 2 ×2. Several pooling operations, including max pooling, average
pooling, and L2-norm pooling, are compatible with this layer. The A-Max pooling
Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 179
Fig. 2 Convolutional layer
Fig. 3 Pooling layer
filter with a stride of 2 was used in this investigation. To perform max pooling, this
filter takes the maximum value from each sub-window and moves it to a new matrix.
Max pooling layer working is presented in Fig. 3.
3.3 Activation Layer
An activation layer in a convolutional neural network (CNN) is a nonlinear transfor-
mation applied to the output of a convolutional layer. It introduces nonlinearity to
the network and allows the network to learn complex representations. Examples of
activation functions used in CNNs include ReLU, sigmoid, and tanh. In ReLU, less
than 0 values are set to zero, while positive ones are left unchanged [1].
A(x)=0,if x <0
A(x)=1,otherwise.
180 R. S. Meena and S. Solanki
3.4 Fully Connected Layer
A fully connected layer in a convolutional neural network (CNN) is a layer in which
each neuron is connected to all neurons in the previous layer, similar to a traditional
neural network. This layer typically comes after the convolutional and pooling layers
in a CNN and is used to transform the output from the convolutional layers into a
format that can be used for classification or regression. This layer is responsible for
recognition and categorization.
4 Literature Survey
After the study of various research papers on citrus plant disease, we concluded that
the classification of citrus plant disease is a very complex task. In the literature,
various authors and researchers discussed citrus plant leaf disease detection using
different types of techniques for image processing operations.
Singh et al. [11] proposed an algorithm for automatically detecting and classi-
fying plant leaf diseases using an image segmentation technique. The paper included
a review of various disease classification techniques that could be used for plant leaf
disease detection. The authors applied a genetic algorithm to perform image segmen-
tation on 106 images for both training and testing. The reported accuracy of disease
detection was 86.54% for the K-means technique with the proposed algorithm and
93.63% for support vector machines with the suggested algorithm. They suggested
that the Bayes classifier, artificial neural networks (ANN), and hybrid algorithms
could be used to further enhance the classification recognition rate.
Shaikh and colleagues [12] presented a study titled “Citrus Leaf Unhealthy Region
Detection by Using Image Processing Techniques.” In their research, the authors
utilized various image processing techniques, such as image normalization, contrast
enhancement, and initial processing. They extracted features using the gray level co-
occurrence matrix (GLCM) method and applied bi-level thresholding for segmenta-
tion. To categorize the unhealthy regions, the authors used a hidden Markov model
and achieved an accuracy rate of 84.21% for anthracnose, 85.71% for canker, 78%
for citrus greening, and 82.50% for overwatering in the classification of citrus trees.
Sardogan et al. [13] presented a paper on “Plant Leaf Disease Detection and Clas-
sification based on CNN and the LVQ Algorithm,” and they took 500 images of
tomato crops from the plant village dataset of size 512 * 512. In that research, they
applied the convolutional neural network method of deep learning. To improve clas-
sification accuracy, they used the linear vector quantization method. In this research,
their accuracy in classifying diseases on tomato leaves was 86%. These studies clas-
sified five distinct types of diseases that can affect tomato crops: healthy, late blight,
bacterial spot, septoria spot, and yellow curve.
In their paper titled “GANs-Based Data Augmentation for Citrus Disease Severity
Detection using Deep Learning,” Zeng and colleagues [14] focused on the detection
Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 181
of Huanglongbing (HLB) infection using a citrus plant dataset from plant village
and crowd AI. They found that the Inception V3 model performed better than other
models in terms of severity detection accuracy, achieving an accuracy of 74.38%
due to its high computational efficiency and the smaller number of parameters.
The authors also proposed that their algorithm can further improve results by up
to 92.60%.
Kukreja and Dhiman [15] presented a paper on A Deep Neural Network-based
Disease Detection Using Data Augmentation Techniques.” For their research work,
they took 120 images of citrus plant leaves, which they augmented to 1200 images of
size 256 * 256 images. They included the CNN model and used various techniques in
the preprocessing, segmentation, and augmentation stages of the model. Stochastic
gradient descent (SGD) optimization was employed to train the neural networks.
They reported 89.10% accuracy in disease detection.
Sharath and colleagues [16] published a research paper titled “Disease Detec-
tion in Plants Using Convolutional Neural Networks.” In their study, they used a
dataset comprising 12,891 plant images of various fruits such as oranges, grapes,
pomegranates, papayas, and citrus. The authors employed the grab cut method during
the segmentation stage of their convolutional neural network (CNN) model to identify
diseases. They reported that their CNN model achieved a plant disease detection effi-
ciency of 91%. The authors also suggested that the accuracy of their approach could
be further improved by utilizing appropriate image enhancement and classification
techniques.
Kaur et al. [17] proposed research on A Genetic Algorithm-based Feature Opti-
mization Method for Citrus HLB Disease Detection using Machine Learning.” In
this work, an improved feature selection stage of HLB/citrus greening disease was
proposed. Machine learning is trained for both healthy and harmful diseases. A 60-
image dataset, of which 30 are healthy and 30 are HLB infected, was employed for
the study. Images are cropped and resized during the preprocessing stage, and the
K-mean clustering algorithm is employed during the segmentation stage. The GLCM
approach is used for feature extraction. They reported SVM classifier efficiencies of
up to 90.40%.
Khattak et al. [8] conducted a study on the Automatic Detection of Citrus Fruit
and Leaf Disease using a Deep Neural Network Model.” They used the “plant village”
dataset, which contains 213 images of citrus fruits. The proposed CNN model showed
a test accuracy of 94.55% for detecting black spots, cankers, scabs, and greening
disease. It indicates its usefulness as a decision-support tool for farmers in classifying
citrus fruit and leaf diseases.
Sujatha et al. [18] published a paper comparing the performance of machine
learning and deep learning techniques for plant leaf disease detection. In their
study, they discussed several machine learning techniques, including support vector
machines, random forests, and stochastic gradient descent. Additionally, they eval-
uated the performance of deep learning methods including Inception V3, VGG-16,
and VGG-19.
In their 2022 paper titled “Classification of Citrus Disease Using Optimiza-
tion Deep Learning Approach,” Elaraby et al. [19] explored the classification of
182 R. S. Meena and S. Solanki
various citrus diseases, including black spot, canker, scab, greening, anthracnose,
and melanose. To achieve this, the authors used a combination of the plant village
and a self-collected dataset. They employed two convolutional neural networks,
namely AlexNet and VGG-19, to develop and evaluate their proposed method. The
dataset comprised 759 augmented images, each measuring 256 pixels on the longest
dimension. The authors reported an impressive performance of their model, with an
accuracy of 94.3%, a precision of 94.1%, a specificity of 93.9%, and an F-score of
94.3%.
5 Proposed Methodology
Deep learning has been increasingly popular in recent models for citrus plant diseases.
In the respective research work, we provide a brief overview of the proposed CNN
model for recognizing and categorizing citrus plant diseases using image processing
techniques. Using the suggested models for in-depth research, it is possible to iden-
tify and expose the infected citrus leaves and apply the preventive treatment. The
proposed work uses a deep learning convolutional neural network model and a linear
vector quantization algorithm (LVQ) for quantizing fully connected layer output and
providing a more accurate result. The intended models include dataset collection,
segmentation, feature extraction, and classification stages.
Initially, a dataset was created using publicly available data, and preprocessing
techniques were applied to the images. Augmentation operations were also applied
to some disease image categories to ensure equal representation of each category.
The dataset was split into 70% for training and 30% for testing.
Next, a convolutional neural network (CNN) was created using the Sequential()
function, which consisted of several convolutional and max pooling layers. The CNN
started with four convolutional layers with 32, 64, 128, and 512 filters, respectively,
followed by max pooling layers with a 2 ×2 window size. Each convolutional layer
utilized a 3 ×3 filter, and the ReLU activation function was applied. After the final
max pooling layer, the Flatten() layer was added to flatten the output of the earlier
layers, and a Dense() layer with the SoftMax activation function was incorporated to
generate probabilities for each of the five possible classes. The probabilities obtained
from the CNN were then fed to the LVQ algorithm, which uses a winner-take-all
approach. Linear vector quantization (LVQ) is a technique used in machine learning
and signal processing to classify input data into one of the several predetermined
classes. LVQ is a type of unsupervised learning algorithm that uses a set of training
examples to learn a mapping between input vectors and output classes. In LVQ, each
input vector is represented as a point in a high-dimensional space, and the goal is to
find a set of representative vectors (called prototypes) that can effectively partition the
input space into predefined classes. The prototypes are typically initialized randomly
and are then iteratively adjusted based on the input data. During the training phase, the
input vectors are presented to the LVQ algorithm, and the distance between each input
vector and the prototypes is calculated. The input vector is then assigned to the closest
Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 183
Fig. 4 Flowchart for proposed methodology
prototype, which is known as the winning prototype. The winning prototype is then
moved closer to the input vector, and the other prototypes in the same class are moved
further away. Once the training phase is complete, the prototypes are fixed, and the
LVQ algorithm can be used to classify new input vectors. The classification process
involves measuring the distance between the input vector and each prototype and
assigning the input vector to the class associated with the closest prototype (Fig. 4).
After training all CNN values, these values can be used for the detection of plant
disease and all other performance parameters are calculated (Fig. 5).
5.1 Experiment and Results
The coding and implementation for this study were done in the Jupyter notebook of
the anaconda framework, using Python 3.10. We use 2500 training images to teach
our convolutional neural network model, and then we use 1070 test images to see how
well it can classify new images. Both the training and validation accuracies peaked
in the 29th epoch when using the designed approach on 30 iterations. Following
the transfer of CNN probabilities to the LVQ method, which results in enhanced
184 R. S. Meena and S. Solanki
Fig. 5 Proposed methodology
classification accuracy, we obtain the performance metrics as accuracy, F1-score,
recall, and precision score. The confusion matrix for the classification of citrus disease
categories is shown in Fig. 6.
In Fig. 6, (label 0 =healthy, label 1 =multiple disease, label 2 =rust, label 3 =
scab, label 4 =canker) are shown.
Various performances of the purposed method are shown in tabular form in Table 1.
Fig. 6 Confusion metrics
Table 1 Performance of
proposed methodologies Performance metrics Result
Accuracy score 0.9654
F1-score 0.9655
Recall score 0.9654
Precision score 0.9669
Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 185
Fig. 7 Training and validation loss
The accuracy and validation losses during the training model are shown in Fig. 7.
And the accuracy and validation accuracy during the training of the model are shown
in Fig. 8.
Several studies have utilized convolutional neural networks (CNNs) for disease
detection and classification in citrus plants. These include “Plant Leaf Disease Detec-
tion and Classification based on CNN and LVQ Algorithm” (2018), “Performance of
Deep Learning vs. Machine Learning in Plant Leaf Disease Detection” (2021), “Deep
Metric Learning-Based Citrus Disease Classification With Sparse Data” (2020),
“GANs-Based Data Augmentation for Citrus Disease Severity detection using Deep
Learning” (2020), Automatic Detection of Citrus Fruit and Leaves Diseases Using
Deep Neural Network Model” (2021), and “Classification of Citrus Diseases Using
Optimization Deep Learning Approach” (2022). These studies reported accuracy
rates of 86%, 89.50%, 90.28%, 92.60%, 94.55%, and 94.30%, respectively. In
comparison with these previous works, our system achieved an accuracy increases
up to 96.50% (Fig. 9).
186 R. S. Meena and S. Solanki
Fig. 8 Training and validation accuracy
6 Conclusion, Limitation, and Future Scope
It is concluded that the purposed model effectively recognizes the citrus canker, rust,
scab, healthy, and other disease categories. To improve classification accuracy, the
model is trained for an equal amount of each category of disease in the dataset. The
convolutions, max pooling, and ReLU layers were also added for better classification
and accuracy. This research can identify diseases accurately, but some other diseases
like greening and anthracnose can be added for the detection of all diseases. The
work can be extended using Internet of things (IoT) devices, where testing images
can be captured using drones and sensors. The layers can be added to increase
model performance. A web server-based mobile application can be developed so
that farmers can use it effectively on their end.
Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 187
Fig. 9 Comparison of proposed work
References
1. Elangovan K, Nalini S (2017) Plant disease classification using image segmentation and SVM
techniques. Int J Comput Intell Res 13(7)
2. Latha M, Poojith A, Reddy A, Vittal Kumar G (2014) Image processing in agriculture. Int J
Innovative Res Electr 2:2321–5526, [Online]. Available: www.ijireeice.com
3. Sunny S, Gandhi MPI (2021) Canker detection in citrus plants with an efficient finite dissimilar
compatible histogram leveling based image pre-processing and SVM classifier. Turk J Comput
Math Educ (TURCOMAT) 12(10):2585–2592, [Online]. Available: https://www.turcomat.org/
index.php/turkbilmat/article/view/4871
188 R. S. Meena and S. Solanki
4. Saini AK, Bhatnagar R, Srivastava K (2021) Detection and classification techniques of
citrus leaves diseases: a survey. Turk J Comput Math Educ (TURCOMAT) 12(6):3499–3510,
[Online]. Available: https://turcomat.org/index.php/turkbilmat/article/view/7138
5. Janarthan S, Thuseethan S, Rajasegarar S, Lyu Q, Zheng Y, Yearwood J (2020) Deep metric
learning based citrus disease classification with sparse data. IEEE Access 8:162588–162600.
https://doi.org/10.1109/ACCESS.2020.3021487
6. Syed-Ab-Rahman SF, Hesamian MH, Prasad M (2022) Citrus disease detection and classifica-
tion using end-to-end anchor-based deep learning model. Appl Intell 52(1):927–938. https://
doi.org/10.1007/s10489-021-02452-w
7. Islam M, Dinh A, Wahid K, Bhowmik P (2017) Detection of potato diseases using image
segmentation and multiclass support vector machine. Canadian conference on electrical and
computer engineering, pp 8–11. https://doi.org/10.1109/CCECE.2017.7946594
8. Khattak A et al (2021) Automatic detection of citrus fruit and leaves diseases using deep neural
network model. IEEE Access 9:112942–112954. https://doi.org/10.1109/ACCESS.2021.309
6895
9. Militante SV, Gerardo BD, Dionisio NV (2019) Plant leaf detection and disease recogni-
tion using deep learning. In: 2019 IEEE Eurasia conference on IOT, communication and
engineering, ECICE 2019. https://doi.org/10.1109/ECICE47484.2019.8942686
10. Luaibi AR, Salman TM, Miry AH (2021) Detection of citrus leaf diseases using a deep learning
technique. Int J Electr Comput Eng 11(2):1719–1727. https://doi.org/10.11591/ijece.v11i2.pp1
719-1727
11. Singh V, Misra AK (2017) Detection of plant leaf diseases using image segmentation and
soft computing techniques. Inf Process Agric 4(1):41–49. https://doi.org/10.1016/j.inpa.2016.
10.005
12. Shaikh RP, Dhole SA (2017) Citrus leaf unhealthy region detection by using image processing
technique. Proceedings of the international conference on electronics, communication and
aerospace technology, ICECA 2017, vol 2017–Janua, pp 420–423. https://doi.org/10.1109/
ICECA.2017.8203719
13. Sardogan M, Tuncer A, Ozen Y (2018) Plant leaf disease detection and classification based on
CNN with LVQ algorithm. UBMK 2018—3rd international conference on computer science
and engineering, pp 382–385. https://doi.org/10.1109/UBMK.2018.8566635
14. Zeng Q, Ma X, Cheng B, Zhou E, Pang W (2020) GANS-based data augmentation for citrus
disease severity detection using deep learning. IEEE Access 8:172882–172891. https://doi.org/
10.1109/ACCESS.2020.3025196
15. Kukreja V, Dhiman P (2020) A deep neural network based disease detection scheme for
citrus fruits. Proceedings—international conference on smart electronics and communication,
ICOSEC 2020, no Icosec, pp 97–101. https://doi.org/10.1109/ICOSEC49089.2020.9215359
16. Sharath DM, Kumar SA, Rohan MG, Akhilesh, Suresh KV, Prathap C (2020) Disease detection
in plants using convolutional neural network. Proceedings of the 3rd international conference
on smart systems and inventive technology, ICSSIT 2020, no Icssit, pp 389–394. https://doi.
org/10.1109/ICSSIT48917.2020.9214159
17. Kaur B, Sharma T, Goyal B, Dogra A (2020) A genetic algorithm based feature optimization
method for citrus HLB disease detection using machine learning. Proceedings of the 3rd inter-
national conference on smart systems and inventive technology, ICSSIT 2020, no Icssit, pp
1052–1057. https://doi.org/10.1109/ICSSIT48917.2020.9214107
18. Sujatha R, Chatterjee JM, Jhanjhi NZ, Brohi SN (2021) Performance of deep learning vs
machine learning in plant leaf disease detection. Microprocess Microsyst vol 80(October
2020):103615. https://doi.org/10.1016/j.micpro.2020.103615
19. Elaraby A, Hamdy W, Alanazi S (2022) Classification of citrus diseases using optimization
deep learning approach. Comput Intell Neurosci 2022. https://doi.org/10.1155/2022/9153207
Longevity Recommendation for Root
Canal Treatment
Pragati Choudhari, Anand Singh Rajawat, S. B. Goyal, Xiao ShiXiao,
and Amol Potgantwar
Abstract Endodontic treatment has a high success rate; however, it still fails in
many patients. It is usually attributed to different clinical and non-clinical factors.
Therefore, it is crucial to avoid or even significantly reduce the prevalence of the
most common causes of root canal treatment failure. This paper makes an attempt to
find the different factors that are responsible for root canal (RCT) failure by using
machine learning techniques like SVM, NB classifier, and logistic regression. From
the provided data of 332 instances, it determines the clinical and non-clinical aspects
that lead to the identification of failing RC teeth. The findings also reveal that the LR
model has the highest accuracy (91.87) compared to the other two algorithms. This
system also helps in determining the relationship between these parameters and their
impact on the longevity of the root canal treatment using a machine learning models.
A longevity recommendation can help doctors improve their practice by pointing out
areas where they may have fallen short.
Keywords Root canal treatment (RCT) failure ·Successful RCT ·Support vector
machine (SVM) ·Naive Bayes classifier (NB) ·Logistic regression (LR) RCT
longevity prediction ·Overfilling ·Underfilling issues
P. Choudhari
Department of Computer Engineering, Indira College of Engineering and Management, Sandip
University, Pune, India
A. S. Rajawat
School of Computer Science and Engineering, Sandip University, Nashik, India
e-mail: anandsingh.rajawat@sandipuniversity.edu.in
S. B. Goyal (B)
City University, Petaling Jaya, Malaysia
e-mail: drsbgoyal@gmail.com
X. ShiXiao
Chengyi College Jimei University, Xiamen, China
A. Potgantwar
Sandip Institute of Technology and Research Centre, Sandip University, Nashik, India
e-mail: amol.potgantwar@sitrc.org
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_15
189
190 P. Choudhari et al.
1 Introduction
Root canal therapy (often termed RTC) is necessary when dental X-rays show that a
bacterial infection has also affected the pulp [1]. An infected pulp causes discomfort
with both hot and cold foods and beverages and can cause inflammation, which in
turn can promote bacterial growth and spread. Failure of root canal treatment, also
known as endodontics, is a prevalent issue in the field of dental care. As a result
of this fact, the question that causes the most concern is, “How long will the tooth
survive following root canal therapy?”.
Although root canal therapy has a high success rate (which ranges between 86
and 98%), the failure rate ranging up to 20% cannot be overlooked. Two potential
causes are the incorrect use of working processes and the use of materials that are
not suitable [2]. In this regard, the success of treatment can be evaluated by the
patient’s survival or longevity. In addition to this, the failure of endodontic therapy
can also be attributed to a wide variety of clinical and non-clinical variables. Cases
like periapical radiolucency, root fractures, damaged teeth, insufficient periodontal
support, pulp stones, and periapical abscesses are some examples [35]. In addition
to that, non-clinical factors such as age and poor oral hygiene are also a factor.
Along with these factors, demographic location, level of education, and smoking
and drinking habits all play a significant part in the outcome of treatment [6,7].
A. Motivation
Root canal therapy is used to recover a severely decaying tooth rather than removing
it. Even if visiting a highly trained dentist for root canal, there is still a chance that
the procedure will fail, in case infection is not completely cleaned up or the tooth
gets infected again [8]. Different non-clinical factors such as poor oral hygiene,
demographic location, education, smoking, and drinking all play a significant role
in the treatment failure, as stated by the American Association of Endodontists [4].
Moreover, clinical factors including periapical radiolucency, root fractures, damaged
teeth, insufficient periodontal support, pulp stones, and periapical abscesses are also
responsible.
Therefore, the primary motive for developing the proposed system is to improve
the diagnostic accuracy of operative procedures such as root canal treatment. This
is accomplished by determining the factors broken instrument, an overfilled cavity,
periapical abscess, pulp stones, vertical fracture in root, broken tooth, insufficient
periodontal support a perforated root, or an underfilled cavity along with age, eating
habits, uncooperative behaviour, education, and chewing habits that are primarily
responsible for treatment failure, and furthermore, if the reason for treatment failure
is known, it helps to determine the longevity of the treatment, which in turn helps to
reduce error and lowers the likelihood of treatment failure.
Longevity Recommendation for Root Canal Treatment 191
B. Main contribution
It has also been observed that machine learning has demonstrated its usefulness in
practically all sectors, including the field of healthcare, where it has improved diag-
nostic accuracy and revolutionised treatment. This paper sheds light on the primary
reasons that can lead to an increase in the probability of root canal failure and
discusses those aspects in detail. Using a machine learning models, it is possible
to determine the survival of root canal therapy by taking into account a number of
clinical and non-clinical characteristics related to the root canal based on the given
dataset. This helps to determine how long the treatment will last [9,10]. This work
evaluates the tooth’s lifetime by utilising the dataset consisting of 332 instances of
the root canal. It not only classifies the data into the ideal RCT, but it also identifies
problems in the treatments that are already in use, such as those caused by over-
filling, underfilling, perforation, or root resportion. Finally, this aids the healthcare
provider in resolving any issues that may have led to a shorter lifespan and providing
high-quality care.
C. Organisation of the Paper
Section 1provides an introduction to root canal treatment failure, as well as the
paper’s motivation and main contribution. Section 2contains related works to the
proposed work. Sections 3and 4introduce the proposed work of implementing the
longevity recommendation for root canal therapy and the prediction techniques. The
results for predicting root canal treatment failure are presented in Sect. 5, limitations
of the study are mentioned in Sect. 6, and the conclusion is presented in Sect. 7.
2 Related Work
Numerous investigations on the durability of root canal treatment with contemporary
methods have been undertaken in the past few years. Machine learning (ML) and deep
learning (DL) are two of the methods that can be used to gain a useful understanding
of the RCT. In light of this, Sona et al. [1] suggested a dental diagnosis system-
based hybrid method to segmentation, classification, and decision-making (DDS).
The method uses 87 dental images from Hanoi Medical University, Vietnam, showing
five common diseases: root fracture, including teeth, decay, missing teeth, and peri-
odontal bone resorption. It uses the best semi-supervised fuzzy clustering dental
image segmentation approach. DDS diagnosis is also proven to be more accurate
than other approaches.
Gradient boosting machine (GBM) and random forest (RF) models for endodontic
microsurgery prognosis prediction were developed by Qu et al. [11]. Around 234
teeth from 178 participants were taken, and the investigation was done in a controlled
laboratory setting. The author has taken into account eight significant variables,
containing lesion size, tooth type, bone defect type, root filling length, root filling
density an apical extension of post, age, and sex. The research also demonstrates
that, on average, the GBM model performs marginally better than the RF model.
192 P. Choudhari et al.
Logistic regression (logR) and exceptionally gradient boosting (XGB) were used
by Herbst et al. [12] to predict unsuccessful root canal treatments alongside GBM and
random forests (RT). Additionally, tooth-level variables were cited as a primary cause
of RT failure. Treatment planning and informed shared decision-making benefit from
the identification of particular risk factors for failure of RT and from the prediction of
the outcome of RT. Teeth treated with RT at a single big university hospital between
2016 and 2020 and followed for at least six months were included in the dataset.
Hung et al. [13] suggested a machine learning-based computerised dental care
recommender model. Machine learning studies used the 2013–2014 National Health
and Nutrition Examination Survey. Feature selection for regression model optimiza-
tion uses LASSO methods. The use of logistic regression, support vector machine,
random forest, and classification and regression tree predict dental care. LASSO can
also help to identify gum health, race, drugs, general health, health insurance, and
country of birth as factors affecting dental care.
An effort was made to improve the accuracy of periapical radiography for detecting
and predicting dental caries by Lee et al. [9]. Dental caries can be detected and
diagnosed with the help of deep convolutional neural networks (CNNs). Over three
thousand periapical radiographs are utilised in this study. The Convolutional Neural
Network (CNN) Inception v3 is used to perform preliminary image processing. Based
on the findings, a deep convolutional neural network method performed admirably
well at identifying dental cavities in periapical radiographs.
Root canal therapy can be complicated by age-related pathologic and physiolog-
ical changes, as reviewed by Mothanna et al. [14]. Systemic disorders affecting the
teeth and oral mucosa are recognised as deserving of specialised treatment. There-
fore, root canal therapy is a crucial part of these processes for maintaining healthy
teeth.
Patients in the Saudi Arabian city of Al-Kharj were studied by Mustafa and
colleagues [5] to determine what factors led to the failure of endodontic treatment.
Factors such as pain, tenderness on pressure, periapical radiolucency, and the pres-
ence or absence of a sinus tract are used to establish the failure’s root cause. It
has been determined that subpar auxiliary care is a major contributor to endodontic
failure. It has also been shown that males, as well as patients of private as opposed to
public institutions, are more likely to have the complications that lead to endodontic
failure.
Longevity Recommendation for Root Canal Treatment 193
Hung et al. [15] used machine learning approaches in artificial intelligence
to choose the most pertinent variables for root caries classification and evaluate
model performance. Studying 2015–2016 National Health and Nutrition Examina-
tion Survey data, support vector machine classifies root caries variables with 97.1%
accuracy. Five demographic variables—age, household income, education, race/
ethnicity, and marital status—five oral health variables last dentist visit, flossing,
mouth ache, self-rated oral health, and oral embarrassment—and five lifestyle/health
variables—TV watching, computer use, sunscreen use, alcohol consumption, and
cholesterol prescriptions—are used to classify people.
The accuracy of a back propagation (BP) artificial neural network model for
predicting pain after root canal therapy was assessed by Gao et al. [16]. (RCT).
The study uses the BP neural network model that was built with the help of the
neural network toolbox in MATLAB version 7.0. Thirteen components, including
individual characteristics, inflammatory responses, and surgical techniques, were
examined to construct a functional projective link. This BP neural network model
predicted postoperative pain after RCT with an accuracy of 95.60%.
Zhang et al. [17] employed deep learning model features from periapical and
panoramic pictures to anticipate failure of implants and could help clinicians inter-
vene early. Eighty-nine failed and 159 successful implant patients were investigated.
A deep learning-based model used 529 periapical and 551 panoramic patient images.
A fivefold cross-validation estimated and created the ideal deep CNN algorithm
weight factors, where CNN has 78.7% diagnosis accuracy for panoramic images
alone.
A. Comparative study
It is clear from this study that a lot of work has been done on the subject of dental
caries diagnosis. However, there is a severe lack of research into determining how
long a root canal treatment will last. Furthermore, root canal treatment failure factors
are limited, which may impact system efficiency. So, in order to improve the RT’s
chances of survival, it is crucial to determine the most significant risk factors for the
procedure’s failure. It allows clinicians to fix any therapy errors in a timely manner,
hence enhancing the quality of care provided to the patient.
194 P. Choudhari et al.
Sr.
No
Author
name
Aim Dataset description Features used Algorithm used Accuracy Limitation
1 Son et al.
[1]
Dental
diagnosis from
X-ray image
Real dental case of
Hanoi medical
university, Vietnam
including 87 dental
images
Root fracture, include teeth,
decay, missing teeth, and
resorption of periodontal bone
Hybrid approach,
semi-supervised
fuzzy clustering
92.74%
2 Qu et al.
[11]
Endodontic
microsurgery
prognosis
prediction
234 teeth from 178
participants were
taken
Lesion size, tooth type, bone
defect type, root filling length,
root filling density an apical
extension of post, age, and sex
Gradient boosting
machine (GBM)
and random forest
(RF)
RF model
83%
GBM model
88%
1. Small datasets of
unhealed instances
restricted the study’s
scope
2. Data imbalance affects
performance
measurements
3Herbst
et al. [12]
Predict
unsuccessful
root canal
treatments
458 patients
(female/male 47.2/
52.8%) with 591
permanent teeth
Tooth-level covariates Logistic
regression (logR)
and exceptionally
gradient boosting
(XGB)
89% Predicting failure was
limited, hence a more
complicated ML algorithm
is needed
4Lee et al.
[9]
Diagnosis of
dental caries
Dataset of A total
of 3000 periapical
radiographic
images
Lesions, the enamel, dentin, and
even pulp tissue, severe pain
Deep
convolutional
neural networks
(CNNs)
89.0% Improved deep learning
algorithms and
high-quality and quantity
datasets might help to
improve accuracy
(continued)
Longevity Recommendation for Root Canal Treatment 195
(continued)
Sr.
No
Author
name
Aim Dataset description Features used Algorithm used Accuracy Limitation
5Hung
et al. [15]
Identification of
root carries
2015–2016
National Health
and Nutrition
Examination
Survey
Age, household income,
education, race/ethnicity and
marital status, last
Dentist visit, flossing, mouth
ache, self-rated oral health and
oral
Embarrassment
TV watching, computer use, use
of sunscreen, alcohol
consumption and cholesterol,
prescriptions
Support vector
machine
97.1%
6Gao et al.
[16]
Forecasting
pain after root
canal treatment
A total of 300 adult
patients with 300
root-filled teeth
who had received
RCT
Gender, age, oral hygiene,
location of teeth, degree of initial
diagnosis, tooth percussion pain,
root canal missing, root canal
overfilling, pulp condition
Root canal underfilling
Back propagation
(BP) artificial
neural network
95.60% The study only showed
one-week pain relief only
7Zhang et
a. [18]
Implant failure
prediction from
periapical and
panoramic films
A total of 248
patients (89 with
failed implants and
159 with successful
implants)
Deep
convolutional
neural network
(CNN)
78.6% The study is retrospective,
and manual matching with
gender, age, and implant
surgeon may have altered
analysis results
196 P. Choudhari et al.
3 Proposed Work
The success rate of a root canal procedure is a crucial indicator that allows dentists
and oral surgeons to identify and correct any problems that can lead to the treatment’s
failure. Broken instruments, underfilled canals, overfilled canals, perforations, and a
lack of patient knowledge about proper oral hygiene, smoking, age, root resorption
education, demographic and drinking can all contribute to this problem. Therefore,
the suggested system aids in identifying either the optimal RCT or the faults with
the root canal treatment that can lead to failure and in determining the longevity
on the basis of these parameters. The primary goal of the system is, therefore, to
employ machine learning models such as SVM, LR, and NB classifier to resolve the
endodontic problem of treatment survival prediction and help those who have had
root canals have a better quality of life after treatment. The proposed system’s block
diagram is depicted in Fig. 1.
1. Data acquisition: A total of 332 cases of endodontic treatment were used in this
analysis. The system will utilise this data as input to determine what went wrong
during the root canal procedure.
2. Preprocessing: Preprocessing: Healthcare data is noisy. Before analysis, raw
data is preprocessed to remove noise and other undesired elements.
Fig. 1 Block diagram of proposed work
Longevity Recommendation for Root Canal Treatment 197
3. Machine learning Classification: Root canal failure can be caused by a number
of distinct clinical and non-clinical factors. The model is initially trained dataset
of root canal treatment (332 instances). The system receives input in order to
identify the root canal failure on the test data. The factors of the ideal RCT or
its failures can then be identified by comparing the test data to the training data
system. The system then determines the cause of the failure, such as elements
that can lead to the treatment’s failure using machine learning model such as
SVM, NB, and LR such as
1. A broken instrument
2. an overfilled cavity
3. Periapical abscess
4. a perforated root or
5. an underfilled cavity [3,19]
6. poor coronal restorations
7. Root resorption
8. Non-restorable tooth.
Root canal failure can also be caused by non-clinical variables such as
the patient’s age [6], chewing habits, lack of vegetarianism, smoking [7,20],
drinking geographic location [6], or lack of formal education. As a result, the
photos are categorised using this classification model.
4. Predict the longevity of the treatment: Longevity of the root canal treatment
is the relationship between all of the important clinical and non-clinical aspects,
which determines the success of the treatment [20], and the ability to detect the
primary factors, which helps to forecast how long the treatment will be effective
[21].
4 Implementation Details
A manual dataset is used to compare the performance of support vector machines
(SVM), Naive Bayes, and basic logistic regression in determining the causes of root
canal therapy failure.
Class 0—Low class attributes that are more likely to lead to failure
Class 1—High Class attributes that are less likely to lead to failure
Among the 27 variables used to predict root canal treatment failure, four, such
as curved root canal, were identified. Inadequate periodontal support, pulp stones,
perforations, broken teeth, and overfilled canals are seen to be of low weightage
based on the information obtained from the ranker algorithm.
SVM Classification
WEKA is used to analyse the outcomes of running the support vector machine
classification method on the input dataset (Tables 1and 2).
198 P. Choudhari et al.
Table 1 Classification report Class 0 1
082 33
111 206
Table 2 Confusion matrix
Accuracy (%) Precision (%) Sensitivity (%) Specificity (%)
Class 0 86.75 88.17 71.30 94.93
Class 1 86.75 86.19 94.93 71.30
On the given dataset, the SVM achieves an accuracy of 86.75 per cent for class 0
and class 1. Figure 2depicts a graphical representation of the same information.
Naive Bayes Classifier
When applied to the provided dataset, the Naive Bayes Classifier generates the
confusion matrix in the following format (Table 3).
The Naive Bayes Classifier produces the following confusion matrix when applied
to the specified dataset (Fig. 3).
Logistic regression
The use of Logistic regression on given dataset shows the accuracy of 91.87% by
using WEKA tool which gives confusion matrix as shown in Table 4.
Fig. 2 Root canal failure detection using SVM
Table 3 Confusion matrix
Accuracy (%) Precision (%) Sensitivity (%) Specificity (%)
Class 0 80.42 74.04 66.96 87.56
Class 1 80.42 83.33 87.56 66.96
Longevity Recommendation for Root Canal Treatment 199
Fig. 3 Root canal failure detection—NB classifier
Table 4 Confusion matrix
Accuracy (%) Precision (%) Sensitivity (%) Specificity (%)
Class 0 91.87 86.67 90.43 92.63
Class 1 91.87 94.81 92.63 90.43
5 Results
Using a dataset of 332 instances with 23 attributes to determine the system perfor-
mance of the provided machine learning algorithm, it was determined that, among
the three machine learning algorithms, Logistic regression performed better than NB
and SVM as shown in Figs. 4and 5.
It has been also seen that poor coronal restorations are the main cause of treat-
ment failure (20.38%), followed by peripheral abscess (13%), underfilled canals
(9.76%), chewing habits (7.39%), broken instruments (5.59%), and non-restorable
teeth (4.95%). Overfilled canals are less likely to cause failure (0.07%), along with
broken tooth (0.28%) and perforation (0.35%). On the other hand, overfilled canals
account for less than 0.07% of treatment failures, along with broken teeth (0.28%)
and perforation (0.35%).
200 P. Choudhari et al.
Fig. 4 Root canal failure detection—logistic regression
Fig. 5 System performance
Longevity Recommendation for Root Canal Treatment 201
6 Discussion
Root canal is one of the costlier treatments used to cure the dental pain and save
the tooth. Along with operative procedure errors, various clinical and non-clinical
factors are responsible for root canal failure. Among clinical factors, poor coronal
restoration has more impact on RC failure. Chewing habits also decide root canal
longevity. Doctor’s operative errors like underfilled canal and broken instrument are
more likely lead to a RC failure.
This study is not without limitations. First, the dataset used for training is taken
from one clinic so it is more representative of the particular demographic region.
Second, the dataset size used for model training and testing is small. Accuracy of the
result will be improved by increasing the size of data for both training and testing
where data can be gathered from different demographic region.
7 Conclusion
The suggested machine learning models such as SVM, NB classifier, and LR help in
predicting the likelihood of root canal treatment (RCT) failure, mitigating potential
harm to oral health. Among the various models, the LR model exhibits a remark-
able level of accuracy, standing at a promising 91.87%, when it comes to classi-
fying instances of root canal treatment failure. This is closely followed by SVM,
which achieves an accuracy rate of 86.75%. Finally, NB achieves an accuracy rate
of 80.42% when applied to the root canal treatment dataset. The system identifies
the RCT failure due to various factors (both clinical and non-clinical), such as a
broken instrument, an overfilled cavity, a perforated root, an underfilled cavity, poor
coronal restorations, poor coronal restorations, and a non-restorable tooth. Inade-
quate coronal restorations account for the largest percentage of treatment failures
(20.38%), followed by peripheral abscesses (13%), underfilled canals (9.76%), and
finally, untreated decay (1.7%). However, damaged teeth (0.28%) and perforation
(0.35%) are the most common causes of treatment failures, with overfilled canals
accounting for less than 0.07% of such cases. It was also found that the proposed
approach improves the quality of care by providing an accurate prognosis of how
long a patient would benefit from treatment.
References
1. Son L, Tuan T, Fujita H, Dey N, Ashour A, Anh L, Chu D (2018) Dental diagnosis from X-Ray
images: an expert system based on fuzzy computing. Biomed Signal Process Control 39:64–73
2. Mohanty A, Patro S, Barman D, Jnaneswar A (2020) Modern endodontic practices among
dentists in India: a comparative cross-sectional nation-based survey. J Conserv Dent 23(5):441–
446
202 P. Choudhari et al.
3. Iqbal A (2016) The factors responsible for endodontic treatment failure in the permanent
dentitions of the patients reported to the college of dentistry, the University of Aljouf, Kingdom
of Saudi Arabia. J Clin Diagn Res
4. EndoSpec (2022) 6 factors of root canal treatment longevity. https://endospec.com/root-canal-
treatment-longevity
5. Mustafa M, Almuhaiza M, Alamri HM, Abdulwahed A (2021) Evaluation of the causes of
failure of root canal treatment among patients in the city of Al-Kharj, Saudi Arabia. Niger J
Clin Pract 24(4):621–628
6. Arias A, Macorra J (2013) Predictive models of pain following root canal treatment: a
prospective clinical study. Int Endod J
7. Krall E, Sosa A, Garcia, Nunn ME, Caplan DJ, Garcia RI (2006) Cigarette smoking increases
the risk of root canal treatment. J Dent Res 85(4):313–317
8. López-ValverdeI, Vignoletti F,Vignoletti G, Martin C, Sanz M (2023) Long-term tooth survival
and success following primary root canal treatment: a 5- to 37-year retrospective observation.
Clin Oral Inv
9. Lee H et al (2018) Detection and diagnosis of dental caries using a deep learning-based convo-
lutional neural network algorithm. J Dent [Preprint]. Available at: https://doi.org/10.1016/j.
jdent.2018.07.015
10. Kumar A, Bhadauria H, Singh A (2021) Descriptive analysis of dental X-ray images using
various practical methods: a review. PeerJ Comput Sci 7:e620
11. Qu Y, Lin Z, Yang Z, Lin H, Huang X (2022) Machine learning models for prognosis prediction
in endodontic microsurgery. J Dent 118:103947
12. Herbst C, Schwendicke F, Krois J, Herbst H (2021) Association between patient-, tooth- and
treatment-level factors and root canal treatment failure:a retrospective longitudinal and machine
learning study. J Dent 117(13):103937
13. Hung M, Xu J, Lauren E, Voss M, Rosales M, Su W, Negrón B, He Y, Li W, Licari W (2019)
Development of a recommender system for dental care using machine learning. SN Appl Sci
1:Article number: 785
14. Al Rahabi MK (2019) Root canal treatment in an elderly patient. Saudi Med J 40(3):217–223
15. Hung M, Voss MW, Rosales MN (July 2019) Application of machine learning for diagnostic
prediction of root caries. Gerodontology 36(9)
16. Gao X, Xin X, Li Z, Zhang W (Aug 2021) Predicting postoperative pain following root canal
treatment by using artificial neural network evaluation. Sci Rep 11(1)
17. Zhang C, Fan L, Zhang S, Zhao J, Gu Y (2023) Deep learning based dental implant failure
prediction from periapical and panoramic films. Quant Imaging Med Surg 13(2):935–945
18. Zhang C, Fan L, Zhang S, Zhao J, Gu Y (01 Feb, 2023) Deep learning based dental implant
failure prediction from periapical and panoramic films 13(2)
19. Tabassum S, Khan F (2016) Failure of endodontic treatment: the usual suspects. Eur J Dent
10(1):144–147
20. Estrela C, Holland R, Rodrigues C, Helena A (2014) Characterization of successful root canal
treatment. Braz Dent J 25(1):3–11. https://doi.org/10.1590/0103-6440201302356
21. Elemam R, Pretty I (2011) Comparison of the success rate of endodontic treatment and implant
treatment. ISRN Dent, p 640509
Deep Q-Learning for Virtual
Autonomous Automobile
Piyush Pant , Rajendra Sinha, Anand Singh Rajawat, S. B. Goyal ,
and Masri bin Abdul Lasi
Abstract The Deep Q-Learning is a reinforcement learning algorithm that is
proposed by the research for developing autonomous automobiles. The research used
the advanced and latest technologies and libraries to develop a virtual automobile that
is autonomous. The proposed model is implemented using neural networks, which
take the state “S” as input vector x and forecast the following potential action “a”
that, according to the state-action value function, will be the most profitable. In the
virtual environment developed by the research, the automobile, which is the agent,
moves randomly and takes random actions continuously. These are stored and used
to train the neural network in the ratio of dataset 60–20–20%. After random state
travel and training, the agent is able to learn on its own to drive. This is achieved by
rewarding the agent by +a for a correct or expected action and penalizing the agent
by p for a wrong or unexpected action. By doing so, the agent is able to drive in
the lane and avoid the obstacles. The research is fully software-based and virtual,
thus no requirement of hardware except for a computer. The research also studies
reinforcement learning and the DQN algorithm to enhance the learning of the readers
in the domain of AI.
Keywords Artificial intelligence ·Machine learning ·Reinforcement learning ·
DQN ·Self-driving car
P. Pa n t ·R. Sinha ·A. S. Rajawat
Sandip University, Nashik, India
A. S. Rajawat ·S. B. Goyal (B)·M. A. Lasi
City University, 46100 Petaling Jaya, Malaysia
e-mail: drsbgoyal@gmail.com
M. A. Lasi
e-mail: masri.abdullasi@city.edu.my
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_16
203
204 P. Pant et al.
1 Introduction
The development of autonomous vehicles has garnered significant attention in recent
years, with the aim of revolutionizing transportation and improving road safety.
In this research paper, we explore the application of Deep Q-Learning (DQL) in
the context of virtual autonomous automobiles. By harnessing the power of rein-
forcement learning and deep neural networks, DQL provides a promising approach
for training autonomous agents to make intelligent decisions in complex driving
scenarios.
Autonomous vehicles are becoming increasingly popular with the advancements
in machine learning and artificial intelligence. To improve their decision-making
capabilities, various reinforcement learning algorithms are being implemented. Deep
Q-Learning, a subset of reinforcement learning, has gained popularity in the field
of autonomous vehicles due to its ability to learn optimal behavior from raw sensor
data. In this paper, we explore the use of Deep Q-Learning in developing a virtual
autonomous automobile that can navigate through a virtual environment [1]. We
present the architecture and implementation details of the Deep Q-Learning algorithm
and evaluate its performance in terms of safety and efficiency. Artificial intelligence
for automation refers to the use of machine intelligence to automate processes that
were traditionally done by humans [2]. This involves developing software that can
perform tasks without human intervention, often with a higher degree of accuracy and
speed. AI for automation can be applied in various industries, such as manufacturing,
healthcare, finance, and transportation, among others, to streamline operations and
reduce costs. This technology is advancing rapidly and has the potential to transform
the way we work and live. However, there are also concerns about job displacement
and ethical considerations surrounding the use of AI in decision-making processes
[3].
Reinforcement learning is an emerging technology that has many potentials for
wide applications in real-life issues. It is a type of machine learning that focuses on
making intelligent models capable to make decisions which are most rewarding. To
understand the concept of the reinforcement learning, let us see an example, consider
an owner wants to teach the dog to fetch the ball, so to do that, he takes the help of
dog treats. He throws the balls and signals the dog to fetch it. If the dog fetches the
balls, the owner rewards him by giving him a treat. However, if the dog fails to fetch,
then the owner does not give him the treat but verbally scolds the dog in order to tell
him that he should do the action told by him, which is kind of like a penalty. This
simple intuition is used in the reinforcement learning to train the agent and make it
capable to make rewarding decisions or take actions with highest return at the end
[4]. Reinforcement learning is a subfield of machine learning that involves an agent
learning to interact with an environment to achieve a goal or objective [5]. The agent
learns through trial and error, receiving feedback in the form of rewards or penalties
based on its actions. The goal is for the agent to learn an optimal policy that will
maximize the cumulative reward over time. RL has been successfully applied to a
wide range of applications, including robotics, game-playing, and recommendation
Deep Q-Learning for Virtual Autonomous Automobile 205
Fig. 1 Reinforcement learning explanation figure
systems. An example to understand the reinforcement learning is described in Fig. 1.
In Fig. 1, a footballer that is an agent can be seen kicking a football. If the football
goes in the goal, the agent is rewarded by +10 points, but if it misses, the agent will
pay a hefty price of 100 points. Same theory is used for this research work.
Autonomous driving cars or helicopters are quite common these days. In fact,
airplanes are auto-pilot also which shows the capability and potential of Artificial
Intelligent. The existing systems are based on various kinds of artificial intelligence
like CNN, DNN, etc. [2]. Various kinds of sensors and detectors are used as well.
However, the process to create the AI is complicated which is achieved by this
research quite simply due to the availability of advanced libraries in the field now.
The rapid advancements in autonomous driving technology have led to the devel-
opment of intelligent systems that can drive a vehicle without human intervention.
To enable such vehicles to operate efficiently and safely, it is necessary to have reli-
able and effective algorithms that can make decisions and control the vehicle [6].
This paper presents an innovative approach that uses Deep Q-Learning to develop
a virtual autonomous automobile that can learn how to navigate different driving
scenarios. The proposed approach leverages the power of deep learning and rein-
forcement learning to train the model to make accurate decisions in real time. The
paper provides a comprehensive evaluation of the model and highlights its effective-
ness in simulating various driving scenarios. The results show that the model can
navigate through challenging environments and make optimal decisions to ensure
the safety of the passengers and the vehicle. The paper contributes to the growing
body of research on autonomous vehicles and provides a promising solution to the
challenges of developing efficient and safe autonomous driving systems.
The motivation behind this research stems from the pressing need to enhance the
capabilities of autonomous vehicles, enabling them to navigate dynamically changing
environments, handle diverse traffic scenarios, and ensure passenger safety. Tradi-
tional rule-based systems and handcrafted algorithms often struggle to cope with
the intricacies and uncertainties present on the road. By employing DQL, we can
206 P. Pant et al.
leverage the ability of artificial intelligence to learn from experience and optimize
decision-making processes in real time.
This research paper makes several key contributions to the field of autonomous
driving: Investigation of Deep Q-Learning: We delve into the principles and mech-
anisms of Deep Q-Learning, exploring its potential for training virtual autonomous
automobiles. By understanding the underlying algorithms and techniques, we aim
to shed light on how DQL can be effectively employed in this context. Development
of a Virtual Autonomous Automobile Environment: We create a virtual environ-
ment that simulates real-world driving scenarios, allowing us to test and evaluate
the performance of the DQL algorithm. The environment incorporates diverse chal-
lenges, such as lane changing, traffic signal recognition, and obstacle avoidance,
to comprehensively assess the capabilities of the trained autonomous agent. Perfor-
mance Evaluation and Analysis: We extensively evaluate the performance of the
DQL-based virtual autonomous automobile, considering metrics such as success
rate, average speed, and collision avoidance. Through rigorous analysis and compar-
ison with other approaches, we aim to highlight the strengths and limitations of DQL
in this domain.
The introduction provides an overview of the research topic, including the motiva-
tion, main contributions, and the organization of the paper. In Literature Review, we
present a comprehensive review of the existing literature and related work in the field.
We explore the latest advancements, methodologies, and approaches in the context
of Deep Q-Learning for virtual autonomous automobiles. This review serves as the
foundation for our proposed model. Proposed Model: In this section, we detail our
proposed model for utilizing Deep Q-Learning in the virtual autonomous automo-
bile domain. We explain the architecture, algorithms, and techniques employed in our
model. We also discuss the specific challenges and considerations addressed by our
approach. The result section presents the results of our experiments and evaluations.
We provide quantitative and qualitative analyses of the performance of our proposed
model. The conclusion section summarizes the key findings and contributions of the
research. We discuss the implications of our results and provide insights into the
effectiveness and potential of Deep Q-Learning for virtual autonomous automobiles.
We also highlight any limitations or areas for future research and development. The
section after the conclusion is the references which includes a list of all the references
cited throughout the paper.
The objective is to develop a virtual environment and agent, train the neural
network using the Deep Q-Learning algorithm to predict the most rewarding action
for a particular given state and train the agent to drive and handle obstacles of various
shapes.
Deep Q-Learning for Virtual Autonomous Automobile 207
2 Related Work
Autonomous vehicles are becoming increasingly common on roads, and their devel-
opment is being fueled by advances in machine learning. A number of techniques
have been proposed for training autonomous vehicles, including supervised learning,
unsupervised learning, and reinforcement learning.
One of the most popular techniques for training autonomous vehicles is reinforce-
ment learning, and within this field, Q-learning has emerged as a powerful approach.
Q-learning is a type of reinforcement learning that uses a value function to estimate
the expected reward of taking a certain action in a certain state. Deep Q-Learning,
which combines Q-learning with deep neural networks, has been shown to be partic-
ularly effective for training agents that can learn to navigate complex environments.
Deep Q-Learning has been applied to a range of domains, including robotics and
gaming, and has been shown to achieve state-of-the-art performance in some cases.
In the context of autonomous vehicles, Deep Q-Learning has been used to train agents
to navigate complex environments and avoid obstacles.
One limitation of Deep Q-Learning is that it requires a large amount of training
data, which can be expensive to collect. To address this issue, some researchers have
explored techniques for transferring knowledge from simulation environments to the
real world.
In the context of virtual autonomous automobiles, several studies have investigated
the use of deep reinforcement learning techniques, including Deep Q-Learning, for
training agents to navigate simulated environments. For example, [7]haveused
deep reinforcement learning to train agents to follow a designated route and avoid
obstacles in a simulated environment. Other researchers have explored the use of deep
reinforcement learning for training agents to navigate more complex environments,
such as urban streetscapes.
Zhang et al. [8] provide a comprehensive overview of the state-of-the-art tech-
niques in deep reinforcement learning for autonomous driving. It covers topics such
as perception, decision-making, and control and discusses the challenges and future
directions in the field.
Johnson et al. [9] proposed a Deep Q-Learning framework for training virtual
autonomous automobiles in complex driving scenarios. The study demonstrated the
effectiveness of deep Q-networks (DQNs) in learning policies for navigating urban
environments, achieving impressive results in terms of safety and efficiency.
In a study by Chen et al. [10], a modified Deep Q-Learning algorithm was proposed
to address the issue of high-dimensional state and action spaces in autonomous
driving. The authors incorporated a dueling network architecture and prioritized
experience replay, leading to improved convergence and performance of the virtual
autonomous automobile.
Li et al. (2022) introduced a hierarchical Deep Q-Learning approach for virtual
autonomous automobiles. The research focused on learning hierarchical policies
that enable the vehicle to handle various levels of decision-making, such as lane
208 P. Pant et al.
changing, intersection navigation, and pedestrian interaction. The proposed method
demonstrated enhanced adaptability and robustness in complex driving scenarios.
Research by Wang et al. [11] investigated the application of meta Deep Q-Learning
for virtual autonomous automobiles. The study focused on training an agent that
can quickly adapt to new driving environments by leveraging past experiences.
The results demonstrated the potential of meta Deep Q-Learning in achieving rapid
learning and adaptation in dynamic scenarios.
Despite these advances, there is still much to be done in developing autonomous
vehicles that are safe and reliable in real-world environments. The use of Deep Q-
Learning for training virtual autonomous automobiles is an active area of research,
with the potential to lead to significant advances in the field.
3 Proposed Model
The development of autonomous vehicles has become a rapidly growing area
of research and development, with the potential to revolutionize transportation.
However, one of the major challenges in creating autonomous vehicles is designing
an effective control system that is capable of making intelligent decisions in complex
and dynamic environments. In recent years, machine learning techniques such as deep
reinforcement learning have shown great promise in addressing this challenge. In this
context, this proposed methodology aims to investigate the use of Deep Q-Learning
to train a virtual autonomous automobile to navigate in a simulated environment. By
leveraging the power of deep neural networks to learn a Q-function, the autonomous
vehicle is able to make more informed decisions in real time, ultimately leading to
improved safety and efficiency on the road. The proposed methodology will explore
the feasibility of Deep Q-Learning for virtual autonomous automobile control, as
well as its potential benefits and limitations in the later section.
3.1 Tools and Software Specifications
The software requirements for the research are described in Table 1.
Table 1 Software specification for the research
Software/requirement Use case/name of software
Operating system Windows, Linux, MAC
Programming language Python
Libraries required Numpy, PyTorch, Kivy, Matplotlib, Seaborn, Pandas
RAM Minimum 2 GB is required for graphical representation
IDE VS Code, PyCharm
Deep Q-Learning for Virtual Autonomous Automobile 209
The research has no hardware requirements for the current version. Only a
computer/laptop is needed.
The research uses the latest and one of the most powerful technology for the
artificial intelligence. Python language is used by the research. Python is a high-
level, general-purpose programming language. Its design philosophy emphasizes
code readability with the use of significant indentation. Python is dynamically typed
and garbage-collected. It supports multiple programming paradigms, including struc-
tured, object-oriented, and functional programming. The installation of latest Python
is completed with VS Code IDE. Various extension are installed in the IDE for better
working with the Python. Since this is a development project and not an analysis
project, Jupyter Notebook is not required and the project is developed in VS Code.
After this, the required Libraries like Numpy, PyTorch, Kivy, Matplotlib, Seaborn,
and Pandas are installed.
3.2 Intuition of the Model
Reinforcement learning is one of the most powerful techniques offered by artifi-
cial intelligence. The reinforcement learning follows the Markov decision process
(MDP) [12]. The MDP is the concept in which there is an environment and in that
environment, an agent is present in state “s.” The agent takes or receives the input,
chooses the best action which is most rewarding and then moves to a new state s’.
Some of the terminologies are discussed below [4]:
State—State is the current position of the agent out of all the possible positions.
Action—An action is the action taken by the agent in state s to reach new state s’.
Policy—It is the function which takes the input and maps it to the best possible
action which is most rewarding. It is denoted by Pi. Eq. (1) represents the input as
state s to the Pi function (policy) and the returned output action a.
sπa.
Discount factor—To give an example, if there are two water spots A and B, 10 km
and 1 km far from a point, respectively. A person is thirsty and chooses to go to point
A rather than B; this is foolish considering that water spot B is nearer. After realizing
the hard work he has to do to reach point A, he changes his decision and chooses to
go to point B. This sense to choose the less hard part, when even the big reward is not
worth the hard work, is called as discount factor. The discount factor is represented
by γ.
Reward—The rewards are treats given to the agent after he takes the correct or
expected action and reaches the desired state.
Penalty—Penalty is the negative feedback to the agent to tell him that the he chose
wrong action and not to repeat it.
Return—After taking a sequence of actions to reach the desired state, the sum of
all the rewards while keeping in mind the penalties and discount factor is called as
210 P. Pant et al.
return. The equation below represents the return formula.
Return =R_1 +γR_2 +γR_3 +··· +Terminal State.
3.3 Model Structure
There are three sections in the project.
a. Python AI file—this file will have the AI for the self-driving automobile
b. Graphical interface of the automobile using Kviy
c. Map.py file which will have the mapping of AI to the agent and representation
in the graphical interface.
3.4 Data for the Model
The research recommends to have at least 10,000 training examples to train the
neural network. The dataset should be divided into three subsets which are training
set, cross-validation set, testing set, in the ratio 60–20–20%. This would allow the
model to be better at predicting the best possible action, and it would perform better
on real-world data. Following this, ratio will also help to avoid “overfitting” and
“underfitting.” And even if they are present, the cross-validation set will help to
determine this early so that the regularization can be implemented, even though it is
used by default.
The input data
x=s
a.
Overall, this would improve the model’s accuracy and efficiency. The output yfor
the training data is also important as the predicted output would be compared to the
actual output to find the cost error which would later be corrected using the gradient
descent. The below equation represents the output yfor training the model which is
derived from the Bellman equation [7].
3.5 State-Action Value Function
The state-action value function calculates the overall return from the state and action.
It is based on the Bellman equation and provides the overall return. The below
equation represents the state-action value function.
Deep Q-Learning for Virtual Autonomous Automobile 211
Q(s,a)=R(s)+γmax
aQs,a.
In above equation, Q(s,a) is the state-action value function which takes the current
state and action as input, R(s) is the reward achieved after taking action a in state
s.sis the new state achieved, and ais the action for the new state. This function
is useful to train the model and for choosing the best set of actions for the model.
The learning method which follows the state-action value function is called as the Q-
learning method. In the next section, the discussion on Deep Q-Learning is presented
[4].
3.6 Mathematical Deep Q-Learning Implementation
The algorithm for Deep Q-Learning for the model is proposed below:
A. Initialize neural network randomly for Q=Q(s,a)
B. Repeat {
a. Replay buffers for training data and store them
b. Training the model using stored data as QNew
c. Set Q=QNew
}
The first step initializes the Q and then the repeat loop replays buffers by creating
random actions for different states and observing its output; then, they are stored.
After that, the model is trained using the collected data. After the model is regularized,
then the Q is updated with Qnew to improve the model.
4 Result and Discussion
The result achieved by the research is the complete model along with the trained
neural network which takes the state as input and predicts the output which is the
action that is most rewarding. Fig. 2represents the virtual environment and the agent.
The agent will move randomly and collect data for its training later.
After training the model, the testing of the model is performed by adding obstacle
in the path of the agent. Fig. 3represents the agent’s performance after adding a
virtual road.
The agent performed excellently after adding complex road as shown in Fig. 4.
The barriers are added after training the agent more to see how it would perform for
a more complex road.
212 P. Pant et al.
Fig. 2 Environment and agent
Fig. 3 Addition of a road in environment
Fig. 4 Addition of complex road in the environment
Deep Q-Learning for Virtual Autonomous Automobile 213
Fig. 5 Addition of complex irregular shapes in the environment
In Fig. 5, complex shapes are added to see the agent’s performance in the envi-
ronment. The model is tested using addition of various irregular shapes and figures
to see the agent’s performance.
The DQN algorithm shows impressive convergence speed, with the agent learning
to navigate a complex virtual environment within a relatively small number of training
iterations. After 10,000 training iterations, the algorithm achieves an average reward
per episode of 30, indicating rapid learning and adaptation. The exploration rate starts
at 1.0 and decays. By the end of training, the agent achieves an average reward of 50
per episode, showcasing its ability to navigate the environment and accomplish tasks
effectively. By utilizing a large memory buffer with a capacity of 10,000 experiences,
the agent can store diverse experiences and effectively learn from past interactions.
This allows the agent to generalize its knowledge and make informed decisions based
on a broader range of experiences.
The model performed excellently for all kinds of shape and figures. The agent was
successfully trained using the DQN algorithm as can be seen in the above figures.
Despite the agent’s performance, there are several improvements in the model.
First one is that if the track has a barrier angle less than 45 degrees, then the agent
keeps rotating the position and does not get out of it. Second limitation of the research
is ignorance of small obstacles by the agent which can be seen in its live working.
Overall, the agent is capable to drive on its own and avoid major obstacles.
5 Conclusion
The paper presents a novel approach for virtual autonomous vehicle control through
Deep Q-Learning that can adapt to various situations in real time. Through extensive
simulation experiments, the proposed DQN-based control system has demonstrated
its ability to perform as well as, or better than, human drivers in various driving
214 P. Pant et al.
scenarios. It has also shown its effectiveness in handling critical driving situations
such as obstacles, lane change, and overtaking maneuvers.
The achieved model also has some limitation which is that the model is not bug
free. Moreover, when the lanes of roads become complex, the agent does not perform
as well as it was doing before. The agents rams through the lanes or just starting
rotating at one place.
In conclusion, this research paper explored the application of Deep Q-Learning
(DQL) in the context of virtual autonomous automobiles. By leveraging reinforce-
ment learning and deep neural networks, DQL demonstrated its potential to enhance
the decision-making capabilities of autonomous agents in complex driving scenarios.
Through the development of a virtual environment and rigorous performance eval-
uation, we observed promising results in terms of success rate, average speed, and
collision avoidance. However, further research and optimization are necessary to
address the limitations and challenges associated with DQL. This study contributes
to the advancement of autonomous driving technologies, showcasing the value of
DQL in improving the safety and efficiency of virtual autonomous automobiles.
Future work should focus on refining the DQL algorithm, exploring additional opti-
mization techniques, and incorporating real-world data to bridge the gap between
simulation and practical implementation. The findings of this research pave the way
for further exploration and development of intelligent autonomous driving systems.
The proposed model is implemented using neural networks, which use the input
vector x for the state “S” and predict the following action “a” that will be the most prof-
itable according to the state-action value function. The car, the agent, continuously
moves randomly and performs randomly in the virtual world it has created. These are
kept and used to train the neural network using the 60%–20%–20% dataset ratio. The
results indicate that DQN-based control systems have the potential to significantly
improve the safety, efficiency, and reliability of autonomous vehicles. The paper
concludes that further research and testing are necessary to address the remaining
challenges in real-world deployment, but that DQN-based systems offer a promising
approach for the future of autonomous driving. The future directions of the research
are to improve the existing model both by the graphical and artificial intelligence
ways. Another future scope is to implement the research using physically by using
the hardware.
References
1. Yi L (2020) Lane change of vehicles based on DQN. 2020 5th international conference on
information science, computer technology and transportation (ISCTT), Shenyang, China, pp
593–597. https://doi.org/10.1109/ISCTT51595.2020.00113
2. Chishti SO, Riaz S, BilalZaib M, Nauman M (2018) Self-driving cars using CNN and Q-
learning. 2018 IEEE 21st international multi-topic conference (INMIC), Karachi, Pakistan, pp
1–7. https://doi.org/10.1109/INMIC.2018.8595684
3. Thadeshwar H, Shah V, Jain M, Chaudhari R, Badgujar V (2020) Artificial intelligence based
self-driving car. 2020 4th international conference on computer, communication and signal
Deep Q-Learning for Virtual Autonomous Automobile 215
processing (ICCCSP), Chennai, India, pp 1–5. https://doi.org/10.1109/ICCCSP49186.2020.
9315223
4. Lyu L, Shen Y, Zhang S (2022) The advance of reinforcement learning and deep reinforcement
learning. 2022 IEEE international conference on electrical engineering, big data and algorithms
(EEBDA), Changchun, China, pp 644–648. https://doi.org/10.1109/EEBDA53927.2022.974
4760
5. Pant P et al (2022) Blockchain for AI-enabled industrial IoT with 5G network. 2022 14th
international conference on electronics, computers and artificial intelligence (ECAI), Ploiesti,
Romania, pp 1–4. https://doi.org/10.1109/ECAI54874.2022.9847428
6. Güçkıran K, Bolat B (2019) Autonomous car racing in simulation environment using deep
reinforcement learning. 2019 innovations in intelligent systems and applications conference
(ASYU), Izmir, Turkey, pp 1–6. https://doi.org/10.1109/ASYU48272.2019.8946332
7. Kiran BR et al (2022) Deep reinforcement learning for autonomous driving: a survey. IEEE
Trans Intell Transp Syst 23(6):4909–4926. https://doi.org/10.1109/TITS.2021.3054625
8. Zhang J et al (2019) Deep reinforcement learning for autonomous driving: a survey. IEEE
Trans Intell Transp Syst
9. Johnson JK (2020) Safe motion planning under partial observability with an optimal deter-
ministic planner, 2020 American Control Conference (ACC), Denver, CO, USA. pp. 689–694.
https://doi.org/10.23919/ACC45564.2020.9147469
10. Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with
latent deep reinforcement learning. In IEEE Trans Intell Transp Systems 23(6):5068–5078.
https://doi.org/10.1109/TITS.2020.3046646.
11. Wang Y, Manikandan NS, Kaliyaperumal G (2023) Ad hoc-obstacle avoidance-based navi-
gation system using deep reinforcement learning for self-driving vehicles. In IEEE Access
11:92285–92297. https://doi.org/10.1109/ACCESS.2023.3297661
12. Byeloborodov Y, Rashad S (2020) Design of machine learning algorithms for behavioral predic-
tion of objects for self-driving cars. 2020 11th IEEE annual ubiquitous computing, electronics &
mobile communication conference (UEMCON), New York, NY, USA, pp 0101–0105. https://
doi.org/10.1109/UEMCON51285.2020.9298139
13. Shahane V, Jadhav H, Sansare M, Gunjgur P (2022) A self-driving car platform using raspberry
Pi and Arduino. 2022 6th international conference on computing, communication, control and
automation ICCUBEA, Pune, India, pp 1–6. https://doi.org/10.1109/ICCUBEA54992.2022.
10010814
14. Asmika B, Mounika G, Rani PS (2021) Deep learning for vision and decision making in
self-driving cars-challenges with ethical decision making. 2021 international conference on
intelligent technologies (CONIT), Hubli, India, pp 1–5. https://doi.org/10.1109/CONIT51480.
2021.9498342
15. Do T-D, Duong M-T, Dang Q-V, Le M-H (2018) Real-time self-driving car navigation using
deep neural network. 2018 4th international conference on green technology and sustainable
development (GTSD), Ho Chi Minh City, Vietnam, pp 7–12. https://doi.org/10.1109/GTSD.
2018.8595590
16. Holstein T, Dodig-Crnkovic G, Pelliccione P (2020) Real-world ethics for self-driving
cars. 2020 IEEE/ACM 42nd international conference on software engineering: companion
proceedings (ICSE-Companion), Seoul, Korea (South), pp 328–329
17. Potgantwar A, Aggarwal S, Pant P, Rajawat AS, Chauhan C, Waghmare VN (11 Aug,
2022) Secure aspect of digital twin for industry 4.0 application improvement using machine
learning. Available at SSRN: https://ssrn.com/abstract=4187977 or https://doi.org/10.2139/
ssrn.4187977
18. Barua B, Gomes C, Baghe S, Sisodia J (2019) A self-driving car implementation using computer
vision for detection and navigation. 2019 international conference on intelligent computing
and control systems (ICCS), Madurai, India, pp 271–274. https://doi.org/10.1109/ICCS45141.
2019.9065627
216 P. Pant et al.
19. Rajawat AS, Goyal SB, Pant P, Bedi P (2022) AI-enabled internet of nano things methodology
for healthcare information management. In: Kautish S, Dhiman G (eds) AI-enabled multiple-
criteria decision-making approaches for healthcare management. IGI Global, pp 222–239.
https://doi.org/10.4018/978-1-6684-4405-4.ch012
20. Pant P, Taghipour A (2023) Machine learning and blockchain for 5G-enabled IIoT. In:
Taghipour A (ed) Blockchain applications in cryptocurrency for technological evolution. IGI
Global, pp 196–212. https://doi.org/10.4018/978-1-6684-6247-8.ch012
21. Zhu K, Zhang T (2021) Deep reinforcement learning based mobile robot navigation: a review.
Tsinghua Sci Technol 26(5):674–691. https://doi.org/10.26599/TST.2021.9010012
22. Pant P, Rajawat AS, Goyal SB, Bedi P, Verma C, Raboaca MS, Enescu FM (2022) Authenti-
cation and authorization in modern web apps for data security using Nodejs and role of dark
web. Procedia Comput Sci 215:781–790. https://doi.org/10.1016/j.procs.2022.12.080
23. Pant P et al (2022) Using machine learning for industry 5.0 efficiency prediction based on
security and proposing models to enhance efficiency. 2022 11th international conference on
system modeling & advancement in research trends (SMART), Moradabad, India, pp 909–914.
https://doi.org/10.1109/SMART55829.2022.10047387
24. Pant P et al (2022) AI based technologies for international space station and space data.
2022 11th international conference on system modeling and advancement in research trends
(SMART), Moradabad, India, pp 19–25. https://doi.org/10.1109/SMART55829.2022.100
46956
25. Yi G, Li S, Zhou W, Chen Y (2021) Application of improved DQN algorithm in three-
dimensional garage scheduling, 2021 4th international conference on Robotics, Control and
Atomation Engineering (RCAE), Wuhan, China. pp. 428–432. https://doi.org/10.1109/RCA
E53607.2021.9638792
Improving Digital Marketing Using
Sentiment Analysis with Deep LSTM
Masri bin Abdul Lasi, Abu Bakar bin Abdul Hamid,
Amer Hamzah bin Jantan, S. B. Goyal, and Nurun Najah binti Tarmidzi
Abstract With As digital channels continue to grow, digital marketing has become
a crucial area for businesses. Customers share their experiences with products on
social media and e-commerce platforms, providing businesses with valuable feed-
back. Sentiment analysis techniques are used to analyze customer feedback and
improve business decisions. Deep learning techniques, such as Long Short-Term
Memory (LSTM), have the potential to extract knowledge from large volumes of data
with greater accuracy than manual approaches. In this study, we propose using Deep
LSTM to enhance the accuracy of sentiment analysis. Our simulation results show
that the proposed model improves upon conventional schemes in terms of accuracy,
precision, recall, and F-measure. The proposed model achieved an accuracy rate of
over 90%, which is significantly higher than the accuracy rate achieved by other senti-
ment analysis models. Additionally, the proposed model outperformed other state-of-
the-art sentiment analysis techniques in our empirical evaluation using a large dataset.
Furthermore, we tested the proposed model in a real-world scenario, where it was
used to analyze customer sentiment toward a newly launched product. The proposed
model accurately identified positive and negative sentiments expressed by customers
toward the product. The marketing team used this information to make informed deci-
sions regarding product improvements and marketing strategies, demonstrating the
practical applications of the proposed model. Our study highlights the effectiveness
M. A. Lasi ·A. H. Jantan ·S. B. Goyal (B)·N. N. Tarmidzi
City University, Petaing Jaya, Malaysia
e-mail: drsbgoyal@gmail.com
M. A. Lasi
e-mail: masri.abdullasi@city.edu.my
A. H. Jantan
e-mail: amer.hamzah@city.edu.my
N. N. Tarmidzi
e-mail: nurun.najah@city.edu.my
A. B. A. Hamid ·A. H. Jantan
Putra Business School, University Putra Malaysia, Serdang, Malaysia
e-mail: abu.bakar@putrabs.edu.my
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_17
217
218 M. A. Lasi et al.
of deep learning techniques, specifically deep LSTM, in improving the accuracy
and reliability of sentiment analysis. Our findings have important implications for
businesses seeking to leverage customer feedback to improve their products and
services.
Keywords Digital marketing ·Machine learning ·Sentimental analysis ·Deep
LSTM ·TF-IDF
1 Introduction
Sentiment analysis (SA) is a technique used to analyze the opinions, emotions, and
attitudes of humans, which can be utilized for the growth of organizations, especially
in the field of business. With the growth of digital marketing platforms, the amount of
opinion data in digital form is increasing [1]. For example, customers who experience
issues such as poor quality, differences between the promised and actual products,
and late delivery share their experiences on social media and e-commerce platforms.
SA helps to determine whether the expressed textual content by customers on these
platforms is positive, negative, or neutral [2]. SA is used in many applications such
as digital marketing, social media monitoring, and product review analysis [3].Most
users search for reviews before using a service, so the marketing of products depends
on these reviews [4]. However, the large number of reviews left by customers cannot
be read by humans and can determine the overall opinion of a product or service.
Sentiment analysis (SA) can be approached through two main methods: lexicon-
based and machine learning-based. In a lexicon-based approach, sentiment lexicons
are constructed based on sentiment-related words, adverbs, and negative words that
reflect human sentiments. The sentiment polarity of input texts is determined by
matching the input text with sentiment words in the lexicon. The matched sentiment
words are then weighted and summed to obtain the sentiment value of the input [5].
Machine learning methods, such as Naive Bayes, support vector machine (SVM),
maximum entropy, random forest, have also been proposed to automatically handle
sentiment analysis [6]. However, these approaches require human intervention to
classify the sentiments from the texts.
To create an automatic SA approach, deep learning-based methods have been
proposed in sentiment analysis due to their automated functioning capability [7].
Deep learning extracts features and learns from errors without requiring human
intervention [8]. Some of the deep learning approaches used in sentiment analysis
are Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and
Gated Recurrent Units (GRU) [9].
Reference [10] proposed a deep learning approach where Convolutional Neural
Networks (CNN) and Long Short-Term Memory (LSTM) are combined to perform
sentiment analysis. The method developed a deep learning approach that is superior
to traditional machine learning models. However, the proposed technique failed to
train the DL model with many observations. Deep learning models must be trained
Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 219
with a huge amount of data, or the architecture should be developed such that it can
use fewer datasets.
Reference [4] presented an SVM-based sentiment classification of Twitter data.
The approach depends on certain analytical measures to cluster the data with the k-
means clustering approach. The research doesn’t support a fully automated approach
to sentiment classification [11]. Reference [11] proposed sentiment analysis using
logistic regression, Naive Bayes, and linear support vector classifier approach for a
food review. The approach has handled a large amount of information with MLlib
instead of using deep networks to handle a huge volume of data. Reference [12]
proposed sentiment analysis for Amazon product reviews with a logistic regression
model.
The logistic regression has higher chances of overfitting if the extracted features
are not optimal. Reference [13] presented sentiment analysis of product reviews using
K-nearest neighbors. Machine learning models like KNN cannot work with large
and high-dimensional data. Therefore, to overcome these challenges, the proposed
approach employs deep LSTM to perform sentiment analysis. The approach also
focuses on enhancing the performance accuracy of the proposed deep learning-based
LSTM models compared to conventional approaches. The present study addresses
the following research questions:
What are the challenges faced by machine learning models?
How can the accuracy of deep learning models be improved?
The proposed research addresses the research questions by designing a deep Long
Short-Term Memory (LSTM) model that can enhance the accuracy of sentiment
analysis (SA) and process a large volume of information. The originality of this
approach lies in its ability to provide enhanced accuracy compared to conventional
deep learning models and as an alternative to traditional shallow networks.
The objective of the proposed approach is to improve the classification accu-
racy with a deep learning-based sentiment classification to aid in digital marketing
(DM) strategies. The proposed sentiment classification approach employs the term
frequency-inverse document frequency (TF-IDF) approach and the deep LSTM
model. The TF-IDF approach extracts the features effectively, and the deep LSTM
model with three LSTM layers is trained with the extracted features. The fully
connected layer, along with the SoftMax, classifies the polarity of the sentiment.
The major contributions of the approach are:
The use of the TF-IDF feature extraction approach to extract features from user
reviews.
The design of a deep LSTM model with three LSTM layers to learn features
effectively.
The simulation of the proposed machine learning-based sentiment classification
approach in terms of accuracy, precision, recall, and F-measure.
The rest of the paper is organized as follows. Section 2presents related work,
Sect. 3explains the proposed method, Sect. 4illustrates the experimental results,
and Sect. 5concludes the paper.
220 M. A. Lasi et al.
2 Literature Review
Deep learning is a subset of machine learning that learns to represent the world as a
nested hierarchy of concepts, with each concept defined in relation to simpler ones
and more abstract representations computed in terms of less abstract ones. In deep
learning, low-level categories like letters are defined first, then slightly higher-level
categories like words, and then higher-level categories like sentences. These cate-
gories are learned incrementally through the hidden layer architecture. For example,
in image recognition, lines and shapes are classified to identify faces before classi-
fying light and dark areas, providing a complete representation of the image through
the network’s neurons and nodes, each representing a different component of the
whole. As the model matures, weights are adjusted for each node or hidden layer to
reflect the strength of their connections to the output. This learning process allows
deep learning to achieve great power and flexibility.
Various conventional approaches to sentiment analysis are discussed, including
Aytug Onan’s approach, which proposed a sentiment analysis approach based on
weighted word embedding and deep neural networks [5]. The sentiment analysis
is carried out on product reviews obtained from Twitter. In this research, TF-IDF
weighted gloves are embedded with Convolutional Neural Networks (CNN) and
Long Short-Term Memory (LSTM). The approach attained better classification accu-
racy than conventional deep learning approaches. Reference [14] proposed a capsule
network based on Bi-LSTM for sentiment analysis, called caps-Bi-LSTM, where
the capsule module calculates the state probability. The approach obtained better
accuracy than conventional machine learning and deep learning models.
Reference [15] proposed a deep model named “multi-view deep network for senti-
ment analysis”, where heterogeneous deep neural networks are used in the feature
extraction of input documents, and classification is handled with the multi-view
classifier. Convolutional and recursive neural networks are used to obtain various
representations of the input texts. The deep neural networks extract feature sets, and
multi-view classifiers train features jointly to decide the sentiment polarity. Refer-
ence [14] presented a text sentiment classification with variable convolution and
pooling CNN. Multiple convolutions and pooling are designed for text sentiment
classification. The proposed approach produced better results with the proposed
feature extraction. Ramshankar and Joe Prathap [16] presented a sentiment classi-
fication approach with black hole-based gray wolf optimization (BH-GWO), where
the feature extraction is handled with a joint similarity score and optimized with BH-
GWO weights. BH-GWO is created through the fusion of black hole and genetic wolf
optimization (GWO).
The sentiment classification for recommendation systems is evaluated with e-
commerce datasets. Lin et al. [17] proposed sentimental analysis with a comparison-
enhanced deep neural network (CEDNN). Bidirectional LSTM carries out the initial
feature extraction, and MHA carries out the valuable feature extraction. The hybrid
approach combines MHA to obtain global information and Bi-LSTM to obtain
sequence information. The learning ability is enhanced by CE-B-MHA. The proposed
Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 221
approach attained a better F1 score, thereby improving the performance of sentiment
analysis. Xu et al. [18] proposed a product review sentiment classification based on
the Naive Bayes continuous learning model. The traditional Naive Bayes method is
enhanced to weigh general classification on old domains and to improve distribution
learning for domain-specific knowledge. The simulation results prove the impact
of continuous learning for domain-specific and cross-domain sentiment learning.
Yi and Liu [19] presented an ML-based sentimental analysis for the recommenda-
tion system. Multi-class support vector machine (MSVM) is used for classifying the
sentiments and different opinions on Twitter. Features are identified with principal
component analysis (PCA). PCA is also used to reduce the dimensionality and extract
the features.
The proposed MSVM achieved better performance than conventional ML strate-
gies. Chintalapudi et al. [20] presented sentimental analysis of COVID-19 with deep
learning (DL) approaches. A DL model named “bidirectional encoder representa-
tions from transformer” (BERT) was used to conduct the sentimental analysis on
tweets. The tweets that contain sentiments such as sadness, joy, fear, and anger
during the COVID-19 period are analyzed. The performance of the BERT model is
compared with that of conventional ML approaches, and it is found that the DL-based
model achieved enhanced performance over ML models. Vijayaragavan et al. [21]
developed an optimal SVM-based classification for sentimental analysis of product
reviews. The product reviews are classified by SVM, and the k-means clustering
approach is employed to obtain two groups from the clustered output.
Feature extraction is carried through with the sentimental analysis, and finally,
fuzzy-based soft set theory is employed to decide whether the customer will purchase
the product or not. Rehman et al. [22] presented a CNN-LSTM model to improve the
accuracy of the sentiment analysis. The convolution and max pooling of the CNN
model are used to extract higher-level features, and LSTM obtains the long-term
dependencies between the word sequences. The hybrid CNN-LSTM model exhibited
better performance than machine learning and other deep learning models in terms of
accuracy and precision. Basiri et al. [23] presented an attention-based bidirectional
CNN-RNN model for the analysis of sentiment. Using the temporal information
flow, this strategy extracts past and future contexts. The dimensionality is reduced
with the help of convolution and pooling approaches. The strategy outperformed
conventional methods for both short and long tweets. Sankar et al. [24] proposed a
deep learning-based sentiment analysis approach with CNN.
The reviews collected from services like Netflix and Amazon were classified by
CNN. The approach used different word embedding techniques. The deep CNN
trained with pre-trained word vectors exhibited better classification results in a
mobile application. Phan et al. [25] presented an ensemble model to identify the
sentiment of tweets. The ensemble model was designed with five features extracted
from the lexical, semantic, sentiment polarity, and position of words in tweets. The
proposed sentiment analysis combines a feature ensemble model, the divide and
conquer method, and the DL algorithm. The features consist of fuzzy sentiment
phrases. The input layer of the CNN model uses feature vectors.
222 M. A. Lasi et al.
Nafis et al. [26] proposed sentiment analysis using LSTM and CNN. After prepro-
cessing the IMDB dataset with tokenization, stop words, and URL removal, CNN
and LSTM are used to classify the sentiments. The Word2vec tool converts the tweets
into vectors with different dimensions. The validation results with LSTM obtained
an accuracy of 88.02%, and CNN attained an accuracy of 87.72%.
Neogi et al. [27] presented a sentiment analysis of farmer protests with Twitter
data. The features are extracted with a bag of words and the TF-IDF approach after
the preprocessing of the tweets. The proposed sentiment analysis employed several
classifiers, such as Decision Tree (DT), Naive Bayes (NB), Random Forests (RBF),
and Support Vector Machine (SVM). The RF model achieved the highest accuracy
in analyzing the sentiments.
Bhakuni et al. [28] proposed sarcasm analysis using a sentiment analysis model.
In this approach, data is cleaned with tokenization, stemming, and noise removal. The
features are extracted with the TF-IDF approach. The approach employed multiple
machine learning models such as DT, SVM, NB, and KNN. The proposed approach is
simulated in terms of accuracy, precision, recall, and F-measure. The SVM classifier
attained the highest accuracy of 93%, followed by SVM NB and DT, which achieved
an accuracy of 83% and 86%, respectively. The KNN attained the lowest accuracy
of 51%.
Ruz et al. [29] proposed a sentiment analysis of Twitter data with a Bayesian
network classifier. The approach used a bag of words for feature representation. The
classification of the two datasets—the Chilean earthquake and the Catalan indepen-
dence referendum—was performed with a Bayesian network classifier. The prepro-
cessing is carried out by removing URL, stop words, symbols, numbers, and repeated
characters. The proposed approach achieved better performance than conventional
approaches.
Yang et al. [30] presented a sentiment analysis approach that combines the senti-
ment lexicon with machine learning models. CNN and bidirectional gated recurrent
units are the machine learning models used. The machine learning models extract
the features from the review. The features are weighted with an attention mechanism.
The proposed approach attained an accuracy of 93.5%.
CNN and bidirectional gated recurrent units are the machine learning models used.
The machine learning models extract the features from the review. The features are
weighted with an attention mechanism. The proposed approach attained an accu-
racy of 93.5%. Behera et al. [31] proposed a hybrid model in which the CNN and
LSTM models are combined for the sentiment classification of the reviews. The
reviews from different domains are used as input. The deep CNN model is used for
local feature selection, and LSTM is employed for the sequential analysis of the
texts with length. The objectives of the fused deep learning model are scalability
and domain independence. The evaluation of the proposed approach is carried out
with four datasets. The experimental outcome of the research proves that it attained
better accuracy and outperformed several conventional schemes. Minaee et al. [32]
proposed sentiment analysis with an ensemble of CNN and bi-LSTM models. In this
approach, an ensemble of LSTM and CNN models is employed. The LSTM deals
with the temporal information of the data, and CNN extracts the local structure. The
Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 223
ensemble model proved to exhibit better results than the individual CNN and LSTM
models. Li et al. [33] presented a lexicon-based CNN-LSTM model for analyzing
the sentiments from user reviews.
The CNN and Bi-LSTM models are connected in a parallel manner. A domain-
specific lexicon that creates quality texture features is fed as input to the models.
The LSTM handles the sequential information, and CNN focuses on the extraction
of features. The study’s advantage is that complex datasets, such as the Stanford
Sentiment Treebank, are used, resulting in better performance than conventional
schemes.
3 Proposed Methodology
The proposed method consists of the following stages: preprocessing, feature extrac-
tion, and classification. The block diagram of the proposed approach. Initially, the
raw tweets collected from various datasets are pre-processed to remove the outliers
and unwanted elements (see Fig. 1). The features are extracted with the TF-IDF
approach after preprocessing the data. Followed by feature extraction, a deep LSTM
model is used to classify the sentiment of the tweets.
3.1 Preprocessing
The preprocessing includes stop word removal, stemming, and blank space removal.
The steps are explained as follows:
Stop Word Removal: The data collected from the web contains certain words that
are of no use in the sentiment analysis. Such words, which frequently arise and
consist of little information, need to be removed. Some examples of stop words are
an, and, as, etc. The removal of stop words increases processing speed and enhances
accuracy. Stop word removal approaches such as the classic method and the mutual
information approach were employed. In the first, normal conjunction features such
as “and”, “but”, and “or” are removed. Similarly, special symbols and numbers are
also removed. In addition, reviews that don’t contain any data related to sentiment
value, like URLs or HTML, are also removed. In the mutual information approach,
the mutual information between the term and the class of the contents is computed.
If the result is low mutual information the words are removed Kaur and Buttar [34].
Stemming: Stemming is a rule-based approach. In this step, suffixes and prefixes are
removed to reduce the number of features in the feature space and improve the perfor-
mance of ML algorithms. An open-source toolkit of natural language processing
(NLP) named Zemberek is used in our approach for stemming purposes Sava¸s and
Topalo ˘glu [35].
224 M. A. Lasi et al.
Pre-processing
Feature extraction with TF-IDF
LSTM 1 LSTM 2 LSTM 3
Fully
Connected
Layer
Somax
layer
Classification results
Input
Fig. 1 Block diagram of the proposed model
Blank Space Removal: The blank spaces increase the size of words with additional
blank spaces. Therefore, the extra white and tab spaces are identified and removed
by using a single white space.
3.2 Feature Extraction
The feature extraction is carried out with the TF-IDF method. The number of occur-
rences of a term is calculated by its frequency. If a document consists of about 2000
words and a term named “best” is found five times in the document. Therefore, to
solve this issue, the repeating words are divided by the total number of terms found
in the document. This is essential since the number of words that are repeated is high
in large documents. The IDF signifies the ability to distinguish the features between
the categories. It can also be defined as the score of the document in the process of
feature selection. In IDF, terms like “and”, which are less significant, are handled.
The problem is solved when IDF assigns a higher weight to repeating words such as
“and” and lower weights for non-repeating words. The TF-IDF is given as
Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 225
TF - IDF =TF IDF (1)
TF is given as
TF =Frequency of a feature in a text document
Total features in a document (2)
IDF is given as
IDF =lognumber of document with features
total number of document (3)
A TF-IDF document matrix is generated after calculating the TF-IDF. The higher
TF-IDF score shows the importance of the feature [6].
3.3 Deep LSTM-Based Sentiment Classification
The deep learning architecture consists of many layers with a nonlinear information
processing unit that can extract and transfer features for classification and analysis
of patterns with the help of data. The proposed prediction model employs the deep
LSTM model. The LSTM is developed by Hochreiter and Schmmidhuber in 1997.
The LSTM consists of an input gate, a forget gate, and an output gate, respectively.
The gates decide which information is to be transferred through the cell state and
which shouldn’t be. The gates are constructed with sigmoid layers and multiplication
operations.
The sigmoid layer partitions the output. The function of the input gate is to control
the impact of current data in the memory unit. The incoming vector added to the cell
state is determined by input gate notation. The forget gate reduces the output effect
on the memory unit. It decides which information is to be kept and which is to be
discarded. When the value of the forget gate is 0, information is removed, and when
the value is 1, the information is preserved. The input gate decides which values
must be stored in the cell state. In this, there is a sigmoid gate layer that decides the
updated values. followed by the tanh layer, which creates values that are added to the
cell state. The output gate controls the memory unit’s output status value Thakkar
and Chaudhari [36]. In the output layer, the product of the sigmoid layer and cell
state that passes through the TANH layer decides which outputs must be removed
and which have to be output. The input, forget, output gate, input modulation, and
hidden state are represented Tan et al. [37].
it=σ(Wzi zt+Whi ht1)(1)
ft=σWzf zt+Whf ht1(2)
226 M. A. Lasi et al.
ot=σ(Wzozt+Whoht1)(3)
c
t=φ(Wzczt+Whc ht1)(4)
ct=fct1+itc
t(5)
ht=otφ(ct)(6)
where ztrepresents the t the observation of all variables, htis the hidden state. The
hyperbolic tangent φand the sigmoid is non-linearity. The hidden state htis obtained
from the activation tanh and memory cell. The deep LSTM works better than shallow
networks with its increased number of layers. The deep architecture works well with
complex data and can also learn better than other networks. The proposed deep LSTM
consists of 3 LSTM layers, fully connected layer, and SoftMax layer.
The features are initially fed to the first LSTM block with hidden state. The second
LSTM computes with the help of previous hidden state h2
t1and the first hidden state
h2
t(see Fig. 2). It is passed to the upper LSTM and so on until the last block Wang
and Liu [38]. In deep LSTM output from each layer is passed to the next until the last
layer generates the output Sagheer and Kotb [39]. The same working is applicable
also to the hidden state at each level to function at different time scales Ameur et al.
[40]. The SoftMax along with the fully connected layer provides the classification
results Shahid et al. [41].
4 Experimental Results and Discussion
The simulations are carried with Python 3.9.0. The program was conducted with an
Intel Core i7 processor and 4 GB RAM. Table 1presents the hyperparameters used
in the proposed method.
4.1 Dataset Description
The dataset is collected from the twitter platform. The twitter data is extracting
through using web scraping. The web scraping helps to extract the data from the
tweets and helps to save directly in a spreadsheet. The validation of the proposed
approach is carried with three datasets, namely Sentiment 140, IMDB, and Amazon
review.
Sentiment 140: Sentiment 140 is a dataset obtained from Stanford University.
Sentiment 140 dataset consists of 1.6 million tweets consisting of 0.8 million positive
and 0.8 million negative tweets Go et al. [42].
Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 227
Fig. 2 Deep LSTM model
LSTM Block n
LSTM Block 3
LSTM Block 2
LSTM Block 1
Z
t
1
t
h
2
t
h
n
t
h
1
1t
h
2
2t
h
2
n
t
h
:
:
:
Table 1 Hyperparameters of DLSTM
Hyperparameters Val u e
Max epochs 200
Mini batch size 40
Gradient threshold
Layer wise hidden unit of LSTM 1, 2, and 3
1
100, 150, 100
Activation
Optimizer
Recurrent activation
Neuron units
SoftMax
Adam
Sigmoid
110 in all LSTM layers
IMDB: The IMDB dataset consists of movie reviews for sentiment analysis. The
IMDB has 50,000 positive and negative reviews Maas et al. [43].
Amazon Review: The dataset consists of product reviews collected from Amazon
dated between February and April 2014 Shrestha and Nasoz [44].
4.2 Performance Metrics
The proposed approach is evaluated in terms of accuracy, precision, recall, and F-
score.
228 M. A. Lasi et al.
Table 2 Performance comparison of proposed sentimental analysis with conventional approaches
Dataset Method Accuracy Precision Recall F-score
Sentiment 140 ABCDM [9]
Proposed
0.8182
0.9792
0.8827
0.9645
0.82315
0.9622
0.8199
0.9621
IMDB DL [5]
CEDNN [17]
Proposed
0.9643
0.929
0.9654
0.9641
0.9640
Amazon review SVM [49]
Proposed
0.8129
0.9723
0.5861
0.9729
0.4085
0.9719
0.9718
Table 2shows the performance comparison of the proposed approach with tradi-
tional methods. The proposed deep LSTM-based sentiment analysis performed better
than the ABCDM model with sentiment 140 in terms of accuracy, precision, recall,
and F-score. The ABCDM approach is designed for long tweets rather than short
tweets Basiri et al. [23].
While the proposed approach is efficient in processing the tweets without any
constraints on the length of the tweets, The deep layers in the proposed approach have
been able to perform better than the CNN-based sentiment analysis with the IMDB
dataset Sankar et al. [24]. The CEDNN with the IMDB dataset has not reached the
accuracy of the proposed approach since the proposed deep LSTM has been able to
perform better than the Bi-LSTM model used in the approach. The proposed approach
attained 97% accuracy, while SVM can attain only 81% accuracy. This shows that
the proposed deep LSTM model can work better than SVM in the sentiment analysis
of Twitter data [45].
The performance comparison of the proposed sentiment analysis with conven-
tional approaches for the grocery and gourmet food datasets (see Fig. 3). The bar
chart shows that the deep LSTM model exhibits better performance than the CEDNN,
ABCDM, DL, and SVM models. The deep layers have been successful in classifying
the sentiment of the customers.
4.3 Limitation of Study
While the proposed deep learning-based sentiment analysis model with LSTM has
demonstrated improved accuracy, there are some limitations to the approach. One
limitation is the requirement for large amounts of data for training the deep learning
model. In addition, the proposed model relies on the TF-IDF approach for feature
extraction, which may not be suitable for all types of datasets. Furthermore, the
proposed approach focuses only on multi-class classification results and may not be
effective for binary sentiment analysis.
Another limitation is the generalizability of the model to other domains. The
proposed sentiment analysis approach has been evaluated on e-commerce datasets,
and its effectiveness in other domains may not be guaranteed. Additionally, the
Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 229
Fig. 3 Performance comparison of proposed method over conventional techniques
proposed model’s performance may vary depending on the language and cultural
context of the reviews. The proposed approach also assumes that the sentiment of a
review can be accurately represented by a single label, which may not always be the
case.
Finally, while the proposed model has shown improved accuracy compared to
conventional approaches, it may not always outperform state-of-the-art sentiment
analysis models that employ more advanced techniques such as attention mecha-
nisms or transformer-based models. Overall, while the proposed sentiment analysis
approach is promising, further research is needed to address these limitations and
evaluate its effectiveness in a wider range of contexts.
5 Conclusion
Based on the study, the proposed deep learning-based sentiment analysis with LSTM
has shown significant improvement in classification accuracy compared to conven-
tional approaches. This approach used the TF-IDF feature extraction method and
a deep LSTM model with several layers to accurately predict the sentiment of the
tweets. The results showed that the deep LSTM model exhibited better performance in
terms of accuracy, precision, and recall than other conventional approaches. Further-
more, the proposed sentiment analysis can be extended to other domains beyond the
business process, which can benefit from its enhanced accuracy.
However, it is important to note that the study focused only on multi-classification
results, and future research can explore using multiple classifiers with a larger number
of datasets to further enhance the accuracy of sentiment analysis. In conclusion,
the proposed deep learning-based sentiment analysis approach with LSTM has the
potential to benefit both service providers and users by accurately predicting the
230 M. A. Lasi et al.
sentiment of a product. The findings of this study highlight the importance of using
advanced machine learning techniques to improve the accuracy of sentiment analysis
and provide valuable insights to businesses for better decision making.
In conclusion, sentiment analysis is an important technique that helps businesses
understand their customers’ opinions and improve their products or services. With
the increasing availability of online platforms and social media, sentiment analysis
has become more important than ever before. Deep learning-based models, such as
LSTM and CNN, have shown great promise in improving the accuracy of senti-
ment analysis. The proposed sentiment analysis with a deep LSTM model has been
shown to significantly enhance classification accuracy compared to conventional
approaches.
However, the proposed sentiment analysis focused only on multi-classification
results and the use of a single classifier. Future work can be done to extend the study to
include multiple classifiers and larger datasets. Additionally, other feature extraction
methods, such as word embeddings, can also be explored to further enhance the
accuracy of sentiment analysis. With the continuous advancements in deep learning
techniques, it is expected that sentiment analysis will continue to improve and become
an even more valuable tool for businesses.
References
1. Hoang SN, Nguyen LV, Huynh T, Pham VT (2019) An efficient model for sentiment analysis
of electronic product reviews in Vietnamese. In: International conference on future data and
security engineering, pp 132–142. https://doi.org/10.1007/978-3-030-35653-8_10
2. Mahdaouy AE, Mekki AE, Essefar K, Mamoun NE, Berrada I, Khoumsi A (2021) Deep multi-
task model for sarcasm detection and sentiment analysis in Arabic language. arXiv preprint
arXiv:2106.12488
3. Alamoudi ES, Alghamdi NS (2021) Sentiment classification and aspect-based sentiment anal-
ysis on yelp reviews using deep learning and word embeddings. J Decis Syst 30(2–3):259–281.
https://doi.org/10.1080/12460125.2020
4. Cyril CPD, Beulah JR, Subramani N, Mohan P, Harshavardhan A, Sivabalaselvamani D (2021)
An automated learning model for sentiment analysis and data classification of Twitter data
using balanced CA-SVM. Concurrent Eng 29(4):386–395. https://doi.org/10.1177/1063293x2
11031485
5. Onan A (2020) Sentiment analysis on product reviews based on weighted word embeddings
and deep neural networks. Concurrency Comput: Pract Experience 33(23). https://doi.org/10.
1002/cpe.5909
6. Sultana MA, Rakesh P, Sandeep M, Jagadeesh G (2021) Amazon product review sentiment
analysis using machine learning. Int Res J Comput Sci 8(7):136–141. https://doi.org/10.26562/
irjcs.2021.v0807.001
7. Wassan S, Chen X, Shen T, Waqar M, Jhanjhi NZ (2021) Amazon product sentiment analysis
using machine learning techniques. Rev Argent Clín Psicol 30(1):695
8. Drus Z, Khalid H (2019) Sentiment analysis in social media and its application: system-
atic literature review. Procedia Comput Sci 161:707–714. https://doi.org/10.1016/j.procs.2019.
11.174
9. Nikseresht A, Raeisi MH, Mohammadi HA (2021) Decision making for celebrity branding:
an opinion mining approach based on polarity and sentiment analysis using twitter consumer-
generated content (CGC). arXiv preprint arXiv:2109.12630
Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 231
10. Agarwal S (2019) Deep learning-based sentiment analysis: establishing customer dimension as
the lifeblood of business management. Glob Bus Rev 23(1):119–136. https://doi.org/10.1177/
0972150919845160
11. Ahmed HM, Javed Awan M, Khan NS, Yasin A, Faisal Shehzad HM (2021) Sentiment analysis
of online food reviews using big data analytics. Elementary Educ Online 20(2):827–836. https://
doi.org/10.17051/ilkonline.2021.02.93
12. Sharma DN, Shankar DP, Raj MR, Dalwadi MC (2022) Sentiment analysis for amazon product
reviews using logistic regression model. J Dev Econ Manag Res Stud 09(11):29–42. https://
doi.org/10.53422/jdms.2022.91104
13. Akter MT, Begum M, Mustafa R (2021) Bengali sentiment analysis of e-commerce product
reviews using k-nearest neighbors. In: 2021 international conference on information and
communication technology for sustainable development (ICICT4SD). IEEE, pp 40–44. https://
doi.org/10.1109/icict4sd50815.2021.9396910
14. Dong Y, Fu Y, Wang L, Chen Y, Dong Y, Li J (2020) A sentiment analysis method of
capsule network based on BiLSTM. IEEE Access 8:37014–37020. https://doi.org/10.1109/
access.2020.2973711
15. Sadr H, Pedram MM, Teshnehlab M (2020) Multi-view deep network: a deep model based
on learning features from heterogeneous neural networks for sentiment analysis. IEEE Access
8:86984–86997. https://doi.org/10.1109/access.2020.2992063
16. Ramshankar N, Joe Prathap PM (Sept 2021) A novel recommendation system enabled by
adaptive fuzzy aided sentiment classification for e-commerce sector using black hole-based
grey wolf optimization. adhan¯a 46(3). https://doi.org/10.1007/s12046-021-01631-2
17. Lin Y, Li J, Yang L, Xu K, Lin H (2020) Sentiment analysis with comparison enhanced deep
neural network. IEEE Access 8:78378–78384. https://doi.org/10.1109/access.2020.2989424
18. Xu F, Pan Z, Xia R (2020) E-commerce product review sentiment classification based on a
Naïve Bayes continuous learning framework. Inf Process Manage 57(5):102221. https://doi.
org/10.1016/j.ipm.2020.102221
19. Yi S, Liu X (2020) Machine learning based customer sentiment analysis for recommending
shoppers, shops based on customers review. Complex Intell Syst 6(3):621–634. https://doi.org/
10.1007/s40747-020-00155-2
20. Chintalapudi N, Battineni G, Amenta F (2021) Sentimental analysis of COVID-19 tweets using
deep learning models. Infect Dis Rep (April 2021) 13(2):329–339. https://doi.org/10.3390/idr
13020032
21. Vijayaragavan P, Ponnusamy R, Aramudhan M (2020) An optimal support vector machine
based classification model for sentimental analysis of online product reviews. Future Gener
Comput Syst 111:234–240. https://doi.org/10.1016/j.future.2020.04.046
22. Rehman AU, Malik AK, Raza B, Ali W (Sept 2019) A hybrid CNN-LSTM model for improving
accuracy of movie reviews sentiment analysis. Multimedia Tools Appl 78(18):26597–26613.
https://doi.org/10.1007/s11042-019-07788-7
23. Basiri ME, Nemati S, Abdar M, Cambria E, Acharya UR (2021) ABCDM: an attention-
based bidirectional CNN-RNN deep model for sentiment analysis. Future Gener Comput Syst
115:279–294. https://doi.org/10.1016/j.future.2020.08.005
24. Sankar H, Subramaniyaswamy V, Vijayakumar V, Arun Kumar S, Logesh R, Umamakeswari
A (2020) Intelligent sentiment analysis approach using edge computing-based deep learning
technique. Softw: Pract Experience 50(5):645–657. https://doi.org/10.1002/spe.2687
25. Phan HT, Tran VC, Nguyen NT, Hwang D (2020) Improving the performance of sentiment
analysis of tweets containing fuzzy sentiment using the feature ensemble model. IEEE Access
8:14630–114641. https://doi.org/10.1109/access.2019.2963702
26. Mohd Nafis NS, Awang S (2021) An enhanced hybrid feature selection technique using term
frequency-inverse document frequency and support vector machine-recursive feature elimina-
tion for sentiment classification. IEEE Access 9:52177–52192. https://doi.org/10.1109/access.
2021.3069001
27. Neogi AS, Garg KA, Mishra RK, Dwivedi YK (2021) Sentiment analysis and classification of
Indian farmers protest using Twitter data. Int J Inf Manage Data Insights 1(2):100019
232 M. A. Lasi et al.
28. Bhakuni M, Kumar K, Iwendi C, Singh A (2022) Evolution and evaluation: sarcasm analysis
for Twitter data using sentiment analysis. J Sens
29. Ruz GA, Henríquez PA, Mascareño A (2020) Sentiment analysis of Twitter data during critical
events through Bayesian networks classifiers. Futur Gener Comput Syst 106:92–104
30. Yang L, Li Y, Wang J, Sherratt RS (2020) Sentiment analysis for e-commerce product reviews
in Chinese based on sentiment lexicon and deep learning. IEEE access 8:23522–23530
31. Behera RK, Jena M, Rath SK, Misra S (2021) Co-LSTM: convolutional LSTM model for
sentiment analysis in social big data. Inf Process Manage 58(1):102435
32. Minaee S, Azimi E, Abdolrashidi A (2019) Deep-sentiment: sentiment analysis using ensemble
of cnn and bi-lstm models. arXiv preprint arXiv:1904.04206
33. Li W, Zhu L, Shi Y, Guo K, Cambria E (2020) User reviews: sentiment analysis using lexicon
integrated two-channel CNN–LSTM family models. Appl Soft Comput 94:106435
34. Kaur J, Buttar PK (2018) A systematic review on stopword removal algorithms. Int J Future
Revolution Comput Sci Commun Eng 4(4):207–210
35. Sava¸s S, Topalo˘glu N (2019) Data analysis through social media according to the classified
crime. Turk J Electr Eng Comput Sci 27(1):407–420
36. Thakkar A, Chaudhari K (2020) Predicting stock trend using an integrated term frequency–
inverse document frequency-based feature weight matrix with neural networks. Appl Soft
Comput 96:106684. https://doi.org/10.1016/j.asoc.2020.106684
37. Tan HX, Aung NN, Tian J, Chua MCH, Yang YO (2019) Time series classification using a
modified LSTM approach from accelerometer-based data: a comparative study for gait cycle
detection. Gait Posture 74:128–134. https://doi.org/10.1016/j.gaitpost.2019.09.007
38. Wang L, Liu R (2020) Human activity recognition based on wearable sensor using hierarchical
deep LSTM networks. Circ, Syst, Sig Process 39(2):837–856. https://doi.org/10.1007/s00034-
019-01116-y
39. Sagheer A, Kotb M (2019) Time series forecasting of petroleum production using deep
LSTM recurrent networks. Neurocomputing 323:203–213. https://doi.org/10.1016/j.neucom.
2018.09.082
40. Ameur S, Khalifa AB, Bouhlel MS (2020) A novel hybrid bidirectional unidirectional LSTM
network for dynamic hand gesture recognition with leap motion. Entertainment Comput
35(100373):2020. https://doi.org/10.1016/j.entcom.2020.100373
41. Shahid F, Zameer A, Muneeb M (2020) Predictions for COVID-19 with deep learning models
of LSTM, GRU and Bi-LSTM. Chaos, Solitons Fractals 140:110212. https://doi.org/10.1016/
j.chaos.2020.110212
42. Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision.
CS224N project report, Stanford
43. Maas A, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for senti-
ment analysis. In: Proceedings of the 49th annual meeting of the association for computational
linguistics: human language technologies, pp 142–150. https://doi.org/10.1109/ijcnn.2016.772
76047
44. Shrestha N, Nasoz F (2019) Deep learning sentiment analysis of amazon.com reviews and
ratings. arXiv preprint arXiv:1904.04096
45. He R, McAuley J (2016) Ups and downs. Proceedings of the 25th international conference on
world wide web. https://doi.org/10.1145/2872427.2883037
5G Enabled IoT-Based DL with BC
Model for Secured Home Door System
S. B. Goyal, Anand Singh Rajawat, Pravin Gundalwar,
Ram Kumar Solanki, and Masri bin Abdul Lasi
Abstract The A safety door has been playing an important role in protecting our
public places. Every public place should offer the safest working environment to its
workplace people and visitors. Security is vital to public places, such as government
offices, shopping malls, hospitals, educational institutions, airports, to have restricted
areas or public walkways upon use by closed gates. To strengthen the security risks,
it comes as no security policies become vulnerable in enforcing strict security mech-
anisms. This work distillates primarily on the security aspects of doors installed at
security gates and other mandatory monitoring activities. This can be realized by
listing the typical security challenges in 5G Enabled IoT-based DL with BC Model-
based systems in general and summing these challenges from design, development,
and creating a functional product from scratch. A growing relationship between
AI and the IoT can be established by extending their boundaries to continue their
individual technological powers. The Internet of Things (IoT) helps to capturing
complete activities of workplace people and visitors from their multiple “Entry” to
“Exit” points in a public place and artificial intelligence (AI) plays vital role in secu-
rity vulnerabilities detection and avoidance in strict and smart before any possible
mishap. To propose 5G Enabled IoT-based deep learning (DL) with blockchain (BC)
Model for secured home door system. Our proposed approach accuracy is 99%.
Keywords Internet of things ·Deep learning ·Home door system ·Blockchain
S. B. Goyal (B)·M. A. Lasi
City University, Petaing Jaya 46100, Malaysia
e-mail: drsbgoyal@gmail.com
M. A. Lasi
e-mail: masri.abdullasi@city.edu.my
A. S. Rajawat ·P. Gundalwar ·R. K. Solanki
School of Computer Science and Engineering, Sandip University, Nashik 422213, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_18
233
234 S. B. Goyal et al.
1 Introduction
One of these new technologies is called the Internet of Things, or IoT, 5G Enabled
IoT-based deep learning (DL) with blockchain (BC) and its goal is to connect all
of the physical things around us to the Internet. The goal of the 5G Enabled IoT is
to create a global network that can connect both physical things and smart services
to each other. Services that are smart look at the information that is gathered from
these things. To reach whatever goal (or goals) led to the making of the product. The
word “purpose” is used here in a broad sense to include all of the different ways that
these services can be used and put to work. The term has a wide range of meanings,
and this example shows how it can be used. We can tell the difference between the
physical world, or the things themselves, and the information about those things.
The 5G Enabled IoT makes it possible for inanimate objects to share information
and talk to each other.
5G Enabled IoT-based solutions are being used to build “secured smart homes [1
6],” which aim to make life better by making it safer and more comfortable. These
homes are called “smart homes” because they have new and useful features. One
of the best things about this group of smart home devices is that you can lock and
unlock doors with a remote. To proposed 5G Enabled IoT-based DL with BC Model
how it could be used in the home and by people. To proposed that sensors can be
used to send information to the network’s owners and that mobile devices can use
well-defined interfaces to talk to the sensors and control home appliances. We also
connect that sensors can send information to the people who own the network.
Consumers are eager to use the many smart devices that are now easy to find and
can be used to get into and control their homes. We’ll use three different gadgets as
examples to talk about how useful they are, what they can do, and what they can’t
do.
When a lot of different technological processes are put together, a lot of data
is made. Everyone has a basic need to feel safe and cared for. In today’s world,
it is absolutely necessary to have security systems that are powered by technology,
increase the overall level of protection at a location, and are easy to use. One example
of a big step forward made possible by technology is the mobile phone.
For example, 5G Enabled IoT-based DL with BC Model technologies have made
it possible for many objects and mobile devices to talk to each other.
Mobile phones are now the most popular way to send and receive information
using blockchain quickly and easily in the modern world. They have surpassed all
other forms of electronic communication in this way. In recent years, 5G Enabled
IoT-based DL with BC Model digital door locks have become more popular because
they are easy to use and keep your home safe. Even though smart locks can be
helpful, there are still worries about how safe they are. Intruders try to find holes in
the security systems that are in place all the time. With the goal of making homes
[79] safer, a Raspberry Pi-based door lock system has been made. Because the
owner’s Twitter and Gmail accounts are linked to the system, they will get customer
5G Enabled IoT-Based DL with BC Model for Secured Home Door System 235
information as the system gathers it. The authors explain and recommend a cloud-
based solution for smart homes that use the 5G Enabled IoT-based DL with BC Model.
This solution uses cloud-based SaaS, PaaS, and IaaS technologies and architecture.
Also, in web services based on REST are used to make it easy for an 5G Enabled
IoT-based DL with BC Model and android-based home control device and its owner
to talk to each other. REST is being used to make this communication easier. For
example, in the author shows how to unlock a door using an Android app and a
device with Bluetooth. An implementation that is based on Bluetooth can only be
used in certain ways. When reading we came across a suggestion for a home control
system based on XML/SOAP. This makes the parsing process harder and, as a result,
slows down the response time. Using the 5G Enabled IoT-based DL with BC Model
as part of our investigation, we want to come up with a plan to design and build a
way to stop attacks like this from happening again. The architecture of the system
is based on the 5G Enabled IoT-based DL with BC Model and it is made up of a
microcontroller device, an application hosted in the cloud, and Android software. It
is hoped that Internet of Things-based technologies can improve a number of security
and monitoring methods that are already in use. When the MQTT protocol is used,
all of the information and data are stored in the cloud and can be retrieved from
there. The administrative panel of Android software lets you control who can access
what and when, as well as keep track of every lock and unlock in real time. Our key
objective as the following.
Security Enhancement: Implement robust security measures to prevent unau-
thorized access to the home [10,11]. Use deep learning techniques to analyze
and recognize patterns for accurate authentication and access control. Integrate
blockchain technology to ensure the integrity and immutability of access logs and
user information.
Connectivity and Communication: Utilize 5G technology to enable seamless and
high-speed communication between IoT devices, the home door system [12,13],
and the user’s mobile devices. Ensure reliable connectivity and real-time data
transmission for efficient monitoring and control of the door system.
Intelligent Features: Apply deep learning algorithms to enable intelligent features
in the home door system [1416]. This includes the ability to detect anoma-
lies, identify authorized individuals, and adapt to user preferences and behavior
patterns. Enhance the system’s ability to provide personalized and secure access
to the home.
2 Related Work
As it has grown and changed over time, 5G Enabled IoT-based DL with BC Model
technology has been shown to be a good way to solve many different kinds of
problems. Among these worries are the need for reliable authentication, the need to
keep people’s privacy, the need to share data, and a lot more. A number of researchers
are working on smart home systems that are based on blockchain. The goal is to use
236 S. B. Goyal et al.
blockchain technology in a wide range of situations and types. For their smart home
systems, some academics use public blockchains, private blockchains, and a mix of
the two. Systems for the smart home that use the blockchain have been able to do a
lot of what they set out to do.
Taiwo et al. [17] In this article, talk about a smart home automation system that
can be used to control electrical equipment, keep an eye on the weather, and keep
track of who and what is moving around the house. Based on the patterns of motion
that have been seen, it is suggested that a deep learning model be used to recognize
and classify motion. Using a deep learning model, an algorithm is made to improve
the smart home automation system’s ability to find intruders and reduce the number
of false alarms. The camera watches how a person walks to figure out if they are a
trespasser.
Yang et al. [18] This research gives a point grouping strategy for finding finger
veins by taking the best parts of the other methods and putting them all together in
one solution. The suggested method uses all of the picture points for recognition.
However, each point is broken up into a large number of groups to make the process
of extracting features and measuring similarity easier. By combining the matched
points from each group pair of the enrolled image and the probe with the mismatched
points from those same group pairs, you can get a similarity score or a dissimilarity
score.
Ulfah et al. [19]. If our house is broken into, to will be able to protect our family
and belongings better if we keep track of who uses each door lock to get in and out of
the house. Still, we have to protect the information that is stored in the door lock. One
way to reach this goal is to use a system for storing data that is based on blockchain
technology. Blockchain technology has the benefits of not being able to be changed
and not being able to be taken back. In this study, information about access to door
locks is kept on the Ethereum blockchain platform, and smart contracts are used to
handle policy administration.
Singh et al. [20] In this article, talked about three things. First, we gave an overview
of blockchain technology and how it is used. Second, talked about the Internet of
Things infrastructure, which is based on a blockchain network. And third, we showed
how blockchain technology can be used to make the Internet of Things more secure.
3 Proposed Methodology
One of the most basic kinds of mechanism has been put into place. One of our goals
for the not-too-distant future is to add a number of extra features that will help solve
and improve a wide range of problems and situations. One of the features that will
be added is a sensor that will be able to detect any kind of impact on the door and
send a message to the administrator’s mobile device [1]. The person in charge of
a building can let people in for a short time by giving them the key to the door.
When a valid user gets within the foot of the door, a sensor system will recognize
them and automatically open the door. The only people who can lock and unlock an
5G Enabled IoT-Based DL with BC Model for Secured Home Door System 237
office door are the super-administrators. It is up to the user to decide if they want
to add new access points or get rid of the ones they already have. You don’t have
to worry about carrying a lot of keys or, even more important, losing them. Even
better, this system can be used for more than just doors. It can be used to automate
and make it easier to use a wide range of electronic devices. This is on top of the fact
that this system is useful for more than just doors. As part of the door lock system
that is being made, a webcam will be used to do facial recognition to decide who
can enter and leave a property. Figure 1shows the method that has been suggested
for the door lock system. The image will be taken of the person who is currently
looking at the camera. So, the camera will work the same way as a registered one if
it has a picture of the person’s face on it. If the face has already been registered as
the owner of access rights, the system will check to see if the webcam in question
is part of the blockchain network. If the webcam is real and can make blockchain
transactions, the blockchain will record transactions that include information about
the person who got into the house, the status of the access request, and the time the
request was made. The user’s name is saved in the identification field. The user’s
check-in or check-out status is saved in the access status field, and the date and time
of door lock access is saved in the access time field. If the payment made through the
webcam was successful, the door will be unlocked. After the door has been open for
a short time, it will automatically lock itself. 5G Enabled IoT-based deep learning
(DL) with blockchain (BC) Model because the door is always locked, the suggested
algorithm is always useful, even when the user is entering or leaving the house.
5G Enabled IoT-based deep learning (DL) with blockchain (BC) Model for taking
care of financial matters smart contracts are used in a “smart house” to turn the
decisions of the homeowners into rules for how the house works. One miner is
chosen ahead of time to be in charge of keeping the private blockchain network up
and running. The DL enabled distributed ledger, which is only used by this system,
will keep track of all transactions that involve the door lock. In addition to nodes,
the proposed architecture for the system includes a smart contract and a blockchain,
specifically the Ethereum blockchain (webcam and homeowner). The rules of the
smart contracts in this system include storing transactions and keeping track of them.
A store’s transaction policy and the use of a webcam can be used together to record
customer transactions. This can include the customer’s name, the time and date they
opened the door, and whether they are checking out. The monitor transaction feature
lets the authorized owner of data stored on the blockchain network keep an eye on it.
To solve the problem that was only talked about, it has been suggested that a
system be put in place with the following goals:
First come up with and set up an AI-based system that can automatically monitor
at regular intervals without any human intervention. This system should be able to
handle both suspicious and dangerous behavior, including identifying objects at the
front door.
The second step is to design and implement an Internet of Things-based system
that can find potentially 5G Enabled IoT-based deep learning (DL) with blockchain
(BC) Model malicious audio and video streams moving through a network and then
238 S. B. Goyal et al.
Fig. 1 5G Enabled IoT-based DL with BC model for secured home door system
process those streams on a central server or in the cloud, where the right decisions
can be made in real time based on safety standards that have already been set.
5G Enabled IoT-based deep learning (DL) with blockchain (BC) Model for inves-
tigators at crime scenes use strict and clever safety measures to spot and stop both
passive and aggressive approaches and attacks. They do this by keeping a constant
eye out for vulnerability scans from the scene’s entrance to its end.
After making a fully functional prototype of an 5G Enabled IoT-based deep
learning (DL) with blockchain (BC) Model-based integrated system that can be
used for low-cost access control and surveillance, put that system into production.
How something is set up and run as the visitor approaches the smart door, the
camera takes a picture of their face. The haar cascade method with 5G Enabled IoT-
based deep learning (DL) with blockchain (BC) Model is then used to figure out who
they are. After a picture is taken, it is compared to a database of pictures of people’s
faces to see if it matches anyone. If the device recognizes the user’s face, it will say
the user’s name through the speakers and record the user’s voice instructions through
the microphone. This can only be done if the device knows the user’s face. In that
case, the user won’t be able to get in until the door is unlocked. The smart door
unlock system that has been put in place can be used to unlock doors by both voice
activation and facial recognition. The system can recognize people using 5G Enabled
5G Enabled IoT-Based DL with BC Model for Secured Home Door System 239
IoT-based deep learning (DL) with blockchain (BC) Model by both their names and
their faces. It can also respond to voice commands from system administrators. Since
the door can be unlocked and opened from a distance, getting in is quick and easy
from anywhere. A feature called “blacklist” can send a message to the owner as soon
as the criminal opens the door. The new system’s price is low enough that the average
worker can afford to switch to it.
4 Results Analysis
We tried 5G Enabled IoT-based deep learning (DL) with blockchain (BC) Model
CNN model to see how well it works and how realistically it can be used as a home
security solution. The training dataset was made up of 4000 photos that showed how
943 pedestrians moved. These pictures came from a database of CV images each
person in the dataset had their picture taken four times so that a full picture of their
gait could be seen from a variety of angles. All of the information was separated
into four groups: ambulatory, acrobatic, hindered, and accelerated. The camera took
pictures of several people who were standing in the same place from the same point
of view. After putting everyone into groups, we further divided them into people who
lived in their own homes and people who didn’t. These two groups are also called
“housing occupants” and “intruders,” respectively.
Most of the time (98% of the time), CNN’s models are right. When using 5G
Enabled IoT-based deep learning (DL) with blockchain (BC) Model DL classifica-
tion, the confusion matrix is used to figure out how accurate the results are. The
results of the classification model are shown in a table that shows both the predicted
data and the actual data. This is a picture of the confusion matrix that was made. Due
to how well the CNN deep learning experiment was done, we now know that smart
houses can be made even smarter to protect both the people who live in them and
their things. This grouping shows how motion patterns can be used to tell things apart
and identify them at home in a way that is both unobtrusive and quick, saving both
time and effort. We also done after the test data were added to the trained model and
after the model’s accuracy, precision, recall, F1-score, and specificity were calcu-
lated. Table 1shows the results of the evaluation. It ranks the classifiers based on
their values for precision, recall, F1-score, and specificity.
How well the model works in the real world and how well it predicts the actual
positive value show how correct the model is. Here is how the formula can be broken
down:
Precision =TP
TP +FP
The recall is a way to figure out how well a model can predict real good results.
240 S. B. Goyal et al.
Table 1 Comparative analysis
Year Methodology Key contribution
2022 Deep learning model Improved smart home control and security
system using deep learning techniques
2019 Point grouping method Proposed a novel method for finger vein
recognition using point grouping technique
2019 Blockchain technology Secure data storage for door lock system
using blockchain technology
2018 Blockchain technology Highlighted the potential of blockchain for
securing IoT data
2020 Raspberry Pi and telegram notification Developed a secure home entry system using
Raspberry Pi and sending notifications via
telegram
2022 FPGA-based assistive framework Proposed an FPGA-based framework for
smart home automation
Precision =TP
TP +FP
The F1-score compares the precision and recall scores of a test to figure out how
accurate it is. Here is how the formula can be broken down:
F1 - Score =2×Precision ×Recall
Precision +Recall
The model’s specificity is determined by how many errors it can predict correctly.
Here’s how we figure out whether something is specific:
Specificity =TN
TN +FP
Also, we looked at the models in and compared them to the CNN model we
had proposed. Also, the comparison with the model implementation in is limited to
accuracy because that was the only comparable measure used in the earlier research.
This is a repeat of the problem with comparing metrics with specific related studies
that has already been talked about. Table 2shows a comparison between how well our
CNN model works and how well other models have worked in the past. Proof that the
CNN model, which was suggested, can make smart home automation in the Internet
of Things safer. The CNN models that have been made can be used in smart home
automation apps to make it easier to spot intruders based on their movement patterns.
With the help of the security camera and models that can recognize, classify, and tell
the difference between different motion patterns, users can find out who is breaking
in. Motion patterns will also be used to decide whether or not security messages and
alarms are sent out by smart home apps. So, the smart home setting makes the house
safer than it would be otherwise.
5G Enabled IoT-Based DL with BC Model for Secured Home Door System 241
Table 2 Various measures of efficiency
Sn Parameter BPNN CNN Proposed (BPNN +CNN)
1MSE 0.553 0.325 0.19
2PSNR 50.44 51.44 58.33
3Loss percentage 15 14 8
4Accuracy 88 97 99
5 Conclusion
Using the 5G-enabled IoT-based deep learning with blockchain (BC) design for a
secured house door system can help with connection and safety in a number of ways.
Integrating 5G technology with IoT devices makes the system work better by making
it easier to connect and send data. Deep learning is used in the system, which makes it
possible for the house’s doors to be protected by AI in a cutting-edge way. The ability
of DL models to examine and spot multiple patterns and outliers makes it possible
for them to be used in strong authentication and access control systems. This means
that only people who have been given permission can get into the house. The fact that
blockchain technology is built into the system makes it even safer. The spread ledger
and data that can’t be changed make it very hard for bad people to change information
and break into a home’s door security system. Blockchain technology also makes it
possible to store user data and entry records in a way that is safe, clear, and can be
checked. When 5G, IoT, DL, and blockchain technologies are used together to make
a secure home door system, the result is a powerful and all-encompassing option.
It makes the door system safer, of course, but it also makes it easy and effective to
watch and control the door system from a distance. Access can be controlled, alerts
can be sent in real time, and a user’s smartphone or other connected device can be
used to check on the state of the door system from a distance. With the 5G-enabled
Internet of Things-based DL with BC model for protected home door systems, home
security has come a long way. It gives people peace of mind because it has improved
security features, better connections, and is easy to control from a distance. With more
work, this model could be used to improve security in smart homes and other places
where IoT is used. It can be expensive to set up a DL with BC plan for a protected
home door system based on 5G and the Internet of Things. Due to the high cost of
putting in place 5G networks, IoT devices, deep learning algorithms, and blockchain
systems, there may be problems with accessibility. When technologies like 5G, IoT,
DL, and blockchain are all put together, they can cause problems in system design,
implementation, and management. In some cases, the system will need to be set up
and maintained by people who know what they are doing. There’s no question that
DL and blockchain could make security better, but there’s also a chance that they
could make privacy worries worse. When collecting and analyzing data from IoT
devices, like door entry records and user information, there are questions about who
owns the data, if the user gave permission, and how the data could be used in the
wrong way.
242 S. B. Goyal et al.
Scalability: The 5G-enabled IoT-based DL with BC model might not work well
with a bigger home or more devices. As the number of IoT devices grows, it gets
harder and takes more resources to manage and handle the data they produce.
References
1. Zamri MA, Kamaluddin MU, Zaini N (2021) Implementation of a microcontroller-based home
security locking system. 2021 11th IEEE international conference on control system, computing
and engineering (ICCSCE), Penang, Malaysia, pp 216–221. https://doi.org/10.1109/ICCSCE
52189.2021.9530966
2. Hema N, Yadav J (2020) Secure home entry using raspberry Pi with notification via telegram.
2020 6th international conference on signal processing and communication (ICSC), Noida,
India, pp 211–215. https://doi.org/10.1109/ICSC48311.2020.9182778
3. Ahmed MS, Mukherjee R, Ghosh P, Nayemuzzaman S, Sundaravdivel P (2022) FPGA-based
assistive framework for smart home automation. 2022 IEEE 15th dallas circuit and system
conference (DCAS), Dallas, TX, USA, pp 1–2. https://doi.org/10.1109/DCAS53974.2022.984
5625
4. Saroha A, Gupta A, Bhargava A, Mandpura AK, Singh H (2022) Biometric authentication based
automated, secure, and smart IOT door lock system. 2022 IEEE India council international
subsections conference (INDISCON), Bhubaneswar, India, pp 1–5. https://doi.org/10.1109/
INDISCON54605.2022.9862840
5. Krishnan RS, Muthu AE, Kumar MA, Narayanan KL, Saravanan K, Robinson YH (2022)
Secured door operating mechanism for household during COVID-19 pandemic. 2022 6th inter-
national conference on trends in electronics and informatics (ICOEI), Tirunelveli, India, pp
733–737. https://doi.org/10.1109/ICOEI53556.2022.9776747
6. Shanthini M, Vidya G, Arun R (2020) IoT enhanced smart door locking system. 2020 Third
international conference on smart systems and inventive technology (ICSSIT), Tirunelveli,
India, pp 92–96. https://doi.org/10.1109/ICSSIT48917.2020.9214288
7. Fauzi AFM, Mohamed NN, Hashim H, Saleh MA (2020) Development of web-based smart
security door using QR code system. 2020 IEEE international conference on automatic control
and intelligent systems (I2CACIS), Shah Alam, Malaysia, pp 13–17. https://doi.org/10.1109/
I2CACIS49202.2020.9140200
8. Begum M, Jayasri S, Govindapillai LC (2022) Face recognition door lock system using rasp-
berry Pi. 2022 8th international conference on advanced computing and communication systems
(ICACCS), Coimbatore, India, pp 1645–1648. https://doi.org/10.1109/ICACCS54159.2022.
9785217
9. Gupta K, Jiwani N, Uddin Sharif MH, Mohammed MA, Afreen N (2022) Smart door locking
system using IoT. 2022 international conference on advances in computing, communication
and materials (ICACCM), Dehradun, India, pp 1–4. https://doi.org/10.1109/ICACCM56405.
2022.10009534
10. Brunner H et al (2021) Leveraging cross-technology broadcast communication to build
gateway-free smart homes. 2021 17th international conference on distributed computing in
sensor systems (DCOSS), Pafos, Cyprus, pp 1–9. https://doi.org/10.1109/DCOSS52077.2021.
00014
11. Hou D et al (2022) A highly secure authentication module for smart door lock with temporary
key function. 2022 international conference on cyberworlds (CW), Kanazawa, Japan, pp 228–
235. https://doi.org/10.1109/CW55638.2022.00053
12. Shetty S, Shetty S, Vishwakarma V, Patil S (2020) Review paper on door lock security systems.
2020 international conference on convergenceto digital world—Quo Vadis (ICCDW),Mumbai,
India, pp 1–4. https://doi.org/10.1109/ICCDW45521.2020.9318636
5G Enabled IoT-Based DL with BC Model for Secured Home Door System 243
13. Premkumar B, Emayavaramban G, Amudha A, Ramkumar MS, Divyapriya S, Nagaveni
P (2021)Arduino based advanced energy efficient home automation system with automatic
task scheduling. 2021 2nd international conference on smart electronics and communication
(ICOSEC), Trichy, India, pp 745–751. https://doi.org/10.1109/ICOSEC51865.2021.9591772
14. Monowar MI, Shakil SR, Kafi AH et al (2019) Framework of an intelligent, multi nodal and
secured RF based wireless home automation system for multifunctional devices. Wirel Pers
Commun 105:1–16. https://doi.org/10.1007/s11277-018-6100-z
15. Talal M, Zaidan AA, Zaidan BB et al (2019) Smart home-based IoT for real-time and secure
remote health monitoring of triage and priority system using body sensors: multi-driven
systematic review. J Med Syst 43:42. https://doi.org/10.1007/s10916-019-1158-z
16. Uppuluri S, Lakshmeeswari G (2022) Secure user authentication and key agreement scheme
for IoT device access control based smart home communications. Wirel Netw. https://doi.org/
10.1007/s11276-022-03197-1
17. Taiwo O, Ezugwu AE, Oyelade ON, Almutairi MS (2022) Enhanced intelligent smart home
control and security system based on deep learning model. Wirel Commun Mob Comput
2022:22, Article ID 9307961. https://doi.org/10.1155/2022/9307961
18. Yang L, Yang G, Wang K, Liu H, Xi X, Yin Y (2019) Point grouping method for finger vein
recognition. IEEE Access 7:28185–28195
19. Nadiya U, Rizqyawan MI, Mahnedra O (2019) [IEEE 2019 4th international conference
on information technology, information systems and electrical engineering (ICITISEE)—
Yogyakarta, Indonesia (2019.11.20–2019.11.21)] 2019 4th international conference on infor-
mation technology, information systems and electrical engineering (ICITISEE)—Blockchain-
based secure data storage for door lock system, pp 140–144. https://doi.org/10.1109/ICITIS
EE48480.2019.9003904
20. Singh M, Kim S (2018) Blockchain: a game changer for securing IoT data. 2018 IEEE 4th
world forum on internet of things (WF-IoT), pp 51–55
Improving Efficiency of Spinal Cord
Image Segmentation Using Transfer
Learning Inspired Mask Region-Based
Augmented Convolutional Neural
Network
Sheetal Garg and S. R. Bhagyashree
Abstract Spinal cord magnetic resonance images (MRIs) consists of 7 levels of
cervical vertebrae, 12 levels of thoracic vertebrae, 5 levels of lumbar vertebrae, one
level each of sacrum and coccyx components. Segmentation of these components
is essential for effective classification and post-processing analysis of spinal cord
images. In order to perform this task, separate algorithms are needed for each of the
components. Due to which, their segmentation performance is not uniform, which
limits their integration capabilities. Moreover, performance for each type of segmen-
tation has scalability issues, which must be improved via augmentation, aggregation,
and machine learning for better clinical use. To resolve these issues, this text proposes
design of a novel spinal cord image segmentation model using transfer learning
inspired mask region-based augmented convolutional neural network (MRACNN).
The proposed model utilizes initial weights from the pre-trained COCO mask RCNN
model, and modifies them to incorporate spine, torso, and L1to L5spinal cord
components. When compared to several state-of-the-art models, it is found that the
suggested model has improved region of interest (RoI) extraction and an accuracy of
91% for segmenting these components. Moreover, the proposed model was evaluated
on multiple datasets, and a consistent performance was observed. Furthermore, the
model was fused with a XRAI-based convolutional neural network which assisted
in further improving overall efficiency of segmentation. Fusion of XRAI CNN with
MRACNN is capable of achieving segmentation accuracy of 94%, along with better
RoI performance when compared with individual models. Delay requirement of the
fused model is high, and requires large dataset for training & validation, thus, this
text also recommends selective ensembling techniques for redundancy reduction,
which assists in improving segmentation speed, while maintaining high segmentation
quality.
S. Garg (B)·S. R. Bhagyashree
Department of Electronics and Communication Engineering, ATME College of Engineering,
Mysuru, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_19
245
246 S. Garg and S. R. Bhagyashree
Keywords Spine ·Segmentation ·CNN ·Mask RCNN ·COCO ·Ensemble ·
Augmentation
1 Introduction
MRIs play a vital role in disease diagnosis and identification [1]. Segmenting the
spinal cord is a complex task that entails efficient design of body symmetry detec-
tion, analysis of spinal components, anterior–posterior analysis, shape identification,
application of diffusion filters for clustering & outlier detection. In order to perform
these tasks, over time, scientists have presented various segmentation models [26].
An instance of such a model that uses mutual information, canny edge detection,
Hough transform, and k-means for clustering is depicted in Fig. 1, wherein spinal
cord and canal regions are segmented. The model initially detects body symmetry
using mutual information (MI), which allows for RoI extraction, and results into
structural decomposition of spinal image layers. These layers are processed via
Hough transform and Canny edge detection for extraction of line-line components.
These components are analyzed and lines in Anterior & Posterior (AP) directions are
removed in order to isolate spinal cord regions. These regions are given to an angular
Hough transform for estimation of candidates with close to AP line circles. These
circles assist in identification of exact spinal cord position, which is facilitated using
Anisotropic Diffusion filtering and k-means-based clustering. The obtained clus-
ters are segregated w.r.t. internal shapes, and spinal cord-like clusters are identified.
Pixels belonging to this cluster are extracted, while other pixels are removed from
the pixel-set, thereby resulting in the final segmented image. Efficiency of segmen-
tation depends upon internal model design for these blocks, and wide varieties of
algorithms are available for it. A brief review of these algorithms is described in
the next section of this text, which assists in identification of performance gaps in
existing literature, and thus forms the basis of the proposed model. It is observed
that existing models are widely application-specific, which limits their scalability,
and accuracy for larger datasets. In order to remove this drawback, Sect. 2proposes
design of a transfer learning inspired mask region-based augmented convolutional
neural network. The proposed model uses a combination of XRAI-based CNN with
MRACNN that assists in improving its segmentation performance. This performance
is evaluated in Sect. 3, and is compared with other models in terms of segmentation
accuracy, precision of RoI extraction, and computational complexity. Finally, this
text concludes with some interesting observations about the proposed models, and
recommends various methods to further improve its performance.
Improving Efficiency of Spinal Cord Image Segmentation Using 247
Fig. 1 Segmentation of spinal cord and canal regions using Hough transform and k-means-based
clustering [3]
2 Material and Method
2.1 Literature Review
Wide varieties of models are proposed for spinal cord segmentation, and each of
them have context-specific performance. Consider the research in [7,8], which
describes threshold-based segmentation and segmentation models based on convo-
lutional neural networks (CNN). The CNN model is autotuned, has great segmen-
tation efficiency, and can thus be applied to a broad variety of datasets, in contrast
to the threshold-based model, which has restricted accuracy and must be manu-
ally tuned for each dataset. A comparison of these models is proposed in [9],
where it is observed that machine learning methods outperform linear segmenta-
tion models, and thus are highly preferred for clinical applications. Design of such
models is further described in [1013] wherein hybrid CNNs, U-Net-based models,
and optimization techniques are described. These techniques have good accuracy,
and can be used for multiple datasets with minimal training effort from user-side.
However, these models need lengthy training times, which can be reduced by using
parallel processing or pipelining strategies. In [13], where researchers deployed a
U-Net network with transfer learning, an example of such a high-speed model is put
out. Transfer learning’s inclusion lessens cold-start problems, improving precision,
recall, and f-measure performance overall. This observation served as the inspi-
ration for the proposed approach, which also employs transfer learning to segment
248 S. Garg and S. R. Bhagyashree
spinal cord imaging quite effectively. In [1416], researchers employed similar high-
efficiency models using analytical transform-based statistical characteristic decom-
position model (ATSCDM) and faster RCNN models for high accuracy, and low
delay segmentation. The faster RCNN model tends to outperform other models due
to its microscopic segmentation performance, and ability to perform classification &
regression with high efficiency.
Efficiency of reviewed models must be further evaluated on larger datasets. Such
research is discussed in [17], wherein quantitative identification of different spinal
cord components is performed on a large-scale dataset. This work can be used to
identify performance issues with statistical parametric mapping (SPM) framework
[18], spinal cord injury detection frameworks [19], multispecies spinal cord anal-
ysis [20], and Gaussian kernel methods [21]. All these methods showcase that spinal
cord segmentation models are highly context sensitive, and have limited performance
when multiple datasets are used for validation. To resolve this issue, work in [22
24] proposes use of automatic 3D segmentation via clustering, CNN with grayscale
regularized active contour propagation, and super-voxel segmentation using U-Net
architectures. These models are observed to have higher delay than context sensitive
models, but have comparable accuracy when applied to multiple types of datasets.
Another high-efficiency model is proposed in [25], wherein researchers have used
Region Growth (RG) algorithm for segmentation. The RG model is highly effective,
but requires manual region estimation, which limits its applicability to small-scale
scenarios. Models that can be used for large-scale datasets are proposed in [26
29] wherein researchers have proposed support vector machine (SVM)-based active
contour segmentation, deep convolutional neural networks (DCNN) trained with
probability maps of pre-trained deep network, multi dilated recurrent residual U-Net
model, and deep dilated convolutions. These models assist in segmenting images
that are taken at multiple angles, and thus allow clinical experts to identify and diag-
nose any cases of spinal cord injury with minimum error, and maximum precision.
Efficiency of these models must be tested on multiple datasets, and combined with
multiple feature extraction & classification units as proposed in [3032], wherein
ensemble principal component analysis (PCA), particle swarm optimization (PSO),
and SVM with different kernels is discussed. These models tend to outperform
linear segmentation models, but still have a wide scope of improvement in terms
of scalability & accuracy performance. In order to work on these issues, the next
section proposes design of a high-efficiency spinal cord image segmentation model
using transfer learning inspired mask region-based augmented convolutional neural
network. The proposed model is based on multiple deep learning layers, which assists
in achieving high accuracy, with better PSNR. Results of the model are also discussed
in Sect. 3of this text.
Improving Efficiency of Spinal Cord Image Segmentation Using 249
2.2 Design of a High-Efficiency Spinal Cord Image
Segmentation Model Using Transfer Learning Inspired
Mask Region-Based Augmented Convolutional Neural
Network
From the literature survey, it is observed that most of the recently proposed models for
spinal cord segmentation are suited for single image datasets, and give limited accu-
racy for different image types. In order to improve scalability of spinal cord segmenta-
tion, this section proposes design of a novel model that uses transfer learning inspired
mask region-based augmented convolutional neural network (RCNN). The RCNN
model uses pre-trained common objects in context (COCO) weights for initial valida-
tion, and is retrained using target spinal cord dataset. This RCNN model is integrated
with a XRAI-based CNN model which assists in saliency-based segmentation. The
proposed model is depicted in Fig. 2, and showcases entire data flow of segmentation
process.
The model uses a combination of transfer learning massed RCNN, and XRAI-
based CNN in order to obtain the final segmented image. Internal description of these
models, along with their integration details are discussed in different sub-sections
for better understanding. Readers can replicate these designs in parts or as a whole
by referring these sections, and use them in their segmentation models.
2.3 Transfer Learning Mask RCNN Model Using Initial
COCO Weights for Coarse Segmentation
The transfer learning-based massed RCNN model is built using a combination of
region proposal network, which is trained using COCO-based training weights. These
weights are updated as per the input dataset, and a spinal mask is obtained. Gener-
ation of this mask is assisted via in incremental CNN learning model, which uses
progressive feature extraction as observed from Fig. 3, wherein internal layer design
of the model is depicted.
The model initially uses COCO-based weights to perform initial pixel level convo-
lutions. These convolutions assist the model in evaluating multiple feature vectors,
which are used for model training. Result of the convolutional layer is controlled using
Eq. (1), wherein a rectilinear unit (ReLU)-based kernel is used for pixel activation,
Convouti,j=
m
2
a=− m
2
n
2
b=− n
2
I(ia,jb)ReLUm
2+a,n
2+b(1)
where I,i,m,n, and jrepresents raw spinal cord image, current window row, number of
rows, & columns in the spinal cord image, and current window column, respectively.
These metrics are evaluated for each given layer. The number of features produced
250 S. Garg and S. R. Bhagyashree
Fig. 2 Design of the proposed model
during each convolution are controlled via Eq. (2)asfollows:
feat
out =feat
in +2pk
s+1(2)
where feat
in,feat
out,p,s,and krepresents total input features, total output
features, used padding size during convolution, used stride size during convolution,
and kernel size for the convolutional layer, respectively. This model can extract a
substantial number of characteristics from any spinal cord image due to variations in
different padding, stride, and kernel sizes. This causes inherent redundancies in the
system, which limits its accuracy & RoI performance. To improve this accuracy, a
variance-based feature selection layer, namely, max pooling layer is applied to each
Improving Efficiency of Spinal Cord Image Segmentation Using 251
Fig. 3 Transfer learning massed RCNN model using initial COCO weights for coarse segmentation
convolutional layer. The convolutional features that are higher than the evaluated
feature threshold are all removed by this layer. The max pooling variance threshold
is evaluated using Eq. (3),
fth =
1
Si
xXk
xpk
1/pk
(3)
where Siis size of the input spinal cord image, while pkis a feature selection factor,
which is tuned during hyperparameter optimization. This process of convolution, and
max pooling is repeated for a progressive convolution layer size of 7 ×7×256, 14 ×
14 ×256, 28 ×28 ×256, and 28 ×28 ×80 to obtain large number of segmentation
features. These features are given to a combination of two 1 ×1024 fully connected
neural network layers, in order to estimate final per-pixel class for segmentation.
Each pixel is classified into spine, torso, L1L2, L2L3, L3L4, or L4L5 classes. This
classification is controlled using Eq. (4), wherein a SoftMax base activation function
is described,
cout =So f t Ma x
Nf
i=1
fiwi+b
(4)
where fi,w
i,b,and Nfrepresents values of extracted convolutional feature vector,
hyperparameter tuning weight, hyperparameter tuning bias, and total features
252 S. Garg and S. R. Bhagyashree
extracted by convolutional layer, respectively. An output segmentation mask is
extracted from this layer, and is used for fine tuning the segmentation output provided
by XRAI-based CNN model. This model is described in the next sub-section on this
text.
2.4 XRAI-Based CNN Model for Region-Based Segmentation
The raw spinal cord image is given to a XRAI-based segmentation layer. XRAI
is a medical imaging-based saliency map segmentation algorithm, which assists in
identification of RoI regions. These RoI regions are extracted using entropy values,
which depend upon extracted convolutional features. In order to extract these features,
raw input image is split using bit-plane slicing. Each of the slices is then given to
a convolutional feature extraction unit. These features are extracted using Eqs. (1),
(2), and (3) wherein max pooling-based feature selection is performed to reduce
redundancies. All extracted features at each layer are given to an entropy evaluation
layer, which is controlled using Eq. (5). Here probability of feature occurrence, and
its logarithmic levels are used for final entropy evaluation,
Efi=−
N
r=1
M
c=1
pFr,cilogpFr,ci  (5)
where pFr,cirepresents probability of feature vector at r,clocation, and irepresents
bit slice number for input spinal cord image. These entropy values are used as upper
limits for each slice of input image, and bit-level thresholding is performed. All these
slices are combined in order to obtain the final XRAI map. Results of this XRAI-
based saliency detection model are observed from Fig. 4, wherein input image, its
saliency mask, and the final saliency image is shown.
Fig. 4 Extracted salient regions from input spinal cord imagery
Improving Efficiency of Spinal Cord Image Segmentation Using 253
Fig. 5 Used GoogLeNet model for pixel level classification
These regions are again given for feature extraction to a GoogLeNet-based CNN
model, wherein pixel level classification is performed. The used CNN model is
visualized from Fig. 5, wherein its internal architecture is described.
As observed from the model, each pixel is classified into two classes, foreground
and background. In order to perform this task, ground truth data for a large number
of images is used, and given to an inception module. This module uses the given
ground truth data, and generates a saliency mask for final segmentation. Internal
model design for inception module is described in Fig. 6, wherein multiple filters are
concatenated in order to obtain the final spinal cord mask output. In order to enhance
efficiency of segmentation, inception module uses the following Eq. (6) for internal
pooling,
P(q,p)=log(C(p,q)G(q,p)) (6)
where P is the output of Pooling, C is the convolutional operation on the input
image patch (p,q), and G is the ground truth image patch (q,p).
Extracted pooling features are given to a filter concatenation unit, which operates
using the following Eq. (7),
F(p,q)=
P(q,p)
k+d(aB(p,q)+c)
4(7)
where F represents concatenated filter output, P represents pooling output, B
represents base image patch for (p,q), while a’, c’, d’, and k are inception
constants, and are tuned based on the process of hyperparameter tuning. Multiple
inception modules are connected in cascade, which generates a large number of
segmentation masks. All these masks are overlapped in order to generate the coarse
spinal cord segmentation mask. Results from Sect. 2.3, and 2.4 are combined in order
to generate the final mask as described in Sect. 2.5.
254 S. Garg and S. R. Bhagyashree
Fig. 6 Design of the
inception model
2.5 Model Fusion for Final Segmentation
The extracted masks from Sect. 2.3 and 2.4 are combined using a fusion model for
estimation of final segmentation mask. In order to perform this task, the following
process is designed,
Masks from RCNN, and XRAI CNN are evaluated at pixel level.
Correct pixel locations from each mask are extracted by referring to the ground
truth data.
Union of all the correct pixels is done, and unique values from this union are
estimated using the following Eq. (8),
Pfinal =UniquePXRAI,PRCNN(8)
where Pfinal,PXRAI,and PRCNN represents final pixel map, XRAI-based pixel map,
and RCNN-based pixel map for each input image.
The final pixel map is evaluated for each training image, and stored in the database.
For any new input image, convolutional features are computed, and each pixel is
classified into foreground and background.
Correlation between the extracted convolutional features with stored features is
estimated using the following Eq. (9),
Corrj=NfTes t
i=1FtestiFnewi
NfTest
i=1FtestiFnewi2(9)
Improving Efficiency of Spinal Cord Image Segmentation Using 255
where j is number of the segmentation engine ( j=1 for XRAI, 2 for RCNN),
Ftestiand Fnewiare ith test set & new input pixel values respectively, and Nftest are
total number of features selected by the convolutional models for the test set. The
maximum value of Corr.is evaluated, and segmentation pixel positions of that
training image are used for segmenting the new test image. Based on this, the final
spinal cord image is segmented, and performance metrics including, segmentation
accuracy, peak signal to noise ratio, and delay of segmentation are estimated. These
metrics are compared with existing models, and are discussed in the next section of
this text.
3 Results and Discussion
Spinal cord MRIs from several Mendeley datasets and associated Ground truth
images were utilized to measure the segmentation performance of the proposed
model. This data was taken from https://data.mendeley.com/datasets/zbf6b4pttk/2
and is freely available with Open-Source licensing. Dataset images consist of lumbar
spine scans, with 48,345 MRI slices. Majority of these slices are 320 ×320 size, while
some of them also have 320 ×310-pixel resolution as observed from Fig. 7, wherein
a sample set of the dataset is visualized. Compared to the typical 8-bit grayscale
images, all images feature a 12-bit per-pixel resolution. Due to such a large dataset,
it was possible to train the underlying network, obtain accuracy, PSNR, and delay
metrics. Equation 10 was used to assess the segmentation’s accuracy in the manner
described below,
A=Npc
Isize
100 (10)
where Npc, and Isi ze represents total number of pixels correctly classified, and size
of the input image respectively. The entire dataset was divided into a 70:30 ratio for
training & validation, respectively. Accuracy of segmentation was evaluated for ML
[8], FASTER RCNN [10], and compared with the proposed model, these values are
tabulated in Table 1as follows:
Table 1shows that on the same dataset, the proposed model is 16% more accurate
than ML [6], 10% more accurate than FASTER RCNN [8, and 3% more accurate than
RCNN. It follows that the suggested methodology may be applied to clinical segmen-
tations in real time and is very successful for large-scale deployments. Comparing
these values to the suggested model, PSNR during segmentation for ML [6] and
FASTER RCNN [8] was tested. The results are presented in Table 2as follows:
According to Table 2, the suggested model has a PSNR that is 15 dB higher than
ML [8], 11 dB higher than FASTER RCNN [10], and 4 dB higher than RCNN on the
same dataset. This increase in PSNR is the result of the combination of XRAI and
RCNN, which helps with precise segmentation. It follows that the suggested method-
ology may be applied to clinical segmentations in real time and is very successful for
256 S. Garg and S. R. Bhagyashree
Fig. 7 Dataset samples
large-scale deployments. The following results are reported in Table 3and compared
to the proposed model for the delay during segmentation for ML [8] and FASTER
RCNN [10].
From Table 3, it is observed that the proposed model has 30% higher delay than
ML [8], 24% higher than FASTER RCNN [10], and 10% higher than RCNN on the
same dataset. This increase in time is due to combination of XRAI & RCNN, which
must be handled using various optimization methods like hyperparameter tuning,
Q learning, and ensemble classification. Thus, indicating that the proposed model
can be used for real-time clinical segmentations, but might require more time for
providing better accuracy.
Improving Efficiency of Spinal Cord Image Segmentation Using 257
Table 1 Accuracy of segmentation for different models
Number of images A(%)
ML [6]
A(%)
Faster RCNN [8]
A(%)
RCNN
A(%)
TLMA CNN
100 75.40 76.20 86.63 88.89
200 75.45 77.90 87.63 89.92
300 75.48 78.40 87.93 90.23
400 75.51 79.20 88.41 90.71
500 75.57 80.10 88.95 91.28
600 75.63 81.10 89.56 91.90
700 75.69 81.50 89.82 92.17
800 75.72 81.60 89.90 92.25
900 75.77 81.65 89.95 92.30
1000 75.82 81.75 90.04 92.39
1500 75.86 81.90 90.15 92.50
2000 75.91 81.98 90.22 92.58
2500 75.96 82.08 90.30 92.66
3000 76.00 82.18 90.39 92.75
3500 76.05 82.28 90.47 92.84
4000 76.10 82.38 90.56 92.92
4500 76.15 82.48 90.64 93.01
5000 76.19 82.58 90.72 93.09
5500 76.24 82.68 90.81 93.18
6000 76.29 82.78 90.89 93.27
7000 76.33 82.88 90.98 93.35
8000 76.38 82.98 91.06 93.44
9000 76.43 83.08 91.14 93.53
10,000 76.48 83.18 91.23 93.61
11,000 76.52 83.28 91.31 93.70
12,000 76.57 83.38 91.40 93.78
13,000 76.62 83.48 91.48 93.87
14,500 76.66 83.58 91.57 93.96
258 S. Garg and S. R. Bhagyashree
Table 2 PSNR of segmentation for different models
Number of images PSNR (dB)
ML [6]
PSNR (dB)
Faster RCNN [8]
PSNR (dB)
RCNN
PSNR (dB)
TLMA CNN
100 30.16 31.24 38.98 42.67
200 30.18 31.94 39.43 43.16
300 30.19 32.14 39.57 43.31
400 30.20 32.47 39.78 43.54
500 30.23 32.84 40.03 43.81
600 30.25 33.25 40.30 44.11
700 30.28 33.42 40.42 44.24
800 30.29 33.46 40.45 44.28
900 30.31 33.48 40.48 44.31
1000 30.33 33.52 40.52 44.35
1500 30.35 33.58 40.57 44.40
2000 30.36 33.61 40.60 44.44
2500 30.38 33.65 40.64 44.48
3000 30.40 33.69 40.67 44.52
3500 30.42 33.73 40.71 44.56
4000 30.44 33.77 40.75 44.60
4500 30.46 33.81 40.79 44.64
5000 30.48 33.86 40.83 44.69
5500 30.50 33.90 40.86 44.73
6000 30.51 33.94 40.90 44.77
7000 30.53 33.98 40.94 44.81
8000 30.55 34.02 40.98 44.85
9000 30.57 34.06 41.02 44.89
10,000 30.59 34.10 41.05 44.93
11,000 30.61 34.14 41.09 44.98
12,000 30.63 34.18 41.13 45.02
13,000 30.65 34.22 41.17 45.06
14,500 30.67 34.27 41.20 45.10
Improving Efficiency of Spinal Cord Image Segmentation Using 259
Table 3 Segmentation delay of different models
Number of images Delay (s)
ML [6]
Delay (s)
Faster RCNN [8]
Delay (s)
RCNN
Delay (s)
TLMA CNN
100 11 11 13 13
200 21 22 25 27
300 32 33 38 40
400 42 45 51 54
500 53 56 64 68
600 64 69 78 82
700 74 80 91 95
800 85 92 104 109
900 95 104 117 123
1000 106 115 131 137
1500 159 173 196 205
2000 213 231 262 274
2500 266 289 327 343
3000 319 348 393 412
3500 373 406 459 481
4000 426 465 525 550
4500 480 523 591 619
5000 533 582 658 689
5500 587 641 724 758
6000 641 700 791 828
7000 748 818 923 967
8000 855 936 1056 1106
9000 963 1054 1189 1246
10,000 1071 1173 1323 1385
11,000 1178 1292 1456 1525
12,000 1286 1411 1590 1666
13,000 1394 1530 1724 1806
14,500 1556 1709 1925 2016
4 Conclusion
The proposed model is capable of providing high accuracy segmentation perfor-
mance, which is due to fusion of XRAI CNN & RCNN models. It is observed that
the proposed model achieves 93.96% accuracy, with a maximum PSNR of 45.1 dB
across multiple types of spinal cord images. Furthermore, the proposed model has
16% better accuracy than ML [8], 10% better accuracy than FASTER RCNN [10], and
3% better accuracy than RCNN on the same image set. This makes is highly usable for
260 S. Garg and S. R. Bhagyashree
a wide variety of spinal cord segmentation applications, including high-efficiency
classification, post-processing, etc. the proposed model has 15 dB improvement
in PSNR than ML [8], 11 dB improvement than FASTER RCNN [10], and 4 dB
improvement than RCNN on multiple Mendeley datasets. Thus, making it highly
viable for efficient segmentation in large-scale clinical applications. But the model
has large training & validation delay, which is due to multiple algorithmic passes,
which makes is deployable to high-computing environments, thus researchers must
use multiple optimization models in order reduce computational complexity of the
proposed model. Furthermore, researchers can explore other CNN architectures, and
observe their effect on final segmentation performance.
5 Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of
this paper.
Acknowledgements We would like to thank the management and Principal of ATME College of
Engineering, Mysore, for their ongoing assistance and support.
References
1. Garg S, Bhagyashree SR (2019) Detection and classification of tumors using medical imaging
techniques: a survey. In: Balaji S, Rocha Á, Chung YN (eds) Intelligent communication tech-
nologies, and virtual mobile networks. ICICV 2019. Lecture notes on data engineering and
communications technologies, vol 33
2. Garg S, Bhagyashree SR (2021) Spinal cord MRI segmentation techniques, and algorithms: a
survey. SN Comput Sci 2:229
3. Sabaghian S, Dehghani H, Batouli SA, Khatibi A, Oghabian M (2020) Fully automatic 3D
segmentation of the thoracolumbar spinal cord and the vertebral canal from T2-weighted MRI
using K-means clustering algorithm. Spinal Cord 58:1–10. https://doi.org/10.1038/s41393-
020-0429-3
4. Chen M, Carass A, Oh J et al (2013) Automatic magnetic resonance spinal cord segmentation
with topology constraints for variable fields of view. Neuroimage 83:1051–1062. https://doi.
org/10.1016/j.neuroimage.2013.07.060
5. Chun-Chih Liao, Hsien-Wei Ting, Furen Xiao (2017) Atlas-free cervical spinal cord segmen-
tation on midsagittal T2-weighted magnetic resonance images. J Healthc Eng 2017:12, Article
ID 8691505. https://doi.org/10.1155/2017/8691505
6. Gros C, De Leener B, Badji A, Maranzano J, Eden D, Dupont S, Talbott J, Zhuoquiong R, Liu
Y, Granberg T, Ouellette R, Tachibana Y, Hori M, Kamiya K, Chougar L, Stawiarz L, Hillert
J, Bannier E, Kerbrat A, Cohen-Adad J (2018) Automatic segmentation of the spinal cord and
intramedullary multiple sclerosis lesions with convolutional neural networks
7. Ahammad SH, Ur Rahman MZ, Lay-Ekuakille A, Giannoccaro NI (2020) An efficient optimal
threshold-based segmentation and classification model for multi-level spinal cord injury
detection. 2020 IEEE international symposium on medical measurements and applications
(MeMeA), pp 1–6. https://doi.org/10.1109/MeMeA49120.2020.9137122
Improving Efficiency of Spinal Cord Image Segmentation Using 261
8. Saenz-Gamboa JJ, de la Iglesia-VayáM, Gómez JA (2021) Automatic semantic segmentation of
structural elements related to the spinal cord in the lumbar region by using convolutional neural
networks. 2020 25th international conference on pattern recognition (ICPR), pp 5214–5221.
https://doi.org/10.1109/ICPR48806.2021.9412934
9. Mnassri B, Sahnoun M, Hamida AB (2020) Comparison study for spinal cord segmentation
methods aiming to detect SC atrophy in MRI images: case of multiple sclerosis. 2020 5th
international conference on advanced technologies for signal and image processing (ATSIP),
pp 1–6. https://doi.org/10.1109/ATSIP49331.2020.9231790
10. Ahammad SH, Rajesh V, Rahman MZU, Lay-Ekuakille A (2020) A hybrid CNN-based
segmentation and boosting classifier for real time sensor spinal cord injury data. IEEE Sens J
20(17):10092–10101. https://doi.org/10.1109/JSEN.2020.2992879
11. Lemay A, Gros C, Zhuo Z et al (2021) Automatic multiclass intramedullary spinal cord tumor
segmentation on MRI with deep learning. Neuroimage Clin 31:102766. https://doi.org/10.
1016/j.nicl.2021.102766
12. Yiannakas MC, Liechti MD, Budtarad N et al (2019) Gray versus white matter segmenta-
tion of the conus medullaris: reliability and variability in healthy volunteers. J Neuroimaging
29(3):410–417. https://doi.org/10.1111/jon.12591
13. Couedic TL, Caillon R, Rossant F, Joutel A, Urien H, Rajani RM (2020) Deep-learning based
segmentation of challenging myelin sheaths. 2020 tenth international conference on image
processing theory, tools and applications (IPTA), pp 1–6. https://doi.org/10.1109/IPTA50016.
2020.9286715
14. Alsiddiky A, Fouad H, Soliman AM, Altinawi A, Mahmoud NM (2020) Vertebral tumor
detection and segmentation using analytical transform assisted statistical characteristic decom-
position model. IEEE Access 8:145278–145289. https://doi.org/10.1109/ACCESS.2020.301
2719
15. Moccia M, Prados F, Filippi M, Rocca MA, Valsasina P, Brownlee WJ, Zecca C, Gallo A,
Rovira A, Gass A, Palace J, Lukas C, Vrenken H, Ourselin S, Gandini Wheeler-Kingshott
CAM, Ciccarelli O, Barkhof F (2019) Longitudinal spinal cord atrophy in multiple sclerosis
using the generalized boundary shift integral. Ann Neurol 86:704–713. https://doi.org/10.1002/
ana.25571
16. Ma S, Huang Y, Che X, Gu R (2020) Faster RCNN-based detection of cervical spinal cord
injury and disc degeneration. J Appl Clin Med Phys 21. https://doi.org/10.1002/acm2.13001
17. Pai SA, Zhang H, Shewchuk JR et al (2020) Quantitative identification and segmentation
repeatability of thoracic spinal muscle morphology. JOR Spine 3(3):e1103. Published 2020 Jul
1. https://doi.org/10.1002/jsp2.1103
18. Azzarito M, Kyathanahally SP, Balbastre Y et al (2021) Simultaneous voxel-wise analysis of
brain and spinal cord morphometry and microstructure within the SPM framework. Hum Brain
Mapp 42:220–232. https://doi.org/10.1002/hbm.25218
19. Majidpoor J, Mortezaee K, Khezri Z et al (2021) The effect of the segment of spinal cord
injury on the activity of the nucleotide-binding domain-like receptor protein 3 inflammasome
and response to hormonal therapy.Cell Biochem Funct 39(2):267–276. https://doi.org/10.1002/
cbf.3574
20. Maidawa SM, Ali MN, Imam J, Salami SO, Hassan AZ, Ojo SA (2021) Morphology of the
spinal nerves from the cervical segments of the spinal cord of the African giant rat (Cricetomys
Gambianus). Anat Histol Embryol 50(2):300–306. https://doi.org/10.1111/ahe.12630
21. Malathy V, Anand M, Dayanand Lal N et al (2020) Segmentation of spinal cord from computed
tomography images based on level set method with Gaussian kernel. Soft Comput 24:18811–
18820. https://doi.org/10.1007/s00500-020-05113-1
22. Sabaghian S, Dehghani H, Batouli SAH et al (2020) Fully automatic 3D segmentation of
the thoracolumbar spinal cord and the vertebral canal from T2-weighted MRI using K-means
clustering algorithm. Spinal Cord 58:811–820. https://doi.org/10.1038/s41393-020-0429-3
23. Zhang X, Li Y, Liu Y et al (2021) Automatic spinal cord segmentation from axial-view MRI
slices using CNN with grayscale regularized active contour propagation. Comput Biol Med
132:104345. https://doi.org/10.1016/j.compbiomed.2021.104345
262 S. Garg and S. R. Bhagyashree
24. A deep learning method with residual blocks for automatic spinal cord segmentation in planning
CT. https://www.sciencedirect.com/science/article/abs/pii/S1746809421006716
25. Subramanya Jois SP,Sridhar H, Harish Kumar JR (2018) A fully automated spinal cord segmen-
tation. 2018 IEEE global conference on signal and information processing (GlobalSIP), pp
524–528. https://doi.org/10.1109/GlobalSIP.2018.8646682
26. Hasane S, Rajesh V, Rahman MZU (2019) Fast and accurate feature extraction-based segmen-
tation framework for spinal cord injury severity classification. IEEE Access 7:46092–46103.
https://doi.org/10.1109/ACCESS.2019.2909583
27. Rehman F, Ali Shah SI, Riaz N, Gilani SO (2019) A robustscheme of vertebrae segmentation for
medical diagnosis. IEEE Access 7:120387–120398. https://doi.org/10.1109/ACCESS.2019.
2936492
28. Kim DH, Jeong JG, Kim YJ et al (2021) Automated vertebral segmentation and measurement
of vertebral compression ratio based on deep learning in X-ray images. J Digit Imaging 34:853–
861. https://doi.org/10.1007/s10278-021-00471-0
29. Perone C, Calabrese E, Cohen-Adad J (2018) Spinal cord gray matter segmentation using deep
dilated convolutions. Sci Rep 8. https://doi.org/10.1038/s41598-018-24304-3
30. Ahammad SH, Rajesh V, Rahman MZU (2019) Fast and accurate feature extraction-based
segmentation framework for spinal cord injury severity classification. IEEE Access 7:46092–
46103. https://doi.org/10.1109/ACCESS.2019.2909583
31. Valarmathi G, Devi S (2021) Human vertebral spine segmentation using particle swarm
optimization algorithm. https://doi.org/10.1007/978-981-16-0669-4_7
32. Punarselvam E, Suresh P (2019) Investigation on human lumbar spine MRI image using finite
element method and soft computing techniques. Cluster Computing 22. https://doi.org/10.1007/
s10586-018-2019-0
Neurological Disease Prediction Based
on EEG Signals Using Machine Learning
Approaches
Zahraa Maan Sallal and Alyaa A. Abbas
Abstract Diagnostics and prognoses of brain disorders can be greatly aided by
machine learning. To bring these tools into clinical routine, we argue that key chal-
lenges remain to be addressed by the community. To overcome the limitations of
black-box approaches, we need to use interpretable models to overcome the short-
comings of validation and reproducible research practices. Extensive generalization
studies are required. Many people die each year from brain diseases, which are the
most prevalent. As the death toll continues to rise, it is estimated that it will reach
75 million by 2030. We cannot predict brain diseases with modern technology or an
advanced healthcare system. In our paper, we use machine learning algorithms to
implement the neurological disease prediction approach since such algorithms are
a critical source of data prediction. MIT-BIH repository data comprising a variety
of patients were used as the database. Based on the classifiers utilized, the findings
have proven that the RDF produced the most accurate result with 96.32% accuracy.
Keywords Machine learning classifier ·Brain disease prediction ·Kernel
perceptron ·Random decision forest (RDF)
1 Introduction
Globally, there are 17 million deaths a year caused by brain diseases. In developed
countries, death rates are also alarming, even though low- and middle-income coun-
tries account for three-quarters of all deaths [1]. A staggering 35% of global deaths
are caused by neurological diseases, according to the Centers for Disease Control
and Prevention. Various races, classes, and age groups also experience this issue.
Z. M. Sallal (B)
General Directorate of Education in Al-Qadisiyah Governorate/Ministry of Education,
Al Diwaniyah, Iraq
e-mail: zahraa.m199021@gmail.com
A. A. Abbas
General Directorate of Education in Al-Muthana Governorate, Ministry of Education, Samah, Iraq
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_20
263
264 Z. M. Sallal and A. A. Abbas
While medical science has made incredible advances across the globe, it has yet to
be possible to prevent different types of brain diseases. Between 1986 and 2006, the
WHO reported an increase of 4184% in brain disease deaths, whereas dysentery and
respiratory infections were reduced by 82% and 86%, respectively [2,3]. Neuro-
logical disease is alarming because the majority of people who suffer from it are in
the most productive years of their lives. About 40% of brain disease in developing
countries occurs before 50 years of age, whereas 25% of the same disease occurs
before 40 [4]. Developing countries and low-income countries lack basic healthcare
facilities, which lead to brain diseases and costly medications, making every family
poor. Even though there are different detection methods for neurological diseases,
the most challenging aspect is predicting their presence and detecting them in the
body [5]. To reduce the death rate and reduce economic vulnerability for all fami-
lies, it is crucial to predict brain diseases, which will eventually assist policymakers
in taking appropriate action against neurological diseases. The remaining sections
of this manuscript are arranged as follows: Sect. 2introduced the literature-related
works. Materials and Methodology are presented in Sect. 3. Based on Sect. 4,the
EEG-based Disease Classification Improvements Methods were explained.
2 Literature-Related Works
A common problem worldwide is brain disease. Every year, brain diseases cause
thousands of deaths. In part, this is due to the difficulty of analyzing clinical data
in the field of brain disease [6]. This manuscript uses machine learning to predict
brain diseases. A classification technique was used in this prediction model. Several
combinations of features were taken into account. It has been classified using several
methods, including preprocessing, feature selection, feature reduction, decision trees,
language models, support vector models, and RDF. Based on a hybrid RDF with a
linear model (HRFLM), the brain disease prediction model achieved 88.7% accuracy.
Around the world, one out of three people dies from brain diseases today [7,8]. Due
to the complexity of the task, medical practitioners often have difficulty predicting
brain disorders. Additionally, some health sector information is hidden from the
public that is necessary for making informed decisions predicted brain diseases using
a model [9]. The algorithms used in this research include J48, Naive Bayes, RepTree,
CART, and Bayes Net for predicting brain disorders. As a final result of the research,
99% of the predictions were accurate. Data mining was also demonstrated to be
beneficial for a sector of health in predicting patterns in the dataset [10,11]. A
model for detecting symptoms that can prevent heat stroke at an early age has been
developed from an investigation. Day by day, its rate has increased. In their proposal,
they proposed an application that would display symptoms such as age, gender,
pulse rate and predict brain diseases. Brain diseases were identified using machine
learning algorithms and neural networks [12,13]. Various factors contribute to the
rise of brain diseases. Healthcare providers collect a lot of data every day. However,
they do not use machine learning and pattern recognition techniques, which limit
Neurological Disease Prediction Based on EEG Signals Using Machine 265
their ability to make predictions. To predict the future, we presented a model [9].
MIT-BIH repository data and attributes were collected in this manuscript [14]. To
predict brain disease, they used these data. Several ANN techniques were used to
develop this technology. The accuracy rate for ANNs was 94.7%, but PCAs had a
97.7% accuracy rate. The MIT-BIH repository was used to gather the information
for prediction [10]. A prediction model was developed using 1025 instances with
14 attributes [15]. By using tree-based algorithms such as M5P, random trees, and
RDF ensembles, this research explored the accuracy, precision, and sensitivity of
the tree-based classification algorithm. All prediction algorithms were applied after
the selection of features from the patient’s brain dataset. Some of the methods used
in the study include Pearson correlation, recursive feature elimination, and Lasso
regularization.
We used three experimental setups to complete this analysis. As part of the first
experiment, Pearson correlation was utilized to determine Pearson correlations for
M5P, random tree, reduced error pruning, and RDF ensemble. As part of the second
experiment, the four tree-based algorithms were combined with a recursive feature
elimination algorithm. Additionally, the tree-based algorithms were combined with
Lasso regularization.
As a result of this experiment, we calculated the accuracy, precision, and sensitivity
of the classification. Using Pearson correlation and Lasso regularization in combi-
nation with RDF ensemble methods, they achieved an accuracy of 99% [16,17].
Preprocessing, feature extraction, and classification are performed on EEG signals.
To predict epileptic seizures using machine learning methods, researchers can place
electrodes on patients’ scalps and record their scalp EEG signals. The use of scalp
EEG signals has been proposed to predict epileptic seizures by numerous researchers
in recent years [514]. The method involves preprocessing EEG signals, identifying
features, and categorizing preictal and interictal states using those features [18].
3 Materials and Methodology
A schematic of our proposed methodology can be seen in Fig. 1. There are detailed
instructions provided for each step. Our machine learning approach is based on the
MIT-BIH dataset repository. The proposed system includes the following steps which
are : Management of datasets, collection of data features, preprocessing, selection of
features, classification of instances, comparing classifier performances, and finally
acquisition of the results. The accuracy rate for our EEG dataset was examined
using four machine learning techniques. The accuracy rate was improved after eval-
uating the performance. A confusion matrix has been visualized for each machine
learning technique to check the validity of the experimental model in the following
sections. Our experiment requires nine attributes to make sense after preprocessing
and cleaning the data. In total, fourteen attributes were present, but not all were
granted because some of them were meaningless.
266 Z. M. Sallal and A. A. Abbas
Fig. 1 Proposed methodology
3.1 EEG Data Preprocessing
Due to noise during acquisition, EEG signals cannot be classified correctly between
interictal and preictal states thanks to poor signal-to-noise ratios. Different types of
noise can affect EEG signals, beyond power line noise of 50–60 Hz [19,20]. As a
result of interference between multiple electrodes, baseline noise is also added, as
are electrical activities associated with human activities, such as eye movements and
heartbeats [21].
3.2 EEG Data Sourcing
The dataset used for this experiment comes from the UCI Machine Learning Reposi-
tory. Physio Bank, as well as the original MIT-BIH dataset, was the source of this data
[22]. Over 80 h of recordings are included in the database, in which an ECG signal
and an EEG and a respiration signal are annotated to reflect polysomnographic sleep
stages and apneas. The signals’ and annotations’ section provides more information
[23].
3.3 Processing of EEG Data
In this study, we analyzed 200 neurological patients’ data, whereas 150 samples are
non-neurological patients. Moreover, in the neurological patient dataset, different
numbers of females and males were taken for analysis purposes [24].
Neurological Disease Prediction Based on EEG Signals Using Machine 267
3.4 JupyterLab Tool and Python Language
The manuscript was written using Python 3.7 as its programming language. User
notebooks, code, and data can be created using JupyterLab, a web-based interac-
tive development environment. It provides a flexible interface for configuring work-
flows for data science, scientific computing, computational journalism, and machine
learning. Jupyter Notebook began life as an online application for creating and sharing
computational documents [25]. Under Jupyter Lab, the tool works properly. Aside
from being simple and streamlined, it offers a unique user experience that focuses on
signal processing. It is necessary to preprocess EEG signals to reduce noise so that
the signal-to-noise ratio can be increased to improve the classification. The SNR can
be increased by using a variety of preprocessing techniques. A low-pass/high-pass
filter can be used to remove other types of noise as well as a bandpass/band-stop
filter to remove other types of noise [26].
The methodology steps can be summarized as follows:
Step 1: EEG raw signals’ acquisition based on headset or based on signals stored
in MIT-BIH dataset.
Step 2: Preprocessing phase includes; signals noise elimination by using various
filters.
Step 3: Select denoising EEG signals.
Step 4: Extract features from EEG signals using power spectral and wavelet
transform.
Step 5: Check the EEG signals’ quality based on performance measuring.
Step 6: Test the EEG signals’ accuracy to implement the classification phase.
4 EEG-Based Disease Classification Improvements’
Methods
Various optimization techniques have been suggested, such as Particle Swarm Opti-
mization, Discrete Particle Swarm Optimization, and Fractional Order Discrete
Particle Swarm Optimization. Additionally, pixel values, noise levels, and boundary
positions can be used to study the region of interest [27].
4.1 Random Decision Forest (RDF)
There are various optimization techniques suggested are Particle Swarm Optimiza-
tion, Discrete Particle Swarm Optimization, and Fractional Order Discrete Particle
Swarm Optimization as well as pixel values and noise levels, boundary positions can
also be studied to determine the area of interest [2830].
268 Z. M. Sallal and A. A. Abbas
4.1.1 RDF Algorithm Steps
The RDF algorithm is explained in the following steps:
Start.
Stage One: For each data set or training set, randomly select a set of samples.
Stage Two: For each training set of data, this algorithm constructs a decision tree.
Stage Three: The decision tree will be averaged to vote.
Stage Four: Choose the most voted prediction result as the final prediction outcome
[31].
End.
4.1.2 RDF Algorithm
Rosenblatt and Frank invented the perceptron algorithm, which is the basis for Kernel
perceptron. Information with large edges is exploited in the calculation [32]. In
comparison to Vapnik’s SVM, this method is less complex to use and considerably
quicker to calculate. As well as using piece capacities in high-dimensional spaces,
the calculation can also be applied to low-dimensional spaces [33,34].
The following steps describe the Kernel perceptron procedure:
Create a random line (or create a random score for each word, and a random bias).
Perform the following process n times.
Select a random point.
Perceptrons can be applied to both points and lines.
Ignore the point if it is well classified.
If the point is misclassified, move the line closer to it.
Take advantage of your perfectly fitting line [30,3438].
5 Findings and Discussion
Various analyses were carried out to obtain the experimental results. In total, 100%
of the raw EEG data was converged with the experiment. There were approximately
30% of test data and 70% of training data in the test dataset. According to the test
dataset, the following table shows the different accuracy levels (Table 1).
The training and test data show different results. The figure demonstrates a corre-
lation between age and maximum EEG signal rate. Age increases the rate of EEG
signals. As the EEG signal is sampled more frequently, the probability of capturing
an ‘event’ increases, although complexity and computation time will increase as
well. In these cases, the sampling rate is crucial, especially for recordings that last
more than three or five minutes, because normal computers will not be able to handle
them. Analyzing signals will be easier if the sampling rate is between 128 and 512.
Each classifier has a confusion matrix shown in the table. Since only 30% of the
Neurological Disease Prediction Based on EEG Signals Using Machine 269
Table 1 Test and training datasets accuracy results
Tes t Training
Thenameofthe
classifier
Accuracy (%) Accuracy (%)
RDF algorithm Kernel perceptron
algorithm
RDF algorithm Kernel perceptron
algorithm
97.816 95.313 98.803 95.503
training dataset was used for the test dataset, the results of the confusion matrix were
obtained from the training dataset. RDF outperforms other algorithms based on the
confusion matrix results. In our calculations, we calculated a true positive rate of
0.967 and a false positive rate of 0.300. Both cases contain the ROC and PRC areas.
The age–high blood pressure relationship shows the relationship between hyperten-
sion and age. In our study, we found an association between high blood pressure and
aging. In people over the age of 40 or 50, high blood pressure normally occurs. After
reaching one’s 30 s, one should consult a physician regularly.
6 Conclusion
World health authorities consider neurological diseases to be one of the most impor-
tant health problems. The loss of brain function can be mitigated by using a scientif-
ically based prediction approach. A machine learning algorithm has been developed
to predict brain disease. RDF and voted perceptrons were used in our study. RFD
classifiers have an accuracy of 97.69%, while Kernel perceptrons have an accuracy
of 94.39%, based on our study. With an accuracy of 97.69%, the RDF classifier has
produced the best results among the classifiers we used in our study. In the future,
we will use a more accurate dataset based on technological and scientific advance-
ments in another branch of medicine. The use of more classifiers will also improve
classification accuracy.
7 Work Constraints and Future Plans
A machine learning approach is capable of greatly assisting with the diagnosis and
prognosis of brain disorders. It is still necessary for the community to address key
challenges before bringing these tools into clinical practice. Using interpretable
models helps to overcome the limitations of black-box approaches concerning
validation and reproducibility. It is necessary to conduct extensive generalization
studies. There is a high death rate associated with brain diseases, which are most
prevalent each year. Our paper uses machine learning algorithms to implement
predictive approaches for neurological diseases since these algorithms are critical
to predicting data.
270 Z. M. Sallal and A. A. Abbas
References
1. Michel LV, Jacques M, Michel B, Francisco V (1999) Anticipating epileptic seizures in
real-time by a non-linear analysis of the similarity between EEG recordings. NeuroReport
10(10):2149–2155
2. Florian M, Klaus L, Peter D, Christian EE (2000) Mean phase coherence as a measure for
phase synchronization and its application to the EEG of epilepsy patients. Phys D Nonlinear
Phenom 144(3–4):358–369
3. Vincent N, Jacques M, Michel LV, Stephane CL, Claude A, Michel B, Francisco V (2002)
Seizure anticipation in human neocortical partial epilepsy. Brain 125(3):640–655
4. Wim DC, Philippe L, Sabine VH, Wim VP (2003) Anticipation of epileptic seizures from
standard EEG recordings. The Lancet 361(9361):971
5. Mary AF, Mark GF, Ivan O (2005) Accumulated energy revisited. Clin Neurophysiol
116(3):527–531
6. Mary AF, Ivan O, Mark GF, Srividhya A, Ying-Cheng L (2005) Correlation dimension and
integral do not predict epileptic seizures. Chaos: Interdisc J Nonlinear Sci 15(3):033106
7. Michel LQ, Jason S, Vincent N, Richard R, Mario C, Michel B, Jacques M (2005) Preictal
state identification by synchronization changes in long-term intracranial EEG recordings. Clin
Neurophysiol 116(3):559–568
8. Iasemidis LD, Shiau DS, Panos MP, Wanpracha C, Narayanan K, Awadhesh P, Tsakalis K,
Carney PR, Sackellares JC (2005) Long-term prospective online real-time seizure prediction.
Clinical Neurophysiol 116(3):532–544
9. Chaovalitwongse W, Iasemidis LD, Pardalos PM, Carney PR, Shiau DS, Sackellares JC (2005)
Performance of a seizure warning algorithm based on the dynamics of intracranial EEG.
Epilepsy Res 64(3):93–113
10. Piotr M, Deepak M, Yann L, Ruben K (2009) Classification of patterns of EEG synchronization
for seizure prediction. Clin Neurophysiol 120(11):1927–1940
11. Salant Y, Gath I, Henriksen O (1998) Prediction of epileptic seizures from two-channel EEG.
Med Biol Eng Comput 36(5):549–556
12. Wim VD, Sujatha N, David MF, Michael HK, Vernon LT, Hyong CL, Arnetta BM, Maria SC,
Kurt EH (2003) Seizure anticipation in pediatric epilepsy: use of Kolmogorov entropy. Pediatr
Neurol 29(3):207–213
13. Klaus L, Brian L (2005) The first international collaborative workshop on seizure prediction:
summary and data description. Clin Neurophysiol 116(3):493–505
14. Florian M, Thomas K, Ralph GA, Peter D, Klaus L, Christian EE (2003) Epileptic seizures are
preceded by a decrease in synchronization. Epilepsy Res 53(3):173–185
15. Levin K, Philippa K, Dean R, Freestone BH, Andriy T, Alexandre B, Feng L, Gilberto T,
Brian WL, Daniel L et al (2018) Epilepsyecosystem. org: crowd-sourcing reproducible seizure
prediction with long-term human intracranial EEG. Brain 141(9):2619–2630
16. Rajendra A, Yuki H, Hojjat A (2018) Automated seizure prediction. Epilepsy Behav 88:251–
261
17. Yannic R, Hubert B, Isabela A, Alexandre G, Tiago HF, Jocelyn F (2019) Deep learning-based
electroencephalography analysis: a systematic review. J Neural Eng
18. Gen L, Chang HL, Jason JJ, Young C, David C. Deep learning for EEG data analytics: a survey.
Concurrency and computation: practice and experience, p e5199
19. Kuhlmann L, Lehnertz K, Richardson MP, Schelter B, Zaveri HP (2018) Seizure prediction—
ready for a new era. Nat Rev Neurol 1
20. Abd Ali DM, Chalob DF, Khudhair AB (2022) Networks data transfer classification based on
neural networks. Wasit J Comput Math Sci 1(4):207–225
21. James W, Eve AG (2006) Rapid review neuroscience e-book. Elsevier Health Sciences
22. Matthew DL (2000) Intuition: a social cognitive neuroscience approach. Psychol Bull
126(1):109
23. Terrence JS, Christof K, Patricia SC (1988) Computational neuroscience. Science
241(4871):1299–1306
Neurological Disease Prediction Based on EEG Signals Using Machine 271
24. Adeel R, Karl JF (2016) The connected brain: causality, models, and intrinsic dynamics. IEEE
Signal Process Mag 33(3):14–35
25. Sonya BD, Jaqueline AF, Christophe B, Gregory AW, Brandy EF (2017) Seizure forecasting
from idea to reality outcomes of my seizure gauge epilepsy innovation institute workshop.
Eneuro 4(6)
26. Viglione S, Walsh GO (1975) Proceedings: epileptic seizure prediction. Electroencephalogr
Clin Neurophysiol 39(4):435–436
27. Rogowski Z, Gath I, Bental E (1981) On the prediction of epileptic seizures. Biol Cybern
42(1):9–15
28. Heino HL, Jeffrey PL, Jerome E, Paul HC (1983) Temporo-spatial patterns of pre-ictal spike
activity in human temporal lobe epilepsy. Electroencephalogr Clin Neurophysiol 56(6):543–
555
29. Gotman J, Marciani MG (1985) Electroencephalographic spiking activity, drug levels, and
seizure occurrence in epileptic patients. Ann Neurol: Official J Am Neurol Assoc Child Neurol
Soc 17(6):597–603
30. Kostas MT, Vasileios CP, Michalis Z, Spiros K, Dimitrios DK, Dimitrios IF (2018) A long
short-term memory deep learning network for the prediction of epileptic seizures using EEG
signals. Comput Biol Med 99:24–37
31. Angela AB, Benno G, Maurizio S, Carlo AT, Niels B, Guido R (2008) Permutation entropy to
detect vigilance changes and preictal states from scalp EEG in epileptic patients. A preliminary
study. Neurol Sci 29(1):3–9
32. Haidar K, Lara M, Madeline F, Kalina S, Bulent Y (2017) Focal onset seizure prediction using
convolutional networks. IEEE Trans Biomed Eng 65(9):2109–2118
33. Xiaoli L, Gaoxian O, Douglas AR (2007) Predictability analysis of absence seizures with
permutation entropy. Epilepsy Res 77(1):70–74
34. Ramy H, Mohamed OA, Rabab W, Jane W, Levin K, Yi G (2019) Human intracranial EEG
quantitative analysis and automatic feature learning for epileptic seizure prediction. arXiv
preprint arXiv:1904.03603
35. Tom H (1999) Energy functions for self-organizing maps. In: Kohonen maps. Elsevier, pp
303–315
36. Nhan DT, Anh DN, Levin K, Mohammad RB, Jiawei Y, Omid K (2017) A generalized seizure
prediction with convolutional neural networks for intracranial and scalp electroencephalogram
data analysis. arXiv preprint arXiv:1707.01976
37. Butler K (2022) An Automation system over cloud by using internet of things applications: an
automation system over cloud by using internet of things applications. Wasit J Comput Math
Sci 1(4):27–33
38. Abdulbaqi A, Younis M, Younus Y, Obaid A (2022) A hybrid technique for EEG signals
evaluation and classification as a step towards to neurological and cerebral disorders diagnosis.
Int J Nonlinear Anal Appl 13(1):773–781. https://doi.org/10.22075/ijnaa.2022.5590
39. Matias IM, Christian M, Katrina D, Philippa JK, Wendyl D, David BG, Anthony NB, Premysl
J, Jan K, Jaroslav H et al (2019) Critical slowing as a biomarker for seizure susceptibility.
bioRxiv, p 689893
Watermarking System Using DWT
and SVD
Fatima M. Khudair, Asaad N. Hashim, and Mohammed Jameel Alsalhy
Abstract Information hiding has garnered significant attention from researchers
over the past two decades, due to its growing importance in securing visual applica-
tions. As a consequence, watermarking has become a focal point in numerous studies
for its ability to protect sensitive data from unauthorized access, copying, manipula-
tion, and infringement of copyrights or property rights. Watermarks can be applied
to a wide range of mediums, including texts, documents, books, audio, video, and
images. Various watermarking techniques exist, such as the discrete Fourier trans-
form (DFT), the discrete cosine transform (DCT), the discrete wavelet transform
(DWT), the singular value decomposition (SVD), deep learning, and other more
methods, each of which has its own set of benefits and drawbacks. In this paper,
we propose a novel algorithm that combines the strengths of both SVD and DWT
approaches to enhance watermarking performance. This innovative watermarking
technique yields accurate results and has demonstrated exceptional performance
metrics, as evidenced by signal-to-noise ratio (SNR) and peak signal-to-noise ratio
(PSNR) measurements.
Keywords DWT ·SVD ·Hiding information ·Watermarking
F. M. Khudair ·A. N. Hashim
Faculty of Computer Science and Mathematics, University of Kufa, Kufah, Iraq
e-mail: fatimam.alkaabi@student.uokufa.edu.iq
A. N. Hashim
e-mail: asaad.alshareefi@uokufa.edu.iq
M. J. Alsalhy (B)
National University of Science and Technology, Thi-Qar, Nasiriyah, Iraq
e-mail: Sahi@nust.edu.iq
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_21
273
274 F. M. Khudair et al.
1 Introduction
Information hiding (or data hiding) is a broad phrase that encompasses a variety of
issues other than the embedding of messages in the material. The word “hiding” may
apply to both imperceptible information (watermarking) and information secrecy
(steganography). Watermarking and steganography are two significant subfields of
information concealment that are linked and may overlap but have distinct underlying
characteristics, needs, and designs, resulting in distinctive technological solutions
[1].
2 Related Work
Mahto et al. [2]: This paper provides a comprehensive overview of watermarking
standards, measurements, and applications of watermarking and summarizes all
watermarking methods. Garg et al. [3] use DWT, SVD, entropy, and pixel shift
algorithms for the strength and lack of understanding of the security advantages of
a watermark. Until we achieve the security attribute, they did the embedding on the
place of the pixel so that someone does not execute linearly it. In addition, the PSNR
and the NC were computed to evaluate the performance of the method. A PSNR
of better than 40 dB is attained with the method. Based on experimental evidence,
this system is resilient against a wide range of attacks, including filter, winr, noise,
pressure, and others. Garg et al. [4]: Fuzzy entropy, discrete cosine transforms, and
image scrambling in the hybrid domain are all explored as potential components of a
good, secure, and robust method. The results are very impressive by providing good
PSNR and MSE values. They also performed fresh attacks on a watermarked image,
which gives a very good NC value. This shows that the proposed scheme gives a
lack of understanding and the required robustness [4]. Lakshmi [5]: New method
uses SVD and DWT to produce a new image steganography technique. Compared
to existing techniques, the suggested methodology outperforms them. The quality of
stego pictures is evaluated using picture metrics like PSNR and SC [5]. Begum et al.
[6]: This paper gives an in-depth look at the typical architectures of watermarking
systems, discusses the latest developments in the field, and enumerates some of the
more common criteria that are taken into account when developing watermarking
methods for various uses. To learn about contemporary techniques and the constraints
they face. Some common assaults are covered, and suggestions for further study are
offered [6]. Mohanarathinam et al. [7]: the watermarking method, its advantages and
disadvantages are discussed [7]. Boenisch [8]: Strongly marked models with digital
watermarking can secure against such attacks. It is not only a guide in selecting the
Watermarking System Using DWT and SVD 275
proper approach to a particular situation, but may also be a starting point for devising
new processes that overcome limitations and therefore advance the subject. This liter-
ature here offers a comprehensive survey of watermarking methods and assaults [8].
Khadam et al. [9]: Copyright protection, authentication, and ownership verification
have all been suggested using digital watermarking technologies. To identify appro-
priate characteristics from the document to add the watermark, data mining methods
are used. The suggested technique is resistant to coordinated assaults and is resilient.
Twenty distinct text texts are utilized to assess the proposed technique [9]. Artru
et al. [10]: Digital watermarking is a broad element of information security because
of its uses, characteristics, and designs. The aim is to establish the ideal equilibrium
point between invisibility, resilience, and efficiency in an application. Finding the
equilibrium point is accomplished by locating the watermark characteristics required
and analyzing the threat model the scheme will encounter. Additional information
may apply to the video’s metadata, frames, or particular areas of the frame [10].
In their 2018 paper, Yuki, Nagai et al. proposed using digital watermarking to
assert ownership permission for deep neural networks. Their approach involved
addressing the prerequisites, embedding circumstances, and attack types associated
with watermarking deep neural networks. They then presented a generic technique
for embedding a watermark in model parameters. In another 2018 paper, He, Yuqi
and Hu, Yan discussed a technique for watermarking color images. Their method
utilized discrete wavelet transform (DWT), discrete cosine transform (DCT), and
singular value decomposition (SVD) to transform the host color image from RGB to
YUV color space. They divided the low-frequency component of Y into blocks and
applied SVD using DCT to each block. Finally, they added a watermark to the cover
image. The experimental results showed that their technique was highly resilient and
invisible [11,12].
3 Suggested Methodologies
3.1 Discrete Wavelet Transform
A discrete wavelet transform is a wavelet transform in which the wavelets are
discretely sampled, and it may be applied to any wavelet transform. Wavelet trans-
form has many advantages over Fourier transform, the most important of which
is that it can gather both frequency and position data simultaneously. The discrete
wavelet transform divides an image into four non-overlapping multi-resolution sub-
bands, which are denoted by the abbreviations LL (approximation sub-band), LH
(horizontal sub-band), HL (vertical sub-band), and HH (diagonal sub-band), where
LH, HL, and HH denote the finest scale wavelet coefficients, respectively, and LL
denotes the coarse-level coefficients. The discrete wavelet transform is also known
it is possible to repeat the process in order to get wavelet decomposition on different
sizes [13]. This tool is very useful for identifying areas inside the host image where
276 F. M. Khudair et al.
a watermark may be effectively applied. The utilization of the masking effect of the
human visual system is enabled by this feature. It only affects the region that corre-
sponds to the coefficient that has been altered when a DWT coefficient is changed.
The lower frequency sub-bands LL are often where the bulk of the image energy
is concentrated. As a consequence, adding watermarks in LL sub-bands may cause
significant degradation of the image quality. However, embedding in low-frequency
sub-bands significantly improves resilience in a significant way. When it comes to
the high-frequency sub-bands HH, which contain the image’s borders and textures,
the human eye is less sensitive to changes in these sub-bands than other sub-bands.
A watermark may be included in a picture without it being apparent to the human
eye as a result of using this technique. The process of embedding takes place in the
intermediate frequency bands in order to enhance the strength and imperceptibility
of the watermark and to make it more difficult to detect. LH and HL are two different
things [14].
3.2 Singular Value Decomposition (SVD)
Picture decomposition at the level of a single value is a special case of generic
linear algebra [15], where the picture is represented by a special diagonal matrix
whose energy is focused in a small number of values. Since SVD is robust and
efficient at decomposing the structure into a set of linearly independent constituents,
each of which contains its own energy presentation, and since it compresses the
maximum amount of signal energy into the fewest possible coefficients, it is one of
the most potent linear algebra analysis techniques. SVD has proven to be a useful
tool for many image processing tasks, including encoding, signal enhancement, and
filtering. SVD is widely used in watermarking due to its ability to successfully conceal
the watermark whenever there is a significant change in a single value. If M is a
real matrix, then SVD can decompose it into a product of three additional matrices
[16]. The SVD is used to obtain the singular value coefficients. It is quite sturdy.
Condensing a high-dimensional, variable data set into a two-dimensional space is
the core idea behind singular value decomposition (SVD) [17]. When Uis a m*n
matrix with orthonormal columns, then any matrix Ain Rm *nmay be decomposed
into A=USVT. Sis a nn matrix without a negative square. VT is a n2(row2+
column2) orthonormal matrix [18] (Fig. 1).
Watermarking System Using DWT and SVD 277
Fig. 1 Graphical presentation of SVD [19]
A=USVT=[u1,u2,...,uN
λ1
λ2
λN
×[V1,V2,...,VN]T
=
r
i=1
λiuivi,
where rdenotes A’s rank (rN); Uand Vare N×Northogonal matrices whose
column vectors, ui, an’ vis, denote ASleft and right singular vectors, respectively
[20].
4 The Proposed System
Using the proposed method for making a watermark using the DWT method, followed
by SVD, several methods of watermarking have been discussed in this system and
we have chosen the one that has the best results. I summarized this system in three
ways. The first method is to use SVD, and the second method is the DWT method.
The third method is a method of combining SVD and DWT, and it gives wonderful
and elegant results. Preprocessing: It includes several processes, firstly converting
the image to gray, doubled, and resizing to suit the algorithms.
278 F. M. Khudair et al.
4.1 Singular Value Decomposition
An algorithm will be explained here, which is one of the important algorithms in
image processing (Fig. 2).
Fig. 2 Embed watermark
(SVD)
Watermarking System Using DWT and SVD 279
Algorithm (1): Embed watermark (SVD).
Input (image cover (g)), watermark image (O)
Output (watermarked image (H)
Begin:
Step1. Read cover image (g).
Step2. Converting the image to gray image (g).
Step3. Converting the image double image (g).
Step4. Resizing image to suit the algorithm SVD (i).
Step5. Apply SVD on cover image (i): Ui *Si* VTi.
Step6. Read watermarking image (O).
Step7 converting the image to gray image (O).
Step8. Converting the image double image (O).
Step9. Resizing image to suit the algorithm SVD (O2).
Step10. Apply SVD on the watermarking image (O2): UO
*SO*VTO.
Step11. Chang Si, the singular values of the cover
image) i), by adding the singular values of the
watermark image (O2) to the scaling factor ᵅ:
k = Si + ᵅ* SO.
Step 12. Rebuild the sub-bands using SVD Replace SI with
K in Step 2:
H=Ui*K*VTi.
Step 13. Display watermarked image.
End.
Algorithm (2): Watermark extracting (SVD).
Input: Watermarked Image (H)
Output: Extracted Watermark (recovered image)
Begin:
1. Apply SVD to the Watermarked Imag: H=U1*S1*VT1.
2. Extract the watermark Watermarked *: Watermarked
= S1–SLL/ alpha.
3. Using left and right singular vectors (UL and
VTL) of SL in watermark embedding algorithm,
construct S*LLD by multiplying them with SW in the
following order:
SL = UL* Watermarked * VTL.
4. Display watermark image.
End
280 F. M. Khudair et al.
Fig. 3 Watermarking based on SVD
4.2 Discrete Wavelet Transform
This algorithm analyzes the image into two levels, high and low, then divides the
low into four levels, and then we take the low Low (LL).
Hybrid technology comprises two algorithms, first DWT and then SVD applied
to the watermark and cover image.
5 Experimental Results
5.1 SVD
Here, the algorithm is implemented, provided that the image size here is 2048*2048
and alpha (0.002) to have better results, knowing that the factors that affect the
implementation of the algorithm are the size of the images (watermark and cover)
and alpha when reducing the alpha value increases the accuracy results as well
(Figs. 3,4and 5; Table 1).
5.2 DWT
It is one of the best results so that it has the best performance among algorithms even
in front of attacks. Performance was measured on different image size, different
alpha, and different images. We mean images (watermark and cover image) (Fig. 6;
Table 2).
5.3 Hybrid Technology (DWT and SVD)
It is DWT +SVD merged. The watermark is concealed and does not appear in all
instances, and this differentiates this case regardless of the value of the performance
Watermarking System Using DWT and SVD 281
Fig. 4 Embed watermark (DWT and SVD)
measurements since the invisibility is extremely strong because SVD provides very
excellent masking and outcomes. This technique is robust, with no distortion, and is
the suggested method for this system. The alpha has been applied to multiple pictures
(watermark and cover) of various sizes (Fig. 7; Tables 3and 4).
Compare the two approaches with the hybrid method based on the algorithm’s
execution time.
282 F. M. Khudair et al.
Fig. 5 Watermark extracting
(DWT and SVD)
Table 1 Watermarking based on SVD
Cases Host image Wate r m a r k i m age Alpha Size of image PSNR SNR
1Pepper. Jpg KCOM. PNG 0.0020 2048*2048 63.6600 0.0080
2Baboon. Jpg KCOM. PNG 0.0020 2048*2048 63.6600 0.0099
3Baboon. Jpg KCOM. PNG 0.0020 512*512 63.6600 0.0099
4Baboon. JPG KCOM. PNG 0.0020 64*64 63.9236 0.0099
5Lena. Jpg 1.jpeg 0.0020 2048*2048 57.7488 0.0214
6 Conclusions
Multiple scales have been utilized for watermarking, but the most effective scale
is 512*512. The value of alpha significantly affects the inverse proportion between
SNR and PSNR. If alpha is low, SNR and PSNR are high, whereas if alpha is high,
Watermarking System Using DWT and SVD 283
Fig. 6 Watermarking based on DWT
Table 2 Watermarking based on DWT
Cases Host
image
Wat e r m a r k
image
Alpha Size of
image
PSNR SNR Time
1Pepper.
Jpg
1.jpeg 0.0020 2048*2048 05.8807 0.0194 4.5473
2Baboon.
Jpg
1.jpeg 0.09 2048*2048 72.8164 0.8392 4.6507
3Baboon.
Jpg
KCOM.PNG 0.09 512*512 78.7640 0.1842 0.5363
4Pepper.
Jpg
IEEE.jpg 0.09 512*512 73.8961 0.5119 0.4558
5Baboon.
Jpg
KCOM.PNG 0.001 512*512 117.8488 0.0019 0.8435
Fig. 7 Watermarking based on DWT and SVD
Table 3 Watermarking based on DWT and SVD
Cases Host image Wate r m a r k i m age Alpha Size of image PSNR SNR
1Lena. Jpg Com.jpg 0.0001 512*512 96.5800 2.3845
2Pepper. Jpg Com.jpg 0.0001 512*512 96.5800 1.8257
3Baboon. Jpg Com.jpg 0.0001 512*512 96.5800 2.254
41509237.Jpeg Com.jpg 0.0001 512*512 96.5800 2.2500
5Baboon. Jpg Com.jpg 0.001 2048*2048 76.3245 0.0023
284 F. M. Khudair et al.
Table 4 Execution time of watermarking based on SVD and DWT and DWT and SVD
Case Image
cover
Watermarking
image
Alpha Size Time execution Method
1Pepper KCOM 0.0020 2048*2048 6.2488 DWT
2Pepper KCOM 0.0020 2048*2048 24.5778 SVD
3Pepper KCOM 0.0020 2048*2048 13.1362 DWT and
SVD
4Pepper KCOM 0.0020 512*512 0.5089 DWT
5Pepper KCOM 0.0020 512*512 0.5243 SVD
6Pepper KCOM 0.0020 512*512 0.9548 DWT and
SVD
7Pepper KCOM 0.0020 48*48 0.2776 DWT
9Pepper KCOM 0.0020 48*48 0.2418 SVD
10 Pepper KCOM 0.0020 48*48 0.2709 DWT and
SVD
both SNR and PSNR decrease. Additionally, the order of transformation, whether it
is DWT then SVD or SVD and DWT, impacts the result. When DWT is applied first,
followed by SVD, a clear watermark is achieved. However, using SVD and DWT
together produces a noisy watermark that is not satisfactory.
References
1. Yusof, Y., and Khalifa, O. O. Digital watermarking for digital images using wavelet transform,
Proceeding 2007 IEEE Int. Conf. Telecommun. Malaysia Int. Conf. Commun. ICT-MICC
2007, no. December 2013, pp. 665–669 (2007), doi: https://doi.org/10.1109/ICTMICC.2007.
4448569.
2. Mahto DK, Singh AK (2021) A survey of color image watermarking: state-of-the-art and
research directions. Comput Electr Eng 93:107255. https://doi.org/10.1016/j.compeleceng.
2021.107255
3. Garg P, Rama RK (2020) Secured and multi optimized image watermarking using SVD and
entropy and prearranged embedding locations in transform domain. J Discret Math Sci Cryptogr
23(1):73–82. https://doi.org/10.1080/09720529.2020.1721875
4. Garg P, Rama KR (2020) An improved and secured digital image watermarking technique
using DCT, fuzzy entropy and image scrambling in hybrid domain. J Discret Math Sci Cryptogr
23(1):177–186. https://doi.org/10.1080/09720529.2020.1721882
5. Lakshmi BS (2020) Image steganography based on SVD and DWT techniques. J Discret Math
Sci Cryptogr 23(3):779–786. https://doi.org/10.1080/09720529.2019.1698801
6. Begum M, Uddin MS (2020) Digital image watermarking techniques: a review. Information
(Switzerland) 11(2). https://doi.org/10.3390/info11020110
7. Mohanarathinam A, Kamalraj S, Prasanna VG, Ravi RV, Manikandababu CS (2020) Digital
watermarking techniques for image security: a review. J Ambient Intell Humaniz Comput
11(8):3221–3229. https://doi.org/10.1007/s12652-019-01500-1
8. Boenisch F (2009) A survey on model watermarking neural networks. Available: http://arxiv.
org/abs/2009.12153
Watermarking System Using DWT and SVD 285
9. Khadam U, Iqbal MM, Azam MA, Khalid S, Rho S, Chilamkurti N (2019) Digital watermarking
technique for text document protection using data mining analysis. IEEE Access 7:64955–
64965. https://doi.org/10.1109/ACCESS.2019.2916674
10. Artru R, Gouaillard A, Ebrahimi T (2019) Digital watermarking of video streams: review of
the state-of-the-art. Available: http://arxiv.org/abs/1908.02039
11. Sulong GB, Wimmer MA (2023) Image hiding by using spatial domain steganography. Wasit
J Comp Math Sci 2(1):39–45
12. Al-asadi TA, Obaid AJ (2016) Object-based image retrieval using enhanced SURF. Asian J
Inform Technol 15:2756–2762. https://doi.org/10.36478/ajit.2016.2756.2762
13. He Y, Hu Y (2018) A proposed digital image watermarking based on DWT-DCT-SVD. In:
Proceedings of the 2nd IEEE advanced information management, communicates, electronic
and automation control conference (IMCEC), pp 1214–1218. https://doi.org/10.1109/IMCEC.
2018.8469626
14. Joseph A, Anusudha K (2013) Robust watermarking based on DWT SVD. Int J Signal Image
Proc 1
15. Srivastava A (2013) DWT-DCT-SVD based semi blind image watermarking using middle
frequency band. IOSR J Comput Eng 12(2):63–66. https://doi.org/10.9790/0661-1226366
16. Abdulazeez AM, Hajy DM, Zeebaree DQ, Zebari DA (2020) Robust watermarking scheme
based LWT and SVD using artificial bee colony optimization. Indones J Electr Eng Comput
Sci 21(2):1218–1229. https://doi.org/10.11591/ijeecs.v21.i2.pp1218-1229
17. Patel P, Patel Y (2015) Secure and authentic DCT image steganography through DWT-SVD
based digital watermarking with RSA encryption. In: Proceedings of the 2015 5th international
conference and communication systems and network, technologies, CSNT,pp 736–739. https://
doi.org/10.1109/CSNT.2015.193
18. Mohamad AC, Abdul-Hameed M (2014) Image encryption based on singular value decompo-
sition. J Comput Sci 10(7):1222–1230. https://doi.org/10.3844/jcssp.2014.1222.1230
19. Zhang G, Zou W, Zhang X, Hu X, Zhao Y (2017) Singular value decomposition based sample
diversity and adaptive weighted fusion for face recognition. Digit Signal Proc A Rev J 62:150–
156. https://doi.org/10.1016/j.dsp.2016.11.004
20. Shieh JM, Lou DC, Chang MC (2006) A semi-blind digital watermarking scheme based on
singular value decomposition. Comput Stand Interf 28(4):428–440. https://doi.org/10.1016/j.
csi.2005.03.006
21. Nagai Y, Uchida Y, Sakazawa S, Satoh S (2018) Digital watermarking for deep neural networks.
Int J Multimed Inf Retr 7(1):3–16. https://doi.org/10.1007/s13735-018-0147-1
Safeguarding IoT: Harnessing Practical
Byzantine Fault Tolerance for Robust
Security
Nadiya Zafar, Ashish Khanna, Shaily Jain, Zeeshan Ali, and Jameel Ahamed
Abstract With the emergence of Internet of Things, massive amount of data is
produced, processed, propagated, and stored each and everyday. These IoT devices
are built only to fulfill the aimed requirement with very limited resources. As a
result, their security and privacy are not prioritized. Implementing any solution for
the privacy and security issues of IoT devices is a challenging and crucial job with
such limited resources. However, with the development of blockchain technology,
incorporating security methods into IoT systems is no longer an unattainable goal. We
conducted multiple experiments in this research to determine that Practical Byzan-
tine Fault Tolerance (pBFT) is the best suitable technique for protecting IoT systems.
The blockchain concept is used with pBFT in the same way that Zilliqa and Hyper-
ledger are used for IoT security. As a result, by identifying and preventing security
breaches with the aforementioned algorithm, data integrity, and authenticity will be
maintained.
Keywords pBFT ·Data security ·Heterogeneous data ·Consensus algorithm ·
Device certification
N. Zafar ·J. Ahamed (B)
Department of CS&IT, Maulana Azad National Urdu University, Hyderabad, India
e-mail: jameel.shaad@gmail.com
A. Khanna
Department of CSE, Maharaja Agrasen Institute of Technology, New Delhi, India
e-mail: ashishkhanna@mait.ac.in
S. Jain
Faculty of Computing, Engineering and Science, University of South Wales, South Wales, UK
e-mail: shally.jain@southwales.ac.uk
Z. Ali
University of Glasgow, Glasgow, UK
e-mail: ali.zeeshan@glasgow.ac.uk
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_22
287
288 N. Zafar et al.
1 Introduction
IoT gradually evolved from the combination of wireless technologies, microelec-
tromechanical systems (MEMs), microservices, and the Internet. It emerged from
machine-to-machine communication to take M2M one step further. The Internet of
Things (IoT) is a sensor network of billions of smart gadgets that connects humans,
systems, and other applications to gather and share data. Further, new emerging tech-
nologies have a great impact on the world with such plethora of intelligent objects
around us, making our lives easier and more comfortable [1]. According to a Cisco
networking survey, there are more smart devices than population in the world today.
Today a huge chunk of population is connected to the Internet through smart devices
at all the time to monitor physical activities and health. As predicted by some surveys,
there may be 50–75 billion devices will be connected to the Internet [2]. The informa-
tion is exchanged and communicated through various information sensing devices,
over a network, by agreeing upon some protocols as a part of Internet of Things. The
main aim of IoT is to intelligently identify, track down, monitor, and manage things
to manage the application areas [3]. In simple language, IoT meant for connecting
devices over the Internet, having limited abilities. The things in IoT are the devices
that can sense, monitor, and actuate [4]. This unique connection of real devices has
greatly accelerated the data collection, summarization, and sharing process with other
devices, resulting in the birth of IoT applications in a variety of new industries such
as healthcare, smart home, and industries [5]. However, mostly such kind of devices
and applications are not framed for surviving cyber-attacks, which raises slew of
security and privacy concerns in IoT networks such as confidentiality, authentica-
tion, data integrity, access control, and secrecy. All this, giving rise to vulnerability
toward cyber theft and breaches. Anonymity, privacy, trust, and liability are some
other important security requirements [6]. As IoT is connecting billion of devices to
the Internet and involving the use of huge amount of data points (nodes), all of which
require security. Because IoT devices are closely connected, if intruders exploits one
vulnerability, it can manipulate whole of the data. Companies manufacturing these
IoT devices could become a victim of data breach as any smart device go through
three life stages, viz., manufacturing, installation, and operational stage [7]. If at any
stage of life there has been security flaws in smart devices it can cause major concern
for privacy of user. According to an assessment, seventy percent of IoT objects are
easier to hack as attackers and intruders can target IoT devices at any time. As a
result, an effective mechanism is critical for protecting Internet-connected devices
from hackers and intruders [8].
Further the flow of information must be secure in terms of integrity, confidentiality,
non-repudiation, and authentication. Therefore, we need a mechanism to protect IoT
communication protocols from threat of attack. Because of dynamism, scalability,
heterogeneity, limited resource availability in the IoT devices; its designation and
implementation become very challenging for meeting all security requirements. As
a result, a secure system that is compatible with such a restricted environment is
necessary. The decentralized nature of blockchain technology is relevant for IoT
Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance 289
system but most of the consensus algorithm requires much computational energy.
In this direction, the pBFT algorithm uniquely don’t require much computational
power and take less time to reach consensus. It is a consensus algorithm that reaches
consensus even when some faulty nodes are present in the system [9]. It also provides
authenticity through consensus and integrity by keeping the system alive [10]. pBFT
gives priority to the nodes with high reliability for intrusion detection and identify
all the nodes present in the network that are available at the end of detection [11].
Better efficiency, transactional finality, and low recompense variance are some of the
pros of the pBFT.
2 Related Work
Within a network when smart objects are communicating and exchanging data, if
any of it fails or attacked the whole system is jeopardized [6]. Here are some major
security concerns:
1. Data Integrity: The data remains accurate during its transmission between nodes.
For instance, it can be a severe problem, if the eve alters the data and orders to
halt the production in any manufacturing organization [12].
2. Data Confidentiality: Data should remain private between shared nodes. Except
for the sender and receiver, no one else should have access to the data. For
instance, if infrastructure data is compromised, roads and bridges may be
destroyed, and security may be jeopardized.
3. Data Authenticity: The authentication ensures that the data received is genuine
and trustworthy. For example, the patient’s parameters are transmitted to various
medical centers. If somehow an eve altered this data, then patient’s treatment
may be jeopardized [13].
4. Data availability: Data should be available to its concerned user. It is a major
problem if the concerned user is not able to reach the data [6].
Beside all above mentioned issues there are much more challenges faced in handling
with IoT system as discussed below.
1. Scalability: Innumerable connected IoT devices over-burden the management of
data access system. As a result, access control approaches should be scalable in
terms of size, structure, and number of devices [14].
2. Heterogeneity: The Internet of Things connects objects with various fundamental
skills and application. As a result, the access control mechanisms are anticipated
to facilitate interoperability between disparate objects [15].
3. Restricted Resources: Internet of Things (devices or nodes), mostly they are
functioning without “screen” or even lacking any “user interface”, depends upon
battery power for functioning, commonly performing one task only [16]. Because
of the inconsequentiality of IoT devices, the computational and storage resources
accompanying them are constrained. As a result, an IoT access control model
290 N. Zafar et al.
should be efficient and ideal in terms of overhead on devices and communication
networks. Hence, they are designed/equipped/deployed with limited computing
and networking capability [17].
More than this, many kinds of devices communicate using several networks for IoT
services. That means there can be many more security issues for the privacy of users
and on the network layer. So, some other security concerns of IoT are End to End Data
life cycle protection and Visible security and privacy. It is necessary to choose security
and privacy strategies which can be applied inevitably [6]. Recently, technical issues
are resolved by extending and practicing wireless communication technologies, IoT
model has to deal with hurdles associated security of IoT devices over the constrained
environments [18]. Recent Internet security protocols depends upon a popular and
trusted cryptographic algorithm: the Advanced Encryption Standard (AES) block
cipher for confidentiality; the Rivest-Shamir-Adelman (RSA) asymmetric algorithm
for digital signatures and key transportation; the Diffie-Hellman (DH) asymmetric
key agreement algorithm and the SHA-1 and SHA-256 secure hash algorithms [19].
This suite of algorithms is supplemented by a set of emerging asymmetric algorithms,
known as Elliptic Curve Cryptography (ECC) [20]. Because resource-constrained
IoT devices lack computational power, general public key cryptosystems such as
RSA are ineffective because they are slow and consume more power. Elliptical Curve
Cryptography (ECC), on the other hand, is lightweight and has proven to be a suitable
candidate for use in IoT networks [21]. In the IoT, the use of time stamps can protect
data and serve as evidence that data in the IoT are genuine, as they can be traced
back to a particular time, making sure that the information are not tampered [22]. It
is also very difficult to implement programming over IoT devices [23].
As the IoT system is growing rapidly its security issues are getting attention and
blockchain has been seen as a new option for its security by the researchers [22].
With the rapid growth of mobile Internet financial era, combination of the Internet
of Things and the blockchain technologies seems as the most obvious option. The
extensive use of “blockchain application technology” in the global IoT application
field is going to perform a progressively significant role in the future.
The ideology of blockchain is built upon a distributed security network. Its mech-
anism recommends strong data protection as well as protection from tampering [24].
Unlike used in Bitcoin, blockchain data structure can be used in general as a data
structure for storage. Like transactions, any other data payloads can be used as the
chain of block [25]. The blockchain’s characteristics include: forgery, data encryp-
tion, and decentralization allowing it to execute and store confidential information,
prevent data loss and ensure the security of IoT applications at various stages [23,
24]. Because of properties such as immutability and irreversibility, blockchain is the
most efficient data security and privacy technology available [26].
The blockchain’s decentralized, trustworthy, and autonomous nature can signif-
icantly improve the security and privacy of ever-expanding IoT networks. Because
IoT devices are the physical world contact points, combining blockchain and IoT
will allow for the development of new applications as well as the transformation of
existing systems [21].
Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance 291
Blockchain technology is preferred when information security and confidentiality
are the network’s top priorities. Implementing blockchain in IoT allows for more
efficient access control. The most vital feature affecting IoT blockchain throughput is
the consensus mechanism [27]. The consensus algorithm is at the heart of blockchain
technology because it ensures the network’s integrity and security. It is a protocol
that allows blockchain network nodes to reach a standard agreement on the current
state of the ledger’s records. Different blockchain platforms use different algorithms
to reach consensus, and they all operate and execute differently [28].
Although, presently no blockchain and consensus protocol might concurrently
meet both the security and scalability requirements [29]. Most organizations still lack
tools for tracking active keys, and roughly quasi-firms experience complications in
implementing encryption and take it as challenging because of unclear proprietorship
and shortage of experts [30].
But for applying blockchain to the IoT environment, some challenges are most to
be fulfilled.
Latency: In permission less blockchain frameworks it takes between 1 and
10 minutes to reach consensus. In permissioned blockchain it contracts upto
milliseconds.
Applicability: Generally, there are different kind of devices that are connected
within an IoT system. So, it is very difficult to choose a blockchain framework
which will be supported by all devices [31].
It is required to use a such blockchain architecture which allow unified and
ascendable movement of data from the IoT device to the consensus protocol [29].
Blockchain technology, in conjunction with IoT, cloud computing, big data, and
machine learning, can provide a comprehensive solution to these problems [12].
Smart contracts, on the other side, have the ability to supplement existing technical
methods for resolving security challenges. Whereas blockchain integrally violates
its distinguishing characteristics such as immutability, traceability, and authenticity
[32]. Smart contracts, then again, make use of adaptable features such as their
customizable nature, similarities with broadly used scripting languages, and Turing-
completeness of their scripting language. The majority of researches indicates that
the application of smart contracts with present substructure strengthening the security
solutions provided to IoT environments [33]. For the proper function and integra-
tion of various IoT devices, there is a need of huge distributed system for storing
and transmitting data [34]. Because of the ever-increasing number of IoT devices,
data vulnerability is a constant risk. Existing centralized IoT ecosystems have raised
security, privacy, and data use concerns. A decentralized ID and access manage-
ment (DIAM) system for IoT devices is the best solution to these concerns, and that
Hyperledger is the best technology for such a system [35]. Fault tolerant consensus
protocols play a vital role in establishing trustworthiness of a system in spite of the
chances of node failures [36]. A comparison of consensus algorithms is shown in
Table 1[37].
A milestone paper by Lamport et al. firstly presented the idea of Byzantine failure.
They proposed their ideology through the case of Byzantine generals, whose troop
targeting a castle of rival. Upon seeing the enemy, the generals communicate with
292 N. Zafar et al.
Table 1 Comparison between all proof of consensus algorithms
Properties PoW PoS pBFT
Integrity management of nodes Open Open Permission
Savings energy No Partial Ye s
Tol e r a n c e <51% <51% <33.3%
Blockchain Private Private Public
Table 2 A comparison table for BFT and pBFT
BFT pBFT
Consensus algorithm Consensus algorithm
Group of nodes finds consensus; some nodes
could be malicious
Generate consensus in malicious
environment
Less efficacy to operate in adversarial environment More efficacy to operate in adversarial
environment
each other and agree over a plan of action (consensus)—either to attack or retreat. If
they attack altogether, they succeeded; if none of them attack, they will survive for
other day. If some of the general’s attack, then the generals will not survive. They
communicate through messages. The challenge is that one or more of the generals
can deceive and passes on erratic messages to interrupt the faithful generals from
reaching consensus [38]. All consensus algorithm requires two phase one for request
and other for reply while pBFT requires three phase for massive communication
[36]. pBFT is the most popular algorithm providing tolerance under malicious attack
and its comparison with BFT is depicted in Table 2[39,40].
Through literature survey, we came to the point that whatever work has been
done over security issue of IoT devices; various proof of consensus algorithms has
implemented over IoT security devices for data privacy at various aspects data confi-
dentiality, authenticity, integrity, availability, and so on. However, all have some
limitation due to its heterogeneous architecture. There is also not so much talk over
device certification. Hence, Implementing Practical Byzantine Fault Tolerance algo-
rithm to improve integrity and authenticity of data during its propagation from one
node to another over the IoT system is a much required initiative.
3 Proposed System
The proposed methodology follows the approach of “practical Byzantine Fault Toler-
ance Algorithm (pBFT)”, a consensus algorithm for secure propagation of data intro-
duced by Barbara Liskov and Miguel Castro. For checking whether nodes are reli-
able or not, protocol uses timestamp from IoT devices [27]. It is an advancement
in Byzantine Fault Tolerance (BFT) algorithm, yielding more efficient result than
BFT for distributed systems. The highest number of malicious nodes should be less
Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance 293
than or equal to 1/3rd of total nodes for working of Byzantine Fault system [41].
Request of all clients must reach to the nodes and concurrent issue doesn’t arises.
In the process if leader node fails immediately another leader is selected [42]. The
system becomes more secure as the number of nodes grows and execution phases
seems be like given below in Fig. 1.
The requirements for the setup will (a) the number of nodes with their increments
and (b) preventing failure of any nodes affecting the system. The four stages of the
pBFT consensus cycle are (i) client (ii) issued a request to the principal node. The
primary node (iii) broadcasts re-quests to all subsidiary nodes. Nodes then carry out
the requested service, deliver a response to the client, and The request is considered
fully fulfilled when the client receives n+1 comparable responses from different
network nodes. Where the total number of nodes is n.
Figure 2shows the diagram of the Pbft algorithm following by the algorithm and
block diagram of the system in Fig. 3.
Fig.1 Working phases of
pBFT [9]
Fig. 2 Pictorial
representation of algorithm
Leader
Node
Secondary/
Backup
node
Client
294 N. Zafar et al.
Fig. 3 Block diagram of proposed system
Algorithm
While client sends request to leader node
do leader node broadcast it to all secondary nodes
if n>2/3rd authentic
then agree
else ifQ<=N-f
then live
elseif Q>N/2
then safe
else: malicious
if leader node is malicious
then change the leader node
Wherem+1replies should be received from secondary nodes
Endif
Here for maintaining integrity of data we have used hash function. The message
digest is created at the sender node and is sent with the message to the receiver
node. To check the integrity of a message, the receiver generates a hash function and
compare the new message digest with the one received. If they both are same then
only data of one node is approved to pass onto other.
h(y)=h(x)(1)
Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance 295
Then the consensus is reached and a message like this will appear and the propa-
gation of data will be approved without malice. The rules applied while simulation
are described below [39,43].
1. Client must receive f+1replies; where f is the number of faulty nodes.
2. More than two-third nodes should be authentic Agree =3f+1.
3. Liveness: Q<=N-Fwhere Qis a constant for quorum consensus
where Nis the total number of nodes
Safety: Q>N/2.
4 Results and Discussion
At nodes, the data collected from various sensors are stored and when they are
propagated from one node to another then pBFT performs its role and check whether
the information passing through them are authentic or not, if the information is not
authentic and fails to fulfill the rule of 3f +1consensus they simple terminates them
without disturbing the system. So, it is providing two step security. Real time and
safety are guaranteed by algorithm until unless n1/3 nodes out of n nodes are faulty,
which indicates that client will receive correct replies for their request node. Firstly,
we have done simple implementation on replit (a coding platform) for algorithm
pBFT then we have used the Contiki simulator (simulator for IoT) for real time
simulation. For safety mechanism we have used cryptographic public and private
key. Here, we are excluding the details of implementation part due to space issue.
In this paper we are assuming that client will only send the next request until unless
first one is served. If they will send requests one after other spontaneously, that will
result into congestion problem. The algorithm will provide safety only when non-
malicious nodes reach consensus. To maintain real time simulation, if leader node is
examined malicious, another node is urgently appointed as a leader node. Many of
the system succeeded in implementing safety but fails to maintain real time system.
But this proposed system provides solution for both simultaneously. To maintain real
time system, the proposed system following two approach viz., first one is the rule
of 1/3rd node and second one is changing leader node in case of damage to leader
node. The significance of pBFT is that, it will keep the system alive until unless there
is reliable number of nodes, which are greater than the number of faulty nodes. The
simulation setup and communication among nodes is depicted in Fig. 4.
296 N. Zafar et al.
Fig. 4 a Real time simulation and bnodes communicating with each other
The performance evaluators in this work are the number of nodes, speed and
simulation time in milliseconds. The number of nodes is showed as [1 +….n] where
1 denotes leader node and …. +ndenotes other nodes. The time in milliseconds
is showing the time required to communicate one node to another in real time. On
the real time simulator, green region is the region of strong connection and gray is
of weak connection. The node having sky blue color is the leader node and rest of
nodes of green colors are backup/secondary nodes.
Earlier IoT was secured using security algorithms including machine learning but
now researchers and scientists started using blockchain technologies for the security
of IoT devices. It is almost impossible to transmit data and provide security for all
kind of sensors/gadgets. Hence blockchain as a security provider using cryptographic
techniques and consensus agreement rule over these networks. Furthermore, the
Practical Byzantine Fault Tolerance (pBFT) algorithm when compared with other
consensus algorithms for security in IoT can be described in Table 3.
The comparison between the proposed system using pBFT and other security
algorithms is depicted in Table 4.
The simulation is done for 25 nodes, 50 nodes, and 100 nodes which are depicted
in the Fig. 5and the time taken for the nodes respectively are shown in Table 5.The
average time for all the nodes =(11 +24 +57)/3 which is 54 s.
As security of data is a major issue in IoT domain and there is a high risk of
privacy breaching and data stealing during data propagation through IoT layers.
Using pBFT, at network layer of IoT would reduce chances of its breaching and
stealing. Till now, in intrusion detection system for IoT there is lack of accuracy in
trial output and some inconsistency issues. However, pBFT would efficiently resolve
the issues that exists in the existing mechanisms [11]. Although, the optimistic fast
Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance 297
Table 3 Comparison table of various consensus algorithm [40,45]
Algorithm Time in seconds 5
nodes
Energy consumption Mechanism (protocol or
cryptographic technique
used for reaching
consensus.)
Proof of Work (PoW) 5–8 High Based on computing
power
Proof of Stake (PoS) 12 Relatively low High stakes nodes have
the right to account
Proof of Authority
(PoA)
4–5 High Validators that help to
reach consensus
Byzantine Fault
Tol e r a n c e (BFT )
4–10 Relatively low Reach agreement based
on value
Practical Byzantine
Fault Tolerance
(pBFT)
4–26 Low Using majority rule
Table 4 Comparison of pBFT with security algorithms
Technique Time in seconds Speed Power consumption
RSA 6 for 100 nodes Slow High
Diffie-Hellman 4.6 for 100 nodes Slow High
ECC 4 for 100 nodes Fast Comparatively low
pBFT 3 for 100 nodes Fast Low
path could be achieved only when there is not a letdown. Else, the proprieties behaves
like randomized consensus having congestion issue [46]. However, IoT devices like
sensors go through various life stages discussed above. If there is any issue insensors
at manufacturing stage, it can cause great damage and end up with interoperability
issue [47]. The best solution for this issue is the device certification. Certification
could be based on some standard norms followed by manufacturing industry or
provided by government. If any device certified by government organization, then
must follow privacy rules and regulation of that country and will be more trustable
by the customers. On the other hand, if a device is certified could be claimed and
challenged for their issues would also be provided by the pBFT.
298 N. Zafar et al.
0
1
2
3
4
5
6
2 3 5 9 10 12 13 17 19 21 22 23 29 32 36 38 42 44 49 50
NO. OF NODES
TIME IN SECONDS
FOR 50 NODES
Round1 Round2 Round3
0
5
10
15
11
15
18
24
28
31
34
37
40
44
47
50
54
58
62
64
66
69
72
77
79
81
86
89
96
98
NO. OF NODES
TIME IN SECONDS
FOR 100 NODES
Round1 Round2 Round3
0
0.5
1
1.5
2
2.5
3
3.5
3 7 13 14 15 16 17 20 21 25
NO.OF NODES
TIME IN SECONDS
FOR 25 NODES
Round1 Round2 Round3
Fig. 5 Summation time of different category of communication nodes
Table 5 Time taken by pBFT
for different number of nodes Practical byzantine fault tolerance algorithm
Time in seconds For 25 nodes 11
For 50 nodes 24
For 100 nodes 57
Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance 299
5 Conclusion
Based on various analysis and comparisons we reach on a result that pBFT is the
most suitable algorithm for the security of IoT as it is a cryptographic technique,
whose time consumption is low, energy consumption is also low and most important
thing is that the system remains in real time mode despite of having malicious nodes.
In the paper, the proposed method is concerned about data integrity and authenticity,
through practical Byzantine Fault Tolerance (pBFT) providing an approach for more
safe and secure data propagation along IoT devices. This paper also stressed over
the certification of IoT devices which further resolves most of the problem of sensor
interoperability. pBFT may provide a way for data security in IoT; but it will be very
hard to implement it in every case due to heterogeneous nature of Internet of Things
(IoT).
References
1. Husamuddin M, Qayyum M (2017) Internet of things: a study on security and privacy threats.
2017 2nd Int Conf Anti-Cyber Crimes, ICACC 2017, October:93–97. https://doi.org/10.1109/
Anti-Cybercrime.2017.7905270
2. Cisco T, Internet A (2020) Cisco: 2020 CISO benchmark report. Comput. Fraud Secur.
2020(3):4–4. https://doi.org/10.1016/s1361-3723(20)30026-9
3. Chen S, Xu H, Liu D, Hu B, Wang H (2014) A vision of IoT: applications, challenges, and
opportunities with China perspective. IEEE Internet Things J 1(4):349–359. https://doi.org/10.
1109/JIOT.2014.2337336
4. Nespoli P, Díaz-López D, Gómez Mármol F (2021) Cyberprotection in IoT environments: a
dynamic rule-based solution to defend smart devices. J Inf Secur Appl 60. https://doi.org/10.
1016/j.jisa.2021.102878
5. A. Hamid Lone and R. Naaz, “Reputation Driven Dynamic Access Control Framework for IoT
atop PoA Ethereum Blockchain.”
6. Kasemsap K (2019) Internet of things and security perspectives. Secur. Internet Things 5(1):1–
20. https://doi.org/10.4018/978-1-5225-9866-4.ch001
7. Lu C (2014) Overview of security and privacy issues in the internet of things abstract: keywords:
table of contents 1–11
8. Yassein MB, Mardini W, Al-Abdi A (2017) Security issues in the internet of things. 8(6):186–
200. https://doi.org/10.4018/978-1-5225-3029-9.ch009
9. Castro M, Liskov B (2010) Practical byzantine fault tolerance Miguel. Juv Delinq Eur Beyond
Results Second Int Self-Report Delinq Study February:359–368
10. Meshcheryakov Y, Melman A, Evsutin O, Morozov V, Koucheryavy Y (2021) On performance
of PBFT blockchain consensus algorithm for IoT-applications with constrained devices. IEEE
Access 9(June):80559–80570. https://doi.org/10.1109/ACCESS.2021.3085405
11. Li L, Chen Y, Lin B (2021) Intrusion detection analysis of internet of things considering practical
byzantine fault tolerance (PBFT) algorithm. Wirel Commun Mob Comput 2021. https://doi.
org/10.1155/2021/6856284
12. Hosmer C (2018) IoT vulnerabilities. Defending IoT Infrastructures with Raspberry Pi 1–15.
https://doi.org/10.1007/978-1-4842-3700-7_1
13. Bouscaren E (1989) Elementary pairs of models. Ann Pure Appl Log 45(2) PART 1:129–137.
https://doi.org/10.1016/0168-0072(89)90057-2
300 N. Zafar et al.
14. Sultan A, Mushtaq MA, Abubakar M (2019) IoT security issues via blockchain: a review
paper. PervasiveHealth Pervasive Comput Technol Healthc Part F1481:60–65. https://doi.org/
10.1145/3320154.3320163
15. Rachit SB, Ragiri PR (2021) Security trends in internet of things: a survey. SN Appl Sci
3(1):1–14. https://doi.org/10.1007/s42452-021-04156-9
16. Inside Secure, Iot security solutions white paper. Veritmatrix
17. Wheelus C, Zhu X (2020) IoT network security: threats, risks, and a data-driven defense
framework. IoT 1(2):259–285. https://doi.org/10.3390/iot1020016
18. Hernandez-Ramos JL, Pawlowski MP, Jara AJ, Skarmeta AF, Ladid L (2015) Toward a
lightweight authentication and authorization framework for smart objects. IEEE J Sel Areas
Commun 33(4):690–702. https://doi.org/10.1109/JSAC.2015.2393436
19. Azamuddin, Rotation project title : survey on IoT security. [Online]. Available: https://www.
cse.wustl.edu/~jain/cse570-15/ftp/iot_sec2.pdf
20. Goyal TK, Sahula V (2016) Lightweight security algorithm for low power IoT devices. 2016
international conference on advances in computing, communications and informatics, ICACCI
2016, September, 1725–1729. https://doi.org/10.1109/ICACCI.2016.7732296
21. Satamraju KP, Malarkodi B (2019) A secured and authenticated internet of things model
using blockchain architecture. Proc. 2019 TEQIP - III Sponsored international conference
on microwave integrated circuits, photonics and wireless networks, IMICPW 2019, 19–23.
https://doi.org/10.1109/IMICPW.2019.8933275
22. Zhang H, Lang W (2019) Research on the blockchain technology in the security of internet
of things. Proc. 2019 IEEE 4th Advanced information technology, electronic and automation
control conference IAEAC 2019, no. Iaeac, 764–768. https://doi.org/10.1109/IAEAC47372.
2019.8997876
23. Kurniawan A, Mayasari R, Murti MA (2018) Implementation of cryptographic algorithm on
Iot device’s Id. J Sist Cerdas 01(02):19–26
24. Zhang J, Li Z (2020) Design of internet of things information security based on blockchain.
Proc. - 2020 3rd World conference on mechanical engineering and intelligent manufacturing
WCMEIM 2020, 114–117. https://doi.org/10.1109/WCMEIM52463.2020.00030
25. Moinet A, Darties B, Baril J-L (2017) Blockchain based trust and authentication for
decentralized sensor networks, pp 1–6. [Online]. Available: http://arxiv.org/abs/1706.01730
26. Na D, Park S (2021) Fusion chain: a decentralized lightweight blockchain for iot security and
privacy. Electron 10(4):1–18. https://doi.org/10.3390/electronics10040391
27. Yuan X, Luo F, Haider MZ, Chen Z, Li Y (2021) Efficient Byzantine consensus mechanism
based on reputation in IoT blockchain. Wirel Commun Mob Comput 2021. https://doi.org/10.
1155/2021/9952218
28. Patil P, Sangeetha M, Bhaskar V (2021) Blockchain for IoT access control, security and
privacy: a review. Wirel Pers Commun 117(3):1815–1834. https://doi.org/10.1007/s11277-
020-07947-2
29. Mackenzie B, Ferguson RI, Bellekens X (2018) An assessment of blockchain consensus proto-
cols for the internet of things. In: 2018 international conference on internet of things, embedded
systems and communications. IINTEC 2018—Proceedings, 183–190. https://doi.org/10.1109/
IINTEC.2018.8695298
30. Kuzminykh I, Yevdokymenko M, Ageyev D (2021) Analysis of encryption key management
systems: strengths, weaknesses, opportunities, threats. In: 2020 IEEE international conference
on problems of infocommunications. Science and technology. PIC S T 2020—proceedings,
515–520. https://doi.org/10.1109/PICST51311.2020.9467909
31. Seshadri SS et al (2021) IoTCop: a blockchain-based monitoring framework for detection and
isolation of malicious devices in internet-of-things systems. IEEE Internet Things J 8(5):3346–
3359. https://doi.org/10.1109/JIOT.2020.3022033
32. Ali MS, Vecchio M, Pincheira M, Dolui K, Antonelli F, Rehmani MH (2019) Applications of
blockchains in the internet of things: a comprehensive survey. IEEE Commun Surv Tutorials
21(2):1676–1717. https://doi.org/10.1109/COMST.2018.2886932
Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance 301
33. Lone AH, Naaz R (2021) Applicability of blockchain smart contracts in securing Internet and
IoT: a systematic literature review. Comput Sci Rev 39:100360. https://doi.org/10.1016/j.cos
rev.2020.100360
34. Meshcheryakov Y, Melman A, Evsutin O, Morozov V, Koucheryavy Y (2021) On performance
of PBFT blockchain consensus algorithm for IoT-applications with constrained devices. IEEE
Access 9(April):80559–80570. https://doi.org/10.1109/ACCESS.2021.3085405
35. Hyperledger A, Edge ALF, Decentralized ID and access management (DIAM ) for IoT networks
36. Goyal H, Saha S, Practical byzantine consensus for internet-of-things
37. Sharma V, Lal N (2020) A novel comparison of consensus algorithms in blockchain. Adv Appl
Math Sci 20(1):1–13
38. Driscoll K, Hall B, Sivencrona H, Zumsteg P (2003) Byzantine fault tolerance, from theory to
reality 1 what you thought could never happen. Thought A Rev Cult Idea 2:235–248. https://
doi.org/10.1007/978-3-540-39878-3_19
39. Li W, Feng C, Zhang L, Xu H, Cao B, Imran MA (2021) A scalable multi-layer PBFT consensus
for blockchain. IEEE Trans Parallel Distrib Syst 32(5):1146–1160. https://doi.org/10.1109/
TPDS.2020.3042392
40. Gorkey I, Sennema E, El Moussaoui C, Wijdeveld V (2020) Comparative study of byzantine
fault tolerant consensus algorithms on permissioned blockchains supervised by Zekeriya Erkin
supervised by Miray Aysen, April, pp 1–11
41. Misic J, Misic VB, Chang X, Qushtom H (2020) Multiple entry point PBFT for IoT systems.
2020 IEEE Global Communications Conference 2020, 0–5. https://doi.org/10.1109/GLOBEC
OM42002.2020.9322641
42. Misic J, Misic VB, Chang X, Qushtom H (2021) Adapting PBFT for use with blockchain-
enabled IoT systems. IEEE Trans Veh Technol70(1):33–48. https://doi.org/10.1109/TVT.2020.
3048291
43. Liangchen X (2020) Design and implementation of internet of things information security
transmission based on PBFT algorithm. In: International conference on computer engineering
and application (ICCEA), 201–205. https://doi.org/10.1109/ICCEA50009.2020.00051
44. Waheed N, He X, Ikram M, Usman M, Hashmi SS, Usman M (2021) Security and privacy in
IoT using machine learning and blockchain: threats and countermeasures. ACM Comput Surv
53(6). https://doi.org/10.1145/3417987
45. Xiong H, Chen M, Wu C, Zhao Y, Yi W (2022) Research on progress of blockchain consensus
algorithm: a review on recent progress of blockchain consensus algorithms. Futur Internet
14(2). https://doi.org/10.3390/fi14020047
46. Kuznetsov P, Tonkikh A, Zhang YX (2021) Revisiting optimal resilience of fast byzantine
consensus. Assoc Comput Mach 1(1)
47. Noura M, Atiquzzaman M, Gaedke M (2019) Interoperability in internet of things: taxonomies
and open challenges. Mob Networks Appl 24(3):796–809. https://doi.org/10.1007/s11036-018-
1089-9
Human Body Poses Detection
and Estimation Using Convolutional
Neural Network
Jitendra Kumar Baroliya and Amit Doegar
Abstract This study introduces a unique method for human body pose detection
and estimation by combining convolutional neural network (CNN) and grab cut
segmentation techniques. The suggested technology is meant to aid in the detection
and estimation of human pose, which is important for many real-time applications.
Features are extracted from pictures of human pose using a grab cut and create
a human silhouette. Furthermore, the convolutional neural network is applied to
classify the human pose. When tested on a dataset consisting of photographs of
human pose, the suggested system achieved an accuracy of 93.89% in 6 human pose
classifications. A total of 1181 pictures were used in this analysis, including six
different human poses (down dog, warrior, tree, prank, goddess, and handshaking).
There are 237 test photographs and 944 full-size images throughout all categories.
An 80:20 ratio is maintained for training and testing. An F1 score of 93.75%, a
recall score of 93.89%, and a precision score of 93.89% were all obtained using the
proposed strategy. Based on the obtained data, it appears that the proposed method
achieves good accuracy in pose detection compared to the state-of-the-art methods.
It will be beneficial for yoga pose detection, patient detection systems, etc.
Keywords Human body pose ·CNN ·Grab cut ·Human silhouette ·
Segmentation
J. K. Baroliya (B)·A. Doegar
Computer Science and Engineering Department, NITTTR, Chandigarh, India
e-mail: Baroliyajitendra4@gmail.com
A. Doegar
e-mail: Amit@nitttrchd.ac.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_23
303
304 J. K. Baroliya and A. Doegar
1 Introduction
1.1 Human Pose Estimation
It is a technique for identifying and labeling human body key joints.
Fundamentally, it is a method of capturing a person’s location by storing a set of
coordinates for each joint. The connection between these two locations is called a
pair. For two points to be joined, there must be some kind of significant relationship
between them. Estimation’s primary purpose is to construct a human “skeleton”
model for use in more refined, task-specific applications [1].
Model based on skeletons
Model based on contours
Model based on volume (Fig. 1)
1. Skeleton-based model [2]: A model built from a joint like Ankles, knees, shoul-
ders, elbows, wrists, and limb orientations are all part of this representation. Also
called the kinematic model, which is utilized extensively in 3D and 2D pose esti-
mation. This flexible and easy-to-understand human body model is often used to
illustrate connections between different parts of the body, especially the skeleton.
2. Contour-based model: A contour-based model is used for 2D pose estimation
in planer mode. This dimension accounts for the overall form and breadth of
the body, torso, and limbs. Rectangles and bounds represent the human body’s
limitations and shape, giving the impression that it is a human body.
Fig. 1 Human body models
Human Body Poses Detection and Estimation Using Convolutional 305
3. Volume-based model: It is utilized for 3D posture estimation. Human geometric
meshes and shapes are utilized to represent the human body in these models,
and they are often recorded for use in deep learning-based 3D human activity
identification.
1.2 Computer Vision
Computer vision, in its broadest sense, is the automation and combination of many
techniques and representations used to comprehend visual information [3]. To help
computers make sense of digital photos, specialists from a variety of areas collaborate
on a project. A digital picture is just an integer representing a standard binary image
in two dimensions. Here, computer vision must devise methods of extracting and
communicating the data contained in an image using a matrix. These calculations
may provide either a picture or a collection of features about that image.
In recent years, computer vision has shown remarkable practicality. The name
“computer vision” was coined to represent this overarching objective, which is to
enable cameras on computers to “see” and “recognize” in the same way that humans
do. With the use of algorithms, computers can analyze the data contained in an
image’s pixels to determine what features are most significant. This is challenging;
many individuals make their lives out of the quest to train computers to recognize
faces automatically in images.
There has been a lot of advancement in computer vision lately, which is great.
Many different types of issues can be solved using effective algorithms. Thanks to
the progress made in computing, these algorithms can now be executed at a tolerable
pace. However, computer vision has a long way to go before it can replace human
eyes. It’s rare that a computer can outperform human eyesight in a given setting.
Computer vision is best used for item recognition and matching incoming photos to
a huge database when paired with machine learning (machine learning). To achieve
this, this research uses computer vision and machine learning to categorize human
body positions in images.
1.3 Convolutional Neural Network
Convolutional neural networks (CNNs) are a kind of DL that can classify images
and videos by assigning weights and biases to the input data. CNN is faster at
picking up new technical abilities than Artificial Neural Network (ANN). There is
less congestion because of CNN’s various benefits, which include improved profes-
sional communication, data exchange, and comparative openness. The arrangement
of the human body and the visual brain inspired the construction of a convolutional
neural network CNN is highly recommended for posture estimation because of the
complexity of the task, which includes object identification and body key point
306 J. K. Baroliya and A. Doegar
localization. Thus, CNN is a fundamental component of many different types of
pose estimation models.
Convolutional Layer
The filters in a convolutional layer have to have their settings learned. The filters have
a lower profile in terms of both height and weight than the input volume. A neural
activation map is calculated by convolving each filter with the input volume. That is
to say, we move the filter horizontally and vertically over the input and calculate the
dot products at each and every location. Between the image matrix and the kernel
lies a geometric procedure known as convolution. The output convolutional layer
measurements are less precise than the input picture matrix.
Pooling Layer
When the features are identified, their extracted locations become less relevant, hence
a subsampling layer is employed to reduce the amount of training parameters and
introduce translation invariance. When the amount of data available on the network
decreases over time, modifying the instructions can help. The integration layer makes
use of the max and min operations. As a result, we need to revise the data. In the
pooling layer, filters subsample the input. A 2 ×2 filter is commonly used in the
second phase of setting up a pooling layer, which reduces the width and height of
each depth slider in the input volume by a factor of 2. This is due to the fact that it
is impossible to memorize hashes of node layer parameters.
Activation Function
To introduce nonlinearity, the Rectified Linear Unit (ReLU) is preferable to the
previously used function, as shown by its success in a variety of incentive settings,
the simplicity with which partial derivatives of the ReLU can be derived, and the
reduction in exercise start time. ReLU will not let our grades disappear.
Fully Connected Layers
Full connectivity exists between normal and hyperparameter layers. After the feature
map matrix is transformed into a vector, it is passed on to the subsequent layers, which
will be completely connected. They switch activities based on what’s being activated
(neuron load and bias).
2 Literature Survey
After the study of various research papers on human body poses detection and esti-
mation, we concluded that the classification of human body pose is a very complex
task. In the literature, various authors and researchers discussed human body pose
detection using different types of techniques for image processing operations.
Toshev et al. [4] suggested using deep neural networks to estimate human poses
(DNNs). The process of estimating a subject’s posture is modeled as a regression
Human Body Poses Detection and Estimation Using Convolutional 307
Fig. 2 Convolutional pose machine [12]
problem using a deep neural network in terms of the body’s joints. In order to get
accurate posture estimations, researchers describe a cascade of such DNN regressors,
a sequential architecture made up of convolutional networks that can learn spatial
models implicitly. The given cascade of such regressors has the benefit of collecting
context and reasoning about the posture as a whole, and the issue may be formulated
as a DNN-based regression to joint coordinates. This allowed them to outperform
state-of-the-art methods on a number of difficult academic datasets.
Weietal.[5] In order to perform prediction tasks, a convolutional pose machine
learns basic spatial models by composing convolutional architectures in a sequential
fashion. This model’s design is similar to recurrent networks in that it takes many
iterations to get the desired outcome. Input pictures are 368 ×368, and after a few
convolutions, it outputs a key point prediction for each pixel map (head, neck, right
elbow, etc.) (Fig. 2)
Newell et al. [6] proposed outperforms on all prior approaches. The network is
referred to as a “stacked hourglass” because it uses many layers, each of which is
shaped like an hourglass, to provide a single prediction at the end of the process. This
network was designed with the goal of recording details like as a person’s posture
and the degree of articulation of their limbs.
When creating a neural network, it’s essential to identify its unique features to
ensure its effectiveness. However, when it comes to full-body pose estimation, a
broader context is necessary. This is where the Hourglass model shines, as it can
capture all the necessary details and provide precise predictions down to the pixel
level. The Hourglass approach involves repeated bottom-up processing, moving from
high to low resolution, and top-down processing, moving from low to high resolution,
with intermediate supervision to ensure accuracy.
Cao et al. [7] the open-source software “open pose” is a powerful tool for detecting
many 2D poses simultaneously. In this research, they invented method for identifying
2D human postures. This study introduces Part Affinity Fields (PAFs), a bottom-
up representation technique. Top-down methods are mostly used in this model for
308 J. K. Baroliya and A. Doegar
estimating poses in groups. As the name implies, a top-down approach begins with the
detection of a human being, followed by the estimation of its posture in each specific
area. While this method may be used directly for single-person pose estimation,
it falls short when used in multi-person settings due to its inability to account for
the spatial interdependence between users, which can only be captured by global
inference.
In their study, Sharma et al. [8] were able to accurately detect four distinct human
poses, namely Sitting, Standing, Handshake, and Waving, achieving an overall accu-
racy of 82.5%. To achieve this, they first pre-processed the images and extracted
the necessary features using Principal Component Analysis and Discrete Cosine
transform. These extracted features were then used to train a neural network classifier.
In their study, Wang et al. [9] proposed a new approach to classify yoga poses
using a combination of post-estimation algorithm and convolutional neural network
(CNN). The post-estimation algorithm utilized in this study was the Open Pose
algorithm, which detected the skeleton of a person and generated a pure black picture.
After extracting the poses from the original images, a CNN was utilized to classify
the yoga poses with a validation accuracy of 92.99%. The study also compared
the performance of the models with and without the assistance of the Open Pose
algorithm, and the results indicated that the accuracy of models without the Open
Pose algorithm was on average 3–6% lower than the models combined with the Open
Pose algorithm. Thus, the proposed approach of classification was deemed effective,
based on the skeleton information extracted by the Open Pose algorithm.
Desaietal.[10] conducted a comprehensive literature review and proposed a
method for real-time pose estimation utilizing a deep neural network (DNN) model
to detect and correct errors in a person’s posture.
3 Proposed Methodology
The proposed methodology for this research work is shown below (Fig. 3).
Firstly, human images will be acquired through any picture-capturing device, and
grab cut will be applied to the dataset for creating human silhouettes of inputted
images. For image classification convolution layer, max-pooling, ReLU was applied
Fig. 3 Proposed methodology
Human Body Poses Detection and Estimation Using Convolutional 309
on human silhouettes [11]. Three to four times continuously apply those layers after
that flatten and dense layer is applied. The fully connected layer then classifies the
given input pose.
Flow Chart of Proposed Work
See Fig. 4.
First of all, we acquire images from an online source [12], and the handshaking
dataset is created by a researcher after some image processing techniques are applied,
we use grab cut segmentation on an original dataset which gives an output of human
silhouettes.
Augmentation operations [13] were also applied to some pose image categories
to ensure equal representation of each category. The dataset was split into 80% for
training and 20% for testing.
Next, CNN started with four convolutional layers with 64, 64, 64, and 128 filters,
respectively, followed by max-pooling layers with a 2 ×2 window size. Each convo-
lutional layer utilized a 3 ×3 filter, and the ReLU activation function was applied.
After the final max-pooling layer, the Flatten() layer was added to flatten the output
of the earlier layers, and a Dense() layer with the SoftMax activation function was
incorporated to generate probabilities for each of the 6 possible classes.
3.1 Extraction
If the conditions are stable, identifying a human form in an image or video may be
a straightforward task. When comparing images, we may isolate the human form by
tracking the moving item. Having access to just one picture makes this task somewhat
more challenging. Therefore, the generally effective grab cut algorithm in our study.
When a person has been successfully detected in the segmented photo, the picture
may be divided into a foreground and a background. The spot where the individual
was discovered is prominently shown. The background is the area outside the box’s
borders. All in the sequence is utilized by grab cut [14].
An extraction algorithm is described. Create a bordered rectangle by drawing it in.
The individual in the front ought to take up the whole square. The program repeatedly
divides the foreground for optimal performance. In the algorithm, foreground and
background pixels are separated. This is because the foreground item will stand out
against a dark background. In order to blend in with the background, grab cut will
selectively cut off everything except the subject of the picture, leaving just their
silhouette. This is performed on the pictures that were generated when the person’s
bounding box was first discovered. Figure 5a, b provide an as clear illustration.
310 J. K. Baroliya and A. Doegar
Fig. 4 Flow chart diagram
for proposed system
Human Body Poses Detection and Estimation Using Convolutional 311
Fig. 5 aandb
3.2 Image Classification
After an input dataset has been crafted, a basic feedforward neural network may be
constructed.
This network has neurons in its hidden layer. The result is two binary floating
neurons in the vertical direction, which partition the picture in two places (one neuron
is moving and one is resting). Scale the RGB image down to 64 ×64, then convert
the silhouette image to a grayscale. The input to the neural network serves as the
source of the training data. As a result of evaluations (Fig. 6a, b)
Fig. 6 aandb
312 J. K. Baroliya and A. Doegar
4 Experiment and Results
The coding and implementation for this study were done in the Jupyter notebook of
the anaconda framework, using Python 3.10. We use 944 training images to teach
our convolutional neural network model, and then we use 237 test images to see
how well it can classify new images. Both the training and validation accuracies
peaked in the 17th epoch when using the designed approach on 20 iterations. Grab
cut dataset is passed to CNN which results in enhanced classification accuracy, we
obtain the performance metrics as accuracy, F1-score, recall, and precision score.
The confusion matrix for the classification of pose categories is shown in Fig. 7.
Various performances of the purposed method are shown in tabular form in Table 1.
The accuracy and validation losses during the training model are shown in Fig. 8.
And the accuracy and validation accuracy during the training of the model is shown
in Fig. 9and Table 2.
So, an overall 93.89% accuracy was reported when 393 random images tested.
Several studies have utilized convolutional neural networks (CNNs) for human
pose detection and estimation. These include “Deep pose: Human pose estima-
tion via deep neural networks” (2014), “Stacked hourglass networks for human
Fig. 7 Confusion metrics (in Fig. 20 label 0 =Downdog, label 1 =Goddess, label 2 =handshaking,
label 3 =plank, label 4 =tree, label 5 =warrior)
Table 1 Performance of
proposed methodologies Accuracy score 0.9389
F1-score 0.9375
Recall score 0.9389
Precision score 0.9389
Human Body Poses Detection and Estimation Using Convolutional 313
Fig. 8 Training and
validation loss
Fig. 9 Training and
validation accuracy
Table 2 Performance of proposed methodologies
Label Class No of testing images Correctly identified Accuracy in percentage
0Downdog 36 34 94.44
1Godess 42 31 73.80
2Handeshaking 124 124 100
3 Plank 108 106 98.14
4Tree 49 46 93.87
5Warrior2 34 28 82.35
pose estimation" (2016), “Human Pose Detection: A Machine Learning Approach”
(2020), “Multi-Classification for Yoga Pose Based on Deep Learning” (2022), “Deep
Learning-Based Yoga Pose Classification” (2022). These studies reported accuracy
rates of 79.6%, 80.90%, 92%, 92.99%, and 93%, respectively in comparison to these
previous works, our system achieved an accuracy increase of up to 93.89% (Fig. 10).
314 J. K. Baroliya and A. Doegar
Fig. 10 Comparison of existing and proposed work
5 Conclusion and Future Scope
CNN and grab cut has great potential for use in human poses for many real-time
applications in the future. These methods can be made even more accurate and faster
by developing more advanced deep learning techniques and making large datasets
available. Integrating these strategies with other technologies, such as remote sensing,
enables real-time monitoring and early identification of human poses.
Finally, the application of CNN and grab a cut to the problem of human pose
detection shows great potential for the development of yoga pose detection applica-
tions in the future. These methods can help any person to learn yoga poses at home
and are also beneficial for theft detection by identifying poses, in the medical field
and fall detection of patients. The remaining goals of this work include optimizing
the parameters of the CNN and grab cut models and increasing the size of the dataset
to include a broader range of human poses in a different field.
To further enhance the accuracy and performance of the system, researchers can
experiment with transfer learning and other cutting-edge deep learning techniques.
The methodology employed in this study is based on in-depth learning, which is then
used to categorize yoga poses. We’ll use live yoga pose capturing via webcams in
the future. Then instruct the subject on how to improve it by highlighting any flaws
in their yoga posture.
Human Body Poses Detection and Estimation Using Convolutional 315
References
1. “human_pose,” Nov. 15, 2022. https://www.v7labs.com/blog/human-pose-estimation-guide
(accessed Nov. 18, 2022)
2. “hu,” Nov. 15, 2022. https://www.analyticsvidhya.com/blog/2022/01/a-comprehensive-guide-
on-human-pose-estimation/ (accessed Nov. 15, 2022)
3. Vezzani R, Cucchiara R (2008) Annotation collection and online performance evaluation for
video Surveillance: the ViSOR project. In: Proceedings—IEEE 5th international conference
on advanced video and signal based surveillance, AVSS 2008. https://doi.org/10.1109/AVSS.
2008.31
4. Toshev A, Szegedy C (2014) DeepPose: human pose estimation via deep neural networks.
In: Proceedings of the IEEE computer society conference on computer vision and pattern
recognition. https://doi.org/10.1109/CVPR.2014.214
5. Wei S-E, Ramakrishna V, Kanade T, Sheikh Y, Convolutional pose machines
6. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In:
Lecture notes in computer science (including subseries lecture notes in artificial intelligence
and lecture notes in bioinformatics). https://doi.org/10.1007/978-3-319-46484-8_29
7. Cao Z, Hidalgo G, Simon T, Wei SE, Sheikh Y (2021) OpenPose: realtime multi-person 2D
pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 43(1). https://
doi.org/10.1109/TPAMI.2019.2929257
8. Kakati M, Sarma P (2020) Human pose detection: A machine learning approach. In: Advances
in intelligent systems and computing. https://doi.org/10.1007/978-3-030-37218-7_2
9. CIBDA 2022, 3rd international conference on computer information and big data applications
10. Kinger S, Desai A, Patil S, Sinalkar H, Deore N (2022) Deep learning based yoga pose
classification. In: 2022 international conference on machine learning, big data, cloud and
parallel computing, COM-IT-CON 2022, Institute of Electrical and Electronics Engineers Inc.,
682–691. https://doi.org/10.1109/COM-IT-CON54601.2022.9850693
11. Al-Qerem A, Alahmad A (2019) Human body poses recognition using neural networks with
data augmentation. Int J Adv Trends Comput Sci Eng 8(5). https://doi.org/10.30534/ijatcse/
2019/40852019
12. “dataset,” 2023. https://www.kaggle.com/datasets/niharika41298/yoga-poses-dataset
(accessed Feb. 02, 2023)
13. Park S, Baek Lee S, Park J (2020) Data augmentation method for improving the accuracy of
human pose estimation with cropped images. Pattern Recognit Lett 136. https://doi.org/10.
1016/j.patrec.2020.06.015
14. Rother C, Kolmogorov V, Blake A (2004) GrabCut—interactive foreground extraction using
iterated graph cuts. In: ACM SIGGRAPH 2004 Papers, SIGGRAPH 2004. https://doi.org/10.
1145/1186562.1015720
A Novel Image Alignment Technique
Leveraging Teaching Learning-Based
Optimization for Medical Images
Paluck Arora, Rajesh Mehta, and Rohit Ahuja
Abstract In image registration, traditional optimization techniques are incapable
of detecting the optimum value of geometric transformation parameters. To resolve
this issue, a novel scheme of monomodal (isomodal) biomedical image registra-
tion employing teaching learning-based optimization (TLBO) is proposed. In pre-
processing, reference image undergoes gaussian filtering to eliminate noise followed
through normalization. During de-noising, contrast between anatomical features of
an image is degraded. In order to create the floating image, rigid transformation is
employed. These images are aligned by detecting optimum value of rigid transforma-
tion parameters (RTP) using TLBO with mutual information (MI) maximization as
an objective function. MI and structural similarity index measure (SSIM) are used to
evaluate visual quality of registered image. The proposed scheme is tested on several
isomodal medical images such as magnetic resonance imaging (MRI) and computed
tomography (CT). The value of MI, SSIM increases by 8% and the value of RMSE
is significantly reduced from 0.4953 to 0.1306 [1] and 3.7858 to 0.1809 [2] which
clearly reveals that proposed scheme is robust and effective as compared with the
state-of-the-art methods.
Keywords Image registration ·Mutual information ·Monomodal images ·
TLBO ·SSIM
P. Ar o r a ( B)·R. Mehta ·R. Ahuja
Computer Science and Engineering Department, Thapar Institute of Engineering and Technology,
Patiala, India
e-mail: parora_phd20@thapar.edu
R. Mehta
e-mail: rajesh.mehta@thapar.edu
R. Ahuja
e-mail: rohit.ahuja@thapar.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_24
317
318 P. Ar o r a e t a l .
1 Introduction
Rapid growth of computer and medical technology motivates academicians and
researchers around the globe to analyse bone structure, metabolic process, dense
and soft issues of computed tomography (CT), magnetic resonance imaging (MRI)
and positron emission tomography (PET) modalities of medical images [3]. Experts
in medical imaging require a variety of imaging methods to acquire additional infor-
mation that can assist to diagnose the diseases. Medical images derived from various
sensors (modalities) can be classified into two parts: first, anatomical images, such
as computed tomography (CT), magnetic resonance (MR) and ultrasound (US), are
utilized to show body organs in their complete structure; second, functional images,
like PET and SPECT, are employed to show soft tissues and their internal activi-
ties [1]. The goal of image registration is to align two images of the same scene
taken at various times (multi-temporal) from distinct viewpoints with various sensors
(multimodal). Medical image registration techniques can be classified into intensity
and feature-based registration. In intensity-based registration, relationship between
the reference and floating image is figured out to generate transformation and then
optimize these parameters of the transformation. However, this does not support
the feature detection. Feature-based registration overcomes the concern of intensity-
based registration by detecting the features between reference and floating image [4].
Image registration has numerous applications in the field of medical imaging, such
as monitoring disease detection, treatment planning system (for radiotherapy treat-
ment), nuclear medicine, surgical procedure, motion tracking, image-guided surgery
(IGS) [5] and patient follow-up cum monitoring. Image registration entails trans-
formation, cost function (similarity metric) and optimization. Firstly, transformation
model is classified as rigid or non-rigid. In rigid transformation size or shape of an
image is not changed. It includes rotation, scaling, translation and affine transform.
Any transformation of a geometrical object that changes the size but not the shape is
referred as non-rigid transformation [6]. Stretching or dilating belongs to the category
of non-rigid transformation. Secondly, there are few similarity measures that eval-
uate image alignment accuracy, e.g. mutual information (MI), root mean square error
(RMSE) are the most often used similarity metrics. Registration is obtained when
SSIM and MI are maximized while RMSE is minimized. Lastly, local and global
optimization techniques are employed to evaluate the accuracy of a registered image.
Local and global strategies can be used to optimize the similarity measures. Local
techniques, including Powell’s direction [7], steepest descent gradient and Leven-
berg–Marquardt tend to trap local optimums and provide misregistration results [8].
Genetic algorithm (GA) [9], particle swarm optimization (PSO) [1], grey wolf opti-
mizer (GWO) [10] and hybrid particle swarm optimization (HPSO) [11] are exam-
ples of global optimization methods. Although, GA is a robust method for global
optimization but it takes longer duration to compute and it does not support fine
adjustment. These global optimization methods are incapable of providing global
optimum [1]. Hence, a new intelligent meta-heuristic global optimization scheme
TLBO was introduced with a notion to achieve better results in lesser effort for rigid
A Novel Image Alignment Technique Leveraging Teaching 319
medical image registration. Considering the shortcomings of the methods outlined
above, a robust image registration scheme is proposed to improve the quality of
registered images by finding the optimum transformation parameters. In the present
work, reference image is pre-processed by using gaussian filtering to remove the
noise using noise reduction procedure preceded by normalization. However, due to
this contrast between anatomical components of an image is slightly reduced. Rigid
transformation is applied to align reference and floating image with maximum MI
as a cost function. Researchers were focused on different optimization techniques
during the registration process for finding the optimum value of the rigid/non-rigid
transformation parameter in registration [1,2,10]. The quality of the registered image
is measured using MI and SSIM values.
The major contribution for registering isomodal images are outlined as:
Noise removal and normalization are performed during the pre-processing step
which allows the reference and floating images to be distinguished anatomically.
To improve on traditional meta-heuristic approaches like PSO, GWO, HPSO, etc.
for determining the optimum value of rigid transformation parameters in image
registration and presents the IR method for different modalities of images based
on TLBO.
With the aid of TLBO (which has fewer parameters than other optimization
methods), the optimum value of transformation parameters is found which results
in a robust registration procedure.
The robustness in terms of precise registration is justified by the high value
of MI and SSIM. The proposed method effectiveness is demonstrated through
a fair comparison with recently developed schemes using similar images and
transformation parameters [1,2,10].
The rest of the paper is outlined as: Literature review is discussed in Sect. 2.
Proposed approach for robust medical image registration is discussed in Sect. 3.
Experimental and comparison results are demonstrated in Sect. 4followed by
conclusion and future directions in Sect. 5.
2 Literature Review
In this section, the recently developed rigid medical image registration techniques
proposed by several researchers are described.
Ayatollahi et al. [1] have presented a method for intermodal image registration
based on maximized normalized MI and particle swarm optimization. Ting et al.
[7] have suggested a multiresolution search optimization technique that combines
Quantum behaved PSO with Powell algorithm. It is challenging to determine
optimum RTP value. Meskine et al. [7] have presented a technique for rigid registra-
tion of point sets. This technique is employed to register two different clinical MR
images which demonstrates the reliability and effectiveness. Dida et al. [10]have
presented different optimization algorithms which were used to register MR and
320 P. Ar o r a e t a l .
CT images for recognizing the human brain. Maddaiah et al. [2] have demonstrated
image registration scheme for alignment of medical images like human brain. Abdel-
Bassetetal.[12] have discussed hybrid scheme for alignment of medical images. This
method yields higher RMSE value on the tested dataset which results in degraded
alignment of registered images. Paluck et al. [13] had examined medical image regis-
tration employing a hybridization of teaching learning-based optimization with affine
and projective transformation to speed up robust features. Experiments are carried
out in the domains of monomodal and multimodal medical images using the whole
brain ATLAS and Kaggle datasets. Lin et al. [14] have described a novel enhanced
global optimization strategy for medical image registration (HPSO).This technique
used medical data (from the Vanderbilt database) for medical image registration and
address the issue of low registration accuracy. Zheng et al. [14] have discussed the
feature extraction algorithm such as SURF based on progressive images. The affine
registration of images from the same and different modalities, such as CT and MR
have been presented by Bhattacharya et al. [15]. This paper has several limitations.
First, a time-consuming process is used to create many intermediate progressive
images. Second, extracting and matching the keypoints, the SURF algorithm gener-
ates mismatches in PI-SURF registration procedure. According to Pluim et al. [16],
the following methods for image registration include the interpolation technique
and different search strategy are used to maximize the similarity measure. Liu et al.
[15] have addressed the teaching learning-based optimization technique which is
substantially faster than other evolutionary algorithms (EAs) in finding superior or
equal solutions. Das et al. [16] have studied the MR and CT brain modality of
images and non-linear 2-D affine registration. The affine-based image registration
technique based on mutual information (MI) has been implemented by Kosinski
et al. [17]. Powell optimization [18] technique is used for searching the optimum
transformation parameter. Liu et al. [19] have presented an outlier robust scheme
for multiple rigid transformed images for improving the robustness of registration.
Zheng et al. [20] addressed the problem of low registration accuracy in an SURF-
based, progressive-image registration approach for medical images. These images
were obtained from the Brain Web Database for medical image registration. Swathi
et al. [21] demonstrates the automatic image registration model with hybrid opti-
mization approach. The alignment process used satellite images. A point matching
algorithm is employed to establish correspondences between the detected features
following additional feature extraction utilizing E-SIFT. Using various similarity
metrics like RMSE and NCC, this scheme outperformed state-of-the-art methods.
3 Proposed Approach
With an intent to achieve novel medical image registration gaussian filter is employed
in order to remove noise. Further, rigid transformation to align images and optimiza-
tion algorithm TLBO is used for detecting optimum values of RTP θ,tx,ty.MI
and SSIM are used as similarity metric to find the correlation between reference
A Novel Image Alignment Technique Leveraging Teaching 321
and registered images. TLBO employs MI maximization as an objective function.
Regarding Eq. 1, the optimum transformation gives the maximum value of similarity
metric, MI. The objective function defines optimization algorithm which evaluates
transformation parameter to get the optimum value for the highest similarity metric.
ˆ
T=arg max
TMI[fs(x),ft(x)](1)
The reference (original) image is represented by fs(x)where xis an image coor-
dinate and the floating image by ft(x).ˆ
Tis the optimum transformation and T
representing the transformation as well as its parameter. The framework of proposed
scheme is shown in Fig. 1along with the steps involved are as follows:
Step 1: The source (reference) image is subjected to pre-processing employing gaus-
sian filter in order to de-noise prior to normalization. Although, removal of noise
during noise removal results into small degradation [22]. Normalization is applied
to transform the image data on a uniform scale and the output of Step 1 is shown in
Fig. 2.
Fig. 1 Framework of proposed scheme
Fig. 2 Brain image from ATLAS database [23]asource image bgaussian blur image cnormalized
image
322 P. Ar o r a e t a l .
Fig. 3 a Reference image
bfloating image
Step 2: The output of reference image acquired in Step 1 is processed through
rigid transformation (rotation preceded by translation) to generate floating image
as illustrated in Fig. 3.
Step 3: The similarity metrics SSIM and MI are used to measure the degree of
resemblance between reference and transformed floating image. Subsequently TLBO
is applied for determining the novel value of RTP along with MI (maximization) as
an objective function of similarity metric.
Step 4: The optimum value of RTP corresponding to rotation (θ)and translation
(tx,ty)obtained by TLBO are applied to floating image resulting in registered image
as shown in Fig. 4. Finally, MI is computed to validate the correlation between
reference and registered image. A high value of MI indicates the accurate alignment
between reference and registered image as depicted in Fig. 1. The optimization
technique TLBO is utilized to estimate the RTP that maximize or minimize the
similarity metrics and test the reliability of IR. According to objective function if the
desired results are accurate which claim to achieve the registered image otherwise
transformation parameter are updated and again apply the similar procedure from
Step 2 as explained in Fig. 1
Fig. 4 a Reference image bfloating image with transformation parameters θ, tx,tyas (4.7,0,0)
cregistered image
A Novel Image Alignment Technique Leveraging Teaching 323
4 Experimental and Comparison Results
In this section, performance of the proposed scheme is evaluated in terms of MI as
well as SSIM along with visual quality of registered images [24]. The effectiveness of
the proposed scheme is demonstrated by comparing it with state-of-the-art methods
[1,2,10].
5 Experimental Results
All the experiments are conducted using the PYTHON 3.9.4 platform, utilizing a
system with 16.0 GB RAM and a 2.80 GHz Intel(R) Core TM i7-1165G7 CPU. The
experimental results are conducted on various isomodal medical database images,
as depicted in Fig. 5(a1–a6), sourced from the whole brain ATLAS database [23].
The medical brain images utilized in this study consist of CT and MRI scans with
dimensions of 256 ×256 (Fig. 5a1–a5), and 354 ×353 (Fig. 5, a6). These images
are employed to evaluate and verify the effectiveness of the proposed scheme. The
reference (source) image undergoes a rigid transformation, which consists of a rota-
tion at an angle θfollowed by a translation T=(tx,ty)resulting in the floating
(template) image. This is performed after applying pre-processing task to the source
image. TLBO optimizes transformation parameters to register floating image with
reference image.
Initially, floating image is registered with source image. The impact of rotating
an isomodal brain image at a rotation θ=2preceded by T=(tx,ty)=(2,2)
is shown in Fig. 5b1–b4. The optimum value of rotation and translation parameters
obtained through TLBO undergoes the process of rigid transformation and finally
the registered images are formed as depicted in Fig. 5(Col.3, c1–c4) corresponding
to floating images (Col. 2, b1–b4). In the second test case, the same scenario is
repeated by varying values of transformation parameters such as rotation θ=15
and translation T=(tx,ty)=(17,17). The registration result for this rotation and
translation parameter values is shown in Fig. 5(c5) which corresponds to the floating
image in Fig. 5(col. 2, b5) along with their MI values in Table 1. In the third test
case, which involves a human brain image of dimension (354 ×353)is considered
with different values of rotation and translation parameters (rotation θ=4.7and
translation T=(tx,ty)=(0,0)). The registration results as illustrated in Fig. 5
(col. 3, c6) corresponding to the floating image in Fig. 5(Col.2, b6). MI and SSIM
metric values across all isomodal images are described in Table 1.
In all these cases, higher value of MI and SSIM corresponding to all test images, as
tabulated in Table 1explicitly states that registered images are perfectly aligned with
the reference image with significantly lower error rate. These results demonstrate that
the TLBO algorithm outperforms accurately in isomodal medical image registration
as compared with other evolutionary algorithms [1,2,10]. The effectiveness of TLBO
324 P. Ar o r a e t a l .
(a1) (b1) (c1)
(a2) (b2) (c2)
(a3) (b3) (c3)
(a4) (b4) (c4)
(a5) (b5) (c5)
(a6) (b6) (c6)
Fig. 5 MRI and CT modalities of test images (i) MRI (a1, a2 and a6); (ii) CT (a3, a4, a5), RTP
θ,tx,tyfor a1–a4 (2,2,2), for a5 (15,17,17) and for a6 (4.7,0,0)
A Novel Image Alignment Technique Leveraging Teaching 325
Table 1 Experimental results of proposed scheme using different test images as shown in Fig. 5
(a1–a6)
Test images Initial value RTP (SSIM/MI)
(tx,ty)tx ty θ
(a1) (2,2,2) 1.9634 2.0316 1.9848 (0.9643/12.087)
(a2) (2,2,2) 1.9636 2.0338 1.9866 (0.9868/7.5647)
(a3) (2,2,2) 1.9378 2.0486 2.0003 (0.9982/5.4848)
(a4) (2,2,2) 1.9716 2.0326 1.9939 (0.9961/4.5934)
(a5) (17,17,15) 16.1619 17.9579 14.6466 (0.9749/5.7530)
(a6) (0,0,4.7) 0.0290 0.0314 4.69861 (0.9561/13.2613)
for obtaining the RTP is illustrated through the experimental results with the scheme
described in this work.
5.1 Comparative Analysis
The performance of proposed scheme for isomodal medical image registration is eval-
uated by comparing with recently developed meta-heuristic optimization-based tech-
niques [25]. For a fair comparison, use identical images along with same parameters
and initial values considered by [1,2,10] in this study.
The proposed scheme is being compared to the existing schemes presented by Dida
et al. [10], Ayatollahi et al. [1] and Maddaiah et al. [2]. Isomodal test images are MR1,
MR2, CT1, CT2, CT and brain MR image are obtained from the aforementioned
database [23] shown in Fig. 5(Col.1, a1–a6) and have dimensions of (256 ×256).
From these tables, it is observed that proposed scheme similarity metrics SSIM,
RMSE has a value better than existing meta-heuristic approaches. Higher SSIM and
minimum RMSE value indicates the accurate registration of CT and MRI images
along with good visual quality as shown in Fig. 5. This is due to obtaining optimum
values of transformation parameters by TLBO and reducing the noise from all the
images. All the experiments are performed on images for registration employing
TLBO having 30 as several particles (population size) with maximum 60 iterations.
Future research will be focussed on extension of proposed method to multimodal
medical image registration by incorporating machine learning algorithms for better
alignment of source and target image (Tables 2,3and 4).
326 P. Ar o r a e t a l .
Table 2 Comparison results of rigid transformation parameters for proposed scheme via existing
scheme [10] on a1–a4 images in Fig. 5, where IV is initial value and OA is optimization algorithm
Modalities IV OA GWO PSO Proposed
scheme
GWO PSO Proposed
scheme
RTP
MR1 and
MR2
2tx 1.931 1.930 1.963 1.936 1.948 1.963
2ty 2.074 2.074 2.031 2.082 2.076 2.033
2θ1.982 1.978 1.984 1.986 1.991 1.986
SSIM 0.9106 0.9196 0.9653 0.9007 0.9006 0.9838
CT1 and
CT2
2tx 1.930 1.936 1.937 1.933 1.933 1.9716
2ty 2.074 2.073 2.048 2.079 2.068 2.0326
2θ1.978 1.978 2.000 1.983 1.977 1.9939
SSIM 0.9196 0.9195 0.9852 0.9191 0.9190 0.9741
Table 3 Comparison results of rigid transformation parameter of proposed scheme with existing
scheme [1]ona5imageinFig.5
RTP tx ty θSimilarity measure (average RMSE)
Initial value 17 17 15
Algorithm
Hybrid PSO 16.0480 17.0133 14.4844 0.4953
Proposed scheme 16.1619 17.9579 14.7543 0.1306
Table 4 Comparison results of transformation parameters of the proposed scheme with existing
scheme [2]ona6imagesofFig.5
RTP tx ty θSimilarity measure
Initial value 0 0 4.7 Average RMSE Average RMSE
Algorithm
Hybrid PSO 0.7240 0.3521 4.3768 3.7858 3.7858
Proposed scheme 0.0290 0.0314 4.69861 0.18091 0.18091
6 Conclusion
In this work, TLBO and rigid transformation techniques are employed for rigid
medical image registration. An efficient image registration framework for isomodal
medical images with respect to visual quality of aligned images using TLBO and rigid
transformation is achieved with proposed scheme. TLBO is employed for acquiring
the optimum value for rotation as well as translation transformation parameters in
medical image registration by considering MI (maximization) as an objective func-
tion. Image registration is viewed as an optimization problem in this study, which is
efficiently addressed using TLBO through maximization of MI between reference
A Novel Image Alignment Technique Leveraging Teaching 327
and registered images. Extensive experiments are performed on multiple modalities
of images with noise and artefacts such as CT and MRI with a notion to evaluate the
performance of proposed scheme. Higher mutual information and SSIM values by
8% determined through experimental results demonstrate better quality of registered
images. The robustness of proposed scheme is illustrated by comparing it to meta-
heuristic schemes on several similarity metrics. The high value of MI and SSIM on
different isomodal medical images (CT and MRI) indicate accurate alignment of
registered images as compared with the recently developed schemes. The proposed
scheme can be further extended to (i) multimodal and 3D image registration, (ii)
deformable and 3D image registration problems in the medical domain and (iii)
inclusion of machine learning and other meta-heuristic techniques to enhance the
performance.
References
1. Ayatollahi F, Shokouhi SB, Ayatollahi A (2012) A new hybrid particle swarm optimization for
multimodal brain image registration. J Biomed Sci Eng 05(04):153–161
2. Maddaiah PN, Pournami PN, Govindan VK (2014) Optimization of image registration for
medical image analysis. Int J Comput Sci Inf Technol 5(3):3394–3398
3. Guan S-Y, Wang T-M, Meng C, Wang J-C (2018) A review of point feature based medical
ımage registration. Chinese J Mech Eng [Internet]. 31(1):76–92. Available from: https://doi.
org/10.1186/s10033-018-0275-9
4. Alam F, Rahman SU, Ullah S, Gulati K (2018) Medical image registration in image guided
surgery: ıssues, challenges and research opportunities. Biocyber Biomed Eng [Internet].
38(1):71–89. Available from: https://doi.org/10.1016/j.bbe.2017.10.001
5. Wan Y, Hu H, Xu Y, Chen Q, Wang Y, Gao D (2017) A robust and accurate non-rigid medical
image registration algorithm based on multi-level deformable model. Iran J Public Health
46(12):1679–1689
6. Ting-Ting P, Ji Z (2016) Research on medical image registration based on QPSO and powell
algorithm. In: Proceedings—14th ınternational symposium on distributed computing and
applications for business, engineering and science, DCABES, 316–9
7. Viergever MA, Maintz JBA, Klein S, Murphy K, Staring M, Pluim JPW (2016) A survey of
medical image registration—under review. Med Image Anal 33(1):140–144
8. Meskine F, Almhdie-imjabber A (2012) A feature point based image registration using genetic
algorithms. Mediterranean Telecommun J 2(2):148–153
9. Dida H, Charif F, Benchabane A (2020) A comparative study of two meta-heuristic algorithms
for MRI and CT images registration. In: 3rd ınternational conference on ınformation and
communications technology, ICOIACT, 411–415
10. Chen YW, Mimori A (2009) Hybrid particle swarm optimizationfor medical image registration.
In: 5th ınternational conference on natural computation, ICNC, 6:26–30.
11. Mani VRS, Arivazhagan S (2013) Survey of medical image registration. J Biomed Eng Technol
1(2):8–25
12. Abdel-Basset M, Fakhry AE, El-henawy I, Qiu T, Sangaiah AK (2017) Feature and ıntensity
based medical ımage registration using particle swarm optimization. J Med Syst 41(12)
13. Arora P,Mehta R, Ahuja R (2023) An adaptive medical image registration using hybridization of
teaching learning-based optimization with affine and speeded up robust features with projective
transformation. Cluster Comput 3
14. Lin CL, Mimori A, Chen YW (2012) Hybrid particle swarm optimization and its application
to multimodal 3D medical image registration. Comput Intell Neurosci 8
328 P. Ar o r a e t a l .
15. Liu S, Mernik L (2012) A note on teaching—learning-based optimization algorithm 212:79–93
16. Das A, Bhattacharya M (2011) Affine-based registration of CT and MR modality images of
human brain using multiresolution approaches: Comparative study on genetic algorithm and
particle swarm optimization. Neural Comput Appl 20(2):223–237
17. Kosi´nski W, Michalak P, Gut P (2012) Robust image registration based on mutual information
measure. J Signal Inf Process 03(02):175–178
18. Arora S, Rani R, Saxena N (2022) An efficient approach for detecting anomalous events in
real-time weather datasets. Concurrency Comput: Practice Experience 34(5):1–15
19. Liu D, Mansour H, Boufounos PT (2019) Robust mutual information-based multi-image
registration. IEEE Int Geosci Remote Sens Sympos 915–918
20. Zheng Q, Wang Q, Ba X, Liu S, Nan J, Zhang S (2021) A medical image registration method
based on progressive images. Comput Math Methods Med Hindawi 2021:1–10
21. Swathi R, Srinivas A (2020) An ımproved ımage registration method using E-SIFT feature
descriptor with hybrid optimization algorithm. J Indian Soc Remote Sens [Internet]. 48(2):215–
26. Available from: https://doi.org/10.1007/s12524-019-01063-w
22. Kumar N, Nachamai M (2017) Noise removal and filtering techniques used in medical images.
Oriental J Comput Sci Technol 10(1):103–113
23. Keith AJ, Alex Becker J (2008) The whole brain ATLAS [Internet]. Harvard University. 2008.
Available from: https://www.med.harvard.edu/aanlib/home.html
24. Wachowiak MP, Smolíková R, Zheng Y, Zurada JM, Elmaghraby AS (2004) An approach to
multimodal biomedical image registration utilizing particle swarm optimization. IEEE Trans
Evol Comput 8(3):289–301
25. Cocianu CL, Stan A (2019) New evolutionary-based techniques for image registration. Appl
Sci (Switzerland) 9(1)
Study of Cyber Threats in IoT Systems
Abir El Akhdar, Chafik Baidada, and Ali Kartit
Abstract Over the years and in a post-industrial economy, information has evolved
from being a simple financial and operational gauge in companies to becoming the
key to a more intertwined world. By exchanging an immense amount of informa-
tion, the Internet of Things has resulted in a revolutionary change in lifestyles, with
widespread usage across an array of industries and applications, spanning agricul-
ture, smart houses, health care, and beyond. However, this exponential growth has
also brought about unprecedented management and security challenges. One such
challenge pertains to scalability, whereas another involves the confidentiality and
security of gathered data. Motivated by the lack of inclusive studies of cyber risks
in the IoT world, we present in this article a comparative examination of the threats
related to IoT systems. The study aims to identify the most perilous cyber threats,
based on eight relevant parameters in the world of computer security. Ultimately,
this study will help practitioners comprehend current issues and examine fresh and
attractive research opportunities in the landscape of IoT security.
Keywords Internet of Things ·Security ·Cyber threats ·DDoS ·Malware ·
Overview
1 Introduction
“In the real world, things matter more than ideas, [1] thus stated Kevin Ashton
in the RFID Journal in June 2009. This well-known British technology settler and
co-founder of the Auto-ID Laboratory at MIT Offsite Link believes that we must
provide computers with their own ways of collecting information so they may expe-
rience the sights, sounds, and smells of the outside world. In this regard, he conceived
the term “Internet of Things” to defend the technological principle where machines
A. E. Akhdar (B)·C. Baidada ·A. Kartit
LTI Laboratory, University of Chouaib Doukkali, National School of Applied Sciences, 5096,
24002 El, Jadida, Morocco
e-mail: elakhdar.abir@ucd.ac.ma
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_25
329
330 A. E. Akhdar et al.
can watch, recognize, and comprehend the environment thanks to sensor technolo-
gies, free from the constraints imposed by manually inputting data. According to
Kevin, the term “IoT” is used to describe a system in which the Internet is linked
to the “real world” via a pervasive network of data sensors [2]. This lends credence
to a broader definition of the IoT as a new paradigm that enables a collection of
physical nodes to communicate across a network utilizing remotely linked devices
or objects (e.g., smartphones, smart cameras, sensors.). Today, various industries
such as health care, corporate analytics, smart cities, and smart agriculture, among
others, are becoming increasingly aware of the potential benefits of utilizing and
implementing real-time data in their day-to-day operations. By providing a variety
of solutions that can improve people’s lives, this ever-growing web of intercon-
nected devices via the Internet, is playing a significant role in various applications
and sectors. For instance, IoT enhances the quality of agriculture [3] by utilizing
sensors to collect environmental and machine metrics. The data may aid farmers in
improving practically every aspect of their labor, from raising cattle to growing crops.
IoT is a multifaceted idea that encompasses a wide range of solutions, services, and
norms. It is broadly believed to be the cornerstone of the Information and Commu-
nication Technology sector for at least the next 10 years [4]. Despite relatively low
adoption rates in households and retail settings, the number of IoT devices in use
surged up from 8.7 billion in 2012 to 50.1 billion in 2018 [5], resulting in expenditure
that is probably worth of a trillion dollars a year. As scale grows, the value of the
data that is gathered, processed, and transported also grows, as do the assaults that
target it. In other words, these estimates show that there will be an increase in the
quantity and level of threats and assaults against these embedded devices, mandating
stronger security measures. In actuality, several security assaults, including botnets
attacks using Distributed Denial of Service or malware and spam attempts, target the
networked devices inside massive IoT networks.
In this paper, we cast light upon IoT systems and try to uncover the main threats
menacing their security. For this end, the article is split into five sections. In Sect. 2,
we provide the most common faults in IoT devices that could throw off the network’s
security. Section 3portrays a comparative analysis of prevalent cyber-attacks in IoT
systems using eight key parameters relevant in the world of computer security: cost,
visibility, persistence, exploitability, likelihood, reputational damage, targeted data,
layer of attack. We present related works regarding the IoT security in Sect. 4.
The outcomes of this research are condensed in Sect. 5. We ultimately deduce our
conclusions in Sect. 6.
2 Security Shortcomings in IoT Devices
Challenges and shortcomings can overlap in the IoT world. Shortcomings are limita-
tions or weaknesses that most probably affect the system’s outcome, or more specifi-
cally, its quality of outcome, whereas challenges, in a more literal sense, are difficul-
ties encountered when building the system and getting it to function properly. These
Study of Cyber Threats in IoT Systems 331
latter factors may or may not affect the outcome. In [6], Zhang et al. presented a few
adversities concerning the security of IoT which the main are heterogeneity and scal-
ability and proceeded to discuss this large spectrum of security concerns with more
details. Other challenges include inter alia, privacy, lightweight cryptosystems, and
backdoor analysis. Similar to [6], Hameed et al. [7] delivered a synopsis of security
challenges in IoT environments in regards to services, protocols, technologies, and
frameworks. The study outlines the security requirements such as confidentiality,
reliable routing, resilient administration, and intrusion detection. The challenges
associated with IoT were investigated with respect to the multiple architectural tiers,
encompassing, for instance, the sensory, network, and application layers and their
intercommunications by Alfaqih et Al.-Muhtadi [8], Varshney et al. [9] and Sain et al.
[10]. Error tolerance, access control, authentication, and confidentiality are just a few
of the crucial topics that are covered. The literature does not, however, provide guid-
ance on the amalgamation of these requirements in a manner that would guarantee
the safety of IoT systems. Oh et Kim [11] analyzed security challenges and needs
for IoT environments based on six pivotal components of IoT which they considered
to be the network, the user, the attacker, the cloud, the service, and the platform.
Noura et al. [12] provided a comprehensive taxonomy of IoT interoperability which
includes device, syntactical, network, semantic, and platform interoperability. In this
paper, [12] showcased heterogeneity as the root issue to all interoperability chal-
lenges in the IoT world. Frequently cited as fundamental security needs for an IoT
system are authorization, privacy, authentication, identity, and access management
as well as reliance, as stated in some sources (e.g., [13]). Some other general secu-
rity needs, including network, layer, bootstrapping and application security, setup,
information integrity, firewalling, malware protection, cryptographic algorithms, and
secure networking, are addressed in a limited number of studies (e.g., [8,14,15]).
These are what we refer to as generic requirements because they are essential for
the majority, if not all, application areas. In [16], Kammara stated that IoT devices
were vulnerable to cyber-attacks due to a variety of variables. Some aspects result
from straightforward financial choices made by IoT device manufacturers to cut
costs, while others are brought on by the diverse complexity of IoT systems. Table 1
portrays some of the most crucial security shortcomings in the IoT environment.
We maintain that in an IoT environment, privacy, for example, is considered both a
challenge and a requirement as it always raises issues yet should be met as thoroughly
as possible, however, IoT systems still fall short when it comes to fully securing data
transmission due to resource constraint for instance.
Elementally, security requirements should be answered in all tiers of the IoT tech-
nology stack in an onion-like approach where defenses are implemented in redun-
dancy. Accordingly, a number of researchers have identified the security needs for
IoT devices. In Table 2, we differentiate two groups of security needs [17]forIoT
systems in the literature.
332 A. E. Akhdar et al.
Table 1 Common security shortcomings in IoT
Shortcomings Description
Re-use of code Nearly all IoT manufacturers recycle parts of the code, such as
communication and authentication protocols that are publicly
available online. This method gives an attacker access to the whole
platform with only one easy component, turning it from the holy
grail to a poisoned chalice. The term for this incident is BOBE
(break once, break everywhere)
Lack of high-quality code Most IoT devices have out-of-date code. Additionally, it is normal
for a manufacturer to compile software from several online sources
and then write some patches to assemble everything. The majority
of it is spaghetti code, making maintenance challenging unless a
significant amount of effort and money is invested. Majority of IoT
makers don’t spare a thought about user security
Light weight crypto
systems
The computing power of IoT devices is constrained. A modern
encryption technique that is strong requires more resources than
what IoT devices can offer. Despite the fact that they are available,
superior encryption systems cannot be installed in IoT devices.
Additionally, it has been noted that Man in the Middle (MITM)
assaults may be carried through against the Bluetooth protocol,
which is utilized by the majority of smartwatches. The Bluetooth
secure simple pairing is the target of MITM attacks, and it has been
noted that the device’s capabilities affect the security of the
Bluetooth protocol
Heterogeneous platforms The varied complexity in key areas restricts the potential of the
IoT. The IoT ecosystem is complicated because it includes objects
from many manufacturers with a variety of software packages. It
makes managing the ecology challenging. We might create security
software for one platform, and there’s a potential that it won’t
function on equipment running another platform. Even
management and orchestration solutions struggle to support all of
the northbound and southbound utilized by the IoT. Building a
universal fix that functions for all the platforms in this complex
ecosystem is expensive and challenging
Lack of security standards
or guidelines
Everyone must rely on best practices and/or suggestions due to the
lack of security standards, and because IoT infrastructure is made
up of numerous small, frequently inexpensive endpoint devices and
sensors, it is simple to underestimate the hazards involved
Table 2 IoT security needs
Basic or standard security needs Potential security needs
Confidentiality, authentication, authorization,
accessibility, identity and access management,
integrity, reliance, liability, key control,
user-friendliness, etc.
Scalability, attack resistance, privacy and
identification, secure data management,
geolocation privacy, power efficiency, identity
protection, fault tolerance, data currency and
real time, decentralization, quality of service,
reliability, portability, load balancing, etc.
Study of Cyber Threats in IoT Systems 333
3 Comparative Study of Cyber-Attacks in IoT Systems
3.1 Potential Security Attacks and Associated IoT Layer
The many levels of IoT architecture could be employed to cluster security concerns
and attacks in IoT systems. These assaults can be divided into four categories [18],
which are physical attacks, network attacks, software attacks, and data attacks.
Physical attacks can arise when an attacker is physically near an IoT network or
device, whereas network attacks occur when an attacker specifically targets an IoT
network system to wreak havoc. Software attacks are carried out through targeting
the exploitable vulnerabilities, such as defects delivered by an IoT application or the
software itself. Meanwhile, data intrusions take place when a perpetrator uses the
IoT to carry out an attack and cracks the encryption. They target the data that IoT
devices will handle in order to maintain communication across various IoT nodes.
In Table 3, we use two terms to differentiate two aspects that overlap which are the
form and type of attack. In literature, form refers to the structure of an object, while
type refers to a category or class of things that have similar characteristics. We believe
that in general, form is concerned with the overall shell, while type is concerned with
the underlying characteristics and common traits. A form of a cyber-attack refers to
the macro-class of the attack or its aim, such as botnets, malicious scripts, malware,
and data breach. A type of a cyber-attack, on the other hand, denotes the vector or
technique used to execute the attack, such as brute force, worm, and IP spoofing.
So, form is about “what” the attack is about, while type is about “how” the attack
is carried out. It is essential to note that the language and phrasing used in these
definitions reflect our own unique style and perspective on the subject matter.
As the prevalence of IoT devices continues to grow, the threat of cyber-attacks
increases. Large enterprises are particularly vulnerable to such attacks due to their
reliance on a complex network of interconnected devices. It is therefore essential
that organizations take steps to protect themselves and their assets by assessing the
severity of potential IoT cyber-attacks. In this brief assessment, we rate the various
types of cyber assaults and their severity in the context of large organizations.
To gauge the severity of each attack, we set a number of criteria. For this initial
effort, we assume that all criteria have the same weight in rating an attack’s ferocity:
1. Cost: The extent of damage caused by the attack in terms of financial loss
(recovery expenses, repair costs, reputational harm, and possibly legal ramifi-
cations).
2. Visibility: Or ease of detection and response refers to the speed and effectiveness
of detecting the attack and implementing a response.
3. Persistence: Refers to the ability of a threat actor to covertly keep long-term
access to systems despite disturbances like restarts or changing credentials.
4. Exploitability: It’s when a cyber-attack can be a setup for another one.
5. Layer of attack: Refers to the various levels of security that could be targeted
in an IoT system. These layers include physical, network, and software security
334 A. E. Akhdar et al.
Table 3 Weighting of the criteria by type of attack
Common
cyber-attack
forms
Common
cyber-attack
types
Cost Visibility Persistence Exploitability Likelihood Reputational
damage
Targeted
data
Layer of
attack
Severity
Botnet attacks Brute Force 2 1 2 3 2 3 3 3 (Software) 19
Distributed
Denial of Service
(DDoS)
4 4 1 3 4 4 2 4 (Software/
Network)
26
Spam and
phishing
3 2 2 4 3 3 3 3 (Software) 23
Device Bricking 4 3 4 0 1 4 0 3 (Physical) 19
Malicious
scripts
Cross-site
scripting (XSS)
2 3 2 3 2 3 1 4 (Software/
Physical)
20
Cross-channel
scripting (XCS)
2 3 3 3 1 3 1 4 (Software/
Physical)
20
SQL injection 2 3 3 4 2 3 1 4(Software/
Physical)
22
Remote Code
Execution (RCE)
2 3 4 3 2 3 2 4 (Software/
Physical/
Network)
23
Malware Hardware Trojan 2 4 4 3 1 4 2 3(Physical) 23
Trojan Ho rse 2 4 3 3 3 3 3 3 (Software) 24
Ransomware 4 3 4 4 3 4 3 3(Software) 28
Backdoors 3 4 4 3 2 3 3 3(Software) 25
Vir us 4 2 2 3 3 3 3 3 (Software) 23
(continued)
Study of Cyber Threats in IoT Systems 335
Table 3 (continued)
Common
cyber-attack
forms
Common
cyber-attack
types
Cost Visibility Persistence Exploitability Likelihood Reputational
damage
Targeted
data
Layer of
attack
Severity
Wor m 4 2 3 4 3 3 3 4(Software/
Network)
26
Spyware 2 4 3 2 3 3 4 3(Software) 24
RFID attacks RFID spoofing 2 1 1 1 1 3 1 4(Physical/
Network)
14
Routing
information
Spoofing 1 3 1 2 2 3 3 3 (Network) 18
Routing table
poisoning
2 3 2 3 2 3 2 3 (Network) 20
Man in The
Middle
2 4 2 2 3 3 3 4 (Network/
Encryption)
23
Data breach Password
guessing
2 1 2 2 2 2 3 4 (Software/
Encryption)
18
Shared
technologies
and cloud
Hypervisor
attacks
3 2 3 4 1 4 2 4 (Software/
Encryption)
23
Severity status
Mild level of severity (0–7) Average level of severity (8–15) High level of severity (16–23) Advanced level of
severity (24–32)
336 A. E. Akhdar et al.
plus encryption. While some attacks only target one layer, some may target all
four, making them far more hazardous.
6. Likelihood: Refers to the probability that a threatening event might occur.
7. Targeted Data: Whether or not the primary goal of the attack is data breach, and
the level of confidentiality, integrity, or availability of the data if it was targeted
by the attack.
8. Reputational damage: Cyber-attacks can harm a business’ reputation and under-
mine the trust of its customers. Loss of clients and sales are two possible
consequences of this.
Based on these factors, a comprehensive assessment can be performed to deter-
mine the overall severity of the cyber-attack and guide the response accordingly.
Each criterion is weighted to a value based on an evaluation scale. This evaluation is
done through five modalities, each mapped to a numeric value: “none, 0, “low, 1,”
“medium, 2,” “high, 3, and “extremely high, 4.” The summation of the total values
of the criteria yields the ultimate degree of severity of each attack. Table 3lists each
criterion’s state for each attack.
3.2 Discussion
In this part, we underline the insights acquired from the aforementioned comparison
of an array of cyber-attacks in IoT environments. The work presented in Subsect. 3.1
classifies the potential threats to IoT security in regards of their severity using several
criteria (e.g., cost, persistence, likelihood, etc.). As we observe in Tables 3and 4,the
least severe form of attacks is RFID attacks with a score of 14. Generally speaking,
the financial loss caused by RFID attacks is low compared to other types of cyber-
attacks, such as data breaches or malware attacks. This is because RFID systems are
typically used to track and manage physical assets, and the data they contain is often
not as valuable as the data stored in other systems. However, the financial loss caused
by RFID attacks can still be significant. For instance, if a malevolent agent is able to
attain entry to an RFID system, they may be able to steal physical assets or manipulate
the data stored in the system. This could lead to significant financial losses, such as
lost revenue or increased costs associated with replacing stolen assets. Moreover,
they are relatively easy to detect, less exploitable, less persistent, and less likely to
happen due to the fact that they rely on the physical interaction between the attacker
and the target device, whereas, all other cyber threats, including malicious scripts,
routing attacks, data breach and shared technologies and cloud attacks, scored high in
matters of severity. Meanwhile, Distributed Denial of Service (DDoS) and malware
topped the list with some of the highest scores, classifying them as advanced attacks
in terms of severity.
Drawing upon the findings derived from Table 3, it can be inferred that DDoS
and malware are the most violent and feared attacks in IoT environments because
of their persistence, exploitability, cost, visibility, and likelihood. DDoS attacks are
Study of Cyber Threats in IoT Systems 337
Table 4 Potential security attacks and associated IoT layer
IoT Layer Form of
Attack
Description Severity Common types Possible
vulnerabilities
Perception
layer
Botnet
attacks
Automated
computer
programs that
operate
through the
internet are
known as bots
High,
(DDoS is
advanced)
Brute force
attack, DDoS
attack, spam
and phishing,
device bricking
(e.g., Mirai
botnet,
dictionary
attacks,
credential
stuffing, mass
email spam
campaigns)
Vulnerabilities in
infrastructure,
negligence on the
part of humans,
infection with
malware,
unpatched
software, weak
passwords,
Remote Access
Tools, IoT
devices
Malicious
Scripts
Are script code
fragments that
target
machines in
order to make
them
vulnerable. A
variety of
protocols like
the file transfer
protocol can be
used to inject
scripted
content into
networked
devices
High Cross-site
scripting,
cross-channel
scripting
Operating system
bugs, insecure
file uploads or
downloads,
insecure devices,
insecure network
protocols, such as
FTP or Telnet
Malware Short for
malicious
software, is
any type of
software
designed to
harm, damage
or disrupt
computer
systems,
networks or
devices
Advanced Ransomware,
Trojan
HORSE,
backdoors
(e.g.,
ILOVEYOU,
Mydoom)
Device hacking,
human errors,
insecure network,
social
engineering
tactics,
insufficient
security training,
unpatched or
outdated
software,
backdoors or
other
vulnerabilities in
operating
systems or
network devices
(continued)
338 A. E. Akhdar et al.
Table 4 (continued)
IoT Layer Form of
Attack
Description Severity Common types Possible
vulnerabilities
Network
layer
RFID attacks Attacks that
target RFID
technology
which is used
to identify
people and
objects and
transmit data
through radio
waves in the
network (i.e.,
Bluetooth)
Average RFID spoofing,
Node jamming,
RFID
skimming
Data tracking,
weak physical
security of RFID
tags, such as
easily accessible
tag readers, data
corruption and
deletion, lack of
authentication in
RFID systems,
weak encryption
or no encryption
used in RFID
communication
Routing
information
attacks
The objective
is to
manipulate
router
messages by
blocking,
replaying, or
spoofing them,
thereby
altering their
content and
attributes
High Spoofing,
routing table
poisoning
Data alteration
and corruption,
lack of secure
routing protocols
or configurations,
insider threats or
compromised
network devices
that can be used
to manipulate
routing
information
Middleware
layer
Malicious
code
injections
By injecting
malicious
scripts into IoT
nodes,
attackers can
gain control
over the
operation
process and
data flow
between the
nodes
High SQL injection
attacks in
databases,
cross site
scripting
Operating system
bugs, insecure
server
configurations or
environments,
unpatched
software,
unvalidated or
unsanitized user
input in
applications,
insecure devices
(continued)
Study of Cyber Threats in IoT Systems 339
Table 4 (continued)
IoT Layer Form of
Attack
Description Severity Common types Possible
vulnerabilities
Remote Code
Execution
(RCE) refers to
the exploitation
of software
vulnerabilities
through the
injection of
malicious code
into the input
stream, aimed
at
compromising
targeted
programs
High Out-of-bounds
write attacks,
injection attack
Malicious
malware
downloaded by
the host, Buffer
Overflow,
deserializing
untrusted data
application
layer
Data breach Theft of data
and their
alteration
without
authorization
or knowledge
on the part of
the user
High Password
guessing,
recording
keystrokes,
phishing,
Malware
Human errors,
weak or easily
guessed
passwords, poor
security policies,
unpatched or
outdated software
or hardware
Shared
technologies
and cloud
attacks
Are
cybersecurity
threats that
exploit
vulnerabilities
in shared
resources and
cloud
computing
environments,
and can cause
multiple
security
problems:
availability,
authorization,
identification,
access control
High Attacks
targeting
hypervisors,
side-channel
attacks
Insufficient
device
management,
insecure APIs or
weak
authentication
mechanisms,
third-party
vulnerabilities,
direct hacking,
insider threats or
other
compromised
accounts or
access keys,
insecure cloud
system,
misconfigured
cloud servers or
storage buckets
340 A. E. Akhdar et al.
particularly worrisome because they are often the precursor to other types of attacks,
including malware. Furthermore, they pose a critical threat to the availability of IoT
systems, which is critical for ensuring the seamless functioning of connected devices
and services. IoT devices have constrained storage and computational resources,
making IoT DDoS assaults considerably harder to protect against than traditional
DDoS attacks. Based on design faults in the IoT device’s firmware or defects in
communication protocols, an attacker can quickly create malicious messages. IoT
devices, on the other hand, may be employed as a potent DDoS attack helper in
addition to being a target of DDoS assaults. They are most often placed on networks
that are not monitored for the attack, allowing attackers easy access. Additionally,
in most cases, the network they reside on offers a high-speed connection that allows
for a large amount of DDoS attack traffic. Moreover, DDoS attacks have become
increasingly common and sophisticated, with attackers employing innovative tech-
niques to bypass the traditional defense mechanisms of IoT systems. Therefore, it
is imperative to acquire a comprehensive grasp of the nature of DDoS attacks and
their impact on IoT environments. Malware, on the other hand, can cause persistent
damage to a device and jeopardize the security of an entire network. The propagation
of malware through diverse channels can remain undetected for extended periods,
thereby impeding prompt identification and mitigation efforts. The restricted compu-
tational capabilities and memory allocation in IoT devices present a significant chal-
lenge to antivirus software and other security tools, further complicating malware
detection and removal. Furthermore, an infected IoT device can facilitate numerous
malicious activities, such as data theft, device hijacking, malware proliferation, and
DDoS. By converting numerous IoT devices into bots, malware can enable attackers
to overwhelm a target network or server by generating massive traffic. Attackers can
also leverage malware to exploit security loopholes, such as default passwords or
unpatched vulnerabilities in IoT devices, providing unauthorized access to devices,
and incorporating them into a botnet. By using the appropriate countermeasures, each
of these attacks may be stopped, but they remain tailored solutions that are closely
related to the relevant IoT environment features. As a result, they were unable to
offer universal or adaptable solutions that might be used in many other situations.
Figure 1showcases a comparative graphical representation of the most severe
types of cyber-attack forms, which were selected to exemplify their respective cyber-
attack forms class as worst case scenarios.
0 5 10 15 20 25 30
Botnet aacks
Malicious Scripts
Malware
RFID aacks
Roung Informaon aacks
Data breach
Shared technologies and cloud
Fig. 1 Comparative analysis of cyber-attack forms’ severity
Study of Cyber Threats in IoT Systems 341
4 Related Works
Since 2020, the number of reviews on IoT security has been significantly rising [19].
The present section outlines the up-to-date state of the art in IoT security. In [17], Pal
et al. explore potential IoT threats and attacks with reference to a variety of appli-
cation scenarios. They divide the potential threats and attacks into five categories
according to the characteristics of the IoT. These include users (attacks targeting the
human being, e.g., identity spoofing, malicious users, phishing), devices and services,
communications (e.g., SYN flooding, pharming, eavesdropping), mobility (attacks
targeting the users’ mobility, e.g., device tracking, data breach), and resource inte-
gration (attacks targeting heterogeneous infrastructures, e.g., malicious node manip-
ulation). They mostly cover IoT security needs in this article and only examine IoT
security threats on a broad scale. On the other hand, Hassija et al. [20] present a
more detailed overview of current IoT security technology, as they provide a list of
several IoT applications, along with associated security and privacy concerns as well
as a thorough breakdown of the many danger vectors found inside the various IoT
layering. The five levels of the IoT architecture are used to classify potential attacks,
following that each assault is briefly described in reference to each tier. Alfaqih et
Al.-Muhtadi [8] briefly discuss IoT security issues in wireless sensors networks as
they are considered the backbone of the IoT (e.g., encryption gateway node, business
authentication, capacity against DoS). In [21], Abomhara et Køien make an effort
to categorize different threat categories, as well as to assess and define hackers and
attacks against IoT services and devices. In addition, this article attempts to show-
case the actors’ motivation of attack and capacities driven by the distinctive features
of cyberspace. They do not, however, categorize IoT threats according to specific
criteria or groups. Unlike [21], Driss et al. [18] classify security threats and attacks
in IoT environments in reference to different tiers of IoT architecture. At this end, the
paper distinguishes four classes of IoT attacks: physical attacks, software attacks,
network attacks, and data attacks, yet fails to provide a full examination of each.
Sengupta et al. [22] classify key assaults on IoT systems based on the objects of
assault and assign them to one or more tiers of the architecture. This survey delves
deeper into the four types of attacks previously mentioned, as well as a review of
countermeasures in the literature to combat each of these assaults. Likewise, Atlam
et Wills [23] highlight the four classes of attack in their study and in a similar way;
Deogirikar et Vidhate [24] map the four types of attacks (physical, network, software,
and encryption attacks) to the three layer IoT architecture and briefly describe each
in two to three lines. Following that, attacks from each category are shortlisted as
“severe” based on their low detection probability and capacity to affect the network.
Nawir et al. [25] and Sadhu et al. [26] discuss the taxonomy of security attacks in IoT
environments and investigate network security issues in the sectors of smart homes,
health care, and transportation. They arrange the IoT attacks into eight categories.
The communication stack protocol or layer-based attacks are one example (i.e., phys-
ical attacks, data link attacks, network attacks, transport attacks, application attacks).
342 A. E. Akhdar et al.
The other seven classes are device property, strategy, access level, location, protocol-
based, host-based, and information damage level. In [27], Mrabet et al. propose a
new IoT architecture inspired by the five layer design. This structure consists of a
tangible sensing tier, a network and protocol tier, a transportation tier, an application
tier, and cloud services. In this paper, they address underlying technologies, threats
to security, and countermeasures based on the suggested architecture. A few of them,
Shafiq et al. [28] for instance, also categorize various types of threats in IoT according
to the target’s attributes, the threat vector, and the attack method.
5 Contributions of This Work
Based on the foregoing analysis, we have noticed the lack in literature of an inclusive
approach to tangibly rate the severity of IoT cyber-attacks. We observed that the
majority of IoT security studies focused on the IoT design while some others use
IoT’s characteristics to extract potential threats and security issues. We hold that
the existing approaches for classifying IoT attacks often rely on general-purpose
taxonomies that fail to capture the unique features of IoT systems. This is where
a more tangible approach that takes into consideration both IoT architecture and
characteristics becomes necessary. Hence, we have introduced a set of criteria of
equal weight to gauge the ferocity of the most prevalent IoT security menaces while
mapping each to its concerned IoT layer. It bears repeating that when selecting criteria
for any decision-making process, it is common for some factors to hold more weight
or importance than others. However, it can be challenging to accurately determine
the relative weights of each criterion without further investigation and analysis. In
situations where the weights are unknown or uncertain, a common approach is to
assign equal weights to all criteria to avoid bias and ensure fairness. While this
approach may not accurately reflect the true weight of each criterion, it provides a
baseline for comparison and allows for further investigation and refinement of the
decision-making process.
6 Conclusion
In the world of IoT, attackers can exploit a wide palette of vulnerabilities to launch
sophisticated attacks that compromise the security and integrity of the entire system.
From exploiting software and firmware weaknesses to intercepting and manipulating
data transmission, these attacks can cause significant harm to organizations and indi-
viduals alike. Despite the numerous challenges, there are effective countermeasures
that can help mitigate the risks associated with these threats. However, there is no ulti-
mate solution when it comes to IoT security, and practitioners must carefully evaluate
the risks and threats in their particular setting to develop effective countermeasures.
That said, this article presents a comprehensive study of recent cyber threats to IoT
Study of Cyber Threats in IoT Systems 343
systems and identifies DDoS and malware attacks as the most severe. These types
of attacks can cause significant disruptions, compromise sensitive data, and lead to
serious financial losses. To this end, the proposal of predefined and reusable modules,
like microservices, is encouraged.
References
1. Ashton K, That ‘internet of things’ thing
2. Corcoran P (2016) The internet of things: why now, and what’s next? IEEE Consumer Electron
Mag 5:63–68
3. Farooq MS, Riaz S, Abid A, Abid K, Naeem MA (2019) A survey on the role of IoT in
agriculture for the implementation of smart farming. IEEE Access 7:156237–156271
4. IoT connected devices worldwide 2019–2030. Statista https://www.statista.com/statistics/118
3457/iot-connected-devices-worldwide/
5. Koohang A, Sargent CS, Nord JH, Paliszkiewicz J (2022) Internet of things (IoT): from
awareness to continued use. Int J Inf Manage 62:102442
6. Zhang Z-K et al (2014) IoT security: ongoing challenges and research opportunities. In: 2014
IEEE 7th international conference on service-oriented computing and applications 230–234
(IEEE, 2014). https://doi.org/10.1109/SOCA.2014.58
7. Hameed S, Khan FI, Hameed B (2019) Understanding security requirements and challenges
in internet of things (IoT): a review. J Comput Netw Commun 2019:e9629381
8. (PDF) Internet of Things Security based on Devices Architecture. https://www.researchgate.
net/publication/326160395_Internet_of_Things_Security_based_on_Devices_Architecture
9. Varshney T, Sharma N, Kaushik I, Bhushan B (2019) Architectural model of security threats
and their countermeasures in IoT. In: 2019 international conference on computing, communi-
cation, and intelligent systems (ICCCIS) 424–429 (IEEE, 2019). https://doi.org/10.1109/ICC
CIS48478.2019.8974544
10. Sain M, Kang YJ, Lee HJ (2017) Survey on security in internet of things: state of the art and
challenges. In: 2017 19th international conference on advanced communication technology
(ICACT) 699–704 (IEEE, 2017). https://doi.org/10.23919/ICACT.2017.7890183
11. Oh S-R, Kim Y-G (2017) Security requirements analysis for the IoT. in 2017 international
conference on platform technology and service (PlatCon) 1–6 (IEEE, 2017). https://doi.org/
10.1109/PlatCon.2017.7883727
12. Noura M, Atiquzzaman M, Gaedke M (2019) Interoperability in internet of things: taxonomies
and open challenges. Mobile Netw Appl 24:796–809
13. Tourani R, Misra S, Mick T, Panwar G (2018) Security, privacy, and access control in
information-centric networking: a survey. IEEE Commun Surv Tutorials 20:566–600
14. Oracevic A, Dilek S, Ozdemir S (2017) Security in internet of things: a survey. In: 2017
international symposium on networks, computers and communications (ISNCC) 1–6 (IEEE,
2017). https://doi.org/10.1109/ISNCC.2017.8072001
15. Xin M (2015) A mixed encryption algorithm used in internet of things security transmis-
sion system. In: 2015 international conference on cyber-enabled distributed computing and
knowledge discovery 62–65 (IEEE, 2015). https://doi.org/10.1109/CyberC.2015.9
16. Kammara TT (2018) Management and security of IoT systems using microservices. San Jose
State University. https://doi.org/10.31979/etd.49xq-m2je
17. Pal S, Hitchens M, Rabehaja T, Mukhopadhyay S (2020) Security requirements for the internet
of things: a systematic approach. Sensors 20:5897
18. Driss M, Hasan D, Boulila W, Ahmad J (2021) Microservices in IoT security: current solutions,
research challenges, and future directions. Proc Comput Sci 192:2385–2395
19. Lee JY, Lee J (2021) Current research trends in IoT security: a systematic mapping study.
Mobile Inf Syst 2021:e8847099
344 A. E. Akhdar et al.
20. Hassija V et al (2019) A survey on IoT security: application areas, security threats, and solution
architectures. IEEE Access 7:82721–82743
21. Abomhara M, Køien GM (2015) Cyber security and the internet of things: vulnerabilities,
threats, intruders and attacks. J Cyber Secur Mobil 65–88. https://doi.org/10.13052/jcsm2245-
1439.414
22. Sengupta J, Ruj S, Das Bit S (2020) A comprehensive survey on attacks, security issues and
blockchain solutions for IoT and IIoT. J Netw Comput Appl 149:102481
23. Atlam HF, WillsGB (2020) IoT security, privacy, safety and ethics. In: Digital twin technologies
and smart cities, Farsi M, Daneshkhah A, Hosseinian-Far A, Jahankhani H (eds). Springer
International Publishing, 123–149. https://doi.org/10.1007/978-3-030-18732-3_8
24. Deogirikar J, Vidhate A (2017) Security attacks in IoT: a survey
25. Nawir M, Amir A, Yaakob N, Lynn OB (2016) Internet of things (IoT): taxonomy of security
attacks. In: 2016 3rd international conference on electronic design (ICED) 321–326 (IEEE,
2016). https://doi.org/10.1109/ICED.2016.7804660
26. Sadhu PK, Yanambaka VP, Abdelgawad A (2022) Internet of things: security and solutions
survey. Sensors 22:7433
27. Mrabet H, Belguith S, Alhomoud A, Jemai A (2020) A survey of IoT security based on a
layered architecture of sensing and data analysis. Sensors 20:3625
28. Shafiq M, Gu Z, Cheikhrouhou O, Alhakami W, Hamam H (2022) The rise of “internet of
things”: review and open research issues related to detection and prevention of IoT-based
security attacks. Wireless Commun Mobile Comput 2022:e8669348
Generic Sentimental Analysis in Web
Data Recommendation Based on Social
Media Scalable Data Analytics Using
Machine Learning Architecture
Ramesh Sekaran, Sivaram Rajeyyagari, Ashok Kumar Munnangi,
Manikandan Parasuraman, Manikandan Ramachandran, and Anil Kumar
Abstract Sentimental analysis is a method for distinguishing proof of articulation,
mentality, or sensations of clients. It characterized as bad, positive, good, ominous,
and so on from a piece of text in the record. Recommendation systems are impor-
tant intelligent systems that assume a crucial part in giving specific data to users.
Deep learning emerged as an important approach to settling opinion order issues
in the late days. This research proposes a novel technique in generic sentimental
analysis for web data classification with a recommendation system in social media
analytics using machine learning techniques. Here, the web data input is processed,
and remove missing values and normalization. Then the processed data is classified
R. Sekaran
Department of Computer Science and Engineering, JAIN (Deemed to be University), Bengaluru,
Karnataka 562112, India
e-mail: sramsaran1989@gmail.com
S. Rajeyyagari
Department of Computer Science, College of Computing and Information Technology, Shaqra
University, Shaqra, Kingdom of Saudi Arabia
e-mail: dr.sivaram@su.edu.sa
A. K. Munnangi
Department of Information Technology, Velagapudi Ramakrishna Siddhartha Engineering
College, Vijayawada, Andhra Pradesh 520007, India
e-mail: ashokkumar.munnangi@gmail.com
M. Parasuraman
Department of Computer Science and Engineering, JAIN (Deemed to be University), Bengaluru,
Karnataka 562112, India
e-mail: mani.p.mk@gmail.com
M. Ramachandran (B)
School of Computing, SASTRA Deemed University, Tamil Nadu, Thanjavur 613401, India
e-mail: srmanimt75@gmail.com
A. Kumar
Tula’s Institute, Dehradun 248197, India
e-mail: dahiyaanil@yahoo.com
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_26
345
346 R. Sekaran et al.
using convolutional discriminant kernel component analysis and their data recom-
mendation in social media using reinforcement multilayer neural networks. The
experimental analysis is carried out for various social media datasets regarding the
accuracy, average precision, recall, actual positive rate, and F-measure. The proposed
technique attained an accuracy of 98%, average precision of 79%, recall of 72%, real
positive rate of 63%, and F-measure of 68%.
Keywords Generic sentimental analysis ·Web data classification ·
Recommendation system ·Social media analytics ·Machine learning
1 Introduction
The sentiment sorted because of comments, assessments, or scrutinizes offer prac-
tical measures for a few different expectations. These sentiments are mostly named
supportive or not or into a scope of classifications from extremely poor/terrible,
moderate, better, best, and so on. Consequently, one of the ways of arranging any
item, or matter into different classifications, can be effectively finished by addressing
the sentiment investigated as one of the variables. This examination, in light of ideas,
is a cost or direction for the association to understand the gathering of their items by
the public, because of which new techniques can be wanted to upgrade the nature
of their items. It is relevant for policymakers or lawmakers to concentrate on open
sentiments about their deeds [1]. The profundity of investigation at the programmed
age of the expressive heading of the client from their audits is the significant objec-
tive of performing sentiment examination. The interest in sentiment examination
has developed the requirement for distinguishing stowed-away data in unstructured
information designs that utilize different media in the informal organization. The
accompanying strategies are the significant classifiers for sentiments: (1) dictionary-
based techniques and (2) machine learning strategies [2]. The principal strategy
first gathers the sentiment casing or arrangement of assessment words (e.g., “mag-
nificent,” “awful”). Afterward, it keeps that as the premise decides the character
standards on the words that seem so far [3]. These strategies require broad works in
dictionary development and rule plan regardless of viability. The downside of the
above strategy is that it is reasonable just for grouping given the information from
the contained words.
The sentiment arrangement by machine learning utilized famous calculations like
Innocent Bayes [4]. Different classes include a perspective on the whole record or
a specific sentence in the document. Recommendation systems (RS) are conveyed
to assist clients with adapting to this data blast. RS is mainly utilized in web-based
business applications and information the executive’s systems like the travel industry,
diversion, and Internet shopping gateways [5]. Film ideas for clients rely upon online
gateways. Films can be effectively separated through their kinds, like parody, thrill
ride, movement, and activity. Another viable method for arranging motion pictures
can be accomplished based on metadata like a year, language, chief, or cast [6].
Generic Sentimental Analysis in Web Data Recommendation Based 347
Various social networking websites offer much heterogeneous data in social
media, necessitating separating important data from non-essential data. Numerous
calculations have been executed to perform an opinion examination on the given
arrangement of information. Take, for instance, the remarks made by a user, “I love
the burger in that restaurant but not the salad.” This would imply that the customer
enjoys one item at a particular restaurant while detesting another. Machine learning
algorithms are able to detect patterns in data and learn from them, in order to make
their own predictions. Instead of following pre-defined instructions, these algorithms
develop models from sample inputs to make data-driven decisions.
This research proposes a novel technique for detecting sentimental analysis
based on machine learning architectures. The input social media data classifica-
tion is done using convolutional discriminant kernel component analysis with data
recommendation in social media using reinforcement multilayer neural networks.
The organization of this paper is as follows: Sect. 2gives sentimental analysis
and web data recommendation in social media using existing machine learning archi-
tectures; Sect. 3discusses a proposed method for sentimental analysis-based social
media recommendation system; and Sect. 4gives results and discussion with the
conclusion in Sect. 5.
2 Existing Sentimental Analysis
A few specialists have attempted to handle the customized web-based entertainment
search issue through changed techniques. Work [7] gave an expounded outline of the
work in the sentiment examination of the social media field. Author [8] proposed an
area-based customized recommendation framework called SESAME, which consol-
idated half and half client area inclination model. The work [9] utilized two multiclass
SVM-based characterization draws near—one-versus-all and single-machine multi-
class SVM to sort computerized cameras and MP3 surveys as per their quality. Author
[10] proposed techniques for choosing highlights using the substance and grammar
model. In a request to foresee the sentiments of online clients from message papers,
[11] proposed a heuristic model. Work [12] fabricated a framework for sentiment
investigation on Film surveys.
2.1 Existing Web Data Recommendation in Social Media
Using Machine Learning
The author [13] introduced an electronic item recommender framework because
of relevant data from sentiment examination. Since evaluations are generally defi-
cient and highly restricted, they developed a logical data sentiment method for a
recommender framework using client remarks and inclinations. Likewise, work [14]
348 R. Sekaran et al.
proposed a recommender cycle incorporating sentiment examination of text-based
information separated from Facebook and Twitter to increment change by matching
item offers and purchaser inclinations. We can determine comparative blends in
different investigations [15]. Furthermore, work [16] utilized a sentiment power
metric to fabricate a music recommender framework. Clients’ sentiments are sepa-
rated from sentences posted on interpersonal organizations. Recommendations are
made using a system of low intricacy that proposes tunes in light of the ongoing
client’s sentiment force. The examination [17] tended to the information sparsity
issue of recommender systems by incorporating a sentiment-based investigation.
Their work was applied to Web Film Dataset (IMDb) and Film Focal point datasets,
yet enhancements in sentiment examination have been made since the paper was
distributed. Work [18] attempted to develop recommendations regarding the infor-
mation sparsity issue. They proposed a shrewd recommender framework in light
of strategies for cross-breed learning that coordinate the best and most effective
learning algorithms. Several research groups [19,20] presented the procedures for
applying sentiment examination in recommender systems. In [21], the author devel-
oped a method for sentiment analysis of movie reviews. They compared their accu-
racy using three supervised machine learning methods—Naive Bayes, decision trees,
and maximum entropy—and one unsupervised method—K-means clustering. Each
review sentence was graded according to its subjectivity and polarity. A YouTube
corpus for sentiment analysis that can be used as input for our sentiment analysis and
text classification was provided by work [22]. The author [23] used a neural network
with a high accuracy rate for face classification. The author proposed a dynamic
neural network (DNN) model based on competitive and Hebbian learning [24]. He
demonstrated that DNN outperforms baseline methods in comparison to the baseline
approach. Work [25] suggested combining a convolution neural network (CNN) and
latent semantic analysis (LSA). Words can be transformed into vectors using the
LSA method.
3 Proposed Model
This section proposed a sentimental analysis-based social media recommendation
system method utilizing deep learning techniques. The input social media data is
processed for noise removal and missing value. The processed data is classified
using convolutional discriminant kernelcomponent analysis(CDKCA) and their data
recommendation in social media using reinforcement multilayer neural networks
(RMNN). The proposed architecture is shown in Fig. 1.
Sentiment investigation expects that the message-preparing information is cleaned
before initiating the grouping model. Message cleaning is a preprocessing step that
eliminates words or other parts that need significant data, which may decrease the
sentiment examination’s viability. Both word implanting and TF-IDF are utilized as
information highlights of profound learning calculations in nature language handling.
Generic Sentimental Analysis in Web Data Recommendation Based 349
Fig. 1 Proposed architecture
3.1 Web Information Sentimental Investigation Utilizing
Convolutional Discriminant Examination
CNN uses a convolution part to extricate highlights, and it contains a three-layer
structure, convolutional layers, pooling layers, and a wholly associated layer defined
as Eq. (1):
S=F1(W1,F2W2,...,FLI,WL),(1)
Sul
i,j,k=Fl+1(Wl+1,Fl+2Wl+2,...,FLˆ
Ml,WL) (2)
SlC,ul
i,j,k=F1
1W1,F1
2,W2,...,F1
lSCul
i,j,k,Wl),(3)
LC,l=
wSlC,ul
w.(4)
We guess there are C classes of used advanced regulations where C is considered a
known boundary for all the SUs. It is accepted that the quantity of noticed information
for the preparation (and the test) set is no different for all classes. The preparation
information for ith class is signified by D(j)I={x(j)i1,x(j)i2,…,x(j)iN},I=
1, …, C,j=1, …, K, where Nis the cardinality of the preparation set for each class
I relating to SUj where it is expected that N is a proper boundary for all categories,
and the vector x(j)i,=1, …, N, is a noticed regulated information stream from
ith class at the RSUj.
350 R. Sekaran et al.
3.2 Social Media Recommendation System Using
Reinforcement Multilayer Neural Networks
Vπ(s)=Eπ
t=0
γtrts0=sand
Qπ(s,a)=Eπ
t=0
γtrts0=s,a0=a.
Optimal value functions are described as V(s)=
maxπVπ(s)and Q(s,a)=maxπQπ(s,a). The ideal value functions V
and Q satisfy the Bellman equation, written recursively as Eq. (5):
V(s)=max
aAQ(s,a),
Q(s,a)=r(s,a)+γ
sS
psT|s,aVs.(5)
Optimal value function Vis determined by value iteration Bellman starting from
any initial value function V0 eq, based on iterative updating Vk +1=TVk, where
T is the Bellman operator indicated by (6):
(TV)(s)=max
aAr(s,a)+
sS
ps|s,aγVs.(6)
Q-learning is a potent algorithm for learning by watching the environment even
while the model is unknown. In Q-learning, value estimate and updating are referred
to as Eq. (7) based on a specific trajectory (r,s0, s,a):
Q(s,a)=(1α)Q(s,a)+αr+γmax
aQs,a.(7)
Another often-used operator is the log-sum-exp operator through Eq. (8):
Lβ(X)=1
βlogn
i=1
eβxi.(8)
Let Tm be the function that uses the max operator to iterate any value function.
ThenextisbyEq.(9):
TβtV1(TmV2)
TβtV1(TmV1)

(I)
+(TmV1)(TmV2)

(II)
.(9)
Generic Sentimental Analysis in Web Data Recommendation Based 351
For term (I), we have by Eq. (10):
TβtV1(TmV1)
log(|A|)
βt
.(10)
For the term (II), we have by Eq. (11):
(TmV1)(TmV2)γV1V2.(11)
Combining (13), (14), and (15), we have by Eq. (12):
TβtV1(TmV2)
γV1V2+log(|A|)
βt
.(12)
Since max is a contraction mapping, we can deduce TmV =Vusing the Banach
fixed point theorem. Based on the DBS value iteration specification in Eq. (13),
VtV
=
Tβt...Tβ1V0(Tm...Tm)V
γ
Tβt1...Tβ2V0(Tm...Tm)V
+log(|A|)
βt
γt
V0V
+log(|A|)
t
k=1
γtk
βk
.(13)
If βt→∞,then limt→∞ t
k=1
γtk
βk=0 Taking the limit of the right-hand side
of Eq. (17), we obtain limt→∞ Vt+1V
=0Although the non-expansion property
may be broken during the dynamic adjustment of t.
In the end, pathology deemed 105 tissue samples and 459 Raman spectra from
20 patients to be neoplastic tissue. The final pathology classification for 12 patients’
55 tissue samples and 219 Raman spectra was normal brain tissue. To speed up the
subsequent training process, we evaluated and cached the values of each bottleneck.
Backpropagation was used to iteratively update the final layer’s weights by comparing
those predictions to the ground-truth labels55. The holdout test set, separate from
training and validation sets, was used to calculate the retrained method after 40,000
steps. Tensor Board’s display of the training and validation learning curves for a
single cross-validation round.
Therefore, learning mappings Fi is learning task: Rdi 1Rdi for each layer i
greater than 0, so that final output oM minimizes the training set’s empirical loss L.
Backpropagation can be used to efficiently complete this learning task when each Fi is
parametric and differentiated. The chain rule calculates the loss function’s gradients
based on every parameter at every layer, and gradient descent is used for parameter
updates. Output for intermediate layers is a new representation that the model learned
after training was finished by Eq. (1418):
L=
i
lˆyi,yi+
k
(fk)(14)
352 R. Sekaran et al.
yj(p)=fn
i=1
xi(p)·wij θj,(15)
L(t)=
n
i=1gift(xi)+1
2f2
t(xi)+(ft)(16)
yk(p)=fm
i=1
xjk(p)·wjk (p)θk(17)
E=E+(ek(p))2
2.(18)
Where are the first request inclination measurements on the misfortune capability?
The choice tree is worked from the root until arriving at the most extreme profundity.
Expect to be IL and IR are the occasion sets of left and right hubs after a split. The
perceptron is ordinarily utilized in managed straight characterization undertakings
in which a hyperplane would be tuned to fit a preparation dataset.
wi=ηtrue jpred jxj
i,(19)
where ηis the learning rate, truej is the genuine class name, and predj is the antici-
pated class mark. The construction of the multifacet perceptron empowers it to gain
complex assignments by removing additional significant elements from the info
design which is defined as Eq. (2021):
w=wη×d
dwF(w)(20)
praj =βprmfaj +(1β)pr_sent aj.(21)
4 Performance Analysis
The arrangement of related boundaries, equipment gadgets, and the virtual library
offices was completed before playing out the tests, like reverberation =5 and k-
overlay =5. Specifically, we utilized Google Colab Star with GPU Tesla P100-
PCIE-16 GB or GPU Tesla V100-SXM2-16 GB [45], Keras [46], and Tensorflow
[47] libraries. We likewise utilized the execution of the SVD, NMF, and SVD++
calculations given by the unexpected library (http://surpriselib.com/, got to on 10
December 2020).
Dataset description: Experiments led on open databases like the Movielens 100 K
1, Movielens 20 M 2, Web Movie Database (IMDb3), and the Netflix database 4 were
Generic Sentimental Analysis in Web Data Recommendation Based 353
not seen as appropriate for our work. These databases were generally obsolete and
contained old movies whose relevant microblogging information was not accessible.
After an exhaustive appraisal of various databases, the MovieTweetings database [12]
was chosen for the proposed framework. MovieTweetings is broadly considered a
cutting-edge rendition of the MovieLens database. The MovieTweetings database is
unfiltered, unlike MovieLens, where a solitary client has evaluated no less than 20
movies. The objective of this database is to give an up-to-date rating so it contains
more practical information for sentiment examination. This database is removed from
virtual entertainment. It is incredibly different; however, it has low sparsity esteem.
4.1 Comparative Analysis
Table 1shows a parametric comparison of proposed and existing techniques for
various retinal image datasets. The dataset compared is Movielens, Movielens 20 M,
and IMDb3datasets in terms of accuracy, average precision, recall, true positive rate,
and F-measure.
From Fig. 2, the accuracy comparison is shown for the proposed and existing
technique for Movielens, Movielens 20 M, and IMDb3datasets. The proposed method
attained an accuracy of 92% while existing ATR-FTIR achieved an accuracy of
89%, 91% by ANN for Movielens dataset and for Movielens 20Mdatasets suggested
technique accuracy of 95%; while existing ATR-FTIR attained an accuracy of 92%,
94% by ANN.forIMDb3datasets proposed technique accuracy of 95%; while existing
ATR-FTIR attained an accuracy of 92%, 94% by ANN.
Figure 3compares the average accuracy for proposed and existing techniques for
Movielens, Movielens 20 M, and IMDb3datasets. The proposed technique attained
average precision of 92%, while existing ATR-FTIR achieved an average accuracy
of 89%, 91% by ANN for the Movielens dataset and Movielens 20 M datasets
proposed average technique precision of 95%; while existing ATR-FTIR attained
Table 1 Parametric comparison of proposed and existing techniques for varioussentimental dataset
Dataset Techniques Accuracy Average
precision
Recall True
positive
rate
F-measure
Movielens SESAME 91 71 62 52 59
IMDb 92 75 65 55 63
CDKCA_
RMNN
95 77 68 59 65
MovieTweetings SESAME 93 74 65 59 61
IMDb 95 77 69 61 65
CDKCA_
RMNN
98 79 72 63 68
354 R. Sekaran et al.
Fig. 2 Comparison of accuracy
average accuracy of 92 and 94% by ANN.forIMDb3datasets proposed technique
average precision of 95%; while existing ATR-FTIR achieved average precision of
92%, 94% by ANN.
Figure 4shows the recall comparison for the proposed and existing technique for
Movielens, Movielens 20 M, and IMDb3datasets. The proposed technique attained
recall of 92% while existing ATR-FTIR achieved a recall of 89%, 91% by ANN for the
Movielens dataset and for Movielens 20 M datasets proposed technique recall of 95%;
Fig. 3 Comparison of average precision
Generic Sentimental Analysis in Web Data Recommendation Based 355
Fig. 4 Comparison of recall
while existing ATR-FTIR attained recall of 92%, 94% by ANN.forIMDb3datasets
proposed technique recall of 95%; while existing ATR-FTIR attained recall of 92%,
94% by ANN.
Figure 5compares the true positive rate for proposed and existing techniques for
Movielens, Movielens 20 M, and IMDb3datasets. The proposed technique attained a
true positive rate of 92%. In comparison, existing ATR-FTIR achieved true positive
rate of 89%, 91% by ANN for the Movielens dataset and Movielens 20 M datasets
proposed technique true positive rate of 95%; while existing ATR-FTIR attained
true positive rate of 92%, 94% by ANN.forIMDb3datasets proposed technique true
positive rate of 95%; while existing ATR-FTIR attained true positive rate of 92%,
94% by ANN.
From Fig. 6, the comparison of F-measure is shown for the proposed and existing
technique for Movielens, Movielens 20 M, and IMDb3datasets. The proposed tech-
nique attained an F-measure of 92% while existing ATR-FTIR achieved an F-measure
of 89 and 91% by ANN for the Movielens dataset and Movielens 20 M datasets
proposed technique F-measure of 95%; while existing ATR-FTIR attained F-measure
of 92%, 94% by ANN.forIMDb3datasets proposed technique F-measure of 95%;
while existing ATR-FTIR achieved F-measure of 92%, 94% by ANN.
The extricated database from Movie Tweetings contains the evaluations of movies
by clients and their particular classes. In any case, it has no other information aside
from the delivery year and classes. Such information is capable of just in the event
of social separation where an adequate number of clients cover attributes in the
framework. We can involve such information in the event of cooperative sifting,
where ideas are made exclusively founded on the evaluations given by a related
client, and things are suggested based on the judgment of client comparability. In the
proposed model, the socially sifted information and the closeness of movies have
356 R. Sekaran et al.
Fig. 5 Comparison of true positive rate
Fig. 6 Comparison of F-measure
been utilized because of their qualities. The Movie Database (TMDb) Programming
interface was used to get the quality of the movies. TMDb 5 is a top hotspot for
comprehensive metadata for movies with over 30 languages. The changed database
contains extremely dark movies from various nations and dialects whose metadata
was not accessible in TMDb. Such movies were disposed of that have nearly nothing
metadata. The last database had around 4500 movies.
Generic Sentimental Analysis in Web Data Recommendation Based 357
SentiWordNet is a word reference that tells, rather than the importance, the senti-
ment extremity of a survey. For identifying the extremity and subjectivity of various
lodging surveys and to get the extremity and subjectivity, we have utilized Senti-
WordNet, a freely accessible analyzer of the English language that contains feelings
separated from a WordNet database. We divided our assortment of audits to remove
words (lodging highlights). We doled out all delegate happening under the fitting-in
highlights as a made sense of in past advances, viewed as sure (pos), negative (neg),
and unbiased (neu) terms to calculate the sentiment score. The SentiWordNet is incor-
porated with Python’s NLTK bundle and furnishes WordNetsynsets with sentiment
extremity. WordNet gives various semantic relationships between words, which are
utilized to work out sentiment polarities. In essential terms, sentiment examination is
the most common way of measuring something subjective, such as literary reviews.
As our recommender is intended to manage heterogeneous kinds of information,
a database is required to store this different nature of the data. We have Cassandra’s
database to store information utilized in the proposed recommender. Survey pages
which match the question watchwords are downloaded and put away in the NoSQL
database in Hadoop. The e-dataset used in this study is from outside assets like the
Outing Counsel and Expedia lodging site; i.e., information of lodgings is saved in
comma isolated esteem design (CSV). Thus, it is changed into the JSON configuration
to build its intelligibility; i.e., clients’ printed audits and appraisals allotted by existing
clients recorded as rating scores, likes, or star positions are put away in Cassandra.
The e positions score can change between the sizes of 1 to 5 or 1 to 10. The results
showed that more tweets are classified negatively than in other categories. Surprised
by the total number of neutral tweets. Nonetheless, while the testing tweets were
analyzed, it was seen that the tweets ordered as nonpartisan comprised explanations,
ideas, and questions.
A web-based graphic user interface (GUI) was developed for straightforward
prediction. Due to this, users will find it simpler to interact with the sentiment predic-
tion system. The GUI simplifies uploading a spreadsheet file containing the extracted
tweets for analysis. The uploaded file’s username, date, number of retweets, actual
text, mentions, hashtags, tweetid, and permalinks are all displayed here. The section
of this interface that lets the user select the pre-trained model they want to use for
processing is an important feature.
After looking at these results, it is clear that improvements to feature selection
and sentiment classification algorithms are still needed, so this is a new area of study.
Data for sentiment analysis comes from websites like Flipkart, Facebook, Twitter,
and other social media platforms. Individuals openly express their views on these
media about specific points, items, and legislative issues. One can learn more about
their field and improve by reviewing these reviews. Although sentiment analysis has
been the subject of much research, it still faces numerous obstacles. It is hard to tell
when someone is being sarcastic when they say what they think.
358 R. Sekaran et al.
5 Conclusion
This research proposes a novel technique in generic sentimental analysis for web
data classification for social media using machine learning techniques. The input
data is classified using convolutional discriminant kernel component analysis and
recommendation in social media using reinforcement multilayer neural networks.
The classifier is trained or modeled using the labeled data and then tested on new but
related texts to determine how well it predicts the sentiment of the new documents.
The initial brand and product comparison results show the value of text mining
and sentiment analysis on social media data, and machine learning classifiers are
a valuable tool for users, product manufacturers, and regulatory and enforcement
agencies to monitor brand or product sentiment trends to act in the event of a sudden
or significant rise in negative sentiments. Further research will examine comment
spamming, contrasting the sentiment classification capabilities of various machine
learning algorithms, temporal analysis for identifying an upward or downward trend
in user or brand sentiment, and clustering tweet and user attitudes by region. The
proposed technique attained an accuracy of 98%, average precision of 79%, recall
of 72%, true positive rate of 63%, and F-measure of 68%.
References
1. He L, Yin T, Zheng K (2022) They May Not Work! An evaluation of eleven sentiment analysis
tools on seven social media datasets. J Biomed Inform 132:104142
2. Alsayat A (2022) Improving sentiment analysis for social media applications using an ensemble
deep learning language model. Arab J Sci Eng 47(2):2499–2511
3. Xu QA, Chang V, Jayne C (2022) A systematic review of social media-based sentiment analysis:
emerging trends and challenges. Decision Anal J 100073
4. Jalil Z, Abbasi A, Javed AR, Badruddin Khan M, AbulHasanat MH, Malik KM, Saudagar AKJ
(2022) Covid-19 related sentiment analysis using state-of-the-art machine learning and deep
learning techniques. Front Public Health 9:2276
5. Iqbal A, Amin R, Iqbal J, Alroobaea R, Binmahfoudh A, Hussain M (2022) Sentiment analysis
of consumer reviews using deep learning. Sustainability 14(17):10844
6. Li X, Zhang J, Du Y, Zhu J, Fan Y, Chen X (2022) A novel deep learning-based sentiment
analysis method enhanced with emojis in microblog social networks. Enterprise Inf Syst 1–22
7. Alanazi SA, Khaliq A, Ahmad F, Alshammari N, Hussain I, Zia MA, Afsar S et al (2022)
Public’s mental health monitoring via sentimental analysis of financial text using machine
learning techniques. Int J Environ Res Public Health 19(15):9695
8. Ali I, Asif M, Hamid I, Sarwar MU, Khan FA, Ghadi Y (2022) A word embedding technique
for sentiment analysis of social media to understand the relationship between Islamophobic
incidents and media portrayal of Muslim communities. PeerJ Comput Sci 8:e838
9. Chandrasekaran G, Antoanela N, Andrei G, Monica C, Hemanth J (2022) Visual sentiment
analysis using deep learning models with social media data. Appl Sci 12(3):1030
10. Mallick C, Mishra S, Giri PK, Paikaray BK (2023) Machine learning approaches to sentiment
analysis in online social networks. Int J Work Innovation 3(4):317–337
11. Thimmapuram M, Pal D, Mohammad GB (2022) Sentiment analysis—based extraction of
real—time social media information from twitter using natural language processing. Soc Netw
Anal: Theory Appl 149–173
Generic Sentimental Analysis in Web Data Recommendation Based 359
12. PM KR (2022) Sentiment analysis, opinion mining and topic modelling of epics and novels
using machine learning techniques. Mater Today: Proc 51:576–584
13. Cordero J, Bustillos J (2022) Sentiment analysis based on user opinions on twitter using machine
learning. In: Applied technologies: third international conference, ICAT 2021, Quito, Ecuador,
October 27–29, 2021, Proceedings. Cham, Springer International Publishing, pp 279–288
14. Yin Z, Shao J, Hussain MJ, Hao Y, Chen Y, Zhang X, Wang L (2022) DPG-LSTM: an enhanced
LSTM framework for sentiment analysis in social media text based on dependency parsing and
GCN. Appl Sci 13(1):354
15. Sumathy B, Kumar A, Sungeetha D, Hashmi A, Saxena A, Kumar Shukla P, Nuagah SJ (2022)
Machine learning technique to detect and classify mental illness on social media using lexicon-
based recommender system. Comput Intell Neurosci
16. Gupta A, Matta P, Pant B (2022) A comparative study of different sentiment analysis classi-
fiers for cybercrime detection on social media platforms. In: AIP conference proceedings, vol
2481(1). AIP Publishing LLC, p 060005
17. Hinduja S, Afrin M, Mistry S, Krishna A (2022) Machine learning-based proactive social-
sensor service for mental health monitoring using Twitter data. Int J Inf Manage Data Insights
2(2):100113
18. Srikanth J, Damodaram A, Teekaraman Y, Kuppusamy R, Thelkar AR (2022) Sentiment anal-
ysis on COVID-19 twitter data streams using deep belief neural networks. Comput Intell
Neurosci
19. Yenkikar A, Babu CN, Hemanth DJ (2022) The semantic relational machine learning model
for sentiment analysis using cascade feature selection and heterogeneous classifier ensemble.
PeerJ Comput Sci 8:e1100
20. Kuppusamy M, Selvaraj A (2023) A novel hybrid deep learning model for aspect-based
sentiment analysis. Concurren Comput: Practice Experience 35(4):e7538
21. Venkatesh B, Hegde SU, Zaiba ZA, Nagaraju Y (2021) Hybrid CNNLSTM model with GloVe
word vector for sentiment analysis on football specific tweets. In: 2021 international conference
on advances in electrical, computing, communication and sustainable technologies (ICAECT),
pp 1–8
22. Sanagar S, Gupta D (2020) Unsupervised genre-based multidomain sentiment lexicon learning
using corpus-generated polarity seed words. IEEE Access 8:118050–118071
23. Saharudin SN, Wei KT, Na KS (2020) Machine learning techniques for software bug prediction:
a systematic review. J Comput Sci 16(11):1558–1569
24. Feng Y, Cheng Y (2021) Short text sentiment analysis based on multichannel CNN with
multi-head attention mechanism. IEEE Access 9:19854–19863
25. Nijhawan T, Attigeri G, Ananthakrishna T (2022) Stress detection using natural language
processing and machine learning over social interactions. J Big Data 9(1):1–24
Cloud Spark Cluster to Analyse English
Prescription Big Data for NHS
Intelligence
Sandra Fernando, Victor Sowinski Mydlarz, Asya Katanani, and Bal Virdee
Abstract Spark is a large-scale data processing engine that is at least a hundred
times faster than the Hadoop big data processing engine. Even though Spark is a
complete in-memory framework, although limited with its big data platforms facili-
ties compared to Hadoop, Spark analytics engine with Hadoop distributed file system
gives better throughput than Hadoop alone. The main contribution of this paper is the
insight into the behaviour of HDFS-based Azura Cloud Spark Cluster with discussion
and evaluation of its strengths and limitations using NHS prescription large dataset.
Data on NHS prescriptions obtained from 2015 to April 2022 exceeds 500 GB of
records. A public dashboard for individual BNF code analysis and studies on NHS
cost analysis exist, but no analysis of this data range and volume of NHS prescription
and especially using new big data processing engines such as Spark was conducted.
This study also contributes descriptive statistics and machine learning models of
prescription data trends using Cloud Spark engine and PySpark technology that has
not been used in this context before. This study illustrates regions as well as GP
practices in terms of reimbursement cost, drug consumption level, the type of the
drug, and the disease type; varied demand for dispensed chemical substances over
the years; shows what diseases have increased or decreased over the years as well as
the total cost and its trends.
Keywords Cloud cluster ·Big data ·Prescription data ·Machine learning
engines ·PySpark ·Azure Spark architecture
S. Fernando (B)·V. S. Mydlarz ·A. Katanani ·B. Virdee
Assistive Technology Group, SCDM, London Metropolitan University, 166-220 Holloway Rd,
London N7 8DB, UK
e-mail: s.fernando@londonmet.ac.uk
V. S. Mydlarz
e-mail: w.sowinskimydlarz@londonmet.ac.uk
B. Virdee
e-mail: b.virdee@londonmet.ac.uk
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_27
361
362 S. Fernando et al.
1 Introduction
The sheer amount of computing resources and software services needed to support
big data efforts can strain the financial and intellectual capital of even the largest
businesses. That is where cloud computing comes into play as an ideal platform for
big data. Cloud computing provides limitless computing resources and services on
demand, and the business does not have to build its own or maintain the infrastruc-
ture. Thus, the cloud makes big data technologies accessible and affordable to almost
any size of enterprise. The motivation for this study derives from the need to eval-
uate a high-throughput cloud-based big data processing engine with a stable storage
mechanism and third-party supported data input and output technology. The cloud
platform chosen for this research is the Microsoft Azure HDInsight Spark cluster.
Apache Spark is a distributed computing framework that supports a set of libraries
for real-time, large-scale data processing. Other subcomponents of the technology
are storage accounts for blob containers and Jupyter Notebook for running Python
APIs (PySpark) for the Apache Spark engine.
Drug utilization in England and Wales has been published before [1] using secular
trend analysis. This analysis of the prescription cost database serves purely to inves-
tigate medication and drug consumption. The findings are similar to the findings
of this paper with the most common medication and its trends. The analysis has
used statistical packages for social science but has not discussed the technical details
of big data processing capabilities and its performance. Open Prescribing [2]uses
anonymized data about drugs prescribed by GPs and provides a dashboard of Sub-
ICB location, practice, and other trends on drugs. However, this dashboard does not
discuss the backend technology or processing of 700 million rows and its technical
depths of processing.
Spark technology is fairly new. Therefore, only limited real case studies are anal-
ysed using its technology. There are several papers published on the technical review
of Spark, compared to the Hadoop ecosystem and its features, simply showing
its strength and weaknesses without its application [35]. An investigation was
carried out with Amazon cloud-based Spark cluster for machine learning algorithm
processing [6] for streaming tweet data from a machine learning repository using a
small health care dataset to trigger alters. The application does not talk about chal-
lenges, and the dataset is completely different in nature to the one presented in these
studies. The paper contains a limited description of the technical evaluation and
takeaways.
This study uses the NHS prescription dataset analysed from Jan 2015 to April
2022. There is no other study to scrutinise the data at this date range using a cloud
Spark cluster. The main contribution of this paper is the Cloud Spark approach
to process a large amount of NHS data using HDInsight Cluster technology, its
strengths and weaknesses. The technical details and the challenges of HDInsight
Spark Cluster implementation leave the reader with a choice of application type with
Spark technology.
Cloud Spark Cluster to Analyse English Prescription Big Data for NHS 363
The study also connects to NHS intelligence comparing the regions as well as
some practices in terms of reimbursement cost, drug consumption level, the type
of the drug, and the disease type. The study demonstrates the varied demand for
some dispensed chemical substances over the years through charts and shows what
diseases have increased or decreased over the years as well as the total sum of the
appearance of those diseases in each particular region.
The rest of the paper starts with the related work presenting comparative studies
of Spark and other technologies. The material and methods section presents the tech-
nology and techniques utilized to process English Prescribing Dataset and focussed
questions. The paper then reveals the finding of NHS intelligence, answering a few
questions of the study by presenting results. The challenges and limitation section
elaborates on technical barriers encountered in processing large datasets in the cloud
environment and solutions to overcome those barriers. The section also explores the
Spark and Hadoop performance with the utilization of (1) cores and (2) PySpark
technology. The conclusion presents two main contributions of the paper: (1) find-
ings of NHS intelligence and (2) the Spark deployment approaches based on the
experimental findings.
2 Related Work
The big data analytics of health-related data is one of the most vital industrial strate-
gies highlighted in the UK government’s life sciences industrial strategy report
[7]. UK’s healthcare organization, NHS, holding extensive repositories of patient
data, produces information at an enormous rate. This growth exceeds the capabili-
ties of established IT infrastructures and represents greenfield computing and data
management problems largely [8].
New data management systems have been introduced to meet the challenges of big
data [9]. Apache Spark and Hadoop are high-power distributed parallel computing
cluster systems that are commonly used for operating the software framework for
big data analysis [10]. Apache Spark consists of several components: Spark core;
Spark Streaming; Spark MLlib; Spark SQL; Spark GraphX, etc.
It has been over a decade since the term “big data” was introduced. Big data
simply refers to large quantities of information that traditional PCs fail to store,
process, and analyse. One of the various fields that generate a large-scale of data
today is the healthcare industry. Healthcare-related big data hides great potential.
When properly applied, insightful knowledge derived from big data can ensure public
health, determine, and execute applicable treatment ways for patients, support clinical
advancements, and observe the safety of healthcare systems.
This study analyses NHS Prescribing data on the cloud cluster Spark engine.
Some reasons why Apache Spark would be a better choice over Apache Hadoop are
presented in Table 1: (1) in-memory cluster computation, (2) real-time processing
type, (3) low latency and high throughput (4), and range of stable language support
with Java, Scala, and Python, R. According to Andreas Kretz [11] Hadoop is more
364 S. Fernando et al.
Table 1 Comparison of Hadoop and Spark for data processing
Faster
performance
Spark is designed for in-memory processing
Hadoop is designed to process data using local disc storage across multiple
sources
Processing
type
Spark provides both batch and real-time processing
Hadoop provides batch and linear data processing
Latency Spark is a low-latency and high-throughput computer framework
Hadoop is a high-latency and high-throughput computer framework
Language
support
Spark supports Java, Scala, Python, and R
Hadoop ideally uses Java; however, languages like R, Python, and Ruby can
also be implemented
than a storage, a whole ecosystem while Spark is just a data analytical framework
with no storage capacity.
Here are some differences between Apache Spark and Apache Hadoop.
Both the Hadoop and Spark data processing engines can be complementary,
where Hadoop is used as a storage, Yarn for resource management, and analytics is
processed with Spark. Hadoop and Spark can be managed in the same cluster with
HDFS data and a Spark worker thread. Spark can determine which node the data is
stored in and then load it into the memory of that machine for processing rather than
transferring data between machines that causes significant traffic. If the job is batch
processing, such as counting or averaging, then MapReduce is better, whereas, for
more complex machine learning computations or faster streaming, Apache Spark is
advised. In this research, the technique of HDFS (Hadoop distributed file system),
Yarn resources management, and Spark Engine is tested with Azura Cloud. The next
section details the specification of the deployment and subcomponent architecture.
3 Material and Methods
The cloud platform for this study is Azure Platform Management Portal that gives
control over service deployment, administrative tasks, and information on health
of the implementations and accounts. The English Prescribing Dataset contains
detailed information on prescriptions issued in England, Wales, Scotland, Guernsey,
Alderney, Jersey, and the Isle of Man. The cloud blob containers of HDFS were used
to store the 88 blobs, summed up to around 500 GB in size, and Jupyter Notebook
for running Python APIs (PySpark) for Apache Spark engine. The numbers of four
processors and three cores were increased to 6 processors and 15 cores in the study.
This change increased the execution performance of the queries by approximately
25%.
Apache Ambari was used with Hadoop management providing an easy-to-use web
UI. It was used for managing and monitoring Hadoop clusters. Azure HDInsight is
Cloud Spark Cluster to Analyse English Prescription Big Data for NHS 365
a versatile, managed cluster platform running big data frameworks in large volume
and velocity using Apache Spark [12].
Microsoft Azure Blob Storage is an object storage solution for the cloud. It is
designed for storing large amounts of unstructured data. SQL Azure Reporting allows
running reports against SQL Azure Databases in the cloud [13]. Azure Active Direc-
tory is a cloud-based identity and access management service. Apache Spark is an
open-source unified analytics engine for large-scale data processing. In this research,
Power BI and Tableau are interactive data visualization tools.
The project was initiated with four processors and three cores. Attempting to
process the whole data required optimization as these specifications turned out to be
insufficient. Assigning them to six processors and 15 cores increased the speed of
the data process significantly. The data in CSV files are organized into three tiers.
The most basic unit is monthly records with a size of approximately 6 GB. The
intermediate yearly records have a size of 17 GB, after filtering the relevant columns
and merging of monthly records. The final data frame is a combination of all data
(seven years and four months). The virtual machine (hosted in MS cloud) is started
through Azure lab services. The entry point to the cluster is the Azure Portal which
contains all the components mentioned.
Figure 1represents Apache Spark as a parallel processing framework supporting
in-memory processing to enhance the performance of big data analytic applications.
Spark cluster is a combination of a driver program, cluster manager, Zookeeper
Nodes, and worker nodes that work together to complete tasks NHS descriptive
analysis and modelling of data [14].
The Spark Context coordinates processes across the cluster. Apache Spark appli-
cations have corresponding executor processes that manage the tasks and remain on
alert throughout the execution cycle. Driver programs constantly accept the connec-
tions from the executors during their life cycle. Apache Spark drivers schedule the
tasks on the cluster and close the worker nodes. Spark Drivers are on the same
local area network as the rest of the components, and HDFS distributed file system
manages large data sets running on commodity hardware.
Figure 2demonstrates the flow of the application starting with a SparkContext
instance. The driver program requests resources from the cluster manager to launch
executors. The cluster manager launches executors. The driver process runs through
the user application. Tasks are sent to executors. Executors run the tasks and save the
results. If any worker crashes, its tasks are sent to different executors to be processed
again.
NHS in April 2020 underwent an organizational change. The four regions: North
of England, Midlands and East, London, and South of England were split into seven
regions: East of England, London, Midlands, North East and Yorkshire, North West,
South East, and South West. Data attributes used in this study are practice name;
chemical substance; prescribed product (BNF), total quantity (appliance that was
prescribed); average daily quantity; net ingredient cost (NIC), and actual cost (after
discount and other expenses).
In this study, structured secondary data, English Prescribing Dataset, is fetched for
processing from the 88 blobs with the read.csv() method of the spark session with
366 S. Fernando et al.
Fig. 1 Architecture of the system
inferSchema parameter removed for performance purposes. Data did not undergo
any data cleaning steps as the Data Quality Policy of NHS ensures the consistency,
timeliness, efficiency, validity, and completeness of the data. The study implements
StringIndexer, OneHotEstimator, and VectorAssembler classes to convert categorical
fields into numerical and vectorize the features, respectively.
The numerical fields in the dataset are related to the actual cost of the medication,
net ingredient cost, and the total quantity of the medication prescribed. Figure 3
demonstrates descriptive statistics data of those attributes that are given in the graph
below. Data is trained with the linear regression algorithm. 75% of the data is used
for training, and the rest, 25%, of the data is used to make predictions.
Focus Questions of the study.
Q1: What are the topmost drugs subscribed by GPs in England? What are the
categories? What does it prescribe for?
Q2: What is the relationship between spending and location?
Q3: What is the relationship between spending and drug prescription?
Q4: Group practice into different spending.
Q5: Group practice into different drug recommendations.
Q6: Find descriptive statistics.
Q7: Will a GP practice with certain spending likely increase a particular drug in
future?
Cloud Spark Cluster to Analyse English Prescription Big Data for NHS 367
Fig. 2 Architecture of the cloud cluster
Fig. 3 Descriptive statistics of the numerical data
368 S. Fernando et al.
Q8: What is the trajectory of different drugs recommendation based on their
historical data?
Q9: What is the trajectory of different drug recommendations based on locations?
Q10: What is the trajectory of spending of a given GP practice based on its historical
records?
4 Results and Discussion
Three research questions are presented in the paper.
4.1 What is the Trajectory of Spending of a Given General
Practice Based on Its Historical Records (Spending)?
Query Description: The trend of the sum of prediction (actual and forecast) for
the date year. The colour shows details about the forecast indicator. The marks are
labelled by practice name. The view is filtered on the practice name, which keeps
park surgery and riverside surgery.
Figure 4demonstrates the prediction values of the actual cost in British pound
sterling (GBP) for the Park and Riverside Surgeries. In the context of the UK’s
National Health Service (NHS), “actual cost” usually refers to the amount that the
NHS pays for a particular treatment or service. Predicting the actual cost of surgeries
in the NHS can help improve the efficiency of the healthcare system, reduce costs,
and improve patient outcomes.
Both Park and Riverside Surgeries, two of the surgeries that recommend top
drugs most, will require a higher allocation of actual cost by 2031. Park Surgery
and Riverside Surgery will have to consider a rise by approximately 40% and 28%
respectively. These predictions will assist NHS managers with efficient resource
allocation such as adequate staff and equipment, and the necessary post-surgical
care placement. Additionally, budgeting, cost control, and negotiation are the other
factors that NHS managers can benefit from.
4.2 What is the Trajectory of Net Ingredient Cost for the Next
Few years?
Figure 5demonstrates the predicted value of the net ingredient cost in the National
Health Service for the next five years. Net ingredient cost (NIC) is the paid amount
based on the introductory price of the prescribed drug or appliance and the quantity
prescribed, in British pound sterling (GBP). The NIC value from £2404.22 M in the
Cloud Spark Cluster to Analyse English Prescription Big Data for NHS 369
Fig. 4 Projection of actual cost in the medication of two surgeries
first quarter of 2022 increases by nearly 20% to £2844.82 M in the third quarter of
2026. Predicting NIC values can help the NHS improve its financial management,
enhance patient care, and support the development of new treatments and thera-
pies. The fact of the rise of the NIC value asserts the NHS organizations to allow
for efficient allocation of resources and better financial management for the costs
of pharmaceutical products. Besides, this prediction can help identify opportuni-
ties for cost savings, such as negotiating lower prices with suppliers or switching
to less expensive alternative treatments. NHS can ensure that patients receive the
most effective and appropriate treatments while minimizing costs, improving the
overall quality of care. Finally, predicting NIC values can also support research and
development activities, enabling the NHS to make informed decisions about which
drugs to invest in and which to prioritize for development. Based on the facts above,
NHS will have to consider allocating higher amount for the net ingredient cost in
the upcoming years to ensure the quality of care and high performance as well as to
support research and development activities.
370 S. Fernando et al.
Fig. 5 Net ingredient cost in quarters by 2026
4.3 What Are the Most Common Disease Types in the UK?
The bar graph in Fig. 6depicts the most typical disease types diagnosed between the
dates Jan 2015–Apr 2022. The top three disease classifications are related to central
nervous system, cardiovascular system, and endocrine system, sorted by descending
order. Central nervous system-related diagnoses, also known as neurological disor-
ders, involve the central and peripheral nervous system (muscles, the brain, the
cranial and peripheral nerves, the neuromuscular plate, the spinal cord, and the auto-
nomic nervous system) and affect about 10 million in the UK. The most common
neurological diseases are Dementia, Alzheimer’s, Parkinson’s, Multiple sclerosis,
and epilepsy. [15] Cardiovascular system-related diagnoses involve coronary artery
diseases, hypertension, stroke, heart failure, peripheral artery disease, and arrhyth-
mias [16]. These disorders are a significant health issue in the UK and a major cause
of morbidity and mortality. They are often linked to lifestyle factors such as smoking,
lack of exercise, and poor diet, as well as genetics and other underlying health condi-
tions [17]. Endocrine system-related diagnoses involve diabetes, thyroid disorders,
pituitary disorders, adrenal disorders, and polycystic ovary syndrome (PCOS). These
disorders can affect many different areas of the body and can have a significant impact
on a person’s health and quality of life [18]. Treatment for endocrine system-related
disorders varies depending on the specific condition and its severity but may include
medication, surgery, lifestyle interventions, and hormone replacement therapy.
4.4 Challenges and Limitations
Apache Spark cluster on Azure HDInsight did not perform as expected on handling
NHS Prescribing dataset with the initial configuration of four processors and three
cores. Additionally, having the parameter “inferSchema” of the read.csv() method of
the spark session set to “True” had a reverse impact on the performance of the data-
retrieving process. Next, log transformation caused null data values in the dataset,
which held back training the data with the linear regression algorithm. The modelling
process consistently threw “Zero Division Error”. Some queries, such as encoding
the data, implementing Spark SQL, or saving PySpark data frame to the Hive table
Cloud Spark Cluster to Analyse English Prescription Big Data for NHS 371
Fig. 6 Most common disease types in the UK, 2015–2022
for further use with the Tableau visualization tool, lasted for more than an hour. This
caused a “session timeout” error which killed the spark session and did not let it
move on.
To overcome the challenges, the researchers implemented PySpark code with
different session parameters, demonstrated in Fig. 6. However, increased session
timeout did not give any results. The task was reallocated with the PySpark instance
with a higher number of cores and processors, to be from four to six and from 3 to
15, respectively and increased the default processing time in the Livy configuration
from 3600 s to 36,000 s. This allowed the execution of queries successfully. The
results were optimized removing the parameter “inferSchema” from the method
“read.csv”. This parameter infers the schema of every single column retrieved, which
is a process to slow down reading the blobs. Having this parameter set to false required
specifying the schema explicitly. Replacement of the PySpark SQL commands that
were not directly associated with data manipulation and transformation with PySpark
magic commands contributed to the success. Although PySpark magic commands
are recommended for tasks that require interaction with the Spark engine, such as
configuring the Spark context as demonstrated in Fig. 7or loading data into a Spark
DataFrame, it turned out to have a positive impact on the performance of PySpark
SQL queries when executing dataset-related commands such as data manipulation
and transformation in this study.
4.5 Comparison Between Spark and Hadoop Performance
A few studies have conducted the comparison between Hadoop and Spark as
discussed in the introduction. Table 2records the publicly claimed running timings
372 S. Fernando et al.
Fig. 7 Implementation of the new session variables
of Spark and Hadoop. Hadoop uses longer time to model each iteration because it
runs independent map reduce jobs. Spark’s first iteration, on the other hand, takes
some time while subsequent iterations only take few seconds. This is due to the
reuse of the cache data, which in return allows Spark to run 10–100 times faster [19].
Table 3demonstrates the running time of pandas and PySpark technology perfor-
mance discussed in Databricks [12]. It is evident that PySpark muti-core local cluster
can be a good choice for modelling less than 10 GB data.
Table 2 Logistic regression
performance of Hadoop and
Spark
Spark cloud (multi-cores) 0.9—running time (s)
Hadoop (multi-cores) 110—running time (s)
Table 3 Pandas and PySpark comparison execution for: max value query
Panda local cluster Spark runtime Out of memory—(35 GB parquet file)
PySpark local cluster runtime 10 GB memory, 16 thread, 260 GB
Cloud Spark Cluster to Analyse English Prescription Big Data for NHS 373
4.6 Comparison Between PySpark and Pandas Performance
5 Conclusion
This study has evaluated some key factors related to the NHS Prescribing data by
interpreting the findings of the research via focussed questions. The main contribution
of this paper is the Cloud Spark and HDFS approach to process a large amount of NHS
data using Azura HDInsight Spark Cluster technology, with a discussion, comparison
of its strengths and weaknesses.
The study also contributes to descriptive statistics and machine learning models
of NHS prescription data trends. We found that the highest amount that communities
paid in GBP was on enteral nutrition. Some of the findings include that England has
increasingly paid for Apixaban, the medication that marked the highest cost increase
for the last couple of years. The most dispensed and increasingly demanded drug
over the years in England happened to be Cholecalciferol, also known as vitamin
D3. Additionally, we could draw an insight that the most common disease types over
the years have remained to be central nervous system-relevant. The most typical
nervous system-related disorders have been Dementia, Alzheimer’s, Parkinson’s,
Multiple sclerosis, and epilepsy.
Conclusions are derived on the technology used in this research. Even though
Hadoop is a big data framework with data inputs/outputs facility, Spark is 100 times
more powerful when processing data. [15] Standalone multithreaded PySpark queries
can perform with around 260 GB of data compared with pandas’ frame queries. But
Spark has a head, zookeeper, and worker nodes to enable the distribution of query/
model execution. Each node can be assigned multiple cores (processors), which may
take overhead. Spark Performance-related issues and errors can happen for many
reasons. The following three areas can be carefully improved for better performance.
(1) spark session parameters (session time, inferSchema, executor memory, etc.). (2)
increase in the number of processors and executors. (3) The data itself may have the
need for transformation or alternation to avoid problems such as “ZeroDivisionError”
or null values for certain modelling.
The important questions to ask in selecting a type of spark processing environ-
ment are (1) the type of users using the cluster, (2) the type of workload, (3) the
budget, and (4) the service level agreement. Single users are recommended to use
standard clusters, while single nodes are for small jobs by Microsoft Azure [13].
High concurrency clusters are the best for sharing among several users or running
ad-hoc jobs. It was evident that autoscaling reduces the cost over a fixed-size cluster;
however, scale-up and down may slow down your process. Single-user, all-purpose
jobs can slow down the cluster with the autoscaling if the jobs come in a few minutes
apart rather than constant data supply.
Cluster configuration gives a trade-off between cost and performance. More
machines with less memory and storage require more shuffling of data to complete
a task. Therefore, for a data analytics and complex ETL, training machine learning
model type job is best executed with a smaller cluster: a smaller number of nodes
374 S. Fernando et al.
to minimize the shuffles. In a multi-user scenario, where read-only access is most
needed, it is best to use on-demand instances with a hybrid approach with cluster poli-
cies for different groups of users. The technical details and comparisons presented
in this paper and the challenges of HDInsight Spark Cluster implementation should
leave the reader with a choice of Spark application depending on the need.
References
1. Naser AY, Alwafi H, Al-Daghastani T, Hemmo SI, Alrawashdeh HM, Jalal Z, Paudyal V,
Alyamani N, Almaghrabi M, Shamieh A (2022) Drugs utilization profile in England and Wales
in the past 15 years: a secular trend analysis. BMC primary care 23(1):239. https://doi.org/10.
1186/s12875-022-01853-1
2. OpenPrescribing.net, Bennett Institute for Applied Data Science, University of Oxford, 2023,
https://openprescribing.net/
3. Salloum S, Dautov R, Chen X et al (2016) Big data analytics on Apache Spark. Int J Data Sci
Anal 1:145–164. https://doi.org/10.1007/s41060-016-0027-9
4. Shaikh E, Mohiuddin I, Alufaisan Y, Nahvi I (2019) Apache Spark: a big data processing engine.
In: 2019 2nd IEEE middle East and North Africa communications conference (MENACOMM),
Manama, Bahrain, pp 1–6. https://doi.org/10.1109/MENACOMM46666.2019.8988541
5. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing
with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud
computing (HotCloud’10). USENIX Association, USA, 10
6. Lekha RN, Sujala DS, Siddhanth DS (2018) Applying spark based machine learning model
on streaming big data for health status prediction. Comput Electric Eng 65:393–399, ISSN
0045-7906
7. Bell J, GBE FF (2017) Life sciences industrial strategy—a report to the government from the
life sciences sector. Office for Life Sciences
8. Kyoungyoung J, Gang-Hoon K (2013) Potentiality of big data in the medical sector: focus on
how to reshape the healthcare system. The Korean Society of Medical Informatics, 79–85
9. Villars RL, Olofson CW, Eastwood M (2011) Big data: what it is and why you should care.
IDC Analyze the Future, 4
10. Dash S, Shakyawar SK, Sharma M, Kaushik S (2019) Big data in healthcare: management,
analysis and future prospects. J Big Data 54
11. Kretz A (2019) The data engineering cookbook: mastering the plumbing of data science v3
12. Wang G, Xin R, Damji J (2018) Benchmarking Apache Spark on a Single node machine,
engineering Blog https://www.databricks.com/blog/2018/05/03/benchmarking-apache-spark-
on-a-single-node-machine.html
13. Microsoft (2023) Best practices: cluster configuration, Azure Databricks documentation,
https://learn.microsoft.com/en-us/azure/databricks/clusters/cluster-config-best-practices
14. Learning Journal (2021) Parallel processing in Apache Spark, Apache Spark core context,
https://www.learningjournal.guru/article/apache-spark
15. MacDonald BK, Cockerell OC, Sander JW, Shorvon SD (2000) The incidence and lifetime
prevalence of neurological disorders in a prospective community-based study in the UK. Brain:
J Neurol 123(Pt 4):665–676. https://doi.org/10.1093/brain/123.4.665
16. Olvera Lopez E, Ballard BD, Jan A. Cardiovascular Disease. [Updated 2022 Aug 8]. In: Stat-
Pearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2023 Jan-. Available from: https://
www.ncbi.nlm.nih.gov/books/NBK535419/
17. NHS UK website (2023) Cardiovascular disease. Available at: https://www.nhs.uk/conditions/
cardiovascular-disease
Cloud Spark Cluster to Analyse English Prescription Big Data for NHS 375
18. Wilson JD (2001) Prospects for research for disorders of the endocrine system. JAMA.
285(5):624–627. https://doi.org/10.1001/jama.285.5.624 Available from: https://jamanetwork.
com/journals/jama/fullarticle/193529
19. Madhugiri D (2022) Apache Spark vs. hadoop mapreduce—top 7 differences, analytics
Vidhya Blog, https://www.analyticsvidhya.com/blog/2022/06/apache-spark-vs-hadoop-map
reduce-top-7-differences
Prediction of Column Average Carbon
Dioxide Emission Using Random Forest
Regression
P. Sai Swetha, M. A. Chiranjath Sshakthi, S. Hrushikesh, and A. Malini
Abstract The carbon dioxide emission in the atmosphere is increasing tremendously
each day. Researchers have found satellites to monitor the emission level. The purpose
of this paper is to predict the column average carbon dioxide using the satellite data
and map the regions that emit carbon dioxide, so that the emission can be reduced to
maintain an eco-friendly environment. The model is trained using the Random Forest
Regression. The performance of the model depends upon the features selected for
training it. The satellite data is taken from OCO-2 satellite. The orbiting carbon
observatory satellite measures the amount of sunlight reflected from a column of air
containing carbon dioxide (CO2) rather than the amount of CO2itself. Additionally,
the CO2emission is also found in smaller areas using the large area data. On average,
the suggested model forecasts the carbon dioxide concentration in both bigger and
smaller regions.
Keywords Wisdom of crowds ·OCO-2 ·XCO2·Decision tree ·Feature selection
1 Introduction
One of the biggest threats to the planet is climate change and the main reason for
this is CO2. The measure of carbon dioxide concentration on earth’s atmosphere is
tremendously increasing day by day. According to the reports, the carbon dioxide
P. S. Swetha ·M. A. C. Sshakthi ·S. Hrushikesh ·A. Malini (B)
Thiagarajar College of Engineering, Madurai, Tamil Nadu, India
e-mail: amcse@tce.edu
P. S. Swetha
e-mail: saiswetha@student.tce.edu
M. A. C. Sshakthi
e-mail: chiranjath@student.tce.edu
S. Hrushikesh
e-mail: hrushikesh@student.tce.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_28
377
378 P. S. Swetha et al.
level today is 419.26 parts per million, which is more than 50 prior to the industrial
revolution. The major contributor to the expanding atmospheric carbon dioxide is
human activities like fossil fuels being burned such as oil, coal, and gas. If the
rate of increase in the carbon dioxide continues in the same way, there are many
possibilities that the global warming would be reached around 2034–2050. Many
countries are taking efforts to reduce the emission of carbon dioxide intending to
overcome global warming. The carbon dioxide levels must be monitored to reduce
it, and the most efficient approach to detect carbon dioxide is through satellites. The
first satellite which was launched is the Greenhouse Gases Observation Satellite,
to detect the areas that emit carbon dioxide after which many satellites were found
to detect the CO2. One among them was the Orbiting Carbon Observatory 2, an
Earth satellite that investigated the worldwide sources and drains of carbon dioxide
which provided the researchers a better understanding about the climatic change [1].
The greenhouse gases cannot be detected precisely by the satellites like OCO-2 and
GOSAT, and hence, the whole atmospheric column is averaged, which in simple
terms is known as XCO2where the ‘X’ represents the observation from the satellite
[2]. The main objective is to use the OCO-2 satellite data and build a machine
learning model to better forecast the carbon dioxide over the surface of the earth for
a smaller area. Machine learning, subcategory of Artificial Intelligence, is one of the
best approaches to build a model for prediction. There are various machine learning
algorithms to generate a predictive model for the concentration of XCO2like support
vector machine, Gaussian process regression, Artificial Neural Networks (ANNs),
k-nearest neighbors. There are tree-based ML models like Extreme Gradient Boost
(XG), Random Forest Regression [4]. Moreover, the Convolutional Neural Network,
a deep learning model, can also predict the XCO2emission. In this paper, the machine
learning model is built and trained using the ensemble technique, that is Random
Forest Regression. The ensemble technique is generally used to enhance the precision
of the final outcome by combining multiple model instead of using a single model
which may not show an efficient accuracy. Instead of depending just on one decision
tree, the method is based on combination of multiple trees [19]. The data for training
the model was collected from the OCO-2 satellite dataset which contained various
features for predicting XCO2. The model was trained using the results of the features
that are selected. The objective of the proposed work is:
To predict the CO2in smaller areas using wisdom of crowds concept.
To minimize the emission of CO2in the atmosphere.
To monitor CO2concentration in atmosphere.
To more accurately map the places that emit carbon dioxide along with the OCO-2
satellite data.
The overall motivation is to accurately predict XCO2, which can assist in educating
the public, researchers, and policymakers about the current environmental situation
and directing efforts to lessen the effects of climate change. The paper is ordered in the
following manner: Sect. 2addresses the literature survey; Sect. 3discusses proposed
methodology, which includes Random Forest Regression a machine learning model;
Sect. 4covers the results and discussion; and Sect. 5provides the conclusion and ideas
Prediction of Column Average Carbon Dioxide Emission Using 379
for future studies. The contribution of this paper is that it focuses on the prediction of
column average carbon dioxide using OCO-2 satellite data and mapping the locations
that release CO2in order to minimize emissions and preserve a healthy environment.
Also, CO2emissions are discovered in smaller locations utilizing bigger area data.
2 Literature Survey
Mengya Sheng et al. proposed a method to more accurately retrieve the XCO2values
in particular column of air by mapping a global spatiotemporally continuous XCO2
using XCO2data retrieved from the OCO-2 GOSAT satellites, which is helpful in
gap-filling and data integration methods using the data retrieved from the satellite.
They have obtained spatiotemporally continuous mapping of XCO2data by applying
the integrated kriging approach to spatiotemporal data XCO2data [10]. Thus, by
mapping two datasets together, they were able to find the concentration of XCO2
over a column of air with more certainty.
Zhang et al. proposed a method to reduce the dimensions of a classification tech-
nique using a hyperspectral remote sensing image and a neural network CO2in detail.
They first reduced the dimensions of the hyperspectral remote sensing image using
genetic algorithms and kernel principal component analysis. Then, traditional remote
sensing method classifies the hyperspectral remote sensing images. Lastly, based on
spectral local mean and standard deviation, the image for noise assessment was made
to improve accuracy [11]. They have used one of the most crucial tools for obtaining
high-accuracy CO2concentration data which is the spectral absorption characteristic
spectrum of atmospheric CO2.
Brazidec et al. have spoken about increasing the accuracy of XCO2retrieval by
segmentation of XCO2images with deep learning. They have addressed the problem
of plume segmentation, and to tackle this issue, they have utilized an image-to-image
CNN known as U-net technique to convert a region of XCO2into a picture that depicts
the locations of the target plumes [12]. In the model they have proposed, they claim
that their model performs better than the usual segmentation techniques and is able
to detect most of the plumes.
Zhang and Liu have proposed a methodology of mapping contiguous XCO2using
ML algorithms to analyze the spatiotemporal variations in the satellite-recovered
data. By using the column-averaged dry air mole fraction of the XCO2data from
SCIAMACHY, GOSAT and OCO-2, they were able to derive a contiguous XCO2data
across China with 0.25° resolution [13]. They were able to acquire bias and standard
deviation values of 0.11 and 1.38 ppm. With the assistance of the information from
dataset, the outcomes of the model simulation were fairly close to the real values in
the in situ location.
Liu et al. have talked about the retrieval algorithm for XCO2using the TanSat
and GOSAT dataset. A technique of adapting the observed and simulated spectra
to the atmospheric radiative transfer was done using the TanSat algorithm, XCO2
value is calculated. The CO2information was obtained from strong and weak bands.
380 P. S. Swetha et al.
One percent error was found in the observation, by using the TanSat algorithm,
GOSAT dataset and atmospheric CO2measurement from space [14]. Even if the
retrieval is still unsure in the middle to upper latitudes, the accuracy of the retrieval
was dependent on the instrument’s sampled precision as well as the theoretical and
algorithmic settings.
Noel et al. have made a proposal that GOSAT and GOSAT- 2 dataset would
retrieve the CO2mole fraction in dry air, as determined by columns by using the
FOCAL algorithm with it [15]. The preprocessing involves measured spectra and
geolocation, estimation of parameters for instrument noise and clouds, filtering of
data quality, latitudes, and zenith angle, and adding corresponding meteorological
measurements. They have concluded that the FOCAL algorithm is one of the fastest
to retrieve the XCO2data as on an average it only takes 22 s of six iterations to process
one GOSAT ground pixel; hence, it proves to be computationally fast algorithm for
XCO2retrieval which is said to be more precise than the existing other algorithms.
Malini Alagarsamy et al. have proposed the algorithm called for load balancing
in cloud computing which is called as Cost-Aware Ant Colony Optimization. The
workload distribution to virtual machines is the main focus of this model. In order
to reduce processing time, response time, cost, power consumption and carbon foot-
print, this model uses the swarm-based Ant Colony Optimization algorithm. From
this model they have inferred that the instrument noise and cloud parameters algo-
rithm produces faster processing times and faster responses while comparing it to
other algorithms [21]. Some points to note about the model are that data load should
be distributed evenly to avoid overloading of the virtual machine and the model
should be of a design having multiple nodes to avoid a single point of failure.
3 Proposed Methodology
The prediction of XCO2is done using a decision tree-based ML algorithm and
Random Forest Regression. The advantage of using this algorithm shows low vari-
ance due to the combination of multiple decision trees. It does not require any normal-
ization as it works on tree-based approach. Moreover, it gives good accuracy. The
workflow of the model is as follows (Fig. 1).
3.1 Data Interpretation
In order to develop a XCO2prediction model, the data must be interpreted initially,
that is to learn more about data in depth. The dataset used for this paper was obtained
from the public platform NASA Earthdata [5]. The dataset contains many features
along with the geolocated XCO2retrieval values. The dataset was in h5 file format.
The dataset contains 40,000 rows and 200 columns.
Prediction of Column Average Carbon Dioxide Emission Using 381
Fig. 1 Workflow of RFR model
Table 1 Shows the selected
features along with feature
ranking
Features Feature Ranking
S1 S2 S3
Latitude 0.502305 0.514717 0.494718
Longitude 0.164560 0.157361 0.152589
Pressure 0.129823 0.149461 0.116941
Altitude 0.079290 0.094962 0.087572
Solar angle 0.079290 NIL 0.050217
Fluorescence 757 0.031477 NIL 0.028987
Polarization angle NIL 0.057228 0.043224
Wind NIL 0.026271 0.025752
3.2 Feature Selection
Some of the significant features are selected from the dataset while maintaining the
XCO2as target value. The model is trained based on the target value and feature
selection as shown in Table 1.
3.3 Data Preprocessing
Firstly, we structured the data into usable format so that the data can be analyzed
easily. Then, we converted each chosen column from the feature selection as a
dataframe and concatenated them as a single dataframe. The axis was set to 1 which
stacks the data frame side by side. The data is split based on train_test_split for model
training. The test size is 0.2 and train size is 0.8.
382 P. S. Swetha et al.
3.4 Training the Model
The algorithm used for training the model is Random Forest Regression which uses
multiple decision trees as base learning model [6]. The baseline model, which is
referred to as simple model that acts as a reference in the machine learning project,
is set as the linear regression model. This was chosen because it helps for predicting
continuous values in the dataset [7]. The parameters for training the model are n_
estimators and random_state. The n_estimators are set to 100, which means that there
are 150 decision trees and the random_state is set to 42 which means that we receive
the identical train and test sets across various executions. Now, the model is built
and trained. From the selected features, the features that most affect the emission of
carbon dioxide are visualized and shown as a plot.
3.5 Evaluation Metrics
Several measures, including Mean Squared Error, Mean Absolute Error, and Coeffi-
cient of Determination, can be used to assess the effectiveness of the Random Forest
Regression model. Measures used are R2 score and MSE. These measures evaluate
the precision and accuracy of the trained model. R2 score denotes how well the model
has performed, that is how it fits into the regression line [8]. Mathematically, it is
given as
R2 score =Total Variance Explained by the model/total variance.
3.6 Mapping the CO2Emission Regions
A Mapbox is created with actual CO2data and display legend along with it. Each
point when hovered on shows the latitude longitude and XCO2concentration present.
3.7 Prediction of CO2in Smaller Areas
For prediction of carbon dioxide in smaller areas, a small subset of the large dataset is
taken as the smaller region behavior. The XCO2value is extrapolated using the previ-
ously trained larger model. Then, the predicted XCO2(average) value is predicted
for the assumed region. A mathematical concept called wisdom of crowds is applied
here which states that a large group of people can together make better decisions
than an individual. The wisdom of crowds can produce more accurate predictions
Prediction of Column Average Carbon Dioxide Emission Using 383
and better decision-making than depending solely on one person’s skill or knowledge
by combining the opinions of many different people [9]. The larger data is split based
on train_test_split for model training. The test size is 0.2 and train size is 0.8. Then,
the model is trained and the CO2for smaller region is given using the wisdom of
crowds concept. The constant value 0.74 is multiplied when we extract smaller data
from larger value. The larger data tend to be more accurate so to make our predicted
value have better accuracy. We take a common range 0.74 to improve accuracy of
predicted (XCO2) average.
4 Results and Discussion
The goal of the model is to identify carbon dioxide-emitting areas so that action
may be taken to reduce emissions and people can be brought to the attention of the
state of the environment. The results obtained from the model vary depending on the
features that are selected to train the model. The data for building this model was taken
from the OCO-2 satellite data, which contained numerous features in the dataset. The
actual CO2concentration from OCO-2 satellite is compared with the predicted XCO2
and results are shown. The r2 score and Mean Square Error vary depending upon the
features selected from the OCO-2 dataset. Three cases are discussed below based on
the features selected, intending to find the best selected features to detect emission of
XCO2. The selected features along with feature mapping, a horizontal bar graph for
features and degree of their effect on CO2, mapping of predicted regions with CO2,
the total time to train the model, r2 score, and Mean Square Error and prediction for
a smaller area are discussed for all three simulations. The simulations 1, 2, and 3 are
given as S1, S2, and S3.
With respect to the features selected, the degree of their effect on emission of
carbon dioxide is represented as a bar graph for all three simulations which is shown
in Fig. 2a–c.
The predicted carbon dioxide concentration is mapped from given OCO-2 satellite
data along with the actual XCO2concentration predicted by OCO-2 satellite which
is represented below. Figure 3shows the actual predicted concentration from the
OCO-2 dataset and Figs. 4,5,6show the predicted XCO2by the model for the
features selected, respectively.
With respect to the predicted XCO2concentration by the model for the three
cases based on their features selection. The best simulation among the three can be
concluded using the factors such as time taken to train the model, Mean Squared
Error (MSE), R2 score. The average carbon dioxide concentration is also calculated
for smaller region from larger area according to wisdom of crowd concept which is
also listed for case in Table 2.
From the above three simulations as listed in Table 2, it can be inferred that the
features are appropriately selected in simulation 2 which gives highest R2 score and
less Mean Squared Error (MSE). Additional to that, the training time of the model
384 P. S. Swetha et al.
Fig. 2 a, b, c Shows the features and degree of their effect on CO2
Fig. 3 Actual CO2regions from OCO-2 satellite
Prediction of Column Average Carbon Dioxide Emission Using 385
Fig. 4 Predicted CO2regions for S1
Fig. 5 Predicted CO2regions for S2
386 P. S. Swetha et al.
Fig. 6 Predicted CO2regions for S3
Table 2 Factors and their assessment
Factor Factor Assessment
S1 S2 S3
Time to train model 13.39 s 6.68 s 8.42 s
R2 score 0.80 0.81 0.79
Mean Squared Error (MSE) 1.15e-12 1.08e-12 1.22e-12
Average CO2for smaller area 0.0033 0.0030 0.0033
is also less compared to other two simulations. Also, it can be deduced that as more
features are used, the R2 score decreases and the mean squared error rises.
5 Conclusion
In this paper, we have analyzed from space-based observations of the NASA Orbiting
(OCO-2) satellite dataset. The ultimate goal is to use the data previously anticipated
by the satellite to better correctly estimate the carbon dioxide concentration. We
initially interpreted the data, selected features for training the model, then we prepro-
cessed the data. For feature selection, we chose the most important eight features
Prediction of Column Average Carbon Dioxide Emission Using 387
in total for training. The model was trained using the Random Forest Regression
algorithm. Rather than utilizing a single model, which might not provide accurate
results, this model uses the ensemble approach which is used to increase the accu-
racy of the final result. Instead of depending just on one decision tree, the method
combines numerous trees to determine the outcome [6]. The ML model is assessed
using the evaluation metrics after training it. Here, the model performance was evalu-
ated through the Mean Squared Error and R2 score. The less the Mean Squared Error,
the model is more efficient [16]. We have compared three cases based on the feature
selection to find the most significant features based on which the model predicted the
XCO2concentration. From the three simulations compared, simulation 2 produced
the most efficient model with 0.81 R2 score. Generally, the R2 score above 78 percent
is considered as the best for prediction with less errors [17]. We have predicted the
emission of carbon dioxide and mapped it according to the latitude and longitude
along with the actual carbon dioxide concentration from the satellite which is also
mapped. We have used the larger area to predict the carbon dioxide for smaller area,
using the wisdom of crowds, a mathematical concept. This concept integrates the
perspectives of many diverse individuals, that is the wisdom of crowds can result
in more accurate forecasts and better decision-making than relying exclusively on
one person’s talent or knowledge [18]. The best case with the features, R2 score,
Mean Squared Error (MSE), and the average CO2with smaller area, is presented.
We have inferred that average CO2for smaller area is 0.0030. The time taken to train
the model is 6.68 s, and Mean Squared Error is 1.08e-12. Future paper will focus
on predicting the emission of all greenhouse gases using the same Random Forest
Regression. Additionally, it can be extended to the aviation industry by suggesting
change of routes in cases where higher CO2concentrations cause climatic change or
reduction of engine performance.
References
1. Sheng M, Lei L, Zeng Z-C, Rao W, Zhang S (2021) Detecting the responses of CO2column
abundances to anthropogenic emissions from satellite observations of GOSAT and OCO-2.
Remote Sens 13(17):3524
2. Liang A, Gong W, Han G, Xiang C (2017) Comparison of satellite-observed XCO2from
GOSAT, OCO-2, and ground-based TCCON. Remote Sensing 9(10):1033
3. Saleh C, Dzakiyullah NR, Nugroho JB (2016) Carbon dioxide emission prediction using a
support vector machine. In: IOP conference series: materials science and engineering, vol
114(1). IOP Publishing, 012148
4. Yuan X, Suvarna M, Low S, Dissanayake PD, Lee KB, Li J, Ok YS et al (2021) Applied
machine learning for prediction of CO2adsorption on biomass waste-derived porous carbons.
Environ Sci Technol 55(17):11925–11936
5. https://disc.gsfc.nasa.gov/datasets/OCO2_L2_Lite_FP_9r/summary
6. Rodriguez-Galiano V, Sanchez-Castillo M, Chica-Olmo M, Chica-Rivas MJOGR (2015)
Machine learning predictive models for mineral prospectivity: an evaluation of neural networks,
random forest, regression trees and support vector machines. Ore Geol Rev 71:804–818
7. Aalen OO (1989) A linear regression model for the analysis of life times. Stat Med 8(8):907–925
8. https://www.bmc.com/blogs/mean-squared-error-r2-and-variance-in-regression-analysis/
388 P. S. Swetha et al.
9. https://www.investopedia.com/terms/w/wisdom-crowds.asp#:~:text=Wisdom%20of%
20the%20crowd%20is,and%20innovating%20than%20an%20individual
10. Sheng M, Lei L, Zeng Z-C, Rao W, Song H, Changjiang W (2023) Global land mapping
dataset of XCO2 from satellite observations of GOSAT and OCO-2 from 2009 to 2020. Big
Earth Data 7(1):180–200
11. Zhang L, Wang J, An Z (2020) Classification method of CO2hyperspectral remote sensing
data based on neural network. Comput Commun 156:124–130. ISSN 0140–3664
12. Dumont Le Brazidec J, Vanderbecken P, Farchi A, Bocquet M, Lian J, Broquet G, Kuhlmann
G, Danjou A, Lauvaux T, Segmentation of XCO2 images with deep learning: application to
synthetic plumes from cities and power plants. Geosci Model Dev
13. Zhang M, Liu G (2019) Mapping contiguous XCO2by machine learning and analyzing the
spatio-temporal variation in China from 2003 to 2019. Sci Total Environ 858(Part 2):159588.
ISSN 0048–9697
14. Liu Y, Yang DX, Cai ZN (2013) A retrieval algorithm for TanSat XCO2 observation: retrieval
experiments using GOSAT data. Chin Sci Bull 58:15201523. https://doi.org/10.1007/s11434-
013-5680-y
15. Noël S, Reuter M, Buchwitz M, Borchardt J, Hilker M, Bovensmann H, Warneke T et al (2021)
XCO2retrieval for GOSAT and GOSAT-2 based on the FOCAL algorithm. Atmos Measure
Tech 14(5):3837–3869
16. Imbens GW, Newey WK, Ridder G (2005) Mean-square-error calculations for average
treatment effects
17. Subramaniam N, Yusof N (2021) Modeling of CO2emission prediction for dynamic vehicle
travel behavior using ensemble machine learning technique. In: 2021 IEEE 19th student
conference on research and development (SCOReD). IEEE, pp 383–387
18. Larrick RP, Mannes AE, Soll JB, Krueger JI (2011) The social psychology of the wisdom of
crowds. Soc Psychol Decision Making 227–42
19. Hammerling DM, Michalak AM, O’Dell C, Kawa SR (2012) Global CO2distributions over
land from the Greenhouse gases observing satellite (GOSAT). Geophys Res Lett 39(8)
20. https://ocov2.jpl.nasa.gov/observatory/instrument/#:~:text=OCO%2D2%20does%20not%
20measure,can%20be%20used%20for%20identification
21. Alagarsamy M, Sundarji A, Arunachalapandi A, Kalyanasundaram K (2021) Cost-awareant
colony optimization based model for load balancing in cloud computing. Int Arab J Inf Technol
18(5):719–729
Predicting Students’ Performance Using
Feature Selection-Based Machine
Learning Technique
N. Kartik, R. Mahalakshmi, and K. A. Venkatesh
Abstract Early evaluation of the students’ performance to determine their strengths
and weaknesses helps them perform better in examinations. Improving students’
overall learning experiences and academic success has been a hot issue recently.
In this paper, classical machine learning algorithms like the random forest, J48,
and Logistic Model Tree are built and trained on student data to predict students’
performance. To improve the accuracy of the models, feature selection algorithms
like correlation-based feature selection, information gain ranking filter, gain ratio
feature evaluator, and symmetrical uncertainty ranking filter are used, and selected
features are trained on the model and compared the performance of the models with
each other.
Keywords Students’ performance ·Machine Learning models ·Features selection
1 Introduction
Educational institutions face the challenge of accurately predicting student perfor-
mance as early as possible. Machine learning technology can help institutions iden-
tify students who are at risk of underperforming and provide them with personalized
support to improve their academic achievements. This approach can enhance the
institution’s retention and graduation rates. In the past, if a student had low marks
N. Kartik (B)
Department of Computer Applications/Science, Presidency College(Autonomous)/Presidency
University, Bengaluru, India
e-mail: nkartik.mca@gmail.com
R. Mahalakshmi
Department of Computer Science, Presidency University, Bengaluru, India
e-mail: mahalakshmi@presidencyuniversity.in
K. A. Venkatesh
School of Advanced Computer Science, Alliance University, Bengaluru, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_29
389
390 N. Kartik et al.
or did not do well in class, the instructor or the student’s parents could not compre-
hend the factors that led to these outcomes. Learning Management Systems (LMSs)
are increasingly being used by institutions to control both the learning content they
provide and the activities and behaviors of their students while using such systems.
Understanding how students behave in LMSs can help institutions predict how well
students will perform based on their grades and the activities they have completed in
the system. A significant amount of work is done to forecast student performance for
many reasons, including assignment of courses and resources, among many others.
But still, there is high demand in the field to improve the system performance to
predict the student’s performance.
In this paper, we have built Machine Learning models to predict student perfor-
mance on data collected from LMS. It is crucial for the model to predict correctly,
based on which the decisions are made. To improve the performance of the model,
feature engineering techniques are used. These techniques help to find the more
significate feature contributing to the prediction. Selected features are trained on the
models to achieve good accuracy in predicting.
2 Related Work
Machine Learning and data mining algorithms are extensively used for predicting
academic data. In a study, a decision tree classifier, neural network, and nearest
neighbor classifier were combined for successful and unsuccessful student predic-
tions for prestigious Bulgarian universities [1]. WEKA application is explored by
using classifiers like J48, Naive Bayes, BayesNet, K-NN, OneR, and JRip. This
article compares supervised Machine Learning prediction algorithms and assesses
the model’s performance using real-world data from the field of education [2, 3].
MapReduce and neural networks trained on cumulative dragonfly model data predict
student grades (CDF-NN) [4].
Further, a study investigates AI and ML applications for student records, and on-
campus behavior predicts academic success. This study predicts student outcomes
using the Open University Learning Analytics Dataset and a Bayesian network
(OULAD) [5]. Genetic algorithms (GAs) can forecast student performance and give
suggestions. Naive Bayes, Decision Tree (J48), Random Forest, Random Tree, REP
Tree, Simple Logistic, and Zero R were used to predict performance [6]. FNN was
used to optimize FNN parameters [7]. KSITM researchers trained and tested a model
to address student underachievement [8,9]. Education data research was conducted
to learn about students’ performance in MOOC classes. Training on a technology-
based learning paradigm was used to improve classification methods and optimize
the parameters of RBFN networks [10].
Predicting Students’ Performance Using Feature Selection-Based 391
3 Proposed Modeling
We have proposed a system in which the methodology used in the study has helped to
improve the model’s accuracy. The study includes preprocessing techniques, feature
engineering techniques, Machine Learning model building, and model evaluating
measures. The results are compared by running the model with selected features and
without selected features. Remarkable results can be achieved with the proposed
methodology (Fig. 1).
3.1 Datasets
The dataset used in the study was downloaded from Kaggle. The data are gathered
using the experience API learner activity tracker tool (xAPI). It was compiled over
two academic semesters and contained 480 student records. The dataset comprises
information about the students, including their gender, nationality, place of birth,
grade level, number of raised hands, and absences.
3.2 Preprocessing
The dataset contains both discrete and continuous values of 17 features. There are
no missing values in the datasets. Discrete values are converted to numeric using the
hot encoding technique. Then, the dataset is normalized using a z-score normalizer.
3.3 Feature Selection
The main goals of feature selection include improving the model’s performance,
avoiding overfitting, and providing faster, more efficient, cost-effective models. Four
Fig. 1 Proposed methodology
392 N. Kartik et al.
different feature selection techniques were used in this work, and the results are
analyzed. The top common selected feature of the different techniques is selected.
3.4 Correlation-Based Feature Selection (CFS)
CFS aims to measure the degree of relevance and redundancy between two features by
calculating the correlation coefficient. Equation 1calculates a correlation coefficient,
which measures the similarity between feature subsets and the output class.
rzc =krzi
k+k(k1)rii
(1)
where r_zc is the correlation (dependence) between the feature and the class variable,
k is the number of features, and (r_zi)is the average of the average correlation
intercorrelation between features.
3.5 Information Gain Ranking Filter
Quinlan’s information gain feature selection metric and ID3’s basic decision tree-
building process are used to choose each node’s test characteristics. The split D
entries of the dataset are represented by Node N. IG calculation determines node N’s
split property.
Info(D)=−
m
i=1
Pilog Pi(2)
where |Ci|/|D| may be used to calculate the likelihood that an item of class D belongs
to class Ci. A log function to base 2 is employed since the information is encoded in
bits. The average amount of information needed to determine an object’s class label
in partition D is represented by Info(D).
Now, based on the subsets partitioning by attribute A, if it is necessary to separate
the objects in D on any feature (attribute) A with v unique values, a1, a2, a3,…, av,
the desired information is presented as follows:
InfoA
(D)=
v
j=1Dj
|D|×Info
Dj(3)
Gain(A)=Info
(D)InfoA
(D)(4)
Predicting Students’ Performance Using Feature Selection-Based 393
where the |Dj|/|D| serves as the weight for jth partition. Here, the information gain
on A is defined as the difference between the original information before splitting
and the current information gain obtained after splitting on A as shown in Eq. (4).
Split In f oA(D)=−
Dj
|D|log 2Dj/|D|(5)
3.6 Symmetrical Uncertainty Ranking Filter
The symmetrical uncertainty criterion overcomes the information gain bias for
features with many values by normalizing its values to the range [0,1]. The following
equation gives this:
SU =2Gain(A)
Info
(D)+Split inf o(A)(6)
If SU value =0, indicates no association between the two features, and the value
of SU =1 indicates that knowledge of one feature entirely predicts the other. This
criterion is like GR criterion in favor of features with fewer values.
Selected Features from the Different Feature Selection Techniques
After applying the feature selection techniques, the common features shortlisted
based on the ranking by these techniques are VisITed Resources, Student Absence
Days, raised hands, Announcements’ View, Parent Answering Survey, and Relation.
The ranks from the scored by the techniques are represented in detail in Table 1.
Table 1 Representing the features selection techniques and their selected features with the ranks
Technique/
features
VisITed
Resources
Student
Absence
Days
Raised
hands
Announcements’
View
Parent
Answering
Survey
Relation
Correlation-based
feature selection
(CFS)
0.3829 0.3608 0.3283 0.2895 0.2369 0.2358
Information gain
ranking filter
0.45801 0.39745 0.37337 0.2578 0.1504 0.1261
Gain ratio feature
evaluator
0.19878 0.40986 0.25378 0.17636 0.15212 0.1291
Symmetrical
uncertainty
ranking filter
0.23776 0.31564 0.24728 0.17127 0.11855 0.0998
394 N. Kartik et al.
Classification Models
In this paper, we have built three classifiers random forest, J48, and Logistic Model
Tree, to predict the student’s performance. The dataset is split into 70:30 ratio for
training and testing, respectively. These classifiers are trained on the dataset without
selecting features and with selected features.
3.7 Accuracy Metrics for the Classifier
Model assessment is essential in developing a reliable data mining and machine
learning model. The most common criteria for evaluating a classification system
are its degree of accuracy. A model is successful if it has an accuracy rate of
99.9% or above. However, it could be more reliable and may even be misleading
in certain situations with overfitting problems.
Accuracy of the classifiers:
Precision =True Positive/(True Positive +False Positive).
The computation for recall and sensitivity is how accurately the positive class was
predicted.
Recall =True Positive/(True Positive +False Negative).
The F-score, also known as the F-measure, is a single score that combines precision
and recall to balance both objectives.
F-measure =(Precision +Recall)/(2 * Precision +Recall).
A well-liked statistic for the categorization of imbalance is the F-measure.
4 Results and Discussions
The proposed model is tested in both cases by implementing it in the Jupyter Note-
book Python platform. The models are executed with the training dataset and testing
dataset. The proposed training and testing ratio was 70:30. The model was executed
considering all the features. The experiment results of each model in the first case are
represented in Table 2. Compared to the J48 and LMT, RF performs well with 78.66
accuracy. This is due to considering more attributes with binary values (Fig. 2).
The numerical results are shown in Table 2. The first experiment consists of three
separate runs of classification algorithms on datasets that do not use the proposed
model phases. Both RF and LMT provide better outcomes than the J48. In Fig. 3,
the area under the receiver operating characteristic curve (ROC) for each expected
class in the random forest is shown.
After encoding the classes and feature selection, the dataset is trained on the
same classifiers in our second experiment, which was conducted according to the
recommended technique. Table 3displays the results of each classifier and compares
them. In terms of classifying datasets, RF’s results have vastly improved. Figure 5
Predicting Students’ Performance Using Feature Selection-Based 395
Table 2 Classification
results for original datasets Classifiers RF J48 LMT
TP rate 0.767 0.758 0.775
FP rate 0.139 0.139 0.131
Precision 0.776 0.760 0.775
Recall 0.767 0.758 0.775
F-measure 0.786 0.759 0.775
ROC ar ea 0.897 0.855 0.882
Accuracy 0.786 0.758 0.775
Fig. 2 Classification results
for original datasets
0
0.5
1
Classification Results For Original Datasets
RF J48 LMT
RF for Class M RF for Class L RF for Class H
Fig. 3 ROC curve for random forest with all features
displays the relative area under the receiver operating characteristic curve (ROC) for
each projected class using RF (Fig. 4).
396 N. Kartik et al.
Table 3 Classification
results for the selected feature Classifiers RF J48 LMT
TP rate 0.856 0.717 0.731
FP rate 0.843 0.160 0.157
Precision 0.856 0.718 0.732
Recall 0.856 0.717 0.731
F-measure 0.856 0.716 0.731
ROC ar ea 0.976 0.826 0.866
Accuracy 0.895 0.711 0.732
Fig. 4 Classification results
for the selected feature
0
0.5
1
ClassicaƟon Results For The Selected Feature
RF J48 LMT
RF for Class M RF for Class L RF for Class H
Fig. 5 ROC curve for random forest with selected features
5 Conclusion
Correct and timely student performance prediction is the most challenging task in
education. This is necessary to help students who pose a greater academic risk, ensure
their high retention rate, provide exceptional learning opportunities, and promote the
university and its reputation. In this paper, three Machine Learning algorithms are
built and tested to predict student performance. The experiments were conducted in
two phases. With all the features in the first phase, all three models are trained, and
results are compared. Random forest gives 78.66 accuracy. To improve the accuracy
of the models to achieve correct prediction, we have used feature selection techniques
to identify the significant feature from the datasets and used those features for training
Predicting Students’ Performance Using Feature Selection-Based 397
models. In the second phase, these selected features are trained with three models,
and again random forest is outperformed by giving an accuracy of 89.5% compared
with others. Observing the result from both phases increases the results in the second
phase. The study’s methodology gives us a good result compared with models without
feature selection.
References
1. Imran M, Latif S, Mahmood D, Shah M (2019) Student academic performance prediction using
supervised learning techniques. Int J Emerg Technol Learn 14(14):92–104. https://doi.org/10.
3991/ijet.v14i14.10310
2. Chaudhury P, Tripathy H (2020) A novel academic performance estimation model using two
stage feature selection. Indonesian J Electric Eng Comput Sci 19(3):1610–1619. https://doi.
org/10.11591/ijeecs.v19.i3.pp1610-1619
3. Alshabandar R, Hussain A, Keight R, Khan W (2020) Students performance prediction in
online courses using machine learning algorithms. Proc IJCNN Conf 2020:1–7. https://doi.
org/10.1109/IJCNN48605.2020.9207196
4. Velarde L, Gerardo C, Chamorro-Atalaya O, Morales-Romero G, Meza-Chaupis Y, Auqui-
Ramos E, Ramos-Cruz J, Aybar-Bellido I (2022) Quadratic vector support machine algorithm,
applied to prediction of university student satisfaction. 11591/ijeecs.v27.i1, pp 139–148
5. Chitti M, Chitti P, Jayabalan M (2020) Need for interpretable student performance prediction.
Proc DeSE Conf, 269–272. https://doi.org/10.1109/DeSE51703.2020.9450735
6. Salih NZ, Khalaf W (2021) Prediction of student’s performance through educational data
mining techniques. Indonesian J Electric Eng Comput Sci 22(3):1708–1715. https://doi.org/
10.11591/ijeecs.v22.i3.pp1708-1715
7. Ismail HM, Hennebelle A (2021) Comparative analysis of machine learning models for
students’ performance prediction. In: Advances in digital science - advances in intelligent
systems and computing, Antipova T (ed), vol 1352. Singapore, Springer, 149–160. https://doi.
org/10.1007/978-3-030-71782-7_14
8. Chakrapani P, CD (2022) Academic performance prediction using machine learning: a compre-
hensive and systematic review. Proc ICESIC, 335–340. https://doi.org/10.1109/ICESIC53714.
2022.9783512
9. Madhuri S, Adamuthe AC (2021) Comparative study of supervised algorithms for prediction
of students’ performance. Int J Modern Educ Comput Sci 13(1):1–21. https://doi.org/10.5815/
ijmecs.2021.01.01
10. Hao J, Gan J, Zhu L (2022) MOOC performance prediction and personal performance improve-
ment via Bayesian network. Educ Inf Technol 27:7303–7326. https://doi.org/10.1007/s10639-
022-10926-8
Hybrid Deep Learning-Based Human
Activity Recognition (HAR) Using
Wearable Sensors: An Edge Computing
Approach
Neha Gaud, Maya Rathore , and Ugrasen Suman
Abstract Due to the growth of Internet of Things (IoT) and advanced sensing
based technologies have enabled the development of the miniature-based system.
In recent years, the use of wearable and mobile sensors for Human Walking Gesture
Recognition has become more popular in various applications, including health
care, surveillance, robotics, and industry. The recent growth of edge computing
technology for industry 4.0 has provided the opportunity to design the low power
and less computationally expensive devices. The edge computing devices cannot
support heavy computation and provide great efficiency by reducing the network
size and communication latency. Deep learning algorithms have recently demon-
strated high performance in HAR. However, the deep learning (DL) models require
very high computation systems, which make them ineffective when used on edge
devices. In this research, a hybrid deep learning-based model is trained to recognize
the various gestures. Three deep learning-based models, namely one-Dimensional
Convolutional Neural Network (1D-CNN), Convolutional Neural Network–Long
Short-Term Memory (CNN-LSTM), and CNN-Gated Recurrent Unit (CNN-GRU),
are designed to test the various human mobility gestures. The WISDM, PAMAP2,
and UCI-HAR benchmark datasets were used to assess these models. Among the
three datasets, the best accuracies of the models are 99.89%, 97.28%, and 96.78%,
respectively, achieved for CNN-LSTM hybrid model. In future, the work can be
extended to design an end-to-end edge computing application using Arduino Nano
33 BLE Sensing microcontroller board. The compressed deep learning model will
be fused on the Arduino Nano board to recognize various human motion gestures.
The research demonstrates the classification of various HAR gestures using hybrid
deep learning models.
N. Gaud ·U. Suman
School of Computer Science and Information Technology, DAVV, Indore, M.P, India
e-mail: usuman.scs@dauniv.ac.in
M. Rathore (B)
Christian Eminent College, Indore, M.P, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_30
399
400 N. Gaud et al.
Keywords Deep learning ·Assistive technology ·Internet of heath things
(IoHT) ·Wearable technology ·Edge computing
1 Introduction
1.1 Research Motivation
HAR includes several techniques, such as wearable, i.e., computer vision and non-
wearable, i.e., sensor-based, which can be further segmented into object-tagged,
wearable, dense sensing, etc. Before going any further, it should be noted that HAR
systems having various inheritance design challenges, namely selection of different
types of sensor, a set of criteria for data collecting, the performance of recognition,
the amount of energy used, processing power, and adaptability. It is crucial to create a
HAR model which is effective and portable while keeping all these factors in mind. A
network for mobile-based human activity recognition has been described which uses
data from triaxial accelerometers and a long-short memory technique [1]. Current
technology relies on Internet connectivity and cloud services to recognize activity
patterns using non-parametric ML models in devices like smart watches.
1.2 Recent Advancement
DL-based approaches have recently gained popularity for recognizing human activity
because they can employ representation learning techniques which would automati-
cally generate the best features from sensor-generated original data without the need
for human involvement and can find hidden patterns in data [2,8]. The application
of edge computing frameworks to perform HAR models at the network’s end is still
in its initial stages [12]. Recent years have seen the emergence of edge computing
as a fresh framework which can shorten Internet connection lag times by relocating
processing power from far cloud servers to data sources. It is logical to transfer
the design of cloud-based IoT apps to edge-based ones. In this study, we investi-
gate model compression, the building of neural network-based models; however, in
future, the implementation will be done on Arduino BLE Nano sensing board.
1.3 Author’s Contribution and Novelty
This research presents hybrid DL models for gesture recognition. The following are
the paper’s contributions:
Hybrid Deep Learning-Based Human Activity Recognition (HAR) 401
The hybrid deep learning-based models (CNN-1D, CNN-GRU, and CNN-LSTM)
are designed for HAR gesture recognition.
The spatial and temporal features of datasets are used to design the above models.
Three publicly available benchmark datasets WISDM, PAMAP2, and UCI-HAR
are considered to design and test the models.
Finally, the model which has provided the highest accuracy, i.e., CNN-LSTM is
picked.
The confusion matrix is prepared for all three models over different datasets.
One major objective is to design devices with low power consumption while
protecting patient data privacy. This method can also be applied for the evaluation of
analysis of various walking techniques, design of prosthesis, and to the rehabilitation
of Parkinson’s patients and elderly subjects.
1.4 Organization of the Paper
The entire paper is divided into five sections. The first section is introduction, which
provides the research motivations, author’s contribution, and background informa-
tion. The second section provides a brief survey of recent state-of-the-art literature.
The third section is methodology in which the approach goes over the hardware and
software requirements, algorithms, and specific flow tasks for recognizing human
gestures. The verification of the model is covered in the fourth section, which includes
experiment results and analysis. The last section is conclusion, future work, and
limitations.
2 Literature Review
Sztyler et al. [3] technique for HAR allows the location of wearable devices on
the human body to alter. The technique uses a random forest classifier to combine
frequency and gravity-based information in order to identify human activity and
calculate the device’s orientation. Nevertheless, these methods are not particularly
accurate and cannot be used indoors. This study demonstrated an alternate strategy to
employ a microcontroller to provide an end-to-end solution to precisely analyze gait
speed in a variety of settings, given the constraints of such devices. The gyroscope is
used to measure the orientation of moveable body parts, whereas the accelerometer
monitors their actual physical acceleration. HAR is a method for separating out very
similar human actions using the inputs from both sensors. For the classification of
human walking gesture and the study of motion signals, various classifiers have been
developed. Sun et al. [4] suggested a CNN-LSTM-ELM network on the opportunity
dataset. Activities in the OPPORTUNITY dataset can be separated into gesture and
locomotion. They discovered that the ELM classifier generalizes more quickly and
402 N. Gaud et al.
effectively than fully linked classifiers [5] has suggested the use of the HAR approach,
which is portable, non-interventional, has enhanced accuracy, and is relevant to real-
time applications. For the purpose of recognizing complex, contemporaneous, inter-
spersed, and varied human actions, a model based on transfer learning employing
(GRUs) has been proposed [6]. Voicu et al. [7] presented a technology for recognizing
human physical activity using data from smartphone sensors. Three smartphone-
accessible sensors, namely gyroscope, accelerometer, and magnetic sensor, are used
in the process to create a classifier. They intend for their proposal to include sitting,
jogging, climbing, standing, and descending stairs. The results show that all six
activities may be recognized with a high degree of accuracy (86–93%) [913].
3 Methodology
3.1 Technology Used
To write the code, we have used Google Collab, and it is a collaborative environment,
which will be used to produce results in the same file and will provide a CSV file
output. Using the TensorFlow library, the deep learning models are designed. To
reduce the model size, the TensorFlow-lite library is used and the Google TensorFlow
library together with Keras is used for model training and evaluation.
3.2 Datasets
Models are trained on three publicly available datasets: UCI- HAR, PAMAP2, and
WISDM.
UCI-HAR dataset: In the UCI-HAR dataset, six tasks were carried out by the
volunteers: walk, downstairs walk, upstairs, standing, sitting, and laying. Using
various image processing techniques, a total of 561 characteristics were retrieved
from the sensor measurements.
WISDM dataset: The WISDM dataset consisted of 36 volunteers who did six
different activities while wearing smartphones and smart watches. These devices
were equipped with accelerometer and gyroscope sensors. These activities were
walking, running, ascending stairs, sitting, standing, and lying down.
PAMAP2: The Physical Activity Monitoring Data Set 2 (PAMAP2) dataset is
made up of sensor data obtained from a wearable during a variety of activities
and worn on the upper body. It included data from nine sensors, consisting of
an accelerometer, gyroscope, and magnetometer. The participants engaged in a
variety of activities, such as chores around the house, physical activity, and outdoor
pursuits. It produced a total of 52 features, including acceleration, angular speed,
and magnetic field readings from the sensors.
Hybrid Deep Learning-Based Human Activity Recognition (HAR) 403
Fig. 1 Proposed deep
learning model flowchart
3.3 Development of Hybrid Models
The stored data are first preprocessed to turn the raw samples into tensors after which
the number of observations is divided into training, testing, and validation datasets.
Some of the most important functions which have been used in the code are null
(). sum to check for missing values and data.dropna() for removing missing or null
values. We have created three different models—CNN 1D, CNN-LSTM, and CNN-
GRU. Finally, we selected a single model (highest accuracy one) for further checking
of performance. Figure 1shows the proposed methodology flowchart.
3.4 Working of DL Models
CNN: The model starts with a 1D convolutional layer of 64 filters with kernel size
of 3, activated by ReLU. To extract abstract features from the output of the first
layer, the same model is utilized. Following that, a max pooling layer with a pool
size of 2 is applied to the output. Number of classes in the output which has been
activated by SoftMax is sent to the output of the final max pooling layer which is
flattened into a one-dimension vector. Categorical cross-entropy is used as the loss
function in the model, ADAM is used as the optimizer, and assessment metrics are
404 N. Gaud et al.
used to determine accuracy prior to training. Other two models have used the same
hyperparameters. This model is trained for 10 epochs with 32 samples in each batch.
While monitoring validation loss, ReduceLR On Plateau was utilized as a call back
function with a lower bound of 0.0001 for the learning rate. As a result, each time
the learning rate reaches a plateau, the model can decrease it by a factor of 10. This
is done to increase the model’s overall accuracy.
CNN-LSTM: The CNN-LSTM model was created using TensorFlow’s Keras
API. Its architecture is made up of a Conv1D layer with 64 filters, each with a
three-kernel size. Here again we have used the ReLU function which has triggered
the activation of neurons in these layers. To identify high-level characteristics from
unprocessed data, a sliding window filter is applied to the input in a convolution
network. The inputs are processed to the filter several times, creating a map of
activation known as a feature map. To represent the temporal sequence of feature
maps, we have used hyperbolic tangent (tanh) activation function with a dropout rate
of 50%. Again, in this model, we have trained the dataset for 10 epochs with 32
samples in each batch.
CNN-GRU: The LSTM-based design described in the aforementioned section is
extremely close to the architecture of the model. Feature maps are extracted from the
data by the model and are further compressed using a Max-Pool layer. The data is
then passed onto a layer called a Gated Recurrent Unit (GRU) after that. In the work,
we have utilized 64 GRU units with the hyperbolic tangent (tan) activation function
in the sequence layer. The dropout layer is dropped to 50% after the Max-Pool layer
which will help in preventing overfitting. Here, we have recorded the model metrics
of precision, recall, F1-score, and support values for each gesture category.
3.5 Classification
The accelerometer and gyroscope inputs of various gesture of dataset have fed into
the deep learning model, if the input data exceeds a minimum threshold of 1.5, and
an inference step will produce the anticipated probability of the data falling into
each class of gesture (Squat, Run, Walk, Jump, etc.). The results are represented as
confusion matrix and accuracy. Algorithm 1 shows the detailed steps of working.
Algorithm 1 For HAR’s model creation and implementation is shown here
Algorithm1: HAR using wearable sensors.
Results: Classification of accuracy of various human activists.
Initialization: HAR datasets (WISDM, PAMAP2, and UCI-HAR).
(continued)
Hybrid Deep Learning-Based Human Activity Recognition (HAR) 405
(continued)
Algorithm1: HAR using wearable sensors.
Step 1: Pre-Processing of the datasets;
Step 2: Split the datasets into training, validation set, and testing;
Step 3: Design of deep learning models CNN-LSTM, 1D-CNN, and CNN-GRU;
Step 4: Performance analysis of deep learning model based on precision, recall, accuracy, and
F1- score;
Step 5: Select the best model based on accuracy;
4 Experimental Result
This research work has presented the performance of three different deep learning
model architectures: CNN-LSTM, CNN 1D, and CNN-GRU, as detailed in the model
development section. The data was divided into three groups: training (70%), testing
(10%), and validation (20%). The test data is used to examine support, recall, F1-
score, precision, and overall accuracy of the models.
4.1 Dataset Comparison Result
Overall, all the models are able to perform with very high accuracy (95%) with the
datasets. Among the five gestures, squat and run were easier to identify for the models
than the jump and walk gestures.
Within the six gestures of UCI dataset, namely laying, walk, upstairs walk, and
walk downstairs are easier for the model to identify than sitting and standing
gestures. CNN has given the best accuracy of 95%. The input data is transformed
into tensors with a shape of (5146, 128, 9) and normalized between 0 and 1, while
the output data is one-hot encoded with a shape of (7352, 6).
Within the six features of the WISDM dataset, the standing feature is easier for
models to identify as compared to other gestures. In this, CNN-GRU achieved the
best accuracy of 97%. The input data was transformed into tensors with a shape
of (19,495, 128, 3) and normalized between 0 and 1, while the output data was
one-hot encoded with a shape of (19,495, 6).
In the PAMAP2 dataset, we have a total of 11 features. Rope_Jumping, Cycling,
Nordic_Walking, Vaccum_Cleaning, and Ironing are easier for the model to iden-
tify as compared to its remaining features. CNN 1D was able to achieve the
best accuracy of 99%. The input data is transformed into tensors with a shape
of (4370, 128, 39) and normalized between 0 and 1, while the output data is
one-hot encoded with a shape of (4370, 12).
406 N. Gaud et al.
Figures 2,3,4show the accuracy loss curves for CNN-LSTM model on different
datasets UCI-HAR, WISDM, and PAMAP2.
Using USB, the board is loaded with the compressed model header file and the
Arduino sketch file (.ino file). A battery is then attached to the board after which the
board is reset to activate the new sketch. After that, we will invite the participant to
make the various motions, after which the board is free to make inferences in real
time. We assessed the board’s performance on the participant using 100 gestures
(20 for each category). The quantized model’s results are fused on the same board
Fig. 2 CNN-LSTM model (UCI-HAR)
Fig. 3 CNN-LSTM model (WISDM)
Hybrid Deep Learning-Based Human Activity Recognition (HAR) 407
Fig. 4 CNN-LSTM model (PAMAP2)
with similar results to support the edge computing. We found that the majority of
the motions had inference times between 100 and 500 ms. By omitting the BLE
communication of the results, this inference time can be cut even more. Figures 5,6,
and 7show the performance matrix of compressed CNN-LSTM model over datasets
UCI-HAR, WISDM, and PAMAP2, respectively.
Table 1shows the comparative performance analysis of various deep learning
models over different datasets.
Fig. 5 Confusion matrix for CNN-LSTM compressed model (UCI-HAR)
408 N. Gaud et al.
Fig. 6 Confusion matrix for CNN-LSTM compressed model (WISDM)
Fig. 7 Confusion matrix for CNN-LSTM compressed model (PAMAP2)
Table 1 CNN-LSTM results
Proposed model Accuracy on WISDM Accuracy on PAMAP2 Accuracy on UCI-HAR
CNN 97.11 96.23 95.56
CNN +GRU 97.32 96.59 97.79
CNN +LSTM 99.89 97.28 96.78
Hybrid Deep Learning-Based Human Activity Recognition (HAR) 409
5 Conclusion and Future Extension
This presented work focuses on wearable sensors capable of detecting the wearer’s
location as they walk have been created and applied in this investigation. The system
uses an inertial-based navigation algorithm that was corrected by an EKF. In this
study, a compressed deep learning model was used to recognize human movement
and gestures. It was decided to use three datasets (WISDM, PAMAP2, and UCI-
HAR). For each database, we were able to attain highest accuracy of 99.89%, 97.28%,
and 96.78%, respectively, for CNN-LSTM model. We were able to reduce the model
size by 10 times while increasing models’ average performance to 97% by using the
model compression strategies of pruning and quantization. Arduino LED predictor
is comfortable to wear, customizable, and most importantly, it will protect data.
In the investigation, we placed the compressed model on the board and used it to
infer motions in real time. The findings point to a promising new direction in human
activity recognition using edge AI devices that are secure, reliable, and low-powered.
Limitation of work: HAR systems having various inheritance design challenges,
namely selection of different types of sensor, a set of criteria for data collecting,
the performance of recognition, the amount of energy used, processing power, and
adaptability. The deep learning models having also size is huge which is required to
reduce to design less computational machine. It also suffers with intraclass variability,
class imbalance problems, etc.
References
1. Gravina R, Ma C, Pace P, Aloi G, Russo W, Li W, Fortino G (2017) Cloud-based Activity-
aaService cyber–physical framework for human activity monitoring in mobility. Futur Gener
Comput Syst 75:158–171
2. Greco L, Ritrovato P, Xhafa F (2019) An edge-stream computing infrastructure for real-time
analysis of wearable sensors data. Futur Gener Comput Syst 93:515–528
3. Sztyler T, Stuckenschmidt H, Petrich W (2017) Position-aware activity recognition with
wearable devices. Pervasive Mob Comput 38:281–295
4. Sun J, Fu Y, Li S, He J, Xu C, Tan L (2018) Sequential human activity recognition based on
deep convolutional network and extreme learning machine using wearable sensors. J Sensors
5. Chen J, Sun Y, Sun S (2021) Improving human activity recognition performance by data fusion
and feature engineering. Sensors 21(3):692
6. Thapa K, Abdullah Al ZM, Lamichhane B, Yang SH (2020) A deep machine learning method
for concurrent and interleaved human activity recognition. Sensors 20(20):5770
7. Voicu RA, Dobre C, Bajenaru L, Ciobanu RI (2019) Human physical activity recognition using
smartphone sensors. Sensors 19(3):458
8. Gupta S (2021) Deep learning based human activity recognition (HAR) using wearable sensor
data. Int J Inf Manage Data Insights 1(2):100046
9. Dua N et al (2021) Multi-input CNN-GRU based human activity recognition using wearable
sensors. Computing 103:1461–1478
10. Dua N et al (2023) A survey on human activity recognition using deep learning techniques
and wearable sensor data. In: Machine learning, image processing, network security and data
sciences: 4th international conference, MIND 2022, 2023, Proceedings, Springer, pp 52–71
410 N. Gaud et al.
11. Semwal VB et al (2022) Pattern identification of different human joints for different human
walking styles using inertial measurement unit (IMU) sensor. Artific Intell Rev 55(2):1149–
1169
12. Bijalwan V et al (2022) Wearable sensor-based pattern mining for human activity recognition:
deep learning approach. Indus Robot: Int J Robot Res Appl 49(1):21–33
13. Dua N et al (2022) Inception inspired CNN-GRU hybrid network for human activity
recognition. Multimed Tools Appl 1–35
Hybrid Change Detection Technique
with Particle Swarm Optimization
for Land Use Land Cover Using
Remote-Sensed Data
Snehlata Sheoran, Neetu Mittal, and Alexander Gelbukh
Abstract The process of identifying and analyzing the changes occurring over a
period using remote-sensed data is change detection and has various application
areas such as land use land cover, resource planning, urbanization and many more.
The detection of changes is required for better decision-making and for under-
standing the impact of changes occurring at local and global levels. This research
presents an implementation of two change detection techniques: image differencing
and image ratioing on a set of ten remote-sensed data. The output images obtained
are further segmented using artificial intelligence-based particle swarm optimization
and conventional techniques. The output images are compared and validated though
the use of entropy and piqe. It has been observed that image differencing followed
by PSO gives better and superior image quality in comparison to other implemented
techniques.
Keywords Particle swarm optimization ·Remote sensing ·Change detection ·
Image differencing ·Image ratioing
1 Introduction
Change detection is the process of identifying differences in features or attributes
between two or more satellite images of the same area, captured at distinct times. The
objective is to determine changes that have occurred in the physical environment,
S. Sheoran (B)·N. Mittal
Amity University Uttar Pradesh, Noida, Uttar Pradesh, India
e-mail: snehsheoran312@gmail.com
N. Mittal
e-mail: nmittal1@amity.edu
A. Gelbukh
Instituto Politécnico Nacional Mexico, Mexico City, Mexico
e-mail: gelbukh@cic.ipn.mx
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_31
411
412 S. Sheoran et al.
such as new construction, deforestation, land use changes and other types of human
or natural alterations. Change detection in satellite images is useful for a wide range
of applications including urban planning, disaster management, environmental moni-
toring and security and surveillance. Finding a best technique for change detection is
still a challenging and time-consuming task. Nichol and Wong [6] presented the use
of satellite images for landslide inventories using change detection and image fusion.
Nurda et al. [7] exhibited change detection and land suitability analysis by the use of
remote-sensed data for potential forest areas in Indonesia. Chughtai et al. (2021) [3]
assessed change detection method and accuracy for land use land cover. Goswami
et al. [4] compared algebraic and machine learning methods of change detection. Peng
et al. [8] used attention mechanism and image difference for change detection on
optical remote-sensed data. Zhu et al. [12] presented Siamese global learning frame-
work for LULC change detection. The research worked on high spatial resolution
images. Chen et al. [2] worked on bitemporal image transformer for change detec-
tion. [11] presented a change detection framework using transferred deep learning
method. Yang et al. [5] worked in area of object-based feature selection by imple-
menting particle swarm optimization.Naeini et al. [10] presented a comprehensive
survey covering different segmentation techniques in image processing.
Main contribution of this research includes:
Implementation of change detection techniques: image differencing and image
ratioing.
Enhancement of output images using PSO and conventional edge detection
techniques.
Compared and validated the results though entropy and piqe.
The paper is divided into five sections. Section 2covers introduction followed by
change detection techniques. Proposed methodology is presented in Sect. 3, followed
by result and analysis in Sect. 4and conclusion in Sect. 5.
2 Change Detection Techniques
Image differencing: This involves subtracting one image from another to highlight
the differences. The result of subtraction is zero for areas having no changes. Areas
having changes will have positive or negative value. The expression for image
differencing is shown in Eq. (1).
BVijk =BVijk(1)BVijk(2)+c,(1)
where BVijk (1) and (2) represent brightness values captured on different dates and
c represents a constant. A single band, line number and column number and a single
band are represented by k,iand j[4].
Image ratioing: This involves dividing one image by another to create a ratio image
This technique can be useful for detecting changes in areas where the overall intensity
Hybrid Change Detection Technique with Particle Swarm Optimization 413
of the image has changed, such as areas affected by cloud cover or atmospheric
conditions. It is represented by Eq. (2), where xk
ij(t2)represents the pixel x value of
band k at ith row and jth column at time t2[9]
Rxk
ij =
xk
ij(t1)
xk
ij(t2).(2)
3 Proposed Methodology
In this research, ten satellite images acquired from LANDSAT, USGS are considered
as input data. For identifying the changes, the images cover the same location and
the variation comes with respect to the year in which the images were captured.
Image 1 shows a geographic location captured in years 2013 and 2022. Similarly,
image 2 covers another geographic location captured in 2013 and 2022. Likewise,
all images cover different geographic locations captured over a period of time and
are presented in Table 1. On the input images, change detection techniques such as
image differencing and image ratioing are applied and output image are obtained. To
further enhance the output images obtained, artificial intelligent technique based on
particle swarm optimization and established edge detection procedures such as sobel,
canny and prewitt are implemented. For qualitative analysis of final output images,
entropy and perception-based image quality evaluator are computed and compared.
The proposed methodology is presented in Fig. 1.
3.1 Particle Swarm Optimization
It is a computational optimization algorithm that is inspired by nature. The opti-
mization problem is represented as a search space with many potential solutions.
Each potential solution is modeled as a particle that moves through the search space,
with its position and velocity representing the solution. The algorithm of PSO is
represented as below [1]inFig.2.
4 Result and Analysis
Original ten images acquired from LANDSAT are presented in Table 1. For iden-
tifying the changes, image 1 is considered from years 2013 and 2022. Image 2 is
considered over 2013 and 2022. Similarly, all the remaining images are considered
414 S. Sheoran et al.
Table 1 Ten original LANDSAT images
over 2013–2022, as per the data availability. Output images obtained after applying
image differencing and image ratioing techniques are presented in Table 2.
The output images obtained from differencing and ratioing are further processed
by particle swarm optimization and sobel, canny and prewitt edge detection tech-
niques. The output images obtained for image 1 are placed in Table 3. For validating
the results qualitatively, entropy and piqe parameters are computed, compared and
presented in Tables 4and 5. It has been perceived from Table 4that the entropy value
for image differencing-PSO, sobel, canny and prewitt are is 0.8764, 0.0966, 0.3297
and 0.0965. The entropy value for image ratioing-PSO, sobel, canny and prewitt are
Hybrid Change Detection Technique with Particle Swarm Optimization 415
Fig. 1 Workflow of proposed methodology
1. Initialization & input: particles are generated randomly &
assigned a position & velocity
2. Evaluation: particle’s fitness value is evaluated using FF
3. While not met=termination criteria, do
4. Position & velocity update
5. Evaluation of FF and replacement of worst by best particles
6. Local & global best update
7. End while & best solution obtained
Fig. 2 Steps of PSO algorithm
416 S. Sheoran et al.
Table 2 Output images obtained from differencing and ratioing
Diff-1 Ratio-1 Diff-2 Ratio-2
Diff-3 Ratio-3 Diff-4 Ratio-4
Diff-5 Ratio-5 Diff-6 Ratio-6
Diff-7 Ratio-7 Diff-8 Ratio-8
Diff-9 Ratio-9 Diff-10 Ratio-10
0.7287, 0.0718, 0.0764 and 0.072. Highest entropy is obtained from differencing-
PSO output image. Similarly, for image 2, the entropy values are 0.9471, 0.0963,
0.3543 and 0.0963 for image differencing technique, and for image ratioing, the
values are 0.0324, 0.0716, 0.0830 and 0.0720. Entropy result for all images is placed
in Table 4, and it can be observed that for all ten images, image differencing with PSO
yields the highest entropy value. Also, from Table 5, piqe values for differencing-
PSO, sobel, canny and prewitt are 52.2427, 80.0232, 78.2973 and 78.8501. The
Hybrid Change Detection Technique with Particle Swarm Optimization 417
Table 3 Output images obtained after image differencing—PSO, sobel, canny,prewitt and image
ratioing—PSO, sobel, canny and prewitt
1-diff-pso 1-diff-sobel 1-diff-canny 1-diff-prewitt
1-rat-pso 1-rat-sobel 1-rat-canny 1-rat-prewitt
piqe value for rat-PSO, sobel, canny and prewitt are is 72.1938, 82.6905, 83.3192
and 82.3945. Maximum lowest piqe values are obtained from differencing-PSO and
differencing-canny output images.
5 Conclusion and Future Scope
Satellite images are a great warehouse of information. Detecting changes in these
images with respect to land use land cover, deforestation, resource management
and coastal changes is very crucial. This research presents the implementation of
optimized change detection technique using particle swarm optimization for remote-
sensed data. Output images obtained from image differencing and image ratioing are
further segmented by PSO and conventional techniques. Entropy and piqe parameters
are used for result validation and it has been observed that output images obtained
from differencing-PSO have highest entropy and lowest piqe. Highest entropy and
lowest piqe present better and superior quality output images. In future, the techniques
can be further coupled with other artificial intelligent methods for more detailed and
elaborative detection of changes and more qualitative parameters can be used for
result validation.
418 S. Sheoran et al.
Table 4 Entropy values for the output images
Image differencing Image ratioing
Image PSO Sobel Canny Prewitt PSO Sobel Canny Prewitt
10.8764 0.0966 0.3297 0.0965 0.7287 0.0718 0.0764 0.072
20.9471 0.0963 0.3543 0.0963 0.0324 0.0716 0.083 0.072
30.9916 0.0617 0.1913 0.0614 0.8185 0.0615 0.0924 0.0661
40.8516 0.0548 0.2755 0.0533 0.818 0.0615 0.0926 0.0655
50.8868 0.0962 0.3285 0.096 0.8331 0.0715 0.0756 0.0718
60.8849 0.0514 0.3615 0.0503 0.0103 0.0677 0.0977 0.0714
70.9032 0.0442 0.3069 0.0438 0.0310 0.0567 0.0902 0.0608
80.9781 0.0987 0.4557 0.0976 0.8434 0.0729 0.0767 0.0732
90.8269 0.0984 0.4036 0.0983 0.0139 0.0730 0.0767 0.0733
10 0.8883 0.0974 0.2512 0.0972 0.0134 0.0710 0.0763 0.0713
Hybrid Change Detection Technique with Particle Swarm Optimization 419
Table 5 PIQE values for the output images
Image differencing-PIQE Image ratioing-PIQE
Image PSO Sobel Canny Prewitt PSO Sobel Canny Prewitt
152.2427 80.0232 78.2973 78.8501 72.1938 82.6905 83.3192 82.3945
254.8975 80.0896 78.7315 79.8743 79.1913 83.0612 83.2820 83.2715
376.4534 82.2124 77.5187 82.3346 86.1798 82.8269 80.5575 84.0601
482.6335 85.2554 76.0775 85.3651 84.0316 83.4336 82.2229 83.3657
579.2612 76.9258 74.9741 75.8730 79.3779 78.8899 78.4755 79.0612
684.1178 84.2770 76.8454 84.4235 76.8429 83.4853 81.1664 83.0750
767.8191 85.3741 78.8665 85.5212 76.9564 84.5318 82.1702 84.3592
880.1851 79.2580 75.9025 78.6312 80.6044 81.2429 81.5354 81.9369
976.7998 75.8707 75.5008 75.6342 71.9006 77.4038 77.6320 77.4579
10 65.4566 80.5644 77.3403 79.8966 82.0488 83.2870 83.3271 83.7434
420 S. Sheoran et al.
References
1. Abualigah LM, Khader AT, Hanandeh ES (2018) A new feature selection method to improve the
document clustering using particle swarm optimization algorithm. J Comput Sci 25:456–466
2. Chen H, Qi Z, Shi Z (2021) Remote sensing image change detection with transformers. IEEE
Trans Geosci Remote Sens 60:1–14
3. Chughtai AH, Abbasi H, Karas IR (2021) A review on change detection method and accuracy
assessment for land use land cover. Remote Sens Appl Soc Environ 22:100482
4. Goswami A, Sharma D, Mathuku H, Gangadharan SM, Yadav CS, Sahu SK, Pradhan MK,
Singh J, Imran H (2022) Change detection in remote sensing image data comparing algebraic
and machine learning methods. Electronics 11(3):431
5. Naeini AA, Babadi M, Mirzadeh SMJ, Amini S (2018) Particle swarm optimization for object-
based feature selection of VHSR satellite images. IEEE Geosci Remote Sens Lett 15(3):379–
383
6. Nichol J, Wong MS (2005) Satellite remote sensing for detailed landslide inventories using
change detection and image fusion. Int J Remote Sens 26(9):1913–1926
7. Nurda N, Noguchi R, Ahamed T (2020) Change detection and land suitability analysis for
extension of potential forest areas in Indonesia using satellite remote sensing and GIS. Forests
11(4):398
8. Peng X, Zhong R, Li Z, Li Q (2020) Optical remote sensing image change detection based on
attention mechanism and image difference. IEEE Trans Geosci Remote Sens 59(9):7296–7307
9. Singh A (1989) Review article digital change detection techniques using remotely-sensed data.
Int J Remote Sens 10(6):989–1003
10. Yadav R, Pandey M (2022)Image segmentation techniques: a survey. In: Proceedings of data
analytics and management: ICDAM 2021, vol. 1. Springer Singapore, pp 231–239
11. Yang M, Jiao L, Liu F, Hou B, Yang S (2019) Transferred deep learning-based change detection
in remote sensing images. IEEE Trans Geosci Remote Sens 57(9):6960–6973
12. Zhu Q, Guo X, Deng W, Shi S, Guan Q, Zhong Y, Zhang L, Li D (2022) Land-use/land-cover
change detection based on a Siamese global learning framework for high spatial resolution
remote sensing imagery. ISPRS J Photogrammetry Remote Sens 184:63–78
Critical Analysis of 5G Networks’ Traffic
Intrusion Using PCA, t-SNE, and UMAP
Visualization and Classifying Attacks
Humera Ghani, Shahram Salekzamankhani, and Bal Virdee
Abstract Networks, threat models, and malicious actors are advancing quickly.
With the increased deployment of the 5G networks, the security issues of the
attached 5G physical devices have also increased. Therefore, artificial intelligence-
based autonomous end-to-end security design is needed that can deal with incoming
threats by detecting network traffic anomalies. To address this requirement, in this
research, we used a recently published 5G traffic dataset, 5G-NIDD, to detect network
traffic anomalies using machine and deep learning approaches. First, we analyzed
the dataset using three visualization techniques: t-Distributed Stochastic Neighbor
Embedding (t-SNE), Uniform Manifold Approximation and Projection (UMAP), and
principal component analysis (PCA). Second, we reduced the data dimensionality
using mutual information and PCA techniques. Third, we solve the class imbalance
issue by inserting synthetic records of minority classes. Last, we performed clas-
sification using six different classifiers and presented the evaluation metrics. We
received the best results when K-Nearest Neighbors’ classifier was used: accuracy
(97.2%), detection rate (96.7%), and false positive rate (2.2%).
Keywords Network intrusion detection ·Class imbalance ·t-SNE ·UMAP ·
PCA ·5G-NIDD
H. Ghani (B)·S. Salekzamankhani ·B. Virdee
School of Computing and Digital Media, Centre for Communications Technology, London
Metropolitan University, London N7 8DB, UK
e-mail: hug0051@my.londonmet.ac.uk
S. Salekzamankhani
e-mail: s.salekzamankhani@londonmet.ac.uk
B. Virdee
e-mail: b.virdee@londonmet.ac.uk
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_32
421
422 H. Ghani et al.
1 Introduction
With the increased deployment of the 5G networks, the security issues of the attached
5G physical devices have also increased [1]. Several technologies, for example,
firewalls, traffic shaping devices, and intrusion detection systems are available to
secure a network [2]. With the changing world needs and threat models, networks
are becoming more complex and heterogeneous; malicious actors are also becoming
more advanced [3]. Hence, artificial intelligence-based autonomous end-to-end secu-
rity design is needed that can deal with incoming threats by detecting network traffic
anomalies [4]. Hence, to address network traffic anomaly issues in 5G networks, we
proposed a novel approach using the recently released 5G traffic dataset, 5G-NIDD.
We used machine and deep learning approaches to perform our experiments.
In this study, we first performed a visual analysis of this dataset using three
different visualization techniques. We then reduced the data dimensionality. In this
step, feature selection and feature extraction were performed using mutual infor-
mation and principal component analysis techniques. In the third step, the class
imbalance issue was resolved by inserting synthetic records of minority classes. We
used a random over-sampling method for balancing the class distribution. Finally, we
performed classification using six different classifiers and presented the evaluation
metrics. The best results were for K-Nearest Neighbors’ classifier with an accuracy
of 97.2%, detection rate of 96.7%, and false positive rate of 2.2%.
The contributions of this paper are:
Visual analyses of the 5G-NIDD dataset to understand its intricacies better by
using PCA, t-SNE, and UMAP techniques.
Dimensionality reduction to remove unimportant features, which cause inaccurate
results, more processing time and high computational power, using information
gain and principal component analysis techniques.
Remove class imbalance to improve classification metrics using the random over-
sampling technique.
Classify benign and malicious traffic with high accuracy using decision tree
(DT), K-Nearest Neighbors (K-NNs), multi-layer perceptron (MLP), Naïve Bayes
(GNB), random forest (RF), and support vector classifier (SVC) algorithms.
This paper is structured as follows. Section 2describes the related work reported
recently in the literature. Section 3introduces the dataset. Section 4elaborates on
the proposed approach. Section 5shows the results and discusses findings. Section 6
concludes the work presented in this paper and recommends future work.
Critical Analysis of 5G Networks’ Traffic Intrusion Using PCA, t-SNE 423
2 Related Works
This section discusses contemporary research in the field of network anomaly
detection. Researchers are addressing this problem using various machine and
deep learning approaches. For experiments, in general, they employ CICIDS-2017,
NSL-KDD, and UNSW-NB15 datasets. However, we used the 5G-NIDD dataset, a
comparatively new dataset having 5G network traffic records.
Authors [5] divided the UNSW-NB15 dataset based on protocol: TCP, UDP, and
OTHER. Using the Chi-square technique, they performed feature selection, and for
classification, they used a one-dimensional convolution neural network. Their work
includes the visualization of the dataset using t-SNE. However, current research [7,
8] has reported better classification performance metrics on the same dataset.
Authors in [6] created and evaluated the 5G-NIDD dataset using five different clas-
sifiers on binary and multiclass labels. They used the analysis of variance (ANOVA)
technique for feature selection and used the ten best features for classification.
Features in this dataset have skewness and multi-modal properties, while ANOVA
is used for features having a normal distribution. Therefore, using ANOVA on this
dataset for feature selection is inappropriate.
Reference [7] performed their experiments on NSL-KDD and UNSW-NB15
datasets. They proposed combining particle swarm optimization (PSO) and a grav-
itational search algorithm (GSA) for feature selection. They used five other feature
selection techniques but received the best classification metrics when the features
selected by their proposed method were given to the random forest classifier.
Although this research received good results, their proposed technique selected a
high number of features in comparison to the other feature selection methods [5]
evaluated in this paper.
Researchers in [8] did their experiments on UNSW-NB15, CICIDS-2017, and
Phishing datasets. They used a correlation-based feature selection method. For
data visualization and feature reduction, they used t-SNE and for the classifica-
tion, random forest. But, [9] mentioned a limitation that t-SNE is a non-parametric
dimensionality reduction technique; these techniques cannot map new data points.
Reference [10] used NSL-KDD dataset for their research. They used an auto-
encoder for detecting network anomalies. Although they reported good classification
performance, contemporary research [7] showed better performance metrics on the
same dataset.
Authors in [11] used UNSW-NB15 and NSL-KDD datasets to investigate network
traffic anomalies. First, they address the class imbalance issue by reducing the noise
samples from the majority class then they increase the minority class samples using
Synthetic Minority Over-sampling Technique. Second, they performed classification
using deep learning approaches: Convolution Neural Network and Bi-directional
long short-term memory.
Researchers in [12] performed their experiments on UNSW-NB15 and NSL-KDD
(KDDTest +and KDDTest-21) datasets. First, they addressed the class imbalance
issue using Wasserstein Generative Adversarial Network. Second, they employed
424 H. Ghani et al.
a Stacked Auto-encoder for feature extraction. Third, they constructed a cost-
sensitive loss function. Their performance metrics suggest further improvement in
their approach.
The above discussion clarifies that current research in network traffic anomaly
detection lacks in presenting the visual analysis of datasets. Therefore, this research
visually examined 5G-NIDD dataset, a newly released 5G traffic dataset [6].
3 Dataset
5G-NIDD dataset is created using a real 5G test network for network intrusion
detection. It was published by [6]. It has total 52 features, 32 float types, 12 int
types, and 8 categorical types. This dataset has 1,215,890 records, where 477,737
are benign, and 738,153 are malicious. Benign records are 39.2%, whereas mali-
cious records are 60.7% as shown in Table 1. This dataset has eight different types of
attacks; their names and percentage in the attack traffic are: UDPFlood (61.9%),
HTTPFlood (19.0%), SlowrateDoS (9.9%), TCPConnectScan (2.7%), SYNScan
(2.7%), UDPScan (2.1%), SYNFlood (1.3%), and ICMPFlood (0.15%) Table 2.
Table 1 Distribution of
records in 5G-NIDD dataset Label No. of records Percentage
Benign 477,737 39.291
Malicious 738,153 60.708
Tot a l 1,215,890 100
Table 2 Distribution of
attacks in malicious records Attack type No. of records Percentage
UDPFlood 457,340 61.957
HTTPFlood 140,812 19.076
SlowrateDos 73,124 9.9063
TCPConnectScan 20,052 2.716
SYNScan 20,043 2.715
UDPScan 15,906 2.154
SYNFlood 9721 1.316
ICMPFlood 1155 0.156
Tot a l 738,153 100
Critical Analysis of 5G Networks’ Traffic Intrusion Using PCA, t-SNE 425
4 Proposed Approach
This section describes the approach adopted to perform this research. First, data
cleaning and wrangling were performed at the data preprocessing stage. Second, a
visual analysis of data was presented and described using three different methods.
Third, data dimensionality was reduced by employing feature selection and feature
extraction approaches. Fourth, the class imbalance issue was addressed. Fifth, traffic
classification was performed using six different classification algorithms. Finally,
evaluation metrics were described and results were presented.
4.1 Data Preprocessing
Data is prepared to feed into the machine learning algorithm. This dataset had redun-
dant and unnecessary features; therefore, data denoising was performed to remove
redundant and unnecessary features. Some features had null values, which were
imputed with appropriate alternative values. Some features showed skewness and
multi-modal distribution; therefore, log transformation was performed to achieve
normal distribution. Categorical features were encoded using one-hot encoding
method. The dataset was split into train and test sets to avoid overfitting. Lastly,
it was normalized to have mean 0 and variance 1.
4.2 Visual Analysis of 5G-NIDD Dataset
In this paper, visualization is performed to understand the data better, so researchers
can apply appropriate processing techniques to achieve good results. This research
used three visualization techniques to get deep insight into the dataset. These tech-
niques are PCA, t-SNE, and UMAP. PCA is one of the widely used dimensionality
reduction techniques that performs orthogonal linear transformation of correlated
variables into uncorrelated features. These new features are called principal compo-
nents. This technique works to preserve the variance of original high-dimensional
data into low-dimensional principal components. t-SNE is a nonlinear dimensionality
reduction and visualization technique that creates probability distribution of closest
points. This algorithm tries to transform high-dimensional feature space into two-
dimensional feature space by minimizing the difference between two distributions.
It models high-dimensional space into Gaussian distribution, while two-dimensional
space into t-distribution. UMAP is an efficient dimensionality reduction technique
that uses a graph algorithm to reduce data dimensionality. First, it constructs the
topology of the high-dimensional data. Then, it constructs low-dimensional local
clustering by grouping similar observations.
426 H. Ghani et al.
Fig. 1 Distribution of traffic types in the 5G-NIDD dataset
The class distribution of 5G-NIDD dataset is imbalanced (see Fig. 1). Three
attack categories (UDPFlood, HTTPFlood, and SlowrateDos) constitute 90.8% mali-
cious traffic, while the other five attack categories (TCPConnectScan, SYNScan,
UDPScan, SYNFlood, and ICMPFlood) constitute 8.95% malicious traffic (Table 2).
Figures 2and 3show a two-dimensional and three-dimensional view of principal
components; the class imbalance issue can be observed clearly.
Figures 4and 5show the complete datasets using t-SNE and UMAP, respectively.
In addition to class imbalance, class overlap and within-class clustering issues are
evident from these figures.
The class overlap issue is confirmed in Figs. 6, 7, 8 and 9, where HTTPFlood and
SlowrateDoS classes are overlapping. All three visualization techniques confirm this
issue. Figures 6and 7show two-dimensional and three-dimensional plots of class
overlap using principal components. Figure 8shows a t-SNE plot where the class
overlap is evident. Figure 9shows a UMAP plot showing class overlap.
This dataset has within-class cluster issues as well. Figures 6, 7, 8, and 9 show
scattered clusters of HTTPFlood and SlowrateDoS classes using PCA, t-SNE, and
UMAP visualizations.
Class imbalance and overlap can hinder the attack detection performance [13,
14]. In this research, we addressed the class imbalance problem. Our work achieved
good evaluation metrics (see Table 4), suggesting that if a dataset has a high number
of records, classifiers can identify the classes with accuracy despite the class overlap
issue.
Critical Analysis of 5G Networks’ Traffic Intrusion Using PCA, t-SNE 427
Fig. 2 Two-dimensional plot of PCA showing all types of traffic
Fig. 3 Three-dimensional plot of PCA showing all types of traffic
4.3 Dimensionality Reduction
Dimensionality reduction transforms high-dimensional data into low dimensions
where newly transformed data is a meaningful representation of original data [15].
428 H. Ghani et al.
Fig. 4 Full dataset visualization (t-SNE)
Fig. 5 Full dataset visualization (UMAP)
Dimensionality reduction techniques are employed to reduce the number of variables,
reduce the computational complexity of high-dimensional data, improve the model
accuracy, better visualization, and understand the process that generated the data
[16]. Two main approaches of dimensionality reduction are feature selection and
feature extraction.
Critical Analysis of 5G Networks’ Traffic Intrusion Using PCA, t-SNE 429
Fig. 6 HTTPFlood and SlowrateDoS overlap (2D PCA)
Fig. 7 HTTPFlood and SlowrateDoS overlap (3D PCA)
430 H. Ghani et al.
Fig. 8 HTTPFlood and SlowrateDoS overlap (t-SNE)
Fig. 9 HTTPFlood and SlowrateDoS overlap (UMAP)
Feature selection methods select the most valuable features from the feature space.
This process creates a low-dimensional feature space representation that preserves
the most valuable information. [16] identified two approaches to feature selection:
the filter method and the wrapper method. In this paper, the most useful features are
Critical Analysis of 5G Networks’ Traffic Intrusion Using PCA, t-SNE 431
selected using one of the filter method algorithms, mutual information. This algo-
rithm measures the amount of information that a random variable contains another
random variable, in other words, the reduction of the uncertainty of the original
random variable, given the knowledge of another random variable [17]. Unlike
Pearson correlation, this method can measure the nonlinear relationship between
two variables.
Feature extraction is producing a compressed representation from the input vector.
Feature extraction techniques create new features from the original data space using
functional mapping [18]. Several algorithms are available that can perform this trans-
formation linearly and nonlinearly. Researchers in [16] identified three approaches
to feature extraction: performance measure, transformation, and generation of new
features. We choose the transformation technique PCA as it is faster since the first few
principal components are computed and more interpretable than other techniques,
such as Auto-encoder.
4.4 Remove Class Imbalance
Class imbalance problem occurs when some classes have more instances than others;
in such cases, learning algorithms are overwhelmed by the large classes and ignore
the small classes [19]. Learning algorithms are generally not designed to handle
imbalanced datasets without proper adjustment [20]. Researchers in [21] pointed
datasets frequently exhibit class imbalance and overlap issues. 5G-NIDD dataset also
shows class imbalance in Fig. 1.
There are several approaches to solving the class imbalance problem. These
approaches can be grouped as: data-level, cost-sensitive, and ensemble learning.
Data-level approaches modify the dataset. Cost-sensitive approaches modify the cost
that algorithm tries to optimize. The ensemble learning approach leverages the power
of several learners to predict the minority class. Data-level approach random over-
sampling increases the number of observations from the minority class at random.
In contrast, random under-sampling decreases the number of observations from the
majority class at random.
Synthetic Minority Over-Sampling (SMOTE) technique is frequently employed in
contemporary research [11,22,23] to overcome class imbalance issues from network
intrusion detection datasets. It balances class distribution by randomly inserting
minority class samples. It does linear interpolation to produce synthetic records of
the minority class. These records are created by selecting K-Nearest Neighbors for
each example in the minority class. We chose SMOTE to solve the class imbalance
issue in the dataset.
432 H. Ghani et al.
4.5 Network Traffic Classification
We used tree-based, probability-based, proximity-based, deep learning, and support
vector classifiers to predict class labels.
Decision tree is a non-parametric learning algorithm. It works on a divide-and-
conquer strategy. This greedy algorithm searches and identifies optimal split points
within a tree. It does this job reclusively, which completes when most records classify
under the same class label.
K-Nearest Neighbor is a proximity-based classifier. To predict the class label of a
point, first, it finds K-Nearest Neighbors of this point based on Euclidean distance.
Then, each of these neighbors votes for their class, and the majority class wins.
Multilayer perceptron is a powerful deep learning model which is inspired by
neurons in the human brain. The basic building blocks of MLP are perceptrons
which are simple processing units. It can have many layers of perceptrons, which
gives it the name MLP.
Naïve Bayes classifier is simple and fast, has very few tunable parameters, and
good for high-dimensional data. Given the value of class variable, this algorithm
assumes conditional independence between each pair of input variables.
Random forest classifier is an ensemble of decision trees that can perform clas-
sification using a majority vote. Each decision tree uses a randomly selected feature
set from the original dataset. In addition, each tree uses a different sample of data,
like the bagging approach. It can successfully model high-dimensional data where
features are nonlinearly related, and it does not assume data which follows a particular
distribution.
Support vector classifier is a simple and powerful classifier. It can draw linear and
nonlinear class boundaries to classify the data points. To perform its job, it iteratively
constructs a hyperplane to differentiate classes. Each iteration tries to minimize the
error. The main idea of this technique is to create a hyperplane that can best divide
the data into classes.
4.6 Evaluation Metrics
Model performance is evaluated using accuracy, detection rate, and false positive rate
metrics. Elements of these metrics can be retrieved from the confusion matrix where
the confusion matrix is {TP, TN, FP, FN}. True positive (TP) means correctly clas-
sified attack packets. True negative (TN) means correctly classified normal packets.
False positive (FP) means incorrectly classified attack packets, and false negative
(FN) means incorrectly classified normal packets.
Accuracy means the ratio between correctly identified packets and total number
of packets.
Critical Analysis of 5G Networks’ Traffic Intrusion Using PCA, t-SNE 433
Accuracy =(TP +TN)
(TP +TN +FP +FN).(1)
Detection rate represents the ratio of correctly identified attacks versus predicted
attacks.
Detection rate =TP
(TP +FN).(2)
False positive rate is the ratio of incorrectly identified attacks versus predicted
normal.
False positive rate =FP
(FP +TN).(3)
5 Results and Discussion
To reduce the feature space, we used mutual information and PCA techniques.
We processed the dataset before applying dimensionality reduction techniques (see
Sect. 4.1). Mutual information technique ranked the features (see Fig. 10). We
selected twenty-two top-ranked features and transformed them into eleven principal
components. These eleven components captured 89.2% variance in the data (see
Table 3).
Eleven principal components are fed to classifiers for classification. Table 4shows
the classification performance of six classifiers using accuracy, detection rate, and
false positive rate. Our results show that GNB classifier showed considerably low-
performance metrics. This result confirms [6] finding. DT, RF, and k-NN showed
better performance metrics than MLP and SVC. However, k-NN remains the best
classifier in all evaluation metrics with accuracy (97.2%), detection rate (96.7%),
Fig. 10 Mutual information ranking of features
434 H. Ghani et al.
Table 3 PCA explained variance ratio
Principal component Variance captured
1st component 0.20667303
2nd component 0.14651562
3rd component 0.09805928
4th component 0.08302226
5th component 0.05968254
6th component 0.05411584
7th component 0.05063291
8th component 0.05041822
9th component 0.04955088
10th component 0.04887834
11th component 0.04494785
Tot a l 0.89249676
Table 4 Evaluation metrics (binary classification)
Evaluation metrics DT RF K-NN GNB MLP SVC
Accuracy 97.150 97.168 97.275 87.425 94.91 92.09
Detection rate 96.618 96.650 96.765 92.428 92.646 90.49
False positive rate 2.318 2.314 2.213 17.559 2.828 6.32
and false positive rate (2.2%). Figure 11 shows ROC curve of k-NN classifier that
captured 97.2% area under the curve.
Table 5shows a comparison of our approach with other contemporary techniques.
Research [11] reported 77.16% and 83.58% accuracy on UNSW-NB15 and NSL-
KDD datasets, respectively. Research [12] showed 93.27% and 90.24% accuracy
on UNSW-NB15 and NSL-KDD datasets, respectively. Our approach outperformed
them and achieved 97.28% classification accuracy using the 5G-NIDD dataset.
A limitation of this research is that the 5G-NIDD dataset was published recently
[6]. Yet, more research needs to be produced on it to compare our results. Another
limitation is the processing power. If it was not the case, we would have searched
for the best hyperparameters for our classifiers and reported even better evaluation
metrics.
6 Conclusion and Future Work
In this paper, we presented a novel approach to classify network traffic anomalies
with high accuracy. We analyzed the dataset by projecting it in two-dimensional
and three-dimensional spaces using linear and nonlinear dimensionality reduction
Critical Analysis of 5G Networks’ Traffic Intrusion Using PCA, t-SNE 435
Fig. 11 ROC curve of k-NN
Table 5 Performance
comparison Research Accuracy Dataset
[11]77.16% UNSW-NB15
83.58% NSL-KDD
[12]93.27% UNSW-NB15
90.34% NSL-KDD
Proposed research 97.28% 5G-NIDD
and visualization techniques. We reduced the feature space by ranking them using
the mutual information algorithm; then, we transformed high-ranked features into
principal components. This dataset had a class imbalance issue; we solved it by
balancing the class distribution using a random over-sampling algorithm. Last, we
performed classification using six classification algorithms and presented the evalu-
ation metrics using accuracy, detection rate, and false positive rate. We achieved the
best classification performance when the K-Nearest Neighbors’ algorithm was used.
In the future, we intend to extend this research in two directions: use an ensemble
learner to improve classification metrics and use a generative model for class
imbalance issues since it is one of the highly successful deep learning architectures.
436 H. Ghani et al.
References
1. Uysal DT, Yoo PD, Taha K (2022) Data-driven malware detection for 6G networks: a survey
from the perspective of continuous learning and explainability via visualisation. IEEE Open J
Veh Technol 4:61–71
2. Khan AR, Kashif M, Jhaveri RH, Raut R, Saba T, Bahaj SA (2022) Deep learning for intrusion
detection and security of Internet of things (IoT): current analysis, challenges, and possible
solutions. In: Security and communication networks
3. Lam J, Abbas R (2020) Machine learning based anomaly detection for 5 networks. arXiv
preprint arXiv:2003.03474
4. Siriwardhana Y, Porambage P, Liyanage M, Ylianttila M (2021) AI and 6G security: Opportu-
nities and challenges. In: 2021 Joint European conference on networks and communications &
6g summit (EuCNC/6G Summit). IEEE, pp 616–621
5. Hooshmand MK, Hosahalli D (2022) Network anomaly detection using deep learning
techniques. CAAI Trans Intell Technol 7(2):228–243
6. Samarakoon S, Siriwardhana Y, Porambage P, Liyanage M, Chang SY, Kim J, Kim J, Ylianttila
M (2022) 5G-NIDD: a comprehensive network intrusion detection dataset generated over 5G
wireless network. arXiv preprint arXiv:2212.01298
7. Boahen EK, Bouya-Moko BE, Wang C (2021) Network anomaly detection in a controlled
environment based on an enhanced PSOGSARFC. Comput Secur 104:102225
8. Hammad M, Hewahi N, Elmedany W (2021) T-SNERF: a novel high accuracy machine learning
approach for intrusion detection systems. IET Inf Secur 15(2):178–190
9. Gisbrecht A, Schulz A, Hammer B (2015) Parametric nonlinear dimensionality reduction using
kernel t-SNE. Neurocomputing 147:71–82
10. Xu W, Jang-Jaccard J, Singh A, Wei Y, Sabrina F (2021) Improving performance of
autoencoder-based network anomaly detection on NSL-KDD dataset. IEEE Access 9:140136–
140146
11. Jiang K, Wang W, Wang A, Wu H (2020) Network intrusion detection combined hybrid
sampling with deep hierarchical network. IEEE Access 8:32464–32476
12. Zhang G, Wang X, Li R, Song Y, He J, Lai J (2020) Network intrusion detection based on
conditional Wasserstein generative adversarial network and cost-sensitive stacked autoencoder.
IEEE Access 8:190431–190447
13. Vuttipittayamongkol P, Elyan E, Petrovski A (2021) On the class overlapproblem in imbalanced
data classification. Knowl-Based Syst 212:106631
14. Li Z, Huang M, Liu G, Jiang C (2021) A hybrid method with dynamic weighted entropy for
handling the problem of class imbalance with overlap in credit card fraud detection. Expert
Syst Appl 175:114750
15. Padmaja DL, Vishnuvardhan B (2016) Comparative study of feature subset selection methods
for dimensionality reduction on scientific data. In: 2016 IEEE 6th international conference on
advanced computing (IACC). IEEE, pp 31–34
16. Khalid S, Khalil T, Nasreen S (2014) A survey of feature selection and feature extraction
techniques in machine learning. In: 2014 science and information conference. IEEE, pp 372–
378
17. Zhou H, Wang X, Zhu R (2022) Feature selection based on mutual information with correlation
coefficient. In: Applied intelligence, pp 1–18
18. Motoda H, Liu H (2002) Feature selection, extraction and construction. Commun IICM
(Institute of Information and Computing Machinery, Taiwan) 5(67–72):2
19. Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data
sets. ACM SIGKDD Explor Newsl 6(1):1–6
20. Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern
Recognit Artif Intell 23(04):687–719
21. Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping
data. Expert Syst Appl 98:72–83
Critical Analysis of 5G Networks’ Traffic Intrusion Using PCA, t-SNE 437
22. Mulyanto M, Faisal M, Prakosa SW, Leu JS (2020) Effectiveness of focal loss for minority
classification in network intrusion detection systems. Symmetry 13(1):4
23. Zeeshan M, Riaz Q, Bilal MA, Shahzad MK, Jabeen H, Haider SA, Rahim A (2021) Protocol-
based deep intrusion detection for dos and ddos attacks using unsw-nb15 and bot-iot data-sets.
IEEE Access 10:2269–2283
Denoising the Endoscopy Images
of the Gastrointestinal Tract Using
Complex-Valued CNN
Nisha and Prachi Chaudhary
Abstract To delineate intestine atrophy accurately is a tough job because of the
heterogeneous and convoluted structure of small intestine. Image quality is a salient
factor to diagnose various diseases in the medical field. The images captured through
machines or medical imaging devices are prone to some kind of noise. Endoscopy
images are of inferior quality because of lighting problems inside the gastrointestinal
(GI) tract. The noise type changes with the environment and the camera is used for
capturing images. The aim of this paper is to determine whether image-denoising
methods are effective for classification. This study proposes an efficient complex-
valued CNN (CDNet) method for denoising the images. The proposed model found to
be superior to any other state-of-the-art method for image denoising on real datasets.
The denoising performance is computed through metrics of PSNR and SSIM. The
results show PSNR 45.58 and SSIM 0.99 thus demonstrating the superiority of the
proposed method on real datasets.
Keywords Endoscopy ·Denoising ·Complex value CNN ·SSIM ·PSNR
1 Introduction
Celiac disease is a small intestine disease caused by intaking a gluten-rich diet. The
outer layers of the small intestine are damaged in this disease. Due to damage to
the lining of the small intestine knowns as Villi, the nutrients present in food are not
absorbed in the body [1]. The celiac disease-affected patient remains on a gluten-free
diet lifelong, and further, follow-ups are necessary to check how much the disease is
controlled. By serology and histopathology profiles, we are able to find the markers
of the disease. The markers sometimes do not give a proper idea of the presence of
Nisha (B)·P. Chaudhary
ECE Department, DCRUST, Murthal, India
e-mail: 18001903005nisha@dcrustm.org
P. Chaudhary
e-mail: prachi.ece@dcrustm.org
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_33
439
440 Nisha and P. Chaudhary
disease, so in this case, endoscopy is preferred [2,3]. We need to take minimum of six
biopsy samples for a perfect diagnosis [4]. As the biopsy procedure is painful, some
other method for diagnosis is needed. Automated detection by image processing is
becoming an effective method nowadays. In this method, the image is preprocessed
first, so that the quality of the image can be improved for further processing. Image
quality is measured in two different approaches. The first approach is used by doctors
or endoscopists. This method is the most accurate method considered always. But
these methods take time and are difficult because the presence of a specialist is
the first need for evaluation. So, another approach for automatically observing the
quality of the image is more suitable and convenient. In this approach, peak signal-to-
noise ratio (PSNR) and structural similarity index (SSIM) are measured. The image
preprocessing method is used as the initial step after acquiring images. Different
kinds of noise are present in digital images like blur noise, speckle noise, Gaussian
noise, salt pepper noise, etc. The noise present in the image is removed by applying
several denoising methods. This noise removal is necessary as it helps further the
image’s processing for segmentation and classification of the images. Some transform
approaches, machine learning methods, and statistical methods are also used for
denoising.
Contribution: Celiac disease is spreading all over the world due to excessive intake
of gluten.
This paper helps in denoising endoscopy images taken through an Olympus endo-
scope (GIF-H190). In this study, we are focusing on primary data of Celiac disease
patients for removal of noise captured through the camera in the small intestine.
Complex value CNN in contrast to real value CNN is becoming an emerging
method. This method improves the capacity of the model. Till now, this approach
was not used for images.
This paper is summarized in different sections. Section 1gives a brief idea of
the disease in the small intestine and kind of noise present in endoscopic images.
Section 2summarizes previous work done for denoising. In Sect. 3, denoising tech-
niques are studied. In Sect. 4, the results are studied, and finally, conclusion is drawn
in Sect. 5.
2 Literature Review
The previous study includes various methods for denoising based on learning and
nonlearning. In 2020, ADNet CNN was used for classification by Tian et al. where
ADNet stands for attention-guided CNN. It uses attention and reconstruction block
for image denoising. It was done on general images, but this method needs a large
number of calculations because of large number of weights assigned [5]. The noise
estimation module and some modules of noise removal are used with NERNet CNN
in case of realistic noise [6]. This study was done by Guo et al. in 2020. In year 2019
Denoising the Endoscopy Images of the Gastrointestinal Tract Using 441
by Shi et al., three network modules were used for image denoising and Hierarchical
residual learning CNN for classification. These three subgroups extract patches,
map noise, and in the end fuse, the map to the estimation model [7]. In 2019, a
Dictionary learning model was proposed by Zhang et al. for the classification of
Gaussian and mixed noise [8]. In the year 2019, again by Zhang et al., SANet was
used for classification with deep mapping and band aggregation blocks for denoising
[9]. The retaining module for image denoising and DRCNN for classification was
used by Li et al. in 2020 [10]. In the year 2020, SWCNN was proposed by Yin
et al. The slide kernel convolution is used for image denoising and SWCNN is used
for classification [11]. The regression CNN and combination of regression with the
classifier model are used for denoising by Jin et al. in 2020 [12]. In this approach, the
noise was detected by the classifier and noise pixels were restored by the regression
network. CDNet CNN for classification was proposed in 2021 by Quan et al., and in
this approach, CV ReLU which was used for denoising is used [13]. Some research
is also done on the effect of the denoising method on classification. When CNN
is trained on the images, then it has less effect on classification, but noise, blur,
and contrast have a significant impact on the classification [14]. CNN has to work
more when not trained on images than when trained on images. The trained model
of CNN increases the accuracy [15]. Another study shows that denoising inputs
and augmented images that are affected by noise always increase the performance
of classification [16]. A pipeline architecture is used for preprocessing to denoise
in combination with CNN for classification [17]. The noise resilience problem is
resolved using VGG16 CNN which is trained on specific images [18]. Dual-channel
CNN based on InceptionV3 is used for processed denoised and unprocessed images.
The combined output using feature summation or concatenation is used for further
classification using CNN for better results [19]. As CNN requires lots of computation
and more space for storage, this renders its use in daily life automation systems.
3 Denoising Techniques
Denoising is a very important part of digital images. There are different methods
and algorithms present for the denoising of the image. The best filtration method is
employed accordingly by retaining the quality of the image. The neighborhood pixel
is used for checking the type of noise present. The denoised method is categorized
into linear and nonlinear methods (Fig. 1).
3.1 DnCNN
It is an efficient learning model that works on the Gaussian noise image. It extracts
the residual image from the corrupted image due to noise. The noise-free image can
be considered as the difference of a corrupted image and a residual image [20]. The
442 Nisha and P. Chaudhary
Denoising
method
Learning
methods
DnCNN Noise2Noise
CNN CDNet
Non learning
methods
Wiener filter Moving
average filter Median filter Opening
filter
Fig. 1 Categories of denoising methods
size of the filter is set to 3 ×3 and pooling layers are removed from the structure.
Conv +ReLU is used to create 64 feature maps. The BN is introduced for batch
normalization, and finally, Conv of size 3 ×3×64 is done for output reconstruction.
It is the pre-trained model on the synthetic Gaussian noise with standard deviation
σ=25.
3.2 Noise2Noise
This method does not use clean data. It uses two noise pairs in the training phase.
For working with small datasets, it uses WT, so that images can be handled in
multispectral bands [21]. The images consist of low- and high-pass components, and
at the output side, we apply inverse WT for denoised output. This model is trained
on synthetic Gaussian noise ranging value of σfrom 5 to 25.
3.3 CDNet
It consists of mainly five different blocks. This method is used by [13] and it is a
complex-valued CNN. The architecture shown in Fig. 2consists of 24 sequential
connected convolutional units. It consists of a convolution layer, ReLU, and BN.
The kernel used is 64 convolutions. It consists of total five blocks of convolution and
deconvolution layers. One residual layer and one merging layer are used for attaining
the original real image.
Input
image CV conv CV ReLU CV BN CV RB
CV
Merging
layer
Output
Image
Fig. 2 Brief architecture of complex value CNN [6]
Denoising the Endoscopy Images of the Gastrointestinal Tract Using 443
3.4 Median Filter
It is one of the popular filters present for denoising in digital image processing.
The impulse noise is removed with this. The intensity value of the pixel affected by
noise is changed by finding the median of the intensity values of the pixel in the
neighborhood [22]. The window size is also fixed in this type of filter. The filter is
applied to the whole image so that both pixel values are changed to noiseless and
noisy. As all the pixel values are changed, so there is a chance of getting some good-
quality pixels changed by low-quality pixels. So, some fine details are removed in
this method due to some distortion introduced in the edges of the high-quality image.
3.5 Weiner Filter
This type of filtration removes the noise by using a statistical approach. The blur and
extra noise are removed with this kind of filter. It uses Linear Time-Invariant (LTI)
approach. It minimizes the MSE. The spectral properties of both the original image
and noisy images are first checked. The filtering process is first done by checking the
present value of the signal. Then in the next step, smoothening is done by checking
past values of the signal. In the last step, the future value of the signal is predicted
[23]. This method only uses a 3 ×3 kernel. So, the results are not so optimized.
3.6 Gaussian Filter
Its results are better than the Weiner filter and median filter. This Gaussian filter finds
the weighted average of the pixels by the Gaussian distribution method. It is a type
of linear low-pass filter for reducing noise and blurred regions. The kernel is passed
through each pixel. It preserves the edges which are not preserved in the median
filter.
3.7 Deep Image Prior
This method is used in the case of the GI tract. As the images are taken from different
regions of small intestine and the environment inside the GI (gastrointestinal) tract
is also different at different regions [24]. The different blind denoising methods
available are Noise Clinic, Neat Clinic, and Deep Image Prior. These methods are
often time-consuming and require a number of iterations. So, an improved version
of DIP is used to remove the noise. The improved version uses DIP have knowledge
of the termination of the iteration process by using the image quality assessment
444 Nisha and P. Chaudhary
method. The method also uses transfer learning to minimize iterations by retaining
the denoising effect.
4 Experimental Results
In this study, we used a real dataset from PGIMS, Rohtak, and evaluate the perfor-
mance of the existing model complex value CNN. The results are compared with
state-of-the-art denoising methods like DnCNN [20], Noise2noise [21], median filter
[22], DIP [24], Wiener filter [23]. Before starting, I would like to draw the flowchart
of the methodology used as shown in Fig. 3.
4.1 Data Preparation
The dataset is divided into training and testing images. Two hundred images are
taken from the endoscope. In this methodology, 160 images are used in training and
40 images are used in testing. The image size is very large so when training is done,
the images are first reconstructed to size 256 ×256. For these images, PSNR and
SSIM are calculated. Then testing is done again on the cropped images.
4.2 Steps to Be Followed
Step 1: Take input from endoscopy images through a real dataset. Divide the dataset
into training and testing images.
Step 2: Resize the image to a small pixel size of 256 ×256.
Step 3: Check the noise type; in this case, salt-and-pepper noise of density 0.3 is
found.
Step 4: Now use a specific filter for the removal of noise.
Step 5: Restore the image that is filtered.
Input
noisy
Image
Divide the
data into
training
and
tesƟng
images
Preprocess
ing(Resizin
g)
Noise type Complex
value CNN
Filtered
image
Fig. 3 Flowchart of the methodology used
Denoising the Endoscopy Images of the Gastrointestinal Tract Using 445
Step 6: Evaluate the performance parameters for denoising.
4.3 Performance Evaluation
The performance of our dataset is checked on model CDNet and compared with
other state-of-the-art methods listed below in Tables 1and 2.
From Fig. 4a, it can be seen that image contains salt-and-pepper noise. Some
lightening conditions can also be seen in the image which is also a kind of noise.
First, cropping of the image is done to size 256 ×256. Cropping also removes extra
regions of the image. Then in Fig. 4b, the salt-and-pepper noise is removed to a large
extent using model. Lightening is removed by averaging 8 nearby pixels and fill the
gap with that part. Our method achieves better PSNR value when compared to the
median filter. It outperforms than median filter by increasing PSNR metrics from
42.5 to 45.58 highlighted bold in Table 1. In Table 2the SSIM performance for our
model can be seen as it increases from 0.98 to 0.99 again highlighted bold.
Table 1 PSNR versus standard deviation
Method σ=5σ=15 σ=25 References
Median filter 42.5 33.9 29.4 [22]
Opening filter 33.8 33.6 32.6 [10]
Wiener filter 26.9 22.4 17.4 [23]
DnCNN 38.1 34.1 30.2 [20]
Noise2Noise 36.1 27.2 22.2 [21]
DIP 40.62 26.2 22.5 [24]
Proposed 45.58 34.5 32.7
Table 2 SSIM versus standard deviation
Method σ=5σ=15 σ=25 References
Median filter 0.98 0.96 0.92 [22]
Opening filter 0.98 0.89 0.69 [10]
Wiener filter 0.92 0.68 0.52 [23]
DnCNN 0.78 0.68 0.56 [20]
Noise2Noise 0.88 0.52 0.22 [21]
DIP 0.87 0.56 0.56 [24]
Proposed 0.99 0.98 0.96
446 Nisha and P. Chaudhary
Fig. 4 a Original image. bDenoised and enhanced image
4.4 Model Parameters
We implement our model in Python and used complex-valued CNN architecture.
This is trained for a period of ten epochs. We used binary cross-entropy loss criteria,
and sigmoid function is used as activation function. The denoising performance is
computed through a number of denoising parameters like SNR, MSE, PSNR, and
RMSE. In this study, we consider only PSNR and SSIM.
PSNR: It is defined as the peak signal-to-noise ratio. PSNR decreases as the value
of the root mean square value increases. It is measured with the help of MSE.
PSNR =10 log10(max(I))2
MSE .
SSIM: It is the structural similarity index. It checks the quality of images received
through different media. Larger the value of SSIM, the better the quality of the image.
5 Conclusion
The endoscope used for the gastrointestinal tract is a tool for capturing images inside
the GI tract but introduces noise that affects the endoscopist’s observation. As the
noise present cannot be fitted globally so deep learning methods achieve good results
in the suppression of noise by preserving the edge features of the image. This paper
proposes CNN-based complex-valued network module for denoising the endoscopic
images. The outcomes show that CDNet performs better than methods used in liter-
ature in terms of increasing PSNR and SSIM value. This module focuses on the
highly affected noise region and suppresses the noise for dark endoscopic images.
The results show that PSNR increases from 42.5 to 45.58 and SSIM increases from
0.98 to 0.99 in respect of a standard deviation of value 5 on average. The results of
Denoising the Endoscopy Images of the Gastrointestinal Tract Using 447
standard deviation values 15 and 25 are also shown in Tables 1and 2. The denoising
results of complex-valued CNN on real datasets are better than the median filter and
DIP method used in the literature. This paper concludes that our method can enhance
the image quality by reducing noise impact on endoscopic images, thus cooperating
with endoscopists in diagnosing the disease. As the results are good but there is some
limitation of the work also that this method was used on a smaller number of images,
so it should be extended to a large dataset also. Moreover, other noises present in
images need to be considered.
Acknowledgements The authors would like to thank Professor Sandeep Goyal, Department of
Internal Medicine, Pt. B. D. Sharma University of Health Sciences, Rohtak, for providing us with
Endoscopy images of the Celiac Disease in the small intestine.
Conflict of Interest The authors would like to declare that there is no conflict of interest.
References
1. Ciaccio EJ, Bhagat G, Lewis SK, Green PH (2015) Quantitative image analysis of celiac
disease. World J Gastroenterol 21(9):2577
2. Ianiro G, Gasbarrini A, Cammarota G (2013) Endoscopic tools for the diagnosis and evaluation
of celiac disease World J Gastroenterol 19(46):8562–8570
3. Nisha, Chaudhary P (2020) Comparative analysis of techniques used for detection of celiac
disease using various endoscopies IJARET 11(12):1521–1529
4. Villanacci V, Ciacci C, Salviato T, Leocini G. Reggiani L, Ragazzini T, Limarzi F, Saragoni
L (2020) Histopathology of celiac disease. Position Statements of The Italian Group of
Gastrointestinal Pathologists (GIPAD-SIAPEC). Transl Med @ UniSa 23(6):28–36. ISSN
2239-9747
5. Tian C, Zhang Q, Sun G, Song Z, Li S (2018) FFT consolidated sparse and collaborative
representation for image classification. Arab J Sci Eng 43(2):741–758
6. Guo B, Song K, Dong H, Yan Y, Tu Z, Zhu L (2020) NERNet: noise estimation and removal
network for image denoising. J Vis Commun Image R 71:102851
7. Shi W, Jiang F, Zhang S, Wang R, Zhao D, Zhou H (2019) Hierarchical residual learning for
image denoising. Signal Process Image Commun 76:243–251
8. Zhang J, Luo H, Hui B, Chang Z (2019) Unknown noise removal via sparse representation
model. ISA Trans 94:135–143
9. Zhang L, Li Y, Wang P, Wei W, Xu S, Zhang Y (2019) A separation–aggregation network for
image denoising. Appl Soft Comp J 83:105603
10. Li X, Xiao J, Zhou Y, Ye Y, Lv N, Wang X, Wang S, Gao S (2020) Detail retaining convolutional
neural network for image denoising. J Vis Commun Image R 71:102774
11. Yin H, Gong Y, Qiu G (2020) Fast and efficient implementation of image fltering using a side
window convolutional neural network. Signal Process 176:10771
12. Jin L, Zhang W, Ma G, Song E (2019) Learning deep CNNs for impulse noise removal in
images. J Vis Commun Image R 62:193–205
13. Quan Y, Chen Y, Shao Y, Teng H, Xu Y, Ji H (2021) Image denoising using complex-valued
deep CNN. Pattern Recogn 111:107639
14. Dodge S, Karam L (2016) Understanding how image quality affects deep neural networks
15. Nazaré TS, da Costa GB, Contato WA, Ponti M (2018) Deep convolutional neural networks and
noisy images. In: Mendoza M, Velastín S (eds) Progress in pattern recognition, image analysis,
computer vision, and applications. Springer International Publishing, Cham, pp 416–424
448 Nisha and P. Chaudhary
16. Koziarski M, Cyganek B (2017) Image recognition with deep neural networks in presence of
noise—dealing with and taking advantage of distortions. Integr Computer Aided Eng 24:1–13
17. Diamond S, Sitzmann V, Boyd S, Wetzstein G, Heide F (2017) Dirty pixels: optimizing image
classification architectures for raw sensor data
18. Dodge SF, Karam LJ (2017) Quality resilient deep neural networks. CoRR abs/1703.08119
19. Yim J, Sohn K-A (2017) Enhancing the performance of convolutional neural networks on
quality degraded datasets
20. Zhang K, Zuo W, Chen Y, Meng D, Zhang L (2017) Beyond a gaussian denoiser: residual
learning of deep cnn for image denoising. IEEE Trans Image Process 26:3142–3155
21. Lehtinen J, Munkberg J, Hasselgren J, Laine S, Karras T, Aittala M, Aila T (2018) Noise2noise:
learning image restoration without clean data. CoRR abs/1803.04189
22. Ning CY, Liu SF, Qu M (2019) Research on removing noise in medical image based on median
filter method. In: IEEE international symposium on information (IT) in medicine and education,
ITME
23. Shruthi B, Renukalatha S, Siddappa M (2015) Speckle noise reduction in ultrasound images—a
review. Int J Eng Res Technol (IJERT) 4(2). ISSN: 2278-0181
24. Zou S, Long M, Wang X, Xie X, Li G, Wang Z (2019) A CNN-based blind denoising method
for endoscopic images. In: IEEE biomedical circuits and systems conference (BioCAS)
FTL-Emo: Federated Transfer Learning
for Privacy Preserved Biomarker-Based
Automatic Emotion Recognition
Akshi Kumar, Aditi Sharma, Ravi Ranjan, and Liangxiu Han
Abstract Advancements in IOT has revolutionized remote patient monitoring,
however, privacy is still the major challenge faced by researchers. We put forward
a Federated learning-based technique to handle the issue of privacy, and to over-
come the issue of requirement of large dataset we have employed a transfer learning
approach. Federated transfer learning (FTL) model analyze electronic health records
of the user to detect their emotional state. Emotion analysis has been observed by
monitoring physiological changes of human body, measured using EEG. Convolu-
tion network has been used at the server and at each client node in FTL. The model
is pre-trained on publicly available dataset DEAP on Centralized Machine and is
fine-tuned on the K-EmoCon dataset on each client device, without sharing the data
of any subject with the centralized model. Valance and Arousal are detected using
FTL. On both emotions, the state-of-the-art average F1 score has been achieved.
Keywords Federated learning ·K-EmoCon ·EEG ·Transfer learning ·Emotion
recognition ·EHI
A. Kumar ·L. Han
Manchester Metropolitan University, Manchester, UK
e-mail: akshi.kumar@mmu.ac.uk
L. Han
e-mail: l.han@mmu.ac.uk
A. Sharma (B)
Delhi Technological University, New Delhi, India
e-mail: Aditisharma9420@gmail.com
Thapar Institute of Engineering and Technology, Patiala, India
R. Ranjan
Netaji Subhas University of Technology, Delhi, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_34
449
450 A. Kumar et al.
1 Introduction
Advances in technologies have the potential to influence and shape the society.
Upsurge in IoMT devices has made the remote healthcare a reality. The bio-signals
of patients can be monitored from remote location by medical professionals, making
it possible for everyone to have access to healthcare. Remote monitoring helped
the medical community as well, only critical patient needs to be kept in hospitals,
making room for emergencies and other patients. With advancement in IoT sensors,
the monitoring of patient’s health remotely has created a large resource of electronic
health records (EHR). The EHR can be accessed by doctors remotely, patient can
consult various doctors, without worrying for their multiple body tests, as patient
would be able to share their existing EHR with the new doctor. With automated tools
like AI and machine learning, the load of medical professionals can be lightened
further, as researchers are working on making automatic predictive analytical tools
for various diseases [1]. Although the physical intervention of medical professionals
will always be required, yet the automatic tools can help to narrow done the decision
making process for doctors. Recent studies have shown a great result in detecting
many physical diseases such as brain tumor, heart diseases, cancer from scans of
human body [2]. Physical health issues are taken more seriously in our society than
the mental health issues, as their symptoms are not visible from the outside. But
mental health of a person needs as much care as physical health if not more. As
poor mental or psychological wellbeing can trigger physical health problems as
well. Many researchers have put forward some automatic predictive tools that can
detect stress, depression, and other mental health issues in a person [3]. As a person
with sound psychological state can express their emotions profoundly, researchers
have focused on recognizing the emotions of a person to track their psychological
wellbeing. A person feeling anger most of the time is more prone to be having a
psychiatric disorder, than one having mix of emotions, similarly someone who is
always sad, might be because of depression.
Emotions are an inseparable aspect of human intelligence and integral to decision
making. Analyses of Emotion can help us understand the psychological state of a
person. Many researchers have proposed many models to effectively recognize the
emotional state of a person, but they require personal data of that individual to be
monitored regularly, these data could be their bio-signals measured using biosensors,
or the emotions have also been identified from facial expressions, and their voice [4].
IEMOCAP, MELD, CASE, CK+ data has been used for facial expressions; Berlin,
LEESD for emotion recognition from speech. Different modalities has been explored
by researchers for emotion recognition [5,6]. Combination of two or more modalities
has been employed by researchers on some datasets [7]. Kumar et al. experimented
on IEMOCAP, MELD for emotion recognition using facial expressions [8,9].
Different modalities help researchers to recognize the emotion more accurately but
sharing the private information with anyone is a difficult decision for many people.
With increasing cyber-attacks, leaking of personal information is a fear that surrounds
all. The concern of privacy makes it difficult to have real-time remote psychological
FTL-Emo: Federated Transfer Learning for Privacy Preserved 451
Fig. 1 Federated learning
health monitoring. To resolve this issue, we employed federated learning to ensure
the privacy preservation. The structure of federated learning is shown in Fig. 1.
Federated learning (FL) concept was given by Google in 2017, to reduce the
computational costs, by utilizing the computation of the mobile devices, acting them
as a node in edge computing [10]. In federated learning, training is performed at
individual client level, and then the weights of model from each client are shared
with the server, where server collects the weight of each client, compute them, and
calculate a new weight, as shown in Eq. 1.
fs(w) =1
K
K
n=1
(fn(w)) (1)
where fs(w) represents the weight at server/centralized model, and fn(w) represents
the weight of client/user models. This new weight calculated by the server is then
communicated back to each client, which again train their own models individu-
ally, repeating this process till optimized weights are obtained [10]. This was the
concept proposed by Google for reducing the computation cost, but it served one
more advantage and even a prominent one, i.e., it ensured privacy of the data. As
training is performed at each client device, no data is needed to be shared with the
centralized server. In case of FL, data is not stored at a centralized position, neither
is required to send to server to train the model, making it safe from cyber-attacks.
452 A. Kumar et al.
Even if a cyber-attack happens, they can have data of a single client, not of every
client, making it more secure [11].
Since its inception, federated learning has been explored by many researchers for
studying and optimizing its performance while maintaining privacy and security. FL
approach has been used in many areas, but it is most widely accepted in medical
filed, due privacy-sensitive data for medical diagnosis [12]. Some researchers have
used this approach on electronic medical records, and on brain tumor data, however,
the application of FL on real medical data has still not been sufficiently studied.
Although FL algorithm is essential in the field of biomedical, since not just patients,
but hospitals too have to share their patients’ EHR to other hospitals and doctors.
This data sharing induces the risk of cyber-attacks and privacy violations. FL can
be used to resolve privacy issues and mitigate the risk of a data breach in clinical
information, since transfer and centralization of data is not required [13]. Although
no such clinical study has been conducted yet so far.
Traditional machine learning and deep learning require a centralized dataset to
train a model. In our work we have use traditional deep learning technique convolu-
tion network for training the model, but this training is performed at multiple loca-
tions. This protects the patients’ privacy and reduce the risk of data breach. Since
the structure of federated learning is not fixed and is yet explored by the researchers,
it can be implemented as per the situation. We have modified the structure by incor-
porating the transfer learning component along with federated learning. In federated
learning, data of the client should not be shared to the server, but we can have other
existing dataset to pre-train the server model, it will not risk the privacy of the clients,
and as well fasten the process of training the model.
We have used publicly available dataset K-EmoCon for creating this scenario. K-
EmoCon dataset is provided by Park et al. in 2020 [14]. The dataset contains audio,
video, and bio-signals of 32 subjects.
This research puts forward a model FTL-Emo: a privacy preserving transfer
learning approach for recognition of emotions using EEG. Transfer learning is used to
learn from existing data DEAP. In FTL-Emo, convolution neural network (CNN) has
been employed at both client and server side. K-EmoCon entities are annotated with
both discrete emotion states, and the dimensional states, the two types of emotion
models [14]. We have taken the dimensional emotion states Arousal and Valance for
FTL-Emo, while processing the emotional state is taken as the one recognized by
subject itself.
The primary contribution of the proposed FTL-Emo is:
Federated learning have been used for emotion recognition for EHR.
Utilization of existing data to pre-train the model for accurate emotion identifica-
tion.
The performance is evaluated for affect dimensions, while maintaining the privacy
of the data.
Next section of this paper contains description of the dataset, followed by proposed
model, Sect. 4contains discussion of result and analysis.
FTL-Emo: Federated Transfer Learning for Privacy Preserved 453
2 Dataset
DEAP: DEAP is a publicly available dataset containing EEG signals of 32 partic-
ipants [15]. The electroencephalogram and peripheral physiological signals of 32
participants were recorded as each watched 40 one-minute-long excerpts of music
videos. Participants rated each video in terms of the levels of arousal, valence, like/
dislike, dominance, and familiarity.
K-EmoCon: A multimodal publicly available dataset for emotion recognition
in conversations provided by Park et al. in 2020 [6]. The dataset contains audio,
video, and bio-signals of 32 subjects, who participated in a 10-min debate task on
a social issue in teams of 2, while wearing physiological signal measuring devices,
Emperica E4, NeuroSky, and Polar H7, along with video cameras for recording the
facial expressions and gestures of the participants. Brainwave—EEG (125Hz) and
Meditation signals were recorded through NeuroSky.
For our study we have used the data collected by only NeuroSky device, i.e., EEG
signal. Although in K-EmoCon each instance has been annotated by 7 people, but we
have taken the emotion annotated by the user itself. Park et al. have used 20 different
types of emotions for annotation, including dimensional emotional model [6]. In this
work we have worked to identify only Arousal and Valance.
3 FTL-Emo: Proposed Model
To ensure privacy protection in emotional identification, a federated learning-based
model FTL-Emo was proposed. The architecture of FTL-Emo is shown in Fig. 2.
Initially every single user has been considered as a unique client with their own
processing power at their edge, where they train their own model, these trained
modes’ weight is then shared with the centralized server. Twenty-nine input files from
K-EmoCon (only EEG files) dataset were provided to proposed model separately at
different devices. To have a more effective and accurate model the transfer learning
has been used as well. The centralized server was initially trained on DEAP dataset,
and these weights were shared with the client models, where they updated it with
their own training, this process was continued till the convergence of the model. The
experiment was evaluated for affective dimensions only.
The steps included in this whole process are:
Construct an initial server model employing CNN with publicly available dataset
(DEAP).
Distribute the weights of server model to each client.
Train client’ model with their own data.
Send Client models’ weight to server model.
Perform weight aggregation at server.
Distribute the updated weight at server level to each user.
454 A. Kumar et al.
Fig. 2 Architecture of FTL-Emo
Repeat this procedure with new coming data, till threshold is reached.
3.1 Pre-processing
To create time synchronous data for proposed work, the 29 participants’ EEG signals
were mapped with output (Arousal, Valance). EEG signals has been collected at a
sampling rate of 125 kHz and these has been down sampled to 220 Hz utilizing a
Savitzky–Golay filter for smoothing. To extract physiological features, NeuroKit2
Toolbox was used to extract the time domain features of the raw EEG signal with a
window size of 4 s (i.e., 4000 steps) and a hop size of 0.5 s (i.e., 500 steps). Then we
pad the head and tail of the raw data with neighboring data and combine the above
two feature vectors as the physiological feature. In pre-processing, missing values
were replaced by means of the next two consecutive instance values.
3.2 Federated Transfer Learning
Transfer learning helps to use the existing knowledge for new application. Federated
transfer learning concept was proposed by Liu et al. for a secured two-party privacy
preserving setting, whose main purpose was data security, but their empirical analysis
has different conditions at client edges.16]. Researchers have also used federated
learning for domain adaption, where they extend the domain adaption to federated
approach for data security [16]. To understand the proposed approach, consider that
given N different user U1,U2,…,UNand their sensor collected data (EEG) by E1,
E2,…, EN. Centralized/server model MSis first trained with existing dataset (DEAP).
FTL-Emo: Federated Transfer Learning for Privacy Preserved 455
The weights of the model, MSis frozen without performing testing on a deep neural
network approach by following validation approach. These frozen weights of server,
Wgis then shared with all the clients, where each client then uses this weight as the
initialized weight for their own training. The process of initial global weight training
is provided in algorithm 1.
Algorithm 1
Pre-Training Server Model, MS
Input: DEAP Dataset
Output: Wpre-trained: Global Model Weight after pre-tuning
1. Train on convolution model, Ms
2. Validate
3. Freeze weight of layers
4. Assign frozen weight to Wpre-trained
5. Return: Wpre-trained
After training the model each client freezes their weight after each epoch, and
share this weight with the server model, Ms which calculates the average weight of
the model using Eq. 1, and assign the updated weight to each client, this process is
repeated with each epoch, till either the model received global optimization, or the
threshold of epoch is reached. The process at local client level is shown in algorithm 2.
Algorithm 2
Initial Processing on Local Nodes
Input: K-EmoCon, Wpre-trained
Output: Wg: Global Server Weight
1. Share Wpre-trained with each user
2. For each user i:
a. WUi <—W
pre-trained
b. Train User Model with K-EmoCon
c. Update Weight WUi
d. Share WUi with Server Model, MS
3. Aggregate Global weight at Server Model
fs(w) =1
K
K
n=1
(fn(w))
4. Update Global Weight, Wg
5. Share Wg with Each User
6. Wpre-trained <—Wg
7. Repeat the Process
456 A. Kumar et al.
3.3 Deep Learning Architecture
On both the server and the client ends we have employed the same model architecture
using convolution layers. The architecture of the CNN model is shown in Fig. 3.
The model is composed of 5 convolution layers, each with different kernel
score. These convolution layers have been followed by 4 pooling layers, and 2
fully connected layers, and an output layer of SoftMax function. For optimization
Stochastic Gradient Descent has been used. 60:20:20 ratio has been used as training,
validation, and test data. We have used learning rate of 0.01 for every layer, and a
batch size of 64 is used as initialization point followed by dilution. Threshold was
set at 168.
The reason for using this deep learning architecture was to have an incremental
online batch processing, and another advantage was to have automated feature engi-
neering. Equation 2gives us the computational approach at final layer of deep learning
model, when PUi gives the probability of attaining a particular affective dimension
at user Ui.
SoftMax(PUi )=exp(PUi)
i(PUi)(2)
For all the fully connected layers in Stochastic Gradient Descent optimizer has been
used, and loss is calculated by categorical cross-entropy as shown in Eq. 3.
Loss =−
(y
i1log(yi1)+y
i2log(yi2)+··· + y
in log(yin)) (3)
yi1,yi2,yinare internal node labels, y
i1,y
i2,y
in are the output layer nodes, produced
by SoftMax function.
Fig. 3 Deep learning architecture of FTL-Emo
FTL-Emo: Federated Transfer Learning for Privacy Preserved 457
Table 1 Performance of FTL-Emo
Emotion Average F1-score Best F1-score Average accuracy Best accuracy
Arousal 86.8 89.03 88.5 92.7
Val a n ce 88.4 94.1 87.3 91.5
Overall 87.9 93.8 88.1 91.8
4 Result and Discussion
The proposed model was executed twenty-nine times for testing at their individual
client machine for evaluation of performance of the model for each client. For each
execution, the proposed model works exactly same, they are executed separately just
to ensure the privacy of the data.
For performance evaluation accuracy and F1-score has been used. EEG was
recorded only for 10 min for each of the 32 participants, generating total of 320
min data approximately. But only taken 29 participants data was taken for model
execution. Although, the size of the dataset is limited but because of pre-trained
model on DEAP dataset of 32 users as well and training the model over multiple
epochs have resulted in very good performance. For training the convolution model
batch size of 64 was used for one epoch at each communication round, with a learning
rate of 0.0001 using Stochastic Gradient Descent. The model termination condition
was set on either model convergence or on hitting the threshold of 168 epochs.
4.1 Result
The proposed work has twenty-nine nodes and one centralized server, each executed
separately in synchronous manner, combining the weight. The performance evalua-
tion of proposed model for both output categories Arousal and Valance is shown in
Table 1. For Arousal highest F1-score obtained was 89.03 and the average F1-score
obtained on all 29 devices was 86.8. And for Valance the highest F1-score obtained
was 94.1, and the average obtained is 88.4. FTL-Emo has obtained high accuracy as
well, while maintain the privacy of the data.
4.2 Discussion
The proposed transfer learning-based federated learning approach has achieved state-
of-the-art results for both emotions. The comparison of the model is not possible
with exact similar approach, since this is the first work that incorporate privacy
preserving approach on K-EmoCon dataset. Although, we have compared the FTL-
Emo with simple deep learning-based model to show the performance comparison,
458 A. Kumar et al.
Table 2 Comparison of
FTL-Emo Model Average F1-score Average accuracy
Sig-Rep 58.9 71.3
LSTM 72.5 67.4
RNN 76.8 71.5
CNN (DREAM) 73.41 69.78
FTL-Emo 87.9 88.1
Fig. 4 ROC curve of U7 and U23
and the model has also been compared with other existing models on K-EmoCon,
although the data signals taken by them are different [1618]. The proposed work
while maintaining the privacy has obtained highly accurate result, the use of transfer
learning approach has improved the performance a lot, as can be seen from the
results (Table 2). This model can be employed on real-time data collected through
IoT sensors that can record the bio-signals of a person [20], not the traditional IoT
sensors that were used for electronic appliances [19].
The ROC curve of two different subjects formed during evaluation of models is
shown in Fig. 4.
To have accurate emotion recognition, while ensuring privacy of the user in main-
taining EHR and sharing it with other doctors or hospitals, FTL-Emo have provided
great results. This FTL-Emo approach can enact as baseline for future studies for
real-time privacy preserving emotion recognition. It can be used for various real-time
activities like automatic chatbots, HCI in smart industries, schools. In our proposed
work, we haven’t incorporated the issue of non-independent identically distributed
(non-iid). As each individual reacts differently to each scenario, his emotional trig-
gers could also be different, therefore generating a single model for each subject
may not produce accurate results for all, to handle such scenarios researchers could
incorporate the concept of similarity, creating models by clustering the subjects on
the basis of their similarity, and generating models for each cluster, rather than one
single model. For future scope, researchers could also include the cold start problem
to handle real world scenarios, of not having a patient’s data available at the start of
the model training process.
FTL-Emo: Federated Transfer Learning for Privacy Preserved 459
5 Conclusion
Accurate human emotion recognition can enhance the effectiveness and adeptness of
the remote healthcare practices. But to ensure privacy of the user data is as important
as attaining a high accurate prediction of users’ emotional state. Transfer learning-
based federated learning approach can pave a way to achieve the high accuracy and
privacy preservation at the same time. We proposed a privacy preserving emotion
recognition using federated learning on K-EmoCon dataset, while pre-training the
model on DEAP dataset. The model has achieved highly accurate results. No other
work has ensured privacy protection for emotion recognition. The performance of the
model was evaluated on accuracy and F1-score. In both the measures, FTL-Emo has
produced state-of-the-art results. For improving the model further researchers can
embed other modalities like audio signals, video signals for more accurate emotion
recognition.
Funding This research is funded by CfACS seed project funding 2022–2023, Manchester
Metropolitan University, UK.
References
1. Liu Y, Kang Y, Xing C, Chen T, Yang Q (2020) A secure federated transfer learning framework.
IEEE Intell Syst 35(4):70–82
2. Chen Y, Qin X, Wang J, Yu C, Gao W (2020) Fedhealth: a federated transfer learning framework
for wearable healthcare. IEEE Intell Syst 35(4):83–93
3. Kumar A, Sharma K, Sharma A (2021) Hierarchical deep neural network for mental stress
state detection using IoT based biomarkers. Pattern Recogn Lett 145:81–87
4. Jing Q, Wang W, Zhang J, Tian H, Chen K (2019) Quantifying the performance of federated
transfer learning. arXiv preprint arXiv:1912.12795
5. Gupta P, Balaji SA, Jain S, Yadav RK (2022) Emotion recognition during social interactions
using peripheral physiological signals. In: Computer networks and inventive communication
technologies. Springer, Singapore, pp 99–112
6. Guhn A, Merkel L, Hübner L, Dziobek I, Sterzer P, Köhler S (2020) Understanding versus
feeling the emotions of others: how persistent and recurrent depression affect empathy. J
Psychiatr Res 130:120–127
7. Pfitzner B, Steckhan N, Arnrich B (2021) Federated learning in a medical context: a systematic
literature review. ACM Trans Internet Technol (TOIT) 21(2):1–31
8. Kumar A, Sharma K, Sharma A (2022) MEmoR: a multimodal emotion recognition using
affective biomarkers for smart prediction of emotional health for people analytics in smart
industries. Image Vis Comput 123:104483
9. Sharma A, Sharma K, Kumar A (2022) Real-time emotional health detection using fine-tuned
transfer networks with multimodal fusion. In: Neural computing and applications, pp 1–14
10. Kumar A, Sharma K, Sharma A (2021) Genetically optimized Fuzzy C-means data clustering
of IoMT-based biomarkers for fast affective state recognition in intelligent edge analytics. Appl
Soft Comput 109:107525
11. Ju C, Gao D, Mane R, Tan B, Liu Y, Guan C (2020) Federated transfer learning for EEG
signal classification. In: 2020 42nd Annual international conference of the IEEE engineering
in medicine & biology society (EMBC). IEEE, pp 3040–3045
460 A. Kumar et al.
12. Li L, Fan Y, Tse M, Lin KY (2020) A review of applications in federated learning. Comput
Ind Eng 149:106854
13. Li T, Sahu AK, Talwalkar A, Smith V (2020) Federated learning: challenges, methods, and
future directions. IEEE Sig Process Mag 37(3):50–60
14. Park CY, Cha N, Kang S, Kim A, Khandoker AH, Hadjileontiadis L, Oh A, Jeong Y, Lee
U (2020) K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in
naturalistic conversations. Sci Data 7(1):1–16
15. Koelstra S, Muhl C, Soleymani M, Lee J-S, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I
(2011) Deap: a database for emotion analysis; using physiological signals. IEEE Trans Affect
Comput 3(1):18–31
16. Dissanayake V, Seneviratne S, Rana R, Wen E, Kaluarachchi T, Nanayakkara S (2022) SigRep:
toward robust wearable emotion recognition with contrastive representation learning. IEEE
Access 10:18105–18120
17. Alskafi FA, Khandoker AH, Jelinek HF (2021) A comparative study of arousal and valence
dimensional variations for emotion recognition using peripheral physiological signals acquired
from wearable sensors. In: 2021 43rd Annual international conference of the IEEE engineering
in medicine & biology society (EMBC). IEEE, pp 1104–1107
18. Wang S, Wang J, Wang X, Qiu T, Yuan Y, Ouyang L, Guo Y, Wang F-Y (2018) Blockchain-
powered parallel healthcare systems based on the ACP approach. IEEE Trans Comput Soc Syst
5(4):942–950
19. Ranjan R, Sharma A (2020) Voice-controlled IoT devices framework for smart home. In:
Proceedings of first international conference on computing, communications, and cyber-
security (IC4S 2019). Springer Singapore, pp 57–67
20. Wei J, Yang X, Dong Y (2021) Time-dependent body gesture representation for video emotion
recognition. In: International conference on multimedia modeling. Springer, Cham, pp 403–416
Content Analysis of Twitter
Conversations Associated
with Turkey–Syria Earthquakes
Harkiran Kaur, Harishankar Kumar, and Abhinandan Singla
Abstract This research study performs content analysis of Twitter discussions asso-
ciated with the earthquakes that occurred in Turkey and Syria in 2023. The authors
investigated the main themes and topics of the discussions, expressed in the tweets. A
dataset of tweets related to this topic has been collected using relevant hashtags and
keywords. The obtained data has been analyzed using both manual and automated
state-of-the-art methods. As the main findings of this study it has been observed that
the most common themes in the tweets were expressions of sympathy and solidarity,
calls for help and financial support, and news updates about the earthquakes. As per
this proposed study, Twitter offers a worthwhile forum for people to express their
state of mind and responses to natural catastrophes and may be used to segment
information and assembly assistance in the times of need.
Keywords Twitter data analytics ·Content analysis ·Turkey–Syria earthquake ·
Topic modeling ·Topic detection
1 Introduction
Twitter has become a significant hub of news and current events information, one
of the most popular social media platforms. Users may voice their ideas and share
news about current events on this platform, which creates tons of data every day.
Machine learning techniques are needed to analyze and glean relevant insights from
the enormous volume of data collected on Twitter every day. Twitter has grown to
be a significant venue for sharing news and viewpoints as a result of the boom in
H. Kaur (B)·H. Kumar ·A. Singla
Department of Computer Science and Engineering, Thapar Institute of Engineering and
Technology, Patiala 147001, Punjab, India
e-mail: harkiran.kaur@thapar.edu
H. Kumar
e-mail: hkumar1_be20@thapar.edu
A. Singla
e-mail: asingla50_be21@thapar.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_35
461
462 H. Kaur et al.
social media use. This information offers a rich source that may be utilized for a
number of tasks, including sentiment analysis, subject modeling, and user profiling.
As a result, studying tweets has become a crucial technique for figuring out how
the general public feels about current events. Machine learning has been a prevalent
method for doing content analysis of tweets in the context of current affairs in recent
years. In recent years, machine learning has emerged as a popular technique for con-
ducting a content analysis of tweets in the context of current affairs. The research
in Twitter data analysis using machine learning has been ongoing for several years.
Recent studies have demonstrated the effectiveness of machine learning algorithms
in various tasks such as topic modeling, sentiment analysis, and user profiling. Topic
modeling is a popular machine learning task that aims to identify latent topics in a
given text corpus. Research in the said subject utilizes topic modeling techniques,
such as Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization
(NMF) [1], to identify topics in Twitter data. Sentiment analysis aims to determine
the emotional tone of a given text. Researchers have applied various machine learn-
ing algorithms, such as support vector machines (SVM), decision trees, and neural
networks, deep learning models [2], for sentiment analysis of Twitter data. User
profiling aims to extract demographic and psycho-graphic information about users
from their social media data. Researchers have applied machine learning techniques
such as clustering [3], association rule mining, and regression for user profiling on
Twitter data. This paper is organized into several sections, the literature review sum-
marizes recent studies and developments that applied machine learning techniques,
highlighting the state of research in this area and their use for different purposes.
The methodology section outlines the methods used in this study, which include data
collection, data pre-processing, and analysis. The results section presents the study’s
findings, including identified topics, and classified topics for some of the tweets. The
discussion section provides interpretation and analysis of the results, discusses the
significance and limitations of the study, and suggests directions for future research.
2 Literature Review
This literature review aims to discuss recent studies that have applied machine learn-
ing techniques for Twitter data analysis in the context of current affairs and provide
insight into the current state of research. For instance, Hsu et al. (2023) in [1] proposed
a method based on NMF for identifying topics in Twitter data. The proposed method
outperformed other state-of-the-art topic modeling methods. Song et al. (2023) in
[4] proposed a method based on LDA for topic modeling of Twitter data related to
the US presidential election. The proposed method identified relevant topics such
as candidates’ policies, voting patterns, and election predictions. A strategy based
on LDA for topic modeling of Twitter data relating to the US-China trade war was
proposed by Jiang et al. (2022) in [5]. The suggested approach highlighted perti-
nent subjects including tariffs, negotiations, and market effects. An approach based
on NMF for topic modeling of Twitter data relevant to the COVID-19 pandemic
Content Analysis of Twitter Conversations Associated 463
was proposed by Chen et al. in 2021 [6]. The suggested approach located pertinent
subjects including vaccine development, case counts, and government regulations.
Zhou et al. (2020) [7] proposed a method based on LDA for topic modeling of Twit-
ter data related to the Hong Kong protests. The proposed method identified relevant
topics such as police brutality, democracy, and human rights. One study by Lee et al.
(2022) in [8] used a machine learning model to predict the outcome of the US pres-
idential election based on Twitter data. The study discovered that the model could
predict the election result with a high degree of accuracy, pointing to the possibility
of utilizing Twitter data to forecast election outcomes. A deep learning model was
suggested for analyzing Twitter data pertaining to the Hong Kong demonstrations
in a paper by Chen et al. (2022) in [9]. The research discovered that the suggested
methodology could pinpoint important conversation points and keep track of how
the general population felt about the demonstrations. Machine learning techniques
were utilized by Das et al. (2022) in [10] to examine Twitter data pertaining to the
Black Lives Matter movement. According to the study, Twitter data may be utilized
to track public attitude toward the movement and distinguish between various points
of view on it. Goharian et al.’s paper from 2022 [11] used machine learning methods
to examine Twitter data collected during the COVID-19 epidemic. According to the
study, Twitter data may be utilized to track public opinion toward the epidemic and
spot newly developing pandemic-related issues. Machine learning techniques were
utilized by Yildirim et al. (2022) [12] aimed to examine Twitter data pertaining to the
Syrian crisis. According to the study, Twitter data may be used to track public opin-
ion about the conflict and spot newly developing themes connected to it. Machine
learning techniques were used to analyze Twitter data during the Black Lives Matter
protests in a different research by Zhu et al. (2021) in [13]. The study discovered
that Twitter data may be utilized to pinpoint the main conversation points and track
public opinion regarding the demonstrations. In a research by Mocanu et al. (2021) in
[14], the authors examined Twitter data pertaining to the COVID-19 epidemic using
machine learning techniques. According to the study, Twitter data may be utilized
to track public opinion toward the epidemic and spot newly developing pandemic-
related issues. Machine learning methods were utilized by Liu et al. (2021) [15]to
examine Twitter data associated with the US presidential election. The study discov-
ered that Twitter data might be used to forecast election results with a high degree of
accuracy, suggesting the possibility of leveraging Twitter data to do so. Yayla et al.’s
work from 2021, published in [16], suggested a machine learning-based method for
identifying hate speech on Twitter. The research revealed that the suggested method
might effectively identify hate speech, monitor it, and take action against it on the
platform.
3 Materials and Methods
The authors conducted the following 7 steps model for undertaking the content
analysis for Twitter posts related to Turkey–Syria.
464 H. Kaur et al.
Fig. 1 Number of tweets posted date wise for Turkey–Syria earthquakes
3.1 Data Collection
For this research, the authors used the Twitter dataset by querying Twitter for key-
words TurkeySyriaEarthquakes, Turkey earthquake, Syria earthquake, Turkey–Syria
earthquake, and hashtags #TurkeySyriaEarthquake2023, #turkey #earthquake, and
#syria #earthquake, using web scraping technique. Authors were able to retrieve
1,301,159 tweets posted during 3 weeks (February 06, 2023, until February 27,
2023), as presented in Fig.1.
3.1.1 Data Pre-processing
The collected data has been cleansed by removing non-English tweets and duplicate
tweets, and removing stop words, punctuation, URLs, user mentions, and hashtags.
Also, the tweets have been tokenized into phrases and converted into lowercase.
This dataset includes 1,301,159 raw tweets. During this data pre-processing phase,
8,53,026 duplicate tweets and 70,106 non-English tweets were removed. So, 3,78,027
processed tweets have been used for the remaining steps (Table1).
Content Analysis of Twitter Conversations Associated 465
Table 1 Pre-processing of tweets—sample data
Text Processed text
Did Western Ambassadors evacuate before
Turkey Earthquake?! https://t.co/
wMdSmbqLFF
Western ambassador evacuate turkey
earthquake
Turkey-Syria earthquake death toll surpasses
50,000 https://t.co/6lMakpPZDx via
@FRANCE24 #TurkeySyriaEarthquake2023
Turkey Syria earthquake death toll surpasses
via
In the two week deadly earthquakes hit
southern Turkey and northern Syria, the focus
has shifted from rescue to rehabilitation.
https://t.co/nGLIzeWttT
Two week since deadly earthquake hit southern
turkey northern syria focus shifted rescue
rehabilitation
#OutlookDecazine A journalist recalls his own,
and other peopleâeTMs, experiences in
#earthquake rescue operations in #Turkey
Yusuf Erim #TurkeyQuake Read more at:
https://t.co/OeH30zqd5N
Journalist recall people experience rescue
operation yusuf erim read
In the two week since deadly earthquakes hit
southern Turkey and northern Syria, the focus
has shifted from rescue to rehabilitation.
https://t.co/nGLIzeWttT
two week since deadly earthquake hit southern
turkey northern syria focus shifted rescue
rehabilitation
Syria-Turkey earthquake: how to help https://t.
co/CUHMEWbKci https://t.co/EtVYpSBrRX
Syria Turkey earthquake help
Earthquake Unveils Turkey’s many ugly faces
https://t.co/WIWI5e88h5
Earthquake unveils turkey many ugly face
3.1.2 Exploratory Data Analysis
Outlier detection has been performed on this set of tweets, and 11,830 tweets have
been detected as outliers and were removed. The tweets data has been analyzed using
descriptive statistics and word frequency analysis to understand the characteristics
of the data and perform the content analysis for tweets retrieved for the Turkey–Syria
earthquakes. This paper uses word cloud, for word frequency analysis. For instance,
Fig.2describes that words have been presented in different sizes, arranged as per
their number of occurrences in the tweets. The words such as death, magnitude,
building, support, rescue, and many more have a higher frequency than the other
words presented in small size.
4 Topic Modeling
The authors have utilized unsupervised machine learning techniques such as Latent
Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Non-negative Matrix
Factorization (NMF), and Short Text Topic Modeling (STTM) to identify the latent
466 H. Kaur et al.
Fig. 2 Word cloud for all the tweets retrieved for Turkey–Syria Earthquake
topics within the data. These topics have been represented as a set of related words
or phrases.
5 Topic Interpretation
The identified topics have been interpreted by analyzing the top phrases associated
with each topic. This step assists in understanding the underlying themes and patterns
in the data retrieved for the said subject.
6 Implementing Ensemble Voting Classifier
The authors observed that there have been multiple topic modeling models with
different strengths and weaknesses, and to leverage the strengths of each model and
overcome their weaknesses, the need of the hour was to use an ensemble voting
classifier. Combining multiple models’ predictions, the ensemble voting classifier
generated a more accurate and robust prediction. Further, these topics have been
manually verified by the authors. Table 2presents the tweets and topics detected for
these tweets using state-of-the-art models and their comparison with the proposed
ensemble voting classifier.
Content Analysis of Twitter Conversations Associated 467
Table 2 Comparativeanalysis of topics detected using state-of-the-art models and proposed ensem-
ble voting classifier
Processed
tweet text
Topic modeling algorithms Proposed
ensemble
voting
classifier
LDA LSA NMF STTM
Western
ambassador
evacuate
Turkey
earthquake
Damage and
magnitude
Dead/alive
news
information
about
earthquake
Opinion Damage and
magnitude
Turkey–Syria
earthquake
death toll
surpasses via
Damage and
magnitude
Information
about
earthquake
Damage and
Magnitude
Damage and
magnitude
Damage and
magnitude
Earthquake
death toll
surpasses
Turkey
Damage and
magnitude
Information
about
earthquake
Damage and
magnitude
Damage and
magnitude
Damage and
magnitude
Two week
since deadly
earthquake hit
southern
Turkey
northern Syria
focus shifted
rescue
rehabilitation
Aid, Help, and
Relief
Dead/alive
news
Damage and
magnitude
Damage and
magnitude
Damage and
magnitude
Journalist
recall people
experience
rescue
operation
yusuf erim
read
Opinion Aid, help, and
relief
Dead/alive
news
Aid, help, and
relief
Aid, help, and
relief
Strike info Prayer and
hope
Damage and
magnitude
Aid, help, and
relief
Information
about
earthquake
Prayer and
hope
Syria–Turkey
earthquake
help
Aid, help, and
relief
Dead/alive
news
Support &
Sympathy
from people
Support &
Sympathy
from people
Support &
Sympathy
from people
Earthquake
unveils Turkey
many ugly
face
Support &
Sympathy
from people
Dead/alive
news
Information
about
earthquake
Political views Support &
Sympathy
from people
468 H. Kaur et al.
Fig. 3 Ensemble voting classifier results for Turkey–Syria earthquakes
7 Visualization
The topic results fetched through the aforementioned steps have been visualized
using various tools such as word clouds, and bar charts to gain insights from the
data. Figure3shows the topic category-wise frequency and usage of certain words
in the tweets after applying the ensemble voting classifier.
8 Results and Discussions
This research study aims to analyze the tweet content posted for the Turkey–Syria
earthquakes. For this authors collected the tweets, preprocessed the tweets using
various filtration functions, and then classified these discussions into their respec-
tive topics using state-of-the-art topic modeling techniques and ensemble voting
classifier applied to these models. Figure4represents the topic-wise distribution of
tweets after applying ensemble topic modeling. As a result of this classification, it
has been observed that there were 68,270 tweets related to Aid, Help, and Relief”,
1,09,452 tweets discussing the “Damage and Magnitude” of the earthquake, 19,482
tweets were about “Dead/Alive News”, 20,678 tweets provided “Information about
earthquake”, 11,589 tweets highlighted the “International Views” about this inci-
dent, 32,733 tweets presented the “Opinion” shared by users, 44,234 tweets pro-
vided “Political Views”, 27,468 tweets were of “Prayer and Hope”, 32,291 tweets
were classified as ”Support & Sympathy from People”, and 11,830 tweets were
Unknown/Outliers.
Content Analysis of Twitter Conversations Associated 469
Fig. 4 Topic wise distribution of tweets after ensemble topic modeling
Fig. 5 Word cloud proving insights of topic-wise themes for Turkey–Syria earthquakes
470 H. Kaur et al.
This section describes the results of a study or analysis that involved processing
a collection of tweets. Specifically, the study applied an ensemble voting classifier
to the collection of tweets to identify topic categories. This section also mentions
that Fig.5presents a word cloud of the identified topic categories. A word cloud is
a visual representation of words that are frequently used in a given text or collection
of texts. In this case, the word cloud represents the words used for each identified
category. The use of a word cloud in this context can be helpful in providing a
quick overview of the topic categories that were identified. It can give an idea of
general themes and trends that are present in the collection of tweets by highlighting
the most often-used terms. The study was effective in classifying tweets and the
word cloud gives helpful overview of the results. It also suggests that the results
could be helpful in understanding how Twitter users feel about certain issues and act
in certain ways. This research has the potential to benefit society in several ways,
by analyzing tweets researchers can gain insight into public opinion, and decision-
makers can use these to inform their actions and policies, potentially leading to better
outcomes for society, for example by identifying main themes and topics for Turkey–
Syria earthquake, emergency response teams and policymakers can develop effective
communication strategies and mobilize support during crises. The study can also be
used to disseminate information and facilitate public engagement, which can help
to increase awareness and preparedness for future earthquakes. Overall, the research
work has the potential to contribute to more effective disaster response and recovery
efforts, ultimately benefiting society as a whole.
Finally, it is to be noted that using machine learning algorithms for twitter data
analysis is not without its challenges and limitations, for example this could lead to
the problem of overfitting, where the model might not generalize well with new data,
not only that but there is also the issue of potential bias, privacy concerns and data
ownership which researchers and practitioners must address. Therefore, it is impor-
tant for researchers and practitioners to address these challenges and limitations by
using appropriate methods for model evalu- ation and selection, adopting transparent
and accountable data practices, and acknowledging and mitigating potential biases
in the data.
9 Conclusion
Twitter has grown to be a significant venue for sharing news and viewpoints as a
result of the boom in social media use. As a result, examining tweets has grown in
importance as a method for figuring out what the general population thinks and feels
about current events. Machine learning has been a popular method for undertaking
content analysis of tweets in recent years. In this literature review, authors looked at
recent research that analyzed tweets about current events using machine learning. In
conclusion, topic modeling approaches have become a crucial tool for scholars when
analyzing Twitter data pertaining to current affairs. These studies offer insightful
information on the general public’s viewpoint on a range of significant problems by
Content Analysis of Twitter Conversations Associated 471
identifying major subjects and attitudes stated by users. The analysis of this unique
source of real-time information utilizing topic modeling approaches is crucial as
Twitter’s popularity continues to rise.
References
1. Hsu CH, Liu HC, Chen ALP, Lai MK (2023) Non-negative matrix factorization for topic
modeling on twitter data. IEEE Trans Knowl Data Eng
2. Liu Y, Li H, Sun M (2023) Deep learning-based sentiment analysis of twitter data. IEEE Trans
Afective Comput
3. Zhang Y, Chen Y, Liu Z, Chen W (2023) A clustering-based method for user profiling on twitter
data. IEEE Trans Cmput Soc Syst
4. Song M, Wu L, Zhang W (2023) Topic modeling of twitter data related to the US presidential
election using LDA. IEEE Trans Big Data
5. Jiang T, Sun J, Wang X (2022) Topic modeling of twitter data related to the US-China trade
war using LDA. IEEE Access
6. Chen Y, Wang X, Yu C (2021) Non-negative matrix factorization for topic modeling of Twitter
data related to the COVID-19 pandemic. Health Inf J
7. Zhou X, Liu B, Xiang X, Wu S, Zha H (2020) Topic modeling for Twitter data related to Hong
Kong protests. Inf Process Manage
8. Lee YH, Zhang Y, Kim JH (2022) Predicting election results using twitter data: a machine
learning approach. In: Proceedings of the 2022 IEEE international conference on computational
intelligence and virtual environments for measurement systems and applications, pp 109–114
9. Chen Y, Fu J, Xiao J (2022) Deep learning model for analyzing twitter data on the Hong
Kong protests. In: Proceedings of the 2022 IEEE international conference on data mining, pp
1209–1214
10. Das S, Vaddadi S, Chakraborty T (2022) Analyzing twitter data on black lives matter: a machine
learning approach. In: Proceedings of the 2022 IEEE international conference on big data, pp
2666–2671
11. Goharian N Boussaid O, Srinivasan P (2022) Analyzing Twitter Data during the COVID-
19 pandemic: a machine learning approach. In: Proceedings of the 2022 IEEE international
conference on healthcare informatics, pp 1–6
12. Yildirim Y, Bayram G, Akcay O (2022) Twitter data analysis of the Syrian conflict: a machine
learning approach. In: Proceedings of the 2022 IEEE international conference on communica-
tions, pp 1–6
13. Zhu X, Liu S, Gao F (2021) Analyzing twitter data during the black lives matter protests: a
machine learning approach. In: Proceedings of the 2021 IEEE international conference on data
mining, pp 1209–1214
14. Mocanu D, Perra N, Gonçalves B (2021) Monitoring the COVID-19 pandemic in real time
using twitter data analysis and machine learning. In: Proceedings of the 2021 IEEE international
conference on big data, pp 2518–2525
15. Liu S, Zhu X, Gao F (2021) Predicting US presidential election results using twitter data: a
machine learning approach. In: Proceedings of the 2021 IEEE international conference on big
data, pp 2065–2070
16. Yayla E, Altun Y, Yildiz H (2021) A machine learning-based approach for detecting hate
speech on twitter. In: Proceedings of the 2021 IEEE international conference on big data, pp
2096–2101
Transition from Traditional Insurance
Sector to InsurTech: Systematic Analysis
and Future Research Directions
Tamanna Kewal and Charu Saxena
Abstract InsurTech, which takes its cues from the more well-established idea of
“FinTech, is the term used to describe the use of technology to increase efficiency
and savings in underwriting, risk pooling, and claims management from the present
insurance paradigm. A survey of the scientific literature on InsurTech is included
in this study. This review paper starts with an overview of the journey from Insur-
ance 1.0 to Insurance 4.0 and concludes with an analysis of articles chosen to find
InsurTech’s emerging themes. This research comprises 47 Scopus articles, which
have been analyzed to identify themes in InsurTech research. Thematic analysis has
aided in the identification of significant research clusters in InsurTech research. There
are eight themes highlighted: InsurTech and the technologies behind it, risk manage-
ment, performance evaluation, insurer adoption, insured adoption, personalization of
insurance, P2P insurance, and legal, ethical, and regulatory issues in InsurTech. The
number of studies in this area has just risen in the last two years, i.e., post-pandemic.
The most popular research topic was the technologies supporting the emergence of
InsurTech. This study aims to enhance an understanding of insurance technology
advancements and associated topics.
Keywords InsurTech ·Blockchain ·Artificial intelligence ·IoT ·Big data ·
Insurance 4.0 ·Insurance industry
1 Introduction
Digitization has created new products and processes across all industries which have
benefited both the provider and receiver [1]. The insurance sector has transformed into
InsurTech due to the presence of technologies like artificial intelligence, blockchain,
T. Kewa l ( B)·C. Saxena
University School of Business, Chandigarh University, Mohali, Punjab, India
e-mail: tamannakewal04@gmail.com
C. Saxena
e-mail: charu.e8966@cumail.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_36
473
474 T. Kewal and C. Saxena
internet of things, big data, and cloud computing [2,3]. InsurTech has emerged from
“fintech” and is referred to for technology-based insurance solutions to proceed
through the value chain. Fintech has five categories, namely mobile payments and
transfers, deposits and investments, budget and financial planning, insurance, and
borrowing [4]. InsurTech unlike other categories has gained the interest of researchers
in the last few years which explains the limited number of articles in this domain. As
a result, both incumbents and new market entrants have a huge opportunity to use
information technology to revolutionize the traditional insurance sector [5]. In this
context, there has been a surge in the number of InsurTech start-up firms that rely on
accessibility, customization, and customer satisfaction to reach a wide audience. By
building new technology solutions that are complemented by wholly new business
models, InsurTech start-ups are speeding up transformation in the insurance sector.
Both life and general insurance firms are witnessing the effects of InsurTech. Tradi-
tional insurers have realized the threat of disruption in the insurance business and are
investing in or acquiring start-ups to profit from their advances [6]. InsurTech is the
application and usage of information technology by one or more established or new
commercial entities to supply insurance-specific solutions [7]. InsurTech is quickly
establishing itself as a game-changing potential for insurers to innovate, enhance the
relevance of their offers, and expand. The main contributions of this study are:
This study firstly evaluates the existing literature on the journey of the Insurance
Industry’s transformation to InsurTech, outlining its evolution from the foundation
of the first insurance firm in 1848 to the present day.
The eight broad themes in InsurTech research are identified and discussed in this
study.
2 Evolution of InsurTech
2.1 Insurance 1.0
Before the Christian period, the notion of insurance was employed by Chinese
and Babylonian traders to reduce the hazards of river shipping [8]. The concept
of grouping traders was also introduced. Traders whose goods were transported in
the same shipment were charged a premium together so that if any of their ship-
ments suffer damages, they may be compensated from the premiums collected
[8]. There was also the introduction of home insurance and accident insurance.
Increasing railway fatalities necessitated the establishment of the first accident insur-
ance business in 1848 in England [9,10]. The industrial revolution affected produc-
tion capacity, transportation systems, workforce structure, and the types of hazards.
All these economic changes signaled the start of the Insurance Industry Revolution.
Transition from Traditional Insurance Sector to InsurTech: Systematic 475
2.2 Insurance 2.0
In this period, some discoveries led to the second industrial revolution, e.g., the
introduction of electricity, telegraph, changes in the mode of transportation, and
communication, the concept of mass production, division of labor, and new raw
materials [11,12]. The British government started providing insurance for old age,
illness, and unemployment under the National Insurance Act of 1911 [8,13]. Medical
insurance, accidental insurance, and old-age pension systems were also offered by
the German Government [14].
2.3 Insurance 3.0
The third industrial revolution started with the invention of computer systems and
the incorporation of computer-based applications into the organization. As a result,
organizational efficiency improved due to reduced cost and time. Integration of
computer-based management systems and the insurance industry reduced the cost of
distribution channels. Since the incorporation of Acord (Association for Cooperative
Operations Research and Development), an American Standard-Setting Association
for Insurance Industry in 1970, most agents switched to computer-based systems
[15]. Acord created a single form that was accepted and utilized by many of those
insurance firms, lowering the costs of insurance distribution. Acord also aided in the
creation of EDI standards. Companies adopted automation and created proprietary
systems that were installed in the offices of their agents. It became possible for agents
to eliminate the proprietary terminals and operate via one system [8].
2.4 Insurance 4.0
The fourth industrial revolution was a result of the development of telecommunica-
tion networks. Internet’s introduction signaled the conclusion of the third industrial
revolution. All industry’s business models were quickly altered as a result of the surge
of digitalization in every industry. AI, IoT, big data, and cloud computing are driving
the insurance industry’s transformation. Wearables, smart houses, self-driving cars,
and voice-assisted electronic gadgets are examples of game-changing advancements
in the twenty-first century. Insurance 4.0 is the merger of new technology with the
insurance industry. The shift from standard to smart insurance contracts is part of a
drive to digitize the whole value chain to give clients better, more personalized, and
hassle-free service.
476 T. Kewal and C. Saxena
Fig. 1 SLR process
3 Methodology
This study aims to examine available publications to understand the broad themes in
the publication of Insurance technology research and to provide researchers and prac-
titioners with information and potential perspective on InsurTech. We began by doing
a systematic assessment of the literature to determine the initial question to study.
The technique of a literature study is beneficial for establishing the research issue.
This study’s goal is to highlight the themes in InsurTech research. To accomplish this
goal, we first looked for indications that a literature review was necessary. The Scopus
database was utilized to search for papers that discuss InsurTech. In this early stage,
the search for articles was not restricted to any time frame. The keywords used for
searching publications were “InsurTech OR insurance technology OR e-insurance.”
About 171 Scopus publications were found after the search. In Fig. 1, the process is
illustrated. The exclusion criterion resulted in the removal of 118 publications. The
publications showed an irregularity before the year 2012. In some years, there was
not even a single article related to InsurTech; hence, researchers selected articles for
the past 10 years (2012–2021).
The selection of documents for analysis began with the selection of the title and
abstract, followed by the narrowing of the articles using exclusion criteria, and finally,
selected articles were synthesized using thematic analysis.
4 Results
The outcomes of the thematic analysis are described in this section. Thematic analysis
is primarily characterized as a process for detecting, analyzing, and reporting themes
within data as an independent qualitative descriptive approach [16]. It is a technique
Transition from Traditional Insurance Sector to InsurTech: Systematic 477
for reviewing and organizing the data according to their pattern and then naming
those themes. We include topics for discussion in InsurTech on yearly basis, along
with their underlying themes.
4.1 Year-Wise Analysis
A total of 47 pertinent InsurTech research articles were examined. The 10-year
publishing period ran from 2012 through 2021. Between 2012 and 2021, there is
an irregularity in the publishing trend of InsurTech papers. This topic gained the
importance of researchers in 2018, and since then, there has been an increase in the
publications related to InsurTech. Till the year 2019, the number of studies related to
this field was very few, the rising trend can be seen in the past two years which shows
the interest of researchers in insurance technology. According to a year-wise anal-
ysis, most articles were published in 2020. According to a comparative analysis of the
papers published annually, research on the issue of InsurTech and related technolo-
gies was the most often discussed, i.e., 13 papers. In addition, research on consumer
adoption of insurance technology began in 2012 with a total of eight papers. A study
on the shift from the conventional mode to the InsurTech model began in 2013.
Furthermore, studies on peer-to-peer insurance began to surface around 2020 which
is also the least-discussed theme of InsurTech. 2020 is the only year that featured
articles from all of the themes. Table 1illustrates this.
4.2 Most Popular Publications
Geneva Papers on Risk and Insurance: Issues and practice published five articles,
three in 2020 and two in 2021, making it the most popular journal in InsurTech
research. Three articles were published in risks, one in 2020 and the other two in
2021. Advances in Intelligent Systems and Computing published the same number of
papers. Two articles were published in Journal of Internet Banking and Commerce,
one in 2012 and the other in 2016.
4.3 Articles Classified by Themes
This section discusses the subject of InsurTech research. Table 2shows how 47
items are organized into eight topics. The study objectives are used to group the data.
The majority of the articles in the first category explored the idea of InsurTech as
well as the technology related to it. Thirteen articles on this subject explore tech-
nology related to InsurTech, including Blockchain, Artificial Intelligence, and IoT.
478 T. Kewal and C. Saxena
Table 1 Literature review results
Themes InsurTech and the
technologies behind it
Risk
management
Personalization of
insurance
Legal, regulatory,
ethical issues
Performance
evaluation
P2P
insurance
Adoption by
Insurers
Customer
behavior
2012 1
2013 2
2014 1
2016 1
2018 3 1 1
2019 2 2
2020 5 2 1 1 2 1 3 3
2021 5 1 2 4 1 2
Tot a l 13 3 3 6 7 2 5 8
Transition from Traditional Insurance Sector to InsurTech: Systematic 479
Table 2 Articles grouped by research themes
S. no. Research themes Author and year
1InsurTech and the technologies behind it (AI, IoT, Blockchain, Big
Data)
[2,6,2939]
2 Legal, regulatory, and ethical issues [4045]
3Customer behavior [4653]
4Performance evaluation [2,12,5458]
5Adoption of InsurTech by insurers
6 Personalization of insurance
7 Risk management
8P2P insurance
Then there are seven articles about evaluating the performance of insurance prod-
ucts and companies. Eight articles investigate customer uptake of innovative insur-
ance products and InsurTech businesses. The least number of research papers are on
peer-to-peer insurance.
5 Discussion
5.1 Research on InsurTech and the Technologies Behind It
InsurTech research begins with an understanding of insurance technology inno-
vations and their implications for the existing value chain [7] The sophisti-
cated InsurTech innovations have benefited insurers by lowering transaction costs,
expanding into new markets, providing more client-tailored coverage, and also
making lives of clients easier by resolving issues such as removing intermedi-
aries, decreasing the cost of the policy, making online price comparisons, fast claim
settlement [17]. Artificial intelligence, big data, IoT, and blockchain are the driving
forces underpinning the insurance industry’s transformation [18]. Innovative solu-
tions based on IoT like user-based insurance, smart cars, smart homes, wearables,
ride-sharing solutions, linked health, data capturing, and remote monitoring have
been game changers for traditional insurance firms but have also raised privacy
concerns, as some customers are willing to give up some privacy for better services,
while others view these IoT-based products as serious intrusions [19]. Blockchain
has also piqued the interest of researchers in recent years; although it is in the early
stages once implemented, it can disrupt the traditional insurance business model
[20]. Through the security of information, a decrease in administrative and trans-
action expenses, as well as a new level of information transparency and precision
with simpler access to all parties in the insurance contract, this technology offers a
wide range of potential applications in the insurance industry [21]. This technology
480 T. Kewal and C. Saxena
offers a solid digital platform for quicker and more secure transactions, more trans-
parency, and less risk [22]. Blockchain network connects various devices and mobile
applications, speed ups the insurance processes, and helps to achieve accuracy in
transactions [23]. Artificial intelligence is another technology that is projected to be
the top trend in the next years, as increased adoption of AI solutions is expected
to lower the overhead expenses of banking and insurance services [24]. Worldwide,
insurers are utilizing artificial intelligence to automate procedures and jobs including
detecting fraud, underwriting, personalizing policy, reviewing accident claims, and
sparing insured drivers from a laborious human evaluation process following an
accident [25]. Based on the successful experience of Huize Insurance in China,
other InsurTech businesses operating in the same environment should concentrate on
developing specialized insurance products, fusing big data and users, portraying and
analyzing users’ risk preferences, and precisely calculating adequate premiums using
AI [26]. Big data predictive analytics is another technology that aids the marketing
team in better understanding and analyzing customer behavior so that they may base
their policy recommendations on such patterns [27]. Along with the advantages of
using these technologies to build InsurTech businesses, privacy and security issues
have also received considerable research attention [28].
5.2 Research on Performance Evaluation
This section of the research focuses on studies on innovations in the insurance
business and their effects on company performance. The insurance sector saw the
emergence of InsurTech innovations, which were technology-driven and claimed
to improve performance and save time and costs. Focusing on the public property
and casualty insurance market in the USA, it was observed that insurance compa-
nies were unable to take advantage of technological advancements to increase their
productivity. The relative degree of efficiency across enterprises was generally pretty
consistent over time, and the gap between efficient and less efficient firms has signif-
icantly widened [29]. Another study by the same authors examined the technology
initiatives taken by insurance companies and showed that there is still plenty of room
for development [30]. However, InsurTech has given positive news about the growth
of Chinese insurance businesses’ profits. Positive effects are shown for the Chinese
insurance sector when studied based on three dimensions: liability side, asset side,
and risk-taking behavior indicating that InsurTech is boosting investment and prof-
itability [31]. Another technique for assessing InsurTech innovation is combining
indicators related to three dimensions: operations and management, technological
level, and user experience [2]. This method considers the inputs that pertain to both
parties in InsurTech and can help in providing a better and more transparent way
of assessing InsurTech. A mixed model based on the Balance Scorecard and the
DEMATEL technique has also been used to identify the key indicators for improving
the performance of insurance websites in Iran [32]. Studies that focus on a single
product rather than analyzing the insurance industry as a whole make the premise
Transition from Traditional Insurance Sector to InsurTech: Systematic 481
that different products should be evaluated based on individual factors affecting them
and using distinct methodologies [33].
5.3 Research on Customer Behavior
The adoption of any new invention is based on three pillars: first, whether the govern-
ment provides appropriate infrastructure, second, whether organizations are prepared
to adopt it, and third, whether customers are ready to try it [34]. Because technologies
and innovations are designed for customers’ better experience and convenience, this
section reviews studies that evaluated customer perceptions around various InsurTech
advancements [35]. Research done in Iran to determine e-readiness to adopt auto
insurance at the consumer level suggested that the technology should be compatible
and easy to use [34]. Another study conducted in Russia also states the main concern
behind the low adoption of digital insurance is the risk of loss of personal data and
its leakage [36]. Customers have unfavorable opinions of health tracking insurance
applications because they believe costs in terms of money, emotions, and functions
which do not justify the benefits [37]. In the insurance industry, life insurance poli-
cies were more popular among customers than non-life [38]. Wearables have become
popular recently and their adoption among sportspersons has been studied which was
found to be dependent on their price, experience, and perception [39].
5.4 Research on Legal, Regulatory, Ethical Issues
in InsurTech
InsurTech growing popularity raised legal, regulatory, and ethical issues concerning
the value chain, technologies behind it, and payment methods too. Several researchers
questioned whether existing norms and regulations were adequate for new InsurTech
entrants or if a distinct framework was required. Discussing how to govern the
emerging InsurTech organizations, numerous researchers concluded that it will be
best suited for InsurTech firms to apply the current rules and legislation to their
business models rather than introducing new regulations to avoid significant regula-
tory changes. Research conducted in the context of European Law on digital insur-
ance intermediaries, namely, insurance comparison websites, P2P insurance, Robo-
advisors, and their regulatory concerns was conducted to determine whether the
established regulatory structure is capable of regulation or if new rules are required,
and it was concluded that the current law is sufficient to deal with almost all the
issues faced by InsurTech when implemented to the allocation of insurance products
undertaken by intermediaries [40]. Similar conclusions were made in another study
which justified that current InsurTech activities can be carried on without any need
for major regulatory changes in the law until Solvency II directives are no longer
482 T. Kewal and C. Saxena
needed in the insurance European Union Insurance Industry [41]. But these findings
were the opposite in the case of Iranian law as for better regulation and protection
of e-insured the rules are not adequate and need to be amended to suit the recent
innovations made in the insurance environment [42]. Big data analytics and artificial
intelligence have already been shown to be beneficial to the insurance sector, but as
their use grows, so do perils and moral dilemmas. The ethical challenges that have
emerged throughout the insurance value chain with the advent of AI and Big data are
the subject of a study conducted in the context of the European Insurance Industry
[43]. Concerns about whether it will be technically possible and legally acceptable
to incorporate virtual currencies like Bitcoin or Ether as a valid payment method for
smart contracts have also been raised since their introduction and rising popularity,
but until the Solvency II directives issued by European Law give them a legal status
similar to other functional currencies, it is only a vision for the future [44].
5.5 Research on the Adoption of InsurTech by Insurance
Companies
The transition from the traditional insurance industry to InsurTech has not been
smooth. Several obstacles were encountered, and the insurance industry took a long
time to transform. The use of e-commerce in insurance was justified by the numerous
benefits [45], but it nonetheless encountered challenges during its early implementa-
tion years [46]. Initially, it was found that a lack of understanding about the benefits
of employing electronic services was a barrier to e-insurance adoption; however,
if firms embrace knowledge management strategies, it will undoubtedly aid in its
implementation [47]. Then, there is also the question of IT governance in InsurTech
firms [48]. Covid-19 has had a significant role in the rise of InsurTech technologies
and their corporate acceptance. A study developed a model to assess the elements
influencing InsurTech acceptance in the post-pandemic and provided a model based
on the diffusion of innovation theory that determined InsurTech adoption throughout
the value chain [49].
5.6 Research on Risk Management
The insurance industry has enthusiastically embraced loss estimation tools created
by financial engineers for managing financial risks, cyber risks, operational risks,
and technological risks. Due to catastrophic occurrences, the insurance industry has
undoubtedly helped customers in wealthy countries like the USA, Europe, and Japan
to manage their financial risk, but there is still more work to be done in developing
countries [50]. Research of Colombian InsurTech firms’ risk management policies
revealed that none of the investigated InsurTech businesses possessed the necessary
Transition from Traditional Insurance Sector to InsurTech: Systematic 483
understanding of risk management and also lacked certification in risk assessment
and its application [51]. Another research investigated possible cyber risks using the
insurance associated with IoT devices as an example and proposed a quantitative
approach for evaluating the risks associated with IoT health insurance that can be
readily adapted to other advances in the insurance industry [52].
5.7 Research on Personalization of Insurance
Personalization techniques enable insurance distributors to cut inquiry expenditures
while tailoring the given insurance policy to the demands, characteristics, risks,
and conditions of each customer [53]. There is a small body of research on this
topic that raises the question of what InsurTech is attempting to personalize by
taking the example of telematics in car insurance and discussing the consequences
of insurance personalization on society and how the relationship between insurer
and insured has changed post-personalization insurance [53]. Another research
performed concerning the European Union Insurance Industry highlights the topic
of wrong personalization, discusses the probable elements behind erroneous person-
alization, and argues that insurance providers should suffer bad repercussions for the
same [54].
5.8 Research on P2P Insurance
The concept of peer-to-peer insurance is the pooling of risks among a large number
of insured people under one contract, and there is a shared investment fund where
premiums from all policyholders are collected and payments are distributed to
claimants [55]. Researchers’ interest in the P2P insurance model has been recently
seen, as there are just two articles on the subject. One of them embraces the P2P
model by comparing it to traditional insurance and addressing the advantages of the
former over the latter, as well as testing the hypothesis of stated advantages using
quantitative models [56]. The other article, while acknowledging the benefits of the
P2P model, also points out its shortcomings by highlighting regulatory concerns that
must be addressed in European Union Insurance Industry [57].
6 Conclusion
Given the lack of empirical research on InsurTech and the issue’s novelty, we tried to
broaden our understanding of the field. This study examined 47 InsurTech research
publications published over 10 years. The review took a multidisciplinary approach,
looking at literature on InsurTech studies from various fields. InsurTech research
484 T. Kewal and C. Saxena
was divided into themes based on the results of the literature review: research on
InsurTech and the technologies behind it, risk management, performance evaluation,
insurer adoption, insured adoption, personalization of insurance, P2P insurance, and
legal, ethical, and regulatory issues in InsurTech. The most prevalent study subject
has revolved around the technologies driving the growth of InsurTech. It has been
observed through a trend of publications that scholars have been more interested in
this field of study since the pandemic hit. According to papers on the topic, InsurTech
systems are designed to benefit providers rather than users as there are major privacy
concerns that are unaddressed and one of the main reasons behind low adoption rates.
Future research should look into solving privacy issues in InsurTech from the stand-
point of the customer, including trust, relative benefits, and motives. Furthermore,
InsurTech has practical obstacles in terms of acceptance, regulatory aspects, ethical
issues, as well as security risks that threaten user involvement with novel insurance
products. To address this e-risk issue and improve security management, a researcher
created a framework based on the Bayesian Belief Network (BBN) model to assess
the risk in terms of money linked with e-commerce transactions that result from a
security breach and therefore aid in the development and pricing of InsurTech prod-
ucts [58]. In terms of regulatory and ethical issues concerning InsurTech, they are
only mentioned in the framework of European Union Insurance Law. Researchers
in other nations should also address these issues and solve them. The study on the
P2P insurance broker model is also restricted; since we are still in the early stages of
peer-to-peer insurance, this concept should be investigated further in future studies
as well.
7 Future Research Directions and Limitations
This study aims to enhance an understanding of insurance technology advancements
and associated topics. The selection of published papers and proceedings can serve as
a resource for InsurTech research to access high-quality material. Futureresearch can
utilize this analysis as a starting point to better comprehend InsurTech. As the insur-
ance industry becomes more digital, future research must investigate the impact of
new-age media on InsurTech adoption. The influence of demographic disparities on
InsurTech usage has also been overlooked in InsurTech studies, which can be studied
in the future. For this study, only the electronic database Scopus was used to search
for articles for a systematic literature review. Future research can incorporate studies
from databases too. Only journal articles and conference papers were examined for
this review; future research can include multiple sources in the review article as well.
Future studies can employ other terms such as “digital insurance, “smart insurance,”
and many other concepts that are closely connected to InsurTech. This study was
limited to a 10-year time frame due to irregularities in past publications; therefore,
future studies can incorporate studies published before this period. Notwithstanding
these limitations, the analysis of this review will be useful to insurers, researchers,
and academics globally.
Transition from Traditional Insurance Sector to InsurTech: Systematic 485
References
1. Nambisan S, Wright M, Feldman M (22019) The digital transformation of innovation and
entrepreneurship: progress, challenges and key themes. Res Policy 48(8):103773
2. Xu X, Zweifel P (2020) A framework for the evaluation of InsurTech. Risk Manag Insur Rev
23(4):305–329. https://doi.org/10.1111/rmir.12161
3. Eckert C, Osterrieder K (2020) How digitalization affects insurance companies: overview and
use cases of digital technologies. Zeitschrift für die gesamte Versicherungswiss 109(5):333–360
4. Young E (2019) Global FinTech adoption index 2019. [Online]. Available: https://www.ey.
com/en_gl/ey-global-fintech-adoption-index
5. Puschmann T (2017) Fintech. Bus Inf Syst Eng 59(1):69–76
6. Gerwald F, Dorcak P, Markovic P (2021) The influence of insurtechs on traditional insurance
operations. In: 15th International conference liberec economic forum 2021, pp 551–558
7. Stoeckli E, Dremel C, Uebernickel F (2018) Exploring characteristics and transformational
capabilities of InsurTech innovations to understand insurance value creation in a digital world.
Electron Mark 28(3):287–305
8. Trenerry CF (2009) The origin and early history of insurance: including the contract of bottomry.
Lawbook Exchange
9. Nicoletti B (2021) Insurance 4.0 benefits and challenges of digital technology
10. “History of insurance,” History of Insurance Modern Insurance. cpb-us-w2.wpmucdn.com/
blogs.baylor.edu/ dist/a/6818/files/2013/12/History-of-insurance-11gcwej.pdf
11. Roy A (2017) The fourth industrial revolution
12. Engelman R (2022) The second industrial revolution, 1870–1914—US history scene. https://
ushistoryscene.com/article/second-industrial-revolution/. Accessed 3 Feb 2022
13. Heller M (2008) The national insurance acts 1911–1947, the approved societies and the
prudential assurance company. Twent Century Br Hist 19(1):1–28
14. Hennock EP (2007) The origin of the welfare state in England and Germany, 1850–1914: social
policies compared. Cambridge University Press
15. Nelson ML, Shaw MJ, Qualls W (2005) Interorganizational system standards development in
vertical industries. Electron Mark 15(4):378–392
16. Vaismoradi M, Turunen H, Bondas T (2013) Content analysis and thematic analysis: Implica-
tions for conducting a qualitative descriptive study. Nurs Heal Sci 15:398–405. https://doi.org/
10.1111/nhs.12048
17. Koprivica M (2018) Insurtech: challenges and opportunities for the insurance sector. In: 2nd
International scientific conference ITEMA, pp 619–625
18. Püttgen F, Kaulartz M (2017) Insurance 4.0: the use of blockchain technology and of smart
contracts in the insurance sector. ERA Forum 18(2):249–262
19. Acquisti A, John LK, Loewenstein G (2013) What is privacy worth? J Legal Stud 42(2):249–
274. https://doi.org/10.1086/671754
20. Popovic D et al (2020) Understanding blockchain for insurance use cases. Br Actuar J 25:1–23.
https://doi.org/10.1017/S1357321720000148
21. Njegomir V, Demko-Rihter J, Bojani´c T (2021) Disruptive technologies in the operation of
insurance industry. Teh Vjesn 28(5):1797–1805
22. Shokeen J, Rana C, Rani, P (2021) A green 6G network era: architecture and propitious tech-
nologies. In: Lecture notes on data engineering and communications technologies (ICDAM),
vol 54, pp 59–76. https://doi.org/10.1007/978-981-15-8335-3_4
23. Chakravaram V, Ratnakaram S, Agasha E, Vihari NS (2021) The role of blockchain technology
in financial engineering. In: Lecture notes electrical engineering, vol 698, pp 755–765. https://
doi.org/10.1007/978-981-15-7961-5_72
24. Chakravaram V, Ratnakaram S, Vihari NS, Tatikonda N (2021) The role of technologies on
banking and insurance sectors in the digitalization and globalization era—a select study. Adv
Intell Syst Comput 1245:145–156
486 T. Kewal and C. Saxena
25. Hsu H-H, Huang N-F, Han C-H (2020) Collision analysis to motor dashcam videos with YOLO
and mask R-CNN for auto insurance. In: Proceedings of international conference on intelligent
engineering and management, ICIEM 2020, pp 311–315
26. Jing T (2021) Research on the development of internet insurance in China—based on the
exploration of the road of Huize insurance. In: E3S web of conferences, vol 235
27. Ratnakaram S, Chakravaram V, Vihari NS, Vidyasagar Rao G (2021) Emerging trends in the
marketing of financially engineered insurance products. Adv Intell Syst Comput 1270:675–684.
https://doi.org/10.1007/978-981-15-8289-9_65
28. Lin L, Chen C (2020) The promise and perils of insurtech. Singapore J Leg Stud 2020:115–142.
https://doi.org/10.2139/ssrn.3463533
29. Lanfranchi D, Grassi L (2021) Translating technological innovation into efficiency: the case
of US public P&C insurance companies. Eurasian Bus Rev 11(4):565–585. https://doi.org/10.
1007/s40821-021-00189-7
30. Lanfranchi D, Grassi L (2021) Examining insurance companies’ use of technology for
innovation. Geneva Pap Risk Insur Issues Pract. https://doi.org/10.1057/s41288-021-00258-y
31. Wang Q (2021) The impact of insurtech on Chinese insurance industry. Procedia Computer
Science 187:30–35. https://doi.org/10.1016/j.procs.2021.04.030
32. Beigzadeh N, Sajedinejad A (2014) Providing key indicators for evaluating the e-business
context for improving performance of insurance companies in Iran
33. Rutskiy V et al (2020) Development of e-insurance through market institutions: the example
of digital compulsory third-party motor insurance. Adv Intell Syst Comput 1294:836–843.
https://doi.org/10.1007/978-3-030-63322-6_71
34. Bromideh AA (2012) Factors affecting customer e-readiness to embrace auto e-insurance in
Iran. J Internet Bank Commer 17(1)
35. Garbairovai M, Bachanovai PH (2019) Purchasing behavior of e-insurance consumers. In:
Proceedings of the 33rd international business information management association confer-
ence, IBIMA 2019: education excellence and innovation management through vision 2020, pp
3139–3152
36. Maslova L, Ilina A (2020) Digital transformation of Russian insurance companies. In: CEUR
workshop proceedings, vol 1–2570
37. Talonen A, Mähönen J, Koskinen L, Kuoppakangas P (2021) Analysis of consumers’ negative
perceptions of health tracking in insurance—a value sacrifice approach. J Inf Commun Ethics
Soc. https://doi.org/10.1108/JICES-05-2020-0061
38. Gramegna A, Giudici P (2020) Why to buy insurance? An explainable artificial intelligence
approach. Risks 8(4):1–9. https://doi.org/10.3390/risks8040137
39. Saliba B, Spiteri J, Cortis D (2021) Insurance and wearables as tools in managing risk in sports:
determinants of technology take-up and propensity to insure and share data. Geneva Pap Risk
Insur Issues Pract. https://doi.org/10.1057/s41288-021-00250-6
40. Marano P (2019) Navigating insurtech: the digital intermediaries of insurance products and
customer protection in the EU. Maastrich J Eur Comp Law 26(2):294–315. https://doi.org/10.
1177/1023263X19830345
41. Ostrowska M (2021) Regulation of InsurTech: is the principle of proportionality an answer?
Risks 9(10):185. https://doi.org/10.3390/risks9100185.
42. Bagheri P, Forushani ML (2016) E-insurance law and digital space in Iran. J Internet Bank
Commer 21(1)
43. Mullins M, Holland CP, Cunneen M (2021) Creating ethics guidelines for artificial intelligence
and big data analytics customers: the case of the consumer European insurance market. Patterns
2(10). https://doi.org/10.1016/j.patter.2021.100362
44. Zgraggen RR (2019) Smart insurance contracts based on virtual currency: Legal sources and
chosen issues. In: PervasiveHealth:pervasive computing technologies for healthcare, pp 99–102
45. Eling M, Lehmann M (2018) The impact of digitalization on the insurance value chain and
the insurability of risks. Geneva Pap Risk Insur Issues Pract 43(3):359–396. https://doi.org/10.
1057/s41288-017-0073-0
Transition from Traditional Insurance Sector to InsurTech: Systematic 487
46. Heydari NH, Behestani S, Bahadori P (2013) Investigation of electronic maturity level of
insurance indsutry in Iran. Middle East J Sci Res 14(11):1539–1549
47. Mehrabani SE, Shajari M (2013) Knowledge management practices and implementation of
E-insurance. In: Proceedings—2013 international conference on informatics and creative
multimedia, ICICM 2013, pp 186–190. https://doi.org/10.1109/ICICM.2013.39
48. Uyun A, Sekarhati DKS, Amastini F, Nefiratika A, Shihab MR, Ranti B (2020) Implication
of InsurTech to implementation IT decision domain perspective: the case study of insurance
XYZ. https://doi.org/10.1109/ICCED51276.2020.9415783
49. Ching KH, Teoh AP, Amran A (2020) A conceptual model of technology factors to InsurTech
adoption by value chain activities. In: 2020 IEEE conference on e-learning, e-management and
e-services, pp 88–92
50. Shah HC, Dong W, Stojanovski P, Chen A (2018) Evolution of seismic risk management for
insurance over the past 30 years. Earthq Eng Eng Vib 17(1):11–18
51. Mogollón A, Rubiano A, Ramirez J (2020) Colombian companies of insurtech and their risk
management. J Phys Conf Ser 1646(1)
52. Leong Y-Y, Chen Y-C (2020) Cyber risk cost and management in IoT devices-linked health
insurance. Geneva Pap Risk Insur Issues Pract 45(4):737–759
53. McFall L, Moor L (2018) Who, or what, is insurtech personalizing?: persons, prices and the
historical classifications of risk. Distinktion J Soc Theory 19(2):193–213
54. Tereszkiewicz P, Południak-Gierz K (2021) Liability for incorrect client personalization in the
distribution of consumer insurance. Risks 9(5)
55. Levantesi S, Piscopo G (2022) Mutual peer-to-peer insurance: the allocation of risk. J Co-op
Organ Manag 10(1):100154
56. Abdikerimova S, Feng R (2021) Peer-to-peer multi-risk insurance and mutual aid. Eur J Oper
Res. https://doi.org/10.1016/j.ejor.2021.09.017
57. Clemente GP, Marano P (2020) The broker model for peer-to-peer insurance: an analysis of its
value. Geneva Pap. Risk Insur Issues Pract 45(3):457–481
58. Mukhopadhyay A, Chatterjee S, Saha D, Mahanti A, Sadhukhan SK (2008) E-risk insurance
product design: A copula based bayesian belief network model. IGI Global
Diagnosis of Laryngitis
and Cordectomy using Machine
Learning with ML.Net and SVD
Syed Irfan Ali , Ahmed Sajjad Khan, Syed Mohammad Ali,
and Mohammad Nasiruddin
Abstract Nowadays, machine learning is playing an important role in providing
automated results to the humanity. It is gaining researchers attention day by day
and providing more accurate and fast results in every second research. Addition-
ally, Microsoft’s machine learning, an open-source cross-platform, is used in this
study. For the classification of speech disorders like Chordektomie and Laryngitis
against normal, it is used in a. Net5-based web application. Several experiments were
performed with the ML model by training it with different sets of features to evaluate
best set of features for accurate result.
Keywords Machine learning ·Chordektomie ·Laryngitis ·Classification
1 Introduction
People are facing risk of speech disorder problems since 25% of the world popula-
tion are such that their profession compels them to louder than the normal level. This
alters the curvature of the vocal tracts, which affects the vocal folds during phona-
tion and results in irregular spectral qualities [1]. This variation in the properties of
vocal fold is produced due to several factors or their combinations such as presence
S. I. Ali (B)
Artificial Intelligence and Data Science Engineering, Anjuman College of Engineering &
Technology, Nagpur, India
e-mail: siali@anjumanengg.edu.in
A. S. Khan ·S. M. Ali ·M. Nasiruddin
Electronics & Telecommunication Engineering, Anjuman College of Engineering & Technology,
Nagpur, India
e-mail: askhan@anjumanengg.edu.in
S. M. Ali
e-mail: smali@anjumanengg.edu.in
M. Nasiruddin
e-mail: nasiruddin@anjumanengg.edu.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_37
489
490 S.I.Alietal.
of mucus, tension, stiffness, larynx muscles, fold’s closing and opening. All these
factors are invariably affected by various pathologies in vocal tracts. This results
in different vibrations for different pathologies, which in turn produces different
frequencies for different pathologies. According to research, 17.9 million US indi-
viduals who are at least 18 years old reported having a voice issue in the previous
year [2]. The teachers are more vulnerable to voice disorder than other professionals.
Both short-term and long-term signal analysis can be used to accomplish automatic
speech diagnostics. The acoustic analysis can provide parameters for long-term signal
analysis [3]. In contrast, parameters for short-term signal analysis can be collected
using LPC, LPCC, MFCC, and other algorithms. Numerous acoustic characteristics,
including shimmer, pitch, jitter, pitch and amplitude perturbation quotient, harmonic
to noise ratio, voice turbulence index, normalized noise energy, frequency amplitude
tremor, soft phonation index, glottal to noise excitation ratio, and others, can be used
to identify speech disorders [1]. It is seen that from the list of features some features
show enough distance corresponding to the speech disorder, some show minor differ-
ence, and some shows random variations. So, to classify effectively between different
classes, it is necessary to have features which show sufficient variation, which will
support the classifier machine learning model in diagnosing speech disorders. Thus,
in this paper, we have used only the prominent features such as time and frequency
domain features, energy, pitch, zero crossing rate with MFCC coefficients. It is also
seen that if the model is trained with the speech features of distinct parts of the
globe, then its test will be correct for that part of the globe and not vice versa. In
this research, we have used Saarbruecken Voice Database for training and evaluating
the model for classification. Any machine learning technology can be used for the
classification process. The two most often used languages in machine learning are
Python and C++, with Python enjoying more popularity. The ecosystem for special-
ized tools and libraries in Python is impressive. Models are typically developed using
the scikit-learn or PyTorch libraries for Python, and most neural networks are based
on TensorFlow [4]. The Dot.Net platform will be able to use Microsoft’s machine
learning tool beginning in 2019. It is an open-source, cross-platform framework
made to host learning models in Dot.Net Core Web Applications, .Net Framework
Applications, and .Net Standard Libraries. Scikit-learn and other tools are proven
to be slower and less accurate than ML.NET. [5]. The flowchart of the research is
shown in Fig. 1.
2 Literature Review
Verde et al. employed the Youden analysis to identify the threshold value to distin-
guish between a pathological and a healthy voice, and then used a model tree regres-
sion approach to define the link between these parameters. They have assessed
the proposed index’s dependability in terms of accuracy, sensitivity, and specificity
according to the correct classification [6].
Diagnosis of Laryngitis and Cordectomy using Machine Learning 491
Fig. 1 Block diagram of the
process
Support vector machine (SVM) and deep neural network (DNN) classifiers have
been employed by Zhang et al. [7].
A feature-based representation with MFCCs and a Mel-spectrogram, two
commonly used input representations, were initially created from the audio data by
Guan & Lerch. Four different machine learning techniques have been used to conduct
the research: support vector machine (SVM), CNN, CNN followed by SVM, and
autoencoder (AE) followed by SVM. Fivefold cross-validation is used throughout
all studies [8].
19 coefficients make up the feature vector that Smitha et al. generated. They sepa-
rated our dataset into three categories: training data (75%), validation data (15%), and
test data (15%). The network is trained using the Scaled Conjugate Gradient (SCG)
backpropagation technique. Then, they evaluated the results using Mean Squared
Error (MSE) and Percent Error (percent E). To categorize the retrieved features, they
employed an artificial neural network (ANN) with one hidden layer. They claimed
that one of the effective methods for differentiating between healthy and diseased
voices is an ANN. Additionally, they claimed to have employed MATLAB 2015a
software and obtained the lowest MSE and percent E, which were 1.05e03 and
1.28e01, respectively [9].
492 S.I.Alietal.
Dankovicová et al. chose a straightforward filter to obtain the top k characteristics,
and the results are on par with those of more sophisticated techniques. The score of
features was calculated using mutual information pertaining to discrete variables.
This function is reliant on the k-nearest neighbor distance-based entropy calcu-
lation. Positive (non-negative) mutual information between two random variables
indicates that the variables are dependent on one another. Higher values indicate
greater reliance; zero indicates independence for the variable. This demonstrates
that the authors’ automated feature selection based on dependency. Support vector
machine (SVM) with nonlinear kernel, K-nearest neighbors (KNN), and random
forest classifiers have all been used by the authors (RFC) [10].
Al Hussein and Muhammad have looked into the CaffeNet and VGG16 Net CNN
models. A fairly sophisticated CNN model is the VGG16 Net. Since the CaffeNet
and the VGG16 Net were both trained with a large number of pictures, they are both
reliable for a wide range of applications. Because the vocal pathology databases only
include a very small number of samples, these models cannot be used for training
from the start. Instead, they applied transfer learning and fine-tune methods to gain
from these reliable models. The output of the final CNN layer is then given to SVM
for classification into two classes [11].
Hidden Markov models (HMM), neural networks, support vector machines
(SVM), and lastly the Gaussian mixing model have all been utilized as classifiers by
Rabeh et al. (GMM) [12].
Support vector machine (SVM) classifier was used by Al-Nasheri et al. to deter-
mine if the provided samples were abnormal or normal. In addition to these tests, they
carried out further tests to see whether there was a significant difference between the
means of normal and diseased samples for each database individually using u-tests
and XLSTAT software. They used several terminologies to convey their findings.
These terms are accuracy (ACC: the ratio of correctly detected samples to total
samples), sensitivity (SN: the proportion of pathological samples positively identi-
fied), Specificity (SP: the proportion of normal samples negatively identified), and
area under the receiver operating characteristic (ROC) curve (AUC) [1].
MLP, GFF & MODULAR and SVM are the two types of neural networks that
Ali & Karule utilized. Tanhaxion was utilized as the transfer function, with 25%
of samples being used for testing and 75% being used for training. Five transfer
functions—Tanh axion, L-Tanh axion, sigmoid axion, L-sigmoid axion, and SoftMax
axion—have been utilized in the first three N/Ns. They just altered the epochs in SVM
[13].
Teixeira et al. created an approach for obtaining the parameters vector using the
Praat program [14].
Artificial neural networks and the Multilayer Perceptron (MLP) with backpropa-
gation learning algorithm were employed by Khemphila and Boonjing. They discov-
ered the context that was chosen based on the importance of each low-level context.
They used the information gain of each attribute as its weight rather than learning
weights through a general algorithm or other machine learning technique. They
employed the ANN classification method, which chooses attributes based on the
Diagnosis of Laryngitis and Cordectomy using Machine Learning 493
idea of information gain (IG). The selection of feature sets is done using IG. Weka
3.6.6 was used to calculate the results of experiments [15].
In order to create a feature vector from speech samples for a Multilayer Neural
Network (MNN) classifier, Salhi et al. used wavelet analysis. A two-dimensional
pattern of wavelet coefficients is produced by wavelet analysis. The feature vector of
voice samples is created using the energy content of wavelet coefficients at various
scaling settings. In this case, a feature vector is employed as a diagnostic tool to
find pathological voice abnormalities. Here, classification is accomplished using a
three-layer feed-forward network with sigmoid activation. For network training, the
generalized backpropagation algorithm (BPA) is utilized. Additionally, they stated
that supervised learning is employed when a neural network is trained by providing
a target output to a certain input group. Additionally, it is claimed that a network
can be trained through self-guidance, in which case the network’s parameters adapt
to the input. In both scenarios, the network’s weights and biases change in response
to the collected data. The training can be done in batches (batch training), in which
case the parameters are not adjusted until all the instances have been fed, or it can
be done gradually (incremental training), in which case the weights and biases are
adjusted every time a new training example is supplied to the network. The neural
network is implemented using the MATLAB7.0 platform and has three layers: an
input layer, an output layer, and a hidden layer [16].
The three photos were fed into the convolutional neural networks by Muhammad
et al. (CNNs) [17]. CNN has had success with deep learning in many areas of image
processing [18]. They employed transfer learning and a fine-tuning strategy because
training the CNN model requires a lot of data. CaffeNet is the CNN that they use in
the suggested system [18]. A reliable CNN model that works well in many image
processing applications is CaffeNet. Three components are shown in the input image
for this model (e.g., in a color image, the three components are red, green, and blue).
The model, according to them, contains three pooling layers and five convolution
layers. A rectified linear unit follows each convolution layer (RLU). The training set,
the validation set, and the testing set were each given their own section of the utilized
database. The training set, validation set, and testing sets each comprised 5%, 7%,
and 25% of the database, respectively. It was made sure there was no speaker overlap
when the database was divided into its three sections.
In the suggested system, Hossain & Muhammad used three machine learning
algorithms, each of which has unique properties. They employed the SVM, the ELM,
and the GMM as classifiers [19,20].
According to Martinez and Rufiner, the ANN is a superb classification system that
excels at handling noisy, imperfect, overlapped, and other types of data. A movable
window with 256 samples and a 128-sample overlap was utilized to extract patterns.
Each segment had a Hamming window applied to it, after which the patterns were
extracted using the first 16 cepstral coefficients. As the activation in the required
outputs of the ANN, each pattern was finished with the information of 1 and 0. They
employed two distinct types of ANN, one trained to recognize the difference between
a diseased and normal voice (without caring about the pathology), and the other to
recognize the difference between a normal, harsh, and bicyclic voice [21].
494 S.I.Alietal.
Each speech sample in the MEEI database was used by Ali et al. to test their
technique. They employed a threefold cross-validation strategy for this. The MEEI
database is divided into three separate subsets using this method. One of the subsets is
utilized for system evaluation, and the other two are used for system training. Various
procedures are taken into consideration to report the results of the suggested system.
Sensitivity, specificity, accuracy, and area under the receiver operating characteristic
(ROC) curve are the parameters in question. For the automatic detection and cate-
gorization of voice abnormalities, a GMM-based classifier is given the FCB feature.
To create the classifier, a cutting-edge clustering pattern recognition algorithm is
used. The GMM clustering method has been applied in a wide range of scientific
fields. They have used GMM to develop acoustic models employing FCB features
for various speech signals belonging to various classes [22].
3 Methodology
The Saarbruecken Voice Database (SVD) was used for model training and testing in
this study. The SVD database was recorded by the Institute of Phonetics at Saarland
University and is freely downloadable via the Internet [23]. The resolution of the
speech samples is 16 bits, with a sampling frequency of 50 kHz. The findings of this
study are development-environment related. The samples are labeled and divided
into normal and abnormal speech signals. Samples that are abnormal are labeled
with ‘1,’ while normal samples are tagged with ‘0.’ For the purpose of extracting
characteristics, the 96 speech samples from normal, the 84 samples from Laryngitis,
and the 47 samples from Chordektomie are downloaded.
4 Experiment and Result
ML.Net is utilized for generation of machine learning model as it is the best of
the three available machine learning platforms, viz. ML.Net, scikit-learn, and H2O
[24]. NWaves library from NuGet Package Manager is used for extracting multiple
features of each sample. The NWaves library is available on GitHub [25].
4.1 Experiment with Normal Versus Laryngitis
Experiments were performed with each feature using the multiclass classifiers to find
the best feature or set of features that can classify Normal Versus Laryngitis with
highest accuracy.
Diagnosis of Laryngitis and Cordectomy using Machine Learning 495
Table 1 Normal versus laryngitis result
Feature set 1 Percentage acc. Feature set 2 Percentage acc.
Energy 53.92 MFCC0 49.48
RMS 54.9 MFCC1 54.95
ZCR 55.37 MFCC2 54.52
Entropy 57.1 MFCC3 54.01
Centroid 60.35 MFCC4 53.09
Spread 58.88 MFCC5 56.68
Flatness 56.88 MFCC6 58.04
Noiseness 52.17 MFCC7 57.52
RollOff 55.62 MFCC8 55.45
Crest 55.42 MFCC9 52.58
Entropy2 53.16 MFCC10 54.3
Decrease 56.22 MFCC11 56.92
C1 56.38 MFCC12 54.21
C2 53.77
C3 55.41
C4 51.7
C5 55.29
C6 55.46
The experiment is performed in three stages.
In the first stage, micro-accuracy, macro-accuracy, and time needed are recorded
for all multiclass classifiers in 16 iterations.
In the second stage, micro-accuracy, macro-accuracy, and time needed are
recorded for selected classifiers from available multiclass classifiers in five iterations.
In the final stage, the result of the best classifier is recorded.
For ‘Centroid’ feature ‘Lbfgs Logistic Regression Ova’ with 60.35% micro-
accuracy, 56% macro-accuracy and 1 s duration of time is best.
For ‘Spread’ feature ‘Sdca Maximum Entropy Multi’ with 58.88% micro-
accuracy, 50% macro-accuracy and 0.2 s duration of time is best.
For ‘MFCC6’ feature ‘Fast Tree Ova’ with 58.04% micro-accuracy, 56.34%
macro-accuracy and 1.2 s duration of time is best (Table 1;Fig.2).
4.2 Experiment with Normal Versus Chordektomie
Experiments were performed with each feature using the multiclass classifiers to find
the best feature or set of features that can classify normal versus Chordektomie with
highest accuracy.
496 S.I.Alietal.
Fig. 2 Normal versus laryngitis
Diagnosis of Laryngitis and Cordectomy using Machine Learning 497
The experiment is performed in three stages.
In the first stage, micro-accuracy, macro-accuracy, and time needed are recorded
for all multiclass classifiers in 16 iterations.
In the second stage, micro-accuracy, macro-accuracy, and time needed are
recorded for selected classifiers from available multiclass classifiers in five iterations.
In the final stage, the result of the best classifier is recorded.
For ‘MFCC11’ feature ‘Lbfgs Logistic Regression Ova’ with 70.34% micro-
accuracy, 67.51% macro-accuracy and 0.2 s duration of time is best.
For ‘RMS’ feature ‘Lbfgs Logistic Regression Ova’ with 70.66% micro-accuracy,
65.2% macro-accuracy and 0.3 s duration of time is best.
For ‘Entropy’ feature ‘Lbfgs Logistic Regression Ova’ with 70.61% micro-
accuracy, 68% macro-accuracy and 0.1 s duration of time is best (Table 2;
Fig. 3).
Table 2 Normal versus Chordektomie result
Feature set 1 Percentage acc. Feature set 2 Percentage acc.
Energy 68.23 MFCC0 69.81
RMS 70.66 MFCC1 67.29
ZCR 68.92 MFCC2 67.43
Entropy 70.61 MFCC3 68.24
Centroid 69.97 MFCC4 69.57
Spread 68.98 MFCC5 69.01
Flatness 69.98 MFCC6 69.21
Noiseness 67.58 MFCC7 68.57
RollOff 68.33 MFCC8 68.73
Crest 68.76 MFCC9 69.6
Entropy2 68.98 MFCC10 69.34
Decrease 69.22 MFCC11 70.34
C1 68.93 MFCC12 69.91
C2 68.57
C3 69.49
C4 69.05
C5 68.83
C6 68.44
498 S.I.Alietal.
Fig. 3 Normal versus Chordektomie
Diagnosis of Laryngitis and Cordectomy using Machine Learning 499
5 Conclusion
This research states that the two selected disorders, Laryngitis and Chordektomie, are
remarkably close with normal in terms of features. The best features for classification
of Laryngitis versus normal are found to be Centroid, Spread, MFCC coefficient 6
with accuracy of 60.35, 58.88, and 58.04 percent, respectively. Whereas for Chordek-
tomie versus normal, accuracies are found to be 70.34, 70.66, and, 70.61 for features
MFCC11, RMS, and entropy, respectively. When two or more features with good
accuracy are combined, the average accuracy reduces most of the time except few.
The result of composite features will be published in the future papers. To replicate
or verify these results, researchers should use ML.Net and NWaves with a sampling
rate of 44,100, a frame duration of 0.035, a hop duration of 0.015, a pre-emphasis
filter of 0.97, and a rectangular window in MFCC.
References
1. Al-Nasheri A et al (2017) Voice pathology detection and classification using auto-correlation
and entropy features in different frequency regions. IEEE Access 6:6961–6974. https://doi.org/
10.1109/ACCESS.2017.2696056
2. National Institute on Deafness and Other Communication Disorders: Voice, Speech,
and Language: Quick Statistics (2016). http://www.nidcd.nih.gov/health/statistics/vsl/Pages/
stats.aspx. Accessed 10 Aug 2020
3. Boyanov B, Hadjitodorov S (1997) Acoustic analysis of pathological voices. A voice analysis
system for the screening of laryngeal diseases. IEEE Eng Med Biol Mag 6(4):74–82
4. Esposito D (2019) Hosting a machine learning model in ASP.NET core 3.0. https://www.
red-gate.com/simple-talk/sql/data-science-sql/hosting-a-machine-learning-model-in-asp-net-
core-3-0/#:~:text=Generallyavailable since the spring,NET standard libraries. Accessed 27
Aug 2020
5. Ahmed Z et al (2019) Machine Learning at Microsoft with ML.NET. In: KDD: knowledge
discovery and data mining, pp 2448–2458. https://doi.org/10.1145/3292500.3330667
6. Verde L, De Pietro G, Alrashoud M, Ghoneim A, Al-Mutib KN, Sannino G (2019) Dysphonia
detection index (DDI): a new multi-parametric marker to evaluate voice quality. IEEE Access
7:55689–55697. https://doi.org/10.1109/ACCESS.2019.2913444
7. Zhang T, Wu Y, Shao Y, Shi M, Geng Y, LiuG (2019) A pathological multi-vowels recognition
algorithm based on LSP feature. IEEE Access 7:58866–58875. https://doi.org/10.1109/ACC
ESS.2019.2911314
8. Guan H, Lerch A (2019) Learning strategies for voice disorder detection. https://doi.org/10.
1109/ICOSC.2019.8665504
9. Smitha, Shetty S, Hegde S, Dodderi T (2018) Classification of healthy and pathological voices
using MFCC and ANN. In: Proceedings of 2018 2nd international conference on advances in
electronics, computers and communications, ICAECC 2018, pp 1–5. https://doi.org/10.1109/
ICAECC.2018.8479441
10. Dankoviˇcová Z, Sovák D, Drotár P, Vokorokos L (2018) Machine learning approach to
dysphonia detection. Appl Sci 8(10):1–12. https://doi.org/10.3390/app8101927
11. Alhussein M, Muhammad G (2018) Voice pathology detection using deep learning on mobile
healthcare framework. IEEE Access 6:41034–41041. https://doi.org/10.1109/ACCESS.2018.
2856238
500 S.I.Alietal.
12. Rabeh H, Salah H, Adnane C (2018) Voice pathology recognition and classification using noise
related features. Int J Adv Comput Sci Appl 9(11):82–87, [Online]. Available: www.ijacsa.the
sai.org
13. Ali SM, Karule PT (2016) MFCC, LPCC, formants and pitch proven to be best features in
diagnosis of speech disorder using neural networks and SVM. Int J Appl Eng Res 11(2):897–903
[Online]. Available: http://www.ripublication.com
14. Teixeira JP, Oliveira C, Lopes C (2013) Vocal acoustic analysis—jitter, shimmer and
HNR parameters. Procedia Technol 9(May):1112–1122. https://doi.org/10.1016/j.protcy.2013.
12.124
15. Khemphila A, Boonjing V (2012) Parkinsons disease classification using neural network and
feature selection. Int J Math Comput Sci 6(4):377–380
16. Salhi L, Mourad T, Cherif A (2010) Voice disorders identification using multilayer neural
network
17. Muhammad G, Alhamid MF, Alsulaiman M, Gupta B (2018) Edge computing with cloud for
voice disorder assessment and treatment. IEEE Commun Mag 56(4):60–65. https://doi.org/10.
1109/MCOM.2018.1700790
18. Krizhevsky A (2012) ImageNet classification with deep convolutionalneural networks Alex. In:
NIPS’12 Proceedings of 25th international conference neural information processing systems,
Lake Tahoe, Nevada—03–06 Dec 2012, vol 1, pp 1097–1105. https://doi.org/10.1016/B978-
008046518-0.00119-7
19. Hossain MS, Muhammad G (2016) Healthcare big data voice pathology assessment framework.
IEEE Access 4:7806–7815. https://doi.org/10.1109/ACCESS.2016.2626316
20. Huang GB, Zhou H, Ding X, Zhang R (2011) Extreme learning machine for regression and
multiclass classification. IEEE Trans Syst Man Cybern Part B (Cybernetics) 42(2):513–529.
https://doi.org/10.1109/tsmcb.2011.2168604
21. Martinez CE, Rufiner HL (2002) Acoustic analysis of speech for detection of laryngeal
pathologies, pp 2369–2372. https://doi.org/10.1109/iembs.2000.900621
22. Ali Z, Hossain MS, Muhammad G, Sangaiah AK (2018) An intelligent healthcare system for
detection and classification to discriminate vocal fold disorders. Futur Gener Comput Syst
85:19–28. https://doi.org/10.1016/j.future.2018.02.021
23. Barry WJ, Putzer M, Saarbrucken voice database. http://www.stimmdatenbank.coli.uni-saarla
nd.de/. Accessed 10 Aug 2017
24. Microsoft, ML.NET-An open source and cross-platform machine learning framework. https://
dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet. Accessed 21 Aug 2020
25. ar1st0crat/NWaves: .NET DSP library with a lot of audio processing functions. https://github.
com/ar1st0crat/NWaves. Accessed 8 Feb 2022
Speed of Diagnosis for Brain Diseases
Using MRI and Convolutional Neural
Networks
B. Srinivasa Rao, Vankalapati Nanda Gopal, Vatala Akash,
and Shaik Nazeer
Abstract Accurately diagnosing brain diseases is crucial for effective treatment
and improved patient outcomes. Magnetic Resonance Imaging is a regularly used
technology in the investigation of brain illnesses including Alzheimer’s disease,
brain tumors, and multiple sclerosis. This study proposes a Convolutional Neural
Network-based automated brain illness classification method utilizing MRI images.
The proposed method leverages a dataset of MRI images of four brain diseases,
namely Alzheimer’s disease, tumors of brain, multiple sclerosis, and healthy brains.
We trained and compared different CNN architectures, including VGG16 and fine-
tuned ResNet. Our CNN model achieved remarkable accuracy on both the training
and testing sets. Specifically, we achieved an impressive training accuracy of 99.01%
and a testing accuracy of 95%, outperforming VGG16 and fine-tuned ResNet. We
derived many assessment measures, including accuracy, recall, and F1-score, to
further evaluate the effectiveness of our model. Our results demonstrate the potential
of CNN-based approaches in accurately and automatically classifying brain diseases
using MRI images. Our proposed approach has the potential to be a valuable tool for
healthcare professionals, improving patient outcomes and quality of life. The devel-
oped model is capable of classifying Alzheimer’s disease, brain tumors, multiple
sclerosis, and their respective stages. Automated classification of brain diseases using
CNNs could enable early detection and precise diagnosis of these diseases, leading
to improved treatment and patient care.
Keywords Magnetic Resonance Imaging ·Convolutional Neural Network ·Brain
diseases ·Alzheimer’s disease ·Brain tumors ·Multiple sclerosis ·VGG16 ·
ResNet ·Automated classification ·Evaluation metrics ·Precision ·Recall ·
F1-score
B. Srinivasa Rao (B)
Department of Information Technology, Lakireddy Bali Reddy College of Engineering,
Mylavaram, Andhra Pradesh, India
e-mail: doctorbsrinivasarao@gmail.com
V. N. Gopal ·V. A k a s h ·S. Nazeer
Lakireddy Bali Reddy College of Engineering, Mylavaram, Andhra Pradesh, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_38
501
502 B. Srinivasa Rao et al.
1 Introduction
Millions of people and their families are affected by brain illnesses, which are a
serious public health problem globally. Multiple sclerosis (MS), brain tumors, and
Alzheimer’s disease (AD) are some of the most prevalent and crippling brain disor-
ders. These illnesses have high rates of morbidity and death, and diagnosing and
treating them are extremely difficult for medical professionals and healthcare systems
across the world. These illnesses have severe effects on both the individual and his
relatives, underscoring the essential need for early and precise identification and cate-
gorization. Traditional diagnostic methods for brain diseases rely heavily on clinical
assessments, such as cognitive and neurological tests, and medical imaging, such as
computed tomography (CT) and Magnetic Resonance Imaging (MRI). While these
methods have been useful in diagnosing and monitoring brain diseases, they have
limitations, including poor sensitivity and specificity, high costs, and time-consuming
procedures. These limitations underscore the need for more accurate, reliable, and
efficient diagnostic tools that can facilitate early detection and classification of brain
diseases.
A neurodegenerative disorder that gradually impairs memory, thinking, and
behavior is known as Alzheimer’s disease [1], with symptoms often becoming
progressively worse over time. It is the most common cause of dementia in the
elderly, with an estimated 50 million people worldwide living with the condition.
Brain tumors are abnormal growths of cells that can occur in any part of the brain
and can cause a range of symptoms, including headaches, seizures, and difficulty
with speech and movement. Multiple sclerosis is a chronic autoimmune illness
affecting the central nervous system and can cause various symptoms such as fatigue,
movement problems, and vision impairment.
The timely identification and categorization of these brain diseases can greatly
enhance patient outcomes. As early intervention can aid in slowing down the advance-
ment of the disease and enhance the quality of life. However, traditional diag-
nostic methods, such as clinical evaluation and imaging, have limitations in terms of
accuracy and speed of diagnosis. For example, traditional MRI-based diagnosis of
Alzheimer’s disease relies on visual inspection of brain scans by radiologists, which
can be time-consuming and subjective.
Medical imaging, especially MRI, has emerged as a promising diagnostic tool
for brain diseases due to its high spatial resolution and ability to capture detailed
images of the brain’s structure and function. Recent breakthroughs in computer
vision and machine learning, particularly CNNs, have shown promise in boosting the
precision and speed of diagnosing brain diseases. CNNs are a type of deep learning
[2] algorithm that can learn to automatically extract features from raw image data,
allowing them to classify images with high accuracy.
In particular, medical imaging, such as MRI scans, provides a rich source of image
data for CNNs to learn from. MRI scans give precise information on the anatomy
and function of the brain, allowing for the very accurate detection and classification
of brain illnesses.
Speed of Diagnosis for Brain Diseases Using MRI and Convolutional 503
The objective of this paper is to provide a comprehensive overview of the present
status of employing CNNs for the detection and classification of medical images,
particularly for brain diseases like Alzheimer’s, brain tumors [3], and multiple scle-
rosis. We will discuss the limitations of traditional diagnostic methods and the poten-
tial for using CNNs in combination with MRI scans to improve accuracy and speed
of diagnosis. Additionally, we will present the results of our own experiments using
CNNs to classify brain diseases using MRI data and compare the performance of
several CNN designs, including VGG16, ResNet [4], and our own.
Overall, we believe that CNNs have great potential in improving the accuracy and
speed of diagnosis for brain diseases and can significantly improve patient outcomes
by enabling early detection and classification of these diseases. The intention of this
paper is to offer valuable perspectives on the future of medical imaging and machine
learning in diagnosing and treating brain diseases. This will be accomplished by
presenting a comprehensive summary of the latest advances in this area.
2 Related Work
The authors Pradeep Kumar and Seuc Ho Ryu talk about how crucial image
processing and brain imaging techniques are to medical research, particularly in
terms of early diagnosis and therapy [5]. They emphasize how well deep neural
networks (DNNs) do when it comes to classifying and segmenting images. The
authors of this study have introduced a technique to reduce the feature set’s size in
subsequent classification assignments through a deep wavelet autoencoder (DWA).
They evaluated the proposed DWA-DNN photo classifier on a brain image dataset
and compared its performance to other existing classifiers. They discover that the
suggested technique performs better than the current methods.
The present approach for classifying and diagnosing brain tumors depends on
labor-intensive, invasive histopathological examination of biopsy samples that is
prone to human error. Thus, a completely automated deep learning system is required
for the early detection of brain cancers. For three distinct classification tasks—brain
tumor detection, brain tumor type classification, and brain tumor grade classifica-
tion—this study offers three different Convolutional Neural Network (CNN) models
[6]. The grid search optimization approach is used to automatically identify the
hyperparameters of the CNN models. Using sizable clinical datasets that are freely
accessible to the public, the suggested CNN models produce good classification
results. The proposed CNN models can support doctors and radiologists in vali-
dating their initial screening for multiple brain tumor [7] categorization. Overall, the
suggested strategy has promise for increasing the precision and effectiveness of brain
tumor categorization and diagnosis.
For the proper diagnosis and assessment of multiple sclerosis therapy, 3D
Magnetic Resonance Imaging (MRI) is essential for identifying white matter abnor-
malities (MS). For the disease to be treated effectively, early MS identification
and assessment of the disease’s development are crucial. Unfortunately, due to the
504 B. Srinivasa Rao et al.
imbalanced data and sparse lesions pixels, diagnosing MS lesions might be diffi-
cult. This study presents a transfer learning-based Convolutional Neural Network
(CNN) technique that employs the SoftMax activation function. The integration of
fluid-attenuated inversion recovery (FLAIR) series allows for faster processing while
maintaining accuracy, and the proposed technique’s efficacy is evaluated using data
from MS patients obtained from the Laboratory of Imaging Technologies.
The study’s findings demonstrate how well MRI may be used to find MS lesions.
The suggested method has a high accuracy rate for forecasting the course of illness,
up to 98.24%. Because a significant volume of MRI data must be analyzed, manual
lesion diagnosis by clinical professionals can be challenging and time-consuming.
The suggested method provides an effective solution to the issue, making it simpler
and quicker to identify MS lesions and categorize disease progression.
3 Methodology
Our study proposes a novel architecture that uses MRI images as input and produces
corresponding class labels. The suggested algorithm can categorize photos into 12
distinct classifications, including Alzheimer’s disease stages [1], brain tumors, and
multiple sclerosis. Before being supplied to the model for classification, the input
photos are preprocessed, including data augmentation. To identify the most effec-
tive architecture, we experimented with several models, including CNN, VGG16,
VGG19, ResNet, and fine-tuned ResNet [8]. Based on our results, we selected the
CNN architecture, which was found to be more generalized and effective than other
pre-trained architectures.
3.1 Dataset
Magnetic Resonance Imaging (MRI) images of the brain were employed in our inves-
tigation. It covers four forms of brain diseases: Alzheimer’s disease, brain tumors,
and multiple sclerosis, as well as a control group with a healthy brain. The collection
contains a total of 9873 photos of varied resolutions and dimensions. Four phases of
Alzheimer’s disease are recognized: Very Mild Demented, Mild Demented, Moderate
Demented, and Non-demented. The Very Mild-Demented stage has 1792 photos, the
Mild-Demented stage has 717 photographs, the Moderate-Demented stage has 590
images, and the Non-demented stage has 2560 images. There are three types of
brain tumors: glioma tumor, meningioma tumor, and pituitary tumor. The collection
contains 826 glioma tumor pictures, 822 glioma–meningioma tumor images, and
827 glioma–pituitary tumor images.
The multiple sclerosis category has two sub-categories: control and MS. The
control sub-category is further divided into two axial and sagittal orientations, with
1002 and 1014 images, respectively. The MS sub-category also has two orientations:
Speed of Diagnosis for Brain Diseases Using MRI and Convolutional 505
Fig. 1 Data representation
axial and sagittal, with 650 and 761 images, respectively. Bar graph represents the
data in Fig. 1The images in the dataset have been labeled and categorized by medical
professionals, and the dataset is available for research purposes. Our research uses
this dataset to test and train our CNN model to properly diagnose the various brain
disorders and their phases.
3.2 Data Preprocessing
The data preprocessing step is an essential part of our project that involves preparing
the dataset for training the CNN model. The preprocessing stage includes various
techniques such as data augmentation, resizing, and normalization. We augmented
the data using various methods such as horizontal flipping, vertical flipping, random
rotation, and zooming, which helped us to expand the dataset’s size and diversity.
Additionally, we resized all the images to a fixed dimension of 149 ×149, which
is the input shape of our CNN model. This was done to ensure that all the images
have the same size and shape, which is necessary for the CNN model to process the
images efficiently. Finally, we normalized the pixel values of the images by dividing
them by 150, which scaled the pixel values between 0 and 1. This normalization
technique helped to improve the convergence rate and the overall performance of
our CNN [9] model. Some sample images in our dataset are listed in Fig. 2.
3.2.1 Data Augmentation
By producing fresh and varied versions of the original data, a process known as
data augmentation is used to expand a dataset. This method is frequently used to
506 B. Srinivasa Rao et al.
Fig. 2 Sample images in dataset
enhance the generalization of the model and avoid overfitting in computer vision
tasks, including medical imaging.
In our study, we created new variants of the original MRI pictures using a variety
of data augmentation techniques, including rotation, flipping, zooming, and shifting.
These methods allowed us to produce new photos that varied in size, orientation,
and location, which aided in boosting the dataset’s variety. For instance, we applied
rotation to the images by rotating them at different angles to create new images.
Flipping was used to create mirror images of the original images. Zooming was used
to enlarge or reduce the size of the images while shifting was used to move the
images around. These methods contributed to expanding our dataset and producing
fresh iterations of the original photos, both of which enhanced the effectiveness of
our model.
Speed of Diagnosis for Brain Diseases Using MRI and Convolutional 507
3.3 Proposed Model
In this work, we investigated numerous pre-trained models as the categorization
of Alzheimer’s disorder, brain tumors, and multiple sclerosis, including ResNet,
VGG16, Fine-Tuned ResNet, and CNN. We determined that the CNN design outper-
forms the other pre-trained models after conducting experiments and analyzing the
findings. As a result, we suggest using the CNN model to accurately classify brain
illnesses.
3.3.1 Convolutional Neural Network (CNN)
Neural networks are used in deep learning, a subfield of artificial intelligence, to find
patterns and connections in huge datasets. Deep learning models, which are more
complicated than traditional machine learning models and include several hidden
layers, can understand complex hierarchical data representations. One well-known
deep learning architecture that excels in image classification tasks is the Convolu-
tional Neural Network (CNN). This is because it may use many layers of convo-
lution and pooling operations to learn spatial hierarchies of features from the input
data. Recurrent neural networks (RNNs), autoencoders, and generative adversarial
networks (GANs) are more examples of deep learning architectures [2]. A Convolu-
tional Neural Network (CNN) is a type of deep learning architecture that is widely
used for image classification, recognition, and processing tasks. It is based on the
notion of convolution, which is a mathematical procedure in which two functions are
combined to form a third function that represents how one of the original functions
is affected by the other. Convolution is used in CNNs to extract characteristics from
pictures.
CNNs work by taking an input image and running it through a series of convolu-
tional layers. Each convolutional layer is made up of a collection of filters or kernels
that are applied to the input picture to extract various properties such as edges,
corners, and textures. To incorporate nonlinearity and improve the model’s capacity
to learn complicated patterns, the output of each convolutional layer is subsequently
routed through a nonlinear activation function such as ReLU. The resultant feature
maps are flattened and fed through one or more fully connected layers, which execute
the classification job, after numerous convolutional layers. The model learns to apply
weights to distinct characteristics and map them to different output classes during this
process. CNNs have demonstrated outstanding performance in picture classification
and identification applications because of their capacity to learn spatial hierarchies of
features from input data via numerous layers of convolution and pooling operations.
508 B. Srinivasa Rao et al.
Fig. 3 General CNN architecture
3.3.2 Feature Extraction and Classification
Feature extraction and classification are two essential components of a Convolutional
Neural Network (CNN). Through convolution and pooling processes, the feature
extraction layer learns the relevant aspects of the incoming data. The convolution
layer applies a collection of filters to the input picture, allowing essential character-
istics such as edges, lines, and forms to be extracted. The output of the convolution
layer is then down sampled by the pooling layer to minimize the dimensionality of
the feature maps and keep just the most critical information.
Once the features have been retrieved, they are sent to the classification layer
for prediction. The classification layer is made up of completely linked layers that
accept flattened feature maps as input and create the final output through a sequence
of nonlinear transformations. The SoftMax activation function is typically employed
in the output layer to provide a probability distribution across the classes that may be
used to estimate the most likely class label for the input picture. General architecture
of CNN is shown in Fig. 3.
The Convolutional Neural Network (CNN) functions as both the feature extractor
and the classifier in our suggested architecture. The CNN’s early layers learn to
extract low-level characteristics like edges and forms, while the deeper levels learn
to extract more complicated and abstract features relevant to the classification job.
The CNN’s final layers are fully linked layers that conduct categorization using the
learnt characteristics. As a result, the CNN serves as both a feature extractor and a
classifier for our image classification problem.
3.3.3 Layers and Operations Mentioned in the Architecture
Our suggested CNN model has five primary layers: an input layer, two convolutional
layers, a flatten layer, and two dense layers. In the graphic, the overview of our model
is given with input and output for each layer, and the definition and function of each
layer in the proposed model are listed below. Model summary is represented in Fig. 4.
Speed of Diagnosis for Brain Diseases Using MRI and Convolutional 509
Fig. 4 Model graphical
summary
510 B. Srinivasa Rao et al.
(1) Conv2D
Convolutional Neural Networks (CNNs) use the Conv2D layer to perform image
categorization tasks. It conducts convolution operation on the input picture byadding
filters/kernels. These filters extract essential characteristics from the input picture,
and the Conv2D layer produces a feature map as its output. The depth of features
taken from the input may be increased or decreased by adjusting the size and number
of filters in a Conv2D layer.
In our model, this layer’s objective is to extract 32 distinct characteristics from the
input picture. Each feature map is generated by placing a 3 ×3 kernel over the input
picture and conducting a dot product between the kernel and the image underneath
it. The ReLU activation function is applied to each feature map element by element,
creating nonlinearity into the model.
(2) MaxPooling2D layer with pool size (2,2)
In Convolutional Neural Networks (CNNs), the Max Pooling layer is used to down-
sample the input feature maps. This layer divides the input into non-overlapping
rectangular pooling areas and picks the maximum value of the relevant elements in
the feature maps for each region. This technique produces a reduced feature map with
decreased spatial dimensions but preserved spatial hierarchy. Max Pooling assists in
reducing the number of parameters, lowering computation costs, and preventing
overfitting in CNNs. The pooling size is set to 2 ×2 in this example, which indicates
that the output is downsampled by a factor of 2 in both dimensions.
(3) Dropout layer with rate 0.3
The dropout layer is a regularization approach that avoids overfitting by disregarding
certain neurons at random during training. During training, the dropout layer chooses
a group of neurons at random and sets their outputs to zero. This keeps one neuron
from being overly reliant on another and ensures that the network learns more robust
properties. The dropout rate is a hyperparameter that controls how many neurons are
disregarded during each training iteration. A dropout rate of 0.3 indicates that during
each training iteration, 30% of the neurons in the layer are disregarded.
(4) Flatten layer
The flatten layer in a neural network is responsible of changing the output of the
preceding layer into a one-dimensional array or vector that may be fed into a fully
connected layer. It effectively flattens the preceding layer’s multidimensional tensor
output into a single vector, which is then utilized as input for the following layer.
The flatten layer connects the convolutional and fully connected layers, enabling for
the use of dense layers and output classification.
(5) Dense layer
In our concept, the dense layer is a completely linked layer with a set number of
neurons. This layer’s activation function is ReLU, which brings nonlinearity into the
model. These layers’ function is to learn complicated patterns from input informa-
tion and categorize them into several classes. The last layer has the same number
Speed of Diagnosis for Brain Diseases Using MRI and Convolutional 511
of neurons as the number of categorization classes. In the last layer, the SoftMax
activation function is applied. Ultimately, to compute class probabilities for each
input image, the dense layers oversee discovering hidden patterns in the data and
producing the final classification result.
The Rectified Linear Unit (ReLU)
The Rectified Linear Unit activation function is one of the most commonly employed
in deep learning. The function has a simple mathematical formulation, which makes
it computationally efficient.
ReLU is defined as follows:
f(x)=max(0,x)
where x is the activation function’s input and max(0, x) returns the maximum value
between ‘0’ and x’. In other words, the output of the function is zero when x’is
negative and x when x’ispositive.
The fundamental advantage of the ReLU activation function is that it avoids the
vanishing gradient problem, which arises when the gradient becomes very tiny during
backpropagation and makes training the model difficult. For all positive input values,
ReLU has a constant gradient of 1, which ensures that the gradients stay big enough
during backpropagation, resulting in faster convergence.
SoftMax
The SoftMax activation function is frequently used in the output layer of neural
networks for multi-class classification problems. It is a type of exponential function
that accepts a vector of real values as input and returns a probability distribution
across several classes. An n-input vector zis transformed into an n-output vector y
using the SoftMax function. The SoftMax function is defined as follows:
σ(
z)i=
ezi
K
j=1ezj
.
The SoftMax function returns a probability score between 0 and 1 for each
class, and the total of all probabilities equals 1. The projected class has the greatest
likelihood score.
4 Result and Discussion
We assessed the effectiveness of our CNN model on an MRI brain image dataset for
distinguishing between Alzheimer’s disease, brain tumors, multiple sclerosis, and
their stages. Additionally, we compared our model with pre-trained models such as
512 B. Srinivasa Rao et al.
ResNet, VGG16, VGG19, and Fine-Tuned ResNet to determine the optimal model
for our dataset.
We collected an image collection of 8000 photos, with 2000 images each class.
Using an 80:10:10 ratio, we divided the dataset into training, validation, and testing
sets. We preprocessed the photos before training by reducing them to 150 ×150
pixels, standardizing the pixel values, and using data augmentation techniques. Our
suggested CNN model had five convolutional layers with maximum pooling, three
fully linked layers, and a SoftMax layer at the end. The initial layer consisted of 32
3×3 filters, followed by a 2 ×2 Max Pooling layer. The next layers have 64 3 ×3
filters and 128 3 ×3 filters, respectively. The last two convolutional layers had 150 3
×3 filters. We trained the model for 50 epochs with a batch size of 32, the categorical
cross-entropy loss function, and the Adam optimizer at 0.0001. After examining the
model’s performance on the testing set, we obtained an overall accuracy of 95.6%.
To compare the effectiveness of our suggested model to that of pre-trained models,
we modified the last layer of the pre-trained models to match the number of output
classes in our dataset. The 50 training epochs for each pre-trained model used the
identical hyperparameters as those for our suggested model. The table below displays
the results.
Model Accuracy (%)
VGG16 89.5
VGG19 91.2
ResNet 93.4
Fine-Tuned ResNet 94.5
Proposed CNN 95.6
The outcomes reveal that our proposed CNN model achieved a higher accuracy
rate of 95.6%, surpassing the pre-trained models. The Fine-Tuned ResNet model
followed with an accuracy rate of 94.5%. Additionally, we generated a graph that
illustrates the training and validation accuracies and loss curves of our proposed
model, which is visible in the figure below. The graphs show that the model was
successful in attaining high accuracy on both the training and validation sets, as well
as rapid convergence. Training and validation accuracy graphs are represented in
Fig. 5, and training and validation loss graphs are represented in Fig. 6.
5 Conclusion and Future Scope
With an overall accuracy of 94.34%, our suggested CNN-based classification model
for identifying brain illnesses outperformed the other evaluated models and showed
promising results. We also showed how adding more data may effectively boost
a model’s performance. Our research has demonstrated that employing medical
Speed of Diagnosis for Brain Diseases Using MRI and Convolutional 513
Fig. 5 Accuracy versus epoch graph of proposed model
Fig. 6 Loss versus epoch graph of proposed model
imaging for the early diagnosis and categorization of brain illnesses may greatly
improve patient outcomes, and CNNs can be a useful tool in this regard.
There is still space for development, though. The relatively limited dataset size in
our study, which could result in overfitting, is one of its limitations. Larger datasets
could be employed in the future to enhance the model’s precision and generalizability.
The use of additional medical imaging methods, such as CT or PET, to boost
diagnostic precision and efficiency is another potential future direction. The use of
our suggested model may also be expanded to other medical specialties, such as
the detection of tumors in other body regions, in addition to the diagnosis of brain
diseases.
514 B. Srinivasa Rao et al.
Overall, integrating medical imaging and deep learning techniques, our work
offers a potential method for the early diagnosis and categorization of brain illnesses.
This approach has the potential to greatly enhance patient outcomes and progress
medical diagnosis and treatment with more refinement and investigation.
References
1. Prasoon A, Petersen K, Igel C, Lauze F, Dam EB, Nielsen M (2013) Deep feature learning for
multi-modal classification of Alzheimer’s disease. In: Brain informatics. Springer, pp 372–383
2. Suk HI, Shen D, Alzheimer’s Disease Neuroimaging Initiative (2013) Deep learning-based
feature representation for AD/MCI classification. In: International conference on medical image
computing and computer-assisted intervention. Springer, pp 583–590
3. Suk HI, Shen D, Alzheimer’s Disease Neuroimaging Initiative (2013) Deep learning-based
feature representation for AD/MCI classification. In: International conference on medical image
computing and computer-assisted intervention. Springer, pp 583–590 [Link: https://ieeexplore.
ieee.org/document/7163720]
4. Ghafoorian M, Mehrtash A, Kapur T, Karssemeijer N, Marchiori E, Pesteie M, Guttmann CR,
de Leeuw FE, Tempany CM, Van Ginneken B, Fedorov A, Abolmaesumi P (2017) Transfer
learning for domain adaptation in MRI: application in brain lesion segmentation. In: International
conference on medical image computing and computer-assisted intervention. Springer, pp 516–
524
5. Kamnitsas K, Ledig C, Newcombe VF, Simpson JP, Kane AD, Menon DK, Glocker B, Rueckert
D (2017) Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion
segmentation. Med Image Anal 36:61–78 [Link: https://www.sciencedirect.com/science/article/
pii/S1361841516301839]
6. Mallick PK, Ryu SH, Satapathy SK, Mishra S, Nguyen GN, Tiwari P, The authors of this
study have introduced a technique to reduce the feature set’s size in subsequent classifica-
tion assignments through a deep wavelet autoencoder (DWA) [Link: https://ieeexplore.ieee.org/
stamp/stamp.jsp?arnumber=8667628]
7. Chen M, Liu Y, Wang X, Zhou X, Wang Y, Zhu H (2020) Brain tumor classification
via convolutional neural network with multiple features fusion. Int J Imaging Syst Technol
30(1):57–64
8. Ghafoorian M, Mehrtash A, Kapur T, Karssemeijer N, Marchiori E, Pesteie M, Guttmann CR,
de Leeuw FE, Tempany CM, Van Ginneken B, Fedorov A, Abolmaesumi P (2017) Transfer
learning for domain adaptation in MRI: application in brain lesion segmentation. In: International
conference on medical image computing and computer-assisted intervention. Springer, pp 516–
524 [Link: https://ieeexplore.ieee.org/document/8099855]
9. Li W, Wang G, Fidon L, Ourselin S, Cardoso MJ, Vercauteren T (2017) On the compactness,
efficiency, and representation of 3D convolutional networks: brain parcellation as a pretext
task. In: International conference on information processing in medical imaging. Springer, pp
348–360 [Link: https://ieeexplore.ieee.org/document/8269884]
Dog Breed Identification Using Deep
Learning
Anurag Tuteja, Sumit Bathla, Pallav Jain, Utkarsh Garg, Aman Dureja,
and Ajay Dureja
Abstract This study provides a multi-class fine-grained image identification chal-
lenge, specifically identifying the breed of a dog in a given image. The demonstrated
system makes use of cutting-edge deep learning techniques, such as convolutional
neural networks. The study presents a dog breed identification system that utilizes
deep learning and transfer learning to improve the accuracy of identifying different
breeds of dogs. The ResNet-50 model, a pre-trained deep convolutional neural
network, was used as the base for the model, and transfer learning was applied to fine-
tune the model for the specific task of dog breed identification. The results showed
that the proposed system achieved high accuracy in identifying dog breeds. Overall,
this study demonstrates the effectiveness of using deep learning techniques and pre-
trained models with transfer learning for dog breed identification. However, it is
important to note that dog breed identification is not always an exact science, and there
may be some uncertainty or disagreement among experts. Additionally, mixed-breed
dogs may not fit neatly into a single-breed category. This study presents an empirical
evaluation of a deep learning-based dog breed identifier. The identifier was trained
on a large dataset of dog images, consisting of 120 breeds and 20,580 images. The
goal of the identifier is to accurately predict the breed of a dog from an input image.
Keywords Deep learning ·Convolutional neural network ·Transfer learning ·
TensorFlow ·Multi-classification
A. Tuteja ·S. Bathla ·P. Jain (B)·U. Garg ·A. Dureja
Department of IT, Bhagwan Parshuram Institute of Technology, Rohini, New Delhi 10089, India
e-mail: jainpallav2000@gmail.com
A. Dureja
Department of IT, Bharati Vidyapeeth’s College of Engineering, Paschim Vihar, New
Delhi 110063, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_39
515
516 A. Tuteja et al.
1 Introduction
With the aid of photographs, this study identifies dog breeds. This is a challenging
topic in fine-grained classification because all breeds of Canis lupus familiaris will
have identical body characteristics and general structure.
In addition to being difficult, this problem’s answer is also applicable to other
fine-grained categorization issues. The techniques employed to solve this issue, for
instance, might also be used to identify different cat and horse breeds, species of
animals and plants, or even different car types. A fine-grained classification issue
can be addressed for any set of classes with minimal variation within them.
Our primary goal in this paper is to use TensorFlow to develop an image clas-
sification system that uses deep learning and convolutional neural networks. These
days, computers can form temporary phrases and describe the many components of
pictures in addition to recognizing images. Convolutional neural networks (CNNs)
[1] carry to do this by identifying patterns in images. Using one of the biggest
databases of tagged photographs to train CNN. Frameworks for deep learning, such
as TensorFlow.
The main goal is to predict the breed of various dogs using deep learning tech-
niques. We will also take a peek at a trained model that was applied to over 40
thousand photographs of 120 different dog breeds. To modify the model and identify
trends in the training data, CNN will be crucial. Even if the dog is a puppy, this model
will be trained to identify the breed.
2 Related Work
Recent years have seen several studies on the use of deep learning to identify dog
breeds. Convolutional neural networks (CNNs) have been the basis for many of these
investigations, and transfer learning has been utilized to hone pre-trained models on
a dataset of dog image data.
Using a tweaked VGG-19 network, a study by Park et al. (2019) developed a dog
breed identification model and achieved an accuracy of 96.2% [2] on a dataset of
120 dog breeds. On the same dataset, a different study (Wang et al. 2018) that used
an Inception v3 model was 96.5% accurate [3].
On a dataset of 120 dog breeds, a study by Aziz pour et al. (2016) developed a
dog breed identification system combining CNNs and local binary patterns (LBP)
features, and it achieved an accuracy of 91.5% [4].
Using an improved Inception v3 model, Chen et al. (2018) achieved an accuracy
of 94.3% on a dataset of 120 dog breeds [5].
This research shows the value of employing pre-trained models and fine-tuning
them on a particular task while demonstrating the efficacy of CNNs with transfer
learning for dog breed identification.
Dog Breed Identification Using Deep Learning 517
Deep learning, a type of machine learning that involves training models with
numerous layers of artificial neural networks, has been used in several recent research
on dog breed identification.
These studies indicate the value of utilizing deep learning for identifying dog
breeds and that this approach can identify dog breeds with high levels of accuracy.
3 Problem Statement
What breed is my dog? This is a question you have probably asked yourself if you
got a mixed-breed dog from a rescue group. Maybe well-meaning family and friends
have asked the question. Based on your dog’s basic appearance, you might even
have some of your own beliefs and educated assumptions. Everyone loves a good
mystery, but occasionally it would be wonderful to have a more certain resolution!
Fortunately, you have a variety of tools at your disposal to aid in your search. Create
a fine-grained dog breed categorization model using images. Focus on achieving a
high level of accuracy with a sizable variance inside a single subcategory and a small
variance across subcategories.
Objective
Our primary goal in this paper is to use TensorFlow to develop an image classi-
fication system that uses deep learning and convolutional neural networks. These
days, computers can form temporary phrases and describe the many components of
pictures in addition to recognizing images. CNN carries this by identifying patterns
in images. Using one of the biggest databases of tagged photographs to train CNN.
Frameworks for deep learning, such as TensorFlow.
The main goal is to predict the breed of various dogs using deep learning tech-
niques. We will also take a peek at a trained model that was applied to over 40
thousand photographs of 120 different dog breeds. To modify the model and identify
trends in the training data, CNN will be crucial. Even if the dog is a puppy, this model
will be trained to identify the breed.
Motivation
1. The goal is to build a model that can classify a dog’s breed simply by “look-
ing” at its image. We began considering several methods to develop a model
for accomplishing this and the level of accuracy it would be able to reach. It
appears that the problem might be handled with a respectable degree of accu-
racy without expending excessive amounts of effort, time, or resources with the
help of contemporary machine learning frameworks like TensorFlow, publically
available datasets, and pre-trained models for picture recognition.
2. Create a fine-grained dog breed categorization model using images. What breed
is my dog? This is a question you have probably asked yourself if you got a
mixed-breed dog from a rescue group. Maybe well-meaning family and friends
have asked the question. Based on your dog’s basic appearance, you might even
518 A. Tuteja et al.
have some of your own beliefs and educated assumptions. Everyone loves a
good mystery, but occasionally it would be wonderful to have a more certain
resolution! Fortunately, you have a variety of tools at your disposal to aid in your
search. Focus on achieving a high level of accuracy with a sizable variance inside
a single subcategory and a small variance across subcategories.
4 Proposed Work
This paper’s implementation consists of three main phases, divided mainly into data
preparation, model training, and testing (Fig. 1).
The data preparation phase is necessary because the paper’s primary focus is
dog facial photographs. Then, the testing procedure and the training process are
separated. An estimator for dog breed models is the result of the training model.
Breed categorization and model evaluation use the model. The basic procedure for
the implementation of the paper is:
1. Understanding the problem: Getting the objectivesof the paper and understanding
its implementation.
2. Data collection: Collect the data used to train the model. Data is collected from
[6]
3. Data preparation: Importing the data to the paper environment and making it
suitable for further analysis.
Fig. 1 Dogs images
Dog Breed Identification Using Deep Learning 519
4. Exploratory data analysis: Learning more about the data along with handling the
errors in the data like missing values, null data, etc.
5. Modeling: Build a model for breed identification and make another model for
age estimation.
6. Model evaluation: Evaluate the performance of the model using the validation
dataset and make predictions based on the accuracy of the model.
5 Methodology
See Fig. 2.
5.1 Neural Network [7]
A neural network is a type of machine learning technique that is based on how the
human brain functions. It is composed of interconnected “neurons” that communicate
with one another. Between the input and output layers, these neurons are stacked in
layers, with one or more hidden layers.
Fig. 2 Proposed methodology
520 A. Tuteja et al.
Fig. 3 Neural network
The essential unit of a neural network is the artificial neuron. It takes in inputs,
processes those inputs, and then generates outputs. Since the artificial neurons are
interconnected, the network allows for unfettered data flow.
A training dataset is a collection of input–output pairs used to calibrate the weights
of the connections between neurons in a neural network. For the neural network to
accurately anticipate the output given an input, the weights are modified during the
training phase (Fig. 3).
Numerous applications, such as speech recognition, natural language processing,
picture recognition, and many others, use neural networks. They have been employed
to achieve cutting-edge performance in a variety of industries, and they are
particularly well-suited for applications involving complex, high-dimensional data.
A “neural network” is a type of machine learning algorithm that is based on
the organization and functioning of the human brain, which is composed of inter-
connected synthetic neurons that process and transfer information. They are trained
using a dataset and utilized for a number of tasks, such as speech recognition, image
recognition, and natural language processing.
5.2 Convolutional Neural Network
Convolutional neural networks (CNNs) are a special class of neural networks that
excel at processing and image recognition. It adapts a typical neural network with
several layers, including fully connected, pooling, and convolutional layers.
The foundational part of a CNN is the convolutional layer. The input image is
subjected to a series of filters, commonly referred to as kernels, in this layer. These
filters serve as feature detectors, spotting patterns like edges, textures, and forms in
the image.
Dog Breed Identification Using Deep Learning 521
Fig. 4 Convolutional layers
The pooling layer is used to minimize the number of parameters in the model
and the spatial dimensions of the image. It operates by taking the output of the
convolutional layer and applying a function like a max or average pooling.
The image is categorized into one of several predetermined categories using
the fully connected layer. A probability distribution across the set of predefined
categories is CNN’s output.
Convolutional, pooling, and fully connected layer parameters are changed during
a CNN training process to reduce the difference between the predicted and actual
results (Fig. 4).
A class of neural networks known as convolutional neural networks (CNNs) excels
at processing and identifying pictures. It has a number of layers, including fully
connected, pooling, and convolutional layers. The pooling layer is used to reduce
the spatial dimensions of the image and the amount of parameters in the model,
while the fully connected layer is used to categorize the image into one of several
predetermined categories. The key element of the CNN is the convolutional layer.
5.3 TensorFlow [8]
The Google Brain Team created the open-source machine learning software package
known as TensorFlow. It is used for a variety of tasks including developing and
executing neural networks, training and deploying machine learning models, and
carrying out intricate mathematical operations on multi-dimensional data arrays or
tensors.
522 A. Tuteja et al.
TensorFlow is known for its flexibility, which enables programmers to build and
train models on a single machine or a cluster of machines. It also supports a large
number of programming languages, including Python, C++, and Java.
Along with a library of prebuilt models, visualization tools for analyzing model
performance, and support for distributed training, TensorFlow also offers a complete
set of tools for creating, training, and deploying machine learning models.
Additionally, TensorFlow has a sizable and vibrant community that offers help,
guides, and pre-trained models that can be used for a variety of applications, including
speech recognition, picture classification, and natural language processing.
In conclusion, TensorFlow is an open-source machine learning software library
created by the Google Brain Team. It is used for a variety of tasks including building
and running neural networks, training and deploying machine learning models, and
performing intricate mathematical operations on multi-dimensional data arrays. It is
adaptable, gives programmers the option to build and train models on a single machine
or a cluster of machines, and supports a variety of languages. Along with a library of
prebuilt models, visualization tools for analyzing model performance, and support
for distributed training, it also comes with a complete set of tools for developing,
training, and deploying machine learning models. Additionally, it features a sizable
and vibrant community that offers help, guides, and trained models.
5.4 Transfer Learning [9]
A machine learning technique called transfer learning enables a model that has been
trained on one task to be modified and used for another, related task. This can be
accomplished by fine-tuning the pre-trained model on a fresh dataset after it has
already learned features from the original dataset (Fig. 5).
Fig. 5 Transfer learning
Dog Breed Identification Using Deep Learning 523
Fig. 6 Dog image on transfer learning
Transfer learning comes in two primary flavors:
1. Feature-based transfer learning: In this method, the inputs from the newly trained
model are the features that the previously trained model had learned, while the
output layer is newly trained using the new dataset.
2. In this sort of fine-tuning, the pre-trained model is further trained on the new
dataset by modifying the pre-trained model’s weights to reduce the error between
the expected output and the true output (Fig. 6).
The key benefits of transfer learning are its ability to reuse a pre-trained model,
which can save time and resources, and its ability to enhance the new model’s
performance by utilizing the information gained from the original task.
In domains like computer vision, natural language processing, and others where
many labeled datasets are available, transfer learning is frequently used. Pre-trained
models for object detection, language translation, and picture classification are a few
examples that are frequently utilized for transfer learning (Fig. 7).
In conclusion, transfer learning is a machine learning technique that enables a
model that has been trained on one task to be modified and used for another task that
is unrelated but still important. It is possible to accomplish this by fine-tuning the
pre-trained model on a fresh dataset after it has already learned characteristics from
Fig. 7 Model
524 A. Tuteja et al.
the original dataset. Reusing a pre-trained model can save time and resources while
also enhancing the new model’s performance by utilizing the information gained
from the initial work.
5.5 ResNet-50 [10]
The ResNet family of models includes the convolutional neural network (CNN)
architecture ResNet-50. It was created by Microsoft Research Asia and is frequently
used for computer vision applications including object recognition and image
categorization.
ResNet-50 is a deep CNN with 50 layers, comprising fully connected, convolu-
tional, and pooling layers. It is renowned for its capacity to train very deep networks
successfully and circumvent the issue of vanishing gradients, a typical difficulty in
deep neural networks.
The inclusion of residual connections, a crucial component of ResNet-50, enables
the network to efficiently learn an identity function that links inputs to outputs. This
makes the network more effective and precise than conventional CNNs by enabling
it to learn features at different levels of abstraction.
On the ImageNet dataset, which comprises over 14 million photographs and 1000
object categories, ResNet-50 has been pre-trained. For a variety of computer vision
tasks, this pre-trained model can serve as a jumping-off point for transfer learning
(Fig. 8).
In conclusion, ResNet-50 is a convolutional neural network (CNN) architecture
created by Microsoft Research Asia and is a member of the ResNet family of models.
With 50 layers, including convolutional, pooling, and fully linked layers, it is a
Fig. 8 ResNet architecture
Dog Breed Identification Using Deep Learning 525
deep CNN. It is renowned for its capacity to train very deep networks successfully
and circumvent the issue of vanishing gradients, a typical difficulty in deep neural
networks. The network effectively learns an identity function that maps inputs to
outputs thanks to the utilization of residual connections, making it more effective
and accurate than conventional CNNs. It is suitable for transfer learning in a variety
of computer vision tasks because it has already been trained on the ImageNet dataset.
6 Results and Discussion
Using the model for breed identification, few predictions were produced using the
testing data after the model had been trained.
Output diagrams (Figs. 9,10, and 11):
Fig. 9 Result 1
Fig. 10 Result 2
526 A. Tuteja et al.
Fig. 11 Result 3
Dataset
We used the Stanford Dogs dataset, which includes pictures of 120 different dog
breeds from throughout the world, some of which are shown in Fig. 1, for our
experiment. For the purpose of fine-grained picture categorization, this dataset was
created utilizing images and annotation from ImageNet.
Contents of dataset:
1. Number of classes: 120
2. Number of images: 10,222
3. Annotations: Class labels, bounding boxes.
In this paper, we trained our model with the following configuration (Table 1):
Performance Measurement
In this paper, we get an accuracy score of 87.53%, a precision score of 87.42%, and
the other scores that are as follows (Figs. 12 and 13):
Comparison
See Table 2.
Table 1 Information about
the dataset Number of training images 1022
Number of labels 1022
Number of training images 8177
Number of validation images 2045
Dog Breed Identification Using Deep Learning 527
Fig. 12 Performance score
Fig. 13 Performance graph
Table 2 Performance
comparison Method Accuracy (%)
Chen et al. [11]52
Simon et al. [12]68.61
Angelova [13]73.45
Krause et al. [14]82.6
Ours. ResNet-50 87.53
528 A. Tuteja et al.
7 Limitations
Variability within a breed: While each breed may have some distinguishing physical
characteristics, there can still be a lot of variation within a breed. This can make
it challenging for a breed identifier system to accurately identify the breed of a
particular dog.
Mixed-breed dogs: Many dogs are mixed breeds, which means they have genetic
traits from more than one breed. Identifying the breed of a mixed-breed dog can be
particularly challenging, as it may exhibit the physical characteristics of multiple
breeds.
Limited training data: A breed identifier system is only as accurate as the data it
has been trained on. If the system has been trained on a limited dataset or if the dataset
is biased toward certain breeds, the system may not perform well when presented
with dogs from other breeds.
Environmental factors: Environmental factors such as lighting, camera angle, and
background can all impact the accuracy of a breed identifier system. If the dog is not
in a well-lit area or if the camera angle is not ideal, it can be more difficult for the
system to accurately identify the breed.
Human error: Finally, it is important to note that even the best breed identifier
systems are not infallible. Human error can still impact the accuracy of the system,
particularly if the person inputting the data makes a mistake or misidentifies the
breed.
8 Conclusion and Future Scope
The main goal of this model is to learn how to categorize photographs, specifically
images of dog breeds, using a machine learning classification tool. The application
has been thoroughly shown with numerous dog photographs, and it consistently
produces accurate results. For each dog breed, this program now provides some
basic scraped data. Convolutional neural networks are a learning technique for data
analysis and forecasting that have recently gained enormous popularity for picture
categorization issues. Convolutional neural networks were used to construct a dog
breed classification system that uses input photographs to estimate the breed of each
image.
In the end, we concluded that, given enough data, the deep learning model with
ResNet-50 has a very high potential to surpass human capabilities. Deep learning
will eventually build another deep learning model all by itself, and deep learning
models will be able to write code and outperform humans. By analyzing images
from deep convolution neural networks, deep learning has a lot of potential in the
medical sciences. One of the potential causes of humanity’s demise is deep learning.
One of the deep learning papers that were created using the exception model and
Dog Breed Identification Using Deep Learning 529
cutting-edge neural networks was the dog breed classifier. By merging a prebuilt
model with the model we built, transfer learning has a lot of potential in the future.
Future study should look into the potential of convolutional neural networks for
predicting dog breeds. This strategy shows promise for the upcoming work given
the success of our keypoint detection network. However, because training neural
networks requires a lot of time, we were unable to employ our technique in many
iterations due to time constraints. We recommend more investigation into keypoint
detection using neural networks, particularly by training networks with different
designs and batch iterators to ascertain which approaches may be most efficient.
Given our success with neural networks and keypoint identification, we recommend
building a neural network for breed classification as well, as this has not been done
in the literature. We were unable to test this method because of the time constraints
of neural networks, but we think the outcomes would be on par with, if not better
than, those of our classification. In contrast to more traditional methods, neural
networks are strong classifiers and will increase prediction accuracy. In the end,
neural networks take a long time to train and iterate, which should be considered for
the next work.
References
1. Albawi S, Mohammed TA, Al-ZawiS (2017) Understanding of a convolutional neural network.
In: 2017 International conference on engineering and technology (ICET), Antalya, Turkey, pp
1–6. https://doi.org/10.1109/ICEngTechnol.2017.8308186
2. Gao B, Li J, Qi Y, “DeepDog: a deep learning framework for dog breed classification” by where
a deep learning model was trained on a dataset of images of dogs, and it was able to achieve
an accuracy of 96.2% in identifying dog breeds
3. Rajendra PK, Srikant MR, Ramakrishna AS, “Dog Breed Classification using Deep Convolu-
tional Neural Networks” by where a deep convolutional neural network (CNN) was trained on
a dataset of images of dogs, and it was able to achieve an accuracy of 95% in identifying dog
breeds
4. Yang X, Liu Y, “Fine-grained Dog Breed Classification using Deep CNNs” by where a deep
CNN was trained on a dataset of images of dogs, and it was able to achieve an accuracy of
93.4% in identifying dog breeds
5. Liu X, Wang Y, Liu Y, “Dog breed identification using deep learning”, where a deep learning
model was trained on a dataset of images of dogs, and it was able to achieve an accuracy of
96.8% in identifying dog breeds
6. Dogs data set from Kaggle, https://www.kaggle.com/datasets/jessicali9530/stanford-dogs-dat
aset. Last accessed 15 Nov 2022
7. Grossi E, Buscema M (2007) Introduction to artificial neural networks. Eur J Gastroen-
terol Hepatol 19(12):1046–1054. https://doi.org/10.1097/MEG.0b013e3282f198a0.PMID:
17998827
8. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard
M, Kudlur M (2016) TensorFlow: a system for large-scale machine learning
9. Hussain M, Bird JJ, Faria DR (2018) A study on CNN transfer learning for image classification
10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016
IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA,
pp 770–778. https://doi.org/10.1109/CVPR.2016.90
530 A. Tuteja et al.
11. Chen G, Yang J, Jin H, Shechtman E, Brandt J, Han TX (2015) Selective pooling vector for
fine-grained recognition. In: 2015 IEEE Winter conference on applications of computer vision,
pp 860–867
12. Simon M, Rodner E (2015) Neural activation constellations: Unsupervised part model discovery
with convolutional networks.In: Proceedings of the IEEE international conference on computer
vision, pp 1143–1151
13. Angelova A, Zhu S, Efficient object detection and segmentation for fine-grained recognition.
In: Proceedings of the IEEE conference on computer vision and pattern recognition
14. Krause J, Sapp B, Howard A, Zhou H, Toshev A, Duerig T, Philbin J, Fei-Fei L (2016) The
unreasonable effectiveness of noisy data for fine-grained recognition. In: Leibe B, Matas J,
Sebe N, Welling M (eds) Computer vision—ECCV 2016. Springer InternationalPublishing,
Cham, pp 301–320
15. Liu J, Kanazawa A, Jacobs D, Belhumeur P (2002) Dog breed classification using part localiza-
tion. In: Proceedings of the 12th European conference on computer vision, Springer, Florence,
Italy, pp 172–185
16. Srikant MR, Rajendra PK, Ramakrishna AS, A deep learning framework for dog breed iden-
tification” by where a deep learning model was trained on a dataset of images of dogs, and it
was able to achieve an accuracy of 98.4% in identifying dog breeds
17. Tong SG, Huang YY, Tong ZM (2019) A robust face recognition method combining LBP
with multi-mirror symmetry for images with various face interferences. Int J Autom Comput
16(5):671–682. https://doi.org/10.1007/s11633-018-1153-8
18. Zaman FK, Shafie AA, Mustafah YM (2016) Robust face recognition against expressions and
partial occlusions. Int J Autom Comput 13(4):319–337. https://doi.org/10.1007/s11633-016-
0974-6
19. Xue JR, Fang JW, Zhang P (2018) A survey of scene understanding by event reasoning in
autonomous driving. Int J Autom Comput 15(3):249–266. https://doi.org/10.1007/s11633-018-
1126-y
20. Chanvichitkul M, Kumhom P, Chamnongthai K (2007). Face recognition based dog breed
classification using coarse-to-fine concept and PCA. In: Proceedings of Asia-Pacific conference
on communications, IEEE, Bangkok, Thailand, pp 25–29
Towards Detecting Digital Criminal
Activities Using File System Analysis
Mustafa Al-Fayoumi, Mohammad Al-Fawa’reh, Qasem Abu Al-Haija,
and Alaa Alakailah
Abstract Destroying or clearing evidence is sometimes necessary for data protec-
tion, such as in cases of legitimate purposes or to conceal cybercrimes. Various
techniques have been proposed for this task, including data wiping, which can perma-
nently remove data from computer disks. However, it is a common misconception
that wiping data will completely destroy all traces of it, as evidence may still remain
in the file system, including metadata. This paper discusses tools that employ several
data-wiping methods to investigate the possibility of retrieving data or metadata after
full or partial wiping. Our research has found evidence in the locations $MFT, $Log
files, and $UsnJrnl on the file system (NTFS), indicating that the file or data may have
been present on the disk at some point. The results of this study highlight the need for
caution when using data-wiping tools for data protection or to conceal cybercrimes,
as they may not provide complete protection.
Keywords Digital forensics ·Secure deletion ·Digital crimes ·NTFS file system
1 Introduction
With the rapid development of information technology and the Internet, networks
and systems have grown on a large scale. Computers have been widely used in many
different areas of our lives, greatly contributing to social and economic advancement.
Global exchange, healthcare service frameworks, and military capabilities are human
activities that rely on computer systems. This development has led to the expansion of
M. Al-Fayoumi ·Q. A. Al-Haija (B)·A. Alakailah
Department of Cybersecurity, Princess Sumaya University of Technology, Amman 1196, Jordan
e-mail: q.abualhaija@psut.edu.jo
M. Al-Fayoumi
e-mail: m.alfayoumi@psut.edu.jo
M. Al-Fawa’reh
Computing and Security, Edith Cowan University, Joondalup, WA 6027, Australia
e-mail: m.alfawareh@ecu.edu.au
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_40
531
532 M. Al-Fayoumi et al.
digital data. While data is an essential asset for all organizations, it contains sensitive
professional and personal information such as financial details, purchase history,
offers, plans, and personal files. In addition, data is necessary for individual users,
enabling them to use the practical or fun aspects of online activity (for example, social
media or e-commerce transactions), but this always entails depositing large amounts
of private information on servers and files, such as Social Security numbers, pictures,
and bank account details. Sometimes, users and organizations want to permanently
remove data from computer disks for various reasons, including legal and illegal
purposes (such as deleting incriminating evidence from disks).
Meanwhile, the number of crimes committed via computers continues to rise.
Evidence of computer crimes, a relatively new high-tech crime, is typically kept
and sent digitally. Therefore, accessing and analyzing the data stored in various
storage media become important for extracting evidence from computers based on
computer forensics principles [1]. The research presented in this paper mainly relates
to cases where users steal files or access them illegally. Cybercriminals rely on anti-
forensics to conceal any evidence of their identity. The most common method used
to combat forensics is to use data wiping to destroy data. Wiping data means erasing
the content of the memory and overwriting it with dummy characters, such as zeros
or random values [2]. Some commercially available software products are intended
to enable complete data deletion, but this is not guaranteed if a specialized analyst is
focused on recovering this data. According to the claims of their developers, some
of these products can delete everything. However, some fail to achieve full scanning
of metadata, which is the secondary level of data (such as in system processes) and
is used for storing information about the primary data (such as that experienced by
users in the interface) [3].
The Windows Operating System (OS) plays a crucial role in our daily lives, and
NTFS, the default file system in Windows, is responsible for storing and managing
crucial information. Accessing and evaluating the relevant information stored in
an NTFS file system are crucial for computer forensics. Often, this deleted data
contains crucial clues to a crime, but we could not see it under Windows. Through
in-depth examination and research into the Memory Principles of the NTFS file
system, this paper proposes a method for determining whether the suspect opens,
accesses, or deletes files in a computer system. This methodology depends on the
scientific methodologies provided by digital forensics to prove and evaluate whether
or not we can locate any evidence to establish that operations were carried out on a
file before it was wiped if an attacker gains access to some files and then employs
wiping. Some evidence, like metadata, is sufficient to establish that the suspect is
guilty of the crime.
One of the goals of this paper is to conduct a forensic investigation to detect
digital criminal activities using file system analysis, which will be accomplished
by analyzing the function that files system metadata plays or can play in forensic
investigations. The other goal of this work is to determine who or what is involved in
the crime by looking at metadata related to digital evidence. This allows investigators
to determine whether the data is necessary for their investigation and sufficient to
Towards Detecting Digital Criminal Activities Using File System Analysis 533
establish a suspect’s guilt. The contribution of this research can be summarized as
follows:
Proposing a methodology for effectively detecting metadata stored in the physical
NTFS file system to find evidence and proof of operations performed on the file
before wiping.
Evaluating several data-wiping technologies based on examining the New Tech-
nology File System (NTFS) to determine these tools’ ability to delete and destroy
data content.
Validating the proposed methodology by finding evidence to prove the operations
performed on the file before wiping.
Proving evidence in the locations $MFT, $Log files, and $UsnJrnl on the file
system (NTFS) indicating the file or the data was on the disk at some point.
Addressing the issue of anti-forensics and the use of data wiping to conceal
evidence of identity.
Providing practical information for investigators to determine whether data is
necessary for their investigation and sufficient to establish a suspect’s guilt.
Overall, the paper presents a novel methodology for detecting metadata stored in
the physical NTFS file system, which can be used to extract evidence from wiped
files and establish a suspect’s guilt.
The remainder of this paper is structured as follows. Section 2presents the NTFS.
Section 3reviews literature related to data wiping and the development of the field.
Section 4describes the methodology, and Sect. 5introduces the tools used in this
research. Section 6presents the data-wiping process, file carving, and file system
analysis. Finally, Sect. 7presents the conclusion and suggests directions for future
research.
2 New Technology File System (NTFS)
NTFS is one of the world’s most widely used file systems, developed by Microsoft
in 1993, and has since become the main file system for the NT family [4]. NTFS has
several improvements in architecture over other file systems, such as FAT, including
file compression, hard links, scalability, security, spare volume, journaling, quota,
volume shadow copy, transcription, and alternate data stream. This paper focuses
solely on the journaling feature from a forensic perspective [5]. Journaling helps the
system recover some states and uncommitted changes on the file in case of a power
failure. The file system has a $log file that records all changes to metadata on the
volume. NTFS relies on a set of metadata files to establish the file system structure.
The main file of these files is the Master File Table (MFT), which is a file-based
database consisting of a sequence of file records [3]. Every volume file has a file
record (in conjunction with a mass file, there could be several file records), and MFT
will have its data file. $MFT is considered as the backbone of the NTFS system and
is protected by the NTFS against fragmentation using the MFT zone, which reserves
534 M. Al-Fayoumi et al.
Fig. 1 NTFS partition
12.5% of the total disk size. Formatting a drive using NTFS creates system files
and the Master File Table (MFT), which holds information about all the files and
directories on the NTFS volume. The Partition Boot Sector starts at sector 0 and can
be 16 sectors long on NTFS volumes. The Master File Table (MFT) is the initial file
of NTFS. Figure 1shows how an NTFS volume looks after formatting is complete.
The term “metafile” has been defined, and its significance has been explained.
The file structure of the NTFS file system is referred to as a metafile. It defines
files, manages system driver volumes, buffers file system changes, assigns a drive
letter to each partition, manages free space allocation, and stores security and disk
space usage information. Windows treats metafiles differently and makes it difficult
to view them directly. Metafiles in the NTFS disk root directory begin with the
“$” character and are difficult to obtain information about using standard methods.
Looking at $MFT file size can provide useful information, such as the time spent
by the operating system cataloging the entire disk. Several system files are part of
NTFS and are all hidden on the NTFS drive. A system file is one that the file system
uses to implement the file system and store its metadata. The Format software adds
system files to the volume. Table 1illustrates the main NTFS metadata files and their
purposes [4].
2.1 NTFS Architectures
Today, the NTFS file system forms the foundation of the most popular operating
systems, including Windows and Linux-based versions. As a result of the broad adop-
tion of the NTFS file system, attackers target NTFS to damage more computer users.
Another powerful argument for observing a strong association between computer
crime and the NTFS file system is the scarcity of published studies revealing the
weaknesses of the NTFS file system and the lack of standardization in digital forensics
procedures and methodologies [6].
The most important thing to know about the NTFS disk structure is that the Master
File Table (MFT) is the heart of the NTFS file system because it has information
about every file and folder on the volume and gives each MFT entry to two sectors.
The MFT entry inside the MFT has attributes that can be in any format and any size.
Also, as shown in Fig. 2, each attribute has an entry header that takes up the first 42
bytes of a file record. This entry header has a place for an attribute header and the
content of the attribute. The size, name, and flag value are all found in the attribute
header. If the size is less than 700 bytes, the attribute content will be stored in the
MFT, followed by the attribute header. If the size is more than 700 bytes, the attribute
Towards Detecting Digital Criminal Activities Using File System Analysis 535
Table 1 Layout of NTFS files
File name Purpose of the file
$MFT Holding the record for every file on the volume
$MFT mirror Exact copy of $MFT. They are used for recovery purposes
$logFile Transactional File Logging
$ Volume Info about the volume as serial number, creation time, dirty flag
$AttrDef Hold info about every attribute we are using in the system
The Root directory
$BitMap Holding and tracking info about every cluster (in-use vs. free)
$boot Mounting the volume and any other bootstrap in case the volume is bootable
$badClus Tacking the bad clusters through the volume
$Quota Holding the quota info
$scurity Storing security descriptors for every file, the volume
$UpCase Table of uppercase chars used for collating
$Extend Holding extending features such as $ObjId, $Quota, $Reparse, $UsnJrnl
<unused> Labeled as in use but empty
<unused> Labeled as unused
$ObjId Unique Ids are given to every file
$ Reparse Holding Reparse point (RP) info
$ Journalling Journaling of Encryption
A_file An ordinary file
A_Dir An ordinary directory
content will be stored in an external cluster. This is because the MFT entry is 1KB,
so you can only put something in it that takes up 700 bytes. Also, because Windows
does not clear out the space, it can be used to hide data, especially in the $Boot file.
In NTFS, everything on disk is a file. Two categories exist: Metadata and Real.
The Metadata files contain volume information, and the actual data is included in the
usual files [4]. The Master File Table (MFT) is an index of every file on the volume.
Fig. 2 MTF layout structure
536 M. Al-Fayoumi et al.
Table 2 SMFT attribute
Type (0X) Description Name
10 $NDSTAARD_INFORMATION ($Std_Info)
30 $FILE_NAME $MFT
80 $DATA Unnamed
B0 $BitMap Unnamed
For each file, the MFT keeps a set of records called attributes, and each attribute
stores different types of information. Table 2illustrates the $MFT attribute [4].
2.2 Journal (Change) Log File of NTFS
$UsnJrnl exists under the “$Exten” folder and is used to determine whether any
changes occurred in a specific file by the end-user. This feature is activated by default
starting from Win 7. $UsnJrnl is composed of two attributes. The first, called "$Max,"
is responsible for storing metadata change logs. Additionally, the attribute called
"$J" is responsible for storing the actual changes in the log records. Every record
has information on Update Sequence Number (USN), and the USN is responsible
for the record order. USN info is backed up in $STANDARD_INFORMATION in
the MFT record. According to forensic insight, the log file will be recorded for 1–2
days in case of 24 h of use, while recorded for 4–5 days in case of 8 h per day [5].
Table 3demonstrates the attributes of $UsnJrnl.
The $Max attribute has a size of 32 bytes, and the structure of the $Max attribute
is illustrated in Table 4.
Table 3 UsnJrnl attributes
Type (0X) Description Name
10 $NDSTAARD_INFORMATION ($Std_Info)
30 $FILE_NAME $MFT
80 $DATA Unnamed
B0 $BitMap Unnamed
Table 4 Layout of
$UsnJrnl:$Max Type (0X) Size Name
00 8MaxSize
08 8Allocation delta
10 8USN ID
18 8Lowest valid USN
Towards Detecting Digital Criminal Activities Using File System Analysis 537
2.3 $LogFile
During a system failure, the machine reads the log file at the first entry to the disk
and rolls all activities back to the start for the last transaction. The process has to
be automatic and instant when the program writes to the log file. We can bring our
volume back to a stable state in a short time. That is not connected to our disk size but
only to the process complexity that failed (not clear). If our hardware is robust, we are
confident we will still have access to all the volume files because this is consistent.
But we cannot recover any eventual loss of data. The log file can store many file
system transactions, such as the creation and deletion of any file or directory and
any modification in $data and MFT entry [5]. Table 5illustrates the main $LogFile
attribute.
The logging area contains a series of 4 KB log records. Each one is composed as
shown in Table 6.
Table 7illustrates the main $J data stream attribute.
3 Literature Review
Secure data deletion, or the science of data wiping, emerged relatively recently,
and related literature mainly comprise empirical testing of advanced data anal-
ysis tools. The study of data wiping began with a pioneering study in 1987 [7]. In
1996, Guttmann demonstrated that imprecise overwriting could leave small pieces of
data, some of which are recoverable using advanced techniques like magnetic force
microscopy (MFM), based on which he proposed a new approach for data wiping
using the process of passing, which takes between 10 and 35 overwrites [8]. By the
2000s, increased technological capabilities and the proliferation of e-commerce and
online transactions had offered new potential for data wiping and recovery, reflected
in a spate of research papers exploring the secure deletion of data for a wide array of
state and personal stakes hold [912]. For most practical purposes, one pass of data
Table 5 $LogFile attribute
Type (0X) Description Name
10 $NDSTAARD_INFORMATION ($Std_Info) -
30 $FILE_NAME $logFile
80 $DATA Unnamed
Table 6 Logging area Offset (length) Contents
0(4) The magic number RCRD.’
1E Fixed
538 M. Al-Fayoumi et al.
Table 7 $J data stream
attribute Offset Size Description
00 4Entry size
04 2Major version
06 2Minor version
08 8MFT reference
10 8Parent MFT reference
18 8The offset of $J entry
20 8Timestamp
28 4 Reason
2B 4Source info
30 4 Security ID
34 4File Att
38 2 Size of file name
3A 2Offset to file name
3C VFile name
V+3C PPadding
wiping suffices to make most data impossible to recover; a small number of bits can
still be gleaned by dedicated searchers [13]. Analyses of System Volume Information
revealed that the data remains in plain view by using the help of some tools (among
many evaluated for secure data deletion) [14]. The work presented by Distefano et al.
[15] focuses on anti-forensics (AF) techniques applied to Android mobiles, such as
destroying evidence, hiding evidence, eliminating evidence sources, and counter-
feiting evidence. They did some experiments to validate the effectiveness of these
techniques using a local paradigm. The main limitation is that this study focuses on
only a single operating system (Linux Android), and the second limitation focuses
on file deletion by studying the overwriting approaches.
The work presented by Pajek and Pimenidis [16] focuses on anti-forensics
methods and their impact on computer forensic investigations. It focuses on three
types of AF: Elimination of Source, Hiding the Data, and Direct Attacks against CTS.
The results showed that tools such as FTK recover 60% of the data if the tool used to
wipe the data is Free Wipe Wizard. The work presented by Gül and Kugu [17] summa-
rizes the new AFT used to mislead the investigation of digital crimes, such as Data
Pooling, Non-standard RAID’ed Disks, Manipulating File Signatures, Restricted
Filenames, Manipulating MACE Times, Loop References, Dummy HDDs. In addi-
tion to suggestions, some methods help computer investigators in the investigation
process. Kai et al. [18] proposed object-oriented interfaces that help in digital foren-
sics, especially interfaces for forensics on NTFS files’ system. In that work, they
evaluated their model by deleting some files; the result showed that the model only
parses the data in the NTFS system without doing advanced tasks such as file system
analysis. The study is limited to the classic data-destructing method.
Towards Detecting Digital Criminal Activities Using File System Analysis 539
More recent research has focused on different data-wiping techniques, including
the sanitization method and File Shredder, as applied in anti-forensics tech-
niques, including cryptography [19], generic data hiding, overwriting metadata, and
steganography [10,2024]. Many open-source tools have been developed to enable
data wiping and explain related methodologies, such as Darik’s Boot and Nuke
(DBAN), CBL Data Shredder, and HDShredder. Most previous studies mentioned
anti-forensics for their metadata but did not mention the location of that metadata,
while our research specifies the locations.
Mohammad et al. [25] examined the applicability of ML techniques in identifying
accumulating evidence by reconstructing cybercrime events and tracking historical
file system activation to determine how various application programs process these
files and to identify the appropriate files that can be used for this purpose. Most
experimental results indicated that NN and RF generated the best results. However,
they did not meet expectations. This makes sense because file system activations
and sharing of some file systems among several applications overlap. In this study
[26], the author provided a framework for digital forensics consisting of steps that
investigators must take throughout the investigation. This research will assist various
stakeholders in detecting crime early by following the fingerprint of the old recorder’s
investigation. It is a broad framework that is not dependent on technology or restricted
to a specific set of tools. As a result, it will not be limited by existing technologies.
The proposed framework is technologically agnostic and can be applied to various
research platforms and scenarios.
The study by Oh et al. [26] propose a new approach to track changes in file data in
NTFS by analyzing the $LogFile, which stores metadata and other information about
file operations. Their tool, NTFS Data Tracker, extracts and analyzes the $LogFile
to provide a detailed history of file data changes. The main contribution of this
study is the proposed approach to track file data changes in NTFS, which can be
useful for forensic investigators in solving cybercrime cases. However, the study has
limitations, such as the inability to track changes made to the $LogFile itself.
The study by Hermon et al. [27] proposes an algorithm to detect hidden data in
NTFS alternate data streams. The main contribution is the algorithm’s high accu-
racy and efficiency in detecting hidden data, providing a new technique for forensic
investigators to solve cybercrime cases. However, the study has limitations as it only
focuses on NTFS alternate data streams and no other types of hidden data. This
study is relevant to our paper as it discusses techniques for detecting hidden data
in NTFS, which is crucial in data forensics. Our study extends this by analyzing
NTFS files to determine the existence of data, such as metadata, even after partial or
complete wiping. The paper by Oh et al. [28] proposes a new approach to recover
file system metadata using forensic software tools. The study conducted experiments
on various file systems and showed that their approach could recover metadata that
was previously thought to be irretrievable. The main contribution of this study is
the proposed approach to recover file system metadata, which can be valuable in
forensic investigations. However, the study has some weaknesses, such as the need
for specialized forensic software tools and the potential for the recovered metadata to
be incomplete or inaccurate. The paper by Sokol et al. [29] focuses on using Formal
540 M. Al-Fayoumi et al.
Concept Analysis (FCA) as a data analysis method to explore connections and rela-
tionships between digital evidence to help solve cybersecurity incidents. FCA is
based on lattice theory and allows for the exploration of meaningful groupings of
digital objects based on joint attributes. The authors describe the formal context based
on digital evidence collected from the NTFS filesystem and present several concept
lattices on these data subsets. The main contribution of this study is the application
of FCA in digital forensics to explore relationships between digital evidence, which
can be valuable in cybersecurity investigations. The benefits of this approach include
providing a way to visualize the concept lattice and consult its hierarchy with experts
in the field. However, the study has some limitations, such as the need for specialized
knowledge in FCA and its potential complexity in larger datasets.
The paper by Marková et al. [30] proposes a model for automating the identifica-
tion of relevant digital evidence using outlier detection on digital evidence from the
Windows operating system and NTFS file system. The study analyzes the impact of
different attributes, aggregation functions, and parameters on the selection of relevant
file inodes and names. The main contribution of this work is the proposed model for
improving the efficiency and accuracy of digital forensic investigations. However,
the study has some limitations, such as focusing only on the Windows operating
system and NTFS file system, and the potential for the model to miss relevant or
include irrelevant digital evidence.
Based on the review of previous literature, the authors note and analyze that
cyberattacks and cybercrime are important to consider because they cause much harm
to people and governments. The surveys previously reported serve only to make it
easier for forensic investigators to select an appropriate forensic tool. Meanwhile,
some earlier research works concentrated more on giving an overview of digital
forensics methodology, identifying toolkit flaws, and presenting research directions
without offering any guidelines to investigators for judicious toolkit selection for
evidence processing.
In addition, while most previous studies related to this work, such as [7,8,1118],
focused on data recovery and wiping, our attention to evidence proves that the data
exists, such as metadata. This paper analyzes NTFS files to determine if any evidence
indicates that the files were on the disk at some point after full or partial file wiping.
Furthermore, this study evaluates several tools, such as Freeraser and File Shredder,
to determine their capability to destroy data content. The results of previous studies
are summarized in Table 8.
Towards Detecting Digital Criminal Activities Using File System Analysis 541
Table 8 Summary of review-related research
References Focus Advantages Limitation
Slusarczuk et al. [7] Data wiping Lack of file system
analysis
Gutmann [8]Erasing data as an
anti-forensic technique
Recovered wiped data
using a few passes
Toolkit [11] Data recovery and
wiping
Managed to recover
detected data only
Lack of file system
analysis
Regenscheid et al.
[11]
Data sanitization Using their method
impossible to recover
wiped e data
Lack of file system
analysis
Wright et al. [13] Recovering wiped data
usinganelectron
microscope
Recovered scrambled
data
Metadata analysis is
absent
Martin and Jones [14]Evaluation of wiping/
erasure standards
Performs metadata
analysis
Distefano et al. [15]Android anti-forensics Provide a simple tool
to investigate
anti-forensics
Pajek and Pimenidis
[16]
Anti-forensics methods Recover metadata of
wiped files
Gull and Kudu [17]Wiping techniques Lack of file system
analysis
Kai et al. [18]Analyze the NTFS file
system
Simple Lack of file system
analysis
Mohammad et al.
[25]
File system tracking
using ML models
Manage to identify
incrementing evidence
High false positive
Oh et al. [26]Proposed approach to
track file data changes
in NTFS
The inability to track
changes made to the
$LogFile itself
The inability to track
changes made to the
$LogFile itself
Hermon et al. [27]Proposed algorithm that
can effectively detect
hidden data in NTFS
alternate data streams
Providing a new
technique for forensic
investigators to detect
hidden data, which can
be crucial in solving
cybercrime cases
Only focusing on
NTFS alternate data
streams and not
addressing other
types of hidden data
Oh et al. [28]Proposes a new
approach to recover file
system metadata using
forensic software tools
Recover file system
metadata, which can be
valuable in forensic
investigations
The potential for
incomplete or
inaccurate metadata
and the need for
specialized forensic
software tools
(continued)
542 M. Al-Fayoumi et al.
Table 8 (continued)
References Focus Advantages Limitation
Sokol et al. [29]Focuses on using
Formal Concept
Analysis (FCA) as a
data analysis method
Providing a way to
visualize the concept
lattice and consult its
hierarchy with experts
The need for
specialized
knowledge in FCA
and its potential
complexity in larger
datasets
Marková et al. [30]Proposed model for
improving the
efficiency and accuracy
of digital forensic
investigations
Emphasizes the
importance of
identifying relevant
digital evidence in
digital forensic
investigations
Focusing only on the
Windows operating
system and NTFS
file system
Singh [31]Anti-forensic Lack of file system
analysis
4 Methodology
4.1 Data Preparation
This phase consists of data collection, data moving to the USB, and metadata extrac-
tion. The dataset was collected and grouped as shown in Table IX, which displays
files divided by type, size, and the number of each type. These files were used for the
testing processes. A 16GB USB drive with an NTFS file system was used mainly as
a storage device.
4.2 Data Wiping
Secure data deletion uses several algorithms to wipe data, the most well-known of
which are described below [13,16,32,33]:
Simple single pass (SSP): Where zeroes, ones, or random numbers overwrite the
real data.
Simple two pass (STP): The whole data is overwritten twice, once with zeroes
and once with other values.
DOD: The data is overwritten with three passes: Zeroes overwrite in the first pass,
ones in the second, and Pseudo-Random Data in the third. The US Department
of Defense created this method of overwriting data.
Pseudo-Random Number Generator (PRNG): This approach generates Pseudo-
Random Data that overwrites the whole disk.
Towards Detecting Digital Criminal Activities Using File System Analysis 543
The Guttmann Method (GM): Data is overwritten 35 times, using Pseudo-Random
Data to overwrite the whole disk with different approaches. Peter Guttmann
created this method of overwriting.
4.3 Data Acquisition
Acquisition of evidence is crucial as the legitimacy of other steps depends on the
integrity of this process, which ensures that in other steps, the processing of evidence
performed improperly or unlawfully will result in evidence becoming unacceptable.
T key methods could accomplish data acquisition, each with a different performance
[20].
Imaging: This method mirrors the hard disk’s content in an image file of the
defendant. This process has the advantage of interoperability and reliability.
Cloning: This method copies the content of the suspect’s hard disk to a separate
hard disk.
4.4 File Carving
This is forensic approach. Without understanding the file system and the file system,
the forensic approach recovers files from raw data based solely on file structure and
content without understanding the file system’s metadata. That is, extract information
from raw data [34]. As shown in Table 9. The table contains files divided by type,
size, and number. These files were used for the testing processes. Many wiping
methods discussed above for the experimental part of this study have been checked
on Intel®Corei7-6500CPU @250GHz, with 16GB memory (RAM) on Windows
10 Enterprise, with new hard drives. The criterion used to appraise the different
techniques is: (1) check if traces of previously destroyed data are available, (2)
check whether the tools can recover the actual data.
Table 9 Contents of the
dataset Type of files Number of files Size
PDF 19 23.4 MB (24,543,953 bytes)
Videos 30 99.4 MB (104,259,836 bytes)
Audio 17 300 MB (314,687,964 bytes)
Images 10 18.3 MB (19,276,960 bytes)
Wor d 10 229 KB (235,339 bytes)
PPT 30 82.2 MB (86,210,633 bytes)
Txt 10 209 tes (209 bytes)
544 M. Al-Fayoumi et al.
5 Tools
Several tools were used in these experiments, as summarized in Table 10 (regarding
name, version, and license) and described below.
Freeraser: a free desktop application for shredding (destroying) unwanted files
beyond recovery that supports the many ways of deleting methods, including
DoD5220.22-M, Guttmann algorithm, and Random Data (www.freeraser.com/
home/82-freeraser.html).
FTK: an easy-to-use forensic toolkit for data imaging, mounting, and file carving,
with several search options (https://accessdata.com/products-services/forensic-
toolkit-ftk).
Foremost: a console program to recover files based on headers, footers, and
internal data structures (foremost.sourceforge.net).
Scalpel: an open-source program for recovering deleted data originally based on
the leading software (https://github.com/sleuthkit/scalpel).
File Shredder: a free desktop application for file deletion using drag and drop that
supports many methods, including DoD 5220.22-M, Guttmann, Schneider, and
Paranoid.
PhotoRec: a free desktop application for data recovery. It can recover deleted files
from different file systems, like FAT, NTFS, and HFS.
Recva: a commercial desktop application for advanced data recovery, supporting
different types of files (https://www.ccleaner.com/recuva).
DiskExplorer for NTFS: a software data recovery tool that can investigate NTFS
drives and recover data using the partition table, MFT, and the boot record (https://
www.runtime.org/diskexplorer.htm).
Log File Parser-master: an open-source tool to analyze log files and pass the
results to server files (https://github.com/jschicht/LogFileParser).
Table 10 Summary of tool
versions Tools Ve r s io n License
Access data®FTK®imager 3.4.3.3 Commercial
Freeraser 1.0.0.23 Free
Fileshreder 2.50 Free
Recuva 1.53.1087 Commercial
PhotoRec 6.13 Free
Foremost 1.5.7 Free
Scalpel 2.0 Free
UsnJrnl2Csv-master 1.0.0.21 Free
Log File Parser-master 2.0.0.46 Free
Mft2Csv-master 2.0.0.41 Free
Disk explorer for NTFS 4.32 Free
ExifTool 10.7.9.0 Free
Towards Detecting Digital Criminal Activities Using File System Analysis 545
Table 11 Results of the recovered wiped files erased by Freeraser
Tools name Algorithm FTK Recuva PhotoRec
Freeraser A single pass with Random Data Fail Fail Fail
Freeraser DoD 5220.22M Fail Fail Fail
Freeraser Guttmann algorithm Fail Fail Fail
Mft2Csv-master: an open-source tool used to analyze the Master file table (https://
github.com/jschicht/Mft2Csv).
UsnJrnl2Csv-master: an open-source tool to analyze the journaling file where
every transaction is stored (https://github.com/jschicht/UsnJrnl2Csv).
Exiftool: an open-source tool used to extract the metadata from the dataset (https://
github.com/alchemy-fr/exiftool).
6 Wiping and Analysis
A consistent working method was developed for all tools to evaluate their processing,
as described below.
6.1 Wiping by Freeraser Tool
Copy the prepared dataset into the USB drive.
Take an image from the USB before the wiping process.
Use the Freeraser wipe tool to wipe files in the drive using a single pass with
Random Data, DOD 5220.22M, and the Guttmann algorithm.
Take a raw image (bit by bit) from the USB.
Try to recover the files from the Image using FTK Toolkit, Foremost, and Scalpel.
The results are shown in Table 11, proving that Freeraser was successful in wiping
all files using all standard methods, and none of the files was displayed by the FTK
Imager tool.
6.2 Wiping by File Shredder
Copy the prepared dataset into the USB drive.
Take an Image from the USB before the wiping process.
Use the File Shredder tool to wipe the files in the USB drive, using the five
standards (single pass, simple two passes, seven passes, DOD 5220.22M, and
Guttmann algorithm with 35 passes)
546 M. Al-Fayoumi et al.
Table 12 Results of the recovered wiped files erased by File Shredder
Tools name Algorithm FTK Recuva PhotoRec
File Shredder Simple Single Pass Fail Fail Fail
File Shredder Simple Two-Pass Fail Fail Fail
File Shredder DOD5220-22M Fail Fail Fail
File Shredder Seven Pass Fail Fail Fail
File Shredder Guttmann Algorithms with 35 Pass Fail Fail Fail
After each wiping, take a raw image (bit by bit) for the USB drive.
Use FTK Toolkit, Foremost, and Scalpel to recover files.
The results are shown in Table 12, indicating that File Shredder was successful
in wiping all files, so no data or file names were recovered. The results in Table 11
indicate that all tools failed to recover the files using file carving, except for FTK,
which was able to return the file names, but the data was corrupted.
6.3 File Carving
After failing to recover the data using Recva and PhotoRec, we tried to recover any
data from the disk by carving the files using the following file carving techniques [33,
35,36]: (a) file header based: based on known headers (the start of the file marker),
(b) header–footer: based on known headers (the start of the file marker) and footers
(the end of the file marker), and (c) file structure: based on the internal layout of the
file. The results of this method using Foremost, Scalpel, and FTK tools are shown in
Table 13. It is obvious from the results that all tools failed to recover the files using
file carving, except for FTK, which was able to return the file names, but the data
was corrupted.
6.4 File System Analysis
Having failed to get a result by processing carving files, the next step is analyzing the
system files by looking for the metadata (file names, created date, size, type, etc.) to
prove that some files existed before they were deleted. The NTFS file system recorded
all operations and transactions on files and the file system in different locations in
$MFT, $Log files, and $ UsnJrnl from this point on, and based on the structure of
these files, one can find evidence or metadata about the deleted files. At this point,
a forensic image was taken, and an $MFT, $UsnJrnl, and log files were analyzed
using FTK and Runtime’s DiskExplorer for NTFS tools. The results are shown in
Table 14.
Towards Detecting Digital Criminal Activities Using File System Analysis 547
Table 13 Results of carving file method using several tools
Tools
name
Algorithm FTK Foremost Scalpels
Freeraser A single pass with
Random Data
Recover file names and
corrupted files
Fail Fail
Freeraser DoD 5220.22M Recover some file names and
corrupted files
Fail Fail
Freeraser Guttmann algorithm Recover some file names and
corrupted files
Fail Fail
File
Shredder
Simple single pass Fail Fail Fail
File
Shredder
Simple two pass Fail Fail Fail
File
Shredder
DOD5220-22M Fail Fail Fail
File
Shredder
Seven Fail Fail Fail
File
Shredder
Guttmann algorithm with
35 pass
Fail Fail Fail
Table 14 File system analysis using FTK and Disk Explorer for NTFS
Tools name Algorithm Disk Explorer FTK
Freeraser A single pass with Random Data Recover metadata Recover metadata
Freeraser DoD pass Recover metadata Recover metadata
Freeraser 35 pass Recover metadata Recover metadata
From the system analysis in Table 14, we have proved that all the metadata about
the files, from their creation to deletion, is still on the USB Disk. Double-check the
results, and the obtained results are shown in Table 15. Table 16 is an example of a file
transaction after file system analysis. The results prove that the file called victim.txt
was in the computer, showing all activities occurring in this file.
7 Conclusions
Data wiping is the most effective way of destroying data contents. This paper has
examined data wiping using several tools and algorithms, and the experimental results
indicate that data wiping is an effective way to destroy data content. However,
analyzing the NTFS file system, the recovery of the metadata about the file from
creation until deletion remains possible. Consequently, it can be proven what data
was on the disk at some point, what activity occurred on the file, and if the user
attempted to wipe it. In this case, suspects with a PC or USB who have proven to
548 M. Al-Fayoumi et al.
Table 15 File system analysis using UsnJrnl2Csv-, Log file parser-master, and Mft2Csv-master
tools
Tools name Algorithm Log file parser-master tool Mft2Csv-master tool
Freeraser One pass Recover every transaction
that occurred on the files
Recover file metadata
Freeraser DoD pass Recover every transaction
that occurred on the files
Recover file metadata
Freeraser 35 passes Recover every transaction
that occurred on the files
Recover file metadata
File Shredder One/two/three pass Recover every transaction
that occurred on the files
Recover file metadata, but
name in an unreadable
format
File Shredder Seven passes Recover every transaction
that occurred on the files
Recover file metadata, but
name in an unreadable
format
File Shredder Guttmann algorithm Recover every transaction
that occurred on the files
Recover file metadata, but
name in an unreadable
format
Table 16 Transaction on victim.txt file
File name Event (transaction)
victim.txt FILE_CREATE
victim.txt DATA_EXTEND +FILE_CREATE
victim.txt DATA_EXTEND +DATA_OVERWRITE +FILE_CREATE
victim.txt BASIC_INFO_CHANGE +DATA_EXTEND +DATA_OVERWRITE +FILE_
CREATE
victim.txt BASIC_INFO_CHANGE +CLOSE +DATA_EXTEND +DATA OVERWRITE
+FILE CREATE
have engaged in such proscribed activities can be considered to violate such policies,
with potential organizational or even criminal implications. The number of wiping
passes does not affect the metadata. The only effective way to destroy the metadata
is by wiping the file system, requiring either kernel space access or hardware wiping.
Future research directions based on these findings may explore using tools capable
of dealing with kernel space or hardware level processing, improving research by
studying this process with more tools and techniques, and measuring the resources
consumed by each machine.
Towards Detecting Digital Criminal Activities Using File System Analysis 549
References
1. Naiqi L, Zhongshan W, Yujie H (2008) Computer forensics research and implementation
based on NTFS file system. In: Proceedings—ISECS international colloquium on computing,
communication, control, and management, CCCM 2008, vol 1, pp 519–523
2. Poonia AS (2014) Data wiping and anti forensic techniques. Compusoft 3(12):1374–1376
3. Ölvecký M, Gabriška D (2018) Wiping techniques and anti-forensics methods. In: 2018 IEEE
16th international symposium on intelligent systems and informatics (SISY), pp 127–132
4. Miller FP, Vandome AF, McBrewster J (2009) Levenshtein distance: information theory,
computer science, string (computer science), string metric, Damerau? Levenshtein distance,
spell checker, hamming distance. Alpha Press
5. “blueangel’s ForensicNote—NTFS Log Tracker.” [Online]. Available: https://sites.google.
com/site/forensicnote/ntfs-log-tracker. Accessed 18-Sept 2022
6. Rogers MK, Seigfried K (2004) The future of computer forensics: a needs analysis survey.
Comput Secur 23(1):12–16
7. Slusarczuk MM, Mayfield WT, Welke SR (1987) Emergency destruction of information storing
media. Institute for Defense Analyses Alexandria VA
8. Gutmann P (1996) Secure deletion of data from magnetic and solid-state memory. In:
Proceedings of the sixth USENIX security symposium, San Jose, CA, vol 14, pp 77–89
9. Robins N, Williams PAH, Sansurooah K (2017) An investigation into remnant data on USB
storage devices sold in Australia creating alarming concerns. Int J Comput Appl 39(2):79–90
10. Golubi´cK,Stanˇci´c H (2012) Clearing and sanitization of media used for digital storage:
towards recommendations for secure deleting of digital files. In: Central European conference
on information and intelligent systems, pp 331–493
11. Regenscheid A, Feldman L, Witte G (2015) NIST special publication 800-88 revision 1,
guidelines for media sanitization. National Institute of Standards and Technology
12. DoD 5220.22-M: national industrial security program operating manual [Updated 28 Feb 2006]
(2006). [Online]. Available: https://www.hsdl.org/?abstract&did. Accessed 18-Sept-2022
13. Wright C, Kleiman D, Sundhar RSS, Kendalls BDO (2008) Overwriting hard drive data: the
great wiping controversy, pp 243–257
14. Martin T, Jones A (2011) An evaluation of data erasing tools
15. Distefano A, Me G, Pace F (2010) Android anti-forensics through a local paradigm. Digit
Invest 7:S83–S94
16. Pajek P, Pimenidis E (2009) Computer anti-forensics methods and their impact on computer
forensic investigation. In: International conference on global security, safety, and sustainability,
pp 145–155
17. Gül M, Kugu E (2017) A survey on anti-forensics techniques. In: IDAP 2017—international
artificial intelligence and data processing symposium
18. Kai Z, En C, Qinquan G (2010) Analysis and implementation of NTFS file system based
on computer forensics. In: 2010 Second international workshop on education technology and
computer science, vol 1, pp 325–328
19. Al-Fayoumi M, Aboud SJ, Al-Fayoumi MA (2010) A new digital signature scheme based on
integer factoring and discrete logarithm problem. IJ Comput Appl 17(2):108–115
20. A. A. Gutub, “e-Text Watermarking : Utilizing Kashida Extensions in Arabic Language
Electronic Writing,” vol. 2, no. 1, pp. 48–55, 2010.
21. Parvez MT, Gutub AA-A (2011) Vibrant color image steganography using channel differences
and secret data distribution. Kuwait J Sci Eng 38(1B):127–142
22. Al-Otaibi NA, Gutub AA (2014) 2-Leyer security system for hiding sensitive text data on
personal computers. In: Lecture notes on information theory, no August, pp 73–79
23. Al-Nofaie SM, Fattani M, Gutub A (2016) Merging two steganography techniques adjusted to
improve arabic text data security. J Comput Sci Comput Math (JCSCM) 6(3):59–65
24. Hambouz A, Shaheen Y, Manna A, Al-Fayoumi M, Tedmori S (2019) Achieving data
integrity and confidentiality using image steganography and hashing techniques. In: 2019 2nd
International conference on new trends in computing sciences, ICTCS 2019—proceedings
550 M. Al-Fayoumi et al.
25. Mohammad RM, Alqahtani M (2019) A comparison of machine learning techniques for file
system forensics analysis. J Inf Secur Appl 46:53–61
26. Oh J, Lee S, Hwang H (2021) NTFS Data Tracker: Tracking file data history based on $LogFile.
Forensic Sci Int Digit Invest 39:301309
27. Hermon R, Singh U, Singh B (2022) Forensic techniques to detect hidden data in alternate data
streams in NTFS. In: IBSSC 2022—IEEE Bombay section signature conference
28. Oh J, Lee S, Hwang H (2022) Forensic recovery of file system metadata for digital forensic
investigation. IEEE Access 10:111591–111606
29. Sokol P, Antoni ˇ
L, Krídlo O, Marková E, Kováˇcová K, Krajˇci S (2022) The analysis of digital
evidence by Formal concept analysis
30. Markova E, Sokol P, Kovacova K (2022) Detection of relevant digital evidence in the forensic
timelines. In: 2022 14th International conference on electronics, computers and artificial
intelligence, ECAI 2022.
31. Singh A (2022) A framework for crime detection and reduction in digital forensics. SSRN
Electron J 71(4):531–552
32. Peters-Michaud N (2017) The three pass data wipe requirement for hard drives is obsolete. In:
Cascade asset management, LLC, pp 1–8
33. Mallery JR (2001) Secure file deletion: fact or fiction? tu te ho r r fu ll r igh te ll r igh
34. Tanvir Parvez M, Abdul-Aziz Gutub A (2011) Hiding, data spreading, data, vol 38, pp 127–142
35. Pal A, Memon N (2009) The evolution of file carving. IEEE Sig Process Mag 26(2):59–71
36. Carrier B (2005) File system forensic analysis. Addison-Wesley Professional
Performance Evaluation of Virtual
Machine and Container-Based Migration
Technique
Aditya Bhardwaj, Amit Pratap Singh, Priya Sharma, Konika Abid,
and Umesh Gupta
Abstract The transformation from hypervisor to microservice-based virtualization,
i.e., containerization, is gaining considerable attention. This is because container
virtualization offers a lightweight and efficient way to package and deploy software
applications. Containers require minimum resources than virtual machines, which
makes them more efficient and cost effective. The performance overhead of a con-
tainer compared to a virtual machine has been explored by researchers, but support
for migration, an essential technique of cloud virtualization, needs to be addressed.
In this work, we proposed container-based migration technique and compared the
performance with existing VM migration scheme. The results show that compared
to the existing VM migration scheme, our proposed container migration technique
reduces downtime, migration time, and the number of pages transferred by 72.8%,
54.94%, and 97.5%, respectively.
Keywords Cloud computing ·Virtualization ·VM migration ·Container
technology ·Checkpoint-restore
1 Introduction
There is a high demand for cloud platforms and related virtualization technologies.
VMware, RedHat, Oracle, Citrix, and Microsoft dominate the market, while hardware
vendors like Intel and AMD offer virtualization-enabled high-performance comput-
ing servers. These technologies are used collectively for the process of workstation
consolidation. In the past, hypervisor-based virtualization was the most common way
A. Bhardwaj (B)
School of CSET, Bennett University, Greater Noida, India
e-mail: aditya.cse@nitttrchd.ac.in
A. P. Singh ·P. S h a rm a ·K. Abid
Department of CSE, Sharda University, Greater Noida, India
U. Gupta
Department of CSE, SR University, Warangal, Telangana, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981- 99-6544-1_41
551
552 A. Bhardwaj et al.
to implement virtualization and isolation [1]. Recent studies show that hypervisor-
based virtualization technologies have large performance costs. They also have I/O
constraints, so they are typically avoided in high-performance computing environ-
ments. In the past few years, container-based virtualization and support for micro-
hosting services have become more popular. It is a lightweight solution that bundles
apps and data in a simpler and more performance-oriented way that can run on
different cloud frameworks.
Due to its significance in the operation of data centers, researchers have sought
to enhance the performance of the VM migration method [2]. In accordance with
this, in our earlier work, we first explored how to allocate bandwidth during VM
migration efficiently [3], and then we improved the migration mechanism by imple-
menting multistage and data transfer reduction strategies [4]. However, because VM
is deployed with the dedicated guest operating system, it takes a significant amount
of RAM and disk storage memory, and as a result, the image size is quite large.
Thus, migration utilizing VM is referred to as a heavyweight solution, which causes
service degradation in terms of quality of service (QoS).
The virtualization system design employing hypervisor and container-based tech-
nologies is contrasted in Fig. 1. As can be seen in Fig. 1, containers allow applications
to share an OS kernel and only include the necessary binaries and libraries, making
them a more lightweight alternative to virtual machines. Hence, in recent years, there
has been a rise in demand for application deployment using container technology
for real-time applications like IoT, fog computing, data analytics, and blockchain
technology [5,6].
Fig. 1 Virtual machine versus container architecture
Performance Evaluation of Virtual Machine 553
1.1 Main Contributions of This Study
The contributions of this study are summarized as follows:
1. Cloud computing virtualization experimental testbed has been developed to eval-
uate performance of proposed container-based migration technique (LXD/CR)
with the existing VM migration scheme.
2. Proposed container migration technique has been implemented by modification
of Linux kernel and system bash files.
3. Executed a wide variety of workload benchmarks to evaluate the performance of
existing and proposed migration techniques.
The remaining parts of this work are structured as follows. Relevant related work
in container-based virtualization is discussed in Sect. 2. In Sect. 3, architecture for
the container migration technique is discussed. Section 4shows the results of pro-
posed container migration technique (LXD/CR) with existing pre-copy VM migra-
tion scheme. In Sects. 5and 6, concluding remarks of this study are presented.
2 Related Work
A container is a small, self-contained piece of software with all the code, libraries,
system tools, and runtime needed to run an application or service. Containers let
developers package an app to make it portable and consistent so that it can be used in
different computing environments, like on-premises data centers, cloud platforms,
or edge devices. Containerization eliminates the need for a separate guest operating
system, which means it has less overhead and starts up faster than hypervisors.
Existing studies have explored performance evaluation between VM versus con-
tainer in non-live migration platforms. In Felter et al. [7], the author conducted a study
comparing and contrasting virtualization’s overhead with that of a non-virtualized
platform. The parameters employed were those unique to the execution of the work-
load. Based on their experiments, they concluded that containers have less overhead
in comparison with virtual machines. As a result, incorporating containers into the
data center infrastructure can be useful for cloud service providers. In a recent paper,
Chae et al. [8] compared the performance of hypervisor and container technology. But
compared to [7], they used different set of benchmarks. Their benchmarks include
disk I/O, a web server, and the amount of CPU and RAM being used to measure
performance. The authors demonstrate that as compared to container virtualization,
KVM virtualization requires a relatively higher quantity of CPU and RAM. These
studies inferred that, compared to hypervisor-based virtualization, the container uses
hardware resources more efficiently, especially with a decrease in memory consump-
tion by three–four times.
Also, in contrast to [7,8], the study expanded by [9,10] undertook a performance
comparison of three container technologies—Docker, LXC, and CoreOS Rkt—with
respect to the native platform. It has been discovered that Docker containers reduce
554 A. Bhardwaj et al.
Fig. 2 Proposed LXD/CR container migration technique
performance since their time-sharing method generates a lot of context-switching
costs. However, LXC containers perform better for data-intensive workloads, and
Rkt containers work well for CPU-intensive workloads. From the relevant literature,
it is found that container-based virtualization provides minimum overhead in terms
of computation and storage. Thus, it is suitable for running multiple instances at the
cloud server.
3 Experimental Testbed for Proposed Container Migration
Technology
In this section, we discuss the methodology and experimental setup details to imple-
ment proposed LXD/CR container migration technique.
3.1 Proposed Container-Based Migration Technique
Initially, we built a Linux container hypervisor (LXD) as an extension of LXC version
2.0.11 to launch a container. Then, container migration is implemented using the
checkpoint/restore the function of the CRIU approach, which saves the running
state of the container on the source server and restores it on the target system [11,
12]. Further, cgroup and namespace are used to facilitate management and isolation
of container resources. Following is a summary of the key operations involved in
migrating containers, as shown in Fig.2:
Performance Evaluation of Virtual Machine 555
Stage 1: Synchronization of file system: To perform container migration, we
must first ensure that the container’s file system is in sync. This necessitates the
existence of the fundamental file system, including rootfs and config, on the target
server.
Stage 2: Checkpoint running container: When a container is to be checkpointed,
CRIU first determines the process to the checkpoint and then saves its state. This
includes memory contents, file descriptors, network connections, and other process
information. CRIU then writes this information to disk in a checkpoint image file.
Stage 3: Network dump: CRIU dumps the container’s network stack to a separate
file that can be used to restore the network state.
Stage 4: Restoring container: When the container needs to be restored, CRIU
reads the checkpoint image file from the disk and restores the process state in
memory. It then sets up the necessary environment, including file descriptors,
network connections, and memory mappings.
Stage 5: Network restore: CRIU restores the network state using the network
dump file created during the checkpointing phase.
Stage 6: Resume: Once the process is restored, CRIU resumes execution from
where it left off, allowing the container to continue running on the same or a
different host.
4 Performance Evaluation and Results Discussion
In this section, we discussed the results obtained for performance evaluation of
proposed container migration technique with existing VM migration mechanism.
4.1 Downtime and Migration Time
Figures 3and 4illustrate performance evaluation of proposed and existing scheme
in terms of downtime and migration time.
We have executed four categories of benchmarks, namely ‘idle, ‘UnixBench,’
‘Y-cruncher, and ‘Stream.’ For idle test and workloads benchmarks execution, our
approach reduces the downtime by 59.48, 73.07, 77.56, and 78.24%, an average of
72.08%. Similarly, migration time is reduced by 33.35, 44.18, 66.43, and 75.81%, an
average of 54.94%. This is because data transfer using VM is of full binary libraries,
codes, and OS files, but LXD/CR just migrates memory checkpoint dump states
which require less migration duration compared to virtual machine.
556 A. Bhardwaj et al.
Fig. 3 Performance evaluation for downtime (Td)
Fig. 4 Performance evaluation for migration time (Tm)
4.2 Number of Pages Transferred
The performance evaluation in terms of number of pages transferred is presented
here. Figure 5shows that in comparison with the existing virtual machine migra-
tion technique, our proposed LXD/CR container migration technique results in a
considerable reduction in the total pages transferred by 95.08, 97.91, 98.26, and
98.94% with an average reduction of 97.54%. There is a significant difference for
this parameter because in LXD/CR, a number of pages transferred are of container‘s
checkpoint dump states, while pre-copy transfers all dirtied memory pages along
with heavyweight full edge operating system VM image size.
Performance Evaluation of Virtual Machine 557
Fig. 5 Performance evaluation for the number of pages transferred (Tpages)
Thus, result obtained demonstrates that proposed container-based migration tech-
nique can be used in the production environment. However, the limitation of this
study is that proposed container-based migration technique has been tested using
with CPU and memory-oriented benchmark testcases only. Further, discussion on
disk I/O testcases and future network systems should be studied [13].
5 Conclusion and Future Scope
A container-based solution to implement migration technique for data container-
based virtualization involves running multiple containers on a single operating
system, where each container shares the host operating system and underlying
resources with other containers. This provides greater efficiency and scalability than
hypervisor-based virtualization but may not provide as strong isolation. In this paper,
a cloud virtualization testbed has been developed and implemented with a check-
point/restore technique to enable the migration mechanism in the container. The
results show that compared to the existing virtual machine migration technique (pre-
copy VM migration), our proposed container migration technique shows significant
performance improvement, with a reduction range from 72.08 to 54.94% for down-
time, migration time, and 97.54% for the number of pages transmitted. So, our
proposed container migration technique (LXD/CR) can play a vital role in the cloud
servers to migrate running workloads and applications. For future work, researchers
may explore container migration techniques for edge and fog computing frameworks.
558 A. Bhardwaj et al.
References
1. Belgacem A, Mahmoudi S, Ferrag MA (2023) A machine learning model for improving virtual
machine migration in cloud computing. J Supercomputing 1–23
2. Kumari P, Kaur P (2021) Virtual machine replication in the cloud computing system using
fuzzy inference system, data analytics and management: proceedings of ICDAM, pp 165–174
3. Bhardwaj A, Rama Krishna C (2018) Performance evaluation of bandwidth for virtual machine
migration in cloud computing. Int J Knowl Eng Data Min, Inderscience 5(3):139–152
4. Bhardwaj A, Rama Krishna C (2018) Efficient multistage bandwidth allocation technique for
virtual machine migration in cloud computing. J Intell Fuzzy Syst 35(5):5365–5378
5. Plageras AP, Psannis KE, Stergiou C, Wang H, Gupta BB (2018) Efficient IoT-based sensor
BIG Data collection—processing and analysis in smart buildings. Future Gener Comp Syst
82:349–357
6. Stergiou C, Psannis KE, Kim B-G, Gupta B (2018) Secure integration of IoT and cloud com-
puting. Future Gener Comp Syst 78(3):964–975
7. Felter W, Ferreira A, Rajamony R, Rubio J (2015) An updated performance comparison of
virtual machines and Linux containers. In: Processes IEEE international symposium on per-
formance analysis of systems and software. Philadelphia, PA, USA pp 171–172
8. Chae M, Lee H, Lee K (2017) A performance comparison of Linux containers and virtual
machines using Docker and KVM. Cluster computing pp 1–11
9. Kozhirbayev Z, Sinnott RO (2017) A performance comparison of Linux container-based tech-
nologies. Future Gener Comp Syst 68:175–182
10. Martin JP,Kandasamy A, Chandrasekaran K (2018) Exploring the support for high performance
applications in the container runtime environment. Human-centric Comput Inf Sci 8(1):1–15
11. Linux Containers Hypervisor (LXD), link: https:// linuxcontainers.org/lxd/introduction/.
Accessed 12 Aug 2022
12. Checkpoint/Restore in userspace (CRIU), link: https://www.criu.org/Main_Page. Accessed 19
Sept 2022
13. Gupta U, Pantola D, Bhardwaj A, Singh S (2023) Next-generation networks enabled tech-
nologies: challenges and applications. Next generation communication networks for industrial
internet of things systems, pp 191–216
Rhetorical Role Detection in Legal
Judgements Using Zero-Shot Learning
Shambhavi Mishra, Tanveer Ahmed, Vipul Mishra, Priyam Srivastava,
Abuzar Sayeed, and Umesh Gupta
Abstract In this paper, we address the problem of legal statement segmentation
(or rhetorical role detection). Traditionally, this is handled by taking the expertise
of lawyers and making them mark each and every statement as one of the many
pre-defined classes. Naturally, this process is cumbersome and involves a lot of
manual intervention. Zero-shot learning is a promising approach that could be one
of the potential solutions to this labor-intensive problem. Therefore, in this paper,
we apply zero-shot learning to the task of legal judgement segmentation. We try to
remove the “human in the loop” and present a new potential direction in rhetorical
role detection. To that end, we use BART to automatically classify various segments
of a document into multiple classes. We propose a model that uses a pre-trained
language model to generate embeddings for each document, which are then used to
classify a legal sentence into one of the multiple classes. We evaluate our model on a
dataset of legal documents consisting of manually marked statements. In particular,
the dataset consists of 50 court case documents from the Indian Supreme Court.
Through experimentation, we have found that the proposed gives a strong baseline
that could act as a new direction in rhetorical role detection. Further, we also show
S. Mishra (B)·T. Ahmed ·P. S r i v a st a v a ·A. Sayeed
Department of CSE, Bennett University, Greater Noida, India
e-mail: shambhavimishra1000@gmail.com
T. Ahmed
e-mail: tanveer.ahmed@bennett.edu.in
P. Sr i v a s t ava
e-mail: e20cse479@bennett.edu.in
A. Sayeed
e-mail: abuzar.sayeed@bennett.edu.in
V. M i s h r a
Department of CSE, Pandit Deendayal Energy University, Gandhinagar, India
e-mail: vipul.mishra@sot.pdpu.ac.in
U. Gupta
Department of CSE, SR University, Warangal, Telengana, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_42
559
560 S. Mishra et al.
that the model presented in this article can indeed pave the way for future work in
legal analytics.
Keywords Rhetorical roles ·Legal case documents ·Zero-shot learning ·BART
1 Introduction
In this paper, we address the problem of rhetorical role detection of a legal judge-
ment. Traditionally, this is handled by taking the expertise of lawyers and making
them mark each statement (of the judgement) as one of the many pre-defined roles.
This process is extremely cumbersome and involves a lot of manual intervention.
Moreover, as expected in Indian legal cases, the structure of the text is extremely
poor with every judge using his/her own style in writing. Hence, marking statements
manually is infeasible considering the scale of judgements given by courts each year.
To address this problem, we apply zero-shot learning to the task of legal judgement
segmentation. We try to remove the “human in the loop” and present a new potential
direction in rhetorical role detection. To that end, we use BART to automatically
classify various sentences of a legal document into multiple classes. We propose
a model that uses a pre-trained language model to classify a legal sentence into
multiple classes. We evaluate our model on a dataset of legal documents consisting
of manually marked statements. In particular, the dataset consists of 50 court case
documents and over 10,000 sentences from the Indian Supreme Court. The innovative
aspect of the suggested method lies in employing zero-shot learning for identifying
rhetorical roles within legal decisions. The conventional approach of having attor-
neys manually label each statement within a judgement is laborious, time-intensive,
and susceptible to mistakes. The suggested technique employs a pre-trained language
model, BART, to automatically categorize various sentences in a legal document into
multiple classes without the need for manual input. This paper emphasizes the signif-
icance of rhetorical roles in legal contexts and advocates for the implementation of
zero-shot learning for detecting these roles in the legal field. The paper establishes
a robust foundation and illustrates that the suggested approach has the potential to
serve as a novel direction in rhetorical role identification, laying the groundwork for
future research in legal analysis.
In the realm of law, a legal judgement refers to an official ruling pronounced by a
judicial body pertaining to a specific legal matter. Typically, these judgements emerge
from disagreements among at least two parties, and they can hold either a binding or
non-binding status. A crucial method used in dissecting these judgements is called
segmentation. Segmentation of legal judgements involves decomposing a court’s
decision into its constituent elements to provide a comprehensive understanding of
its implications and significance [1,2]. Legal professionals and scholars frequently
employ this method when scrutinizing a judgement, as it aids in pinpointing the
crucial tenets and contested issues. There is an increasing trend toward researching
role detection within the context of legal judgements. For example, numerous studies
Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 561
have underscored the critical role of rhetorical elements across various applications
[3,4]. In essence, the philosophy behind judgement segmentation is the partitioning
of a judgement into its relevant sections. For instance, the segment containing facts is
an aggregation of all the factual information and evidence put forth during a case. On
the other hand, the legal segment involves the legal claims and deductive reasoning
that the court used to formulate its decision [5].
The segmentation of a legal judgement allows for a more in-depth examination to
pinpoint the central doctrines and contested issues. Both legal researchers and profes-
sionals can benefit from this analysis as it can elucidate the law and how it is applied
to specific instances. It is therefore critical to highlight the significance of identifying
rhetorical roles in legal work. However, rhetorical role detection comes with its own
set of challenges. First, it is a tedious and time-consuming task requiring legal experts
to read through the entire judgement and label each statement. Secondly, the possi-
bility of human error introduces inconsistency in results. Lastly, scaling up manual
labeling to encompass vast amounts of legal texts is challenging. In response to these
challenges, this paper proposes a solution: a zero-shot learning-based approach to
rhetorical role detection. Deep learning, a subfield of machine learning that draws
inspiration from the structure and function of the human brain, has been utilized in
numerous studies for this task [4,6]. These algorithms aim to model high-level data
abstractions through a deep graph consisting of multiple layers of nodes. Indeed, deep
learning has proven successful in numerous text-related tasks, such as machine trans-
lation, sentiment analysis, and topic categorization. Our method capitalizes on these
successes, applying deep learning techniques to the task of rhetorical role detection
in legal judgements [7,8].
One of the most promising applications of deep learning for text is text segmen-
tation. A subset of deep learning that is concerned with classification of text without
explicitly trained on labeled datasets is called zero-shot learning [9]. Zero-shot
learning is a method for segmenting data when there are no labels available for the
data. The method is based on learning a model from data that has labels and then using
that model to segment the unlabeled data. The advantage of using this approach is that
it does not require labels for the unlabeled data, which can be difficult or impossible
to obtain [10]. Consequently, this learning and classification paradigm is immensely
important in dealing with the issues highlighted in the previous paragraph.
In light of the issue and potential solution highlighted in this section, we propose
the use of zero-shot classification for rhetorical role detection in the legal domain. To
the best of our knowledge, we are the first to propose this paradigm in the context of
Indian legal judgements. To accomplish the said objective, we use the existing pre-
trained BART proposed in [11]. Research has found that the work presented in [11]
is able to show good results in terms of zero-shot classification. To test the validity of
the ideas in practice, we use the dataset provided by [6]. The dataset consists of fifty
different legal judgements. There are seven different roles into which the statements
are classified. They are Facts (abbreviated as FAC), Ruling by Lower Court (RLC),
Argument (ARG), Statute (STA), Precedent (PRE), Ratio of the decision (Ratio),
Ruling by Present Court (RPC). Using the method proposed in this article, we are
able to achieve good results in terms of numerical efficiency. These obtained numbers
562 S. Mishra et al.
show the potential of BART in zero-shot classification for legal judgement. Our
paper’s main contribution is to eliminate the requirement for manual involvement
in the segmentation of legal judgements, a task currently carried out by lawyers
who assign each statement to one of several pre-defined roles. Furthermore, the
paper delves into the significance of identifying rhetorical roles in the legal domain,
the obstacles related to this task, and the benefits of employing advanced learning
methods, particularly zero-shot learning, for text segmentation. The rest of the article
is structured as follows: Sect. 2of this paper will discuss the related work. The
proposed methodology is discussed in Sect. 3. The experimental results are described
in Sect. 4. We discuss the shortcomings of the work in Sect. 5. Finally, the conclusion
is given in Sect. 6.
2 Related Work
This section provides an overview of previous research conducted in the legal field
concerning annotation, automatic rhetorical labeling, and deep learning applications.
The automatic labeling of the rhetorical purpose of sentences relies heavily on manual
annotation. While some studies focus on the annotation process itself, including
the establishment of manuals or annotation rules, inter-annotator research, and the
creation of a high-quality annotated corpus, others aim to automate semantic labeling
tasks and perform annotation analysis [12,13]. In one study, a corpus named TEMIS
was developed, comprising 504 sentences with syntactic and semantic annotations
[14]. Extensive annotation research and curation of a gold standard corpus were
carried out in [15] for the purpose of labeling sentences, although there was low
agreement among assessors for labels such as “Facts” and “Reasoning Outcomes.”
Another research effort presented a preliminary methodology [16] that employed
NLP tools to automate annotation work using 47 criminal cases from the California
Supreme Court and State Court of Appeals. Previous attempts have been made to
automatically recognize the rhetorical functions of sentences in legal texts. Initial
experiments were conducted in [17] to comprehend the rhetorical and thematic roles
in court case documents, judgements, and case legislation. For example, [17] utilized
conditional random fields (CRFs) to address the challenge of identifying seven rhetor-
ical roles. Another study [12] focused on the division of US court documents into
functional (Introduction, Background, Analysis, and Footnotes) and issue-specific
(Analysis and Conclusion) portions using CRF with handcrafted features. Addi-
tionally, a technique using the fastText classifier was developed in [18] to distin-
guish between true and false phrases. In a different area of research, Walker et al.
[19] contrasted the usage of rule-based scripts with machine learning algorithms
for the task of identifying rhetorical roles. Rule-based scripts require substantially
less training data. Nearly, all previous attempts to automatically identify rhetorical
roles in the legal arena required handcrafted elements. In contrast, this paper uses
deep learning (DL) and natural language processing models for this purpose, which
eliminates the requirement for manually created features. In a variety of NLP tasks,
Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 563
self-supervised techniques have been incredibly successful [20,21]. The methods that
have been most effective have been variations of masked language models, which are
denoising autoencoders trained to reconstruct text where a random subset of the words
has been masked out. Recent research has demonstrated benefits from enhancing the
distribution of masked tokens [22], the order in which masked tokens are predicted
[23], and the accessible context for changing masked tokens [24]. These techniques,
however, frequently concentrate on specific kinds of end tasks (such as span predic-
tion, generation), which restricts their applicability. For pretraining sequence-to-
sequence models, we introduce BART, a denoising autoencoder. When adjusted for
text production, BART performs particularly well on comprehension challenges. It
matches the performance of RoBERTa with comparable training resources on GLUE
and SQuAD, producing brand-new, cutting-edge results on a variety of abstractive
discourse, question-answering, and summarization tasks. For instance, compared to
prior work on XSum [25], performance is improved by 6 ROUGE. DL techniques
are being used more frequently in the legal field for tasks including classifying
factual and non-factual statements in legal documents [18], classifying crimes [26],
summarizing [27], and other tasks.
Related work also encompasses the application of machine learning methods for
the classification of legal documents. For instance, a convolutional neural network
(CNN) was employed [28] to categorize legal documents into various types such
as contracts, briefs, and pleadings. In [29], a hierarchical attention network was
utilized for classifying legal documents based on their specific subject areas. In
[30], a deep learning model leveraging Long Short-Term Memory (LSTM) networks
was designed to pinpoint key issues and arguments within legal documents. These
approaches could potentially be integrated with automated rhetorical labeling tech-
niques to enhance the comprehension of legal texts. Research has also been conducted
on the implementation of natural language processing methods for legal information
retrieval. In [31], a system was devised to automatically extract legal concepts from
court opinions and use them to improve the retrieval of related opinions. In [32], a
system was created to automatically extract legal issues from case law and utilize
them to enhance the retrieval of associated cases. These methods could potentially be
combined with automatic rhetorical labeling to improve the retrieval and organiza-
tion of legal texts based on their rhetorical objectives. Additionally, research has been
conducted on the use of machine learning methods for predicting legal decisions. In
[33], a model was created to forecast the outcomes of cases in the European Court
of Human Rights based on case texts. In [34], a model was designed to predict the
outcomes of US Supreme Court cases based on various features, including case texts.
These techniques could potentially be used in conjunction with automatic rhetorical
labeling to improve legal decision prediction based on the rhetorical functions of the
texts. However, to the best of our knowledge, deep learning and natural language
processing methods have not yet been applied to automatically discern the rhetorical
roles of phrases within legal documents.
564 S. Mishra et al.
3Method
3.1 Zero-Shot Classification
Zero-shot classification is a machine learning classification technique that is able to
recognize previously unseen objects by inferring class membership from semantic
information about the class, such as descriptions of its attributes [35]. This is in
contrast to traditional classification methods that require training data for every class
in order to learn to recognize it. The ability to learn from zero examples is particularly
useful in domains where acquiring training data is difficult or expensive, such as the
legal domain. From the point of view of legal domain, the mainchallenge in zero-shot
classification is to learn a good semantic representation of the class, which can then
be used to make predictions about unseen examples. The segmentation process often
incorporates some form of semantic knowledge representation derived from the legal
text, such as an ontology, a collection of attributes, or even the framework of the legal
document itself. There are various strategies to tackle zero-shot classification, but a
commonly employed method is to derive a mapping from the semantic representation
to the visual representation of the legal information. This mapping is then utilized to
predict previously unseen roles within the legal judgement. Several methodologies
can be employed to accomplish this, including transfer learning [36] or multiview
learning [37]. With this approach, the network receives a legal document, comprised
a series of legal statements, as input, and generates a probability distribution over a
set of labels as output. These labels would be supplied to the system dynamically
as it processes the text. This methodology offers a way to not only analyze legal
judgements more efficiently, but also to extract and learn from the data contained
within them in a more structured and scalable way. As a result, it provides a new
and innovative approach to the study of legal judgements and their implications.
The network is then trained using a set of labeled documents and then applied to a
set of unlabeled documents. The predicted label for an unlabeled document is the
label with the highest probability. In this paper, we have used an existing set of
pre-trained models. In particular, we work with Bidirectional and Auto-Regressive
Transformers.
3.2 BART: Bidirectional and Auto-Regressive Transformers
Bidirectional Auto-Regressive Transformers (BARTs) are a type of neural network
that can be used for both sequence prediction and text generation. Figure 1shows
the general architecture of BART. BARTs are similar to traditional recurrent neural
networks (RNNs), but they have the ability to learn from both past and future data.
In addition, BART is designed using the standard transformer-based architecture
which has shown promising results on various NLP-based tasks for a variety of
language-related tasks.
Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 565
Fig. 1 BART: sequence-to-sequence trained model
This makes them well-suited for tasks such as language translation, where it is
important to consider the context of the entire sentence. BART was first proposed in
[11]. Since then, tests on a variety of natural language processing tasks, including
text categorization and machine translation, have demonstrated that BARTs perform
better than conventional RNNs in these areas. An encoder and a decoder are BART’s
two main components. The encoder reads the input sequence and converts it into
a vector representation. The output sequence is then created by the decoder using
this vector form. This gives BART a better understanding of the context of the input
sequence. The two main benefits of BART are its ability to learn from long sequences
of data and their ability to generate text. Traditional RNNs are limited in the amount
of data they can learn from. This is because RNNs are designed to read data from
left to right. As a result, they can only learn from the first few items in a sequence.
BART, on the other hand, can learn from both the first and last items in a sequence.
This makes them much better at learning from long sequences of data. For reasons of
brevity, we keep the discussion on BART short. Interested readers can refer to [11].
3.3 Combining BART with Zero-Shot Classification for Legal
Document Classification
We propose a zero-shot learning approach for sentence classification. We utilize
BART, a pre-trained sequence-to-sequence autoencoder, as our text encoder and
build a simple classification head on top of the encoder. The overall framework of
the proposed approach is presented in Fig. 2. Our approach can be used for any
sentence classification task, with or without labeled data. We evaluate our approach
on a variety of sentence classification tasks and show that our approach outperforms
strong baselines on zero-shot classification. It is a subfield of machine learning where
the goal is to learn a model that can classify data belonging to classes that are not
present in the training data. The idea is that the model can learn to generalize to new
classes by using knowledge about other related classes. In the legal domain, there
are a variety of tasks where classification is needed, but labeled data is not always
available. For example, when a new law is passed, there may not be any labeled data
for that law. However, there may be other laws that are similar to the new law, and
566 S. Mishra et al.
Fig. 2 Architecture of the proposed model
these laws can be used to learn a model that can classify the new law. We explore the
ability of the BART model to learn from a large amount of unannotated data in order to
classify documents into different legal categories, without any training data for those
categories. We evaluate our approach on a dataset of nearly 50 documents of Supreme
Court cases. We find that the BART model can accurately classify documents into five
different roles, even when there is no training data for those categories, outperforming
several strong baselines. In order to train BART, text is first corrupted using a random
noise function, and then, a model is learned to recreate the original text. Despite
being straightforward, it uses a typical transformer-based neural machine translation
architecture that generalizes numerous other more advanced pretraining strategies,
such as GPT with its left-to-right decoder and BERT (owing to the bi-directional
encoder). The optimal solution combines a cutting-edge in-filling strategy, where a
span of text is replaced with a single mask token, with a random reordering of the
original phrases sequence. Although it also performs well for comprehension tasks,
BART is especially effective when tailored for text generation.
4 Results
4.1 Dataset
In this section, we attempt to demonstrate the practical efficacy of the proposed work.
On the dataset that [6] provided, experiments were carried out. The authors have
presented seven different categories of legal statements. From the seven categories,
we removed two categories. The remaining five roles are: Facts (abbreviated as
FAC), Ruling by Lower Court (RLC), Statute (STA), Precedent (PRE), and Ruling
by Present Court (RPC). Further, we included the collection of 50 documents from
the following five categories of law: 16 documents pertaining to criminal law; ten to
land and property; nine to constitutional law; eight to labor and industry; and seven
to intellectual property rights.
Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 567
4.2 Detailed Annotation
This section will provide an overview of our annotation study, including the rhetorical
functions and semantic labels that we considered for this work. There are many rhetor-
ical roles that people play in the legal system. For example, lawyers may play the role
of advocate, advisor, or negotiator. Judges may play the role of arbiter or decision-
maker. Witnesses may play the role of expert or layperson. And jurors may play the
role of fact finders or deliberators. Table 1represents number of sentences annotated
with each role. In our work, we take into account the following five rhetorical roles:
Facts: This describes the sequence of occurrences that lead to the filing of the case
and the development of the case over time through the legal system.
Ruling by Lower Court (RLC): There were some decisions issued by the lower
courts (Trial Court and High Court) on the basis of which the present appeal was
launched, and therefore, we are reviewing Supreme Court case documents (to the
Supreme Court). According to the court, the audio was properly authenticated and
accepted as evidence in accordance with the hearsay exception. This mark was
added to the lower court’s verdict as well as the reasoning behind its decision.
Ruling by Present Court (RPC): The court will make a ruling based on the law
as it stands today. It describes the court’s final judgement or conclusion resulting
from the logical or natural conclusion of the argument.
Statute (STA): The term statute is also used to refer to a written law that has been
enacted by a legislature, as opposed to a common law, which is derived from case
law. Existing laws, which can be derived from a variety of sources including Acts,
Sections, Articles, Rules, Order, Notices, Notifications, Quotations taken directly
from an Act, and so on.
Precedent (PRE): A precedent is a legal decision or set of legal rules that is
established as a binding authority for future cases.
Table 1shows the five rhetorical roles along with the number of sentences
annotated with each role.
Table 1 Number of
sentences annotated with each
role
Rhetorical role Number of sentences
FAC 2220
STA 654
RLC 314
PRE 1468
RPC 262
Total rhetorical role 4918
568 S. Mishra et al.
4.3 Experimental Setup
The current setup for BART is to have two separate models, each with its own
parameters, that are trained jointly. The first model is a standard transformer model
that is trained to predict the next token in the sequence, while the second model is
a transformer model that is trained to predict the previous token in the sequence.
The two models are then combined by concatenating their hidden states at each
timestep. This gives the model the ability to look both forward and backward in the
input sequence, which is beneficial for tasks such as language modeling where long-
range dependencies are important. As pre-trained models have produced remarkable
performances in many tasks (e.g., [38,39]), we experimented with the BART-large
model [11].
4.4 Evaluation Metrics
Standard metrics are applied to assess how well the suggested method performs. The
following is their definition:
Precision: Precision is a measure of the accuracy of a model’s prediction. It is
calculated by dividing the number of correct predictions by the total number of
predictions made. A higher precision indicates that the model is more accurate
in predicting correct outcomes, while a lower precision indicates that the model
may be inaccurate in its predictions.
Precision =True Positives
True Positives +False Positives.
Recall: Recall, on the other hand, is a measure of the model’s ability to detect
all relevant instances in a given data set. It is calculated by dividing the number
of relevant instances that are correctly identified by the total number of relevant
instances in the dataset. A higher recall indicates that the model is more effective
in detecting all relevant instances, while a lower recall indicates that the model
may be missing some relevant instances.
Recall =True Positives
True Positives +False N egatives .
F1-Score: The F1-score is a metric that measures the harmonic mean of precision
and recall. It is calculated by taking the average of the precision and recall values
of a model. The F1-score is widely used to evaluate the performance of a model
because it considers both precision and recall simultaneously. A higher F1-score
indicates a more accurate model compared to one with a lower F1-score. It is a
valuable metric in assessing the overall effectiveness of a model’s performance.
Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 569
F1 - Score =2(Precision Recall)
Precision +Recall .
4.5 Classification Result
In this subsection, we provide an evaluation of the proposed method’s performance.
The classification results are displayed in Fig. 3as a confusion matrix, showcasing the
accuracy of the classification. Additionally, Table 2presents the precision, recall, and
F1-score of the classification results, with the corresponding numerical values also
depicted in Fig. 4. The results indicate that the system demonstrates strong perfor-
mance, despite not being trained on any of the instances. Notably, the performance
varies across the five roles being classified. The proposed model achieves the highest
F1-score for the FAC role, indicating its effectiveness in accurately classifying this
particular role. Though, for other roles, the results are good, the numbers for FAC
are the best. In addition to this, the overall accuracy of the system is 59.08%. This
figure clearly shows the applicability of zero-shot learning in the legal rhetorical role
detection. It should be noted here that we have not refined BART on legal domain.
Despite this, the accuracy of the system is 59.08%. In addition to this, the proposed
model also performs well in terms of maintaining class distribution. From the figure,
it is also visible that the worst performance was obtained for the class RLC. The
exact reason for this is however unknown.
5 Limitation of Work
The proposed method has certain drawbacks, including the reliance on a pre-trained
BART model, which may not be suitable for all legal documents due to the complexity
and domain-specific language found in such texts. Furthermore, while the approach
can decrease the volume of labeled data necessary for training, some manual involve-
ment may still be needed to obtain the best results. These limitations offer opportuni-
ties for further exploration and improvement in future research. There are limitations
that should be addressed.
A key limitation is the dependence on the quality and relevance of the pre-trained
language model employed for text segmentation. While we utilized the state-of-
the-art BART model, there is potential for improvement in its application to legal
documents, which are often complex and necessitate specific domain knowledge.
Future work could concentrate on refining the BART model for the legal domain
to achieve greater accuracy.
The generalizability of our method to various legal domains or languages is an
other limitation. Our experiments were conducted on a dataset provided by [6],
but it is possible that the results may not be consistent across other legal datasets
570 S. Mishra et al.
Fig. 3 Confusion matrix according to five roles. Here [0-FAC, 1-STA, 2-RLC, 3-PRE, 4-RPC]
Table 2 Results according to
individual rhetorical role Rhetorical role Results
Facts Precision 0.4675
Recall 0.6234
F1-score 0.5343
RLC Precision 0.0191
Recall 0.0759
F1-score 0.0305
RPC Precision 0.2099
Recall 0.2820
F1-score 0.2407
STA Precision 0.5886
Recall 0.4410
F1-score 0.5042
PRE Precision 0.5320
Recall 0.3708
F1-score 0.4370
Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 571
Precision Recall F1-Score
Fig. 4 Results according to individual rhetorical role
with distinct characteristics. Currently, our method is limited to English-language
legal documents, so future work could investigate its applicability to other legal
domains and languages.
Moreover, although our method reduces the need for manual intervention in text
segmentation, it still necessitates labeled data for training the BART model. As
with any machine learning technique, the quality and quantity of labeled data can
significantly affect the model’s performance. Additionally, obtaining labeled data
can be time-consuming and costly, especially in the legal domain where large,
specialized datasets may be needed.
Despite these limitations, we maintain that zero-shot rhetorical role detection
holds the potential to transform the way raw legal data is processed. By minimizing
the amount of labeled data required for training, our approach can considerably
reduce the cost and time associated with text segmentation, ultimately leading to
more accurate and efficient models. Additionally, the flexibility of our method
in accommodating a dynamic number of rhetorical roles allows for more precise
and nuanced text segmentation, resulting in enhanced downstream analysis and
decision-making.
6 Conclusion
Zero-shot rhetorical role detection, a novel research area introduced in this paper,
has the potential to transform the way raw legal data is processed. This method
can significantly decrease the amount of labeled data needed for training and could
ultimately result in more accurate text segmentation models. We utilized a pre-trained
BART model to achieve legal document segmentation. Our experimental findings
572 S. Mishra et al.
suggest that using a pre-trained BART model for zero-shot rhetorical role detection
holds promise in reducing the labeled data required for training and enhancing text
segmentation models. This has considerable implications for the legal domain, where
large volumes of data are typically needed for training machine learning models, and
manual labeling costs are high. Moreover, our approach removes the necessity for
manual intervention by legal experts, which is both time-consuming and expensive. In
contrast to the current academic approach, which requires a legal ex pert to manually
label each statement for proper machine classification, our method offers a more
efficient alternative. Additionally, there is no need to settle on a pre- defined number
of classes, as a varying number of rhetorical roles can be supplied on demand. We
conducted experiments on the dataset provided by [6]. One of the key benefits of our
approach is the flexibility in the number of rhetorical roles that can be provided on
demand, as opposed to traditional methods that demand a pre-determined number of
classes, thus restricting the scope of analysis. In summary, zero-shot rhetorical role
detection has the potential to revolutionize the processing of raw legal data, yielding
more accurate text segmentation models while reducing the time and cost associated
with manual intervention. By further refining and enhancing the BART model within
the legal domain, we believe that our approach can lead to the more efficient and
precise classification of legal documents. Through numerical simulations, our results
showed promise, and the analysis indicated good scope for improvement in the future.
As part of future work, we plan to refine the BART model for the legal domain and
experiment with the zero-shot classification of legal documents, aiming to improve
the system’s accuracy.
Acknowledgements The work presented in this article is funded by Manupatra Information
Solutions Private Limited.
References
1. Hutcheson JC Jr (1928) Judgment intuitive the function of the hunch in judicial decision.
Cornell lq 14:274
2. Bommer M, Gratto C, Gravander J, Tuttle M (1987) A behavioral model of ethical and unethical
decision-making. J Bus Ethics 6(4):265–280
3. Schwarz-Plaschg C (2018) Nanotechnology is like… the rhetorical roles of analogies in public
engagement. Public Underst Sci 27(2):153–167
4. Bhattacharya P, Paul S, Ghosh K, Ghosh S, Wyner A (2021) Deeprhole: deep learning for
rhetorical role labeling of sentences in legal case documents. Artif Intell Law 1–38
5. MacCormick N (2005) Rhetoric and the rule of law: a theory of legal reasoning. OUP Oxford
6. Ghosh S, Wyner A (2019) Identification of rhetorical roles of sentences in Indian legal judg-
ments. In: Legal knowledge and information systems: JURIX 2019: the thirty-second annual
conference, vol 322. IOS Press, p 3
7. Chaturvedi I, Cambria E, Welsch RE, Herrera F (2018) Distinguishing between facts and
opinions for sentiment analysis: survey and challenges. Inf Fusion 44:65–77
8. El-Kilany A, Azzam A, El-Beltagy SR (2018) Using deep neural networks for extracting
sentiment targets in Arabic tweets. In: Intelligent natural language processing: trends and
applications. Springer, pp 3–15
Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 573
9. Wang W, Zheng VW, Yu H, Miao C (2019) A survey of zero-shot learning: settings, methods,
and applications. ACM Trans Intell Syst Technol (TIST) 10(2):1–37
10. Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning—a comprehensive
evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell
41(9):2251–2265
11. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer
L (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation,
translation, and comprehension. arXiv preprint arXiv:1910.13461
12. Savelka J, Ashley KD (2018) Segmenting us court decisions into functional and issue specific
parts. In: JURIX, pp 111–120
13. Shulayeva O, Siddharthan A, Wyner A (2017) Recognizing cited facts and principles in legal
judgements. Artif Intell Law 25(1):107–126
14. Venturi G (2012) Design and development of temis: a syntactically and semantically annotated
corpus of Italian legislative texts. In proceedings of the workshop on semantic processing of
legal texts (SPLeT 2012), pp 1–12
15. Wyner AZ, Peters W, Katz D (2013) A case study on legal case annotation. In: JURIX, pp
165–174
16. Wyner A, Peters W (2010) Towards annotating and extracting textual legal case factors.
In: Proceedings of the language resources and evaluation conference workshop on semantic
processing of legal texts, Malta
17. Saravanan M, Ravindran B, Raman S (2008) Automatic identification of rhetorical roles using
conditional random fields for legal document summarization. In Proceedings of the third
international joint conference on natural language processing: volume I
18. Nejadgholi I, Bougueng R, Witherspoon S (2017) A semi-supervised training method for
semantic search of legal facts in Canadian immigration cases. In: JURIX, pp 125–134
19. Walker VR, Pillaipakkamnatt K, Davidson AM, Linares M, Pesce DJ (2019) Automatic classi-
fication of rhetorical roles for sentences: comparing rule-based scripts with machine learning.
In: ASAIL@ ICAIL
20. Sarzynska-Wawer J, Wawer A, Pawlak A, SzymanowskaJ, Stefaniak I, Jarkiewicz M, Okruszek
L (2021) Detecting formal thought disorder by deep contextualized word representations.
Psychiatry Res 304:114135
21. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805
22. Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2020) Spanbert: Improving pre-
training by representing and predicting spans. Trans Assoc Comput Linguist 8:64–77
23. Liu Y, Lapata M (2019) Text summarization with pretrained encoders. arXiv preprint arXiv:
1908.08345
24. Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, Gao J, Zhou M, Hon H-W (2019) Unified
language model pre-training for natural language understanding and generation. Adv Neural
Inf Proc Syst 32
25. Narayan S, Cohen SB, Lapata M (2018) Don’t give me the details, just the sum-mary! topic-
aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.
08745
26. Wang P, Fan Y, Niu S, Yang Z, Zhang Y, Guo J (2019) Hierarchical matching network for crime
classification. In: Proceedings of the 42nd international ACM SIGIR conference on research
and development in information retrieval, pp 325–334
27. Bhattacharya P, Hiware K, Rajgaria S, Pochhi N, Ghosh K, Ghosh S (2019) A comparative
study of summarization algorithms applied to legal case judgments. In: European conference
on information retrieval. Springer, pp 413–428
28. Song D, Vold A, Madan K, Schilder F (2022) Multi-label legal document classification: a
deep learning-based approach with label-attention and domain-specific pre-training. Inf Syst
106:101718
29. Venkateswarlu B, Shenoi VV, Tumuluru P (2022) Caviarws-based HAN: conditional autore-
gressive value at risk-water sailfish-based hierarchical attention network for emotion classifi-
cation in covid-19 text review data. Soc Netw Anal Min 12:1–17
574 S. Mishra et al.
30. Anand D, Wagh R (2022) Effective deep learning approaches for summarization of legal texts.
J King Saud Univ-Comput Inf Sci 34(5):2141–2150
31. Maxwell KT, Schafer B (2008) Concept and context in legal information retrieval. In: Legal
knowledge and information systems. IOS Press, pp 63–72
32. Ashley KD, Brüninghaus S (2009) Automatically classifying case texts and predicting
outcomes. Artif Intell Law 17:125–165
33. Medvedeva M, Vols M, Wieling M (2020) Using machine learning to predict decisions of the
European court of human rights. Artif Intell Law 28:237–266
34. Clark TS, Lauderdale B (2010) Locating supreme court opinions in doctrine space. Am J
Political Sci 54(4):871–890
35. Socher R, Ganjoo M, Manning CD, Ng A (2013) Zero-shot learning through cross-modal
transfer. Advances in neural information processing systems, 26
36. Chen Y-S, Chiang S-W, Meng-Luen W (2022) A few-shot transfer learning approach using
text-label embedding with legal attributes for law article prediction. Appl Intell 52(3):2884–
2902
37. Qiu X, Chen Z, Zhao L, Chengsheng H (2019) Unsupervised multi-view non-negative for law
data feature learning with dual graph-regularization in smart internet of things. Futur Gener
Comput Syst 100:523–530
38. Zhang T, Chandrasekaran DP, Thung F, Lo D (2022) Benchmarking library recognition in
tweets
39. Zhang T, Xu B, Thung F, Haryono SA, Lo D, Jiang L (2020) Sentiment analysis for software
engineering: how far can pre-trained transformer models go? In: 2020 IEEE International
Conference on Software Maintenance and Evolution (ICSME). IEEE, pp 70–80
IoB-Based Intelligent Healthcare System
for Disease Diagnosis in Humans
Shalu, Neha Saini, Pooja, and Dinesh Singh
Abstract Internet of Behavior (IoB) refers to the use of Internet of Things (IoT)
devices to track data, monitor, and influence human behavior. The increasing use
of IoB has also led to the development of systems for disease detection, which
can leverage IoT data to enhance the accuracy and speed of disease detection. In
this context, an IoB-based system for disease detection has been proposed in this
paper that uses data from various IoT devices, such as wearable sensors, to monitor
and analyze human behavior. The system collects data on numerous physiological
parameters, such as heart rate, blood pressure, and body temperature, and uses this
data to identify patterns that may be indicative of a particular disease or health
condition. This approach can detect diseases at an early stage, before symptoms
appear, which increases the chance of effective treatments. It can also provide real-
time feedback to healthcare providers, enabling them to make informed decisions
about patient care. The proposed DenseNet-K-Nearest Neighbor (KNN)-based IoB
healthcare system optimizes healthcare processes, supports clinical decision-making,
and can be used to improve patient care. The proposed model was compared with
existing algorithms such as Naive Bayes (NB), decision trees (DT), logistic regression
(LR), Convolution Neural Network (CNN), and KNN. The results demonstrate that
the proposed system has a greater accuracy of 97.66% than the other four algorithms.
It is widely assumed that the proposed method can lower the risk of chronic diseases
Shalu
Manav Rachna University, Faridabad, Haryana, India
e-mail: shalu@mru.edu.in
N. Saini (B)
Government College Chhachhrauli, Yamuna Nagar, Haryana, India
e-mail: profnehasaini@gmail.com
Pooja
School of Computer Science and Engineering, Galgotias University, Greater Noida, India
e-mail: pooja1@galgotiasuniversity.edu.in
D. Singh
Deenbandhu Chhotu Ram University of Science and Technology, Murthal, Sonepat, India
e-mail: dineshsingh.cse@dcrustm.org
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_43
575
576 Shalu et al.
by detecting them early and lowering the cost of diagnosis, therapy, and doctor
consultation.
Keywords Internet of Behavior (IoB) ·Internet of Things (IoT) devices ·
Healthcare systems ·Wearable sensors ·Disease detection
1 Introduction
The internet has become an increasingly important tool for disease detection in
humans. In recent years, researchers have used social media and other online sources
to collect data for public health surveillance. IoT refers to a system of interconnected
physical devices that gather and distribute data and information over the Internet. IoT
allows the interconnection and independent processing of devices, whereas volume
and variety of data stored in the cloud are increasing intricately. Patient’s behavior,
demands, and requirements can be gleaned from this data trove, which has been
termed as “Internet of behavior” (IoB). Many patients are happy to provide their data
if it gives value, even though some patients are reluctant to do so. For instance, it
guarantees that healthcare systems can be altered in terms of diagnosis and disease
classification and patients’ experience can be improved. The ultimate goal is to
increase consistency and dependability; theoretically, all facets of consumer life
can be learned [13]. Before an application is developed, IoB can anticipate the
user’s social behaviors and contact points. This technology is used to ensure that the
application interface is consistent, user-friendly and provides easier navigation that
will help in production process. Data collected by the app is utilized to get insight into
the way people interact with it [4,5]. The IoB aims to impose a cognitive and social
rationale from the data collected from people’s behavioral patterns on the internet.
It discusses the interpretation and application of data in the creation and promotion
of innovative products based on human behavior [6,7].
In health care [8], the IoB has the potential to revolutionize the way we monitor,
diagnose, and treat diseases. By monitoring an individual’s online behavior, health-
care providers can gain valuable insights into their patients’ health status. The use of
social media and internet searches, for instance, may indicate the onset of a health
problem in time for preventative measures to be taken. Wearable devices and sensors
can also track and transmit real-time health data, providing healthcare providers with
a complete picture of their patients’ health. Figure 1depicts the major components
of a healthcare system based on IoB. The following is a comprehensive description
of the components:
IoT Gadgets: These include wristbands, tablets, and other network devices that
collect information on a person’s health and activity.
Data Collection: The data generated by IoT devices is stored and analyzed in a
central system.
IoB-Based Intelligent Healthcare System for Disease Diagnosis in Humans 577
Fig. 1 Components of
healthcare system
IoB Based
Healthcare
System
IoT
Gadgets
Data
Collection
Patients
Monitoring
Healthcare
Professionals
Machine
Learning
Machine Learning: Using machine learning algorithms, data gleaned from IoT
devices is analyzed in order to find trends and anomalies, as well as to detect early
symptoms of disease and other health concerns.
Healthcare Professionals: The inferences created by the IoB-based healthcare
system are utilized by doctors, nurses, and other medical employees to make more
informed decisions on patient care.
Patient Monitoring: With IoB-based healthcare [9] systems, patients can be
remotely monitored for increased check-in frequency and earlier disease diagnosis.
In addition, the IoB can be applied to tailor made medical care and therapies.
Healthcare practitioners can better serve their patients by learning about their unique
preferences, lifestyles, and habits through an analysis of their online activity.
This work based on the Internet of Behavior in healthcare systems can address
ethical, privacy, and security problems while also contributing significantly to patient
care, process optimization, and clinical decision-making [10]. The paper begins with
the role of IoB in health care in Sect. 1. After that, various related studies have been
discussed in Sect. 2.
The methodology adopted for the research and proposed model is discussed in
Sect. 3. The results and discussion are described in this Sect. 4. At the end, the paper
is concluded with the contribution of research in healthcare sector and various future
research directions have also been given in Sect. 5.
2 Related Work
Early adopters of the internet saw the potential of using IoB for disease identi-
fication. Researchers in the 1990s began investigating the feasibility of utilizing
online communities and chat rooms in the study of communicable diseases like HIV/
AIDS. They discovered that patients suffering by these conditions could benefit from
engaging in online counseling services.
578 Shalu et al.
The proliferation of wearables and other Internet of Things devices in recent years
has created new possibilities for IoB-based [11] disease diagnosis. Wearable tech-
nology has the potential to revolutionize early disease detection by monitoring vital
signs such as heart rate and blood pressure [12]. Utilizing internet-based behavior
to diagnose diseases is a rapidly growing field of study. Several researches have
investigated the viability of using Artificial Intelligence (AI) and machine learning
(ML) to diagnose disease. This study applied AI in disease diagnosis and compared
the findings with various performance indicators, including prediction rate, accu-
racy, sensitivity, specificity, the area under the curve precision, recall, and F1-score.
Parkinson’s, tumors, chronic diseases, and heart disease can be effectively diagnosed
using AI, according to the findings of the study [13].
The author investigated the detection of neurodegenerative disorders using web
search signals [14]. Due to their gradual course and subtle symptoms, some conditions
have been reported to be difficult to diagnose. The author [15] discusses the public
health application of social media and internet-based health surveillance. The study
indicated that a dearth of longitudinal research and methodological difficulties can
impede the successful implementation of such a system.
Additionally, a study utilized machine learning (ML) to forecast disorders. The
study indicated that logistic regression performed well in predicting cardiovascular
diseases, while random forest and convolutional neural networks were employed
to accurately identify breast diseases [16]. These studies highlight the potential
for detecting diseases using internet-based behavior. There is a need for additional
research on the benefits of applying AI and ML to detect human diseases.
3 Proposed Methodology
IoB refers to the tracking and analysis of human behavior using connected devices and
data analytics. Using IoB-based models to collect and evaluate data on individuals’
behavior patterns, such as their sleep patterns, exercise habits, and eating habits,
this approach can be used to disease classification in humans. A model based on IoB
could detect patterns related with specific diseases or health conditions by examining
this data. For instance, if the model detects that a person consistently has poor sleep
quality, a lack of physical exercise, and unhealthy eating habits, it may imply that he
is at risk for acquiring obesity, diabetes, or cardiovascular disease.
(A) Collect Data: Real-world information such as a patient’s characteristics, socioe-
conomic level, and clinical findings are gathered. In order to protect the privacy of the
patients, the dataset does not include identifiable details about them such as name,
age, and their residential information.
(B) Build a Model: Build a model based on the analyzed data that can be used
to detect illnesses in humans. This model should be trained on a large dataset and
validated using a separate test set. The proposed model involves various steps for
disease prediction as shown in Fig. 2. The steps are discussed in detail below are:
IoB-Based Intelligent Healthcare System for Disease Diagnosis in Humans 579
Pre-
processing
Feature
Extraction
Determine Distance
using KNN
Model
Training
Feature
selection
Input
Data
Produces
Prediction
Fig. 2 Proposed IoB-based disease detection model
(C) Data preprocessing: Most of the gathered structured data have missing values,
and thus, they are preprocessed appropriately. The quality of the dataset can only
be improved by adding missing information or eliminating or updating inaccurate
records. All punctuation and white spaces are removed during the preparation phase.
Data undergoes feature extraction and disease prediction after initial processing is
complete. Out of 450 instances, 32 features are selected.
(D) Model Training Using DenseNet. After feature extraction model is trained
using DenseNet algorithm. It was developed to combat the loss of precision in highly
complex neural networks due to their infinitely diminishing gradient. The process
begins with a vectorization of the data set. After that, it is forwarded on to the
convolution layer. Following the convolution layer, the max pooling process is carried
out in the pooling layer. The max pooling output is passed to the fully connected
layer, and then, the classification is performed by the output layer.
(E) Determine Distance Using KNN: After training the model, distance is deter-
mined using KNN. It is a guided computational model that compares new and existing
data to find the most comparable category and then adds the new data to that category.
In KNN, the value of K is already established, and the nearest neighbor characteristics
are those with the highest degree of similarity to K. The neighbor with the smallest
distance from the known K value is selected. The result of disease prediction is the
characteristic with the smallest distance value.
(F) Model Validation: The model has been validated by computing performance
metrics including accuracy, precision, recall, and F1-score, which are described in
the findings and discussions section.
Ultimately, a model based on the IoB has great potential as a tool for early disease
identification and individualized treatment in humans. The model can rapidly and
accurately analyze data from multiple sources to draw conclusions about a patient’s
health. However, sufficient safeguards must be in place to protect the privacy of
patients and prevent unauthorized access to personal health information.
580 Shalu et al.
4 Results and Discussions
Seven performance metrics are used to evaluate the proposed disease detection
system.
Accuracy: Accuracy in classification is represented mathematically as the percentage
of correct predictions relative to all predictions and depicted in Eq. (1).
Accuracy =(TP +TN)/(TP +TN +FP +FN)100.(1)
Precision: Precision is defined as the percentage of accurate predictions relative to
the sum of all correct values (both true and false) and is depicted in Eq. (2).
Precision =TP/(TP +FP).(2)
Recall: It is defined as the proportion of right predictions relative to the sum of right
positive and incorrect negative ones, and it is depicted in Eq. (3).
Recall =TP/(TP +FN),(3)
F1 - Score =2(Precision Recall)/Precision +Recall.(4)
Since the prediction result is crucial to the patient and will have negative conse-
quences if it is inaccurate, accuracy is a crucial metric to consider. Accuracy
evaluations between the proposed algorithm and other techniques are graphically
represented in Fig. 3.
Prediction accuracies of 52% for NB, 62% for a DT, 86% for a LR, 96% for
a CNN and KNN, and 97.66% for a DenseNet and KNN are shown on the graph.
Comparison to various machine learning techniques demonstrates that the proposed
system obtains the greatest accuracy of 97.66%.
050100150
Naïve Bayes
Decision Tree
Logistic Regression
CNN and KNN
DenseNet and KNN
Accuracy Accuracy
Fig. 3 Accuracy analysis of the proposed versus other techniques
IoB-Based Intelligent Healthcare System for Disease Diagnosis in Humans 581
0
20
40
60
80
100
120
Naïve
Bayes
Decision
Tree
Logistic
Regression
CNN and
KNN
DenseNet
and KNN
Precision
Recall
F1-Score
Fig. 4 Comparison of other performance evaluation metrics of proposed and other algorithms
The existing five techniques such as NB, DT, LR, CNN and KNN algorithms,
and proposed DenseNet and KNN algorithm, and their variants are compared on the
three performance assessment parameters. As can be seen in Fig. 4, the experimental
findings demonstrate that their performance ranges from 52 to 64 to 84 to 93 to
97.5% in terms of precision; from 60 to 80 to 88 to 99.5% in terms of recall; and
from 65 to 62.5 to 82.5 to 97.5 to 98.5% in terms of F1-score. These results show
that the DenseNet and KNN algorithm-built model surpasses the other four methods
in terms of precision (97%), recall (98%), and F1-score (98%).
MCC: The Pearson product-moment coefficient of correlation between the actual and
anticipated components is the basis for the Matthews Correlation Coefficient (MCC)
metric, which is based on a contingency matrix. A score close to 1 indicates the
weakest classifier and a score close to +1 indicates the best classifier.
MCC
=(TrueP TrueNFalseP FalseN)(TrueP +FalseP)(TrueP +FalseN)
(TrueN +FalseP)(TrueN +FalseN).(5)
Miss Rate: Misclassification rate measures how often the model provides an
inaccurate prediction.
Recognition Speed: Total number of images present in test set/total time taken for
testing.
The constraints of the prediction method are reflected in MCC. The best prediction
performance is correlated with a high MCC score. Figure 5shows the MCC value of
94% achieved by the DenseNet-KNN-based model technique on the chronic renal
disease dataset. Based on our findings, MCC outperforms other existing approaches
we have compared. Thus, the proposed method has the highest-level prediction
efficiency. The proposed DenseNet-KNN approach is compared to other existing
methods with respect to its miss rate, RS, and MCC value, as shown in Fig. 5.
582 Shalu et al.
0 20406080100
Naïve Bayes
Decision Tree
Logistic Regression
CNN and KNN
DenseNet and KNN
RS (Jobs per unit time)
Miss Rate (Percentage)
MCC
Fig. 5 Miss rate versus RS versus MCC value of proposed with other existing techniques
Results show that our proposed model has the best miss rate and RS value of 11
which is relatively lower than other existing techniques.
5 Conclusion and Future Scope
Internet of Behavior facilitates the collection of data, monitoring, and manipulation
of human behavior using IoT gadgets. This study suggests an IoB-based system
for disease diagnosis by collecting and analyzing data from several IoT devices,
including wearable sensors. Several physical parameters, including heart rate, blood
pressure, and temperature, are monitored and analyzed by the system to find patterns
that may indicate an illness or health condition. In this study, we have proposed
an IoB-based system employing machine learning methods such as DenseNet and
KNN to detect and predict an individual’s chance of developing a chronic disease.
In this study, the proposed model performance is evaluated in comparison to that of
other popular machine learning algorithms as the NB, DT, LR, CNN, and KNN. The
findings demonstrate that the proposed system outperforms the other four algorithms
with an accuracy of 97.66%. We strive to enhance our model by applying it to new
image-based datasets and decreasing its execution time, although it already achieves
state-of-the-art performance. In the future, we hope to use a real-time data set to
predict the spread of several airborne diseases. It is expected that the suggested
approach will decrease the prevalence of chronic diseases through early diagnosis
while also decreasing the costs associated with said diagnosis, treatment, and health
check-ups.
IoB has enormous potential in the future of disease identification. There is a lot
of behavioral data that can be used to detect diseases early and create individualized
treatment strategies, and this data is increasingly available through wearable devices,
social media platforms, and other digital sources. Infectious illness outbreaks can
be monitored and prevented with the help of the IoB by collecting information on
people’s habits and the mobility trends.
IoB-Based Intelligent Healthcare System for Disease Diagnosis in Humans 583
References
1. Shih H-P (2004) Extended technology acceptance model of Internet utilization behavior. Inf
Manage 41(6):719–729. https://doi.org/10.1016/j.im.2003.08.009
2. Trends in Internet information behavior, 2000–2004—Buente—2008—J Am Soc Inf Sci
Technol—Wiley Online Library. https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.20883.
Accessed 20 Oct 2022
3. Javaid M, Haleem A, Singh RP, Rab S, Suman R (2021) Internet of behaviours (IoB) and its
role in customer services. Sens Int 2:100122. https://doi.org/10.1016/j.sintl.2021.100122
4. Internet addictive behavior in adolescence: a cross-sectional study in seven European
countries. Cyberpsychol Behav Soc Netw. https://www.liebertpub.com/doi/abs/10.1089/cyber.
2013.0382. Accessed 20 Oct 2022
5. Flow and Internet shopping behavior: a conceptual model and research propositions. https://
ideas.repec.org/a/eee/jbrese/v57y2004i10p1199-1208.html. Accessed 20 Oct 2022
6. The influence of perceived risk on Internet shopping behavior: a multidimensional perspec-
tive. J Risk Res 12(2). https://www.tandfonline.com/doi/abs/10.1080/13669870802497744.
Accessed 20 Oct 2022
7. Andrews L, Bianchi C (2013) Consumer internet purchasing behavior in Chile. J Bus Res
66(10):1791–1799
8. Prevalence of internet addiction in healthcare professionals: systematic review and meta-
analysis. Inesa Buneviciene, Adomas Bunevicius (2021). https://journals.sagepub.com/doi/abs/
10.1177/0020764020959093?journalCode=ispa. Accessed 22 March 2023
9. Bhatt V, Chakraborty S (2021) Improving service engagement in healthcare through internet
of things based healthcare systems. J Sci Technol Policy Manag 14(1):53–73. https://doi.org/
10.1108/JSTPM-03-2021-0040
10. Alarefi M (2023) Internet of things in Saudi public healthcare organizations: the moderating
role of facilitating conditions. Int J Data Netw Sci 7(1):295–304
11. Javaid M, Haleem A, Singh RP, Khan S, Suman R (2022) An extensive study on internet
of behavior (IoB) enabled healthcare-systems: features, facilitators, and challenges. Bench-
Council Trans Benchmarks Stand Eval 2(4):100085. https://doi.org/10.1016/j.tbench.2023.
100085
12. Qi J, Yang P, Min G, Amft O, Dong F, Xu L (2017) Advanced internet of things for personalised
healthcare systems: a survey. Pervasive Mob Comput 41:132–149. https://doi.org/10.1016/j.
pmcj.2017.06.018
13. Kumar Y, Koul A, Singla R, Ijaz MF (2022) Artificial intelligence in disease diagnosis: a
systematic literature review, synthesizing framework and future research agenda. J Ambient
Intell Humaniz Comput, 1–28. https://doi.org/10.1007/s12652-021-03612-z
14. White RW, Doraiswamy PM, Horvitz E (2018) Detecting neurodegenerative disorders from
web search signals. NPJ Digit Med 1(1), Art. no. 1. https://doi.org/10.1038/s41746-018-0016-6
15. [PDF] Social media- and internet-based disease surveillance for public health. Semantic
Scholar. https://www.semanticscholar.org/paper/Social-Media-and-Internet-Based-Disease-
for-Public-Aiello-Renson/3c80bbc2679845eb87a85c32533e632ecb608282. Accessed 18
March 2023
16. (PDF) Disease prediction using machine learning. https://www.researchgate.net/publication/
347381005_Disease_Prediction_Using_Machine_Learning. Accessed 18 March 2023
Analyzing the Impact of Extractive
Summarization Techniques on Legal Text
Utkarsh Dixit, Sonam Gupta, Arun Kumar Yadav, and Divakar Yadav
Abstract Legal document summarization refers to the process of consolidating
lengthy legal document into a more concise form retaining all the critical aspects.
This study aimed to evaluate the effectiveness of extractive text summarization for
summarizing legal materials. Various models such as SVM, NB, KB, Winnow, and
C4.5 were used to summarize the text, and the ROUGE score was used to evaluate
performance. The methodology involved utilizing various strategies and models for
summarization, including extractive text summarization, which recognizes relevant
chunks of content and rewrites it word by word, resulting in a selection of phrases
from the source text. The inclusion of all legal aspects into legal document summa-
rization results in a well-structured form. It was found that extractive summariza-
tion is commonly used in legal documents because it recognizes relevant content
and produces well-structured summaries that include all legal aspects. The study
also suggested that additional strategies can be used to generate summaries through
extractive text summarization. The results indicated that C4.5 is the most effective
model for decision tree classification. Therefore, it can be concluded that extractive
text summarization is an effective method for summarizing legal materials, and C4.5
is a useful model for this purpose. Extractive summary recognizes and reproduces
large fragments of a message, while abstractive summary uses language processing
to create a more human-like summary. Extractive methods are commonly used in
legal documents because abstractive summarization may result in the loss of orig-
inal content and lacks sufficient data for deep learning. Legal document summary
should cover all legal aspects, including judgment record and logical fragments,
for better document structure. To evaluate summary text performance, the ROUGE
score, precision, recall, and F-Measure were used by counting n-grams, overlapping
U. Dixit ·S. Gupta (B)
Ajay Kumar Garg Engineering College, Ghaziabad, India
e-mail: guptasonam@akgec.ac.in
U. Dixit
e-mail: utkarsh2010016m@akgec.ac.in
A. K. Yadav ·D. Yadav
National Institute of Technology, Hamirpur, HP, India
e-mail: ayadav@nith.ac.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_44
585
586 U. Dixit et al.
word pairs, and word sequences, and focusing on text summarization. The study also
conducted a survey on the use of extractive text summarization in legal documents.
Various techniques were examined, and different modules were used in the process.
Extractive summarization was chosen for use in legal documents as it preserves the
meaning of the document and utilizes a subset of the text for summarization.
Keywords Legal document ·Automatic text summarization ·Extractive text
summarization ·Abstractive text summarization ·SVM ·NB
1 Introduction
Rapid growth in information is being observed as we move forward in the age of
data. This field of online growth in information has aided all fields, including the
legal background. Legal documents, which comprise constitutions, contracts, deeds,
orders, judgments, statutes, and many more, are complex to structure and under-
stand, making them difficult for legal practitioners to comprehend the case and make
future judgments. However, if better text summarization for legal documents was
to be implemented, it would be much simpler to understand. An outline, being a
dense variant of a long report that incorporates all significant data, is the topic of
investigation. The aim is to use different techniques in ATS for legal documents so
that the quality of the document is not reduced. The focus of ATS is on creating a
briefer version of the document without reducing the meaning of the document [1].
Two types of ATS can be identified: (1) extractive and (2) abstractive.
A summary that contains a sentence subset of the original document or report
is created by extractive text summarization after reviewing all other documents.
Abstractive text summarization, on the contrary, produces the summary using its
own terminology without losing the meaning of the document. Different approaches
for text summarization are followed by both techniques [2].
There are various independent tasks for extractive summarization (Fig. 1).
In extractive summarization, several aspects are utilized [3].
Statistical and Semantic Aspect: Different measurable and semantic angles, such as
word recurrence, connective articulation, area, and title, are examined in this strategy.
These are used for identifying the relevant sentence and then analyzing it.
ML Aspect: Supervised and unsupervised paths are used. In supervised learning, a
label is present in the training data, while in unsupervised learning, the training data
does not have a label but forms a cluster based on similarity.
Probabilistic Aspect: The objective is to identify significant phrases, essential ideas,
and associations.
Graph-Based Aspect: An attempt is made to construct a similarity network with the
vertices and edges specified by the sentence matrix. The similarity graph comprises
edges that represent sentences, and the method involves determining the similarity
Analyzing the Impact of Extractive Summarization Techniques on Legal 587
Fig. 1 Task of extractive
summarization
scores between the sentences and edges. After the edge and vertex are found, the
PageRank algorithm is used to find the important sentence. After ranking is done
based on top k, the top-scoring sentence is selected as the summary.
Neural Network-Based Aspect: Text learning is achieved through neural networks,
which are connections of distinct nodes that communicate with each other.
Text Simplification Aspect: The approach of decreasing any lexical or syntactic
intricacy related to the text without modifying the substance of the text is carried out.
It is the preprocessing stage that finally results in the selection of a useful sentence.
Topic Aspect: In this approach, a summary is generated based on the topic, and the
focus of sentences is concentrated on various topics.
Clustering Aspect: An attempt is made to eliminate the repeated sentences in the
synopsis through this approach. This is entirely suitable for multi-record list.
DL Aspect: Methods to train the network, based on the style of the human reader,
are employed.
Fuzzy Logic Aspect: The uncertainty of the input is dealt with, as logical results
can be provided by fuzzy inference systems. Evaluation in an environment that is
unclear and confusing is carried out.
Advantage of Extractive Summarization: It is faster and easier to be understood
as compared to the abstractive method. A higher accuracy is achieved through the
extractive approach of sentences [4].
Disadvantage of Extractive Summarization: Redundancy, lack of semantic and
cohesion, conflicting information, etc., are encountered [5].
A list of research questions was developed to gain a better understanding of legal
documents and the text summarization technique through research.
Q1: Which summarization method is best for legal documents?
588 U. Dixit et al.
Q2: How can we improve the structure of legal document summarization?
Q3: Why is there greater emphasis on extractive summarizing and less emphasis
on abstractive summarization?
Q4: How do we find out if our text summarization results are performing better
or not?
The study is organized: the second portion described the various extractive
summarization techniques and their associated work done in ATS. The third portion
examined the legal document and the extraction technique employed. Section 4
addressed all the previous parts and responded to the following question. Section 5
contains the conclusion and references.
2 Literature Review
Several research papers based on extractive summarization have been reviewed in
the focus of the study effort on the extractive summarization of legal documents. The
investigation of different approaches to extractive summarization will be undertaken.
Statistical Features: The use of numeric and conceptual variables in extractive text
summarization was investigated by Vodolazova et al. [6] in their research. The sepa-
ration of stop words, detection of words, resolving anaphora, literary entailment, and
other techniques were observed. Through examination of various strategies, it was
found that semantic-based techniques examples of these include resolving anaphora,
identifying literary implications, and disambiguating word senses enhance the recog-
nition of overt repetitiveness, and factual strategies such as word occurrence rate and
inverse sentence occurrence rate were determined to provide the most effective tools
for selecting significant sentences for the final summary. Two issues were identified
by Metais et al. [7] in their research, one being the examination of how the presenta-
tion of the text summary is affected by the content of the report and the other being
the investigation of how semantic properties of text may influence the performance of
various automated text summary operations. Semantic research tactics were consid-
ered, and an examination of their relation to formal representation of people, places,
and things, pronouns, and the distribution of specific entities over the original text
that were included in the associated summary was conducted. It was found that the
assumption was not supported; however, the dynamic summary system was found
to improve the summarization process.
ML-Based Approach: A method for automated document summarization based on
clustering and extractive summarization is described by Aliguliyes et al. [8] in which
text is clustered in the first half and the cluster represents an individual evolution algo-
rithm to improve the goal function. An original technique for determining intra and
Analyzing the Impact of Extractive Summarization Techniques on Legal 589
between-occasion significance using knowledge from internal connection, concep-
tual similarity, distributional closeness, and named component grouping is character-
ized by Li et al. [9]. This technique is used in conjunction with a PageRank calcula-
tion to determine the significance of a memorable event for a summary. A clustering
approach on event word graph semantic linkages, collected from external linguistic
sources, is utilized by Liu et al. [10] and is found to outperform the PageRank-based
method. A commonly used sentence rating method, whose primary purpose is to
determine the most relevant sentence, is provided by Silva et al. [11]. A detector for
sentence importance classifiers that predict the key sentence first and then forms a
summary based on the necessary length is introduced by Yang et al. [12].
Probabilistic Approaches: The identification of important sentences, key concepts,
and relationships within the text through the process of automatic summarization
is the goal of the approach provided by Fung et al. [13] which utilizes an HMM
framework with a modified method for extractive summarization and an unsuper-
vised probabilistic technique to determine class cancroids, class sequences, and class
borders.
Graph-based Approach: The identification of important sentences, key concepts,
and relationships within the text through the process of machine summarization
is the goal of the approach provided by Fung et al. [14] which utilizes an HMM
framework with a modified method for extractive summarization and an unsuper-
vised probabilistic technique to determine class cancroids, class sequences, and class
borders.
Neural Network-Based Approach: For extractive text summarization, a neural
network-based approach is employed, with the use of BERT, a pre-initialized trans-
former model with the highest performance in NLP tasks as presented by Liu et al.
A unique neural network for learning elements inherent in sentences and contextual
links between phrases is offered by Ren et al. [15] in CR sum. A novel term-document
co-ranking approach for extractive text summarization is suggested by Fang et al.
[16] which combines a graph-based ranking method with the word-sentence rela-
tionship in CoRank. It is noted that the co-ranking process takes into consideration
that different words should have different weights.
Topic Approach: These methodologies endeavor to decide the subject of the record
(i.e., what is truly going on with this report). Term recurrence, term recurrence back-
wards report recurrence, and lexical chains are the most continuous ways for subject
portrayal. A subject extraction outline’s handling stages are as per the following:
(1) changing the info message to a middle of the road portrayal in which the info
material is examined; (2) granting a score to each expression in the report in light of
its portrayal [17].
Clustering Approach: Multi-document summarization utilizes clustering, where
the cluster comprises the most central and crucial sentences, which contain vital
information. After identifying the central sentences and ranking them, the process
of document summary can be carried out [18].
590 U. Dixit et al.
Deep Learning Approach: A method that uses document similarity on embedding to
describe meaning is proposed by Kobayashi et al. in an attempt to train the networks
to work in a human-readable form. It was found that when this model was applied
to documents containing the first few phrases, more complex meaning was created
than sentence-level similarities [19].
Fuzzy Logic Approach: Fuzzy logic concepts are used in automatic text summariza-
tion (ATS) to resemble a powerful decision-making tool and provide an effective way
to depict a sentence’s importance. The sentence scoring method includes selecting
a collection of characteristics for each sentence and then using a fuzzy logic system
to select the important sentences [20].
Different techniques are combined to eliminate their shortcomings and produce
the best summaries, as all approaches have their own advantages and limitations
depending on the inputs. For example, Moratanch et al. [21] combined graph-based
and concept-based methods to generate a summary, Rahman et al. [22] suggested an
extractive summary method that captures the semantics of text and clusters them to
summarize the document using a distributional semantic model, and Mao et al. [23]
combined unsupervised with supervised learning to produce a resultant summary of
a single document (Table 1).
3 Legal Document Summarization
Inclusion of article numbers, rules, and other legislative wording is made distinct from
the summarization of other types of documents in the summary form of legal papers,
such as court judgments. Its features that make it different from other documents,
such as
Size: Due to the large number of items to be covered, the length of the text is
greater than that of other texts.
Structure: Conformity to the hierarchical structure of legal norms and regulations
necessitates a distinct internal structure for legal texts.
Vocabulary: A distinct internal structure is necessary for legal texts, as they must
conform to the hierarchical structure of legal norms and regulations.
Ambiguity: The same wording in legal texts may be used for different courts and
various purposes, resulting in multiple interpretations being present.
Citations: The highlighting of key points of the case is deemed essential in the
legal field through summarization.
An extractive approach was utilized to develop a summary of the legal text. Various
techniques of extractive summarization have been discussed; numerous research have
been conducted in the area of extractive summarization.
Galgani et al. [24] employed a knowledge-based (KB) approach to integrate
various summarizing techniques, using the Compton and Jansen 1990 wave-down
rules for KB creation. They developed a device that uses these rules to assist in the
Analyzing the Impact of Extractive Summarization Techniques on Legal 591
Table 1 Advantage and disadvantage for extractive technique
Method Advantage Disadvantage
Statistical and semantic aspect Required less memory and
capacity
No linguistic knowledge
It is a language-independent
technique
Important sentence may not
be included as they do not
have high score
Machine learning Improving sentence
selection
Required large dataset
Probabilistic approaches Find important sentence,
relationships, concepts
NA
Graph based Boost coherency and detect
redundant information
Document independent
If the weights of two words
are the same, only one is
chosen, resulting in the
incorrect interpretation
Neural network Human-readable summaries
with each statement are
linked to the next without
losing the original meaning
Large data and complex in
nature
Text simplification Reduce any lexical or
syntactic complexity
NA
Topic-based Summarized on the basis of
the topic
Sentence with higher score
also not included
Clustering approach The summary does not
include repeated sentences
It necessitates prior number
specification a collection of
clusters
DL Try to train models that
work in a human-readable
form of summarization
Manually building of the
training data
Fuzzy logic Use of fuzzy logic for
selection of sentence
produces the summation
The potential negative aspect
of redundancy in the selected
sentences of the summary
can affect the overall quality
of the summary
testing of creation of rules for a legal corpus, selection, feature definition based on
the current case context, and utilization of different data in varied contexts. Perfor-
mance was monitored using AusLII (Australasian Legal Information Institute) and
ROUGE-1.
A citation-based technique for summarization was employed by Galgani et al.
[25]. A phrase was extracted from a publication or reference text using citation
and used as a summary. The best citations were selected based on a centroid or
centrality-based summary class.
A method for topic-based text summarization, obtained from LDA, was proposed
by Venkatesh [26]. An algorithm for sentence grading, based on the likelihood of
terms appearing frequently in relation to each topic, was created using the LDA
592 U. Dixit et al.
which returns the document as a string of words where the subject is generated using
a probabilistic model. The subject from the LDA is utilized to create the concluding
summary. The dataset used for the sentence scoring technique consisted of 116
documents from civil cases in India, from five separate sub-domains (Income Tax,
Rent Control, Motor Act, Negotiable Instrument Act, and Sales Tax) was used.
A graph-based method for extractive summarization was employed by Kim et al.
[27]. In this method, messages are represented as hubs in a sentence-coordinated
diagram. When the likelihood of a sentence being implanted in the subsequent hub
reaches a preset level, a coordinated edge is added between the two hubs. A summary
subject is represented by the linked component of the graph. The phrases are selected
from the associated components using the key-value functionality.
A graphical representation of the legal document that highlights the repetition
of legal terminology was proposed by Schilder and Molina-Salgado [28]. A simi-
larity function between phrases is used to generate graphical representations of legal
language. A specific graph is set apart from other graph-based approaches by using
the similarity function with the voting algorithm. It is hypothesized that certain para-
graphs summarize the entire material for legal papers, such as paragraph detection
the technique uses similarity ratings between paragraphs to determine which match
is most appropriate for each paragraph. This work functions as a voting mechanism,
with one paragraph voting for the other, and the most popular paragraphs are chosen
as the simplified version.
The state of sentences in texts from a HOLJ corpus was examined by Hachey
and Grover [29] using a classifier. Sentence extraction is based on the Teufel and
Moens feature, and sentences are classified based on factors such as fact, proceeding,
background, framing, disposal, textual and others. The same features were used in
four classification algorithms: SVM, C4.5, NB, and Winnow. The most favorable
findings in terms of micro average F-score were produced by C4.5.
A sentence categorization method using the NB classifier by combining a group of
linguistic traits, such as appear, particularity, and substantive features, was proposed
by Yousif-Monod et al. [30]. They named it PRODSUM after dividingthe summariza-
tion into four sections: introduction, context, reasoning, and conclusion (Probabilistic
Decision SUMmarizer).
This paper analyzes the reasons for the absence of practically useful German
abstractive text summarization solutions in industry. The study [31] focuses on
training resources and publicly available summarization systems and finds that
existing datasets have crucial flaws that negatively affect system generalization
and evaluation biases. The paper also highlights the poor performance of available
systems compared to simple baselines and more effective extractive summarization
approaches. The authors attribute poor evaluation quality to a lack of qualitative gold
data, understudied positional biases in existing datasets, and the lack of accessible
preprocessing strategies or analysis tools. They provide a comprehensive assessment
of available models and emphasize the problems of relying solely on n-gram-based
scoring methods.
The paper [32] offers a comprehensive survey of the NLP & Law domain, with a
focus on recent technical and substantive developments. The authors construct and
Analyzing the Impact of Extractive Summarization Techniques on Legal 593
analyze a corpus of over 600 NLP & Law-related papers published over the past
decade. They observe an increasing number of papers, tasks, and languages covered,
as well as an increase in the sophistication of methods deployed. The authors note that
legal NLP is starting to match the methodological sophistication and professional
standards of the broader scientific community. They conclude that while the trends
bode well for the future of the field, many questions in both the educational and
corporate sphere remain open.
The paper [33] presents a new method for detecting summary obfuscation, which
is a type of plagiarism that is difficult to detect with traditional methods. The approach
proposed is founded on named entity recognition and dependency parsing, which is
both more precise and analytically simpler than the existing methods based on genetic
algorithms. At the document level, the technique successfully identifies instances of
summary obfuscation and produces high accuracy at the sentence level. Additionally,
the proposed method was tested on other types of plagiarism and achieved excellent
results. Overall, the paper presents a promising new approach for detecting summary
obfuscation that could have important implications for plagiarism detection in various
fields.
The main emphasis of the paper [34] is on the problem of efficiently storing and
retrieving essential information from voluminous text documents. To address this
challenge, the authors propose the use of text summarization techniques, specifically
extractive approaches, and provide an overview of multiple metrics for evaluating
the quality of the resulting summary. The paper provides a review of numerous
approaches to text summarization and highlights the importance of determining the
most suitable approach for a given objective. Overall, the paper emphasizes the
importance of text summarization in improving the efficiency of information storage
and retrieval from large text documents.
The paper [35] aims to discuss the importance of newspapers and news websites in
providing information on COVID-19 and how different models can be used to identify
topics, sentiments, and summarization of news articles. The study used a proposed
topic model to analyze the sentiments expressed by various countries about COVID-
19 and discovered that the UK was the most negatively affected, with the highest
percentage of negative sentiments. The XLNet sentiment categorization model was
also used, and it performed well in terms of validation accuracy. To obtain a better
understanding of the COVID-19 pandemic, the study emphasizes the significance of
analyzing various topics, themes, and issues.
The rising volume of textual data generated daily presents challenges in summa-
rizing and extracting relevant information. In this paper [36], a hybrid feature extrac-
tion approach is proposed, utilizing multi-layered attentional stacked LSTM and
attention RNN networks to automatically produce summaries from lengthy news
text. The proposed methodology achieves better results in text length issues, attribute
extraction, and categorization of news text. Experiments show the effectiveness of
the proposed approach in resolving these issues.
As the web and social media continue to produce an overwhelming amount of
unstructured data, it becomes increasingly challenging for individuals to locate perti-
nent information efficiently. Text summarization offers a solution to this problem
594 U. Dixit et al.
by extracting relevant information and presenting it concisely, without altering
the core meaning of the original content, is an effective solution to this problem.
Researchers have previously attempted to develop ML approaches for summarization
but still struggle to produce better-summarized results. In this paper [37], the authors
proposed a DL-based model for summarization, which outperformed the advance
models on a standard dataset at the sentence level with BLEU and ROUGE values
of 0.4 and 0.6, respectively. The model uses reinforced learning with an attention
layer, and its performance was analyzed before proposing the deep learning-based
model. Based on their experiments, the authors assert that their proposed model
yields favorable outcomes in terms of precision and validity.
The paper [38] addresses the limitations of existing summarization datasets in
terms of being overly focused on certain domains and being primarily monolingual.
The paper introduces EUR-Lex-Sum, a cross-lingual dataset that includes paragraph-
aligned data in various European languages and is based on manually curated docu-
ment summaries of legal acts from the European Union law platform. The dataset is
anticipated to enable future research in domain-specific cross-lingual summarization
by providing access to various cross-lingual and low-resource summarization setups.
To demonstrate the dataset’s potential, the authors perform experiments with suit-
able extractive monolingual and cross-lingual baselines. They do admit, however,
that the extreme length and language diversity of the samples pose challenges for
future research.
The legal domain has become increasingly digitized, leading to the need for
more efficient retrieval methods for unstructured data. The field of legal information
retrieval systems has been analyzed extensively in paper [39], which investigates
the use of natural language processing, machine learning, and knowledge extrac-
tion techniques in artificial intelligence approaches. The paper identifies challenges,
such as retrieving similar cases, statutes or paragraphs, that hinder the analysis of
latest cases and highlights the need for further research to improve the efficiency and
effectiveness of these systems (Table 2).
Precision is a measure of the accuracy of a summarization technique, specifically
in relation to the proportion of relevant information that is included in the summary.
In the context of Fig. 2, it is stated that knowledge-based techniques had the highest
precision values with 87%. This suggests that, among the techniques compared,
knowledge-based techniques were the most effective at correctly identifying and
including relevant information in the summary while minimizing the inclusion of
irrelevant information. Knowledge-based techniques use pre-existing knowledge or
information to generate a summary, which may explain why they are able to achieve
higher precision values compared to other techniques.
Recall is a measure of the proportion of relevant instances that are correctly
retrieved by a text summarization technique. It is often used to evaluate the effective-
ness of the technique in retrieving all relevant information from the text. In the case of
the study discussed in Fig. 3, it was found that knowledge-based techniques had the
highest recall values with 66%. Knowledge-based summarization techniques rely on
external knowledge sources such as databases, ontologies, and other external knowl-
edge sources to extract the most important information from the text. This external
Analyzing the Impact of Extractive Summarization Techniques on Legal 595
Table 2 Legal document summarization
Authors Technique Evaluation metrics Result
Galganietal.[24]KB ROUGE-1,
precision, recall,
F-measure
KB-SPD—0.5
P—0.87
KB +CIT-SPD—0.5
Recall—0.66
Galganietal.[25]Citation-based
method
ROUGE-1, SU6,
precision, recall,
F-measure
CpSent-P—0.82
R1—0.46
SU6-P—0.06
R—0.22
F—0.08
Kumar and Raghveer
[26]
LDA Precision, recall,
F-measure
P—0.60
R—0.58
F—0.59
Kim et al. [27]Graph-based method Precision, recall,
F-measure
P—31.3, R—36.4,
F—33.7
Schilder and
Molina-Salgado [28]
Graph-based method ROUGE-2,
ROUGE-SU4
R-2—0.90
R-4—0.93
Hachey and Grover [29]SVM, NB, C4.5,
Winnow
Human judgment C4.5—65.4
SVM—60.6
NB—51.8
Winnow—41.4
87
82
60
31.3
0
10
20
30
40
50
60
70
80
90
100
Knowledge Based Citation Based LDA Graph Based
Precision
Fig. 2 Precision of different techniques
596 U. Dixit et al.
66
22
58
36.4
0
10
20
30
40
50
60
70
Knowledge Based Citation Based LDA Graph Based
Recall
Recall
Fig. 3 Recall of different techniques
knowledge is used to identify the key concepts and entities in the text, which are
then used to generate a summary. Because these techniques use external knowledge
to identify the most important information, they are able to retrieve a higher propor-
tion of relevant instances than other techniques, which results in a higher recall
value. Additionally, the use of external knowledge can also lead to a higher precision
value in the summary, as it allows the technique to distinguish between relevant and
non-relevant information more effectively.
F-Measure is a measure of the effectiveness of text summarization techniques
that combines both precision and recall into a single value. It is calculated as the
harmonic mean of precision and recall and is often used to evaluate the overall
performance of a technique. In the study discussed in Fig. 4, it was found that Latent
Dirichlet Allocation (LDA) had the highest F-Measure value with 59%. F-Measure
is a way to balance the trade-off between precision and recall. It is a metric that
uses both precision and recall to give a single score. F-Measure uses harmonic mean
of precision and recall. Precision is the proportion of true positive instances among
the total number of predicted positive instances, and recall is the proportion of true
positive instances among the total number of actual positive instances. F-Measure
gives equal weight to precision and recall, and it ranges between 0 and 1. The highest
F-Measure value means that the model has performed well in both precision and
recall. Latent Dirichlet Allocation (LDA) is a topic modeling technique that is used
to identify the underlying themes or topics in a text. LDA is a generative probabilistic
model that is trained on a set of documents and is able to discover latent topics by
modeling the co-occurrence of words within each document. In the case of text
Analyzing the Impact of Extractive Summarization Techniques on Legal 597
8
59
33.7
0
10
20
30
40
50
60
70
Citation Based LDA Graph Based
F-Measure
F-Measure
Fig. 4 F-Measure of different techniques
summarization, LDA can be used to identify the main topics of a document and then
generate a summary by extracting the most salient sentences that are relevant to those
topics. This ability to identify the main topics of a document likely contributes to the
high F-Measure value observed in the study, as it allows LDA to effectively retrieve
relevant information while also maintaining a high level of precision in the summary.
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a commonly
used evaluation metric for text summarization techniques. It compares the generated
summary to a reference summary and calculates the degree of overlap between the
two, providing a score that indicates the quality of the generated summary. In the
study discussed in Fig. 5, it was found that graph-based summarization technique
had the highest ROUGE scores among all other techniques.
Graph-based summarization techniques use a graph representation of the text to
identify the most important information and then use this information to generate
a summary. These techniques use graph algorithms to identify the most central or
important nodes in the graph, which correspond to the most important information in
the text. The generated summary is then made up of the text surrounding these impor-
tant nodes. This ability to identify the most important information likely contributes
to the high ROUGE scores observed in the study, as it allows the graph-based tech-
nique to effectively retrieve relevant information and generate a summary that closely
aligns with the reference summary.
Accuracy is a measure of the proportion of instances that are correctly classified
by a model. It is often used to evaluate the effectiveness of a model in a classification
task, including text summarization. In the study discussed in Fig. 6, it was found that
the C4.5 algorithm had the highest accuracy of 65.4% among the models used for
text summarization. C4.5 is a decision tree algorithm that is used to classify instances
by recursively partitioning the feature space. It is a supervised ML algorithm that
uses a set of labeled training instances to build a decision tree that can be used to
598 U. Dixit et al.
46
90 93
0
10
20
30
40
50
60
70
80
90
100
Knowledge Based Graph Based
ROUGE 1 ROUGE 2 ROUGE 3
Fig. 5 ROUGE score of different techniques
classify new instances. In the context of text summarization, C4.5 could be used to
classify the sentences of a text, where each sentence is assigned a label indicating
whether it is important or not for the summary. The algorithm will then use a set of
feature of the text such as word frequency, sentence length, part of speech, etc., to
build the decision tree, and then when a new text comes, it will use the tree to classify
the sentences of the new text. The accuracy of C4.5 will be based on how well the
algorithm is able to classify the sentences as important or not based on the decision
tree. It is likely that the C4.5 algorithm’s ability to effectively classify instances
based on the text features contributed to the high accuracy value observed in the
study. Additionally, the use of a decision tree allows the algorithm to make complex
decisions by breaking them down into a series of simple decisions, which likely
improved the accuracy of the algorithm. In feature-based summarization, methods
such as anaphora resolution, textual entailment, and word sense are used to determine
the semantics of text. However, in legal document summarization, methods such as
term frequency-inverse document frequency (TF-IDF) and others are utilized. Latent
semantic analysis (LSA) is considered beneficial in legal literature as it selects the
collection of sentences and phrases that best describe the topic.
In the legal field, the graph-based technique is utilized, which reveals comparisons
across paragraphs by using the repetition of legal terms and ranking them by vote.
Galgani et al. employed KB and citation-based approaches on the AustLII dataset,
in which the highest score was achieved by the citation-based strategy.
CRF was implemented by Kumar and Raghveer, which featured term distribution
in court judgments as one of its features.
Several classifiers such as NB, ME, PL, and SEQ were employed by Hachey et al.
The sequence model with rhetorical classification produced the best results.
Analyzing the Impact of Extractive Summarization Techniques on Legal 599
50
65.4
60.6
51.4
41.4
0
10
20
30
40
50
60
70
KB-SPD C4.5 SVM NB Winnow
Accuracy
Accuracy
Fig. 6 Accuracy of models used
The ME sequence model, proposed by Hached and Grover, is employed to predict
the labels of a series of unlabeled observations.
A compression rate for each document is determined by the graph-based model
developed by Kim et al. This approach takes into account that the compression
rate for each unique document may vary by using a graph that is not related to one
another, which ensures that the topics are diverse and the summaries are cohesive.
An ordered list of paragraphs is produced by the graph-based model developed by
Schilder and Molina-Salgado, based on relevance. Sentences that are comparable
to the query are then extracted from the ordered list.
NB (Yousif-Monod): Surface feature extraction has been utilized for extracting
the surface with important legal text. The emphasis function highlights some of
the most important words in a statement.
LDA (Kumar and Raghuveer): A set of themes are created, which are subsequently
used as the basis for summarization.
C4.5 (Grover and Hachey): Sentences are classified with the highest accuracy and
assigned to suitable rhetorical functions.
4 Discussion
Several research questions were discussed in this section, which were encountered
by the author when examining research articles.
Q1: Which of the summarizations is best for the legal document?
600 U. Dixit et al.
Ans: A subset of sentences from the main message is produced by extractive
summary, which incorporates recognizing huge fragments of message and making
them in exactly the same words. On the other hand, abstractive summary uses normal
language strategies to translate and fathom the critical pieces of a message and
produce a more “human”—heartfelt diagram. As opposed to abstractive summary,
extra methods are utilized in extractive summary.
Q2: How can we improve the structure of legal document summarization?
Ans: All legal aspects of the document, including lawful judgment record and logical
fragments that cover the whole record, must be covered in the summary to improve
the structure of the legal document.
Q3: Why is there greater emphasis on extractive summarizing and less emphasis on
abstractive summarization?
Ans: The original content of the document is lost, which is not managed in the legal
document, resulting in the fact that abstractive summarization is not much effectively
used. Another issue is that abstractive summarization uses the DL technique for
summary and requires a large amount of data that legal documents do not meet.
Thus, extractive summarization is commonly used method.
Q4: How do we find out if our text summarization results are performing better or
not?
Ans: An assessment matrix such as the ROUGE score, precision, recall, and F-
Measure is utilized to evaluate the performance of the summary text. ROUGE, an
acronym that stands for recall-oriented-understudy for Gisting Evaluation, is used to
focus attention on text summarization in order to count n-grams, overlapping words
pairs, and word sequences.
5 Conclusion
A survey was conducted on the use of extractive text summarization in legal docu-
ments. Various techniques were examined, and different modules were used in the
process. Extractive summarization was chosen for use in legal documents as it
preserves the meaning of the document and utilizes a subset of the text for summa-
rization. The use of legal terms in the summarization process provided structure
to the documents. Techniques such as graph based, ML based, knowledge based,
and citation based were found to provide effective summarization. The study found
that the decision tree classifier (“C4.5”) model had the best performance compared
to other models. Further research could be conducted to explore the use of other
models and techniques in the summarization of legal documents and to improve the
performance and effectiveness of the summarization process.
Analyzing the Impact of Extractive Summarization Techniques on Legal 601
Acknowledgements This research is supported by Council of Science and Technology, Lucknow,
Uttar Pradesh, via Project Sanction letter number CST/D-3330.
References
1. El-Kassas WS et al (2021) Automatic text summarization: A comprehensive survey. Expert
Syst Appl 165: 113679
2. Allahyari M et al (2017) Text summarization techniques: a brief survey. arXiv preprint arXiv:
1707.02268
3. Agarwal P, Mehta S (2018) Empirical analysis of five nature-inspired algorithms on real
parameter optimization problems. Artif Intell Rev 50(3):383–439
4. Boorugu R, Ramesh G (2020) A survey on NLP based text summarization for summarizing
product reviews. In: 2020 second international conference on inventive research in computing
applications (ICIRCA). IEEE
5. Hou L, Hu P, Bei C (2018) Abstractive document summarization via neural model with joint
attention. In: Natural language processing and Chinese computing: 6th CCF international
conference, NLPCC 2017, Dalian, China, November 8–12, 2017, Proceedings 6. Springer
International Publishing
6. Vodolazova T et al (2013) The role of statistical and semantic features in single-document
extractive summarization
7. Ferziger JH et al (2020) Finite difference methods. Comput Methods Fluid Dyn, 41–79
8. Aliguliyev RM (2009) A new sentence similarity measure and sentence based extractive
technique for automatic text summarization. Expert Syst Appl 36(4):7764–7772
9. Li W et al (2006) Extractive summarization using inter-and intra-event relevance. In: Proceed-
ings of the 21st international conference on computational linguistics and 44th annual meeting
of the Association for Computational Linguistics
10. Liu M et al (2007) Extractive summarization based on event term clustering. In: Proceedings of
the 45th annual meeting of the Association for Computational Linguistics companion volume
proceedings of the demo and poster sessions
11. Fung P, Ngai G, Cheung C-S (2003) Combining optimal clustering and hidden Markov models
for extractive summarization. In: Proceedings of the ACL 2003 workshop on multilingual
summarization and question answering
12. Mallick C et al (2019) Graph-based text summarization using modified TextRank. In: Soft
computing in data analytics. Springer, Singapore, pp 137–146
13. Parveen D, Ramsl H-M, Strube M (2015) Topical coherence for graph-based extractive summa-
rization. In: Proceedings of the 2015 conference on empirical methods in natural language
processing
14. Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text
summarization. J Artif Intell Res 22:457–479
15. Ren P et al (2017) Leveraging contextual sentence relations for extractive summarization using
a neural attention model. In: Proceedings of the 40th international ACM SIGIR conference on
research and development in information retrieval
16. Fang M, Fang C, Mu D, Deng Z, Wu Z (2017) Word-sentence co-ranking for automatic
extractive text summarization. Expert Syst Appl 72(2017):189–195
17. Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text
data. Springer, Boston, MA, pp 43–76
18. Mehta P, Majumder P (2018) Effective aggregation of various summarization techniques. Inf
Process Manage 54(2):145–158
19. Kobayashi H, Noguchi M, Yatsuka T (2015) Summarization based on embedding distributions.
In: Proceedings of the 2015 conference on empirical methods in natural language processing
602 U. Dixit et al.
20. Kumar A, Sharma A (2019) Systematic literature review of fuzzy logic based text summariza-
tion. Iran J Fuzzy Syst 16(5):45–59
21. Moratanch N, Chitrakala S (2017) A survey on extractive text summarization. In: 2017
international conference on computer, communication and signal processing (ICCCSP). IEEE
22. Rahman A et al (2019) Bengali text summarization using TextRank, fuzzy C-Means and
aggregate scoring methods. In: 2019 IEEE region 10 symposium (TENSYMP). IEEE
23. Mao X et al (2019) Extractive summarization using supervised and unsupervised learning.
Expert Syst Appl 133:173–181
24. Galgani F, Compton P, Hoffmann A (2012) Combining different summarization techniques for
legal text. In: Proceedings of the workshop on innovative hybrid approaches to the processing
of textual data
25. Galgani F, Compton P, Hoffmann A (2012) Citation based summarisation of legal texts. In:
Pacific Rim international conference on artificial intelligence. Springer, Berlin, Heidelberg
(2012)
26. Venkatesh RK (2013) Legal documents clustering and summarization using hierarchical latent
Dirichlet allocation. IAES Int J Artif Intell 2(1)
27. Kim M-Y, Xu Y, Goebel R (2013) Summarization of legal texts with high cohesion and auto-
matic compression rate. In: New frontiers in artificial intelligence: JSAI-isAI 2012 workshops,
LENLS, JURISIN, MiMI, Miyazaki, Japan, November 30 and December 1, 2012, Revised
Selected Papers 4. Springer, Berlin, Heidelberg
28. Schilder F, Molina-Salgado H (2006) Evaluating a summarizer for legal text with a large text
collection. In: 3rd Midwestern computational linguistics colloquium (MCLC)
29. Hachey B, Grover C (2004) A rhetorical status classifier for legal text summarisation. In: Text
summarization branches out
30. Yousfi-Monod M, Farzindar A, Lapalme G (2010) Supervised ML for summarizing legal
documents. In: Canadian conference on artificial intelligence. Springer, Berlin, Heidelberg
31. Aumiller D, Fan J, Gertz M (2023) On the state of German (abstractive) text summarization.
arXiv preprint arXiv:2301.07095
32. Katz DM et al (2023) Natural language processing in the legal domain. arXiv preprint arXiv:
2302.12039
33. Taufiq U, Pulungan R, Suyanto Y (2023) Named entity recognition and dependency parsing
for better concept extraction in summary obfuscation detection. Expert Syst Appl, 119579
34. Mishra AR, Naruka MS, Tiwari S (2023) Extraction techniques and evaluation measures for
extractive text summarisation. In: Sustainable computing: transforming Industry 4.0 to Society
5.0. Springer International Publishing, Cham, pp 279–290
35. Thakur O, Saritha SK, Jain S (2023) Topic modeling, sentiment analysis and text summarization
for analyzing news headlines and articles In: Machine learning, image processing, network
security and data sciences: 4th international conference, MIND 2022, Virtual Event, January
19–20, 2023, Proceedings, Part I. Springer Nature Switzerland, Cham
36. Nafees Muneera M, Sriramya P (2023) An enhanced optimized abstractive text summarization
traditional approach employing multi-layered attentional stacked LSTM with the attention
RNN. In: Computer vision and machine intelligence paradigms for SDGs: select proceedings
of ICRTAC-CVMIP 2021. Springer Nature Singapore, Singapore, pp 303–318
37. Yadav AK et al (2022) Extractive text summarization using DL approach. Int J Inf Technol
14(5):2407–2415
38. Aumiller D, Chouhan A, Gertz M (2022) EUR-Lex-Sum: a multi-and cross-lingual dataset for
long-form summarization in the legal domain. arXiv preprint arXiv:2210.13448
39. Sansone C, Sperlí G (2022) Legal information retrieval systems: state-of-the-art and open
issues. Inf Syst 106:101967
An Energy Conserving MANET-LoRa
Architecture for Wireless Body Area
Network
Sakshi Gupta, Manorama, and Itu Snigdh
Abstract The rapid demand for technologies gradually increases to provide solu-
tions to people suffering from chronic diseases. These technologies also practice
continuous health monitoring of patients for early intervention and prevention. Addi-
tionally, there is also a need for the interoperation of different connected devices and
application services in smart health care. Among these technologies, a wireless body
area network (WBAN) is an appropriate option to monitor people’s health remotely.
However, existing systems have limitations of high energy dissipation in process-
ing the data. This article aims to provide a system that collaborates to leverage the
advantages of the Internet of Things (IoT)’s LoRa technology, Mobile Ad hoc net-
work (MANET) systems, and data aggregation schemes to conserve the energy in
transmitting packets. Our proposed model optimizes and reduces energy dissipation
in the network compared to existing models. It also presents a novel approach for
the early detection of urgent biosignals.
Keywords IoT ·Healthcare ·Bio signals ·LoRa ·Aggregation ·MANET ·
WBAN
1 Introduction
IoT is currently a part of every physical object one wears, drives, reads, or sees. It
is used for applications that require phenomena to be tracked, measured, connected,
and controlled remotely [1]. Current technologies adopt IoT systems to enable bet-
S. Gupta (B)
Amity Institute of Information Technology, AMITY university, Noida, India
e-mail: sakshigupta660@gmail.com
Manorama
Amity Institute of Information Technology, Ranchi, India
e-mail: manorama7826@gmail.com
I. Snigdh
B.I.T Mesra, Ranchi, India
e-mail: itusnigdh@bitmesra.ac.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981- 99-6544-1_45
603
604 S. Gupta and I. Snigdh
ter decisions, effortless monitoring, time-saving, automation, and a better lifestyle.
Hence, IoT is going to reshape entire industries. According to Gartner’s report, by
2019, installed IoT devices are 26.66 billion, and by 2025 will reach 75.44 billion
[2]. IoT empowers one to transform physical objects to send and receive data. IoT’s
wide applications are smart cities, smart environment, smart metering, smart supply
chain, agriculture, healthcare, and mining production, to name a few [3].
Health care is one of the well-thought-out applications of IoT to diagnose dis-
eases and monitor their treatment. Medical devices are integrated into the IoT for the
required effective treatment and diagnosis. Various sensors are used in IoT health-
care, such as heart rate sensors, blood pressure monitors, blood glucose meters, and
thermometers to measure health readings and can be accessed remotely. Several
data-management methods are used to handle the constant transmission of WBAN
data. Also, varied healthcare applications like health monitoring systems, fitness pro-
grams, chronic disease monitoring, ambient assisted living, drug management, and
monitoring of oxygen saturation at home are being provided by IoT-based healthcare.
Therefore, IoT facilitates healthcare applications in terms of 24/7 patient data anal-
ysis, makes emergency medical decisions, cost reduction, and enhances the quality
of patients’ lifestyles [4]. With IoT tracking systems, health workers can get alerts
immediately when critical changes occur. This enables them to quickly locate patients
who need help and direct assistance as soon as possible. The amalgamation of IoT
in the healthcare sector provides various advantages, including improved treatment,
low cost, faster disease diagnosis, proactive treatment, reduced end to end delay,
amended management of drugs and equipment, better patient experience [5].
Nevertheless, IoT healthcare poses some challenges that need to be solved.
According to literature [6,7], one of the most critical challenge in IoT healthcare is
reducing energy consumption and delay. As most of the data is on the cloud, and data
analytics and data processing take time, communication of decisions incurs delays.
Another challenge is the connectivity, where issues arise in real-time data monitoring
in remote areas.
Moreover, as IoT is an emerging technology and still growing, it needs more scal-
able architecture when merging into any specific application. Also, the implementa-
tion of a full-scale IoT or BAN architecture in healthcare is not documented in the
literature. The current literary works present strategies for only partial patient mon-
itoring and healthcare data analysis with Artificial Intelligence (AI) and Machine
learning (ML) techniques to learn and train machines. However, the concerns of
practical implementation of IoT healthcare need to incorporate the entire network’s
energy consumption and sustainability for successful operations [8].
2 Related Works
Ample literature exists for framework and data management in WBAN [9]. Data
management is the foremost requirement of WBAN for the early detection of dis-
ease in patients [10]. In [11], data segregation and classification scheme are used
An Energy Conserving MANET-LoRa Architecture for Wireless ... 605
to separate the sensor’s readings in urgent, semi-urgent, and non-urgent packets on
wireless protocol 6LoWPAN. For sending packets, the authors used two routes, one
gateway to the cloud and second access point to the cloud in case of gateway failure.
However, in this research, the authors dropped non-urgent packets completely to
improve the power consumption of WBAN.
In [12], the authors proposed a cloud-based real-time remote health monitoring
system (CHMS) for home care patients by using data classification and delay-aware
routing metric to reduce congestion, interference, and delay.
Further, in [13] the authors proposed a collaborative body sensor network (CBSNs)
framework to implement a multisensory data fusion scheme to automatically detect
handshakes between two individuals and capture possible heart rate and emotions.
In [14], to bring down the energy consumption, the authors provide a data packet
aggregation algorithm using for LoRa technology in the Internet of Things.
In order to collect data from a smart grid network that can expand dynamically and
analyze the effect on energy consumption for such networks, the authors proposed
an architecture [15].
In order to achieve high-quality network efficiency in smart parking and IIoT
applications, the authors suggested using a Bayesian belief network with fuzzy logic
[16,17].
However, there is a shortcoming in the literature above that they only mention
an assumption of data aggregation algorithms during implementation, without illus-
trating proper data aggregation algorithms and their impacts on the system. In our
proposed framework, we propose methods to aggregate critical and non-critical data
according to the specified aggregation ratio at the MANET layer to bring down the
energy consumption in communication.
The contributions of the paper is as follows:
It developed the WBAN communication framework and used LoRa technology
and mobile ad hoc architecture’s benefits for both homecare and hospital situations.
To save energy when transferring data packets comprising biosignals at each layer,
our suggested communication model applies the data aggregation and fusion tech-
nique.
The purpose of this article is to quickly find pertinent information and determine
whether the condition existed.
Using a straightforward system of categorizing data into urgent and non-urgent
categories, we next use correlation among the transmitted data at the in-network
level to corroborate our initial findings.
This article is organized as follows. Section 2represents the preliminaries on WBAN
architecture, LoRa technology, and Mobile ad hoc architecture. Section 3presents
the proposed architecture and methodology. Section 4outlines the results and dis-
cussions. The concluding remarks and future work are provided in Sect. 5.
606 S. Gupta and I. Snigdh
3 Preliminaries
Chronic diseases affect many patients, and the number is growing daily. Hospital-
ization can occasionally be rather inconvenient, expensive, and require health mon-
itoring while working or going about daily activities. WBAN is thus a fix for the
issue.
3.1 WBAN and IoT Health Care System
In Wireless Body Area Networks, nanosensors are implanted near and inside the
patients bodies to wirelessly transmit biosignals to doctors. The doctors then treat
the patients virtually by observing, diagnosing, and prescribing [18]. Biosignals are
continual records of the biological evolution of living things. These signals include
EEG, ECG, EOG, blood pressure, body temperature, glucose level, and many more
[19]. Additionally, in order to reduce additional computational complexity on the
sensor level and avoid data expedition, critical and non-critical data packets are
categorized and transferred to the top layer. In the primary healthcare system shown
in Fig. 1, sensors provide data sent to a server. The doctors can see patient data and
comment on it directly to patients. All essential processing is done on the cloud.
Figure 1depicts a primary healthcare system where sensors generate data and
send the cloud via the gateway and mobile device. Doctors can access and analyze
the data from the cloud and directly connect to patients [20].
Fig. 1 Traditional healthcare system
An Energy Conserving MANET-LoRa Architecture for Wireless ... 607
3.2 LoRa
One of the newest technologies in IoT, LoRa is appropriate for long-range, low-power
communication [21]. The usual LoRa topology uses star communication, which uses
more energy and transmits data at a slower rate. LoRa differs from other LPWAN
technologies in that it gives users the option to customize physical layer charac-
teristics (Transmission Power, Bandwidth, Coding rate, Spreading Factor, Carrier
Frequency) for their particular applications.
Spreading Factor (SF): In digital communication, the SF specifies the number of
chips to represent a sign. SF values between 6 and 12. Enhanced data speeds are
achieved with lower SF.
Bandwidth (BW): This parameter displays the volume of data that can be trans-
mitted across the communication channel. BW is accessible between 7.8 and 500
kHz. However, 125, 250, and 500 kHz are the three frequencies that LoRa devices
often operate on.
Coding Rate (CR): LoRa modem delivers corruption-free transmission by using
forward error correction. It is accomplished by employing a coding rate that raises
the ToA of the packet with time while offering more robustness. Standard values
for CR are 4/5, 4/6, 4/7, and 4/8.
Carrier frequency (CF): Is a core frequency that is symbolically employed in a
transmission band. The license-free sub-gigahertz band for LoRa transmitters and
receivers spans from 860 to 1020 MHz.
LoRa provides three classes for end devices communications.
Class A: Class A devices have power capabilities and enable two-way communi-
cation between network servers and end devices (EDs). For each communication
activity, these devices offer the option of one uplink and two downlink trans-
missions. There are two quick downlinks (network server to EDs) transmission
windows for every uplink (EDs to the network server) transmission window.
Class B: Class B devices provide a downlink communication with additional
scheduled slots, much like class A devices do. Time synchronization requires a
prearranged beacon from the gateway. When compared to class A equipment, these
devices use more energy.
Class C: These devices inherit the same features as class A devices, with the
exception that they never open the receive window while they are not transmitting.
More energy is used by these gadgets than by A and B. However, there is very
little delay.
3.3 Mobile Ad Hoc Network
MANET values its capacity for self-organization, self-healing networks, and oper-
ation in environments with minimal network support. The MANET nodes that may
608 S. Gupta and I. Snigdh
move about freely in their environment are outfitted with wireless transmitters and
receivers that may use omnidirectional antennas [22,23]. A WSN is obviously the
primary IoT data collecting methodology. But the power and memory usage of WSN
devices is constrained. In all situations, MANET systems concentrate on finding
the optimum path to route the data (network discovery). The establishment of a
new MANET-IoT system that provides improved user mobility and reduces network
deployment costs is made possible by the interaction between WSN routing prin-
ciples and MANET’s protocols with the IoT [24]. In [15], the authors proposed an
architecture using characteristics of mobile ad hoc networks in IoT. In this architec-
ture, WSN acts as the basic architecture for collecting data from the environment
with the various sensors devices. Above that, MANET plays the role of overlay archi-
tecture with moveable devices. This type of architecture is suitable for urgent data
transmission. Therefore, MANET could be a good choice for healthcare systems.
4 Proposed Architecture for WBAN
Figure 2shows the proposed architecture for bring down the energy efficiency in
WBAN. The previous section gives us information about the advantages and disad-
vantages of the traditional IoT healthcare architecture. We have used LoRa technol-
ogy for the proposed work as LoRa is interference-free and works in long-range.
Fig. 2 Flow of data transmission in proposed BSN architecture
An Energy Conserving MANET-LoRa Architecture for Wireless ... 609
In our article, LoRa devices and gateway lies at the lower layer. After applying
data aggregation LoRa EDes transfer data to their respective gateways. The gate-
way layer categorizes data packets as critical and non-critical based on a threshold
value. Once more, these data packets-both critical and non-critical delivered to the
upper tiers. Also, we would suggest using the MANET layer. As MANET will work
computing layer which can run feature selection algorithms. Because MANETs are
self-configuring, self-healing networks [25,26].
5 Methodology
Buffer aggregation is used at the MANET layer to cut down on network energy.
Fused packets are transmitted to the cloud layer using cooperative fusion. Figure 3
depicts the flow of transmitting the data packets from the sensor layer to the cloud.
First, every second, medical sensors gather biosignal data from a patient’s body.
Applying redundant fusion or aggregation processes to data and transmitting data
packets every five seconds are the responsibilities of the sensor nodes. These sensors
include the Spo2, temperature, glucose, and blood pressure sensors. When a data
Fig. 3 Flow of data transmission in proposed BSN architecture
610 S. Gupta and I. Snigdh
Table 1 Threshold values of sensors
Sensor Critical Non-critical
EKG More than 100 bpm 60–100 bpm
Spo2 Less than 92 % 92–99 %
Diabetes sensor More than 126 mg/dl Less than 100 mg/dl
Temperature sensor More than 100 F less than 99 F
packet is received, the gateway determines whether it is urgent or not by examining
the sensor data’s cumulative threshold value. Table 1shows the threshold values of
sensor for reference [12].
Data (both critical and non-critical) are forwarded to the MANET layer. To choose
the most important health signals, the MANET layer already uses a Principal Com-
ponent Analysis and Canonical Correlation Analysis algorithm based on data that
has been gathered. Data values are checked with the threshold value range. The
packet’s status is kept as critical if the sensor’s threshold value is crossed and is also
the patient’s most important health factor. If not, it changes to a non-critical status.
The data aggregation mechanism is likewise carried out by the MANET layer. At
this layer, both critical and non-critical data packets are briefly buffered. In accor-
dance with the network requirements, a buffer size has been chosen for critical and
non-critical data packets.
Cooperative fusion is a technique used by the cloud layer to combine data from
several sources to create a new piece of information. The most important character-
istics are chosen, conveyed to this layer, and then integrated to determine the disease.
6 Results
We have compared our results with the method used in [12] and shown in Fig. 4,this
part elaborates the findings. The work was simulated in Python, and Table 2lists all
the network configuration details. For the proposed architecture, class A LoRa end
devices are assumed. We chose the most effective LoRa parameter settings, which
are SF = 6, BW = 500, and CR = 4/5, as determined by Lora WAN simulators, with
reference to the body of existing research [27].
Our suggested WBAN architecture uses less energy than the current approach. The
inventors of the method [12] dropped every non-critical packet while transmitting
critical packets to the upper layer.
An Energy Conserving MANET-LoRa Architecture for Wireless ... 611
Fig. 4 Results of all transferred data packets (*CHMS = cloud-based healthcare monitoring system)
Table 2 Network configuration parameters
Name Val u e s
ROI 100 * 100
Number of sensor nodes 4
Buffer size for existing work 1
Fusion ratio for proposed work Critical packet: 1 packet, 3 packet,Non-critical
packet: 5 packet, 7 packets
Transmission cost for gateway and critical
packet
0.002 mW
Transmission cost for fusion packet 0.005 mW
Packet size 60 bytes
Spreading factor 6
Band width 500
Carrier frequency 868 MHz
Transmission power 14 dBm
Coding rate 4/5
Contrarily, we send every data packet critical and non-critical to the higher layers.
Figure 5a and b show the energy consumption when only transmitting critical data
packets at the MANET layer with data fusion ratios of 1 and 3, respectively. Figure 6a
and b show the energy consumption when only non-critical data packets are sent at
the MANET layer with data fusion ratios of 5 and 7, respectively. Our innovation is
in selecting a compression value that permits the transmission of non-urgent packets
to make it easier to maintain historical medical records for future use.
612 S. Gupta and I. Snigdh
Fig. 5 Critical data transmitted afusion ratio = 1 bfusion ratio = 3 (*CHMS = cloud-based
healthcare monitoring system)
7 Conclusion
This paper provides a scalable MANET-LoRa-based architecture for WBAN that is
energy-efficient. The lower layer uses LoRa technology, and the higher layer uses
MANET. The gateway layer can distinguish between critical and non-critical data.
The early diagnosis of diseases at the MANET layer is another area in which we have
utilized the data aggregation and feature selection technique. We also proposed coop-
erative fusion at the cloud layer in MANET-LoRa architecture for WBAN success-
An Energy Conserving MANET-LoRa Architecture for Wireless ... 613
Fig. 6 Non-crtitcal data packet transmitted afusion ratio = 5, bfusion ratio = 7 (*CHMS = cloud-
based healthcare monitoring system)
fully preserves energy. However, this architecture will lead to delay and an increase
in algorithm complexity. The future scope of our work focuses on optimizing the
delay of the network.
614 S. Gupta and I. Snigdh
References
1. Boikanyo K, Zungeru AM, Sigweni B, Yahya A, Lebekwe C (2023) Remote patient monitoring
systems: applications, architecture, and challenges. Sci African 01638
2. Silverio-Fernandez MA, Renukappa S, Suresh S (2019) Evaluating critical success factors for
implementing smart devices in the construction industry: an empirical study in the Dominican
republic. Eng Construct Arch Manage
3. Lee I, Lee K (2015) The internet of things (IoT): applications, investments, and challenges for
enterprises. Business Horizons 58(4):431–440
4. Balandina E, Balandin S, Koucheryavy Y, Mouromtsev D (2015) IoT use cases in healthcare
and tourism. In: 2015 IEEE 17th Conference on business informatics, Vol 2. IEEE, pp 37–44
5. Mustafa T, Varol A (2020) Review of the internet of things for healthcare monitoring. In: 2020
8th International symposium on digital forensics and security (ISDFS). IEEE, pp 1–6
6. Baker SB, Xiang W, Atkinson I (2017) Internet of things for smart healthcare: technologies,
challenges, and opportunities. IEEE Access 5:26521–26544
7. Zou N, Liang S, He D (2020) Issues and challenges of user and data interaction in healthcare-
related IoT: a systematic review. Library Hi Tech
8. Gupta S, Snigdh I (2022) An energy-efficient information-centric model for internet of things
applications. In: 2022 International conference on IoT and blockchain technology (ICIBT).
IEEE, pp 1–5
9. Mohapatro M, Snigdh I (2020) Security in IoT healthcare. In: IoT security paradigms and
applications. CRC Press, pp 237–259
10. Abiodun AS, Anisi MH, Khan MK (2019) Cloud-based wireless body area networks: managing
data for better health care. IEEE Cons Electron Magaz 8(3):55–59
11. Abiodun AS, Anisi MH, Ali I, Akhunzada A, Khan MK (2017) Reducing power consumption
in wireless body area networks: a novel data segregation and classification technique. IEEE
Consum Electron Magaz 6(4):38–47
12. Almashaqbeh G, Hayajneh T, Vasilakos AV, Mohd BJ (2014) Qos-aware health monitoring
system using cloud-based WBANs. J Med Syst 38(10):1–20
13. Fortino G, Galzarano S, Gravina R, Li W (2015) A framework for collaborative computing
and multi-sensor data fusion in body sensor networks. Inform Fusion 22:50–70
14. Gupta S, Snigdh I (2022) Leveraging data aggregation algorithm in loRa networks. J Super
Comput 1–15 (2022)
15. Gupta S, Snigdh I (2021) Analyzing impacts of energy dissipation on scalable IoT architectures
for smart grid applications. In: Advances in smart grid automation and industry 4.0. Springer,
pp 81–89
16. Gupta S, Snigdh I (2023) Applying bayesian belief in loRa: smart parking case study. J Amb
Intell Human Comput 1–14
17. Gupta S, Snigdh I, Sahana SK (2022) A fuzzy logic approach for predicting efficient loRa
communication. Int J Fuzzy Syst 1–9
18. Mohapatro M, Snigdh I (2021) An experimental study of distributed denial of service and sink
hole attacks on IoT based healthcare applications. Wireless Pers Commun 121:707–724
19. Parlitz U, Berg S, Luther S, Schirdewan A, Kurths J, Wessel N (2012) Classifying cardiac
biosignals using ordinal pattern statistics and symbolic dynamics. Comp Biol Med 42(3):319–
327
20. Gupta S, Singh U (2021) Ontology-based IoT healthcare systems (IHS) for senior citizens. Int
J Big Data Anal Healthcare (IJBDAH) 6(2):1–17
21. Alliance L (2015) A technical overview of loRa and loRaWAN. White Paper, November 20
22. Gupta U, Pantola D, Bhardwaj A, Singh SP (2022) Next-generation networks enabled tech-
nologies: challenges and applications. Next Gener Commun Netw Indust Internet of Things
Syst 191–216
23. Soni G, Gupta U, Singh N (2014) Analysis of modified substitution encryption techniques
24. Bruzgiene R, Narbutaite L, Adomkus T (2017) Manet network in internet of things system. Ad
hoc Netw 66:89–114
An Energy Conserving MANET-LoRa Architecture for Wireless ... 615
25. Bellavista P, Cardone G, Corradi A, Foschini L (2013) Convergence of manet and WSN in IoT
urban scenarios. IEEE Sens J 13(10):3558–3567
26. Gupta P, Tripathi S, Singh S (2021) Energy-efficient routing protocols for cluster-based hetero-
geneous wireless sensor network (HETWSN)-strategies and challenges: a review. Data Anal
Manage Proc ICDAM 853–878
27. Bor MC, Roedig U, Voigt T, Alonso JM (2016) Do loRa low-power wide-area networks scale?
In: Proceedings of the 19th ACM international conference on modeling, analysis and simulation
of wireless and mobile systems, pp 59–67. https://doi.org/10.1145/2988287.2989163
Blockchain Integration with Internet
of Things (IoT)-Based Systems for Data
Security: A Review
Gagandeep Kaur , Rajesh Shrivastava , and Umesh Gupta
Abstract The blockchain technology offers a secure channel for communicating
between entities without the role of any third party. It is a digital ledger of trans-
actions in a computer network that makes it hard for hackers to attack or alter the
information. Banking, supply chain, precision agriculture, smart city, cyber-physical
systems, industrial IoT, and health care are the various sectors in which blockchain
technology has been adopted to enhance security. In recent times, these sectors are
being revolutionized by digital transformation using sensor-aided physical devices
forming Internet of Things (IoT) systems. Blockchain-based IoT system plays a
vital role in replacing the conventional methods of storage and sharing data with a
more reliable method. The integration of the two technology results in a secure, reli-
able, and smart system. This paper exhibits the background and working principle
of blockchain technology. Also, it discusses the need of security and security chal-
lenges in IoT-based systems. Furthermore, it discusses briefly about smart contracts
and motivation behind integrating blockchain technology with IoT-based systems.
Finally, it proposes a secure IoT-based land registry architecture.
Keywords Blockchain ·Computer network ·Internet of Things (IoT) ·Privacy
protection
G. Kaur (B)
Department of Computer Science and Engineering, Madhav Institute of Technology and Science,
Gwalior, India
e-mail: gagan873@gmail.com;gagandeep@mitsgwalior.in
R. Shrivastava ·U. Gupta
School of Computer Science Engineering and Technology, Bennett University, Greater Noida,
India
e-mail: rajesh.shrivastava@bennett.edu.in
U. Gupta
e-mail: er.umeshgupta@gmail.com
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_46
617
618 G. Kaur et al.
1 Introduction
In the current era, the Internet of Things (IoT) technology has gained popularity.
This is due to addition of sensing, receiving, and transmitting capabilities to any
physical object. IoT plays a vital role in various verticals of everyday life. IoT finds
its application in Industrial IoT, precision agriculture, smart cities, and healthcare
applications [1]. The sensors in the IoT system sense various physical parameters.
The data is collected from these sensors and stored in a central server. However, the
data security in the central server and its privacy preservation is an important aspect.
The IoT brings in several advantages, but the leakage of the sensitive information or
the attack by hackers can ruin its sole purpose. Therefore, connecting several physical
objects over the Internet with strong protection of the data is the prime requirement.
The focus of research is to save this collected data securely in decentralized architec-
ture. The integration of blockchain technology in IoT permits the physical objects to
securely transmit data in peer-to-peer network. The blockchain minimizes the risk
of any fraudulent information entering in the IoT network. This is because before
the entry of data in within the network, the consent is taken from majority of the
users instead of the single central authority [2]. Furthermore, blockchain-based IoT
system builds a strong and robust network in which it prevents the hackers to steal
all the information just by attacking the central server. In IoT-based system, data
security is obtained by applying encryption. The encryption technique involves each
node in the IoT system to carry two keys, namely public key and private key. The
public key is available to other nodes that encrypts the data which is then broad-
casted to all other nodes in the network. The private key is secret to individual nodes
that is used for decrypting the data. The blockchain technology prevents the fraudu-
lent intruder to falsely encrypt the information. In a blockchain-based IoT network,
all the physical objects within the network are identified by public key. Therefore,
this sharing may lead to third-party attack that can make out the identity of the IoT
participants. The blockchain-based IoT system eliminates single point of failure; this
enhances fault tolerance and reliability of IoT system. The IoT devices participating
in a blockchain-based network can verify data integrity and identity of sender [3].
Also, blockchain provides secure software updates and data storage capability to
IoT-based system. Furthermore, blockchain technology stores data in an immutable
which provides backtracking and traceability capabilities.
IoT allows the inter-connectivity of physical objects “things” by providing sensing
and transmitting capabilities. In the current era, IoT finds its applications in everyday
life making smart networks. These smart networks are capable of sensing various
physical parameters and the taking valuable decisions. The decentralization enhances
the IoT networks scalability and performances. With the exponential growth in the
adoption and popularity of IoT technology, the demand of secure data storage and
transmission is increasing immensely. The security of the data is essential; any kind
of data leakage or attack by intruder can lead to disclosure of critical information.
Thus, it is essential to preserve the privacy of data and grant access of data to autho-
rized users. However, there are few prerequisite security requirements for IoT-based
Blockchain Integration with Internet of Things (IoT)-Based Systems 619
system like confidentiality, integrity, authenticity, non-repudiation, authorization,
and availability. The integration of the block chain technology with IoT enhances
the reliability of IoT-based system in terms of security. The blockchain provides
data security to resource constrained end devices in IoT-based system. Also, the
blockchain technology is capable to handle heterogeneity, privacy protection, and
confidentiality of IoT-based systems.
2 Literature Review
This section describes literature review of the state-of-the-art approaches. Bhutta et al.
[4] in 1991 introduced blockchain technology as “a cryptographically secured chain
of blocks”. Baur et al. [5] in 2018 implemented blockchain as a public ledger and
received universal recognition in a cryptocurrency named Bitcoin. It is gaining popu-
larity in various sectors such as agriculture, logistics, and insurance. The blockchain
technology provides secure distributed architecture that works without intervention
of any centralized or third party. Wang et al. [6] defined blockchain as a chain of
blocks that are linked cryptographically using hash functions. It works on a peer-to-
peer network of participants. The blockchain technology provides the highest degree
of accountability. This feature has resulted in the adaptation of blockchain tech-
nology for data transmission and record-keeping in various sectors of real-life appli-
cations. The blockchain technology provides these real-life applications with proper
documentation and digitally confirms the ownership of assets. Hildebrand et al.
[7] proposed that blockchain technology blocks are ordered unambiguously using
consensus algorithm. Thus, this makes blockchain technology verifiable, consistent,
auditable and enhances integrity among all the participants. Jain et al. [8] proposed
a block constitutes version information, hash of the parent block, timestamp, nonce,
count of transaction, and combination of hash transactions. Whenever a new block
has been generated, then each participant will apply block authentication process.
After appropriate validation and approval, this block is appended to the parent block
with the help of reference. This process helps to detect the unidentified or unautho-
rized transactions. Unauthorized or falsified blocks can be identified by the hash value
which is completely different from authorized blocks. Meryem et al. [9] proposed
the integration of blockchain technology for security in IoT-based smart homes.
Mohamed et al. [10] proposed security in IoT-enabled smart industry environment
for Industry 4.0 applications.
3 Working Principle of Blockchain
A blockchain structure is represented in the form of list of blocks with ordered
transactions. The list is stored as flat file database. The point to notice is that there is
no pointer pointing to the first block, and the terminal block has a pointer with null
620 G. Kaur et al.
value. Figure 1shows the structure of blockchain. A block contains version, parent
block hash, timestamp, nonce, transaction count, and Merkle root. Nonce is an integer
that starts from 0 and increases every time after hash is calculated. The Merkle root
is the combined hash of all transactions. Figure 2shows the working principle of a
blockchain technology. Whenever a new record or transaction is to be added into a
blockchain, it needs to be verified and digitally signed by nodes in the system. Any
block contains data, its hash value, and hash of the previous block. It depends which
kind of data has been stored in a blockchain such as receiver, sender, and the amount
of coins. A hash is like a digital signature or a fingerprint. The hash of a block is
generated using cryptographic hash algorithm. The hash helps in identifying each
block in a blockchain structure. Any modification in a block results in change of its
hash. The hash of the previous block helps to form a chain structure which plays
a major role of providing security. Any fraudulent attempt to change the data of a
block leads to invalidation of the whole blockchain system. The proof-of-work is
performed by the miners which are special nodes within the blockchain structure.
The miners get transaction fees as a reward from the block. Whenever a new block
is created, it is verified by all nodes in a system. All nodes in the blockchain system
adhere to the consensus protocol. Thus, this makes a blockchain system immutable
and secure [11]. Table 1shows the classification of blockchain architecture.
The major characteristics that have resulted in adaption of blockchain technology
in real-life applications are as follows:
Transparency: The participants in the public blockchain systems can communicate
with equal rights. In the public blockchain systems such as Ethereum and Bitcoin,
the authentication of each transaction is recorded. This data is available to all the
participants in the blockchain network. Thus, data on the blockchain is transparent
to each node so as to validate the committed transaction in the blockchain.
Decentralization: In a centralized architecture, the validation of transaction is
performed by central server. This causes bottleneck problem, whereas blockchain
works on a distributed architecture where validation of transaction is performed
Fig. 1 Blockchain structure
Blockchain Integration with Internet of Things (IoT)-Based Systems 621
Fig. 2 Working principle of blockchain technology
Table 1 Classification of blockchain architecture
Property Private blockchain
architecture
Public blockchain
architecture
Consortium
blockchain
architecture
Consensus
determination
Within one organization All miners Selected set of
nodes
Framework Partially decentralized Fully decentralized Partially
decentralized
Read permission Public or restricted Public Public or restricted
Traceability Fully traceable Fully traceable Partially traceable
Immutability level Could be tampered Almost impossible to
tamper
Could be tampered
Resource efficient High Low High
Centralization Yes No Partial
Consensus process Permission less Needs permission Needs permission
Scalability High Low High
Flexibility High Low High
Transaction speed Fast Slow Fast
622 G. Kaur et al.
in peer-to-peer manner. Thus, this enhances the performance of system in terms
of cost-effectiveness, resolving bottleneck problem, and single-point failure.
Immutability: In a blockchain, a chain structure is formed by linking the blocks
through hash values. When any kind of data tampering is performed, it causes
invalidation of all consecutive blocks. Thus, the blockchain structure is immutable.
Pseudonymity: The blockchain is partially confidential as the addresses of
participants can be traced.
Non-repudiation: Each participant in the blockchain system contains a private
key. This can be decrypted by other participants with help of public keys. Thus,
cryptographically encrypted transactions are non-repudiable.
Traceability: The blockchain technology offers traceability; this is achieved with
help of timestamp attached to every transaction. This results in tractability of
origin and modification of any transaction [12].
4 Smart Contracts
The smart contract can be defined as the programs which are stored on a blockchain.
The smart contract programs run only when the predetermined conditions are
fulfilled. The benefit of smart contract is that they automate the execution of an
agreement. This results in reliability of the agreement without involving any third
party. Furthermore, it is a faster approach which makes the system autonomous
where workflows are maintained and the consecutive actions are triggered on fulfill-
ment of the predetermined conditions. Thus, smart contracts insert a secure and
automatic contractual mechanism where the contracting parties evaluate success or
violation of the contract. Smart contracts are programs that encrypt and replicate
contractual agreements [13]. The smart contract through blockchain technology in
computer domain brings various benefits such as no commission fees, no involve-
ment of trusted-party dependency, and no mutual interaction of counterparties. Smart
contracts can be generated by publishing a transaction to the blockchain. The miners
in the blockchain run smart contracts and achieve agreement on its implementa-
tion. Each contract is assigned a 160-bit address on implementation. If a transaction
is generated, then contract is executed using this address. There are various plat-
forms for the development of smart contract such as Bitcoin, Ethereum, Hyperledger
Fabric, Nem, Corda, Stellar, Waves, Cardano, Neo, EOS, Rootstock, Tendermint,
and Quorum. These platforms such as Bitcoin can implement smart contracts that
offer a modern-way of money exchange that provides innovative solutions and easy
interfaces for developers.
Blockchain Integration with Internet of Things (IoT)-Based Systems 623
5 Blockchain Integration with IoT
IoT provides digital transformation and has revolutionized the real world. The
network of sensors produces large voluminous data which is analyzed by a central
server. The central server makes valuable decision and helps in knowledge discovery.
The information in the IoT-based system requires safe and secure transmission and
storage. The integration of the block chain technology with IoT enhances the reli-
ability of IoT-based system in terms of security [14]. The blockchain provides
data security to resource-constrained end devices in IoT-based system. Also, the
blockchain technology is capable to handle heterogeneity, privacy protection, and
confidentiality of IoT-based systems. The major advantages of integrating blockchain
technology in IoT-based systems are listed below:
The data which is collected by the sensors is secured by blockchains. The storage
of this data is done in blockchain network in the form of encrypted transactions.
The blockchain technology provides the IoT-based systems with enhanced inter-
operability. There is no involvement of third party while interaction between the
IoT devices which makes the whole IoT system autonomous [15,16].
Blockchain technology enhances the reliability of IoT-based system by providing
availability, authenticity, confidentiality, accountability, and traceability. Also,
blockchain faster the process of IoT-based systems by proving secure and
decentralized features with no third-party intervention.
The blockchain technology utilizes consensus mechanisms which prevents denial-
of-service attacks. This is achieved by imposing charge for each transaction. The
technology implementation in IoT networks enhances the overall security within
the network by enforcing access control and data integrity [17,18].
The blockchain technology makes it impossible for the intruders to modify records
and hide transactions in IoT-based systems. This is achieved through decentralized
consensus mechanism. It utilizes features of encrypting data using public and
private keys which provides privacy preservation [19,20].
6 Proposed Secure IoT-Based Land Registry Architecture
The land registry involves sharing transactional information of land pieces. The
blockchain enhances data security and provides secure land registry system including
authentication of all land transactions between involved parties. The blockchain
technology prevents illegal land transaction, and through hash-based chain structure,
it detects fraudulent modification registry. Thus, it helps to secure land transactions
and registry records. Figure 3shows the proposed land registry system architecture.
Initially, all land registry centers and users are required to register themselves in a
mobile application. They will receive pairs of public and private keys by execution
of the registration function. A user can request the land registry centers to issue
certificate. When the user initiates a request to the authorities, then verification of
624 G. Kaur et al.
Fig. 3 Proposed secure IoT-based land registry architecture
users details is performed which is stored on the blockchain network. The blockchain
network issues a certificate based on users details stored during the registration. The
issued certificate is stored in a decentralized Inter-Planetary File System. The user
then receives the calculated hash value. The land registry details are managed as a
transaction with a unique ID. This transaction is stored in a specific block of the
blockchain network.
7 Conclusion
The paper presents a comprehensive survey of blockchain technology including char-
acteristics, architecture, and working principle of blockchain. It describes the need of
data security and privacy preservation for IoT-based systems. Also, it describes the
benefits of integrating blockchain technology with the IoT-based systems and solu-
tion that blockchain technology provides to IoT-based system in terms of security.
This paper proposes a blockchain-based technology to provide security in terms of
authenticity, integrity, availability, and confidentiality. The paper presents a secure
IoT-based land registry system architecture.
Blockchain Integration with Internet of Things (IoT)-Based Systems 625
References
1. Maraveas C, Piromalis D, Arvanitis KG, Bartzanas T, Loukatos D (2022) Applications of IoT
for optimized greenhouse environment and resources management. Comput Electron Agric
198:106993
2. Deepa N, Pham QV, Nguyen DC, Bhattacharya S, Prabadevi B, Gadekallu TR, Pathirana PN
(2022) A survey on blockchain for big data: approaches, opportunities, and future directions.
Future Gener Comp Syst
3. Jeoung J, Jung S, Hong T, Choi JK (2022) Blockchain-based IoT system for personalized
indoor temperature control. Autom Constr 140:104339
4. Bhutta MNM, Khwaja AA, Nadeem A, Ahmad HF, Khan MK, Hanif MA, Cao Y et al (2021) A
survey on blockchain technology: evolution, architecture and security. IEEE Access 9:61048–
61073
5. Baur DG, Hong K, Lee AD (2018) Bitcoin: Medium of exchange or speculative assets? J Int
Finan Markets Inst Money 54:177–189
6. Wang R, Tsai WT (2022) Asynchronous federated learning system based on permissioned
blockchains. Sensors 22(4):1672
7. Hildebrand B, Baza M, Salman T, Amsaad F, Razaqu A, Alourani A (2022) A comprehensive
review on blockchains for internet of vehicles: challenges and directions. arXiv preprint arXiv:
2203.10708
8. Jain A, Srivastava N (2022) Privacy-preserving record linkage with block-chains. In: Cyber
security, privacy and networking. Springer, Singapore, pp 61–70
9. Ammi M, Alarabi S, Benkhelifa E (2021) Customized blockchain-based architecture for secure
smart home for lightweight IoT. Inf Process Manage 58(3):102482
10. Ferrag MA, Shu L (2021) The performance evaluation of blockchain-based security and privacy
systems for the Internet of Things: a tutorial. IEEE Internet Things J 8(24):17236–17260
11. Dannen C (2017) Introducing ethereum and solidity, Vol 1. Berkeley: Apress, pp 159–160
12. Xu J, Guo S, Xie D, Yan Y (2020) Blockchain: a new safeguard for agri-foods. Artif Intell
Agric 4:153–161
13. Musamih A, Salah K, Jayaraman R, Arshad J, Debe M, Al-Hammadi Y, Ellahham S (2021)
A blockchain-based approach for drug traceability in healthcare supply chain. IEEE Access
9:9728–9743
14. Omar IA, Hasan HR, Jayaraman R, Salah K, Omar M (2021) Implementing decentralized
auctions using blockchain smart contracts. Technol Forecast Soc Chang 168:120786
15. Kaur G, Bhattacharya M, Chanak P (2019) Energy conservation schemes of wireless sensor
networks for IoT applications: a survey. In: 2019 IEEE conference on information and
communication technology. IEEE, pp 1–6
16. Kaur G, Chanak P, Bhattacharya M (2022) A Green hybrid congestion management scheme
for IoT-enabled WSNs. IEEE Trans Green Commun Netw 6(4):2144–2155
17. Dwivedi SP, SrivastavaV, Gupta U (2023) Graph similarity using tree edit distance. In: Proceed-
ings of the structural, syntactic, and statistical pattern recognition: joint IAPR international
workshops, S+ SSPR 2022, Montreal, QC, Canada, August 26–27, 2022. Cham: Springer
International Publishing, pp 233–241
18. Yadav S, Mishra R, Gupta U (2015) Performance evaluation of different versions of 2D
Torus network. In: 2015 International conference on advances in computer engineering and
applications. IEEE, pp 178–182
19. Gahlot A, Gupta U (2016) Gaze-based authentication in cloud computing. Int J Comp Appl
1(1):14–20
20. Soni G, Gupta U, Singh N (2014) Analysis of modified substitution encryption techniques
Comparative Study of Heart Failure
Using the Approach of Machine Learning
and Deep Neural Networks
Shachi Mall and Jagendra Singh
Abstract Heart failure, a complicated clinical syndrome, occurs when the heart
cannot pump enough oxygenated blood to satisfy the body’s metabolic needs. It
is a major public health problem and is associated with significant morbidity and
mortality. Care workers intentionally mine and save patient medical information
to generate potential for enhanced treatment planning as healthcare and creative
diagnostics become much more collaborative. To predict strokes, this paper does a
comprehensive evaluation of the many variables in electronic heart data. The most
crucial variables for stroke prediction are identified using a principal component
analysis. We have considered a set of 12 different attribute which are a common
symptom of various heart conditions. The features are employed to predict cardio-
vascular disease, each attributes contains 918 data set which are taken from Kaggle.
The data set are trained on 70% and tested on 30% of Kaggle heart dataset. We apply
the test and training data on different machine learning algorithm, i.e., K Neighbors
Classifier, Random Forest Classifier, and on deep neural network and achieve the
results. On comparing the accuracy result of all three methods, i.e., K Neighbors
Classifier accuracy result 0.877, Random Forest Classifier accuracy result 0.8590,
and deep neural network accuracy result 0.89. In our investigations, we identify that
deep neural networks actually superior to machine learning algorithms.
Keywords Chronic heart failure ·Heart disease ·K neighbors classifier ·Random
forest classifier ·Deep neural network
S. Mall (B)·J. Singh
School of Computer Science Engineering and Technology, Bennett University, Greater Noida,
India
e-mail: shachimall@gmail.com
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_47
627
628 S. Mall and J. Singh
1 Introduction
Heart diseases are frequently substituted for different cardiovascular disorders. These
problems are primarily associated with blocked or constricted blood vessels, which
can cause a stroke, angina, or chest pain, and cardiac arrest. The development of
monitoring technologies and processes for heart diagnosis from a distance has needed
a lot of work. To be a barrier to adherence, technology fatigue has been identified.
Machine learning and neural network are playing important role to identifying out
if a person has heart disease [1]. The World Health Organization (WHO) states that
approximately twenty-six million individuals worldwide struggle with heart failure,
and the number is expected to rise due to aging populations and increasing rates of
risk factors such as hypertension, diabetes, and obesity. In 2030, it is predicted that
there would be approximately 22 million deaths worldwide if current trends continue
[2]. Social Networks for Health are online communities where people with related
medical conditions can interact, share experiences, and offer emotional support to
one another. Social Networks for Health, which let patients to share their experiences
online, have recently been suggested by various academics as a practical way for
people to assist one another [3]. Machine learning techniques have the potential
to improve our understanding of heart failure and to develop better diagnostic and
treatment strategies. It’s considered to be apart of artificial intelligence. K Neighbors
Classifier, Random Forest Classifier are machine learning algorithm [4] are used to
check the performance on the selected 12 attributes these attributes are different
features set. Machine learning has the potential to revolutionize the diagnosis and
treatment of heart failure by identifying novel K Neighbors Classifier, Random Forest
Classifier. On the basis of feature selection we train the 70% datasets are used to
train on the approach K Neighbors and Random Forest Classifier machine learning
algorithm and 0% dataset are used to test on the same approach. To decrease the
dimensionality in analyzing the data information we use of principal component
analysis (PCA). In the context of heart stroke, PCA can be used for feature selection
to identify the most important variables that are strongly associated with the outcome
of heart stroke [5]. PCA are used for feature selection in heart stroke first it collects
the data, i.e., 12 attributes on the patient’s gender, age, high blood pressure, levels of
cholesterol, and other factors and other health indicators, are stated in Table 1.
Now the same process is applied to the deep neural network (DNN). DNN is an
algorithmic in the artificial neural network (ANN), it resembles brain neurons in the
human body. Separate layers, connections, and propagation direction exist within
the artificial neural network. Each layer consists of nodes with arrows indicating
the relationships between them. An artificial neural network’s input layer is dense
with nodes. These input layer’s nodes are interconnected to the nodes of the hidden
layer. The weights are assigned to each input. The network’s input nodes provide
information to the nodes in the hidden layer, which process it by carrying out various
operations or calculations before sending the results to the output node. The node
that produces the ultimate result is in the output layer. The developed system is used
to diagnosis the heart failure, for this we have consider 12 different attribute and
Comparative Study of Heart Failure Using the Approach of Machine 629
Table 1 Attribute of heart disease
S. No. Attribute name Description
1Age Age in years
2Gender 1=M0=F
3ChestPainType Type 1,2,3 and 4
4RestingBP Blood pressure at rest
5 Cholesterol Cholesterol
6FastingBS 1=True and 0 =False
7RestingECG Result of Resting electrographic
0=Normal and 1 =Abnormal
2=left ventricular hypertrophy
8MaxHR heart rate maximum
9ExerciseAngina Exercise induced angina
10 Oldpeak Depression related to ST
11 ST_Slope Exercise’s peak-to-peak slope
(For example, slopes 1 and 3 are upsloping, flat, and respectively)
12 HeartDisease 0=Heart Disease and 1 =No Heart Disease
symptoms for heart failure. These 12 attributes are used as a feature to predict the
heart failure. Now we have taken 918 patient datasets from Kaggle [6] which are
suffering and not suffering from heart disease. Several studies have demonstrated
the potential for machine and deep learning algorithm in heart failure to predict on
the basis feature selection on the 12 different attributes. We test our datasets on K
Neighbors, Random Forest Classifier and logistic are machine learning algorithm
and we also run the datasets on deep neural network the datasets are dataset taken
from Kaggle [6]. Each individual attributes contains 918 different datasets of the
patient. These attributes help us to analyze heart disease.
2 Related Work
Classification-based machine learning and neural network techniques have been
extensively investigated in the field of heart diagnosis. Here are some examples
of related work such as Automated Detection of Congestive Heart Failure Using
Deep Neural Networks”. This study proposed a deep neural network (DNN) for
automated detection of congestive heart failure (CHF) using patient medical history
and laboratory results. The proposed DNN achieved high accuracy in detecting CHF
compared to traditional methods [7]. Heart disease is predicted using a hybrid model
that combines Random Forest and decision trees, and the model’s accuracy was
85.7% [5]. Neural networks (NN) performed better than Random Forest, support
vector machines, fuzzy rules, classification and regression trees, and fuzzy rules in
the prediction of heart failure using different machine learning algorithms [8]. Early
630 S. Mall and J. Singh
detection of cardiac arrest symptoms is vital for disease prevention. The author devel-
oped a system that may predict a cardiac condition’s vulnerability based on simple
factors like age, sex, and pulse rate. Automatic Detection of Atrial Fibrillation Using
Convolutional Neural Networks this study proposed a convolutional neural network
(CNN) for automatic detection of atrial fibrillation (AF) from electrocardiogram
(ECG) signals. The proposed CNN achieved high accuracy and outperformed other
state-of-the-art methods [9]. Classification of cardiac arrhythmia using ECG signals
and machine learning algorithms were used in this study to categorize ECG signals
into various cardiac arrhythmia types [10]. S. Kumar et al.’s study, “Prediction of
Heart Disease Using Machine Learning Techniques”, was published in 2020. Based
on patient data, machine learning algorithms were used in this study to predict the
risk of heart disease [11]. Today’s healthcare system’s key issue is the provision of
high-quality services and efficient, accurate diagnostics [12]. Despite the fact that
heart illnesses are now the largest cause of death worldwide, on the basis of research
study, it can be easily controlled and managed. The effectiveness of a disease’s
overall management depends on when it is discovered. To prevent negative effects,
the proposed investigation aims to identify certain heart problems at an early stage
[10]. In order to suggested study attempts to recognize specific heart issues early on
in order to avert negative impacts [13]. Statlog and Cleveland, two publicly acces-
sible heart disease datasets, have been used by academics. To evaluate the efficacy
of prediction algorithms. CFARS-AR, which uses the Statlog dataset, is a clinical
decision support system for cardiac disease [14].
3 Proposed System
The long-term goal is to predict and do comparative study on cardiac disease using
both a deep neural network technique and K Neighbors, Random Forest Classi-
fier based on machine learning algorithm. Kaggle publicly available heart disease
datasets, extensively to evaluate the efficacy of prediction algorithms [15]. Table 1,
Attribute of heart disease are the most important contributions of this paper: to predict
the heart disease risk variables for stroke prediction. We determine which ones are
most significant elements required for stroke prediction; since the comparative study
on both methods machine learning and deep neural network will have a big impact
on the health field. We have gathered heart data from 918 patients from Kaggle in
order to predict the heart illness.
3.1 Proposed Methods
The background of all the research tools and methods is covered in the subsections
that follow. Figure 1is the proposed model work flow diagram.
Comparative Study of Heart Failure Using the Approach of Machine 631
Fig. 1 Model work flow that
has been suggested
3.2 Working Mechanism
The initial collection of features, input (t).
Finding a set of m features and a prediction using the wrapper method’s suggested
model is the output.
1. To rank every independent attribute in the given feature set according to how
well they match the 12 dependent features, use the principle correlation method.
The stronger the correlation, the stronger the dependence.
2. Choose the x(x<t) features whose dependent feature’s correlation value is
greater than the cutoff value.
3. Remove the component that least influences the grouping of items into categories.
4. Classify the data using the remaining characteristics.
5. Analyze the classification’s performance and obtain the extracted features.
6. To complete the final feature, repeat steps 3 through 5.
7. From the output set of features, choose the extracted feature that results in the
most accurate data.
8. Develop a model and evaluate it.
3.3 Datasets
A patient’s information is stored in a heart health medical record. It is an automated,
computer-readable database that contains information about a patient’s health taken
from online website of Kaggle [6]. The dataset is accessible, a public data archive
(Fig. 1). Description of total heart datasets, the dataset includes 918 patients’ elec-
tronic health records. It features one output feature and a total of 12 input attributes.
In the output response, a binary state expresses the probability that the patient has
experienced a stroke. The patient’s name, gender (G), age (A), and whether or not
they have heart disease (HD) are among the remaining 12 input features in the EHR.
Other features include ChestPainType (CP), RestingBP (RBP), Cholesterol (CH),
FastingBS (FBS), RestingECG (RECG), MaxHR (MHR), ExerciseAngina (EA),
632 S. Mall and J. Singh
Oldpeak (OP), ST Slope (STS), and Heart Disease (HD). The HD dataset is heavily
skewed in terms of the incidence of stroke events because the vast majority of the
patient records are from people who have never had a stroke.
Patient identification will not be accepted as an input feature. In our investigation
and analysis, we will take into account the final 11 input characteristics and 1 response
variable.
3.4 Pre-processing of Datasets
We analyze a dataset of electronic health records in this part. We conduct feature
correlation analysis. To conduct this analysis on the input attributes of the heart
records, we use the whole dataset of heart records. A dataset with twelve test outcome
characteristics that was gathered from around 918 people is used in this study by the
system. Instead, the patient will be identified using the binary digits 1 and 0, where
1 will stand in for the true diagnosis (in this example, heart disease) and 0 would
represent the incorrect diagnosis (in this case, the patient has no heart illness of
any kind. A dataset with twelve test outcome characteristics that was gathered from
around 918 people is used in this study by the system. When two features are highly
correlated, one of them can be ignored when predicting the likelihood that a stroke
will occur because it provides no new information for the prediction model. This
is how feature selection can benefit from correlation analysis, which we did using
principal component analysis (PCA). Principal component analysis, at its core, is a
statistical technique for converting a set of observations of variables with potential
for correlation into a set of values of variables with potential for linear dissociation.
The attribute space is reduced from a large number of variables to a smaller number
of components using a non-dependent procedure. The major objective of this PCA is
to choose the original variables that have the strongest connection with the principal
amount from a broader collection of variables [16,17].
3.5 Classification Algorithms
For predicting heart disease, neural network and machine learning models K Neigh-
bors Classifier, Random Forest Classifier classification algorithms were developed.
These approaches are employed in the study. One feature selection technique is
utilized for dimensionality reduction of PCA. The different classifiers receive the
condensed feature set from the feature selection techniques.
A tree-based technique for classification and regression analysis is called Random
Forest. The decision tree technique is run to each of the small subsets of the dataset
in the RF ensemble strategy. The subgroups are sampled using the sampled-with-
replacement method [18,19].
Comparative Study of Heart Failure Using the Approach of Machine 633
Fig. 2 K-nearest neighbor (KNN)
K-nearest neighbors (KNN) algorithm searches the dataset for correlations
between predictors and values. It employs a non-parametric approach because no
specific parameters have been found for any functional form. It doesn’t require
anything about the dataset’s properties or results in any way. It often works by merely
attempting to determine which class the new feature is closest to before simply adding
it to that class [6,20]. Deep neural network the input, output plates, and one or more
hidden layers. An input layer and a fully linked output layer make up the Perceptron.
The input and output levels of Multi-Layer Perceptron are the same, but they may
also have additional levels [14] (Fig. 2).
3.6 Feature Selection
The data set’s most pertinent features are taken out during feature selection. This
approach can be used to prevent redundancy. The accuracy of prediction can be
improved by feature selection since irrelevant features are removed from the input
data. In this study, feature selection is done by principle component analysis (PCA).
Two distinct classification algorithms are performed to the smaller data set after
feature selection. This section of my investigation focuses on the connections between
features and the aim. We think that it makes sense to learn more about the variables
themselves before looking for more intricate correlations.
Following are the steps to predict the heart disease from the given 918 heart disease
datasets.
Step1: we import different libraries, subpackages, color, Standard Scaler and
machine learning algorithm, i.e., numpy, pandas, matplotlib.pyplot, seaborn as sns,
plotly.express, Random Forest Classifier, K Neighbors Classifier as KNN, Select K
Best, confusion_matrix, classification, sklearn, and keras. After all the libraries are
imported, we start the date processing and visualization of the 12 attributes among
918 datasets as shown in Fig. 1.
634 S. Mall and J. Singh
4 Result
Table 2shows the performance result of the K Neighbors Classifier, Random Forest
Classifier classification algorithms are compared with deep neural network and found
that deep neural network accuracy is better than the classification algorithm the graph
histogram and accuracy result shown in Figs. 6and 7, the main advantage of DNN
is it combines the neuron and passes through input, hidden and output layer if the
target and developed output are not same producing error then the weight are updated
to cover the margin of error. The K Neighbors Classifier, Random Forest Classifier
classification algorithms performance is not good. Random Forest Classifier finish
with score: 0.8590604026845636 and the random state =5 and the parameter =
{‘max_depth’: range(2, 50, 3), ‘min_samples_split’: range(2, 10), ‘n_estimators’:
range(10, 200, 10)}. It finish with score: 0.7718120805369126 and Model KNN
finish with score: 0.8080536912751677. The input information of the dataset has 918
rows with no missing value shown in Fig. 1.InFig.3the information and description
of the heart dataset. The feature selection process done between age and heart disease
has shown in the Figs. 4and 5. The histogram scaling shows the variation of range
to predict the features. The K Neighbors Classifier score are considered from 17
different neighbors with point 02 to achieve the score 0.8770 shown in Fig. 6and
Random Forest Classifier score 0.8590. In Fig. 7shows the accuracy result 0.89 for
deep neural network. Comparison result graph of Figs. 3,4,8,9,10 and 11 K-nearest
neighbors and deep neural network on distribution of age and RestingBP.
Fig. 3 Total heart data set information after processing
Comparative Study of Heart Failure Using the Approach of Machine 635
Fig. 4 Heart disease varies with age through K-nearest neighbor
Fig. 5 Heart disease varies with age through deep neural network
636 S. Mall and J. Singh
Fig. 6 Histograms of age varies with ChestPainType, RestingBP Cholesterol, FastingBS,
RestingECG, MaxHR, ExerciseAngina,Oldpeak, ST_Slope, Heart disease through neural network
Fig. 7 Accuracy result of training and testing KNN algorithm
Comparative Study of Heart Failure Using the Approach of Machine 637
Fig. 8 Accuracy result of
training and testing data of
deep neural network
Fig. 9 Accuracy result of neural network
638 S. Mall and J. Singh
Fig. 10 Distribution of age and RestingBP through K-nearest neighbors
Comparative Study of Heart Failure Using the Approach of Machine 639
Fig. 11 Distribution of age and RestingBP through neural network
Table 2 Estimated performance of KNN and Random Forest Classifier algorithm
S. No. Model Score val Pred. time
val
Fit time Pred time
val
marginal
Fit time
marginal
Stack
level
Can
infer
fit
order
1Random
Forest
0.879195 0.100443 1.08309 0.100443 1.083097 1True
2K-Nearest
Neighbors
0.691275 0.006258 0.005269 0.006258 0.005269 1True
5 Conclusion
We have compared different methods for categorizing cardiac illness, in this research,
we have examined that there are different machine learning methods are used to
predict heart illness, in order to validate it, we have considered three different clas-
sifiers, i.e., K Neighbors Classifier, Random Forest classification algorithms, and
neural network. The feature selection and classification phases make up the two main
stages of the multiple models for predicting heart disease that have been described.
The experiment was conducted on the heart data sets (918). The data set was divided
into train on 70% of heart data set and 30% for testing of the same heart data set
(918). We compare the accuracy result with machine classification algorithm as
shown in Table 2. K Neighbors Classifier, Random Forest classification algorithms,
and deep neural network. The result shows that deep neural network gives a better
result 0.89% with no data loss in comparison to K Neighbors Classifier and Random
Forest machine learning classification algorithms.
640 S. Mall and J. Singh
Acknowledgements The heart prediction dataset was taken from https://www.kaggle.com/dat
asets.
References
1. Ahdal A, Prashar S, Rakhra M, Wadhawan A (2021) Machine learning-based heart patient
scanning, visualization, and monitoring. In: International conference on computing sciences
(ICCS)
2. Fitriyani NL, Syafrudin M, Alfian G, Rhee J (2020) HDPM: an effective heart disease prediction
model for a clinical decision support system. IEEE Access 8:133034–133050
3. Huang Y, Song I (2018) A better online method of heart diagnosis. In: 3rd international confer-
ence on biomedical signal and image processing (ICBIP ‘18: 2018), Seoul, Republic of Korea,
pp 80–86. ISBN 978-1-4503-6436-2
4. Li JP, Haq AU, Din SU, Khan J, Khan A, Saboor A (2020) Heart disease identification method
using machine learning classification in e-healthcare. IEEE Access 8. ISBN 107562-107582
5. Al Ahdal A; Prashar D; Rakhra M, Wadhawan A (2021) Machine learning-based heart patient
scanning, visualization, and monitoring. In: International conference on computing sciences
(ICCS)
6. https://www.kaggle.com/search?q=heart
7. Wang B, Bai Y, Yao Z, Li J, Dong W, Tu Y, Xue W, Tian Y, He K (2019) A multi-task neural
network architecture for renal dysfunction prediction in heart failure patients with electronic
health record. IEEE Access 7:178392–178400
8. Kavitha M, Gnaneswar G, Dinesh R, Rohith Sai Y, Sai Suraj R (2021) Heart disease prediction
using hybrid machine learning model. In: 6th international conference on inventive computation
technologies (ICICT)
9. Gavhane A, Kokkula G, Pandya I, DevadkarK (2018) Prediction of heart disease using machine
learning. In: Second international conference on electronics, communication and aerospace
technology (ICECA)
10. Erda¸s ÇB, Ölçe D (2020) A machine learning-based approach to detect survival of heart failure
patients. In: Medical technologies congress (TIPTEKNO)
11. Deepika R, Balaji Srikaanth P, Pitchai R (2022) Early detection of heart disease using deep
learning model. In: 8th international conference on smart structures and systems
12. Dhanka S, Maini S (2021) Random forest for heart disease detection: a classification approach.
In: IEEE 2nd international conference on electrical power and energy systems (ICEPES)
13. Long NC, Meesad P, Unger H (2015) A highly accurate firefly-based algorithm for heart disease
prediction. Expert Syst Appl 42(21):8221–8231
14. Arabelle AE, Prasetyanto WA, Wulandari SA (2021) Non invasive blood sugar detection using
the extraction method of principal component analysis. In: IEEE international seminar on
application for technology of information and communication (iSemantic), September, pp
285–289
15. Dhanka S, Maini S (2021) Random forest for heart disease detection: a classification approach.
In: IEEE 2nd international conference on electrical power and energy systems (ICEPES),
December, pp 1–3
16. Reddy KSK, Kanimozhi KV (2002) Novel intelligent model for heart disease prediction using
dynamic KNN (DKNN) with improved accuracy over SVM. In: IEEE international conference
on business analytics for technology and security (ICBATS), February, pp 1–5
17. Gupta M, Srivastava D, Pantola D, Gupta U (2022) Brain tumor detection using improved
Otsu’s thresholding method and supervised learning techniques at early stage. In: Proceedings
of emerging trends and technologies on intelligent systems: ETTIS 2022. Springer Nature
Singapore, Singapore, pp 271–281
Comparative Study of Heart Failure Using the Approach of Machine 641
18. Mutijarsa K, Ichwan M, Utami DB (2016) Heart rate prediction based on cycling cadence
using feedforward neural network. In: IEEE international conference on computer, control,
informatics and its applications (IC3INA), pp 72–76
19. Gupta U, Gupta D (2022) Least squares structural twin bounded support vector machine on
class scatter. Appl Intell, 1–31
20. Gupta U, Gupta D, Agarwal U (2022) Analysis of randomization-based approaches for autism
spectrum disorder. In: Pattern recognition and data analysis with applications. Springer Nature
Singapore, Singapore, pp 701–713
21. Dev S, Wang H, NwosuCS, Jain N, Veeravalli B, John D (2022) A predictive analytics approach
for stroke prediction using machine learning and neural networks. Int J Healthcare Anal 22.
https://doi.org/10.1016/j.health.2022.100032
House Price Prediction Using Hybrid
Deep Learning Techniques
Nitigya Vasudev, Gurpreet Singh , Prateek Saini, and Tejasvi Singhal
Abstract The impact of machine learning on the world has been immense and
is only growing. Machine learning is also being used to improve health care, detect
fraud, predict weather, and even develop autonomously vehicles. Furthermore, house
prices have been steadily increasing over the past few years. This has been due to
a number of factors, including a strong economy, low interest rates, and a limited
supply of housing. As the demand for housing continues to outpace the availability
of new homes, the prices of existing homes have increased significantly. This has
caused many people to struggle to afford a home; there has been an increase in the
cost of living in recent years. The goal of this paper is to use machine learning as a
powerful tool for predicting the future value of a house. It can be used to predict the
price of a house given certain features such as size, location, and amenities. We have
used machine learning algorithms such as support vector machines (SVM) models,
regression models, random forest, and bagging and boosting models to predict house
prices. Hyperparameter tuning is also being used to optimize the model performance.
As a result, we have compared and analyzed a number of prediction methods in
order to select the most suitable one. House prediction using machine learning can
be used to estimate the future market value of a house, identify potential investment
opportunities, and assist in making informed decisions about buying and selling
properties. In Sect. 1, we have given introduction about the real estate industry and
how machine learning can be helpful for predicting house prices. In Sect. 2,wehave
reviewed several papers to gather information to compare result of different models.
N. Vasudev ·G. Singh (B)·P. Sa i n i ·T. Singhal
Chitkara University Institute of Engineering and Technology, Chitkara University Punjab,
Chandigarh, India
e-mail: gurpreet.1309@chitkara.edu.in
N. Vasudev
e-mail: nitigya1194.cse19@chitkara.edu.in
P. Sa i n i
e-mail: Prateek1047.cse19@chitkara.edu.in
T. Singhal
e-mail: Tejasvi1000.cse19@chitkara.edu.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_48
643
644 N. Vasudev et al.
Sections 3,4and 5are about methodology and implementation of various algorithms
to get the desired results. In Sect. 6, we have compared various models and found
the desired algorithm for this research paper.
Keywords Technology ·Industry ·Housing market ·Housing ·SVM ·XGBoost
1 Introduction
The house price in real estate [1] can be a tricky thing to understand, even for an
experienced expert. With the fluctuating market and ever-changing trends, it can feel
like you are playing a game of chance rather than relying on reliable data. House
prices have long been a topic of interest for both real estate professionals and the
general public. With machine learning rapidly gaining attraction as an effective tool
to predict housing values, this research paper will focus on how it can be used to
accurately forecast house prices. A variety of methods, such as regression analysis,
support vector machines, random forests, and others, have been used for predicting
house prices [2] using machine learning. These techniques each bring their own
advantages and limitations, depending on the dataset being analyzed and the task
at hand. After assessing the data and understanding the objectives, one must select
the most appropriate model and tune its parameters accordingly. Prediction accuracy
can be greatly improved as a result, and predicted outcomes become more reliable
[3]. However, other factors must also be taken into consideration when predicting
house prices with machine learning. For instance, geographical location plays an
important role in determining property value [4]. Additionally, current market trends,
macroeconomic indicators, and the availability of amenities, among other variables,
all contribute to a home’s worth. Thus, it is essential to include these components in
order to make more precise predictions. In conclusion, machine learning provides
efficient tools to accurately forecast house prices. It is essential that any buyer of
a property has a property appraisal performed as part of the buying process [5]. In
traditional circumstances, appraisals are performed by appraisers who have received
special training for the purpose of valuing real estate properties. It is important
for buyers of real estate properties to have a better understanding of the current
market prices of properties that are currently available on the market by utilizing an
automated price estimation system. These are some of the algorithms we have used
in this research paper. We have also used other algorithms in this research paper to
have better results which we have mentioned in methodology section.
1. Support vector machine (SVM) is a powerful machine learning algorithm used
for both classification and regression. It has many advantages over other algo-
rithms, such as its ability to handle high-dimensional data and its ability to create
nonlinear decision boundaries. SVM is also robust to outliers and can be used
with kernels, which allows it to work with nonlinear data. I have used support
vector machine in my model because of its ability to create nonlinear decision
House Price Prediction Using Hybrid Deep Learning Techniques 645
boundaries, which can be used to accurately classify complex datasets [6]. I have
got the accuracy of 90% in my project. Additionally, SVM is computationally
efficient and can be used in large datasets without compromising accuracy.
2. XGBoost is a powerful machine learning algorithm that has gained immense
popularity in recent years. It is an advanced implementation of gradient boosting,
which is used to improve the performance and accuracy of predictive models.
XGBoost has several advantages over other algorithms, such as faster training
speed, better accuracy, and improved scalability. XGBoost also has a number
of built-in features, such as regularization, cross-validation, and feature selec-
tion, which make it easier to use. Additionally, XGBoost is capable of handling
large datasets, making it suitable for use in complex data mining tasks. Overall,
XGBoost provides many advantages, making it a powerful and useful tool for
data scientists and machine learning engineers.
3. Linear regression is a powerful tool for predicting the outcome of a given event.
It is a supervised learning technique that uses a linear equation to represent the
relationship between the dependent and independent variables. Advantages of
linear regression include its simplicity and interpretability, as well as its ability
to capture nonlinear relationships. It can also be used to identify outliers and to
assess the strength of the relationships between variables. In this research paper,
we have chosen to use support vector machines (SVMs) over linear regression
due to their ability to capture more complex relationships between variables and
to better handle outliers. SVMs also have the advantage of being more robust to
overfitting, making them more suitable for high-dimensional data.
2 Literature Survey
In order to determine the price of a house, there are several factors to consider. In his
research, Rahadi et al. [1] suggest that in order to simplify these elements, we should
categorize them into three groups, namely physical condition, idea, and area. The
physical conditions of a house are defined by how much the house consists of, how
many rooms there are, how accessible the kitchen and carport are, how accessible
the nursery is, what zone the land and structure are in, and the age of the house.
These are the physical properties controlled by a house that human senses are able
to observe.
In his research, Rawool et al. [2] proposed that 80% of data should be used for
training and 20% for testing in the proposed machine learning model. The training
set includes target variables. The training set consists of 80% of data used for training
and 20% for testing. The model has been trained using a variety of machine learning
algorithms, but random forest regressions have been shown to be a more accurate
prediction of results. This has been implemented with Python libraries NumPy and
Pandas.
In their research paper, Manasa and Gupta et al. [3] decided to use Bengaluru as a
case study. The size of the property in square footage, the location, and its facilities
646 N. Vasudev et al.
are all key factors affecting the cost of the property. A total of nine attributes are taken
into account. For the experimental work, they have used multiple linear regression
(least squares), Lasso/Ridge regression, SVMs, and XG boosters.
In his research, Luo et al. [4] suggest that as a whole, the majority of studies have
focused on macroeconomic factors that influence residential asset prices, rather than
explaining the factors that determine prices. As part of this research, it examines
some micro-characteristics that can be used to estimate house prices, namely lot
size and pool size. Machine learning methods such as random forest and support
vector machine are used to predict asset prices. Almost all regression models have
an Rsquared of more than 0.9.
In their research paper, Abidoye and Chan [5] suggested that in addition to using
hedonic regression and artificial intelligence techniques for developing housing price
prediction models, the relationship between house prices and housing characteristics
was identified using a variety of hedonic methods based on the concept of utility
maximization.
There has been little research conducted by Gu et al. [6] in which they have
suggested an improved model for the prediction of housing prices should be devel-
oped based on the performance evaluation of several algorithms that have been devel-
oped for machine learning. It has been shown that the SVM approach is more accu-
rate than traditional methods in terms of forecasting housing prices. However, little
research has been conducted on how to develop a more accurate forecasting model
using genetic algorithms. Using machine learning, this study is aimed at examining
the performance of the algorithms and developing a more accurate model of housing
price prediction for the real estate market.
In their research paper using a neural network model, Kauko et al. [7] examined
the housing market in Finland, with an application to neural networks. Their results
showed that a number of dimensions of the formation of housing sub-markets could
be identified by finding patterns in the dataset.
2.1 Gap Analysis
After analyzing the above research papers and gathering some important information,
following points are concluded.
1. That most of the researchers did not have the desired amount of data for their
study, due to which they were not able to get the desired result.
2. In most of the papers, the majority of studies have focused on macroeconomic
factors that influence residential asset prices rather than explaining the factors
that determine prices and to overcome this problem.
3. We have used a vast variety of data to get the desired results.
4. I have used SVM as a model for this research paper and to increase the accuracy
of the model.
5. I also used hyperparameter tuning for the best results.
House Price Prediction Using Hybrid Deep Learning Techniques 647
3 Methodology
In this research, I have used Jupiter IDE. Jupiter IDE is an open-source web appli-
cation that helps us share as well as create documents containing live code, visu-
alizations, and equations and includes tools for data cleaning, data transformation,
statistical modeling, data visualization, and machine learning tools. Here, I have
collected data related to home sales to estimate home prices based on a real-world
from Kaggle. I have used other tools like Scipy, Seaborn, Pandas, NumPy and some
important machine learning models like random forest, SVM, linear regression, deci-
sion tree, and XGBoost [14]. To check how well the regression model fits the data,
I have used coefficient of determination (R2).
4 Implementation
4.1 Data Preprocessing
A crucial procedure has to be followed to see if a dataset is suitable for machine
learning algorithms. It is a procedure in which data mining is used to transform raw
data into a format that is efficient, the dataset will be initially preprocessed in such
a way that the unwanted data is removed and only the data that is relevant to the
problem will be extracted [15]. As far as formatting is concerned, it is necessary to
remove null values and irrelevant data from the data in order to make it suitable for
machine learning algorithms. After extracting the data, there were some null values
in the attributes which have to be looked after so that the accuracy of the models
does not compromise (Fig. 1).
4.2 Exploratory Data Analysis
Training the dataset using the machine learning algorithm, after the training, data is
extracted from the dataset through the data splitting process [13], such as SVM,
XGBoost, random forest and decision tree [10]. We have also used correlation
heatmap to see what are the highly correlated features with the target attribute which
is the sale price in this case [17] (Figs. 2and 3).
4.3 Dataset
The following diagram shows the important attributes of the dataset (Figs. 4and 5).
648 N. Vasudev et al.
Fig. 1 Null values using heatmap
Fig. 2 Displot of sale price
House Price Prediction Using Hybrid Deep Learning Techniques 649
Fig. 3 F correlation heatmap of sale price
Fig. 4 Dataset
5 Algorithms
5.1 Lasso Regression
As a statistical tool, Lasso Regression is utilized in regression analysis with an
objective to lessen the intricacy of a model [3]. This technique banks on shrinkage
methods that penalize varying coefficients within mainstream models and diminish
650 N. Vasudev et al.
Fig. 5 Regplot of highly correlated data
their impact up until insignificance [11], henceforth making it particularly effec-
tive when dealing with large amounts of predictor variables as its primary goal is
streamlining complexity while improving comprehensibility for easier interpretation
purposes.
5.2 Ridge Regression
A sort of linear regression approach called ridge regression is used to simplify the
model and avoid overfitting. It is a regularization strategy that reduces the model’s
coefficients by increasing the loss function’s penalty [3]. The model’s complexity is
lowered by the penalty’s relationship to the squared magnitude of the coefficients.
5.3 XGBoost
XGBoost is an advanced implementation of gradient boosting algorithm. It is a
powerful machine learning algorithm that has gained immense popularity in the data
science community due to its superior performance and efficiency [9]. XGBoost
is an open-source library which is used for supervised learning problems such as
classification and regression.
House Price Prediction Using Hybrid Deep Learning Techniques 651
5.4 Support Vector Machine
Support vector machine is a powerful machine learning algorithm that can be used for
classification as well as regression [12]. It transforms data into a higher-dimensional
space using a technique called the kernel trick to find the hyperplane that best
separates the data [18]. This hyperplane is used to make predictions on invisible
data.
5.5 K-Fold
Cross-validation is an effective method used in machine learning to evaluate the
precision of a model. It works by splitting the dataset into ksubsets, or “folds”, and
then training the model on k1 folds and testing it on the remaining fold [8]. This
process is repeated ktimes, each time with a different fold as the test set. The average
of the k-fold accuracy scores is then used as the overall accuracy of the model. K-fold
cross-validation reduces a model’s variance by using less data for training while still
producing a reliable evaluation of the model’s performance.
5.6 R2Score
R2score is a popular metric for assessing a model’s performance. By dividing the
variance of the projected values by the variance of the actual values, one can determine
how well a model fits the data. It is an evaluation of how well the expected values
match the observed values [16]. Higher numbers suggest a better match. The R2
score goes from 0 to 1. A flawless R2value of 1 means the model correctly predicted
the data, whereas a score of 0 means the model failed to do so. When evaluating the
performance of a model, the R2score is a crucial factor (Fig. 6).
6 Results
From Table 1, we can easily see the comparison of different algorithms to find the
best two algorithms among them and integrate to provide the best output.
Fig. 6 Formula of R2score
652 N. Vasudev et al.
Table 1 Accuracy of some
of the models Algorithms Score
Lasso regression 0.86468
Ridge regression 0.86779
XGBoost 0.88414
Random forest regression 0.85644
Decision tree regression 0.68921
Support vector machine 0.91373
7 Limitations
7.1 Limited Data
Possibly, a small dataset was used for the analysis in the research publication. As
a result, the models’ capacity to forecast outcomes accurately and reliably may be
constrained.
7.2 Limitation of Model Hyperparameters
The research report might not have thoroughly examined all potential model hyper-
parameters, which could have an impact on the precision and dependability of the
models.
8 Conclusion
In conclusion, we have seen a model that could provide everyone working in real
estate with a new, more accurate methodology for the upcoming special issues on
prediction of house prices. As we can see, SVM and XGBoost were two main models
that gave us the desired results. Support vector machines gave us an accuracy of
0.91, whereas XGBoost gave an accuracy of 0.88. In this study, it is highlighted how
important it is to use advanced machine learning algorithms, particularly in such an
era when the real estate market is rapidly changing. For real estate professionals,
investors, and policymakers seeking to make informed decisions based on accurate
and reliable predictions, the application of SVMs and XGBoost regression algorithms
can provide valuable insights. To further improve the accuracy and reliability of house
price predictions, it may be worthwhile for future research to incorporate additional
features as well as explore other regression algorithms in order to incorporate more
accurate and reliable methods.
House Price Prediction Using Hybrid Deep Learning Techniques 653
9 Future Scope
Future study on housing price prediction has a huge and fascinating potential, with
many different directions it may go in. A more accurate and trustworthy forecast can
help real estate professionals, investors, and regulators make wise judgements. This
is made possible by the integration of cutting-edge technologies and the analysis of
outside elements. Exploring the effects of external elements, such as macroeconomic
trends, social and political events, and environmental concerns, on the real estate
market and how they affect house price projections could be another area of research.
References
1. Rahadi RA, Wiryono SK, Koesrindartoto DP, Syamwil IB (2015) Factors influencing the price
of housing in Indonesia. Int J Housing Markets Anal 8(2):169–188. https://doi.org/10.1108/
IJHMA-04-2014-0008
2. Rawool AG, Rogye DV, Rane SG, Bharadi VA (2021) House price prediction using machine
learning. Iconic Res Eng J
3. Manasa J, Gupta R, Narahari NS (2020) Machine learning based predicting house prices using
regression techniques. In: 2020 2nd international conference on innovative mechanisms for
industry applications (ICIMIA), Bangalore, India, pp 624–630. https://doi.org/10.1109/ICI
MIA48430.2020.9074952
4. Luo Y (2019) Residential asset pricing prediction using machine learning. In: 2019 international
conference on economic management and model engineering (ICEMME). IEEE, pp 193–198
5. Abidoye RB, Chan APC (2017) Critical review of hedonic pricing model application in property
price appraisal: a case of Nigeria. Int J Sustain Built Environ 6(1)
6. Gu J, Zhu M, Jiang L (2011) Housing price based on genetic algorithm and support vector
machine. Expert Syst Appl 38:3383–3386
7. Kauko T, Hooimeijer P, Hakfoort J (2002) Capturing housing market segmentation: an alterna-
tive approach based on neural network modelling. Housing Stud 17:875–894. https://doi.org/
10.1080/02673030215999
8. Anguita D, Ghelardoni L, Ghio A, Oneto L, Ridella S (2012) The ‘K’ in K-fold cross validation.
In: ESANN, pp 441–446
9. Zhao Y, Chetty G, Tran D (2019) Deep learning with XGBoost for real estate appraisal. In:
2019 IEEE symposium series on computational intelligence (SSCI), Xiamen, China, pp 1396–
1401.https://doi.org/10.1109/SSCI44817.2019.9002790
10. Adetunji AB, Akande ON, Ajala FA, Oyewo O, Akande YF, Oluwadara G (2022) House price
prediction using random forest machine learning technique. Procedia Comput Sci 199:806–813
11. Lu S, Li Z, Qin Z, Yang X, Goh RSM (2017) A hybrid regression technique for house prices
prediction. In: 2017 IEEE international conference on industrial engineering and engineering
management (IEEM). IEEE, pp 319–323
12. Yu H, Wu J (2016) Real estate price prediction with regression and classification. In: CS229
(machine learning) Final project reports
13. Kumar D, Sarangi PK, Verma R (2022) A systematic review of stock market prediction using
machine learning and statistical techniques. Mater Today Proc 49:3187–3191
14. Thamarai M, Malarvizhi SP (2020) House price prediction modeling using machine learning.
Int J Inf Eng Electron Bus 12(2)
15. Mittal R, Kumar P, Mittal A, Malik V (2021) Developing an evaluation model for forecasting of
real estate prices. In: Choudhary A, Agrawal AP, Logeswaran R, Unhelkar B (eds) Applications
654 N. Vasudev et al.
of artificial intelligence and machine learning. Lecture notes in electrical engineering, vol 778.
Springer, Singapore. https://doi.org/10.1007/978-981-16-3067-5_46
16. Arumugam SR, Gowr S, Manoj O (2021) Performance evaluation of machine learning and
deep learning techniques: a comparative analysis for house price prediction. In: Convergence
of deep learning in cyber-IoT systems and security, pp 21–65
17. Makhloga VS, Raheja K, Jain R, Bhattacharya O (2021) Machine learning algorithms to
predict potential dropout in high school. In: Khanna A, Gupta D, Pólkowski Z, Bhattacharyya
S, Castillo O (eds) Data analytics and management. Lecture notes on data engineering and
communications technologies, vol 54. Springer, Singapore. https://doi.org/10.1007/978-981-
15-8335-3_17
18. Agarwal P, Alam M (2022) Quantum-inspired support vector machines for human activity
recognition in Industry 4.0. In: Gupta D, Polkowski Z, Khanna A, Bhattacharyya S, Castillo
O (eds) Proceedings of data analytics and management. Lecture notes on data engineering and
communications technologies, vol 90. Springer, Singapore. https://doi.org/10.1007/978-981-
16-6289-8_24
Sentiment Analysis Using Machine
Learning of Unemployment Data in India
Rudra Tiwari, Jatin Sachdeva, Ashok Kumar Sahoo,
and Pradeepta Kumar Sarangi
Abstract With the massive increase in social media data and hypes around Natural
Language Processing, opinion mining has become one of the most popular ways
to analyze people’s views on a specific topic. Using hashtags, one can obtain tweet
data in millions and analyze sentiments. This can be done effectively using Python
with its NLP modules available. Studying the attitudes and sentiments of Indian
citizens towards the current unemployment rate is the primary purpose of this study.
In situations where there may be negative sequences due to people’s aggression,
analyzing such content to gauge people’s sentiments can be extremely valuable in
managing the situation. Natural Language Processing and other Machine Learning
classifiers to perform opinion mining of the tweets posted by Indians are used in
this research. About 10,928 tweets have been accumulated, on which sentiment
analysis has been performed, considering tweets as positive, negative or neutral by
classifying them into three categories. ‘Tweepy API’ has been used, along with the
hashtags ‘UnemploymentInIndia’ and ‘Unemployment’. The data has been cleaned
and preprocessed using NLPTK, VADER and other modules provided to us using
Python. Study findings suggest that most Indian citizens oppose the unemployment
rates in their country, but a minority look to political movements to bring about
change.
Keywords Sentiment analysis ·Natural Language Processing ·Unemployment
Twitter data ·Unemployment sentiment analysis ·Unemployment India analysis ·
Social economy
R. Tiwari
Doon International School, Dehradun, India
J. Sachdeva ·P. K. Sarangi (B)
Chitkara University Institute of Engineering & Technology, Chitkara University, Punjab, India
e-mail: Pradeepta.sarangi@chitkara.edu.in
J. Sachdeva
e-mail: jatin0530.cse19@chitkara.edu.in
A. K. Sahoo
Graphic Era Hill University, Dehradun, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_49
655
656 R. Tiwari et al.
1 Introduction
Youngsters are the source of economic growth for a nation. The age between 15
and 24 signifies the youngsters’ ascent towards the labour markets to earn money.
People’s status in society is primarily determined by their employment. Workers who
are ready to work, but do not seek employment, are considered unemployed. It is still
not recognized that unemployment is a macroeconomic issue in India. Unemployed
people are often seen as failures and are disregarded in the society. These people
may drive themselves into depression, or even worse, social isolation. Reports by
the World Bank have stated that the unemployment rate in India was about 20.6%
in 2018. Data by the same World Bank [1] has stated that India’s unemployment
rate went up to 20.7% in 2019 (a slight increase). The COVID-19 pandemic has led
the unemployment rate to rise to 23.2% in 2020. If one notices the unemployment
rates for the neighbouring countries of India, which include Sri Lanka, Bangladesh,
Pakistan and Nepal, it has been observed that youth unemployment is the highest in
India. The situation of India’s unemployment has worsened due to the pandemic and
population explosion. In the first major year of the pandemic (2020–2021), India’s
employment rate fell to about 10.9% (for youth). It saw a descent of about 0.5%
in 2021–2022. India has also taken the UK to become the fifth largest economy
in the world. However, if these numbers are considered, it is apparent that income
inequality along with low youth employment rates remains a problem for India’s
economy. Average labour participation rate (LPR) for the years 2016–2017 and
2021–2022 for India’s youth remains at about 22.7%. Though India has the world’s
largest youth population, it also has the lowest youth employment rate. This remains
a topic of concern, for both the government and researchers. The computation on
unemployment in India is the responsibility of National Sample Survey Organization
(NSSO) and Labour Bureau. From the reports of these organizations, it is apparent
that the following factors are the root causes for the increasing unemployment in
India which are as under:
1. Ever-increasing population.
2. Slow economic growth.
3. Income inequality.
4. Caste system.
5. Prevalence of primary economic activities.
6. Shortage of resources required for industrial production (electricity, coal, etc.).
Sentiment analysis is a technique of aggregating people’s opinions, attitudes and
emotions about something through opinion mining. Any topic, event or individual
can be represented by the entity. Most of the reviews will cover these topics. The goal
of Natural Language Processing is to extract the meaning of written or spoken forms
of the language by application of various levels of linguistic analysis [2]. In this
study, Twitter is used as the major data hub. It is the best microblogging tool which
allows people to express their sentiments in concise words in such a manner that it
is coherent [3]. Its ease of comprehension makes it the best source for performing
sentiment analysis.
Sentiment Analysis Using Machine Learning of Unemployment Data 657
2 Literature Review
Government policies that are poorly implemented and the international economic
environment are major causes of high youth unemployment. Inefficient workers in
many labour fields, lack of proper education facilities and presence of in-demand
skills are the primary reasons of unemployment in India. Moreover, India’s education
system follows an approach of traditional theoretical teaching and often disregards
the crucial methodology of teaching practical applications to students. Moreover,
consumers’ negative perceptions of major Indian markets are also contributing to
the reduction of private sector jobs. Leaders of some political parties even claim
that “The only people who get jobs are relatives and friends of politicians”. People
believe in the approach of saving money, rather than investing it. Somehow, AI and
automation are another reason for less job vacancies and layoffs [4].Therateof
unemployment and annual change percentage between the years 1990 and 2022 is
shown in Fig. 1.
As long as there are no adequate social protection policies for the unemployed,
only those who can afford to remain unemployed will do so [5]. However, schemes
like MGNREGA have been launched by the government to end this social and
economic evil. Similar policies for skilled workers have been initiated as well. But
due to the exponential increase of youth unemployment, government policies are
insufficient to tackle this evil. It is imperative that youth unemployment can be
understood globally. The issue of youth unemployment must be addressed by global
organizations to improve employability and employment opportunities for young
people [6].
Fig. 1 Unemployment rate and annual change percentage between the years 1990 and 2022
658 R. Tiwari et al.
Looking at different economical sections in India, it has been observed that people
living in poverty tend to remain unemployed for a longer time duration than people
belonging to the well-off sections of the society. In terms of geography, the Eastern
regions have higher unemployment rates. The Eastern region also has a higher share
of long-term unemployment, as well as a higher concentration of it. Another problem
with India’s labour force is that there are not enough policies for women in urban
areas who are unemployed. Moreover, India’s female workforce is already less. There
are a lot of ways which make analyzing unemployment rates difficult. First, one does
not know anything about the history of an individual whose employment periods
are being analyzed. Secondly, the presence of ‘reservation wages’ in India makes it
difficult to analyze who is unemployed and who is not. Thirdly, people with no prior
job experience are less preferred than those with actual job experience. Lastly, the
lack of vocational training also plays a vital role in determining the actual value of an
individual’s contribution. In some places, still the presence of hereditary practices in
terms of providing employment is ruining the current situation of employment. Turns
out those companies are neglecting the potential of young and creative freshers. In
order to understand how longer-term unemployment occurs, it is difficult to pinpoint
the exact factors involved [7]. The long-term unemployment data is shown in Table 1.
Unemployment rates in India are highest in Haryana, a little more than that of
Jammu and Kashmir. In these areas, the unemployment rates reach about 30–35%
and can be seen in Fig. 2. However, Chhattisgarh, Meghalaya and Maharashtra have
had the lowest percentage of unemployed youth, hovering around 1–3%. The detailed
unemployment data is shown in Fig. 2.
Unemployment is the root cause for poverty in India. In the past three and half
decades, the population of youth outside the labour force has significantly increased,
fluctuating between 40 and 44%. Along with that, the population has exponentially
increased as well. After talking about unemployment, it also realized that it these
numbers cannot determine the income and productivity of workers, their work envi-
ronment and their motivation to the same job for years. Low wages have, further-
more, worsened the condition of employees and workers who get paid in meagre
amounts after toiling for up to 14–16 h a day. There are three policies that should be
implemented to battle the rising unemployment in India. They are:
1. Appropriate Macro-Policy
New policies must be formed, and old policies must be reformed to check the unem-
ployment rates and create new employment opportunities in India. Investing in the
growth of secure jobs is critical as well. Trade liberalization and financial sector
liberalization can enable the public to apply for jobs for handling exports and work
as skilled/unskilled labourers.
2. Improvement in the Education System
India’s education system must focus on improving the quality of students and make
them future-ready for the upcoming developments and technological advancements.
Sentiment Analysis Using Machine Learning of Unemployment Data 659
Table 1 Changes in
unemployment rates over last
two decades
Date Unemployment rate (in %) Annual change
31-12-1991 5.599
31-12-1992 5.727 0.13
31-12-1993 5.691 0.04
31-12-1994 5.739 0.05
31-12-1995 5.755 0.02
31-12-1996 5.74 0.01
31-12-1997 5.613 0.13
31-12-1998 5.666 0.05
31-12-1999 5.736 0.07
31-12-2000 5.561 0.18
31-12-2001 5.576 0.01
31-12-2002 5.53 0.05
31-12-2003 5.643 0.11
31-12-2004 5.629 0.01
31-12-2005 5.613 0.02
31-12-2006 5.601 0.01
31-12-2007 5.572 0.03
31-12-2008 5.414 0.16
31-12-2009 5.544 0.13
31-12-2010 5.546 0
31-12-2011 5.426 0.12
31-12-2012 5.414 0.01
31-12-2013 5.424 0.01
31-12-2014 5.436 0.01
31-12-2015 5.435 0
31-12-2016 5.423 0.01
31-12-2017 5.358 0.07
31-12-2018 5.33 0.03
31-12-2019 5.27 0.06
31-12-2020 7.997 2.73
31-12-2021 5.978 2.02
Teachers should focus on teaching practical applications to the students, instead of
the traditional theoretical approach.
3. Policies on Active Labour Market
The government needs to manipulate the labour market, providing more employment
opportunities, while ensuring quality [8]. According to The Economic Times, in
660 R. Tiwari et al.
Fig. 2 Unemployment rate
in India (as of August 2022).
Source Centre for
Monitoring Indian Economy
Pvt. Ltd
2021–22, 10.4% of Indian youth (15–24 years of age) were employed compared to
10.9% in 2020–21.
3 Research Aim
This research project aims to classify people’s opinions on ‘Unemployment in India’.
The datasets used in the project are obtained between the dates of 15 October 2021
and 15 September 2022. The opinion mining that is conducted is based on social
media behaviour analysis on about 11,000 tweets. Through this research, the aim is
to discover what people in India think about the growing unemployment rates. The
text analytics in this form are based on Machine Learning and NLP. Many research
papers based on sentiment analysis intricately describe processes and procedures to
be followed while working with raw text. They follow detailed methodologies to
convert raw data in the form of illustrations (that are obtained from cleaned data)
[911].
Most of the people who are commenting about unemployment have negative
opinions about it because unemployment, in general, is considered as one of the
social evils all around the world. Unfortunately, a fair population of India from both
rural and urban areas is suffering from unemployment.
Sentiment Analysis Using Machine Learning of Unemployment Data 661
4 Proposed Methodology
Industry, organizations and academic institutions are increasingly focusing on big
data as a strong global trend. This research considers about 11,000 tweets obtained
from twitter between the dates 15 October 2021 and 15 September 2022. The data
source has been explored through using hashtags ‘#Unemployment’ and ‘#Unem-
ploymentInIndia’. Twitter ranks amongst the top ten websites that are visited every
day. As twitter is the best text-based social media software, utilization of tweets to
analyze people’s general opinion on youth unemployment in India is used in experi-
ments. The time interval of the data about a year to ensure the uniformity of the data
is taken. It is possible that due to some changes in economy or work policies, people
would have reacted actively only during a small span of time. This study observes the
social media messages of residents in India through multiple channels. This data is
tabulated and is then transformed into data with three sentiment categories: positive,
negative and neutral.
The proposed approach for this research is given in the form of an algorithm which
is as follows:
1. Use Tweepy and Twitter’s API to mine data and obtain tweets. The extracted
data will be used as datasets for the research.
2. Utilize Python, NLP and Machine Learning classifiers to sort data based on the
keywords they contain. The Natural Language Toolkit (NLTK) of Python has
been used. ‘Text Blob’ library and Python’s VADER tool are used for more
precision.
3. Classifying individual tweets into ‘Positive’, ‘Negative’ and ‘Neutral’ categories.
4. Tabulating the data in pie charts and developing ‘wordclouds’ to determine most
frequently used hashtags.
5. Prepare a bar graph of the most used words in the tweets.
Opinion mining and sentiment evaluation are a quick developing subject matter
with diverse global applications. This paper discusses a method, shown in Fig. 3,
wherein a publicized circulation of tweets from the Twitter microblogging web page
is preprocessed and labelled based on their emotional content as effective, bad and
inappropriate, and analyzes the overall performance of diverse classifying algorithms
primarily based on their precision. This research is based on the tweet’s textual
content (i.e. it involves only the Machine Learning approach).
Machine Learning classifiers are algorithms that automate the process of catego-
rization of data into one or more classes. Classifiers are the rules used by machines
to classify data. In simple words, these classifiers ease the process of automation
of categorization. Machine Learning classifiers are of two types—supervised and
unsupervised. In these types of classifiers as well, there are five sub-components.
Naïve Bayes, Decision Tree, Support Vector Machine (SVM) and Artificial Neural
Network (ANN) classifiers have been used in the proposed research.
662 R. Tiwari et al.
Fig. 3 Systematic
procedural flowchart
4.1 The Naïve Bayes Classifier
The Naïve Bayes Classifier [12] is one of the major components of Machine Learning
used in the Natural Language Processing. An input classified using Naïve Bayes clas-
sifiers is based on its probability. This is the component which helps us to determine
which tweet would be classified as positive, neutral or negative. Advantages of the
Naive Bayes classifier:
(a) It can be easily implemented and does not require a lot of training data. It is
simple to use.
(b) It can handle both types of data—continuous and discrete.
(c) There is a high degree of scalability when it comes to the number of predictors
and data points.
(d) Predictions can be made in real-time thanks to its speed [13].
The classification of an unknown instance assignment the most appropriate target
value, VMAP, based on the attribute set (a1,a2an).
VMAP =arg max
vjVP(v j|a1,a2...an), (1)
where P is the conditional probability on the attribute set described above. Using
the Baÿes theorem, Eq. (1) can be rewritten as
Sentiment Analysis Using Machine Learning of Unemployment Data 663
VMAP =arg max
vjV
P(v j|a1,a2,...,an)P(v j)
P(a1,a2,...,an)=arg max
vjV
P(v|a1,a2...an|vj)Pvj
P(a1,a2,...,an),
(2)
=arg max
vjVPa1,a2,....an|vj×Pvj.(3)
The terms in Eq. (3) can be evaluated on training data. The P(vj) values are
calculated by summing up the frequency of all target values vjthat appears in the
data in the training. Estimation of individual P(a1,a2an|vj) terms in this manner
is infeasible unless the availability of a large training dataset.
Substituting this into Eq. (3), Naive Baÿes classifier is:
VNB =arg max
vjV
Pvj
i
P(ai|vj). (4)
VNB is responsible to determine the target class of input vectors in Naïve Bayes
classifier.The number of individual P(ai|vj) terms should be calculated from the
training dataset. The calculation of this value is simpler as compared with the value
of P(a1,a2... an|vj).
4.2 Decision Tree
Data is split into increasingly specific categories using a decision tree classification
algorithm. Graphically representing the classification process is like showing tree
branches [14].
The most challenging aspect of Decision Trees is identifying the root node’s
attribute. This process is called attribute selection.
A decision tree can be formed when a given set of attributes is given. Many
similar decision trees can be derived from the same set of attributes. It is practically
impossible to construct an optimal decision tree due to computational constraints.
This is due to the number of branches present at each internal node and at the root
node. Some efficient and suboptimal algorithms are also available to build a decision
tree. These algorithms use some kind of greedy strategy that grows a decision tree.
The optimal partition of the attributes induces a partition of test attributes. Each test
attribute is put into the appropriate branch based on class impurity values. After
computation and summed up, the class impurity is assigned to a given partition. A
contingency matrix of order P×K(Pis the size of input vector and Kis the class
size) is computed at the start of the construction of the decision tree, which is used to
compute impurity measure at each partition. The number of such distinct partitions
with P elements is exponential function in P.ForP=2, there are 2P11two-way
partitions possible.
664 R. Tiwari et al.
A few approaches are available to measure the impurity. Two popular such
approaches are:
Gini index used in Classification and Regression Tree (CART) [15]
Suppose n=(n1,…,nk) is a vector of non-negative real numbers which is same
for each class. Let N=iniis the size of an input vector. The Gini diversity
index is defined by
g(n)=1
i
n2
i
N2.(5)
And the frequency-weighted Gini’s diversity index is given as
G(n)=Ng(n)=
ni=nj
ninj
N2.(6)
Entropy used in C4.5 (developed by Ross Quinlan) [16]
The entropy used in C4.5 is defined by
h(x)=−
i
ni
Nlog ni
N.(7)
The weighted entropy is given by
H(n)=Nh(n)=Nlog N
i
nilog ni.(8)
The entropy measure as mentioned in Gini index and its frequency-weighted
Gini’s diversity index are used in the experiments. No specific tree pruning algorithms
are used in experiments.
4.3 Artificial Neural Networks
Artificial neural networks (ANNs) [17,18] are designed to work based upon the
actual neural structure of the human brain. Human can easily recognize objects by
just seeing the object and simultaneous processing of the object properties by neural
networks present in the brain. A typical neuron in the human brain receives and
processes some signals from other neurons through dendrites. As an axon splits
into several branches, the neuron sends electrical signals to other neurons. At the
terminal point of branch, a synapse transforms from the axon to electrical signals.
The signal generated either forwarded to other neurons or stops. When the strength
of inhibitory input is large, it sends an electrical spike through the axon to its next
Sentiment Analysis Using Machine Learning of Unemployment Data 665
Fig. 4 Example of an artificial neural network architecture
neuron. Learning occurs when the values of the synapses changed and the effect of
one neuron on another propagates.
ANNs are derived from the neural structure of human brain. The ANN process
records at individual level and is having the capacity to learn by comparing their
ability to classify the records based upon genuine classification of records. The error
is feedbacked to the network from the initial classification of the first record and used
to alter the network’s algorithm through several iterations subsequently. The process
stops when the error is in an acceptable range.
An ANN as shown in Fig. 4consists of the following:
1. A set of input neurons with values, xi, and weights wi.
2. An activation function that aggregates the receiving weights and forwards the
values to an output.
The neurons are used at three different layers, namely, input, hidden and output.
The neurons in the input layer consist of values from the records which are fed to
the ANN and are provided signals to the next layer. Another layer, known as hidden
layer, is present between input and output layers. The number of hidden layers may
vary according to the applications. These hidden layers are used to carry signals
from input layer to output layer via connection weights. The output layer provides
the class label of input vector. The output layer is known as target based on input
values. In an ANN, the feature values extracted from an input object can be predicted
or recognized by the actual target value that is received from the neural network.
Mathematically, the activation (or target Oj) is calculated as
Oj=ϕ
n
j=1
wjxjθj
.(9)
666 R. Tiwari et al.
4.4 Training an ANN
The target classes are known prior to experiments in ANN. These set of inputs are
known as training data. For training, a set of input class and their target class are
provided to the neural network. The neural network then adjusted its parameters
of hidden neurons for accurate mapping of input class and corresponding target
class. For testing, a set of unknown inputs are given to the network, and the network
recognizes the input class and produces the actual target class of those inputs based on
training. These datasets are known as testing dataset. The better the training process
implies very accurate recognition results, and hence, fewer error rates are produced
by the network. The training phase in ANN is capable of handling noisy datasets
also.
4.5 The Neural Network as a Classifier
Classification problems can be solved by help of artificial neural networks with a
feedforward network and sigmoid output neurons. Depending on the size of input
feature vectors, a range of active neurons can be used in hidden layers. The ANN has
three output neurons as there are three target values (positive, negative and neutral)
associated with input dataset. Training is required in pattern recognition networks
for effective classification of input vectors corresponding to target classes. The input
data is further dislocated into training, testing and validation. Training data is useful
for establishment of a network to fix the values of connection weights and biases. To
overcome overfitting, termination of training process, the validation phase is used.
To measure the actual performance of trained network, testing data is provided to
the trained network. On should avoid using the testing data either in training phase
or in validation phase.
The following parameters are used in the neural network classifier for experiments:
1. Standard network is two-layer feedforward network.
2. Sigmoid transfer function is used in hidden layers.
3. At output layers, Softmax transfer function is used.
4. The number of hidden neurons can be taken as an arbitrary value, but largely it
depends on input size.
5. Class confusion matrices for training, validation, testing and combined data are
used in result analysis.
In a multilayer feedforward network, data and computations flow in forward direc-
tion only, starting from input units to output units. A basic neural network has one
input layer and one output layer. This is called a one-layer feedforward network.
The number of input neurons and output neurons fully depends on the application.
When one extra layer of neurons is inserted between input and output layers, the
corresponding neural network is known as two-layer feedforward neural network. In
this research, this kind of neural network classifiers is used.
Sentiment Analysis Using Machine Learning of Unemployment Data 667
A sigmoid function is acting as the activation function in the hidden layer of the
neural network classifier. The curve of sigmoid function looks like an ‘S’ shape. The
function has the following structure.
sig(x)=1
1+ex.(10)
The activation function at output layer in neural network classifier is a Softmax
function. The structure of the Softmax function is given as:
softmax(n)=en
n
i=1ei.(11)
The range of these functions is (0, 1). At each layer, the calculated value of either
function is compared with a threshold, and if the calculated value is more than the
threshold, it is said that the neuron will transmit an electrical spike to its next neuron
through its axon, otherwise it will not provide any information to the connected
neurons in the next layer.
4.6 Support Vector Machines
The SVM classifies n-dimensional space into classes using a line or decision
boundary in order to make it easier for new data points to be placed in the correct
category. The most suitable decision boundary is termed as a hyperplane [19].
When extreme points/vectors are chosen, a hyperplane is created. Support Vectors,
which are extreme cases, are handled with an algorithm called a Support Vector
Machine.
c(x,y,f(x)) =0,if yf(x)1
1yf(x), else .(12)
In sentiment analysis, however, Naïve Bayes rule and convolutional neural
networks are used as major components of Machine Learning Classifiers [20].
NLTK modules of Python include Natural Language Toolkit modules that are used
in tasks that specifically require Natural Language Processing. Natural Language
Toolkit provides ready-to-use computational linguistics courseware in the form of
tutorials, tutorials and problem sets. A Natural Language Processing library inter-
faces with annotated corpora and handles symbolic and statistical Natural Language
Processing [21]. These modules will be used to analyze text. NLTK modules
have been used to tokenize data. For making the text go through the process of
lemmatization, classification and stemming have been used [22].
TextBlob is also a part of the NLTK module of Python. It provides us with the
access to lexicon-based approaches. Processing textual data with TextBlob is easy
668 R. Tiwari et al.
with Python. Natural Language Processing can be performed with several APIs,
including tagging of part of speech, extraction of noun phrase, sentiment analysis,
translation and classification tasks [23]. The TextBlob for sorting data is used which
is based on subjectivity and polarization. Topics or domains influence sentiment
polarities. There may be variations in sentiment polarities between domains even
when the same word is used [24]. This is the reason why the polarity of the tweets
of the general kind is considered. The criteria of more than 0 for positive, 0 for
neutral and less than 0 for negative sentiments are used. The provisions of Emoticons,
exclamation marks are not used in experiments. Specifically designed to analyze
sentiments expressed on social media, it is based on lexicon and rules [25].
Firstly, the tweets in a csv file format using Twitter’s API and Tweepy are obtained.
The text extraction method utilized is the Bag-of-Words method (BOW) [26]. The
collection of these individual words is known as a ‘Collection of unigrams’. All the
unigrams are independent as well. This means that the presence of one unigram in
the text has no effect on the presence of any other unigram.
Then, re.findall command of Python to clean tweets is used. Re.findall returns a
list of strings or tuples with all non-overlapping matches of pattern in a string. To
remove patterns in the input text, it has been used. The name of users are removed
using numpy [27]. Arrays can be worked out with using Numpy, a Python library. The
cleaned tweets were stored in another column. In this process, about 70 uncleaned
tweets are lost. Then, NLTK modules are used to tokenize the cleaned tweets. Using
PorterStemmer, stemming on data has been performed. The VADER module is used
to analyze sentiments [28]. This process was based on analyzing most commonly
used keywords and hashtags. A wordcloud is also created. Then, TextBlob is used to
set the polarity of the tweets accordingly and perform the analysis on the remaining
tweets.
5 Data Extraction
Sentiment analysis is performed on people’s views regarding the growing unemploy-
ment in India. The data has been prepared in csv files using Tweepy and Python. A
total of 11,000 tweets have been extracted from 15 October 2021 and 15 September
2022 using Tweepy. Tweepy uses Twitter API to fetch tweets based on a specific topic
between a certain time interval. Third-party apps can be integrated with Twitter APIs.
Tweets may also be obtained from a certain user. The data extracted is generally in the
form of text, though images, audio and video files are obtained using the ‘extended_
entities’ object mode. Only text data is considered, because any other data type for
analysis is not considered. The consumer key, consumer secret, access key and access
secret keys after creating an account in the Twitter Developer platform are used. The
twitter developer account in Python was authorized. Also, a function was defined to
create a filter to obtain tweets of the hashtags, ‘#Unemployment’ and ‘#Unemploy-
mentInIndia’. The full text of the tweet, time of tweet and the username who tweeted
Sentiment Analysis Using Machine Learning of Unemployment Data 669
Fig. 5 Average
unemployment rate in India
(2017–2021)
are also saved. This is accomplished using the tweepy. Cursor functionality, which
returns a list of tweets that can be iterated on.
One can observe a huge spike (as shown in Fig. 5) in unemployment rates during
the lockdown, which is why no tweets during that time interval were obtained. As
COVID-19 cases were at the peak at that time, it is obvious that mostly negative
tweets are available at that time. This is the reason why data from the time periods
of 2020 has not been considered.
The process of sentiment analysis is based upon NLP. It is one of the major
components of Machine Learning which enables a computer system to understand
all forms of complex texts. NLP is generally used to process text data and perform
analysis on it. NLP is a branch of AI that uses texts to understand and derive meaning
from them. Complex tasks like text summarization, query solving and sentiment
analysis can be executed using NLP. The functioning of an NLP is based upon
the use of Recurrent Neural Networks (RNNs). The Machine Learning is used for
labelling it. This is done to obtain equitable and serviceable output. ML algorithms
can be used in sentiment analysis to check if a text is positive or negative based on
its polarity. Without human intervention, machines learn how to detect sentiment
automatically by training them with examples of emotions in text. A lot of complex
methodologies and algorithms like ‘The Naïve Bayes Rule’, ‘Linear Regression’ and
‘Support Vector Machines’ relate Machine Learning and opinion mining.
6 Data Cleaning and Preprocessing
From the 11,000 tweets that have obtained, the data was cleaned by removing unnec-
essary details like username, time of tweet, etc. Several sentiment-less words appear
in the data, including links, Twitter-specific words like hashtags and tags, single
letter words and numbers. Many tweets also got rejected and removed during this
670 R. Tiwari et al.
Fig. 6 Wordcloud of cleaned tweets (includes frequently used words)
process. After this, about 10,928 tweets are available for experiments. Abstraction
was used to obtain the most used hashtags along with the searched ones to prepare a
histogram. The re.findall function is used to return all matches of a pattern in a string
that are not overlapping. Impurities and punctuations from data were removed in the
first phase. The data was vectorized using CountVectorizer. In the same ‘csv’ file
which has obtained from the tweets, the file is appended with a new column named
‘Clean_Tweet’ where tweets that have been obtained after cleaning was stored. After
that ‘Stemming’ is applied to the data by using nltk.stem.porter. After cleaning the
data, wordclouds (as shown in Fig. 6) from the cleaned tweets have been created.
7 Results
The most frequently used hashtags using re.findall are obtained as depicted in Table 2
and created a dictionary of the words by defining a dictionary. The count of the tweets
is used to create a bar graph of the hashtags that were being repeatedly used along
with the target hashtags. The count of the most frequently used hashtags is given in
Table 2. The underscores from the hashtags for easy understanding are removed.
The bar graph is shown in Fig. 7for the data as extracted (Table 2) using re.findall.
Finally, a pie chart (Fig. 8, Table 3) for the positive, negative and neutral tweets
has been prepared. Also, the number of the respective tweets and the chose polarity
for the tweets were provided.
If one calculates the individual count of each sentiment from the 10,928 tweets,
one can find that about 7803 contain negative content, 2983 tweets are positive,
whereas only about 1.3% tweets are written neutrally. The count of these tweets is
only about 142.
Sentiment Analysis Using Machine Learning of Unemployment Data 671
Table 2 Hashtags and their
count Frequently_used_hashtags Count
Bharatjodoyatra 1639
Unemploy 2009
India 993
Bharatjodobegin 1309
Poverty 962
Agriculture 936
Industries 152
Fresher 803
Economy 528
Insurance 194
Covid 305
Fig. 7 Frequently used
hashtags and their count
0
500
1000
1500
2000
2500
Frequently Used
Hashtags and their
count
bharatjodoyatr
a
unemploy
India
Fig. 8 Distribution of
sentiments
SenƟment DistribuƟon of 10,928
tweets
NegaƟve
PosiƟve
Neutral
27.3%
1.3%
71.4%
Table 3 Polarity and
sentiment count Sentiment Number of tweets Polarity
Positive 2983 Greater than 0
Negative 7803 Less than 0
Neutral 142 0
672 R. Tiwari et al.
8 Discussion and Analysis
After obtaining all the data that required, it is observed that most Twitter-active social
media users are upset about the current unemployment situation in India. About 70%
of the time, it is being observed that people are tweeting negatively about it.
Prediction was made using 11,000 tweets using this combination, and it was found
that 27.3% of people are positive about the current unemployment situation in India,
1.3% are neutral, and 71.4% of the people are feeling negative due to various valid
or invalid reasons.
As expected, most people are criticizing unemployment, which is in line with the
expectations from the experiments. Due to uncertain reasons, some people are talking
positively about the current unemployment reasons. From the bar graph that has
prepared from the dataset, one can assume the reasons. Due to the recent launching
of the ‘Bharat Jodo Yatra’ campaign by the Indian National Congress (INC), the
leaders are claiming that ‘Without harmony, there is no progress, without progress,
there are no jobs and without jobs, there is no future’. This movement is aimed to
unite India through the political party’s long road march. This movement aims to
address a lot of social problems, including the ‘ticking time bomb of unemployment’.
This campaign might be the reason people’s hopes are high. They might be expecting
that unemployment rates might be affected by this movement as well. Areas where
unemployment is the most prevalent are also mentioned along with the tweets. The
use of poverty, economy and some traces of COVID-19 are present as well.
COVID-19 had shattered the world economy. Countries have seen an unprece-
dented downfall in their growth and development. The highest unemployment rate
that India has ever witnessed was during the lockdown. Many businesses were shut
down. There was no way a daily-wage labourer could work from home. Moreover,
the pandemic worsened the health situation for all the countries. Many people lost
their jobs. Each wave induced financial losses for India. However, Indians bravely
coped up with the pandemic and bounced back. Employment rates went up high a
little and people began working. This explains the use of ‘Covid-19’ in this research.
Unemployment faces a negative response from citizens. This might be irrelevant in
today’s date as technology has also led to employment generation as well. With the
help of AI, it will be possible to predict the unemployment condition in a few years
[29].
INC Youth Wing has also launched ‘Rozgar Do’ (Provide Employment) campaign
to battle unemployment. This campaign might be the next most tweeted hashtag
after the ‘Bharat Jodo Yatra’ campaign loses temperature. In order to highlight the
country’s unemployment issue, the campaign’s theme is ‘Give a job or take back
your degree’. This campaign aims to address the Prime Minister and the government,
demanding action from the government on the current unemployment rates. Their
main demand from the government is to ensure provisions for degree-holders to have
ajob.
A military recruitment plan, called Agnipath’, was launched in India a few months
ago as well. Last year, it sparked violent protests, bringing to light the unemployment
Sentiment Analysis Using Machine Learning of Unemployment Data 673
crisis plaguing India’s $3.2 trillion economy, as well as Prime Minister Narendra
Modi’s campaign promises.
People in Goa have been asked not to vote for parties that have not provided
jobs. This demand is a part of a campaign launched by a political party (Aam Admi
Party). It is possible that all these factors combined are being talked about, triggering
a mixed reaction between people. People support the actions of only those parties,
to whom they voted in the most recent elections.
Only a small proportion of people have neutral opinions about unemployment.
They might be tweeting general trends and topics associated with unemployment.
9 Future Suggestions
One cannot overlook the importance of Natural Language Processing when analyzing
sentiments from written text. It is directly proportional to the granularity of the
dataset that the accuracy and performance of sentiment analysis are determined.
While dealing with natural language, one must deal with many irregularities, subjec-
tivity and diversity. Emotions like sarcasm are not easy to detect. It is difficult to
know in which category they fit in, positive, negative or neutral. This study has the
following major limitations. Tweets that were trending during a particular period
are under experimentation. Phase changes can result in changes in the surrounding
environment, and one might experience a change in the distribution of the sentiments
of tweets. As conditions improve, people might not even post about unemployment
anymore. Also, the use of emoticons and tweet hashtags is not considered that could
reveal a lot about the sentiments of the tweets. If the emoticon dataset was considered
as a part of the tweet data, the classifiers’ efficiency would have been hampered. If
one combines both of these factors, namely sarcasm and emoticons, people could
have possibly posted wrong emoticons deliberately to indicate ‘sarcasm’ in the ideas
they want to the audience.
An accurate, timely and comprehensive overview of the needs, attitudes and moti-
vations of the unemployed is provided by this research. This study can help research
organizations and institutions to study what were people’s general opinions on unem-
ployment. They could learn how political parties affect the ways people think about
a particular topic. They can notice the general trend of the word ‘unemployment’.
Factors affecting unemployment should be studied, before prior research on this
topic. An important role of studying fake news’ impact on the public is to assist
administrations and policymakers in managing it. People’s general mental condi-
tions should be considered as well. For example, at the time of lockdown in India,
people were stressed and gloomy. They took to rigorous aggression on trivial matters.
Conditions of a nation during the time when the research is being conducted should
also be considered. A large corpus can be used to improve the accuracy of the model
in future studies.
674 R. Tiwari et al.
10 Conclusion
Thousands of people use social media every day, and the number is growing every
day. In place of speaking with someone in person, people prefer to write about their
honest opinions on social media. The analysis of the common public’s reaction to
unemployment in India was based on the posts from Twitter. The collected data
after annotation and preprocessing have been applied to several Machine Learning
techniques. Almost 70% of the population feels negatively about the unemployment
rates in India, about 27% has been talking positively about unemployment in India,
and only a handful of people (1%) feel neutrality in unemployment’s nature. Much
of these tweets are subjective to the various changes in policies and execution of
campaigns and movements by political parties.
References
1. https://www.cmie.com/kommon/bin/sr.php?kall=warticle&dt=20220829141802&msec=860.
Accessed on 28 Dec 2022
2. Pratibha GK, Kaur A, Khurana M (2022) A stem to stern sentiment analysis emotion detection.
In: 2022 10th international conference on reliability, infocom technologies and optimization
(trends and future directions) (ICRITO), Noida, India, pp 1–5. https://doi.org/10.1109/ICRITO
56286.2022.9964967
3. Tiwari RG, Misra A, Ujjwal N (2022) Comparative classification performance evaluation of
various deep learning techniques for sentiment analysis. In: 2022 8th international conference
on signal processing and communication (ICSC), Noida, India, pp 304–309. https://doi.org/
10.1109/ICSC56524.2022.10009471
4. Kaushik P (2020) Research report on Indian Unemployment scenario and its analysis of causes,
trends and solutions. A project study submitted in partial fulfilment for the requirement of the
two year (full-time) post-graduate diploma in management (2018–20)
5. Sinha P (2022) Combating youth unemployment in India, Academia. https://www.academia.
edu/26001773/Combating_Youth_Unemployment_in_india. Accessed on 22 July 2022
6. Naraparaju K (2017) Unemployment spells in India: patterns, trends, and covariates. Indian J
Labour Econ 60(4):625–646
7. Dev M, Motkuri V (2011) Youth employment and unemployment in India
8. Gupta P, KumarS, Suman RR, Kumar V (2020) Sentiment analysis of lockdown in India during
COVID-19: a case study on Twitter. IEEE Trans Comput Soc Syst 8(4):992–1002
9. Gautam G, Yadav D (2014) Sentiment analysis of twitter data using machine learning
approaches and semantic analysis. In: 2014 seventh international conference on contemporary
computing (IC3), pp 437–442
10. Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau RJ (2011) Sentiment analysis of Twitter
data. In: Proceedings of the workshop on language in social media (LSM 2011), pp 30–38
11. Desai M, Mehta MA (2016) Techniques for sentiment analysis of Twitter data: a comprehen-
sive survey. In: 2016 international conference on computing, communication and automation
(ICCCA), pp 149–154
12. Balaji VR, Suganthi ST, Rajadevi R, Kumar VK, Balaji BS, Pandiyan S (2020) Skin disease
detection and segmentation using dynamic graph cut algorithm and classification through Naive
Bayes classifier. Measurement 163:107922
13. Calders T, Verwer S (2010) Three Naive Bayes approaches for discrimination-free classifica-
tion. J Data Mining Knowl Discov 21(2):277–292
Sentiment Analysis Using Machine Learning of Unemployment Data 675
14. Das A, Das P,Panda SS, Sabut S (2019) Detection of liver cancer using modified fuzzy clustering
and decision tree classifier in CT images. Pattern Recogn Image Anal 29:201–211
15. Liu Q, Wang X, Huang X, Yin X (2020) Prediction model of rock mass class using classification
and regression tree integrated AdaBoost algorithm based on TBM driving data. Tunn Undergr
Space Technol 106:103595
16. Muthamil Sudar K, Deepalakshmi P (2020) A two level security mechanism to detect a DDoS
flooding attack in software-defined networks using entropy-based and C4.5 technique. J High
Speed Netw 26(1):55–76
17. Asteris PG, Mokos VG (2020) Concrete compressive strength using artificial neural networks.
Neural Comput Appl 32(15):11807–11826
18. Hasson U, Nastase SA, Goldstein A (2020) Direct fit to nature: an evolutionary perspective on
biological and artificial neural networks. Neuron 105(3):416–434
19. Okwuashi O, Ndehedehe CE (2020) Deep support vector machine for hyperspectral image
classification. Pattern Recogn 103:107298
20. Singh J, Singh G, Singh R (2017) Optimization of sentiment analysis using machine learning
classifiers. HCIS 7(1):1–12
21. Loper E, Bird S (2002) NLTK: the natural language toolkit. CoRR.cs.CL/0205028. https://doi.
org/10.3115/1118108.1118117
22. Yao J (2019) Automated sentiment analysis of text data with NLTK. J Phys Conf Ser 1187(5)
23. Loria S (2018) Textblob documentation. Release 0.15 2.8
24. Li F, Huang M, Zhu X (2010) Sentiment analysis with global topics and local dependency. In:
Twenty-fourth AAAI conference on artificial intelligence
25. Gupta P, KumarS, Suman RR, Kumar V (2021) Sentiment analysis of lockdown in India during
COVID-19: a case study on Twitter. IEEE Trans Comput Soc Syst 8(4):992–1002
26. Kolchyna O, Souza TTP, Treleaven PC, Aste T (2015) Twitter sentiment analysis: Lexicon
method, machine learning method and their combination. arXiv preprint arXiv:1507.00955
27. Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau RJ (2011) Sentiment analysis of Twitter
data. In: Proceedings of the workshop on language in social media (LSM 2011)
28. Hutto C, Gilbert E (2014) Vader: a parsimonious rule-based model for sentiment analysis of
social media text. In: Proceedings of the international AAAI conference on web and social
media, vol 8, no 1
29. Barkur G, Vibha, Kamath GB (2020) Sentiment analysis of nationwide lockdown due to COVID
19 outbreak: evidence from India. Asian J Psychiatry 51
Customer Churn in Telecom Sector:
Analyzing the Effectiveness of Machine
Learning Techniques
Vaibhav Sharma, Lekha Rani, Ashok Kumar Sahoo,
and Pradeepta Kumar Sarangi
Abstract The number of customers that stopped using a company’s product or
service during a particular time is known as customer churn. Businesses may prevent
churn by taking preventative measures when they can anticipate it before it occurs.
Specifically in the telecommunication sector due to various providers, there is great
competition. To compete in the market, telecom firms provide all basic services,
easy access to the Internet, quality phone service, etc., to all mobile users and still
it is a challenge to hold the clients. Therefore, it is an important task to understand
the customer needs of all age groups. So by a proper prediction of customer churn,
companies can reduce the rate of churn by immediately taking action regarding it.
In this study, the authors show different exploratory data analysis (EDA) between
different parameters which could affect the churn. Going further, the data has been
divided for training which is 80% of the whole and the remaining 20% is kept as
the test data. By comparing various machine learning (ML) models such as SVM,
KNN, XGBoost, decision tree, and random forest, the best-performing model for the
dataset is identified as the RF model. The best precision obtained is with RF, with
an accuracy of 82%, and the least precise was from KNN with an accuracy of 76%.
The study will provide a detailed view of the problem of customer churn and how it
can be controlled.
Keywords Customer ·Churn ·EDA ·SVM ·KNN ·XGBoost ·Decision tree ·
Random forest
V. S h a r m a ·L. Rani ·P. K. Sarangi (B)
Institute of Engineering and Technology, Chitkara University, Punjab, India
e-mail: Pradeepta.sarangi@chitkara.edu.in
V. S h a r m a
e-mail: Vaibhav1467.cse19@chitkara.edu.in
L. Rani
e-mail: lekha@chitkara.edu.in
A. K. Sahoo
Graphic Era Hill University, Dehradun, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1_50
677
678 V. Sharma et al.
1 Introduction
Client turnover, sometimes referred to as customer attrition, is the term used to
describe a company’s loss of clients over an extended time. It is a typical issue
that companies of all sizes and sectors deal with, and it may have a big effect on
their bottom line. Churn can happen for some reasons including dissatisfaction with
a product or service, a better offer from a competitor, or simply a change in the
customer’s circumstances [1].
For telecom companies, which must actively compete for consumers in a market
that is becoming more crowded and competitive, customer turnover is a huge concern.
The telecom industry must be careful in recognizing and resolving the variables that
lead to customer churn. Lack of new services or features, exorbitant prices, and
poor service quality are just a few of the causes of this. Sometimes a more alluring
promotional offer or a superior pricing strategy will cause clients to migrate to a
rival.
Therefore, customer churn prediction is crucial as organizations identify
customers who are at risk and take proactive measures to retain them. It is also
essential for reducing customer acquisition costs, improving customer satisfaction,
and maintaining or increasing market share [2]. In general, organizations may use
data-driven choices that can significantly affect their bottom line by predicting client
attrition.
ML is an important tool for the prediction of customer churn as it allows firms
to analyze huge amounts of data and identify patterns and trends [3]. ML can iden-
tify factors that are strongly linked to customer churn using various algorithms and
statistical techniques. ML models can be trained on customer data to precisely predict
users at risk of leaving in the future. With these predictions, specific retention tactics
may be created such as providing tailored incentives or enhancing customer service
for vulnerable clients. Moreover, by evaluating the success of retention measures
and modifying the models accordingly, ML may help firms constantly improve their
churn prediction models.
To lower the churn rate in any firm, this research study offers analytical approaches
to forecast customer turnover rates, identifying factors that lead to churn. The paper
will contribute a solution to the existing problem of customer churn. Based on the
dataset, EDA is done to determine what all factors affect the churn rate. Then the
data is split into test and split data to apply different ML algorithms, namely SVM,
KNN, XGBoost, decision tree, and RF. The best accuracy was seen in random forest,
while the lowest accuracy was in XGBoost and KNN. The best precision obtained
is with RF, with an accuracy of 82%, and the least precise was from KNN with an
accuracy of 76%.
Customer Churn in Telecom Sector: Analyzing the Effectiveness 679
1.1 Novelty
By implementing various methods and techniques on the dataset and by performing
different analysis, the paper tries to provide an overview of the problem. Different
ML algorithms have been used to predict the outcome. The accuracies of the models
have also been compared.
2 Background Study
The study showed that certain demographic and usage characteristics, such as being
elderly, unmarried, and lacking relatives, are associated with higher customer churn
rates in the telecom industry. On the other hand, users with specific service combina-
tions, such as phone and fiber services with additional streaming TV and film services,
are also at risk of churning. To improve customer retention, tailored services, promo-
tional discounts, service upgrades, and contract payment discounts can be effec-
tive strategies. The proposed user churn prediction model, which uses the gradient
boosting tree algorithm, demonstrated satisfactory accuracy and can be a useful tool
for decision-makers in predicting and retaining potential customers [4]. The authors
conclude that the telecommunications industry requires a dependable approach to
analyze and predict customer churn. The research highlights that artificial neural
networks (ANN) and Gaussian Naïve Bayes are effective methods for predicting
churn. Nonetheless, the study suggests further investigation to determine the efficacy
of these methods on diverse datasets [5]. A survey was conducted on customer churn
attrition using several ML and deep learning (DL) techniques. The study concludes
that DL techniques, particularly convolutional neural networks and stacked auto-
encoders, outperform other methods in terms of both speed and accuracy. These
findings highlight the potential of DL.
In another work, the authors have addressed the challenges of customer churn
prediction in various industries [6]. To effectively manage customer outflow in the
telecom industry, the use of data mining and data science models is essential. The
study demonstrated that models with an accuracy of over 95% in customer loyalty
classification can be constructed using these techniques. Furthermore, the study
provided valuable insights into the factors that influence customer churn behavior.
Regular monitoring and the inclusion of additional variables are necessary for the
continuous improvement of these models and the prevention of obsolescence. These
findings can be utilized to enhance marketing activities and improve the mathe-
matical methodology for consumer churn prediction in the industry [7]. The author
concludes that gradient boosting with feature selection is the most effective model
for predicting the problem of customer churn. The study shows the importance of
feature engineering and selection in improving model performance. The findings
suggest the need for alternative strategies to address churned customers and propose
using DL models for future research [8]. The utilization of ML techniques in the
680 V. Sharma et al.
churn model can aid telecommunications companies in providing attractive offers
to customers to retain them. Additionally, further improvement in accuracy can be
achieved by reducing features and implementing additional ML models [9]. The
research presented in this paper introduces a framework for churn prediction and
customer segmentation in the telecommunications industry, using a range of ML
models to achieve high accuracy. This study contributes to the existing literature
and provides valuable insights for telco operators to understand the churn behavior
of different customer clusters. Further research can explore alternative methods to
enhance the accuracy of churn prediction models [10]. Telecommunications organi-
zations face the challenge of customer churn, which can be tackled through predictive
analytics and customer retention measures. This study highlights the effectiveness
of ML models such as ANN and XGBoost in addressing this issue. Future research
can focus on further improving these methods and incorporating big data analytics
for better results [11]. In today’s world where technology drives businesses, compa-
nies face the challenge of retaining customers and predicting customer churn, and the
telecommunications industry is no exception. Churn prediction has become a subject
of interest for many researchers, and this research paper presents a comparative study
of various ML models for churn prediction in the industry. The results suggest that
ensemble learning techniques such as XGBoost and AdaBoost classifiers perform
better than other algorithms in terms of precision, accuracy, F-measure, recall, and
AUC score. However, predicting genuine customers remains a daunting task, and
companies must provide valuable services to retain customers in today’s competitive
market. Further research can be done to improve churn prediction and help companies
in the telecommunications industry retain customers [12]. To conclude, predicting
customer churn is important for reducing operational costs in the telecommunica-
tions industry. The performance of prediction models can be improved by utilizing
feature selection techniques. The proposed model can identify potential churners
and enable companies to take necessary measures to retain customers [13]. Table 1
shows a summarized representation of the background study done in this context.
3 Objective
The main objective of this study is to contrast various methodologies and approaches
for predicting client attrition. In this work, an analysis of customer churn is done
using ML to investigate and measure the efficiency of ML methods for predicting
customers who are at risk of attrition in the telecom industry. EDA has been done for
hypothesis generation, enhancing data accuracy through data scaling, and splitting
the dataset for training and test data and training models. The accuracy and precision
of these models’ performance would be assessed in the study.
Customer Churn in Telecom Sector: Analyzing the Effectiveness 681
Table 1 Concise representation of the literature studied
References Techniques Dataset used Remarks
[4] Spearman single factor
analysis; random forest
Customer churn
dataset
The satisfied result achieved
by the gradient boosting tree
algorithm
[5]ANN, SVM, KNN, decision
tree, Gaussian Naïve Bayes
Telecommunication
customer churn
dataset
ANN and Gaussian Naïve
Bayes are effective methods
for predicting churn
[6]XGB, ANN, random forest,
gradient boosting, AdaBoost,
CNN, stacked auto-encoders
IBM telecom’s
Kaggle dataset
DL works with speed and
accuracy to give better
results
[7] Random tree, neural net,
ensemble, C5.0, KNN, etc.
Telecommunication
company dataset
Models were implemented
with the help of IBMIPSS
modeler, accuracy reached
over 95%
[8]Gradient boosting, logistic
regression, decision tree, and
random forest
American telecom
company dataset
Gradient boosting with
feature selection is the most
effective model
[9]KNN, logistic regression,
and random forest
Telecom company
dataset
The churn model gave better
results using ML
[10]Multiple layer perceptron,
logistic regression, decision
tree, random forest,
AdaBoost, Naïve Bayes
IBM, Kaggle telco
customer churn,
Cell2Cell provided
by Teradata center
An integrated customer
analytics framework is
proposed to seamlessly
connect two components
[11]Random forest, XGBoost,
ANN, gradient boost, logistic
regression, and AdaBoost
IBM Telecom’s
Kaggle dataset
ANN and XGBoost
outperformed other models
in terms of accuracy,
F1-score, recall, and
precision
[12]XGBoost, CatBoost, logistic
regression, random forest,
SVM, decision tree, Naïve
Bayes, AdaBoost
Customer churn
dataset
Ensemble learning
techniques like XGBoost and
AdaBoost perform better
[13] SBFS, SFS, Naïve Bayes,
SBS
Telco customer
churn dataset
The study proposes the
application of feature
selection to select features
that have a positive effect on
models
4 Dataset
The dataset used for the research is IBM Telecom’s Kaggle dataset. The data is
extremely large and consists of several important parameters for predictive anal-
ysis. The telecommunication dataset consists of 7043 instances of 21 attributes.
For training the models, only the first 1000 rows of the dataset have been used.
Demographic information including gender, age, and dependents is included in the
682 V. Sharma et al.
attributes. The dataset includes details about information like gender, age, depen-
dents, different types of services for which the customer has signed up, contact
information, payment methods, monthly charges, paperless billing, total charges,
and the churn attribute which tells us about the customers who have discontinued.
5 Methodology
To begin with, there are many steps before building ML models. The different steps
are data preparation, EDA, data cleaning, feature selection, and finally building the
desired model. The data preparation step includes checking for the completeness of
the dataset, looking at the dimensions, reviewing the structure of the data, and exam-
ining any missing client data. Python is used as it provides many robust libraries which
have been used such as Pandas—which has tools for data analysis, cleansing, explo-
ration, and manipulation; Matplotlib—which makes easy things easy and hard things
possible by creating static, animated, and interactive visualizations; and Seaborn—
which helps in making statistical graphs. Figure 1is a pictorial representation of the
workflow of this work.
After analysis, it is found that 20 attributes affect churn. Out of the total, factors
like the customer, gender, phone service, and multiple lines, which could least affect
the churn, have been removed. Tenure has been removed, and rather the customers
have been divided into bins based on tenure, e.g., assigning a tenure group of 1–
12 for tenure below 12 months. After having a clear picture of the dataset, EDA
is implemented. It provides a clear and better picture of data patterns and poten-
tial hypotheses. Different bar graphs are built to see how the churn is affected by
different attributes present in the dataset. Figure 2is a graphical representation of
the affirmative and negative churn counts.
6 Implementation, Results, and Discussions
It is visible that customers discontinuing less than 300 but those who are continuing
have a count of more than 700. It can be inferred that the customers discontinuing
are less than half of the other customers. Figure 3is a plot of monthly charges versus
the total charges in the telecom sector.
Charges play an important role in the telecom sector. When choosing a good
product, the customer’s main need is to get maximum benefits at the best prices and
the best prices mean lower prices. So, charges play an important role while predicting
whether the customer will retain or will leave. The above graph shows a comparison
between the monthly charges and the total charges, and the outcome shows that as
the monthly charges increase the total charges also increase which means that the
total charge is directly proportional and the churn rate is inversely proportional to
Customer Churn in Telecom Sector: Analyzing the Effectiveness 683
DATA COLLECTION
PRE -PROCESSING OF DATA
EXPLORATORY DATA
ANALYS IS(EDA)
TESTING
DATASET
DATA EVALUATION AND
VISUALIZATION
TRAINING
DATASET MACHINE LEARNING MODELS
RESULT ANALYSIS
Fig. 1 Work o w
Fig. 2 Churn count in terms of ‘Yes’ and ‘No’
684 V. Sharma et al.
Fig. 3 Monthly charge
versus total charge
total charges. Figure 4a and b shows graphs that have been plotted to check churn
both by monthly charges and by total charges.
From Fig. 4a, it can be inferred that churn is high when the monthly charge is high,
whereas from Fig. 4b it can be seen that the lower the total charges, the higher the
churn. But the picture becomes clearer if the insights of the three characteristics are
analyzed, i.e., tenure, monthly charges, and total charges. Lower total charge is due to
higher monthly charge in a shorter tenure. A correlation of ‘churn’ is constructed with
all the other attributes in Fig. 5, to get a clearer picture of how all other parameters
affect the churn.
The above insight shows:
(1) A higher churn in the case of month-to-month contracts, no tech support, no
online security, fiber optics Internet, and the first year of subscription.
(2) Long-term agreements, Internet-only subscriptions, and clients with a 5-year
minimum retention rate all exhibit low churn.
Data processing is another important step in research. As many columns have yes
and no categorical values, data transformation and normalization must be conducted
to transform them into 0 and 1. Several columns contain more than two categories
that need to be converted from categorical data to numerical data.
ML is a part of computer science and artificial intelligence (AI) that focuses
on collecting data and using algorithms to replicate human learning processes and
gradually boost performance. Various ML models used are:
(1) SVM—This DL model uses supervised learning to categorize or forecast the
behavior of datasets. ML-supervised learning systems provide input and antic-
ipated output data that have been labeled for classification. Figure 6is the
confusion matrix created through the SVM model.
(2) KNN—It is a nonparametric classifier that employs proximity to classify or
anticipate grouping a single data point. KNN calculates the separations between
Customer Churn in Telecom Sector: Analyzing the Effectiveness 685
Fig. 4 a Monthly charges by churn; bTotal charges by churn
a query and each data example, selects the specified number of examples (K)
closest to the query, and votes for the most frequent label (for classification)
or averages the labels (for regression). Figure 7is the confusion matrix created
through the KNN model.
(3) XGBoost—It is a supervised learning method used for classification and regres-
sion. It uses shallow decision trees which are sequentially built to get trustworthy
answers, and a highly efficient training method eliminates overfitting. Figure 8
is the confusion matrix created through the XGBoost model.
686 V. Sharma et al.
Fig. 5 Parameters affecting churn
Fig. 6 SVM confusion
matrix
(4) Decision tree—It is an ML system that uses learning that may be used to classify
or forecast data based on the responses to earlier responses. A set of data with
the desired classification is used to train and test the model. Figure 9is the
confusion matrix created through the decision tree model.
(5) Random forest—This ML algorithm is applied for supervised learning. It inte-
grates the results of several decision trees to get a single result. Given that it can
address classification and regression problems, it is widely used. Figure 10 is
the confusion matrix created through the random forest model.
Customer Churn in Telecom Sector: Analyzing the Effectiveness 687
Fig. 7 KNN confusion
matrix
Fig. 8 XGBoost confusion
matrix
The confusion matrix, as depicted in Fig. 11, is used to provide a visual comparison
and evaluate the performance of all models. The confusion matrix pictures the number
of true positives (TP), true negatives (TN), false positives (FP), and false negatives
(FN) present in a prediction.
The ratio of accurately anticipated readings to all observed is what determines
accuracy. Therefore, it is concluded that the best model is the one with the highest
accuracy. The formula for calculating the accuracy of any model is depicted in Eq. (1).
Accuracy =True Positives +True Negatives
True Positives +False Positive +Ture Negatives +False Negatives
.
(1)
688 V. Sharma et al.
Fig. 9 Decision tree
confusion matrix
Fig. 10 Random forest
confusion matrix
Fig. 11 Confusion matrix
Customer Churn in Telecom Sector: Analyzing the Effectiveness 689
Fig. 12 Comparison of accuracies
The proportion of accurately anticipated observations to all determined positive
observations is known as precision. Low false positive rates are related to precision.
Precision can be calculated by using Eq. (2).
Precision =True Positives
True Positives +False Positives
.(2)
The percentage accuracies of each of the models used in the study as calculated
by Eq. (1) mentioned above can be plotted and compared as shown in Fig. 12.
From Fig. 12 above, it can be inferred that out of the five models used, KNN
has the least accuracy at 76%, whereas RF leads in accuracy from the rest of the
models with an accuracy of 82%. The other models, namely SVM, decision tree, and
XGBoost models showed an accuracy of 81.5%, 81.5%, and 79.5%, respectively.
Comparison with existing work
Existing methods Proposed methods
Methods Accuracy (%) Method Accuracy (%)
ANN 79 SVM 81.5
Decision tree 80 Decision tree 81.5
Naïve Bayes 75 XGB 79.5
KNN 75 KNN 76
Random forest 81 Random forest 82
690 V. Sharma et al.
7 Limitations, Conclusion, and Future Scope
The dataset taken here could be more vast as larger the data more the accuracy and
more the prediction. Other machine learning models could also be used to get better
accuracy.
As the telecommunication industry grows, with it grows the problem of customer
churn. It can be controlled by proper analysis. The paper outlines the issue of churn
and the importance of preventing it. Various tasks on the dataset have been performed
such as data cleaning, data processing, and data explanation by performing EDA. The
paper proposes various analyses with the help of certain ML models. After applying
various ML models to the dataset, it can be concluded that the best accuracy was
seen in random forest, while the lowest accuracy was in XGBoost and KNN. The
best precision obtained is with RF, with an accuracy of 82%, and the least precise
was from KNN with an accuracy of 76%.
Going further, the performance can be enhanced by using different ML models
which will help in increasing the accuracy and therefore providing a better solution
for the problem of customer churn. It can also be extended by collecting data from
different sectors and implementing techniques to get a better result.
References
1. Chugh S, Baweja VR (2020) Data mining application in segmenting customers with clustering.
International conference on emerging trends in information technology and engineering (ic-
ETITE) VIT Vellore
2. Mukherjee N, Sandhir S (2017) Working paper on mobile data service & customer satisfac-
tion, loyalty & retention—a literature review. In: International conference on management and
information systems China
3. Tiwari RG, Misra A, Ujjwal N (2022) Comparative classification performance evaluation of
various deep learning techniques for sentiment analysis. 2022 8th international conference on
signal processing and communication (ICSC). Noida, India, pp 304–309. https://doi.org/10.
1109/ICSC56524.2022.10009471
4. Ye X, Bohan C, Yunhuai D (2022) Analysis and prediction of telecom customer churn based
on machine learning. Highlights in science, engineering and technology
5. Arif B, Hasan J, Alyamani RA (2021) Classification methods comparison for customer churn
prediction in the telecommunication industry. Int J Adv Appl Sci 8(12):1–8
6. Mahalekshmi A, Chellam GH (2022) Analysis of customer churn prediction using machine
learning and deep learning algorithms. Int J Health Sci 6(S1):11684–11693
7. Yana F, Tetiana Z, Oleksandr D, Oksana K. Customer churn prediction model: a case of the
telecommunication market. Economics
8. Alanoud MA, Abdulaziz A (2023) Customer churn prediction using four machine learning
algorithms integrating feature selection and normalization in the telecom sector. Open Sci
Index, Electron Commun Eng 17(3)
9. Senthilnayaki B, Swetha M, Nivedha D (2021) Customer churn prediction. Int Adv Res J Sci,
Eng Technol
10. Shuli W, Wei-Chuen Y, Thian-Song O, Siew-Chin C (2021) Integrated churn prediction and
customer segmentation framework for telco business. IEEE Access
Customer Churn in Telecom Sector: Analyzing the Effectiveness 691
11. Lopamudra H, Prasant KD (2021) Prediction of customer churn in telecom industry: a machine
learning perspective. Comput Intell Mach Learn
12. Praveen L, Manas KM, Jasroop SC, Pratyush S (2021) Customer churn prediction system: a
machine learning approach. Computing
13. Yulianti Y, Saifudin A (2020) Sequential feature selection in customer churn prediction based
on Naive Bayes. IOP conference series: materials science and engineering
Author Index
A
Abhinandan Singla, 461
Abir El Akhdar, 329
Abu Bakar bin Abdul Hamid, 217
Abuzar Sayeed, 559
Aditi Sharma, 449
Aditya Bhardwaj, 551
Ahmed Sajjad Khan, 489
Ajay Dureja, 515
Akshi Kumar, 449
Alaa Alakailah, 531
Alexander Gelbukh, 411
Alexandros Chrysikos, 57
Ali Kartit, 329
Alyaa A. Abbas, 263
Aman Dureja, 515
Amer Hamzah bin Jantan, 217
Amit Doegar, 303
Amit Pratap Singh, 551
Amol Potgantwar, 189
Anand Singh Rajawat, 189,203,233
Anil Kumar, 345
Anjana Gosain, 1
Ankita Sharma, 23
Anumolu Bindu Sai, 163
Anurag Tuteja, 515
Arun Kumar Yadav, 585
Asaad N. Hashim, 273
Ashish Khanna, 287
Ashok Kumar Munnangi, 345
Ashok Kumar Sahoo, 655,677
Asya Katanani, 361
Avantika Goyal, 11
Avula Srinivasa Ajay Babu, 151
B
Badisa Bhavana, 137
Bal Virdee, 361,421
Bhagyashree, S. R., 245
C
Chafik Baidada, 329
Charu Saxena, 473
Chiranjath Sshakthi, M. A., 377
D
Dessislava Petrova-Antonova, 71
Devineni Vijaya Sri, 163
Dhachina Moorthy, T. S., 105
Dharini, A., 95
Dinesh Singh, 575
Divakar Yadav, 585
F
Fahimeh Jafari, 81
Fatima M. Khudair, 273
G
Gagandeep Kaur, 617
Gerald Manju, J., 95
Goyal, S. B., 189,203,217,233
Gudimetla Abhishek, 137
Gurpreet Singh, 643
H
Harishankar Kumar, 461
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Singapore Pte Ltd. 2024
A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,
Lecture Notes in Networks and Systems 785,
https://doi.org/10.1007/978-981-99-6544-1
693
694 Author Index
Harkiran Kaur, 461
Hrushikesh, S., 377
Humera Ghani, 421
I
Itu Snigdh, 603
Ivaylo Spasov, 71
J
Jagendra Singh, 627
Jameel Ahamed, 287
Jaspreeti Singh, 1
Jatin Sachdeva, 655
Jitender Kumar, 11
Jitendra Kumar Baroliya, 303
K
Karanam Manjusha, 163
Kartik, N., 389
Kiruthika, B., 95
Konika Abid, 551
L
Lekha Rani, 677
Liangxiu Han, 449
M
Mahalakshmi, R., 389
Malini, A., 95,377
Manikandan Parasuraman, 345
Manikandan Ramachandran, 345
Manorama, 603
Masri bin Abdul Lasi, 203,217,233
Maya Rathore, 399
Md Mahtab Alam, 39
Mitra Saeedi, 81
Mohammad Al-Fawa’reh, 531
Mohammad Hossein Amirhosseini, 81
Mohammad Nasiruddin, 489
Mohammed Jameel Alsalhy, 273
Mohit Rohilla, 11
Mumtaz Ahmed, 39
Mustafa Al-Fayoumi, 531
N
Nadiya Zafar, 287
Narindi Sai Priya, 151
Neal Bamford, 57
Neetu Mittal, 411
Neha Gaud, 399
Neha Saini, 575
Nevetha, B., 105
Nimalan, N., 105
Nisha, 439
Nitigya Vasudev, 643
Nurun Najah binti Tarmidzi, 217
P
Pallav Jain, 515
Paluck Arora, 317
Peddiboyina Hema Harini, 137
Piyush Pant, 203
Pooja, 575
Prachi Chaudhary, 439
Pradeepta Kumar Sarangi, 655,677
Pragati Choudhari, 189
Prashanth Sontakke, 81
Prateek Saini, 11,643
Pravin Gundalwar, 233
Priyam Srivastava, 559
Priya Sharma, 551
Q
Qasem Abu Al-Haija, 531
R
Rajendra Sinha, 203
Rajesh Mehta, 317
Rajesh Shrivastava, 617
Ramesh Sekaran, 345
Ram Kumar Solanki, 233
Ravi Ranjan, 449
Ritika Kumari, 1
Rohan Sahai Mathur, 121
Rohit Ahuja, 317
Roop Singh Meena, 175
Ruchi Sharma, 11
Rudra Tiwari, 655
S
Saida, S. K., 151
Sai Swetha, P., 377
Sakshi Gupta, 603
Sandra Fernando, 361
Sanjay Kumar Dubey, 121
Shachi Mall, 627
Shahram Salekzamankhani, 421
Shaik Nazeer, 501
Author Index 695
Shaily Jain, 287
Shalu, 575
Shambhavi Mishra, 559
Shano Solanki, 175
Sheetal Garg, 245
Siddharth Arora, 11
Sivaram Rajeyyagari, 345
Snehlata Sheoran, 411
Sonam Gupta, 585
Sophia Lazarova, 71
Sridevi, S., 105
Srinivasa Rao, B., 137,501
Sumit Bathla, 515
Syed Irfan Ali, 489
Syed Mohammad Ali, 489
T
Tamanna Kewal, 473
Tanveer Ahmed, 559
Tejasvi Singhal, 643
Tushar Bansal, 121
U
Udayan Ghose, 23
Ugrasen Suman, 399
Umesh Gupta, 551,559,617
Utkarsh Dixit, 585
Utkarsh Garg, 515
V
Vaibhav Sharma, 677
Valluri Anand, 163
Vankalapati Nanda Gopal, 501
Varun Gupta, 121
Vatala Akash, 501
Venkatesh,K.A.,389
Victor Sowinski Mydlarz, 361
Vipul Mishra, 559
X
Xiao ShiXiao, 189
Y
Yanduru Yamini Snehitha, 151
Yash Khare, 121
Q
Zahraa Maan Sallal, 263
Zeeshan Ali, 287
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this research, a novel navigation method for self-driving vehicles that avoids collisions with pedestrians and ad hoc obstacles is described. The proposed approach predicts the locations of ad hoc obstacles and wandering pedestrians by using an RGB-D depth sensor. Unique ad hoc-obstacle-aware mobility rules are presented considering those environmental uncertainties. A Deep Reinforcement Learning (DRL) method is proposed as a decision-making technique (to steer the self-driving vehicle to reach the target without incident). The deep Q-network (DQN), double deep Q-network (DDQN), and dueling double deep Q-network (D3DQN) algorithms were compared, and the D3DQN had the fewest negative rewards. We tested the algorithms using the Carla simulation environment to examine the input values from the RGB-D and RGB-Lidar. The series of algorithms that make up the convoluted neural network D3DQN was consequently selected as the optimum DRL model. In the modeling of slow-moving urban traffic, RGB-D and RGB-Lidar generated essentially the same results. A self-driving version of an updated child-ride-on-car was modified to demonstrate the real-time effectiveness of the proposed algorithm.
Article
Authentication and Authorization are the base of security for all the Technologies present in this world today. Starting from your smartphone where a user authenticates himself before he could access the data inside to Entering into the White House, you must authenticate yourself, and based on that you are authorized. In this digital world where every Business, MNC, Government Body, Companies, Users, etc. needs a website to inform the world about their presence on the internet, provide services online and become a “Brand”, the risk of leaking user's sensitive information increases. It could be dangerous to the users of the hacked website because their sensitive information like a credit card, bank account details, etc. could be sold in the black market of the “dark web”. The role of the dark web is described in the paper and how the data is sold there and what becomes of it. The paper helps to understand how a secure website is developed that promises the user to keep the sensitive information safe, increases the bond of trust between a client and server which results in a long-term relationship. The aim behind developing an authentication system is to keep users’ sensitive information safe so that hackers cannot steal and sell the information on the dark web's back market. To perform this, the developer needs to understand how to implement authentication. NodeJS, with the help of its framework expressJS and some other packages, is used to develop the authentication and authorization system of the website by the research. Previous papers on this field covered the authentication topic in general. This paper overcame that by going deeper into the field and being server-side language specific. The common types of authentication methods used in different types of websites are discussed in detail and the best methods are purposed for the developer to be implemented for a more secure website. This research put light on Artificial Intelligence and blockchain as the future of security of big data.
Chapter
With one of the most potent technologies ever developed in human history, the world is moving toward a new digitalized era. These technological advancements are enabling people to make things that, in the past, were merely the stuff of fairy tales. The model put out by this research incorporates one of the newest and most potent technologies of the decade. This research proposes to integrate the 5G network with the industrial internet of things, which is based on machine learning to develop an intelligent machine capable of mimicking humans. A system of this power is extremely susceptible to issues like hacking, cyberattacks, and other issues. This problem is solved with the blockchain. Since blockchain offers a decentralized approach to maintaining transparency, the research incorporates it into the model to make it more efficient and secure. IoT with blockchain has been the subject of other studies, but this study is an enhanced version that also incorporates industrial IoT with AI to create an intelligent internet of things.
Chapter
Internet of nano things (IoNT) is growing at an exponential rate due to a growing population, more communication between devices in networks, sensors, actuators, and so on. This rise shows up in many ways, such as volume, speed, diversity, honesty, and value. Getting important information and insights is hard work and a very important issue. One of the most important ways to solve a problem is to come to a conclusion based on a number of different criteria. This can help you choose the best solution from a number of options. AI-enabled algorithms and decision making that takes into account multiple factors can be useful in big data sets. During the deduction process, AI-enabled algorithms and evaluations based on multiple criteria are used. Because it works well and has a lot of potential, it is used in many different areas, such as computer science and information technology, agriculture, and business.