ChapterPDF Available

Deep Q-Learning for Virtual Autonomous Automobile

January 2024

January 2024

DOI:10.1007/978-981-99-6544-1_16

In book: Proceedings of Data Analytics and Management (pp.203-216)

Authors:

Piyush Pant

Universität des Saarlandes

Anand Singh Rajawat

Sandip University

S B Goyal

City University College of Science and Technology

Show all 5 authorsHide

The Deep Q-Learning is a reinforcement learning algorithm that is proposed by the research for developing autonomous automobiles. The research used the advanced and latest technologies and libraries to develop a virtual automobile that is autonomous. The proposed model is implemented using neural networks, which take the state “S” as input vector x and forecast the following potential action “a” that, according to the state-action value function, will be the most profitable. In the virtual environment developed by the research, the automobile, which is the agent, moves randomly and takes random actions continuously. These are stored and used to train the neural network in the ratio of dataset 60–20–20%. After random state travel and training, the agent is able to learn on its own to drive. This is achieved by rewarding the agent by +a for a correct or expected action and penalizing the agent by −p for a wrong or unexpected action. By doing so, the agent is able to drive in the lane and avoid the obstacles. The research is fully software-based and virtual, thus no requirement of hardware except for a computer. The research also studies reinforcement learning and the DQN algorithm to enhance the learning of the readers in the domain of AI.

Content uploaded by Masri Abdul Lasi

Content may be subject to copyright.

Lecture Notes in Networks and Systems 785

AbhishekSwaroop

ZdzislawPolkowski

SérgioDuarteCorreia

BalVirdeeEditors

Proceedings

ofData

Analytics and

Management

ICDAM 2023, Volume 1

Lecture Notes in Networks and Systems

Volume 785

Series Editor

Janusz Kacprzyk , Systems Research Institute, Polish Academy of Sciences,

Warsaw, Poland

Advisory Editors

Fernando Gomide, Department of Computer Engineering and Automation—DCA,

School of Electrical and Computer Engineering—FEEC, University of

Campinas—UNICAMP, São Paulo, Brazil

Okyay Kaynak, Department of Electrical and Electronic Engineering,

Bogazici University, Istanbul, Türkiye

Derong Liu, Department of Electrical and Computer Engineering, University of

Illinois at Chicago, Chicago, USA

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Witold Pedrycz, Department of Electrical and Computer Engineering, University of

Alberta, Alberta, Canada

Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland

Marios M. Polycarpou, Department of Electrical and Computer Engineering,

KIOS Research Center for Intelligent Systems and Networks, University of Cyprus,

Nicosia, Cyprus

Imre J. Rudas, Óbuda University, Budapest, Hungary

Jun Wang, Department of Computer Science, City University of Hong Kong,

Kowloon, Hong Kong

The series “Lecture Notes in Networks and Systems” publishes the latest

developments in Networks and Systems—quickly, informally and with high quality.

Original research reported in proceedings and post-proceedings represents the core

of LNNS.

Volumes published in LNNS embrace all aspects and subﬁelds of, as well as new

challenges in, Networks and Systems.

The series contains proceedings and edited volumes in systems and networks,

spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor

Networks, Control Systems, Energy Systems, Automotive Systems, Biological

Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems,

Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems,

Robotics, Social Systems, Economic Systems and other. Of particular value to

both the contributors and the readership are the short publication timeframe and

the world-wide distribution and exposure which enable both a wide and rapid

dissemination of research output.

The series covers the theory, applications, and perspectives on the state of the art

and future developments relevant to systems and networks, decision making, control,

complex processes and related areas, as embedded in the ﬁelds of interdisciplinary

and applied sciences, engineering, computer science, physics, economics, social, and

life sciences, as well as the paradigms and methodologies behind them.

Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago.

All books published in the series are submitted for consideration in Web of Science.

For proposals from Asia please contact Aninda Bose (aninda.bose@springer.com).

Abhishek Swaroop ·Zdzislaw Polkowski ·

Sérgio Duarte Correia ·Bal Virdee

Editors

Proceedings of Data

Analytics and Management

ICDAM 2023, Volume 1

Editors

Abhishek Swaroop

Department of Information Technology

Bhagwan Parshuram Institute

of Technology

New Delhi, Delhi, India

Sérgio Duarte Correia

Polytechnic Institute of Portalegre

Portalegre, Portugal

Zdzislaw Polkowski

Jan Wyzykowski University

Polkowice, Poland

Bal Virdee

Centre for Communications Technology

London Metropolitan University

London, UK

ISSN 2367-3370 ISSN 2367-3389 (electronic)

Lecture Notes in Networks and Systems

ISBN 978-981-99-6543-4 ISBN 978-981-99-6544-1 (eBook)

https://doi.org/10.1007/978-981-99-6544-1

Singapore Pte Ltd. 2024

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether

the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse

of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and

transmission or information storage and retrieval, electronic adaptation, computer software, or by similar

or dissimilar methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication

does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant

protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book

are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or

the editors give a warranty, expressed or implied, with respect to the material contained herein or for any

errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional

claims in published maps and institutional afﬁliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.

The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,

Singapore

Paper in this product is recyclable.

ICDAM-2023 Steering Committee Members

Patrons

Prof. (Dr.) Don MacRaild, Pro-Vice Chancellor, London Metropolitan University,

London

Prof. (Dr.) Wioletta Palczewska, Rector, The Karkonosze State University of Applied

Sciences in Jelenia Góra, Poland

Prof. (Dr.) Beata Tel˛a˙zka, Vice-Rector, The Karkonosze State University of Applied

Sciences in Jelenia Góra

General Chairs

Prof. Dr. Janusz Kacprzyk, Polish Academy of Sciences, Systems Research Institute,

Poland

Prof. Dr. Karim Ouazzane, London Metropolitan University, London

Prof. Dr. Bal Virdee, London Metropolitan University, London

Prof. Cesare Alippi, Polytechnic University of Milan, Italy

Honorary Chairs

Prof. Dr. Aboul Ella Hassanien, Cairo University, Egypt

Prof. Dr. Vaclav Snasel, Rector, VSB-Technical University of Ostrava,

Czech Republic

Prof. Chris Lane, London Metropolitan University, London

vi ICDAM-2023 Steering Committee Members

Conference Chairs

Prof. Dr. Vassil Vassilev, London Metropolitan University, London

Dr. Pancham Shukla, Imperial College London, London

Prof. Dr. Mak Sharma, Birmingham City University, London

Dr. Shikun Zhou, University of Portsmouth

Dr. Magdalena Baczy´nska, Dean, The Karkonosze State University of Applied

Sciences in Jelenia Góra, Poland

Dr. Zdzislaw Polkowski, Adjunct Professor KPSW, The Karkonosze State University

of Applied Sciences in Jelenia Góra

Prof. Dr. Abhishek Swaroop, Bhagwan Parshuram Institute of Technology, Delhi,

India

Prof. Dr. Anil K. Ahlawat, Dean, KIET Group of Institutes, India

Technical Program Chairs

Dr. Shahram Salekzamankhani, London Metropolitan University, London

Dr. Mohammad Hossein Amirhosseini, University of East London, London

Dr. Sandra Fernando, London Metropolitan University, London

Dr. Qicheng Yu, London Metropolitan University, London

Prof. Joel J. P. C. Rodrigues, Federal University of Piauí (UFPI), Teresina—PI, Brazil

Dr. Ali Kashif Bashir, Manchester Metropolitan University, UK

Dr. Rajkumar Singh Rathore, Cardiff Metropolitan University, UK

Conveners

Dr. Ashish Khanna, Maharaja Agrasen Institute of Technology (GGSIPU),

New Delhi, India

Dr. Deepak Gupta, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi,

India

Publicity Chairs

Dr. Józef Zaprucki, Prof. KPSW, Rector’s Proxy for Foreign Affairs, The Karkonosze

State University of Applied Sciences in Jelenia Góra

Dr. Umesh Gupta, Bennett University, India

Dr. Puneet Sharma, Assistant Professor, Amity University, Noida

Dr. Deepak Arora, Professor and Head (CSE), Amity University, Lucknow Campus

ICDAM-2023 Steering Committee Members vii

João Matos-Carvalho, Lusófona University, Portugal

Co-conveners

Mr. Moolchand Sharma, Maharaja Agrasen Institute of Technology, India

Dr. Richa Sharma, London Metropolitan University, London

Preface

We hereby are delighted to announce that The London Metropolitan University,

London, in collaboration with The Karkonosze University of Applied Sciences,

Poland, Politécnico de Portalegre, Portugal, and Bhagwan Parshuram Institute of

Technology, India, has hosted the eagerly awaited and much coveted International

Conference on Data Analytics and Management (ICDAM-2023). The fourth version

of the conference was able to attract a diverse range of engineering practitioners,

academicians, scholars, and industry delegates, with the reception of abstracts

including more than 7000 authors from different parts of the world. The committee

of professionals dedicated toward the conference is striving to achieve a high-

quality technical program with tracks on data analytics, data management, big data,

computational intelligence, and communication networks. All the tracks chosen in

the conference are interrelated and are very famous among present-day research

community. Therefore, a lot of research is happening in the above-mentioned tracks

and their related sub-areas. More than 1200 full-length papers have been received,

among which the contributions are focused on theoretical, computer simulation-

based research, and laboratory-scale experiments. Among these manuscripts, 190

papers have been included in the Springer proceedings after a thorough two-stage

review and editing process. All the manuscripts submitted to the ICDAM-2023 were

peer-reviewed by at least two independent reviewers, who were provided with a

detailed review pro forma. The comments from the reviewers were communicated

to the authors, who incorporated the suggestions in their revised manuscripts. The

recommendations from two reviewers were taken into consideration while selecting a

manuscript for inclusion in the proceedings. The exhaustiveness of the review process

is evident, given the large number of articles received addressing a wide range of

research areas. The stringent review process ensured that each published manuscript

met the rigorous academic and scientiﬁc standards. It is an exalting experience to

ﬁnally see these elite contributions materialize into the four book volumes as ICDAM

proceedings by Springer entitled “Proceedings of Data Analytics and Management:

ICDAM-2023”.

ICDAM-2023 invited four keynote speakers, who are eminent researchers in the

ﬁeld of computer science and engineering, from different parts of the world. In

xPreface

addition to the plenary sessions on each day of the conference, 17 concurrent technical

sessions are held every day to assure the oral presentation of around 190 accepted

papers. Keynote speakers and session chair(s) for each of the concurrent sessions

have been leading researchers from the thematic area of the session. The delegates

were provided with a book of extended abstracts to quickly browse through the

contents, participate in the presentations, and provide access to a broad audience of

the audience. The research part of the conference was organized in a total of 22 special

sessions. These special sessions provided the opportunity for researchers conducting

research in speciﬁc areas to present their results in a more focused environment.

An international conference of such magnitude and release of the ICDAM-2023

proceedings by Springer has been the remarkable outcome of the untiring efforts

of the entire organizing team. The success of an event undoubtedly involves the

painstaking efforts of several contributors at different stages, dictated by their devo-

tion and sincerity. Fortunately, since the beginning of its journey, ICDAM-2023 has

received support and contributions from every corner. We thank them all who have

wished the best for ICDAM-2023 and contributed by any means toward its success.

The edited proceedings volumes by Springer would not have been possible without

the perseverance of all the steering, advisory, and technical program committee

members.

All the contributing authors owe thanks from the organizers of ICDAM-2023

for their interest and exceptional articles. We would also like to thank the authors

of the papers for adhering to the time schedule and for incorporating the review

comments. We wish to extend my heartfelt acknowledgment to the authors, peer-

reviewers, committee members, and production staff whose diligent work put shape

to the ICDAM-2023 proceedings. We especially want to thank our dedicated team of

peer-reviewers who volunteered for the arduous and tedious step of quality checking

and critique on the submitted manuscripts. We wish to thank our faculty colleague

Mr. Moolchand Sharma for extending their enormous assistance during the confer-

ence. The time spent by them and the midnight oil burnt is greatly appreciated,

for which we will ever remain indebted. The management, faculties, administrative,

and support staff of the college have always been extending their services whenever

needed, for which we remain thankful to them.

Lastly, we would like to thank Springer for accepting our proposal for publishing

the ICDAM-2023 conference proceedings. Help received from Mr. Aninda Bose, the

acquisition senior editor, in the process has been very useful.

New Delhi, India

Polkowice, Poland

Portalegre, Portugal

London, UK

Abhishek Swaroop

Zdzislaw Polkowski

Sérgio Duarte Correia

Bal Virdee

Contents

Diagnosis of Parkinson Disease Using Ensemble Methods for Class

Imbalance Problem ................................................ 1

Ritika Kumari, Jaspreeti Singh, and Anjana Gosain

A Comparative Analysis of Pneumonia Detection Using Chest

X-rays with DNN .................................................. 11

Prateek Jha, Mohit Rohilla, Avantika Goyal, Siddharth Arora,

Ruchi Sharma, and Jitender Kumar

Machine Learning-Based Binary Sentiment Classiﬁcation of Movie

Reviews in Hindi (Devanagari Script) ............................... 23

Ankita Sharma and Udayan Ghose

Deep Learning-Based Recommendation Systems: Review

and Critical Analysis .............................................. 39

Md Mahtab Alam and Mumtaz Ahmed

Retention in Second Year Computing Students in a London-Based

University During the Post-COVID-19 Era Using Learned

Optimism as a Lens: A Statistical Analysis in R ...................... 57

Alexandros Chrysikos and Neal Bamford

Alzheimer’s Disease Knowledge Graph Based on Ontology

and Neo4j Graph Database ......................................... 71

Ivaylo Spasov, Sophia Lazarova, and Dessislava Petrova-Antonova

Forecasting Bitcoin Prices in the Context of the COVID-19

Pandemic Using Machine Learning Approaches ...................... 81

Prashanth Sontakke, Fahimeh Jafari, Mitra Saeedi,

and Mohammad Hossein Amirhosseini

Online Food Delivery Customer Churn Prediction: A Quantitative

Analysis on the Performance of Machine Learning Classiﬁers ......... 95

J. Gerald Manju, A. Dharini, B. Kiruthika, and A. Malini

xii Contents

Prevention Equipment for COVID-19 Spread Using IoT

and Multimedia-Based Solutions .................................... 105

T. S. Dhachina Moorthy, N. Nimalan, S. Sridevi, and B. Nevetha

Renal Disease Classiﬁcation Using Image Processing .................. 121

Rohan Sahai Mathur, Varun Gupta, Tushar Bansal, Yash Khare,

and Sanjay Kumar Dubey

Identiﬁcation of Fake Users on Social Networks and Detection

of Spammers ...................................................... 137

B. Srinivasa Rao, Badisa Bhavana, Gudimetla Abhishek,

and Peddiboyina Hema Harini

A Effective Method for Predicting the Dyslexia by Applying

Ensemble Technique ............................................... 151

S. K. Saida, Yanduru Yamini Snehitha, Narindi Sai Priya,

and Avula Srinivasa Ajay Babu

Identifying Suicidal Risk: A Text Classiﬁcation Study for Early

Detection ......................................................... 163

Devineni Vijaya Sri, Anumolu Bindu Sai, Valluri Anand,

and Karanam Manjusha

Citrus Plant Leaves Disease Detection Using CNN and LVQ

Algorithm ........................................................ 175

Roop Singh Meena and Shano Solanki

Longevity Recommendation for Root Canal Treatment ............... 189

Pragati Choudhari, Anand Singh Rajawat, S. B. Goyal, Xiao ShiXiao,

and Amol Potgantwar

Deep Q-Learning for Virtual Autonomous Automobile ................ 203

Piyush Pant, Rajendra Sinha, Anand Singh Rajawat, S. B. Goyal,

and Masri bin Abdul Lasi

Improving Digital Marketing Using Sentiment Analysis with Deep

LSTM ............................................................ 217

Masri bin Abdul Lasi, Abu Bakar bin Abdul Hamid,

Amer Hamzah bin Jantan, S. B. Goyal, and Nurun Najah binti Tarmidzi

5G Enabled IoT-Based DL with BC Model for Secured Home Door

System ........................................................... 233

S. B. Goyal, Anand Singh Rajawat, Pravin Gundalwar,

Ram Kumar Solanki, and Masri bin Abdul Lasi

Improving Efﬁciency of Spinal Cord Image Segmentation Using

Transfer Learning Inspired Mask Region-Based Augmented

Convolutional Neural Network ..................................... 245

Sheetal Garg and S. R. Bhagyashree

Contents xiii

Neurological Disease Prediction Based on EEG Signals Using

Machine Learning Approaches ..................................... 263

Zahraa Maan Sallal and Alyaa A. Abbas

Watermarking System Using DWT and SVD ......................... 273

Fatima M. Khudair, Asaad N. Hashim, and Mohammed Jameel Alsalhy

Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance

for Robust Security ................................................ 287

Nadiya Zafar, Ashish Khanna, Shaily Jain, Zeeshan Ali,

and Jameel Ahamed

Human Body Poses Detection and Estimation Using Convolutional

Neural Network ................................................... 303

Jitendra Kumar Baroliya and Amit Doegar

A Novel Image Alignment Technique Leveraging Teaching

Learning-Based Optimization for Medical Images .................... 317

Paluck Arora, Rajesh Mehta, and Rohit Ahuja

Study of Cyber Threats in IoT Systems .............................. 329

Abir El Akhdar, Chaﬁk Baidada, and Ali Kartit

Generic Sentimental Analysis in Web Data Recommendation

Based on Social Media Scalable Data Analytics Using Machine

Learning Architecture ............................................. 345

Ramesh Sekaran, Sivaram Rajeyyagari, Ashok Kumar Munnangi,

Manikandan Parasuraman, Manikandan Ramachandran, and Anil Kumar

Cloud Spark Cluster to Analyse English Prescription Big Data

for NHS Intelligence ............................................... 361

Sandra Fernando, Victor Sowinski Mydlarz, Asya Katanani,

and Bal Virdee

Prediction of Column Average Carbon Dioxide Emission Using

Random Forest Regression ......................................... 377

P. Sai Swetha, M. A. Chiranjath Sshakthi, S. Hrushikesh, and A. Malini

Predicting Students’ Performance Using Feature Selection-Based

Machine Learning Technique ....................................... 389

N. Kartik, R. Mahalakshmi, and K. A. Venkatesh

Hybrid Deep Learning-Based Human Activity Recognition (HAR)

Using Wearable Sensors: An Edge Computing Approach .............. 399

Neha Gaud, Maya Rathore, and Ugrasen Suman

Hybrid Change Detection Technique with Particle Swarm

Optimization for Land Use Land Cover Using Remote-Sensed Data .... 411

Snehlata Sheoran, Neetu Mittal, and Alexander Gelbukh

xiv Contents

Critical Analysis of 5G Networks’ Trafﬁc Intrusion Using PCA,

t-SNE, and UMAP Visualization and Classifying Attacks .............. 421

Humera Ghani, Shahram Salekzamankhani, and Bal Virdee

Denoising the Endoscopy Images of the Gastrointestinal Tract

Using Complex-Valued CNN ....................................... 439

Nisha and Prachi Chaudhary

FTL-Emo: Federated Transfer Learning for Privacy Preserved

Biomarker-Based Automatic Emotion Recognition ................... 449

Akshi Kumar, Aditi Sharma, Ravi Ranjan, and Liangxiu Han

Content Analysis of Twitter Conversations Associated

with Turkey–Syria Earthquakes .................................... 461

Harkiran Kaur, Harishankar Kumar, and Abhinandan Singla

Transition from Traditional Insurance Sector to InsurTech:

Systematic Analysis and Future Research Directions .................. 473

Tamanna Kewal and Charu Saxena

Diagnosis of Laryngitis and Cordectomy using Machine Learning

with ML.Net and SVD ............................................. 489

Syed Irfan Ali, Ahmed Sajjad Khan, Syed Mohammad Ali,

and Mohammad Nasiruddin

Speed of Diagnosis for Brain Diseases Using MRI and Convolutional

Neural Networks .................................................. 501

B. Srinivasa Rao, Vankalapati Nanda Gopal, Vatala Akash,

and Shaik Nazeer

Dog Breed Identiﬁcation Using Deep Learning ....................... 515

Anurag Tuteja, Sumit Bathla, Pallav Jain, Utkarsh Garg, Aman Dureja,

and Ajay Dureja

Towards Detecting Digital Criminal Activities Using File System

Analysis .......................................................... 531

Mustafa Al-Fayoumi, Mohammad Al-Fawa’reh, Qasem Abu Al-Haija,

and Alaa Alakailah

Performance Evaluation of Virtual Machine and Container-Based

Migration Technique ............................................... 551

Aditya Bhardwaj, Amit Pratap Singh, Priya Sharma, Konika Abid,

and Umesh Gupta

Rhetorical Role Detection in Legal Judgements Using Zero-Shot

Learning ......................................................... 559

Shambhavi Mishra, Tanveer Ahmed, Vipul Mishra, Priyam Srivastava,

Abuzar Sayeed, and Umesh Gupta

Contents xv

IoB-Based Intelligent Healthcare System for Disease Diagnosis

in Humans ........................................................ 575

Shalu, Neha Saini, Pooja, and Dinesh Singh

Analyzing the Impact of Extractive Summarization Techniques

on Legal Text ..................................................... 585

Utkarsh Dixit, Sonam Gupta, Arun Kumar Yadav, and Divakar Yadav

An Energy Conserving MANET-LoRa Architecture for Wireless

Body Area Network ................................................ 603

Sakshi Gupta, Manorama, and Itu Snigdh

Blockchain Integration with Internet of Things (IoT)-Based

Systems for Data Security: A Review ................................ 617

Gagandeep Kaur, Rajesh Shrivastava, and Umesh Gupta

Comparative Study of Heart Failure Using the Approach

of Machine Learning and Deep Neural Networks ..................... 627

Shachi Mall and Jagendra Singh

House Price Prediction Using Hybrid Deep Learning Techniques ...... 643

Nitigya Vasudev, Gurpreet Singh, Prateek Saini, and Tejasvi Singhal

Sentiment Analysis Using Machine Learning of Unemployment

Data in India ...................................................... 655

Rudra Tiwari, Jatin Sachdeva, Ashok Kumar Sahoo,

and Pradeepta Kumar Sarangi

Customer Churn in Telecom Sector: Analyzing the Effectiveness

of Machine Learning Techniques .................................... 677

Vaibhav Sharma, Lekha Rani, Ashok Kumar Sahoo,

and Pradeepta Kumar Sarangi

Author Index ...................................................... 693

Editors and Contributors

About the Editors

Prof. (Dr.) Abhishek Swaroop completed his B.Tech. (CSE) from GBP University

of Agriculture and Technology, M.Tech. from Punjabi University Patiala, and Ph.D.

from NIT Kurukshetra. He has industrial experience of 8 years in organizations like

Usha Rectiﬁer Corporations and Envirotech Instruments Pvt. Limited. He has 22

years of teaching experience. He has served in reputed educational institutions such as

Jaypee Institute of Information Technology, Noida, Sharda University Greater Noida,

and Galgotias University Greater Noida. He has served at various administrative

positions such as Head of the Department, Division Chair, NBA Coordinator for the

university, and Head of training and placements. Currently, he is serving as Professor

and HoD, Department of Information Technology in Bhagwan Parshuram Institute

of Technology, Rohini, and Delhi. He is actively engaged in research. He has more

than 60 quality publications, out of which eight are SCI and 16 Scopus.

Prof. (Dr.) Zdzislaw Polkowski is Adjunct Professor at Faculty of Technical

Sciences at the Jan Wyzykowski University, Poland. He is also Rector’s Repre-

sentative for International Cooperation and Erasmus Program and Former Dean of

the Technical Sciences Faculty during the period of 2009–2012 His area of research

includes management information systems, business informatics, IT in business and

administration, IT security, small medium enterprises, CC, IoT, big data, business

intelligence, and block chain. He has published around 60 research articles. He

has served the research community in the capacity of Author, Professor, Reviewer,

Keynote Speaker, and Co-editor. He has attended several international conferences

in the various parts of the world. He is also playing the role of Principal Investigator.

Prof. Sérgio Duarte Correia received his Diploma in Electrical and Computer

Engineering from the University of Coimbra, Portugal, in 2000, the master’s degree in

Industrial Control and Maintenance Systems from Beira Interior University, Covilhã,

Portugal, in 2010, and the Ph.D. in Electrical and Computer Engineering from the

xvii

xviii Editors and Contributors

University of Coimbra, Portugal, in 2020. Currently, he is Associate Professor at

the Polytechnic Institute of Portalegre, Portugal. He is Researcher at COPELABS—

Cognitive and People-centric Computing Research Center, Lusófona University of

Humanities and Technologies, Lisbon, Portugal, and Valoriza—Research Center for

Endogenous Resource Valorization, Polytechnic Institute of Portalegre, Portalegre,

Portugal. Over past 20 years, he has worked with several private companies in the ﬁeld

of product development and industrial electronics. His current research interests are

artiﬁcial intelligence, soft computing, signal processing, and embedded computing.

Prof. Bal Virdee graduated with a B.Sc. (Engineering) Honors in Communication

Engineering and M.Phil. from Leeds University, UK. He obtained his Ph.D. from

University of North London, UK. He was worked as Academic at Open Univer-

sity and Leeds University. Prior to this, he was Research and Development Elec-

tronic Engineer in the Future Products Department at Teledyne Defence (formerly

Filtronic Components Ltd., Shipley, West Yorkshire) and at PYE TVT (Philips) in

Cambridge. He has held numerous duties and responsibilities at the university, i.e.,

Health and Safety Ofﬁcer, Postgraduate Tutor, Examination’s Ofﬁcer, Admission’s

Tutor, Short Course Organizer, Course Leader for M.Sc./M.Eng. Satellite Commu-

nications, B.Sc. Communications Systems, and B.Sc. Electronics. In 2010. He was

appointed Academic Leader (UG Recruitment). He is Member of ethical committee

and Member of the school’s research committee and research degrees committee.

Contributors

Alyaa A. Abbas General Directorate of Education in Al-Muthana Governorate,

Ministry of Education, Samah, Iraq

Gudimetla Abhishek Lakireddy Bali Reddy College of Engineering, Mylavaram,

Andhra Pradesh, India

Konika Abid Department of CSE, Sharda University, Greater Noida, India

Jameel Ahamed Department of CS&IT, Maulana Azad National Urdu University,

Hyderabad, India

Mumtaz Ahmed Department of Computer Engineering, Jamia Millia Islamia,

New Delhi, India

Tanveer Ahmed Department of CSE, Bennett University, Greater Noida, India

Rohit Ahuja Computer Science and Engineering Department, Thapar Institute of

Engineering and Technology, Patiala, India

Vatala Akash Lakireddy Bali Reddy College of Engineering, Mylavaram,

Andhra Pradesh, India

Editors and Contributors xix

Abir El Akhdar LTI Laboratory, University of Chouaib Doukkali, National School

of Applied Sciences, El, Jadida, Morocco

Mohammad Al-Fawa’reh Computing and Security, Edith Cowan University,

Joondalup, WA, Australia

Mustafa Al-Fayoumi Department of Cybersecurity, Princess Sumaya University

of Technology, Amman, Jordan

Qasem Abu Al-Haija Department of Cybersecurity, Princess Sumaya University

of Technology, Amman, Jordan

Alaa Alakailah Department of Cybersecurity, Princess Sumaya University of

Technology, Amman, Jordan

Md Mahtab Alam Department of Computer Engineering, Jamia Millia Islamia,

New Delhi, India

Syed Irfan Ali Artiﬁcial Intelligence and Data Science Engineering, Anjuman

College of Engineering & Technology, Nagpur, India

Syed Mohammad Ali Electronics & Telecommunication Engineering, Anjuman

College of Engineering & Technology, Nagpur, India

Zeeshan Ali University of Glasgow, Glasgow, UK

Mohammed Jameel Alsalhy National University of Science and Technology,

Thi-Qar, Nasiriyah, Iraq

Mohammad Hossein Amirhosseini University of East London, London,

United Kingdom

Valluri Anand Lakireddy Bali Reddy College of Engineering, Mylavaram,

Andhra Pradesh, India

Paluck Arora Computer Science and Engineering Department, Thapar Institute of

Engineering and Technology, Patiala, India

Siddharth Arora Bharati Vidyapeeth’s College of Engineering, New Delhi, India

Avula Srinivasa Ajay Babu Lakireddy Bali Reddy College of Engineering,

Mylavaram, Andhra Pradesh, India

Chaﬁk Baidada LTI Laboratory, University of Chouaib Doukkali, National School

of Applied Sciences, El, Jadida, Morocco

Neal Bamford London Metropolitan University, London, UK

Tushar Bansal Amity University, Uttar Pradesh, Noida, India

Jitendra Kumar Baroliya Computer Science and Engineering Department,

NITTTR, Chandigarh, India

xx Editors and Contributors

Sumit Bathla Department of IT, Bhagwan Parshuram Institute of Technology,

New Delhi, India

S. R. Bhagyashree Department of Electronics and Communication Engineering,

ATME College of Engineering, Mysuru, India

Aditya Bhardwaj School of CSET, Bennett University, Greater Noida, India

Badisa Bhavana Lakireddy Bali Reddy College of Engineering, Mylavaram,

Andhra Pradesh, India

Prachi Chaudhary ECE Department, DCRUST, Murthal, India

Pragati Choudhari Department of Computer Engineering, Indira College of Engi-

neering and Management, Sandip University, Pune, India

Alexandros Chrysikos London Metropolitan University, London, UK

A. Dharini Thiagarajar College of Engineering, Madurai, Tamil Nadu, India

Utkarsh Dixit Ajay Kumar Garg Engineering College, Ghaziabad, India

Amit Doegar Computer Science and Engineering Department, NITTTR,

Chandigarh, India

Sanjay Kumar Dubey Amity University, Uttar Pradesh, Noida, India

Ajay Dureja Department of IT, Bharati Vidyapeeth’s College of Engineering,

New Delhi, India

Aman Dureja Department of IT, Bhagwan Parshuram Institute of Technology,

New Delhi, India

Sandra Fernando Assistive Technology Group, SCDM, London Metropolitan

University, London, UK

Sheetal Garg Department of Electronics and Communication Engineering, ATME

College of Engineering, Mysuru, India

Utkarsh Garg Department of IT, Bhagwan Parshuram Institute of Technology,

New Delhi, India

Neha Gaud School of Computer Science and Information Technology, DAVV,

Indore, M.P, India

Alexander Gelbukh Instituto Politécnico Nacional Mexico, Mexico City, Mexico

J. Gerald Manju Thiagarajar College of Engineering, Madurai, Tamil Nadu, India

Humera Ghani School of Computing and Digital Media, Centre for Communica-

tions Technology, London Metropolitan University, London, UK

Udayan Ghose University School of Information, Communication and Tech-

nology, Guru Gobind Singh Indraprastha University, Delhi, India

Editors and Contributors xxi

Vankalapati Nanda Gopal Lakireddy Bali Reddy College of Engineering,

Mylavaram, Andhra Pradesh, India

Anjana Gosain USICT, Guru Gobind Singh Indraprastha University, New Delhi,

India

Avantika Goyal Bharati Vidyapeeth’s College of Engineering, New Delhi, India

S. B. Goyal City University, Petaling Jaya, Malaysia

Pravin Gundalwar School of Computer Science and Engineering, Sandip Univer-

sity, Nashik, India

Sakshi Gupta Amity Institute of Information Technology, AMITY university,

Noida, India

Sonam Gupta Ajay Kumar Garg Engineering College, Ghaziabad, India

Umesh Gupta Department of CSE, SR University, Warangal, Telangana, India;

School of Computer Science Engineering and Technology, Bennett University,

Greater Noida, India

Varun Gupta Amity University, Uttar Pradesh, Noida, India

Abu Bakar bin Abdul Hamid Putra Business School, University Putra Malaysia,

Serdang, Malaysia

Liangxiu Han Manchester Metropolitan University, Manchester, UK

Peddiboyina Hema Harini Lakireddy Bali Reddy College of Engineering,

Mylavaram, Andhra Pradesh, India

Asaad N. Hashim Faculty of Computer Science and Mathematics, University of

Kufa, Kufah, Iraq

S. Hrushikesh Thiagarajar College of Engineering, Madurai, Tamil Nadu, India

Fahimeh Jafari University of East London, London, United Kingdom

Pallav Jain Department of IT, Bhagwan Parshuram Institute of Technology,

New Delhi, India

Shaily Jain Faculty of Computing, Engineering and Science, University of South

Wales, South Wales, UK

Amer Hamzah bin Jantan City University, Petaing Jaya, Malaysia;

Putra Business School, University Putra Malaysia, Serdang, Malaysia

Prateek Jha Bharati Vidyapeeth’s College of Engineering, New Delhi, India

N. Kartik Department of Computer Applications/Science, Presidency

College(Autonomous)/Presidency University, Bengaluru, India

Ali Kartit LTI Laboratory, University of Chouaib Doukkali, National School of

Applied Sciences, El, Jadida, Morocco

xxii Editors and Contributors

Asya Katanani Assistive Technology Group, SCDM, London Metropolitan

University, London, UK

Gagandeep Kaur Department of Computer Science and Engineering, Madhav

Institute of Technology and Science, Gwalior, India

Harkiran Kaur Department of Computer Science and Engineering, Thapar Insti-

tute of Engineering and Technology, Patiala, Punjab, India

Tamanna Kewal University School of Business, Chandigarh University, Mohali,

Punjab, India

Ahmed Sajjad Khan Electronics & Telecommunication Engineering, Anjuman

College of Engineering & Technology, Nagpur, India

Ashish Khanna Department of CSE, Maharaja Agrasen Institute of Technology,

New Delhi, India

Yash Khare Amity University, Uttar Pradesh, Noida, India

Fatima M. Khudair Faculty of Computer Science and Mathematics, University of

Kufa, Kufah, Iraq

B. Kiruthika Thiagarajar College of Engineering, Madurai, Tamil Nadu, India

Akshi Kumar Manchester Metropolitan University, Manchester, UK

Anil Kumar Tula’s Institute, Dehradun, India

Harishankar Kumar Department of Computer Science and Engineering, Thapar

Institute of Engineering and Technology, Patiala, Punjab, India

Jitender Kumar Bharati Vidyapeeth’s College of Engineering, New Delhi, India

Ritika Kumari USICT, Guru Gobind Singh Indraprastha University, New Delhi,

India;

Department of Artiﬁcial Intelligence and Data Sciences, IGDTUW, Delhi, India

Masri bin Abdul Lasi City University, Petaling Jaya, Malaysia

Sophia Lazarova GATE Institute, Soﬁa University, Soﬁa, Bulgaria

R. Mahalakshmi Department of Computer Science, Presidency University,

Bengaluru, India

A. Malini Thiagarajar College of Engineering, Madurai, Tamil Nadu, India

Shachi Mall School of Computer Science Engineering and Technology, Bennett

University, Greater Noida, India

Karanam Manjusha Lakireddy Bali Reddy College of Engineering, Mylavaram,

Andhra Pradesh, India

Manorama Amity Institute of Information Technology, Ranchi, India

Editors and Contributors xxiii

Rohan Sahai Mathur Amity University, Uttar Pradesh, Noida, India

Roop Singh Meena Computer Science and Engineering Department, NITTTR,

Chandigarh, India

Rajesh Mehta Computer Science and Engineering Department, Thapar Institute of

Engineering and Technology, Patiala, India

Shambhavi Mishra Department of CSE, Bennett University, Greater Noida, India

Vipul Mishra Department of CSE, Pandit Deendayal Energy University, Gandhi-

nagar, India

Neetu Mittal Amity University Uttar Pradesh, Noida, Uttar Pradesh, India

T. S. Dhachina Moorthy Department of Information Technology, Thiagarajar

College of Engineering, Madurai, Tamil Nadu, India

Ashok Kumar Munnangi Department of Information Technology, Velagapudi

Ramakrishna Siddhartha Engineering College, Vijayawada, Andhra Pradesh, India

Victor Sowinski Mydlarz Assistive Technology Group, SCDM, London

Metropolitan University, London, UK

Mohammad Nasiruddin Electronics & Telecommunication Engineering,

Anjuman College of Engineering & Technology, Nagpur, India

Shaik Nazeer Lakireddy Bali Reddy College of Engineering, Mylavaram,

Andhra Pradesh, India

B. Nevetha Department of Information Technology, Thiagarajar College of Engi-

neering, Madurai, Tamil Nadu, India

N. Nimalan Department of Information Technology, Thiagarajar College of Engi-

neering, Madurai, Tamil Nadu, India

Nisha ECE Department, DCRUST, Murthal, India

Piyush Pant Sandip University, Nashik, India

Manikandan Parasuraman Department of Computer Science and Engineering,

JAIN (Deemed to be University), Bengaluru, Karnataka, India

Dessislava Petrova-Antonova GATE Institute, Soﬁa University, Soﬁa, Bulgaria

Pooja School of Computer Science and Engineering, Galgotias University,

Greater Noida, India

Amol Potgantwar Sandip Institute of Technology and Research Centre, Sandip

University, Nashik, India

Narindi Sai Priya Lakireddy Bali Reddy College of Engineering, Mylavaram,

Andhra Pradesh, India

xxiv Editors and Contributors

Anand Singh Rajawat School of Computer Science and Engineering, Sandip

University, Nashik, India;

City University, Petaling Jaya, Malaysia

Sivaram Rajeyyagari Department of Computer Science, College of Computing

and Information Technology, Shaqra University, Shaqra, Kingdom of Saudi Arabia

Manikandan Ramachandran School of Computing, SASTRA Deemed Univer-

sity, Thanjavur, India

Lekha Rani Institute of Engineering and Technology, Chitkara University, Punjab,

India

Ravi Ranjan Netaji Subhas University of Technology, Delhi, India

Maya Rathore Christian Eminent College, Indore, M.P, India

Mohit Rohilla Bharati Vidyapeeth’s College of Engineering, New Delhi, India

Jatin Sachdeva Chitkara University Institute of Engineering & Technology,

Chitkara University, Punjab, India

Mitra Saeedi University of East London, London, United Kingdom

Ashok Kumar Sahoo Graphic Era Hill University, Dehradun, India

Anumolu Bindu Sai Lakireddy Bali Reddy College of Engineering, Mylavaram,

Andhra Pradesh, India

S. K. Saida Department of Information Technology, Lakireddy Bali Reddy College

of Engineering, Mylavaram, Andhra Pradesh, India

Neha Saini Government College Chhachhrauli, Yamuna Nagar, Haryana, India

Prateek Saini Chitkara University Institute of Engineering and Technology,

Chitkara University Punjab, Chandigarh, India

Shahram Salekzamankhani School of Computing and Digital Media, Centre for

Communications Technology, London Metropolitan University, London, UK

Zahraa Maan Sallal General Directorate of Education in Al-Qadisiyah Gover-

norate/Ministry of Education, Al Diwaniyah, Iraq

Pradeepta Kumar Sarangi Institute of Engineering and Technology, Chitkara

University, Punjab, India

Charu Saxena University School of Business, Chandigarh University, Mohali,

Punjab, India

Abuzar Sayeed Department of CSE, Bennett University, Greater Noida, India

Ramesh Sekaran Department of Computer Science and Engineering, JAIN

(Deemed to be University), Bengaluru, Karnataka, India

Shalu Manav Rachna University, Faridabad, Haryana, India

Editors and Contributors xxv

Aditi Sharma Delhi Technological University, New Delhi, India;

Thapar Institute of Engineering and Technology, Patiala, India

Ankita Sharma University School of Information, Communication and Tech-

nology, Guru Gobind Singh Indraprastha University, Delhi, India

Priya Sharma Department of CSE, Sharda University, Greater Noida, India

Ruchi Sharma Bharati Vidyapeeth’s College of Engineering, New Delhi, India

Vaibhav Sharma Institute of Engineering and Technology, Chitkara University,

Punjab, India

Snehlata Sheoran Amity University Uttar Pradesh, Noida, Uttar Pradesh, India

Xiao ShiXiao Chengyi College Jimei University, Xiamen, China

Rajesh Shrivastava School of Computer Science Engineering and Technology,

Bennett University, Greater Noida, India

Amit Pratap Singh Department of CSE, Sharda University, Greater Noida, India

Dinesh Singh Deenbandhu Chhotu Ram University of Science and Technology,

Murthal, Sonepat, India

Gurpreet Singh Chitkara University Institute of Engineering and Technology,

Chitkara University Punjab, Chandigarh, India

Jagendra Singh School of Computer Science Engineering and Technology,

Bennett University, Greater Noida, India

Jaspreeti Singh USICT, Guru Gobind Singh Indraprastha University, New Delhi,

India

Tejasvi Singhal Chitkara University Institute of Engineering and Technology,

Chitkara University Punjab, Chandigarh, India

Abhinandan Singla Department of Computer Science and Engineering, Thapar

Institute of Engineering and Technology, Patiala, Punjab, India

Rajendra Sinha Sandip University, Nashik, India

Yanduru Yamini Snehitha Lakireddy Bali Reddy College of Engineering,

Mylavaram, Andhra Pradesh, India

Itu Snigdh B.I.T Mesra, Ranchi, India

Ram Kumar Solanki School of Computer Science and Engineering, Sandip

University, Nashik, India

Shano Solanki Computer Science and Engineering Department, NITTTR, Chandi-

garh, India

Prashanth Sontakke University of East London, London, United Kingdom

xxvi Editors and Contributors

Ivaylo Spasov Rila Solutions, Soﬁa, Bulgaria

S. Sridevi Department of Information Technology, Thiagarajar College of Engi-

neering, Madurai, Tamil Nadu, India

B. Srinivasa Rao Department of Information Technology, Lakireddy Bali Reddy

College of Engineering, Mylavaram, Andhra Pradesh, India

Priyam Srivastava Department of CSE, Bennett University, Greater Noida, India

M. A. Chiranjath Sshakthi Thiagarajar College of Engineering, Madurai,

Tamil Nadu, India

Ugrasen Suman School of Computer Science and Information Technology, DAVV,

Indore, M.P, India

P. Sai Swetha Thiagarajar College of Engineering, Madurai,

Tamil Nadu, India

Nurun Najah binti Tarmidzi City University, Petaing Jaya, Malaysia

Rudra Tiwari Doon International School, Dehradun, India

Anurag Tuteja Department of IT, Bhagwan Parshuram Institute of Technology,

New Delhi, India

Nitigya Vasudev Chitkara University Institute of Engineering and Technology,

Chitkara University Punjab, Chandigarh, India

K. A. Venkatesh School of Advanced Computer Science, Alliance University,

Bengaluru, India

Devineni Vijaya Sri Department of Information Technology, Lakireddy Bali

Reddy College of Engineering, Mylavaram, Andhra Pradesh, India

Bal Virdee Assistive Technology Group, SCDM, London Metropolitan University,

London, UK;

School of Computing and Digital Media, Centre for Communications Technology,

London Metropolitan University, London, UK

Arun Kumar Yadav National Institute of Technology, Hamirpur, HP, India

Divakar Yadav National Institute of Technology, Hamirpur, HP, India

Nadiya Zafar Department of CS&IT, Maulana Azad National Urdu University,

Hyderabad, India

Diagnosis of Parkinson Disease Using

Ensemble Methods for Class Imbalance

Problem

Ritika Kumari, Jaspreeti Singh, and Anjana Gosain

Abstract Parkinson disease (PD) is the most prevalent degenerative neurological

disorders that is incurable. Early PD diagnosis is essential in order to determine the

initial course of treatment. Typically, the issue of class imbalance has an impact on

the PD diagnosis. This paper seeks to give a comparative analysis of the ensemble

methods: random forest, bagging, and random under-sampling boost for addressing

the class imbalance problem for PD diagnosis. We make use of a real-world PD

speech dataset that is housed in the repository at UCI (University of California,

Irvine). Due to the high imbalance in this dataset, feature scaling and the Synthetic

Minority Oversampling Technique (SMOTE) are employed. We also employ the

feature selection (FS) technique for enhancing the efﬁciency of the machine learning

algorithms (MLAs). The results show that bagging performs best with an accuracy

of 96.46%. This study proposes the use of ensemble approaches for PD’s early

diagnosis.

Keywords Classiﬁcation ·Ensemble methods ·Random forest ·Feature

selection ·Bagging

R. Kumari (B)·J. Singh ·A. Gosain

USICT, Guru Gobind Singh Indraprastha University, New Delhi, India

e-mail: ritikakumari@igdtuw.ac.in

J. Singh

e-mail: jaspreeti_singh@ipu.ac.in

A. Gosain

e-mail: anjana_gosain@ipu.ac.in

R. Kumari

Department of Artiﬁcial Intelligence and Data Sciences, IGDTUW, Delhi, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_1

2R. Kumari et al.

1 Introduction

Parkinson’s disease is the neurological condition with the second-slowest rate of

progression, affecting 7 to 10 million individuals worldwide after Alzheimer’s

disease [1–4]. PD is typically observed in elderly people. The primary reason for

PD is the decrease in the dopamine’s level, a chemical in the human brain [5]. The

dopamine produced by neurons is responsible for transmitting signals in the human

brain. It is yet unclear what is causing the impairment of these neurons. The signs

of PD may include loss of smell, constipation, sleep and speech issues, swallowing

difﬁculties, bradykinesia, stiffness, and postural imbalance [6].

PD is incurable, but early diagnosis may help in providing proper treatment and

taking preventive measures [1,7]. Researchers have noticed changes in speech as

an early symptom in PD patients. This has motivated us to develop an ML model

that can serve as a second opinion in the diagnosis of PD patients. We use the PD

speech dataset for analyzing the changes in speech for this study as it is non-invasive

and low cost. The dataset suffers from class imbalance problem (CIP) as it is highly

imbalanced which further makes the analyses difﬁcult. CIP occurs when one class is

present in majority in comparison with another (minority class). Using an imbalanced

dataset makes the traditional classiﬁers biased toward the majority class.

To handle CIP, the researchers have worked at three levels: data level (DL), algo-

rithm level (AL), and hybrid level (HL) [8,9]. In DL, we work at the data level and

try to develop uniformity in the class distribution using data sampling techniques

such as under-sampling or oversampling. However, DL strategies suffer from over-

ﬁtting of the model in case of oversampling and in case of under-sampling, and it

may lead to loss of potential data. In AL, we formulate a new algorithm or make

some modiﬁcations to the existing algorithm. This strategy requires knowledge and

expertise in the area of the algorithm. Then comes the HL, at this level we take

the beneﬁts of both DL and AL. Ensemble methods come under this category [10].

Several different independent classiﬁers are combined to create a robust classiﬁer

using the effective technique of ensemble learning. Numerous research demonstrate

that ensemble learning models perform better on imbalanced datasets and have a

great generalization capacity [11–15].

In this study, we employ three ensemble methods, namely random forest (RF),

bagging, and random under-sampling boost (RUSBoost) for the PD diagnosis using

the PD speech dataset.

The following is the study’s key contribution:

(1) Firstly, as the PD speech dataset studied is highly imbalanced, we use SMOTE

oversampling technique for balancing the dataset.

(2) Secondly, we evaluate the performance of the ensemble methods (a) without

using any feature selection (FS) method (b) using the SelectKBest FS method.

Finally, we compare our work with the already existing research in the area of PD.

And it is observed that bagging outperforms RF and RUSBoost with an accuracy of

96.46%.

Diagnosis of Parkinson Disease Using Ensemble Methods for Class … 3

This paper presents a summary of the relevant literature for the CIP in PD diagnosis

in Sect. 2. Section 3brieﬂy discusses the ensemble methods studied, FS method used,

and performance metrics involved. Section 4explains the experimental results and

discussions along with the comparison of our study with the prior work. The study’s

conclusion is given in Sect. 5.

2 Related Work

Several researchers have worked on the diagnosis of PD using different MLAs and

FS methods.

An ensemble-based model for diagnosing PD was proposed by Biswas et al. [16]

in 2022. The authors employed stacking to create a strong model and FS to choose

the pertinent characteristics. The proposed model was evaluated using a variety of

MLAs. The authors claimed that the ensemble-based model surpassed the other

techniques.

Saeed et al. [17] in 2022 developed a comprehensive strategy for PD prediction

where the authors employed several ML algorithms and FS techniques. They have

studied the performance of different classiﬁers and also used different FS methods.

With the wrapper ﬁlter method, this research improved the K nearest neighbor’s

(KNN) accuracy to 88.33%.

Yadav and Jain [18] in 2022 conducted the study with six ML models: KNN,

Naïve Bayes (NB), support vector machine (SVM), RF, etc., for the PD prediction.

According to the experimental ﬁndings, KNN had the highest accuracy, i.e., 92.05%

for early detection of PD.

For the effective diagnosis of PD, Lamba et al. [1] proposed a hybrid system in

2021. The authors have used three classiﬁers, namely KNN, RF and NB along with

SMOTE for addressing the CIP. Three FS methods were also employed for reducing

the feature subset. The study concluded that the RF classiﬁer showed better results

than others classiﬁers with an accuracy of 95.58%.

Yama n et al . [19] in 2020 conducted the experiment using the FS method: relief

for selecting the acoustic features from the dataset. They used KNN and SVM for

the PD prediction and found that out of the two, the SVM classiﬁer performed the

best with an accuracy of 91.25%.

Polat [20] in 2019 used the PD dataset to propose the hybrid model for PD predic-

tion. The authors worked at the data level to handle the CIP using SMOTE technique

and employed RF for their study. The authors noticed that RF achieved an accuracy

of 94.8%.

Mathur et al. [7] in 2019 suggested using the combined effect of KNN with artiﬁ-

cial neural network (ANN) for PD detection. The researchers studied the performance

of various MLAs and selected the best-performing models w.r.t. accuracy and less

execution time. The study showed that the ensemble-based method AdaBoost.M1

with KNN gives the best accuracy, i.e., 91.28%.

4R. Kumari et al.

3 Materials and Methods

3.1 Ensemble Methods

Three ensemble methods—RF, bagging, and RUSBoost—are used in this research.

The performance of these methods is analyzed using all features and selected features.

Random Forest (RF) RF is a supervised ensemble MLA which generates many

decision trees (DT), and the instances are selected randomly. Each DT provides the

prediction, and the ﬁnal prediction is done on the basis of majority voting [21].

Bagging To improve the predicted accuracy of MLAs, Breiman [22] suggested the

bootstrap aggregating (bagging) machine learning ensemble technique. The bagging

technique, given a training set, randomly creates a variety of bootstrap samples by

sampling with replacement from the original dataset. The ensemble’s classiﬁers are

then trained individually for each bootstrap sample. The majority vote is then used

to determine the prediction results for classiﬁcation problems.

RUSBoost RUSBoost proposed by Seiffert et al. [23] is a combination of random

under-sampling (RUS) and boosting. RUS is a method for removing instances from

the over-represented class at random. AdaBoost [24], the most used boosting tech-

nique, iteratively trains the base learners in a sequential manner. All built models then

take part in a weighted vote to categorize unlabeled examples. Because the minority

class instances are more prone to be misclassiﬁed and hence assigned higher weights

in following rounds, this strategy is particularly effective at addressing CIP.

3.2 FS Method

FS is a preprocessing method to extract signiﬁcant features from a dataset. The

name attribute is initially omitted because it does not appear to have an impact on

performance. Then, using the scikit-learn library’s SelectKBest function, we choose

the k feature with the best score. Based on f_classif univariate statistical analysis, the

score is determined. The features that SelectKBest chose are displayed in Table 1.

Using the MinMaxScaler from the Sklearn library, we scale the features as well.

Table 1 Features selected by SelectKBest FS method

FS method #features Selected Features

SelectKBest 15 MDVP:Fo (Hz), MDVP:Flo (Hz), MDVP:Jitter (Abs), MDVP:Shimmer,

MDVP:Shimmer (dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ,

Shimmer:DDA, HNR, RPDE, spread1, spread2, D2, PPE

Diagnosis of Parkinson Disease Using Ensemble Methods for Class … 5

3.3 Performance Metrics

We use accuracy and AUC metric for the performance evaluation of the techniques.

Accuracy It refers to percentage of correct predictions over total predictions.

Accuracy is calculated using Eq. (1).

Accuracy =Correct Predictions

Total Predictions

.(1)

AUC Under this, true positive rate (TPrate) or the percentage of truly classiﬁed posi-

tive cases and true negative rate (TNrate) or the percentage of truly classiﬁed nega-

tive instances are shown on x-axis and y-axis, respectively, in an AUC curve [25].

Area under the receiver operator characteristics (ROC) represents the classiﬁer’s

performance. AUC is evaluated using Eq. (2).

AUC =TPrate +TNrate

.(2)

4 Experimental Setup

The experiment is conducted with three ensemble methods: RF, bagging, and

RUSBoost using Python 3.8 version in Jupyter Notebook. We obtain the dataset

from the UCI repository. The parameter settings of the ensemble methods are given

in Table 2.

4.1 Dataset

In this study, the publicly accessible PD dataset [26] from the UCI repository is

used. The collection includes speech sounds from 23 PD patients and eight healthy

Table 2 Initial parameter settings

Ensemble method Parameter settings

RF n_estimators =100

Bagging base_estimator =SVC

RUSBoost base_estimator =LogisticRegression

6R. Kumari et al.

Table 3 Parkinson’s speech dataset

Attribute name Description

name ASCII subject name and recording number

MDVP:Fo (Hz) Average vocal fundamental frequency

MDVP:Fhi (Hz) Maximum vocal fundamental frequency

MDVP:Flo (Hz) Minimum vocal fundamental frequency

MDVP:Jitter (%), MDVP:Jitter (Abs),

MDVP:RAP, MDVP:PPQ, Jitter:DDP

Fundamental frequency variation measures

MDVP:Shimmer, MDVP:Shimmer (dB),

Shimmer:APQ3, Shimmer:APQ5,

MDVP:APQ, Shimmer:DDA

Amplitude variation measures

NHR, HNR Ratio of noise to tonal components in the voice

RPDE, D2 Two nonlinear dynamical complexity measures

DFA Signal fractal scaling exponent

spread1, spread2, PPE Nonlinear measures of fundamental frequency

variation

status Health status: 1—PD, 0—healthy

Table 4 Performance of

ensemble methods with all

features

Methods Accuracy AUC

RF 94.76 99.25

Bagging 96.24 99.31

RUSBoost 76.73 88.89

subjects. Max Little of the University of Oxford produced this dataset having 195

rows representing the voice measurements of 31 different people, and each column

represents a different voice attribute. Out of 195 voice measurements, 147 are from

people with PD, and the rest belongs to healthy people. The status column contains

two values: ‘0’ represents healthy people and ‘1’ represents people with PD. Table 3

shows dataset properties.

Repeated stratiﬁed K-fold cross-validation with ten splits is used for evaluation.

Two performance measures accuracy and AUC are utilized. SelectKBest method is

applied for selecting the most relevant features from the dataset. Performance of the

ensemble methods with all features and selected features is given in Tables 4and 5.

Figure 1represents this graphically: (a) accuracy and (b) AUC. The highest values

are highlighted.

4.2 Results and Discussions

(a) With All Features

Diagnosis of Parkinson Disease Using Ensemble Methods for Class … 7

Table 5 Performance of

ensemble methods with

selected features

Methods Accuracy AUC

RF 95.71 99.36

Bagging 96.46 98.45

RUSBoost 77.54 88.86

(a)

(b)

0 20406080100120

Before FS

After FS

Before FS After FS

RF 94.76 95.71

Bagging 96.24 96.46

RUSBoost 76.73 77.54

Accuracy

RF Bagging RUSBoost

80 85 90 95 100 105

Before FS

After FS

Before FS After FS

RF 99.25 99.36

Bagging 99.31 98.45

RUSBoost 88.89 88.86

AUC

RF Bagging RUSBoost

Fig. 1 Performance of ML techniques aaccuracy and bAUC

8R. Kumari et al.

Table 6 Comparison with

prior studies Reference Year Technique

[27]2020 SVM

[1]2021 RF

[17]2022 KNN

[16]2022 Ensembled expert system

[28]2022 SVM

Our work 2022 Bagging

With an accuracy of 96.24% and an AUC of 99.31%, bagging surpassed the

other ensemble methods as shown in Fig. 1a and b. This might be the case since

the approach improves the stability and generalization capacity of multiple base

classiﬁers. RUSBoost had the worst accuracy scoring 76.73%. This might be

the case because RUSBoost’s performance is hampered by the fact that under-

sampling could not be performed as SMOTE initially balanced the dataset.

(b) With Selected Features

With 15 selected features, a slight improvement of 0.22% w.r.t the accuracy of

bagging is noticed. In the case of FS also, with an accuracy of 96.46%, bagging

outperforms RF and RUSBoost. Our study suggests that accuracy of the model

is enhanced by FS; thus, their usage is beneﬁcial in diagnosis of PD.

From the results, it is illustrated that bagging can be taken as the viable tool in

early PD diagnosis.

4.3 Comparison with Previous Studies

The best-performing method from our research is compared with the results from

earlier studies using the same PD dataset in Table 6.

5 Conclusion

PD is a chronic health disease; therefore, detecting it in the early phase is very

crucial in order to prolong a patient’s life. This paper utilizes speech signals for

an early PD diagnosis taken from the UCI repository. To balance the dataset, we

use SMOTE technique. We perform the comparative analysis of three ensemble

methods, namely RF, bagging, and RUSBoost, for PD diagnosis. We also use the

FS SelectKBest method for selecting features and comparing the performance of

the ensemble methods without the FS method and with the FS method. The results

suggest that the FS technique is advantageous since it helps to reduce the complexity

Diagnosis of Parkinson Disease Using Ensemble Methods for Class … 9

and enhances the model’s accuracy. With an accuracy of 96.46%, the experimental

data demonstrates that the ensemble method bagging beats the other strategies in

the study. For future work, various FS methods may be utilized to select the most

contributing features.

References

1. Lamba R, Gulati T, Alharbi HF, Jain A (2022) A hybrid system for Parkinson’s disease diagnosis

using machine learning techniques. Int J Speech Technol 25(3):583–593

2. Cacabelos R (2017) Parkinson’s disease: from pathogenesis to pharmacogenomics. Int J Mol

Sci 18(3):551

3. Bharath S, Hsu M, Kaur D, Rajagopalan S, Andersen JK (2002) Glutathione, iron and

Parkinson’s disease. Biochem Pharmacol 64(5–6):1037–1048

4. Tuncer T, Dogan S, Acharya UR (2020) Automated detection of Parkinson’s disease using

minimum average maximum tree and singular value decomposition method with vowels.

Biocybernetics Biomed Eng 40(1):211–220

5. Shamrat FMJM, Asaduzzaman M, Rahman AS, Tusher RTH, Tasnim Z (2019) A comparative

analysis of Parkinson disease prediction using machine learning approaches. Int J Sci Technol

Res 8(11):2576–2580

6. Challa KNR, Pagolu VS, Panda G, Majhi B (2016) An improved approach for prediction of

Parkinson’s disease using machine learning techniques. In: 2016 international conference on

signal processing, communication, power and embedded system (SCOPES). IEEE, pp 1446–

1451

7. Mathur R, Pathak V, Bandil D (2019) Parkinson disease prediction using machine learning

algorithm. In: Emerging trends in expert applications and security. Advances in intelligent

systems and computing, vol 841. Springer, Singapore, pp 357–363

8. Gosain A, Sardana S (2017) Handling class imbalance problem using oversampling techniques:

a review. In: 2017 international conference on advances in computing, communications and

informatics (ICACCI). IEEE, pp 79–85

9. Kaur P, Gosain A (2019) Empirical assessment of ensemble based approaches to classify

imbalanced data in binary classiﬁcation. Int J Adv Comput Sci Appl (IJACSA) 10(3):48–58

10. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for

the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans

Syst Man Cybern Part C Appl Rev 42(4):463–484

11. Hou S, Liu Y, Yang Q (2022) Real-time prediction of rock mass classiﬁcation based on TBM

operation big data and stacking technique of ensemble learning. J Rock Mech Geotech Eng

14(1):123–143

12. Liu L, Wu X, Li S, Li Y, Tan S, Bai Y (2022) Solving the class imbalance problem using

ensemble algorithm: application of screening for aortic dissection. BMC Med Inform Decis

Mak 22(82):1–16

13. Abedin MZ, Guotai C, Hajek P, Zhang T (2023) Combining weighted SMOTE with ensemble

learning for the class-imbalanced prediction of small business credit risk. Complex Intell Syst

9:3559–3579

14. Nishant PS, Rohit B, Chandra BS, Mehrotra S (2021) HOUSEN: hybrid over–undersampling

and ensemble approach for imbalance classiﬁcation. In: Inventive systems and control. Lecture

notes in networks and systems, vol 204. Springer, Singapore, pp 93–108

15. Sarkar S, Khatedi N, Pramanik A, Maiti J (2020) An ensemble learning-based undersampling

technique for handling class-imbalance problem. In: Proceedings of ICETIT 2019. Lecture

notes in electrical engineering, vol 605. Springer, Cham, pp 586–595

10 R. Kumari et al.

16. Biswas SK, Boruah AN, Saha R, Raj RS, Chakraborty M, Bordoloi M (2022) Early detection

of Parkinson disease using stacking ensemble method. Comput Methods Biomech Biomed Eng

26(5):527––539

17. Saeed F, Al-Sarem M, Al-Mohaimeed M, Emara A, Boulila W, Alasli M, Ghabban F

(2022) Enhancing Parkinson’s disease prediction using machine learning and feature selection

methods. Comput Mater Continua 71(3):5639–5658

18. Yadav D, Jain I (2022) Comparative analysis of machine learning algorithms for Parkinson’s

disease prediction. In: 2022 6th international conference on intelligent computing and control

systems (ICICCS). IEEE, pp 1334–1339

19. Yaman O, Ertam F, Tuncer T (2020) Automated Parkinson’s disease recognition based on

statistical pooling method using acoustic features. Med Hypotheses 135:109483

20. Polat K (2019) A hybrid approach to Parkinson disease classiﬁcation using speech signal:

the combination of SMOTE and random forests. In: 2019 scientiﬁc meeting on electrical-

electronics and biomedical engineering and computer science (EBBT). IEEE, pp 1–3

21. Rani P, Kumar R, Ahmed NMOS, Jain A (2021) A decision support system for heart disease

prediction based upon machine learning. J Reliable Intell Environ 7:263–275

22. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140

23. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) RUSBoost: a hybrid approach

to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197

24. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Machine

Learning: Proceedings of the thirteenth international conference, pp 1–9

25. Kumari R, Singh J, Gosain A (2023) SmS: SMOTE-stacked hybrid model for diagnosis of

polycystic ovary syndrome using feature selection method. Expert Syst Appl 225:120102

26. UCI Repository for Parkinsons Dataset (PD) Retrieved from https://archive.ics.uci.edu/ml/mac

hine-learning-databases/parkinsons. Accessed on 15 Jan 2023

27. Senturk ZK (2020) Early diagnosis of Parkinson’s disease using machine learning algorithms.

Med Hypotheses 138:109603

28. Kuresan H, Samiappan D (2022) Genetic algorithm and principal components analysis in

speech-based Parkinson’s early diagnosis studies. Int J Nonlinear Anal Appl 13(1):591–602

A Comparative Analysis of Pneumonia

Detection Using Chest X-rays with DNN

Prateek Jha, Mohit Rohilla, Avantika Goyal, Siddharth Arora,

Ruchi Sharma, and Jitender Kumar

Abstract A variety of viral infections can develop in pneumonia–known to be highly

catastrophic lung disease. Due to the close association between pneumonia and other

lung disorders, the diagnosis of pneumonia using chest X-ray images presents a

signiﬁcant challenge. Due to this issue, higher levels of accuracy cannot be gained

from the recent approaches for detecting pneumonia. In this research, pneumonia

is classiﬁed using deep learning algorithms. CNN model was developed to make

chest X-ray diagnosis easier. Furthermore, the utilization of pre-trained convolu-

tional neural network (CNN) models, which extract features from vast datasets,

proves highly advantageous in the branch of image classiﬁcation applications. In our

analysis, we use a selection process to determine the most suitable CNN model for

the task at hand. CNN models offer substantial assistance in the evaluation of chest

X-ray images, particularly in the identiﬁcation of pneumonia. To effectively identify

pneumonic lungs in chest X-rays and contribute to pneumonia treatment, this article

presents a range of convolutional neural network models.

Keywords Deep convolutional neural network (DCNN) ·Image classiﬁcation ·

Conv2D ·Maxpooling2D ·Batch normalization ·Activation function ·Chest

X-ray (CXR)

P. Jha ·M. Rohilla (B)·A. Goyal ·S. Arora ·R. Sharma ·J. Kumar

Bharati Vidyapeeth’s College of Engineering, New Delhi, India

e-mail: rohillamohit1510@gmail.com

P. Jha

e-mail: jhapk0001@gmail.com

A. Goyal

e-mail: goyal.avi2000@gmail.com

S. Arora

e-mail: siddharth2699@gmail.com

R. Sharma

e-mail: ruchi.sharma@bharatividyapeeth.edu

J. Kumar

e-mail: jitender.kumar@bharatividyapeeth.edu

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_2

12 P. Jh a e t a l.

1 Introduction

The projection radiography of chest is called a chest X-ray. It is used to diagnose

disorders that affect the heart, lungs, bones, respiratory system, and some major

vessels in the chest. Chest X-ray can help to predict pneumonia [1]. Diagnostic and

medical facilities do research on the classiﬁcation of chest X-ray pictures. Normally,

a chest X-ray is need to visit a doctor for some chest pain, a chest injury, orshortness

of breath. The doctor can diagnose you with any heart issue or a collapsed lung or

may be pneumonia and could be broken ribs, any of a number of other ailments using

the image. To improve efﬁcacy and quality, the main goal is to offer a system for

recognizing and classifying disorders. There have been recent studies showing the

high effectiveness of chest X-rays. By focusing on the chest X-ray image catego-

rization approaches based on the application of machine learning algorithms, this

research seeks to provide a better accurate solution in chest X-rays. The review starts

out with background material on data mining, as well as the fundamentals of machine

learning and medical picture analysis [2].

Today, the most signiﬁcant danger to human well-being is viruses. Pneumonia is

one of the contiguous illnesses. Because of this, classifying medical images auto-

matically has become much more challenging. The aim of this paper is to classify

medical photographs into predetermined categories. One of the preferred and often

used techniques for producing medical picture categorization issues is deep learning

(DL), which has lately gained popularity.

The gap between human and computer skills is increasingly being closed by arti-

ﬁcial intelligence. One of many of these disciplines is computer vision. A CNN is a

DL system that can understand differences between objects in an image by ingesting

a source image, assigning different characteristics and objects in the image impor-

tance. A convolutional neural network requires signiﬁcantly less preprocessing than

earlier classiﬁcation methods. A convolutional neural network’s design resembles

the connectivity pattern of the neurons. Chest X-ray pictures were taught to identify

pneumonia using CNN technique. The suggested CNN method will be generated

using three models in an effort to combine two strategies that, by utilizing the top

CNN models for the training stage and using a vision transformer, both of which,

when applied separately, have shown promising results. Related work has shown that

creating an ensemble learning solution using the top two CNN models has been used

in this model.

For our experimental analysis, pneumonia diseases have been conﬁrmed,

including the ability to spot pneumonia from CXR pictures. The two datasets in

this set are referred to as normal and pneumonia. Goldbaum and Kermany provided

the dataset [3]. Using the most important properties that the convolution networks to

retrieve for good performance (Fig. 1).

As a result of the increased input, as previously noted, a larger network is always

used to provide a sufﬁcient network receptive ﬁeld.

A Comparative Analysis of Pneumonia Detection Using Chest X-rays … 13

Fig. 1 Eight prevalent thoracic diseases are sized differently in the chest X-ray 14 dataset

2 Literature Review

Deep learning has been widely used by researchers over the recent years to ﬁnd out

the illness from the chest X-ray. For instance, Rajpurkar et al. [4] constructed the

121-layer CNN mode known as CheXNet. Fourteen diseases were found from 10,000

X-ray images and were used to train this model. The model was also implemented on

420 X-rays images, and the outcome was compared to radiologist’s research. As an

outcome, they came to know that the CNN approach deep learning based was better

than typical identiﬁcation of pneumonia.

Stephen et al. [5] trained a CNN algorithm from zero to achieve features from

X-ray pictures and used it for the prediction if a particular patient had pneumonia or

not.

To anticipate pneumonia by chest X-ray photographs, Atitallah et al. [6] proposed

a system in based on average adaptive ﬁltering CNN. Chest X-ray image was consid-

ered to adaptive ﬁltration to cancel noise, increasing accuracy and made it easy to

identify. Then, for feature extraction, a CNN model with two layers is built using

dropout. The Signiﬁcant Filter is needed to improve CNN’s classiﬁcation accuracy.

Maselli et al. [7] draw out features for the pneumonia classiﬁcation challenge

using three well-known CNN models. They used the similar dataset to train each

set individually, from each CNN’s ﬁnal fully connected layer, they extracted 1000

features. Further, the features selected were given to machine classiﬁcation methods.

Further, DL model which has 49 convolutional layers and two dense layers was used

in their research. The test accuracy for their model was 90.05%.

For example, CNN methodology using residual junctions and dilated CNN

approaches was used by Ayan and Karabulut [8] to classify the pneumonia. By

14 P. Jh a e t a l.

choosing X-ray images, they visualized how CNN’s approach was impacted by Doso-

vitskiy et al. [9] and used transfer learning to get CNN method for the pneumonia

identiﬁcation in X-ray images.

In conclusion, the state of the art includes some mind blowing ideas, but we have

tried to grow things a step ahead by introducing a method which combines two

different approaches: employing convolution data models for the stage training and

choosing the best one that yields best results. The acquired ﬁndings are optimistic

and only slightly improve upon the performance of the existing art of the state with

a minimum number of features and layers.

3 Dataset

For our experimental analysis, we detected pneumonia from CXR images. The two

sets of data are called normal and pneumonia. The dataset was taken from Kaggle.

There are total of 5856 normal and pneumonia images in the dataset, which are

categorized into three parts: train, test, and validate. Then it was further divided into

two sub-folders normal and pneumonia. 1341 healthy patients and 3875 pneumonia

images make up the training subset. Additionally, it has 34 normal individuals and

390 pneumonia images making the test subset. Sixteen validation images are also

included in these data, including eight patients with pneumonia and eight healthy

individuals.

4 Methodology

In this study, an ideal approach for detecting pneumonia from chest X-rays is

proposed. A straightforward eight-layer CNN with max pooling and activation

function will serve as our initial model (Fig. 2).

For the successful completion of this project, a number of steps were taken into

consideration, which are as follows.

4.1 Choosing the Dataset

The dataset consists of 5856 normal and pneumonia X-ray images. It is divided

into three folders (train, test, and validation), each of which has two sub-folders

(pneumonia/normal). Images are of grayscale format and of varying sizes.

A Comparative Analysis of Pneumonia Detection Using Chest X-rays … 15

Fig. 2 Block diagram

4.2 Preprocessing the Images

Prior to training the model, the chest X-ray images resized to 224 ∗224. The X-

ray images show that more than 1200 people are healthy and more than 3800 have

pneumonia from the dataset.

16 P. Jh a e t a l.

4.3 CNN Classiﬁcation Model

The suggested methodology’s numerous strategies, which are described in the

following sections, were used to train the CNN classiﬁcation model.

Conv2D

Conv2D is a two-dimensional convolution layer that helps form a tensor of outputs

by combining a convolution kernel with input layers. To produce a tensor of outputs,

this layer creates a convolution kernel that undergoes convolution with the input of

the layer.

Activation Function

One of the choices you have when building a neural network is which activation

function to employ in hidden layer and at the output layer. Since neural networks

are nonlinear, they may construct complex representations and functions according

to their inputs, which are not doable with this layer.

MaxPooling2D

The creation of the convolution layer utilizes Keras maxpooling2d function. This

layer constructs a convolution kernel that intertwines with the input layer, assisting

in the generation of tensor outputs. The kernel is a mask-based image processing

matrix that is used for edge detection, blurring, and convolution between the image

and kernel. The class name for the Keras maxpooling2d is maxpool2d, and it will

use the maxpooling2d class from the TF Keras layers.

Batch Normalization

Batch normalization accelerates and improves the reliability of deep neural networks

by adding extra layers to them. A new layer normalizes and normalizes the input it

receives from a preceding layer.

4.4 Flattening Layer

Flattening layer converts 2D matrix into 1D array/vector for the feeding of next step/

layer. The output of convolutional layers transforms into 1D long array/vector, which

is connected to the ﬁnal layer.

4.5 Compiling Model

Finally, the model was compiled on Google Colab, and outputs were generated

wherein we receive distinguished images as normal or pneumonia.

A Comparative Analysis of Pneumonia Detection Using Chest X-rays … 17

5 Experiment Results and Discussion

This part provides the speciﬁcs of experiments that have been done to evaluate the

suggested architecture. The deep learning networks have been implemented using

the Keras with TensorFlow. Google Colaboratory was used for the computation in

this paper and section.

Accuracy and Loss

The model has gone through a number of epochs to check the performance. The

best results were obtained with Epoch 10, with Epoch 11 experiment showing some

undesired results and get terminated. This architecture had an accuracy of 0.8781

and a loss of 0.2865 as shown in Fig. 3.

Finally, with the same architecture and hyperparameters, the accuracy gets to

90.38%.

Performance of Model

See Figs. 4,5and 6and Tables 1and 2.

Accuracy, precision, and recall were also calculated for the model. Accuracy is

the true predictions divided by total predictions (Eq. 1). Precision tells about the

preciseness of the model to predict the true label (Eq. 2). Recall can be deﬁned as

true positive label divided by the sum of the false negative label and true positive

label (Eq. 3).

Accuracy =(TN +TP)(TN +TP +FN +FP)(1)

Precision =TP(FP +TP)(2)

Fig. 3 Accuracy and loss

18 P. Jh a e t a l.

Fig. 4 Accuracy and loss

curves for train and

validation data versus

number of epochs

Recall =TP(FN +TP)(3)

For calculating the precision, recall, and accuracy, the confusion matrix was obtained

(Fig. 5). The confusion matrix tells about the false positive, false negative, true

positive, and true negative; hence, it could be used to analyze the model and calculate

the precision, recall, and accuracy. F1-score calculates the mean of the precision and

A Comparative Analysis of Pneumonia Detection Using Chest X-rays … 19

Fig. 5 Confusion matrix of

training dataset

Fig. 6 ROC curve

Table 1 Obtained scores

from the proposed approach Precision Recall F1-Score Accuracy

89.75 79.04 84.06 90.34

Table 2 Comparison between proposed approach and existing methods

Model Number of images Accuracy Precision Recall

Ayan et al. [8]5856 84.5 91.3 89.1

Rahman et al. [10]5247 98.0 97.0 99.0

Proposed methodology 5856 90.38 89.75 79.04

recall (Eq. 4).

F1 score =(2∗Precision ∗Recall)(Precision +Recall)(4)

20 P. Jh a e t a l.

The F1-score is 84.06.

Comparative Analysis with Existing Methods

Various scores of proposed methodology and existing methods have been compared.

We have gone through two existing methods from Ayan et al. [8] and Rahman et al.

[10] which used VGG16, SqueezeNet, DenseNet and Xception and got an accuracy

of 84.5 and 98.0%, respectively.

The proposed approach has lesser accuracy compared to Rahman et al. [10], but

it gives higher accuracy than Ayan et al. [8] with the same dataset. And it would be

easier for real-world applications.

6 Conclusion

The purpose of this work is to increase medical expertise in situations when there are

few radiotherapists available. We assist in the pre-diagnosis of pneumonia in order

to avoid any adverse repercussions in these areas. The creation of such an algorithm

may be advantageous for the healthcare industry. We evaluated how different pre-

trained models performed and concluded that our approach produces results that are

superior to those of some earlier works. We would like to provide the most efﬁcient

pre-trained CNN model available for similar future research. Better algorithms will

likely be created as a result of our research.

References

1. Ortiz-Toro C, García-Pedrero A, Lillo-Saavedra M, Gonzalo-Martín C (2022) Automatic

pneumonia detection in chest X-ray images using textural features. Comput Biol Med

145:105466

2. Wang L, Wang H, Huang Y, Yan B, Chang Z, Liu Z, Zhao M, CuiL, Song J, Li F (2022) Trends

in the application of deep learning networks in medical image analysis: evolution between

2012 and 2020. Eur J Radiol 146:110069

3. Malhotra P, Gupta S, Koundal D, Zaguia A, Kaur M, Lee H-N (2022) Deep learning-based

computer-aided pneumothorax detection using chest X-ray images. Sensors 22(2278):1–23

4. Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul A, Langlotz C, Shpan-

skaya K et al (2017) CheXNet: radiologist-level pneumonia detection on chest X-rays with deep

learning. arXiv:1711.05225 [cs.CV], arXiv:1711.05225v3 [cs.CV], https://doi.org/10.48550/

arXiv.1711.05225

5. Stephen O, Sain M, Maduh UJ, Jeong D-U (2019) An efﬁcient deep learning approach to

pneumonia classiﬁcation in healthcare. J Healthc Eng 2019(4180949):1–7

6. Atitallah SB, Driss M, Boulila W, Koubaa A, Ghézala HB (2022) Fusion of convolutional

neural networks based on Dempster-Shafer theory for automatic pneumonia detection from

chest X-ray images. Int J Imaging Syst Technol 32(2):658–672

7. Maselli G, Bertamino E, Capalbo C, Mancini R, Orsi GB, Napoli C, Napoli C (2021) Hierar-

chical convolutional models for automatic pneumonia diagnosis based on X-ray images: new

strategies in public health. Ann IG 33(6):644–655

A Comparative Analysis of Pneumonia Detection Using Chest X-rays … 21

8. Ayan E, Karabulut B, Ünver HM (2022) Diagnosis of pediatric pneumonia with ensemble of

deep convolutional neural networks in chest X-ray images. Arab J Sci Eng 47:2123–2139

9. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M,

Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16×16 words: transformers for

image recognition at scale. arXiv:2010.11929 [cs.CV], arXiv:2010.11929v2 [cs.CV], https://

doi.org/10.48550/arXiv.2010.11929

10. Wu H, Xie P, Zhang H, Li D, Cheng M (2020) Predict pneumonia with chest X-ray images

based on convolutional deep neural learning networks. J Intell Fuzzy Syst 39(3):2893–2907

Machine Learning-Based Binary Sentiment

Classiﬁcation of Movie Reviews in Hindi

(Devanagari Script)

Ankita Sharma and Udayan Ghose

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture

Notes in Networks and Systems 785, https://doi.org/10.1007/978-981-99-6544-1_3

Abstract Lately, there has been a remarkable surge in online movie reviews in

Hindi with the advent of the UTF-8 standard. Movie reviews are an excellent

source of sentiments; therefore, Hindi movie review classiﬁcation is one of the

exciting and demanding tasks of NLP, as it helps the viewers decide whether a

ﬁlm/movie is worth watching. Much work in movie reviews sentiment classi-

ﬁcation has been done mainly for resource-afﬂuent languages. Still, preliminary

work is being done in Hindi due to its complex nature and scarce resources like

adequate-labeled datasets. This paper aims to develop a machine learning-based

solution for performing binary sentiment classiﬁcation on movie reviews in Hindi

(Devanagari Script). To this end, a primary binary polarity dataset, namely, Movie

Reviews in Hindi (MRH) consisting of 5K reviews, is made. Apart from MRH,

the Hindi IIT-P movie and product review datasets are also deployed in this work.

Firstly, all three datasets are prepared for further processing using the preproc-

essing steps, and the features used are unigram, bigram, and trigram, along with

TF-IDF. Second, various state-of-the-art classiﬁers are applied to all three data-

sets. Further, we proposed and used a stacked model of classiﬁers for perform-

ing binary sentiment classiﬁcation on Hindi reviews. Experimental results on all

three datasets prove that the proposed stacking ensemble based on the employed

features compared favorably to all the baseline classiﬁers applied and achieved

reasonably high performance. Therefore, it indicates the efﬁcacy of the proposed

stacked model for sentence level movie reviews sentiment classiﬁcation in a

resource-scarce scenario.

A. Sharma (*) · U. Ghose

University School of Information, Communication and Technology, Guru Gobind Singh

Indraprastha University, Delhi, India

e-mail: ankitasharma2711@gmail.com

U. Ghose

e-mail: udayan@ipu.ac.in

24 A. Sharma and U. Ghose

Keywords Binary sentiment classiﬁcation · Machine learning · Movie ·

Hindi · Stacking ensemble

1 Introduction

In today’s contemporary world, people post their opinions on social network-

ing sites like Twitter, Facebook, etc., generating massive textual content. They

also write blogs, participate in forums, and post multiple online reviews. India

has about 658 M Internet users, expected to reach 900 M by 2025. Due to vol-

canic growth in data, we are drowning in data but hungering for knowledge. Thus,

mining that data to get valuable insights is becoming critical [1]. It is known that

Hindi is the ofﬁcial language of India and the third most spoken language in the

world, with over 615 M speakers. Hindi is a communication medium in India and

many parts of the world. Making sense of Hindi text posted online is crucial to

understand its emotion. Therefore, Hindi sentiment analysis (SA) is becoming

essential. Textual SA is a process of ﬁnding the emotion or opinion of the writer

in written text. It plays a signiﬁcant role in judging the writer’s perception and

guides in decision making. Till now, most of the work in SA has been done mainly

for resource-afﬂuent languages, but only preliminary work in Hindi. With the

advent of UTF-8 standard, Hindi movie review content on the web is proliferating

[2]. The availability of voluminous Hindi textual content on the web has fueled

interest for researchers to explore this area. The Indian youth is very passionate

about Hindi cinema, and they participate proactively by writing movie reviews

(MRs) in Hindi (Devanagari) over the web; also, a signiﬁcant amount of capi-

tal is invested in Hindi cinema every year [3]. It becomes tedious to go through

millions of reviews posted daily online. Therefore, there is a need to automate a

review mining or classiﬁcation system to help viewers decide whether to watch

or skip a movie. Reviewing reviews gives viewers an idea about both positive and

negative movie aspects. SA research in Hindi is still developing, and recently sev-

eral studies have been conducted on SC in Hindi textual data using the machine

learning approach (MLA) and lexicon-based approach (LBA) [4]. Researchers

have primarily used their private and Hinglish datasets rather than pure Hindi data-

sets in Devanagari script. The present work provides an efﬁcient machine learn-

ing (ML)-based solution for binary SC of MRs written in Hindi. Binary SC, in

respect of MRs, categorizes MRs as positive or negative. SA or SC can be done

at sentence, document, and aspect level [5, 6]. This work-study will be conﬁned

to binary sentence level SA. Binary sentence level SA unfolds whether a sentence

is of positive polarity or negative polarity. Only subjective nature, sentence level

MRs in Hindi (Devanagari script) have been considered for this work. The litera-

ture review shows little research on ensemble-based solutions for Hindi text clas-

siﬁcation. Therefore, this paper presents a stacking classiﬁer-based solution for

the SC of MRs in the Devanagari script. All experiments were performed on three

Machine Learning-Based Binary Sentiment …

datasets. Moreover, to further validate the performance of our proposed solution,

different state-of-the-art (SOTA) classiﬁers were used individually for comparison.

The experiments presented and the results obtained strongly validate the effective-

ness of our proposed solution for binary SC in a resource-limited scenario.

The contribution of the present work is as follows:

(a) The key point of this investigation is to propose a stacking ensemble of

classiﬁer-based solution for SC of review sentences of the movie domain

into binary sentiment category, i.e., positive or negative.

(b) A binary polarity MRH dataset comprising of 5 K MRs in Hindi is com-

piled and manually annotated to ensure maximal quality.

(d) TF-IDF along with unigram, bigram, and trigram are used for feature

extraction.

(e) A stacking ensemble of the classiﬁers-based solution has been devised for

SC of MRs sentences.

(f) To further validate the performance and for comparative analysis, various

SOTA classiﬁers were applied.

The remaining paper is as follows: Sect. 2 overviews related work done in Hindi

text SA. Section 3 describes the formed dataset, methodology used, and pro-

posed stacking ensemble for SC of Hindi reviews. Section 4 discusses the results

obtained. Lastly, we conclude our work and offer potential future directions

regarding Hindi reviews SC.

2 Literature Review

There have been plenty of research efforts for resource-afﬂuent languages.

However, research in the Hindi language is still evolving. To the best of our

knowledge, a few studies exist in the movie domain using pure Hindi reviews;

also, most of the research works in Hindi SA have used their primary dataset to

conduct their experimental study.

This section will cover some prior studies related to SA in Hindi text.

Madan and Ghose [7] performed SA on Hindi twitter. Hindi MRs were col-

lected and LBA, MLA, and hybrid approaches were applied. It concludes that the

hybrid approach outperforms LBA and that the decision tree (DT) classiﬁer per-

formed the best and obtained an accuracy of 92.97%.

Hussaini et al. [8] have attempted to perform a score-based SA of Hindi book

reviews. An annotated dataset of 700 sentences related to Hindi book reviews is

made. Verbs, nouns, adjectives, and adverbs are used as opinion words. HSWN,

word sense disambiguation, and the Hindi subjectivity lexicon were applied.

According to the results obtained Hindi subjectivity lexicon performed best and

achieved an accuracy of 87.4%.

26 A. Sharma and U. Ghose

Jha et al. [9] have proposed HSAS in their paper. HSAS looks into two ways of

producing the Hindi subjectivity lexicon. The ﬁrst way is to use the translator to

translate the English language to the Hindi language, and the second way is to use

an improved seed list of the Hindi language. Encouraging results were achieved,

HSAS obtained an accuracy of 80% when the seed word approach was used.

Jha et al. [10] have proposed HOMS for performing opinion mining on movie

review data. A dataset of 200 MRs was collected with equal positive, negative

reviews. Naïve Bayes (NB) classiﬁer was used, and for POS tagging, adjectives

were only considered. HOMS achieved an accuracy of 87.1%. Jha et al. [11] have

proposed a method to ﬁnd opinions in reviews of Hindi movies. NB classiﬁer,

support vector machine (SVM), maximum entropy, and LBA has been utilized.

A dataset of 1000 MRs, 500 each for positive and negative valence, is collected.

The results obtained state that accuracy gets increased when bigram features are

used rather than unigram features. Kumar et al. [12] have expanded the Indian sen-

timent lexicon. SA was performed on Indian tweets using co-occurrences at the

sentence level and DTs. The corpora comprising of Hindi 2,358,708 sentences and

Bengali 109,855 sentences were collected from an online newspaper. An accuracy

of 43.20 and 42% were obtained for Bengali corpora and accuracy of 49.68 and

46.25% were obtained for Hindi corpora for constraint and unconstraint submis-

sion, respectively.

Kaur et al. [13] have proposed a new approach for Hinglish SA. A dataset of

100 positive and 100 negative reviews of the movie domain was collected. For

feature extraction, unigram, bigram, and trigram are used, and the classiﬁcation

of sentiment was performed using SVM, NB, logistic regression (LR), and neural

network. Mishra et al. [14] have proposed an improvised context-speciﬁc polarity

lexicon (CSPL) resource for Hindi reviews. Five thousand two hundred reviews

from the hotel and movie domain were collected. The results were compared with

HSWN. The results obtained to state that the proposed lexicon performed better,

and an accuracy of 88% in the hotel domain using CSPL and 77% in the movie

domain using CSPL-extended was obtained.

Sharma and Moh [15] have attempted to prognosticate Indian Election results

by performing SA on Hindi twitter. Forty-two thousand two hundred thirty-ﬁve

tweets were collected both supervised and unsupervised approaches were applied.

NB and SVM prognosticated BJP’s win, while dictionary-based approach prog-

nosticated INC’s win. Among the three applied approaches, SVM obtained the

highest accuracy of 78.4%. Sharma et al. [16] performed SA on the Hindi lan-

guage by using a modiﬁed subjectivity lexicon. A dataset of 50 Hindi tweets was

obtained from Twitter on hashtag, “JAIHIND” and “WORLDCUP2015”. The

comparison of results obtained was made with unigram presence, and the results

concluded that proposed modiﬁed lexicon performs better and obtained an accu-

racy of 73.53 and 81.97% for both hashtags, respectively. Singh and Lefever [17]

performed Hinglish SA using cross-lingual word-embeddings. A dataset from

Sem-Eval 2020 was used. A supervised classiﬁcation model and transfer learning

model were applied. Results suggested that integrating cross-lingual embedding

increases performance. An F-score of 0.556 was obtained. An attempt to detect

Machine Learning-Based Binary Sentiment …

sarcastic sentiment in Hindi was made by Bharati et al. [18]. A dataset compris-

ing 4000 tweets were collected and manually labeled as sarcastic and non-sar-

castic. A sarcasm detection algorithm based on the contradiction between a tweet

and context was applied. The context with the same timestamp was used. The

results obtained outperformed SOTA approaches for Hindi sarcasm detection and

obtained an accuracy of 87%.

3 Proposed Methodology

This section describes the followed methodology and proposed stacked architec-

ture for SC of MRs in Hindi. First, the formation of the MRH dataset and the sta-

tistics of the dataset used is discussed. Second, the preprocessing step is explained,

followed by the feature extraction step, where the features used and vectorization

are discussed. Then, the model generation and SC using different baselines and

the proposed stacking ensemble are described. Finally, the performance evaluation

metrics used in this work are presented.

3.1 The Formation of Movie Reviews in Hindi (MRH)

Dataset

The MRH dataset of this work is primary and consists of 5000 review sentences.

MRs in Hindi were obtained from various online websites,1,2,3. The collected

reviews were manually labeled as positive or negative. The reviews rated with one

or two stars by the reviewers were considered negative, and the reviews rated with

more than three stars were considered positive. Neutral reviews were not consid-

ered in this work. Later, each annotation was manually reviewed by two language

experts to conﬁrm the polarity of the label. Cohen’s Kappa was used to evaluate

the annotation quality, which yielded a score of ~ 85%. The dataset used in this

work is a CSV ﬁle with two columns: MRs text, and polarity labels (PLabels). The

snapshot of the reviews in our dataset is shown in the Table 1.

1 Webdunia, https://hindi.webdunia.com/bollywood-movie-review/, last accessed on 31/01/23.

2 Filmibeat, https://hindi.ﬁlmibeat.com/reviews, last accessed on 10/02/23.

3 Amarujala, https://www.amarujala.com/entertainment/movie-review, last accessed on 15/02/23.

28 A. Sharma and U. Ghose

3.2 Statistics of Datasets Used

In addition to the MRH dataset, two other datasets are used in this paper. Details

and statistics of the datasets used can be found in Table 2. Since we perform a

binary classiﬁcation of MRs, our datasets contain only two polarity classes: posi-

tive and negative. The IIT-P product reviews and IIT-P movie reviews were anno-

tated with four classes, namely positive, negative, neutral, and conﬂict [19]. For

this study, only reviews with positive or negative classes were considered, while

the remaining classes were ignored. Statistics of the employed datasets are given

in Table 2.

3.3 Preprocessing and Data Preparation

Since reviews are collected from online sources, they need to be cleaner, more

consistent, and more accurate. Therefore, its preprocessing is required as it can-

not be used directly for analysis. First, numbers, special characters, extra spaces,

repeated words, and non-Hindi words were removed, and emoticons were replaced

Table 1 Snapshot of MRH dataset

MRs text PLabels

            

{Citylights is a very beautiful ﬁlm full of human emotions and humanity}

Positive

              

{Aiyaary fails to hook the audience anywhere due to weak screenplay and sluggish

direction}

Negative

      

{Thrilling story of India’s pride Parmanu}

Positive

            

{There is nothing new or special in the story of the ﬁlm Simran}

Negative

                   

{Tiger Shroff has given such a performance in the ﬁlm Heropanti that people have

become crazy about him}

Positive

Table 2 Brief statistics of the review datasets used in this work

Datasets Language Positive Negative Total reviews

MRH (Ours) Hindi 2895 2105 5,000

IIT-P movie Hindi 823 530 1,353

IIT-P product Hindi 2290 712 3,002

Machine Learning-Based Binary Sentiment …

with their textual equivalents. This text cleanup was followed by tokenization.

After tokenization, stop words in Hindi are removed, and negation words were left

untreated, as their removal could change the meaning of the reviews.

3.4 Features and Vectorization

The motive of this step is to extract features for Hindi MRs classiﬁcation. This

step is essential as MLA works with data that is numeric in nature, so it is the pro-

cess of converting textual data into the numeric format. In our paper, we have used

the most popular method for vectorization that is Term Frequency and Inverse

Document Frequency (TF-IDF) along with N-gram features. This work considered

unigram, bigram, and trigram along with TF-IDF [20, 21].

3.5 Model Generation and the Brief Description of the ML

Classiﬁers Used

This phase aims to apply supervised MLAs for binary MRs classiﬁcation on all

three datasets. The “No free lunch” theorem states that no single ML algorithm

works well for all types of problems. So, applying and experimenting with differ-

ent algorithms that ﬁt our problem is always a good practice. We have also experi-

mented with different MLAs for classifying sentiment in Hindi MRs [22].

• Naïve Bayes (NB). NB is a classiﬁcation technique based on the principle of

conditional probability according to Bayes’ theorem. It is a simple but surpris-

ingly powerful algorithm for predictive modeling. It consists of two parts, Naïve

and Bayes, where Naïve Bayes assumes that the occurrence of certain features

is independent of the occurrence of other features, even if these features depend

on each other.

• Support Vector Machine (SVM). SVM is a supervised learning method that

looks at data and classiﬁes it into one of two categories; in other words, it sep-

arates the data using hyperplanes. It is one of the most popular ML classiﬁca-

tion methods that can be used for both classiﬁcation and regression problems.

It creates a hyperplane that separates the classes in the best possible way, i.e., it

chooses the correct hyperplane with maximum separation from one of the clos-

est data points.

• Decision Tree (DT). DT is a powerful algorithm that can be used for both clas-

siﬁcation and regression problems. It is a nonparametric model based on the

conditionality principle. An advantage of DT is that the number of parameters

does not increase when more features are added.

• K-Nearest Neighbor (KNN). This is a supervised MLA used mainly for clas-

siﬁcation tasks. In this nonparametric technique, data points are classiﬁed based

30 A. Sharma and U. Ghose

on the classiﬁcation of their neighboring points. It is called lazy learner because

it does not train but remembers the training dataset [23].

• Logistic Regression (LR). LR module can be imported from the sklearn.lin-

ear_model module and is often used for classiﬁcation problems. It is an efﬁcient

model with low variance. The idea is to ﬁnd a relationship between features and

the possibility of a certain outcome. As in our work, the binomial model LR was

used because only two positive and negative labels are considered.

• Extra Trees (ET). Extremely Randomized Trees, also known as Extra Trees.

It is based on the ensemble of DTs. First, a large number of unpruned DTs

are created from training data. In classiﬁcation, predictions are made based

on majority voting. All predictions made by the trees are aggregated to obtain

the ﬁnal prediction. In this process, the selection of splits and features is done

randomly.

• AdaBoost (AB). AdaBoost or Adaptive Boosting is a widely used iterative

ensemble method. It selects training data randomly and iteratively trains by

selecting a training data set based on the correct prediction of the previous

training.

• Gradient Boosting Machine (GBM). GBM is based on the sequential ensem-

ble method. In this method, weak learners are generated sequentially so that the

current weak learners are always better than the previous weak learners. The

overall performance of the model improves with each iteration [24].

• XGBoost (XGB). XGBoost or extremely gradient boost is an enhanced version

of GBM that works with an ensemble of DTs. The problem with GBM was that

it computed output very slowly due to sequential analysis. XGBoost overcomes

this drawback by computing output quickly, increasing the efﬁciency of model

efﬁciency. It uses cache optimization and implements distributed computation

methods to improve the performance of the model.

3.6 Proposed Stacked Model of Classiﬁers

There is a saying called the “wisdom of crowds” that the crowd’s collective opin-

ion is always better than that of a single expert. Combining many ML models

into a single model is called ensemble learning [6], and stacked generalization or

simply stacking is an ensemble of ensembles. In the stacking ensemble, the ﬁnal

estimator is trained rationally by integrating the different estimator’s predictions

[25], and is inspired by the wisdom of crowds. In this work, a stacked model of

classiﬁers for performing binary SC on the Hindi reviews dataset is presented.

The proposed framework combines six efﬁcient classiﬁers to improve the perfor-

mance of each classiﬁer. The classiﬁer employed in the proposed architecture is

selected based on ease of implementation and respective trade-offs—combined

classiﬁer models in the stacked architecture balance individual model’s bias and

variances. The proposed stacking method consists of one layer of estimators/

Machine Learning-Based Binary Sentiment …

classiﬁers as subsequent layers increases the complexity. The applied estimators

are NBC, SVM, boosting-based GBM, XGB, and bootstrap aggregation-based ET.

The prediction made by the estimators is used as a feature for the ﬁnal estima-

tor LR. The ﬁnal estimator, also called the meta-estimator, makes the ﬁnal predic-

tions for predicting the review labels. The meta-estimator learns from the strengths

of the previously used learners and compensates for their weaknesses. To avoid

overﬁtting, cross-validation (CV) is performed at each stacking/training step. The

dataset is split into S folds, and in S successive rounds, S−1 folds are used to ﬁt

base-level estimators in every iteration, and the base-level estimators are applied

to the remaining subset that was not included for model training in each prior iter-

ation. The resultant predictions are then stacked and given as input data to meta-

level estimators. After training the stacked classiﬁers, the base-level estimators are

ﬁt to the entire dataset. The proposed architecture and algorithm for the proposed

stacked ensemble of classiﬁers with S-fold CV is given in Fig. 1.

Fig. 1 Proposed stacked model of classiﬁers for SC of MRs in Hindi

32 A. Sharma and U. Ghose

Algorithm: Proposed Stacked model of Classifiers with S-fold Cross Validation

Input: Hindi Movie Reviews set R(r

, r

,…, r

5000

);

Sentiment Label Class set Label (l

, l

,…, l

);

Estimators set E

(NBC, SVM, GBM, XGB, ET);

Output: Predicted polarity label based on proposed stacking ensemble of

classifiers

1: A Stacking ensemble S

2: Adopt CV approach in preparing a training set for estimators

3: Randomly split Rinto Sequal-size subsets: R{R

, R

,…, R

}

4: for s1 to Sdo

5: Step 1: Learn base level estimators

6: fort1 to Kdo

7: Learn a stacker Sfrom R\R

8: end for

9: end for

10: Step 2: Learn a meta-level estimator LR

11: Step 3: Re-train base level estimators

12: for t1 to Kdo

13: Train a classifier s

basedon R

14: end for

15: return S(r)sʹ(s

(r), s

(r), ……, s

(r))

3.7 Performance Evaluation

The performance comparison between the various SOTA ML-based classiﬁcation

models and our proposed stacked model of classiﬁers is evaluated using the stand-

ard evaluation metrics such as accuracy, recall, precision, and F measure which are

deﬁned beneath [26].

• Accuracy states how many reviews are correctly classiﬁed to the total number

of reviews taken into consideration, and it is a vital classiﬁcation metric.

Machine Learning-Based Binary Sentiment …

• Precision gives how many positive reviews were predicted as positive.

• Recall is a ratio of positive review identiﬁcations by total positive reviews, and

it tests the classiﬁer’s completeness.

• F measure measures the accuracy of the test conducted and is the harmonic

mean between precision and recall.

4 Results and Discussion

This work aims to efﬁciently analyze the sentiments of viewers and reviewers

expressed in written Hindi MRs. Python 3.11.1 was used for the implementation.

To test the effectiveness of our proposed stacked model, all experiments were con-

ducted on two sentence level Hindi benchmark review datasets, namely the IIT-P

movie and product review dataset, along with the MRH dataset. After prelimi-

nary processing and feature extraction using TF-IDF with unigrams, bigrams, and

trigrams, respectively. Various SOTA classiﬁer models are applied, followed by the

proposed stacked model. All the results are evaluated using tenfold CV. The value

of the standard evaluation metric such as accuracy, recall, precision, and F meas-

ure is calculated in this work, but the presented results are only discussed in terms

of accuracy. The clustered bar graph with accuracy, precision, recall, and F meas-

ure score in percentage on dataset 1, i.e., MRH, dataset 2, i.e., IIT-P movie reviews

and dataset 3, i.e., IIT-P product reviews are given in Figs. 2, 3 and 4, respectively.

Considering our MRH dataset with unigrams, the results imply that the pro-

posed architecture achieved the highest accuracy of 83%, followed by LR and

XGB, which equally attained the accuracy of 81%. In Fig. 2b, the proposed archi-

tecture achieved the highest accuracy of 78%, followed by MNB and LR, which

attained the accuracy of 77 and 76%, respectively. While in Fig. 2c,a different

observation was observed where ET obtained an accuracy of 73%, which outper-

formed the proposed model which attained the accuracy of 69%.

The same experiment is performed on dataset 2. Figure 3 presents the results

obtained. Based on the results presented in Fig. 3a, the proposed architecture

achieved the highest accuracy of 79%, followed by SVM, which attained an accu-

racy of 78%. The proposed model performed as well as the best-performing mod-

els in bigrams with TF-IDF and trigrams with TF-IDF, as shown in Figs. 3b and

3c. The proposed architecture was applied to dataset 3 which are product reviews,

to check the domain independence as given in Fig. 4. In the case of unigram with

TF-IDF, the same pattern was observed. That is, the proposed architecture out-

performed the other SOTA models applied. While as the n-gram range increased,

i.e., in the bigram and trigram case, the proposed model performed equally to the

best-performing model. Empirical results from all three datasets suggest that the

proposed architecture achieves improved overall performance in unigrams with

the TF-IDF case. In contrast, for bigram and trigram with the TF-IDF case, per-

formance evaluation results obtained were complied with the highest accuracy

obtained in applied classiﬁers. The proposed architecture performed better than

34 A. Sharma and U. Ghose

the individual classiﬁers in all three datasets and outperformed in unigram with

TF-IDF. It was concluded from the experiments conducted on three datasets that

the proposed architecture, which is based on the stacking model of classiﬁers,

is apt for performing binary SC on Hindi review datasets. Also, it was observed

that among features, unigram with TF-IDF outperformed both bigram and trigram

Fig. 2 Accuracy, precision, recall, and F measure in % on dataset-1 with a unigram, b bigrams,

and c trigram along with TF-IDF as features

Machine Learning-Based Binary Sentiment …

with TF-IDF in all three Hindi datasets applied. The other observation was that

the results did not signiﬁcantly improve even when applying a stacking ensemble

of classiﬁers on higher N grams. The responsible factor is the TF-IDF used with

N-gram (unigram, bigram, and trigram). The Hindi polarity-bearing words, fre-

quently occurring in Hindi reviews could be assigned a lower weight by TF-IDF,

Fig. 3 Accuracy, precision, recall, and F measure in % on dataset-2 with a unigram, b bigrams,

and c trigram along with TF-IDF as features

36 A. Sharma and U. Ghose

which could be the reason for this result. To best of the author’s knowledge, pre-

vious work has not addressed the binary SC of Hindi MRs with the proposed

architecture. The proposed architecture is characterized by better accuracy, ease

of implementation, and improved performance. Moreover, requires lesser com-

putational resources and is apt for dealing with overﬁtting problems. Therefore,

Fig. 4 Accuracy, precision, recall, and F measure in % on dataset-3 with a unigram, b bigrams,

and c trigram along with TF-IDF as features

Machine Learning-Based Binary Sentiment …

proposed architecture is expected to help the viewers and reviewers evaluate the

online MRs in Hindi and thus help decide whether a movie is to be watched.

5 Conclusion

Nowadays, the Internet has become a perpetual podium, people use it to express

their sentiments. With the advent of UTF-8 standard, Hindi textual content on the

web proliferates as people feel more comfortable expressing their views, emo-

tions, etc., in their native language. The availability of voluminous Hindi textual

content on the Internet has sparked researcher’s interest in exploring this area. This

paper aims to develop an ML-based solution for performing binary SC on MRs

in Hindi (Devanagari Script). To this end, a binary movie domain-oriented data-

set, namely MRH, is made, and a stacked model of classiﬁers is proposed. After

preprocessing and feature extraction, the proposed architecture is applied, which

contains six classiﬁer models to improve the performance of individual classiﬁ-

ers. The classiﬁer employed in the proposed architecture is selected based on

ease of implementation and respective trade-offs—combined classiﬁer models in

the stacked architecture balance individual model’s bias and variances. Different

SOTA classiﬁers were used individually for comparison to validate the proposed

solution’s performance further. The experimental results on all three datasets

strongly conﬁrm the efﬁcacy of the proposed architecture for doing binary SC in

a resource-deﬁcient scenario, and unigram with TF-IDF is performing the best

among all the features applied. We plan to incorporate character-level and word-

level features and their mélange in the proposed stacked ensemble. Also, we will

employ a deep learning model in the meta-learning stage. To further escalate the

performance, we will increase the size of our dataset.

References

1. Sharma A, Ghose U (2020) Sentimental analysis of twitter data with respect to general

elections in India. Procedia Comput Sci 173:325–334

2. Kulkarni DS, Rodd SS (2021) Sentiment analysis in Hindi—a survey on the state-of-the-art

techniques. In: ACM transactions on Asian and low-resource language information process-

ing, vol 21, issue 1, pp 1–46

3. Kaur A, Nidhi AP (2013) Predicting movie success using neural network. Int J Sci Res

2(9):69–71

4. Sharma A, Ghose U (2021) Lexicon a linguistic approach for sentiment classiﬁcation. In:

2021 11th international conference on cloud computing, data science and engineering (con-

ﬂuence). IEEE, pp 887–893

5. Makhloga VS et al (2021) Machine learning algorithms to predict potential dropout in high

school. In: Data analytics and management: proceedings of ICDAM. Springer, Singapore,

pp 189–201

38 A. Sharma and U. Ghose

6. Sharma A, Ghose U (2023) Voting ensemble-based model for sentiment classiﬁcation of

Hindi movie reviews. In: Computational intelligence: proceedings of InCITe2022. Springer,

Singapore, pp 473–483

7. Madan A, Ghose U (2021) Sentiment analysis for twitter data in the Hindi language. In:

2021 11th international conference on cloud computing, data science and engineering (con-

ﬂuence). IEEE, pp 784–789

8. Hussaini F et al (2018) Score-based sentiment analysis of book reviews in Hindi language.

Int J Nat Lang Comput 7(5):115–127

9. Jha V et al (2015) HSAS: Hindi subjectivity analysis system. In: 2015 annual IEEE India

conference (INDICON). IEEE, pp 1–6

10. Jha V et al (2015) HOMS: Hindi opinion mining system. In: 2015 IEEE 2nd international

conference on recent trends in information systems (ReTIS). IEEE, pp 366–371

11. Jha V et al (2016) Sentiment analysis in a resource scarce language: Hindi. Int J Sci Eng

Res 7(9):968–980

12. Kumar A et al (2015) IIT-TUDA: system for sentiment analysis in Indian languages using

lexical acquisition. In: International conference on mining intelligence and knowledge

exploration. MIKE 2015: mining intelligence and knowledge exploration. Springer, Cham,

pp 684–693

13. Kaur H et al (2018) Dictionary based sentiment analysis of Hinglish text. Int J Adv Res

Comput Sci 8(5):816–822

14. Mishra D et al (2016) Context speciﬁc lexicon for Hindi reviews. Procedia Comput Sci

93:554–563

15. Sharma P, Moh T-S (2016) Prediction of Indian election using sentiment analysis on Hindi

twitter. In: 2016 IEEE international conference on big data (big data). IEEE, pp 1966–1971

16. Sharma Y et al (2015) A practical approach to sentiment analysis of Hindi tweets. In: 2015

1st international conference on next generation computing technologies (NGCT). IEEE, pp

677–680

17. Singh P, Lefever E (2020) Sentiment analysis for Hinglish code-mixed tweets by means

of cross-lingual word embeddings, In: Proceedings of the 4th workshop on computa-

tional approaches to code switching, Marseille, France. European Language Resources

Association, pp 45–51

18. Bharti SK et al (2017) Context-based sarcasm detection in Hindi tweets. In: 2017 ninth

international conference on advances in pattern recognition (ICAPR). IEEE, pp 1–6

19. Akhtar MS et al (2016) A hybrid deep learning architecture for sentiment analysis. In:

Proceedings of the COLING 2016, the 26th international conference on computational lin-

guistics: technical papers, Osaka, Japan. The COLING 2016 Organizing Committee, pp

482–493

20. Oussous A et al (2020) ASA: a framework for Arabic sentiment analysis. J Inf Sci

46(4):544–559

21. Mehmood K et al (2019) Sentiment analysis for a resource poor language—Roman Urdu.

In: ACM transactions on Asian and low-resource language information processing, vol 19,

issue 1, pp 1–15

22. Shah SR, Kaushik A (2019) Sentiment analysis on Indian indigenous languages: a review

on multilingual opinion mining. arXiv preprint arXiv:1911.12848

23. Hourrane O et al (2019) Sentiment classiﬁcation on movie reviews and twitter: an experi-

mental study of supervised learning models. In: 2019 1st international conference on smart

systems and data science (ICSSD), Rabat, Morocco. IEEE, pp 1–6

24. Sarkar K (2020) Heterogeneous classiﬁer ensemble for sentiment analysis of Bengali and

Hindi tweets. Sādhanā45(196):1–17

25. Wolpert DH (1992) Stacked generalization. Neural Netw 5(2):241–259

26. Jain V et al (2021) Product recommendation platform based on natural language process-

ing. In: Data analytics and management: proceedings of ICDAM. Springer, Singapore, pp

627–635

Deep Learning-Based Recommendation

Systems: Review and Critical Analysis

Md Mahtab Alam and Mumtaz Ahmed

Abstract Recommendation systems (RSs) belong to a category of information

ﬁltering systems designed to predict the “ranking” or “preference” that users will

give to a particular item. RSs are automated instruments and strategies that assist

and increase the decision-making process by aggregating the views of individuals

and guiding them to suitable recipients. RSs are extensively utilized in various

domains, including e-commerce, social networking sites, and entertainment, and

impact everyone’s everyday life. The systems are designed to assist the user by

proposing the items that are appropriate for him or her without requiring them to

undergo the lengthy, time-consuming, and complex process of selecting from a wide

selection of items that can number in the thousands or millions. The major aim

of making recommendations based on the user’s interests is to minimize human

work. Models and algorithms are expected to catch different user preferences and

mostly identify non-dependencies between them and the multitude of items to provide

personalization. In addition, this problem is compounded by real data criteria and

ambitious real-time requirements. Many difﬁculties arise when developing and oper-

ating RSs. Therefore, it is compulsory to address them and design a system in which

they become mitigated or tolerable. Sparsity, Cold Start, and Scalability are a few

challenges when a user develops a recommendation system. The pervasive use of deep

learning has demonstrated its power in solving complicated tasks more efﬁciently

than conventional techniques. This paper seeks to stimulate advancements in RSs by

providing a thorough summary of recent research on recommendation systems using

deep learning. The surveyed articles are categorized using a taxonomy of recommen-

dation systems that are offered. Based on the analysis of the evaluated works and the

stated potential solutions, open problems are highlighted.

Keywords Recommendation systems ·Ranking ·Sparsity ·Cold start ·

Scalability ·Deep learning ·Collaborative ﬁltering

M. M. Alam (B)·M. Ahmed

Department of Computer Engineering, Jamia Millia Islamia, New Delhi, India

e-mail: mahtab.alam57@gmail.com

M. Ahmed

e-mail: ahmedmumtaz01@gmail.com

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_4

40 M. M. Alam and M. Ahmed

1 Introduction

A recommendation system that offers fast and speciﬁc advice can typically draw

users’ attention and beneﬁt businesses. In recent years, recommendation systems

have experienced exponential growth and their applications have expanded across

various real-life domains. The theory behind the success of recommendation systems

is that before making decisions of some sort, humans have a natural tendency to

make their decisions on the opinion of their friends and neighbors, particularly on

the purchasing of certain products.

The most popular applications for recommendation systems include:

•Entertainment: ﬁlm, songs, and game recommendations.

•Content: customized newspapers, text reviews, web page recommendations, and

email ﬁltering.

•E-commerce: customer reviews for buying items such as mobile, desktop,

stationary items, etc.

•Services: expert consulting recommendations, recommendations for rental

houses, or transport services.

(A) Motivation

This work has been carried out considering the considerations that are relevant

to bring to my mind the importance of recommendation systems focused on

deep learning. Since such systems are used due to the spreading of such data

types that are present in excess, this system could be common these days.

Users will ﬁnd the data that is most important to a particular intent from the

vast volume of data available on the Internet. They could use this insight to see

if it could be changed so that certain templates could be increasingly used for

other purposes.

It has been noted that even now, students and other learners in different

classes rely very much on the Internet or electronically generated data. This data

is used and handled a lot to serve different objectives related to academics or the

different ﬁelds for analysis. These structures, however, are found in academic

or research and have importance in other areas of real life. The problem to

be tackled is to tackle an explosive volume of information: strengthening the

ﬁltering of information to promote decision-making and reclaim attention.

Since people ﬁnd it difﬁcult to hit the data that is extremely important to their

application, they will choose to use suggestions to reach the correct form of data.

Research like this one must be conducted to determine the best approaches to

the data management problem. Because these technologies ensure that internal

operations within these businesses run smoothly, it is difﬁcult to ignore their use

in industry. It is crucial to keep in mind that recommendation systems make

it simple for customers in these industries to beneﬁt from the right kind of

information that can be used in a variety of situations. However, it is crucial to

employ the learning model to make it capable of assisting users in carrying out

many hard tasks to enhance the user’s framework. These activities may apply

Deep Learning-Based Recommendation Systems: Review and Critical … 41

to one’s academics or the sector of the business that seeks to support consumers

in separate ways.

They are now commonly used and surround everyone in everyday life. Any

popular places for suggestion schemes include e-commerce, social media, or

entertainment. Amazon, Twitter, Spotify, Netﬂix, and many more use recom-

mendation systems and machine learning to personalize content. Therefore,

with every aspect of the purchase process, Amazon incorporates suggestions.

When they recorded a 29% growth in revenue in a single year in 2012, this was

viewed as the key driver [1]. Comparable success has been recorded by Netﬂix,

as almost 80% of content consumption derives from reviews that make them

an important part of the whole network [2]. Because the suggested systems

are primarily powered by machine learning, another signiﬁcant advancement

in this area needs to be integrated.

(B) Challenges

Many difﬁculties [3] arise in the process of the development of RSs. Therefore,

it is inevitable to remember them and design systems in which they become miti-

gated or at least tolerable. There are mainly three challenges namely: Sparsity

[4], Cold Start [5], and Scalability [6].

Sparsity It is one of the major challenges and is used to mention the problem

that a signiﬁcant part of the future interaction between user items resembling the

interaction matrix Ris uncertain. Density, which is deﬁned as the ratio between

known entries and the matrix size of user-item interaction, is its counterpart. It is

a common issue in RSs to have few details about any of all encounters. This is the

very beginning of why we need to make rating estimates and build our rankings

from them. It not only raises ambiguity but also difﬁculty in computation. By

matrix factorization methods that transform sparse into dense user and object

representations, the system partially mitigates the sparsity problem.

Cold Start It can be divided into cold start users or items and applies to users

without connections. An incomplete cold start (ICS) is implemented to describe

entities with too few connections rather than zero that are still relevant. Cold

start poses a key concern in Collaborative Filtering (CF) as experiences are the

only source of data to infer tastes. Therefore, unpracticed users or novel items

will not be suggested or offered customized reviews in CF until experiences

are encountered by the system. Only then is there a potential for suggestions to

be produced? In Content-Based Filtering (CBF) cold start; when they join the

device with their features set, there is less issue in terms of items. As far as users

are concerned, the issue is also problematic because the user proﬁle learner has

no details, i.e., user-item experiences, from which user proﬁle features can be

inferred. This cold start condition of the CBF customer is also mitigated by the

presence of a default proﬁle that includes common features; but due to the lack

of personalization, it is a mediocre solution.

42 M. M. Alam and M. Ahmed

Scalability It is a problem that sparsity necessarily causes. Real-world RSs

also deal with millions of items and more than a billion users. An effort to meet

low inference latencies, e.g., 10 ms, is the classical IR dichotomy. However, the

candidate generation must be matched between performance and consistency.

These demands scaling well and achieving high quality in a limited period for

effective algorithms and parallel architecture for RSs.

Even though the techniques employed in present-day recommendation systems

were developed over a decade ago, the ﬁeld is currently experiencing active

research due to the pervasive presence of the Internet in people’s lives and the

continuous emergence of recent technologies. The primarily goal of this paper

is to compile various existing techniques into a single resource and assess

them based on different parameters. The paper reaches a conclusion regarding

the integration of different recommendation techniques with the ongoing tech-

nology trends, while also addressing the challenges they face. Furthermore,

this paper suggests proposing a novel hybrid technique to overcome certain

limitations encountered by existing approaches.

2 Terminology and Background Concepts

The theoretical foundations of recommendation systems and deep learning are

discussed in this section and innovations crucial to the application are discussed.

(A) Recommendation System

In Fig. 1, three fundamental categories of recommendation systems are typically

applied to different machine learning environments. Let us talk about these

kinds of systems to illustrate how they related to systems based on deep learning.

RSs also referred to as recommender systems [7], have mainly three types:

Content-Based Filtering (CBF) RSs,Collaborative Filtering (CF) RSs, and

Hybrid RSs (a combination of two or so).

Collaborative Filtering Recommendation System It is a type of recommen-

dation system in which recommended an item based on history. Users-based

ﬁltering, item-based ﬁltering, as well as several other methods, are additional

sub-types of CF [8]. The fundamental assumption underlying the utilization

of such systems is that users will be more likely to target the same items they

previously did. They must understand that users are expected to pay greater

attention to this kind of information in the future.

The most popular method for guidance engines is Collaborative Filtering

(CF). CF analyses interactions between consumers and interdependencies

between items to recognize new user-item connections [9]. We need a history

of experiences that resemble user ratings on particular items ri,j, provided the

Deep Learning-Based Recommendation Systems: Review and Critical … 43

Fig. 1 Classiﬁcation of RSs

user sets u∈U, and items v ∈Vwith m=|U|and n=|V|. An incom-

plete matrix R=ri,jwill explain these scores. Absent entries in Rrefer

to scores that are not yet observed, but that may be observed in the future. As

both ‘m‘ and ‘n‘ in recommendation situations are normally broad and users

commonly communicate with only a comparatively small portion of all objects,

Rbecomes much more sparse. CF transforms the problem with the suggestion

into a problem with matrix completion. Based on existing scores, this involves

predicting the missing values in R and displaying the top@k entries to the

respective customer.

To solve the matrix completion problem, we differentiate between two tech-

niques: The model-based technique and the Memory-based technique.Both

manipulate the resemblance of the object and/or customer. When using user

comparisons, we do user-user CF and vice versa, item-item CF. Both yield

comparable effects, but as mis rarely as high as n, they can vary greatly in

terms of performance [10–12].

Memory-based techniques also referred to as neighborhood-based tech-

niques, infer scores by comparing items or users with each other. It implement

a weighted average over the user, resp. Item, k ratings of the most comparable

entities. For this job, we may use various similarity measures, of which the most

common are cosine similarity or Pearson correlation [12]. K-Nearest Neighbors

(KNN) [8] is a method [13] to ﬁnd the nearest neighbors. We use a three-step

method to produce recommendations for a single user x∈U: The initial step

involves computing the cosine similarity between user xand all other users

y∈Ux, it is mentioned as given formula:

44 M. M. Alam and M. Ahmed

cos_sim (x,y)=cos (x,y)=i∈Vxy rx,i.ry,i

i∈Vxy r2

x,ii∈Vxy r2

y,i

(1)

where

Vxy set of rated items which are given by both the user xand y.

Naïve Bayes, Clustering [14], or (probabilistic) Matrix Factorization (MF)

[15] provides model-based approaches [16]. The latent factor models are often

referred to as MF-based models and present the most common model-based

methodology. They are focused on algorithms for machine learning to create

models that identify patterns, i.e., past scores, in the training data. They are

also able to generalize to new data, i.e., to build suggestions and they predict

uncertain scores.

Again, we begin with a set of Uusers, a set of Vproducts, and a sparse matrix

of Rratings. In R, we need to approximate missed values and complete the

matrix by doing so. MF breaks Rinto two lower-dimensional matrices, Pand

Q, which should be as similar as possible to R.Both Pand Qmap users into

ad-dimensional embedding space with dmin(m,n), respectively items.

Thus, we use latent factors that are much less and therefore denser relative to the

sparse vectors within R, instead of using scores to represent users and objects

as in memory-based techniques. Finally, to approximate unnoticed rating 

ri,j,

we use these dense representations—ﬁtted to resemble observed rating ri,j.

Thus, MF aims to recreate Rusing the multiplication matrix of dense

representations of users and items:

Rm×n≈R=Pm×d∗QT

d×n(2)

r1,1r1,2··· r1,n

.··· .

rm,1rm,2··· rm,n

≈⎡

⎢

⎣

p1,1p1,2··· p1,d

.··· .

pm,1pm,2··· pm,d

⎤

⎥

⎦∗⎡

⎢

⎣

q1,1q2,1··· qd,1

.··· .

q1,nq2,n··· qd,n

⎤

⎥

⎦(3)

We use the RMSE [17] between the actual and reconstructed scores to

calculate the reconstruction error:

RMSE =





|s|

(i,j)∈S

rij −

rij2

=





|s|

(i,j)∈S

rij −PiQT

j2

(4)

Deep Learning-Based Recommendation Systems: Review and Critical … 45

where S: set of user-item tuples for all ratings we observed. We now must

initialize and adapt the respective latent factor vectors pi,qj∈Rdmini-

mizing the RMSE. This resembles an optimization problem that we can address

algorithm.

Stochastic Gradient Descent (SGD) [18]SGD is a standard algorithm used

for these problems and works by gradually adapting the latent factor variables to

minimize RMSE, which is a differentiable function. This allows us to calculate

the partial derivatives for either the user or item embedding variables. The

resulting gradient points in the direction of a local or global minimum. We

guide the loss function toward its minimum by multiplying the gradient with a

predetermined learning rate αfor the latent variable update. The convergence

is critically dependent on an appropriate choice for α:

pi=pi−α·∂RMSE

∂pi

(5)

qj=qj−α·∂RMSE

∂qj

(6)

Content-Based Filtering Recommendation System (CBF) The algorithm

learns to recommend items based on the similarities of attributes. These features

are part of user proﬁles and representations of items and can be basic keywords

or representations of items and user proﬁles based on concepts. As the basis

for suggestions, they simply present material that is mapped against each other.

The content of users can correspond to desires or interests, while the content of

items can be textual or consist of metadata correlated with items. As a result of

matching item attributes with target user proﬁle attributes, we get appropriate

scores that are considered as a user’s preference level for a given item. The

proﬁle learner creates a model based on previous experiences to create user

proﬁles. It leverages probabilistic models, customer input on signiﬁcance, or

K-Nearest Neighbor (KNN). Lastly, to produce a binary or continuous relative

judgment, the ﬁltering component matches user proﬁles against item represen-

tations. In a ranked list of items that are likely to be interesting for a particular

user, such item judgments are ordered.

User proﬁles combine users’ rating behavior with the content of rated items

that are agnostic to other users’ rating behavior. To obtain a user representation,

item descriptions that are labeled with ratings (either implicit or explicit) are

used as training data. This introduces the ability to recommend novel items but

does not provide unpracticed users with generalization [[19], p. 14 sq.]

A widely used approach is the tf −idf, often known as vector space repre-

sentation. Based on a weighted vector of item attributes, the system creates a

content-based user proﬁle for each user. The weights, which indicate each func-

tion’s user value, can be determined using a variety of methods using content

vectors with independently valued entries. Simple methods employ the average

46 M. M. Alam and M. Ahmed

rating of user-item matrix, whereas more complex ways assess the likelihood

that the item would be liked by consumers using model-based RSs like Artiﬁ-

cial Neural Networks, Association Rule Mining, Bayesian Classiﬁers, and

Cluster analysis.

tf −idf(n,d)=tf(n,d)∗idf(n)(7)

tf(n,d)=fd(n)maxw∈dfd(w)(8)

idf(n)=logC(df(n)+1)(9)

where

tf Term Frequency.

idf Inverse Document Frequency.

df Document Frequency.

nTerm (word).

dDocument (set of words).

fd(t)Count of tin d.

fd(w)Number of words in d.

CCount of the corpus (the total document set).

df(n)Occurrence of the document.

Hybrid Recommendation System It is now used for most recommendation

systems, incorporating CF, CBF, and other techniques. There is no justiﬁcation

for why it would not be feasible to hybridize many different methods of the

same kind. It is possible to incorporate hybrid approaches in many ways: by

independently creating and then integrating content-based and collaborative

predictions; by incorporating CBF skills into the CF approach (and vice versa);

or by integrating the strategies into one framework. Multiple experiments that

compare the empirical success of hybrid methods with CBF and CF methods

have shown that hybrid methods can provide more detailed recommendations

than other methods. These strategies can also help to address the common

challenges such as cold start and sparsity problem. Netﬂix serves as a prominent

example of how hybrid recommendation algorithms are implemented.

Kaur and Bathla [7] Listed 5 types of advisory structures based on

approaches. The several types are the CBF recommendation system, CF recom-

mendation system, demographic-based recommendation system, recommen-

dations based on usefulness (utility-based), and recommendations based on

information (knowledge-based).

(B) Deep Learning

It is subclass of machine learning (ML) techniques that analyze raw data through

successive layers to extract increasingly complex features. For instance, lower

levels can detect edges in images, while higher levels can recognize meaningful

Deep Learning-Based Recommendation Systems: Review and Critical … 47

patterns such as digits, letters, or faces that have signiﬁcance to humans. The

underlying principle behind deep learning algorithms is like that of humans—

they learn from experience.

Deep learning is already seeing an immense buzz. In many ﬁelds of applica-

tion, such as machine vision and speech comprehension, tremendous success

in deep learning (DL) [20,21] has been seen in the last few decades. Due

to its potential to tackle complex challenges and deliver exceptional results,

researchers and businesses are racing to expand the applications of deep

learning. Recently, it has signiﬁcantly transformed recommendation systems,

providing new ways to improve their effectiveness. By simplifying traditional

models and achieving high recommendation accuracy, deep learning-based

recommendation systems (DLRS) [11,22] have garnered a lot of interest

in recent times. It successfully captures the unpredictable and complicated

interaction between users and items and enables more intricate abstractions

to be embodied in the higher layers as data representations. Additionally, deep

learning leverages abundant open data sources, such as textual, visual, and

qualitative knowledge, to incorporate diverse experiences into the information

itself.

Deep learning can be further subdivided into three different forms [20]:

(1) Supervised learning (task-driven)

(2) Unsupervised learning (data-driven)

(3) Reinforcement learning

(1) Supervised Learning

It helps in to train a model using labeled data to make predictions on new,

unseen data. There are different methods used in the supervised learning

process. These approaches include classiﬁcation [23] and regression that

are essential for predicting answers of any kind. Artiﬁcial Neural Networks

(ANNs) are modeled after the biological network of neurons. There are

various categories of ANNs, including Convolution Neural Networks

(CNNs) [17]. CNNs use multilayered structures to recognize images and

voices in various applications.

(2) Unsupervised Learning

It helps in to train a model using unlabeled data to discover patterns and

relationships in the data on its own. It is used when the desired output is

not known, and the objective is to uncover hidden data structure. Clus-

tering [23] is a method that is commonly used in this type of learning that

uses multiple applications such as image recognition and interpretation of

objects.

48 M. M. Alam and M. Ahmed

Fig. 2 Simple neural network and multilayer neural network [24]

(3) Reinforcement Learning

It entails training a model to make decisions by utilizing feedback received

from the environment. It helps to maximize a reward through trial and error,

and it is often used in robotics and gaming applications.

The two forms of neural networks: are (a) Simple neural networks and (b)

Multilayer neural networks in which multiple layers of hidden layer network

(also known as deep layer neural network). The neural network is shown in

Fig. 2.

Many ML models such as Support Vector Machines (SVM) [25] and

Logistic Regression have shallow architectures consisting of only one or two

layers. Despite their popularity in the 1990s, these shallow models have limited

ability to represent complex data, such as text, images, and audio, leading to

difﬁculties in modeling such data.

Experimental results recently suggest deep architecture is needed to train

better models of deep learning. At most, models with 2 to 3 layers perform well

than deep models before that. Deep models can be more challenging to train

and may yield worse results. However, in 2006 [21], the successful training of

aDeep Belief Network (DBN) to predict handwritten digits using a layer-wise

training methodology marked the ﬁrst successful exploration of deep models.

In the past, researchers had not fully utilized deep models primarily because of

limited data availability and computational power. However, deep architecture

models, which typically consist of multiple layers, have the capacity to learn

a hierarchical representation of features, starting from low-level features and

progressing to high-level features.

ARestricted Boltzmann Machine (RBM) is a generative stochastic ANN

that can learn a probability distribution over its inputs. It successfully applied in

various applications including CF, and dimensionality reduction. RBM consists

of two layers: a visible layer and hidden layer. The connections between the

two layers are symmetrically weighted and the units in each layer do not have

connections with each other.

Deep Learning-Based Recommendation Systems: Review and Critical … 49

Deep Neural Networks (DNNs) [26] are a type of ANN which has multiple

layers of interconnected nodes, also known as Multilayer Perceptron (MLP)

[27]. Backpropagation (BP) [28] is needed to understand DNN. They are

designed to learn and model complex relationships between inputs and outputs,

and they have been extremely successful in solving various tasks in speech

recognition, natural language processing (NLP), and many other areas. A DNN

takes an input, passes it through multiple hidden layers, and ﬁnally produces

an output. Each hidden layer is made up of many artiﬁcial neurons, and each

neuron receives inputs from the previous layer, performs computations on them,

and then sends the results to the next layer.

Deep Auto Encoders (DAEs) are a speciﬁc type of Deep Neural Network

(DNN). Unlike DNNs, which are supervised learning algorithms, DAEs are

unsupervised learning algorithms with the input and output being the same.

This design allows for the middle layer’s output to be represented as dense

representations. Like DNNs, DAEs can be pre-trained using DBN. A tech-

nique for learning deep multilayer autoencoder pre-training was proposed in

[29]. This approach involves treating consecutive layers as a Restricted Boltz-

mann Machine and using pre-training to approximate a rational parameter

initialization.

Convolutional Neural Networks (CNNs) [30] are a type of ANN archi-

tecture that is designed to work with grid-structured data, such as an image.

In CNNs, the layers are organized in a way that is meant to gather relevant

features from the input data in an ordered manner. The ﬁrst layer typically

extracts low-level features such as edges and simple shapes, while deeper layers

extract higher-level features and abstract concepts. The ﬁnal layers of CNNs

are usually fully connected, meaning that they take the features gathered by the

preceding convolutional layers and produce a prediction. In summary, CNNs

are a powerful deep learning tool for processing grid-structured data and have

been successful in various applications.

The Deep Learning technique [31] in both DM and ML communities is

a hot and evolving ﬁeld. Either supervised or unsupervised methods will train

these models. Deep learning models ﬁrst made a signiﬁcant impact on computer

vision and audio, voice, and language processing. They have performed better

than many existing models in these areas. For different NLP functions, later deep

models have proven their efﬁcacy. Semantic parsing [32], automatic translation

[33], sentence modeling, and several typical NLP tasks [11] are included in

these tasks.

Growing research activity on applying DL to RSs has been seen in the last

two years. Several neural network architectures have been tested by researchers,

such as autoencoders (AEs) [34,35], recurrent neural networks (RNNs) [8,

22,36], or CNNs [17], regular feedforward, or wide and deep architectures

[30]. Nevertheless, analysis is either in its infancy on this intersection or gets

no consideration. I will have a range of methods with different architectures

in the following and point out accomplishments and limitations as well as

commonalities and specialties. The list is not holistic and attempts to offer a ﬁrst

50 M. M. Alam and M. Ahmed

impression of alternative strategies. Papers can be divided into deep learning-

based RS papers, surveys and summaries, and others that do not directly discuss

the convergence of DL and RSs with theoretically applicable contributions.

The evaluation metrics [35,37,38] for assessing the success of RSs as well

as the nature of such an assessment are presented in this section. The metrics

used are different for quantifying certain domains. The work usually measures

the absolute and squared root of squared difference between real and expected

ratings using mean absolute error (MAE) [39], and root mean squared error

(RMSE) [17], to quantify the accuracy of ratings as numerical quantities.

Mean Absolute Error To know the closeness of actual output to their

predicted value, MAE is measured. The mathematical deﬁnition of MAE is

the average of the absolute differences between the original values and the

predicted values.

MAE =1



i=1

yi−yp

(10)

where

nTotal number of observations/rows in dataset.

iVariable.

yiActual value.

ypPredicted value.

Mean Squared Error (MSE) It measures the average squared difference

between original values and predicted values.

MSE =1



i=1yi−yp2(11)

where

nTotal number of observations/rows in dataset.

iVariable.

yiActual value.

ypPredicted value.

Root Mean Squared Error (RMSE) It is a commonly used for evaluating

the performance of regression models. It is square root of the average of the

squared differences between original values and predicted values.

Deep Learning-Based Recommendation Systems: Review and Critical … 51

RMSE =







i=1yi−yp2(12)

where

nTotal number of observations/rows in dataset.

iVariable.

yiActual value.

ypPredicted value.

3 Deep Learning-Based Recommendation System

This section gives an analysis of the various works that have been proposed.

Comparison of various models (Table 1).

4 Discussion

Surveying recommendation systems can help in understanding the current techniques

[48,49], identifying emerging trends [50], and assessing the effectiveness of different

techniques and algorithms. In this section, we will analyze and interpret the ﬁndings

from the literature and discuss their implications for our recommendation system.

The discussion based on the literature review provides valuable insights into the

effectiveness of different RSs, including CF, CBF, hybrid approaches, and deep

learning methods. Furthermore, addressing different type of challenges, selecting

appropriate evaluation metrics, and considering ethical and privacy considerations

are essential for designing an effective and responsible recommendation system.

To mitigate the issues of the literature review paper this paper suggests proposing

a novel hybrid technique to overcome certain limitations encountered by existing

approaches.

5 Conclusion

In this paper, we gave a thorough analysis of the most signiﬁcant research on DLRS

so far. To determine the contributions and aspects of these studies, we also perform a

brief statistical analysis of these works. We present several signiﬁcant research proto-

types, assess their beneﬁts and drawbacks, and discuss any relevant applications. We

52 M. M. Alam and M. Ahmed

Table 1 Model analysis

Existing

work

Year Methods Evaluation

metrics

Dataset Results

[31]2017 Stacked denoising autoencoder RMSE Netﬂix movie 1.049

[16]2015 Hierarchical Bayesian model MAP Netﬂix 0.031

[40]2017 Stacked denoising autoencoder

(SDAE) and Matrix

factorization (MF)

RMSE Book Crossing

Movielens 100k

Movielens 1M

0.924

0.508

0.502

[41]2020 Autoencoder RMSE Movielens 1M

Movielens 10M

0.029

0.010

[42]2020 DLCRS RMSE Movielens 100k

Movilens 1M

0.917

0.903

[43]2019 K-nearest neighbors (KNN) Accuracy ICU patient 95.6%

[44]2020 Matrix factorization RMSE and

MAE

Book Crossing

Movielens 100K

25.78

and

19.69%

19.69

and

14.08%

[35]2018 Autoencoder MAP Movielens 100k

Movielens 10M

0.223

0.179

[45]2018 Opinion mining with experts MAE and

RMSE

Books 0.97,

4.08

[46]2021 Collaborative ﬁltering and

support vector machines

classiﬁer

Average

accuracy

User speech

emotion

information

87.2%

[47]2022 Session based HR@kNews (Adressa,

Globo, and

MIND)

0.1658

0.1852

0.0495

also include some of the most urgent unsolved issues and intriguing potential future

developments. Deep learning and recommendation systems are both still immensely

popular study areas today. Every year, there are a lot of new innovative approaches

and developing strategies. Here, we lay out a complete framework for comprehending

the fundamental ideas in this area, describe the most signiﬁcant developments, and

offer some insight into potential future research.

References

1. Hallinan B, Striphas T (2016) Recommended for you: the Netﬂix prize and the production of

algorithmic culture. New Media Soc 18(1):117–137

2. Koren Y (2009) The BellKor solution to the Netﬂix grand prize, pp 1–10

3. Thirumaran E (2009) Collaborative ﬁltering based recommendation systems. In: Handbook of

research on text and web mining technologies, p 16

Deep Learning-Based Recommendation Systems: Review and Critical … 53

4. Barjasteh I, Forsati R, Masrour F, Esfahanian A-H, Radha H (2015) Cold-start item and user

recommendation with decoupled completion and transduction. In: RecSys’15 Proceedings of

the 9th ACM conference on recommender systems, pp 91–98

5. Qin C et al (2020) A survey on knowledge graph-based recommender systems. Sci Sin Inf

50(7):937–956

6. van den Berg R, Kipf TN, Welling M (2017) Graph convolutional matrix completion.

arXiv:1706.02263 [stat.ML], arXiv:1706.02263v2 [stat.ML], https://doi.org/10.48550/arXiv.

1706.02263

7. Kaur H, Bathla G (2019) Techniques of recommender system. Int J Innovative Technol

Exploring Eng 8(9S):373–379

8. Devooght R, Bersini H (2016) Collaborative ﬁltering with recurrent neural networks.

arXiv:1608.07400 [cs.IR], arXiv:1608.07400v2 [cs.IR], https://doi.org/10.48550/arXiv.1608.

07400

9. Hu Y, Koren Y, Volinsky C (2008) Collaborative ﬁltering for implicit feedback datasets. In:

2008 eighth IEEE international conference on data mining, Pisa, Italy. IEEE, pp 263–272

10. Bhasker B (2012) Comparative study of collaborative ﬁltering algorithms. In: KDIR’12

Proceedings of the international conference on knowledge discovery information retrieval,

pp 132–137

11. Zhang S, Yao L, Sun A, Tay Y (2019) Deep learning based recommender system: a survey and

new perspectives. ACM Comput Surv 52(1):1–38

12. Ekstrand MD, Riedl JT, Konstan JA (2010) Collaborative ﬁltering recommender systems.

Found Trends Hum-Comput Interact 4(2):81–173

13. Al-Garadi MA et al (2020) A survey of machine and deep learning methods for internet of

things (IoT) security. IEEE Commun Surv Tutorials 22(3):1646–1685

14. Sohail SS, Siddiqui J, Ali R (2017) Classiﬁcations of recommender systems: a review. J Eng

Sci Technol Rev 10(4):132–153

15. Ricci F, Rokach L, Shapira B, Kantor PB (2010) Recommender systems handbook 2011. In:

Google scholar. Google scholar digital library digital library

16. Wang H, Wang N, Yeung D-Y (2015) Collaborative deep learning for recommender systems.

In: KDD’15: Proceedings of the 21th ACM SIGKDD international conference on knowledge

discovery and data mining, pp 1235–1244

17. Sahoo AK, Pradhan C, Barik RK, Dubey H (2019) DeepReco: deep learning based health

recommender system using collaborative ﬁltering. Computation 7(25):1–18

18. Vo ND, Hong M, Jung JJ (2020) Implicit stochastic gradient descent method for a cross-domain

recommendation system. Sensors (Switzerland) 20(2510):1–16

19. Aggarwal CC (2016) Recommender systems: the textbook. Recommender Syst 39(4):8–21

20. Quadrana M, Karatzoglou A, Hidasi B, Cremonesi P (2017) Personalizing session-based

recommendations with hierarchical recurrent neural networks. In: RecSys’17: Proceedings

of the eleventh ACM conference on recommender systems, pp 130–137

21. Mukhopadhyay S (2018) Deep learning and neural networks. In: Advanced data analytics using

python. Apress, Berkeley, CA, pp 99–119

22. Covington P, Adams J, Sargin E (2016) Deep neural networks for YouTube recommendations.

In: RecSys 2016 Proceedings of the 10th ACM conference on recommender systems, pp 191–

198

23. Lu J, Wu D, Mao M, Wang W, Zhang G (2015) Recommender system application developments:

a survey. Decis Support Syst 74:12–32

24. Karatzoglou A, Hidasi B (2017) Deep learning for recommender systems. In: Proceedings of

the eleventh ACM conference on recommender systems, RecSys’17, pp 396–397

25. Shalaby W et al (2017) Help me ﬁnd a job: a graph-based approach for job recommendation

at scale. In: 2017 IEEE international conference on big data (Big Data), Boston, MA, USA.

IEEE, pp 1544–1553

26. Papadakis H, Fragopoulou P, Michalakis N, Panagiotakis C (2018) A mobile application for

personalized movie recommendations with dynamic updates. In: 2018 international conference

on intelligent systems (IS), Funchal, Portugal. IEEE, pp 507–514

54 M. M. Alam and M. Ahmed

27. Li M, Gao W, Chen Y (2020) A topic and concept integrated model for thread recommenda-

tion in online health communities. In: CIKM’20: Proceedings of the 29th ACM international

conference on information and knowledge management, pp 765–774

28. Zhang S, Tay Y, Yao L, Wu B, Sun A (2019) DeepRec: an open-source toolkit for deep learning

based recommendation. In: Proceedings of the twenty-eighth international joint conference on

artiﬁcial intelligence (IJCAI-19), pp 6581–6583

29. Elkahky AM, Song Y, He X (2015) A multi-view deep learning approach for cross-domain

user modeling in recommendation systems. In: WWW’15 Proceedings of the 24th international

conference on World Wide Web, pp 278–288

30. Rajkomar A et al (2018) Scalable and accurate deep learning with electronic health records.

npj Digit Med 1(18):1–10

31. Wei J, He J, Chen K, Zhou Y, Tang Z (2017) Collaborative ﬁltering and deep learning based

recommendation system for cold start items. Expert Syst Appl 69:29–39

32. Pazzani MJ (1999) A framework for collaborative, content-based and demographic ﬁltering.

Artif Intell Rev 13(5):393–408

33. van den Oord A, Dieleman S, Schrauwen B (2013) Deep content-based music recommendation.

In: Advances in neural information processing systems, vol 26 (NIPS 2013), pp 1–9

34. Strub F, Gaudel R, Mary J (2016) Hybrid recommender system based on autoencoders. In:

DLRS 2016: Proceedings of the 1st workshop on deep learning for recommender systems, pp

11–16

35. Li T, Ma Y, Xu J, Stenger B, Liu C, Hirate Y (2018) Deep heterogeneous autoencoders

for collaborative ﬁltering. In: 2018 IEEE international conference on data mining (ICDM),

Singapore. IEEE, pp 1164–1169

36. Hidasi B, Karatzoglou A, Baltrunas L, Tikk D (2016) Session-based recommendations with

recurrent neural networks. arXiv:1511.06939 [cs.LG], arXiv:1511.06939v4 [cs.LG], pp 1–10,

https://doi.org/10.48550/arXiv.1511.06939

37. Felfernig A, Burke R (2008) Constraint-based recommender systems: technologies and

research issues. In: ICEC’08 Proceedings of the 10th international conference on electronic

commerce, article no 3, pp 1–10

38. Chen C-W, Lamere P, Schedl M, Zamani H (2018) Recsys challenge 2018: automatic

music playlist continuation. In: RecSys 2018 Proceedings of the 12th ACM conference on

recommender systems, pp 527–528

39. Shen Y, Lv T, Chen X, Wang Y (2016) A collaborative ﬁltering based social recommender

system for e-commerce. Int J Simul Syst Sci Technol 17(22):91–96

40. Dong X, Yu L, Wu Z, Sun Y, Yuan L, Zhang F (2017) A hybrid collaborative ﬁltering model

with deep structure for recommender systems. In: AAA’17 Proceedings of the thirty-ﬁrst AAAI

conference on artiﬁcial intelligence, pp 1309–1315

41. Ferreira D, Silva S, Abelha A, Machado J (2020) Recommendation system using autoencoders.

Appl Sci (Switzerland) 10(5510):1–17

42. Aljunid MF, Dh M (2020) An efﬁcient deep learning approach for collaborative ﬁltering

recommender system. Procedia Comput Sci 171:829–836

43. Neloy AA, Oshman MS, Islam MM, Hossain MJ, Zahir ZB (2019) Content-based health

recommender system for ICU patient. In: Multi-disciplinary trends in artiﬁcial intelligence.

MIWAI 2019. Lecture notes in computer science, vol 11909. Springer, Cham, pp 229–237

44. Davagdorj K, Park KH, Ryu KH (2020) A collaborative ﬁltering recommendation system

for rating prediction. In: Advances in intelligent information hiding and multimedia signal

processing. Smart innovation, systems and technologies, vol 156. Springer, Singapore, pp

265–271

45. Sohail SS, Siddiqui J, Ali R (2018) Feature-based opinion mining approach (FOMA) for

improved book recommendation. Arab J Sci Eng 43:8029–8048

46. Kim T-Y, Ko H, Kim S-H, Kim H-D (2021) Modeling of recommendation system based on

emotional information and collaborative ﬁltering. Sensors 21(1997):1–25

47. Gong S, Zhu KQ (2022) Positive, negative and neutral: modeling implicit feedback in session-

based news recommendation. In: SIGIR’22 proceedings of the 45th international ACM SIGIR

conference on research and development in information retrieval, Madrid, Spain, pp 1185–195

Deep Learning-Based Recommendation Systems: Review and Critical … 55

48. Sharma M, Mittal R, Bharati A, Saxena D, Singh AK (2021) A survey and classiﬁcation on

recommendation systems. In: Proceedings of the 2nd international conference on big data,

machine learning and applications (BigDML 2021), Silchar, India, pp 19–20

49. Bukhari SNH, Jain A, Haq E, Mehbodniya A, Webber J (2021) Ensemble machine learning

model to predict SARS-CoV-2 T-cell epitopes as potential vaccine targets. Diagnostics

11(1990):1–18

50. Bukhari SNH, Webber J, Mehbodniya A (2022) Decision tree based ensemble machine learning

model for the prediction of Zika virus T-cell epitopes as potential vaccine candidates. Sci Rep

12(7810):1–11

Retention in Second Year Computing

Students in a London-Based University

During the Post-COVID-19 Era Using

Learned Optimism as a Lens:

A Statistical Analysis in R

Alexandros Chrysikos and Neal Bamford

Abstract The aim of the current research project is to investigate the low retention

rate in second-year undergraduate computing students at a London-based university.

The research is conducted in 2022 during the post-COVID-19 era using learned

optimism as a lens and compares to the 2021 study Chrysikos et al. [1]. The main aim

is to support the university’s efforts to improve retention rate as the overall dropout

has been increasing in the last few years. The research methodology employed was

an exploratory investigation approach by using statistical modelling analysis in R

to predict behavioural patterns. The study aimed to discover any effect the CODE-

It initiative had on student grades and optimism scores, to quantify its success as

an initiative. The primary outcome of the data analysis indicates that the CODE-It

initiative had a positive impact on student optimism scores, particularly among black

ethnicity students. Additionally, a slight increase in optimism was observed among

the least optimistic students. The return to in-person interaction with classmates

and lecturers may have played a signiﬁcant role in raising the minimum scores

compared to the 2021 study [1]. Nevertheless, many students continue to grapple

with the lasting effects of the post-pandemic era, particularly in matters of ﬁnancial

hardship. Finally, for those students who did attend CODE-It, 85% showed that they

felt it was a worthwhile exercise. Speciﬁcally, black ethnicity students had a higher

proportion of attendance and were no longer the student ethnicity group with the

lowest optimism score.

Keywords Learned optimism ·Student retention ·Computing ·R programming ·

Quantitative research ·Data analysis

A. Chrysikos (B)·N. Bamford

London Metropolitan University, 166-220 Holloway Road, London N7 8DB, UK

e-mail: a.chrysikos@londonmet.ac.uk

N. Bamford

e-mail: n.bamford@londonmet.ac.uk

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_5

58 A. Chrysikos and N. Bamford

1 Introduction

In 2021, Chrysikos and other authors conducted a study on student retention that

was published in 2023 [1]. The study focused on identifying the reasons for higher

than usual dropout rates among foundation and ﬁrst-year undergraduate computing

students at a London-based university. To accomplish this, a survey was conducted

among these students to collect relevant data. This data was analyzed using R, a

statistical modelling language, to explore potential links between optimism levels

and retention. The overall conclusion was that students with a foreign qualiﬁcation

were optimistic (comprising 31% of the students), while students with other or an

unknown qualiﬁcation were mildly pessimistic (comprising 43% of the students).

Students with a Bachelor of Technology (B.Tech), higher education diploma or A

level qualiﬁcation were generally more pessimistic (comprising 26% of the students),

especially if they were also of black ethnicity (comprising 5%), or were also not of

black ethnicity, aged under 34 and British (comprising 5% of the students).

To further identify factors affecting optimism, the authors conducted a similar

survey for the same group of students in 2022, with a speciﬁc focus on the black

ethnicity group, which had been identiﬁed with the lowest optimism scores and there-

fore faced the greatest risk of dropping out. Although the survey sections and ques-

tions remained the same, a further section was included which asked the students if

they had been involved in an initiative run by the university studied known as CODE-

It. The study aimed to discover any effect the CODE-It initiative had on student grades

and optimism scores, to be able to quantify its success as an initiative and ﬁnally

make recommendations for a further study. CODE-It is a short programming training

course aiming to prepare students to solve real-world projects. Arranging students

in teams of at least 3 and no more than 5, gives them an opportunity to be creative

and innovative in solving real-world problems on a single theme.

2 Literature Review

Non-continuation in UK universities has been an issue over the past several years. In

2019/2020 the percentage of both young and mature students leaving HE has reduced

by 1.3 and 1.8%, respectively, from consistent yearly values [2]. The ﬁgures are for

UK students who did not leave within 50 days of commencement, not continuing

in HE after their ﬁrst year of HE provider and academic year of entry. Similar data

is seen for both Scotland and Wales for young students and Scotland for mature

students, with a nominal increase in Wales. In general, retention rates in England

are comparatively favorable when compared to international institutions. A rate of

72% of students in 2021 is signiﬁcantly above the international average of 39% for

bachelor’s degrees [3]. However, dropout rates among young undergraduates have

increased over the past ﬁfteen years only reducing slightly in 2019/20 [2].

Retention in Second Year Computing Students in a London-Based … 59

During recent research conducted by [4] it was found that London has the highest

dropout rate of all English regions, and the capital struggles to keep students. Those

universities with a higher intake of black ethnicity students are more likely to see

students from disadvantaged backgrounds not complete their studies. However, more

selective higher education institutions have lower non-continuation rates for black

ethnicity students than white. Gender has also been seen to be a major factor in contin-

uation rates. Only binary gender data is available currently, which shows that comple-

tion rates for female students are 11% higher than males. Furthermore, London

universities have a high proportion of students from low socio-economic backgrounds

and from ethnic minorities, which partly explains the higher than average dropout

rates seen within the region. However, at the same time, students attending London

universities tend to come from areas with high university participation rates, and

students from high university participation rates typically have lower dropout rates

[5].

There seems to be no evidence that dropout rates are linked to the standing or

academic success of an institute, as some universities with gold or silver awards

have dropout rates much higher than the benchmark. Investigating the various demo-

graphics of those who dropped out, students who were deemed to be mature (an age

of 21 or over is categorised as mature in the UK) were twice as likely to dropout

of university than those students entering straight after A levels. Two main concepts

were identiﬁed as factors which could explain the likelihood of a student continuing

with their studies or dropping out, these being a sense of belonging and a level of

engagement [6].

In the most recent survey conducted by the Higher Education Policy Institute

(HEPI), data reveals that the majority of white students—61%—feel a positive sense

of belonging, while for other student groups, the sense of belonging is signiﬁcantly

less evident: Asian 48%; Black 46%; Chinese 46%; mixed 53%; and Other 43% [7].

However, a new questionnaire on loneliness identiﬁed that higher education can be

a lonely place, with nearly one-in-four feeling lonely “all” or “most” of the time

[7]. It may not always be possible for a student to engage fully in university life in

a way that would not affect them academically. Two reasons for this are ﬁnancial

and time constraints. One small study of institutions in London found that travel

or commuting time stayed a signiﬁcant predictor of student progression or contin-

uation for England-domiciled full-time undergraduates at three of the six London

institutions participating in the study [3]. In the case of mature students, they may

have been out of education for some time and might also have work and home life

to balance [8].

3 Methodology

Quantitative analysis involves the systematic analysis of data through collection and

statistical, mathematical, and computational analysis to obtain results. Numerical

data is used and analysed using special statistical techniques to get the solutions

60 A. Chrysikos and N. Bamford

for the questions like how, how many, how much, what, where, when and who

[9]. The quantitative data is then analysed and modelled using the R programming

language in R studio. The purpose of preferring a quantitative approach is to create

and implement statistical models, theories and hypotheses related to the subject of

research. A quantitative approach is used to bring out a conclusive result for the

objective.

The data collection method employed was a questionnaire in the form of a survey.

The data collected through the survey was then explored to discover and summarise

the characteristics of the data (see also Sect. 4). Then, an exploratory analysis on

data was performed to summarise the characteristics of data, speciﬁcally, regression

tree analysis with the use of scatter and box plots to show how various aspects of

the data relate to each other. In the current research, the outcome was the optimism

score as the target variable and the predictive variables were split into two feature

sets. The ﬁrst feature set consisted of attendance of CODE-It, gender, age, ethnicity,

disability, full or part-time student and level of study. The second feature set consisted

of attendance of CODE-It, gender, age, ethnicity, disability, full or part-time student,

level of study and average component mark.

4 Data Collection

The data under analysis was collected in the form of an online survey from the

computing students who participated in the 2021 study [1]. The survey was structured

in the three following sections.

Section 1 Respondent Content Consisting of seven questions, this section was

concerned with making the student aware of the nature of the survey and seeking

their permission to use the data in the research in line with the General Data Protection

Regulation (GDPR).

Section 2 Optimism Questions Adapted from the “Learned Optimism” survey [10]

and consisting of 30 questions which were applied to the Optimism Test Scoring

Sheet and the interpretation guide applied. From the survey data, an overall score

was obtained by using the optimism test scoring sheet values, speciﬁcally:

•A student’s pessimism score when unpleasant events happen,

•Optimism score when good events happen,

•Total optimism score and

•Hope score.

The pessimism score is a score when unpleasant events happen and is the total

of questions answered with the I (5/30), D (5/30) or F (5/30) option. There was

a maximum pessimism score of 15 available across the 30 questions on the ques-

tionnaire. The optimism score is a score when good events happen and was the

total of questions answered with the H (5/30), E (5/30) or B (5/30) option. There

Retention in Second Year Computing Students in a London-Based … 61

was a maximum pessimism score of 15 available across the 30 questions on the

questionnaire. Total optimism was the calculated score of optimism–pessimism.

Section 3 Final Question A new addition to the current study’s survey was the

inclusion of a question asking students whether they participated in the CODE-It

initiative and, if so, whether it was a positive or negative experience for them.

As this study is in its second year and includes participation in the CODE-It initia-

tive, the results from the current study were combined with the students’ current

average component marks and the average module results from the 2021 study

[1]. This merging of data is a common practice in research to compare multiple

variables from different sources and domains.

5 Data Analysis and Discussion

After the data merging was completed, some transformations were made to aid anal-

ysis and some cases ﬁltered out, either due to missing data or because student consent

was not provided. From the original 74 cases, 7 cases did not give consent, leaving

67 cases which could be used for analysis.

The 2021 study highlighted four recommendations for further analysis [1]:

(1) Contrast in optimism of students with foreign qualiﬁcations and UK qualiﬁca-

tions,

(2) Exploration of factors causing black ethnicity students to be less optimistic,

(3) Expand the research to other universities, and

(4) Compare year-on-year of student satisfaction levels from the National Student

Survey.

Items 1 and 3 are still an ambition and should be considered from the next research.

Item 2 forms the basis of the main analysis of this study. Item 4 is discussed in the

following section.

5.1 Exploration of Factors Causing Black Ethnicity Students

to Be Less Optimistic

Carrying on from the 2021 study, data comparisons were conducted to show any

major similarities or differences in the data distribution. All data variables which

exist in both years’ studies were included, and in the case of the current study the

extra variables of average component mark (how the student is currently progressing

in their studies), average module score (how the student performed in 2021) and if

they attended the CODE-It initiative were also included.

62 A. Chrysikos and N. Bamford

Comparative Analysis of Year-on-Year Data

Ethnicity Comparing ethnicity between 2021 and 2022, signiﬁcant differences

occurred in White Ethnicity (~8% increase) and Other Ethnicity (~6% increase).

Gender Comparing gender between 2021 and 2022, no major differences were seen

in percentages.

Disability Comparing disability between 2021 and 2022, a 6% rise was seen in those

with some form of disability.

Component Mark Module marks were compared using average component mark

and average module score. On average, module score was down by 8%, due to the

natural increase in difﬁculty of study between ﬁrst and second years at university,

commonly referred to as “Second Year Slump” [11]. The next survey should include

a set of questions asking if the student found the next level of study more difﬁcult

than the previous year.

Optimism Optimism can be seen to have improved by 3 points at the minimum level,

with a drop of 1 point at the maximum. The mean was the same and the median

was within 1 point. Therefore, it can be summarised that optimism did increase.

This might be explained by the post-COVID-19 pandemic effect of the return to

in-class teaching rather than online. Other factors within this explanation might

be, the CODE-It initiative, giving students the chance to collaborate on fun team-

based activities, students being able to interact with their classmates in lectures and

tutorials and having face-to-face in person time with lecturers. In addition, speakers

from relevant industry backgrounds of all ethnicity types were invited to talk to the

students about a range of topics including interview tips, C.V. preparation, how to

achieve higher grades and projects in industry within relevant ﬁelds.

5.2 Analysis of Optimism Score by Feature Grouping

Further analysis was conducted on the optimism scores to assess how various variable

groupings, including ethnicity, gender, disability, and component marks, contributed

to the results.

Optimism Grouped by Ethnicity Analysis based on ethnicity revealed that White

Ethnicity students exhibited both the minimum and maximum optimism scores but,

on average, tended to be pessimistic. Students of Black Ethnicity, while still showing

average pessimism, had improved their minimum scores compared to the previous

year (comparing 2021 and 2022). On the other hand, students of Asian Ethnicity

tended to be pessimistic on average, while those of Other Ethnicities demonstrated

average levels of optimism.

Retention in Second Year Computing Students in a London-Based … 63

Optimism Grouped by Gender Gender-based grouping signiﬁcantly inﬂuenced

the mean and median values of the optimism score, although there were no substantial

differences at the extremes.

Optimism Grouped by Disability There was no signiﬁcant difference in optimism

scores between students with disabilities and those without.

Optimism Grouped by Component Mark Grouping optimism by binned average

component mark showed that those students with marks < 50 (21%) had the lowest

mean and median optimism score. Those with marks > =80 (7%) had the highest

mean and median score due to feeling positive about their academic achievement

obtained.

In this section’s analysis, we observed an 8% increase in White Ethnicity students,

a 6% decrease in Other Ethnicity compared to the 2021 study. For Black Ethnicity,

Asian Ethnicity, and those with unknown ethnicity, changes were below 5%. Gender

distribution remained similar to 2021, with changes of less than 5%. Disability saw

a 6% increase among students with a disability compared to 2021, while changes

for those without disabilities were below 5%. Median and mean component grades

decreased by 8 to 9%, which is expected given the increased difﬁculty between year

one and year two undergraduate courses. In terms of optimism, we observed a 3.00

point increase at the lowest level (from – 8.00 to – 5.00) and a 1.00 point decrease at the

highest level (from 9.00 to 8.00). The mean remained unchanged, while the median

decreased by 1.00 point (from 2.00 to 1.00). When grouping optimism scores by

ethnicity, we found that Other Ethnicity had the highest mean score at 1.25, followed

by Black Ethnicity (0.93), White Ethnicity (0.42), and Asian Ethnicity (0.40). White

Ethnicity showed the widest range of optimism scores, ranging from – 5.00 to 8.00.

Grouping optimism scores by gender, we noted that females were, on average, more

optimistic (1.81) than males (0.02), with minimum and maximum values within 1.00

point of each other. However, there were no signiﬁcant differences when grouping

optimism by disability. Finally, when grouping optimism by binned average compo-

nent marks, students with an average below 50 had the lowest maximum score (3.00),

the second-highest minimum score (– 4.00), and the lowest mean score (– 1.00).

5.3 Regression Tree Analysis of Optimism Scores

Two feature sets were used in regression tree analysis. The ﬁrst set without average

component mark and the second set including it to see what effect it had. Compared to

the 2021 study [1] the variables qualiﬁcation and work experience could not be used

as they were not recorded on the survey. In the current study, the following predictor

variables were used: attended CODE-It, gender (M/F), age, ethnicity, disability (Y/

N), full or part-time and study level (degree or foundation).

64 A. Chrysikos and N. Bamford

5.3.1 Feature Set 1

A regression tree analysis, using the previously mentioned predictor variables, was

conducted and produced a variable importance. Speciﬁcally, it was observed that the

10% of Black Ethnicity students who were at or below an optimism score of −1.50

in 2021 were no longer there and the lowest score of −0.89 (29%) was comprised

of White Ethnicity students.

The next group with a score of 0.29 (10%) was a combination of Asian, Black

and Other Ethnicities. In both cases this applied to male students who attended a

foundation year prior to entry.

The next set of scores 0.00 and 1.60 were males of all ethnicities who did not

partake in a foundation year. It can be seen in this group which was evenly split at

15% each that for those who attended the CODE-It initiative, their optimism score

was 1.60 (compared to 0.00 for those who did not) and at 1.60 this was moving away

from pessimism to an average optimism score, showing a clear positive effect on

optimism scores, regardless of demographic factors by attending CODE-It.

The ﬁnal set of scores (31%) are for females split by study level. For those who

did not attend a foundation year (19%), the score was pessimistic in contrast to those

who did attend a foundation year (12%) with the highest optimism score of 3.10

which was just above the high average level.

It can, therefore, be stated that compared to the 2021 study, Black Ethnicity

students had improved their level of optimism at 2.90 which was near the highest

level of 2021 of 3.00. White Ethnicity male students (28%) who attended a foundation

year had the lowest optimism score of −0.89. For male students who attended CODE-

It, optimism scores were improved by 1.60 points. Foundation year female students

were 3 points more optimistic than the equivalent non-foundation students.

5.3.2 Feature Set 2

Feature Set 2 included the average component mark, but it did not produce signiﬁcant

variable importance. Speciﬁcally, it was observed that when average component mark

is added as an explanatory variable, there were two distinct groups. Students with

an average component mark of < 51 and student with an average component mark

of > =51. For marks < 51 (27%), students were pessimistic at −0.93 regardless

of any other variable. Those students with an average module score > =51 were

further split into two groups, male and female. The male group was split into 16%

withascoreof−0.56 and are those who attended a foundation year. For those male

students who did not attend a foundation year (27%), their score was 1.20 which was

heading towards an average optimism score of 2.00. The ﬁnal distinct group (30%)

and the most optimistic by one whole point with an average optimism score of 2.20

were females. It can also be observed that those groups at most risk when taking into

account optimism as an indicator are students with average component marks < 51

(27%) and male students who attended a foundation score with average component

mark > =51 (16%).

Retention in Second Year Computing Students in a London-Based … 65

5.4 Analysis of Attendance of CODE-It

Although the variable importance of attending CODE-It showed relevance in the

Feature Set 1 of the regression tree analysis, it did not show signiﬁcant relevance in

the Feature Set 2. Therefore, a separate analysis of its effect on the average increase

of marks was conducted. The attendance based on ethnicity was also explored. This

time the results were of signiﬁcant relevance. Speciﬁcally, the results showed that

26 students attended CODE-It. For 11, their average module mark increased by a

median of 10 and average of 11 points, however 15 students saw a drop in their

average module mark by a median of 12 and average of 16. This contrasts with 30

students who did not attend CODE-It where 10 saw a median increase of 9 and an

average increase of 32. Finally, there were 20 students who did not participate in

CODE-It and saw a median decrease of 17 and an average decrease of 56. Of those

26 students who attended CODE-It and graded their experience (positive or negative)

the majority 22 (85%) thought it was a positive experience compared to 4 (15%) who

did not.

Analysing the data and grouping by ethnicity showed that the largest participating

group of students by ethnicity based on percentage of ethnic group were those who

identify as Black Ethnicity (60%), then White (42%), Asian (40%) and other (25%).

This could in part explain the increase in optimism levels in that group and further

research should be conducted to ascertain a correlation. In addition, the data anal-

ysis by attendance of CODE-It by gender showed that 47% were female students,

while 43% were male. The attendance of CODE-It by study level showed a higher

attendance by foundation degree students (58%) compared to a 42% of degree-level

students.

All this information suggests that attending CODE-It had a positive effect on

the participant students’ grades as well as showing a slight positive increase in

grades and a less negative effect on grade reduction. Speciﬁcally, most students

(85%) who attended CODE-It thought it was a positive experience. Additionally, a

higher percentage of Black Ethnicity students attended CODE-It, which could be

a contributing factor to their increased optimism compared to the previous year.

Finally, there is no signiﬁcant difference in attendance based on gender or study

level (both important variables in the regression tree analysis ﬁndings), being at 5%

in each case. From these results, it could be argued that CODE-It should be continued

as a worthwhile exercise, further reﬁned and its effects studied in any similar future

studies. With the analysis completed in this section, the implications of the ﬁndings

are discussed in the following section.

66 A. Chrysikos and N. Bamford

6 Implications

The analysis and interpretation of the survey data revealed four prominent impli-

cations, categorized into two related groups. The ﬁrst group encompasses ﬁndings

consistent with those of the 2021 study [1], focusing on the ethnicity group with

the lowest average optimism scores and the optimism scores of the entire student

population.

Ethnicity of Students with the Lowest Optimism Score The student ethnicity

group with the lowest optimism score in the 2021 analysis, those of Black Ethnicity,

were no longer the lowest ethnicity group in. The ethnic group with the lowest opti-

mism group are now those students who identify as White Ethnicity. The increase in

optimism of the Black Ethnicity students may be attributable to the higher proportion

of that group of student’s participation (60% attendance) in the CODE-It initiative

compared to the White Ethnicity group of students (42% attendance).

Slight Increase in the Lower Optimism Score Year-on-Year On average, opti-

mism has increased by three points at the lower end, decreased by only one point,

and remained relatively unchanged in both the mean and median in both years. The

return to in-person interaction with classmates and lecturers could be a signiﬁcant

factor in reducing the minimum score compared to the 2021 study. However, there is

still a very real post-pandemic effect being experienced by many students, especially

around matters of hardship and ﬁnance [8]. As it was possible to observe the effects

of the average component mark in the regression tree analysis, a second feature set

including that data (for cases where it was available) was run. It found that the least

optimistic students were those with a score < =50. For each feature set, females

remained the most optimistic.

The second group of implications became apparent through the inclusion of new

data related to attendance and the CODE-It experience, as well as the impact of

the natural phenomena affecting a statistically signiﬁcant number of second-year

undergraduate students—commonly referred to as the ‘Second Year Slump’—in

comparison with their ﬁrst-year experiences.

Decrease in Median Average Component Score Year-on-Year in Line

with Recognised “Second Year Slump” With the addition of the average compo-

nent mark (2022) and average module result (2021), a median and mean drop of 8%

was observed. The so-called “Second Year Slump” is a phenomenon researched in

the U.S., but recognised as an international experience [11]. Students are observed

to become generally less satisﬁed with their university experience and their priori-

ties change. They also reported feeling unprepared for the overall workload of the

second year, in particular the volume of assessments. This is something which should

be factored into and observed in the next survey.

Retention in Second Year Computing Students in a London-Based … 67

Quantiﬁable Positive Effect of the CODE-It Initiative on Average Component

Score and Optimism Levels It was possible to analyse the effects of the CODE-It

initiative against the survey data collected. This was done by comparing average

increase and decrease in component score and showed that those who attended

CODE-It saw less of a decrease (by 5 points) and a slight increase (by 1 point)

compared to those who did not attend. Grouping the CODE-It attendance by ethnicity

showed that those students who identiﬁed as Black Ethnicity had a higher proportion

of attendance (60%) and were no longer the student ethnicity group with the lowest

optimism score. Finally, for those students who did attend CODE-It, 85% indicated

that they felt it was a worthwhile exercise. Therefore, it is recommended to continue

running CODE-It while improving and continuously measuring its effects.

7 Limitations

There were several limitations encountered conducting the research, one of which

would have had no effect, it would just have been interesting to have other data,

and two of which it could be argued could have had a small effect on the analysis,

regardless interesting and relevant conclusions were obtained for this study.

The ﬁrst limitation was the non-ability to roll out the survey to multiple university

schools.

The second limitation was the level of engagement by students. Although all

the students who continued to the second year of their academic studies were asked

to complete the survey, it was not possible to obtain feedback from all of them.

However, a statistically signiﬁcant number of students did complete the survey.

The third limitation was the lack of data from students who did not continue

their studies into the second year. For those students who did not continue into

the second year, because they dropped out, although they were contacted, none of

them responded. It is difﬁcult to postulate the reason for this, therefore, a better

mechanism for obtaining feedback in such cases might be sought in order to gather

as much relevant data as possible.

The fourth limitation was the exclusion of variables in regression tree analysis.

Speciﬁcally, not all the variables used in the regression tree analysis of the 2021 study

were available in the current study. These were: qualiﬁcation and work experience.

However, this seemed not to have a detrimental effect on the Feature Set 1 analysis

which gave comparable results in the 2021 study [1].

68 A. Chrysikos and N. Bamford

8 Conclusion

Students in the UK continue to experience the lingering effects of the global

pandemic. Despite the return to in-person and face-to-face teaching, optimism levels

have not signiﬁcantly increased overall. However, there was a slight increase at the

lower end. In the second-year study, students identifying as Black Ethnicity moved

from the lowest optimism group to the second-lowest, with those identifying as White

Ethnicity now in the lowest optimism group. This change may be partly attributed to

a higher percentage of Black Ethnicity students participating in the CODE-It initia-

tive. Furthermore, average module scores have slightly decreased at the overall mean

level, which is a phenomenon known as the “Second Year Slump” [11]. It is essential

to monitor this trend, as the average component mark is a signiﬁcant variable in the

regression tree analysis.

9 Recommendations for Further Research

The ﬁrst recommendation for further research is related to the evaluation of CODE-

It initiative. It is highly recommended to further investigate the CODE-It initia-

tive, which has demonstrated a positive impact on student grades and optimism

levels, particularly among Black Ethnicity students. Given that 85% of students who

attended the initiative reported a positive experience, conducting more surveys to

gather additional student feedback would be beneﬁcial.

The second recommendation is related to the inclusion of predictive

factors. Consider incorporating predictive factors identiﬁed in previous studies into

future research. These factors may include the impact of commuting [12], student

ﬁnancial hardship [8], and the natural increase in academic difﬁculty between the

ﬁrst and second years of undergraduate studies [11]. Surveys that capture students’

academic experiences from year 1 to year 2 could provide valuable insights.

A third recommendation is to study the effect of ethnicity groups as a contributing

factor to levels of optimism. Conduct a separate study to examine the inﬂuence of

ethnicity on students’ optimism levels. Speciﬁcally, investigate how different ethnic

backgrounds contribute to variations in optimism levels among students and explore

strategies to enhance these levels.

Finally, conducting comparative analysis with other universities is recom-

mended. To gain a broader perspective on computing students’ well-being and opti-

mism levels, consider including other universities in future studies. Comparative

analyses could provide insights into the effectiveness of initiatives like CODE-It and

help identify best practices to improve student welfare and optimism levels across

different institutions.

Retention in Second Year Computing Students in a London-Based … 69

References

1. Chrysikos A, Ravi I, Stasinopoulos D, Rigby R, Catterall S (2023) Retention of computing

students in a London-based university during the covid-19 pandemic using learned optimism

as a lens: a statistical analysis in R. In: Arai K (ed) Intelligent computing. SAI 2023. Lecture

notes in networks and systems, vol 711. Springer, Cham. https://doi.org/10.1007/978-3-031-

37717-4_16

2. HESA (2022) Non-continuation summary: UK performance indicators. Retrieved from https://

www.hesa.ac.uk/data-andanalysis/performance-indicators/non-continuation-summary

3. Hillman N (2021) A short guide to non-continuation in UK universities. In: HEPI.

Retrieved from https://www.hepi.ac.uk/2021/01/07/a-short-guide-to-non-continuation-in-uk-

universities/

4. Keohane N (2017) On course for success? Student retention at university. In: VOCEDplus, the

international tertiary education and research database. Social Market Foundation. Retrieved

from https://www.voced.edu.au/content/ngv:77230

5. Priestley M, Hall A, Wilbraham SJ, Mistry V, Hughes G, Spanner L (2022) Student perceptions

and proposals for promoting wellbeing through social relationships at university. J Furth High

Educ 46(9):1243–1256

6. Vytniorgu R (2022) To encourage a sense of belonging among students, avoid excessive focus

on identity differences and increase engagement with local communities. In: HEPI. Retrieved

from https://www.hepi.ac.uk/2022/11/17/to-encourage-a-sense-of-belonging-among-stu

dents-avoid-excessive-focus-on-identity-differences-and-increase-engagement-with-local-

communities/

7. HEPI (2022) Students signal signiﬁcant bounce-back in the value of their studies.

Retrieved from https://www.hepi.ac.uk/2022/06/09/students-signal-significant-bounce-back-

in-the-value-of-their-studies/

8. Shearing H (2022) Hardship funding for students doubled last year. In: BBC News. BBC.

Retrieved from https://www.bbc.com/news/education-61883656

9. Apuke OD (2017) Quantitative research methods: a synopsis approach. Arab J Bus Manage

Rev (Kuwait Chapter) 6(10):40–47. https://doi.org/10.12816/0040336

10. Seligman MEP (2018) Learned optimism: how to change your mind and your life. Nicholas

Brealey Publishing

11. Milson C (2015) Disengaged and overwhelmed: why do second year students

underperform? In: The Guardian. Guardian News and Media Limited. Retrieved

from https://www.theguardian.com/higher-education-network/2015/feb/16/disengaged-and-

overwhelmed-why-do-second-year-students-underperform

12. Hillman N (2021) A short guide to non-continuation in UK universities. Higher Education

Policy Institute

Alzheimer’s Disease Knowledge Graph

Based on Ontology and Neo4j Graph

Database

Ivaylo Spasov, Sophia Lazarova, and Dessislava Petrova-Antonova

Abstract Recently, a massive amount of data has been available for research on

Alzheimer’s disease. However, the data entities are stored with different names

at different levels of granularity and in various formats. Thus, a comprehensive

knowledge graph is needed to facilitate the development of analytical models related

to Alzheimer’s disease. In our previous work, we created the Alzheimer’s disease

Ontology for Diagnosis and Preclinical Classiﬁcation (AD-DPC), a domain ontology

incorporating the knowledge of medical experts in an understandable way for individ-

uals with no medical background. This paper extends our work by employing Neo4j

graph database technology and AD-DPC to build a domain-speciﬁc knowledge graph.

Data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) is used to popu-

late the knowledge graph and to validate its data retrieval and visualisation capabili-

ties. The knowledge graph contains 2996 diagnoses, 154,953 psychometric ﬁndings,

24,102 blood ﬁndings, 12,471 CSF ﬁndings, and 14,703 brain imaging ﬁndings from

MRI or PET scanning. The nodes were further annotated with 259,260 labels and

673,325 relations based on the AD-DPC ontology. The results prove the efﬁcacy

of using ontologies as a base for the semantic modelling of graph databases. They

further rely on their straightforward and intuitive data querying and visualisation

support.

Keywords Alzheimer’s disease data modelling ·Knowledge graphs ·Neo4j

I. Spasov

Rila Solutions, Soﬁa, Bulgaria

S. Lazarova ·D. Petrova-Antonova (B)

GATE Institute, Soﬁa University, St. Kliment Ohridski, Soﬁa, Bulgaria

e-mail: dessislava.petrova@gate-ai.eu

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_6

72 I. Spasov et al.

1 Introduction

Dementia puts an ever-growing physical, emotional, and ﬁnancial strain on patients

and their families [1–3]. According to the World Health Organisation (WHO),

55 million people worldwide live with dementia, expected to reach a staggering

139 million in 2050 [4]. Alzheimer’s disease (AD) is the most common type of

dementia, accounting for 60–70% of all dementia cases [2]. Numerous attempts have

been made to explain the causes of the disease, but none of the generated hypotheses

is universally accepted. Thus, the underlying causes of AD’s pathological changes

remain unknown [5]. The lack of knowledge about AD’s root causes contributes to

the complexity of the disease and renders it neither curable nor preventable [6].

The eagerness to ﬁnd a cure and slow down the progression of the disease has

recently led to an intense interest in the applications of Big Data in AD research. The

AD research community produces large amounts of data, including patient proﬁles,

anamnestic data, genomic data, neuroimaging and molecular biomarkers, and cogni-

tive and neuropsychiatric assessments. This data is expanded with data from mobile

devices such as wearables and smartphones [7]. Leveraging the collection, aggrega-

tion, and analysis of these large data volumes changes the landscape of AD research

by shedding light on its aetiology or contributing to timelier diagnosis and preven-

tion. For example, machine learning and statistical methods are commonly used to

develop AD screening tools, early detection algorithms and decision-support tools

based on brain imaging data [8], combinations of non-imaging features [9], speech

patterns [10], and even novel biomarkers [11]. However, despite the vast amount

of data available, these data are often scattered and quite heterogeneous regarding

organisation and formatting [12]. Thus, the success of advanced analytics heavily

depends on the adequate standardisation and interoperability of medical data. An

ontology-based approach is needed to explicitly deﬁne the semantics of the domain

and map data across heterogeneous data sources [13].

Knowledge graphs (KGs) are heterogeneous knowledge bases modelled through

ontologies and graph databases. They store data in a semantically structured manner,

support drawing new conclusions through reasoning, and provide context-facilitating

machine learning (ML) models [14]. Prominent examples of KGs of biological

data are the Monarch Initiative [15] and Pheno4J [16]. The Monarch Initiative is

a large-scale endeavour which uses an ontology-based strategy in combination with

a graph database to integrate massive amounts of heterogeneous genotype–pheno-

type data and reveal complex relationships within it. Similarly, Pheno4J uses the

Human Phenotype Ontology (HPO) to build a Java-based solution that loads anno-

tated genetic variants and well-phenotyped patients into the Neo4j database [17].

There are several implementations of KGs for AD focused on extracting and organ-

ising knowledge from scientiﬁc articles [18,19], identifying candidates for drug

repurposing [20,21], studying depression as a risk factor for AD [22], representing

knowledge for the nonpharmacological treatment of psychotic symptoms in dementia

[23], and visualisation of dementia risk factors [24]. However, to the best of our

Alzheimer’s Disease Knowledge Graph Based on Ontology and Neo4j … 73

knowledge, there are no existing KGs focused on AD diagnosis and preclinical

classiﬁcation.

In our previous work, we created Alzheimer’s disease Ontology for Diagnosis and

Preclinical Classiﬁcation (AD-DPC) [25]. It incorporates the knowledge of domain

experts while keeping it understandable for individuals outside the medical domain.

It aims to facilitate knowledge exchange between medical and technological experts

in interdisciplinary teams. This paper proposes a KG modelled through the AD-DPC

and a Neo4j graph database. It is populated with data from the Alzheimer’s Disease

Neuroimaging Initiative (ADNI)1. Its utility is validated regarding data retrieval and

data visualisation. The main contributions of the paper are as follows:

•Integration of various datasets from ADNI based on AD-DPC ontology in a

common data repository, enabling data analytics and development of ML models.

•Development of an Alzheimer’s disease KG supporting semantic data interoper-

ability and knowledge sharing.

•Implementation of a fully operational Neo4j graph database, compliant with AD-

DPC ontology and providing intuitive data interaction and visualisation.

The rest of the paper is organised as follows. Section 2describes data preparation.

Section 3presents data modelling in the Neo4j database. Section 4shows sample

queries written with Cypher and corresponding results returned as graphs. Finally,

Sect. 5discusses the results and concludes the paper.

2 Data Preparation

Alzheimer’s Disease Neuroimaging Initiative (ADNI)2is a multicentre longitu-

dinal study aiming to understand the changes occurring during the progression of

Alzheimer’s disease. ADNI offers access to large amounts of data collected from

cognitively normal (CN) subjects, subjects with mild cognitive impairment (MCI)

and subjects with Alzheimer’s disease (AD). The repository contains demographic,

clinical, neuropsychological, neuroimaging, and biochemical biomarker data. For

our work, we used the data corresponding to the concepts outlined by the medical

experts in AD-DPC, containing 16,227 entries of about 2404 participants.

The demographic and anamnestic data includes age, gender, years of education,

family history of dementia, blood pressure, and body mass index (BMI). Our data

sample contains longitudinal data. Therefore, each entry has a timestamp encoded

as the number of months since the baseline visit. The baseline corresponds to ‘0’,

6 months after the baseline corresponds to ‘06’ and so on. Follow-up visits were

1Data used in preparation of this article were obtained from the Alzheimer’s disease Neuroimaging

Initiative (ADNI) database (https://adni.loni.usc.edu/). As such, the investigators within the ADNI

contributed to the design and implementation of ADNI and/or provided data but did not participate

in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://

adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

2See footnote 1.

74 I. Spasov et al.

conducted approximately every 6 months. Longitudinal results from 12 neuropsycho-

logical assessments are considered, namely Mini-mental state examination (MMSE),

Montreal Cognitive Assessment (MoCA), Alzheimer’s Disease Assessment Scale

(ADAS), Clinical Dementia Rating (CDR), Functional Activities Questionnaire

(FAQ), Rey Auditory Verbal Learning Test (RAVLT), clock drawing and copying

task, Boston naming test (BNT), American National Adult Reading Test (ANART),

verbal ﬂuency task, logical memory task (delayed and immediate recall). Cere-

brospinal ﬂuid (CSF) assessments and blood plasma biomarkers are also available.

In particular, we included CSF concentration of amyloid-beta 42 (Aβ42), total tau

(t-tau), and phosphorylated tau (p-tau) as well as plasma concentrations of p-tau181

and neuroﬁlament light (NfL). The included APOE4 status is a marker of genetic

predisposition to developing AD. It is binary encoded with 1 designating that the

participant is a carrier of at least one allele 4 copy.

Finally, we included results from ﬂuorodeoxyglucose-positron emission tomog-

raphy (FDG-PET) and ﬂorbetapir-PET (AV45-PET) imaging along with volumetric

data extracted from magnetic resonance imaging (MRI) images. Brain FDG-PET is

commonly used to estimate the distribution of neural injury, and AV45-PET is used

to visualise accumulations of Aβ42 plaques in the brain. This version of the KG does

not contain actual brain images, only data extracted from them. However, storing

brain images remains to be implemented in future work.

3 Modelling Data in Neo4j

To map the ontology to a database that is suitable for storing real-world medical data,

we outlined three generics groups of concepts that are represented in the graph: (1)

Participant data (demographic and anamnestic); (2) Clinical ﬁndings (results from

tests and assessments); and (3) Diagnosis.

These generic groups were enriched with attributes and relations from the

ontology. Corresponding timestamps were added (deﬁned as Zero-Dimensional

Temporal Region) and assessment deﬁnitions, result interpretation scales, and diag-

nostic process descriptions. The result interpretation scales were modelled as follows.

Each laboratory result can be treated as either “positive”, “negative”, or “invalid”

with respect to the volume of a target biochemical compound within a sample. There-

fore a scale node was deﬁned with the following properties: unit, minimal value for

negative reading, maximum value for negative reading, minimum value for positive

reading, and maximum value for positive reading. This was each test result can be

evaluated against the respective scale and labelled according to three criteria:

•if a score is between the min/max negative, the test outcome is negative;

•if a score is between the min/max positive, the test outcome is positive;

•every other value is treated as an invalid outcome.

Each participant has a dedicated node (Participant) attributed by base information

such as participant ID, age, years of education and gender. Then each participant

Alzheimer’s Disease Knowledge Graph Based on Ontology and Neo4j … 75

Fig. 1 Database hierarchy. Each circle represents a node, and each arrow is a relation implemented

from the AD-DPC ontology

node has a dependent node (Participant File) containing all the clinical data available

for this person. The participant ﬁle contains constitutional data, 0 or more image ﬁnd-

ings, psychometric ﬁndings, blood ﬁndings, CSF ﬁndings, and a diagnosis. Consti-

tutional data contains anamnestic data routinely collected during examinations, such

as patient history and BMI. The graph database follows the ontological structure

where each result from an assessment or test is treated as a “ﬁnding” produced by a

particular laboratory assay, psychometric test, or brain imaging procedure. Figure 1

shows the overall structure of the database.

To create and populate a Neo4J database, we used the Neo4j native Cypher Query

Language. The resulting Neo4j database contained information about all 2404 partic-

ipants. The KG contains 2996 diagnoses, 154,953 psychometric ﬁndings, 24,102

blood ﬁndings, 12,471 CSF ﬁndings, and 14,703 brain imaging ﬁndings from MRI

or PET scanning. These nodes were annotated with 259,260 labels and 673,325

relations based on the AD-DPC ontology.

4 Results and Discussion

This section presents sample queries used to extract data from the Neo4j database.

Different scenarios are explored to show the longitudinal and multimodal nature of

the data. Each query is represented by its objective, syntax, and returned result.

76 I. Spasov et al.

The objective of the ﬁrst query is to get all laboratory ﬁndings for a participant

with record identiﬁer (RID) 2 that were logged 72 months after the baseline. Its

syntax is shown in the following listing:

MATCH (pf:ParticipantFile {rid:‘2’}) - [] - (p:Participant)WITH pf,p

MATCH (t:ZeroDimensialTemporalRegion {months: ‘72’}) - [] -

(n:LaboratoryFinding) – [] - (pf).

Return *;

The execution of the query returns three records, each standing for a laboratory

ﬁnding for a participant with RID 2 logged at a visit 72 months after the baseline

visit. The results can be presented in a table (Fig. 2) or visualised in a graph (Fig. 3).

The objective of the second query is to return all laboratory ﬁndings for a patient

with record identiﬁcation 2 that were logged 72 months after the baseline.

MATCH (pf:ParticipantFile {rid:‘2’}) - [] - (p:Participant)WITH pf,p

MATCH (t:ZeroDimensialTemporalRegion {months: ‘72’}) - [] -

(n:LaboratoryFinding) – [] - (pf).

Return *

The laboratory ﬁndings for the corresponding patient are visualised in Fig. 4.

The objective of the third query is to list all BMI measures with the corresponding

time region for participant with RID 2.

MATCH (pf:ParticipantFile {rid:‘2’}) - [] - (p:Participant)WITH pf,p

Fig. 2 Table representation

of the results from the ﬁrst

query

Fig. 3 Graph visualisation of the results from the ﬁrst query

Alzheimer’s Disease Knowledge Graph Based on Ontology and Neo4j … 77

Fig. 4 Graph visualisation of the results from the second query

MATCH (t:ZeroDimensialTemporalRegion) - [] - (n:BMI) - [] - (pf).

Return *

The execution of the query returns 11 records, each standing for a record of BMI

for a participant with RID 2. Each record originates from a different visit. The date

stamps are encoded as several months since the baseline visit. Thus, 108 refers to

108 months after baseline (Fig. 5).

The inherited semantic structure is a natural consequence of the here-by-presented

KG built on top of an ontology. As results show, this structure offers the beneﬁt of

intuitive querying and visually supported data retrieval. Nevertheless, such access

point is limiting because it requires knowledge about databases, data querying, and

programming. To create a more inclusive access point to the KG, future work should

focus on implementing an interactive interface that will eliminate the need for direct

user interaction with Neo4j. To fully leverage the semantic layer granted by AD-DPC,

future extensions to implement user interface based on natural language questions

are considered.

Our results are partially similar to those in a previous work modelling ADNI data

through an ontology-based approach [26]. Similarly to our ﬁndings, they conclude

that semantic databases grant more intuitive data querying, thus creating an outlet for

simpliﬁed data access in machine learning. However, while they chose to model their

ontology entirely after the neuropsychological data offered by ADNI, we consider

this approach as limited in data blending and operability since any future data import

outside of ADNI will likely require signiﬁcant changes in the ontological structure.

Therefore, we used a data-independent ontological structure that we later translated

into a KG able to accommodate data from any dataset containing AD patient infor-

mation. While we acknowledge that importing data from several data sources will

require the development of dataset-speciﬁc mappings, we consider that changes in

the semantic structure of the KG should only occur in the following cases: (1) to

accommodate high-demand user needs; (2) to reﬂect any novelties in the domain;

(3) to ﬁx existing imperfections.

78 I. Spasov et al.

Fig. 5 Graph visualisation of the results from the third query

5 Conclusion

The paper proposed a KG modelled through the AD-DPC and a Neo4j graph database.

We populated the KG with data from ADNI and demonstrated data organisation and

querying. The KG contained concepts, relations, and attributes described in the AD-

DPC ontology. The resulting rich data representation offers an additional semantic

layer, making the KG self-explanatory. This will signiﬁcantly beneﬁt data analysts

interested in researching AD but lacking background knowledge. A well-known

setback of interdisciplinary collaboration is the coordination and communication

between people with different backgrounds. Expert knowledge and consultation

are often expensive, difﬁcult to provide and time-consuming. Incorporating AD-

DPC semantics in the KG minimises the need for external consultation in analytics.

However, to fully achieve this goal, deﬁnitions and elucidations of base ontological

concepts within the graph should be integrated.

Alzheimer’s Disease Knowledge Graph Based on Ontology and Neo4j … 79

The paper demonstrated data loading from a single data source—ADNI. A limi-

tation of our work is that we did not explore the possibilities for data integration from

different sources. However, we consider the increasing need for data interoperability

in the domain of AD as one of the pressing matters that must be addressed before the

domain can truly embrace and leverage Big Data analytics. Therefore, future work

shall focus on data integration from multiple sources. Another limitation is that the

current implementation of the KG was created through a manual mapping between

the ontology and the database. Future work considers automation of this process.

This will ensure that any ontology updates will also be reﬂected in the KG.

Acknowledgements This research work has been supported by the GATE project, funded by the

H2020 WIDESPREAD-2018-2020 TEAMING Phase 2 programme (agreement no. 857155); OP

Science and Education for Smart Growth (agreement no. BG05M2OP001-1.003-0002-C01); and

the BNS fund (project no. KP-06-N32/5).

References

1. Duong S, Patel T, Chang F (2017) Dementia: what pharmacists need to know. Can Pharmacists

J 150(2):118–129. https://doi.org/10.1177/1715163517690745

2. Silva MVF, de Mello Gomide Loures C, Alves LCV, de Souza C, Borges KBG, das Graças

Carvalho M (2019) Alzheimer’s disease: risk factors and potentially protective measures. J

Biomed Sci 26(33):1–11. https://doi.org/10.1186/s12929-019-0524-y

3. Li X, Feng X, Sun X, Hou N, Han F, Liu Y (2022) Global, regional, and national burden of

Alzheimer’s disease and other dementias, 1990–2019. Front Ageing Neurosci 14(937486):1–

17. https://doi.org/10.3389/fnagi.2022.937486

4. World Health Organization (WHO) Dementia Fact Sheet. Retrieved from https://www.who.

int/news-room/fact-sheets/detail/dementia. Accessed on 13 Feb 2022

5. Breijyeh Z, Karaman R (2020) Comprehensive review on Alzheimer’s disease: causes and

treatment. Molecules 25(24):5789. https://doi.org/10.3390/molecules25245789

6. Luo J, Wu M, Gopukumar D, Zhao Y (2016) Big data application in biomedical research and

health care: a literature review. Biomed Inform Insights, vol 8. https://doi.org/10.4137/BII.

S31559

7. Ienca M, Vayena E, Blasimme A (2018) Big data and dementia: charting the route ahead for

research, ethics, and policy. Front Med 5(13):1–7. https://doi.org/10.3389/fmed.2018.00013

8. Zhao Z et al (2023) Conventional machine learning and deep learning in Alzheimer’s disease

diagnosis using neuroimaging: a review. Front Comput Neurosci 17(1038636):1–16. https://

doi.org/10.3389/fncom.2023.1038636

9. Wang H et al (2022) Develop a diagnostic tool for dementia using machine learning and non-

imaging features. Front Aging Neurosci 14(945274):1–14. https://doi.org/10.3389/fnagi.2022.

945274

10. Fristed E et al (2022) Leveraging speech and artiﬁcial intelligence to screen for early

Alzheimer’s disease and amyloid beta positivity. Brain Commun 4(5):1–12. https://doi.org/

10.1093/braincomms/fcac231

11. Bourkhime H et al (2022) Machine learning and novel ophthalmologic biomarkers for

Alzheimer’s disease screening: systematic review. ITM Web Conf 43(01009):1–9. https://doi.

org/10.1051/itmconf/20224301009

12. Birkenbihl C et al (2020) Evaluating the Alzheimer’s disease data landscape. Alzheimer’s

Dement Transl Res Clin Interv 6(e12102):1–11. https://doi.org/10.1002/trc2.12102

80 I. Spasov et al.

13. Liyanage H, Krause P, de Lusignan S (2015) Using ontologies to improve semantic inter-

operability in health data. J Innov Health Inf 22(2):309–315. https://doi.org/10.14236/jhi.v22

i2.159

14. Timón-Reina S, Rincón M, Martínez-Tomás R (2021) An overview of graph databases and

their applications in the biomedical domain. Database 2021:baab026. https://doi.org/10.1093/

database/baab026

15. Mungall CJ et al (2017) The monarch initiative: an integrative data and analytic platform

connecting phenotypes to genotypes across species. Nucleic Acids Res 45(D1):D712–D722.

https://doi.org/10.1093/nar/gkw1128

16. Mughal S et al (2017) Pheno4J: a gene to phenotype graph database. Bioinformatics

33(20):3317–3319. https://doi.org/10.1093/bioinformatics/btx397

17. Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S (2008) The human phenotype

ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet

83(5):610–615. https://doi.org/10.1016/j.ajhg.2008.09.017

18. Fahd K, Miao Y, Miah SJ, Venkatraman S, Ahmed K (2022) Knowledge graph model devel-

opment for knowledge discovery in dementia research using cognitive scripting and next-

generation graph-based database: a design science research approach. Soc Netw Anal Min

12(61):1–12. https://doi.org/10.1007/s13278-022-00894-9

19. Rossanez A, dos Reis JC, da Silva TR, de Ribaupierre H (2020) KGen: a knowledge graph gener-

ator from biomedical scientiﬁc literature. BMC Med Inform Decis Making 20(314 (S4)):1–24.

https://doi.org/10.1186/s12911-020-01341-5

20. Nian Y et al (2022) Mining on Alzheimer’s diseases related knowledge graph to identity poten-

tial AD-related semantic triples for drug repurposing. BMC Bioinformatics 23(407 (S6)):1–15.

https://doi.org/10.1186/s12859-022-04934-1

21. Hsieh K-L, Plascencia-Villa G, Lin K-H, Perry G, Jiang X, Kim Y () Synthesize heterogeneous

biological knowledge via representation learning for Alzheimer’s disease drug repurposing. In:

iScience, vol 26, issue 105678, pp 1–18. https://doi.org/10.1016/j.isci.2022.105678

22. Malec SA et al (2023) Causal feature selection using a knowledge graph combining structured

knowledge from the biomedical literature and ontologies: a use case studying depression as

a risk factor for Alzheimer’s disease. In: bioRxiv preprint, pp 1–45. https://doi.org/10.1101/

2022.07.18.500549

23. Zhang Z et al (2022) Developing an intuitive graph representation of knowledge for nonphar-

macological treatment of psychotic symptoms in dementia. J Gerontological Nurs 48(4):49–55.

https://doi.org/10.3928/00989134-20220308-02

24. Fahd K, Venkatraman S (2021) Visualizing risk factors of dementia from scholarly literature

using knowledge maps and next-generation data models. Vis Comput Ind Biomed Art 4(19):1–

19. https://doi.org/10.1186/s42492-021-00085-x

25. Lazarova S, Petrova-Antonova D, Kunchev T (2023) Ontology-driven knowledge sharing in

Alzheimer’s disease research. Information 14(3):188. https://doi.org/10.3390/info14030188

26. Taglino F et al (2023) An ontology-based approach for modelling and querying Alzheimer’s

disease data, pp 1–19. https://doi.org/10.21203/rs.3.rs-1813123/v1

Forecasting Bitcoin Prices in the Context

of the COVID-19 Pandemic Using

Machine Learning Approaches

Prashanth Sontakke, Fahimeh Jafari, Mitra Saeedi,

and Mohammad Hossein Amirhosseini

Abstract Using daily data from 1st April 2016 to 3rd March 2022, this study

aims to explore the use and effectiveness of machine learning algorithms in fore-

casting the price of Bitcoin. The paper examines the forecasting performance based

on different time lags within the selected periods: (1) before pandemic and (2)

including pandemic. The second time frame is selected to examine the effect of

the Covid pandemic on the Bitcoin market ﬂuctuations. This research employs four

machine learning models, including linear regression, support vector regression,

extreme gradient boosting, and long short-term memory. These are reﬁned and cali-

brated to produce the most accurate forecasts. The performance of the algorithms

was measured and compared using regression metrics. The results show that before

the pandemic, the linear regression model performed the best for next-day predic-

tions, while extreme gradient boosting performed best overall and for longer-term

predictions. For the period including the pandemic, extreme gradient boosting and

linear regression performed the best, consistently outperforming long short-term

memory and support vector regression. The prediction models for data before the

pandemic have demonstrated improved performance, whereas the selected model

for the period including the pandemic exhibited satisfactory results. This is because

Bitcoin prices displayed the highest volatility during the Covid pandemic. The study

ﬁnds that extreme gradient boosting performs best overall and for longer-term predic-

tions, while linear regression performs the best for next-day predictions before the

pandemic. Moreover, the study reports satisfactory results for Bitcoin price prediction

for the period including the pandemic, despite the high volatility of prices.

P. Sontakke ·F. Jaf a r i ·M. Saeedi ·M. H. Amirhosseini (B)

University of East London, London, United Kingdom

e-mail: m.h.amirhosseini@uel.ac.uk

P. Sontakke

e-mail: U2054788@uel.ac.uk

F. Ja f a r i

e-mail: f.jafari@uel.ac.uk

M. Saeedi

e-mail: m.saeedi@uel.ac.uk

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_7

82 P. Sontakke et al.

Keywords Cryptocurrency ·Bitcoin price ·Time series forecasting ·Machine

learning ·Technical indicators ·Linear regression ·Support vector regression ·

Extreme gradient boosting ·Long short-term memory

1 Introduction

Bitcoin, one of the most popular cryptocurrencies, was introduced by Satoshi Nako-

moto in 2009 [1]. The principle of decentralisation is applied to cryptocurrency, while

ﬁat currencies are based on central banking systems. Therefore, a cryptocurrency is

not subjected to interference from a central banking authority. The global ﬁnan-

cial crisis in 2007–2008, known as the subprime mortgage crisis, followed by the

eurozone debt crisis in 2011–2012 substantially increased people’s distrust in their

government and declined their faith in traditional ﬁnancial institutions. As a result,

Bitcoin with its promising and revolutionised features of a decentralised structure

with no governmental and regulatory controls, was well-received in the coming years

[2]. Bitcoin and other cryptocurrencies are used in different ways such as specula-

tive trading assets, investment, or simply as a payment method. Bitcoin, with its

explicit speculative behaviour, is subjected to high volatility and bubbles [3]. The

unusual price behaviour of Bitcoin has attracted many researchers to provide the

most efﬁcient models to predict the price.

Financial time series forecasting has been a subject of signiﬁcant interest in

economics, statistics, and computer science. A cryptocurrency is a digital currency

that uses cryptography to make transactions securely [4]. All cryptocurrencies are

traded across various exchanges 24/7, resulting in much volatility compared with

traditional stock markets. The motivation behind predicting the price of Bitcoin using

machine learning techniques was heavily inspired by increasingly better-performing

ensemble algorithms and neural network architectures. Bitcoin recorded its all-time

high in 2021 and experienced high ﬂuctuations during the Covid pandemic, attracting

massive public attention. The high-price volatility of Bitcoin, especially during the

pandemic, motivated this research to analyse the Bitcoin price behaviour before and

during the Covid pandemic.

This study aims to examine the effectiveness of machine learning algorithms in

forecasting Bitcoin prices before and during the COVID-19 pandemic. It uses a

robust feature selection strategy to identify the most critical features for prediction

and applies different machine learning algorithms to forecast Bitcoin prices. The

models have been optimised and tuned to reﬂect the ﬂuctuations as well. The paper

considers forecasting performance on different lags within pre-selected periods. It

evaluates the extent to which the prices of Bitcoin can be accurately predicted for

the next day, 7th day, 15th day, and 30th day.

The rest of this paper is organised as follows. Section 2discusses the literature

review. The methodology and the machine learning models utilised in this paper

are detailed in Sect. 3. Section 4is devoted to the experimental result, and Sect. 5

concludes the paper.

Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic … 83

2 Literature Review

The high volatility of Bitcoin price could be due to many factors from operating hours

of the American, European, and Asian markets to different macroeconomic factors of

the world economy, especially the leading economies. While regulatory implications

and economic pressures led Bitcoin to be perceived differently in various countries,

Bitcoin price volatility and its hedging capacity have been discussed in many studies,

as Bitcoin-based portfolios can gain signiﬁcant gains. Bitcoin has been considered a

risk diversiﬁer for the portfolio. In some cases, it proved to be the best hedge choice

during ﬁnancial crises helping the investor in the investment process [5].

Some studies suggested that Bitcoin should not be considered as a currency; they

argued that due to Bitcoin’s volatile price behaviour, it should be instead referred to as

a speculative investment asset. Among the early studies on Bitcoin price volatility,

Mittal [6] found no fundamental explanation for Bitcoin’s price movements and

concluded that the primary determinant of Bitcoin price is the investors’ speculation.

Meanwhile, Buchholz et al. [7] argued that Bitcoin’s price had bubble characteristics

with no signiﬁcant relation with other ﬁnancial assets. He concluded that Bitcoin price

movement was only derived from its own dynamics of supply and demand induced

by the behaviour of speculative investors. Gronwald [8] examined if Bitcoin’s price

movements exhibited characteristics of commodities such as gold or oil and found

that compared with the price ﬂuctuation of traditional commodities, Bitcoin price

was signiﬁcantly more volatile.

As interest in Bitcoin grew during the initial years, some studies have used statis-

tical and econometric model-based techniques to predict Bitcoin prices [9]. Statistical

model-based time series forecasting is a method of estimating and predicting price

values, but it has the drawback of requiring assumptions about the data distribution

beforehand. Bitcoin prices are non-stationary, and this approach cannot be used to

make accurate predictions as there are no seasonal effects with Bitcoin. Some studies

recommended autoregressive integrated moving average (ARIMA)-based model for

predicting Bitcoin prices [10,11]. Alahmari [12] used the ARIMA model to predict

Bitcoin, Ripple, and Ethereum based on daily, weekly, and monthly time horizons.

Huang et al. [13] developed a classiﬁcation tree-based model for predicting Bitcoin

returns using 124 technical indicators that indicate overlap, momentum, pattern, etc.

Their approach claimed that technical analysis of historical data could predict Bitcoin

returns within narrow ranges as its value is believed to be driven by factors other

than fundamental factors. The result could surpass the buy-and-hold strategy and

signiﬁcantly contribute to the newly emerging literature on technical analysis-based

cryptocurrency price forecasting.

Machine learning can be referred to as an automated learning process from experi-

ence without the need for explicit programming. This motivated many researchers to

study Bitcoin volatility and propose forecasting techniques using machine learning.

Greaves and Au [14] applied linear regression, logistic regression, support vector

machines (SVM), and artiﬁcial neural networks (ANN) and achieved a 55% accuracy

rate with ANN, outperforming the other models. They concluded that ﬁnancial ﬂow

84 P. Sontakke et al.

features from various exchanges would be an added advantage in predicting Bitcoin

prices. Using only blockchain-based features for training and testing offers limited

predictability. Madan et al. [15] addressed binary classiﬁcation models like logistic

regression and random forest. Results show that the random forest outperformed

SVM as the former is not affected by high standard deviation and outliers within the

data. The study by Radityo et al. [16] predicted next-day prices using the closing price

of Bitcoin in USD. The research utilised four variations of artiﬁcial neural network

(generic algorithm NN, backpropagation NN, genetic algorithm BPNN, and neuro

evolution of augmenting topologies) and compared the results based on mean abso-

lute percentage error (MAPE) values and computational time complexity. Among the

variants of ANN used, GABPNN showed the best results, whereas the performance of

the genetic algorithm NN was unsatisfactory. The study by Yeh et al. [17] proposes an

improved ensemble learning method for forecasting Bitcoin price movements. The

method combines AdaBoost, random forest, and extreme gradient boosting algo-

rithms to enhance prediction accuracy. The authors evaluate the proposed method on

real-world Bitcoin price data and compare it with other popular forecasting methods,

including ARIMA and LSTM neural networks. The experimental results demonstrate

that the proposed method outperforms other methods in accuracy and robustness.

Authors in [18] present a hybrid deep learning framework for forecasting cryptocur-

rency prices, including Bitcoin. The framework combines CNN and LSTM to capture

the complex temporal patterns of cryptocurrency price data. The experimental results

show that the proposed framework achieves higher accuracy and lower error rates

than other models. However, it is worth noting that these studies do not consider the

pandemic period for Bitcoin price.

3 Methodology

A time series is a set of sequential data points for a speciﬁc successive time duration.

It incorporates methods that relate time series with understanding the trend of data

points within the time series or helps make predictions. This research concentrates

on forecasting Bitcoin prices using multivariate time series and machine learning

models, where the value of the target variable x at a future time point, x[t+s]=

f(x[t],x[t−1], ..., x[t−n]),with s >0, represents the prediction horizon.

The prediction forecast is evaluated for horizons of the next day, 7th day, 15th day, and

30th day. As shown in Fig. 1, the implementation of a time series-based forecasting

method begins with creating a dataset. Then, machine learning models are trained

for the speciﬁed prediction horizons. Technical indicators contributing to the Bitcoin

price have been scraped from open data sources.

As a preprocessing step, the data is consolidated into a single data frame, cleaned,

and scaled. The end-of-day close price is used to create datasets for the next-day, 7th-,

15th-, and 30th-day forecast for historical periods of data (i.e. from 1st April 2016

to 1st November 2019 and 1st April 2016, to 3rd March 2022). This results in four

separate datasets for the two time periods speciﬁed. Feature extraction and feature

Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic … 85

Fig. 1 Step-by-step model development

selection are performed separately for each dataset. Over 900 derived features are

created based on past time frames of 7, 30, and 90 days. Feature selection, which is

a crucial step, is depicted in a ﬁgure and is performed to reduce the number of input

variables, thereby reducing the dimensionality and computational complexity of the

model. The top 10 features from each dataset are extracted using a random forest

regressor, followed by a training and testing split.

3.1 Data Collection

We have collected daily historical data from Yahoo Finance API (OHLC feature),

blockchain-based features from Bitinfocharts [19], and Quandl [20] through web

scraping techniques. We have 23 features excluding date and target variables. Table 1

represents the features that have been gathered.

3.2 Feature Engineering Using Technical Indicators

The dataset was enriched with newly generated features based on technical indicators

and lagged for 7, 30, and 90 days. These technical indicators added to the dataset

by providing information that could not be obtained from the existing features. For

instance, these new features addressed the need for more information regarding

properties like variance and standard deviation, which were calculated from the raw

features. This calculation allowed us to observe the relationship between prices and

the standard deviation of hash rate for past 7, 30, and 90-day intervals, rather than

86 P. Sontakke et al.

Table 1 Features collected using web scraping

Number of transactions per

day in blockchain

Block size Miner revenue

Number of sent by addresses Number of active addresses Open price

Average mining difﬁculty Average hash rate Low price

Average and median

transaction fee

Average block time Vo l u m e

Mining proﬁtability Sent coins High price

Average and median

transaction value

Tweets and google trends per

day

Number of coins in circulation

Average fee percentage in

total block reward

Top 100 richest addresses to

total coins

Close price

Market cap Conﬁrmation time

Table 2 Extracted features based on technical indicators

Simple moving average Weighted moving average

Exponential moving average Double exponential moving average

Triple exponential moving average Standard deviation

Relative strength index Rate of change

Bollinger bands Moving average convergence divergence

just the raw features. Table 2represents the features extracted based on technical

indicators.

3.3 Feature Selection

When dealing with large datasets with many features, it can increase the complexity

and time of computing an algorithm. The feature selection process can help identify

which features have a more signiﬁcant impact on the outcome by analysing the

contribution of each feature and reducing the dimensionality of the dataset, all the

while retaining or improving the accuracy scores. A random forest regressor selects

the top 10 features from the entire dataset. When working with extensive datasets that

possess numerous features, the computational time and complexity of an algorithm

can increase signiﬁcantly. To address this issue, the feature selection process can

be employed to identify the most impactful features by evaluating each feature’s

contribution. This process reduces the dataset’s dimensionality, while preserving

or enhancing the accuracy scores. Therefore, we have applied the random forest

regressor method to identify the top 10 features represented in Table 3.

To determine relevant features for this study, we have identiﬁed new features using

technical analysis and feature selection algorithms. Feature engineering revealed the

Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic … 87

Table 3 Most frequently selected features across all horizons

Features Next day 715 30

WMA 30 Number of coins in circulations * * * *

SMA 13 High * * *

EMA 90 Low * * *

EMA 7 Number of coins in circulations * * *

Close * *

High * *

WMA 7 Close * *

EMA 7 Open * *

DEMA 30 Close * *

DEMA 7 Market cap * *

EMA 30 Close * *

extent to which features directly related to the blockchain impacted the price of

Bitcoin, e.g. miner revenue, which involves transaction fees and rewards, is correlated

with the Bitcoin price. Similarly, block size and the creation of new blocks also

correlate with the number of transactions. More number of Bitcoin transactions

correlates with Bitcoin price. More processing power in mining coins is the result of

high difﬁculty, which is highly correlated with the hash rate.

3.4 Training and Testing

After the feature selection process, the next step is to allocate a portion of the data as

the training set and another portion as the testing set. Due to the non-stationary nature

of cryptocurrency prices, there is a conundrum of using too much or too little data

for training. While the former makes the model irrelevant, the latter makes it prone

to overﬁtting the model. This problem is usually solved by using the ideal ratio of

80% training data and 20% testing data based on the Pareto principle. However, we

observed overﬁtting in the results obtained through time series split cross-validation.

Therefore, we employed a sliding window approach which uses 10 consecutive data

points to predict the 11th and 12th data points within the same sequence, as supported

by previous research [21]. Essentially, the prediction of the next two days will be

based on data from the preceding ten days, with the ﬁnal metric being the average

of the metrics computed for each split.

88 P. Sontakke et al.

3.5 Machine Learning Algorithms

Four machine learning models have been implemented in this work including (1)

linear regression with gradient descent (LR), (2) support vector regression (SVR),

(3) extreme gradient boosting (XGBoost), and (4) long short-term memory (LSTM).

Table 4gives a summary of the parameters chosen for each model. For all four models,

all possible combinations of the hyperparameters were investigated during the hyper-

parameter tuning process and the combinations presented in Table 4produced the

best results.

For the LSTM model, a bidirectional layer of 500 cells was used followed by

a dropout of 25% which in turn is fed to another bidirectional layer of 600 cells,

followed by dropout of 30%. In order to update network weights during training, an

optimiser algorithm was used. Adam optimiser is suitable for non-convex optimi-

sation problems with beneﬁts like little memory requirements, efﬁcient with noisy

gradients and computationally efﬁcient. Hence, Adam optimiser was adopted.

4 Experimental Results

The performance outcomes of the forecasting models are outlined in this section. We

have developed all the steps explained in Sect. 3using Python on the Google Colab

platform. Two case studies have been conducted based on the following time frames:

•Period 1: Before pandemic (1st April 2016–1st November 2019)

Table 4 Hyperparameter

tuning for each model Model Parameters Val u e

LR Loss function

Penalty

Shufﬂe

L1_ratio

Epsilon

Learning rate

Max_inter

Squared_epsilon_insensitive

Elasticnet

True

0.15

0.01

Adaptive

1000

SVR Kernel

Gamma

Radial basis function

1000

Auto

XGBoost n_estimators

max_depth

learning_rate

n_jobs

500

0.01

−1

LSTM Monitor

Verbose

Mode

Patience

Root_mean_squared_error

min

Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic … 89

•Period 2: Including pandemic (1st April 2016–3rd March 2022)

The models are evaluated using three metrics which are root-mean squared error

(RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).

When evaluating models, it is ideal to have low values for the MAPE, RMSE, and

MAE metrics. For instance, in the case of Bitcoin price prediction, a model with

inconsistent values may result in a higher RMSE value, but it could still have lower

MAPE or MAE values. Hence, it is crucial to assess the models using all three

measures.

4.1 Period 1: Before Pandemic (1st April 2016–1st November

2019)

In the second period, we examine forecasting Bitcoin prices before the pandemic.

Table 5presents the outcomes of the machine learning models for the different time

frames.

Table 5 Comparing model accuracy across time frames before pandemic

Test metrics: 01 April 2016—01 November 2019

Next day

LR SVR XGBoost LSTM

RMSE 261.6396 363.0427 288.9283 373.1148

MAE 244.5862 349.7400 272.1291 359.8107

MAPE 0.8746 5.9647 0.8746 0.8746

7th Day

LR SVR XGBoost LSTM

RMSE 407.1737 389.4182 316.0947 399.7969

MAE 392.4431 376.1204 297.5931 386.8093

MAPE 1.4965 6.2503 1.4965 6.3994

15th Day

LR SVR XGBoost LSTM

RMSE 398.0850 387.2883 326.3839 406.8071

MAE 383.5820 374.2908 309.8014 394.1188

MAPE 4.2807 6.1359 4.2807 6.6521

30th Day

LR SVR XGBoost LSTM

RMSE 386.2755 389.6317 277.8465 423.1300

MAE 372.3106 374.5799 259.9288 408.9162

MAPE 0.8412 6.1730 0.8412 6.7438

90 P. Sontakke et al.

During this period, Bitcoin prices displayed minimal ﬂuctuation but saw a signif-

icant increase in early 2017, maintaining a stable trend for the remainder of the

interval. Among the models for next-day predictions, LR achieved the lowest RMSE

of 261.6396, followed by XGBoost, SVR, and LSTM. LR also had the best MAE of

244.5862, followed by XGBoost, SVR, and LSTM. In terms of MAPE, LR, XGBoost,

and LSTM recorded 0.8746, with SVR coming in at 5.9647. Therefore, the LR model

is the best performer among the four models mentioned (LR, XGBoost, SVR, and

LSTM).

For the 7-day prediction, XGBoost showed the best performance, with the lowest

RMSE of 316.0947, followed by SVR, LSTM, and LR. XGBoost also had the best

MAE of 297.5931, followed by SVR, LSTM, and LR. In terms of MAPE, LR and

XGBoost performed best with a value of 1.4965, followed by SVR and LSTM. For

the 15-day prediction, XGBoost showed the best performance with an RMSE of

326.3839, followed by SVR, LR, and LSTM. XGBoost also had the best MAE of

309.8014, followed by SVR at 374.2908, LR at 383.5820, and LSTM at 394.1188. In

terms of MAPE, LR and XGBoost performed best with a value of 4.2807, followed

by SVR and LSTM. For the 30-day prediction, the best RMSE was achieved by

the XGBoost model with a value of 277.8465, followed by LR, SVR, and LSTM.

XGBoost also had the best MAE of 259.9288, followed by LR at 372.3106, SVR,

and LSTM.

The results show that the LR model performs the best for next-day predictions with

the lowest RMSE and MAE values. For the 7-day prediction, XGBoost outperforms

the other models with the lowest RMSE and MAE values. Similarly, for the 15-

day and 30-day predictions, XGBoost performs the best with the lowest RMSE and

MAE values. For all prediction periods, LR and XGBoost also performed well in

terms of MAPE values. In conclusion, the XGBoost model performs the best overall,

while the LR model performs well for next-day predictions. Figure 2presents a graph

contrasting the actual and predicted data for the 15-day forecast utilising the XGBoost

model.

Fig. 2 Comparison of actual versus predicted data for 15-day prediction using XGBoost model

Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic … 91

4.2 Period 2: Including Pandemic (1st April 2016–3rd March

2022)

In the second period, we examine forecasting bitcoin prices for the period that

included the pandemic, characterised by an unusual level of volatility. This consti-

tutes the core contribution of this research. Table 6displays the results of the machine

learning models for various time frames in this period.

The results of the next-day prediction show that XGBoost achieved the lowest

RMSE of 723.9742, followed by LR at 773.8296, LSTM at 890.0664, and SVR at

981.2988. XGBoost also had the best MAE of 682.4402, with LR, LSTM, and SVR

following. LR and XGBoost had the best MAPE of 0.3862, while LSTM and SVR

followed.

For the 7th-day prediction, XGBoost had the lowest RMSE of 734.0597, followed

by LR, LSTM, and SVR. XGBoost also reported the best MAE of 691.8397, followed

by LR, LSTM, and SVR. LR and XGBoost had the best MAPE of 4.0497, followed

by SVR and LSTM. For the 15-day prediction, XGBoost had the lowest RMSE of

686.5598, followed by LR, LSTM, and SVR. XGBoost also had the best MAE of

Table 6 Comparing model accuracy across time frames including pandemic

Test metrics: 01 April 2016–01 November 2019

Next day

LR SVR XGBoost LSTM

RMSE 773.8296 981.2988 723.9742 890.0664

MAE 739.1618 952.5082 682.4402 859.1364

MAPE 0.3862 6.3134 0.3862 5.8114

7th Day

LR SVR XGBoost LSTM

RMSE 989.5905 998.1517 734.0597 993.7988

MAE 958.2842 969.1428 691.8397 963.8020

MAPE 4.0497 6.362 4.0497 6.3729

15th Day

LR SVR XGBoost LSTM

RMSE 976.8267 1008.8316 686.5598 1002.8917

MAE 944.8389 979.7016 648.2302 973.0593

MAPE 6.2047 6.3711 6.2047 6.4255

30th Day

LR SVR XGBoost LSTM

RMSE 1007.7876 1029.7602 678.8905 1038.7840

MAE 966.5514 989.2436 633.0426 997.3399

MAPE 0.6291 6.3763 0.6291 6.5032

92 P. Sontakke et al.

Fig. 3 Comparison of actual versus predicted data using XGBoost model for the period covering

pandemic

648.2302, followed by LR, LSTM, and SVR. LR and XGBoost had the best MAPE

of 6.2047, followed by SVR and LSTM. For the 30-day prediction, XGBoost had

the lowest RMSE of 678.8905, followed by LR, LSTM, and SVR. XGBoost also

had the best MAE of 633.0426, followed by LR, SVR, and LSTM. LR and XGBoost

had the best MAPE of 0.6291, while SVR and LSTM followed.

The results show that XGBoost outperformed the other models in all time frames

regarding RMSE, MAE, and MAPE. LR also performed well, consistently achieving

the second-best results. LSTM and SVR showed lower performance compared

with XGBoost and LR. Overall, XGBoost and LR demonstrated the best results in

predicting future outcomes based on the given dataset. The graph in Fig. 3compares

the actual and predicted data using the XGBoost model for the period that includes

the pandemic.

The four machine learning models (SVR, XGBoost, LR, and LSTM) used in the

study differ in their underlying principles and have varying strengths and weaknesses.

Regarding speed, SVR was the quickest at 3 s, followed by linear regression at 10 s,

XGBoost at 90 s, and LSTM at 90 min for predicting next-day bitcoin prices in the

second period. As LSTM had the longest runtime, the other models are recommended

for their time-saving advantages.

5 Conclusion and Future Work

This study assessed the performance of four machine learning models, linear regres-

sion, support vector regression, XGBoost, and LSTM, in predicting Bitcoin price

volatility during the COVID-19 pandemic using technical features and indicators.

The results show that the models performed better before the pandemic compared

with during the pandemic with high volatility. Despite this, the study still reports

satisfactory results for Bitcoin price prediction during the pandemic. The authors

suggest that this remains a challenge for future studies.

The study employed a robust feature selection strategy to determine the most

critical features. The random forest regressor recommended features for all deﬁned

horizons, which have been partially related to speciﬁed periods. For example, the

number of coins in circulation has been selected for all horizons, while the close price

Forecasting Bitcoin Prices in the Context of the COVID-19 Pandemic … 93

has been only selected for the next-day and 7th-day horizons and not for the 15th-

and 30th-day horizons. The study shows a satisfactory prediction of bitcoin prices

over the selected horizons. The results showed that the accuracy of predictions for

the next day, 15th day and 30th day was superior to that for the 7th-day horizon in the

second dataset. The reason could not be established as bitcoin prices are stochastic.

The limitations of this study could include the following:

•The study only examines the period of 1st April 2016 to 3rd March 2022, and

may not capture the full range of Bitcoin price ﬂuctuations over a longer period.

•The study only focuses on four machine learning models, and other models may

better predict Bitcoin price ﬂuctuations.

•The study only uses technical features and indicators, and additional factors such

as global economic conditions and regulatory changes may affect Bitcoin prices.

Future work could involve exploring other machine learning models or incorpo-

rating additional features to improve the performance of the models. Additionally,

the study could be extended to other cryptocurrencies and compare the results with

those obtained for Bitcoin.

References

1. Nakamoto S (2008) Bitcoin: a peer-to-peer electronic cash system. Retrieved from https://

www.bitcoinpaper.info/bitcoinpaper-html/

2. Summoogum JP, Saeedi M (2020) A study on the inefﬁciencies of bitcoin and its future

adoption. Test Eng Manag 82:16624–16634

3. Cheah E-T, Fry J (2015) Speculative bubbles in bitcoin markets? An empirical investigation

into the fundamental value of bitcoin. Econ Lett 130:32–36

4. Garcia D, Tessone CJ, Mavrodiev P, Perony N (2014) The digital traces of bubbles: feed-

back cycles between socio-economic signals in the bitcoin economy. J R Soc Interface

11(99):20140623-1–20140623-8. https://doi.org/10.1098/rsif.2014.0623

5. Nistala MN, Saeedi M, Islam MU (2020) Bitcoin price volatility and hedging capacity. Int J

Manag 11(10):1703–1712. https://doi.org/10.34218/IJM.11.10.2020.156

6. Mittal S (2014) Is bitcoin money? Bitcoin and alternate theories of money (SSRN Scholarly

Paper No. ID 2434194). Social Science Research Network, Rochester, NY

7. Buchholz M, Delaney J, Warren J, Parker J (2012) Bits and bets, information, price volatility,

and demand for Bitcoin. Economics 312(1):2–48

8. Gronwald M (2019) Is bitcoin a commodity? On price jumps, demand shocks, and certainty

of supply. J Int Money Financ 97:86–92

9. Brooks C (2019) Introductory econometrics for ﬁnance. Cambridge University Press,

Cambridge. https://doi.org/10.1017/9781108524872

10. Roy S, Nanjiba S, Chakrabarty A (2018) Bitcoin price forecasting using time series analysis.

In: 2018 21st international conference of computer and information technology (ICCIT). IEEE,

pp 1–5. https://doi.org/10.1109/ICCITECHN.2018.8631923

11. Anupriya, Garg S (2018) Autoregressive integrated moving average model based prediction of

bitcoin close price. In: 2018 international conference on smart systems and inventive technology

(ICSSIT). IEEE, pp 473–478. https://doi.org/10.1109/ICSSIT.2018.8748423

12. Alahmari SA (2019) Using machine learning ARIMA to predict the price of cryptocurrencies.

ISC Int J Inf Secur 11(3):139–144. https://doi.org/10.22042/isecure.2019.11.0.18

94 P. Sontakke et al.

13. Huang J-Z, Huang W, Ni J (2019) Predicting bitcoin returns using high-dimensional technical

indicators. J Finan Data Sci 5(3):140–155. https://doi.org/10.1016/j.jfds.2018.10.001

14. Greaves A, Au B (2015) Using the bitcoin transaction graph to predict the price of bitcoin, pp 1–

8. Retrieved from https://snap.stanford.edu/class/cs224w-2015/projects_2015/Using_the_Bit

coin_Transaction_Graph_to_Predict_the_Price_of_Bitcoin.pdf

15. Madan I, Saluja S, Zhao A (2015) Automated bitcoin trading via machine learning algorithms,

pp 1–5. Department of Computer Science, Stanford University, Stanford, CA, USA, Technical

Reports. Retrieved from https://www.smallake.kr/wp-content/uploads/2017/10/Isaac-Madan-

Shaurya-Saluja-Aojia-ZhaoAutomated-Bitcoin-Trading-via-Machine-Learning-Algorithms.

pdf

16. Radityo A, Munajat Q, Budi I (2017) Prediction of bitcoin exchange rate to American

dollar using artiﬁcial neural network methods. In: 2017 international conference on advanced

computer science and information systems (ICACSIS). IEEE, pp 433–438. https://doi.org/10.

1109/ICACSIS.2017.8355070

17. Yeh CC, Liao YC, Yang YJ (2020) Predicting bitcoin prices with machine learning techniques.

Expert Syst Appl 163:113762. https://doi.org/10.1016/j.eswa.2020.113762

18. Wang S, Ma Y, Zhang Y (2018) Forecasting bitcoin price with deep learning networks. Phys

A 510:828–834. https://doi.org/10.1016/j.physa.2018.07.026

19. BitInfoCharts. Retrieved from https://bitinfocharts.com/. Accessed on 18 Feb 2023

20. Quandl. Retrieved from https://demo.quandl.com/. Accessed on 18 Feb 2023

21. Hota HS, Handa R, Shrivas AK (2017) Time series data prediction using sliding window based

RBF neural network. Int J Comput Intell Res 13(5):1145–1156

Online Food Delivery Customer Churn

Prediction: A Quantitative Analysis

on the Performance of Machine Learning

Classiﬁers

J. Gerald Manju, A. Dharini, B. Kiruthika, and A. Malini

Abstract Securing current customers is extremely necessary than earning new

customers in a market that is expanding. To trace customer churn, a reliable churn

prediction paradigm is required. Customer churn is the process through which people

switch from one ﬁrm to another or break off contact with the company. This decision is

driven by a variety of inﬂuences. It is critical for companies to acknowledge each one

so that they can encourage customers to stay over. This is accomplished by regularly

conducting surveys regarding customer satisfaction and analyzing the responses.

Applying appropriate modelling approaches is a vital component of predicting

customer churn. Predominantly, this study evaluates several machine learning models

and also an incorporated model that aids in predicting customer churn where the data

collected from Bengaluru regions in India about online food delivery is prioritized.

In order to make better predictions using machine learning, a variety of general clas-

siﬁers and ensemble classiﬁers are used and their degree of functionality are assessed

by determining their accuracy and area under the ROC curve. According to the AUC

scores obtained for the individual classiﬁers, the Naïve Bayes and random forest

classiﬁers rank ﬁrst with the same AUC score of 0.952. After dealing with this case,

the results show that the random forest classiﬁer outperforms all other models used.

Keywords Voting classiﬁer ·Ensemble ·AUC ·XGBoost ·Naïve Bayes ·

Random forest

J. Gerald Manju ·A. Dharini ·B. Kiruthika ·A. Malini (B)

Thiagarajar College of Engineering, Madurai, Tamil Nadu, India

e-mail: amcse@tce.edu

J. Gerald Manju

e-mail: gerald@student.tce.edu

A. Dharini

e-mail: dharinia@student.tce.edu

B. Kiruthika

e-mail: kiruthikab@student.tce.edu

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_8

96 J. Gerald Manju et al.

1 Introduction

To combat the intense competition in the business sector, companies are placing

a greater emphasis on the customer relationship management (CRM). A diligent

communication is required for the sustenance of the customers. Acquiring new

customers involves huge investment than retaining the existing customers. Customer

base is a more important resource and so companies have to put efforts to hold on

the existing customers. This is attained through customer churn prediction which in

turn helps in the development of retention strategies. The term ‘churn’ refers to both

consumers who switch service providers and those who leave the service providers

with whom they are entrusted. Additionally, it takes into account the likelihood that

some of the customers will leave.

Several data mining techniques are implemented to predict customer churn

utilizing machine learning models. In fact, studies have shown that gathering, orga-

nizing, cleaning, processing and analyzing data is an expensive process for producing

reliable forecasts. Moreover, researches have shown that ensemble approaches are

better predictors than individual models. In order to achieve better results, the

effectiveness of several classiﬁers is evaluated.

The contributions of this paper are,

•To create a classiﬁer that is effective at predicting customer churn, incorporate

the classiﬁers’ training and testing results.

•Employ Naive Bayes, logistic regression which are general classiﬁers and

ensemble classiﬁers like gradient boosting, XGBoost and random forest to forecast

customer churn.

•Aggregate the results of the individual classiﬁers using voting classiﬁer, consid-

ering the majority votes and analyze the effectiveness of the voting classiﬁer.

•Determine the efﬁcacy of the distinct classiﬁers based on the AUC analysis.

The following of the paper contains the existing literature. It explains the

approaches that were used in the study and an examination of how well machine

learning classiﬁers performed in making predictions with the conclusion.

2 Literature Survey

Raeisi et al. used six data mining methods in the e-commerce churn prediction

of online Iranian food ordering service which includes gradient boosted tree, rule

induction, k-NN, random forest, decision tree, Naïve Bayes, and concluded that the

highest accuracy of 86.90% was obtained by gradient boosted trees [1]. However, the

performance of the models could be evaluated one step ahead with AUC analysis.

Lalwani et al. used XGBoost classiﬁer and AdaBoost classiﬁer, CatBoost clas-

siﬁer, Naïve Bayes, logistic regression, decision trees, SVM, random forest and

extra tree classiﬁer to provide a comparative study in telecommunication industry

Online Food Delivery Customer Churn Prediction: A Quantitative … 97

where the maximum accuracy was gained for Adaboost and XGBoost classiﬁers,

following the ensembling approach with an AUC value 0.84 [2]. Their emphasis

on the domination of ensemble methods on weak learners is not signiﬁcant for all

aspects.

Abbasimehr et al. employed C4.3 decision tree, SVM, ANN, RIPPER rule learner

as base learners and improved the performance of these models using ensemble

methods which includes bagging, boosting, staking and voting [3]. Above all,

Boosting RIPPER and Boosting C4 indicate the domination of the collaborated strong

learners over their base learners in the prediction of churning of customers.

Sudharsan et al. utilized the Swish RNN strategy for the forecasting of churning

customers in the telecommunication industry [4]. By the S-RNN, sensitivity value of

98.27%, speciﬁcity of 92.31% and accuracy of 95.99% were observed and clustering

was done with a clustering algorithm called CLARA. This enabled quicker forecasts.

Fathian et al. examined the bagging and boosting ensemble classiﬁers which

are heterogeneous [5]. For this a comparison was made between the sensitivity, F-

measure, speciﬁcation, accuracy and AUC of 14 models and they concluded with

a point that the combination of PCA, SOM and boosting (heterogeneous approach)

acquired the desirable results.

Sharma et al. put forth XGBoost algorithm as the best performing algorithm and

classiﬁes churners among the total churners correctly with 81 and 85% as highest

true positive rate and AUC [6]. The priority of a single classiﬁer’s performances is

not always dependable. Consequently, combining different models can be one of the

approaches to get the optimal model.

Dhini and Fauzan performed two ensemble learning approaches, namely extreme

gradient boosting and random forest [7]. They inferred that the XGBoost is the best

predictor.

Researches have emphasized on the utilization of individual models and opti-

mistic aspects of ensemble models. This paper’s intent is to execute a comparative

analysis on the performance of several individual classiﬁers and their heterogeneous

combination using voting classiﬁer which is to be evaluated by the accuracy and

AUC analysis.

3 Design and Methodology

Data collection is a fundamental component. The quality of the training data deter-

mines how accurately the machine learning models anticipate the future. Another

integral part of machine learning is data preprocessing. It is a process of generating

a clean and reliable data [8]. For an effective and efﬁcient predictive model, in addi-

tion to data collection the processing of the collected data is very important for better

results [9]. The process of choosing the features that possess the greatest inﬂuence

and are the best predictors of the target feature is regarded as selecting features

and a statistical technique to assess the degree of association between two variables

98 J. Gerald Manju et al.

is correlation [10]. Depending on the k highest scores, the SelectKBest technique

chooses the features that have the greatest inﬂuence [11]. This is depicted in Fig. 1.

4 Classiﬁcation Algorithms

4.1 General Classiﬁers

Logistic Regression Classiﬁer Modelling of the dependent variable is done through

logistic function. There are only two possible classes for the dependent variable. Thus,

binary data can be processed using this technique.

Naïve Bayes Classiﬁer This algorithm is a Bayes theorem-based classiﬁer and is

mostly utilized in text categorization. It is appropriate for a training dataset with

many dimensions. It is a probabilistic classiﬁer that goes by the name ‘Naïve’ since

it makes the assumption that the occurrence of one characteristic is unrelated to the

occurrence of other features.

4.2 Ensemble Classiﬁers

Random Forest Classiﬁer The ensemble strategy (bagging approach), which

employs multiple decision trees on diverse dataset subsets, is used by the random

forest classiﬁer. It takes into account the predictions that have received the most votes

in order to increase predicted accuracy. As a result, it foretells the outcome.

Gradient Boosting Classiﬁer This classiﬁer is a combination of a number of weak

learning models with the boosting approach leading to a powerful predictor. This

process frequently uses decision trees. The residual errors of each weak learner’s

predecessor are used as labels for the training process.

XGBoost Classiﬁer For large datasets, the gradient boosted trees approach is effec-

tively implemented by the XGBoost classiﬁer. It is done to deliver accurate ﬁndings

and prevents overﬁtting [12].

4.3 Voting Classiﬁer

A voting classiﬁer trains different models and then estimates an output on the basis

of maximum likelihood. It provides the results by the average of each classiﬁer

submitted into it to anticipate the output in accordance with highest majority of votes.

Online Food Delivery Customer Churn Prediction: A Quantitative … 99

Fig. 1 Heterogeneous ensemble method

100 J. Gerald Manju et al.

Instead of developing distinct, specialized approaches and assessing the accuracy

individually, a single model is produced for betterment. This supports the two voting

processes known as hard voting and soft voting.

5 Performance Metrics

5.1 Accuracy

A machine learning algorithm’s evaluation is crucial and this is done by calculating

the proportion of a classiﬁer’s accurate predictions to all of its predictions which is

called accuracy [13].

Accuracy =

number of correct predicions

total number of predictions

5.2 Area under the Curve Analysis

Classiﬁcation analysis uses the area under curve.

AUC is for evaluation and comparison. AUC is a scale that varies from 0 to 1.

When a model has an AUC of 0, the predictions are unreliable, while when it has an

AUC of 1, all of the predictions are accurate.

6 Results and Discussion

An open-source online food delivery customer churn prediction dataset is taken from

Kaggle for the study. The dataset consists of 388 instances with 55 attributes as given

in Table 1. It is on the basis of consumer trends [14], decisions concerning a purchase

in general, importance of delivery time restaurant rating on purchasing. The attributes

are related to online food delivery preferences in Bengaluru region. The dataset helps

us predict whether the online food delivery customers churn or not with respect to

their preferences. The attribute ‘Output’ which is dependent indicates with the words

‘Yes’ or ‘No’ if the client has churned or not. The dataset has categorical attributes

in object type and continuous attributes in numeric type. It includes variables with

Likert form.

Various preprocessing steps are executed [15]. To overcome the chances of over-

ﬁtting, dimensionality reduction methods are carried out. From the observations of

data visualization of continuous variables, it is shown that in age groups less than 25

Online Food Delivery Customer Churn Prediction: A Quantitative … 101

Table 1 Characteristics of

the dataset Total count of instances 388 (0–387)

Total count of attributes 55

Count of attributes (categorical) 50

Count of attributes (numeric) 5

Count of missing values Nil

Datatypes Float64—2

Int64—3

Object—50

and family size of less than 4 reordering is frequent. Data redundancy is also reduced

by removing certain attributes such as ‘pincode’, ‘Meal (P1)’, ‘Meal (P2)’, ‘Medium

(P1)’, ‘Medium (P2)’, ‘lat’, ‘lon’, as they provided no information. The attributes with

values in Likert scale are transformed to ordinal rank order scale. Then, the correla-

tion matrix is determined using Spearman’s rank correlation method to understand

the intra-relationship between the attributes. A correlation is observed along the diag-

onals and the attributes are ordinal in preference of being continuous. Due to such

cases, applying principal component analysis for dimensionality reduction is not

effective. The attributes with categorical values are converted to dummy variables.

The most inﬂuencing 20 attributes are selected using SelectKBest method [16].

Various classiﬁcation models are taken to construct an ensemble classiﬁcation

model to improve the performance of the data mining techniques [17]. Classiﬁers

like random forest, XGBoost, gradient boost, Naive Bayes, logistic regression are

used as base learners to create an ensemble classiﬁcation model utilizing the voting

classiﬁer, which is regarded as a strong learner as depicted in Fig. 2[18].

Primarily, the quality of the individual classiﬁers is assessed using the performance

metric accuracy. Among the ﬁve individual classiﬁers used, XGBoost classiﬁer has

the best accuracy of 100% which is represented in Fig. 3.

To predict the output class, the accuracy of each classiﬁer is then combined using

a voting classiﬁer based on the majority of votes by soft voting. The combination of

the performances of the multiple classiﬁers gives an accuracy of 98%. Basically, the

voting classiﬁer is regarded as a versatile classiﬁer but it is important to understand

that the voting ensemble method has its limitations [19]. There is a possibility for an

individual classiﬁer to outperform a group of classiﬁers and this is clearly observed

from our study as XGBoost classiﬁer outperforms the voting classiﬁer.The accuracies

of the individual models along with voting classiﬁer are represented in Fig. 3.

From the AUC analysis represented in Fig. 3, Naïve Bayes and random forest

classiﬁers have same AUC scores. To resolve this, the difference between the training

score and testing score is considered. The best successful model is seen to be the

classiﬁer with the smallest difference.

From Table 2, we infer that random forest classiﬁer is the most effective classi-

ﬁer since there is the least difference between its training and testing scores which

suggests that the model is less likely to make incorrect predictions and so lessens the

overﬁtting issue.

102 J. Gerald Manju et al.

Fig. 2 Combination of classiﬁers

Fig. 3 Comparative analysis of accuracies and AUC scores

Table 2 Comparative

analysis on training and

testing scores

Model Training score Testing score

Naïve Bayes 0.88 0.96

Random forest 1.00 0.97

Online Food Delivery Customer Churn Prediction: A Quantitative … 103

7 Conclusion

This paper’s primary objective is to implement an assessment of the effectiveness of

machine learning classiﬁers used on customer churning data. Speciﬁcally, it focuses

on the effectiveness of the voting classiﬁer. Basically, voting classiﬁer is recom-

mended as it is a more powerful meta-classiﬁer that neutralizes the weaknesses of

the individual classiﬁers on a particular dataset. It is an ensemble method that incor-

porates the outcomes of multiple models to arrive at the ultimate optimal outcome and

then makes predictions. It is essential to comprehend that voting-based models could

not be used as a generic machine learning strategy as the voting ensemble method

is not without its drawbacks, after all. In some instances, a model can outperform

a collection of models where its accurate predictions are nulliﬁed by the voting

classiﬁer.

This particular aspect of the voting classiﬁer is witnessed through this study as

the individual model—XGBoost classiﬁer outperforms the voting classiﬁer with an

accuracy of 100%. Here, the voting classiﬁer nulliﬁed the accurate prediction by the

XGBoost classiﬁer.

Most of the classiﬁers not only make prediction of the classes but also output the

probability of that prediction. But accuracy lacks the utilization of this probability.

Area under the curve analysis aids in the assessment of probability with greater

accuracy. From the AUC scores obtained for the individual classiﬁers, it is observed

that random forest and Naïve Bayes classiﬁers stand ﬁrst having the same AUC score

of 0.952. The results show that random forest classiﬁer outperforms all other models

after this case has been handled.

Currently, the data used for this study is limited to the online food delivery sector,

whereby this may not be appropriate in the same way for other sectors. In future, an

extension can be made by making analysis in different sectors. Availing explainable

artiﬁcial intelligence (XAI) is a more reliable technique which helps in unveiling the

underlying prediction process and it explains interpreting the impenetrable machine

learning models [20]. This in turn serves good for realizing the feature contribu-

tion explanation, which is accountable for better understanding the most inﬂuencing

features in the customer churn thereby assisting in its mitigation.

References

1. Raeisi S, Sajedi H (2020) E-commerce customer churn prediction by gradient boosted trees. In:

2020 10th international conference on computer and knowledge engineering (ICCKE). IEEE,

pp 55–59

2. Lalwani P, Mishra MK, Chadha JS, Sethi P (2022) Customer churn prediction system: a machine

learning approach. Computing 104(2):271–294

3. Abbasimehr H, Setak M, Tarokh MJ (2014) A comparative assessment of the performance of

ensemble learning in customer churn prediction. Int Arab J Inf Technol 11(6):599–606

4. Sudharsan R, Ganesh EN (2022) A Swish RNN based customer churn prediction for the telecom

industry with a novel feature selection strategy. Connect Sci 34(1):1855–1876

104 J. Gerald Manju et al.

5. Fathian M, Hoseinpoor Y, Minaei-Bidgoli B (2016) Offering a hybrid approach of data mining

to predict the customer churn based on bagging and boosting methods. Kybernetes 45(5):732–

743

6. Sharma T, Gupta P, Nigam V, Goel M (2020) Customer churn prediction in telecommunications

using gradient boosted trees. In: Khanna A, Gupta D, Bhattacharyya S, Snasel V, Platos J,

Hassanien A (eds) International conference on innovative computing and communications.

Advances in intelligent systems and computing, vol 1059. Springer, Singapore, pp 235–246

7. Dhini A, Fauzan M (2021) Predicting customer churn using ensemble learning: case study of

a ﬁxed broadband company. Int J Technol 12(5):1030–1037

8. Jagadeesan AP (2020) Bank customer retention prediction and customer ranking based on deep

neural networks. Int J Sci Dev Res (IJSDR) 5(9):444–449

9. Momin S, Bohra T, Raut P (2020) Prediction of customer churn using machine learning. In:

EAI international conference on big data innovation for sustainable cognitive computing. EAI/

Springer innovations in communication and computing. Springer, Cham, pp 203–212

10. Fujo SW, Subramanian S, Khder MA (2022) Customer churn prediction in telecommunication

industry using deep learning. Inf Sci Lett 11(1):185–198

11. Domingos E, Ojeme B, Daramola O (2021) Experimental analysis of hyperparameters for deep

learning-based churn prediction in the banking sector. Computation 9(34):1–19

12. Sree GMA, Ashika S, Karthi S, Sathesh V, Shankar M, Pamina J (2019) Churn prediction in

telecom using classiﬁcation algorithms. Int J Sci Res Eng Dev 2(1):1–16

13. Ahmad AK, Jafar A, Aljoumaa K (2019) Customer churn prediction in telecom using machine

learning in big data platform. J Big Data 6(28):1–24

14. Dias J, Godinho P, Torres P (2020) Machine learning for customer churn prediction in retail

banking. In: International conference on computational science and its applications. Springer,

Cham, pp 576–589

15. Shirazi F, Mohammadi M (2019) A big data analytics model for customer churn prediction in

the retiree segment. Int J Inf Manage 48:238–253

16. Khodabandehlou S, Rahman MZ (2017) Comparison of supervised machine learning tech-

niques for customer churn prediction based on analysis of customer behavior. J Syst Inf Technol

19(1/2):65–93

17. Kumar AS, Chandrakala D (2016) A survey on customer churn prediction using machine

learning techniques. Int J Comput Appl 154(10):13–16

18. Al-Najjar D, Al-Rousan N, Al-Najjar H (2022) Machine learning to develop credit card

customer churn prediction. J Theor Appl Electron Commer Res 17:1529–1542

19. Xu T, Ma Y, Kim K (2021) Telecom churn prediction system based on ensemble learning using

feature grouping. Appl Sci 11(4742):1–12

20. Tavassoli S, Koosha H (2022) Hybrid ensemble learning approaches to customer churn

prediction. Kybernetes 51(3):1062–1088

Prevention Equipment for COVID-19

Spread Using IoT and Multimedia-Based

Solutions

T. S. Dhachina Moorthy, N. Nimalan, S. Sridevi, and B. Nevetha

Abstract The global spread of COVID-19 is a growing concern for everyone.

The virus is transmitted through droplets and airborne particles from one person

to another. The World Health Organization (WHO) recommends wearing a face

mask, social distancing, avoiding crowded areas, and maintaining a strong immune

system to reduce the spread of COVID-19. In response to the pandemic, many coun-

tries have implemented lockdowns to control its spread. Research has shown that

wearing masks in public can help prevent person-to-person transmission of the virus.

This paper proposes a device that uses cameras to detect elevated body temperature,

people wearing face masks, those not wearing face masks, and calculates proximity

among individuals. The proposed model can be deployed in public places such as

shopping malls, hotels, apartment entrances, airports, hospitals, and ofﬁces to main-

tain safety standards. The system uses Internet of Things (IoT) technology and deep

learning mechanisms to detect individuals who may be infected with COVID-19. The

proposed framework is evaluated using the face mask detection and social distance

detecting algorithms in the TensorFlow library. A non-contact sensor is used to check

the temperature of each person passing through the device. To ensure ease of use, an

animated ﬁlm is used to help people understand how to operate the proposed system.

A multimedia application is also employed to display the system’s output to end-users

in the form of visualizations or reports, accompanied by an alarming sound to remind

individuals to maintain distance or avoid crowded areas. The proposed system, when

implemented, can help prevent the spread of COVID-19 and save lives.

T. S. D. Moorthy ·N. Nimalan ·S. Sridevi (B)·B. Nevetha

Department of Information Technology, Thiagarajar College of Engineering, Madurai, Tamil

Nadu, India

e-mail: sridevi@tce.edu

T. S. D. Moorthy

e-mail: dhachina@student.tce.edu

N. Nimalan

e-mail: nimalan@student.tce.edu

B. Nevetha

e-mail: nevetha@student.tce.edu

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_9

105

106 T. S. D. Moorthy et al.

Keywords COVID-19 ·Face mask detection ·Camera ·IoT ·Sensors ·

Temperature detection

1 Introduction

The global outbreak of the coronavirus has raised concerns among the public

regarding the spread of the virus. To help slow down and ultimately stop the spread

of the virus, society is seeking tools to aid in the detection of infections. Although

there are currently no thermal cameras available that can detect the virus, forward-

looking infrared (FLIR) cameras can be used as an additional tool for body temper-

ature screening in high-trafﬁc public places, allowing for quick individual screening

[1,2]. If an individual’s skin temperature in key areas is higher than the average

temperature, they may be selected for additional screening. These cameras use the

thermography temperature measurement technology, which allows for accurate non-

reactive, contactless, and planar recording of surface temperatures. This makes them

suitable for fast and easy detection of increased body temperatures, which may indi-

cate a possible virus infection among individuals who have undergone the screening

process. Using the infrared cameras, the body temperature is diagnosed at the inner

angle of the eye. Alarm will be raised when there is slightest differences and thus

elevated body temperatures can be displayed.

Besides that, the device can detect, alert, and hopefully remind a person who is

not wearing a mask to wear one before entering a venture or facility. To slow the

spread of the coronavirus, the centers for disease control and prevention and state

health agencies have advised people to maintain social distance and wear masks in

public over the past few months [3,4]. A small camera is built into the device. When

a person without a mask approaches us, it alerts us, ﬂashes bright lights, and sends

out a loud audio alert reminding them to wear a mask. The device can detect all types

of face masks, including medical masks and scarves. The device’s goal is to remind

people to wear masks, especially during a health.

The proposed work also includes an additional feature in which cameras can

measure the distance between people, and report the person when they are too near

to each other. Time-of-ﬂight technology is used by the sensors to enable precise

monitoring of lines of people. Using fully anonymous data collection, sensors can

identify the existence of people and calculate the proximity to the neighboring person.

Ultrasonic sensors with high accuracy, miniaturization, and low power consumption

are perfect technology for solutions that prevent infection by social distancing. To

check if people are maintaining regulation distances, real-time data is linked to

a visual and/or audio system [5,6]. In accordance with the local regulations and

current public health advice, the system allows a minimal distance threshold between

people to be set. A signal is sent to enable a visual/audio alert when the distance is

violated—which makes sure that the distracted people ﬁgure out that they are not

respecting distancing. To ensure social distancing measures are maintained, social

distance monitoring devices can be installed to monitor entrances of public places.

Prevention Equipment for COVID-19 Spread Using IoT … 107

Businesses can make sure that the safety of staff and customers is maintained, while

adhering with rules and regulations and health constraints caused by COVID-19.

2 Existing System

Handheld temperature scanners also called ‘temperature guns’ have been used for a

long period of time for checking the temperature of individuals. The temperature is

determined by the thermal radiation emitted by the object. This prevents individuals

from touching each other or touching the device, but let us take it this way. What

if the person who checks the body temperature has the coronavirus but behaves

asymptomatic? He must be in close pandemic, and more importantly, to avoid conﬂict

with other humans. Impact of COVID-19 across the world is shown by Fig. 1[7].

Here is a high risk of spread due to the reduced distance between the individuals.

In some places right now, they appoint people to hold that scanner and check the

temperature of customers [8–10]. This results in the wastage of time and money. This

device can prove to be helpful for the doctors too. Usage of thermometers at hospitals

might be a disadvantage for doctors as it will take some time for it to analyze. This

time waste can affect the proﬁt of the doctor per day. Using our devices, you can

ensure a safer distance and instant temperature readings which can result in a better

proﬁt.

The existing system used IoT-based enabled devices to predict and prevent

COVID-19 spread. Some of the people do not get exposure on how to use these

devices. For creating awareness and demonstrating the working of the proposed

system, here we are also focusing on multimedia applications. With the help of

Fig. 1 Total conﬁrmed cases across the world on May, 2020 (Source: https://www.newsclick.in/

covid-19-graphs-cases-recoveries-deaths)

108 T. S. D. Moorthy et al.

multimedia applications, one should understand how to use the IoT-based enabling

devices [11,12].

3 Proposed Methodology

The solution is a device which can act as a multitasking machine by identifying the

temperature of the people nearby us, whether all around us are wearing a mask and to

maintain a safe distance between every person around us. Though there are no thermal

cameras which can detect the virus, in addition to conventional body temperature

screening technologies, forward-looking infrared (FLIR) cameras can be utilized for

quick individual screening to identify people with excessive skin temperatures in

high-trafﬁc public areas. The proposed work develops a fever screening and tracking

system using the thermal and normal cameras present in the device. We train the

network of cameras with deep learning algorithms where the video feed is given as

input and the corresponding result was taken. The algorithm will be precise as the

device will be trained with lots of images to identify whether a respective individual

is wearing a mask [7,13–15]. Using fully anonymous data collection, sensors can

identify the existence of people and calculate the proximity to the neighboring person.

Ultrasonic sensors with high accuracy, miniaturization, and low power consumption

are perfect technology for solutions that prevents infection by social distancing. In

accordance with the local regulations and current public health advice, the system

allows a minimal distance threshold between people to be set [16,17]. A signal is

sent to enable a visual/audio alert when the distance is violated—which makes sure

that the distracted people ﬁgure out that they are not respecting distancing.

3.1 Proposed System Design

See Fig. 2.

3.2 Prototype System Architecture

See Figs. 3,4and 5.

The entire process for the working prototype model is displayed in Fig. 6, starting

with the import of the dataset, followed by the start of the video stream, face detection

in the video stream, and application of the face mask classiﬁer to the face ROI to

determine whether there is a mask. The results are displayed in a box around the face

ROI that is highlighted. The software then looks for neighboring faces if a mask is

found.

Prevention Equipment for COVID-19 Spread Using IoT … 109

Fig. 2 Proposed system design

Fig. 3 Sample input ﬁle for object detection

110 T. S. D. Moorthy et al.

Fig. 4 Face mask dataset

Fig. 5 Without face mask dataset

The person identiﬁcation model begins and tries to identify the person if a mask

is not there, sends a notiﬁcation to the impacted person after the person identiﬁcation

model has been run successfully. Until every person in the frame is covered by a

mask, this process keeps repeating.

Prevention Equipment for COVID-19 Spread Using IoT … 111

Fig. 6 Prototype system architecture

3.3 Face Mask Detection Using Hybrid Convolution Neural

Networks (CNN) Algorithm

The face mask that is present on every person’s face is identiﬁed in this study using

hybrid convolution neural networks (CNN). But there is a slight modiﬁcation here.

The input for processing the image datasets is supplied in as an array. Afterward

it is transmitted inside the mobile nets, where maximum pooling takes place. The

major applications of hybrid CNN, a type of artiﬁcial neural network, are image

recognition and analysis. CNN is speciﬁcally designed to analyze pixel input. A

result is produced by combining between 100 and 1000 ﬁlters, and the resulting

output is then passed on to the following layer of the neural network [18–20]. Keres

and TensorFlow are used to train the mask detection model. The process used in the

algorithm is explained in the Fig. 7.

Fig. 7 Face mask detection using CNN

112 T. S. D. Moorthy et al.

3.4 Social Distancing Detection

Using YOLO V3, the proposed work detects the people present in the given video

dataset or live video feed. To track the people, we draw boxes around each individual

and measure their centroids and give a unique ID to them as given in the Fig. 8.

The algorithm identiﬁed each person and in the next stage we must track them as

the same person as they move. On the above subpicture 2, the purple one is the initial

position and the yellow one is the point after the person has moved. To know that

it is the same person who has moved from here to there, we measure the Euclidean

distance from every old and new centroid and the close pairs will be considered as the

same person. The working of the distance measuring algorithm is shown in Fig. 9.

We need to calculate the space between everyone coming across the device. Let

the doodle in the image be the object and we are taking a photo from a camera. At

ﬁrst, we will measure the distance from the camera to the object using the ﬁrst given

formula where

•Fis the focal length,

•Pis how many pixel it is covering in the photo,

•Dis the distance, and

•Wis the object width.

Fig. 8 Assigning ID to centroids

Fig. 9 Distance measuring

algorithm

Prevention Equipment for COVID-19 Spread Using IoT … 113

Using that focal length, we are moving the camera and therefore the distance is

now changed and so is the number of pixels covered. Now with this information, the

new distance is calculated and saved.

3.5 Temperature Detection

The Arduino temperature sensor transforms the ambient temperature to electricity.

It then converts the oltage to Celsius, then to Fahrenheit, and displays the Fahrenheit

temperature on the LCD panel. It is shown in Fig. 9.

4 Functional Requirements

4.1 Hardware Requirements: Infrared Thermometer

On its most basic design an infrared thermometer is used to focus IR energy using

a detector, which converts energy to an electrical signal displayed using units of

temperature [11,21,22].

This ensures temperature checking without going into proximity of the person. As

a result, the infrared thermometer can be used to measure temperature in situations

when other instruments cannot produce accurate outputs.

Thermal Imaging Cameras

With a 180° rotating optical block, the FLIR T865 thermal imaging camera is a

non-contact inspection tool that enables users to comfortably and safely evaluate

the state of crucial mechanical and electrical equipment in utility and manufacturing

applications.

Buzzer

Abuzzer or beeper is an audio signaling device which is used for alerting people

when someone without a mask comes into a speciﬁc region (Fig. 10).

4.2 Software Requirements

Face Mask Detection

– TensorFlow

114 T. S. D. Moorthy et al.

Fig. 10 Circuit diagarm of temperature sensor

It is the open-source platform used for machine learning. Some of the pack-

ages have been imported like data generation. Mobile net will be imported from

TensorFlow and Keras.

– Keras

It is an open-source library that provides the interface for the neural networks.

All packages which are imported from TensorFlow are also imported from the

sub-package Keras.

– Imutils

It is a sequence of image processing functions.

– NumPy

NumPy is used for mathematical functions. Here, NumPy is used to store the

images with and without mask as separate arrays.

– OpenCV-Python

OpenCV provides the entire computer vision library and tools.

– Matplotlib

Matplotlib is used to create diagrammatic visualizations. Here, the trained data

and value has been marked as a line chart mentioned in Fig. 11.

– Argparse

Argparse is used to write user-friendly command line interfaces.

– SciPy

SciPy provides algorithms for optimization and better understanding for the user.

– Scikit learn

Prevention Equipment for COVID-19 Spread Using IoT … 115

Fig. 11 Results of face

mask detection

It provides efﬁcient tools for predictive data analysis.

Social Distancing Detection

•Yol o V 3

A real-time object detection method called you only look once (YOLO) recognizes

items in videos and live feeds. Here, we have used to identify the people present

in the video dataset and ensure the safe distance between them.

– Imutils,

– NumPy,

– OpenCV-Python,

– SciPy.

Temperature Monitoring

•Tinkercad

Tinkercad is an online Arduino simulation website which is used here for the

online circuit design of our temperature sensor shown in Fig. 10.

116 T. S. D. Moorthy et al.

Fig. 12 Plot of the loss or accuracy of training and value versus epoch

5 Results and Discussion for Multimedia-Based Strategic

Plans to Prevent Spread of COVID-19

The relevant images/videos are collected and video and audio will be added and

edited using Final Cut Pro video editing tool. Interactive video will be created, so

that it will be simple for individuals to use. Moreover, multimedia applications are

utilized to provide the result of the suggested model to the end users in reports or

visualization forms, along with alarm sounds. The sound is produced to warn people

to keep their distance or avoid the crowd (Figs. 12,13,14,15).

The proposed multimedia-based applications to prevent COVID-19 is uploaded

in YouTube. The link is as follows: https://youtu.be/LfAga_j8F5k.

6 Business Impacts

The arrival of our project into business will have a major impact on handheld temper-

ature scanners, as there is no need to point the device to our foreheads. It is enough

for a person to just pass by and the device will tell if the person has a fever or not. By

this, we can also ensure a safer distance between the individuals which cannot happen

in case of the handheld scanner. This device will have its use mostly only during the

pandemics which in a matter of time, must come to an end. So, the device will have to

hibernate as it will not be a daily use item. Therefore, the project will not have many

buyers during the normal days, but its usage will boom during difﬁcult times. This

Prevention Equipment for COVID-19 Spread Using IoT … 117

Fig. 13 Output

Fig. 14 Algorithm

comparison

Fig. 15 Confusion matrix

118 T. S. D. Moorthy et al.

device can help doctors save time for waiting for a thermometer’s mercury reading

to rise or fall as it can scan the body temperature as soon as the patient enters the

doctor’s room. This will help him/her to diagnose a lot more customers than before,

which results in an increased proﬁt for the doctor.

7 Conclusion

The availability of smart technology and new breakthroughs promote the develop-

ment of new models, which will help fulﬁlling the demands of developing nations. An

IoT-enabled smart gadget is created in this study to measure proximity between indi-

viduals, detect face masks, and measure body temperature, all of which can improve

public safety. This will add an additional layer of prevention against the spread of the

COVID-19 infection, while also assisting in the reduction of labor requirements. The

model makes advantage of IoT to identify face masks, detect temperatures, and track

the proximity of all people present at any one time. Moreover, the gadget is scalable

and feasible. There are also many ways to boost performance and get better results.

The suggested approach and multimedia-based software would assist in maintaining

safety standards as the states and municipalities adopt reopening plans throughout

the COVID-19 epidemic.

References

1. Chan JF-W, To KK-W, Tse H et al (2013) Interspecies transmission and emergence of novel

viruses: lessons from bats and birds. Trends Microbiol 21(10):544–555

2. Ng DK-k, Chan C-h, Chan EY-t, Kwok K-l, Chow P-y, Lau W-F, Ho JC-S (2005) A brief report

on the normal range of forehead temperature as determined by noncontact, handheld, infrared

thermometer. Am J Infect Control 33(4):227–229. https://doi.org/10.1016/j.ajic.2005.01.003.

PMID: 15877017; PMCID: PMC7115295

3. Kumar S, Gupta SK, Kaur M, Gupta U (2022) VI-NET: a hybrid deep convolutional neural

network using VGG and inception V3 model for copy-move forgery classiﬁcation. J Vis

Commun Image Represent 89:103644

4. Singh S, Bhardwaj A, Budhiraja I, Gupta U, Gupta I (2023) Cloud-based architecture for

effective surveillance and diagnosis of COVID-19. In: Convergence of cloud with AI for big

data analytics: foundations and innovation. Scrivener Publishing LLC, pp 69–88

5. Gupta U, Gupta D (2021) Kernel-target alignment based fuzzy Lagrangian twin bounded

support vector machine. Int J Uncertainty Fuzziness Knowl Based Syst 29(5):677–707

6. Kumar S, Gupta S, Gupta U (2022) Discrete cosine transform features matching-based forgery

mask detection for copy-move forged images. In: 2022 2nd international conference on

innovative sustainable computational technologies (CISCT). IEEE, pp 1–4

7. Chen Y, Cheng J, Jiang X, Xu X (2020) The reconstruction and prediction algorithm of

the fractional TDD for the local outbreak of COVID-19. arXiv:2002.10302 [physics.soc-ph],

arXiv:2002.10302v1 [physics.soc-ph], https://doi.org/10.48550/arXiv.2002.10302

8. Gupta U, Dutta M, Vadhavaniya M (2013) Analysis of target tracking algorithm in thermal

imagery. Int J Comput Appl 71(16):34–41

Prevention Equipment for COVID-19 Spread Using IoT … 119

9. Long G (2016) Design of a non-contact infrared thermometer. Int J Smart Sens Intell Syst

9(2):1110–1129. https://doi.org/10.21307/ijssis-2017-910

10. Barnawi A, Chhikara P, Tekchandani R, Kumar N, Alzahrani B (2021) Artiﬁcial intelligence-

enabled Internet of Things-based system for COVID-19 screening using aerial thermal imaging.

Future Gener Comput Syst 124:119–132. https://doi.org/10.1016/j.future.2021.05.019. Epub

2021 May 26. PMID: 34075265; PMCID: PMC8152244

11. Yugakiruthika AB, Malini A (2022) A comprehensive tool survey for blockchain to IoT appli-

cations. In: Data engineering for smart systems: proceedings of SSIC 2021. Springer Singapore,

pp 89–99

12. Yugakiruthika AB, Malini A (2022) Security testing for blockchain enabled IoT system. In:

Data engineering for smart systems: proceedings of SSIC 2021. Springer Singapore, pp 45–55

13. Varshini B, Yogesh HR, Pasha SD, Suhail M, Madhumitha V, Sasi A (2021) IoT-enabled

smart doors for monitoring body temperature and face mask detection. Glob Transitions Proc

2(2):246–254. https://doi.org/10.1016/j.gltp.2021.08.071. ISSN 2666–285X

14. Chavda A, Dsouza J, Badgujar S, Damani A (2021) Multi-stage CNN architecture for face

mask detection. In: 2021 6th international conference for convergence in technology (I2CT).

IEEE, pp 1–8

15. Petropoulos F, Makridakis S (2020) Forecasting the novel coronavirus COVID-19. PLoS ONE

15(3):e0231236-1–e0231236-8. https://doi.org/10.1371/journal pone.0231236

16. Akash V, Sridevi S, Ananthi G, Rajaram S (2021) Forecasting of novel corona virus disease

(covid-19) using LSTM and XG boosting algorithms. In: Data analytics in bioinformatics—a

machine learning perspective. Wiley Publishers

17. Retrieved from https://en.wikipedia.org/wiki/Infrared_thermometer

18. Retrieved from https://serverscheck.com/solutions/corona-covid-19.asp

19. Retrieved from https://spectrum.ieee.org/why-use-timeofﬂight-for-distance-measurement

20. Retrieved from https://www.pyimagesearch.com/2020/05/04/covid-19-face-mask-detector-

with-opencv-keras-tensorﬂow-and-deep-learning/

21. Retrieved from https://pyimagesearch.com/2018/08/13/opencv-people-counter/

22. Retrieved from https://pyimagesearch.com/2020/06/01/opencv-social-distancing-detector/

Renal Disease Classiﬁcation Using Image

Processing

Rohan Sahai Mathur, Varun Gupta, Tushar Bansal, Yash Khare,

and Sanjay Kumar Dubey

Abstract The growth of renal disease has increased gradually and affected millions

of people and the number of affected people increases each year. Chronic kidney

disease usually occurs due to abnormal albumin excretion which reduces the func-

tioning of the kidney by more than three months. On the view of life expectancy,

7.6% of the deaths are due to chronic kidney disease and it accounted for 4.6% of

all-cause mortality. And the best to treat renal diseases is early prophylaxis which is

achieved by accurately diagnosing the patient at the very early stages. The diagnostic

methods include ultrasonographic diagnosis which is a cheaper, more convenient,

and timeliness method. This paper presents renal disease detection and classiﬁcation

using supervised techniques which classiﬁes disease with up to 97 percent accuracy

and uses image processing tools for kidney stone detection with an accuracy of 95%.

Keywords Machine learning ·Image processing ·Chronic kidney disease ·

Kidney stone ·Ultrasound images

1 Introduction

The World Health Organization (WHO) considers chronic kidney disease (CKD)

as one of the signiﬁcant public health issues. It affected millions of people and an

increase of 2% has been observed in the number of affected people each year. The

disease has spread across the globe and remains a crucial public health problem that

affects 12% of the population around the globe. Chronic kidney disease is considered

a dysfunction of the kidney which decreases renal function progressively and it is

an irreversible disease due to decreasing rate of glomerular ﬁltration which leads to

R. S. Mathur (B)·V. Gupta ·T. Ba n sal ·Y. K h a r e ·S. K. Dubey

Amity University, Uttar Pradesh, Sector 125, Noida, India

e-mail: rsmathur74@gmail.com

S. K. Dubey

e-mail: skdubey1@amity.edu

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_10

121

122 R. S. Mathur et al.

the complete deterioration of the kidney. When a kidney is heavily damaged then it

loses its capability to ﬁlter the blood effectively [1].

Chronic kidney disease occurs when there is abnormal albumin excretion which

causes kidney damage by reducing the proper function of the kidneys for more

than three months. Kidney function can be detected by measuring it directly or by

estimating the glomerular ﬁltration rate (GFR). It has a major inﬂuence on global

health as it increases the mortality threat in renal diseases either as an associated

threat factor for cardiovascular diseases, or causes global morbidity and mortality

directly [2].

CKD stage 1 and stage 2 CKD patients suffer from the renal disease with zero

symptoms and display minor confusion in the capability of the kidney without water

which is critically proven clinically electrolyte, or endocrine lopsided characteristics

on serum examinations which can also be metabolic. Thus, there is a basic need

to foster protected, quick, and savvy strategies to evaluate for and analyze CKD

precisely, so there should be a start to measure all prevention techniques at the earliest

to hinder the weakening of kidney capability. In patients at all phases of CKD, there

could be performed a kidney that is broadly accessible for the minimal expense to

survey for primary changes related to CKD, for example, decreased size of the kidney

or change in the surface of the parenchyma section of the kidney. Echocardiographic

imaging might disclose the dilation part, diastolic/systolic brokenness into the part

which ventricles and atria, left ventricular wall’s hypertrophy, valve stenosis which

is brought by coronary illness by early degenerative valvular, and decayed right

ventricular capability hyper tensed aspiratory. Some patients are having high phases

of renal dysfunction which vary from the sound populace by hydration with cyclic

changes. In the typical condition, the renal control the body’s liquid volume and

homeostasis with controlled water though it will be there as soon as humanly possible

[3].

In the current scenario, the detection of kidney diseases is done by continuously

monitoring speciﬁc parameters obtained through diagnostic tests. Then statistical

models are used to determine the actual presence or absence of the disease and analyze

its severity. This analysis can be done automatically through the models based on

artiﬁcial intelligence and machine learning and it may help in obtaining statistically

better results or more high-performing solutions. Machine learning algorithms can

be used in many ways like object detection (which can be used to detect kidney

stones), classiﬁcation of a kidney mass type, and prediction of the severity of the

disease [4].

The contributions made are as follows:

•Using image processing techniques such as image impainting and recognition to

identify the presence of kidney stone(s) inside the kidney.

•Use of machine learning techniques to predict the efﬁciency of the therapy given

to chronic kidney disease stage V dialysis patients (hemodialysis), and suggesting

course of action.

Renal Disease Classiﬁcation Using Image Processing 123

2 Related Work

Early renal classiﬁcation is an especially important task because it can lead to chronic

kidney disease. This classiﬁcation uses machine learning algorithms like decision

tree, SVM gives better results than the traditional method with around 84% accu-

racy [5]. The prediction of T-cell epitopes for SARS-CoV-2 can also give addi-

tional information like protein–protein interaction and structural information [6].

The ensemble learning technique can be used to predict T-cells with more accu-

racy [7]. Medical images are an important part for diagnosis of any disease and for

better and more efﬁcient kidney disease prediction, deep learning can be best tool

[8]. Disease management requires early detection as soon as possible and machine

learning can give more accurate CKD prediction with predictors like blood pressure,

serum creatinine [9]. Kidney identiﬁcation is mostly performed semi-automatically

or automatically. Earlier, there was a study conducted which found out that ESRD

is the most severe stage which means end-stage renal disease in which patients go

into critical condition and ultimately, they need a transplant [10]. There has been

the development of a system that targets anticipating the early location of constant

kidney infection otherwise called chronic kidney disease for diabetic patients with

the assistance of artiﬁcial intelligence (AI) strategies and recommends a choice tree

to show up at substantial outcomes with beneﬁcial precision [11]. Apart from CKD,

acute kidney injuries (AKI) also critically affect ill patients. The development of the

method of identifying helps in reducing the complications of AKI. There has been a

development of the model which used ﬁve supervised learning models to detect AKI

using deep learning [12]. In kidney disease detection, detection of the glomerular

lesions is the major component which is time-consuming and should be accurately

done. A framework developed has been made based upon a deep neural network to

locate glomeruli and quantify distinct glomerular cells [13]. Machine learning has

been proven to be important for CKD prediction with high accuracy. Heterogeneous

modiﬁed ANN and backpropagation have been proved to have high accuracy in

preprocessing the ultrasound image of the kidney images and helps in detecting the

region of interest in a more precise way [14]. The case-based reasoning method has

been proven to provide an excellent neural network-based renal disease prediction.

This model uses demographic data for training along with some medical data [15].

Artiﬁcial neural network methods like SVM, KNN can be used to predict CKD with

more accuracy, sensitivity, and speciﬁcity [16]. It is crucial for the patient with CKD

to be treated by the renal replacement therapy (RRT) which is kidney transplantation

or hemodialysis at the right time to ensure the patient’s well-being [17].

124 R. S. Mathur et al.

3 Methodology

In this project, both machine learning and image processing are deployed to develop

a machine-level understanding of chronic kidney disease and kidney stones, respec-

tively. Machine learning uses dialysis patients’ data to analyze the therapy of CKD

stage V patients and image processing uses patients’ ultrasound images to detect

kidney stones.

The different machine learning algorithms are used to train models like KNN

and decision tree. And in the conclusion of the model-building stage, a confusion

matrix is used to determine the quality of the machine learning model. Lastly, renal

diseases will be predicted using the classiﬁcation model mentioned in the previous

stage. Figure 1shows the methodology the project will follow and the steps to assure

the effectiveness of the image processing model. The steps for the image processing

model are as follows.

3.1 Ultrasound Pictures

The ﬁrst step is to collect the various ultrasound images from the hospitals in which

some of them contain stone and others do not. This dataset will help to ﬁnd the

effectiveness of the model created.

Fig. 1 Methodology of

image processing

Renal Disease Classiﬁcation Using Image Processing 125

3.2 Image Pre-processing

The second step is to remove all the unnecessary noises from the ultrasound image like

the text or the lines from it. This can be done using various methods like the image

inpainting technique which is a method in which missing parts of the ultrasound

image can be recovered or ﬁlled. The other method that can be used is noise ﬁltering,

it is a method in which it detects text or lines on the ultrasound image and removes

it from it and afterward image inpainting can be used to ﬁll those spaces.

3.3 Diagnosis of the Kidney’s Outer Contour

In this step, after removing all the noises from the ultrasound images, the outer

portion of the kidney is highlighted to narrow to area of interest in the image. This

is done using various image processing techniques to reduce the region where the

stone can be found in the kidney.

3.4 ROI Identiﬁcation

After reducing the area of interest, the main region of interest, i.e., ROI is determined

in the kidney by removing all the connected pixel components which have a value

less than p number of pixels with the help of a MATLAB function. These steps help

to furthermore shirk the region of interest.

3.5 Detection of the Stone in the ROI

As the region of interest is determined, the stone can be found by overlaying the

output image with the original image. This helps to ﬁnd the region where the stone

is present in the kidney.

3.6 Labeling of the Ultrasound Image

The last step is to label to image and notify the user if the stone is detected or not.

If the stone is detected in the kidney, the user can take necessary steps to avoid any

mishap or can visit the doctor for further consultation.

126 R. S. Mathur et al.

4 Experimental Work

There are two methods that have been adopted in this paper. Machine learning is

adopted for training and testing on the collected dataset which includes parameters

like hemoglobin, urea, creatinine, etc., on data collected from chronic kidney disease

patients and image processing for classifying ultrasound images on data collected

from kidney stone patients.

4.1 Machine Learning

The dataset has been collected from hospitals and labeled by consulting with experts

in the medical ﬁeld and developing the results obtained in unsupervised learning.

Then, this dataset has been divided into knumbers of folds, using k−1 folds

for training and onefold for testing purposes. In this way, cross-validation of the

ﬁndings from supervised learning becomes easier. Supervised learning is a subpart

of machine learning and artiﬁcial intelligence. It is used to classify the data by using

dataset and train algorithm by labeling the dataset. There are several supervised

learning methods that have been used including linear regression, etc. The linear

regression model performs a regression task and target prediction value that is based

upon independent variables. It is also used for predicting the relationship between

forecasting and variables. Logistic regression is used for statistical analysis to predict

outcomes in yes or no which are predicted based on prior observations of a dataset.

It helps in predicting dependent data variables using the relationship between one or

more existing independent variables. The support vector machine is used to create

the best line decision boundaries for segregating n-dimensional space into classes.

SVM is used to put the new data point correctly with ease so that it can be used in the

future. The Naive Bayes algorithm is used for solving text classiﬁcation that includes

a high-dimensional training dataset. A decision tree is used as a decision support tool

that gives a possible consequence including chance event outcomes, resource costs,

and utility. Lastly, random forest is used for both classiﬁcation and regression which

is used to combine multiple classiﬁers to solve complex machine learning problems

and improves the performance of the model.

4.2 Image Processing

The project uses the various concepts of image processing to detect the stone in the

human kidney. A model has been made to ﬁnd the speciﬁc place of the stone or the

region of interest in the human kidney. The various steps to determine the speciﬁc

place of the stone in the human kidney is as follows.

Renal Disease Classiﬁcation Using Image Processing 127

4.2.1 Preprocessing

The initial step that is being followed in our image processing is to ﬁrst remove all

the unimportant data which is not very useful for our purpose. This unimportant

data includes all the markings that doctors used to do while examining the reports

of the kidneys. Hence these can be removed successfully with the use of the image

processing. To get our ROI, i.e., region of interest, we will be implementing image

processing. So, we must contour our image from RGB to gray, and then gray to

binary, and so forth. However, after converting, we will not be able to see any color

difference between RGB image and grayscale image as in the RGB image, no color

is present. Now, our next step is to convert our image into the binary image. But

before that, ﬁrst we will be checking our histogram to verify whether we will be able

to apply the global thresholding or not.

After applying the LFT thresholding, we can see that it is not binomial, as there is

no logo in the lower part. However, we can also use the pixel info to get our threshold

value using the trial-and-error method. As our cursor goes to the brighter part, the

intensity value increases. Hence, in our case, we have selected 20. It means that if

we can get the intensity above 20, then we will be getting 1 in return else 0. This

is how we can get our output. However, since we are getting many holes, we need

to ﬁll these holes to implement it properly. Now, the holes which we were getting

earlier are now ﬁlled. And we can see our ROI vividly. However, there are still some

unnecessary things which need to be cleared up as well.

Figure 2displays the ultrasound image that has been taken as the input for the

model to determine the speciﬁc place of the stone in the kidney. This ultrasound image

also contains various noises like the text or the lines which need to be removed by

preprocessing the image before the ultrasound image can be used for determining

the place of the stone.

Fig. 2 Ultrasound image of

kidney [16]

128 R. S. Mathur et al.

Fig. 3 Processed image of

the ultrasound

Figure 3displays the ultrasound image which has been preprocessed and all the

noises like the text or the lines which can vary the output are removed. Now, only

the region of the kidney is remaining after the preprocessing step.

4.2.2 Contrast Enhancement

The boundary region can be eliminated because it is not crucial for our application

by making the dark section darker and the bright portion brighter. The contrast

enhancement or contrast stretching is the term used for this. As a result, values

between 0 and 0.3 are classiﬁed as 0, whereas values beyond 0.7 are treated as 1.

And for values between 0.3 and 0.7, the output will translate it to a value between 0

and 1.

4.2.3 Feature Extraction

In this section, several techniques are utilized to extract kidney traits that aid in

characterizing the kidney that is being described. The extracted feature consists of

three parts: a median ﬁlter, ROIPoly or region of interest polynomial, and image

segmentation.

4.2.4 Median Filter

A particular kind of image processing ﬁlter known as a median ﬁlter replaces a given

picture’s pixel value with the median value of the pixels in the area around it. Mostly,

noises that are not required in the image are eliminated from a picture using the

median ﬁlter. Instead of using the average of the surrounding pixels, it uses their

median to keep the image’s crisp edges while reducing noise. Given that the median

Renal Disease Classiﬁcation Using Image Processing 129

of a group of values is less susceptible to outliers than the mean, the median ﬁlter is

particularly helpful for maintaining the borders of a picture.

4.2.5 ROI Polynomial

ROI, or region of interest, is an image processing phrase that refers to a speciﬁc

portion of an image that is of interest and should be handled differently than the

remainder of the picture. In image processing, ROIPoly is a particular implementation

of ROI selection where the user may specify an area of interest by encircling it with

a polygon. When the user clicks on various locations in the picture to pick the ROI, a

polygon is produced by connecting these points. Once the ROI has been established,

the user may isolate that area and only apply certain image processing methods to

it, leaving the rest of the picture untouched. This enables the user to concentrate on

a speciﬁc area of a picture and carry out processing actions that are speciﬁc to that

area.

4.2.6 Image Segmentation

In image segmentation, the ‘bwareaopen’ function of MATLAB has been used. In

this function, the objects in the picture are distinguished from the background using

binary images. The function eliminates minute connected parts from a binary picture.

Larger items in an image may be distinguished from smaller ones that can be caused

by noise or other image processing processes using this function.

4.2.7 Labeling the Image

The process of labeling the image is started when the image is segmented using the

‘bwareaopen’ function. Here, if our value is greater than 1, i.e., if the single binary

object is detected then, it will simply display that ‘Stone is detected’. Else it will

show that ‘No stone is detected’.

5 Discussion

In this project, we used supervised learning and unsupervised learning algorithms

to determine the model with the highest accuracy, being decision tree algorithm. We

used this model to predict the efﬁciency of therapy received in hemodialysis patients

who have reached Stage V of chronic kidney disease. We used parameters such as

urea, creatinine, albumin, total protein, and hemoglobin, as indicators of dialysis

therapy. Our results demonstrate that decision tree algorithm effectively predicts

130 R. S. Mathur et al.

CKD therapy efﬁciency levels with high accuracy, precision of 0.97, recall of 0.97,

and accuracy of 0.97.

In this project, we analyzed the ultrasound images of various patients using image

processing to identify the kidney stone in the region of interest area of the kidney.

Initially, preprocessing is done on the image to remove any distortion that may arise

due to the quality of the image. After that, region of interest is identiﬁed which helps

recognize the location where the probability of ﬁnding the stone in the kidney is

maximum. This method is then implemented on various ultrasound images to check

the accuracy of this method. The accuracy comes out to be 95.53%. This accuracy

can further be improved by using different preprocessing and postprocessing ﬁlters

like the negate, contrast, Prewitt, Sobel, and canny ﬁlter.

6 Limitations

•Image distortions can cause image features to appear stretched, compressed,

or warped, which can affect the accuracy of image analysis and recognition

algorithms.

•Image gaps can lead to missing information in the image, which can affect the

performance of image analysis algorithms such as object detection, segmentation,

and recognition.

•Medical data is sensitive and personal; hence it is difﬁcult to obtain large amounts

of high-quality data to train machine learning models efﬁciently.

•However, it can perpetuate bias and unfairness if the training data is biased or

if the model is not designed to be fair. Also, these models can be vulnerable to

attacks such as adversarial attacks, where an attacker deliberately manipulates the

input data to cause the model to produce incorrect predictions.

7 Result and Analysis

Prescribed ranges for values are different for different investigations. For instance,

hemoglobin of a CKD stage V patient is advised to be in 11–12 g /deciliter, and so

on for investigations like albumin, total protein, HbA1c, platelets, urea, creatinine,

and Kt/v.

K-means clustering model was designed to divide the dataset into three clusters.

The ﬁrst cluster (cluster 2) is supposed to deﬁne a data range where the patient is

responding very well to the therapy, cluster 1 deﬁnes data ranges where patient is

responding satisfactorily to the therapy, and cluster 0 represents a data range where

the therapy given is below par.

This cluster categorization aims to aid medical professionals to determine whether

to continue the same methods of therapy on a given patient with the given parameters.

In a country such as India, observing a shortage of medical professionals, this cluster

Renal Disease Classiﬁcation Using Image Processing 131

categorization helps to channelize the focus of medical professionals toward patient

in an optimum way. For instance, cluster 0 patients (where therapy is below par)

require the most attention, cluster 1 patients (therapy is at par) require the existing

attention and care, and cluster 2 patients (therapy is above par) do not require urgent

medical intervention.

It has been observed that the unsupervised model was not very successful in

predicting whether a given value belongs to a particular cluster. The unsupervised

learning model needs improvement in predicting these cluster values, as there is found

to be very less inaccuracy in prediction. In diseases such as chronic kidney disease, a

lot of parameters, namely A G Ratio, Alkaline Phosphate, Calcium, Chlorides, Folic

Acid, Globulin, Indirect Bilirubin, Kt/v and Total Phosphate to name a few. This

study was focused on ﬁve major parameters—hemoglobin, urea, creatinine, total

protein, and albumin.

In supervised learning, the dataset is labeled into three groups with consultation

from medical experts. Group 0 represents values below the prescribed medical range

for stage V chronic kidney disease patients, group 1 represents values within the

prescribed range, and group 2 represents values above the prescribed range.

The following ﬁgures show the results obtained from the experiments. A random

sample of 425 patients’ hemoglobin data was taken from the dataset and shows the

predicted values of therapy category for each datapoint, as shown in Fig. 4.

Confusion matrix shows that an overwhelming majority of our predicted values

fall within the categories of ‘True Category 0’ ‘True Category 1’, and ‘True Category

2’. As most of the large values are aligned along the diagonal of the confusion matrix

and the non-diagonal values are close to 0, we infer that our prediction is highly

accurate.

Figure 5shows the results obtained by Gini index classiﬁcation on the selected

sample. An accuracy of 97% shows that the proposed model is a near perfect model.

Fig. 4 Gini index classiﬁcation results (decision tree)

132 R. S. Mathur et al.

Fig. 5 Results of supervised

learning using Gini index

A random sample of 372 patients’ hemoglobin data was taken from the dataset

and shows the predicted values of therapy category for each datapoint, as shown in

Fig. 6.

Confusion matrix shows that an overwhelming majority of our predicted values

fall within the categories of ‘True Category 0’, ‘True Category 1’, and ‘True Category

2’. As most of the large values are aligned along the diagonal of the confusion matrix

and all but one non-diagonal value is 0, we infer that our prediction is very accurate.

Figure 7shows the results obtained by entropy classiﬁcation on the selected

sample. An accuracy of 98% shows that there are very few anomalies and impurities

in the datapoints, also data splitting has been done efﬁciently.

It can be observed that this supervised machine learning model is successful in

predicting whether a given value belongs to a particular cluster. Three clusters can

be seen, namely 0, 1, and 2. Supervised learning is successful in predicting cluster

values, as it is very accurate in prediction.

The analysis part focuses a lot on developing insights on the accuracy of the

algorithms achieved and the results obtained through this research. It is observed that

some algorithms performed better and came close to predicting the actual values. As

the data is categorized into three main categories, cluster 1 represents those values

Fig. 6 Entropy classiﬁcation results

Renal Disease Classiﬁcation Using Image Processing 133

Fig. 7 Results of supervised learning using entropy

depicting good therapy response from patient, cluster 2 representing satisfactory

response, and cluster 3 representing unsatisfactory response.

The result of the image processing focuses to show the areas where the probability

of ﬁnding the stone is most likely. We used various methods to reach the region of

interest where stone may be found with the assistance of doctors. The insights from

the result of image processing will further assist the doctors in treating the patients

correctly as it will inform the doctors whether the stone is present in the patient’s

body or not. Also, it will identify the area of the stone that will help the doctor to

operate easily in case the stone is detected.

Figure 8shows that the ultrasound image that we gave as an input to the model

contains the stone in the kidney and shows the location of the stone which can help

the doctor to remove the stone from the kidney easily. It helps to ease out the work

for the doctors and save time in case of emergency situations.

Fig. 8 Result of image

processing (showing the

location of the detected

stone)

134 R. S. Mathur et al.

8 Conclusion and Future Scope

Machine learning models yielded satisfactory results with unsupervised learning

underperforming and supervised learning performing well.

Selection of few parameters and a small dataset in unsupervised learning is the

cause for unsatisfactory results. For one patient, a hemoglobin of 12 gm/dl would

be good for one set of parameters, while for another patient the same level of

hemoglobin proves to be too much for the set of parameters. It depends on patient

to patient whether the therapy given is good, bad, or satisfactory. Hence, a much

larger dataset of chronic kidney disease (CKD) patients of all stages (from Stage

I to Stage V) is required to yield better results for unsupervised CKD predictions.

Data labeling conducted with assistance from medical experts proved crucial for

supervised learning as the model proposed gives accurate predictions. A doctor’s

opinion is always better than any machine predictions, which is depicted in the

parity seen between unsupervised and supervised machine learning model results.

Image processing clearly shows the location of kidney stones in ultrasound images

and can be applied in chronic kidney disease too. A model can be built which takes

ultrasound images and directly predicts whether the kidney has a stone or not. If yes,

it could help determine the area occupied by the stone and its size. This could prove

to be of great utility in the medical ﬁeld and assist doctors in better decision-making.

In future, the image processing will be able to detect the size and dimension of the

stone that will help the doctors to decide the procedure to apply for each patient, e.g.,

a patient having a small stone dimension can be cured with the help of medicine only,

while a patient having a little bit larger stone dimension needs to be operated on by

the doctor. This distinction can also be made with the image processing techniques

in the future. Furthermore, with the help of ultrasound images, the classiﬁcation

of various stages of chronic kidney disease will also be possible. With help of the

concentric contour detection and the transition indicator measurement the stages can

be classiﬁed, because if the white to black ratio is a higher value it means the patient

is in the initial stage, while if the value is lower it will mean that the patient is in ﬁnal

stages. So, this classiﬁcation will be vital for doctors in making decisions about the

process of treatment for each patient.

References

1. Alnazer I, Bourdon P, Urruty T, Falou O, Khalil M, Shahin A, Fernandez-Maloigne C (2021)

Recent advances in medical image processing for the evaluation of chronic kidney disease.

Med Image Anal 69:101960

2. Gudigar A, Raghavendra U, Samanth J, Gangavarapu MR, Kudva A, Paramasivam G, Acharya

UR et al (2021) Automated detection of chronic kidney disease using image fusion and graph

embedding techniques with ultrasound images. Biomed Signal Process Control 68:102733

3. Ghosh P, Shamrat FMJM, Shultana S, Afrin S, Anjum AA, Khan AA (2020) Optimization

of prediction method of chronic kidney disease using machine learning algorithm. In: 2020

Renal Disease Classiﬁcation Using Image Processing 135

15th international joint symposium on artiﬁcial intelligence and natural language processing

(iSAI-NLP). IEEE, pp 1–6

4. Georgieva V, Petrov P, Mihaylova A (2018) Ultrasound image processing for improving

diagnose of renal diseases. In: 2018 IX national conference with international participation

(ELECTRONICA). IEEE, pp 1–4

5. Bai Q et al (2022) Machine learning to predict end stage kidney disease in chronic kidney

disease. Sci Rep 12(8377):1–8

6. Bukhari SNH, Jain A, Haq E, Mehbodniya A, Webber J (2021) Ensemble machine learning

model to predict SARS-CoV-2 T-Cell epitopes as potential vaccine targets. Diagnostics

2021(1990):1–18

7. Bukhari SNH, Webber J, Mehbodniya A (2022) Decision tree based ensemble machine learning

model for the prediction of Zika virus T-cell epitopes as potential vaccine candidates. Sci Rep

12(7810):1–11

8. Kumar K et al (2023) A deep learning approach for kidney disease recognition and prediction

through image processing. Appl Sci 13(3621):1–14

9. Islam MA, Majumder MZH, Hussein MA (2023) Chronic kidney disease prediction based on

machine learning algorithms. J Pathol Inform 14:100189

10. Segal Z, Kalifa D, Radinsky K, Ehrenberg B, Elad G, Maor G, Koren G et al (2020) Machine

learning algorithm for early detection of end-stage renal disease. BMC Nephrol 21(518):1–10

11. Padmanaban KRA, Parthiban G (2016) Applying machine learning techniques for predicting

the risk of chronic kidney disease. Indian J Sci Technol 9(29):1–5

12. Li Y, Yao L, Mao C, Srivastava A, Jiang X, Luo Y (2018) Early prediction of acute kidney

injury in critical care setting using clinical notes. In: 2018 IEEE international conference on

bioinformatics and biomedicine (BIBM). IEEE, pp 683–686

13. Zeng C, Nan Y, Xu F, Lei Q, Li F, Chen T, Liang S et al (2020) Identiﬁcation of glomerular

lesions and intrinsic glomerular cell types in kidney diseases via deep learning. J Pathol

252(1):53–64

14. Ma F, Sun T, Liu L, Jing H (2020) Detection and diagnosis of chronic kidney disease using

deep learning-based heterogeneous modiﬁed artiﬁcial neural network. Futur Gener Comput

Syst 111:17–26

15. Vásquez-Morales GR, Martinez-Monterrubio SM, Moreno-Ger P, Recio-Garcia JA (2019)

Explainable prediction of chronic renal disease in the Colombian population using neural

networks and case-based reasoning. IEEE Access 7:152900–152910

16. Pal S (2022) Chronic kidney disease prediction using machine learning techniques. Biomed

Mater Devices (pp 1–7)

17. Dovgan E, Gradišek A, Luštrek M, Uddin M, Nursetyo AA, Annavarajula SK, Li Y-C, Syed-

Abdul S (2020) Using machine learning models to predict the initiation of renal replacement

therapy among chronic kidney disease patients. PlOS ONE 15(6):e0233976

Identiﬁcation of Fake Users on Social

Networks and Detection of Spammers

B. Srinivasa Rao, Badisa Bhavana, Gudimetla Abhishek,

and Peddiboyina Hema Harini

Abstract Numerous people use social networking services on a global scale. The

way people interact with social media platforms like Facebook and Twitter has the

signiﬁcant impact per day life, frequently with negative outcomes. Popular social

networking sites have actually been targeted by spammers who spread a lot of

unwanted and damaging stuff there. As an illustration, Twitter has become one of

the most widely utilized platforms ever, which has led to an annoying quantity of

spam. By sending undesired users’ tweets in order to advance businesses websites, or

fake users waste resources and also hurt actual people. Additionally, the capacity for

transmitting incorrect information to people using phoney identiﬁcations has grown,

aiding in the spread of dangerous goods. Identifying spammers, unauthorized Twitter

users has recently developed a major research issue on today’s social networks online

(OSNs). Phoney customers, link spam, and content based on trending topics that

is likewise spam. The solutions that are provided are also contrasted based on a

variety of criteria, such as individual, web content, graph, structure, and temporal

components. We hope that the research study that has been presented will serve as

an important resource for academics looking for one of the most signiﬁcant recent

developments in Twitter spam identiﬁcation on a single platform.

Keywords OSN ·Spam ·Fake account ·URL ·Twitter ·Social media

B. Srinivasa Rao (B)

Department of Information Technology, Lakireddy Bali Reddy College of Engineering,

Mylavaram, Andhra Pradesh, India

e-mail: doctorbsrinivasarao@gmail.com

B. Bhavana ·G. Abhishek ·P. H. Harini

Lakireddy Bali Reddy College of Engineering, Mylavaram, Andhra Pradesh, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_11

137

138 B. Srinivasa Rao et al.

1 Introduction

One of the more well-known social media platforms, Twitter, has been used in a

number of studies. The majority of people currently utilize Twitter. Due to the fact

that we have ﬁctitious clients on Twitter as well, we discovered false Twitter user

identiﬁcation in our study. In this study, we will deﬁnitely identify phoney users

using false content, URL-based spam detection, spam in trending subjects, and fake

individual recognition then expose the bogus client. By posting frequently and on

issues unrelated to the other person, the phoney user wastes other people’s time.

Social networking services like Twitter, Facebook, MySpace, Instagram, and Linked

In have actually grown in appeal during the past few years. Twitter is one of the most

well-known and signiﬁcant networking websites when compared to other social

media platforms. Users of social networking websites are now able to upload and

also distribute messages thanks to Twitter. The Twitter network refers to messages

with a character count of no more than 280 as tweets. People typically utilize social

media network websites to express their opinions on a variety of topics, feelings,

and ideas about other people. These social networking sites might be the biggest

platforms for individuals to publish comments and reviews on products they have

already bought. At the moment, 0.13% of Twitter adverts are clicked, which results

in a higher rate of spam data access than email spam [1]. Twitter and other online

social networks, which are largely used for the exchange of useful information, are

frequently targeted by social crawlers and hackers due to their large user bases. On

social networking sites, spam crawlers are frequently referred to as social crawlers.

In fact, a number of studies have been done on the subject of identifying Twitter

spam. To include the most recent research, a few surveys on false Twitter user recog-

nition have also been conducted. Tingmin et al. present a research of modern methods

and procedures for Twitter spam detection in their magazine [2]. The study previ-

ously cited provides a comparison of current methods. On the other hand, a study

on the various actions made by spammers on the social media platform Twitter was

conducted by the authors of [3]. The study offers an analysis of the literary works

that acknowledge the prevalence of spammers on Twitter. Despite all of the study

investigations that have been conducted, there is still room in the literary works.

Therefore, in order to reduce the gap, we look at the most current developments in

spammer discovery as well as fake individual recognition on Twitter. Additionally,

this study employs a taxonomy of methods for identifying Twitter spam and tries to

provide a thorough summary of recent advancements in the area.

A social media platform Twitter is one that “focuses on the growth and veriﬁca-

tion of online social networks for areas of individuals who share passions and condi-

tioning or that are interested in ﬁnding the passions as well as conditioning of others,

and which requires the use of software application,” according to Wikipedia. An

OCLC report deﬁnes social networking sites that comply. Websites like Face maga-

zine, Mixi, and MySpace were primarily developed for drug users who engage in the

trading of goods and services. Social networks offer a variety of advantages to people

Identiﬁcation of Fake Users on Social Networks and Detection … 139

within an organization. Assistance with locating social networks can foster relation-

ships between subjects of study and those involved in the promotion of literacy.

They can also improve non-formal literacy. Support for a group’s participants social

media tools are available to all employees of a company, not just those that interact

with students. Technology can advance with the aid of social media. Conversing

with others utilizing social networks can provide crucial corporate information and

feedback on institutional solutions (although this may generate ethical endeavours).

Reduction in workload and information accessibility. The simplicity of a variety

of social networking sites can beneﬁt drug addicts by facilitating access to new

tools and procedures. The Face Publishing System is an example of how a social

networking service can be utilized as a surface for other applications regular inter-

face. One advantage of social media networks could be their shared user interface,

which transcends both professional and social divides. Because identical solutions

are regularly used in a particular ability, the user interface, and the ways the solution

work may be familiar with, less training and assistance is needed to use the options

in a professional context. However, this may be problematic for those who prefer

clear distinctions between their work and their social upbringing.

1.1 Objectives of the Project

Finding any kind of information from any kind of source anywhere in the world

has become more simpler as a result of the Internet. Folks can get a large amount

of data and facts regarding other people thanks to the via social media businesses’

expanding user bases. Because they provide so much information, these websites

draw phoney users. In fact, Twitter has become very popular as a source for current

personal information. Nowadays, it is quite simple to obtain any kind of information

from any kind of source situated anywhere on the planet via the Web. People now

have access to a wealth of information as well as speciﬁcs about other users and

thanks to the social network platforms that are becoming more and more popular.

Because there is so much easily available material on these networks, they attract

fake users. Due of the vast amounts of information they provide, many websites

attract phoney users.

1.2 Conﬂict Meaning

Finding any kind of information from any kind of source worldwide has become

incredibly simple thanks to the Internet. Due to the social media sites’ rising popu-

larity, users have access to a vast amount of personal data on both themselves and

other people. These platforms are attractive to scammers because they use so much

data. The popularity of Twitter as a source of up-to-date information about people has

actually grown quickly. Nowadays, any type of information from any kind of source

140 B. Srinivasa Rao et al.

can be easily found using the Internet. People can learn a lot about other people and

gain a lot of information because social networking platforms are growing more and

more popular. Due to the vast amount of information they provide, these websites

draw the wrong kind of people. These websites attract phoney users because of the

wealth of information they employ.

Motivation: Scientists have recently shown an increasing level of interest in the

ﬁnding of spam on social networking sites. Recognizing spam is a difﬁcult topic in

maintaining social media’s safety and security. If users are to be protected among all

forms of hazardous assaults, to hold their sensation of solitude, and to feel safe and

secure, then ﬁnding spam on OSN sites is essential. Spammers’ dangerous actions

cause extensive area damage. Twitter is a platform that spammers use to spread

false information, fake news, rumours, and inappropriate statements, to mention a

few. Spammers maintain a large number of customer lists in order to connect their

interests and employ a variety of other techniques, such sending spam messages at

random, to accomplish their harmful goals. The original customers, also referred to as

non-spammers, are irritated by these behaviour. The OSN systems’ reputation is also

damaged. To ensure that the appropriate steps may be taken to halt their unwanted

behaviour, it is crucial to develop a system for ﬁnding spammers.

2 Related Work

Reference [1] Shivangi Ghee Wala et al. suggestion’s OSNs have also made a number

of measures to safeguard private information from various threats. Despite the impor-

tance of these recommendations, developers believe that there isn’t already a concep-

tual foundation for building information protection technologies. This technique’s

central concept must be one of risk. Therefore, we advise OSNs to adopt a threat

monitoring strategy for the duration of their work. By connecting risk factors to social

network users, they hope to make people consider how risky it would have been

to connect with them while giving personal information. They consider client risk

mindsets while using similarity and earnings signals to determine danger limitations.

We especially employ a dynamic risk evaluation mentor approach in which a select

number of critical user communications demonstrate consumer threat behaviour. The

risk assessment method discussed in this article has been created and also tested using

actual data.

Reference [4] the method proposed by Rohit Kumar Kaliyar et al. was referred

to as “Fake Information Discovery Making Use of a Deep Semantic Network.” The

results of online forums with person-to-person conversational forms have garnered

signiﬁcantly more attention than the combination of electronic communication tools

in co-located classes. In this study, middle school students’ perceptions and presump-

tions of two different communication styles in co-located classrooms—face-to-face

(F2F) and simultaneous, computer-mediated interaction—were examined (CMC).

What scientiﬁc work is easily accessible in French? As a result, they distinguish

Identiﬁcation of Fake Users on Social Networks and Detection … 141

between pupils who are considered to be participating in face-to-face (F2F) class-

room discussions and those who are typically mute. These studies demonstrate the

beneﬁts of computer-mediated communication (CMC) over face-to-face interac-

tions in co-locations and reveal that different students have varied preconceptions of

both F2F and CMC (“active” and also “silent”). Computer network infractions and

cyberattacks have a substantial protective impact.

Reference [5] Gupta et al. suggested a technique referred to as—On the direction

of Differentiating Fake Customer Records in Facebook. People are highly suscep-

tible to OSNs as a result of a genuine concern over digital offenders carrying out

numerous evil deeds. A whole industry of record-based bootleg market managements

has developed, selling these phoney solutions. Our research study’s main objective

is to identify bogus information on Facebook, a hugely popular (and challenging to

ﬁnd information about) online social media. Here is a list of our task’s unspoken obli-

gations. It has taken a lot of work to compile data connecting real and even phoney

Facebook papers. Due to Facebook’s strict security policies and also applications

user interface, which is constantly enhancing and also integrating new restrictions,

gathering customer account information came to be a tough operation. The next step

is to leverage Facebook customer channel data to study customer proﬁle behaviours

and discover a large collection of 17 criteria that are essential for differentiating

fraudulent users from real ones on Facebook. Finally, these highlights will undoubt-

edly be used to identify the signiﬁcant AI-based classiﬁers that excel at identifying

jobs out of a total of 12 classiﬁers used.

Identifying fake Twitter accounts Akta¸s, B. Erçahin, D. Kilinç, and C. Akyol are

the authors. Many people use social media websites like Facebook and Twitter, and

the connections they make there, have a profound impact on their lives. A number of

issues have emerged as a result of social networking’s rising popularity, including the

potential for dangerous material to spread by deceiving people into thinking they are

someone they are not. This illness has the potential to profoundly destroy culture in

real life. We present a categorization method for identifying Twitter bogus accounts in

this study. Our dataset was pre-processed using the Worsening Reduction Discretiza-

tion (EMD) method of monitored discretization on mathematical characteristics, and

the output of the Naive Bayes algorithm was then examined.

Finding young Twitter spammers, title. G. Magno, T. Rodrigues, V. Almeida,

F. Benevenuto, and all contributed to the essay’s creation. Many individuals are

tweeting on a global scale, therefore new information mining tools and Internet

search engines are emerging to help users keep up with events and information on

Twitter. Despite being useful as tools for accelerating the ﬂow of information and

enabling users to discuss events and promote their standing, these services also open

the door for new kinds of spam. The most popular topics on Twitter at any given

time, trending topics have certainly been thought about as a possible method to

boost visitors and also revenue. In their tweets, spammers use popular terms from a

hot issue as well as URLs that, in most cases, are masked by URL shorteners and

take users to completely unrelated websites. If methods for discouraging spammers

are not developed, this kind of spam could also reduce the usefulness of real-time

search systems. The challenge of identifying spammers on Twitter is examined in

142 B. Srinivasa Rao et al.

this essay. A sizable dataset from Twitter, which contains more than 54 million users,

1.8 billion tweets, and 1.9 billion links, served as our initial step. We use tweets on

three 2009 hot motifs to create a sizable labelled collection of people who can be

manually classiﬁed as spammers and non-spammers. Then, we identify several traits

related to tweet content and user social behaviours that can be utilized to identify

spammers. These characteristics served as the foundation for our machine learning

method for classifying individuals as spammers or non-spammers. Only a small

percentage of non-spammers are incorrectly categorized by our technology, while it

successfully detects the vast majority of spammers. 96% of non-spammers and about

70% of spammers were correctly identiﬁed. Our searches also illustrate the crucial

components for identifying Twitter spam.

TITLE: A comprehensive NLP-based technique for identifying potentially

harmful tweets. S. Gharge and M. Chavan are the authors. The detection of phoney

client accounts was the main objective of many past jobs. Recently, Twitter spam

detection has drawn more attention as a social media network study topic. We do,

however, provide a strategy based on two novel ideas: a method based on language

evaluation for ﬁnding spam on Twitter in vogue at the time, as well as a method for

detecting spam tweets without taking the user’s history into account. The issues of

conversation that are currently popular are known as trending topics. This growing

microblogging trend beneﬁts spammers. In our research, we look for spam tweets

using linguistic techniques. We started by gathering tweets related to a variety of

popular subjects and categorizing them as either having safe or dangerous mate-

rial. After labelling, we eliminated many traits based on linguistic variances, using

language as a tool. We also assess the effectiveness and classify tweets as spam or

not. Therefore, our method can be used to detect spam on Twitter by focusing on

tweet analysis rather than customer account evaluation.

TITLE: Using cutting-edge methods, a research on Twitter spam. S. Wen, Y.

Xiang, W. Zhou, T. Wu, and S. are the authors. Twitter trolls has actually long been

a serious yet challenging issue to solve. Researchers have already offered a variety

of study and support methods to protect Twitter users from spamming activities.

Particularly in the recent three years, a number of novel techniques have been devel-

oped that signiﬁcantly improve the performance and accuracy of exploration when

compared to those that were ﬁrst offered three years earlier. As a result, we are

motivated to conduct a fresh study on Twitter spam detection techniques. This inves-

tigation is divided into three sections: (1) A review of contemporary literature: this

section provides in-depth analysis (such as taxonomies and also predispositions on

function selection) as well as justiﬁcation (such as beneﬁts and drawbacks on each

fundamental strategy); (2) Relative studies: To provide a quantiﬁable understanding

of current techniques, we will compare the performance of various common strate-

gies on a global tested (i.e. the same datasets and also real-world scenarios); and

(3) An analysis of contemporary literature. (3) Unresolved issues: The third section

provides a summary of the issues that current Twitter spam discovery techniques

continue to encounter. It is crucial that these unresolved issues are addressed for the

beneﬁt of both the academic community and business. Visitors to this study may

Identiﬁcation of Fake Users on Social Networks and Detection … 143

include people looking for a thorough understanding of the topic to develop original

strategies, as well as those with or without prior experience in the ﬁeld.

TITLE: An example of spammers’ behaviour on popular social media networks.

Author S. J. Soman is involved. Social media websites and applications have devel-

oped into a signiﬁcant component of the Internet and are currently having a signiﬁcant

effect on people’s lives. Utilizing social networking platforms enables customer inter-

action (SNSs). But the blogosphere has undoubtedly been tortured by many forms of

information that resemble spam. Websites for social media networks have become

increasingly popular, making them a prime target for spammers because they annoy

users by returning useless search results. Scientists initially concentrated on building

honey pots to ﬁnd spam. Spammers and marketing specialists both use Twitter as a

target mechanism. The writers look at extensive literary works that demonstrate the

existence of spam and spammers on well-known social media sites.

2.1 Existing System

Social networking services like Twitter and Facebook are used by millions of individ-

uals, and their involvement with these sites has a positive impact on their lives. Due

to its popularity, social networking has given rise to a number of issues, including the

potential for dangerous content to spread by tricking people into believing they are

someone they are not. This circumstance has the potential to cause signiﬁcant harm

to society in the actual world. In our study, we offer a classiﬁcation technique for

identifying Twitter bogus accounts. Our dataset was pre-processed using the Entropy

Minimization Discretion (EMD) method on numerical features explained.

2.2 Proposed System

The suggested system uses a combination of metadata-, content-, interaction-, and

community-based elements to identify fake users in order to identify social spam

bots on Twitter. Most network-based features are not deﬁned using user followers and

underlying community structures in the analysis of characterizing features of existing

approaches, which ignores the fact that a user’s reputation in a network is inherited

from followers (rather than from those they are following) and community members.

As a result, the system places a strong emphasis on using community structures and

followers to deﬁne a user’s network-based features. The system divides a group

of features into four major categories: the following: false material; spam based

on URL; spam in popular subjects, and imposters. The network category is further

divided into features that are interaction- and community-based. While content-based

features seek to study a user’s message posting behaviour and the calibre of the text

they use in postings, metadata features are derived from additional information that

144 B. Srinivasa Rao et al.

Fig. 1 Spammer detection model

is available regarding a user’s tweets. The network of user interactions is used to

extract network-based features (Fig. 1).

3 Methodology

The authors discuss the strategy for identifying spam and fake accounts on the online

social network Twitter.

Author uses four different detection methods, including fake user identiﬁcation,

fake content, spam URL detection, and spam trending topic, to carry out the task

of detection. After determining whether a tweet is regular or spam using the afore-

mentioned four methods, we will train a random forest data mining algorithm on the

above data to identify spam and non-spam tweets and the percentage of fake and

real money. To classify tweets as spam or not spam, authors use a lot of information

method, but in this case we use random forest classiﬁcation.

Here is a description of four ways to determine if a tweet is spam.

User attributes (retweets, tweets, following, etc.), content attributes, etc. many

features included.

Fake Content: If an account’s reputation is low and there is a strong likelihood

that it is spam, it is shown by a low number of followers relative to the number of

followers. Similar features include HTTP links, mentions and replies, hot topics, and

the reputation of tweets. If the user tweets a lot in a short time according to the time

zone, the account is considered as spam.

URL recognition for spam: The user-based functions include determined by the

number of factors, including the age of the account and the quantity of user favourite,

lists, and tweets. The parsed JSON structure contains the characteristics that are

based on human input detected. Great retweets, hashtags, user mentions, and URLs

Identiﬁcation of Fake Users on Social Networks and Detection … 145

are attributes of a tweet like anything else. We will use a machine learning method

called Naive Bayes to determine if a tweet is a spam URL.

Using the Naive Bayes method to classify tweet content, it is possible to determine

whether a trending topic contains spam or terms that are not spam. This model will

search for matching tweets, spam links, and adult content. Naive Bayes returns 1 if

the tweet contains SPAM, and 0 if no SPAM content is found.

Fake user ID: These features include account age, followers and unfollowers. As

opposed to spammers who publish a small number of duplicate tweets, the features

of content related relating to the tweets are submitted by the users. This is because

spammer bots upload several instances of duplicating content. This method extracts

information from tweets and uses the Naive Bayes algorithm to categories them as

spam or non-spam depending on whether they are following or contain material

that is spam or not. To determine if the account is a fake account, a random forest

algorithm will then be used to train these attributes. The feature.txt ﬁle will contain

all the extracted features. The “Model” folder contains the Naive Bayes classiﬁer.

The aforementioned methods allow us to determine if a tweet contains a legitimate

content or spam. Social networks can improve their reputation in the market by

identifying and eliminating such spam communications. Social networks’ popularity

might decline if spam messages were not removed from them. Today’s consumers

rely extensively using social media access business, family, and news information,

thus keeping them free of spam will help them build their reputation.

We are using a Twitter dataset in JSON format that comprises user information,

tweet counts, follower and following counts, favourite tweets, and more to create

this project. We examine all information using the Python JSON API to determine

whether a user account is real or false and whether it contains spam or regular

communications. The “tweets” folder contains all of these dataset ﬁles.

4 Implementation

Double click the “run.bat” ﬁle to bring up the following screen to start this project.

Click the “Upload Twitter JSON Format Tweets Dataset” button in the aforemen-

tioned window, then upload the tweets folder.

146 B. Srinivasa Rao et al.

I’ve uploaded a folder called “tweets” that contains tweets in JSON format from

various individuals in the screen above. Click the open button now to begin reading

tweets.

We can see all of the loaded tweets from all users on the screen above. To load

the simple Naive Bayes classiﬁer, tick the “Put Naive Bayes on.” To analyse use the

“Tweet Text or URL” button.

To analyse each tweet for fraudulent material and spam, choose “Detect Fake

Content, Spam URL, Trending Subject & Fake Account.” URLs, both fake accounts

utilize the Naive Bayes classiﬁer in addition to the others—mentioned techniques.

The Naive Bayes classiﬁer is already loaded on the screen above.

Identiﬁcation of Fake Users on Social Networks and Detection … 147

All features from the tweet collection are extracted and analysed in the screen

above to determine if a tweet is spam or not. Each tweet record displays data such

the account’s TWEET TEXT, FOLLOWERS, FOLLOWING, etc. false whether the

tweet text is legitimate, spam, or neither phrases. In the text ﬁeld above, each record

value is separated by an empty line. To train a random forest classiﬁer with the

features of the retrieved tweets, click the “Run Random Forest Prediction” button.

This a simulation will be utilized to forecast or to ﬁnd false and spam accounts for

incoming tweets. To read each tweet’s details, scroll down above the text area.

Click the “Detection Graph” button to view a graph of the total number of tweets,

spam, and bogus accounts. In the screen above, we calculated the random forest

prediction accuracy to be 92%.

The total number of tweets, false accounts, and tweets with spam language are

represented on the x-axis in the graph above, while their count is shown on the y-axis.

148 B. Srinivasa Rao et al.

5 Conclusion

In this research, we reviewed the methods for identifying spammers on Twitter.

Additionally, we provided a taxonomy of Twitter spam detection methods and divided

them into categories such as false user detection, spam detection in hot topics, spam

detection based on URLs, and fake content detection. Several features, including user

features, content features, graph features, structure features, and temporal features

were used to compare the provided strategies. The strategies were also contrasted

in terms of the datasets they employed and the goals they were designed to achieve.

The review that is being presented is expected to make it simpler for academics

to locate data on cutting-edge Twitter spam detection methods in one place. There

are still certain open areas that need signiﬁcant research by researchers despite the

development of efﬁcient and successful ways for the spam detection and false user

identiﬁcation on Twitter. The problems are succinctly highlighted as follows: Due

to the grave consequences that false news can have on both an individual and a

communal level, the subject of false news detection on social media networks needs to

be investigated. Finding the sources of rumours on social media is a related topic that

is worthwhile of further study. While there have been some studies using statistical

methods to determine the origin of words, the best strategies can be used, such as

speech-based ones, because of their good results.

Feature Analysis

Although effective and successful methods have been developed for Twitter spam

detection and fraudulent user detection, there are still some gaps in research that need

to be ﬁlled. Several of the problems include the following: Fake news identiﬁcation

on social media networks is a topic that has to be looked at because of the signiﬁcant

effects false news has on an individual and societal level. Another related matter that

merits investigation is the ability to track out the source of rumours on social media.

Although some research have already been done to identify the source of rumours

using statistical techniques, more sophisticated strategies, such those based on social

networks, can be used because of their proven effectiveness.

References

1. Social Networks Analysis and Mining (ASONAM) 2018 Aug 28. IEEE, pp 1191–1198

2. Er¸sahin B, Akta¸s Ö, Kılınç D, Akyol C. Twitter fake account detection. In: 2017 international

conference on computer science and engineering (UBMK) 2017 Oct 5. IEEE, pp 388–392

3. Gupta A, Kaushal R. Towards detecting fake user accounts in Facebook. In: 2017 ISEA Asia

security and privacy (ISEASP) 2017. IEEE, pp 1–6

Identiﬁcation of Fake Users on Social Networks and Detection … 149

4. Pakaya FN, Ibrohim MO, Gheewala S, Patel R. ML based Twitter Spam account detec-

tion: a review. In: 2018 second international conference on computing methodologies and

communication (ICCMC) 2018 Feb 15. IEEE, pp 79–84

5. Kaliyar RK. Fake news detection using a deep neural network. In: 2018 4th international

conference on computing communication and automation (ICCCA) 2018 Dec 14. IEEE, pp

1–7

A Effective Method for Predicting

the Dyslexia by Applying Ensemble

Technique

S. K. Saida, Yanduru Yamini Snehitha, Narindi Sai Priya,

and Avula Srinivasa Ajay Babu

Abstract Dyslexia is a condition where a person will face difﬁculties in certain tasks

including reading, writing, speaking, and identifying sounds. Around 10% of people

globally struggle with this issue. The most important step in preventing dyslexia is

early identiﬁcation. There are several ways to estimate the risk of dyslexia, where we

have developed a model which allows the user to specify their language vocabulary,

memory, speed, visual discrimination, audio discrimination test results. The model

will determine the user’s individual risk of dyslexia after receiving input from the

user. The approach we used included data preparation, data preprocessing, model

training, model testing, and model construction. Predicting Risk of Dyslexia-PLOS

ONE dataset is used. Dyslexia can be identiﬁed using machine learning classiﬁcation

techniques like Decision Trees, Random Forests, and Support Vector Machines.

When compared to individual classiﬁcation strategies, the ensemble technique in the

proposed work predicts the risk of dyslexia with a better degree of accuracy. Here, we

consider integrating GridSearch CV, Support Vector Machine, and Random Forest.

Accuracy, precision, recall, and F1-score were taken into consideration as outcome

measures.

Keywords Dyslexia ·Machine learning ·Random forest ·Support vector

machine ·GridSearch CV

S. K. Saida (B)

Department of Information Technology, Lakireddy Bali Reddy College of Engineering,

Mylavaram, Andhra Pradesh, India

e-mail: saida518@gmail.com

Y. Y. Sn e h i th a ·N. S. Priya ·A. S. A. Babu

Lakireddy Bali Reddy College of Engineering, Mylavaram, Andhra Pradesh, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_12

151

152 S. K. Saida et al.

1 Introduction

Dyslexia is a neurological condition where a person will face difﬁculties in some

speciﬁc tasks like reading, writing, speaking clearly, and identifying sounds [1].

Dyslexic people mainly face the problem in decoding each word. A normal person

can decode a particular word faster than a dyslexic person. The major problem they

have is manipulating the words but not eye problem or lack of intelligence [2].

Everyone will assume that the people who have dyslexia have very low intelligence

and they can’t succeed in their life but the major problems they have are unable

to recognize the directions correctly, seeing the letters backwards like ‘b’ as ‘d’ or

‘saw’ as ‘was’ etc., and cannot remember the names of the things in their surroundings

(Requires more time to remember).

Essentially, brain is divided into two sections one is right hemisphere and the other

one is left hemisphere. The left hemisphere of the brain is in charge of language and

logic, whereas the right hemisphere is in charge of the creative side. Reading a word

takes longer in the dyslexic brain’s right hemisphere and frontal lobe because of this,

and it may even take longer to register in the frontal lobe [3]. As they mostly rely on

the right side where it becomes the major advantage of the dyslexic people to think

differently and creatively. There is no medicine or treatment for the dyslexia but

the early identiﬁcation is the most important step to make their future successful [4].

Special teaching methods and emotional support are mainly required for the dyslexia

people to make their life easier. There are no particular causes for dyslexia but some

may get dyslexia by hereditary or brain injury at the school age [5] (Fig. 1).

Recognition of dyslexia includes various methods. The most frequent method

is to detect the dyslexia based on the eye movement of the people. The machine

learning model will detect the dyslexia by taking the eye movement video using

various algorithms [6]. The another way to detect the dyslexia is by taking their

audio recordings and giving it to the machine learning model [7].

Data preparation is the initial step where we collected Predicting Risk of Dyslexia-

PLOS ONE set from the Kaggle. After collection of data from the Kaggle data

preprocessing is done using the Standard Scaler technique which will eliminate

the mean and scale every variable to unit variance. Here we use ensemble technique

which is the combination of two algorithms in a single model to improve the efﬁciency

Fig. 1 Identiﬁcation of

dyslexia

A Effective Method for Predicting the Dyslexia by Applying Ensemble … 153

and accuracy of the detection. The Random Forest Classiﬁer and GridSearch CV have

given the best accurate results for the model compared to Support Vector Machine

and Decision Tree.

2 Related Work

In 2022, Brunswick et al. [8]. In this paper the author have included the students if

145 universities with and without dyslexia. This survey includes 53% childhood and

47% adulthood people. They have explored that the people with dyslexia have low

self-esteem and self-efﬁcacy but higher creativity. To reduce the negative effects for

dyslexic people, early assessment of dyslexia should be done mainly.

In 2020, Chakraborty et al. [9]. In this paper machine learning algorithms are used

to detect the dyslexia using the eye movement of the person. The dyslexic people

have diverse eye developments than the typical reader. SVM and Random Forest

algorithms are used in this model to detect with 89.8% precision.

In 2017, Hassanain et al. [10]. In this paper they have created a big data-based

and tablet-based multimedia environment where they have considered the children

if age both less than 10 and greater than 10 to detect the symptoms of dyslexia. In

their framework they have included clock drawing test, writing test, reading test, and

drawing family members. Automatic grading will be done while attempting the test

scenarios. At last the calculation if scores in every scenario will be considered and

the detection is done.

In 2020, Ileri et al. [11]. This paper mainly deals with the EOG signals of the brain

for diagnosing the dyslexia. Firstly EOG signals of the person will be captured while

reading four different texts and after that the obtained EOG signals will be ﬁltered and

segmented into frames. At last they are classiﬁed using 1D CNN machine learning

algorithm.

In 2020, Seshadri et al. [12]. They claimed that frontal regions of dyslexic patients

will have unusual patterns of delta and theta activity. They included both young-

sters with and without dyslexia and employed EEG signals in a scenario where

the eyes were closed. Using the relative wavelet energies, they have calculated the

lateralization score at each electrode position.

In 2018,Fridetal.[13] using an etymological Computer to predict the likelihood

of dyslexia in individuals’ corporations Designers can create augmented reality appli-

cations using Vuforia, a ﬂexible-based enhanced reality programming development

unit. C++, Java, and .NET interfaces through a machine learning-based Unity plugin

game. For the purpose of predicting dyslexia, an SVM model with a Gaussian kernel

was created using LIBSVM.

154 S. K. Saida et al.

3 Methods and Materials

The system here uses the idea of machine learning, and the models are trained before

being tested. The ﬁnal outcome will be predicted by the model with the highest

accuracy. We focus on the work ﬂow of our suggested work in this area. The ﬂow

chart that shows the various steps is shown below (Fig. 2).

Data Preparation

This is the ﬁrst and most crucial part of the suggested work, where we used the dataset

from the online open source platform Kaggle. 500 rows and 7 columns make up the

collected dataset known as Predicting Risk of Dyslexia-PLOS ONE. This dataset

can be easily examined in excel sheets as numerical data with a.csv extension.

Fig. 2 Work ﬂ ow diagra m

A Effective Method for Predicting the Dyslexia by Applying Ensemble … 155

Data Preprocessing

Standard Scaler is used for the data preprocessing. This method will eliminate the

mean and scale every variable to unit variance. Each variable is used to guide the

autonomous execution of this process. Outliers may have an impact on the Standard

Scaler, which entails estimating the empirical mean and standard deviation of each

feature (if they are present in the dataset). Therefore, before including it into the

machine learning model, we must normalize the data (mean =0, standard deviation

=1) that is commonly used to address this possible problem.

•Standardization

x−μ

•Mean

μ=



i=1

(xi)

•Standard Deviation

σ=







i=1

(xi−μ)2

Model Training

To train the machine learning algorithm, a dataset is used. The dataset in question

is the training model. Training data typically outweighs testing data in size. This

is because we want to provide the model as much data as we can so that it can

recognize and pick up on useful patterns. It is composed of the right sets of input

data and sample output data, both of which have an impact on the ﬁnal visual product.

In order to compare the processed output to the sample output, the training model is

utilized to process the input data using the algorithm. The association’s conclusion

is used to modify the model.

Model Testing

Model testing is the procedure where a fully trained model’s performance is evaluated

on a testing set. Testing a model’s performance involves putting it to the test using

fresh datasets and test data as well as examining the results in terms of the model’s

outputs for things like review, precision, and other criteria that are not spelled out in

stone exactness with the model that has already been developed. It’s interesting to

note that neither the preparation dataset nor the unit included in the testing set are

used to generate perceptions. If the test set contains models from the preparation set,

156 S. K. Saida et al.

it is attempting to determine if the algorithmic framework has simply retained them

or has learned how to combine data from the preparation set.

4 Description of Proposed Ensemble Techniques

(A) Random Forest Classiﬁer

For training a dataset, a large number of decision trees are constructed as part of the

ensembling learning approach known as Random Forest, which outputs the mode

of the classes of the individual trees. At each split, this method chooses a random

subset of characteristics and creates a separate tree [14]. Here’s how the Random

Forest Classiﬁer operates:

Step 1: We must ﬁrst select n random samples at random from the entire training

dataset; n must be smaller than the total number of observations.

Step 2: A decision tree should be constructed for each and every sample. A decision

tree’s nodes can be branched using the Gini index or entropy.

The Gini index calculation formula is stated as Eq. (1)

Gini =1−



i=1

(pi)2(1)

where pi=relative frequency, c=number of classes

The Entropy calculation formula is stated as Eq. (2)

Entropy =



i=1

−pi∗log2(pi)(2)

Step 3: Next, we must decide how many trees will be present in the forest; often,

we select a high quantity, such as 100 or 500.

Step 4: Each tree is used to make predictions for a fresh piece of data.

Step 5: By calculating the average of the predictions collected from all the trees in

the forest, the ﬁnal predictions for the new data point are determined.

Step 6: In order to create many trees and a forest, we next repeat the technique

above for the full dataset.

(B) Support Vector Machine

Both classiﬁcation and regression are performed using the Support Vector Machine

(SVM) supervised machine learning technique. The most correct term is classiﬁ-

cation, even if we also discuss regression issues. The SVM method looks for an

N-dimensional space hyperplane that categorizes the data points with clarity [15].

A Effective Method for Predicting the Dyslexia by Applying Ensemble … 157

Positive Hyperlane

Class 2

Negative Hyperlane

Hyperlane

Class 1

Fig. 3 Support vector machine

The number of features determines the hyperplane’s size. The hyperplane is essen-

tially a line if the input qualities are limited to only two. The hyperplane changes

into a 2-D plane as the number of input features approaches three. When there are

more than three factors involved, it is difﬁcult to imagine (Fig. 3).

SVMs differ from other classiﬁcation algorithms because they choose the decision

boundary that optimizes the distance between the nearest data points of all the classes.

The decision boundary created by SVMs is referred to as the maximum margin

classiﬁer or maximum margin hyperplane.

Another scikit-learn technique for conducting a thorough search on hyper-parameters

is GridSearch CV. It is a process for going through many combinations of hyper-

parameter values in a methodical way and training a machine learning model for

each combination to see which collection of hyper-parameters performs the best.

Using a preset ‘grid’ of hyper-parameters, the hyper-parameter search is carried out,

which means that all conceivable combinations of hyper-parameter values are tested

systematically. It is a helpful tool for determining the best set of hyper-parameters

for a machine learning model, but it can be computationally expensive, especially

when there are many hyper-parameters and a wide range of possible values.

5 Results and Analysis

In the proposed work we have collected the dataset from the Kaggle. The dataset

named Predicting Risk of Dyslexia-PLOS ONE is used to detect the risk of the

dyslexia in individual persons. It consists of data with size (500, 7) which means

500 samples and 7 features that include language vocabulary, memory, speed, visual

discrimination, audio discrimination.

There are no NaN values in the collected dataset and we have proceeded to the

further step without applying any techniques.

158 S. K. Saida et al.

(A) Standard Scaler Technique

After importing the dataset we have used the Standard Scaler technique to preprocess

the data in the dataset. Standardization changes the distribution to have a mean of

zero and a standard deviation of one. This is accomplished by scaling each input

variable individually by subtracting the mean (a process known as rounding) and

dividing by the standard deviation (Fig. 4).

(B) Scatter Plot of different algorithms

Data can be graphically represented using a scatter plot. The coordinate axes are

used in a straight forward scatter plot to plot the points according to their values. The

below ﬁgure is the scatter plot of the algorithms and ensemble techniques which we

have used in the proposed model. Here the X coordinate represents the algorithms

and Y coordinate represents the score (Fig. 5).

Fig. 4 Data preprocessing

Fig. 5 Scatter plot of algorithms

A Effective Method for Predicting the Dyslexia by Applying Ensemble … 159

Fig. 6 Line Plot of performance metrics

The proposed model is evaluated based on four measures which are:

•Accuracy

It is a parameter for measuring how well models perform in classiﬁcation

tasks, and it is so well-known that it is often used to calculate the total model

performance. It is the percentage of accurate classiﬁcations that a machine learning

model that has undergone training achieves, or the ratio of correct predictions to

all other predictions.

•Precision

The accuracy of a model determines how many of the detected things are

genuinely signiﬁcant. By taking the genuine positives out of the total positives, it

is calculated. We can assess the precision with which the machine learning model

classiﬁes the model as positive.

•Recall

The number of signiﬁcant elements that were found is measured by recall.

It estimates the percentage of actual positive labels that the model accurately

identiﬁed. As a result, it divides the total number of essential elements by the true

positives.

•F1-score

One of the most signiﬁcant evaluation criteria in machine learning is the F1-

score. It succinctly distills a model’s predictive power by combining accuracy and

recall, two measurements that ordinarily compete with one another (Figs. 6and

7).

6 Conclusion and Future Scope

Our proposed methodology aims to provide an efﬁcient classiﬁcation for dyslexia.

Our entire model is classiﬁed into 3 stages. Firstly, for the data acquisition Predicting

Risk of Dyslexia-PLOS ONE dataset is collected from Kaggle on Internet. For the

160 S. K. Saida et al.

Fig. 7 Performance measures

data preprocessing we have used the Standard Scaler technique to eliminate the mean

and scale every variable to unit variance and this process is carried out independently

on each single variable. We have done the comparison of algorithms with each other

based on their performance metrics. The algorithms which we have included are

Decision Tree, SVM, Random Forest Classiﬁer, and the ensemble techniques like

Random Forest with Grid Search CV, SVM with Grid Search CV. At last we have

got the best results using the ensemble technique of Random Forest Classiﬁer and

Grid Search CV. The model uses the ensemble technique with those two algorithms

for classiﬁcation and it has given the efﬁcient values for the performance metrics like

accuracy, precision, recall, and F1-score. Our proposed model will be user friendly

and more effective to predict the dyslexia.

References

1. Protopapas A, Parrila R (5 April, 2018) Is dyslexia a brain disorder? Brain Sci 8(4):61. https://

doi.org/10.3390/brainsci8040061. PMID: 29621138; PMCID: PMC5924397

2. Snowling MJ, Hulme C, Nation K (13 Aug, 2020) Deﬁning and understanding dyslexia: past,

present and future. Oxf Rev Educ. 46(4):501–513. https://doi.org/10.1080/03054985.2020.176

5756. PMID: 32939103; PMCID: PMC7455053

3. Raschle NM, Chang M, Gaab N (1 Aug 2011) Structural brain alterations associated with

dyslexia predate reading onset. Neuroimage 57(3):742–9. https://doi.org/10.1016/j.neuroi

mage.2010.09.055. Epub 2010 Sep 25. PMID: 20884362; PMCID: PMC3499031

4. Snowling MJ (1 Jan 2013) Early identiﬁcation and interventions for dyslexia: a contemporary

view. J Res Spec Educ Needs 13(1):7–14. https://doi.org/10.1111/j.1471-3802.2012.01262.x.

PMID: 26290655; PMCID: PMC4538781

5. Werth R (2019) What causes dyslexia? Identifying the causes and effective compensatory

therapy. Restor Neurol Neurosci 37(6):591–608. https://doi.org/10.3233/RNN-190939. PMID:

31796709; PMCID: PMC6971836

6. Nerušil B, Polec J, Škunda J, Kaˇcur J (3 Aug 2021) Eye tracking based dyslexia detection

using a holistic approach. Sci Rep 11(1):15687. https://doi.org/10.1038/s41598-021-95275-1.

PMID: 34344972; PMCID: PMC8333039

7. Radford J, Richard G, Richard H, Serrurier M. Detecting dyslexia from audio records: an AI

approach. https://doi.org/10.5220/0010196000580066

A Effective Method for Predicting the Dyslexia by Applying Ensemble … 161

8. Brunswick N, Bargary S (28 Aug 2022) Self-concept, creativity and developmental dyslexia

in university students: effects of age of assessment. Dyslexia 28(3):293–308. https://doi.org/

10.1002/dys.1722. Epub 2022 Jul 11. PMID: 35818173; PMCID: PMC9543102

9. Chakraborty V, Sundaram M, Machine learning algorithms for prediction of dyslexia using eye

movement. 06 Nov 2020 Bengaluru. https://doi.org/10.1088/1742-6596/1427/1/012012

10. Hassanain E. A multimedia big data retrieval framework to detect dyslexia among children.

2017 IEEE international conference on big data. 978-1-5386-2715-0/17

11. ˙

Ileri R, Latifo˘glu F, Demirci E (2020) New method to diagnosis of dyslexia using 1D-CNN,

2020 medical technologies congress (TIPTEKNO). Antalya, Turkey, pp 1–4. https://doi.org/

10.1109/TIPTEKNO50054.2020.9299241

12. Seshadri NPG, Singh BK (2020) Hemispheric lateralization analysis in dyslexic and normal

children using rest-EEG. 2020 IEEE recent advances in intelligent computational systems

(RAICS). Thiruvananthapuram, India, pp 37–41. https://doi.org/10.1109/RAICS51191.2020.

9332509

13. Frid A, Mane Vitz LM (2018) Features and machine learning for correlating and classifying

between brain areas and dyslexia. arXiv:1812.10622

14. Ali J, Khan R, Ahmad N, Maqsood I (2012) Random forests and decision trees. Int J Comput

Sci Issues (IJCSI) 9

15. Evgeniou T, Pontil M (2001) Support vector machines: theory and applications. 2049. 249–257.

https://doi.org/10.1007/3-540-44673-7_12

Identifying Suicidal Risk: A Text

Classiﬁcation Study for Early Detection

Devineni Vijaya Sri, Anumolu Bindu Sai, Valluri Anand,

and Karanam Manjusha

Abstract Language usage is affected by suicidal intent that is conveyed on social

media. Many at-risk users rely on online forum websites to discuss their issues or

ﬁnd out information about related duties. Our study’s main goal is to share ongoing

research on automatically identifying suicidal postings. We developed a method in

order to identify individuals who might be at suicide risk by analysing data from

social networking sites like Reddit. To achieve this, we plan to apply a variety

of classiﬁcation techniques, including both deep learning and traditional machine

learning methods. To this purpose, we compare our results to those of other clas-

siﬁcation methods using a combined LSTM-CNN model. Our experiment reveals

that combining word embedding techniques with neural network architecture may

produce the best relevance classiﬁcation results. Furthermore, our results show how

deep learning architectures may be used to build a viable model for a suicide risk

assessment by excelling at a variety of text classiﬁcation tasks.

Keywords Suicidal ideation ·Neural network architecture ·Text classiﬁcation ·

Classiﬁcation algorithms ·LSTM-CNN model

1 Introduction

The mortality rate is anticipated to rise to one every 20s by 2020 [1]. Nearly 79% of

suicides take place in low- and middle-income countries, where there are frequently

insufﬁcient and limited resources for detection and management. However, Pompili

et al. [2] show that “many characteristics thought to be risk factors for suicidal

conduct” might be fairly comparable in a suicide ideator and suicide attempter. In

D. Vijaya Sri (B)

Department of Information Technology, Lakireddy Bali Reddy College of Engineering,

Mylavaram, Andhra Pradesh, India

e-mail: devineni66@gmail.com

A. B. Sai ·V. A n a n d ·K. Manjusha

Lakireddy Bali Reddy College of Engineering, Mylavaram, Andhra Pradesh, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_13

163

164 D. Vijaya Sri et al.

order to reduce suicide rates by 10% by 2020 [3], early detection of suicidal ideation

has been developed and put into practice as a part of national harm reduction plans

in WHO member countries.

It offers a useful research environment for the creation of cutting-edge technical

innovations that might revolutionise suicide risk reduction and suicide detection

[4]. That might serve as a good preliminary step for intervention. Kumar et al. [5]

conducted a study on the posting habits of Reddit Suicide Watch users who keep

up with news regarding celebrity suicides [6]. He presented a strategy that would

effectively stop suicides involving prominent ﬁgures. To determine the distinctive

signs of this transition, he created a methodology based on propensity weighting.

AvgDiffLDP is an innovative optimisation approach that Ji et al. [7] recently devised

for the early identiﬁcation of suicidal thoughts. Our study’s main goal is to use

powerful deep learning architectures for data analysis to disseminate knowledge

about suicide thoughts in Reddit social media communities [8]. We attempt to deter-

mine if combining CNN and LSTM classiﬁers into a single model may enhance the

performance of language modelling and text categorisation [9].

On the basis of the baseline and our suggested model, we assess the experimental

strategy. We leverage data gathered from Reddit social media, a platform that allows

users to write lengthier messages, for our data set [10].

In order to do our experiment, we ﬁrst choose a data source and assess the salient

features of our suggested model. Our next step involves analysing the frequency of

n-grams (consisting of both individual words and pairs of words) in the dataset, with

the aim of identifying any signals of suicidal intent [11]. This approach is designed to

help us uncover potential patterns and trends that may shed light on the presence of

suicidal ideation. This enables us to spot potential trends and telltale signs that point to

the presence of suicidal ideas [12]. Using both the baseline and our suggested model,

we assess the experimental strategy. Lastly, we use tenfold cross-validation to train

our LSTM-CNN model and identify the most effective hyper-parameter for spotting

suicidal thoughts [13]. We leverage data from Reddit social media, a platform that

permits users to write longer messages, for our dataset.

Our study makes the following three distinct contributions:

N-gram analysis: We examined n-gram data from suicide-related forums in our

study to demonstrate that decreased social engagement and suicide thoughts are

regularly mentioned topics [14]. Our results show that a change to social ideation

is linked to a range of psychological conditions, such as an increase in self-focus,

despair, discontent, anxiety, or loneliness.

Traditional feature analysis: We used traditional feature analysis to compare

different approaches to detecting suicidal thoughts [15]. We used CNN, LSTM, and

a combined LSTM-CNN model to compare performance of statistical characteristics

with word2vec, bag of words, and TF-IDF.

Comparative analysis: We evaluated the accuracy of the deep model of neural

networks we propose, the LSTM-CNN integrated class, for detecting suicidal

thoughts [16]. To establish a state-of-the-art approach, we compared its performance

and potential against other deep learning techniques, including CNN and LSTM, as

Identifying Suicidal Risk: A Text Classiﬁcation Study for Early Detection 165

well as four conventional machine learning classiﬁers, namely SVM, NB, RF, and

XGBoost. The evaluation was conducted using a real-world dataset.

2 Literature Review

Reddit users’ suicidal inclinations were explored by Kumar et al. [17] in relation to

the Werther or copycat effect [18]. His research shows that after news of celebrity

suicides, individuals’ posting frequency signiﬁcantly increased and their language

behaviour changed. This change was seen as moving in the direction of postings that

were less socially integrated and were more negative and self-focused. Similar to

this, Ueda et al. [19] carried out in-depth research on a million Twitter tweets after

26 well-known Japanese celebrities committed suicide between 2010 and 2014.

Suicidal inclinations are more effectively recognised when regular linguistic

patterns in social media material are identiﬁed. Applying various machine learning

techniques to various NLP techniques frequently supports it. A suicide note anal-

ysis technique was developed by Desmet et al. [20] utilising binary support vector

machine (SVM) classiﬁers to identify suicidal thoughts. Machine learning algo-

rithms have been shown to be effective in separating people into those who are and

are not at suicide risk by Braithwaite et al. [21] 0.125 Twitter users were discov-

ered by Wood et al. [22] who then monitored their tweets up until the point of their

attempted suicide. Okhapkina et al. [23] research looked at modifying information

retrieval techniques to spot a harmful informational effect on social networks [24].

He created a lexicon of phrases with a suicidal undertone. He developed singular

vector decompositions for TF-IDF matrices.

Signiﬁcant modiﬁcations have been made as a result of recent developments in

neural network models for natural language processing. Recurrent neural networks

(RNN) have distinguished themselves as a particularly potent method for sequence

modelling among these [25].

Recent research has demonstrated that convolutional, nonlinear, and pooling

layers in CNN neural networks perform better than conventional NLP techniques

for a variety of NLP tasks [26]. However, it fails to capture distant interactions and

instead highlights local n-gram properties. The power of CNN on n-gram character-

istics from different sentence places was supported by Kalchbrenner et al. [27]Yin

and Schutze devised a strategy that utilises unsupervised pre-training and multiple

channel word embedding to enhance accuracy of classiﬁcation.

Using n-gram features with CTAKES and LR approaches, Gehrmann et al.

compared the CNN model to more conventional, rule-based entity extraction

methods. He found in his investigation [28] that CNN performs better than previous

phenol-typing algorithms in predicting ten phenotypes. Morales et al.’s presenta-

tion of the ﬁndings for a novelly evaluated personality and tone traits demonstrated

the efﬁcacy CNN and LSTM models were used to assess the risk of suicide. In

comparison with other methods, CNN performed better in detecting the presence

of suicidal inclinations in teenagers, according to Bhat et al., and deep learning

166 D. Vijaya Sri et al.

techniques were used by Du et al. to identify mental stresses in social media for

suicide identiﬁcation. He created a binary classiﬁer using CNN networks to distin-

guish between suicidal and non-suicidal tweets. According to other recent studies,

the Suicide Watch forum, which is used as a data set in our research article, beneﬁted

from CNN implementations.

Fundamentally, a single recurrent or convolutional neural network used as a vector

to encode an entire sequence usually is not enough to capture the entirety of the

signiﬁcant information. A hybrid framework that combines the advantages of RNNs

and CNNs has been worked on. This strategy seeks to improve results by utilising the

distinctive qualities of each model. The measurement problem of semantic textual

similarity has received signiﬁcant attention. Using both CNNs and RNNs inside the

hybrid framework, several methods have been researched and developed to increase

the precision and dependability of these metrics. In reference, in order to overcome

the difﬁculty of determining semantic textual similarity, He et al. developed a new

neural network model that combines ConvNet and Bi-LSTMs. In order to get better

outcomes, Matsumoto et al. suggested a hybrid framework that employs a quick

method of deep learning in close cooperation with an initial information retrieval

model.

3 Methodology

Through the comment threads that are connected to each post, they converse [29].

The Ji et al. [30] data set, which includes a list of postings that are both suicide

indicative and are not, was employed in our investigation. Users’ private information

is replaced with a special ID to protect their privacy. Due to the users’ propensity

to participate in numerous sub-Reddits, each group is composed of an equivalently

random amount of messages originating from diverse themes. Our data set is made up

of 3652 non-suicidal posts and 3549 posts with suicidal indications from reasonably

big sub-Reddits supporting those who may be at danger. Posts that are not suicidal

come from sub-Reddits with a focus on friends and family.

4 Existing Schemes

A comprehensive overview of our suggested framework is shown in Fig. 1.Thetwo

frameworks for text data mining are different. Natural language processing (NLP)

methods are used in the ﬁrst framework to pre-process data and extract features. Prior

to being analysed by standard machine learning systems as baseline approaches, the

words are ﬁrst encoded using techniques like TF-IDF, BOW, and statistical features.

The second framework, on the other hand, uses deep learning classiﬁers after pre-

processing the data and extracting features using word embedding methods. Also,

Identifying Suicidal Risk: A Text Classiﬁcation Study for Early Detection 167

Fig. 1 Framework for suicide ideation detection

this framework provides two different kinds of classiﬁers: one for the conventional

approach and one for the proposed model.

Model Architecture and Its Parameters

The parameter conﬁguration for the proposed model (LSTM +CNN) is given in

Table 1. The following parameters are used in the experiment: the number of ﬁlters,

the kernel size, the padding, the pooling size, the optimiser, the batch size, the epochs,

and the units. The NLTK natural language toolkit is used with Python. The models

are created using the TensorFlow deep learning framework.

Number of true positive predictions (TP), true negative predictions (TN), false

positive predictions (FP), and false negative predictions (FN) are included in the

assessment metrics [80]. An accuracy deﬁned as follows is the score for classifying

Table 1 Parameter conﬁguration for the proposed LSTM-CNN model

LSTM-CNN model layers Parameters Val u e s

Convolutional layer Number of ﬁlters 2,4,6,8

Kernel size 2,3,4

Padding ‘Same’

Activation function ‘ReLU’

Pooling layer

LSTM layer and other

Pooling size Max-pooling

Units 100

Embedding dimension 300

Batch size 8

Number of epochs 10

Dropout 05

Fully connected layer Softmax

168 D. Vijaya Sri et al.

assessment that is the most simple:

Accuracy =TP +TN

TP +TN +FP +FN

Precision =TP

TP +FP

Recall =TP

TP +FN

F1=2.

Precision.recall

Precision +recall

5 Results

Our strategy is split into two phases. The ﬁrst stage entails scrutinising the labelled

Reddit posts corpus and comparing the most common n-grams in posts with clinical

depression to those without. This aids in identifying any patterns or indicators of

suicidal intent. The next step is to compare the effectiveness of our proposed deep

learning prediction model, which sets classiﬁers’ baselines using a predetermined

collection of characteristics. This enables us to accurately determine the classiﬁer’s

ability to detect suicidal ideation and evaluate its performance using appropriate

analytical metrics (Figs. 2,3,4,5,6,7and 8).

6 Conclusion

Deep learning techniques are being used into suicide care, opening up new avenues

for better ideation detection and the potential for early suicide prevention. Our study

contributes to the effort to advance convolutional linguistics technologically so that

it may be successfully used in the ﬁeld of mental health treatment and disseminated

among researchers.

Our method was developed for this aim using a sub-Reddit data corpus made up

of posts that were both suicide indicative and were not. To transform the text of the

postings into a format that our system could understand, we employed several data

representation approaches. By using several NLP and text classiﬁcation algorithms,

we were able to identify a tighter link between language usage and suicidal ideation.

Identifying Suicidal Risk: A Text Classiﬁcation Study for Early Detection 169

Fig. 2 Training accuracy

Fig. 3 Loss curve

170 D. Vijaya Sri et al.

Fig. 4 Suicide prediction

Fig. 5 Accuracy curve

Identifying Suicidal Risk: A Text Classiﬁcation Study for Early Detection 171

Fig. 6 Data set collection

Fig. 7 Testing accuracy

172 D. Vijaya Sri et al.

Fig. 8 Suicide percentage

We discussed the LSTM-CNN experiment and saw CNN’s potential in several texts

categorisation tasks. These networks were created on top of word2vec features.

Our goal was not to investigate in-depth how sensitive CNN hyper-parameters

are to planned decisions. Instead, we focused on enhancing CNN’s ability to classify

activities involving suicidal thoughts. We found the factors associated with portrayals

of suicidal inclinations throughout our data analysis. We saw a major change in the

way those at risk used language. Users’ self-centeredness was notably found to be

accompanied by indicators of irritation, pessimism, negativity, or loneliness.

References

1. World Health Organization (2018) National suicide prevention strategies: progress, examples

and indicators; World Health Organization: Geneva, Switzerland

2. Beck AT, Kovacs M, Weissman A (1975) Hopelessness and suicidal behavior: an overview.

JAMA 234:1146–1149

Identifying Suicidal Risk: A Text Classiﬁcation Study for Early Detection 173

3. Silver MA, Bohnert M, Beck AT, Marcus D (1971) Relation of depression of attempted suicide

and seriousness of intent. Arch Gen Psychiatry 25:573–576

4. Klonsky ED, May AM (2014) Differentiating suicide attempters from suicide ideators: a critical

frontier for suicidology research. Suicide Life-Threat Behav 44:1–5

5. Pompili M, Innamorati M, Di Vittorio C, Sher L, Girardi P, Amore M (2014) Sociodemographic

and clinical differences between suicide ideators and attempters: a study of mood disordered

patients 50 years and older. Suicide Life-Threat. Behav. 44:34–45

6. DeJong TM, Overholser JC, Stockmeier CA (2010) Apples to oranges?: a direct comparison

between suicide attempters and suicide completers. J Affect Disord 124:90–97

7. De Choudhury M, Kiciman E, Dredze M, Coppersmith G, Kumar M (2016) Discovering shifts

to suicidal ideation from mental health content in social media. In: Proceedings of the 2016

CHI conference on human factors in computing systems, San José, CA, USA, 9–12 December

2016; ACM: New York, NY, USA, pp 2098–2110

8. Marks M (2019) Artiﬁcial intelligence based suicide prediction. Yale J Health Policy Law

Ethics. Forthcoming

9. Kumar M, Dredze M, Coppersmith G, De Choudhury M (2015) Detecting changes in suicide

content manifested in social media following celebrity suicides. In: Proceedings of the 26th

ACM conference on hypertext & social media, Prague, Czech Republic, 4–7 July 2015; ACM:

New York, NY, USA, pp 85–94

10. Ji S, Long G, Pan S, Zhu T, Jiang J, Wang S (2019) Detecting suicidal ideation with data

protection in online communities. In: Proceedings of the international conference on database

systems for advanced applications, Chiang Mai, Thailand, 22–25 April 2019. Springer, Berlin,

Germany, pp 225–229

11. Yang Y, Zheng L, Zhang J, Cui Q, Li Z, Yu PS (2018) TI-CNN: convolutional neural networks

for fake news detection. arXiv arXiv:1806.00749

12. Mikolov T, Karaﬁát M, Burget L, ˇ

Cernock`y J, Khudanpur S. Recurrent neural network based

language model. In: Proceedings of the eleventh annual conference of the international speech

communication association, Makuhari, Chiba, Japan, 26–30 September 2010

13. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words

and phrases and their compositionality. In: Proceedings of the advances in neural information

processing systems, Lake Tahoe, CA, USA, 5–10 December 2013; pp 3111–3119

14. Coppersmith G, Ngo K, Leary R, Wood A. Exploratory analysis of social media prior to

a suicide attempt. In: Proceedings of the third workshop on computational Linguistics and

clinical psychology, San Diego, CA, USA, 16 June 2016, pp 106–117

15. Hsiung RC (2007) A suicide in an online mental health support group: reactions of the group

members, administrative responses, and recommendations. CyberPsychol Behav 10:495–500

16. Jashinsky J, Burton SH, Hanson CL, West J, Giraud-Carrier C, Barnes MD, Argyle T (2014)

Tracking suicide risk factors through Twitter in the US. Crisis 35:51–59

17. Colombo GB, Burnap P, Hodorog A, Scourﬁeld J (2016) Analysing the connectivity and

communication of suicidal users on twitter. Comput Commun 73:291–300

18. Niederkrotenthaler T, Till B, Kapusta ND, Voracek M, Dervic K, Sonneck G (2009) Copycat

effects after media reports on suicide: a population-based ecologic study. Soc Sci Med 69:1085–

1090

19. Ueda M, Mori K, Matsubayashi T, Sawada Y (2017) Tweeting celebrity suicides: users reaction

to prominent suicide deaths on Twitter and subsequent increases in actual suicides. Soc Sci

Med 189:158–166

20. Desmet B, Hoste V (2013) Emotion detection in suicide notes. Expert Syst Appl 40:6351–6358

21. Huang X, Zhang L, Chiu D, Liu T, Li X, Zhu T. Detecting suicidal ideation in Chinese

microblogs with psychological lexicons. In: Proceedings of the 2014 IEEE 11th international

conference on ubiquitous intelligence and computing and 2014 IEEE 11th international confer-

ence on autonomic and trusted computing and 2014 IEEE 14th international conference on

scalable computing and communications and its associated workshops, Bali, Indonesia, 9–12

December 2014; pp 844–849

174 D. Vijaya Sri et al.

22. Braithwaite SR, Giraud-Carrier C, West J, Barnes MD, Hanson CL (2016) Validating machine

learning algorithms for Twitter data against established measures of suicidality. JMIR Ment

Health 3:e21

23. Sueki H (2015) The association of suicide-related Twitter use with suicidal behaviour: a cross-

sectional study of young internet users in Japan. J Affect Disord 170:155–160

24. O’Dea B, Wan S, Batterham PJ, Calear AL, Paris C, Christensen H (2015) Detecting suicidality

on Twitter. Internet Interv 2:183–188

25. Wood A, Shiffman J, Leary R, Coppersmith G. Language signals preceding suicide attempts.

In: Proceedings of the CHI 2016 computing and mental health workshop, San Jose, CA, USA,

7–12 May 2016

26. Okhapkina E, Okhapkin V, Kazarin O. Adaptation of information retrieval methods for iden-

tifying of destructive informational inﬂuence in social networks. In: Proceedings of the 2017

IEEE 31st international conference on advanced information networking and applications

workshops (WAINA), Taipei, Taiwan, 27–29 March 2017; pp 87–92

27. Sawhney R, Manchanda P, Singh R, Aggarwal S. A computational approach to feature extrac-

tion for identiﬁcation of suicidal ideation in tweets. In: Proceedings of the ACL 2018, student

research workshop, Melbourne, Australia, 15–20 July 2018; pp 91–98

28. Alada˘g AE, Muderrisoglu S, Akbas NB, Zahmacioglu O, Bingol HO (2018) Detecting suicidal

ideation on forums: proof-of-concept study. J Med Internet Res 20:e215

Citrus Plant Leaves Disease Detection

Using CNN and LVQ Algorithm

Roop Singh Meena and Shano Solanki

Abstract This study introduces a unique method for disease identiﬁcation in citrus

plants by combining convolutional neural network (CNN) and learning vector quanti-

zation (LVQ) techniques. The suggested technology is meant to aid in the early iden-

tiﬁcation and diagnosis of citrus plant diseases, which is important for preserving

crop yields and avoiding crop loss. Features are extracted from pictures of citrus

plant leaves using a convolutional neural network. Furthermore, the LVQ algorithm

is used to identify the retrieved features as either healthy or unhealthy. When tested

on a dataset consisting of photographs of citrus plant leaves, the suggested system

achieved a high accuracy of 96.33% in disease classiﬁcation. A total of 3570 pictures

were used in this analysis, including both healthy and diseased citrus plants, repre-

senting different pathogens (citrus canker, citrus scab, citrus rust, other diseases, and

healthy images) classes. There are 500 test photographs from each class and 1070

full-size images throughout test categories. An F1-score of 96.54%, a recall score

of 96.54%, and a precision score of 96.69% were all obtained using the proposed

strategy. Based on the obtained data, it appears that the proposed method achieves

superior accuracy in disease identiﬁcation compared to the state-of-the-art methods.

The citrus industry stands to beneﬁt greatly from this strategy, as it may be used for

early disease identiﬁcation and prevention in citrus plants.

Keywords Citrus plant diseases ·CNN ·LVQ ·Image preprocessing ·

Segmentation

R. S. Meena (B)·S. Solanki

Computer Science and Engineering Department, NITTTR, Chandigarh, India

e-mail: Roopsingh1988@gmail.com

S. Solanki

e-mail: shano@nitttrchd.ac.in

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_14

175

176 R. S. Meena and S. Solanki

1 Introduction

Plant disease detection is an important task in agriculture as it helps to identify and

control the spread of diseases that can damage crops and reduce yields. In recent years,

the use of convolutional neural networks (CNNs) and other deep learning techniques

has been incorporated to automate the process of plant disease detection. Machine

learning has replaced pattern recognition and computer image processing for citrus

plant disease. Automated fruit categorization using machine vision can improve clas-

siﬁcation accuracy and address problems with manual ﬁltration, including low output

and variable separation levels [1]. Plant disease detection is one of the agriculture’s

greatest challenges. These services prevent the disease from infecting other crops

and causing signiﬁcant monetary losses. Plant diseases can have a range of effects

on the agricultural economy, from mild symptoms to crop loss. CNN technology is

believed to be the most efﬁcient approach to deep learning.

This industry is crucial to the economy of India because 60% of the population

relies on agriculture. As a result, it is an important area of study. Agriculture sector

photos are taken using IoT sensors, cameras, and drones [2].

2 Taxonomy of Citrus Diseases

Citrus plant diseases contribute signiﬁcantly to the decline in agricultural output,

which is detrimental to the economy of any country. Citrus fruit contains vitamin

C as well as good amounts of other vitamins and minerals, including B vitamins,

potassium, phosphorous, magnesium, and copper, which are found primarily in citrus.

It is a challenge to accurately identify several citrus diseases using deep learning-

based methods (Fig. 1).

2.1 Citrus Canker

Lesions on the leaves of citrus plants are the most dangerous symptom of citrus canker

disease, which is a cancer of the citrus tree. When citrus trees are infected with the

infectious disease citrus canker, their leaves and fruit begin to develop prematurely.

The white, spongy patches on the injured leaves may change to a darker hue, like

brown or gray. Sores resembling rings with oily edges can be seen on either side

of the boot. Signs of citrus canker disease include the appearance of raised, scabby

lesions on the leaves, stems, and fruit, as well as the premature dropping of the fruit

and the plant’s defoliation. Leaves and fruits might become misshapen when the

lesions, which are often surrounded by an oily, water-soaked edge, eventually join

together. This citrus blight can be recognized by its characteristic lesions [3].

Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 177

Scab Rust

Canker

Healthy Greening Black Spot

Fig. 1 Citrus plant leaves disease

2.2 Citrus Scab

Citrus scab disease symptoms include the formation of raised, corky scab-like lesions

on fruit, leaves, and twigs. These lesions can vary in size and color from light yellow

to dark brown and can lead to fruit cracking and distortion. Additionally, infected

leaves may become distorted or drop prematurely [4].

2.3 Citrus Rust

Citrus rust disease symptoms include the formation of yellow-orange pustules on

leaves, stems, and fruit. These pustules may later turn brown or black as they dry out,

which can cause defoliation and premature fruit drop. Severe infections can lead to

stunted growth and reduced fruit quality [5].

2.4 Citrus Greening

Citrus greening disease, also known as Huanglongbing (HLB), has symptoms

including asymmetrical yellowing of leaves, mottled leaves, yellow shoots, and

178 R. S. Meena and S. Solanki

stunted growth. Infected trees may produce small, lopsided, and bitter fruit that

does not ripen properly. HLB is a serious and incurable disease that can ultimately

kill the tree [6].

2.5 Citrus Anthracnose

Citrus anthracnose disease symptoms include the formation of small, circular, or

irregular-shaped sunken lesions on leaves, twigs, and fruit. These lesions may be

brown or black and have a water-soaked appearance. Infected fruit may drop prema-

turely, become deformed, and rot. Severe infections can cause defoliation and dieback

of twigs and branches.

These features can provide valuable information for various image analysis tasks

in ﬁelds such as computer vision, remote sensing, and medical imaging [7].

3 Convolutional Neural Network

The primary application of the deep learning model known as a convolutional neural

network (CNN) is in the domain of image and video recognition [8].

3.1 Convolutional Layer

The term “CNN” was initially applied to the convolution layer. That layer used

different scientiﬁc methods to obtain the extracted features from the input picture

[9]. The ﬁlter is used to make the input image smaller. The ﬁlter is progressively

decreased throughout the picture, starting in the upper-left corner. Divide the picture

values by the ﬁlter values to get the total for each category [10]. A new, smaller matrix

is created using the provided image. The process of convolution in the convolution

layer is depicted in Fig. 2below using a 5 ×5 input image and a 3 ×3 ﬁlter. A

general CNN feature mapping is shown in Fig. 2[9].

3.2 Pooling Layer

The pooling layer is usually applied immediately, just after the convolution layer. The

size of the convolution layer’s output matrix can vary depending on the prevalence

and incidence of a condition. The ﬁlter size used by the pooling layer can vary, but

typically it is 2 ×2. Several pooling operations, including max pooling, average

pooling, and L2-norm pooling, are compatible with this layer. The A-Max pooling

Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 179

Fig. 2 Convolutional layer

Fig. 3 Pooling layer

ﬁlter with a stride of 2 was used in this investigation. To perform max pooling, this

ﬁlter takes the maximum value from each sub-window and moves it to a new matrix.

Max pooling layer working is presented in Fig. 3.

3.3 Activation Layer

An activation layer in a convolutional neural network (CNN) is a nonlinear transfor-

mation applied to the output of a convolutional layer. It introduces nonlinearity to

the network and allows the network to learn complex representations. Examples of

activation functions used in CNNs include ReLU, sigmoid, and tanh. In ReLU, less

than 0 values are set to zero, while positive ones are left unchanged [1].

A(x)=0,if x <0

A(x)=1,otherwise.

180 R. S. Meena and S. Solanki

3.4 Fully Connected Layer

A fully connected layer in a convolutional neural network (CNN) is a layer in which

each neuron is connected to all neurons in the previous layer, similar to a traditional

neural network. This layer typically comes after the convolutional and pooling layers

in a CNN and is used to transform the output from the convolutional layers into a

format that can be used for classiﬁcation or regression. This layer is responsible for

recognition and categorization.

4 Literature Survey

After the study of various research papers on citrus plant disease, we concluded that

the classiﬁcation of citrus plant disease is a very complex task. In the literature,

various authors and researchers discussed citrus plant leaf disease detection using

different types of techniques for image processing operations.

Singh et al. [11] proposed an algorithm for automatically detecting and classi-

fying plant leaf diseases using an image segmentation technique. The paper included

a review of various disease classiﬁcation techniques that could be used for plant leaf

disease detection. The authors applied a genetic algorithm to perform image segmen-

tation on 106 images for both training and testing. The reported accuracy of disease

detection was 86.54% for the K-means technique with the proposed algorithm and

93.63% for support vector machines with the suggested algorithm. They suggested

that the Bayes classiﬁer, artiﬁcial neural networks (ANN), and hybrid algorithms

could be used to further enhance the classiﬁcation recognition rate.

Shaikh and colleagues [12] presented a study titled “Citrus Leaf Unhealthy Region

Detection by Using Image Processing Techniques.” In their research, the authors

utilized various image processing techniques, such as image normalization, contrast

enhancement, and initial processing. They extracted features using the gray level co-

occurrence matrix (GLCM) method and applied bi-level thresholding for segmenta-

tion. To categorize the unhealthy regions, the authors used a hidden Markov model

and achieved an accuracy rate of 84.21% for anthracnose, 85.71% for canker, 78%

for citrus greening, and 82.50% for overwatering in the classiﬁcation of citrus trees.

Sardogan et al. [13] presented a paper on “Plant Leaf Disease Detection and Clas-

siﬁcation based on CNN and the LVQ Algorithm,” and they took 500 images of

tomato crops from the plant village dataset of size 512 * 512. In that research, they

applied the convolutional neural network method of deep learning. To improve clas-

siﬁcation accuracy, they used the linear vector quantization method. In this research,

their accuracy in classifying diseases on tomato leaves was 86%. These studies clas-

siﬁed ﬁve distinct types of diseases that can affect tomato crops: healthy, late blight,

bacterial spot, septoria spot, and yellow curve.

In their paper titled “GANs-Based Data Augmentation for Citrus Disease Severity

Detection using Deep Learning,” Zeng and colleagues [14] focused on the detection

Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 181

of Huanglongbing (HLB) infection using a citrus plant dataset from plant village

and crowd AI. They found that the Inception V3 model performed better than other

models in terms of severity detection accuracy, achieving an accuracy of 74.38%

due to its high computational efﬁciency and the smaller number of parameters.

The authors also proposed that their algorithm can further improve results by up

to 92.60%.

Kukreja and Dhiman [15] presented a paper on “A Deep Neural Network-based

Disease Detection Using Data Augmentation Techniques.” For their research work,

they took 120 images of citrus plant leaves, which they augmented to 1200 images of

size 256 * 256 images. They included the CNN model and used various techniques in

the preprocessing, segmentation, and augmentation stages of the model. Stochastic

gradient descent (SGD) optimization was employed to train the neural networks.

They reported 89.10% accuracy in disease detection.

Sharath and colleagues [16] published a research paper titled “Disease Detec-

tion in Plants Using Convolutional Neural Networks.” In their study, they used a

dataset comprising 12,891 plant images of various fruits such as oranges, grapes,

pomegranates, papayas, and citrus. The authors employed the grab cut method during

the segmentation stage of their convolutional neural network (CNN) model to identify

diseases. They reported that their CNN model achieved a plant disease detection efﬁ-

ciency of 91%. The authors also suggested that the accuracy of their approach could

be further improved by utilizing appropriate image enhancement and classiﬁcation

techniques.

Kaur et al. [17] proposed research on “A Genetic Algorithm-based Feature Opti-

mization Method for Citrus HLB Disease Detection using Machine Learning.” In

this work, an improved feature selection stage of HLB/citrus greening disease was

proposed. Machine learning is trained for both healthy and harmful diseases. A 60-

image dataset, of which 30 are healthy and 30 are HLB infected, was employed for

the study. Images are cropped and resized during the preprocessing stage, and the

K-mean clustering algorithm is employed during the segmentation stage. The GLCM

approach is used for feature extraction. They reported SVM classiﬁer efﬁciencies of

up to 90.40%.

Khattak et al. [8] conducted a study on the “Automatic Detection of Citrus Fruit

and Leaf Disease using a Deep Neural Network Model.” They used the “plant village”

dataset, which contains 213 images of citrus fruits. The proposed CNN model showed

a test accuracy of 94.55% for detecting black spots, cankers, scabs, and greening

disease. It indicates its usefulness as a decision-support tool for farmers in classifying

citrus fruit and leaf diseases.

Sujatha et al. [18] published a paper comparing the performance of machine

learning and deep learning techniques for plant leaf disease detection. In their

study, they discussed several machine learning techniques, including support vector

machines, random forests, and stochastic gradient descent. Additionally, they eval-

uated the performance of deep learning methods including Inception V3, VGG-16,

and VGG-19.

In their 2022 paper titled “Classiﬁcation of Citrus Disease Using Optimiza-

tion Deep Learning Approach,” Elaraby et al. [19] explored the classiﬁcation of

182 R. S. Meena and S. Solanki

various citrus diseases, including black spot, canker, scab, greening, anthracnose,

and melanose. To achieve this, the authors used a combination of the plant village

and a self-collected dataset. They employed two convolutional neural networks,

namely AlexNet and VGG-19, to develop and evaluate their proposed method. The

dataset comprised 759 augmented images, each measuring 256 pixels on the longest

dimension. The authors reported an impressive performance of their model, with an

accuracy of 94.3%, a precision of 94.1%, a speciﬁcity of 93.9%, and an F-score of

94.3%.

5 Proposed Methodology

Deep learning has been increasingly popular in recent models for citrus plant diseases.

In the respective research work, we provide a brief overview of the proposed CNN

model for recognizing and categorizing citrus plant diseases using image processing

techniques. Using the suggested models for in-depth research, it is possible to iden-

tify and expose the infected citrus leaves and apply the preventive treatment. The

proposed work uses a deep learning convolutional neural network model and a linear

vector quantization algorithm (LVQ) for quantizing fully connected layer output and

providing a more accurate result. The intended models include dataset collection,

segmentation, feature extraction, and classiﬁcation stages.

Initially, a dataset was created using publicly available data, and preprocessing

techniques were applied to the images. Augmentation operations were also applied

to some disease image categories to ensure equal representation of each category.

The dataset was split into 70% for training and 30% for testing.

Next, a convolutional neural network (CNN) was created using the Sequential()

function, which consisted of several convolutional and max pooling layers. The CNN

started with four convolutional layers with 32, 64, 128, and 512 ﬁlters, respectively,

followed by max pooling layers with a 2 ×2 window size. Each convolutional layer

utilized a 3 ×3 ﬁlter, and the ReLU activation function was applied. After the ﬁnal

max pooling layer, the Flatten() layer was added to ﬂatten the output of the earlier

layers, and a Dense() layer with the SoftMax activation function was incorporated to

generate probabilities for each of the ﬁve possible classes. The probabilities obtained

from the CNN were then fed to the LVQ algorithm, which uses a winner-take-all

approach. Linear vector quantization (LVQ) is a technique used in machine learning

and signal processing to classify input data into one of the several predetermined

classes. LVQ is a type of unsupervised learning algorithm that uses a set of training

examples to learn a mapping between input vectors and output classes. In LVQ, each

input vector is represented as a point in a high-dimensional space, and the goal is to

ﬁnd a set of representative vectors (called prototypes) that can effectively partition the

input space into predeﬁned classes. The prototypes are typically initialized randomly

and are then iteratively adjusted based on the input data. During the training phase, the

input vectors are presented to the LVQ algorithm, and the distance between each input

vector and the prototypes is calculated. The input vector is then assigned to the closest

Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 183

Fig. 4 Flowchart for proposed methodology

prototype, which is known as the winning prototype. The winning prototype is then

moved closer to the input vector, and the other prototypes in the same class are moved

further away. Once the training phase is complete, the prototypes are ﬁxed, and the

LVQ algorithm can be used to classify new input vectors. The classiﬁcation process

involves measuring the distance between the input vector and each prototype and

assigning the input vector to the class associated with the closest prototype (Fig. 4).

After training all CNN values, these values can be used for the detection of plant

disease and all other performance parameters are calculated (Fig. 5).

5.1 Experiment and Results

The coding and implementation for this study were done in the Jupyter notebook of

the anaconda framework, using Python 3.10. We use 2500 training images to teach

our convolutional neural network model, and then we use 1070 test images to see how

well it can classify new images. Both the training and validation accuracies peaked

in the 29th epoch when using the designed approach on 30 iterations. Following

the transfer of CNN probabilities to the LVQ method, which results in enhanced

184 R. S. Meena and S. Solanki

Fig. 5 Proposed methodology

classiﬁcation accuracy, we obtain the performance metrics as accuracy, F1-score,

recall, and precision score. The confusion matrix for the classiﬁcation of citrus disease

categories is shown in Fig. 6.

In Fig. 6, (label 0 =healthy, label 1 =multiple disease, label 2 =rust, label 3 =

scab, label 4 =canker) are shown.

Various performances of the purposed method are shown in tabular form in Table 1.

Fig. 6 Confusion metrics

Table 1 Performance of

proposed methodologies Performance metrics Result

Accuracy score 0.9654

F1-score 0.9655

Recall score 0.9654

Precision score 0.9669

Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 185

Fig. 7 Training and validation loss

The accuracy and validation losses during the training model are shown in Fig. 7.

And the accuracy and validation accuracy during the training of the model are shown

in Fig. 8.

Several studies have utilized convolutional neural networks (CNNs) for disease

detection and classiﬁcation in citrus plants. These include “Plant Leaf Disease Detec-

tion and Classiﬁcation based on CNN and LVQ Algorithm” (2018), “Performance of

Deep Learning vs. Machine Learning in Plant Leaf Disease Detection” (2021), “Deep

Metric Learning-Based Citrus Disease Classiﬁcation With Sparse Data” (2020),

“GANs-Based Data Augmentation for Citrus Disease Severity detection using Deep

Learning” (2020), “Automatic Detection of Citrus Fruit and Leaves Diseases Using

Deep Neural Network Model” (2021), and “Classiﬁcation of Citrus Diseases Using

Optimization Deep Learning Approach” (2022). These studies reported accuracy

rates of 86%, 89.50%, 90.28%, 92.60%, 94.55%, and 94.30%, respectively. In

comparison with these previous works, our system achieved an accuracy increases

up to 96.50% (Fig. 9).

186 R. S. Meena and S. Solanki

Fig. 8 Training and validation accuracy

6 Conclusion, Limitation, and Future Scope

It is concluded that the purposed model effectively recognizes the citrus canker, rust,

scab, healthy, and other disease categories. To improve classiﬁcation accuracy, the

model is trained for an equal amount of each category of disease in the dataset. The

convolutions, max pooling, and ReLU layers were also added for better classiﬁcation

and accuracy. This research can identify diseases accurately, but some other diseases

like greening and anthracnose can be added for the detection of all diseases. The

work can be extended using Internet of things (IoT) devices, where testing images

can be captured using drones and sensors. The layers can be added to increase

model performance. A web server-based mobile application can be developed so

that farmers can use it effectively on their end.

Citrus Plant Leaves Disease Detection Using CNN and LVQ Algorithm 187

Fig. 9 Comparison of proposed work

References

1. Elangovan K, Nalini S (2017) Plant disease classiﬁcation using image segmentation and SVM

techniques. Int J Comput Intell Res 13(7)

2. Latha M, Poojith A, Reddy A, Vittal Kumar G (2014) Image processing in agriculture. Int J

Innovative Res Electr 2:2321–5526, [Online]. Available: www.ijireeice.com

3. Sunny S, Gandhi MPI (2021) Canker detection in citrus plants with an efﬁcient ﬁnite dissimilar

compatible histogram leveling based image pre-processing and SVM classiﬁer. Turk J Comput

Math Educ (TURCOMAT) 12(10):2585–2592, [Online]. Available: https://www.turcomat.org/

index.php/turkbilmat/article/view/4871

188 R. S. Meena and S. Solanki

4. Saini AK, Bhatnagar R, Srivastava K (2021) Detection and classiﬁcation techniques of

citrus leaves diseases: a survey. Turk J Comput Math Educ (TURCOMAT) 12(6):3499–3510,

[Online]. Available: https://turcomat.org/index.php/turkbilmat/article/view/7138

5. Janarthan S, Thuseethan S, Rajasegarar S, Lyu Q, Zheng Y, Yearwood J (2020) Deep metric

learning based citrus disease classiﬁcation with sparse data. IEEE Access 8:162588–162600.

https://doi.org/10.1109/ACCESS.2020.3021487

6. Syed-Ab-Rahman SF, Hesamian MH, Prasad M (2022) Citrus disease detection and classiﬁca-

tion using end-to-end anchor-based deep learning model. Appl Intell 52(1):927–938. https://

doi.org/10.1007/s10489-021-02452-w

7. Islam M, Dinh A, Wahid K, Bhowmik P (2017) Detection of potato diseases using image

segmentation and multiclass support vector machine. Canadian conference on electrical and

computer engineering, pp 8–11. https://doi.org/10.1109/CCECE.2017.7946594

8. Khattak A et al (2021) Automatic detection of citrus fruit and leaves diseases using deep neural

network model. IEEE Access 9:112942–112954. https://doi.org/10.1109/ACCESS.2021.309

6895

9. Militante SV, Gerardo BD, Dionisio NV (2019) Plant leaf detection and disease recogni-

tion using deep learning. In: 2019 IEEE Eurasia conference on IOT, communication and

engineering, ECICE 2019. https://doi.org/10.1109/ECICE47484.2019.8942686

10. Luaibi AR, Salman TM, Miry AH (2021) Detection of citrus leaf diseases using a deep learning

technique. Int J Electr Comput Eng 11(2):1719–1727. https://doi.org/10.11591/ijece.v11i2.pp1

719-1727

11. Singh V, Misra AK (2017) Detection of plant leaf diseases using image segmentation and

soft computing techniques. Inf Process Agric 4(1):41–49. https://doi.org/10.1016/j.inpa.2016.

10.005

12. Shaikh RP, Dhole SA (2017) Citrus leaf unhealthy region detection by using image processing

technique. Proceedings of the international conference on electronics, communication and

aerospace technology, ICECA 2017, vol 2017–Janua, pp 420–423. https://doi.org/10.1109/

ICECA.2017.8203719

13. Sardogan M, Tuncer A, Ozen Y (2018) Plant leaf disease detection and classiﬁcation based on

CNN with LVQ algorithm. UBMK 2018—3rd international conference on computer science

and engineering, pp 382–385. https://doi.org/10.1109/UBMK.2018.8566635

14. Zeng Q, Ma X, Cheng B, Zhou E, Pang W (2020) GANS-based data augmentation for citrus

disease severity detection using deep learning. IEEE Access 8:172882–172891. https://doi.org/

10.1109/ACCESS.2020.3025196

15. Kukreja V, Dhiman P (2020) A deep neural network based disease detection scheme for

citrus fruits. Proceedings—international conference on smart electronics and communication,

ICOSEC 2020, no Icosec, pp 97–101. https://doi.org/10.1109/ICOSEC49089.2020.9215359

16. Sharath DM, Kumar SA, Rohan MG, Akhilesh, Suresh KV, Prathap C (2020) Disease detection

in plants using convolutional neural network. Proceedings of the 3rd international conference

on smart systems and inventive technology, ICSSIT 2020, no Icssit, pp 389–394. https://doi.

org/10.1109/ICSSIT48917.2020.9214159

17. Kaur B, Sharma T, Goyal B, Dogra A (2020) A genetic algorithm based feature optimization

method for citrus HLB disease detection using machine learning. Proceedings of the 3rd inter-

national conference on smart systems and inventive technology, ICSSIT 2020, no Icssit, pp

1052–1057. https://doi.org/10.1109/ICSSIT48917.2020.9214107

18. Sujatha R, Chatterjee JM, Jhanjhi NZ, Brohi SN (2021) Performance of deep learning vs

machine learning in plant leaf disease detection. Microprocess Microsyst vol 80(October

2020):103615. https://doi.org/10.1016/j.micpro.2020.103615

19. Elaraby A, Hamdy W, Alanazi S (2022) Classiﬁcation of citrus diseases using optimization

deep learning approach. Comput Intell Neurosci 2022. https://doi.org/10.1155/2022/9153207

Longevity Recommendation for Root

Canal Treatment

Pragati Choudhari, Anand Singh Rajawat, S. B. Goyal, Xiao ShiXiao,

and Amol Potgantwar

Abstract Endodontic treatment has a high success rate; however, it still fails in

many patients. It is usually attributed to different clinical and non-clinical factors.

Therefore, it is crucial to avoid or even signiﬁcantly reduce the prevalence of the

most common causes of root canal treatment failure. This paper makes an attempt to

ﬁnd the different factors that are responsible for root canal (RCT) failure by using

machine learning techniques like SVM, NB classiﬁer, and logistic regression. From

the provided data of 332 instances, it determines the clinical and non-clinical aspects

that lead to the identiﬁcation of failing RC teeth. The ﬁndings also reveal that the LR

model has the highest accuracy (91.87) compared to the other two algorithms. This

system also helps in determining the relationship between these parameters and their

impact on the longevity of the root canal treatment using a machine learning models.

A longevity recommendation can help doctors improve their practice by pointing out

areas where they may have fallen short.

Keywords Root canal treatment (RCT) failure ·Successful RCT ·Support vector

machine (SVM) ·Naive Bayes classiﬁer (NB) ·Logistic regression (LR) RCT

longevity prediction ·Overﬁlling ·Underﬁlling issues

P. Choudhari

Department of Computer Engineering, Indira College of Engineering and Management, Sandip

University, Pune, India

A. S. Rajawat

School of Computer Science and Engineering, Sandip University, Nashik, India

e-mail: anandsingh.rajawat@sandipuniversity.edu.in

S. B. Goyal (B)

City University, Petaling Jaya, Malaysia

e-mail: drsbgoyal@gmail.com

X. ShiXiao

Chengyi College Jimei University, Xiamen, China

A. Potgantwar

Sandip Institute of Technology and Research Centre, Sandip University, Nashik, India

e-mail: amol.potgantwar@sitrc.org

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_15

189

190 P. Choudhari et al.

1 Introduction

Root canal therapy (often termed RTC) is necessary when dental X-rays show that a

bacterial infection has also affected the pulp [1]. An infected pulp causes discomfort

with both hot and cold foods and beverages and can cause inﬂammation, which in

turn can promote bacterial growth and spread. Failure of root canal treatment, also

known as endodontics, is a prevalent issue in the ﬁeld of dental care. As a result

of this fact, the question that causes the most concern is, “How long will the tooth

survive following root canal therapy?”.

Although root canal therapy has a high success rate (which ranges between 86

and 98%), the failure rate ranging up to 20% cannot be overlooked. Two potential

causes are the incorrect use of working processes and the use of materials that are

not suitable [2]. In this regard, the success of treatment can be evaluated by the

patient’s survival or longevity. In addition to this, the failure of endodontic therapy

can also be attributed to a wide variety of clinical and non-clinical variables. Cases

like periapical radiolucency, root fractures, damaged teeth, insufﬁcient periodontal

support, pulp stones, and periapical abscesses are some examples [3–5]. In addition

to that, non-clinical factors such as age and poor oral hygiene are also a factor.

Along with these factors, demographic location, level of education, and smoking

and drinking habits all play a signiﬁcant part in the outcome of treatment [6,7].

A. Motivation

Root canal therapy is used to recover a severely decaying tooth rather than removing

it. Even if visiting a highly trained dentist for root canal, there is still a chance that

the procedure will fail, in case infection is not completely cleaned up or the tooth

gets infected again [8]. Different non-clinical factors such as poor oral hygiene,

demographic location, education, smoking, and drinking all play a signiﬁcant role

in the treatment failure, as stated by the American Association of Endodontists [4].

Moreover, clinical factors including periapical radiolucency, root fractures, damaged

teeth, insufﬁcient periodontal support, pulp stones, and periapical abscesses are also

responsible.

Therefore, the primary motive for developing the proposed system is to improve

the diagnostic accuracy of operative procedures such as root canal treatment. This

is accomplished by determining the factors broken instrument, an overﬁlled cavity,

periapical abscess, pulp stones, vertical fracture in root, broken tooth, insufﬁcient

periodontal support a perforated root, or an underﬁlled cavity along with age, eating

habits, uncooperative behaviour, education, and chewing habits that are primarily

responsible for treatment failure, and furthermore, if the reason for treatment failure

is known, it helps to determine the longevity of the treatment, which in turn helps to

reduce error and lowers the likelihood of treatment failure.

Longevity Recommendation for Root Canal Treatment 191

B. Main contribution

It has also been observed that machine learning has demonstrated its usefulness in

practically all sectors, including the ﬁeld of healthcare, where it has improved diag-

nostic accuracy and revolutionised treatment. This paper sheds light on the primary

reasons that can lead to an increase in the probability of root canal failure and

discusses those aspects in detail. Using a machine learning models, it is possible

to determine the survival of root canal therapy by taking into account a number of

clinical and non-clinical characteristics related to the root canal based on the given

dataset. This helps to determine how long the treatment will last [9,10]. This work

evaluates the tooth’s lifetime by utilising the dataset consisting of 332 instances of

the root canal. It not only classiﬁes the data into the ideal RCT, but it also identiﬁes

problems in the treatments that are already in use, such as those caused by over-

ﬁlling, underﬁlling, perforation, or root resportion. Finally, this aids the healthcare

provider in resolving any issues that may have led to a shorter lifespan and providing

high-quality care.

C. Organisation of the Paper

Section 1provides an introduction to root canal treatment failure, as well as the

paper’s motivation and main contribution. Section 2contains related works to the

proposed work. Sections 3and 4introduce the proposed work of implementing the

longevity recommendation for root canal therapy and the prediction techniques. The

results for predicting root canal treatment failure are presented in Sect. 5, limitations

of the study are mentioned in Sect. 6, and the conclusion is presented in Sect. 7.

2 Related Work

Numerous investigations on the durability of root canal treatment with contemporary

methods have been undertaken in the past few years. Machine learning (ML) and deep

learning (DL) are two of the methods that can be used to gain a useful understanding

of the RCT. In light of this, Sona et al. [1] suggested a dental diagnosis system-

based hybrid method to segmentation, classiﬁcation, and decision-making (DDS).

The method uses 87 dental images from Hanoi Medical University, Vietnam, showing

ﬁve common diseases: root fracture, including teeth, decay, missing teeth, and peri-

odontal bone resorption. It uses the best semi-supervised fuzzy clustering dental

image segmentation approach. DDS diagnosis is also proven to be more accurate

than other approaches.

Gradient boosting machine (GBM) and random forest (RF) models for endodontic

microsurgery prognosis prediction were developed by Qu et al. [11]. Around 234

teeth from 178 participants were taken, and the investigation was done in a controlled

laboratory setting. The author has taken into account eight signiﬁcant variables,

containing lesion size, tooth type, bone defect type, root ﬁlling length, root ﬁlling

density an apical extension of post, age, and sex. The research also demonstrates

that, on average, the GBM model performs marginally better than the RF model.

192 P. Choudhari et al.

Logistic regression (logR) and exceptionally gradient boosting (XGB) were used

by Herbst et al. [12] to predict unsuccessful root canal treatments alongside GBM and

random forests (RT). Additionally, tooth-level variables were cited as a primary cause

of RT failure. Treatment planning and informed shared decision-making beneﬁt from

the identiﬁcation of particular risk factors for failure of RT and from the prediction of

the outcome of RT. Teeth treated with RT at a single big university hospital between

2016 and 2020 and followed for at least six months were included in the dataset.

Hung et al. [13] suggested a machine learning-based computerised dental care

recommender model. Machine learning studies used the 2013–2014 National Health

and Nutrition Examination Survey. Feature selection for regression model optimiza-

tion uses LASSO methods. The use of logistic regression, support vector machine,

random forest, and classiﬁcation and regression tree predict dental care. LASSO can

also help to identify gum health, race, drugs, general health, health insurance, and

country of birth as factors affecting dental care.

An effort was made to improve the accuracy of periapical radiography for detecting

and predicting dental caries by Lee et al. [9]. Dental caries can be detected and

diagnosed with the help of deep convolutional neural networks (CNNs). Over three

thousand periapical radiographs are utilised in this study. The Convolutional Neural

Network (CNN) Inception v3 is used to perform preliminary image processing. Based

on the ﬁndings, a deep convolutional neural network method performed admirably

well at identifying dental cavities in periapical radiographs.

Root canal therapy can be complicated by age-related pathologic and physiolog-

ical changes, as reviewed by Mothanna et al. [14]. Systemic disorders affecting the

teeth and oral mucosa are recognised as deserving of specialised treatment. There-

fore, root canal therapy is a crucial part of these processes for maintaining healthy

teeth.

Patients in the Saudi Arabian city of Al-Kharj were studied by Mustafa and

colleagues [5] to determine what factors led to the failure of endodontic treatment.

Factors such as pain, tenderness on pressure, periapical radiolucency, and the pres-

ence or absence of a sinus tract are used to establish the failure’s root cause. It

has been determined that subpar auxiliary care is a major contributor to endodontic

failure. It has also been shown that males, as well as patients of private as opposed to

public institutions, are more likely to have the complications that lead to endodontic

failure.

Longevity Recommendation for Root Canal Treatment 193

Hung et al. [15] used machine learning approaches in artiﬁcial intelligence

to choose the most pertinent variables for root caries classiﬁcation and evaluate

model performance. Studying 2015–2016 National Health and Nutrition Examina-

tion Survey data, support vector machine classiﬁes root caries variables with 97.1%

accuracy. Five demographic variables—age, household income, education, race/

ethnicity, and marital status—ﬁve oral health variables last dentist visit, ﬂossing,

mouth ache, self-rated oral health, and oral embarrassment—and ﬁve lifestyle/health

variables—TV watching, computer use, sunscreen use, alcohol consumption, and

cholesterol prescriptions—are used to classify people.

The accuracy of a back propagation (BP) artiﬁcial neural network model for

predicting pain after root canal therapy was assessed by Gao et al. [16]. (RCT).

The study uses the BP neural network model that was built with the help of the

neural network toolbox in MATLAB version 7.0. Thirteen components, including

individual characteristics, inﬂammatory responses, and surgical techniques, were

examined to construct a functional projective link. This BP neural network model

predicted postoperative pain after RCT with an accuracy of 95.60%.

Zhang et al. [17] employed deep learning model features from periapical and

panoramic pictures to anticipate failure of implants and could help clinicians inter-

vene early. Eighty-nine failed and 159 successful implant patients were investigated.

A deep learning-based model used 529 periapical and 551 panoramic patient images.

A ﬁvefold cross-validation estimated and created the ideal deep CNN algorithm

weight factors, where CNN has 78.7% diagnosis accuracy for panoramic images

alone.

A. Comparative study

It is clear from this study that a lot of work has been done on the subject of dental

caries diagnosis. However, there is a severe lack of research into determining how

long a root canal treatment will last. Furthermore, root canal treatment failure factors

are limited, which may impact system efﬁciency. So, in order to improve the RT’s

chances of survival, it is crucial to determine the most signiﬁcant risk factors for the

procedure’s failure. It allows clinicians to ﬁx any therapy errors in a timely manner,

hence enhancing the quality of care provided to the patient.

194 P. Choudhari et al.

Sr.

Author

name

Aim Dataset description Features used Algorithm used Accuracy Limitation

1 Son et al.

[1]

Dental

diagnosis from

X-ray image

Real dental case of

Hanoi medical

university, Vietnam

including 87 dental

images

Root fracture, include teeth,

decay, missing teeth, and

resorption of periodontal bone

Hybrid approach,

semi-supervised

fuzzy clustering

92.74% –

2 Qu et al.

[11]

Endodontic

microsurgery

prognosis

prediction

234 teeth from 178

participants were

taken

Lesion size, tooth type, bone

defect type, root ﬁlling length,

root ﬁlling density an apical

extension of post, age, and sex

Gradient boosting

machine (GBM)

and random forest

(RF)

RF model −

83%

GBM model

−88%

1. Small datasets of

unhealed instances

restricted the study’s

scope

2. Data imbalance affects

performance

measurements

3Herbst

et al. [12]

Predict

unsuccessful

root canal

treatments

458 patients

(female/male 47.2/

52.8%) with 591

permanent teeth

Tooth-level covariates Logistic

regression (logR)

and exceptionally

gradient boosting

(XGB)

89% Predicting failure was

limited, hence a more

complicated ML algorithm

is needed

4Lee et al.

[9]

Diagnosis of

dental caries

Dataset of A total

of 3000 periapical

radiographic

images

Lesions, the enamel, dentin, and

even pulp tissue, severe pain

Deep

convolutional

neural networks

(CNNs)

89.0% Improved deep learning

algorithms and

high-quality and quantity

datasets might help to

improve accuracy

(continued)

Longevity Recommendation for Root Canal Treatment 195

(continued)

Sr.

Author

name

Aim Dataset description Features used Algorithm used Accuracy Limitation

5Hung

et al. [15]

Identiﬁcation of

root carries

2015–2016

National Health

and Nutrition

Examination

Survey

Age, household income,

education, race/ethnicity and

marital status, last

Dentist visit, ﬂossing, mouth

ache, self-rated oral health and

oral

Embarrassment

TV watching, computer use, use

of sunscreen, alcohol

consumption and cholesterol,

prescriptions

Support vector

machine

97.1%

6Gao et al.

[16]

Forecasting

pain after root

canal treatment

A total of 300 adult

patients with 300

root-ﬁlled teeth

who had received

RCT

Gender, age, oral hygiene,

location of teeth, degree of initial

diagnosis, tooth percussion pain,

root canal missing, root canal

overﬁlling, pulp condition

Root canal underﬁlling

Back propagation

(BP) artiﬁcial

neural network

95.60% The study only showed

one-week pain relief only

7Zhang et

a. [18]

Implant failure

prediction from

periapical and

panoramic ﬁlms

A total of 248

patients (89 with

failed implants and

159 with successful

implants)

Deep

convolutional

neural network

(CNN)

78.6% The study is retrospective,

and manual matching with

gender, age, and implant

surgeon may have altered

analysis results

196 P. Choudhari et al.

3 Proposed Work

The success rate of a root canal procedure is a crucial indicator that allows dentists

and oral surgeons to identify and correct any problems that can lead to the treatment’s

failure. Broken instruments, underﬁlled canals, overﬁlled canals, perforations, and a

lack of patient knowledge about proper oral hygiene, smoking, age, root resorption

education, demographic and drinking can all contribute to this problem. Therefore,

the suggested system aids in identifying either the optimal RCT or the faults with

the root canal treatment that can lead to failure and in determining the longevity

on the basis of these parameters. The primary goal of the system is, therefore, to

employ machine learning models such as SVM, LR, and NB classiﬁer to resolve the

endodontic problem of treatment survival prediction and help those who have had

root canals have a better quality of life after treatment. The proposed system’s block

diagram is depicted in Fig. 1.

1. Data acquisition: A total of 332 cases of endodontic treatment were used in this

analysis. The system will utilise this data as input to determine what went wrong

during the root canal procedure.

2. Preprocessing: Preprocessing: Healthcare data is noisy. Before analysis, raw

data is preprocessed to remove noise and other undesired elements.

Fig. 1 Block diagram of proposed work

Longevity Recommendation for Root Canal Treatment 197

3. Machine learning Classiﬁcation: Root canal failure can be caused by a number

of distinct clinical and non-clinical factors. The model is initially trained dataset

of root canal treatment (332 instances). The system receives input in order to

identify the root canal failure on the test data. The factors of the ideal RCT or

its failures can then be identiﬁed by comparing the test data to the training data

system. The system then determines the cause of the failure, such as elements

that can lead to the treatment’s failure using machine learning model such as

SVM, NB, and LR such as

1. A broken instrument

2. an overﬁlled cavity

3. Periapical abscess

4. a perforated root or

5. an underﬁlled cavity [3,19]

6. poor coronal restorations

7. Root resorption

8. Non-restorable tooth.

Root canal failure can also be caused by non-clinical variables such as

the patient’s age [6], chewing habits, lack of vegetarianism, smoking [7,20],

drinking geographic location [6], or lack of formal education. As a result, the

photos are categorised using this classiﬁcation model.

4. Predict the longevity of the treatment: Longevity of the root canal treatment

is the relationship between all of the important clinical and non-clinical aspects,

which determines the success of the treatment [20], and the ability to detect the

primary factors, which helps to forecast how long the treatment will be effective

[21].

4 Implementation Details

A manual dataset is used to compare the performance of support vector machines

(SVM), Naive Bayes, and basic logistic regression in determining the causes of root

canal therapy failure.

Class 0—Low class −attributes that are more likely to lead to failure

Class 1—High Class −attributes that are less likely to lead to failure

Among the 27 variables used to predict root canal treatment failure, four, such

as curved root canal, were identiﬁed. Inadequate periodontal support, pulp stones,

perforations, broken teeth, and overﬁlled canals are seen to be of low weightage

based on the information obtained from the ranker algorithm.

SVM Classiﬁcation

WEKA is used to analyse the outcomes of running the support vector machine

classiﬁcation method on the input dataset (Tables 1and 2).

198 P. Choudhari et al.

Table 1 Classiﬁcation report Class 0 1

082 33

111 206

Table 2 Confusion matrix

Accuracy (%) Precision (%) Sensitivity (%) Speciﬁcity (%)

Class 0 86.75 88.17 71.30 94.93

Class 1 86.75 86.19 94.93 71.30

On the given dataset, the SVM achieves an accuracy of 86.75 per cent for class 0

and class 1. Figure 2depicts a graphical representation of the same information.

Naive Bayes Classiﬁer

When applied to the provided dataset, the Naive Bayes Classiﬁer generates the

confusion matrix in the following format (Table 3).

The Naive Bayes Classiﬁer produces the following confusion matrix when applied

to the speciﬁed dataset (Fig. 3).

Logistic regression

The use of Logistic regression on given dataset shows the accuracy of 91.87% by

using WEKA tool which gives confusion matrix as shown in Table 4.

Fig. 2 Root canal failure detection using SVM

Table 3 Confusion matrix

Accuracy (%) Precision (%) Sensitivity (%) Speciﬁcity (%)

Class 0 80.42 74.04 66.96 87.56

Class 1 80.42 83.33 87.56 66.96

Longevity Recommendation for Root Canal Treatment 199

Fig. 3 Root canal failure detection—NB classiﬁer

Table 4 Confusion matrix

Accuracy (%) Precision (%) Sensitivity (%) Speciﬁcity (%)

Class 0 91.87 86.67 90.43 92.63

Class 1 91.87 94.81 92.63 90.43

5 Results

Using a dataset of 332 instances with 23 attributes to determine the system perfor-

mance of the provided machine learning algorithm, it was determined that, among

the three machine learning algorithms, Logistic regression performed better than NB

and SVM as shown in Figs. 4and 5.

It has been also seen that poor coronal restorations are the main cause of treat-

ment failure (20.38%), followed by peripheral abscess (13%), underﬁlled canals

(9.76%), chewing habits (7.39%), broken instruments (5.59%), and non-restorable

teeth (4.95%). Overﬁlled canals are less likely to cause failure (0.07%), along with

broken tooth (0.28%) and perforation (0.35%). On the other hand, overﬁlled canals

account for less than 0.07% of treatment failures, along with broken teeth (0.28%)

and perforation (0.35%).

200 P. Choudhari et al.

Fig. 4 Root canal failure detection—logistic regression

Fig. 5 System performance

Longevity Recommendation for Root Canal Treatment 201

6 Discussion

Root canal is one of the costlier treatments used to cure the dental pain and save

the tooth. Along with operative procedure errors, various clinical and non-clinical

factors are responsible for root canal failure. Among clinical factors, poor coronal

restoration has more impact on RC failure. Chewing habits also decide root canal

longevity. Doctor’s operative errors like underﬁlled canal and broken instrument are

more likely lead to a RC failure.

This study is not without limitations. First, the dataset used for training is taken

from one clinic so it is more representative of the particular demographic region.

Second, the dataset size used for model training and testing is small. Accuracy of the

result will be improved by increasing the size of data for both training and testing

where data can be gathered from different demographic region.

7 Conclusion

The suggested machine learning models such as SVM, NB classiﬁer, and LR help in

predicting the likelihood of root canal treatment (RCT) failure, mitigating potential

harm to oral health. Among the various models, the LR model exhibits a remark-

able level of accuracy, standing at a promising 91.87%, when it comes to classi-

fying instances of root canal treatment failure. This is closely followed by SVM,

which achieves an accuracy rate of 86.75%. Finally, NB achieves an accuracy rate

of 80.42% when applied to the root canal treatment dataset. The system identiﬁes

the RCT failure due to various factors (both clinical and non-clinical), such as a

broken instrument, an overﬁlled cavity, a perforated root, an underﬁlled cavity, poor

coronal restorations, poor coronal restorations, and a non-restorable tooth. Inade-

quate coronal restorations account for the largest percentage of treatment failures

(20.38%), followed by peripheral abscesses (13%), underﬁlled canals (9.76%), and

ﬁnally, untreated decay (1.7%). However, damaged teeth (0.28%) and perforation

(0.35%) are the most common causes of treatment failures, with overﬁlled canals

accounting for less than 0.07% of such cases. It was also found that the proposed

approach improves the quality of care by providing an accurate prognosis of how

long a patient would beneﬁt from treatment.

References

1. Son L, Tuan T, Fujita H, Dey N, Ashour A, Anh L, Chu D (2018) Dental diagnosis from X-Ray

images: an expert system based on fuzzy computing. Biomed Signal Process Control 39:64–73

2. Mohanty A, Patro S, Barman D, Jnaneswar A (2020) Modern endodontic practices among

dentists in India: a comparative cross-sectional nation-based survey. J Conserv Dent 23(5):441–

446

202 P. Choudhari et al.

3. Iqbal A (2016) The factors responsible for endodontic treatment failure in the permanent

dentitions of the patients reported to the college of dentistry, the University of Aljouf, Kingdom

of Saudi Arabia. J Clin Diagn Res

4. EndoSpec (2022) 6 factors of root canal treatment longevity. https://endospec.com/root-canal-

treatment-longevity

5. Mustafa M, Almuhaiza M, Alamri HM, Abdulwahed A (2021) Evaluation of the causes of

failure of root canal treatment among patients in the city of Al-Kharj, Saudi Arabia. Niger J

Clin Pract 24(4):621–628

6. Arias A, Macorra J (2013) Predictive models of pain following root canal treatment: a

prospective clinical study. Int Endod J

7. Krall E, Sosa A, Garcia, Nunn ME, Caplan DJ, Garcia RI (2006) Cigarette smoking increases

the risk of root canal treatment. J Dent Res 85(4):313–317

8. López-ValverdeI, Vignoletti F,Vignoletti G, Martin C, Sanz M (2023) Long-term tooth survival

and success following primary root canal treatment: a 5- to 37-year retrospective observation.

Clin Oral Inv

9. Lee H et al (2018) Detection and diagnosis of dental caries using a deep learning-based convo-

lutional neural network algorithm. J Dent [Preprint]. Available at: https://doi.org/10.1016/j.

jdent.2018.07.015

10. Kumar A, Bhadauria H, Singh A (2021) Descriptive analysis of dental X-ray images using

various practical methods: a review. PeerJ Comput Sci 7:e620

11. Qu Y, Lin Z, Yang Z, Lin H, Huang X (2022) Machine learning models for prognosis prediction

in endodontic microsurgery. J Dent 118:103947

12. Herbst C, Schwendicke F, Krois J, Herbst H (2021) Association between patient-, tooth- and

treatment-level factors and root canal treatment failure:a retrospective longitudinal and machine

learning study. J Dent 117(13):103937

13. Hung M, Xu J, Lauren E, Voss M, Rosales M, Su W, Negrón B, He Y, Li W, Licari W (2019)

Development of a recommender system for dental care using machine learning. SN Appl Sci

1:Article number: 785

14. Al Rahabi MK (2019) Root canal treatment in an elderly patient. Saudi Med J 40(3):217–223

15. Hung M, Voss MW, Rosales MN (July 2019) Application of machine learning for diagnostic

prediction of root caries. Gerodontology 36(9)

16. Gao X, Xin X, Li Z, Zhang W (Aug 2021) Predicting postoperative pain following root canal

treatment by using artiﬁcial neural network evaluation. Sci Rep 11(1)

17. Zhang C, Fan L, Zhang S, Zhao J, Gu Y (2023) Deep learning based dental implant failure

prediction from periapical and panoramic ﬁlms. Quant Imaging Med Surg 13(2):935–945

18. Zhang C, Fan L, Zhang S, Zhao J, Gu Y (01 Feb, 2023) Deep learning based dental implant

failure prediction from periapical and panoramic ﬁlms 13(2)

19. Tabassum S, Khan F (2016) Failure of endodontic treatment: the usual suspects. Eur J Dent

10(1):144–147

20. Estrela C, Holland R, Rodrigues C, Helena A (2014) Characterization of successful root canal

treatment. Braz Dent J 25(1):3–11. https://doi.org/10.1590/0103-6440201302356

21. Elemam R, Pretty I (2011) Comparison of the success rate of endodontic treatment and implant

treatment. ISRN Dent, p 640509

Deep Q-Learning for Virtual

Autonomous Automobile

Piyush Pant , Rajendra Sinha, Anand Singh Rajawat, S. B. Goyal ,

and Masri bin Abdul Lasi

Abstract The Deep Q-Learning is a reinforcement learning algorithm that is

proposed by the research for developing autonomous automobiles. The research used

the advanced and latest technologies and libraries to develop a virtual automobile that

is autonomous. The proposed model is implemented using neural networks, which

take the state “S” as input vector x and forecast the following potential action “a”

that, according to the state-action value function, will be the most proﬁtable. In the

virtual environment developed by the research, the automobile, which is the agent,

moves randomly and takes random actions continuously. These are stored and used

to train the neural network in the ratio of dataset 60–20–20%. After random state

travel and training, the agent is able to learn on its own to drive. This is achieved by

rewarding the agent by +a for a correct or expected action and penalizing the agent

by −p for a wrong or unexpected action. By doing so, the agent is able to drive in

the lane and avoid the obstacles. The research is fully software-based and virtual,

thus no requirement of hardware except for a computer. The research also studies

reinforcement learning and the DQN algorithm to enhance the learning of the readers

in the domain of AI.

Keywords Artiﬁcial intelligence ·Machine learning ·Reinforcement learning ·

DQN ·Self-driving car

P. Pa n t ·R. Sinha ·A. S. Rajawat

Sandip University, Nashik, India

A. S. Rajawat ·S. B. Goyal (B)·M. A. Lasi

City University, 46100 Petaling Jaya, Malaysia

e-mail: drsbgoyal@gmail.com

M. A. Lasi

e-mail: masri.abdullasi@city.edu.my

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_16

203

204 P. Pant et al.

1 Introduction

The development of autonomous vehicles has garnered signiﬁcant attention in recent

years, with the aim of revolutionizing transportation and improving road safety.

In this research paper, we explore the application of Deep Q-Learning (DQL) in

the context of virtual autonomous automobiles. By harnessing the power of rein-

forcement learning and deep neural networks, DQL provides a promising approach

for training autonomous agents to make intelligent decisions in complex driving

scenarios.

Autonomous vehicles are becoming increasingly popular with the advancements

in machine learning and artiﬁcial intelligence. To improve their decision-making

capabilities, various reinforcement learning algorithms are being implemented. Deep

Q-Learning, a subset of reinforcement learning, has gained popularity in the ﬁeld

of autonomous vehicles due to its ability to learn optimal behavior from raw sensor

data. In this paper, we explore the use of Deep Q-Learning in developing a virtual

autonomous automobile that can navigate through a virtual environment [1]. We

present the architecture and implementation details of the Deep Q-Learning algorithm

and evaluate its performance in terms of safety and efﬁciency. Artiﬁcial intelligence

for automation refers to the use of machine intelligence to automate processes that

were traditionally done by humans [2]. This involves developing software that can

perform tasks without human intervention, often with a higher degree of accuracy and

speed. AI for automation can be applied in various industries, such as manufacturing,

healthcare, ﬁnance, and transportation, among others, to streamline operations and

reduce costs. This technology is advancing rapidly and has the potential to transform

the way we work and live. However, there are also concerns about job displacement

and ethical considerations surrounding the use of AI in decision-making processes

[3].

Reinforcement learning is an emerging technology that has many potentials for

wide applications in real-life issues. It is a type of machine learning that focuses on

making intelligent models capable to make decisions which are most rewarding. To

understand the concept of the reinforcement learning, let us see an example, consider

an owner wants to teach the dog to fetch the ball, so to do that, he takes the help of

dog treats. He throws the balls and signals the dog to fetch it. If the dog fetches the

balls, the owner rewards him by giving him a treat. However, if the dog fails to fetch,

then the owner does not give him the treat but verbally scolds the dog in order to tell

him that he should do the action told by him, which is kind of like a penalty. This

simple intuition is used in the reinforcement learning to train the agent and make it

capable to make rewarding decisions or take actions with highest return at the end

[4]. Reinforcement learning is a subﬁeld of machine learning that involves an agent

learning to interact with an environment to achieve a goal or objective [5]. The agent

learns through trial and error, receiving feedback in the form of rewards or penalties

based on its actions. The goal is for the agent to learn an optimal policy that will

maximize the cumulative reward over time. RL has been successfully applied to a

wide range of applications, including robotics, game-playing, and recommendation

Deep Q-Learning for Virtual Autonomous Automobile 205

Fig. 1 Reinforcement learning explanation ﬁgure

systems. An example to understand the reinforcement learning is described in Fig. 1.

In Fig. 1, a footballer that is an agent can be seen kicking a football. If the football

goes in the goal, the agent is rewarded by +10 points, but if it misses, the agent will

pay a hefty price of −100 points. Same theory is used for this research work.

Autonomous driving cars or helicopters are quite common these days. In fact,

airplanes are auto-pilot also which shows the capability and potential of Artiﬁcial

Intelligent. The existing systems are based on various kinds of artiﬁcial intelligence

like CNN, DNN, etc. [2]. Various kinds of sensors and detectors are used as well.

However, the process to create the AI is complicated which is achieved by this

research quite simply due to the availability of advanced libraries in the ﬁeld now.

The rapid advancements in autonomous driving technology have led to the devel-

opment of intelligent systems that can drive a vehicle without human intervention.

To enable such vehicles to operate efﬁciently and safely, it is necessary to have reli-

able and effective algorithms that can make decisions and control the vehicle [6].

This paper presents an innovative approach that uses Deep Q-Learning to develop

a virtual autonomous automobile that can learn how to navigate different driving

scenarios. The proposed approach leverages the power of deep learning and rein-

forcement learning to train the model to make accurate decisions in real time. The

paper provides a comprehensive evaluation of the model and highlights its effective-

ness in simulating various driving scenarios. The results show that the model can

navigate through challenging environments and make optimal decisions to ensure

the safety of the passengers and the vehicle. The paper contributes to the growing

body of research on autonomous vehicles and provides a promising solution to the

challenges of developing efﬁcient and safe autonomous driving systems.

The motivation behind this research stems from the pressing need to enhance the

capabilities of autonomous vehicles, enabling them to navigate dynamically changing

environments, handle diverse trafﬁc scenarios, and ensure passenger safety. Tradi-

tional rule-based systems and handcrafted algorithms often struggle to cope with

the intricacies and uncertainties present on the road. By employing DQL, we can

206 P. Pant et al.

leverage the ability of artiﬁcial intelligence to learn from experience and optimize

decision-making processes in real time.

This research paper makes several key contributions to the ﬁeld of autonomous

driving: Investigation of Deep Q-Learning: We delve into the principles and mech-

anisms of Deep Q-Learning, exploring its potential for training virtual autonomous

automobiles. By understanding the underlying algorithms and techniques, we aim

to shed light on how DQL can be effectively employed in this context. Development

of a Virtual Autonomous Automobile Environment: We create a virtual environ-

ment that simulates real-world driving scenarios, allowing us to test and evaluate

the performance of the DQL algorithm. The environment incorporates diverse chal-

lenges, such as lane changing, trafﬁc signal recognition, and obstacle avoidance,

to comprehensively assess the capabilities of the trained autonomous agent. Perfor-

mance Evaluation and Analysis: We extensively evaluate the performance of the

DQL-based virtual autonomous automobile, considering metrics such as success

rate, average speed, and collision avoidance. Through rigorous analysis and compar-

ison with other approaches, we aim to highlight the strengths and limitations of DQL

in this domain.

The introduction provides an overview of the research topic, including the motiva-

tion, main contributions, and the organization of the paper. In Literature Review, we

present a comprehensive review of the existing literature and related work in the ﬁeld.

We explore the latest advancements, methodologies, and approaches in the context

of Deep Q-Learning for virtual autonomous automobiles. This review serves as the

foundation for our proposed model. Proposed Model: In this section, we detail our

proposed model for utilizing Deep Q-Learning in the virtual autonomous automo-

bile domain. We explain the architecture, algorithms, and techniques employed in our

model. We also discuss the speciﬁc challenges and considerations addressed by our

approach. The result section presents the results of our experiments and evaluations.

We provide quantitative and qualitative analyses of the performance of our proposed

model. The conclusion section summarizes the key ﬁndings and contributions of the

research. We discuss the implications of our results and provide insights into the

effectiveness and potential of Deep Q-Learning for virtual autonomous automobiles.

We also highlight any limitations or areas for future research and development. The

section after the conclusion is the references which includes a list of all the references

cited throughout the paper.

The objective is to develop a virtual environment and agent, train the neural

network using the Deep Q-Learning algorithm to predict the most rewarding action

for a particular given state and train the agent to drive and handle obstacles of various

shapes.

Deep Q-Learning for Virtual Autonomous Automobile 207

2 Related Work

Autonomous vehicles are becoming increasingly common on roads, and their devel-

opment is being fueled by advances in machine learning. A number of techniques

have been proposed for training autonomous vehicles, including supervised learning,

unsupervised learning, and reinforcement learning.

One of the most popular techniques for training autonomous vehicles is reinforce-

ment learning, and within this ﬁeld, Q-learning has emerged as a powerful approach.

Q-learning is a type of reinforcement learning that uses a value function to estimate

the expected reward of taking a certain action in a certain state. Deep Q-Learning,

which combines Q-learning with deep neural networks, has been shown to be partic-

ularly effective for training agents that can learn to navigate complex environments.

Deep Q-Learning has been applied to a range of domains, including robotics and

gaming, and has been shown to achieve state-of-the-art performance in some cases.

In the context of autonomous vehicles, Deep Q-Learning has been used to train agents

to navigate complex environments and avoid obstacles.

One limitation of Deep Q-Learning is that it requires a large amount of training

data, which can be expensive to collect. To address this issue, some researchers have

explored techniques for transferring knowledge from simulation environments to the

real world.

In the context of virtual autonomous automobiles, several studies have investigated

the use of deep reinforcement learning techniques, including Deep Q-Learning, for

training agents to navigate simulated environments. For example, [7]haveused

deep reinforcement learning to train agents to follow a designated route and avoid

obstacles in a simulated environment. Other researchers have explored the use of deep

reinforcement learning for training agents to navigate more complex environments,

such as urban streetscapes.

Zhang et al. [8] provide a comprehensive overview of the state-of-the-art tech-

niques in deep reinforcement learning for autonomous driving. It covers topics such

as perception, decision-making, and control and discusses the challenges and future

directions in the ﬁeld.

Johnson et al. [9] proposed a Deep Q-Learning framework for training virtual

autonomous automobiles in complex driving scenarios. The study demonstrated the

effectiveness of deep Q-networks (DQNs) in learning policies for navigating urban

environments, achieving impressive results in terms of safety and efﬁciency.

In a study by Chen et al. [10], a modiﬁed Deep Q-Learning algorithm was proposed

to address the issue of high-dimensional state and action spaces in autonomous

driving. The authors incorporated a dueling network architecture and prioritized

experience replay, leading to improved convergence and performance of the virtual

autonomous automobile.

Li et al. (2022) introduced a hierarchical Deep Q-Learning approach for virtual

autonomous automobiles. The research focused on learning hierarchical policies

that enable the vehicle to handle various levels of decision-making, such as lane

208 P. Pant et al.

changing, intersection navigation, and pedestrian interaction. The proposed method

demonstrated enhanced adaptability and robustness in complex driving scenarios.

Research by Wang et al. [11] investigated the application of meta Deep Q-Learning

for virtual autonomous automobiles. The study focused on training an agent that

can quickly adapt to new driving environments by leveraging past experiences.

The results demonstrated the potential of meta Deep Q-Learning in achieving rapid

learning and adaptation in dynamic scenarios.

Despite these advances, there is still much to be done in developing autonomous

vehicles that are safe and reliable in real-world environments. The use of Deep Q-

Learning for training virtual autonomous automobiles is an active area of research,

with the potential to lead to signiﬁcant advances in the ﬁeld.

3 Proposed Model

The development of autonomous vehicles has become a rapidly growing area

of research and development, with the potential to revolutionize transportation.

However, one of the major challenges in creating autonomous vehicles is designing

an effective control system that is capable of making intelligent decisions in complex

and dynamic environments. In recent years, machine learning techniques such as deep

reinforcement learning have shown great promise in addressing this challenge. In this

context, this proposed methodology aims to investigate the use of Deep Q-Learning

to train a virtual autonomous automobile to navigate in a simulated environment. By

leveraging the power of deep neural networks to learn a Q-function, the autonomous

vehicle is able to make more informed decisions in real time, ultimately leading to

improved safety and efﬁciency on the road. The proposed methodology will explore

the feasibility of Deep Q-Learning for virtual autonomous automobile control, as

well as its potential beneﬁts and limitations in the later section.

3.1 Tools and Software Speciﬁcations

The software requirements for the research are described in Table 1.

Table 1 Software speciﬁcation for the research

Software/requirement Use case/name of software

Operating system Windows, Linux, MAC

Programming language Python

Libraries required Numpy, PyTorch, Kivy, Matplotlib, Seaborn, Pandas

RAM Minimum 2 GB is required for graphical representation

IDE VS Code, PyCharm

Deep Q-Learning for Virtual Autonomous Automobile 209

The research has no hardware requirements for the current version. Only a

computer/laptop is needed.

The research uses the latest and one of the most powerful technology for the

artiﬁcial intelligence. Python language is used by the research. Python is a high-

level, general-purpose programming language. Its design philosophy emphasizes

code readability with the use of signiﬁcant indentation. Python is dynamically typed

and garbage-collected. It supports multiple programming paradigms, including struc-

tured, object-oriented, and functional programming. The installation of latest Python

is completed with VS Code IDE. Various extension are installed in the IDE for better

working with the Python. Since this is a development project and not an analysis

project, Jupyter Notebook is not required and the project is developed in VS Code.

After this, the required Libraries like Numpy, PyTorch, Kivy, Matplotlib, Seaborn,

and Pandas are installed.

3.2 Intuition of the Model

Reinforcement learning is one of the most powerful techniques offered by artiﬁ-

cial intelligence. The reinforcement learning follows the Markov decision process

(MDP) [12]. The MDP is the concept in which there is an environment and in that

environment, an agent is present in state “s.” The agent takes or receives the input,

chooses the best action which is most rewarding and then moves to a new state s’.

Some of the terminologies are discussed below [4]:

State—State is the current position of the agent out of all the possible positions.

Action—An action is the action taken by the agent in state s to reach new state s’.

Policy—It is the function which takes the input and maps it to the best possible

action which is most rewarding. It is denoted by Pi. Eq. (1) represents the input as

state s to the Pi function (policy) and the returned output action a.

s→π→a.

Discount factor—To give an example, if there are two water spots A and B, 10 km

and 1 km far from a point, respectively. A person is thirsty and chooses to go to point

A rather than B; this is foolish considering that water spot B is nearer. After realizing

the hard work he has to do to reach point A, he changes his decision and chooses to

go to point B. This sense to choose the less hard part, when even the big reward is not

worth the hard work, is called as discount factor. The discount factor is represented

by γ.

Reward—The rewards are treats given to the agent after he takes the correct or

expected action and reaches the desired state.

Penalty—Penalty is the negative feedback to the agent to tell him that the he chose

wrong action and not to repeat it.

Return—After taking a sequence of actions to reach the desired state, the sum of

all the rewards while keeping in mind the penalties and discount factor is called as

210 P. Pant et al.

return. The equation below represents the return formula.

Return =R_1 +γR_2 +γR_3 +··· +Terminal State.

3.3 Model Structure

There are three sections in the project.

a. Python AI ﬁle—this ﬁle will have the AI for the self-driving automobile

b. Graphical interface of the automobile using Kviy

c. Map.py ﬁle which will have the mapping of AI to the agent and representation

in the graphical interface.

3.4 Data for the Model

The research recommends to have at least 10,000 training examples to train the

neural network. The dataset should be divided into three subsets which are training

set, cross-validation set, testing set, in the ratio 60–20–20%. This would allow the

model to be better at predicting the best possible action, and it would perform better

on real-world data. Following this, ratio will also help to avoid “overﬁtting” and

“underﬁtting.” And even if they are present, the cross-validation set will help to

determine this early so that the regularization can be implemented, even though it is

used by default.

The input data

x=s

a.

Overall, this would improve the model’s accuracy and efﬁciency. The output yfor

the training data is also important as the predicted output would be compared to the

actual output to ﬁnd the cost error which would later be corrected using the gradient

descent. The below equation represents the output yfor training the model which is

derived from the Bellman equation [7].

3.5 State-Action Value Function

The state-action value function calculates the overall return from the state and action.

It is based on the Bellman equation and provides the overall return. The below

equation represents the state-action value function.

Deep Q-Learning for Virtual Autonomous Automobile 211

Q(s,a)=R(s)+γmax

aQs,a.

In above equation, Q(s,a) is the state-action value function which takes the current

state and action as input, R(s) is the reward achieved after taking action a in state

s.sis the new state achieved, and ais the action for the new state. This function

is useful to train the model and for choosing the best set of actions for the model.

The learning method which follows the state-action value function is called as the Q-

learning method. In the next section, the discussion on Deep Q-Learning is presented

[4].

3.6 Mathematical Deep Q-Learning Implementation

The algorithm for Deep Q-Learning for the model is proposed below:

A. Initialize neural network randomly for Q=Q(s,a)

B. Repeat {

a. Replay buffers for training data and store them

b. Training the model using stored data as QNew

c. Set Q=QNew

}

The ﬁrst step initializes the Q and then the repeat loop replays buffers by creating

random actions for different states and observing its output; then, they are stored.

After that, the model is trained using the collected data. After the model is regularized,

then the Q is updated with Qnew to improve the model.

4 Result and Discussion

The result achieved by the research is the complete model along with the trained

neural network which takes the state as input and predicts the output which is the

action that is most rewarding. Fig. 2represents the virtual environment and the agent.

The agent will move randomly and collect data for its training later.

After training the model, the testing of the model is performed by adding obstacle

in the path of the agent. Fig. 3represents the agent’s performance after adding a

virtual road.

The agent performed excellently after adding complex road as shown in Fig. 4.

The barriers are added after training the agent more to see how it would perform for

a more complex road.

212 P. Pant et al.

Fig. 2 Environment and agent

Fig. 3 Addition of a road in environment

Fig. 4 Addition of complex road in the environment

Deep Q-Learning for Virtual Autonomous Automobile 213

Fig. 5 Addition of complex irregular shapes in the environment

In Fig. 5, complex shapes are added to see the agent’s performance in the envi-

ronment. The model is tested using addition of various irregular shapes and ﬁgures

to see the agent’s performance.

The DQN algorithm shows impressive convergence speed, with the agent learning

to navigate a complex virtual environment within a relatively small number of training

iterations. After 10,000 training iterations, the algorithm achieves an average reward

per episode of 30, indicating rapid learning and adaptation. The exploration rate starts

at 1.0 and decays. By the end of training, the agent achieves an average reward of 50

per episode, showcasing its ability to navigate the environment and accomplish tasks

effectively. By utilizing a large memory buffer with a capacity of 10,000 experiences,

the agent can store diverse experiences and effectively learn from past interactions.

This allows the agent to generalize its knowledge and make informed decisions based

on a broader range of experiences.

The model performed excellently for all kinds of shape and ﬁgures. The agent was

successfully trained using the DQN algorithm as can be seen in the above ﬁgures.

Despite the agent’s performance, there are several improvements in the model.

First one is that if the track has a barrier angle less than 45 degrees, then the agent

keeps rotating the position and does not get out of it. Second limitation of the research

is ignorance of small obstacles by the agent which can be seen in its live working.

Overall, the agent is capable to drive on its own and avoid major obstacles.

5 Conclusion

The paper presents a novel approach for virtual autonomous vehicle control through

Deep Q-Learning that can adapt to various situations in real time. Through extensive

simulation experiments, the proposed DQN-based control system has demonstrated

its ability to perform as well as, or better than, human drivers in various driving

214 P. Pant et al.

scenarios. It has also shown its effectiveness in handling critical driving situations

such as obstacles, lane change, and overtaking maneuvers.

The achieved model also has some limitation which is that the model is not bug

free. Moreover, when the lanes of roads become complex, the agent does not perform

as well as it was doing before. The agents rams through the lanes or just starting

rotating at one place.

In conclusion, this research paper explored the application of Deep Q-Learning

(DQL) in the context of virtual autonomous automobiles. By leveraging reinforce-

ment learning and deep neural networks, DQL demonstrated its potential to enhance

the decision-making capabilities of autonomous agents in complex driving scenarios.

Through the development of a virtual environment and rigorous performance eval-

uation, we observed promising results in terms of success rate, average speed, and

collision avoidance. However, further research and optimization are necessary to

address the limitations and challenges associated with DQL. This study contributes

to the advancement of autonomous driving technologies, showcasing the value of

DQL in improving the safety and efﬁciency of virtual autonomous automobiles.

Future work should focus on reﬁning the DQL algorithm, exploring additional opti-

mization techniques, and incorporating real-world data to bridge the gap between

simulation and practical implementation. The ﬁndings of this research pave the way

for further exploration and development of intelligent autonomous driving systems.

The proposed model is implemented using neural networks, which use the input

vector x for the state “S” and predict the following action “a” that will be the most prof-

itable according to the state-action value function. The car, the agent, continuously

moves randomly and performs randomly in the virtual world it has created. These are

kept and used to train the neural network using the 60%–20%–20% dataset ratio. The

results indicate that DQN-based control systems have the potential to signiﬁcantly

improve the safety, efﬁciency, and reliability of autonomous vehicles. The paper

concludes that further research and testing are necessary to address the remaining

challenges in real-world deployment, but that DQN-based systems offer a promising

approach for the future of autonomous driving. The future directions of the research

are to improve the existing model both by the graphical and artiﬁcial intelligence

ways. Another future scope is to implement the research using physically by using

the hardware.

References

1. Yi L (2020) Lane change of vehicles based on DQN. 2020 5th international conference on

information science, computer technology and transportation (ISCTT), Shenyang, China, pp

593–597. https://doi.org/10.1109/ISCTT51595.2020.00113

2. Chishti SO, Riaz S, BilalZaib M, Nauman M (2018) Self-driving cars using CNN and Q-

learning. 2018 IEEE 21st international multi-topic conference (INMIC), Karachi, Pakistan, pp

1–7. https://doi.org/10.1109/INMIC.2018.8595684

3. Thadeshwar H, Shah V, Jain M, Chaudhari R, Badgujar V (2020) Artiﬁcial intelligence based

self-driving car. 2020 4th international conference on computer, communication and signal

Deep Q-Learning for Virtual Autonomous Automobile 215

processing (ICCCSP), Chennai, India, pp 1–5. https://doi.org/10.1109/ICCCSP49186.2020.

9315223

4. Lyu L, Shen Y, Zhang S (2022) The advance of reinforcement learning and deep reinforcement

learning. 2022 IEEE international conference on electrical engineering, big data and algorithms

(EEBDA), Changchun, China, pp 644–648. https://doi.org/10.1109/EEBDA53927.2022.974

4760

5. Pant P et al (2022) Blockchain for AI-enabled industrial IoT with 5G network. 2022 14th

international conference on electronics, computers and artiﬁcial intelligence (ECAI), Ploiesti,

Romania, pp 1–4. https://doi.org/10.1109/ECAI54874.2022.9847428

6. Güçkıran K, Bolat B (2019) Autonomous car racing in simulation environment using deep

reinforcement learning. 2019 innovations in intelligent systems and applications conference

(ASYU), Izmir, Turkey, pp 1–6. https://doi.org/10.1109/ASYU48272.2019.8946332

7. Kiran BR et al (2022) Deep reinforcement learning for autonomous driving: a survey. IEEE

Trans Intell Transp Syst 23(6):4909–4926. https://doi.org/10.1109/TITS.2021.3054625

8. Zhang J et al (2019) Deep reinforcement learning for autonomous driving: a survey. IEEE

Trans Intell Transp Syst

9. Johnson JK (2020) Safe motion planning under partial observability with an optimal deter-

ministic planner, 2020 American Control Conference (ACC), Denver, CO, USA. pp. 689–694.

https://doi.org/10.23919/ACC45564.2020.9147469

10. Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with

latent deep reinforcement learning. In IEEE Trans Intell Transp Systems 23(6):5068–5078.

https://doi.org/10.1109/TITS.2020.3046646.

11. Wang Y, Manikandan NS, Kaliyaperumal G (2023) Ad hoc-obstacle avoidance-based navi-

gation system using deep reinforcement learning for self-driving vehicles. In IEEE Access

11:92285–92297. https://doi.org/10.1109/ACCESS.2023.3297661

12. Byeloborodov Y, Rashad S (2020) Design of machine learning algorithms for behavioral predic-

tion of objects for self-driving cars. 2020 11th IEEE annual ubiquitous computing, electronics &

mobile communication conference (UEMCON), New York, NY, USA, pp 0101–0105. https://

doi.org/10.1109/UEMCON51285.2020.9298139

13. Shahane V, Jadhav H, Sansare M, Gunjgur P (2022) A self-driving car platform using raspberry

Pi and Arduino. 2022 6th international conference on computing, communication, control and

automation ICCUBEA, Pune, India, pp 1–6. https://doi.org/10.1109/ICCUBEA54992.2022.

10010814

14. Asmika B, Mounika G, Rani PS (2021) Deep learning for vision and decision making in

self-driving cars-challenges with ethical decision making. 2021 international conference on

intelligent technologies (CONIT), Hubli, India, pp 1–5. https://doi.org/10.1109/CONIT51480.

2021.9498342

15. Do T-D, Duong M-T, Dang Q-V, Le M-H (2018) Real-time self-driving car navigation using

deep neural network. 2018 4th international conference on green technology and sustainable

development (GTSD), Ho Chi Minh City, Vietnam, pp 7–12. https://doi.org/10.1109/GTSD.

2018.8595590

16. Holstein T, Dodig-Crnkovic G, Pelliccione P (2020) Real-world ethics for self-driving

cars. 2020 IEEE/ACM 42nd international conference on software engineering: companion

proceedings (ICSE-Companion), Seoul, Korea (South), pp 328–329

17. Potgantwar A, Aggarwal S, Pant P, Rajawat AS, Chauhan C, Waghmare VN (11 Aug,

2022) Secure aspect of digital twin for industry 4.0 application improvement using machine

learning. Available at SSRN: https://ssrn.com/abstract=4187977 or https://doi.org/10.2139/

ssrn.4187977

18. Barua B, Gomes C, Baghe S, Sisodia J (2019) A self-driving car implementation using computer

vision for detection and navigation. 2019 international conference on intelligent computing

and control systems (ICCS), Madurai, India, pp 271–274. https://doi.org/10.1109/ICCS45141.

2019.9065627

216 P. Pant et al.

19. Rajawat AS, Goyal SB, Pant P, Bedi P (2022) AI-enabled internet of nano things methodology

for healthcare information management. In: Kautish S, Dhiman G (eds) AI-enabled multiple-

criteria decision-making approaches for healthcare management. IGI Global, pp 222–239.

https://doi.org/10.4018/978-1-6684-4405-4.ch012

20. Pant P, Taghipour A (2023) Machine learning and blockchain for 5G-enabled IIoT. In:

Taghipour A (ed) Blockchain applications in cryptocurrency for technological evolution. IGI

Global, pp 196–212. https://doi.org/10.4018/978-1-6684-6247-8.ch012

21. Zhu K, Zhang T (2021) Deep reinforcement learning based mobile robot navigation: a review.

Tsinghua Sci Technol 26(5):674–691. https://doi.org/10.26599/TST.2021.9010012

22. Pant P, Rajawat AS, Goyal SB, Bedi P, Verma C, Raboaca MS, Enescu FM (2022) Authenti-

cation and authorization in modern web apps for data security using Nodejs and role of dark

web. Procedia Comput Sci 215:781–790. https://doi.org/10.1016/j.procs.2022.12.080

23. Pant P et al (2022) Using machine learning for industry 5.0 efﬁciency prediction based on

security and proposing models to enhance efﬁciency. 2022 11th international conference on

system modeling & advancement in research trends (SMART), Moradabad, India, pp 909–914.

https://doi.org/10.1109/SMART55829.2022.10047387

24. Pant P et al (2022) AI based technologies for international space station and space data.

2022 11th international conference on system modeling and advancement in research trends

(SMART), Moradabad, India, pp 19–25. https://doi.org/10.1109/SMART55829.2022.100

46956

25. Yi G, Li S, Zhou W, Chen Y (2021) Application of improved DQN algorithm in three-

dimensional garage scheduling, 2021 4th international conference on Robotics, Control and

Atomation Engineering (RCAE), Wuhan, China. pp. 428–432. https://doi.org/10.1109/RCA

E53607.2021.9638792

Improving Digital Marketing Using

Sentiment Analysis with Deep LSTM

Masri bin Abdul Lasi, Abu Bakar bin Abdul Hamid,

Amer Hamzah bin Jantan, S. B. Goyal, and Nurun Najah binti Tarmidzi

Abstract With As digital channels continue to grow, digital marketing has become

a crucial area for businesses. Customers share their experiences with products on

social media and e-commerce platforms, providing businesses with valuable feed-

back. Sentiment analysis techniques are used to analyze customer feedback and

improve business decisions. Deep learning techniques, such as Long Short-Term

Memory (LSTM), have the potential to extract knowledge from large volumes of data

with greater accuracy than manual approaches. In this study, we propose using Deep

LSTM to enhance the accuracy of sentiment analysis. Our simulation results show

that the proposed model improves upon conventional schemes in terms of accuracy,

precision, recall, and F-measure. The proposed model achieved an accuracy rate of

over 90%, which is signiﬁcantly higher than the accuracy rate achieved by other senti-

ment analysis models. Additionally, the proposed model outperformed other state-of-

the-art sentiment analysis techniques in our empirical evaluation using a large dataset.

Furthermore, we tested the proposed model in a real-world scenario, where it was

used to analyze customer sentiment toward a newly launched product. The proposed

model accurately identiﬁed positive and negative sentiments expressed by customers

toward the product. The marketing team used this information to make informed deci-

sions regarding product improvements and marketing strategies, demonstrating the

practical applications of the proposed model. Our study highlights the effectiveness

M. A. Lasi ·A. H. Jantan ·S. B. Goyal (B)·N. N. Tarmidzi

City University, Petaing Jaya, Malaysia

e-mail: drsbgoyal@gmail.com

M. A. Lasi

e-mail: masri.abdullasi@city.edu.my

A. H. Jantan

e-mail: amer.hamzah@city.edu.my

N. N. Tarmidzi

e-mail: nurun.najah@city.edu.my

A. B. A. Hamid ·A. H. Jantan

Putra Business School, University Putra Malaysia, Serdang, Malaysia

e-mail: abu.bakar@putrabs.edu.my

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_17

217

218 M. A. Lasi et al.

of deep learning techniques, speciﬁcally deep LSTM, in improving the accuracy

and reliability of sentiment analysis. Our ﬁndings have important implications for

businesses seeking to leverage customer feedback to improve their products and

services.

Keywords Digital marketing ·Machine learning ·Sentimental analysis ·Deep

LSTM ·TF-IDF

1 Introduction

Sentiment analysis (SA) is a technique used to analyze the opinions, emotions, and

attitudes of humans, which can be utilized for the growth of organizations, especially

in the ﬁeld of business. With the growth of digital marketing platforms, the amount of

opinion data in digital form is increasing [1]. For example, customers who experience

issues such as poor quality, differences between the promised and actual products,

and late delivery share their experiences on social media and e-commerce platforms.

SA helps to determine whether the expressed textual content by customers on these

platforms is positive, negative, or neutral [2]. SA is used in many applications such

as digital marketing, social media monitoring, and product review analysis [3].Most

users search for reviews before using a service, so the marketing of products depends

on these reviews [4]. However, the large number of reviews left by customers cannot

be read by humans and can determine the overall opinion of a product or service.

Sentiment analysis (SA) can be approached through two main methods: lexicon-

based and machine learning-based. In a lexicon-based approach, sentiment lexicons

are constructed based on sentiment-related words, adverbs, and negative words that

reﬂect human sentiments. The sentiment polarity of input texts is determined by

matching the input text with sentiment words in the lexicon. The matched sentiment

words are then weighted and summed to obtain the sentiment value of the input [5].

Machine learning methods, such as Naive Bayes, support vector machine (SVM),

maximum entropy, random forest, have also been proposed to automatically handle

sentiment analysis [6]. However, these approaches require human intervention to

classify the sentiments from the texts.

To create an automatic SA approach, deep learning-based methods have been

proposed in sentiment analysis due to their automated functioning capability [7].

Deep learning extracts features and learns from errors without requiring human

intervention [8]. Some of the deep learning approaches used in sentiment analysis

are Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and

Gated Recurrent Units (GRU) [9].

Reference [10] proposed a deep learning approach where Convolutional Neural

Networks (CNN) and Long Short-Term Memory (LSTM) are combined to perform

sentiment analysis. The method developed a deep learning approach that is superior

to traditional machine learning models. However, the proposed technique failed to

train the DL model with many observations. Deep learning models must be trained

Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 219

with a huge amount of data, or the architecture should be developed such that it can

use fewer datasets.

Reference [4] presented an SVM-based sentiment classiﬁcation of Twitter data.

The approach depends on certain analytical measures to cluster the data with the k-

means clustering approach. The research doesn’t support a fully automated approach

to sentiment classiﬁcation [11]. Reference [11] proposed sentiment analysis using

logistic regression, Naive Bayes, and linear support vector classiﬁer approach for a

food review. The approach has handled a large amount of information with MLlib

instead of using deep networks to handle a huge volume of data. Reference [12]

proposed sentiment analysis for Amazon product reviews with a logistic regression

model.

The logistic regression has higher chances of overﬁtting if the extracted features

are not optimal. Reference [13] presented sentiment analysis of product reviews using

K-nearest neighbors. Machine learning models like KNN cannot work with large

and high-dimensional data. Therefore, to overcome these challenges, the proposed

approach employs deep LSTM to perform sentiment analysis. The approach also

focuses on enhancing the performance accuracy of the proposed deep learning-based

LSTM models compared to conventional approaches. The present study addresses

the following research questions:

•What are the challenges faced by machine learning models?

•How can the accuracy of deep learning models be improved?

The proposed research addresses the research questions by designing a deep Long

Short-Term Memory (LSTM) model that can enhance the accuracy of sentiment

analysis (SA) and process a large volume of information. The originality of this

approach lies in its ability to provide enhanced accuracy compared to conventional

deep learning models and as an alternative to traditional shallow networks.

The objective of the proposed approach is to improve the classiﬁcation accu-

racy with a deep learning-based sentiment classiﬁcation to aid in digital marketing

(DM) strategies. The proposed sentiment classiﬁcation approach employs the term

frequency-inverse document frequency (TF-IDF) approach and the deep LSTM

model. The TF-IDF approach extracts the features effectively, and the deep LSTM

model with three LSTM layers is trained with the extracted features. The fully

connected layer, along with the SoftMax, classiﬁes the polarity of the sentiment.

The major contributions of the approach are:

•The use of the TF-IDF feature extraction approach to extract features from user

reviews.

•The design of a deep LSTM model with three LSTM layers to learn features

effectively.

•The simulation of the proposed machine learning-based sentiment classiﬁcation

approach in terms of accuracy, precision, recall, and F-measure.

The rest of the paper is organized as follows. Section 2presents related work,

Sect. 3explains the proposed method, Sect. 4illustrates the experimental results,

and Sect. 5concludes the paper.

220 M. A. Lasi et al.

2 Literature Review

Deep learning is a subset of machine learning that learns to represent the world as a

nested hierarchy of concepts, with each concept deﬁned in relation to simpler ones

and more abstract representations computed in terms of less abstract ones. In deep

learning, low-level categories like letters are deﬁned ﬁrst, then slightly higher-level

categories like words, and then higher-level categories like sentences. These cate-

gories are learned incrementally through the hidden layer architecture. For example,

in image recognition, lines and shapes are classiﬁed to identify faces before classi-

fying light and dark areas, providing a complete representation of the image through

the network’s neurons and nodes, each representing a different component of the

whole. As the model matures, weights are adjusted for each node or hidden layer to

reﬂect the strength of their connections to the output. This learning process allows

deep learning to achieve great power and ﬂexibility.

Various conventional approaches to sentiment analysis are discussed, including

Aytug Onan’s approach, which proposed a sentiment analysis approach based on

weighted word embedding and deep neural networks [5]. The sentiment analysis

is carried out on product reviews obtained from Twitter. In this research, TF-IDF

weighted gloves are embedded with Convolutional Neural Networks (CNN) and

Long Short-Term Memory (LSTM). The approach attained better classiﬁcation accu-

racy than conventional deep learning approaches. Reference [14] proposed a capsule

network based on Bi-LSTM for sentiment analysis, called caps-Bi-LSTM, where

the capsule module calculates the state probability. The approach obtained better

accuracy than conventional machine learning and deep learning models.

Reference [15] proposed a deep model named “multi-view deep network for senti-

ment analysis”, where heterogeneous deep neural networks are used in the feature

extraction of input documents, and classiﬁcation is handled with the multi-view

classiﬁer. Convolutional and recursive neural networks are used to obtain various

representations of the input texts. The deep neural networks extract feature sets, and

multi-view classiﬁers train features jointly to decide the sentiment polarity. Refer-

ence [14] presented a text sentiment classiﬁcation with variable convolution and

pooling CNN. Multiple convolutions and pooling are designed for text sentiment

classiﬁcation. The proposed approach produced better results with the proposed

feature extraction. Ramshankar and Joe Prathap [16] presented a sentiment classi-

ﬁcation approach with black hole-based gray wolf optimization (BH-GWO), where

the feature extraction is handled with a joint similarity score and optimized with BH-

GWO weights. BH-GWO is created through the fusion of black hole and genetic wolf

optimization (GWO).

The sentiment classiﬁcation for recommendation systems is evaluated with e-

commerce datasets. Lin et al. [17] proposed sentimental analysis with a comparison-

enhanced deep neural network (CEDNN). Bidirectional LSTM carries out the initial

feature extraction, and MHA carries out the valuable feature extraction. The hybrid

approach combines MHA to obtain global information and Bi-LSTM to obtain

sequence information. The learning ability is enhanced by CE-B-MHA. The proposed

Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 221

approach attained a better F1 score, thereby improving the performance of sentiment

analysis. Xu et al. [18] proposed a product review sentiment classiﬁcation based on

the Naive Bayes continuous learning model. The traditional Naive Bayes method is

enhanced to weigh general classiﬁcation on old domains and to improve distribution

learning for domain-speciﬁc knowledge. The simulation results prove the impact

of continuous learning for domain-speciﬁc and cross-domain sentiment learning.

Yi and Liu [19] presented an ML-based sentimental analysis for the recommenda-

tion system. Multi-class support vector machine (MSVM) is used for classifying the

sentiments and different opinions on Twitter. Features are identiﬁed with principal

component analysis (PCA). PCA is also used to reduce the dimensionality and extract

the features.

The proposed MSVM achieved better performance than conventional ML strate-

gies. Chintalapudi et al. [20] presented sentimental analysis of COVID-19 with deep

learning (DL) approaches. A DL model named “bidirectional encoder representa-

tions from transformer” (BERT) was used to conduct the sentimental analysis on

tweets. The tweets that contain sentiments such as sadness, joy, fear, and anger

during the COVID-19 period are analyzed. The performance of the BERT model is

compared with that of conventional ML approaches, and it is found that the DL-based

model achieved enhanced performance over ML models. Vijayaragavan et al. [21]

developed an optimal SVM-based classiﬁcation for sentimental analysis of product

reviews. The product reviews are classiﬁed by SVM, and the k-means clustering

approach is employed to obtain two groups from the clustered output.

Feature extraction is carried through with the sentimental analysis, and ﬁnally,

fuzzy-based soft set theory is employed to decide whether the customer will purchase

the product or not. Rehman et al. [22] presented a CNN-LSTM model to improve the

accuracy of the sentiment analysis. The convolution and max pooling of the CNN

model are used to extract higher-level features, and LSTM obtains the long-term

dependencies between the word sequences. The hybrid CNN-LSTM model exhibited

better performance than machine learning and other deep learning models in terms of

accuracy and precision. Basiri et al. [23] presented an attention-based bidirectional

CNN-RNN model for the analysis of sentiment. Using the temporal information

ﬂow, this strategy extracts past and future contexts. The dimensionality is reduced

with the help of convolution and pooling approaches. The strategy outperformed

conventional methods for both short and long tweets. Sankar et al. [24] proposed a

deep learning-based sentiment analysis approach with CNN.

The reviews collected from services like Netﬂix and Amazon were classiﬁed by

CNN. The approach used different word embedding techniques. The deep CNN

trained with pre-trained word vectors exhibited better classiﬁcation results in a

mobile application. Phan et al. [25] presented an ensemble model to identify the

sentiment of tweets. The ensemble model was designed with ﬁve features extracted

from the lexical, semantic, sentiment polarity, and position of words in tweets. The

proposed sentiment analysis combines a feature ensemble model, the divide and

conquer method, and the DL algorithm. The features consist of fuzzy sentiment

phrases. The input layer of the CNN model uses feature vectors.

222 M. A. Lasi et al.

Naﬁs et al. [26] proposed sentiment analysis using LSTM and CNN. After prepro-

cessing the IMDB dataset with tokenization, stop words, and URL removal, CNN

and LSTM are used to classify the sentiments. The Word2vec tool converts the tweets

into vectors with different dimensions. The validation results with LSTM obtained

an accuracy of 88.02%, and CNN attained an accuracy of 87.72%.

Neogi et al. [27] presented a sentiment analysis of farmer protests with Twitter

data. The features are extracted with a bag of words and the TF-IDF approach after

the preprocessing of the tweets. The proposed sentiment analysis employed several

classiﬁers, such as Decision Tree (DT), Naive Bayes (NB), Random Forests (RBF),

and Support Vector Machine (SVM). The RF model achieved the highest accuracy

in analyzing the sentiments.

Bhakuni et al. [28] proposed sarcasm analysis using a sentiment analysis model.

In this approach, data is cleaned with tokenization, stemming, and noise removal. The

features are extracted with the TF-IDF approach. The approach employed multiple

machine learning models such as DT, SVM, NB, and KNN. The proposed approach is

simulated in terms of accuracy, precision, recall, and F-measure. The SVM classiﬁer

attained the highest accuracy of 93%, followed by SVM NB and DT, which achieved

an accuracy of 83% and 86%, respectively. The KNN attained the lowest accuracy

of 51%.

Ruz et al. [29] proposed a sentiment analysis of Twitter data with a Bayesian

network classiﬁer. The approach used a bag of words for feature representation. The

classiﬁcation of the two datasets—the Chilean earthquake and the Catalan indepen-

dence referendum—was performed with a Bayesian network classiﬁer. The prepro-

cessing is carried out by removing URL, stop words, symbols, numbers, and repeated

characters. The proposed approach achieved better performance than conventional

approaches.

Yang et al. [30] presented a sentiment analysis approach that combines the senti-

ment lexicon with machine learning models. CNN and bidirectional gated recurrent

units are the machine learning models used. The machine learning models extract

the features from the review. The features are weighted with an attention mechanism.

The proposed approach attained an accuracy of 93.5%.

CNN and bidirectional gated recurrent units are the machine learning models used.

The machine learning models extract the features from the review. The features are

weighted with an attention mechanism. The proposed approach attained an accu-

racy of 93.5%. Behera et al. [31] proposed a hybrid model in which the CNN and

LSTM models are combined for the sentiment classiﬁcation of the reviews. The

reviews from different domains are used as input. The deep CNN model is used for

local feature selection, and LSTM is employed for the sequential analysis of the

texts with length. The objectives of the fused deep learning model are scalability

and domain independence. The evaluation of the proposed approach is carried out

with four datasets. The experimental outcome of the research proves that it attained

better accuracy and outperformed several conventional schemes. Minaee et al. [32]

proposed sentiment analysis with an ensemble of CNN and bi-LSTM models. In this

approach, an ensemble of LSTM and CNN models is employed. The LSTM deals

with the temporal information of the data, and CNN extracts the local structure. The

Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 223

ensemble model proved to exhibit better results than the individual CNN and LSTM

models. Li et al. [33] presented a lexicon-based CNN-LSTM model for analyzing

the sentiments from user reviews.

The CNN and Bi-LSTM models are connected in a parallel manner. A domain-

speciﬁc lexicon that creates quality texture features is fed as input to the models.

The LSTM handles the sequential information, and CNN focuses on the extraction

of features. The study’s advantage is that complex datasets, such as the Stanford

Sentiment Treebank, are used, resulting in better performance than conventional

schemes.

3 Proposed Methodology

The proposed method consists of the following stages: preprocessing, feature extrac-

tion, and classiﬁcation. The block diagram of the proposed approach. Initially, the

raw tweets collected from various datasets are pre-processed to remove the outliers

and unwanted elements (see Fig. 1). The features are extracted with the TF-IDF

approach after preprocessing the data. Followed by feature extraction, a deep LSTM

model is used to classify the sentiment of the tweets.

3.1 Preprocessing

The preprocessing includes stop word removal, stemming, and blank space removal.

The steps are explained as follows:

Stop Word Removal: The data collected from the web contains certain words that

are of no use in the sentiment analysis. Such words, which frequently arise and

consist of little information, need to be removed. Some examples of stop words are

an, and, as, etc. The removal of stop words increases processing speed and enhances

accuracy. Stop word removal approaches such as the classic method and the mutual

information approach were employed. In the ﬁrst, normal conjunction features such

as “and”, “but”, and “or” are removed. Similarly, special symbols and numbers are

also removed. In addition, reviews that don’t contain any data related to sentiment

value, like URLs or HTML, are also removed. In the mutual information approach,

the mutual information between the term and the class of the contents is computed.

If the result is low mutual information the words are removed Kaur and Buttar [34].

Stemming: Stemming is a rule-based approach. In this step, sufﬁxes and preﬁxes are

removed to reduce the number of features in the feature space and improve the perfor-

mance of ML algorithms. An open-source toolkit of natural language processing

(NLP) named Zemberek is used in our approach for stemming purposes Sava¸s and

Topalo ˘glu [35].

224 M. A. Lasi et al.

Pre-processing

Feature extraction with TF-IDF

LSTM 1 LSTM 2 LSTM 3

Fully

Connected

Layer

Somax

layer

Classification results

Input

Fig. 1 Block diagram of the proposed model

Blank Space Removal: The blank spaces increase the size of words with additional

blank spaces. Therefore, the extra white and tab spaces are identiﬁed and removed

by using a single white space.

3.2 Feature Extraction

The feature extraction is carried out with the TF-IDF method. The number of occur-

rences of a term is calculated by its frequency. If a document consists of about 2000

words and a term named “best” is found ﬁve times in the document. Therefore, to

solve this issue, the repeating words are divided by the total number of terms found

in the document. This is essential since the number of words that are repeated is high

in large documents. The IDF signiﬁes the ability to distinguish the features between

the categories. It can also be deﬁned as the score of the document in the process of

feature selection. In IDF, terms like “and”, which are less signiﬁcant, are handled.

The problem is solved when IDF assigns a higher weight to repeating words such as

“and” and lower weights for non-repeating words. The TF-IDF is given as

Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 225

TF - IDF =TF ∗IDF (1)

TF is given as

TF =Frequency of a feature in a text document

Total features in a document (2)

IDF is given as

IDF =lognumber of document with features

total number of document (3)

A TF-IDF document matrix is generated after calculating the TF-IDF. The higher

TF-IDF score shows the importance of the feature [6].

3.3 Deep LSTM-Based Sentiment Classiﬁcation

The deep learning architecture consists of many layers with a nonlinear information

processing unit that can extract and transfer features for classiﬁcation and analysis

of patterns with the help of data. The proposed prediction model employs the deep

LSTM model. The LSTM is developed by Hochreiter and Schmmidhuber in 1997.

The LSTM consists of an input gate, a forget gate, and an output gate, respectively.

The gates decide which information is to be transferred through the cell state and

which shouldn’t be. The gates are constructed with sigmoid layers and multiplication

operations.

The sigmoid layer partitions the output. The function of the input gate is to control

the impact of current data in the memory unit. The incoming vector added to the cell

state is determined by input gate notation. The forget gate reduces the output effect

on the memory unit. It decides which information is to be kept and which is to be

discarded. When the value of the forget gate is 0, information is removed, and when

the value is 1, the information is preserved. The input gate decides which values

must be stored in the cell state. In this, there is a sigmoid gate layer that decides the

updated values. followed by the tanh layer, which creates values that are added to the

cell state. The output gate controls the memory unit’s output status value Thakkar

and Chaudhari [36]. In the output layer, the product of the sigmoid layer and cell

state that passes through the TANH layer decides which outputs must be removed

and which have to be output. The input, forget, output gate, input modulation, and

hidden state are represented Tan et al. [37].

it=σ(Wzi zt+Whi ht−1)(1)

ft=σWzf zt+Whf ht−1(2)

226 M. A. Lasi et al.

ot=σ(Wzozt+Whoht−1)(3)

c∼

t=φ(Wzczt+Whc ht−1)(4)

ct=f∗ct−1+it∗c∼

t(5)

ht=ot∗φ(ct)(6)

where ztrepresents the t the observation of all variables, htis the hidden state. The

hyperbolic tangent φand the sigmoid is non-linearity. The hidden state htis obtained

from the activation tanh and memory cell. The deep LSTM works better than shallow

networks with its increased number of layers. The deep architecture works well with

complex data and can also learn better than other networks. The proposed deep LSTM

consists of 3 LSTM layers, fully connected layer, and SoftMax layer.

The features are initially fed to the ﬁrst LSTM block with hidden state. The second

LSTM computes with the help of previous hidden state h2

t−1and the ﬁrst hidden state

t(see Fig. 2). It is passed to the upper LSTM and so on until the last block Wang

and Liu [38]. In deep LSTM output from each layer is passed to the next until the last

layer generates the output Sagheer and Kotb [39]. The same working is applicable

also to the hidden state at each level to function at different time scales Ameur et al.

[40]. The SoftMax along with the fully connected layer provides the classiﬁcation

results Shahid et al. [41].

4 Experimental Results and Discussion

The simulations are carried with Python 3.9.0. The program was conducted with an

Intel Core i7 processor and 4 GB RAM. Table 1presents the hyperparameters used

in the proposed method.

4.1 Dataset Description

The dataset is collected from the twitter platform. The twitter data is extracting

through using web scraping. The web scraping helps to extract the data from the

tweets and helps to save directly in a spreadsheet. The validation of the proposed

approach is carried with three datasets, namely Sentiment 140, IMDB, and Amazon

review.

Sentiment 140: Sentiment 140 is a dataset obtained from Stanford University.

Sentiment 140 dataset consists of 1.6 million tweets consisting of 0.8 million positive

and 0.8 million negative tweets Go et al. [42].

Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 227

Fig. 2 Deep LSTM model

LSTM Block n

LSTM Block 3

LSTM Block 2

LSTM Block 1

−

Table 1 Hyperparameters of DLSTM

Hyperparameters Val u e

Max epochs 200

Mini batch size 40

Gradient threshold

Layer wise hidden unit of LSTM 1, 2, and 3

100, 150, 100

Activation

Optimizer

Recurrent activation

Neuron units

SoftMax

Adam

Sigmoid

110 in all LSTM layers

IMDB: The IMDB dataset consists of movie reviews for sentiment analysis. The

IMDB has 50,000 positive and negative reviews Maas et al. [43].

Amazon Review: The dataset consists of product reviews collected from Amazon

dated between February and April 2014 Shrestha and Nasoz [44].

4.2 Performance Metrics

The proposed approach is evaluated in terms of accuracy, precision, recall, and F-

score.

228 M. A. Lasi et al.

Table 2 Performance comparison of proposed sentimental analysis with conventional approaches

Dataset Method Accuracy Precision Recall F-score

Sentiment 140 ABCDM [9]

Proposed

0.8182

0.9792

0.8827

0.9645

0.82315

0.9622

0.8199

0.9621

IMDB DL [5]

CEDNN [17]

Proposed

–

0.9643

–

0.929

0.9654

–

0.9641

–

0.9640

Amazon review SVM [49]

Proposed

0.8129

0.9723

0.5861

0.9729

0.4085

0.9719

–

0.9718

Table 2shows the performance comparison of the proposed approach with tradi-

tional methods. The proposed deep LSTM-based sentiment analysis performed better

than the ABCDM model with sentiment 140 in terms of accuracy, precision, recall,

and F-score. The ABCDM approach is designed for long tweets rather than short

tweets Basiri et al. [23].

While the proposed approach is efﬁcient in processing the tweets without any

constraints on the length of the tweets, The deep layers in the proposed approach have

been able to perform better than the CNN-based sentiment analysis with the IMDB

dataset Sankar et al. [24]. The CEDNN with the IMDB dataset has not reached the

accuracy of the proposed approach since the proposed deep LSTM has been able to

perform better than the Bi-LSTM model used in the approach. The proposed approach

attained 97% accuracy, while SVM can attain only 81% accuracy. This shows that

the proposed deep LSTM model can work better than SVM in the sentiment analysis

of Twitter data [45].

The performance comparison of the proposed sentiment analysis with conven-

tional approaches for the grocery and gourmet food datasets (see Fig. 3). The bar

chart shows that the deep LSTM model exhibits better performance than the CEDNN,

ABCDM, DL, and SVM models. The deep layers have been successful in classifying

the sentiment of the customers.

4.3 Limitation of Study

While the proposed deep learning-based sentiment analysis model with LSTM has

demonstrated improved accuracy, there are some limitations to the approach. One

limitation is the requirement for large amounts of data for training the deep learning

model. In addition, the proposed model relies on the TF-IDF approach for feature

extraction, which may not be suitable for all types of datasets. Furthermore, the

proposed approach focuses only on multi-class classiﬁcation results and may not be

effective for binary sentiment analysis.

Another limitation is the generalizability of the model to other domains. The

proposed sentiment analysis approach has been evaluated on e-commerce datasets,

and its effectiveness in other domains may not be guaranteed. Additionally, the

Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 229

Fig. 3 Performance comparison of proposed method over conventional techniques

proposed model’s performance may vary depending on the language and cultural

context of the reviews. The proposed approach also assumes that the sentiment of a

review can be accurately represented by a single label, which may not always be the

case.

Finally, while the proposed model has shown improved accuracy compared to

conventional approaches, it may not always outperform state-of-the-art sentiment

analysis models that employ more advanced techniques such as attention mecha-

nisms or transformer-based models. Overall, while the proposed sentiment analysis

approach is promising, further research is needed to address these limitations and

evaluate its effectiveness in a wider range of contexts.

5 Conclusion

Based on the study, the proposed deep learning-based sentiment analysis with LSTM

has shown signiﬁcant improvement in classiﬁcation accuracy compared to conven-

tional approaches. This approach used the TF-IDF feature extraction method and

a deep LSTM model with several layers to accurately predict the sentiment of the

tweets. The results showed that the deep LSTM model exhibited better performance in

terms of accuracy, precision, and recall than other conventional approaches. Further-

more, the proposed sentiment analysis can be extended to other domains beyond the

business process, which can beneﬁt from its enhanced accuracy.

However, it is important to note that the study focused only on multi-classiﬁcation

results, and future research can explore using multiple classiﬁers with a larger number

of datasets to further enhance the accuracy of sentiment analysis. In conclusion,

the proposed deep learning-based sentiment analysis approach with LSTM has the

potential to beneﬁt both service providers and users by accurately predicting the

230 M. A. Lasi et al.

sentiment of a product. The ﬁndings of this study highlight the importance of using

advanced machine learning techniques to improve the accuracy of sentiment analysis

and provide valuable insights to businesses for better decision making.

In conclusion, sentiment analysis is an important technique that helps businesses

understand their customers’ opinions and improve their products or services. With

the increasing availability of online platforms and social media, sentiment analysis

has become more important than ever before. Deep learning-based models, such as

LSTM and CNN, have shown great promise in improving the accuracy of senti-

ment analysis. The proposed sentiment analysis with a deep LSTM model has been

shown to signiﬁcantly enhance classiﬁcation accuracy compared to conventional

approaches.

However, the proposed sentiment analysis focused only on multi-classiﬁcation

results and the use of a single classiﬁer. Future work can be done to extend the study to

include multiple classiﬁers and larger datasets. Additionally, other feature extraction

methods, such as word embeddings, can also be explored to further enhance the

accuracy of sentiment analysis. With the continuous advancements in deep learning

techniques, it is expected that sentiment analysis will continue to improve and become

an even more valuable tool for businesses.

References

1. Hoang SN, Nguyen LV, Huynh T, Pham VT (2019) An efﬁcient model for sentiment analysis

of electronic product reviews in Vietnamese. In: International conference on future data and

security engineering, pp 132–142. https://doi.org/10.1007/978-3-030-35653-8_10

2. Mahdaouy AE, Mekki AE, Essefar K, Mamoun NE, Berrada I, Khoumsi A (2021) Deep multi-

task model for sarcasm detection and sentiment analysis in Arabic language. arXiv preprint

arXiv:2106.12488

3. Alamoudi ES, Alghamdi NS (2021) Sentiment classiﬁcation and aspect-based sentiment anal-

ysis on yelp reviews using deep learning and word embeddings. J Decis Syst 30(2–3):259–281.

https://doi.org/10.1080/12460125.2020

4. Cyril CPD, Beulah JR, Subramani N, Mohan P, Harshavardhan A, Sivabalaselvamani D (2021)

An automated learning model for sentiment analysis and data classiﬁcation of Twitter data

using balanced CA-SVM. Concurrent Eng 29(4):386–395. https://doi.org/10.1177/1063293x2

11031485

5. Onan A (2020) Sentiment analysis on product reviews based on weighted word embeddings

and deep neural networks. Concurrency Comput: Pract Experience 33(23). https://doi.org/10.

1002/cpe.5909

6. Sultana MA, Rakesh P, Sandeep M, Jagadeesh G (2021) Amazon product review sentiment

analysis using machine learning. Int Res J Comput Sci 8(7):136–141. https://doi.org/10.26562/

irjcs.2021.v0807.001

7. Wassan S, Chen X, Shen T, Waqar M, Jhanjhi NZ (2021) Amazon product sentiment analysis

using machine learning techniques. Rev Argent Clín Psicol 30(1):695

8. Drus Z, Khalid H (2019) Sentiment analysis in social media and its application: system-

atic literature review. Procedia Comput Sci 161:707–714. https://doi.org/10.1016/j.procs.2019.

11.174

9. Nikseresht A, Raeisi MH, Mohammadi HA (2021) Decision making for celebrity branding:

an opinion mining approach based on polarity and sentiment analysis using twitter consumer-

generated content (CGC). arXiv preprint arXiv:2109.12630

Improving Digital Marketing Using Sentiment Analysis with Deep LSTM 231

10. Agarwal S (2019) Deep learning-based sentiment analysis: establishing customer dimension as

the lifeblood of business management. Glob Bus Rev 23(1):119–136. https://doi.org/10.1177/

0972150919845160

11. Ahmed HM, Javed Awan M, Khan NS, Yasin A, Faisal Shehzad HM (2021) Sentiment analysis

of online food reviews using big data analytics. Elementary Educ Online 20(2):827–836. https://

doi.org/10.17051/ilkonline.2021.02.93

12. Sharma DN, Shankar DP, Raj MR, Dalwadi MC (2022) Sentiment analysis for amazon product

reviews using logistic regression model. J Dev Econ Manag Res Stud 09(11):29–42. https://

doi.org/10.53422/jdms.2022.91104

13. Akter MT, Begum M, Mustafa R (2021) Bengali sentiment analysis of e-commerce product

reviews using k-nearest neighbors. In: 2021 international conference on information and

communication technology for sustainable development (ICICT4SD). IEEE, pp 40–44. https://

doi.org/10.1109/icict4sd50815.2021.9396910

14. Dong Y, Fu Y, Wang L, Chen Y, Dong Y, Li J (2020) A sentiment analysis method of

capsule network based on BiLSTM. IEEE Access 8:37014–37020. https://doi.org/10.1109/

access.2020.2973711

15. Sadr H, Pedram MM, Teshnehlab M (2020) Multi-view deep network: a deep model based

on learning features from heterogeneous neural networks for sentiment analysis. IEEE Access

8:86984–86997. https://doi.org/10.1109/access.2020.2992063

16. Ramshankar N, Joe Prathap PM (Sept 2021) A novel recommendation system enabled by

adaptive fuzzy aided sentiment classiﬁcation for e-commerce sector using black hole-based

grey wolf optimization. S¯adhan¯a 46(3). https://doi.org/10.1007/s12046-021-01631-2

17. Lin Y, Li J, Yang L, Xu K, Lin H (2020) Sentiment analysis with comparison enhanced deep

neural network. IEEE Access 8:78378–78384. https://doi.org/10.1109/access.2020.2989424

18. Xu F, Pan Z, Xia R (2020) E-commerce product review sentiment classiﬁcation based on a

Naïve Bayes continuous learning framework. Inf Process Manage 57(5):102221. https://doi.

org/10.1016/j.ipm.2020.102221

19. Yi S, Liu X (2020) Machine learning based customer sentiment analysis for recommending

shoppers, shops based on customers review. Complex Intell Syst 6(3):621–634. https://doi.org/

10.1007/s40747-020-00155-2

20. Chintalapudi N, Battineni G, Amenta F (2021) Sentimental analysis of COVID-19 tweets using

deep learning models. Infect Dis Rep (April 2021) 13(2):329–339. https://doi.org/10.3390/idr

13020032

21. Vijayaragavan P, Ponnusamy R, Aramudhan M (2020) An optimal support vector machine

based classiﬁcation model for sentimental analysis of online product reviews. Future Gener

Comput Syst 111:234–240. https://doi.org/10.1016/j.future.2020.04.046

22. Rehman AU, Malik AK, Raza B, Ali W (Sept 2019) A hybrid CNN-LSTM model for improving

accuracy of movie reviews sentiment analysis. Multimedia Tools Appl 78(18):26597–26613.

https://doi.org/10.1007/s11042-019-07788-7

23. Basiri ME, Nemati S, Abdar M, Cambria E, Acharya UR (2021) ABCDM: an attention-

based bidirectional CNN-RNN deep model for sentiment analysis. Future Gener Comput Syst

115:279–294. https://doi.org/10.1016/j.future.2020.08.005

24. Sankar H, Subramaniyaswamy V, Vijayakumar V, Arun Kumar S, Logesh R, Umamakeswari

A (2020) Intelligent sentiment analysis approach using edge computing-based deep learning

technique. Softw: Pract Experience 50(5):645–657. https://doi.org/10.1002/spe.2687

25. Phan HT, Tran VC, Nguyen NT, Hwang D (2020) Improving the performance of sentiment

analysis of tweets containing fuzzy sentiment using the feature ensemble model. IEEE Access

8:14630–114641. https://doi.org/10.1109/access.2019.2963702

26. Mohd Naﬁs NS, Awang S (2021) An enhanced hybrid feature selection technique using term

frequency-inverse document frequency and support vector machine-recursive feature elimina-

tion for sentiment classiﬁcation. IEEE Access 9:52177–52192. https://doi.org/10.1109/access.

2021.3069001

27. Neogi AS, Garg KA, Mishra RK, Dwivedi YK (2021) Sentiment analysis and classiﬁcation of

Indian farmers protest using Twitter data. Int J Inf Manage Data Insights 1(2):100019

232 M. A. Lasi et al.

28. Bhakuni M, Kumar K, Iwendi C, Singh A (2022) Evolution and evaluation: sarcasm analysis

for Twitter data using sentiment analysis. J Sens

29. Ruz GA, Henríquez PA, Mascareño A (2020) Sentiment analysis of Twitter data during critical

events through Bayesian networks classiﬁers. Futur Gener Comput Syst 106:92–104

30. Yang L, Li Y, Wang J, Sherratt RS (2020) Sentiment analysis for e-commerce product reviews

in Chinese based on sentiment lexicon and deep learning. IEEE access 8:23522–23530

31. Behera RK, Jena M, Rath SK, Misra S (2021) Co-LSTM: convolutional LSTM model for

sentiment analysis in social big data. Inf Process Manage 58(1):102435

32. Minaee S, Azimi E, Abdolrashidi A (2019) Deep-sentiment: sentiment analysis using ensemble

of cnn and bi-lstm models. arXiv preprint arXiv:1904.04206

33. Li W, Zhu L, Shi Y, Guo K, Cambria E (2020) User reviews: sentiment analysis using lexicon

integrated two-channel CNN–LSTM family models. Appl Soft Comput 94:106435

34. Kaur J, Buttar PK (2018) A systematic review on stopword removal algorithms. Int J Future

Revolution Comput Sci Commun Eng 4(4):207–210

35. Sava¸s S, Topalo˘glu N (2019) Data analysis through social media according to the classiﬁed

crime. Turk J Electr Eng Comput Sci 27(1):407–420

36. Thakkar A, Chaudhari K (2020) Predicting stock trend using an integrated term frequency–

inverse document frequency-based feature weight matrix with neural networks. Appl Soft

Comput 96:106684. https://doi.org/10.1016/j.asoc.2020.106684

37. Tan HX, Aung NN, Tian J, Chua MCH, Yang YO (2019) Time series classiﬁcation using a

modiﬁed LSTM approach from accelerometer-based data: a comparative study for gait cycle

detection. Gait Posture 74:128–134. https://doi.org/10.1016/j.gaitpost.2019.09.007

38. Wang L, Liu R (2020) Human activity recognition based on wearable sensor using hierarchical

deep LSTM networks. Circ, Syst, Sig Process 39(2):837–856. https://doi.org/10.1007/s00034-

019-01116-y

39. Sagheer A, Kotb M (2019) Time series forecasting of petroleum production using deep

LSTM recurrent networks. Neurocomputing 323:203–213. https://doi.org/10.1016/j.neucom.

2018.09.082

40. Ameur S, Khalifa AB, Bouhlel MS (2020) A novel hybrid bidirectional unidirectional LSTM

network for dynamic hand gesture recognition with leap motion. Entertainment Comput

35(100373):2020. https://doi.org/10.1016/j.entcom.2020.100373

41. Shahid F, Zameer A, Muneeb M (2020) Predictions for COVID-19 with deep learning models

of LSTM, GRU and Bi-LSTM. Chaos, Solitons Fractals 140:110212. https://doi.org/10.1016/

j.chaos.2020.110212

42. Go A, Bhayani R, Huang L (2009) Twitter sentiment classiﬁcation using distant supervision.

CS224N project report, Stanford

43. Maas A, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for senti-

ment analysis. In: Proceedings of the 49th annual meeting of the association for computational

linguistics: human language technologies, pp 142–150. https://doi.org/10.1109/ijcnn.2016.772

76047

44. Shrestha N, Nasoz F (2019) Deep learning sentiment analysis of amazon.com reviews and

ratings. arXiv preprint arXiv:1904.04096

45. He R, McAuley J (2016) Ups and downs. Proceedings of the 25th international conference on

world wide web. https://doi.org/10.1145/2872427.2883037

5G Enabled IoT-Based DL with BC

Model for Secured Home Door System

S. B. Goyal, Anand Singh Rajawat, Pravin Gundalwar,

Ram Kumar Solanki, and Masri bin Abdul Lasi

Abstract The A safety door has been playing an important role in protecting our

public places. Every public place should offer the safest working environment to its

workplace people and visitors. Security is vital to public places, such as government

ofﬁces, shopping malls, hospitals, educational institutions, airports, to have restricted

areas or public walkways upon use by closed gates. To strengthen the security risks,

it comes as no security policies become vulnerable in enforcing strict security mech-

anisms. This work distillates primarily on the security aspects of doors installed at

security gates and other mandatory monitoring activities. This can be realized by

listing the typical security challenges in 5G Enabled IoT-based DL with BC Model-

based systems in general and summing these challenges from design, development,

and creating a functional product from scratch. A growing relationship between

AI and the IoT can be established by extending their boundaries to continue their

individual technological powers. The Internet of Things (IoT) helps to capturing

complete activities of workplace people and visitors from their multiple “Entry” to

“Exit” points in a public place and artiﬁcial intelligence (AI) plays vital role in secu-

rity vulnerabilities detection and avoidance in strict and smart before any possible

mishap. To propose 5G Enabled IoT-based deep learning (DL) with blockchain (BC)

Model for secured home door system. Our proposed approach accuracy is 99%.

Keywords Internet of things ·Deep learning ·Home door system ·Blockchain

S. B. Goyal (B)·M. A. Lasi

City University, Petaing Jaya 46100, Malaysia

e-mail: drsbgoyal@gmail.com

M. A. Lasi

e-mail: masri.abdullasi@city.edu.my

A. S. Rajawat ·P. Gundalwar ·R. K. Solanki

School of Computer Science and Engineering, Sandip University, Nashik 422213, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_18

233

234 S. B. Goyal et al.

1 Introduction

One of these new technologies is called the Internet of Things, or IoT, 5G Enabled

IoT-based deep learning (DL) with blockchain (BC) and its goal is to connect all

of the physical things around us to the Internet. The goal of the 5G Enabled IoT is

to create a global network that can connect both physical things and smart services

to each other. Services that are smart look at the information that is gathered from

these things. To reach whatever goal (or goals) led to the making of the product. The

word “purpose” is used here in a broad sense to include all of the different ways that

these services can be used and put to work. The term has a wide range of meanings,

and this example shows how it can be used. We can tell the difference between the

physical world, or the things themselves, and the information about those things.

The 5G Enabled IoT makes it possible for inanimate objects to share information

and talk to each other.

5G Enabled IoT-based solutions are being used to build “secured smart homes [1–

6],” which aim to make life better by making it safer and more comfortable. These

homes are called “smart homes” because they have new and useful features. One

of the best things about this group of smart home devices is that you can lock and

unlock doors with a remote. To proposed 5G Enabled IoT-based DL with BC Model

how it could be used in the home and by people. To proposed that sensors can be

used to send information to the network’s owners and that mobile devices can use

well-deﬁned interfaces to talk to the sensors and control home appliances. We also

connect that sensors can send information to the people who own the network.

Consumers are eager to use the many smart devices that are now easy to ﬁnd and

can be used to get into and control their homes. We’ll use three different gadgets as

examples to talk about how useful they are, what they can do, and what they can’t

do.

When a lot of different technological processes are put together, a lot of data

is made. Everyone has a basic need to feel safe and cared for. In today’s world,

it is absolutely necessary to have security systems that are powered by technology,

increase the overall level of protection at a location, and are easy to use. One example

of a big step forward made possible by technology is the mobile phone.

For example, 5G Enabled IoT-based DL with BC Model technologies have made

it possible for many objects and mobile devices to talk to each other.

Mobile phones are now the most popular way to send and receive information

using blockchain quickly and easily in the modern world. They have surpassed all

other forms of electronic communication in this way. In recent years, 5G Enabled

IoT-based DL with BC Model digital door locks have become more popular because

they are easy to use and keep your home safe. Even though smart locks can be

helpful, there are still worries about how safe they are. Intruders try to ﬁnd holes in

the security systems that are in place all the time. With the goal of making homes

[7–9] safer, a Raspberry Pi-based door lock system has been made. Because the

owner’s Twitter and Gmail accounts are linked to the system, they will get customer

5G Enabled IoT-Based DL with BC Model for Secured Home Door System 235

information as the system gathers it. The authors explain and recommend a cloud-

based solution for smart homes that use the 5G Enabled IoT-based DL with BC Model.

This solution uses cloud-based SaaS, PaaS, and IaaS technologies and architecture.

Also, in web services based on REST are used to make it easy for an 5G Enabled

IoT-based DL with BC Model and android-based home control device and its owner

to talk to each other. REST is being used to make this communication easier. For

example, in the author shows how to unlock a door using an Android app and a

device with Bluetooth. An implementation that is based on Bluetooth can only be

used in certain ways. When reading we came across a suggestion for a home control

system based on XML/SOAP. This makes the parsing process harder and, as a result,

slows down the response time. Using the 5G Enabled IoT-based DL with BC Model

as part of our investigation, we want to come up with a plan to design and build a

way to stop attacks like this from happening again. The architecture of the system

is based on the 5G Enabled IoT-based DL with BC Model and it is made up of a

microcontroller device, an application hosted in the cloud, and Android software. It

is hoped that Internet of Things-based technologies can improve a number of security

and monitoring methods that are already in use. When the MQTT protocol is used,

all of the information and data are stored in the cloud and can be retrieved from

there. The administrative panel of Android software lets you control who can access

what and when, as well as keep track of every lock and unlock in real time. Our key

objective as the following.

•Security Enhancement: Implement robust security measures to prevent unau-

thorized access to the home [10,11]. Use deep learning techniques to analyze

and recognize patterns for accurate authentication and access control. Integrate

blockchain technology to ensure the integrity and immutability of access logs and

user information.

•Connectivity and Communication: Utilize 5G technology to enable seamless and

high-speed communication between IoT devices, the home door system [12,13],

and the user’s mobile devices. Ensure reliable connectivity and real-time data

transmission for efﬁcient monitoring and control of the door system.

•Intelligent Features: Apply deep learning algorithms to enable intelligent features

in the home door system [14–16]. This includes the ability to detect anoma-

lies, identify authorized individuals, and adapt to user preferences and behavior

patterns. Enhance the system’s ability to provide personalized and secure access

to the home.

2 Related Work

As it has grown and changed over time, 5G Enabled IoT-based DL with BC Model

technology has been shown to be a good way to solve many different kinds of

problems. Among these worries are the need for reliable authentication, the need to

keep people’s privacy, the need to share data, and a lot more. A number of researchers

are working on smart home systems that are based on blockchain. The goal is to use

236 S. B. Goyal et al.

blockchain technology in a wide range of situations and types. For their smart home

systems, some academics use public blockchains, private blockchains, and a mix of

the two. Systems for the smart home that use the blockchain have been able to do a

lot of what they set out to do.

Taiwo et al. [17] In this article, talk about a smart home automation system that

can be used to control electrical equipment, keep an eye on the weather, and keep

track of who and what is moving around the house. Based on the patterns of motion

that have been seen, it is suggested that a deep learning model be used to recognize

and classify motion. Using a deep learning model, an algorithm is made to improve

the smart home automation system’s ability to ﬁnd intruders and reduce the number

of false alarms. The camera watches how a person walks to ﬁgure out if they are a

trespasser.

Yang et al. [18] This research gives a point grouping strategy for ﬁnding ﬁnger

veins by taking the best parts of the other methods and putting them all together in

one solution. The suggested method uses all of the picture points for recognition.

However, each point is broken up into a large number of groups to make the process

of extracting features and measuring similarity easier. By combining the matched

points from each group pair of the enrolled image and the probe with the mismatched

points from those same group pairs, you can get a similarity score or a dissimilarity

score.

Ulfah et al. [19]. If our house is broken into, to will be able to protect our family

and belongings better if we keep track of who uses each door lock to get in and out of

the house. Still, we have to protect the information that is stored in the door lock. One

way to reach this goal is to use a system for storing data that is based on blockchain

technology. Blockchain technology has the beneﬁts of not being able to be changed

and not being able to be taken back. In this study, information about access to door

locks is kept on the Ethereum blockchain platform, and smart contracts are used to

handle policy administration.

Singh et al. [20] In this article, talked about three things. First, we gave an overview

of blockchain technology and how it is used. Second, talked about the Internet of

Things infrastructure, which is based on a blockchain network. And third, we showed

how blockchain technology can be used to make the Internet of Things more secure.

3 Proposed Methodology

One of the most basic kinds of mechanism has been put into place. One of our goals

for the not-too-distant future is to add a number of extra features that will help solve

and improve a wide range of problems and situations. One of the features that will

be added is a sensor that will be able to detect any kind of impact on the door and

send a message to the administrator’s mobile device [1]. The person in charge of

a building can let people in for a short time by giving them the key to the door.

When a valid user gets within the foot of the door, a sensor system will recognize

them and automatically open the door. The only people who can lock and unlock an

5G Enabled IoT-Based DL with BC Model for Secured Home Door System 237

ofﬁce door are the super-administrators. It is up to the user to decide if they want

to add new access points or get rid of the ones they already have. You don’t have

to worry about carrying a lot of keys or, even more important, losing them. Even

better, this system can be used for more than just doors. It can be used to automate

and make it easier to use a wide range of electronic devices. This is on top of the fact

that this system is useful for more than just doors. As part of the door lock system

that is being made, a webcam will be used to do facial recognition to decide who

can enter and leave a property. Figure 1shows the method that has been suggested

for the door lock system. The image will be taken of the person who is currently

looking at the camera. So, the camera will work the same way as a registered one if

it has a picture of the person’s face on it. If the face has already been registered as

the owner of access rights, the system will check to see if the webcam in question

is part of the blockchain network. If the webcam is real and can make blockchain

transactions, the blockchain will record transactions that include information about

the person who got into the house, the status of the access request, and the time the

request was made. The user’s name is saved in the identiﬁcation ﬁeld. The user’s

check-in or check-out status is saved in the access status ﬁeld, and the date and time

of door lock access is saved in the access time ﬁeld. If the payment made through the

webcam was successful, the door will be unlocked. After the door has been open for

a short time, it will automatically lock itself. 5G Enabled IoT-based deep learning

(DL) with blockchain (BC) Model because the door is always locked, the suggested

algorithm is always useful, even when the user is entering or leaving the house.

5G Enabled IoT-based deep learning (DL) with blockchain (BC) Model for taking

care of ﬁnancial matters smart contracts are used in a “smart house” to turn the

decisions of the homeowners into rules for how the house works. One miner is

chosen ahead of time to be in charge of keeping the private blockchain network up

and running. The DL enabled distributed ledger, which is only used by this system,

will keep track of all transactions that involve the door lock. In addition to nodes,

the proposed architecture for the system includes a smart contract and a blockchain,

speciﬁcally the Ethereum blockchain (webcam and homeowner). The rules of the

smart contracts in this system include storing transactions and keeping track of them.

A store’s transaction policy and the use of a webcam can be used together to record

customer transactions. This can include the customer’s name, the time and date they

opened the door, and whether they are checking out. The monitor transaction feature

lets the authorized owner of data stored on the blockchain network keep an eye on it.

To solve the problem that was only talked about, it has been suggested that a

system be put in place with the following goals:

First come up with and set up an AI-based system that can automatically monitor

at regular intervals without any human intervention. This system should be able to

handle both suspicious and dangerous behavior, including identifying objects at the

front door.

The second step is to design and implement an Internet of Things-based system

that can ﬁnd potentially 5G Enabled IoT-based deep learning (DL) with blockchain

(BC) Model malicious audio and video streams moving through a network and then

238 S. B. Goyal et al.

Fig. 1 5G Enabled IoT-based DL with BC model for secured home door system

process those streams on a central server or in the cloud, where the right decisions

can be made in real time based on safety standards that have already been set.

5G Enabled IoT-based deep learning (DL) with blockchain (BC) Model for inves-

tigators at crime scenes use strict and clever safety measures to spot and stop both

passive and aggressive approaches and attacks. They do this by keeping a constant

eye out for vulnerability scans from the scene’s entrance to its end.

After making a fully functional prototype of an 5G Enabled IoT-based deep

learning (DL) with blockchain (BC) Model-based integrated system that can be

used for low-cost access control and surveillance, put that system into production.

How something is set up and run as the visitor approaches the smart door, the

camera takes a picture of their face. The haar cascade method with 5G Enabled IoT-

based deep learning (DL) with blockchain (BC) Model is then used to ﬁgure out who

they are. After a picture is taken, it is compared to a database of pictures of people’s

faces to see if it matches anyone. If the device recognizes the user’s face, it will say

the user’s name through the speakers and record the user’s voice instructions through

the microphone. This can only be done if the device knows the user’s face. In that

case, the user won’t be able to get in until the door is unlocked. The smart door

unlock system that has been put in place can be used to unlock doors by both voice

activation and facial recognition. The system can recognize people using 5G Enabled

5G Enabled IoT-Based DL with BC Model for Secured Home Door System 239

IoT-based deep learning (DL) with blockchain (BC) Model by both their names and

their faces. It can also respond to voice commands from system administrators. Since

the door can be unlocked and opened from a distance, getting in is quick and easy

from anywhere. A feature called “blacklist” can send a message to the owner as soon

as the criminal opens the door. The new system’s price is low enough that the average

worker can afford to switch to it.

4 Results Analysis

We tried 5G Enabled IoT-based deep learning (DL) with blockchain (BC) Model

CNN model to see how well it works and how realistically it can be used as a home

security solution. The training dataset was made up of 4000 photos that showed how

943 pedestrians moved. These pictures came from a database of CV images each

person in the dataset had their picture taken four times so that a full picture of their

gait could be seen from a variety of angles. All of the information was separated

into four groups: ambulatory, acrobatic, hindered, and accelerated. The camera took

pictures of several people who were standing in the same place from the same point

of view. After putting everyone into groups, we further divided them into people who

lived in their own homes and people who didn’t. These two groups are also called

“housing occupants” and “intruders,” respectively.

Most of the time (98% of the time), CNN’s models are right. When using 5G

Enabled IoT-based deep learning (DL) with blockchain (BC) Model DL classiﬁca-

tion, the confusion matrix is used to ﬁgure out how accurate the results are. The

results of the classiﬁcation model are shown in a table that shows both the predicted

data and the actual data. This is a picture of the confusion matrix that was made. Due

to how well the CNN deep learning experiment was done, we now know that smart

houses can be made even smarter to protect both the people who live in them and

their things. This grouping shows how motion patterns can be used to tell things apart

and identify them at home in a way that is both unobtrusive and quick, saving both

time and effort. We also done after the test data were added to the trained model and

after the model’s accuracy, precision, recall, F1-score, and speciﬁcity were calcu-

lated. Table 1shows the results of the evaluation. It ranks the classiﬁers based on

their values for precision, recall, F1-score, and speciﬁcity.

How well the model works in the real world and how well it predicts the actual

positive value show how correct the model is. Here is how the formula can be broken

down:

Precision =TP

TP +FP

The recall is a way to ﬁgure out how well a model can predict real good results.

240 S. B. Goyal et al.

Table 1 Comparative analysis

Year Methodology Key contribution

2022 Deep learning model Improved smart home control and security

system using deep learning techniques

2019 Point grouping method Proposed a novel method for ﬁnger vein

recognition using point grouping technique

2019 Blockchain technology Secure data storage for door lock system

using blockchain technology

2018 Blockchain technology Highlighted the potential of blockchain for

securing IoT data

2020 Raspberry Pi and telegram notiﬁcation Developed a secure home entry system using

Raspberry Pi and sending notiﬁcations via

2022 FPGA-based assistive framework Proposed an FPGA-based framework for

smart home automation

Precision =TP

TP +FP

The F1-score compares the precision and recall scores of a test to ﬁgure out how

accurate it is. Here is how the formula can be broken down:

F1 - Score =2×Precision ×Recall

Precision +Recall

The model’s speciﬁcity is determined by how many errors it can predict correctly.

Here’s how we ﬁgure out whether something is speciﬁc:

Speciﬁcity =TN

TN +FP

Also, we looked at the models in and compared them to the CNN model we

had proposed. Also, the comparison with the model implementation in is limited to

accuracy because that was the only comparable measure used in the earlier research.

This is a repeat of the problem with comparing metrics with speciﬁc related studies

that has already been talked about. Table 2shows a comparison between how well our

CNN model works and how well other models have worked in the past. Proof that the

CNN model, which was suggested, can make smart home automation in the Internet

of Things safer. The CNN models that have been made can be used in smart home

automation apps to make it easier to spot intruders based on their movement patterns.

With the help of the security camera and models that can recognize, classify, and tell

the difference between different motion patterns, users can ﬁnd out who is breaking

in. Motion patterns will also be used to decide whether or not security messages and

alarms are sent out by smart home apps. So, the smart home setting makes the house

safer than it would be otherwise.

5G Enabled IoT-Based DL with BC Model for Secured Home Door System 241

Table 2 Various measures of efﬁciency

Sn Parameter BPNN CNN Proposed (BPNN +CNN)

1MSE 0.553 0.325 0.19

2PSNR 50.44 51.44 58.33

3Loss percentage 15 14 8

4Accuracy 88 97 99

5 Conclusion

Using the 5G-enabled IoT-based deep learning with blockchain (BC) design for a

secured house door system can help with connection and safety in a number of ways.

Integrating 5G technology with IoT devices makes the system work better by making

it easier to connect and send data. Deep learning is used in the system, which makes it

possible for the house’s doors to be protected by AI in a cutting-edge way. The ability

of DL models to examine and spot multiple patterns and outliers makes it possible

for them to be used in strong authentication and access control systems. This means

that only people who have been given permission can get into the house. The fact that

blockchain technology is built into the system makes it even safer. The spread ledger

and data that can’t be changed make it very hard for bad people to change information

and break into a home’s door security system. Blockchain technology also makes it

possible to store user data and entry records in a way that is safe, clear, and can be

checked. When 5G, IoT, DL, and blockchain technologies are used together to make

a secure home door system, the result is a powerful and all-encompassing option.

It makes the door system safer, of course, but it also makes it easy and effective to

watch and control the door system from a distance. Access can be controlled, alerts

can be sent in real time, and a user’s smartphone or other connected device can be

used to check on the state of the door system from a distance. With the 5G-enabled

Internet of Things-based DL with BC model for protected home door systems, home

security has come a long way. It gives people peace of mind because it has improved

security features, better connections, and is easy to control from a distance. With more

work, this model could be used to improve security in smart homes and other places

where IoT is used. It can be expensive to set up a DL with BC plan for a protected

home door system based on 5G and the Internet of Things. Due to the high cost of

putting in place 5G networks, IoT devices, deep learning algorithms, and blockchain

systems, there may be problems with accessibility. When technologies like 5G, IoT,

DL, and blockchain are all put together, they can cause problems in system design,

implementation, and management. In some cases, the system will need to be set up

and maintained by people who know what they are doing. There’s no question that

DL and blockchain could make security better, but there’s also a chance that they

could make privacy worries worse. When collecting and analyzing data from IoT

devices, like door entry records and user information, there are questions about who

owns the data, if the user gave permission, and how the data could be used in the

wrong way.

242 S. B. Goyal et al.

Scalability: The 5G-enabled IoT-based DL with BC model might not work well

with a bigger home or more devices. As the number of IoT devices grows, it gets

harder and takes more resources to manage and handle the data they produce.

References

1. Zamri MA, Kamaluddin MU, Zaini N (2021) Implementation of a microcontroller-based home

security locking system. 2021 11th IEEE international conference on control system, computing

and engineering (ICCSCE), Penang, Malaysia, pp 216–221. https://doi.org/10.1109/ICCSCE

52189.2021.9530966

2. Hema N, Yadav J (2020) Secure home entry using raspberry Pi with notiﬁcation via telegram.

2020 6th international conference on signal processing and communication (ICSC), Noida,

India, pp 211–215. https://doi.org/10.1109/ICSC48311.2020.9182778

3. Ahmed MS, Mukherjee R, Ghosh P, Nayemuzzaman S, Sundaravdivel P (2022) FPGA-based

assistive framework for smart home automation. 2022 IEEE 15th dallas circuit and system

conference (DCAS), Dallas, TX, USA, pp 1–2. https://doi.org/10.1109/DCAS53974.2022.984

5625

4. Saroha A, Gupta A, Bhargava A, Mandpura AK, Singh H (2022) Biometric authentication based

automated, secure, and smart IOT door lock system. 2022 IEEE India council international

subsections conference (INDISCON), Bhubaneswar, India, pp 1–5. https://doi.org/10.1109/

INDISCON54605.2022.9862840

5. Krishnan RS, Muthu AE, Kumar MA, Narayanan KL, Saravanan K, Robinson YH (2022)

Secured door operating mechanism for household during COVID-19 pandemic. 2022 6th inter-

national conference on trends in electronics and informatics (ICOEI), Tirunelveli, India, pp

733–737. https://doi.org/10.1109/ICOEI53556.2022.9776747

6. Shanthini M, Vidya G, Arun R (2020) IoT enhanced smart door locking system. 2020 Third

international conference on smart systems and inventive technology (ICSSIT), Tirunelveli,

India, pp 92–96. https://doi.org/10.1109/ICSSIT48917.2020.9214288

7. Fauzi AFM, Mohamed NN, Hashim H, Saleh MA (2020) Development of web-based smart

security door using QR code system. 2020 IEEE international conference on automatic control

and intelligent systems (I2CACIS), Shah Alam, Malaysia, pp 13–17. https://doi.org/10.1109/

I2CACIS49202.2020.9140200

8. Begum M, Jayasri S, Govindapillai LC (2022) Face recognition door lock system using rasp-

berry Pi. 2022 8th international conference on advanced computing and communication systems

(ICACCS), Coimbatore, India, pp 1645–1648. https://doi.org/10.1109/ICACCS54159.2022.

9785217

9. Gupta K, Jiwani N, Uddin Sharif MH, Mohammed MA, Afreen N (2022) Smart door locking

system using IoT. 2022 international conference on advances in computing, communication

and materials (ICACCM), Dehradun, India, pp 1–4. https://doi.org/10.1109/ICACCM56405.

2022.10009534

10. Brunner H et al (2021) Leveraging cross-technology broadcast communication to build

gateway-free smart homes. 2021 17th international conference on distributed computing in

sensor systems (DCOSS), Pafos, Cyprus, pp 1–9. https://doi.org/10.1109/DCOSS52077.2021.

00014

11. Hou D et al (2022) A highly secure authentication module for smart door lock with temporary

key function. 2022 international conference on cyberworlds (CW), Kanazawa, Japan, pp 228–

235. https://doi.org/10.1109/CW55638.2022.00053

12. Shetty S, Shetty S, Vishwakarma V, Patil S (2020) Review paper on door lock security systems.

2020 international conference on convergenceto digital world—Quo Vadis (ICCDW),Mumbai,

India, pp 1–4. https://doi.org/10.1109/ICCDW45521.2020.9318636

5G Enabled IoT-Based DL with BC Model for Secured Home Door System 243

13. Premkumar B, Emayavaramban G, Amudha A, Ramkumar MS, Divyapriya S, Nagaveni

P (2021)Arduino based advanced energy efﬁcient home automation system with automatic

task scheduling. 2021 2nd international conference on smart electronics and communication

(ICOSEC), Trichy, India, pp 745–751. https://doi.org/10.1109/ICOSEC51865.2021.9591772

14. Monowar MI, Shakil SR, Kaﬁ AH et al (2019) Framework of an intelligent, multi nodal and

secured RF based wireless home automation system for multifunctional devices. Wirel Pers

Commun 105:1–16. https://doi.org/10.1007/s11277-018-6100-z

15. Talal M, Zaidan AA, Zaidan BB et al (2019) Smart home-based IoT for real-time and secure

remote health monitoring of triage and priority system using body sensors: multi-driven

systematic review. J Med Syst 43:42. https://doi.org/10.1007/s10916-019-1158-z

16. Uppuluri S, Lakshmeeswari G (2022) Secure user authentication and key agreement scheme

for IoT device access control based smart home communications. Wirel Netw. https://doi.org/

10.1007/s11276-022-03197-1

17. Taiwo O, Ezugwu AE, Oyelade ON, Almutairi MS (2022) Enhanced intelligent smart home

control and security system based on deep learning model. Wirel Commun Mob Comput

2022:22, Article ID 9307961. https://doi.org/10.1155/2022/9307961

18. Yang L, Yang G, Wang K, Liu H, Xi X, Yin Y (2019) Point grouping method for ﬁnger vein

recognition. IEEE Access 7:28185–28195

19. Nadiya U, Rizqyawan MI, Mahnedra O (2019) [IEEE 2019 4th international conference

on information technology, information systems and electrical engineering (ICITISEE)—

Yogyakarta, Indonesia (2019.11.20–2019.11.21)] 2019 4th international conference on infor-

mation technology, information systems and electrical engineering (ICITISEE)—Blockchain-

based secure data storage for door lock system, pp 140–144. https://doi.org/10.1109/ICITIS

EE48480.2019.9003904

20. Singh M, Kim S (2018) Blockchain: a game changer for securing IoT data. 2018 IEEE 4th

world forum on internet of things (WF-IoT), pp 51–55

Improving Efﬁciency of Spinal Cord

Image Segmentation Using Transfer

Learning Inspired Mask Region-Based

Augmented Convolutional Neural

Network

Sheetal Garg and S. R. Bhagyashree

Abstract Spinal cord magnetic resonance images (MRIs) consists of 7 levels of

cervical vertebrae, 12 levels of thoracic vertebrae, 5 levels of lumbar vertebrae, one

level each of sacrum and coccyx components. Segmentation of these components

is essential for effective classiﬁcation and post-processing analysis of spinal cord

images. In order to perform this task, separate algorithms are needed for each of the

components. Due to which, their segmentation performance is not uniform, which

limits their integration capabilities. Moreover, performance for each type of segmen-

tation has scalability issues, which must be improved via augmentation, aggregation,

and machine learning for better clinical use. To resolve these issues, this text proposes

design of a novel spinal cord image segmentation model using transfer learning

inspired mask region-based augmented convolutional neural network (MRACNN).

The proposed model utilizes initial weights from the pre-trained COCO mask RCNN

model, and modiﬁes them to incorporate spine, torso, and L1to L5spinal cord

components. When compared to several state-of-the-art models, it is found that the

suggested model has improved region of interest (RoI) extraction and an accuracy of

91% for segmenting these components. Moreover, the proposed model was evaluated

on multiple datasets, and a consistent performance was observed. Furthermore, the

model was fused with a XRAI-based convolutional neural network which assisted

in further improving overall efﬁciency of segmentation. Fusion of XRAI CNN with

MRACNN is capable of achieving segmentation accuracy of 94%, along with better

RoI performance when compared with individual models. Delay requirement of the

fused model is high, and requires large dataset for training & validation, thus, this

text also recommends selective ensembling techniques for redundancy reduction,

which assists in improving segmentation speed, while maintaining high segmentation

quality.

S. Garg (B)·S. R. Bhagyashree

Department of Electronics and Communication Engineering, ATME College of Engineering,

Mysuru, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_19

245

246 S. Garg and S. R. Bhagyashree

Keywords Spine ·Segmentation ·CNN ·Mask RCNN ·COCO ·Ensemble ·

Augmentation

1 Introduction

MRIs play a vital role in disease diagnosis and identiﬁcation [1]. Segmenting the

spinal cord is a complex task that entails efﬁcient design of body symmetry detec-

tion, analysis of spinal components, anterior–posterior analysis, shape identiﬁcation,

application of diffusion ﬁlters for clustering & outlier detection. In order to perform

these tasks, over time, scientists have presented various segmentation models [2–6].

An instance of such a model that uses mutual information, canny edge detection,

Hough transform, and k-means for clustering is depicted in Fig. 1, wherein spinal

cord and canal regions are segmented. The model initially detects body symmetry

using mutual information (MI), which allows for RoI extraction, and results into

structural decomposition of spinal image layers. These layers are processed via

Hough transform and Canny edge detection for extraction of line-line components.

These components are analyzed and lines in Anterior & Posterior (AP) directions are

removed in order to isolate spinal cord regions. These regions are given to an angular

Hough transform for estimation of candidates with close to AP line circles. These

circles assist in identiﬁcation of exact spinal cord position, which is facilitated using

Anisotropic Diffusion ﬁltering and k-means-based clustering. The obtained clus-

ters are segregated w.r.t. internal shapes, and spinal cord-like clusters are identiﬁed.

Pixels belonging to this cluster are extracted, while other pixels are removed from

the pixel-set, thereby resulting in the ﬁnal segmented image. Efﬁciency of segmen-

tation depends upon internal model design for these blocks, and wide varieties of

algorithms are available for it. A brief review of these algorithms is described in

the next section of this text, which assists in identiﬁcation of performance gaps in

existing literature, and thus forms the basis of the proposed model. It is observed

that existing models are widely application-speciﬁc, which limits their scalability,

and accuracy for larger datasets. In order to remove this drawback, Sect. 2proposes

design of a transfer learning inspired mask region-based augmented convolutional

neural network. The proposed model uses a combination of XRAI-based CNN with

MRACNN that assists in improving its segmentation performance. This performance

is evaluated in Sect. 3, and is compared with other models in terms of segmentation

accuracy, precision of RoI extraction, and computational complexity. Finally, this

text concludes with some interesting observations about the proposed models, and

recommends various methods to further improve its performance.

Improving Efﬁciency of Spinal Cord Image Segmentation Using … 247

Fig. 1 Segmentation of spinal cord and canal regions using Hough transform and k-means-based

clustering [3]

2 Material and Method

2.1 Literature Review

Wide varieties of models are proposed for spinal cord segmentation, and each of

them have context-speciﬁc performance. Consider the research in [7,8], which

describes threshold-based segmentation and segmentation models based on convo-

lutional neural networks (CNN). The CNN model is autotuned, has great segmen-

tation efﬁciency, and can thus be applied to a broad variety of datasets, in contrast

to the threshold-based model, which has restricted accuracy and must be manu-

ally tuned for each dataset. A comparison of these models is proposed in [9],

where it is observed that machine learning methods outperform linear segmenta-

tion models, and thus are highly preferred for clinical applications. Design of such

models is further described in [10–13] wherein hybrid CNNs, U-Net-based models,

and optimization techniques are described. These techniques have good accuracy,

and can be used for multiple datasets with minimal training effort from user-side.

However, these models need lengthy training times, which can be reduced by using

parallel processing or pipelining strategies. In [13], where researchers deployed a

U-Net network with transfer learning, an example of such a high-speed model is put

out. Transfer learning’s inclusion lessens cold-start problems, improving precision,

recall, and f-measure performance overall. This observation served as the inspi-

ration for the proposed approach, which also employs transfer learning to segment

248 S. Garg and S. R. Bhagyashree

spinal cord imaging quite effectively. In [14–16], researchers employed similar high-

efﬁciency models using analytical transform-based statistical characteristic decom-

position model (ATSCDM) and faster RCNN models for high accuracy, and low

delay segmentation. The faster RCNN model tends to outperform other models due

to its microscopic segmentation performance, and ability to perform classiﬁcation &

regression with high efﬁciency.

Efﬁciency of reviewed models must be further evaluated on larger datasets. Such

research is discussed in [17], wherein quantitative identiﬁcation of different spinal

cord components is performed on a large-scale dataset. This work can be used to

identify performance issues with statistical parametric mapping (SPM) framework

[18], spinal cord injury detection frameworks [19], multispecies spinal cord anal-

ysis [20], and Gaussian kernel methods [21]. All these methods showcase that spinal

cord segmentation models are highly context sensitive, and have limited performance

when multiple datasets are used for validation. To resolve this issue, work in [22–

24] proposes use of automatic 3D segmentation via clustering, CNN with grayscale

regularized active contour propagation, and super-voxel segmentation using U-Net

architectures. These models are observed to have higher delay than context sensitive

models, but have comparable accuracy when applied to multiple types of datasets.

Another high-efﬁciency model is proposed in [25], wherein researchers have used

Region Growth (RG) algorithm for segmentation. The RG model is highly effective,

but requires manual region estimation, which limits its applicability to small-scale

scenarios. Models that can be used for large-scale datasets are proposed in [26–

29] wherein researchers have proposed support vector machine (SVM)-based active

contour segmentation, deep convolutional neural networks (DCNN) trained with

probability maps of pre-trained deep network, multi dilated recurrent residual U-Net

model, and deep dilated convolutions. These models assist in segmenting images

that are taken at multiple angles, and thus allow clinical experts to identify and diag-

nose any cases of spinal cord injury with minimum error, and maximum precision.

Efﬁciency of these models must be tested on multiple datasets, and combined with

multiple feature extraction & classiﬁcation units as proposed in [30–32], wherein

ensemble principal component analysis (PCA), particle swarm optimization (PSO),

and SVM with different kernels is discussed. These models tend to outperform

linear segmentation models, but still have a wide scope of improvement in terms

of scalability & accuracy performance. In order to work on these issues, the next

section proposes design of a high-efﬁciency spinal cord image segmentation model

using transfer learning inspired mask region-based augmented convolutional neural

network. The proposed model is based on multiple deep learning layers, which assists

in achieving high accuracy, with better PSNR. Results of the model are also discussed

in Sect. 3of this text.

Improving Efﬁciency of Spinal Cord Image Segmentation Using … 249

2.2 Design of a High-Efﬁciency Spinal Cord Image

Segmentation Model Using Transfer Learning Inspired

Mask Region-Based Augmented Convolutional Neural

Network

From the literature survey, it is observed that most of the recently proposed models for

spinal cord segmentation are suited for single image datasets, and give limited accu-

racy for different image types. In order to improve scalability of spinal cord segmenta-

tion, this section proposes design of a novel model that uses transfer learning inspired

mask region-based augmented convolutional neural network (RCNN). The RCNN

model uses pre-trained common objects in context (COCO) weights for initial valida-

tion, and is retrained using target spinal cord dataset. This RCNN model is integrated

with a XRAI-based CNN model which assists in saliency-based segmentation. The

proposed model is depicted in Fig. 2, and showcases entire data ﬂow of segmentation

process.

The model uses a combination of transfer learning massed RCNN, and XRAI-

based CNN in order to obtain the ﬁnal segmented image. Internal description of these

models, along with their integration details are discussed in different sub-sections

for better understanding. Readers can replicate these designs in parts or as a whole

by referring these sections, and use them in their segmentation models.

2.3 Transfer Learning Mask RCNN Model Using Initial

COCO Weights for Coarse Segmentation

The transfer learning-based massed RCNN model is built using a combination of

region proposal network, which is trained using COCO-based training weights. These

weights are updated as per the input dataset, and a spinal mask is obtained. Gener-

ation of this mask is assisted via in incremental CNN learning model, which uses

progressive feature extraction as observed from Fig. 3, wherein internal layer design

of the model is depicted.

The model initially uses COCO-based weights to perform initial pixel level convo-

lutions. These convolutions assist the model in evaluating multiple feature vectors,

which are used for model training. Result of the convolutional layer is controlled using

Eq. (1), wherein a rectilinear unit (ReLU)-based kernel is used for pixel activation,

Convouti,j=



a=− m



b=− n

I(i−a,j−b)∗ReLUm

2+a,n

2+b(1)

where I,i,m,n, and jrepresents raw spinal cord image, current window row, number of

rows, & columns in the spinal cord image, and current window column, respectively.

These metrics are evaluated for each given layer. The number of features produced

250 S. Garg and S. R. Bhagyashree

Fig. 2 Design of the proposed model

during each convolution are controlled via Eq. (2)asfollows:

feat

out =feat

in +2∗p−k

s+1(2)

where feat

in,feat

out,p,s,and krepresents total input features, total output

features, used padding size during convolution, used stride size during convolution,

and kernel size for the convolutional layer, respectively. This model can extract a

substantial number of characteristics from any spinal cord image due to variations in

different padding, stride, and kernel sizes. This causes inherent redundancies in the

system, which limits its accuracy & RoI performance. To improve this accuracy, a

variance-based feature selection layer, namely, max pooling layer is applied to each

Improving Efﬁciency of Spinal Cord Image Segmentation Using … 251

Fig. 3 Transfer learning massed RCNN model using initial COCO weights for coarse segmentation

convolutional layer. The convolutional features that are higher than the evaluated

feature threshold are all removed by this layer. The max pooling variance threshold

is evaluated using Eq. (3),

fth =⎛

⎝

∗

x∈Xk

xpk⎞

⎠

1/pk

(3)

where Siis size of the input spinal cord image, while pkis a feature selection factor,

which is tuned during hyperparameter optimization. This process of convolution, and

max pooling is repeated for a progressive convolution layer size of 7 ×7×256, 14 ×

14 ×256, 28 ×28 ×256, and 28 ×28 ×80 to obtain large number of segmentation

features. These features are given to a combination of two 1 ×1024 fully connected

neural network layers, in order to estimate ﬁnal per-pixel class for segmentation.

Each pixel is classiﬁed into spine, torso, L1L2, L2L3, L3L4, or L4L5 classes. This

classiﬁcation is controlled using Eq. (4), wherein a SoftMax base activation function

is described,

cout =So f t Ma x⎛

⎝



i=1

fi∗wi+b⎞

⎠(4)

where fi,w

i,b,and Nfrepresents values of extracted convolutional feature vector,

hyperparameter tuning weight, hyperparameter tuning bias, and total features

252 S. Garg and S. R. Bhagyashree

extracted by convolutional layer, respectively. An output segmentation mask is

extracted from this layer, and is used for ﬁne tuning the segmentation output provided

by XRAI-based CNN model. This model is described in the next sub-section on this

text.

2.4 XRAI-Based CNN Model for Region-Based Segmentation

The raw spinal cord image is given to a XRAI-based segmentation layer. XRAI

is a medical imaging-based saliency map segmentation algorithm, which assists in

identiﬁcation of RoI regions. These RoI regions are extracted using entropy values,

which depend upon extracted convolutional features. In order to extract these features,

raw input image is split using bit-plane slicing. Each of the slices is then given to

a convolutional feature extraction unit. These features are extracted using Eqs. (1),

(2), and (3) wherein max pooling-based feature selection is performed to reduce

redundancies. All extracted features at each layer are given to an entropy evaluation

layer, which is controlled using Eq. (5). Here probability of feature occurrence, and

its logarithmic levels are used for ﬁnal entropy evaluation,

Efi=−



r=1



c=1

pFr,ci∗logpFr,ci  (5)

where pFr,cirepresents probability of feature vector at r,clocation, and irepresents

bit slice number for input spinal cord image. These entropy values are used as upper

limits for each slice of input image, and bit-level thresholding is performed. All these

slices are combined in order to obtain the ﬁnal XRAI map. Results of this XRAI-

based saliency detection model are observed from Fig. 4, wherein input image, its

saliency mask, and the ﬁnal saliency image is shown.

Fig. 4 Extracted salient regions from input spinal cord imagery

Improving Efﬁciency of Spinal Cord Image Segmentation Using … 253

Fig. 5 Used GoogLeNet model for pixel level classiﬁcation

These regions are again given for feature extraction to a GoogLeNet-based CNN

model, wherein pixel level classiﬁcation is performed. The used CNN model is

visualized from Fig. 5, wherein its internal architecture is described.

As observed from the model, each pixel is classiﬁed into two classes, foreground

and background. In order to perform this task, ground truth data for a large number

of images is used, and given to an inception module. This module uses the given

ground truth data, and generates a saliency mask for ﬁnal segmentation. Internal

model design for inception module is described in Fig. 6, wherein multiple ﬁlters are

concatenated in order to obtain the ﬁnal spinal cord mask output. In order to enhance

efﬁciency of segmentation, inception module uses the following Eq. (6) for internal

pooling,

P(q,p)=log(C(p,q)∗G(q,p)) (6)

where ‘P’ is the output of Pooling, ‘C’ is the convolutional operation on the input

image patch (p,q), and ‘G’ is the ground truth image patch (q,p).

Extracted pooling features are given to a ﬁlter concatenation unit, which operates

using the following Eq. (7),

F(p,q)=





P(q,p)

k+d∗(a∗B(p,q)+c)

4(7)

where ‘F’ represents concatenated ﬁlter output, ‘P’ represents pooling output, ‘B’

represents base image patch for (p,q), while ‘a’, ‘c’, ‘d’, and ‘k’ are inception

constants, and are tuned based on the process of hyperparameter tuning. Multiple

inception modules are connected in cascade, which generates a large number of

segmentation masks. All these masks are overlapped in order to generate the coarse

spinal cord segmentation mask. Results from Sect. 2.3, and 2.4 are combined in order

to generate the ﬁnal mask as described in Sect. 2.5.

254 S. Garg and S. R. Bhagyashree

Fig. 6 Design of the

inception model

2.5 Model Fusion for Final Segmentation

The extracted masks from Sect. 2.3 and 2.4 are combined using a fusion model for

estimation of ﬁnal segmentation mask. In order to perform this task, the following

process is designed,

•Masks from RCNN, and XRAI CNN are evaluated at pixel level.

•Correct pixel locations from each mask are extracted by referring to the ground

truth data.

•Union of all the correct pixels is done, and unique values from this union are

estimated using the following Eq. (8),

Pﬁnal =UniquePXRAI,PRCNN(8)

where Pﬁnal,PXRAI,and PRCNN represents ﬁnal pixel map, XRAI-based pixel map,

and RCNN-based pixel map for each input image.

•The ﬁnal pixel map is evaluated for each training image, and stored in the database.

•For any new input image, convolutional features are computed, and each pixel is

classiﬁed into foreground and background.

•Correlation between the extracted convolutional features with stored features is

estimated using the following Eq. (9),

Corrj=NfTes t

i=1Ftesti−Fnewi

NfTest

i=1Ftesti−Fnewi2(9)

Improving Efﬁciency of Spinal Cord Image Segmentation Using … 255

where ‘j’ is number of the segmentation engine ( j=1 for XRAI, 2 for RCNN),

Ftestiand Fnewiare ith test set & new input pixel values respectively, and Nftest are

total number of features selected by the convolutional models for the test set. The

maximum value of Corr.is evaluated, and segmentation pixel positions of that

training image are used for segmenting the new test image. Based on this, the ﬁnal

spinal cord image is segmented, and performance metrics including, segmentation

accuracy, peak signal to noise ratio, and delay of segmentation are estimated. These

metrics are compared with existing models, and are discussed in the next section of

this text.

3 Results and Discussion

Spinal cord MRIs from several Mendeley datasets and associated Ground truth

images were utilized to measure the segmentation performance of the proposed

model. This data was taken from https://data.mendeley.com/datasets/zbf6b4pttk/2

and is freely available with Open-Source licensing. Dataset images consist of lumbar

spine scans, with 48,345 MRI slices. Majority of these slices are 320 ×320 size, while

some of them also have 320 ×310-pixel resolution as observed from Fig. 7, wherein

a sample set of the dataset is visualized. Compared to the typical 8-bit grayscale

images, all images feature a 12-bit per-pixel resolution. Due to such a large dataset,

it was possible to train the underlying network, obtain accuracy, PSNR, and delay

metrics. Equation 10 was used to assess the segmentation’s accuracy in the manner

described below,

A=Npc

Isize

∗100 (10)

where Npc, and Isi ze represents total number of pixels correctly classiﬁed, and size

of the input image respectively. The entire dataset was divided into a 70:30 ratio for

training & validation, respectively. Accuracy of segmentation was evaluated for ML

[8], FASTER RCNN [10], and compared with the proposed model, these values are

tabulated in Table 1as follows:

Table 1shows that on the same dataset, the proposed model is 16% more accurate

than ML [6], 10% more accurate than FASTER RCNN [8, and 3% more accurate than

RCNN. It follows that the suggested methodology may be applied to clinical segmen-

tations in real time and is very successful for large-scale deployments. Comparing

these values to the suggested model, PSNR during segmentation for ML [6] and

FASTER RCNN [8] was tested. The results are presented in Table 2as follows:

According to Table 2, the suggested model has a PSNR that is 15 dB higher than

ML [8], 11 dB higher than FASTER RCNN [10], and 4 dB higher than RCNN on the

same dataset. This increase in PSNR is the result of the combination of XRAI and

RCNN, which helps with precise segmentation. It follows that the suggested method-

ology may be applied to clinical segmentations in real time and is very successful for

256 S. Garg and S. R. Bhagyashree

Fig. 7 Dataset samples

large-scale deployments. The following results are reported in Table 3and compared

to the proposed model for the delay during segmentation for ML [8] and FASTER

RCNN [10].

From Table 3, it is observed that the proposed model has 30% higher delay than

ML [8], 24% higher than FASTER RCNN [10], and 10% higher than RCNN on the

same dataset. This increase in time is due to combination of XRAI & RCNN, which

must be handled using various optimization methods like hyperparameter tuning,

Q learning, and ensemble classiﬁcation. Thus, indicating that the proposed model

can be used for real-time clinical segmentations, but might require more time for

providing better accuracy.

Improving Efﬁciency of Spinal Cord Image Segmentation Using … 257

Table 1 Accuracy of segmentation for different models

Number of images A(%)

ML [6]

A(%)

Faster RCNN [8]

A(%)

RCNN

A(%)

TLMA CNN

100 75.40 76.20 86.63 88.89

200 75.45 77.90 87.63 89.92

300 75.48 78.40 87.93 90.23

400 75.51 79.20 88.41 90.71

500 75.57 80.10 88.95 91.28

600 75.63 81.10 89.56 91.90

700 75.69 81.50 89.82 92.17

800 75.72 81.60 89.90 92.25

900 75.77 81.65 89.95 92.30

1000 75.82 81.75 90.04 92.39

1500 75.86 81.90 90.15 92.50

2000 75.91 81.98 90.22 92.58

2500 75.96 82.08 90.30 92.66

3000 76.00 82.18 90.39 92.75

3500 76.05 82.28 90.47 92.84

4000 76.10 82.38 90.56 92.92

4500 76.15 82.48 90.64 93.01

5000 76.19 82.58 90.72 93.09

5500 76.24 82.68 90.81 93.18

6000 76.29 82.78 90.89 93.27

7000 76.33 82.88 90.98 93.35

8000 76.38 82.98 91.06 93.44

9000 76.43 83.08 91.14 93.53

10,000 76.48 83.18 91.23 93.61

11,000 76.52 83.28 91.31 93.70

12,000 76.57 83.38 91.40 93.78

13,000 76.62 83.48 91.48 93.87

14,500 76.66 83.58 91.57 93.96

258 S. Garg and S. R. Bhagyashree

Table 2 PSNR of segmentation for different models

Number of images PSNR (dB)

ML [6]

PSNR (dB)

Faster RCNN [8]

PSNR (dB)

RCNN

PSNR (dB)

TLMA CNN

100 30.16 31.24 38.98 42.67

200 30.18 31.94 39.43 43.16

300 30.19 32.14 39.57 43.31

400 30.20 32.47 39.78 43.54

500 30.23 32.84 40.03 43.81

600 30.25 33.25 40.30 44.11

700 30.28 33.42 40.42 44.24

800 30.29 33.46 40.45 44.28

900 30.31 33.48 40.48 44.31

1000 30.33 33.52 40.52 44.35

1500 30.35 33.58 40.57 44.40

2000 30.36 33.61 40.60 44.44

2500 30.38 33.65 40.64 44.48

3000 30.40 33.69 40.67 44.52

3500 30.42 33.73 40.71 44.56

4000 30.44 33.77 40.75 44.60

4500 30.46 33.81 40.79 44.64

5000 30.48 33.86 40.83 44.69

5500 30.50 33.90 40.86 44.73

6000 30.51 33.94 40.90 44.77

7000 30.53 33.98 40.94 44.81

8000 30.55 34.02 40.98 44.85

9000 30.57 34.06 41.02 44.89

10,000 30.59 34.10 41.05 44.93

11,000 30.61 34.14 41.09 44.98

12,000 30.63 34.18 41.13 45.02

13,000 30.65 34.22 41.17 45.06

14,500 30.67 34.27 41.20 45.10

Improving Efﬁciency of Spinal Cord Image Segmentation Using … 259

Table 3 Segmentation delay of different models

Number of images Delay (s)

ML [6]

Delay (s)

Faster RCNN [8]

Delay (s)

RCNN

Delay (s)

TLMA CNN

100 11 11 13 13

200 21 22 25 27

300 32 33 38 40

400 42 45 51 54

500 53 56 64 68

600 64 69 78 82

700 74 80 91 95

800 85 92 104 109

900 95 104 117 123

1000 106 115 131 137

1500 159 173 196 205

2000 213 231 262 274

2500 266 289 327 343

3000 319 348 393 412

3500 373 406 459 481

4000 426 465 525 550

4500 480 523 591 619

5000 533 582 658 689

5500 587 641 724 758

6000 641 700 791 828

7000 748 818 923 967

8000 855 936 1056 1106

9000 963 1054 1189 1246

10,000 1071 1173 1323 1385

11,000 1178 1292 1456 1525

12,000 1286 1411 1590 1666

13,000 1394 1530 1724 1806

14,500 1556 1709 1925 2016

4 Conclusion

The proposed model is capable of providing high accuracy segmentation perfor-

mance, which is due to fusion of XRAI CNN & RCNN models. It is observed that

the proposed model achieves 93.96% accuracy, with a maximum PSNR of 45.1 dB

across multiple types of spinal cord images. Furthermore, the proposed model has

16% better accuracy than ML [8], 10% better accuracy than FASTER RCNN [10], and

3% better accuracy than RCNN on the same image set. This makes is highly usable for

260 S. Garg and S. R. Bhagyashree

a wide variety of spinal cord segmentation applications, including high-efﬁciency

classiﬁcation, post-processing, etc. the proposed model has 15 dB improvement

in PSNR than ML [8], 11 dB improvement than FASTER RCNN [10], and 4 dB

improvement than RCNN on multiple Mendeley datasets. Thus, making it highly

viable for efﬁcient segmentation in large-scale clinical applications. But the model

has large training & validation delay, which is due to multiple algorithmic passes,

which makes is deployable to high-computing environments, thus researchers must

use multiple optimization models in order reduce computational complexity of the

proposed model. Furthermore, researchers can explore other CNN architectures, and

observe their effect on ﬁnal segmentation performance.

5 Conﬂicts of Interest

The authors declare that there is no conﬂict of interest regarding the publication of

this paper.

Acknowledgements We would like to thank the management and Principal of ATME College of

Engineering, Mysore, for their ongoing assistance and support.

References

1. Garg S, Bhagyashree SR (2019) Detection and classiﬁcation of tumors using medical imaging

techniques: a survey. In: Balaji S, Rocha Á, Chung YN (eds) Intelligent communication tech-

nologies, and virtual mobile networks. ICICV 2019. Lecture notes on data engineering and

communications technologies, vol 33

2. Garg S, Bhagyashree SR (2021) Spinal cord MRI segmentation techniques, and algorithms: a

survey. SN Comput Sci 2:229

3. Sabaghian S, Dehghani H, Batouli SA, Khatibi A, Oghabian M (2020) Fully automatic 3D

segmentation of the thoracolumbar spinal cord and the vertebral canal from T2-weighted MRI

using K-means clustering algorithm. Spinal Cord 58:1–10. https://doi.org/10.1038/s41393-

020-0429-3

4. Chen M, Carass A, Oh J et al (2013) Automatic magnetic resonance spinal cord segmentation

with topology constraints for variable ﬁelds of view. Neuroimage 83:1051–1062. https://doi.

org/10.1016/j.neuroimage.2013.07.060

5. Chun-Chih Liao, Hsien-Wei Ting, Furen Xiao (2017) Atlas-free cervical spinal cord segmen-

tation on midsagittal T2-weighted magnetic resonance images. J Healthc Eng 2017:12, Article

ID 8691505. https://doi.org/10.1155/2017/8691505

6. Gros C, De Leener B, Badji A, Maranzano J, Eden D, Dupont S, Talbott J, Zhuoquiong R, Liu

Y, Granberg T, Ouellette R, Tachibana Y, Hori M, Kamiya K, Chougar L, Stawiarz L, Hillert

J, Bannier E, Kerbrat A, Cohen-Adad J (2018) Automatic segmentation of the spinal cord and

intramedullary multiple sclerosis lesions with convolutional neural networks

7. Ahammad SH, Ur Rahman MZ, Lay-Ekuakille A, Giannoccaro NI (2020) An efﬁcient optimal

threshold-based segmentation and classiﬁcation model for multi-level spinal cord injury

detection. 2020 IEEE international symposium on medical measurements and applications

(MeMeA), pp 1–6. https://doi.org/10.1109/MeMeA49120.2020.9137122

Improving Efﬁciency of Spinal Cord Image Segmentation Using … 261

8. Saenz-Gamboa JJ, de la Iglesia-VayáM, Gómez JA (2021) Automatic semantic segmentation of

structural elements related to the spinal cord in the lumbar region by using convolutional neural

networks. 2020 25th international conference on pattern recognition (ICPR), pp 5214–5221.

https://doi.org/10.1109/ICPR48806.2021.9412934

9. Mnassri B, Sahnoun M, Hamida AB (2020) Comparison study for spinal cord segmentation

methods aiming to detect SC atrophy in MRI images: case of multiple sclerosis. 2020 5th

international conference on advanced technologies for signal and image processing (ATSIP),

pp 1–6. https://doi.org/10.1109/ATSIP49331.2020.9231790

10. Ahammad SH, Rajesh V, Rahman MZU, Lay-Ekuakille A (2020) A hybrid CNN-based

segmentation and boosting classiﬁer for real time sensor spinal cord injury data. IEEE Sens J

20(17):10092–10101. https://doi.org/10.1109/JSEN.2020.2992879

11. Lemay A, Gros C, Zhuo Z et al (2021) Automatic multiclass intramedullary spinal cord tumor

segmentation on MRI with deep learning. Neuroimage Clin 31:102766. https://doi.org/10.

1016/j.nicl.2021.102766

12. Yiannakas MC, Liechti MD, Budtarad N et al (2019) Gray versus white matter segmenta-

tion of the conus medullaris: reliability and variability in healthy volunteers. J Neuroimaging

29(3):410–417. https://doi.org/10.1111/jon.12591

13. Couedic TL, Caillon R, Rossant F, Joutel A, Urien H, Rajani RM (2020) Deep-learning based

segmentation of challenging myelin sheaths. 2020 tenth international conference on image

processing theory, tools and applications (IPTA), pp 1–6. https://doi.org/10.1109/IPTA50016.

2020.9286715

14. Alsiddiky A, Fouad H, Soliman AM, Altinawi A, Mahmoud NM (2020) Vertebral tumor

detection and segmentation using analytical transform assisted statistical characteristic decom-

position model. IEEE Access 8:145278–145289. https://doi.org/10.1109/ACCESS.2020.301

2719

15. Moccia M, Prados F, Filippi M, Rocca MA, Valsasina P, Brownlee WJ, Zecca C, Gallo A,

Rovira A, Gass A, Palace J, Lukas C, Vrenken H, Ourselin S, Gandini Wheeler-Kingshott

CAM, Ciccarelli O, Barkhof F (2019) Longitudinal spinal cord atrophy in multiple sclerosis

using the generalized boundary shift integral. Ann Neurol 86:704–713. https://doi.org/10.1002/

ana.25571

16. Ma S, Huang Y, Che X, Gu R (2020) Faster RCNN-based detection of cervical spinal cord

injury and disc degeneration. J Appl Clin Med Phys 21. https://doi.org/10.1002/acm2.13001

17. Pai SA, Zhang H, Shewchuk JR et al (2020) Quantitative identiﬁcation and segmentation

repeatability of thoracic spinal muscle morphology. JOR Spine 3(3):e1103. Published 2020 Jul

1. https://doi.org/10.1002/jsp2.1103

18. Azzarito M, Kyathanahally SP, Balbastre Y et al (2021) Simultaneous voxel-wise analysis of

brain and spinal cord morphometry and microstructure within the SPM framework. Hum Brain

Mapp 42:220–232. https://doi.org/10.1002/hbm.25218

19. Majidpoor J, Mortezaee K, Khezri Z et al (2021) The effect of the segment of spinal cord

injury on the activity of the nucleotide-binding domain-like receptor protein 3 inﬂammasome

and response to hormonal therapy.Cell Biochem Funct 39(2):267–276. https://doi.org/10.1002/

cbf.3574

20. Maidawa SM, Ali MN, Imam J, Salami SO, Hassan AZ, Ojo SA (2021) Morphology of the

spinal nerves from the cervical segments of the spinal cord of the African giant rat (Cricetomys

Gambianus). Anat Histol Embryol 50(2):300–306. https://doi.org/10.1111/ahe.12630

21. Malathy V, Anand M, Dayanand Lal N et al (2020) Segmentation of spinal cord from computed

tomography images based on level set method with Gaussian kernel. Soft Comput 24:18811–

18820. https://doi.org/10.1007/s00500-020-05113-1

22. Sabaghian S, Dehghani H, Batouli SAH et al (2020) Fully automatic 3D segmentation of

the thoracolumbar spinal cord and the vertebral canal from T2-weighted MRI using K-means

clustering algorithm. Spinal Cord 58:811–820. https://doi.org/10.1038/s41393-020-0429-3

23. Zhang X, Li Y, Liu Y et al (2021) Automatic spinal cord segmentation from axial-view MRI

slices using CNN with grayscale regularized active contour propagation. Comput Biol Med

132:104345. https://doi.org/10.1016/j.compbiomed.2021.104345

262 S. Garg and S. R. Bhagyashree

24. A deep learning method with residual blocks for automatic spinal cord segmentation in planning

CT. https://www.sciencedirect.com/science/article/abs/pii/S1746809421006716

25. Subramanya Jois SP,Sridhar H, Harish Kumar JR (2018) A fully automated spinal cord segmen-

tation. 2018 IEEE global conference on signal and information processing (GlobalSIP), pp

524–528. https://doi.org/10.1109/GlobalSIP.2018.8646682

26. Hasane S, Rajesh V, Rahman MZU (2019) Fast and accurate feature extraction-based segmen-

tation framework for spinal cord injury severity classiﬁcation. IEEE Access 7:46092–46103.

https://doi.org/10.1109/ACCESS.2019.2909583

27. Rehman F, Ali Shah SI, Riaz N, Gilani SO (2019) A robustscheme of vertebrae segmentation for

medical diagnosis. IEEE Access 7:120387–120398. https://doi.org/10.1109/ACCESS.2019.

2936492

28. Kim DH, Jeong JG, Kim YJ et al (2021) Automated vertebral segmentation and measurement

of vertebral compression ratio based on deep learning in X-ray images. J Digit Imaging 34:853–

861. https://doi.org/10.1007/s10278-021-00471-0

29. Perone C, Calabrese E, Cohen-Adad J (2018) Spinal cord gray matter segmentation using deep

dilated convolutions. Sci Rep 8. https://doi.org/10.1038/s41598-018-24304-3

30. Ahammad SH, Rajesh V, Rahman MZU (2019) Fast and accurate feature extraction-based

segmentation framework for spinal cord injury severity classiﬁcation. IEEE Access 7:46092–

46103. https://doi.org/10.1109/ACCESS.2019.2909583

31. Valarmathi G, Devi S (2021) Human vertebral spine segmentation using particle swarm

optimization algorithm. https://doi.org/10.1007/978-981-16-0669-4_7

32. Punarselvam E, Suresh P (2019) Investigation on human lumbar spine MRI image using ﬁnite

element method and soft computing techniques. Cluster Computing 22. https://doi.org/10.1007/

s10586-018-2019-0

Neurological Disease Prediction Based

on EEG Signals Using Machine Learning

Approaches

Zahraa Maan Sallal and Alyaa A. Abbas

Abstract Diagnostics and prognoses of brain disorders can be greatly aided by

machine learning. To bring these tools into clinical routine, we argue that key chal-

lenges remain to be addressed by the community. To overcome the limitations of

black-box approaches, we need to use interpretable models to overcome the short-

comings of validation and reproducible research practices. Extensive generalization

studies are required. Many people die each year from brain diseases, which are the

most prevalent. As the death toll continues to rise, it is estimated that it will reach

75 million by 2030. We cannot predict brain diseases with modern technology or an

advanced healthcare system. In our paper, we use machine learning algorithms to

implement the neurological disease prediction approach since such algorithms are

a critical source of data prediction. MIT-BIH repository data comprising a variety

of patients were used as the database. Based on the classiﬁers utilized, the ﬁndings

have proven that the RDF produced the most accurate result with 96.32% accuracy.

Keywords Machine learning classiﬁer ·Brain disease prediction ·Kernel

perceptron ·Random decision forest (RDF)

1 Introduction

Globally, there are 17 million deaths a year caused by brain diseases. In developed

countries, death rates are also alarming, even though low- and middle-income coun-

tries account for three-quarters of all deaths [1]. A staggering 35% of global deaths

are caused by neurological diseases, according to the Centers for Disease Control

and Prevention. Various races, classes, and age groups also experience this issue.

Z. M. Sallal (B)

General Directorate of Education in Al-Qadisiyah Governorate/Ministry of Education,

Al Diwaniyah, Iraq

e-mail: zahraa.m199021@gmail.com

A. A. Abbas

General Directorate of Education in Al-Muthana Governorate, Ministry of Education, Samah, Iraq

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_20

263

264 Z. M. Sallal and A. A. Abbas

While medical science has made incredible advances across the globe, it has yet to

be possible to prevent different types of brain diseases. Between 1986 and 2006, the

WHO reported an increase of 4184% in brain disease deaths, whereas dysentery and

respiratory infections were reduced by 82% and 86%, respectively [2,3]. Neuro-

logical disease is alarming because the majority of people who suffer from it are in

the most productive years of their lives. About 40% of brain disease in developing

countries occurs before 50 years of age, whereas 25% of the same disease occurs

before 40 [4]. Developing countries and low-income countries lack basic healthcare

facilities, which lead to brain diseases and costly medications, making every family

poor. Even though there are different detection methods for neurological diseases,

the most challenging aspect is predicting their presence and detecting them in the

body [5]. To reduce the death rate and reduce economic vulnerability for all fami-

lies, it is crucial to predict brain diseases, which will eventually assist policymakers

in taking appropriate action against neurological diseases. The remaining sections

of this manuscript are arranged as follows: Sect. 2introduced the literature-related

works. Materials and Methodology are presented in Sect. 3. Based on Sect. 4,the

EEG-based Disease Classiﬁcation Improvements Methods were explained.

2 Literature-Related Works

A common problem worldwide is brain disease. Every year, brain diseases cause

thousands of deaths. In part, this is due to the difﬁculty of analyzing clinical data

in the ﬁeld of brain disease [6]. This manuscript uses machine learning to predict

brain diseases. A classiﬁcation technique was used in this prediction model. Several

combinations of features were taken into account. It has been classiﬁed using several

methods, including preprocessing, feature selection, feature reduction, decision trees,

language models, support vector models, and RDF. Based on a hybrid RDF with a

linear model (HRFLM), the brain disease prediction model achieved 88.7% accuracy.

Around the world, one out of three people dies from brain diseases today [7,8]. Due

to the complexity of the task, medical practitioners often have difﬁculty predicting

brain disorders. Additionally, some health sector information is hidden from the

public that is necessary for making informed decisions predicted brain diseases using

a model [9]. The algorithms used in this research include J48, Naive Bayes, RepTree,

CART, and Bayes Net for predicting brain disorders. As a ﬁnal result of the research,

99% of the predictions were accurate. Data mining was also demonstrated to be

beneﬁcial for a sector of health in predicting patterns in the dataset [10,11]. A

model for detecting symptoms that can prevent heat stroke at an early age has been

developed from an investigation. Day by day, its rate has increased. In their proposal,

they proposed an application that would display symptoms such as age, gender,

pulse rate and predict brain diseases. Brain diseases were identiﬁed using machine

learning algorithms and neural networks [12,13]. Various factors contribute to the

rise of brain diseases. Healthcare providers collect a lot of data every day. However,

they do not use machine learning and pattern recognition techniques, which limit

Neurological Disease Prediction Based on EEG Signals Using Machine … 265

their ability to make predictions. To predict the future, we presented a model [9].

MIT-BIH repository data and attributes were collected in this manuscript [14]. To

predict brain disease, they used these data. Several ANN techniques were used to

develop this technology. The accuracy rate for ANNs was 94.7%, but PCAs had a

97.7% accuracy rate. The MIT-BIH repository was used to gather the information

for prediction [10]. A prediction model was developed using 1025 instances with

14 attributes [15]. By using tree-based algorithms such as M5P, random trees, and

RDF ensembles, this research explored the accuracy, precision, and sensitivity of

the tree-based classiﬁcation algorithm. All prediction algorithms were applied after

the selection of features from the patient’s brain dataset. Some of the methods used

in the study include Pearson correlation, recursive feature elimination, and Lasso

regularization.

We used three experimental setups to complete this analysis. As part of the ﬁrst

experiment, Pearson correlation was utilized to determine Pearson correlations for

M5P, random tree, reduced error pruning, and RDF ensemble. As part of the second

experiment, the four tree-based algorithms were combined with a recursive feature

elimination algorithm. Additionally, the tree-based algorithms were combined with

Lasso regularization.

As a result of this experiment, we calculated the accuracy, precision, and sensitivity

of the classiﬁcation. Using Pearson correlation and Lasso regularization in combi-

nation with RDF ensemble methods, they achieved an accuracy of 99% [16,17].

Preprocessing, feature extraction, and classiﬁcation are performed on EEG signals.

To predict epileptic seizures using machine learning methods, researchers can place

electrodes on patients’ scalps and record their scalp EEG signals. The use of scalp

EEG signals has been proposed to predict epileptic seizures by numerous researchers

in recent years [5–14]. The method involves preprocessing EEG signals, identifying

features, and categorizing preictal and interictal states using those features [18].

3 Materials and Methodology

A schematic of our proposed methodology can be seen in Fig. 1. There are detailed

instructions provided for each step. Our machine learning approach is based on the

MIT-BIH dataset repository. The proposed system includes the following steps which

are : Management of datasets, collection of data features, preprocessing, selection of

features, classiﬁcation of instances, comparing classiﬁer performances, and ﬁnally

acquisition of the results. The accuracy rate for our EEG dataset was examined

using four machine learning techniques. The accuracy rate was improved after eval-

uating the performance. A confusion matrix has been visualized for each machine

learning technique to check the validity of the experimental model in the following

sections. Our experiment requires nine attributes to make sense after preprocessing

and cleaning the data. In total, fourteen attributes were present, but not all were

granted because some of them were meaningless.

266 Z. M. Sallal and A. A. Abbas

Fig. 1 Proposed methodology

3.1 EEG Data Preprocessing

Due to noise during acquisition, EEG signals cannot be classiﬁed correctly between

interictal and preictal states thanks to poor signal-to-noise ratios. Different types of

noise can affect EEG signals, beyond power line noise of 50–60 Hz [19,20]. As a

result of interference between multiple electrodes, baseline noise is also added, as

are electrical activities associated with human activities, such as eye movements and

heartbeats [21].

3.2 EEG Data Sourcing

The dataset used for this experiment comes from the UCI Machine Learning Reposi-

tory. Physio Bank, as well as the original MIT-BIH dataset, was the source of this data

[22]. Over 80 h of recordings are included in the database, in which an ECG signal

and an EEG and a respiration signal are annotated to reﬂect polysomnographic sleep

stages and apneas. The signals’ and annotations’ section provides more information

[23].

3.3 Processing of EEG Data

In this study, we analyzed 200 neurological patients’ data, whereas 150 samples are

non-neurological patients. Moreover, in the neurological patient dataset, different

numbers of females and males were taken for analysis purposes [24].

Neurological Disease Prediction Based on EEG Signals Using Machine … 267

3.4 JupyterLab Tool and Python Language

The manuscript was written using Python 3.7 as its programming language. User

notebooks, code, and data can be created using JupyterLab, a web-based interac-

tive development environment. It provides a ﬂexible interface for conﬁguring work-

ﬂows for data science, scientiﬁc computing, computational journalism, and machine

learning. Jupyter Notebook began life as an online application for creating and sharing

computational documents [25]. Under Jupyter Lab, the tool works properly. Aside

from being simple and streamlined, it offers a unique user experience that focuses on

signal processing. It is necessary to preprocess EEG signals to reduce noise so that

the signal-to-noise ratio can be increased to improve the classiﬁcation. The SNR can

be increased by using a variety of preprocessing techniques. A low-pass/high-pass

ﬁlter can be used to remove other types of noise as well as a bandpass/band-stop

ﬁlter to remove other types of noise [26].

The methodology steps can be summarized as follows:

Step 1: EEG raw signals’ acquisition based on headset or based on signals stored

in MIT-BIH dataset.

Step 2: Preprocessing phase includes; signals noise elimination by using various

ﬁlters.

Step 3: Select denoising EEG signals.

Step 4: Extract features from EEG signals using power spectral and wavelet

transform.

Step 5: Check the EEG signals’ quality based on performance measuring.

Step 6: Test the EEG signals’ accuracy to implement the classiﬁcation phase.

4 EEG-Based Disease Classiﬁcation Improvements’

Methods

Various optimization techniques have been suggested, such as Particle Swarm Opti-

mization, Discrete Particle Swarm Optimization, and Fractional Order Discrete

Particle Swarm Optimization. Additionally, pixel values, noise levels, and boundary

positions can be used to study the region of interest [27].

4.1 Random Decision Forest (RDF)

There are various optimization techniques suggested are Particle Swarm Optimiza-

tion, Discrete Particle Swarm Optimization, and Fractional Order Discrete Particle

Swarm Optimization as well as pixel values and noise levels, boundary positions can

also be studied to determine the area of interest [28–30].

268 Z. M. Sallal and A. A. Abbas

4.1.1 RDF Algorithm Steps

The RDF algorithm is explained in the following steps:

Start.

Stage One: For each data set or training set, randomly select a set of samples.

Stage Two: For each training set of data, this algorithm constructs a decision tree.

Stage Three: The decision tree will be averaged to vote.

Stage Four: Choose the most voted prediction result as the ﬁnal prediction outcome

[31].

End.

4.1.2 RDF Algorithm

Rosenblatt and Frank invented the perceptron algorithm, which is the basis for Kernel

perceptron. Information with large edges is exploited in the calculation [32]. In

comparison to Vapnik’s SVM, this method is less complex to use and considerably

quicker to calculate. As well as using piece capacities in high-dimensional spaces,

the calculation can also be applied to low-dimensional spaces [33,34].

The following steps describe the Kernel perceptron procedure:

•Create a random line (or create a random score for each word, and a random bias).

•Perform the following process n times.

•Select a random point.

•Perceptrons can be applied to both points and lines.

•Ignore the point if it is well classiﬁed.

•If the point is misclassiﬁed, move the line closer to it.

•Take advantage of your perfectly ﬁtting line [30,34–38].

5 Findings and Discussion

Various analyses were carried out to obtain the experimental results. In total, 100%

of the raw EEG data was converged with the experiment. There were approximately

30% of test data and 70% of training data in the test dataset. According to the test

dataset, the following table shows the different accuracy levels (Table 1).

The training and test data show different results. The ﬁgure demonstrates a corre-

lation between age and maximum EEG signal rate. Age increases the rate of EEG

signals. As the EEG signal is sampled more frequently, the probability of capturing

an ‘event’ increases, although complexity and computation time will increase as

well. In these cases, the sampling rate is crucial, especially for recordings that last

more than three or ﬁve minutes, because normal computers will not be able to handle

them. Analyzing signals will be easier if the sampling rate is between 128 and 512.

Each classiﬁer has a confusion matrix shown in the table. Since only 30% of the

Neurological Disease Prediction Based on EEG Signals Using Machine … 269

Table 1 Test and training datasets accuracy results

Tes t Training

Thenameofthe

classiﬁer

Accuracy (%) Accuracy (%)

RDF algorithm Kernel perceptron

algorithm

RDF algorithm Kernel perceptron

algorithm

97.816 95.313 98.803 95.503

training dataset was used for the test dataset, the results of the confusion matrix were

obtained from the training dataset. RDF outperforms other algorithms based on the

confusion matrix results. In our calculations, we calculated a true positive rate of

0.967 and a false positive rate of 0.300. Both cases contain the ROC and PRC areas.

The age–high blood pressure relationship shows the relationship between hyperten-

sion and age. In our study, we found an association between high blood pressure and

aging. In people over the age of 40 or 50, high blood pressure normally occurs. After

reaching one’s 30 s, one should consult a physician regularly.

6 Conclusion

World health authorities consider neurological diseases to be one of the most impor-

tant health problems. The loss of brain function can be mitigated by using a scientif-

ically based prediction approach. A machine learning algorithm has been developed

to predict brain disease. RDF and voted perceptrons were used in our study. RFD

classiﬁers have an accuracy of 97.69%, while Kernel perceptrons have an accuracy

of 94.39%, based on our study. With an accuracy of 97.69%, the RDF classiﬁer has

produced the best results among the classiﬁers we used in our study. In the future,

we will use a more accurate dataset based on technological and scientiﬁc advance-

ments in another branch of medicine. The use of more classiﬁers will also improve

classiﬁcation accuracy.

7 Work Constraints and Future Plans

A machine learning approach is capable of greatly assisting with the diagnosis and

prognosis of brain disorders. It is still necessary for the community to address key

challenges before bringing these tools into clinical practice. Using interpretable

models helps to overcome the limitations of black-box approaches concerning

validation and reproducibility. It is necessary to conduct extensive generalization

studies. There is a high death rate associated with brain diseases, which are most

prevalent each year. Our paper uses machine learning algorithms to implement

predictive approaches for neurological diseases since these algorithms are critical

to predicting data.

270 Z. M. Sallal and A. A. Abbas

References

1. Michel LV, Jacques M, Michel B, Francisco V (1999) Anticipating epileptic seizures in

real-time by a non-linear analysis of the similarity between EEG recordings. NeuroReport

10(10):2149–2155

2. Florian M, Klaus L, Peter D, Christian EE (2000) Mean phase coherence as a measure for

phase synchronization and its application to the EEG of epilepsy patients. Phys D Nonlinear

Phenom 144(3–4):358–369

3. Vincent N, Jacques M, Michel LV, Stephane CL, Claude A, Michel B, Francisco V (2002)

Seizure anticipation in human neocortical partial epilepsy. Brain 125(3):640–655

4. Wim DC, Philippe L, Sabine VH, Wim VP (2003) Anticipation of epileptic seizures from

standard EEG recordings. The Lancet 361(9361):971

5. Mary AF, Mark GF, Ivan O (2005) Accumulated energy revisited. Clin Neurophysiol

116(3):527–531

6. Mary AF, Ivan O, Mark GF, Srividhya A, Ying-Cheng L (2005) Correlation dimension and

integral do not predict epileptic seizures. Chaos: Interdisc J Nonlinear Sci 15(3):033106

7. Michel LQ, Jason S, Vincent N, Richard R, Mario C, Michel B, Jacques M (2005) Preictal

state identiﬁcation by synchronization changes in long-term intracranial EEG recordings. Clin

Neurophysiol 116(3):559–568

8. Iasemidis LD, Shiau DS, Panos MP, Wanpracha C, Narayanan K, Awadhesh P, Tsakalis K,

Carney PR, Sackellares JC (2005) Long-term prospective online real-time seizure prediction.

Clinical Neurophysiol 116(3):532–544

9. Chaovalitwongse W, Iasemidis LD, Pardalos PM, Carney PR, Shiau DS, Sackellares JC (2005)

Performance of a seizure warning algorithm based on the dynamics of intracranial EEG.

Epilepsy Res 64(3):93–113

10. Piotr M, Deepak M, Yann L, Ruben K (2009) Classiﬁcation of patterns of EEG synchronization

for seizure prediction. Clin Neurophysiol 120(11):1927–1940

11. Salant Y, Gath I, Henriksen O (1998) Prediction of epileptic seizures from two-channel EEG.

Med Biol Eng Comput 36(5):549–556

12. Wim VD, Sujatha N, David MF, Michael HK, Vernon LT, Hyong CL, Arnetta BM, Maria SC,

Kurt EH (2003) Seizure anticipation in pediatric epilepsy: use of Kolmogorov entropy. Pediatr

Neurol 29(3):207–213

13. Klaus L, Brian L (2005) The ﬁrst international collaborative workshop on seizure prediction:

summary and data description. Clin Neurophysiol 116(3):493–505

14. Florian M, Thomas K, Ralph GA, Peter D, Klaus L, Christian EE (2003) Epileptic seizures are

preceded by a decrease in synchronization. Epilepsy Res 53(3):173–185

15. Levin K, Philippa K, Dean R, Freestone BH, Andriy T, Alexandre B, Feng L, Gilberto T,

Brian WL, Daniel L et al (2018) Epilepsyecosystem. org: crowd-sourcing reproducible seizure

prediction with long-term human intracranial EEG. Brain 141(9):2619–2630

16. Rajendra A, Yuki H, Hojjat A (2018) Automated seizure prediction. Epilepsy Behav 88:251–

261

17. Yannic R, Hubert B, Isabela A, Alexandre G, Tiago HF, Jocelyn F (2019) Deep learning-based

electroencephalography analysis: a systematic review. J Neural Eng

18. Gen L, Chang HL, Jason JJ, Young C, David C. Deep learning for EEG data analytics: a survey.

Concurrency and computation: practice and experience, p e5199

19. Kuhlmann L, Lehnertz K, Richardson MP, Schelter B, Zaveri HP (2018) Seizure prediction—

ready for a new era. Nat Rev Neurol 1

20. Abd Ali DM, Chalob DF, Khudhair AB (2022) Networks data transfer classiﬁcation based on

neural networks. Wasit J Comput Math Sci 1(4):207–225

21. James W, Eve AG (2006) Rapid review neuroscience e-book. Elsevier Health Sciences

22. Matthew DL (2000) Intuition: a social cognitive neuroscience approach. Psychol Bull

126(1):109

23. Terrence JS, Christof K, Patricia SC (1988) Computational neuroscience. Science

241(4871):1299–1306

Neurological Disease Prediction Based on EEG Signals Using Machine … 271

24. Adeel R, Karl JF (2016) The connected brain: causality, models, and intrinsic dynamics. IEEE

Signal Process Mag 33(3):14–35

25. Sonya BD, Jaqueline AF, Christophe B, Gregory AW, Brandy EF (2017) Seizure forecasting

from idea to reality outcomes of my seizure gauge epilepsy innovation institute workshop.

Eneuro 4(6)

26. Viglione S, Walsh GO (1975) Proceedings: epileptic seizure prediction. Electroencephalogr

Clin Neurophysiol 39(4):435–436

27. Rogowski Z, Gath I, Bental E (1981) On the prediction of epileptic seizures. Biol Cybern

42(1):9–15

28. Heino HL, Jeffrey PL, Jerome E, Paul HC (1983) Temporo-spatial patterns of pre-ictal spike

activity in human temporal lobe epilepsy. Electroencephalogr Clin Neurophysiol 56(6):543–

555

29. Gotman J, Marciani MG (1985) Electroencephalographic spiking activity, drug levels, and

seizure occurrence in epileptic patients. Ann Neurol: Ofﬁcial J Am Neurol Assoc Child Neurol

Soc 17(6):597–603

30. Kostas MT, Vasileios CP, Michalis Z, Spiros K, Dimitrios DK, Dimitrios IF (2018) A long

short-term memory deep learning network for the prediction of epileptic seizures using EEG

signals. Comput Biol Med 99:24–37

31. Angela AB, Benno G, Maurizio S, Carlo AT, Niels B, Guido R (2008) Permutation entropy to

detect vigilance changes and preictal states from scalp EEG in epileptic patients. A preliminary

study. Neurol Sci 29(1):3–9

32. Haidar K, Lara M, Madeline F, Kalina S, Bulent Y (2017) Focal onset seizure prediction using

convolutional networks. IEEE Trans Biomed Eng 65(9):2109–2118

33. Xiaoli L, Gaoxian O, Douglas AR (2007) Predictability analysis of absence seizures with

permutation entropy. Epilepsy Res 77(1):70–74

34. Ramy H, Mohamed OA, Rabab W, Jane W, Levin K, Yi G (2019) Human intracranial EEG

quantitative analysis and automatic feature learning for epileptic seizure prediction. arXiv

preprint arXiv:1904.03603

35. Tom H (1999) Energy functions for self-organizing maps. In: Kohonen maps. Elsevier, pp

303–315

36. Nhan DT, Anh DN, Levin K, Mohammad RB, Jiawei Y, Omid K (2017) A generalized seizure

prediction with convolutional neural networks for intracranial and scalp electroencephalogram

data analysis. arXiv preprint arXiv:1707.01976

37. Butler K (2022) An Automation system over cloud by using internet of things applications: an

automation system over cloud by using internet of things applications. Wasit J Comput Math

Sci 1(4):27–33

38. Abdulbaqi A, Younis M, Younus Y, Obaid A (2022) A hybrid technique for EEG signals

evaluation and classiﬁcation as a step towards to neurological and cerebral disorders diagnosis.

Int J Nonlinear Anal Appl 13(1):773–781. https://doi.org/10.22075/ijnaa.2022.5590

39. Matias IM, Christian M, Katrina D, Philippa JK, Wendyl D, David BG, Anthony NB, Premysl

J, Jan K, Jaroslav H et al (2019) Critical slowing as a biomarker for seizure susceptibility.

bioRxiv, p 689893

Watermarking System Using DWT

and SVD

Fatima M. Khudair, Asaad N. Hashim, and Mohammed Jameel Alsalhy

Abstract Information hiding has garnered signiﬁcant attention from researchers

over the past two decades, due to its growing importance in securing visual applica-

tions. As a consequence, watermarking has become a focal point in numerous studies

for its ability to protect sensitive data from unauthorized access, copying, manipula-

tion, and infringement of copyrights or property rights. Watermarks can be applied

to a wide range of mediums, including texts, documents, books, audio, video, and

images. Various watermarking techniques exist, such as the discrete Fourier trans-

form (DFT), the discrete cosine transform (DCT), the discrete wavelet transform

(DWT), the singular value decomposition (SVD), deep learning, and other more

methods, each of which has its own set of beneﬁts and drawbacks. In this paper,

we propose a novel algorithm that combines the strengths of both SVD and DWT

approaches to enhance watermarking performance. This innovative watermarking

technique yields accurate results and has demonstrated exceptional performance

metrics, as evidenced by signal-to-noise ratio (SNR) and peak signal-to-noise ratio

(PSNR) measurements.

Keywords DWT ·SVD ·Hiding information ·Watermarking

F. M. Khudair ·A. N. Hashim

Faculty of Computer Science and Mathematics, University of Kufa, Kufah, Iraq

e-mail: fatimam.alkaabi@student.uokufa.edu.iq

A. N. Hashim

e-mail: asaad.alshareeﬁ@uokufa.edu.iq

M. J. Alsalhy (B)

National University of Science and Technology, Thi-Qar, Nasiriyah, Iraq

e-mail: Sahi@nust.edu.iq

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_21

273

274 F. M. Khudair et al.

1 Introduction

Information hiding (or data hiding) is a broad phrase that encompasses a variety of

issues other than the embedding of messages in the material. The word “hiding” may

apply to both imperceptible information (watermarking) and information secrecy

(steganography). Watermarking and steganography are two signiﬁcant subﬁelds of

information concealment that are linked and may overlap but have distinct underlying

characteristics, needs, and designs, resulting in distinctive technological solutions

[1].

2 Related Work

Mahto et al. [2]: This paper provides a comprehensive overview of watermarking

standards, measurements, and applications of watermarking and summarizes all

watermarking methods. Garg et al. [3] use DWT, SVD, entropy, and pixel shift

algorithms for the strength and lack of understanding of the security advantages of

a watermark. Until we achieve the security attribute, they did the embedding on the

place of the pixel so that someone does not execute linearly it. In addition, the PSNR

and the NC were computed to evaluate the performance of the method. A PSNR

of better than 40 dB is attained with the method. Based on experimental evidence,

this system is resilient against a wide range of attacks, including ﬁlter, winr, noise,

pressure, and others. Garg et al. [4]: Fuzzy entropy, discrete cosine transforms, and

image scrambling in the hybrid domain are all explored as potential components of a

good, secure, and robust method. The results are very impressive by providing good

PSNR and MSE values. They also performed fresh attacks on a watermarked image,

which gives a very good NC value. This shows that the proposed scheme gives a

lack of understanding and the required robustness [4]. Lakshmi [5]: New method

uses SVD and DWT to produce a new image steganography technique. Compared

to existing techniques, the suggested methodology outperforms them. The quality of

stego pictures is evaluated using picture metrics like PSNR and SC [5]. Begum et al.

[6]: This paper gives an in-depth look at the typical architectures of watermarking

systems, discusses the latest developments in the ﬁeld, and enumerates some of the

more common criteria that are taken into account when developing watermarking

methods for various uses. To learn about contemporary techniques and the constraints

they face. Some common assaults are covered, and suggestions for further study are

offered [6]. Mohanarathinam et al. [7]: the watermarking method, its advantages and

disadvantages are discussed [7]. Boenisch [8]: Strongly marked models with digital

watermarking can secure against such attacks. It is not only a guide in selecting the

Watermarking System Using DWT and SVD 275

proper approach to a particular situation, but may also be a starting point for devising

new processes that overcome limitations and therefore advance the subject. This liter-

ature here offers a comprehensive survey of watermarking methods and assaults [8].

Khadam et al. [9]: Copyright protection, authentication, and ownership veriﬁcation

have all been suggested using digital watermarking technologies. To identify appro-

priate characteristics from the document to add the watermark, data mining methods

are used. The suggested technique is resistant to coordinated assaults and is resilient.

Twenty distinct text texts are utilized to assess the proposed technique [9]. Artru

et al. [10]: Digital watermarking is a broad element of information security because

of its uses, characteristics, and designs. The aim is to establish the ideal equilibrium

point between invisibility, resilience, and efﬁciency in an application. Finding the

equilibrium point is accomplished by locating the watermark characteristics required

and analyzing the threat model the scheme will encounter. Additional information

may apply to the video’s metadata, frames, or particular areas of the frame [10].

In their 2018 paper, Yuki, Nagai et al. proposed using digital watermarking to

assert ownership permission for deep neural networks. Their approach involved

addressing the prerequisites, embedding circumstances, and attack types associated

with watermarking deep neural networks. They then presented a generic technique

for embedding a watermark in model parameters. In another 2018 paper, He, Yuqi

and Hu, Yan discussed a technique for watermarking color images. Their method

utilized discrete wavelet transform (DWT), discrete cosine transform (DCT), and

singular value decomposition (SVD) to transform the host color image from RGB to

YUV color space. They divided the low-frequency component of Y into blocks and

applied SVD using DCT to each block. Finally, they added a watermark to the cover

image. The experimental results showed that their technique was highly resilient and

invisible [11,12].

3 Suggested Methodologies

3.1 Discrete Wavelet Transform

A discrete wavelet transform is a wavelet transform in which the wavelets are

discretely sampled, and it may be applied to any wavelet transform. Wavelet trans-

form has many advantages over Fourier transform, the most important of which

is that it can gather both frequency and position data simultaneously. The discrete

wavelet transform divides an image into four non-overlapping multi-resolution sub-

bands, which are denoted by the abbreviations LL (approximation sub-band), LH

(horizontal sub-band), HL (vertical sub-band), and HH (diagonal sub-band), where

LH, HL, and HH denote the ﬁnest scale wavelet coefﬁcients, respectively, and LL

denotes the coarse-level coefﬁcients. The discrete wavelet transform is also known

it is possible to repeat the process in order to get wavelet decomposition on different

sizes [13]. This tool is very useful for identifying areas inside the host image where

276 F. M. Khudair et al.

a watermark may be effectively applied. The utilization of the masking effect of the

human visual system is enabled by this feature. It only affects the region that corre-

sponds to the coefﬁcient that has been altered when a DWT coefﬁcient is changed.

The lower frequency sub-bands LL are often where the bulk of the image energy

is concentrated. As a consequence, adding watermarks in LL sub-bands may cause

signiﬁcant degradation of the image quality. However, embedding in low-frequency

sub-bands signiﬁcantly improves resilience in a signiﬁcant way. When it comes to

the high-frequency sub-bands HH, which contain the image’s borders and textures,

the human eye is less sensitive to changes in these sub-bands than other sub-bands.

A watermark may be included in a picture without it being apparent to the human

eye as a result of using this technique. The process of embedding takes place in the

intermediate frequency bands in order to enhance the strength and imperceptibility

of the watermark and to make it more difﬁcult to detect. LH and HL are two different

things [14].

3.2 Singular Value Decomposition (SVD)

Picture decomposition at the level of a single value is a special case of generic

linear algebra [15], where the picture is represented by a special diagonal matrix

whose energy is focused in a small number of values. Since SVD is robust and

efﬁcient at decomposing the structure into a set of linearly independent constituents,

each of which contains its own energy presentation, and since it compresses the

maximum amount of signal energy into the fewest possible coefﬁcients, it is one of

the most potent linear algebra analysis techniques. SVD has proven to be a useful

tool for many image processing tasks, including encoding, signal enhancement, and

ﬁltering. SVD is widely used in watermarking due to its ability to successfully conceal

the watermark whenever there is a signiﬁcant change in a single value. If M is a

real matrix, then SVD can decompose it into a product of three additional matrices

[16]. The SVD is used to obtain the singular value coefﬁcients. It is quite sturdy.

Condensing a high-dimensional, variable data set into a two-dimensional space is

the core idea behind singular value decomposition (SVD) [17]. When Uis a m*n

matrix with orthonormal columns, then any matrix Ain Rm *nmay be decomposed

into A=USVT. Sis a nn matrix without a negative square. VT is a n2(row2+

column2) orthonormal matrix [18] (Fig. 1).

Watermarking System Using DWT and SVD 277

Fig. 1 Graphical presentation of SVD [19]

A=USVT=[u1,u2,...,uN]×⎡

⎣

λ1

λ2

λN

⎤

⎦

×[V1,V2,...,VN]T



i=1

λiuivi,

where rdenotes A’s rank (r≤N); Uand Vare N×Northogonal matrices whose

column vectors, ui, an’ vis, denote ASleft and right singular vectors, respectively

[20].

4 The Proposed System

Using the proposed method for making a watermark using the DWT method, followed

by SVD, several methods of watermarking have been discussed in this system and

we have chosen the one that has the best results. I summarized this system in three

ways. The ﬁrst method is to use SVD, and the second method is the DWT method.

The third method is a method of combining SVD and DWT, and it gives wonderful

and elegant results. Preprocessing: It includes several processes, ﬁrstly converting

the image to gray, doubled, and resizing to suit the algorithms.

278 F. M. Khudair et al.

4.1 Singular Value Decomposition

An algorithm will be explained here, which is one of the important algorithms in

image processing (Fig. 2).

Fig. 2 Embed watermark

(SVD)

Watermarking System Using DWT and SVD 279

Algorithm (1): Embed watermark (SVD).

Input (image cover (g)), watermark image (O)

Output (watermarked image (H)

Begin:

Step1. Read cover image (g).

Step2. Converting the image to gray image (g).

Step3. Converting the image double image (g).

Step4. Resizing image to suit the algorithm SVD (i).

Step5. Apply SVD on cover image (i): Ui *Si* VTi.

Step6. Read watermarking image (O).

Step7 converting the image to gray image (O).

Step8. Converting the image double image (O).

Step9. Resizing image to suit the algorithm SVD (O2).

Step10. Apply SVD on the watermarking image (O2): UO

*SO*VTO.

Step11. Chang Si, the singular values of the cover

image) i), by adding the singular values of the

watermark image (O2) to the scaling factor ᵅ:

k = Si + ᵅ* SO.

Step 12. Rebuild the sub-bands using SVD Replace SI with

K in Step 2:

H=Ui*K*VTi.

Step 13. Display watermarked image.

End.

Algorithm (2): Watermark extracting (SVD).

Input: Watermarked Image (H)

Output: Extracted Watermark (recovered image)

Begin:

1. Apply SVD to the Watermarked Imag: H=U1*S1*VT1.

2. Extract the watermark Watermarked *: Watermarked

= S1–SLL/ alpha.

3. Using left and right singular vectors (UL and

VTL) of SL in watermark embedding algorithm,

construct S*LLD by multiplying them with SW in the

following order:

SL = UL* Watermarked * VTL.

4. Display watermark image.

End

280 F. M. Khudair et al.

Fig. 3 Watermarking based on SVD

4.2 Discrete Wavelet Transform

This algorithm analyzes the image into two levels, high and low, then divides the

low into four levels, and then we take the low Low (LL).

Hybrid technology comprises two algorithms, ﬁrst DWT and then SVD applied

to the watermark and cover image.

5 Experimental Results

5.1 SVD

Here, the algorithm is implemented, provided that the image size here is 2048*2048

and alpha (0.002) to have better results, knowing that the factors that affect the

implementation of the algorithm are the size of the images (watermark and cover)

and alpha when reducing the alpha value increases the accuracy results as well

(Figs. 3,4and 5; Table 1).

5.2 DWT

It is one of the best results so that it has the best performance among algorithms even

in front of attacks. Performance was measured on different image size, different

alpha, and different images. We mean images (watermark and cover image) (Fig. 6;

Table 2).

5.3 Hybrid Technology (DWT and SVD)

It is DWT +SVD merged. The watermark is concealed and does not appear in all

instances, and this differentiates this case regardless of the value of the performance

Watermarking System Using DWT and SVD 281

Fig. 4 Embed watermark (DWT and SVD)

measurements since the invisibility is extremely strong because SVD provides very

excellent masking and outcomes. This technique is robust, with no distortion, and is

the suggested method for this system. The alpha has been applied to multiple pictures

(watermark and cover) of various sizes (Fig. 7; Tables 3and 4).

Compare the two approaches with the hybrid method based on the algorithm’s

execution time.

282 F. M. Khudair et al.

Fig. 5 Watermark extracting

(DWT and SVD)

Table 1 Watermarking based on SVD

Cases Host image Wate r m a r k i m age Alpha Size of image PSNR SNR

1Pepper. Jpg KCOM. PNG 0.0020 2048*2048 63.6600 0.0080

2Baboon. Jpg KCOM. PNG 0.0020 2048*2048 63.6600 0.0099

3Baboon. Jpg KCOM. PNG 0.0020 512*512 63.6600 0.0099

4Baboon. JPG KCOM. PNG 0.0020 64*64 63.9236 0.0099

5Lena. Jpg 1.jpeg 0.0020 2048*2048 57.7488 0.0214

6 Conclusions

Multiple scales have been utilized for watermarking, but the most effective scale

is 512*512. The value of alpha signiﬁcantly affects the inverse proportion between

SNR and PSNR. If alpha is low, SNR and PSNR are high, whereas if alpha is high,

Watermarking System Using DWT and SVD 283

Fig. 6 Watermarking based on DWT

Table 2 Watermarking based on DWT

Cases Host

image

Wat e r m a r k

image

Alpha Size of

image

PSNR SNR Time

1Pepper.

Jpg

1.jpeg 0.0020 2048*2048 05.8807 0.0194 4.5473

2Baboon.

Jpg

1.jpeg 0.09 2048*2048 72.8164 0.8392 4.6507

3Baboon.

Jpg

KCOM.PNG 0.09 512*512 78.7640 0.1842 0.5363

4Pepper.

Jpg

IEEE.jpg 0.09 512*512 73.8961 0.5119 0.4558

5Baboon.

Jpg

KCOM.PNG 0.001 512*512 117.8488 0.0019 0.8435

Fig. 7 Watermarking based on DWT and SVD

Table 3 Watermarking based on DWT and SVD

Cases Host image Wate r m a r k i m age Alpha Size of image PSNR SNR

1Lena. Jpg Com.jpg 0.0001 512*512 96.5800 2.3845

2Pepper. Jpg Com.jpg 0.0001 512*512 96.5800 1.8257

3Baboon. Jpg Com.jpg 0.0001 512*512 96.5800 2.254

41509237.Jpeg Com.jpg 0.0001 512*512 96.5800 2.2500

5Baboon. Jpg Com.jpg 0.001 2048*2048 76.3245 0.0023

284 F. M. Khudair et al.

Table 4 Execution time of watermarking based on SVD and DWT and DWT and SVD

Case Image

cover

Watermarking

image

Alpha Size Time execution Method

1Pepper KCOM 0.0020 2048*2048 6.2488 DWT

2Pepper KCOM 0.0020 2048*2048 24.5778 SVD

3Pepper KCOM 0.0020 2048*2048 13.1362 DWT and

SVD

4Pepper KCOM 0.0020 512*512 0.5089 DWT

5Pepper KCOM 0.0020 512*512 0.5243 SVD

6Pepper KCOM 0.0020 512*512 0.9548 DWT and

SVD

7Pepper KCOM 0.0020 48*48 0.2776 DWT

9Pepper KCOM 0.0020 48*48 0.2418 SVD

10 Pepper KCOM 0.0020 48*48 0.2709 DWT and

SVD

both SNR and PSNR decrease. Additionally, the order of transformation, whether it

is DWT then SVD or SVD and DWT, impacts the result. When DWT is applied ﬁrst,

followed by SVD, a clear watermark is achieved. However, using SVD and DWT

together produces a noisy watermark that is not satisfactory.

References

1. Yusof, Y., and Khalifa, O. O. Digital watermarking for digital images using wavelet transform,

Proceeding 2007 IEEE Int. Conf. Telecommun. Malaysia Int. Conf. Commun. ICT-MICC

2007, no. December 2013, pp. 665–669 (2007), doi: https://doi.org/10.1109/ICTMICC.2007.

4448569.

2. Mahto DK, Singh AK (2021) A survey of color image watermarking: state-of-the-art and

research directions. Comput Electr Eng 93:107255. https://doi.org/10.1016/j.compeleceng.

2021.107255

3. Garg P, Rama RK (2020) Secured and multi optimized image watermarking using SVD and

entropy and prearranged embedding locations in transform domain. J Discret Math Sci Cryptogr

23(1):73–82. https://doi.org/10.1080/09720529.2020.1721875

4. Garg P, Rama KR (2020) An improved and secured digital image watermarking technique

using DCT, fuzzy entropy and image scrambling in hybrid domain. J Discret Math Sci Cryptogr

23(1):177–186. https://doi.org/10.1080/09720529.2020.1721882

5. Lakshmi BS (2020) Image steganography based on SVD and DWT techniques. J Discret Math

Sci Cryptogr 23(3):779–786. https://doi.org/10.1080/09720529.2019.1698801

6. Begum M, Uddin MS (2020) Digital image watermarking techniques: a review. Information

(Switzerland) 11(2). https://doi.org/10.3390/info11020110

7. Mohanarathinam A, Kamalraj S, Prasanna VG, Ravi RV, Manikandababu CS (2020) Digital

watermarking techniques for image security: a review. J Ambient Intell Humaniz Comput

11(8):3221–3229. https://doi.org/10.1007/s12652-019-01500-1

8. Boenisch F (2009) A survey on model watermarking neural networks. Available: http://arxiv.

org/abs/2009.12153

Watermarking System Using DWT and SVD 285

9. Khadam U, Iqbal MM, Azam MA, Khalid S, Rho S, Chilamkurti N (2019) Digital watermarking

technique for text document protection using data mining analysis. IEEE Access 7:64955–

64965. https://doi.org/10.1109/ACCESS.2019.2916674

10. Artru R, Gouaillard A, Ebrahimi T (2019) Digital watermarking of video streams: review of

the state-of-the-art. Available: http://arxiv.org/abs/1908.02039

11. Sulong GB, Wimmer MA (2023) Image hiding by using spatial domain steganography. Wasit

J Comp Math Sci 2(1):39–45

12. Al-asadi TA, Obaid AJ (2016) Object-based image retrieval using enhanced SURF. Asian J

Inform Technol 15:2756–2762. https://doi.org/10.36478/ajit.2016.2756.2762

13. He Y, Hu Y (2018) A proposed digital image watermarking based on DWT-DCT-SVD. In:

Proceedings of the 2nd IEEE advanced information management, communicates, electronic

and automation control conference (IMCEC), pp 1214–1218. https://doi.org/10.1109/IMCEC.

2018.8469626

14. Joseph A, Anusudha K (2013) Robust watermarking based on DWT SVD. Int J Signal Image

Proc 1

15. Srivastava A (2013) DWT-DCT-SVD based semi blind image watermarking using middle

frequency band. IOSR J Comput Eng 12(2):63–66. https://doi.org/10.9790/0661-1226366

16. Abdulazeez AM, Hajy DM, Zeebaree DQ, Zebari DA (2020) Robust watermarking scheme

based LWT and SVD using artiﬁcial bee colony optimization. Indones J Electr Eng Comput

Sci 21(2):1218–1229. https://doi.org/10.11591/ijeecs.v21.i2.pp1218-1229

17. Patel P, Patel Y (2015) Secure and authentic DCT image steganography through DWT-SVD

based digital watermarking with RSA encryption. In: Proceedings of the 2015 5th international

conference and communication systems and network, technologies, CSNT,pp 736–739. https://

doi.org/10.1109/CSNT.2015.193

18. Mohamad AC, Abdul-Hameed M (2014) Image encryption based on singular value decompo-

sition. J Comput Sci 10(7):1222–1230. https://doi.org/10.3844/jcssp.2014.1222.1230

19. Zhang G, Zou W, Zhang X, Hu X, Zhao Y (2017) Singular value decomposition based sample

diversity and adaptive weighted fusion for face recognition. Digit Signal Proc A Rev J 62:150–

156. https://doi.org/10.1016/j.dsp.2016.11.004

20. Shieh JM, Lou DC, Chang MC (2006) A semi-blind digital watermarking scheme based on

singular value decomposition. Comput Stand Interf 28(4):428–440. https://doi.org/10.1016/j.

csi.2005.03.006

21. Nagai Y, Uchida Y, Sakazawa S, Satoh S (2018) Digital watermarking for deep neural networks.

Int J Multimed Inf Retr 7(1):3–16. https://doi.org/10.1007/s13735-018-0147-1

Safeguarding IoT: Harnessing Practical

Byzantine Fault Tolerance for Robust

Security

Nadiya Zafar, Ashish Khanna, Shaily Jain, Zeeshan Ali, and Jameel Ahamed

Abstract With the emergence of Internet of Things, massive amount of data is

produced, processed, propagated, and stored each and everyday. These IoT devices

are built only to fulﬁll the aimed requirement with very limited resources. As a

result, their security and privacy are not prioritized. Implementing any solution for

the privacy and security issues of IoT devices is a challenging and crucial job with

such limited resources. However, with the development of blockchain technology,

incorporating security methods into IoT systems is no longer an unattainable goal. We

conducted multiple experiments in this research to determine that Practical Byzan-

tine Fault Tolerance (pBFT) is the best suitable technique for protecting IoT systems.

The blockchain concept is used with pBFT in the same way that Zilliqa and Hyper-

ledger are used for IoT security. As a result, by identifying and preventing security

breaches with the aforementioned algorithm, data integrity, and authenticity will be

maintained.

Keywords pBFT ·Data security ·Heterogeneous data ·Consensus algorithm ·

Device certiﬁcation

N. Zafar ·J. Ahamed (B)

Department of CS&IT, Maulana Azad National Urdu University, Hyderabad, India

e-mail: jameel.shaad@gmail.com

A. Khanna

Department of CSE, Maharaja Agrasen Institute of Technology, New Delhi, India

e-mail: ashishkhanna@mait.ac.in

S. Jain

Faculty of Computing, Engineering and Science, University of South Wales, South Wales, UK

e-mail: shally.jain@southwales.ac.uk

Z. Ali

University of Glasgow, Glasgow, UK

e-mail: ali.zeeshan@glasgow.ac.uk

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_22

287

288 N. Zafar et al.

1 Introduction

IoT gradually evolved from the combination of wireless technologies, microelec-

tromechanical systems (MEMs), microservices, and the Internet. It emerged from

machine-to-machine communication to take M2M one step further. The Internet of

Things (IoT) is a sensor network of billions of smart gadgets that connects humans,

systems, and other applications to gather and share data. Further, new emerging tech-

nologies have a great impact on the world with such plethora of intelligent objects

around us, making our lives easier and more comfortable [1]. According to a Cisco

networking survey, there are more smart devices than population in the world today.

Today a huge chunk of population is connected to the Internet through smart devices

at all the time to monitor physical activities and health. As predicted by some surveys,

there may be 50–75 billion devices will be connected to the Internet [2]. The informa-

tion is exchanged and communicated through various information sensing devices,

over a network, by agreeing upon some protocols as a part of Internet of Things. The

main aim of IoT is to intelligently identify, track down, monitor, and manage things

to manage the application areas [3]. In simple language, IoT meant for connecting

devices over the Internet, having limited abilities. The things in IoT are the devices

that can sense, monitor, and actuate [4]. This unique connection of real devices has

greatly accelerated the data collection, summarization, and sharing process with other

devices, resulting in the birth of IoT applications in a variety of new industries such

as healthcare, smart home, and industries [5]. However, mostly such kind of devices

and applications are not framed for surviving cyber-attacks, which raises slew of

security and privacy concerns in IoT networks such as conﬁdentiality, authentica-

tion, data integrity, access control, and secrecy. All this, giving rise to vulnerability

toward cyber theft and breaches. Anonymity, privacy, trust, and liability are some

other important security requirements [6]. As IoT is connecting billion of devices to

the Internet and involving the use of huge amount of data points (nodes), all of which

require security. Because IoT devices are closely connected, if intruders exploits one

vulnerability, it can manipulate whole of the data. Companies manufacturing these

IoT devices could become a victim of data breach as any smart device go through

three life stages, viz., manufacturing, installation, and operational stage [7]. If at any

stage of life there has been security ﬂaws in smart devices it can cause major concern

for privacy of user. According to an assessment, seventy percent of IoT objects are

easier to hack as attackers and intruders can target IoT devices at any time. As a

result, an effective mechanism is critical for protecting Internet-connected devices

from hackers and intruders [8].

Further the ﬂow of information must be secure in terms of integrity, conﬁdentiality,

non-repudiation, and authentication. Therefore, we need a mechanism to protect IoT

communication protocols from threat of attack. Because of dynamism, scalability,

heterogeneity, limited resource availability in the IoT devices; its designation and

implementation become very challenging for meeting all security requirements. As

a result, a secure system that is compatible with such a restricted environment is

necessary. The decentralized nature of blockchain technology is relevant for IoT

Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance … 289

system but most of the consensus algorithm requires much computational energy.

In this direction, the pBFT algorithm uniquely don’t require much computational

power and take less time to reach consensus. It is a consensus algorithm that reaches

consensus even when some faulty nodes are present in the system [9]. It also provides

authenticity through consensus and integrity by keeping the system alive [10]. pBFT

gives priority to the nodes with high reliability for intrusion detection and identify

all the nodes present in the network that are available at the end of detection [11].

Better efﬁciency, transactional ﬁnality, and low recompense variance are some of the

pros of the pBFT.

2 Related Work

Within a network when smart objects are communicating and exchanging data, if

any of it fails or attacked the whole system is jeopardized [6]. Here are some major

security concerns:

1. Data Integrity: The data remains accurate during its transmission between nodes.

For instance, it can be a severe problem, if the eve alters the data and orders to

halt the production in any manufacturing organization [12].

2. Data Conﬁdentiality: Data should remain private between shared nodes. Except

for the sender and receiver, no one else should have access to the data. For

instance, if infrastructure data is compromised, roads and bridges may be

destroyed, and security may be jeopardized.

3. Data Authenticity: The authentication ensures that the data received is genuine

and trustworthy. For example, the patient’s parameters are transmitted to various

medical centers. If somehow an eve altered this data, then patient’s treatment

may be jeopardized [13].

4. Data availability: Data should be available to its concerned user. It is a major

problem if the concerned user is not able to reach the data [6].

Beside all above mentioned issues there are much more challenges faced in handling

with IoT system as discussed below.

1. Scalability: Innumerable connected IoT devices over-burden the management of

data access system. As a result, access control approaches should be scalable in

terms of size, structure, and number of devices [14].

2. Heterogeneity: The Internet of Things connects objects with various fundamental

skills and application. As a result, the access control mechanisms are anticipated

to facilitate interoperability between disparate objects [15].

3. Restricted Resources: Internet of Things (devices or nodes), mostly they are

functioning without “screen” or even lacking any “user interface”, depends upon

battery power for functioning, commonly performing one task only [16]. Because

of the inconsequentiality of IoT devices, the computational and storage resources

accompanying them are constrained. As a result, an IoT access control model

290 N. Zafar et al.

should be efﬁcient and ideal in terms of overhead on devices and communication

networks. Hence, they are designed/equipped/deployed with limited computing

and networking capability [17].

More than this, many kinds of devices communicate using several networks for IoT

services. That means there can be many more security issues for the privacy of users

and on the network layer. So, some other security concerns of IoT are End to End Data

life cycle protection and Visible security and privacy. It is necessary to choose security

and privacy strategies which can be applied inevitably [6]. Recently, technical issues

are resolved by extending and practicing wireless communication technologies, IoT

model has to deal with hurdles associated security of IoT devices over the constrained

environments [18]. Recent Internet security protocols depends upon a popular and

trusted cryptographic algorithm: the Advanced Encryption Standard (AES) block

cipher for conﬁdentiality; the Rivest-Shamir-Adelman (RSA) asymmetric algorithm

for digital signatures and key transportation; the Difﬁe-Hellman (DH) asymmetric

key agreement algorithm and the SHA-1 and SHA-256 secure hash algorithms [19].

This suite of algorithms is supplemented by a set of emerging asymmetric algorithms,

known as Elliptic Curve Cryptography (ECC) [20]. Because resource-constrained

IoT devices lack computational power, general public key cryptosystems such as

RSA are ineffective because they are slow and consume more power. Elliptical Curve

Cryptography (ECC), on the other hand, is lightweight and has proven to be a suitable

candidate for use in IoT networks [21]. In the IoT, the use of time stamps can protect

data and serve as evidence that data in the IoT are genuine, as they can be traced

back to a particular time, making sure that the information are not tampered [22]. It

is also very difﬁcult to implement programming over IoT devices [23].

As the IoT system is growing rapidly its security issues are getting attention and

blockchain has been seen as a new option for its security by the researchers [22].

With the rapid growth of mobile Internet ﬁnancial era, combination of the Internet

of Things and the blockchain technologies seems as the most obvious option. The

extensive use of “blockchain application technology” in the global IoT application

ﬁeld is going to perform a progressively signiﬁcant role in the future.

The ideology of blockchain is built upon a distributed security network. Its mech-

anism recommends strong data protection as well as protection from tampering [24].

Unlike used in Bitcoin, blockchain data structure can be used in general as a data

structure for storage. Like transactions, any other data payloads can be used as the

chain of block [25]. The blockchain’s characteristics include: forgery, data encryp-

tion, and decentralization allowing it to execute and store conﬁdential information,

prevent data loss and ensure the security of IoT applications at various stages [23,

24]. Because of properties such as immutability and irreversibility, blockchain is the

most efﬁcient data security and privacy technology available [26].

The blockchain’s decentralized, trustworthy, and autonomous nature can signif-

icantly improve the security and privacy of ever-expanding IoT networks. Because

IoT devices are the physical world contact points, combining blockchain and IoT

will allow for the development of new applications as well as the transformation of

existing systems [21].

Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance … 291

Blockchain technology is preferred when information security and conﬁdentiality

are the network’s top priorities. Implementing blockchain in IoT allows for more

efﬁcient access control. The most vital feature affecting IoT blockchain throughput is

the consensus mechanism [27]. The consensus algorithm is at the heart of blockchain

technology because it ensures the network’s integrity and security. It is a protocol

that allows blockchain network nodes to reach a standard agreement on the current

state of the ledger’s records. Different blockchain platforms use different algorithms

to reach consensus, and they all operate and execute differently [28].

Although, presently no blockchain and consensus protocol might concurrently

meet both the security and scalability requirements [29]. Most organizations still lack

tools for tracking active keys, and roughly quasi-ﬁrms experience complications in

implementing encryption and take it as challenging because of unclear proprietorship

and shortage of experts [30].

But for applying blockchain to the IoT environment, some challenges are most to

be fulﬁlled.

Latency: In permission less blockchain frameworks it takes between 1 and

10 minutes to reach consensus. In permissioned blockchain it contracts upto

milliseconds.

Applicability: Generally, there are different kind of devices that are connected

within an IoT system. So, it is very difﬁcult to choose a blockchain framework

which will be supported by all devices [31].

It is required to use a such blockchain architecture which allow uniﬁed and

ascendable movement of data from the IoT device to the consensus protocol [29].

Blockchain technology, in conjunction with IoT, cloud computing, big data, and

machine learning, can provide a comprehensive solution to these problems [12].

Smart contracts, on the other side, have the ability to supplement existing technical

methods for resolving security challenges. Whereas blockchain integrally violates

its distinguishing characteristics such as immutability, traceability, and authenticity

[32]. Smart contracts, then again, make use of adaptable features such as their

customizable nature, similarities with broadly used scripting languages, and Turing-

completeness of their scripting language. The majority of researches indicates that

the application of smart contracts with present substructure strengthening the security

solutions provided to IoT environments [33]. For the proper function and integra-

tion of various IoT devices, there is a need of huge distributed system for storing

and transmitting data [34]. Because of the ever-increasing number of IoT devices,

data vulnerability is a constant risk. Existing centralized IoT ecosystems have raised

security, privacy, and data use concerns. A decentralized ID and access manage-

ment (DIAM) system for IoT devices is the best solution to these concerns, and that

Hyperledger is the best technology for such a system [35]. Fault tolerant consensus

protocols play a vital role in establishing trustworthiness of a system in spite of the

chances of node failures [36]. A comparison of consensus algorithms is shown in

Table 1[37].

A milestone paper by Lamport et al. ﬁrstly presented the idea of Byzantine failure.

They proposed their ideology through the case of Byzantine generals, whose troop

targeting a castle of rival. Upon seeing the enemy, the generals communicate with

292 N. Zafar et al.

Table 1 Comparison between all proof of consensus algorithms

Properties PoW PoS pBFT

Integrity management of nodes Open Open Permission

Savings energy No Partial Ye s

Tol e r a n c e <51% <51% <33.3%

Blockchain Private Private Public

Table 2 A comparison table for BFT and pBFT

BFT pBFT

Consensus algorithm Consensus algorithm

Group of nodes ﬁnds consensus; some nodes

could be malicious

Generate consensus in malicious

environment

Less efﬁcacy to operate in adversarial environment More efﬁcacy to operate in adversarial

environment

each other and agree over a plan of action (consensus)—either to attack or retreat. If

they attack altogether, they succeeded; if none of them attack, they will survive for

other day. If some of the general’s attack, then the generals will not survive. They

communicate through messages. The challenge is that one or more of the generals

can deceive and passes on erratic messages to interrupt the faithful generals from

reaching consensus [38]. All consensus algorithm requires two phase one for request

and other for reply while pBFT requires three phase for massive communication

[36]. pBFT is the most popular algorithm providing tolerance under malicious attack

and its comparison with BFT is depicted in Table 2[39,40].

Through literature survey, we came to the point that whatever work has been

done over security issue of IoT devices; various proof of consensus algorithms has

implemented over IoT security devices for data privacy at various aspects data conﬁ-

dentiality, authenticity, integrity, availability, and so on. However, all have some

limitation due to its heterogeneous architecture. There is also not so much talk over

device certiﬁcation. Hence, Implementing Practical Byzantine Fault Tolerance algo-

rithm to improve integrity and authenticity of data during its propagation from one

node to another over the IoT system is a much required initiative.

3 Proposed System

The proposed methodology follows the approach of “practical Byzantine Fault Toler-

ance Algorithm (pBFT)”, a consensus algorithm for secure propagation of data intro-

duced by Barbara Liskov and Miguel Castro. For checking whether nodes are reli-

able or not, protocol uses timestamp from IoT devices [27]. It is an advancement

in Byzantine Fault Tolerance (BFT) algorithm, yielding more efﬁcient result than

BFT for distributed systems. The highest number of malicious nodes should be less

Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance … 293

than or equal to 1/3rd of total nodes for working of Byzantine Fault system [41].

Request of all clients must reach to the nodes and concurrent issue doesn’t arises.

In the process if leader node fails immediately another leader is selected [42]. The

system becomes more secure as the number of nodes grows and execution phases

seems be like given below in Fig. 1.

The requirements for the setup will (a) the number of nodes with their increments

and (b) preventing failure of any nodes affecting the system. The four stages of the

pBFT consensus cycle are (i) client (ii) issued a request to the principal node. The

primary node (iii) broadcasts re-quests to all subsidiary nodes. Nodes then carry out

the requested service, deliver a response to the client, and The request is considered

fully fulﬁlled when the client receives n+1 comparable responses from different

network nodes. Where the total number of nodes is n.

Figure 2shows the diagram of the Pbft algorithm following by the algorithm and

block diagram of the system in Fig. 3.

Fig.1 Working phases of

pBFT [9]

Fig. 2 Pictorial

representation of algorithm

Leader

Node

Secondary/

Backup

node

Client

294 N. Zafar et al.

Fig. 3 Block diagram of proposed system

Algorithm

While client sends request to leader node

do leader node broadcast it to all secondary nodes

if n>2/3rd authentic

then agree

else ifQ<=N-f

then live

elseif Q>N/2

then safe

else: malicious

if leader node is malicious

then change the leader node

Wherem+1replies should be received from secondary nodes

Endif

Here for maintaining integrity of data we have used hash function. The message

digest is created at the sender node and is sent with the message to the receiver

node. To check the integrity of a message, the receiver generates a hash function and

compare the new message digest with the one received. If they both are same then

only data of one node is approved to pass onto other.

h(y)=h(x)(1)

Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance … 295

Then the consensus is reached and a message like this will appear and the propa-

gation of data will be approved without malice. The rules applied while simulation

are described below [39,43].

1. Client must receive f+1replies; where f is the number of faulty nodes.

2. More than two-third nodes should be authentic Agree =3f+1.

3. Liveness: Q<=N-Fwhere Qis a constant for quorum consensus

where Nis the total number of nodes

Safety: Q>N/2.

4 Results and Discussion

At nodes, the data collected from various sensors are stored and when they are

propagated from one node to another then pBFT performs its role and check whether

the information passing through them are authentic or not, if the information is not

authentic and fails to fulﬁll the rule of 3f +1consensus they simple terminates them

without disturbing the system. So, it is providing two step security. Real time and

safety are guaranteed by algorithm until unless n−1/3 nodes out of n nodes are faulty,

which indicates that client will receive correct replies for their request node. Firstly,

we have done simple implementation on replit (a coding platform) for algorithm

pBFT then we have used the Contiki simulator (simulator for IoT) for real time

simulation. For safety mechanism we have used cryptographic public and private

key. Here, we are excluding the details of implementation part due to space issue.

In this paper we are assuming that client will only send the next request until unless

ﬁrst one is served. If they will send requests one after other spontaneously, that will

result into congestion problem. The algorithm will provide safety only when non-

malicious nodes reach consensus. To maintain real time simulation, if leader node is

examined malicious, another node is urgently appointed as a leader node. Many of

the system succeeded in implementing safety but fails to maintain real time system.

But this proposed system provides solution for both simultaneously. To maintain real

time system, the proposed system following two approach viz., ﬁrst one is the rule

of 1/3rd node and second one is changing leader node in case of damage to leader

node. The signiﬁcance of pBFT is that, it will keep the system alive until unless there

is reliable number of nodes, which are greater than the number of faulty nodes. The

simulation setup and communication among nodes is depicted in Fig. 4.

296 N. Zafar et al.

Fig. 4 a Real time simulation and bnodes communicating with each other

The performance evaluators in this work are the number of nodes, speed and

simulation time in milliseconds. The number of nodes is showed as [1 +….n] where

1 denotes leader node and …. +ndenotes other nodes. The time in milliseconds

is showing the time required to communicate one node to another in real time. On

the real time simulator, green region is the region of strong connection and gray is

of weak connection. The node having sky blue color is the leader node and rest of

nodes of green colors are backup/secondary nodes.

Earlier IoT was secured using security algorithms including machine learning but

now researchers and scientists started using blockchain technologies for the security

of IoT devices. It is almost impossible to transmit data and provide security for all

kind of sensors/gadgets. Hence blockchain as a security provider using cryptographic

techniques and consensus agreement rule over these networks. Furthermore, the

Practical Byzantine Fault Tolerance (pBFT) algorithm when compared with other

consensus algorithms for security in IoT can be described in Table 3.

The comparison between the proposed system using pBFT and other security

algorithms is depicted in Table 4.

The simulation is done for 25 nodes, 50 nodes, and 100 nodes which are depicted

in the Fig. 5and the time taken for the nodes respectively are shown in Table 5.The

average time for all the nodes =(11 +24 +57)/3 which is 54 s.

As security of data is a major issue in IoT domain and there is a high risk of

privacy breaching and data stealing during data propagation through IoT layers.

Using pBFT, at network layer of IoT would reduce chances of its breaching and

stealing. Till now, in intrusion detection system for IoT there is lack of accuracy in

trial output and some inconsistency issues. However, pBFT would efﬁciently resolve

the issues that exists in the existing mechanisms [11]. Although, the optimistic fast

Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance … 297

Table 3 Comparison table of various consensus algorithm [40,45]

Algorithm Time in seconds 5

nodes

Energy consumption Mechanism (protocol or

cryptographic technique

used for reaching

consensus.)

Proof of Work (PoW) 5–8 High Based on computing

power

Proof of Stake (PoS) 12 Relatively low High stakes nodes have

the right to account

Proof of Authority

(PoA)

4–5 High Validators that help to

reach consensus

Byzantine Fault

Tol e r a n c e (BFT )

4–10 Relatively low Reach agreement based

on value

Practical Byzantine

Fault Tolerance

(pBFT)

4–26 Low Using majority rule

Table 4 Comparison of pBFT with security algorithms

Technique Time in seconds Speed Power consumption

RSA 6 for 100 nodes Slow High

Difﬁe-Hellman 4.6 for 100 nodes Slow High

ECC 4 for 100 nodes Fast Comparatively low

pBFT 3 for 100 nodes Fast Low

path could be achieved only when there is not a letdown. Else, the proprieties behaves

like randomized consensus having congestion issue [46]. However, IoT devices like

sensors go through various life stages discussed above. If there is any issue insensors

at manufacturing stage, it can cause great damage and end up with interoperability

issue [47]. The best solution for this issue is the device certiﬁcation. Certiﬁcation

could be based on some standard norms followed by manufacturing industry or

provided by government. If any device certiﬁed by government organization, then

must follow privacy rules and regulation of that country and will be more trustable

by the customers. On the other hand, if a device is certiﬁed could be claimed and

challenged for their issues would also be provided by the pBFT.

298 N. Zafar et al.

2 3 5 9 10 12 13 17 19 21 22 23 29 32 36 38 42 44 49 50

NO. OF NODES

TIME IN SECONDS

FOR 50 NODES

Round1 Round2 Round3

NO. OF NODES

TIME IN SECONDS

FOR 100 NODES

Round1 Round2 Round3

0.5

1.5

2.5

3.5

3 7 13 14 15 16 17 20 21 25

NO.OF NODES

TIME IN SECONDS

FOR 25 NODES

Round1 Round2 Round3

Fig. 5 Summation time of different category of communication nodes

Table 5 Time taken by pBFT

for different number of nodes Practical byzantine fault tolerance algorithm

Time in seconds For 25 nodes 11

For 50 nodes 24

For 100 nodes 57

Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance … 299

5 Conclusion

Based on various analysis and comparisons we reach on a result that pBFT is the

most suitable algorithm for the security of IoT as it is a cryptographic technique,

whose time consumption is low, energy consumption is also low and most important

thing is that the system remains in real time mode despite of having malicious nodes.

In the paper, the proposed method is concerned about data integrity and authenticity,

through practical Byzantine Fault Tolerance (pBFT) providing an approach for more

safe and secure data propagation along IoT devices. This paper also stressed over

the certiﬁcation of IoT devices which further resolves most of the problem of sensor

interoperability. pBFT may provide a way for data security in IoT; but it will be very

hard to implement it in every case due to heterogeneous nature of Internet of Things

(IoT).

References

1. Husamuddin M, Qayyum M (2017) Internet of things: a study on security and privacy threats.

2017 2nd Int Conf Anti-Cyber Crimes, ICACC 2017, October:93–97. https://doi.org/10.1109/

Anti-Cybercrime.2017.7905270

2. Cisco T, Internet A (2020) Cisco: 2020 CISO benchmark report. Comput. Fraud Secur.

2020(3):4–4. https://doi.org/10.1016/s1361-3723(20)30026-9

3. Chen S, Xu H, Liu D, Hu B, Wang H (2014) A vision of IoT: applications, challenges, and

opportunities with China perspective. IEEE Internet Things J 1(4):349–359. https://doi.org/10.

1109/JIOT.2014.2337336

4. Nespoli P, Díaz-López D, Gómez Mármol F (2021) Cyberprotection in IoT environments: a

dynamic rule-based solution to defend smart devices. J Inf Secur Appl 60. https://doi.org/10.

1016/j.jisa.2021.102878

5. A. Hamid Lone and R. Naaz, “Reputation Driven Dynamic Access Control Framework for IoT

atop PoA Ethereum Blockchain.”

6. Kasemsap K (2019) Internet of things and security perspectives. Secur. Internet Things 5(1):1–

20. https://doi.org/10.4018/978-1-5225-9866-4.ch001

7. Lu C (2014) Overview of security and privacy issues in the internet of things abstract: keywords:

table of contents 1–11

8. Yassein MB, Mardini W, Al-Abdi A (2017) Security issues in the internet of things. 8(6):186–

200. https://doi.org/10.4018/978-1-5225-3029-9.ch009

9. Castro M, Liskov B (2010) Practical byzantine fault tolerance Miguel. Juv Delinq Eur Beyond

Results Second Int Self-Report Delinq Study February:359–368

10. Meshcheryakov Y, Melman A, Evsutin O, Morozov V, Koucheryavy Y (2021) On performance

of PBFT blockchain consensus algorithm for IoT-applications with constrained devices. IEEE

Access 9(June):80559–80570. https://doi.org/10.1109/ACCESS.2021.3085405

11. Li L, Chen Y, Lin B (2021) Intrusion detection analysis of internet of things considering practical

byzantine fault tolerance (PBFT) algorithm. Wirel Commun Mob Comput 2021. https://doi.

org/10.1155/2021/6856284

12. Hosmer C (2018) IoT vulnerabilities. Defending IoT Infrastructures with Raspberry Pi 1–15.

https://doi.org/10.1007/978-1-4842-3700-7_1

13. Bouscaren E (1989) Elementary pairs of models. Ann Pure Appl Log 45(2) PART 1:129–137.

https://doi.org/10.1016/0168-0072(89)90057-2

300 N. Zafar et al.

14. Sultan A, Mushtaq MA, Abubakar M (2019) IoT security issues via blockchain: a review

paper. PervasiveHealth Pervasive Comput Technol Healthc Part F1481:60–65. https://doi.org/

10.1145/3320154.3320163

15. Rachit SB, Ragiri PR (2021) Security trends in internet of things: a survey. SN Appl Sci

3(1):1–14. https://doi.org/10.1007/s42452-021-04156-9

16. Inside Secure, Iot security solutions white paper. Veritmatrix

17. Wheelus C, Zhu X (2020) IoT network security: threats, risks, and a data-driven defense

framework. IoT 1(2):259–285. https://doi.org/10.3390/iot1020016

18. Hernandez-Ramos JL, Pawlowski MP, Jara AJ, Skarmeta AF, Ladid L (2015) Toward a

lightweight authentication and authorization framework for smart objects. IEEE J Sel Areas

Commun 33(4):690–702. https://doi.org/10.1109/JSAC.2015.2393436

19. Azamuddin, Rotation project title : survey on IoT security. [Online]. Available: https://www.

cse.wustl.edu/~jain/cse570-15/ftp/iot_sec2.pdf

20. Goyal TK, Sahula V (2016) Lightweight security algorithm for low power IoT devices. 2016

international conference on advances in computing, communications and informatics, ICACCI

2016, September, 1725–1729. https://doi.org/10.1109/ICACCI.2016.7732296

21. Satamraju KP, Malarkodi B (2019) A secured and authenticated internet of things model

using blockchain architecture. Proc. 2019 TEQIP - III Sponsored international conference

on microwave integrated circuits, photonics and wireless networks, IMICPW 2019, 19–23.

https://doi.org/10.1109/IMICPW.2019.8933275

22. Zhang H, Lang W (2019) Research on the blockchain technology in the security of internet

of things. Proc. 2019 IEEE 4th Advanced information technology, electronic and automation

control conference IAEAC 2019, no. Iaeac, 764–768. https://doi.org/10.1109/IAEAC47372.

2019.8997876

23. Kurniawan A, Mayasari R, Murti MA (2018) Implementation of cryptographic algorithm on

Iot device’s Id. J Sist Cerdas 01(02):19–26

24. Zhang J, Li Z (2020) Design of internet of things information security based on blockchain.

Proc. - 2020 3rd World conference on mechanical engineering and intelligent manufacturing

WCMEIM 2020, 114–117. https://doi.org/10.1109/WCMEIM52463.2020.00030

25. Moinet A, Darties B, Baril J-L (2017) Blockchain based trust and authentication for

decentralized sensor networks, pp 1–6. [Online]. Available: http://arxiv.org/abs/1706.01730

26. Na D, Park S (2021) Fusion chain: a decentralized lightweight blockchain for iot security and

privacy. Electron 10(4):1–18. https://doi.org/10.3390/electronics10040391

27. Yuan X, Luo F, Haider MZ, Chen Z, Li Y (2021) Efﬁcient Byzantine consensus mechanism

based on reputation in IoT blockchain. Wirel Commun Mob Comput 2021. https://doi.org/10.

1155/2021/9952218

28. Patil P, Sangeetha M, Bhaskar V (2021) Blockchain for IoT access control, security and

privacy: a review. Wirel Pers Commun 117(3):1815–1834. https://doi.org/10.1007/s11277-

020-07947-2

29. Mackenzie B, Ferguson RI, Bellekens X (2018) An assessment of blockchain consensus proto-

cols for the internet of things. In: 2018 international conference on internet of things, embedded

systems and communications. IINTEC 2018—Proceedings, 183–190. https://doi.org/10.1109/

IINTEC.2018.8695298

30. Kuzminykh I, Yevdokymenko M, Ageyev D (2021) Analysis of encryption key management

systems: strengths, weaknesses, opportunities, threats. In: 2020 IEEE international conference

on problems of infocommunications. Science and technology. PIC S T 2020—proceedings,

515–520. https://doi.org/10.1109/PICST51311.2020.9467909

31. Seshadri SS et al (2021) IoTCop: a blockchain-based monitoring framework for detection and

isolation of malicious devices in internet-of-things systems. IEEE Internet Things J 8(5):3346–

3359. https://doi.org/10.1109/JIOT.2020.3022033

32. Ali MS, Vecchio M, Pincheira M, Dolui K, Antonelli F, Rehmani MH (2019) Applications of

blockchains in the internet of things: a comprehensive survey. IEEE Commun Surv Tutorials

21(2):1676–1717. https://doi.org/10.1109/COMST.2018.2886932

Safeguarding IoT: Harnessing Practical Byzantine Fault Tolerance … 301

33. Lone AH, Naaz R (2021) Applicability of blockchain smart contracts in securing Internet and

IoT: a systematic literature review. Comput Sci Rev 39:100360. https://doi.org/10.1016/j.cos

rev.2020.100360

34. Meshcheryakov Y, Melman A, Evsutin O, Morozov V, Koucheryavy Y (2021) On performance

of PBFT blockchain consensus algorithm for IoT-applications with constrained devices. IEEE

Access 9(April):80559–80570. https://doi.org/10.1109/ACCESS.2021.3085405

35. Hyperledger A, Edge ALF, Decentralized ID and access management (DIAM ) for IoT networks

36. Goyal H, Saha S, Practical byzantine consensus for internet-of-things

37. Sharma V, Lal N (2020) A novel comparison of consensus algorithms in blockchain. Adv Appl

Math Sci 20(1):1–13

38. Driscoll K, Hall B, Sivencrona H, Zumsteg P (2003) Byzantine fault tolerance, from theory to

reality 1 what you thought could never happen. Thought A Rev Cult Idea 2:235–248. https://

doi.org/10.1007/978-3-540-39878-3_19

39. Li W, Feng C, Zhang L, Xu H, Cao B, Imran MA (2021) A scalable multi-layer PBFT consensus

for blockchain. IEEE Trans Parallel Distrib Syst 32(5):1146–1160. https://doi.org/10.1109/

TPDS.2020.3042392

40. Gorkey I, Sennema E, El Moussaoui C, Wijdeveld V (2020) Comparative study of byzantine

fault tolerant consensus algorithms on permissioned blockchains supervised by Zekeriya Erkin

supervised by Miray Aysen, April, pp 1–11

41. Misic J, Misic VB, Chang X, Qushtom H (2020) Multiple entry point PBFT for IoT systems.

2020 IEEE Global Communications Conference 2020, 0–5. https://doi.org/10.1109/GLOBEC

OM42002.2020.9322641

42. Misic J, Misic VB, Chang X, Qushtom H (2021) Adapting PBFT for use with blockchain-

enabled IoT systems. IEEE Trans Veh Technol70(1):33–48. https://doi.org/10.1109/TVT.2020.

3048291

43. Liangchen X (2020) Design and implementation of internet of things information security

transmission based on PBFT algorithm. In: International conference on computer engineering

and application (ICCEA), 201–205. https://doi.org/10.1109/ICCEA50009.2020.00051

44. Waheed N, He X, Ikram M, Usman M, Hashmi SS, Usman M (2021) Security and privacy in

IoT using machine learning and blockchain: threats and countermeasures. ACM Comput Surv

53(6). https://doi.org/10.1145/3417987

45. Xiong H, Chen M, Wu C, Zhao Y, Yi W (2022) Research on progress of blockchain consensus

algorithm: a review on recent progress of blockchain consensus algorithms. Futur Internet

14(2). https://doi.org/10.3390/ﬁ14020047

46. Kuznetsov P, Tonkikh A, Zhang YX (2021) Revisiting optimal resilience of fast byzantine

consensus. Assoc Comput Mach 1(1)

47. Noura M, Atiquzzaman M, Gaedke M (2019) Interoperability in internet of things: taxonomies

and open challenges. Mob Networks Appl 24(3):796–809. https://doi.org/10.1007/s11036-018-

1089-9

Human Body Poses Detection

and Estimation Using Convolutional

Neural Network

Jitendra Kumar Baroliya and Amit Doegar

Abstract This study introduces a unique method for human body pose detection

and estimation by combining convolutional neural network (CNN) and grab cut

segmentation techniques. The suggested technology is meant to aid in the detection

and estimation of human pose, which is important for many real-time applications.

Features are extracted from pictures of human pose using a grab cut and create

a human silhouette. Furthermore, the convolutional neural network is applied to

classify the human pose. When tested on a dataset consisting of photographs of

human pose, the suggested system achieved an accuracy of 93.89% in 6 human pose

classiﬁcations. A total of 1181 pictures were used in this analysis, including six

different human poses (down dog, warrior, tree, prank, goddess, and handshaking).

There are 237 test photographs and 944 full-size images throughout all categories.

An 80:20 ratio is maintained for training and testing. An F1 score of 93.75%, a

recall score of 93.89%, and a precision score of 93.89% were all obtained using the

proposed strategy. Based on the obtained data, it appears that the proposed method

achieves good accuracy in pose detection compared to the state-of-the-art methods.

It will be beneﬁcial for yoga pose detection, patient detection systems, etc.

Keywords Human body pose ·CNN ·Grab cut ·Human silhouette ·

Segmentation

J. K. Baroliya (B)·A. Doegar

Computer Science and Engineering Department, NITTTR, Chandigarh, India

e-mail: Baroliyajitendra4@gmail.com

A. Doegar

e-mail: Amit@nitttrchd.ac.in

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_23

303

304 J. K. Baroliya and A. Doegar

1 Introduction

1.1 Human Pose Estimation

It is a technique for identifying and labeling human body key joints.

Fundamentally, it is a method of capturing a person’s location by storing a set of

coordinates for each joint. The connection between these two locations is called a

pair. For two points to be joined, there must be some kind of signiﬁcant relationship

between them. Estimation’s primary purpose is to construct a human “skeleton”

model for use in more reﬁned, task-speciﬁc applications [1].

•Model based on skeletons

•Model based on contours

•Model based on volume (Fig. 1)

1. Skeleton-based model [2]: A model built from a joint like Ankles, knees, shoul-

ders, elbows, wrists, and limb orientations are all part of this representation. Also

called the kinematic model, which is utilized extensively in 3D and 2D pose esti-

mation. This ﬂexible and easy-to-understand human body model is often used to

illustrate connections between different parts of the body, especially the skeleton.

2. Contour-based model: A contour-based model is used for 2D pose estimation

in planer mode. This dimension accounts for the overall form and breadth of

the body, torso, and limbs. Rectangles and bounds represent the human body’s

limitations and shape, giving the impression that it is a human body.

Fig. 1 Human body models

Human Body Poses Detection and Estimation Using Convolutional … 305

3. Volume-based model: It is utilized for 3D posture estimation. Human geometric

meshes and shapes are utilized to represent the human body in these models,

and they are often recorded for use in deep learning-based 3D human activity

identiﬁcation.

1.2 Computer Vision

Computer vision, in its broadest sense, is the automation and combination of many

techniques and representations used to comprehend visual information [3]. To help

computers make sense of digital photos, specialists from a variety of areas collaborate

on a project. A digital picture is just an integer representing a standard binary image

in two dimensions. Here, computer vision must devise methods of extracting and

communicating the data contained in an image using a matrix. These calculations

may provide either a picture or a collection of features about that image.

In recent years, computer vision has shown remarkable practicality. The name

“computer vision” was coined to represent this overarching objective, which is to

enable cameras on computers to “see” and “recognize” in the same way that humans

do. With the use of algorithms, computers can analyze the data contained in an

image’s pixels to determine what features are most signiﬁcant. This is challenging;

many individuals make their lives out of the quest to train computers to recognize

faces automatically in images.

There has been a lot of advancement in computer vision lately, which is great.

Many different types of issues can be solved using effective algorithms. Thanks to

the progress made in computing, these algorithms can now be executed at a tolerable

pace. However, computer vision has a long way to go before it can replace human

eyes. It’s rare that a computer can outperform human eyesight in a given setting.

Computer vision is best used for item recognition and matching incoming photos to

a huge database when paired with machine learning (machine learning). To achieve

this, this research uses computer vision and machine learning to categorize human

body positions in images.

1.3 Convolutional Neural Network

Convolutional neural networks (CNNs) are a kind of DL that can classify images

and videos by assigning weights and biases to the input data. CNN is faster at

picking up new technical abilities than Artiﬁcial Neural Network (ANN). There is

less congestion because of CNN’s various beneﬁts, which include improved profes-

sional communication, data exchange, and comparative openness. The arrangement

of the human body and the visual brain inspired the construction of a convolutional

neural network CNN is highly recommended for posture estimation because of the

complexity of the task, which includes object identiﬁcation and body key point

306 J. K. Baroliya and A. Doegar

localization. Thus, CNN is a fundamental component of many different types of

pose estimation models.

Convolutional Layer

The ﬁlters in a convolutional layer have to have their settings learned. The ﬁlters have

a lower proﬁle in terms of both height and weight than the input volume. A neural

activation map is calculated by convolving each ﬁlter with the input volume. That is

to say, we move the ﬁlter horizontally and vertically over the input and calculate the

dot products at each and every location. Between the image matrix and the kernel

lies a geometric procedure known as convolution. The output convolutional layer

measurements are less precise than the input picture matrix.

Pooling Layer

When the features are identiﬁed, their extracted locations become less relevant, hence

a subsampling layer is employed to reduce the amount of training parameters and

introduce translation invariance. When the amount of data available on the network

decreases over time, modifying the instructions can help. The integration layer makes

use of the max and min operations. As a result, we need to revise the data. In the

pooling layer, ﬁlters subsample the input. A 2 ×2 ﬁlter is commonly used in the

second phase of setting up a pooling layer, which reduces the width and height of

each depth slider in the input volume by a factor of 2. This is due to the fact that it

is impossible to memorize hashes of node layer parameters.

Activation Function

To introduce nonlinearity, the Rectiﬁed Linear Unit (ReLU) is preferable to the

previously used function, as shown by its success in a variety of incentive settings,

the simplicity with which partial derivatives of the ReLU can be derived, and the

reduction in exercise start time. ReLU will not let our grades disappear.

Fully Connected Layers

Full connectivity exists between normal and hyperparameter layers. After the feature

map matrix is transformed into a vector, it is passed on to the subsequent layers, which

will be completely connected. They switch activities based on what’s being activated

(neuron load and bias).

2 Literature Survey

After the study of various research papers on human body poses detection and esti-

mation, we concluded that the classiﬁcation of human body pose is a very complex

task. In the literature, various authors and researchers discussed human body pose

detection using different types of techniques for image processing operations.

Toshev et al. [4] suggested using deep neural networks to estimate human poses

(DNNs). The process of estimating a subject’s posture is modeled as a regression

Human Body Poses Detection and Estimation Using Convolutional … 307

Fig. 2 Convolutional pose machine [12]

problem using a deep neural network in terms of the body’s joints. In order to get

accurate posture estimations, researchers describe a cascade of such DNN regressors,

a sequential architecture made up of convolutional networks that can learn spatial

models implicitly. The given cascade of such regressors has the beneﬁt of collecting

context and reasoning about the posture as a whole, and the issue may be formulated

as a DNN-based regression to joint coordinates. This allowed them to outperform

state-of-the-art methods on a number of difﬁcult academic datasets.

Weietal.[5] In order to perform prediction tasks, a convolutional pose machine

learns basic spatial models by composing convolutional architectures in a sequential

fashion. This model’s design is similar to recurrent networks in that it takes many

iterations to get the desired outcome. Input pictures are 368 ×368, and after a few

convolutions, it outputs a key point prediction for each pixel map (head, neck, right

elbow, etc.) (Fig. 2)

Newell et al. [6] proposed outperforms on all prior approaches. The network is

referred to as a “stacked hourglass” because it uses many layers, each of which is

shaped like an hourglass, to provide a single prediction at the end of the process. This

network was designed with the goal of recording details like as a person’s posture

and the degree of articulation of their limbs.

When creating a neural network, it’s essential to identify its unique features to

ensure its effectiveness. However, when it comes to full-body pose estimation, a

broader context is necessary. This is where the Hourglass model shines, as it can

capture all the necessary details and provide precise predictions down to the pixel

level. The Hourglass approach involves repeated bottom-up processing, moving from

high to low resolution, and top-down processing, moving from low to high resolution,

with intermediate supervision to ensure accuracy.

Cao et al. [7] the open-source software “open pose” is a powerful tool for detecting

many 2D poses simultaneously. In this research, they invented method for identifying

2D human postures. This study introduces Part Afﬁnity Fields (PAFs), a bottom-

up representation technique. Top-down methods are mostly used in this model for

308 J. K. Baroliya and A. Doegar

estimating poses in groups. As the name implies, a top-down approach begins with the

detection of a human being, followed by the estimation of its posture in each speciﬁc

area. While this method may be used directly for single-person pose estimation,

it falls short when used in multi-person settings due to its inability to account for

the spatial interdependence between users, which can only be captured by global

inference.

In their study, Sharma et al. [8] were able to accurately detect four distinct human

poses, namely Sitting, Standing, Handshake, and Waving, achieving an overall accu-

racy of 82.5%. To achieve this, they ﬁrst pre-processed the images and extracted

the necessary features using Principal Component Analysis and Discrete Cosine

transform. These extracted features were then used to train a neural network classiﬁer.

In their study, Wang et al. [9] proposed a new approach to classify yoga poses

using a combination of post-estimation algorithm and convolutional neural network

(CNN). The post-estimation algorithm utilized in this study was the Open Pose

algorithm, which detected the skeleton of a person and generated a pure black picture.

After extracting the poses from the original images, a CNN was utilized to classify

the yoga poses with a validation accuracy of 92.99%. The study also compared

the performance of the models with and without the assistance of the Open Pose

algorithm, and the results indicated that the accuracy of models without the Open

Pose algorithm was on average 3–6% lower than the models combined with the Open

Pose algorithm. Thus, the proposed approach of classiﬁcation was deemed effective,

based on the skeleton information extracted by the Open Pose algorithm.

Desaietal.[10] conducted a comprehensive literature review and proposed a

method for real-time pose estimation utilizing a deep neural network (DNN) model

to detect and correct errors in a person’s posture.

3 Proposed Methodology

The proposed methodology for this research work is shown below (Fig. 3).

Firstly, human images will be acquired through any picture-capturing device, and

grab cut will be applied to the dataset for creating human silhouettes of inputted

images. For image classiﬁcation convolution layer, max-pooling, ReLU was applied

Fig. 3 Proposed methodology

Human Body Poses Detection and Estimation Using Convolutional … 309

on human silhouettes [11]. Three to four times continuously apply those layers after

that ﬂatten and dense layer is applied. The fully connected layer then classiﬁes the

given input pose.

Flow Chart of Proposed Work

See Fig. 4.

First of all, we acquire images from an online source [12], and the handshaking

dataset is created by a researcher after some image processing techniques are applied,

we use grab cut segmentation on an original dataset which gives an output of human

silhouettes.

Augmentation operations [13] were also applied to some pose image categories

to ensure equal representation of each category. The dataset was split into 80% for

training and 20% for testing.

Next, CNN started with four convolutional layers with 64, 64, 64, and 128 ﬁlters,

respectively, followed by max-pooling layers with a 2 ×2 window size. Each convo-

lutional layer utilized a 3 ×3 ﬁlter, and the ReLU activation function was applied.

After the ﬁnal max-pooling layer, the Flatten() layer was added to ﬂatten the output

of the earlier layers, and a Dense() layer with the SoftMax activation function was

incorporated to generate probabilities for each of the 6 possible classes.

3.1 Extraction

If the conditions are stable, identifying a human form in an image or video may be

a straightforward task. When comparing images, we may isolate the human form by

tracking the moving item. Having access to just one picture makes this task somewhat

more challenging. Therefore, the generally effective grab cut algorithm in our study.

When a person has been successfully detected in the segmented photo, the picture

may be divided into a foreground and a background. The spot where the individual

was discovered is prominently shown. The background is the area outside the box’s

borders. All in the sequence is utilized by grab cut [14].

An extraction algorithm is described. Create a bordered rectangle by drawing it in.

The individual in the front ought to take up the whole square. The program repeatedly

divides the foreground for optimal performance. In the algorithm, foreground and

background pixels are separated. This is because the foreground item will stand out

against a dark background. In order to blend in with the background, grab cut will

selectively cut off everything except the subject of the picture, leaving just their

silhouette. This is performed on the pictures that were generated when the person’s

bounding box was ﬁrst discovered. Figure 5a, b provide an as clear illustration.

310 J. K. Baroliya and A. Doegar

Fig. 4 Flow chart diagram

for proposed system

Human Body Poses Detection and Estimation Using Convolutional … 311

Fig. 5 aandb

3.2 Image Classiﬁcation

After an input dataset has been crafted, a basic feedforward neural network may be

constructed.

This network has neurons in its hidden layer. The result is two binary ﬂoating

neurons in the vertical direction, which partition the picture in two places (one neuron

is moving and one is resting). Scale the RGB image down to 64 ×64, then convert

the silhouette image to a grayscale. The input to the neural network serves as the

source of the training data. As a result of evaluations (Fig. 6a, b)

Fig. 6 aandb

312 J. K. Baroliya and A. Doegar

4 Experiment and Results

The coding and implementation for this study were done in the Jupyter notebook of

the anaconda framework, using Python 3.10. We use 944 training images to teach

our convolutional neural network model, and then we use 237 test images to see

how well it can classify new images. Both the training and validation accuracies

peaked in the 17th epoch when using the designed approach on 20 iterations. Grab

cut dataset is passed to CNN which results in enhanced classiﬁcation accuracy, we

obtain the performance metrics as accuracy, F1-score, recall, and precision score.

The confusion matrix for the classiﬁcation of pose categories is shown in Fig. 7.

Various performances of the purposed method are shown in tabular form in Table 1.

The accuracy and validation losses during the training model are shown in Fig. 8.

And the accuracy and validation accuracy during the training of the model is shown

in Fig. 9and Table 2.

So, an overall 93.89% accuracy was reported when 393 random images tested.

Several studies have utilized convolutional neural networks (CNNs) for human

pose detection and estimation. These include “Deep pose: Human pose estima-

tion via deep neural networks” (2014), “Stacked hourglass networks for human

Fig. 7 Confusion metrics (in Fig. 20 label 0 =Downdog, label 1 =Goddess, label 2 =handshaking,

label 3 =plank, label 4 =tree, label 5 =warrior)

Table 1 Performance of

proposed methodologies Accuracy score 0.9389

F1-score 0.9375

Recall score 0.9389

Precision score 0.9389

Human Body Poses Detection and Estimation Using Convolutional … 313

Fig. 8 Training and

validation loss

Fig. 9 Training and

validation accuracy

Table 2 Performance of proposed methodologies

Label Class No of testing images Correctly identiﬁed Accuracy in percentage

0Downdog 36 34 94.44

1Godess 42 31 73.80

2Handeshaking 124 124 100

3 Plank 108 106 98.14

4Tree 49 46 93.87

5Warrior2 34 28 82.35

pose estimation" (2016), “Human Pose Detection: A Machine Learning Approach”

(2020), “Multi-Classiﬁcation for Yoga Pose Based on Deep Learning” (2022), “Deep

Learning-Based Yoga Pose Classiﬁcation” (2022). These studies reported accuracy

rates of 79.6%, 80.90%, 92%, 92.99%, and 93%, respectively in comparison to these

previous works, our system achieved an accuracy increase of up to 93.89% (Fig. 10).

314 J. K. Baroliya and A. Doegar

Fig. 10 Comparison of existing and proposed work

5 Conclusion and Future Scope

CNN and grab cut has great potential for use in human poses for many real-time

applications in the future. These methods can be made even more accurate and faster

by developing more advanced deep learning techniques and making large datasets

available. Integrating these strategies with other technologies, such as remote sensing,

enables real-time monitoring and early identiﬁcation of human poses.

Finally, the application of CNN and grab a cut to the problem of human pose

detection shows great potential for the development of yoga pose detection applica-

tions in the future. These methods can help any person to learn yoga poses at home

and are also beneﬁcial for theft detection by identifying poses, in the medical ﬁeld

and fall detection of patients. The remaining goals of this work include optimizing

the parameters of the CNN and grab cut models and increasing the size of the dataset

to include a broader range of human poses in a different ﬁeld.

To further enhance the accuracy and performance of the system, researchers can

experiment with transfer learning and other cutting-edge deep learning techniques.

The methodology employed in this study is based on in-depth learning, which is then

used to categorize yoga poses. We’ll use live yoga pose capturing via webcams in

the future. Then instruct the subject on how to improve it by highlighting any ﬂaws

in their yoga posture.

Human Body Poses Detection and Estimation Using Convolutional … 315

References

1. “human_pose,” Nov. 15, 2022. https://www.v7labs.com/blog/human-pose-estimation-guide

(accessed Nov. 18, 2022)

2. “hu,” Nov. 15, 2022. https://www.analyticsvidhya.com/blog/2022/01/a-comprehensive-guide-

on-human-pose-estimation/ (accessed Nov. 15, 2022)

3. Vezzani R, Cucchiara R (2008) Annotation collection and online performance evaluation for

video Surveillance: the ViSOR project. In: Proceedings—IEEE 5th international conference

on advanced video and signal based surveillance, AVSS 2008. https://doi.org/10.1109/AVSS.

2008.31

4. Toshev A, Szegedy C (2014) DeepPose: human pose estimation via deep neural networks.

In: Proceedings of the IEEE computer society conference on computer vision and pattern

recognition. https://doi.org/10.1109/CVPR.2014.214

5. Wei S-E, Ramakrishna V, Kanade T, Sheikh Y, Convolutional pose machines

6. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In:

Lecture notes in computer science (including subseries lecture notes in artiﬁcial intelligence

and lecture notes in bioinformatics). https://doi.org/10.1007/978-3-319-46484-8_29

7. Cao Z, Hidalgo G, Simon T, Wei SE, Sheikh Y (2021) OpenPose: realtime multi-person 2D

pose estimation using part afﬁnity ﬁelds. IEEE Trans Pattern Anal Mach Intell 43(1). https://

doi.org/10.1109/TPAMI.2019.2929257

8. Kakati M, Sarma P (2020) Human pose detection: A machine learning approach. In: Advances

in intelligent systems and computing. https://doi.org/10.1007/978-3-030-37218-7_2

9. CIBDA 2022, 3rd international conference on computer information and big data applications

10. Kinger S, Desai A, Patil S, Sinalkar H, Deore N (2022) Deep learning based yoga pose

classiﬁcation. In: 2022 international conference on machine learning, big data, cloud and

parallel computing, COM-IT-CON 2022, Institute of Electrical and Electronics Engineers Inc.,

682–691. https://doi.org/10.1109/COM-IT-CON54601.2022.9850693

11. Al-Qerem A, Alahmad A (2019) Human body poses recognition using neural networks with

data augmentation. Int J Adv Trends Comput Sci Eng 8(5). https://doi.org/10.30534/ijatcse/

2019/40852019

12. “dataset,” 2023. https://www.kaggle.com/datasets/niharika41298/yoga-poses-dataset

(accessed Feb. 02, 2023)

13. Park S, Baek Lee S, Park J (2020) Data augmentation method for improving the accuracy of

human pose estimation with cropped images. Pattern Recognit Lett 136. https://doi.org/10.

1016/j.patrec.2020.06.015

14. Rother C, Kolmogorov V, Blake A (2004) GrabCut—interactive foreground extraction using

iterated graph cuts. In: ACM SIGGRAPH 2004 Papers, SIGGRAPH 2004. https://doi.org/10.

1145/1186562.1015720

A Novel Image Alignment Technique

Leveraging Teaching Learning-Based

Optimization for Medical Images

Paluck Arora, Rajesh Mehta, and Rohit Ahuja

Abstract In image registration, traditional optimization techniques are incapable

of detecting the optimum value of geometric transformation parameters. To resolve

this issue, a novel scheme of monomodal (isomodal) biomedical image registra-

tion employing teaching learning-based optimization (TLBO) is proposed. In pre-

processing, reference image undergoes gaussian ﬁltering to eliminate noise followed

through normalization. During de-noising, contrast between anatomical features of

an image is degraded. In order to create the ﬂoating image, rigid transformation is

employed. These images are aligned by detecting optimum value of rigid transforma-

tion parameters (RTP) using TLBO with mutual information (MI) maximization as

an objective function. MI and structural similarity index measure (SSIM) are used to

evaluate visual quality of registered image. The proposed scheme is tested on several

isomodal medical images such as magnetic resonance imaging (MRI) and computed

tomography (CT). The value of MI, SSIM increases by 8% and the value of RMSE

is signiﬁcantly reduced from 0.4953 to 0.1306 [1] and 3.7858 to 0.1809 [2] which

clearly reveals that proposed scheme is robust and effective as compared with the

state-of-the-art methods.

Keywords Image registration ·Mutual information ·Monomodal images ·

TLBO ·SSIM

P. Ar o r a ( B)·R. Mehta ·R. Ahuja

Computer Science and Engineering Department, Thapar Institute of Engineering and Technology,

Patiala, India

e-mail: parora_phd20@thapar.edu

R. Mehta

e-mail: rajesh.mehta@thapar.edu

R. Ahuja

e-mail: rohit.ahuja@thapar.edu

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_24

317

318 P. Ar o r a e t a l .

1 Introduction

Rapid growth of computer and medical technology motivates academicians and

researchers around the globe to analyse bone structure, metabolic process, dense

and soft issues of computed tomography (CT), magnetic resonance imaging (MRI)

and positron emission tomography (PET) modalities of medical images [3]. Experts

in medical imaging require a variety of imaging methods to acquire additional infor-

mation that can assist to diagnose the diseases. Medical images derived from various

sensors (modalities) can be classiﬁed into two parts: ﬁrst, anatomical images, such

as computed tomography (CT), magnetic resonance (MR) and ultrasound (US), are

utilized to show body organs in their complete structure; second, functional images,

like PET and SPECT, are employed to show soft tissues and their internal activi-

ties [1]. The goal of image registration is to align two images of the same scene

taken at various times (multi-temporal) from distinct viewpoints with various sensors

(multimodal). Medical image registration techniques can be classiﬁed into intensity

and feature-based registration. In intensity-based registration, relationship between

the reference and ﬂoating image is ﬁgured out to generate transformation and then

optimize these parameters of the transformation. However, this does not support

the feature detection. Feature-based registration overcomes the concern of intensity-

based registration by detecting the features between reference and ﬂoating image [4].

Image registration has numerous applications in the ﬁeld of medical imaging, such

as monitoring disease detection, treatment planning system (for radiotherapy treat-

ment), nuclear medicine, surgical procedure, motion tracking, image-guided surgery

(IGS) [5] and patient follow-up cum monitoring. Image registration entails trans-

formation, cost function (similarity metric) and optimization. Firstly, transformation

model is classiﬁed as rigid or non-rigid. In rigid transformation size or shape of an

image is not changed. It includes rotation, scaling, translation and afﬁne transform.

Any transformation of a geometrical object that changes the size but not the shape is

referred as non-rigid transformation [6]. Stretching or dilating belongs to the category

of non-rigid transformation. Secondly, there are few similarity measures that eval-

uate image alignment accuracy, e.g. mutual information (MI), root mean square error

(RMSE) are the most often used similarity metrics. Registration is obtained when

SSIM and MI are maximized while RMSE is minimized. Lastly, local and global

optimization techniques are employed to evaluate the accuracy of a registered image.

Local and global strategies can be used to optimize the similarity measures. Local

techniques, including Powell’s direction [7], steepest descent gradient and Leven-

berg–Marquardt tend to trap local optimums and provide misregistration results [8].

Genetic algorithm (GA) [9], particle swarm optimization (PSO) [1], grey wolf opti-

mizer (GWO) [10] and hybrid particle swarm optimization (HPSO) [11] are exam-

ples of global optimization methods. Although, GA is a robust method for global

optimization but it takes longer duration to compute and it does not support ﬁne

adjustment. These global optimization methods are incapable of providing global

optimum [1]. Hence, a new intelligent meta-heuristic global optimization scheme

TLBO was introduced with a notion to achieve better results in lesser effort for rigid

A Novel Image Alignment Technique Leveraging Teaching … 319

medical image registration. Considering the shortcomings of the methods outlined

above, a robust image registration scheme is proposed to improve the quality of

registered images by ﬁnding the optimum transformation parameters. In the present

work, reference image is pre-processed by using gaussian ﬁltering to remove the

noise using noise reduction procedure preceded by normalization. However, due to

this contrast between anatomical components of an image is slightly reduced. Rigid

transformation is applied to align reference and ﬂoating image with maximum MI

as a cost function. Researchers were focused on different optimization techniques

during the registration process for ﬁnding the optimum value of the rigid/non-rigid

transformation parameter in registration [1,2,10]. The quality of the registered image

is measured using MI and SSIM values.

The major contribution for registering isomodal images are outlined as:

•Noise removal and normalization are performed during the pre-processing step

which allows the reference and ﬂoating images to be distinguished anatomically.

•To improve on traditional meta-heuristic approaches like PSO, GWO, HPSO, etc.

for determining the optimum value of rigid transformation parameters in image

registration and presents the IR method for different modalities of images based

on TLBO.

•With the aid of TLBO (which has fewer parameters than other optimization

methods), the optimum value of transformation parameters is found which results

in a robust registration procedure.

•The robustness in terms of precise registration is justiﬁed by the high value

of MI and SSIM. The proposed method effectiveness is demonstrated through

a fair comparison with recently developed schemes using similar images and

transformation parameters [1,2,10].

The rest of the paper is outlined as: Literature review is discussed in Sect. 2.

Proposed approach for robust medical image registration is discussed in Sect. 3.

Experimental and comparison results are demonstrated in Sect. 4followed by

conclusion and future directions in Sect. 5.

2 Literature Review

In this section, the recently developed rigid medical image registration techniques

proposed by several researchers are described.

Ayatollahi et al. [1] have presented a method for intermodal image registration

based on maximized normalized MI and particle swarm optimization. Ting et al.

[7] have suggested a multiresolution search optimization technique that combines

Quantum behaved PSO with Powell algorithm. It is challenging to determine

optimum RTP value. Meskine et al. [7] have presented a technique for rigid registra-

tion of point sets. This technique is employed to register two different clinical MR

images which demonstrates the reliability and effectiveness. Dida et al. [10]have

presented different optimization algorithms which were used to register MR and

320 P. Ar o r a e t a l .

CT images for recognizing the human brain. Maddaiah et al. [2] have demonstrated

image registration scheme for alignment of medical images like human brain. Abdel-

Bassetetal.[12] have discussed hybrid scheme for alignment of medical images. This

method yields higher RMSE value on the tested dataset which results in degraded

alignment of registered images. Paluck et al. [13] had examined medical image regis-

tration employing a hybridization of teaching learning-based optimization with afﬁne

and projective transformation to speed up robust features. Experiments are carried

out in the domains of monomodal and multimodal medical images using the whole

brain ATLAS and Kaggle datasets. Lin et al. [14] have described a novel enhanced

global optimization strategy for medical image registration (HPSO).This technique

used medical data (from the Vanderbilt database) for medical image registration and

address the issue of low registration accuracy. Zheng et al. [14] have discussed the

feature extraction algorithm such as SURF based on progressive images. The afﬁne

registration of images from the same and different modalities, such as CT and MR

have been presented by Bhattacharya et al. [15]. This paper has several limitations.

First, a time-consuming process is used to create many intermediate progressive

images. Second, extracting and matching the keypoints, the SURF algorithm gener-

ates mismatches in PI-SURF registration procedure. According to Pluim et al. [16],

the following methods for image registration include the interpolation technique

and different search strategy are used to maximize the similarity measure. Liu et al.

[15] have addressed the teaching learning-based optimization technique which is

substantially faster than other evolutionary algorithms (EAs) in ﬁnding superior or

equal solutions. Das et al. [16] have studied the MR and CT brain modality of

images and non-linear 2-D afﬁne registration. The afﬁne-based image registration

technique based on mutual information (MI) has been implemented by Kosinski

et al. [17]. Powell optimization [18] technique is used for searching the optimum

transformation parameter. Liu et al. [19] have presented an outlier robust scheme

for multiple rigid transformed images for improving the robustness of registration.

Zheng et al. [20] addressed the problem of low registration accuracy in an SURF-

based, progressive-image registration approach for medical images. These images

were obtained from the Brain Web Database for medical image registration. Swathi

et al. [21] demonstrates the automatic image registration model with hybrid opti-

mization approach. The alignment process used satellite images. A point matching

algorithm is employed to establish correspondences between the detected features

following additional feature extraction utilizing E-SIFT. Using various similarity

metrics like RMSE and NCC, this scheme outperformed state-of-the-art methods.

3 Proposed Approach

With an intent to achieve novel medical image registration gaussian ﬁlter is employed

in order to remove noise. Further, rigid transformation to align images and optimiza-

tion algorithm TLBO is used for detecting optimum values of RTP θ,tx,ty.MI

and SSIM are used as similarity metric to ﬁnd the correlation between reference

A Novel Image Alignment Technique Leveraging Teaching … 321

and registered images. TLBO employs MI maximization as an objective function.

Regarding Eq. 1, the optimum transformation gives the maximum value of similarity

metric, MI. The objective function deﬁnes optimization algorithm which evaluates

transformation parameter to get the optimum value for the highest similarity metric.

T=arg max

TMI[fs(x),ft(x)](1)

The reference (original) image is represented by fs(x)where xis an image coor-

dinate and the ﬂoating image by ft(x).ˆ

Tis the optimum transformation and T

representing the transformation as well as its parameter. The framework of proposed

scheme is shown in Fig. 1along with the steps involved are as follows:

Step 1: The source (reference) image is subjected to pre-processing employing gaus-

sian ﬁlter in order to de-noise prior to normalization. Although, removal of noise

during noise removal results into small degradation [22]. Normalization is applied

to transform the image data on a uniform scale and the output of Step 1 is shown in

Fig. 2.

Fig. 1 Framework of proposed scheme

Fig. 2 Brain image from ATLAS database [23]asource image bgaussian blur image cnormalized

image

322 P. Ar o r a e t a l .

Fig. 3 a Reference image

bﬂoating image

Step 2: The output of reference image acquired in Step 1 is processed through

rigid transformation (rotation preceded by translation) to generate ﬂoating image

as illustrated in Fig. 3.

Step 3: The similarity metrics SSIM and MI are used to measure the degree of

resemblance between reference and transformed ﬂoating image. Subsequently TLBO

is applied for determining the novel value of RTP along with MI (maximization) as

an objective function of similarity metric.

Step 4: The optimum value of RTP corresponding to rotation (θ)and translation

(tx,ty)obtained by TLBO are applied to ﬂoating image resulting in registered image

as shown in Fig. 4. Finally, MI is computed to validate the correlation between

reference and registered image. A high value of MI indicates the accurate alignment

between reference and registered image as depicted in Fig. 1. The optimization

technique TLBO is utilized to estimate the RTP that maximize or minimize the

similarity metrics and test the reliability of IR. According to objective function if the

desired results are accurate which claim to achieve the registered image otherwise

transformation parameter are updated and again apply the similar procedure from

Step 2 as explained in Fig. 1

Fig. 4 a Reference image bﬂoating image with transformation parameters θ, tx,tyas (4.7,0,0)

cregistered image

A Novel Image Alignment Technique Leveraging Teaching … 323

4 Experimental and Comparison Results

In this section, performance of the proposed scheme is evaluated in terms of MI as

well as SSIM along with visual quality of registered images [24]. The effectiveness of

the proposed scheme is demonstrated by comparing it with state-of-the-art methods

[1,2,10].

5 Experimental Results

All the experiments are conducted using the PYTHON 3.9.4 platform, utilizing a

system with 16.0 GB RAM and a 2.80 GHz Intel(R) Core TM i7-1165G7 CPU. The

experimental results are conducted on various isomodal medical database images,

as depicted in Fig. 5(a1–a6), sourced from the whole brain ATLAS database [23].

The medical brain images utilized in this study consist of CT and MRI scans with

dimensions of 256 ×256 (Fig. 5a1–a5), and 354 ×353 (Fig. 5, a6). These images

are employed to evaluate and verify the effectiveness of the proposed scheme. The

reference (source) image undergoes a rigid transformation, which consists of a rota-

tion at an angle θfollowed by a translation T=(tx,ty)resulting in the ﬂoating

(template) image. This is performed after applying pre-processing task to the source

image. TLBO optimizes transformation parameters to register ﬂoating image with

reference image.

Initially, ﬂoating image is registered with source image. The impact of rotating

an isomodal brain image at a rotation θ=2◦preceded by T=(tx,ty)=(2,2)

is shown in Fig. 5b1–b4. The optimum value of rotation and translation parameters

obtained through TLBO undergoes the process of rigid transformation and ﬁnally

the registered images are formed as depicted in Fig. 5(Col.3, c1–c4) corresponding

to ﬂoating images (Col. 2, b1–b4). In the second test case, the same scenario is

repeated by varying values of transformation parameters such as rotation θ=15◦

and translation T=(tx,ty)=(17,17). The registration result for this rotation and

translation parameter values is shown in Fig. 5(c5) which corresponds to the ﬂoating

image in Fig. 5(col. 2, b5) along with their MI values in Table 1. In the third test

case, which involves a human brain image of dimension (354 ×353)is considered

with different values of rotation and translation parameters (rotation θ=4.7◦and

translation T=(tx,ty)=(0,0)). The registration results as illustrated in Fig. 5

(col. 3, c6) corresponding to the ﬂoating image in Fig. 5(Col.2, b6). MI and SSIM

metric values across all isomodal images are described in Table 1.

In all these cases, higher value of MI and SSIM corresponding to all test images, as

tabulated in Table 1explicitly states that registered images are perfectly aligned with

the reference image with signiﬁcantly lower error rate. These results demonstrate that

the TLBO algorithm outperforms accurately in isomodal medical image registration

as compared with other evolutionary algorithms [1,2,10]. The effectiveness of TLBO

324 P. Ar o r a e t a l .

(a1) (b1) (c1)

(a2) (b2) (c2)

(a3) (b3) (c3)

(a4) (b4) (c4)

(a5) (b5) (c5)

(a6) (b6) (c6)

Fig. 5 MRI and CT modalities of test images (i) MRI (a1, a2 and a6); (ii) CT (a3, a4, a5), RTP

θ,tx,tyfor a1–a4 (2,2,2), for a5 (15,17,17) and for a6 (4.7,0,0)

A Novel Image Alignment Technique Leveraging Teaching … 325

Table 1 Experimental results of proposed scheme using different test images as shown in Fig. 5

(a1–a6)

Test images Initial value RTP (SSIM/MI)

(tx,ty,θ)tx ty θ

(a1) (2,2,2) −1.9634 −2.0316 −1.9848 (0.9643/12.087)

(a2) (2,2,2) −1.9636 −2.0338 −1.9866 (0.9868/7.5647)

(a3) (2,2,2) −1.9378 −2.0486 −2.0003 (0.9982/5.4848)

(a4) (2,2,2) −1.9716 −2.0326 −1.9939 (0.9961/4.5934)

(a5) (17,17,15) −16.1619 −17.9579 −14.6466 (0.9749/5.7530)

(a6) (0,0,4.7) −0.0290 −0.0314 −4.69861 (0.9561/13.2613)

for obtaining the RTP is illustrated through the experimental results with the scheme

described in this work.

5.1 Comparative Analysis

The performance of proposed scheme for isomodal medical image registration is eval-

uated by comparing with recently developed meta-heuristic optimization-based tech-

niques [25]. For a fair comparison, use identical images along with same parameters

and initial values considered by [1,2,10] in this study.

The proposed scheme is being compared to the existing schemes presented by Dida

et al. [10], Ayatollahi et al. [1] and Maddaiah et al. [2]. Isomodal test images are MR1,

MR2, CT1, CT2, CT and brain MR image are obtained from the aforementioned

database [23] shown in Fig. 5(Col.1, a1–a6) and have dimensions of (256 ×256).

From these tables, it is observed that proposed scheme similarity metrics SSIM,

RMSE has a value better than existing meta-heuristic approaches. Higher SSIM and

minimum RMSE value indicates the accurate registration of CT and MRI images

along with good visual quality as shown in Fig. 5. This is due to obtaining optimum

values of transformation parameters by TLBO and reducing the noise from all the

images. All the experiments are performed on images for registration employing

TLBO having 30 as several particles (population size) with maximum 60 iterations.

Future research will be focussed on extension of proposed method to multimodal

medical image registration by incorporating machine learning algorithms for better

alignment of source and target image (Tables 2,3and 4).

326 P. Ar o r a e t a l .

Table 2 Comparison results of rigid transformation parameters for proposed scheme via existing

scheme [10] on a1–a4 images in Fig. 5, where IV is initial value and OA is optimization algorithm

Modalities IV OA GWO PSO Proposed

scheme

GWO PSO Proposed

scheme

RTP

MR1 and

MR2

2tx −1.931 −1.930 −1.963 −1.936 −1.948 −1.963

2ty −2.074 −2.074 −2.031 −2.082 −2.076 −2.033

2θ−1.982 −1.978 −1.984 −1.986 −1.991 −1.986

SSIM 0.9106 0.9196 0.9653 0.9007 0.9006 0.9838

CT1 and

CT2

2tx −1.930 −1.936 −1.937 −1.933 −1.933 −1.9716

2ty −2.074 −2.073 −2.048 −2.079 −2.068 −2.0326

2θ−1.978 −1.978 −2.000 −1.983 −1.977 −1.9939

SSIM 0.9196 0.9195 0.9852 0.9191 0.9190 0.9741

Table 3 Comparison results of rigid transformation parameter of proposed scheme with existing

scheme [1]ona5imageinFig.5

RTP tx ty θSimilarity measure (average RMSE)

Initial value 17 17 15

Algorithm

Hybrid PSO −16.0480 −17.0133 −14.4844 0.4953

Proposed scheme −16.1619 −17.9579 −14.7543 0.1306

Table 4 Comparison results of transformation parameters of the proposed scheme with existing

scheme [2]ona6imagesofFig.5

RTP tx ty θSimilarity measure

Initial value 0 0 4.7 Average RMSE Average RMSE

Algorithm

Hybrid PSO −0.7240 −0.3521 −4.3768 3.7858 3.7858

Proposed scheme −0.0290 −0.0314 −4.69861 0.18091 0.18091

6 Conclusion

In this work, TLBO and rigid transformation techniques are employed for rigid

medical image registration. An efﬁcient image registration framework for isomodal

medical images with respect to visual quality of aligned images using TLBO and rigid

transformation is achieved with proposed scheme. TLBO is employed for acquiring

the optimum value for rotation as well as translation transformation parameters in

medical image registration by considering MI (maximization) as an objective func-

tion. Image registration is viewed as an optimization problem in this study, which is

efﬁciently addressed using TLBO through maximization of MI between reference

A Novel Image Alignment Technique Leveraging Teaching … 327

and registered images. Extensive experiments are performed on multiple modalities

of images with noise and artefacts such as CT and MRI with a notion to evaluate the

performance of proposed scheme. Higher mutual information and SSIM values by

8% determined through experimental results demonstrate better quality of registered

images. The robustness of proposed scheme is illustrated by comparing it to meta-

heuristic schemes on several similarity metrics. The high value of MI and SSIM on

different isomodal medical images (CT and MRI) indicate accurate alignment of

registered images as compared with the recently developed schemes. The proposed

scheme can be further extended to (i) multimodal and 3D image registration, (ii)

deformable and 3D image registration problems in the medical domain and (iii)

inclusion of machine learning and other meta-heuristic techniques to enhance the

performance.

References

1. Ayatollahi F, Shokouhi SB, Ayatollahi A (2012) A new hybrid particle swarm optimization for

multimodal brain image registration. J Biomed Sci Eng 05(04):153–161

2. Maddaiah PN, Pournami PN, Govindan VK (2014) Optimization of image registration for

medical image analysis. Int J Comput Sci Inf Technol 5(3):3394–3398

3. Guan S-Y, Wang T-M, Meng C, Wang J-C (2018) A review of point feature based medical

ımage registration. Chinese J Mech Eng [Internet]. 31(1):76–92. Available from: https://doi.

org/10.1186/s10033-018-0275-9

4. Alam F, Rahman SU, Ullah S, Gulati K (2018) Medical image registration in image guided

surgery: ıssues, challenges and research opportunities. Biocyber Biomed Eng [Internet].

38(1):71–89. Available from: https://doi.org/10.1016/j.bbe.2017.10.001

5. Wan Y, Hu H, Xu Y, Chen Q, Wang Y, Gao D (2017) A robust and accurate non-rigid medical

image registration algorithm based on multi-level deformable model. Iran J Public Health

46(12):1679–1689

6. Ting-Ting P, Ji Z (2016) Research on medical image registration based on QPSO and powell

algorithm. In: Proceedings—14th ınternational symposium on distributed computing and

applications for business, engineering and science, DCABES, 316–9

7. Viergever MA, Maintz JBA, Klein S, Murphy K, Staring M, Pluim JPW (2016) A survey of

medical image registration—under review. Med Image Anal 33(1):140–144

8. Meskine F, Almhdie-imjabber A (2012) A feature point based image registration using genetic

algorithms. Mediterranean Telecommun J 2(2):148–153

9. Dida H, Charif F, Benchabane A (2020) A comparative study of two meta-heuristic algorithms

for MRI and CT images registration. In: 3rd ınternational conference on ınformation and

communications technology, ICOIACT, 411–415

10. Chen YW, Mimori A (2009) Hybrid particle swarm optimizationfor medical image registration.

In: 5th ınternational conference on natural computation, ICNC, 6:26–30.

11. Mani VRS, Arivazhagan S (2013) Survey of medical image registration. J Biomed Eng Technol

1(2):8–25

12. Abdel-Basset M, Fakhry AE, El-henawy I, Qiu T, Sangaiah AK (2017) Feature and ıntensity

based medical ımage registration using particle swarm optimization. J Med Syst 41(12)

13. Arora P,Mehta R, Ahuja R (2023) An adaptive medical image registration using hybridization of

teaching learning-based optimization with afﬁne and speeded up robust features with projective

transformation. Cluster Comput 3

14. Lin CL, Mimori A, Chen YW (2012) Hybrid particle swarm optimization and its application

to multimodal 3D medical image registration. Comput Intell Neurosci 8

328 P. Ar o r a e t a l .

15. Liu S, Mernik L (2012) A note on teaching—learning-based optimization algorithm 212:79–93

16. Das A, Bhattacharya M (2011) Afﬁne-based registration of CT and MR modality images of

human brain using multiresolution approaches: Comparative study on genetic algorithm and

particle swarm optimization. Neural Comput Appl 20(2):223–237

17. Kosi´nski W, Michalak P, Gut P (2012) Robust image registration based on mutual information

measure. J Signal Inf Process 03(02):175–178

18. Arora S, Rani R, Saxena N (2022) An efﬁcient approach for detecting anomalous events in

real-time weather datasets. Concurrency Comput: Practice Experience 34(5):1–15

19. Liu D, Mansour H, Boufounos PT (2019) Robust mutual information-based multi-image

registration. IEEE Int Geosci Remote Sens Sympos 915–918

20. Zheng Q, Wang Q, Ba X, Liu S, Nan J, Zhang S (2021) A medical image registration method

based on progressive images. Comput Math Methods Med Hindawi 2021:1–10

21. Swathi R, Srinivas A (2020) An ımproved ımage registration method using E-SIFT feature

descriptor with hybrid optimization algorithm. J Indian Soc Remote Sens [Internet]. 48(2):215–

26. Available from: https://doi.org/10.1007/s12524-019-01063-w

22. Kumar N, Nachamai M (2017) Noise removal and ﬁltering techniques used in medical images.

Oriental J Comput Sci Technol 10(1):103–113

23. Keith AJ, Alex Becker J (2008) The whole brain ATLAS [Internet]. Harvard University. 2008.

Available from: https://www.med.harvard.edu/aanlib/home.html

24. Wachowiak MP, Smolíková R, Zheng Y, Zurada JM, Elmaghraby AS (2004) An approach to

multimodal biomedical image registration utilizing particle swarm optimization. IEEE Trans

Evol Comput 8(3):289–301

25. Cocianu CL, Stan A (2019) New evolutionary-based techniques for image registration. Appl

Sci (Switzerland) 9(1)

Study of Cyber Threats in IoT Systems

Abir El Akhdar, Chaﬁk Baidada, and Ali Kartit

Abstract Over the years and in a post-industrial economy, information has evolved

from being a simple ﬁnancial and operational gauge in companies to becoming the

key to a more intertwined world. By exchanging an immense amount of informa-

tion, the Internet of Things has resulted in a revolutionary change in lifestyles, with

widespread usage across an array of industries and applications, spanning agricul-

ture, smart houses, health care, and beyond. However, this exponential growth has

also brought about unprecedented management and security challenges. One such

challenge pertains to scalability, whereas another involves the conﬁdentiality and

security of gathered data. Motivated by the lack of inclusive studies of cyber risks

in the IoT world, we present in this article a comparative examination of the threats

related to IoT systems. The study aims to identify the most perilous cyber threats,

based on eight relevant parameters in the world of computer security. Ultimately,

this study will help practitioners comprehend current issues and examine fresh and

attractive research opportunities in the landscape of IoT security.

Keywords Internet of Things ·Security ·Cyber threats ·DDoS ·Malware ·

Overview

1 Introduction

“In the real world, things matter more than ideas,” [1] thus stated Kevin Ashton

in the RFID Journal in June 2009. This well-known British technology settler and

co-founder of the Auto-ID Laboratory at MIT Offsite Link believes that we must

provide computers with their own ways of collecting information so they may expe-

rience the sights, sounds, and smells of the outside world. In this regard, he conceived

the term “Internet of Things” to defend the technological principle where machines

A. E. Akhdar (B)·C. Baidada ·A. Kartit

LTI Laboratory, University of Chouaib Doukkali, National School of Applied Sciences, 5096,

24002 El, Jadida, Morocco

e-mail: elakhdar.abir@ucd.ac.ma

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_25

329

330 A. E. Akhdar et al.

can watch, recognize, and comprehend the environment thanks to sensor technolo-

gies, free from the constraints imposed by manually inputting data. According to

Kevin, the term “IoT” is used to describe a system in which the Internet is linked

to the “real world” via a pervasive network of data sensors [2]. This lends credence

to a broader deﬁnition of the IoT as a new paradigm that enables a collection of

physical nodes to communicate across a network utilizing remotely linked devices

or objects (e.g., smartphones, smart cameras, sensors.). Today, various industries

such as health care, corporate analytics, smart cities, and smart agriculture, among

others, are becoming increasingly aware of the potential beneﬁts of utilizing and

implementing real-time data in their day-to-day operations. By providing a variety

of solutions that can improve people’s lives, this ever-growing web of intercon-

nected devices via the Internet, is playing a signiﬁcant role in various applications

and sectors. For instance, IoT enhances the quality of agriculture [3] by utilizing

sensors to collect environmental and machine metrics. The data may aid farmers in

improving practically every aspect of their labor, from raising cattle to growing crops.

IoT is a multifaceted idea that encompasses a wide range of solutions, services, and

norms. It is broadly believed to be the cornerstone of the Information and Commu-

nication Technology sector for at least the next 10 years [4]. Despite relatively low

adoption rates in households and retail settings, the number of IoT devices in use

surged up from 8.7 billion in 2012 to 50.1 billion in 2018 [5], resulting in expenditure

that is probably worth of a trillion dollars a year. As scale grows, the value of the

data that is gathered, processed, and transported also grows, as do the assaults that

target it. In other words, these estimates show that there will be an increase in the

quantity and level of threats and assaults against these embedded devices, mandating

stronger security measures. In actuality, several security assaults, including botnets

attacks using Distributed Denial of Service or malware and spam attempts, target the

networked devices inside massive IoT networks.

In this paper, we cast light upon IoT systems and try to uncover the main threats

menacing their security. For this end, the article is split into ﬁve sections. In Sect. 2,

we provide the most common faults in IoT devices that could throw off the network’s

security. Section 3portrays a comparative analysis of prevalent cyber-attacks in IoT

systems using eight key parameters relevant in the world of computer security: cost,

visibility, persistence, exploitability, likelihood, reputational damage, targeted data,

layer of attack. We present related works regarding the IoT security in Sect. 4.

The outcomes of this research are condensed in Sect. 5. We ultimately deduce our

conclusions in Sect. 6.

2 Security Shortcomings in IoT Devices

Challenges and shortcomings can overlap in the IoT world. Shortcomings are limita-

tions or weaknesses that most probably affect the system’s outcome, or more speciﬁ-

cally, its quality of outcome, whereas challenges, in a more literal sense, are difﬁcul-

ties encountered when building the system and getting it to function properly. These

Study of Cyber Threats in IoT Systems 331

latter factors may or may not affect the outcome. In [6], Zhang et al. presented a few

adversities concerning the security of IoT which the main are heterogeneity and scal-

ability and proceeded to discuss this large spectrum of security concerns with more

details. Other challenges include inter alia, privacy, lightweight cryptosystems, and

backdoor analysis. Similar to [6], Hameed et al. [7] delivered a synopsis of security

challenges in IoT environments in regards to services, protocols, technologies, and

frameworks. The study outlines the security requirements such as conﬁdentiality,

reliable routing, resilient administration, and intrusion detection. The challenges

associated with IoT were investigated with respect to the multiple architectural tiers,

encompassing, for instance, the sensory, network, and application layers and their

intercommunications by Alfaqih et Al.-Muhtadi [8], Varshney et al. [9] and Sain et al.

[10]. Error tolerance, access control, authentication, and conﬁdentiality are just a few

of the crucial topics that are covered. The literature does not, however, provide guid-

ance on the amalgamation of these requirements in a manner that would guarantee

the safety of IoT systems. Oh et Kim [11] analyzed security challenges and needs

for IoT environments based on six pivotal components of IoT which they considered

to be the network, the user, the attacker, the cloud, the service, and the platform.

Noura et al. [12] provided a comprehensive taxonomy of IoT interoperability which

includes device, syntactical, network, semantic, and platform interoperability. In this

paper, [12] showcased heterogeneity as the root issue to all interoperability chal-

lenges in the IoT world. Frequently cited as fundamental security needs for an IoT

system are authorization, privacy, authentication, identity, and access management

as well as reliance, as stated in some sources (e.g., [13]). Some other general secu-

rity needs, including network, layer, bootstrapping and application security, setup,

information integrity, ﬁrewalling, malware protection, cryptographic algorithms, and

secure networking, are addressed in a limited number of studies (e.g., [8,14,15]).

These are what we refer to as generic requirements because they are essential for

the majority, if not all, application areas. In [16], Kammara stated that IoT devices

were vulnerable to cyber-attacks due to a variety of variables. Some aspects result

from straightforward ﬁnancial choices made by IoT device manufacturers to cut

costs, while others are brought on by the diverse complexity of IoT systems. Table 1

portrays some of the most crucial security shortcomings in the IoT environment.

We maintain that in an IoT environment, privacy, for example, is considered both a

challenge and a requirement as it always raises issues yet should be met as thoroughly

as possible, however, IoT systems still fall short when it comes to fully securing data

transmission due to resource constraint for instance.

Elementally, security requirements should be answered in all tiers of the IoT tech-

nology stack in an onion-like approach where defenses are implemented in redun-

dancy. Accordingly, a number of researchers have identiﬁed the security needs for

IoT devices. In Table 2, we differentiate two groups of security needs [17]forIoT

systems in the literature.

332 A. E. Akhdar et al.

Table 1 Common security shortcomings in IoT

Shortcomings Description

Re-use of code Nearly all IoT manufacturers recycle parts of the code, such as

communication and authentication protocols that are publicly

available online. This method gives an attacker access to the whole

platform with only one easy component, turning it from the holy

grail to a poisoned chalice. The term for this incident is BOBE

(break once, break everywhere)

Lack of high-quality code Most IoT devices have out-of-date code. Additionally, it is normal

for a manufacturer to compile software from several online sources

and then write some patches to assemble everything. The majority

of it is spaghetti code, making maintenance challenging unless a

signiﬁcant amount of effort and money is invested. Majority of IoT

makers don’t spare a thought about user security

Light weight crypto

systems

The computing power of IoT devices is constrained. A modern

encryption technique that is strong requires more resources than

what IoT devices can offer. Despite the fact that they are available,

superior encryption systems cannot be installed in IoT devices.

Additionally, it has been noted that Man in the Middle (MITM)

assaults may be carried through against the Bluetooth protocol,

which is utilized by the majority of smartwatches. The Bluetooth

secure simple pairing is the target of MITM attacks, and it has been

noted that the device’s capabilities affect the security of the

Bluetooth protocol

Heterogeneous platforms The varied complexity in key areas restricts the potential of the

IoT. The IoT ecosystem is complicated because it includes objects

from many manufacturers with a variety of software packages. It

makes managing the ecology challenging. We might create security

software for one platform, and there’s a potential that it won’t

function on equipment running another platform. Even

management and orchestration solutions struggle to support all of

the northbound and southbound utilized by the IoT. Building a

universal ﬁx that functions for all the platforms in this complex

ecosystem is expensive and challenging

Lack of security standards

or guidelines

Everyone must rely on best practices and/or suggestions due to the

lack of security standards, and because IoT infrastructure is made

up of numerous small, frequently inexpensive endpoint devices and

sensors, it is simple to underestimate the hazards involved

Table 2 IoT security needs

Basic or standard security needs Potential security needs

Conﬁdentiality, authentication, authorization,

accessibility, identity and access management,

integrity, reliance, liability, key control,

user-friendliness, etc.

Scalability, attack resistance, privacy and

identiﬁcation, secure data management,

geolocation privacy, power efﬁciency, identity

protection, fault tolerance, data currency and

real time, decentralization, quality of service,

reliability, portability, load balancing, etc.

Study of Cyber Threats in IoT Systems 333

3 Comparative Study of Cyber-Attacks in IoT Systems

3.1 Potential Security Attacks and Associated IoT Layer

The many levels of IoT architecture could be employed to cluster security concerns

and attacks in IoT systems. These assaults can be divided into four categories [18],

which are physical attacks, network attacks, software attacks, and data attacks.

Physical attacks can arise when an attacker is physically near an IoT network or

device, whereas network attacks occur when an attacker speciﬁcally targets an IoT

network system to wreak havoc. Software attacks are carried out through targeting

the exploitable vulnerabilities, such as defects delivered by an IoT application or the

software itself. Meanwhile, data intrusions take place when a perpetrator uses the

IoT to carry out an attack and cracks the encryption. They target the data that IoT

devices will handle in order to maintain communication across various IoT nodes.

In Table 3, we use two terms to differentiate two aspects that overlap which are the

form and type of attack. In literature, form refers to the structure of an object, while

type refers to a category or class of things that have similar characteristics. We believe

that in general, form is concerned with the overall shell, while type is concerned with

the underlying characteristics and common traits. A form of a cyber-attack refers to

the macro-class of the attack or its aim, such as botnets, malicious scripts, malware,

and data breach. A type of a cyber-attack, on the other hand, denotes the vector or

technique used to execute the attack, such as brute force, worm, and IP spooﬁng.

So, form is about “what” the attack is about, while type is about “how” the attack

is carried out. It is essential to note that the language and phrasing used in these

deﬁnitions reﬂect our own unique style and perspective on the subject matter.

As the prevalence of IoT devices continues to grow, the threat of cyber-attacks

increases. Large enterprises are particularly vulnerable to such attacks due to their

reliance on a complex network of interconnected devices. It is therefore essential

that organizations take steps to protect themselves and their assets by assessing the

severity of potential IoT cyber-attacks. In this brief assessment, we rate the various

types of cyber assaults and their severity in the context of large organizations.

To gauge the severity of each attack, we set a number of criteria. For this initial

effort, we assume that all criteria have the same weight in rating an attack’s ferocity:

1. Cost: The extent of damage caused by the attack in terms of ﬁnancial loss

(recovery expenses, repair costs, reputational harm, and possibly legal ramiﬁ-

cations).

2. Visibility: Or ease of detection and response refers to the speed and effectiveness

of detecting the attack and implementing a response.

3. Persistence: Refers to the ability of a threat actor to covertly keep long-term

access to systems despite disturbances like restarts or changing credentials.

4. Exploitability: It’s when a cyber-attack can be a setup for another one.

5. Layer of attack: Refers to the various levels of security that could be targeted

in an IoT system. These layers include physical, network, and software security

334 A. E. Akhdar et al.

Table 3 Weighting of the criteria by type of attack

Common

cyber-attack

forms

Common

cyber-attack

types

Cost Visibility Persistence Exploitability Likelihood Reputational

damage

Targeted

data

Layer of

attack

Severity

Botnet attacks Brute Force 2 1 2 3 2 3 3 3 (Software) 19

Distributed

Denial of Service

(DDoS)

4 4 1 3 4 4 2 4 (Software/

Network)

Spam and

phishing

3 2 2 4 3 3 3 3 (Software) 23

Device Bricking 4 3 4 0 1 4 0 3 (Physical) 19

Malicious

scripts

Cross-site

scripting (XSS)

2 3 2 3 2 3 1 4 (Software/

Physical)

Cross-channel

scripting (XCS)

2 3 3 3 1 3 1 4 (Software/

Physical)

SQL injection 2 3 3 4 2 3 1 4(Software/

Physical)

Remote Code

Execution (RCE)

2 3 4 3 2 3 2 4 (Software/

Physical/

Network)

Malware Hardware Trojan 2 4 4 3 1 4 2 3(Physical) 23

Trojan Ho rse 2 4 3 3 3 3 3 3 (Software) 24

Ransomware 4 3 4 4 3 4 3 3(Software) 28

Backdoors 3 4 4 3 2 3 3 3(Software) 25

Vir us 4 2 2 3 3 3 3 3 (Software) 23

(continued)

Study of Cyber Threats in IoT Systems 335

Table 3 (continued)

Common

cyber-attack

forms

Common

cyber-attack

types

Cost Visibility Persistence Exploitability Likelihood Reputational

damage

Targeted

data

Layer of

attack

Severity

Wor m 4 2 3 4 3 3 3 4(Software/

Network)

Spyware 2 4 3 2 3 3 4 3(Software) 24

RFID attacks RFID spooﬁng 2 1 1 1 1 3 1 4(Physical/

Network)

Routing

information

Spooﬁng 1 3 1 2 2 3 3 3 (Network) 18

Routing table

poisoning

2 3 2 3 2 3 2 3 (Network) 20

Man in The

Middle

2 4 2 2 3 3 3 4 (Network/

Encryption)

Data breach Password

guessing

2 1 2 2 2 2 3 4 (Software/

Encryption)

Shared

technologies

and cloud

Hypervisor

attacks

3 2 3 4 1 4 2 4 (Software/

Encryption)

Severity status

Mild level of severity (0–7) Average level of severity (8–15) High level of severity (16–23) Advanced level of

severity (24–32)

336 A. E. Akhdar et al.

plus encryption. While some attacks only target one layer, some may target all

four, making them far more hazardous.

6. Likelihood: Refers to the probability that a threatening event might occur.

7. Targeted Data: Whether or not the primary goal of the attack is data breach, and

the level of conﬁdentiality, integrity, or availability of the data if it was targeted

by the attack.

8. Reputational damage: Cyber-attacks can harm a business’ reputation and under-

mine the trust of its customers. Loss of clients and sales are two possible

consequences of this.

Based on these factors, a comprehensive assessment can be performed to deter-

mine the overall severity of the cyber-attack and guide the response accordingly.

Each criterion is weighted to a value based on an evaluation scale. This evaluation is

done through ﬁve modalities, each mapped to a numeric value: “none, 0,” “low, 1,”

“medium, 2,” “high, 3,” and “extremely high, 4.” The summation of the total values

of the criteria yields the ultimate degree of severity of each attack. Table 3lists each

criterion’s state for each attack.

3.2 Discussion

In this part, we underline the insights acquired from the aforementioned comparison

of an array of cyber-attacks in IoT environments. The work presented in Subsect. 3.1

classiﬁes the potential threats to IoT security in regards of their severity using several

criteria (e.g., cost, persistence, likelihood, etc.). As we observe in Tables 3and 4,the

least severe form of attacks is RFID attacks with a score of 14. Generally speaking,

the ﬁnancial loss caused by RFID attacks is low compared to other types of cyber-

attacks, such as data breaches or malware attacks. This is because RFID systems are

typically used to track and manage physical assets, and the data they contain is often

not as valuable as the data stored in other systems. However, the ﬁnancial loss caused

by RFID attacks can still be signiﬁcant. For instance, if a malevolent agent is able to

attain entry to an RFID system, they may be able to steal physical assets or manipulate

the data stored in the system. This could lead to signiﬁcant ﬁnancial losses, such as

lost revenue or increased costs associated with replacing stolen assets. Moreover,

they are relatively easy to detect, less exploitable, less persistent, and less likely to

happen due to the fact that they rely on the physical interaction between the attacker

and the target device, whereas, all other cyber threats, including malicious scripts,

routing attacks, data breach and shared technologies and cloud attacks, scored high in

matters of severity. Meanwhile, Distributed Denial of Service (DDoS) and malware

topped the list with some of the highest scores, classifying them as advanced attacks

in terms of severity.

Drawing upon the ﬁndings derived from Table 3, it can be inferred that DDoS

and malware are the most violent and feared attacks in IoT environments because

of their persistence, exploitability, cost, visibility, and likelihood. DDoS attacks are

Study of Cyber Threats in IoT Systems 337

Table 4 Potential security attacks and associated IoT layer

IoT Layer Form of

Attack

Description Severity Common types Possible

vulnerabilities

Perception

layer

Botnet

attacks

Automated

computer

programs that

operate

through the

internet are

known as bots

High,

(DDoS is

advanced)

Brute force

attack, DDoS

attack, spam

and phishing,

device bricking

(e.g., Mirai

botnet,

dictionary

attacks,

credential

stufﬁng, mass

email spam

campaigns)

Vulnerabilities in

infrastructure,

negligence on the

part of humans,

infection with

malware,

unpatched

software, weak

passwords,

Remote Access

Tools, IoT

devices

Malicious

Scripts

Are script code

fragments that

target

machines in

order to make

them

vulnerable. A

variety of

protocols like

the ﬁle transfer

protocol can be

used to inject

scripted

content into

networked

devices

High Cross-site

scripting,

cross-channel

scripting

Operating system

bugs, insecure

ﬁle uploads or

downloads,

insecure devices,

insecure network

protocols, such as

FTP or Telnet

Malware Short for

malicious

software, is

any type of

software

designed to

harm, damage

or disrupt

computer

systems,

networks or

devices

Advanced Ransomware,

Trojan

HORSE,

backdoors

(e.g.,

ILOVEYOU,

Mydoom)

Device hacking,

human errors,

insecure network,

social

engineering

tactics,

insufﬁcient

security training,

unpatched or

outdated

software,

backdoors or

other

vulnerabilities in

operating

systems or

network devices

(continued)

338 A. E. Akhdar et al.

Table 4 (continued)

IoT Layer Form of

Attack

Description Severity Common types Possible

vulnerabilities

Network

layer

RFID attacks Attacks that

target RFID

technology

which is used

to identify

people and

objects and

transmit data

through radio

waves in the

network (i.e.,

Bluetooth)

Average RFID spooﬁng,

Node jamming,

RFID

skimming

Data tracking,

weak physical

security of RFID

tags, such as

easily accessible

tag readers, data

corruption and

deletion, lack of

authentication in

RFID systems,

weak encryption

or no encryption

used in RFID

communication

Routing

information

attacks

The objective

is to

manipulate

router

messages by

blocking,

replaying, or

spooﬁng them,

thereby

altering their

content and

attributes

High Spooﬁng,

routing table

poisoning

Data alteration

and corruption,

lack of secure

routing protocols

or conﬁgurations,

insider threats or

compromised

network devices

that can be used

to manipulate

routing

information

Middleware

layer

Malicious

code

injections

By injecting

malicious

scripts into IoT

nodes,

attackers can

gain control

over the

operation

process and

data ﬂow

between the

nodes

High SQL injection

attacks in

databases,

cross site

scripting

Operating system

bugs, insecure

server

conﬁgurations or

environments,

unpatched

software,

unvalidated or

unsanitized user

input in

applications,

insecure devices

(continued)

Study of Cyber Threats in IoT Systems 339

Table 4 (continued)

IoT Layer Form of

Attack

Description Severity Common types Possible

vulnerabilities

Remote Code

Execution

(RCE) refers to

the exploitation

of software

vulnerabilities

through the

injection of

malicious code

into the input

stream, aimed

compromising

targeted

programs

High Out-of-bounds

write attacks,

injection attack

Malicious

malware

downloaded by

the host, Buffer

Overﬂow,

deserializing

untrusted data

application

layer

Data breach Theft of data

and their

alteration

without

authorization

or knowledge

on the part of

the user

High Password

guessing,

recording

keystrokes,

phishing,

Malware

Human errors,

weak or easily

guessed

passwords, poor

security policies,

unpatched or

outdated software

or hardware

Shared

technologies

and cloud

attacks

Are

cybersecurity

threats that

exploit

vulnerabilities

in shared

resources and

cloud

computing

environments,

and can cause

multiple

security

problems:

availability,

authorization,

identiﬁcation,

access control

High Attacks

targeting

hypervisors,

side-channel

attacks

Insufﬁcient

device

management,

insecure APIs or

weak

authentication

mechanisms,

third-party

vulnerabilities,

direct hacking,

insider threats or

other

compromised

accounts or

access keys,

insecure cloud

system,

misconﬁgured

cloud servers or

storage buckets

340 A. E. Akhdar et al.

particularly worrisome because they are often the precursor to other types of attacks,

including malware. Furthermore, they pose a critical threat to the availability of IoT

systems, which is critical for ensuring the seamless functioning of connected devices

and services. IoT devices have constrained storage and computational resources,

making IoT DDoS assaults considerably harder to protect against than traditional

DDoS attacks. Based on design faults in the IoT device’s ﬁrmware or defects in

communication protocols, an attacker can quickly create malicious messages. IoT

devices, on the other hand, may be employed as a potent DDoS attack helper in

addition to being a target of DDoS assaults. They are most often placed on networks

that are not monitored for the attack, allowing attackers easy access. Additionally,

in most cases, the network they reside on offers a high-speed connection that allows

for a large amount of DDoS attack trafﬁc. Moreover, DDoS attacks have become

increasingly common and sophisticated, with attackers employing innovative tech-

niques to bypass the traditional defense mechanisms of IoT systems. Therefore, it

is imperative to acquire a comprehensive grasp of the nature of DDoS attacks and

their impact on IoT environments. Malware, on the other hand, can cause persistent

damage to a device and jeopardize the security of an entire network. The propagation

of malware through diverse channels can remain undetected for extended periods,

thereby impeding prompt identiﬁcation and mitigation efforts. The restricted compu-

tational capabilities and memory allocation in IoT devices present a signiﬁcant chal-

lenge to antivirus software and other security tools, further complicating malware

detection and removal. Furthermore, an infected IoT device can facilitate numerous

malicious activities, such as data theft, device hijacking, malware proliferation, and

DDoS. By converting numerous IoT devices into bots, malware can enable attackers

to overwhelm a target network or server by generating massive trafﬁc. Attackers can

also leverage malware to exploit security loopholes, such as default passwords or

unpatched vulnerabilities in IoT devices, providing unauthorized access to devices,

and incorporating them into a botnet. By using the appropriate countermeasures, each

of these attacks may be stopped, but they remain tailored solutions that are closely

related to the relevant IoT environment features. As a result, they were unable to

offer universal or adaptable solutions that might be used in many other situations.

Figure 1showcases a comparative graphical representation of the most severe

types of cyber-attack forms, which were selected to exemplify their respective cyber-

attack forms class as worst case scenarios.

0 5 10 15 20 25 30

Botnet aacks

Malicious Scripts

Malware

RFID aacks

Roung Informaon aacks

Data breach

Shared technologies and cloud

Fig. 1 Comparative analysis of cyber-attack forms’ severity

Study of Cyber Threats in IoT Systems 341

4 Related Works

Since 2020, the number of reviews on IoT security has been signiﬁcantly rising [19].

The present section outlines the up-to-date state of the art in IoT security. In [17], Pal

et al. explore potential IoT threats and attacks with reference to a variety of appli-

cation scenarios. They divide the potential threats and attacks into ﬁve categories

according to the characteristics of the IoT. These include users (attacks targeting the

human being, e.g., identity spooﬁng, malicious users, phishing), devices and services,

communications (e.g., SYN ﬂooding, pharming, eavesdropping), mobility (attacks

targeting the users’ mobility, e.g., device tracking, data breach), and resource inte-

gration (attacks targeting heterogeneous infrastructures, e.g., malicious node manip-

ulation). They mostly cover IoT security needs in this article and only examine IoT

security threats on a broad scale. On the other hand, Hassija et al. [20] present a

more detailed overview of current IoT security technology, as they provide a list of

several IoT applications, along with associated security and privacy concerns as well

as a thorough breakdown of the many danger vectors found inside the various IoT

layering. The ﬁve levels of the IoT architecture are used to classify potential attacks,

following that each assault is brieﬂy described in reference to each tier. Alfaqih et

Al.-Muhtadi [8] brieﬂy discuss IoT security issues in wireless sensors networks as

they are considered the backbone of the IoT (e.g., encryption gateway node, business

authentication, capacity against DoS). In [21], Abomhara et Køien make an effort

to categorize different threat categories, as well as to assess and deﬁne hackers and

attacks against IoT services and devices. In addition, this article attempts to show-

case the actors’ motivation of attack and capacities driven by the distinctive features

of cyberspace. They do not, however, categorize IoT threats according to speciﬁc

criteria or groups. Unlike [21], Driss et al. [18] classify security threats and attacks

in IoT environments in reference to different tiers of IoT architecture. At this end, the

paper distinguishes four classes of IoT attacks: physical attacks, software attacks,

network attacks, and data attacks, yet fails to provide a full examination of each.

Sengupta et al. [22] classify key assaults on IoT systems based on the objects of

assault and assign them to one or more tiers of the architecture. This survey delves

deeper into the four types of attacks previously mentioned, as well as a review of

countermeasures in the literature to combat each of these assaults. Likewise, Atlam

et Wills [23] highlight the four classes of attack in their study and in a similar way;

Deogirikar et Vidhate [24] map the four types of attacks (physical, network, software,

and encryption attacks) to the three layer IoT architecture and brieﬂy describe each

in two to three lines. Following that, attacks from each category are shortlisted as

“severe” based on their low detection probability and capacity to affect the network.

Nawir et al. [25] and Sadhu et al. [26] discuss the taxonomy of security attacks in IoT

environments and investigate network security issues in the sectors of smart homes,

health care, and transportation. They arrange the IoT attacks into eight categories.

The communication stack protocol or layer-based attacks are one example (i.e., phys-

ical attacks, data link attacks, network attacks, transport attacks, application attacks).

342 A. E. Akhdar et al.

The other seven classes are device property, strategy, access level, location, protocol-

based, host-based, and information damage level. In [27], Mrabet et al. propose a

new IoT architecture inspired by the ﬁve layer design. This structure consists of a

tangible sensing tier, a network and protocol tier, a transportation tier, an application

tier, and cloud services. In this paper, they address underlying technologies, threats

to security, and countermeasures based on the suggested architecture. A few of them,

Shaﬁq et al. [28] for instance, also categorize various types of threats in IoT according

to the target’s attributes, the threat vector, and the attack method.

5 Contributions of This Work

Based on the foregoing analysis, we have noticed the lack in literature of an inclusive

approach to tangibly rate the severity of IoT cyber-attacks. We observed that the

majority of IoT security studies focused on the IoT design while some others use

IoT’s characteristics to extract potential threats and security issues. We hold that

the existing approaches for classifying IoT attacks often rely on general-purpose

taxonomies that fail to capture the unique features of IoT systems. This is where

a more tangible approach that takes into consideration both IoT architecture and

characteristics becomes necessary. Hence, we have introduced a set of criteria of

equal weight to gauge the ferocity of the most prevalent IoT security menaces while

mapping each to its concerned IoT layer. It bears repeating that when selecting criteria

for any decision-making process, it is common for some factors to hold more weight

or importance than others. However, it can be challenging to accurately determine

the relative weights of each criterion without further investigation and analysis. In

situations where the weights are unknown or uncertain, a common approach is to

assign equal weights to all criteria to avoid bias and ensure fairness. While this

approach may not accurately reﬂect the true weight of each criterion, it provides a

baseline for comparison and allows for further investigation and reﬁnement of the

decision-making process.

6 Conclusion

In the world of IoT, attackers can exploit a wide palette of vulnerabilities to launch

sophisticated attacks that compromise the security and integrity of the entire system.

From exploiting software and ﬁrmware weaknesses to intercepting and manipulating

data transmission, these attacks can cause signiﬁcant harm to organizations and indi-

viduals alike. Despite the numerous challenges, there are effective countermeasures

that can help mitigate the risks associated with these threats. However, there is no ulti-

mate solution when it comes to IoT security, and practitioners must carefully evaluate

the risks and threats in their particular setting to develop effective countermeasures.

That said, this article presents a comprehensive study of recent cyber threats to IoT

Study of Cyber Threats in IoT Systems 343

systems and identiﬁes DDoS and malware attacks as the most severe. These types

of attacks can cause signiﬁcant disruptions, compromise sensitive data, and lead to

serious ﬁnancial losses. To this end, the proposal of predeﬁned and reusable modules,

like microservices, is encouraged.

References

1. Ashton K, That ‘internet of things’ thing

2. Corcoran P (2016) The internet of things: why now, and what’s next? IEEE Consumer Electron

Mag 5:63–68

3. Farooq MS, Riaz S, Abid A, Abid K, Naeem MA (2019) A survey on the role of IoT in

agriculture for the implementation of smart farming. IEEE Access 7:156237–156271

4. IoT connected devices worldwide 2019–2030. Statista https://www.statista.com/statistics/118

3457/iot-connected-devices-worldwide/

5. Koohang A, Sargent CS, Nord JH, Paliszkiewicz J (2022) Internet of things (IoT): from

awareness to continued use. Int J Inf Manage 62:102442

6. Zhang Z-K et al (2014) IoT security: ongoing challenges and research opportunities. In: 2014

IEEE 7th international conference on service-oriented computing and applications 230–234

(IEEE, 2014). https://doi.org/10.1109/SOCA.2014.58

7. Hameed S, Khan FI, Hameed B (2019) Understanding security requirements and challenges

in internet of things (IoT): a review. J Comput Netw Commun 2019:e9629381

8. (PDF) Internet of Things Security based on Devices Architecture. https://www.researchgate.

net/publication/326160395_Internet_of_Things_Security_based_on_Devices_Architecture

9. Varshney T, Sharma N, Kaushik I, Bhushan B (2019) Architectural model of security threats

and their countermeasures in IoT. In: 2019 international conference on computing, communi-

cation, and intelligent systems (ICCCIS) 424–429 (IEEE, 2019). https://doi.org/10.1109/ICC

CIS48478.2019.8974544

10. Sain M, Kang YJ, Lee HJ (2017) Survey on security in internet of things: state of the art and

challenges. In: 2017 19th international conference on advanced communication technology

(ICACT) 699–704 (IEEE, 2017). https://doi.org/10.23919/ICACT.2017.7890183

11. Oh S-R, Kim Y-G (2017) Security requirements analysis for the IoT. in 2017 international

conference on platform technology and service (PlatCon) 1–6 (IEEE, 2017). https://doi.org/

10.1109/PlatCon.2017.7883727

12. Noura M, Atiquzzaman M, Gaedke M (2019) Interoperability in internet of things: taxonomies

and open challenges. Mobile Netw Appl 24:796–809

13. Tourani R, Misra S, Mick T, Panwar G (2018) Security, privacy, and access control in

information-centric networking: a survey. IEEE Commun Surv Tutorials 20:566–600

14. Oracevic A, Dilek S, Ozdemir S (2017) Security in internet of things: a survey. In: 2017

international symposium on networks, computers and communications (ISNCC) 1–6 (IEEE,

2017). https://doi.org/10.1109/ISNCC.2017.8072001

15. Xin M (2015) A mixed encryption algorithm used in internet of things security transmis-

sion system. In: 2015 international conference on cyber-enabled distributed computing and

knowledge discovery 62–65 (IEEE, 2015). https://doi.org/10.1109/CyberC.2015.9

16. Kammara TT (2018) Management and security of IoT systems using microservices. San Jose

State University. https://doi.org/10.31979/etd.49xq-m2je

17. Pal S, Hitchens M, Rabehaja T, Mukhopadhyay S (2020) Security requirements for the internet

of things: a systematic approach. Sensors 20:5897

18. Driss M, Hasan D, Boulila W, Ahmad J (2021) Microservices in IoT security: current solutions,

research challenges, and future directions. Proc Comput Sci 192:2385–2395

19. Lee JY, Lee J (2021) Current research trends in IoT security: a systematic mapping study.

Mobile Inf Syst 2021:e8847099

344 A. E. Akhdar et al.

20. Hassija V et al (2019) A survey on IoT security: application areas, security threats, and solution

architectures. IEEE Access 7:82721–82743

21. Abomhara M, Køien GM (2015) Cyber security and the internet of things: vulnerabilities,

threats, intruders and attacks. J Cyber Secur Mobil 65–88. https://doi.org/10.13052/jcsm2245-

1439.414

22. Sengupta J, Ruj S, Das Bit S (2020) A comprehensive survey on attacks, security issues and

blockchain solutions for IoT and IIoT. J Netw Comput Appl 149:102481

23. Atlam HF, WillsGB (2020) IoT security, privacy, safety and ethics. In: Digital twin technologies

and smart cities, Farsi M, Daneshkhah A, Hosseinian-Far A, Jahankhani H (eds). Springer

International Publishing, 123–149. https://doi.org/10.1007/978-3-030-18732-3_8

24. Deogirikar J, Vidhate A (2017) Security attacks in IoT: a survey

25. Nawir M, Amir A, Yaakob N, Lynn OB (2016) Internet of things (IoT): taxonomy of security

attacks. In: 2016 3rd international conference on electronic design (ICED) 321–326 (IEEE,

2016). https://doi.org/10.1109/ICED.2016.7804660

26. Sadhu PK, Yanambaka VP, Abdelgawad A (2022) Internet of things: security and solutions

survey. Sensors 22:7433

27. Mrabet H, Belguith S, Alhomoud A, Jemai A (2020) A survey of IoT security based on a

layered architecture of sensing and data analysis. Sensors 20:3625

28. Shaﬁq M, Gu Z, Cheikhrouhou O, Alhakami W, Hamam H (2022) The rise of “internet of

things”: review and open research issues related to detection and prevention of IoT-based

security attacks. Wireless Commun Mobile Comput 2022:e8669348

Generic Sentimental Analysis in Web

Data Recommendation Based on Social

Media Scalable Data Analytics Using

Machine Learning Architecture

Ramesh Sekaran, Sivaram Rajeyyagari, Ashok Kumar Munnangi,

Manikandan Parasuraman, Manikandan Ramachandran, and Anil Kumar

Abstract Sentimental analysis is a method for distinguishing proof of articulation,

mentality, or sensations of clients. It characterized as bad, positive, good, ominous,

and so on from a piece of text in the record. Recommendation systems are impor-

tant intelligent systems that assume a crucial part in giving speciﬁc data to users.

Deep learning emerged as an important approach to settling opinion order issues

in the late days. This research proposes a novel technique in generic sentimental

analysis for web data classiﬁcation with a recommendation system in social media

analytics using machine learning techniques. Here, the web data input is processed,

and remove missing values and normalization. Then the processed data is classiﬁed

R. Sekaran

Department of Computer Science and Engineering, JAIN (Deemed to be University), Bengaluru,

Karnataka 562112, India

e-mail: sramsaran1989@gmail.com

S. Rajeyyagari

Department of Computer Science, College of Computing and Information Technology, Shaqra

University, Shaqra, Kingdom of Saudi Arabia

e-mail: dr.sivaram@su.edu.sa

A. K. Munnangi

Department of Information Technology, Velagapudi Ramakrishna Siddhartha Engineering

College, Vijayawada, Andhra Pradesh 520007, India

e-mail: ashokkumar.munnangi@gmail.com

M. Parasuraman

Department of Computer Science and Engineering, JAIN (Deemed to be University), Bengaluru,

Karnataka 562112, India

e-mail: mani.p.mk@gmail.com

M. Ramachandran (B)

School of Computing, SASTRA Deemed University, Tamil Nadu, Thanjavur 613401, India

e-mail: srmanimt75@gmail.com

A. Kumar

Tula’s Institute, Dehradun 248197, India

e-mail: dahiyaanil@yahoo.com

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_26

345

346 R. Sekaran et al.

using convolutional discriminant kernel component analysis and their data recom-

mendation in social media using reinforcement multilayer neural networks. The

experimental analysis is carried out for various social media datasets regarding the

accuracy, average precision, recall, actual positive rate, and F-measure. The proposed

technique attained an accuracy of 98%, average precision of 79%, recall of 72%, real

positive rate of 63%, and F-measure of 68%.

Keywords Generic sentimental analysis ·Web data classiﬁcation ·

Recommendation system ·Social media analytics ·Machine learning

1 Introduction

The sentiment sorted because of comments, assessments, or scrutinizes offer prac-

tical measures for a few different expectations. These sentiments are mostly named

supportive or not or into a scope of classiﬁcations from extremely poor/terrible,

moderate, better, best, and so on. Consequently, one of the ways of arranging any

item, or matter into different classiﬁcations, can be effectively ﬁnished by addressing

the sentiment investigated as one of the variables. This examination, in light of ideas,

is a cost or direction for the association to understand the gathering of their items by

the public, because of which new techniques can be wanted to upgrade the nature

of their items. It is relevant for policymakers or lawmakers to concentrate on open

sentiments about their deeds [1]. The profundity of investigation at the programmed

age of the expressive heading of the client from their audits is the signiﬁcant objec-

tive of performing sentiment examination. The interest in sentiment examination

has developed the requirement for distinguishing stowed-away data in unstructured

information designs that utilize different media in the informal organization. The

accompanying strategies are the signiﬁcant classiﬁers for sentiments: (1) dictionary-

based techniques and (2) machine learning strategies [2]. The principal strategy

ﬁrst gathers the sentiment casing or arrangement of assessment words (e.g., “mag-

niﬁcent,” “awful”). Afterward, it keeps that as the premise decides the character

standards on the words that seem so far [3]. These strategies require broad works in

dictionary development and rule plan regardless of viability. The downside of the

above strategy is that it is reasonable just for grouping given the information from

the contained words.

The sentiment arrangement by machine learning utilized famous calculations like

Innocent Bayes [4]. Different classes include a perspective on the whole record or

a speciﬁc sentence in the document. Recommendation systems (RS) are conveyed

to assist clients with adapting to this data blast. RS is mainly utilized in web-based

business applications and information the executive’s systems like the travel industry,

diversion, and Internet shopping gateways [5]. Film ideas for clients rely upon online

gateways. Films can be effectively separated through their kinds, like parody, thrill

ride, movement, and activity. Another viable method for arranging motion pictures

can be accomplished based on metadata like a year, language, chief, or cast [6].

Generic Sentimental Analysis in Web Data Recommendation Based … 347

Various social networking websites offer much heterogeneous data in social

media, necessitating separating important data from non-essential data. Numerous

calculations have been executed to perform an opinion examination on the given

arrangement of information. Take, for instance, the remarks made by a user, “I love

the burger in that restaurant but not the salad.” This would imply that the customer

enjoys one item at a particular restaurant while detesting another. Machine learning

algorithms are able to detect patterns in data and learn from them, in order to make

their own predictions. Instead of following pre-deﬁned instructions, these algorithms

develop models from sample inputs to make data-driven decisions.

This research proposes a novel technique for detecting sentimental analysis

based on machine learning architectures. The input social media data classiﬁca-

tion is done using convolutional discriminant kernel component analysis with data

recommendation in social media using reinforcement multilayer neural networks.

The organization of this paper is as follows: Sect. 2gives sentimental analysis

and web data recommendation in social media using existing machine learning archi-

tectures; Sect. 3discusses a proposed method for sentimental analysis-based social

media recommendation system; and Sect. 4gives results and discussion with the

conclusion in Sect. 5.

2 Existing Sentimental Analysis

A few specialists have attempted to handle the customized web-based entertainment

search issue through changed techniques. Work [7] gave an expounded outline of the

work in the sentiment examination of the social media ﬁeld. Author [8] proposed an

area-based customized recommendation framework called SESAME, which consol-

idated half and half client area inclination model. The work [9] utilized two multiclass

SVM-based characterization draws near—one-versus-all and single-machine multi-

class SVM to sort computerized cameras and MP3 surveys as per their quality. Author

[10] proposed techniques for choosing highlights using the substance and grammar

model. In a request to foresee the sentiments of online clients from message papers,

[11] proposed a heuristic model. Work [12] fabricated a framework for sentiment

investigation on Film surveys.

2.1 Existing Web Data Recommendation in Social Media

Using Machine Learning

The author [13] introduced an electronic item recommender framework because

of relevant data from sentiment examination. Since evaluations are generally deﬁ-

cient and highly restricted, they developed a logical data sentiment method for a

recommender framework using client remarks and inclinations. Likewise, work [14]

348 R. Sekaran et al.

proposed a recommender cycle incorporating sentiment examination of text-based

information separated from Facebook and Twitter to increment change by matching

item offers and purchaser inclinations. We can determine comparative blends in

different investigations [15]. Furthermore, work [16] utilized a sentiment power

metric to fabricate a music recommender framework. Clients’ sentiments are sepa-

rated from sentences posted on interpersonal organizations. Recommendations are

made using a system of low intricacy that proposes tunes in light of the ongoing

client’s sentiment force. The examination [17] tended to the information sparsity

issue of recommender systems by incorporating a sentiment-based investigation.

Their work was applied to Web Film Dataset (IMDb) and Film Focal point datasets,

yet enhancements in sentiment examination have been made since the paper was

distributed. Work [18] attempted to develop recommendations regarding the infor-

mation sparsity issue. They proposed a shrewd recommender framework in light

of strategies for cross-breed learning that coordinate the best and most effective

learning algorithms. Several research groups [19,20] presented the procedures for

applying sentiment examination in recommender systems. In [21], the author devel-

oped a method for sentiment analysis of movie reviews. They compared their accu-

racy using three supervised machine learning methods—Naive Bayes, decision trees,

and maximum entropy—and one unsupervised method—K-means clustering. Each

review sentence was graded according to its subjectivity and polarity. A YouTube

corpus for sentiment analysis that can be used as input for our sentiment analysis and

text classiﬁcation was provided by work [22]. The author [23] used a neural network

with a high accuracy rate for face classiﬁcation. The author proposed a dynamic

neural network (DNN) model based on competitive and Hebbian learning [24]. He

demonstrated that DNN outperforms baseline methods in comparison to the baseline

approach. Work [25] suggested combining a convolution neural network (CNN) and

latent semantic analysis (LSA). Words can be transformed into vectors using the

LSA method.

3 Proposed Model

This section proposed a sentimental analysis-based social media recommendation

system method utilizing deep learning techniques. The input social media data is

processed for noise removal and missing value. The processed data is classiﬁed

using convolutional discriminant kernelcomponent analysis(CDKCA) and their data

recommendation in social media using reinforcement multilayer neural networks

(RMNN). The proposed architecture is shown in Fig. 1.

Sentiment investigation expects that the message-preparing information is cleaned

before initiating the grouping model. Message cleaning is a preprocessing step that

eliminates words or other parts that need signiﬁcant data, which may decrease the

sentiment examination’s viability. Both word implanting and TF-IDF are utilized as

information highlights of profound learning calculations in nature language handling.

Generic Sentimental Analysis in Web Data Recommendation Based … 349

Fig. 1 Proposed architecture

3.1 Web Information Sentimental Investigation Utilizing

Convolutional Discriminant Examination

CNN uses a convolution part to extricate highlights, and it contains a three-layer

structure, convolutional layers, pooling layers, and a wholly associated layer deﬁned

as Eq. (1):

S=F1(W1,F2W2,...,FLI,WL),(1)

Sul

i,j,k=Fl+1(Wl+1,Fl+2Wl+2,...,FLˆ

Ml,WL) (2)

SlC,ul

i,j,k=F−1

1W1,F−1

2,W2,...,F−1

lSCul

i,j,k,Wl),(3)

LC,l=

wSlC,ul

w.(4)

We guess there are C classes of used advanced regulations where C is considered a

known boundary for all the SUs. It is accepted that the quantity of noticed information

for the preparation (and the test) set is no different for all classes. The preparation

information for ith class is signiﬁed by D(j)I={x(j)i1,x(j)i2,…,x(j)iN},I=

1, …, C,j=1, …, K, where Nis the cardinality of the preparation set for each class

I relating to SUj where it is expected that N is a proper boundary for all categories,

and the vector x(j)i,=1, …, N, is a noticed regulated information stream from

ith class at the RSUj.

350 R. Sekaran et al.

3.2 Social Media Recommendation System Using

Reinforcement Multilayer Neural Networks

Vπ(s)=Eπ∞



t=0

γtrts0=sand

Qπ(s,a)=Eπ∞



t=0

γtrts0=s,a0=a.

Optimal value functions are described as V∗(s)=

maxπVπ(s)and Q∗(s,a)=maxπQπ(s,a). The ideal value functions V

and Q satisfy the Bellman equation, written recursively as Eq. (5):

V∗(s)=max

a∈AQ∗(s,a),

Q∗(s,a)=r(s,a)+γ

s∈S

psT|s,aV∗s.(5)

Optimal value function Vis determined by value iteration Bellman starting from

any initial value function V0 eq, based on iterative updating Vk +1=TVk, where

T is the Bellman operator indicated by (6):

(TV)(s)=max

a∈Ar(s,a)+

s∈S

ps|s,aγVs.(6)

Q-learning is a potent algorithm for learning by watching the environment even

while the model is unknown. In Q-learning, value estimate and updating are referred

to as Eq. (7) based on a speciﬁc trajectory (r,s0, s,a):

Q(s,a)=(1−α)Q(s,a)+αr+γmax

aQs,a.(7)

Another often-used operator is the log-sum-exp operator through Eq. (8):

Lβ(X)=1

βlogn



i=1

eβxi.(8)

Let Tm be the function that uses the max operator to iterate any value function.

ThenextisbyEq.(9):



TβtV1−(TmV2)

∞≤

TβtV1−(TmV1)∞



 

(I)

+(TmV1)−(TmV2)∞

 

(II)

.(9)

Generic Sentimental Analysis in Web Data Recommendation Based … 351

For term (I), we have by Eq. (10):



TβtV1−(TmV1)

∞≤log(|A|)

βt

.(10)

For the term (II), we have by Eq. (11):

(TmV1)−(TmV2)∞≤γV1−V2.(11)

Combining (13), (14), and (15), we have by Eq. (12):



TβtV1−(TmV2)

∞≤γV1−V2∞+log(|A|)

βt

.(12)

Since max is a contraction mapping, we can deduce TmV =Vusing the Banach

ﬁxed point theorem. Based on the DBS value iteration speciﬁcation in Eq. (13),



Vt−V∗

∞=

Tβt...Tβ1V0−(Tm...Tm)V∗

∞

≤γ

Tβt−1...Tβ2V0−(Tm...Tm)V∗

∞

+log(|A|)

βt

≤γt

V0−V∗

∞+log(|A|)



k=1

γt−k

βk

.(13)

If βt→∞,then limt→∞ t

k=1

γt−k

βk=0 Taking the limit of the right-hand side

of Eq. (17), we obtain limt→∞ Vt+1−V∗

∞=0Although the non-expansion property

may be broken during the dynamic adjustment of t.

In the end, pathology deemed 105 tissue samples and 459 Raman spectra from

20 patients to be neoplastic tissue. The ﬁnal pathology classiﬁcation for 12 patients’

55 tissue samples and 219 Raman spectra was normal brain tissue. To speed up the

subsequent training process, we evaluated and cached the values of each bottleneck.

Backpropagation was used to iteratively update the ﬁnal layer’s weights by comparing

those predictions to the ground-truth labels55. The holdout test set, separate from

training and validation sets, was used to calculate the retrained method after 40,000

steps. Tensor Board’s display of the training and validation learning curves for a

single cross-validation round.

Therefore, learning mappings Fi is learning task: Rdi −1→Rdi for each layer i

greater than 0, so that ﬁnal output oM minimizes the training set’s empirical loss L.

Backpropagation can be used to efﬁciently complete this learning task when each Fi is

parametric and differentiated. The chain rule calculates the loss function’s gradients

based on every parameter at every layer, and gradient descent is used for parameter

updates. Output for intermediate layers is a new representation that the model learned

after training was ﬁnished by Eq. (14–18):



L=

lˆyi,yi+

(fk)(14)

352 R. Sekaran et al.

yj(p)=fn



i=1

xi(p)·wij −θj,(15)



L(t)=



i=1gift(xi)+1

2f2

t(xi)+(ft)(16)

yk(p)=fm



i=1

xjk(p)·wjk (p)−θk(17)

E=E+(ek(p))2

2.(18)

Where are the ﬁrst request inclination measurements on the misfortune capability?

The choice tree is worked from the root until arriving at the most extreme profundity.

Expect to be IL and IR are the occasion sets of left and right hubs after a split. The

perceptron is ordinarily utilized in managed straight characterization undertakings

in which a hyperplane would be tuned to ﬁt a preparation dataset.

wi=ηtrue j−pred jxj

i,(19)

where ηis the learning rate, truej is the genuine class name, and predj is the antici-

pated class mark. The construction of the multifacet perceptron empowers it to gain

complex assignments by removing additional signiﬁcant elements from the info

design which is deﬁned as Eq. (20–21):

w=w−η×d

dwF(w)(20)

praj =β∗pr−mfaj +(1−β)∗pr−_sent aj.(21)

4 Performance Analysis

The arrangement of related boundaries, equipment gadgets, and the virtual library

ofﬁces was completed before playing out the tests, like reverberation =5 and k-

overlay =5. Speciﬁcally, we utilized Google Colab Star with GPU Tesla P100-

PCIE-16 GB or GPU Tesla V100-SXM2-16 GB [45], Keras [46], and Tensorﬂow

[47] libraries. We likewise utilized the execution of the SVD, NMF, and SVD++

calculations given by the unexpected library (http://surpriselib.com/, got to on 10

December 2020).

Dataset description: Experiments led on open databases like the Movielens 100 K

1, Movielens 20 M 2, Web Movie Database (IMDb3), and the Netﬂix database 4 were

Generic Sentimental Analysis in Web Data Recommendation Based … 353

not seen as appropriate for our work. These databases were generally obsolete and

contained old movies whose relevant microblogging information was not accessible.

After an exhaustive appraisal of various databases, the MovieTweetings database [12]

was chosen for the proposed framework. MovieTweetings is broadly considered a

cutting-edge rendition of the MovieLens database. The MovieTweetings database is

unﬁltered, unlike MovieLens, where a solitary client has evaluated no less than 20

movies. The objective of this database is to give an up-to-date rating so it contains

more practical information for sentiment examination. This database is removed from

virtual entertainment. It is incredibly different; however, it has low sparsity esteem.

4.1 Comparative Analysis

Table 1shows a parametric comparison of proposed and existing techniques for

various retinal image datasets. The dataset compared is Movielens, Movielens 20 M,

and IMDb3datasets in terms of accuracy, average precision, recall, true positive rate,

and F-measure.

From Fig. 2, the accuracy comparison is shown for the proposed and existing

technique for Movielens, Movielens 20 M, and IMDb3datasets. The proposed method

attained an accuracy of 92% while existing ATR-FTIR achieved an accuracy of

89%, 91% by ANN for Movielens dataset and for Movielens 20Mdatasets suggested

technique accuracy of 95%; while existing ATR-FTIR attained an accuracy of 92%,

94% by ANN.forIMDb3datasets proposed technique accuracy of 95%; while existing

ATR-FTIR attained an accuracy of 92%, 94% by ANN.

Figure 3compares the average accuracy for proposed and existing techniques for

Movielens, Movielens 20 M, and IMDb3datasets. The proposed technique attained

average precision of 92%, while existing ATR-FTIR achieved an average accuracy

of 89%, 91% by ANN for the Movielens dataset and Movielens 20 M datasets

proposed average technique precision of 95%; while existing ATR-FTIR attained

Table 1 Parametric comparison of proposed and existing techniques for varioussentimental dataset

Dataset Techniques Accuracy Average

precision

Recall True

positive

rate

F-measure

Movielens SESAME 91 71 62 52 59

IMDb 92 75 65 55 63

CDKCA_

RMNN

95 77 68 59 65

MovieTweetings SESAME 93 74 65 59 61

IMDb 95 77 69 61 65

CDKCA_

RMNN

98 79 72 63 68

354 R. Sekaran et al.

Fig. 2 Comparison of accuracy

average accuracy of 92 and 94% by ANN.forIMDb3datasets proposed technique

average precision of 95%; while existing ATR-FTIR achieved average precision of

92%, 94% by ANN.

Figure 4shows the recall comparison for the proposed and existing technique for

Movielens, Movielens 20 M, and IMDb3datasets. The proposed technique attained

recall of 92% while existing ATR-FTIR achieved a recall of 89%, 91% by ANN for the

Movielens dataset and for Movielens 20 M datasets proposed technique recall of 95%;

Fig. 3 Comparison of average precision

Generic Sentimental Analysis in Web Data Recommendation Based … 355

Fig. 4 Comparison of recall

while existing ATR-FTIR attained recall of 92%, 94% by ANN.forIMDb3datasets

proposed technique recall of 95%; while existing ATR-FTIR attained recall of 92%,

94% by ANN.

Figure 5compares the true positive rate for proposed and existing techniques for

Movielens, Movielens 20 M, and IMDb3datasets. The proposed technique attained a

true positive rate of 92%. In comparison, existing ATR-FTIR achieved true positive

rate of 89%, 91% by ANN for the Movielens dataset and Movielens 20 M datasets

proposed technique true positive rate of 95%; while existing ATR-FTIR attained

true positive rate of 92%, 94% by ANN.forIMDb3datasets proposed technique true

positive rate of 95%; while existing ATR-FTIR attained true positive rate of 92%,

94% by ANN.

From Fig. 6, the comparison of F-measure is shown for the proposed and existing

technique for Movielens, Movielens 20 M, and IMDb3datasets. The proposed tech-

nique attained an F-measure of 92% while existing ATR-FTIR achieved an F-measure

of 89 and 91% by ANN for the Movielens dataset and Movielens 20 M datasets

proposed technique F-measure of 95%; while existing ATR-FTIR attained F-measure

of 92%, 94% by ANN.forIMDb3datasets proposed technique F-measure of 95%;

while existing ATR-FTIR achieved F-measure of 92%, 94% by ANN.

The extricated database from Movie Tweetings contains the evaluations of movies

by clients and their particular classes. In any case, it has no other information aside

from the delivery year and classes. Such information is capable of just in the event

of social separation where an adequate number of clients cover attributes in the

framework. We can involve such information in the event of cooperative sifting,

where ideas are made exclusively founded on the evaluations given by a related

client, and things are suggested based on the judgment of client comparability. In the

proposed model, the socially sifted information and the closeness of movies have

356 R. Sekaran et al.

Fig. 5 Comparison of true positive rate

Fig. 6 Comparison of F-measure

been utilized because of their qualities. The Movie Database (TMDb) Programming

interface was used to get the quality of the movies. TMDb 5 is a top hotspot for

comprehensive metadata for movies with over 30 languages. The changed database

contains extremely dark movies from various nations and dialects whose metadata

was not accessible in TMDb. Such movies were disposed of that have nearly nothing

metadata. The last database had around 4500 movies.

Generic Sentimental Analysis in Web Data Recommendation Based … 357

SentiWordNet is a word reference that tells, rather than the importance, the senti-

ment extremity of a survey. For identifying the extremity and subjectivity of various

lodging surveys and to get the extremity and subjectivity, we have utilized Senti-

WordNet, a freely accessible analyzer of the English language that contains feelings

separated from a WordNet database. We divided our assortment of audits to remove

words (lodging highlights). We doled out all delegate happening under the ﬁtting-in

highlights as a made sense of in past advances, viewed as sure (pos), negative (neg),

and unbiased (neu) terms to calculate the sentiment score. The SentiWordNet is incor-

porated with Python’s NLTK bundle and furnishes WordNetsynsets with sentiment

extremity. WordNet gives various semantic relationships between words, which are

utilized to work out sentiment polarities. In essential terms, sentiment examination is

the most common way of measuring something subjective, such as literary reviews.

As our recommender is intended to manage heterogeneous kinds of information,

a database is required to store this different nature of the data. We have Cassandra’s

database to store information utilized in the proposed recommender. Survey pages

which match the question watchwords are downloaded and put away in the NoSQL

database in Hadoop. The e-dataset used in this study is from outside assets like the

Outing Counsel and Expedia lodging site; i.e., information of lodgings is saved in

comma isolated esteem design (CSV). Thus, it is changed into the JSON conﬁguration

to build its intelligibility; i.e., clients’ printed audits and appraisals allotted by existing

clients recorded as rating scores, likes, or star positions are put away in Cassandra.

The e positions score can change between the sizes of 1 to 5 or 1 to 10. The results

showed that more tweets are classiﬁed negatively than in other categories. Surprised

by the total number of neutral tweets. Nonetheless, while the testing tweets were

analyzed, it was seen that the tweets ordered as nonpartisan comprised explanations,

ideas, and questions.

A web-based graphic user interface (GUI) was developed for straightforward

prediction. Due to this, users will ﬁnd it simpler to interact with the sentiment predic-

tion system. The GUI simpliﬁes uploading a spreadsheet ﬁle containing the extracted

tweets for analysis. The uploaded ﬁle’s username, date, number of retweets, actual

text, mentions, hashtags, tweetid, and permalinks are all displayed here. The section

of this interface that lets the user select the pre-trained model they want to use for

processing is an important feature.

After looking at these results, it is clear that improvements to feature selection

and sentiment classiﬁcation algorithms are still needed, so this is a new area of study.

Data for sentiment analysis comes from websites like Flipkart, Facebook, Twitter,

and other social media platforms. Individuals openly express their views on these

media about speciﬁc points, items, and legislative issues. One can learn more about

their ﬁeld and improve by reviewing these reviews. Although sentiment analysis has

been the subject of much research, it still faces numerous obstacles. It is hard to tell

when someone is being sarcastic when they say what they think.

358 R. Sekaran et al.

5 Conclusion

This research proposes a novel technique in generic sentimental analysis for web

data classiﬁcation for social media using machine learning techniques. The input

data is classiﬁed using convolutional discriminant kernel component analysis and

recommendation in social media using reinforcement multilayer neural networks.

The classiﬁer is trained or modeled using the labeled data and then tested on new but

related texts to determine how well it predicts the sentiment of the new documents.

The initial brand and product comparison results show the value of text mining

and sentiment analysis on social media data, and machine learning classiﬁers are

a valuable tool for users, product manufacturers, and regulatory and enforcement

agencies to monitor brand or product sentiment trends to act in the event of a sudden

or signiﬁcant rise in negative sentiments. Further research will examine comment

spamming, contrasting the sentiment classiﬁcation capabilities of various machine

learning algorithms, temporal analysis for identifying an upward or downward trend

in user or brand sentiment, and clustering tweet and user attitudes by region. The

proposed technique attained an accuracy of 98%, average precision of 79%, recall

of 72%, true positive rate of 63%, and F-measure of 68%.

References

1. He L, Yin T, Zheng K (2022) They May Not Work! An evaluation of eleven sentiment analysis

tools on seven social media datasets. J Biomed Inform 132:104142

2. Alsayat A (2022) Improving sentiment analysis for social media applications using an ensemble

deep learning language model. Arab J Sci Eng 47(2):2499–2511

3. Xu QA, Chang V, Jayne C (2022) A systematic review of social media-based sentiment analysis:

emerging trends and challenges. Decision Anal J 100073

4. Jalil Z, Abbasi A, Javed AR, Badruddin Khan M, AbulHasanat MH, Malik KM, Saudagar AKJ

(2022) Covid-19 related sentiment analysis using state-of-the-art machine learning and deep

learning techniques. Front Public Health 9:2276

5. Iqbal A, Amin R, Iqbal J, Alroobaea R, Binmahfoudh A, Hussain M (2022) Sentiment analysis

of consumer reviews using deep learning. Sustainability 14(17):10844

6. Li X, Zhang J, Du Y, Zhu J, Fan Y, Chen X (2022) A novel deep learning-based sentiment

analysis method enhanced with emojis in microblog social networks. Enterprise Inf Syst 1–22

7. Alanazi SA, Khaliq A, Ahmad F, Alshammari N, Hussain I, Zia MA, Afsar S et al (2022)

Public’s mental health monitoring via sentimental analysis of ﬁnancial text using machine

learning techniques. Int J Environ Res Public Health 19(15):9695

8. Ali I, Asif M, Hamid I, Sarwar MU, Khan FA, Ghadi Y (2022) A word embedding technique

for sentiment analysis of social media to understand the relationship between Islamophobic

incidents and media portrayal of Muslim communities. PeerJ Comput Sci 8:e838

9. Chandrasekaran G, Antoanela N, Andrei G, Monica C, Hemanth J (2022) Visual sentiment

analysis using deep learning models with social media data. Appl Sci 12(3):1030

10. Mallick C, Mishra S, Giri PK, Paikaray BK (2023) Machine learning approaches to sentiment

analysis in online social networks. Int J Work Innovation 3(4):317–337

11. Thimmapuram M, Pal D, Mohammad GB (2022) Sentiment analysis—based extraction of

real—time social media information from twitter using natural language processing. Soc Netw

Anal: Theory Appl 149–173

Generic Sentimental Analysis in Web Data Recommendation Based … 359

12. PM KR (2022) Sentiment analysis, opinion mining and topic modelling of epics and novels

using machine learning techniques. Mater Today: Proc 51:576–584

13. Cordero J, Bustillos J (2022) Sentiment analysis based on user opinions on twitter using machine

learning. In: Applied technologies: third international conference, ICAT 2021, Quito, Ecuador,

October 27–29, 2021, Proceedings. Cham, Springer International Publishing, pp 279–288

14. Yin Z, Shao J, Hussain MJ, Hao Y, Chen Y, Zhang X, Wang L (2022) DPG-LSTM: an enhanced

LSTM framework for sentiment analysis in social media text based on dependency parsing and

GCN. Appl Sci 13(1):354

15. Sumathy B, Kumar A, Sungeetha D, Hashmi A, Saxena A, Kumar Shukla P, Nuagah SJ (2022)

Machine learning technique to detect and classify mental illness on social media using lexicon-

based recommender system. Comput Intell Neurosci

16. Gupta A, Matta P, Pant B (2022) A comparative study of different sentiment analysis classi-

ﬁers for cybercrime detection on social media platforms. In: AIP conference proceedings, vol

2481(1). AIP Publishing LLC, p 060005

17. Hinduja S, Afrin M, Mistry S, Krishna A (2022) Machine learning-based proactive social-

sensor service for mental health monitoring using Twitter data. Int J Inf Manage Data Insights

2(2):100113

18. Srikanth J, Damodaram A, Teekaraman Y, Kuppusamy R, Thelkar AR (2022) Sentiment anal-

ysis on COVID-19 twitter data streams using deep belief neural networks. Comput Intell

Neurosci

19. Yenkikar A, Babu CN, Hemanth DJ (2022) The semantic relational machine learning model

for sentiment analysis using cascade feature selection and heterogeneous classiﬁer ensemble.

PeerJ Comput Sci 8:e1100

20. Kuppusamy M, Selvaraj A (2023) A novel hybrid deep learning model for aspect-based

sentiment analysis. Concurren Comput: Practice Experience 35(4):e7538

21. Venkatesh B, Hegde SU, Zaiba ZA, Nagaraju Y (2021) Hybrid CNNLSTM model with GloVe

word vector for sentiment analysis on football speciﬁc tweets. In: 2021 international conference

on advances in electrical, computing, communication and sustainable technologies (ICAECT),

pp 1–8

22. Sanagar S, Gupta D (2020) Unsupervised genre-based multidomain sentiment lexicon learning

using corpus-generated polarity seed words. IEEE Access 8:118050–118071

23. Saharudin SN, Wei KT, Na KS (2020) Machine learning techniques for software bug prediction:

a systematic review. J Comput Sci 16(11):1558–1569

24. Feng Y, Cheng Y (2021) Short text sentiment analysis based on multichannel CNN with

multi-head attention mechanism. IEEE Access 9:19854–19863

25. Nijhawan T, Attigeri G, Ananthakrishna T (2022) Stress detection using natural language

processing and machine learning over social interactions. J Big Data 9(1):1–24

Cloud Spark Cluster to Analyse English

Prescription Big Data for NHS

Intelligence

Sandra Fernando, Victor Sowinski Mydlarz, Asya Katanani, and Bal Virdee

Abstract Spark is a large-scale data processing engine that is at least a hundred

times faster than the Hadoop big data processing engine. Even though Spark is a

complete in-memory framework, although limited with its big data platforms facili-

ties compared to Hadoop, Spark analytics engine with Hadoop distributed ﬁle system

gives better throughput than Hadoop alone. The main contribution of this paper is the

insight into the behaviour of HDFS-based Azura Cloud Spark Cluster with discussion

and evaluation of its strengths and limitations using NHS prescription large dataset.

Data on NHS prescriptions obtained from 2015 to April 2022 exceeds 500 GB of

records. A public dashboard for individual BNF code analysis and studies on NHS

cost analysis exist, but no analysis of this data range and volume of NHS prescription

and especially using new big data processing engines such as Spark was conducted.

This study also contributes descriptive statistics and machine learning models of

prescription data trends using Cloud Spark engine and PySpark technology that has

not been used in this context before. This study illustrates regions as well as GP

practices in terms of reimbursement cost, drug consumption level, the type of the

drug, and the disease type; varied demand for dispensed chemical substances over

the years; shows what diseases have increased or decreased over the years as well as

the total cost and its trends.

Keywords Cloud cluster ·Big data ·Prescription data ·Machine learning

engines ·PySpark ·Azure Spark architecture

S. Fernando (B)·V. S. Mydlarz ·A. Katanani ·B. Virdee

Assistive Technology Group, SCDM, London Metropolitan University, 166-220 Holloway Rd,

London N7 8DB, UK

e-mail: s.fernando@londonmet.ac.uk

V. S. Mydlarz

e-mail: w.sowinskimydlarz@londonmet.ac.uk

B. Virdee

e-mail: b.virdee@londonmet.ac.uk

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_27

361

362 S. Fernando et al.

1 Introduction

The sheer amount of computing resources and software services needed to support

big data efforts can strain the ﬁnancial and intellectual capital of even the largest

businesses. That is where cloud computing comes into play as an ideal platform for

big data. Cloud computing provides limitless computing resources and services on

demand, and the business does not have to build its own or maintain the infrastruc-

ture. Thus, the cloud makes big data technologies accessible and affordable to almost

any size of enterprise. The motivation for this study derives from the need to eval-

uate a high-throughput cloud-based big data processing engine with a stable storage

mechanism and third-party supported data input and output technology. The cloud

platform chosen for this research is the Microsoft Azure HDInsight Spark cluster.

Apache Spark is a distributed computing framework that supports a set of libraries

for real-time, large-scale data processing. Other subcomponents of the technology

are storage accounts for blob containers and Jupyter Notebook for running Python

APIs (PySpark) for the Apache Spark engine.

Drug utilization in England and Wales has been published before [1] using secular

trend analysis. This analysis of the prescription cost database serves purely to inves-

tigate medication and drug consumption. The ﬁndings are similar to the ﬁndings

of this paper with the most common medication and its trends. The analysis has

used statistical packages for social science but has not discussed the technical details

of big data processing capabilities and its performance. Open Prescribing [2]uses

anonymized data about drugs prescribed by GPs and provides a dashboard of Sub-

ICB location, practice, and other trends on drugs. However, this dashboard does not

discuss the backend technology or processing of 700 million rows and its technical

depths of processing.

Spark technology is fairly new. Therefore, only limited real case studies are anal-

ysed using its technology. There are several papers published on the technical review

of Spark, compared to the Hadoop ecosystem and its features, simply showing

its strength and weaknesses without its application [3–5]. An investigation was

carried out with Amazon cloud-based Spark cluster for machine learning algorithm

processing [6] for streaming tweet data from a machine learning repository using a

small health care dataset to trigger alters. The application does not talk about chal-

lenges, and the dataset is completely different in nature to the one presented in these

studies. The paper contains a limited description of the technical evaluation and

takeaways.

This study uses the NHS prescription dataset analysed from Jan 2015 to April

2022. There is no other study to scrutinise the data at this date range using a cloud

Spark cluster. The main contribution of this paper is the Cloud Spark approach

to process a large amount of NHS data using HDInsight Cluster technology, its

strengths and weaknesses. The technical details and the challenges of HDInsight

Spark Cluster implementation leave the reader with a choice of application type with

Spark technology.

Cloud Spark Cluster to Analyse English Prescription Big Data for NHS … 363

The study also connects to NHS intelligence comparing the regions as well as

some practices in terms of reimbursement cost, drug consumption level, the type

of the drug, and the disease type. The study demonstrates the varied demand for

some dispensed chemical substances over the years through charts and shows what

diseases have increased or decreased over the years as well as the total sum of the

appearance of those diseases in each particular region.

The rest of the paper starts with the related work presenting comparative studies

of Spark and other technologies. The material and methods section presents the tech-

nology and techniques utilized to process English Prescribing Dataset and focussed

questions. The paper then reveals the ﬁnding of NHS intelligence, answering a few

questions of the study by presenting results. The challenges and limitation section

elaborates on technical barriers encountered in processing large datasets in the cloud

environment and solutions to overcome those barriers. The section also explores the

Spark and Hadoop performance with the utilization of (1) cores and (2) PySpark

technology. The conclusion presents two main contributions of the paper: (1) ﬁnd-

ings of NHS intelligence and (2) the Spark deployment approaches based on the

experimental ﬁndings.

2 Related Work

The big data analytics of health-related data is one of the most vital industrial strate-

gies highlighted in the UK government’s life sciences industrial strategy report

[7]. UK’s healthcare organization, NHS, holding extensive repositories of patient

data, produces information at an enormous rate. This growth exceeds the capabili-

ties of established IT infrastructures and represents greenﬁeld computing and data

management problems largely [8].

New data management systems have been introduced to meet the challenges of big

data [9]. Apache Spark and Hadoop are high-power distributed parallel computing

cluster systems that are commonly used for operating the software framework for

big data analysis [10]. Apache Spark consists of several components: Spark core;

Spark Streaming; Spark MLlib; Spark SQL; Spark GraphX, etc.

It has been over a decade since the term “big data” was introduced. Big data

simply refers to large quantities of information that traditional PCs fail to store,

process, and analyse. One of the various ﬁelds that generate a large-scale of data

today is the healthcare industry. Healthcare-related big data hides great potential.

When properly applied, insightful knowledge derived from big data can ensure public

health, determine, and execute applicable treatment ways for patients, support clinical

advancements, and observe the safety of healthcare systems.

This study analyses NHS Prescribing data on the cloud cluster Spark engine.

Some reasons why Apache Spark would be a better choice over Apache Hadoop are

presented in Table 1: (1) in-memory cluster computation, (2) real-time processing

type, (3) low latency and high throughput (4), and range of stable language support

with Java, Scala, and Python, R. According to Andreas Kretz [11] Hadoop is more

364 S. Fernando et al.

Table 1 Comparison of Hadoop and Spark for data processing

Faster

performance

Spark is designed for in-memory processing

Hadoop is designed to process data using local disc storage across multiple

sources

Processing

type

Spark provides both batch and real-time processing

Hadoop provides batch and linear data processing

Latency Spark is a low-latency and high-throughput computer framework

Hadoop is a high-latency and high-throughput computer framework

Language

support

Spark supports Java, Scala, Python, and R

Hadoop ideally uses Java; however, languages like R, Python, and Ruby can

also be implemented

than a storage, a whole ecosystem while Spark is just a data analytical framework

with no storage capacity.

Here are some differences between Apache Spark and Apache Hadoop.

Both the Hadoop and Spark data processing engines can be complementary,

where Hadoop is used as a storage, Yarn for resource management, and analytics is

processed with Spark. Hadoop and Spark can be managed in the same cluster with

HDFS data and a Spark worker thread. Spark can determine which node the data is

stored in and then load it into the memory of that machine for processing rather than

transferring data between machines that causes signiﬁcant trafﬁc. If the job is batch

processing, such as counting or averaging, then MapReduce is better, whereas, for

more complex machine learning computations or faster streaming, Apache Spark is

advised. In this research, the technique of HDFS (Hadoop distributed ﬁle system),

Yarn resources management, and Spark Engine is tested with Azura Cloud. The next

section details the speciﬁcation of the deployment and subcomponent architecture.

3 Material and Methods

The cloud platform for this study is Azure Platform Management Portal that gives

control over service deployment, administrative tasks, and information on health

of the implementations and accounts. The English Prescribing Dataset contains

detailed information on prescriptions issued in England, Wales, Scotland, Guernsey,

Alderney, Jersey, and the Isle of Man. The cloud blob containers of HDFS were used

to store the 88 blobs, summed up to around 500 GB in size, and Jupyter Notebook

for running Python APIs (PySpark) for Apache Spark engine. The numbers of four

processors and three cores were increased to 6 processors and 15 cores in the study.

This change increased the execution performance of the queries by approximately

25%.

Apache Ambari was used with Hadoop management providing an easy-to-use web

UI. It was used for managing and monitoring Hadoop clusters. Azure HDInsight is

Cloud Spark Cluster to Analyse English Prescription Big Data for NHS … 365

a versatile, managed cluster platform running big data frameworks in large volume

and velocity using Apache Spark [12].

Microsoft Azure Blob Storage is an object storage solution for the cloud. It is

designed for storing large amounts of unstructured data. SQL Azure Reporting allows

running reports against SQL Azure Databases in the cloud [13]. Azure Active Direc-

tory is a cloud-based identity and access management service. Apache Spark is an

open-source uniﬁed analytics engine for large-scale data processing. In this research,

Power BI and Tableau are interactive data visualization tools.

The project was initiated with four processors and three cores. Attempting to

process the whole data required optimization as these speciﬁcations turned out to be

insufﬁcient. Assigning them to six processors and 15 cores increased the speed of

the data process signiﬁcantly. The data in CSV ﬁles are organized into three tiers.

The most basic unit is monthly records with a size of approximately 6 GB. The

intermediate yearly records have a size of 17 GB, after ﬁltering the relevant columns

and merging of monthly records. The ﬁnal data frame is a combination of all data

(seven years and four months). The virtual machine (hosted in MS cloud) is started

through Azure lab services. The entry point to the cluster is the Azure Portal which

contains all the components mentioned.

Figure 1represents Apache Spark as a parallel processing framework supporting

in-memory processing to enhance the performance of big data analytic applications.

Spark cluster is a combination of a driver program, cluster manager, Zookeeper

Nodes, and worker nodes that work together to complete tasks NHS descriptive

analysis and modelling of data [14].

The Spark Context coordinates processes across the cluster. Apache Spark appli-

cations have corresponding executor processes that manage the tasks and remain on

alert throughout the execution cycle. Driver programs constantly accept the connec-

tions from the executors during their life cycle. Apache Spark drivers schedule the

tasks on the cluster and close the worker nodes. Spark Drivers are on the same

local area network as the rest of the components, and HDFS distributed ﬁle system

manages large data sets running on commodity hardware.

Figure 2demonstrates the ﬂow of the application starting with a SparkContext

instance. The driver program requests resources from the cluster manager to launch

executors. The cluster manager launches executors. The driver process runs through

the user application. Tasks are sent to executors. Executors run the tasks and save the

results. If any worker crashes, its tasks are sent to different executors to be processed

again.

NHS in April 2020 underwent an organizational change. The four regions: North

of England, Midlands and East, London, and South of England were split into seven

regions: East of England, London, Midlands, North East and Yorkshire, North West,

South East, and South West. Data attributes used in this study are practice name;

chemical substance; prescribed product (BNF), total quantity (appliance that was

prescribed); average daily quantity; net ingredient cost (NIC), and actual cost (after

discount and other expenses).

In this study, structured secondary data, English Prescribing Dataset, is fetched for

processing from the 88 blobs with the read.csv() method of the spark session with

366 S. Fernando et al.

Fig. 1 Architecture of the system

inferSchema parameter removed for performance purposes. Data did not undergo

any data cleaning steps as the Data Quality Policy of NHS ensures the consistency,

timeliness, efﬁciency, validity, and completeness of the data. The study implements

StringIndexer, OneHotEstimator, and VectorAssembler classes to convert categorical

ﬁelds into numerical and vectorize the features, respectively.

The numerical ﬁelds in the dataset are related to the actual cost of the medication,

net ingredient cost, and the total quantity of the medication prescribed. Figure 3

demonstrates descriptive statistics data of those attributes that are given in the graph

below. Data is trained with the linear regression algorithm. 75% of the data is used

for training, and the rest, 25%, of the data is used to make predictions.

Focus Questions of the study.

Q1: What are the topmost drugs subscribed by GPs in England? What are the

categories? What does it prescribe for?

Q2: What is the relationship between spending and location?

Q3: What is the relationship between spending and drug prescription?

Q4: Group practice into different spending.

Q5: Group practice into different drug recommendations.

Q6: Find descriptive statistics.

Q7: Will a GP practice with certain spending likely increase a particular drug in

future?

Cloud Spark Cluster to Analyse English Prescription Big Data for NHS … 367

Fig. 2 Architecture of the cloud cluster

Fig. 3 Descriptive statistics of the numerical data

368 S. Fernando et al.

Q8: What is the trajectory of different drugs recommendation based on their

historical data?

Q9: What is the trajectory of different drug recommendations based on locations?

Q10: What is the trajectory of spending of a given GP practice based on its historical

records?

4 Results and Discussion

Three research questions are presented in the paper.

4.1 What is the Trajectory of Spending of a Given General

Practice Based on Its Historical Records (Spending)?

Query Description: The trend of the sum of prediction (actual and forecast) for

the date year. The colour shows details about the forecast indicator. The marks are

labelled by practice name. The view is ﬁltered on the practice name, which keeps

park surgery and riverside surgery.

Figure 4demonstrates the prediction values of the actual cost in British pound

sterling (GBP) for the Park and Riverside Surgeries. In the context of the UK’s

National Health Service (NHS), “actual cost” usually refers to the amount that the

NHS pays for a particular treatment or service. Predicting the actual cost of surgeries

in the NHS can help improve the efﬁciency of the healthcare system, reduce costs,

and improve patient outcomes.

Both Park and Riverside Surgeries, two of the surgeries that recommend top

drugs most, will require a higher allocation of actual cost by 2031. Park Surgery

and Riverside Surgery will have to consider a rise by approximately 40% and 28%

respectively. These predictions will assist NHS managers with efﬁcient resource

allocation such as adequate staff and equipment, and the necessary post-surgical

care placement. Additionally, budgeting, cost control, and negotiation are the other

factors that NHS managers can beneﬁt from.

4.2 What is the Trajectory of Net Ingredient Cost for the Next

Few years?

Figure 5demonstrates the predicted value of the net ingredient cost in the National

Health Service for the next ﬁve years. Net ingredient cost (NIC) is the paid amount

based on the introductory price of the prescribed drug or appliance and the quantity

prescribed, in British pound sterling (GBP). The NIC value from £2404.22 M in the

Cloud Spark Cluster to Analyse English Prescription Big Data for NHS … 369

Fig. 4 Projection of actual cost in the medication of two surgeries

ﬁrst quarter of 2022 increases by nearly 20% to £2844.82 M in the third quarter of

2026. Predicting NIC values can help the NHS improve its ﬁnancial management,

enhance patient care, and support the development of new treatments and thera-

pies. The fact of the rise of the NIC value asserts the NHS organizations to allow

for efﬁcient allocation of resources and better ﬁnancial management for the costs

of pharmaceutical products. Besides, this prediction can help identify opportuni-

ties for cost savings, such as negotiating lower prices with suppliers or switching

to less expensive alternative treatments. NHS can ensure that patients receive the

most effective and appropriate treatments while minimizing costs, improving the

overall quality of care. Finally, predicting NIC values can also support research and

development activities, enabling the NHS to make informed decisions about which

drugs to invest in and which to prioritize for development. Based on the facts above,

NHS will have to consider allocating higher amount for the net ingredient cost in

the upcoming years to ensure the quality of care and high performance as well as to

support research and development activities.

370 S. Fernando et al.

Fig. 5 Net ingredient cost in quarters by 2026

4.3 What Are the Most Common Disease Types in the UK?

The bar graph in Fig. 6depicts the most typical disease types diagnosed between the

dates Jan 2015–Apr 2022. The top three disease classiﬁcations are related to central

nervous system, cardiovascular system, and endocrine system, sorted by descending

order. Central nervous system-related diagnoses, also known as neurological disor-

ders, involve the central and peripheral nervous system (muscles, the brain, the

cranial and peripheral nerves, the neuromuscular plate, the spinal cord, and the auto-

nomic nervous system) and affect about 10 million in the UK. The most common

neurological diseases are Dementia, Alzheimer’s, Parkinson’s, Multiple sclerosis,

and epilepsy. [15] Cardiovascular system-related diagnoses involve coronary artery

diseases, hypertension, stroke, heart failure, peripheral artery disease, and arrhyth-

mias [16]. These disorders are a signiﬁcant health issue in the UK and a major cause

of morbidity and mortality. They are often linked to lifestyle factors such as smoking,

lack of exercise, and poor diet, as well as genetics and other underlying health condi-

tions [17]. Endocrine system-related diagnoses involve diabetes, thyroid disorders,

pituitary disorders, adrenal disorders, and polycystic ovary syndrome (PCOS). These

disorders can affect many different areas of the body and can have a signiﬁcant impact

on a person’s health and quality of life [18]. Treatment for endocrine system-related

disorders varies depending on the speciﬁc condition and its severity but may include

medication, surgery, lifestyle interventions, and hormone replacement therapy.

4.4 Challenges and Limitations

Apache Spark cluster on Azure HDInsight did not perform as expected on handling

NHS Prescribing dataset with the initial conﬁguration of four processors and three

cores. Additionally, having the parameter “inferSchema” of the read.csv() method of

the spark session set to “True” had a reverse impact on the performance of the data-

retrieving process. Next, log transformation caused null data values in the dataset,

which held back training the data with the linear regression algorithm. The modelling

process consistently threw “Zero Division Error”. Some queries, such as encoding

the data, implementing Spark SQL, or saving PySpark data frame to the Hive table

Cloud Spark Cluster to Analyse English Prescription Big Data for NHS … 371

Fig. 6 Most common disease types in the UK, 2015–2022

for further use with the Tableau visualization tool, lasted for more than an hour. This

caused a “session timeout” error which killed the spark session and did not let it

move on.

To overcome the challenges, the researchers implemented PySpark code with

different session parameters, demonstrated in Fig. 6. However, increased session

timeout did not give any results. The task was reallocated with the PySpark instance

with a higher number of cores and processors, to be from four to six and from 3 to

15, respectively and increased the default processing time in the Livy conﬁguration

from 3600 s to 36,000 s. This allowed the execution of queries successfully. The

results were optimized removing the parameter “inferSchema” from the method

“read.csv”. This parameter infers the schema of every single column retrieved, which

is a process to slow down reading the blobs. Having this parameter set to false required

specifying the schema explicitly. Replacement of the PySpark SQL commands that

were not directly associated with data manipulation and transformation with PySpark

magic commands contributed to the success. Although PySpark magic commands

are recommended for tasks that require interaction with the Spark engine, such as

conﬁguring the Spark context as demonstrated in Fig. 7or loading data into a Spark

DataFrame, it turned out to have a positive impact on the performance of PySpark

SQL queries when executing dataset-related commands such as data manipulation

and transformation in this study.

4.5 Comparison Between Spark and Hadoop Performance

A few studies have conducted the comparison between Hadoop and Spark as

discussed in the introduction. Table 2records the publicly claimed running timings

372 S. Fernando et al.

Fig. 7 Implementation of the new session variables

of Spark and Hadoop. Hadoop uses longer time to model each iteration because it

runs independent map reduce jobs. Spark’s ﬁrst iteration, on the other hand, takes

some time while subsequent iterations only take few seconds. This is due to the

reuse of the cache data, which in return allows Spark to run 10–100 times faster [19].

Table 3demonstrates the running time of pandas and PySpark technology perfor-

mance discussed in Databricks [12]. It is evident that PySpark muti-core local cluster

can be a good choice for modelling less than 10 GB data.

Table 2 Logistic regression

performance of Hadoop and

Spark

Spark cloud (multi-cores) 0.9—running time (s)

Hadoop (multi-cores) 110—running time (s)

Table 3 Pandas and PySpark comparison execution for: max value query

Panda local cluster Spark runtime Out of memory—(35 GB parquet ﬁle)

PySpark local cluster runtime 10 GB memory, 16 thread, 260 GB

Cloud Spark Cluster to Analyse English Prescription Big Data for NHS … 373

4.6 Comparison Between PySpark and Pandas Performance

5 Conclusion

This study has evaluated some key factors related to the NHS Prescribing data by

interpreting the ﬁndings of the research via focussed questions. The main contribution

of this paper is the Cloud Spark and HDFS approach to process a large amount of NHS

data using Azura HDInsight Spark Cluster technology, with a discussion, comparison

of its strengths and weaknesses.

The study also contributes to descriptive statistics and machine learning models

of NHS prescription data trends. We found that the highest amount that communities

paid in GBP was on enteral nutrition. Some of the ﬁndings include that England has

increasingly paid for Apixaban, the medication that marked the highest cost increase

for the last couple of years. The most dispensed and increasingly demanded drug

over the years in England happened to be Cholecalciferol, also known as vitamin

D3. Additionally, we could draw an insight that the most common disease types over

the years have remained to be central nervous system-relevant. The most typical

nervous system-related disorders have been Dementia, Alzheimer’s, Parkinson’s,

Multiple sclerosis, and epilepsy.

Conclusions are derived on the technology used in this research. Even though

Hadoop is a big data framework with data inputs/outputs facility, Spark is 100 times

more powerful when processing data. [15] Standalone multithreaded PySpark queries

can perform with around 260 GB of data compared with pandas’ frame queries. But

Spark has a head, zookeeper, and worker nodes to enable the distribution of query/

model execution. Each node can be assigned multiple cores (processors), which may

take overhead. Spark Performance-related issues and errors can happen for many

reasons. The following three areas can be carefully improved for better performance.

(1) spark session parameters (session time, inferSchema, executor memory, etc.). (2)

increase in the number of processors and executors. (3) The data itself may have the

need for transformation or alternation to avoid problems such as “ZeroDivisionError”

or null values for certain modelling.

The important questions to ask in selecting a type of spark processing environ-

ment are (1) the type of users using the cluster, (2) the type of workload, (3) the

budget, and (4) the service level agreement. Single users are recommended to use

standard clusters, while single nodes are for small jobs by Microsoft Azure [13].

High concurrency clusters are the best for sharing among several users or running

ad-hoc jobs. It was evident that autoscaling reduces the cost over a ﬁxed-size cluster;

however, scale-up and down may slow down your process. Single-user, all-purpose

jobs can slow down the cluster with the autoscaling if the jobs come in a few minutes

apart rather than constant data supply.

Cluster conﬁguration gives a trade-off between cost and performance. More

machines with less memory and storage require more shufﬂing of data to complete

a task. Therefore, for a data analytics and complex ETL, training machine learning

model type job is best executed with a smaller cluster: a smaller number of nodes

374 S. Fernando et al.

to minimize the shufﬂes. In a multi-user scenario, where read-only access is most

needed, it is best to use on-demand instances with a hybrid approach with cluster poli-

cies for different groups of users. The technical details and comparisons presented

in this paper and the challenges of HDInsight Spark Cluster implementation should

leave the reader with a choice of Spark application depending on the need.

References

1. Naser AY, Alwaﬁ H, Al-Daghastani T, Hemmo SI, Alrawashdeh HM, Jalal Z, Paudyal V,

Alyamani N, Almaghrabi M, Shamieh A (2022) Drugs utilization proﬁle in England and Wales

in the past 15 years: a secular trend analysis. BMC primary care 23(1):239. https://doi.org/10.

1186/s12875-022-01853-1

2. OpenPrescribing.net, Bennett Institute for Applied Data Science, University of Oxford, 2023,

https://openprescribing.net/

3. Salloum S, Dautov R, Chen X et al (2016) Big data analytics on Apache Spark. Int J Data Sci

Anal 1:145–164. https://doi.org/10.1007/s41060-016-0027-9

4. Shaikh E, Mohiuddin I, Alufaisan Y, Nahvi I (2019) Apache Spark: a big data processing engine.

In: 2019 2nd IEEE middle East and North Africa communications conference (MENACOMM),

Manama, Bahrain, pp 1–6. https://doi.org/10.1109/MENACOMM46666.2019.8988541

5. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing

with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud

computing (HotCloud’10). USENIX Association, USA, 10

6. Lekha RN, Sujala DS, Siddhanth DS (2018) Applying spark based machine learning model

on streaming big data for health status prediction. Comput Electric Eng 65:393–399, ISSN

0045-7906

7. Bell J, GBE FF (2017) Life sciences industrial strategy—a report to the government from the

life sciences sector. Ofﬁce for Life Sciences

8. Kyoungyoung J, Gang-Hoon K (2013) Potentiality of big data in the medical sector: focus on

how to reshape the healthcare system. The Korean Society of Medical Informatics, 79–85

9. Villars RL, Olofson CW, Eastwood M (2011) Big data: what it is and why you should care.

IDC Analyze the Future, 4

10. Dash S, Shakyawar SK, Sharma M, Kaushik S (2019) Big data in healthcare: management,

analysis and future prospects. J Big Data 54

11. Kretz A (2019) The data engineering cookbook: mastering the plumbing of data science v3

12. Wang G, Xin R, Damji J (2018) Benchmarking Apache Spark on a Single node machine,

engineering Blog https://www.databricks.com/blog/2018/05/03/benchmarking-apache-spark-

on-a-single-node-machine.html

13. Microsoft (2023) Best practices: cluster conﬁguration, Azure Databricks documentation,

https://learn.microsoft.com/en-us/azure/databricks/clusters/cluster-conﬁg-best-practices

14. Learning Journal (2021) Parallel processing in Apache Spark, Apache Spark core context,

https://www.learningjournal.guru/article/apache-spark

15. MacDonald BK, Cockerell OC, Sander JW, Shorvon SD (2000) The incidence and lifetime

prevalence of neurological disorders in a prospective community-based study in the UK. Brain:

J Neurol 123(Pt 4):665–676. https://doi.org/10.1093/brain/123.4.665

16. Olvera Lopez E, Ballard BD, Jan A. Cardiovascular Disease. [Updated 2022 Aug 8]. In: Stat-

Pearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2023 Jan-. Available from: https://

www.ncbi.nlm.nih.gov/books/NBK535419/

17. NHS UK website (2023) Cardiovascular disease. Available at: https://www.nhs.uk/conditions/

cardiovascular-disease

Cloud Spark Cluster to Analyse English Prescription Big Data for NHS … 375

18. Wilson JD (2001) Prospects for research for disorders of the endocrine system. JAMA.

285(5):624–627. https://doi.org/10.1001/jama.285.5.624 Available from: https://jamanetwork.

com/journals/jama/fullarticle/193529

19. Madhugiri D (2022) Apache Spark vs. hadoop mapreduce—top 7 differences, analytics

Vidhya Blog, https://www.analyticsvidhya.com/blog/2022/06/apache-spark-vs-hadoop-map

reduce-top-7-differences

Prediction of Column Average Carbon

Dioxide Emission Using Random Forest

Regression

P. Sai Swetha, M. A. Chiranjath Sshakthi, S. Hrushikesh, and A. Malini

Abstract The carbon dioxide emission in the atmosphere is increasing tremendously

each day. Researchers have found satellites to monitor the emission level. The purpose

of this paper is to predict the column average carbon dioxide using the satellite data

and map the regions that emit carbon dioxide, so that the emission can be reduced to

maintain an eco-friendly environment. The model is trained using the Random Forest

Regression. The performance of the model depends upon the features selected for

training it. The satellite data is taken from OCO-2 satellite. The orbiting carbon

observatory satellite measures the amount of sunlight reﬂected from a column of air

containing carbon dioxide (CO2) rather than the amount of CO2itself. Additionally,

the CO2emission is also found in smaller areas using the large area data. On average,

the suggested model forecasts the carbon dioxide concentration in both bigger and

smaller regions.

Keywords Wisdom of crowds ·OCO-2 ·XCO2·Decision tree ·Feature selection

1 Introduction

One of the biggest threats to the planet is climate change and the main reason for

this is CO2. The measure of carbon dioxide concentration on earth’s atmosphere is

tremendously increasing day by day. According to the reports, the carbon dioxide

P. S. Swetha ·M. A. C. Sshakthi ·S. Hrushikesh ·A. Malini (B)

Thiagarajar College of Engineering, Madurai, Tamil Nadu, India

e-mail: amcse@tce.edu

P. S. Swetha

e-mail: saiswetha@student.tce.edu

M. A. C. Sshakthi

e-mail: chiranjath@student.tce.edu

S. Hrushikesh

e-mail: hrushikesh@student.tce.edu

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_28

377

378 P. S. Swetha et al.

level today is 419.26 parts per million, which is more than 50 prior to the industrial

revolution. The major contributor to the expanding atmospheric carbon dioxide is

human activities like fossil fuels being burned such as oil, coal, and gas. If the

rate of increase in the carbon dioxide continues in the same way, there are many

possibilities that the global warming would be reached around 2034–2050. Many

countries are taking efforts to reduce the emission of carbon dioxide intending to

overcome global warming. The carbon dioxide levels must be monitored to reduce

it, and the most efﬁcient approach to detect carbon dioxide is through satellites. The

ﬁrst satellite which was launched is the Greenhouse Gases Observation Satellite,

to detect the areas that emit carbon dioxide after which many satellites were found

to detect the CO2. One among them was the Orbiting Carbon Observatory 2, an

Earth satellite that investigated the worldwide sources and drains of carbon dioxide

which provided the researchers a better understanding about the climatic change [1].

The greenhouse gases cannot be detected precisely by the satellites like OCO-2 and

GOSAT, and hence, the whole atmospheric column is averaged, which in simple

terms is known as XCO2where the ‘X’ represents the observation from the satellite

[2]. The main objective is to use the OCO-2 satellite data and build a machine

learning model to better forecast the carbon dioxide over the surface of the earth for

a smaller area. Machine learning, subcategory of Artiﬁcial Intelligence, is one of the

best approaches to build a model for prediction. There are various machine learning

algorithms to generate a predictive model for the concentration of XCO2like support

vector machine, Gaussian process regression, Artiﬁcial Neural Networks (ANNs),

k-nearest neighbors. There are tree-based ML models like Extreme Gradient Boost

(XG), Random Forest Regression [4]. Moreover, the Convolutional Neural Network,

a deep learning model, can also predict the XCO2emission. In this paper, the machine

learning model is built and trained using the ensemble technique, that is Random

Forest Regression. The ensemble technique is generally used to enhance the precision

of the ﬁnal outcome by combining multiple model instead of using a single model

which may not show an efﬁcient accuracy. Instead of depending just on one decision

tree, the method is based on combination of multiple trees [19]. The data for training

the model was collected from the OCO-2 satellite dataset which contained various

features for predicting XCO2. The model was trained using the results of the features

that are selected. The objective of the proposed work is:

•To predict the CO2in smaller areas using wisdom of crowds concept.

•To minimize the emission of CO2in the atmosphere.

•To monitor CO2concentration in atmosphere.

•To more accurately map the places that emit carbon dioxide along with the OCO-2

satellite data.

The overall motivation is to accurately predict XCO2, which can assist in educating

the public, researchers, and policymakers about the current environmental situation

and directing efforts to lessen the effects of climate change. The paper is ordered in the

following manner: Sect. 2addresses the literature survey; Sect. 3discusses proposed

methodology, which includes Random Forest Regression a machine learning model;

Sect. 4covers the results and discussion; and Sect. 5provides the conclusion and ideas

Prediction of Column Average Carbon Dioxide Emission Using … 379

for future studies. The contribution of this paper is that it focuses on the prediction of

column average carbon dioxide using OCO-2 satellite data and mapping the locations

that release CO2in order to minimize emissions and preserve a healthy environment.

Also, CO2emissions are discovered in smaller locations utilizing bigger area data.

2 Literature Survey

Mengya Sheng et al. proposed a method to more accurately retrieve the XCO2values

in particular column of air by mapping a global spatiotemporally continuous XCO2

using XCO2data retrieved from the OCO-2 GOSAT satellites, which is helpful in

gap-ﬁlling and data integration methods using the data retrieved from the satellite.

They have obtained spatiotemporally continuous mapping of XCO2data by applying

the integrated kriging approach to spatiotemporal data XCO2data [10]. Thus, by

mapping two datasets together, they were able to ﬁnd the concentration of XCO2

over a column of air with more certainty.

Zhang et al. proposed a method to reduce the dimensions of a classiﬁcation tech-

nique using a hyperspectral remote sensing image and a neural network CO2in detail.

They ﬁrst reduced the dimensions of the hyperspectral remote sensing image using

genetic algorithms and kernel principal component analysis. Then, traditional remote

sensing method classiﬁes the hyperspectral remote sensing images. Lastly, based on

spectral local mean and standard deviation, the image for noise assessment was made

to improve accuracy [11]. They have used one of the most crucial tools for obtaining

high-accuracy CO2concentration data which is the spectral absorption characteristic

spectrum of atmospheric CO2.

Brazidec et al. have spoken about increasing the accuracy of XCO2retrieval by

segmentation of XCO2images with deep learning. They have addressed the problem

of plume segmentation, and to tackle this issue, they have utilized an image-to-image

CNN known as U-net technique to convert a region of XCO2into a picture that depicts

the locations of the target plumes [12]. In the model they have proposed, they claim

that their model performs better than the usual segmentation techniques and is able

to detect most of the plumes.

Zhang and Liu have proposed a methodology of mapping contiguous XCO2using

ML algorithms to analyze the spatiotemporal variations in the satellite-recovered

data. By using the column-averaged dry air mole fraction of the XCO2data from

SCIAMACHY, GOSAT and OCO-2, they were able to derive a contiguous XCO2data

across China with 0.25° resolution [13]. They were able to acquire bias and standard

deviation values of 0.11 and 1.38 ppm. With the assistance of the information from

dataset, the outcomes of the model simulation were fairly close to the real values in

the in situ location.

Liu et al. have talked about the retrieval algorithm for XCO2using the TanSat

and GOSAT dataset. A technique of adapting the observed and simulated spectra

to the atmospheric radiative transfer was done using the TanSat algorithm, XCO2

value is calculated. The CO2information was obtained from strong and weak bands.

380 P. S. Swetha et al.

One percent error was found in the observation, by using the TanSat algorithm,

GOSAT dataset and atmospheric CO2measurement from space [14]. Even if the

retrieval is still unsure in the middle to upper latitudes, the accuracy of the retrieval

was dependent on the instrument’s sampled precision as well as the theoretical and

algorithmic settings.

Noel et al. have made a proposal that GOSAT and GOSAT- 2 dataset would

retrieve the CO2mole fraction in dry air, as determined by columns by using the

FOCAL algorithm with it [15]. The preprocessing involves measured spectra and

geolocation, estimation of parameters for instrument noise and clouds, ﬁltering of

data quality, latitudes, and zenith angle, and adding corresponding meteorological

measurements. They have concluded that the FOCAL algorithm is one of the fastest

to retrieve the XCO2data as on an average it only takes 22 s of six iterations to process

one GOSAT ground pixel; hence, it proves to be computationally fast algorithm for

XCO2retrieval which is said to be more precise than the existing other algorithms.

Malini Alagarsamy et al. have proposed the algorithm called for load balancing

in cloud computing which is called as Cost-Aware Ant Colony Optimization. The

workload distribution to virtual machines is the main focus of this model. In order

to reduce processing time, response time, cost, power consumption and carbon foot-

print, this model uses the swarm-based Ant Colony Optimization algorithm. From

this model they have inferred that the instrument noise and cloud parameters algo-

rithm produces faster processing times and faster responses while comparing it to

other algorithms [21]. Some points to note about the model are that data load should

be distributed evenly to avoid overloading of the virtual machine and the model

should be of a design having multiple nodes to avoid a single point of failure.

3 Proposed Methodology

The prediction of XCO2is done using a decision tree-based ML algorithm and

Random Forest Regression. The advantage of using this algorithm shows low vari-

ance due to the combination of multiple decision trees. It does not require any normal-

ization as it works on tree-based approach. Moreover, it gives good accuracy. The

workﬂow of the model is as follows (Fig. 1).

3.1 Data Interpretation

In order to develop a XCO2prediction model, the data must be interpreted initially,

that is to learn more about data in depth. The dataset used for this paper was obtained

from the public platform NASA Earthdata [5]. The dataset contains many features

along with the geolocated XCO2retrieval values. The dataset was in h5 ﬁle format.

The dataset contains 40,000 rows and 200 columns.

Prediction of Column Average Carbon Dioxide Emission Using … 381

Fig. 1 Workﬂow of RFR model

Table 1 Shows the selected

features along with feature

ranking

Features Feature Ranking

S1 S2 S3

Latitude 0.502305 0.514717 0.494718

Longitude 0.164560 0.157361 0.152589

Pressure 0.129823 0.149461 0.116941

Altitude 0.079290 0.094962 0.087572

Solar angle 0.079290 NIL 0.050217

Fluorescence 757 0.031477 NIL 0.028987

Polarization angle NIL 0.057228 0.043224

Wind NIL 0.026271 0.025752

3.2 Feature Selection

Some of the signiﬁcant features are selected from the dataset while maintaining the

XCO2as target value. The model is trained based on the target value and feature

selection as shown in Table 1.

3.3 Data Preprocessing

Firstly, we structured the data into usable format so that the data can be analyzed

easily. Then, we converted each chosen column from the feature selection as a

dataframe and concatenated them as a single dataframe. The axis was set to 1 which

stacks the data frame side by side. The data is split based on train_test_split for model

training. The test size is 0.2 and train size is 0.8.

382 P. S. Swetha et al.

3.4 Training the Model

The algorithm used for training the model is Random Forest Regression which uses

multiple decision trees as base learning model [6]. The baseline model, which is

referred to as simple model that acts as a reference in the machine learning project,

is set as the linear regression model. This was chosen because it helps for predicting

continuous values in the dataset [7]. The parameters for training the model are n_

estimators and random_state. The n_estimators are set to 100, which means that there

are 150 decision trees and the random_state is set to 42 which means that we receive

the identical train and test sets across various executions. Now, the model is built

and trained. From the selected features, the features that most affect the emission of

carbon dioxide are visualized and shown as a plot.

3.5 Evaluation Metrics

Several measures, including Mean Squared Error, Mean Absolute Error, and Coefﬁ-

cient of Determination, can be used to assess the effectiveness of the Random Forest

Regression model. Measures used are R2 score and MSE. These measures evaluate

the precision and accuracy of the trained model. R2 score denotes how well the model

has performed, that is how it ﬁts into the regression line [8]. Mathematically, it is

given as

R2 score =Total Variance Explained by the model/total variance.

3.6 Mapping the CO2Emission Regions

A Mapbox is created with actual CO2data and display legend along with it. Each

point when hovered on shows the latitude longitude and XCO2concentration present.

3.7 Prediction of CO2in Smaller Areas

For prediction of carbon dioxide in smaller areas, a small subset of the large dataset is

taken as the smaller region behavior. The XCO2value is extrapolated using the previ-

ously trained larger model. Then, the predicted XCO2(average) value is predicted

for the assumed region. A mathematical concept called wisdom of crowds is applied

here which states that a large group of people can together make better decisions

than an individual. The wisdom of crowds can produce more accurate predictions

Prediction of Column Average Carbon Dioxide Emission Using … 383

and better decision-making than depending solely on one person’s skill or knowledge

by combining the opinions of many different people [9]. The larger data is split based

on train_test_split for model training. The test size is 0.2 and train size is 0.8. Then,

the model is trained and the CO2for smaller region is given using the wisdom of

crowds concept. The constant value 0.74 is multiplied when we extract smaller data

from larger value. The larger data tend to be more accurate so to make our predicted

value have better accuracy. We take a common range 0.74 to improve accuracy of

predicted (XCO2) average.

4 Results and Discussion

The goal of the model is to identify carbon dioxide-emitting areas so that action

may be taken to reduce emissions and people can be brought to the attention of the

state of the environment. The results obtained from the model vary depending on the

features that are selected to train the model. The data for building this model was taken

from the OCO-2 satellite data, which contained numerous features in the dataset. The

actual CO2concentration from OCO-2 satellite is compared with the predicted XCO2

and results are shown. The r2 score and Mean Square Error vary depending upon the

features selected from the OCO-2 dataset. Three cases are discussed below based on

the features selected, intending to ﬁnd the best selected features to detect emission of

XCO2. The selected features along with feature mapping, a horizontal bar graph for

features and degree of their effect on CO2, mapping of predicted regions with CO2,

the total time to train the model, r2 score, and Mean Square Error and prediction for

a smaller area are discussed for all three simulations. The simulations 1, 2, and 3 are

given as S1, S2, and S3.

With respect to the features selected, the degree of their effect on emission of

carbon dioxide is represented as a bar graph for all three simulations which is shown

in Fig. 2a–c.

The predicted carbon dioxide concentration is mapped from given OCO-2 satellite

data along with the actual XCO2concentration predicted by OCO-2 satellite which

is represented below. Figure 3shows the actual predicted concentration from the

OCO-2 dataset and Figs. 4,5,6show the predicted XCO2by the model for the

features selected, respectively.

With respect to the predicted XCO2concentration by the model for the three

cases based on their features selection. The best simulation among the three can be

concluded using the factors such as time taken to train the model, Mean Squared

Error (MSE), R2 score. The average carbon dioxide concentration is also calculated

for smaller region from larger area according to wisdom of crowd concept which is

also listed for case in Table 2.

From the above three simulations as listed in Table 2, it can be inferred that the

features are appropriately selected in simulation 2 which gives highest R2 score and

less Mean Squared Error (MSE). Additional to that, the training time of the model

384 P. S. Swetha et al.

Fig. 2 a, b, c Shows the features and degree of their effect on CO2

Fig. 3 Actual CO2regions from OCO-2 satellite

Prediction of Column Average Carbon Dioxide Emission Using … 385

Fig. 4 Predicted CO2regions for S1

Fig. 5 Predicted CO2regions for S2

386 P. S. Swetha et al.

Fig. 6 Predicted CO2regions for S3

Table 2 Factors and their assessment

Factor Factor Assessment

S1 S2 S3

Time to train model 13.39 s 6.68 s 8.42 s

R2 score 0.80 0.81 0.79

Mean Squared Error (MSE) 1.15e-12 1.08e-12 1.22e-12

Average CO2for smaller area 0.0033 0.0030 0.0033

is also less compared to other two simulations. Also, it can be deduced that as more

features are used, the R2 score decreases and the mean squared error rises.

5 Conclusion

In this paper, we have analyzed from space-based observations of the NASA Orbiting

(OCO-2) satellite dataset. The ultimate goal is to use the data previously anticipated

by the satellite to better correctly estimate the carbon dioxide concentration. We

initially interpreted the data, selected features for training the model, then we prepro-

cessed the data. For feature selection, we chose the most important eight features

Prediction of Column Average Carbon Dioxide Emission Using … 387

in total for training. The model was trained using the Random Forest Regression

algorithm. Rather than utilizing a single model, which might not provide accurate

results, this model uses the ensemble approach which is used to increase the accu-

racy of the ﬁnal result. Instead of depending just on one decision tree, the method

combines numerous trees to determine the outcome [6]. The ML model is assessed

using the evaluation metrics after training it. Here, the model performance was evalu-

ated through the Mean Squared Error and R2 score. The less the Mean Squared Error,

the model is more efﬁcient [16]. We have compared three cases based on the feature

selection to ﬁnd the most signiﬁcant features based on which the model predicted the

XCO2concentration. From the three simulations compared, simulation 2 produced

the most efﬁcient model with 0.81 R2 score. Generally, the R2 score above 78 percent

is considered as the best for prediction with less errors [17]. We have predicted the

emission of carbon dioxide and mapped it according to the latitude and longitude

along with the actual carbon dioxide concentration from the satellite which is also

mapped. We have used the larger area to predict the carbon dioxide for smaller area,

using the wisdom of crowds, a mathematical concept. This concept integrates the

perspectives of many diverse individuals, that is the wisdom of crowds can result

in more accurate forecasts and better decision-making than relying exclusively on

one person’s talent or knowledge [18]. The best case with the features, R2 score,

Mean Squared Error (MSE), and the average CO2with smaller area, is presented.

We have inferred that average CO2for smaller area is 0.0030. The time taken to train

the model is 6.68 s, and Mean Squared Error is 1.08e-12. Future paper will focus

on predicting the emission of all greenhouse gases using the same Random Forest

Regression. Additionally, it can be extended to the aviation industry by suggesting

change of routes in cases where higher CO2concentrations cause climatic change or

reduction of engine performance.

References

1. Sheng M, Lei L, Zeng Z-C, Rao W, Zhang S (2021) Detecting the responses of CO2column

abundances to anthropogenic emissions from satellite observations of GOSAT and OCO-2.

Remote Sens 13(17):3524

2. Liang A, Gong W, Han G, Xiang C (2017) Comparison of satellite-observed XCO2from

GOSAT, OCO-2, and ground-based TCCON. Remote Sensing 9(10):1033

3. Saleh C, Dzakiyullah NR, Nugroho JB (2016) Carbon dioxide emission prediction using a

support vector machine. In: IOP conference series: materials science and engineering, vol

114(1). IOP Publishing, 012148

4. Yuan X, Suvarna M, Low S, Dissanayake PD, Lee KB, Li J, Ok YS et al (2021) Applied

machine learning for prediction of CO2adsorption on biomass waste-derived porous carbons.

Environ Sci Technol 55(17):11925–11936

5. https://disc.gsfc.nasa.gov/datasets/OCO2_L2_Lite_FP_9r/summary

6. Rodriguez-Galiano V, Sanchez-Castillo M, Chica-Olmo M, Chica-Rivas MJOGR (2015)

Machine learning predictive models for mineral prospectivity: an evaluation of neural networks,

random forest, regression trees and support vector machines. Ore Geol Rev 71:804–818

7. Aalen OO (1989) A linear regression model for the analysis of life times. Stat Med 8(8):907–925

8. https://www.bmc.com/blogs/mean-squared-error-r2-and-variance-in-regression-analysis/

388 P. S. Swetha et al.

9. https://www.investopedia.com/terms/w/wisdom-crowds.asp#:~:text=Wisdom%20of%

20the%20crowd%20is,and%20innovating%20than%20an%20individual

10. Sheng M, Lei L, Zeng Z-C, Rao W, Song H, Changjiang W (2023) Global land 1° mapping

dataset of XCO2 from satellite observations of GOSAT and OCO-2 from 2009 to 2020. Big

Earth Data 7(1):180–200

11. Zhang L, Wang J, An Z (2020) Classiﬁcation method of CO2hyperspectral remote sensing

data based on neural network. Comput Commun 156:124–130. ISSN 0140–3664

12. Dumont Le Brazidec J, Vanderbecken P, Farchi A, Bocquet M, Lian J, Broquet G, Kuhlmann

G, Danjou A, Lauvaux T, Segmentation of XCO2 images with deep learning: application to

synthetic plumes from cities and power plants. Geosci Model Dev

13. Zhang M, Liu G (2019) Mapping contiguous XCO2by machine learning and analyzing the

spatio-temporal variation in China from 2003 to 2019. Sci Total Environ 858(Part 2):159588.

ISSN 0048–9697

14. Liu Y, Yang DX, Cai ZN (2013) A retrieval algorithm for TanSat XCO2 observation: retrieval

experiments using GOSAT data. Chin Sci Bull 58:15201523. https://doi.org/10.1007/s11434-

013-5680-y

15. Noël S, Reuter M, Buchwitz M, Borchardt J, Hilker M, Bovensmann H, Warneke T et al (2021)

XCO2retrieval for GOSAT and GOSAT-2 based on the FOCAL algorithm. Atmos Measure

Tech 14(5):3837–3869

16. Imbens GW, Newey WK, Ridder G (2005) Mean-square-error calculations for average

treatment effects

17. Subramaniam N, Yusof N (2021) Modeling of CO2emission prediction for dynamic vehicle

travel behavior using ensemble machine learning technique. In: 2021 IEEE 19th student

conference on research and development (SCOReD). IEEE, pp 383–387

18. Larrick RP, Mannes AE, Soll JB, Krueger JI (2011) The social psychology of the wisdom of

crowds. Soc Psychol Decision Making 227–42

19. Hammerling DM, Michalak AM, O’Dell C, Kawa SR (2012) Global CO2distributions over

land from the Greenhouse gases observing satellite (GOSAT). Geophys Res Lett 39(8)

20. https://ocov2.jpl.nasa.gov/observatory/instrument/#:~:text=OCO%2D2%20does%20not%

20measure,can%20be%20used%20for%20identification

21. Alagarsamy M, Sundarji A, Arunachalapandi A, Kalyanasundaram K (2021) Cost-awareant

colony optimization based model for load balancing in cloud computing. Int Arab J Inf Technol

18(5):719–729

Predicting Students’ Performance Using

Feature Selection-Based Machine

Learning Technique

N. Kartik, R. Mahalakshmi, and K. A. Venkatesh

Abstract Early evaluation of the students’ performance to determine their strengths

and weaknesses helps them perform better in examinations. Improving students’

overall learning experiences and academic success has been a hot issue recently.

In this paper, classical machine learning algorithms like the random forest, J48,

and Logistic Model Tree are built and trained on student data to predict students’

performance. To improve the accuracy of the models, feature selection algorithms

like correlation-based feature selection, information gain ranking ﬁlter, gain ratio

feature evaluator, and symmetrical uncertainty ranking ﬁlter are used, and selected

features are trained on the model and compared the performance of the models with

each other.

Keywords Students’ performance ·Machine Learning models ·Features selection

1 Introduction

Educational institutions face the challenge of accurately predicting student perfor-

mance as early as possible. Machine learning technology can help institutions iden-

tify students who are at risk of underperforming and provide them with personalized

support to improve their academic achievements. This approach can enhance the

institution’s retention and graduation rates. In the past, if a student had low marks

N. Kartik (B)

Department of Computer Applications/Science, Presidency College(Autonomous)/Presidency

University, Bengaluru, India

e-mail: nkartik.mca@gmail.com

R. Mahalakshmi

Department of Computer Science, Presidency University, Bengaluru, India

e-mail: mahalakshmi@presidencyuniversity.in

K. A. Venkatesh

School of Advanced Computer Science, Alliance University, Bengaluru, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_29

389

390 N. Kartik et al.

or did not do well in class, the instructor or the student’s parents could not compre-

hend the factors that led to these outcomes. Learning Management Systems (LMSs)

are increasingly being used by institutions to control both the learning content they

provide and the activities and behaviors of their students while using such systems.

Understanding how students behave in LMSs can help institutions predict how well

students will perform based on their grades and the activities they have completed in

the system. A signiﬁcant amount of work is done to forecast student performance for

many reasons, including assignment of courses and resources, among many others.

But still, there is high demand in the ﬁeld to improve the system performance to

predict the student’s performance.

In this paper, we have built Machine Learning models to predict student perfor-

mance on data collected from LMS. It is crucial for the model to predict correctly,

based on which the decisions are made. To improve the performance of the model,

feature engineering techniques are used. These techniques help to ﬁnd the more

signiﬁcate feature contributing to the prediction. Selected features are trained on the

models to achieve good accuracy in predicting.

2 Related Work

Machine Learning and data mining algorithms are extensively used for predicting

academic data. In a study, a decision tree classiﬁer, neural network, and nearest

neighbor classiﬁer were combined for successful and unsuccessful student predic-

tions for prestigious Bulgarian universities [1]. WEKA application is explored by

using classiﬁers like J48, Naive Bayes, BayesNet, K-NN, OneR, and JRip. This

article compares supervised Machine Learning prediction algorithms and assesses

the model’s performance using real-world data from the ﬁeld of education [2, 3].

MapReduce and neural networks trained on cumulative dragonﬂy model data predict

student grades (CDF-NN) [4].

Further, a study investigates AI and ML applications for student records, and on-

campus behavior predicts academic success. This study predicts student outcomes

using the Open University Learning Analytics Dataset and a Bayesian network

(OULAD) [5]. Genetic algorithms (GAs) can forecast student performance and give

suggestions. Naive Bayes, Decision Tree (J48), Random Forest, Random Tree, REP

Tree, Simple Logistic, and Zero R were used to predict performance [6]. FNN was

used to optimize FNN parameters [7]. KSITM researchers trained and tested a model

to address student underachievement [8,9]. Education data research was conducted

to learn about students’ performance in MOOC classes. Training on a technology-

based learning paradigm was used to improve classiﬁcation methods and optimize

the parameters of RBFN networks [10].

Predicting Students’ Performance Using Feature Selection-Based … 391

3 Proposed Modeling

We have proposed a system in which the methodology used in the study has helped to

improve the model’s accuracy. The study includes preprocessing techniques, feature

engineering techniques, Machine Learning model building, and model evaluating

measures. The results are compared by running the model with selected features and

without selected features. Remarkable results can be achieved with the proposed

methodology (Fig. 1).

3.1 Datasets

The dataset used in the study was downloaded from Kaggle. The data are gathered

using the experience API learner activity tracker tool (xAPI). It was compiled over

two academic semesters and contained 480 student records. The dataset comprises

information about the students, including their gender, nationality, place of birth,

grade level, number of raised hands, and absences.

3.2 Preprocessing

The dataset contains both discrete and continuous values of 17 features. There are

no missing values in the datasets. Discrete values are converted to numeric using the

hot encoding technique. Then, the dataset is normalized using a z-score normalizer.

3.3 Feature Selection

The main goals of feature selection include improving the model’s performance,

avoiding overﬁtting, and providing faster, more efﬁcient, cost-effective models. Four

Fig. 1 Proposed methodology

392 N. Kartik et al.

different feature selection techniques were used in this work, and the results are

analyzed. The top common selected feature of the different techniques is selected.

3.4 Correlation-Based Feature Selection (CFS)

CFS aims to measure the degree of relevance and redundancy between two features by

calculating the correlation coefﬁcient. Equation 1calculates a correlation coefﬁcient,

which measures the similarity between feature subsets and the output class.

rzc =krzi

√k+k(k−1)rii

(1)

where r_zc is the correlation (dependence) between the feature and the class variable,

k is the number of features, and (r_zi)is the average of the average correlation

intercorrelation between features.

3.5 Information Gain Ranking Filter

Quinlan’s information gain feature selection metric and ID3’s basic decision tree-

building process are used to choose each node’s test characteristics. The split D

entries of the dataset are represented by Node N. IG calculation determines node N’s

split property.

Info(D)=−



i=1

Pilog Pi(2)

where |Ci|/|D| may be used to calculate the likelihood that an item of class D belongs

to class Ci. A log function to base 2 is employed since the information is encoded in

bits. The average amount of information needed to determine an object’s class label

in partition D is represented by Info(D).

Now, based on the subsets partitioning by attribute A, if it is necessary to separate

the objects in D on any feature (attribute) A with v unique values, a1, a2, a3,…, av,

the desired information is presented as follows:

InfoA

(D)=



j=1Dj

|D|×Info

Dj(3)

Gain(A)=Info

(D)−InfoA

(D)(4)

Predicting Students’ Performance Using Feature Selection-Based … 393

where the |Dj|/|D| serves as the weight for jth partition. Here, the information gain

on A is deﬁned as the difference between the original information before splitting

and the current information gain obtained after splitting on A as shown in Eq. (4).

Split In f oA(D)=−

Dj

|D|log 2Dj/|D|(5)

3.6 Symmetrical Uncertainty Ranking Filter

The symmetrical uncertainty criterion overcomes the information gain bias for

features with many values by normalizing its values to the range [0,1]. The following

equation gives this:

SU =2Gain(A)

Info

(D)+Split inf o(A)(6)

If SU value =0, indicates no association between the two features, and the value

of SU =1 indicates that knowledge of one feature entirely predicts the other. This

criterion is like GR criterion in favor of features with fewer values.

Selected Features from the Different Feature Selection Techniques

After applying the feature selection techniques, the common features shortlisted

based on the ranking by these techniques are VisITed Resources, Student Absence

Days, raised hands, Announcements’ View, Parent Answering Survey, and Relation.

The ranks from the scored by the techniques are represented in detail in Table 1.

Table 1 Representing the features selection techniques and their selected features with the ranks

Technique/

features

VisITed

Resources

Student

Absence

Days

Raised

hands

Announcements’

View

Parent

Answering

Survey

Relation

Correlation-based

feature selection

(CFS)

0.3829 0.3608 0.3283 0.2895 0.2369 0.2358

Information gain

ranking ﬁlter

0.45801 0.39745 0.37337 0.2578 0.1504 0.1261

Gain ratio feature

evaluator

0.19878 0.40986 0.25378 0.17636 0.15212 0.1291

Symmetrical

uncertainty

ranking ﬁlter

0.23776 0.31564 0.24728 0.17127 0.11855 0.0998

394 N. Kartik et al.

Classiﬁcation Models

In this paper, we have built three classiﬁers random forest, J48, and Logistic Model

Tree, to predict the student’s performance. The dataset is split into 70:30 ratio for

training and testing, respectively. These classiﬁers are trained on the dataset without

selecting features and with selected features.

3.7 Accuracy Metrics for the Classiﬁer

Model assessment is essential in developing a reliable data mining and machine

learning model. The most common criteria for evaluating a classiﬁcation system

are its degree of accuracy. A model is successful if it has an accuracy rate of

99.9% or above. However, it could be more reliable and may even be misleading

in certain situations with overﬁtting problems.

Accuracy of the classiﬁers:

Precision =True Positive/(True Positive +False Positive).

The computation for recall and sensitivity is how accurately the positive class was

predicted.

Recall =True Positive/(True Positive +False Negative).

The F-score, also known as the F-measure, is a single score that combines precision

and recall to balance both objectives.

F-measure =(Precision +Recall)/(2 * Precision +Recall).

A well-liked statistic for the categorization of imbalance is the F-measure.

4 Results and Discussions

The proposed model is tested in both cases by implementing it in the Jupyter Note-

book Python platform. The models are executed with the training dataset and testing

dataset. The proposed training and testing ratio was 70:30. The model was executed

considering all the features. The experiment results of each model in the ﬁrst case are

represented in Table 2. Compared to the J48 and LMT, RF performs well with 78.66

accuracy. This is due to considering more attributes with binary values (Fig. 2).

The numerical results are shown in Table 2. The ﬁrst experiment consists of three

separate runs of classiﬁcation algorithms on datasets that do not use the proposed

model phases. Both RF and LMT provide better outcomes than the J48. In Fig. 3,

the area under the receiver operating characteristic curve (ROC) for each expected

class in the random forest is shown.

After encoding the classes and feature selection, the dataset is trained on the

same classiﬁers in our second experiment, which was conducted according to the

recommended technique. Table 3displays the results of each classiﬁer and compares

them. In terms of classifying datasets, RF’s results have vastly improved. Figure 5

Predicting Students’ Performance Using Feature Selection-Based … 395

Table 2 Classiﬁcation

results for original datasets Classiﬁers RF J48 LMT

TP rate 0.767 0.758 0.775

FP rate 0.139 0.139 0.131

Precision 0.776 0.760 0.775

Recall 0.767 0.758 0.775

F-measure 0.786 0.759 0.775

ROC ar ea 0.897 0.855 0.882

Accuracy 0.786 0.758 0.775

Fig. 2 Classiﬁcation results

for original datasets

0.5

Classification Results For Original Datasets

RF J48 LMT

RF for Class M RF for Class L RF for Class H

Fig. 3 ROC curve for random forest with all features

displays the relative area under the receiver operating characteristic curve (ROC) for

each projected class using RF (Fig. 4).

396 N. Kartik et al.

Table 3 Classiﬁcation

results for the selected feature Classiﬁers RF J48 LMT

TP rate 0.856 0.717 0.731

FP rate 0.843 0.160 0.157

Precision 0.856 0.718 0.732

Recall 0.856 0.717 0.731

F-measure 0.856 0.716 0.731

ROC ar ea 0.976 0.826 0.866

Accuracy 0.895 0.711 0.732

Fig. 4 Classiﬁcation results

for the selected feature

0.5

ClassiﬁcaƟon Results For The Selected Feature

RF J48 LMT

RF for Class M RF for Class L RF for Class H

Fig. 5 ROC curve for random forest with selected features

5 Conclusion

Correct and timely student performance prediction is the most challenging task in

education. This is necessary to help students who pose a greater academic risk, ensure

their high retention rate, provide exceptional learning opportunities, and promote the

university and its reputation. In this paper, three Machine Learning algorithms are

built and tested to predict student performance. The experiments were conducted in

two phases. With all the features in the ﬁrst phase, all three models are trained, and

results are compared. Random forest gives 78.66 accuracy. To improve the accuracy

of the models to achieve correct prediction, we have used feature selection techniques

to identify the signiﬁcant feature from the datasets and used those features for training

Predicting Students’ Performance Using Feature Selection-Based … 397

models. In the second phase, these selected features are trained with three models,

and again random forest is outperformed by giving an accuracy of 89.5% compared

with others. Observing the result from both phases increases the results in the second

phase. The study’s methodology gives us a good result compared with models without

feature selection.

References

1. Imran M, Latif S, Mahmood D, Shah M (2019) Student academic performance prediction using

supervised learning techniques. Int J Emerg Technol Learn 14(14):92–104. https://doi.org/10.

3991/ijet.v14i14.10310

2. Chaudhury P, Tripathy H (2020) A novel academic performance estimation model using two

stage feature selection. Indonesian J Electric Eng Comput Sci 19(3):1610–1619. https://doi.

org/10.11591/ijeecs.v19.i3.pp1610-1619

3. Alshabandar R, Hussain A, Keight R, Khan W (2020) Students performance prediction in

online courses using machine learning algorithms. Proc IJCNN Conf 2020:1–7. https://doi.

org/10.1109/IJCNN48605.2020.9207196

4. Velarde L, Gerardo C, Chamorro-Atalaya O, Morales-Romero G, Meza-Chaupis Y, Auqui-

Ramos E, Ramos-Cruz J, Aybar-Bellido I (2022) Quadratic vector support machine algorithm,

applied to prediction of university student satisfaction. 11591/ijeecs.v27.i1, pp 139–148

5. Chitti M, Chitti P, Jayabalan M (2020) Need for interpretable student performance prediction.

Proc DeSE Conf, 269–272. https://doi.org/10.1109/DeSE51703.2020.9450735

6. Salih NZ, Khalaf W (2021) Prediction of student’s performance through educational data

mining techniques. Indonesian J Electric Eng Comput Sci 22(3):1708–1715. https://doi.org/

10.11591/ijeecs.v22.i3.pp1708-1715

7. Ismail HM, Hennebelle A (2021) Comparative analysis of machine learning models for

students’ performance prediction. In: Advances in digital science - advances in intelligent

systems and computing, Antipova T (ed), vol 1352. Singapore, Springer, 149–160. https://doi.

org/10.1007/978-3-030-71782-7_14

8. Chakrapani P, CD (2022) Academic performance prediction using machine learning: a compre-

hensive and systematic review. Proc ICESIC, 335–340. https://doi.org/10.1109/ICESIC53714.

2022.9783512

9. Madhuri S, Adamuthe AC (2021) Comparative study of supervised algorithms for prediction

of students’ performance. Int J Modern Educ Comput Sci 13(1):1–21. https://doi.org/10.5815/

ijmecs.2021.01.01

10. Hao J, Gan J, Zhu L (2022) MOOC performance prediction and personal performance improve-

ment via Bayesian network. Educ Inf Technol 27:7303–7326. https://doi.org/10.1007/s10639-

022-10926-8

Hybrid Deep Learning-Based Human

Activity Recognition (HAR) Using

Wearable Sensors: An Edge Computing

Approach

Neha Gaud, Maya Rathore , and Ugrasen Suman

Abstract Due to the growth of Internet of Things (IoT) and advanced sensing

based technologies have enabled the development of the miniature-based system.

In recent years, the use of wearable and mobile sensors for Human Walking Gesture

Recognition has become more popular in various applications, including health

care, surveillance, robotics, and industry. The recent growth of edge computing

technology for industry 4.0 has provided the opportunity to design the low power

and less computationally expensive devices. The edge computing devices cannot

support heavy computation and provide great efﬁciency by reducing the network

size and communication latency. Deep learning algorithms have recently demon-

strated high performance in HAR. However, the deep learning (DL) models require

very high computation systems, which make them ineffective when used on edge

devices. In this research, a hybrid deep learning-based model is trained to recognize

the various gestures. Three deep learning-based models, namely one-Dimensional

Convolutional Neural Network (1D-CNN), Convolutional Neural Network–Long

Short-Term Memory (CNN-LSTM), and CNN-Gated Recurrent Unit (CNN-GRU),

are designed to test the various human mobility gestures. The WISDM, PAMAP2,

and UCI-HAR benchmark datasets were used to assess these models. Among the

three datasets, the best accuracies of the models are 99.89%, 97.28%, and 96.78%,

respectively, achieved for CNN-LSTM hybrid model. In future, the work can be

extended to design an end-to-end edge computing application using Arduino Nano

33 BLE Sensing microcontroller board. The compressed deep learning model will

be fused on the Arduino Nano board to recognize various human motion gestures.

The research demonstrates the classiﬁcation of various HAR gestures using hybrid

deep learning models.

N. Gaud ·U. Suman

School of Computer Science and Information Technology, DAVV, Indore, M.P, India

e-mail: usuman.scs@dauniv.ac.in

M. Rathore (B)

Christian Eminent College, Indore, M.P, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_30

399

400 N. Gaud et al.

Keywords Deep learning ·Assistive technology ·Internet of heath things

(IoHT) ·Wearable technology ·Edge computing

1 Introduction

1.1 Research Motivation

HAR includes several techniques, such as wearable, i.e., computer vision and non-

wearable, i.e., sensor-based, which can be further segmented into object-tagged,

wearable, dense sensing, etc. Before going any further, it should be noted that HAR

systems having various inheritance design challenges, namely selection of different

types of sensor, a set of criteria for data collecting, the performance of recognition,

the amount of energy used, processing power, and adaptability. It is crucial to create a

HAR model which is effective and portable while keeping all these factors in mind. A

network for mobile-based human activity recognition has been described which uses

data from triaxial accelerometers and a long-short memory technique [1]. Current

technology relies on Internet connectivity and cloud services to recognize activity

patterns using non-parametric ML models in devices like smart watches.

1.2 Recent Advancement

DL-based approaches have recently gained popularity for recognizing human activity

because they can employ representation learning techniques which would automati-

cally generate the best features from sensor-generated original data without the need

for human involvement and can ﬁnd hidden patterns in data [2,8]. The application

of edge computing frameworks to perform HAR models at the network’s end is still

in its initial stages [12]. Recent years have seen the emergence of edge computing

as a fresh framework which can shorten Internet connection lag times by relocating

processing power from far cloud servers to data sources. It is logical to transfer

the design of cloud-based IoT apps to edge-based ones. In this study, we investi-

gate model compression, the building of neural network-based models; however, in

future, the implementation will be done on Arduino BLE Nano sensing board.

1.3 Author’s Contribution and Novelty

This research presents hybrid DL models for gesture recognition. The following are

the paper’s contributions:

Hybrid Deep Learning-Based Human Activity Recognition (HAR) … 401

•The hybrid deep learning-based models (CNN-1D, CNN-GRU, and CNN-LSTM)

are designed for HAR gesture recognition.

•The spatial and temporal features of datasets are used to design the above models.

•Three publicly available benchmark datasets WISDM, PAMAP2, and UCI-HAR

are considered to design and test the models.

•Finally, the model which has provided the highest accuracy, i.e., CNN-LSTM is

picked.

•The confusion matrix is prepared for all three models over different datasets.

One major objective is to design devices with low power consumption while

protecting patient data privacy. This method can also be applied for the evaluation of

analysis of various walking techniques, design of prosthesis, and to the rehabilitation

of Parkinson’s patients and elderly subjects.

1.4 Organization of the Paper

The entire paper is divided into ﬁve sections. The ﬁrst section is introduction, which

provides the research motivations, author’s contribution, and background informa-

tion. The second section provides a brief survey of recent state-of-the-art literature.

The third section is methodology in which the approach goes over the hardware and

software requirements, algorithms, and speciﬁc ﬂow tasks for recognizing human

gestures. The veriﬁcation of the model is covered in the fourth section, which includes

experiment results and analysis. The last section is conclusion, future work, and

limitations.

2 Literature Review

Sztyler et al. [3] technique for HAR allows the location of wearable devices on

the human body to alter. The technique uses a random forest classiﬁer to combine

frequency and gravity-based information in order to identify human activity and

calculate the device’s orientation. Nevertheless, these methods are not particularly

accurate and cannot be used indoors. This study demonstrated an alternate strategy to

employ a microcontroller to provide an end-to-end solution to precisely analyze gait

speed in a variety of settings, given the constraints of such devices. The gyroscope is

used to measure the orientation of moveable body parts, whereas the accelerometer

monitors their actual physical acceleration. HAR is a method for separating out very

similar human actions using the inputs from both sensors. For the classiﬁcation of

human walking gesture and the study of motion signals, various classiﬁers have been

developed. Sun et al. [4] suggested a CNN-LSTM-ELM network on the opportunity

dataset. Activities in the OPPORTUNITY dataset can be separated into gesture and

locomotion. They discovered that the ELM classiﬁer generalizes more quickly and

402 N. Gaud et al.

effectively than fully linked classiﬁers [5] has suggested the use of the HAR approach,

which is portable, non-interventional, has enhanced accuracy, and is relevant to real-

time applications. For the purpose of recognizing complex, contemporaneous, inter-

spersed, and varied human actions, a model based on transfer learning employing

(GRUs) has been proposed [6]. Voicu et al. [7] presented a technology for recognizing

human physical activity using data from smartphone sensors. Three smartphone-

accessible sensors, namely gyroscope, accelerometer, and magnetic sensor, are used

in the process to create a classiﬁer. They intend for their proposal to include sitting,

jogging, climbing, standing, and descending stairs. The results show that all six

activities may be recognized with a high degree of accuracy (86–93%) [9–13].

3 Methodology

3.1 Technology Used

To write the code, we have used Google Collab, and it is a collaborative environment,

which will be used to produce results in the same ﬁle and will provide a CSV ﬁle

output. Using the TensorFlow library, the deep learning models are designed. To

reduce the model size, the TensorFlow-lite library is used and the Google TensorFlow

library together with Keras is used for model training and evaluation.

3.2 Datasets

Models are trained on three publicly available datasets: UCI- HAR, PAMAP2, and

WISDM.

•UCI-HAR dataset: In the UCI-HAR dataset, six tasks were carried out by the

volunteers: walk, downstairs walk, upstairs, standing, sitting, and laying. Using

various image processing techniques, a total of 561 characteristics were retrieved

from the sensor measurements.

•WISDM dataset: The WISDM dataset consisted of 36 volunteers who did six

different activities while wearing smartphones and smart watches. These devices

were equipped with accelerometer and gyroscope sensors. These activities were

walking, running, ascending stairs, sitting, standing, and lying down.

•PAMAP2: The Physical Activity Monitoring Data Set 2 (PAMAP2) dataset is

made up of sensor data obtained from a wearable during a variety of activities

and worn on the upper body. It included data from nine sensors, consisting of

an accelerometer, gyroscope, and magnetometer. The participants engaged in a

variety of activities, such as chores around the house, physical activity, and outdoor

pursuits. It produced a total of 52 features, including acceleration, angular speed,

and magnetic ﬁeld readings from the sensors.

Hybrid Deep Learning-Based Human Activity Recognition (HAR) … 403

Fig. 1 Proposed deep

learning model ﬂowchart

3.3 Development of Hybrid Models

The stored data are ﬁrst preprocessed to turn the raw samples into tensors after which

the number of observations is divided into training, testing, and validation datasets.

Some of the most important functions which have been used in the code are null

(). sum to check for missing values and data.dropna() for removing missing or null

values. We have created three different models—CNN 1D, CNN-LSTM, and CNN-

GRU. Finally, we selected a single model (highest accuracy one) for further checking

of performance. Figure 1shows the proposed methodology ﬂowchart.

3.4 Working of DL Models

CNN: The model starts with a 1D convolutional layer of 64 ﬁlters with kernel size

of 3, activated by ReLU. To extract abstract features from the output of the ﬁrst

layer, the same model is utilized. Following that, a max pooling layer with a pool

size of 2 is applied to the output. Number of classes in the output which has been

activated by SoftMax is sent to the output of the ﬁnal max pooling layer which is

ﬂattened into a one-dimension vector. Categorical cross-entropy is used as the loss

function in the model, ADAM is used as the optimizer, and assessment metrics are

404 N. Gaud et al.

used to determine accuracy prior to training. Other two models have used the same

hyperparameters. This model is trained for 10 epochs with 32 samples in each batch.

While monitoring validation loss, ReduceLR On Plateau was utilized as a call back

function with a lower bound of 0.0001 for the learning rate. As a result, each time

the learning rate reaches a plateau, the model can decrease it by a factor of 10. This

is done to increase the model’s overall accuracy.

CNN-LSTM: The CNN-LSTM model was created using TensorFlow’s Keras

API. Its architecture is made up of a Conv1D layer with 64 ﬁlters, each with a

three-kernel size. Here again we have used the ReLU function which has triggered

the activation of neurons in these layers. To identify high-level characteristics from

unprocessed data, a sliding window ﬁlter is applied to the input in a convolution

network. The inputs are processed to the ﬁlter several times, creating a map of

activation known as a feature map. To represent the temporal sequence of feature

maps, we have used hyperbolic tangent (tanh) activation function with a dropout rate

of 50%. Again, in this model, we have trained the dataset for 10 epochs with 32

samples in each batch.

CNN-GRU: The LSTM-based design described in the aforementioned section is

extremely close to the architecture of the model. Feature maps are extracted from the

data by the model and are further compressed using a Max-Pool layer. The data is

then passed onto a layer called a Gated Recurrent Unit (GRU) after that. In the work,

we have utilized 64 GRU units with the hyperbolic tangent (tan) activation function

in the sequence layer. The dropout layer is dropped to 50% after the Max-Pool layer

which will help in preventing overﬁtting. Here, we have recorded the model metrics

of precision, recall, F1-score, and support values for each gesture category.

3.5 Classiﬁcation

The accelerometer and gyroscope inputs of various gesture of dataset have fed into

the deep learning model, if the input data exceeds a minimum threshold of 1.5, and

an inference step will produce the anticipated probability of the data falling into

each class of gesture (Squat, Run, Walk, Jump, etc.). The results are represented as

confusion matrix and accuracy. Algorithm 1 shows the detailed steps of working.

Algorithm 1 For HAR’s model creation and implementation is shown here

Algorithm1: HAR using wearable sensors.

Results: Classiﬁcation of accuracy of various human activists.

Initialization: HAR datasets (WISDM, PAMAP2, and UCI-HAR).

(continued)

Hybrid Deep Learning-Based Human Activity Recognition (HAR) … 405

(continued)

Algorithm1: HAR using wearable sensors.

Step 1: Pre-Processing of the datasets;

Step 2: Split the datasets into training, validation set, and testing;

Step 3: Design of deep learning models CNN-LSTM, 1D-CNN, and CNN-GRU;

Step 4: Performance analysis of deep learning model based on precision, recall, accuracy, and

F1- score;

Step 5: Select the best model based on accuracy;

4 Experimental Result

This research work has presented the performance of three different deep learning

model architectures: CNN-LSTM, CNN 1D, and CNN-GRU, as detailed in the model

development section. The data was divided into three groups: training (70%), testing

(10%), and validation (20%). The test data is used to examine support, recall, F1-

score, precision, and overall accuracy of the models.

4.1 Dataset Comparison Result

Overall, all the models are able to perform with very high accuracy (95%) with the

datasets. Among the ﬁve gestures, squat and run were easier to identify for the models

than the jump and walk gestures.

•Within the six gestures of UCI dataset, namely laying, walk, upstairs walk, and

walk downstairs are easier for the model to identify than sitting and standing

gestures. CNN has given the best accuracy of 95%. The input data is transformed

into tensors with a shape of (5146, 128, 9) and normalized between 0 and 1, while

the output data is one-hot encoded with a shape of (7352, 6).

•Within the six features of the WISDM dataset, the standing feature is easier for

models to identify as compared to other gestures. In this, CNN-GRU achieved the

best accuracy of 97%. The input data was transformed into tensors with a shape

of (19,495, 128, 3) and normalized between 0 and 1, while the output data was

one-hot encoded with a shape of (19,495, 6).

•In the PAMAP2 dataset, we have a total of 11 features. Rope_Jumping, Cycling,

Nordic_Walking, Vaccum_Cleaning, and Ironing are easier for the model to iden-

tify as compared to its remaining features. CNN 1D was able to achieve the

best accuracy of 99%. The input data is transformed into tensors with a shape

of (4370, 128, 39) and normalized between 0 and 1, while the output data is

one-hot encoded with a shape of (4370, 12).

406 N. Gaud et al.

Figures 2,3,4show the accuracy loss curves for CNN-LSTM model on different

datasets UCI-HAR, WISDM, and PAMAP2.

Using USB, the board is loaded with the compressed model header ﬁle and the

Arduino sketch ﬁle (.ino ﬁle). A battery is then attached to the board after which the

board is reset to activate the new sketch. After that, we will invite the participant to

make the various motions, after which the board is free to make inferences in real

time. We assessed the board’s performance on the participant using 100 gestures

(20 for each category). The quantized model’s results are fused on the same board

Fig. 2 CNN-LSTM model (UCI-HAR)

Fig. 3 CNN-LSTM model (WISDM)

Hybrid Deep Learning-Based Human Activity Recognition (HAR) … 407

Fig. 4 CNN-LSTM model (PAMAP2)

with similar results to support the edge computing. We found that the majority of

the motions had inference times between 100 and 500 ms. By omitting the BLE

communication of the results, this inference time can be cut even more. Figures 5,6,

and 7show the performance matrix of compressed CNN-LSTM model over datasets

UCI-HAR, WISDM, and PAMAP2, respectively.

Table 1shows the comparative performance analysis of various deep learning

models over different datasets.

Fig. 5 Confusion matrix for CNN-LSTM compressed model (UCI-HAR)

408 N. Gaud et al.

Fig. 6 Confusion matrix for CNN-LSTM compressed model (WISDM)

Fig. 7 Confusion matrix for CNN-LSTM compressed model (PAMAP2)

Table 1 CNN-LSTM results

Proposed model Accuracy on WISDM Accuracy on PAMAP2 Accuracy on UCI-HAR

CNN 97.11 96.23 95.56

CNN +GRU 97.32 96.59 97.79

CNN +LSTM 99.89 97.28 96.78

Hybrid Deep Learning-Based Human Activity Recognition (HAR) … 409

5 Conclusion and Future Extension

This presented work focuses on wearable sensors capable of detecting the wearer’s

location as they walk have been created and applied in this investigation. The system

uses an inertial-based navigation algorithm that was corrected by an EKF. In this

study, a compressed deep learning model was used to recognize human movement

and gestures. It was decided to use three datasets (WISDM, PAMAP2, and UCI-

HAR). For each database, we were able to attain highest accuracy of 99.89%, 97.28%,

and 96.78%, respectively, for CNN-LSTM model. We were able to reduce the model

size by 10 times while increasing models’ average performance to 97% by using the

model compression strategies of pruning and quantization. Arduino LED predictor

is comfortable to wear, customizable, and most importantly, it will protect data.

In the investigation, we placed the compressed model on the board and used it to

infer motions in real time. The ﬁndings point to a promising new direction in human

activity recognition using edge AI devices that are secure, reliable, and low-powered.

Limitation of work: HAR systems having various inheritance design challenges,

namely selection of different types of sensor, a set of criteria for data collecting,

the performance of recognition, the amount of energy used, processing power, and

adaptability. The deep learning models having also size is huge which is required to

reduce to design less computational machine. It also suffers with intraclass variability,

class imbalance problems, etc.

References

1. Gravina R, Ma C, Pace P, Aloi G, Russo W, Li W, Fortino G (2017) Cloud-based Activity-

aaService cyber–physical framework for human activity monitoring in mobility. Futur Gener

Comput Syst 75:158–171

2. Greco L, Ritrovato P, Xhafa F (2019) An edge-stream computing infrastructure for real-time

analysis of wearable sensors data. Futur Gener Comput Syst 93:515–528

3. Sztyler T, Stuckenschmidt H, Petrich W (2017) Position-aware activity recognition with

wearable devices. Pervasive Mob Comput 38:281–295

4. Sun J, Fu Y, Li S, He J, Xu C, Tan L (2018) Sequential human activity recognition based on

deep convolutional network and extreme learning machine using wearable sensors. J Sensors

5. Chen J, Sun Y, Sun S (2021) Improving human activity recognition performance by data fusion

and feature engineering. Sensors 21(3):692

6. Thapa K, Abdullah Al ZM, Lamichhane B, Yang SH (2020) A deep machine learning method

for concurrent and interleaved human activity recognition. Sensors 20(20):5770

7. Voicu RA, Dobre C, Bajenaru L, Ciobanu RI (2019) Human physical activity recognition using

smartphone sensors. Sensors 19(3):458

8. Gupta S (2021) Deep learning based human activity recognition (HAR) using wearable sensor

data. Int J Inf Manage Data Insights 1(2):100046

9. Dua N et al (2021) Multi-input CNN-GRU based human activity recognition using wearable

sensors. Computing 103:1461–1478

10. Dua N et al (2023) A survey on human activity recognition using deep learning techniques

and wearable sensor data. In: Machine learning, image processing, network security and data

sciences: 4th international conference, MIND 2022, 2023, Proceedings, Springer, pp 52–71

410 N. Gaud et al.

11. Semwal VB et al (2022) Pattern identiﬁcation of different human joints for different human

walking styles using inertial measurement unit (IMU) sensor. Artiﬁc Intell Rev 55(2):1149–

1169

12. Bijalwan V et al (2022) Wearable sensor-based pattern mining for human activity recognition:

deep learning approach. Indus Robot: Int J Robot Res Appl 49(1):21–33

13. Dua N et al (2022) Inception inspired CNN-GRU hybrid network for human activity

recognition. Multimed Tools Appl 1–35

Hybrid Change Detection Technique

with Particle Swarm Optimization

for Land Use Land Cover Using

Remote-Sensed Data

Snehlata Sheoran, Neetu Mittal, and Alexander Gelbukh

Abstract The process of identifying and analyzing the changes occurring over a

period using remote-sensed data is change detection and has various application

areas such as land use land cover, resource planning, urbanization and many more.

The detection of changes is required for better decision-making and for under-

standing the impact of changes occurring at local and global levels. This research

presents an implementation of two change detection techniques: image differencing

and image ratioing on a set of ten remote-sensed data. The output images obtained

are further segmented using artiﬁcial intelligence-based particle swarm optimization

and conventional techniques. The output images are compared and validated though

the use of entropy and piqe. It has been observed that image differencing followed

by PSO gives better and superior image quality in comparison to other implemented

techniques.

Keywords Particle swarm optimization ·Remote sensing ·Change detection ·

Image differencing ·Image ratioing

1 Introduction

Change detection is the process of identifying differences in features or attributes

between two or more satellite images of the same area, captured at distinct times. The

objective is to determine changes that have occurred in the physical environment,

S. Sheoran (B)·N. Mittal

Amity University Uttar Pradesh, Noida, Uttar Pradesh, India

e-mail: snehsheoran312@gmail.com

N. Mittal

e-mail: nmittal1@amity.edu

A. Gelbukh

Instituto Politécnico Nacional Mexico, Mexico City, Mexico

e-mail: gelbukh@cic.ipn.mx

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_31

411

412 S. Sheoran et al.

such as new construction, deforestation, land use changes and other types of human

or natural alterations. Change detection in satellite images is useful for a wide range

of applications including urban planning, disaster management, environmental moni-

toring and security and surveillance. Finding a best technique for change detection is

still a challenging and time-consuming task. Nichol and Wong [6] presented the use

of satellite images for landslide inventories using change detection and image fusion.

Nurda et al. [7] exhibited change detection and land suitability analysis by the use of

remote-sensed data for potential forest areas in Indonesia. Chughtai et al. (2021) [3]

assessed change detection method and accuracy for land use land cover. Goswami

et al. [4] compared algebraic and machine learning methods of change detection. Peng

et al. [8] used attention mechanism and image difference for change detection on

optical remote-sensed data. Zhu et al. [12] presented Siamese global learning frame-

work for LULC change detection. The research worked on high spatial resolution

images. Chen et al. [2] worked on bitemporal image transformer for change detec-

tion. [11] presented a change detection framework using transferred deep learning

method. Yang et al. [5] worked in area of object-based feature selection by imple-

menting particle swarm optimization.Naeini et al. [10] presented a comprehensive

survey covering different segmentation techniques in image processing.

Main contribution of this research includes:

•Implementation of change detection techniques: image differencing and image

ratioing.

•Enhancement of output images using PSO and conventional edge detection

techniques.

•Compared and validated the results though entropy and piqe.

The paper is divided into ﬁve sections. Section 2covers introduction followed by

change detection techniques. Proposed methodology is presented in Sect. 3, followed

by result and analysis in Sect. 4and conclusion in Sect. 5.

2 Change Detection Techniques

Image differencing: This involves subtracting one image from another to highlight

the differences. The result of subtraction is zero for areas having no changes. Areas

having changes will have positive or negative value. The expression for image

differencing is shown in Eq. (1).

BVijk =BVijk(1)−BVijk(2)+c,(1)

where BVijk (1) and (2) represent brightness values captured on different dates and

c represents a constant. A single band, line number and column number and a single

band are represented by k,iand j[4].

Image ratioing: This involves dividing one image by another to create a ratio image

This technique can be useful for detecting changes in areas where the overall intensity

Hybrid Change Detection Technique with Particle Swarm Optimization … 413

of the image has changed, such as areas affected by cloud cover or atmospheric

conditions. It is represented by Eq. (2), where xk

ij(t2)represents the pixel x value of

band k at ith row and jth column at time t2[9]

Rxk

ij =

ij(t1)

ij(t2).(2)

3 Proposed Methodology

In this research, ten satellite images acquired from LANDSAT, USGS are considered

as input data. For identifying the changes, the images cover the same location and

the variation comes with respect to the year in which the images were captured.

Image 1 shows a geographic location captured in years 2013 and 2022. Similarly,

image 2 covers another geographic location captured in 2013 and 2022. Likewise,

all images cover different geographic locations captured over a period of time and

are presented in Table 1. On the input images, change detection techniques such as

image differencing and image ratioing are applied and output image are obtained. To

further enhance the output images obtained, artiﬁcial intelligent technique based on

particle swarm optimization and established edge detection procedures such as sobel,

canny and prewitt are implemented. For qualitative analysis of ﬁnal output images,

entropy and perception-based image quality evaluator are computed and compared.

The proposed methodology is presented in Fig. 1.

3.1 Particle Swarm Optimization

It is a computational optimization algorithm that is inspired by nature. The opti-

mization problem is represented as a search space with many potential solutions.

Each potential solution is modeled as a particle that moves through the search space,

with its position and velocity representing the solution. The algorithm of PSO is

represented as below [1]inFig.2.

4 Result and Analysis

Original ten images acquired from LANDSAT are presented in Table 1. For iden-

tifying the changes, image 1 is considered from years 2013 and 2022. Image 2 is

considered over 2013 and 2022. Similarly, all the remaining images are considered

414 S. Sheoran et al.

Table 1 Ten original LANDSAT images

over 2013–2022, as per the data availability. Output images obtained after applying

image differencing and image ratioing techniques are presented in Table 2.

The output images obtained from differencing and ratioing are further processed

by particle swarm optimization and sobel, canny and prewitt edge detection tech-

niques. The output images obtained for image 1 are placed in Table 3. For validating

the results qualitatively, entropy and piqe parameters are computed, compared and

presented in Tables 4and 5. It has been perceived from Table 4that the entropy value

for image differencing-PSO, sobel, canny and prewitt are is 0.8764, 0.0966, 0.3297

and 0.0965. The entropy value for image ratioing-PSO, sobel, canny and prewitt are

Hybrid Change Detection Technique with Particle Swarm Optimization … 415

Fig. 1 Workﬂow of proposed methodology

1. Initialization & input: particles are generated randomly &

assigned a position & velocity

2. Evaluation: particle’s fitness value is evaluated using FF

3. While not met=termination criteria, do

4. Position & velocity update

5. Evaluation of FF and replacement of worst by best particles

6. Local & global best update

7. End while & best solution obtained

Fig. 2 Steps of PSO algorithm

416 S. Sheoran et al.

Table 2 Output images obtained from differencing and ratioing

Diff-1 Ratio-1 Diff-2 Ratio-2

Diff-3 Ratio-3 Diff-4 Ratio-4

Diff-5 Ratio-5 Diff-6 Ratio-6

Diff-7 Ratio-7 Diff-8 Ratio-8

Diff-9 Ratio-9 Diff-10 Ratio-10

0.7287, 0.0718, 0.0764 and 0.072. Highest entropy is obtained from differencing-

PSO output image. Similarly, for image 2, the entropy values are 0.9471, 0.0963,

0.3543 and 0.0963 for image differencing technique, and for image ratioing, the

values are 0.0324, 0.0716, 0.0830 and 0.0720. Entropy result for all images is placed

in Table 4, and it can be observed that for all ten images, image differencing with PSO

yields the highest entropy value. Also, from Table 5, piqe values for differencing-

PSO, sobel, canny and prewitt are 52.2427, 80.0232, 78.2973 and 78.8501. The

Hybrid Change Detection Technique with Particle Swarm Optimization … 417

Table 3 Output images obtained after image differencing—PSO, sobel, canny,prewitt and image

ratioing—PSO, sobel, canny and prewitt

1-diff-pso 1-diff-sobel 1-diff-canny 1-diff-prewitt

1-rat-pso 1-rat-sobel 1-rat-canny 1-rat-prewitt

piqe value for rat-PSO, sobel, canny and prewitt are is 72.1938, 82.6905, 83.3192

and 82.3945. Maximum lowest piqe values are obtained from differencing-PSO and

differencing-canny output images.

5 Conclusion and Future Scope

Satellite images are a great warehouse of information. Detecting changes in these

images with respect to land use land cover, deforestation, resource management

and coastal changes is very crucial. This research presents the implementation of

optimized change detection technique using particle swarm optimization for remote-

sensed data. Output images obtained from image differencing and image ratioing are

further segmented by PSO and conventional techniques. Entropy and piqe parameters

are used for result validation and it has been observed that output images obtained

from differencing-PSO have highest entropy and lowest piqe. Highest entropy and

lowest piqe present better and superior quality output images. In future, the techniques

can be further coupled with other artiﬁcial intelligent methods for more detailed and

elaborative detection of changes and more qualitative parameters can be used for

result validation.

418 S. Sheoran et al.

Table 4 Entropy values for the output images

Image differencing Image ratioing

Image PSO Sobel Canny Prewitt PSO Sobel Canny Prewitt

10.8764 0.0966 0.3297 0.0965 0.7287 0.0718 0.0764 0.072

20.9471 0.0963 0.3543 0.0963 0.0324 0.0716 0.083 0.072

30.9916 0.0617 0.1913 0.0614 0.8185 0.0615 0.0924 0.0661

40.8516 0.0548 0.2755 0.0533 0.818 0.0615 0.0926 0.0655

50.8868 0.0962 0.3285 0.096 0.8331 0.0715 0.0756 0.0718

60.8849 0.0514 0.3615 0.0503 0.0103 0.0677 0.0977 0.0714

70.9032 0.0442 0.3069 0.0438 0.0310 0.0567 0.0902 0.0608

80.9781 0.0987 0.4557 0.0976 0.8434 0.0729 0.0767 0.0732

90.8269 0.0984 0.4036 0.0983 0.0139 0.0730 0.0767 0.0733

10 0.8883 0.0974 0.2512 0.0972 0.0134 0.0710 0.0763 0.0713

Hybrid Change Detection Technique with Particle Swarm Optimization … 419

Table 5 PIQE values for the output images

Image differencing-PIQE Image ratioing-PIQE

Image PSO Sobel Canny Prewitt PSO Sobel Canny Prewitt

152.2427 80.0232 78.2973 78.8501 72.1938 82.6905 83.3192 82.3945

254.8975 80.0896 78.7315 79.8743 79.1913 83.0612 83.2820 83.2715

376.4534 82.2124 77.5187 82.3346 86.1798 82.8269 80.5575 84.0601

482.6335 85.2554 76.0775 85.3651 84.0316 83.4336 82.2229 83.3657

579.2612 76.9258 74.9741 75.8730 79.3779 78.8899 78.4755 79.0612

684.1178 84.2770 76.8454 84.4235 76.8429 83.4853 81.1664 83.0750

767.8191 85.3741 78.8665 85.5212 76.9564 84.5318 82.1702 84.3592

880.1851 79.2580 75.9025 78.6312 80.6044 81.2429 81.5354 81.9369

976.7998 75.8707 75.5008 75.6342 71.9006 77.4038 77.6320 77.4579

10 65.4566 80.5644 77.3403 79.8966 82.0488 83.2870 83.3271 83.7434

420 S. Sheoran et al.

References

1. Abualigah LM, Khader AT, Hanandeh ES (2018) A new feature selection method to improve the

document clustering using particle swarm optimization algorithm. J Comput Sci 25:456–466

2. Chen H, Qi Z, Shi Z (2021) Remote sensing image change detection with transformers. IEEE

Trans Geosci Remote Sens 60:1–14

3. Chughtai AH, Abbasi H, Karas IR (2021) A review on change detection method and accuracy

assessment for land use land cover. Remote Sens Appl Soc Environ 22:100482

4. Goswami A, Sharma D, Mathuku H, Gangadharan SM, Yadav CS, Sahu SK, Pradhan MK,

Singh J, Imran H (2022) Change detection in remote sensing image data comparing algebraic

and machine learning methods. Electronics 11(3):431

5. Naeini AA, Babadi M, Mirzadeh SMJ, Amini S (2018) Particle swarm optimization for object-

based feature selection of VHSR satellite images. IEEE Geosci Remote Sens Lett 15(3):379–

383

6. Nichol J, Wong MS (2005) Satellite remote sensing for detailed landslide inventories using

change detection and image fusion. Int J Remote Sens 26(9):1913–1926

7. Nurda N, Noguchi R, Ahamed T (2020) Change detection and land suitability analysis for

extension of potential forest areas in Indonesia using satellite remote sensing and GIS. Forests

11(4):398

8. Peng X, Zhong R, Li Z, Li Q (2020) Optical remote sensing image change detection based on

attention mechanism and image difference. IEEE Trans Geosci Remote Sens 59(9):7296–7307

9. Singh A (1989) Review article digital change detection techniques using remotely-sensed data.

Int J Remote Sens 10(6):989–1003

10. Yadav R, Pandey M (2022)Image segmentation techniques: a survey. In: Proceedings of data

analytics and management: ICDAM 2021, vol. 1. Springer Singapore, pp 231–239

11. Yang M, Jiao L, Liu F, Hou B, Yang S (2019) Transferred deep learning-based change detection

in remote sensing images. IEEE Trans Geosci Remote Sens 57(9):6960–6973

12. Zhu Q, Guo X, Deng W, Shi S, Guan Q, Zhong Y, Zhang L, Li D (2022) Land-use/land-cover

change detection based on a Siamese global learning framework for high spatial resolution

remote sensing imagery. ISPRS J Photogrammetry Remote Sens 184:63–78

Critical Analysis of 5G Networks’ Trafﬁc

Intrusion Using PCA, t-SNE, and UMAP

Visualization and Classifying Attacks

Humera Ghani, Shahram Salekzamankhani, and Bal Virdee

Abstract Networks, threat models, and malicious actors are advancing quickly.

With the increased deployment of the 5G networks, the security issues of the

attached 5G physical devices have also increased. Therefore, artiﬁcial intelligence-

based autonomous end-to-end security design is needed that can deal with incoming

threats by detecting network trafﬁc anomalies. To address this requirement, in this

research, we used a recently published 5G trafﬁc dataset, 5G-NIDD, to detect network

trafﬁc anomalies using machine and deep learning approaches. First, we analyzed

the dataset using three visualization techniques: t-Distributed Stochastic Neighbor

Embedding (t-SNE), Uniform Manifold Approximation and Projection (UMAP), and

principal component analysis (PCA). Second, we reduced the data dimensionality

using mutual information and PCA techniques. Third, we solve the class imbalance

issue by inserting synthetic records of minority classes. Last, we performed clas-

siﬁcation using six different classiﬁers and presented the evaluation metrics. We

received the best results when K-Nearest Neighbors’ classiﬁer was used: accuracy

(97.2%), detection rate (96.7%), and false positive rate (2.2%).

Keywords Network intrusion detection ·Class imbalance ·t-SNE ·UMAP ·

PCA ·5G-NIDD

H. Ghani (B)·S. Salekzamankhani ·B. Virdee

School of Computing and Digital Media, Centre for Communications Technology, London

Metropolitan University, London N7 8DB, UK

e-mail: hug0051@my.londonmet.ac.uk

S. Salekzamankhani

e-mail: s.salekzamankhani@londonmet.ac.uk

B. Virdee

e-mail: b.virdee@londonmet.ac.uk

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_32

421

422 H. Ghani et al.

1 Introduction

With the increased deployment of the 5G networks, the security issues of the attached

5G physical devices have also increased [1]. Several technologies, for example,

ﬁrewalls, trafﬁc shaping devices, and intrusion detection systems are available to

secure a network [2]. With the changing world needs and threat models, networks

are becoming more complex and heterogeneous; malicious actors are also becoming

more advanced [3]. Hence, artiﬁcial intelligence-based autonomous end-to-end secu-

rity design is needed that can deal with incoming threats by detecting network trafﬁc

anomalies [4]. Hence, to address network trafﬁc anomaly issues in 5G networks, we

proposed a novel approach using the recently released 5G trafﬁc dataset, 5G-NIDD.

We used machine and deep learning approaches to perform our experiments.

In this study, we ﬁrst performed a visual analysis of this dataset using three

different visualization techniques. We then reduced the data dimensionality. In this

step, feature selection and feature extraction were performed using mutual infor-

mation and principal component analysis techniques. In the third step, the class

imbalance issue was resolved by inserting synthetic records of minority classes. We

used a random over-sampling method for balancing the class distribution. Finally, we

performed classiﬁcation using six different classiﬁers and presented the evaluation

metrics. The best results were for K-Nearest Neighbors’ classiﬁer with an accuracy

of 97.2%, detection rate of 96.7%, and false positive rate of 2.2%.

The contributions of this paper are:

•Visual analyses of the 5G-NIDD dataset to understand its intricacies better by

using PCA, t-SNE, and UMAP techniques.

•Dimensionality reduction to remove unimportant features, which cause inaccurate

results, more processing time and high computational power, using information

gain and principal component analysis techniques.

•Remove class imbalance to improve classiﬁcation metrics using the random over-

sampling technique.

•Classify benign and malicious trafﬁc with high accuracy using decision tree

(DT), K-Nearest Neighbors (K-NNs), multi-layer perceptron (MLP), Naïve Bayes

(GNB), random forest (RF), and support vector classiﬁer (SVC) algorithms.

This paper is structured as follows. Section 2describes the related work reported

recently in the literature. Section 3introduces the dataset. Section 4elaborates on

the proposed approach. Section 5shows the results and discusses ﬁndings. Section 6

concludes the work presented in this paper and recommends future work.

Critical Analysis of 5G Networks’ Trafﬁc Intrusion Using PCA, t-SNE … 423

2 Related Works

This section discusses contemporary research in the ﬁeld of network anomaly

detection. Researchers are addressing this problem using various machine and

deep learning approaches. For experiments, in general, they employ CICIDS-2017,

NSL-KDD, and UNSW-NB15 datasets. However, we used the 5G-NIDD dataset, a

comparatively new dataset having 5G network trafﬁc records.

Authors [5] divided the UNSW-NB15 dataset based on protocol: TCP, UDP, and

OTHER. Using the Chi-square technique, they performed feature selection, and for

classiﬁcation, they used a one-dimensional convolution neural network. Their work

includes the visualization of the dataset using t-SNE. However, current research [7,

8] has reported better classiﬁcation performance metrics on the same dataset.

Authors in [6] created and evaluated the 5G-NIDD dataset using ﬁve different clas-

siﬁers on binary and multiclass labels. They used the analysis of variance (ANOVA)

technique for feature selection and used the ten best features for classiﬁcation.

Features in this dataset have skewness and multi-modal properties, while ANOVA

is used for features having a normal distribution. Therefore, using ANOVA on this

dataset for feature selection is inappropriate.

Reference [7] performed their experiments on NSL-KDD and UNSW-NB15

datasets. They proposed combining particle swarm optimization (PSO) and a grav-

itational search algorithm (GSA) for feature selection. They used ﬁve other feature

selection techniques but received the best classiﬁcation metrics when the features

selected by their proposed method were given to the random forest classiﬁer.

Although this research received good results, their proposed technique selected a

high number of features in comparison to the other feature selection methods [5]

evaluated in this paper.

Researchers in [8] did their experiments on UNSW-NB15, CICIDS-2017, and

Phishing datasets. They used a correlation-based feature selection method. For

data visualization and feature reduction, they used t-SNE and for the classiﬁca-

tion, random forest. But, [9] mentioned a limitation that t-SNE is a non-parametric

dimensionality reduction technique; these techniques cannot map new data points.

Reference [10] used NSL-KDD dataset for their research. They used an auto-

encoder for detecting network anomalies. Although they reported good classiﬁcation

performance, contemporary research [7] showed better performance metrics on the

same dataset.

Authors in [11] used UNSW-NB15 and NSL-KDD datasets to investigate network

trafﬁc anomalies. First, they address the class imbalance issue by reducing the noise

samples from the majority class then they increase the minority class samples using

Synthetic Minority Over-sampling Technique. Second, they performed classiﬁcation

using deep learning approaches: Convolution Neural Network and Bi-directional

long short-term memory.

Researchers in [12] performed their experiments on UNSW-NB15 and NSL-KDD

(KDDTest +and KDDTest-21) datasets. First, they addressed the class imbalance

issue using Wasserstein Generative Adversarial Network. Second, they employed

424 H. Ghani et al.

a Stacked Auto-encoder for feature extraction. Third, they constructed a cost-

sensitive loss function. Their performance metrics suggest further improvement in

their approach.

The above discussion clariﬁes that current research in network trafﬁc anomaly

detection lacks in presenting the visual analysis of datasets. Therefore, this research

visually examined 5G-NIDD dataset, a newly released 5G trafﬁc dataset [6].

3 Dataset

5G-NIDD dataset is created using a real 5G test network for network intrusion

detection. It was published by [6]. It has total 52 features, 32 ﬂoat types, 12 int

types, and 8 categorical types. This dataset has 1,215,890 records, where 477,737

are benign, and 738,153 are malicious. Benign records are 39.2%, whereas mali-

cious records are 60.7% as shown in Table 1. This dataset has eight different types of

attacks; their names and percentage in the attack trafﬁc are: UDPFlood (61.9%),

HTTPFlood (19.0%), SlowrateDoS (9.9%), TCPConnectScan (2.7%), SYNScan

(2.7%), UDPScan (2.1%), SYNFlood (1.3%), and ICMPFlood (0.15%) Table 2.

Table 1 Distribution of

records in 5G-NIDD dataset Label No. of records Percentage

Benign 477,737 39.291

Malicious 738,153 60.708

Tot a l 1,215,890 100

Table 2 Distribution of

attacks in malicious records Attack type No. of records Percentage

UDPFlood 457,340 61.957

HTTPFlood 140,812 19.076

SlowrateDos 73,124 9.9063

TCPConnectScan 20,052 2.716

SYNScan 20,043 2.715

UDPScan 15,906 2.154

SYNFlood 9721 1.316

ICMPFlood 1155 0.156

Tot a l 738,153 100

Critical Analysis of 5G Networks’ Trafﬁc Intrusion Using PCA, t-SNE … 425

4 Proposed Approach

This section describes the approach adopted to perform this research. First, data

cleaning and wrangling were performed at the data preprocessing stage. Second, a

visual analysis of data was presented and described using three different methods.

Third, data dimensionality was reduced by employing feature selection and feature

extraction approaches. Fourth, the class imbalance issue was addressed. Fifth, trafﬁc

classiﬁcation was performed using six different classiﬁcation algorithms. Finally,

evaluation metrics were described and results were presented.

4.1 Data Preprocessing

Data is prepared to feed into the machine learning algorithm. This dataset had redun-

dant and unnecessary features; therefore, data denoising was performed to remove

redundant and unnecessary features. Some features had null values, which were

imputed with appropriate alternative values. Some features showed skewness and

multi-modal distribution; therefore, log transformation was performed to achieve

normal distribution. Categorical features were encoded using one-hot encoding

method. The dataset was split into train and test sets to avoid overﬁtting. Lastly,

it was normalized to have mean 0 and variance 1.

4.2 Visual Analysis of 5G-NIDD Dataset

In this paper, visualization is performed to understand the data better, so researchers

can apply appropriate processing techniques to achieve good results. This research

used three visualization techniques to get deep insight into the dataset. These tech-

niques are PCA, t-SNE, and UMAP. PCA is one of the widely used dimensionality

reduction techniques that performs orthogonal linear transformation of correlated

variables into uncorrelated features. These new features are called principal compo-

nents. This technique works to preserve the variance of original high-dimensional

data into low-dimensional principal components. t-SNE is a nonlinear dimensionality

reduction and visualization technique that creates probability distribution of closest

points. This algorithm tries to transform high-dimensional feature space into two-

dimensional feature space by minimizing the difference between two distributions.

It models high-dimensional space into Gaussian distribution, while two-dimensional

space into t-distribution. UMAP is an efﬁcient dimensionality reduction technique

that uses a graph algorithm to reduce data dimensionality. First, it constructs the

topology of the high-dimensional data. Then, it constructs low-dimensional local

clustering by grouping similar observations.

426 H. Ghani et al.

Fig. 1 Distribution of trafﬁc types in the 5G-NIDD dataset

The class distribution of 5G-NIDD dataset is imbalanced (see Fig. 1). Three

attack categories (UDPFlood, HTTPFlood, and SlowrateDos) constitute 90.8% mali-

cious trafﬁc, while the other ﬁve attack categories (TCPConnectScan, SYNScan,

UDPScan, SYNFlood, and ICMPFlood) constitute 8.95% malicious trafﬁc (Table 2).

Figures 2and 3show a two-dimensional and three-dimensional view of principal

components; the class imbalance issue can be observed clearly.

Figures 4and 5show the complete datasets using t-SNE and UMAP, respectively.

In addition to class imbalance, class overlap and within-class clustering issues are

evident from these ﬁgures.

The class overlap issue is conﬁrmed in Figs. 6, 7, 8 and 9, where HTTPFlood and

SlowrateDoS classes are overlapping. All three visualization techniques conﬁrm this

issue. Figures 6and 7show two-dimensional and three-dimensional plots of class

overlap using principal components. Figure 8shows a t-SNE plot where the class

overlap is evident. Figure 9shows a UMAP plot showing class overlap.

This dataset has within-class cluster issues as well. Figures 6, 7, 8, and 9 show

scattered clusters of HTTPFlood and SlowrateDoS classes using PCA, t-SNE, and

UMAP visualizations.

Class imbalance and overlap can hinder the attack detection performance [13,

14]. In this research, we addressed the class imbalance problem. Our work achieved

good evaluation metrics (see Table 4), suggesting that if a dataset has a high number

of records, classiﬁers can identify the classes with accuracy despite the class overlap

issue.

Critical Analysis of 5G Networks’ Trafﬁc Intrusion Using PCA, t-SNE … 427

Fig. 2 Two-dimensional plot of PCA showing all types of trafﬁc

Fig. 3 Three-dimensional plot of PCA showing all types of trafﬁc

4.3 Dimensionality Reduction

Dimensionality reduction transforms high-dimensional data into low dimensions

where newly transformed data is a meaningful representation of original data [15].

428 H. Ghani et al.

Fig. 4 Full dataset visualization (t-SNE)

Fig. 5 Full dataset visualization (UMAP)

Dimensionality reduction techniques are employed to reduce the number of variables,

reduce the computational complexity of high-dimensional data, improve the model

accuracy, better visualization, and understand the process that generated the data

[16]. Two main approaches of dimensionality reduction are feature selection and

feature extraction.

Critical Analysis of 5G Networks’ Trafﬁc Intrusion Using PCA, t-SNE … 429

Fig. 6 HTTPFlood and SlowrateDoS overlap (2D PCA)

Fig. 7 HTTPFlood and SlowrateDoS overlap (3D PCA)

430 H. Ghani et al.

Fig. 8 HTTPFlood and SlowrateDoS overlap (t-SNE)

Fig. 9 HTTPFlood and SlowrateDoS overlap (UMAP)

Feature selection methods select the most valuable features from the feature space.

This process creates a low-dimensional feature space representation that preserves

the most valuable information. [16] identiﬁed two approaches to feature selection:

the ﬁlter method and the wrapper method. In this paper, the most useful features are

Critical Analysis of 5G Networks’ Trafﬁc Intrusion Using PCA, t-SNE … 431

selected using one of the ﬁlter method algorithms, mutual information. This algo-

rithm measures the amount of information that a random variable contains another

random variable, in other words, the reduction of the uncertainty of the original

random variable, given the knowledge of another random variable [17]. Unlike

Pearson correlation, this method can measure the nonlinear relationship between

two variables.

Feature extraction is producing a compressed representation from the input vector.

Feature extraction techniques create new features from the original data space using

functional mapping [18]. Several algorithms are available that can perform this trans-

formation linearly and nonlinearly. Researchers in [16] identiﬁed three approaches

to feature extraction: performance measure, transformation, and generation of new

features. We choose the transformation technique PCA as it is faster since the ﬁrst few

principal components are computed and more interpretable than other techniques,

such as Auto-encoder.

4.4 Remove Class Imbalance

Class imbalance problem occurs when some classes have more instances than others;

in such cases, learning algorithms are overwhelmed by the large classes and ignore

the small classes [19]. Learning algorithms are generally not designed to handle

imbalanced datasets without proper adjustment [20]. Researchers in [21] pointed

datasets frequently exhibit class imbalance and overlap issues. 5G-NIDD dataset also

shows class imbalance in Fig. 1.

There are several approaches to solving the class imbalance problem. These

approaches can be grouped as: data-level, cost-sensitive, and ensemble learning.

Data-level approaches modify the dataset. Cost-sensitive approaches modify the cost

that algorithm tries to optimize. The ensemble learning approach leverages the power

of several learners to predict the minority class. Data-level approach random over-

sampling increases the number of observations from the minority class at random.

In contrast, random under-sampling decreases the number of observations from the

majority class at random.

Synthetic Minority Over-Sampling (SMOTE) technique is frequently employed in

contemporary research [11,22,23] to overcome class imbalance issues from network

intrusion detection datasets. It balances class distribution by randomly inserting

minority class samples. It does linear interpolation to produce synthetic records of

the minority class. These records are created by selecting K-Nearest Neighbors for

each example in the minority class. We chose SMOTE to solve the class imbalance

issue in the dataset.

432 H. Ghani et al.

4.5 Network Trafﬁc Classiﬁcation

We used tree-based, probability-based, proximity-based, deep learning, and support

vector classiﬁers to predict class labels.

Decision tree is a non-parametric learning algorithm. It works on a divide-and-

conquer strategy. This greedy algorithm searches and identiﬁes optimal split points

within a tree. It does this job reclusively, which completes when most records classify

under the same class label.

K-Nearest Neighbor is a proximity-based classiﬁer. To predict the class label of a

point, ﬁrst, it ﬁnds K-Nearest Neighbors of this point based on Euclidean distance.

Then, each of these neighbors votes for their class, and the majority class wins.

Multilayer perceptron is a powerful deep learning model which is inspired by

neurons in the human brain. The basic building blocks of MLP are perceptrons

which are simple processing units. It can have many layers of perceptrons, which

gives it the name MLP.

Naïve Bayes classiﬁer is simple and fast, has very few tunable parameters, and

good for high-dimensional data. Given the value of class variable, this algorithm

assumes conditional independence between each pair of input variables.

Random forest classiﬁer is an ensemble of decision trees that can perform clas-

siﬁcation using a majority vote. Each decision tree uses a randomly selected feature

set from the original dataset. In addition, each tree uses a different sample of data,

like the bagging approach. It can successfully model high-dimensional data where

features are nonlinearly related, and it does not assume data which follows a particular

distribution.

Support vector classiﬁer is a simple and powerful classiﬁer. It can draw linear and

nonlinear class boundaries to classify the data points. To perform its job, it iteratively

constructs a hyperplane to differentiate classes. Each iteration tries to minimize the

error. The main idea of this technique is to create a hyperplane that can best divide

the data into classes.

4.6 Evaluation Metrics

Model performance is evaluated using accuracy, detection rate, and false positive rate

metrics. Elements of these metrics can be retrieved from the confusion matrix where

the confusion matrix is {TP, TN, FP, FN}. True positive (TP) means correctly clas-

siﬁed attack packets. True negative (TN) means correctly classiﬁed normal packets.

False positive (FP) means incorrectly classiﬁed attack packets, and false negative

(FN) means incorrectly classiﬁed normal packets.

Accuracy means the ratio between correctly identiﬁed packets and total number

of packets.

Critical Analysis of 5G Networks’ Trafﬁc Intrusion Using PCA, t-SNE … 433

Accuracy =(TP +TN)

(TP +TN +FP +FN).(1)

Detection rate represents the ratio of correctly identiﬁed attacks versus predicted

attacks.

Detection rate =TP

(TP +FN).(2)

False positive rate is the ratio of incorrectly identiﬁed attacks versus predicted

normal.

False positive rate =FP

(FP +TN).(3)

5 Results and Discussion

To reduce the feature space, we used mutual information and PCA techniques.

We processed the dataset before applying dimensionality reduction techniques (see

Sect. 4.1). Mutual information technique ranked the features (see Fig. 10). We

selected twenty-two top-ranked features and transformed them into eleven principal

components. These eleven components captured 89.2% variance in the data (see

Table 3).

Eleven principal components are fed to classiﬁers for classiﬁcation. Table 4shows

the classiﬁcation performance of six classiﬁers using accuracy, detection rate, and

false positive rate. Our results show that GNB classiﬁer showed considerably low-

performance metrics. This result conﬁrms [6] ﬁnding. DT, RF, and k-NN showed

better performance metrics than MLP and SVC. However, k-NN remains the best

classiﬁer in all evaluation metrics with accuracy (97.2%), detection rate (96.7%),

Fig. 10 Mutual information ranking of features

434 H. Ghani et al.

Table 3 PCA explained variance ratio

Principal component Variance captured

1st component 0.20667303

2nd component 0.14651562

3rd component 0.09805928

4th component 0.08302226

5th component 0.05968254

6th component 0.05411584

7th component 0.05063291

8th component 0.05041822

9th component 0.04955088

10th component 0.04887834

11th component 0.04494785

Tot a l 0.89249676

Table 4 Evaluation metrics (binary classiﬁcation)

Evaluation metrics DT RF K-NN GNB MLP SVC

Accuracy 97.150 97.168 97.275 87.425 94.91 92.09

Detection rate 96.618 96.650 96.765 92.428 92.646 90.49

False positive rate 2.318 2.314 2.213 17.559 2.828 6.32

and false positive rate (2.2%). Figure 11 shows ROC curve of k-NN classiﬁer that

captured 97.2% area under the curve.

Table 5shows a comparison of our approach with other contemporary techniques.

Research [11] reported 77.16% and 83.58% accuracy on UNSW-NB15 and NSL-

KDD datasets, respectively. Research [12] showed 93.27% and 90.24% accuracy

on UNSW-NB15 and NSL-KDD datasets, respectively. Our approach outperformed

them and achieved 97.28% classiﬁcation accuracy using the 5G-NIDD dataset.

A limitation of this research is that the 5G-NIDD dataset was published recently

[6]. Yet, more research needs to be produced on it to compare our results. Another

limitation is the processing power. If it was not the case, we would have searched

for the best hyperparameters for our classiﬁers and reported even better evaluation

metrics.

6 Conclusion and Future Work

In this paper, we presented a novel approach to classify network trafﬁc anomalies

with high accuracy. We analyzed the dataset by projecting it in two-dimensional

and three-dimensional spaces using linear and nonlinear dimensionality reduction

Critical Analysis of 5G Networks’ Trafﬁc Intrusion Using PCA, t-SNE … 435

Fig. 11 ROC curve of k-NN

Table 5 Performance

comparison Research Accuracy Dataset

[11]77.16% UNSW-NB15

83.58% NSL-KDD

[12]93.27% UNSW-NB15

90.34% NSL-KDD

Proposed research 97.28% 5G-NIDD

and visualization techniques. We reduced the feature space by ranking them using

the mutual information algorithm; then, we transformed high-ranked features into

principal components. This dataset had a class imbalance issue; we solved it by

balancing the class distribution using a random over-sampling algorithm. Last, we

performed classiﬁcation using six classiﬁcation algorithms and presented the evalu-

ation metrics using accuracy, detection rate, and false positive rate. We achieved the

best classiﬁcation performance when the K-Nearest Neighbors’ algorithm was used.

In the future, we intend to extend this research in two directions: use an ensemble

learner to improve classiﬁcation metrics and use a generative model for class

imbalance issues since it is one of the highly successful deep learning architectures.

436 H. Ghani et al.

References

1. Uysal DT, Yoo PD, Taha K (2022) Data-driven malware detection for 6G networks: a survey

from the perspective of continuous learning and explainability via visualisation. IEEE Open J

Veh Technol 4:61–71

2. Khan AR, Kashif M, Jhaveri RH, Raut R, Saba T, Bahaj SA (2022) Deep learning for intrusion

detection and security of Internet of things (IoT): current analysis, challenges, and possible

solutions. In: Security and communication networks

3. Lam J, Abbas R (2020) Machine learning based anomaly detection for 5 networks. arXiv

preprint arXiv:2003.03474

4. Siriwardhana Y, Porambage P, Liyanage M, Ylianttila M (2021) AI and 6G security: Opportu-

nities and challenges. In: 2021 Joint European conference on networks and communications &

6g summit (EuCNC/6G Summit). IEEE, pp 616–621

5. Hooshmand MK, Hosahalli D (2022) Network anomaly detection using deep learning

techniques. CAAI Trans Intell Technol 7(2):228–243

6. Samarakoon S, Siriwardhana Y, Porambage P, Liyanage M, Chang SY, Kim J, Kim J, Ylianttila

M (2022) 5G-NIDD: a comprehensive network intrusion detection dataset generated over 5G

wireless network. arXiv preprint arXiv:2212.01298

7. Boahen EK, Bouya-Moko BE, Wang C (2021) Network anomaly detection in a controlled

environment based on an enhanced PSOGSARFC. Comput Secur 104:102225

8. Hammad M, Hewahi N, Elmedany W (2021) T-SNERF: a novel high accuracy machine learning

approach for intrusion detection systems. IET Inf Secur 15(2):178–190

9. Gisbrecht A, Schulz A, Hammer B (2015) Parametric nonlinear dimensionality reduction using

kernel t-SNE. Neurocomputing 147:71–82

10. Xu W, Jang-Jaccard J, Singh A, Wei Y, Sabrina F (2021) Improving performance of

autoencoder-based network anomaly detection on NSL-KDD dataset. IEEE Access 9:140136–

140146

11. Jiang K, Wang W, Wang A, Wu H (2020) Network intrusion detection combined hybrid

sampling with deep hierarchical network. IEEE Access 8:32464–32476

12. Zhang G, Wang X, Li R, Song Y, He J, Lai J (2020) Network intrusion detection based on

conditional Wasserstein generative adversarial network and cost-sensitive stacked autoencoder.

IEEE Access 8:190431–190447

13. Vuttipittayamongkol P, Elyan E, Petrovski A (2021) On the class overlapproblem in imbalanced

data classiﬁcation. Knowl-Based Syst 212:106631

14. Li Z, Huang M, Liu G, Jiang C (2021) A hybrid method with dynamic weighted entropy for

handling the problem of class imbalance with overlap in credit card fraud detection. Expert

Syst Appl 175:114750

15. Padmaja DL, Vishnuvardhan B (2016) Comparative study of feature subset selection methods

for dimensionality reduction on scientiﬁc data. In: 2016 IEEE 6th international conference on

advanced computing (IACC). IEEE, pp 31–34

16. Khalid S, Khalil T, Nasreen S (2014) A survey of feature selection and feature extraction

techniques in machine learning. In: 2014 science and information conference. IEEE, pp 372–

378

17. Zhou H, Wang X, Zhu R (2022) Feature selection based on mutual information with correlation

coefﬁcient. In: Applied intelligence, pp 1–18

18. Motoda H, Liu H (2002) Feature selection, extraction and construction. Commun IICM

(Institute of Information and Computing Machinery, Taiwan) 5(67–72):2

19. Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data

sets. ACM SIGKDD Explor Newsl 6(1):1–6

20. Sun Y, Wong AK, Kamel MS (2009) Classiﬁcation of imbalanced data: a review. Int J Pattern

Recognit Artif Intell 23(04):687–719

21. Lee HK, Kim SB (2018) An overlap-sensitive margin classiﬁer for imbalanced and overlapping

data. Expert Syst Appl 98:72–83

Critical Analysis of 5G Networks’ Trafﬁc Intrusion Using PCA, t-SNE … 437

22. Mulyanto M, Faisal M, Prakosa SW, Leu JS (2020) Effectiveness of focal loss for minority

classiﬁcation in network intrusion detection systems. Symmetry 13(1):4

23. Zeeshan M, Riaz Q, Bilal MA, Shahzad MK, Jabeen H, Haider SA, Rahim A (2021) Protocol-

based deep intrusion detection for dos and ddos attacks using unsw-nb15 and bot-iot data-sets.

IEEE Access 10:2269–2283

Denoising the Endoscopy Images

of the Gastrointestinal Tract Using

Complex-Valued CNN

Nisha and Prachi Chaudhary

Abstract To delineate intestine atrophy accurately is a tough job because of the

heterogeneous and convoluted structure of small intestine. Image quality is a salient

factor to diagnose various diseases in the medical ﬁeld. The images captured through

machines or medical imaging devices are prone to some kind of noise. Endoscopy

images are of inferior quality because of lighting problems inside the gastrointestinal

(GI) tract. The noise type changes with the environment and the camera is used for

capturing images. The aim of this paper is to determine whether image-denoising

methods are effective for classiﬁcation. This study proposes an efﬁcient complex-

valued CNN (CDNet) method for denoising the images. The proposed model found to

be superior to any other state-of-the-art method for image denoising on real datasets.

The denoising performance is computed through metrics of PSNR and SSIM. The

results show PSNR 45.58 and SSIM 0.99 thus demonstrating the superiority of the

proposed method on real datasets.

Keywords Endoscopy ·Denoising ·Complex value CNN ·SSIM ·PSNR

1 Introduction

Celiac disease is a small intestine disease caused by intaking a gluten-rich diet. The

outer layers of the small intestine are damaged in this disease. Due to damage to

the lining of the small intestine knowns as Villi, the nutrients present in food are not

absorbed in the body [1]. The celiac disease-affected patient remains on a gluten-free

diet lifelong, and further, follow-ups are necessary to check how much the disease is

controlled. By serology and histopathology proﬁles, we are able to ﬁnd the markers

of the disease. The markers sometimes do not give a proper idea of the presence of

Nisha (B)·P. Chaudhary

ECE Department, DCRUST, Murthal, India

e-mail: 18001903005nisha@dcrustm.org

P. Chaudhary

e-mail: prachi.ece@dcrustm.org

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_33

439

440 Nisha and P. Chaudhary

disease, so in this case, endoscopy is preferred [2,3]. We need to take minimum of six

biopsy samples for a perfect diagnosis [4]. As the biopsy procedure is painful, some

other method for diagnosis is needed. Automated detection by image processing is

becoming an effective method nowadays. In this method, the image is preprocessed

ﬁrst, so that the quality of the image can be improved for further processing. Image

quality is measured in two different approaches. The ﬁrst approach is used by doctors

or endoscopists. This method is the most accurate method considered always. But

these methods take time and are difﬁcult because the presence of a specialist is

the ﬁrst need for evaluation. So, another approach for automatically observing the

quality of the image is more suitable and convenient. In this approach, peak signal-to-

noise ratio (PSNR) and structural similarity index (SSIM) are measured. The image

preprocessing method is used as the initial step after acquiring images. Different

kinds of noise are present in digital images like blur noise, speckle noise, Gaussian

noise, salt pepper noise, etc. The noise present in the image is removed by applying

several denoising methods. This noise removal is necessary as it helps further the

image’s processing for segmentation and classiﬁcation of the images. Some transform

approaches, machine learning methods, and statistical methods are also used for

denoising.

Contribution: Celiac disease is spreading all over the world due to excessive intake

of gluten.

•This paper helps in denoising endoscopy images taken through an Olympus endo-

scope (GIF-H190). In this study, we are focusing on primary data of Celiac disease

patients for removal of noise captured through the camera in the small intestine.

Complex value CNN in contrast to real value CNN is becoming an emerging

method. This method improves the capacity of the model. Till now, this approach

was not used for images.

This paper is summarized in different sections. Section 1gives a brief idea of

the disease in the small intestine and kind of noise present in endoscopic images.

Section 2summarizes previous work done for denoising. In Sect. 3, denoising tech-

niques are studied. In Sect. 4, the results are studied, and ﬁnally, conclusion is drawn

in Sect. 5.

2 Literature Review

The previous study includes various methods for denoising based on learning and

nonlearning. In 2020, ADNet CNN was used for classiﬁcation by Tian et al. where

ADNet stands for attention-guided CNN. It uses attention and reconstruction block

for image denoising. It was done on general images, but this method needs a large

number of calculations because of large number of weights assigned [5]. The noise

estimation module and some modules of noise removal are used with NERNet CNN

in case of realistic noise [6]. This study was done by Guo et al. in 2020. In year 2019

Denoising the Endoscopy Images of the Gastrointestinal Tract Using … 441

by Shi et al., three network modules were used for image denoising and Hierarchical

residual learning CNN for classiﬁcation. These three subgroups extract patches,

map noise, and in the end fuse, the map to the estimation model [7]. In 2019, a

Dictionary learning model was proposed by Zhang et al. for the classiﬁcation of

Gaussian and mixed noise [8]. In the year 2019, again by Zhang et al., SANet was

used for classiﬁcation with deep mapping and band aggregation blocks for denoising

[9]. The retaining module for image denoising and DRCNN for classiﬁcation was

used by Li et al. in 2020 [10]. In the year 2020, SWCNN was proposed by Yin

et al. The slide kernel convolution is used for image denoising and SWCNN is used

for classiﬁcation [11]. The regression CNN and combination of regression with the

classiﬁer model are used for denoising by Jin et al. in 2020 [12]. In this approach, the

noise was detected by the classiﬁer and noise pixels were restored by the regression

network. CDNet CNN for classiﬁcation was proposed in 2021 by Quan et al., and in

this approach, CV ReLU which was used for denoising is used [13]. Some research

is also done on the effect of the denoising method on classiﬁcation. When CNN

is trained on the images, then it has less effect on classiﬁcation, but noise, blur,

and contrast have a signiﬁcant impact on the classiﬁcation [14]. CNN has to work

more when not trained on images than when trained on images. The trained model

of CNN increases the accuracy [15]. Another study shows that denoising inputs

and augmented images that are affected by noise always increase the performance

of classiﬁcation [16]. A pipeline architecture is used for preprocessing to denoise

in combination with CNN for classiﬁcation [17]. The noise resilience problem is

resolved using VGG16 CNN which is trained on speciﬁc images [18]. Dual-channel

CNN based on InceptionV3 is used for processed denoised and unprocessed images.

The combined output using feature summation or concatenation is used for further

classiﬁcation using CNN for better results [19]. As CNN requires lots of computation

and more space for storage, this renders its use in daily life automation systems.

3 Denoising Techniques

Denoising is a very important part of digital images. There are different methods

and algorithms present for the denoising of the image. The best ﬁltration method is

employed accordingly by retaining the quality of the image. The neighborhood pixel

is used for checking the type of noise present. The denoised method is categorized

into linear and nonlinear methods (Fig. 1).

3.1 DnCNN

It is an efﬁcient learning model that works on the Gaussian noise image. It extracts

the residual image from the corrupted image due to noise. The noise-free image can

be considered as the difference of a corrupted image and a residual image [20]. The

442 Nisha and P. Chaudhary

Denoising

method

Learning

methods

DnCNN Noise2Noise

CNN CDNet

Non learning

methods

Wiener filter Moving

average filter Median filter Opening

filter

Fig. 1 Categories of denoising methods

size of the ﬁlter is set to 3 ×3 and pooling layers are removed from the structure.

Conv +ReLU is used to create 64 feature maps. The BN is introduced for batch

normalization, and ﬁnally, Conv of size 3 ×3×64 is done for output reconstruction.

It is the pre-trained model on the synthetic Gaussian noise with standard deviation

σ=25.

3.2 Noise2Noise

This method does not use clean data. It uses two noise pairs in the training phase.

For working with small datasets, it uses WT, so that images can be handled in

multispectral bands [21]. The images consist of low- and high-pass components, and

at the output side, we apply inverse WT for denoised output. This model is trained

on synthetic Gaussian noise ranging value of σfrom 5 to 25.

3.3 CDNet

It consists of mainly ﬁve different blocks. This method is used by [13] and it is a

complex-valued CNN. The architecture shown in Fig. 2consists of 24 sequential

connected convolutional units. It consists of a convolution layer, ReLU, and BN.

The kernel used is 64 convolutions. It consists of total ﬁve blocks of convolution and

deconvolution layers. One residual layer and one merging layer are used for attaining

the original real image.

Input

image CV conv CV ReLU CV BN CV RB

Merging

layer

Output

Image

Fig. 2 Brief architecture of complex value CNN [6]

Denoising the Endoscopy Images of the Gastrointestinal Tract Using … 443

3.4 Median Filter

It is one of the popular ﬁlters present for denoising in digital image processing.

The impulse noise is removed with this. The intensity value of the pixel affected by

noise is changed by ﬁnding the median of the intensity values of the pixel in the

neighborhood [22]. The window size is also ﬁxed in this type of ﬁlter. The ﬁlter is

applied to the whole image so that both pixel values are changed to noiseless and

noisy. As all the pixel values are changed, so there is a chance of getting some good-

quality pixels changed by low-quality pixels. So, some ﬁne details are removed in

this method due to some distortion introduced in the edges of the high-quality image.

3.5 Weiner Filter

This type of ﬁltration removes the noise by using a statistical approach. The blur and

extra noise are removed with this kind of ﬁlter. It uses Linear Time-Invariant (LTI)

approach. It minimizes the MSE. The spectral properties of both the original image

and noisy images are ﬁrst checked. The ﬁltering process is ﬁrst done by checking the

present value of the signal. Then in the next step, smoothening is done by checking

past values of the signal. In the last step, the future value of the signal is predicted

[23]. This method only uses a 3 ×3 kernel. So, the results are not so optimized.

3.6 Gaussian Filter

Its results are better than the Weiner ﬁlter and median ﬁlter. This Gaussian ﬁlter ﬁnds

the weighted average of the pixels by the Gaussian distribution method. It is a type

of linear low-pass ﬁlter for reducing noise and blurred regions. The kernel is passed

through each pixel. It preserves the edges which are not preserved in the median

ﬁlter.

3.7 Deep Image Prior

This method is used in the case of the GI tract. As the images are taken from different

regions of small intestine and the environment inside the GI (gastrointestinal) tract

is also different at different regions [24]. The different blind denoising methods

available are Noise Clinic, Neat Clinic, and Deep Image Prior. These methods are

often time-consuming and require a number of iterations. So, an improved version

of DIP is used to remove the noise. The improved version uses DIP have knowledge

of the termination of the iteration process by using the image quality assessment

444 Nisha and P. Chaudhary

method. The method also uses transfer learning to minimize iterations by retaining

the denoising effect.

4 Experimental Results

In this study, we used a real dataset from PGIMS, Rohtak, and evaluate the perfor-

mance of the existing model complex value CNN. The results are compared with

state-of-the-art denoising methods like DnCNN [20], Noise2noise [21], median ﬁlter

[22], DIP [24], Wiener ﬁlter [23]. Before starting, I would like to draw the ﬂowchart

of the methodology used as shown in Fig. 3.

4.1 Data Preparation

The dataset is divided into training and testing images. Two hundred images are

taken from the endoscope. In this methodology, 160 images are used in training and

40 images are used in testing. The image size is very large so when training is done,

the images are ﬁrst reconstructed to size 256 ×256. For these images, PSNR and

SSIM are calculated. Then testing is done again on the cropped images.

4.2 Steps to Be Followed

Step 1: Take input from endoscopy images through a real dataset. Divide the dataset

into training and testing images.

Step 2: Resize the image to a small pixel size of 256 ×256.

Step 3: Check the noise type; in this case, salt-and-pepper noise of density 0.3 is

found.

Step 4: Now use a speciﬁc ﬁlter for the removal of noise.

Step 5: Restore the image that is ﬁltered.

Input

noisy

Image

Divide the

data into

training

and

tesƟng

images

Preprocess

ing(Resizin

Noise type Complex

value CNN

Filtered

image

Fig. 3 Flowchart of the methodology used

Denoising the Endoscopy Images of the Gastrointestinal Tract Using … 445

Step 6: Evaluate the performance parameters for denoising.

4.3 Performance Evaluation

The performance of our dataset is checked on model CDNet and compared with

other state-of-the-art methods listed below in Tables 1and 2.

From Fig. 4a, it can be seen that image contains salt-and-pepper noise. Some

lightening conditions can also be seen in the image which is also a kind of noise.

First, cropping of the image is done to size 256 ×256. Cropping also removes extra

regions of the image. Then in Fig. 4b, the salt-and-pepper noise is removed to a large

extent using model. Lightening is removed by averaging 8 nearby pixels and ﬁll the

gap with that part. Our method achieves better PSNR value when compared to the

median ﬁlter. It outperforms than median ﬁlter by increasing PSNR metrics from

42.5 to 45.58 highlighted bold in Table 1. In Table 2the SSIM performance for our

model can be seen as it increases from 0.98 to 0.99 again highlighted bold.

Table 1 PSNR versus standard deviation

Method σ=5σ=15 σ=25 References

Median ﬁlter 42.5 33.9 29.4 [22]

Opening ﬁlter 33.8 33.6 32.6 [10]

Wiener ﬁlter 26.9 22.4 17.4 [23]

DnCNN 38.1 34.1 30.2 [20]

Noise2Noise 36.1 27.2 22.2 [21]

DIP 40.62 26.2 22.5 [24]

Proposed 45.58 34.5 32.7

Table 2 SSIM versus standard deviation

Method σ=5σ=15 σ=25 References

Median ﬁlter 0.98 0.96 0.92 [22]

Opening ﬁlter 0.98 0.89 0.69 [10]

Wiener ﬁlter 0.92 0.68 0.52 [23]

DnCNN 0.78 0.68 0.56 [20]

Noise2Noise 0.88 0.52 0.22 [21]

DIP 0.87 0.56 0.56 [24]

Proposed 0.99 0.98 0.96

446 Nisha and P. Chaudhary

Fig. 4 a Original image. bDenoised and enhanced image

4.4 Model Parameters

We implement our model in Python and used complex-valued CNN architecture.

This is trained for a period of ten epochs. We used binary cross-entropy loss criteria,

and sigmoid function is used as activation function. The denoising performance is

computed through a number of denoising parameters like SNR, MSE, PSNR, and

RMSE. In this study, we consider only PSNR and SSIM.

PSNR: It is deﬁned as the peak signal-to-noise ratio. PSNR decreases as the value

of the root mean square value increases. It is measured with the help of MSE.

PSNR =10 log10(max(I))2

MSE .

SSIM: It is the structural similarity index. It checks the quality of images received

through different media. Larger the value of SSIM, the better the quality of the image.

5 Conclusion

The endoscope used for the gastrointestinal tract is a tool for capturing images inside

the GI tract but introduces noise that affects the endoscopist’s observation. As the

noise present cannot be ﬁtted globally so deep learning methods achieve good results

in the suppression of noise by preserving the edge features of the image. This paper

proposes CNN-based complex-valued network module for denoising the endoscopic

images. The outcomes show that CDNet performs better than methods used in liter-

ature in terms of increasing PSNR and SSIM value. This module focuses on the

highly affected noise region and suppresses the noise for dark endoscopic images.

The results show that PSNR increases from 42.5 to 45.58 and SSIM increases from

0.98 to 0.99 in respect of a standard deviation of value 5 on average. The results of

Denoising the Endoscopy Images of the Gastrointestinal Tract Using … 447

standard deviation values 15 and 25 are also shown in Tables 1and 2. The denoising

results of complex-valued CNN on real datasets are better than the median ﬁlter and

DIP method used in the literature. This paper concludes that our method can enhance

the image quality by reducing noise impact on endoscopic images, thus cooperating

with endoscopists in diagnosing the disease. As the results are good but there is some

limitation of the work also that this method was used on a smaller number of images,

so it should be extended to a large dataset also. Moreover, other noises present in

images need to be considered.

Acknowledgements The authors would like to thank Professor Sandeep Goyal, Department of

Internal Medicine, Pt. B. D. Sharma University of Health Sciences, Rohtak, for providing us with

Endoscopy images of the Celiac Disease in the small intestine.

Conﬂict of Interest The authors would like to declare that there is no conﬂict of interest.

References

1. Ciaccio EJ, Bhagat G, Lewis SK, Green PH (2015) Quantitative image analysis of celiac

disease. World J Gastroenterol 21(9):2577

2. Ianiro G, Gasbarrini A, Cammarota G (2013) Endoscopic tools for the diagnosis and evaluation

of celiac disease World J Gastroenterol 19(46):8562–8570

3. Nisha, Chaudhary P (2020) Comparative analysis of techniques used for detection of celiac

disease using various endoscopies IJARET 11(12):1521–1529

4. Villanacci V, Ciacci C, Salviato T, Leocini G. Reggiani L, Ragazzini T, Limarzi F, Saragoni

L (2020) Histopathology of celiac disease. Position Statements of The Italian Group of

Gastrointestinal Pathologists (GIPAD-SIAPEC). Transl Med @ UniSa 23(6):28–36. ISSN

2239-9747

5. Tian C, Zhang Q, Sun G, Song Z, Li S (2018) FFT consolidated sparse and collaborative

representation for image classiﬁcation. Arab J Sci Eng 43(2):741–758

6. Guo B, Song K, Dong H, Yan Y, Tu Z, Zhu L (2020) NERNet: noise estimation and removal

network for image denoising. J Vis Commun Image R 71:102851

7. Shi W, Jiang F, Zhang S, Wang R, Zhao D, Zhou H (2019) Hierarchical residual learning for

image denoising. Signal Process Image Commun 76:243–251

8. Zhang J, Luo H, Hui B, Chang Z (2019) Unknown noise removal via sparse representation

model. ISA Trans 94:135–143

9. Zhang L, Li Y, Wang P, Wei W, Xu S, Zhang Y (2019) A separation–aggregation network for

image denoising. Appl Soft Comp J 83:105603

10. Li X, Xiao J, Zhou Y, Ye Y, Lv N, Wang X, Wang S, Gao S (2020) Detail retaining convolutional

neural network for image denoising. J Vis Commun Image R 71:102774

11. Yin H, Gong Y, Qiu G (2020) Fast and efﬁcient implementation of image ﬂtering using a side

window convolutional neural network. Signal Process 176:10771

12. Jin L, Zhang W, Ma G, Song E (2019) Learning deep CNNs for impulse noise removal in

images. J Vis Commun Image R 62:193–205

13. Quan Y, Chen Y, Shao Y, Teng H, Xu Y, Ji H (2021) Image denoising using complex-valued

deep CNN. Pattern Recogn 111:107639

14. Dodge S, Karam L (2016) Understanding how image quality affects deep neural networks

15. Nazaré TS, da Costa GB, Contato WA, Ponti M (2018) Deep convolutional neural networks and

noisy images. In: Mendoza M, Velastín S (eds) Progress in pattern recognition, image analysis,

computer vision, and applications. Springer International Publishing, Cham, pp 416–424

448 Nisha and P. Chaudhary

16. Koziarski M, Cyganek B (2017) Image recognition with deep neural networks in presence of

noise—dealing with and taking advantage of distortions. Integr Computer Aided Eng 24:1–13

17. Diamond S, Sitzmann V, Boyd S, Wetzstein G, Heide F (2017) Dirty pixels: optimizing image

classiﬁcation architectures for raw sensor data

18. Dodge SF, Karam LJ (2017) Quality resilient deep neural networks. CoRR abs/1703.08119

19. Yim J, Sohn K-A (2017) Enhancing the performance of convolutional neural networks on

quality degraded datasets

20. Zhang K, Zuo W, Chen Y, Meng D, Zhang L (2017) Beyond a gaussian denoiser: residual

learning of deep cnn for image denoising. IEEE Trans Image Process 26:3142–3155

21. Lehtinen J, Munkberg J, Hasselgren J, Laine S, Karras T, Aittala M, Aila T (2018) Noise2noise:

learning image restoration without clean data. CoRR abs/1803.04189

22. Ning CY, Liu SF, Qu M (2019) Research on removing noise in medical image based on median

ﬁlter method. In: IEEE international symposium on information (IT) in medicine and education,

ITME

23. Shruthi B, Renukalatha S, Siddappa M (2015) Speckle noise reduction in ultrasound images—a

review. Int J Eng Res Technol (IJERT) 4(2). ISSN: 2278-0181

24. Zou S, Long M, Wang X, Xie X, Li G, Wang Z (2019) A CNN-based blind denoising method

for endoscopic images. In: IEEE biomedical circuits and systems conference (BioCAS)

FTL-Emo: Federated Transfer Learning

for Privacy Preserved Biomarker-Based

Automatic Emotion Recognition

Akshi Kumar, Aditi Sharma, Ravi Ranjan, and Liangxiu Han

Abstract Advancements in IOT has revolutionized remote patient monitoring,

however, privacy is still the major challenge faced by researchers. We put forward

a Federated learning-based technique to handle the issue of privacy, and to over-

come the issue of requirement of large dataset we have employed a transfer learning

approach. Federated transfer learning (FTL) model analyze electronic health records

of the user to detect their emotional state. Emotion analysis has been observed by

monitoring physiological changes of human body, measured using EEG. Convolu-

tion network has been used at the server and at each client node in FTL. The model

is pre-trained on publicly available dataset DEAP on Centralized Machine and is

ﬁne-tuned on the K-EmoCon dataset on each client device, without sharing the data

of any subject with the centralized model. Valance and Arousal are detected using

FTL. On both emotions, the state-of-the-art average F1 score has been achieved.

Keywords Federated learning ·K-EmoCon ·EEG ·Transfer learning ·Emotion

recognition ·EHI

A. Kumar ·L. Han

Manchester Metropolitan University, Manchester, UK

e-mail: akshi.kumar@mmu.ac.uk

L. Han

e-mail: l.han@mmu.ac.uk

A. Sharma (B)

Delhi Technological University, New Delhi, India

e-mail: Aditisharma9420@gmail.com

Thapar Institute of Engineering and Technology, Patiala, India

R. Ranjan

Netaji Subhas University of Technology, Delhi, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_34

449

450 A. Kumar et al.

1 Introduction

Advances in technologies have the potential to inﬂuence and shape the society.

Upsurge in IoMT devices has made the remote healthcare a reality. The bio-signals

of patients can be monitored from remote location by medical professionals, making

it possible for everyone to have access to healthcare. Remote monitoring helped

the medical community as well, only critical patient needs to be kept in hospitals,

making room for emergencies and other patients. With advancement in IoT sensors,

the monitoring of patient’s health remotely has created a large resource of electronic

health records (EHR). The EHR can be accessed by doctors remotely, patient can

consult various doctors, without worrying for their multiple body tests, as patient

would be able to share their existing EHR with the new doctor. With automated tools

like AI and machine learning, the load of medical professionals can be lightened

further, as researchers are working on making automatic predictive analytical tools

for various diseases [1]. Although the physical intervention of medical professionals

will always be required, yet the automatic tools can help to narrow done the decision

making process for doctors. Recent studies have shown a great result in detecting

many physical diseases such as brain tumor, heart diseases, cancer from scans of

human body [2]. Physical health issues are taken more seriously in our society than

the mental health issues, as their symptoms are not visible from the outside. But

mental health of a person needs as much care as physical health if not more. As

poor mental or psychological wellbeing can trigger physical health problems as

well. Many researchers have put forward some automatic predictive tools that can

detect stress, depression, and other mental health issues in a person [3]. As a person

with sound psychological state can express their emotions profoundly, researchers

have focused on recognizing the emotions of a person to track their psychological

wellbeing. A person feeling anger most of the time is more prone to be having a

psychiatric disorder, than one having mix of emotions, similarly someone who is

always sad, might be because of depression.

Emotions are an inseparable aspect of human intelligence and integral to decision

making. Analyses of Emotion can help us understand the psychological state of a

person. Many researchers have proposed many models to effectively recognize the

emotional state of a person, but they require personal data of that individual to be

monitored regularly, these data could be their bio-signals measured using biosensors,

or the emotions have also been identiﬁed from facial expressions, and their voice [4].

IEMOCAP, MELD, CASE, CK+ data has been used for facial expressions; Berlin,

LEESD for emotion recognition from speech. Different modalities has been explored

by researchers for emotion recognition [5,6]. Combination of two or more modalities

has been employed by researchers on some datasets [7]. Kumar et al. experimented

on IEMOCAP, MELD for emotion recognition using facial expressions [8,9].

Different modalities help researchers to recognize the emotion more accurately but

sharing the private information with anyone is a difﬁcult decision for many people.

With increasing cyber-attacks, leaking of personal information is a fear that surrounds

all. The concern of privacy makes it difﬁcult to have real-time remote psychological

FTL-Emo: Federated Transfer Learning for Privacy Preserved … 451

Fig. 1 Federated learning

health monitoring. To resolve this issue, we employed federated learning to ensure

the privacy preservation. The structure of federated learning is shown in Fig. 1.

Federated learning (FL) concept was given by Google in 2017, to reduce the

computational costs, by utilizing the computation of the mobile devices, acting them

as a node in edge computing [10]. In federated learning, training is performed at

individual client level, and then the weights of model from each client are shared

with the server, where server collects the weight of each client, compute them, and

calculate a new weight, as shown in Eq. 1.

fs(w) =1



n=1

(fn(w)) (1)

where fs(w) represents the weight at server/centralized model, and fn(w) represents

the weight of client/user models. This new weight calculated by the server is then

communicated back to each client, which again train their own models individu-

ally, repeating this process till optimized weights are obtained [10]. This was the

concept proposed by Google for reducing the computation cost, but it served one

more advantage and even a prominent one, i.e., it ensured privacy of the data. As

training is performed at each client device, no data is needed to be shared with the

centralized server. In case of FL, data is not stored at a centralized position, neither

is required to send to server to train the model, making it safe from cyber-attacks.

452 A. Kumar et al.

Even if a cyber-attack happens, they can have data of a single client, not of every

client, making it more secure [11].

Since its inception, federated learning has been explored by many researchers for

studying and optimizing its performance while maintaining privacy and security. FL

approach has been used in many areas, but it is most widely accepted in medical

ﬁled, due privacy-sensitive data for medical diagnosis [12]. Some researchers have

used this approach on electronic medical records, and on brain tumor data, however,

the application of FL on real medical data has still not been sufﬁciently studied.

Although FL algorithm is essential in the ﬁeld of biomedical, since not just patients,

but hospitals too have to share their patients’ EHR to other hospitals and doctors.

This data sharing induces the risk of cyber-attacks and privacy violations. FL can

be used to resolve privacy issues and mitigate the risk of a data breach in clinical

information, since transfer and centralization of data is not required [13]. Although

no such clinical study has been conducted yet so far.

Traditional machine learning and deep learning require a centralized dataset to

train a model. In our work we have use traditional deep learning technique convolu-

tion network for training the model, but this training is performed at multiple loca-

tions. This protects the patients’ privacy and reduce the risk of data breach. Since

the structure of federated learning is not ﬁxed and is yet explored by the researchers,

it can be implemented as per the situation. We have modiﬁed the structure by incor-

porating the transfer learning component along with federated learning. In federated

learning, data of the client should not be shared to the server, but we can have other

existing dataset to pre-train the server model, it will not risk the privacy of the clients,

and as well fasten the process of training the model.

We have used publicly available dataset K-EmoCon for creating this scenario. K-

EmoCon dataset is provided by Park et al. in 2020 [14]. The dataset contains audio,

video, and bio-signals of 32 subjects.

This research puts forward a model FTL-Emo: a privacy preserving transfer

learning approach for recognition of emotions using EEG. Transfer learning is used to

learn from existing data DEAP. In FTL-Emo, convolution neural network (CNN) has

been employed at both client and server side. K-EmoCon entities are annotated with

both discrete emotion states, and the dimensional states, the two types of emotion

models [14]. We have taken the dimensional emotion states Arousal and Valance for

FTL-Emo, while processing the emotional state is taken as the one recognized by

subject itself.

The primary contribution of the proposed FTL-Emo is:

•Federated learning have been used for emotion recognition for EHR.

•Utilization of existing data to pre-train the model for accurate emotion identiﬁca-

tion.

•The performance is evaluated for affect dimensions, while maintaining the privacy

of the data.

Next section of this paper contains description of the dataset, followed by proposed

model, Sect. 4contains discussion of result and analysis.

FTL-Emo: Federated Transfer Learning for Privacy Preserved … 453

2 Dataset

DEAP: DEAP is a publicly available dataset containing EEG signals of 32 partic-

ipants [15]. The electroencephalogram and peripheral physiological signals of 32

participants were recorded as each watched 40 one-minute-long excerpts of music

videos. Participants rated each video in terms of the levels of arousal, valence, like/

dislike, dominance, and familiarity.

K-EmoCon: A multimodal publicly available dataset for emotion recognition

in conversations provided by Park et al. in 2020 [6]. The dataset contains audio,

video, and bio-signals of 32 subjects, who participated in a 10-min debate task on

a social issue in teams of 2, while wearing physiological signal measuring devices,

Emperica E4, NeuroSky, and Polar H7, along with video cameras for recording the

facial expressions and gestures of the participants. Brainwave—EEG (125Hz) and

Meditation signals were recorded through NeuroSky.

For our study we have used the data collected by only NeuroSky device, i.e., EEG

signal. Although in K-EmoCon each instance has been annotated by 7 people, but we

have taken the emotion annotated by the user itself. Park et al. have used 20 different

types of emotions for annotation, including dimensional emotional model [6]. In this

work we have worked to identify only Arousal and Valance.

3 FTL-Emo: Proposed Model

To ensure privacy protection in emotional identiﬁcation, a federated learning-based

model FTL-Emo was proposed. The architecture of FTL-Emo is shown in Fig. 2.

Initially every single user has been considered as a unique client with their own

processing power at their edge, where they train their own model, these trained

modes’ weight is then shared with the centralized server. Twenty-nine input ﬁles from

K-EmoCon (only EEG ﬁles) dataset were provided to proposed model separately at

different devices. To have a more effective and accurate model the transfer learning

has been used as well. The centralized server was initially trained on DEAP dataset,

and these weights were shared with the client models, where they updated it with

their own training, this process was continued till the convergence of the model. The

experiment was evaluated for affective dimensions only.

The steps included in this whole process are:

•Construct an initial server model employing CNN with publicly available dataset

(DEAP).

•Distribute the weights of server model to each client.

•Train client’ model with their own data.

•Send Client models’ weight to server model.

•Perform weight aggregation at server.

•Distribute the updated weight at server level to each user.

454 A. Kumar et al.

Fig. 2 Architecture of FTL-Emo

Repeat this procedure with new coming data, till threshold is reached.

3.1 Pre-processing

To create time synchronous data for proposed work, the 29 participants’ EEG signals

were mapped with output (Arousal, Valance). EEG signals has been collected at a

sampling rate of 125 kHz and these has been down sampled to 220 Hz utilizing a

Savitzky–Golay ﬁlter for smoothing. To extract physiological features, NeuroKit2

Toolbox was used to extract the time domain features of the raw EEG signal with a

window size of 4 s (i.e., 4000 steps) and a hop size of 0.5 s (i.e., 500 steps). Then we

pad the head and tail of the raw data with neighboring data and combine the above

two feature vectors as the physiological feature. In pre-processing, missing values

were replaced by means of the next two consecutive instance values.

3.2 Federated Transfer Learning

Transfer learning helps to use the existing knowledge for new application. Federated

transfer learning concept was proposed by Liu et al. for a secured two-party privacy

preserving setting, whose main purpose was data security, but their empirical analysis

has different conditions at client edges.16]. Researchers have also used federated

learning for domain adaption, where they extend the domain adaption to federated

approach for data security [16]. To understand the proposed approach, consider that

given N different user U1,U2,…,UNand their sensor collected data (EEG) by E1,

E2,…, EN. Centralized/server model MSis ﬁrst trained with existing dataset (DEAP).

FTL-Emo: Federated Transfer Learning for Privacy Preserved … 455

The weights of the model, MSis frozen without performing testing on a deep neural

network approach by following validation approach. These frozen weights of server,

Wgis then shared with all the clients, where each client then uses this weight as the

initialized weight for their own training. The process of initial global weight training

is provided in algorithm 1.

Algorithm 1

Pre-Training Server Model, MS

Input: DEAP Dataset

Output: Wpre-trained: Global Model Weight after pre-tuning

1. Train on convolution model, Ms

2. Validate

3. Freeze weight of layers

4. Assign frozen weight to Wpre-trained

5. Return: Wpre-trained

After training the model each client freezes their weight after each epoch, and

share this weight with the server model, Ms which calculates the average weight of

the model using Eq. 1, and assign the updated weight to each client, this process is

repeated with each epoch, till either the model received global optimization, or the

threshold of epoch is reached. The process at local client level is shown in algorithm 2.

Algorithm 2

Initial Processing on Local Nodes

Input: K-EmoCon, Wpre-trained

Output: Wg: Global Server Weight

1. Share Wpre-trained with each user

2. For each user i:

a. WUi <—W

pre-trained

b. Train User Model with K-EmoCon

c. Update Weight WUi

d. Share WUi with Server Model, MS

3. Aggregate Global weight at Server Model

fs(w) =1



n=1

(fn(w))

4. Update Global Weight, Wg

5. Share Wg with Each User

6. Wpre-trained <—Wg

7. Repeat the Process

456 A. Kumar et al.

3.3 Deep Learning Architecture

On both the server and the client ends we have employed the same model architecture

using convolution layers. The architecture of the CNN model is shown in Fig. 3.

The model is composed of 5 convolution layers, each with different kernel

score. These convolution layers have been followed by 4 pooling layers, and 2

fully connected layers, and an output layer of SoftMax function. For optimization

Stochastic Gradient Descent has been used. 60:20:20 ratio has been used as training,

validation, and test data. We have used learning rate of 0.01 for every layer, and a

batch size of 64 is used as initialization point followed by dilution. Threshold was

set at 168.

The reason for using this deep learning architecture was to have an incremental

online batch processing, and another advantage was to have automated feature engi-

neering. Equation 2gives us the computational approach at ﬁnal layer of deep learning

model, when PUi gives the probability of attaining a particular affective dimension

at user Ui.

SoftMax(PUi )=exp(PUi)

i(PUi)(2)

For all the fully connected layers in Stochastic Gradient Descent optimizer has been

used, and loss is calculated by categorical cross-entropy as shown in Eq. 3.

Loss =−

(y

i1log(yi1)+y

i2log(yi2)+··· + y

in log(yin)) (3)

yi1,yi2,yinare internal node labels, y

i1,y

i2,y

in are the output layer nodes, produced

by SoftMax function.

Fig. 3 Deep learning architecture of FTL-Emo

FTL-Emo: Federated Transfer Learning for Privacy Preserved … 457

Table 1 Performance of FTL-Emo

Emotion Average F1-score Best F1-score Average accuracy Best accuracy

Arousal 86.8 89.03 88.5 92.7

Val a n ce 88.4 94.1 87.3 91.5

Overall 87.9 93.8 88.1 91.8

4 Result and Discussion

The proposed model was executed twenty-nine times for testing at their individual

client machine for evaluation of performance of the model for each client. For each

execution, the proposed model works exactly same, they are executed separately just

to ensure the privacy of the data.

For performance evaluation accuracy and F1-score has been used. EEG was

recorded only for 10 min for each of the 32 participants, generating total of 320

min data approximately. But only taken 29 participants data was taken for model

execution. Although, the size of the dataset is limited but because of pre-trained

model on DEAP dataset of 32 users as well and training the model over multiple

epochs have resulted in very good performance. For training the convolution model

batch size of 64 was used for one epoch at each communication round, with a learning

rate of 0.0001 using Stochastic Gradient Descent. The model termination condition

was set on either model convergence or on hitting the threshold of 168 epochs.

4.1 Result

The proposed work has twenty-nine nodes and one centralized server, each executed

separately in synchronous manner, combining the weight. The performance evalua-

tion of proposed model for both output categories Arousal and Valance is shown in

Table 1. For Arousal highest F1-score obtained was 89.03 and the average F1-score

obtained on all 29 devices was 86.8. And for Valance the highest F1-score obtained

was 94.1, and the average obtained is 88.4. FTL-Emo has obtained high accuracy as

well, while maintain the privacy of the data.

4.2 Discussion

The proposed transfer learning-based federated learning approach has achieved state-

of-the-art results for both emotions. The comparison of the model is not possible

with exact similar approach, since this is the ﬁrst work that incorporate privacy

preserving approach on K-EmoCon dataset. Although, we have compared the FTL-

Emo with simple deep learning-based model to show the performance comparison,

458 A. Kumar et al.

Table 2 Comparison of

FTL-Emo Model Average F1-score Average accuracy

Sig-Rep 58.9 71.3

LSTM 72.5 67.4

RNN 76.8 71.5

CNN (DREAM) 73.41 69.78

FTL-Emo 87.9 88.1

Fig. 4 ROC curve of U7 and U23

and the model has also been compared with other existing models on K-EmoCon,

although the data signals taken by them are different [16–18]. The proposed work

while maintaining the privacy has obtained highly accurate result, the use of transfer

learning approach has improved the performance a lot, as can be seen from the

results (Table 2). This model can be employed on real-time data collected through

IoT sensors that can record the bio-signals of a person [20], not the traditional IoT

sensors that were used for electronic appliances [19].

The ROC curve of two different subjects formed during evaluation of models is

shown in Fig. 4.

To have accurate emotion recognition, while ensuring privacy of the user in main-

taining EHR and sharing it with other doctors or hospitals, FTL-Emo have provided

great results. This FTL-Emo approach can enact as baseline for future studies for

real-time privacy preserving emotion recognition. It can be used for various real-time

activities like automatic chatbots, HCI in smart industries, schools. In our proposed

work, we haven’t incorporated the issue of non-independent identically distributed

(non-iid). As each individual reacts differently to each scenario, his emotional trig-

gers could also be different, therefore generating a single model for each subject

may not produce accurate results for all, to handle such scenarios researchers could

incorporate the concept of similarity, creating models by clustering the subjects on

the basis of their similarity, and generating models for each cluster, rather than one

single model. For future scope, researchers could also include the cold start problem

to handle real world scenarios, of not having a patient’s data available at the start of

the model training process.

FTL-Emo: Federated Transfer Learning for Privacy Preserved … 459

5 Conclusion

Accurate human emotion recognition can enhance the effectiveness and adeptness of

the remote healthcare practices. But to ensure privacy of the user data is as important

as attaining a high accurate prediction of users’ emotional state. Transfer learning-

based federated learning approach can pave a way to achieve the high accuracy and

privacy preservation at the same time. We proposed a privacy preserving emotion

recognition using federated learning on K-EmoCon dataset, while pre-training the

model on DEAP dataset. The model has achieved highly accurate results. No other

work has ensured privacy protection for emotion recognition. The performance of the

model was evaluated on accuracy and F1-score. In both the measures, FTL-Emo has

produced state-of-the-art results. For improving the model further researchers can

embed other modalities like audio signals, video signals for more accurate emotion

recognition.

Funding This research is funded by CfACS seed project funding 2022–2023, Manchester

Metropolitan University, UK.

References

1. Liu Y, Kang Y, Xing C, Chen T, Yang Q (2020) A secure federated transfer learning framework.

IEEE Intell Syst 35(4):70–82

2. Chen Y, Qin X, Wang J, Yu C, Gao W (2020) Fedhealth: a federated transfer learning framework

for wearable healthcare. IEEE Intell Syst 35(4):83–93

3. Kumar A, Sharma K, Sharma A (2021) Hierarchical deep neural network for mental stress

state detection using IoT based biomarkers. Pattern Recogn Lett 145:81–87

4. Jing Q, Wang W, Zhang J, Tian H, Chen K (2019) Quantifying the performance of federated

transfer learning. arXiv preprint arXiv:1912.12795

5. Gupta P, Balaji SA, Jain S, Yadav RK (2022) Emotion recognition during social interactions

using peripheral physiological signals. In: Computer networks and inventive communication

technologies. Springer, Singapore, pp 99–112

6. Guhn A, Merkel L, Hübner L, Dziobek I, Sterzer P, Köhler S (2020) Understanding versus

feeling the emotions of others: how persistent and recurrent depression affect empathy. J

Psychiatr Res 130:120–127

7. Pﬁtzner B, Steckhan N, Arnrich B (2021) Federated learning in a medical context: a systematic

literature review. ACM Trans Internet Technol (TOIT) 21(2):1–31

8. Kumar A, Sharma K, Sharma A (2022) MEmoR: a multimodal emotion recognition using

affective biomarkers for smart prediction of emotional health for people analytics in smart

industries. Image Vis Comput 123:104483

9. Sharma A, Sharma K, Kumar A (2022) Real-time emotional health detection using ﬁne-tuned

transfer networks with multimodal fusion. In: Neural computing and applications, pp 1–14

10. Kumar A, Sharma K, Sharma A (2021) Genetically optimized Fuzzy C-means data clustering

of IoMT-based biomarkers for fast affective state recognition in intelligent edge analytics. Appl

Soft Comput 109:107525

11. Ju C, Gao D, Mane R, Tan B, Liu Y, Guan C (2020) Federated transfer learning for EEG

signal classiﬁcation. In: 2020 42nd Annual international conference of the IEEE engineering

in medicine & biology society (EMBC). IEEE, pp 3040–3045

460 A. Kumar et al.

12. Li L, Fan Y, Tse M, Lin KY (2020) A review of applications in federated learning. Comput

Ind Eng 149:106854

13. Li T, Sahu AK, Talwalkar A, Smith V (2020) Federated learning: challenges, methods, and

future directions. IEEE Sig Process Mag 37(3):50–60

14. Park CY, Cha N, Kang S, Kim A, Khandoker AH, Hadjileontiadis L, Oh A, Jeong Y, Lee

U (2020) K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in

naturalistic conversations. Sci Data 7(1):1–16

15. Koelstra S, Muhl C, Soleymani M, Lee J-S, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I

(2011) Deap: a database for emotion analysis; using physiological signals. IEEE Trans Affect

Comput 3(1):18–31

16. Dissanayake V, Seneviratne S, Rana R, Wen E, Kaluarachchi T, Nanayakkara S (2022) SigRep:

toward robust wearable emotion recognition with contrastive representation learning. IEEE

Access 10:18105–18120

17. Alskaﬁ FA, Khandoker AH, Jelinek HF (2021) A comparative study of arousal and valence

dimensional variations for emotion recognition using peripheral physiological signals acquired

from wearable sensors. In: 2021 43rd Annual international conference of the IEEE engineering

in medicine & biology society (EMBC). IEEE, pp 1104–1107

18. Wang S, Wang J, Wang X, Qiu T, Yuan Y, Ouyang L, Guo Y, Wang F-Y (2018) Blockchain-

powered parallel healthcare systems based on the ACP approach. IEEE Trans Comput Soc Syst

5(4):942–950

19. Ranjan R, Sharma A (2020) Voice-controlled IoT devices framework for smart home. In:

Proceedings of ﬁrst international conference on computing, communications, and cyber-

security (IC4S 2019). Springer Singapore, pp 57–67

20. Wei J, Yang X, Dong Y (2021) Time-dependent body gesture representation for video emotion

recognition. In: International conference on multimedia modeling. Springer, Cham, pp 403–416

Content Analysis of Twitter

Conversations Associated

with Turkey–Syria Earthquakes

Harkiran Kaur, Harishankar Kumar, and Abhinandan Singla

Abstract This research study performs content analysis of Twitter discussions asso-

ciated with the earthquakes that occurred in Turkey and Syria in 2023. The authors

investigated the main themes and topics of the discussions, expressed in the tweets. A

dataset of tweets related to this topic has been collected using relevant hashtags and

keywords. The obtained data has been analyzed using both manual and automated

state-of-the-art methods. As the main ﬁndings of this study it has been observed that

the most common themes in the tweets were expressions of sympathy and solidarity,

calls for help and ﬁnancial support, and news updates about the earthquakes. As per

this proposed study, Twitter offers a worthwhile forum for people to express their

state of mind and responses to natural catastrophes and may be used to segment

information and assembly assistance in the times of need.

Keywords Twitter data analytics ·Content analysis ·Turkey–Syria earthquake ·

Topic modeling ·Topic detection

1 Introduction

Twitter has become a signiﬁcant hub of news and current events information, one

of the most popular social media platforms. Users may voice their ideas and share

news about current events on this platform, which creates tons of data every day.

Machine learning techniques are needed to analyze and glean relevant insights from

the enormous volume of data collected on Twitter every day. Twitter has grown to

be a signiﬁcant venue for sharing news and viewpoints as a result of the boom in

H. Kaur (B)·H. Kumar ·A. Singla

Department of Computer Science and Engineering, Thapar Institute of Engineering and

Technology, Patiala 147001, Punjab, India

e-mail: harkiran.kaur@thapar.edu

H. Kumar

e-mail: hkumar1_be20@thapar.edu

A. Singla

e-mail: asingla50_be21@thapar.edu

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_35

461

462 H. Kaur et al.

social media use. This information offers a rich source that may be utilized for a

number of tasks, including sentiment analysis, subject modeling, and user proﬁling.

As a result, studying tweets has become a crucial technique for ﬁguring out how

the general public feels about current events. Machine learning has been a prevalent

method for doing content analysis of tweets in the context of current affairs in recent

years. In recent years, machine learning has emerged as a popular technique for con-

ducting a content analysis of tweets in the context of current affairs. The research

in Twitter data analysis using machine learning has been ongoing for several years.

Recent studies have demonstrated the effectiveness of machine learning algorithms

in various tasks such as topic modeling, sentiment analysis, and user proﬁling. Topic

modeling is a popular machine learning task that aims to identify latent topics in a

given text corpus. Research in the said subject utilizes topic modeling techniques,

such as Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization

(NMF) [1], to identify topics in Twitter data. Sentiment analysis aims to determine

the emotional tone of a given text. Researchers have applied various machine learn-

ing algorithms, such as support vector machines (SVM), decision trees, and neural

networks, deep learning models [2], for sentiment analysis of Twitter data. User

proﬁling aims to extract demographic and psycho-graphic information about users

from their social media data. Researchers have applied machine learning techniques

such as clustering [3], association rule mining, and regression for user proﬁling on

Twitter data. This paper is organized into several sections, the literature review sum-

marizes recent studies and developments that applied machine learning techniques,

highlighting the state of research in this area and their use for different purposes.

The methodology section outlines the methods used in this study, which include data

collection, data pre-processing, and analysis. The results section presents the study’s

ﬁndings, including identiﬁed topics, and classiﬁed topics for some of the tweets. The

discussion section provides interpretation and analysis of the results, discusses the

signiﬁcance and limitations of the study, and suggests directions for future research.

2 Literature Review

This literature review aims to discuss recent studies that have applied machine learn-

ing techniques for Twitter data analysis in the context of current affairs and provide

insight into the current state of research. For instance, Hsu et al. (2023) in [1] proposed

a method based on NMF for identifying topics in Twitter data. The proposed method

outperformed other state-of-the-art topic modeling methods. Song et al. (2023) in

[4] proposed a method based on LDA for topic modeling of Twitter data related to

the US presidential election. The proposed method identiﬁed relevant topics such

as candidates’ policies, voting patterns, and election predictions. A strategy based

on LDA for topic modeling of Twitter data relating to the US-China trade war was

proposed by Jiang et al. (2022) in [5]. The suggested approach highlighted perti-

nent subjects including tariffs, negotiations, and market effects. An approach based

on NMF for topic modeling of Twitter data relevant to the COVID-19 pandemic

Content Analysis of Twitter Conversations Associated … 463

was proposed by Chen et al. in 2021 [6]. The suggested approach located pertinent

subjects including vaccine development, case counts, and government regulations.

Zhou et al. (2020) [7] proposed a method based on LDA for topic modeling of Twit-

ter data related to the Hong Kong protests. The proposed method identiﬁed relevant

topics such as police brutality, democracy, and human rights. One study by Lee et al.

(2022) in [8] used a machine learning model to predict the outcome of the US pres-

idential election based on Twitter data. The study discovered that the model could

predict the election result with a high degree of accuracy, pointing to the possibility

of utilizing Twitter data to forecast election outcomes. A deep learning model was

suggested for analyzing Twitter data pertaining to the Hong Kong demonstrations

in a paper by Chen et al. (2022) in [9]. The research discovered that the suggested

methodology could pinpoint important conversation points and keep track of how

the general population felt about the demonstrations. Machine learning techniques

were utilized by Das et al. (2022) in [10] to examine Twitter data pertaining to the

Black Lives Matter movement. According to the study, Twitter data may be utilized

to track public attitude toward the movement and distinguish between various points

of view on it. Goharian et al.’s paper from 2022 [11] used machine learning methods

to examine Twitter data collected during the COVID-19 epidemic. According to the

study, Twitter data may be utilized to track public opinion toward the epidemic and

spot newly developing pandemic-related issues. Machine learning techniques were

utilized by Yildirim et al. (2022) [12] aimed to examine Twitter data pertaining to the

Syrian crisis. According to the study, Twitter data may be used to track public opin-

ion about the conﬂict and spot newly developing themes connected to it. Machine

learning techniques were used to analyze Twitter data during the Black Lives Matter

protests in a different research by Zhu et al. (2021) in [13]. The study discovered

that Twitter data may be utilized to pinpoint the main conversation points and track

public opinion regarding the demonstrations. In a research by Mocanu et al. (2021) in

[14], the authors examined Twitter data pertaining to the COVID-19 epidemic using

machine learning techniques. According to the study, Twitter data may be utilized

to track public opinion toward the epidemic and spot newly developing pandemic-

related issues. Machine learning methods were utilized by Liu et al. (2021) [15]to

examine Twitter data associated with the US presidential election. The study discov-

ered that Twitter data might be used to forecast election results with a high degree of

accuracy, suggesting the possibility of leveraging Twitter data to do so. Yayla et al.’s

work from 2021, published in [16], suggested a machine learning-based method for

identifying hate speech on Twitter. The research revealed that the suggested method

might effectively identify hate speech, monitor it, and take action against it on the

platform.

3 Materials and Methods

The authors conducted the following 7 steps model for undertaking the content

analysis for Twitter posts related to Turkey–Syria.

464 H. Kaur et al.

Fig. 1 Number of tweets posted date wise for Turkey–Syria earthquakes

3.1 Data Collection

For this research, the authors used the Twitter dataset by querying Twitter for key-

words TurkeySyriaEarthquakes, Turkey earthquake, Syria earthquake, Turkey–Syria

earthquake, and hashtags #TurkeySyriaEarthquake2023, #turkey #earthquake, and

#syria #earthquake, using web scraping technique. Authors were able to retrieve

1,301,159 tweets posted during 3 weeks (February 06, 2023, until February 27,

2023), as presented in Fig.1.

3.1.1 Data Pre-processing

The collected data has been cleansed by removing non-English tweets and duplicate

tweets, and removing stop words, punctuation, URLs, user mentions, and hashtags.

Also, the tweets have been tokenized into phrases and converted into lowercase.

This dataset includes 1,301,159 raw tweets. During this data pre-processing phase,

8,53,026 duplicate tweets and 70,106 non-English tweets were removed. So, 3,78,027

processed tweets have been used for the remaining steps (Table1).

Content Analysis of Twitter Conversations Associated … 465

Table 1 Pre-processing of tweets—sample data

Text Processed text

Did Western Ambassadors evacuate before

TurkeyÂ Earthquake?! https://t.co/

wMdSmbqLFF

Western ambassador evacuate turkey

earthquake

Turkey-Syria earthquake death toll surpasses

50,000 https://t.co/6lMakpPZDx via

@FRANCE24 #TurkeySyriaEarthquake2023

Turkey Syria earthquake death toll surpasses

via

In the two week deadly earthquakes hit

southern Turkey and northern Syria, the focus

has shifted from rescue to rehabilitation.

https://t.co/nGLIzeWttT

Two week since deadly earthquake hit southern

turkey northern syria focus shifted rescue

rehabilitation

#OutlookDecazine A journalist recalls his own,

and other peopleâeTMs, experiences in

#earthquake rescue operations in #Turkey

Yusuf Erim #TurkeyQuake Read more at:

https://t.co/OeH30zqd5N

Journalist recall people experience rescue

operation yusuf erim read

In the two week since deadly earthquakes hit

southern Turkey and northern Syria, the focus

has shifted from rescue to rehabilitation.

https://t.co/nGLIzeWttT

two week since deadly earthquake hit southern

turkey northern syria focus shifted rescue

rehabilitation

Syria-Turkey earthquake: how to help https://t.

co/CUHMEWbKci https://t.co/EtVYpSBrRX

Syria Turkey earthquake help

Earthquake Unveils Turkey’s many ugly faces

https://t.co/WIWI5e88h5

Earthquake unveils turkey many ugly face

3.1.2 Exploratory Data Analysis

Outlier detection has been performed on this set of tweets, and 11,830 tweets have

been detected as outliers and were removed. The tweets data has been analyzed using

descriptive statistics and word frequency analysis to understand the characteristics

of the data and perform the content analysis for tweets retrieved for the Turkey–Syria

earthquakes. This paper uses word cloud, for word frequency analysis. For instance,

Fig.2describes that words have been presented in different sizes, arranged as per

their number of occurrences in the tweets. The words such as death, magnitude,

building, support, rescue, and many more have a higher frequency than the other

words presented in small size.

4 Topic Modeling

The authors have utilized unsupervised machine learning techniques such as Latent

Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Non-negative Matrix

Factorization (NMF), and Short Text Topic Modeling (STTM) to identify the latent

466 H. Kaur et al.

Fig. 2 Word cloud for all the tweets retrieved for Turkey–Syria Earthquake

topics within the data. These topics have been represented as a set of related words

or phrases.

5 Topic Interpretation

The identiﬁed topics have been interpreted by analyzing the top phrases associated

with each topic. This step assists in understanding the underlying themes and patterns

in the data retrieved for the said subject.

6 Implementing Ensemble Voting Classiﬁer

The authors observed that there have been multiple topic modeling models with

different strengths and weaknesses, and to leverage the strengths of each model and

overcome their weaknesses, the need of the hour was to use an ensemble voting

classiﬁer. Combining multiple models’ predictions, the ensemble voting classiﬁer

generated a more accurate and robust prediction. Further, these topics have been

manually veriﬁed by the authors. Table 2presents the tweets and topics detected for

these tweets using state-of-the-art models and their comparison with the proposed

ensemble voting classiﬁer.

Content Analysis of Twitter Conversations Associated … 467

Table 2 Comparativeanalysis of topics detected using state-of-the-art models and proposed ensem-

ble voting classiﬁer

Processed

tweet text

Topic modeling algorithms Proposed

ensemble

voting

classiﬁer

LDA LSA NMF STTM

Western

ambassador

evacuate

Turkey

earthquake

Damage and

magnitude

Dead/alive

news

information

about

earthquake

Opinion Damage and

magnitude

Turkey–Syria

earthquake

death toll

surpasses via

Damage and

magnitude

Information

about

earthquake

Damage and

Magnitude

Damage and

magnitude

Damage and

magnitude

Earthquake

death toll

surpasses

Turkey

Damage and

magnitude

Information

about

earthquake

Damage and

magnitude

Damage and

magnitude

Damage and

magnitude

Two week

since deadly

earthquake hit

southern

Turkey

northern Syria

focus shifted

rescue

rehabilitation

Aid, Help, and

Relief

Dead/alive

news

Damage and

magnitude

Damage and

magnitude

Damage and

magnitude

Journalist

recall people

experience

rescue

operation

yusuf erim

read

Opinion Aid, help, and

relief

Dead/alive

news

Aid, help, and

relief

Aid, help, and

relief

Strike info Prayer and

hope

Damage and

magnitude

Aid, help, and

relief

Information

about

earthquake

Prayer and

hope

Syria–Turkey

earthquake

help

Aid, help, and

relief

Dead/alive

news

Support &

Sympathy

from people

Support &

Sympathy

from people

Support &

Sympathy

from people

Earthquake

unveils Turkey

many ugly

face

Support &

Sympathy

from people

Dead/alive

news

Information

about

earthquake

Political views Support &

Sympathy

from people

468 H. Kaur et al.

Fig. 3 Ensemble voting classiﬁer results for Turkey–Syria earthquakes

7 Visualization

The topic results fetched through the aforementioned steps have been visualized

using various tools such as word clouds, and bar charts to gain insights from the

data. Figure3shows the topic category-wise frequency and usage of certain words

in the tweets after applying the ensemble voting classiﬁer.

8 Results and Discussions

This research study aims to analyze the tweet content posted for the Turkey–Syria

earthquakes. For this authors collected the tweets, preprocessed the tweets using

various ﬁltration functions, and then classiﬁed these discussions into their respec-

tive topics using state-of-the-art topic modeling techniques and ensemble voting

classiﬁer applied to these models. Figure4represents the topic-wise distribution of

tweets after applying ensemble topic modeling. As a result of this classiﬁcation, it

has been observed that there were 68,270 tweets related to “Aid, Help, and Relief”,

1,09,452 tweets discussing the “Damage and Magnitude” of the earthquake, 19,482

tweets were about “Dead/Alive News”, 20,678 tweets provided “Information about

earthquake”, 11,589 tweets highlighted the “International Views” about this inci-

dent, 32,733 tweets presented the “Opinion” shared by users, 44,234 tweets pro-

vided “Political Views”, 27,468 tweets were of “Prayer and Hope”, 32,291 tweets

were classiﬁed as ”Support & Sympathy from People”, and 11,830 tweets were

Unknown/Outliers.

Content Analysis of Twitter Conversations Associated … 469

Fig. 4 Topic wise distribution of tweets after ensemble topic modeling

Fig. 5 Word cloud proving insights of topic-wise themes for Turkey–Syria earthquakes

470 H. Kaur et al.

This section describes the results of a study or analysis that involved processing

a collection of tweets. Speciﬁcally, the study applied an ensemble voting classiﬁer

to the collection of tweets to identify topic categories. This section also mentions

that Fig.5presents a word cloud of the identiﬁed topic categories. A word cloud is

a visual representation of words that are frequently used in a given text or collection

of texts. In this case, the word cloud represents the words used for each identiﬁed

category. The use of a word cloud in this context can be helpful in providing a

quick overview of the topic categories that were identiﬁed. It can give an idea of

general themes and trends that are present in the collection of tweets by highlighting

the most often-used terms. The study was effective in classifying tweets and the

word cloud gives helpful overview of the results. It also suggests that the results

could be helpful in understanding how Twitter users feel about certain issues and act

in certain ways. This research has the potential to beneﬁt society in several ways,

by analyzing tweets researchers can gain insight into public opinion, and decision-

makers can use these to inform their actions and policies, potentially leading to better

outcomes for society, for example by identifying main themes and topics for Turkey–

Syria earthquake, emergency response teams and policymakers can develop effective

communication strategies and mobilize support during crises. The study can also be

used to disseminate information and facilitate public engagement, which can help

to increase awareness and preparedness for future earthquakes. Overall, the research

work has the potential to contribute to more effective disaster response and recovery

efforts, ultimately beneﬁting society as a whole.

Finally, it is to be noted that using machine learning algorithms for twitter data

analysis is not without its challenges and limitations, for example this could lead to

the problem of overﬁtting, where the model might not generalize well with new data,

not only that but there is also the issue of potential bias, privacy concerns and data

ownership which researchers and practitioners must address. Therefore, it is impor-

tant for researchers and practitioners to address these challenges and limitations by

using appropriate methods for model evalu- ation and selection, adopting transparent

and accountable data practices, and acknowledging and mitigating potential biases

in the data.

9 Conclusion

Twitter has grown to be a signiﬁcant venue for sharing news and viewpoints as a

result of the boom in social media use. As a result, examining tweets has grown in

importance as a method for ﬁguring out what the general population thinks and feels

about current events. Machine learning has been a popular method for undertaking

content analysis of tweets in recent years. In this literature review, authors looked at

recent research that analyzed tweets about current events using machine learning. In

conclusion, topic modeling approaches have become a crucial tool for scholars when

analyzing Twitter data pertaining to current affairs. These studies offer insightful

information on the general public’s viewpoint on a range of signiﬁcant problems by

Content Analysis of Twitter Conversations Associated … 471

identifying major subjects and attitudes stated by users. The analysis of this unique

source of real-time information utilizing topic modeling approaches is crucial as

Twitter’s popularity continues to rise.

References

1. Hsu CH, Liu HC, Chen ALP, Lai MK (2023) Non-negative matrix factorization for topic

modeling on twitter data. IEEE Trans Knowl Data Eng

2. Liu Y, Li H, Sun M (2023) Deep learning-based sentiment analysis of twitter data. IEEE Trans

Afective Comput

3. Zhang Y, Chen Y, Liu Z, Chen W (2023) A clustering-based method for user proﬁling on twitter

data. IEEE Trans Cmput Soc Syst

4. Song M, Wu L, Zhang W (2023) Topic modeling of twitter data related to the US presidential

election using LDA. IEEE Trans Big Data

5. Jiang T, Sun J, Wang X (2022) Topic modeling of twitter data related to the US-China trade

war using LDA. IEEE Access

6. Chen Y, Wang X, Yu C (2021) Non-negative matrix factorization for topic modeling of Twitter

data related to the COVID-19 pandemic. Health Inf J

7. Zhou X, Liu B, Xiang X, Wu S, Zha H (2020) Topic modeling for Twitter data related to Hong

Kong protests. Inf Process Manage

8. Lee YH, Zhang Y, Kim JH (2022) Predicting election results using twitter data: a machine

learning approach. In: Proceedings of the 2022 IEEE international conference on computational

intelligence and virtual environments for measurement systems and applications, pp 109–114

9. Chen Y, Fu J, Xiao J (2022) Deep learning model for analyzing twitter data on the Hong

Kong protests. In: Proceedings of the 2022 IEEE international conference on data mining, pp

1209–1214

10. Das S, Vaddadi S, Chakraborty T (2022) Analyzing twitter data on black lives matter: a machine

learning approach. In: Proceedings of the 2022 IEEE international conference on big data, pp

2666–2671

11. Goharian N Boussaid O, Srinivasan P (2022) Analyzing Twitter Data during the COVID-

19 pandemic: a machine learning approach. In: Proceedings of the 2022 IEEE international

conference on healthcare informatics, pp 1–6

12. Yildirim Y, Bayram G, Akcay O (2022) Twitter data analysis of the Syrian conﬂict: a machine

learning approach. In: Proceedings of the 2022 IEEE international conference on communica-

tions, pp 1–6

13. Zhu X, Liu S, Gao F (2021) Analyzing twitter data during the black lives matter protests: a

machine learning approach. In: Proceedings of the 2021 IEEE international conference on data

mining, pp 1209–1214

14. Mocanu D, Perra N, Gonçalves B (2021) Monitoring the COVID-19 pandemic in real time

using twitter data analysis and machine learning. In: Proceedings of the 2021 IEEE international

conference on big data, pp 2518–2525

15. Liu S, Zhu X, Gao F (2021) Predicting US presidential election results using twitter data: a

machine learning approach. In: Proceedings of the 2021 IEEE international conference on big

data, pp 2065–2070

16. Yayla E, Altun Y, Yildiz H (2021) A machine learning-based approach for detecting hate

speech on twitter. In: Proceedings of the 2021 IEEE international conference on big data, pp

2096–2101

Transition from Traditional Insurance

Sector to InsurTech: Systematic Analysis

and Future Research Directions

Tamanna Kewal and Charu Saxena

Abstract InsurTech, which takes its cues from the more well-established idea of

“FinTech,” is the term used to describe the use of technology to increase efﬁciency

and savings in underwriting, risk pooling, and claims management from the present

insurance paradigm. A survey of the scientiﬁc literature on InsurTech is included

in this study. This review paper starts with an overview of the journey from Insur-

ance 1.0 to Insurance 4.0 and concludes with an analysis of articles chosen to ﬁnd

InsurTech’s emerging themes. This research comprises 47 Scopus articles, which

have been analyzed to identify themes in InsurTech research. Thematic analysis has

aided in the identiﬁcation of signiﬁcant research clusters in InsurTech research. There

are eight themes highlighted: InsurTech and the technologies behind it, risk manage-

ment, performance evaluation, insurer adoption, insured adoption, personalization of

insurance, P2P insurance, and legal, ethical, and regulatory issues in InsurTech. The

number of studies in this area has just risen in the last two years, i.e., post-pandemic.

The most popular research topic was the technologies supporting the emergence of

InsurTech. This study aims to enhance an understanding of insurance technology

advancements and associated topics.

Keywords InsurTech ·Blockchain ·Artiﬁcial intelligence ·IoT ·Big data ·

Insurance 4.0 ·Insurance industry

1 Introduction

Digitization has created new products and processes across all industries which have

beneﬁted both the provider and receiver [1]. The insurance sector has transformed into

InsurTech due to the presence of technologies like artiﬁcial intelligence, blockchain,

T. Kewa l ( B)·C. Saxena

University School of Business, Chandigarh University, Mohali, Punjab, India

e-mail: tamannakewal04@gmail.com

C. Saxena

e-mail: charu.e8966@cumail.in

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_36

473

474 T. Kewal and C. Saxena

internet of things, big data, and cloud computing [2,3]. InsurTech has emerged from

“ﬁntech” and is referred to for technology-based insurance solutions to proceed

through the value chain. Fintech has ﬁve categories, namely mobile payments and

transfers, deposits and investments, budget and ﬁnancial planning, insurance, and

borrowing [4]. InsurTech unlike other categories has gained the interest of researchers

in the last few years which explains the limited number of articles in this domain. As

a result, both incumbents and new market entrants have a huge opportunity to use

information technology to revolutionize the traditional insurance sector [5]. In this

context, there has been a surge in the number of InsurTech start-up ﬁrms that rely on

accessibility, customization, and customer satisfaction to reach a wide audience. By

building new technology solutions that are complemented by wholly new business

models, InsurTech start-ups are speeding up transformation in the insurance sector.

Both life and general insurance ﬁrms are witnessing the effects of InsurTech. Tradi-

tional insurers have realized the threat of disruption in the insurance business and are

investing in or acquiring start-ups to proﬁt from their advances [6]. InsurTech is the

application and usage of information technology by one or more established or new

commercial entities to supply insurance-speciﬁc solutions [7]. InsurTech is quickly

establishing itself as a game-changing potential for insurers to innovate, enhance the

relevance of their offers, and expand. The main contributions of this study are:

•This study ﬁrstly evaluates the existing literature on the journey of the Insurance

Industry’s transformation to InsurTech, outlining its evolution from the foundation

of the ﬁrst insurance ﬁrm in 1848 to the present day.

•The eight broad themes in InsurTech research are identiﬁed and discussed in this

study.

2 Evolution of InsurTech

2.1 Insurance 1.0

Before the Christian period, the notion of insurance was employed by Chinese

and Babylonian traders to reduce the hazards of river shipping [8]. The concept

of grouping traders was also introduced. Traders whose goods were transported in

the same shipment were charged a premium together so that if any of their ship-

ments suffer damages, they may be compensated from the premiums collected

[8]. There was also the introduction of home insurance and accident insurance.

Increasing railway fatalities necessitated the establishment of the ﬁrst accident insur-

ance business in 1848 in England [9,10]. The industrial revolution affected produc-

tion capacity, transportation systems, workforce structure, and the types of hazards.

All these economic changes signaled the start of the Insurance Industry Revolution.

Transition from Traditional Insurance Sector to InsurTech: Systematic … 475

2.2 Insurance 2.0

In this period, some discoveries led to the second industrial revolution, e.g., the

introduction of electricity, telegraph, changes in the mode of transportation, and

communication, the concept of mass production, division of labor, and new raw

materials [11,12]. The British government started providing insurance for old age,

illness, and unemployment under the National Insurance Act of 1911 [8,13]. Medical

insurance, accidental insurance, and old-age pension systems were also offered by

the German Government [14].

2.3 Insurance 3.0

The third industrial revolution started with the invention of computer systems and

the incorporation of computer-based applications into the organization. As a result,

organizational efﬁciency improved due to reduced cost and time. Integration of

computer-based management systems and the insurance industry reduced the cost of

distribution channels. Since the incorporation of Acord (Association for Cooperative

Operations Research and Development), an American Standard-Setting Association

for Insurance Industry in 1970, most agents switched to computer-based systems

[15]. Acord created a single form that was accepted and utilized by many of those

insurance ﬁrms, lowering the costs of insurance distribution. Acord also aided in the

creation of EDI standards. Companies adopted automation and created proprietary

systems that were installed in the ofﬁces of their agents. It became possible for agents

to eliminate the proprietary terminals and operate via one system [8].

2.4 Insurance 4.0

The fourth industrial revolution was a result of the development of telecommunica-

tion networks. Internet’s introduction signaled the conclusion of the third industrial

revolution. All industry’s business models were quickly altered as a result of the surge

of digitalization in every industry. AI, IoT, big data, and cloud computing are driving

the insurance industry’s transformation. Wearables, smart houses, self-driving cars,

and voice-assisted electronic gadgets are examples of game-changing advancements

in the twenty-ﬁrst century. Insurance 4.0 is the merger of new technology with the

insurance industry. The shift from standard to smart insurance contracts is part of a

drive to digitize the whole value chain to give clients better, more personalized, and

hassle-free service.

476 T. Kewal and C. Saxena

Fig. 1 SLR process

3 Methodology

This study aims to examine available publications to understand the broad themes in

the publication of Insurance technology research and to provide researchers and prac-

titioners with information and potential perspective on InsurTech. We began by doing

a systematic assessment of the literature to determine the initial question to study.

The technique of a literature study is beneﬁcial for establishing the research issue.

This study’s goal is to highlight the themes in InsurTech research. To accomplish this

goal, we ﬁrst looked for indications that a literature review was necessary. The Scopus

database was utilized to search for papers that discuss InsurTech. In this early stage,

the search for articles was not restricted to any time frame. The keywords used for

searching publications were “InsurTech OR insurance technology OR e-insurance.”

About 171 Scopus publications were found after the search. In Fig. 1, the process is

illustrated. The exclusion criterion resulted in the removal of 118 publications. The

publications showed an irregularity before the year 2012. In some years, there was

not even a single article related to InsurTech; hence, researchers selected articles for

the past 10 years (2012–2021).

The selection of documents for analysis began with the selection of the title and

abstract, followed by the narrowing of the articles using exclusion criteria, and ﬁnally,

selected articles were synthesized using thematic analysis.

4 Results

The outcomes of the thematic analysis are described in this section. Thematic analysis

is primarily characterized as a process for detecting, analyzing, and reporting themes

within data as an independent qualitative descriptive approach [16]. It is a technique

Transition from Traditional Insurance Sector to InsurTech: Systematic … 477

for reviewing and organizing the data according to their pattern and then naming

those themes. We include topics for discussion in InsurTech on yearly basis, along

with their underlying themes.

4.1 Year-Wise Analysis

A total of 47 pertinent InsurTech research articles were examined. The 10-year

publishing period ran from 2012 through 2021. Between 2012 and 2021, there is

an irregularity in the publishing trend of InsurTech papers. This topic gained the

importance of researchers in 2018, and since then, there has been an increase in the

publications related to InsurTech. Till the year 2019, the number of studies related to

this ﬁeld was very few, the rising trend can be seen in the past two years which shows

the interest of researchers in insurance technology. According to a year-wise anal-

ysis, most articles were published in 2020. According to a comparative analysis of the

papers published annually, research on the issue of InsurTech and related technolo-

gies was the most often discussed, i.e., 13 papers. In addition, research on consumer

adoption of insurance technology began in 2012 with a total of eight papers. A study

on the shift from the conventional mode to the InsurTech model began in 2013.

Furthermore, studies on peer-to-peer insurance began to surface around 2020 which

is also the least-discussed theme of InsurTech. 2020 is the only year that featured

articles from all of the themes. Table 1illustrates this.

4.2 Most Popular Publications

Geneva Papers on Risk and Insurance: Issues and practice published ﬁve articles,

three in 2020 and two in 2021, making it the most popular journal in InsurTech

research. Three articles were published in risks, one in 2020 and the other two in

2021. Advances in Intelligent Systems and Computing published the same number of

papers. Two articles were published in Journal of Internet Banking and Commerce,

one in 2012 and the other in 2016.

4.3 Articles Classiﬁed by Themes

This section discusses the subject of InsurTech research. Table 2shows how 47

items are organized into eight topics. The study objectives are used to group the data.

The majority of the articles in the ﬁrst category explored the idea of InsurTech as

well as the technology related to it. Thirteen articles on this subject explore tech-

nology related to InsurTech, including Blockchain, Artiﬁcial Intelligence, and IoT.

478 T. Kewal and C. Saxena

Table 1 Literature review results

Themes InsurTech and the

technologies behind it

Risk

management

Personalization of

insurance

Legal, regulatory,

ethical issues

Performance

evaluation

P2P

insurance

Adoption by

Insurers

Customer

behavior

2012 1

2013 2

2014 1

2016 1

2018 3 1 1

2019 2 2

2020 5 2 1 1 2 1 3 3

2021 5 1 2 4 1 2

Tot a l 13 3 3 6 7 2 5 8

Transition from Traditional Insurance Sector to InsurTech: Systematic … 479

Table 2 Articles grouped by research themes

S. no. Research themes Author and year

1InsurTech and the technologies behind it (AI, IoT, Blockchain, Big

Data)

[2,6,29–39]

2 Legal, regulatory, and ethical issues [40–45]

3Customer behavior [46–53]

4Performance evaluation [2,12,54–58]

5Adoption of InsurTech by insurers

6 Personalization of insurance

7 Risk management

8P2P insurance

Then there are seven articles about evaluating the performance of insurance prod-

ucts and companies. Eight articles investigate customer uptake of innovative insur-

ance products and InsurTech businesses. The least number of research papers are on

peer-to-peer insurance.

5 Discussion

5.1 Research on InsurTech and the Technologies Behind It

InsurTech research begins with an understanding of insurance technology inno-

vations and their implications for the existing value chain [7] The sophisti-

cated InsurTech innovations have beneﬁted insurers by lowering transaction costs,

expanding into new markets, providing more client-tailored coverage, and also

making lives of clients easier by resolving issues such as removing intermedi-

aries, decreasing the cost of the policy, making online price comparisons, fast claim

settlement [17]. Artiﬁcial intelligence, big data, IoT, and blockchain are the driving

forces underpinning the insurance industry’s transformation [18]. Innovative solu-

tions based on IoT like user-based insurance, smart cars, smart homes, wearables,

ride-sharing solutions, linked health, data capturing, and remote monitoring have

been game changers for traditional insurance ﬁrms but have also raised privacy

concerns, as some customers are willing to give up some privacy for better services,

while others view these IoT-based products as serious intrusions [19]. Blockchain

has also piqued the interest of researchers in recent years; although it is in the early

stages once implemented, it can disrupt the traditional insurance business model

[20]. Through the security of information, a decrease in administrative and trans-

action expenses, as well as a new level of information transparency and precision

with simpler access to all parties in the insurance contract, this technology offers a

wide range of potential applications in the insurance industry [21]. This technology

480 T. Kewal and C. Saxena

offers a solid digital platform for quicker and more secure transactions, more trans-

parency, and less risk [22]. Blockchain network connects various devices and mobile

applications, speed ups the insurance processes, and helps to achieve accuracy in

transactions [23]. Artiﬁcial intelligence is another technology that is projected to be

the top trend in the next years, as increased adoption of AI solutions is expected

to lower the overhead expenses of banking and insurance services [24]. Worldwide,

insurers are utilizing artiﬁcial intelligence to automate procedures and jobs including

detecting fraud, underwriting, personalizing policy, reviewing accident claims, and

sparing insured drivers from a laborious human evaluation process following an

accident [25]. Based on the successful experience of Huize Insurance in China,

other InsurTech businesses operating in the same environment should concentrate on

developing specialized insurance products, fusing big data and users, portraying and

analyzing users’ risk preferences, and precisely calculating adequate premiums using

AI [26]. Big data predictive analytics is another technology that aids the marketing

team in better understanding and analyzing customer behavior so that they may base

their policy recommendations on such patterns [27]. Along with the advantages of

using these technologies to build InsurTech businesses, privacy and security issues

have also received considerable research attention [28].

5.2 Research on Performance Evaluation

This section of the research focuses on studies on innovations in the insurance

business and their effects on company performance. The insurance sector saw the

emergence of InsurTech innovations, which were technology-driven and claimed

to improve performance and save time and costs. Focusing on the public property

and casualty insurance market in the USA, it was observed that insurance compa-

nies were unable to take advantage of technological advancements to increase their

productivity. The relative degree of efﬁciency across enterprises was generally pretty

consistent over time, and the gap between efﬁcient and less efﬁcient ﬁrms has signif-

icantly widened [29]. Another study by the same authors examined the technology

initiatives taken by insurance companies and showed that there is still plenty of room

for development [30]. However, InsurTech has given positive news about the growth

of Chinese insurance businesses’ proﬁts. Positive effects are shown for the Chinese

insurance sector when studied based on three dimensions: liability side, asset side,

and risk-taking behavior indicating that InsurTech is boosting investment and prof-

itability [31]. Another technique for assessing InsurTech innovation is combining

indicators related to three dimensions: operations and management, technological

level, and user experience [2]. This method considers the inputs that pertain to both

parties in InsurTech and can help in providing a better and more transparent way

of assessing InsurTech. A mixed model based on the Balance Scorecard and the

DEMATEL technique has also been used to identify the key indicators for improving

the performance of insurance websites in Iran [32]. Studies that focus on a single

product rather than analyzing the insurance industry as a whole make the premise

Transition from Traditional Insurance Sector to InsurTech: Systematic … 481

that different products should be evaluated based on individual factors affecting them

and using distinct methodologies [33].

5.3 Research on Customer Behavior

The adoption of any new invention is based on three pillars: ﬁrst, whether the govern-

ment provides appropriate infrastructure, second, whether organizations are prepared

to adopt it, and third, whether customers are ready to try it [34]. Because technologies

and innovations are designed for customers’ better experience and convenience, this

section reviews studies that evaluated customer perceptions around various InsurTech

advancements [35]. Research done in Iran to determine e-readiness to adopt auto

insurance at the consumer level suggested that the technology should be compatible

and easy to use [34]. Another study conducted in Russia also states the main concern

behind the low adoption of digital insurance is the risk of loss of personal data and

its leakage [36]. Customers have unfavorable opinions of health tracking insurance

applications because they believe costs in terms of money, emotions, and functions

which do not justify the beneﬁts [37]. In the insurance industry, life insurance poli-

cies were more popular among customers than non-life [38]. Wearables have become

popular recently and their adoption among sportspersons has been studied which was

found to be dependent on their price, experience, and perception [39].

5.4 Research on Legal, Regulatory, Ethical Issues

in InsurTech

InsurTech growing popularity raised legal, regulatory, and ethical issues concerning

the value chain, technologies behind it, and payment methods too. Several researchers

questioned whether existing norms and regulations were adequate for new InsurTech

entrants or if a distinct framework was required. Discussing how to govern the

emerging InsurTech organizations, numerous researchers concluded that it will be

best suited for InsurTech ﬁrms to apply the current rules and legislation to their

business models rather than introducing new regulations to avoid signiﬁcant regula-

tory changes. Research conducted in the context of European Law on digital insur-

ance intermediaries, namely, insurance comparison websites, P2P insurance, Robo-

advisors, and their regulatory concerns was conducted to determine whether the

established regulatory structure is capable of regulation or if new rules are required,

and it was concluded that the current law is sufﬁcient to deal with almost all the

issues faced by InsurTech when implemented to the allocation of insurance products

undertaken by intermediaries [40]. Similar conclusions were made in another study

which justiﬁed that current InsurTech activities can be carried on without any need

for major regulatory changes in the law until Solvency II directives are no longer

482 T. Kewal and C. Saxena

needed in the insurance European Union Insurance Industry [41]. But these ﬁndings

were the opposite in the case of Iranian law as for better regulation and protection

of e-insured the rules are not adequate and need to be amended to suit the recent

innovations made in the insurance environment [42]. Big data analytics and artiﬁcial

intelligence have already been shown to be beneﬁcial to the insurance sector, but as

their use grows, so do perils and moral dilemmas. The ethical challenges that have

emerged throughout the insurance value chain with the advent of AI and Big data are

the subject of a study conducted in the context of the European Insurance Industry

[43]. Concerns about whether it will be technically possible and legally acceptable

to incorporate virtual currencies like Bitcoin or Ether as a valid payment method for

smart contracts have also been raised since their introduction and rising popularity,

but until the Solvency II directives issued by European Law give them a legal status

similar to other functional currencies, it is only a vision for the future [44].

5.5 Research on the Adoption of InsurTech by Insurance

Companies

The transition from the traditional insurance industry to InsurTech has not been

smooth. Several obstacles were encountered, and the insurance industry took a long

time to transform. The use of e-commerce in insurance was justiﬁed by the numerous

beneﬁts [45], but it nonetheless encountered challenges during its early implementa-

tion years [46]. Initially, it was found that a lack of understanding about the beneﬁts

of employing electronic services was a barrier to e-insurance adoption; however,

if ﬁrms embrace knowledge management strategies, it will undoubtedly aid in its

implementation [47]. Then, there is also the question of IT governance in InsurTech

ﬁrms [48]. Covid-19 has had a signiﬁcant role in the rise of InsurTech technologies

and their corporate acceptance. A study developed a model to assess the elements

inﬂuencing InsurTech acceptance in the post-pandemic and provided a model based

on the diffusion of innovation theory that determined InsurTech adoption throughout

the value chain [49].

5.6 Research on Risk Management

The insurance industry has enthusiastically embraced loss estimation tools created

by ﬁnancial engineers for managing ﬁnancial risks, cyber risks, operational risks,

and technological risks. Due to catastrophic occurrences, the insurance industry has

undoubtedly helped customers in wealthy countries like the USA, Europe, and Japan

to manage their ﬁnancial risk, but there is still more work to be done in developing

countries [50]. Research of Colombian InsurTech ﬁrms’ risk management policies

revealed that none of the investigated InsurTech businesses possessed the necessary

Transition from Traditional Insurance Sector to InsurTech: Systematic … 483

understanding of risk management and also lacked certiﬁcation in risk assessment

and its application [51]. Another research investigated possible cyber risks using the

insurance associated with IoT devices as an example and proposed a quantitative

approach for evaluating the risks associated with IoT health insurance that can be

readily adapted to other advances in the insurance industry [52].

5.7 Research on Personalization of Insurance

Personalization techniques enable insurance distributors to cut inquiry expenditures

while tailoring the given insurance policy to the demands, characteristics, risks,

and conditions of each customer [53]. There is a small body of research on this

topic that raises the question of what InsurTech is attempting to personalize by

taking the example of telematics in car insurance and discussing the consequences

of insurance personalization on society and how the relationship between insurer

and insured has changed post-personalization insurance [53]. Another research

performed concerning the European Union Insurance Industry highlights the topic

of wrong personalization, discusses the probable elements behind erroneous person-

alization, and argues that insurance providers should suffer bad repercussions for the

same [54].

5.8 Research on P2P Insurance

The concept of peer-to-peer insurance is the pooling of risks among a large number

of insured people under one contract, and there is a shared investment fund where

premiums from all policyholders are collected and payments are distributed to

claimants [55]. Researchers’ interest in the P2P insurance model has been recently

seen, as there are just two articles on the subject. One of them embraces the P2P

model by comparing it to traditional insurance and addressing the advantages of the

former over the latter, as well as testing the hypothesis of stated advantages using

quantitative models [56]. The other article, while acknowledging the beneﬁts of the

P2P model, also points out its shortcomings by highlighting regulatory concerns that

must be addressed in European Union Insurance Industry [57].

6 Conclusion

Given the lack of empirical research on InsurTech and the issue’s novelty, we tried to

broaden our understanding of the ﬁeld. This study examined 47 InsurTech research

publications published over 10 years. The review took a multidisciplinary approach,

looking at literature on InsurTech studies from various ﬁelds. InsurTech research

484 T. Kewal and C. Saxena

was divided into themes based on the results of the literature review: research on

InsurTech and the technologies behind it, risk management, performance evaluation,

insurer adoption, insured adoption, personalization of insurance, P2P insurance, and

legal, ethical, and regulatory issues in InsurTech. The most prevalent study subject

has revolved around the technologies driving the growth of InsurTech. It has been

observed through a trend of publications that scholars have been more interested in

this ﬁeld of study since the pandemic hit. According to papers on the topic, InsurTech

systems are designed to beneﬁt providers rather than users as there are major privacy

concerns that are unaddressed and one of the main reasons behind low adoption rates.

Future research should look into solving privacy issues in InsurTech from the stand-

point of the customer, including trust, relative beneﬁts, and motives. Furthermore,

InsurTech has practical obstacles in terms of acceptance, regulatory aspects, ethical

issues, as well as security risks that threaten user involvement with novel insurance

products. To address this e-risk issue and improve security management, a researcher

created a framework based on the Bayesian Belief Network (BBN) model to assess

the risk in terms of money linked with e-commerce transactions that result from a

security breach and therefore aid in the development and pricing of InsurTech prod-

ucts [58]. In terms of regulatory and ethical issues concerning InsurTech, they are

only mentioned in the framework of European Union Insurance Law. Researchers

in other nations should also address these issues and solve them. The study on the

P2P insurance broker model is also restricted; since we are still in the early stages of

peer-to-peer insurance, this concept should be investigated further in future studies

as well.

7 Future Research Directions and Limitations

This study aims to enhance an understanding of insurance technology advancements

and associated topics. The selection of published papers and proceedings can serve as

a resource for InsurTech research to access high-quality material. Futureresearch can

utilize this analysis as a starting point to better comprehend InsurTech. As the insur-

ance industry becomes more digital, future research must investigate the impact of

new-age media on InsurTech adoption. The inﬂuence of demographic disparities on

InsurTech usage has also been overlooked in InsurTech studies, which can be studied

in the future. For this study, only the electronic database Scopus was used to search

for articles for a systematic literature review. Future research can incorporate studies

from databases too. Only journal articles and conference papers were examined for

this review; future research can include multiple sources in the review article as well.

Future studies can employ other terms such as “digital insurance,” “smart insurance,”

and many other concepts that are closely connected to InsurTech. This study was

limited to a 10-year time frame due to irregularities in past publications; therefore,

future studies can incorporate studies published before this period. Notwithstanding

these limitations, the analysis of this review will be useful to insurers, researchers,

and academics globally.

Transition from Traditional Insurance Sector to InsurTech: Systematic … 485

References

1. Nambisan S, Wright M, Feldman M (22019) The digital transformation of innovation and

entrepreneurship: progress, challenges and key themes. Res Policy 48(8):103773

2. Xu X, Zweifel P (2020) A framework for the evaluation of InsurTech. Risk Manag Insur Rev

23(4):305–329. https://doi.org/10.1111/rmir.12161

3. Eckert C, Osterrieder K (2020) How digitalization affects insurance companies: overview and

use cases of digital technologies. Zeitschrift für die gesamte Versicherungswiss 109(5):333–360

4. Young E (2019) Global FinTech adoption index 2019. [Online]. Available: https://www.ey.

com/en_gl/ey-global-ﬁntech-adoption-index

5. Puschmann T (2017) Fintech. Bus Inf Syst Eng 59(1):69–76

6. Gerwald F, Dorcak P, Markovic P (2021) The inﬂuence of insurtechs on traditional insurance

operations. In: 15th International conference liberec economic forum 2021, pp 551–558

7. Stoeckli E, Dremel C, Uebernickel F (2018) Exploring characteristics and transformational

capabilities of InsurTech innovations to understand insurance value creation in a digital world.

Electron Mark 28(3):287–305

8. Trenerry CF (2009) The origin and early history of insurance: including the contract of bottomry.

Lawbook Exchange

9. Nicoletti B (2021) Insurance 4.0 beneﬁts and challenges of digital technology

10. “History of insurance,” History of Insurance Modern Insurance. cpb-us-w2.wpmucdn.com/

blogs.baylor.edu/ dist/a/6818/ﬁles/2013/12/History-of-insurance-11gcwej.pdf

11. Roy A (2017) The fourth industrial revolution

12. Engelman R (2022) The second industrial revolution, 1870–1914—US history scene. https://

ushistoryscene.com/article/second-industrial-revolution/. Accessed 3 Feb 2022

13. Heller M (2008) The national insurance acts 1911–1947, the approved societies and the

prudential assurance company. Twent Century Br Hist 19(1):1–28

14. Hennock EP (2007) The origin of the welfare state in England and Germany, 1850–1914: social

policies compared. Cambridge University Press

15. Nelson ML, Shaw MJ, Qualls W (2005) Interorganizational system standards development in

vertical industries. Electron Mark 15(4):378–392

16. Vaismoradi M, Turunen H, Bondas T (2013) Content analysis and thematic analysis: Implica-

tions for conducting a qualitative descriptive study. Nurs Heal Sci 15:398–405. https://doi.org/

10.1111/nhs.12048

17. Koprivica M (2018) Insurtech: challenges and opportunities for the insurance sector. In: 2nd

International scientiﬁc conference ITEMA, pp 619–625

18. Püttgen F, Kaulartz M (2017) Insurance 4.0: the use of blockchain technology and of smart

contracts in the insurance sector. ERA Forum 18(2):249–262

19. Acquisti A, John LK, Loewenstein G (2013) What is privacy worth? J Legal Stud 42(2):249–

274. https://doi.org/10.1086/671754

20. Popovic D et al (2020) Understanding blockchain for insurance use cases. Br Actuar J 25:1–23.

https://doi.org/10.1017/S1357321720000148

21. Njegomir V, Demko-Rihter J, Bojani´c T (2021) Disruptive technologies in the operation of

insurance industry. Teh Vjesn 28(5):1797–1805

22. Shokeen J, Rana C, Rani, P (2021) A green 6G network era: architecture and propitious tech-

nologies. In: Lecture notes on data engineering and communications technologies (ICDAM),

vol 54, pp 59–76. https://doi.org/10.1007/978-981-15-8335-3_4

23. Chakravaram V, Ratnakaram S, Agasha E, Vihari NS (2021) The role of blockchain technology

in ﬁnancial engineering. In: Lecture notes electrical engineering, vol 698, pp 755–765. https://

doi.org/10.1007/978-981-15-7961-5_72

24. Chakravaram V, Ratnakaram S, Vihari NS, Tatikonda N (2021) The role of technologies on

banking and insurance sectors in the digitalization and globalization era—a select study. Adv

Intell Syst Comput 1245:145–156

486 T. Kewal and C. Saxena

25. Hsu H-H, Huang N-F, Han C-H (2020) Collision analysis to motor dashcam videos with YOLO

and mask R-CNN for auto insurance. In: Proceedings of international conference on intelligent

engineering and management, ICIEM 2020, pp 311–315

26. Jing T (2021) Research on the development of internet insurance in China—based on the

exploration of the road of Huize insurance. In: E3S web of conferences, vol 235

27. Ratnakaram S, Chakravaram V, Vihari NS, Vidyasagar Rao G (2021) Emerging trends in the

marketing of ﬁnancially engineered insurance products. Adv Intell Syst Comput 1270:675–684.

https://doi.org/10.1007/978-981-15-8289-9_65

28. Lin L, Chen C (2020) The promise and perils of insurtech. Singapore J Leg Stud 2020:115–142.

https://doi.org/10.2139/ssrn.3463533

29. Lanfranchi D, Grassi L (2021) Translating technological innovation into efﬁciency: the case

of US public P&C insurance companies. Eurasian Bus Rev 11(4):565–585. https://doi.org/10.

1007/s40821-021-00189-7

30. Lanfranchi D, Grassi L (2021) Examining insurance companies’ use of technology for

innovation. Geneva Pap Risk Insur Issues Pract. https://doi.org/10.1057/s41288-021-00258-y

31. Wang Q (2021) The impact of insurtech on Chinese insurance industry. Procedia Computer

Science 187:30–35. https://doi.org/10.1016/j.procs.2021.04.030

32. Beigzadeh N, Sajedinejad A (2014) Providing key indicators for evaluating the e-business

context for improving performance of insurance companies in Iran

33. Rutskiy V et al (2020) Development of e-insurance through market institutions: the example

of digital compulsory third-party motor insurance. Adv Intell Syst Comput 1294:836–843.

https://doi.org/10.1007/978-3-030-63322-6_71

34. Bromideh AA (2012) Factors affecting customer e-readiness to embrace auto e-insurance in

Iran. J Internet Bank Commer 17(1)

35. Garbairovai M, Bachanovai PH (2019) Purchasing behavior of e-insurance consumers. In:

Proceedings of the 33rd international business information management association confer-

ence, IBIMA 2019: education excellence and innovation management through vision 2020, pp

3139–3152

36. Maslova L, Ilina A (2020) Digital transformation of Russian insurance companies. In: CEUR

workshop proceedings, vol 1–2570

37. Talonen A, Mähönen J, Koskinen L, Kuoppakangas P (2021) Analysis of consumers’ negative

perceptions of health tracking in insurance—a value sacriﬁce approach. J Inf Commun Ethics

Soc. https://doi.org/10.1108/JICES-05-2020-0061

38. Gramegna A, Giudici P (2020) Why to buy insurance? An explainable artiﬁcial intelligence

approach. Risks 8(4):1–9. https://doi.org/10.3390/risks8040137

39. Saliba B, Spiteri J, Cortis D (2021) Insurance and wearables as tools in managing risk in sports:

determinants of technology take-up and propensity to insure and share data. Geneva Pap Risk

Insur Issues Pract. https://doi.org/10.1057/s41288-021-00250-6

40. Marano P (2019) Navigating insurtech: the digital intermediaries of insurance products and

customer protection in the EU. Maastrich J Eur Comp Law 26(2):294–315. https://doi.org/10.

1177/1023263X19830345

41. Ostrowska M (2021) Regulation of InsurTech: is the principle of proportionality an answer?

Risks 9(10):185. https://doi.org/10.3390/risks9100185.

42. Bagheri P, Forushani ML (2016) E-insurance law and digital space in Iran. J Internet Bank

Commer 21(1)

43. Mullins M, Holland CP, Cunneen M (2021) Creating ethics guidelines for artiﬁcial intelligence

and big data analytics customers: the case of the consumer European insurance market. Patterns

2(10). https://doi.org/10.1016/j.patter.2021.100362

44. Zgraggen RR (2019) Smart insurance contracts based on virtual currency: Legal sources and

chosen issues. In: PervasiveHealth:pervasive computing technologies for healthcare, pp 99–102

45. Eling M, Lehmann M (2018) The impact of digitalization on the insurance value chain and

the insurability of risks. Geneva Pap Risk Insur Issues Pract 43(3):359–396. https://doi.org/10.

1057/s41288-017-0073-0

Transition from Traditional Insurance Sector to InsurTech: Systematic … 487

46. Heydari NH, Behestani S, Bahadori P (2013) Investigation of electronic maturity level of

insurance indsutry in Iran. Middle East J Sci Res 14(11):1539–1549

47. Mehrabani SE, Shajari M (2013) Knowledge management practices and implementation of

E-insurance. In: Proceedings—2013 international conference on informatics and creative

multimedia, ICICM 2013, pp 186–190. https://doi.org/10.1109/ICICM.2013.39

48. Uyun A, Sekarhati DKS, Amastini F, Neﬁratika A, Shihab MR, Ranti B (2020) Implication

of InsurTech to implementation IT decision domain perspective: the case study of insurance

XYZ. https://doi.org/10.1109/ICCED51276.2020.9415783

49. Ching KH, Teoh AP, Amran A (2020) A conceptual model of technology factors to InsurTech

adoption by value chain activities. In: 2020 IEEE conference on e-learning, e-management and

e-services, pp 88–92

50. Shah HC, Dong W, Stojanovski P, Chen A (2018) Evolution of seismic risk management for

insurance over the past 30 years. Earthq Eng Eng Vib 17(1):11–18

51. Mogollón A, Rubiano A, Ramirez J (2020) Colombian companies of insurtech and their risk

management. J Phys Conf Ser 1646(1)

52. Leong Y-Y, Chen Y-C (2020) Cyber risk cost and management in IoT devices-linked health

insurance. Geneva Pap Risk Insur Issues Pract 45(4):737–759

53. McFall L, Moor L (2018) Who, or what, is insurtech personalizing?: persons, prices and the

historical classiﬁcations of risk. Distinktion J Soc Theory 19(2):193–213

54. Tereszkiewicz P, Południak-Gierz K (2021) Liability for incorrect client personalization in the

distribution of consumer insurance. Risks 9(5)

55. Levantesi S, Piscopo G (2022) Mutual peer-to-peer insurance: the allocation of risk. J Co-op

Organ Manag 10(1):100154

56. Abdikerimova S, Feng R (2021) Peer-to-peer multi-risk insurance and mutual aid. Eur J Oper

Res. https://doi.org/10.1016/j.ejor.2021.09.017

57. Clemente GP, Marano P (2020) The broker model for peer-to-peer insurance: an analysis of its

value. Geneva Pap. Risk Insur Issues Pract 45(3):457–481

58. Mukhopadhyay A, Chatterjee S, Saha D, Mahanti A, Sadhukhan SK (2008) E-risk insurance

product design: A copula based bayesian belief network model. IGI Global

Diagnosis of Laryngitis

and Cordectomy using Machine

Learning with ML.Net and SVD

Syed Irfan Ali , Ahmed Sajjad Khan, Syed Mohammad Ali,

and Mohammad Nasiruddin

Abstract Nowadays, machine learning is playing an important role in providing

automated results to the humanity. It is gaining researchers attention day by day

and providing more accurate and fast results in every second research. Addition-

ally, Microsoft’s machine learning, an open-source cross-platform, is used in this

study. For the classiﬁcation of speech disorders like Chordektomie and Laryngitis

against normal, it is used in a. Net5-based web application. Several experiments were

performed with the ML model by training it with different sets of features to evaluate

best set of features for accurate result.

Keywords Machine learning ·Chordektomie ·Laryngitis ·Classiﬁcation

1 Introduction

People are facing risk of speech disorder problems since 25% of the world popula-

tion are such that their profession compels them to louder than the normal level. This

alters the curvature of the vocal tracts, which affects the vocal folds during phona-

tion and results in irregular spectral qualities [1]. This variation in the properties of

vocal fold is produced due to several factors or their combinations such as presence

S. I. Ali (B)

Artiﬁcial Intelligence and Data Science Engineering, Anjuman College of Engineering &

Technology, Nagpur, India

e-mail: siali@anjumanengg.edu.in

A. S. Khan ·S. M. Ali ·M. Nasiruddin

Electronics & Telecommunication Engineering, Anjuman College of Engineering & Technology,

Nagpur, India

e-mail: askhan@anjumanengg.edu.in

S. M. Ali

e-mail: smali@anjumanengg.edu.in

M. Nasiruddin

e-mail: nasiruddin@anjumanengg.edu.in

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_37

489

490 S.I.Alietal.

of mucus, tension, stiffness, larynx muscles, fold’s closing and opening. All these

factors are invariably affected by various pathologies in vocal tracts. This results

in different vibrations for different pathologies, which in turn produces different

frequencies for different pathologies. According to research, 17.9 million US indi-

viduals who are at least 18 years old reported having a voice issue in the previous

year [2]. The teachers are more vulnerable to voice disorder than other professionals.

Both short-term and long-term signal analysis can be used to accomplish automatic

speech diagnostics. The acoustic analysis can provide parameters for long-term signal

analysis [3]. In contrast, parameters for short-term signal analysis can be collected

using LPC, LPCC, MFCC, and other algorithms. Numerous acoustic characteristics,

including shimmer, pitch, jitter, pitch and amplitude perturbation quotient, harmonic

to noise ratio, voice turbulence index, normalized noise energy, frequency amplitude

tremor, soft phonation index, glottal to noise excitation ratio, and others, can be used

to identify speech disorders [1]. It is seen that from the list of features some features

show enough distance corresponding to the speech disorder, some show minor differ-

ence, and some shows random variations. So, to classify effectively between different

classes, it is necessary to have features which show sufﬁcient variation, which will

support the classiﬁer machine learning model in diagnosing speech disorders. Thus,

in this paper, we have used only the prominent features such as time and frequency

domain features, energy, pitch, zero crossing rate with MFCC coefﬁcients. It is also

seen that if the model is trained with the speech features of distinct parts of the

globe, then its test will be correct for that part of the globe and not vice versa. In

this research, we have used Saarbruecken Voice Database for training and evaluating

the model for classiﬁcation. Any machine learning technology can be used for the

classiﬁcation process. The two most often used languages in machine learning are

Python and C++, with Python enjoying more popularity. The ecosystem for special-

ized tools and libraries in Python is impressive. Models are typically developed using

the scikit-learn or PyTorch libraries for Python, and most neural networks are based

on TensorFlow [4]. The Dot.Net platform will be able to use Microsoft’s machine

learning tool beginning in 2019. It is an open-source, cross-platform framework

made to host learning models in Dot.Net Core Web Applications, .Net Framework

Applications, and .Net Standard Libraries. Scikit-learn and other tools are proven

to be slower and less accurate than ML.NET. [5]. The ﬂowchart of the research is

shown in Fig. 1.

2 Literature Review

Verde et al. employed the Youden analysis to identify the threshold value to distin-

guish between a pathological and a healthy voice, and then used a model tree regres-

sion approach to deﬁne the link between these parameters. They have assessed

the proposed index’s dependability in terms of accuracy, sensitivity, and speciﬁcity

according to the correct classiﬁcation [6].

Diagnosis of Laryngitis and Cordectomy using Machine Learning … 491

Fig. 1 Block diagram of the

process

Support vector machine (SVM) and deep neural network (DNN) classiﬁers have

been employed by Zhang et al. [7].

A feature-based representation with MFCCs and a Mel-spectrogram, two

commonly used input representations, were initially created from the audio data by

Guan & Lerch. Four different machine learning techniques have been used to conduct

the research: support vector machine (SVM), CNN, CNN followed by SVM, and

autoencoder (AE) followed by SVM. Fivefold cross-validation is used throughout

all studies [8].

19 coefﬁcients make up the feature vector that Smitha et al. generated. They sepa-

rated our dataset into three categories: training data (75%), validation data (15%), and

test data (15%). The network is trained using the Scaled Conjugate Gradient (SCG)

backpropagation technique. Then, they evaluated the results using Mean Squared

Error (MSE) and Percent Error (percent E). To categorize the retrieved features, they

employed an artiﬁcial neural network (ANN) with one hidden layer. They claimed

that one of the effective methods for differentiating between healthy and diseased

voices is an ANN. Additionally, they claimed to have employed MATLAB 2015a

software and obtained the lowest MSE and percent E, which were 1.05e−03 and

1.28e−01, respectively [9].

492 S.I.Alietal.

Dankovicová et al. chose a straightforward ﬁlter to obtain the top k characteristics,

and the results are on par with those of more sophisticated techniques. The score of

features was calculated using mutual information pertaining to discrete variables.

This function is reliant on the k-nearest neighbor distance-based entropy calcu-

lation. Positive (non-negative) mutual information between two random variables

indicates that the variables are dependent on one another. Higher values indicate

greater reliance; zero indicates independence for the variable. This demonstrates

that the authors’ automated feature selection based on dependency. Support vector

machine (SVM) with nonlinear kernel, K-nearest neighbors (KNN), and random

forest classiﬁers have all been used by the authors (RFC) [10].

Al Hussein and Muhammad have looked into the CaffeNet and VGG16 Net CNN

models. A fairly sophisticated CNN model is the VGG16 Net. Since the CaffeNet

and the VGG16 Net were both trained with a large number of pictures, they are both

reliable for a wide range of applications. Because the vocal pathology databases only

include a very small number of samples, these models cannot be used for training

from the start. Instead, they applied transfer learning and ﬁne-tune methods to gain

from these reliable models. The output of the ﬁnal CNN layer is then given to SVM

for classiﬁcation into two classes [11].

Hidden Markov models (HMM), neural networks, support vector machines

(SVM), and lastly the Gaussian mixing model have all been utilized as classiﬁers by

Rabeh et al. (GMM) [12].

Support vector machine (SVM) classiﬁer was used by Al-Nasheri et al. to deter-

mine if the provided samples were abnormal or normal. In addition to these tests, they

carried out further tests to see whether there was a signiﬁcant difference between the

means of normal and diseased samples for each database individually using u-tests

and XLSTAT software. They used several terminologies to convey their ﬁndings.

These terms are accuracy (ACC: the ratio of correctly detected samples to total

samples), sensitivity (SN: the proportion of pathological samples positively identi-

ﬁed), Speciﬁcity (SP: the proportion of normal samples negatively identiﬁed), and

area under the receiver operating characteristic (ROC) curve (AUC) [1].

MLP, GFF & MODULAR and SVM are the two types of neural networks that

Ali & Karule utilized. Tanhaxion was utilized as the transfer function, with 25%

of samples being used for testing and 75% being used for training. Five transfer

functions—Tanh axion, L-Tanh axion, sigmoid axion, L-sigmoid axion, and SoftMax

axion—have been utilized in the ﬁrst three N/Ns. They just altered the epochs in SVM

[13].

Teixeira et al. created an approach for obtaining the parameters vector using the

Praat program [14].

Artiﬁcial neural networks and the Multilayer Perceptron (MLP) with backpropa-

gation learning algorithm were employed by Khemphila and Boonjing. They discov-

ered the context that was chosen based on the importance of each low-level context.

They used the information gain of each attribute as its weight rather than learning

weights through a general algorithm or other machine learning technique. They

employed the ANN classiﬁcation method, which chooses attributes based on the

Diagnosis of Laryngitis and Cordectomy using Machine Learning … 493

idea of information gain (IG). The selection of feature sets is done using IG. Weka

3.6.6 was used to calculate the results of experiments [15].

In order to create a feature vector from speech samples for a Multilayer Neural

Network (MNN) classiﬁer, Salhi et al. used wavelet analysis. A two-dimensional

pattern of wavelet coefﬁcients is produced by wavelet analysis. The feature vector of

voice samples is created using the energy content of wavelet coefﬁcients at various

scaling settings. In this case, a feature vector is employed as a diagnostic tool to

ﬁnd pathological voice abnormalities. Here, classiﬁcation is accomplished using a

three-layer feed-forward network with sigmoid activation. For network training, the

generalized backpropagation algorithm (BPA) is utilized. Additionally, they stated

that supervised learning is employed when a neural network is trained by providing

a target output to a certain input group. Additionally, it is claimed that a network

can be trained through self-guidance, in which case the network’s parameters adapt

to the input. In both scenarios, the network’s weights and biases change in response

to the collected data. The training can be done in batches (batch training), in which

case the parameters are not adjusted until all the instances have been fed, or it can

be done gradually (incremental training), in which case the weights and biases are

adjusted every time a new training example is supplied to the network. The neural

network is implemented using the MATLAB7.0 platform and has three layers: an

input layer, an output layer, and a hidden layer [16].

The three photos were fed into the convolutional neural networks by Muhammad

et al. (CNNs) [17]. CNN has had success with deep learning in many areas of image

processing [18]. They employed transfer learning and a ﬁne-tuning strategy because

training the CNN model requires a lot of data. CaffeNet is the CNN that they use in

the suggested system [18]. A reliable CNN model that works well in many image

processing applications is CaffeNet. Three components are shown in the input image

for this model (e.g., in a color image, the three components are red, green, and blue).

The model, according to them, contains three pooling layers and ﬁve convolution

layers. A rectiﬁed linear unit follows each convolution layer (RLU). The training set,

the validation set, and the testing set were each given their own section of the utilized

database. The training set, validation set, and testing sets each comprised 5%, 7%,

and 25% of the database, respectively. It was made sure there was no speaker overlap

when the database was divided into its three sections.

In the suggested system, Hossain & Muhammad used three machine learning

algorithms, each of which has unique properties. They employed the SVM, the ELM,

and the GMM as classiﬁers [19,20].

According to Martinez and Ruﬁner, the ANN is a superb classiﬁcation system that

excels at handling noisy, imperfect, overlapped, and other types of data. A movable

window with 256 samples and a 128-sample overlap was utilized to extract patterns.

Each segment had a Hamming window applied to it, after which the patterns were

extracted using the ﬁrst 16 cepstral coefﬁcients. As the activation in the required

outputs of the ANN, each pattern was ﬁnished with the information of 1 and 0. They

employed two distinct types of ANN, one trained to recognize the difference between

a diseased and normal voice (without caring about the pathology), and the other to

recognize the difference between a normal, harsh, and bicyclic voice [21].

494 S.I.Alietal.

Each speech sample in the MEEI database was used by Ali et al. to test their

technique. They employed a threefold cross-validation strategy for this. The MEEI

database is divided into three separate subsets using this method. One of the subsets is

utilized for system evaluation, and the other two are used for system training. Various

procedures are taken into consideration to report the results of the suggested system.

Sensitivity, speciﬁcity, accuracy, and area under the receiver operating characteristic

(ROC) curve are the parameters in question. For the automatic detection and cate-

gorization of voice abnormalities, a GMM-based classiﬁer is given the FCB feature.

To create the classiﬁer, a cutting-edge clustering pattern recognition algorithm is

used. The GMM clustering method has been applied in a wide range of scientiﬁc

ﬁelds. They have used GMM to develop acoustic models employing FCB features

for various speech signals belonging to various classes [22].

3 Methodology

The Saarbruecken Voice Database (SVD) was used for model training and testing in

this study. The SVD database was recorded by the Institute of Phonetics at Saarland

University and is freely downloadable via the Internet [23]. The resolution of the

speech samples is 16 bits, with a sampling frequency of 50 kHz. The ﬁndings of this

study are development-environment related. The samples are labeled and divided

into normal and abnormal speech signals. Samples that are abnormal are labeled

with ‘1,’ while normal samples are tagged with ‘0.’ For the purpose of extracting

characteristics, the 96 speech samples from normal, the 84 samples from Laryngitis,

and the 47 samples from Chordektomie are downloaded.

4 Experiment and Result

ML.Net is utilized for generation of machine learning model as it is the best of

the three available machine learning platforms, viz. ML.Net, scikit-learn, and H2O

[24]. NWaves library from NuGet Package Manager is used for extracting multiple

features of each sample. The NWaves library is available on GitHub [25].

4.1 Experiment with Normal Versus Laryngitis

Experiments were performed with each feature using the multiclass classiﬁers to ﬁnd

the best feature or set of features that can classify Normal Versus Laryngitis with

highest accuracy.

Diagnosis of Laryngitis and Cordectomy using Machine Learning … 495

Table 1 Normal versus laryngitis result

Feature set 1 Percentage acc. Feature set 2 Percentage acc.

Energy 53.92 MFCC0 49.48

RMS 54.9 MFCC1 54.95

ZCR 55.37 MFCC2 54.52

Entropy 57.1 MFCC3 54.01

Centroid 60.35 ✓MFCC4 53.09

Spread 58.88 ✓MFCC5 56.68

Flatness 56.88 MFCC6 58.04 ✓

Noiseness 52.17 MFCC7 57.52

RollOff 55.62 MFCC8 55.45

Crest 55.42 MFCC9 52.58

Entropy2 53.16 MFCC10 54.3

Decrease 56.22 MFCC11 56.92

C1 56.38 MFCC12 54.21

C2 53.77

C3 55.41

C4 51.7

C5 55.29

C6 55.46

The experiment is performed in three stages.

In the ﬁrst stage, micro-accuracy, macro-accuracy, and time needed are recorded

for all multiclass classiﬁers in 16 iterations.

In the second stage, micro-accuracy, macro-accuracy, and time needed are

recorded for selected classiﬁers from available multiclass classiﬁers in ﬁve iterations.

In the ﬁnal stage, the result of the best classiﬁer is recorded.

For ‘Centroid’ feature ‘Lbfgs Logistic Regression Ova’ with 60.35% micro-

accuracy, 56% macro-accuracy and 1 s duration of time is best.

For ‘Spread’ feature ‘Sdca Maximum Entropy Multi’ with 58.88% micro-

accuracy, 50% macro-accuracy and 0.2 s duration of time is best.

For ‘MFCC6’ feature ‘Fast Tree Ova’ with 58.04% micro-accuracy, 56.34%

macro-accuracy and 1.2 s duration of time is best (Table 1;Fig.2).

4.2 Experiment with Normal Versus Chordektomie

Experiments were performed with each feature using the multiclass classiﬁers to ﬁnd

the best feature or set of features that can classify normal versus Chordektomie with

highest accuracy.

496 S.I.Alietal.

Fig. 2 Normal versus laryngitis

Diagnosis of Laryngitis and Cordectomy using Machine Learning … 497

The experiment is performed in three stages.

In the ﬁrst stage, micro-accuracy, macro-accuracy, and time needed are recorded

for all multiclass classiﬁers in 16 iterations.

In the second stage, micro-accuracy, macro-accuracy, and time needed are

recorded for selected classiﬁers from available multiclass classiﬁers in ﬁve iterations.

In the ﬁnal stage, the result of the best classiﬁer is recorded.

For ‘MFCC11’ feature ‘Lbfgs Logistic Regression Ova’ with 70.34% micro-

accuracy, 67.51% macro-accuracy and 0.2 s duration of time is best.

For ‘RMS’ feature ‘Lbfgs Logistic Regression Ova’ with 70.66% micro-accuracy,

65.2% macro-accuracy and 0.3 s duration of time is best.

For ‘Entropy’ feature ‘Lbfgs Logistic Regression Ova’ with 70.61% micro-

accuracy, 68% macro-accuracy and 0.1 s duration of time is best (Table 2;

Fig. 3).

Table 2 Normal versus Chordektomie result

Feature set 1 Percentage acc. Feature set 2 Percentage acc.

Energy 68.23 MFCC0 69.81

RMS 70.66 ✓MFCC1 67.29

ZCR 68.92 MFCC2 67.43

Entropy 70.61 ✓MFCC3 68.24

Centroid 69.97 MFCC4 69.57

Spread 68.98 MFCC5 69.01

Flatness 69.98 MFCC6 69.21

Noiseness 67.58 MFCC7 68.57

RollOff 68.33 MFCC8 68.73

Crest 68.76 MFCC9 69.6

Entropy2 68.98 MFCC10 69.34

Decrease 69.22 MFCC11 70.34 ✓

C1 68.93 MFCC12 69.91

C2 68.57

C3 69.49

C4 69.05

C5 68.83

C6 68.44

498 S.I.Alietal.

Fig. 3 Normal versus Chordektomie

Diagnosis of Laryngitis and Cordectomy using Machine Learning … 499

5 Conclusion

This research states that the two selected disorders, Laryngitis and Chordektomie, are

remarkably close with normal in terms of features. The best features for classiﬁcation

of Laryngitis versus normal are found to be Centroid, Spread, MFCC coefﬁcient 6

with accuracy of 60.35, 58.88, and 58.04 percent, respectively. Whereas for Chordek-

tomie versus normal, accuracies are found to be 70.34, 70.66, and, 70.61 for features

MFCC11, RMS, and entropy, respectively. When two or more features with good

accuracy are combined, the average accuracy reduces most of the time except few.

The result of composite features will be published in the future papers. To replicate

or verify these results, researchers should use ML.Net and NWaves with a sampling

rate of 44,100, a frame duration of 0.035, a hop duration of 0.015, a pre-emphasis

ﬁlter of 0.97, and a rectangular window in MFCC.

References

1. Al-Nasheri A et al (2017) Voice pathology detection and classiﬁcation using auto-correlation

and entropy features in different frequency regions. IEEE Access 6:6961–6974. https://doi.org/

10.1109/ACCESS.2017.2696056

2. National Institute on Deafness and Other Communication Disorders: Voice, Speech,

and Language: Quick Statistics (2016). http://www.nidcd.nih.gov/health/statistics/vsl/Pages/

stats.aspx. Accessed 10 Aug 2020

3. Boyanov B, Hadjitodorov S (1997) Acoustic analysis of pathological voices. A voice analysis

system for the screening of laryngeal diseases. IEEE Eng Med Biol Mag 6(4):74–82

4. Esposito D (2019) Hosting a machine learning model in ASP.NET core 3.0. https://www.

red-gate.com/simple-talk/sql/data-science-sql/hosting-a-machine-learning-model-in-asp-net-

core-3-0/#:~:text=Generallyavailable since the spring,NET standard libraries. Accessed 27

Aug 2020

5. Ahmed Z et al (2019) Machine Learning at Microsoft with ML.NET. In: KDD: knowledge

discovery and data mining, pp 2448–2458. https://doi.org/10.1145/3292500.3330667

6. Verde L, De Pietro G, Alrashoud M, Ghoneim A, Al-Mutib KN, Sannino G (2019) Dysphonia

detection index (DDI): a new multi-parametric marker to evaluate voice quality. IEEE Access

7:55689–55697. https://doi.org/10.1109/ACCESS.2019.2913444

7. Zhang T, Wu Y, Shao Y, Shi M, Geng Y, LiuG (2019) A pathological multi-vowels recognition

algorithm based on LSP feature. IEEE Access 7:58866–58875. https://doi.org/10.1109/ACC

ESS.2019.2911314

8. Guan H, Lerch A (2019) Learning strategies for voice disorder detection. https://doi.org/10.

1109/ICOSC.2019.8665504

9. Smitha, Shetty S, Hegde S, Dodderi T (2018) Classiﬁcation of healthy and pathological voices

using MFCC and ANN. In: Proceedings of 2018 2nd international conference on advances in

electronics, computers and communications, ICAECC 2018, pp 1–5. https://doi.org/10.1109/

ICAECC.2018.8479441

10. Dankoviˇcová Z, Sovák D, Drotár P, Vokorokos L (2018) Machine learning approach to

dysphonia detection. Appl Sci 8(10):1–12. https://doi.org/10.3390/app8101927

11. Alhussein M, Muhammad G (2018) Voice pathology detection using deep learning on mobile

healthcare framework. IEEE Access 6:41034–41041. https://doi.org/10.1109/ACCESS.2018.

2856238

500 S.I.Alietal.

12. Rabeh H, Salah H, Adnane C (2018) Voice pathology recognition and classiﬁcation using noise

related features. Int J Adv Comput Sci Appl 9(11):82–87, [Online]. Available: www.ijacsa.the

sai.org

13. Ali SM, Karule PT (2016) MFCC, LPCC, formants and pitch proven to be best features in

diagnosis of speech disorder using neural networks and SVM. Int J Appl Eng Res 11(2):897–903

[Online]. Available: http://www.ripublication.com

14. Teixeira JP, Oliveira C, Lopes C (2013) Vocal acoustic analysis—jitter, shimmer and

HNR parameters. Procedia Technol 9(May):1112–1122. https://doi.org/10.1016/j.protcy.2013.

12.124

15. Khemphila A, Boonjing V (2012) Parkinsons disease classiﬁcation using neural network and

feature selection. Int J Math Comput Sci 6(4):377–380

16. Salhi L, Mourad T, Cherif A (2010) Voice disorders identiﬁcation using multilayer neural

network

17. Muhammad G, Alhamid MF, Alsulaiman M, Gupta B (2018) Edge computing with cloud for

voice disorder assessment and treatment. IEEE Commun Mag 56(4):60–65. https://doi.org/10.

1109/MCOM.2018.1700790

18. Krizhevsky A (2012) ImageNet classiﬁcation with deep convolutionalneural networks Alex. In:

NIPS’12 Proceedings of 25th international conference neural information processing systems,

Lake Tahoe, Nevada—03–06 Dec 2012, vol 1, pp 1097–1105. https://doi.org/10.1016/B978-

008046518-0.00119-7

19. Hossain MS, Muhammad G (2016) Healthcare big data voice pathology assessment framework.

IEEE Access 4:7806–7815. https://doi.org/10.1109/ACCESS.2016.2626316

20. Huang GB, Zhou H, Ding X, Zhang R (2011) Extreme learning machine for regression and

multiclass classiﬁcation. IEEE Trans Syst Man Cybern Part B (Cybernetics) 42(2):513–529.

https://doi.org/10.1109/tsmcb.2011.2168604

21. Martinez CE, Ruﬁner HL (2002) Acoustic analysis of speech for detection of laryngeal

pathologies, pp 2369–2372. https://doi.org/10.1109/iembs.2000.900621

22. Ali Z, Hossain MS, Muhammad G, Sangaiah AK (2018) An intelligent healthcare system for

detection and classiﬁcation to discriminate vocal fold disorders. Futur Gener Comput Syst

85:19–28. https://doi.org/10.1016/j.future.2018.02.021

23. Barry WJ, Putzer M, Saarbrucken voice database. http://www.stimmdatenbank.coli.uni-saarla

nd.de/. Accessed 10 Aug 2017

24. Microsoft, ML.NET-An open source and cross-platform machine learning framework. https://

dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet. Accessed 21 Aug 2020

25. ar1st0crat/NWaves: .NET DSP library with a lot of audio processing functions. https://github.

com/ar1st0crat/NWaves. Accessed 8 Feb 2022

Speed of Diagnosis for Brain Diseases

Using MRI and Convolutional Neural

Networks

B. Srinivasa Rao, Vankalapati Nanda Gopal, Vatala Akash,

and Shaik Nazeer

Abstract Accurately diagnosing brain diseases is crucial for effective treatment

and improved patient outcomes. Magnetic Resonance Imaging is a regularly used

technology in the investigation of brain illnesses including Alzheimer’s disease,

brain tumors, and multiple sclerosis. This study proposes a Convolutional Neural

Network-based automated brain illness classiﬁcation method utilizing MRI images.

The proposed method leverages a dataset of MRI images of four brain diseases,

namely Alzheimer’s disease, tumors of brain, multiple sclerosis, and healthy brains.

We trained and compared different CNN architectures, including VGG16 and ﬁne-

tuned ResNet. Our CNN model achieved remarkable accuracy on both the training

and testing sets. Speciﬁcally, we achieved an impressive training accuracy of 99.01%

and a testing accuracy of 95%, outperforming VGG16 and ﬁne-tuned ResNet. We

derived many assessment measures, including accuracy, recall, and F1-score, to

further evaluate the effectiveness of our model. Our results demonstrate the potential

of CNN-based approaches in accurately and automatically classifying brain diseases

using MRI images. Our proposed approach has the potential to be a valuable tool for

healthcare professionals, improving patient outcomes and quality of life. The devel-

oped model is capable of classifying Alzheimer’s disease, brain tumors, multiple

sclerosis, and their respective stages. Automated classiﬁcation of brain diseases using

CNNs could enable early detection and precise diagnosis of these diseases, leading

to improved treatment and patient care.

Keywords Magnetic Resonance Imaging ·Convolutional Neural Network ·Brain

diseases ·Alzheimer’s disease ·Brain tumors ·Multiple sclerosis ·VGG16 ·

ResNet ·Automated classiﬁcation ·Evaluation metrics ·Precision ·Recall ·

F1-score

B. Srinivasa Rao (B)

Department of Information Technology, Lakireddy Bali Reddy College of Engineering,

Mylavaram, Andhra Pradesh, India

e-mail: doctorbsrinivasarao@gmail.com

V. N. Gopal ·V. A k a s h ·S. Nazeer

Lakireddy Bali Reddy College of Engineering, Mylavaram, Andhra Pradesh, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_38

501

502 B. Srinivasa Rao et al.

1 Introduction

Millions of people and their families are affected by brain illnesses, which are a

serious public health problem globally. Multiple sclerosis (MS), brain tumors, and

Alzheimer’s disease (AD) are some of the most prevalent and crippling brain disor-

ders. These illnesses have high rates of morbidity and death, and diagnosing and

treating them are extremely difﬁcult for medical professionals and healthcare systems

across the world. These illnesses have severe effects on both the individual and his

relatives, underscoring the essential need for early and precise identiﬁcation and cate-

gorization. Traditional diagnostic methods for brain diseases rely heavily on clinical

assessments, such as cognitive and neurological tests, and medical imaging, such as

computed tomography (CT) and Magnetic Resonance Imaging (MRI). While these

methods have been useful in diagnosing and monitoring brain diseases, they have

limitations, including poor sensitivity and speciﬁcity, high costs, and time-consuming

procedures. These limitations underscore the need for more accurate, reliable, and

efﬁcient diagnostic tools that can facilitate early detection and classiﬁcation of brain

diseases.

A neurodegenerative disorder that gradually impairs memory, thinking, and

behavior is known as Alzheimer’s disease [1], with symptoms often becoming

progressively worse over time. It is the most common cause of dementia in the

elderly, with an estimated 50 million people worldwide living with the condition.

Brain tumors are abnormal growths of cells that can occur in any part of the brain

and can cause a range of symptoms, including headaches, seizures, and difﬁculty

with speech and movement. Multiple sclerosis is a chronic autoimmune illness

affecting the central nervous system and can cause various symptoms such as fatigue,

movement problems, and vision impairment.

The timely identiﬁcation and categorization of these brain diseases can greatly

enhance patient outcomes. As early intervention can aid in slowing down the advance-

ment of the disease and enhance the quality of life. However, traditional diag-

nostic methods, such as clinical evaluation and imaging, have limitations in terms of

accuracy and speed of diagnosis. For example, traditional MRI-based diagnosis of

Alzheimer’s disease relies on visual inspection of brain scans by radiologists, which

can be time-consuming and subjective.

Medical imaging, especially MRI, has emerged as a promising diagnostic tool

for brain diseases due to its high spatial resolution and ability to capture detailed

images of the brain’s structure and function. Recent breakthroughs in computer

vision and machine learning, particularly CNNs, have shown promise in boosting the

precision and speed of diagnosing brain diseases. CNNs are a type of deep learning

[2] algorithm that can learn to automatically extract features from raw image data,

allowing them to classify images with high accuracy.

In particular, medical imaging, such as MRI scans, provides a rich source of image

data for CNNs to learn from. MRI scans give precise information on the anatomy

and function of the brain, allowing for the very accurate detection and classiﬁcation

of brain illnesses.

Speed of Diagnosis for Brain Diseases Using MRI and Convolutional … 503

The objective of this paper is to provide a comprehensive overview of the present

status of employing CNNs for the detection and classiﬁcation of medical images,

particularly for brain diseases like Alzheimer’s, brain tumors [3], and multiple scle-

rosis. We will discuss the limitations of traditional diagnostic methods and the poten-

tial for using CNNs in combination with MRI scans to improve accuracy and speed

of diagnosis. Additionally, we will present the results of our own experiments using

CNNs to classify brain diseases using MRI data and compare the performance of

several CNN designs, including VGG16, ResNet [4], and our own.

Overall, we believe that CNNs have great potential in improving the accuracy and

speed of diagnosis for brain diseases and can signiﬁcantly improve patient outcomes

by enabling early detection and classiﬁcation of these diseases. The intention of this

paper is to offer valuable perspectives on the future of medical imaging and machine

learning in diagnosing and treating brain diseases. This will be accomplished by

presenting a comprehensive summary of the latest advances in this area.

2 Related Work

The authors Pradeep Kumar and Seuc Ho Ryu talk about how crucial image

processing and brain imaging techniques are to medical research, particularly in

terms of early diagnosis and therapy [5]. They emphasize how well deep neural

networks (DNNs) do when it comes to classifying and segmenting images. The

authors of this study have introduced a technique to reduce the feature set’s size in

subsequent classiﬁcation assignments through a deep wavelet autoencoder (DWA).

They evaluated the proposed DWA-DNN photo classiﬁer on a brain image dataset

and compared its performance to other existing classiﬁers. They discover that the

suggested technique performs better than the current methods.

The present approach for classifying and diagnosing brain tumors depends on

labor-intensive, invasive histopathological examination of biopsy samples that is

prone to human error. Thus, a completely automated deep learning system is required

for the early detection of brain cancers. For three distinct classiﬁcation tasks—brain

tumor detection, brain tumor type classiﬁcation, and brain tumor grade classiﬁca-

tion—this study offers three different Convolutional Neural Network (CNN) models

[6]. The grid search optimization approach is used to automatically identify the

hyperparameters of the CNN models. Using sizable clinical datasets that are freely

accessible to the public, the suggested CNN models produce good classiﬁcation

results. The proposed CNN models can support doctors and radiologists in vali-

dating their initial screening for multiple brain tumor [7] categorization. Overall, the

suggested strategy has promise for increasing the precision and effectiveness of brain

tumor categorization and diagnosis.

For the proper diagnosis and assessment of multiple sclerosis therapy, 3D

Magnetic Resonance Imaging (MRI) is essential for identifying white matter abnor-

malities (MS). For the disease to be treated effectively, early MS identiﬁcation

and assessment of the disease’s development are crucial. Unfortunately, due to the

504 B. Srinivasa Rao et al.

imbalanced data and sparse lesions pixels, diagnosing MS lesions might be difﬁ-

cult. This study presents a transfer learning-based Convolutional Neural Network

(CNN) technique that employs the SoftMax activation function. The integration of

ﬂuid-attenuated inversion recovery (FLAIR) series allows for faster processing while

maintaining accuracy, and the proposed technique’s efﬁcacy is evaluated using data

from MS patients obtained from the Laboratory of Imaging Technologies.

The study’s ﬁndings demonstrate how well MRI may be used to ﬁnd MS lesions.

The suggested method has a high accuracy rate for forecasting the course of illness,

up to 98.24%. Because a signiﬁcant volume of MRI data must be analyzed, manual

lesion diagnosis by clinical professionals can be challenging and time-consuming.

The suggested method provides an effective solution to the issue, making it simpler

and quicker to identify MS lesions and categorize disease progression.

3 Methodology

Our study proposes a novel architecture that uses MRI images as input and produces

corresponding class labels. The suggested algorithm can categorize photos into 12

distinct classiﬁcations, including Alzheimer’s disease stages [1], brain tumors, and

multiple sclerosis. Before being supplied to the model for classiﬁcation, the input

photos are preprocessed, including data augmentation. To identify the most effec-

tive architecture, we experimented with several models, including CNN, VGG16,

VGG19, ResNet, and ﬁne-tuned ResNet [8]. Based on our results, we selected the

CNN architecture, which was found to be more generalized and effective than other

pre-trained architectures.

3.1 Dataset

Magnetic Resonance Imaging (MRI) images of the brain were employed in our inves-

tigation. It covers four forms of brain diseases: Alzheimer’s disease, brain tumors,

and multiple sclerosis, as well as a control group with a healthy brain. The collection

contains a total of 9873 photos of varied resolutions and dimensions. Four phases of

Alzheimer’s disease are recognized: Very Mild Demented, Mild Demented, Moderate

Demented, and Non-demented. The Very Mild-Demented stage has 1792 photos, the

Mild-Demented stage has 717 photographs, the Moderate-Demented stage has 590

images, and the Non-demented stage has 2560 images. There are three types of

brain tumors: glioma tumor, meningioma tumor, and pituitary tumor. The collection

contains 826 glioma tumor pictures, 822 glioma–meningioma tumor images, and

827 glioma–pituitary tumor images.

The multiple sclerosis category has two sub-categories: control and MS. The

control sub-category is further divided into two axial and sagittal orientations, with

1002 and 1014 images, respectively. The MS sub-category also has two orientations:

Speed of Diagnosis for Brain Diseases Using MRI and Convolutional … 505

Fig. 1 Data representation

axial and sagittal, with 650 and 761 images, respectively. Bar graph represents the

data in Fig. 1The images in the dataset have been labeled and categorized by medical

professionals, and the dataset is available for research purposes. Our research uses

this dataset to test and train our CNN model to properly diagnose the various brain

disorders and their phases.

3.2 Data Preprocessing

The data preprocessing step is an essential part of our project that involves preparing

the dataset for training the CNN model. The preprocessing stage includes various

techniques such as data augmentation, resizing, and normalization. We augmented

the data using various methods such as horizontal ﬂipping, vertical ﬂipping, random

rotation, and zooming, which helped us to expand the dataset’s size and diversity.

Additionally, we resized all the images to a ﬁxed dimension of 149 ×149, which

is the input shape of our CNN model. This was done to ensure that all the images

have the same size and shape, which is necessary for the CNN model to process the

images efﬁciently. Finally, we normalized the pixel values of the images by dividing

them by 150, which scaled the pixel values between 0 and 1. This normalization

technique helped to improve the convergence rate and the overall performance of

our CNN [9] model. Some sample images in our dataset are listed in Fig. 2.

3.2.1 Data Augmentation

By producing fresh and varied versions of the original data, a process known as

data augmentation is used to expand a dataset. This method is frequently used to

506 B. Srinivasa Rao et al.

Fig. 2 Sample images in dataset

enhance the generalization of the model and avoid overﬁtting in computer vision

tasks, including medical imaging.

In our study, we created new variants of the original MRI pictures using a variety

of data augmentation techniques, including rotation, ﬂipping, zooming, and shifting.

These methods allowed us to produce new photos that varied in size, orientation,

and location, which aided in boosting the dataset’s variety. For instance, we applied

rotation to the images by rotating them at different angles to create new images.

Flipping was used to create mirror images of the original images. Zooming was used

to enlarge or reduce the size of the images while shifting was used to move the

images around. These methods contributed to expanding our dataset and producing

fresh iterations of the original photos, both of which enhanced the effectiveness of

our model.

Speed of Diagnosis for Brain Diseases Using MRI and Convolutional … 507

3.3 Proposed Model

In this work, we investigated numerous pre-trained models as the categorization

of Alzheimer’s disorder, brain tumors, and multiple sclerosis, including ResNet,

VGG16, Fine-Tuned ResNet, and CNN. We determined that the CNN design outper-

forms the other pre-trained models after conducting experiments and analyzing the

ﬁndings. As a result, we suggest using the CNN model to accurately classify brain

illnesses.

3.3.1 Convolutional Neural Network (CNN)

Neural networks are used in deep learning, a subﬁeld of artiﬁcial intelligence, to ﬁnd

patterns and connections in huge datasets. Deep learning models, which are more

complicated than traditional machine learning models and include several hidden

layers, can understand complex hierarchical data representations. One well-known

deep learning architecture that excels in image classiﬁcation tasks is the Convolu-

tional Neural Network (CNN). This is because it may use many layers of convo-

lution and pooling operations to learn spatial hierarchies of features from the input

data. Recurrent neural networks (RNNs), autoencoders, and generative adversarial

networks (GANs) are more examples of deep learning architectures [2]. A Convolu-

tional Neural Network (CNN) is a type of deep learning architecture that is widely

used for image classiﬁcation, recognition, and processing tasks. It is based on the

notion of convolution, which is a mathematical procedure in which two functions are

combined to form a third function that represents how one of the original functions

is affected by the other. Convolution is used in CNNs to extract characteristics from

pictures.

CNNs work by taking an input image and running it through a series of convolu-

tional layers. Each convolutional layer is made up of a collection of ﬁlters or kernels

that are applied to the input picture to extract various properties such as edges,

corners, and textures. To incorporate nonlinearity and improve the model’s capacity

to learn complicated patterns, the output of each convolutional layer is subsequently

routed through a nonlinear activation function such as ReLU. The resultant feature

maps are ﬂattened and fed through one or more fully connected layers, which execute

the classiﬁcation job, after numerous convolutional layers. The model learns to apply

weights to distinct characteristics and map them to different output classes during this

process. CNNs have demonstrated outstanding performance in picture classiﬁcation

and identiﬁcation applications because of their capacity to learn spatial hierarchies of

features from input data via numerous layers of convolution and pooling operations.

508 B. Srinivasa Rao et al.

Fig. 3 General CNN architecture

3.3.2 Feature Extraction and Classiﬁcation

Feature extraction and classiﬁcation are two essential components of a Convolutional

Neural Network (CNN). Through convolution and pooling processes, the feature

extraction layer learns the relevant aspects of the incoming data. The convolution

layer applies a collection of ﬁlters to the input picture, allowing essential character-

istics such as edges, lines, and forms to be extracted. The output of the convolution

layer is then down sampled by the pooling layer to minimize the dimensionality of

the feature maps and keep just the most critical information.

Once the features have been retrieved, they are sent to the classiﬁcation layer

for prediction. The classiﬁcation layer is made up of completely linked layers that

accept ﬂattened feature maps as input and create the ﬁnal output through a sequence

of nonlinear transformations. The SoftMax activation function is typically employed

in the output layer to provide a probability distribution across the classes that may be

used to estimate the most likely class label for the input picture. General architecture

of CNN is shown in Fig. 3.

The Convolutional Neural Network (CNN) functions as both the feature extractor

and the classiﬁer in our suggested architecture. The CNN’s early layers learn to

extract low-level characteristics like edges and forms, while the deeper levels learn

to extract more complicated and abstract features relevant to the classiﬁcation job.

The CNN’s ﬁnal layers are fully linked layers that conduct categorization using the

learnt characteristics. As a result, the CNN serves as both a feature extractor and a

classiﬁer for our image classiﬁcation problem.

3.3.3 Layers and Operations Mentioned in the Architecture

Our suggested CNN model has ﬁve primary layers: an input layer, two convolutional

layers, a ﬂatten layer, and two dense layers. In the graphic, the overview of our model

is given with input and output for each layer, and the deﬁnition and function of each

layer in the proposed model are listed below. Model summary is represented in Fig. 4.

Speed of Diagnosis for Brain Diseases Using MRI and Convolutional … 509

Fig. 4 Model graphical

summary

510 B. Srinivasa Rao et al.

(1) Conv2D

Convolutional Neural Networks (CNNs) use the Conv2D layer to perform image

categorization tasks. It conducts convolution operation on the input picture byadding

ﬁlters/kernels. These ﬁlters extract essential characteristics from the input picture,

and the Conv2D layer produces a feature map as its output. The depth of features

taken from the input may be increased or decreased by adjusting the size and number

of ﬁlters in a Conv2D layer.

In our model, this layer’s objective is to extract 32 distinct characteristics from the

input picture. Each feature map is generated by placing a 3 ×3 kernel over the input

picture and conducting a dot product between the kernel and the image underneath

it. The ReLU activation function is applied to each feature map element by element,

creating nonlinearity into the model.

(2) MaxPooling2D layer with pool size (2,2)

In Convolutional Neural Networks (CNNs), the Max Pooling layer is used to down-

sample the input feature maps. This layer divides the input into non-overlapping

rectangular pooling areas and picks the maximum value of the relevant elements in

the feature maps for each region. This technique produces a reduced feature map with

decreased spatial dimensions but preserved spatial hierarchy. Max Pooling assists in

reducing the number of parameters, lowering computation costs, and preventing

overﬁtting in CNNs. The pooling size is set to 2 ×2 in this example, which indicates

that the output is downsampled by a factor of 2 in both dimensions.

(3) Dropout layer with rate 0.3

The dropout layer is a regularization approach that avoids overﬁtting by disregarding

certain neurons at random during training. During training, the dropout layer chooses

a group of neurons at random and sets their outputs to zero. This keeps one neuron

from being overly reliant on another and ensures that the network learns more robust

properties. The dropout rate is a hyperparameter that controls how many neurons are

disregarded during each training iteration. A dropout rate of 0.3 indicates that during

each training iteration, 30% of the neurons in the layer are disregarded.

(4) Flatten layer

The ﬂatten layer in a neural network is responsible of changing the output of the

preceding layer into a one-dimensional array or vector that may be fed into a fully

connected layer. It effectively ﬂattens the preceding layer’s multidimensional tensor

output into a single vector, which is then utilized as input for the following layer.

The ﬂatten layer connects the convolutional and fully connected layers, enabling for

the use of dense layers and output classiﬁcation.

(5) Dense layer

In our concept, the dense layer is a completely linked layer with a set number of

neurons. This layer’s activation function is ReLU, which brings nonlinearity into the

model. These layers’ function is to learn complicated patterns from input informa-

tion and categorize them into several classes. The last layer has the same number

Speed of Diagnosis for Brain Diseases Using MRI and Convolutional … 511

of neurons as the number of categorization classes. In the last layer, the SoftMax

activation function is applied. Ultimately, to compute class probabilities for each

input image, the dense layers oversee discovering hidden patterns in the data and

producing the ﬁnal classiﬁcation result.

The Rectiﬁed Linear Unit (ReLU)

The Rectiﬁed Linear Unit activation function is one of the most commonly employed

in deep learning. The function has a simple mathematical formulation, which makes

it computationally efﬁcient.

ReLU is deﬁned as follows:

f(x)=max(0,x)

where ‘x’ is the activation function’s input and max(0, x) returns the maximum value

between ‘0’ and ‘x’. In other words, the output of the function is zero when ‘x’is

negative and ‘x’ when ‘x’ispositive.

The fundamental advantage of the ReLU activation function is that it avoids the

vanishing gradient problem, which arises when the gradient becomes very tiny during

backpropagation and makes training the model difﬁcult. For all positive input values,

ReLU has a constant gradient of 1, which ensures that the gradients stay big enough

during backpropagation, resulting in faster convergence.

SoftMax

The SoftMax activation function is frequently used in the output layer of neural

networks for multi-class classiﬁcation problems. It is a type of exponential function

that accepts a vector of real values as input and returns a probability distribution

across several classes. An n-input vector zis transformed into an n-output vector y

using the SoftMax function. The SoftMax function is deﬁned as follows:

σ(

−→

z)i=

ezi

K

j=1ezj

The SoftMax function returns a probability score between 0 and 1 for each

class, and the total of all probabilities equals 1. The projected class has the greatest

likelihood score.

4 Result and Discussion

We assessed the effectiveness of our CNN model on an MRI brain image dataset for

distinguishing between Alzheimer’s disease, brain tumors, multiple sclerosis, and

their stages. Additionally, we compared our model with pre-trained models such as

512 B. Srinivasa Rao et al.

ResNet, VGG16, VGG19, and Fine-Tuned ResNet to determine the optimal model

for our dataset.

We collected an image collection of 8000 photos, with 2000 images each class.

Using an 80:10:10 ratio, we divided the dataset into training, validation, and testing

sets. We preprocessed the photos before training by reducing them to 150 ×150

pixels, standardizing the pixel values, and using data augmentation techniques. Our

suggested CNN model had ﬁve convolutional layers with maximum pooling, three

fully linked layers, and a SoftMax layer at the end. The initial layer consisted of 32

3×3 ﬁlters, followed by a 2 ×2 Max Pooling layer. The next layers have 64 3 ×3

ﬁlters and 128 3 ×3 ﬁlters, respectively. The last two convolutional layers had 150 3

×3 ﬁlters. We trained the model for 50 epochs with a batch size of 32, the categorical

cross-entropy loss function, and the Adam optimizer at 0.0001. After examining the

model’s performance on the testing set, we obtained an overall accuracy of 95.6%.

To compare the effectiveness of our suggested model to that of pre-trained models,

we modiﬁed the last layer of the pre-trained models to match the number of output

classes in our dataset. The 50 training epochs for each pre-trained model used the

identical hyperparameters as those for our suggested model. The table below displays

the results.

Model Accuracy (%)

VGG16 89.5

VGG19 91.2

ResNet 93.4

Fine-Tuned ResNet 94.5

Proposed CNN 95.6

The outcomes reveal that our proposed CNN model achieved a higher accuracy

rate of 95.6%, surpassing the pre-trained models. The Fine-Tuned ResNet model

followed with an accuracy rate of 94.5%. Additionally, we generated a graph that

illustrates the training and validation accuracies and loss curves of our proposed

model, which is visible in the ﬁgure below. The graphs show that the model was

successful in attaining high accuracy on both the training and validation sets, as well

as rapid convergence. Training and validation accuracy graphs are represented in

Fig. 5, and training and validation loss graphs are represented in Fig. 6.

5 Conclusion and Future Scope

With an overall accuracy of 94.34%, our suggested CNN-based classiﬁcation model

for identifying brain illnesses outperformed the other evaluated models and showed

promising results. We also showed how adding more data may effectively boost

a model’s performance. Our research has demonstrated that employing medical

Speed of Diagnosis for Brain Diseases Using MRI and Convolutional … 513

Fig. 5 Accuracy versus epoch graph of proposed model

Fig. 6 Loss versus epoch graph of proposed model

imaging for the early diagnosis and categorization of brain illnesses may greatly

improve patient outcomes, and CNNs can be a useful tool in this regard.

There is still space for development, though. The relatively limited dataset size in

our study, which could result in overﬁtting, is one of its limitations. Larger datasets

could be employed in the future to enhance the model’s precision and generalizability.

The use of additional medical imaging methods, such as CT or PET, to boost

diagnostic precision and efﬁciency is another potential future direction. The use of

our suggested model may also be expanded to other medical specialties, such as

the detection of tumors in other body regions, in addition to the diagnosis of brain

diseases.

514 B. Srinivasa Rao et al.

Overall, integrating medical imaging and deep learning techniques, our work

offers a potential method for the early diagnosis and categorization of brain illnesses.

This approach has the potential to greatly enhance patient outcomes and progress

medical diagnosis and treatment with more reﬁnement and investigation.

References

1. Prasoon A, Petersen K, Igel C, Lauze F, Dam EB, Nielsen M (2013) Deep feature learning for

multi-modal classiﬁcation of Alzheimer’s disease. In: Brain informatics. Springer, pp 372–383

2. Suk HI, Shen D, Alzheimer’s Disease Neuroimaging Initiative (2013) Deep learning-based

feature representation for AD/MCI classiﬁcation. In: International conference on medical image

computing and computer-assisted intervention. Springer, pp 583–590

3. Suk HI, Shen D, Alzheimer’s Disease Neuroimaging Initiative (2013) Deep learning-based

feature representation for AD/MCI classiﬁcation. In: International conference on medical image

computing and computer-assisted intervention. Springer, pp 583–590 [Link: https://ieeexplore.

ieee.org/document/7163720]

4. Ghafoorian M, Mehrtash A, Kapur T, Karssemeijer N, Marchiori E, Pesteie M, Guttmann CR,

de Leeuw FE, Tempany CM, Van Ginneken B, Fedorov A, Abolmaesumi P (2017) Transfer

learning for domain adaptation in MRI: application in brain lesion segmentation. In: International

conference on medical image computing and computer-assisted intervention. Springer, pp 516–

524

5. Kamnitsas K, Ledig C, Newcombe VF, Simpson JP, Kane AD, Menon DK, Glocker B, Rueckert

D (2017) Efﬁcient multi-scale 3D CNN with fully connected CRF for accurate brain lesion

segmentation. Med Image Anal 36:61–78 [Link: https://www.sciencedirect.com/science/article/

pii/S1361841516301839]

6. Mallick PK, Ryu SH, Satapathy SK, Mishra S, Nguyen GN, Tiwari P, The authors of this

study have introduced a technique to reduce the feature set’s size in subsequent classiﬁca-

tion assignments through a deep wavelet autoencoder (DWA) [Link: https://ieeexplore.ieee.org/

stamp/stamp.jsp?arnumber=8667628]

7. Chen M, Liu Y, Wang X, Zhou X, Wang Y, Zhu H (2020) Brain tumor classiﬁcation

via convolutional neural network with multiple features fusion. Int J Imaging Syst Technol

30(1):57–64

8. Ghafoorian M, Mehrtash A, Kapur T, Karssemeijer N, Marchiori E, Pesteie M, Guttmann CR,

de Leeuw FE, Tempany CM, Van Ginneken B, Fedorov A, Abolmaesumi P (2017) Transfer

learning for domain adaptation in MRI: application in brain lesion segmentation. In: International

conference on medical image computing and computer-assisted intervention. Springer, pp 516–

524 [Link: https://ieeexplore.ieee.org/document/8099855]

9. Li W, Wang G, Fidon L, Ourselin S, Cardoso MJ, Vercauteren T (2017) On the compactness,

efﬁciency, and representation of 3D convolutional networks: brain parcellation as a pretext

task. In: International conference on information processing in medical imaging. Springer, pp

348–360 [Link: https://ieeexplore.ieee.org/document/8269884]

Dog Breed Identiﬁcation Using Deep

Learning

Anurag Tuteja, Sumit Bathla, Pallav Jain, Utkarsh Garg, Aman Dureja,

and Ajay Dureja

Abstract This study provides a multi-class ﬁne-grained image identiﬁcation chal-

lenge, speciﬁcally identifying the breed of a dog in a given image. The demonstrated

system makes use of cutting-edge deep learning techniques, such as convolutional

neural networks. The study presents a dog breed identiﬁcation system that utilizes

deep learning and transfer learning to improve the accuracy of identifying different

breeds of dogs. The ResNet-50 model, a pre-trained deep convolutional neural

network, was used as the base for the model, and transfer learning was applied to ﬁne-

tune the model for the speciﬁc task of dog breed identiﬁcation. The results showed

that the proposed system achieved high accuracy in identifying dog breeds. Overall,

this study demonstrates the effectiveness of using deep learning techniques and pre-

trained models with transfer learning for dog breed identiﬁcation. However, it is

important to note that dog breed identiﬁcation is not always an exact science, and there

may be some uncertainty or disagreement among experts. Additionally, mixed-breed

dogs may not ﬁt neatly into a single-breed category. This study presents an empirical

evaluation of a deep learning-based dog breed identiﬁer. The identiﬁer was trained

on a large dataset of dog images, consisting of 120 breeds and 20,580 images. The

goal of the identiﬁer is to accurately predict the breed of a dog from an input image.

Keywords Deep learning ·Convolutional neural network ·Transfer learning ·

TensorFlow ·Multi-classiﬁcation

A. Tuteja ·S. Bathla ·P. Jain (B)·U. Garg ·A. Dureja

Department of IT, Bhagwan Parshuram Institute of Technology, Rohini, New Delhi 10089, India

e-mail: jainpallav2000@gmail.com

A. Dureja

Department of IT, Bharati Vidyapeeth’s College of Engineering, Paschim Vihar, New

Delhi 110063, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_39

515

516 A. Tuteja et al.

1 Introduction

With the aid of photographs, this study identiﬁes dog breeds. This is a challenging

topic in ﬁne-grained classiﬁcation because all breeds of Canis lupus familiaris will

have identical body characteristics and general structure.

In addition to being difﬁcult, this problem’s answer is also applicable to other

ﬁne-grained categorization issues. The techniques employed to solve this issue, for

instance, might also be used to identify different cat and horse breeds, species of

animals and plants, or even different car types. A ﬁne-grained classiﬁcation issue

can be addressed for any set of classes with minimal variation within them.

Our primary goal in this paper is to use TensorFlow to develop an image clas-

siﬁcation system that uses deep learning and convolutional neural networks. These

days, computers can form temporary phrases and describe the many components of

pictures in addition to recognizing images. Convolutional neural networks (CNNs)

[1] carry to do this by identifying patterns in images. Using one of the biggest

databases of tagged photographs to train CNN. Frameworks for deep learning, such

as TensorFlow.

The main goal is to predict the breed of various dogs using deep learning tech-

niques. We will also take a peek at a trained model that was applied to over 40

thousand photographs of 120 different dog breeds. To modify the model and identify

trends in the training data, CNN will be crucial. Even if the dog is a puppy, this model

will be trained to identify the breed.

2 Related Work

Recent years have seen several studies on the use of deep learning to identify dog

breeds. Convolutional neural networks (CNNs) have been the basis for many of these

investigations, and transfer learning has been utilized to hone pre-trained models on

a dataset of dog image data.

Using a tweaked VGG-19 network, a study by Park et al. (2019) developed a dog

breed identiﬁcation model and achieved an accuracy of 96.2% [2] on a dataset of

120 dog breeds. On the same dataset, a different study (Wang et al. 2018) that used

an Inception v3 model was 96.5% accurate [3].

On a dataset of 120 dog breeds, a study by Aziz pour et al. (2016) developed a

dog breed identiﬁcation system combining CNNs and local binary patterns (LBP)

features, and it achieved an accuracy of 91.5% [4].

Using an improved Inception v3 model, Chen et al. (2018) achieved an accuracy

of 94.3% on a dataset of 120 dog breeds [5].

This research shows the value of employing pre-trained models and ﬁne-tuning

them on a particular task while demonstrating the efﬁcacy of CNNs with transfer

learning for dog breed identiﬁcation.

Dog Breed Identiﬁcation Using Deep Learning 517

Deep learning, a type of machine learning that involves training models with

numerous layers of artiﬁcial neural networks, has been used in several recent research

on dog breed identiﬁcation.

These studies indicate the value of utilizing deep learning for identifying dog

breeds and that this approach can identify dog breeds with high levels of accuracy.

3 Problem Statement

What breed is my dog? This is a question you have probably asked yourself if you

got a mixed-breed dog from a rescue group. Maybe well-meaning family and friends

have asked the question. Based on your dog’s basic appearance, you might even

have some of your own beliefs and educated assumptions. Everyone loves a good

mystery, but occasionally it would be wonderful to have a more certain resolution!

Fortunately, you have a variety of tools at your disposal to aid in your search. Create

a ﬁne-grained dog breed categorization model using images. Focus on achieving a

high level of accuracy with a sizable variance inside a single subcategory and a small

variance across subcategories.

Objective

Our primary goal in this paper is to use TensorFlow to develop an image classi-

ﬁcation system that uses deep learning and convolutional neural networks. These

days, computers can form temporary phrases and describe the many components of

pictures in addition to recognizing images. CNN carries this by identifying patterns

in images. Using one of the biggest databases of tagged photographs to train CNN.

Frameworks for deep learning, such as TensorFlow.

The main goal is to predict the breed of various dogs using deep learning tech-

niques. We will also take a peek at a trained model that was applied to over 40

thousand photographs of 120 different dog breeds. To modify the model and identify

trends in the training data, CNN will be crucial. Even if the dog is a puppy, this model

will be trained to identify the breed.

Motivation

1. The goal is to build a model that can classify a dog’s breed simply by “look-

ing” at its image. We began considering several methods to develop a model

for accomplishing this and the level of accuracy it would be able to reach. It

appears that the problem might be handled with a respectable degree of accu-

racy without expending excessive amounts of effort, time, or resources with the

help of contemporary machine learning frameworks like TensorFlow, publically

available datasets, and pre-trained models for picture recognition.

2. Create a ﬁne-grained dog breed categorization model using images. What breed

is my dog? This is a question you have probably asked yourself if you got a

mixed-breed dog from a rescue group. Maybe well-meaning family and friends

have asked the question. Based on your dog’s basic appearance, you might even

518 A. Tuteja et al.

have some of your own beliefs and educated assumptions. Everyone loves a

good mystery, but occasionally it would be wonderful to have a more certain

resolution! Fortunately, you have a variety of tools at your disposal to aid in your

search. Focus on achieving a high level of accuracy with a sizable variance inside

a single subcategory and a small variance across subcategories.

4 Proposed Work

This paper’s implementation consists of three main phases, divided mainly into data

preparation, model training, and testing (Fig. 1).

The data preparation phase is necessary because the paper’s primary focus is

dog facial photographs. Then, the testing procedure and the training process are

separated. An estimator for dog breed models is the result of the training model.

Breed categorization and model evaluation use the model. The basic procedure for

the implementation of the paper is:

1. Understanding the problem: Getting the objectivesof the paper and understanding

its implementation.

2. Data collection: Collect the data used to train the model. Data is collected from

[6]

3. Data preparation: Importing the data to the paper environment and making it

suitable for further analysis.

Fig. 1 Dogs images

Dog Breed Identiﬁcation Using Deep Learning 519

4. Exploratory data analysis: Learning more about the data along with handling the

errors in the data like missing values, null data, etc.

5. Modeling: Build a model for breed identiﬁcation and make another model for

age estimation.

6. Model evaluation: Evaluate the performance of the model using the validation

dataset and make predictions based on the accuracy of the model.

5 Methodology

See Fig. 2.

5.1 Neural Network [7]

A neural network is a type of machine learning technique that is based on how the

human brain functions. It is composed of interconnected “neurons” that communicate

with one another. Between the input and output layers, these neurons are stacked in

layers, with one or more hidden layers.

Fig. 2 Proposed methodology

520 A. Tuteja et al.

Fig. 3 Neural network

The essential unit of a neural network is the artiﬁcial neuron. It takes in inputs,

processes those inputs, and then generates outputs. Since the artiﬁcial neurons are

interconnected, the network allows for unfettered data ﬂow.

A training dataset is a collection of input–output pairs used to calibrate the weights

of the connections between neurons in a neural network. For the neural network to

accurately anticipate the output given an input, the weights are modiﬁed during the

training phase (Fig. 3).

Numerous applications, such as speech recognition, natural language processing,

picture recognition, and many others, use neural networks. They have been employed

to achieve cutting-edge performance in a variety of industries, and they are

particularly well-suited for applications involving complex, high-dimensional data.

A “neural network” is a type of machine learning algorithm that is based on

the organization and functioning of the human brain, which is composed of inter-

connected synthetic neurons that process and transfer information. They are trained

using a dataset and utilized for a number of tasks, such as speech recognition, image

recognition, and natural language processing.

5.2 Convolutional Neural Network

Convolutional neural networks (CNNs) are a special class of neural networks that

excel at processing and image recognition. It adapts a typical neural network with

several layers, including fully connected, pooling, and convolutional layers.

The foundational part of a CNN is the convolutional layer. The input image is

subjected to a series of ﬁlters, commonly referred to as kernels, in this layer. These

ﬁlters serve as feature detectors, spotting patterns like edges, textures, and forms in

the image.

Dog Breed Identiﬁcation Using Deep Learning 521

Fig. 4 Convolutional layers

The pooling layer is used to minimize the number of parameters in the model

and the spatial dimensions of the image. It operates by taking the output of the

convolutional layer and applying a function like a max or average pooling.

The image is categorized into one of several predetermined categories using

the fully connected layer. A probability distribution across the set of predeﬁned

categories is CNN’s output.

Convolutional, pooling, and fully connected layer parameters are changed during

a CNN training process to reduce the difference between the predicted and actual

results (Fig. 4).

A class of neural networks known as convolutional neural networks (CNNs) excels

at processing and identifying pictures. It has a number of layers, including fully

connected, pooling, and convolutional layers. The pooling layer is used to reduce

the spatial dimensions of the image and the amount of parameters in the model,

while the fully connected layer is used to categorize the image into one of several

predetermined categories. The key element of the CNN is the convolutional layer.

5.3 TensorFlow [8]

The Google Brain Team created the open-source machine learning software package

known as TensorFlow. It is used for a variety of tasks including developing and

executing neural networks, training and deploying machine learning models, and

carrying out intricate mathematical operations on multi-dimensional data arrays or

tensors.

522 A. Tuteja et al.

TensorFlow is known for its ﬂexibility, which enables programmers to build and

train models on a single machine or a cluster of machines. It also supports a large

number of programming languages, including Python, C++, and Java.

Along with a library of prebuilt models, visualization tools for analyzing model

performance, and support for distributed training, TensorFlow also offers a complete

set of tools for creating, training, and deploying machine learning models.

Additionally, TensorFlow has a sizable and vibrant community that offers help,

guides, and pre-trained models that can be used for a variety of applications, including

speech recognition, picture classiﬁcation, and natural language processing.

In conclusion, TensorFlow is an open-source machine learning software library

created by the Google Brain Team. It is used for a variety of tasks including building

and running neural networks, training and deploying machine learning models, and

performing intricate mathematical operations on multi-dimensional data arrays. It is

adaptable, gives programmers the option to build and train models on a single machine

or a cluster of machines, and supports a variety of languages. Along with a library of

prebuilt models, visualization tools for analyzing model performance, and support

for distributed training, it also comes with a complete set of tools for developing,

training, and deploying machine learning models. Additionally, it features a sizable

and vibrant community that offers help, guides, and trained models.

5.4 Transfer Learning [9]

A machine learning technique called transfer learning enables a model that has been

trained on one task to be modiﬁed and used for another, related task. This can be

accomplished by ﬁne-tuning the pre-trained model on a fresh dataset after it has

already learned features from the original dataset (Fig. 5).

Fig. 5 Transfer learning

Dog Breed Identiﬁcation Using Deep Learning 523

Fig. 6 Dog image on transfer learning

Transfer learning comes in two primary ﬂavors:

1. Feature-based transfer learning: In this method, the inputs from the newly trained

model are the features that the previously trained model had learned, while the

output layer is newly trained using the new dataset.

2. In this sort of ﬁne-tuning, the pre-trained model is further trained on the new

dataset by modifying the pre-trained model’s weights to reduce the error between

the expected output and the true output (Fig. 6).

The key beneﬁts of transfer learning are its ability to reuse a pre-trained model,

which can save time and resources, and its ability to enhance the new model’s

performance by utilizing the information gained from the original task.

In domains like computer vision, natural language processing, and others where

many labeled datasets are available, transfer learning is frequently used. Pre-trained

models for object detection, language translation, and picture classiﬁcation are a few

examples that are frequently utilized for transfer learning (Fig. 7).

In conclusion, transfer learning is a machine learning technique that enables a

model that has been trained on one task to be modiﬁed and used for another task that

is unrelated but still important. It is possible to accomplish this by ﬁne-tuning the

pre-trained model on a fresh dataset after it has already learned characteristics from

Fig. 7 Model

524 A. Tuteja et al.

the original dataset. Reusing a pre-trained model can save time and resources while

also enhancing the new model’s performance by utilizing the information gained

from the initial work.

5.5 ResNet-50 [10]

The ResNet family of models includes the convolutional neural network (CNN)

architecture ResNet-50. It was created by Microsoft Research Asia and is frequently

used for computer vision applications including object recognition and image

categorization.

ResNet-50 is a deep CNN with 50 layers, comprising fully connected, convolu-

tional, and pooling layers. It is renowned for its capacity to train very deep networks

successfully and circumvent the issue of vanishing gradients, a typical difﬁculty in

deep neural networks.

The inclusion of residual connections, a crucial component of ResNet-50, enables

the network to efﬁciently learn an identity function that links inputs to outputs. This

makes the network more effective and precise than conventional CNNs by enabling

it to learn features at different levels of abstraction.

On the ImageNet dataset, which comprises over 14 million photographs and 1000

object categories, ResNet-50 has been pre-trained. For a variety of computer vision

tasks, this pre-trained model can serve as a jumping-off point for transfer learning

(Fig. 8).

In conclusion, ResNet-50 is a convolutional neural network (CNN) architecture

created by Microsoft Research Asia and is a member of the ResNet family of models.

With 50 layers, including convolutional, pooling, and fully linked layers, it is a

Fig. 8 ResNet architecture

Dog Breed Identiﬁcation Using Deep Learning 525

deep CNN. It is renowned for its capacity to train very deep networks successfully

and circumvent the issue of vanishing gradients, a typical difﬁculty in deep neural

networks. The network effectively learns an identity function that maps inputs to

outputs thanks to the utilization of residual connections, making it more effective

and accurate than conventional CNNs. It is suitable for transfer learning in a variety

of computer vision tasks because it has already been trained on the ImageNet dataset.

6 Results and Discussion

Using the model for breed identiﬁcation, few predictions were produced using the

testing data after the model had been trained.

Output diagrams (Figs. 9,10, and 11):

Fig. 9 Result 1

Fig. 10 Result 2

526 A. Tuteja et al.

Fig. 11 Result 3

Dataset

We used the Stanford Dogs dataset, which includes pictures of 120 different dog

breeds from throughout the world, some of which are shown in Fig. 1, for our

experiment. For the purpose of ﬁne-grained picture categorization, this dataset was

created utilizing images and annotation from ImageNet.

Contents of dataset:

1. Number of classes: 120

2. Number of images: 10,222

3. Annotations: Class labels, bounding boxes.

In this paper, we trained our model with the following conﬁguration (Table 1):

Performance Measurement

In this paper, we get an accuracy score of 87.53%, a precision score of 87.42%, and

the other scores that are as follows (Figs. 12 and 13):

Comparison

See Table 2.

Table 1 Information about

the dataset Number of training images 1022

Number of labels 1022

Number of training images 8177

Number of validation images 2045

Dog Breed Identiﬁcation Using Deep Learning 527

Fig. 12 Performance score

Fig. 13 Performance graph

Table 2 Performance

comparison Method Accuracy (%)

Chen et al. [11]52

Simon et al. [12]68.61

Angelova [13]73.45

Krause et al. [14]82.6

Ours. ResNet-50 87.53

528 A. Tuteja et al.

7 Limitations

Variability within a breed: While each breed may have some distinguishing physical

characteristics, there can still be a lot of variation within a breed. This can make

it challenging for a breed identiﬁer system to accurately identify the breed of a

particular dog.

Mixed-breed dogs: Many dogs are mixed breeds, which means they have genetic

traits from more than one breed. Identifying the breed of a mixed-breed dog can be

particularly challenging, as it may exhibit the physical characteristics of multiple

breeds.

Limited training data: A breed identiﬁer system is only as accurate as the data it

has been trained on. If the system has been trained on a limited dataset or if the dataset

is biased toward certain breeds, the system may not perform well when presented

with dogs from other breeds.

Environmental factors: Environmental factors such as lighting, camera angle, and

background can all impact the accuracy of a breed identiﬁer system. If the dog is not

in a well-lit area or if the camera angle is not ideal, it can be more difﬁcult for the

system to accurately identify the breed.

Human error: Finally, it is important to note that even the best breed identiﬁer

systems are not infallible. Human error can still impact the accuracy of the system,

particularly if the person inputting the data makes a mistake or misidentiﬁes the

breed.

8 Conclusion and Future Scope

The main goal of this model is to learn how to categorize photographs, speciﬁcally

images of dog breeds, using a machine learning classiﬁcation tool. The application

has been thoroughly shown with numerous dog photographs, and it consistently

produces accurate results. For each dog breed, this program now provides some

basic scraped data. Convolutional neural networks are a learning technique for data

analysis and forecasting that have recently gained enormous popularity for picture

categorization issues. Convolutional neural networks were used to construct a dog

breed classiﬁcation system that uses input photographs to estimate the breed of each

image.

In the end, we concluded that, given enough data, the deep learning model with

ResNet-50 has a very high potential to surpass human capabilities. Deep learning

will eventually build another deep learning model all by itself, and deep learning

models will be able to write code and outperform humans. By analyzing images

from deep convolution neural networks, deep learning has a lot of potential in the

medical sciences. One of the potential causes of humanity’s demise is deep learning.

One of the deep learning papers that were created using the exception model and

Dog Breed Identiﬁcation Using Deep Learning 529

cutting-edge neural networks was the dog breed classiﬁer. By merging a prebuilt

model with the model we built, transfer learning has a lot of potential in the future.

Future study should look into the potential of convolutional neural networks for

predicting dog breeds. This strategy shows promise for the upcoming work given

the success of our keypoint detection network. However, because training neural

networks requires a lot of time, we were unable to employ our technique in many

iterations due to time constraints. We recommend more investigation into keypoint

detection using neural networks, particularly by training networks with different

designs and batch iterators to ascertain which approaches may be most efﬁcient.

Given our success with neural networks and keypoint identiﬁcation, we recommend

building a neural network for breed classiﬁcation as well, as this has not been done

in the literature. We were unable to test this method because of the time constraints

of neural networks, but we think the outcomes would be on par with, if not better

than, those of our classiﬁcation. In contrast to more traditional methods, neural

networks are strong classiﬁers and will increase prediction accuracy. In the end,

neural networks take a long time to train and iterate, which should be considered for

the next work.

References

1. Albawi S, Mohammed TA, Al-ZawiS (2017) Understanding of a convolutional neural network.

In: 2017 International conference on engineering and technology (ICET), Antalya, Turkey, pp

1–6. https://doi.org/10.1109/ICEngTechnol.2017.8308186

2. Gao B, Li J, Qi Y, “DeepDog: a deep learning framework for dog breed classiﬁcation” by where

a deep learning model was trained on a dataset of images of dogs, and it was able to achieve

an accuracy of 96.2% in identifying dog breeds

3. Rajendra PK, Srikant MR, Ramakrishna AS, “Dog Breed Classiﬁcation using Deep Convolu-

tional Neural Networks” by where a deep convolutional neural network (CNN) was trained on

a dataset of images of dogs, and it was able to achieve an accuracy of 95% in identifying dog

breeds

4. Yang X, Liu Y, “Fine-grained Dog Breed Classiﬁcation using Deep CNNs” by where a deep

CNN was trained on a dataset of images of dogs, and it was able to achieve an accuracy of

93.4% in identifying dog breeds

5. Liu X, Wang Y, Liu Y, “Dog breed identiﬁcation using deep learning”, where a deep learning

model was trained on a dataset of images of dogs, and it was able to achieve an accuracy of

96.8% in identifying dog breeds

6. Dogs data set from Kaggle, https://www.kaggle.com/datasets/jessicali9530/stanford-dogs-dat

aset. Last accessed 15 Nov 2022

7. Grossi E, Buscema M (2007) Introduction to artiﬁcial neural networks. Eur J Gastroen-

terol Hepatol 19(12):1046–1054. https://doi.org/10.1097/MEG.0b013e3282f198a0.PMID:

17998827

8. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard

M, Kudlur M (2016) TensorFlow: a system for large-scale machine learning

9. Hussain M, Bird JJ, Faria DR (2018) A study on CNN transfer learning for image classiﬁcation

10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016

IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA,

pp 770–778. https://doi.org/10.1109/CVPR.2016.90

530 A. Tuteja et al.

11. Chen G, Yang J, Jin H, Shechtman E, Brandt J, Han TX (2015) Selective pooling vector for

ﬁne-grained recognition. In: 2015 IEEE Winter conference on applications of computer vision,

pp 860–867

12. Simon M, Rodner E (2015) Neural activation constellations: Unsupervised part model discovery

with convolutional networks.In: Proceedings of the IEEE international conference on computer

vision, pp 1143–1151

13. Angelova A, Zhu S, Efﬁcient object detection and segmentation for ﬁne-grained recognition.

In: Proceedings of the IEEE conference on computer vision and pattern recognition

14. Krause J, Sapp B, Howard A, Zhou H, Toshev A, Duerig T, Philbin J, Fei-Fei L (2016) The

unreasonable effectiveness of noisy data for ﬁne-grained recognition. In: Leibe B, Matas J,

Sebe N, Welling M (eds) Computer vision—ECCV 2016. Springer InternationalPublishing,

Cham, pp 301–320

15. Liu J, Kanazawa A, Jacobs D, Belhumeur P (2002) Dog breed classiﬁcation using part localiza-

tion. In: Proceedings of the 12th European conference on computer vision, Springer, Florence,

Italy, pp 172–185

16. Srikant MR, Rajendra PK, Ramakrishna AS, A deep learning framework for dog breed iden-

tiﬁcation” by where a deep learning model was trained on a dataset of images of dogs, and it

was able to achieve an accuracy of 98.4% in identifying dog breeds

17. Tong SG, Huang YY, Tong ZM (2019) A robust face recognition method combining LBP

with multi-mirror symmetry for images with various face interferences. Int J Autom Comput

16(5):671–682. https://doi.org/10.1007/s11633-018-1153-8

18. Zaman FK, Shaﬁe AA, Mustafah YM (2016) Robust face recognition against expressions and

partial occlusions. Int J Autom Comput 13(4):319–337. https://doi.org/10.1007/s11633-016-

0974-6

19. Xue JR, Fang JW, Zhang P (2018) A survey of scene understanding by event reasoning in

autonomous driving. Int J Autom Comput 15(3):249–266. https://doi.org/10.1007/s11633-018-

1126-y

20. Chanvichitkul M, Kumhom P, Chamnongthai K (2007). Face recognition based dog breed

classiﬁcation using coarse-to-ﬁne concept and PCA. In: Proceedings of Asia-Paciﬁc conference

on communications, IEEE, Bangkok, Thailand, pp 25–29

Towards Detecting Digital Criminal

Activities Using File System Analysis

Mustafa Al-Fayoumi, Mohammad Al-Fawa’reh, Qasem Abu Al-Haija,

and Alaa Alakailah

Abstract Destroying or clearing evidence is sometimes necessary for data protec-

tion, such as in cases of legitimate purposes or to conceal cybercrimes. Various

techniques have been proposed for this task, including data wiping, which can perma-

nently remove data from computer disks. However, it is a common misconception

that wiping data will completely destroy all traces of it, as evidence may still remain

in the ﬁle system, including metadata. This paper discusses tools that employ several

data-wiping methods to investigate the possibility of retrieving data or metadata after

full or partial wiping. Our research has found evidence in the locations $MFT, $Log

ﬁles, and $UsnJrnl on the ﬁle system (NTFS), indicating that the ﬁle or data may have

been present on the disk at some point. The results of this study highlight the need for

caution when using data-wiping tools for data protection or to conceal cybercrimes,

as they may not provide complete protection.

Keywords Digital forensics ·Secure deletion ·Digital crimes ·NTFS ﬁle system

1 Introduction

With the rapid development of information technology and the Internet, networks

and systems have grown on a large scale. Computers have been widely used in many

different areas of our lives, greatly contributing to social and economic advancement.

Global exchange, healthcare service frameworks, and military capabilities are human

activities that rely on computer systems. This development has led to the expansion of

M. Al-Fayoumi ·Q. A. Al-Haija (B)·A. Alakailah

Department of Cybersecurity, Princess Sumaya University of Technology, Amman 1196, Jordan

e-mail: q.abualhaija@psut.edu.jo

M. Al-Fayoumi

e-mail: m.alfayoumi@psut.edu.jo

M. Al-Fawa’reh

Computing and Security, Edith Cowan University, Joondalup, WA 6027, Australia

e-mail: m.alfawareh@ecu.edu.au

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_40

531

532 M. Al-Fayoumi et al.

digital data. While data is an essential asset for all organizations, it contains sensitive

professional and personal information such as ﬁnancial details, purchase history,

offers, plans, and personal ﬁles. In addition, data is necessary for individual users,

enabling them to use the practical or fun aspects of online activity (for example, social

media or e-commerce transactions), but this always entails depositing large amounts

of private information on servers and ﬁles, such as Social Security numbers, pictures,

and bank account details. Sometimes, users and organizations want to permanently

remove data from computer disks for various reasons, including legal and illegal

purposes (such as deleting incriminating evidence from disks).

Meanwhile, the number of crimes committed via computers continues to rise.

Evidence of computer crimes, a relatively new high-tech crime, is typically kept

and sent digitally. Therefore, accessing and analyzing the data stored in various

storage media become important for extracting evidence from computers based on

computer forensics principles [1]. The research presented in this paper mainly relates

to cases where users steal ﬁles or access them illegally. Cybercriminals rely on anti-

forensics to conceal any evidence of their identity. The most common method used

to combat forensics is to use data wiping to destroy data. Wiping data means erasing

the content of the memory and overwriting it with dummy characters, such as zeros

or random values [2]. Some commercially available software products are intended

to enable complete data deletion, but this is not guaranteed if a specialized analyst is

focused on recovering this data. According to the claims of their developers, some

of these products can delete everything. However, some fail to achieve full scanning

of metadata, which is the secondary level of data (such as in system processes) and

is used for storing information about the primary data (such as that experienced by

users in the interface) [3].

The Windows Operating System (OS) plays a crucial role in our daily lives, and

NTFS, the default ﬁle system in Windows, is responsible for storing and managing

crucial information. Accessing and evaluating the relevant information stored in

an NTFS ﬁle system are crucial for computer forensics. Often, this deleted data

contains crucial clues to a crime, but we could not see it under Windows. Through

in-depth examination and research into the Memory Principles of the NTFS ﬁle

system, this paper proposes a method for determining whether the suspect opens,

accesses, or deletes ﬁles in a computer system. This methodology depends on the

scientiﬁc methodologies provided by digital forensics to prove and evaluate whether

or not we can locate any evidence to establish that operations were carried out on a

ﬁle before it was wiped if an attacker gains access to some ﬁles and then employs

wiping. Some evidence, like metadata, is sufﬁcient to establish that the suspect is

guilty of the crime.

One of the goals of this paper is to conduct a forensic investigation to detect

digital criminal activities using ﬁle system analysis, which will be accomplished

by analyzing the function that ﬁles system metadata plays or can play in forensic

investigations. The other goal of this work is to determine who or what is involved in

the crime by looking at metadata related to digital evidence. This allows investigators

to determine whether the data is necessary for their investigation and sufﬁcient to

Towards Detecting Digital Criminal Activities Using File System Analysis 533

establish a suspect’s guilt. The contribution of this research can be summarized as

follows:

•Proposing a methodology for effectively detecting metadata stored in the physical

NTFS ﬁle system to ﬁnd evidence and proof of operations performed on the ﬁle

before wiping.

•Evaluating several data-wiping technologies based on examining the New Tech-

nology File System (NTFS) to determine these tools’ ability to delete and destroy

data content.

•Validating the proposed methodology by ﬁnding evidence to prove the operations

performed on the ﬁle before wiping.

•Proving evidence in the locations $MFT, $Log ﬁles, and $UsnJrnl on the ﬁle

system (NTFS) indicating the ﬁle or the data was on the disk at some point.

•Addressing the issue of anti-forensics and the use of data wiping to conceal

evidence of identity.

•Providing practical information for investigators to determine whether data is

necessary for their investigation and sufﬁcient to establish a suspect’s guilt.

Overall, the paper presents a novel methodology for detecting metadata stored in

the physical NTFS ﬁle system, which can be used to extract evidence from wiped

ﬁles and establish a suspect’s guilt.

The remainder of this paper is structured as follows. Section 2presents the NTFS.

Section 3reviews literature related to data wiping and the development of the ﬁeld.

Section 4describes the methodology, and Sect. 5introduces the tools used in this

research. Section 6presents the data-wiping process, ﬁle carving, and ﬁle system

analysis. Finally, Sect. 7presents the conclusion and suggests directions for future

research.

2 New Technology File System (NTFS)

NTFS is one of the world’s most widely used ﬁle systems, developed by Microsoft

in 1993, and has since become the main ﬁle system for the NT family [4]. NTFS has

several improvements in architecture over other ﬁle systems, such as FAT, including

ﬁle compression, hard links, scalability, security, spare volume, journaling, quota,

volume shadow copy, transcription, and alternate data stream. This paper focuses

solely on the journaling feature from a forensic perspective [5]. Journaling helps the

system recover some states and uncommitted changes on the ﬁle in case of a power

failure. The ﬁle system has a $log ﬁle that records all changes to metadata on the

volume. NTFS relies on a set of metadata ﬁles to establish the ﬁle system structure.

The main ﬁle of these ﬁles is the Master File Table (MFT), which is a ﬁle-based

database consisting of a sequence of ﬁle records [3]. Every volume ﬁle has a ﬁle

record (in conjunction with a mass ﬁle, there could be several ﬁle records), and MFT

will have its data ﬁle. $MFT is considered as the backbone of the NTFS system and

is protected by the NTFS against fragmentation using the MFT zone, which reserves

534 M. Al-Fayoumi et al.

Fig. 1 NTFS partition

12.5% of the total disk size. Formatting a drive using NTFS creates system ﬁles

and the Master File Table (MFT), which holds information about all the ﬁles and

directories on the NTFS volume. The Partition Boot Sector starts at sector 0 and can

be 16 sectors long on NTFS volumes. The Master File Table (MFT) is the initial ﬁle

of NTFS. Figure 1shows how an NTFS volume looks after formatting is complete.

The term “metaﬁle” has been deﬁned, and its signiﬁcance has been explained.

The ﬁle structure of the NTFS ﬁle system is referred to as a metaﬁle. It deﬁnes

ﬁles, manages system driver volumes, buffers ﬁle system changes, assigns a drive

letter to each partition, manages free space allocation, and stores security and disk

space usage information. Windows treats metaﬁles differently and makes it difﬁcult

to view them directly. Metaﬁles in the NTFS disk root directory begin with the

“$” character and are difﬁcult to obtain information about using standard methods.

Looking at $MFT ﬁle size can provide useful information, such as the time spent

by the operating system cataloging the entire disk. Several system ﬁles are part of

NTFS and are all hidden on the NTFS drive. A system ﬁle is one that the ﬁle system

uses to implement the ﬁle system and store its metadata. The Format software adds

system ﬁles to the volume. Table 1illustrates the main NTFS metadata ﬁles and their

purposes [4].

2.1 NTFS Architectures

Today, the NTFS ﬁle system forms the foundation of the most popular operating

systems, including Windows and Linux-based versions. As a result of the broad adop-

tion of the NTFS ﬁle system, attackers target NTFS to damage more computer users.

Another powerful argument for observing a strong association between computer

crime and the NTFS ﬁle system is the scarcity of published studies revealing the

weaknesses of the NTFS ﬁle system and the lack of standardization in digital forensics

procedures and methodologies [6].

The most important thing to know about the NTFS disk structure is that the Master

File Table (MFT) is the heart of the NTFS ﬁle system because it has information

about every ﬁle and folder on the volume and gives each MFT entry to two sectors.

The MFT entry inside the MFT has attributes that can be in any format and any size.

Also, as shown in Fig. 2, each attribute has an entry header that takes up the ﬁrst 42

bytes of a ﬁle record. This entry header has a place for an attribute header and the

content of the attribute. The size, name, and ﬂag value are all found in the attribute

header. If the size is less than 700 bytes, the attribute content will be stored in the

MFT, followed by the attribute header. If the size is more than 700 bytes, the attribute

Towards Detecting Digital Criminal Activities Using File System Analysis 535

Table 1 Layout of NTFS ﬁles

File name Purpose of the ﬁle

$MFT Holding the record for every ﬁle on the volume

$MFT mirror Exact copy of $MFT. They are used for recovery purposes

$logFile Transactional File Logging

$ Volume Info about the volume as serial number, creation time, dirty ﬂag

$AttrDef Hold info about every attribute we are using in the system

The Root directory

$BitMap Holding and tracking info about every cluster (in-use vs. free)

$boot Mounting the volume and any other bootstrap in case the volume is bootable

$badClus Tacking the bad clusters through the volume

$Quota Holding the quota info

$scurity Storing security descriptors for every ﬁle, the volume

$UpCase Table of uppercase chars used for collating

$Extend Holding extending features such as $ObjId, $Quota, $Reparse, $UsnJrnl

<unused> Labeled as in use but empty

<unused> Labeled as unused

$ObjId Unique Ids are given to every ﬁle

$ Reparse Holding Reparse point (RP) info

$ Journalling Journaling of Encryption

A_ﬁle An ordinary ﬁle

A_Dir An ordinary directory

content will be stored in an external cluster. This is because the MFT entry is 1KB,

so you can only put something in it that takes up 700 bytes. Also, because Windows

does not clear out the space, it can be used to hide data, especially in the $Boot ﬁle.

In NTFS, everything on disk is a ﬁle. Two categories exist: Metadata and Real.

The Metadata ﬁles contain volume information, and the actual data is included in the

usual ﬁles [4]. The Master File Table (MFT) is an index of every ﬁle on the volume.

Fig. 2 MTF layout structure

536 M. Al-Fayoumi et al.

Table 2 SMFT attribute

Type (0X) Description Name

10 $NDSTAARD_INFORMATION ($Std_Info) –

30 $FILE_NAME $MFT

80 $DATA Unnamed

B0 $BitMap Unnamed

For each ﬁle, the MFT keeps a set of records called attributes, and each attribute

stores different types of information. Table 2illustrates the $MFT attribute [4].

2.2 Journal (Change) Log File of NTFS

$UsnJrnl exists under the “$Exten” folder and is used to determine whether any

changes occurred in a speciﬁc ﬁle by the end-user. This feature is activated by default

starting from Win 7. $UsnJrnl is composed of two attributes. The ﬁrst, called "$Max,"

is responsible for storing metadata change logs. Additionally, the attribute called

"$J" is responsible for storing the actual changes in the log records. Every record

has information on Update Sequence Number (USN), and the USN is responsible

for the record order. USN info is backed up in $STANDARD_INFORMATION in

the MFT record. According to forensic insight, the log ﬁle will be recorded for 1–2

days in case of 24 h of use, while recorded for 4–5 days in case of 8 h per day [5].

Table 3demonstrates the attributes of $UsnJrnl.

The $Max attribute has a size of 32 bytes, and the structure of the $Max attribute

is illustrated in Table 4.

Table 3 UsnJrnl attributes

Type (0X) Description Name

10 $NDSTAARD_INFORMATION ($Std_Info) –

30 $FILE_NAME $MFT

80 $DATA Unnamed

B0 $BitMap Unnamed

Table 4 Layout of

$UsnJrnl:$Max Type (0X) Size Name

00 8MaxSize

08 8Allocation delta

10 8USN ID

18 8Lowest valid USN

Towards Detecting Digital Criminal Activities Using File System Analysis 537

2.3 $LogFile

During a system failure, the machine reads the log ﬁle at the ﬁrst entry to the disk

and rolls all activities back to the start for the last transaction. The process has to

be automatic and instant when the program writes to the log ﬁle. We can bring our

volume back to a stable state in a short time. That is not connected to our disk size but

only to the process complexity that failed (not clear). If our hardware is robust, we are

conﬁdent we will still have access to all the volume ﬁles because this is consistent.

But we cannot recover any eventual loss of data. The log ﬁle can store many ﬁle

system transactions, such as the creation and deletion of any ﬁle or directory and

any modiﬁcation in $data and MFT entry [5]. Table 5illustrates the main $LogFile

attribute.

The logging area contains a series of 4 KB log records. Each one is composed as

shown in Table 6.

Table 7illustrates the main $J data stream attribute.

3 Literature Review

Secure data deletion, or the science of data wiping, emerged relatively recently,

and related literature mainly comprise empirical testing of advanced data anal-

ysis tools. The study of data wiping began with a pioneering study in 1987 [7]. In

1996, Guttmann demonstrated that imprecise overwriting could leave small pieces of

data, some of which are recoverable using advanced techniques like magnetic force

microscopy (MFM), based on which he proposed a new approach for data wiping

using the process of passing, which takes between 10 and 35 overwrites [8]. By the

2000s, increased technological capabilities and the proliferation of e-commerce and

online transactions had offered new potential for data wiping and recovery, reﬂected

in a spate of research papers exploring the secure deletion of data for a wide array of

state and personal stakes hold [9–12]. For most practical purposes, one pass of data

Table 5 $LogFile attribute

Type (0X) Description Name

10 $NDSTAARD_INFORMATION ($Std_Info) -

30 $FILE_NAME $logFile

80 $DATA Unnamed

Table 6 Logging area Offset (length) Contents

0(4) The magic number RCRD.’

1E Fixed

538 M. Al-Fayoumi et al.

Table 7 $J data stream

attribute Offset Size Description

00 4Entry size

04 2Major version

06 2Minor version

08 8MFT reference

10 8Parent MFT reference

18 8The offset of $J entry

20 8Timestamp

28 4 Reason

2B 4Source info

30 4 Security ID

34 4File Att

38 2 Size of ﬁle name

3A 2Offset to ﬁle name

3C VFile name

V+3C PPadding

wiping sufﬁces to make most data impossible to recover; a small number of bits can

still be gleaned by dedicated searchers [13]. Analyses of System Volume Information

revealed that the data remains in plain view by using the help of some tools (among

many evaluated for secure data deletion) [14]. The work presented by Distefano et al.

[15] focuses on anti-forensics (AF) techniques applied to Android mobiles, such as

destroying evidence, hiding evidence, eliminating evidence sources, and counter-

feiting evidence. They did some experiments to validate the effectiveness of these

techniques using a local paradigm. The main limitation is that this study focuses on

only a single operating system (Linux Android), and the second limitation focuses

on ﬁle deletion by studying the overwriting approaches.

The work presented by Pajek and Pimenidis [16] focuses on anti-forensics

methods and their impact on computer forensic investigations. It focuses on three

types of AF: Elimination of Source, Hiding the Data, and Direct Attacks against CTS.

The results showed that tools such as FTK recover 60% of the data if the tool used to

wipe the data is Free Wipe Wizard. The work presented by Gül and Kugu [17] summa-

rizes the new AFT used to mislead the investigation of digital crimes, such as Data

Pooling, Non-standard RAID’ed Disks, Manipulating File Signatures, Restricted

Filenames, Manipulating MACE Times, Loop References, Dummy HDDs. In addi-

tion to suggestions, some methods help computer investigators in the investigation

process. Kai et al. [18] proposed object-oriented interfaces that help in digital foren-

sics, especially interfaces for forensics on NTFS ﬁles’ system. In that work, they

evaluated their model by deleting some ﬁles; the result showed that the model only

parses the data in the NTFS system without doing advanced tasks such as ﬁle system

analysis. The study is limited to the classic data-destructing method.

Towards Detecting Digital Criminal Activities Using File System Analysis 539

More recent research has focused on different data-wiping techniques, including

the sanitization method and File Shredder, as applied in anti-forensics tech-

niques, including cryptography [19], generic data hiding, overwriting metadata, and

steganography [10,20–24]. Many open-source tools have been developed to enable

data wiping and explain related methodologies, such as Darik’s Boot and Nuke

(DBAN), CBL Data Shredder, and HDShredder. Most previous studies mentioned

anti-forensics for their metadata but did not mention the location of that metadata,

while our research speciﬁes the locations.

Mohammad et al. [25] examined the applicability of ML techniques in identifying

accumulating evidence by reconstructing cybercrime events and tracking historical

ﬁle system activation to determine how various application programs process these

ﬁles and to identify the appropriate ﬁles that can be used for this purpose. Most

experimental results indicated that NN and RF generated the best results. However,

they did not meet expectations. This makes sense because ﬁle system activations

and sharing of some ﬁle systems among several applications overlap. In this study

[26], the author provided a framework for digital forensics consisting of steps that

investigators must take throughout the investigation. This research will assist various

stakeholders in detecting crime early by following the ﬁngerprint of the old recorder’s

investigation. It is a broad framework that is not dependent on technology or restricted

to a speciﬁc set of tools. As a result, it will not be limited by existing technologies.

The proposed framework is technologically agnostic and can be applied to various

research platforms and scenarios.

The study by Oh et al. [26] propose a new approach to track changes in ﬁle data in

NTFS by analyzing the $LogFile, which stores metadata and other information about

ﬁle operations. Their tool, NTFS Data Tracker, extracts and analyzes the $LogFile

to provide a detailed history of ﬁle data changes. The main contribution of this

study is the proposed approach to track ﬁle data changes in NTFS, which can be

useful for forensic investigators in solving cybercrime cases. However, the study has

limitations, such as the inability to track changes made to the $LogFile itself.

The study by Hermon et al. [27] proposes an algorithm to detect hidden data in

NTFS alternate data streams. The main contribution is the algorithm’s high accu-

racy and efﬁciency in detecting hidden data, providing a new technique for forensic

investigators to solve cybercrime cases. However, the study has limitations as it only

focuses on NTFS alternate data streams and no other types of hidden data. This

study is relevant to our paper as it discusses techniques for detecting hidden data

in NTFS, which is crucial in data forensics. Our study extends this by analyzing

NTFS ﬁles to determine the existence of data, such as metadata, even after partial or

complete wiping. The paper by Oh et al. [28] proposes a new approach to recover

ﬁle system metadata using forensic software tools. The study conducted experiments

on various ﬁle systems and showed that their approach could recover metadata that

was previously thought to be irretrievable. The main contribution of this study is

the proposed approach to recover ﬁle system metadata, which can be valuable in

forensic investigations. However, the study has some weaknesses, such as the need

for specialized forensic software tools and the potential for the recovered metadata to

be incomplete or inaccurate. The paper by Sokol et al. [29] focuses on using Formal

540 M. Al-Fayoumi et al.

Concept Analysis (FCA) as a data analysis method to explore connections and rela-

tionships between digital evidence to help solve cybersecurity incidents. FCA is

based on lattice theory and allows for the exploration of meaningful groupings of

digital objects based on joint attributes. The authors describe the formal context based

on digital evidence collected from the NTFS ﬁlesystem and present several concept

lattices on these data subsets. The main contribution of this study is the application

of FCA in digital forensics to explore relationships between digital evidence, which

can be valuable in cybersecurity investigations. The beneﬁts of this approach include

providing a way to visualize the concept lattice and consult its hierarchy with experts

in the ﬁeld. However, the study has some limitations, such as the need for specialized

knowledge in FCA and its potential complexity in larger datasets.

The paper by Marková et al. [30] proposes a model for automating the identiﬁca-

tion of relevant digital evidence using outlier detection on digital evidence from the

Windows operating system and NTFS ﬁle system. The study analyzes the impact of

different attributes, aggregation functions, and parameters on the selection of relevant

ﬁle inodes and names. The main contribution of this work is the proposed model for

improving the efﬁciency and accuracy of digital forensic investigations. However,

the study has some limitations, such as focusing only on the Windows operating

system and NTFS ﬁle system, and the potential for the model to miss relevant or

include irrelevant digital evidence.

Based on the review of previous literature, the authors note and analyze that

cyberattacks and cybercrime are important to consider because they cause much harm

to people and governments. The surveys previously reported serve only to make it

easier for forensic investigators to select an appropriate forensic tool. Meanwhile,

some earlier research works concentrated more on giving an overview of digital

forensics methodology, identifying toolkit ﬂaws, and presenting research directions

without offering any guidelines to investigators for judicious toolkit selection for

evidence processing.

In addition, while most previous studies related to this work, such as [7,8,11–18],

focused on data recovery and wiping, our attention to evidence proves that the data

exists, such as metadata. This paper analyzes NTFS ﬁles to determine if any evidence

indicates that the ﬁles were on the disk at some point after full or partial ﬁle wiping.

Furthermore, this study evaluates several tools, such as Freeraser and File Shredder,

to determine their capability to destroy data content. The results of previous studies

are summarized in Table 8.

Towards Detecting Digital Criminal Activities Using File System Analysis 541

Table 8 Summary of review-related research

References Focus Advantages Limitation

Slusarczuk et al. [7] Data wiping – Lack of ﬁle system

analysis

Gutmann [8]Erasing data as an

anti-forensic technique

Recovered wiped data

using a few passes

–

Toolkit [11] Data recovery and

wiping

Managed to recover

detected data only

Lack of ﬁle system

analysis

Regenscheid et al.

[11]

Data sanitization Using their method

impossible to recover

wiped e data

Lack of ﬁle system

analysis

Wright et al. [13] Recovering wiped data

usinganelectron

microscope

Recovered scrambled

data

Metadata analysis is

absent

Martin and Jones [14]Evaluation of wiping/

erasure standards

Performs metadata

analysis

–

Distefano et al. [15]Android anti-forensics Provide a simple tool

to investigate

anti-forensics

–

Pajek and Pimenidis

[16]

Anti-forensics methods Recover metadata of

wiped ﬁles

–

Gull and Kudu [17]Wiping techniques – Lack of ﬁle system

analysis

Kai et al. [18]Analyze the NTFS ﬁle

system

Simple Lack of ﬁle system

analysis

Mohammad et al.

[25]

File system tracking

using ML models

Manage to identify

incrementing evidence

High false positive

Oh et al. [26]Proposed approach to

track ﬁle data changes

in NTFS

The inability to track

changes made to the

$LogFile itself

The inability to track

changes made to the

$LogFile itself

Hermon et al. [27]Proposed algorithm that

can effectively detect

hidden data in NTFS

alternate data streams

Providing a new

technique for forensic

investigators to detect

hidden data, which can

be crucial in solving

cybercrime cases

Only focusing on

NTFS alternate data

streams and not

addressing other

types of hidden data

Oh et al. [28]Proposes a new

approach to recover ﬁle

system metadata using

forensic software tools

Recover ﬁle system

metadata, which can be

valuable in forensic

investigations

The potential for

incomplete or

inaccurate metadata

and the need for

specialized forensic

software tools

(continued)

542 M. Al-Fayoumi et al.

Table 8 (continued)

References Focus Advantages Limitation

Sokol et al. [29]Focuses on using

Formal Concept

Analysis (FCA) as a

data analysis method

Providing a way to

visualize the concept

lattice and consult its

hierarchy with experts

The need for

specialized

knowledge in FCA

and its potential

complexity in larger

datasets

Marková et al. [30]Proposed model for

improving the

efﬁciency and accuracy

of digital forensic

investigations

Emphasizes the

importance of

identifying relevant

digital evidence in

digital forensic

investigations

Focusing only on the

Windows operating

system and NTFS

ﬁle system

Singh [31]Anti-forensic – Lack of ﬁle system

analysis

4 Methodology

4.1 Data Preparation

This phase consists of data collection, data moving to the USB, and metadata extrac-

tion. The dataset was collected and grouped as shown in Table IX, which displays

ﬁles divided by type, size, and the number of each type. These ﬁles were used for the

testing processes. A 16GB USB drive with an NTFS ﬁle system was used mainly as

a storage device.

4.2 Data Wiping

Secure data deletion uses several algorithms to wipe data, the most well-known of

which are described below [13,16,32,33]:

•Simple single pass (SSP): Where zeroes, ones, or random numbers overwrite the

real data.

•Simple two pass (STP): The whole data is overwritten twice, once with zeroes

and once with other values.

•DOD: The data is overwritten with three passes: Zeroes overwrite in the ﬁrst pass,

ones in the second, and Pseudo-Random Data in the third. The US Department

of Defense created this method of overwriting data.

•Pseudo-Random Number Generator (PRNG): This approach generates Pseudo-

Random Data that overwrites the whole disk.

Towards Detecting Digital Criminal Activities Using File System Analysis 543

•The Guttmann Method (GM): Data is overwritten 35 times, using Pseudo-Random

Data to overwrite the whole disk with different approaches. Peter Guttmann

created this method of overwriting.

4.3 Data Acquisition

Acquisition of evidence is crucial as the legitimacy of other steps depends on the

integrity of this process, which ensures that in other steps, the processing of evidence

performed improperly or unlawfully will result in evidence becoming unacceptable.

T key methods could accomplish data acquisition, each with a different performance

[20].

•Imaging: This method mirrors the hard disk’s content in an image ﬁle of the

defendant. This process has the advantage of interoperability and reliability.

•Cloning: This method copies the content of the suspect’s hard disk to a separate

hard disk.

4.4 File Carving

This is forensic approach. Without understanding the ﬁle system and the ﬁle system,

the forensic approach recovers ﬁles from raw data based solely on ﬁle structure and

content without understanding the ﬁle system’s metadata. That is, extract information

from raw data [34]. As shown in Table 9. The table contains ﬁles divided by type,

size, and number. These ﬁles were used for the testing processes. Many wiping

methods discussed above for the experimental part of this study have been checked

on Intel®Core™i7-6500CPU @250GHz, with 16GB memory (RAM) on Windows

10 Enterprise, with new hard drives. The criterion used to appraise the different

techniques is: (1) check if traces of previously destroyed data are available, (2)

check whether the tools can recover the actual data.

Table 9 Contents of the

dataset Type of ﬁles Number of ﬁles Size

PDF 19 23.4 MB (24,543,953 bytes)

Videos 30 99.4 MB (104,259,836 bytes)

Audio 17 300 MB (314,687,964 bytes)

Images 10 18.3 MB (19,276,960 bytes)

Wor d 10 229 KB (235,339 bytes)

PPT 30 82.2 MB (86,210,633 bytes)

Txt 10 209 tes (209 bytes)

544 M. Al-Fayoumi et al.

5 Tools

Several tools were used in these experiments, as summarized in Table 10 (regarding

name, version, and license) and described below.

•Freeraser: a free desktop application for shredding (destroying) unwanted ﬁles

beyond recovery that supports the many ways of deleting methods, including

DoD5220.22-M, Guttmann algorithm, and Random Data (www.freeraser.com/

home/82-freeraser.html).

•FTK: an easy-to-use forensic toolkit for data imaging, mounting, and ﬁle carving,

with several search options (https://accessdata.com/products-services/forensic-

toolkit-ftk).

•Foremost: a console program to recover ﬁles based on headers, footers, and

internal data structures (foremost.sourceforge.net).

•Scalpel: an open-source program for recovering deleted data originally based on

the leading software (https://github.com/sleuthkit/scalpel).

•File Shredder: a free desktop application for ﬁle deletion using drag and drop that

supports many methods, including DoD 5220.22-M, Guttmann, Schneider, and

Paranoid.

•PhotoRec: a free desktop application for data recovery. It can recover deleted ﬁles

from different ﬁle systems, like FAT, NTFS, and HFS.

•Recva: a commercial desktop application for advanced data recovery, supporting

different types of ﬁles (https://www.ccleaner.com/recuva).

•DiskExplorer for NTFS: a software data recovery tool that can investigate NTFS

drives and recover data using the partition table, MFT, and the boot record (https://

www.runtime.org/diskexplorer.htm).

•Log File Parser-master: an open-source tool to analyze log ﬁles and pass the

results to server ﬁles (https://github.com/jschicht/LogFileParser).

Table 10 Summary of tool

versions Tools Ve r s io n License

Access data®FTK®imager 3.4.3.3 Commercial

Freeraser 1.0.0.23 Free

Fileshreder 2.50 Free

Recuva 1.53.1087 Commercial

PhotoRec 6.13 Free

Foremost 1.5.7 Free

Scalpel 2.0 Free

UsnJrnl2Csv-master 1.0.0.21 Free

Log File Parser-master 2.0.0.46 Free

Mft2Csv-master 2.0.0.41 Free

Disk explorer for NTFS 4.32 Free

ExifTool 10.7.9.0 Free

Towards Detecting Digital Criminal Activities Using File System Analysis 545

Table 11 Results of the recovered wiped ﬁles erased by Freeraser

Tools name Algorithm FTK Recuva PhotoRec

Freeraser A single pass with Random Data Fail Fail Fail

Freeraser DoD 5220.22M Fail Fail Fail

Freeraser Guttmann algorithm Fail Fail Fail

•Mft2Csv-master: an open-source tool used to analyze the Master ﬁle table (https://

github.com/jschicht/Mft2Csv).

•UsnJrnl2Csv-master: an open-source tool to analyze the journaling ﬁle where

every transaction is stored (https://github.com/jschicht/UsnJrnl2Csv).

•Exiftool: an open-source tool used to extract the metadata from the dataset (https://

github.com/alchemy-fr/exiftool).

6 Wiping and Analysis

A consistent working method was developed for all tools to evaluate their processing,

as described below.

6.1 Wiping by Freeraser Tool

•Copy the prepared dataset into the USB drive.

•Take an image from the USB before the wiping process.

•Use the Freeraser wipe tool to wipe ﬁles in the drive using a single pass with

Random Data, DOD 5220.22M, and the Guttmann algorithm.

•Take a raw image (bit by bit) from the USB.

•Try to recover the ﬁles from the Image using FTK Toolkit, Foremost, and Scalpel.

The results are shown in Table 11, proving that Freeraser was successful in wiping

all ﬁles using all standard methods, and none of the ﬁles was displayed by the FTK

Imager tool.

6.2 Wiping by File Shredder

•Copy the prepared dataset into the USB drive.

•Take an Image from the USB before the wiping process.

•Use the File Shredder tool to wipe the ﬁles in the USB drive, using the ﬁve

standards (single pass, simple two passes, seven passes, DOD 5220.22M, and

Guttmann algorithm with 35 passes)

546 M. Al-Fayoumi et al.

Table 12 Results of the recovered wiped ﬁles erased by File Shredder

Tools name Algorithm FTK Recuva PhotoRec

File Shredder Simple Single Pass Fail Fail Fail

File Shredder Simple Two-Pass Fail Fail Fail

File Shredder DOD5220-22M Fail Fail Fail

File Shredder Seven Pass Fail Fail Fail

File Shredder Guttmann Algorithms with 35 Pass Fail Fail Fail

•After each wiping, take a raw image (bit by bit) for the USB drive.

•Use FTK Toolkit, Foremost, and Scalpel to recover ﬁles.

The results are shown in Table 12, indicating that File Shredder was successful

in wiping all ﬁles, so no data or ﬁle names were recovered. The results in Table 11

indicate that all tools failed to recover the ﬁles using ﬁle carving, except for FTK,

which was able to return the ﬁle names, but the data was corrupted.

6.3 File Carving

After failing to recover the data using Recva and PhotoRec, we tried to recover any

data from the disk by carving the ﬁles using the following ﬁle carving techniques [33,

35,36]: (a) ﬁle header based: based on known headers (the start of the ﬁle marker),

(b) header–footer: based on known headers (the start of the ﬁle marker) and footers

(the end of the ﬁle marker), and (c) ﬁle structure: based on the internal layout of the

ﬁle. The results of this method using Foremost, Scalpel, and FTK tools are shown in

Table 13. It is obvious from the results that all tools failed to recover the ﬁles using

ﬁle carving, except for FTK, which was able to return the ﬁle names, but the data

was corrupted.

6.4 File System Analysis

Having failed to get a result by processing carving ﬁles, the next step is analyzing the

system ﬁles by looking for the metadata (ﬁle names, created date, size, type, etc.) to

prove that some ﬁles existed before they were deleted. The NTFS ﬁle system recorded

all operations and transactions on ﬁles and the ﬁle system in different locations in

$MFT, $Log ﬁles, and $ UsnJrnl from this point on, and based on the structure of

these ﬁles, one can ﬁnd evidence or metadata about the deleted ﬁles. At this point,

a forensic image was taken, and an $MFT, $UsnJrnl, and log ﬁles were analyzed

using FTK and Runtime’s DiskExplorer for NTFS tools. The results are shown in

Table 14.

Towards Detecting Digital Criminal Activities Using File System Analysis 547

Table 13 Results of carving ﬁle method using several tools

Tools

name

Algorithm FTK Foremost Scalpels

Freeraser A single pass with

Random Data

Recover ﬁle names and

corrupted ﬁles

Fail Fail

Freeraser DoD 5220.22M Recover some ﬁle names and

corrupted ﬁles

Fail Fail

Freeraser Guttmann algorithm Recover some ﬁle names and

corrupted ﬁles

Fail Fail

File

Shredder

Simple single pass Fail Fail Fail

File

Shredder

Simple two pass Fail Fail Fail

File

Shredder

DOD5220-22M Fail Fail Fail

File

Shredder

Seven Fail Fail Fail

File

Shredder

Guttmann algorithm with

35 pass

Fail Fail Fail

Table 14 File system analysis using FTK and Disk Explorer for NTFS

Tools name Algorithm Disk Explorer FTK

Freeraser A single pass with Random Data Recover metadata Recover metadata

Freeraser DoD pass Recover metadata Recover metadata

Freeraser 35 pass Recover metadata Recover metadata

From the system analysis in Table 14, we have proved that all the metadata about

the ﬁles, from their creation to deletion, is still on the USB Disk. Double-check the

results, and the obtained results are shown in Table 15. Table 16 is an example of a ﬁle

transaction after ﬁle system analysis. The results prove that the ﬁle called victim.txt

was in the computer, showing all activities occurring in this ﬁle.

7 Conclusions

Data wiping is the most effective way of destroying data contents. This paper has

examined data wiping using several tools and algorithms, and the experimental results

indicate that data wiping is an effective way to destroy data content. However,

analyzing the NTFS ﬁle system, the recovery of the metadata about the ﬁle from

creation until deletion remains possible. Consequently, it can be proven what data

was on the disk at some point, what activity occurred on the ﬁle, and if the user

attempted to wipe it. In this case, suspects with a PC or USB who have proven to

548 M. Al-Fayoumi et al.

Table 15 File system analysis using UsnJrnl2Csv-, Log ﬁle parser-master, and Mft2Csv-master

tools

Tools name Algorithm Log ﬁle parser-master tool Mft2Csv-master tool

Freeraser One pass Recover every transaction

that occurred on the ﬁles

Recover ﬁle metadata

Freeraser DoD pass Recover every transaction

that occurred on the ﬁles

Recover ﬁle metadata

Freeraser 35 passes Recover every transaction

that occurred on the ﬁles

Recover ﬁle metadata

File Shredder One/two/three pass Recover every transaction

that occurred on the ﬁles

Recover ﬁle metadata, but

name in an unreadable

format

File Shredder Seven passes Recover every transaction

that occurred on the ﬁles

Recover ﬁle metadata, but

name in an unreadable

format

File Shredder Guttmann algorithm Recover every transaction

that occurred on the ﬁles

Recover ﬁle metadata, but

name in an unreadable

format

Table 16 Transaction on victim.txt ﬁle

File name Event (transaction)

victim.txt FILE_CREATE

victim.txt DATA_EXTEND +FILE_CREATE

victim.txt DATA_EXTEND +DATA_OVERWRITE +FILE_CREATE

victim.txt BASIC_INFO_CHANGE +DATA_EXTEND +DATA_OVERWRITE +FILE_

CREATE

victim.txt BASIC_INFO_CHANGE +CLOSE +DATA_EXTEND +DATA OVERWRITE

+FILE CREATE

have engaged in such proscribed activities can be considered to violate such policies,

with potential organizational or even criminal implications. The number of wiping

passes does not affect the metadata. The only effective way to destroy the metadata

is by wiping the ﬁle system, requiring either kernel space access or hardware wiping.

Future research directions based on these ﬁndings may explore using tools capable

of dealing with kernel space or hardware level processing, improving research by

studying this process with more tools and techniques, and measuring the resources

consumed by each machine.

Towards Detecting Digital Criminal Activities Using File System Analysis 549

References

1. Naiqi L, Zhongshan W, Yujie H (2008) Computer forensics research and implementation

based on NTFS ﬁle system. In: Proceedings—ISECS international colloquium on computing,

communication, control, and management, CCCM 2008, vol 1, pp 519–523

2. Poonia AS (2014) Data wiping and anti forensic techniques. Compusoft 3(12):1374–1376

3. Ölvecký M, Gabriška D (2018) Wiping techniques and anti-forensics methods. In: 2018 IEEE

16th international symposium on intelligent systems and informatics (SISY), pp 127–132

4. Miller FP, Vandome AF, McBrewster J (2009) Levenshtein distance: information theory,

computer science, string (computer science), string metric, Damerau? Levenshtein distance,

spell checker, hamming distance. Alpha Press

5. “blueangel’s ForensicNote—NTFS Log Tracker.” [Online]. Available: https://sites.google.

com/site/forensicnote/ntfs-log-tracker. Accessed 18-Sept 2022

6. Rogers MK, Seigfried K (2004) The future of computer forensics: a needs analysis survey.

Comput Secur 23(1):12–16

7. Slusarczuk MM, Mayﬁeld WT, Welke SR (1987) Emergency destruction of information storing

media. Institute for Defense Analyses Alexandria VA

8. Gutmann P (1996) Secure deletion of data from magnetic and solid-state memory. In:

Proceedings of the sixth USENIX security symposium, San Jose, CA, vol 14, pp 77–89

9. Robins N, Williams PAH, Sansurooah K (2017) An investigation into remnant data on USB

storage devices sold in Australia creating alarming concerns. Int J Comput Appl 39(2):79–90

10. Golubi´cK,Stanˇci´c H (2012) Clearing and sanitization of media used for digital storage:

towards recommendations for secure deleting of digital ﬁles. In: Central European conference

on information and intelligent systems, pp 331–493

11. Regenscheid A, Feldman L, Witte G (2015) NIST special publication 800-88 revision 1,

guidelines for media sanitization. National Institute of Standards and Technology

12. DoD 5220.22-M: national industrial security program operating manual [Updated 28 Feb 2006]

(2006). [Online]. Available: https://www.hsdl.org/?abstract&did. Accessed 18-Sept-2022

13. Wright C, Kleiman D, Sundhar RSS, Kendalls BDO (2008) Overwriting hard drive data: the

great wiping controversy, pp 243–257

14. Martin T, Jones A (2011) An evaluation of data erasing tools

15. Distefano A, Me G, Pace F (2010) Android anti-forensics through a local paradigm. Digit

Invest 7:S83–S94

16. Pajek P, Pimenidis E (2009) Computer anti-forensics methods and their impact on computer

forensic investigation. In: International conference on global security, safety, and sustainability,

pp 145–155

17. Gül M, Kugu E (2017) A survey on anti-forensics techniques. In: IDAP 2017—international

artiﬁcial intelligence and data processing symposium

18. Kai Z, En C, Qinquan G (2010) Analysis and implementation of NTFS ﬁle system based

on computer forensics. In: 2010 Second international workshop on education technology and

computer science, vol 1, pp 325–328

19. Al-Fayoumi M, Aboud SJ, Al-Fayoumi MA (2010) A new digital signature scheme based on

integer factoring and discrete logarithm problem. IJ Comput Appl 17(2):108–115

20. A. A. Gutub, “e-Text Watermarking : Utilizing ’ Kashida ’ Extensions in Arabic Language

Electronic Writing,” vol. 2, no. 1, pp. 48–55, 2010.

21. Parvez MT, Gutub AA-A (2011) Vibrant color image steganography using channel differences

and secret data distribution. Kuwait J Sci Eng 38(1B):127–142

22. Al-Otaibi NA, Gutub AA (2014) 2-Leyer security system for hiding sensitive text data on

personal computers. In: Lecture notes on information theory, no August, pp 73–79

23. Al-Nofaie SM, Fattani M, Gutub A (2016) Merging two steganography techniques adjusted to

improve arabic text data security. J Comput Sci Comput Math (JCSCM) 6(3):59–65

24. Hambouz A, Shaheen Y, Manna A, Al-Fayoumi M, Tedmori S (2019) Achieving data

integrity and conﬁdentiality using image steganography and hashing techniques. In: 2019 2nd

International conference on new trends in computing sciences, ICTCS 2019—proceedings

550 M. Al-Fayoumi et al.

25. Mohammad RM, Alqahtani M (2019) A comparison of machine learning techniques for ﬁle

system forensics analysis. J Inf Secur Appl 46:53–61

26. Oh J, Lee S, Hwang H (2021) NTFS Data Tracker: Tracking ﬁle data history based on $LogFile.

Forensic Sci Int Digit Invest 39:301309

27. Hermon R, Singh U, Singh B (2022) Forensic techniques to detect hidden data in alternate data

streams in NTFS. In: IBSSC 2022—IEEE Bombay section signature conference

28. Oh J, Lee S, Hwang H (2022) Forensic recovery of ﬁle system metadata for digital forensic

investigation. IEEE Access 10:111591–111606

29. Sokol P, Antoni ˇ

L, Krídlo O, Marková E, Kováˇcová K, Krajˇci S (2022) The analysis of digital

evidence by Formal concept analysis

30. Markova E, Sokol P, Kovacova K (2022) Detection of relevant digital evidence in the forensic

timelines. In: 2022 14th International conference on electronics, computers and artiﬁcial

intelligence, ECAI 2022.

31. Singh A (2022) A framework for crime detection and reduction in digital forensics. SSRN

Electron J 71(4):531–552

32. Peters-Michaud N (2017) The three pass data wipe requirement for hard drives is obsolete. In:

Cascade asset management, LLC, pp 1–8

33. Mallery JR (2001) Secure ﬁle deletion: fact or ﬁction? tu te ho r r fu ll r igh te ll r igh

34. Tanvir Parvez M, Abdul-Aziz Gutub A (2011) Hiding, data spreading, data, vol 38, pp 127–142

35. Pal A, Memon N (2009) The evolution of ﬁle carving. IEEE Sig Process Mag 26(2):59–71

36. Carrier B (2005) File system forensic analysis. Addison-Wesley Professional

Performance Evaluation of Virtual

Machine and Container-Based Migration

Technique

Aditya Bhardwaj, Amit Pratap Singh, Priya Sharma, Konika Abid,

and Umesh Gupta

Abstract The transformation from hypervisor to microservice-based virtualization,

i.e., containerization, is gaining considerable attention. This is because container

virtualization offers a lightweight and efﬁcient way to package and deploy software

applications. Containers require minimum resources than virtual machines, which

makes them more efﬁcient and cost effective. The performance overhead of a con-

tainer compared to a virtual machine has been explored by researchers, but support

for migration, an essential technique of cloud virtualization, needs to be addressed.

In this work, we proposed container-based migration technique and compared the

performance with existing VM migration scheme. The results show that compared

to the existing VM migration scheme, our proposed container migration technique

reduces downtime, migration time, and the number of pages transferred by 72.8%,

54.94%, and 97.5%, respectively.

Keywords Cloud computing ·Virtualization ·VM migration ·Container

technology ·Checkpoint-restore

1 Introduction

There is a high demand for cloud platforms and related virtualization technologies.

VMware, RedHat, Oracle, Citrix, and Microsoft dominate the market, while hardware

vendors like Intel and AMD offer virtualization-enabled high-performance comput-

ing servers. These technologies are used collectively for the process of workstation

consolidation. In the past, hypervisor-based virtualization was the most common way

A. Bhardwaj (B)

School of CSET, Bennett University, Greater Noida, India

e-mail: aditya.cse@nitttrchd.ac.in

A. P. Singh ·P. S h a rm a ·K. Abid

Department of CSE, Sharda University, Greater Noida, India

U. Gupta

Department of CSE, SR University, Warangal, Telangana, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981- 99-6544-1_41

551

552 A. Bhardwaj et al.

to implement virtualization and isolation [1]. Recent studies show that hypervisor-

based virtualization technologies have large performance costs. They also have I/O

constraints, so they are typically avoided in high-performance computing environ-

ments. In the past few years, container-based virtualization and support for micro-

hosting services have become more popular. It is a lightweight solution that bundles

apps and data in a simpler and more performance-oriented way that can run on

different cloud frameworks.

Due to its signiﬁcance in the operation of data centers, researchers have sought

to enhance the performance of the VM migration method [2]. In accordance with

this, in our earlier work, we ﬁrst explored how to allocate bandwidth during VM

migration efﬁciently [3], and then we improved the migration mechanism by imple-

menting multistage and data transfer reduction strategies [4]. However, because VM

is deployed with the dedicated guest operating system, it takes a signiﬁcant amount

of RAM and disk storage memory, and as a result, the image size is quite large.

Thus, migration utilizing VM is referred to as a heavyweight solution, which causes

service degradation in terms of quality of service (QoS).

The virtualization system design employing hypervisor and container-based tech-

nologies is contrasted in Fig. 1. As can be seen in Fig. 1, containers allow applications

to share an OS kernel and only include the necessary binaries and libraries, making

them a more lightweight alternative to virtual machines. Hence, in recent years, there

has been a rise in demand for application deployment using container technology

for real-time applications like IoT, fog computing, data analytics, and blockchain

technology [5,6].

Fig. 1 Virtual machine versus container architecture

Performance Evaluation of Virtual Machine … 553

1.1 Main Contributions of This Study

The contributions of this study are summarized as follows:

1. Cloud computing virtualization experimental testbed has been developed to eval-

uate performance of proposed container-based migration technique (LXD/CR)

with the existing VM migration scheme.

2. Proposed container migration technique has been implemented by modiﬁcation

of Linux kernel and system bash ﬁles.

3. Executed a wide variety of workload benchmarks to evaluate the performance of

existing and proposed migration techniques.

The remaining parts of this work are structured as follows. Relevant related work

in container-based virtualization is discussed in Sect. 2. In Sect. 3, architecture for

the container migration technique is discussed. Section 4shows the results of pro-

posed container migration technique (LXD/CR) with existing pre-copy VM migra-

tion scheme. In Sects. 5and 6, concluding remarks of this study are presented.

2 Related Work

A container is a small, self-contained piece of software with all the code, libraries,

system tools, and runtime needed to run an application or service. Containers let

developers package an app to make it portable and consistent so that it can be used in

different computing environments, like on-premises data centers, cloud platforms,

or edge devices. Containerization eliminates the need for a separate guest operating

system, which means it has less overhead and starts up faster than hypervisors.

Existing studies have explored performance evaluation between VM versus con-

tainer in non-live migration platforms. In Felter et al. [7], the author conducted a study

comparing and contrasting virtualization’s overhead with that of a non-virtualized

platform. The parameters employed were those unique to the execution of the work-

load. Based on their experiments, they concluded that containers have less overhead

in comparison with virtual machines. As a result, incorporating containers into the

data center infrastructure can be useful for cloud service providers. In a recent paper,

Chae et al. [8] compared the performance of hypervisor and container technology. But

compared to [7], they used different set of benchmarks. Their benchmarks include

disk I/O, a web server, and the amount of CPU and RAM being used to measure

performance. The authors demonstrate that as compared to container virtualization,

KVM virtualization requires a relatively higher quantity of CPU and RAM. These

studies inferred that, compared to hypervisor-based virtualization, the container uses

hardware resources more efﬁciently, especially with a decrease in memory consump-

tion by three–four times.

Also, in contrast to [7,8], the study expanded by [9,10] undertook a performance

comparison of three container technologies—Docker, LXC, and CoreOS Rkt—with

respect to the native platform. It has been discovered that Docker containers reduce

554 A. Bhardwaj et al.

Fig. 2 Proposed LXD/CR container migration technique

performance since their time-sharing method generates a lot of context-switching

costs. However, LXC containers perform better for data-intensive workloads, and

Rkt containers work well for CPU-intensive workloads. From the relevant literature,

it is found that container-based virtualization provides minimum overhead in terms

of computation and storage. Thus, it is suitable for running multiple instances at the

cloud server.

3 Experimental Testbed for Proposed Container Migration

Technology

In this section, we discuss the methodology and experimental setup details to imple-

ment proposed LXD/CR container migration technique.

3.1 Proposed Container-Based Migration Technique

Initially, we built a Linux container hypervisor (LXD) as an extension of LXC version

2.0.11 to launch a container. Then, container migration is implemented using the

checkpoint/restore the function of the CRIU approach, which saves the running

state of the container on the source server and restores it on the target system [11,

12]. Further, cgroup and namespace are used to facilitate management and isolation

of container resources. Following is a summary of the key operations involved in

migrating containers, as shown in Fig.2:

Performance Evaluation of Virtual Machine … 555

Stage 1: Synchronization of ﬁle system: To perform container migration, we

must ﬁrst ensure that the container’s ﬁle system is in sync. This necessitates the

existence of the fundamental ﬁle system, including rootfs and conﬁg, on the target

server.

Stage 2: Checkpoint running container: When a container is to be checkpointed,

CRIU ﬁrst determines the process to the checkpoint and then saves its state. This

includes memory contents, ﬁle descriptors, network connections, and other process

information. CRIU then writes this information to disk in a checkpoint image ﬁle.

Stage 3: Network dump: CRIU dumps the container’s network stack to a separate

ﬁle that can be used to restore the network state.

Stage 4: Restoring container: When the container needs to be restored, CRIU

reads the checkpoint image ﬁle from the disk and restores the process state in

memory. It then sets up the necessary environment, including ﬁle descriptors,

network connections, and memory mappings.

Stage 5: Network restore: CRIU restores the network state using the network

dump ﬁle created during the checkpointing phase.

Stage 6: Resume: Once the process is restored, CRIU resumes execution from

where it left off, allowing the container to continue running on the same or a

different host.

4 Performance Evaluation and Results Discussion

In this section, we discussed the results obtained for performance evaluation of

proposed container migration technique with existing VM migration mechanism.

4.1 Downtime and Migration Time

Figures 3and 4illustrate performance evaluation of proposed and existing scheme

in terms of downtime and migration time.

We have executed four categories of benchmarks, namely ‘idle,’ ‘UnixBench,’

‘Y-cruncher,’ and ‘Stream.’ For idle test and workloads benchmarks execution, our

approach reduces the downtime by 59.48, 73.07, 77.56, and 78.24%, an average of

72.08%. Similarly, migration time is reduced by 33.35, 44.18, 66.43, and 75.81%, an

average of 54.94%. This is because data transfer using VM is of full binary libraries,

codes, and OS ﬁles, but LXD/CR just migrates memory checkpoint dump states

which require less migration duration compared to virtual machine.

556 A. Bhardwaj et al.

Fig. 3 Performance evaluation for downtime (Td)

Fig. 4 Performance evaluation for migration time (Tm)

4.2 Number of Pages Transferred

The performance evaluation in terms of number of pages transferred is presented

here. Figure 5shows that in comparison with the existing virtual machine migra-

tion technique, our proposed LXD/CR container migration technique results in a

considerable reduction in the total pages transferred by 95.08, 97.91, 98.26, and

98.94% with an average reduction of 97.54%. There is a signiﬁcant difference for

this parameter because in LXD/CR, a number of pages transferred are of container‘s

checkpoint dump states, while pre-copy transfers all dirtied memory pages along

with heavyweight full edge operating system VM image size.

Performance Evaluation of Virtual Machine … 557

Fig. 5 Performance evaluation for the number of pages transferred (Tpages)

Thus, result obtained demonstrates that proposed container-based migration tech-

nique can be used in the production environment. However, the limitation of this

study is that proposed container-based migration technique has been tested using

with CPU and memory-oriented benchmark testcases only. Further, discussion on

disk I/O testcases and future network systems should be studied [13].

5 Conclusion and Future Scope

A container-based solution to implement migration technique for data container-

based virtualization involves running multiple containers on a single operating

system, where each container shares the host operating system and underlying

resources with other containers. This provides greater efﬁciency and scalability than

hypervisor-based virtualization but may not provide as strong isolation. In this paper,

a cloud virtualization testbed has been developed and implemented with a check-

point/restore technique to enable the migration mechanism in the container. The

results show that compared to the existing virtual machine migration technique (pre-

copy VM migration), our proposed container migration technique shows signiﬁcant

performance improvement, with a reduction range from 72.08 to 54.94% for down-

time, migration time, and 97.54% for the number of pages transmitted. So, our

proposed container migration technique (LXD/CR) can play a vital role in the cloud

servers to migrate running workloads and applications. For future work, researchers

may explore container migration techniques for edge and fog computing frameworks.

558 A. Bhardwaj et al.

References

1. Belgacem A, Mahmoudi S, Ferrag MA (2023) A machine learning model for improving virtual

machine migration in cloud computing. J Supercomputing 1–23

2. Kumari P, Kaur P (2021) Virtual machine replication in the cloud computing system using

fuzzy inference system, data analytics and management: proceedings of ICDAM, pp 165–174

3. Bhardwaj A, Rama Krishna C (2018) Performance evaluation of bandwidth for virtual machine

migration in cloud computing. Int J Knowl Eng Data Min, Inderscience 5(3):139–152

4. Bhardwaj A, Rama Krishna C (2018) Efﬁcient multistage bandwidth allocation technique for

virtual machine migration in cloud computing. J Intell Fuzzy Syst 35(5):5365–5378

5. Plageras AP, Psannis KE, Stergiou C, Wang H, Gupta BB (2018) Efﬁcient IoT-based sensor

BIG Data collection—processing and analysis in smart buildings. Future Gener Comp Syst

82:349–357

6. Stergiou C, Psannis KE, Kim B-G, Gupta B (2018) Secure integration of IoT and cloud com-

puting. Future Gener Comp Syst 78(3):964–975

7. Felter W, Ferreira A, Rajamony R, Rubio J (2015) An updated performance comparison of

virtual machines and Linux containers. In: Processes IEEE international symposium on per-

formance analysis of systems and software. Philadelphia, PA, USA pp 171–172

8. Chae M, Lee H, Lee K (2017) A performance comparison of Linux containers and virtual

machines using Docker and KVM. Cluster computing pp 1–11

9. Kozhirbayev Z, Sinnott RO (2017) A performance comparison of Linux container-based tech-

nologies. Future Gener Comp Syst 68:175–182

10. Martin JP,Kandasamy A, Chandrasekaran K (2018) Exploring the support for high performance

applications in the container runtime environment. Human-centric Comput Inf Sci 8(1):1–15

11. Linux Containers Hypervisor (LXD), link: https:// linuxcontainers.org/lxd/introduction/.

Accessed 12 Aug 2022

12. Checkpoint/Restore in userspace (CRIU), link: https://www.criu.org/Main_Page. Accessed 19

Sept 2022

13. Gupta U, Pantola D, Bhardwaj A, Singh S (2023) Next-generation networks enabled tech-

nologies: challenges and applications. Next generation communication networks for industrial

internet of things systems, pp 191–216

Rhetorical Role Detection in Legal

Judgements Using Zero-Shot Learning

Shambhavi Mishra, Tanveer Ahmed, Vipul Mishra, Priyam Srivastava,

Abuzar Sayeed, and Umesh Gupta

Abstract In this paper, we address the problem of legal statement segmentation

(or rhetorical role detection). Traditionally, this is handled by taking the expertise

of lawyers and making them mark each and every statement as one of the many

pre-deﬁned classes. Naturally, this process is cumbersome and involves a lot of

manual intervention. Zero-shot learning is a promising approach that could be one

of the potential solutions to this labor-intensive problem. Therefore, in this paper,

we apply zero-shot learning to the task of legal judgement segmentation. We try to

remove the “human in the loop” and present a new potential direction in rhetorical

role detection. To that end, we use BART to automatically classify various segments

of a document into multiple classes. We propose a model that uses a pre-trained

language model to generate embeddings for each document, which are then used to

classify a legal sentence into one of the multiple classes. We evaluate our model on a

dataset of legal documents consisting of manually marked statements. In particular,

the dataset consists of 50 court case documents from the Indian Supreme Court.

Through experimentation, we have found that the proposed gives a strong baseline

that could act as a new direction in rhetorical role detection. Further, we also show

S. Mishra (B)·T. Ahmed ·P. S r i v a st a v a ·A. Sayeed

Department of CSE, Bennett University, Greater Noida, India

e-mail: shambhavimishra1000@gmail.com

T. Ahmed

e-mail: tanveer.ahmed@bennett.edu.in

P. Sr i v a s t ava

e-mail: e20cse479@bennett.edu.in

A. Sayeed

e-mail: abuzar.sayeed@bennett.edu.in

V. M i s h r a

Department of CSE, Pandit Deendayal Energy University, Gandhinagar, India

e-mail: vipul.mishra@sot.pdpu.ac.in

U. Gupta

Department of CSE, SR University, Warangal, Telengana, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_42

559

560 S. Mishra et al.

that the model presented in this article can indeed pave the way for future work in

legal analytics.

Keywords Rhetorical roles ·Legal case documents ·Zero-shot learning ·BART

1 Introduction

In this paper, we address the problem of rhetorical role detection of a legal judge-

ment. Traditionally, this is handled by taking the expertise of lawyers and making

them mark each statement (of the judgement) as one of the many pre-deﬁned roles.

This process is extremely cumbersome and involves a lot of manual intervention.

Moreover, as expected in Indian legal cases, the structure of the text is extremely

poor with every judge using his/her own style in writing. Hence, marking statements

manually is infeasible considering the scale of judgements given by courts each year.

To address this problem, we apply zero-shot learning to the task of legal judgement

segmentation. We try to remove the “human in the loop” and present a new potential

direction in rhetorical role detection. To that end, we use BART to automatically

classify various sentences of a legal document into multiple classes. We propose

a model that uses a pre-trained language model to classify a legal sentence into

multiple classes. We evaluate our model on a dataset of legal documents consisting

of manually marked statements. In particular, the dataset consists of 50 court case

documents and over 10,000 sentences from the Indian Supreme Court. The innovative

aspect of the suggested method lies in employing zero-shot learning for identifying

rhetorical roles within legal decisions. The conventional approach of having attor-

neys manually label each statement within a judgement is laborious, time-intensive,

and susceptible to mistakes. The suggested technique employs a pre-trained language

model, BART, to automatically categorize various sentences in a legal document into

multiple classes without the need for manual input. This paper emphasizes the signif-

icance of rhetorical roles in legal contexts and advocates for the implementation of

zero-shot learning for detecting these roles in the legal ﬁeld. The paper establishes

a robust foundation and illustrates that the suggested approach has the potential to

serve as a novel direction in rhetorical role identiﬁcation, laying the groundwork for

future research in legal analysis.

In the realm of law, a legal judgement refers to an ofﬁcial ruling pronounced by a

judicial body pertaining to a speciﬁc legal matter. Typically, these judgements emerge

from disagreements among at least two parties, and they can hold either a binding or

non-binding status. A crucial method used in dissecting these judgements is called

segmentation. Segmentation of legal judgements involves decomposing a court’s

decision into its constituent elements to provide a comprehensive understanding of

its implications and signiﬁcance [1,2]. Legal professionals and scholars frequently

employ this method when scrutinizing a judgement, as it aids in pinpointing the

crucial tenets and contested issues. There is an increasing trend toward researching

role detection within the context of legal judgements. For example, numerous studies

Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 561

have underscored the critical role of rhetorical elements across various applications

[3,4]. In essence, the philosophy behind judgement segmentation is the partitioning

of a judgement into its relevant sections. For instance, the segment containing facts is

an aggregation of all the factual information and evidence put forth during a case. On

the other hand, the legal segment involves the legal claims and deductive reasoning

that the court used to formulate its decision [5].

The segmentation of a legal judgement allows for a more in-depth examination to

pinpoint the central doctrines and contested issues. Both legal researchers and profes-

sionals can beneﬁt from this analysis as it can elucidate the law and how it is applied

to speciﬁc instances. It is therefore critical to highlight the signiﬁcance of identifying

rhetorical roles in legal work. However, rhetorical role detection comes with its own

set of challenges. First, it is a tedious and time-consuming task requiring legal experts

to read through the entire judgement and label each statement. Secondly, the possi-

bility of human error introduces inconsistency in results. Lastly, scaling up manual

labeling to encompass vast amounts of legal texts is challenging. In response to these

challenges, this paper proposes a solution: a zero-shot learning-based approach to

rhetorical role detection. Deep learning, a subﬁeld of machine learning that draws

inspiration from the structure and function of the human brain, has been utilized in

numerous studies for this task [4,6]. These algorithms aim to model high-level data

abstractions through a deep graph consisting of multiple layers of nodes. Indeed, deep

learning has proven successful in numerous text-related tasks, such as machine trans-

lation, sentiment analysis, and topic categorization. Our method capitalizes on these

successes, applying deep learning techniques to the task of rhetorical role detection

in legal judgements [7,8].

One of the most promising applications of deep learning for text is text segmen-

tation. A subset of deep learning that is concerned with classiﬁcation of text without

explicitly trained on labeled datasets is called zero-shot learning [9]. Zero-shot

learning is a method for segmenting data when there are no labels available for the

data. The method is based on learning a model from data that has labels and then using

that model to segment the unlabeled data. The advantage of using this approach is that

it does not require labels for the unlabeled data, which can be difﬁcult or impossible

to obtain [10]. Consequently, this learning and classiﬁcation paradigm is immensely

important in dealing with the issues highlighted in the previous paragraph.

In light of the issue and potential solution highlighted in this section, we propose

the use of zero-shot classiﬁcation for rhetorical role detection in the legal domain. To

the best of our knowledge, we are the ﬁrst to propose this paradigm in the context of

Indian legal judgements. To accomplish the said objective, we use the existing pre-

trained BART proposed in [11]. Research has found that the work presented in [11]

is able to show good results in terms of zero-shot classiﬁcation. To test the validity of

the ideas in practice, we use the dataset provided by [6]. The dataset consists of ﬁfty

different legal judgements. There are seven different roles into which the statements

are classiﬁed. They are Facts (abbreviated as FAC), Ruling by Lower Court (RLC),

Argument (ARG), Statute (STA), Precedent (PRE), Ratio of the decision (Ratio),

Ruling by Present Court (RPC). Using the method proposed in this article, we are

able to achieve good results in terms of numerical efﬁciency. These obtained numbers

562 S. Mishra et al.

show the potential of BART in zero-shot classiﬁcation for legal judgement. Our

paper’s main contribution is to eliminate the requirement for manual involvement

in the segmentation of legal judgements, a task currently carried out by lawyers

who assign each statement to one of several pre-deﬁned roles. Furthermore, the

paper delves into the signiﬁcance of identifying rhetorical roles in the legal domain,

the obstacles related to this task, and the beneﬁts of employing advanced learning

methods, particularly zero-shot learning, for text segmentation. The rest of the article

is structured as follows: Sect. 2of this paper will discuss the related work. The

proposed methodology is discussed in Sect. 3. The experimental results are described

in Sect. 4. We discuss the shortcomings of the work in Sect. 5. Finally, the conclusion

is given in Sect. 6.

2 Related Work

This section provides an overview of previous research conducted in the legal ﬁeld

concerning annotation, automatic rhetorical labeling, and deep learning applications.

The automatic labeling of the rhetorical purpose of sentences relies heavily on manual

annotation. While some studies focus on the annotation process itself, including

the establishment of manuals or annotation rules, inter-annotator research, and the

creation of a high-quality annotated corpus, others aim to automate semantic labeling

tasks and perform annotation analysis [12,13]. In one study, a corpus named TEMIS

was developed, comprising 504 sentences with syntactic and semantic annotations

[14]. Extensive annotation research and curation of a gold standard corpus were

carried out in [15] for the purpose of labeling sentences, although there was low

agreement among assessors for labels such as “Facts” and “Reasoning Outcomes.”

Another research effort presented a preliminary methodology [16] that employed

NLP tools to automate annotation work using 47 criminal cases from the California

Supreme Court and State Court of Appeals. Previous attempts have been made to

automatically recognize the rhetorical functions of sentences in legal texts. Initial

experiments were conducted in [17] to comprehend the rhetorical and thematic roles

in court case documents, judgements, and case legislation. For example, [17] utilized

conditional random ﬁelds (CRFs) to address the challenge of identifying seven rhetor-

ical roles. Another study [12] focused on the division of US court documents into

functional (Introduction, Background, Analysis, and Footnotes) and issue-speciﬁc

(Analysis and Conclusion) portions using CRF with handcrafted features. Addi-

tionally, a technique using the fastText classiﬁer was developed in [18] to distin-

guish between true and false phrases. In a different area of research, Walker et al.

[19] contrasted the usage of rule-based scripts with machine learning algorithms

for the task of identifying rhetorical roles. Rule-based scripts require substantially

less training data. Nearly, all previous attempts to automatically identify rhetorical

roles in the legal arena required handcrafted elements. In contrast, this paper uses

deep learning (DL) and natural language processing models for this purpose, which

eliminates the requirement for manually created features. In a variety of NLP tasks,

Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 563

self-supervised techniques have been incredibly successful [20,21]. The methods that

have been most effective have been variations of masked language models, which are

denoising autoencoders trained to reconstruct text where a random subset of the words

has been masked out. Recent research has demonstrated beneﬁts from enhancing the

distribution of masked tokens [22], the order in which masked tokens are predicted

[23], and the accessible context for changing masked tokens [24]. These techniques,

however, frequently concentrate on speciﬁc kinds of end tasks (such as span predic-

tion, generation), which restricts their applicability. For pretraining sequence-to-

sequence models, we introduce BART, a denoising autoencoder. When adjusted for

text production, BART performs particularly well on comprehension challenges. It

matches the performance of RoBERTa with comparable training resources on GLUE

and SQuAD, producing brand-new, cutting-edge results on a variety of abstractive

discourse, question-answering, and summarization tasks. For instance, compared to

prior work on XSum [25], performance is improved by 6 ROUGE. DL techniques

are being used more frequently in the legal ﬁeld for tasks including classifying

factual and non-factual statements in legal documents [18], classifying crimes [26],

summarizing [27], and other tasks.

Related work also encompasses the application of machine learning methods for

the classiﬁcation of legal documents. For instance, a convolutional neural network

(CNN) was employed [28] to categorize legal documents into various types such

as contracts, briefs, and pleadings. In [29], a hierarchical attention network was

utilized for classifying legal documents based on their speciﬁc subject areas. In

[30], a deep learning model leveraging Long Short-Term Memory (LSTM) networks

was designed to pinpoint key issues and arguments within legal documents. These

approaches could potentially be integrated with automated rhetorical labeling tech-

niques to enhance the comprehension of legal texts. Research has also been conducted

on the implementation of natural language processing methods for legal information

retrieval. In [31], a system was devised to automatically extract legal concepts from

court opinions and use them to improve the retrieval of related opinions. In [32], a

system was created to automatically extract legal issues from case law and utilize

them to enhance the retrieval of associated cases. These methods could potentially be

combined with automatic rhetorical labeling to improve the retrieval and organiza-

tion of legal texts based on their rhetorical objectives. Additionally, research has been

conducted on the use of machine learning methods for predicting legal decisions. In

[33], a model was created to forecast the outcomes of cases in the European Court

of Human Rights based on case texts. In [34], a model was designed to predict the

outcomes of US Supreme Court cases based on various features, including case texts.

These techniques could potentially be used in conjunction with automatic rhetorical

labeling to improve legal decision prediction based on the rhetorical functions of the

texts. However, to the best of our knowledge, deep learning and natural language

processing methods have not yet been applied to automatically discern the rhetorical

roles of phrases within legal documents.

564 S. Mishra et al.

3Method

3.1 Zero-Shot Classiﬁcation

Zero-shot classiﬁcation is a machine learning classiﬁcation technique that is able to

recognize previously unseen objects by inferring class membership from semantic

information about the class, such as descriptions of its attributes [35]. This is in

contrast to traditional classiﬁcation methods that require training data for every class

in order to learn to recognize it. The ability to learn from zero examples is particularly

useful in domains where acquiring training data is difﬁcult or expensive, such as the

legal domain. From the point of view of legal domain, the mainchallenge in zero-shot

classiﬁcation is to learn a good semantic representation of the class, which can then

be used to make predictions about unseen examples. The segmentation process often

incorporates some form of semantic knowledge representation derived from the legal

text, such as an ontology, a collection of attributes, or even the framework of the legal

document itself. There are various strategies to tackle zero-shot classiﬁcation, but a

commonly employed method is to derive a mapping from the semantic representation

to the visual representation of the legal information. This mapping is then utilized to

predict previously unseen roles within the legal judgement. Several methodologies

can be employed to accomplish this, including transfer learning [36] or multiview

learning [37]. With this approach, the network receives a legal document, comprised

a series of legal statements, as input, and generates a probability distribution over a

set of labels as output. These labels would be supplied to the system dynamically

as it processes the text. This methodology offers a way to not only analyze legal

judgements more efﬁciently, but also to extract and learn from the data contained

within them in a more structured and scalable way. As a result, it provides a new

and innovative approach to the study of legal judgements and their implications.

The network is then trained using a set of labeled documents and then applied to a

set of unlabeled documents. The predicted label for an unlabeled document is the

label with the highest probability. In this paper, we have used an existing set of

pre-trained models. In particular, we work with Bidirectional and Auto-Regressive

Transformers.

3.2 BART: Bidirectional and Auto-Regressive Transformers

Bidirectional Auto-Regressive Transformers (BARTs) are a type of neural network

that can be used for both sequence prediction and text generation. Figure 1shows

the general architecture of BART. BARTs are similar to traditional recurrent neural

networks (RNNs), but they have the ability to learn from both past and future data.

In addition, BART is designed using the standard transformer-based architecture

which has shown promising results on various NLP-based tasks for a variety of

language-related tasks.

Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 565

Fig. 1 BART: sequence-to-sequence trained model

This makes them well-suited for tasks such as language translation, where it is

important to consider the context of the entire sentence. BART was ﬁrst proposed in

[11]. Since then, tests on a variety of natural language processing tasks, including

text categorization and machine translation, have demonstrated that BARTs perform

better than conventional RNNs in these areas. An encoder and a decoder are BART’s

two main components. The encoder reads the input sequence and converts it into

a vector representation. The output sequence is then created by the decoder using

this vector form. This gives BART a better understanding of the context of the input

sequence. The two main beneﬁts of BART are its ability to learn from long sequences

of data and their ability to generate text. Traditional RNNs are limited in the amount

of data they can learn from. This is because RNNs are designed to read data from

left to right. As a result, they can only learn from the ﬁrst few items in a sequence.

BART, on the other hand, can learn from both the ﬁrst and last items in a sequence.

This makes them much better at learning from long sequences of data. For reasons of

brevity, we keep the discussion on BART short. Interested readers can refer to [11].

3.3 Combining BART with Zero-Shot Classiﬁcation for Legal

Document Classiﬁcation

We propose a zero-shot learning approach for sentence classiﬁcation. We utilize

BART, a pre-trained sequence-to-sequence autoencoder, as our text encoder and

build a simple classiﬁcation head on top of the encoder. The overall framework of

the proposed approach is presented in Fig. 2. Our approach can be used for any

sentence classiﬁcation task, with or without labeled data. We evaluate our approach

on a variety of sentence classiﬁcation tasks and show that our approach outperforms

strong baselines on zero-shot classiﬁcation. It is a subﬁeld of machine learning where

the goal is to learn a model that can classify data belonging to classes that are not

present in the training data. The idea is that the model can learn to generalize to new

classes by using knowledge about other related classes. In the legal domain, there

are a variety of tasks where classiﬁcation is needed, but labeled data is not always

available. For example, when a new law is passed, there may not be any labeled data

for that law. However, there may be other laws that are similar to the new law, and

566 S. Mishra et al.

Fig. 2 Architecture of the proposed model

these laws can be used to learn a model that can classify the new law. We explore the

ability of the BART model to learn from a large amount of unannotated data in order to

classify documents into different legal categories, without any training data for those

categories. We evaluate our approach on a dataset of nearly 50 documents of Supreme

Court cases. We ﬁnd that the BART model can accurately classify documents into ﬁve

different roles, even when there is no training data for those categories, outperforming

several strong baselines. In order to train BART, text is ﬁrst corrupted using a random

noise function, and then, a model is learned to recreate the original text. Despite

being straightforward, it uses a typical transformer-based neural machine translation

architecture that generalizes numerous other more advanced pretraining strategies,

such as GPT with its left-to-right decoder and BERT (owing to the bi-directional

encoder). The optimal solution combines a cutting-edge in-ﬁlling strategy, where a

span of text is replaced with a single mask token, with a random reordering of the

original phrases sequence. Although it also performs well for comprehension tasks,

BART is especially effective when tailored for text generation.

4 Results

4.1 Dataset

In this section, we attempt to demonstrate the practical efﬁcacy of the proposed work.

On the dataset that [6] provided, experiments were carried out. The authors have

presented seven different categories of legal statements. From the seven categories,

we removed two categories. The remaining ﬁve roles are: Facts (abbreviated as

FAC), Ruling by Lower Court (RLC), Statute (STA), Precedent (PRE), and Ruling

by Present Court (RPC). Further, we included the collection of 50 documents from

the following ﬁve categories of law: 16 documents pertaining to criminal law; ten to

land and property; nine to constitutional law; eight to labor and industry; and seven

to intellectual property rights.

Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 567

4.2 Detailed Annotation

This section will provide an overview of our annotation study, including the rhetorical

functions and semantic labels that we considered for this work. There are many rhetor-

ical roles that people play in the legal system. For example, lawyers may play the role

of advocate, advisor, or negotiator. Judges may play the role of arbiter or decision-

maker. Witnesses may play the role of expert or layperson. And jurors may play the

role of fact ﬁnders or deliberators. Table 1represents number of sentences annotated

with each role. In our work, we take into account the following ﬁve rhetorical roles:

•Facts: This describes the sequence of occurrences that lead to the ﬁling of the case

and the development of the case over time through the legal system.

•Ruling by Lower Court (RLC): There were some decisions issued by the lower

courts (Trial Court and High Court) on the basis of which the present appeal was

launched, and therefore, we are reviewing Supreme Court case documents (to the

Supreme Court). According to the court, the audio was properly authenticated and

accepted as evidence in accordance with the hearsay exception. This mark was

added to the lower court’s verdict as well as the reasoning behind its decision.

•Ruling by Present Court (RPC): The court will make a ruling based on the law

as it stands today. It describes the court’s ﬁnal judgement or conclusion resulting

from the logical or natural conclusion of the argument.

•Statute (STA): The term statute is also used to refer to a written law that has been

enacted by a legislature, as opposed to a common law, which is derived from case

law. Existing laws, which can be derived from a variety of sources including Acts,

Sections, Articles, Rules, Order, Notices, Notiﬁcations, Quotations taken directly

from an Act, and so on.

•Precedent (PRE): A precedent is a legal decision or set of legal rules that is

established as a binding authority for future cases.

Table 1shows the ﬁve rhetorical roles along with the number of sentences

annotated with each role.

Table 1 Number of

sentences annotated with each

role

Rhetorical role Number of sentences

FAC 2220

STA 654

RLC 314

PRE 1468

RPC 262

Total rhetorical role 4918

568 S. Mishra et al.

4.3 Experimental Setup

The current setup for BART is to have two separate models, each with its own

parameters, that are trained jointly. The ﬁrst model is a standard transformer model

that is trained to predict the next token in the sequence, while the second model is

a transformer model that is trained to predict the previous token in the sequence.

The two models are then combined by concatenating their hidden states at each

timestep. This gives the model the ability to look both forward and backward in the

input sequence, which is beneﬁcial for tasks such as language modeling where long-

range dependencies are important. As pre-trained models have produced remarkable

performances in many tasks (e.g., [38,39]), we experimented with the BART-large

model [11].

4.4 Evaluation Metrics

Standard metrics are applied to assess how well the suggested method performs. The

following is their deﬁnition:

•Precision: Precision is a measure of the accuracy of a model’s prediction. It is

calculated by dividing the number of correct predictions by the total number of

predictions made. A higher precision indicates that the model is more accurate

in predicting correct outcomes, while a lower precision indicates that the model

may be inaccurate in its predictions.

Precision =True Positives

True Positives +False Positives.

•Recall: Recall, on the other hand, is a measure of the model’s ability to detect

all relevant instances in a given data set. It is calculated by dividing the number

of relevant instances that are correctly identiﬁed by the total number of relevant

instances in the dataset. A higher recall indicates that the model is more effective

in detecting all relevant instances, while a lower recall indicates that the model

may be missing some relevant instances.

Recall =True Positives

True Positives +False N egatives .

•F1-Score: The F1-score is a metric that measures the harmonic mean of precision

and recall. It is calculated by taking the average of the precision and recall values

of a model. The F1-score is widely used to evaluate the performance of a model

because it considers both precision and recall simultaneously. A higher F1-score

indicates a more accurate model compared to one with a lower F1-score. It is a

valuable metric in assessing the overall effectiveness of a model’s performance.

Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 569

F1 - Score =2∗(Precision ∗Recall)

Precision +Recall .

4.5 Classiﬁcation Result

In this subsection, we provide an evaluation of the proposed method’s performance.

The classiﬁcation results are displayed in Fig. 3as a confusion matrix, showcasing the

accuracy of the classiﬁcation. Additionally, Table 2presents the precision, recall, and

F1-score of the classiﬁcation results, with the corresponding numerical values also

depicted in Fig. 4. The results indicate that the system demonstrates strong perfor-

mance, despite not being trained on any of the instances. Notably, the performance

varies across the ﬁve roles being classiﬁed. The proposed model achieves the highest

F1-score for the FAC role, indicating its effectiveness in accurately classifying this

particular role. Though, for other roles, the results are good, the numbers for FAC

are the best. In addition to this, the overall accuracy of the system is 59.08%. This

ﬁgure clearly shows the applicability of zero-shot learning in the legal rhetorical role

detection. It should be noted here that we have not reﬁned BART on legal domain.

Despite this, the accuracy of the system is 59.08%. In addition to this, the proposed

model also performs well in terms of maintaining class distribution. From the ﬁgure,

it is also visible that the worst performance was obtained for the class RLC. The

exact reason for this is however unknown.

5 Limitation of Work

The proposed method has certain drawbacks, including the reliance on a pre-trained

BART model, which may not be suitable for all legal documents due to the complexity

and domain-speciﬁc language found in such texts. Furthermore, while the approach

can decrease the volume of labeled data necessary for training, some manual involve-

ment may still be needed to obtain the best results. These limitations offer opportuni-

ties for further exploration and improvement in future research. There are limitations

that should be addressed.

•A key limitation is the dependence on the quality and relevance of the pre-trained

language model employed for text segmentation. While we utilized the state-of-

the-art BART model, there is potential for improvement in its application to legal

documents, which are often complex and necessitate speciﬁc domain knowledge.

Future work could concentrate on reﬁning the BART model for the legal domain

to achieve greater accuracy.

•The generalizability of our method to various legal domains or languages is an

other limitation. Our experiments were conducted on a dataset provided by [6],

but it is possible that the results may not be consistent across other legal datasets

570 S. Mishra et al.

Fig. 3 Confusion matrix according to ﬁve roles. Here [0-FAC, 1-STA, 2-RLC, 3-PRE, 4-RPC]

Table 2 Results according to

individual rhetorical role Rhetorical role Results

Facts Precision 0.4675

Recall 0.6234

F1-score 0.5343

RLC Precision 0.0191

Recall 0.0759

F1-score 0.0305

RPC Precision 0.2099

Recall 0.2820

F1-score 0.2407

STA Precision 0.5886

Recall 0.4410

F1-score 0.5042

PRE Precision 0.5320

Recall 0.3708

F1-score 0.4370

Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 571

Precision Recall F1-Score

Fig. 4 Results according to individual rhetorical role

with distinct characteristics. Currently, our method is limited to English-language

legal documents, so future work could investigate its applicability to other legal

domains and languages.

•Moreover, although our method reduces the need for manual intervention in text

segmentation, it still necessitates labeled data for training the BART model. As

with any machine learning technique, the quality and quantity of labeled data can

signiﬁcantly affect the model’s performance. Additionally, obtaining labeled data

can be time-consuming and costly, especially in the legal domain where large,

specialized datasets may be needed.

•Despite these limitations, we maintain that zero-shot rhetorical role detection

holds the potential to transform the way raw legal data is processed. By minimizing

the amount of labeled data required for training, our approach can considerably

reduce the cost and time associated with text segmentation, ultimately leading to

more accurate and efﬁcient models. Additionally, the ﬂexibility of our method

in accommodating a dynamic number of rhetorical roles allows for more precise

and nuanced text segmentation, resulting in enhanced downstream analysis and

decision-making.

6 Conclusion

Zero-shot rhetorical role detection, a novel research area introduced in this paper,

has the potential to transform the way raw legal data is processed. This method

can signiﬁcantly decrease the amount of labeled data needed for training and could

ultimately result in more accurate text segmentation models. We utilized a pre-trained

BART model to achieve legal document segmentation. Our experimental ﬁndings

572 S. Mishra et al.

suggest that using a pre-trained BART model for zero-shot rhetorical role detection

holds promise in reducing the labeled data required for training and enhancing text

segmentation models. This has considerable implications for the legal domain, where

large volumes of data are typically needed for training machine learning models, and

manual labeling costs are high. Moreover, our approach removes the necessity for

manual intervention by legal experts, which is both time-consuming and expensive. In

contrast to the current academic approach, which requires a legal ex pert to manually

label each statement for proper machine classiﬁcation, our method offers a more

efﬁcient alternative. Additionally, there is no need to settle on a pre- deﬁned number

of classes, as a varying number of rhetorical roles can be supplied on demand. We

conducted experiments on the dataset provided by [6]. One of the key beneﬁts of our

approach is the ﬂexibility in the number of rhetorical roles that can be provided on

demand, as opposed to traditional methods that demand a pre-determined number of

classes, thus restricting the scope of analysis. In summary, zero-shot rhetorical role

detection has the potential to revolutionize the processing of raw legal data, yielding

more accurate text segmentation models while reducing the time and cost associated

with manual intervention. By further reﬁning and enhancing the BART model within

the legal domain, we believe that our approach can lead to the more efﬁcient and

precise classiﬁcation of legal documents. Through numerical simulations, our results

showed promise, and the analysis indicated good scope for improvement in the future.

As part of future work, we plan to reﬁne the BART model for the legal domain and

experiment with the zero-shot classiﬁcation of legal documents, aiming to improve

the system’s accuracy.

Acknowledgements The work presented in this article is funded by Manupatra Information

Solutions Private Limited.

References

1. Hutcheson JC Jr (1928) Judgment intuitive the function of the hunch in judicial decision.

Cornell lq 14:274

2. Bommer M, Gratto C, Gravander J, Tuttle M (1987) A behavioral model of ethical and unethical

decision-making. J Bus Ethics 6(4):265–280

3. Schwarz-Plaschg C (2018) Nanotechnology is like… the rhetorical roles of analogies in public

engagement. Public Underst Sci 27(2):153–167

4. Bhattacharya P, Paul S, Ghosh K, Ghosh S, Wyner A (2021) Deeprhole: deep learning for

rhetorical role labeling of sentences in legal case documents. Artif Intell Law 1–38

5. MacCormick N (2005) Rhetoric and the rule of law: a theory of legal reasoning. OUP Oxford

6. Ghosh S, Wyner A (2019) Identiﬁcation of rhetorical roles of sentences in Indian legal judg-

ments. In: Legal knowledge and information systems: JURIX 2019: the thirty-second annual

conference, vol 322. IOS Press, p 3

7. Chaturvedi I, Cambria E, Welsch RE, Herrera F (2018) Distinguishing between facts and

opinions for sentiment analysis: survey and challenges. Inf Fusion 44:65–77

8. El-Kilany A, Azzam A, El-Beltagy SR (2018) Using deep neural networks for extracting

sentiment targets in Arabic tweets. In: Intelligent natural language processing: trends and

applications. Springer, pp 3–15

Rhetorical Role Detection in Legal Judgements Using Zero-Shot Learning 573

9. Wang W, Zheng VW, Yu H, Miao C (2019) A survey of zero-shot learning: settings, methods,

and applications. ACM Trans Intell Syst Technol (TIST) 10(2):1–37

10. Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning—a comprehensive

evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell

41(9):2251–2265

11. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer

L (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation,

translation, and comprehension. arXiv preprint arXiv:1910.13461

12. Savelka J, Ashley KD (2018) Segmenting us court decisions into functional and issue speciﬁc

parts. In: JURIX, pp 111–120

13. Shulayeva O, Siddharthan A, Wyner A (2017) Recognizing cited facts and principles in legal

judgements. Artif Intell Law 25(1):107–126

14. Venturi G (2012) Design and development of temis: a syntactically and semantically annotated

corpus of Italian legislative texts. In proceedings of the workshop on semantic processing of

legal texts (SPLeT 2012), pp 1–12

15. Wyner AZ, Peters W, Katz D (2013) A case study on legal case annotation. In: JURIX, pp

165–174

16. Wyner A, Peters W (2010) Towards annotating and extracting textual legal case factors.

In: Proceedings of the language resources and evaluation conference workshop on semantic

processing of legal texts, Malta

17. Saravanan M, Ravindran B, Raman S (2008) Automatic identiﬁcation of rhetorical roles using

conditional random ﬁelds for legal document summarization. In Proceedings of the third

international joint conference on natural language processing: volume I

18. Nejadgholi I, Bougueng R, Witherspoon S (2017) A semi-supervised training method for

semantic search of legal facts in Canadian immigration cases. In: JURIX, pp 125–134

19. Walker VR, Pillaipakkamnatt K, Davidson AM, Linares M, Pesce DJ (2019) Automatic classi-

ﬁcation of rhetorical roles for sentences: comparing rule-based scripts with machine learning.

In: ASAIL@ ICAIL

20. Sarzynska-Wawer J, Wawer A, Pawlak A, SzymanowskaJ, Stefaniak I, Jarkiewicz M, Okruszek

L (2021) Detecting formal thought disorder by deep contextualized word representations.

Psychiatry Res 304:114135

21. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional

transformers for language understanding. arXiv preprint arXiv:1810.04805

22. Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2020) Spanbert: Improving pre-

training by representing and predicting spans. Trans Assoc Comput Linguist 8:64–77

23. Liu Y, Lapata M (2019) Text summarization with pretrained encoders. arXiv preprint arXiv:

1908.08345

24. Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, Gao J, Zhou M, Hon H-W (2019) Uniﬁed

language model pre-training for natural language understanding and generation. Adv Neural

Inf Proc Syst 32

25. Narayan S, Cohen SB, Lapata M (2018) Don’t give me the details, just the sum-mary! topic-

aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.

08745

26. Wang P, Fan Y, Niu S, Yang Z, Zhang Y, Guo J (2019) Hierarchical matching network for crime

classiﬁcation. In: Proceedings of the 42nd international ACM SIGIR conference on research

and development in information retrieval, pp 325–334

27. Bhattacharya P, Hiware K, Rajgaria S, Pochhi N, Ghosh K, Ghosh S (2019) A comparative

study of summarization algorithms applied to legal case judgments. In: European conference

on information retrieval. Springer, pp 413–428

28. Song D, Vold A, Madan K, Schilder F (2022) Multi-label legal document classiﬁcation: a

deep learning-based approach with label-attention and domain-speciﬁc pre-training. Inf Syst

106:101718

29. Venkateswarlu B, Shenoi VV, Tumuluru P (2022) Caviarws-based HAN: conditional autore-

gressive value at risk-water sailﬁsh-based hierarchical attention network for emotion classiﬁ-

cation in covid-19 text review data. Soc Netw Anal Min 12:1–17

574 S. Mishra et al.

30. Anand D, Wagh R (2022) Effective deep learning approaches for summarization of legal texts.

J King Saud Univ-Comput Inf Sci 34(5):2141–2150

31. Maxwell KT, Schafer B (2008) Concept and context in legal information retrieval. In: Legal

knowledge and information systems. IOS Press, pp 63–72

32. Ashley KD, Brüninghaus S (2009) Automatically classifying case texts and predicting

outcomes. Artif Intell Law 17:125–165

33. Medvedeva M, Vols M, Wieling M (2020) Using machine learning to predict decisions of the

European court of human rights. Artif Intell Law 28:237–266

34. Clark TS, Lauderdale B (2010) Locating supreme court opinions in doctrine space. Am J

Political Sci 54(4):871–890

35. Socher R, Ganjoo M, Manning CD, Ng A (2013) Zero-shot learning through cross-modal

transfer. Advances in neural information processing systems, 26

36. Chen Y-S, Chiang S-W, Meng-Luen W (2022) A few-shot transfer learning approach using

text-label embedding with legal attributes for law article prediction. Appl Intell 52(3):2884–

2902

37. Qiu X, Chen Z, Zhao L, Chengsheng H (2019) Unsupervised multi-view non-negative for law

data feature learning with dual graph-regularization in smart internet of things. Futur Gener

Comput Syst 100:523–530

38. Zhang T, Chandrasekaran DP, Thung F, Lo D (2022) Benchmarking library recognition in

tweets

39. Zhang T, Xu B, Thung F, Haryono SA, Lo D, Jiang L (2020) Sentiment analysis for software

engineering: how far can pre-trained transformer models go? In: 2020 IEEE International

Conference on Software Maintenance and Evolution (ICSME). IEEE, pp 70–80

IoB-Based Intelligent Healthcare System

for Disease Diagnosis in Humans

Shalu, Neha Saini, Pooja, and Dinesh Singh

Abstract Internet of Behavior (IoB) refers to the use of Internet of Things (IoT)

devices to track data, monitor, and inﬂuence human behavior. The increasing use

of IoB has also led to the development of systems for disease detection, which

can leverage IoT data to enhance the accuracy and speed of disease detection. In

this context, an IoB-based system for disease detection has been proposed in this

paper that uses data from various IoT devices, such as wearable sensors, to monitor

and analyze human behavior. The system collects data on numerous physiological

parameters, such as heart rate, blood pressure, and body temperature, and uses this

data to identify patterns that may be indicative of a particular disease or health

condition. This approach can detect diseases at an early stage, before symptoms

appear, which increases the chance of effective treatments. It can also provide real-

time feedback to healthcare providers, enabling them to make informed decisions

about patient care. The proposed DenseNet-K-Nearest Neighbor (KNN)-based IoB

healthcare system optimizes healthcare processes, supports clinical decision-making,

and can be used to improve patient care. The proposed model was compared with

existing algorithms such as Naive Bayes (NB), decision trees (DT), logistic regression

(LR), Convolution Neural Network (CNN), and KNN. The results demonstrate that

the proposed system has a greater accuracy of 97.66% than the other four algorithms.

It is widely assumed that the proposed method can lower the risk of chronic diseases

Shalu

Manav Rachna University, Faridabad, Haryana, India

e-mail: shalu@mru.edu.in

N. Saini (B)

Government College Chhachhrauli, Yamuna Nagar, Haryana, India

e-mail: profnehasaini@gmail.com

Pooja

School of Computer Science and Engineering, Galgotias University, Greater Noida, India

e-mail: pooja1@galgotiasuniversity.edu.in

D. Singh

Deenbandhu Chhotu Ram University of Science and Technology, Murthal, Sonepat, India

e-mail: dineshsingh.cse@dcrustm.org

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_43

575

576 Shalu et al.

by detecting them early and lowering the cost of diagnosis, therapy, and doctor

consultation.

Keywords Internet of Behavior (IoB) ·Internet of Things (IoT) devices ·

Healthcare systems ·Wearable sensors ·Disease detection

1 Introduction

The internet has become an increasingly important tool for disease detection in

humans. In recent years, researchers have used social media and other online sources

to collect data for public health surveillance. IoT refers to a system of interconnected

physical devices that gather and distribute data and information over the Internet. IoT

allows the interconnection and independent processing of devices, whereas volume

and variety of data stored in the cloud are increasing intricately. Patient’s behavior,

demands, and requirements can be gleaned from this data trove, which has been

termed as “Internet of behavior” (IoB). Many patients are happy to provide their data

if it gives value, even though some patients are reluctant to do so. For instance, it

guarantees that healthcare systems can be altered in terms of diagnosis and disease

classiﬁcation and patients’ experience can be improved. The ultimate goal is to

increase consistency and dependability; theoretically, all facets of consumer life

can be learned [1–3]. Before an application is developed, IoB can anticipate the

user’s social behaviors and contact points. This technology is used to ensure that the

application interface is consistent, user-friendly and provides easier navigation that

will help in production process. Data collected by the app is utilized to get insight into

the way people interact with it [4,5]. The IoB aims to impose a cognitive and social

rationale from the data collected from people’s behavioral patterns on the internet.

It discusses the interpretation and application of data in the creation and promotion

of innovative products based on human behavior [6,7].

In health care [8], the IoB has the potential to revolutionize the way we monitor,

diagnose, and treat diseases. By monitoring an individual’s online behavior, health-

care providers can gain valuable insights into their patients’ health status. The use of

social media and internet searches, for instance, may indicate the onset of a health

problem in time for preventative measures to be taken. Wearable devices and sensors

can also track and transmit real-time health data, providing healthcare providers with

a complete picture of their patients’ health. Figure 1depicts the major components

of a healthcare system based on IoB. The following is a comprehensive description

of the components:

IoT Gadgets: These include wristbands, tablets, and other network devices that

collect information on a person’s health and activity.

Data Collection: The data generated by IoT devices is stored and analyzed in a

central system.

IoB-Based Intelligent Healthcare System for Disease Diagnosis in Humans 577

Fig. 1 Components of

healthcare system

IoB Based

Healthcare

System

IoT

Gadgets

Data

Collection

Patients

Monitoring

Healthcare

Professionals

Machine

Learning

Machine Learning: Using machine learning algorithms, data gleaned from IoT

devices is analyzed in order to ﬁnd trends and anomalies, as well as to detect early

symptoms of disease and other health concerns.

Healthcare Professionals: The inferences created by the IoB-based healthcare

system are utilized by doctors, nurses, and other medical employees to make more

informed decisions on patient care.

Patient Monitoring: With IoB-based healthcare [9] systems, patients can be

remotely monitored for increased check-in frequency and earlier disease diagnosis.

In addition, the IoB can be applied to tailor made medical care and therapies.

Healthcare practitioners can better serve their patients by learning about their unique

preferences, lifestyles, and habits through an analysis of their online activity.

This work based on the Internet of Behavior in healthcare systems can address

ethical, privacy, and security problems while also contributing signiﬁcantly to patient

care, process optimization, and clinical decision-making [10]. The paper begins with

the role of IoB in health care in Sect. 1. After that, various related studies have been

discussed in Sect. 2.

The methodology adopted for the research and proposed model is discussed in

Sect. 3. The results and discussion are described in this Sect. 4. At the end, the paper

is concluded with the contribution of research in healthcare sector and various future

research directions have also been given in Sect. 5.

2 Related Work

Early adopters of the internet saw the potential of using IoB for disease identi-

ﬁcation. Researchers in the 1990s began investigating the feasibility of utilizing

online communities and chat rooms in the study of communicable diseases like HIV/

AIDS. They discovered that patients suffering by these conditions could beneﬁt from

engaging in online counseling services.

578 Shalu et al.

The proliferation of wearables and other Internet of Things devices in recent years

has created new possibilities for IoB-based [11] disease diagnosis. Wearable tech-

nology has the potential to revolutionize early disease detection by monitoring vital

signs such as heart rate and blood pressure [12]. Utilizing internet-based behavior

to diagnose diseases is a rapidly growing ﬁeld of study. Several researches have

investigated the viability of using Artiﬁcial Intelligence (AI) and machine learning

(ML) to diagnose disease. This study applied AI in disease diagnosis and compared

the ﬁndings with various performance indicators, including prediction rate, accu-

racy, sensitivity, speciﬁcity, the area under the curve precision, recall, and F1-score.

Parkinson’s, tumors, chronic diseases, and heart disease can be effectively diagnosed

using AI, according to the ﬁndings of the study [13].

The author investigated the detection of neurodegenerative disorders using web

search signals [14]. Due to their gradual course and subtle symptoms, some conditions

have been reported to be difﬁcult to diagnose. The author [15] discusses the public

health application of social media and internet-based health surveillance. The study

indicated that a dearth of longitudinal research and methodological difﬁculties can

impede the successful implementation of such a system.

Additionally, a study utilized machine learning (ML) to forecast disorders. The

study indicated that logistic regression performed well in predicting cardiovascular

diseases, while random forest and convolutional neural networks were employed

to accurately identify breast diseases [16]. These studies highlight the potential

for detecting diseases using internet-based behavior. There is a need for additional

research on the beneﬁts of applying AI and ML to detect human diseases.

3 Proposed Methodology

IoB refers to the tracking and analysis of human behavior using connected devices and

data analytics. Using IoB-based models to collect and evaluate data on individuals’

behavior patterns, such as their sleep patterns, exercise habits, and eating habits,

this approach can be used to disease classiﬁcation in humans. A model based on IoB

could detect patterns related with speciﬁc diseases or health conditions by examining

this data. For instance, if the model detects that a person consistently has poor sleep

quality, a lack of physical exercise, and unhealthy eating habits, it may imply that he

is at risk for acquiring obesity, diabetes, or cardiovascular disease.

(A) Collect Data: Real-world information such as a patient’s characteristics, socioe-

conomic level, and clinical ﬁndings are gathered. In order to protect the privacy of the

patients, the dataset does not include identiﬁable details about them such as name,

age, and their residential information.

(B) Build a Model: Build a model based on the analyzed data that can be used

to detect illnesses in humans. This model should be trained on a large dataset and

validated using a separate test set. The proposed model involves various steps for

disease prediction as shown in Fig. 2. The steps are discussed in detail below are:

IoB-Based Intelligent Healthcare System for Disease Diagnosis in Humans 579

Pre-

processing

Feature

Extraction

Determine Distance

using KNN

Model

Training

Feature

selection

Input

Data

Produces

Prediction

Fig. 2 Proposed IoB-based disease detection model

and thus, they are preprocessed appropriately. The quality of the dataset can only

be improved by adding missing information or eliminating or updating inaccurate

records. All punctuation and white spaces are removed during the preparation phase.

Data undergoes feature extraction and disease prediction after initial processing is

complete. Out of 450 instances, 32 features are selected.

(D) Model Training Using DenseNet. After feature extraction model is trained

using DenseNet algorithm. It was developed to combat the loss of precision in highly

complex neural networks due to their inﬁnitely diminishing gradient. The process

begins with a vectorization of the data set. After that, it is forwarded on to the

convolution layer. Following the convolution layer, the max pooling process is carried

out in the pooling layer. The max pooling output is passed to the fully connected

layer, and then, the classiﬁcation is performed by the output layer.

(E) Determine Distance Using KNN: After training the model, distance is deter-

mined using KNN. It is a guided computational model that compares new and existing

data to ﬁnd the most comparable category and then adds the new data to that category.

In KNN, the value of K is already established, and the nearest neighbor characteristics

are those with the highest degree of similarity to K. The neighbor with the smallest

distance from the known K value is selected. The result of disease prediction is the

characteristic with the smallest distance value.

(F) Model Validation: The model has been validated by computing performance

metrics including accuracy, precision, recall, and F1-score, which are described in

the ﬁndings and discussions section.

Ultimately, a model based on the IoB has great potential as a tool for early disease

identiﬁcation and individualized treatment in humans. The model can rapidly and

accurately analyze data from multiple sources to draw conclusions about a patient’s

health. However, sufﬁcient safeguards must be in place to protect the privacy of

patients and prevent unauthorized access to personal health information.

580 Shalu et al.

4 Results and Discussions

Seven performance metrics are used to evaluate the proposed disease detection

system.

Accuracy: Accuracy in classiﬁcation is represented mathematically as the percentage

of correct predictions relative to all predictions and depicted in Eq. (1).

Accuracy =(TP +TN)/(TP +TN +FP +FN)∗100.(1)

Precision: Precision is deﬁned as the percentage of accurate predictions relative to

the sum of all correct values (both true and false) and is depicted in Eq. (2).

Precision =TP/(TP +FP).(2)

Recall: It is deﬁned as the proportion of right predictions relative to the sum of right

positive and incorrect negative ones, and it is depicted in Eq. (3).

Recall =TP/(TP +FN),(3)

F1 - Score =2∗(Precision ∗Recall)/Precision +Recall.(4)

Since the prediction result is crucial to the patient and will have negative conse-

quences if it is inaccurate, accuracy is a crucial metric to consider. Accuracy

evaluations between the proposed algorithm and other techniques are graphically

represented in Fig. 3.

Prediction accuracies of 52% for NB, 62% for a DT, 86% for a LR, 96% for

a CNN and KNN, and 97.66% for a DenseNet and KNN are shown on the graph.

Comparison to various machine learning techniques demonstrates that the proposed

system obtains the greatest accuracy of 97.66%.

050100150

Naïve Bayes

Decision Tree

Logistic Regression

CNN and KNN

DenseNet and KNN

Accuracy Accuracy

Fig. 3 Accuracy analysis of the proposed versus other techniques

IoB-Based Intelligent Healthcare System for Disease Diagnosis in Humans 581

100

120

Naïve

Bayes

Decision

Tree

Logistic

Regression

CNN and

KNN

DenseNet

and KNN

Precision

Recall

F1-Score

Fig. 4 Comparison of other performance evaluation metrics of proposed and other algorithms

The existing ﬁve techniques such as NB, DT, LR, CNN and KNN algorithms,

and proposed DenseNet and KNN algorithm, and their variants are compared on the

three performance assessment parameters. As can be seen in Fig. 4, the experimental

ﬁndings demonstrate that their performance ranges from 52 to 64 to 84 to 93 to

97.5% in terms of precision; from 60 to 80 to 88 to 99.5% in terms of recall; and

from 65 to 62.5 to 82.5 to 97.5 to 98.5% in terms of F1-score. These results show

that the DenseNet and KNN algorithm-built model surpasses the other four methods

in terms of precision (97%), recall (98%), and F1-score (98%).

MCC: The Pearson product-moment coefﬁcient of correlation between the actual and

anticipated components is the basis for the Matthews Correlation Coefﬁcient (MCC)

metric, which is based on a contingency matrix. A score close to −1 indicates the

weakest classiﬁer and a score close to +1 indicates the best classiﬁer.

MCC

=(TrueP ∗TrueN−FalseP ∗FalseN)√(TrueP +FalseP)(TrueP +FalseN)

(TrueN +FalseP)(TrueN +FalseN).(5)

Miss Rate: Misclassiﬁcation rate measures how often the model provides an

inaccurate prediction.

Recognition Speed: Total number of images present in test set/total time taken for

testing.

The constraints of the prediction method are reﬂected in MCC. The best prediction

performance is correlated with a high MCC score. Figure 5shows the MCC value of

94% achieved by the DenseNet-KNN-based model technique on the chronic renal

disease dataset. Based on our ﬁndings, MCC outperforms other existing approaches

we have compared. Thus, the proposed method has the highest-level prediction

efﬁciency. The proposed DenseNet-KNN approach is compared to other existing

methods with respect to its miss rate, RS, and MCC value, as shown in Fig. 5.

582 Shalu et al.

0 20406080100

Naïve Bayes

Decision Tree

Logistic Regression

CNN and KNN

DenseNet and KNN

RS (Jobs per unit time)

Miss Rate (Percentage)

MCC

Fig. 5 Miss rate versus RS versus MCC value of proposed with other existing techniques

Results show that our proposed model has the best miss rate and RS value of 11

which is relatively lower than other existing techniques.

5 Conclusion and Future Scope

Internet of Behavior facilitates the collection of data, monitoring, and manipulation

of human behavior using IoT gadgets. This study suggests an IoB-based system

for disease diagnosis by collecting and analyzing data from several IoT devices,

including wearable sensors. Several physical parameters, including heart rate, blood

pressure, and temperature, are monitored and analyzed by the system to ﬁnd patterns

that may indicate an illness or health condition. In this study, we have proposed

an IoB-based system employing machine learning methods such as DenseNet and

KNN to detect and predict an individual’s chance of developing a chronic disease.

In this study, the proposed model performance is evaluated in comparison to that of

other popular machine learning algorithms as the NB, DT, LR, CNN, and KNN. The

ﬁndings demonstrate that the proposed system outperforms the other four algorithms

with an accuracy of 97.66%. We strive to enhance our model by applying it to new

image-based datasets and decreasing its execution time, although it already achieves

state-of-the-art performance. In the future, we hope to use a real-time data set to

predict the spread of several airborne diseases. It is expected that the suggested

approach will decrease the prevalence of chronic diseases through early diagnosis

while also decreasing the costs associated with said diagnosis, treatment, and health

check-ups.

IoB has enormous potential in the future of disease identiﬁcation. There is a lot

of behavioral data that can be used to detect diseases early and create individualized

treatment strategies, and this data is increasingly available through wearable devices,

social media platforms, and other digital sources. Infectious illness outbreaks can

be monitored and prevented with the help of the IoB by collecting information on

people’s habits and the mobility trends.

IoB-Based Intelligent Healthcare System for Disease Diagnosis in Humans 583

References

1. Shih H-P (2004) Extended technology acceptance model of Internet utilization behavior. Inf

Manage 41(6):719–729. https://doi.org/10.1016/j.im.2003.08.009

2. Trends in Internet information behavior, 2000–2004—Buente—2008—J Am Soc Inf Sci

Technol—Wiley Online Library. https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.20883.

Accessed 20 Oct 2022

3. Javaid M, Haleem A, Singh RP, Rab S, Suman R (2021) Internet of behaviours (IoB) and its

role in customer services. Sens Int 2:100122. https://doi.org/10.1016/j.sintl.2021.100122

4. Internet addictive behavior in adolescence: a cross-sectional study in seven European

countries. Cyberpsychol Behav Soc Netw. https://www.liebertpub.com/doi/abs/10.1089/cyber.

2013.0382. Accessed 20 Oct 2022

5. Flow and Internet shopping behavior: a conceptual model and research propositions. https://

ideas.repec.org/a/eee/jbrese/v57y2004i10p1199-1208.html. Accessed 20 Oct 2022

6. The inﬂuence of perceived risk on Internet shopping behavior: a multidimensional perspec-

tive. J Risk Res 12(2). https://www.tandfonline.com/doi/abs/10.1080/13669870802497744.

Accessed 20 Oct 2022

7. Andrews L, Bianchi C (2013) Consumer internet purchasing behavior in Chile. J Bus Res

66(10):1791–1799

8. Prevalence of internet addiction in healthcare professionals: systematic review and meta-

analysis. Inesa Buneviciene, Adomas Bunevicius (2021). https://journals.sagepub.com/doi/abs/

10.1177/0020764020959093?journalCode=ispa. Accessed 22 March 2023

9. Bhatt V, Chakraborty S (2021) Improving service engagement in healthcare through internet

of things based healthcare systems. J Sci Technol Policy Manag 14(1):53–73. https://doi.org/

10.1108/JSTPM-03-2021-0040

10. Alareﬁ M (2023) Internet of things in Saudi public healthcare organizations: the moderating

role of facilitating conditions. Int J Data Netw Sci 7(1):295–304

11. Javaid M, Haleem A, Singh RP, Khan S, Suman R (2022) An extensive study on internet

of behavior (IoB) enabled healthcare-systems: features, facilitators, and challenges. Bench-

Council Trans Benchmarks Stand Eval 2(4):100085. https://doi.org/10.1016/j.tbench.2023.

100085

12. Qi J, Yang P, Min G, Amft O, Dong F, Xu L (2017) Advanced internet of things for personalised

healthcare systems: a survey. Pervasive Mob Comput 41:132–149. https://doi.org/10.1016/j.

pmcj.2017.06.018

13. Kumar Y, Koul A, Singla R, Ijaz MF (2022) Artiﬁcial intelligence in disease diagnosis: a

systematic literature review, synthesizing framework and future research agenda. J Ambient

Intell Humaniz Comput, 1–28. https://doi.org/10.1007/s12652-021-03612-z

14. White RW, Doraiswamy PM, Horvitz E (2018) Detecting neurodegenerative disorders from

web search signals. NPJ Digit Med 1(1), Art. no. 1. https://doi.org/10.1038/s41746-018-0016-6

15. [PDF] Social media- and internet-based disease surveillance for public health. Semantic

Scholar. https://www.semanticscholar.org/paper/Social-Media-and-Internet-Based-Disease-

for-Public-Aiello-Renson/3c80bbc2679845eb87a85c32533e632ecb608282. Accessed 18

March 2023

16. (PDF) Disease prediction using machine learning. https://www.researchgate.net/publication/

347381005_Disease_Prediction_Using_Machine_Learning. Accessed 18 March 2023

Analyzing the Impact of Extractive

Summarization Techniques on Legal Text

Utkarsh Dixit, Sonam Gupta, Arun Kumar Yadav, and Divakar Yadav

Abstract Legal document summarization refers to the process of consolidating

lengthy legal document into a more concise form retaining all the critical aspects.

This study aimed to evaluate the effectiveness of extractive text summarization for

summarizing legal materials. Various models such as SVM, NB, KB, Winnow, and

C4.5 were used to summarize the text, and the ROUGE score was used to evaluate

performance. The methodology involved utilizing various strategies and models for

summarization, including extractive text summarization, which recognizes relevant

chunks of content and rewrites it word by word, resulting in a selection of phrases

from the source text. The inclusion of all legal aspects into legal document summa-

rization results in a well-structured form. It was found that extractive summariza-

tion is commonly used in legal documents because it recognizes relevant content

and produces well-structured summaries that include all legal aspects. The study

also suggested that additional strategies can be used to generate summaries through

extractive text summarization. The results indicated that C4.5 is the most effective

model for decision tree classiﬁcation. Therefore, it can be concluded that extractive

text summarization is an effective method for summarizing legal materials, and C4.5

is a useful model for this purpose. Extractive summary recognizes and reproduces

large fragments of a message, while abstractive summary uses language processing

to create a more human-like summary. Extractive methods are commonly used in

legal documents because abstractive summarization may result in the loss of orig-

inal content and lacks sufﬁcient data for deep learning. Legal document summary

should cover all legal aspects, including judgment record and logical fragments,

for better document structure. To evaluate summary text performance, the ROUGE

score, precision, recall, and F-Measure were used by counting n-grams, overlapping

U. Dixit ·S. Gupta (B)

Ajay Kumar Garg Engineering College, Ghaziabad, India

e-mail: guptasonam@akgec.ac.in

U. Dixit

e-mail: utkarsh2010016m@akgec.ac.in

A. K. Yadav ·D. Yadav

National Institute of Technology, Hamirpur, HP, India

e-mail: ayadav@nith.ac.in

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_44

585

586 U. Dixit et al.

word pairs, and word sequences, and focusing on text summarization. The study also

conducted a survey on the use of extractive text summarization in legal documents.

Various techniques were examined, and different modules were used in the process.

Extractive summarization was chosen for use in legal documents as it preserves the

meaning of the document and utilizes a subset of the text for summarization.

Keywords Legal document ·Automatic text summarization ·Extractive text

summarization ·Abstractive text summarization ·SVM ·NB

1 Introduction

Rapid growth in information is being observed as we move forward in the age of

data. This ﬁeld of online growth in information has aided all ﬁelds, including the

legal background. Legal documents, which comprise constitutions, contracts, deeds,

orders, judgments, statutes, and many more, are complex to structure and under-

stand, making them difﬁcult for legal practitioners to comprehend the case and make

future judgments. However, if better text summarization for legal documents was

to be implemented, it would be much simpler to understand. An outline, being a

dense variant of a long report that incorporates all signiﬁcant data, is the topic of

investigation. The aim is to use different techniques in ATS for legal documents so

that the quality of the document is not reduced. The focus of ATS is on creating a

briefer version of the document without reducing the meaning of the document [1].

Two types of ATS can be identiﬁed: (1) extractive and (2) abstractive.

A summary that contains a sentence subset of the original document or report

is created by extractive text summarization after reviewing all other documents.

Abstractive text summarization, on the contrary, produces the summary using its

own terminology without losing the meaning of the document. Different approaches

for text summarization are followed by both techniques [2].

There are various independent tasks for extractive summarization (Fig. 1).

In extractive summarization, several aspects are utilized [3].

Statistical and Semantic Aspect: Different measurable and semantic angles, such as

word recurrence, connective articulation, area, and title, are examined in this strategy.

These are used for identifying the relevant sentence and then analyzing it.

ML Aspect: Supervised and unsupervised paths are used. In supervised learning, a

label is present in the training data, while in unsupervised learning, the training data

does not have a label but forms a cluster based on similarity.

Probabilistic Aspect: The objective is to identify signiﬁcant phrases, essential ideas,

and associations.

Graph-Based Aspect: An attempt is made to construct a similarity network with the

vertices and edges speciﬁed by the sentence matrix. The similarity graph comprises

edges that represent sentences, and the method involves determining the similarity

Analyzing the Impact of Extractive Summarization Techniques on Legal … 587

Fig. 1 Task of extractive

summarization

scores between the sentences and edges. After the edge and vertex are found, the

PageRank algorithm is used to ﬁnd the important sentence. After ranking is done

based on top k, the top-scoring sentence is selected as the summary.

Neural Network-Based Aspect: Text learning is achieved through neural networks,

which are connections of distinct nodes that communicate with each other.

Text Simpliﬁcation Aspect: The approach of decreasing any lexical or syntactic

intricacy related to the text without modifying the substance of the text is carried out.

It is the preprocessing stage that ﬁnally results in the selection of a useful sentence.

Topic Aspect: In this approach, a summary is generated based on the topic, and the

focus of sentences is concentrated on various topics.

Clustering Aspect: An attempt is made to eliminate the repeated sentences in the

synopsis through this approach. This is entirely suitable for multi-record list.

DL Aspect: Methods to train the network, based on the style of the human reader,

are employed.

Fuzzy Logic Aspect: The uncertainty of the input is dealt with, as logical results

can be provided by fuzzy inference systems. Evaluation in an environment that is

unclear and confusing is carried out.

Advantage of Extractive Summarization: It is faster and easier to be understood

as compared to the abstractive method. A higher accuracy is achieved through the

extractive approach of sentences [4].

Disadvantage of Extractive Summarization: Redundancy, lack of semantic and

cohesion, conﬂicting information, etc., are encountered [5].

A list of research questions was developed to gain a better understanding of legal

documents and the text summarization technique through research.

•Q1: Which summarization method is best for legal documents?

588 U. Dixit et al.

•Q2: How can we improve the structure of legal document summarization?

•Q3: Why is there greater emphasis on extractive summarizing and less emphasis

on abstractive summarization?

•Q4: How do we ﬁnd out if our text summarization results are performing better

or not?

The study is organized: the second portion described the various extractive

summarization techniques and their associated work done in ATS. The third portion

examined the legal document and the extraction technique employed. Section 4

addressed all the previous parts and responded to the following question. Section 5

contains the conclusion and references.

2 Literature Review

Several research papers based on extractive summarization have been reviewed in

the focus of the study effort on the extractive summarization of legal documents. The

investigation of different approaches to extractive summarization will be undertaken.

Statistical Features: The use of numeric and conceptual variables in extractive text

summarization was investigated by Vodolazova et al. [6] in their research. The sepa-

ration of stop words, detection of words, resolving anaphora, literary entailment, and

other techniques were observed. Through examination of various strategies, it was

found that semantic-based techniques examples of these include resolving anaphora,

identifying literary implications, and disambiguating word senses enhance the recog-

nition of overt repetitiveness, and factual strategies such as word occurrence rate and

inverse sentence occurrence rate were determined to provide the most effective tools

for selecting signiﬁcant sentences for the ﬁnal summary. Two issues were identiﬁed

by Metais et al. [7] in their research, one being the examination of how the presenta-

tion of the text summary is affected by the content of the report and the other being

the investigation of how semantic properties of text may inﬂuence the performance of

various automated text summary operations. Semantic research tactics were consid-

ered, and an examination of their relation to formal representation of people, places,

and things, pronouns, and the distribution of speciﬁc entities over the original text

that were included in the associated summary was conducted. It was found that the

assumption was not supported; however, the dynamic summary system was found

to improve the summarization process.

ML-Based Approach: A method for automated document summarization based on

clustering and extractive summarization is described by Aliguliyes et al. [8] in which

text is clustered in the ﬁrst half and the cluster represents an individual evolution algo-

rithm to improve the goal function. An original technique for determining intra and

Analyzing the Impact of Extractive Summarization Techniques on Legal … 589

between-occasion signiﬁcance using knowledge from internal connection, concep-

tual similarity, distributional closeness, and named component grouping is character-

ized by Li et al. [9]. This technique is used in conjunction with a PageRank calcula-

tion to determine the signiﬁcance of a memorable event for a summary. A clustering

approach on event word graph semantic linkages, collected from external linguistic

sources, is utilized by Liu et al. [10] and is found to outperform the PageRank-based

method. A commonly used sentence rating method, whose primary purpose is to

determine the most relevant sentence, is provided by Silva et al. [11]. A detector for

sentence importance classiﬁers that predict the key sentence ﬁrst and then forms a

summary based on the necessary length is introduced by Yang et al. [12].

Probabilistic Approaches: The identiﬁcation of important sentences, key concepts,

and relationships within the text through the process of automatic summarization

is the goal of the approach provided by Fung et al. [13] which utilizes an HMM

framework with a modiﬁed method for extractive summarization and an unsuper-

vised probabilistic technique to determine class cancroids, class sequences, and class

borders.

Graph-based Approach: The identiﬁcation of important sentences, key concepts,

and relationships within the text through the process of machine summarization

is the goal of the approach provided by Fung et al. [14] which utilizes an HMM

framework with a modiﬁed method for extractive summarization and an unsuper-

vised probabilistic technique to determine class cancroids, class sequences, and class

borders.

Neural Network-Based Approach: For extractive text summarization, a neural

network-based approach is employed, with the use of BERT, a pre-initialized trans-

former model with the highest performance in NLP tasks as presented by Liu et al.

A unique neural network for learning elements inherent in sentences and contextual

links between phrases is offered by Ren et al. [15] in CR sum. A novel term-document

co-ranking approach for extractive text summarization is suggested by Fang et al.

[16] which combines a graph-based ranking method with the word-sentence rela-

tionship in CoRank. It is noted that the co-ranking process takes into consideration

that different words should have different weights.

Topic Approach: These methodologies endeavor to decide the subject of the record

(i.e., what is truly going on with this report). Term recurrence, term recurrence back-

wards report recurrence, and lexical chains are the most continuous ways for subject

portrayal. A subject extraction outline’s handling stages are as per the following:

(1) changing the info message to a middle of the road portrayal in which the info

material is examined; (2) granting a score to each expression in the report in light of

its portrayal [17].

Clustering Approach: Multi-document summarization utilizes clustering, where

the cluster comprises the most central and crucial sentences, which contain vital

information. After identifying the central sentences and ranking them, the process

of document summary can be carried out [18].

590 U. Dixit et al.

Deep Learning Approach: A method that uses document similarity on embedding to

describe meaning is proposed by Kobayashi et al. in an attempt to train the networks

to work in a human-readable form. It was found that when this model was applied

to documents containing the ﬁrst few phrases, more complex meaning was created

than sentence-level similarities [19].

Fuzzy Logic Approach: Fuzzy logic concepts are used in automatic text summariza-

tion (ATS) to resemble a powerful decision-making tool and provide an effective way

to depict a sentence’s importance. The sentence scoring method includes selecting

a collection of characteristics for each sentence and then using a fuzzy logic system

to select the important sentences [20].

Different techniques are combined to eliminate their shortcomings and produce

the best summaries, as all approaches have their own advantages and limitations

depending on the inputs. For example, Moratanch et al. [21] combined graph-based

and concept-based methods to generate a summary, Rahman et al. [22] suggested an

extractive summary method that captures the semantics of text and clusters them to

summarize the document using a distributional semantic model, and Mao et al. [23]

combined unsupervised with supervised learning to produce a resultant summary of

a single document (Table 1).

3 Legal Document Summarization

Inclusion of article numbers, rules, and other legislative wording is made distinct from

the summarization of other types of documents in the summary form of legal papers,

such as court judgments. Its features that make it different from other documents,

such as

•Size: Due to the large number of items to be covered, the length of the text is

greater than that of other texts.

•Structure: Conformity to the hierarchical structure of legal norms and regulations

necessitates a distinct internal structure for legal texts.

•Vocabulary: A distinct internal structure is necessary for legal texts, as they must

conform to the hierarchical structure of legal norms and regulations.

•Ambiguity: The same wording in legal texts may be used for different courts and

various purposes, resulting in multiple interpretations being present.

•Citations: The highlighting of key points of the case is deemed essential in the

legal ﬁeld through summarization.

An extractive approach was utilized to develop a summary of the legal text. Various

techniques of extractive summarization have been discussed; numerous research have

been conducted in the area of extractive summarization.

Galgani et al. [24] employed a knowledge-based (KB) approach to integrate

various summarizing techniques, using the Compton and Jansen 1990 wave-down

rules for KB creation. They developed a device that uses these rules to assist in the

Analyzing the Impact of Extractive Summarization Techniques on Legal … 591

Table 1 Advantage and disadvantage for extractive technique

Method Advantage Disadvantage

Statistical and semantic aspect • Required less memory and

capacity

• No linguistic knowledge

• It is a language-independent

technique

• Important sentence may not

be included as they do not

have high score

Machine learning • Improving sentence

selection

• Required large dataset

Probabilistic approaches • Find important sentence,

relationships, concepts

Graph based • Boost coherency and detect

redundant information

• Document independent

• If the weights of two words

are the same, only one is

chosen, resulting in the

incorrect interpretation

Neural network • Human-readable summaries

with each statement are

linked to the next without

losing the original meaning

• Large data and complex in

nature

Text simpliﬁcation • Reduce any lexical or

syntactic complexity

Topic-based • Summarized on the basis of

the topic

• Sentence with higher score

also not included

Clustering approach • The summary does not

include repeated sentences

• It necessitates prior number

speciﬁcation a collection of

clusters

DL • Try to train models that

work in a human-readable

form of summarization

• Manually building of the

training data

Fuzzy logic • Use of fuzzy logic for

selection of sentence

produces the summation

• The potential negative aspect

of redundancy in the selected

sentences of the summary

can affect the overall quality

of the summary

testing of creation of rules for a legal corpus, selection, feature deﬁnition based on

the current case context, and utilization of different data in varied contexts. Perfor-

mance was monitored using AusLII (Australasian Legal Information Institute) and

ROUGE-1.

A citation-based technique for summarization was employed by Galgani et al.

[25]. A phrase was extracted from a publication or reference text using citation

and used as a summary. The best citations were selected based on a centroid or

centrality-based summary class.

A method for topic-based text summarization, obtained from LDA, was proposed

by Venkatesh [26]. An algorithm for sentence grading, based on the likelihood of

terms appearing frequently in relation to each topic, was created using the LDA

592 U. Dixit et al.

which returns the document as a string of words where the subject is generated using

a probabilistic model. The subject from the LDA is utilized to create the concluding

summary. The dataset used for the sentence scoring technique consisted of 116

documents from civil cases in India, from ﬁve separate sub-domains (Income Tax,

Rent Control, Motor Act, Negotiable Instrument Act, and Sales Tax) was used.

A graph-based method for extractive summarization was employed by Kim et al.

[27]. In this method, messages are represented as hubs in a sentence-coordinated

diagram. When the likelihood of a sentence being implanted in the subsequent hub

reaches a preset level, a coordinated edge is added between the two hubs. A summary

subject is represented by the linked component of the graph. The phrases are selected

from the associated components using the key-value functionality.

A graphical representation of the legal document that highlights the repetition

of legal terminology was proposed by Schilder and Molina-Salgado [28]. A simi-

larity function between phrases is used to generate graphical representations of legal

language. A speciﬁc graph is set apart from other graph-based approaches by using

the similarity function with the voting algorithm. It is hypothesized that certain para-

graphs summarize the entire material for legal papers, such as paragraph detection

the technique uses similarity ratings between paragraphs to determine which match

is most appropriate for each paragraph. This work functions as a voting mechanism,

with one paragraph voting for the other, and the most popular paragraphs are chosen

as the simpliﬁed version.

The state of sentences in texts from a HOLJ corpus was examined by Hachey

and Grover [29] using a classiﬁer. Sentence extraction is based on the Teufel and

Moens feature, and sentences are classiﬁed based on factors such as fact, proceeding,

background, framing, disposal, textual and others. The same features were used in

four classiﬁcation algorithms: SVM, C4.5, NB, and Winnow. The most favorable

ﬁndings in terms of micro average F-score were produced by C4.5.

A sentence categorization method using the NB classiﬁer by combining a group of

linguistic traits, such as appear, particularity, and substantive features, was proposed

by Yousif-Monod et al. [30]. They named it PRODSUM after dividingthe summariza-

tion into four sections: introduction, context, reasoning, and conclusion (Probabilistic

Decision SUMmarizer).

This paper analyzes the reasons for the absence of practically useful German

abstractive text summarization solutions in industry. The study [31] focuses on

training resources and publicly available summarization systems and ﬁnds that

existing datasets have crucial ﬂaws that negatively affect system generalization

and evaluation biases. The paper also highlights the poor performance of available

systems compared to simple baselines and more effective extractive summarization

approaches. The authors attribute poor evaluation quality to a lack of qualitative gold

data, understudied positional biases in existing datasets, and the lack of accessible

preprocessing strategies or analysis tools. They provide a comprehensive assessment

of available models and emphasize the problems of relying solely on n-gram-based

scoring methods.

The paper [32] offers a comprehensive survey of the NLP & Law domain, with a

focus on recent technical and substantive developments. The authors construct and

Analyzing the Impact of Extractive Summarization Techniques on Legal … 593

analyze a corpus of over 600 NLP & Law-related papers published over the past

decade. They observe an increasing number of papers, tasks, and languages covered,

as well as an increase in the sophistication of methods deployed. The authors note that

legal NLP is starting to match the methodological sophistication and professional

standards of the broader scientiﬁc community. They conclude that while the trends

bode well for the future of the ﬁeld, many questions in both the educational and

corporate sphere remain open.

The paper [33] presents a new method for detecting summary obfuscation, which

is a type of plagiarism that is difﬁcult to detect with traditional methods. The approach

proposed is founded on named entity recognition and dependency parsing, which is

both more precise and analytically simpler than the existing methods based on genetic

algorithms. At the document level, the technique successfully identiﬁes instances of

summary obfuscation and produces high accuracy at the sentence level. Additionally,

the proposed method was tested on other types of plagiarism and achieved excellent

results. Overall, the paper presents a promising new approach for detecting summary

obfuscation that could have important implications for plagiarism detection in various

ﬁelds.

The main emphasis of the paper [34] is on the problem of efﬁciently storing and

retrieving essential information from voluminous text documents. To address this

challenge, the authors propose the use of text summarization techniques, speciﬁcally

extractive approaches, and provide an overview of multiple metrics for evaluating

the quality of the resulting summary. The paper provides a review of numerous

approaches to text summarization and highlights the importance of determining the

most suitable approach for a given objective. Overall, the paper emphasizes the

importance of text summarization in improving the efﬁciency of information storage

and retrieval from large text documents.

The paper [35] aims to discuss the importance of newspapers and news websites in

providing information on COVID-19 and how different models can be used to identify

topics, sentiments, and summarization of news articles. The study used a proposed

topic model to analyze the sentiments expressed by various countries about COVID-

19 and discovered that the UK was the most negatively affected, with the highest

percentage of negative sentiments. The XLNet sentiment categorization model was

also used, and it performed well in terms of validation accuracy. To obtain a better

understanding of the COVID-19 pandemic, the study emphasizes the signiﬁcance of

analyzing various topics, themes, and issues.

The rising volume of textual data generated daily presents challenges in summa-

rizing and extracting relevant information. In this paper [36], a hybrid feature extrac-

tion approach is proposed, utilizing multi-layered attentional stacked LSTM and

attention RNN networks to automatically produce summaries from lengthy news

text. The proposed methodology achieves better results in text length issues, attribute

extraction, and categorization of news text. Experiments show the effectiveness of

the proposed approach in resolving these issues.

As the web and social media continue to produce an overwhelming amount of

unstructured data, it becomes increasingly challenging for individuals to locate perti-

nent information efﬁciently. Text summarization offers a solution to this problem

594 U. Dixit et al.

by extracting relevant information and presenting it concisely, without altering

the core meaning of the original content, is an effective solution to this problem.

Researchers have previously attempted to develop ML approaches for summarization

but still struggle to produce better-summarized results. In this paper [37], the authors

proposed a DL-based model for summarization, which outperformed the advance

models on a standard dataset at the sentence level with BLEU and ROUGE values

of 0.4 and 0.6, respectively. The model uses reinforced learning with an attention

layer, and its performance was analyzed before proposing the deep learning-based

model. Based on their experiments, the authors assert that their proposed model

yields favorable outcomes in terms of precision and validity.

The paper [38] addresses the limitations of existing summarization datasets in

terms of being overly focused on certain domains and being primarily monolingual.

The paper introduces EUR-Lex-Sum, a cross-lingual dataset that includes paragraph-

aligned data in various European languages and is based on manually curated docu-

ment summaries of legal acts from the European Union law platform. The dataset is

anticipated to enable future research in domain-speciﬁc cross-lingual summarization

by providing access to various cross-lingual and low-resource summarization setups.

To demonstrate the dataset’s potential, the authors perform experiments with suit-

able extractive monolingual and cross-lingual baselines. They do admit, however,

that the extreme length and language diversity of the samples pose challenges for

future research.

The legal domain has become increasingly digitized, leading to the need for

more efﬁcient retrieval methods for unstructured data. The ﬁeld of legal information

retrieval systems has been analyzed extensively in paper [39], which investigates

the use of natural language processing, machine learning, and knowledge extrac-

tion techniques in artiﬁcial intelligence approaches. The paper identiﬁes challenges,

such as retrieving similar cases, statutes or paragraphs, that hinder the analysis of

latest cases and highlights the need for further research to improve the efﬁciency and

effectiveness of these systems (Table 2).

Precision is a measure of the accuracy of a summarization technique, speciﬁcally

in relation to the proportion of relevant information that is included in the summary.

In the context of Fig. 2, it is stated that knowledge-based techniques had the highest

precision values with 87%. This suggests that, among the techniques compared,

knowledge-based techniques were the most effective at correctly identifying and

including relevant information in the summary while minimizing the inclusion of

irrelevant information. Knowledge-based techniques use pre-existing knowledge or

information to generate a summary, which may explain why they are able to achieve

higher precision values compared to other techniques.

Recall is a measure of the proportion of relevant instances that are correctly

retrieved by a text summarization technique. It is often used to evaluate the effective-

ness of the technique in retrieving all relevant information from the text. In the case of

the study discussed in Fig. 3, it was found that knowledge-based techniques had the

highest recall values with 66%. Knowledge-based summarization techniques rely on

external knowledge sources such as databases, ontologies, and other external knowl-

edge sources to extract the most important information from the text. This external

Analyzing the Impact of Extractive Summarization Techniques on Legal … 595

Table 2 Legal document summarization

Authors Technique Evaluation metrics Result

Galganietal.[24]KB ROUGE-1,

precision, recall,

F-measure

KB-SPD—0.5

P—0.87

KB +CIT-SPD—0.5

Recall—0.66

Galganietal.[25]Citation-based

method

ROUGE-1, SU6,

precision, recall,

F-measure

CpSent-P—0.82

R1—0.46

SU6-P—0.06

R—0.22

F—0.08

Kumar and Raghveer

[26]

LDA Precision, recall,

F-measure

P—0.60

R—0.58

F—0.59

Kim et al. [27]Graph-based method Precision, recall,

F-measure

P—31.3, R—36.4,

F—33.7

Schilder and

Molina-Salgado [28]

Graph-based method ROUGE-2,

ROUGE-SU4

R-2—0.90

R-4—0.93

Hachey and Grover [29]SVM, NB, C4.5,

Winnow

Human judgment C4.5—65.4

SVM—60.6

NB—51.8

Winnow—41.4

31.3

100

Knowledge Based Citation Based LDA Graph Based

Precision

Fig. 2 Precision of different techniques

596 U. Dixit et al.

36.4

Knowledge Based Citation Based LDA Graph Based

Recall

Fig. 3 Recall of different techniques

knowledge is used to identify the key concepts and entities in the text, which are

then used to generate a summary. Because these techniques use external knowledge

to identify the most important information, they are able to retrieve a higher propor-

tion of relevant instances than other techniques, which results in a higher recall

value. Additionally, the use of external knowledge can also lead to a higher precision

value in the summary, as it allows the technique to distinguish between relevant and

non-relevant information more effectively.

F-Measure is a measure of the effectiveness of text summarization techniques

that combines both precision and recall into a single value. It is calculated as the

harmonic mean of precision and recall and is often used to evaluate the overall

performance of a technique. In the study discussed in Fig. 4, it was found that Latent

Dirichlet Allocation (LDA) had the highest F-Measure value with 59%. F-Measure

is a way to balance the trade-off between precision and recall. It is a metric that

uses both precision and recall to give a single score. F-Measure uses harmonic mean

of precision and recall. Precision is the proportion of true positive instances among

the total number of predicted positive instances, and recall is the proportion of true

positive instances among the total number of actual positive instances. F-Measure

gives equal weight to precision and recall, and it ranges between 0 and 1. The highest

F-Measure value means that the model has performed well in both precision and

recall. Latent Dirichlet Allocation (LDA) is a topic modeling technique that is used

to identify the underlying themes or topics in a text. LDA is a generative probabilistic

model that is trained on a set of documents and is able to discover latent topics by

modeling the co-occurrence of words within each document. In the case of text

Analyzing the Impact of Extractive Summarization Techniques on Legal … 597

33.7

Citation Based LDA Graph Based

F-Measure

Fig. 4 F-Measure of different techniques

summarization, LDA can be used to identify the main topics of a document and then

generate a summary by extracting the most salient sentences that are relevant to those

topics. This ability to identify the main topics of a document likely contributes to the

high F-Measure value observed in the study, as it allows LDA to effectively retrieve

relevant information while also maintaining a high level of precision in the summary.

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a commonly

used evaluation metric for text summarization techniques. It compares the generated

summary to a reference summary and calculates the degree of overlap between the

two, providing a score that indicates the quality of the generated summary. In the

study discussed in Fig. 5, it was found that graph-based summarization technique

had the highest ROUGE scores among all other techniques.

Graph-based summarization techniques use a graph representation of the text to

identify the most important information and then use this information to generate

a summary. These techniques use graph algorithms to identify the most central or

important nodes in the graph, which correspond to the most important information in

the text. The generated summary is then made up of the text surrounding these impor-

tant nodes. This ability to identify the most important information likely contributes

to the high ROUGE scores observed in the study, as it allows the graph-based tech-

nique to effectively retrieve relevant information and generate a summary that closely

aligns with the reference summary.

Accuracy is a measure of the proportion of instances that are correctly classiﬁed

by a model. It is often used to evaluate the effectiveness of a model in a classiﬁcation

task, including text summarization. In the study discussed in Fig. 6, it was found that

the C4.5 algorithm had the highest accuracy of 65.4% among the models used for

text summarization. C4.5 is a decision tree algorithm that is used to classify instances

by recursively partitioning the feature space. It is a supervised ML algorithm that

uses a set of labeled training instances to build a decision tree that can be used to

598 U. Dixit et al.

90 93

100

Knowledge Based Graph Based

ROUGE 1 ROUGE 2 ROUGE 3

Fig. 5 ROUGE score of different techniques

classify new instances. In the context of text summarization, C4.5 could be used to

classify the sentences of a text, where each sentence is assigned a label indicating

whether it is important or not for the summary. The algorithm will then use a set of

feature of the text such as word frequency, sentence length, part of speech, etc., to

build the decision tree, and then when a new text comes, it will use the tree to classify

the sentences of the new text. The accuracy of C4.5 will be based on how well the

algorithm is able to classify the sentences as important or not based on the decision

tree. It is likely that the C4.5 algorithm’s ability to effectively classify instances

based on the text features contributed to the high accuracy value observed in the

study. Additionally, the use of a decision tree allows the algorithm to make complex

decisions by breaking them down into a series of simple decisions, which likely

improved the accuracy of the algorithm. In feature-based summarization, methods

such as anaphora resolution, textual entailment, and word sense are used to determine

the semantics of text. However, in legal document summarization, methods such as

term frequency-inverse document frequency (TF-IDF) and others are utilized. Latent

semantic analysis (LSA) is considered beneﬁcial in legal literature as it selects the

collection of sentences and phrases that best describe the topic.

•In the legal ﬁeld, the graph-based technique is utilized, which reveals comparisons

across paragraphs by using the repetition of legal terms and ranking them by vote.

•Galgani et al. employed KB and citation-based approaches on the AustLII dataset,

in which the highest score was achieved by the citation-based strategy.

•CRF was implemented by Kumar and Raghveer, which featured term distribution

in court judgments as one of its features.

•Several classiﬁers such as NB, ME, PL, and SEQ were employed by Hachey et al.

The sequence model with rhetorical classiﬁcation produced the best results.

Analyzing the Impact of Extractive Summarization Techniques on Legal … 599

65.4

60.6

51.4

41.4

KB-SPD C4.5 SVM NB Winnow

Accuracy

Fig. 6 Accuracy of models used

•The ME sequence model, proposed by Hached and Grover, is employed to predict

the labels of a series of unlabeled observations.

•A compression rate for each document is determined by the graph-based model

developed by Kim et al. This approach takes into account that the compression

rate for each unique document may vary by using a graph that is not related to one

another, which ensures that the topics are diverse and the summaries are cohesive.

•An ordered list of paragraphs is produced by the graph-based model developed by

Schilder and Molina-Salgado, based on relevance. Sentences that are comparable

to the query are then extracted from the ordered list.

•NB (Yousif-Monod): Surface feature extraction has been utilized for extracting

the surface with important legal text. The emphasis function highlights some of

the most important words in a statement.

•LDA (Kumar and Raghuveer): A set of themes are created, which are subsequently

used as the basis for summarization.

•C4.5 (Grover and Hachey): Sentences are classiﬁed with the highest accuracy and

assigned to suitable rhetorical functions.

4 Discussion

Several research questions were discussed in this section, which were encountered

by the author when examining research articles.

Q1: Which of the summarizations is best for the legal document?

600 U. Dixit et al.

Ans: A subset of sentences from the main message is produced by extractive

summary, which incorporates recognizing huge fragments of message and making

them in exactly the same words. On the other hand, abstractive summary uses normal

language strategies to translate and fathom the critical pieces of a message and

produce a more “human”—heartfelt diagram. As opposed to abstractive summary,

extra methods are utilized in extractive summary.

Q2: How can we improve the structure of legal document summarization?

Ans: All legal aspects of the document, including lawful judgment record and logical

fragments that cover the whole record, must be covered in the summary to improve

the structure of the legal document.

Q3: Why is there greater emphasis on extractive summarizing and less emphasis on

abstractive summarization?

Ans: The original content of the document is lost, which is not managed in the legal

document, resulting in the fact that abstractive summarization is not much effectively

used. Another issue is that abstractive summarization uses the DL technique for

summary and requires a large amount of data that legal documents do not meet.

Thus, extractive summarization is commonly used method.

Q4: How do we ﬁnd out if our text summarization results are performing better or

not?

Ans: An assessment matrix such as the ROUGE score, precision, recall, and F-

Measure is utilized to evaluate the performance of the summary text. ROUGE, an

acronym that stands for recall-oriented-understudy for Gisting Evaluation, is used to

focus attention on text summarization in order to count n-grams, overlapping words

pairs, and word sequences.

5 Conclusion

A survey was conducted on the use of extractive text summarization in legal docu-

ments. Various techniques were examined, and different modules were used in the

process. Extractive summarization was chosen for use in legal documents as it

preserves the meaning of the document and utilizes a subset of the text for summa-

rization. The use of legal terms in the summarization process provided structure

to the documents. Techniques such as graph based, ML based, knowledge based,

and citation based were found to provide effective summarization. The study found

that the decision tree classiﬁer (“C4.5”) model had the best performance compared

to other models. Further research could be conducted to explore the use of other

models and techniques in the summarization of legal documents and to improve the

performance and effectiveness of the summarization process.

Analyzing the Impact of Extractive Summarization Techniques on Legal … 601

Acknowledgements This research is supported by Council of Science and Technology, Lucknow,

Uttar Pradesh, via Project Sanction letter number CST/D-3330.

References

1. El-Kassas WS et al (2021) Automatic text summarization: A comprehensive survey. Expert

Syst Appl 165: 113679

2. Allahyari M et al (2017) Text summarization techniques: a brief survey. arXiv preprint arXiv:

1707.02268

3. Agarwal P, Mehta S (2018) Empirical analysis of ﬁve nature-inspired algorithms on real

parameter optimization problems. Artif Intell Rev 50(3):383–439

4. Boorugu R, Ramesh G (2020) A survey on NLP based text summarization for summarizing

product reviews. In: 2020 second international conference on inventive research in computing

applications (ICIRCA). IEEE

5. Hou L, Hu P, Bei C (2018) Abstractive document summarization via neural model with joint

attention. In: Natural language processing and Chinese computing: 6th CCF international

conference, NLPCC 2017, Dalian, China, November 8–12, 2017, Proceedings 6. Springer

International Publishing

6. Vodolazova T et al (2013) The role of statistical and semantic features in single-document

extractive summarization

7. Ferziger JH et al (2020) Finite difference methods. Comput Methods Fluid Dyn, 41–79

8. Aliguliyev RM (2009) A new sentence similarity measure and sentence based extractive

technique for automatic text summarization. Expert Syst Appl 36(4):7764–7772

9. Li W et al (2006) Extractive summarization using inter-and intra-event relevance. In: Proceed-

ings of the 21st international conference on computational linguistics and 44th annual meeting

of the Association for Computational Linguistics

10. Liu M et al (2007) Extractive summarization based on event term clustering. In: Proceedings of

the 45th annual meeting of the Association for Computational Linguistics companion volume

proceedings of the demo and poster sessions

11. Fung P, Ngai G, Cheung C-S (2003) Combining optimal clustering and hidden Markov models

for extractive summarization. In: Proceedings of the ACL 2003 workshop on multilingual

summarization and question answering

12. Mallick C et al (2019) Graph-based text summarization using modiﬁed TextRank. In: Soft

computing in data analytics. Springer, Singapore, pp 137–146

13. Parveen D, Ramsl H-M, Strube M (2015) Topical coherence for graph-based extractive summa-

rization. In: Proceedings of the 2015 conference on empirical methods in natural language

processing

14. Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text

summarization. J Artif Intell Res 22:457–479

15. Ren P et al (2017) Leveraging contextual sentence relations for extractive summarization using

a neural attention model. In: Proceedings of the 40th international ACM SIGIR conference on

research and development in information retrieval

16. Fang M, Fang C, Mu D, Deng Z, Wu Z (2017) Word-sentence co-ranking for automatic

extractive text summarization. Expert Syst Appl 72(2017):189–195

17. Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text

data. Springer, Boston, MA, pp 43–76

18. Mehta P, Majumder P (2018) Effective aggregation of various summarization techniques. Inf

Process Manage 54(2):145–158

19. Kobayashi H, Noguchi M, Yatsuka T (2015) Summarization based on embedding distributions.

In: Proceedings of the 2015 conference on empirical methods in natural language processing

602 U. Dixit et al.

20. Kumar A, Sharma A (2019) Systematic literature review of fuzzy logic based text summariza-

tion. Iran J Fuzzy Syst 16(5):45–59

21. Moratanch N, Chitrakala S (2017) A survey on extractive text summarization. In: 2017

international conference on computer, communication and signal processing (ICCCSP). IEEE

22. Rahman A et al (2019) Bengali text summarization using TextRank, fuzzy C-Means and

aggregate scoring methods. In: 2019 IEEE region 10 symposium (TENSYMP). IEEE

23. Mao X et al (2019) Extractive summarization using supervised and unsupervised learning.

Expert Syst Appl 133:173–181

24. Galgani F, Compton P, Hoffmann A (2012) Combining different summarization techniques for

legal text. In: Proceedings of the workshop on innovative hybrid approaches to the processing

of textual data

25. Galgani F, Compton P, Hoffmann A (2012) Citation based summarisation of legal texts. In:

Paciﬁc Rim international conference on artiﬁcial intelligence. Springer, Berlin, Heidelberg

(2012)

26. Venkatesh RK (2013) Legal documents clustering and summarization using hierarchical latent

Dirichlet allocation. IAES Int J Artif Intell 2(1)

27. Kim M-Y, Xu Y, Goebel R (2013) Summarization of legal texts with high cohesion and auto-

matic compression rate. In: New frontiers in artiﬁcial intelligence: JSAI-isAI 2012 workshops,

LENLS, JURISIN, MiMI, Miyazaki, Japan, November 30 and December 1, 2012, Revised

Selected Papers 4. Springer, Berlin, Heidelberg

28. Schilder F, Molina-Salgado H (2006) Evaluating a summarizer for legal text with a large text

collection. In: 3rd Midwestern computational linguistics colloquium (MCLC)

29. Hachey B, Grover C (2004) A rhetorical status classiﬁer for legal text summarisation. In: Text

summarization branches out

30. Yousﬁ-Monod M, Farzindar A, Lapalme G (2010) Supervised ML for summarizing legal

documents. In: Canadian conference on artiﬁcial intelligence. Springer, Berlin, Heidelberg

31. Aumiller D, Fan J, Gertz M (2023) On the state of German (abstractive) text summarization.

arXiv preprint arXiv:2301.07095

32. Katz DM et al (2023) Natural language processing in the legal domain. arXiv preprint arXiv:

2302.12039

33. Tauﬁq U, Pulungan R, Suyanto Y (2023) Named entity recognition and dependency parsing

for better concept extraction in summary obfuscation detection. Expert Syst Appl, 119579

34. Mishra AR, Naruka MS, Tiwari S (2023) Extraction techniques and evaluation measures for

extractive text summarisation. In: Sustainable computing: transforming Industry 4.0 to Society

5.0. Springer International Publishing, Cham, pp 279–290

35. Thakur O, Saritha SK, Jain S (2023) Topic modeling, sentiment analysis and text summarization

for analyzing news headlines and articles In: Machine learning, image processing, network

security and data sciences: 4th international conference, MIND 2022, Virtual Event, January

19–20, 2023, Proceedings, Part I. Springer Nature Switzerland, Cham

36. Nafees Muneera M, Sriramya P (2023) An enhanced optimized abstractive text summarization

traditional approach employing multi-layered attentional stacked LSTM with the attention

RNN. In: Computer vision and machine intelligence paradigms for SDGs: select proceedings

of ICRTAC-CVMIP 2021. Springer Nature Singapore, Singapore, pp 303–318

37. Yadav AK et al (2022) Extractive text summarization using DL approach. Int J Inf Technol

14(5):2407–2415

38. Aumiller D, Chouhan A, Gertz M (2022) EUR-Lex-Sum: a multi-and cross-lingual dataset for

long-form summarization in the legal domain. arXiv preprint arXiv:2210.13448

39. Sansone C, Sperlí G (2022) Legal information retrieval systems: state-of-the-art and open

issues. Inf Syst 106:101967

An Energy Conserving MANET-LoRa

Architecture for Wireless Body Area

Network

Sakshi Gupta, Manorama, and Itu Snigdh

Abstract The rapid demand for technologies gradually increases to provide solu-

tions to people suffering from chronic diseases. These technologies also practice

continuous health monitoring of patients for early intervention and prevention. Addi-

tionally, there is also a need for the interoperation of different connected devices and

application services in smart health care. Among these technologies, a wireless body

area network (WBAN) is an appropriate option to monitor people’s health remotely.

However, existing systems have limitations of high energy dissipation in process-

ing the data. This article aims to provide a system that collaborates to leverage the

advantages of the Internet of Things (IoT)’s LoRa technology, Mobile Ad hoc net-

work (MANET) systems, and data aggregation schemes to conserve the energy in

transmitting packets. Our proposed model optimizes and reduces energy dissipation

in the network compared to existing models. It also presents a novel approach for

the early detection of urgent biosignals.

Keywords IoT ·Healthcare ·Bio signals ·LoRa ·Aggregation ·MANET ·

WBAN

1 Introduction

IoT is currently a part of every physical object one wears, drives, reads, or sees. It

is used for applications that require phenomena to be tracked, measured, connected,

and controlled remotely [1]. Current technologies adopt IoT systems to enable bet-

S. Gupta (B)

Amity Institute of Information Technology, AMITY university, Noida, India

e-mail: sakshigupta660@gmail.com

Manorama

Amity Institute of Information Technology, Ranchi, India

e-mail: manorama7826@gmail.com

I. Snigdh

B.I.T Mesra, Ranchi, India

e-mail: itusnigdh@bitmesra.ac.in

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981- 99-6544-1_45

603

604 S. Gupta and I. Snigdh

ter decisions, effortless monitoring, time-saving, automation, and a better lifestyle.

Hence, IoT is going to reshape entire industries. According to Gartner’s report, by

2019, installed IoT devices are 26.66 billion, and by 2025 will reach 75.44 billion

[2]. IoT empowers one to transform physical objects to send and receive data. IoT’s

wide applications are smart cities, smart environment, smart metering, smart supply

chain, agriculture, healthcare, and mining production, to name a few [3].

Health care is one of the well-thought-out applications of IoT to diagnose dis-

eases and monitor their treatment. Medical devices are integrated into the IoT for the

required effective treatment and diagnosis. Various sensors are used in IoT health-

care, such as heart rate sensors, blood pressure monitors, blood glucose meters, and

thermometers to measure health readings and can be accessed remotely. Several

data-management methods are used to handle the constant transmission of WBAN

data. Also, varied healthcare applications like health monitoring systems, ﬁtness pro-

grams, chronic disease monitoring, ambient assisted living, drug management, and

monitoring of oxygen saturation at home are being provided by IoT-based healthcare.

Therefore, IoT facilitates healthcare applications in terms of 24/7 patient data anal-

ysis, makes emergency medical decisions, cost reduction, and enhances the quality

of patients’ lifestyles [4]. With IoT tracking systems, health workers can get alerts

immediately when critical changes occur. This enables them to quickly locate patients

who need help and direct assistance as soon as possible. The amalgamation of IoT

in the healthcare sector provides various advantages, including improved treatment,

low cost, faster disease diagnosis, proactive treatment, reduced end to end delay,

amended management of drugs and equipment, better patient experience [5].

Nevertheless, IoT healthcare poses some challenges that need to be solved.

According to literature [6,7], one of the most critical challenge in IoT healthcare is

reducing energy consumption and delay. As most of the data is on the cloud, and data

analytics and data processing take time, communication of decisions incurs delays.

Another challenge is the connectivity, where issues arise in real-time data monitoring

in remote areas.

Moreover, as IoT is an emerging technology and still growing, it needs more scal-

able architecture when merging into any speciﬁc application. Also, the implementa-

tion of a full-scale IoT or BAN architecture in healthcare is not documented in the

literature. The current literary works present strategies for only partial patient mon-

itoring and healthcare data analysis with Artiﬁcial Intelligence (AI) and Machine

learning (ML) techniques to learn and train machines. However, the concerns of

practical implementation of IoT healthcare need to incorporate the entire network’s

energy consumption and sustainability for successful operations [8].

2 Related Works

Ample literature exists for framework and data management in WBAN [9]. Data

management is the foremost requirement of WBAN for the early detection of dis-

ease in patients [10]. In [11], data segregation and classiﬁcation scheme are used

An Energy Conserving MANET-LoRa Architecture for Wireless ... 605

to separate the sensor’s readings in urgent, semi-urgent, and non-urgent packets on

wireless protocol 6LoWPAN. For sending packets, the authors used two routes, one

gateway to the cloud and second access point to the cloud in case of gateway failure.

However, in this research, the authors dropped non-urgent packets completely to

improve the power consumption of WBAN.

In [12], the authors proposed a cloud-based real-time remote health monitoring

system (CHMS) for home care patients by using data classiﬁcation and delay-aware

routing metric to reduce congestion, interference, and delay.

Further, in [13] the authors proposed a collaborative body sensor network (CBSNs)

framework to implement a multisensory data fusion scheme to automatically detect

handshakes between two individuals and capture possible heart rate and emotions.

In [14], to bring down the energy consumption, the authors provide a data packet

aggregation algorithm using for LoRa technology in the Internet of Things.

In order to collect data from a smart grid network that can expand dynamically and

analyze the effect on energy consumption for such networks, the authors proposed

an architecture [15].

In order to achieve high-quality network efﬁciency in smart parking and IIoT

applications, the authors suggested using a Bayesian belief network with fuzzy logic

[16,17].

However, there is a shortcoming in the literature above that they only mention

an assumption of data aggregation algorithms during implementation, without illus-

trating proper data aggregation algorithms and their impacts on the system. In our

proposed framework, we propose methods to aggregate critical and non-critical data

according to the speciﬁed aggregation ratio at the MANET layer to bring down the

energy consumption in communication.

The contributions of the paper is as follows:

•It developed the WBAN communication framework and used LoRa technology

and mobile ad hoc architecture’s beneﬁts for both homecare and hospital situations.

•To save energy when transferring data packets comprising biosignals at each layer,

our suggested communication model applies the data aggregation and fusion tech-

nique.

•The purpose of this article is to quickly ﬁnd pertinent information and determine

whether the condition existed.

•Using a straightforward system of categorizing data into urgent and non-urgent

categories, we next use correlation among the transmitted data at the in-network

level to corroborate our initial ﬁndings.

This article is organized as follows. Section 2represents the preliminaries on WBAN

architecture, LoRa technology, and Mobile ad hoc architecture. Section 3presents

the proposed architecture and methodology. Section 4outlines the results and dis-

cussions. The concluding remarks and future work are provided in Sect. 5.

606 S. Gupta and I. Snigdh

3 Preliminaries

Chronic diseases affect many patients, and the number is growing daily. Hospital-

ization can occasionally be rather inconvenient, expensive, and require health mon-

itoring while working or going about daily activities. WBAN is thus a ﬁx for the

issue.

3.1 WBAN and IoT Health Care System

In Wireless Body Area Networks, nanosensors are implanted near and inside the

patients bodies to wirelessly transmit biosignals to doctors. The doctors then treat

the patients virtually by observing, diagnosing, and prescribing [18]. Biosignals are

continual records of the biological evolution of living things. These signals include

EEG, ECG, EOG, blood pressure, body temperature, glucose level, and many more

[19]. Additionally, in order to reduce additional computational complexity on the

sensor level and avoid data expedition, critical and non-critical data packets are

categorized and transferred to the top layer. In the primary healthcare system shown

in Fig. 1, sensors provide data sent to a server. The doctors can see patient data and

comment on it directly to patients. All essential processing is done on the cloud.

Figure 1depicts a primary healthcare system where sensors generate data and

send the cloud via the gateway and mobile device. Doctors can access and analyze

the data from the cloud and directly connect to patients [20].

Fig. 1 Traditional healthcare system

An Energy Conserving MANET-LoRa Architecture for Wireless ... 607

3.2 LoRa

One of the newest technologies in IoT, LoRa is appropriate for long-range, low-power

communication [21]. The usual LoRa topology uses star communication, which uses

more energy and transmits data at a slower rate. LoRa differs from other LPWAN

technologies in that it gives users the option to customize physical layer charac-

teristics (Transmission Power, Bandwidth, Coding rate, Spreading Factor, Carrier

Frequency) for their particular applications.

•Spreading Factor (SF): In digital communication, the SF speciﬁes the number of

chips to represent a sign. SF values between 6 and 12. Enhanced data speeds are

achieved with lower SF.

•Bandwidth (BW): This parameter displays the volume of data that can be trans-

mitted across the communication channel. BW is accessible between 7.8 and 500

kHz. However, 125, 250, and 500 kHz are the three frequencies that LoRa devices

often operate on.

•Coding Rate (CR): LoRa modem delivers corruption-free transmission by using

forward error correction. It is accomplished by employing a coding rate that raises

the ToA of the packet with time while offering more robustness. Standard values

for CR are 4/5, 4/6, 4/7, and 4/8.

•Carrier frequency (CF): Is a core frequency that is symbolically employed in a

transmission band. The license-free sub-gigahertz band for LoRa transmitters and

receivers spans from 860 to 1020 MHz.

LoRa provides three classes for end devices communications.

•Class A: Class A devices have power capabilities and enable two-way communi-

cation between network servers and end devices (EDs). For each communication

activity, these devices offer the option of one uplink and two downlink trans-

missions. There are two quick downlinks (network server to EDs) transmission

windows for every uplink (EDs to the network server) transmission window.

•Class B: Class B devices provide a downlink communication with additional

scheduled slots, much like class A devices do. Time synchronization requires a

prearranged beacon from the gateway. When compared to class A equipment, these

devices use more energy.

•Class C: These devices inherit the same features as class A devices, with the

exception that they never open the receive window while they are not transmitting.

More energy is used by these gadgets than by A and B. However, there is very

little delay.

3.3 Mobile Ad Hoc Network

MANET values its capacity for self-organization, self-healing networks, and oper-

ation in environments with minimal network support. The MANET nodes that may

608 S. Gupta and I. Snigdh

move about freely in their environment are outﬁtted with wireless transmitters and

receivers that may use omnidirectional antennas [22,23]. A WSN is obviously the

primary IoT data collecting methodology. But the power and memory usage of WSN

devices is constrained. In all situations, MANET systems concentrate on ﬁnding

the optimum path to route the data (network discovery). The establishment of a

new MANET-IoT system that provides improved user mobility and reduces network

deployment costs is made possible by the interaction between WSN routing prin-

ciples and MANET’s protocols with the IoT [24]. In [15], the authors proposed an

architecture using characteristics of mobile ad hoc networks in IoT. In this architec-

ture, WSN acts as the basic architecture for collecting data from the environment

with the various sensors devices. Above that, MANET plays the role of overlay archi-

tecture with moveable devices. This type of architecture is suitable for urgent data

transmission. Therefore, MANET could be a good choice for healthcare systems.

4 Proposed Architecture for WBAN

Figure 2shows the proposed architecture for bring down the energy efﬁciency in

WBAN. The previous section gives us information about the advantages and disad-

vantages of the traditional IoT healthcare architecture. We have used LoRa technol-

ogy for the proposed work as LoRa is interference-free and works in long-range.

Fig. 2 Flow of data transmission in proposed BSN architecture

An Energy Conserving MANET-LoRa Architecture for Wireless ... 609

In our article, LoRa devices and gateway lies at the lower layer. After applying

data aggregation LoRa EDes transfer data to their respective gateways. The gate-

way layer categorizes data packets as critical and non-critical based on a threshold

value. Once more, these data packets-both critical and non-critical delivered to the

upper tiers. Also, we would suggest using the MANET layer. As MANET will work

computing layer which can run feature selection algorithms. Because MANETs are

self-conﬁguring, self-healing networks [25,26].

5 Methodology

Buffer aggregation is used at the MANET layer to cut down on network energy.

Fused packets are transmitted to the cloud layer using cooperative fusion. Figure 3

depicts the ﬂow of transmitting the data packets from the sensor layer to the cloud.

First, every second, medical sensors gather biosignal data from a patient’s body.

Applying redundant fusion or aggregation processes to data and transmitting data

packets every ﬁve seconds are the responsibilities of the sensor nodes. These sensors

include the Spo2, temperature, glucose, and blood pressure sensors. When a data

Fig. 3 Flow of data transmission in proposed BSN architecture

610 S. Gupta and I. Snigdh

Table 1 Threshold values of sensors

Sensor Critical Non-critical

EKG More than 100 bpm 60–100 bpm

Spo2 Less than 92 % 92–99 %

Diabetes sensor More than 126 mg/dl Less than 100 mg/dl

Temperature sensor More than 100 ◦F less than 99 ◦F

packet is received, the gateway determines whether it is urgent or not by examining

the sensor data’s cumulative threshold value. Table 1shows the threshold values of

sensor for reference [12].

Data (both critical and non-critical) are forwarded to the MANET layer. To choose

the most important health signals, the MANET layer already uses a Principal Com-

ponent Analysis and Canonical Correlation Analysis algorithm based on data that

has been gathered. Data values are checked with the threshold value range. The

packet’s status is kept as critical if the sensor’s threshold value is crossed and is also

the patient’s most important health factor. If not, it changes to a non-critical status.

The data aggregation mechanism is likewise carried out by the MANET layer. At

this layer, both critical and non-critical data packets are brieﬂy buffered. In accor-

dance with the network requirements, a buffer size has been chosen for critical and

non-critical data packets.

Cooperative fusion is a technique used by the cloud layer to combine data from

several sources to create a new piece of information. The most important character-

istics are chosen, conveyed to this layer, and then integrated to determine the disease.

6 Results

We have compared our results with the method used in [12] and shown in Fig. 4,this

part elaborates the ﬁndings. The work was simulated in Python, and Table 2lists all

the network conﬁguration details. For the proposed architecture, class A LoRa end

devices are assumed. We chose the most effective LoRa parameter settings, which

are SF = 6, BW = 500, and CR = 4/5, as determined by Lora WAN simulators, with

reference to the body of existing research [27].

Our suggested WBAN architecture uses less energy than the current approach. The

inventors of the method [12] dropped every non-critical packet while transmitting

critical packets to the upper layer.

An Energy Conserving MANET-LoRa Architecture for Wireless ... 611

Fig. 4 Results of all transferred data packets (*CHMS = cloud-based healthcare monitoring system)

Table 2 Network conﬁguration parameters

Name Val u e s

ROI 100 * 100

Number of sensor nodes 4

Buffer size for existing work 1

Fusion ratio for proposed work Critical packet: 1 packet, 3 packet,Non-critical

packet: 5 packet, 7 packets

Transmission cost for gateway and critical

packet

0.002 mW

Transmission cost for fusion packet 0.005 mW

Packet size 60 bytes

Spreading factor 6

Band width 500

Carrier frequency 868 MHz

Transmission power 14 dBm

Coding rate 4/5

Contrarily, we send every data packet critical and non-critical to the higher layers.

Figure 5a and b show the energy consumption when only transmitting critical data

packets at the MANET layer with data fusion ratios of 1 and 3, respectively. Figure 6a

and b show the energy consumption when only non-critical data packets are sent at

the MANET layer with data fusion ratios of 5 and 7, respectively. Our innovation is

in selecting a compression value that permits the transmission of non-urgent packets

to make it easier to maintain historical medical records for future use.

612 S. Gupta and I. Snigdh

Fig. 5 Critical data transmitted afusion ratio = 1 bfusion ratio = 3 (*CHMS = cloud-based

healthcare monitoring system)

7 Conclusion

This paper provides a scalable MANET-LoRa-based architecture for WBAN that is

energy-efﬁcient. The lower layer uses LoRa technology, and the higher layer uses

MANET. The gateway layer can distinguish between critical and non-critical data.

The early diagnosis of diseases at the MANET layer is another area in which we have

utilized the data aggregation and feature selection technique. We also proposed coop-

erative fusion at the cloud layer in MANET-LoRa architecture for WBAN success-

An Energy Conserving MANET-LoRa Architecture for Wireless ... 613

Fig. 6 Non-crtitcal data packet transmitted afusion ratio = 5, bfusion ratio = 7 (*CHMS = cloud-

based healthcare monitoring system)

fully preserves energy. However, this architecture will lead to delay and an increase

in algorithm complexity. The future scope of our work focuses on optimizing the

delay of the network.

614 S. Gupta and I. Snigdh

References

1. Boikanyo K, Zungeru AM, Sigweni B, Yahya A, Lebekwe C (2023) Remote patient monitoring

systems: applications, architecture, and challenges. Sci African 01638

2. Silverio-Fernandez MA, Renukappa S, Suresh S (2019) Evaluating critical success factors for

implementing smart devices in the construction industry: an empirical study in the Dominican

republic. Eng Construct Arch Manage

3. Lee I, Lee K (2015) The internet of things (IoT): applications, investments, and challenges for

enterprises. Business Horizons 58(4):431–440

4. Balandina E, Balandin S, Koucheryavy Y, Mouromtsev D (2015) IoT use cases in healthcare

and tourism. In: 2015 IEEE 17th Conference on business informatics, Vol 2. IEEE, pp 37–44

5. Mustafa T, Varol A (2020) Review of the internet of things for healthcare monitoring. In: 2020

8th International symposium on digital forensics and security (ISDFS). IEEE, pp 1–6

6. Baker SB, Xiang W, Atkinson I (2017) Internet of things for smart healthcare: technologies,

challenges, and opportunities. IEEE Access 5:26521–26544

7. Zou N, Liang S, He D (2020) Issues and challenges of user and data interaction in healthcare-

related IoT: a systematic review. Library Hi Tech

8. Gupta S, Snigdh I (2022) An energy-efﬁcient information-centric model for internet of things

applications. In: 2022 International conference on IoT and blockchain technology (ICIBT).

IEEE, pp 1–5

9. Mohapatro M, Snigdh I (2020) Security in IoT healthcare. In: IoT security paradigms and

applications. CRC Press, pp 237–259

10. Abiodun AS, Anisi MH, Khan MK (2019) Cloud-based wireless body area networks: managing

data for better health care. IEEE Cons Electron Magaz 8(3):55–59

11. Abiodun AS, Anisi MH, Ali I, Akhunzada A, Khan MK (2017) Reducing power consumption

in wireless body area networks: a novel data segregation and classiﬁcation technique. IEEE

Consum Electron Magaz 6(4):38–47

12. Almashaqbeh G, Hayajneh T, Vasilakos AV, Mohd BJ (2014) Qos-aware health monitoring

system using cloud-based WBANs. J Med Syst 38(10):1–20

13. Fortino G, Galzarano S, Gravina R, Li W (2015) A framework for collaborative computing

and multi-sensor data fusion in body sensor networks. Inform Fusion 22:50–70

14. Gupta S, Snigdh I (2022) Leveraging data aggregation algorithm in loRa networks. J Super

Comput 1–15 (2022)

15. Gupta S, Snigdh I (2021) Analyzing impacts of energy dissipation on scalable IoT architectures

for smart grid applications. In: Advances in smart grid automation and industry 4.0. Springer,

pp 81–89

16. Gupta S, Snigdh I (2023) Applying bayesian belief in loRa: smart parking case study. J Amb

Intell Human Comput 1–14

17. Gupta S, Snigdh I, Sahana SK (2022) A fuzzy logic approach for predicting efﬁcient loRa

communication. Int J Fuzzy Syst 1–9

18. Mohapatro M, Snigdh I (2021) An experimental study of distributed denial of service and sink

hole attacks on IoT based healthcare applications. Wireless Pers Commun 121:707–724

19. Parlitz U, Berg S, Luther S, Schirdewan A, Kurths J, Wessel N (2012) Classifying cardiac

biosignals using ordinal pattern statistics and symbolic dynamics. Comp Biol Med 42(3):319–

327

20. Gupta S, Singh U (2021) Ontology-based IoT healthcare systems (IHS) for senior citizens. Int

J Big Data Anal Healthcare (IJBDAH) 6(2):1–17

21. Alliance L (2015) A technical overview of loRa and loRaWAN. White Paper, November 20

22. Gupta U, Pantola D, Bhardwaj A, Singh SP (2022) Next-generation networks enabled tech-

nologies: challenges and applications. Next Gener Commun Netw Indust Internet of Things

Syst 191–216

23. Soni G, Gupta U, Singh N (2014) Analysis of modiﬁed substitution encryption techniques

24. Bruzgiene R, Narbutaite L, Adomkus T (2017) Manet network in internet of things system. Ad

hoc Netw 66:89–114

An Energy Conserving MANET-LoRa Architecture for Wireless ... 615

25. Bellavista P, Cardone G, Corradi A, Foschini L (2013) Convergence of manet and WSN in IoT

urban scenarios. IEEE Sens J 13(10):3558–3567

26. Gupta P, Tripathi S, Singh S (2021) Energy-efﬁcient routing protocols for cluster-based hetero-

geneous wireless sensor network (HETWSN)-strategies and challenges: a review. Data Anal

Manage Proc ICDAM 853–878

27. Bor MC, Roedig U, Voigt T, Alonso JM (2016) Do loRa low-power wide-area networks scale?

In: Proceedings of the 19th ACM international conference on modeling, analysis and simulation

of wireless and mobile systems, pp 59–67. https://doi.org/10.1145/2988287.2989163

Blockchain Integration with Internet

of Things (IoT)-Based Systems for Data

Security: A Review

Gagandeep Kaur , Rajesh Shrivastava , and Umesh Gupta

Abstract The blockchain technology offers a secure channel for communicating

between entities without the role of any third party. It is a digital ledger of trans-

actions in a computer network that makes it hard for hackers to attack or alter the

information. Banking, supply chain, precision agriculture, smart city, cyber-physical

systems, industrial IoT, and health care are the various sectors in which blockchain

technology has been adopted to enhance security. In recent times, these sectors are

being revolutionized by digital transformation using sensor-aided physical devices

forming Internet of Things (IoT) systems. Blockchain-based IoT system plays a

vital role in replacing the conventional methods of storage and sharing data with a

more reliable method. The integration of the two technology results in a secure, reli-

able, and smart system. This paper exhibits the background and working principle

of blockchain technology. Also, it discusses the need of security and security chal-

lenges in IoT-based systems. Furthermore, it discusses brieﬂy about smart contracts

and motivation behind integrating blockchain technology with IoT-based systems.

Finally, it proposes a secure IoT-based land registry architecture.

Keywords Blockchain ·Computer network ·Internet of Things (IoT) ·Privacy

protection

G. Kaur (B)

Department of Computer Science and Engineering, Madhav Institute of Technology and Science,

Gwalior, India

e-mail: gagan873@gmail.com;gagandeep@mitsgwalior.in

R. Shrivastava ·U. Gupta

School of Computer Science Engineering and Technology, Bennett University, Greater Noida,

India

e-mail: rajesh.shrivastava@bennett.edu.in

U. Gupta

e-mail: er.umeshgupta@gmail.com

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_46

617

618 G. Kaur et al.

1 Introduction

In the current era, the Internet of Things (IoT) technology has gained popularity.

This is due to addition of sensing, receiving, and transmitting capabilities to any

physical object. IoT plays a vital role in various verticals of everyday life. IoT ﬁnds

its application in Industrial IoT, precision agriculture, smart cities, and healthcare

applications [1]. The sensors in the IoT system sense various physical parameters.

The data is collected from these sensors and stored in a central server. However, the

data security in the central server and its privacy preservation is an important aspect.

The IoT brings in several advantages, but the leakage of the sensitive information or

the attack by hackers can ruin its sole purpose. Therefore, connecting several physical

objects over the Internet with strong protection of the data is the prime requirement.

The focus of research is to save this collected data securely in decentralized architec-

ture. The integration of blockchain technology in IoT permits the physical objects to

securely transmit data in peer-to-peer network. The blockchain minimizes the risk

of any fraudulent information entering in the IoT network. This is because before

the entry of data in within the network, the consent is taken from majority of the

users instead of the single central authority [2]. Furthermore, blockchain-based IoT

system builds a strong and robust network in which it prevents the hackers to steal

all the information just by attacking the central server. In IoT-based system, data

security is obtained by applying encryption. The encryption technique involves each

node in the IoT system to carry two keys, namely public key and private key. The

public key is available to other nodes that encrypts the data which is then broad-

casted to all other nodes in the network. The private key is secret to individual nodes

that is used for decrypting the data. The blockchain technology prevents the fraudu-

lent intruder to falsely encrypt the information. In a blockchain-based IoT network,

all the physical objects within the network are identiﬁed by public key. Therefore,

this sharing may lead to third-party attack that can make out the identity of the IoT

participants. The blockchain-based IoT system eliminates single point of failure; this

enhances fault tolerance and reliability of IoT system. The IoT devices participating

in a blockchain-based network can verify data integrity and identity of sender [3].

Also, blockchain provides secure software updates and data storage capability to

IoT-based system. Furthermore, blockchain technology stores data in an immutable

which provides backtracking and traceability capabilities.

IoT allows the inter-connectivity of physical objects “things” by providing sensing

and transmitting capabilities. In the current era, IoT ﬁnds its applications in everyday

life making smart networks. These smart networks are capable of sensing various

physical parameters and the taking valuable decisions. The decentralization enhances

the IoT networks scalability and performances. With the exponential growth in the

adoption and popularity of IoT technology, the demand of secure data storage and

transmission is increasing immensely. The security of the data is essential; any kind

of data leakage or attack by intruder can lead to disclosure of critical information.

Thus, it is essential to preserve the privacy of data and grant access of data to autho-

rized users. However, there are few prerequisite security requirements for IoT-based

Blockchain Integration with Internet of Things (IoT)-Based Systems … 619

system like conﬁdentiality, integrity, authenticity, non-repudiation, authorization,

and availability. The integration of the block chain technology with IoT enhances

the reliability of IoT-based system in terms of security. The blockchain provides

data security to resource constrained end devices in IoT-based system. Also, the

blockchain technology is capable to handle heterogeneity, privacy protection, and

conﬁdentiality of IoT-based systems.

2 Literature Review

This section describes literature review of the state-of-the-art approaches. Bhutta et al.

[4] in 1991 introduced blockchain technology as “a cryptographically secured chain

of blocks”. Baur et al. [5] in 2018 implemented blockchain as a public ledger and

received universal recognition in a cryptocurrency named Bitcoin. It is gaining popu-

larity in various sectors such as agriculture, logistics, and insurance. The blockchain

technology provides secure distributed architecture that works without intervention

of any centralized or third party. Wang et al. [6] deﬁned blockchain as a chain of

blocks that are linked cryptographically using hash functions. It works on a peer-to-

peer network of participants. The blockchain technology provides the highest degree

of accountability. This feature has resulted in the adaptation of blockchain tech-

nology for data transmission and record-keeping in various sectors of real-life appli-

cations. The blockchain technology provides these real-life applications with proper

documentation and digitally conﬁrms the ownership of assets. Hildebrand et al.

[7] proposed that blockchain technology blocks are ordered unambiguously using

consensus algorithm. Thus, this makes blockchain technology veriﬁable, consistent,

auditable and enhances integrity among all the participants. Jain et al. [8] proposed

a block constitutes version information, hash of the parent block, timestamp, nonce,

count of transaction, and combination of hash transactions. Whenever a new block

has been generated, then each participant will apply block authentication process.

After appropriate validation and approval, this block is appended to the parent block

with the help of reference. This process helps to detect the unidentiﬁed or unautho-

rized transactions. Unauthorized or falsiﬁed blocks can be identiﬁed by the hash value

which is completely different from authorized blocks. Meryem et al. [9] proposed

the integration of blockchain technology for security in IoT-based smart homes.

Mohamed et al. [10] proposed security in IoT-enabled smart industry environment

for Industry 4.0 applications.

3 Working Principle of Blockchain

A blockchain structure is represented in the form of list of blocks with ordered

transactions. The list is stored as ﬂat ﬁle database. The point to notice is that there is

no pointer pointing to the ﬁrst block, and the terminal block has a pointer with null

620 G. Kaur et al.

value. Figure 1shows the structure of blockchain. A block contains version, parent

block hash, timestamp, nonce, transaction count, and Merkle root. Nonce is an integer

that starts from 0 and increases every time after hash is calculated. The Merkle root

is the combined hash of all transactions. Figure 2shows the working principle of a

blockchain technology. Whenever a new record or transaction is to be added into a

blockchain, it needs to be veriﬁed and digitally signed by nodes in the system. Any

block contains data, its hash value, and hash of the previous block. It depends which

kind of data has been stored in a blockchain such as receiver, sender, and the amount

of coins. A hash is like a digital signature or a ﬁngerprint. The hash of a block is

generated using cryptographic hash algorithm. The hash helps in identifying each

block in a blockchain structure. Any modiﬁcation in a block results in change of its

hash. The hash of the previous block helps to form a chain structure which plays

a major role of providing security. Any fraudulent attempt to change the data of a

block leads to invalidation of the whole blockchain system. The proof-of-work is

performed by the miners which are special nodes within the blockchain structure.

The miners get transaction fees as a reward from the block. Whenever a new block

is created, it is veriﬁed by all nodes in a system. All nodes in the blockchain system

adhere to the consensus protocol. Thus, this makes a blockchain system immutable

and secure [11]. Table 1shows the classiﬁcation of blockchain architecture.

The major characteristics that have resulted in adaption of blockchain technology

in real-life applications are as follows:

•Transparency: The participants in the public blockchain systems can communicate

with equal rights. In the public blockchain systems such as Ethereum and Bitcoin,

the authentication of each transaction is recorded. This data is available to all the

participants in the blockchain network. Thus, data on the blockchain is transparent

to each node so as to validate the committed transaction in the blockchain.

•Decentralization: In a centralized architecture, the validation of transaction is

performed by central server. This causes bottleneck problem, whereas blockchain

works on a distributed architecture where validation of transaction is performed

Fig. 1 Blockchain structure

Blockchain Integration with Internet of Things (IoT)-Based Systems … 621

Fig. 2 Working principle of blockchain technology

Table 1 Classiﬁcation of blockchain architecture

Property Private blockchain

architecture

Public blockchain

architecture

Consortium

blockchain

architecture

Consensus

determination

Within one organization All miners Selected set of

nodes

Framework Partially decentralized Fully decentralized Partially

decentralized

Read permission Public or restricted Public Public or restricted

Traceability Fully traceable Fully traceable Partially traceable

Immutability level Could be tampered Almost impossible to

tamper

Could be tampered

Resource efﬁcient High Low High

Centralization Yes No Partial

Consensus process Permission less Needs permission Needs permission

Scalability High Low High

Flexibility High Low High

Transaction speed Fast Slow Fast

622 G. Kaur et al.

in peer-to-peer manner. Thus, this enhances the performance of system in terms

of cost-effectiveness, resolving bottleneck problem, and single-point failure.

•Immutability: In a blockchain, a chain structure is formed by linking the blocks

through hash values. When any kind of data tampering is performed, it causes

invalidation of all consecutive blocks. Thus, the blockchain structure is immutable.

•Pseudonymity: The blockchain is partially conﬁdential as the addresses of

participants can be traced.

•Non-repudiation: Each participant in the blockchain system contains a private

key. This can be decrypted by other participants with help of public keys. Thus,

cryptographically encrypted transactions are non-repudiable.

•Traceability: The blockchain technology offers traceability; this is achieved with

help of timestamp attached to every transaction. This results in tractability of

origin and modiﬁcation of any transaction [12].

4 Smart Contracts

The smart contract can be deﬁned as the programs which are stored on a blockchain.

The smart contract programs run only when the predetermined conditions are

fulﬁlled. The beneﬁt of smart contract is that they automate the execution of an

agreement. This results in reliability of the agreement without involving any third

party. Furthermore, it is a faster approach which makes the system autonomous

where workﬂows are maintained and the consecutive actions are triggered on fulﬁll-

ment of the predetermined conditions. Thus, smart contracts insert a secure and

automatic contractual mechanism where the contracting parties evaluate success or

violation of the contract. Smart contracts are programs that encrypt and replicate

contractual agreements [13]. The smart contract through blockchain technology in

computer domain brings various beneﬁts such as no commission fees, no involve-

ment of trusted-party dependency, and no mutual interaction of counterparties. Smart

contracts can be generated by publishing a transaction to the blockchain. The miners

in the blockchain run smart contracts and achieve agreement on its implementa-

tion. Each contract is assigned a 160-bit address on implementation. If a transaction

is generated, then contract is executed using this address. There are various plat-

forms for the development of smart contract such as Bitcoin, Ethereum, Hyperledger

Fabric, Nem, Corda, Stellar, Waves, Cardano, Neo, EOS, Rootstock, Tendermint,

and Quorum. These platforms such as Bitcoin can implement smart contracts that

offer a modern-way of money exchange that provides innovative solutions and easy

interfaces for developers.

Blockchain Integration with Internet of Things (IoT)-Based Systems … 623

5 Blockchain Integration with IoT

IoT provides digital transformation and has revolutionized the real world. The

network of sensors produces large voluminous data which is analyzed by a central

server. The central server makes valuable decision and helps in knowledge discovery.

The information in the IoT-based system requires safe and secure transmission and

storage. The integration of the block chain technology with IoT enhances the reli-

ability of IoT-based system in terms of security [14]. The blockchain provides

data security to resource-constrained end devices in IoT-based system. Also, the

blockchain technology is capable to handle heterogeneity, privacy protection, and

conﬁdentiality of IoT-based systems. The major advantages of integrating blockchain

technology in IoT-based systems are listed below:

•The data which is collected by the sensors is secured by blockchains. The storage

of this data is done in blockchain network in the form of encrypted transactions.

The blockchain technology provides the IoT-based systems with enhanced inter-

operability. There is no involvement of third party while interaction between the

IoT devices which makes the whole IoT system autonomous [15,16].

•Blockchain technology enhances the reliability of IoT-based system by providing

availability, authenticity, conﬁdentiality, accountability, and traceability. Also,

blockchain faster the process of IoT-based systems by proving secure and

decentralized features with no third-party intervention.

•The blockchain technology utilizes consensus mechanisms which prevents denial-

of-service attacks. This is achieved by imposing charge for each transaction. The

technology implementation in IoT networks enhances the overall security within

the network by enforcing access control and data integrity [17,18].

•The blockchain technology makes it impossible for the intruders to modify records

and hide transactions in IoT-based systems. This is achieved through decentralized

consensus mechanism. It utilizes features of encrypting data using public and

private keys which provides privacy preservation [19,20].

6 Proposed Secure IoT-Based Land Registry Architecture

The land registry involves sharing transactional information of land pieces. The

blockchain enhances data security and provides secure land registry system including

authentication of all land transactions between involved parties. The blockchain

technology prevents illegal land transaction, and through hash-based chain structure,

it detects fraudulent modiﬁcation registry. Thus, it helps to secure land transactions

and registry records. Figure 3shows the proposed land registry system architecture.

Initially, all land registry centers and users are required to register themselves in a

mobile application. They will receive pairs of public and private keys by execution

of the registration function. A user can request the land registry centers to issue

certiﬁcate. When the user initiates a request to the authorities, then veriﬁcation of

624 G. Kaur et al.

Fig. 3 Proposed secure IoT-based land registry architecture

users details is performed which is stored on the blockchain network. The blockchain

network issues a certiﬁcate based on users details stored during the registration. The

issued certiﬁcate is stored in a decentralized Inter-Planetary File System. The user

then receives the calculated hash value. The land registry details are managed as a

transaction with a unique ID. This transaction is stored in a speciﬁc block of the

blockchain network.

7 Conclusion

The paper presents a comprehensive survey of blockchain technology including char-

acteristics, architecture, and working principle of blockchain. It describes the need of

data security and privacy preservation for IoT-based systems. Also, it describes the

beneﬁts of integrating blockchain technology with the IoT-based systems and solu-

tion that blockchain technology provides to IoT-based system in terms of security.

This paper proposes a blockchain-based technology to provide security in terms of

authenticity, integrity, availability, and conﬁdentiality. The paper presents a secure

IoT-based land registry system architecture.

Blockchain Integration with Internet of Things (IoT)-Based Systems … 625

References

1. Maraveas C, Piromalis D, Arvanitis KG, Bartzanas T, Loukatos D (2022) Applications of IoT

for optimized greenhouse environment and resources management. Comput Electron Agric

198:106993

2. Deepa N, Pham QV, Nguyen DC, Bhattacharya S, Prabadevi B, Gadekallu TR, Pathirana PN

(2022) A survey on blockchain for big data: approaches, opportunities, and future directions.

Future Gener Comp Syst

3. Jeoung J, Jung S, Hong T, Choi JK (2022) Blockchain-based IoT system for personalized

indoor temperature control. Autom Constr 140:104339

4. Bhutta MNM, Khwaja AA, Nadeem A, Ahmad HF, Khan MK, Hanif MA, Cao Y et al (2021) A

survey on blockchain technology: evolution, architecture and security. IEEE Access 9:61048–

61073

5. Baur DG, Hong K, Lee AD (2018) Bitcoin: Medium of exchange or speculative assets? J Int

Finan Markets Inst Money 54:177–189

6. Wang R, Tsai WT (2022) Asynchronous federated learning system based on permissioned

blockchains. Sensors 22(4):1672

7. Hildebrand B, Baza M, Salman T, Amsaad F, Razaqu A, Alourani A (2022) A comprehensive

review on blockchains for internet of vehicles: challenges and directions. arXiv preprint arXiv:

2203.10708

8. Jain A, Srivastava N (2022) Privacy-preserving record linkage with block-chains. In: Cyber

security, privacy and networking. Springer, Singapore, pp 61–70

9. Ammi M, Alarabi S, Benkhelifa E (2021) Customized blockchain-based architecture for secure

smart home for lightweight IoT. Inf Process Manage 58(3):102482

10. Ferrag MA, Shu L (2021) The performance evaluation of blockchain-based security and privacy

systems for the Internet of Things: a tutorial. IEEE Internet Things J 8(24):17236–17260

11. Dannen C (2017) Introducing ethereum and solidity, Vol 1. Berkeley: Apress, pp 159–160

12. Xu J, Guo S, Xie D, Yan Y (2020) Blockchain: a new safeguard for agri-foods. Artif Intell

Agric 4:153–161

13. Musamih A, Salah K, Jayaraman R, Arshad J, Debe M, Al-Hammadi Y, Ellahham S (2021)

A blockchain-based approach for drug traceability in healthcare supply chain. IEEE Access

9:9728–9743

14. Omar IA, Hasan HR, Jayaraman R, Salah K, Omar M (2021) Implementing decentralized

auctions using blockchain smart contracts. Technol Forecast Soc Chang 168:120786

15. Kaur G, Bhattacharya M, Chanak P (2019) Energy conservation schemes of wireless sensor

networks for IoT applications: a survey. In: 2019 IEEE conference on information and

communication technology. IEEE, pp 1–6

16. Kaur G, Chanak P, Bhattacharya M (2022) A Green hybrid congestion management scheme

for IoT-enabled WSNs. IEEE Trans Green Commun Netw 6(4):2144–2155

17. Dwivedi SP, SrivastavaV, Gupta U (2023) Graph similarity using tree edit distance. In: Proceed-

ings of the structural, syntactic, and statistical pattern recognition: joint IAPR international

workshops, S+ SSPR 2022, Montreal, QC, Canada, August 26–27, 2022. Cham: Springer

International Publishing, pp 233–241

18. Yadav S, Mishra R, Gupta U (2015) Performance evaluation of different versions of 2D

Torus network. In: 2015 International conference on advances in computer engineering and

applications. IEEE, pp 178–182

19. Gahlot A, Gupta U (2016) Gaze-based authentication in cloud computing. Int J Comp Appl

1(1):14–20

20. Soni G, Gupta U, Singh N (2014) Analysis of modiﬁed substitution encryption techniques

Comparative Study of Heart Failure

Using the Approach of Machine Learning

and Deep Neural Networks

Shachi Mall and Jagendra Singh

Abstract Heart failure, a complicated clinical syndrome, occurs when the heart

cannot pump enough oxygenated blood to satisfy the body’s metabolic needs. It

is a major public health problem and is associated with signiﬁcant morbidity and

mortality. Care workers intentionally mine and save patient medical information

to generate potential for enhanced treatment planning as healthcare and creative

diagnostics become much more collaborative. To predict strokes, this paper does a

comprehensive evaluation of the many variables in electronic heart data. The most

crucial variables for stroke prediction are identiﬁed using a principal component

analysis. We have considered a set of 12 different attribute which are a common

symptom of various heart conditions. The features are employed to predict cardio-

vascular disease, each attributes contains 918 data set which are taken from Kaggle.

The data set are trained on 70% and tested on 30% of Kaggle heart dataset. We apply

the test and training data on different machine learning algorithm, i.e., K Neighbors

Classiﬁer, Random Forest Classiﬁer, and on deep neural network and achieve the

results. On comparing the accuracy result of all three methods, i.e., K Neighbors

Classiﬁer accuracy result 0.877, Random Forest Classiﬁer accuracy result 0.8590,

and deep neural network accuracy result 0.89. In our investigations, we identify that

deep neural networks actually superior to machine learning algorithms.

Keywords Chronic heart failure ·Heart disease ·K neighbors classiﬁer ·Random

forest classiﬁer ·Deep neural network

S. Mall (B)·J. Singh

School of Computer Science Engineering and Technology, Bennett University, Greater Noida,

India

e-mail: shachimall@gmail.com

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_47

627

628 S. Mall and J. Singh

1 Introduction

Heart diseases are frequently substituted for different cardiovascular disorders. These

problems are primarily associated with blocked or constricted blood vessels, which

can cause a stroke, angina, or chest pain, and cardiac arrest. The development of

monitoring technologies and processes for heart diagnosis from a distance has needed

a lot of work. To be a barrier to adherence, technology fatigue has been identiﬁed.

Machine learning and neural network are playing important role to identifying out

if a person has heart disease [1]. The World Health Organization (WHO) states that

approximately twenty-six million individuals worldwide struggle with heart failure,

and the number is expected to rise due to aging populations and increasing rates of

risk factors such as hypertension, diabetes, and obesity. In 2030, it is predicted that

there would be approximately 22 million deaths worldwide if current trends continue

[2]. Social Networks for Health are online communities where people with related

medical conditions can interact, share experiences, and offer emotional support to

one another. Social Networks for Health, which let patients to share their experiences

online, have recently been suggested by various academics as a practical way for

people to assist one another [3]. Machine learning techniques have the potential

to improve our understanding of heart failure and to develop better diagnostic and

treatment strategies. It’s considered to be apart of artiﬁcial intelligence. K Neighbors

Classiﬁer, Random Forest Classiﬁer are machine learning algorithm [4] are used to

check the performance on the selected 12 attributes these attributes are different

features set. Machine learning has the potential to revolutionize the diagnosis and

treatment of heart failure by identifying novel K Neighbors Classiﬁer, Random Forest

Classiﬁer. On the basis of feature selection we train the 70% datasets are used to

train on the approach K Neighbors and Random Forest Classiﬁer machine learning

algorithm and 0% dataset are used to test on the same approach. To decrease the

dimensionality in analyzing the data information we use of principal component

analysis (PCA). In the context of heart stroke, PCA can be used for feature selection

to identify the most important variables that are strongly associated with the outcome

of heart stroke [5]. PCA are used for feature selection in heart stroke ﬁrst it collects

the data, i.e., 12 attributes on the patient’s gender, age, high blood pressure, levels of

cholesterol, and other factors and other health indicators, are stated in Table 1.

Now the same process is applied to the deep neural network (DNN). DNN is an

algorithmic in the artiﬁcial neural network (ANN), it resembles brain neurons in the

human body. Separate layers, connections, and propagation direction exist within

the artiﬁcial neural network. Each layer consists of nodes with arrows indicating

the relationships between them. An artiﬁcial neural network’s input layer is dense

with nodes. These input layer’s nodes are interconnected to the nodes of the hidden

layer. The weights are assigned to each input. The network’s input nodes provide

information to the nodes in the hidden layer, which process it by carrying out various

operations or calculations before sending the results to the output node. The node

that produces the ultimate result is in the output layer. The developed system is used

to diagnosis the heart failure, for this we have consider 12 different attribute and

Comparative Study of Heart Failure Using the Approach of Machine … 629

Table 1 Attribute of heart disease

S. No. Attribute name Description

1Age Age in years

2Gender 1=M0=F

3ChestPainType Type 1,2,3 and 4

4RestingBP Blood pressure at rest

5 Cholesterol Cholesterol

6FastingBS 1=True and 0 =False

7RestingECG Result of Resting electrographic

0=Normal and 1 =Abnormal

2=left ventricular hypertrophy

8MaxHR heart rate maximum

9ExerciseAngina Exercise induced angina

10 Oldpeak Depression related to ST

11 ST_Slope Exercise’s peak-to-peak slope

(For example, slopes 1 and 3 are upsloping, ﬂat, and respectively)

12 HeartDisease 0=Heart Disease and 1 =No Heart Disease

symptoms for heart failure. These 12 attributes are used as a feature to predict the

heart failure. Now we have taken 918 patient datasets from Kaggle [6] which are

suffering and not suffering from heart disease. Several studies have demonstrated

the potential for machine and deep learning algorithm in heart failure to predict on

the basis feature selection on the 12 different attributes. We test our datasets on K

Neighbors, Random Forest Classiﬁer and logistic are machine learning algorithm

and we also run the datasets on deep neural network the datasets are dataset taken

from Kaggle [6]. Each individual attributes contains 918 different datasets of the

patient. These attributes help us to analyze heart disease.

2 Related Work

Classiﬁcation-based machine learning and neural network techniques have been

extensively investigated in the ﬁeld of heart diagnosis. Here are some examples

of related work such as Automated Detection of Congestive Heart Failure Using

Deep Neural Networks”. This study proposed a deep neural network (DNN) for

automated detection of congestive heart failure (CHF) using patient medical history

and laboratory results. The proposed DNN achieved high accuracy in detecting CHF

compared to traditional methods [7]. Heart disease is predicted using a hybrid model

that combines Random Forest and decision trees, and the model’s accuracy was

85.7% [5]. Neural networks (NN) performed better than Random Forest, support

vector machines, fuzzy rules, classiﬁcation and regression trees, and fuzzy rules in

the prediction of heart failure using different machine learning algorithms [8]. Early

630 S. Mall and J. Singh

detection of cardiac arrest symptoms is vital for disease prevention. The author devel-

oped a system that may predict a cardiac condition’s vulnerability based on simple

factors like age, sex, and pulse rate. Automatic Detection of Atrial Fibrillation Using

Convolutional Neural Networks this study proposed a convolutional neural network

(CNN) for automatic detection of atrial ﬁbrillation (AF) from electrocardiogram

(ECG) signals. The proposed CNN achieved high accuracy and outperformed other

state-of-the-art methods [9]. Classiﬁcation of cardiac arrhythmia using ECG signals

and machine learning algorithms were used in this study to categorize ECG signals

into various cardiac arrhythmia types [10]. S. Kumar et al.’s study, “Prediction of

Heart Disease Using Machine Learning Techniques”, was published in 2020. Based

on patient data, machine learning algorithms were used in this study to predict the

risk of heart disease [11]. Today’s healthcare system’s key issue is the provision of

high-quality services and efﬁcient, accurate diagnostics [12]. Despite the fact that

heart illnesses are now the largest cause of death worldwide, on the basis of research

study, it can be easily controlled and managed. The effectiveness of a disease’s

overall management depends on when it is discovered. To prevent negative effects,

the proposed investigation aims to identify certain heart problems at an early stage

[10]. In order to suggested study attempts to recognize speciﬁc heart issues early on

in order to avert negative impacts [13]. Statlog and Cleveland, two publicly acces-

sible heart disease datasets, have been used by academics. To evaluate the efﬁcacy

of prediction algorithms. CFARS-AR, which uses the Statlog dataset, is a clinical

decision support system for cardiac disease [14].

3 Proposed System

The long-term goal is to predict and do comparative study on cardiac disease using

both a deep neural network technique and K Neighbors, Random Forest Classi-

ﬁer based on machine learning algorithm. Kaggle publicly available heart disease

datasets, extensively to evaluate the efﬁcacy of prediction algorithms [15]. Table 1,

Attribute of heart disease are the most important contributions of this paper: to predict

the heart disease risk variables for stroke prediction. We determine which ones are

most signiﬁcant elements required for stroke prediction; since the comparative study

on both methods machine learning and deep neural network will have a big impact

on the health ﬁeld. We have gathered heart data from 918 patients from Kaggle in

order to predict the heart illness.

3.1 Proposed Methods

The background of all the research tools and methods is covered in the subsections

that follow. Figure 1is the proposed model work ﬂow diagram.

Comparative Study of Heart Failure Using the Approach of Machine … 631

Fig. 1 Model work ﬂow that

has been suggested

3.2 Working Mechanism

The initial collection of features, input (t).

Finding a set of m features and a prediction using the wrapper method’s suggested

model is the output.

1. To rank every independent attribute in the given feature set according to how

well they match the 12 dependent features, use the principle correlation method.

The stronger the correlation, the stronger the dependence.

2. Choose the x(x<t) features whose dependent feature’s correlation value is

greater than the cutoff value.

3. Remove the component that least inﬂuences the grouping of items into categories.

4. Classify the data using the remaining characteristics.

5. Analyze the classiﬁcation’s performance and obtain the extracted features.

6. To complete the ﬁnal feature, repeat steps 3 through 5.

7. From the output set of features, choose the extracted feature that results in the

most accurate data.

8. Develop a model and evaluate it.

3.3 Datasets

A patient’s information is stored in a heart health medical record. It is an automated,

computer-readable database that contains information about a patient’s health taken

from online website of Kaggle [6]. The dataset is accessible, a public data archive

(Fig. 1). Description of total heart datasets, the dataset includes 918 patients’ elec-

tronic health records. It features one output feature and a total of 12 input attributes.

In the output response, a binary state expresses the probability that the patient has

experienced a stroke. The patient’s name, gender (G), age (A), and whether or not

they have heart disease (HD) are among the remaining 12 input features in the EHR.

Other features include ChestPainType (CP), RestingBP (RBP), Cholesterol (CH),

FastingBS (FBS), RestingECG (RECG), MaxHR (MHR), ExerciseAngina (EA),

632 S. Mall and J. Singh

Oldpeak (OP), ST Slope (STS), and Heart Disease (HD). The HD dataset is heavily

skewed in terms of the incidence of stroke events because the vast majority of the

patient records are from people who have never had a stroke.

Patient identiﬁcation will not be accepted as an input feature. In our investigation

and analysis, we will take into account the ﬁnal 11 input characteristics and 1 response

variable.

3.4 Pre-processing of Datasets

We analyze a dataset of electronic health records in this part. We conduct feature

correlation analysis. To conduct this analysis on the input attributes of the heart

records, we use the whole dataset of heart records. A dataset with twelve test outcome

characteristics that was gathered from around 918 people is used in this study by the

system. Instead, the patient will be identiﬁed using the binary digits 1 and 0, where

1 will stand in for the true diagnosis (in this example, heart disease) and 0 would

represent the incorrect diagnosis (in this case, the patient has no heart illness of

any kind. A dataset with twelve test outcome characteristics that was gathered from

around 918 people is used in this study by the system. When two features are highly

correlated, one of them can be ignored when predicting the likelihood that a stroke

will occur because it provides no new information for the prediction model. This

is how feature selection can beneﬁt from correlation analysis, which we did using

principal component analysis (PCA). Principal component analysis, at its core, is a

statistical technique for converting a set of observations of variables with potential

for correlation into a set of values of variables with potential for linear dissociation.

The attribute space is reduced from a large number of variables to a smaller number

of components using a non-dependent procedure. The major objective of this PCA is

to choose the original variables that have the strongest connection with the principal

amount from a broader collection of variables [16,17].

3.5 Classiﬁcation Algorithms

For predicting heart disease, neural network and machine learning models K Neigh-

bors Classiﬁer, Random Forest Classiﬁer classiﬁcation algorithms were developed.

These approaches are employed in the study. One feature selection technique is

utilized for dimensionality reduction of PCA. The different classiﬁers receive the

condensed feature set from the feature selection techniques.

A tree-based technique for classiﬁcation and regression analysis is called Random

Forest. The decision tree technique is run to each of the small subsets of the dataset

in the RF ensemble strategy. The subgroups are sampled using the sampled-with-

replacement method [18,19].

Comparative Study of Heart Failure Using the Approach of Machine … 633

Fig. 2 K-nearest neighbor (KNN)

K-nearest neighbors (KNN) algorithm searches the dataset for correlations

between predictors and values. It employs a non-parametric approach because no

speciﬁc parameters have been found for any functional form. It doesn’t require

anything about the dataset’s properties or results in any way. It often works by merely

attempting to determine which class the new feature is closest to before simply adding

it to that class [6,20]. Deep neural network the input, output plates, and one or more

hidden layers. An input layer and a fully linked output layer make up the Perceptron.

The input and output levels of Multi-Layer Perceptron are the same, but they may

also have additional levels [14] (Fig. 2).

3.6 Feature Selection

The data set’s most pertinent features are taken out during feature selection. This

approach can be used to prevent redundancy. The accuracy of prediction can be

improved by feature selection since irrelevant features are removed from the input

data. In this study, feature selection is done by principle component analysis (PCA).

Two distinct classiﬁcation algorithms are performed to the smaller data set after

feature selection. This section of my investigation focuses on the connections between

features and the aim. We think that it makes sense to learn more about the variables

themselves before looking for more intricate correlations.

Following are the steps to predict the heart disease from the given 918 heart disease

datasets.

Step1: we import different libraries, subpackages, color, Standard Scaler and

machine learning algorithm, i.e., numpy, pandas, matplotlib.pyplot, seaborn as sns,

plotly.express, Random Forest Classiﬁer, K Neighbors Classiﬁer as KNN, Select K

Best, confusion_matrix, classiﬁcation, sklearn, and keras. After all the libraries are

imported, we start the date processing and visualization of the 12 attributes among

918 datasets as shown in Fig. 1.

634 S. Mall and J. Singh

4 Result

Table 2shows the performance result of the K Neighbors Classiﬁer, Random Forest

Classiﬁer classiﬁcation algorithms are compared with deep neural network and found

that deep neural network accuracy is better than the classiﬁcation algorithm the graph

histogram and accuracy result shown in Figs. 6and 7, the main advantage of DNN

is it combines the neuron and passes through input, hidden and output layer if the

target and developed output are not same producing error then the weight are updated

to cover the margin of error. The K Neighbors Classiﬁer, Random Forest Classiﬁer

classiﬁcation algorithms performance is not good. Random Forest Classiﬁer ﬁnish

with score: 0.8590604026845636 and the random state =5 and the parameter =

{‘max_depth’: range(2, 50, 3), ‘min_samples_split’: range(2, 10), ‘n_estimators’:

range(10, 200, 10)}. It ﬁnish with score: 0.7718120805369126 and Model KNN

ﬁnish with score: 0.8080536912751677. The input information of the dataset has 918

rows with no missing value shown in Fig. 1.InFig.3the information and description

of the heart dataset. The feature selection process done between age and heart disease

has shown in the Figs. 4and 5. The histogram scaling shows the variation of range

to predict the features. The K Neighbors Classiﬁer score are considered from 17

different neighbors with point 02 to achieve the score 0.8770 shown in Fig. 6and

Random Forest Classiﬁer score 0.8590. In Fig. 7shows the accuracy result 0.89 for

deep neural network. Comparison result graph of Figs. 3,4,8,9,10 and 11 K-nearest

neighbors and deep neural network on distribution of age and RestingBP.

Fig. 3 Total heart data set information after processing

Comparative Study of Heart Failure Using the Approach of Machine … 635

Fig. 4 Heart disease varies with age through K-nearest neighbor

Fig. 5 Heart disease varies with age through deep neural network

636 S. Mall and J. Singh

Fig. 6 Histograms of age varies with ChestPainType, RestingBP Cholesterol, FastingBS,

RestingECG, MaxHR, ExerciseAngina,Oldpeak, ST_Slope, Heart disease through neural network

Fig. 7 Accuracy result of training and testing KNN algorithm

Comparative Study of Heart Failure Using the Approach of Machine … 637

Fig. 8 Accuracy result of

training and testing data of

deep neural network

Fig. 9 Accuracy result of neural network

638 S. Mall and J. Singh

Fig. 10 Distribution of age and RestingBP through K-nearest neighbors

Comparative Study of Heart Failure Using the Approach of Machine … 639

Fig. 11 Distribution of age and RestingBP through neural network

Table 2 Estimated performance of KNN and Random Forest Classiﬁer algorithm

S. No. Model Score val Pred. time

val

Fit time Pred time

val

marginal

Fit time

marginal

Stack

level

Can

infer

ﬁt

order

1Random

Forest

0.879195 0.100443 1.08309 0.100443 1.083097 1True

2K-Nearest

Neighbors

0.691275 0.006258 0.005269 0.006258 0.005269 1True

5 Conclusion

We have compared different methods for categorizing cardiac illness, in this research,

we have examined that there are different machine learning methods are used to

predict heart illness, in order to validate it, we have considered three different clas-

siﬁers, i.e., K Neighbors Classiﬁer, Random Forest classiﬁcation algorithms, and

neural network. The feature selection and classiﬁcation phases make up the two main

stages of the multiple models for predicting heart disease that have been described.

The experiment was conducted on the heart data sets (918). The data set was divided

into train on 70% of heart data set and 30% for testing of the same heart data set

(918). We compare the accuracy result with machine classiﬁcation algorithm as

shown in Table 2. K Neighbors Classiﬁer, Random Forest classiﬁcation algorithms,

and deep neural network. The result shows that deep neural network gives a better

result 0.89% with no data loss in comparison to K Neighbors Classiﬁer and Random

Forest machine learning classiﬁcation algorithms.

640 S. Mall and J. Singh

Acknowledgements The heart prediction dataset was taken from https://www.kaggle.com/dat

asets.

References

1. Ahdal A, Prashar S, Rakhra M, Wadhawan A (2021) Machine learning-based heart patient

scanning, visualization, and monitoring. In: International conference on computing sciences

(ICCS)

2. Fitriyani NL, Syafrudin M, Alﬁan G, Rhee J (2020) HDPM: an effective heart disease prediction

model for a clinical decision support system. IEEE Access 8:133034–133050

3. Huang Y, Song I (2018) A better online method of heart diagnosis. In: 3rd international confer-

ence on biomedical signal and image processing (ICBIP ‘18: 2018), Seoul, Republic of Korea,

pp 80–86. ISBN 978-1-4503-6436-2

4. Li JP, Haq AU, Din SU, Khan J, Khan A, Saboor A (2020) Heart disease identiﬁcation method

using machine learning classiﬁcation in e-healthcare. IEEE Access 8. ISBN 107562-107582

5. Al Ahdal A; Prashar D; Rakhra M, Wadhawan A (2021) Machine learning-based heart patient

scanning, visualization, and monitoring. In: International conference on computing sciences

(ICCS)

6. https://www.kaggle.com/search?q=heart

7. Wang B, Bai Y, Yao Z, Li J, Dong W, Tu Y, Xue W, Tian Y, He K (2019) A multi-task neural

network architecture for renal dysfunction prediction in heart failure patients with electronic

health record. IEEE Access 7:178392–178400

8. Kavitha M, Gnaneswar G, Dinesh R, Rohith Sai Y, Sai Suraj R (2021) Heart disease prediction

using hybrid machine learning model. In: 6th international conference on inventive computation

technologies (ICICT)

9. Gavhane A, Kokkula G, Pandya I, DevadkarK (2018) Prediction of heart disease using machine

learning. In: Second international conference on electronics, communication and aerospace

technology (ICECA)

10. Erda¸s ÇB, Ölçe D (2020) A machine learning-based approach to detect survival of heart failure

patients. In: Medical technologies congress (TIPTEKNO)

11. Deepika R, Balaji Srikaanth P, Pitchai R (2022) Early detection of heart disease using deep

learning model. In: 8th international conference on smart structures and systems

12. Dhanka S, Maini S (2021) Random forest for heart disease detection: a classiﬁcation approach.

In: IEEE 2nd international conference on electrical power and energy systems (ICEPES)

13. Long NC, Meesad P, Unger H (2015) A highly accurate ﬁreﬂy-based algorithm for heart disease

prediction. Expert Syst Appl 42(21):8221–8231

14. Arabelle AE, Prasetyanto WA, Wulandari SA (2021) Non invasive blood sugar detection using

the extraction method of principal component analysis. In: IEEE international seminar on

application for technology of information and communication (iSemantic), September, pp

285–289

15. Dhanka S, Maini S (2021) Random forest for heart disease detection: a classiﬁcation approach.

In: IEEE 2nd international conference on electrical power and energy systems (ICEPES),

December, pp 1–3

16. Reddy KSK, Kanimozhi KV (2002) Novel intelligent model for heart disease prediction using

dynamic KNN (DKNN) with improved accuracy over SVM. In: IEEE international conference

on business analytics for technology and security (ICBATS), February, pp 1–5

17. Gupta M, Srivastava D, Pantola D, Gupta U (2022) Brain tumor detection using improved

Otsu’s thresholding method and supervised learning techniques at early stage. In: Proceedings

of emerging trends and technologies on intelligent systems: ETTIS 2022. Springer Nature

Singapore, Singapore, pp 271–281

Comparative Study of Heart Failure Using the Approach of Machine … 641

18. Mutijarsa K, Ichwan M, Utami DB (2016) Heart rate prediction based on cycling cadence

using feedforward neural network. In: IEEE international conference on computer, control,

informatics and its applications (IC3INA), pp 72–76

19. Gupta U, Gupta D (2022) Least squares structural twin bounded support vector machine on

class scatter. Appl Intell, 1–31

20. Gupta U, Gupta D, Agarwal U (2022) Analysis of randomization-based approaches for autism

spectrum disorder. In: Pattern recognition and data analysis with applications. Springer Nature

Singapore, Singapore, pp 701–713

21. Dev S, Wang H, NwosuCS, Jain N, Veeravalli B, John D (2022) A predictive analytics approach

for stroke prediction using machine learning and neural networks. Int J Healthcare Anal 22.

https://doi.org/10.1016/j.health.2022.100032

House Price Prediction Using Hybrid

Deep Learning Techniques

Nitigya Vasudev, Gurpreet Singh , Prateek Saini, and Tejasvi Singhal

Abstract The impact of machine learning on the world has been immense and

is only growing. Machine learning is also being used to improve health care, detect

fraud, predict weather, and even develop autonomously vehicles. Furthermore, house

prices have been steadily increasing over the past few years. This has been due to

a number of factors, including a strong economy, low interest rates, and a limited

supply of housing. As the demand for housing continues to outpace the availability

of new homes, the prices of existing homes have increased signiﬁcantly. This has

caused many people to struggle to afford a home; there has been an increase in the

cost of living in recent years. The goal of this paper is to use machine learning as a

powerful tool for predicting the future value of a house. It can be used to predict the

price of a house given certain features such as size, location, and amenities. We have

used machine learning algorithms such as support vector machines (SVM) models,

regression models, random forest, and bagging and boosting models to predict house

prices. Hyperparameter tuning is also being used to optimize the model performance.

As a result, we have compared and analyzed a number of prediction methods in

order to select the most suitable one. House prediction using machine learning can

be used to estimate the future market value of a house, identify potential investment

opportunities, and assist in making informed decisions about buying and selling

properties. In Sect. 1, we have given introduction about the real estate industry and

how machine learning can be helpful for predicting house prices. In Sect. 2,wehave

reviewed several papers to gather information to compare result of different models.

N. Vasudev ·G. Singh (B)·P. Sa i n i ·T. Singhal

Chitkara University Institute of Engineering and Technology, Chitkara University Punjab,

Chandigarh, India

e-mail: gurpreet.1309@chitkara.edu.in

N. Vasudev

e-mail: nitigya1194.cse19@chitkara.edu.in

P. Sa i n i

e-mail: Prateek1047.cse19@chitkara.edu.in

T. Singhal

e-mail: Tejasvi1000.cse19@chitkara.edu.in

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_48

643

644 N. Vasudev et al.

Sections 3,4and 5are about methodology and implementation of various algorithms

to get the desired results. In Sect. 6, we have compared various models and found

the desired algorithm for this research paper.

Keywords Technology ·Industry ·Housing market ·Housing ·SVM ·XGBoost

1 Introduction

The house price in real estate [1] can be a tricky thing to understand, even for an

experienced expert. With the ﬂuctuating market and ever-changing trends, it can feel

like you are playing a game of chance rather than relying on reliable data. House

prices have long been a topic of interest for both real estate professionals and the

general public. With machine learning rapidly gaining attraction as an effective tool

to predict housing values, this research paper will focus on how it can be used to

accurately forecast house prices. A variety of methods, such as regression analysis,

support vector machines, random forests, and others, have been used for predicting

house prices [2] using machine learning. These techniques each bring their own

advantages and limitations, depending on the dataset being analyzed and the task

at hand. After assessing the data and understanding the objectives, one must select

the most appropriate model and tune its parameters accordingly. Prediction accuracy

can be greatly improved as a result, and predicted outcomes become more reliable

[3]. However, other factors must also be taken into consideration when predicting

house prices with machine learning. For instance, geographical location plays an

important role in determining property value [4]. Additionally, current market trends,

macroeconomic indicators, and the availability of amenities, among other variables,

all contribute to a home’s worth. Thus, it is essential to include these components in

order to make more precise predictions. In conclusion, machine learning provides

efﬁcient tools to accurately forecast house prices. It is essential that any buyer of

a property has a property appraisal performed as part of the buying process [5]. In

traditional circumstances, appraisals are performed by appraisers who have received

special training for the purpose of valuing real estate properties. It is important

for buyers of real estate properties to have a better understanding of the current

market prices of properties that are currently available on the market by utilizing an

automated price estimation system. These are some of the algorithms we have used

in this research paper. We have also used other algorithms in this research paper to

have better results which we have mentioned in methodology section.

1. Support vector machine (SVM) is a powerful machine learning algorithm used

for both classiﬁcation and regression. It has many advantages over other algo-

rithms, such as its ability to handle high-dimensional data and its ability to create

nonlinear decision boundaries. SVM is also robust to outliers and can be used

with kernels, which allows it to work with nonlinear data. I have used support

vector machine in my model because of its ability to create nonlinear decision

House Price Prediction Using Hybrid Deep Learning Techniques 645

boundaries, which can be used to accurately classify complex datasets [6]. I have

got the accuracy of 90% in my project. Additionally, SVM is computationally

efﬁcient and can be used in large datasets without compromising accuracy.

2. XGBoost is a powerful machine learning algorithm that has gained immense

popularity in recent years. It is an advanced implementation of gradient boosting,

which is used to improve the performance and accuracy of predictive models.

XGBoost has several advantages over other algorithms, such as faster training

speed, better accuracy, and improved scalability. XGBoost also has a number

of built-in features, such as regularization, cross-validation, and feature selec-

tion, which make it easier to use. Additionally, XGBoost is capable of handling

large datasets, making it suitable for use in complex data mining tasks. Overall,

XGBoost provides many advantages, making it a powerful and useful tool for

data scientists and machine learning engineers.

3. Linear regression is a powerful tool for predicting the outcome of a given event.

It is a supervised learning technique that uses a linear equation to represent the

relationship between the dependent and independent variables. Advantages of

linear regression include its simplicity and interpretability, as well as its ability

to capture nonlinear relationships. It can also be used to identify outliers and to

assess the strength of the relationships between variables. In this research paper,

we have chosen to use support vector machines (SVMs) over linear regression

due to their ability to capture more complex relationships between variables and

to better handle outliers. SVMs also have the advantage of being more robust to

overﬁtting, making them more suitable for high-dimensional data.

2 Literature Survey

In order to determine the price of a house, there are several factors to consider. In his

research, Rahadi et al. [1] suggest that in order to simplify these elements, we should

categorize them into three groups, namely physical condition, idea, and area. The

physical conditions of a house are deﬁned by how much the house consists of, how

many rooms there are, how accessible the kitchen and carport are, how accessible

the nursery is, what zone the land and structure are in, and the age of the house.

These are the physical properties controlled by a house that human senses are able

to observe.

In his research, Rawool et al. [2] proposed that 80% of data should be used for

training and 20% for testing in the proposed machine learning model. The training

set includes target variables. The training set consists of 80% of data used for training

and 20% for testing. The model has been trained using a variety of machine learning

algorithms, but random forest regressions have been shown to be a more accurate

prediction of results. This has been implemented with Python libraries NumPy and

Pandas.

In their research paper, Manasa and Gupta et al. [3] decided to use Bengaluru as a

case study. The size of the property in square footage, the location, and its facilities

646 N. Vasudev et al.

are all key factors affecting the cost of the property. A total of nine attributes are taken

into account. For the experimental work, they have used multiple linear regression

(least squares), Lasso/Ridge regression, SVMs, and XG boosters.

In his research, Luo et al. [4] suggest that as a whole, the majority of studies have

focused on macroeconomic factors that inﬂuence residential asset prices, rather than

explaining the factors that determine prices. As part of this research, it examines

some micro-characteristics that can be used to estimate house prices, namely lot

size and pool size. Machine learning methods such as random forest and support

vector machine are used to predict asset prices. Almost all regression models have

an Rsquared of more than 0.9.

In their research paper, Abidoye and Chan [5] suggested that in addition to using

hedonic regression and artiﬁcial intelligence techniques for developing housing price

prediction models, the relationship between house prices and housing characteristics

was identiﬁed using a variety of hedonic methods based on the concept of utility

maximization.

There has been little research conducted by Gu et al. [6] in which they have

suggested an improved model for the prediction of housing prices should be devel-

oped based on the performance evaluation of several algorithms that have been devel-

oped for machine learning. It has been shown that the SVM approach is more accu-

rate than traditional methods in terms of forecasting housing prices. However, little

research has been conducted on how to develop a more accurate forecasting model

using genetic algorithms. Using machine learning, this study is aimed at examining

the performance of the algorithms and developing a more accurate model of housing

price prediction for the real estate market.

In their research paper using a neural network model, Kauko et al. [7] examined

the housing market in Finland, with an application to neural networks. Their results

showed that a number of dimensions of the formation of housing sub-markets could

be identiﬁed by ﬁnding patterns in the dataset.

2.1 Gap Analysis

After analyzing the above research papers and gathering some important information,

following points are concluded.

1. That most of the researchers did not have the desired amount of data for their

study, due to which they were not able to get the desired result.

2. In most of the papers, the majority of studies have focused on macroeconomic

factors that inﬂuence residential asset prices rather than explaining the factors

that determine prices and to overcome this problem.

3. We have used a vast variety of data to get the desired results.

4. I have used SVM as a model for this research paper and to increase the accuracy

of the model.

5. I also used hyperparameter tuning for the best results.

House Price Prediction Using Hybrid Deep Learning Techniques 647

3 Methodology

In this research, I have used Jupiter IDE. Jupiter IDE is an open-source web appli-

cation that helps us share as well as create documents containing live code, visu-

alizations, and equations and includes tools for data cleaning, data transformation,

statistical modeling, data visualization, and machine learning tools. Here, I have

collected data related to home sales to estimate home prices based on a real-world

from Kaggle. I have used other tools like Scipy, Seaborn, Pandas, NumPy and some

important machine learning models like random forest, SVM, linear regression, deci-

sion tree, and XGBoost [14]. To check how well the regression model ﬁts the data,

I have used coefﬁcient of determination (R2).

4 Implementation

4.1 Data Preprocessing

A crucial procedure has to be followed to see if a dataset is suitable for machine

learning algorithms. It is a procedure in which data mining is used to transform raw

data into a format that is efﬁcient, the dataset will be initially preprocessed in such

a way that the unwanted data is removed and only the data that is relevant to the

problem will be extracted [15]. As far as formatting is concerned, it is necessary to

remove null values and irrelevant data from the data in order to make it suitable for

machine learning algorithms. After extracting the data, there were some null values

in the attributes which have to be looked after so that the accuracy of the models

does not compromise (Fig. 1).

4.2 Exploratory Data Analysis

Training the dataset using the machine learning algorithm, after the training, data is

extracted from the dataset through the data splitting process [13], such as SVM,

XGBoost, random forest and decision tree [10]. We have also used correlation

heatmap to see what are the highly correlated features with the target attribute which

is the sale price in this case [17] (Figs. 2and 3).

4.3 Dataset

The following diagram shows the important attributes of the dataset (Figs. 4and 5).

648 N. Vasudev et al.

Fig. 1 Null values using heatmap

Fig. 2 Displot of sale price

House Price Prediction Using Hybrid Deep Learning Techniques 649

Fig. 3 F correlation heatmap of sale price

Fig. 4 Dataset

5 Algorithms

5.1 Lasso Regression

As a statistical tool, Lasso Regression is utilized in regression analysis with an

objective to lessen the intricacy of a model [3]. This technique banks on shrinkage

methods that penalize varying coefﬁcients within mainstream models and diminish

650 N. Vasudev et al.

Fig. 5 Regplot of highly correlated data

their impact up until insigniﬁcance [11], henceforth making it particularly effec-

tive when dealing with large amounts of predictor variables as its primary goal is

streamlining complexity while improving comprehensibility for easier interpretation

purposes.

5.2 Ridge Regression

A sort of linear regression approach called ridge regression is used to simplify the

model and avoid overﬁtting. It is a regularization strategy that reduces the model’s

coefﬁcients by increasing the loss function’s penalty [3]. The model’s complexity is

lowered by the penalty’s relationship to the squared magnitude of the coefﬁcients.

5.3 XGBoost

XGBoost is an advanced implementation of gradient boosting algorithm. It is a

powerful machine learning algorithm that has gained immense popularity in the data

science community due to its superior performance and efﬁciency [9]. XGBoost

is an open-source library which is used for supervised learning problems such as

classiﬁcation and regression.

House Price Prediction Using Hybrid Deep Learning Techniques 651

5.4 Support Vector Machine

Support vector machine is a powerful machine learning algorithm that can be used for

classiﬁcation as well as regression [12]. It transforms data into a higher-dimensional

space using a technique called the kernel trick to ﬁnd the hyperplane that best

separates the data [18]. This hyperplane is used to make predictions on invisible

data.

5.5 K-Fold

Cross-validation is an effective method used in machine learning to evaluate the

precision of a model. It works by splitting the dataset into ksubsets, or “folds”, and

then training the model on k−1 folds and testing it on the remaining fold [8]. This

process is repeated ktimes, each time with a different fold as the test set. The average

of the k-fold accuracy scores is then used as the overall accuracy of the model. K-fold

cross-validation reduces a model’s variance by using less data for training while still

producing a reliable evaluation of the model’s performance.

5.6 R2Score

R2score is a popular metric for assessing a model’s performance. By dividing the

variance of the projected values by the variance of the actual values, one can determine

how well a model ﬁts the data. It is an evaluation of how well the expected values

match the observed values [16]. Higher numbers suggest a better match. The R2

score goes from 0 to 1. A ﬂawless R2value of 1 means the model correctly predicted

the data, whereas a score of 0 means the model failed to do so. When evaluating the

performance of a model, the R2score is a crucial factor (Fig. 6).

6 Results

From Table 1, we can easily see the comparison of different algorithms to ﬁnd the

best two algorithms among them and integrate to provide the best output.

Fig. 6 Formula of R2score

652 N. Vasudev et al.

Table 1 Accuracy of some

of the models Algorithms Score

Lasso regression 0.86468

Ridge regression 0.86779

XGBoost 0.88414

Random forest regression 0.85644

Decision tree regression 0.68921

Support vector machine 0.91373

7 Limitations

7.1 Limited Data

Possibly, a small dataset was used for the analysis in the research publication. As

a result, the models’ capacity to forecast outcomes accurately and reliably may be

constrained.

7.2 Limitation of Model Hyperparameters

The research report might not have thoroughly examined all potential model hyper-

parameters, which could have an impact on the precision and dependability of the

models.

8 Conclusion

In conclusion, we have seen a model that could provide everyone working in real

estate with a new, more accurate methodology for the upcoming special issues on

prediction of house prices. As we can see, SVM and XGBoost were two main models

that gave us the desired results. Support vector machines gave us an accuracy of

0.91, whereas XGBoost gave an accuracy of 0.88. In this study, it is highlighted how

important it is to use advanced machine learning algorithms, particularly in such an

era when the real estate market is rapidly changing. For real estate professionals,

investors, and policymakers seeking to make informed decisions based on accurate

and reliable predictions, the application of SVMs and XGBoost regression algorithms

can provide valuable insights. To further improve the accuracy and reliability of house

price predictions, it may be worthwhile for future research to incorporate additional

features as well as explore other regression algorithms in order to incorporate more

accurate and reliable methods.

House Price Prediction Using Hybrid Deep Learning Techniques 653

9 Future Scope

Future study on housing price prediction has a huge and fascinating potential, with

many different directions it may go in. A more accurate and trustworthy forecast can

help real estate professionals, investors, and regulators make wise judgements. This

is made possible by the integration of cutting-edge technologies and the analysis of

outside elements. Exploring the effects of external elements, such as macroeconomic

trends, social and political events, and environmental concerns, on the real estate

market and how they affect house price projections could be another area of research.

References

1. Rahadi RA, Wiryono SK, Koesrindartoto DP, Syamwil IB (2015) Factors inﬂuencing the price

of housing in Indonesia. Int J Housing Markets Anal 8(2):169–188. https://doi.org/10.1108/

IJHMA-04-2014-0008

2. Rawool AG, Rogye DV, Rane SG, Bharadi VA (2021) House price prediction using machine

learning. Iconic Res Eng J

3. Manasa J, Gupta R, Narahari NS (2020) Machine learning based predicting house prices using

regression techniques. In: 2020 2nd international conference on innovative mechanisms for

industry applications (ICIMIA), Bangalore, India, pp 624–630. https://doi.org/10.1109/ICI

MIA48430.2020.9074952

4. Luo Y (2019) Residential asset pricing prediction using machine learning. In: 2019 international

conference on economic management and model engineering (ICEMME). IEEE, pp 193–198

5. Abidoye RB, Chan APC (2017) Critical review of hedonic pricing model application in property

price appraisal: a case of Nigeria. Int J Sustain Built Environ 6(1)

6. Gu J, Zhu M, Jiang L (2011) Housing price based on genetic algorithm and support vector

machine. Expert Syst Appl 38:3383–3386

7. Kauko T, Hooimeijer P, Hakfoort J (2002) Capturing housing market segmentation: an alterna-

tive approach based on neural network modelling. Housing Stud 17:875–894. https://doi.org/

10.1080/02673030215999

8. Anguita D, Ghelardoni L, Ghio A, Oneto L, Ridella S (2012) The ‘K’ in K-fold cross validation.

In: ESANN, pp 441–446

9. Zhao Y, Chetty G, Tran D (2019) Deep learning with XGBoost for real estate appraisal. In:

2019 IEEE symposium series on computational intelligence (SSCI), Xiamen, China, pp 1396–

1401.https://doi.org/10.1109/SSCI44817.2019.9002790

10. Adetunji AB, Akande ON, Ajala FA, Oyewo O, Akande YF, Oluwadara G (2022) House price

prediction using random forest machine learning technique. Procedia Comput Sci 199:806–813

11. Lu S, Li Z, Qin Z, Yang X, Goh RSM (2017) A hybrid regression technique for house prices

prediction. In: 2017 IEEE international conference on industrial engineering and engineering

management (IEEM). IEEE, pp 319–323

12. Yu H, Wu J (2016) Real estate price prediction with regression and classiﬁcation. In: CS229

(machine learning) Final project reports

13. Kumar D, Sarangi PK, Verma R (2022) A systematic review of stock market prediction using

machine learning and statistical techniques. Mater Today Proc 49:3187–3191

14. Thamarai M, Malarvizhi SP (2020) House price prediction modeling using machine learning.

Int J Inf Eng Electron Bus 12(2)

15. Mittal R, Kumar P, Mittal A, Malik V (2021) Developing an evaluation model for forecasting of

real estate prices. In: Choudhary A, Agrawal AP, Logeswaran R, Unhelkar B (eds) Applications

654 N. Vasudev et al.

of artiﬁcial intelligence and machine learning. Lecture notes in electrical engineering, vol 778.

Springer, Singapore. https://doi.org/10.1007/978-981-16-3067-5_46

16. Arumugam SR, Gowr S, Manoj O (2021) Performance evaluation of machine learning and

deep learning techniques: a comparative analysis for house price prediction. In: Convergence

of deep learning in cyber-IoT systems and security, pp 21–65

17. Makhloga VS, Raheja K, Jain R, Bhattacharya O (2021) Machine learning algorithms to

predict potential dropout in high school. In: Khanna A, Gupta D, Pólkowski Z, Bhattacharyya

S, Castillo O (eds) Data analytics and management. Lecture notes on data engineering and

communications technologies, vol 54. Springer, Singapore. https://doi.org/10.1007/978-981-

15-8335-3_17

18. Agarwal P, Alam M (2022) Quantum-inspired support vector machines for human activity

recognition in Industry 4.0. In: Gupta D, Polkowski Z, Khanna A, Bhattacharyya S, Castillo

O (eds) Proceedings of data analytics and management. Lecture notes on data engineering and

communications technologies, vol 90. Springer, Singapore. https://doi.org/10.1007/978-981-

16-6289-8_24

Sentiment Analysis Using Machine

Learning of Unemployment Data in India

Rudra Tiwari, Jatin Sachdeva, Ashok Kumar Sahoo,

and Pradeepta Kumar Sarangi

Abstract With the massive increase in social media data and hypes around Natural

Language Processing, opinion mining has become one of the most popular ways

to analyze people’s views on a speciﬁc topic. Using hashtags, one can obtain tweet

data in millions and analyze sentiments. This can be done effectively using Python

with its NLP modules available. Studying the attitudes and sentiments of Indian

citizens towards the current unemployment rate is the primary purpose of this study.

In situations where there may be negative sequences due to people’s aggression,

analyzing such content to gauge people’s sentiments can be extremely valuable in

managing the situation. Natural Language Processing and other Machine Learning

classiﬁers to perform opinion mining of the tweets posted by Indians are used in

this research. About 10,928 tweets have been accumulated, on which sentiment

analysis has been performed, considering tweets as positive, negative or neutral by

classifying them into three categories. ‘Tweepy API’ has been used, along with the

hashtags ‘UnemploymentInIndia’ and ‘Unemployment’. The data has been cleaned

and preprocessed using NLPTK, VADER and other modules provided to us using

Python. Study ﬁndings suggest that most Indian citizens oppose the unemployment

rates in their country, but a minority look to political movements to bring about

change.

Keywords Sentiment analysis ·Natural Language Processing ·Unemployment

Twitter data ·Unemployment sentiment analysis ·Unemployment India analysis ·

Social economy

R. Tiwari

Doon International School, Dehradun, India

J. Sachdeva ·P. K. Sarangi (B)

Chitkara University Institute of Engineering & Technology, Chitkara University, Punjab, India

e-mail: Pradeepta.sarangi@chitkara.edu.in

J. Sachdeva

e-mail: jatin0530.cse19@chitkara.edu.in

A. K. Sahoo

Graphic Era Hill University, Dehradun, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_49

655

656 R. Tiwari et al.

1 Introduction

Youngsters are the source of economic growth for a nation. The age between 15

and 24 signiﬁes the youngsters’ ascent towards the labour markets to earn money.

People’s status in society is primarily determined by their employment. Workers who

are ready to work, but do not seek employment, are considered unemployed. It is still

not recognized that unemployment is a macroeconomic issue in India. Unemployed

people are often seen as failures and are disregarded in the society. These people

may drive themselves into depression, or even worse, social isolation. Reports by

the World Bank have stated that the unemployment rate in India was about 20.6%

in 2018. Data by the same World Bank [1] has stated that India’s unemployment

rate went up to 20.7% in 2019 (a slight increase). The COVID-19 pandemic has led

the unemployment rate to rise to 23.2% in 2020. If one notices the unemployment

rates for the neighbouring countries of India, which include Sri Lanka, Bangladesh,

Pakistan and Nepal, it has been observed that youth unemployment is the highest in

India. The situation of India’s unemployment has worsened due to the pandemic and

population explosion. In the ﬁrst major year of the pandemic (2020–2021), India’s

employment rate fell to about 10.9% (for youth). It saw a descent of about 0.5%

in 2021–2022. India has also taken the UK to become the ﬁfth largest economy

in the world. However, if these numbers are considered, it is apparent that income

inequality along with low youth employment rates remains a problem for India’s

economy. Average labour participation rate (LPR) for the years 2016–2017 and

2021–2022 for India’s youth remains at about 22.7%. Though India has the world’s

largest youth population, it also has the lowest youth employment rate. This remains

a topic of concern, for both the government and researchers. The computation on

unemployment in India is the responsibility of National Sample Survey Organization

(NSSO) and Labour Bureau. From the reports of these organizations, it is apparent

that the following factors are the root causes for the increasing unemployment in

India which are as under:

1. Ever-increasing population.

2. Slow economic growth.

3. Income inequality.

4. Caste system.

5. Prevalence of primary economic activities.

6. Shortage of resources required for industrial production (electricity, coal, etc.).

Sentiment analysis is a technique of aggregating people’s opinions, attitudes and

emotions about something through opinion mining. Any topic, event or individual

can be represented by the entity. Most of the reviews will cover these topics. The goal

of Natural Language Processing is to extract the meaning of written or spoken forms

of the language by application of various levels of linguistic analysis [2]. In this

study, Twitter is used as the major data hub. It is the best microblogging tool which

allows people to express their sentiments in concise words in such a manner that it

is coherent [3]. Its ease of comprehension makes it the best source for performing

sentiment analysis.

Sentiment Analysis Using Machine Learning of Unemployment Data … 657

2 Literature Review

Government policies that are poorly implemented and the international economic

environment are major causes of high youth unemployment. Inefﬁcient workers in

many labour ﬁelds, lack of proper education facilities and presence of in-demand

skills are the primary reasons of unemployment in India. Moreover, India’s education

system follows an approach of traditional theoretical teaching and often disregards

the crucial methodology of teaching practical applications to students. Moreover,

consumers’ negative perceptions of major Indian markets are also contributing to

the reduction of private sector jobs. Leaders of some political parties even claim

that “The only people who get jobs are relatives and friends of politicians”. People

believe in the approach of saving money, rather than investing it. Somehow, AI and

automation are another reason for less job vacancies and layoffs [4].Therateof

unemployment and annual change percentage between the years 1990 and 2022 is

shown in Fig. 1.

As long as there are no adequate social protection policies for the unemployed,

only those who can afford to remain unemployed will do so [5]. However, schemes

like MGNREGA have been launched by the government to end this social and

economic evil. Similar policies for skilled workers have been initiated as well. But

due to the exponential increase of youth unemployment, government policies are

insufﬁcient to tackle this evil. It is imperative that youth unemployment can be

understood globally. The issue of youth unemployment must be addressed by global

organizations to improve employability and employment opportunities for young

people [6].

Fig. 1 Unemployment rate and annual change percentage between the years 1990 and 2022

658 R. Tiwari et al.

Looking at different economical sections in India, it has been observed that people

living in poverty tend to remain unemployed for a longer time duration than people

belonging to the well-off sections of the society. In terms of geography, the Eastern

regions have higher unemployment rates. The Eastern region also has a higher share

of long-term unemployment, as well as a higher concentration of it. Another problem

with India’s labour force is that there are not enough policies for women in urban

areas who are unemployed. Moreover, India’s female workforce is already less. There

are a lot of ways which make analyzing unemployment rates difﬁcult. First, one does

not know anything about the history of an individual whose employment periods

are being analyzed. Secondly, the presence of ‘reservation wages’ in India makes it

difﬁcult to analyze who is unemployed and who is not. Thirdly, people with no prior

job experience are less preferred than those with actual job experience. Lastly, the

lack of vocational training also plays a vital role in determining the actual value of an

individual’s contribution. In some places, still the presence of hereditary practices in

terms of providing employment is ruining the current situation of employment. Turns

out those companies are neglecting the potential of young and creative freshers. In

order to understand how longer-term unemployment occurs, it is difﬁcult to pinpoint

the exact factors involved [7]. The long-term unemployment data is shown in Table 1.

Unemployment rates in India are highest in Haryana, a little more than that of

Jammu and Kashmir. In these areas, the unemployment rates reach about 30–35%

and can be seen in Fig. 2. However, Chhattisgarh, Meghalaya and Maharashtra have

had the lowest percentage of unemployed youth, hovering around 1–3%. The detailed

unemployment data is shown in Fig. 2.

Unemployment is the root cause for poverty in India. In the past three and half

decades, the population of youth outside the labour force has signiﬁcantly increased,

ﬂuctuating between 40 and 44%. Along with that, the population has exponentially

increased as well. After talking about unemployment, it also realized that it these

numbers cannot determine the income and productivity of workers, their work envi-

ronment and their motivation to the same job for years. Low wages have, further-

more, worsened the condition of employees and workers who get paid in meagre

amounts after toiling for up to 14–16 h a day. There are three policies that should be

implemented to battle the rising unemployment in India. They are:

1. Appropriate Macro-Policy

New policies must be formed, and old policies must be reformed to check the unem-

ployment rates and create new employment opportunities in India. Investing in the

growth of secure jobs is critical as well. Trade liberalization and ﬁnancial sector

liberalization can enable the public to apply for jobs for handling exports and work

as skilled/unskilled labourers.

2. Improvement in the Education System

India’s education system must focus on improving the quality of students and make

them future-ready for the upcoming developments and technological advancements.

Sentiment Analysis Using Machine Learning of Unemployment Data … 659

Table 1 Changes in

unemployment rates over last

two decades

Date Unemployment rate (in %) Annual change

31-12-1991 5.599 –

31-12-1992 5.727 0.13

31-12-1993 5.691 −0.04

31-12-1994 5.739 0.05

31-12-1995 5.755 0.02

31-12-1996 5.74 −0.01

31-12-1997 5.613 −0.13

31-12-1998 5.666 0.05

31-12-1999 5.736 0.07

31-12-2000 5.561 −0.18

31-12-2001 5.576 0.01

31-12-2002 5.53 −0.05

31-12-2003 5.643 0.11

31-12-2004 5.629 −0.01

31-12-2005 5.613 −0.02

31-12-2006 5.601 −0.01

31-12-2007 5.572 −0.03

31-12-2008 5.414 −0.16

31-12-2009 5.544 0.13

31-12-2010 5.546 0

31-12-2011 5.426 −0.12

31-12-2012 5.414 −0.01

31-12-2013 5.424 0.01

31-12-2014 5.436 0.01

31-12-2015 5.435 0

31-12-2016 5.423 −0.01

31-12-2017 5.358 −0.07

31-12-2018 5.33 −0.03

31-12-2019 5.27 −0.06

31-12-2020 7.997 2.73

31-12-2021 5.978 −2.02

Teachers should focus on teaching practical applications to the students, instead of

the traditional theoretical approach.

3. Policies on Active Labour Market

The government needs to manipulate the labour market, providing more employment

opportunities, while ensuring quality [8]. According to The Economic Times, in

660 R. Tiwari et al.

Fig. 2 Unemployment rate

in India (as of August 2022).

Source Centre for

Monitoring Indian Economy

Pvt. Ltd

2021–22, 10.4% of Indian youth (15–24 years of age) were employed compared to

10.9% in 2020–21.

3 Research Aim

This research project aims to classify people’s opinions on ‘Unemployment in India’.

The datasets used in the project are obtained between the dates of 15 October 2021

and 15 September 2022. The opinion mining that is conducted is based on social

media behaviour analysis on about 11,000 tweets. Through this research, the aim is

to discover what people in India think about the growing unemployment rates. The

text analytics in this form are based on Machine Learning and NLP. Many research

papers based on sentiment analysis intricately describe processes and procedures to

be followed while working with raw text. They follow detailed methodologies to

convert raw data in the form of illustrations (that are obtained from cleaned data)

[9–11].

Most of the people who are commenting about unemployment have negative

opinions about it because unemployment, in general, is considered as one of the

social evils all around the world. Unfortunately, a fair population of India from both

rural and urban areas is suffering from unemployment.

Sentiment Analysis Using Machine Learning of Unemployment Data … 661

4 Proposed Methodology

Industry, organizations and academic institutions are increasingly focusing on big

data as a strong global trend. This research considers about 11,000 tweets obtained

from twitter between the dates 15 October 2021 and 15 September 2022. The data

source has been explored through using hashtags ‘#Unemployment’ and ‘#Unem-

ploymentInIndia’. Twitter ranks amongst the top ten websites that are visited every

day. As twitter is the best text-based social media software, utilization of tweets to

analyze people’s general opinion on youth unemployment in India is used in experi-

ments. The time interval of the data about a year to ensure the uniformity of the data

is taken. It is possible that due to some changes in economy or work policies, people

would have reacted actively only during a small span of time. This study observes the

social media messages of residents in India through multiple channels. This data is

tabulated and is then transformed into data with three sentiment categories: positive,

negative and neutral.

The proposed approach for this research is given in the form of an algorithm which

is as follows:

1. Use Tweepy and Twitter’s API to mine data and obtain tweets. The extracted

data will be used as datasets for the research.

2. Utilize Python, NLP and Machine Learning classiﬁers to sort data based on the

keywords they contain. The Natural Language Toolkit (NLTK) of Python has

been used. ‘Text Blob’ library and Python’s VADER tool are used for more

precision.

3. Classifying individual tweets into ‘Positive’, ‘Negative’ and ‘Neutral’ categories.

4. Tabulating the data in pie charts and developing ‘wordclouds’ to determine most

frequently used hashtags.

5. Prepare a bar graph of the most used words in the tweets.

Opinion mining and sentiment evaluation are a quick developing subject matter

with diverse global applications. This paper discusses a method, shown in Fig. 3,

wherein a publicized circulation of tweets from the Twitter microblogging web page

is preprocessed and labelled based on their emotional content as effective, bad and

inappropriate, and analyzes the overall performance of diverse classifying algorithms

primarily based on their precision. This research is based on the tweet’s textual

content (i.e. it involves only the Machine Learning approach).

Machine Learning classiﬁers are algorithms that automate the process of catego-

rization of data into one or more classes. Classiﬁers are the rules used by machines

to classify data. In simple words, these classiﬁers ease the process of automation

of categorization. Machine Learning classiﬁers are of two types—supervised and

unsupervised. In these types of classiﬁers as well, there are ﬁve sub-components.

Naïve Bayes, Decision Tree, Support Vector Machine (SVM) and Artiﬁcial Neural

Network (ANN) classiﬁers have been used in the proposed research.

662 R. Tiwari et al.

Fig. 3 Systematic

procedural ﬂowchart

4.1 The Naïve Bayes Classiﬁer

The Naïve Bayes Classiﬁer [12] is one of the major components of Machine Learning

used in the Natural Language Processing. An input classiﬁed using Naïve Bayes clas-

siﬁers is based on its probability. This is the component which helps us to determine

which tweet would be classiﬁed as positive, neutral or negative. Advantages of the

Naive Bayes classiﬁer:

(a) It can be easily implemented and does not require a lot of training data. It is

simple to use.

(b) It can handle both types of data—continuous and discrete.

and data points.

(d) Predictions can be made in real-time thanks to its speed [13].

The classiﬁcation of an unknown instance assignment the most appropriate target

value, VMAP, based on the attribute set (a1,a2…an).

VMAP =arg max

vj∈VP(v j|a1,a2...an), (1)

where ‘P’ is the conditional probability on the attribute set described above. Using

the Baÿes theorem, Eq. (1) can be rewritten as

Sentiment Analysis Using Machine Learning of Unemployment Data … 663

VMAP =arg max

vj∈V

P(v j|a1,a2,...,an)P(v j)

P(a1,a2,...,an)=arg max

vj∈V

P(v|a1,a2...an|vj)Pvj

P(a1,a2,...,an),

(2)

=arg max

vj∈VPa1,a2,....an|vj×Pvj.(3)

The terms in Eq. (3) can be evaluated on training data. The P(vj) values are

calculated by summing up the frequency of all target values vjthat appears in the

data in the training. Estimation of individual P(a1,a2…an|vj) terms in this manner

is infeasible unless the availability of a large training dataset.

Substituting this into Eq. (3), Naive Baÿes classiﬁer is:

VNB =arg max

vj∈V

Pvj

P(ai|vj). (4)

VNB is responsible to determine the target class of input vectors in Naïve Bayes

classiﬁer.The number of individual P(ai|vj) terms should be calculated from the

training dataset. The calculation of this value is simpler as compared with the value

of P(a1,a2... an|vj).

4.2 Decision Tree

Data is split into increasingly speciﬁc categories using a decision tree classiﬁcation

algorithm. Graphically representing the classiﬁcation process is like showing tree

branches [14].

The most challenging aspect of Decision Trees is identifying the root node’s

attribute. This process is called attribute selection.

A decision tree can be formed when a given set of attributes is given. Many

similar decision trees can be derived from the same set of attributes. It is practically

impossible to construct an optimal decision tree due to computational constraints.

This is due to the number of branches present at each internal node and at the root

node. Some efﬁcient and suboptimal algorithms are also available to build a decision

tree. These algorithms use some kind of greedy strategy that grows a decision tree.

The optimal partition of the attributes induces a partition of test attributes. Each test

attribute is put into the appropriate branch based on class impurity values. After

computation and summed up, the class impurity is assigned to a given partition. A

contingency matrix of order P×K(Pis the size of input vector and Kis the class

size) is computed at the start of the construction of the decision tree, which is used to

compute impurity measure at each partition. The number of such distinct partitions

with P elements is exponential function in P.ForP=2, there are 2P−1−1two-way

partitions possible.

664 R. Tiwari et al.

A few approaches are available to measure the impurity. Two popular such

approaches are:

•Gini index used in Classiﬁcation and Regression Tree (CART) [15]

Suppose n=(n1,…,nk) is a vector of non-negative real numbers which is same

for each class. Let N=iniis the size of an input vector. The Gini diversity

index is deﬁned by

g(n)=1−

N2.(5)

And the frequency-weighted Gini’s diversity index is given as

G(n)=Ng(n)=

ni=nj

ninj

N2.(6)

•Entropy used in C4.5 (developed by Ross Quinlan) [16]

The entropy used in C4.5 is deﬁned by

h(x)=−



Nlog ni

N.(7)

The weighted entropy is given by

H(n)=Nh(n)=Nlog N−

nilog ni.(8)

The entropy measure as mentioned in Gini index and its frequency-weighted

Gini’s diversity index are used in the experiments. No speciﬁc tree pruning algorithms

are used in experiments.

4.3 Artiﬁcial Neural Networks

Artiﬁcial neural networks (ANNs) [17,18] are designed to work based upon the

actual neural structure of the human brain. Human can easily recognize objects by

just seeing the object and simultaneous processing of the object properties by neural

networks present in the brain. A typical neuron in the human brain receives and

processes some signals from other neurons through dendrites. As an axon splits

into several branches, the neuron sends electrical signals to other neurons. At the

terminal point of branch, a synapse transforms from the axon to electrical signals.

The signal generated either forwarded to other neurons or stops. When the strength

of inhibitory input is large, it sends an electrical spike through the axon to its next

Sentiment Analysis Using Machine Learning of Unemployment Data … 665

Fig. 4 Example of an artiﬁcial neural network architecture

neuron. Learning occurs when the values of the synapses changed and the effect of

one neuron on another propagates.

ANNs are derived from the neural structure of human brain. The ANN process

records at individual level and is having the capacity to learn by comparing their

ability to classify the records based upon genuine classiﬁcation of records. The error

is feedbacked to the network from the initial classiﬁcation of the ﬁrst record and used

to alter the network’s algorithm through several iterations subsequently. The process

stops when the error is in an acceptable range.

An ANN as shown in Fig. 4consists of the following:

1. A set of input neurons with values, xi, and weights wi.

2. An activation function that aggregates the receiving weights and forwards the

values to an output.

The neurons are used at three different layers, namely, input, hidden and output.

The neurons in the input layer consist of values from the records which are fed to

the ANN and are provided signals to the next layer. Another layer, known as hidden

layer, is present between input and output layers. The number of hidden layers may

vary according to the applications. These hidden layers are used to carry signals

from input layer to output layer via connection weights. The output layer provides

the class label of input vector. The output layer is known as target based on input

values. In an ANN, the feature values extracted from an input object can be predicted

or recognized by the actual target value that is received from the neural network.

Mathematically, the activation (or target Oj) is calculated as

Oj=ϕ⎛

⎝



j=1

wjxj−θj⎞

⎠.(9)

666 R. Tiwari et al.

4.4 Training an ANN

The target classes are known prior to experiments in ANN. These set of inputs are

known as training data. For training, a set of input class and their target class are

provided to the neural network. The neural network then adjusted its parameters

of hidden neurons for accurate mapping of input class and corresponding target

class. For testing, a set of unknown inputs are given to the network, and the network

recognizes the input class and produces the actual target class of those inputs based on

training. These datasets are known as testing dataset. The better the training process

implies very accurate recognition results, and hence, fewer error rates are produced

by the network. The training phase in ANN is capable of handling noisy datasets

also.

4.5 The Neural Network as a Classiﬁer

Classiﬁcation problems can be solved by help of artiﬁcial neural networks with a

feedforward network and sigmoid output neurons. Depending on the size of input

feature vectors, a range of active neurons can be used in hidden layers. The ANN has

three output neurons as there are three target values (positive, negative and neutral)

associated with input dataset. Training is required in pattern recognition networks

for effective classiﬁcation of input vectors corresponding to target classes. The input

data is further dislocated into training, testing and validation. Training data is useful

for establishment of a network to ﬁx the values of connection weights and biases. To

overcome overﬁtting, termination of training process, the validation phase is used.

To measure the actual performance of trained network, testing data is provided to

the trained network. On should avoid using the testing data either in training phase

or in validation phase.

The following parameters are used in the neural network classiﬁer for experiments:

1. Standard network is two-layer feedforward network.

2. Sigmoid transfer function is used in hidden layers.

3. At output layers, Softmax transfer function is used.

4. The number of hidden neurons can be taken as an arbitrary value, but largely it

depends on input size.

5. Class confusion matrices for training, validation, testing and combined data are

used in result analysis.

In a multilayer feedforward network, data and computations ﬂow in forward direc-

tion only, starting from input units to output units. A basic neural network has one

input layer and one output layer. This is called a one-layer feedforward network.

The number of input neurons and output neurons fully depends on the application.

When one extra layer of neurons is inserted between input and output layers, the

corresponding neural network is known as two-layer feedforward neural network. In

this research, this kind of neural network classiﬁers is used.

Sentiment Analysis Using Machine Learning of Unemployment Data … 667

A sigmoid function is acting as the activation function in the hidden layer of the

neural network classiﬁer. The curve of sigmoid function looks like an ‘S’ shape. The

function has the following structure.

sig(x)=1

1+e−x.(10)

The activation function at output layer in neural network classiﬁer is a Softmax

function. The structure of the Softmax function is given as:

softmax(n)=en

n

i=1ei.(11)

The range of these functions is (0, 1). At each layer, the calculated value of either

function is compared with a threshold, and if the calculated value is more than the

threshold, it is said that the neuron will transmit an electrical spike to its next neuron

through its axon, otherwise it will not provide any information to the connected

neurons in the next layer.

4.6 Support Vector Machines

The SVM classiﬁes n-dimensional space into classes using a line or decision

boundary in order to make it easier for new data points to be placed in the correct

category. The most suitable decision boundary is termed as a hyperplane [19].

When extreme points/vectors are chosen, a hyperplane is created. Support Vectors,

which are extreme cases, are handled with an algorithm called a Support Vector

Machine.

c(x,y,f(x)) =0,if y∗f(x)≥1

1−y∗f(x), else .(12)

In sentiment analysis, however, Naïve Bayes rule and convolutional neural

networks are used as major components of Machine Learning Classiﬁers [20].

NLTK modules of Python include Natural Language Toolkit modules that are used

in tasks that speciﬁcally require Natural Language Processing. Natural Language

Toolkit provides ready-to-use computational linguistics courseware in the form of

tutorials, tutorials and problem sets. A Natural Language Processing library inter-

faces with annotated corpora and handles symbolic and statistical Natural Language

Processing [21]. These modules will be used to analyze text. NLTK modules

have been used to tokenize data. For making the text go through the process of

lemmatization, classiﬁcation and stemming have been used [22].

TextBlob is also a part of the NLTK module of Python. It provides us with the

access to lexicon-based approaches. Processing textual data with TextBlob is easy

668 R. Tiwari et al.

with Python. Natural Language Processing can be performed with several APIs,

including tagging of part of speech, extraction of noun phrase, sentiment analysis,

translation and classiﬁcation tasks [23]. The TextBlob for sorting data is used which

is based on subjectivity and polarization. Topics or domains inﬂuence sentiment

polarities. There may be variations in sentiment polarities between domains even

when the same word is used [24]. This is the reason why the polarity of the tweets

of the general kind is considered. The criteria of more than 0 for positive, 0 for

neutral and less than 0 for negative sentiments are used. The provisions of Emoticons,

exclamation marks are not used in experiments. Speciﬁcally designed to analyze

sentiments expressed on social media, it is based on lexicon and rules [25].

Firstly, the tweets in a csv ﬁle format using Twitter’s API and Tweepy are obtained.

The text extraction method utilized is the Bag-of-Words method (BOW) [26]. The

collection of these individual words is known as a ‘Collection of unigrams’. All the

unigrams are independent as well. This means that the presence of one unigram in

the text has no effect on the presence of any other unigram.

Then, re.ﬁndall command of Python to clean tweets is used. Re.ﬁndall returns a

list of strings or tuples with all non-overlapping matches of pattern in a string. To

remove patterns in the input text, it has been used. The name of users are removed

using numpy [27]. Arrays can be worked out with using Numpy, a Python library. The

cleaned tweets were stored in another column. In this process, about 70 uncleaned

tweets are lost. Then, NLTK modules are used to tokenize the cleaned tweets. Using

PorterStemmer, stemming on data has been performed. The VADER module is used

to analyze sentiments [28]. This process was based on analyzing most commonly

used keywords and hashtags. A wordcloud is also created. Then, TextBlob is used to

set the polarity of the tweets accordingly and perform the analysis on the remaining

tweets.

5 Data Extraction

Sentiment analysis is performed on people’s views regarding the growing unemploy-

ment in India. The data has been prepared in csv ﬁles using Tweepy and Python. A

total of 11,000 tweets have been extracted from 15 October 2021 and 15 September

2022 using Tweepy. Tweepy uses Twitter API to fetch tweets based on a speciﬁc topic

between a certain time interval. Third-party apps can be integrated with Twitter APIs.

Tweets may also be obtained from a certain user. The data extracted is generally in the

form of text, though images, audio and video ﬁles are obtained using the ‘extended_

entities’ object mode. Only text data is considered, because any other data type for

analysis is not considered. The consumer key, consumer secret, access key and access

secret keys after creating an account in the Twitter Developer platform are used. The

twitter developer account in Python was authorized. Also, a function was deﬁned to

create a ﬁlter to obtain tweets of the hashtags, ‘#Unemployment’ and ‘#Unemploy-

mentInIndia’. The full text of the tweet, time of tweet and the username who tweeted

Sentiment Analysis Using Machine Learning of Unemployment Data … 669

Fig. 5 Average

unemployment rate in India

(2017–2021)

are also saved. This is accomplished using the tweepy. Cursor functionality, which

returns a list of tweets that can be iterated on.

One can observe a huge spike (as shown in Fig. 5) in unemployment rates during

the lockdown, which is why no tweets during that time interval were obtained. As

COVID-19 cases were at the peak at that time, it is obvious that mostly negative

tweets are available at that time. This is the reason why data from the time periods

of 2020 has not been considered.

The process of sentiment analysis is based upon NLP. It is one of the major

components of Machine Learning which enables a computer system to understand

all forms of complex texts. NLP is generally used to process text data and perform

analysis on it. NLP is a branch of AI that uses texts to understand and derive meaning

from them. Complex tasks like text summarization, query solving and sentiment

analysis can be executed using NLP. The functioning of an NLP is based upon

the use of Recurrent Neural Networks (RNNs). The Machine Learning is used for

labelling it. This is done to obtain equitable and serviceable output. ML algorithms

can be used in sentiment analysis to check if a text is positive or negative based on

its polarity. Without human intervention, machines learn how to detect sentiment

automatically by training them with examples of emotions in text. A lot of complex

methodologies and algorithms like ‘The Naïve Bayes Rule’, ‘Linear Regression’ and

‘Support Vector Machines’ relate Machine Learning and opinion mining.

6 Data Cleaning and Preprocessing

From the 11,000 tweets that have obtained, the data was cleaned by removing unnec-

essary details like username, time of tweet, etc. Several sentiment-less words appear

in the data, including links, Twitter-speciﬁc words like hashtags and tags, single

letter words and numbers. Many tweets also got rejected and removed during this

670 R. Tiwari et al.

Fig. 6 Wordcloud of cleaned tweets (includes frequently used words)

process. After this, about 10,928 tweets are available for experiments. Abstraction

was used to obtain the most used hashtags along with the searched ones to prepare a

histogram. The re.ﬁndall function is used to return all matches of a pattern in a string

that are not overlapping. Impurities and punctuations from data were removed in the

ﬁrst phase. The data was vectorized using CountVectorizer. In the same ‘csv’ ﬁle

which has obtained from the tweets, the ﬁle is appended with a new column named

‘Clean_Tweet’ where tweets that have been obtained after cleaning was stored. After

that ‘Stemming’ is applied to the data by using nltk.stem.porter. After cleaning the

data, wordclouds (as shown in Fig. 6) from the cleaned tweets have been created.

7 Results

The most frequently used hashtags using re.ﬁndall are obtained as depicted in Table 2

and created a dictionary of the words by deﬁning a dictionary. The count of the tweets

is used to create a bar graph of the hashtags that were being repeatedly used along

with the target hashtags. The count of the most frequently used hashtags is given in

Table 2. The underscores from the hashtags for easy understanding are removed.

The bar graph is shown in Fig. 7for the data as extracted (Table 2) using re.ﬁndall.

Finally, a pie chart (Fig. 8, Table 3) for the positive, negative and neutral tweets

has been prepared. Also, the number of the respective tweets and the chose polarity

for the tweets were provided.

If one calculates the individual count of each sentiment from the 10,928 tweets,

one can ﬁnd that about 7803 contain negative content, 2983 tweets are positive,

whereas only about 1.3% tweets are written neutrally. The count of these tweets is

only about 142.

Sentiment Analysis Using Machine Learning of Unemployment Data … 671

Table 2 Hashtags and their

count Frequently_used_hashtags Count

Bharatjodoyatra 1639

Unemploy 2009

India 993

Bharatjodobegin 1309

Poverty 962

Agriculture 936

Industries 152

Fresher 803

Economy 528

Insurance 194

Covid 305

Fig. 7 Frequently used

hashtags and their count

500

1000

1500

2000

2500

Frequently Used

Hashtags and their

count

bharatjodoyatr

unemploy

India

Fig. 8 Distribution of

sentiments

SenƟment DistribuƟon of 10,928

tweets

NegaƟve

PosiƟve

Neutral

27.3%

1.3%

71.4%

Table 3 Polarity and

sentiment count Sentiment Number of tweets Polarity

Positive 2983 Greater than 0

Negative 7803 Less than 0

Neutral 142 0

672 R. Tiwari et al.

8 Discussion and Analysis

After obtaining all the data that required, it is observed that most Twitter-active social

media users are upset about the current unemployment situation in India. About 70%

of the time, it is being observed that people are tweeting negatively about it.

Prediction was made using 11,000 tweets using this combination, and it was found

that 27.3% of people are positive about the current unemployment situation in India,

1.3% are neutral, and 71.4% of the people are feeling negative due to various valid

or invalid reasons.

As expected, most people are criticizing unemployment, which is in line with the

expectations from the experiments. Due to uncertain reasons, some people are talking

positively about the current unemployment reasons. From the bar graph that has

prepared from the dataset, one can assume the reasons. Due to the recent launching

of the ‘Bharat Jodo Yatra’ campaign by the Indian National Congress (INC), the

leaders are claiming that ‘Without harmony, there is no progress, without progress,

there are no jobs and without jobs, there is no future’. This movement is aimed to

unite India through the political party’s long road march. This movement aims to

address a lot of social problems, including the ‘ticking time bomb of unemployment’.

This campaign might be the reason people’s hopes are high. They might be expecting

that unemployment rates might be affected by this movement as well. Areas where

unemployment is the most prevalent are also mentioned along with the tweets. The

use of poverty, economy and some traces of COVID-19 are present as well.

COVID-19 had shattered the world economy. Countries have seen an unprece-

dented downfall in their growth and development. The highest unemployment rate

that India has ever witnessed was during the lockdown. Many businesses were shut

down. There was no way a daily-wage labourer could work from home. Moreover,

the pandemic worsened the health situation for all the countries. Many people lost

their jobs. Each wave induced ﬁnancial losses for India. However, Indians bravely

coped up with the pandemic and bounced back. Employment rates went up high a

little and people began working. This explains the use of ‘Covid-19’ in this research.

Unemployment faces a negative response from citizens. This might be irrelevant in

today’s date as technology has also led to employment generation as well. With the

help of AI, it will be possible to predict the unemployment condition in a few years

[29].

INC Youth Wing has also launched ‘Rozgar Do’ (Provide Employment) campaign

to battle unemployment. This campaign might be the next most tweeted hashtag

after the ‘Bharat Jodo Yatra’ campaign loses temperature. In order to highlight the

country’s unemployment issue, the campaign’s theme is ‘Give a job or take back

your degree’. This campaign aims to address the Prime Minister and the government,

demanding action from the government on the current unemployment rates. Their

main demand from the government is to ensure provisions for degree-holders to have

ajob.

A military recruitment plan, called ‘Agnipath’, was launched in India a few months

ago as well. Last year, it sparked violent protests, bringing to light the unemployment

Sentiment Analysis Using Machine Learning of Unemployment Data … 673

crisis plaguing India’s $3.2 trillion economy, as well as Prime Minister Narendra

Modi’s campaign promises.

People in Goa have been asked not to vote for parties that have not provided

jobs. This demand is a part of a campaign launched by a political party (Aam Admi

Party). It is possible that all these factors combined are being talked about, triggering

a mixed reaction between people. People support the actions of only those parties,

to whom they voted in the most recent elections.

Only a small proportion of people have neutral opinions about unemployment.

They might be tweeting general trends and topics associated with unemployment.

9 Future Suggestions

One cannot overlook the importance of Natural Language Processing when analyzing

sentiments from written text. It is directly proportional to the granularity of the

dataset that the accuracy and performance of sentiment analysis are determined.

While dealing with natural language, one must deal with many irregularities, subjec-

tivity and diversity. Emotions like sarcasm are not easy to detect. It is difﬁcult to

know in which category they ﬁt in, positive, negative or neutral. This study has the

following major limitations. Tweets that were trending during a particular period

are under experimentation. Phase changes can result in changes in the surrounding

environment, and one might experience a change in the distribution of the sentiments

of tweets. As conditions improve, people might not even post about unemployment

anymore. Also, the use of emoticons and tweet hashtags is not considered that could

reveal a lot about the sentiments of the tweets. If the emoticon dataset was considered

as a part of the tweet data, the classiﬁers’ efﬁciency would have been hampered. If

one combines both of these factors, namely sarcasm and emoticons, people could

have possibly posted wrong emoticons deliberately to indicate ‘sarcasm’ in the ideas

they want to the audience.

An accurate, timely and comprehensive overview of the needs, attitudes and moti-

vations of the unemployed is provided by this research. This study can help research

organizations and institutions to study what were people’s general opinions on unem-

ployment. They could learn how political parties affect the ways people think about

a particular topic. They can notice the general trend of the word ‘unemployment’.

Factors affecting unemployment should be studied, before prior research on this

topic. An important role of studying fake news’ impact on the public is to assist

administrations and policymakers in managing it. People’s general mental condi-

tions should be considered as well. For example, at the time of lockdown in India,

people were stressed and gloomy. They took to rigorous aggression on trivial matters.

Conditions of a nation during the time when the research is being conducted should

also be considered. A large corpus can be used to improve the accuracy of the model

in future studies.

674 R. Tiwari et al.

10 Conclusion

Thousands of people use social media every day, and the number is growing every

day. In place of speaking with someone in person, people prefer to write about their

honest opinions on social media. The analysis of the common public’s reaction to

unemployment in India was based on the posts from Twitter. The collected data

after annotation and preprocessing have been applied to several Machine Learning

techniques. Almost 70% of the population feels negatively about the unemployment

rates in India, about 27% has been talking positively about unemployment in India,

and only a handful of people (1%) feel neutrality in unemployment’s nature. Much

of these tweets are subjective to the various changes in policies and execution of

campaigns and movements by political parties.

References

1. https://www.cmie.com/kommon/bin/sr.php?kall=warticle&dt=20220829141802&msec=860.

Accessed on 28 Dec 2022

2. Pratibha GK, Kaur A, Khurana M (2022) A stem to stern sentiment analysis emotion detection.

In: 2022 10th international conference on reliability, infocom technologies and optimization

(trends and future directions) (ICRITO), Noida, India, pp 1–5. https://doi.org/10.1109/ICRITO

56286.2022.9964967

3. Tiwari RG, Misra A, Ujjwal N (2022) Comparative classiﬁcation performance evaluation of

various deep learning techniques for sentiment analysis. In: 2022 8th international conference

on signal processing and communication (ICSC), Noida, India, pp 304–309. https://doi.org/

10.1109/ICSC56524.2022.10009471

4. Kaushik P (2020) Research report on Indian Unemployment scenario and its analysis of causes,

trends and solutions. A project study submitted in partial fulﬁlment for the requirement of the

two year (full-time) post-graduate diploma in management (2018–20)

5. Sinha P (2022) Combating youth unemployment in India, Academia. https://www.academia.

edu/26001773/Combating_Youth_Unemployment_in_india. Accessed on 22 July 2022

6. Naraparaju K (2017) Unemployment spells in India: patterns, trends, and covariates. Indian J

Labour Econ 60(4):625–646

7. Dev M, Motkuri V (2011) Youth employment and unemployment in India

8. Gupta P, KumarS, Suman RR, Kumar V (2020) Sentiment analysis of lockdown in India during

COVID-19: a case study on Twitter. IEEE Trans Comput Soc Syst 8(4):992–1002

9. Gautam G, Yadav D (2014) Sentiment analysis of twitter data using machine learning

approaches and semantic analysis. In: 2014 seventh international conference on contemporary

computing (IC3), pp 437–442

10. Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau RJ (2011) Sentiment analysis of Twitter

data. In: Proceedings of the workshop on language in social media (LSM 2011), pp 30–38

11. Desai M, Mehta MA (2016) Techniques for sentiment analysis of Twitter data: a comprehen-

sive survey. In: 2016 international conference on computing, communication and automation

(ICCCA), pp 149–154

12. Balaji VR, Suganthi ST, Rajadevi R, Kumar VK, Balaji BS, Pandiyan S (2020) Skin disease

detection and segmentation using dynamic graph cut algorithm and classiﬁcation through Naive

Bayes classiﬁer. Measurement 163:107922

13. Calders T, Verwer S (2010) Three Naive Bayes approaches for discrimination-free classiﬁca-

tion. J Data Mining Knowl Discov 21(2):277–292

Sentiment Analysis Using Machine Learning of Unemployment Data … 675

14. Das A, Das P,Panda SS, Sabut S (2019) Detection of liver cancer using modiﬁed fuzzy clustering

and decision tree classiﬁer in CT images. Pattern Recogn Image Anal 29:201–211

15. Liu Q, Wang X, Huang X, Yin X (2020) Prediction model of rock mass class using classiﬁcation

and regression tree integrated AdaBoost algorithm based on TBM driving data. Tunn Undergr

Space Technol 106:103595

16. Muthamil Sudar K, Deepalakshmi P (2020) A two level security mechanism to detect a DDoS

ﬂooding attack in software-deﬁned networks using entropy-based and C4.5 technique. J High

Speed Netw 26(1):55–76

17. Asteris PG, Mokos VG (2020) Concrete compressive strength using artiﬁcial neural networks.

Neural Comput Appl 32(15):11807–11826

18. Hasson U, Nastase SA, Goldstein A (2020) Direct ﬁt to nature: an evolutionary perspective on

biological and artiﬁcial neural networks. Neuron 105(3):416–434

19. Okwuashi O, Ndehedehe CE (2020) Deep support vector machine for hyperspectral image

classiﬁcation. Pattern Recogn 103:107298

20. Singh J, Singh G, Singh R (2017) Optimization of sentiment analysis using machine learning

classiﬁers. HCIS 7(1):1–12

21. Loper E, Bird S (2002) NLTK: the natural language toolkit. CoRR.cs.CL/0205028. https://doi.

org/10.3115/1118108.1118117

22. Yao J (2019) Automated sentiment analysis of text data with NLTK. J Phys Conf Ser 1187(5)

23. Loria S (2018) Textblob documentation. Release 0.15 2.8

24. Li F, Huang M, Zhu X (2010) Sentiment analysis with global topics and local dependency. In:

Twenty-fourth AAAI conference on artiﬁcial intelligence

25. Gupta P, KumarS, Suman RR, Kumar V (2021) Sentiment analysis of lockdown in India during

COVID-19: a case study on Twitter. IEEE Trans Comput Soc Syst 8(4):992–1002

26. Kolchyna O, Souza TTP, Treleaven PC, Aste T (2015) Twitter sentiment analysis: Lexicon

method, machine learning method and their combination. arXiv preprint arXiv:1507.00955

27. Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau RJ (2011) Sentiment analysis of Twitter

data. In: Proceedings of the workshop on language in social media (LSM 2011)

28. Hutto C, Gilbert E (2014) Vader: a parsimonious rule-based model for sentiment analysis of

social media text. In: Proceedings of the international AAAI conference on web and social

media, vol 8, no 1

29. Barkur G, Vibha, Kamath GB (2020) Sentiment analysis of nationwide lockdown due to COVID

19 outbreak: evidence from India. Asian J Psychiatry 51

Customer Churn in Telecom Sector:

Analyzing the Effectiveness of Machine

Learning Techniques

Vaibhav Sharma, Lekha Rani, Ashok Kumar Sahoo,

and Pradeepta Kumar Sarangi

Abstract The number of customers that stopped using a company’s product or

service during a particular time is known as customer churn. Businesses may prevent

churn by taking preventative measures when they can anticipate it before it occurs.

Speciﬁcally in the telecommunication sector due to various providers, there is great

competition. To compete in the market, telecom ﬁrms provide all basic services,

easy access to the Internet, quality phone service, etc., to all mobile users and still

it is a challenge to hold the clients. Therefore, it is an important task to understand

the customer needs of all age groups. So by a proper prediction of customer churn,

companies can reduce the rate of churn by immediately taking action regarding it.

In this study, the authors show different exploratory data analysis (EDA) between

different parameters which could affect the churn. Going further, the data has been

divided for training which is 80% of the whole and the remaining 20% is kept as

the test data. By comparing various machine learning (ML) models such as SVM,

KNN, XGBoost, decision tree, and random forest, the best-performing model for the

dataset is identiﬁed as the RF model. The best precision obtained is with RF, with

an accuracy of 82%, and the least precise was from KNN with an accuracy of 76%.

The study will provide a detailed view of the problem of customer churn and how it

can be controlled.

Keywords Customer ·Churn ·EDA ·SVM ·KNN ·XGBoost ·Decision tree ·

Random forest

V. S h a r m a ·L. Rani ·P. K. Sarangi (B)

Institute of Engineering and Technology, Chitkara University, Punjab, India

e-mail: Pradeepta.sarangi@chitkara.edu.in

V. S h a r m a

e-mail: Vaibhav1467.cse19@chitkara.edu.in

L. Rani

e-mail: lekha@chitkara.edu.in

A. K. Sahoo

Graphic Era Hill University, Dehradun, India

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1_50

677

678 V. Sharma et al.

1 Introduction

Client turnover, sometimes referred to as customer attrition, is the term used to

describe a company’s loss of clients over an extended time. It is a typical issue

that companies of all sizes and sectors deal with, and it may have a big effect on

their bottom line. Churn can happen for some reasons including dissatisfaction with

a product or service, a better offer from a competitor, or simply a change in the

customer’s circumstances [1].

For telecom companies, which must actively compete for consumers in a market

that is becoming more crowded and competitive, customer turnover is a huge concern.

The telecom industry must be careful in recognizing and resolving the variables that

lead to customer churn. Lack of new services or features, exorbitant prices, and

poor service quality are just a few of the causes of this. Sometimes a more alluring

promotional offer or a superior pricing strategy will cause clients to migrate to a

rival.

Therefore, customer churn prediction is crucial as organizations identify

customers who are at risk and take proactive measures to retain them. It is also

essential for reducing customer acquisition costs, improving customer satisfaction,

and maintaining or increasing market share [2]. In general, organizations may use

data-driven choices that can signiﬁcantly affect their bottom line by predicting client

attrition.

ML is an important tool for the prediction of customer churn as it allows ﬁrms

to analyze huge amounts of data and identify patterns and trends [3]. ML can iden-

tify factors that are strongly linked to customer churn using various algorithms and

statistical techniques. ML models can be trained on customer data to precisely predict

users at risk of leaving in the future. With these predictions, speciﬁc retention tactics

may be created such as providing tailored incentives or enhancing customer service

for vulnerable clients. Moreover, by evaluating the success of retention measures

and modifying the models accordingly, ML may help ﬁrms constantly improve their

churn prediction models.

To lower the churn rate in any ﬁrm, this research study offers analytical approaches

to forecast customer turnover rates, identifying factors that lead to churn. The paper

will contribute a solution to the existing problem of customer churn. Based on the

dataset, EDA is done to determine what all factors affect the churn rate. Then the

data is split into test and split data to apply different ML algorithms, namely SVM,

KNN, XGBoost, decision tree, and RF. The best accuracy was seen in random forest,

while the lowest accuracy was in XGBoost and KNN. The best precision obtained

is with RF, with an accuracy of 82%, and the least precise was from KNN with an

accuracy of 76%.

Customer Churn in Telecom Sector: Analyzing the Effectiveness … 679

1.1 Novelty

By implementing various methods and techniques on the dataset and by performing

different analysis, the paper tries to provide an overview of the problem. Different

ML algorithms have been used to predict the outcome. The accuracies of the models

have also been compared.

2 Background Study

The study showed that certain demographic and usage characteristics, such as being

elderly, unmarried, and lacking relatives, are associated with higher customer churn

rates in the telecom industry. On the other hand, users with speciﬁc service combina-

tions, such as phone and ﬁber services with additional streaming TV and ﬁlm services,

are also at risk of churning. To improve customer retention, tailored services, promo-

tional discounts, service upgrades, and contract payment discounts can be effec-

tive strategies. The proposed user churn prediction model, which uses the gradient

boosting tree algorithm, demonstrated satisfactory accuracy and can be a useful tool

for decision-makers in predicting and retaining potential customers [4]. The authors

conclude that the telecommunications industry requires a dependable approach to

analyze and predict customer churn. The research highlights that artiﬁcial neural

networks (ANN) and Gaussian Naïve Bayes are effective methods for predicting

churn. Nonetheless, the study suggests further investigation to determine the efﬁcacy

of these methods on diverse datasets [5]. A survey was conducted on customer churn

attrition using several ML and deep learning (DL) techniques. The study concludes

that DL techniques, particularly convolutional neural networks and stacked auto-

encoders, outperform other methods in terms of both speed and accuracy. These

ﬁndings highlight the potential of DL.

In another work, the authors have addressed the challenges of customer churn

prediction in various industries [6]. To effectively manage customer outﬂow in the

telecom industry, the use of data mining and data science models is essential. The

study demonstrated that models with an accuracy of over 95% in customer loyalty

classiﬁcation can be constructed using these techniques. Furthermore, the study

provided valuable insights into the factors that inﬂuence customer churn behavior.

Regular monitoring and the inclusion of additional variables are necessary for the

continuous improvement of these models and the prevention of obsolescence. These

ﬁndings can be utilized to enhance marketing activities and improve the mathe-

matical methodology for consumer churn prediction in the industry [7]. The author

concludes that gradient boosting with feature selection is the most effective model

for predicting the problem of customer churn. The study shows the importance of

feature engineering and selection in improving model performance. The ﬁndings

suggest the need for alternative strategies to address churned customers and propose

using DL models for future research [8]. The utilization of ML techniques in the

680 V. Sharma et al.

churn model can aid telecommunications companies in providing attractive offers

to customers to retain them. Additionally, further improvement in accuracy can be

achieved by reducing features and implementing additional ML models [9]. The

research presented in this paper introduces a framework for churn prediction and

customer segmentation in the telecommunications industry, using a range of ML

models to achieve high accuracy. This study contributes to the existing literature

and provides valuable insights for telco operators to understand the churn behavior

of different customer clusters. Further research can explore alternative methods to

enhance the accuracy of churn prediction models [10]. Telecommunications organi-

zations face the challenge of customer churn, which can be tackled through predictive

analytics and customer retention measures. This study highlights the effectiveness

of ML models such as ANN and XGBoost in addressing this issue. Future research

can focus on further improving these methods and incorporating big data analytics

for better results [11]. In today’s world where technology drives businesses, compa-

nies face the challenge of retaining customers and predicting customer churn, and the

telecommunications industry is no exception. Churn prediction has become a subject

of interest for many researchers, and this research paper presents a comparative study

of various ML models for churn prediction in the industry. The results suggest that

ensemble learning techniques such as XGBoost and AdaBoost classiﬁers perform

better than other algorithms in terms of precision, accuracy, F-measure, recall, and

AUC score. However, predicting genuine customers remains a daunting task, and

companies must provide valuable services to retain customers in today’s competitive

market. Further research can be done to improve churn prediction and help companies

in the telecommunications industry retain customers [12]. To conclude, predicting

customer churn is important for reducing operational costs in the telecommunica-

tions industry. The performance of prediction models can be improved by utilizing

feature selection techniques. The proposed model can identify potential churners

and enable companies to take necessary measures to retain customers [13]. Table 1

shows a summarized representation of the background study done in this context.

3 Objective

The main objective of this study is to contrast various methodologies and approaches

for predicting client attrition. In this work, an analysis of customer churn is done

using ML to investigate and measure the efﬁciency of ML methods for predicting

customers who are at risk of attrition in the telecom industry. EDA has been done for

hypothesis generation, enhancing data accuracy through data scaling, and splitting

the dataset for training and test data and training models. The accuracy and precision

of these models’ performance would be assessed in the study.

Customer Churn in Telecom Sector: Analyzing the Effectiveness … 681

Table 1 Concise representation of the literature studied

References Techniques Dataset used Remarks

[4] Spearman single factor

analysis; random forest

Customer churn

dataset

The satisﬁed result achieved

by the gradient boosting tree

algorithm

[5]ANN, SVM, KNN, decision

tree, Gaussian Naïve Bayes

Telecommunication

customer churn

dataset

ANN and Gaussian Naïve

Bayes are effective methods

for predicting churn

[6]XGB, ANN, random forest,

gradient boosting, AdaBoost,

CNN, stacked auto-encoders

IBM telecom’s

Kaggle dataset

DL works with speed and

accuracy to give better

results

[7] Random tree, neural net,

ensemble, C5.0, KNN, etc.

Telecommunication

company dataset

Models were implemented

with the help of IBMIPSS

modeler, accuracy reached

over 95%

[8]Gradient boosting, logistic

regression, decision tree, and

random forest

American telecom

company dataset

Gradient boosting with

feature selection is the most

effective model

[9]KNN, logistic regression,

and random forest

Telecom company

dataset

The churn model gave better

results using ML

[10]Multiple layer perceptron,

logistic regression, decision

tree, random forest,

AdaBoost, Naïve Bayes

IBM, Kaggle telco

customer churn,

Cell2Cell provided

by Teradata center

An integrated customer

analytics framework is

proposed to seamlessly

connect two components

[11]Random forest, XGBoost,

ANN, gradient boost, logistic

regression, and AdaBoost

IBM Telecom’s

Kaggle dataset

ANN and XGBoost

outperformed other models

in terms of accuracy,

F1-score, recall, and

precision

[12]XGBoost, CatBoost, logistic

regression, random forest,

SVM, decision tree, Naïve

Bayes, AdaBoost

Customer churn

dataset

Ensemble learning

techniques like XGBoost and

AdaBoost perform better

[13] SBFS, SFS, Naïve Bayes,

SBS

Telco customer

churn dataset

The study proposes the

application of feature

selection to select features

that have a positive effect on

models

4 Dataset

The dataset used for the research is IBM Telecom’s Kaggle dataset. The data is

extremely large and consists of several important parameters for predictive anal-

ysis. The telecommunication dataset consists of 7043 instances of 21 attributes.

For training the models, only the ﬁrst 1000 rows of the dataset have been used.

Demographic information including gender, age, and dependents is included in the

682 V. Sharma et al.

attributes. The dataset includes details about information like gender, age, depen-

dents, different types of services for which the customer has signed up, contact

information, payment methods, monthly charges, paperless billing, total charges,

and the churn attribute which tells us about the customers who have discontinued.

5 Methodology

To begin with, there are many steps before building ML models. The different steps

are data preparation, EDA, data cleaning, feature selection, and ﬁnally building the

desired model. The data preparation step includes checking for the completeness of

the dataset, looking at the dimensions, reviewing the structure of the data, and exam-

ining any missing client data. Python is used as it provides many robust libraries which

have been used such as Pandas—which has tools for data analysis, cleansing, explo-

ration, and manipulation; Matplotlib—which makes easy things easy and hard things

possible by creating static, animated, and interactive visualizations; and Seaborn—

which helps in making statistical graphs. Figure 1is a pictorial representation of the

workﬂow of this work.

After analysis, it is found that 20 attributes affect churn. Out of the total, factors

like the customer, gender, phone service, and multiple lines, which could least affect

the churn, have been removed. Tenure has been removed, and rather the customers

have been divided into bins based on tenure, e.g., assigning a tenure group of 1–

12 for tenure below 12 months. After having a clear picture of the dataset, EDA

is implemented. It provides a clear and better picture of data patterns and poten-

tial hypotheses. Different bar graphs are built to see how the churn is affected by

different attributes present in the dataset. Figure 2is a graphical representation of

the afﬁrmative and negative churn counts.

6 Implementation, Results, and Discussions

It is visible that customers discontinuing less than 300 but those who are continuing

have a count of more than 700. It can be inferred that the customers discontinuing

are less than half of the other customers. Figure 3is a plot of monthly charges versus

the total charges in the telecom sector.

Charges play an important role in the telecom sector. When choosing a good

product, the customer’s main need is to get maximum beneﬁts at the best prices and

the best prices mean lower prices. So, charges play an important role while predicting

whether the customer will retain or will leave. The above graph shows a comparison

between the monthly charges and the total charges, and the outcome shows that as

the monthly charges increase the total charges also increase which means that the

total charge is directly proportional and the churn rate is inversely proportional to

Customer Churn in Telecom Sector: Analyzing the Effectiveness … 683

DATA COLLECTION

PRE -PROCESSING OF DATA

EXPLORATORY DATA

ANALYS IS(EDA)

TESTING

DATASET

DATA EVALUATION AND

VISUALIZATION

TRAINING

DATASET MACHINE LEARNING MODELS

RESULT ANALYSIS

Fig. 1 Work ﬂ o w

Fig. 2 Churn count in terms of ‘Yes’ and ‘No’

684 V. Sharma et al.

Fig. 3 Monthly charge

versus total charge

total charges. Figure 4a and b shows graphs that have been plotted to check churn

both by monthly charges and by total charges.

From Fig. 4a, it can be inferred that churn is high when the monthly charge is high,

whereas from Fig. 4b it can be seen that the lower the total charges, the higher the

churn. But the picture becomes clearer if the insights of the three characteristics are

analyzed, i.e., tenure, monthly charges, and total charges. Lower total charge is due to

higher monthly charge in a shorter tenure. A correlation of ‘churn’ is constructed with

all the other attributes in Fig. 5, to get a clearer picture of how all other parameters

affect the churn.

The above insight shows:

(1) A higher churn in the case of month-to-month contracts, no tech support, no

online security, ﬁber optics Internet, and the ﬁrst year of subscription.

(2) Long-term agreements, Internet-only subscriptions, and clients with a 5-year

minimum retention rate all exhibit low churn.

Data processing is another important step in research. As many columns have yes

and no categorical values, data transformation and normalization must be conducted

to transform them into 0 and 1. Several columns contain more than two categories

that need to be converted from categorical data to numerical data.

ML is a part of computer science and artiﬁcial intelligence (AI) that focuses

on collecting data and using algorithms to replicate human learning processes and

gradually boost performance. Various ML models used are:

(1) SVM—This DL model uses supervised learning to categorize or forecast the

behavior of datasets. ML-supervised learning systems provide input and antic-

ipated output data that have been labeled for classiﬁcation. Figure 6is the

confusion matrix created through the SVM model.

(2) KNN—It is a nonparametric classiﬁer that employs proximity to classify or

anticipate grouping a single data point. KNN calculates the separations between

Customer Churn in Telecom Sector: Analyzing the Effectiveness … 685

Fig. 4 a Monthly charges by churn; bTotal charges by churn

a query and each data example, selects the speciﬁed number of examples (K)

closest to the query, and votes for the most frequent label (for classiﬁcation)

or averages the labels (for regression). Figure 7is the confusion matrix created

through the KNN model.

(3) XGBoost—It is a supervised learning method used for classiﬁcation and regres-

sion. It uses shallow decision trees which are sequentially built to get trustworthy

answers, and a highly efﬁcient training method eliminates overﬁtting. Figure 8

is the confusion matrix created through the XGBoost model.

686 V. Sharma et al.

Fig. 5 Parameters affecting churn

Fig. 6 SVM confusion

matrix

(4) Decision tree—It is an ML system that uses learning that may be used to classify

or forecast data based on the responses to earlier responses. A set of data with

the desired classiﬁcation is used to train and test the model. Figure 9is the

confusion matrix created through the decision tree model.

(5) Random forest—This ML algorithm is applied for supervised learning. It inte-

grates the results of several decision trees to get a single result. Given that it can

address classiﬁcation and regression problems, it is widely used. Figure 10 is

the confusion matrix created through the random forest model.

Customer Churn in Telecom Sector: Analyzing the Effectiveness … 687

Fig. 7 KNN confusion

matrix

Fig. 8 XGBoost confusion

matrix

The confusion matrix, as depicted in Fig. 11, is used to provide a visual comparison

and evaluate the performance of all models. The confusion matrix pictures the number

of true positives (TP), true negatives (TN), false positives (FP), and false negatives

(FN) present in a prediction.

The ratio of accurately anticipated readings to all observed is what determines

accuracy. Therefore, it is concluded that the best model is the one with the highest

accuracy. The formula for calculating the accuracy of any model is depicted in Eq. (1).

Accuracy =True Positives +True Negatives

True Positives +False Positive +Ture Negatives +False Negatives

(1)

688 V. Sharma et al.

Fig. 9 Decision tree

confusion matrix

Fig. 10 Random forest

confusion matrix

Fig. 11 Confusion matrix

Customer Churn in Telecom Sector: Analyzing the Effectiveness … 689

Fig. 12 Comparison of accuracies

The proportion of accurately anticipated observations to all determined positive

observations is known as precision. Low false positive rates are related to precision.

Precision can be calculated by using Eq. (2).

Precision =True Positives

True Positives +False Positives

.(2)

The percentage accuracies of each of the models used in the study as calculated

by Eq. (1) mentioned above can be plotted and compared as shown in Fig. 12.

From Fig. 12 above, it can be inferred that out of the ﬁve models used, KNN

has the least accuracy at 76%, whereas RF leads in accuracy from the rest of the

models with an accuracy of 82%. The other models, namely SVM, decision tree, and

XGBoost models showed an accuracy of 81.5%, 81.5%, and 79.5%, respectively.

Comparison with existing work

Existing methods Proposed methods

Methods Accuracy (%) Method Accuracy (%)

ANN 79 SVM 81.5

Decision tree 80 Decision tree 81.5

Naïve Bayes 75 XGB 79.5

KNN 75 KNN 76

Random forest 81 Random forest 82

690 V. Sharma et al.

7 Limitations, Conclusion, and Future Scope

The dataset taken here could be more vast as larger the data more the accuracy and

more the prediction. Other machine learning models could also be used to get better

accuracy.

As the telecommunication industry grows, with it grows the problem of customer

churn. It can be controlled by proper analysis. The paper outlines the issue of churn

and the importance of preventing it. Various tasks on the dataset have been performed

such as data cleaning, data processing, and data explanation by performing EDA. The

paper proposes various analyses with the help of certain ML models. After applying

various ML models to the dataset, it can be concluded that the best accuracy was

seen in random forest, while the lowest accuracy was in XGBoost and KNN. The

best precision obtained is with RF, with an accuracy of 82%, and the least precise

was from KNN with an accuracy of 76%.

Going further, the performance can be enhanced by using different ML models

which will help in increasing the accuracy and therefore providing a better solution

for the problem of customer churn. It can also be extended by collecting data from

different sectors and implementing techniques to get a better result.

References

1. Chugh S, Baweja VR (2020) Data mining application in segmenting customers with clustering.

International conference on emerging trends in information technology and engineering (ic-

ETITE) VIT Vellore

2. Mukherjee N, Sandhir S (2017) Working paper on mobile data service & customer satisfac-

tion, loyalty & retention—a literature review. In: International conference on management and

information systems China

3. Tiwari RG, Misra A, Ujjwal N (2022) Comparative classiﬁcation performance evaluation of

various deep learning techniques for sentiment analysis. 2022 8th international conference on

signal processing and communication (ICSC). Noida, India, pp 304–309. https://doi.org/10.

1109/ICSC56524.2022.10009471

4. Ye X, Bohan C, Yunhuai D (2022) Analysis and prediction of telecom customer churn based

on machine learning. Highlights in science, engineering and technology

5. Arif B, Hasan J, Alyamani RA (2021) Classiﬁcation methods comparison for customer churn

prediction in the telecommunication industry. Int J Adv Appl Sci 8(12):1–8

6. Mahalekshmi A, Chellam GH (2022) Analysis of customer churn prediction using machine

learning and deep learning algorithms. Int J Health Sci 6(S1):11684–11693

7. Yana F, Tetiana Z, Oleksandr D, Oksana K. Customer churn prediction model: a case of the

telecommunication market. Economics

8. Alanoud MA, Abdulaziz A (2023) Customer churn prediction using four machine learning

algorithms integrating feature selection and normalization in the telecom sector. Open Sci

Index, Electron Commun Eng 17(3)

9. Senthilnayaki B, Swetha M, Nivedha D (2021) Customer churn prediction. Int Adv Res J Sci,

Eng Technol

10. Shuli W, Wei-Chuen Y, Thian-Song O, Siew-Chin C (2021) Integrated churn prediction and

customer segmentation framework for telco business. IEEE Access

Customer Churn in Telecom Sector: Analyzing the Effectiveness … 691

11. Lopamudra H, Prasant KD (2021) Prediction of customer churn in telecom industry: a machine

learning perspective. Comput Intell Mach Learn

12. Praveen L, Manas KM, Jasroop SC, Pratyush S (2021) Customer churn prediction system: a

machine learning approach. Computing

13. Yulianti Y, Saifudin A (2020) Sequential feature selection in customer churn prediction based

on Naive Bayes. IOP conference series: materials science and engineering

Author Index

Abhinandan Singla, 461

Abir El Akhdar, 329

Abu Bakar bin Abdul Hamid, 217

Abuzar Sayeed, 559

Aditi Sharma, 449

Aditya Bhardwaj, 551

Ahmed Sajjad Khan, 489

Ajay Dureja, 515

Akshi Kumar, 449

Alaa Alakailah, 531

Alexander Gelbukh, 411

Alexandros Chrysikos, 57

Ali Kartit, 329

Alyaa A. Abbas, 263

Aman Dureja, 515

Amer Hamzah bin Jantan, 217

Amit Doegar, 303

Amit Pratap Singh, 551

Amol Potgantwar, 189

Anand Singh Rajawat, 189,203,233

Anil Kumar, 345

Anjana Gosain, 1

Ankita Sharma, 23

Anumolu Bindu Sai, 163

Anurag Tuteja, 515

Arun Kumar Yadav, 585

Asaad N. Hashim, 273

Ashish Khanna, 287

Ashok Kumar Munnangi, 345

Ashok Kumar Sahoo, 655,677

Asya Katanani, 361

Avantika Goyal, 11

Avula Srinivasa Ajay Babu, 151

Badisa Bhavana, 137

Bal Virdee, 361,421

Bhagyashree, S. R., 245

Chaﬁk Baidada, 329

Charu Saxena, 473

Chiranjath Sshakthi, M. A., 377

Dessislava Petrova-Antonova, 71

Devineni Vijaya Sri, 163

Dhachina Moorthy, T. S., 105

Dharini, A., 95

Dinesh Singh, 575

Divakar Yadav, 585

Fahimeh Jafari, 81

Fatima M. Khudair, 273

Gagandeep Kaur, 617

Gerald Manju, J., 95

Goyal, S. B., 189,203,217,233

Gudimetla Abhishek, 137

Gurpreet Singh, 643

Harishankar Kumar, 461

to Springer Nature Singapore Pte Ltd. 2024

A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management,

Lecture Notes in Networks and Systems 785,

https://doi.org/10.1007/978-981-99-6544-1

693

694 Author Index

Harkiran Kaur, 461

Hrushikesh, S., 377

Humera Ghani, 421

Itu Snigdh, 603

Ivaylo Spasov, 71

Jagendra Singh, 627

Jameel Ahamed, 287

Jaspreeti Singh, 1

Jatin Sachdeva, 655

Jitender Kumar, 11

Jitendra Kumar Baroliya, 303

Karanam Manjusha, 163

Kartik, N., 389

Kiruthika, B., 95

Konika Abid, 551

Lekha Rani, 677

Liangxiu Han, 449

Mahalakshmi, R., 389

Malini, A., 95,377

Manikandan Parasuraman, 345

Manikandan Ramachandran, 345

Manorama, 603

Masri bin Abdul Lasi, 203,217,233

Maya Rathore, 399

Md Mahtab Alam, 39

Mitra Saeedi, 81

Mohammad Al-Fawa’reh, 531

Mohammad Hossein Amirhosseini, 81

Mohammad Nasiruddin, 489

Mohammed Jameel Alsalhy, 273

Mohit Rohilla, 11

Mumtaz Ahmed, 39

Mustafa Al-Fayoumi, 531

Nadiya Zafar, 287

Narindi Sai Priya, 151

Neal Bamford, 57

Neetu Mittal, 411

Neha Gaud, 399

Neha Saini, 575

Nevetha, B., 105

Nimalan, N., 105

Nisha, 439

Nitigya Vasudev, 643

Nurun Najah binti Tarmidzi, 217

Pallav Jain, 515

Paluck Arora, 317

Peddiboyina Hema Harini, 137

Piyush Pant, 203

Pooja, 575

Prachi Chaudhary, 439

Pradeepta Kumar Sarangi, 655,677

Pragati Choudhari, 189

Prashanth Sontakke, 81

Prateek Saini, 11,643

Pravin Gundalwar, 233

Priyam Srivastava, 559

Priya Sharma, 551

Qasem Abu Al-Haija, 531

Rajendra Sinha, 203

Rajesh Mehta, 317

Rajesh Shrivastava, 617

Ramesh Sekaran, 345

Ram Kumar Solanki, 233

Ravi Ranjan, 449

Ritika Kumari, 1

Rohan Sahai Mathur, 121

Rohit Ahuja, 317

Roop Singh Meena, 175

Ruchi Sharma, 11

Rudra Tiwari, 655

Saida, S. K., 151

Sai Swetha, P., 377

Sakshi Gupta, 603

Sandra Fernando, 361

Sanjay Kumar Dubey, 121

Shachi Mall, 627

Shahram Salekzamankhani, 421

Shaik Nazeer, 501

Author Index 695

Shaily Jain, 287

Shalu, 575

Shambhavi Mishra, 559

Shano Solanki, 175

Sheetal Garg, 245

Siddharth Arora, 11

Sivaram Rajeyyagari, 345

Snehlata Sheoran, 411

Sonam Gupta, 585

Sophia Lazarova, 71

Sridevi, S., 105

Srinivasa Rao, B., 137,501

Sumit Bathla, 515

Syed Irfan Ali, 489

Syed Mohammad Ali, 489

Tamanna Kewal, 473

Tanveer Ahmed, 559

Tejasvi Singhal, 643

Tushar Bansal, 121

Udayan Ghose, 23

Ugrasen Suman, 399

Umesh Gupta, 551,559,617

Utkarsh Dixit, 585

Utkarsh Garg, 515

Vaibhav Sharma, 677

Valluri Anand, 163

Vankalapati Nanda Gopal, 501

Varun Gupta, 121

Vatala Akash, 501

Venkatesh,K.A.,389

Victor Sowinski Mydlarz, 361

Vipul Mishra, 559

Xiao ShiXiao, 189

Yanduru Yamini Snehitha, 151

Yash Khare, 121

Zahraa Maan Sallal, 263

Zeeshan Ali, 287

ResearchGate has not been able to resolve any citations for this publication.

Ad Hoc-Obstacle Avoidance-Based Navigation System Using Deep Reinforcement Learning for Self-Driving Vehicles

Article

Full-text available

Jan 2023

In this research, a novel navigation method for self-driving vehicles that avoids collisions with pedestrians and ad hoc obstacles is described. The proposed approach predicts the locations of ad hoc obstacles and wandering pedestrians by using an RGB-D depth sensor. Unique ad hoc-obstacle-aware mobility rules are presented considering those environmental uncertainties. A Deep Reinforcement Learning (DRL) method is proposed as a decision-making technique (to steer the self-driving vehicle to reach the target without incident). The deep Q-network (DQN), double deep Q-network (DDQN), and dueling double deep Q-network (D3DQN) algorithms were compared, and the D3DQN had the fewest negative rewards. We tested the algorithms using the Carla simulation environment to examine the input values from the RGB-D and RGB-Lidar. The series of algorithms that make up the convoluted neural network D3DQN was consequently selected as the optimum DRL model. In the modeling of slow-moving urban traffic, RGB-D and RGB-Lidar generated essentially the same results. A self-driving version of an updated child-ride-on-car was modified to demonstrate the real-time effectiveness of the proposed algorithm.

AI based Technologies for International Space Station and Space Data

Conference Paper

Dec 2022

Using Machine Learning for Industry 5.0 Efficiency Prediction Based on Security and Proposing Models to Enhance Efficiency

Conference Paper

Dec 2022

A Self-Driving Car Platform Using Raspberry Pi and Arduino

Conference Paper

Aug 2022

Authentication and Authorization in Modern Web Apps for Data Security Using Nodejs and Role of Dark Web

Article

Dec 2022

Authentication and Authorization are the base of security for all the Technologies present in this world today. Starting from your smartphone where a user authenticates himself before he could access the data inside to Entering into the White House, you must authenticate yourself, and based on that you are authorized. In this digital world where every Business, MNC, Government Body, Companies, Users, etc. needs a website to inform the world about their presence on the internet, provide services online and become a “Brand”, the risk of leaking user's sensitive information increases. It could be dangerous to the users of the hacked website because their sensitive information like a credit card, bank account details, etc. could be sold in the black market of the “dark web”. The role of the dark web is described in the paper and how the data is sold there and what becomes of it. The paper helps to understand how a secure website is developed that promises the user to keep the sensitive information safe, increases the bond of trust between a client and server which results in a long-term relationship. The aim behind developing an authentication system is to keep users’ sensitive information safe so that hackers cannot steal and sell the information on the dark web's back market. To perform this, the developer needs to understand how to implement authentication. NodeJS, with the help of its framework expressJS and some other packages, is used to develop the authentication and authorization system of the website by the research. Previous papers on this field covered the authentication topic in general. This paper overcame that by going deeper into the field and being server-side language specific. The common types of authentication methods used in different types of websites are discussed in detail and the best methods are purposed for the developer to be implemented for a more secure website. This research put light on Artificial Intelligence and blockchain as the future of security of big data.

Machine Learning and Blockchain for 5G-Enabled IIoT

Chapter

Dec 2022

With one of the most potent technologies ever developed in human history, the world is moving toward a new digitalized era. These technological advancements are enabling people to make things that, in the past, were merely the stuff of fairy tales. The model put out by this research incorporates one of the newest and most potent technologies of the decade. This research proposes to integrate the 5G network with the industrial internet of things, which is based on machine learning to develop an intelligent machine capable of mimicking humans. A system of this power is extremely susceptible to issues like hacking, cyberattacks, and other issues. This problem is solved with the blockchain. Since blockchain offers a decentralized approach to maintaining transparency, the research incorporates it into the model to make it more efficient and secure. IoT with blockchain has been the subject of other studies, but this study is an enhanced version that also incorporates industrial IoT with AI to create an intelligent internet of things.

AI-Enabled Internet of Nano Things Methodology for Healthcare Information Management

Chapter

Jun 2022

Internet of nano things (IoNT) is growing at an exponential rate due to a growing population, more communication between devices in networks, sensors, actuators, and so on. This rise shows up in many ways, such as volume, speed, diversity, honesty, and value. Getting important information and insights is hard work and a very important issue. One of the most important ways to solve a problem is to come to a conclusion based on a number of different criteria. This can help you choose the best solution from a number of options. AI-enabled algorithms and decision making that takes into account multiple factors can be useful in big data sets. During the deduction process, AI-enabled algorithms and evaluations based on multiple criteria are used. Because it works well and has a lot of potential, it is used in many different areas, such as computer science and information technology, agriculture, and business.

Secure Aspect of Digital Twin For Industry 4.0 Application Improvement Using Machine Learning

Article