ArticlePDF Available

Email Based Cyberstalking Detection On Textual Data Using Multi Model Soft Voting Technique Of Machine Learning Approach

January 2023
Journal of Computer Information Systems 63(2)

January 2023
63(2)

DOI:10.1080/08874417.2022.2155267

Authors:

Arvind Kumar Gautam

Indira Gandhi National Tribal University

Abhishek Bansal

Dr. Harisingh Gour Vishwavidhyalaya Sagar

In the virtual world, many internet applications are used by a mass of people for several purposes. Internet applications are the basic needs of people in the modern days of lifestyle which are also making habitual society. Like social media, e-mail technology is also more prevalent among people of different categories for personal and official communications. The widespread use of e-mail-based communication is also raising various types of cybercrimes, including cyberstalking. Cyberstalkers also use an e-mail-based approach to harass the victim in the form of cyberstalking. Cyberstalkers utilize several content-wise and intent-wise approaches to target the victim, such as spamming, phishing, spoofing, malicious, defamatory, e-mail bombing, and non-spam e-mails, including sexism, racism, and threatening, and finally, trying to hack the account over e-mail technology. This paper proposed an EBCD model for automatic cyberstalking detection on textual data of e-mail using the multi-model soft voting technique of the machine learning approach. Initially, experimental works were performed to train, test, and validate all classifiers of three model sets on three different labeled datasets. Dataset D1 contains spam, fraudulent, and phishing e-mail subject, dataset D2 contains spam e-mail body text, while dataset D3 contains harassment-related data. After that, trained, tested, and validated classifiers of all model sets were applied as a combined approach to automatically classify the unlabeled e-mails from the user’s mailbox using the multi-model soft voting technique. The proposed EBCD model successfully classifies the e-mails from the user’s mailbox into cyberstalking e-mails, suspicious e-mails (spam and fraudulent), and normal e-mails. In each model set of the EBCD model, several classifiers, namely support vector machine, random forest, naïve bayes, logistic regression, and soft voting, were used. The final decision in classifying the e-mails from the user’s mailbox was taken by the soft voting technique of each model set. The TF-IDF feature extraction method was used with the entire applied machine learning model sets to obtain the feature vectors from the data. Experimental results show that the soft voting technique not only enhances the performance of the e-mail classification task but also supports making the right decision to avoid the wrong classification. Overall performance of the soft voting technique was better than other classifiers, although the performance of the support vector machine was also notable. As per experimental results, the soft voting technique obtained an accuracy of 97.7%, 97.7%, 98.9%, a precision of 97%, 98.3%, 98.6%, recall of 98.3%, 96.5%, 99.1%, f-score of 97.6%, 97.4%, 98.9%, and AUC of 99.4%, 99.7%, 99.9% on dataset D1, D2, and D3 respectively. The average performance of soft voting of each model set on classified e-mails from the user’s mailbox was also notable, with an accuracy of 96.3%, precision of 98.1%, recall of 94%, f-score of 95.9%, and AUC of 96.8%.

Different methods of e-mail-based cyberstalking.

…

Basic layout of the proposed EBCD (e-mail-based cyberstalking detection) model on textual data.

…

Distribution of data in labeled datasets.

…

Soft voting technique for combining the predicted probabilities and predicting the final result.

…

Performances of classifiers of ML model set MS-1 on dataset D1.

…

Figures - uploaded by Arvind Kumar Gautam

Content may be subject to copyright.

Content uploaded by Arvind Kumar Gautam

Content may be subject to copyright.

Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=ucis20

Journal of Computer Information Systems

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/ucis20

Email-Based Cyberstalking Detection On Textual

Data Using Multi-Model Soft Voting Technique Of

Machine Learning Approach

Arvind Kumar Gautam & Abhishek Bansal

To cite this article: Arvind Kumar Gautam & Abhishek Bansal (2023): Email-Based Cyberstalking

Detection On Textual Data Using Multi-Model Soft Voting Technique Of Machine Learning

Approach, Journal of Computer Information Systems, DOI: 10.1080/08874417.2022.2155267

To link to this article: https://doi.org/10.1080/08874417.2022.2155267

Published online: 17 Jan 2023.

Submit your article to this journal

View related articles

View Crossmark data

Email-Based Cyberstalking Detection On Textual Data Using Multi-Model Soft

Voting Technique Of Machine Learning Approach

Arvind Kumar Gautam and Abhishek Bansal

Indira Gandhi National Tribal University, Amarkantak, India

ABSTRACT

In the virtual world, many internet applications are used by a mass of people for several purposes.

Internet applications are the basic needs of people in the modern days of lifestyle which are also

making habitual society. Like social media, e-mail technology is also more prevalent among people

of dierent categories for personal and ocial communications. The widespread use of e-mail-

based communication is also raising various types of cybercrimes, including cyberstalking.

Cyberstalkers also use an e-mail-based approach to harass the victim in the form of cyberstalking.

Cyberstalkers utilize several content-wise and intent-wise approaches to target the victim, such as

spamming, phishing, spoong, malicious, defamatory, e-mail bombing, and non-spam e-mails,

including sexism, racism, and threatening, and nally, trying to hack the account over e-mail

technology. This paper proposed an EBCD model for automatic cyberstalking detection on textual

data of e-mail using the multi-model soft voting technique of the machine learning approach.

Initially, experimental works were performed to train, test, and validate all classiers of three model

sets on three dierent labeled datasets. Dataset D1 contains spam, fraudulent, and phishing e-mail

subject, dataset D2 contains spam e-mail body text, while dataset D3 contains harassment-related

data. After that, trained, tested, and validated classiers of all model sets were applied as

a combined approach to automatically classify the unlabeled e-mails from the user’s mailbox

using the multi-model soft voting technique. The proposed EBCD model successfully classies

the e-mails from the user’s mailbox into cyberstalking e-mails, suspicious e-mails (spam and

fraudulent), and normal e-mails. In each model set of the EBCD model, several classiers, namely

support vector machine, random forest, naïve bayes, logistic regression, and soft voting, were used.

The nal decision in classifying the e-mails from the user’s mailbox was taken by the soft voting

technique of each model set. The TF-IDF feature extraction method was used with the entire

applied machine learning model sets to obtain the feature vectors from the data. Experimental

results show that the soft voting technique not only enhances the performance of the e-mail

classication task but also supports making the right decision to avoid the wrong classication.

Overall performance of the soft voting technique was better than other classiers, although the

performance of the support vector machine was also notable. As per experimental results, the soft

voting technique obtained an accuracy of 97.7%, 97.7%, 98.9%, a precision of 97%, 98.3%, 98.6%,

recall of 98.3%, 96.5%, 99.1%, f-score of 97.6%, 97.4%, 98.9%, and AUC of 99.4%, 99.7%, 99.9% on

dataset D1, D2, and D3 respectively. The average performance of soft voting of each model set on

classied e-mails from the user’s mailbox was also notable, with an accuracy of 96.3%, precision of

98.1%, recall of 94%, f-score of 95.9%, and AUC of 96.8%.

KEYWORDS

e-mail cyberstalking;

cyberstalking detection;

cyberbullying; machine

learning; spam detection;

soft voting; TF-IDF; support

vector machine; naive bayes;

logistics regression; random

forest

Introduction

With the growth and popularity of internet technology,

e-mail (electronic mail) has become an essential source

everywhere for a person to person and person-to-group

communication. E-mail platform is not only just for com-

munication purposes but also provides a storage facility

which has been growing exponentially over the years.

Generally, regular users of e-mail store half of their basic

and critical information in e-mail storage.

1,2

E-mail is the

best application for sharing personal, official, business,

and confidential information over the internet. Many

organizations and individuals utilize e-mail technology

to share their general and necessary information, such as

document sharing, message communication, and sending

urgent information about any news, updates, and notifi-

cations. Several e-mail service providers provide e-mail

service to users for personal and business purposes, either

free or on a subscription basis. Some of the most famous

and notable e-mail service providers are Gmail, Microsoft

Hotmail and Outlook, Yahoo, iCloud, AOL, GMX,

ProtonMail, Yandex mail Tutanota, and Zoho Mail. As

per the data provided by Statista,

more than 4.1 billion

CONTACT Arvind Kumar Gautam analyst.igntu@gmail.com Department of Computer Science, Indira Gandhi National Tribal University, Amarkantak,

Distt. - Anuppur, MP 484886, India

JOURNAL OF COMPUTER INFORMATION SYSTEMS

https://doi.org/10.1080/08874417.2022.2155267

users are using the e-mail service worldwide through

different electronic devices and e-mail client software.

The frequent use of e-mail technology is not just limited

to personal and official purposes but is also widely used by

cybercriminals for performing cybercrime incidents.

Cybercrimes like phishing, spamming, hacking, spoofing,

e-mail bombing, and cyberstalking are being executed

using e-mail.

E-mail is the second most used application

and the third most common source for cyberstalking and

other cyber harassment over the internet.

2,4

Although different authors give different definitions of

cyberstalking but cyberstalking is a form of online har-

assment involving the use of technology to target indivi-

duals or groups. Cyberstalking and cyberbullying are two

challenging issues of online abuse and are near to close in

content and intent, which involve the same internet-

based technology to harass, bully and undermine others

in the online world. Cyberstalking is systematic, repeated,

and numerous cyber-attacks and may occur on multiple

occurrences.

5–8

Cyberstalking may be classified into

e-mail stalking, internet stalking, computer stalking,

phone stalking, and automated stalking.

8,9

Cyberstalking

is a dangerous and convoluted cybercrime that affects and

targets numerous people, communities, and

organizations.

Cybercriminals apply several approaches

to target the victims, such as sending e-mails containing

phishing, viruses, threatening, fraudulent, and harassing

content, e-mail bombing as well as sharing the private

information of victims, and finally, trying to hack the

e-mail account. Cyberstalkers often utilize e-mail-based

technology with predefined plans and agendas to insult,

profanity, harassing the victim through repeated activities

of sexism, racism, offensive, abuse, hate, and fake news

from real or counterfeit accounts. However, such types of

e-mail-based methods are mainly utilized for several

other types of e-mail-based crimes, but the utilization of

these e-mail-based methods in cyberstalking incidents

can not be ignored. Some e-mail-based methods applied

by cyberstalkers are presented in Figure 1.

Spam is the criminal and fraudulent communication

of unwanted and harmful messages containing unsoli-

cited and unwanted messages such as phishing, false

advertising, harassment, and illegal content from an

infected device or messages to multiple addresses at

once.

According to DataProt,

as of March 2022,

across the world, approximately 85% of e-mails were

filtered as spam e-mails, including 36% for advertise-

ment purposes while 31.7% for all spam messages for

adult-related and harassment purposes. A phishing

e-mail is a scam and more dangerous than general

spam e-mails sent by cybercriminals with fraud and

harassment intentions. Cybercriminals find the victim’s

interests and send customized phishing e-mails from

a legitimate, reliable source to a specific person or

group to steal and gather personal and financial

information.

Cybercriminals utilize different types of

phishing e-mails with predefined objectives, containing

harmful hyperlinks, fake website links, malware, and

clone id and contents to thieve private information,

hack the account, control the victim’s devices, and

undermine and harass the victims. Malicious e-mails

are an approach used by cybercriminals as phishing

e-mails to try to access private information from victims.

Malicious e-mails contain attachments such as docu-

ments, PDFs, hyperlinks, e-files, and voicemails to initi-

ate an attack on a user’s devices.

Cybercriminals use

such attachments with e-mails that can install malware

to destroy data, steal information, take control of the

user’s computer, access the screen, capture keystrokes,

and access other network systems. Cyberstalkers often

utilize malicious e-mails to target known users. Spoofing

e-mails is another harmful e-mail technique used by

Figure 1. Diﬀerent methods of e-mail-based cyberstalking.

2A. K. GAUTAM AND A. BANSAL

cybercriminals for sending spam and phishing e-mails

to trap users into thinking a message came from trusta-

ble and well-known persons or organizations. In spoof-

ing e-mail techniques, cybercriminals create a fake

header to send the message with malicious links and

malware attachments so that receiver can believe and

client application software shows the falsified sender

address.

Cybercriminals make spoofing e-mails using

display names, legitimate domains, and lookalike

domains. Spoofing e-mail is mainly used for phishing,

identity theft, avoiding spam filters, anonymity, and

harassment purposes.

In e-mail bombing, cyberstalkers repeatedly send an

unnecessary, large and meaningless e-mail message to

a predefined e-mail address of the victims to consume

large amounts of system and network resources (such

as internet bandwidth, storage space, etc.) for harass-

ment purposes.

Composing e-mail bombing mes-

sages automatically using computer programs is

another approach used by adversary cyberstalkers.

Sometimes, the adversary also utilizes the controversial

or official statement to a large audience using the

victim’s return e-mail address so that users read and

reply individually, and eventually, the victim’s e-mail

account is flooded through a large number of replies.

Another dangerous approach used by adversary cyber-

stalkers is to subscribe the victim’s e-mail address to

many sexual sites and other mailing lists so that victim

can receive unnecessary automatic e-mails regularly.

Defamatory e-mail is a technique of cyber defamation

which is often used by cyberstalkers to send false

information related to any person or organization to

demolish the reputation of that person or organiza-

tion. Defamatory e-mails are sent to different sources

either accidentally or deliberately, making

a confounded matter from an unintentional or inten-

tional result.

Sometimes, cyberstalkers also send

defamatory e-mails to the victim’s relatives containing

false and sexual information related to the victim to

damage the victim’s public image. The cases of defa-

matory e-mails are regularly increasing and are very

complicated to detect.

Those e-mails that are not classified into any types

of spam or fraudulent e-mails and look legitimate

e-mails are called non-spam e-mails. Vicious non-

spammer cyberstalkers use non-spam e-mails, includ-

ing sexual abuse, fake e-mails, threatening e-mails,

and other harassment e-mails, to target the victims

with proper plans. In Non-spam e-mail methods used

by cyberstalkers, a bunch of temporary e-mail ids

from well-known e-mail servers or sometimes suspi-

cious servers are created, and then using these e-mail

ids, stalking-related messages are sent to the victims

regularly. In case of blocking the sender’s e-mail id

or police complaint, cyberstalkers utilize other tem-

poral e-mail ids. Threatening e-mail is basically used

by cyberstalkers and scammers to blackmail the vic-

tim. In threatening e-mails, cyberstalkers regularly

threaten victims for publishing a piece of private

information or sometimes fake or factual sexual

information among the victim’s colleagues or rela-

tives (friends and family) unless they fulfill the

demand by the victims. Threatening e-mail is more

common for cyberstalking of women victims by ex-

partners or friends for financial cheating or personal

adversary reasons. Sometimes, cyberstalkers send fake

e-mails to victims or victims’ relatives containing

false or fake information or fake sender name

(often using the name of the victim’s well-known

persons or organizations), or counterfeit domains to

harass the victims intensely. Such types of fake

e-mails look like original e-mails based on e-mail

filtration policy, domain name, and sender name,

and also do not contain any harmful links, and very

difficult to identify whether the e-mail is fake or

legitimate. Cyberstalkers and other cybercriminals

often try to hack the e-mail ids of victims or their

family members so that further victims can be har-

assed easily. Cybercriminals use some general

approaches, such as phishing and spoofing to hack

the e-mail ids of victims. Keylogging (software and

hardware keylogger to capture all keystrokes which

a user performs), pharming (a fake website that looks

legitimate for collecting usernames and passwords),

automated script-based programs or suspicious

mobile apps, gaming applications, sexual site hyper-

links, and password guessing and resetting are some

other powerful methods used by cybercriminals for

hacking e-mail ids.

Generally, researchers focus on classifying the

e-mail into spam e-mail or non-spam e-mail, but

non-spam e-mail is not always safe and crime-free

in e-mail technology and is also responsible for

cyberstalking that cannot be ignored. Researchers

have proposed various content-based and rule-based

techniques for spam filtration and detection. The

content-based methods mainly focus on content fea-

tures, while modal-based approaches with predefined

rules and blacklist and whitelist mechanisms are used

in rule-based methods for spam e-mail

classification.

19,20

Generally, reputed e-mail service

providers (Gmail, outlook, and yahoo) filter the

e-mails with a primary target for spam and other

harmful e-mails but do not focus on the filtration

of harassing e-mails. Cyberstalking is a critical cyber-

criminal activity, and technical solution is relatively

JOURNAL OF COMPUTER INFORMATION SYSTEMS 3

low to combat and control cyberstalking incidents.

Detection of cyberstalking, especially early and auto-

mated detection, is another major challenge. An

intelligent cyberstalking detection model is required

to automatically classify the e-mails from the user’s

mailbox to handle upsetting cyberstalking incidents

on the e-mail platform. Sentiment analysis using

machine learning techniques performs a vital task in

text analysis and deciding the score of e-mail con-

tents to classify as positive or negative text.

Mostly,

researchers focus on cyberstalking detection on social

media platforms, while e-mail-based cyberstalking

detection is not more highlighted and explored.

There is still much scope for e-mail-based cyberstalk-

ing detection that can automatically filter cyberstalk-

ing e-mails from the user’s mailbox. The main

research objective of this paper is to train and test

the different machine learning model sets on differ-

ent datasets (spam and harassment) and finally per-

form e-mail filtration from the user’s mailbox as

cyberstalking, suspicious, and normal e-mails auto-

matically. This research study utilizes the multi-

model soft voting technique of the machine learning

approach to design and develop an improved auto-

mated e-mail-based cyberstalking detection model on

textual data. The significant contributions from this

study are as follows.

●We designed and developed an automated, effi-

cient model named EBCD for e-mail filtration as

cyberstalking e-mails, suspicious e-mails (spam

and fraudulent), and normal e-mails by utilizing

the multi-model soft voting technique of the

machine learning approach to achieve the best

performance in the e-mail-based cyberstalking

detection on textual data.

●The proposed EBCD model can classify and label

e-mails automatically in real-time with high accu-

racy and can gather useful information from

e-mails in the user’s mailbox that can be utilized

for further training of machine learning models

and evidence purposes.

●The proposed EBCD model can be used in any

e-mail mailbox that provides e-mail fetching facil-

ity API or IMAP services.

The next part of the research study is structured

section-wise. In section 2, the notable and recent

contribution of researchers in the related field is

presented in the form of a literature review.

Section 3 describes the applied materials and the

proposed methodology used in this paper. The

experimental setup, results, and detailed discussion

are mentioned in section 4. Finally, the conclusion

and future works are finalized in section 5.

Review of literature

In the literature survey, some related research papers were

chosen to observe the contributions of past work per-

formed by researchers to the automatic detection of

cyberbullying, cyberstalking, and other cyber harassment.

Researchers have suggested several techniques to design

and develop a cyberstalking detection model on different

virtual world platforms. Burmester Henry et al.

pro-

posed a monitoring system framework for tracking cyber-

stalkers using the cryptography approach. Authors

claimed that the proposed framework would be able to

record cyberstalking-related data on the computer of

cyberstalking victims. Aggarwal S. et al.

have developed

the Predator and Prey Alert (PAPA) system to help law

enforcement. The PAPA system records every screen

event of a victim’s device during the session. The PAPA

system requires special software and hardware for victim

use and creates a secrecy issue. PAPA system was also not

performing properly to filter and detect cyberstalking

e-mails and was unable to handle the text-based cyber-

stalking. Onan et al.

suggested a model for topic extrac-

tion for bibliometric data analysis using several improved

word embedding with a cluster analysis approach and

developed sentiment analysis models

using machine

learning, ensemble learning, and deep learning methods

on educational data mining. Gautam et al.

explored and

reviewed the various cyberstalking and cybercrime detec-

tion techniques and found that machine learning techni-

ques are widely used as a single, ensemble, and hybrid

approach. Onan et al.

proposed a model based on

a three-layer stacked bidirectional long short-term mem-

ory architecture for detecting sarcastic text documents on

social media and, after that, also suggested a deep learn-

ing-based model utilizing several word embedding

model,

another deep learning-based model utilizing

several weighted word embedding model

for sentiment

analysis of product reviews on Twitter. Another machine

learning and deep learning-based model proposed by

Onan et al.

utilizes several unsupervised and supervised

term-weighted models, namely TF-IDF, word2vec,

FastText, and GloVe. Machine learning classifiers play

a vital role in making the cyberstalking detection model

using either single or multi-model-based as an ensemble

and hybrid approach. Gautam et al.

analyze the perfor-

mance of several popular machine learning classifiers on

different sizes of datasets for cyberstalking detection. In

the literature, researchers mainly focus on making

a cyberstalking detection model on social media plat-

forms. Zhang et al.

suggested a machine learning-based

4A. K. GAUTAM AND A. BANSAL

automated cyberbullying detection model for detecting

bully tweets on Twitter. The authors performed the

experimental work using various machine learning mod-

els using multiple textual features and found maximum

accuracy of 90%. Liew et al.

suggested an automated

security alert model using supervised machine learning

techniques to detect and control phishing tweets in real-

time on Twitter. Nimisha et al.

presented another auto-

mated model for cyberstalking detection on social media

using machine learning and natural language processing.

The authors proposed model mainly focuses on identify-

ing the cyberstalker online and detecting cyberstalking

incidents. Another enhanced automated cyberstalking

detection model in real-time on Twitter is designed and

developed by Gautam et al.

using a hybrid approach

inspired by machine learning. The authors performed the

experimental work on live tweets in real time for cyber-

stalking detection using lexicon-based, machine learning-

based (single approach), and hybrid approaches (multi-

model based inspired by machine learning) and found the

hybrid approach performed better in cyberstalking

detection.

Researchers less explored e-mail-based cyberstalking

detection than social media-based cyberstalking detec-

tion, although the researchers have recommended several

notable detection approaches for e-mail-based crimes

other than cyberstalking. Roy et al.

performed

a comparative analysis between SVM and Deep Neural

Networks in intrusion detection and proposed several

detection models using machine learning based model

utilizing extreme learning machine (ELM) and support

vector machine (SVM), a hybrid model

of rough set and

decorate ensemble and multi-model approach

using

Deep SVM, SVM and Artificial Neural Network models

for the detection of spam e-mails. Bassiouni et al.

pro-

posed a spam e-mail classification model utilizing

machine learning techniques. The authors performed

the experimental work on the Spambase UCI dataset

using several machine learning classifiers and found bet-

ter results for Random Forest for e-mail classifying as

spam e-mail or ham e-mails. Another detection model

using machine learning methods was proposed by

Zhaoquan et al.

for spam filtering using the marginal

attack methods. Kontsewaya et al.

proposed another

machine learning based detection model for spam

e-mail classification. The authors performed experimental

work on a ready-made dataset containing 1368 spam and

4360 non-spam e-mails and found that Logistics

Regression provides better results than other classifiers.

Aviad Cohen et al.

proposed a model for the detection

of malicious e-mails using machine learning methods.

The authors applied general descriptive features with

machine learning algorithms to enhance the performance.

Experimental works were performed on a dataset contain-

ing 33,142 e-mails (38.73% malicious and 61.27% benign

e-mails) and found better results. Chaitra Sai et al.

pro-

posed a model for the detection of spoofing e-mails using

stacking algorithms. The authors explored various

approaches and compared stacking algorithms of

machine learning for detecting different types of spoofing

e-mails to find better accuracy. Onan et al.

proposed an

ensemble-based machine learning model in text classifi-

cation and suggested another machine learning based

model utilizing a consensus clustering-based-

undersampling approach

for text classification in an

imbalanced dataset. The authors explored a comparative

analysis of different feature engineering approaches, base

learners, ensemble learning methods, and consensus clus-

tering-based-undersampling. Onan et al.

again pro-

posed another ensemble pruning approach based model

utilizing multiple classifier techniques based on swarm-

optimized topic modeling, machine learning based hybrid

ensemble pruning model

utilizing clustering and rando-

mized search approach and a machine learning based

ensemble model

for text classification utilizing different

extraction methods. Nisar et al.

suggested a soft voting

technique using several machine learning classifiers for

spam e-mail classification. During the experimental work,

the authors found that the ensemble approach using the

soft voting technique enhances the performance of spam

e-mail classification. Bountakas et al.

proposed

a machine learning-based hybrid ensemble approach

using stacking and soft voting techniques for phishing

detection. The authors performed experimental work on

a dataset containing 3,460 phishing and 32,051 benign

e-mail samples and found better performance with soft

voting ensemble learning. Onan et al.

suggested

a group-wise enhancement technique to perform the

text sentiment classification using deep leaning model

and suggested a model with effective feature selection

using an ensemble approach

to enhance the perfor-

mance of text sentiment classification. Cybercriminals

have recently introduced the image spam approach to

render e-mail body text analysis ineffective. Image spam

is unsolicited bulk e-mail that contains a message

embedded in an image. Spammers use such images to

avoid detection by text-based filters. Image spamming is

a growing issue in executing cybercrimes, although some

machine learning and deep learning-based image spam

detection approaches have been suggested by

researchers.

53,54

In the area of e-mail-based cyberstalking detection,

Ghasem et al.

introduced an improved e-mail-based

cyberstalking detection framework for automatically

detecting and controlling cyberbullying and cyberstalking

using machine learning techniques. The proposed ACES

JOURNAL OF COMPUTER INFORMATION SYSTEMS 5

(Anti-Cyberstalking E-mail System) framework of

authors generally focused on automatic e-mail-based

cyber-stalking detection as well as evidence documenta-

tion to combat cybercriminals. Another e-mail-based

cyberstalking detection model was proposed by

Frommholz et al.

for textual analysis and cyberstalking

detection using machine learning algorithms. The

author’s proposed framework, ACTS (Anti

Cyberstalking Text-based System), mainly focused on

author identification, text classification, personalization,

and digital text forensics. X. Feng et al.

proposed

another framework for e-mail-based cyberstalking detec-

tion using machine learning approaches. The author’s

proposed model was inspired by the ACES (Anti-

Cyberstalking E-mail System) and ACTS (Anti

Cyberstalking Text-based System) framework and

claimed that the proposed model would perform better

for cyberstalking detection. Another e-mail-based cyber-

stalking detection framework was proposed by Gautam

using a machine learning approach to detect, filter, and

collect cyberstalking evidence on textual data of non-

spam e-mails. The proposed framework of the authors

explores the cyberstalking risk from non-spam e-mail.

Initially, the author’s framework classifies the e-mail

into spam and non-spam e-mail and further detects the

cyberstalking on non-spam e-mail. Another improved

e-mail-based detection model proposed by Maryam

et al.

using a deep learning approach. The author’s pro-

posed model classified the e-mail into Harassment

E-mails, Fraudulent E-mails, Suspicious E-mails, and

Normal E-mails. Asante et al.

suggested another auto-

mated model for cyberstalking detection on social media

using machine learning, data mining techniques, and

digital forensics. The author’s proposed model contains

identification, filtering, detection (content detection and

profiling offender), and evidence modules.

Based on the literature review, it is found that

researchers mainly focused on social media-based

cyberstalking and other harassment detection.

Researchers also contribute to exploring and detect-

ing e-mail-based cybercrimes. E-mail-based cyber-

stalking is still not much explored, and more

attention is required. Few authors in

55–59

have con-

tributed to detecting and combating e-mail-based

cyberstalking. Automatic e-mail-based cyberstalking

detection on textual data is still challenging, and

there is still a lack of automated cyberstalking detec-

tion approaches with impressive performance.

Inspired by authors,

55–59

this paper proposed an

EBCD model for automatic cyberstalking detection

on textual data of e-mails and classifying the

e-mails from a user’s mailbox into cyberstalking

e-mails, suspicious e-mails, and normal e-mails.

Material and methodology

This section describes the detailed algorithms used for

designing the proposed model. E-mail-based cyberstalk-

ing detection (EBCD) model on textual data has two

main parts: Making ML Model Sets and E-mail-based

cyberstalking detection. In the first part of the EBCD

model, 3 ML model sets containing Support Vector

Machine (SVM), Logistics Regression (LR), Naïve

Bayes (NB), Random Forest (RF), and Soft Voting clas-

sifiers were trained and tested on three separate datasets

(subject line spam dataset, e-mail spam dataset and

cyberstalking dataset).In the second part of the EBCD

model, e-mails from the user’s e-mail box were fetched

and later filtered as cyberstalking e-mails, suspicious

e-mails (spam and fraudulent), and normal e-mails by

applying the trained and tested ML model sets using soft

voting techniques. The stepwise procedure for making

ML model sets is described by algorithm-1, while algo-

rithm-2 describes the e-mail-based cyberstalking detec-

tion on textual data from a user’s mailbox. Figure 2

explains the basic functioning layout of the proposed

EBCD model on textual data. The overall methodology

for the proposed EBCD model is presented stepwise,

consisting of the following main phases to perform

both parts of the model for e-mail-based cyberstalking

detection on textual data.

(1) Making the Dataset.

(2) Data pre-processing module.

(3) Features extraction module.

(4) Making ML model sets.

(5) Fetching e-mails from the user’s mailbox.

(6) Apply trained ML model sets to e-mails and

combine the probabilities using soft voting

(7) Aggregator module and e-mail classification

(8) Saving classified e-mails as evidence

(9) Model Performance

Making datasets

This paper gathers several datasets

60–67

related to spam/

phishing e-mail subjects, spam e-mail text, fraudulent

e-mail, and harassment text (e-mail, tweets, and posts/

comments from social media). Three separate mixed

labeled datasets were made to train, test, and cross-

validate the three machine learning model sets based

on the collected datasets. Dataset D1 contains e-mail

subject line spam, phishing, and fraudulent data labeled

as spam (1) and ham (0). Dataset D2 contains spam and

fraudulent e-mail body text labeled as spam and ham

class. Dataset D3 contains harassment-related

6A. K. GAUTAM AND A. BANSAL

(threatening, sexual abuse, hate messages, racism, etc.)

data from e-mails and social media tweets/posts/com-

ments labeled as cyberstalking (1) and non-

cyberstalking (0). Dataset D1 will be used to train and

test the machine learning classifiers of ML model set

MS-1. Dataset D2 will be used to train and test the

machine learning classifiers of ML model set MS-2,

while dataset D3 will be used to train and test the

machine learning algorithms of ML model set MS-3.

The distribution of data in every three datasets is

explained in Figure 3.

Data pre-processing module

The data of datasets and fetched e-mails often contain

raw text with unnecessary characters, blank spaces,

blank lines, meaningless characters, html tags, and dif-

ferent symbols. Properly cleaning the data is highly

Figure 2. Basic layout of the proposed EBCD (e-mail-based cyberstalking detection) model on textual data.

Figure 3. Distribution of data in labeled datasets.

JOURNAL OF COMPUTER INFORMATION SYSTEMS 7

required before feature extraction and classification.

Data pre-processing module will be used to clean and

normalize the data of all training and testing labeled

datasets as well as unlabeled e-mails fetched from the

user’s mailbox. Initially, this module will be used for

performing several pre-processing tasks on labeled

training and testing datasets. Later, it will be utilized

on unlabeled e-mails fetched from the user’s mailbox.

Several pre-processing tasks, such as: Removing stop

words, noise removal, tokenization, normalization, and

stemming will be performed in this module to clean the

data. In the first step of pre-processing, all stop words

were removed. Meaningless words such as articles, pre-

positions, and pronouns that are not useful for e-mail

classification are called stop words.

Fetched e-mails

from the user’s mailbox and datasets gathered from

different sources also contain noise data that is required

to be removed. In the e-mail, repeated words, symbols

(such as html tags, @, #, etc.), blank lines, blank space,

special characters, punctuation marks, and any useless

digits are called noise data. After removing the noise

data and stop words, the texts of the e-mail (subject and

body text) were divided into individual words and added

to a separate list. This process for splitting the sentence

into words is called tokenization. Further, tokenized

texts are required to convert into lower case letters

using normalization to make the uniformity. After

that, tokenized words are required to be restored to

their original form using the lemmatization

and

stemming

methods. Lemmatization may be used

instead of stemming for proper morphological analysis

of the words. Lemmatization is a method to combine the

synonyms relation words into a single word and remove

all other concerned synonyms words from the list.

this paper, the stemming method was used.

Feature extraction module

The feature extraction process is essential in the machine

learning-based process before training, testing, and clas-

sifying e-mail because the machine learning algorithms

work on feature vectors and can not understand data as

text forms. Feature extraction computes the weights of

e-mail words and creates a feature vector in numerical

form. Feature extractions play a crucial role in improving

the performance of classifiers.

Several traditional-based,

word embedding-based and language model-based fea-

ture extraction methods are available for feature extrac-

tion in the word-level, sentence-level, and n-gram levels.

TF-IDF, Word2Vec, BOW, BERT, FastText, GloVe, XL-

NET, ELECTRA, InferSent, GPT-2, and Universal

Sentence Encoder are some widely used examples of

feature extraction methods.

72–75

The proposed EBCD

model of this study applied TF-IDF methods for feature

extractions. TF-IDF is an efficient calculation-based fea-

ture extraction method that measures the weight of any

word of documents in a collection of documents.

TF-

IDF finds the most occurring words and assigns more

consequences because regularly occurring words are

more important for the classification.

Equation (1) is

used to calculate the feature vector in the TF-IDF.

TF IDF T;Dð Þ ¼ PT in D

PW in D Log N

PT in Nð Þ þ 1

 

(1)

Where:

PT in D ¼Number of times

word T appears in

a document }D}

PW in D ¼Total number of

words in the

document }D}

;

!Represents the

Term Frequency

PT in N ¼

Total occurrence

of Word }T}in

total documents

Represents the

document

Frequency

N= Total Documents

Making ML model sets

After cleaning the data through the data pre-

processing module and getting the feature vector

through the feature extraction module, machine

learning model sets were designed and developed.

In this study, three separate machine learning

model sets, ML Model Set MS-1, ML Model Set

MS-2, and ML Model Set MS-3, were designed and

developed. Machine learning algorithms of ML model

set MS-1 were trained, tested, and validated on data-

set D1. Dataset D2 was applied for the training,

testing, and validating of algorithms of the ML

model set MS-2, while the ML model set MS-3 uti-

lized dataset D3 for training, testing, and validating

the algorithms. In each model set, Support Vector

Machine (SVM), Logistics Regression (LR), Naïve

Bayes (NB), Random Forest (RF), and Soft Voting

classifiers were trained, tested, and validated. Support

vector machine is an efficient, versatile, and trendy

supervised machine learning broadly used to classify

text with more accurate results.

SVM creates hyper-

planes and computes the distance between the line

and support vector to classify the text. The SVM

offered several kernels (polynomial, sigmoid, Radial

Basis Function, linear, and nonlinear kernels) with

8A. K. GAUTAM AND A. BANSAL

different mathematical functions.

Although, as per

its native nature, SVM uses prediction and does not

support probability directly, using Platt scaling and

isotonic regression methods, SVM determines the

probability of any text for the target class. This

paper used the probability calibration classifier

method for SVM to calculate the prediction probabil-

ity of e-mail. Naïve Bayes (NB) is an efficient and

straightforward supervised machine learning algo-

rithm. The functioning of NB is according to the

Bayes Theorem and derived from conditional

probability.

In this paper, the multinomial NB

model was used, while other models offered by NB

are Gaussian NB and Bernoulli NB. Logistic regres-

sion is a statistical-based linear learning algorithm

that utilizes an s-shaped curve to map any real-

valued number using the sigmoid function to find

dichotomous results (a value between 0 and 1).

Logistic regression predicts an output value (y) by

combining the input features(x) linearly using

weights or coefficient values.

Random Forest is

a supervised ensemble algorithm that uses multiple

decision trees with the bootstrap technique to get

better prediction results. For a classification problem,

each tree in a random forest takes input and provides

individual votes for a particular class, and finally,

a class that has got the maximum number of votes

is predicted as output.

The mathematical expression

for calculating the prediction probability of e-mail

using SVM is explained by Equation (2). Equation

(3) shows the mathematical formula to determine

prediction probability using NB. Equations (4) and

(5) show the mathematical expression of LR and RF

classifiers, respectively, for calculating the prediction

probability. Algorithm 1 describes the stepwise pro-

cedure for making the machine learning model sets.

PSVM yjemailð Þ ¼ 1

1þexp Af e mailð Þ þ Bð Þ (2)

Where “A” and “B” are scalar parameters learned by the

algorithm during the training, “y” is the target class (y = 1

for cyberstalking and y = 0 for non-cyberstalking) f(e-

mail) is a real-valued function.

PNB yjemailð Þ ¼ P yð ÞQn

i¼1PðxijyÞ

P x1ð Þ  P x2ð Þ  . . . :p xn

ð Þ(3)

Where “y” is the target class (y = 1 for cyberstalking

and y = 0 for non-cyberstalking). P(y|e-mail) repre-

sents the posterior probability of e-mail for target

class “y.” P(e-mail)=P(x1)P(x2) . . . .P(x

) is the pre-

ceding probability of the predictor e-mail. P(y) is

the preceding probability of the target class. P(x

|y)

is the likelihood conditional probability of predictor

e-mail for target class (y).

PLR yjemailð Þ ¼ eaþbemailð Þ

ð1þeaþbemailð ÞÞ(4)

Where y is the predicted probability output, a is the

intercept term, and b is the coefficient for the single

input e-mail value learned from the training data.

PRFðyjemailÞ ¼ MaxVote PnðemailÞgf N

1(5)

Where N is the total tree in random forest and P

a class prediction of the n

tree

Algorithm 1: Stepwise procedure for Making ML model sets on labeled

datasets

Step:1. Begin

Step:2. Import labeled datasets D1, D2, and D3.

Step:3. Send datasets D1, D2, and D3 to the data pre-processing module

for text cleaning and normalization.

Step:4. Split the datasets D1, D2, and D3 into training and testing sets.

D1=D1

Train

+ D1

Test

, D2=D2

Train

+D2

Test

, D3=D3

Train

+D3

Test

, where

Train

, D2

Train

, D3

Train

are the training and D1

Test

, D2

Test

, D3

Test

are

the test corpus for dataset D1, D2, D3 respectively.

Step:5. Apply TF-IDF vectorizer on D1

Train

, D2

Train

, D3

Train

, D1

Test

, D2

Test

and D3

Test

to get the feature vectors using the feature extraction

module.

Step:6. Train and test the ML classiﬁers of ML model set MS-1 using D1

Train

and D1

Test

corpus (training and testing feature sets of dataset D1).

Step:7. Train and test the ML classiﬁers of ML model set MS-2 using D2

Train

and D2

Test

corpus (training and testing feature sets of dataset D2).

Step:8. Train and test the ML classiﬁers of ML model set MS-3 using D3

Train

and D3

Test

corpus (training and testing feature sets of dataset D3).

Step:9. Apply K-Fold cross-validation for ML Classiﬁers of model sets MS-1,

MS-2, and MS-3 on Datasets D1, D2, and D3, respectively.

Step:10. Measure the performance of ML classiﬁers of each ML model set.

Step:11. Save the ML model sets as pickle ﬁles so that ML model sets can

be used later during the classiﬁcation of e-mails from the user’s

mailbox.

Step:12. End

Fetching e-mails from the user’s mailbox

E-mail is private communication (person-to-person and

person-to-group), so e-mails from the user’s mailbox

can not be fetched without the user id, password, and

user permission. Several approaches may automatically

fetch the e-mails from the user’s mailbox through

a third-party application. IMAP service and Gmail API

(in the case of Gmail service) are the two main methods

for fetching e-mails automatically from the user’s mail-

box. In the case of Gmail API, a user must log into

Google Cloud Console and enable the Gmail API ser-

vice. After that, it is necessary to create/select an appli-

cation under the OAuth Consent Screen of Google

Cloud Console. After creating or selecting the existing

application, OAuth Client ID credentials are required to

create a desktop or web application for getting the Client

ID with OAuth credentials as a text or JSON file. After

JOURNAL OF COMPUTER INFORMATION SYSTEMS 9

getting the Client ID with OAuth credentials, e-mails

from the user’s mailbox can be fetched automatically

through programs. The first time, the user will be auto-

matically intimated that “This application wants to

access your mailbox – Allow or deny,” and after the

user has permission to access the mailbox, e-mails can

be fetched. In fetching e-mails using the IMAP service,

only a user id and password with some basic settings are

required. After Enabling “Allow less secure apps: ON”

and Enabling IMAP service in the user’s mailbox,

e-mails can be fetched automatically through programs.

Apply trained ML model sets to e-mails and combine

the probabilities using soft voting

After fetching the e-mail from the user’s mailbox, the

e-mail was sent to the data pre-processing module

and feature extraction module to clean the e-mail

and get the feature vectors for the e-mail subject

and body text. Saved (trained and tested) ML model

sets were loaded to apply the classifiers separately on

the e-mail subject and body text. Using the ML

model set MS-1, prediction probabilities for e-mail

subjects were found through all ML classifiers (SVM,

NB, LR, and RF). Classifiers of ML Model set MS-2

were applied to the e-mail body text to determine the

prediction probabilities for checking whether the

e-mail is spam or normal. ML model set MS-3 with

all classifiers were applied to get the prediction prob-

abilities for checking whether the e-mail was cyber-

stalking E-mail or a normal E-mail. Prediction

probabilities given by each ML classifier in each ML

model set may vary. Taking the final decision based

on only the prediction of a single classifier may affect

the e-mail classification task. So an ensemble

approach using the multi-model soft voting techni-

que was applied to get the final prediction probability

for a particular class (Spam or Normal, Cyberstalking

or Normal).

In machine learning, the voting technique is classified

as hard voting and soft voting. In hard voting, the

“Mode” based approach is used to select the majority

vote among all the votes (predictions) predicted by all

classifiers. For example, if classifier-1 predicts for class

“A,” classifier-2 predicts for class “B,” and classifier-3

predicts for class “A,” then the hard voting technique

gives the final prediction for class “A” due to a majority

of votes. In soft voting, the “Mean” based approach is

used to find the final prediction probability from all the

predicted probabilities (votes) by all classifiers for both

classes. In soft voting, classifiers give the prediction

probability for both classes (in the case of binary classi-

fication) using the “Predict_proba” method. Such as

p=svm.predict_proba(), and “p” ={0.7,0.3} show that

0.7 is a probability for class “A” and 0.3 is a probability

for class “B.” For example, if classifier-1 predicted prob-

ability is {0.7, 0.3}, the predicted probability of classifier-

2 is {0.4, 0.6}, and the predicted probability of classifier-3

is {0.8, 0.2} then soft voting technique will give final

prediction probability as {0.633, 0.366} which show the

prediction in favor of class “A.” This study uses the soft

voting technique to combine the prediction probabil-

ities. The mathematical representation of the soft voting

technique is explained by Equation (6), and the func-

tioning of soft voting in the author’s study is described

in Figure 4. The final prediction probability is calculated

using the soft voting technique based on the prediction

probabilities provided by the ML model set MS-1 (on

the e-mail subject), the model set MS-2 (on e-mail body

text), and model set MS-3 (on e-mail body text). In the

Figure 4. Soft voting technique for combining the predicted probabilities and predicting the ﬁnal result.

10 A. K. GAUTAM AND A. BANSAL

last of this phase, three final prediction probabilities

(from ML model sets MS-1, MS-2, and MS-3) for an

e-mail (subject and body text) are sent to the aggregator

module for e-mail classification.

PSoftVotingðyj

emailÞ ¼ argmaxjPN

k¼1Pk

ðCkemailð ÞÞ

N¼j

A(6)

Where k is a pair of class probabilities [P

, P

], N is

total classifiers, P

is a probability, and C

is a classifier,

j is the average probability of N classifiers for binary

class(j 2Υ={0,1}), argmax function return the final

max probability for “y” class

Aggregator module and e-mail classication

The aggregator module of the proposed EBCD model

takes the combined (final) prediction probabilities

through soft voting from ML model set MS-1, MS-2,

and MS-3 and finally classifies an e-mail of the user’s

mailbox either as “Cyberstalking E-mail,” “Suspicious

E-mail,” or “Normal E-mail.” In the aggregator module,

three e-mail check posts were used to check the e-mails.

In the first e-mail check post of the aggregator module,

the e-mail of the user’s mailbox is checked for cyberstalk-

ing e-mail. If the value of combined prediction probability

for class “A” (Cyberstalking) provided by the ML model

set MS-3 > 0.5, then e-mail is classified as “Cyberstalking

E-mail.” If an e-mail is not identified as cyberstalking,

then a second e-mail check post will check the e-mail for

suspicious e-mails (spam and fraudulent). In the second

e-mail check post, combined prediction probabilities for

class “A” (Spam) given by ML model sets MS-1 and MS-2

are used. If the probability given by MS-2 > 0.5 or (MS-1

> 0.5 AND MS-2 > 0.5), then the e-mail is identified as

spam e-mail and required to check for the case of repeated

spam and e-mail bombing incident. In the last e-mail

check post of the aggregator module, identified spam

e-mails were sent to ML model set MS-2 for checking

the spam repetition and e-mail bombing incident by the

same sender. At least ten latest e-mails sent by the same

sender is checked, and if the majority of e-mails sent by

the identified sender (spammer) are spam or fraudulent

e-mail, then identified spam e-mail in the second check

post is classified as “Cyberstalking E-mail” due to inten-

sely sending the repetition spam e-mail or e-mail bomb-

ing. Although, the user will finally decide whether either

e-mail is a cyberstalking e-mail or just a suspicious (spam/

fraudulent) e-mail. During the checking of e-mail in

e-mail check post-3, if the ML model set using soft voting

does not classify as cyberstalking e-mail, then that identi-

fied spam e-mail (in check post2) will be classified as

“Suspicious E-mail.” In case of if the e-mail of the user’s

mailbox is neither identified as cyberstalking nor

Figure 5. e-mail classiﬁcation in aggregator module.

JOURNAL OF COMPUTER INFORMATION SYSTEMS 11

identified as suspicious e-mail while checking in all three

check posts, then the e-mail will be classified as “Normal

E-mail.” The functioning of the aggregator module for

e-mail classification is described in Figure 5, while the

overall stepwise procedure for E-mail Classification from

the User’s Mailbox is explained in algorithm 2.

Saving classied e-mails as evidence

After the e-mail classification of the user’s mailbox, the

available evidence is required to be stored in a file. The

proposed EBCD model will automatically read the user’s

mailbox, move the cyberstalking e-mails to

a cyberstalking folder, suspicious e-mails to

a suspicious folder and finally store the e-mail date,

sender, subject, body text, sentiment label, etc. in the

CSV file during the fetching of e-mail. Later, a CSV file

containing classified e-mails from the user’s mailbox as

evidence can also be used for training purposes and legal

action against cyberstalkers. The user can also use gath-

ered evidence to decide to block the sender as

a blacklisted sender to avoid cyberstalking from the

same sender.

Model performance

The performance of classifiers of each ML model set

on each dataset (during the training and testing time

and during the e-mail classification from the user’s

mailbox was measured separately. Performance

metrics are a set of several parameters to estimate

the model performance during training and testing

time (on labeled datasets) and real-time (on unla-

beled e-mail classification).

Several parameters in

the performance metrics are usually calculated by

using the confusion matrix. In the case of binary

classification, the confusion matrix is a 2 × 2 truth

table that contains the total value of True_Pos,

True_Neg, False_Neg, and False_Pos. True_Pos

(True Positive) is a successful hit showing the total

number of correctly detected cyberstalking e-mails or

spam e-mails, while True_Neg (True Negative)

explains the total number of correctly detected nor-

mal e-mails. False_Pos (False Positive) is a miss-hit,

which refers to the total number of incorrectly

detected cyberstalking e-mails or spam e-mails,

while False_Neg (False Negative) is the failure count

that shows the total number of wrongly detected

normal e-mails. This study used broadly used para-

meters such as accuracy, precision, f-score, recall, and

AUC (Area Under the Curve) to measure the perfor-

mance of the EBCD model.

Algorithm 2: Stepwise procedure for e-mail Classiﬁcation from User’s

Mailbox

Step:1. Begin

Step:2. Load saved pre-trained and pre-tested ML model sets MS-1, MS-2,

and MS-3

Step:3. Enable IMAP service in the user’s mailbox (Gmail).

Step:4. Enable “Allow less secure apps: ON” or generate an App password

for the user’s mailbox (Gmail).

Step:5. Import the required library and authenticate the login process

using the User ID, Password, Host, and Port [In the case of Python,

import imaplib and e-mail library, mail=imaplib.IMAP4_SSL(host,

port), mail. login(username, app_password), host for gmail= imap.

gmail.com, port=993]

Step:6. Select “Inbox” or/and another mailbox folder to fetch the e-mails.

[as mail.select(“Inbox”)]

Step:7. Create label/folder “Cyberstalking” and “Suspicious” in the user’s

mailbox.

[As mail.create(“Cyberstalking”) and mail.create(“Suspicious”)]

Step:8. Fetch e-mail from a selected folder of the user’s mailbox. [Get e-

mail date, sender, subject, e-mail text, and other required

information]

Step:9. Split the e-mail into a date, sender, subject, and e-mail body text,

As e-mail

Subject

and e-mail

BodyText

Step:10. Send e-mail subject and body text (e-mail

Subject

and

e-mail

BodyText

) to the data pre-processing module for e-mail

cleaning and normalization.

Step:11. Apply TF-IDF vectorizer on e-mail

Subject

and e-mail

BodyText

to get

the feature vectors using the feature extraction module.

Step:12. Apply all algorithms of ML model set MS-1 on e-mail

Subject

and get

prediction probabilities. [As PP

MS1_SVM

, PP

MS1_LR

, PP

MS1_NB

, and

MS1_RF

]

Step:13. Apply all algorithms of ML model set MS-2 on e-mail

BodyText

and

get prediction probabilities. [As PP

MS2_SVM

, PP

MS2_LR

, PP

MS2_NB

, and

MS2_RF

]

Step:14. Apply all algorithms of ML model set MS-3 on e-mail

Subject

and get

prediction probabilities. [As PP

MS3_SVM

, PP

MS3_LR

, PP

MS3_NB

, and

MS3_RF

]

Step:15. Combine the prediction probabilities on ML model sets MS-1, MS-

2, and MS-3 and get the ﬁnal possibilities in each model set using

Equation (6) of the soft voting technique.

AsFPP1MS1¼PPMS1SVM þPPMS1LR þPPMS1NB þPPMS1RF

ð Þ=4;FPP2MS2

¼PPMS2SVM þPPMS2LR þPPMS2NB þPPMS2RF

ð Þ=4;FPP3MS3

¼PPMS3SVM þPPMS3LR þPPMS3NB þPPMS3RF

ð Þ=4

Step:16. If (FPP3

MS3

>0.5) then

Classify the e-mail as “Cyberstalking e-mail.”

Assign a label (value=1, Cyberstalking e-mail (negative e-mail)).

Move the e-mail to the “Cyberstalking” folder of the user’s

mailbox.

Step:17. ElseIf (FPP2

MS2

>0.5) or (FPP2

MS2

>0.5 AND FPP1

MS1

>0.5) then

Check for repeated spam and e-mail bombing by the same sender

(check at least ten latest e-mails of the sender) and apply ML

Model set MS-2 for getting the ﬁnal probabilities using the soft

voting technique.

[As RFPP4

MS2

= Call Get_Sentiment_e-mail(Sender, MS-2) (Any

user-deﬁned function for getting the prediction probabilities for

sender e-mails)]

IF RFPP4

MS2

>0.5) then

Classify the e-mail as “Cyberstalking e-mail.”

Assign a label (value=1, Cyberstalking e-mail (negative e-mail)).

Move the e-mail to the “Cyberstalking” folder of the user’s mailbox.

Else

Classify the e-mail as “Suspicious e-mail.”

Assign a label (value=2, Suspicious e-mail (negative e-mail

containing spam/fraudulent)).

Move the e-mail to the “Suspicious” folder of the user’s mailbox.

Step:18. Else [in case of FPP3

MS3

<0.5, FPP2

MS2

<0.5 and FPP1

MS1

<0.5]

Classify the e-mail as “Normal e-mail.”

Assign a label (value=0, Normal e-mail (positive e-mail)).

Step:19. Save the fetched and classiﬁed e-mail to a CSV ﬁle (All e-mail-

related information as date, sender, subject, text, sentiment label

(Cyberstalking/Suspicious/Normal), etc.)

Step:20. Repeat steps 8 to step 19 until fetching a suﬃcient number of

e-mails from the user’s mailbox (Deﬁne a fetching limit)

Step:21. Measure the performance of ML classiﬁers of each ML model set.

Step:22. End

12 A. K. GAUTAM AND A. BANSAL

Accuracy

Accuracy shows the complete number of rights predic-

tions that are predicted by the classifier. Equation (7)

shows the mathematical representation to calculate the

accuracy.

Accuracy ¼True Pos þTrue Neg

True Pos þFalse Posþ

False Neg þTrue Neg

(7)

Precisions

Precision shows the proportion between the true posi-

tives and the wide range of various others positives.

Precision can be calculated using Equation (8).

Precision ¼True Pos

True Pos þFalse Pos (8)

Recall

Recall is used to determine the sensitivity of the model

and measures the ratio of true positive prediction to total

positive. Recall can be calculated by using Equation (9).

Recall ¼True Pos

True Pos þFalse Neg (9)

F-score

F-Score measures the test accuracy and explains the

harmonic average between precision and recall. F-score

can be calculated using Equation (10).

FScore ¼2Precision Recall

Precision þRecall (10)

AUC (Area Under the Curve)

AUC estimates the ability of the classifier to separate

among classes correctly. ROC (Receiver Operator

Characteristic) is a likelihood curve that plots the True

Positive Rate (TPR) against the False Positive Rate

(FPR). Equation (11) can be used to calculate the AUC.

AUC ¼1

True Pos

True PosþFalse Neg

þTrue Neg

True NegþFalse Pos

! (11)

Results and discussion

This section discusses the experimental setup and results

for e-mail-based cyberstalking detection on textual data.

The experiments used python language with Scikit

Learn, imaplib, e-mail, BeautifulSoup, smtplib, NLTK,

and other library packages to develop the proposed

EBCD model. In the first stage of the experiment,

machine learning classifiers of model sets MS-1 were

trained, tested, and cross-validated (kFold) on labeled

datasets D1. The performance of different classifiers of

Figure 6. Performances of classiﬁers of ML model set MS-1 on dataset D1.

Table 1. Performance of ML classiﬁers of ML set MS-1.

Dataset (D1): e-mails Subject

Total Unique Records: 23320

Model Set: Machine Learning Model Set MS-1

ML Classiﬁers Accuracy Precision Recall F-Score AUC

Naive Bayes 0.925443 0.968613 0.878256 0.921194 0.986559

Logistic

Regression

0.946884 0.911006 0.989749 0.948730 0.991968

Random

Forest

0.969640 0.956672 0.984106 0.969396 0.992305

SVM 0.971012 0.955813 0.987330 0.971283 0.993824

Soft Voting 0.976844 0.969979 0.983211 0.976550 0.994339

JOURNAL OF COMPUTER INFORMATION SYSTEMS 13

ML model set MS-1 is explained in Table 1 and Figure 6.

As per experimental results, the soft voting approach

achieved the best accuracy of 97.7%, best precision of

97%, and best f-score of 97.7% and best AUC of 99.4%.

Logistic regression provided the highest recall of 99%.

The support vector machine obtained the second posi-

tion with an accuracy of 97.1%, recall of 98.7%, f-score of

97.1%, and AUC of 99.3%. Overall performance of all

classifiers of the model set MS-1 was up to mark and

near to similar performance in terms of AUC.

In the second stage of the experiment, machine learn-

ing classifiers of model sets MS-2 were trained, tested,

and cross-validated (kFold) on labeled datasets D2. The

performance of different classifiers of ML model set MS-

2 is explained in Table 2 and Figure 7. As per experi-

mental results, the soft voting technique was again the

best performer classifier with the best accuracy of 97.7%,

recall of 96.5%, f-score of 97.4%, and AUC of 99.7%.

Maximum precision of 98.6% was provided by the ran-

dom forest classifier, while SVM again got the position

of second best performer classifier with an accuracy of

97%, precision of 98%, recall of 95.3%, f-score of 96.6%,

and AUC of 99.6%. Other classifiers of the model set

MS-2 were also performed up to mark and near to the

best performer classifier.

In the third stage of the experiment, machine learning

classifiers of model sets MS-3 were trained, tested, and

cross-validated (kFold) on labeled datasets D3. The per-

formance of different classifiers of ML model set MS-3 is

explained in Table 3 and Figure 8. Experimental results

show that the soft voting technique again provided the

best AUC of 99.9%, while SVM is the best performer

classifier with an accuracy of 99.0%, precision of 99.4%

and f-score of 99.0%. Naïve Bayes achieved a maximum

recall of 99.4%; however, all classifiers of the model set

MS-3 performed outstanding, and performance para-

meters are near to close with the best performer classi-

fier. Overall, the performance of the soft voting classifier

of each model set on all three datasets is best, and it not

only enhances the performance of the classification task

but also helps to make the right decision during the

classification of unlabeled data based on the majority

of votes. Sometimes, due to model overfitting, the high-

est performance parameters are provided by classifiers,

although experiments in this paper utilize the K fold

cross-validation to avoid any overfitting. The soft voting

technique may also avoid overfitting due to majority

votes and avoid making a wrong decision during the

classification of unknown data. For example, in the

classification of any e-mail from the user’s mailbox, it

is also possible that one classifier may indicate normal

e-mail, and other classifiers may predict cyberstalking

e-mail. In this scenario, the soft voting technique uses

the majority votes option to make the right decision for

the actual classification of e-mail. Based on these

Table 2. Performance of ML classiﬁers of ML set MS-2.

Dataset (D2): e-mails Body Text

Total Unique Records: 31715

Model Set: Machine Learning Model Set MS-2

ML Classiﬁers Accuracy Precision Recall F-Score AUC

Naive Bayes 0.955520 0.958086 0.942925 0.950428 0.992443

Logistic

Regression

0.959009 0.977275 0.931027 0.953581 0.993947

Random

Forest

0.964391 0.986148 0.934373 0.960268 0.989182

SVM 0.969646 0.979182 0.953151 0.965982 0.995702

Soft Voting 0.977299 0.983459 0.964977 0.974130 0.996806

Figure 7. Performances of classiﬁers of ML model set MS-2 on dataset D2.

14 A. K. GAUTAM AND A. BANSAL

advantages, the multi-model soft voting technique was

used during the classification and labeling of the e-mails

from the user’s mailbox (unlabeled e-mail).

At the end of experiments, each ML model set’s

trained, tested, and validated classifiers were saved as

pickle files for further use during the automated cyber-

stalking detection and filtration of e-mails from the

user’s mailbox. In the last experiment, trained, tested,

and validated classifiers are applied to classify the

e-mails from the user’s mailbox (as discussed in algo-

rithm 2 of the methodology section). For experimental

purposes, different types of e-mails (spam, fraudulent,

cyberstalking, and normal) were sent to the author’s

mailbox from different e-mail ids of authors using the

python program through smtplib tools. Using the

EBCD model, a total of 497 e-mails were fetched and

classified as cyberstalking e-mails (37.8%), suspicious

e-mails (26.4%), and normal e-mails (35.8%). The dis-

tributions of fetched and classified e-mails are shown in

Figure 9. Performance of classifiers of ML model sets

MS-2 and MS-3 are measured on fetched classified

e-mails using the manual “OneVsRest” approach.

Fetched classified e-mail is divided into two datasets:

set 1 and set 2. Classifiers of the model set MS-2 were

tested on set 1, containing all e-mails belonging to

suspicious and normal e-mail classes, while classifiers

of MS-3 were tested on set 2, containing cyberstalking

and normal e-mail classes. The average performance of

different classifiers of ML model sets MS-2 and MS-3 is

explained in Table 4 and Figure 10. As experimental

results described in Table 4 and Figure 10 show, the

soft voting technique outperformed other classifiers in

terms of accuracy. The soft voting classifier achieved

the highest accuracy of 96.3 and f-score of 95.9%.

Table 3. Performance of ML classiﬁers of ML set MS-3.

Dataset (D3): Harassment Text

Total Unique Records: 36804

Model Set: Machine Learning Model Set MS-3

ML Classiﬁers Accuracy Precision Recall F-Score AUC

Naive Bayes 0.944608 0.909068 0.993706 0.949491 0.994240

Logistic

Regression

0.982285 0.992456 0.973579 0.982923 0.998221

Random

Forest

0.981524 0.977701 0.985060 0.982392 0.997325

SVM 0.990327 0.994357 0.987135 0.990731 0.998615

Soft Voting 0.988697 0.986543 0.991547 0.989039 0.998727

Figure 8. Performances of classiﬁers of ML model set MS-3 on dataset D3.

Figure 9. Distribution of fetched and classiﬁed e-mails.

JOURNAL OF COMPUTER INFORMATION SYSTEMS 15

Heights AUC of 96.8% and 96.8 were provided by SVM

and soft voting, respectively. Maximum precision of

98.5%, 98.1%, and 98.1% was provided by the random

forest, soft voting, and support vector machine, respec-

tively. In the case of the recall, naïve bayes, support

vector machine, and soft voting achieved a maximum

recall of 94.8%, 94.4%, and 94%, respectively. Overall

performance of all classifiers of model sets was up to

mark. During the classification and labeling of e-mails

from the user’s mailbox, the final decision was taken

using the soft voting technique, and after that perfor-

mance of all classifiers was measured on classified

e-mails using the Stratified K-Folds cross-validator.

During the overall experimental works, it is found

that the performance of the support vector machine is

notable, but the soft voting technique is a better choice

for making the right decision.

Conclusion and future work

E-mail-based cyberstalkers are making negative and

fearful communication over e-mail technology.

Cyberstalking through spamming, e-mail bombing,

and the general approach of cyberstalking are common

for e-mail-based harassment. Apart from these, cyber-

stalkers also utilize several other approaches to target the

victim or groups over e-mail, which are complex to

detect automatically. This paper proposed an EBCD

model using the multi-model soft voting technique of

the machine learning approach for automatic cyber-

stalking detection on textual data from a user’s mailbox.

Initially, three machine learning model sets containing

random forest, support vector machine, naïve bayes,

logistic regression, and soft voting classifiers were

trained, tested, and validated through k-fold cross-

validation on three different datasets. Classifiers of the

model set MS-1 were trained, tested, and validated on

dataset D1 containing spam, phishing, and fraudulent

e-mail subject line so that further it can be used in

classifying the e-mail using e-mail subject. Classifiers

of the model set MS-2 were trained, tested, and validated

on dataset D2 containing spam and fraudulent related

e-mail body text. Later, the model set MS-2 can be used

to classify the e-mail from the user’s mailbox as spam

e-mail as well as can also be utilized for checking e-mail

bombing and repeated spamming approaches of cyber-

stalkers. Classifiers of the model set MS-3 were trained,

tested, and validated on dataset D3 containing harass-

ment-related data so that it can be used further for

checking cyberstalking e-mails from the user’s mailbox.

Table 4. Average performance of ML classiﬁers of ML model sets

on fetched and classiﬁed e-mails from user’s mailbox.

Dataset D4: Set 1(suspicious and normal e-mail) and Set 2 (cyberstalking

and normal e-mail)

Fetched e-mail classiﬁed and labeled by: Soft Voting Technique of EBCD Model

Total Unique e-mail: 497, cyberstalking e-mail: 37.8%,

suspicious e-mail: 26.4%, and normal e-mail: 35.8%

ML Classiﬁers Accuracy Precision Recall F-Score AUC

Naive Bayes 0.879176 0.979588 0.944156 0.898151 0.964610

Logistic

Regression

0.910239 0.979637 0.927597 0.914087 0.964462

Random

Forest

0.911804 0.984968 0.935390 0.940763 0.964478

SVM 0.941714 0.981334 0.947727 0.946976 0.968466

Soft Voting 0.963057 0.981431 0.940584 0.959211 0.967925

Figure 10. Average performance of classiﬁers of ML model sets on fetched and classiﬁed e-mails from user’s mailbox.

16 A. K. GAUTAM AND A. BANSAL

The performance of classifiers of each model set was

measured using accuracy, precision, recall, f-score, and

AUC. Experimental results show that the soft voting

technique achieved the best accuracy of 97.7%, best

f-score of 97.7%, best precision of 97%, and best AUC

of 99.4% on dataset D1. The soft voting technique also

performed well in dataset D2 with the best accuracy of

97.7%, best f-score of 97.4%, best recall of 96.5%, and

best AUC of 99.7%. In the case of dataset D3, the soft

voting technique also achieved the best AUC of 99.9%,

while accuracy, precision, and f-score provided by soft

voting were very close to the top perform classi-

fier (SVM).

Due to the overall better performance, the multi-

model soft voting technique was applied for the

automated classification and labeling of e-mails

from the user’s mailbox. During the classification

of e-mails from the user’s mailbox, trained, tested,

and validated classifiers of the model set MS-1, MS-

2 and MS-3 were applied as a combined approach.

Based on the final decision through the soft voting

classifier of MS-1, MS-2, and MS-3 models, in each

of the three e-mail check posts, e-mails from the

user’s mailbox were classified as cyberstalking

e-mail, suspicious e-mail, and normal e-mail. The

performance of all classifiers of each model set was

measured on classified e-mails from the user’s mail-

box. The average performance of classifiers shows

that soft voting again performed well with an accu-

racy of 96.3% and f-score of 95.9%, while the pre-

cision, recall, and AUC of soft voting were very

close to the top performer classifier. Overall experi-

mental results show that the performance of the

support vector machine was notable, but the soft

voting technique is a better choice for unlabeled

e-mail classification. The soft voting technique not

only enhances the performance of classification task

for labeled and unlabeled e-mail but also provide

help to make the right decision for the actual clas-

sification of e-mails. The proposed EBCD model

performed well and could automatically classify

e-mails from the user’s mailbox and evidence col-

lection. The proposed EBCD model not only detects

the automatically cyberstalking e-mail but also clas-

sifies the e-mail as suspicious e-mail (spam and

fraudulent) based on the textual e-mail data.

Further, the EBCD model also helps to detect

basic intent-wise cyberstalking e-mails through

repeated spamming and e-mail bombing. However,

advanced intent-wise e-mail-based cyberstalking

detection, including the image spam approach of

cyberstalking, is more complex than content-wise

e-mail-based cyberstalking. Future work includes

the design and development of an enhanced EBCD

model for the detection of advanced intent-wise

cyberstalking performed through phishing, mali-

cious, defamatory, e-mail spoofing, and image

spam-based cyberstalking. So that advanced intent-

wise, cyberstalking can be detected automatically

from fake e-mails, identity theft, and the personal/

financial losses approaches of cyberstalkers. Future

work also includes the design and development of

the EBCD model using deep learning techniques

and a comparison of the current proposed EBCD

model with ANN, Logitboost, XGBoost, LSTM, and

GRU models.

Disclosure statement

No potential conflict of interest was reported by the authors.

ORCID

Arvind Kumar Gautam http://orcid.org/0000-0001-6057-

1006

Abhishek Bansal http://orcid.org/0000-0001-5968-3625

References

1. Karim A, Azam S, Shanmugam B, Kannoorpatti K,

Alazab M. A comprehensive survey for intelligent spam

e-mail detection. IEEE Access. 2019;7:168261_168295.

doi:10.1109/ACCESS.2019.2954791.

2. Hina M, Ali M, Javed AR, Ghabban F, Khan LA, Jalil Z.

Sefaced: semantic-based forensic analysis and classifica-

tion of e-mail data using deep learning. IEEE Access.

2021;9:98398–411. doi:10.1109/ACCESS.2021.3095730.

3. https://www.statista.com/statistics/255080/number-of

-e-mail-users-worldwide/ .

4. https://www.statista.com/statistics/420391/spam-e-mail

-traf_c-share .

5. Miller L. Stalking: patterns, motives, and intervention

strategies. Aggress Violent Behav. 2012;17(6):495–506.

doi:10.1016/j.avb.2012.07.001.

6. Ogilvie E. Cyberstalking. Trends Issues Crime Crim

Justice. 2000;166:1–6.

7. Truman JL. Examining intimate partner stalking and

use of technology in stalking victimization [PhD thesis].

Florida: University of Central Florida Orlando; 2010.

8. WinkelmAn SB, Oomen-Early J, Walker AD, Chu L,

Yick-Flanagan A. Exploring cyber harassment among

women who use social media. Univers J Public Health.

2015;3(5):194. doi:10.13189/ujph.2015.030504.

9. Gautam AK, Bansal A. A review on cyberstalking detec-

tion using machine learning techniques: current trends

and future direction. International Journal of

Engineering Trends and Technology. 2022;70

(3):95–107. doi:10.14445/22315381/IJETT-V70I3P211.

10. Baer M. Cyberstalking and the internet landscape we have

constructed. Virginia J Law Technol. 2020;154:153–227.

JOURNAL OF COMPUTER INFORMATION SYSTEMS 17

11. Nam SG, Jang Y, Lee D-G, Seo Y-S. Hybrid features by

combining visual and text information to improve spam

filtering performance. Electronics. 2022;11(13):2053.

doi:10.3390/electronics11132053.

12. https://dataprot.net/statistics/spam-statistics .

13. Bagui S, Nandi D, Bagui S, White RJ. Classifying phish-

ing e-mail using machine learning and deep learning.

2019 International Conference on Cyber Security and

Protection of Digital Services (Cyber Security), Oxford,

UK; 2019; IEEE.

14. Marková E, Bajtoš T, Sokol P, Mézešová T. Classification of

malicious e-mails. 2019 IEEE 15th International Scientific

Conference on Informatics, Poprad, Slovakia; 2019; IEEE.

15. Pandove K, Jindal A, Kumar R. e-mail spoofing.

Int J Comput Appl. 2010;5(1):27–30. doi:10.5120/881-

1252.

16. Sakshi M, Vashishth A. An analysis of cyber crime with

special reference to cyber stalking. J Posit Psychol. 2022;6

(4):1279–87.

17. Goni O. Cyber crime and its classification. Int J Electr

Electron Eng. 2022;10(2):01–17. doi:10.30696/IJEEA.X.I.

2022.01-17.

18. Kumar S, Agarwal D. Hacking attacks, methods, techni-

ques and their protection measures. Int J Adv Res Comput

Sci Manag. 2018;4:2353–58.

19. Mirza N, Patil B, Mirza T, Auti R. Evaluating efficiency of

classifier for e-mail spam detector using hybrid feature

selection approaches. International Conference on

Intelligent Computing and Control Systems (ICICCS’

17), Madurai, India; 2017; IEEE. p. 735–40.

20. Thomas K, Grier C, Ma J, Paxson V, Song D. Design and

evaluation of a real-time URL spam filtering service. IEEE

Symposium on Security and Privacy (SP ’11), Oakland,

CA, USA; 2011; IEEE. p. 447–62.

21. Rakshitha K, Ramalingam HM, Pavithra M, Advi HD,

Hegde M. Sentimental analysis of Indian regional lan-

guages on social media. Glob Transit Proc. 2021;2

(2):414–20. doi:10.1016/j.gltp.2021.08.039.

22. Burmester M, Burmester M, Henry P, Kermes LS,

Kermes LS, Henry P. Tracking cyberstalkers:

a cryptographic approach. ACM SIGCAS Comput Soc.

2005;35(3):2. doi:10.1145/1215932.1215934.

23. Aggarwal S, Burmester M, Henry P, Kermes L,

Mulholland J. Anti-cyberstalking: the Predator and

Prey Alert (PAPA) system. Proceedings - First

International Workshop on Systematic Approaches,

Taipei, Taiwan; 2005.

24. Onan A. Two-stage topic extraction model for biblio-

metric data analysis based on word embeddings and

clustering. IEEE Access. 2019;7:145614–33. doi:10.

1109/ACCESS.2019.2945911.

25. Onan A. Sentiment analysis on massive open online

course evaluations: a text mining and deep learning

approach. Comput Appl Eng Educ. 2021;29(3):572–89.

doi:10.1002/cae.22253.

26. Onan A, Alp Toçoğlu M. A term weighted neural lan-

guage model and stacked bidirectional LSTM based

framework for sarcasm identification. IEEE Access.

2021;9:7701–22. doi:10.1109/ACCESS.2021.3049734.

27. Onan A. Deep learning based sentiment analysis on

product reviews on Twitter. International Conference

on Big Data Innovations and Applications; 2019; Cham:

Springer.

28. Onan A. Sentiment analysis on product reviews based on

weighted word embeddings and deep neural networks.

Concurr Comput Pract Exp. 2021;33(23):e5909. doi:10.

1002/cpe.5909.

29. Onan A. Mining opinions from instructor evaluation

reviews: a deep learning approach. Comput Appl Eng

Educ. 2020;28(1):117–38. doi:10.1002/cae.22179.

30. Gautam AK, Bansal A. Performance analysis of super-

vised machine learning techniques for cyberstalking

detection in social media. Journal of Theoretical and

Applied Information Technology. 2022;100(2):449–461.

31. Zhang J, Otomo T, Li L, Nakajima S. Cyberbullying

detection on Twitter using multiple textual features.

2019 IEEE 10th International Conference on

Awareness Science and Technology (CAST), Morioka,

Japan; 2019; IEEE. p. 1–6.

32. Liew SW, Sani NFM, Abdullah MT, Yaakob R,

Sharum MY. An effective security alert mechanism for

real-time phishing tweet detection on Twitter. Comput

Secur. 2019;83:201–07. doi:10.1016/j.cose.2019.02.004.

33. Dughyala N, Potluri S, Sumesh KJ, Pavithran V.

Automating the detection of cyberstalking. 2021

Second International Conference on Electronics and

Sustainable Communication Systems (ICESC),

Coimbatore, India; 2021; IEEE.

34. Gautam AK, Bansal A. Automatic cyberstalking

detection on Twitter in real-time using hybrid

approach. International Journal of Modern

Education and Computer Science . 2023;15(1).

35. Roy SS, Mallik A, Gulati R, Obaidat MS, Krishna PV.

A deep learning based artificial neural network

approach for intrusion detection. International

Conference on Mathematics and Computing; 2017;

Singapore: Springer.

36. Roy SS, Madhu Viswanatham V. Classifying spam e-

mails using artificial intelligent techniques. Int J Eng

Res Africa. 2016;22:152–61. Trans Tech Publications

Ltd. https://doi.org/10.4028/www.scientific.net/JERA.

22.152 .

37. Roy SS, Madhu Viswanatham V, Venkata

Krishna P. Spam detection using hybrid model of

rough set and decorate ensemble. Int J Comput Syst

Eng. 2016;2(3):139–47. doi:10.1504/IJCSYSE.2016.

079000.

38. Roy SS, Sinha A, Roy R, Barna C, Samui P. Spam e-mail

detection using deep support vector machine, support

vector machine and artificial neural network.

International Workshop Soft Computing Applications;

2016; Cham: Springer.

39. Bassiouni M, Ali M, El-Dahshan EA. Ham and spam e-

mails classification using machine learning techniques.

J Appl Secur Res. 2018;13(3):315–31. doi:10.1080/

19361610.2018.1463136.

40. Zhaoquan GU, Yushun X, Weixiong HU, Lihua Y, Yi H,

Zhihong T. Marginal attacks of generating adversarial

examples for spam filtering. Chinese J Electron. 2021;30

(4):595–602. doi:10.1049/cje.2021.05.001.

41. Kontsewaya Y, Antonov E, Artamonov A. Evaluating

the effectiveness of machine learning methods for spam

18 A. K. GAUTAM AND A. BANSAL

detection. Procedia Comput Sci. 2021;190:479–86.

doi:10.1016/j.procs.2021.06.056.

42. Cohen A, Nissim N, Elovici Y. Novel set of general

descriptive features for enhanced detection of malicious

e-mails using machine learning methods. Expert Syst

Appl. 2018;110:143–69. doi:10.1016/j.eswa.2018.05.031.

43. Jalda CS, Nanda AK, Pitchai R. Spoofing e-mail detec-

tion using stacking algorithm. 2022 8th International

Conference on Smart Structures and Systems (ICSSS),

Chennai, India; 2022; IEEE.

44. Onan A. An ensemble scheme based on language func-

tion analysis and feature engineering for text genre

classification. J Inf Sci. 2018;44(1):28–47. doi:10.1177/

0165551516677911.

45. Onan A. Consensus clustering-based undersampling

approach to imbalanced learning. Sci Program.

2019;2019:1–14. doi:10.1155/2019/5901087.

46. Onan A. Biomedical text categorization based on

ensemble pruning and optimized topic modelling.

Comput Math Methods Med. 2018;2018:1–22.

doi:10.1155/2018/2497471.

47. Onan A, Korukoğlu S, Bulut H. A hybrid ensemble

pruning approach based on consensus clustering and

multi-objective evolutionary algorithm for sentiment

classification. Inf Process Manag. 2017;53(4):814–33.

doi:10.1016/j.ipm.2017.02.008.

48. Onan A, Korukoğlu S, Bulut H. Ensemble of key-

word extraction methods and classifiers in text

classification. Expert Syst Appl. 2016;57:232–47.

doi:10.1016/j.eswa.2016.03.045.

49. Nisar N, Rakesh N, Chhabra M. Voting-ensemble

classification for e-mail spam detection. 2021

International Conference on Communication infor-

mation and Computing Technology (ICCICT),

Mumbai, India; 2021; IEEE.

50. Bountakas P, Xenakis C. Helphed: hybrid ensemble

learning phishing e-mail detection Journal of

Network and Computer Applications. 2022;210.

doi:10.1016/j.jnca.2022.103545.

51. Onan A. Bidirectional convolutional recurrent

neural network architecture with group-wise

enhancement mechanism for text sentiment

classification. J King Saud Univ Comput Inf Sci.

2022;34(5):2098–117. doi:10.1016/j.jksuci.2022.02.

025.

52. Onan A, Korukoğlu S. A feature selection model

based on genetic rank aggregation for text sentiment

classification. J Inf Sci. 2017;43(1):25–38. doi:10.

1177/0165551515613226.

53. Annadatha A, Stamp M. Image spam analysis and

detection. J Comput Virol Hacking Tech. 2018;14

(1):39–52. doi:10.1007/s11416-016-0287-x.

54. Sharmin T, Di Troia F, Potika K, Stamp M.

Convolutional neural networks for image spam

detection. Inf Secur J. 2020;29(3):103–17. doi:10.

1080/19393555.2020.1722867.

55. Ghasem Z, Frommholz I, Maple C. Machine learning

solutions for controlling cyberbullying and

cyberstalking. Int J Inf Secur. 2015;6:55–64.

56. Frommholz I, Al-Khateeb HM, Potthast M,

Ghasem Z, Shukla M, Short E. On textual analysis

and machine learning for cyberstalking detection.

Datenbank Spektrum. 2016;16(2):127–35. doi:10.

1007/s13222-016-0221-x.

57. Feng X, Asante A, Short E, Abeykoon I. Cyberstalking

issues. 2017 IEEE 15th International Conference on

Dependable, Autonomic and Secure Computing, 15th

International Conference on Pervasive Intelligence and

Computing, 3rd International Conference on Big Data

Intelligence and Computing and Cyber Science and

Technology Congress (DASC/PiCom/DataCom/

CyberSciTech); 2017. p. 373–76. doi:10.1109/DASC-

PICom-DataCom-CyberSciTec.2017.78.

58. Gautam AK, Bansal A. A machine learning framework

for detection and documentation of cyberstalking on

on-spam e-mail. The Journal of Oriental Research

Madras . 2021;92(5):41–50.

59. Asante A, Feng X. Content-based technical solution

for cyberstalking detection. 2021 3rd International

Conference on Computer Communication and the

Internet (ICCCI), Nagoya, Japan; 2021; IEEE.

60. Trec Dataset: https://www.kaggle.com/datasets/imdeep

mind/preprocessed-trec-2007-public-corpus-dataset .

61. Enron dataset: https://www2.aueb.gr/users/ion/data/

enron-spam/ .

62. https://www.kaggle.com/datasets/llabhishekll/fraud-e-

mail-dataset?resource=download .

63. https://www.kaggle.com/datasets/mfaisalqureshi/spam-

e-mail .

64. https://www.kaggle.com/datasets/harshsinha1234/

email-spam-classification .

65. https://www.kaggle.com/datasets/juanagsolano/spam-

e-mail-from-enron-dataset .

66. https://www.kaggle.com/datasets/ganiyuolalekan/

spam-assassin-email-classification-dataset .

67. https://data.mendeley.com/datasets/72ptz43s9v/1 .

68. Vijayarani S, Ilamathi MJ, Nithya M. Pre-

processing techniques for text mining-an overview.

Int J Comput Netw Commun. 2015;5:7–16.

69. Kadhim AI. An evaluation of pre-processing tech-

niques for text classification. Int J Inf Technol

Comput Sci Inf Secu. 2018;16:22–32.

70. Tiwari D, Singh N. Ensemble approach for twitter senti-

ment analysis. Int J Inf Technol Comput Sci. 2019;11

(8):20–26. doi:10.5815/ijitcs.2019.08.03.

71. Gautam AK, Bansal A. Effect of features extraction

techniques on cyberstalking detection using machine

learning framework. J Adv Inf Technol. 2022;13(5).

doi:10.12720/jait.13.5.486-502.

72. Rui W, Xing K, Jia Y. BOWL: bag of word clusters

text representation using word embeddings.

International Conference on Knowledge Science,

Engineering and Management; 2016; Cham:

Springer.

73. Mikolov T, Chen K, Corrado G, Dean J. Efficient

estimation of word representations in vector space.

arXiv preprint arXiv:1301.3781. 2013. https://arxiv.

org/pdf/1301.3781.pdf .

74. Jeffrey P, Socher R, Christopher D. Glove: global vectors

for word representation. Proceedings of the 2014

Conference on Empirical Methods in Natural Language

Processing (EMNLP), Doha, Qatar; 2014.

75. Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of

tricks for efficient text classification. arXiv preprint

JOURNAL OF COMPUTER INFORMATION SYSTEMS 19

arXiv:1607.01759. 2016. https://arxiv.org/pdf/1607.

01759.pdf .

76. Raj C, Agarwal A, Bharathy G, Narayan B, Prasad M.

Cyberbullying detection: hybrid models based on

machine learning and natural language processing

techniques. Electronics. 2021;10(22):2021. doi:10.3390/

electronics10222810.

77. Das B, Chakraborty S. An improved text sentiment classi-

fication model using TF-IDF and next word negation.

arXiv preprint arXiv:1806.06407. 2018.

78. Cristianini N, Shawe-Taylor J. An introduction to support

vector machines and other kernel-based learning methods.

United Kingdom: Cambridge University Press; 2000.

79. Rish I. An empirical study of the naive bayes classifier.

IJCAI 2001 workshop on empirical methods in artificial

intelligence; 2001;3(22):41–46.

80. Yan J, Lee J. Degradation assessment and fault modes

classification using logistic regression. J Manuf Sci Eng.

2005;127(4):912–14. doi:10.1115/1.1962019.

81. Pal M. Random forest classifier for remote sensing

classification. Int J Remote Sens. 2005;26(1):217–22.

doi:10.1080/01431160412331269698.

82. Bashir E, Bouguessa M. Data mining for cyberbullying

and harassment detection in Arabic texts. Int J Inf

Technol Comput Sci. 2021;13(5):41–50. doi:10.5815/

ijitcs.2021.05.04.

20 A. K. GAUTAM AND A. BANSAL

Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach

Article

Full-text available

Feb 2023

Many people are using Twitter for thought expression and information sharing in real-time. Twitter is one of the trendiest social media applications that cybercriminals also widely use to harass the victim in the form of cyberstalking. Cyberstalkers target the victim through sexism, racism, offensive language, hate language, trolling, and fake accounts on Twitter. This paper proposed a framework for automatic cyberstalking detection on Twitter in real-time using the hybrid approach. Initially, experimental works were performed on recent unlabeled tweets collected through Twitter API using three different methods: lexicon-based, machine learning, and hybrid approach. The TF-IDF feature extraction method was used with all the applied methods to obtain the feature vectors from the tweets. The lexicon-based process produced maximum accuracy of 91.1%, and the machine learning approach achieved maximum accuracy of 92.4%. In comparison, the hybrid approach achieved the highest accuracy of 95.8% for classifying unlabeled tweets fetched through Twitter API. The machine learning approach performed better than the lexicon-based, while the performance of the proposed hybrid approach was outstanding. The hybrid method with a different approach was again applied to classify and label the live tweets collected by Twitter Streaming in real-time. Once again, the hybrid approach provided the outstanding result as expected, with an accuracy of 94.2%, recall of 94.1%, the precision of 94.6%, f-score of 94.1%, and the best AUC of 98%. The performance of machine learning classifiers was measured in each dataset labeled by all three methods. Experimental results in this study show that the proposed hybrid approach performed better than other implemented approaches in both recent and live tweets classification. The performance of SVM was better than other machine learning algorithms with all applied approaches.

Evaluating Online Sexism Detection: A Comparative Study of Machine Learning Models using the EDOS Dataset

Conference Paper

Apr 2024

Recent Advancements in Machine Learning for Cybercrime Prediction

Article

Full-text available

Oct 2023

Cyberstalking: Consequences and Coping Strategies to Improve Mental Health

Chapter

Jun 2023

Cyberstalking is one of the most widespread threats on digital platforms. It has included many forms of direct threats via email, online distribution of intimate photographs, seeking information about victims, harassment, and catfishing. The consequences of cyberstalking may lead to psychological problems such as mental health, distress, victim experiencing feelings of isolation, guilt, adverse effects on life activity. These psychological problems may further lead to reports of serious health issues such as anger, fear, suicidal ideation, depression, and post-traumatic stress disorder (PTSD). However, there are many coping strategies such as avoidant coping, ignoring the perpetrator, confrontational coping, support seeking, and cognitive reframing. In spite of these methods, awareness of preventive measures of cyberstalking may further help to overcome mental stress. In this chapter, the authors have pointed out the various psychological issues due to cyberstalking and further discuss their solutions through preventing or automatic detection methods inspired by machine learning approaches.

Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach

Article

Full-text available

Feb 2023

Hybrid Features by Combining Visual and Text Information to Improve Spam Filtering Performance

Article

Full-text available

Jun 2022

The development of information and communication technology has created many positive outcomes, including convenience for people; however, cases of unsolicited communication, such as spam, also occur frequently. Spam is the indiscriminate transmission of unwanted information by anonymous users, called spammers. Spam content is indiscriminately transmitted to users in various forms, such as SMS, e-mail, and social network service posts, causing negative experiences for users of the service, while also creating costs, such as unnecessarily large amounts of network traffic. In addition, spam content includes phishing, hype or false advertising, and illegal content. Recently, spammers have also used images that contain stimulating content to effectively attract users’ curiosity and attention. Image spam contains more complex information than text, making it more difficult to analyze and to generalize its properties compared to text. Therefore, existing text-based spam detectors are vulnerable to spam image attacks, resulting in a decline in service quality. In this paper, a “hybrid features by combining visual and text information to improve spam filtering performance” method is proposed to reduce the occurrence of misclassification. The proposed method employs three sub-models to extract features from spam images and a classifier model to output the results using the features. Each sub-model extracts topic-, word-, and image-embedding-based features from spam images. In addition, the sub-models use optical character recognition, latent Dirichlet allocation, and word2Vec techniques to extract features from images. To evaluate spam image classification performance, the spam classifiers were trained using the extracted features and the results were measured using a confusion matrix. Our model achieved an accuracy of 0.9814 and a macro-F1 score of 0.9813. In addition, the application of OCR evasion techniques resulted in a decrease in recognition performance. Using the proposed model, a mean macro-F1 score of 0.9607 was obtained.

A Review on Cyberstalking Detection Using Machine Learning Techniques: Current Trends and Future Direction

Article

Full-text available

Mar 2022

Web-based media organizations and other web applications, for example, WhatsApp, Facebook, YouTube, Instagram, Twitter, have become more well known among individuals for data sharing, live occasions, news, exposure, publicity, and cybercrimes. The utilization of online media stages additionally offers major issues through cyberstalking, cyberbullying, and different kinds of digital provocation. Cyberstalking and cyberbullying are frequently utilized reciprocally and include the utilization of the web to follow or target somebody in the web-based world. Cyberstalking is a basic worldwide issue that influences instructive foundations, casualties, and the whole human culture that should be distinguished, recognized, revealed, and controlled appropriately for the security of clients in online media. Machine learning is the most well-known method for making the cyberstalking recognition model. Researchers have recommended different recognition procedures utilizing machine learning to control and battle cyberstalking in web-based media. In this paper, the study relates to some popular features extraction methods machine learning classifiers for text classification and explores the datasets used by the researchers. The study also focuses on reasonably determining the research gaps and the scope for improving cyberstalking detection. This paper will review some cyberstalking detection techniques using machine learning, analyze the performance of popular machine learning classifiers and finally explore the issues, challenges, recent trends, and future direction for cyberstalking detection techniques.

PERFORMANCE ANALYSIS OF SUPERVISED MACHINE LEARNING TECHNIQUES FOR CYBERSTALKING DETECTION IN SOCIAL MEDIA

Article

Full-text available

Jan 2022

In the modern days of life, people use many social media sites for information sharing among friends, relatives, and others for personal, business, and official purposes. The use of social media platforms is also raising serious issues in the form of cyberstalking. Cyberstalking has been identified as a growing antisocial problem that affects educational institutions, victims, and entire human society. An intelligent system is required to detect cyberstalking in social media. In this paper, we proposed a cyberstalking detection model and analyzed the performance of six popular supervised machine learning algorithms, namely Logistic Regression, Support Vector Machines (SVM), Random Forest, Decision Trees, K-Nearest Neighbor, and Naive Bayes. These machine learning algorithms were implemented with two feature extraction methods, Bag of Words and TF-IDF, on two datasets of different sizes and distribution containing 35734 and 70019 comments and tweets, respectively. Performance of algorithms was measured in terms of Accuracy, Precision, Recall, f-score, training time, and prediction time. Our experimental results show that Logistic Regression and Support Vector Machine were top performer algorithms for both datasets with both feature extraction methods. Logistic Regression (92.6% with BOW and 92% with TF-IDF) and Support Vector Machine (92.5% with TF-IDF and 91.9% with BOW) achieved the highest accuracy on dataset-1. Logistic Regression and Support Vector Machine also achieved the highest Precision (96.4% and 96.6% respectively) and F-Score (94.3% and 93.8% respectively), while Naïve Bayes provides the best Recall (97.6% with TF-IDF on dataset-1) for both datasets.

A MACHINE LEARNING FRAMEWORK FOR DETECTION AND DOCUMENTATION OF CYBERSTALKING ON NON-SPAM EMAIL

Article

Full-text available

Mar 2021

Cyberstalking is growing as a social and international problem and creating a pandemic situation for users of internet applications. In modern days of life due to the huge use of Internet technology, cyberstalking has become a major fear for users, society, and institutions. Like social media, cyberstalkers are using email technology to target the victim as cyberstalking. Email is a widely used internet application and is so much popular to share information among people and organizations for personal, business, and official purposes. Generally, cybercriminals use fake email IDs either from popular email services providers or from fake email service providers to perform cyber crimes such as phishing, spamming, and cyberstalking. Mostly, through spam email, victims were targeted but in the recent trends, non-spam email is also used by criminals for cyberstalking and cyberbullying. Victims can be easily targeted by cyberstalkers using non-spam email because cyberstalkers often use fake email id and messages which is difficult to block and filter as spam email category. Filtration, Detection, and proper evidence documentation of non-spam email-based cyberstalking are challenging and interesting tasks for researchers. In this paper, we are proposing a Machine Learning framework to filter, detect, and collect cyberstalking evidence on textual data of non-spam emails.

SeFACED: Semantic-based Forensic Analysis and Classification of E-Mail Data using Deep Learning

Article

Full-text available

Mar 2022

Artificial Intelligence (AI), in combination with the Internet of Things (IoT), called (AIoT), an emerging trend in industrial applications, is capable of intelligent decision-making with self-driven analytics. With its extensive usage in diverse scenarios, IoT devices generate bulk data contrived by attackers to disrupt normal operations and services. Hence, there is a need for proactive data analysis to prevent cyber-attacks and crimes. To investigate crimes involving Electronic Mail (e-mail), analysis of both the header and the email body is required since the semantics of communication helps to identify the source of potential evidence. With the continued growth of data shared via emails, investigators now face the daunting challenge of extracting the required semantic information from the bulks of emails, thereby causing a delay in the investigation process. This gives an edge to the criminal in erasing their footprints of malicious acts. The existing keyword-based search techniques and filtration often result in extraneous, short sequence emails, which skips meaningful information. To overcome the above limitation, we propose a novel efficient approach named SeFACED that uses Long Short-Term Memory (LSTM) based Gated Recurrent Neural Network (GRU) for multiclass email classification. SeFACED not only works on short sequences but with long dependencies of 1000+ characters as well. SeFACED focuses on tuning LSTM based GRU parameters to attain the best performance and with assessment by comparing it with traditional machine learning, deep learning models, and state-of-the-art studies on the subject. Experimental results on self-extended benchmark datasets exhibit that SeFACED effectively outperforms existing methods while keeping the classification process robust and reliable.

Cyber Crime and Its Classification

Article

Sep 2021

Osman Goni

HELPHED: Hybrid Ensemble Learning PHishing Email Detection

Article

Nov 2022

Phishing email attack is a dominant cyber-criminal strategy for decades. Despite its longevity, it has evolved during the COVID-19 pandemic, indicating that adversaries exploit critical situations to lure victims. Plenty of detectors have been proposed over the years, which mainly focus on the contents or the textual information of emails; however, to cope with the evolution of phishing emails more sophisticated approaches should be introduced that will exploit all the emails’ traits to enhance the detection capability of Machine Learning/Deep Learning classifiers. To tackle the limitations of existing works, this paper proposes a phishing email detection methodology, named HELPHED that focuses on the detection of phishing emails by combining Ensemble Learning methods with hybrid features. The hybrid features provide an accurate representation of emails by fusing their content and textual traits. We propose two methods of HELPHED, the first one employs the Stacking Ensemble Learning method, while the second method utilizes the Soft Voting Ensemble Learning. Both methods deploy two different Machine Learning algorithms to handle the hybrid features separately, yet in parallel, minimizing the features’ complexity and improving the model’s performance. A thorough evaluation analysis is carried out considering innovative guidelines that aim to prevent partial and misleading results. Experimental tests verified that the combination of hybrid features with Ensemble Learning, overall, accomplishes better detection performance than when employing only content-based or text-based features. Numerical results on a rich imbalanced dataset (i.e., 32,051 benign and 3,460 phishing email samples) that considers the evolution of phishing emails show that Soft Voting Ensemble Learning outperforms other prominent Machine Learning/Deep Learning algorithms and existing works yielding F1-score equal to 0.9942.

Effect of Features Extraction Techniques on Cyberstalking Detection Using Machine Learning Framework

Article

Sep 2022

Abhishek Bansal

Various cybercriminals are active with predefined and preplanned agendas to carry out cybercrimes in the Internet world. Cyberstalking, cyberbullying, cyber terrorism, cyber hacking, data leakage, identity theft, phishing, and other types of cyber harassment continually occur in the virtual world. Cyberstalking and cyberbullying are near to close in content and intent, involving the same internet-based technology to harass, bully and undermine others online. This paper implemented a cyberstalking detection model and analyzed the effect of various feature extraction techniques on different machine learning classifiers for cyberstalking detection. For feature extraction, the proposed model applied Word2vec, BOW, TF-IDF, FastText, GloVe, ELMo, and BERT. Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Random Forest (RF), Naive Bayes (NB), and Decision Tree (DT) were used for classification. Effects of each feature extraction method to enhance the performance of the detection model were determined based on the performance results of applied classifiers with each feature extraction process. Experimental results show that BOW and TF-IDF outperformed advanced word embedding-based feature extraction methods. BOW (for LR) achieved the highest accuracy of 95.7%, highest precision of 97.9%, and highest F-Score of 97.3%. TF-IDF achieved the highest recall of 99.8% for NB. SVM classifier achieved the second-highest accuracy of 95.2% with TF-IDF. BERT model successfully obtained maximum accuracy of 90.9% and 90.7% for LR and SVM, respectively. ELMo model also performed well and produced maximum accuracy of 90.5% and 90.2% for LR and SVM, respectively. The SkipGram model of Word2Vec provided an accuracy of 85% for the LR classifier. GloVe provided 81.2% accuracy for the RF classifier. SkipGram and the CBOW model of FastText provided 85.7% and 82.2% accuracy, respectively, for the RF classifier.

Spoofing E-Mail Detection Using Stacking Algorithm

Conference Paper

Apr 2022

Email Based Cyberstalking Detection On Textual Data Using Multi Model Soft Voting Technique Of Machine Learning Approach

Abstract and Figures

Recommended publications

Effect of Features Extraction Techniques on Cyberstalking Detection Using Machine Learning Framework

Impact analysis of feature selection techniques on cyberstalking detection

Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach

Effect of Features Extraction Techniques on Cyberstalking Detection Using Machine Learning Framework

A MACHINE LEARNING FRAMEWORK FOR DETECTION AND DOCUMENTATION OF CYBERSTALKING ON NON-SPAM EMAIL