ArticlePDF Available

Email Based Cyberstalking Detection On Textual Data Using Multi Model Soft Voting Technique Of Machine Learning Approach

Authors:

Abstract and Figures

In the virtual world, many internet applications are used by a mass of people for several purposes. Internet applications are the basic needs of people in the modern days of lifestyle which are also making habitual society. Like social media, e-mail technology is also more prevalent among people of different categories for personal and official communications. The widespread use of e-mail-based communication is also raising various types of cybercrimes, including cyberstalking. Cyberstalkers also use an e-mail-based approach to harass the victim in the form of cyberstalking. Cyberstalkers utilize several content-wise and intent-wise approaches to target the victim, such as spamming, phishing, spoofing, malicious, defamatory, e-mail bombing, and non-spam e-mails, including sexism, racism, and threatening, and finally, trying to hack the account over e-mail technology. This paper proposed an EBCD model for automatic cyberstalking detection on textual data of e-mail using the multi-model soft voting technique of the machine learning approach. Initially, experimental works were performed to train, test, and validate all classifiers of three model sets on three different labeled datasets. Dataset D1 contains spam, fraudulent, and phishing e-mail subject, dataset D2 contains spam e-mail body text, while dataset D3 contains harassment-related data. After that, trained, tested, and validated classifiers of all model sets were applied as a combined approach to automatically classify the unlabeled e-mails from the user’s mailbox using the multi-model soft voting technique. The proposed EBCD model successfully classifies the e-mails from the user’s mailbox into cyberstalking e-mails, suspicious e-mails (spam and fraudulent), and normal e-mails. In each model set of the EBCD model, several classifiers, namely support vector machine, random forest, naïve bayes, logistic regression, and soft voting, were used. The final decision in classifying the e-mails from the user’s mailbox was taken by the soft voting technique of each model set. The TF-IDF feature extraction method was used with the entire applied machine learning model sets to obtain the feature vectors from the data. Experimental results show that the soft voting technique not only enhances the performance of the e-mail classification task but also supports making the right decision to avoid the wrong classification. Overall performance of the soft voting technique was better than other classifiers, although the performance of the support vector machine was also notable. As per experimental results, the soft voting technique obtained an accuracy of 97.7%, 97.7%, 98.9%, a precision of 97%, 98.3%, 98.6%, recall of 98.3%, 96.5%, 99.1%, f-score of 97.6%, 97.4%, 98.9%, and AUC of 99.4%, 99.7%, 99.9% on dataset D1, D2, and D3 respectively. The average performance of soft voting of each model set on classified e-mails from the user’s mailbox was also notable, with an accuracy of 96.3%, precision of 98.1%, recall of 94%, f-score of 95.9%, and AUC of 96.8%.
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
https://www.tandfonline.com/action/journalInformation?journalCode=ucis20
Journal of Computer Information Systems
ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/ucis20
Email-Based Cyberstalking Detection On Textual
Data Using Multi-Model Soft Voting Technique Of
Machine Learning Approach
Arvind Kumar Gautam & Abhishek Bansal
To cite this article: Arvind Kumar Gautam & Abhishek Bansal (2023): Email-Based Cyberstalking
Detection On Textual Data Using Multi-Model Soft Voting Technique Of Machine Learning
Approach, Journal of Computer Information Systems, DOI: 10.1080/08874417.2022.2155267
To link to this article: https://doi.org/10.1080/08874417.2022.2155267
Published online: 17 Jan 2023.
Submit your article to this journal
View related articles
View Crossmark data
Email-Based Cyberstalking Detection On Textual Data Using Multi-Model Soft
Voting Technique Of Machine Learning Approach
Arvind Kumar Gautam and Abhishek Bansal
Indira Gandhi National Tribal University, Amarkantak, India
ABSTRACT
In the virtual world, many internet applications are used by a mass of people for several purposes.
Internet applications are the basic needs of people in the modern days of lifestyle which are also
making habitual society. Like social media, e-mail technology is also more prevalent among people
of dierent categories for personal and ocial communications. The widespread use of e-mail-
based communication is also raising various types of cybercrimes, including cyberstalking.
Cyberstalkers also use an e-mail-based approach to harass the victim in the form of cyberstalking.
Cyberstalkers utilize several content-wise and intent-wise approaches to target the victim, such as
spamming, phishing, spoong, malicious, defamatory, e-mail bombing, and non-spam e-mails,
including sexism, racism, and threatening, and nally, trying to hack the account over e-mail
technology. This paper proposed an EBCD model for automatic cyberstalking detection on textual
data of e-mail using the multi-model soft voting technique of the machine learning approach.
Initially, experimental works were performed to train, test, and validate all classiers of three model
sets on three dierent labeled datasets. Dataset D1 contains spam, fraudulent, and phishing e-mail
subject, dataset D2 contains spam e-mail body text, while dataset D3 contains harassment-related
data. After that, trained, tested, and validated classiers of all model sets were applied as
a combined approach to automatically classify the unlabeled e-mails from the user’s mailbox
using the multi-model soft voting technique. The proposed EBCD model successfully classies
the e-mails from the user’s mailbox into cyberstalking e-mails, suspicious e-mails (spam and
fraudulent), and normal e-mails. In each model set of the EBCD model, several classiers, namely
support vector machine, random forest, naïve bayes, logistic regression, and soft voting, were used.
The nal decision in classifying the e-mails from the user’s mailbox was taken by the soft voting
technique of each model set. The TF-IDF feature extraction method was used with the entire
applied machine learning model sets to obtain the feature vectors from the data. Experimental
results show that the soft voting technique not only enhances the performance of the e-mail
classication task but also supports making the right decision to avoid the wrong classication.
Overall performance of the soft voting technique was better than other classiers, although the
performance of the support vector machine was also notable. As per experimental results, the soft
voting technique obtained an accuracy of 97.7%, 97.7%, 98.9%, a precision of 97%, 98.3%, 98.6%,
recall of 98.3%, 96.5%, 99.1%, f-score of 97.6%, 97.4%, 98.9%, and AUC of 99.4%, 99.7%, 99.9% on
dataset D1, D2, and D3 respectively. The average performance of soft voting of each model set on
classied e-mails from the user’s mailbox was also notable, with an accuracy of 96.3%, precision of
98.1%, recall of 94%, f-score of 95.9%, and AUC of 96.8%.
KEYWORDS
e-mail cyberstalking;
cyberstalking detection;
cyberbullying; machine
learning; spam detection;
soft voting; TF-IDF; support
vector machine; naive bayes;
logistics regression; random
forest
Introduction
With the growth and popularity of internet technology,
e-mail (electronic mail) has become an essential source
everywhere for a person to person and person-to-group
communication. E-mail platform is not only just for com-
munication purposes but also provides a storage facility
which has been growing exponentially over the years.
Generally, regular users of e-mail store half of their basic
and critical information in e-mail storage.
1,2
E-mail is the
best application for sharing personal, official, business,
and confidential information over the internet. Many
organizations and individuals utilize e-mail technology
to share their general and necessary information, such as
document sharing, message communication, and sending
urgent information about any news, updates, and notifi-
cations. Several e-mail service providers provide e-mail
service to users for personal and business purposes, either
free or on a subscription basis. Some of the most famous
and notable e-mail service providers are Gmail, Microsoft
Hotmail and Outlook, Yahoo, iCloud, AOL, GMX,
ProtonMail, Yandex mail Tutanota, and Zoho Mail. As
per the data provided by Statista,
3
more than 4.1 billion
CONTACT Arvind Kumar Gautam analyst.igntu@gmail.com Department of Computer Science, Indira Gandhi National Tribal University, Amarkantak,
Distt. - Anuppur, MP 484886, India
JOURNAL OF COMPUTER INFORMATION SYSTEMS
https://doi.org/10.1080/08874417.2022.2155267
© 2023 International Association for Computer Information Systems
users are using the e-mail service worldwide through
different electronic devices and e-mail client software.
The frequent use of e-mail technology is not just limited
to personal and official purposes but is also widely used by
cybercriminals for performing cybercrime incidents.
Cybercrimes like phishing, spamming, hacking, spoofing,
e-mail bombing, and cyberstalking are being executed
using e-mail.
4
E-mail is the second most used application
and the third most common source for cyberstalking and
other cyber harassment over the internet.
2,4
Although different authors give different definitions of
cyberstalking but cyberstalking is a form of online har-
assment involving the use of technology to target indivi-
duals or groups. Cyberstalking and cyberbullying are two
challenging issues of online abuse and are near to close in
content and intent, which involve the same internet-
based technology to harass, bully and undermine others
in the online world. Cyberstalking is systematic, repeated,
and numerous cyber-attacks and may occur on multiple
occurrences.
5–8
Cyberstalking may be classified into
e-mail stalking, internet stalking, computer stalking,
phone stalking, and automated stalking.
8,9
Cyberstalking
is a dangerous and convoluted cybercrime that affects and
targets numerous people, communities, and
organizations.
10
Cybercriminals apply several approaches
to target the victims, such as sending e-mails containing
phishing, viruses, threatening, fraudulent, and harassing
content, e-mail bombing as well as sharing the private
information of victims, and finally, trying to hack the
e-mail account. Cyberstalkers often utilize e-mail-based
technology with predefined plans and agendas to insult,
profanity, harassing the victim through repeated activities
of sexism, racism, offensive, abuse, hate, and fake news
from real or counterfeit accounts. However, such types of
e-mail-based methods are mainly utilized for several
other types of e-mail-based crimes, but the utilization of
these e-mail-based methods in cyberstalking incidents
can not be ignored. Some e-mail-based methods applied
by cyberstalkers are presented in Figure 1.
Spam is the criminal and fraudulent communication
of unwanted and harmful messages containing unsoli-
cited and unwanted messages such as phishing, false
advertising, harassment, and illegal content from an
infected device or messages to multiple addresses at
once.
11
According to DataProt,
12
as of March 2022,
across the world, approximately 85% of e-mails were
filtered as spam e-mails, including 36% for advertise-
ment purposes while 31.7% for all spam messages for
adult-related and harassment purposes. A phishing
e-mail is a scam and more dangerous than general
spam e-mails sent by cybercriminals with fraud and
harassment intentions. Cybercriminals find the victim’s
interests and send customized phishing e-mails from
a legitimate, reliable source to a specific person or
group to steal and gather personal and financial
information.
13
Cybercriminals utilize different types of
phishing e-mails with predefined objectives, containing
harmful hyperlinks, fake website links, malware, and
clone id and contents to thieve private information,
hack the account, control the victim’s devices, and
undermine and harass the victims. Malicious e-mails
are an approach used by cybercriminals as phishing
e-mails to try to access private information from victims.
Malicious e-mails contain attachments such as docu-
ments, PDFs, hyperlinks, e-files, and voicemails to initi-
ate an attack on a user’s devices.
14
Cybercriminals use
such attachments with e-mails that can install malware
to destroy data, steal information, take control of the
user’s computer, access the screen, capture keystrokes,
and access other network systems. Cyberstalkers often
utilize malicious e-mails to target known users. Spoofing
e-mails is another harmful e-mail technique used by
Figure 1. Different methods of e-mail-based cyberstalking.
2A. K. GAUTAM AND A. BANSAL
cybercriminals for sending spam and phishing e-mails
to trap users into thinking a message came from trusta-
ble and well-known persons or organizations. In spoof-
ing e-mail techniques, cybercriminals create a fake
header to send the message with malicious links and
malware attachments so that receiver can believe and
client application software shows the falsified sender
address.
15
Cybercriminals make spoofing e-mails using
display names, legitimate domains, and lookalike
domains. Spoofing e-mail is mainly used for phishing,
identity theft, avoiding spam filters, anonymity, and
harassment purposes.
In e-mail bombing, cyberstalkers repeatedly send an
unnecessary, large and meaningless e-mail message to
a predefined e-mail address of the victims to consume
large amounts of system and network resources (such
as internet bandwidth, storage space, etc.) for harass-
ment purposes.
16
Composing e-mail bombing mes-
sages automatically using computer programs is
another approach used by adversary cyberstalkers.
Sometimes, the adversary also utilizes the controversial
or official statement to a large audience using the
victim’s return e-mail address so that users read and
reply individually, and eventually, the victim’s e-mail
account is flooded through a large number of replies.
Another dangerous approach used by adversary cyber-
stalkers is to subscribe the victim’s e-mail address to
many sexual sites and other mailing lists so that victim
can receive unnecessary automatic e-mails regularly.
Defamatory e-mail is a technique of cyber defamation
which is often used by cyberstalkers to send false
information related to any person or organization to
demolish the reputation of that person or organiza-
tion. Defamatory e-mails are sent to different sources
either accidentally or deliberately, making
a confounded matter from an unintentional or inten-
tional result.
17
Sometimes, cyberstalkers also send
defamatory e-mails to the victim’s relatives containing
false and sexual information related to the victim to
damage the victim’s public image. The cases of defa-
matory e-mails are regularly increasing and are very
complicated to detect.
Those e-mails that are not classified into any types
of spam or fraudulent e-mails and look legitimate
e-mails are called non-spam e-mails. Vicious non-
spammer cyberstalkers use non-spam e-mails, includ-
ing sexual abuse, fake e-mails, threatening e-mails,
and other harassment e-mails, to target the victims
with proper plans. In Non-spam e-mail methods used
by cyberstalkers, a bunch of temporary e-mail ids
from well-known e-mail servers or sometimes suspi-
cious servers are created, and then using these e-mail
ids, stalking-related messages are sent to the victims
regularly. In case of blocking the sender’s e-mail id
or police complaint, cyberstalkers utilize other tem-
poral e-mail ids. Threatening e-mail is basically used
by cyberstalkers and scammers to blackmail the vic-
tim. In threatening e-mails, cyberstalkers regularly
threaten victims for publishing a piece of private
information or sometimes fake or factual sexual
information among the victim’s colleagues or rela-
tives (friends and family) unless they fulfill the
demand by the victims. Threatening e-mail is more
common for cyberstalking of women victims by ex-
partners or friends for financial cheating or personal
adversary reasons. Sometimes, cyberstalkers send fake
e-mails to victims or victims’ relatives containing
false or fake information or fake sender name
(often using the name of the victim’s well-known
persons or organizations), or counterfeit domains to
harass the victims intensely. Such types of fake
e-mails look like original e-mails based on e-mail
filtration policy, domain name, and sender name,
and also do not contain any harmful links, and very
difficult to identify whether the e-mail is fake or
legitimate. Cyberstalkers and other cybercriminals
often try to hack the e-mail ids of victims or their
family members so that further victims can be har-
assed easily. Cybercriminals use some general
approaches, such as phishing and spoofing to hack
the e-mail ids of victims. Keylogging (software and
hardware keylogger to capture all keystrokes which
a user performs), pharming (a fake website that looks
legitimate for collecting usernames and passwords),
automated script-based programs or suspicious
mobile apps, gaming applications, sexual site hyper-
links, and password guessing and resetting are some
other powerful methods used by cybercriminals for
hacking e-mail ids.
18
Generally, researchers focus on classifying the
e-mail into spam e-mail or non-spam e-mail, but
non-spam e-mail is not always safe and crime-free
in e-mail technology and is also responsible for
cyberstalking that cannot be ignored. Researchers
have proposed various content-based and rule-based
techniques for spam filtration and detection. The
content-based methods mainly focus on content fea-
tures, while modal-based approaches with predefined
rules and blacklist and whitelist mechanisms are used
in rule-based methods for spam e-mail
classification.
19,20
Generally, reputed e-mail service
providers (Gmail, outlook, and yahoo) filter the
e-mails with a primary target for spam and other
harmful e-mails but do not focus on the filtration
of harassing e-mails. Cyberstalking is a critical cyber-
criminal activity, and technical solution is relatively
JOURNAL OF COMPUTER INFORMATION SYSTEMS 3
low to combat and control cyberstalking incidents.
Detection of cyberstalking, especially early and auto-
mated detection, is another major challenge. An
intelligent cyberstalking detection model is required
to automatically classify the e-mails from the user’s
mailbox to handle upsetting cyberstalking incidents
on the e-mail platform. Sentiment analysis using
machine learning techniques performs a vital task in
text analysis and deciding the score of e-mail con-
tents to classify as positive or negative text.
21
Mostly,
researchers focus on cyberstalking detection on social
media platforms, while e-mail-based cyberstalking
detection is not more highlighted and explored.
There is still much scope for e-mail-based cyberstalk-
ing detection that can automatically filter cyberstalk-
ing e-mails from the user’s mailbox. The main
research objective of this paper is to train and test
the different machine learning model sets on differ-
ent datasets (spam and harassment) and finally per-
form e-mail filtration from the user’s mailbox as
cyberstalking, suspicious, and normal e-mails auto-
matically. This research study utilizes the multi-
model soft voting technique of the machine learning
approach to design and develop an improved auto-
mated e-mail-based cyberstalking detection model on
textual data. The significant contributions from this
study are as follows.
We designed and developed an automated, effi-
cient model named EBCD for e-mail filtration as
cyberstalking e-mails, suspicious e-mails (spam
and fraudulent), and normal e-mails by utilizing
the multi-model soft voting technique of the
machine learning approach to achieve the best
performance in the e-mail-based cyberstalking
detection on textual data.
The proposed EBCD model can classify and label
e-mails automatically in real-time with high accu-
racy and can gather useful information from
e-mails in the user’s mailbox that can be utilized
for further training of machine learning models
and evidence purposes.
The proposed EBCD model can be used in any
e-mail mailbox that provides e-mail fetching facil-
ity API or IMAP services.
The next part of the research study is structured
section-wise. In section 2, the notable and recent
contribution of researchers in the related field is
presented in the form of a literature review.
Section 3 describes the applied materials and the
proposed methodology used in this paper. The
experimental setup, results, and detailed discussion
are mentioned in section 4. Finally, the conclusion
and future works are finalized in section 5.
Review of literature
In the literature survey, some related research papers were
chosen to observe the contributions of past work per-
formed by researchers to the automatic detection of
cyberbullying, cyberstalking, and other cyber harassment.
Researchers have suggested several techniques to design
and develop a cyberstalking detection model on different
virtual world platforms. Burmester Henry et al.
22
pro-
posed a monitoring system framework for tracking cyber-
stalkers using the cryptography approach. Authors
claimed that the proposed framework would be able to
record cyberstalking-related data on the computer of
cyberstalking victims. Aggarwal S. et al.
23
have developed
the Predator and Prey Alert (PAPA) system to help law
enforcement. The PAPA system records every screen
event of a victim’s device during the session. The PAPA
system requires special software and hardware for victim
use and creates a secrecy issue. PAPA system was also not
performing properly to filter and detect cyberstalking
e-mails and was unable to handle the text-based cyber-
stalking. Onan et al.
24
suggested a model for topic extrac-
tion for bibliometric data analysis using several improved
word embedding with a cluster analysis approach and
developed sentiment analysis models
25
using machine
learning, ensemble learning, and deep learning methods
on educational data mining. Gautam et al.
9
explored and
reviewed the various cyberstalking and cybercrime detec-
tion techniques and found that machine learning techni-
ques are widely used as a single, ensemble, and hybrid
approach. Onan et al.
26
proposed a model based on
a three-layer stacked bidirectional long short-term mem-
ory architecture for detecting sarcastic text documents on
social media and, after that, also suggested a deep learn-
ing-based model utilizing several word embedding
model,
27
another deep learning-based model utilizing
several weighted word embedding model
28
for sentiment
analysis of product reviews on Twitter. Another machine
learning and deep learning-based model proposed by
Onan et al.
29
utilizes several unsupervised and supervised
term-weighted models, namely TF-IDF, word2vec,
FastText, and GloVe. Machine learning classifiers play
a vital role in making the cyberstalking detection model
using either single or multi-model-based as an ensemble
and hybrid approach. Gautam et al.
30
analyze the perfor-
mance of several popular machine learning classifiers on
different sizes of datasets for cyberstalking detection. In
the literature, researchers mainly focus on making
a cyberstalking detection model on social media plat-
forms. Zhang et al.
31
suggested a machine learning-based
4A. K. GAUTAM AND A. BANSAL
automated cyberbullying detection model for detecting
bully tweets on Twitter. The authors performed the
experimental work using various machine learning mod-
els using multiple textual features and found maximum
accuracy of 90%. Liew et al.
32
suggested an automated
security alert model using supervised machine learning
techniques to detect and control phishing tweets in real-
time on Twitter. Nimisha et al.
33
presented another auto-
mated model for cyberstalking detection on social media
using machine learning and natural language processing.
The authors proposed model mainly focuses on identify-
ing the cyberstalker online and detecting cyberstalking
incidents. Another enhanced automated cyberstalking
detection model in real-time on Twitter is designed and
developed by Gautam et al.
34
using a hybrid approach
inspired by machine learning. The authors performed the
experimental work on live tweets in real time for cyber-
stalking detection using lexicon-based, machine learning-
based (single approach), and hybrid approaches (multi-
model based inspired by machine learning) and found the
hybrid approach performed better in cyberstalking
detection.
Researchers less explored e-mail-based cyberstalking
detection than social media-based cyberstalking detec-
tion, although the researchers have recommended several
notable detection approaches for e-mail-based crimes
other than cyberstalking. Roy et al.
35
performed
a comparative analysis between SVM and Deep Neural
Networks in intrusion detection and proposed several
detection models using machine learning based model
36
utilizing extreme learning machine (ELM) and support
vector machine (SVM), a hybrid model
37
of rough set and
decorate ensemble and multi-model approach
38
using
Deep SVM, SVM and Artificial Neural Network models
for the detection of spam e-mails. Bassiouni et al.
39
pro-
posed a spam e-mail classification model utilizing
machine learning techniques. The authors performed
the experimental work on the Spambase UCI dataset
using several machine learning classifiers and found bet-
ter results for Random Forest for e-mail classifying as
spam e-mail or ham e-mails. Another detection model
using machine learning methods was proposed by
Zhaoquan et al.
40
for spam filtering using the marginal
attack methods. Kontsewaya et al.
41
proposed another
machine learning based detection model for spam
e-mail classification. The authors performed experimental
work on a ready-made dataset containing 1368 spam and
4360 non-spam e-mails and found that Logistics
Regression provides better results than other classifiers.
Aviad Cohen et al.
42
proposed a model for the detection
of malicious e-mails using machine learning methods.
The authors applied general descriptive features with
machine learning algorithms to enhance the performance.
Experimental works were performed on a dataset contain-
ing 33,142 e-mails (38.73% malicious and 61.27% benign
e-mails) and found better results. Chaitra Sai et al.
43
pro-
posed a model for the detection of spoofing e-mails using
stacking algorithms. The authors explored various
approaches and compared stacking algorithms of
machine learning for detecting different types of spoofing
e-mails to find better accuracy. Onan et al.
44
proposed an
ensemble-based machine learning model in text classifi-
cation and suggested another machine learning based
model utilizing a consensus clustering-based-
undersampling approach
45
for text classification in an
imbalanced dataset. The authors explored a comparative
analysis of different feature engineering approaches, base
learners, ensemble learning methods, and consensus clus-
tering-based-undersampling. Onan et al.
46
again pro-
posed another ensemble pruning approach based model
utilizing multiple classifier techniques based on swarm-
optimized topic modeling, machine learning based hybrid
ensemble pruning model
47
utilizing clustering and rando-
mized search approach and a machine learning based
ensemble model
48
for text classification utilizing different
extraction methods. Nisar et al.
49
suggested a soft voting
technique using several machine learning classifiers for
spam e-mail classification. During the experimental work,
the authors found that the ensemble approach using the
soft voting technique enhances the performance of spam
e-mail classification. Bountakas et al.
50
proposed
a machine learning-based hybrid ensemble approach
using stacking and soft voting techniques for phishing
detection. The authors performed experimental work on
a dataset containing 3,460 phishing and 32,051 benign
e-mail samples and found better performance with soft
voting ensemble learning. Onan et al.
51
suggested
a group-wise enhancement technique to perform the
text sentiment classification using deep leaning model
and suggested a model with effective feature selection
using an ensemble approach
52
to enhance the perfor-
mance of text sentiment classification. Cybercriminals
have recently introduced the image spam approach to
render e-mail body text analysis ineffective. Image spam
is unsolicited bulk e-mail that contains a message
embedded in an image. Spammers use such images to
avoid detection by text-based filters. Image spamming is
a growing issue in executing cybercrimes, although some
machine learning and deep learning-based image spam
detection approaches have been suggested by
researchers.
53,54
In the area of e-mail-based cyberstalking detection,
Ghasem et al.
55
introduced an improved e-mail-based
cyberstalking detection framework for automatically
detecting and controlling cyberbullying and cyberstalking
using machine learning techniques. The proposed ACES
JOURNAL OF COMPUTER INFORMATION SYSTEMS 5
(Anti-Cyberstalking E-mail System) framework of
authors generally focused on automatic e-mail-based
cyber-stalking detection as well as evidence documenta-
tion to combat cybercriminals. Another e-mail-based
cyberstalking detection model was proposed by
Frommholz et al.
56
for textual analysis and cyberstalking
detection using machine learning algorithms. The
author’s proposed framework, ACTS (Anti
Cyberstalking Text-based System), mainly focused on
author identification, text classification, personalization,
and digital text forensics. X. Feng et al.
57
proposed
another framework for e-mail-based cyberstalking detec-
tion using machine learning approaches. The author’s
proposed model was inspired by the ACES (Anti-
Cyberstalking E-mail System) and ACTS (Anti
Cyberstalking Text-based System) framework and
claimed that the proposed model would perform better
for cyberstalking detection. Another e-mail-based cyber-
stalking detection framework was proposed by Gautam
58
using a machine learning approach to detect, filter, and
collect cyberstalking evidence on textual data of non-
spam e-mails. The proposed framework of the authors
explores the cyberstalking risk from non-spam e-mail.
Initially, the author’s framework classifies the e-mail
into spam and non-spam e-mail and further detects the
cyberstalking on non-spam e-mail. Another improved
e-mail-based detection model proposed by Maryam
et al.
2
using a deep learning approach. The author’s pro-
posed model classified the e-mail into Harassment
E-mails, Fraudulent E-mails, Suspicious E-mails, and
Normal E-mails. Asante et al.
59
suggested another auto-
mated model for cyberstalking detection on social media
using machine learning, data mining techniques, and
digital forensics. The author’s proposed model contains
identification, filtering, detection (content detection and
profiling offender), and evidence modules.
Based on the literature review, it is found that
researchers mainly focused on social media-based
cyberstalking and other harassment detection.
Researchers also contribute to exploring and detect-
ing e-mail-based cybercrimes. E-mail-based cyber-
stalking is still not much explored, and more
attention is required. Few authors in
55–59
have con-
tributed to detecting and combating e-mail-based
cyberstalking. Automatic e-mail-based cyberstalking
detection on textual data is still challenging, and
there is still a lack of automated cyberstalking detec-
tion approaches with impressive performance.
Inspired by authors,
55–59
this paper proposed an
EBCD model for automatic cyberstalking detection
on textual data of e-mails and classifying the
e-mails from a user’s mailbox into cyberstalking
e-mails, suspicious e-mails, and normal e-mails.
Material and methodology
This section describes the detailed algorithms used for
designing the proposed model. E-mail-based cyberstalk-
ing detection (EBCD) model on textual data has two
main parts: Making ML Model Sets and E-mail-based
cyberstalking detection. In the first part of the EBCD
model, 3 ML model sets containing Support Vector
Machine (SVM), Logistics Regression (LR), Naïve
Bayes (NB), Random Forest (RF), and Soft Voting clas-
sifiers were trained and tested on three separate datasets
(subject line spam dataset, e-mail spam dataset and
cyberstalking dataset).In the second part of the EBCD
model, e-mails from the user’s e-mail box were fetched
and later filtered as cyberstalking e-mails, suspicious
e-mails (spam and fraudulent), and normal e-mails by
applying the trained and tested ML model sets using soft
voting techniques. The stepwise procedure for making
ML model sets is described by algorithm-1, while algo-
rithm-2 describes the e-mail-based cyberstalking detec-
tion on textual data from a user’s mailbox. Figure 2
explains the basic functioning layout of the proposed
EBCD model on textual data. The overall methodology
for the proposed EBCD model is presented stepwise,
consisting of the following main phases to perform
both parts of the model for e-mail-based cyberstalking
detection on textual data.
(1) Making the Dataset.
(2) Data pre-processing module.
(3) Features extraction module.
(4) Making ML model sets.
(5) Fetching e-mails from the user’s mailbox.
(6) Apply trained ML model sets to e-mails and
combine the probabilities using soft voting
(7) Aggregator module and e-mail classification
(8) Saving classified e-mails as evidence
(9) Model Performance
Making datasets
This paper gathers several datasets
60–67
related to spam/
phishing e-mail subjects, spam e-mail text, fraudulent
e-mail, and harassment text (e-mail, tweets, and posts/
comments from social media). Three separate mixed
labeled datasets were made to train, test, and cross-
validate the three machine learning model sets based
on the collected datasets. Dataset D1 contains e-mail
subject line spam, phishing, and fraudulent data labeled
as spam (1) and ham (0). Dataset D2 contains spam and
fraudulent e-mail body text labeled as spam and ham
class. Dataset D3 contains harassment-related
6A. K. GAUTAM AND A. BANSAL
(threatening, sexual abuse, hate messages, racism, etc.)
data from e-mails and social media tweets/posts/com-
ments labeled as cyberstalking (1) and non-
cyberstalking (0). Dataset D1 will be used to train and
test the machine learning classifiers of ML model set
MS-1. Dataset D2 will be used to train and test the
machine learning classifiers of ML model set MS-2,
while dataset D3 will be used to train and test the
machine learning algorithms of ML model set MS-3.
The distribution of data in every three datasets is
explained in Figure 3.
Data pre-processing module
The data of datasets and fetched e-mails often contain
raw text with unnecessary characters, blank spaces,
blank lines, meaningless characters, html tags, and dif-
ferent symbols. Properly cleaning the data is highly
Figure 2. Basic layout of the proposed EBCD (e-mail-based cyberstalking detection) model on textual data.
Figure 3. Distribution of data in labeled datasets.
JOURNAL OF COMPUTER INFORMATION SYSTEMS 7
required before feature extraction and classification.
Data pre-processing module will be used to clean and
normalize the data of all training and testing labeled
datasets as well as unlabeled e-mails fetched from the
user’s mailbox. Initially, this module will be used for
performing several pre-processing tasks on labeled
training and testing datasets. Later, it will be utilized
on unlabeled e-mails fetched from the user’s mailbox.
Several pre-processing tasks, such as: Removing stop
words, noise removal, tokenization, normalization, and
stemming will be performed in this module to clean the
data. In the first step of pre-processing, all stop words
were removed. Meaningless words such as articles, pre-
positions, and pronouns that are not useful for e-mail
classification are called stop words.
68
Fetched e-mails
from the user’s mailbox and datasets gathered from
different sources also contain noise data that is required
to be removed. In the e-mail, repeated words, symbols
(such as html tags, @, #, etc.), blank lines, blank space,
special characters, punctuation marks, and any useless
digits are called noise data. After removing the noise
data and stop words, the texts of the e-mail (subject and
body text) were divided into individual words and added
to a separate list. This process for splitting the sentence
into words is called tokenization. Further, tokenized
texts are required to convert into lower case letters
using normalization to make the uniformity. After
that, tokenized words are required to be restored to
their original form using the lemmatization
69
and
stemming
69
methods. Lemmatization may be used
instead of stemming for proper morphological analysis
of the words. Lemmatization is a method to combine the
synonyms relation words into a single word and remove
all other concerned synonyms words from the list.
70
In
this paper, the stemming method was used.
Feature extraction module
The feature extraction process is essential in the machine
learning-based process before training, testing, and clas-
sifying e-mail because the machine learning algorithms
work on feature vectors and can not understand data as
text forms. Feature extraction computes the weights of
e-mail words and creates a feature vector in numerical
form. Feature extractions play a crucial role in improving
the performance of classifiers.
71
Several traditional-based,
word embedding-based and language model-based fea-
ture extraction methods are available for feature extrac-
tion in the word-level, sentence-level, and n-gram levels.
71
TF-IDF, Word2Vec, BOW, BERT, FastText, GloVe, XL-
NET, ELECTRA, InferSent, GPT-2, and Universal
Sentence Encoder are some widely used examples of
feature extraction methods.
72–75
The proposed EBCD
model of this study applied TF-IDF methods for feature
extractions. TF-IDF is an efficient calculation-based fea-
ture extraction method that measures the weight of any
word of documents in a collection of documents.
76
TF-
IDF finds the most occurring words and assigns more
consequences because regularly occurring words are
more important for the classification.
77
Equation (1) is
used to calculate the feature vector in the TF-IDF.
TF IDF T;Dð Þ ¼ PT in D
PW in D Log N
PT in Nð Þ þ 1
(1)
Where:
PT in D ¼Number of times
word T appears in
a document }D}
PW in D ¼Total number of
words in the
document }D}
9
>
>
>
>
>
>
=
>
>
>
>
>
>
;
!Represents the
Term Frequency
PT in N ¼
Total occurrence
of Word }T}in
total documents
8
<
:9
=
;!
Represents the
document
Frequency
N= Total Documents
Making ML model sets
After cleaning the data through the data pre-
processing module and getting the feature vector
through the feature extraction module, machine
learning model sets were designed and developed.
In this study, three separate machine learning
model sets, ML Model Set MS-1, ML Model Set
MS-2, and ML Model Set MS-3, were designed and
developed. Machine learning algorithms of ML model
set MS-1 were trained, tested, and validated on data-
set D1. Dataset D2 was applied for the training,
testing, and validating of algorithms of the ML
model set MS-2, while the ML model set MS-3 uti-
lized dataset D3 for training, testing, and validating
the algorithms. In each model set, Support Vector
Machine (SVM), Logistics Regression (LR), Naïve
Bayes (NB), Random Forest (RF), and Soft Voting
classifiers were trained, tested, and validated. Support
vector machine is an efficient, versatile, and trendy
supervised machine learning broadly used to classify
text with more accurate results.
30
SVM creates hyper-
planes and computes the distance between the line
and support vector to classify the text. The SVM
offered several kernels (polynomial, sigmoid, Radial
Basis Function, linear, and nonlinear kernels) with
8A. K. GAUTAM AND A. BANSAL
different mathematical functions.
78
Although, as per
its native nature, SVM uses prediction and does not
support probability directly, using Platt scaling and
isotonic regression methods, SVM determines the
probability of any text for the target class. This
paper used the probability calibration classifier
method for SVM to calculate the prediction probabil-
ity of e-mail. Naïve Bayes (NB) is an efficient and
straightforward supervised machine learning algo-
rithm. The functioning of NB is according to the
Bayes Theorem and derived from conditional
probability.
79
In this paper, the multinomial NB
model was used, while other models offered by NB
are Gaussian NB and Bernoulli NB. Logistic regres-
sion is a statistical-based linear learning algorithm
that utilizes an s-shaped curve to map any real-
valued number using the sigmoid function to find
dichotomous results (a value between 0 and 1).
Logistic regression predicts an output value (y) by
combining the input features(x) linearly using
weights or coefficient values.
80
Random Forest is
a supervised ensemble algorithm that uses multiple
decision trees with the bootstrap technique to get
better prediction results. For a classification problem,
each tree in a random forest takes input and provides
individual votes for a particular class, and finally,
a class that has got the maximum number of votes
is predicted as output.
81
The mathematical expression
for calculating the prediction probability of e-mail
using SVM is explained by Equation (2). Equation
(3) shows the mathematical formula to determine
prediction probability using NB. Equations (4) and
(5) show the mathematical expression of LR and RF
classifiers, respectively, for calculating the prediction
probability. Algorithm 1 describes the stepwise pro-
cedure for making the machine learning model sets.
PSVM yjemailð Þ ¼ 1
1þexp Af e mailð Þ þ Bð Þ (2)
Where “A” and “B” are scalar parameters learned by the
algorithm during the training, “y” is the target class (y = 1
for cyberstalking and y = 0 for non-cyberstalking) f(e-
mail) is a real-valued function.
PNB yjemailð Þ ¼ P yð ÞQn
i¼1PðxijyÞ
P x1ð Þ P x2ð Þ . . . :p xn
ð Þ(3)
Where “y” is the target class (y = 1 for cyberstalking
and y = 0 for non-cyberstalking). P(y|e-mail) repre-
sents the posterior probability of e-mail for target
class “y.” P(e-mail)=P(x1)P(x2) . . . .P(x
n
) is the pre-
ceding probability of the predictor e-mail. P(y) is
the preceding probability of the target class. P(x
i
|y)
is the likelihood conditional probability of predictor
e-mail for target class (y).
PLR yjemailð Þ ¼ eaþbemailð Þ
ð1þeaþbemailð ÞÞ(4)
Where y is the predicted probability output, a is the
intercept term, and b is the coefficient for the single
input e-mail value learned from the training data.
PRFðyjemailÞ ¼ MaxVote PnðemailÞgf N
1(5)
Where N is the total tree in random forest and P
n
is
a class prediction of the n
th
tree
Algorithm 1: Stepwise procedure for Making ML model sets on labeled
datasets
Step:1. Begin
Step:2. Import labeled datasets D1, D2, and D3.
Step:3. Send datasets D1, D2, and D3 to the data pre-processing module
for text cleaning and normalization.
Step:4. Split the datasets D1, D2, and D3 into training and testing sets.
D1=D1
Train
+ D1
Test
, D2=D2
Train
+D2
Test
, D3=D3
Train
+D3
Test
, where
D1
Train
, D2
Train
, D3
Train
are the training and D1
Test
, D2
Test
, D3
Test
are
the test corpus for dataset D1, D2, D3 respectively.
Step:5. Apply TF-IDF vectorizer on D1
Train
, D2
Train
, D3
Train
, D1
Test
, D2
Test
,
and D3
Test
to get the feature vectors using the feature extraction
module.
Step:6. Train and test the ML classifiers of ML model set MS-1 using D1
Train
and D1
Test
corpus (training and testing feature sets of dataset D1).
Step:7. Train and test the ML classifiers of ML model set MS-2 using D2
Train
and D2
Test
corpus (training and testing feature sets of dataset D2).
Step:8. Train and test the ML classifiers of ML model set MS-3 using D3
Train
and D3
Test
corpus (training and testing feature sets of dataset D3).
Step:9. Apply K-Fold cross-validation for ML Classifiers of model sets MS-1,
MS-2, and MS-3 on Datasets D1, D2, and D3, respectively.
Step:10. Measure the performance of ML classifiers of each ML model set.
Step:11. Save the ML model sets as pickle files so that ML model sets can
be used later during the classification of e-mails from the user’s
mailbox.
Step:12. End
Fetching e-mails from the user’s mailbox
E-mail is private communication (person-to-person and
person-to-group), so e-mails from the user’s mailbox
can not be fetched without the user id, password, and
user permission. Several approaches may automatically
fetch the e-mails from the user’s mailbox through
a third-party application. IMAP service and Gmail API
(in the case of Gmail service) are the two main methods
for fetching e-mails automatically from the user’s mail-
box. In the case of Gmail API, a user must log into
Google Cloud Console and enable the Gmail API ser-
vice. After that, it is necessary to create/select an appli-
cation under the OAuth Consent Screen of Google
Cloud Console. After creating or selecting the existing
application, OAuth Client ID credentials are required to
create a desktop or web application for getting the Client
ID with OAuth credentials as a text or JSON file. After
JOURNAL OF COMPUTER INFORMATION SYSTEMS 9
getting the Client ID with OAuth credentials, e-mails
from the user’s mailbox can be fetched automatically
through programs. The first time, the user will be auto-
matically intimated that “This application wants to
access your mailbox Allow or deny,” and after the
user has permission to access the mailbox, e-mails can
be fetched. In fetching e-mails using the IMAP service,
only a user id and password with some basic settings are
required. After Enabling “Allow less secure apps: ON”
and Enabling IMAP service in the user’s mailbox,
e-mails can be fetched automatically through programs.
Apply trained ML model sets to e-mails and combine
the probabilities using soft voting
After fetching the e-mail from the user’s mailbox, the
e-mail was sent to the data pre-processing module
and feature extraction module to clean the e-mail
and get the feature vectors for the e-mail subject
and body text. Saved (trained and tested) ML model
sets were loaded to apply the classifiers separately on
the e-mail subject and body text. Using the ML
model set MS-1, prediction probabilities for e-mail
subjects were found through all ML classifiers (SVM,
NB, LR, and RF). Classifiers of ML Model set MS-2
were applied to the e-mail body text to determine the
prediction probabilities for checking whether the
e-mail is spam or normal. ML model set MS-3 with
all classifiers were applied to get the prediction prob-
abilities for checking whether the e-mail was cyber-
stalking E-mail or a normal E-mail. Prediction
probabilities given by each ML classifier in each ML
model set may vary. Taking the final decision based
on only the prediction of a single classifier may affect
the e-mail classification task. So an ensemble
approach using the multi-model soft voting techni-
que was applied to get the final prediction probability
for a particular class (Spam or Normal, Cyberstalking
or Normal).
In machine learning, the voting technique is classified
as hard voting and soft voting. In hard voting, the
“Mode” based approach is used to select the majority
vote among all the votes (predictions) predicted by all
classifiers. For example, if classifier-1 predicts for class
“A,” classifier-2 predicts for class “B,” and classifier-3
predicts for class “A,” then the hard voting technique
gives the final prediction for class “A” due to a majority
of votes. In soft voting, the “Mean” based approach is
used to find the final prediction probability from all the
predicted probabilities (votes) by all classifiers for both
classes. In soft voting, classifiers give the prediction
probability for both classes (in the case of binary classi-
fication) using the “Predict_proba” method. Such as
p=svm.predict_proba(), and “p” ={0.7,0.3} show that
0.7 is a probability for class “A” and 0.3 is a probability
for class “B.” For example, if classifier-1 predicted prob-
ability is {0.7, 0.3}, the predicted probability of classifier-
2 is {0.4, 0.6}, and the predicted probability of classifier-3
is {0.8, 0.2} then soft voting technique will give final
prediction probability as {0.633, 0.366} which show the
prediction in favor of class “A.” This study uses the soft
voting technique to combine the prediction probabil-
ities. The mathematical representation of the soft voting
technique is explained by Equation (6), and the func-
tioning of soft voting in the author’s study is described
in Figure 4. The final prediction probability is calculated
using the soft voting technique based on the prediction
probabilities provided by the ML model set MS-1 (on
the e-mail subject), the model set MS-2 (on e-mail body
text), and model set MS-3 (on e-mail body text). In the
Figure 4. Soft voting technique for combining the predicted probabilities and predicting the final result.
10 A. K. GAUTAM AND A. BANSAL
last of this phase, three final prediction probabilities
(from ML model sets MS-1, MS-2, and MS-3) for an
e-mail (subject and body text) are sent to the aggregator
module for e-mail classification.
PSoftVotingðyj
emailÞ ¼ argmaxjPN
k¼1Pk
ðCkemailð ÞÞ
N¼j
0
B
B
@1
C
C
A(6)
Where k is a pair of class probabilities [P
k0
, P
k1
], N is
total classifiers, P
k
is a probability, and C
k
is a classifier,
j is the average probability of N classifiers for binary
class(j 2Υ={0,1}), argmax function return the final
max probability for “y” class
Aggregator module and e-mail classication
The aggregator module of the proposed EBCD model
takes the combined (final) prediction probabilities
through soft voting from ML model set MS-1, MS-2,
and MS-3 and finally classifies an e-mail of the user’s
mailbox either as “Cyberstalking E-mail,” “Suspicious
E-mail,” or “Normal E-mail.” In the aggregator module,
three e-mail check posts were used to check the e-mails.
In the first e-mail check post of the aggregator module,
the e-mail of the user’s mailbox is checked for cyberstalk-
ing e-mail. If the value of combined prediction probability
for class “A” (Cyberstalking) provided by the ML model
set MS-3 > 0.5, then e-mail is classified as “Cyberstalking
E-mail.” If an e-mail is not identified as cyberstalking,
then a second e-mail check post will check the e-mail for
suspicious e-mails (spam and fraudulent). In the second
e-mail check post, combined prediction probabilities for
class “A” (Spam) given by ML model sets MS-1 and MS-2
are used. If the probability given by MS-2 > 0.5 or (MS-1
> 0.5 AND MS-2 > 0.5), then the e-mail is identified as
spam e-mail and required to check for the case of repeated
spam and e-mail bombing incident. In the last e-mail
check post of the aggregator module, identified spam
e-mails were sent to ML model set MS-2 for checking
the spam repetition and e-mail bombing incident by the
same sender. At least ten latest e-mails sent by the same
sender is checked, and if the majority of e-mails sent by
the identified sender (spammer) are spam or fraudulent
e-mail, then identified spam e-mail in the second check
post is classified as “Cyberstalking E-mail” due to inten-
sely sending the repetition spam e-mail or e-mail bomb-
ing. Although, the user will finally decide whether either
e-mail is a cyberstalking e-mail or just a suspicious (spam/
fraudulent) e-mail. During the checking of e-mail in
e-mail check post-3, if the ML model set using soft voting
does not classify as cyberstalking e-mail, then that identi-
fied spam e-mail (in check post2) will be classified as
“Suspicious E-mail.” In case of if the e-mail of the user’s
mailbox is neither identified as cyberstalking nor
Figure 5. e-mail classification in aggregator module.
JOURNAL OF COMPUTER INFORMATION SYSTEMS 11
identified as suspicious e-mail while checking in all three
check posts, then the e-mail will be classified as “Normal
E-mail.” The functioning of the aggregator module for
e-mail classification is described in Figure 5, while the
overall stepwise procedure for E-mail Classification from
the User’s Mailbox is explained in algorithm 2.
Saving classied e-mails as evidence
After the e-mail classification of the user’s mailbox, the
available evidence is required to be stored in a file. The
proposed EBCD model will automatically read the user’s
mailbox, move the cyberstalking e-mails to
a cyberstalking folder, suspicious e-mails to
a suspicious folder and finally store the e-mail date,
sender, subject, body text, sentiment label, etc. in the
CSV file during the fetching of e-mail. Later, a CSV file
containing classified e-mails from the user’s mailbox as
evidence can also be used for training purposes and legal
action against cyberstalkers. The user can also use gath-
ered evidence to decide to block the sender as
a blacklisted sender to avoid cyberstalking from the
same sender.
Model performance
The performance of classifiers of each ML model set
on each dataset (during the training and testing time
and during the e-mail classification from the user’s
mailbox was measured separately. Performance
metrics are a set of several parameters to estimate
the model performance during training and testing
time (on labeled datasets) and real-time (on unla-
beled e-mail classification).
82
Several parameters in
the performance metrics are usually calculated by
using the confusion matrix. In the case of binary
classification, the confusion matrix is a 2 × 2 truth
table that contains the total value of True_Pos,
True_Neg, False_Neg, and False_Pos. True_Pos
(True Positive) is a successful hit showing the total
number of correctly detected cyberstalking e-mails or
spam e-mails, while True_Neg (True Negative)
explains the total number of correctly detected nor-
mal e-mails. False_Pos (False Positive) is a miss-hit,
which refers to the total number of incorrectly
detected cyberstalking e-mails or spam e-mails,
while False_Neg (False Negative) is the failure count
that shows the total number of wrongly detected
normal e-mails. This study used broadly used para-
meters such as accuracy, precision, f-score, recall, and
AUC (Area Under the Curve) to measure the perfor-
mance of the EBCD model.
Algorithm 2: Stepwise procedure for e-mail Classification from User’s
Mailbox
Step:1. Begin
Step:2. Load saved pre-trained and pre-tested ML model sets MS-1, MS-2,
and MS-3
Step:3. Enable IMAP service in the user’s mailbox (Gmail).
Step:4. Enable “Allow less secure apps: ON” or generate an App password
for the user’s mailbox (Gmail).
Step:5. Import the required library and authenticate the login process
using the User ID, Password, Host, and Port [In the case of Python,
import imaplib and e-mail library, mail=imaplib.IMAP4_SSL(host,
port), mail. login(username, app_password), host for gmail= imap.
gmail.com, port=993]
Step:6. Select “Inbox” or/and another mailbox folder to fetch the e-mails.
[as mail.select(“Inbox”)]
Step:7. Create label/folder “Cyberstalking” and “Suspicious” in the user’s
mailbox.
[As mail.create(“Cyberstalking”) and mail.create(“Suspicious”)]
Step:8. Fetch e-mail from a selected folder of the user’s mailbox. [Get e-
mail date, sender, subject, e-mail text, and other required
information]
Step:9. Split the e-mail into a date, sender, subject, and e-mail body text,
As e-mail
Subject
and e-mail
BodyText
Step:10. Send e-mail subject and body text (e-mail
Subject
and
e-mail
BodyText
) to the data pre-processing module for e-mail
cleaning and normalization.
Step:11. Apply TF-IDF vectorizer on e-mail
Subject
and e-mail
BodyText
to get
the feature vectors using the feature extraction module.
Step:12. Apply all algorithms of ML model set MS-1 on e-mail
Subject
and get
prediction probabilities. [As PP
MS1_SVM
, PP
MS1_LR
, PP
MS1_NB
, and
PP
MS1_RF
]
Step:13. Apply all algorithms of ML model set MS-2 on e-mail
BodyText
and
get prediction probabilities. [As PP
MS2_SVM
, PP
MS2_LR
, PP
MS2_NB
, and
PP
MS2_RF
]
Step:14. Apply all algorithms of ML model set MS-3 on e-mail
Subject
and get
prediction probabilities. [As PP
MS3_SVM
, PP
MS3_LR
, PP
MS3_NB
, and
PP
MS3_RF
]
Step:15. Combine the prediction probabilities on ML model sets MS-1, MS-
2, and MS-3 and get the final possibilities in each model set using
Equation (6) of the soft voting technique.
AsFPP1MS1¼PPMS1SVM þPPMS1LR þPPMS1NB þPPMS1RF
ð Þ=4;FPP2MS2
¼PPMS2SVM þPPMS2LR þPPMS2NB þPPMS2RF
ð Þ=4;FPP3MS3
¼PPMS3SVM þPPMS3LR þPPMS3NB þPPMS3RF
ð Þ=4
2
43
5
Step:16. If (FPP3
MS3
>0.5) then
Classify the e-mail as “Cyberstalking e-mail.”
Assign a label (value=1, Cyberstalking e-mail (negative e-mail)).
Move the e-mail to the “Cyberstalking” folder of the user’s
mailbox.
Step:17. ElseIf (FPP2
MS2
>0.5) or (FPP2
MS2
>0.5 AND FPP1
MS1
>0.5) then
Check for repeated spam and e-mail bombing by the same sender
(check at least ten latest e-mails of the sender) and apply ML
Model set MS-2 for getting the final probabilities using the soft
voting technique.
[As RFPP4
MS2
= Call Get_Sentiment_e-mail(Sender, MS-2) (Any
user-defined function for getting the prediction probabilities for
sender e-mails)]
IF RFPP4
MS2
>0.5) then
Classify the e-mail as “Cyberstalking e-mail.”
Assign a label (value=1, Cyberstalking e-mail (negative e-mail)).
Move the e-mail to the “Cyberstalking” folder of the user’s mailbox.
Else
Classify the e-mail as “Suspicious e-mail.”
Assign a label (value=2, Suspicious e-mail (negative e-mail
containing spam/fraudulent)).
Move the e-mail to the “Suspicious” folder of the user’s mailbox.
Step:18. Else [in case of FPP3
MS3
<0.5, FPP2
MS2
<0.5 and FPP1
MS1
<0.5]
Classify the e-mail as “Normal e-mail.”
Assign a label (value=0, Normal e-mail (positive e-mail)).
Step:19. Save the fetched and classified e-mail to a CSV file (All e-mail-
related information as date, sender, subject, text, sentiment label
(Cyberstalking/Suspicious/Normal), etc.)
Step:20. Repeat steps 8 to step 19 until fetching a sufficient number of
e-mails from the user’s mailbox (Define a fetching limit)
Step:21. Measure the performance of ML classifiers of each ML model set.
Step:22. End
12 A. K. GAUTAM AND A. BANSAL
Accuracy
Accuracy shows the complete number of rights predic-
tions that are predicted by the classifier. Equation (7)
shows the mathematical representation to calculate the
accuracy.
Accuracy ¼True Pos þTrue Neg
True Pos þFalse Posþ
False Neg þTrue Neg
(7)
Precisions
Precision shows the proportion between the true posi-
tives and the wide range of various others positives.
Precision can be calculated using Equation (8).
Precision ¼True Pos
True Pos þFalse Pos (8)
Recall
Recall is used to determine the sensitivity of the model
and measures the ratio of true positive prediction to total
positive. Recall can be calculated by using Equation (9).
Recall ¼True Pos
True Pos þFalse Neg (9)
F-score
F-Score measures the test accuracy and explains the
harmonic average between precision and recall. F-score
can be calculated using Equation (10).
FScore ¼2Precision Recall
Precision þRecall (10)
AUC (Area Under the Curve)
AUC estimates the ability of the classifier to separate
among classes correctly. ROC (Receiver Operator
Characteristic) is a likelihood curve that plots the True
Positive Rate (TPR) against the False Positive Rate
(FPR). Equation (11) can be used to calculate the AUC.
AUC ¼1
2
True Pos
True PosþFalse Neg
þTrue Neg
True NegþFalse Pos
! (11)
Results and discussion
This section discusses the experimental setup and results
for e-mail-based cyberstalking detection on textual data.
The experiments used python language with Scikit
Learn, imaplib, e-mail, BeautifulSoup, smtplib, NLTK,
and other library packages to develop the proposed
EBCD model. In the first stage of the experiment,
machine learning classifiers of model sets MS-1 were
trained, tested, and cross-validated (kFold) on labeled
datasets D1. The performance of different classifiers of
Figure 6. Performances of classifiers of ML model set MS-1 on dataset D1.
Table 1. Performance of ML classifiers of ML set MS-1.
Dataset (D1): e-mails Subject
Total Unique Records: 23320
Model Set: Machine Learning Model Set MS-1
ML Classifiers Accuracy Precision Recall F-Score AUC
Naive Bayes 0.925443 0.968613 0.878256 0.921194 0.986559
Logistic
Regression
0.946884 0.911006 0.989749 0.948730 0.991968
Random
Forest
0.969640 0.956672 0.984106 0.969396 0.992305
SVM 0.971012 0.955813 0.987330 0.971283 0.993824
Soft Voting 0.976844 0.969979 0.983211 0.976550 0.994339
JOURNAL OF COMPUTER INFORMATION SYSTEMS 13
ML model set MS-1 is explained in Table 1 and Figure 6.
As per experimental results, the soft voting approach
achieved the best accuracy of 97.7%, best precision of
97%, and best f-score of 97.7% and best AUC of 99.4%.
Logistic regression provided the highest recall of 99%.
The support vector machine obtained the second posi-
tion with an accuracy of 97.1%, recall of 98.7%, f-score of
97.1%, and AUC of 99.3%. Overall performance of all
classifiers of the model set MS-1 was up to mark and
near to similar performance in terms of AUC.
In the second stage of the experiment, machine learn-
ing classifiers of model sets MS-2 were trained, tested,
and cross-validated (kFold) on labeled datasets D2. The
performance of different classifiers of ML model set MS-
2 is explained in Table 2 and Figure 7. As per experi-
mental results, the soft voting technique was again the
best performer classifier with the best accuracy of 97.7%,
recall of 96.5%, f-score of 97.4%, and AUC of 99.7%.
Maximum precision of 98.6% was provided by the ran-
dom forest classifier, while SVM again got the position
of second best performer classifier with an accuracy of
97%, precision of 98%, recall of 95.3%, f-score of 96.6%,
and AUC of 99.6%. Other classifiers of the model set
MS-2 were also performed up to mark and near to the
best performer classifier.
In the third stage of the experiment, machine learning
classifiers of model sets MS-3 were trained, tested, and
cross-validated (kFold) on labeled datasets D3. The per-
formance of different classifiers of ML model set MS-3 is
explained in Table 3 and Figure 8. Experimental results
show that the soft voting technique again provided the
best AUC of 99.9%, while SVM is the best performer
classifier with an accuracy of 99.0%, precision of 99.4%
and f-score of 99.0%. Naïve Bayes achieved a maximum
recall of 99.4%; however, all classifiers of the model set
MS-3 performed outstanding, and performance para-
meters are near to close with the best performer classi-
fier. Overall, the performance of the soft voting classifier
of each model set on all three datasets is best, and it not
only enhances the performance of the classification task
but also helps to make the right decision during the
classification of unlabeled data based on the majority
of votes. Sometimes, due to model overfitting, the high-
est performance parameters are provided by classifiers,
although experiments in this paper utilize the K fold
cross-validation to avoid any overfitting. The soft voting
technique may also avoid overfitting due to majority
votes and avoid making a wrong decision during the
classification of unknown data. For example, in the
classification of any e-mail from the user’s mailbox, it
is also possible that one classifier may indicate normal
e-mail, and other classifiers may predict cyberstalking
e-mail. In this scenario, the soft voting technique uses
the majority votes option to make the right decision for
the actual classification of e-mail. Based on these
Table 2. Performance of ML classifiers of ML set MS-2.
Dataset (D2): e-mails Body Text
Total Unique Records: 31715
Model Set: Machine Learning Model Set MS-2
ML Classifiers Accuracy Precision Recall F-Score AUC
Naive Bayes 0.955520 0.958086 0.942925 0.950428 0.992443
Logistic
Regression
0.959009 0.977275 0.931027 0.953581 0.993947
Random
Forest
0.964391 0.986148 0.934373 0.960268 0.989182
SVM 0.969646 0.979182 0.953151 0.965982 0.995702
Soft Voting 0.977299 0.983459 0.964977 0.974130 0.996806
Figure 7. Performances of classifiers of ML model set MS-2 on dataset D2.
14 A. K. GAUTAM AND A. BANSAL
advantages, the multi-model soft voting technique was
used during the classification and labeling of the e-mails
from the user’s mailbox (unlabeled e-mail).
At the end of experiments, each ML model set’s
trained, tested, and validated classifiers were saved as
pickle files for further use during the automated cyber-
stalking detection and filtration of e-mails from the
user’s mailbox. In the last experiment, trained, tested,
and validated classifiers are applied to classify the
e-mails from the user’s mailbox (as discussed in algo-
rithm 2 of the methodology section). For experimental
purposes, different types of e-mails (spam, fraudulent,
cyberstalking, and normal) were sent to the author’s
mailbox from different e-mail ids of authors using the
python program through smtplib tools. Using the
EBCD model, a total of 497 e-mails were fetched and
classified as cyberstalking e-mails (37.8%), suspicious
e-mails (26.4%), and normal e-mails (35.8%). The dis-
tributions of fetched and classified e-mails are shown in
Figure 9. Performance of classifiers of ML model sets
MS-2 and MS-3 are measured on fetched classified
e-mails using the manual “OneVsRest” approach.
Fetched classified e-mail is divided into two datasets:
set 1 and set 2. Classifiers of the model set MS-2 were
tested on set 1, containing all e-mails belonging to
suspicious and normal e-mail classes, while classifiers
of MS-3 were tested on set 2, containing cyberstalking
and normal e-mail classes. The average performance of
different classifiers of ML model sets MS-2 and MS-3 is
explained in Table 4 and Figure 10. As experimental
results described in Table 4 and Figure 10 show, the
soft voting technique outperformed other classifiers in
terms of accuracy. The soft voting classifier achieved
the highest accuracy of 96.3 and f-score of 95.9%.
Table 3. Performance of ML classifiers of ML set MS-3.
Dataset (D3): Harassment Text
Total Unique Records: 36804
Model Set: Machine Learning Model Set MS-3
ML Classifiers Accuracy Precision Recall F-Score AUC
Naive Bayes 0.944608 0.909068 0.993706 0.949491 0.994240
Logistic
Regression
0.982285 0.992456 0.973579 0.982923 0.998221
Random
Forest
0.981524 0.977701 0.985060 0.982392 0.997325
SVM 0.990327 0.994357 0.987135 0.990731 0.998615
Soft Voting 0.988697 0.986543 0.991547 0.989039 0.998727
Figure 8. Performances of classifiers of ML model set MS-3 on dataset D3.
Figure 9. Distribution of fetched and classified e-mails.
JOURNAL OF COMPUTER INFORMATION SYSTEMS 15
Heights AUC of 96.8% and 96.8 were provided by SVM
and soft voting, respectively. Maximum precision of
98.5%, 98.1%, and 98.1% was provided by the random
forest, soft voting, and support vector machine, respec-
tively. In the case of the recall, naïve bayes, support
vector machine, and soft voting achieved a maximum
recall of 94.8%, 94.4%, and 94%, respectively. Overall
performance of all classifiers of model sets was up to
mark. During the classification and labeling of e-mails
from the user’s mailbox, the final decision was taken
using the soft voting technique, and after that perfor-
mance of all classifiers was measured on classified
e-mails using the Stratified K-Folds cross-validator.
During the overall experimental works, it is found
that the performance of the support vector machine is
notable, but the soft voting technique is a better choice
for making the right decision.
Conclusion and future work
E-mail-based cyberstalkers are making negative and
fearful communication over e-mail technology.
Cyberstalking through spamming, e-mail bombing,
and the general approach of cyberstalking are common
for e-mail-based harassment. Apart from these, cyber-
stalkers also utilize several other approaches to target the
victim or groups over e-mail, which are complex to
detect automatically. This paper proposed an EBCD
model using the multi-model soft voting technique of
the machine learning approach for automatic cyber-
stalking detection on textual data from a user’s mailbox.
Initially, three machine learning model sets containing
random forest, support vector machine, naïve bayes,
logistic regression, and soft voting classifiers were
trained, tested, and validated through k-fold cross-
validation on three different datasets. Classifiers of the
model set MS-1 were trained, tested, and validated on
dataset D1 containing spam, phishing, and fraudulent
e-mail subject line so that further it can be used in
classifying the e-mail using e-mail subject. Classifiers
of the model set MS-2 were trained, tested, and validated
on dataset D2 containing spam and fraudulent related
e-mail body text. Later, the model set MS-2 can be used
to classify the e-mail from the user’s mailbox as spam
e-mail as well as can also be utilized for checking e-mail
bombing and repeated spamming approaches of cyber-
stalkers. Classifiers of the model set MS-3 were trained,
tested, and validated on dataset D3 containing harass-
ment-related data so that it can be used further for
checking cyberstalking e-mails from the user’s mailbox.
Table 4. Average performance of ML classifiers of ML model sets
on fetched and classified e-mails from user’s mailbox.
Dataset D4: Set 1(suspicious and normal e-mail) and Set 2 (cyberstalking
and normal e-mail)
Fetched e-mail classified and labeled by: Soft Voting Technique of EBCD Model
Total Unique e-mail: 497, cyberstalking e-mail: 37.8%,
suspicious e-mail: 26.4%, and normal e-mail: 35.8%
ML Classifiers Accuracy Precision Recall F-Score AUC
Naive Bayes 0.879176 0.979588 0.944156 0.898151 0.964610
Logistic
Regression
0.910239 0.979637 0.927597 0.914087 0.964462
Random
Forest
0.911804 0.984968 0.935390 0.940763 0.964478
SVM 0.941714 0.981334 0.947727 0.946976 0.968466
Soft Voting 0.963057 0.981431 0.940584 0.959211 0.967925
Figure 10. Average performance of classifiers of ML model sets on fetched and classified e-mails from user’s mailbox.
16 A. K. GAUTAM AND A. BANSAL
The performance of classifiers of each model set was
measured using accuracy, precision, recall, f-score, and
AUC. Experimental results show that the soft voting
technique achieved the best accuracy of 97.7%, best
f-score of 97.7%, best precision of 97%, and best AUC
of 99.4% on dataset D1. The soft voting technique also
performed well in dataset D2 with the best accuracy of
97.7%, best f-score of 97.4%, best recall of 96.5%, and
best AUC of 99.7%. In the case of dataset D3, the soft
voting technique also achieved the best AUC of 99.9%,
while accuracy, precision, and f-score provided by soft
voting were very close to the top perform classi-
fier (SVM).
Due to the overall better performance, the multi-
model soft voting technique was applied for the
automated classification and labeling of e-mails
from the user’s mailbox. During the classification
of e-mails from the user’s mailbox, trained, tested,
and validated classifiers of the model set MS-1, MS-
2 and MS-3 were applied as a combined approach.
Based on the final decision through the soft voting
classifier of MS-1, MS-2, and MS-3 models, in each
of the three e-mail check posts, e-mails from the
user’s mailbox were classified as cyberstalking
e-mail, suspicious e-mail, and normal e-mail. The
performance of all classifiers of each model set was
measured on classified e-mails from the user’s mail-
box. The average performance of classifiers shows
that soft voting again performed well with an accu-
racy of 96.3% and f-score of 95.9%, while the pre-
cision, recall, and AUC of soft voting were very
close to the top performer classifier. Overall experi-
mental results show that the performance of the
support vector machine was notable, but the soft
voting technique is a better choice for unlabeled
e-mail classification. The soft voting technique not
only enhances the performance of classification task
for labeled and unlabeled e-mail but also provide
help to make the right decision for the actual clas-
sification of e-mails. The proposed EBCD model
performed well and could automatically classify
e-mails from the user’s mailbox and evidence col-
lection. The proposed EBCD model not only detects
the automatically cyberstalking e-mail but also clas-
sifies the e-mail as suspicious e-mail (spam and
fraudulent) based on the textual e-mail data.
Further, the EBCD model also helps to detect
basic intent-wise cyberstalking e-mails through
repeated spamming and e-mail bombing. However,
advanced intent-wise e-mail-based cyberstalking
detection, including the image spam approach of
cyberstalking, is more complex than content-wise
e-mail-based cyberstalking. Future work includes
the design and development of an enhanced EBCD
model for the detection of advanced intent-wise
cyberstalking performed through phishing, mali-
cious, defamatory, e-mail spoofing, and image
spam-based cyberstalking. So that advanced intent-
wise, cyberstalking can be detected automatically
from fake e-mails, identity theft, and the personal/
financial losses approaches of cyberstalkers. Future
work also includes the design and development of
the EBCD model using deep learning techniques
and a comparison of the current proposed EBCD
model with ANN, Logitboost, XGBoost, LSTM, and
GRU models.
Disclosure statement
No potential conflict of interest was reported by the authors.
ORCID
Arvind Kumar Gautam http://orcid.org/0000-0001-6057-
1006
Abhishek Bansal http://orcid.org/0000-0001-5968-3625
References
1. Karim A, Azam S, Shanmugam B, Kannoorpatti K,
Alazab M. A comprehensive survey for intelligent spam
e-mail detection. IEEE Access. 2019;7:168261_168295.
doi:10.1109/ACCESS.2019.2954791.
2. Hina M, Ali M, Javed AR, Ghabban F, Khan LA, Jalil Z.
Sefaced: semantic-based forensic analysis and classifica-
tion of e-mail data using deep learning. IEEE Access.
2021;9:98398–411. doi:10.1109/ACCESS.2021.3095730.
3. https://www.statista.com/statistics/255080/number-of
-e-mail-users-worldwide/ .
4. https://www.statista.com/statistics/420391/spam-e-mail
-traf_c-share .
5. Miller L. Stalking: patterns, motives, and intervention
strategies. Aggress Violent Behav. 2012;17(6):495–506.
doi:10.1016/j.avb.2012.07.001.
6. Ogilvie E. Cyberstalking. Trends Issues Crime Crim
Justice. 2000;166:1–6.
7. Truman JL. Examining intimate partner stalking and
use of technology in stalking victimization [PhD thesis].
Florida: University of Central Florida Orlando; 2010.
8. WinkelmAn SB, Oomen-Early J, Walker AD, Chu L,
Yick-Flanagan A. Exploring cyber harassment among
women who use social media. Univers J Public Health.
2015;3(5):194. doi:10.13189/ujph.2015.030504.
9. Gautam AK, Bansal A. A review on cyberstalking detec-
tion using machine learning techniques: current trends
and future direction. International Journal of
Engineering Trends and Technology. 2022;70
(3):95–107. doi:10.14445/22315381/IJETT-V70I3P211.
10. Baer M. Cyberstalking and the internet landscape we have
constructed. Virginia J Law Technol. 2020;154:153–227.
JOURNAL OF COMPUTER INFORMATION SYSTEMS 17
11. Nam SG, Jang Y, Lee D-G, Seo Y-S. Hybrid features by
combining visual and text information to improve spam
filtering performance. Electronics. 2022;11(13):2053.
doi:10.3390/electronics11132053.
12. https://dataprot.net/statistics/spam-statistics .
13. Bagui S, Nandi D, Bagui S, White RJ. Classifying phish-
ing e-mail using machine learning and deep learning.
2019 International Conference on Cyber Security and
Protection of Digital Services (Cyber Security), Oxford,
UK; 2019; IEEE.
14. Marková E, Bajtoš T, Sokol P, Mézešová T. Classification of
malicious e-mails. 2019 IEEE 15th International Scientific
Conference on Informatics, Poprad, Slovakia; 2019; IEEE.
15. Pandove K, Jindal A, Kumar R. e-mail spoofing.
Int J Comput Appl. 2010;5(1):27–30. doi:10.5120/881-
1252.
16. Sakshi M, Vashishth A. An analysis of cyber crime with
special reference to cyber stalking. J Posit Psychol. 2022;6
(4):1279–87.
17. Goni O. Cyber crime and its classification. Int J Electr
Electron Eng. 2022;10(2):01–17. doi:10.30696/IJEEA.X.I.
2022.01-17.
18. Kumar S, Agarwal D. Hacking attacks, methods, techni-
ques and their protection measures. Int J Adv Res Comput
Sci Manag. 2018;4:2353–58.
19. Mirza N, Patil B, Mirza T, Auti R. Evaluating efficiency of
classifier for e-mail spam detector using hybrid feature
selection approaches. International Conference on
Intelligent Computing and Control Systems (ICICCS’
17), Madurai, India; 2017; IEEE. p. 735–40.
20. Thomas K, Grier C, Ma J, Paxson V, Song D. Design and
evaluation of a real-time URL spam filtering service. IEEE
Symposium on Security and Privacy (SP ’11), Oakland,
CA, USA; 2011; IEEE. p. 447–62.
21. Rakshitha K, Ramalingam HM, Pavithra M, Advi HD,
Hegde M. Sentimental analysis of Indian regional lan-
guages on social media. Glob Transit Proc. 2021;2
(2):414–20. doi:10.1016/j.gltp.2021.08.039.
22. Burmester M, Burmester M, Henry P, Kermes LS,
Kermes LS, Henry P. Tracking cyberstalkers:
a cryptographic approach. ACM SIGCAS Comput Soc.
2005;35(3):2. doi:10.1145/1215932.1215934.
23. Aggarwal S, Burmester M, Henry P, Kermes L,
Mulholland J. Anti-cyberstalking: the Predator and
Prey Alert (PAPA) system. Proceedings - First
International Workshop on Systematic Approaches,
Taipei, Taiwan; 2005.
24. Onan A. Two-stage topic extraction model for biblio-
metric data analysis based on word embeddings and
clustering. IEEE Access. 2019;7:145614–33. doi:10.
1109/ACCESS.2019.2945911.
25. Onan A. Sentiment analysis on massive open online
course evaluations: a text mining and deep learning
approach. Comput Appl Eng Educ. 2021;29(3):572–89.
doi:10.1002/cae.22253.
26. Onan A, Alp Toçoğlu M. A term weighted neural lan-
guage model and stacked bidirectional LSTM based
framework for sarcasm identification. IEEE Access.
2021;9:7701–22. doi:10.1109/ACCESS.2021.3049734.
27. Onan A. Deep learning based sentiment analysis on
product reviews on Twitter. International Conference
on Big Data Innovations and Applications; 2019; Cham:
Springer.
28. Onan A. Sentiment analysis on product reviews based on
weighted word embeddings and deep neural networks.
Concurr Comput Pract Exp. 2021;33(23):e5909. doi:10.
1002/cpe.5909.
29. Onan A. Mining opinions from instructor evaluation
reviews: a deep learning approach. Comput Appl Eng
Educ. 2020;28(1):117–38. doi:10.1002/cae.22179.
30. Gautam AK, Bansal A. Performance analysis of super-
vised machine learning techniques for cyberstalking
detection in social media. Journal of Theoretical and
Applied Information Technology. 2022;100(2):449–461.
31. Zhang J, Otomo T, Li L, Nakajima S. Cyberbullying
detection on Twitter using multiple textual features.
2019 IEEE 10th International Conference on
Awareness Science and Technology (CAST), Morioka,
Japan; 2019; IEEE. p. 1–6.
32. Liew SW, Sani NFM, Abdullah MT, Yaakob R,
Sharum MY. An effective security alert mechanism for
real-time phishing tweet detection on Twitter. Comput
Secur. 2019;83:201–07. doi:10.1016/j.cose.2019.02.004.
33. Dughyala N, Potluri S, Sumesh KJ, Pavithran V.
Automating the detection of cyberstalking. 2021
Second International Conference on Electronics and
Sustainable Communication Systems (ICESC),
Coimbatore, India; 2021; IEEE.
34. Gautam AK, Bansal A. Automatic cyberstalking
detection on Twitter in real-time using hybrid
approach. International Journal of Modern
Education and Computer Science . 2023;15(1).
35. Roy SS, Mallik A, Gulati R, Obaidat MS, Krishna PV.
A deep learning based artificial neural network
approach for intrusion detection. International
Conference on Mathematics and Computing; 2017;
Singapore: Springer.
36. Roy SS, Madhu Viswanatham V. Classifying spam e-
mails using artificial intelligent techniques. Int J Eng
Res Africa. 2016;22:152–61. Trans Tech Publications
Ltd. https://doi.org/10.4028/www.scientific.net/JERA.
22.152 .
37. Roy SS, Madhu Viswanatham V, Venkata
Krishna P. Spam detection using hybrid model of
rough set and decorate ensemble. Int J Comput Syst
Eng. 2016;2(3):139–47. doi:10.1504/IJCSYSE.2016.
079000.
38. Roy SS, Sinha A, Roy R, Barna C, Samui P. Spam e-mail
detection using deep support vector machine, support
vector machine and artificial neural network.
International Workshop Soft Computing Applications;
2016; Cham: Springer.
39. Bassiouni M, Ali M, El-Dahshan EA. Ham and spam e-
mails classification using machine learning techniques.
J Appl Secur Res. 2018;13(3):315–31. doi:10.1080/
19361610.2018.1463136.
40. Zhaoquan GU, Yushun X, Weixiong HU, Lihua Y, Yi H,
Zhihong T. Marginal attacks of generating adversarial
examples for spam filtering. Chinese J Electron. 2021;30
(4):595–602. doi:10.1049/cje.2021.05.001.
41. Kontsewaya Y, Antonov E, Artamonov A. Evaluating
the effectiveness of machine learning methods for spam
18 A. K. GAUTAM AND A. BANSAL
detection. Procedia Comput Sci. 2021;190:479–86.
doi:10.1016/j.procs.2021.06.056.
42. Cohen A, Nissim N, Elovici Y. Novel set of general
descriptive features for enhanced detection of malicious
e-mails using machine learning methods. Expert Syst
Appl. 2018;110:143–69. doi:10.1016/j.eswa.2018.05.031.
43. Jalda CS, Nanda AK, Pitchai R. Spoofing e-mail detec-
tion using stacking algorithm. 2022 8th International
Conference on Smart Structures and Systems (ICSSS),
Chennai, India; 2022; IEEE.
44. Onan A. An ensemble scheme based on language func-
tion analysis and feature engineering for text genre
classification. J Inf Sci. 2018;44(1):28–47. doi:10.1177/
0165551516677911.
45. Onan A. Consensus clustering-based undersampling
approach to imbalanced learning. Sci Program.
2019;2019:1–14. doi:10.1155/2019/5901087.
46. Onan A. Biomedical text categorization based on
ensemble pruning and optimized topic modelling.
Comput Math Methods Med. 2018;2018:1–22.
doi:10.1155/2018/2497471.
47. Onan A, Korukoğlu S, Bulut H. A hybrid ensemble
pruning approach based on consensus clustering and
multi-objective evolutionary algorithm for sentiment
classification. Inf Process Manag. 2017;53(4):814–33.
doi:10.1016/j.ipm.2017.02.008.
48. Onan A, Korukoğlu S, Bulut H. Ensemble of key-
word extraction methods and classifiers in text
classification. Expert Syst Appl. 2016;57:232–47.
doi:10.1016/j.eswa.2016.03.045.
49. Nisar N, Rakesh N, Chhabra M. Voting-ensemble
classification for e-mail spam detection. 2021
International Conference on Communication infor-
mation and Computing Technology (ICCICT),
Mumbai, India; 2021; IEEE.
50. Bountakas P, Xenakis C. Helphed: hybrid ensemble
learning phishing e-mail detection Journal of
Network and Computer Applications. 2022;210.
doi:10.1016/j.jnca.2022.103545.
51. Onan A. Bidirectional convolutional recurrent
neural network architecture with group-wise
enhancement mechanism for text sentiment
classification. J King Saud Univ Comput Inf Sci.
2022;34(5):2098–117. doi:10.1016/j.jksuci.2022.02.
025.
52. Onan A, Korukoğlu S. A feature selection model
based on genetic rank aggregation for text sentiment
classification. J Inf Sci. 2017;43(1):25–38. doi:10.
1177/0165551515613226.
53. Annadatha A, Stamp M. Image spam analysis and
detection. J Comput Virol Hacking Tech. 2018;14
(1):39–52. doi:10.1007/s11416-016-0287-x.
54. Sharmin T, Di Troia F, Potika K, Stamp M.
Convolutional neural networks for image spam
detection. Inf Secur J. 2020;29(3):103–17. doi:10.
1080/19393555.2020.1722867.
55. Ghasem Z, Frommholz I, Maple C. Machine learning
solutions for controlling cyberbullying and
cyberstalking. Int J Inf Secur. 2015;6:55–64.
56. Frommholz I, Al-Khateeb HM, Potthast M,
Ghasem Z, Shukla M, Short E. On textual analysis
and machine learning for cyberstalking detection.
Datenbank Spektrum. 2016;16(2):127–35. doi:10.
1007/s13222-016-0221-x.
57. Feng X, Asante A, Short E, Abeykoon I. Cyberstalking
issues. 2017 IEEE 15th International Conference on
Dependable, Autonomic and Secure Computing, 15th
International Conference on Pervasive Intelligence and
Computing, 3rd International Conference on Big Data
Intelligence and Computing and Cyber Science and
Technology Congress (DASC/PiCom/DataCom/
CyberSciTech); 2017. p. 373–76. doi:10.1109/DASC-
PICom-DataCom-CyberSciTec.2017.78.
58. Gautam AK, Bansal A. A machine learning framework
for detection and documentation of cyberstalking on
on-spam e-mail. The Journal of Oriental Research
Madras . 2021;92(5):41–50.
59. Asante A, Feng X. Content-based technical solution
for cyberstalking detection. 2021 3rd International
Conference on Computer Communication and the
Internet (ICCCI), Nagoya, Japan; 2021; IEEE.
60. Trec Dataset: https://www.kaggle.com/datasets/imdeep
mind/preprocessed-trec-2007-public-corpus-dataset .
61. Enron dataset: https://www2.aueb.gr/users/ion/data/
enron-spam/ .
62. https://www.kaggle.com/datasets/llabhishekll/fraud-e-
mail-dataset?resource=download .
63. https://www.kaggle.com/datasets/mfaisalqureshi/spam-
e-mail .
64. https://www.kaggle.com/datasets/harshsinha1234/
email-spam-classification .
65. https://www.kaggle.com/datasets/juanagsolano/spam-
e-mail-from-enron-dataset .
66. https://www.kaggle.com/datasets/ganiyuolalekan/
spam-assassin-email-classification-dataset .
67. https://data.mendeley.com/datasets/72ptz43s9v/1 .
68. Vijayarani S, Ilamathi MJ, Nithya M. Pre-
processing techniques for text mining-an overview.
Int J Comput Netw Commun. 2015;5:7–16.
69. Kadhim AI. An evaluation of pre-processing tech-
niques for text classification. Int J Inf Technol
Comput Sci Inf Secu. 2018;16:22–32.
70. Tiwari D, Singh N. Ensemble approach for twitter senti-
ment analysis. Int J Inf Technol Comput Sci. 2019;11
(8):20–26. doi:10.5815/ijitcs.2019.08.03.
71. Gautam AK, Bansal A. Effect of features extraction
techniques on cyberstalking detection using machine
learning framework. J Adv Inf Technol. 2022;13(5).
doi:10.12720/jait.13.5.486-502.
72. Rui W, Xing K, Jia Y. BOWL: bag of word clusters
text representation using word embeddings.
International Conference on Knowledge Science,
Engineering and Management; 2016; Cham:
Springer.
73. Mikolov T, Chen K, Corrado G, Dean J. Efficient
estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781. 2013. https://arxiv.
org/pdf/1301.3781.pdf .
74. Jeffrey P, Socher R, Christopher D. Glove: global vectors
for word representation. Proceedings of the 2014
Conference on Empirical Methods in Natural Language
Processing (EMNLP), Doha, Qatar; 2014.
75. Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of
tricks for efficient text classification. arXiv preprint
JOURNAL OF COMPUTER INFORMATION SYSTEMS 19
arXiv:1607.01759. 2016. https://arxiv.org/pdf/1607.
01759.pdf .
76. Raj C, Agarwal A, Bharathy G, Narayan B, Prasad M.
Cyberbullying detection: hybrid models based on
machine learning and natural language processing
techniques. Electronics. 2021;10(22):2021. doi:10.3390/
electronics10222810.
77. Das B, Chakraborty S. An improved text sentiment classi-
fication model using TF-IDF and next word negation.
arXiv preprint arXiv:1806.06407. 2018.
78. Cristianini N, Shawe-Taylor J. An introduction to support
vector machines and other kernel-based learning methods.
United Kingdom: Cambridge University Press; 2000.
79. Rish I. An empirical study of the naive bayes classifier.
IJCAI 2001 workshop on empirical methods in artificial
intelligence; 2001;3(22):41–46.
80. Yan J, Lee J. Degradation assessment and fault modes
classification using logistic regression. J Manuf Sci Eng.
2005;127(4):912–14. doi:10.1115/1.1962019.
81. Pal M. Random forest classifier for remote sensing
classification. Int J Remote Sens. 2005;26(1):217–22.
doi:10.1080/01431160412331269698.
82. Bashir E, Bouguessa M. Data mining for cyberbullying
and harassment detection in Arabic texts. Int J Inf
Technol Comput Sci. 2021;13(5):41–50. doi:10.5815/
ijitcs.2021.05.04.
20 A. K. GAUTAM AND A. BANSAL
... Cyberstalking is a dangerous and convoluted cybercrime that affects and targets numerous people, communities, and organizations [3]. Cyberstalkers and gangs of cyberstalkers are active on Twitter with pre-defined plans and agendas to insults, profanity, harassing the victim through repeated activities of sexism, racism, offensive, abuse, hate, trolling, fake news, and fake accounts [4,5,6]. Impressive cyberstalking detection, controlling, and counteraction arrangements are required to handle this troublesome cyberstalking circumstance on Twitter. ...
Article
Full-text available
Many people are using Twitter for thought expression and information sharing in real-time. Twitter is one of the trendiest social media applications that cybercriminals also widely use to harass the victim in the form of cyberstalking. Cyberstalkers target the victim through sexism, racism, offensive language, hate language, trolling, and fake accounts on Twitter. This paper proposed a framework for automatic cyberstalking detection on Twitter in real-time using the hybrid approach. Initially, experimental works were performed on recent unlabeled tweets collected through Twitter API using three different methods: lexicon-based, machine learning, and hybrid approach. The TF-IDF feature extraction method was used with all the applied methods to obtain the feature vectors from the tweets. The lexicon-based process produced maximum accuracy of 91.1%, and the machine learning approach achieved maximum accuracy of 92.4%. In comparison, the hybrid approach achieved the highest accuracy of 95.8% for classifying unlabeled tweets fetched through Twitter API. The machine learning approach performed better than the lexicon-based, while the performance of the proposed hybrid approach was outstanding. The hybrid method with a different approach was again applied to classify and label the live tweets collected by Twitter Streaming in real-time. Once again, the hybrid approach provided the outstanding result as expected, with an accuracy of 94.2%, recall of 94.1%, the precision of 94.6%, f-score of 94.1%, and the best AUC of 98%. The performance of machine learning classifiers was measured in each dataset labeled by all three methods. Experimental results in this study show that the proposed hybrid approach performed better than other implemented approaches in both recent and live tweets classification. The performance of SVM was better than other machine learning algorithms with all applied approaches.
Chapter
Cyberstalking is one of the most widespread threats on digital platforms. It has included many forms of direct threats via email, online distribution of intimate photographs, seeking information about victims, harassment, and catfishing. The consequences of cyberstalking may lead to psychological problems such as mental health, distress, victim experiencing feelings of isolation, guilt, adverse effects on life activity. These psychological problems may further lead to reports of serious health issues such as anger, fear, suicidal ideation, depression, and post-traumatic stress disorder (PTSD). However, there are many coping strategies such as avoidant coping, ignoring the perpetrator, confrontational coping, support seeking, and cognitive reframing. In spite of these methods, awareness of preventive measures of cyberstalking may further help to overcome mental stress. In this chapter, the authors have pointed out the various psychological issues due to cyberstalking and further discuss their solutions through preventing or automatic detection methods inspired by machine learning approaches.
Article
Full-text available
Many people are using Twitter for thought expression and information sharing in real-time. Twitter is one of the trendiest social media applications that cybercriminals also widely use to harass the victim in the form of cyberstalking. Cyberstalkers target the victim through sexism, racism, offensive language, hate language, trolling, and fake accounts on Twitter. This paper proposed a framework for automatic cyberstalking detection on Twitter in real-time using the hybrid approach. Initially, experimental works were performed on recent unlabeled tweets collected through Twitter API using three different methods: lexicon-based, machine learning, and hybrid approach. The TF-IDF feature extraction method was used with all the applied methods to obtain the feature vectors from the tweets. The lexicon-based process produced maximum accuracy of 91.1%, and the machine learning approach achieved maximum accuracy of 92.4%. In comparison, the hybrid approach achieved the highest accuracy of 95.8% for classifying unlabeled tweets fetched through Twitter API. The machine learning approach performed better than the lexicon-based, while the performance of the proposed hybrid approach was outstanding. The hybrid method with a different approach was again applied to classify and label the live tweets collected by Twitter Streaming in real-time. Once again, the hybrid approach provided the outstanding result as expected, with an accuracy of 94.2%, recall of 94.1%, the precision of 94.6%, f-score of 94.1%, and the best AUC of 98%. The performance of machine learning classifiers was measured in each dataset labeled by all three methods. Experimental results in this study show that the proposed hybrid approach performed better than other implemented approaches in both recent and live tweets classification. The performance of SVM was better than other machine learning algorithms with all applied approaches.
Article
Full-text available
The development of information and communication technology has created many positive outcomes, including convenience for people; however, cases of unsolicited communication, such as spam, also occur frequently. Spam is the indiscriminate transmission of unwanted information by anonymous users, called spammers. Spam content is indiscriminately transmitted to users in various forms, such as SMS, e-mail, and social network service posts, causing negative experiences for users of the service, while also creating costs, such as unnecessarily large amounts of network traffic. In addition, spam content includes phishing, hype or false advertising, and illegal content. Recently, spammers have also used images that contain stimulating content to effectively attract users’ curiosity and attention. Image spam contains more complex information than text, making it more difficult to analyze and to generalize its properties compared to text. Therefore, existing text-based spam detectors are vulnerable to spam image attacks, resulting in a decline in service quality. In this paper, a “hybrid features by combining visual and text information to improve spam filtering performance” method is proposed to reduce the occurrence of misclassification. The proposed method employs three sub-models to extract features from spam images and a classifier model to output the results using the features. Each sub-model extracts topic-, word-, and image-embedding-based features from spam images. In addition, the sub-models use optical character recognition, latent Dirichlet allocation, and word2Vec techniques to extract features from images. To evaluate spam image classification performance, the spam classifiers were trained using the extracted features and the results were measured using a confusion matrix. Our model achieved an accuracy of 0.9814 and a macro-F1 score of 0.9813. In addition, the application of OCR evasion techniques resulted in a decrease in recognition performance. Using the proposed model, a mean macro-F1 score of 0.9607 was obtained.
Article
Full-text available
Web-based media organizations and other web applications, for example, WhatsApp, Facebook, YouTube, Instagram, Twitter, have become more well known among individuals for data sharing, live occasions, news, exposure, publicity, and cybercrimes. The utilization of online media stages additionally offers major issues through cyberstalking, cyberbullying, and different kinds of digital provocation. Cyberstalking and cyberbullying are frequently utilized reciprocally and include the utilization of the web to follow or target somebody in the web-based world. Cyberstalking is a basic worldwide issue that influences instructive foundations, casualties, and the whole human culture that should be distinguished, recognized, revealed, and controlled appropriately for the security of clients in online media. Machine learning is the most well-known method for making the cyberstalking recognition model. Researchers have recommended different recognition procedures utilizing machine learning to control and battle cyberstalking in web-based media. In this paper, the study relates to some popular features extraction methods machine learning classifiers for text classification and explores the datasets used by the researchers. The study also focuses on reasonably determining the research gaps and the scope for improving cyberstalking detection. This paper will review some cyberstalking detection techniques using machine learning, analyze the performance of popular machine learning classifiers and finally explore the issues, challenges, recent trends, and future direction for cyberstalking detection techniques.
Article
Full-text available
In the modern days of life, people use many social media sites for information sharing among friends, relatives, and others for personal, business, and official purposes. The use of social media platforms is also raising serious issues in the form of cyberstalking. Cyberstalking has been identified as a growing antisocial problem that affects educational institutions, victims, and entire human society. An intelligent system is required to detect cyberstalking in social media. In this paper, we proposed a cyberstalking detection model and analyzed the performance of six popular supervised machine learning algorithms, namely Logistic Regression, Support Vector Machines (SVM), Random Forest, Decision Trees, K-Nearest Neighbor, and Naive Bayes. These machine learning algorithms were implemented with two feature extraction methods, Bag of Words and TF-IDF, on two datasets of different sizes and distribution containing 35734 and 70019 comments and tweets, respectively. Performance of algorithms was measured in terms of Accuracy, Precision, Recall, f-score, training time, and prediction time. Our experimental results show that Logistic Regression and Support Vector Machine were top performer algorithms for both datasets with both feature extraction methods. Logistic Regression (92.6% with BOW and 92% with TF-IDF) and Support Vector Machine (92.5% with TF-IDF and 91.9% with BOW) achieved the highest accuracy on dataset-1. Logistic Regression and Support Vector Machine also achieved the highest Precision (96.4% and 96.6% respectively) and F-Score (94.3% and 93.8% respectively), while Naïve Bayes provides the best Recall (97.6% with TF-IDF on dataset-1) for both datasets.
Article
Full-text available
Cyberstalking is growing as a social and international problem and creating a pandemic situation for users of internet applications. In modern days of life due to the huge use of Internet technology, cyberstalking has become a major fear for users, society, and institutions. Like social media, cyberstalkers are using email technology to target the victim as cyberstalking. Email is a widely used internet application and is so much popular to share information among people and organizations for personal, business, and official purposes. Generally, cybercriminals use fake email IDs either from popular email services providers or from fake email service providers to perform cyber crimes such as phishing, spamming, and cyberstalking. Mostly, through spam email, victims were targeted but in the recent trends, non-spam email is also used by criminals for cyberstalking and cyberbullying. Victims can be easily targeted by cyberstalkers using non-spam email because cyberstalkers often use fake email id and messages which is difficult to block and filter as spam email category. Filtration, Detection, and proper evidence documentation of non-spam email-based cyberstalking are challenging and interesting tasks for researchers. In this paper, we are proposing a Machine Learning framework to filter, detect, and collect cyberstalking evidence on textual data of non-spam emails.
Article
Full-text available
Artificial Intelligence (AI), in combination with the Internet of Things (IoT), called (AIoT), an emerging trend in industrial applications, is capable of intelligent decision-making with self-driven analytics. With its extensive usage in diverse scenarios, IoT devices generate bulk data contrived by attackers to disrupt normal operations and services. Hence, there is a need for proactive data analysis to prevent cyber-attacks and crimes. To investigate crimes involving Electronic Mail (e-mail), analysis of both the header and the email body is required since the semantics of communication helps to identify the source of potential evidence. With the continued growth of data shared via emails, investigators now face the daunting challenge of extracting the required semantic information from the bulks of emails, thereby causing a delay in the investigation process. This gives an edge to the criminal in erasing their footprints of malicious acts. The existing keyword-based search techniques and filtration often result in extraneous, short sequence emails, which skips meaningful information. To overcome the above limitation, we propose a novel efficient approach named SeFACED that uses Long Short-Term Memory (LSTM) based Gated Recurrent Neural Network (GRU) for multiclass email classification. SeFACED not only works on short sequences but with long dependencies of 1000+ characters as well. SeFACED focuses on tuning LSTM based GRU parameters to attain the best performance and with assessment by comparing it with traditional machine learning, deep learning models, and state-of-the-art studies on the subject. Experimental results on self-extended benchmark datasets exhibit that SeFACED effectively outperforms existing methods while keeping the classification process robust and reliable.
Article
Phishing email attack is a dominant cyber-criminal strategy for decades. Despite its longevity, it has evolved during the COVID-19 pandemic, indicating that adversaries exploit critical situations to lure victims. Plenty of detectors have been proposed over the years, which mainly focus on the contents or the textual information of emails; however, to cope with the evolution of phishing emails more sophisticated approaches should be introduced that will exploit all the emails’ traits to enhance the detection capability of Machine Learning/Deep Learning classifiers. To tackle the limitations of existing works, this paper proposes a phishing email detection methodology, named HELPHED that focuses on the detection of phishing emails by combining Ensemble Learning methods with hybrid features. The hybrid features provide an accurate representation of emails by fusing their content and textual traits. We propose two methods of HELPHED, the first one employs the Stacking Ensemble Learning method, while the second method utilizes the Soft Voting Ensemble Learning. Both methods deploy two different Machine Learning algorithms to handle the hybrid features separately, yet in parallel, minimizing the features’ complexity and improving the model’s performance. A thorough evaluation analysis is carried out considering innovative guidelines that aim to prevent partial and misleading results. Experimental tests verified that the combination of hybrid features with Ensemble Learning, overall, accomplishes better detection performance than when employing only content-based or text-based features. Numerical results on a rich imbalanced dataset (i.e., 32,051 benign and 3,460 phishing email samples) that considers the evolution of phishing emails show that Soft Voting Ensemble Learning outperforms other prominent Machine Learning/Deep Learning algorithms and existing works yielding F1-score equal to 0.9942.
Article
Various cybercriminals are active with predefined and preplanned agendas to carry out cybercrimes in the Internet world. Cyberstalking, cyberbullying, cyber terrorism, cyber hacking, data leakage, identity theft, phishing, and other types of cyber harassment continually occur in the virtual world. Cyberstalking and cyberbullying are near to close in content and intent, involving the same internet-based technology to harass, bully and undermine others online. This paper implemented a cyberstalking detection model and analyzed the effect of various feature extraction techniques on different machine learning classifiers for cyberstalking detection. For feature extraction, the proposed model applied Word2vec, BOW, TF-IDF, FastText, GloVe, ELMo, and BERT. Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Random Forest (RF), Naive Bayes (NB), and Decision Tree (DT) were used for classification. Effects of each feature extraction method to enhance the performance of the detection model were determined based on the performance results of applied classifiers with each feature extraction process. Experimental results show that BOW and TF-IDF outperformed advanced word embedding-based feature extraction methods. BOW (for LR) achieved the highest accuracy of 95.7%, highest precision of 97.9%, and highest F-Score of 97.3%. TF-IDF achieved the highest recall of 99.8% for NB. SVM classifier achieved the second-highest accuracy of 95.2% with TF-IDF. BERT model successfully obtained maximum accuracy of 90.9% and 90.7% for LR and SVM, respectively. ELMo model also performed well and produced maximum accuracy of 90.5% and 90.2% for LR and SVM, respectively. The SkipGram model of Word2Vec provided an accuracy of 85% for the LR classifier. GloVe provided 81.2% accuracy for the RF classifier. SkipGram and the CBOW model of FastText provided 85.7% and 82.2% accuracy, respectively, for the RF classifier.