Content uploaded by Arvind Kumar Gautam
Author content
All content in this area was uploaded by Arvind Kumar Gautam on Aug 09, 2023
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
https://www.tandfonline.com/action/journalInformation?journalCode=ucis20
Journal of Computer Information Systems
ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/ucis20
Email-Based Cyberstalking Detection On Textual
Data Using Multi-Model Soft Voting Technique Of
Machine Learning Approach
Arvind Kumar Gautam & Abhishek Bansal
To cite this article: Arvind Kumar Gautam & Abhishek Bansal (2023): Email-Based Cyberstalking
Detection On Textual Data Using Multi-Model Soft Voting Technique Of Machine Learning
Approach, Journal of Computer Information Systems, DOI: 10.1080/08874417.2022.2155267
To link to this article: https://doi.org/10.1080/08874417.2022.2155267
Published online: 17 Jan 2023.
Submit your article to this journal
View related articles
View Crossmark data
Email-Based Cyberstalking Detection On Textual Data Using Multi-Model Soft
Voting Technique Of Machine Learning Approach
Arvind Kumar Gautam and Abhishek Bansal
Indira Gandhi National Tribal University, Amarkantak, India
ABSTRACT
In the virtual world, many internet applications are used by a mass of people for several purposes.
Internet applications are the basic needs of people in the modern days of lifestyle which are also
making habitual society. Like social media, e-mail technology is also more prevalent among people
of dierent categories for personal and ocial communications. The widespread use of e-mail-
based communication is also raising various types of cybercrimes, including cyberstalking.
Cyberstalkers also use an e-mail-based approach to harass the victim in the form of cyberstalking.
Cyberstalkers utilize several content-wise and intent-wise approaches to target the victim, such as
spamming, phishing, spoong, malicious, defamatory, e-mail bombing, and non-spam e-mails,
including sexism, racism, and threatening, and nally, trying to hack the account over e-mail
technology. This paper proposed an EBCD model for automatic cyberstalking detection on textual
data of e-mail using the multi-model soft voting technique of the machine learning approach.
Initially, experimental works were performed to train, test, and validate all classiers of three model
sets on three dierent labeled datasets. Dataset D1 contains spam, fraudulent, and phishing e-mail
subject, dataset D2 contains spam e-mail body text, while dataset D3 contains harassment-related
data. After that, trained, tested, and validated classiers of all model sets were applied as
a combined approach to automatically classify the unlabeled e-mails from the user’s mailbox
using the multi-model soft voting technique. The proposed EBCD model successfully classies
the e-mails from the user’s mailbox into cyberstalking e-mails, suspicious e-mails (spam and
fraudulent), and normal e-mails. In each model set of the EBCD model, several classiers, namely
support vector machine, random forest, naïve bayes, logistic regression, and soft voting, were used.
The nal decision in classifying the e-mails from the user’s mailbox was taken by the soft voting
technique of each model set. The TF-IDF feature extraction method was used with the entire
applied machine learning model sets to obtain the feature vectors from the data. Experimental
results show that the soft voting technique not only enhances the performance of the e-mail
classication task but also supports making the right decision to avoid the wrong classication.
Overall performance of the soft voting technique was better than other classiers, although the
performance of the support vector machine was also notable. As per experimental results, the soft
voting technique obtained an accuracy of 97.7%, 97.7%, 98.9%, a precision of 97%, 98.3%, 98.6%,
recall of 98.3%, 96.5%, 99.1%, f-score of 97.6%, 97.4%, 98.9%, and AUC of 99.4%, 99.7%, 99.9% on
dataset D1, D2, and D3 respectively. The average performance of soft voting of each model set on
classied e-mails from the user’s mailbox was also notable, with an accuracy of 96.3%, precision of
98.1%, recall of 94%, f-score of 95.9%, and AUC of 96.8%.
KEYWORDS
e-mail cyberstalking;
cyberstalking detection;
cyberbullying; machine
learning; spam detection;
soft voting; TF-IDF; support
vector machine; naive bayes;
logistics regression; random
forest
Introduction
With the growth and popularity of internet technology,
e-mail (electronic mail) has become an essential source
everywhere for a person to person and person-to-group
communication. E-mail platform is not only just for com-
munication purposes but also provides a storage facility
which has been growing exponentially over the years.
Generally, regular users of e-mail store half of their basic
and critical information in e-mail storage.
1,2
E-mail is the
best application for sharing personal, official, business,
and confidential information over the internet. Many
organizations and individuals utilize e-mail technology
to share their general and necessary information, such as
document sharing, message communication, and sending
urgent information about any news, updates, and notifi-
cations. Several e-mail service providers provide e-mail
service to users for personal and business purposes, either
free or on a subscription basis. Some of the most famous
and notable e-mail service providers are Gmail, Microsoft
Hotmail and Outlook, Yahoo, iCloud, AOL, GMX,
ProtonMail, Yandex mail Tutanota, and Zoho Mail. As
per the data provided by Statista,
3
more than 4.1 billion
CONTACT Arvind Kumar Gautam analyst.igntu@gmail.com Department of Computer Science, Indira Gandhi National Tribal University, Amarkantak,
Distt. - Anuppur, MP 484886, India
JOURNAL OF COMPUTER INFORMATION SYSTEMS
https://doi.org/10.1080/08874417.2022.2155267
© 2023 International Association for Computer Information Systems
users are using the e-mail service worldwide through
different electronic devices and e-mail client software.
The frequent use of e-mail technology is not just limited
to personal and official purposes but is also widely used by
cybercriminals for performing cybercrime incidents.
Cybercrimes like phishing, spamming, hacking, spoofing,
e-mail bombing, and cyberstalking are being executed
using e-mail.
4
E-mail is the second most used application
and the third most common source for cyberstalking and
other cyber harassment over the internet.
2,4
Although different authors give different definitions of
cyberstalking but cyberstalking is a form of online har-
assment involving the use of technology to target indivi-
duals or groups. Cyberstalking and cyberbullying are two
challenging issues of online abuse and are near to close in
content and intent, which involve the same internet-
based technology to harass, bully and undermine others
in the online world. Cyberstalking is systematic, repeated,
and numerous cyber-attacks and may occur on multiple
occurrences.
5–8
Cyberstalking may be classified into
e-mail stalking, internet stalking, computer stalking,
phone stalking, and automated stalking.
8,9
Cyberstalking
is a dangerous and convoluted cybercrime that affects and
targets numerous people, communities, and
organizations.
10
Cybercriminals apply several approaches
to target the victims, such as sending e-mails containing
phishing, viruses, threatening, fraudulent, and harassing
content, e-mail bombing as well as sharing the private
information of victims, and finally, trying to hack the
e-mail account. Cyberstalkers often utilize e-mail-based
technology with predefined plans and agendas to insult,
profanity, harassing the victim through repeated activities
of sexism, racism, offensive, abuse, hate, and fake news
from real or counterfeit accounts. However, such types of
e-mail-based methods are mainly utilized for several
other types of e-mail-based crimes, but the utilization of
these e-mail-based methods in cyberstalking incidents
can not be ignored. Some e-mail-based methods applied
by cyberstalkers are presented in Figure 1.
Spam is the criminal and fraudulent communication
of unwanted and harmful messages containing unsoli-
cited and unwanted messages such as phishing, false
advertising, harassment, and illegal content from an
infected device or messages to multiple addresses at
once.
11
According to DataProt,
12
as of March 2022,
across the world, approximately 85% of e-mails were
filtered as spam e-mails, including 36% for advertise-
ment purposes while 31.7% for all spam messages for
adult-related and harassment purposes. A phishing
e-mail is a scam and more dangerous than general
spam e-mails sent by cybercriminals with fraud and
harassment intentions. Cybercriminals find the victim’s
interests and send customized phishing e-mails from
a legitimate, reliable source to a specific person or
group to steal and gather personal and financial
information.
13
Cybercriminals utilize different types of
phishing e-mails with predefined objectives, containing
harmful hyperlinks, fake website links, malware, and
clone id and contents to thieve private information,
hack the account, control the victim’s devices, and
undermine and harass the victims. Malicious e-mails
are an approach used by cybercriminals as phishing
e-mails to try to access private information from victims.
Malicious e-mails contain attachments such as docu-
ments, PDFs, hyperlinks, e-files, and voicemails to initi-
ate an attack on a user’s devices.
14
Cybercriminals use
such attachments with e-mails that can install malware
to destroy data, steal information, take control of the
user’s computer, access the screen, capture keystrokes,
and access other network systems. Cyberstalkers often
utilize malicious e-mails to target known users. Spoofing
e-mails is another harmful e-mail technique used by
Figure 1. Different methods of e-mail-based cyberstalking.
2A. K. GAUTAM AND A. BANSAL
cybercriminals for sending spam and phishing e-mails
to trap users into thinking a message came from trusta-
ble and well-known persons or organizations. In spoof-
ing e-mail techniques, cybercriminals create a fake
header to send the message with malicious links and
malware attachments so that receiver can believe and
client application software shows the falsified sender
address.
15
Cybercriminals make spoofing e-mails using
display names, legitimate domains, and lookalike
domains. Spoofing e-mail is mainly used for phishing,
identity theft, avoiding spam filters, anonymity, and
harassment purposes.
In e-mail bombing, cyberstalkers repeatedly send an
unnecessary, large and meaningless e-mail message to
a predefined e-mail address of the victims to consume
large amounts of system and network resources (such
as internet bandwidth, storage space, etc.) for harass-
ment purposes.
16
Composing e-mail bombing mes-
sages automatically using computer programs is
another approach used by adversary cyberstalkers.
Sometimes, the adversary also utilizes the controversial
or official statement to a large audience using the
victim’s return e-mail address so that users read and
reply individually, and eventually, the victim’s e-mail
account is flooded through a large number of replies.
Another dangerous approach used by adversary cyber-
stalkers is to subscribe the victim’s e-mail address to
many sexual sites and other mailing lists so that victim
can receive unnecessary automatic e-mails regularly.
Defamatory e-mail is a technique of cyber defamation
which is often used by cyberstalkers to send false
information related to any person or organization to
demolish the reputation of that person or organiza-
tion. Defamatory e-mails are sent to different sources
either accidentally or deliberately, making
a confounded matter from an unintentional or inten-
tional result.
17
Sometimes, cyberstalkers also send
defamatory e-mails to the victim’s relatives containing
false and sexual information related to the victim to
damage the victim’s public image. The cases of defa-
matory e-mails are regularly increasing and are very
complicated to detect.
Those e-mails that are not classified into any types
of spam or fraudulent e-mails and look legitimate
e-mails are called non-spam e-mails. Vicious non-
spammer cyberstalkers use non-spam e-mails, includ-
ing sexual abuse, fake e-mails, threatening e-mails,
and other harassment e-mails, to target the victims
with proper plans. In Non-spam e-mail methods used
by cyberstalkers, a bunch of temporary e-mail ids
from well-known e-mail servers or sometimes suspi-
cious servers are created, and then using these e-mail
ids, stalking-related messages are sent to the victims
regularly. In case of blocking the sender’s e-mail id
or police complaint, cyberstalkers utilize other tem-
poral e-mail ids. Threatening e-mail is basically used
by cyberstalkers and scammers to blackmail the vic-
tim. In threatening e-mails, cyberstalkers regularly
threaten victims for publishing a piece of private
information or sometimes fake or factual sexual
information among the victim’s colleagues or rela-
tives (friends and family) unless they fulfill the
demand by the victims. Threatening e-mail is more
common for cyberstalking of women victims by ex-
partners or friends for financial cheating or personal
adversary reasons. Sometimes, cyberstalkers send fake
e-mails to victims or victims’ relatives containing
false or fake information or fake sender name
(often using the name of the victim’s well-known
persons or organizations), or counterfeit domains to
harass the victims intensely. Such types of fake
e-mails look like original e-mails based on e-mail
filtration policy, domain name, and sender name,
and also do not contain any harmful links, and very
difficult to identify whether the e-mail is fake or
legitimate. Cyberstalkers and other cybercriminals
often try to hack the e-mail ids of victims or their
family members so that further victims can be har-
assed easily. Cybercriminals use some general
approaches, such as phishing and spoofing to hack
the e-mail ids of victims. Keylogging (software and
hardware keylogger to capture all keystrokes which
a user performs), pharming (a fake website that looks
legitimate for collecting usernames and passwords),
automated script-based programs or suspicious
mobile apps, gaming applications, sexual site hyper-
links, and password guessing and resetting are some
other powerful methods used by cybercriminals for
hacking e-mail ids.
18
Generally, researchers focus on classifying the
e-mail into spam e-mail or non-spam e-mail, but
non-spam e-mail is not always safe and crime-free
in e-mail technology and is also responsible for
cyberstalking that cannot be ignored. Researchers
have proposed various content-based and rule-based
techniques for spam filtration and detection. The
content-based methods mainly focus on content fea-
tures, while modal-based approaches with predefined
rules and blacklist and whitelist mechanisms are used
in rule-based methods for spam e-mail
classification.
19,20
Generally, reputed e-mail service
providers (Gmail, outlook, and yahoo) filter the
e-mails with a primary target for spam and other
harmful e-mails but do not focus on the filtration
of harassing e-mails. Cyberstalking is a critical cyber-
criminal activity, and technical solution is relatively
JOURNAL OF COMPUTER INFORMATION SYSTEMS 3
low to combat and control cyberstalking incidents.
Detection of cyberstalking, especially early and auto-
mated detection, is another major challenge. An
intelligent cyberstalking detection model is required
to automatically classify the e-mails from the user’s
mailbox to handle upsetting cyberstalking incidents
on the e-mail platform. Sentiment analysis using
machine learning techniques performs a vital task in
text analysis and deciding the score of e-mail con-
tents to classify as positive or negative text.
21
Mostly,
researchers focus on cyberstalking detection on social
media platforms, while e-mail-based cyberstalking
detection is not more highlighted and explored.
There is still much scope for e-mail-based cyberstalk-
ing detection that can automatically filter cyberstalk-
ing e-mails from the user’s mailbox. The main
research objective of this paper is to train and test
the different machine learning model sets on differ-
ent datasets (spam and harassment) and finally per-
form e-mail filtration from the user’s mailbox as
cyberstalking, suspicious, and normal e-mails auto-
matically. This research study utilizes the multi-
model soft voting technique of the machine learning
approach to design and develop an improved auto-
mated e-mail-based cyberstalking detection model on
textual data. The significant contributions from this
study are as follows.
●We designed and developed an automated, effi-
cient model named EBCD for e-mail filtration as
cyberstalking e-mails, suspicious e-mails (spam
and fraudulent), and normal e-mails by utilizing
the multi-model soft voting technique of the
machine learning approach to achieve the best
performance in the e-mail-based cyberstalking
detection on textual data.
●The proposed EBCD model can classify and label
e-mails automatically in real-time with high accu-
racy and can gather useful information from
e-mails in the user’s mailbox that can be utilized
for further training of machine learning models
and evidence purposes.
●The proposed EBCD model can be used in any
e-mail mailbox that provides e-mail fetching facil-
ity API or IMAP services.
The next part of the research study is structured
section-wise. In section 2, the notable and recent
contribution of researchers in the related field is
presented in the form of a literature review.
Section 3 describes the applied materials and the
proposed methodology used in this paper. The
experimental setup, results, and detailed discussion
are mentioned in section 4. Finally, the conclusion
and future works are finalized in section 5.
Review of literature
In the literature survey, some related research papers were
chosen to observe the contributions of past work per-
formed by researchers to the automatic detection of
cyberbullying, cyberstalking, and other cyber harassment.
Researchers have suggested several techniques to design
and develop a cyberstalking detection model on different
virtual world platforms. Burmester Henry et al.
22
pro-
posed a monitoring system framework for tracking cyber-
stalkers using the cryptography approach. Authors
claimed that the proposed framework would be able to
record cyberstalking-related data on the computer of
cyberstalking victims. Aggarwal S. et al.
23
have developed
the Predator and Prey Alert (PAPA) system to help law
enforcement. The PAPA system records every screen
event of a victim’s device during the session. The PAPA
system requires special software and hardware for victim
use and creates a secrecy issue. PAPA system was also not
performing properly to filter and detect cyberstalking
e-mails and was unable to handle the text-based cyber-
stalking. Onan et al.
24
suggested a model for topic extrac-
tion for bibliometric data analysis using several improved
word embedding with a cluster analysis approach and
developed sentiment analysis models
25
using machine
learning, ensemble learning, and deep learning methods
on educational data mining. Gautam et al.
9
explored and
reviewed the various cyberstalking and cybercrime detec-
tion techniques and found that machine learning techni-
ques are widely used as a single, ensemble, and hybrid
approach. Onan et al.
26
proposed a model based on
a three-layer stacked bidirectional long short-term mem-
ory architecture for detecting sarcastic text documents on
social media and, after that, also suggested a deep learn-
ing-based model utilizing several word embedding
model,
27
another deep learning-based model utilizing
several weighted word embedding model
28
for sentiment
analysis of product reviews on Twitter. Another machine
learning and deep learning-based model proposed by
Onan et al.
29
utilizes several unsupervised and supervised
term-weighted models, namely TF-IDF, word2vec,
FastText, and GloVe. Machine learning classifiers play
a vital role in making the cyberstalking detection model
using either single or multi-model-based as an ensemble
and hybrid approach. Gautam et al.
30
analyze the perfor-
mance of several popular machine learning classifiers on
different sizes of datasets for cyberstalking detection. In
the literature, researchers mainly focus on making
a cyberstalking detection model on social media plat-
forms. Zhang et al.
31
suggested a machine learning-based
4A. K. GAUTAM AND A. BANSAL
automated cyberbullying detection model for detecting
bully tweets on Twitter. The authors performed the
experimental work using various machine learning mod-
els using multiple textual features and found maximum
accuracy of 90%. Liew et al.
32
suggested an automated
security alert model using supervised machine learning
techniques to detect and control phishing tweets in real-
time on Twitter. Nimisha et al.
33
presented another auto-
mated model for cyberstalking detection on social media
using machine learning and natural language processing.
The authors proposed model mainly focuses on identify-
ing the cyberstalker online and detecting cyberstalking
incidents. Another enhanced automated cyberstalking
detection model in real-time on Twitter is designed and
developed by Gautam et al.
34
using a hybrid approach
inspired by machine learning. The authors performed the
experimental work on live tweets in real time for cyber-
stalking detection using lexicon-based, machine learning-
based (single approach), and hybrid approaches (multi-
model based inspired by machine learning) and found the
hybrid approach performed better in cyberstalking
detection.
Researchers less explored e-mail-based cyberstalking
detection than social media-based cyberstalking detec-
tion, although the researchers have recommended several
notable detection approaches for e-mail-based crimes
other than cyberstalking. Roy et al.
35
performed
a comparative analysis between SVM and Deep Neural
Networks in intrusion detection and proposed several
detection models using machine learning based model
36
utilizing extreme learning machine (ELM) and support
vector machine (SVM), a hybrid model
37
of rough set and
decorate ensemble and multi-model approach
38
using
Deep SVM, SVM and Artificial Neural Network models
for the detection of spam e-mails. Bassiouni et al.
39
pro-
posed a spam e-mail classification model utilizing
machine learning techniques. The authors performed
the experimental work on the Spambase UCI dataset
using several machine learning classifiers and found bet-
ter results for Random Forest for e-mail classifying as
spam e-mail or ham e-mails. Another detection model
using machine learning methods was proposed by
Zhaoquan et al.
40
for spam filtering using the marginal
attack methods. Kontsewaya et al.
41
proposed another
machine learning based detection model for spam
e-mail classification. The authors performed experimental
work on a ready-made dataset containing 1368 spam and
4360 non-spam e-mails and found that Logistics
Regression provides better results than other classifiers.
Aviad Cohen et al.
42
proposed a model for the detection
of malicious e-mails using machine learning methods.
The authors applied general descriptive features with
machine learning algorithms to enhance the performance.
Experimental works were performed on a dataset contain-
ing 33,142 e-mails (38.73% malicious and 61.27% benign
e-mails) and found better results. Chaitra Sai et al.
43
pro-
posed a model for the detection of spoofing e-mails using
stacking algorithms. The authors explored various
approaches and compared stacking algorithms of
machine learning for detecting different types of spoofing
e-mails to find better accuracy. Onan et al.
44
proposed an
ensemble-based machine learning model in text classifi-
cation and suggested another machine learning based
model utilizing a consensus clustering-based-
undersampling approach
45
for text classification in an
imbalanced dataset. The authors explored a comparative
analysis of different feature engineering approaches, base
learners, ensemble learning methods, and consensus clus-
tering-based-undersampling. Onan et al.
46
again pro-
posed another ensemble pruning approach based model
utilizing multiple classifier techniques based on swarm-
optimized topic modeling, machine learning based hybrid
ensemble pruning model
47
utilizing clustering and rando-
mized search approach and a machine learning based
ensemble model
48
for text classification utilizing different
extraction methods. Nisar et al.
49
suggested a soft voting
technique using several machine learning classifiers for
spam e-mail classification. During the experimental work,
the authors found that the ensemble approach using the
soft voting technique enhances the performance of spam
e-mail classification. Bountakas et al.
50
proposed
a machine learning-based hybrid ensemble approach
using stacking and soft voting techniques for phishing
detection. The authors performed experimental work on
a dataset containing 3,460 phishing and 32,051 benign
e-mail samples and found better performance with soft
voting ensemble learning. Onan et al.
51
suggested
a group-wise enhancement technique to perform the
text sentiment classification using deep leaning model
and suggested a model with effective feature selection
using an ensemble approach
52
to enhance the perfor-
mance of text sentiment classification. Cybercriminals
have recently introduced the image spam approach to
render e-mail body text analysis ineffective. Image spam
is unsolicited bulk e-mail that contains a message
embedded in an image. Spammers use such images to
avoid detection by text-based filters. Image spamming is
a growing issue in executing cybercrimes, although some
machine learning and deep learning-based image spam
detection approaches have been suggested by
researchers.
53,54
In the area of e-mail-based cyberstalking detection,
Ghasem et al.
55
introduced an improved e-mail-based
cyberstalking detection framework for automatically
detecting and controlling cyberbullying and cyberstalking
using machine learning techniques. The proposed ACES
JOURNAL OF COMPUTER INFORMATION SYSTEMS 5
(Anti-Cyberstalking E-mail System) framework of
authors generally focused on automatic e-mail-based
cyber-stalking detection as well as evidence documenta-
tion to combat cybercriminals. Another e-mail-based
cyberstalking detection model was proposed by
Frommholz et al.
56
for textual analysis and cyberstalking
detection using machine learning algorithms. The
author’s proposed framework, ACTS (Anti
Cyberstalking Text-based System), mainly focused on
author identification, text classification, personalization,
and digital text forensics. X. Feng et al.
57
proposed
another framework for e-mail-based cyberstalking detec-
tion using machine learning approaches. The author’s
proposed model was inspired by the ACES (Anti-
Cyberstalking E-mail System) and ACTS (Anti
Cyberstalking Text-based System) framework and
claimed that the proposed model would perform better
for cyberstalking detection. Another e-mail-based cyber-
stalking detection framework was proposed by Gautam
58
using a machine learning approach to detect, filter, and
collect cyberstalking evidence on textual data of non-
spam e-mails. The proposed framework of the authors
explores the cyberstalking risk from non-spam e-mail.
Initially, the author’s framework classifies the e-mail
into spam and non-spam e-mail and further detects the
cyberstalking on non-spam e-mail. Another improved
e-mail-based detection model proposed by Maryam
et al.
2
using a deep learning approach. The author’s pro-
posed model classified the e-mail into Harassment
E-mails, Fraudulent E-mails, Suspicious E-mails, and
Normal E-mails. Asante et al.
59
suggested another auto-
mated model for cyberstalking detection on social media
using machine learning, data mining techniques, and
digital forensics. The author’s proposed model contains
identification, filtering, detection (content detection and
profiling offender), and evidence modules.
Based on the literature review, it is found that
researchers mainly focused on social media-based
cyberstalking and other harassment detection.
Researchers also contribute to exploring and detect-
ing e-mail-based cybercrimes. E-mail-based cyber-
stalking is still not much explored, and more
attention is required. Few authors in
55–59
have con-
tributed to detecting and combating e-mail-based
cyberstalking. Automatic e-mail-based cyberstalking
detection on textual data is still challenging, and
there is still a lack of automated cyberstalking detec-
tion approaches with impressive performance.
Inspired by authors,
55–59
this paper proposed an
EBCD model for automatic cyberstalking detection
on textual data of e-mails and classifying the
e-mails from a user’s mailbox into cyberstalking
e-mails, suspicious e-mails, and normal e-mails.
Material and methodology
This section describes the detailed algorithms used for
designing the proposed model. E-mail-based cyberstalk-
ing detection (EBCD) model on textual data has two
main parts: Making ML Model Sets and E-mail-based
cyberstalking detection. In the first part of the EBCD
model, 3 ML model sets containing Support Vector
Machine (SVM), Logistics Regression (LR), Naïve
Bayes (NB), Random Forest (RF), and Soft Voting clas-
sifiers were trained and tested on three separate datasets
(subject line spam dataset, e-mail spam dataset and
cyberstalking dataset).In the second part of the EBCD
model, e-mails from the user’s e-mail box were fetched
and later filtered as cyberstalking e-mails, suspicious
e-mails (spam and fraudulent), and normal e-mails by
applying the trained and tested ML model sets using soft
voting techniques. The stepwise procedure for making
ML model sets is described by algorithm-1, while algo-
rithm-2 describes the e-mail-based cyberstalking detec-
tion on textual data from a user’s mailbox. Figure 2
explains the basic functioning layout of the proposed
EBCD model on textual data. The overall methodology
for the proposed EBCD model is presented stepwise,
consisting of the following main phases to perform
both parts of the model for e-mail-based cyberstalking
detection on textual data.
(1) Making the Dataset.
(2) Data pre-processing module.
(3) Features extraction module.
(4) Making ML model sets.
(5) Fetching e-mails from the user’s mailbox.
(6) Apply trained ML model sets to e-mails and
combine the probabilities using soft voting
(7) Aggregator module and e-mail classification
(8) Saving classified e-mails as evidence
(9) Model Performance
Making datasets
This paper gathers several datasets
60–67
related to spam/
phishing e-mail subjects, spam e-mail text, fraudulent
e-mail, and harassment text (e-mail, tweets, and posts/
comments from social media). Three separate mixed
labeled datasets were made to train, test, and cross-
validate the three machine learning model sets based
on the collected datasets. Dataset D1 contains e-mail
subject line spam, phishing, and fraudulent data labeled
as spam (1) and ham (0). Dataset D2 contains spam and
fraudulent e-mail body text labeled as spam and ham
class. Dataset D3 contains harassment-related
6A. K. GAUTAM AND A. BANSAL
(threatening, sexual abuse, hate messages, racism, etc.)
data from e-mails and social media tweets/posts/com-
ments labeled as cyberstalking (1) and non-
cyberstalking (0). Dataset D1 will be used to train and
test the machine learning classifiers of ML model set
MS-1. Dataset D2 will be used to train and test the
machine learning classifiers of ML model set MS-2,
while dataset D3 will be used to train and test the
machine learning algorithms of ML model set MS-3.
The distribution of data in every three datasets is
explained in Figure 3.
Data pre-processing module
The data of datasets and fetched e-mails often contain
raw text with unnecessary characters, blank spaces,
blank lines, meaningless characters, html tags, and dif-
ferent symbols. Properly cleaning the data is highly
Figure 2. Basic layout of the proposed EBCD (e-mail-based cyberstalking detection) model on textual data.
Figure 3. Distribution of data in labeled datasets.
JOURNAL OF COMPUTER INFORMATION SYSTEMS 7
required before feature extraction and classification.
Data pre-processing module will be used to clean and
normalize the data of all training and testing labeled
datasets as well as unlabeled e-mails fetched from the
user’s mailbox. Initially, this module will be used for
performing several pre-processing tasks on labeled
training and testing datasets. Later, it will be utilized
on unlabeled e-mails fetched from the user’s mailbox.
Several pre-processing tasks, such as: Removing stop
words, noise removal, tokenization, normalization, and
stemming will be performed in this module to clean the
data. In the first step of pre-processing, all stop words
were removed. Meaningless words such as articles, pre-
positions, and pronouns that are not useful for e-mail
classification are called stop words.
68
Fetched e-mails
from the user’s mailbox and datasets gathered from
different sources also contain noise data that is required
to be removed. In the e-mail, repeated words, symbols
(such as html tags, @, #, etc.), blank lines, blank space,
special characters, punctuation marks, and any useless
digits are called noise data. After removing the noise
data and stop words, the texts of the e-mail (subject and
body text) were divided into individual words and added
to a separate list. This process for splitting the sentence
into words is called tokenization. Further, tokenized
texts are required to convert into lower case letters
using normalization to make the uniformity. After
that, tokenized words are required to be restored to
their original form using the lemmatization
69
and
stemming
69
methods. Lemmatization may be used
instead of stemming for proper morphological analysis
of the words. Lemmatization is a method to combine the
synonyms relation words into a single word and remove
all other concerned synonyms words from the list.
70
In
this paper, the stemming method was used.
Feature extraction module
The feature extraction process is essential in the machine
learning-based process before training, testing, and clas-
sifying e-mail because the machine learning algorithms
work on feature vectors and can not understand data as
text forms. Feature extraction computes the weights of
e-mail words and creates a feature vector in numerical
form. Feature extractions play a crucial role in improving
the performance of classifiers.
71
Several traditional-based,
word embedding-based and language model-based fea-
ture extraction methods are available for feature extrac-
tion in the word-level, sentence-level, and n-gram levels.
71
TF-IDF, Word2Vec, BOW, BERT, FastText, GloVe, XL-
NET, ELECTRA, InferSent, GPT-2, and Universal
Sentence Encoder are some widely used examples of
feature extraction methods.
72–75
The proposed EBCD
model of this study applied TF-IDF methods for feature
extractions. TF-IDF is an efficient calculation-based fea-
ture extraction method that measures the weight of any
word of documents in a collection of documents.
76
TF-
IDF finds the most occurring words and assigns more
consequences because regularly occurring words are
more important for the classification.
77
Equation (1) is
used to calculate the feature vector in the TF-IDF.
TF IDF T;Dð Þ ¼ PT in D
PW in D Log N
PT in Nð Þ þ 1
(1)
Where:
PT in D ¼Number of times
word T appears in
a document }D}
PW in D ¼Total number of
words in the
document }D}
9
>
>
>
>
>
>
=
>
>
>
>
>
>
;
!Represents the
Term Frequency
PT in N ¼
Total occurrence
of Word }T}in
total documents
8
<
:9
=
;!
Represents the
document
Frequency
N= Total Documents
Making ML model sets
After cleaning the data through the data pre-
processing module and getting the feature vector
through the feature extraction module, machine
learning model sets were designed and developed.
In this study, three separate machine learning
model sets, ML Model Set MS-1, ML Model Set
MS-2, and ML Model Set MS-3, were designed and
developed. Machine learning algorithms of ML model
set MS-1 were trained, tested, and validated on data-
set D1. Dataset D2 was applied for the training,
testing, and validating of algorithms of the ML
model set MS-2, while the ML model set MS-3 uti-
lized dataset D3 for training, testing, and validating
the algorithms. In each model set, Support Vector
Machine (SVM), Logistics Regression (LR), Naïve
Bayes (NB), Random Forest (RF), and Soft Voting
classifiers were trained, tested, and validated. Support
vector machine is an efficient, versatile, and trendy
supervised machine learning broadly used to classify
text with more accurate results.
30
SVM creates hyper-
planes and computes the distance between the line
and support vector to classify the text. The SVM
offered several kernels (polynomial, sigmoid, Radial
Basis Function, linear, and nonlinear kernels) with
8A. K. GAUTAM AND A. BANSAL
different mathematical functions.
78
Although, as per
its native nature, SVM uses prediction and does not
support probability directly, using Platt scaling and
isotonic regression methods, SVM determines the
probability of any text for the target class. This
paper used the probability calibration classifier
method for SVM to calculate the prediction probabil-
ity of e-mail. Naïve Bayes (NB) is an efficient and
straightforward supervised machine learning algo-
rithm. The functioning of NB is according to the
Bayes Theorem and derived from conditional
probability.
79
In this paper, the multinomial NB
model was used, while other models offered by NB
are Gaussian NB and Bernoulli NB. Logistic regres-
sion is a statistical-based linear learning algorithm
that utilizes an s-shaped curve to map any real-
valued number using the sigmoid function to find
dichotomous results (a value between 0 and 1).
Logistic regression predicts an output value (y) by
combining the input features(x) linearly using
weights or coefficient values.
80
Random Forest is
a supervised ensemble algorithm that uses multiple
decision trees with the bootstrap technique to get
better prediction results. For a classification problem,
each tree in a random forest takes input and provides
individual votes for a particular class, and finally,
a class that has got the maximum number of votes
is predicted as output.
81
The mathematical expression
for calculating the prediction probability of e-mail
using SVM is explained by Equation (2). Equation
(3) shows the mathematical formula to determine
prediction probability using NB. Equations (4) and
(5) show the mathematical expression of LR and RF
classifiers, respectively, for calculating the prediction
probability. Algorithm 1 describes the stepwise pro-
cedure for making the machine learning model sets.
PSVM yjemailð Þ ¼ 1
1þexp Af e mailð Þ þ Bð Þ (2)
Where “A” and “B” are scalar parameters learned by the
algorithm during the training, “y” is the target class (y = 1
for cyberstalking and y = 0 for non-cyberstalking) f(e-
mail) is a real-valued function.
PNB yjemailð Þ ¼ P yð ÞQn
i¼1PðxijyÞ
P x1ð Þ P x2ð Þ . . . :p xn
ð Þ(3)
Where “y” is the target class (y = 1 for cyberstalking
and y = 0 for non-cyberstalking). P(y|e-mail) repre-
sents the posterior probability of e-mail for target
class “y.” P(e-mail)=P(x1)P(x2) . . . .P(x
n
) is the pre-
ceding probability of the predictor e-mail. P(y) is
the preceding probability of the target class. P(x
i
|y)
is the likelihood conditional probability of predictor
e-mail for target class (y).
PLR yjemailð Þ ¼ eaþbemailð Þ
ð1þeaþbemailð ÞÞ(4)
Where y is the predicted probability output, a is the
intercept term, and b is the coefficient for the single
input e-mail value learned from the training data.
PRFðyjemailÞ ¼ MaxVote PnðemailÞgf N
1(5)
Where N is the total tree in random forest and P
n
is
a class prediction of the n
th
tree
Algorithm 1: Stepwise procedure for Making ML model sets on labeled
datasets
Step:1. Begin
Step:2. Import labeled datasets D1, D2, and D3.
Step:3. Send datasets D1, D2, and D3 to the data pre-processing module
for text cleaning and normalization.
Step:4. Split the datasets D1, D2, and D3 into training and testing sets.
D1=D1
Train
+ D1
Test
, D2=D2
Train
+D2
Test
, D3=D3
Train
+D3
Test
, where
D1
Train
, D2
Train
, D3
Train
are the training and D1
Test
, D2
Test
, D3
Test
are
the test corpus for dataset D1, D2, D3 respectively.
Step:5. Apply TF-IDF vectorizer on D1
Train
, D2
Train
, D3
Train
, D1
Test
, D2
Test
,
and D3
Test
to get the feature vectors using the feature extraction
module.
Step:6. Train and test the ML classifiers of ML model set MS-1 using D1
Train
and D1
Test
corpus (training and testing feature sets of dataset D1).
Step:7. Train and test the ML classifiers of ML model set MS-2 using D2
Train
and D2
Test
corpus (training and testing feature sets of dataset D2).
Step:8. Train and test the ML classifiers of ML model set MS-3 using D3
Train
and D3
Test
corpus (training and testing feature sets of dataset D3).
Step:9. Apply K-Fold cross-validation for ML Classifiers of model sets MS-1,
MS-2, and MS-3 on Datasets D1, D2, and D3, respectively.
Step:10. Measure the performance of ML classifiers of each ML model set.
Step:11. Save the ML model sets as pickle files so that ML model sets can
be used later during the classification of e-mails from the user’s
mailbox.
Step:12. End
Fetching e-mails from the user’s mailbox
E-mail is private communication (person-to-person and
person-to-group), so e-mails from the user’s mailbox
can not be fetched without the user id, password, and
user permission. Several approaches may automatically
fetch the e-mails from the user’s mailbox through
a third-party application. IMAP service and Gmail API
(in the case of Gmail service) are the two main methods
for fetching e-mails automatically from the user’s mail-
box. In the case of Gmail API, a user must log into
Google Cloud Console and enable the Gmail API ser-
vice. After that, it is necessary to create/select an appli-
cation under the OAuth Consent Screen of Google
Cloud Console. After creating or selecting the existing
application, OAuth Client ID credentials are required to
create a desktop or web application for getting the Client
ID with OAuth credentials as a text or JSON file. After
JOURNAL OF COMPUTER INFORMATION SYSTEMS 9
getting the Client ID with OAuth credentials, e-mails
from the user’s mailbox can be fetched automatically
through programs. The first time, the user will be auto-
matically intimated that “This application wants to
access your mailbox – Allow or deny,” and after the
user has permission to access the mailbox, e-mails can
be fetched. In fetching e-mails using the IMAP service,
only a user id and password with some basic settings are
required. After Enabling “Allow less secure apps: ON”
and Enabling IMAP service in the user’s mailbox,
e-mails can be fetched automatically through programs.
Apply trained ML model sets to e-mails and combine
the probabilities using soft voting
After fetching the e-mail from the user’s mailbox, the
e-mail was sent to the data pre-processing module
and feature extraction module to clean the e-mail
and get the feature vectors for the e-mail subject
and body text. Saved (trained and tested) ML model
sets were loaded to apply the classifiers separately on
the e-mail subject and body text. Using the ML
model set MS-1, prediction probabilities for e-mail
subjects were found through all ML classifiers (SVM,
NB, LR, and RF). Classifiers of ML Model set MS-2
were applied to the e-mail body text to determine the
prediction probabilities for checking whether the
e-mail is spam or normal. ML model set MS-3 with
all classifiers were applied to get the prediction prob-
abilities for checking whether the e-mail was cyber-
stalking E-mail or a normal E-mail. Prediction
probabilities given by each ML classifier in each ML
model set may vary. Taking the final decision based
on only the prediction of a single classifier may affect
the e-mail classification task. So an ensemble
approach using the multi-model soft voting techni-
que was applied to get the final prediction probability
for a particular class (Spam or Normal, Cyberstalking
or Normal).
In machine learning, the voting technique is classified
as hard voting and soft voting. In hard voting, the
“Mode” based approach is used to select the majority
vote among all the votes (predictions) predicted by all
classifiers. For example, if classifier-1 predicts for class
“A,” classifier-2 predicts for class “B,” and classifier-3
predicts for class “A,” then the hard voting technique
gives the final prediction for class “A” due to a majority
of votes. In soft voting, the “Mean” based approach is
used to find the final prediction probability from all the
predicted probabilities (votes) by all classifiers for both
classes. In soft voting, classifiers give the prediction
probability for both classes (in the case of binary classi-
fication) using the “Predict_proba” method. Such as
p=svm.predict_proba(), and “p” ={0.7,0.3} show that
0.7 is a probability for class “A” and 0.3 is a probability
for class “B.” For example, if classifier-1 predicted prob-
ability is {0.7, 0.3}, the predicted probability of classifier-
2 is {0.4, 0.6}, and the predicted probability of classifier-3
is {0.8, 0.2} then soft voting technique will give final
prediction probability as {0.633, 0.366} which show the
prediction in favor of class “A.” This study uses the soft
voting technique to combine the prediction probabil-
ities. The mathematical representation of the soft voting
technique is explained by Equation (6), and the func-
tioning of soft voting in the author’s study is described
in Figure 4. The final prediction probability is calculated
using the soft voting technique based on the prediction
probabilities provided by the ML model set MS-1 (on
the e-mail subject), the model set MS-2 (on e-mail body
text), and model set MS-3 (on e-mail body text). In the
Figure 4. Soft voting technique for combining the predicted probabilities and predicting the final result.
10 A. K. GAUTAM AND A. BANSAL
last of this phase, three final prediction probabilities
(from ML model sets MS-1, MS-2, and MS-3) for an
e-mail (subject and body text) are sent to the aggregator
module for e-mail classification.
PSoftVotingðyj
emailÞ ¼ argmaxjPN
k¼1Pk
ðCkemailð ÞÞ
N¼j
0
B
B
@1
C
C
A(6)
Where k is a pair of class probabilities [P
k0
, P
k1
], N is
total classifiers, P
k
is a probability, and C
k
is a classifier,
j is the average probability of N classifiers for binary
class(j 2Υ={0,1}), argmax function return the final
max probability for “y” class
Aggregator module and e-mail classication
The aggregator module of the proposed EBCD model
takes the combined (final) prediction probabilities
through soft voting from ML model set MS-1, MS-2,
and MS-3 and finally classifies an e-mail of the user’s
mailbox either as “Cyberstalking E-mail,” “Suspicious
E-mail,” or “Normal E-mail.” In the aggregator module,
three e-mail check posts were used to check the e-mails.
In the first e-mail check post of the aggregator module,
the e-mail of the user’s mailbox is checked for cyberstalk-
ing e-mail. If the value of combined prediction probability
for class “A” (Cyberstalking) provided by the ML model
set MS-3 > 0.5, then e-mail is classified as “Cyberstalking
E-mail.” If an e-mail is not identified as cyberstalking,
then a second e-mail check post will check the e-mail for
suspicious e-mails (spam and fraudulent). In the second
e-mail check post, combined prediction probabilities for
class “A” (Spam) given by ML model sets MS-1 and MS-2
are used. If the probability given by MS-2 > 0.5 or (MS-1
> 0.5 AND MS-2 > 0.5), then the e-mail is identified as
spam e-mail and required to check for the case of repeated
spam and e-mail bombing incident. In the last e-mail
check post of the aggregator module, identified spam
e-mails were sent to ML model set MS-2 for checking
the spam repetition and e-mail bombing incident by the
same sender. At least ten latest e-mails sent by the same
sender is checked, and if the majority of e-mails sent by
the identified sender (spammer) are spam or fraudulent
e-mail, then identified spam e-mail in the second check
post is classified as “Cyberstalking E-mail” due to inten-
sely sending the repetition spam e-mail or e-mail bomb-
ing. Although, the user will finally decide whether either
e-mail is a cyberstalking e-mail or just a suspicious (spam/
fraudulent) e-mail. During the checking of e-mail in
e-mail check post-3, if the ML model set using soft voting
does not classify as cyberstalking e-mail, then that identi-
fied spam e-mail (in check post2) will be classified as
“Suspicious E-mail.” In case of if the e-mail of the user’s
mailbox is neither identified as cyberstalking nor
Figure 5. e-mail classification in aggregator module.
JOURNAL OF COMPUTER INFORMATION SYSTEMS 11
identified as suspicious e-mail while checking in all three
check posts, then the e-mail will be classified as “Normal
E-mail.” The functioning of the aggregator module for
e-mail classification is described in Figure 5, while the
overall stepwise procedure for E-mail Classification from
the User’s Mailbox is explained in algorithm 2.
Saving classied e-mails as evidence
After the e-mail classification of the user’s mailbox, the
available evidence is required to be stored in a file. The
proposed EBCD model will automatically read the user’s
mailbox, move the cyberstalking e-mails to
a cyberstalking folder, suspicious e-mails to
a suspicious folder and finally store the e-mail date,
sender, subject, body text, sentiment label, etc. in the
CSV file during the fetching of e-mail. Later, a CSV file
containing classified e-mails from the user’s mailbox as
evidence can also be used for training purposes and legal
action against cyberstalkers. The user can also use gath-
ered evidence to decide to block the sender as
a blacklisted sender to avoid cyberstalking from the
same sender.
Model performance
The performance of classifiers of each ML model set
on each dataset (during the training and testing time
and during the e-mail classification from the user’s
mailbox was measured separately. Performance
metrics are a set of several parameters to estimate
the model performance during training and testing
time (on labeled datasets) and real-time (on unla-
beled e-mail classification).
82
Several parameters in
the performance metrics are usually calculated by
using the confusion matrix. In the case of binary
classification, the confusion matrix is a 2 × 2 truth
table that contains the total value of True_Pos,
True_Neg, False_Neg, and False_Pos. True_Pos
(True Positive) is a successful hit showing the total
number of correctly detected cyberstalking e-mails or
spam e-mails, while True_Neg (True Negative)
explains the total number of correctly detected nor-
mal e-mails. False_Pos (False Positive) is a miss-hit,
which refers to the total number of incorrectly
detected cyberstalking e-mails or spam e-mails,
while False_Neg (False Negative) is the failure count
that shows the total number of wrongly detected
normal e-mails. This study used broadly used para-
meters such as accuracy, precision, f-score, recall, and
AUC (Area Under the Curve) to measure the perfor-
mance of the EBCD model.
Algorithm 2: Stepwise procedure for e-mail Classification from User’s
Mailbox
Step:1. Begin
Step:2. Load saved pre-trained and pre-tested ML model sets MS-1, MS-2,
and MS-3
Step:3. Enable IMAP service in the user’s mailbox (Gmail).
Step:4. Enable “Allow less secure apps: ON” or generate an App password
for the user’s mailbox (Gmail).
Step:5. Import the required library and authenticate the login process
using the User ID, Password, Host, and Port [In the case of Python,
import imaplib and e-mail library, mail=imaplib.IMAP4_SSL(host,
port), mail. login(username, app_password), host for gmail= imap.
gmail.com, port=993]
Step:6. Select “Inbox” or/and another mailbox folder to fetch the e-mails.
[as mail.select(“Inbox”)]
Step:7. Create label/folder “Cyberstalking” and “Suspicious” in the user’s
mailbox.
[As mail.create(“Cyberstalking”) and mail.create(“Suspicious”)]
Step:8. Fetch e-mail from a selected folder of the user’s mailbox. [Get e-
mail date, sender, subject, e-mail text, and other required
information]
Step:9. Split the e-mail into a date, sender, subject, and e-mail body text,
As e-mail
Subject
and e-mail
BodyText
Step:10. Send e-mail subject and body text (e-mail
Subject
and
e-mail
BodyText
) to the data pre-processing module for e-mail
cleaning and normalization.
Step:11. Apply TF-IDF vectorizer on e-mail
Subject
and e-mail
BodyText
to get
the feature vectors using the feature extraction module.
Step:12. Apply all algorithms of ML model set MS-1 on e-mail
Subject
and get
prediction probabilities. [As PP
MS1_SVM
, PP
MS1_LR
, PP
MS1_NB
, and
PP
MS1_RF
]
Step:13. Apply all algorithms of ML model set MS-2 on e-mail
BodyText
and
get prediction probabilities. [As PP
MS2_SVM
, PP
MS2_LR
, PP
MS2_NB
, and
PP
MS2_RF
]
Step:14. Apply all algorithms of ML model set MS-3 on e-mail
Subject
and get
prediction probabilities. [As PP
MS3_SVM
, PP
MS3_LR
, PP
MS3_NB
, and
PP
MS3_RF
]
Step:15. Combine the prediction probabilities on ML model sets MS-1, MS-
2, and MS-3 and get the final possibilities in each model set using
Equation (6) of the soft voting technique.
AsFPP1MS1¼PPMS1SVM þPPMS1LR þPPMS1NB þPPMS1RF
ð Þ=4;FPP2MS2
¼PPMS2SVM þPPMS2LR þPPMS2NB þPPMS2RF
ð Þ=4;FPP3MS3
¼PPMS3SVM þPPMS3LR þPPMS3NB þPPMS3RF
ð Þ=4
2
43
5
Step:16. If (FPP3
MS3
>0.5) then
Classify the e-mail as “Cyberstalking e-mail.”
Assign a label (value=1, Cyberstalking e-mail (negative e-mail)).
Move the e-mail to the “Cyberstalking” folder of the user’s
mailbox.
Step:17. ElseIf (FPP2
MS2
>0.5) or (FPP2
MS2
>0.5 AND FPP1
MS1
>0.5) then
Check for repeated spam and e-mail bombing by the same sender
(check at least ten latest e-mails of the sender) and apply ML
Model set MS-2 for getting the final probabilities using the soft
voting technique.
[As RFPP4
MS2
= Call Get_Sentiment_e-mail(Sender, MS-2) (Any
user-defined function for getting the prediction probabilities for
sender e-mails)]
IF RFPP4
MS2
>0.5) then
Classify the e-mail as “Cyberstalking e-mail.”
Assign a label (value=1, Cyberstalking e-mail (negative e-mail)).
Move the e-mail to the “Cyberstalking” folder of the user’s mailbox.
Else
Classify the e-mail as “Suspicious e-mail.”
Assign a label (value=2, Suspicious e-mail (negative e-mail
containing spam/fraudulent)).
Move the e-mail to the “Suspicious” folder of the user’s mailbox.
Step:18. Else [in case of FPP3
MS3
<0.5, FPP2
MS2
<0.5 and FPP1
MS1
<0.5]
Classify the e-mail as “Normal e-mail.”
Assign a label (value=0, Normal e-mail (positive e-mail)).
Step:19. Save the fetched and classified e-mail to a CSV file (All e-mail-
related information as date, sender, subject, text, sentiment label
(Cyberstalking/Suspicious/Normal), etc.)
Step:20. Repeat steps 8 to step 19 until fetching a sufficient number of
e-mails from the user’s mailbox (Define a fetching limit)
Step:21. Measure the performance of ML classifiers of each ML model set.
Step:22. End
12 A. K. GAUTAM AND A. BANSAL
Accuracy
Accuracy shows the complete number of rights predic-
tions that are predicted by the classifier. Equation (7)
shows the mathematical representation to calculate the
accuracy.
Accuracy ¼True Pos þTrue Neg
True Pos þFalse Posþ
False Neg þTrue Neg
(7)
Precisions
Precision shows the proportion between the true posi-
tives and the wide range of various others positives.
Precision can be calculated using Equation (8).
Precision ¼True Pos
True Pos þFalse Pos (8)
Recall
Recall is used to determine the sensitivity of the model
and measures the ratio of true positive prediction to total
positive. Recall can be calculated by using Equation (9).
Recall ¼True Pos
True Pos þFalse Neg (9)
F-score
F-Score measures the test accuracy and explains the
harmonic average between precision and recall. F-score
can be calculated using Equation (10).
FScore ¼2Precision Recall
Precision þRecall (10)
AUC (Area Under the Curve)
AUC estimates the ability of the classifier to separate
among classes correctly. ROC (Receiver Operator
Characteristic) is a likelihood curve that plots the True
Positive Rate (TPR) against the False Positive Rate
(FPR). Equation (11) can be used to calculate the AUC.
AUC ¼1
2
True Pos
True PosþFalse Neg
þTrue Neg
True NegþFalse Pos
! (11)
Results and discussion
This section discusses the experimental setup and results
for e-mail-based cyberstalking detection on textual data.
The experiments used python language with Scikit
Learn, imaplib, e-mail, BeautifulSoup, smtplib, NLTK,
and other library packages to develop the proposed
EBCD model. In the first stage of the experiment,
machine learning classifiers of model sets MS-1 were
trained, tested, and cross-validated (kFold) on labeled
datasets D1. The performance of different classifiers of
Figure 6. Performances of classifiers of ML model set MS-1 on dataset D1.
Table 1. Performance of ML classifiers of ML set MS-1.
Dataset (D1): e-mails Subject
Total Unique Records: 23320
Model Set: Machine Learning Model Set MS-1
ML Classifiers Accuracy Precision Recall F-Score AUC
Naive Bayes 0.925443 0.968613 0.878256 0.921194 0.986559
Logistic
Regression
0.946884 0.911006 0.989749 0.948730 0.991968
Random
Forest
0.969640 0.956672 0.984106 0.969396 0.992305
SVM 0.971012 0.955813 0.987330 0.971283 0.993824
Soft Voting 0.976844 0.969979 0.983211 0.976550 0.994339
JOURNAL OF COMPUTER INFORMATION SYSTEMS 13
ML model set MS-1 is explained in Table 1 and Figure 6.
As per experimental results, the soft voting approach
achieved the best accuracy of 97.7%, best precision of
97%, and best f-score of 97.7% and best AUC of 99.4%.
Logistic regression provided the highest recall of 99%.
The support vector machine obtained the second posi-
tion with an accuracy of 97.1%, recall of 98.7%, f-score of
97.1%, and AUC of 99.3%. Overall performance of all
classifiers of the model set MS-1 was up to mark and
near to similar performance in terms of AUC.
In the second stage of the experiment, machine learn-
ing classifiers of model sets MS-2 were trained, tested,
and cross-validated (kFold) on labeled datasets D2. The
performance of different classifiers of ML model set MS-
2 is explained in Table 2 and Figure 7. As per experi-
mental results, the soft voting technique was again the
best performer classifier with the best accuracy of 97.7%,
recall of 96.5%, f-score of 97.4%, and AUC of 99.7%.
Maximum precision of 98.6% was provided by the ran-
dom forest classifier, while SVM again got the position
of second best performer classifier with an accuracy of
97%, precision of 98%, recall of 95.3%, f-score of 96.6%,
and AUC of 99.6%. Other classifiers of the model set
MS-2 were also performed up to mark and near to the
best performer classifier.
In the third stage of the experiment, machine learning
classifiers of model sets MS-3 were trained, tested, and
cross-validated (kFold) on labeled datasets D3. The per-
formance of different classifiers of ML model set MS-3 is
explained in Table 3 and Figure 8. Experimental results
show that the soft voting technique again provided the
best AUC of 99.9%, while SVM is the best performer
classifier with an accuracy of 99.0%, precision of 99.4%
and f-score of 99.0%. Naïve Bayes achieved a maximum
recall of 99.4%; however, all classifiers of the model set
MS-3 performed outstanding, and performance para-
meters are near to close with the best performer classi-
fier. Overall, the performance of the soft voting classifier
of each model set on all three datasets is best, and it not
only enhances the performance of the classification task
but also helps to make the right decision during the
classification of unlabeled data based on the majority
of votes. Sometimes, due to model overfitting, the high-
est performance parameters are provided by classifiers,
although experiments in this paper utilize the K fold
cross-validation to avoid any overfitting. The soft voting
technique may also avoid overfitting due to majority
votes and avoid making a wrong decision during the
classification of unknown data. For example, in the
classification of any e-mail from the user’s mailbox, it
is also possible that one classifier may indicate normal
e-mail, and other classifiers may predict cyberstalking
e-mail. In this scenario, the soft voting technique uses
the majority votes option to make the right decision for
the actual classification of e-mail. Based on these
Table 2. Performance of ML classifiers of ML set MS-2.
Dataset (D2): e-mails Body Text
Total Unique Records: 31715
Model Set: Machine Learning Model Set MS-2
ML Classifiers Accuracy Precision Recall F-Score AUC
Naive Bayes 0.955520 0.958086 0.942925 0.950428 0.992443
Logistic
Regression
0.959009 0.977275 0.931027 0.953581 0.993947
Random
Forest
0.964391 0.986148 0.934373 0.960268 0.989182
SVM 0.969646 0.979182 0.953151 0.965982 0.995702
Soft Voting 0.977299 0.983459 0.964977 0.974130 0.996806
Figure 7. Performances of classifiers of ML model set MS-2 on dataset D2.
14 A. K. GAUTAM AND A. BANSAL
advantages, the multi-model soft voting technique was
used during the classification and labeling of the e-mails
from the user’s mailbox (unlabeled e-mail).
At the end of experiments, each ML model set’s
trained, tested, and validated classifiers were saved as
pickle files for further use during the automated cyber-
stalking detection and filtration of e-mails from the
user’s mailbox. In the last experiment, trained, tested,
and validated classifiers are applied to classify the
e-mails from the user’s mailbox (as discussed in algo-
rithm 2 of the methodology section). For experimental
purposes, different types of e-mails (spam, fraudulent,
cyberstalking, and normal) were sent to the author’s
mailbox from different e-mail ids of authors using the
python program through smtplib tools. Using the
EBCD model, a total of 497 e-mails were fetched and
classified as cyberstalking e-mails (37.8%), suspicious
e-mails (26.4%), and normal e-mails (35.8%). The dis-
tributions of fetched and classified e-mails are shown in
Figure 9. Performance of classifiers of ML model sets
MS-2 and MS-3 are measured on fetched classified
e-mails using the manual “OneVsRest” approach.
Fetched classified e-mail is divided into two datasets:
set 1 and set 2. Classifiers of the model set MS-2 were
tested on set 1, containing all e-mails belonging to
suspicious and normal e-mail classes, while classifiers
of MS-3 were tested on set 2, containing cyberstalking
and normal e-mail classes. The average performance of
different classifiers of ML model sets MS-2 and MS-3 is
explained in Table 4 and Figure 10. As experimental
results described in Table 4 and Figure 10 show, the
soft voting technique outperformed other classifiers in
terms of accuracy. The soft voting classifier achieved
the highest accuracy of 96.3 and f-score of 95.9%.
Table 3. Performance of ML classifiers of ML set MS-3.
Dataset (D3): Harassment Text
Total Unique Records: 36804
Model Set: Machine Learning Model Set MS-3
ML Classifiers Accuracy Precision Recall F-Score AUC
Naive Bayes 0.944608 0.909068 0.993706 0.949491 0.994240
Logistic
Regression
0.982285 0.992456 0.973579 0.982923 0.998221
Random
Forest
0.981524 0.977701 0.985060 0.982392 0.997325
SVM 0.990327 0.994357 0.987135 0.990731 0.998615
Soft Voting 0.988697 0.986543 0.991547 0.989039 0.998727
Figure 8. Performances of classifiers of ML model set MS-3 on dataset D3.
Figure 9. Distribution of fetched and classified e-mails.
JOURNAL OF COMPUTER INFORMATION SYSTEMS 15
Heights AUC of 96.8% and 96.8 were provided by SVM
and soft voting, respectively. Maximum precision of
98.5%, 98.1%, and 98.1% was provided by the random
forest, soft voting, and support vector machine, respec-
tively. In the case of the recall, naïve bayes, support
vector machine, and soft voting achieved a maximum
recall of 94.8%, 94.4%, and 94%, respectively. Overall
performance of all classifiers of model sets was up to
mark. During the classification and labeling of e-mails
from the user’s mailbox, the final decision was taken
using the soft voting technique, and after that perfor-
mance of all classifiers was measured on classified
e-mails using the Stratified K-Folds cross-validator.
During the overall experimental works, it is found
that the performance of the support vector machine is
notable, but the soft voting technique is a better choice
for making the right decision.
Conclusion and future work
E-mail-based cyberstalkers are making negative and
fearful communication over e-mail technology.
Cyberstalking through spamming, e-mail bombing,
and the general approach of cyberstalking are common
for e-mail-based harassment. Apart from these, cyber-
stalkers also utilize several other approaches to target the
victim or groups over e-mail, which are complex to
detect automatically. This paper proposed an EBCD
model using the multi-model soft voting technique of
the machine learning approach for automatic cyber-
stalking detection on textual data from a user’s mailbox.
Initially, three machine learning model sets containing
random forest, support vector machine, naïve bayes,
logistic regression, and soft voting classifiers were
trained, tested, and validated through k-fold cross-
validation on three different datasets. Classifiers of the
model set MS-1 were trained, tested, and validated on
dataset D1 containing spam, phishing, and fraudulent
e-mail subject line so that further it can be used in
classifying the e-mail using e-mail subject. Classifiers
of the model set MS-2 were trained, tested, and validated
on dataset D2 containing spam and fraudulent related
e-mail body text. Later, the model set MS-2 can be used
to classify the e-mail from the user’s mailbox as spam
e-mail as well as can also be utilized for checking e-mail
bombing and repeated spamming approaches of cyber-
stalkers. Classifiers of the model set MS-3 were trained,
tested, and validated on dataset D3 containing harass-
ment-related data so that it can be used further for
checking cyberstalking e-mails from the user’s mailbox.
Table 4. Average performance of ML classifiers of ML model sets
on fetched and classified e-mails from user’s mailbox.
Dataset D4: Set 1(suspicious and normal e-mail) and Set 2 (cyberstalking
and normal e-mail)
Fetched e-mail classified and labeled by: Soft Voting Technique of EBCD Model
Total Unique e-mail: 497, cyberstalking e-mail: 37.8%,
suspicious e-mail: 26.4%, and normal e-mail: 35.8%
ML Classifiers Accuracy Precision Recall F-Score AUC
Naive Bayes 0.879176 0.979588 0.944156 0.898151 0.964610
Logistic
Regression
0.910239 0.979637 0.927597 0.914087 0.964462
Random
Forest
0.911804 0.984968 0.935390 0.940763 0.964478
SVM 0.941714 0.981334 0.947727 0.946976 0.968466
Soft Voting 0.963057 0.981431 0.940584 0.959211 0.967925
Figure 10. Average performance of classifiers of ML model sets on fetched and classified e-mails from user’s mailbox.
16 A. K. GAUTAM AND A. BANSAL
The performance of classifiers of each model set was
measured using accuracy, precision, recall, f-score, and
AUC. Experimental results show that the soft voting
technique achieved the best accuracy of 97.7%, best
f-score of 97.7%, best precision of 97%, and best AUC
of 99.4% on dataset D1. The soft voting technique also
performed well in dataset D2 with the best accuracy of
97.7%, best f-score of 97.4%, best recall of 96.5%, and
best AUC of 99.7%. In the case of dataset D3, the soft
voting technique also achieved the best AUC of 99.9%,
while accuracy, precision, and f-score provided by soft
voting were very close to the top perform classi-
fier (SVM).
Due to the overall better performance, the multi-
model soft voting technique was applied for the
automated classification and labeling of e-mails
from the user’s mailbox. During the classification
of e-mails from the user’s mailbox, trained, tested,
and validated classifiers of the model set MS-1, MS-
2 and MS-3 were applied as a combined approach.
Based on the final decision through the soft voting
classifier of MS-1, MS-2, and MS-3 models, in each
of the three e-mail check posts, e-mails from the
user’s mailbox were classified as cyberstalking
e-mail, suspicious e-mail, and normal e-mail. The
performance of all classifiers of each model set was
measured on classified e-mails from the user’s mail-
box. The average performance of classifiers shows
that soft voting again performed well with an accu-
racy of 96.3% and f-score of 95.9%, while the pre-
cision, recall, and AUC of soft voting were very
close to the top performer classifier. Overall experi-
mental results show that the performance of the
support vector machine was notable, but the soft
voting technique is a better choice for unlabeled
e-mail classification. The soft voting technique not
only enhances the performance of classification task
for labeled and unlabeled e-mail but also provide
help to make the right decision for the actual clas-
sification of e-mails. The proposed EBCD model
performed well and could automatically classify
e-mails from the user’s mailbox and evidence col-
lection. The proposed EBCD model not only detects
the automatically cyberstalking e-mail but also clas-
sifies the e-mail as suspicious e-mail (spam and
fraudulent) based on the textual e-mail data.
Further, the EBCD model also helps to detect
basic intent-wise cyberstalking e-mails through
repeated spamming and e-mail bombing. However,
advanced intent-wise e-mail-based cyberstalking
detection, including the image spam approach of
cyberstalking, is more complex than content-wise
e-mail-based cyberstalking. Future work includes
the design and development of an enhanced EBCD
model for the detection of advanced intent-wise
cyberstalking performed through phishing, mali-
cious, defamatory, e-mail spoofing, and image
spam-based cyberstalking. So that advanced intent-
wise, cyberstalking can be detected automatically
from fake e-mails, identity theft, and the personal/
financial losses approaches of cyberstalkers. Future
work also includes the design and development of
the EBCD model using deep learning techniques
and a comparison of the current proposed EBCD
model with ANN, Logitboost, XGBoost, LSTM, and
GRU models.
Disclosure statement
No potential conflict of interest was reported by the authors.
ORCID
Arvind Kumar Gautam http://orcid.org/0000-0001-6057-
1006
Abhishek Bansal http://orcid.org/0000-0001-5968-3625
References
1. Karim A, Azam S, Shanmugam B, Kannoorpatti K,
Alazab M. A comprehensive survey for intelligent spam
e-mail detection. IEEE Access. 2019;7:168261_168295.
doi:10.1109/ACCESS.2019.2954791.
2. Hina M, Ali M, Javed AR, Ghabban F, Khan LA, Jalil Z.
Sefaced: semantic-based forensic analysis and classifica-
tion of e-mail data using deep learning. IEEE Access.
2021;9:98398–411. doi:10.1109/ACCESS.2021.3095730.
3. https://www.statista.com/statistics/255080/number-of
-e-mail-users-worldwide/ .
4. https://www.statista.com/statistics/420391/spam-e-mail
-traf_c-share .
5. Miller L. Stalking: patterns, motives, and intervention
strategies. Aggress Violent Behav. 2012;17(6):495–506.
doi:10.1016/j.avb.2012.07.001.
6. Ogilvie E. Cyberstalking. Trends Issues Crime Crim
Justice. 2000;166:1–6.
7. Truman JL. Examining intimate partner stalking and
use of technology in stalking victimization [PhD thesis].
Florida: University of Central Florida Orlando; 2010.
8. WinkelmAn SB, Oomen-Early J, Walker AD, Chu L,
Yick-Flanagan A. Exploring cyber harassment among
women who use social media. Univers J Public Health.
2015;3(5):194. doi:10.13189/ujph.2015.030504.
9. Gautam AK, Bansal A. A review on cyberstalking detec-
tion using machine learning techniques: current trends
and future direction. International Journal of
Engineering Trends and Technology. 2022;70
(3):95–107. doi:10.14445/22315381/IJETT-V70I3P211.
10. Baer M. Cyberstalking and the internet landscape we have
constructed. Virginia J Law Technol. 2020;154:153–227.
JOURNAL OF COMPUTER INFORMATION SYSTEMS 17
11. Nam SG, Jang Y, Lee D-G, Seo Y-S. Hybrid features by
combining visual and text information to improve spam
filtering performance. Electronics. 2022;11(13):2053.
doi:10.3390/electronics11132053.
12. https://dataprot.net/statistics/spam-statistics .
13. Bagui S, Nandi D, Bagui S, White RJ. Classifying phish-
ing e-mail using machine learning and deep learning.
2019 International Conference on Cyber Security and
Protection of Digital Services (Cyber Security), Oxford,
UK; 2019; IEEE.
14. Marková E, Bajtoš T, Sokol P, Mézešová T. Classification of
malicious e-mails. 2019 IEEE 15th International Scientific
Conference on Informatics, Poprad, Slovakia; 2019; IEEE.
15. Pandove K, Jindal A, Kumar R. e-mail spoofing.
Int J Comput Appl. 2010;5(1):27–30. doi:10.5120/881-
1252.
16. Sakshi M, Vashishth A. An analysis of cyber crime with
special reference to cyber stalking. J Posit Psychol. 2022;6
(4):1279–87.
17. Goni O. Cyber crime and its classification. Int J Electr
Electron Eng. 2022;10(2):01–17. doi:10.30696/IJEEA.X.I.
2022.01-17.
18. Kumar S, Agarwal D. Hacking attacks, methods, techni-
ques and their protection measures. Int J Adv Res Comput
Sci Manag. 2018;4:2353–58.
19. Mirza N, Patil B, Mirza T, Auti R. Evaluating efficiency of
classifier for e-mail spam detector using hybrid feature
selection approaches. International Conference on
Intelligent Computing and Control Systems (ICICCS’
17), Madurai, India; 2017; IEEE. p. 735–40.
20. Thomas K, Grier C, Ma J, Paxson V, Song D. Design and
evaluation of a real-time URL spam filtering service. IEEE
Symposium on Security and Privacy (SP ’11), Oakland,
CA, USA; 2011; IEEE. p. 447–62.
21. Rakshitha K, Ramalingam HM, Pavithra M, Advi HD,
Hegde M. Sentimental analysis of Indian regional lan-
guages on social media. Glob Transit Proc. 2021;2
(2):414–20. doi:10.1016/j.gltp.2021.08.039.
22. Burmester M, Burmester M, Henry P, Kermes LS,
Kermes LS, Henry P. Tracking cyberstalkers:
a cryptographic approach. ACM SIGCAS Comput Soc.
2005;35(3):2. doi:10.1145/1215932.1215934.
23. Aggarwal S, Burmester M, Henry P, Kermes L,
Mulholland J. Anti-cyberstalking: the Predator and
Prey Alert (PAPA) system. Proceedings - First
International Workshop on Systematic Approaches,
Taipei, Taiwan; 2005.
24. Onan A. Two-stage topic extraction model for biblio-
metric data analysis based on word embeddings and
clustering. IEEE Access. 2019;7:145614–33. doi:10.
1109/ACCESS.2019.2945911.
25. Onan A. Sentiment analysis on massive open online
course evaluations: a text mining and deep learning
approach. Comput Appl Eng Educ. 2021;29(3):572–89.
doi:10.1002/cae.22253.
26. Onan A, Alp Toçoğlu M. A term weighted neural lan-
guage model and stacked bidirectional LSTM based
framework for sarcasm identification. IEEE Access.
2021;9:7701–22. doi:10.1109/ACCESS.2021.3049734.
27. Onan A. Deep learning based sentiment analysis on
product reviews on Twitter. International Conference
on Big Data Innovations and Applications; 2019; Cham:
Springer.
28. Onan A. Sentiment analysis on product reviews based on
weighted word embeddings and deep neural networks.
Concurr Comput Pract Exp. 2021;33(23):e5909. doi:10.
1002/cpe.5909.
29. Onan A. Mining opinions from instructor evaluation
reviews: a deep learning approach. Comput Appl Eng
Educ. 2020;28(1):117–38. doi:10.1002/cae.22179.
30. Gautam AK, Bansal A. Performance analysis of super-
vised machine learning techniques for cyberstalking
detection in social media. Journal of Theoretical and
Applied Information Technology. 2022;100(2):449–461.
31. Zhang J, Otomo T, Li L, Nakajima S. Cyberbullying
detection on Twitter using multiple textual features.
2019 IEEE 10th International Conference on
Awareness Science and Technology (CAST), Morioka,
Japan; 2019; IEEE. p. 1–6.
32. Liew SW, Sani NFM, Abdullah MT, Yaakob R,
Sharum MY. An effective security alert mechanism for
real-time phishing tweet detection on Twitter. Comput
Secur. 2019;83:201–07. doi:10.1016/j.cose.2019.02.004.
33. Dughyala N, Potluri S, Sumesh KJ, Pavithran V.
Automating the detection of cyberstalking. 2021
Second International Conference on Electronics and
Sustainable Communication Systems (ICESC),
Coimbatore, India; 2021; IEEE.
34. Gautam AK, Bansal A. Automatic cyberstalking
detection on Twitter in real-time using hybrid
approach. International Journal of Modern
Education and Computer Science . 2023;15(1).
35. Roy SS, Mallik A, Gulati R, Obaidat MS, Krishna PV.
A deep learning based artificial neural network
approach for intrusion detection. International
Conference on Mathematics and Computing; 2017;
Singapore: Springer.
36. Roy SS, Madhu Viswanatham V. Classifying spam e-
mails using artificial intelligent techniques. Int J Eng
Res Africa. 2016;22:152–61. Trans Tech Publications
Ltd. https://doi.org/10.4028/www.scientific.net/JERA.
22.152 .
37. Roy SS, Madhu Viswanatham V, Venkata
Krishna P. Spam detection using hybrid model of
rough set and decorate ensemble. Int J Comput Syst
Eng. 2016;2(3):139–47. doi:10.1504/IJCSYSE.2016.
079000.
38. Roy SS, Sinha A, Roy R, Barna C, Samui P. Spam e-mail
detection using deep support vector machine, support
vector machine and artificial neural network.
International Workshop Soft Computing Applications;
2016; Cham: Springer.
39. Bassiouni M, Ali M, El-Dahshan EA. Ham and spam e-
mails classification using machine learning techniques.
J Appl Secur Res. 2018;13(3):315–31. doi:10.1080/
19361610.2018.1463136.
40. Zhaoquan GU, Yushun X, Weixiong HU, Lihua Y, Yi H,
Zhihong T. Marginal attacks of generating adversarial
examples for spam filtering. Chinese J Electron. 2021;30
(4):595–602. doi:10.1049/cje.2021.05.001.
41. Kontsewaya Y, Antonov E, Artamonov A. Evaluating
the effectiveness of machine learning methods for spam
18 A. K. GAUTAM AND A. BANSAL
detection. Procedia Comput Sci. 2021;190:479–86.
doi:10.1016/j.procs.2021.06.056.
42. Cohen A, Nissim N, Elovici Y. Novel set of general
descriptive features for enhanced detection of malicious
e-mails using machine learning methods. Expert Syst
Appl. 2018;110:143–69. doi:10.1016/j.eswa.2018.05.031.
43. Jalda CS, Nanda AK, Pitchai R. Spoofing e-mail detec-
tion using stacking algorithm. 2022 8th International
Conference on Smart Structures and Systems (ICSSS),
Chennai, India; 2022; IEEE.
44. Onan A. An ensemble scheme based on language func-
tion analysis and feature engineering for text genre
classification. J Inf Sci. 2018;44(1):28–47. doi:10.1177/
0165551516677911.
45. Onan A. Consensus clustering-based undersampling
approach to imbalanced learning. Sci Program.
2019;2019:1–14. doi:10.1155/2019/5901087.
46. Onan A. Biomedical text categorization based on
ensemble pruning and optimized topic modelling.
Comput Math Methods Med. 2018;2018:1–22.
doi:10.1155/2018/2497471.
47. Onan A, Korukoğlu S, Bulut H. A hybrid ensemble
pruning approach based on consensus clustering and
multi-objective evolutionary algorithm for sentiment
classification. Inf Process Manag. 2017;53(4):814–33.
doi:10.1016/j.ipm.2017.02.008.
48. Onan A, Korukoğlu S, Bulut H. Ensemble of key-
word extraction methods and classifiers in text
classification. Expert Syst Appl. 2016;57:232–47.
doi:10.1016/j.eswa.2016.03.045.
49. Nisar N, Rakesh N, Chhabra M. Voting-ensemble
classification for e-mail spam detection. 2021
International Conference on Communication infor-
mation and Computing Technology (ICCICT),
Mumbai, India; 2021; IEEE.
50. Bountakas P, Xenakis C. Helphed: hybrid ensemble
learning phishing e-mail detection Journal of
Network and Computer Applications. 2022;210.
doi:10.1016/j.jnca.2022.103545.
51. Onan A. Bidirectional convolutional recurrent
neural network architecture with group-wise
enhancement mechanism for text sentiment
classification. J King Saud Univ Comput Inf Sci.
2022;34(5):2098–117. doi:10.1016/j.jksuci.2022.02.
025.
52. Onan A, Korukoğlu S. A feature selection model
based on genetic rank aggregation for text sentiment
classification. J Inf Sci. 2017;43(1):25–38. doi:10.
1177/0165551515613226.
53. Annadatha A, Stamp M. Image spam analysis and
detection. J Comput Virol Hacking Tech. 2018;14
(1):39–52. doi:10.1007/s11416-016-0287-x.
54. Sharmin T, Di Troia F, Potika K, Stamp M.
Convolutional neural networks for image spam
detection. Inf Secur J. 2020;29(3):103–17. doi:10.
1080/19393555.2020.1722867.
55. Ghasem Z, Frommholz I, Maple C. Machine learning
solutions for controlling cyberbullying and
cyberstalking. Int J Inf Secur. 2015;6:55–64.
56. Frommholz I, Al-Khateeb HM, Potthast M,
Ghasem Z, Shukla M, Short E. On textual analysis
and machine learning for cyberstalking detection.
Datenbank Spektrum. 2016;16(2):127–35. doi:10.
1007/s13222-016-0221-x.
57. Feng X, Asante A, Short E, Abeykoon I. Cyberstalking
issues. 2017 IEEE 15th International Conference on
Dependable, Autonomic and Secure Computing, 15th
International Conference on Pervasive Intelligence and
Computing, 3rd International Conference on Big Data
Intelligence and Computing and Cyber Science and
Technology Congress (DASC/PiCom/DataCom/
CyberSciTech); 2017. p. 373–76. doi:10.1109/DASC-
PICom-DataCom-CyberSciTec.2017.78.
58. Gautam AK, Bansal A. A machine learning framework
for detection and documentation of cyberstalking on
on-spam e-mail. The Journal of Oriental Research
Madras . 2021;92(5):41–50.
59. Asante A, Feng X. Content-based technical solution
for cyberstalking detection. 2021 3rd International
Conference on Computer Communication and the
Internet (ICCCI), Nagoya, Japan; 2021; IEEE.
60. Trec Dataset: https://www.kaggle.com/datasets/imdeep
mind/preprocessed-trec-2007-public-corpus-dataset .
61. Enron dataset: https://www2.aueb.gr/users/ion/data/
enron-spam/ .
62. https://www.kaggle.com/datasets/llabhishekll/fraud-e-
mail-dataset?resource=download .
63. https://www.kaggle.com/datasets/mfaisalqureshi/spam-
e-mail .
64. https://www.kaggle.com/datasets/harshsinha1234/
email-spam-classification .
65. https://www.kaggle.com/datasets/juanagsolano/spam-
e-mail-from-enron-dataset .
66. https://www.kaggle.com/datasets/ganiyuolalekan/
spam-assassin-email-classification-dataset .
67. https://data.mendeley.com/datasets/72ptz43s9v/1 .
68. Vijayarani S, Ilamathi MJ, Nithya M. Pre-
processing techniques for text mining-an overview.
Int J Comput Netw Commun. 2015;5:7–16.
69. Kadhim AI. An evaluation of pre-processing tech-
niques for text classification. Int J Inf Technol
Comput Sci Inf Secu. 2018;16:22–32.
70. Tiwari D, Singh N. Ensemble approach for twitter senti-
ment analysis. Int J Inf Technol Comput Sci. 2019;11
(8):20–26. doi:10.5815/ijitcs.2019.08.03.
71. Gautam AK, Bansal A. Effect of features extraction
techniques on cyberstalking detection using machine
learning framework. J Adv Inf Technol. 2022;13(5).
doi:10.12720/jait.13.5.486-502.
72. Rui W, Xing K, Jia Y. BOWL: bag of word clusters
text representation using word embeddings.
International Conference on Knowledge Science,
Engineering and Management; 2016; Cham:
Springer.
73. Mikolov T, Chen K, Corrado G, Dean J. Efficient
estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781. 2013. https://arxiv.
org/pdf/1301.3781.pdf .
74. Jeffrey P, Socher R, Christopher D. Glove: global vectors
for word representation. Proceedings of the 2014
Conference on Empirical Methods in Natural Language
Processing (EMNLP), Doha, Qatar; 2014.
75. Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of
tricks for efficient text classification. arXiv preprint
JOURNAL OF COMPUTER INFORMATION SYSTEMS 19
arXiv:1607.01759. 2016. https://arxiv.org/pdf/1607.
01759.pdf .
76. Raj C, Agarwal A, Bharathy G, Narayan B, Prasad M.
Cyberbullying detection: hybrid models based on
machine learning and natural language processing
techniques. Electronics. 2021;10(22):2021. doi:10.3390/
electronics10222810.
77. Das B, Chakraborty S. An improved text sentiment classi-
fication model using TF-IDF and next word negation.
arXiv preprint arXiv:1806.06407. 2018.
78. Cristianini N, Shawe-Taylor J. An introduction to support
vector machines and other kernel-based learning methods.
United Kingdom: Cambridge University Press; 2000.
79. Rish I. An empirical study of the naive bayes classifier.
IJCAI 2001 workshop on empirical methods in artificial
intelligence; 2001;3(22):41–46.
80. Yan J, Lee J. Degradation assessment and fault modes
classification using logistic regression. J Manuf Sci Eng.
2005;127(4):912–14. doi:10.1115/1.1962019.
81. Pal M. Random forest classifier for remote sensing
classification. Int J Remote Sens. 2005;26(1):217–22.
doi:10.1080/01431160412331269698.
82. Bashir E, Bouguessa M. Data mining for cyberbullying
and harassment detection in Arabic texts. Int J Inf
Technol Comput Sci. 2021;13(5):41–50. doi:10.5815/
ijitcs.2021.05.04.
20 A. K. GAUTAM AND A. BANSAL