ArticlePDF Available

Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach

Authors:

Abstract

Many people are using Twitter for thought expression and information sharing in real-time. Twitter is one of the trendiest social media applications that cybercriminals also widely use to harass the victim in the form of cyberstalking. Cyberstalkers target the victim through sexism, racism, offensive language, hate language, trolling, and fake accounts on Twitter. This paper proposed a framework for automatic cyberstalking detection on Twitter in real-time using the hybrid approach. Initially, experimental works were performed on recent unlabeled tweets collected through Twitter API using three different methods: lexicon-based, machine learning, and hybrid approach. The TF-IDF feature extraction method was used with all the applied methods to obtain the feature vectors from the tweets. The lexicon-based process produced maximum accuracy of 91.1%, and the machine learning approach achieved maximum accuracy of 92.4%. In comparison, the hybrid approach achieved the highest accuracy of 95.8% for classifying unlabeled tweets fetched through Twitter API. The machine learning approach performed better than the lexicon-based, while the performance of the proposed hybrid approach was outstanding. The hybrid method with a different approach was again applied to classify and label the live tweets collected by Twitter Streaming in real-time. Once again, the hybrid approach provided the outstanding result as expected, with an accuracy of 94.2%, recall of 94.1%, the precision of 94.6%, f-score of 94.1%, and the best AUC of 98%. The performance of machine learning classifiers was measured in each dataset labeled by all three methods. Experimental results in this study show that the proposed hybrid approach performed better than other implemented approaches in both recent and live tweets classification. The performance of SVM was better than other machine learning algorithms with all applied approaches.
I.J. Modern Education and Computer Science, 2023, 1, 58-72
Published Online on February 8, 2023 by MECS Press (http://www.mecs-press.org/)
DOI: 10.5815/ijmecs.2023.01.05
This work is open access and licensed under the Creative Commons CC BY License. Volume 15 (2023), Issue 1
Automatic Cyberstalking Detection on Twitter in
Real-Time using Hybrid Approach
Arvind Kumar Gautam*
Department of Computer Science, Indira Gandhi National Tribal University, Amarkantak, MP, 484886, India
Email: analyst.igntu@gmail.com
ORCID ID: https://orcid.org/0000-0001-6057-1006
*Corresponding Author
Abhishek Bansal
Department of Computer Science, Indira Gandhi National Tribal University, Amarkantak, MP, 484886, India
Email: abhishek.bansal@igntu.ac.in
ORCID ID: https://orcid.org/0000-0001-5968-3625
Received: 03 January, 2022; Revised: 19 March, 2022; Accepted: 19 June, 2022; Published: 08 February, 2023
Abstract: Many people are using Twitter for thought expression and information sharing in real-time. Twitter is one
of the trendiest social media applications that cybercriminals also widely use to harass the victim in the form of
cyberstalking. Cyberstalkers target the victim through sexism, racism, offensive language, hate language, trolling, and
fake accounts on Twitter. This paper proposed a framework for automatic cyberstalking detection on Twitter in real-
time using the hybrid approach. Initially, experimental works were performed on recent unlabeled tweets collected
through Twitter API using three different methods: lexicon-based, machine learning, and hybrid approach. The TF-
IDF feature extraction method was used with all the applied methods to obtain the feature vectors from the tweets. The
lexicon-based process produced maximum accuracy of 91.1%, and the machine learning approach achieved maximum
accuracy of 92.4%. In comparison, the hybrid approach achieved the highest accuracy of 95.8% for classifying
unlabeled tweets fetched through Twitter API. The machine learning approach performed better than the lexicon-based,
while the performance of the proposed hybrid approach was outstanding. The hybrid method with a different approach
was again applied to classify and label the live tweets collected by Twitter Streaming in real-time. Once again, the
hybrid approach provided the outstanding result as expected, with an accuracy of 94.2%, recall of 94.1%, the precision
of 94.6%, f-score of 94.1%, and the best AUC of 98%. The performance of machine learning classifiers was measured
in each dataset labeled by all three methods. Experimental results in this study show that the proposed hybrid approach
performed better than other implemented approaches in both recent and live tweets classification. The performance of
SVM was better than other machine learning algorithms with all applied approaches.
Index Terms: Cyberstalking Detection, Cyberbullying, Machine Learning, Lexicon, TF-IDF, Support Vector Machine,
Naive Bayes, Sentiment Analysis, Feature Extraction, Twitter.
1. Introduction
Twitter is a real-time social media application that has gained global popularity in the virtual world. As per
statistics [1], More than 300 million worldwide users use Twitter, and more than 500 million daily posts are tweeted on
Twitter. Twitter is a great way to remain socially connected to family, friends, and colleagues to share the tweets for
individual, official, and business reasons [2]. The use of Twitter also raises challenging issues in the form of
cyberstalking, cyberbullying, and other cyber harassment. Cyberstalking is a dangerous and convoluted cybercrime that
affects and targets numerous people, communities, and organizations [3]. Cyberstalkers and gangs of cyberstalkers are
active on Twitter with pre-defined plans and agendas to insults, profanity, harassing the victim through repeated
activities of sexism, racism, offensive, abuse, hate, trolling, fake news, and fake accounts [4, 5, 6]. Impressive
cyberstalking detection, controlling, and counteraction arrangements are required to handle this troublesome
cyberstalking circumstance on Twitter.
Researchers widely use lexicon-based and machine learning techniques for cyberstalking detection with sentiment
analysis support [7-9]. Sentiment analysis performs an imperative task in text analysis and deciding the score of
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
Volume 15 (2023), Issue 1 59
words classified as positive or negative comments [10]. The lexicon-based approach [11, 12] uses a pre-defined and
pre-trained rule-based dictionary of good and bad words to determine the score of any word and assign a positive or
negative sentiment. The main limitation of the lexicon-based approach is that sentiment polarity scores can not be
specified to those words which are not in the dictionary. In other cases, machine learning techniques for sentiment
analysis are not dependent on any pre-defined dictionary [13]. In the machine learning methodology, the detection
model is initially trained using the labeled dataset to predict the probability of words for positive or negative sentiment
[14].
A more improved detection model is required to enhance the performance of cyberstalking detection on Twitter in
real-time. There is still much scope for comparative analysis of lexicon-based cyberstalking detection and machine
learning-based cyberstalking detection to determine and design a better approach. The main research objective of this
paper is to analyze and compare the different methods of automated cyberstalking detection on Twitter and propose a
better approach to enhance the performance. Initially, this paper applied both lexicon-based and machine learning
approaches separately. Finally, to combine the benefits of both methods, this paper implemented an automated hybrid
approach to detect cyberstalking tweets on Twitter in real-time. The significant contributions from this study are as
follows.
We performed the comparative analysis of lexicon-based, machine learning, and hybrid approaches for
cyberstalking detection in recent tweets collected directly through Twitter API.
We proposed a hybrid approach for automatic cyberstalking detection on live tweets directly fetched through
Twitter streaming in real-time. The proposed hybrid method can classify and label live tweets in real-time with
high accuracy.
The proposed approach can also be used in other social media platforms that provide live comments through
API.
The proposed hybrid approach was applied with recent tweets (collected through Twitter API in real-time) and live
tweets (collected through Twitter Streaming in real-time). Initially, the proposed hybrid approach was trained with a
labeled dataset and then auto-trained through classified tweets. With both recent and live tweets, the proposed hybrid
approach performed better than traditional lexicon-based and machine learning techniques. The subsequent part of the
research study is structured section-wise. In section 2, the notable and recent contribution of researchers in the related
field is presented in the form of a literature review. In section 3, applied materials and the proposed methodology used
in this paper are described. The experimental setup, results, and detailed discussion are mentioned in section 4. Finally,
the conclusion and future works are finalized in section 5.
2. Review of Literature
In the literature survey, some related research papers were chosen to observe the contributions of past work
performed by researchers to the automatic detection of cyberbullying, cyberstalking, and other cyberharassment.
Ghasem et al. [15] suggested a model for automatically detecting and controlling cyberbullying and cyberstalking using
machine learning techniques. This approach was generally focused on automatic email-based cyber-stalking detection
as well as evidence documentation to combat cybercriminals. Frommholz et al. [16] suggested a textual analysis-based
cyberstalking detection model using machine learning algorithms. The proposed method of authors was mainly focused
on author identification, text classification, personalization, and digital text forensics. Saravanaraj et al. [17]
implemented an automated model for detecting cyberbullying tweets on Twitter using supervised machine learning
techniques. The authors used Random Forest and Naïve Bayes algorithms to classify tweets and found adequate results
with their experiment. Another machine learning-based automated cyberbullying detection model was developed by
Zhang et al. [18] to detect the bully tweets on Twitter. The authors performed the experimental work using various
machine learning models using multiple textual features and found maximum accuracy of 90%. Liew et al. [19]
suggested an automated security alert model using supervised machine learning techniques to detect and control
phishing tweets in real-time on Twitter. The authors implemented their proposed model using random forest and found
better accuracy. Balakrishnan et al. [20] utilize the user's psychological personalities, sentiments, and emotions to
design a cyberbullying detection model on Twitter. The author used the machine learning technique to filter and
categorize the tweets into bully tweets, aggressor tweets, spammer tweets, and regular tweets. Shah et al. [21] have also
designed a machine learning-based framework for automatically detecting cyberbullying tweets on Twitter. The author
implemented their proposed approach using several machine learning algorithms and found the maximum accuracy of
93% for logistic regression.
Kazim Raza et al. [22] applied a lexicon-based methodology to detect cyberbullying tweets automatically tweeted
in Roman-Urdu language on Twitter. With their proposed approach, the authors found better results than previous work
of researchers. Another model using text analysis features with lexicon-based offered by Geetha et al. [23] for automatic
detection of offensive language on Twitter. The authors used LIWC, POS, and Twitter Tag Scores (TTS) for lexicon-
based text analysis and implemented the model with deep learning and machine learning. The authors achieved 91.72%
accuracy for the C-LSTM method while 90.8% accuracy for logistic regression and SVM. Another machine-learning-
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
60 Volume 15 (2023), Issue 1
based methodology was suggested by Bandi Yoshna et al. [24] to detect cyberbullying on Twitter. The authors tested
their model using random forest and SVM algorithms and successfully obtained an accuracy of 71.2% for the support
vector machine. Real-time cyberbullying detection on Twitter for Hindi-English mixed tweets was suggested by Kumar
Akshi et al. [25] with the support of transfer learning and deep neural networks. The author's model converted the
tweets in Hindi and mixed language into English and automatically classified the tweets. Yuvaraj et al. [26] applied
deep decision tree classification with multi-feature-based AI for their proposed automatic cyberbullying detection
model on Twitter. In experimental work, authors classified and labeled the 30,384 tweets using the deep decision tree
classification method. Another detection model based on deep neural networks was implemented by Sadiq et al. [27] for
the automatic detection of aggression tweets on Twitter. The authors performed the experimental work with multilayer
perceptron methods using CNN-LSTM and CNN-BiLSTM methods to classify aggression tweets and found expected
results with an accuracy of 92%.
Sangwan et al. [28] designed a filterwrapper-based hybrid model for automatic detection of cyberbullying on
Twitter and Instagram. After implementing the hybrid detection model using the lexicon-based method and machine
learning model, the authors found better results. Lepe-faúndez et al. [29] proposed a model for automatic detection of
cyberbullying in the Spanish language on Twitter using a hybrid method. The authors evaluated their hybrid model
using lexicon-based and machine learning methods and found a maximum of 89.2% of accuracy. Madan et al. [30]
suggested a real-time sentiment analysis model using lexicon-based, machine learning-based, and hybrid methods for
tweets in the Hindi language on Twitter. Another hybrid model was proposed by Almutairi et al. [31] for the automatic
detection of cyberbullying in tweets in the Arabic language. The authors implemented their proposed approach using
lexicon-based with machine learning and obtained 82% accuracy. Arora et al. [32] proposed a novel methodology for
automatically detecting cyber harassment on Twitter using a mixed-methods approach. Authors performed the
experimental work using lexicon-based and SVM to classify cyber harassment into spam, hateful, abusive, and neutral
tweets. Ayo et al. [33] successfully implemented a clustering model for automatic hate speech detection on Twitter. The
authors applied the rule-based clustering method and fuzzy logic for automatically classifying tweets and hate speech
detection, respectively in real-time. The authors achieved 96.4 % of AUC and 94.5% f-score.
In the literature, authors at [22, 23] applied a lexicon-based approach, while authors at [15-21, 24-27] have
implemented the detection model using machine learning techniques. Authors at [29-33] have also suggested some
hybrid approaches, including lexicon-based and machine learning techniques for automatic detection. The majority of
researchers applied machine learning techniques for automatic tweets classification. Automatic cyberstalking detection
on Twitter and other social media networks in a real-time manner is still a challenging task. There is still a lack of
automated cyberstalking detection approaches in real-time, with an impressive performance.
3. Material and Methodology
This section describes the detailed algorithms used for designing the proposed model. In Fig. 1, the basic
functioning layout of the proposed automatic detection model is explained. The proposed automated model consists of
the following main phases for real-time cyberstalking detection on Twitter.
1. Tweets Collection and Making the Dataset
2. Tweets pre-processing
3. Features extraction
4. Classification and Labeling of the Tweets
5. Real-Time Cyberstalking Detection on Live Tweets.
6. Measuring the Performance of Model
The proposed methodology was implemented on recent and live tweets both. After fetching the recent tweets using
Twitter API, a lexicon-based approach was initially applied for tweets classification. Machine learning classifiers were
trained using the pre-defined dataset, and after that, machine learning and hybrid approaches were both applied
separately to the same recent tweets. Finally, the hybrid approach is applied again for tweets classification on live
tweets in a real-time manner. The detailed procedure for each applied approach shown in Fig. 1 is explained as follows.
3.1. Tweets Collection and Making the Dataset
In the initial stage, this paper used Twitter API to collect the recent tweets and make the dataset while live tweets
were fetched during the real-time cyberstalking detection. Several hashtags keywords regarding cyberstalking,
cyberbullying, cyber harassment, and cybercrimes were used to collect the recent tweets from Twitter. The following
steps were used for collecting tweets and making the dataset.
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
Volume 15 (2023), Issue 1 61
Procedure for Tweets Collection and Making the Dataset
Step:1. Logged in to a Twitter developer account, registered Twitter API, and obtained the required
authentication keys and tokens by creating a new application or existing application on a Twitter
developer account. Twitter generally provides four authentication keys and tokens, namely "consumer-
key," "consumer-secret," "access-token," and "access-token-secret," for fetching the tweets from Twitter.
Step:2. Required libraries (Tweepy in python) were imported, and the Twitter API key was authenticated. After
that, several related hashtags keywords were defined to fetch the tweets. Such as #harassment,
#cyberstalking, #cyberbullying, #stalker, #stalking, #cyberharassment, #revengeporn, #sexy, #hate, #troll,
#hate speech, #sexism, #racism, #cybercrime, #hacking, #abuse, #victim, #love, #onlinesafety,
#bullyingsucks, #thebullyexposed, #internetsafety etc.
Step:3. Fetched the recent tweets from Twitter based on hashtags, time intervals, and user profiles and finally
saved them to text and CSV file. This paper collected tweets with the user name, user id, tweets location,
retweet count, follower count, and tweets date. Some tweets were also collected from the timeline of the
suspicious user profile as per a pre-defined small dataset.
Step:4. Step 3 was repeated until the collection of a sufficient number of tweets. In the first phase, more than
8000 tweets were collected on several attempts. All collected tweets were saved into dataset D2.
Step:5. A mixed labeled training dataset D1 (classified as cyberstalking and non-cyberstalking text) containing
35734 unique records was prepared separately to train the machine learning classifiers. So that a trained
machine learning model can predict the probability of the collected tweets. This pre-defined training
dataset contains tweets and comments from different sources of the internet world. Further, this labeled
dataset was automatically updated through classified live tweets using the proposed model.
Fig. 1. The basic layout of the proposed automatic model for real-time cyberstalking detection on Twitter
3.2 Tweets Pre-processing
The collected tweets from Twitter API contain raw text with unnecessary characters, blank spaces, blank lines,
meaningless characters, and different symbols. Properly cleaning the tweets is highly required before feature extraction
and classification of tweets. In this phase, collected tweets were cleaned, filtered, and normalized into proper format.
This paper performed several pre-processing tasks: Removing stop words, Noise removal, Tokenization, Normalization,
and Stemming. In the first step of pre-processing, all stop words were removed from the tweets. Meaningless words
such as articles, prepositions, and pronouns that are not useful for sentiment analysis and tweet classification are called
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
62 Volume 15 (2023), Issue 1
stop words [34]. Collected tweets from Twitter also contain different noise data, which were removed. In tweets,
repeated words, symbols (such as @,#, etc.), blank lines, blank spaces, special characters (such as RT, etc.), URLs,
punctuation marks, and any useless digits are called noise data [35]. After removing the noise data and stop words, the
texts of the tweets were divided into individual words and added to a separate list. This process for splitting the sentence
into words is called tokenization [35]. Further, tokenized tweets were converted to lower case letters using
normalization [35] to make the uniformity. After that, tokenized words are required to be restored to their original form
using the lemmatization [35] and stemming [36] methods. Lemmatization may be used instead of stemming for proper
morphological analysis of the words. Lemmatization is a method to combine the synonyms relation words into a single
word and remove all other concerned synonyms words from the list [37]. In this paper, the stemming method was used.
3.3 Feature Extraction
After performing the pre-processing tasks, the tweets dataset was ready for classification and labeling using the
lexicon-based approach. In contrast, the machine learning model uses feature vectors to estimate the predicted
probability of cleaned tweets. Feature extraction is essential in the machine learning-based process before classifying
tweets because the machine learning algorithms work on feature vectors and can not understand tweets as text forms.
Feature extraction computes the weights of tweet words and creates a feature vector in numerical form. Feature
extractions play a crucial role in improving the performance of classifiers [38]. Several traditional-based, word
embedding-based and language model-based feature extraction methods are available for feature extraction in the word-
level, sentence-level, and n-gram levels [38]. TF-IDF, Word2Vec, BOW, BERT, FastText, GloVe, XL-
NET ,ELECTRA, InferSent, GPT-2, and Universal Sentence Encoder are some widely used examples of feature
extraction methods [39-43]. The proposed detection model of this study applied TF-IDF methods for feature extractions.
TF-IDF is an efficient calculation-based feature extraction method that measures the weight of any word of documents
in a collection of documents [44]. TF-IDF finds most occurring words and assigns more consequences because
regularly occurring words are more important for the classification [45]. Equation (1) is used to calculate the feature
vector in the TF-IDF.
, 1
T in D N
TF IDF T D Log
W in D T in N




(1)
Where:


 

T in N
= {Total occurrence of Word “T” in total documents} Represents the Document Frequency
N= Total Documents
3.4 Classification and Labeling of the Tweets
In this phase, collected tweets through Twitter API were classified into cyberstalking tweets and non-cyberstalking
tweets using different methods, as explained in Fig. 1. Recent tweets directly collected through Twitter API were
classified in the primary detection phase. The lexicon-based method was applied in the first approach, and labeled
tweets were saved in a separate dataset. After that, in the second approach, a machine learning technique with a trained
SVM model was applied to classify the same tweets, and labeled tweets were saved in a separate dataset. In the third
approach, a hybrid approach was implemented using the lexicon-based polarity and SVM-based probability to classify
the same tweets, and labeled tweets were saved in another dataset. Finally, another hybrid approach using polarity score
through lexicon-based, probability score through trained SVM, and Naïve Bayes was applied for automated
cyberstalking detection on Twitter in a real-time manner during the fetching of live tweets. The detailed procedure of
each approach is explained in the subsection as follows.
3.4.1 Lexicon-based approach for classification and labeling of the tweets
Lexicon represents the vocabulary of any word, person, and language. The lexicon-based is an admired method for
sentiment analysis that uses a dictionary and rules to assign a positive or negative score to a word. The lexicon-based
process uses pre-papered sentiment to give a score to the words. The lexicon-based method uses different techniques,
namely dictionary-based and context-based lexicon, to produce the polarity score [46, 47]. The dictionary-based lexicon
[48] uses a pre-defined word dictionary of good and bad words updated using synonyms and antonyms. Context-based
lexicon [49] uses semantics and statistical methods to find the context-specific sentiment. The semantic approach finds
the synonyms and antonyms of the word and semantically closer words for assigning the sentiment value. The statistical
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
Volume 15 (2023), Issue 1 63
technique of the lexicon-based process finds positive and negative words in a positive and negative context. Suppose
words behave irregularly in a positive context. In that case, positive polarity is assigned, while in other cases, if word
behavior returns negative in a negative context, then negative polarity is assigned. Neutral polarity is given in case of
equal occurrence of a positive and negative word. Several pre-defined and pre-trained lexicon-based libraries are widely
used, namely TextBlob, Vader, SentiWordNet, and AFINN [50]. This paper used the TextBlob library as a Lexicon-
based approach for classifying and labeling tweets. TextBlob computes the sentiment and returns polarity within the
range of [-1.0 to 1.0] and subjectivity within the range of [0.0 to 1.0]. Equation (2) is used to calculate the polarity of
the tweet.
1
n
k
kPS
Polarity tweet n
(2)
Where: 'n' is a total word in a tweet and
k
PS
is the polarity score of words of tweet available in the dictionary.
This approach used the following stepwise procedure to classify and label the tweets.
Method 1 Lexicon-based approach for classification and labeling of the tweets
Step:1. The polarity of the unlabeled tweet (denoted by PT) from dataset D2 was calculated using equation (2).
Step:2. If PT >= 0, then the tweet was classified as non-cyberstalking and assigned a label (value=0, positive
tweet) to a tweet of dataset D2.
Step:3. If PT < 0, the tweet was classified as a highly suspicious tweet. In this case, the tweet was very near to
cyberstalking tweet, but before taking the final decision, tweets on the user timeline and retweets count
were checked to confirm.
Step:4. The average polarity of the tweets (denoted by UPT) from the user timeline and retweet count
(represented by RT) were calculated (at least three recent tweets were considered from the user timeline).
Step:5. If PT < 0 AND (RT>0 or UPT < 0), then the suspicious tweet was classified as cyberstalking, assigned
label as cyberstalking tweets (value=1, negative tweets), otherwise classified as a non-cyberstalking
tweet.
Step:6. After classification, the labeled tweet was stored in a separate dataset D3.
Step:7. Steps 1 to step 6 were repeated until the classification of all tweets of Dataset D2.
3.4.2 Machine Learning-based approach for classification and labeling of the tweets
Machine learning is broadly used to classify and label tweets with sentiment analysis support. In this approach,
this paper used Support Vector Machine (SVM) for classification and labeling the tweets. Support vector machine is an
efficient, versatile, and trendy supervised machine learning broadly used to classify tweets with more accurate results
[51]. SVM creates hyperplanes and computes the distance between the line and support vector to classify the text. The
SVM offered several kernels (polynomial, sigmoid, Radial Basis Function, linear, and nonlinear kernels) with different
mathematical functions [52]. Although, as per its native nature, SVM use prediction and does not support probability
directly but using Platt scaling and isotonic regression methods, SVM determines the probability of any text for the
target class. This paper used the probability calibration classifier method for SVM to calculate the prediction probability
of tweets. The mathematical expression (3) is used to calculate the prediction probability of tweets in the SVM model.
1
|1B
P y tweet exp Af tweet

(3)
Where 'A' and 'B' are scalar parameters learned by the algorithm during the training, 'y' is target class(y=1 for
cyberstalking and y=0 for non-cyberstalking) f(tweet) is a real-valued function.
In this approach, the following stepwise procedure was used for classifying and labeling the tweets.
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
64 Volume 15 (2023), Issue 1
Step:4. If predicted probability (PPT) <=0.5, then the tweet was classified as a non-cyberstalking tweet and
assigned a label (value=0, positive tweet) to the tweet of dataset D2.
Step:5. If PPT >0.5, the tweet was classified as a suspicious tweet. In this case, tweets from the concerned user
timeline were checked, and retweets (denoted by RT) were counted.
Step:6. The average predicted probability of tweets from the user timeline (denoted by UPPT) was calculated
(at least three recent tweets were considered from the user timeline)
Step:7. If PPT >0.5 AND (RT>0 or UPPT >0.5), then the suspicious tweet was classified as cyberstalking
tweet and assigned a label (value=1, negative tweets) otherwise classified as a non-cyberstalking
tweet.
Step:8. The classified tweet was saved into a separate Dataset D4.
Step:9. Steps 3 to step 8 were repeated until the classification of all tweets of Dataset D2.
3.4.3 Hybrid approach for classification and labeling of the Tweets
The first segment of a hybrid approach used lexicon-based polarity scores and machine learning-based probability
scores to classify and label the tweets. In this approach, the following main stepwise procedure was used.
Method 3 Hybrid approach for classification and labeling of the Tweets
Step:1. The polarity of the unlabeled tweet (denoted by PT) from dataset D2 was calculated using (as discussed
in section 3.4.1) using lexicon-based sentiment analysis.
Step:2. The predicted probability of unlabelled tweet (denoted by PPT) from dataset D2 was calculated (as
discussed in section 3.4.2) using the trained SVM model through Dataset D1.
Step:3. If PT >= 0 AND PPT <= 0.5 then tweet was classified as non-cyberstalking and assigned label (value=0,
positive tweet) to tweet of dataset D2. In this case, both lexicons-based and machine learning methods
produced the same sentiment (non-cyberstalking) for the tweet.
Step:4. If PT < 0 AND PPT > 0.5, the tweet was classified as highly suspicious. In this case, the tweet was very
near the cyberstalking tweet, and both lexicons-based and machine learning methods produced the same
sentiment (cyberstalking). In this case, tweets on the user timeline and retweets were checked to confirm
before making the final decision.
Step:5. The average predicted probability of tweets (denoted by UPPT) and average polarity of the tweet
(represented by UPT) from the user timeline and retweet (RT) were calculated (at least three recent
tweets were considered from the user timeline).
Step:6. If (PT < 0 AND PPT > 0.5) AND (RT > 0 or UPPT > 0.5 or UPT < 0), then high suspicious tweet was
classified as cyberstalking, assigned label as cyberstalking tweets (value=1, negative tweets) otherwise
classified as non-cyberstalking tweet. Again Dataset D2 was updated and saved.
Step:7. Labeled tweet, classified by using this approach, was saved into separate dataset D5.
Step:8. Steps 1 to step 7 were repeated until the successful classification of all tweets of Dataset D2.
After classification and labeling the tweets of Dataset D2, several ML classifiers, specifically SVM, Logistic
Regression (LR), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF), and K-Nearest Neighbor (KNB), were
trained and tested on datasets D3, D4, and D5. Performances were measured for all applied methods of classifications
and labeling: lexicon-based, machine learning, and hybrid approach.
3.5 Real-Time Cyberstalking Detection on Live Tweets
Tweets collected through Twitter's search API (as discussed in section 3.1) contained tweets that already happened
and were not in real-time. In this section, live tweets in real-time were fetched using Twitter's Streaming API and
Twitter's Firehose. Further, using a hybrid approach, tweets were automatically classified and labeled as cyberstalking
or non-cyberstalking tweets in real-time while fetching the live tweets. At this time, the proposed hybrid approach used
the lexicon-based method, trained SVM model, and trained Naïve Bayes model. Naïve Bayes (NB) is an efficient and
straightforward supervised machine learning algorithm. The functioning of NB is according to the Bayes Theorem and
derived from conditional probability [53]. In this paper, the multinomial NB model was used, while other models
offered by NB are Gaussian NB and Bernoulli NB. In Naïve Bayes, the following equation calculates the predicted
probability of tweets for the target class (cyberstalking or non-cyberstalking tweets).
1( | )
|1 2 .
n
i
i
n
P y P x y
P y tweet P x P x p x

(4)
Where 'y' is the target class (y=1 for cyberstalking and y=0 for non-cyberstalking). P(y|tweet) represents the
posterior probability of tweet for target class 'y'. P(tweet)=P(x1)P(x2)….P(xn) is the preceding probability of predictor
tweet. P(y) is the preceding probability of the target class. P(xi|y) is the likelihood conditional probability of predictor
tweet for target class (y).
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
Volume 15 (2023), Issue 1 65
SVM model was trained by dataset D1 while NB was trained by recently created labeled dataset D5 (contain recent
tweets classified by hybrid approach as discussed in section 3.4.3). The following stepwise procedure was used for real-
time automated cyberstalking detection on live tweets.
Method 4 Hybrid approach for Real-Time Cyberstalking Detection on Live Tweets
Step:1. The live tweet was fetched in real-time through Twitter's streaming API and filtered through various
hashtags keywords.
Step:2. The polarity of the fetched unlabeled tweet PT was calculated (as discussed in section 3.4.1) using the
lexicon-based method.
Step:3. The predicted probabilities of fetched unlabelled tweet PPT_SVM and PPT_NB were calculated using
the trained SVM and NB model, respectively.
Step:4. The average predicted probability of fetched unlabelled tweet APPT_ML was calculated using
PPT_SVM and PPT_NB
Step:5. If PT>=0 OR APPT_ML <= 0.5, then tweet was classified as non-cyberstalking and assigned label
(value=0, positive tweet).
Step:6. IF PT <0 AND APPT_ML >0.5, the tweet was classified as highly suspicious. In this case, tweets of the
user timeline were checked to confirm before making the final decision.
Step:7. The average predicted probabilities UAPPT_ML (UAPPT_ML= (UAPPT_SVM+ UAPPT_NB)/2) and
average polarity of tweets UAPT were calculated (at least three recent tweets were considered from the
user timeline).
Step:8. If (PT <0 AND APPT_ML >0.5) AND (UAPT < 0 AND UAPPT_ML >0.5), then the highly suspicious
tweet was classified as cyberstalking and assigned a label (value=1, negative tweets) otherwise classified
as a non-cyberstalking tweet.
Step:9. The live labeled tweet (cyberstalking and non-cyberstalking tweet along with user id, username, location,
and date) was stored in dataset D6.
Step:10. Dataset D1 was updated from the labeled tweet of Dataset D6 for further use.
Step:11. Steps 1 to step 10 were repeated until fetching a sufficient number of live tweets (more than 10000 live
tweets).
3.6 Measuring the Performance of Model
Performance of classifiers with each applied method (lexicon-based, machine learning, and hybrid approach) for
classification and labeling of recent tweets (fetched through Twitter API) and live tweets (fetched through Twitter
Streaming) were measured separately. Performance metrics are several factors used to measure a model's performance
during training and testing time [54]. The performance parameters are usually determined through the confusion matrix.
In this study, the confusion matrix is a 2x2 truth table matrix containing the total value of True_Pos, True_Neg,
False_Neg, and False_Pos. True_Pos (True Positive) is a successful hit showing the total number of correctly detected
cyberstalking tweets, while True_Neg (True Negative) explains the total number of correctly detected non-cyberstalking
tweets. In contrast, False_Pos (False Positive) is miss-hit, illustrating the total number of incorrectly detected
cyberstalking tweets, while False_Neg (False Negative) is the failure count representing the total number of wrongly
detected non-cyberstalking tweets. In this paper, broadly used parameters such as accuracy, precision, f-score, and recall
were calculated to measure the performance of cyberstalking detection method. AUC (Area Under the Curve) was also
calculated during the automatic detection of live tweets in real-time.
3.6.1 Accuracy
Accuracy addresses the complete number of rights predictions anticipated by the classifier. Accuracy can be
calculated using equation (5).
__
_ _ _ _
True Pos True Neg
Accuracy True Pos False Pos False Neg True Neg
(5)
3.6.2 Precision
Precision shows the proportion between the true positives and the wide range of various positives. Precision can be
calculated using equation (6).
_
__
True Pos
Precision True Pos False Pos
(6)
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
66 Volume 15 (2023), Issue 1
3.6.3 Recall
Recall describes the sensitivity and measures the proportion of true positive prediction to total positive. Recall can
be determined using equation (7).
_
__
True Pos
Recall True Pos False Neg
(7)
3.6.4 F-Score
F-Score measures test accuracy and describe the harmonic average between precision and recall. F-score can be
determined using the equation (8).
2
Precision Recall
F Score Precision Recall


(8)
3.6.5 AUC (Area Under the Curve)
AUC estimates the capacity of the classifier to separate among classes correctly. ROC (Receiver Operator
Characteristic) is a likelihood curve that plots the True Positive Rate (TPR) against the False Positive Rate (FPR).
Equation (9) can be used to calculate the AUC.
1 _ _
2 _ _ _ _
True Pos True Neg
AUC True Pos False Neg True Neg False Pos





(9)
4. Experimental Setup, Results, and Discussion
This section will discuss the experimental setup and results for automatically detecting cyberstalking tweets in real-
time. The experiments used python language with Scikit Learn, Tweepy, Twitter Streaming, TextBlob, NLTK, and
other library packages to implement the proposed model. To train the machine learning classifiers in the initial phase (in
machine learning and hybrid approach for classification and labeling of the collected tweets), a mixed labeled dataset
D1 was prepared [55-59]. Training dataset D1 contains 35734 unique records classified as cyberstalking and non-
cyberstalking text. Fig. 2 shows the distribution of tweets/comments in the training dataset D1.
Fig. 2. Distribution of Tweets/Comments in the training dataset D1
In the first stage of the experiment, recent tweets were collected using the Twitter API. A total of 24178 tweets
were collected using several attempts. After removing the duplicate tweets and blank lines, a total of 8066 unique
tweets were selected and saved to dataset D2 for classification and labeling. After that, separate experiments separately
classified tweets using different methods (as discussed in the methodology section). In the second experiment, collected
recent tweets were classified and labeled using the lexicon-based method with the support of TextBlob sentiment
analysis. Experimental work was also performed using other pre-trained and pre-defined lexicon-based methods such as
Vader, SentiWordNet, and AFINN and found almost similar results. The classified tweets were stored in a separate
dataset (D3), and the model was tested using different machine learning classifiers. The performance of different
classifiers with lexicon-based labeling is explained in Table 1. As per experimental results, using a lexicon-based
approach, 24.2% of recent tweets were classified as cyberstalking tweets, while 75.8% of recent tweets were classified
as non-cyberstalking tweets. The lexicon-based approach provided maximum accuracy of 91.1%, a precision of 91.4%,
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
Volume 15 (2023), Issue 1 67
a recall of 81%, an f-score of 80.9%, and an AUC of 90.9% in the classification and labeling of the recent tweets. SVM
achieved the maximum accuracy and AUC, Logistic Regression achieved maximum precision, while the Decision Tree
achieved maximum recall and f-score.
Table 1. Performance of Classifiers with Lexicon-Based Classification and labeling of Tweets
In the third stage of the experiment, a trained SVM model as a machine learning (as discussed in the methodology
section 3.4.2) was used to classify and label the recently collected tweets. The classified tweets were again stored in a
separate dataset (D4), and the model was tested using the different machine learning classifiers. The performance of the
machine learning approach for classification and labeling the tweets is described in Table 2. As per experimental results,
23.3% of recent tweets were classified as cyberstalking tweets, while 76.7% of recent tweets were classified as non-
cyberstalking tweets using a machine learning approach. The machine learning approach for tweets classification
provided maximum accuracy of 92.7%, precision of 90.5%, recall of 89.3%, f-score of 89.9%, and AUC of 96.6%.
SVM performed better than other classifiers.
Table 2. Performance of Classifiers with Machine learning approach for Labeling of Tweets
In the fourth stage of the experiment, a hybrid approach was used (as discussed in the methodology section 3.4.3)
to classify and label the recently collected tweets. After classification, the labeled tweets were saved in a separate
dataset (D5), and the hybrid approach was tested using the different machine learning classifiers. The performance of
the hybrid approach for the classification and labeling of the tweets is exposed in Table 3. As per experimental results,
5.1% of recent tweets were classified as cyberstalking tweets, while 94.9% of recent tweets were classified as non-
cyberstalking tweets using a hybrid approach. The hybrid approach for tweets classification achieved the highest
accuracy of 95.8%, precision of 98.2%, recall of 38.8%, and an f-score of 40.6%. SVM again performed better than
other classifiers.
Table 3. Performance of Classifiers with Hybrid Approach for Labeling of Tweets
The comparative performance of all three applied approaches is presented in Fig. 3. As per experimental results
shown in Table 1, Table 2, Table 3, and Fig. 3 show that the performance of the machine learning approach is better
than the lexicon-based approach. In contrast, the performance of the hybrid approach is outstanding. SVM
outperformed and achieved the highest accuracy of 91.1%, 92.5%, and 95.8% for lexicon-based, machine learning, and
hybrid approach.
Dataset (D2): 8066 unique recent tweets collected through Twitter API
Tweets classified and labeled by: Method1 - Lexicon-based sentiment
Cyberstalking tweets found: 24.2 %, Non-Cyberstalking tweets found: 75.8%
S. No
ML Algorithm
Accuracy
Precision
Recall
F-Score
1
Support Vector Machine (SVM)
0.911254
0.891509
0.739726
0.808556
2
Decision Tree
0.903322
0.808594
0.810176
0.809384
3
Random Forest
0.896381
0.881313
0.682975
0.769570
4
Logistic Regression
0.865642
0.913793
0.518591
0.661673
5
Naive Bayes
0.861675
0.953333
0.679843
0.632678
6
K-Nearest Neighbor
0.836886
0.760000
0.520548
0.617886
Dataset (D2): 8066 unique recent tweets collected through Twitter API
Tweets classified and labeled by: Method 2- Machine Learning Approach
Cyberstalking tweets found: 11.7 %, Non-Cyberstalking tweets found: 88.3%
S. No
ML Algorithm
Accuracy
Precision
Recall
F-Score
1
Support Vector Machine
0.923649
0.834646
0.443515
0.579235
2
Random Forest
0.908775
0.747748
0.347280
0.474286
3
Logistic Regression
0.902826
0.957447
0.188285
0.314685
4
K-Nearest Neighbor
0.898364
0.621429
0.364017
0.459103
5
Naive Bayes
0.892910
0.727536
0.096234
0.175573
6
Decision Tree
0.869608
0.453488
0.489540
0.470825
Dataset (D2): 8066 unique recent tweets collected through Twitter API
Tweets classified and labeled by: Hybrid Approach
Cyberstalking tweets found: 5.1 %, Non-Cyberstalking tweets found: 94.9%
S. No
ML Algorithm
Accuracy
Precision
Recall
F-Score
1
Support Vector Machine
0.958004
0.813725
0.247761
0.379863
2
Decision Tree
0.941113
0.426230
0.388060
0.406250
3
Random Forest
0.957229
0.953846
0.185075
0.310000
4
Naive Bayes
0.956454
0.965517
0.167164
0.284987
5
K-Nearest Neighbor
0.952735
0.636364
0.208955
0.314607
6
Logistic Regression
0.956609
0.982456
0.167164
0.285714
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
68 Volume 15 (2023), Issue 1
Fig. 3. Performances of Classifiers with all classification Methods
In the final experiment, an enhanced hybrid approach was applied again for automatic cyberstalking detection on
live tweets in real-time due to its best performance. This time, the live tweets were fetched through Twitter Streaming,
and the tweets were classified in real-time using a hybrid approach during the fetching of live tweets. During the
fetching and classification, live labeled tweets were recorded into a separate dataset (D6), and the model was tested
using several machine learning algorithms. The performance of the hybrid approach for automatic classification and
labeling of the live tweets in real-time is shown in Table 4 and Fig. 4. AUC score and ROC curve are shown in Fig. 5,
while the distribution of classified live tweets is described in Fig. 6. As per experimental results, 48.1% of tweets were
labeled as cyberstalking, while 51.9% were labeled as non-cyberstalking during the fetching and classification of live
tweets using a hybrid approach. Results mentioned in Table 4 and Fig. 4 show that the hybrid approach accomplished
the results with notable performance. Accuracy of 94.2%, recall and f-score of 94.1%, the precision of 94.6%, and AUC
of 98 % were achieved by the hybrid approach for automatic cyberstalking detection of live tweets in real-time. SVM
again accomplished the highest accuracy, recall, and f-score, while random forest obtained the highest precision. AUC
score and ROC curve are plotted in Fig. 4, indicating that SVM and random forest achieved the highest AUC of 98%.
Table 4. Performance of Classifiers with Hybrid Approach for Labeling of Live Tweets in Real-Time
Fig. 4. Performances of Classifiers with Hybrid Approach for Labeling of Live Tweets in Real-Time
Dataset size: 13294 unique live tweets collected through Twitter Streaming
Live Tweets classified and labeled by: Hybrid Approach
Cyberstalking tweets found: 48.1 %, Non-Cyberstalking tweets found: 51.9%
S. No
ML Algorithm
Accuracy
Precision
Recall
F-Score
AUC
1
Support Vector Machine
0.941937
0.940564
0.941140
0.940852
0.980237
2
Random Forest
0.937425
0.946082
0.925199
0.935524
0.980135
3
Decision Tree
0.928700
0.929716
0.924586
0.927144
0.927545
4
Logistic Regression
0.923887
0.924784
0.919681
0.922226
0.971913
5
Naive Bayes
0.867329
0.872340
0.854690
0.863425
0.946197
6
K-Nearest Neighbor
0.726233
0.943419
0.470264
0.627660
0.792647
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
Volume 15 (2023), Issue 1 69
Fig. 5. AUC score and ROC curve for Live Tweets Classification in Real-Time
Fig. 6. Distribution of Live Tweets Classification in Real-Time
5. Conclusion and Future Work
Cyberstalkers are making a negative and fearful face of Twitter, and it is a challenging task to combat
cyberstalking in real-time automatically. This paper proposed a hybrid approach using lexicon-based and machine
learning-based models using different manners on separate segments for automatically cyberstalking detection on live
tweets on Twitter in real-time. Using the Twitter API total of 24178 recent tweets were collected. In the initial stage,
separate experiments were performed using lexicon-based, machine learning-based, and hybrid approaches on recent
8066 tweets (unique tweets out of 24178). The machine learning-based and hybrid approach used a pre-defined dataset
containing 35734 individual tweets and comments to train the machine learning model. The performance of each
method was measured using several parameters. The lexicon-based model obtained a maximum accuracy of 91.1%,
while the machine learning-based model achieved 92.4% accuracy. The proposed hybrid approach successfully
achieved the highest accuracy of 95.8%. Experimental results show that the performance of the machine learning-based
model was better than the lexicon-based model, while the hybrid approach outperformed during cyberstalking detection
on Twitter.
Due to the better performance of the hybrid approach, once again, another hybrid approach was applied for
cyberstalking detection on live tweets directly fetching through Twitter streaming in real-time. Cyberstalking detection
was successfully performed on 13294 live tweets using the hybrid approach in real-time. 48.1% of live tweets were
classified as cyberstalking tweets, while 51.9% were classified as non-cyberstalking tweets during real-time
cyberstalking detection. This time, the hybrid approach again outperformed for cyberstalking detection on live tweets
on Twitter. The proposed hybrid approach successfully achieved the maximum accuracy of 94.4%, highest recall of
94.1%, highest precision of 94.6%, maximum f-score of 94.1%, and impressive AUC of 98% during cyberstalking
detection on live tweets in real-time. In all approaches, the support vector machine outperformed other classifiers.
Experimental results show that the hybrid approach is much better than other methods to combat the cyberstalking in
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
70 Volume 15 (2023), Issue 1
real-time. Lexicon-based models are often dependent on rules and dictionaries, while machine learning models require
labeled datasets for training before the prediction. The proposed hybrid approach utilized the benefits of both approach
lexicon-based and machine learning. The training dataset was automatically updated through the proposed detection
model to improve the performance of each subsequent execution of the model. The performance of the proposed model
can be enhanced through a more accurate and large dataset. Future work includes designing a more efficient hybrid
model with lexicon-based, machine learning, deep learning, and fuzzy logic for cyberstalking detection in real-time.
References
[1] (2021) The Blacklinko Website. How Many People Use Twitter in 2021? [New Twitter Stats] [Online]. Available:
https://backlinko.com/twitter-users
[2] Arkaitz Zubiaga, Alex Voss, Rob Procter, Maria Liakata, Bo Wang, Adam Tsakalidis, "Towards Real-Time, Country-Level
Location Classification of Worldwide Tweets," IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), 29(9),
2017.
[3] M. Baer. "Cyberstalking and the Internet Landscape We Have Constructed," Virginia Journal of Law & Technology, 154(15),
2020, pp. 153-227.
[4] Gautam, Arvind Kumar, and Abhishek Bansal. "Email-Based Cyberstalking Detection On Textual Data Using Multi-Model
Soft Voting Technique Of Machine Learning Approach." Journal of Computer Information Systems (2023): 1-20. doi:
10.1080/08874417.2022.2155267
[5] Tarmizi, Nursyahirah, Suhaila Saee, and Dayang Hanani Abanag Ibrahim, "Detecting the usage of vulgar words in cyberbully
activities from Twitter," International Journal on Advanced Science, Engineering and Information Technology 10(3), 2020, pp.
1117-1122.
[6] S. Lal, L. Tiwari, R. Ranjan, A. Verma, N. Sardana, & R. Mourya, Analysis and classification of crime tweets. Procedia
Computer Science, 167, 2020, pp. 1911-1919.
[7] Arvind Kumar Gautam, and Abhishek Bansal, "A Review on Cyberstalking Detection Using Machine Learning Techniques:
Current Trends and Future Direction." International Journal of Engineering Trends and Technology, 70(3), 2022, pp. 95-
107. Crossref, https://doi.org/10.14445/22315381/IJETT-V70I3P211
[8] Salawu, Semiu, Yulan He, and Joanna Lumsden, "Approaches to automated detection of cyberbullying: A survey," IEEE
Transactions on Affective Computing, 11(1), 2017, pp. 3-24.
[9] Abdur Rahman, Mobashir Sadat, Saeed Siddik, "Sentiment Analysis on Twitter Data: Comparative Study on Different
Approaches," International Journal of Intelligent Systems and Applications(IJISA), 13(4), 2021, pp.1-13. DOI:
10.5815/ijisa.2021.04.01
[10] K. Rakshitha, H. M. Ramalingam, M. Pavithra, H.D. Advi, & M. Hegde, "Sentimental analysis of Indian regional languages on
social media," Global Transitions Proceedings, 2(2), 2021, pp. 414-420.
[11] Khoo, Christopher SG, and Sathik Basha Johnkhan, "Lexicon-based sentiment analysis: Comparative evaluation of six
sentiment lexicons," Journal of Information Science 44(4), 2018, pp. 491-511.
[12] Norah AL-Harbi, Amirrudin Bin Kamsin, "An Effective Text Classifier using Machine Learning for Identifying Tweets'
Polarity Concerning Terrorist Connotation," International Journal of Information Technology and Computer Science(IJITCS),
13(5), 2021, pp.19-29. DOI: 10.5815/ijitcs.2021.05.02
[13] A. Hasan, S. Moin, A. Karim, & S. Shamshirband, "Machine learning-based sentiment analysis for twitter
accounts," Mathematical and Computational Applications, 23(1), 2018, pp. 11.
[14] Golam Mostafa, Ikhtiar Ahmed, Masum Shah Junayed, "Investigation of Different Machine Learning Algorithms to Determine
Human Sentiment Using Twitter Data," International Journal of Information Technology and Computer Science(IJITCS), 13(2),
2021, pp.38-48. DOI: 10.5815/ijitcs.2021.02.04
[15] Z. Ghasem, I. Frommholz, and C. Maple, "Machine learning solutions for controlling cyberbullying and
cyberstalking," International Journal of Information Security, 6(2), 2015, pp. 55-64.
[16] Ingo Frommholz, Haider M. al-Khateeb, Martin Potthast, Zinnar Ghasem, Mitul Shukla , Emma Short, “On Textual Analysis
and Machine Learning for Cyberstalking Detection,” Datenbank Spektrum 16, 2016, pp. 127135.
[17] Saravanaraj, A., J. I. Sheeba, and S. Pradeep Devaneyan, "Automatic detection of cyberbullying from twitter," International
Journal of Computer Science and Information Technology & Security (IJCSITS), 2016.
[18] J. Zhang, T. Otomo, L. Li, & S. Nakajima, "Cyberbullying Detection on Twitter using Multiple Textual Features," In 2019
IEEE 10th International Conference on Awareness Science and Technology (CAST), IEEE, 2019, pp. 1-6.
[19] S. W. Liew, N. F. M. Sani, M. T. Abdullah, R. Yaakob & M. Y. Sharum, "An effective security alert mechanism for real-time
phishing tweet detection on Twitter," Computers & Security, 83, 2019, pp. 201-207.
[20] V. Balakrishnan, S. Khan, H.R. Arabnia, "Improving cyberbullying detection using Twitter users' psychological features and
machine learning," Science Direct, ELSEVIER, Computer & Security, 90, 2020.
[21] R. Shah, S. Aparajit, R. Chopdekar, & R. Patil, "Machine Learning based Approach for Detection of Cyberbullying
Tweets," International Journal of Computer Applications, 175(37), 2020
[22] Kazim Raza Talpur, Siti Sophiayati Yuhaniz, Nilam Nur binti Amir Sjarif, Bandeh Ali, "Cyberbullying Detection In Roman
Urdu Language Using Lexicon Based Approach," JOURNAL OF CRITICAL REVIEWS, 16, 2020, pp. 834-
848. doi: 10.31838/jcr.07.16.109
[23] R. Geetha, S. Karthika, C. J. Sowmika, & B. M. Janani, "Auto-Off ID: Automatic Detection of Offensive Language in Social
Media," In Journal of Physics: Conference Series, 1911(1), 2021.
[24] Bandi Yoshna, A. K. Jaithunbi, G. Lavanya, D.V. Smitha, "Detecting Twitter Cyberbullying Using Machine Learning," Annals
of the Romanian Society for Cell Biology, 2021, pp. 1630716315.
[25] Kumar, Akshi, and Nitin Sachdeva, "Multi-input integrative learning using deep neural networks and transfer learning for
cyberbullying detection in real-time code-mix data," Multimedia systems, 2020, pp. 1-15.
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
Volume 15 (2023), Issue 1 71
[26] N. Yuvaraj, V. Chang, B. Gobinathan, A. Pinagapani, S. Kannan, G. Dhiman, & A. R. Rajan, "Automatic detection of
cyberbullying using multi-feature based artificial intelligence with deep decision tree classification," Computers & Electrical
Engineering, 92, 2021.
[27] S. Sadiq, A. Mehmood, S. Ullah, M. Ahmad, G.S. Choi, "Aggression detection through deep neural model on twitter," Future
Generation Computer Systems, 114, 2021, pp. 120-129.
[28] Sangwan, Saurabh Raj, and M. P. S. Bhatia, "D-BullyRumbler: a safety rumble strip to resolve online denigration bullying
using a hybrid filter-wrapper approach," Multimedia Systems, 2020, pp. 1-17.
[29] Lepe-Faúndez M, Segura-Navarrete A, Vidal-Castro C, Martínez-Araneda C, Rubio-Manzano C, "Detecting Aggressiveness in
Tweets: A Hybrid Model for Detecting Cyberbullying in the Spanish Language," Applied Sciences, 22(11), 2021.
https://doi.org/10.3390/app112210706
[30] Madan, Anjum, and Udayan Ghose. "Sentiment Analysis for Twitter Data in the Hindi Language," 2021 11th International
Conference on Cloud Computing, Data Science & Engineering (Confluence). IEEE, 2021.
[31] Almutairi, Amjad Rasmi, and Muhammad Abdullah Al-Hagery, "Cyberbullying Detection by Sentiment Analysis of Tweets'
Contents Written in Arabic in Saudi Arabia Society," International Journal of Computer Science & Network Security 21(3),
2021, pp. 112-119.
[32] I. Arora, J. Guo, S. L. Levitan, S. McGregor, & J. Hirschberg, "A novel methodology for developing automatic harassment
classifiers for Twitter," In Proceedings of the Fourth Workshop on Online Abuse and Harms, 2020, pp. 7-15.
[33] F.E. Ayo, O. Folorunso, F.T. Ibharalu, I.A. Osinuga, & A. Abayomi-Alli, "A probabilistic clustering model for hate speech
classification in twitter," Expert Systems with Applications, 173, 2021.
[34] S. Vijayarani, J. Ilamathi, and Nithya, "Pre-processing techniques for text mining-an overview," International Journal of
Computer Science & Communication Networks, 5(1), 2015, pp. 7-16..
[35] (2020) Towardsdatascience website. All you need to know about text pre-processing for NLP and Machine Learning. [Online].
Available: https://towardsdatascience.com/all-you-need-to-know-about-text-preprocessing-for-nlp-and-machine-learning-
bc1c5765ff67.
[36] Kadhim, Ammar Ismael, "An evaluation of pre-processing techniques for text classification," International Journal of Computer
Science and Information Security (IJCSIS), 16(6), 2018, pp. 22-32.
[37] Dimple Tiwari, Nanhay Singh, "Ensemble Approach for Twitter Sentiment Analysis", International Journal of Information
Technology and Computer Science(IJITCS), 11(8), 2019, pp. 20-26. DOI: 10.5815/ijitcs.2019.08.03
[38] Gautam, Arvind Kumar, and Abhishek Bansal, " Effect of Features Extraction Techniques on Cyberstalking Detection using
Machine Learning Framework," Journal of Advances in Information Technology, 13(5), 2022.
[39] Rui, Weikang, Kai Xing, and Yawei Jia. "BOWL: Bag of word clusters text representation using word
embeddings." International Conference on Knowledge Science, Engineering and Management. Springer, Cham, 2016.
[40] (2020) Medium Website. All about Embeddings. [Online]. Available: https://medium.com/@kashyapkathrani/all-about-
embeddings-829c8ff0bf5b
[41] T. Mikolov, K. Chen, G. Corrado, J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint
arXiv:1301.3781, 2013. https://arxiv.org/pdf/1301.3781.pdf
[42] Pennington, Jeffrey, Richard Socher, and D. Christopher, "Glove: Global vectors for word representation," Proceedings of the
2014 conference on empirical methods in natural language processing (EMNLP), 2014.
[43] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, "Bag of tricks for efficient text classification," arXiv preprint
arXiv:1607.01759, 2016. https://arxiv.org/pdf/1607.01759.pdf
[44] C. Raj, A. Agarwal, G. Bharathy, B. Narayan, M. Prasad, "Cyberbullying Detection: Hybrid Models Based on Machine
Learning and Natural Language Processing Techniques," Electronics, 22(10), 2021.
[45] B. Das, S. Chakraborty, "An improved text sentiment classification model using TF-IDF and next word negation," arXiv
preprint arXiv:1806.06407, 2018.
[46] S. Alashri, S. Alzahrani, M. Alhoshan, I. Alkhanen, S. Alghunaim, & M. Alhassoun, "Lexi-Augmenter: Lexicon-Based Model
for Tweets Sentiment Analysis," In 2019 IEEE International Conference on Computational Science and Engineering (CSE) and
IEEE International Conference on Embedded and Ubiquitous Computing (EUC), IEEE, 2019, pp. 7-10.
[47] Gupta, Neha, and Rashmi Agrawal, "Application and techniques of opinion mining," Hybrid Computational Intelligence.
Academic Press, 2020, pp. 1-23.
[48] Osman, Aida, Said Ahmad. "Current trends and research directions in the dictionary-based approach for sentiment lexicon
generation: a survey," Journal of theoretical and applied information technology 97(2), 2019.
[49] Kumar, Akshi, and Geetanjali Garg, "Systematic literature review on context-based sentiment analysis in social
multimedia," Multimedia tools and Applications, 2020, pp. 15349-15380.
[50] Sazzed, Salim, and Sampath Jayarathna. "Ssentia: a self-supervised sentiment analyzer for classification from unlabeled
data," Machine Learning with Applications, 4, 2021.
[51] Gautam, Arvind Kumar, and Abhishek Bansal, "Performance Analysis of Supervised Machine Learning Techniques For
Cyberstalking Detection In Social Media," Journal of Theoretical and Applied Information Technology, 100(2), 2022.
[52] (2017) Data Flair website. Kernel Functions-Introduction to SVM Kernel & Examples. [Online]. Available: https://data-
flair.training/blogs/svm-kernel-functions/
[53] Rish, "An empirical study of the naive bayes classifier", IJCAI 2001 workshop on empirical methods in artificial intelligence,
3(22), 2001, pp. 4146
[54] Eman Bashir, Mohamed Bouguessa, "Data Mining for Cyberbullying and Harassment Detection in Arabic Texts," International
Journal of Information Technology and Computer Science(IJITCS), 13(5),2021, pp.41-50. DOI: 10.5815/ijitcs.2021.05.04
[55] (2020) Mendeley Cyberbullying datasets. [Online]. Available:https://data.mendeley.com/datasets/jf4pzyvnpj/1
[56] (2020) The Kaggle website-dataset. [Online]. Available: https://www.kaggle.com/mrmorj/hate-speech-and-offensive-language-
dataset
[57] (2022) The Kaggle website-dataset. [Online]. Available: https://www.kaggle.com/andrewmvd/cyberbullying-classification
[58] (2021) The Kaggle website-dataset. [Online]. Available: https://www.kaggle.com/sanamps/toxiccommentclassification
Automatic Cyberstalking Detection on Twitter in Real-Time using Hybrid Approach
72 Volume 15 (2023), Issue 1
[59] (2014) The Kaggle website-dataset. [Online]. Available: https://www.kaggle.com/c/detecting-insults-in-social-
commentary/data
Authors Profiles
Arvind Kumar Gautam was born in Rewa, Madhya Pradesh, India. He received his Master of Philosophy degree
in Computer Science in 2009 from APS University, Rewa, Madhya Pradesh, India. He is a Ph.D. research scholar
in the Department of Computer Science, Indira Gandhi National Tribal University, Amarkantak, Madhya Pradesh,
India. He is also working as a System Analyst for 9 years at Indira Gandhi National Tribal University,
Amarkantak, Madhya Pradesh. He has more than 10 years of working experience in server administration,
networking, cyber security, web programming, and teaching. He has published several research papers in
international journals and conferences. His academic research interests mainly include Cyber Security, Machine
Learning, and Web Engineering.
Abhishek Bansal received the MCA degree from Dr. B. R. Ambedkar University, Agra, Uttar Pradesh, India, in
2004, and the Ph.D. degree from Delhi University, Delhi, India. He is currently working as a Senior Assistant
Professor with the Department of Computer Science, Indira Gandhi National Tribal University, Amarkantak,
Madhya Pradesh, India. He has more than 12 years of teaching and research experience and supervised several
Ph.D., research scholars. He has also published several papers in reputed journals and conferences.
How to cite this paper: Arvind Kumar Gautam, Abhishek Bansal, "Automatic Cyberstalking Detection on Twitter in Real-Time
using Hybrid Approach", International Journal of Modern Education and Computer Science(IJMECS), Vol.15, No.1, pp. 58-72, 2023.
DOI:10.5815/ijmecs.2023.01.05
... After training all five datasets (ranging from one gramme to five gramme), we now have five trained models that can handle a wide range of input lengths. TThese models accept word sequences of variable lengths [32] as their inputs and create a single result, which is the next word that ought to most likely follow from the input word arrangement [33]. The length of the word sequences that are used as inputs might vary. ...
... Thus, the disempowering construct needs further examination regarding its constituent elements and antecedents. Future researchers may benefit from examining disempowering behaviors using the hybrid approach consisting of lexicon-based and machine learning approaches applied to detect cyberbullying on Twitter (Gautam & Bansal, 2023). ...
Article
Full-text available
Researchers have focused on leadership, often overlooking followership. The notion of followership was irreversibly transformed with the advent and societal adoption of followership systems, such as Twitter. To examine such emergent systems, this paper advances a distinct form of followership: eFollowership. To understand Twitter and its users, the eFollowership concept is explicated and synthesized by adapting several followership lenses from the literature. The authors empirically examined eFollowership by assessing the roles constructed by 301 Twitter users and the relationships between these users' role-based characteristics and behaviors with partial least squares structural equation modeling (PLS-SEM). Results showed that users' voicing and empowering behaviors were significantly influenced by users' characteristics: personal sense of power, eCourage, and social capital. Users' helping behaviors were related to users' personal sense of power and social capital, but not to eCourage. Surprisingly, users' disempowering behaviors were unrelated to all three users' characteristics.
... Future studies could include these control variables to provide a more comprehensive understanding of the factors influencing movie success. Machine learning (Fadhli et al., 2022;Gautam & Bansal, 2023) and experimental methods can also be used to improve the study. ...
Article
Full-text available
In the growth process of movie IWOM, the antecedent IWOM has a significant influence on the subsequent IWOM. IWOM does not form all at once, but iteratively over a short period. This article explores the influence of IWOM publishers on IWOM growth and the dynamic impact of IWOM on movie box office by using vector autoregressive model (VAR model) and impulse response analysis. The findings reveal that highly influential and active users' statements stimulate discussion enthusiasm and increase related topic discussions. These statements also reduce the discreteness of IWOM. On the other hand, highly professional users make IWOM more discreet. Both increased discussion enthusiasm and differentiated IWOM contribute to the growth of movie box office. Additionally, during the growth of IWOM, there is an approximately five day “advance period of word-of-mouth regeneration”: it takes audiences about three days from reading movie reviews to watching a movie, followed by about two days to write their own reviews, and the whole process takes about five days.
... The authors proposed model mainly focuses on identifying the cyberstalker online and detecting cyberstalking incidents. Another enhanced automated cyberstalking detection model in real-time on Twitter is designed and developed by Gautam et al. 34 using a hybrid approach inspired by machine learning. The authors performed the experimental work on live tweets in real time for cyberstalking detection using lexicon-based, machine learningbased (single approach), and hybrid approaches (multimodel based inspired by machine learning) and found the hybrid approach performed better in cyberstalking detection. ...
Article
Full-text available
In the virtual world, many internet applications are used by a mass of people for several purposes. Internet applications are the basic needs of people in the modern days of lifestyle which are also making habitual society. Like social media, e-mail technology is also more prevalent among people of different categories for personal and official communications. The widespread use of e-mail-based communication is also raising various types of cybercrimes, including cyberstalking. Cyberstalkers also use an e-mail-based approach to harass the victim in the form of cyberstalking. Cyberstalkers utilize several content-wise and intent-wise approaches to target the victim, such as spamming, phishing, spoofing, malicious, defamatory, e-mail bombing, and non-spam e-mails, including sexism, racism, and threatening, and finally, trying to hack the account over e-mail technology. This paper proposed an EBCD model for automatic cyberstalking detection on textual data of e-mail using the multi-model soft voting technique of the machine learning approach. Initially, experimental works were performed to train, test, and validate all classifiers of three model sets on three different labeled datasets. Dataset D1 contains spam, fraudulent, and phishing e-mail subject, dataset D2 contains spam e-mail body text, while dataset D3 contains harassment-related data. After that, trained, tested, and validated classifiers of all model sets were applied as a combined approach to automatically classify the unlabeled e-mails from the user’s mailbox using the multi-model soft voting technique. The proposed EBCD model successfully classifies the e-mails from the user’s mailbox into cyberstalking e-mails, suspicious e-mails (spam and fraudulent), and normal e-mails. In each model set of the EBCD model, several classifiers, namely support vector machine, random forest, naïve bayes, logistic regression, and soft voting, were used. The final decision in classifying the e-mails from the user’s mailbox was taken by the soft voting technique of each model set. The TF-IDF feature extraction method was used with the entire applied machine learning model sets to obtain the feature vectors from the data. Experimental results show that the soft voting technique not only enhances the performance of the e-mail classification task but also supports making the right decision to avoid the wrong classification. Overall performance of the soft voting technique was better than other classifiers, although the performance of the support vector machine was also notable. As per experimental results, the soft voting technique obtained an accuracy of 97.7%, 97.7%, 98.9%, a precision of 97%, 98.3%, 98.6%, recall of 98.3%, 96.5%, 99.1%, f-score of 97.6%, 97.4%, 98.9%, and AUC of 99.4%, 99.7%, 99.9% on dataset D1, D2, and D3 respectively. The average performance of soft voting of each model set on classified e-mails from the user’s mailbox was also notable, with an accuracy of 96.3%, precision of 98.1%, recall of 94%, f-score of 95.9%, and AUC of 96.8%.
Chapter
Cyberstalking is one of the most widespread threats on digital platforms. It has included many forms of direct threats via email, online distribution of intimate photographs, seeking information about victims, harassment, and catfishing. The consequences of cyberstalking may lead to psychological problems such as mental health, distress, victim experiencing feelings of isolation, guilt, adverse effects on life activity. These psychological problems may further lead to reports of serious health issues such as anger, fear, suicidal ideation, depression, and post-traumatic stress disorder (PTSD). However, there are many coping strategies such as avoidant coping, ignoring the perpetrator, confrontational coping, support seeking, and cognitive reframing. In spite of these methods, awareness of preventive measures of cyberstalking may further help to overcome mental stress. In this chapter, the authors have pointed out the various psychological issues due to cyberstalking and further discuss their solutions through preventing or automatic detection methods inspired by machine learning approaches.
Article
Full-text available
In the virtual world, many internet applications are used by a mass of people for several purposes. Internet applications are the basic needs of people in the modern days of lifestyle which are also making habitual society. Like social media, e-mail technology is also more prevalent among people of different categories for personal and official communications. The widespread use of e-mail-based communication is also raising various types of cybercrimes, including cyberstalking. Cyberstalkers also use an e-mail-based approach to harass the victim in the form of cyberstalking. Cyberstalkers utilize several content-wise and intent-wise approaches to target the victim, such as spamming, phishing, spoofing, malicious, defamatory, e-mail bombing, and non-spam e-mails, including sexism, racism, and threatening, and finally, trying to hack the account over e-mail technology. This paper proposed an EBCD model for automatic cyberstalking detection on textual data of e-mail using the multi-model soft voting technique of the machine learning approach. Initially, experimental works were performed to train, test, and validate all classifiers of three model sets on three different labeled datasets. Dataset D1 contains spam, fraudulent, and phishing e-mail subject, dataset D2 contains spam e-mail body text, while dataset D3 contains harassment-related data. After that, trained, tested, and validated classifiers of all model sets were applied as a combined approach to automatically classify the unlabeled e-mails from the user’s mailbox using the multi-model soft voting technique. The proposed EBCD model successfully classifies the e-mails from the user’s mailbox into cyberstalking e-mails, suspicious e-mails (spam and fraudulent), and normal e-mails. In each model set of the EBCD model, several classifiers, namely support vector machine, random forest, naïve bayes, logistic regression, and soft voting, were used. The final decision in classifying the e-mails from the user’s mailbox was taken by the soft voting technique of each model set. The TF-IDF feature extraction method was used with the entire applied machine learning model sets to obtain the feature vectors from the data. Experimental results show that the soft voting technique not only enhances the performance of the e-mail classification task but also supports making the right decision to avoid the wrong classification. Overall performance of the soft voting technique was better than other classifiers, although the performance of the support vector machine was also notable. As per experimental results, the soft voting technique obtained an accuracy of 97.7%, 97.7%, 98.9%, a precision of 97%, 98.3%, 98.6%, recall of 98.3%, 96.5%, 99.1%, f-score of 97.6%, 97.4%, 98.9%, and AUC of 99.4%, 99.7%, 99.9% on dataset D1, D2, and D3 respectively. The average performance of soft voting of each model set on classified e-mails from the user’s mailbox was also notable, with an accuracy of 96.3%, precision of 98.1%, recall of 94%, f-score of 95.9%, and AUC of 96.8%.
Article
Full-text available
Various cybercriminals are active with predefined and preplanned agendas to carry out cybercrimes in the Internet world. Cyberstalking, cyberbullying, cyber terrorism, cyber hacking, data leakage, identity theft, phishing, and other types of cyber harassment continually occur in the virtual world. Cyberstalking and cyberbullying are near to close in content and intent, involving the same internet-based technology to harass, bully and undermine others online. This paper implemented a cyberstalking detection model and analyzed the effect of various feature extraction techniques on different machine learning classifiers for cyberstalking detection. For feature extraction, the proposed model applied Word2vec, BOW, TF-IDF, FastText, GloVe, ELMo, and BERT. Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Random Forest (RF), Naive Bayes (NB), and Decision Tree (DT) were used for classification. Effects of each feature extraction method to enhance the performance of the detection model were determined based on the performance results of applied classifiers with each feature extraction process. Experimental results show that BOW and TF-IDF outperformed advanced word embedding-based feature extraction methods. BOW (for LR) achieved the highest accuracy of 95.7%, highest precision of 97.9%, and highest F-Score of 97.3%. TF-IDF achieved the highest recall of 99.8% for NB. SVM classifier achieved the second-highest accuracy of 95.2% with TF-IDF. BERT model successfully obtained maximum accuracy of 90.9% and 90.7% for LR and SVM, respectively. ELMo model also performed well and produced maximum accuracy of 90.5% and 90.2% for LR and SVM, respectively. The SkipGram model of Word2Vec provided an accuracy of 85% for the LR classifier. GloVe provided 81.2% accuracy for the RF classifier. SkipGram and the CBOW model of FastText provided 85.7% and 82.2% accuracy, respectively, for the RF classifier.
Article
Full-text available
Web-based media organizations and other web applications, for example, WhatsApp, Facebook, YouTube, Instagram, Twitter, have become more well known among individuals for data sharing, live occasions, news, exposure, publicity, and cybercrimes. The utilization of online media stages additionally offers major issues through cyberstalking, cyberbullying, and different kinds of digital provocation. Cyberstalking and cyberbullying are frequently utilized reciprocally and include the utilization of the web to follow or target somebody in the web-based world. Cyberstalking is a basic worldwide issue that influences instructive foundations, casualties, and the whole human culture that should be distinguished, recognized, revealed, and controlled appropriately for the security of clients in online media. Machine learning is the most well-known method for making the cyberstalking recognition model. Researchers have recommended different recognition procedures utilizing machine learning to control and battle cyberstalking in web-based media. In this paper, the study relates to some popular features extraction methods machine learning classifiers for text classification and explores the datasets used by the researchers. The study also focuses on reasonably determining the research gaps and the scope for improving cyberstalking detection. This paper will review some cyberstalking detection techniques using machine learning, analyze the performance of popular machine learning classifiers and finally explore the issues, challenges, recent trends, and future direction for cyberstalking detection techniques.
Article
Full-text available
In the modern days of life, people use many social media sites for information sharing among friends, relatives, and others for personal, business, and official purposes. The use of social media platforms is also raising serious issues in the form of cyberstalking. Cyberstalking has been identified as a growing antisocial problem that affects educational institutions, victims, and entire human society. An intelligent system is required to detect cyberstalking in social media. In this paper, we proposed a cyberstalking detection model and analyzed the performance of six popular supervised machine learning algorithms, namely Logistic Regression, Support Vector Machines (SVM), Random Forest, Decision Trees, K-Nearest Neighbor, and Naive Bayes. These machine learning algorithms were implemented with two feature extraction methods, Bag of Words and TF-IDF, on two datasets of different sizes and distribution containing 35734 and 70019 comments and tweets, respectively. Performance of algorithms was measured in terms of Accuracy, Precision, Recall, f-score, training time, and prediction time. Our experimental results show that Logistic Regression and Support Vector Machine were top performer algorithms for both datasets with both feature extraction methods. Logistic Regression (92.6% with BOW and 92% with TF-IDF) and Support Vector Machine (92.5% with TF-IDF and 91.9% with BOW) achieved the highest accuracy on dataset-1. Logistic Regression and Support Vector Machine also achieved the highest Precision (96.4% and 96.6% respectively) and F-Score (94.3% and 93.8% respectively), while Naïve Bayes provides the best Recall (97.6% with TF-IDF on dataset-1) for both datasets.
Article
Full-text available
The rise in web and social media interactions has resulted in the efortless proliferation of offensive language and hate speech. Such online harassment, insults, and attacks are commonly termed cyberbullying. The sheer volume of user-generated content has made it challenging to identify such illicit content. Machine learning has wide applications in text classification, and researchers are shifting towards using deep neural networks in detecting cyberbullying due to the several advantages they have over traditional machine learning algorithms. This paper proposes a novel neural network framework with parameter optimization and an algorithmic comparative study of eleven classification methods: four traditional machine learning and seven shallow neural networks on two real world cyberbullying datasets. In addition, this paper also examines the effect of feature extraction and word-embedding-techniques-based natural language processing on algorithmic performance. Key observations from this study show that bidirectional neural networks and attention models provide high classification results. Logistic Regression was observed to be the best among the traditional machine learning classifiers used. Term Frequency-Inverse Document Frequency (TF-IDF) demonstrates consistently high accuracies with traditional machine learning techniques. Global Vectors (GloVe) perform better with neural network models. Bi-GRU and Bi-LSTM worked best amongst the neural networks used. The extensive experiments performed on the two datasets establish the importance of this work by comparing eleven classification methods and seven feature extraction techniques. Our proposed shallow neural networks outperform existing state-of-the-art approaches for cyberbullying detection, with accuracy and F1-scores as high as ~95% and ~98%, respectively.
Article
Full-text available
In recent years, the use of social networks has increased exponentially, which has led to a significant increase in cyberbullying. Currently, in the field of Computer Science, research has been made on how to detect aggressiveness in texts, which is a prelude to detecting cyberbullying. In this field, the main work has been done for English language texts, mainly using Machine Learning (ML) approaches, Lexicon approaches to a lesser extent, and very few works using hybrid approaches. In these, Lexicons and Machine Learning algorithms are used, such as counting the number of bad words in a sentence using a Lexicon of bad words, which serves as an input feature for classification algorithms. This research aims at contributing towards detecting aggressiveness in Spanish language texts by creating different models that combine the Lexicons and ML approach. Twenty-two models that combine techniques and algorithms from both approaches are proposed, and for their application, certain hyperparameters are adjusted in the training datasets of the corpora, to obtain the best results in the test datasets. Three Spanish language corpora are used in the evaluation: Chilean, Mexican, and Chilean-Mexican corpora. The results indicate that hybrid models obtain the best results in the 3 corpora, over implemented models that do not use Lexicons. This shows that by mixing approaches, aggressiveness detection improves. Finally, a web application is developed that gives applicability to each model by classifying tweets, allowing evaluating the performance of models with external corpus and receiving feedback on the prediction of each one for future research. In addition, an API is available that can be integrated into technological tools for parental control, online plugins for writing analysis in social networks, and educational tools, among others.
Article
Full-text available
Social media has become incredibly popular these days for communicating with friends and for sharing opinions. According to current statistics, almost 2.22 billion people use social media in 2016, which is roughly one third of the world population and three times of the entire population in Europe. In social media people share their likes, dislikes, opinions, interests, etc. so it is possible to know about a person’s thoughts about a specific topic from the shared data in social media. Since, twitter is one of the most popular social media in the world; it is a very good source for opinion mining and sentiment analysis about different topics. In this research, SVM with different kernel functions and Adaboost are experimented using CPD and Chi-square feature extraction techniques to explore the best sentiment classification model. The reported average accuracy of Adaboost for Chi-square and CPD are 70.2% and 66.9%. The SVM radial basis kernel and polynomial kernel with Chi-square n-grams reported average accuracy of 73.73% and 68.67% respectively. Among the performed experimentation, SVM sigmoid kernel with Chi-square n-grams provided the maximum accuracy that is 74.4%.
Article
Full-text available
In recent years, with the advancement of the internet, social media is a promising platform to explore what going on around the world, sharing opinions and personal development. Now, Sentiment analysis, also known as text mining is widely used in the data science sector. It is an analysis of textual data that describes subjective information available in the source and allows an organization to identify the thoughts and feelings of their brand or goods or services while monitoring conversations and reviews online. Sentiment analysis of Twitter data is a very popular research work nowadays. Twitter is that kind of social media where many users express their opinion and feelings through small tweets and different machine learning classifier algorithms can be used to analyze those tweets. In this paper, some selected machine learning classifier algorithms were applied on crawled Twitter data after applying different types of preprocessors and encoding techniques, which ended up with satisfying accuracy. Later a comparison between the achieved accuracies was showed. Experimental evaluations show that the Neural Network Classifier' algorithm provides a remarkable accuracy of 81.33% compared with other classifiers.
Article
Terrorist groups in the Arab world are using social networking sites like Twitter and Facebook to rapidly spread terror for the past few years. Detection and suspension of such accounts is a way to control the menace to some extent. This research is aimed at building an effective text classifier, using machine learning to identify the polarity of the tweets automatically. Five classifiers were chosen, which are AdB_SAMME, AdB_SAMME.R, Linear SVM, NB, and LR. These classifiers were applied on three features namely S1 (one word, unigram), S2 (word pair, bigram), and S3 (word triplet, trigram). All five classifiers evaluated samples S1, S2, and S3 in 346 preprocessed tweets. Feature extraction process utilized one of the most widely applied weighing schemes tf-idf (term frequency-inverse document frequency).The results were validated by four experts in Arabic language (three teachers and an educational supervisor in Saudi Arabia) through a questionnaire. The study found that the Linear SVM classifier yielded the best results of 99.7 % classification accuracy on S3 among all the other classifiers used. When both classification accuracy and time were considered, the NB classifier demonstrated the performance on S1 with 99.4% accuracy, which was comparable with Linear SVM. The Arab world has faced massive terrorist attacks in the past, and therefore, the research is highly significant and relevant due to its specific focus on detecting terrorism messages in Arabic. The state-of-the-art methods developed so far for tweets classification are mostly focused on analyzing English text, and hence, there was a dire need for devising machine learning algorithms for detecting Arabic terrorism messages. The innovative aspect of the model presented in the current study is that the five best classifiers were selected and applied on three language models S1, S2, and S3. The comparative analysis based on classification accuracy and time constraints proposed the best classifiers for sentiment analysis in the Arabic language.
Article
Broadly cyberbullying is viewed as a severe social danger that influences many individuals around the globe, particularly young people and teenagers. The Arabic world has embraced technology and continues using it in different ways to communicate inside social media platforms. However, the Arabic text has drawbacks for its complexity, challenges, and scarcity of its resources. This paper investigates several questions related to the content of how to protect an Arabic text from cyberbullying/harassment through the information posted on Twitter. To answer this question, we collected the Arab corpus covering the topics with specific words, which will explain in detail. We devised experiments in which we investigated several learning approaches. Our results suggest that deep learning models like LSTM achieve better performance compared to other traditional yberbullying classifiers with an accuracy of 72%.