Content uploaded by Mustafa Shuaieb Sabri
Author content
All content in this area was uploaded by Mustafa Shuaieb Sabri on Sep 24, 2021
Content may be subject to copyright.
© 2020 IJRAR April 2020, Volume 7, Issue 2 www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
New Approach for Detecting Spammers on Twitter
using Machine Learning Framework
Deepali Prakash Sonawane Dr. Baisa L. Gunjal
Amrutvahini College of Engineering Amrutvahini College of Engineering
Sangamner, India Sangamner, India
Abstract: Social network sites involve billions of users around the world wide. User interactions with these social sites, like twitter
have a tremendous and occasionally undesirable impact implications for daily life. The major social networking sites have become a
target platform for spammers to disperse a large amount of irrelevant and harmful information. Twitter, it has become one of the most
extravagant platforms of all time and, most popular microblogging services which is generally used to share unreasonable amount of
spam. Fake users send unwanted tweets to users to promote services or websites that do not only affect legitimate users, but also
interrupt resource consumption. Furthermore, the possibility of expanding invalid information to users through false identities has
increased, resulting in malicious content. Recently, the detection of spammers and the identification of fake users and fake tweets on
Twitter has become an important area of research in online social networks (OSN). In this Paper, proposed the techniques used to
detect spammers on Twitter. In addition, a taxonomy of Twitter spam detection approaches is presented which classifies techniques
based on their ability to detect false content, URL-based, spam on trending issues. Twelve to Nineteen different features, including six
recently defined functions and two redefined functions, identified to learn two machine supervised learning classifiers, in a real time
data set that distinguish users and spammers.
IndexTerms – Classification, Social Network Security, Intrusion, Spam Detection, Machine learning.
I. INTRODUCTION
Online social networking sites like Twitter, Facebook, Instagram and some online social networking companies have become
extremely popular in recent years. People spend a lot of time in OSN making friends with people they are familiar with or interested
in. The expanded interest of social sites grants users to gather bounteous measure of data and information about users. Large volumes
of information accessible on these sites additionally draw the attention of spammers. Twitter has quickly become an online hotspot for
obtaining continuous data about users. Twitter is an Online Social Network (OSN) where users can share anything and everything,
such as news, opinions, and even their moods. Several arguments can be held over different topics, such as politics, current affairs,
and important events. At the point when a client tweets something, it is right away passed on to his/her supporters, enabling them to
extended the got data at an a lot more extensive level. With the development of OSNs, the need to ponder and break down clients’
practices in online social stages has strengthened. Numerous individuals who don’t have a lot of data with respect to the OSN s can
without much of a stretch be deceived by the fraudsters. There is additionally an interest to battle and place a control on the
individuals who use OSNs just for commercials and in this manner spam others’ records.
Recently, the recognition of spam in social networking sites attracted the consideration of researchers. Spam detection is a
difficult task in maintaining the security of social networks It is basic to perceive spams in the OSN locales to spare clients from
different sorts of malevolent assaults and to protect their security and protection. These unsafe moves embraced by spammers cause
huge demolition of the network in reality. Twitter spammers have different targets, for example, spreading invalid data, counterfeit
news, bits of gossip, and unconstrained messages. Spammers accomplish their noxious destinations through promotions and a few
different methods where they bolster diverse mailing records and consequently dispatch spam messages haphazardly to communicate
their inclinations. These exercises cause unsettling influence to the first clients who are known as non-spammers. Furthermore, it
likewise diminishes the notoriety of the OSN stages. Subsequently, it is fundamental to plan a plan to spot spammers so restorative
endeavors can be taken to counter their malevolent exercises.
The ability to order useful information is essential for the academic and industrial world to discover hidden ideas and predict
trends on Twitter. However, spam generates a lot of noise on Twitter. To detect spam automatically, researchers applied machine
learning algorithms to make spam detection a classification problem. Ordering a tweet broadcast instead of a Twitter user as spam or
non-spam is more realistic in the real world.
II. LITERATURE SURVEY
B Nathan Aston, Jacob Liddle and Wei Hu*[2] describe the Twitter Sentiment in Data Streams with Perceptron in this system the
implementation feature reduction we were able to make our Perceptron and Voted Perceptron algorithms more viable in a stream
environment. In this paper, develop methods by which twitter sentiment can be determined both quickly and accurately on such a
large scale.
Q. Cao, M. Sirivianos, X. Yang, and T. Pregueiro [3] describe the Aiding the detection of fake accounts in large scale social online
services.in this paper, SybilRank, an effective and efficient fake account inference scheme, which allows OSNs to rank accounts
according to their perceived likelihood of being fake. It works on the extracted knowledge from the network so it detects, verify and
remove the fake accounts.
G. Stringhini, C. Kruegel, and G. Vigna [4] describe the Detecting spammers on social networks in this paper, Help to detect
spam Profiles even when they do not contact a honeyprofile.The irregular behavior of user profile is detected and based on that the
profile is developed to identify the spammer.
J. Song, S. Lee, and J. Kim [5] describe the Spam filtering in Twitter using sender receiver relationship in this paper a spam
filtering method for social networks using relation information between users and System use distance and connectivity as the features
which are hard to manipulate by spammers and effective to classify spammers.
IJRAR2004252
International Journal of Research and Analytical Reviews (IJRAR)
www.ijrar.org
794
© 2020 IJRAR April 2020, Volume 7, Issue 2 www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
K. Lee, J. Caverlee, and S. Webb [6] describe the Uncovering social spammers: social honeypots and machine learning in this
System analyzes how spammers who target social networking sites operate to collect the data about spamming activity, system
created a large set of honey-profiles on three large social networking sites.
K. Thomas, C. Grier, D. Song, and V. Paxson [7] describe the Suspended accounts in retrospect: An analysis of Twitter spam in
this paper the behaviors of spammers on Twitter by analyzing the tweets sent by suspended users in retrospect. An emerging spam-as-
a-service market that includes reputable and not-so-reputable affiliate programs, ad-based shorteners, and Twitter account sellers
. K.Thomas, C.Grier, J.Ma, V.Paxson, and D.Song [8] describe the Design and evaluation of a real-time URL spam filtering in this
paper, service Monarch is a real-time system for filtering scam, phishing, and malware URLs as they are submitted to web
services.Monarchs architecture generalizes to many web services being targeted by URL spam, accurate classification hinges on
having an intimate understanding of the Spam campaigns abusing a service.
X. Jin, C. X. Lin, J. Luo, and J. Han [9] describe the Social spam guard: A data mining based spam detection system for social
media networks in this paper ,Automatically harvesting spam activities in social network by monitoring social sensors with popular
user bases.Introducing both image.
III. PROPOSED METHODOLOGY
We evaluate the spam detection performance on our dataset by using machine learning algorithm. The process of Twitter spam
detection by using machine learning algorithms. Before classification, a classifier that contains the knowledge structure should be
trained with the prelabeled tweets. After the classification model gains the knowledge structure of the training data, it can be used to
predict a new incoming tweet.
The whole process consists of two steps:
1. Learning
2. Classifying.
First, features of tweets will be extracted and formatted as a vector. The class labels (spam or nonspam) could be get via some
other approaches (like manual inspection). Features and class label will be combined as one instance for training. One training tweet
can then be represented by a pair containing one feature vector, which represents a tweet, and the expected result, and the training set
is the vector. The training set is the input of machine learning algorithm, the classification model will be built after training process. In
the classifying process, timely captured tweets will be labeled by the trained classification model.
A. Architecture:
Figure1: Proposed System Architecture
1. The collection of tweets with respect to trending topics on Twitter. After storing the tweets in a particular
file format, the tweets are subsequently analyzed.
2. Labelling of spam is performed to check through all datasets that are available to detect the malignant URL.
3. Feature extraction separates the characteristics construct based on the language model that uses language as a
tool and helps in determining whether the tweets are fake or not.
4. The classification of data set is performed by shortlisting the set of tweets that is described by the set of
features provided to the classifier to instruct the model and to acquire the knowledge for spam detection.
5. The spam detection uses the classification technique to accept tweets as the input and classify the spam and nonspam.
IJRAR2004252
International Journal of Research and Analytical Reviews (IJRAR)
www.ijrar.org
795
© 2020 IJRAR April 2020, Volume 7, Issue 2 www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
B. Algorithm:
1. Support Vector Machine:
Support Vector Machine (SVM) is used to classify the tweets. SVM Support vector machines are mainly two
class classifiers, linear or non-linear class boundaries. The idea behind SVM is to form a hyper plane in between the data
sets to express which class it belongs to. The task is to train the machine with known data and then SVM find the
optimal hyper plane which gives maximum distance to the nearest training data points of any class.
Steps:
Step 1: Read the test image features and trained features.
Step 2: Check the all test features of image and also get all train features.
Step 3: Consider the kernel.
Step 4: Train the SVM using both features and show the output.
Step 5: Classify an observation using a Trained SVM Classifier.
2. Naïve Bayes Classification:
Naive Bayes algorithm is the algorithm that learns the probability of an object with certain features belonging to
a particular group/class. In short, it is a probabilistic classifier. The Naive Bayes algorithm is called naive because it
makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features.
The Naive Bayesian classifier is based on Bayes theorem with the independence guess between predictors.
A Naive Bayesian model is easy to form, with no critical iterative parameter computation which makes it
particularly useful for very large datasets. Regardless of its simplicity, the Naive Bayesian classifier often does
particularly well and is widely used because it often outperforms more experienced classification methods.
C. Mathematical Model:
1) Working of Support Vector Machine:
We have k sub-spaces so that there are k classification results of sub-space to classifying breast cancer cells, called
CL_SS1,CL_SS2, ..., CL_SSk. Thus the problem is how to integrate all of those results. The simple integrating
way is to calculate the mean value:
= 1 ∑ =1 CLSSi ……………(1)
Or Weighted Mean Value
= 1 ∑ =1_ ……….(2)
Where Wi is the weight of classification result of subspace, i.e. breast cancer cells result , SSi and satisfies:
∑ =1 = 1 …………………(3)
The centroid is calculated as follows:
th
pixel in the hand region and k denotes the number of histopathological image pixels that represent
Where ( , ) represents the centroid of the hand, Xi and Yi are x and y coordinates of the i
only the hand portion. In the next step, the distance between the centroid and the pixel value was calculated. For distance, the following Euclidean distance was used:
=√( 2− 1)2( 2− 1)2
……………(5)
Where (x1, x2) and (y1, y2) represent the two co-ordinate values of histopathological image pixel.
2) Working of Naïve Bayes Classification:
It gives us a method to calculate the conditional probability, i.e., the probability of an event based on previous knowledge
available on the events. Here we will use this technique for breast cancer classification. More formally, Bayes’
Theorem is stated as the following equation:
(
(
)
( )
……………..(6)
) =
(
)
IJRAR2004252
International Journal of Research and Analytical Reviews (IJRAR) www.ijrar.org
796
© 2020 IJRAR April 2020, Volume 7, Issue 2 www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
Let us understand the statement first and then we will look at the proof of the statement. The components of the
above statement are:
( ) Probability (conditional probability) of occurrence of event A given the event B is true.
( ) : Probability of the occurrence of event B given the event A is true.
IV. RESULT AND DISCUSSION
Experimental evaluation is done to compare the naive bayes and support vector machine for evaluating the performance. The
experimental result evaluation, we have notation as follows:
TP: True positive (correctly predicted number of instance)
FP: False positive (incorrectly predicted number of instance),
TN: True negative (correctly predicted the number of instances as not required)
FN false negative (incorrectly predicted the number of instances as not required),
On the basis of this parameter, we can calculate four measurements
Accuracy = TP+TN/TP+FP+TN+FN
Figure 2: Accuracy graph
Sr.No
Support Vector
Naïve Bayes
Machine
1
83%
92%
Table 1: Comparative Table
V. CONCLUSION
In this paper, proposed system performed a review of techniques used for detecting spammers on Twitter. In addition, also
presented a taxonomy of Twitter spam detection approaches and categorized them as fake content detection, URL based spam
detection, spam detection in trending topics, and fake user detection techniques also compared the presented techniques based on
several features, such as user features, content features, graph features, structure features, and time features. Moreover, the techniques
were also compared in terms of their specified goals and datasets used. It is anticipated that the presented review will help researchers
find the information on state-of-the-art Twitter spam detection techniques in a consolidated form.
IJRAR2004252
International Journal of Research and Analytical Reviews (IJRAR) www.ijrar.org
797
© 2020 IJRAR April 2020, Volume 7, Issue 2
REFERENCES
www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
[1] Mohd Fazil and Muhammad Abulaish, “A Hybrid Approach for Detecting Automated Spammers in Twitter” IEEE Transactio n
Information Forensics and Security Vol.11 No.2 January 2019
[2] Ishaq Azhar Mohammed, "ARTIFICIAL INTELLIGENCE: THE KEY TO SELF-DRIVING IDENTITY GOVERNANCE",
International Journal of Creative Research Thoughts (IJCRT), ISSN:2320-2882, Volume.4, Issue 4, pp.664-667, November 2016,
Available at :http://www.ijcrt.org/papers/IJCRT1134112.pdf
[3] Sikender Mohsienuddin Mohammad, Surya Lakshmisri , "SECURITY AUTOMATION IN INFORMATION TECHNOLOGY",
International Journal of Creative Research Thoughts (IJCRT), ISSN:2320-2882, Volume.6, Issue 2, pp.901-905, June 2018,
Available at :http://www.ijcrt.org/papers/IJCRT1133434.pdf
[4] Nathan Aston, Jacob Liddle and Wei Hu*, Twitter Sentiment in Data Streams with Perceptron, in Journal of Computer and
Communications, 2014, Vol-2 No-11.
[5] Sudhir Allam, "BIG DATA MIGHT JUST CURE CANCER - THE RESEARCH AND THE REALITY", International Journal of
Creative Research Thoughts (IJCRT), ISSN:2320-2882, Volume.7, Issue 4, pp.820-825, December-2019, Available at
:http://www.ijcrt.org/papers/IJCRT1133998.pdf
[6] Q. Cao, M. Sirivianos, X. Yang, and T. Pregueiro, Aiding the detection of fake accounts in large scale social online services, in
Proc. Symp. Netw. Syst. Des. Implement. (NSDI), 2012, pp. 197210.
[7] Ishaq Azhar Mohammed, "Identity-based Encryption: From Identity and Access Management to Enterprise Privacy
Management", International Journal of Emerging Technologies and Innovative Research (www.jetir.org | UGC and issn
Approved), ISSN:2349-5162, Vol.4, Issue 9, page no. pp719-722, September-2017, Available at :
http://www.jetir.org/papers/JETIR1709107.pdf
[8] G. Stringhini, C. Kruegel, and G. Vigna, Detecting spammers on social networks, in Proc. 26th Annu. Comput. Sec. Appl. Conf.,
2010, pp. 19.
[9] J. Song, S. Lee, and J. Kim, Spam filtering in Twitter using sender receiver relationship, in Proc. 14th Int. Conf. Recent Adv.
Intrusion Detection, 2011, pp. 301317.
[10] K. Lee, J. Caverlee, and S. Webb, Uncovering social spammers: social honeypots + machine learning, in Proc. 33rd Int. ACM
SIGIR Conf. Res.Develop. Inf. Retrieval, 2010, pp. 435442.
[11] Ravi Teja Yarlagadda, "Implementation of DevOps in healthcare systems", International Journal of Emerging Technologies and
Innovative Research (www.jetir.org), ISSN:2349-5162, Vol.4, Issue 6, page no.537-541, June-2017, Available
:http://www.jetir.org/papers/JETIR1706100.pdf
[12] K. Thomas, C. Grier, D. Song, and V. Paxson, Suspended accounts in retrospect: An analysis of Twitter spam, in Proc. ACM
SIGCOMM Conf. Internet Meas., 2011, pp. 243258.
[13] Ishaq Azhar Mohammed, "Identity and Access Management for the Internet of Things", International Journal of Emerging
Technologies and Innovative Research (www.jetir.org), ISSN:2349-5162, Vol.5, Issue 5, page no.1299-1303, May-2018,
Available :http://www.jetir.org/papers/JETIR1805954.pdf
[14] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song, Design and evaluation of a real-time URL spam filtering service, in Proc.
IEEE Symp. Sec. Privacy, 2011, pp. 447462.
[15] Manishaben Jaiswal “ SOFTWARE ARCHITECTURE AND SOFTWARE DESIGN” International Research Journal of
Engineering and Technology (IRJET) e-ISSN: 2395-0056, p-ISSN: 2395-0072, Volume: 06 Issue: 11, s. no -303 , pp. 2452-2454 , Nov
2019 Available at: https://www.irjet.net/archives/V6/i11/IRJET-V6I11303.pdf
[16] Manishaben Jaiswal "RISK ANALYSIS IN INFORMATION TECHNOLOGY" , International Journal of Scientific Research and
Engineering Development (IJSRED) , ISSN:2581-7175, Vol 2-Issue 6, P110, pp. 857-860, November - December 2019 Available
at: http://www.ijsred.com/volume2/issue6/IJSRED-V2I6P110.pdf
[17] Manishaben Jaiswal, Mehul Patel “THE LEARNING ON CRM IN ERP- WITH SPECIAL REFERENCES TO SELECTED
ENGINEERING COMPANIES IN GUJARAT”, International Journal of Management and Humanities Scopus (IJMH) , published
by Blue Eyes Intelligence Engineering & Sciences Publication (BEIESP), ISSN 2394-0913, Volume-4 Issue-8, April 2020, Pg-
117-126,Available At,http://www.ijmh.org/wp-content/uploads/papers/v4i8/H0798044820.pdf
[18] Sudhir Allam, "RESEARCH ON THE SECURE MEDICAL BIG DATA ECOSYSTEM BASED ON HADOOP", International
Journal of Creative Research Thoughts (IJCRT), ISSN:2320-2882, Volume.7, Issue 1, pp.815-819, March 2019, Available at
:http://www.ijcrt.org/papers/IJCRT1133997.pdf
[19] Ishaq Azhar Mohammed. (2019). A SYSTEMATIC LITERATURE MAPPING ON SECURE IDENTITY MANAGEMENT
USING BLOCKCHAIN TECHNOLOGY. International Journal of Innovations in Engineering Research and Technology, 6(5),
86–91. Retrieved from https://repo.ijiert.org/index.php/ijiert/article/view/2798
[20] X. Jin, C. X. Lin, J. Luo, and J. Han, Socialspamguard: A data mining based spam detection system for social media networks,
PVLDB, vol. 4, no. 12, pp. 14581461, 2011.
[21] Surya Lakshmisri, "SOFTWARE AS A SERVICE IN CLOUD COMPUTING", International Journal of Creative Research
Thoughts (IJCRT), ISSN:2320-2882, Volume.7, Issue 4, pp.182-186, December 2019, Available at
:http://www.ijcrt.org/papers/IJCRT1133471.pdf
[22] S. Ghosh et al., Understanding and combating link farming in the Twitter social network, in Proc. 21st Int. Conf. World Wide
Web, 2012, pp. 6170.
[23] H. Costa, F. Benevenuto, and L. H. C. Merschmann, Detecting tip spam in location-based social networks, in Proc. 28th Annu.
ACM Symp. Appl. Comput., 2013, pp. 724729.
[24] M. Tsikerdekis, Identity deception prevention using common contribution network data, IEEE Transactions on Information
[25] Ishaq Azhar Mohammed, "RISK-BASED ACCESS CONTROL MODEL: A SYSTEMATIC LITERATURE REVIEW",
International Journal of Creative Research Thoughts (IJCRT), ISSN:2320-2882, Volume.7, Issue 2, pp.794-797, May 2019,
Available at :http://www.ijcrt.org/papers/IJCRT1134133.pdf
[26] T. Anwar and M. Abulaish, Ranking radically influential web forum users, IEEE Transactions on Information Forensics and
Security, vol. 10, no. 6, pp. 12891298, 2015.
[27] Y. Boshmaf, I. Muslukhov, K. Beznosov, and M. Ripeanu, Design and analysis of social botnet, Computer Networks, vol. 57, no.
2, pp. 556578, 2013.
IJRAR2004252
International Journal of Research and Analytical Reviews (IJRAR) www.ijrar.org
798
© 2020 IJRAR April 2020, Volume 7, Issue 2 www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
[28] Ravi Teja Yarlagadda, "Understanding DevOps & bridging the gap from continuous integration to continuous delivery",
International Journal of Emerging Technologies and Innovative Research (www.jetir.org), ISSN:2349-5162, Vol.5, Issue 2, page
no.1420-1424, February-2018, Available :http://www.jetir.org/papers/JETIR1802284.pdf
[29] D. Fletcher, A brief history of spam, TIME, Tech. Rep., 2009.
[30] Y. Boshmaf, M. Ripeanu, K. Beznosov, and E. Santos-Neto, Thwarting fake osn accounts by predicting their victims, in Proc.
AISec., Denver, 2015, pp. 8189.
[31] Ishaq Azhar Mohammed, "Artificial Intelligence for Caregivers of Persons with Alzheimer’s Disease and Related Dementias:
Systematic Literature Review", International Journal of Emerging Technologies and Innovative Research (www.jetir.org | UGC
and issn Approved), ISSN:2349-5162, Vol.6, Issue 1, page no. pp741-744, January-2019, Available at :
http://www.jetir.org/papers/JETIR1901E97.pdf
[32] N. R. Amit A Amleshwaram, S. Yadav, G. Gu, and C. Yang, Cats: Characterizing automation of twitter spammers, in Proc.
[33] Sudhir Allam, "THE FUTURE OF URBAN MODELS IN THE BIG DATA AND AI ERA: A BIBLIOMETRIC ANALYSIS",
International Journal of Creative Research Thoughts (IJCRT), ISSN:2320-2882, Volume.6, Issue 1, pp.797-800, February-2018,
Available at :http://www.ijcrt.org/papers/IJCRT1133993.pdf
[34] K. Lee, J. Caverlee, and S. Webb, Uncovering social spammers: Social honeypots + machine learning, in Proc. SIGIR, Geneva,
2010, pp. 435 442. [18] G. Stringhini, C. Kruegel, and G. Vigna, Detecting spammers on social networks, in Proc . ACSAC,
Austin, Texas, 2010, pp. 19.
[35] Lakshmisri Surya,Ravi Teja Yarlagadda, "AI ECONOMICAL SMART DEVICE TO IDENTIFY COVID-19 PANDEMIC, AND
ALERT ON SOCIAL DISTANCING WHO MEASURES", International Journal of Creative Research Thoughts (IJCRT),
ISSN:2320-2882, Volume.8, Issue 5, pp.4152-4156, May 2020, Available at :http://www.ijcrt.org/papers/IJCRT2005556.pdf
[36] Ishaq Azhar Mohammed. (2019). CLOUD IDENTITY AND ACCESS MANAGEMENT – A MODEL PROPOSAL.
International Journal of Innovations in Engineering Research and Technology, 6(10), 1–8. Retrieved from
https://repo.ijiert.org/index.php/ijiert/article/view/2781
[37] H. Yu, M. Kaminsky, P. B. Gibbons, and A. Flaxman, Sybilguard: Defending against sybil attacks via social networks,
IEEE/ACM Transactions on Networking, vol. 16, no. 3, pp. 576589, 2008.
IJRAR2004252
International Journal of Research and Analytical Reviews (IJRAR)
www.ijrar.org
799