Detection of Rumor in Social Media

Manan Vohra Misha Kakkar
Department of CSE Department of CSE
Amity University, Uttar Pradesh Amity University, Uttar Pradesh
India India
Abstract - Proliferation of internet made social media
popular among people with approximately 2.34 billion
users worldwide, this popularity made social media a
one of the major information source but it also made
spread of rumor very easy and hence the information
on social media carry a lot of false claims. Most of the
previous works on detection rumors focused on the
manually, language processing or creating a directed
graph. In this paper, we have proposed a system for
automatic rumor detection on social media. Our system
will collect data from social media sites like Twitter and
will preprocess this data to generate topic and check for
its veracity. Detection of rumor may have practical
application for journalists, news readers, emergency
services and financial markets and to help minimize the
spread of false information on social media.
Keywords- Rumor Detection, Social Media, Rumor
Today social media has taken over the traditional
methods of communication and with this has brought
a change on how information is being delivered to
large audience, making it easy to spread information
within short period. But veracity of the information
or the news that is being spread all over the social
media is not confirmed, and many times it has
happened that is has spread chaos among people all
over the world. Today most of the people get their
news from social media and get exposed to a daily
dose of rumors, hoaxes, conspiracy theories and
misleading news, and all this gets mixed with the
correct information from the honest or reliable
sources making truth harder to discern. Social media
has an enormous audience which makes false
information to viral as likely as correct information,
as people use social media on daily basis and with a
few share or retweet misleading information reaches
to many people who further shares it and process
continues spreading the rumor to wide audience.
Various experiments results shows that people most
people tend to trust links or information that there
friends share without even verifying the source of
information. Also people believe on the
misinformation they get from the links they click on
social media, for example fake news and ads, huge
number of people becomes bait to this and open the
page, hence getting fake news and page owner earn
money from ads on that page, so fake news could
also make money while polluting the social media
with falsehood, these sites or page are commonly
known as click bait sites which manufacture hoaxes
to make money from advertisements and hyper
partisan sites publish and spread rumors on social
media to influence public opinion.
Social media sites like Facebook, Twitter lack tool
which can detect any information that is being spread
on the sites for being a rumor, and can shut it down if
necessary as soon as possible before it creates any
problem. Presently these giant social media sites are
totally dependent on their staff and time to assign any
information a rumor, which could have been
successful 10 or 15 years back when social media
usage and data on it was less but currently social
media sites have billions on users and huge amount
of data growing exponentially which makes it really
difficult to manually check the veracity of all the
information flowing on social media.
In June 2016 a fake “Facebook privacy notice”
spread like a forest fire on social media, urging users
to copy and paste a particular piece of text on their
Facebook wall which will help them retain their
profile privacy, like that thing they share or photos
they upload or their personal information, which
ultimately turned out to be a rumor and millions of
Facebook users shared this and became a victim this
false claim. A rumor is a piece of information whose
veracity and source is not confirmed and can bring
harm in many ways.
Thus in this paper, a rumor detection system is
proposed, which determine the authenticity of an
information and classify it as rumor or not a rumor.
Paper is organized as section II discusses literature
review. Problem is defined in section III.
Methodology used to develop this rumor detection
system is explained in section IV. Result and analysis
is presented in section V followed by conclusion in
section VI.
Much previous works has tried to develop for
complex problem, like detecting the nature that is
true or false of a meme that is spreading over social
media [1, 2, 3, 4]. “Truthy” system attempted to solve
the meme rumor problem by categorising on basis of
their spread that is spreading “organically” or
“astroturf” campaign that is spread by a single person
or organisation as rumor[5, 6].
Recent studies collected image of Hurricane Sandy
from twitter, which contained both fake and real
images of Hurricane Sandy. From which they
randomly selected 5767 tweets and analysed them by
using the properties of a tweet like number of friends,
status, followers, its content and metadata, on the
basis of which they categorised them a real of fake
In recent studies Zhao, et al. proposed early detection
of rumor using the cue terms like “unconfirmed”,
“not true” or “debunk” in the tweet content to find
out whether there is uncertainty or denial or
questioning in it. These terms captures the tweet
content hidden implications and they categorized the
questioning or uncertainty tweets are possible rumor.
They were able to define the temporal traits of non-
rumor or rumor events but the clear cut difference
was not there[8].
In their work Jing Ma et al. presented a RNN-based
rumor detection model. Using Sina Weibo and twitter
data they developed two microblog dataset, and
proposed a method that converted the incoming
stream of microblog post as continuous variable-
length rime series, and presented RNNs with
different kind of layers and hidden units for
classification [9].
Friggeri, et al. characterize the structure of rumor that
is spread on Facebook. They considered copying- and-
pasting of test as test post and uploading and re-
sharing of photos as two major technological
affordances and near exact path that a rumor take on
social network was constructed. Within the Facebook
they measured the longevity of instance and
replication and analysed comments with links to a rumor debunking website[10].
However both [Jing Ma et al., 2016][ Friggeri, et al.
rumor debunking sites like, but our model
is not dependent on any of the rumor debunking site
and works to early detect a rumor.
Social media has an enormous audience which makes
false information to viral as likely as correct
information, as people use social media on daily basis
and with a few share or retweet misleading
information reaches to many people who further
shares it and process continues spreading the rumor
to wide audience. A rumor detection system works in
a two-phase manner. In first phase, extracting some
piece of information from social networking site(s).
In second phase it classifies whether that piece of
information is correct or not.
This paper present a system which works in the same
two phase manner. First phase includes process of
data collection, data preprocessing followed by topic
and text extraction. In secong phase, web scraping
and text classification takes place.
In this section, we present empirical experiments to
evaluate the proposed method of rumor detection.
Dataset collection from social media
Data Preprocessing
Topic Extraction via LDA
Text Extraction
Web scraping for News Extraction
no News yes
2014] suffer from delays and limited coverage they
only work after a rumor has gained attention of the
Rumor Not a Rumor
Fig.1. Methodology
3.1 Data Collection
We started with selecting any one of the social media
sites from where the data is to be collected, Twitter
was chosen for this purpose. Twitter API was
implemented and for a particular hashtag tweets were
collected and only text or content of the tweet was
save in a text file, pre-processing of the tweet content
is done so as to remove URL, emoticons, etc. This
generated our dataset which is unstructured in nature
and has only text content in it.
3.2 Data pre-processing
Collected data is cleaned by removing URL,
username, punctuation marks. The cleaned data is
then encoded to ASCII so that UTF-8 emoticons and
symbols a removed. The removal of emoticons from
the data is essential as their occurrence is high, which
tend to generate wrong output of topic modelling
algorithm. Next preprocessing step is removal of stop
words. Stop words are the word which does not
possess any meaning alone and are just text
3.3 Keyword Generation
To generate topics from the preprocessed data, topic
modelling is performed via Latent Dirichlet
Allocation(LDA). Latent Dirichlet Allocation(LDA)
one of the most famous topic modelling algorithm
which take the text corpus as input and works on that
corpus to generate topic keywords.
LDA procedure:
1. Traverses all the documents.
2. Assign each word in the document to one of
the k topic specified in input.
3. This word distributions and topic
representation for all topics and documents
4. Topic representation improvement is done
by finding the probabilities.
a. p(topic t | document d)
b. p(word w| topic t)
5. By finding (topic t’ | document d) * p(word
w | topic t’) a word is reassigned to a new
topic t’.
We configured LDA to generate one topic with three
keywords, these keywords represents the dataset as
whole and serves as query for our News website
scraping step.
3.4 News website scrapping
Most newspapers and news channels have developed
there news websites to take advantage of the large
audience internet has and since then these news
websites has become the major source of credible and
verified news for people all over the world. The
strong brand recognition and credibility has made
them so famous on internet. These news websites are
trusted just because they publish all the news article
with some credible source or evidence, they never
publish any hoaxes or unconfirmed content, so these
news websites are always considered as credible
news source, and because of this we will use feature
of these news sites in our system for detecting rumor.
We choose 4 trusted news websites:
1. Associated Press
2. The New York Times
3. Yahoo News
We did web scraping of these four news websites
with the keywords generated in keyword generation
step as search query, “AND” search was done with
these keyword so that articles related to all of the key
words are search i.e query is searched for articles as a
whole not for individual keyword, otherwise if no
articles are there for the query “OR” search will
search for article for individual keyword, which will
not good for our s ystem and will generate fault
results. After scraping these sites for articles the links
of these articles are save and displayed in the GUI.
3.5 Detection
In detection module we will the veracity of the topic
is check by the results from the news sites, if the after
scraping any one of the news sites return even one
article which will be related to topic only, the topic
will be classifies a not a rumor, as there is some
credible source of that topic is present, else if all the
news sites results are empty that is there is no article
related to this topic is found, we will assign this topic
as possible rumor as no credible source if found.
Fig. 2. GUI Main Menu
Fig. 2 illustrates, the main menu of our system from
this we can open all the three modules out the rumor
detection system
Fig. 3. Data Collection
Fig. 3 illustrates, the streaming of tweets form twitter
API for hashtag “trump”
Fig. 4. Dataset file
Fig. 4 illustrates, the dataset file generated which
contain the tweets text after preprocessing.
Fig. 5. LDA output
Fig. 5 illustrates, the output generated by the LDA
algorithm with our dataset as input. Fig. 6 illustrates
the output of the news websites. Here article links are
shown for all four news sites. Fig. 7 illustrates, the
assigning a topic as a rumor on the basis of the output
of the news websites.
Fig. 6. News site scraping output
Fig. 7. Detection output
The system is developed using Python programming
language. The system is tested for 200 piece of
information, which included 150 rumor topics and 50
non rumor topics. To measure the performance of the
proposed system, confusion matrix generated as
shown in the Table 1. As the table show out of 150
rumor topics our system detected 140 correctly and
10 incorrectly, whereas for 50 non rumor topics 44
were correctly detected and 6 incorrectly.
TABLE 1 : Confusion matrix for 200 sample topics
or To
44 6 50
10 140 150
al 54 146 200
Based on the generated confusion matrix, our system
has an accuracy of 92% having 96% precision and
70% recall.
Social media is a big platform for sharing and
spreading information, as it is easy to use and
millions of users. But as much as social media is a
boon for us it also has its disadvantage which is
spreading of false information or rumor. Detection of
rumor has a major importance as it will help us to
reduce or stop the speeding of false information
which can cause harm or create disturbance among
So, for detecting rumor we designed a system where
we can collect information from any social media
sites posts, pre-processed it to generate topic specific
data base. We then applied topic modelling on our
dataset to generate three keywords which give the
meaning of our dataset that is what the dataset is
about. These keywords were search on our selected
four news websites and news articles were extracted
from the results. If no article was found in all the four
sites the new assigned that topic as rumor other wise
if article was found its was assigned as not a rumor.
There is still possibility of improving the
effectiveness of out rumor detection system as more
number of news websites can be added so that a
broader search is provided to the system, also our
system is not limited to any one social media site,
data can be collected from any of the social media
site by implementing its API for example like
Facebook, LinkedIn, etc. Also more the number
tweets for a particular hashtag is collected, more
accurate topic will be generated by LDA.
