Conference PaperPDF Available

Detecting rumor patterns in streaming social media

Authors:
2015 IEEE International Conference on Big Data (Big Data)
978-1-4799-9926-2/15/$31.00 ©2015 IEEE 2467
Detecting Rumor Patterns in Streaming Social Media
Shihan Wang
Department of Computational Intelligence
and Systems Science
Tokyo Institute of Technology
Yokohama, Japan
ShihanW@trn.dis.titech.ac.jp
Takao Terano
Department of Computational Intelligence
and Systems Science
Tokyo Institute of Technology
Yokohama, Japan
terano@dis.titech.ac.jp
Abstract—Rumor detection in streaming social media is a
significant but challenging problem. In this paper, we present
a method to identify rumor patterns in the streaming social
media environment. Patterns which combine both structural
and behavioral properties of rumor are firstly proposed to
distinguish false rumors from valid news. A novel graph-based
pattern matching algorithm is also described to detect rumor
patterns from streaming social media data. Compared within
twitter data of rumors and non-rumors, our selected rumor
patterns contain distinct properties of rumors in short-term
series.
Keywords-rumor detection; social media; streaming pattern
matching; socioeconomic sustainability
I. INTRODUCTION
As Microblog platforms like Twitter and Sina Weibo
rapidly grow, social media has become a popular commu-
nication tool in our daily life and attracts more and more
attention. Thanks to its tremendous reachability, social media
provides organizations and individuals wider opportunities
of collaboration and is considered as a new driver of sustain-
ability. Nevertheless, social media brings not only effective
valid information, but vast false rumors as well. In fact, with
the extremely fast and wide spread of information, online ru-
mor causes devastating socioeconomic damage before being
effectively corrected. Therefore, rumor detection in online
social media is significant for the sustainable development.
Rumor is known as a piece of information or statement
that cannot be verified as true or false, but quickly spreading
from person to person [1]. Recently, many researchers have
focused on automatically detecting rumor and determining
its credibility. While they only analyze and evaluate rumor
after it has been widely spread, there is still an important gap
of rumor detection in the real-time streaming environment.
In fact, it is essential to discover the rumor directly from
online social media streams before it causes too much
damage.
Here rumor detection in streaming social media is very
challenging, not only because of the massive and noisy
dataset but also the streaming environment. Most of the
traditional methods employ classification or clustering tech-
niques to identify rumor, which is limited in streaming
scenario. Faced with these challenges, we expect to detect
rumors in streaming social media using pattern matching
approach. In this paper, we focus on discovering important
rumor patterns and detecting them in streaming dataset.
We make two contributions in this work. First, we present
a group of rumor patterns combining both structure and
behavior features, which has never been done particularly for
the streaming detection environment. Second, we propose
a novel graph-based pattern matching algorithm, which is
designed to identify patterns from real social media streams.
The rest of this paper is organized as follows. In section
2, we review the related work in rumor detection. Section 3
describes our pattern design and its theory base. Section 4
explains the pattern matching algorithm while we present the
preliminary experiments and results in section 5. In section
6, we summarize this paper and our future work.
II. RE LATE D WORK
While rumors have been a hot topic in the psychology
field for a long time [2], computer scientists focus on auto-
matic rumor detection of online social media only in recent
years. Since the research on rumor detection in streaming
environment is quite limited, in this section, we mainly
review the related work on traditional offline identification.
Regardless of literature focus on either Twitter or Chinese
Sina Weibo data source, we group them based on their
approaches: classification-based and pattern matching-based.
A. Classification-based Rumor Detection
Much previous literature has considered rumor detection
as a binary classification problem. Researchers utilized
supervised learning approach to automatically determine
whether one trending topic that is spreading is truth or false.
As identifying the credibility of information is complex,
most existing approaches employ various kinds of features
beyond the text of the posts only [3].
Catillo et al [4] firstly grouped and reviewed several
features that are widely used in rumor detection, including
content-based feature, user-based feature, behavior-based
feature and propagation-based feature. Other works extended
these features using own specific properties. Sun et al [5]
2468
and Yang et al [6] extracted multimedia-based and location-
based features respectively to distinguish rumors in Sina
Weibo from ordinary posts. Kwon et al [7] firstly examined
temporal characteristics in rumor spreading.
B. Pattern Matching-based Rumor Detection
Ennals et al [8] used pattern matching techniques to
highlight disputed claims from the web. Their method auto-
matically searched lexical patterns for claims, then filtered
claims by a classifier and provided a corpus of disputed
claims only. On the other side, Zhao et al [9] identified
trending rumors in social media based on inquiry phrases
patterns. Considering content features show early in the
rumor diffusion process, they presented an approach to
cluster only signal pattern contained tweets and address
controversial events with high rumor likelihood. While both
previous works acquired rumor-related patterns, none of
them contained properties beyond the post text.
Although multiple feature-oriented classification methods
bring decent detection accuracy, most of these features
only become available after rumor has already flourished
and been transferred by many users. Therefore, it is not
practical to use such approaches in a real-time situation,
while rumors have already caused serious socioeconomic
damage before they were detected and corrected. We expect
the rumor patterns detection method using pattern matching
techniques to overcome this drawback. While the previous
pattern matching-based research only considered text-related
features, which are not enough for the rumor detection task
[4], we propose to extend social media rumor patterns from
various aspects in this work.
III. RUMOR PATTE RN DESIGN
In order to use patterns to detect rumors from social media
stream in the future, there are two important aspects need
to be balanced within pattern design.
On the one hand, we expect the pattern to be as complex
as possible because the combination of various features can
contribute to a higher accuracy for rumor detection. On the
other hand, the streaming environment restricts the complex-
ity of patterns, as data stream has the one pass constraint,
which makes it difficult to do the iterative calculation and
limits the computing and storage capabilities [10].
Therefore, we not only focus on the most influential
features within rumor detection task, but also consider prop-
erties that are practical in streaming process. In total, two
significant properties are extracted: propagation structure
and behavior of users’ opinion on target posts. We will
explain our detailed design and theoretical base of both
properties in the following part.
A. Structural Design
In the study of [4], authors analyzed the impact of
different features for information credibility. They observed
Figure 1. Frequent-ordered Nontrivial Cascades of Trending Topic
Propagation in Twitter [12] & Sina Weibo [13]
that graph structure pattern of propagation is one of the most
relevant to detect non-credible news. This drives us to firstly
consider propagation structure in the patterns.
On social media like Twitter, there is no important com-
munity structure [11], also overall properties of the graph are
hard to measure in streaming data. So, instead of macro-
level measurements, we focus on micro cascade motifs
that present representative characteristic in event diffusion
network. Zhou et al [12] and Fan et al [13] studied the trace
of information propagation in trending topics of Microblog
and obtained topological features. Figure 1 shows top seven
frequent nontrivial cascade shapes from both Twitter and
Sina Weibo data.
According to the figure, we find that, except for the basic
shape of two nodes, T2(S3) and T4(S4) are the most im-
portant structures among the common cascades. As for one
side, they are the top of the most frequent ones. For another
side, all other important cascades can be decomposed into
a set of them.
In real detecting situation, as social media data stream
keeps coming, the propagation graph is growing from the
basic structures. Based on this observation, we picked these
two subgraphs as the structural features in our pattern.
B. Behavioral Design
Meanwhile, many studies retrieved behavior property of
how users feel about the target post and considered it as a
significant signal. Therefore, we propose to combine users’
behavior feature as well.
A study about how information propagated through the
Twitter network after 2010 Chile earthquake provides a
promising support about user opinion analysis. They ex-
hibited that, user attitude is one obvious difference in the
propagation of tweets between rumors and valid news. In
fact, more negative and doubted users tend to be involved
into false rumor, while tweets exhibit an active attitude are
more related to credible information [14][4]. At the same
time, other research indicated the importance of question-
asking behavior in social media in further analysis [9].
2469
Figure 2. Two Examples of The Rumor Pattern
Overall, considering two parts of the design, our rumor
pattern is the labeled graph. Two examples are shown in
Figure 2. While two essential subgraphs is employed as
structural base, three different labels SUPPORT,DENY and
QUESTION are used to present user opinion.
In total, 45 possible patterns are generated. For each
node in graph pattern, three possible labels are enumerated
in various positions. However, because two sons in ’Star’
patterns are symmetric, which means they represent the
same propagation information. So, we consider patterns like
{SUPPORT DENY QUESTION}and {QUESTION
DENY SUPPORT}as the same one.
IV. PATTERN MATCHING ALGORITHM
In this section, we present an algorithm that tracks
matches of above graph-based rumor patterns from stream-
ing social media data.
Overall, according to the pattern design, a labeled and
directed graph is first extracted from a stream of posts, which
is the original social media data. Here, each post is pre-
processed by semantic analysis to address both user attitude
and information propagated relationships (like retweet or
mention). If this post contains propagated relationships, it
is transformed into an edge. Then, this stream of edges is
provided as the input of a pattern matching algorithm. The
direction of an edge is defined by information spreading
direction, and label attributes of nodes on this edge are
defined by an opinion feature of its poster. Our algorithm
processes data stream and provides a list of matched patterns
and their appearing time.
We begin with introducing the indexical data structure
for dynamically labeled graph pattern search, then proceed
to present the detailed algorithm.
A. Relational Index Structure
We firstly introduce a data structure called Relational
Index (R-index). R-index is responsible for storing attitude
(label) information related to each node. It contains label
information of the current node, as well as that of all nodes
link to this one. To save the storage space, total numbers of
indegree and outdegree for each kind of label are counted
and collected, instead of every individual node ids. In our
pattern graph, there are three kinds of label: SUPPORT,
DENY and QUESTION. This information supports us ad-
equate information to discover incremental patterns of each
step as edges are updating in streaming. An example of our
current basic structure of R-index is shown in Figure 3.
B. Graph-based pattern matching algorithm
With this R-Index structure defined, we describe the
graph-based pattern matching algorithm. Here are some
basic definitions we used in the algorithm.
Definition 1. Given a set of labeled nodes NT=
{n1, n2, n3...}, each edge contains two nodes and time
when it is shown, defined as e=< nstart, nend , time >.
Edge Stream is the continual sequence of edges, defined as
ES ={e1, e2, e3...}.
Definition 2. Since there are two kinds of pattern struc-
ture, Pattern is defined in the following two types: p
={0Star0, nroot.label, nleft.label, nr ight.label}and p=
{0P ath0, nroot.label, nup.label, ndown.label}. A set of pat-
terns is defined as PT={p1, p2, p3...}. For example, two
patterns in Figure 2 are defined as {0Star0,nroot.label =
DENY ,nleft.label =S UP P ORT ,nrig ht.label =
QU ES T ION }and {0P ath0,nroot .label =SU P P OR T ,
nup.label =S U P P ORT ,nright .label =DENY }respec-
tively.
Algorithm 1 matchGraphPattern(ES ,PT)
1: graph G← ∅
2: for each e=< nstart, nend , time >E S do
3: for all ni∈ {nstart, nend }do
4: createNodeIfNew(ni) in G
5: for all piPTdo
6: if ni.label matches pi.nroot .label then
7: nroot ni
8: if eis subgraph of pithen
9: num getNumOfNewPattern(nroot,e,pi)
10: updateResult(pi,num,e.time)
11: end if
12: end if
13: end for
14: updateIndex(ni)
15: end for
16: end for
The input to matchGraphPattern algorithm is an edge
stream ES and a set of query patterns PT. For every coming
edge e, all of its nodes that are new for graph Gare added
into the graph at first (line 4). We iteratively go through
every query pattern (pi) to identify matches (line 5). Then,
every node of ethat shares the same label with a root node
in the given pattern is selected and recorded as the root
node of possible matches (line 6-7). Next, we utilize basic
subgraph isomorphism to check whether this new edge is a
subgraph of pi, which is the necessary condition for further
identification (line 8). As R-index maintains all previous
label-related information of root node nroot, it is efficient
to acquire the total amount of nodes that have been linked
2470
Figure 3. The Format of R-index Structure
to nroot and matches another label of pi(line 9). After that,
the algorithm provides real-time updating matches with this
new edge ein the format of < pi, num, e.time > (number
of new matched query patterns and timestamp) (line 10). In
the end, the R-index of both nodes are updated for future
calculation(line 14).
An example is given to explain the main matching pro-
cedure. Given the star pattern in Figure 2, p={0Star0,
nroot.label =DENY ,nleft .label =S UP P O RT ,
nright .label =QU ES T ION }and a new edge e=<
nstart, nend , time > (nstart.label =DEN Y, nend .label =
SU P P ORT ), we firstly find that nstar t is root node nroot
and eis a subgraph of p. In the next step, we process into
getNumOfNewPattern. As pis ’Star’ type, we continue to
find matches of another part in p, which is an edge with
nstart.label =DENY and nend.label =QU E ST I ON .
Therefore, we check whether outdegree of label QUESTION
(Question out) in R-index of nstart is zero. If not, it means
we successfully discover new matched patterns of pthat
are contributed by this new coming edge. In this way, we
capture the amount of new patterns and their discovered time
(e.time).
V. PRELIMINARY EXPE RI ME NT
In this section, we present the preliminary experiment to
extract a set of rumor patterns from streaming social media
data and distinguish false rumor and new events based on
them.
A. Data Set
We used the dataset that was published in the work of
Kwon et al [7]. It collected Twitter datasets of the trending
topics, which are separated into false rumor and credible
news. The validation of rumor and non-rumor label has been
well annotated and evaluated by previous researchers based
on both investigation websites and human participants. As
the size of total 109 topics is various (from 10 to 33401
tweets), we selected 5 rumors with a larger amount of tweets,
as well as 5 non-rumors that have similar size with the
picked rumors. In summary, the average tweets of each topic
are around 5000 and the least one has more than 2000 tweets.
B. Data Pre-process and Visualization
After ranking tweets of each topic by the timestamp,
we firstly processed every group of tweets into a stream
of posts, which fits the data in real-time. Then, the tweet
frequency per hour of all topics is counted and collected
one by one. In the previous work, they investigated tweet
frequency in each day and presented bursty fluctuations over
Figure 4. Tweet Frequency of Valid News and False Rumors in Short-term
Series
60 days [7]. As their extracted temporal features usually last
for days or even weeks, it is limited in real-time streaming
detection. Therefore, we focus on hour-based frequency
because such short-term temporal property can be captured
even in streaming analysis. Figure 4 shows such frequency
of tweets in time series for both non-rumors and rumors.
In each image of Figure 4, the x-axis represents the time
where one hour is a unit, and the y-axis represents how
many tweets are posted in each unit time. We observed
that valid news generally shows dramatic fluctuations, while
rumors usually have one sharp peak. It indicates that even
in the short-term time series, rumor and valid information
commonly differ from each other. Based on this difference,
it is possible to identify a kind of pattern to distinguish them
in streaming.
C. Tweet Semantical Analysis
Given the stream of posts, we analyzed the semantic
information for each post in the next step. In this step,
in order to further process data for graph-based pattern
detection, we began with extract propagating relationships
within tweets, then proceed to analyze user behavior feature
of each tweet.
According to the official Twitter APIs1, the retweet,men-
tion and reply information is provided for each individual
tweet. For example, given a tweet ti, a set of its mentioned
tweets can be acquired T4={tm,ktn, tj,ktk}. Among
them, we can identify that tiretweets tmand replies tk.
1Source: https://dev.twitter.com/rest/public
2471
Table I
USER OPINION MINING RESULTS OF TRENDING TOPICS
Valid News CharlieWilsonWar ChristianTheLion PalmPre PregnantMan TwitterSummize Aver Percentage of All Users
SUPPORT Users 1093 351 863 667 862 32.74%
DENY Users 315 367 322 216 164 11.81%
QUESTION Users 176 342 303 301 183 11.14%
False Rumors SwinePork SwineZombie LadyGaga Montauk IphoneNano Aver Percentage of All Users
SUPPORT Users 2469 433 337 177 239 13.86%
DENY Users 4309 1148 474 366 137 24.40%
QUESTION Users 3641 379 985 542 507 22.96%
Then, we captured linkages within the propagating infor-
mation. In this example, the retweet and reply implies that
information is transferred from tmto tiand from tito
tkrespectively. For the rest of mentions, the direction of
transferring is from tito mentioned nodes (tnand tj).
In our dataset, because some historical tweets have been
deleted or shielded, some retweet information is missing.
So, we combined the signal ’RT’ of text into consideration
to identify retweets. In the real streaming tweets, such
information is fully provided by Twitter Streaming APIs2.
On the other hand, we employed sentiment analysis [15]
techniques to identify user opinion from tweet content. We
analyzed and collected the positive (SUPPORT) and nega-
tive (DENY) attitudes through the free version of Semantria3.
At the same time, we identified question asking tweets
using simple lexical patterns based on previous research. We
utilized question mark and 5W1H question words (What,
Why, Who, When, Where and How) as basic patterns,
but restricted 5W1H only appear at the beginning of one
sentence [16]. Another pattern regular expression ’is (that|
2https://dev.twitter.com/streaming/overview
3https://semantria.com/
this|it) true)’ [9] is also combined to improve the precision.
Besides three types of opinion, there is still a group of
users who do not show any attitude. We do not consider
them in our behavioral patterns. Overall, the identification
results (SUPPORT,DENY and QUESTION) of ten topics
are summarized in Table I.
Table I exhibits the total number of individual users (their
posts) are identified into three attitudes. We summarized the
average amount of tweets in rumor and non-rumor topics
separately. Overall, one-third of total users have the positive
opinion on credible information, which is three times as
much as negative or questioning people. In contrast, more
users tend to deny and question the non-credible rumors.
This result is consistent with previous studies [14] and ready
for the following process.
D. Rumor Pattern Detection
Based on information of propagating relationship and user
opinion, we detected rumor patterns in streaming trending
topic data using proposed pattern matching algorithm.
In the first step, we iteratively processed data stream
of every topic to generate the number of matches and
Figure 5. The Correlation Matrix Between Trending Topics and Patterns
2472
matched time. Then, we analyzed the matched patterns from
both rumors and non-rumors. In order to discover distinct
patterns, especially relevant and important in rumor events,
we evaluated them through term frequency-inverse document
frequency (TF-IDF). We expect it to adjust patterns that
appear frequently in general and distinguish rumor patterns
from non-rumors.
In Figure 5, a large matrix shows the correlations be-
tween 10 topics and 45 patterns, where 5 valid news
are located on the upper side and rumors are located on
the lower side. In addition, the larger TF-IDF is corre-
sponding to a darker gray in each grid. Interesting pat-
terns can be observed from Figure 5. For example, sev-
eral patterns like ’PATH:DENY QUESTION QUESTION’
(marked with green) only appeared in rumors. And pat-
tern ’STAR:DENY QUESTION QUESTION’ (marked with
orange) appears in the majority of rumors and also shows
higher TF-IDF in rumors. Such phenomenon indicates that
it is possible to identify patterns are either unique or more
relevant in rumor events.
Next, we further evaluated the TF-IDF values of patterns
and selected a set of important rumor patterns, whose
average TF/IDF in rumors is over 10 times larger than
that in non-rumors. In Table II, a list of selected important
rumor patterns is given. Among them, the top three patterns
appeared only in the false rumors, while others show closer
correlations with rumors.
Using the set of selected rumor patterns, we calculated the
pattern frequency in time series with the same interval one
hour as previous tweet frequency analysis. The comparison
images of both valid news and false rumors are exhibited in
Figure 6 respectively.
Comparing the left and right part of Figure 6, we observed
an obvious difference between rumors and non-rumors. In
general, the temporal frequency of selected patterns matches
the trend of tweets bursting in rumors very well. However,
patterns do not often appear in the credible news: they do
not appear in two events and do not consist with the shape
Table II
A LI ST OF SE LE CTE D IMPO RTANT RUMOR PATT ERN S
PATH:DENY DENY QUESTION
PATH:DENY QUESTION QUESTION
PATH:DENY SUPPORT QUESTION
PATH:DENY DENY SUPPORT
PATH:DENY QUESTION DENY
PATH:DENY QUESTION SUPPORT
PATH:DENY SUPPORT DENY
PATH:QUESTION DENY DENY
PATH:QUESTION SUPPORT SUPPORT
PATH:SUPPORT DENY QUESTION
PATH:SUPPORT QUESTION QUESTION
STAR:QUESTION QUESTION QUESTION
STAR:DENY DENY QUESTION
STAR:DENY QUESTION QUESTION
STAR:DENY SUPPORT DENY
STAR:DENY SUPPORT QUESTION
of the tweet frequency in other events. Such result indicates
that the patterns acquired represent the significant properties
of rumor events and are capable of distinguishing rumors
from non-rumors. It provides a good potential to utilize
our proposed patterns to detect rumors in streaming social
media.
VI. CONCLUSION AND FUTURE WO RK
In this paper, we have described the streaming rumor de-
tection problem by detecting rumor patterns in social media
data streams. First, we extended previous work to combine
properties of propagation structure and user behavior into
the rumor pattern design. Second, our proposed algorithm
directly explored the streaming datasets of both valid news
and false rumors. We addressed a set of distinct rumor
patterns that differentiate rumors from non-rumors. The
short-term temporal frequency of selected patterns matched
the trend of rumor-related tweets very well, which indicates
a good potential to use this approach to detecting rumors in
the real-time social media streams.
As for our future work, further evaluations are first
Figure 6. Frequency Comparison between Tweet and Pattern of Valid News (left) and False Rumors (right) in Short-term Series
2473
planned to specify more correlations within the Tweets.
In addition, we would like to focus on extending this
rumor pattern matching approach to detect rumors in real-
time social media streams. The topic-based filtering and
monitoring tool will be explored and combined into our
method, so that it can be evaluated in real-time streaming
social media datasets.
ACKNOWLEDGMENT
We thank the research team from KAIST to provide the
Twitter Dataset, as well as Ji Qi for supporting us with the
matrix visualization tool.
REFERENCES
[1] G. W. Allport and L. Postman, “The psychology of rumor.
1947.
[2] R. H. Knapp, “A psychology of rumor,” Public Opinion
Quarterly, vol. 8, no. 1, pp. 22–37, 1944.
[3] V. Qazvinian, E. Rosengren, D. R. Radev, and Q. Mei,
“Rumor has it: Identifying misinformation in microblogs,” in
Proceedings of the Conference on Empirical Methods in Nat-
ural Language Processing. Association for Computational
Linguistics, 2011, pp. 1589–1599.
[4] C. Castillo, M. Mendoza, and B. Poblete, “Information cred-
ibility on twitter,” in Proceedings of the 20th international
conference on World wide web. ACM, 2011, pp. 675–684.
[5] S. Sun, H. Liu, J. He, and X. Du, “Detecting event rumors
on sina weibo automatically,” in Web Technologies and Ap-
plications. Springer, 2013, pp. 120–131.
[6] F. Yang, Y. Liu, X. Yu, and M. Yang, “Automatic detection of
rumor on sina weibo,” in Proceedings of the ACM SIGKDD
Workshop on Mining Data Semantics. ACM, 2012, p. 13.
[7] S. Kwon, M. Cha, K. Jung, W. Chen, and Y. Wang, “Promi-
nent features of rumor propagation in online social media,”
in Data Mining (ICDM), 2013 IEEE 13th International Con-
ference on. IEEE, 2013, pp. 1103–1108.
[8] R. Ennals, D. Byler, J. M. Agosta, and B. Rosario, “What is
disputed on the web?” in Proceedings of the 4th workshop
on Information credibility. ACM, 2010, pp. 67–74.
[9] Z. Zhao, P. Resnick, and Q. Mei, “Enquiring minds: Early
detection of rumors in social media from enquiry posts,” in
Proceedings of the 24th International Conference on World
Wide Web. International World Wide Web Conferences
Steering Committee, 2015, pp. 1395–1405.
[10] J. Zhang, “A survey on streaming algorithms for massive
graphs,” in Managing and Mining Graph Data. Springer,
2010, pp. 393–420.
[11] O. Aarts, P.-P. van Maanen, T. Ouboter, and J. M. Schraagen,
“Online social behavior in twitter: A literature review,” in
Data Mining Workshops (ICDMW), 2012 IEEE 12th Interna-
tional Conference on. IEEE, 2012, pp. 739–746.
[12] Z. Zhou, R. Bandari, J. Kong, H. Qian, and V. Roychowd-
hury, “Information resonance on twitter: watching iran,” in
Proceedings of the first workshop on social media analytics.
ACM, 2010, pp. 123–131.
[13] P. Fan, P. Li, Z. Jiang, W. Li, and H. Wang, “Measurement
and analysis of topology and information propagation on sina-
microblog,” in Intelligence and Security Informatics (ISI),
2011 IEEE International Conference on. IEEE, 2011, pp.
396–401.
[14] M. Mendoza, B. Poblete, and C. Castillo, “Twitter under
crisis: Can we trust what we rt?” in Proceedings of the first
workshop on social media analytics. ACM, 2010, pp. 71–79.
[15] B. Pang and L. Lee, “Opinion mining and sentiment analysis,”
Foundations and trends in information retrieval, vol. 2, no.
1-2, pp. 1–135, 2008.
[16] B. Li, X. Si, M. R. Lyu, I. King, and E. Y. Chang, “Question
identification on twitter,” in Proceedings of the 20th ACM
international conference on Information and knowledge man-
agement. ACM, 2011, pp. 2477–2480.
... Graph [194] News propagation Structural (statistical feature) [190] Att+Dee+Hom (graph embedding) [10] Social media interaction Structural (statistical feature) Node [212,244] Attributed (statistical feature) [156,157] Graph [82] Att+Sha+Het (graph embedding) ...
... A credibility propagation method is applied to obtain the credibility of the news. Other methods modeled behaviors between user and news through capturing the news propagation patterns [10,190,244]. ...
... They pre-defined or automatically learned fake news propagation patterns [10,190]. Recently, to explicitly model the interactions between users and news, user-news interaction graphs were constructed. ...
Preprint
The explosive growth of cyber attacks nowadays, such as malware, spam, and intrusions, caused severe consequences on society. Securing cyberspace has become an utmost concern for organizations and governments. Traditional Machine Learning (ML) based methods are extensively used in detecting cyber threats, but they hardly model the correlations between real-world cyber entities. In recent years, with the proliferation of graph mining techniques, many researchers investigated these techniques for capturing correlations between cyber entities and achieving high performance. It is imperative to summarize existing graph-based cybersecurity solutions to provide a guide for future studies. Therefore, as a key contribution of this paper, we provide a comprehensive review of graph mining for cybersecurity, including an overview of cybersecurity tasks, the typical graph mining techniques, and the general process of applying them to cybersecurity, as well as various solutions for different cybersecurity tasks. For each task, we probe into relevant methods and highlight the graph types, graph approaches, and task levels in their modeling. Furthermore, we collect open datasets and toolkits for graph-based cybersecurity. Finally, we outlook the potential directions of this field for future research.
... Other Methods (spread pattern, statistics, etc.) , (Papanastasiou, 2020), (Wang & Terano, 2015), , (Kumar & Geethakumari, 2014), (Chen et al., 2016) ...
Article
Full-text available
Social media platforms facilitate the sharing of a vast magnitude of information in split seconds among users. However, some false information is also widely spread, generally referred to as “fake news”. This can have major negative impacts on individuals and societies. Unfortunately, people are often not able to correctly identify fake news from truth. Therefore, there is an urgent need to find effective mechanisms to fight fake news on social media. To this end, this paper adapts the Straub Model of Security Action Cycle to the context of combating fake news on social media. It uses the adapted framework to classify the vast literature on fake news to action cycle phases (i.e., deterrence, prevention, detection, and mitigation/remedy). Based on a systematic and inter-disciplinary review of the relevant literature, we analyze the status and challenges in each stage of combating fake news, followed by introducing future research directions. These efforts allow the development of a holistic view of the research frontier on fighting fake news online. We conclude that this is a multidisciplinary issue; and as such, a collaborative effort from different fields is needed to effectively address this problem.
... It presents great challenges on managing and analysing such large volume and high velocity streaming graphs. In practice, real-time analytics over large graph streams has many applications, such as monitoring cyber security attacks and social network opinion tracking [19,28,30]. Recently, the task of query-friendly graph stream summarization (QfGSS for short) has attracted a lot of research efforts [2,6,10,14,29,34]. ...
Article
Full-text available
In massive and rapid graph streams, a useful and important task is to summarize the structure of graph streams in order to enable efficient and effective graph query processing. Although this task has been extensively studied in the literature, we observe that the existing solutions for graph sketches can only answer queries about the current status of the graph stream. In this paper, we aim at designing persistent graph sketches to support graph queries in any given time range in the past. To this end, we first introduce a baseline method by extending an existing graph summarization method. However, our empirical study suggests that the accuracy performance of the baseline method is unsatisfactory, especially when the query time interval is large. To tackle this issue, we propose a new method PGSS-BDH, which stores the streaming edges using a set of hierarchically organized hashmaps. When a query arrives, we divide the query time interval into a set of disjoint sub-intervals and return the sum of query results on all sub-intervals as the overall query answer. Observing that PGSS-BDH bears a linear space cost to the graph stream size, we further propose an advance method PGSS-MDC by using a set of fixed-size hierarchical counters to store the weight of edges, where the query processing is similar to PGSS-BDH. We theoretically analyze the accuracy performance of PGSS-BDH and PGSS-MDC. The experiment results on real-life datasets demonstrate that PGSS-MDC can return much more accurate answers than the competitors by consuming comparable query time and much less memory.
... Simultaneously, understanding the process and mode of rumor propagation aids in identifying rumors at an early stage of rumor propagation and achieves a good explanatory effect in detecting the authenticity of rumors. Rumor propagation topology enables us to better abstract rumor propagation patterns and further excavate the correlations between hybrid features via propagation aggregation [39], [40], [41], resulting in good interpretability on the veracity of rumors. Wu et al. [16] calculated the substructural similarity of two propagation trees on Sina Weibo datasets for detection. ...
Preprint
Full-text available
The rapid growth of social media has caused tremendous effects on information propagation, raising extreme challenges in detecting rumors. Existing rumor detection methods typically exploit the reposting propagation of a rumor candidate for detection by regarding all reposts to a rumor candidate as a temporal sequence and learning semantics representations of the repost sequence. However, extracting informative support from the topological structure of propagation and the influence of reposting authors for debunking rumors is crucial, which generally has not been well addressed by existing methods. In this paper, we organize a claim post in circulation as an adhoc event tree, extract event elements, and convert it to bipartite adhoc event trees in terms of both posts and authors, i.e., author tree and post tree. Accordingly, we propose a novel rumor detection model with hierarchical representation on the bipartite adhoc event trees called BAET. Specifically, we introduce word embedding and feature encoder for the author and post tree, respectively, and design a root-aware attention module to perform node representation. Then we adopt the tree-like RNN model to capture the structural correlations and propose a tree-aware attention module to learn tree representation for the author tree and post tree, respectively. Extensive experimental results on two public Twitter datasets demonstrate the effectiveness of BAET in exploring and exploiting the rumor propagation structure and the superior detection performance of BAET over state-of-the-art baseline methods.
Chapter
The openness characteristic of social networks facilitates the rapid spread of rumors, necessitating effective methods for detecting and managing the abundance of rumors on social media. Existing studies have primarily focused on improving the accuracy of rumor detection, but often overlook the vital aspects of interpretability and explanation of rumor patterns, limiting their credibility and real-world usability. Additionally, previous works have typically examined only a subset of user features, content, and spreading structure, neglecting the analysis of compound rules. To address these limitations, we propose a novel framework for detecting rumor patterns that emphasize comprehensive feature construction and the explanation of compound rules. Our framework incorporates multi-dimensional features, including user characteristics, post content, and the structure of information propagation. Advanced techniques, including large language models (such as ChatGPT) and graph motif discovery algorithms, are employed for feature construction. By leveraging diverse features, crucial integrated rules identified by Rulefit can investigate the contextually dependent associations among various interrelated rumor factors. We consolidate and analyze seven distinct rumor patterns based on the Credible Early Detection Dataset, deriving valuable insights into the inherent characteristics of rumors. The recognition of rumor patterns empowers social media platforms and fact-checking organizations to develop targeted and explainable interventions that effectively mitigate the spread of rumors and safeguard the integrity of information. These interventions greatly enhance the transparency and trustworthiness of rumor management, fostering a more reliable information ecosystem.
Article
Social media has gradually become the main medium for news transmission. Rumors and real information are mixed on social platforms, which will have certain impact on social order and public psychology. To solve this problem, many fake news detection models based on content and propagation path have been proposed. However, most previous methods do not consider the emotional information contained in the news. Therefore, we propose a novel framework for detecting fake news, which leverages graph neural network to jointly model the content, emotional information and propagation structure of news conversations. Also, in order to use emotion to amplify the spread of fake news, we propose an edge-aware method to enhance the news graph representation. The experimental results indicate that our model achieves state-of-the-art performance on various fake news detection tasks.
Article
The explosive growth of cyber attacks nowadays, such as malware, spam, and intrusions, caused severe consequences on society. Securing cyberspace has become an utmost concern for organizations and governments. Traditional Machine Learning (ML) based methods are extensively used in detecting cyber threats, but they hardly model the correlations between real-world cyber entities. In recent years, with the proliferation of graph mining techniques, many researchers investigated these techniques for capturing correlations between cyber entities and achieving high performance. It is imperative to summarize existing graph-based cybersecurity solutions to provide a guide for future studies. Therefore, as a key contribution of this paper, we provide a comprehensive review of graph mining for cybersecurity, including an overview of cybersecurity tasks, the typical graph mining techniques, and the general process of applying them to cybersecurity, as well as various solutions for different cybersecurity tasks. For each task, we probe into relevant methods and highlight the graph types, graph approaches, and task levels in their modeling. Furthermore, we collect open datasets and toolkits for graph-based cybersecurity. Finally, we outlook the potential directions of this field for future research.
Article
Public sentiment can impact the implementation of public policies and even cause policy failure if public support is not received. Therefore, knowledge of public sentiment concerning new and emerging policies is critical for policymakers. During the coronavirus disease 2019 (COVID-19) pandemic, several precautionary measures have been suggested in an attempt to delay or mitigate the spread of the virus. This study presents a framework that applies natural language processing (NLP) techniques, such as sentiment and bigram analyses, to characterize the public sentiment on three prominent mitigation measures (mask wearing, social distancing, and quarantine) as shared by Twitter users in the United States. As part of the framework, we apply a bigram graph-based approach to visualize the most frequent topics in Twitter discussions during the COVID-19 pandemic. The objective is to provide insights into the most commonly discussed topics among Twitter users with similar demographic characteristics (e.g., age and gender). The sentiment and bigram analyses identified the most frequently discussed topics expressing both positive and negative sentiments among different age and gender groups. Discussions containing positive sentiment prevailed and revolved around the benefits of the measures and trust in the government, while the topics of negative sentiment involved conspiracy theories, skepticism, and distrust of government mandates. It is also notable that the discussions among people 19–29 and over 40 years old focus on government officials and political parties, benefits or inefficiency of mitigation measures, and conspiracy theories more often than other demographic groups. Our proposed approaches and results offer a novel and potentially valuable contribution to public policymakers.
Conference Paper
Full-text available
Sina Weibo has become one of the most popular social networks in China. In the meantime, it also becomes a good place to spread various spams. Unlike previous studies on detecting spams such as ads, pornographic messages and phishing, we focus on identifying event rumors (rumors about social events), which are more harmful than other kinds of spams especially in China. To detect event rumors from enormous posts, we studied the characteristics of event rumors and extracted features which can distinguish rumors from ordinary posts. The experiments conducted on real dataset show that the new features are effective to improve the rumor classifier. Further analysis of the event rumors reveals that they can be classified into 4 different types. We propose an approach for detecting one major type, text-picture unmatched event rumors. The experiment demonstrates that this approach is well-performed.
Conference Paper
Full-text available
The problem of identifying rumors is of practical importance especially in online social networks, since information can diffuse more rapidly and widely than the offline counterpart. In this paper, we identify characteristics of rumors by examining the following three aspects of diffusion: temporal, structural, and linguistic. For the temporal characteristics, we propose a new periodic time series model that considers daily and external shock cycles, where the model demonstrates that rumor likely have fluctuations over time. We also identify key structural and linguistic differences in the spread of rumors and non-rumors. Our selected features classify rumors with high precision and recall in the range of 87% to 92%, that is higher than other states of the arts on rumor classification.
Conference Paper
Full-text available
This literature review is aimed at examining state of the art research in the field of online social networks. The goal is to identify the current challenges within this area of research, given the questions raised in society. In this review we pay attention to three aspects of social networks: actor, message, and network characteristics. We further limit our review to research based on Twitter data, because this online social network is the most widely used by researchers in the field.
Conference Paper
Full-text available
In this article we explore the behavior of Twitter users under an emergency situation. In particular, we analyze the activity related to the 2010 earthquake in Chile and characterize Twitter in the hours and days following this disaster. Furthermore, we perform a pre-liminary study of certain social phenomenons, such as the dissem-ination of false rumors and confirmed news. We analyze how this information propagated through the Twitter network, with the pur-pose of assessing the reliability of Twitter as an information source under extreme circumstances. Our analysis shows that the propa-gation of tweets that correspond to rumors differs from tweets that spread news because rumors tend to be questioned more than news by the Twitter community. This result shows that it is posible to detect rumors by using aggregate analysis on tweets.
Article
Full-text available
Twitter has undoubtedly caught the attention of both the general public, and academia as a microblogging service worthy of study and attention. Twitter has several fea-tures that sets it apart from other social media/networking sites, including its 140 character limit on each user's message (tweet), and the unique combination of avenues via which information is shared: directed social network of friends and followers, where messages posted by a user is broadcast to all its followers, and the public timeline, which provides real time access to posts or tweets on specific topics for everyone. While the character limit plays a role in shaping the type of messages that are posted and shared, the dual mode of shar-ing information (public vs posts to one's followers) provides multiple pathways in which a posting can propagate through the user landscape via forwarding or "Retweets", leading us to ask the following questions: How does a message resonate and spread widely among the users on Twitter, and are the resulting cascade dynamics different due to the unique fea-tures of Twitter? What role does content of a message play in its popularity? Realizing that tweet content would play a major role in the information propagation dynamics (as borne out by the empirical results reported in this paper), we focused on patterns of information propagation on Twitter by observing the sharing and reposting of messages around a specific topic, i.e. the Iranian election. We know that during the 2009 post-election protests in Iran, Twitter and its large community of users played an important role in disseminating news, images, and videos worldwide and in documenting the events. We collected tweets of more than 20 million publicly accessible users on Twitter and analyzed over three million tweets related to the Iranian election posted by around 500K users during June and July of 2009. Our results provide several key in-. sights into the dynamics of information propagation that are special to Twitter. For example, the tweet cascade size dis-tribution is a power-law with exponent of -2.51 and more than 99% of the cascades have depth less than 3. The ex-ponent is different from what one expects from a branching process (usually used to model information cascades) and so is the shallow depth, implying that the dynamics underlying the cascades are potentially different on Twitter. Similarly, we are able to show that while Twitter's Friends-Followers network structure plays an important role in information propagation through retweets (re-posting of another user's message), the search bar and trending topics on Twitter's front page offer other significant avenues for the spread of information outside the explicit Friends-Followers network. We found that at most 63.7% of all retweets in this case were reposts of someone the user was following directly. We also found that at least 7% of retweets are from the public posts, and potentially more than 30% of retweets are from the public timeline. In the end, we examined the context and content of the kinds of information that gained the at-tention of users and spread widely on Twitter. Our data indicates that the retweet probabilities are highly content dependent.
Conference Paper
Many previous techniques identify trending topics in social media, even topics that are not pre-defined. We present a technique to identify trending rumors, which we define as topics that include disputed factual claims. Putting aside any attempt to assess whether the rumors are true or false, it is valuable to identify trending rumors as early as possible. It is extremely difficult to accurately classify whether every individual post is or is not making a disputed factual claim. We are able to identify trending rumors by recasting the problem as finding entire clusters of posts whose topic is a disputed factual claim. The key insight is that when there is a rumor, even though most posts do not raise questions about it, there may be a few that do. If we can find signature text phrases that are used by a few people to express skepticism about factual claims and are rarely used to express anything else, we can use those as detectors for rumor clusters. Indeed, we have found a few phrases that seem to be used exactly that way, including: "Is this true?", "Really?", and "What?". Relatively few posts related to any particular rumor use any of these enquiry phrases, but lots of rumor diffusion processes have some posts that do and have them quite early in the diffusion. We have developed a technique based on searching for the enquiry phrases, clustering similar posts together, and then collecting related posts that do not contain these simple phrases. We then rank the clusters by their likelihood of really containing a disputed factual claim. The detector, which searches for the very rare but very informative phrases, combined with clustering and a classifier on the clusters, yields surprisingly good performance. On a typical day of Twitter, about a third of the top 50 clusters were judged to be rumors, a high enough precision that human analysts might be willing to sift through them.
Article
China, we crawled Sina-Microblog for about 3 months and obtain the trace of its topology and topics. Compared with other online social networks, our measurement study shows a number of interesting findings. Our data suggests that Sina-Microblog network has apparent small-world effect and scale-free characteristic, specially, the outdegree distribution ap­ pears to have multiple separate power-law regimes with different exponents. We also observe the overlay graph of Sina-Microblog represents assortative mixing pattern and weak correlation of indegree and outdegree. Moreover, by constructing the cascades of different topics, our data suggests that the distribution of cascades size follows a power-law and heavy-tailed property with the slope approximately -2, and the common motifs of cascades with different topics are very similar, above 93% of them are isolated nodes. In order to find the formative motivity of hot cascades, we find that they always evolve to the structures like 'star pattern' and 'two-polar pattern', which are mainly due to the indegree of participating nodes, and are also correlated with the content of tweet.
Article
The problem of gauging information credibility on social networks has received considerable attention in recent years. Most previous work has chosen Twitter, the world's largest micro-blogging platform, as the premise of research. In this work, we shift the premise and study the problem of information credibility on Sina Weibo, China's leading micro-blogging service provider. With eight times more users than Twitter, Sina Weibo is more of a Facebook-Twitter hybrid than a pure Twitter clone, and exhibits several important characteristics that distinguish it from Twitter. We collect an extensive set of microblogs which have been confirmed to be false rumors based on information from the official rumor-busting service provided by Sina Weibo. Unlike previous studies on Twitter where the labeling of rumors is done manually by the participants of the experiments, the official nature of this service ensures the high quality of the dataset. We then examine an extensive set of features that can be extracted from the microblogs, and train a classifier to automatically detect the rumors from a mixed set of true information and false information. The experiments show that some of the new features we propose are indeed effective in the classification, and even the features considered in previous studies have different implications with Sina Weibo than with Twitter. To the best of our knowledge, this is the first study on rumor analysis and detection on Sina Weibo.
Chapter
Streaming is an important paradigm for handling massive graphs that are too large to fit in the main memory. In the streaming computational model, algorithms are restricted to use much less space than they would need to store the input. Furthermore, the input is accessed in a sequential fashion, therefore, can be viewed as a stream of data elements. The restriction limits the model and yet, algorithms exist for many graph problems in the streaming model. We survey a set of algorithms that compute graph statistics, matching and distance in a graph, and random walks. These are basic graph problems and the algorithms that compute them may be used as building blocks in graph-data management and mining.