Conference PaperPDF Available

Detecting rumor patterns in streaming social media

October 2015

October 2015

DOI:10.1109/BigData.2015.7364071

Conference: 2015 IEEE International Conference on Big Data (Big Data)

Authors:

Shihan Wang

Utrecht University

Takao Terano

Chiba University of Commerce

Two Examples of The Rumor Pattern

…

The Format of R-index Structure

…

Tweet Frequency of Valid News and False Rumors in Short-term Series

…

The Correlation Matrix Between Trending Topics and Patterns

…

Frequency Comparison between Tweet and Pattern of Valid News (left) and False Rumors (right) in Short-term Series

…

Figures - uploaded by Shihan Wang

Content may be subject to copyright.

Content uploaded by Shihan Wang

Content may be subject to copyright.

2015 IEEE International Conference on Big Data (Big Data)

Detecting Rumor Patterns in Streaming Social Media

Shihan Wang

Department of Computational Intelligence

and Systems Science

Tokyo Institute of Technology

Yokohama, Japan

ShihanW@trn.dis.titech.ac.jp

Takao Terano

Department of Computational Intelligence

and Systems Science

Tokyo Institute of Technology

Yokohama, Japan

terano@dis.titech.ac.jp

Abstract—Rumor detection in streaming social media is a

signiﬁcant but challenging problem. In this paper, we present

a method to identify rumor patterns in the streaming social

media environment. Patterns which combine both structural

and behavioral properties of rumor are ﬁrstly proposed to

distinguish false rumors from valid news. A novel graph-based

pattern matching algorithm is also described to detect rumor

patterns from streaming social media data. Compared within

twitter data of rumors and non-rumors, our selected rumor

patterns contain distinct properties of rumors in short-term

series.

Keywords-rumor detection; social media; streaming pattern

matching; socioeconomic sustainability

I. INTRODUCTION

As Microblog platforms like Twitter and Sina Weibo

rapidly grow, social media has become a popular commu-

nication tool in our daily life and attracts more and more

attention. Thanks to its tremendous reachability, social media

provides organizations and individuals wider opportunities

of collaboration and is considered as a new driver of sustain-

ability. Nevertheless, social media brings not only effective

valid information, but vast false rumors as well. In fact, with

the extremely fast and wide spread of information, online ru-

mor causes devastating socioeconomic damage before being

effectively corrected. Therefore, rumor detection in online

social media is signiﬁcant for the sustainable development.

Rumor is known as a piece of information or statement

that cannot be veriﬁed as true or false, but quickly spreading

from person to person [1]. Recently, many researchers have

focused on automatically detecting rumor and determining

its credibility. While they only analyze and evaluate rumor

after it has been widely spread, there is still an important gap

of rumor detection in the real-time streaming environment.

In fact, it is essential to discover the rumor directly from

online social media streams before it causes too much

damage.

Here rumor detection in streaming social media is very

challenging, not only because of the massive and noisy

dataset but also the streaming environment. Most of the

traditional methods employ classiﬁcation or clustering tech-

niques to identify rumor, which is limited in streaming

scenario. Faced with these challenges, we expect to detect

rumors in streaming social media using pattern matching

approach. In this paper, we focus on discovering important

rumor patterns and detecting them in streaming dataset.

We make two contributions in this work. First, we present

a group of rumor patterns combining both structure and

behavior features, which has never been done particularly for

the streaming detection environment. Second, we propose

a novel graph-based pattern matching algorithm, which is

designed to identify patterns from real social media streams.

The rest of this paper is organized as follows. In section

2, we review the related work in rumor detection. Section 3

describes our pattern design and its theory base. Section 4

explains the pattern matching algorithm while we present the

preliminary experiments and results in section 5. In section

6, we summarize this paper and our future work.

II. RE LATE D WORK

While rumors have been a hot topic in the psychology

ﬁeld for a long time [2], computer scientists focus on auto-

matic rumor detection of online social media only in recent

years. Since the research on rumor detection in streaming

environment is quite limited, in this section, we mainly

review the related work on traditional ofﬂine identiﬁcation.

Regardless of literature focus on either Twitter or Chinese

Sina Weibo data source, we group them based on their

approaches: classiﬁcation-based and pattern matching-based.

A. Classiﬁcation-based Rumor Detection

Much previous literature has considered rumor detection

as a binary classiﬁcation problem. Researchers utilized

supervised learning approach to automatically determine

whether one trending topic that is spreading is truth or false.

As identifying the credibility of information is complex,

most existing approaches employ various kinds of features

beyond the text of the posts only [3].

Catillo et al [4] ﬁrstly grouped and reviewed several

features that are widely used in rumor detection, including

content-based feature, user-based feature, behavior-based

feature and propagation-based feature. Other works extended

these features using own speciﬁc properties. Sun et al [5]

2468

and Yang et al [6] extracted multimedia-based and location-

based features respectively to distinguish rumors in Sina

Weibo from ordinary posts. Kwon et al [7] ﬁrstly examined

temporal characteristics in rumor spreading.

B. Pattern Matching-based Rumor Detection

Ennals et al [8] used pattern matching techniques to

highlight disputed claims from the web. Their method auto-

matically searched lexical patterns for claims, then ﬁltered

claims by a classiﬁer and provided a corpus of disputed

claims only. On the other side, Zhao et al [9] identiﬁed

trending rumors in social media based on inquiry phrases

patterns. Considering content features show early in the

rumor diffusion process, they presented an approach to

cluster only signal pattern contained tweets and address

controversial events with high rumor likelihood. While both

previous works acquired rumor-related patterns, none of

them contained properties beyond the post text.

Although multiple feature-oriented classiﬁcation methods

bring decent detection accuracy, most of these features

only become available after rumor has already ﬂourished

and been transferred by many users. Therefore, it is not

practical to use such approaches in a real-time situation,

while rumors have already caused serious socioeconomic

damage before they were detected and corrected. We expect

the rumor patterns detection method using pattern matching

techniques to overcome this drawback. While the previous

pattern matching-based research only considered text-related

features, which are not enough for the rumor detection task

[4], we propose to extend social media rumor patterns from

various aspects in this work.

III. RUMOR PATTE RN DESIGN

In order to use patterns to detect rumors from social media

stream in the future, there are two important aspects need

to be balanced within pattern design.

On the one hand, we expect the pattern to be as complex

as possible because the combination of various features can

contribute to a higher accuracy for rumor detection. On the

other hand, the streaming environment restricts the complex-

ity of patterns, as data stream has the one pass constraint,

which makes it difﬁcult to do the iterative calculation and

limits the computing and storage capabilities [10].

Therefore, we not only focus on the most inﬂuential

features within rumor detection task, but also consider prop-

erties that are practical in streaming process. In total, two

signiﬁcant properties are extracted: propagation structure

and behavior of users’ opinion on target posts. We will

explain our detailed design and theoretical base of both

properties in the following part.

A. Structural Design

In the study of [4], authors analyzed the impact of

different features for information credibility. They observed

Figure 1. Frequent-ordered Nontrivial Cascades of Trending Topic

Propagation in Twitter [12] & Sina Weibo [13]

that graph structure pattern of propagation is one of the most

relevant to detect non-credible news. This drives us to ﬁrstly

consider propagation structure in the patterns.

On social media like Twitter, there is no important com-

munity structure [11], also overall properties of the graph are

hard to measure in streaming data. So, instead of macro-

level measurements, we focus on micro cascade motifs

that present representative characteristic in event diffusion

network. Zhou et al [12] and Fan et al [13] studied the trace

of information propagation in trending topics of Microblog

and obtained topological features. Figure 1 shows top seven

frequent nontrivial cascade shapes from both Twitter and

Sina Weibo data.

According to the ﬁgure, we ﬁnd that, except for the basic

shape of two nodes, T2(S3) and T4(S4) are the most im-

portant structures among the common cascades. As for one

side, they are the top of the most frequent ones. For another

side, all other important cascades can be decomposed into

a set of them.

In real detecting situation, as social media data stream

keeps coming, the propagation graph is growing from the

basic structures. Based on this observation, we picked these

two subgraphs as the structural features in our pattern.

B. Behavioral Design

Meanwhile, many studies retrieved behavior property of

how users feel about the target post and considered it as a

signiﬁcant signal. Therefore, we propose to combine users’

behavior feature as well.

A study about how information propagated through the

Twitter network after 2010 Chile earthquake provides a

promising support about user opinion analysis. They ex-

hibited that, user attitude is one obvious difference in the

propagation of tweets between rumors and valid news. In

fact, more negative and doubted users tend to be involved

into false rumor, while tweets exhibit an active attitude are

more related to credible information [14][4]. At the same

time, other research indicated the importance of question-

asking behavior in social media in further analysis [9].

2469

Figure 2. Two Examples of The Rumor Pattern

Overall, considering two parts of the design, our rumor

pattern is the labeled graph. Two examples are shown in

Figure 2. While two essential subgraphs is employed as

structural base, three different labels SUPPORT,DENY and

QUESTION are used to present user opinion.

In total, 45 possible patterns are generated. For each

node in graph pattern, three possible labels are enumerated

in various positions. However, because two sons in ’Star’

patterns are symmetric, which means they represent the

same propagation information. So, we consider patterns like

{SUPPORT ←DENY →QUESTION}and {QUESTION

←DENY →SUPPORT}as the same one.

IV. PATTERN MATCHING ALGORITHM

In this section, we present an algorithm that tracks

matches of above graph-based rumor patterns from stream-

ing social media data.

Overall, according to the pattern design, a labeled and

directed graph is ﬁrst extracted from a stream of posts, which

is the original social media data. Here, each post is pre-

processed by semantic analysis to address both user attitude

and information propagated relationships (like retweet or

mention). If this post contains propagated relationships, it

is transformed into an edge. Then, this stream of edges is

provided as the input of a pattern matching algorithm. The

direction of an edge is deﬁned by information spreading

direction, and label attributes of nodes on this edge are

deﬁned by an opinion feature of its poster. Our algorithm

processes data stream and provides a list of matched patterns

and their appearing time.

We begin with introducing the indexical data structure

for dynamically labeled graph pattern search, then proceed

to present the detailed algorithm.

A. Relational Index Structure

We ﬁrstly introduce a data structure called Relational

Index (R-index). R-index is responsible for storing attitude

(label) information related to each node. It contains label

information of the current node, as well as that of all nodes

link to this one. To save the storage space, total numbers of

indegree and outdegree for each kind of label are counted

and collected, instead of every individual node ids. In our

pattern graph, there are three kinds of label: SUPPORT,

DENY and QUESTION. This information supports us ad-

equate information to discover incremental patterns of each

step as edges are updating in streaming. An example of our

current basic structure of R-index is shown in Figure 3.

B. Graph-based pattern matching algorithm

With this R-Index structure deﬁned, we describe the

graph-based pattern matching algorithm. Here are some

basic deﬁnitions we used in the algorithm.

Deﬁnition 1. Given a set of labeled nodes NT=

{n1, n2, n3...}, each edge contains two nodes and time

when it is shown, deﬁned as e=< nstart, nend , time >.

Edge Stream is the continual sequence of edges, deﬁned as

ES ={e1, e2, e3...}.

Deﬁnition 2. Since there are two kinds of pattern struc-

ture, Pattern is deﬁned in the following two types: p

={0Star0, nroot.label, nleft.label, nr ight.label}and p=

{0P ath0, nroot.label, nup.label, ndown.label}. A set of pat-

terns is deﬁned as PT={p1, p2, p3...}. For example, two

patterns in Figure 2 are deﬁned as {0Star0,nroot.label =

DENY ,nleft.label =S UP P ORT ,nrig ht.label =

QU ES T ION }and {0P ath0,nroot .label =SU P P OR T ,

nup.label =S U P P ORT ,nright .label =DENY }respec-

tively.

Algorithm 1 matchGraphPattern(ES ,PT)

1: graph G← ∅

2: for each e=< nstart, nend , time >∈E S do

3: for all ni∈ {nstart, nend }do

4: createNodeIfNew(ni) in G

5: for all pi∈PTdo

6: if ni.label matches pi.nroot .label then

7: nroot ←ni

8: if eis subgraph of pithen

9: num ←getNumOfNewPattern(nroot,e,pi)

10: updateResult(pi,num,e.time)

11: end if

12: end if

13: end for

14: updateIndex(ni)

15: end for

16: end for

The input to matchGraphPattern algorithm is an edge

stream ES and a set of query patterns PT. For every coming

edge e, all of its nodes that are new for graph Gare added

into the graph at ﬁrst (line 4). We iteratively go through

every query pattern (pi) to identify matches (line 5). Then,

every node of ethat shares the same label with a root node

in the given pattern is selected and recorded as the root

node of possible matches (line 6-7). Next, we utilize basic

subgraph isomorphism to check whether this new edge is a

subgraph of pi, which is the necessary condition for further

identiﬁcation (line 8). As R-index maintains all previous

label-related information of root node nroot, it is efﬁcient

to acquire the total amount of nodes that have been linked

2470

Figure 3. The Format of R-index Structure

to nroot and matches another label of pi(line 9). After that,

the algorithm provides real-time updating matches with this

new edge ein the format of < pi, num, e.time > (number

of new matched query patterns and timestamp) (line 10). In

the end, the R-index of both nodes are updated for future

calculation(line 14).

An example is given to explain the main matching pro-

cedure. Given the star pattern in Figure 2, p={0Star0,

nroot.label =DENY ,nleft .label =S UP P O RT ,

nright .label =QU ES T ION }and a new edge e=<

nstart, nend , time > (nstart.label =DEN Y, nend .label =

SU P P ORT ), we ﬁrstly ﬁnd that nstar t is root node nroot

and eis a subgraph of p. In the next step, we process into

getNumOfNewPattern. As pis ’Star’ type, we continue to

ﬁnd matches of another part in p, which is an edge with

nstart.label =DENY and nend.label =QU E ST I ON .

Therefore, we check whether outdegree of label QUESTION

(Question out) in R-index of nstart is zero. If not, it means

we successfully discover new matched patterns of pthat

are contributed by this new coming edge. In this way, we

capture the amount of new patterns and their discovered time

(e.time).

V. PRELIMINARY EXPE RI ME NT

In this section, we present the preliminary experiment to

extract a set of rumor patterns from streaming social media

data and distinguish false rumor and new events based on

them.

A. Data Set

We used the dataset that was published in the work of

Kwon et al [7]. It collected Twitter datasets of the trending

topics, which are separated into false rumor and credible

news. The validation of rumor and non-rumor label has been

well annotated and evaluated by previous researchers based

on both investigation websites and human participants. As

the size of total 109 topics is various (from 10 to 33401

tweets), we selected 5 rumors with a larger amount of tweets,

as well as 5 non-rumors that have similar size with the

picked rumors. In summary, the average tweets of each topic

are around 5000 and the least one has more than 2000 tweets.

B. Data Pre-process and Visualization

After ranking tweets of each topic by the timestamp,

we ﬁrstly processed every group of tweets into a stream

of posts, which ﬁts the data in real-time. Then, the tweet

frequency per hour of all topics is counted and collected

one by one. In the previous work, they investigated tweet

frequency in each day and presented bursty ﬂuctuations over

Figure 4. Tweet Frequency of Valid News and False Rumors in Short-term

Series

60 days [7]. As their extracted temporal features usually last

for days or even weeks, it is limited in real-time streaming

detection. Therefore, we focus on hour-based frequency

because such short-term temporal property can be captured

even in streaming analysis. Figure 4 shows such frequency

of tweets in time series for both non-rumors and rumors.

In each image of Figure 4, the x-axis represents the time

where one hour is a unit, and the y-axis represents how

many tweets are posted in each unit time. We observed

that valid news generally shows dramatic ﬂuctuations, while

rumors usually have one sharp peak. It indicates that even

in the short-term time series, rumor and valid information

commonly differ from each other. Based on this difference,

it is possible to identify a kind of pattern to distinguish them

in streaming.

C. Tweet Semantical Analysis

Given the stream of posts, we analyzed the semantic

information for each post in the next step. In this step,

in order to further process data for graph-based pattern

detection, we began with extract propagating relationships

within tweets, then proceed to analyze user behavior feature

of each tweet.

According to the ofﬁcial Twitter APIs1, the retweet,men-

tion and reply information is provided for each individual

tweet. For example, given a tweet ti, a set of its mentioned

tweets can be acquired T4={tm,ktn, tj,ktk}. Among

them, we can identify that tiretweets tmand replies tk.

1Source: https://dev.twitter.com/rest/public

2471

Table I

USER OPINION MINING RESULTS OF TRENDING TOPICS

Valid News CharlieWilsonWar ChristianTheLion PalmPre PregnantMan TwitterSummize Aver Percentage of All Users

SUPPORT Users 1093 351 863 667 862 32.74%

DENY Users 315 367 322 216 164 11.81%

QUESTION Users 176 342 303 301 183 11.14%

False Rumors SwinePork SwineZombie LadyGaga Montauk IphoneNano Aver Percentage of All Users

SUPPORT Users 2469 433 337 177 239 13.86%

DENY Users 4309 1148 474 366 137 24.40%

QUESTION Users 3641 379 985 542 507 22.96%

Then, we captured linkages within the propagating infor-

mation. In this example, the retweet and reply implies that

information is transferred from tmto tiand from tito

tkrespectively. For the rest of mentions, the direction of

transferring is from tito mentioned nodes (tnand tj).

In our dataset, because some historical tweets have been

deleted or shielded, some retweet information is missing.

So, we combined the signal ’RT’ of text into consideration

to identify retweets. In the real streaming tweets, such

information is fully provided by Twitter Streaming APIs2.

On the other hand, we employed sentiment analysis [15]

techniques to identify user opinion from tweet content. We

analyzed and collected the positive (SUPPORT) and nega-

tive (DENY) attitudes through the free version of Semantria3.

At the same time, we identiﬁed question asking tweets

using simple lexical patterns based on previous research. We

utilized question mark and 5W1H question words (What,

Why, Who, When, Where and How) as basic patterns,

but restricted 5W1H only appear at the beginning of one

sentence [16]. Another pattern regular expression ’is (that|

2https://dev.twitter.com/streaming/overview

3https://semantria.com/

this|it) true)’ [9] is also combined to improve the precision.

Besides three types of opinion, there is still a group of

users who do not show any attitude. We do not consider

them in our behavioral patterns. Overall, the identiﬁcation

results (SUPPORT,DENY and QUESTION) of ten topics

are summarized in Table I.

Table I exhibits the total number of individual users (their

posts) are identiﬁed into three attitudes. We summarized the

average amount of tweets in rumor and non-rumor topics

separately. Overall, one-third of total users have the positive

opinion on credible information, which is three times as

much as negative or questioning people. In contrast, more

users tend to deny and question the non-credible rumors.

This result is consistent with previous studies [14] and ready

for the following process.

D. Rumor Pattern Detection

Based on information of propagating relationship and user

opinion, we detected rumor patterns in streaming trending

topic data using proposed pattern matching algorithm.

In the ﬁrst step, we iteratively processed data stream

of every topic to generate the number of matches and

Figure 5. The Correlation Matrix Between Trending Topics and Patterns

2472

matched time. Then, we analyzed the matched patterns from

both rumors and non-rumors. In order to discover distinct

patterns, especially relevant and important in rumor events,

we evaluated them through term frequency-inverse document

frequency (TF-IDF). We expect it to adjust patterns that

appear frequently in general and distinguish rumor patterns

from non-rumors.

In Figure 5, a large matrix shows the correlations be-

tween 10 topics and 45 patterns, where 5 valid news

are located on the upper side and rumors are located on

the lower side. In addition, the larger TF-IDF is corre-

sponding to a darker gray in each grid. Interesting pat-

terns can be observed from Figure 5. For example, sev-

eral patterns like ’PATH:DENY QUESTION QUESTION’

(marked with green) only appeared in rumors. And pat-

tern ’STAR:DENY QUESTION QUESTION’ (marked with

orange) appears in the majority of rumors and also shows

higher TF-IDF in rumors. Such phenomenon indicates that

it is possible to identify patterns are either unique or more

relevant in rumor events.

Next, we further evaluated the TF-IDF values of patterns

and selected a set of important rumor patterns, whose

average TF/IDF in rumors is over 10 times larger than

that in non-rumors. In Table II, a list of selected important

rumor patterns is given. Among them, the top three patterns

appeared only in the false rumors, while others show closer

correlations with rumors.

Using the set of selected rumor patterns, we calculated the

pattern frequency in time series with the same interval one

hour as previous tweet frequency analysis. The comparison

images of both valid news and false rumors are exhibited in

Figure 6 respectively.

Comparing the left and right part of Figure 6, we observed

an obvious difference between rumors and non-rumors. In

general, the temporal frequency of selected patterns matches

the trend of tweets bursting in rumors very well. However,

patterns do not often appear in the credible news: they do

not appear in two events and do not consist with the shape

Table II

A LI ST OF SE LE CTE D IMPO RTANT RUMOR PATT ERN S

PATH:DENY DENY QUESTION

PATH:DENY QUESTION QUESTION

PATH:DENY SUPPORT QUESTION

PATH:DENY DENY SUPPORT

PATH:DENY QUESTION DENY

PATH:DENY QUESTION SUPPORT

PATH:DENY SUPPORT DENY

PATH:QUESTION DENY DENY

PATH:QUESTION SUPPORT SUPPORT

PATH:SUPPORT DENY QUESTION

PATH:SUPPORT QUESTION QUESTION

STAR:QUESTION QUESTION QUESTION

STAR:DENY DENY QUESTION

STAR:DENY QUESTION QUESTION

STAR:DENY SUPPORT DENY

STAR:DENY SUPPORT QUESTION

of the tweet frequency in other events. Such result indicates

that the patterns acquired represent the signiﬁcant properties

of rumor events and are capable of distinguishing rumors

from non-rumors. It provides a good potential to utilize

our proposed patterns to detect rumors in streaming social

media.

VI. CONCLUSION AND FUTURE WO RK

In this paper, we have described the streaming rumor de-

tection problem by detecting rumor patterns in social media

data streams. First, we extended previous work to combine

properties of propagation structure and user behavior into

the rumor pattern design. Second, our proposed algorithm

directly explored the streaming datasets of both valid news

and false rumors. We addressed a set of distinct rumor

patterns that differentiate rumors from non-rumors. The

short-term temporal frequency of selected patterns matched

the trend of rumor-related tweets very well, which indicates

a good potential to use this approach to detecting rumors in

the real-time social media streams.

As for our future work, further evaluations are ﬁrst

Figure 6. Frequency Comparison between Tweet and Pattern of Valid News (left) and False Rumors (right) in Short-term Series

2473

planned to specify more correlations within the Tweets.

In addition, we would like to focus on extending this

rumor pattern matching approach to detect rumors in real-

time social media streams. The topic-based ﬁltering and

monitoring tool will be explored and combined into our

method, so that it can be evaluated in real-time streaming

social media datasets.

ACKNOWLEDGMENT

We thank the research team from KAIST to provide the

Twitter Dataset, as well as Ji Qi for supporting us with the

matrix visualization tool.

REFERENCES

[1] G. W. Allport and L. Postman, “The psychology of rumor.”

1947.

[2] R. H. Knapp, “A psychology of rumor,” Public Opinion

Quarterly, vol. 8, no. 1, pp. 22–37, 1944.

[3] V. Qazvinian, E. Rosengren, D. R. Radev, and Q. Mei,

“Rumor has it: Identifying misinformation in microblogs,” in

Proceedings of the Conference on Empirical Methods in Nat-

ural Language Processing. Association for Computational

Linguistics, 2011, pp. 1589–1599.

[4] C. Castillo, M. Mendoza, and B. Poblete, “Information cred-

ibility on twitter,” in Proceedings of the 20th international

conference on World wide web. ACM, 2011, pp. 675–684.

[5] S. Sun, H. Liu, J. He, and X. Du, “Detecting event rumors

on sina weibo automatically,” in Web Technologies and Ap-

plications. Springer, 2013, pp. 120–131.

[6] F. Yang, Y. Liu, X. Yu, and M. Yang, “Automatic detection of

rumor on sina weibo,” in Proceedings of the ACM SIGKDD

Workshop on Mining Data Semantics. ACM, 2012, p. 13.

[7] S. Kwon, M. Cha, K. Jung, W. Chen, and Y. Wang, “Promi-

nent features of rumor propagation in online social media,”

in Data Mining (ICDM), 2013 IEEE 13th International Con-

ference on. IEEE, 2013, pp. 1103–1108.

[8] R. Ennals, D. Byler, J. M. Agosta, and B. Rosario, “What is

disputed on the web?” in Proceedings of the 4th workshop

on Information credibility. ACM, 2010, pp. 67–74.

[9] Z. Zhao, P. Resnick, and Q. Mei, “Enquiring minds: Early

detection of rumors in social media from enquiry posts,” in

Proceedings of the 24th International Conference on World

Wide Web. International World Wide Web Conferences

Steering Committee, 2015, pp. 1395–1405.

[10] J. Zhang, “A survey on streaming algorithms for massive

graphs,” in Managing and Mining Graph Data. Springer,

2010, pp. 393–420.

[11] O. Aarts, P.-P. van Maanen, T. Ouboter, and J. M. Schraagen,

“Online social behavior in twitter: A literature review,” in

Data Mining Workshops (ICDMW), 2012 IEEE 12th Interna-

tional Conference on. IEEE, 2012, pp. 739–746.

[12] Z. Zhou, R. Bandari, J. Kong, H. Qian, and V. Roychowd-

hury, “Information resonance on twitter: watching iran,” in

Proceedings of the ﬁrst workshop on social media analytics.

ACM, 2010, pp. 123–131.

[13] P. Fan, P. Li, Z. Jiang, W. Li, and H. Wang, “Measurement

and analysis of topology and information propagation on sina-

microblog,” in Intelligence and Security Informatics (ISI),

2011 IEEE International Conference on. IEEE, 2011, pp.

396–401.

[14] M. Mendoza, B. Poblete, and C. Castillo, “Twitter under

crisis: Can we trust what we rt?” in Proceedings of the ﬁrst

workshop on social media analytics. ACM, 2010, pp. 71–79.

[15] B. Pang and L. Lee, “Opinion mining and sentiment analysis,”

Foundations and trends in information retrieval, vol. 2, no.

1-2, pp. 1–135, 2008.

[16] B. Li, X. Si, M. R. Lyu, I. King, and E. Y. Chang, “Question

identiﬁcation on twitter,” in Proceedings of the 20th ACM

international conference on Information and knowledge man-

agement. ACM, 2011, pp. 2477–2480.

Graph Mining for Cybersecurity: A Survey

Preprint

Apr 2023

The explosive growth of cyber attacks nowadays, such as malware, spam, and intrusions, caused severe consequences on society. Securing cyberspace has become an utmost concern for organizations and governments. Traditional Machine Learning (ML) based methods are extensively used in detecting cyber threats, but they hardly model the correlations between real-world cyber entities. In recent years, with the proliferation of graph mining techniques, many researchers investigated these techniques for capturing correlations between cyber entities and achieving high performance. It is imperative to summarize existing graph-based cybersecurity solutions to provide a guide for future studies. Therefore, as a key contribution of this paper, we provide a comprehensive review of graph mining for cybersecurity, including an overview of cybersecurity tasks, the typical graph mining techniques, and the general process of applying them to cybersecurity, as well as various solutions for different cybersecurity tasks. For each task, we probe into relevant methods and highlight the graph types, graph approaches, and task levels in their modeling. Furthermore, we collect open datasets and toolkits for graph-based cybersecurity. Finally, we outlook the potential directions of this field for future research.

Combating Fake News on Social Media: A Framework, Review, and Future Opportunities and Future Opportunities

Article

Full-text available

Jun 2023

Social media platforms facilitate the sharing of a vast magnitude of information in split seconds among users. However, some false information is also widely spread, generally referred to as “fake news”. This can have major negative impacts on individuals and societies. Unfortunately, people are often not able to correctly identify fake news from truth. Therefore, there is an urgent need to find effective mechanisms to fight fake news on social media. To this end, this paper adapts the Straub Model of Security Action Cycle to the context of combating fake news on social media. It uses the adapted framework to classify the vast literature on fake news to action cycle phases (i.e., deterrence, prevention, detection, and mitigation/remedy). Based on a systematic and inter-disciplinary review of the relevant literature, we analyze the status and challenges in each stage of combating fake news, followed by introducing future research directions. These efforts allow the development of a holistic view of the research frontier on fighting fake news online. We conclude that this is a multidisciplinary issue; and as such, a collaborative effort from different fields is needed to effectively address this problem.

Persistent graph stream summarization for real-time graph analytics

Article

Full-text available

May 2023
WORLD WIDE WEB

In massive and rapid graph streams, a useful and important task is to summarize the structure of graph streams in order to enable efficient and effective graph query processing. Although this task has been extensively studied in the literature, we observe that the existing solutions for graph sketches can only answer queries about the current status of the graph stream. In this paper, we aim at designing persistent graph sketches to support graph queries in any given time range in the past. To this end, we first introduce a baseline method by extending an existing graph summarization method. However, our empirical study suggests that the accuracy performance of the baseline method is unsatisfactory, especially when the query time interval is large. To tackle this issue, we propose a new method PGSS-BDH, which stores the streaming edges using a set of hierarchically organized hashmaps. When a query arrives, we divide the query time interval into a set of disjoint sub-intervals and return the sum of query results on all sub-intervals as the overall query answer. Observing that PGSS-BDH bears a linear space cost to the graph stream size, we further propose an advance method PGSS-MDC by using a set of fixed-size hierarchical counters to store the weight of edges, where the query processing is similar to PGSS-BDH. We theoretically analyze the accuracy performance of PGSS-BDH and PGSS-MDC. The experiment results on real-life datasets demonstrate that PGSS-MDC can return much more accurate answers than the competitors by consuming comparable query time and much less memory.

Rumor Detection with Hierarchical Representation on Bipartite Adhoc Event Trees

Preprint

Full-text available

Apr 2023

The rapid growth of social media has caused tremendous effects on information propagation, raising extreme challenges in detecting rumors. Existing rumor detection methods typically exploit the reposting propagation of a rumor candidate for detection by regarding all reposts to a rumor candidate as a temporal sequence and learning semantics representations of the repost sequence. However, extracting informative support from the topological structure of propagation and the influence of reposting authors for debunking rumors is crucial, which generally has not been well addressed by existing methods. In this paper, we organize a claim post in circulation as an adhoc event tree, extract event elements, and convert it to bipartite adhoc event trees in terms of both posts and authors, i.e., author tree and post tree. Accordingly, we propose a novel rumor detection model with hierarchical representation on the bipartite adhoc event trees called BAET. Specifically, we introduce word embedding and feature encoder for the author and post tree, respectively, and design a root-aware attention module to perform node representation. Then we adopt the tree-like RNN model to capture the structural correlations and propose a tree-aware attention module to learn tree representation for the author tree and post tree, respectively. Extensive experimental results on two public Twitter datasets demonstrate the effectiveness of BAET in exploring and exploiting the rumor propagation structure and the superior detection performance of BAET over state-of-the-art baseline methods.

Enhancing Rumour Detection: A Hybrid Deep Learning Approach with ELMo Embeddings & CNN

Conference Paper

Mar 2024

Detecting the Rumor Patterns Integrating Features of User, Content, and the Spreading Structure

Chapter

Apr 2024

The openness characteristic of social networks facilitates the rapid spread of rumors, necessitating effective methods for detecting and managing the abundance of rumors on social media. Existing studies have primarily focused on improving the accuracy of rumor detection, but often overlook the vital aspects of interpretability and explanation of rumor patterns, limiting their credibility and real-world usability. Additionally, previous works have typically examined only a subset of user features, content, and spreading structure, neglecting the analysis of compound rules. To address these limitations, we propose a novel framework for detecting rumor patterns that emphasize comprehensive feature construction and the explanation of compound rules. Our framework incorporates multi-dimensional features, including user characteristics, post content, and the structure of information propagation. Advanced techniques, including large language models (such as ChatGPT) and graph motif discovery algorithms, are employed for feature construction. By leveraging diverse features, crucial integrated rules identified by Rulefit can investigate the contextually dependent associations among various interrelated rumor factors. We consolidate and analyze seven distinct rumor patterns based on the Credible Early Detection Dataset, deriving valuable insights into the inherent characteristics of rumors. The recognition of rumor patterns empowers social media platforms and fact-checking organizations to develop targeted and explainable interventions that effectively mitigate the spread of rumors and safeguard the integrity of information. These interventions greatly enhance the transparency and trustworthiness of rumor management, fostering a more reliable information ecosystem.

An Overview of Fake News Detection: From A New Perspective

Article

Feb 2024

An Emotion-Aware Approach for Fake News Detection

Article

Jan 2023

Social media has gradually become the main medium for news transmission. Rumors and real information are mixed on social platforms, which will have certain impact on social order and public psychology. To solve this problem, many fake news detection models based on content and propagation path have been proposed. However, most previous methods do not consider the emotional information contained in the news. Therefore, we propose a novel framework for detecting fake news, which leverages graph neural network to jointly model the content, emotional information and propagation structure of news conversations. Also, in order to use emotion to amplify the spread of fake news, we propose an edge-aware method to enhance the news graph representation. The experimental results indicate that our model achieves state-of-the-art performance on various fake news detection tasks.

Graph Mining for Cybersecurity: A Survey

Article

Jul 2023

Analysis of Public Sentiment on COVID-19 Mitigation Measures in Social Media in the United States Using Machine Learning

Article

Jan 2022

Public sentiment can impact the implementation of public policies and even cause policy failure if public support is not received. Therefore, knowledge of public sentiment concerning new and emerging policies is critical for policymakers. During the coronavirus disease 2019 (COVID-19) pandemic, several precautionary measures have been suggested in an attempt to delay or mitigate the spread of the virus. This study presents a framework that applies natural language processing (NLP) techniques, such as sentiment and bigram analyses, to characterize the public sentiment on three prominent mitigation measures (mask wearing, social distancing, and quarantine) as shared by Twitter users in the United States. As part of the framework, we apply a bigram graph-based approach to visualize the most frequent topics in Twitter discussions during the COVID-19 pandemic. The objective is to provide insights into the most commonly discussed topics among Twitter users with similar demographic characteristics (e.g., age and gender). The sentiment and bigram analyses identified the most frequently discussed topics expressing both positive and negative sentiments among different age and gender groups. Discussions containing positive sentiment prevailed and revolved around the benefits of the measures and trust in the government, while the topics of negative sentiment involved conspiracy theories, skepticism, and distrust of government mandates. It is also notable that the discussions among people 19–29 and over 40 years old focus on government officials and political parties, benefits or inefficiency of mitigation measures, and conspiracy theories more often than other demographic groups. Our proposed approaches and results offer a novel and potentially valuable contribution to public policymakers.

Detecting Event Rumors on Sina Weibo Automatically

Conference Paper

Full-text available

Apr 2013

Sina Weibo has become one of the most popular social networks in China. In the meantime, it also becomes a good place to spread various spams. Unlike previous studies on detecting spams such as ads, pornographic messages and phishing, we focus on identifying event rumors (rumors about social events), which are more harmful than other kinds of spams especially in China. To detect event rumors from enormous posts, we studied the characteristics of event rumors and extracted features which can distinguish rumors from ordinary posts. The experiments conducted on real dataset show that the new features are effective to improve the rumor classifier. Further analysis of the event rumors reveals that they can be classified into 4 different types. We propose an approach for detecting one major type, text-picture unmatched event rumors. The experiment demonstrates that this approach is well-performed.

Prominent Features of Rumor Propagation in Online Social Media

Conference Paper

Full-text available

Dec 2013

The problem of identifying rumors is of practical importance especially in online social networks, since information can diffuse more rapidly and widely than the offline counterpart. In this paper, we identify characteristics of rumors by examining the following three aspects of diffusion: temporal, structural, and linguistic. For the temporal characteristics, we propose a new periodic time series model that considers daily and external shock cycles, where the model demonstrates that rumor likely have fluctuations over time. We also identify key structural and linguistic differences in the spread of rumors and non-rumors. Our selected features classify rumors with high precision and recall in the range of 87% to 92%, that is higher than other states of the arts on rumor classification.

Online Social Behavior in Twitter: A Literature Review

Conference Paper

Full-text available

Dec 2012

This literature review is aimed at examining state of the art research in the field of online social networks. The goal is to identify the current challenges within this area of research, given the questions raised in society. In this review we pay attention to three aspects of social networks: actor, message, and network characteristics. We further limit our review to research based on Twitter data, because this online social network is the most widely used by researchers in the field.

Twitter Under Crisis: Can we trust what we RT?

Conference Paper

Full-text available

Jul 2010

In this article we explore the behavior of Twitter users under an emergency situation. In particular, we analyze the activity related to the 2010 earthquake in Chile and characterize Twitter in the hours and days following this disaster. Furthermore, we perform a pre-liminary study of certain social phenomenons, such as the dissem-ination of false rumors and confirmed news. We analyze how this information propagated through the Twitter network, with the pur-pose of assessing the reliability of Twitter as an information source under extreme circumstances. Our analysis shows that the propa-gation of tweets that correspond to rumors differs from tweets that spread news because rumors tend to be questioned more than news by the Twitter community. This result shows that it is posible to detect rumors by using aggregate analysis on tweets.

Information resonance on Twitter: watching Iran

Article

Full-text available

Aug 2010

Twitter has undoubtedly caught the attention of both the general public, and academia as a microblogging service worthy of study and attention. Twitter has several fea-tures that sets it apart from other social media/networking sites, including its 140 character limit on each user's message (tweet), and the unique combination of avenues via which information is shared: directed social network of friends and followers, where messages posted by a user is broadcast to all its followers, and the public timeline, which provides real time access to posts or tweets on specific topics for everyone. While the character limit plays a role in shaping the type of messages that are posted and shared, the dual mode of shar-ing information (public vs posts to one's followers) provides multiple pathways in which a posting can propagate through the user landscape via forwarding or "Retweets", leading us to ask the following questions: How does a message resonate and spread widely among the users on Twitter, and are the resulting cascade dynamics different due to the unique fea-tures of Twitter? What role does content of a message play in its popularity? Realizing that tweet content would play a major role in the information propagation dynamics (as borne out by the empirical results reported in this paper), we focused on patterns of information propagation on Twitter by observing the sharing and reposting of messages around a specific topic, i.e. the Iranian election. We know that during the 2009 post-election protests in Iran, Twitter and its large community of users played an important role in disseminating news, images, and videos worldwide and in documenting the events. We collected tweets of more than 20 million publicly accessible users on Twitter and analyzed over three million tweets related to the Iranian election posted by around 500K users during June and July of 2009. Our results provide several key in-. sights into the dynamics of information propagation that are special to Twitter. For example, the tweet cascade size dis-tribution is a power-law with exponent of -2.51 and more than 99% of the cascades have depth less than 3. The ex-ponent is different from what one expects from a branching process (usually used to model information cascades) and so is the shallow depth, implying that the dynamics underlying the cascades are potentially different on Twitter. Similarly, we are able to show that while Twitter's Friends-Followers network structure plays an important role in information propagation through retweets (re-posting of another user's message), the search bar and trending topics on Twitter's front page offer other significant avenues for the spread of information outside the explicit Friends-Followers network. We found that at most 63.7% of all retweets in this case were reposts of someone the user was following directly. We also found that at least 7% of retweets are from the public posts, and potentially more than 30% of retweets are from the public timeline. In the end, we examined the context and content of the kinds of information that gained the at-tention of users and spread widely on Twitter. Our data indicates that the retweet probabilities are highly content dependent.

Enquiring Minds: Early Detection of Rumors in Social Media from Enquiry Posts

Conference Paper

May 2015

Many previous techniques identify trending topics in social media, even topics that are not pre-defined. We present a technique to identify trending rumors, which we define as topics that include disputed factual claims. Putting aside any attempt to assess whether the rumors are true or false, it is valuable to identify trending rumors as early as possible. It is extremely difficult to accurately classify whether every individual post is or is not making a disputed factual claim. We are able to identify trending rumors by recasting the problem as finding entire clusters of posts whose topic is a disputed factual claim. The key insight is that when there is a rumor, even though most posts do not raise questions about it, there may be a few that do. If we can find signature text phrases that are used by a few people to express skepticism about factual claims and are rarely used to express anything else, we can use those as detectors for rumor clusters. Indeed, we have found a few phrases that seem to be used exactly that way, including: "Is this true?", "Really?", and "What?". Relatively few posts related to any particular rumor use any of these enquiry phrases, but lots of rumor diffusion processes have some posts that do and have them quite early in the diffusion. We have developed a technique based on searching for the enquiry phrases, clustering similar posts together, and then collecting related posts that do not contain these simple phrases. We then rank the clusters by their likelihood of really containing a disputed factual claim. The detector, which searches for the very rare but very informative phrases, combined with clustering and a classifier on the clusters, yields surprisingly good performance. On a typical day of Twitter, about a third of the top 50 clusters were judged to be rumors, a high enough precision that human analysts might be willing to sift through them.

The Psychology of Rumor

Article

Jan 1947

Measurement and analysis of topology and information propagation on Sina-Microblog

Article

Jul 2011

China, we crawled Sina-Microblog for about 3 months and obtain the trace of its topology and topics. Compared with other online social networks, our measurement study shows a number of interesting findings. Our data suggests that Sina-Microblog network has apparent small-world effect and scale-free characteristic, specially, the outdegree distribution ap pears to have multiple separate power-law regimes with different exponents. We also observe the overlay graph of Sina-Microblog represents assortative mixing pattern and weak correlation of indegree and outdegree. Moreover, by constructing the cascades of different topics, our data suggests that the distribution of cascades size follows a power-law and heavy-tailed property with the slope approximately -2, and the common motifs of cascades with different topics are very similar, above 93% of them are isolated nodes. In order to find the formative motivity of hot cascades, we find that they always evolve to the structures like 'star pattern' and 'two-polar pattern', which are mainly due to the indegree of participating nodes, and are also correlated with the content of tweet.

Automatic detection of rumor on Sina Weibo

Article

Aug 2012

The problem of gauging information credibility on social networks has received considerable attention in recent years. Most previous work has chosen Twitter, the world's largest micro-blogging platform, as the premise of research. In this work, we shift the premise and study the problem of information credibility on Sina Weibo, China's leading micro-blogging service provider. With eight times more users than Twitter, Sina Weibo is more of a Facebook-Twitter hybrid than a pure Twitter clone, and exhibits several important characteristics that distinguish it from Twitter. We collect an extensive set of microblogs which have been confirmed to be false rumors based on information from the official rumor-busting service provided by Sina Weibo. Unlike previous studies on Twitter where the labeling of rumors is done manually by the participants of the experiments, the official nature of this service ensures the high quality of the dataset. We then examine an extensive set of features that can be extracted from the microblogs, and train a classifier to automatically detect the rumors from a mixed set of true information and false information. The experiments show that some of the new features we propose are indeed effective in the classification, and even the features considered in previous studies have different implications with Sina Weibo than with Twitter. To the best of our knowledge, this is the first study on rumor analysis and detection on Sina Weibo.

A Survey on Streaming Algorithms for Massive Graphs

Chapter

Feb 2010

Jian Zhang

Streaming is an important paradigm for handling massive graphs that are too large to fit in the main memory. In the streaming computational model, algorithms are restricted to use much less space than they would need to store the input. Furthermore, the input is accessed in a sequential fashion, therefore, can be viewed as a stream of data elements. The restriction limits the model and yet, algorithms exist for many graph problems in the streaming model. We survey a set of algorithms that compute graph statistics, matching and distance in a graph, and random walks. These are basic graph problems and the algorithms that compute them may be used as building blocks in graph-data management and mining.

Detecting rumor patterns in streaming social media

Figures

Recommended publications

Early Signals of Trending Rumor Event in Streaming Social Media

Channel and source coding applied in the bit stream with optimization in the bit rate using pattern...

Early Signals of Trending Rumor Event in Streaming Social Media

Predicting Future Rumors

Automatic Detection of Rumor on Social Network