ArticlePDF Available

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

April 2019
Information Processing & Management 57(2)

April 2019
57(2)

DOI:10.1016/j.ipm.2019.04.002

Authors:

University of Technology Sydney

Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.

Plot of the three evaluation measures (vertical axes) by training epoch (horizontal axes) for 20 runs of the word2vec and doc2vec representations on Twitter data using k-means clustering. (a) shows the results on the #Auspol Twitter data and (b) shows the results on the RepLab 2013 Twitter data. 95% confidence bands based on varying random seeds are shown.

…

Density plots of the three evaluation measures (horizontal axes) over random seeds for the four feature representations with the best performing clustering algorithm, with LDA for comparison. (a) shows the results on the #Auspol Twitter data and (b) shows the results on the RepLab 2013 Twitter data.

…

Plot of the three evaluation measures over random seeds for the methods with the best performing clustering method on Reddit data with varying document lengths in characters. (a) plots the NMI, (b) plots the AMI and (c) plots the ARI.

…

Outline of the data sets, methods for feature representations and clustering, and extrinsic evaluation measures used in this study. For the three data sets, we evaluate the feature representation and clustering method combinations and the LDA topic model (17 combinations) with the three evaluation measures.

…

Count of tweets per hashtag in the #Auspol Twitter data set.

…

Figures - uploaded by Stephan Curiskis

Content may be subject to copyright.

Content uploaded by Stephan Curiskis

Content may be subject to copyright.

An evaluation of document clustering and topic

modelling in two online social networks: Twitter and

Stephan A. Curiskis, Barry Drake, Thomas R. Osborn, Paul J. Kennedy

Centre for Artiﬁcial Intelligence

Faculty of Engineering and Information Technology

University of Technology Sydney,

15 Broadway, Ultimo, NSW 2007,

Email: stephan.a.curiskis@student.uts.edu.au

Abstract

Methods for document clustering and topic modelling in online social net-

works (OSNs) oﬀer a means of categorising, annotating and making sense of

large volumes of user generated content. Many techniques have been developed

over the years, ranging from text mining and clustering methods to latent topic

models and neural embedding approaches. However, many of these methods

deliver poor results when applied to OSN data as such text is notoriously short

and noisy, and often results are not comparable across studies. In this study we

evaluate several techniques for document clustering and topic modelling on three

datasets from Twitter and Reddit. We benchmark four diﬀerent feature rep-

resentations derived from term-frequency inverse-document-frequency (tf-idf )

matrices and word embedding models combined with four clustering methods,

and we include a Latent Dirichlet Allocation topic model for comparison. Sev-

eral diﬀerent evaluation measures are used in the literature, so we provide a

discussion and recommendation for the most appropriate extrinsic measures for

this task. We also demonstrate the performance of the methods over data sets

with diﬀerent document lengths. Our results show that clustering techniques

applied to neural embedding feature representations delivered the best perfor-

mance over all data sets using appropriate extrinsic evaluation measures. We

also demonstrate a method for interpreting the clusters with a top-words based

Preprint submitted to Journal of Information Processing and Management April 7, 2019

approach using tf-idf weights combined with embedding distance measures.

Keywords: document clustering, topic modelling, topic discovery, embedding

models, Online Social Networks

1. Introduction

In January 2018 there were estimated to be around 4.021 billion people

around the world who use the internet. Of these, 3.196 billion people use social

media in some form, generating a staggering amount of content.1Online plat-

forms and social networks have become a key source of information for nearly5

half of the world’s population. These platforms are increasingly being used

to disseminate information regarding news, brands, political discussion, global

events and more (Bakshy et al., 2012). However, much of the data generated is

unstructured and not annotated. This means that it is diﬃcult to understand

how topics of information are diﬀused through online social networks (OSNs),10

and how users engage with diﬀerent topics (Guille et al., 2013). Automatically

annotating topics within OSNs may facilitate analysis of information diﬀusion

and user preferences by enriching the data available from these platforms, in a

way that is readily analysed. With the rise of phenomena like echo chambers

and ﬁlter bubbles, which lead to individuals receiving biased and narrowly fo-15

cused content, the challenge of automatically annotating OSN data has become

important.

Document clustering is a set of machine learning techniques that aim to auto-

matically organise documents into clusters such that documents within clusters

are similar when compared to documents in other clusters. Many methods for20

clustering documents have been proposed (Bisht and Paul, 2013; Naik et al.,

2015). These techniques typically involve the use of a feature matrix, such as

a term-frequency inverse-document-frequency matrix (tf-idf matrix) to repre-

sent a corpus, with a clustering method applied to this matrix. More recently,

1https://wearesocial.com/uk/blog/2018/01/global-digital-report-2018, accessed Sep. 2018

representations derived from neural word embeddings have seen applications on25

social media data as they can produce dense representations with semantic prop-

erties and require less manual preprocessing than traditional methods (Li et al.,

2017). Common clustering methods applied in this context build hierarchies or

partitions (Irfan et al., 2015). Example hierarchical methods are agglomerative

clustering and divisive clustering. Example partitioning methods are k-means30

and k-medoids clustering.

Topic modelling involves methods to discover patterns of word use within

documents, and is an active research area with several techniques recently ap-

plied to OSN data (Chinnov et al., 2015). Topics are typically deﬁned as a

distribution of words, with documents modelled as mixtures of topics. Like35

document clustering, topic modelling can be used to cluster documents by giv-

ing a probability distribution over a range of topics for each document. This

can be viewed as a form of soft partition clustering, where the data points have

a probabilistic degree of ownership to each cluster. The topic representation

also provides the word distribution for each topic which aids in interpretation.40

Commonly used topic models with applications on OSN text data include La-

tent Dirichlet Allocation (Blei et al., 2003), the Author-Topic model (Hong and

Davison, 2010), and more recently Dynamic Topic Models which discover topics

over time (Alghamdi and Alfalqi, 2015).

Document clustering and topic modelling are increasingly important research45

areas as these methods can be applied to large amounts of readily available OSN

text data, yielding homogeneous groups of documents. These document groups

may then align to relevant topics and trends. Clustering is particularly suited

to OSN data as platforms like Twitter and Facebook use hashtags as a form

of topic annotation (Steinskog et al., 2017), which may be used for evaluation50

of document clustering and topic modelling methods. Large scale clustering

can help make sense of the huge amount of content being created online every

day, and can subsequently be used in further machine learning tasks. Addi-

tional features derived from OSN data (such as user demographic, geographic

and network data) have also been clustered to ﬁnd groups of online posts or55

comments that are semantically similar (Alnajran et al., 2017). However, OSN

data presents many challenges when applying topic modelling and document

clustering methods. For example, such text is typically short and contains noise

such as misspellings and grammatical errors (Chinnov et al., 2015).

There are two key challenges with topic modelling and document clustering60

research on OSN data sets. Firstly, results are often not reproducible since the

data used in the studies frequently cannot be published. For instance, Twitter’s

can publish a list of tweet identiﬁers that were used and retrieved via the API.

Unfortunately, over time the associated tweets are removed from the platform,65

which degrades the underlying data. The data sets used are also often small or

biased towards particular contexts. These issues result from the complex data

collection and preparation that is often required to extract large data sets from

an OSN platform, as well as restrictions on the platforms themselves (Stieglitz

et al., 2018).70

Secondly, diﬀerent studies often use diﬀerent methods for evaluating the

performance of clustered documents. Evaluation methods on Twitter data vary

from extrinsic measures which compare clusters against labelled data, to man-

ual assessments of cluster performance and interpretability (Alnajran et al.,

2017). It is therefore diﬃcult to compare empirical results. With the fast pace75

of research in this area, there is little guidance on what method or family of

methods will perform best in speciﬁc circumstances, such as on short Twitter

data or relatively longer Reddit comments.

In this paper we provide an analysis of the performance of several methods

for document clustering and topic modelling of OSN content on three data sets:80

two Twitter data sets and a publicly available Reddit data set. We evaluate

four feature representation methods derived from tf-idf and embedding matri-

ces combined with four clustering techniques, and include a Latent Dirichlet

Allocation (LDA) topic model for comparison. We also provide a discussion of

the properties and appropriateness of document clustering evaluation measures85

commonly used in the literature. We evaluate performance with three such mea-

sures, namely the Normalised Mutual Information (NMI), the Adjusted Mutual

Information (AMI), and the Adjusted Rand Index (ARI). Furthermore, we have

made our data sets available so that our results can be reproduced. To comply

with Twitter’s terms of use, we have made available the tweet identiﬁers used90

along with the topic label. We have also made available the full Reddit data set

used (Curiskis et al., submitted).

Further to this, by tuning key hyper-parameters we demonstrate how em-

bedding models can be used to generate feature sets for document clustering

that delivered good performance and captured latent structure in the data. We95

also show how word embedding distances can aid in the interpretation of the

clusters by ranking the top words, forming a topic vector of words. This contri-

bution is signiﬁcant since data sets from OSNs are often short and contain noise

such as misspellings, abbreviations, acronyms, special characters, emojis, URLs

and hashtags. These issues can result in poor performance for many commonly100

used techniques. Furthermore, a clear consensus is lacking in the literature re-

garding methods that work eﬀectively on OSN data. The results of this paper

provide guidance on methods giving good performance over diﬀerent types of

OSN data. These results show that traditional topic modelling and document

clustering approaches do not work well on short and noisy social media posts.105

Instead, clustering approaches applied to more recent neural network embedding

representations can deliver improved performance.

The structure of this paper is as follows. In Section 2 we review the current

literature in this research area. In Section 3 we present the detail of our meth-

ods, including a description of the data extraction, the preparation process, the110

feature representations, the clustering methods, and the evaluation measures.

In Section 4 we present our results with a discussion. In Section 5 we provide a

discussion followed by our conclusion in Section 6.

2. Literature Review

We organize the literature on document clustering and topic modelling of115

OSNs into three areas. Firstly, many studies have centred on identifying and in-

terpreting memes in this domain, incorporating textual, network and user data.

Secondly, identifying topics through topic models and clustering approaches has

received much attention as a means of understanding and categorising online

content. Thirdly, recent advances in neural word embedding models have been120

used to provide dense feature representations of documents from OSNs.

2.1. Meme Identiﬁcation

The term “meme” is commonly used to represent an element of culture or

system of behaviour that spreads from one individual to another by imitation.

In the context of OSNs, for this paper we deﬁne a “meme” as a semantic unit125

expressed as electronic text where the semantics are transferred across multiple

individuals even though the text may be diﬀerent. This speciﬁc deﬁnition of

“meme” is sometimes called “ememe” (Shabunina and Pasi, 2018). A topic in

OSN applications can be deﬁned as a coherent set of semantically related terms

which express a single argument (Guille et al., 2013). In comparison to this130

deﬁnition of a topic, a meme does not necessarily need to be derived from a set

or distribution of words, but instead aims to detect signiﬁcant semantic content.

Often in practice, however, there is an overlap between the two concepts. The

concept of a meme is useful for OSN applications as it can be thought of as

a latent representation of textual content, but can also be discovered through135

analysis of OSN user and network data.

A study by Ferrara et al. (2013) aimed to identify memes within large so-

cial media data. In that study, several similarity measures were deﬁned for

Twitter data which leverage content, metadata and network features. The au-

thors deﬁned the concept of a ‘protomeme’ which was used to refer to hashtags,140

user mentions, URLs and phrases. Data was aggregated by creating protomeme

projections onto spaces based on tweet, user and content features. For each

protomeme pair, common user, tweet, content and diﬀusion similarity mea-

sures were calculated. These similarity matrices were then aggregated in sev-

eral diﬀerent ways, such as the element-wise mean and maximum. Finally, the145

aggregated similarity matrix was clustered with hierarchical clustering. The re-

sulting clusters were taken to represent memes within the data. The data set

used was a collection of 5,523 tweets related to the US presidential primaries in

April 2012. Twenty-six topics were manually identiﬁed and assigned as labels

to each tweet. Since the memes and topics can overlap per tweet, performance150

was evaluated using a variation of Normalised Mutual Information designated

as LFK-NMI. Given the optimal parameters for this approach, the protomeme

clustering method delivered average 5-fold cross-validation LFK-NMI scores of

around 0.13. JafariAsbagh et al. (2014) later extended the algorithm to work

on streaming data.155

More recently, Shabunina and Pasi (2018) developed a method to identify

and characterise memes, considered as a set of frequently occurring related words

propagating through a network over time. The relationships between terms in a

social media stream were modelled using a graph of words. To identify memes, a

k-core degeneracy process was applied to the graph to generate subgraphs, which160

constituted meme bases. A meme was deﬁned as the fuzzy subset of terms in

a meme basis. The method was applied to over 800,000 tweets from the search

queries #economy, #politics and #ﬁnance. Although useful to characterise

and interpret topics in social media streams, memes were not attributed to

individual social media documents or users. Evaluation of the method was165

limited to subjective interpretation and intrinsic measures.

2.2. Document Clustering and Topic Modelling

In contrast to methods for meme identiﬁcation, many studies have focused

on detecting topics in OSNs. Topic models typically refer to methods that group

both documents, which share similar words, as well as words that occur in a170

similar set of documents. Document clustering refers to methods that group

documents according to some feature matrix, such that documents within a

cluster are more similar to documents in other clusters. Due to the short doc-

ument size and high degree of noise inherent OSN data, such as Twitter data,

clustering based methods are often applied in favour of more traditional topic175

models (Chinnov et al., 2015). Nevertheless, topic models applied to OSN data

are still an active area of research (Alghamdi and Alfalqi, 2015). Indeed, the

term ‘topic discovery’ may refer to either topic modelling or document cluster-

ing.

Document clustering methods have typically used vector space represen-180

tations of word occurrence by document. Commonly, bag-of-words methods

model each document as a point in the space of words. Each word is a feature

or dimension of this space, with element values assigned in one of several ways.

These can be one-hot-encodings, where the value is set to 1 if the word exists

in the document and 0 otherwise, term frequency, or term-frequency inverse-185

document-frequency calculations. Given that the total dimension size is the

number of unique words, often there is a threshold cut-oﬀ to use only those

words with high values (Patki and Khot, 2017). A range of clustering algo-

rithms may then be applied to the feature matrix, such as k-means, hierarchical

clustering, self-organising maps, and so on (Naik et al., 2015).190

For instance, Godfrey et al. (2014) developed an algorithm to identify topics

within a speciﬁc Twitter data set, a collection of about 30,000 tweets extracted

using the query term ‘world cup’. Non-negative Matrix Factorisation (NMF)

and k-means clustering were applied to the tf-idf representation of tweets to cre-

ate topic clusters. Due to the noisiness of Twitter data, Godfrey et al. (2014)195

developed a preliminary ﬁltering step using multiple runs of the DBSCAN clus-

tering algorithm combined with consensus clustering. The rationale was that

tweets which are not close to any particular cluster may be treated as noise and

removed from an analysis. The results when using this approach showed that

both k-means clustering and NMF produced similar results. However, when200

analysing the clusters using a subjective evaluation of tweet network diagrams

and word clouds, NMF seemed to produce more interpretable clusters.

Fang et al. (2014) approached detecting topics in Twitter using additional

information about the tweet. Recognising that the textual content of tweets

can be quite limited, a ‘multi-view’ topic detection framework was developed205

based on more granular ‘multi-relations’. These multi-relations were deﬁned as

useful relations from the Twitter social network and included hashtags, user

mentions, retweets, meaningful words and similar posting times. To measure

these multi-relations, a document similarity measure was developed. Multi-

relation similarity scores were then combined into a multi-view and clustered210

using three diﬀerent methods. These clusters were taken to represent topics

and a keyword extraction method, based on suﬃx trees and tf-idf weights,

was applied to derive representative keywords for each cluster. This method

was evaluated using a dataset of 12,000 tweets with 60 ‘hot’ topics extracted

from the Twitter API. Three evaluation measures were used, namely the F-215

measure, NMI, and entropy. The results showed that including more multi-

views improved performance, with results above 0.928 on the F-measure and

0.935 NMI. However, the authors did not remove any of the hot topic key words

from the text. These key words are generally short phrases or hashtags, and

can be discovered easily by tf-idf approaches.220

Another study compared the eﬃcacy of diﬀerent clustering methods to detect

topics in Twitter data centered around recent earthquakes in Nepal (Klinczak

and Kaestner, 2016). In this study, tweets were represented by their tf-idf

vectors. Four clustering methods applied to this representation were compared,

namely k-means, k-medoids, DBSCAN and NMF. By evaluating each clustering225

method with measures for cohesion and separation of clusters (i.e. intrinsic

evaluation measures), it was clear that NMF produced superior clusters which

were simpler and easier to interpret. More recently, Suri and Roy (2017) applied

LDA and NMF to detect topics on a Twitter data set, as well as a RSS news feed.

Both methods were found to have similar performance. LDA was deemed to230

be more interpretable, but NMF was faster to calculate. However, performance

was evaluated by manual inspection of the key terms for topics.

Many studies have applied topic modelling techniques to OSN data. For

instance, Paul and Dredze (2014) developed a topic modelling framework for

discovering self-reported health topics using Twitter data. 5,128 tweets were235

annotated with a positive status if they related to the user’s health, and negative

if not. A logistic regression model was trained to predict the positive labels in the

annotated data, and applied to a Twitter stream ﬁltered with a large number of

health related keywords. This provided a set of 144 million health tweets which

was used to run the Ailment Topic Aspect Model. While this study is useful240

in ﬁltering and interpreting large amounts of relevant tweets, validation of the

discovered topics focused on correlation measures against external health trend

data.

Further to topic models applied to a static data set, dynamic topic mod-

els, which incorporate the temporal nature of OSN data, are gaining attention245

(Alghamdi and Alfalqi, 2015). Ha et al. (2017) applied dynamic topic models

to Reddit data to understand user perceptions of smart watches. While these

results are interesting to gauge public opinion in this area, no ground truth

label was used and likewise no extrinsic evaluation measures were applied. Re-

cently, Klein et al. (2018) applied topic modelling to reveal distinct interests in250

the Reddit conspiracy page (a subreddit page). NMF was used to create topic

loadings for each user contributing to the page. These topic loadings were then

clustered using k-means to reveal user subgroups. Again, this study is useful

in understanding the user population within OSN discussion threads, but no

extrinsic evaluation was made to validate the quality of the topic modelling or255

the clustering.

2.3. Neural Network Embedding Models

Much of the literature on clustering OSN text data used tf-idf matrix rep-

resentations of tweets at some level. These matrices treat terms as one-hot

encoded vectors, where each term is represented by a binary vector with ex-260

actly one non-zero element. This means that relationships between words, such

as synonyms, are not incorporated and the resulting document matrix repre-

sentation is sparse and high dimensional. The concept of dense, distributional

representations of words, or word embeddings, provide an alternative approach

(Bengio et al., 2003). In these methods, each word is represented by a real valued265

vector of ﬁxed dimension. Word embeddings are commonly trained using neural

network language models, such as word2vec (Mikolov et al., 2013). However,

when using word embedding models to create document level representations,

the word vectors need to be aggregated in some way. Common approaches in

the literature are to simply take the mean of the word vectors for all terms270

in the document, or to concatenate the vectors to a document vector of ﬁxed

size (Yang et al., 2017). Document representations derived from tf-idf weighted

word vector averages have also been proposed (Zhao et al., 2015; Corrˆea J´unior

et al., 2017). Another method trains document level dense vector representa-

tions at the same time as the word vectors (Le and Mikolov, 2014). We refer to275

this latter method as doc2vec.

Much research has applied neural word embeddings to classiﬁcation and

semantic evaluation tasks. For instance, Billah Nagoudi et al. (2017) applied

word embeddings to model semantic similarity between Arabic sentences. Three

diﬀerent sentence level aggregations were proposed, namely the sum of the word280

vectors for all words in a sentence, an inverse-document-frequency weighted sum

of the word vectors, and a part-of-speech weighted sum. The authors found that

the weighted sum representations delivered more accurate sentence similarities.

In another study, Corrˆea J´unior et al. (2017) developed a classiﬁcation method

for sentiment analysis using an ensemble of classiﬁers with diﬀerent feature285

representations, namely a tf-idf matrix, a mean word vector representation, and

atf-idf weighted mean of the word vectors. Recently, Li et al. (2017) published

a number of pre-trained word2vec models on a Twitter data set of 390 million

English tweets with a range of pre-processing steps. Embedding representations

are becoming more widely used in NLP tasks involving OSN data.290

Further to word and document embeddings, character level embedding mod-

els have been proposed and applied to Twitter data, creating tweet2vec (Dhingra

et al., 2016). The motivation for tweet2vec is that social media data are noisy,

suﬀering from spelling errors, abbreviations, acronyms and special characters,

which can lead to prohibitively large vocabulary sizes. Tweet2vec takes as input295

sequences of characters for each tweet and passes them through a bidirectional

GRU neural network encoder to create a ﬁxed dimensional tweet embedding

vector. This tweet embedding is then passed through a linear softmax layer to

predict the hashtags of a tweet. The algorithm was evaluated on hashtag clas-

siﬁcation performance. While this method may promise to create useful tweet300

embeddings, it assumes that hashtags are valid labels for tweets. This assump-

tion may not hold as other text, user mentions and URLs can also be important

in deﬁning the topic of the tweet, and tweets can have multiple hashtags.

Recently, contextualised extensions to word embeddings have been proposed.

One challenge for traditional word embeddings is polysemy, where a word has305

multiple meanings dependent on the context. Peters et al. (2018) introduced

a deep contextualised word embedding model, which models both the syntac-

tic and semantic characteristics of word use, and how these uses vary across

linguistic contexts. This method involves coupling embedding vectors trained

from a bidirectional LSTM with a language model objective. Named ELMo310

(Embeddings from Language Models), the method assigns an embedding vector

to each token that is a function of the entire input sentence. This technique

may be useful for clustering social media documents.

In addition to the document clustering and topic modelling approaches dis-

cussed so far, a new series of deep learning based clustering methods have been315

developed (Min et al., 2018). Many of these techniques use deep neural networks

to learn feature representations trained at the same time as clustering. Exam-

ples include several deep autoencoder networks with a clustering layer, where

the loss function is a combination of reconstruction loss and clustering loss.

Clustering methods based on generative models such as Variational Autoen-320

coders and Generative Adversarial Networks look promising from a document

clustering perspective since they can also generate representative samples from

the clusters. However, the focus for these techniques to date has been on image

data sets.

Many approaches to document clustering and topic modelling are proposed325

for OSN text data. These methods typically involve creating document level fea-

Data Extraction

Three data sets used

Data Preparation

Feature Representations

Four methods applied

Clustering Methods

Four methods applied

Evaluation Measures

Three measures used

Figure 1: Process pipeline for document clustering. The contribution of this paper is

an evaluation of four methods for feature representation and four clustering methods

using three evaluation measures over three data sets.

ture representations with tf-idf matrices or other techniques, followed by clus-

tering methods to group documents into semantically related clusters. However,

there are many variations on these methods and word embedding representations

have not yet been eﬀectively applied and benchmarked on document clustering330

tasks in OSN data, to the best of our knowledge.

3. Methods

In this section we describe the three data sets used and the processing steps,

the feature representations and clustering algorithms, and the evaluation mea-

sures used with a discussion of their properties.335

Document clustering and topic modelling methods applied to OSN data

typically involve several processing steps as outlined in Figure 1. Data is ﬁrst

Table 1: Outline of the data sets, methods for feature representations and clustering,

and extrinsic evaluation measures used in this study. For the three data sets, we

evaluate the feature representation and clustering method combinations and the LDA

topic model (17 combinations) with the three evaluation measures.

Data Sets

Twitter stream ﬁltered by #Auspol, 29,283 tweets

RepLab 2013 competition Twitter data, 2,657 tweets

Reddit data from May 2015, 40,000 parent comments

Methods

Feature representations:

FR1 tf-idf matrix with the top 1,000 terms per document

FR2 Mean word2vec matrix

FR3 Mean word2vec matrix weighted by the top 1,000 tf-idf scores

FR4 doc2vec matrix for each document

Clustering methods:

CM1 k-means clustering

CM2 k-medoids clustering

CM3 Hierarchical agglomerative clustering

CM4 Non-negative matrix factorisation (NMF)

Topic model:

LDA Latent Dirichlet Allocation topic model

Evaluation Measures

NMI Normalised Mutual Information

AMI Adjusted Mutual Information

ARI Adjusted Rand Index

extracted from a source. From the raw data set or OSN platform API, doc-

uments are extracted which consist of text data from an individual user. A

tweet and a Reddit parent comment are examples of a document. The textual340

elements are then processed to remove common punctuation and stop words,

and tokenised. Feature representations of each document are created, followed

by a clustering method. Extrinsic clustering evaluation measures are then cal-

culated using ground truth labels. The variations at each step of the process

are outlined in Table 1. In the rest of this section we detail our approach to345

each step of Figure 1.

3.1. Data Extraction

We used three OSN data sets for evaluation; two Twitter data sets and a

Reddit data set. We have used Twitter data since it has been widely used in

the literature regarding topic modelling and document clustering. While there350

appear to be fewer studies which have used Reddit data, Reddit still represents a

valuable source of OSN data to use for topic modelling and document clustering.

Reddit is also used more as a discussion forum, and comments have a wider

range of document lengths than Twitter data. All three data sets have been

made available (Curiskis et al., submitted).355

Twitter data provides a readily accessible data source for short and topical

user driven content. It is also widely used for research purposes, but has many

challenges due to the short tweet length and use of hashtags, acronyms, user

mentions and URLs (Stieglitz et al., 2018). The ﬁrst Twitter data set was col-

lected through Twitter’s public API. It was constructed by ﬁltering the Twitter360

stream for the hashtag #Auspol, which is frequently used in Australia for po-

litical discussion. A common application for document clustering on OSN data

is to take a set of documents related to a particular theme and discover topics,

such as the study of health topics in Twitter data (Paul and Dredze, 2014).

The #Auspol Twitter data set is suitable for comparing document clustering365

methods since the hashtag is widely used to link a large number of disparate dis-

cussions, often with additional hashtags, related to public opinion in Australia.

Data was collected between 13 June and 2 September 2017 and consisted of

1,364,326 tweets. We ﬁltered this data set by selecting English language tweets

only and removed retweets based on the retweeted status ﬁeld and a text ﬁlter.370

This resulted in 205,895 tweets.

No ground truth topic labels exist for this data set so we used a set of high

count hashtags as ground truth labels. We further removed the search hashtag

(#Auspol) from the data set, since all tweets had this token. It is common for

there to be multiple hashtags on a tweet, so to avoid having overlapping topics375

we removed tweets which contained more than one of the top hashtags. We also

manually removed some related hashtags, such as #ssm (same sex marriage)

which is closely related to #marriageequality; we kept the latter as it was used

in more tweets. Lastly, we ﬁltered by hashtags with at least 1,000 tweets to keep

the topics relatively balanced. This resulted in 29,283 tweets with 13 hashtags380

denoting topic labels, as given in Table 2.

Table 2: Count of tweets per hashtag in the #Auspol Twitter data set.

Topic

Number

Hashtag Tweets

1 #qldpol 3,845

2 #qanda 3,592

3 #insiders 3,495

4 #lnp 3,434

5 #politas 2,618

6 #marriageequality 2,562

7 #springst 1,708

8 #nbn 1,626

9 #trump 1,547

10 #uspoli 1,498

11 #stopadani 1,186

12 #climatechange 1,148

13 #turnbull 1,024

The second Twitter data set was taken from the RepLab 2013 competition

(Amig´o et al., 2013). This competition focused on monitoring the reputation of

entities (companies and individuals), and involved tasks such as named entity

recognition, polarity classiﬁcation and topic detection. The tweets used in this385

competition were annotated with topic labels by several trained annotators su-

pervised and monitored by reputation experts. For the purposes of this paper,

the topics annotated in these tweets were taken as a gold standard. We have

used this data set because it has gold standard labels already annotated and

has been used for topic detection tasks.390

We downloaded the list of Twitter identiﬁers from the training and testing

data sets for the topic detection task made available through the RepLab 2013

competition and retrieved the details through the Twitter API on 19 January,

2019. Out of 110,344 published tweet identiﬁers with labelled topics, we could

only retrieve the tweet text and other information for 23,684 tweets. This is395

likely due to tweets and users being deleted since the tweets were published.

Furthermore, there is a long tail of topics labelled in this data. In fact, for

the 23,684 tweets there were a total of 3,432 distinct topics, with 1,263 topics

containing a single tweet. To ensure that there were suﬃcient data points for

our methods to detect, we limited the frequency count per topic to be 100.400

We also removed the label denoted ‘other topics’ as this does not represent an

internally consistent topic. After this ﬁltering we had a data set of 2,657 tweets

with 13 topic labels from the competition. The list of topic labels used is given

in Table 3.

We originally included the RepLab 2013 data set primarily because compara-405

tive results for topic discovery are available from the competition. However, due

to the large volume of tweets which could not be retrieved from Twitter’s API,

accurate comparisons are no longer possible. Nevertheless, the ground truth

topic labels still allow for the performance of the methods to be benchmarked.

The third data set was from the Reddit platform and consisted of parent com-410

ments and their related comments by Reddit subreddit page from May 2015.

The Reddit platform is widely used for discussion related to speciﬁc topics or

themes, grouped by subreddit page, so is ideal for this study. Furthermore,

Reddit comments can be longer than tweets. Reddit parent comments refer to

the top comment which may or may not have responses from other users. This415

data was made public on the Reddit website (Reddit, 2015). The full data set

contained around 54.5 million comments on 50,138 subreddit pages. We chose

this data set since it is freely available in full and contains discussion on multiple

themes. It is therefore an ideal data set to use for benchmarking methods. We

chose ﬁve subreddit pages which represent disjoint themes for analysis. These420

ﬁve subreddit pages were also used in a previous study benchmarking classiﬁca-

tion models (Gutman and Nam, 2015). Since parent comments and responses

are inherently related, we pooled all the user posts into documents grouped by

the parent comment identiﬁer. Table 4 shows the count of parent comments

per subreddit page. We randomly sampled 40,000 parent comment identiﬁers425

from across the ﬁve subreddit pages, then used these pages to denote the ground

truth labels.

Reddit data is especially useful in this study since it contains a wider range

of character lengths per document than Twitter data, since Twitter has a limit

on the number of characters. An evaluation of the performance of the document430

clustering methods by document length can provide guidance for future studies

on the optimal method for a particular data set. To examine this performance,

we partitioned the Reddit data into four distinct subsets based on the number

Table 3: Count of tweets per topic label in the RepLab 2013 Twitter data set.

Topic

Number

Topic Tweets

1 For Sale 329

2 Suzuki cup 296

3 User Comments 262

4 Money laundering / terrorism ﬁnance 199

5 Record of views on YouTube 195

6 Fan Craze - Beliebers 154

7 Princeton Oﬀense 131

8 For Sale - Nissan Cars, Parts analysed Accessories 127

9 Jokes 127

10 Sports sponsors 127

11 Spam 114

12 Ironic Criticism 111

13 MotoGP - User Comments 103

Table 4: Count of parent comments per subreddit page.

Topic

Number

Subreddit Page Parent

Comments

1 NFL 10,563

2 news 9,488

3 pcmasterrace 9,186

4 movies 6,263

5 relationships 4,500

Table 5: Reddit data was partitioned into four sets based on document character

length. Documents are grouped by the parent comment. The mean character length

and mean number of tokens per document are given.

Character

length range

Number of

documents

Mean

character

length

Mean

number of

tokens

1 to 100 15,273 46.1 4.5

101 to 200 8,360 144.9 13.3

201 to 500 9,310 317.4 28.6

501 or greater 7,057 1,584.5 141.1

of characters per document. Details for the four data partitions are given in

Table 5. For comparison with the Twitter data sets, a tweet has a maximum of435

240 characters. For the #Auspol Twitter data, the mean character length was

117 with 25th percentile of 103 and 75th percentile of 138. Most tweets therefore

fall into the 101 to 200 character length document group.

3.2. Data Preparation

Data preparation and analysis in this study was conducted using python440

3.6.1. For text preprocessing, we removed the list of stopwords from the nltk

3.2.4 package and punctuation from string. A customised tokeniser function

was created for tweets which retained hashtags and user mentions, and removed

URLs. To tokenise the Reddit data, we simply removed punctuation and stan-

dard stopwords. We did not apply any stemming or lemmatisation. We also445

used the TﬁdfVectorizer function from sklearn 0.19.1 for the tf-idf method and

the weighted word2vec method.

For the #Auspol Twitter data, we removed the list of 14 hashtags taken

as ground truth labels from the text, in addition to the #Auspol Twitter API

search query. The RepLab 2013 Twitter data set had annotated topic labels450

that were not based directly on any individual tokens, so no modiﬁcation was

required. For the Reddit data, as the subreddit page was used as the ground

truth label we did not need to modify the text.

3.3. Feature Representations

In this study we evaluated the performance of four methods to construct455

feature representations for documents combined with four commonly used clus-

tering algorithms. We also included an LDA topic model in a separate topic

models category since the technique only takes as input a bag-of-words ma-

trix. These methods are outlined in Table 1, where each method component is

given a code for ease of reference. The four feature representations are coded460

as FR1-FR4 and the four clustering methods are coded as CM1-CM4 and

the LDA topic model is coded simply as LDA. While many other techniques

have been proposed in the literature, such as the meme identiﬁcation studies

(JafariAsbagh et al., 2014; Shabunina and Pasi, 2018), we did not implement

them for evaluation as they are speciﬁc to data from Twitter. However, we pro-465

vide comparison results in our discussion where they were available from other

studies.

For FR1, the tf-idf matrix was limited to the top 1,000 terms per document

by frequency since no performance improvement was gained by including more

terms. This is likely due to the short nature of social media text which produces470

sparse tf-idf feature vectors; terms with lower frequency would not generally be

useful in clustering.

A word2vec model is a neural network trained to create a dense vector with

ﬁxed dimension for each token in a corpus. While a pre-trained word2vec model

is available for Twitter data (Godin et al., 2015), we found that it did not475

perform well on the Twitter data sets used in this study. One issue was that

many tokens in the data were out of the trained model’s vocabulary, and also

the semantic relationships between words may be very diﬀerent on diﬀerent data

sets. Additionally, a pre-trained model on a large amount of Reddit data was

not available. Furthermore, there are many hyper-parameters in these models so480

ﬁnding an ideal set of values for diﬀerent data sets is a useful contribution. For

these reasons, we trained our own word embedding and document embedding

models.

The word2vec models used in FR2 and FR3 were trained with the contin-

uous bag of words (CBOW) method (Mikolov et al., 2013), 100 dimensions, a485

context window of size 5 and minimum word count of 1. We tested variations of

these hyper-parameters, including context window sizes ranging from 3 to 15,

higher dimensions and minimum word counts. We found that the variation in

performance using the three clustering evaluation measures was minimal and

the chosen hyper-parameters were optimal. Some of these results make sense490

given the short document length of social media text. We concluded that 100

dimensions for word2vec was suﬃcient to represent words for short documents.

The mean number of tokens per tweet was 9, and the 75th percentile was 11,

so a context window of size 5 captured all the tokens of most tweets. However,

we did ﬁnd signiﬁcant variation in the number of training epochs used for the495

three data sets. We report on this analysis in Section 4.1. For all other hyper-

parameters, we have used default values provided by the gensim 3.4.0 python

package (ˇ

Reh˚uˇrek and Sojka, 2010).

FR2 was constructed by taking the element-wise mean of the word vectors

for each token in each document, returning a dense feature vector of 100 dimen-500

sions. FR3 was constructed by taking the tf-idf weighted mean of the word

vectors for each word of a document. The tf-idf matrix used was the top 1,000

term matrix by frequency constructed in FR1. This process excluded any word

vectors that were not in the top 1,000 tf-idf terms, although again this was tried

with larger numbers of top terms for which the evaluation measures used were505

found to decrease. We discuss the evaluation measures used in Section 3.5.

A doc2vec model is a neural network trained to create a dense vector with

ﬁxed dimension for each document in a corpus. The doc2vec models in FR4

were trained with 100 dimensions using the distributed bag of words method

(dbow), a context window of size 5 and a minimum word count of 1. The dis-510

tributed bag of words method was used since it can train both word vectors and

document vectors in the same embedding space (Le and Mikolov, 2014), which

was useful for interpreting the document embedding. As with the word2vec

model, we tested variations of the hyper-parameters and found that the eval-

uation measures varied signiﬁcantly for the number of training epochs, and515

diﬀerent data sets had diﬀerent optimal epochs. This is similar to the results

of Lau and Baldwin (2016) where a dbow doc2vec model trained on 4.3 mil-

lion words had an optimal number of epochs of 20, while the optimal number

was 400 for a data set of size .5 million words. Lau and Baldwin (2016) also

found that the optimal number of dimensions was 300 and window size was 15.520

The lower optimal values for our method are likely due to the short document

lengths of OSN data, as well as the lower word count of our data sets, especially

the Twitter data.

3.4. Clustering Methods

For the clustering methods, we have selected four techniques commonly used525

in the literature (Klinczak and Kaestner, 2016; Naik et al., 2015) which also gave

comparable results on our data sets. Firstly, we applied a k-means clustering

algorithm (CM1) using the Euclidean metric and a maximum of 100 iterations.

The algorithm was run multiple times over the data with varying random seeds.

CM2 refers to the k-medoids algorithm. For this we used the pyclustering530

0.8.2 python package with starting centroids sampled according to a uniform

distribution. Both k-means and k-medoids clustering were used in Klinczak and

Kaestner (2016). For CM3 we applied an hierarchical agglomerative clustering

algorithm with the Euclidean metric and Ward linkage. Hierarchical agglomer-

ative clustering was used in Ferrara et al. (2013) to cluster a similarity matrix.535

For CM4 we used a Non-negative Matrix Factorisation (NMF) algorithm, for

which we used the default parameters in the sklearn 0.19.1 package. NMF has

seen multiple applications for topic modelling in OSN data (Godfrey et al., 2014;

Klein et al., 2018). For the clustering methods and the LDA model, we set the

number of clusters or components to be equal to the number of unique labels in540

the evaluation data. In line with Klinczak and Kaestner (2016), we tested the

DBSCAN clustering algorithm with a range of hyper-parameters but found that

it delivered poor performance for all feature representations. The documents

would either be grouped into an outlier cluster, or a large number of very small

clusters. A possible reason for this is that the feature representations are high545

dimensional and sparse, so may not cluster well using density based approaches.

The LDA topic model was trained with 10 passes, chunk size of 10,000

and updated every record. We again used the default values for other hyper-

parameters in the gensim 3.4.0 package. We included this method since it is

commonly used in document clustering and topic modelling. To assign a topic550

label to each document, we chose the topic with the highest probability.

3.5. Evaluation Measures

Measures used for evaluating document clustering methods typically fall into

two categories, intrinsic and extrinsic measures. Intrinsic measures, such as

measures of cluster separation and cohesion, do not require a ground truth la-555

bel. Such measures describe the variation within clusters and between clusters.

However, they are dependent on the feature representations used, so do not

give comparable results for methods which use diﬀerent feature sets. Extrinsic

measures require a ground truth label, but can be compared across methods.

Common extrinsic measures include precision, recall and F1 (Naik et al., 2015),560

but these are dependent on the ordering of cluster labels to ground truth labels

which is a problem with a large number of labels. Measures such as the mu-

tual information and Rand index are more appropriate in this case as they are

independent of the absolute values of the labels.

Mutual information is a measure of the mutual dependence between two565

discrete random variables. It quantiﬁes the reduction in uncertainty about one

discrete random variable given knowledge of another. High mutual information

indicates a large reduction in uncertainty. For two discrete random variables

Xand Ywith joint probability distribution p(x, y), the mutual information,

MI(X, Y ), is given by570

MI(X, Y ) = X

y∈YX

x∈X

log p(x, y)

p(x)p(y).

A commonly used measure is the normalised mutual information (NMI),

which normalises the MI to take values between 0 and 1 with 0 representing no

mutual information and 1 being agreement. This is useful to compare results

across methods and studies. NMI is given as follows.

NMI(X, Y ) = MI(X, Y )

pH(X)H(Y),

where H(X) and H(Y) denote the marginal entropies, given by575

H(X) = −

i=1

p(xi) log (p(xi)) .

The Rand index is a pair counting measure for similarity between the la-

bels and clusters. It also takes values between 0 and 1, with 0 representing a

random labelling and 1 representing identical labels. Given a set of elements

S={o1, . . . , on}and two partitions of Sto compare, X={X1, . . . , Xr}and

Y={Y1, . . . , Ys}, the Rand index represents the frequency of times the par-

titions Xand Yare in agreement over the total number of observation pairs.

Mathematically the Rand index, RI, is given by

RI(X, Y ) = a+b

a+b+c+d=a+b

n

2,

where arepresents the number of pairs of elements in Sthat are in the same

subset in Xand the same subset in Y, and brepresents the number of pairs

of elements in Sthat are in diﬀerent subsets of Xand diﬀerent subsets of Y.

Values aand btogether give the number of times the partitions are in agreement.

The value crepresents the number of pairs of elements in Sthat are in the same580

subset of Xand diﬀerent subsets of Y, and dgives the number of pairs of

elements in Sthat are in diﬀerent subsets of Xand the same subset of Y.

For extrinsic clustering evaluation measures to be useful for comparison

across methods and studies, such measures need a ﬁxed bound and a constant

baseline value. Both the NMI and the RI are scaled to have values between 0

and 1, so satisfy the ﬁrst condition. However, it has been shown that both mea-

sures increase monotonically with the number of labels, even with an arbitrary

cluster assignment (Vinh et al., 2010). This is because both the mutual infor-

mation and Rand index do not have a constant baseline, implying that these

measures are not comparable across clustering methods with diﬀerent numbers

of clusters. To account for this, adjusted versions of the MI and RI have been

proposed. The adjusted rand index, ARI, adjusts the RI by its expected value:

ARI(X, Y ) = RI(X, Y )−E{RI(X , Y )}

max{RI(X, Y )} − E{RI(X , Y )}

where E{RI(X, Y )}denotes the expected value of RI(X, Y ). The ARI takes

values between 0 and 1, with 1 representing identical partitions, and is adjusted

for the number of partitions in Xand Y. In a similar way, the adjusted mutual585

information, AMI, is given by

AMI(X, Y ) = MI(X, Y )−E{MI(X, Y )}

max{H(X), H(Y)} − E{MI(X, Y )},

where E(MI(X, Y )) represents the expected value of the MI (Vinh et al., 2010).

The AMI takes values between 0 and 1, with 1 representing identical partitions,

and is adjusted for the number of partitions used. The best measures to ensure

a comparable evaluation are then the AMI and the ARI. The next question is590

around how these two measures compare to each other. By developing theory

regarding generalised information theoretic measures, Romano et al. (2016) con-

cluded that the AMI is the preferable measure when the labels are unbalanced

and there are small clusters, while the ARI should be used when the labels have

large and similarly sized volumes.595

In this paper, we report the AMI, ARI and the NMI measures. Many pre-

vious studies have reported the NMI measure, so for comparison purposes we

include it in our evaluation. Given the data and methods of this study, it is

likely that the ARI is more appropriate then the AMI as Tables 2 and 4 show

that the distribution of documents across labels is relatively balanced. We still600

include the AMI since it is interesting to see how much the results may diﬀer

from the NMI.

Due to the short and noisy nature of the data sets used in this study, we

examined the eﬀect of diﬀerent random seeds on performance. We ran each

method 20 times with diﬀerent random seeds, calculated the mean of the NMI,605

AMI and ARI, and plotted the distributions of these measures.

4. Results

In this section we present the results of our analysis. We ﬁrst describe the

results on the optimal number of epochs for the word2vec and doc2vec em-

bedding representations, applied to all three data sets. We then evaluate the610

performance of all the methods. Lastly, we discuss methods for the interpreta-

tion of the topics using the doc2vec feature representation.

4.1. Optimal Training Epochs for Embedding Models

A key hyper-parameter for training neural network models is the number

of epochs. Too many epochs and the model may overﬁt to the data, too few615

and performance may be poor. We ﬁrst explored the performance change of the

mean word2vec models (FR2 and FR3) and the doc2vec model (FR4) with the

number of epochs. These results provide guidance for studies where a ground

truth topic label is not present. We used k-means clustering (CM1) for the

clustering method as it gave the best results for the embedding representations.620

For each epoch value between 25 and 300, with increments of 25, we trained the

models 20 times using diﬀerent random seeds and evaluated against the ground

truth labels. This was done for all three data sets. Table 6 summarises the

Table 6: Optimal number of training epochs for word2vec and doc2vec methods on

the three data sets.

Data Set doc2vec wtd.

word2vec

unwtd.

word2vec

Twitter #Auspol 75 250 250

Twitter RepLab 2013 300 200 200

Reddit: 1 to 101 175 75 50

Reddit: 101 to 200 150 100 200

Reddit: 201 to 500 100 50 50

Reddit: 501 + 50 25 25

optimal epoch results by method and data set. The plots for this analysis on

the #Auspol Twitter data are shown in Figure 2(a) and on the RepLab 2013625

data in Figure 2(b). The results for the Reddit data are shown in Figure 3.

To save space we only evaluated the AMI and the ARI measures on the Reddit

data. This is because the AMI typically gives similar results as NMI, but is

chance adjusted.

For the #Auspol data in Figure 2(a), it is clear that doc2vec gave the best630

results and had a peak in performance at around 75 epochs. The word2vec

methods generally delivered better performance with more epochs, with a max-

imum value around 250. The tf-idf weighted mean word2vec method performed

better than the unweighted mean word2vec method, and its performance in-

creased more smoothly than the unweighted method. There was also not much635

variation over seeds as the 95% conﬁdence bands are narrow.

On the RepLab 2013 data in Figure 2(b) the results were quite diﬀerent. The

unweighted mean word2vec method gave the best performance on the NMI and

AMI measures. However, on the ARI measure both word2vec methods suﬀered

drops in performance after 100 epochs while the doc2vec method improved.640

This could be caused by some over-ﬁtting of the word2vec models on the data,

which is likely since the RepLab 2013 data was much smaller than the #Auspol

data. The ARI measure is also the preferred measure where the labels have large

(a) #Auspol (b) RepLab 2013

Figure 2: Plot of the three evaluation measures (vertical axes) by training epoch

(horizontal axes) for 20 runs of the word2vec and doc2vec representations on Twitter

data using k-means clustering. (a) shows the results on the #Auspol Twitter data

and (b) shows the results on the RepLab 2013 Twitter data. 95% conﬁdence bands

based on varying random seeds are shown.

volumes and are balanced (Romano et al., 2016). This data set was relatively

balanced (given in Table 3), so the ARI is the more appropriate performance645

measurement than the NMI and AMI. Overall on the RepLab 2013 data, the

optimal number of epochs for the word2vec methods was 200, while the doc2vec

method had an optimal value of 300. The higher number of optimal epochs for

(a) AMI measure (b) ARI measure

Figure 3: Plots of the AMI and ARI evaluation measures (vertical axes) by training

epoch (horizontal axes) for 20 runs of the word2vec and doc2vec representations on

the Reddit data sets using k-means clustering. Diﬀerent Reddit data sets by size range

are given along the rows. Column (a) shows the AMI results and (b) shows the ARI

results. 95% conﬁdence bands based on varying random seeds are shown.

the doc2vec method is not surprising given that it is also training document

vectors, so has more parameters than word2vec.650

Turning to the results on the four Reddit data sets in Figure 3, the doc2vec

method again gave the best performance. In addition, there is an evident pattern

with doc2vec where shorter documents required more training epochs to reach

optimal performance. For documents with less than 100 characters, the perfor-

mance of doc2vec with k-means clustering improved up to around 250 epochs.655

This dropped to 150 epochs for documents with 101 to 200 characters, then 150,

100 and 50 for the larger document length data sets in increasing order. This

observed pattern aligns to the results of Lau and Baldwin (2016), conﬁrming

that doc2vec models require less training epochs on larger documents.

For the word2vec methods, the tf-idf weighted mean word vector method660

gave better performance than the unweighted mean method. This aligns with re-

sults in previous studies (Billah Nagoudi et al., 2017). On the shortest document

range, both methods showed little performance improvement with more train-

ing, but then a drop in both measures at 75 epochs for the weighted word2vec

method and 50 for the unweighted method. One possible explanation for this665

drop is that averaging word vectors may only make sense above a threshold

of words. For this size range, the average number of words per document is

4.5, which might be too low. On the 101 to 200 character length documents,

the weighted word2vec method gave better performance but also required fewer

training epochs. These results also look similar to the results on the Twit-670

ter data sets, which typically have a similar character length range. On the

largest documents, both methods required 25 or less epochs to reach optimal

performance.

Through this analysis, it is clear that the doc2vec method consistently gave

improved performance over averaged word2vec methods, except in the case675

where the data set had a low number of documents. Furthermore, the number

of training epochs for doc2vec in general was inversely proportional to the doc-

ument size, with more epochs required to reach optimal performance on smaller

document sizes. Doc2vec also required more training epochs than word2vec, in

general. However, these relations were not observed for the #Auspol Twitter680

data where the doc2vec optimal epoch number was 75, below the word2vec op-

timal epochs of 200. The optimal number of doc2vec epochs on the RepLab

2013 data was much higher at 300. An explanation might be that while the

doc2vec model improved on its internal loss function with more training epochs

on the #Auspol data, these improvements did not lead to better performance685

on the clustering task. This is likely because of the hashtag labels used, which

may have some overlapping contributing terms. For the word2vec methods, in

general weighting by tf-idf scores gave a performance lift and required fewer

training epochs. However, care should be taken with the number of epochs

given the low peak on the shortest Reddit documents.690

4.2. Performance Evaluation with Clustering Measures

In this section we provide the mean evaluation measures for the four feature

representations with the four clustering methods, and the LDA model, for each

method with 20 diﬀerent seeds on each data set. We also include distribution

plots to illustrate the variability in performance.695

Table 7 provides the mean for each of the three evaluation measures for

each method on the #Auspol Twitter data set. We set the optimal number of

epochs to be 75 for the doc2vec methods and 250 for the word2vec methods.

It is clear from this table that the doc2vec feature representation with k-means

clustering outperformed the other methods on all three evaluation measures,700

particularly on the ARI. Hierarchical clustering gave close scores for NMI and

AMI, but much lower ARI. For both doc2vec and word2vec feature represen-

tations, NMF performed poorly. The performance of k-medoids clustering was

similar to NMF. For the word2vec representations, k-means clustering also gave

the best performance.705

An interesting observation is that some methods had a relatively large drop

in score between the NMI and AMI measures, indicating that the chance ad-

justment of the AMI is important. The tf-idf representation is the most eﬀected

by this. For instance, the tf-idf matrix with hierarchical clustering gave a high

Table 7: Performance evaluation of the feature representation and clustering methods

on #Auspol Twitter data with the Normalised Mutual Information (NMI), Adjusted

Mutual Information (AMI), and Adjusted Rand Index (ARI) measures.

Feature Representation Clustering NMI AMI ARI

doc2vec hierarchical .165 .154 .059

k-means .193 .191 .120

k-medoids .107 .105 .064

NMF .102 .100 .056

wtd word2vec hierarchical .088 .079 .021

k-means .105 .102 .047

k-medoids .043 .016 .001

NMF .062 .058 .030

unwtd word2vec hierarchical .085 .076 .020

k-means .094 .090 .041

k-medoids .043 .019 .001

NMF .058 .054 .025

TF-IDF hierarchical .163 .085 .013

k-means .114 .070 .014

k-medoids .079 .028 .004

NMF .132 .110 .032

LDA LDA .043 .041 .021

NMI of 0.163, well ahead of the word2vec methods, but an AMI of 0.085. Com-710

paratively, doc2vec and the word2vec methods had smaller drops. As discussed

earlier, the AMI and ARI are more appropriate evaluation measures than NMI

due to their adjustment for chance. On this data set, the ARI is more appro-

priate as the volume of tweets per hashtag label are relatively similar. The

doc2vec representation with k-means clustering therefore far outperformed the715

other methods.

Table 8 shows the mean results for the RepLab 2013 Twitter data set with

the doc2vec model trained with 300 epochs and the word2vec methods trained

with 200 epochs. Overall the performance is much higher than in the #Auspol

data, which is explained by the RepLab 2013 data having expertly annotated720

Table 8: Performance evaluation of the feature representation and clustering methods

on RepLab 2013 Twitter data with the NMI, AMI and ARI measures.

Feature Representation Clustering NMI AMI ARI

doc2vec hierarchical .449 .437 .313

k-means .488 .478 .379

k-medoids .290 .278 .215

NMF .261 .249 .152

wtd word2vec hierarchical .506 .491 .330

k-means .488 .478 .352

k-medoids .421 .404 .274

NMF .401 .384 .266

unwtd word2vec hierarchical .519 .507 .347

k-means .508 .499 .360

k-medoids .435 .414 .278

NMF .425 .407 .286

TF-IDF hierarchical .466 .417 .203

k-means .450 .379 .179

k-medoids .192 .075 .011

NMF .437 .427 .348

LDA LDA .180 .169 .140

topics which are more distinct. On the ARI score, the doc2vec method with

k-means clustering performed best but the unweighted word2vec method with

hierarchical clustering gave higher performance for the NMI and AMI measures.

One explanation for this is that the small size of this data set is insuﬃcient for

the embedding representations to accurately be trained, so further training does725

not necessarily lead to higher clustering performance. This is reﬂected in the

sharp drops evident in Figures 2(b.i) and 2(b.ii).

To examine the variability from the mean measurements, we plot the distri-

butions for the feature representation methods with the best performing clus-

tering algorithm and the LDA topic model. Figure 4 shows the distributions730

for the three evaluation measures over the #Auspol (a) and RepLab 2013 (b)

Twitter data sets. In Figure 4(a), the doc2vec method with k-means clustering

(a) #Auspol (b) RepLab 2013

Figure 4: Density plots of the three evaluation measures (horizontal axes) over ran-

dom seeds for the four feature representations with the best performing clustering

algorithm, with LDA for comparison. (a) shows the results on the #Auspol Twitter

data and (b) shows the results on the RepLab 2013 Twitter data.

was distinctly ahead of the other methods on all three measures. There was

also signiﬁcant overlap between the results for the two word2vec methods, indi-

cating that multiple runs are required when the scores are close. Note that the735

tf-idf method with hierarchical clustering did not show on the plot since both

Figure 5: Plot of the three evaluation measures over random seeds for the methods

with the best performing clustering method on Reddit data with varying document

lengths in characters. (a) plots the NMI, (b) plots the AMI and (c) plots the ARI.

algorithms are deterministic, so every run had the same result.

For the RepLab 2013 data set in Figure 4(b), the word2vec methods again

showed signiﬁcant overlap, with the doc2vec method performing in a lower

range. It is interesting to note that the doc2vec method showed two close740

peaks. These peaks are most signiﬁcant for the NMI and AMI measures, but

also present for ARI. This likely indicates that the doc2vec method optimised

to local minima during training, resulting in poor performance for some of the

runs over random seeds. Given that there was a large gap between the higher

performance of doc2vec compared to the word2vec methods on the #Auspol745

data and close performance between word2vec and doc2vec on RepLab 2013,

the word2vec methods handled the smaller RepLab 2013 data set better than

doc2vec. This may be because there weren’t enough data points in the Re-

pLab 2013 data set to optimally train the doc2vec representation. Nevertheless,

doc2vec still gave the best performance on the ARI measure for both Twitter750

data sets.

Lastly, we provide results from running the methods over the Reddit data.

Figure 5 shows the NMI (a), AMI (b) and ARI (c) values for the methods on

the Reddit data sets. The horizontal axis compares the results for the document

length data partitions. Only the best performing clustering method is displayed755

for each feature representation. The mean scores of the evaluation measures for

each of the methods are given in Table 9 for document length ranges 1 to 100

and 101 to 200, and in Table 10 for document length ranges 201 to 500 and 501

or greater. It is clear from these plots and mean results that the doc2vec method

delivered the best performance on all four data sets by size range. This ﬁnding760

corroborates the results from the #Auspol Twitter data set. The tf-idf weighted

mean word2vec method consistently delivered a performance lift compared to

the unweighted mean word2vec method. Interestingly, the tf-idf methods and

the LDA model only gave comparable performance to the word2vec methods on

the last size range, with number of characters greater than 500.765

Table 9: Performance evaluation on the Reddit data for each method by for document

length ranges 1 to 100 and 101 to 200 characters.

Document

Character

Length

Feature Representa-

tion

Clustering NMI AMI ARI

1 to 100 doc2vec hierarchical .029 .027 .017

k-means .034 .034 .026

k-medoids .012 .011 .004

NMF .030 .023 .015

wtd word2vec hierarchical .013 .012 .010

k-means .014 .013 .011

k-medoids .007 .003 .000

NMF .010 .009 .000

unwtd word2vec hierarchical .011 .011 .011

k-means .012 .012 .011

k-medoids .007 .006 .000

NMF .012 .011 .010

TF-IDF hierarchical .009 .003 .000

k-means .005 .002 .000

k-medoids .005 .001 .000

NMF .014 .011 .012

LDA LDA .009 .009 .003

101 to 200 doc2vec hierarchical .115 .111 .067

k-means .262 .257 .262

k-medoids .018 .006 .001

NMF .127 .096 .027

wtd word2vec hierarchical .112 .101 .032

k-means .176 .174 .144

k-medoids .036 .016 .001

NMF .116 .100 .033

unwtd word2vec hierarchical .086 .079 .027

k-means .144 .142 .114

k-medoids .020 .013 .008

NMF .089 .071 .015

TF-IDF hierarchical .009 .003 .000

k-means .005 .004 .002

k-medoids .008 .000 .000

NMF .006 .005 .000

LDA LDA .008 .007 .007

Table 10: Performance evaluation on the Reddit data for each method by document

length ranges 201 to 500 and 501 or greater.

Document

Character

Length

Feature Representa-

tion

Clustering NMI AMI ARI

201 to 500 doc2vec hierarchical .261 .254 .212

k-means .487 .483 .496

k-medoids .037 .010 .002

NMF .194 .128 .044

wtd word2vec hierarchical .265 .246 .142

k-means .333 .331 .276

k-medoids .174 .172 .150

NMF .247 .226 .133

unwtd word2vec hierarchical .227 .200 .084

k-means .303 .301 .247

k-medoids .106 .103 .081

NMF .208 .183 .092

TF-IDF hierarchical .103 .061 .015

k-means .095 .085 .044

k-medoids .014 .013 .007

NMF .062 .057 .046

LDA LDA .080 .079 .071

501 + doc2vec hierarchical .532 .518 .499

k-means .686 .684 .708

k-medoids .094 .037 .007

NMF .331 .255 .154

wtd word2vec hierarchical .465 .400 .327

k-means .461 .433 .403

k-medoids .353 .330 .283

NMF .366 .325 .229

unwtd word2vec hierarchical .416 .367 .306

k-means .433 .405 .385

k-medoids .336 .322 .290

NMF .290 .242 .159

TF-IDF hierarchical .304 .244 .199

kmeans .431 .382 .323

kmedoids .042 .007 .001

NMF .396 .344 .299

LDA LDA .341 .326 .291

Table 11: Top three topic labels and top three hashtags for each cluster. Note that

the topic labels did not appear in the clustering data, but were mostly recovered in

order when we created a tf-idf matrix for tweets pooled by cluster and selected the

three hashtags with the highest scores. Diﬀerences between the top three topic labels

and top three hashtags are highlighted in bold.

Cluster Top Three Topic Labels Top Three tf-idf Score Hashtags

1 #nbn, #lnp, #insiders #nbn, #lnp, #insiders

2 #uspoli, #insiders, #turnbull #uspoli, #insiders, #trump

3 #insiders, #lnp, #qldpol #insiders, #lnp, #qldpol

4 #qldpol, #insiders, #lnp #qldpol, #insiders, #lnp

5 #politas, #qldpol,#lnp #politas, #utas,#discover

6 #qldpol, #qanda, #trump #qldpol, #politas, #qanda

7 #insiders, #lnp, #qldpol #insiders, #lnp, #qldpol

8 #qldpol, #stopadani, #springst #qldpol, #stopadani, #springst

9 #lnp, #trump, #uspoli #lnp, #trump, #insiders

10 #qanda, #insiders, #qldpol #qanda, #insiders, #sayitwith-

stickers

11 #marriageequality, #politas, #lnp #marriageequality, #equalitycam-

paign, #politas

12 #qldpol, #stopadani, #qanda #qldpol, #qanda, #stopadani

13 #climatechange, #qldpol,

#stopadani

#climatechange, #qldpol,

#stopadani

4.3. Topic Interpretation

It is clear that the doc2vec model with k-means clustering delivered the

best performance on the #Auspol Twitter data set and the Reddit data sets,

as well as the RepLab 2013 Twitter data set based on the ARI measure only.

However, the usefulness of a topic discovery model depends on how interpretable770

the resulting topics are. In this section we aim to address this question through

a deeper analysis of the resulting clusters from the doc2vec representation with

k-means clustering.

We consider ﬁrstly the results on the #Auspol data, where we analysed the

extent to which the document clusters aligned to the label hashtags. On the775

#Auspol data, our ground truth topic labels were the top 13 distinct hashtags,

which were removed from the text prior to feature generation and clustering.

These hashtags can therefore be considered as latent tokens. We ﬁrst identiﬁed

the top three topic labels (hashtags) by frequency for each cluster. For compar-

ison, we created a tf-idf matrix from the original data using all the hashtags,780

including the topic hashtags, and excluded all other tokens. We then extracted

the three hashtags with the highest tf-idf scores for each cluster and compared

to the top three topic label hashtags. Table 11 outlines the results. The top

topic matches to the top hashtag for every cluster. Out of 39 top topics for

the 13 clusters, only 7 contained diﬀerent hashtags (highlighted in bold text).785

There were also two clusters where the order was adjusted. We conclude that

the doc2vec clustering has accurately captured the structure of the latent label

hashtags.

Another way of looking at the quality of the clustering is to analyse the

overlap between ground truth labels and clusters. In the interest of space, we790

considered the Reddit data sets which contained only 5 topics and chose the

data set with document size between 101 and 200 characters for consistency

with the Twitter data sets. We then analysed the confusion matrix for the

doc2vec features with k-means clustering against the ground truth labels, the

subreddit pages. The results are shown in Table 12. It is apparent that the ﬁrst795

cluster grouped most of the parent comments from the subreddit page ‘NFL’ and

the second cluster grouped strongly around ‘pcmasterrace’. These pages clearly

represent distinct topics. Clusters 3 and 4 grouped well around ‘news’ and

‘movies’ respectively, but cluster 5 is divided primarily between ‘relationships’

and ‘news’.800

To further interpret the topics on this Reddit data set, we analysed the top

words by cluster. For each cluster we calculated the centroid as the mean of

the doc2vec representations of each document in the cluster. Since the trained

doc2vec model produced document embeddings in the same space as word em-

beddings, we calculated the cosine similarity between the cluster centroids and805

the words. The idea behind this was that words closer to the cluster centroid

Table 12: Confusion matrix for the doc2vec representation with k-means clustering

method on Reddit data with size range between 101 and 200 characters

Subreddit Page Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

NFL 1,351 60 298 273 395

pcmasterrace 78 1,295 215 185 260

news 93 89 952 204 538

movies 89 50 152 767 226

relationships 32 37 116 48 557

Table 13: Top 10 words per cluster based on combined embedding similarity score in

embedding space and tf-idf score

Cluster Top Topic Top 10 Words

1 NFL talent, ﬂacco, quarterback, tds, sb, wrs, roster, dolphins,

tackle, foles

2 pcmasterrace install, ps4, r9, mobo, gpus, i5, os, msi, processor, asus

3 news federal, manslaughter, district, homicide, economic, isis,

china, labor, upper, toke

4 movies avengers, joss, horror, arnold, cinematography, rewatch, aus-

tralian, doof, boobs, mcx

5 relationships abusive, mentality, react, rdj, xanax, marriage, heaven, meet-

ing, section, subjective

may be representative of the cluster. However, this approach doesn’t account

for the frequency of words appearing in each cluster, or the relative frequency

of the words across the clusters. To incorporate this information, we pooled all

the documents in each cluster and calculated a tf-idf matrix. We then created a810

combined score for each word and cluster from the sum of the cosine similarity

and the tf-idf score. Table 13 shows the top 10 words per cluster ordered by

this method. It is clear that this method extracts very speciﬁc terms related to

the main subreddit pages, particularly for clusters 1 and 2.

5. Discussion815

Throughout this study it has become clear that for clustering OSN text data

into topics, in general doc2vec feature representations combined with k-means

clustering gave the best performance compared to any of the other methods.

However, the cases where this method did not perform as well require discussion.

On the RepLab 2013 Twitter data set, the doc2vec method gave performance820

below that of the mean word2vec methods for the NMI and AMI measures,

but gave the best performance for ARI after 100 epochs of training. Further to

this, the unweighted mean word2vec method performed better than the tf-idf

weighted mean word2vec method on this data. Both of these results are diﬀerent

from the results on the other two data sets. The results on the #Auspol and825

the Reddit data with document length between 101 and 200 characters indicate

that it is not the size of each document that is the issue on the RepLab data, but

it is most likely that the volume of data used was not suﬃcient to accurately

train the doc2vec model. The implication is that doc2vec models should be

trained on data volumes greater than three thousand or so. Interestingly, the830

Reddit data with length between 101 and 200 characters only consisted of 8,360

documents and doc2vec performed very well, although Reddit comments may

be quite diﬀerent to tweets in the terms used.

Another interesting observation is that on the #Auspol Twitter data, the tf-

idf matrix with NMF gave better performance on the NMI and AMI measures835

than the best clustering for both word2vec methods, although a lower score for

the ARI. On the RepLab 2013 data, the word2vec methods performed better on

NMI and AMI, but the tf-idf method was very close on ARI. However, on the

Reddit data the tf-idf method gave very low performance until the document size

was greater than 200 characters. This indicates that topics from Twitter text840

may rely heavily on keywords since the tf-idf clustering performs comparatively

well, which is not surprising given the use of user mentions and hashtags. The

doc2vec method represented this information more eﬀectively on the #Auspol

data than the other feature methods. Assigning a heavier weighting for hashtags

and user mentions for the doc2vec model might give improved performance on845

Twitter data.

Two useful results stand out from this study based on the Reddit data. The

ﬁrst was that the optimal number of training epochs for doc2vec is inversely

proportional to the average length of the documents. This result provides some

guidance for future studies using OSN data. Unfortunately this result was not850

consistent with the results on the #Auspol data, which may be due to the

topic labels themselves not being clearly distinct. There is an ongoing challenge

with using Twitter data as manually labelling topics is time consuming and

prone to error, and the number of retrievable tweets diminishes over time. The

result is consistent with the RepLab 2013 Twitter data, but as discussed already855

the data volume was small. The second result is that the performance of the

doc2vec method increased with the length of the documents. The method gave

high performance for the longest Reddit comments, so should give good results

applied to text data from OSN platforms in general.

Improving embedding representations of OSN documents can be useful for860

several natural language processing tasks. Such representations at the docu-

ment level can provide high quality feature matrices to be used by other ma-

chine learning systems. An example application is for sentiment analysis (Lee

et al., 2016). In addition, it has been shown previously that pre-training the

word vectors used by doc2vec can provide a performance lift in several natural865

language processing tasks (Lau and Baldwin, 2016). Pre-training both word

vectors and document vectors for large volumes of OSN data could then pro-

vide a performance lift on applications focused on speciﬁc samples of data. For

instance, pre-trained document vectors could be used in streaming document

classiﬁcation or clustering applications. In addition, such methods could be ap-870

plied in other domains where data can be modelled as documents with a small

number of tokens. For example, embedding models are seeing applications on

electronic health record data (Choi et al., 2016). In this instance, medical codes

are treated as tokens and embedding models can then be used to capture in-

formation about relationships between diseases and treatments, and be used in875

subsequent prediction or clustering tasks.

6. Conclusion and Future Work

In this study we showed the diﬀerent performance of several document clus-

tering and topic modelling methods on social media text data. Our results have

demonstrated that document and word embedding representations of online880

social network data may be used eﬀectively as a basis for document cluster-

ing. These methods outperformed traditional tf-idf based approaches and topic

modelling techniques. Furthermore, doc2vec and tf-idf weighted mean word em-

bedding representations delivered better results than simple averages of word

embedding vectors in document clustering tasks. We also demonstrated that885

k-means clustering provided the best performance with doc2vec embeddings.

Through applying these methods over the Reddit data set split by docu-

ment length ranges, we outlined two key results for clustering doc2vec embed-

dings. Firstly, the optimal number of training epochs is in general inversely

proportional to the character length range of the documents. Secondly, doc2vec890

embeddings with k-means clustering provide good performance over all the doc-

ument length ranges in the Reddit data used. These results indicate that this

method should perform well on most OSN text data.

To interpret the resulting clusters from these methods, we developed a top

term analysis based on combining tf-idf scores and word vector similarities. We895

demonstrated that this method can provide a representative set of keywords

for a topic cluster. We also showed that the doc2vec embedding with k-means

clustering may successfully recover latent hashtag structure in Twitter data.

We plan several extensions to this work. Firstly, the doc2vec embeddings

combined with k-means clustering can be applied readily to any social media900

text data. In further applications we intend to demonstrate the usefulness of

this method in deﬁning and interpreting dynamic topics in a streaming fashion.

Secondly, this method may be extended to incorporate additional data avail-

able in social networks, and speciﬁcally from Twitter user and network data.

Thirdly, recent developments in the applications of neural embedding and deep905

learning techniques, such as contextualised embedding models (Peters et al.,

2018), Latent LSTM Allocation (Zaheer et al., 2017) and deep learning based

clustering models (Min et al., 2018) may be applied to deliver improved feature

representations or document clusterings. Word and document embeddings may

also be used as pre-trained initial layers in deep clustering and topic modelling910

techniques.

7. Acknowledgements and Declarations

This research did not receive any speciﬁc grant from funding agencies in the

public, commercial, or not-for-proﬁt sectors.

Declarations of interest: none915

8. References

Alghamdi, R., Alfalqi, K., 2015. A survey of topic modeling in text mining.

International Journal of Advanced Computer Science and Applications 3, 774–

777.

Alnajran, N., Crockett, K., McLean, D., Latham, A., 2017. Cluster analysis920

of twitter data: A review of algorithms, in: Proceedings of the 9th Interna-

tional Conference on Agents and Artiﬁcial Intelligence - Volume 2: ICAART,

INSTICC. SciTePress. pp. 239–249.

Amig´o, E., Carrillo de Albornoz, J., Chugur, I., Corujo, A., Gonzalo, J., Mart´ın,

T., Meij, E., de Rijke, M., Spina, D., 2013. Overview of RepLab 2013: Eval-925

uating Online Reputation Monitoring Systems, in: Proceedings of the Fourth

International Conference of the CLEF initiative, pp. 333–352.

Bakshy, E., Rosenn, I., Marlow, C., Adamic, L., 2012. The role of social networks

in information diﬀusion, in: Proceedings of the 21st International Conference

on World Wide Web, ACM, New York, NY, USA. pp. 519–528.930

Bengio, Y., Ducharme, R., Vincent, P., Janvin, C., 2003. A neural probabilistic

language model. Journal of Machine Learning Research 3, 1137–1155.

Billah Nagoudi, E.M., Ferrero, J., Schwab, D., 2017. LIM-LIG at SemEval-2017

Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors

Weighting, in: International Workshop on Semantic Evaluations (SemEval-935

2017), Vancouver, Canada. pp. 125 – 129.

Bisht, S., Paul, A., 2013. Document clustering: A review. International Journal

of Computer Applications 73, 26–33.

Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent dirichlet allocation. Journal

of Machine Learning Research 3, 993–1022.940

Chinnov, A., Kerschke, P., Meske, C., Stieglitz, S., Trautmann, H., 2015. An

overview of topic discovery in twitter communication through social media an-

alytics, in: Proceedings of the Americas Conference on Information Systems,

pp. 1–10.

Choi, E., Bahadori, M.T., Searles, E., Coﬀey, C., Thompson, M., Bost, J.,945

Tejedor-Sojo, J., Sun, J., 2016. Multi-layer representation learning for medi-

cal concepts, in: Proceedings of the 22Nd ACM SIGKDD International Con-

ference on Knowledge Discovery and Data Mining, ACM, New York, NY,

USA. pp. 1495–1504.

Corrˆea J´unior, E.A., Marinho, V.Q., dos Santos, L.B., 2017. NILC-USP at950

SemEval-2017 task 4: A multi-view ensemble for twitter sentiment analysis,

in: Proceedings of the 11th International Workshop on Semantic Evaluation

(SemEval-2017), Association for Computational Linguistics. pp. 611–615.

Curiskis, S., Drake, B., Osborn, T., Kennedy, P., submitted. Topic labelled

online social network data sets from twitter and reddit. Data In Brief .955

Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., Cohen, W., 2016. Tweet2vec:

Character-based distributed representations for social media, in: Proceedings

of the 54th Annual Meeting of the Association for Computational Linguistics

(Volume 2: Short Papers), Association for Computational Linguistics. pp.

269–274.960

Fang, Y., Zhang, H., Ye, Y., Li, X., 2014. Detecting hot topics from twitter: A

multiview approach. Journal of Information Science 40, 578–593.

Ferrara, E., JafariAsbagh, M., Varol, O., Qazvinian, V., Menczer, F., Flam-

mini, A., 2013. Clustering memes in social media, in: Proceedings of the

2013 IEEE/ACM International Conference on Advances in Social Networks965

Analysis and Mining, ACM, New York, NY, USA. pp. 548–555.

Godfrey, D., Johns, C., Meyer, C.D., Race, S., Sadek, C., 2014. A case study

in text mining: Interpreting twitter data from world cup tweets. CoRR

abs/1408.5427, 1–11.

Godin, F., Vandersmissen, B., De Neve, W., Van de Walle, R., 2015. Multi-970

media lab @ ACL WNUT NER shared task: Named entity recognition for

twitter microposts using distributed word representations, in: Proceedings of

the Workshop on Noisy User-generated Text, Association for Computational

Linguistics. pp. 146–153.

Guille, A., Hacid, H., Favre, C., Zighed, D., 2013. Information diﬀusion in975

online social networks: A survey. ACM SIGMOD Record 42, 17–28.

Gutman, J., Nam, R., 2015. Text classiﬁcation of reddit posts. Technical Report.

New York University.

Ha, T., Beijnon, B., Kim, S., Lee, S., Kim, J.H., 2017. Examining user per-

ceptions of smartwatch through dynamic topic modeling. Telematics and980

Informatics 34, 1262 – 1273.

Hong, L., Davison, B.D., 2010. Empirical study of topic modeling in twitter,

in: Proceedings of the First Workshop on Social Media Analytics, ACM, New

York, NY, USA. pp. 80–88.

Irfan, R., King, C.K., Grages, D., Ewen, S., Khan, S.U., Madani, S.A.,985

Kolodziej, J., Wang, L., Chen, D., Rayes, A., et al., 2015. A survey on text

mining in social networks. The Knowledge Engineering Review 30, 157–170.

JafariAsbagh, M., Ferrara, E., Varol, O., Menczer, F., Flammini, A., 2014. Clus-

tering memes in social media streams. Social Network Analysis and Mining

4, 237.990

Klein, C., Clutton, P., Polito, V., 2018. Topic modeling reveals distinct interests

within an online conspiracy forum. Frontiers in Psychology 9, 1–12.

Klinczak, M., Kaestner, C., 2016. Comparison of clustering algorithms for the

identiﬁcation of topics on twitter. Latin American Journal of Computing -

LAJC 3, 19–26.995

Lau, J.H., Baldwin, T., 2016. An empirical evaluation of doc2vec with prac-

tical insights into document embedding generation, in: Proceedings of the

1st Workshop on Representation Learning for NLP, Association for Compu-

tational Linguistics. pp. 78–86.

Le, Q.V., Mikolov, T., 2014. Distributed representations of sentences and doc-1000

uments, in: Proceedings of the 31th International Conference on Machine

Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp. 1188–1196.

Lee, S., Jin, X., Kim, W., 2016. Sentiment classiﬁcation for unlabeled dataset

using doc2vec with jst, in: Proceedings of the 18th Annual International Con-

ference on Electronic Commerce: E-Commerce in Smart Connected World,1005

ACM, New York, NY, USA. pp. 28:1–28:5.

Li, Q., Shah, S., Liu, X., Nourbakhsh, A., 2017. Data sets: Word embeddings

learned from tweets and general data, in: Proceedings of the Eleventh In-

ternational Conference on Web and Social Media, ICWSM 2017, Montr´eal,

Qu´ebec, Canada, May 15-18, 2017., pp. 428–436.1010

Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Eﬃcient estimation of word

representations in vector space. CoRR abs/1301.3781. 1301.3781.

Min, E., Guo, X., Liu, Q., Zhang, G., Cui, J., Long, J., 2018. A survey of

clustering with deep learning: From the perspective of network architecture.

IEEE Access 6, 39501–39514.1015

Naik, M.P., Prajapati, H.B., Dabhi, V.K., 2015. A survey on semantic document

clustering, in: 2015 IEEE International Conference on Electrical, Computer

and Communication Technologies (ICECCT), pp. 1–10.

Patki, U., Khot, D.P., 2017. A literature review on text document clustering al-

gorithms used in text mining. Journal of Engineering Computers and Applied1020

Sciences 6, 16–20.

Paul, M.J., Dredze, M., 2014. Discovering health topics in social media using

topic models. PLOS ONE 9, 1–11.

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle-

moyer, L., 2018. Deep contextualized word representations, in: Proceedings1025

of the 2018 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies, Volume 1 (Long

Papers), Association for Computational Linguistics. pp. 2227–2237.

Reddit, 2015. r/datasets - i have every publicly available reddit comment for re-

search. 1.7 billion comments at 250 gb compressed. any interest in this? (ac-1030

cessed 19 january 2019). https://www.reddit.com/r/datasets/comments/

3bxlg7/i_have_every_publicly_available_reddit_comment.

Reh˚uˇrek, R., Sojka, P., 2010. Software Framework for Topic Modelling with

Large Corpora, in: Proceedings of the LREC 2010 Workshop on New Chal-

lenges for NLP Frameworks, ELRA, Valletta, Malta. pp. 45–50.1035

Romano, S., Vinh, N.X., Bailey, J., Verspoor, K., 2016. Adjusting for chance

clustering comparison measures. Journal of Machine Learning Research 17,

4635–4666.

Shabunina, E., Pasi, G., 2018. A graph-based approach to ememes identiﬁcation

and tracking in social media streams. Knowledge-Based Systems 139, 108 –1040

118.

Steinskog, A., Therkelsen, J., Gamb¨ack, B., 2017. Twitter topic modeling by

tweet aggregation, in: Proceedings of the 21st Nordic Conference on Compu-

tational Linguistics, Association for Computational Linguistics. pp. 77–86.

Stieglitz, S., Mirbabaie, M., Ross, B., Neuberger, C., 2018. Social media ana-1045

lytics – challenges in topic discovery, data collection, and data preparation.

International Journal of Information Management 39, 156 – 168.

Suri, P., Roy, N.R., 2017. Comparison between LDA & NMF for event-detection

from large text stream data, in: 2017 3rd International Conference on Com-

putational Intelligence Communication Technology (CICT), pp. 1–5.1050

Vinh, N.X., Epps, J., Bailey, J., 2010. Information theoretic measures for clus-

terings comparison: Variants, properties, normalization and correction for

chance. Journal of Machine Learning Research 11, 2837–2854.

Yang, X., Macdonald, C., Ounis, I., 2017. Using word embeddings in twitter

election classiﬁcation. Information Retrieval 21, 183–207.1055

Zaheer, M., Ahmed, A., Smola, A.J., 2017. Latent LSTM allocation: Joint clus-

tering and non-linear dynamic modeling of sequence data, in: Proceedings of

the 34th International Conference on Machine Learning, PMLR, International

Convention Centre, Sydney, Australia. pp. 3967–3976.

Zhao, J., Lan, M., Tian, J.F., 2015. Using traditional similarity measurements1060

and word embedding for semantic textual similarity estimation, in: 9th In-

ternational Workshop on Semantic Evaluation (SemEval 2015), p. 117.

Prominent User Segments in Online Consumer Recommendation Communities: Capturing Behavioral and Linguistic Qualities with User Comment Embeddings

Article

Full-text available

Jun 2024

Online conversation communities have become an influential source of consumer recommendations in recent years. We propose a set of meaningful user segments which emerge from user embedding representations, based exclusively on comments’ text input. Data were collected from three popular recommendation communities in Reddit, covering the domains of book and movie suggestions. We utilized two neural language model methods to produce user embeddings, namely Doc2Vec and Sentence-BERT. Embedding interpretation issues were addressed by examining latent factors’ associations with behavioral, sentiment, and linguistic variables, acquired using the VADER, LIWC, and LFTK libraries in Python. User clusters were identified, having different levels of engagement and linguistic characteristics. The latent features of both approaches were strongly correlated with several user behavioral and linguistic indicators. Both approaches managed to capture significant variability in writing styles and quality, such as length, readability, use of function words, and complexity. However, the Doc2Vec features better described users by varying level of contribution, while S-BERT-based features were more closely adapted to users’ varying emotional engagement. Prominent segments revealed prolific users with formal, intuitive, emotionally distant, and highly analytical styles, as well as users who were less elaborate, less consistent, but more emotionally connected. The observed patterns were largely similar across communities.

Exploring Reddit Community Structure: Bridges, Gateways and Highways

Article

Full-text available

May 2024

Multiple research directions have been proposed to study the information structure of Reddit. One of them is to model inter-subreddit relations but modeling user interactions in the form of a graph. Building upon prior work centered on political subreddits using pre-2020 data, we expand this investigation to include a more extensive dataset spanning 2022 and encompassing diverse topic areas. Employing NLP techniques such as text embeddings, we model subreddit content directly and construct a subreddit graph network based on cosine similarity. Community detection using the Louvain method reveals distinct subreddits and allows the analysis of inter-community connections via previous works’ concepts of “bridges” and “gateways”. Surprisingly, our findings indicate redundancy between bridges and gateways in the utilized dataset. Therefore, we introduce a new concept, “highways”. Highways, representing the most traversed paths between subreddits, unveil insights not captured by previous analyses, underscoring the significance of novel conceptual frameworks in uncovering latent knowledge within Reddit’s online community structures.

One-way ticket to the moon? An NLP-based insight on the phenomenon of small-scale neo-broker trading

Article

Full-text available

Jun 2024

We present an Natural Language Processing based analysis on the phenomenon of “Meme Stocks”, which has emerged as a result of the proliferation of neo-brokers like Robinhood and the massive increase in the number of small-scale stock investors. Such investors often use specific Social Media channels to share short-term investment decisions and strategies, resulting in partial collusion and planning of investment decisions. The impact of online communities on the stock prices of affected companies has been considerable in the short term. This paper has two objectives. Firstly, we chronologically model the discourse on the most prominent platforms. Secondly, we examine the potential for using collaboratively made investment decisions as a means to assist in the selection of potential investments.. To understand the investment decision-making processes of small-scale investors, we analyze data from Social Media platforms like Reddit, Stocktwits and Seeking Alpha. Our methodology combines Sentiment Analysis and Topic Modelling. Sentiment Analysis is conducted using VADER and a fine-tuned BERT model. For Topic Modelling, we utilize LDA, NMF and the state-of-the-art BERTopic. We identify the topics and shapes of discussions over time and evaluate the potential for leveraging information of the decision-making process of investors for trading choices. We utilize Random Forest and Neural Network Models to show that latent information in discussions can be exploited for trend prediction of stocks affected by Social Network driven herd behavior. Our findings provide valuable insights into content and sentiment of discussions and are a vehicle to improve efficient trading decisions for stocks affected from short-term herd behavior.

SenTopX: Benchmark for User Sentiment on Various Topics

Preprint

Full-text available

Jun 2024

Toxic sentiment analysis on Twitter (X) often focuses on specific topics and events such as politics and elections. Datasets of toxic users in such research are typically gathered through lexicon-based techniques, providing only a cross-sectional view. his approach has a tight confine for studying toxic user behavior and effective platform moderation. To identify users consistently spreading toxicity, a longitudinal analysis of their tweets is essential. However, such datasets currently do not exist. This study addresses this gap by collecting a longitudinal dataset from 143K Twitter users, covering the period from 2007 to 2021, amounting to a total of 293 million tweets. Using topic modeling, we extract all topics discussed by each user and categorize users into eight groups based on the predominant topic in their timelines. We then analyze the sentiments of each group using 16 toxic scores. Our research demonstrates that examining users longitudinally reveals a distinct perspective on their comprehensive personality traits and their overall impact on the platform. Our comprehensive dataset is accessible to researchers for additional analysis.

Neighboring-Aware Hierarchical Clustering:

Article

Full-text available

Jan 2024
INT J SEMANT WEB INF

In this work, a simple yet robust neighboring-aware hierarchical-based clustering approach (NHC) is developed. NHC employs its dynamic technique to take into account the surroundings of each point when clustering, making it extremely competitive. NHC offers a straightforward design and reliable clustering. It comprises two key techniques, namely, neighboring- aware and filtering and merging. While the proposed neighboring-aware technique helps find the most coherent clusters, filtering and merging help reach the desired number of clusters during the clustering process. The NHC’s performance, which includes all evaluation metrics and run time, has been thoroughly tested against nine clustering rivals using four similarity measures on several real-world numerical and textual datasets. The evaluation is done in two phases. First, we compare NHC to three common clustering methods and show its efficacy through empirical analysis. Second, a comparison with six relevant, contemporary competitors highlights NHC's extremely competitive performance.

Journal Pre-proof Semantic Similarity Is Not Enough: A Novel NLP-based Semantic Similarity Measure in Geospatial Context

Article

Full-text available

May 2024

In this study, we addressed two primary challenges: firstly, the issue of domain shift, which pertains to changes in data characteristics or context that can impact model performance, and secondly, the discrepancy between semantic similarity and geographical distance. We employed topic modeling in conjunction with the BERT architecture. Our model was crafted to enhance similarity computations applied to geospatial text, aiming to integrate both semantic similarity and geographical proximity. We tested the model on two datasets, Persian Wikipedia articles, and rental property advertisements. The findings demonstrate that the model effectively improved the correlation between semantic similarity and geographical distance. Furthermore, evaluation by real-world users within a recommender system context revealed a notable increase in user satisfaction by approximately 22% for Wikipedia articles and 56% for advertisements.

Verbal Behavior and the Future of Social Science

Article

Full-text available

May 2024

Natural language processing (NLP)—previously the domain of a select few language and computer scientists—is undergoing an unprecedented surge in popularity across disciplines. The ubiquity of language data, alongside extremely rapid methodological innovations, has magnetized the field, attracting researchers with the promise of measuring, forecasting, and understanding the most central questions in business, psychology, biology, sociology, the humanities, and beyond. The power of language analysis to reveal insights into human thought, feeling, and behavior has become a core interest emerging from recent technological advances, which are being probed to unearth deeply embedded truths about the human condition. However, NLP research has reached a critical juncture, sitting at the cusp of societal transformation in many aspects of daily life. The details of how NLP research develops over the next 3–5 years will define this transformation. In this emerging, near-infinite space of NLP-driven research, we provide a critical frame of reference for how, when, and why these technologies should evolve in a particularly transdisciplinary manner. Specifically, we discuss (a) the urgency of pairing existing and emerging NLP research with existing scientific knowledge, theory, and principles from the behavioral sciences; (b) the coevolution of NLP technologies; and (c) the practical implications and ethical consequences of expanding language analysis using broader psychosocial theories of the human condition. While our discussion focuses principally on using language as a window in the individual mind, this topic holds substantial implications for other disciplines and lines of inquiry, including the dynamics of social interaction and beyond.

Cryptocurrency Prediction using Deep Learning

Article

Mar 2023

Abusufiyan Athani Samarth Kumar

References

Chapter

May 2024

Jon Roozenbeek

CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Article

May 2024
KNOWL-BASED SYST

A Survey of Clustering With Deep Learning: From the Perspective of Network Architecture

Article

Full-text available

Jul 2018

Clustering is a fundamental problem in many data-driven application domains, and clustering performance highly depends on the quality of data representation. Hence, linear or non-linear feature transformations have been extensively used to learn a better data representation for clustering. In recent years, a lot of works focused on using deep neural networks to learn a clustering-friendly representation, resulting in a significant increase of clustering performance. In this paper, we give a systematic survey of clustering with deep learning in views of architecture. Specifically, we first introduce the preliminary knowledge for better understanding of this field. Then, a taxonomy of clustering with deep learning is proposed and some representative methods are introduced. Finally, we propose some interesting future opportunities of clustering with deep learning and give some conclusion remarks.

Topic Modeling Reveals Distinct Interests within an Online Conspiracy Forum

Article

Full-text available

Feb 2018

Conspiracy theories play a troubling role in political discourse. Online forums provide a valuable window into everyday conspiracy theorizing, and can give a clue to the motivations and interests of those who post in such forums. Yet this online activity can be difficult to quantify and study. We describe a unique approach to studying online conspiracy theorists which used non-negative matrix factorization to create a topic model of authors' contributions to the main conspiracy forum on Reddit.com. This subreddit provides a large corpus of comments which spans many years and numerous authors. We show that within the forum, there are multiple sub-populations distinguishable by their loadings on different topics in the model. Further, we argue, these differences are interpretable as differences in background beliefs and motivations. The diversity of the distinct subgroups places constraints on theories of what generates conspiracy theorizing. We argue that traditional “monological” believers are only the tip of an iceberg of commenters. Neither simple irrationality nor common preoccupations can account for the observed diversity. Instead, we suggest, those who endorse conspiracies seem to be primarily brought together by epistemological concerns, and that these central concerns link an otherwise heterogenous group of individuals.

Social media analytics – Challenges in topic discovery, data collection, and data preparation

Article

Full-text available

Apr 2018
INT J INFORM MANAGE

Since an ever-increasing part of the population makes use of social media in their day-today lives, social media data is being analysed in many different disciplines. The social media analytics process involves four distinct steps, data discovery, collection, preparation, and analysis. While there is a great deal of literature on the challenges and difficulties involving specific data analysis methods, there hardly exists research on the stages of data discovery, collection, and preparation. To address this gap, we conducted an extended and structured literature analysis through which we identified challenges addressed and solutions proposed. The literature search revealed that the volume of data was most often cited as a challenge by researchers. In contrast, other categories have received less attention. Based on the results of the literature search, we discuss the most important challenges for researchers and present potential solutions. The findings are used to extend an existing framework on social media analytics. The article provides benefits for researchers and practitioners who wish to collect and analyse social media data.

Data Sets: Word Embeddings Learned from Tweets and General Data

Conference Paper

Full-text available

May 2017

A word embedding is a low-dimensional, dense and real-valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually generated from a large text corpus. The embedding of a word captures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks.

Multi-layer Representation Learning for Medical Concepts

Conference Paper

Full-text available

Aug 2016

Proper representations of medical concepts such as diagnosis, medication, procedure codes and visits from Electronic Health Records (EHR) has broad applications in healthcare analytics. Patient EHR data consists of a sequence of visits over time, where each visit includes multiple medical concepts, e.g., diagnosis, procedure, and medication codes. This hierarchical structure provides two types of relational information, namely sequential order of visits and co-occurrence of the codes within a visit. In this work, we propose Med2Vec, which not only learns the representations for both medical codes and visits from large EHR datasets with over million visits, but also allows us to interpret the learned representations confirmed positively by clinical experts. In the experiments, Med2Vec shows significant improvement in prediction accuracy in clinical applications compared to baselines such as Skip-gram, GloVe, and stacked autoencoder, while providing clinically meaningful interpretation.

Deep Contextualized Word Representations

Conference Paper

Jan 2018

A Graph-Based Approach to Ememes Identification and Tracking in Social Media Streams

Article

Oct 2017
KNOWL-BASED SYST

A meme, as defined by Richard Dawkins, is a unit of information, a concept or an idea that spreads from person to person within a culture. Examples of memes can be a musical melody, a catchy phrase, trending news, behavioral patterns, etc. In this article the task of identifying potential memes in a stream of texts is addressed: in particular, the content generated by users of Social Media is considered as a rich source of information offering an updated window on the world happenings and on opinions of people. A textual electronic meme, a.k.a. ememe, is here considered as a frequently replicated set of related words that propagates through the Web over time. In this article an approach is proposed that aims to identify ememes in Social Media streams represented as graph of words. Furthermore, a set of measures is defined to track the change of information in time.

Efficient Estimation of Word Representations in Vector Space

Conference Paper

Jan 2013

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

Comparison between LDA & NMF for event-detection from large text stream data

Conference Paper

Feb 2017

Examining user perceptions of smartwatch through dynamic topic modeling

Article

May 2017

Since the 2010s, various companies have begun to manufacture wearable smartwatch devices, but the current sales of these products are not impressive. This study investigates how the limitations of the smartwatch are related to perceptual discomforts. Theoretically, this study evaluates the claim that the discomfort that users appear to have with the smartwatch stem from failed remediation. Users perceive the smartwatch more as a set of functional sensors rather than a watch or smartphone. Specifically, from the remediation perspective, the authors asked how users perceive the functions of the smartwatch. This study used dynamic topic modeling for topics on the smartwatch on Reddit. This study reports that the smartwatch has failed to provide a proper way to use the remediated content that it provides. Suggestions for future studies are addressed.

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

Abstract and Figures

Recommended publications

Text Mining in Online Social Networks: A Systematic Review

Analyzing the Effects of Text Representations on the Performance of Document Clustering in Public He...

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Creación de corpus de palabras embebidas de tweets generados en Argentina

FANATIC: FAst Noise-Aware TopIc Clustering