ArticlePDF Available

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

Authors:

Abstract and Figures

Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.
Content may be subject to copyright.
An evaluation of document clustering and topic
modelling in two online social networks: Twitter and
Reddit
Stephan A. Curiskis, Barry Drake, Thomas R. Osborn, Paul J. Kennedy
Centre for Artificial Intelligence
Faculty of Engineering and Information Technology
University of Technology Sydney,
15 Broadway, Ultimo, NSW 2007,
Email: stephan.a.curiskis@student.uts.edu.au
Abstract
Methods for document clustering and topic modelling in online social net-
works (OSNs) offer a means of categorising, annotating and making sense of
large volumes of user generated content. Many techniques have been developed
over the years, ranging from text mining and clustering methods to latent topic
models and neural embedding approaches. However, many of these methods
deliver poor results when applied to OSN data as such text is notoriously short
and noisy, and often results are not comparable across studies. In this study we
evaluate several techniques for document clustering and topic modelling on three
datasets from Twitter and Reddit. We benchmark four different feature rep-
resentations derived from term-frequency inverse-document-frequency (tf-idf )
matrices and word embedding models combined with four clustering methods,
and we include a Latent Dirichlet Allocation topic model for comparison. Sev-
eral different evaluation measures are used in the literature, so we provide a
discussion and recommendation for the most appropriate extrinsic measures for
this task. We also demonstrate the performance of the methods over data sets
with different document lengths. Our results show that clustering techniques
applied to neural embedding feature representations delivered the best perfor-
mance over all data sets using appropriate extrinsic evaluation measures. We
also demonstrate a method for interpreting the clusters with a top-words based
Preprint submitted to Journal of Information Processing and Management April 7, 2019
approach using tf-idf weights combined with embedding distance measures.
Keywords: document clustering, topic modelling, topic discovery, embedding
models, Online Social Networks
1. Introduction
In January 2018 there were estimated to be around 4.021 billion people
around the world who use the internet. Of these, 3.196 billion people use social
media in some form, generating a staggering amount of content.1Online plat-
forms and social networks have become a key source of information for nearly5
half of the world’s population. These platforms are increasingly being used
to disseminate information regarding news, brands, political discussion, global
events and more (Bakshy et al., 2012). However, much of the data generated is
unstructured and not annotated. This means that it is difficult to understand
how topics of information are diffused through online social networks (OSNs),10
and how users engage with different topics (Guille et al., 2013). Automatically
annotating topics within OSNs may facilitate analysis of information diffusion
and user preferences by enriching the data available from these platforms, in a
way that is readily analysed. With the rise of phenomena like echo chambers
and filter bubbles, which lead to individuals receiving biased and narrowly fo-15
cused content, the challenge of automatically annotating OSN data has become
important.
Document clustering is a set of machine learning techniques that aim to auto-
matically organise documents into clusters such that documents within clusters
are similar when compared to documents in other clusters. Many methods for20
clustering documents have been proposed (Bisht and Paul, 2013; Naik et al.,
2015). These techniques typically involve the use of a feature matrix, such as
a term-frequency inverse-document-frequency matrix (tf-idf matrix) to repre-
sent a corpus, with a clustering method applied to this matrix. More recently,
1https://wearesocial.com/uk/blog/2018/01/global-digital-report-2018, accessed Sep. 2018
2
representations derived from neural word embeddings have seen applications on25
social media data as they can produce dense representations with semantic prop-
erties and require less manual preprocessing than traditional methods (Li et al.,
2017). Common clustering methods applied in this context build hierarchies or
partitions (Irfan et al., 2015). Example hierarchical methods are agglomerative
clustering and divisive clustering. Example partitioning methods are k-means30
and k-medoids clustering.
Topic modelling involves methods to discover patterns of word use within
documents, and is an active research area with several techniques recently ap-
plied to OSN data (Chinnov et al., 2015). Topics are typically defined as a
distribution of words, with documents modelled as mixtures of topics. Like35
document clustering, topic modelling can be used to cluster documents by giv-
ing a probability distribution over a range of topics for each document. This
can be viewed as a form of soft partition clustering, where the data points have
a probabilistic degree of ownership to each cluster. The topic representation
also provides the word distribution for each topic which aids in interpretation.40
Commonly used topic models with applications on OSN text data include La-
tent Dirichlet Allocation (Blei et al., 2003), the Author-Topic model (Hong and
Davison, 2010), and more recently Dynamic Topic Models which discover topics
over time (Alghamdi and Alfalqi, 2015).
Document clustering and topic modelling are increasingly important research45
areas as these methods can be applied to large amounts of readily available OSN
text data, yielding homogeneous groups of documents. These document groups
may then align to relevant topics and trends. Clustering is particularly suited
to OSN data as platforms like Twitter and Facebook use hashtags as a form
of topic annotation (Steinskog et al., 2017), which may be used for evaluation50
of document clustering and topic modelling methods. Large scale clustering
can help make sense of the huge amount of content being created online every
day, and can subsequently be used in further machine learning tasks. Addi-
tional features derived from OSN data (such as user demographic, geographic
and network data) have also been clustered to find groups of online posts or55
3
comments that are semantically similar (Alnajran et al., 2017). However, OSN
data presents many challenges when applying topic modelling and document
clustering methods. For example, such text is typically short and contains noise
such as misspellings and grammatical errors (Chinnov et al., 2015).
There are two key challenges with topic modelling and document clustering60
research on OSN data sets. Firstly, results are often not reproducible since the
data used in the studies frequently cannot be published. For instance, Twitter’s
terms of service do not allow for tweets to be published. Instead, researchers
can publish a list of tweet identifiers that were used and retrieved via the API.
Unfortunately, over time the associated tweets are removed from the platform,65
which degrades the underlying data. The data sets used are also often small or
biased towards particular contexts. These issues result from the complex data
collection and preparation that is often required to extract large data sets from
an OSN platform, as well as restrictions on the platforms themselves (Stieglitz
et al., 2018).70
Secondly, different studies often use different methods for evaluating the
performance of clustered documents. Evaluation methods on Twitter data vary
from extrinsic measures which compare clusters against labelled data, to man-
ual assessments of cluster performance and interpretability (Alnajran et al.,
2017). It is therefore difficult to compare empirical results. With the fast pace75
of research in this area, there is little guidance on what method or family of
methods will perform best in specific circumstances, such as on short Twitter
data or relatively longer Reddit comments.
In this paper we provide an analysis of the performance of several methods
for document clustering and topic modelling of OSN content on three data sets:80
two Twitter data sets and a publicly available Reddit data set. We evaluate
four feature representation methods derived from tf-idf and embedding matri-
ces combined with four clustering techniques, and include a Latent Dirichlet
Allocation (LDA) topic model for comparison. We also provide a discussion of
the properties and appropriateness of document clustering evaluation measures85
commonly used in the literature. We evaluate performance with three such mea-
4
sures, namely the Normalised Mutual Information (NMI), the Adjusted Mutual
Information (AMI), and the Adjusted Rand Index (ARI). Furthermore, we have
made our data sets available so that our results can be reproduced. To comply
with Twitter’s terms of use, we have made available the tweet identifiers used90
along with the topic label. We have also made available the full Reddit data set
used (Curiskis et al., submitted).
Further to this, by tuning key hyper-parameters we demonstrate how em-
bedding models can be used to generate feature sets for document clustering
that delivered good performance and captured latent structure in the data. We95
also show how word embedding distances can aid in the interpretation of the
clusters by ranking the top words, forming a topic vector of words. This contri-
bution is significant since data sets from OSNs are often short and contain noise
such as misspellings, abbreviations, acronyms, special characters, emojis, URLs
and hashtags. These issues can result in poor performance for many commonly100
used techniques. Furthermore, a clear consensus is lacking in the literature re-
garding methods that work effectively on OSN data. The results of this paper
provide guidance on methods giving good performance over different types of
OSN data. These results show that traditional topic modelling and document
clustering approaches do not work well on short and noisy social media posts.105
Instead, clustering approaches applied to more recent neural network embedding
representations can deliver improved performance.
The structure of this paper is as follows. In Section 2 we review the current
literature in this research area. In Section 3 we present the detail of our meth-
ods, including a description of the data extraction, the preparation process, the110
feature representations, the clustering methods, and the evaluation measures.
In Section 4 we present our results with a discussion. In Section 5 we provide a
discussion followed by our conclusion in Section 6.
5
2. Literature Review
We organize the literature on document clustering and topic modelling of115
OSNs into three areas. Firstly, many studies have centred on identifying and in-
terpreting memes in this domain, incorporating textual, network and user data.
Secondly, identifying topics through topic models and clustering approaches has
received much attention as a means of understanding and categorising online
content. Thirdly, recent advances in neural word embedding models have been120
used to provide dense feature representations of documents from OSNs.
2.1. Meme Identification
The term “meme” is commonly used to represent an element of culture or
system of behaviour that spreads from one individual to another by imitation.
In the context of OSNs, for this paper we define a “meme” as a semantic unit125
expressed as electronic text where the semantics are transferred across multiple
individuals even though the text may be different. This specific definition of
“meme” is sometimes called “ememe” (Shabunina and Pasi, 2018). A topic in
OSN applications can be defined as a coherent set of semantically related terms
which express a single argument (Guille et al., 2013). In comparison to this130
definition of a topic, a meme does not necessarily need to be derived from a set
or distribution of words, but instead aims to detect significant semantic content.
Often in practice, however, there is an overlap between the two concepts. The
concept of a meme is useful for OSN applications as it can be thought of as
a latent representation of textual content, but can also be discovered through135
analysis of OSN user and network data.
A study by Ferrara et al. (2013) aimed to identify memes within large so-
cial media data. In that study, several similarity measures were defined for
Twitter data which leverage content, metadata and network features. The au-
thors defined the concept of a ‘protomeme’ which was used to refer to hashtags,140
user mentions, URLs and phrases. Data was aggregated by creating protomeme
projections onto spaces based on tweet, user and content features. For each
6
protomeme pair, common user, tweet, content and diffusion similarity mea-
sures were calculated. These similarity matrices were then aggregated in sev-
eral different ways, such as the element-wise mean and maximum. Finally, the145
aggregated similarity matrix was clustered with hierarchical clustering. The re-
sulting clusters were taken to represent memes within the data. The data set
used was a collection of 5,523 tweets related to the US presidential primaries in
April 2012. Twenty-six topics were manually identified and assigned as labels
to each tweet. Since the memes and topics can overlap per tweet, performance150
was evaluated using a variation of Normalised Mutual Information designated
as LFK-NMI. Given the optimal parameters for this approach, the protomeme
clustering method delivered average 5-fold cross-validation LFK-NMI scores of
around 0.13. JafariAsbagh et al. (2014) later extended the algorithm to work
on streaming data.155
More recently, Shabunina and Pasi (2018) developed a method to identify
and characterise memes, considered as a set of frequently occurring related words
propagating through a network over time. The relationships between terms in a
social media stream were modelled using a graph of words. To identify memes, a
k-core degeneracy process was applied to the graph to generate subgraphs, which160
constituted meme bases. A meme was defined as the fuzzy subset of terms in
a meme basis. The method was applied to over 800,000 tweets from the search
queries #economy, #politics and #finance. Although useful to characterise
and interpret topics in social media streams, memes were not attributed to
individual social media documents or users. Evaluation of the method was165
limited to subjective interpretation and intrinsic measures.
2.2. Document Clustering and Topic Modelling
In contrast to methods for meme identification, many studies have focused
on detecting topics in OSNs. Topic models typically refer to methods that group
both documents, which share similar words, as well as words that occur in a170
similar set of documents. Document clustering refers to methods that group
documents according to some feature matrix, such that documents within a
7
cluster are more similar to documents in other clusters. Due to the short doc-
ument size and high degree of noise inherent OSN data, such as Twitter data,
clustering based methods are often applied in favour of more traditional topic175
models (Chinnov et al., 2015). Nevertheless, topic models applied to OSN data
are still an active area of research (Alghamdi and Alfalqi, 2015). Indeed, the
term ‘topic discovery’ may refer to either topic modelling or document cluster-
ing.
Document clustering methods have typically used vector space represen-180
tations of word occurrence by document. Commonly, bag-of-words methods
model each document as a point in the space of words. Each word is a feature
or dimension of this space, with element values assigned in one of several ways.
These can be one-hot-encodings, where the value is set to 1 if the word exists
in the document and 0 otherwise, term frequency, or term-frequency inverse-185
document-frequency calculations. Given that the total dimension size is the
number of unique words, often there is a threshold cut-off to use only those
words with high values (Patki and Khot, 2017). A range of clustering algo-
rithms may then be applied to the feature matrix, such as k-means, hierarchical
clustering, self-organising maps, and so on (Naik et al., 2015).190
For instance, Godfrey et al. (2014) developed an algorithm to identify topics
within a specific Twitter data set, a collection of about 30,000 tweets extracted
using the query term ‘world cup’. Non-negative Matrix Factorisation (NMF)
and k-means clustering were applied to the tf-idf representation of tweets to cre-
ate topic clusters. Due to the noisiness of Twitter data, Godfrey et al. (2014)195
developed a preliminary filtering step using multiple runs of the DBSCAN clus-
tering algorithm combined with consensus clustering. The rationale was that
tweets which are not close to any particular cluster may be treated as noise and
removed from an analysis. The results when using this approach showed that
both k-means clustering and NMF produced similar results. However, when200
analysing the clusters using a subjective evaluation of tweet network diagrams
and word clouds, NMF seemed to produce more interpretable clusters.
Fang et al. (2014) approached detecting topics in Twitter using additional
8
information about the tweet. Recognising that the textual content of tweets
can be quite limited, a ‘multi-view’ topic detection framework was developed205
based on more granular ‘multi-relations’. These multi-relations were defined as
useful relations from the Twitter social network and included hashtags, user
mentions, retweets, meaningful words and similar posting times. To measure
these multi-relations, a document similarity measure was developed. Multi-
relation similarity scores were then combined into a multi-view and clustered210
using three different methods. These clusters were taken to represent topics
and a keyword extraction method, based on suffix trees and tf-idf weights,
was applied to derive representative keywords for each cluster. This method
was evaluated using a dataset of 12,000 tweets with 60 ‘hot’ topics extracted
from the Twitter API. Three evaluation measures were used, namely the F-215
measure, NMI, and entropy. The results showed that including more multi-
views improved performance, with results above 0.928 on the F-measure and
0.935 NMI. However, the authors did not remove any of the hot topic key words
from the text. These key words are generally short phrases or hashtags, and
can be discovered easily by tf-idf approaches.220
Another study compared the efficacy of different clustering methods to detect
topics in Twitter data centered around recent earthquakes in Nepal (Klinczak
and Kaestner, 2016). In this study, tweets were represented by their tf-idf
vectors. Four clustering methods applied to this representation were compared,
namely k-means, k-medoids, DBSCAN and NMF. By evaluating each clustering225
method with measures for cohesion and separation of clusters (i.e. intrinsic
evaluation measures), it was clear that NMF produced superior clusters which
were simpler and easier to interpret. More recently, Suri and Roy (2017) applied
LDA and NMF to detect topics on a Twitter data set, as well as a RSS news feed.
Both methods were found to have similar performance. LDA was deemed to230
be more interpretable, but NMF was faster to calculate. However, performance
was evaluated by manual inspection of the key terms for topics.
Many studies have applied topic modelling techniques to OSN data. For
instance, Paul and Dredze (2014) developed a topic modelling framework for
9
discovering self-reported health topics using Twitter data. 5,128 tweets were235
annotated with a positive status if they related to the user’s health, and negative
if not. A logistic regression model was trained to predict the positive labels in the
annotated data, and applied to a Twitter stream filtered with a large number of
health related keywords. This provided a set of 144 million health tweets which
was used to run the Ailment Topic Aspect Model. While this study is useful240
in filtering and interpreting large amounts of relevant tweets, validation of the
discovered topics focused on correlation measures against external health trend
data.
Further to topic models applied to a static data set, dynamic topic mod-
els, which incorporate the temporal nature of OSN data, are gaining attention245
(Alghamdi and Alfalqi, 2015). Ha et al. (2017) applied dynamic topic models
to Reddit data to understand user perceptions of smart watches. While these
results are interesting to gauge public opinion in this area, no ground truth
label was used and likewise no extrinsic evaluation measures were applied. Re-
cently, Klein et al. (2018) applied topic modelling to reveal distinct interests in250
the Reddit conspiracy page (a subreddit page). NMF was used to create topic
loadings for each user contributing to the page. These topic loadings were then
clustered using k-means to reveal user subgroups. Again, this study is useful
in understanding the user population within OSN discussion threads, but no
extrinsic evaluation was made to validate the quality of the topic modelling or255
the clustering.
2.3. Neural Network Embedding Models
Much of the literature on clustering OSN text data used tf-idf matrix rep-
resentations of tweets at some level. These matrices treat terms as one-hot
encoded vectors, where each term is represented by a binary vector with ex-260
actly one non-zero element. This means that relationships between words, such
as synonyms, are not incorporated and the resulting document matrix repre-
sentation is sparse and high dimensional. The concept of dense, distributional
representations of words, or word embeddings, provide an alternative approach
10
(Bengio et al., 2003). In these methods, each word is represented by a real valued265
vector of fixed dimension. Word embeddings are commonly trained using neural
network language models, such as word2vec (Mikolov et al., 2013). However,
when using word embedding models to create document level representations,
the word vectors need to be aggregated in some way. Common approaches in
the literature are to simply take the mean of the word vectors for all terms270
in the document, or to concatenate the vectors to a document vector of fixed
size (Yang et al., 2017). Document representations derived from tf-idf weighted
word vector averages have also been proposed (Zhao et al., 2015; Corrˆea J´unior
et al., 2017). Another method trains document level dense vector representa-
tions at the same time as the word vectors (Le and Mikolov, 2014). We refer to275
this latter method as doc2vec.
Much research has applied neural word embeddings to classification and
semantic evaluation tasks. For instance, Billah Nagoudi et al. (2017) applied
word embeddings to model semantic similarity between Arabic sentences. Three
different sentence level aggregations were proposed, namely the sum of the word280
vectors for all words in a sentence, an inverse-document-frequency weighted sum
of the word vectors, and a part-of-speech weighted sum. The authors found that
the weighted sum representations delivered more accurate sentence similarities.
In another study, Corrˆea J´unior et al. (2017) developed a classification method
for sentiment analysis using an ensemble of classifiers with different feature285
representations, namely a tf-idf matrix, a mean word vector representation, and
atf-idf weighted mean of the word vectors. Recently, Li et al. (2017) published
a number of pre-trained word2vec models on a Twitter data set of 390 million
English tweets with a range of pre-processing steps. Embedding representations
are becoming more widely used in NLP tasks involving OSN data.290
Further to word and document embeddings, character level embedding mod-
els have been proposed and applied to Twitter data, creating tweet2vec (Dhingra
et al., 2016). The motivation for tweet2vec is that social media data are noisy,
suffering from spelling errors, abbreviations, acronyms and special characters,
which can lead to prohibitively large vocabulary sizes. Tweet2vec takes as input295
11
sequences of characters for each tweet and passes them through a bidirectional
GRU neural network encoder to create a fixed dimensional tweet embedding
vector. This tweet embedding is then passed through a linear softmax layer to
predict the hashtags of a tweet. The algorithm was evaluated on hashtag clas-
sification performance. While this method may promise to create useful tweet300
embeddings, it assumes that hashtags are valid labels for tweets. This assump-
tion may not hold as other text, user mentions and URLs can also be important
in defining the topic of the tweet, and tweets can have multiple hashtags.
Recently, contextualised extensions to word embeddings have been proposed.
One challenge for traditional word embeddings is polysemy, where a word has305
multiple meanings dependent on the context. Peters et al. (2018) introduced
a deep contextualised word embedding model, which models both the syntac-
tic and semantic characteristics of word use, and how these uses vary across
linguistic contexts. This method involves coupling embedding vectors trained
from a bidirectional LSTM with a language model objective. Named ELMo310
(Embeddings from Language Models), the method assigns an embedding vector
to each token that is a function of the entire input sentence. This technique
may be useful for clustering social media documents.
In addition to the document clustering and topic modelling approaches dis-
cussed so far, a new series of deep learning based clustering methods have been315
developed (Min et al., 2018). Many of these techniques use deep neural networks
to learn feature representations trained at the same time as clustering. Exam-
ples include several deep autoencoder networks with a clustering layer, where
the loss function is a combination of reconstruction loss and clustering loss.
Clustering methods based on generative models such as Variational Autoen-320
coders and Generative Adversarial Networks look promising from a document
clustering perspective since they can also generate representative samples from
the clusters. However, the focus for these techniques to date has been on image
data sets.
Many approaches to document clustering and topic modelling are proposed325
for OSN text data. These methods typically involve creating document level fea-
12
Data Extraction
Three data sets used
Data Preparation
Feature Representations
Four methods applied
Clustering Methods
Four methods applied
Evaluation Measures
Three measures used
Figure 1: Process pipeline for document clustering. The contribution of this paper is
an evaluation of four methods for feature representation and four clustering methods
using three evaluation measures over three data sets.
ture representations with tf-idf matrices or other techniques, followed by clus-
tering methods to group documents into semantically related clusters. However,
there are many variations on these methods and word embedding representations
have not yet been effectively applied and benchmarked on document clustering330
tasks in OSN data, to the best of our knowledge.
3. Methods
In this section we describe the three data sets used and the processing steps,
the feature representations and clustering algorithms, and the evaluation mea-
sures used with a discussion of their properties.335
Document clustering and topic modelling methods applied to OSN data
typically involve several processing steps as outlined in Figure 1. Data is first
13
Table 1: Outline of the data sets, methods for feature representations and clustering,
and extrinsic evaluation measures used in this study. For the three data sets, we
evaluate the feature representation and clustering method combinations and the LDA
topic model (17 combinations) with the three evaluation measures.
Data Sets
Twitter stream filtered by #Auspol, 29,283 tweets
RepLab 2013 competition Twitter data, 2,657 tweets
Reddit data from May 2015, 40,000 parent comments
Methods
Feature representations:
FR1 tf-idf matrix with the top 1,000 terms per document
FR2 Mean word2vec matrix
FR3 Mean word2vec matrix weighted by the top 1,000 tf-idf scores
FR4 doc2vec matrix for each document
Clustering methods:
CM1 k-means clustering
CM2 k-medoids clustering
CM3 Hierarchical agglomerative clustering
CM4 Non-negative matrix factorisation (NMF)
Topic model:
LDA Latent Dirichlet Allocation topic model
Evaluation Measures
NMI Normalised Mutual Information
AMI Adjusted Mutual Information
ARI Adjusted Rand Index
extracted from a source. From the raw data set or OSN platform API, doc-
uments are extracted which consist of text data from an individual user. A
tweet and a Reddit parent comment are examples of a document. The textual340
elements are then processed to remove common punctuation and stop words,
and tokenised. Feature representations of each document are created, followed
14
by a clustering method. Extrinsic clustering evaluation measures are then cal-
culated using ground truth labels. The variations at each step of the process
are outlined in Table 1. In the rest of this section we detail our approach to345
each step of Figure 1.
3.1. Data Extraction
We used three OSN data sets for evaluation; two Twitter data sets and a
Reddit data set. We have used Twitter data since it has been widely used in
the literature regarding topic modelling and document clustering. While there350
appear to be fewer studies which have used Reddit data, Reddit still represents a
valuable source of OSN data to use for topic modelling and document clustering.
Reddit is also used more as a discussion forum, and comments have a wider
range of document lengths than Twitter data. All three data sets have been
made available (Curiskis et al., submitted).355
Twitter data provides a readily accessible data source for short and topical
user driven content. It is also widely used for research purposes, but has many
challenges due to the short tweet length and use of hashtags, acronyms, user
mentions and URLs (Stieglitz et al., 2018). The first Twitter data set was col-
lected through Twitter’s public API. It was constructed by filtering the Twitter360
stream for the hashtag #Auspol, which is frequently used in Australia for po-
litical discussion. A common application for document clustering on OSN data
is to take a set of documents related to a particular theme and discover topics,
such as the study of health topics in Twitter data (Paul and Dredze, 2014).
The #Auspol Twitter data set is suitable for comparing document clustering365
methods since the hashtag is widely used to link a large number of disparate dis-
cussions, often with additional hashtags, related to public opinion in Australia.
Data was collected between 13 June and 2 September 2017 and consisted of
1,364,326 tweets. We filtered this data set by selecting English language tweets
only and removed retweets based on the retweeted status field and a text filter.370
This resulted in 205,895 tweets.
No ground truth topic labels exist for this data set so we used a set of high
15
count hashtags as ground truth labels. We further removed the search hashtag
(#Auspol) from the data set, since all tweets had this token. It is common for
there to be multiple hashtags on a tweet, so to avoid having overlapping topics375
we removed tweets which contained more than one of the top hashtags. We also
manually removed some related hashtags, such as #ssm (same sex marriage)
which is closely related to #marriageequality; we kept the latter as it was used
in more tweets. Lastly, we filtered by hashtags with at least 1,000 tweets to keep
the topics relatively balanced. This resulted in 29,283 tweets with 13 hashtags380
denoting topic labels, as given in Table 2.
Table 2: Count of tweets per hashtag in the #Auspol Twitter data set.
Topic
Number
Hashtag Tweets
1 #qldpol 3,845
2 #qanda 3,592
3 #insiders 3,495
4 #lnp 3,434
5 #politas 2,618
6 #marriageequality 2,562
7 #springst 1,708
8 #nbn 1,626
9 #trump 1,547
10 #uspoli 1,498
11 #stopadani 1,186
12 #climatechange 1,148
13 #turnbull 1,024
The second Twitter data set was taken from the RepLab 2013 competition
(Amig´o et al., 2013). This competition focused on monitoring the reputation of
entities (companies and individuals), and involved tasks such as named entity
recognition, polarity classification and topic detection. The tweets used in this385
competition were annotated with topic labels by several trained annotators su-
pervised and monitored by reputation experts. For the purposes of this paper,
16
the topics annotated in these tweets were taken as a gold standard. We have
used this data set because it has gold standard labels already annotated and
has been used for topic detection tasks.390
We downloaded the list of Twitter identifiers from the training and testing
data sets for the topic detection task made available through the RepLab 2013
competition and retrieved the details through the Twitter API on 19 January,
2019. Out of 110,344 published tweet identifiers with labelled topics, we could
only retrieve the tweet text and other information for 23,684 tweets. This is395
likely due to tweets and users being deleted since the tweets were published.
Furthermore, there is a long tail of topics labelled in this data. In fact, for
the 23,684 tweets there were a total of 3,432 distinct topics, with 1,263 topics
containing a single tweet. To ensure that there were sufficient data points for
our methods to detect, we limited the frequency count per topic to be 100.400
We also removed the label denoted ‘other topics’ as this does not represent an
internally consistent topic. After this filtering we had a data set of 2,657 tweets
with 13 topic labels from the competition. The list of topic labels used is given
in Table 3.
We originally included the RepLab 2013 data set primarily because compara-405
tive results for topic discovery are available from the competition. However, due
to the large volume of tweets which could not be retrieved from Twitter’s API,
accurate comparisons are no longer possible. Nevertheless, the ground truth
topic labels still allow for the performance of the methods to be benchmarked.
The third data set was from the Reddit platform and consisted of parent com-410
ments and their related comments by Reddit subreddit page from May 2015.
The Reddit platform is widely used for discussion related to specific topics or
themes, grouped by subreddit page, so is ideal for this study. Furthermore,
Reddit comments can be longer than tweets. Reddit parent comments refer to
the top comment which may or may not have responses from other users. This415
data was made public on the Reddit website (Reddit, 2015). The full data set
contained around 54.5 million comments on 50,138 subreddit pages. We chose
this data set since it is freely available in full and contains discussion on multiple
17
themes. It is therefore an ideal data set to use for benchmarking methods. We
chose five subreddit pages which represent disjoint themes for analysis. These420
five subreddit pages were also used in a previous study benchmarking classifica-
tion models (Gutman and Nam, 2015). Since parent comments and responses
are inherently related, we pooled all the user posts into documents grouped by
the parent comment identifier. Table 4 shows the count of parent comments
per subreddit page. We randomly sampled 40,000 parent comment identifiers425
from across the five subreddit pages, then used these pages to denote the ground
truth labels.
Reddit data is especially useful in this study since it contains a wider range
of character lengths per document than Twitter data, since Twitter has a limit
on the number of characters. An evaluation of the performance of the document430
clustering methods by document length can provide guidance for future studies
on the optimal method for a particular data set. To examine this performance,
we partitioned the Reddit data into four distinct subsets based on the number
Table 3: Count of tweets per topic label in the RepLab 2013 Twitter data set.
Topic
Number
Topic Tweets
1 For Sale 329
2 Suzuki cup 296
3 User Comments 262
4 Money laundering / terrorism finance 199
5 Record of views on YouTube 195
6 Fan Craze - Beliebers 154
7 Princeton Offense 131
8 For Sale - Nissan Cars, Parts analysed Accessories 127
9 Jokes 127
10 Sports sponsors 127
11 Spam 114
12 Ironic Criticism 111
13 MotoGP - User Comments 103
18
Table 4: Count of parent comments per subreddit page.
Topic
Number
Subreddit Page Parent
Comments
1 NFL 10,563
2 news 9,488
3 pcmasterrace 9,186
4 movies 6,263
5 relationships 4,500
Table 5: Reddit data was partitioned into four sets based on document character
length. Documents are grouped by the parent comment. The mean character length
and mean number of tokens per document are given.
Character
length range
Number of
documents
Mean
character
length
Mean
number of
tokens
1 to 100 15,273 46.1 4.5
101 to 200 8,360 144.9 13.3
201 to 500 9,310 317.4 28.6
501 or greater 7,057 1,584.5 141.1
of characters per document. Details for the four data partitions are given in
Table 5. For comparison with the Twitter data sets, a tweet has a maximum of435
240 characters. For the #Auspol Twitter data, the mean character length was
117 with 25th percentile of 103 and 75th percentile of 138. Most tweets therefore
fall into the 101 to 200 character length document group.
3.2. Data Preparation
Data preparation and analysis in this study was conducted using python440
3.6.1. For text preprocessing, we removed the list of stopwords from the nltk
3.2.4 package and punctuation from string. A customised tokeniser function
was created for tweets which retained hashtags and user mentions, and removed
URLs. To tokenise the Reddit data, we simply removed punctuation and stan-
19
dard stopwords. We did not apply any stemming or lemmatisation. We also445
used the TfidfVectorizer function from sklearn 0.19.1 for the tf-idf method and
the weighted word2vec method.
For the #Auspol Twitter data, we removed the list of 14 hashtags taken
as ground truth labels from the text, in addition to the #Auspol Twitter API
search query. The RepLab 2013 Twitter data set had annotated topic labels450
that were not based directly on any individual tokens, so no modification was
required. For the Reddit data, as the subreddit page was used as the ground
truth label we did not need to modify the text.
3.3. Feature Representations
In this study we evaluated the performance of four methods to construct455
feature representations for documents combined with four commonly used clus-
tering algorithms. We also included an LDA topic model in a separate topic
models category since the technique only takes as input a bag-of-words ma-
trix. These methods are outlined in Table 1, where each method component is
given a code for ease of reference. The four feature representations are coded460
as FR1-FR4 and the four clustering methods are coded as CM1-CM4 and
the LDA topic model is coded simply as LDA. While many other techniques
have been proposed in the literature, such as the meme identification studies
(JafariAsbagh et al., 2014; Shabunina and Pasi, 2018), we did not implement
them for evaluation as they are specific to data from Twitter. However, we pro-465
vide comparison results in our discussion where they were available from other
studies.
For FR1, the tf-idf matrix was limited to the top 1,000 terms per document
by frequency since no performance improvement was gained by including more
terms. This is likely due to the short nature of social media text which produces470
sparse tf-idf feature vectors; terms with lower frequency would not generally be
useful in clustering.
A word2vec model is a neural network trained to create a dense vector with
fixed dimension for each token in a corpus. While a pre-trained word2vec model
20
is available for Twitter data (Godin et al., 2015), we found that it did not475
perform well on the Twitter data sets used in this study. One issue was that
many tokens in the data were out of the trained model’s vocabulary, and also
the semantic relationships between words may be very different on different data
sets. Additionally, a pre-trained model on a large amount of Reddit data was
not available. Furthermore, there are many hyper-parameters in these models so480
finding an ideal set of values for different data sets is a useful contribution. For
these reasons, we trained our own word embedding and document embedding
models.
The word2vec models used in FR2 and FR3 were trained with the contin-
uous bag of words (CBOW) method (Mikolov et al., 2013), 100 dimensions, a485
context window of size 5 and minimum word count of 1. We tested variations of
these hyper-parameters, including context window sizes ranging from 3 to 15,
higher dimensions and minimum word counts. We found that the variation in
performance using the three clustering evaluation measures was minimal and
the chosen hyper-parameters were optimal. Some of these results make sense490
given the short document length of social media text. We concluded that 100
dimensions for word2vec was sufficient to represent words for short documents.
The mean number of tokens per tweet was 9, and the 75th percentile was 11,
so a context window of size 5 captured all the tokens of most tweets. However,
we did find significant variation in the number of training epochs used for the495
three data sets. We report on this analysis in Section 4.1. For all other hyper-
parameters, we have used default values provided by the gensim 3.4.0 python
package (ˇ
Reh˚rek and Sojka, 2010).
FR2 was constructed by taking the element-wise mean of the word vectors
for each token in each document, returning a dense feature vector of 100 dimen-500
sions. FR3 was constructed by taking the tf-idf weighted mean of the word
vectors for each word of a document. The tf-idf matrix used was the top 1,000
term matrix by frequency constructed in FR1. This process excluded any word
vectors that were not in the top 1,000 tf-idf terms, although again this was tried
with larger numbers of top terms for which the evaluation measures used were505
21
found to decrease. We discuss the evaluation measures used in Section 3.5.
A doc2vec model is a neural network trained to create a dense vector with
fixed dimension for each document in a corpus. The doc2vec models in FR4
were trained with 100 dimensions using the distributed bag of words method
(dbow), a context window of size 5 and a minimum word count of 1. The dis-510
tributed bag of words method was used since it can train both word vectors and
document vectors in the same embedding space (Le and Mikolov, 2014), which
was useful for interpreting the document embedding. As with the word2vec
model, we tested variations of the hyper-parameters and found that the eval-
uation measures varied significantly for the number of training epochs, and515
different data sets had different optimal epochs. This is similar to the results
of Lau and Baldwin (2016) where a dbow doc2vec model trained on 4.3 mil-
lion words had an optimal number of epochs of 20, while the optimal number
was 400 for a data set of size .5 million words. Lau and Baldwin (2016) also
found that the optimal number of dimensions was 300 and window size was 15.520
The lower optimal values for our method are likely due to the short document
lengths of OSN data, as well as the lower word count of our data sets, especially
the Twitter data.
3.4. Clustering Methods
For the clustering methods, we have selected four techniques commonly used525
in the literature (Klinczak and Kaestner, 2016; Naik et al., 2015) which also gave
comparable results on our data sets. Firstly, we applied a k-means clustering
algorithm (CM1) using the Euclidean metric and a maximum of 100 iterations.
The algorithm was run multiple times over the data with varying random seeds.
CM2 refers to the k-medoids algorithm. For this we used the pyclustering530
0.8.2 python package with starting centroids sampled according to a uniform
distribution. Both k-means and k-medoids clustering were used in Klinczak and
Kaestner (2016). For CM3 we applied an hierarchical agglomerative clustering
algorithm with the Euclidean metric and Ward linkage. Hierarchical agglomer-
ative clustering was used in Ferrara et al. (2013) to cluster a similarity matrix.535
22
For CM4 we used a Non-negative Matrix Factorisation (NMF) algorithm, for
which we used the default parameters in the sklearn 0.19.1 package. NMF has
seen multiple applications for topic modelling in OSN data (Godfrey et al., 2014;
Klein et al., 2018). For the clustering methods and the LDA model, we set the
number of clusters or components to be equal to the number of unique labels in540
the evaluation data. In line with Klinczak and Kaestner (2016), we tested the
DBSCAN clustering algorithm with a range of hyper-parameters but found that
it delivered poor performance for all feature representations. The documents
would either be grouped into an outlier cluster, or a large number of very small
clusters. A possible reason for this is that the feature representations are high545
dimensional and sparse, so may not cluster well using density based approaches.
The LDA topic model was trained with 10 passes, chunk size of 10,000
and updated every record. We again used the default values for other hyper-
parameters in the gensim 3.4.0 package. We included this method since it is
commonly used in document clustering and topic modelling. To assign a topic550
label to each document, we chose the topic with the highest probability.
3.5. Evaluation Measures
Measures used for evaluating document clustering methods typically fall into
two categories, intrinsic and extrinsic measures. Intrinsic measures, such as
measures of cluster separation and cohesion, do not require a ground truth la-555
bel. Such measures describe the variation within clusters and between clusters.
However, they are dependent on the feature representations used, so do not
give comparable results for methods which use different feature sets. Extrinsic
measures require a ground truth label, but can be compared across methods.
Common extrinsic measures include precision, recall and F1 (Naik et al., 2015),560
but these are dependent on the ordering of cluster labels to ground truth labels
which is a problem with a large number of labels. Measures such as the mu-
tual information and Rand index are more appropriate in this case as they are
independent of the absolute values of the labels.
Mutual information is a measure of the mutual dependence between two565
23
discrete random variables. It quantifies the reduction in uncertainty about one
discrete random variable given knowledge of another. High mutual information
indicates a large reduction in uncertainty. For two discrete random variables
Xand Ywith joint probability distribution p(x, y), the mutual information,
MI(X, Y ), is given by570
MI(X, Y ) = X
yYX
xX
log p(x, y)
p(x)p(y).
A commonly used measure is the normalised mutual information (NMI),
which normalises the MI to take values between 0 and 1 with 0 representing no
mutual information and 1 being agreement. This is useful to compare results
across methods and studies. NMI is given as follows.
NMI(X, Y ) = MI(X, Y )
pH(X)H(Y),
where H(X) and H(Y) denote the marginal entropies, given by575
H(X) =
n
X
i=1
p(xi) log (p(xi)) .
The Rand index is a pair counting measure for similarity between the la-
bels and clusters. It also takes values between 0 and 1, with 0 representing a
random labelling and 1 representing identical labels. Given a set of elements
S={o1, . . . , on}and two partitions of Sto compare, X={X1, . . . , Xr}and
Y={Y1, . . . , Ys}, the Rand index represents the frequency of times the par-
titions Xand Yare in agreement over the total number of observation pairs.
Mathematically the Rand index, RI, is given by
RI(X, Y ) = a+b
a+b+c+d=a+b
n
2,
where arepresents the number of pairs of elements in Sthat are in the same
subset in Xand the same subset in Y, and brepresents the number of pairs
of elements in Sthat are in different subsets of Xand different subsets of Y.
Values aand btogether give the number of times the partitions are in agreement.
24
The value crepresents the number of pairs of elements in Sthat are in the same580
subset of Xand different subsets of Y, and dgives the number of pairs of
elements in Sthat are in different subsets of Xand the same subset of Y.
For extrinsic clustering evaluation measures to be useful for comparison
across methods and studies, such measures need a fixed bound and a constant
baseline value. Both the NMI and the RI are scaled to have values between 0
and 1, so satisfy the first condition. However, it has been shown that both mea-
sures increase monotonically with the number of labels, even with an arbitrary
cluster assignment (Vinh et al., 2010). This is because both the mutual infor-
mation and Rand index do not have a constant baseline, implying that these
measures are not comparable across clustering methods with different numbers
of clusters. To account for this, adjusted versions of the MI and RI have been
proposed. The adjusted rand index, ARI, adjusts the RI by its expected value:
ARI(X, Y ) = RI(X, Y )E{RI(X , Y )}
max{RI(X, Y )} − E{RI(X , Y )}
where E{RI(X, Y )}denotes the expected value of RI(X, Y ). The ARI takes
values between 0 and 1, with 1 representing identical partitions, and is adjusted
for the number of partitions in Xand Y. In a similar way, the adjusted mutual585
information, AMI, is given by
AMI(X, Y ) = MI(X, Y )E{MI(X, Y )}
max{H(X), H(Y)} − E{MI(X, Y )},
where E(MI(X, Y )) represents the expected value of the MI (Vinh et al., 2010).
The AMI takes values between 0 and 1, with 1 representing identical partitions,
and is adjusted for the number of partitions used. The best measures to ensure
a comparable evaluation are then the AMI and the ARI. The next question is590
around how these two measures compare to each other. By developing theory
regarding generalised information theoretic measures, Romano et al. (2016) con-
cluded that the AMI is the preferable measure when the labels are unbalanced
and there are small clusters, while the ARI should be used when the labels have
large and similarly sized volumes.595
25
In this paper, we report the AMI, ARI and the NMI measures. Many pre-
vious studies have reported the NMI measure, so for comparison purposes we
include it in our evaluation. Given the data and methods of this study, it is
likely that the ARI is more appropriate then the AMI as Tables 2 and 4 show
that the distribution of documents across labels is relatively balanced. We still600
include the AMI since it is interesting to see how much the results may differ
from the NMI.
Due to the short and noisy nature of the data sets used in this study, we
examined the effect of different random seeds on performance. We ran each
method 20 times with different random seeds, calculated the mean of the NMI,605
AMI and ARI, and plotted the distributions of these measures.
4. Results
In this section we present the results of our analysis. We first describe the
results on the optimal number of epochs for the word2vec and doc2vec em-
bedding representations, applied to all three data sets. We then evaluate the610
performance of all the methods. Lastly, we discuss methods for the interpreta-
tion of the topics using the doc2vec feature representation.
4.1. Optimal Training Epochs for Embedding Models
A key hyper-parameter for training neural network models is the number
of epochs. Too many epochs and the model may overfit to the data, too few615
and performance may be poor. We first explored the performance change of the
mean word2vec models (FR2 and FR3) and the doc2vec model (FR4) with the
number of epochs. These results provide guidance for studies where a ground
truth topic label is not present. We used k-means clustering (CM1) for the
clustering method as it gave the best results for the embedding representations.620
For each epoch value between 25 and 300, with increments of 25, we trained the
models 20 times using different random seeds and evaluated against the ground
truth labels. This was done for all three data sets. Table 6 summarises the
26
Table 6: Optimal number of training epochs for word2vec and doc2vec methods on
the three data sets.
Data Set doc2vec wtd.
word2vec
unwtd.
word2vec
Twitter #Auspol 75 250 250
Twitter RepLab 2013 300 200 200
Reddit: 1 to 101 175 75 50
Reddit: 101 to 200 150 100 200
Reddit: 201 to 500 100 50 50
Reddit: 501 + 50 25 25
optimal epoch results by method and data set. The plots for this analysis on
the #Auspol Twitter data are shown in Figure 2(a) and on the RepLab 2013625
data in Figure 2(b). The results for the Reddit data are shown in Figure 3.
To save space we only evaluated the AMI and the ARI measures on the Reddit
data. This is because the AMI typically gives similar results as NMI, but is
chance adjusted.
For the #Auspol data in Figure 2(a), it is clear that doc2vec gave the best630
results and had a peak in performance at around 75 epochs. The word2vec
methods generally delivered better performance with more epochs, with a max-
imum value around 250. The tf-idf weighted mean word2vec method performed
better than the unweighted mean word2vec method, and its performance in-
creased more smoothly than the unweighted method. There was also not much635
variation over seeds as the 95% confidence bands are narrow.
On the RepLab 2013 data in Figure 2(b) the results were quite different. The
unweighted mean word2vec method gave the best performance on the NMI and
AMI measures. However, on the ARI measure both word2vec methods suffered
drops in performance after 100 epochs while the doc2vec method improved.640
This could be caused by some over-fitting of the word2vec models on the data,
which is likely since the RepLab 2013 data was much smaller than the #Auspol
data. The ARI measure is also the preferred measure where the labels have large
27
(a) #Auspol (b) RepLab 2013
Figure 2: Plot of the three evaluation measures (vertical axes) by training epoch
(horizontal axes) for 20 runs of the word2vec and doc2vec representations on Twitter
data using k-means clustering. (a) shows the results on the #Auspol Twitter data
and (b) shows the results on the RepLab 2013 Twitter data. 95% confidence bands
based on varying random seeds are shown.
volumes and are balanced (Romano et al., 2016). This data set was relatively
balanced (given in Table 3), so the ARI is the more appropriate performance645
measurement than the NMI and AMI. Overall on the RepLab 2013 data, the
optimal number of epochs for the word2vec methods was 200, while the doc2vec
method had an optimal value of 300. The higher number of optimal epochs for
28
(a) AMI measure (b) ARI measure
Figure 3: Plots of the AMI and ARI evaluation measures (vertical axes) by training
epoch (horizontal axes) for 20 runs of the word2vec and doc2vec representations on
the Reddit data sets using k-means clustering. Different Reddit data sets by size range
are given along the rows. Column (a) shows the AMI results and (b) shows the ARI
results. 95% confidence bands based on varying random seeds are shown.
29
the doc2vec method is not surprising given that it is also training document
vectors, so has more parameters than word2vec.650
Turning to the results on the four Reddit data sets in Figure 3, the doc2vec
method again gave the best performance. In addition, there is an evident pattern
with doc2vec where shorter documents required more training epochs to reach
optimal performance. For documents with less than 100 characters, the perfor-
mance of doc2vec with k-means clustering improved up to around 250 epochs.655
This dropped to 150 epochs for documents with 101 to 200 characters, then 150,
100 and 50 for the larger document length data sets in increasing order. This
observed pattern aligns to the results of Lau and Baldwin (2016), confirming
that doc2vec models require less training epochs on larger documents.
For the word2vec methods, the tf-idf weighted mean word vector method660
gave better performance than the unweighted mean method. This aligns with re-
sults in previous studies (Billah Nagoudi et al., 2017). On the shortest document
range, both methods showed little performance improvement with more train-
ing, but then a drop in both measures at 75 epochs for the weighted word2vec
method and 50 for the unweighted method. One possible explanation for this665
drop is that averaging word vectors may only make sense above a threshold
of words. For this size range, the average number of words per document is
4.5, which might be too low. On the 101 to 200 character length documents,
the weighted word2vec method gave better performance but also required fewer
training epochs. These results also look similar to the results on the Twit-670
ter data sets, which typically have a similar character length range. On the
largest documents, both methods required 25 or less epochs to reach optimal
performance.
Through this analysis, it is clear that the doc2vec method consistently gave
improved performance over averaged word2vec methods, except in the case675
where the data set had a low number of documents. Furthermore, the number
of training epochs for doc2vec in general was inversely proportional to the doc-
ument size, with more epochs required to reach optimal performance on smaller
document sizes. Doc2vec also required more training epochs than word2vec, in
30
general. However, these relations were not observed for the #Auspol Twitter680
data where the doc2vec optimal epoch number was 75, below the word2vec op-
timal epochs of 200. The optimal number of doc2vec epochs on the RepLab
2013 data was much higher at 300. An explanation might be that while the
doc2vec model improved on its internal loss function with more training epochs
on the #Auspol data, these improvements did not lead to better performance685
on the clustering task. This is likely because of the hashtag labels used, which
may have some overlapping contributing terms. For the word2vec methods, in
general weighting by tf-idf scores gave a performance lift and required fewer
training epochs. However, care should be taken with the number of epochs
given the low peak on the shortest Reddit documents.690
4.2. Performance Evaluation with Clustering Measures
In this section we provide the mean evaluation measures for the four feature
representations with the four clustering methods, and the LDA model, for each
method with 20 different seeds on each data set. We also include distribution
plots to illustrate the variability in performance.695
Table 7 provides the mean for each of the three evaluation measures for
each method on the #Auspol Twitter data set. We set the optimal number of
epochs to be 75 for the doc2vec methods and 250 for the word2vec methods.
It is clear from this table that the doc2vec feature representation with k-means
clustering outperformed the other methods on all three evaluation measures,700
particularly on the ARI. Hierarchical clustering gave close scores for NMI and
AMI, but much lower ARI. For both doc2vec and word2vec feature represen-
tations, NMF performed poorly. The performance of k-medoids clustering was
similar to NMF. For the word2vec representations, k-means clustering also gave
the best performance.705
An interesting observation is that some methods had a relatively large drop
in score between the NMI and AMI measures, indicating that the chance ad-
justment of the AMI is important. The tf-idf representation is the most effected
by this. For instance, the tf-idf matrix with hierarchical clustering gave a high
31
Table 7: Performance evaluation of the feature representation and clustering methods
on #Auspol Twitter data with the Normalised Mutual Information (NMI), Adjusted
Mutual Information (AMI), and Adjusted Rand Index (ARI) measures.
Feature Representation Clustering NMI AMI ARI
doc2vec hierarchical .165 .154 .059
k-means .193 .191 .120
k-medoids .107 .105 .064
NMF .102 .100 .056
wtd word2vec hierarchical .088 .079 .021
k-means .105 .102 .047
k-medoids .043 .016 .001
NMF .062 .058 .030
unwtd word2vec hierarchical .085 .076 .020
k-means .094 .090 .041
k-medoids .043 .019 .001
NMF .058 .054 .025
TF-IDF hierarchical .163 .085 .013
k-means .114 .070 .014
k-medoids .079 .028 .004
NMF .132 .110 .032
LDA LDA .043 .041 .021
NMI of 0.163, well ahead of the word2vec methods, but an AMI of 0.085. Com-710
paratively, doc2vec and the word2vec methods had smaller drops. As discussed
earlier, the AMI and ARI are more appropriate evaluation measures than NMI
due to their adjustment for chance. On this data set, the ARI is more appro-
priate as the volume of tweets per hashtag label are relatively similar. The
doc2vec representation with k-means clustering therefore far outperformed the715
other methods.
Table 8 shows the mean results for the RepLab 2013 Twitter data set with
the doc2vec model trained with 300 epochs and the word2vec methods trained
with 200 epochs. Overall the performance is much higher than in the #Auspol
data, which is explained by the RepLab 2013 data having expertly annotated720
32
Table 8: Performance evaluation of the feature representation and clustering methods
on RepLab 2013 Twitter data with the NMI, AMI and ARI measures.
Feature Representation Clustering NMI AMI ARI
doc2vec hierarchical .449 .437 .313
k-means .488 .478 .379
k-medoids .290 .278 .215
NMF .261 .249 .152
wtd word2vec hierarchical .506 .491 .330
k-means .488 .478 .352
k-medoids .421 .404 .274
NMF .401 .384 .266
unwtd word2vec hierarchical .519 .507 .347
k-means .508 .499 .360
k-medoids .435 .414 .278
NMF .425 .407 .286
TF-IDF hierarchical .466 .417 .203
k-means .450 .379 .179
k-medoids .192 .075 .011
NMF .437 .427 .348
LDA LDA .180 .169 .140
topics which are more distinct. On the ARI score, the doc2vec method with
k-means clustering performed best but the unweighted word2vec method with
hierarchical clustering gave higher performance for the NMI and AMI measures.
One explanation for this is that the small size of this data set is insufficient for
the embedding representations to accurately be trained, so further training does725
not necessarily lead to higher clustering performance. This is reflected in the
sharp drops evident in Figures 2(b.i) and 2(b.ii).
To examine the variability from the mean measurements, we plot the distri-
butions for the feature representation methods with the best performing clus-
tering algorithm and the LDA topic model. Figure 4 shows the distributions730
for the three evaluation measures over the #Auspol (a) and RepLab 2013 (b)
Twitter data sets. In Figure 4(a), the doc2vec method with k-means clustering
33
(a) #Auspol (b) RepLab 2013
Figure 4: Density plots of the three evaluation measures (horizontal axes) over ran-
dom seeds for the four feature representations with the best performing clustering
algorithm, with LDA for comparison. (a) shows the results on the #Auspol Twitter
data and (b) shows the results on the RepLab 2013 Twitter data.
was distinctly ahead of the other methods on all three measures. There was
also significant overlap between the results for the two word2vec methods, indi-
cating that multiple runs are required when the scores are close. Note that the735
tf-idf method with hierarchical clustering did not show on the plot since both
34
Figure 5: Plot of the three evaluation measures over random seeds for the methods
with the best performing clustering method on Reddit data with varying document
lengths in characters. (a) plots the NMI, (b) plots the AMI and (c) plots the ARI.
algorithms are deterministic, so every run had the same result.
35
For the RepLab 2013 data set in Figure 4(b), the word2vec methods again
showed significant overlap, with the doc2vec method performing in a lower
range. It is interesting to note that the doc2vec method showed two close740
peaks. These peaks are most significant for the NMI and AMI measures, but
also present for ARI. This likely indicates that the doc2vec method optimised
to local minima during training, resulting in poor performance for some of the
runs over random seeds. Given that there was a large gap between the higher
performance of doc2vec compared to the word2vec methods on the #Auspol745
data and close performance between word2vec and doc2vec on RepLab 2013,
the word2vec methods handled the smaller RepLab 2013 data set better than
doc2vec. This may be because there weren’t enough data points in the Re-
pLab 2013 data set to optimally train the doc2vec representation. Nevertheless,
doc2vec still gave the best performance on the ARI measure for both Twitter750
data sets.
Lastly, we provide results from running the methods over the Reddit data.
Figure 5 shows the NMI (a), AMI (b) and ARI (c) values for the methods on
the Reddit data sets. The horizontal axis compares the results for the document
length data partitions. Only the best performing clustering method is displayed755
for each feature representation. The mean scores of the evaluation measures for
each of the methods are given in Table 9 for document length ranges 1 to 100
and 101 to 200, and in Table 10 for document length ranges 201 to 500 and 501
or greater. It is clear from these plots and mean results that the doc2vec method
delivered the best performance on all four data sets by size range. This finding760
corroborates the results from the #Auspol Twitter data set. The tf-idf weighted
mean word2vec method consistently delivered a performance lift compared to
the unweighted mean word2vec method. Interestingly, the tf-idf methods and
the LDA model only gave comparable performance to the word2vec methods on
the last size range, with number of characters greater than 500.765
36
Table 9: Performance evaluation on the Reddit data for each method by for document
length ranges 1 to 100 and 101 to 200 characters.
Document
Character
Length
Feature Representa-
tion
Clustering NMI AMI ARI
1 to 100 doc2vec hierarchical .029 .027 .017
k-means .034 .034 .026
k-medoids .012 .011 .004
NMF .030 .023 .015
wtd word2vec hierarchical .013 .012 .010
k-means .014 .013 .011
k-medoids .007 .003 .000
NMF .010 .009 .000
unwtd word2vec hierarchical .011 .011 .011
k-means .012 .012 .011
k-medoids .007 .006 .000
NMF .012 .011 .010
TF-IDF hierarchical .009 .003 .000
k-means .005 .002 .000
k-medoids .005 .001 .000
NMF .014 .011 .012
LDA LDA .009 .009 .003
101 to 200 doc2vec hierarchical .115 .111 .067
k-means .262 .257 .262
k-medoids .018 .006 .001
NMF .127 .096 .027
wtd word2vec hierarchical .112 .101 .032
k-means .176 .174 .144
k-medoids .036 .016 .001
NMF .116 .100 .033
unwtd word2vec hierarchical .086 .079 .027
k-means .144 .142 .114
k-medoids .020 .013 .008
NMF .089 .071 .015
TF-IDF hierarchical .009 .003 .000
k-means .005 .004 .002
k-medoids .008 .000 .000
NMF .006 .005 .000
LDA LDA .008 .007 .007
37
Table 10: Performance evaluation on the Reddit data for each method by document
length ranges 201 to 500 and 501 or greater.
Document
Character
Length
Feature Representa-
tion
Clustering NMI AMI ARI
201 to 500 doc2vec hierarchical .261 .254 .212
k-means .487 .483 .496
k-medoids .037 .010 .002
NMF .194 .128 .044
wtd word2vec hierarchical .265 .246 .142
k-means .333 .331 .276
k-medoids .174 .172 .150
NMF .247 .226 .133
unwtd word2vec hierarchical .227 .200 .084
k-means .303 .301 .247
k-medoids .106 .103 .081
NMF .208 .183 .092
TF-IDF hierarchical .103 .061 .015
k-means .095 .085 .044
k-medoids .014 .013 .007
NMF .062 .057 .046
LDA LDA .080 .079 .071
501 + doc2vec hierarchical .532 .518 .499
k-means .686 .684 .708
k-medoids .094 .037 .007
NMF .331 .255 .154
wtd word2vec hierarchical .465 .400 .327
k-means .461 .433 .403
k-medoids .353 .330 .283
NMF .366 .325 .229
unwtd word2vec hierarchical .416 .367 .306
k-means .433 .405 .385
k-medoids .336 .322 .290
NMF .290 .242 .159
TF-IDF hierarchical .304 .244 .199
kmeans .431 .382 .323
kmedoids .042 .007 .001
NMF .396 .344 .299
LDA LDA .341 .326 .291
38
Table 11: Top three topic labels and top three hashtags for each cluster. Note that
the topic labels did not appear in the clustering data, but were mostly recovered in
order when we created a tf-idf matrix for tweets pooled by cluster and selected the
three hashtags with the highest scores. Differences between the top three topic labels
and top three hashtags are highlighted in bold.
Cluster Top Three Topic Labels Top Three tf-idf Score Hashtags
1 #nbn, #lnp, #insiders #nbn, #lnp, #insiders
2 #uspoli, #insiders, #turnbull #uspoli, #insiders, #trump
3 #insiders, #lnp, #qldpol #insiders, #lnp, #qldpol
4 #qldpol, #insiders, #lnp #qldpol, #insiders, #lnp
5 #politas, #qldpol,#lnp #politas, #utas,#discover
6 #qldpol, #qanda, #trump #qldpol, #politas, #qanda
7 #insiders, #lnp, #qldpol #insiders, #lnp, #qldpol
8 #qldpol, #stopadani, #springst #qldpol, #stopadani, #springst
9 #lnp, #trump, #uspoli #lnp, #trump, #insiders
10 #qanda, #insiders, #qldpol #qanda, #insiders, #sayitwith-
stickers
11 #marriageequality, #politas, #lnp #marriageequality, #equalitycam-
paign, #politas
12 #qldpol, #stopadani, #qanda #qldpol, #qanda, #stopadani
13 #climatechange, #qldpol,
#stopadani
#climatechange, #qldpol,
#stopadani
4.3. Topic Interpretation
It is clear that the doc2vec model with k-means clustering delivered the
best performance on the #Auspol Twitter data set and the Reddit data sets,
as well as the RepLab 2013 Twitter data set based on the ARI measure only.
However, the usefulness of a topic discovery model depends on how interpretable770
the resulting topics are. In this section we aim to address this question through
a deeper analysis of the resulting clusters from the doc2vec representation with
k-means clustering.
We consider firstly the results on the #Auspol data, where we analysed the
extent to which the document clusters aligned to the label hashtags. On the775
39
#Auspol data, our ground truth topic labels were the top 13 distinct hashtags,
which were removed from the text prior to feature generation and clustering.
These hashtags can therefore be considered as latent tokens. We first identified
the top three topic labels (hashtags) by frequency for each cluster. For compar-
ison, we created a tf-idf matrix from the original data using all the hashtags,780
including the topic hashtags, and excluded all other tokens. We then extracted
the three hashtags with the highest tf-idf scores for each cluster and compared
to the top three topic label hashtags. Table 11 outlines the results. The top
topic matches to the top hashtag for every cluster. Out of 39 top topics for
the 13 clusters, only 7 contained different hashtags (highlighted in bold text).785
There were also two clusters where the order was adjusted. We conclude that
the doc2vec clustering has accurately captured the structure of the latent label
hashtags.
Another way of looking at the quality of the clustering is to analyse the
overlap between ground truth labels and clusters. In the interest of space, we790
considered the Reddit data sets which contained only 5 topics and chose the
data set with document size between 101 and 200 characters for consistency
with the Twitter data sets. We then analysed the confusion matrix for the
doc2vec features with k-means clustering against the ground truth labels, the
subreddit pages. The results are shown in Table 12. It is apparent that the first795
cluster grouped most of the parent comments from the subreddit page ‘NFL’ and
the second cluster grouped strongly around ‘pcmasterrace’. These pages clearly
represent distinct topics. Clusters 3 and 4 grouped well around ‘news’ and
‘movies’ respectively, but cluster 5 is divided primarily between ‘relationships’
and ‘news’.800
To further interpret the topics on this Reddit data set, we analysed the top
words by cluster. For each cluster we calculated the centroid as the mean of
the doc2vec representations of each document in the cluster. Since the trained
doc2vec model produced document embeddings in the same space as word em-
beddings, we calculated the cosine similarity between the cluster centroids and805
the words. The idea behind this was that words closer to the cluster centroid
40
Table 12: Confusion matrix for the doc2vec representation with k-means clustering
method on Reddit data with size range between 101 and 200 characters
Subreddit Page Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
NFL 1,351 60 298 273 395
pcmasterrace 78 1,295 215 185 260
news 93 89 952 204 538
movies 89 50 152 767 226
relationships 32 37 116 48 557
Table 13: Top 10 words per cluster based on combined embedding similarity score in
embedding space and tf-idf score
Cluster Top Topic Top 10 Words
1 NFL talent, flacco, quarterback, tds, sb, wrs, roster, dolphins,
tackle, foles
2 pcmasterrace install, ps4, r9, mobo, gpus, i5, os, msi, processor, asus
3 news federal, manslaughter, district, homicide, economic, isis,
china, labor, upper, toke
4 movies avengers, joss, horror, arnold, cinematography, rewatch, aus-
tralian, doof, boobs, mcx
5 relationships abusive, mentality, react, rdj, xanax, marriage, heaven, meet-
ing, section, subjective
may be representative of the cluster. However, this approach doesn’t account
for the frequency of words appearing in each cluster, or the relative frequency
of the words across the clusters. To incorporate this information, we pooled all
the documents in each cluster and calculated a tf-idf matrix. We then created a810
combined score for each word and cluster from the sum of the cosine similarity
and the tf-idf score. Table 13 shows the top 10 words per cluster ordered by
this method. It is clear that this method extracts very specific terms related to
the main subreddit pages, particularly for clusters 1 and 2.
41
5. Discussion815
Throughout this study it has become clear that for clustering OSN text data
into topics, in general doc2vec feature representations combined with k-means
clustering gave the best performance compared to any of the other methods.
However, the cases where this method did not perform as well require discussion.
On the RepLab 2013 Twitter data set, the doc2vec method gave performance820
below that of the mean word2vec methods for the NMI and AMI measures,
but gave the best performance for ARI after 100 epochs of training. Further to
this, the unweighted mean word2vec method performed better than the tf-idf
weighted mean word2vec method on this data. Both of these results are different
from the results on the other two data sets. The results on the #Auspol and825
the Reddit data with document length between 101 and 200 characters indicate
that it is not the size of each document that is the issue on the RepLab data, but
it is most likely that the volume of data used was not sufficient to accurately
train the doc2vec model. The implication is that doc2vec models should be
trained on data volumes greater than three thousand or so. Interestingly, the830
Reddit data with length between 101 and 200 characters only consisted of 8,360
documents and doc2vec performed very well, although Reddit comments may
be quite different to tweets in the terms used.
Another interesting observation is that on the #Auspol Twitter data, the tf-
idf matrix with NMF gave better performance on the NMI and AMI measures835
than the best clustering for both word2vec methods, although a lower score for
the ARI. On the RepLab 2013 data, the word2vec methods performed better on
NMI and AMI, but the tf-idf method was very close on ARI. However, on the
Reddit data the tf-idf method gave very low performance until the document size
was greater than 200 characters. This indicates that topics from Twitter text840
may rely heavily on keywords since the tf-idf clustering performs comparatively
well, which is not surprising given the use of user mentions and hashtags. The
doc2vec method represented this information more effectively on the #Auspol
data than the other feature methods. Assigning a heavier weighting for hashtags
42
and user mentions for the doc2vec model might give improved performance on845
Twitter data.
Two useful results stand out from this study based on the Reddit data. The
first was that the optimal number of training epochs for doc2vec is inversely
proportional to the average length of the documents. This result provides some
guidance for future studies using OSN data. Unfortunately this result was not850
consistent with the results on the #Auspol data, which may be due to the
topic labels themselves not being clearly distinct. There is an ongoing challenge
with using Twitter data as manually labelling topics is time consuming and
prone to error, and the number of retrievable tweets diminishes over time. The
result is consistent with the RepLab 2013 Twitter data, but as discussed already855
the data volume was small. The second result is that the performance of the
doc2vec method increased with the length of the documents. The method gave
high performance for the longest Reddit comments, so should give good results
applied to text data from OSN platforms in general.
Improving embedding representations of OSN documents can be useful for860
several natural language processing tasks. Such representations at the docu-
ment level can provide high quality feature matrices to be used by other ma-
chine learning systems. An example application is for sentiment analysis (Lee
et al., 2016). In addition, it has been shown previously that pre-training the
word vectors used by doc2vec can provide a performance lift in several natural865
language processing tasks (Lau and Baldwin, 2016). Pre-training both word
vectors and document vectors for large volumes of OSN data could then pro-
vide a performance lift on applications focused on specific samples of data. For
instance, pre-trained document vectors could be used in streaming document
classification or clustering applications. In addition, such methods could be ap-870
plied in other domains where data can be modelled as documents with a small
number of tokens. For example, embedding models are seeing applications on
electronic health record data (Choi et al., 2016). In this instance, medical codes
are treated as tokens and embedding models can then be used to capture in-
formation about relationships between diseases and treatments, and be used in875
43
subsequent prediction or clustering tasks.
6. Conclusion and Future Work
In this study we showed the different performance of several document clus-
tering and topic modelling methods on social media text data. Our results have
demonstrated that document and word embedding representations of online880
social network data may be used effectively as a basis for document cluster-
ing. These methods outperformed traditional tf-idf based approaches and topic
modelling techniques. Furthermore, doc2vec and tf-idf weighted mean word em-
bedding representations delivered better results than simple averages of word
embedding vectors in document clustering tasks. We also demonstrated that885
k-means clustering provided the best performance with doc2vec embeddings.
Through applying these methods over the Reddit data set split by docu-
ment length ranges, we outlined two key results for clustering doc2vec embed-
dings. Firstly, the optimal number of training epochs is in general inversely
proportional to the character length range of the documents. Secondly, doc2vec890
embeddings with k-means clustering provide good performance over all the doc-
ument length ranges in the Reddit data used. These results indicate that this
method should perform well on most OSN text data.
To interpret the resulting clusters from these methods, we developed a top
term analysis based on combining tf-idf scores and word vector similarities. We895
demonstrated that this method can provide a representative set of keywords
for a topic cluster. We also showed that the doc2vec embedding with k-means
clustering may successfully recover latent hashtag structure in Twitter data.
We plan several extensions to this work. Firstly, the doc2vec embeddings
combined with k-means clustering can be applied readily to any social media900
text data. In further applications we intend to demonstrate the usefulness of
this method in defining and interpreting dynamic topics in a streaming fashion.
Secondly, this method may be extended to incorporate additional data avail-
able in social networks, and specifically from Twitter user and network data.
44
Thirdly, recent developments in the applications of neural embedding and deep905
learning techniques, such as contextualised embedding models (Peters et al.,
2018), Latent LSTM Allocation (Zaheer et al., 2017) and deep learning based
clustering models (Min et al., 2018) may be applied to deliver improved feature
representations or document clusterings. Word and document embeddings may
also be used as pre-trained initial layers in deep clustering and topic modelling910
techniques.
7. Acknowledgements and Declarations
This research did not receive any specific grant from funding agencies in the
public, commercial, or not-for-profit sectors.
Declarations of interest: none915
8. References
Alghamdi, R., Alfalqi, K., 2015. A survey of topic modeling in text mining.
International Journal of Advanced Computer Science and Applications 3, 774–
777.
Alnajran, N., Crockett, K., McLean, D., Latham, A., 2017. Cluster analysis920
of twitter data: A review of algorithms, in: Proceedings of the 9th Interna-
tional Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
INSTICC. SciTePress. pp. 239–249.
Amig´o, E., Carrillo de Albornoz, J., Chugur, I., Corujo, A., Gonzalo, J., Mart´ın,
T., Meij, E., de Rijke, M., Spina, D., 2013. Overview of RepLab 2013: Eval-925
uating Online Reputation Monitoring Systems, in: Proceedings of the Fourth
International Conference of the CLEF initiative, pp. 333–352.
Bakshy, E., Rosenn, I., Marlow, C., Adamic, L., 2012. The role of social networks
in information diffusion, in: Proceedings of the 21st International Conference
on World Wide Web, ACM, New York, NY, USA. pp. 519–528.930
45
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C., 2003. A neural probabilistic
language model. Journal of Machine Learning Research 3, 1137–1155.
Billah Nagoudi, E.M., Ferrero, J., Schwab, D., 2017. LIM-LIG at SemEval-2017
Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors
Weighting, in: International Workshop on Semantic Evaluations (SemEval-935
2017), Vancouver, Canada. pp. 125 – 129.
Bisht, S., Paul, A., 2013. Document clustering: A review. International Journal
of Computer Applications 73, 26–33.
Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent dirichlet allocation. Journal
of Machine Learning Research 3, 993–1022.940
Chinnov, A., Kerschke, P., Meske, C., Stieglitz, S., Trautmann, H., 2015. An
overview of topic discovery in twitter communication through social media an-
alytics, in: Proceedings of the Americas Conference on Information Systems,
pp. 1–10.
Choi, E., Bahadori, M.T., Searles, E., Coffey, C., Thompson, M., Bost, J.,945
Tejedor-Sojo, J., Sun, J., 2016. Multi-layer representation learning for medi-
cal concepts, in: Proceedings of the 22Nd ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, ACM, New York, NY,
USA. pp. 1495–1504.
Corrˆea J´unior, E.A., Marinho, V.Q., dos Santos, L.B., 2017. NILC-USP at950
SemEval-2017 task 4: A multi-view ensemble for twitter sentiment analysis,
in: Proceedings of the 11th International Workshop on Semantic Evaluation
(SemEval-2017), Association for Computational Linguistics. pp. 611–615.
Curiskis, S., Drake, B., Osborn, T., Kennedy, P., submitted. Topic labelled
online social network data sets from twitter and reddit. Data In Brief .955
Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., Cohen, W., 2016. Tweet2vec:
Character-based distributed representations for social media, in: Proceedings
46
of the 54th Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), Association for Computational Linguistics. pp.
269–274.960
Fang, Y., Zhang, H., Ye, Y., Li, X., 2014. Detecting hot topics from twitter: A
multiview approach. Journal of Information Science 40, 578–593.
Ferrara, E., JafariAsbagh, M., Varol, O., Qazvinian, V., Menczer, F., Flam-
mini, A., 2013. Clustering memes in social media, in: Proceedings of the
2013 IEEE/ACM International Conference on Advances in Social Networks965
Analysis and Mining, ACM, New York, NY, USA. pp. 548–555.
Godfrey, D., Johns, C., Meyer, C.D., Race, S., Sadek, C., 2014. A case study
in text mining: Interpreting twitter data from world cup tweets. CoRR
abs/1408.5427, 1–11.
Godin, F., Vandersmissen, B., De Neve, W., Van de Walle, R., 2015. Multi-970
media lab @ ACL WNUT NER shared task: Named entity recognition for
twitter microposts using distributed word representations, in: Proceedings of
the Workshop on Noisy User-generated Text, Association for Computational
Linguistics. pp. 146–153.
Guille, A., Hacid, H., Favre, C., Zighed, D., 2013. Information diffusion in975
online social networks: A survey. ACM SIGMOD Record 42, 17–28.
Gutman, J., Nam, R., 2015. Text classification of reddit posts. Technical Report.
New York University.
Ha, T., Beijnon, B., Kim, S., Lee, S., Kim, J.H., 2017. Examining user per-
ceptions of smartwatch through dynamic topic modeling. Telematics and980
Informatics 34, 1262 – 1273.
Hong, L., Davison, B.D., 2010. Empirical study of topic modeling in twitter,
in: Proceedings of the First Workshop on Social Media Analytics, ACM, New
York, NY, USA. pp. 80–88.
47
Irfan, R., King, C.K., Grages, D., Ewen, S., Khan, S.U., Madani, S.A.,985
Kolodziej, J., Wang, L., Chen, D., Rayes, A., et al., 2015. A survey on text
mining in social networks. The Knowledge Engineering Review 30, 157–170.
JafariAsbagh, M., Ferrara, E., Varol, O., Menczer, F., Flammini, A., 2014. Clus-
tering memes in social media streams. Social Network Analysis and Mining
4, 237.990
Klein, C., Clutton, P., Polito, V., 2018. Topic modeling reveals distinct interests
within an online conspiracy forum. Frontiers in Psychology 9, 1–12.
Klinczak, M., Kaestner, C., 2016. Comparison of clustering algorithms for the
identification of topics on twitter. Latin American Journal of Computing -
LAJC 3, 19–26.995
Lau, J.H., Baldwin, T., 2016. An empirical evaluation of doc2vec with prac-
tical insights into document embedding generation, in: Proceedings of the
1st Workshop on Representation Learning for NLP, Association for Compu-
tational Linguistics. pp. 78–86.
Le, Q.V., Mikolov, T., 2014. Distributed representations of sentences and doc-1000
uments, in: Proceedings of the 31th International Conference on Machine
Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp. 1188–1196.
Lee, S., Jin, X., Kim, W., 2016. Sentiment classification for unlabeled dataset
using doc2vec with jst, in: Proceedings of the 18th Annual International Con-
ference on Electronic Commerce: E-Commerce in Smart Connected World,1005
ACM, New York, NY, USA. pp. 28:1–28:5.
Li, Q., Shah, S., Liu, X., Nourbakhsh, A., 2017. Data sets: Word embeddings
learned from tweets and general data, in: Proceedings of the Eleventh In-
ternational Conference on Web and Social Media, ICWSM 2017, Montr´eal,
Qu´ebec, Canada, May 15-18, 2017., pp. 428–436.1010
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient estimation of word
representations in vector space. CoRR abs/1301.3781. 1301.3781.
48
Min, E., Guo, X., Liu, Q., Zhang, G., Cui, J., Long, J., 2018. A survey of
clustering with deep learning: From the perspective of network architecture.
IEEE Access 6, 39501–39514.1015
Naik, M.P., Prajapati, H.B., Dabhi, V.K., 2015. A survey on semantic document
clustering, in: 2015 IEEE International Conference on Electrical, Computer
and Communication Technologies (ICECCT), pp. 1–10.
Patki, U., Khot, D.P., 2017. A literature review on text document clustering al-
gorithms used in text mining. Journal of Engineering Computers and Applied1020
Sciences 6, 16–20.
Paul, M.J., Dredze, M., 2014. Discovering health topics in social media using
topic models. PLOS ONE 9, 1–11.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle-
moyer, L., 2018. Deep contextualized word representations, in: Proceedings1025
of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers), Association for Computational Linguistics. pp. 2227–2237.
Reddit, 2015. r/datasets - i have every publicly available reddit comment for re-
search. 1.7 billion comments at 250 gb compressed. any interest in this? (ac-1030
cessed 19 january 2019). https://www.reddit.com/r/datasets/comments/
3bxlg7/i_have_every_publicly_available_reddit_comment.
ˇ
Reh˚rek, R., Sojka, P., 2010. Software Framework for Topic Modelling with
Large Corpora, in: Proceedings of the LREC 2010 Workshop on New Chal-
lenges for NLP Frameworks, ELRA, Valletta, Malta. pp. 45–50.1035
Romano, S., Vinh, N.X., Bailey, J., Verspoor, K., 2016. Adjusting for chance
clustering comparison measures. Journal of Machine Learning Research 17,
4635–4666.
49
Shabunina, E., Pasi, G., 2018. A graph-based approach to ememes identification
and tracking in social media streams. Knowledge-Based Systems 139, 108 –1040
118.
Steinskog, A., Therkelsen, J., Gamack, B., 2017. Twitter topic modeling by
tweet aggregation, in: Proceedings of the 21st Nordic Conference on Compu-
tational Linguistics, Association for Computational Linguistics. pp. 77–86.
Stieglitz, S., Mirbabaie, M., Ross, B., Neuberger, C., 2018. Social media ana-1045
lytics – challenges in topic discovery, data collection, and data preparation.
International Journal of Information Management 39, 156 – 168.
Suri, P., Roy, N.R., 2017. Comparison between LDA & NMF for event-detection
from large text stream data, in: 2017 3rd International Conference on Com-
putational Intelligence Communication Technology (CICT), pp. 1–5.1050
Vinh, N.X., Epps, J., Bailey, J., 2010. Information theoretic measures for clus-
terings comparison: Variants, properties, normalization and correction for
chance. Journal of Machine Learning Research 11, 2837–2854.
Yang, X., Macdonald, C., Ounis, I., 2017. Using word embeddings in twitter
election classification. Information Retrieval 21, 183–207.1055
Zaheer, M., Ahmed, A., Smola, A.J., 2017. Latent LSTM allocation: Joint clus-
tering and non-linear dynamic modeling of sequence data, in: Proceedings of
the 34th International Conference on Machine Learning, PMLR, International
Convention Centre, Sydney, Australia. pp. 3967–3976.
Zhao, J., Lan, M., Tian, J.F., 2015. Using traditional similarity measurements1060
and word embedding for semantic textual similarity estimation, in: 9th In-
ternational Workshop on Semantic Evaluation (SemEval 2015), p. 117.
50
... Therefore, the evaluation of embedding quality can often be a difficult task [25,26]. A common approach in addressing issues of explainability is using specialized extrinsic evaluation metrics [27]. Additionally, established consumer clustering methods are gradually enriched by using high-dimensional learned customer features in the form of embeddings, using natural language processing technologies [28]. ...
... Additionally, established consumer clustering methods are gradually enriched by using high-dimensional learned customer features in the form of embeddings, using natural language processing technologies [28]. Applying clustering techniques to neural embedding features can achieve high performance standards in a variety of tasks [27]. ...
Article
Full-text available
Online conversation communities have become an influential source of consumer recommendations in recent years. We propose a set of meaningful user segments which emerge from user embedding representations, based exclusively on comments’ text input. Data were collected from three popular recommendation communities in Reddit, covering the domains of book and movie suggestions. We utilized two neural language model methods to produce user embeddings, namely Doc2Vec and Sentence-BERT. Embedding interpretation issues were addressed by examining latent factors’ associations with behavioral, sentiment, and linguistic variables, acquired using the VADER, LIWC, and LFTK libraries in Python. User clusters were identified, having different levels of engagement and linguistic characteristics. The latent features of both approaches were strongly correlated with several user behavioral and linguistic indicators. Both approaches managed to capture significant variability in writing styles and quality, such as length, readability, use of function words, and complexity. However, the Doc2Vec features better described users by varying level of contribution, while S-BERT-based features were more closely adapted to users’ varying emotional engagement. Prominent segments revealed prolific users with formal, intuitive, emotionally distant, and highly analytical styles, as well as users who were less elaborate, less consistent, but more emotionally connected. The observed patterns were largely similar across communities.
... The independence and large scale of subreddits facilitate scientific exploration, including studying popular subjects [2], finding thematically similar subreddits [3], grouping subreddits into clusters, and investigating possible transitions between these clusters [4]. The cluster detection and analysis of online communities has been widely studied with different applications related to echo chambers [5], political studies [6], community conflict modeling [7,8], conspiracy theories analysis [9], social role, and expert detection [10,11]. ...
... Each subreddit's vector is compared with each other subreddit's vector, and their similarity is measured with cosine similarity. Cosine similarity is a standard measure for textual embedding vector comparison [7,21]. The subreddits with the highest cosine similarity are joined with an edge (total of 228,206 edges). ...
Article
Full-text available
Multiple research directions have been proposed to study the information structure of Reddit. One of them is to model inter-subreddit relations but modeling user interactions in the form of a graph. Building upon prior work centered on political subreddits using pre-2020 data, we expand this investigation to include a more extensive dataset spanning 2022 and encompassing diverse topic areas. Employing NLP techniques such as text embeddings, we model subreddit content directly and construct a subreddit graph network based on cosine similarity. Community detection using the Louvain method reveals distinct subreddits and allows the analysis of inter-community connections via previous works’ concepts of “bridges” and “gateways”. Surprisingly, our findings indicate redundancy between bridges and gateways in the utilized dataset. Therefore, we introduce a new concept, “highways”. Highways, representing the most traversed paths between subreddits, unveil insights not captured by previous analyses, underscoring the significance of novel conceptual frameworks in uncovering latent knowledge within Reddit’s online community structures.
... Topic models have long been used to analyze the latent topics discussed in social networks since they allow to uncover trends and latent topics in these exceedingly large document collections (Hong and Davison 2010;Curiskis et al. 2020;Kant et al. 2022;Weisser et al. 2023). Models such as LDA (Blei et al. 2003), Non-negative Matrix Factorization (NMF) (Lee and Seung 1999) or BERTopic (Grootendorst 2022) (whereas BERT stands for Bidirectional Encoder Representation from Transformers) are some of the prominent models in these applications (Egger and Yu 2022). ...
Article
Full-text available
We present an Natural Language Processing based analysis on the phenomenon of “Meme Stocks”, which has emerged as a result of the proliferation of neo-brokers like Robinhood and the massive increase in the number of small-scale stock investors. Such investors often use specific Social Media channels to share short-term investment decisions and strategies, resulting in partial collusion and planning of investment decisions. The impact of online communities on the stock prices of affected companies has been considerable in the short term. This paper has two objectives. Firstly, we chronologically model the discourse on the most prominent platforms. Secondly, we examine the potential for using collaboratively made investment decisions as a means to assist in the selection of potential investments.. To understand the investment decision-making processes of small-scale investors, we analyze data from Social Media platforms like Reddit, Stocktwits and Seeking Alpha. Our methodology combines Sentiment Analysis and Topic Modelling. Sentiment Analysis is conducted using VADER and a fine-tuned BERT model. For Topic Modelling, we utilize LDA, NMF and the state-of-the-art BERTopic. We identify the topics and shapes of discussions over time and evaluate the potential for leveraging information of the decision-making process of investors for trading choices. We utilize Random Forest and Neural Network Models to show that latent information in discussions can be exploited for trend prediction of stocks affected by Social Network driven herd behavior. Our findings provide valuable insights into content and sentiment of discussions and are a vehicle to improve efficient trading decisions for stocks affected from short-term herd behavior.
... Several topic modeling techniques perform poorly on social network data due to their short and noisy nature, leading to incomparable results across studies [5]. Egger et al. [7] evaluates various document clustering and topic modeling methods using datasets from Twitter and Reddit, highlighting the effectiveness of BERTopic and NMF for Twitter data. ...
Preprint
Full-text available
Toxic sentiment analysis on Twitter (X) often focuses on specific topics and events such as politics and elections. Datasets of toxic users in such research are typically gathered through lexicon-based techniques, providing only a cross-sectional view. his approach has a tight confine for studying toxic user behavior and effective platform moderation. To identify users consistently spreading toxicity, a longitudinal analysis of their tweets is essential. However, such datasets currently do not exist. This study addresses this gap by collecting a longitudinal dataset from 143K Twitter users, covering the period from 2007 to 2021, amounting to a total of 293 million tweets. Using topic modeling, we extract all topics discussed by each user and categorize users into eight groups based on the predominant topic in their timelines. We then analyze the sentiments of each group using 16 toxic scores. Our research demonstrates that examining users longitudinally reveals a distinct perspective on their comprehensive personality traits and their overall impact on the platform. Our comprehensive dataset is accessible to researchers for additional analysis.
... Meanwhile, several recent studies assessed these algorithms while collecting tweets (Curiskis et al., 2020;Selvam et al., 2018). These studies demonstrated the superiority of TF-IDF-based k-means over AHC, bisecting k-means, and even k-medoids clustering methods (Shamir & Tishby, 2010). ...
Article
Full-text available
In this work, a simple yet robust neighboring-aware hierarchical-based clustering approach (NHC) is developed. NHC employs its dynamic technique to take into account the surroundings of each point when clustering, making it extremely competitive. NHC offers a straightforward design and reliable clustering. It comprises two key techniques, namely, neighboring- aware and filtering and merging. While the proposed neighboring-aware technique helps find the most coherent clusters, filtering and merging help reach the desired number of clusters during the clustering process. The NHC’s performance, which includes all evaluation metrics and run time, has been thoroughly tested against nine clustering rivals using four similarity measures on several real-world numerical and textual datasets. The evaluation is done in two phases. First, we compare NHC to three common clustering methods and show its efficacy through empirical analysis. Second, a comparison with six relevant, contemporary competitors highlights NHC's extremely competitive performance.
... Experimental findings show that GSE is superior to other methods in differentiating essays within the same or different sets, and its semantic similarity scores align with human-assessed essay ratings. Curiskis et al. 17 assessed document clustering and topic modeling methods for online social networks, focusing on Twitter and Reddit datasets. Four feature representations, combining TF-IDF matrices and word embedding models, were benchmarked with various clustering methods. ...
Article
Full-text available
In this study, we addressed two primary challenges: firstly, the issue of domain shift, which pertains to changes in data characteristics or context that can impact model performance, and secondly, the discrepancy between semantic similarity and geographical distance. We employed topic modeling in conjunction with the BERT architecture. Our model was crafted to enhance similarity computations applied to geospatial text, aiming to integrate both semantic similarity and geographical proximity. We tested the model on two datasets, Persian Wikipedia articles, and rental property advertisements. The findings demonstrate that the model effectively improved the correlation between semantic similarity and geographical distance. Furthermore, evaluation by real-world users within a recommender system context revealed a notable increase in user satisfaction by approximately 22% for Wikipedia articles and 56% for advertisements.
Article
Full-text available
Natural language processing (NLP)—previously the domain of a select few language and computer scientists—is undergoing an unprecedented surge in popularity across disciplines. The ubiquity of language data, alongside extremely rapid methodological innovations, has magnetized the field, attracting researchers with the promise of measuring, forecasting, and understanding the most central questions in business, psychology, biology, sociology, the humanities, and beyond. The power of language analysis to reveal insights into human thought, feeling, and behavior has become a core interest emerging from recent technological advances, which are being probed to unearth deeply embedded truths about the human condition. However, NLP research has reached a critical juncture, sitting at the cusp of societal transformation in many aspects of daily life. The details of how NLP research develops over the next 3–5 years will define this transformation. In this emerging, near-infinite space of NLP-driven research, we provide a critical frame of reference for how, when, and why these technologies should evolve in a particularly transdisciplinary manner. Specifically, we discuss (a) the urgency of pairing existing and emerging NLP research with existing scientific knowledge, theory, and principles from the behavioral sciences; (b) the coevolution of NLP technologies; and (c) the practical implications and ethical consequences of expanding language analysis using broader psychosocial theories of the human condition. While our discussion focuses principally on using language as a window in the individual mind, this topic holds substantial implications for other disciplines and lines of inquiry, including the dynamics of social interaction and beyond.
Article
Full-text available
Clustering is a fundamental problem in many data-driven application domains, and clustering performance highly depends on the quality of data representation. Hence, linear or non-linear feature transformations have been extensively used to learn a better data representation for clustering. In recent years, a lot of works focused on using deep neural networks to learn a clustering-friendly representation, resulting in a significant increase of clustering performance. In this paper, we give a systematic survey of clustering with deep learning in views of architecture. Specifically, we first introduce the preliminary knowledge for better understanding of this field. Then, a taxonomy of clustering with deep learning is proposed and some representative methods are introduced. Finally, we propose some interesting future opportunities of clustering with deep learning and give some conclusion remarks.
Article
Full-text available
Conspiracy theories play a troubling role in political discourse. Online forums provide a valuable window into everyday conspiracy theorizing, and can give a clue to the motivations and interests of those who post in such forums. Yet this online activity can be difficult to quantify and study. We describe a unique approach to studying online conspiracy theorists which used non-negative matrix factorization to create a topic model of authors' contributions to the main conspiracy forum on Reddit.com. This subreddit provides a large corpus of comments which spans many years and numerous authors. We show that within the forum, there are multiple sub-populations distinguishable by their loadings on different topics in the model. Further, we argue, these differences are interpretable as differences in background beliefs and motivations. The diversity of the distinct subgroups places constraints on theories of what generates conspiracy theorizing. We argue that traditional “monological” believers are only the tip of an iceberg of commenters. Neither simple irrationality nor common preoccupations can account for the observed diversity. Instead, we suggest, those who endorse conspiracies seem to be primarily brought together by epistemological concerns, and that these central concerns link an otherwise heterogenous group of individuals.
Article
Full-text available
Since an ever-increasing part of the population makes use of social media in their day-today lives, social media data is being analysed in many different disciplines. The social media analytics process involves four distinct steps, data discovery, collection, preparation, and analysis. While there is a great deal of literature on the challenges and difficulties involving specific data analysis methods, there hardly exists research on the stages of data discovery, collection, and preparation. To address this gap, we conducted an extended and structured literature analysis through which we identified challenges addressed and solutions proposed. The literature search revealed that the volume of data was most often cited as a challenge by researchers. In contrast, other categories have received less attention. Based on the results of the literature search, we discuss the most important challenges for researchers and present potential solutions. The findings are used to extend an existing framework on social media analytics. The article provides benefits for researchers and practitioners who wish to collect and analyse social media data.
Conference Paper
Full-text available
A word embedding is a low-dimensional, dense and real-valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually generated from a large text corpus. The embedding of a word captures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks.
Conference Paper
Full-text available
Proper representations of medical concepts such as diagnosis, medication, procedure codes and visits from Electronic Health Records (EHR) has broad applications in healthcare analytics. Patient EHR data consists of a sequence of visits over time, where each visit includes multiple medical concepts, e.g., diagnosis, procedure, and medication codes. This hierarchical structure provides two types of relational information, namely sequential order of visits and co-occurrence of the codes within a visit. In this work, we propose Med2Vec, which not only learns the representations for both medical codes and visits from large EHR datasets with over million visits, but also allows us to interpret the learned representations confirmed positively by clinical experts. In the experiments, Med2Vec shows significant improvement in prediction accuracy in clinical applications compared to baselines such as Skip-gram, GloVe, and stacked autoencoder, while providing clinically meaningful interpretation.
Article
A meme, as defined by Richard Dawkins, is a unit of information, a concept or an idea that spreads from person to person within a culture. Examples of memes can be a musical melody, a catchy phrase, trending news, behavioral patterns, etc. In this article the task of identifying potential memes in a stream of texts is addressed: in particular, the content generated by users of Social Media is considered as a rich source of information offering an updated window on the world happenings and on opinions of people. A textual electronic meme, a.k.a. ememe, is here considered as a frequently replicated set of related words that propagates through the Web over time. In this article an approach is proposed that aims to identify ememes in Social Media streams represented as graph of words. Furthermore, a set of measures is defined to track the change of information in time.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Since the 2010s, various companies have begun to manufacture wearable smartwatch devices, but the current sales of these products are not impressive. This study investigates how the limitations of the smartwatch are related to perceptual discomforts. Theoretically, this study evaluates the claim that the discomfort that users appear to have with the smartwatch stem from failed remediation. Users perceive the smartwatch more as a set of functional sensors rather than a watch or smartphone. Specifically, from the remediation perspective, the authors asked how users perceive the functions of the smartwatch. This study used dynamic topic modeling for topics on the smartwatch on Reddit. This study reports that the smartwatch has failed to provide a proper way to use the remediated content that it provides. Suggestions for future studies are addressed.