Content uploaded by Miriam Leenders
Author content
All content in this area was uploaded by Miriam Leenders on Jun 06, 2014
Content may be subject to copyright.
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 262–272,
Edinburgh, Scotland, UK, July 27–31, 2011. c
2011 Association for Computational Linguistics
Optimizing Semantic Coherence in Topic Models
David Mimno
Princeton University
Princeton, NJ 08540
mimno@cs.princeton.edu
Hanna M. Wallach
University of Massachusetts, Amherst
Amherst, MA 01003
wallach@cs.umass.edu
Edmund Talley Miriam Leenders
National Institutes of Health
Bethesda, MD 20892
{talleye,leenderm}@ninds.nih.gov
Andrew McCallum
University of Massachusetts, Amherst
Amherst, MA 01003
mccallum@cs.umass.edu
Abstract
Latent variable models have the potential
to add value to large document collections
by discovering interpretable, low-dimensional
subspaces. In order for people to use such
models, however, they must trust them. Un-
fortunately, typical dimensionality reduction
methods for text, such as latent Dirichlet al-
location, often produce low-dimensional sub-
spaces (topics) that are obviously flawed to
human domain experts. The contributions of
this paper are threefold: (1) An analysis of the
ways in which topics can be flawed; (2) an au-
tomated evaluation metric for identifying such
topics that does not rely on human annotators
or reference collections outside the training
data; (3) a novel statistical topic model based
on this metric that significantly improves topic
quality in a large-scale document collection
from the National Institutes of Health (NIH).
1 Introduction
Statistical topic models such as latent Dirichlet al-
location (LDA) (Blei et al., 2003) provide a pow-
erful framework for representing and summarizing
the contents of large document collections. In our
experience, however, the primary obstacle to accep-
tance of statistical topic models by users the outside
machine learning community is the presence of poor
quality topics. Topics that mix unrelated or loosely-
related concepts substantially reduce users’ confi-
dence in the utility of such automated systems.
In general, users prefer models with larger num-
bers of topics because such models have greater res-
olution and are able to support finer-grained distinc-
tions. Unfortunately, we have observed that there
is a strong relationship between the size of topics
and the probability of topics being nonsensical as
judged by domain experts: as the number of topics
increases, the smallest topics (number of word to-
kens assigned to each topic) are almost always poor
quality. The common practice of displaying only a
small number of example topics hides the fact that as
many as 10% of topics may be so bad that they can-
not be shown without reducing users’ confidence.
The evaluation of statistical topic models has tra-
ditionally been dominated by either extrinsic meth-
ods (i.e., using the inferred topics to perform some
external task such as information retrieval (Wei
and Croft, 2006)) or quantitative intrinsic methods,
such as computing the probability of held-out doc-
uments (Wallach et al., 2009). Recent work has
focused on evaluation of topics as semantically-
coherent concepts. For example, Chang et al. (2009)
found that the probability of held-out documents is
not always a good predictor of human judgments.
Newman et al. (2010) showed that an automated
evaluation metric based on word co-occurrence
statistics gathered from Wikipedia could predict hu-
man evaluations of topic quality. AlSumait et al.
(2009) used differences between topic-specific dis-
tributions over words and the corpus-wide distribu-
tion over words to identify overly-general “vacuous”
topics. Finally, Andrzejewski et al. (2009) devel-
oped semi-supervised methods that avoid specific
user-labeled semantic coherence problems.
The contributions of this paper are threefold: (1)
To identify distinct classes of low-quality topics,
some of which are not flagged by existing evalua-
tion methods; (2) to introduce a new topic “coher-
ence” score that corresponds well with human co-
herence judgments and makes it possible to identify
262
specific semantic problems in topic models without
human evaluations or external reference corpora; (3)
to present an example of a new topic model that
learns latent topics by directly optimizing a metric
of topic coherence. With little additional computa-
tional cost beyond that of LDA, this model exhibits
significant gains in average topic coherence score.
Although the model does not result in a statistically-
significant reduction in the number of topics marked
“bad”, the model consistently improves the topic co-
herence score of the ten lowest-scoring topics (i.e.,
results in bad topics that are “less bad” than those
found using LDA) while retaining the ability to iden-
tify low-quality topics without human interaction.
2 Latent Dirichlet Allocation
LDA is a generative probabilistic model for docu-
ments W={w(1),w(2),...,w(D)}. To generate a
word token w(d)
nin document d, we draw a discrete
topic assignment z(d)
nfrom a document-specific dis-
tribution over the Ttopics θd(which is itself drawn
from a Dirichlet prior with hyperparameter α), and
then draw a word type for that token from the topic-
specific distribution over the vocabulary φz(d)
n. The
inference task in topic models is generally cast as in-
ferring the document–topic proportions {θ1, ..., θD}
and the topic-specific distributions {φ1. . . , φT}.
The multinomial topic distributions are usually
drawn from a shared symmetric Dirichlet prior with
hyperparameter β, such that conditioned on {φt}T
t=1
and the topic assignments {z(1),z(2),...,z(D)},
the word tokens are independent. In practice, how-
ever, it is common to deal directly with the “col-
lapsed” distributions that result from integrating
over the topic-specific multinomial parameters. The
resulting distribution over words for a topic tis then
a function of the hyperparameter βand the number
of words of each type assigned to that topic, Nw|t.
This distribution, known as the Dirichlet compound
multinomial (DCM) or P´
olya distribution (Doyle
and Elkan, 2009), breaks the assumption of condi-
tional independence between word tokens given top-
ics, but is useful during inference because the con-
ditional probability of a word wgiven topic ttakes
a very simple form: P(w|t, β) = Nw|t+β
Nt+|V| β, where
Nt=Pw0Nw0|tand |V| is the vocabulary size.
The process for generating a sequence of words
from such a model is known as the simple P´
olya urn
model (Mahmoud, 2008), in which the initial prob-
ability of word type win topic tis proportional to
β, while the probability of each subsequent occur-
rence of win topic tis proportional to the number
of times whas been drawn in that topic plus β. Note
that this unnormalized weight for each word type de-
pends only on the count of that word type, and is in-
dependent of the count of any other word type w0.
Thus, in the DCM/P´
olya distribution, drawing word
type wmust decrease the probability of seeing all
other word types w06=w. In a later section, we will
introduce a topic model that substitutes a general-
ized P´
olya urn model for the DCM/P´
olya distribu-
tion, allowing a draw of word type wto increase the
probability of seeing certain other word types.
For real-world data, documents Ware observed,
while the corresponding topic assignments Zare
unobserved and may be inferred using either vari-
ational methods (Blei et al., 2003; Teh et al., 2006)
or MCMC methods (Griffiths and Steyvers, 2004).
Here, we use MCMC methods—specifically Gibbs
sampling (Geman and Geman, 1984), which in-
volves sequentially resampling each topic assign-
ment z(d)
nfrom its conditional posterior given the
documents W, the hyperparameters αand β, and
Z\d,n (the current topic assignments for all tokens
other than the token at position nin document d).
3 Expert Opinions of Topic Quality
Concentrating on 300,000 grant and related jour-
nal paper abstracts from the National Institutes of
Health (NIH), we worked with two experts from
the National Institute of Neurological Disorders and
Stroke (NINDS) to collaboratively design an expert-
driven topic annotation study. The goal of this study
was to develop an annotated set of baseline topics,
along with their salient characteristics, as a first step
towards automatically identifying and inferring the
kinds of topics desired by domain experts.1
3.1 Expert-Driven Annotation Protocol
In order to ensure that the topics selected for anno-
tation were within the NINDS experts’ area of ex-
pertise, they selected 148 topics (out of 500), all as-
sociated with areas funded by NINDS. Each topic
1All evaluated models will be released publicly.
263
twas presented to the experts as a list of the thirty
most probable words for that topic, in descending or-
der of their topic-specific “collapsed” probabilities,
Nw|t+β
Nt+|V| β. In addition to the most probable words,
the experts were also given metadata for each topic:
The most common sequences of two or more con-
secutive words assigned to that topic, the four topics
that most often co-occurred with that topic, the most
common IDF-weighted words from titles of grants,
thesaurus terms, NIH institutes, journal titles, and
finally a list of the highest probability grants and
PubMed papers for that topic.
The experts first categorized each topic as one
of three types: “research”, “grant mechanisms and
publication types” or “general”.2The quality of
each topic (“good”, “intermediate”, or “bad”) was
then evaluated using criteria specific to the type
of topic. In general, topics were only annotated
as “good” if they contained words that could be
grouped together as a single coherent concept. Addi-
tionally, each “research” topic was only considered
to be “good” if, in addition to representing a sin-
gle coherent concept, the aggregate content of the
set of documents with appreciable allocations to that
topic clearly contained text referring to the concept
inferred from the topic words. Finally, for each topic
marked as being either “intermediate” or “bad”, one
or more of the following problems (defined by the
domain experts) was identified, as appropriate:
•Chained: every word is connected to every
other word through some pairwise word chain,
but not all word pairs make sense. For exam-
ple, a topic whose top three words are “acids”,
“fatty” and “nucleic” consists of two distinct
concepts (i.e., acids produced when fats are
broken down versus the building blocks of
DNA and RNA) chained via the word “acids”.
•Intruded: either two or more unrelated sets
of related words, joined arbitrarily, or an oth-
erwise good topic with a few “intruder” words.
•Random: no clear, sensical connections be-
tween more than a few pairs of words.
•Unbalanced: the top words are all logically
connected to each other, but the topic combines
very general and specific terms (e.g., “signal
2Equivalent to “vacuous topics” of AlSumait et al. (2009).
transduction” versus “notch signaling”).
Examples of a good general topic, a good research
topic, and a chained research topic are in Table 1.
3.2 Annotation Results
The experts annotated the topics independently and
then aggregated their results. Interestingly, no top-
ics were ever considered “good” by one expert and
“bad” by the other—when there was disagreement
between the experts, one expert always believed the
topic to be “intermediate.” In such cases, the ex-
perts discussed the reasons for their decisions and
came to a consensus. Of the 148 topics selected for
annotation, 90 were labeled as “good,” 21 as “inter-
mediate,” and 37 as “bad.” Of the topics labeled as
“bad” or “intermediate,” 23 were “chained,” 21 were
“intruded,” 3 were “random,” and 15 were “unbal-
anced”. (The annotators were permitted to assign
more than one problem to any given topic.)
4 Automated Metrics for Predicting
Expert Annotations
The ultimate goal of this paper is to develop meth-
ods for building models with large numbers of spe-
cific, high-quality topics from domain-specific cor-
pora. We therefore explore the extent to which in-
formation already contained in the documents being
modeled can be used to assess topic quality.
In this section we evaluate several methods for
ranking the quality of topics and compare these
rankings to human annotations. No method is likely
to perfectly predict human judgments, as individual
annotators may disagree on particular topics. For
an application involving removing low quality top-
ics we recommend using a weighted combination of
metrics, with a threshold determined by users.
4.1 Topic Size
As a simple baseline, we considered the extent to
which topic “size” (as measured by the number of
tokens assigned to each topic via Gibbs sampling) is
a good metric for assessing topic quality. Figure 1
(top) displays the topic size (number of tokens as-
signed to that topic) and expert annotations (“good”,
“intermediate”, “bad”) for the 148 topics manually
labeled by annotators as described above. This fig-
ure suggests that topic size is a reasonable predic-
264
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
40000 60000 80000 120000 160000
Tokens
good inter bad
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−600 −500 −400 −300 −200
Coherence
good inter bad
Figure 1: Topic size is a good indicator of quality; the
new coherence metric is better. Top shows expert-rated
topics ranked by topic size (AP 0.89, AUC 0.79), bottom
shows same topics ranked by coherence (AP 0.94, AUC
0.87). Random jitter is added to the y-axis for clarity.
tor of topic quality. Although there is some overlap,
“bad” topics are generally smaller than “good” top-
ics. Unfortunately, this observation conflicts with
the goal of building highly specialized, domain-
specific topic models with many high-quality, fine-
grained topics—in such models the majority of top-
ics will have relatively few tokens assigned to them.
4.2 Topic Coherence
When displaying topics to users, each topic tis gen-
erally represented as a list of the M=5,...,20 most
probable words for that topic, in descending order
of their topic-specific “collapsed” probabilities. Al-
though there has been previous work on automated
generation of labels or headings for topics (Mei et
al., 2007), we choose to work only with the ordered
list representation. Labels may obscure or detract
from fundamental problems with topic coherence,
and better labels don’t make bad topics good.
The expert-driven annotation study described in
section 3 suggests that three of the four types of
poor-quality topics (“chained,” “intruded” and “ran-
dom”) could be detected using a metric based on
the co-occurrence of words within the documents
being modeled. For “chained” and “intruded” top-
ics, it is likely that although pairs of words belong-
ing to a single concept will co-occur within a single
document (e.g., “nucleic” and “acids” in documents
about DNA), word pairs belonging to different con-
cepts (e.g., “fatty” and “nucleic”) will not. For ran-
dom topics, it is likely that few words will co-occur.
This insight can be used to design a new metric
for assessing topic quality. Letting D(v)be the doc-
ument frequency of word type v(i.e., the number
of documents with least one token of type v) and
D(v, v0)be co-document frequency of word types v
and v0(i.e., the number of documents containing one
or more tokens of type vand at least one token of
type v0), we define topic coherence as
C(t;V(t)) =
M
X
m=2
m−1
X
l=1
log D(v(t)
m, v(t)
l)+1
D(v(t)
l),(1)
where V(t)= (v(t)
1, . . . , v(t)
M)is a list of the Mmost
probable words in topic t. A smoothing count of 1
is included to avoid taking the logarithm of zero.
Figure 1 shows the association between the expert
annotations and both topic size (top) and our coher-
ence metric (bottom). We evaluate these results us-
ing standard ranking metrics, average precision and
the area under the ROC curve. Treating “good” top-
ics as positive and “intermediate” or “bad” topics as
negative, we get average precision values of 0.89 for
topic size vs. 0.94 for coherence and AUC 0.79 for
topic size vs. 0.87 for coherence. We performed a
logistic regression analysis on the binary variable “is
this topic bad”. Using topic size alone as a predic-
tor gives AIC (a measure of model fit) 152.5. Co-
herence alone has AIC 113.8 (substantially better).
Both predictors combined have AIC 115.8: the sim-
pler coherence alone model provides the best perfor-
mance. We tried weighting the terms in equation 1
by their corresponding topic–word probabilities and
and by their position in the sorted list of the Mmost
probable words for that topic, but we found that a
uniform weighting better predicted topic quality.
Our topic coherence metric also exhibits good
qualitative behavior: of the 20 best-scoring topics,
18 are labeled as “good,” one is “intermediate” (“un-
balanced”), and one is “bad” (combining “cortex”
and “fmri”, words that commonly co-occur, but are
conceptually distinct). Of the 20 worst scoring top-
ics, 15 are “bad,” 4 are “intermediate,” and only one
(with the 19th worst coherence score) is “good.”
265
Our coherence metric relies only upon word co-
occurrence statistics gathered from the corpus being
modeled, and does not depend on an external ref-
erence corpus. Ideally, all such co-occurrence infor-
mation would already be accounted for in the model.
We believe that one of the main contributions of our
work is demonstrating that standard topic models
do not fully utilize available co-occurrence informa-
tion, and that a held-out reference corpus is therefore
not required for purposes of topic evaluation.
Equation 1 is very similar to pointwise mutual in-
formation (PMI), but is more closely associated with
our expert annotations than PMI (which achieves
AUC 0.64 and AIC 170.51). PMI has a long history
in language technology (Church and Hanks, 1990),
and was recently used by Newman et al. (2010) to
evaluate topic models. When expressed in terms of
count variables as in equation 1, PMI includes an
additional term for D(v(t)
m). The improved perfor-
mance of our metric over PMI implies that what mat-
ters is not the difference between the joint probabil-
ity of words mand land the product of marginals,
but the conditional probability of each word given
the each of the higher-ranked words in the topic.
In order to provide intuition for the behavior of
our topic coherence metric, table 1 shows three
example topics and their topic coherence scores.
The first topic, related to grant-funded training pro-
grams, is one of the best-scoring topics. All pairs
of words have high co-document frequencies. The
second topic, on neurons, is more typical of qual-
ity “research” topics. Overall, these words occur
less frequently, but generally occur moderately in-
terchangeably: there is little structure to their co-
variance. The last topic is one of the lowest-scoring
topics. Its co-document frequency matrix is shown
in table 2. The top two words are closely related:
487 documents include “aging” at least once, 122
include “lifespan”, and 55 include both. Meanwhile,
the third word “globin” occurs with only one of the
top seven words—the common word “human”.
4.3 Comparison to word intrusion
As an additional check for both our expert annota-
tions and our automated metric, we replicated the
“word intrusion” evaluation originally introduced by
Chang et al. (2009). In this task, one of the top ten
most probable words in a topic is replaced with a
● ●●● ● ● ●●● ● ●● ● ●●● ●●● ●● ●●● ●●● ● ●● ● ●●● ●● ●●● ●● ● ●● ●●● ●●●●●
●
●●● ● ●● ●
● ● ●●● ●
● ● ●●●
●
● ●● ●
● ●●● ●
●
●
●●
●
●
●
● ●
40000 60000 80000 120000 160000
048
Comparison of Topic Size to Intrusion Detection
Tokens assigned to topic
Correct Guesses
●● ●● ●● ●●● ● ●● ●● ●● ●●● ● ●● ●● ●●●● ● ●●● ● ● ●●
● ●●●●● ●● ●● ●● ●●● ●
●
●●●● ●● ●
●●● ●● ●
●● ● ●●
●
●●● ●
● ●● ● ●
●
●
●●
●
●
●
● ●
−600 −500 −400 −300 −200
048
Comparison of Coherence to Intrusion Detection
Coherence
Correct Guesses
Good Topics
Correct Guesses
Frequency
0 2 4 6 8 10
0 15 35
Bad Topics
Correct Guesses
Frequency
0 2 4 6 8 10
0 15 35
Figure 2: Top: results of the intruder selection task rel-
ative to two topic quality metrics. Bottom: marginal in-
truder accuracy frequencies of good and bad topics.
another word, selected at random from the corpus.
The resulting set of words is presented, in a random
order, to users, who are asked to identify the “in-
truder” word. It is very unlikely that a randomly-
chosen word will be semantically related to any of
the original words in the topic, so if a topic is a
high quality representation of a semantically coher-
ent concept, it should be easy for users to select the
intruder word. If the topic is not coherent, there may
be words in the topic that are also not semantically
related to any other word, thus causing users to se-
lect “correct” words instead of the real intruder.
We recruited ten additional expert annotators
from NINDS, not including our original annotators,
and presented them with the intruder selection task,
using the set of previously evaluated topics. Re-
sults are shown in figure 2. In the first two plots,
the x-axis is one of our two automated quality met-
266
Table 1: Example topics (good/general, good/research, chained/research) with different coherence scores (numbers
closer to zero indicate higher coherence). The chained topic combines words related to aging (indicated in plain text)
and words describing blood and blood-related diseases (bold). The only connection is the common word human.
-167.1 students, program, summer, biomedical, training, experience, undergraduate, career, minority, student, ca-
reers, underrepresented, medical students, week, science
-252.1 neurons, neuronal, brain, axon, neuron, guidance, nervous system, cns, axons, neural, axonal, cortical,
survival, disorders, motor
-357.2 aging, lifespan, globin, age related, longevity, human, age, erythroid,sickle cell,beta globin,hb, senes-
cence, adult, older, lcr
Table 2: Co-document frequency matrix for the top words in a low-quality topic (according to our coherence metric),
shaded to highlight zeros. The diagonal (light gray) shows the overall document frequency for each word w. The
column on the right is Nw|t. Note that “globin” and “erythroid” do not co-occur with any of the aging-related words.
aging 487 53 0 65 42 0 51 0 138 0 914
lifespan 53 122 0 15 28 0 15 0 44 0 205
globin 0 0 39 0 0 19 0 15 27 3 200
age related 65 15 0 119 12 0 25 0 37 0 160
longevity 42 28 0 12 73 0 6 0 20 1 159
erythroid 0 0 19 0 0 69 0 8 23 1 110
age 51 15 0 25 6 0 245 1 82 0 103
sickle cell 0 0 15 0 0 8 1 43 16 2 93
human 138 44 27 37 20 23 82 16 4347 157 91
hb 0 0 3 0 1 1 0 2 5 15 73
267
rics (topic size and coherence) and the y-axis is the
number of annotators that correctly identified the
true intruder word (accuracy). The histograms be-
low these plots show the number of topics with each
level of annotator accuracy for good and bad top-
ics. For good topics (green circles), the annotators
were generally able to detect the intruder word with
high accuracy. Bad topics (red diamonds) had more
uniform accuracies. These results suggest that top-
ics with low intruder detection accuracy tend to be
bad, but some bad topics can have a high accuracy.
For example, spotting an intruder word in a chained
topic can be easy. The low-quality topic recep-
tors, cannabinoid, cannabinoids, ligands, cannabis,
endocannabinoid, cxcr4, [virus], receptor, sdf1, is
a typical “chained” topic, with CXCR4 linked to
cannabinoids only through receptors, and otherwise
unrelated. Eight out of ten annotators correctly iden-
tified “virus” as the correct intruder. Repeating the
logistic regression experiment using intruder detec-
tion accuracy as input, the AIC value is 163.18—
much worse than either topic size or coherence.
5 Generalized P´
olya Urn Models
Although the topic coherence metric defined above
provides an accurate way of assessing topic quality,
preventing poor quality topics from occurring in the
first place is preferable. Our results in the previous
section show that we can identify low-quality top-
ics without making use of external supervision; the
training data by itself contains sufficient information
at least to reject poor combinations of words.
In this section, we describe a new topic model that
incorporates the corpus-specific word co-occurrence
information used in our coherence metric directly
into the statistical topic modeling framework. It
is important to note that simply disallowing words
that never co-occur from being assigned to the same
topic is not sufficient. Due to the power-law charac-
teristics of language, most words are rare and will
not co-occur with most other words regardless of
their semantic similarity. It is rather the degree
to which the most prominent words in a topic do
not co-occur with the other most prominent words
in that topic that is an indicator of topic incoher-
ence. We therefore desire models that guide topics
towards semantic similarity without imposing hard
constraints.
As an example of such a model, we present a new
topic model in which the occurrence of word type w
in topic tincreases not only the probability of seeing
that word type again, but also increases the probabil-
ity of seeing other related words (as determined by
co-document frequencies for the corpus being mod-
eled). This new topic model retains the document–
topic component of standard LDA, but replaces the
usual P´
olya urn topic–word component with a gen-
eralized P´
olya urn framework (Mahmoud, 2008).
A sequence of i.i.d. samples from a discrete dis-
tribution can be imagined as arising by repeatedly
drawing a random ball from an urn, where the num-
ber of balls of each color is proportional to the prob-
ability of that color, replacing the selected ball af-
ter each draw. In a P´
olya urn, each ball is replaced
along with another ball of the same color. Samples
from this model exhibit the “burstiness” property:
the probability of drawing a ball of color wincreases
each time a ball of that color is drawn. This process
represents the marginal distribution of a hierarchical
model with a Dirichlet prior and a multinomial like-
lihood, and is used as the distribution over words
for each topic in almost all previous topic models.
In a generalized P´
olya urn model, having drawn a
ball of color w,Avw additional balls of each color
v∈ {1, . . . , W }are returned to the urn. Given W
and Z, the conditional posterior probability of word
win topic timplied by this generalized model is
P(w|t, W,Z, β, A) = PvNv|tAvw +β
Nt+|V|β,(2)
where Ais a W×Wreal-valued matrix, known
as the addition matrix or schema. The simple P´
olya
urn model (and hence the conditional posterior prob-
ability of word win topic tunder LDA) can be re-
covered by setting the schema Ato the identity ma-
trix. Unlike the simple P ´
olya distribution, we do not
know of a representation of the generalized P´
olya
urn distribution that can be expressed using a con-
cise set of conditional independence assumptions. A
standard graphical model with plate notation would
therefore not be helpful in highlighting the differ-
ences between the two models, and is not shown.
Algorithm 1 shows pseudocode for a single Gibbs
sweep over the latent variables Zin standard LDA.
Algorithm 2 shows the modifications necessary to
268
1: for d∈ D do
2: for wn∈w(d)do
3: Nzi|di←Nzi|di−1
4: Nwi|zi←Nwi|zi−1
5: sample zi∝(Nz|di+αz)Nwi|z+β
Pz0(Nwi|z0+β)
6: Nzi|di←Nzi|di+ 1
7: Nwi|zi←Nwi|zi+ 1
8: end for
9: end for
Algorithm 1: One sweep of LDA Gibbs sampling.
1: for d∈ D do
2: for wn∈w(d)do
3: Nzi|di←Nzi|di−1
4: for all vdo
5: Nv|zi←Nv|zi−Avwi
6: end for
7: sample zi∝(Nz|di+αz)Nwi|z+β
Pz0(Nwi|z0+β)
8: Nzi|di←Nzi|di+ 1
9: for all vdo
10: Nv|zi←Nv|zi+Avwi
11: end for
12: end for
13: end for
Algorithm 2: One sweep of gen. P´
olya Gibbs sam-
pling, with differences from LDA highlighted in red.
support a generalized P´
olya urn model: rather than
subtracting exactly one from the count of the word
given the old topic, sampling, and then adding one
to the count of the word given the new topic, we sub-
tract a column of the schema matrix from the entire
count vector over words for the old topic, sample,
and add the same column to the count vector for the
new topic. As long as Ais sparse, this operation
adds only a constant factor to the computation.
Another property of the generalized P´
olya urn
model is that it is nonexchangeable—the joint prob-
ability of the tokens in any given topic is not invari-
ant to permutation of those tokens. Inference of Z
given Wvia Gibbs sampling involves repeatedly cy-
cling through the tokens in Wand, for each one,
resampling its topic assignment conditioned on W
and the current topic assignments for all tokens other
than the token of interest. For LDA, the sampling
distribution for each topic assignment is simply the
product of two predictive probabilities, obtained by
treating the token of interest as if it were the last.
For a topic model with a generalized P´
olya urn for
the topic–word component, the sampling distribu-
tion is more complicated. Specifically, the topic–
word component of the sampling distribution is no
longer a simple predictive distribution—when sam-
pling a new value for z(d)
n, the implication of each
possible value for subsequent tokens and their topic
assignments must be considered. Unfortunately, this
can be very computationally expensive, particularly
for large corpora. There are several ways around this
problem. The first is to use sequential Monte Carlo
methods, which have been successfully applied to
topic models previously (Canini et al., 2009). The
second approach is to approximate the true Gibbs
sampling distribution by treating each token as if it
were the last, ignoring implications for subsequent
tokens and their topic assignments. We find that
this approximate method performs well empirically.
5.1 Setting the Schema A
Inspired by our evaluation metric, we define Aas
Avv ∝λvD(v)(3)
Avw ∝λvD(w, v)
where each element is scaled by a row-specific
weight λvand each column is normalized to sum
to 1. Normalizing columns makes comparison to
standard LDA simpler, because the relative effect of
smoothing parameter β= 0.01 is equivalent. We set
λv= log (D / D(v)), the standard IDF weight used
in information retrieval, which is larger for less fre-
quent words. The column for word type wcan be
interpreted as word types with significant associa-
tion with w. The IDF weighting therefore has the
effect of increasing the strength of association for
rare word types. We also found empirically that it is
helpful to remove off-diagonal elements for the most
common types, such as those that occur in more than
5% of documents (IDF <3.0). Including nonzero
off-diagonal values in Afor very frequent types
causes the model to disperse those types over many
topics, which leads to large numbers of extremely
similar topics. To measure this effect, we calcu-
lated the Jensen-Shannon divergence between all
pairs of topic–word distributions in a given model.
For a model using off-diagonal weights for all word
269
−290 −260
100 Topics
Coherence
50 300 550 800
−290 −260
200 Topics
Coherence
50 300 550 800
−290 −260
300 Topics
Coherence
50 300 550 800
−290 −260
400 Topics
Coherence
50 300 550 800
−400 −340
10 Worst Coher
50 300 550 800
−400 −340
10 Worst Coher
50 300 550 800
−400 −340
10 Worst Coher
50 300 550 800
−400 −340
10 Worst Coher
50 300 550 800
−1700 −1660
Iteration
HOLP
50 300 550 800
−1700 −1660
Iteration
HOLP
50 300 550 800
−1700 −1660
Iteration
HOLP
50 300 550 800
−1700 −1660
Iteration
HOLP
50 300 550 800
Figure 3: P´
olya urn topics (blue) have higher average coherence and converge much faster than LDA topics
(red). The top plots show topic coherence (averaged over 15 runs) over 1000 iterations of Gibbs sampling. Error bars
are not visible in this plot. The middle plot shows the average coherence of the 10 lowest scoring topics. The bottom
plots show held-out log probability (in thousands) for the same models (three runs each of 5-fold cross-validation).
Name Docs Avg. Tok. Tokens Vocab
NIH 18756 114.64 ±30.41 2150172 28702
Table 3: Data set statistics.
types, the mean of the 100 lowest divergences was
0.29 ±.05 (a divergence of 1.0 represents distribu-
tions with no shared support) at T= 200. The aver-
age divergence of the 100 most similar pairs of top-
ics for standard LDA (i.e., A=I) is 0.67±.05. The
same statistic for the generalized P´
olya urn model
without off-diagonal elements for word types with
high document frequency is 0.822 ±0.09.
Setting the off-diagonal elements of the schema
Ato zero for the most common word types also has
the fortunate effect of substantially reducing prepro-
cessing time. We find that Gibbs sampling for the
generalized P´
olya model takes roughly two to three
times longer than for standard LDA, depending on
the sparsity of the schema, due to additional book-
keeping needed before and after sampling topics.
5.2 Experimental Results
We evaluated the new model on a corpus of NIH
grant abstracts. Details are given in table 3. Figure 3
shows the performance of the generalized P´
olya urn
model relative to LDA. Two metrics—our new topic
coherence metric and the log probability of held-out
documents—are shown over 1000 iterations at 50 it-
eration intervals. Each model was run over five folds
of cross validation, each with three random initial-
izations. For each model we calculated an overall
coherence score by calculating the topic coherence
for each topic individually and then averaging these
values. We report the average over all 15 models in
each plot. Held-out probabilities were calculated us-
ing the left-to-right method of Wallach et al. (2009),
with each cross-validation fold using its own schema
A. The generalized P´
olya model performs very well
in average topic coherence, reaching levels within
the first 50 iterations that match the final score. This
model has an early advantage for held-out proba-
bility as well, but is eventually overtaken by LDA.
This trend is consistent with Chang et al.’s observa-
tion that held-out probabilities are not always good
predictors of human judgments (Chang et al., 2009).
Results are consistent over T∈ {100,200,300}.
In section 4.2, we demonstrated that our topic co-
herence metric correlates with expert opinions of
topic quality for standard LDA. The generalized
270
P´
olya urn model was therefore designed with the
goal of directly optimizing that metric. It is pos-
sible, however, that optimizing for coherence di-
rectly could break the association between coher-
ence metric and topic quality. We therefore repeated
the expert-driven evaluation protocol described in
section 3.1. We trained one standard LDA model
and one generalized P´
olya urn model, each with
T= 200, and randomly shuffled the 400 resulting
topics. The topics were then presented to the experts
from NINDS, with no indication as to the identity of
the model from which each topic came. As these
evaluations are time consuming, the experts evalu-
ated the only the first 200 topics, which consisted of
103 generalized P´
olya urn topics and 97 LDA top-
ics. AUC values predicting bad topics given coher-
ence were 0.83 and 0.80, respectively. Coherence
effectively predicts topic quality in both models.
Although we were able to improve the average
overall quality of topics and the average quality of
the ten lowest-scoring topics, we found that the gen-
eralized P´
olya urn model was less successful reduc-
ing the overall number of bad topics. Ignoring one
“unbalanced” topic from each model, 16.5% of the
LDA topics and 13.5% from the generalized P´
olya
urn model were marked as “bad.” While this result
is an improvement, it is not significant at p= 0.05.
6 Discussion
We have demonstrated the following:
•There is a class of low-quality topics that can-
not be detected using existing word-intrusion
tests, but that can be identified reliably using a
metric based on word co-occurrence statistics.
•It is possible to improve the coherence score
of topics, both overall and for the ten worst,
while retaining the ability to flag bad topics, all
without requiring semi-supervised data or ad-
ditional reference corpora. Although additional
information may be useful, it is not necessary.
•Such models achieve better performance with
substantially fewer Gibbs iterations than LDA.
We believe that the most important challenges in fu-
ture topic modeling research are improving the se-
mantic quality of topics, particularly at the low end,
and scaling to ever-larger data sets while ensuring
high-quality topics. Our results provide critical in-
sight into these problems. We found that it should be
possible to construct unsupervised topic models that
do not produce bad topics. We also found that Gibbs
sampling mixes faster for models that use word co-
occurrence information, suggesting that such meth-
ods may also be useful in guiding online stochastic
variational inference (Hoffman et al., 2010).
Acknowledgements
This work was supported in part by the Center
for Intelligent Information Retrieval, in part by the
CIA, the NSA and the NSF under NSF grant # IIS-
0326249, in part by NIH:HHSN271200900640P,
and in part by NSF # number SBE-0965436. Any
opinions, findings and conclusions or recommenda-
tions expressed in this material are the authors’ and
do not necessarily reflect those of the sponsor.
References
Loulwah AlSumait, Daniel Barbara, James Gentle, and
Carlotta Domeniconi. 2009. Topic significance rank-
ing of LDA generative models. In ECML.
David Andrzejewski, Xiaojin Zhu, and Mark Craven.
2009. Incorporating domain knowledge into topic
modeling via Dirichlet forest priors. In Proceedings of
the 26th Annual International Conference on Machine
Learning, pages 25–32.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
2003. Latent Dirichlet allocation. Journal of Machine
Learning Research, 3:993–1022, January.
K.R. Canini, L. Shi, and T.L. Griffiths. 2009. Online
inference of topics with latent Dirichlet allocation. In
Proceedings of the 12th International Conference on
Artificial Intelligence and Statistics.
Jonathan Chang, Jordan Boyd-Graber, Chong Wang,
Sean Gerrish, and David M. Blei. 2009. Reading tea
leaves: How humans interpret topic models. In Ad-
vances in Neural Information Processing Systems 22,
pages 288–296.
Kenneth Church and Patrick Hanks. 1990. Word asso-
ciation norms, mutual information, and lexicography.
Computational Linguistics, 6(1):22–29.
Gabriel Doyle and Charles Elkan. 2009. Accounting for
burstiness in topic models. In ICML.
S. Geman and D. Geman. 1984. Stochastic relaxation,
Gibbs distributions, and the Bayesian restoration of
images. IEEE Transaction on Pattern Analysis and
Machine Intelligence 6, pages 721–741.
271
Thomas L. Griffiths and Mark Steyvers. 2004. Find-
ing scientific topics. Proceedings of the National
Academy of Sciences, 101(suppl. 1):5228–5235.
Matthew Hoffman, David Blei, and Francis Bach. 2010.
Online learning for latent dirichlet allocation. In NIPS.
Hosan Mahmoud. 2008. P´
olya Urn Models. Chapman
& Hall/CRC Texts in Statistical Science.
Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai.
2007. Automatic labeling of multinomial topic mod-
els. In Proceedings of the 13th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data
Mining, pages 490–499.
David Newman, Jey Han Lau, Karl Grieser, and Timothy
Baldwin. 2010. Automatic evaluation of topic coher-
ence. In Human Language Technologies: The Annual
Conference of the North American Chapter of the As-
sociation for Computational Linguistics.
Yee Whye Teh, Dave Newman, and Max Welling. 2006.
A collapsed variational Bayesian inference algorithm
for lat ent Dirichlet allocation. In Advances in Neural
Information Processing Systems 18.
Hanna Wallach, Iain Murray, Ruslan Salakhutdinov, and
David Mimno. 2009. Evaluation methods for topic
models. In Proceedings of the 26th Interational Con-
ference on Machine Learning.
Xing Wei and Bruce Croft. 2006. LDA-based document
models for ad-hoc retrival. In Proceedings of the 29th
Annual International SIGIR Conference.
272