Conference PaperPDF Available

Optimizing Semantic Coherence in Topic Models

Authors:

Abstract and Figures

Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).
Content may be subject to copyright.
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 262–272,
Edinburgh, Scotland, UK, July 27–31, 2011. c
2011 Association for Computational Linguistics
Optimizing Semantic Coherence in Topic Models
David Mimno
Princeton University
Princeton, NJ 08540
mimno@cs.princeton.edu
Hanna M. Wallach
University of Massachusetts, Amherst
Amherst, MA 01003
wallach@cs.umass.edu
Edmund Talley Miriam Leenders
National Institutes of Health
Bethesda, MD 20892
{talleye,leenderm}@ninds.nih.gov
Andrew McCallum
University of Massachusetts, Amherst
Amherst, MA 01003
mccallum@cs.umass.edu
Abstract
Latent variable models have the potential
to add value to large document collections
by discovering interpretable, low-dimensional
subspaces. In order for people to use such
models, however, they must trust them. Un-
fortunately, typical dimensionality reduction
methods for text, such as latent Dirichlet al-
location, often produce low-dimensional sub-
spaces (topics) that are obviously flawed to
human domain experts. The contributions of
this paper are threefold: (1) An analysis of the
ways in which topics can be flawed; (2) an au-
tomated evaluation metric for identifying such
topics that does not rely on human annotators
or reference collections outside the training
data; (3) a novel statistical topic model based
on this metric that significantly improves topic
quality in a large-scale document collection
from the National Institutes of Health (NIH).
1 Introduction
Statistical topic models such as latent Dirichlet al-
location (LDA) (Blei et al., 2003) provide a pow-
erful framework for representing and summarizing
the contents of large document collections. In our
experience, however, the primary obstacle to accep-
tance of statistical topic models by users the outside
machine learning community is the presence of poor
quality topics. Topics that mix unrelated or loosely-
related concepts substantially reduce users’ confi-
dence in the utility of such automated systems.
In general, users prefer models with larger num-
bers of topics because such models have greater res-
olution and are able to support finer-grained distinc-
tions. Unfortunately, we have observed that there
is a strong relationship between the size of topics
and the probability of topics being nonsensical as
judged by domain experts: as the number of topics
increases, the smallest topics (number of word to-
kens assigned to each topic) are almost always poor
quality. The common practice of displaying only a
small number of example topics hides the fact that as
many as 10% of topics may be so bad that they can-
not be shown without reducing users’ confidence.
The evaluation of statistical topic models has tra-
ditionally been dominated by either extrinsic meth-
ods (i.e., using the inferred topics to perform some
external task such as information retrieval (Wei
and Croft, 2006)) or quantitative intrinsic methods,
such as computing the probability of held-out doc-
uments (Wallach et al., 2009). Recent work has
focused on evaluation of topics as semantically-
coherent concepts. For example, Chang et al. (2009)
found that the probability of held-out documents is
not always a good predictor of human judgments.
Newman et al. (2010) showed that an automated
evaluation metric based on word co-occurrence
statistics gathered from Wikipedia could predict hu-
man evaluations of topic quality. AlSumait et al.
(2009) used differences between topic-specific dis-
tributions over words and the corpus-wide distribu-
tion over words to identify overly-general “vacuous”
topics. Finally, Andrzejewski et al. (2009) devel-
oped semi-supervised methods that avoid specific
user-labeled semantic coherence problems.
The contributions of this paper are threefold: (1)
To identify distinct classes of low-quality topics,
some of which are not flagged by existing evalua-
tion methods; (2) to introduce a new topic “coher-
ence” score that corresponds well with human co-
herence judgments and makes it possible to identify
262
specific semantic problems in topic models without
human evaluations or external reference corpora; (3)
to present an example of a new topic model that
learns latent topics by directly optimizing a metric
of topic coherence. With little additional computa-
tional cost beyond that of LDA, this model exhibits
significant gains in average topic coherence score.
Although the model does not result in a statistically-
significant reduction in the number of topics marked
“bad”, the model consistently improves the topic co-
herence score of the ten lowest-scoring topics (i.e.,
results in bad topics that are “less bad” than those
found using LDA) while retaining the ability to iden-
tify low-quality topics without human interaction.
2 Latent Dirichlet Allocation
LDA is a generative probabilistic model for docu-
ments W={w(1),w(2),...,w(D)}. To generate a
word token w(d)
nin document d, we draw a discrete
topic assignment z(d)
nfrom a document-specific dis-
tribution over the Ttopics θd(which is itself drawn
from a Dirichlet prior with hyperparameter α), and
then draw a word type for that token from the topic-
specific distribution over the vocabulary φz(d)
n. The
inference task in topic models is generally cast as in-
ferring the document–topic proportions {θ1, ..., θD}
and the topic-specific distributions {φ1. . . , φT}.
The multinomial topic distributions are usually
drawn from a shared symmetric Dirichlet prior with
hyperparameter β, such that conditioned on {φt}T
t=1
and the topic assignments {z(1),z(2),...,z(D)},
the word tokens are independent. In practice, how-
ever, it is common to deal directly with the “col-
lapsed” distributions that result from integrating
over the topic-specific multinomial parameters. The
resulting distribution over words for a topic tis then
a function of the hyperparameter βand the number
of words of each type assigned to that topic, Nw|t.
This distribution, known as the Dirichlet compound
multinomial (DCM) or P´
olya distribution (Doyle
and Elkan, 2009), breaks the assumption of condi-
tional independence between word tokens given top-
ics, but is useful during inference because the con-
ditional probability of a word wgiven topic ttakes
a very simple form: P(w|t, β) = Nw|t+β
Nt+|V| β, where
Nt=Pw0Nw0|tand |V| is the vocabulary size.
The process for generating a sequence of words
from such a model is known as the simple P´
olya urn
model (Mahmoud, 2008), in which the initial prob-
ability of word type win topic tis proportional to
β, while the probability of each subsequent occur-
rence of win topic tis proportional to the number
of times whas been drawn in that topic plus β. Note
that this unnormalized weight for each word type de-
pends only on the count of that word type, and is in-
dependent of the count of any other word type w0.
Thus, in the DCM/P´
olya distribution, drawing word
type wmust decrease the probability of seeing all
other word types w06=w. In a later section, we will
introduce a topic model that substitutes a general-
ized P´
olya urn model for the DCM/P´
olya distribu-
tion, allowing a draw of word type wto increase the
probability of seeing certain other word types.
For real-world data, documents Ware observed,
while the corresponding topic assignments Zare
unobserved and may be inferred using either vari-
ational methods (Blei et al., 2003; Teh et al., 2006)
or MCMC methods (Griffiths and Steyvers, 2004).
Here, we use MCMC methods—specifically Gibbs
sampling (Geman and Geman, 1984), which in-
volves sequentially resampling each topic assign-
ment z(d)
nfrom its conditional posterior given the
documents W, the hyperparameters αand β, and
Z\d,n (the current topic assignments for all tokens
other than the token at position nin document d).
3 Expert Opinions of Topic Quality
Concentrating on 300,000 grant and related jour-
nal paper abstracts from the National Institutes of
Health (NIH), we worked with two experts from
the National Institute of Neurological Disorders and
Stroke (NINDS) to collaboratively design an expert-
driven topic annotation study. The goal of this study
was to develop an annotated set of baseline topics,
along with their salient characteristics, as a first step
towards automatically identifying and inferring the
kinds of topics desired by domain experts.1
3.1 Expert-Driven Annotation Protocol
In order to ensure that the topics selected for anno-
tation were within the NINDS experts’ area of ex-
pertise, they selected 148 topics (out of 500), all as-
sociated with areas funded by NINDS. Each topic
1All evaluated models will be released publicly.
263
twas presented to the experts as a list of the thirty
most probable words for that topic, in descending or-
der of their topic-specific “collapsed” probabilities,
Nw|t+β
Nt+|V| β. In addition to the most probable words,
the experts were also given metadata for each topic:
The most common sequences of two or more con-
secutive words assigned to that topic, the four topics
that most often co-occurred with that topic, the most
common IDF-weighted words from titles of grants,
thesaurus terms, NIH institutes, journal titles, and
finally a list of the highest probability grants and
PubMed papers for that topic.
The experts first categorized each topic as one
of three types: “research”, “grant mechanisms and
publication types” or “general”.2The quality of
each topic (“good”, “intermediate”, or “bad”) was
then evaluated using criteria specific to the type
of topic. In general, topics were only annotated
as “good” if they contained words that could be
grouped together as a single coherent concept. Addi-
tionally, each “research” topic was only considered
to be “good” if, in addition to representing a sin-
gle coherent concept, the aggregate content of the
set of documents with appreciable allocations to that
topic clearly contained text referring to the concept
inferred from the topic words. Finally, for each topic
marked as being either “intermediate” or “bad”, one
or more of the following problems (defined by the
domain experts) was identified, as appropriate:
Chained: every word is connected to every
other word through some pairwise word chain,
but not all word pairs make sense. For exam-
ple, a topic whose top three words are “acids”,
“fatty” and “nucleic” consists of two distinct
concepts (i.e., acids produced when fats are
broken down versus the building blocks of
DNA and RNA) chained via the word “acids”.
Intruded: either two or more unrelated sets
of related words, joined arbitrarily, or an oth-
erwise good topic with a few “intruder” words.
Random: no clear, sensical connections be-
tween more than a few pairs of words.
Unbalanced: the top words are all logically
connected to each other, but the topic combines
very general and specific terms (e.g., “signal
2Equivalent to “vacuous topics” of AlSumait et al. (2009).
transduction” versus “notch signaling”).
Examples of a good general topic, a good research
topic, and a chained research topic are in Table 1.
3.2 Annotation Results
The experts annotated the topics independently and
then aggregated their results. Interestingly, no top-
ics were ever considered “good” by one expert and
“bad” by the other—when there was disagreement
between the experts, one expert always believed the
topic to be “intermediate. In such cases, the ex-
perts discussed the reasons for their decisions and
came to a consensus. Of the 148 topics selected for
annotation, 90 were labeled as “good,” 21 as “inter-
mediate,” and 37 as “bad.” Of the topics labeled as
“bad” or “intermediate,” 23 were “chained,” 21 were
“intruded,” 3 were “random,” and 15 were “unbal-
anced”. (The annotators were permitted to assign
more than one problem to any given topic.)
4 Automated Metrics for Predicting
Expert Annotations
The ultimate goal of this paper is to develop meth-
ods for building models with large numbers of spe-
cific, high-quality topics from domain-specific cor-
pora. We therefore explore the extent to which in-
formation already contained in the documents being
modeled can be used to assess topic quality.
In this section we evaluate several methods for
ranking the quality of topics and compare these
rankings to human annotations. No method is likely
to perfectly predict human judgments, as individual
annotators may disagree on particular topics. For
an application involving removing low quality top-
ics we recommend using a weighted combination of
metrics, with a threshold determined by users.
4.1 Topic Size
As a simple baseline, we considered the extent to
which topic “size” (as measured by the number of
tokens assigned to each topic via Gibbs sampling) is
a good metric for assessing topic quality. Figure 1
(top) displays the topic size (number of tokens as-
signed to that topic) and expert annotations (“good”,
“intermediate”, “bad”) for the 148 topics manually
labeled by annotators as described above. This fig-
ure suggests that topic size is a reasonable predic-
264
40000 60000 80000 120000 160000
Tokens
good inter bad
−600 −500 −400 −300 −200
Coherence
good inter bad
Figure 1: Topic size is a good indicator of quality; the
new coherence metric is better. Top shows expert-rated
topics ranked by topic size (AP 0.89, AUC 0.79), bottom
shows same topics ranked by coherence (AP 0.94, AUC
0.87). Random jitter is added to the y-axis for clarity.
tor of topic quality. Although there is some overlap,
“bad” topics are generally smaller than “good” top-
ics. Unfortunately, this observation conflicts with
the goal of building highly specialized, domain-
specific topic models with many high-quality, fine-
grained topics—in such models the majority of top-
ics will have relatively few tokens assigned to them.
4.2 Topic Coherence
When displaying topics to users, each topic tis gen-
erally represented as a list of the M=5,...,20 most
probable words for that topic, in descending order
of their topic-specific “collapsed” probabilities. Al-
though there has been previous work on automated
generation of labels or headings for topics (Mei et
al., 2007), we choose to work only with the ordered
list representation. Labels may obscure or detract
from fundamental problems with topic coherence,
and better labels don’t make bad topics good.
The expert-driven annotation study described in
section 3 suggests that three of the four types of
poor-quality topics (“chained,” “intruded” and “ran-
dom”) could be detected using a metric based on
the co-occurrence of words within the documents
being modeled. For “chained” and “intruded” top-
ics, it is likely that although pairs of words belong-
ing to a single concept will co-occur within a single
document (e.g., “nucleic” and “acids” in documents
about DNA), word pairs belonging to different con-
cepts (e.g., “fatty” and “nucleic”) will not. For ran-
dom topics, it is likely that few words will co-occur.
This insight can be used to design a new metric
for assessing topic quality. Letting D(v)be the doc-
ument frequency of word type v(i.e., the number
of documents with least one token of type v) and
D(v, v0)be co-document frequency of word types v
and v0(i.e., the number of documents containing one
or more tokens of type vand at least one token of
type v0), we define topic coherence as
C(t;V(t)) =
M
X
m=2
m1
X
l=1
log D(v(t)
m, v(t)
l)+1
D(v(t)
l),(1)
where V(t)= (v(t)
1, . . . , v(t)
M)is a list of the Mmost
probable words in topic t. A smoothing count of 1
is included to avoid taking the logarithm of zero.
Figure 1 shows the association between the expert
annotations and both topic size (top) and our coher-
ence metric (bottom). We evaluate these results us-
ing standard ranking metrics, average precision and
the area under the ROC curve. Treating “good” top-
ics as positive and “intermediate” or “bad” topics as
negative, we get average precision values of 0.89 for
topic size vs. 0.94 for coherence and AUC 0.79 for
topic size vs. 0.87 for coherence. We performed a
logistic regression analysis on the binary variable “is
this topic bad”. Using topic size alone as a predic-
tor gives AIC (a measure of model fit) 152.5. Co-
herence alone has AIC 113.8 (substantially better).
Both predictors combined have AIC 115.8: the sim-
pler coherence alone model provides the best perfor-
mance. We tried weighting the terms in equation 1
by their corresponding topic–word probabilities and
and by their position in the sorted list of the Mmost
probable words for that topic, but we found that a
uniform weighting better predicted topic quality.
Our topic coherence metric also exhibits good
qualitative behavior: of the 20 best-scoring topics,
18 are labeled as “good,” one is “intermediate” (“un-
balanced”), and one is “bad” (combining “cortex”
and “fmri”, words that commonly co-occur, but are
conceptually distinct). Of the 20 worst scoring top-
ics, 15 are “bad,” 4 are “intermediate,” and only one
(with the 19th worst coherence score) is “good.
265
Our coherence metric relies only upon word co-
occurrence statistics gathered from the corpus being
modeled, and does not depend on an external ref-
erence corpus. Ideally, all such co-occurrence infor-
mation would already be accounted for in the model.
We believe that one of the main contributions of our
work is demonstrating that standard topic models
do not fully utilize available co-occurrence informa-
tion, and that a held-out reference corpus is therefore
not required for purposes of topic evaluation.
Equation 1 is very similar to pointwise mutual in-
formation (PMI), but is more closely associated with
our expert annotations than PMI (which achieves
AUC 0.64 and AIC 170.51). PMI has a long history
in language technology (Church and Hanks, 1990),
and was recently used by Newman et al. (2010) to
evaluate topic models. When expressed in terms of
count variables as in equation 1, PMI includes an
additional term for D(v(t)
m). The improved perfor-
mance of our metric over PMI implies that what mat-
ters is not the difference between the joint probabil-
ity of words mand land the product of marginals,
but the conditional probability of each word given
the each of the higher-ranked words in the topic.
In order to provide intuition for the behavior of
our topic coherence metric, table 1 shows three
example topics and their topic coherence scores.
The first topic, related to grant-funded training pro-
grams, is one of the best-scoring topics. All pairs
of words have high co-document frequencies. The
second topic, on neurons, is more typical of qual-
ity “research” topics. Overall, these words occur
less frequently, but generally occur moderately in-
terchangeably: there is little structure to their co-
variance. The last topic is one of the lowest-scoring
topics. Its co-document frequency matrix is shown
in table 2. The top two words are closely related:
487 documents include “aging” at least once, 122
include “lifespan”, and 55 include both. Meanwhile,
the third word “globin” occurs with only one of the
top seven words—the common word “human”.
4.3 Comparison to word intrusion
As an additional check for both our expert annota-
tions and our automated metric, we replicated the
“word intrusion” evaluation originally introduced by
Chang et al. (2009). In this task, one of the top ten
most probable words in a topic is replaced with a
● ● ● ● ●
● ● ●
● ●
● ●
● ●
40000 60000 80000 120000 160000
048
Comparison of Topic Size to Intrusion Detection
Tokens assigned to topic
Correct Guesses
● ●
● ●
● ●
● ●
● ●
● ●
● ●
−600 −500 −400 −300 −200
048
Comparison of Coherence to Intrusion Detection
Coherence
Correct Guesses
0 2 4 6 8 10
0 15 35
0 2 4 6 8 10
0 15 35
Figure 2: Top: results of the intruder selection task rel-
ative to two topic quality metrics. Bottom: marginal in-
truder accuracy frequencies of good and bad topics.
another word, selected at random from the corpus.
The resulting set of words is presented, in a random
order, to users, who are asked to identify the “in-
truder” word. It is very unlikely that a randomly-
chosen word will be semantically related to any of
the original words in the topic, so if a topic is a
high quality representation of a semantically coher-
ent concept, it should be easy for users to select the
intruder word. If the topic is not coherent, there may
be words in the topic that are also not semantically
related to any other word, thus causing users to se-
lect “correct” words instead of the real intruder.
We recruited ten additional expert annotators
from NINDS, not including our original annotators,
and presented them with the intruder selection task,
using the set of previously evaluated topics. Re-
sults are shown in figure 2. In the first two plots,
the x-axis is one of our two automated quality met-
266
Table 1: Example topics (good/general, good/research, chained/research) with different coherence scores (numbers
closer to zero indicate higher coherence). The chained topic combines words related to aging (indicated in plain text)
and words describing blood and blood-related diseases (bold). The only connection is the common word human.
-167.1 students, program, summer, biomedical, training, experience, undergraduate, career, minority, student, ca-
reers, underrepresented, medical students, week, science
-252.1 neurons, neuronal, brain, axon, neuron, guidance, nervous system, cns, axons, neural, axonal, cortical,
survival, disorders, motor
-357.2 aging, lifespan, globin, age related, longevity, human, age, erythroid,sickle cell,beta globin,hb, senes-
cence, adult, older, lcr
Table 2: Co-document frequency matrix for the top words in a low-quality topic (according to our coherence metric),
shaded to highlight zeros. The diagonal (light gray) shows the overall document frequency for each word w. The
column on the right is Nw|t. Note that “globin” and “erythroid” do not co-occur with any of the aging-related words.
aging 487 53 0 65 42 0 51 0 138 0 914
lifespan 53 122 0 15 28 0 15 0 44 0 205
globin 0 0 39 0 0 19 0 15 27 3 200
age related 65 15 0 119 12 0 25 0 37 0 160
longevity 42 28 0 12 73 0 6 0 20 1 159
erythroid 0 0 19 0 0 69 0 8 23 1 110
age 51 15 0 25 6 0 245 1 82 0 103
sickle cell 0 0 15 0 0 8 1 43 16 2 93
human 138 44 27 37 20 23 82 16 4347 157 91
hb 0 0 3 0 1 1 0 2 5 15 73
267
rics (topic size and coherence) and the y-axis is the
number of annotators that correctly identified the
true intruder word (accuracy). The histograms be-
low these plots show the number of topics with each
level of annotator accuracy for good and bad top-
ics. For good topics (green circles), the annotators
were generally able to detect the intruder word with
high accuracy. Bad topics (red diamonds) had more
uniform accuracies. These results suggest that top-
ics with low intruder detection accuracy tend to be
bad, but some bad topics can have a high accuracy.
For example, spotting an intruder word in a chained
topic can be easy. The low-quality topic recep-
tors, cannabinoid, cannabinoids, ligands, cannabis,
endocannabinoid, cxcr4, [virus], receptor, sdf1, is
a typical “chained” topic, with CXCR4 linked to
cannabinoids only through receptors, and otherwise
unrelated. Eight out of ten annotators correctly iden-
tified “virus” as the correct intruder. Repeating the
logistic regression experiment using intruder detec-
tion accuracy as input, the AIC value is 163.18—
much worse than either topic size or coherence.
5 Generalized P´
olya Urn Models
Although the topic coherence metric defined above
provides an accurate way of assessing topic quality,
preventing poor quality topics from occurring in the
first place is preferable. Our results in the previous
section show that we can identify low-quality top-
ics without making use of external supervision; the
training data by itself contains sufficient information
at least to reject poor combinations of words.
In this section, we describe a new topic model that
incorporates the corpus-specific word co-occurrence
information used in our coherence metric directly
into the statistical topic modeling framework. It
is important to note that simply disallowing words
that never co-occur from being assigned to the same
topic is not sufficient. Due to the power-law charac-
teristics of language, most words are rare and will
not co-occur with most other words regardless of
their semantic similarity. It is rather the degree
to which the most prominent words in a topic do
not co-occur with the other most prominent words
in that topic that is an indicator of topic incoher-
ence. We therefore desire models that guide topics
towards semantic similarity without imposing hard
constraints.
As an example of such a model, we present a new
topic model in which the occurrence of word type w
in topic tincreases not only the probability of seeing
that word type again, but also increases the probabil-
ity of seeing other related words (as determined by
co-document frequencies for the corpus being mod-
eled). This new topic model retains the document–
topic component of standard LDA, but replaces the
usual P´
olya urn topic–word component with a gen-
eralized P´
olya urn framework (Mahmoud, 2008).
A sequence of i.i.d. samples from a discrete dis-
tribution can be imagined as arising by repeatedly
drawing a random ball from an urn, where the num-
ber of balls of each color is proportional to the prob-
ability of that color, replacing the selected ball af-
ter each draw. In a P´
olya urn, each ball is replaced
along with another ball of the same color. Samples
from this model exhibit the “burstiness” property:
the probability of drawing a ball of color wincreases
each time a ball of that color is drawn. This process
represents the marginal distribution of a hierarchical
model with a Dirichlet prior and a multinomial like-
lihood, and is used as the distribution over words
for each topic in almost all previous topic models.
In a generalized P´
olya urn model, having drawn a
ball of color w,Avw additional balls of each color
v∈ {1, . . . , W }are returned to the urn. Given W
and Z, the conditional posterior probability of word
win topic timplied by this generalized model is
P(w|t, W,Z, β, A) = PvNv|tAvw +β
Nt+|V|β,(2)
where Ais a W×Wreal-valued matrix, known
as the addition matrix or schema. The simple P´
olya
urn model (and hence the conditional posterior prob-
ability of word win topic tunder LDA) can be re-
covered by setting the schema Ato the identity ma-
trix. Unlike the simple P ´
olya distribution, we do not
know of a representation of the generalized P´
olya
urn distribution that can be expressed using a con-
cise set of conditional independence assumptions. A
standard graphical model with plate notation would
therefore not be helpful in highlighting the differ-
ences between the two models, and is not shown.
Algorithm 1 shows pseudocode for a single Gibbs
sweep over the latent variables Zin standard LDA.
Algorithm 2 shows the modifications necessary to
268
1: for d∈ D do
2: for wnw(d)do
3: Nzi|diNzi|di1
4: Nwi|ziNwi|zi1
5: sample zi(Nz|di+αz)Nwi|z+β
Pz0(Nwi|z0+β)
6: Nzi|diNzi|di+ 1
7: Nwi|ziNwi|zi+ 1
8: end for
9: end for
Algorithm 1: One sweep of LDA Gibbs sampling.
1: for d∈ D do
2: for wnw(d)do
3: Nzi|diNzi|di1
4: for all vdo
5: Nv|ziNv|ziAvwi
6: end for
7: sample zi(Nz|di+αz)Nwi|z+β
Pz0(Nwi|z0+β)
8: Nzi|diNzi|di+ 1
9: for all vdo
10: Nv|ziNv|zi+Avwi
11: end for
12: end for
13: end for
Algorithm 2: One sweep of gen. P´
olya Gibbs sam-
pling, with differences from LDA highlighted in red.
support a generalized P´
olya urn model: rather than
subtracting exactly one from the count of the word
given the old topic, sampling, and then adding one
to the count of the word given the new topic, we sub-
tract a column of the schema matrix from the entire
count vector over words for the old topic, sample,
and add the same column to the count vector for the
new topic. As long as Ais sparse, this operation
adds only a constant factor to the computation.
Another property of the generalized P´
olya urn
model is that it is nonexchangeable—the joint prob-
ability of the tokens in any given topic is not invari-
ant to permutation of those tokens. Inference of Z
given Wvia Gibbs sampling involves repeatedly cy-
cling through the tokens in Wand, for each one,
resampling its topic assignment conditioned on W
and the current topic assignments for all tokens other
than the token of interest. For LDA, the sampling
distribution for each topic assignment is simply the
product of two predictive probabilities, obtained by
treating the token of interest as if it were the last.
For a topic model with a generalized P´
olya urn for
the topic–word component, the sampling distribu-
tion is more complicated. Specifically, the topic–
word component of the sampling distribution is no
longer a simple predictive distribution—when sam-
pling a new value for z(d)
n, the implication of each
possible value for subsequent tokens and their topic
assignments must be considered. Unfortunately, this
can be very computationally expensive, particularly
for large corpora. There are several ways around this
problem. The first is to use sequential Monte Carlo
methods, which have been successfully applied to
topic models previously (Canini et al., 2009). The
second approach is to approximate the true Gibbs
sampling distribution by treating each token as if it
were the last, ignoring implications for subsequent
tokens and their topic assignments. We find that
this approximate method performs well empirically.
5.1 Setting the Schema A
Inspired by our evaluation metric, we define Aas
Avv λvD(v)(3)
Avw λvD(w, v)
where each element is scaled by a row-specific
weight λvand each column is normalized to sum
to 1. Normalizing columns makes comparison to
standard LDA simpler, because the relative effect of
smoothing parameter β= 0.01 is equivalent. We set
λv= log (D / D(v)), the standard IDF weight used
in information retrieval, which is larger for less fre-
quent words. The column for word type wcan be
interpreted as word types with significant associa-
tion with w. The IDF weighting therefore has the
effect of increasing the strength of association for
rare word types. We also found empirically that it is
helpful to remove off-diagonal elements for the most
common types, such as those that occur in more than
5% of documents (IDF <3.0). Including nonzero
off-diagonal values in Afor very frequent types
causes the model to disperse those types over many
topics, which leads to large numbers of extremely
similar topics. To measure this effect, we calcu-
lated the Jensen-Shannon divergence between all
pairs of topic–word distributions in a given model.
For a model using off-diagonal weights for all word
269
−290 −260
100 Topics
Coherence
50 300 550 800
−290 −260
200 Topics
Coherence
50 300 550 800
−290 −260
300 Topics
Coherence
50 300 550 800
−290 −260
400 Topics
Coherence
50 300 550 800
−400 −340
10 Worst Coher
50 300 550 800
−400 −340
10 Worst Coher
50 300 550 800
−400 −340
10 Worst Coher
50 300 550 800
−400 −340
10 Worst Coher
50 300 550 800
−1700 −1660
Iteration
HOLP
50 300 550 800
−1700 −1660
Iteration
HOLP
50 300 550 800
−1700 −1660
Iteration
HOLP
50 300 550 800
−1700 −1660
Iteration
HOLP
50 300 550 800
Figure 3: P´
olya urn topics (blue) have higher average coherence and converge much faster than LDA topics
(red). The top plots show topic coherence (averaged over 15 runs) over 1000 iterations of Gibbs sampling. Error bars
are not visible in this plot. The middle plot shows the average coherence of the 10 lowest scoring topics. The bottom
plots show held-out log probability (in thousands) for the same models (three runs each of 5-fold cross-validation).
Name Docs Avg. Tok. Tokens Vocab
NIH 18756 114.64 ±30.41 2150172 28702
Table 3: Data set statistics.
types, the mean of the 100 lowest divergences was
0.29 ±.05 (a divergence of 1.0 represents distribu-
tions with no shared support) at T= 200. The aver-
age divergence of the 100 most similar pairs of top-
ics for standard LDA (i.e., A=I) is 0.67±.05. The
same statistic for the generalized P´
olya urn model
without off-diagonal elements for word types with
high document frequency is 0.822 ±0.09.
Setting the off-diagonal elements of the schema
Ato zero for the most common word types also has
the fortunate effect of substantially reducing prepro-
cessing time. We find that Gibbs sampling for the
generalized P´
olya model takes roughly two to three
times longer than for standard LDA, depending on
the sparsity of the schema, due to additional book-
keeping needed before and after sampling topics.
5.2 Experimental Results
We evaluated the new model on a corpus of NIH
grant abstracts. Details are given in table 3. Figure 3
shows the performance of the generalized P´
olya urn
model relative to LDA. Two metrics—our new topic
coherence metric and the log probability of held-out
documents—are shown over 1000 iterations at 50 it-
eration intervals. Each model was run over five folds
of cross validation, each with three random initial-
izations. For each model we calculated an overall
coherence score by calculating the topic coherence
for each topic individually and then averaging these
values. We report the average over all 15 models in
each plot. Held-out probabilities were calculated us-
ing the left-to-right method of Wallach et al. (2009),
with each cross-validation fold using its own schema
A. The generalized P´
olya model performs very well
in average topic coherence, reaching levels within
the first 50 iterations that match the final score. This
model has an early advantage for held-out proba-
bility as well, but is eventually overtaken by LDA.
This trend is consistent with Chang et al.’s observa-
tion that held-out probabilities are not always good
predictors of human judgments (Chang et al., 2009).
Results are consistent over T∈ {100,200,300}.
In section 4.2, we demonstrated that our topic co-
herence metric correlates with expert opinions of
topic quality for standard LDA. The generalized
270
P´
olya urn model was therefore designed with the
goal of directly optimizing that metric. It is pos-
sible, however, that optimizing for coherence di-
rectly could break the association between coher-
ence metric and topic quality. We therefore repeated
the expert-driven evaluation protocol described in
section 3.1. We trained one standard LDA model
and one generalized P´
olya urn model, each with
T= 200, and randomly shuffled the 400 resulting
topics. The topics were then presented to the experts
from NINDS, with no indication as to the identity of
the model from which each topic came. As these
evaluations are time consuming, the experts evalu-
ated the only the first 200 topics, which consisted of
103 generalized P´
olya urn topics and 97 LDA top-
ics. AUC values predicting bad topics given coher-
ence were 0.83 and 0.80, respectively. Coherence
effectively predicts topic quality in both models.
Although we were able to improve the average
overall quality of topics and the average quality of
the ten lowest-scoring topics, we found that the gen-
eralized P´
olya urn model was less successful reduc-
ing the overall number of bad topics. Ignoring one
“unbalanced” topic from each model, 16.5% of the
LDA topics and 13.5% from the generalized P´
olya
urn model were marked as “bad. While this result
is an improvement, it is not significant at p= 0.05.
6 Discussion
We have demonstrated the following:
There is a class of low-quality topics that can-
not be detected using existing word-intrusion
tests, but that can be identified reliably using a
metric based on word co-occurrence statistics.
It is possible to improve the coherence score
of topics, both overall and for the ten worst,
while retaining the ability to flag bad topics, all
without requiring semi-supervised data or ad-
ditional reference corpora. Although additional
information may be useful, it is not necessary.
Such models achieve better performance with
substantially fewer Gibbs iterations than LDA.
We believe that the most important challenges in fu-
ture topic modeling research are improving the se-
mantic quality of topics, particularly at the low end,
and scaling to ever-larger data sets while ensuring
high-quality topics. Our results provide critical in-
sight into these problems. We found that it should be
possible to construct unsupervised topic models that
do not produce bad topics. We also found that Gibbs
sampling mixes faster for models that use word co-
occurrence information, suggesting that such meth-
ods may also be useful in guiding online stochastic
variational inference (Hoffman et al., 2010).
Acknowledgements
This work was supported in part by the Center
for Intelligent Information Retrieval, in part by the
CIA, the NSA and the NSF under NSF grant # IIS-
0326249, in part by NIH:HHSN271200900640P,
and in part by NSF # number SBE-0965436. Any
opinions, findings and conclusions or recommenda-
tions expressed in this material are the authors’ and
do not necessarily reflect those of the sponsor.
References
Loulwah AlSumait, Daniel Barbara, James Gentle, and
Carlotta Domeniconi. 2009. Topic significance rank-
ing of LDA generative models. In ECML.
David Andrzejewski, Xiaojin Zhu, and Mark Craven.
2009. Incorporating domain knowledge into topic
modeling via Dirichlet forest priors. In Proceedings of
the 26th Annual International Conference on Machine
Learning, pages 25–32.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
2003. Latent Dirichlet allocation. Journal of Machine
Learning Research, 3:993–1022, January.
K.R. Canini, L. Shi, and T.L. Griffiths. 2009. Online
inference of topics with latent Dirichlet allocation. In
Proceedings of the 12th International Conference on
Artificial Intelligence and Statistics.
Jonathan Chang, Jordan Boyd-Graber, Chong Wang,
Sean Gerrish, and David M. Blei. 2009. Reading tea
leaves: How humans interpret topic models. In Ad-
vances in Neural Information Processing Systems 22,
pages 288–296.
Kenneth Church and Patrick Hanks. 1990. Word asso-
ciation norms, mutual information, and lexicography.
Computational Linguistics, 6(1):22–29.
Gabriel Doyle and Charles Elkan. 2009. Accounting for
burstiness in topic models. In ICML.
S. Geman and D. Geman. 1984. Stochastic relaxation,
Gibbs distributions, and the Bayesian restoration of
images. IEEE Transaction on Pattern Analysis and
Machine Intelligence 6, pages 721–741.
271
Thomas L. Griffiths and Mark Steyvers. 2004. Find-
ing scientific topics. Proceedings of the National
Academy of Sciences, 101(suppl. 1):5228–5235.
Matthew Hoffman, David Blei, and Francis Bach. 2010.
Online learning for latent dirichlet allocation. In NIPS.
Hosan Mahmoud. 2008. P´
olya Urn Models. Chapman
& Hall/CRC Texts in Statistical Science.
Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai.
2007. Automatic labeling of multinomial topic mod-
els. In Proceedings of the 13th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data
Mining, pages 490–499.
David Newman, Jey Han Lau, Karl Grieser, and Timothy
Baldwin. 2010. Automatic evaluation of topic coher-
ence. In Human Language Technologies: The Annual
Conference of the North American Chapter of the As-
sociation for Computational Linguistics.
Yee Whye Teh, Dave Newman, and Max Welling. 2006.
A collapsed variational Bayesian inference algorithm
for lat ent Dirichlet allocation. In Advances in Neural
Information Processing Systems 18.
Hanna Wallach, Iain Murray, Ruslan Salakhutdinov, and
David Mimno. 2009. Evaluation methods for topic
models. In Proceedings of the 26th Interational Con-
ference on Machine Learning.
Xing Wei and Bruce Croft. 2006. LDA-based document
models for ad-hoc retrival. In Proceedings of the 29th
Annual International SIGIR Conference.
272
... Each topic consists of a set of top words, and the coherence score ℎ ( ) is calculated as the average pairwise similarity between these words. This computation involves summing up the pairwise similarities for all combinations of top words within each topic and then averaging these values across all topics [32]. The resulting coherence score offers a quantitative assessment of the coherence and interpretability of the topics. ...
... The study utilized secondary data, specifically annual report data sourced from the KKP for the specified period, excluding the year 2021, which was unavailable on the official website of the Ministry KKP as shown in Figure 1. Access to the dataset is provided via the following link: https://kkp.go.id/kategori/181-Laporan-Tahunan [32]. This research holds significant importance for several reasons. ...
Article
Full-text available
Annual reports serve as vital instruments for government ministries and agencies, enabling transparency and accountability in managing state budgets (APBN) and activities, thereby fulfilling a crucial role in public accountability, particularly in the context of sustainable development goal (SDG) 14. However, due to their extensive nature, it becomes imperative to conduct topic modeling analysis to discern trends and topics within these reports. In this study, latent Dirichlet allocation (LDA), a prominent topic modeling technique, is employed to analyze the annual reports of the Ministry of Marine Affairs and Fisheries (KKP) Indonesia from 2015 to 2022. Utilizing the coherence score as an evaluation metric, we assess the quality of topic models across each report year. Our findings underscore the consistent emphasis on fisheries and marine-related initiatives, emphasizing their relevance to SDG 14 and Indonesia’s maritime landscape. Ultimately, this study offers valuable insights to inform strategic planning and decision-making processes within the KKP, contributing to the advancement of SDG 14 and promoting sustainable development in Indonesia’s fisheries and marine sectors.
... The optimal number of LDA topics (in our case, 125) was determined using a maximum coherence method, an approach that was systematized by Röder et al. (2015) and utilizes the UMass coherence measure (Mimno et al. (2011)). This methodology builds on foundational work by Newman et al. (2010), who first explicitly introduced the task of evaluating topic coherence, as a type of measure which seeks to align with qualitative human judgements of coherence. ...
Preprint
Full-text available
We summarize our exploratory investigation into whether Machine Learning (ML) techniques applied to publicly available professional text can substantially augment strategic planning for astronomy. We find that an approach based on Latent Dirichlet Allocation (LDA) using content drawn from astronomy journal papers can be used to infer high-priority research areas. While the LDA models are challenging to interpret, we find that they may be strongly associated with meaningful keywords and scientific papers which allow for human interpretation of the topic models. Significant correlation is found between the results of applying these models to the previous decade of astronomical research ("1998-2010" corpus) and the contents of the science frontier panel report which contains high-priority research areas identified by the 2010 National Academies' Astronomy and Astrophysics Decadal Survey ("DS2010" corpus). Significant correlations also exist between model results of the 1998-2010 corpus and the submitted whitepapers to the Decadal Survey ("whitepapers" corpus). Importantly, we derive predictive metrics based on these results which can provide leading indicators of which content modeled by the topic models will become highly cited in the future. Using these identified metrics and the associations between papers and topic models it is possible to identify important papers for planners to consider. A preliminary version of our work was presented by Thronson etal. 2021 and Thomas etal. 2022.
... The perplexity is a metric that allows to evaluate the goodness-of-fit of an LDA model, with a lower perplexity score indicating better generalization performance (Blei et al., 2003). The coherence score reflects the semantic relatedness between words in a topic, with higher values indicating that a topic is internally consistent (Mimno et al., 2011). For each newspaper, an optimal number of topics was chosen based on the evaluation of these two metrics by the authors using a criterion similar to the elbow method for clustering (Thorndike, 1953). ...
Article
Full-text available
In this paper, we propose TAWE (title assignment with word embedding), a new method to automatically assign titles to topics inferred from sets of documents. This method combines the results obtained from the topic modeling performed with, e.g., latent Dirichlet allocation (LDA) or other suitable methods and the word embedding representation of words in a vector space. This representation preserves the meaning of the words while allowing to find the most suitable word that represents the topic. The procedure is twofold: first, a cleaned text is used to build the LDA model to infer a desirable number of latent topics; second, a reasonable number of words and their weights are extracted from each topic and represented in n-dimensional space using word embedding. Based on the selected weighted words, a centroid is computed, and the closest word is chosen as the title of the topic. To test the method, we used a collection of tweets about climate change downloaded from some of the main newspapers accounts on Twitter. Results showed that TAWE is a suitable method for automatically assigning a topic title.
... The UMass coherence score, introduced by Mimno et al. [82], is another metric used to assess the topic coherence of a set of topics extracted from a text corpus. It evaluates coherence by considering the probability of word co-occurrences within topics. ...
Preprint
Full-text available
Data are the critical fuel for Artificial Intelligence (AI) models. Poor quality data produces inaccurate and ineffective AI models that may lead to incorrect or unsafe use. Checking for data readiness is a crucial step in improving data quality. Numerous R&D efforts have been spent on improving data quality. However, standardized metrics for evaluating data readiness for use in AI training are still evolving. In this study, we perform a comprehensive survey of metrics used for verifying AI's data readiness. This survey examines more than 120 papers that are published by ACM Digital Library, IEEE Xplore, other reputable journals, and articles published on the web by prominent AI experts. This survey aims to propose a taxonomy of data readiness for AI (DRAI) metrics for structured and unstructured datasets. We anticipate that this taxonomy can lead to new standards for DRAI metrics that would be used for enhancing the quality and accuracy of AI training and inference.
... For analytical accuracy, we used the metric of coherence (Fang et al., 2016;Mimno et al., 2011;Mirzaalian & Halpenny, 2021). Topic coherence reflects "tightness of the clustering structures of a data set relative to a background collection" (He et al., 2009, p. 185 suggests a semantically superior model (Sarkar, 2019). ...
Article
Full-text available
Social exchange theory (SET) is a leading sociopsychological theory that has enriched understanding of consumers' behavior. In this study, we examine application of SET in consumer research over last 52 years (1971–2022) in terms of predominant topics, publications, outlets, networks, and clusters. We adopt a holistic approach by amalgamating systematic literature review (SLR) using Scientific Procedures and Rationales for SLRs protocol; scientometric analysis utilizing bibliographic coupling and betweenness centrality; and topic modeling employing latent Dirichlet allocation. Based on text analysis of 215 full research papers, the study extracts five predominant areas of extant consumer research leveraging SET: (1) relationship marketing, (2) collaborative consumption, (3) gifting behavior, (4) brand experience, and (5) tourism and hospitality. Scientometric mapping reveals six cohesive clusters and identifies seminal studies in SET‐based consumer research. Building upon the insights, the article concludes by presenting agenda for future research.
... We used semantic coherence and exclusivity as our selection criteria (Schmiedel et al., 2019). Semantic coherence is a metric that measures the consistency of the most probable words in a given topic (Mimno et al., 2011). This metric is computed as the ratio of the number of documents in which a high-frequency word from a topic co-occurs with other highfrequency words in the same topic. ...
Article
Full-text available
The attention‐based view posits that a firm's allocation of attention to particular issues directly influences its actions and performance. Yet, the impact of attentional uniqueness – how the pattern of a firm's attentional allocation diverges from its competitors within the same industry – on behaviour and performance remains underexplored. We argue for an inverted U‐shaped relationship between attentional uniqueness and firm performance, mediated by the frequency of growth actions. This is because a firm's attentional allocation shapes its reaction to problems, opportunities, and threats in the competitive landscape, resulting in its competitive advantage. To generate growth actions, a firm needs to have both a unique perspective and a general understanding of its industry. Furthermore, we propose that this relationship is contingent on environmental munificence, which reflects the presence of growth opportunities. Our analysis, leveraging structural topic modelling on annual security reports from 986 Japanese listed companies between 2004 and 2016, broadly supports these theoretical predictions.
... LDA is an appropriate method to (1) model topics existing among human and machine method selection justifications, and (2) comparatively interpret these topics based on these topics' word probabilities. One important element is coherence score (Mimno et al., 2011), which measures the semantic similarity among high-scoring words within each topic, and higher coherence score means greater semantic coherence and interpretability of the topic; the score ultimately informs the number of topics for our LDA algorithm. ...
Article
Full-text available
Health risks due to preventable infections such as human papillomavirus (HPV) are exacerbated by persistent vaccine hesitancy. Due to limited sample sizes and the time needed to roll out, traditional methodologies like surveys and interviews offer restricted insights into quickly evolving vaccine concerns. Social media platforms can serve as fertile ground for monitoring vaccine-related conversations and detecting emerging concerns in a scalable and dynamic manner. Using state-of-the-art large language models, we propose a minimally supervised end-to-end approach to identify concerns against HPV vaccination from social media posts. We detect and characterize the concerns against HPV vaccination pre- and post-2020 to understand the evolution of HPV vaccine discourse. Upon analyzing 653 k HPV-related post-2020 tweets, adverse effects, personal anecdotes, and vaccine mandates emerged as the dominant themes. Compared to pre-2020, there is a shift towards personal anecdotes of vaccine injury with a growing call for parental consent and transparency. The proposed approach provides an end-to-end system, i.e. given a collection of tweets, a list of prevalent concerns is returned, providing critical insights for crafting targeted interventions, debunking messages, and informing public health campaigns.
Conference Paper
Full-text available
Multinomial distributions over words are frequently used to model topics in text collections. A common, major chal- lenge in applying all such topic models to any text mining problem is to label a multinomial topic model accurately so that a user can interpret the discovered topic. So far, such labels have been generated manually in a subjective way. In this paper, we propose probabilistic approaches to automat- ically labeling multinomial topic models in an objective way. We cast this labeling problem as an optimization problem involving minimizing Kullback-Leibler divergence between word distributions and maximizing mutual information be- tween a label and a topic model. Experiments with user study have been done on two text data sets with different genres. The results show that the proposed labeling meth- ods are quite effective to generate labels that are meaningful and useful for interpreting the discovered topic models. Our methods are general and can be applied to labeling topics learned through all kinds of topic models such as PLSA, LDA, and their variations.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
Full-text available
Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference pro- cedures like variational Bayes and Gibb sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.
Conference Paper
Full-text available
We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
Conference Paper
Full-text available
Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide explo- ration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.
Conference Paper
Full-text available
Many dierent topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suer from the important aw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of lan- guage that if a word is used once in a doc- ument, it is more likely to be used again. We introduce a topic model that uses Dirich- let compound multinomial (DCM) distribu- tions to model this burstiness phenomenon. On both text and non-text datasets, the new model achieves better held-out likeli- hood than standard latent Dirichlet alloca- tion (LDA). It is straightforward to incorpo- rate the DCM extension into topic models that are more complex than LDA.
Article
We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.
Conference Paper
A natural evaluation metric for statistical topic models is the probability of held-out documents given a trained model. While exact computation of this probability is in- tractable, several estimators for this prob- ability have been used in the topic model- ing literature, including the harmonic mean method and empirical likelihood method. In this paper, we demonstrate experimentally that commonly-used methods are unlikely to accurately estimate the probability of held- out documents, and propose two alternative methods that are both accurate and ecient. In this paper we consider only the simplest topic model, latent Dirichlet allocation (LDA), and compare a number of methods for estimating the probability of held-out documents given a trained model. Most of the methods presented, however, are applicable to more complicated topic models. In addition to com- paring evaluation methods that are currently used in the topic modeling literature, we propose several al- ternative methods. We present empirical results on synthetic and real-world data sets showing that the currently-used estimators are less accurate and have higher variance than the proposed new estimators.
Conference Paper
Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has re cently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is he avily cited in the machine learning literature, but its feasibilit y and effectiveness in information retrieval is mostly un known. In this paper, we study how to efficiently use LDA to impro ve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.