Conference PaperPDF Available

Optimizing Semantic Coherence in Topic Models

January 2011

January 2011

Source
DBLP

Conference: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL

Authors:

David Mimno

Cornell University

Miriam Leenders

U.S. Department of Health and Human Services

Show all 5 authorsHide

Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).

Example topics (good/general, good/research, chained/research) with different coherence scores (numbers closer to zero indicate higher coherence). The chained topic combines words related to aging (indicated in plain text) and words describing blood and blood-related diseases (bold). The only connection is the common word human.

…

Figures - uploaded by Miriam Leenders

Content may be subject to copyright.

Content uploaded by Miriam Leenders

Content may be subject to copyright.

Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 262–272,

Edinburgh, Scotland, UK, July 27–31, 2011. c

2011 Association for Computational Linguistics

Optimizing Semantic Coherence in Topic Models

David Mimno

Princeton University

Princeton, NJ 08540

mimno@cs.princeton.edu

Hanna M. Wallach

University of Massachusetts, Amherst

Amherst, MA 01003

wallach@cs.umass.edu

Edmund Talley Miriam Leenders

National Institutes of Health

Bethesda, MD 20892

{talleye,leenderm}@ninds.nih.gov

Andrew McCallum

University of Massachusetts, Amherst

Amherst, MA 01003

mccallum@cs.umass.edu

Abstract

Latent variable models have the potential

to add value to large document collections

by discovering interpretable, low-dimensional

subspaces. In order for people to use such

models, however, they must trust them. Un-

fortunately, typical dimensionality reduction

methods for text, such as latent Dirichlet al-

location, often produce low-dimensional sub-

spaces (topics) that are obviously ﬂawed to

human domain experts. The contributions of

this paper are threefold: (1) An analysis of the

ways in which topics can be ﬂawed; (2) an au-

tomated evaluation metric for identifying such

topics that does not rely on human annotators

or reference collections outside the training

data; (3) a novel statistical topic model based

on this metric that signiﬁcantly improves topic

quality in a large-scale document collection

from the National Institutes of Health (NIH).

1 Introduction

Statistical topic models such as latent Dirichlet al-

location (LDA) (Blei et al., 2003) provide a pow-

erful framework for representing and summarizing

the contents of large document collections. In our

experience, however, the primary obstacle to accep-

tance of statistical topic models by users the outside

machine learning community is the presence of poor

quality topics. Topics that mix unrelated or loosely-

related concepts substantially reduce users’ conﬁ-

dence in the utility of such automated systems.

In general, users prefer models with larger num-

bers of topics because such models have greater res-

olution and are able to support ﬁner-grained distinc-

tions. Unfortunately, we have observed that there

is a strong relationship between the size of topics

and the probability of topics being nonsensical as

judged by domain experts: as the number of topics

increases, the smallest topics (number of word to-

kens assigned to each topic) are almost always poor

quality. The common practice of displaying only a

small number of example topics hides the fact that as

many as 10% of topics may be so bad that they can-

not be shown without reducing users’ conﬁdence.

The evaluation of statistical topic models has tra-

ditionally been dominated by either extrinsic meth-

ods (i.e., using the inferred topics to perform some

external task such as information retrieval (Wei

and Croft, 2006)) or quantitative intrinsic methods,

such as computing the probability of held-out doc-

uments (Wallach et al., 2009). Recent work has

focused on evaluation of topics as semantically-

coherent concepts. For example, Chang et al. (2009)

found that the probability of held-out documents is

not always a good predictor of human judgments.

Newman et al. (2010) showed that an automated

evaluation metric based on word co-occurrence

statistics gathered from Wikipedia could predict hu-

man evaluations of topic quality. AlSumait et al.

(2009) used differences between topic-speciﬁc dis-

tributions over words and the corpus-wide distribu-

tion over words to identify overly-general “vacuous”

topics. Finally, Andrzejewski et al. (2009) devel-

oped semi-supervised methods that avoid speciﬁc

user-labeled semantic coherence problems.

The contributions of this paper are threefold: (1)

To identify distinct classes of low-quality topics,

some of which are not ﬂagged by existing evalua-

tion methods; (2) to introduce a new topic “coher-

ence” score that corresponds well with human co-

herence judgments and makes it possible to identify

262

speciﬁc semantic problems in topic models without

human evaluations or external reference corpora; (3)

to present an example of a new topic model that

learns latent topics by directly optimizing a metric

of topic coherence. With little additional computa-

tional cost beyond that of LDA, this model exhibits

signiﬁcant gains in average topic coherence score.

Although the model does not result in a statistically-

signiﬁcant reduction in the number of topics marked

“bad”, the model consistently improves the topic co-

herence score of the ten lowest-scoring topics (i.e.,

results in bad topics that are “less bad” than those

found using LDA) while retaining the ability to iden-

tify low-quality topics without human interaction.

2 Latent Dirichlet Allocation

LDA is a generative probabilistic model for docu-

ments W={w(1),w(2),...,w(D)}. To generate a

word token w(d)

nin document d, we draw a discrete

topic assignment z(d)

nfrom a document-speciﬁc dis-

tribution over the Ttopics θd(which is itself drawn

from a Dirichlet prior with hyperparameter α), and

then draw a word type for that token from the topic-

speciﬁc distribution over the vocabulary φz(d)

n. The

inference task in topic models is generally cast as in-

ferring the document–topic proportions {θ1, ..., θD}

and the topic-speciﬁc distributions {φ1. . . , φT}.

The multinomial topic distributions are usually

drawn from a shared symmetric Dirichlet prior with

hyperparameter β, such that conditioned on {φt}T

t=1

and the topic assignments {z(1),z(2),...,z(D)},

the word tokens are independent. In practice, how-

ever, it is common to deal directly with the “col-

lapsed” distributions that result from integrating

over the topic-speciﬁc multinomial parameters. The

resulting distribution over words for a topic tis then

a function of the hyperparameter βand the number

of words of each type assigned to that topic, Nw|t.

This distribution, known as the Dirichlet compound

multinomial (DCM) or P´

olya distribution (Doyle

and Elkan, 2009), breaks the assumption of condi-

tional independence between word tokens given top-

ics, but is useful during inference because the con-

ditional probability of a word wgiven topic ttakes

a very simple form: P(w|t, β) = Nw|t+β

Nt+|V| β, where

Nt=Pw0Nw0|tand |V| is the vocabulary size.

The process for generating a sequence of words

from such a model is known as the simple P´

olya urn

model (Mahmoud, 2008), in which the initial prob-

ability of word type win topic tis proportional to

β, while the probability of each subsequent occur-

rence of win topic tis proportional to the number

of times whas been drawn in that topic plus β. Note

that this unnormalized weight for each word type de-

pends only on the count of that word type, and is in-

dependent of the count of any other word type w0.

Thus, in the DCM/P´

olya distribution, drawing word

type wmust decrease the probability of seeing all

other word types w06=w. In a later section, we will

introduce a topic model that substitutes a general-

ized P´

olya urn model for the DCM/P´

olya distribu-

tion, allowing a draw of word type wto increase the

probability of seeing certain other word types.

For real-world data, documents Ware observed,

while the corresponding topic assignments Zare

unobserved and may be inferred using either vari-

ational methods (Blei et al., 2003; Teh et al., 2006)

or MCMC methods (Grifﬁths and Steyvers, 2004).

Here, we use MCMC methods—speciﬁcally Gibbs

sampling (Geman and Geman, 1984), which in-

volves sequentially resampling each topic assign-

ment z(d)

nfrom its conditional posterior given the

documents W, the hyperparameters αand β, and

Z\d,n (the current topic assignments for all tokens

other than the token at position nin document d).

3 Expert Opinions of Topic Quality

Concentrating on 300,000 grant and related jour-

nal paper abstracts from the National Institutes of

Health (NIH), we worked with two experts from

the National Institute of Neurological Disorders and

Stroke (NINDS) to collaboratively design an expert-

driven topic annotation study. The goal of this study

was to develop an annotated set of baseline topics,

along with their salient characteristics, as a ﬁrst step

towards automatically identifying and inferring the

kinds of topics desired by domain experts.1

3.1 Expert-Driven Annotation Protocol

In order to ensure that the topics selected for anno-

tation were within the NINDS experts’ area of ex-

pertise, they selected 148 topics (out of 500), all as-

sociated with areas funded by NINDS. Each topic

1All evaluated models will be released publicly.

263

twas presented to the experts as a list of the thirty

most probable words for that topic, in descending or-

der of their topic-speciﬁc “collapsed” probabilities,

Nw|t+β

Nt+|V| β. In addition to the most probable words,

the experts were also given metadata for each topic:

The most common sequences of two or more con-

secutive words assigned to that topic, the four topics

that most often co-occurred with that topic, the most

common IDF-weighted words from titles of grants,

thesaurus terms, NIH institutes, journal titles, and

ﬁnally a list of the highest probability grants and

PubMed papers for that topic.

The experts ﬁrst categorized each topic as one

of three types: “research”, “grant mechanisms and

publication types” or “general”.2The quality of

each topic (“good”, “intermediate”, or “bad”) was

then evaluated using criteria speciﬁc to the type

of topic. In general, topics were only annotated

as “good” if they contained words that could be

grouped together as a single coherent concept. Addi-

tionally, each “research” topic was only considered

to be “good” if, in addition to representing a sin-

gle coherent concept, the aggregate content of the

set of documents with appreciable allocations to that

topic clearly contained text referring to the concept

inferred from the topic words. Finally, for each topic

marked as being either “intermediate” or “bad”, one

or more of the following problems (deﬁned by the

domain experts) was identiﬁed, as appropriate:

•Chained: every word is connected to every

other word through some pairwise word chain,

but not all word pairs make sense. For exam-

ple, a topic whose top three words are “acids”,

“fatty” and “nucleic” consists of two distinct

concepts (i.e., acids produced when fats are

broken down versus the building blocks of

DNA and RNA) chained via the word “acids”.

•Intruded: either two or more unrelated sets

of related words, joined arbitrarily, or an oth-

erwise good topic with a few “intruder” words.

•Random: no clear, sensical connections be-

tween more than a few pairs of words.

•Unbalanced: the top words are all logically

connected to each other, but the topic combines

very general and speciﬁc terms (e.g., “signal

2Equivalent to “vacuous topics” of AlSumait et al. (2009).

transduction” versus “notch signaling”).

Examples of a good general topic, a good research

topic, and a chained research topic are in Table 1.

3.2 Annotation Results

The experts annotated the topics independently and

then aggregated their results. Interestingly, no top-

ics were ever considered “good” by one expert and

“bad” by the other—when there was disagreement

between the experts, one expert always believed the

topic to be “intermediate.” In such cases, the ex-

perts discussed the reasons for their decisions and

came to a consensus. Of the 148 topics selected for

annotation, 90 were labeled as “good,” 21 as “inter-

mediate,” and 37 as “bad.” Of the topics labeled as

“bad” or “intermediate,” 23 were “chained,” 21 were

“intruded,” 3 were “random,” and 15 were “unbal-

anced”. (The annotators were permitted to assign

more than one problem to any given topic.)

4 Automated Metrics for Predicting

Expert Annotations

The ultimate goal of this paper is to develop meth-

ods for building models with large numbers of spe-

ciﬁc, high-quality topics from domain-speciﬁc cor-

pora. We therefore explore the extent to which in-

formation already contained in the documents being

modeled can be used to assess topic quality.

In this section we evaluate several methods for

ranking the quality of topics and compare these

rankings to human annotations. No method is likely

to perfectly predict human judgments, as individual

annotators may disagree on particular topics. For

an application involving removing low quality top-

ics we recommend using a weighted combination of

metrics, with a threshold determined by users.

4.1 Topic Size

As a simple baseline, we considered the extent to

which topic “size” (as measured by the number of

tokens assigned to each topic via Gibbs sampling) is

a good metric for assessing topic quality. Figure 1

(top) displays the topic size (number of tokens as-

signed to that topic) and expert annotations (“good”,

“intermediate”, “bad”) for the 148 topics manually

labeled by annotators as described above. This ﬁg-

ure suggests that topic size is a reasonable predic-

264

●

40000 60000 80000 120000 160000

Tokens

good inter bad

●

−600 −500 −400 −300 −200

Coherence

good inter bad

Figure 1: Topic size is a good indicator of quality; the

new coherence metric is better. Top shows expert-rated

topics ranked by topic size (AP 0.89, AUC 0.79), bottom

shows same topics ranked by coherence (AP 0.94, AUC

0.87). Random jitter is added to the y-axis for clarity.

tor of topic quality. Although there is some overlap,

“bad” topics are generally smaller than “good” top-

ics. Unfortunately, this observation conﬂicts with

the goal of building highly specialized, domain-

speciﬁc topic models with many high-quality, ﬁne-

grained topics—in such models the majority of top-

ics will have relatively few tokens assigned to them.

4.2 Topic Coherence

When displaying topics to users, each topic tis gen-

erally represented as a list of the M=5,...,20 most

probable words for that topic, in descending order

of their topic-speciﬁc “collapsed” probabilities. Al-

though there has been previous work on automated

generation of labels or headings for topics (Mei et

al., 2007), we choose to work only with the ordered

list representation. Labels may obscure or detract

from fundamental problems with topic coherence,

and better labels don’t make bad topics good.

The expert-driven annotation study described in

section 3 suggests that three of the four types of

poor-quality topics (“chained,” “intruded” and “ran-

dom”) could be detected using a metric based on

the co-occurrence of words within the documents

being modeled. For “chained” and “intruded” top-

ics, it is likely that although pairs of words belong-

ing to a single concept will co-occur within a single

document (e.g., “nucleic” and “acids” in documents

about DNA), word pairs belonging to different con-

cepts (e.g., “fatty” and “nucleic”) will not. For ran-

dom topics, it is likely that few words will co-occur.

This insight can be used to design a new metric

for assessing topic quality. Letting D(v)be the doc-

ument frequency of word type v(i.e., the number

of documents with least one token of type v) and

D(v, v0)be co-document frequency of word types v

and v0(i.e., the number of documents containing one

or more tokens of type vand at least one token of

type v0), we deﬁne topic coherence as

C(t;V(t)) =

m=2

m−1

l=1

log D(v(t)

m, v(t)

l)+1

D(v(t)

l),(1)

where V(t)= (v(t)

1, . . . , v(t)

M)is a list of the Mmost

probable words in topic t. A smoothing count of 1

is included to avoid taking the logarithm of zero.

Figure 1 shows the association between the expert

annotations and both topic size (top) and our coher-

ence metric (bottom). We evaluate these results us-

ing standard ranking metrics, average precision and

the area under the ROC curve. Treating “good” top-

ics as positive and “intermediate” or “bad” topics as

negative, we get average precision values of 0.89 for

topic size vs. 0.94 for coherence and AUC 0.79 for

topic size vs. 0.87 for coherence. We performed a

logistic regression analysis on the binary variable “is

this topic bad”. Using topic size alone as a predic-

tor gives AIC (a measure of model ﬁt) 152.5. Co-

herence alone has AIC 113.8 (substantially better).

Both predictors combined have AIC 115.8: the sim-

pler coherence alone model provides the best perfor-

mance. We tried weighting the terms in equation 1

by their corresponding topic–word probabilities and

and by their position in the sorted list of the Mmost

probable words for that topic, but we found that a

uniform weighting better predicted topic quality.

Our topic coherence metric also exhibits good

qualitative behavior: of the 20 best-scoring topics,

18 are labeled as “good,” one is “intermediate” (“un-

balanced”), and one is “bad” (combining “cortex”

and “fmri”, words that commonly co-occur, but are

conceptually distinct). Of the 20 worst scoring top-

ics, 15 are “bad,” 4 are “intermediate,” and only one

(with the 19th worst coherence score) is “good.”

265

Our coherence metric relies only upon word co-

occurrence statistics gathered from the corpus being

modeled, and does not depend on an external ref-

erence corpus. Ideally, all such co-occurrence infor-

mation would already be accounted for in the model.

We believe that one of the main contributions of our

work is demonstrating that standard topic models

do not fully utilize available co-occurrence informa-

tion, and that a held-out reference corpus is therefore

not required for purposes of topic evaluation.

Equation 1 is very similar to pointwise mutual in-

formation (PMI), but is more closely associated with

our expert annotations than PMI (which achieves

AUC 0.64 and AIC 170.51). PMI has a long history

in language technology (Church and Hanks, 1990),

and was recently used by Newman et al. (2010) to

evaluate topic models. When expressed in terms of

count variables as in equation 1, PMI includes an

additional term for D(v(t)

m). The improved perfor-

mance of our metric over PMI implies that what mat-

ters is not the difference between the joint probabil-

ity of words mand land the product of marginals,

but the conditional probability of each word given

the each of the higher-ranked words in the topic.

In order to provide intuition for the behavior of

our topic coherence metric, table 1 shows three

example topics and their topic coherence scores.

The ﬁrst topic, related to grant-funded training pro-

grams, is one of the best-scoring topics. All pairs

of words have high co-document frequencies. The

second topic, on neurons, is more typical of qual-

ity “research” topics. Overall, these words occur

less frequently, but generally occur moderately in-

terchangeably: there is little structure to their co-

variance. The last topic is one of the lowest-scoring

topics. Its co-document frequency matrix is shown

in table 2. The top two words are closely related:

487 documents include “aging” at least once, 122

include “lifespan”, and 55 include both. Meanwhile,

the third word “globin” occurs with only one of the

top seven words—the common word “human”.

4.3 Comparison to word intrusion

As an additional check for both our expert annota-

tions and our automated metric, we replicated the

“word intrusion” evaluation originally introduced by

Chang et al. (2009). In this task, one of the top ten

most probable words in a topic is replaced with a

● ●●● ● ● ●●● ● ●● ● ●●● ●●● ●● ●●● ●●● ● ●● ● ●●● ●● ●●● ●● ● ●● ●●● ●●●●●

●

●●● ● ●● ●

● ● ●●● ●

● ● ●●●

●

● ●● ●

● ●●● ●

●

●●

●

● ●

40000 60000 80000 120000 160000

048

Comparison of Topic Size to Intrusion Detection

Tokens assigned to topic

Correct Guesses

●● ●● ●● ●●● ● ●● ●● ●● ●●● ● ●● ●● ●●●● ● ●●● ● ● ●●

● ●●●●● ●● ●● ●● ●●● ●

●

●●●● ●● ●

●●● ●● ●

●● ● ●●

●

●●● ●

● ●● ● ●

●

●●

●

● ●

−600 −500 −400 −300 −200

048

Comparison of Coherence to Intrusion Detection

Coherence

Correct Guesses

Good Topics

Correct Guesses

Frequency

0 2 4 6 8 10

0 15 35

Bad Topics

Correct Guesses

Frequency

0 2 4 6 8 10

0 15 35

Figure 2: Top: results of the intruder selection task rel-

ative to two topic quality metrics. Bottom: marginal in-

truder accuracy frequencies of good and bad topics.

another word, selected at random from the corpus.

The resulting set of words is presented, in a random

order, to users, who are asked to identify the “in-

truder” word. It is very unlikely that a randomly-

chosen word will be semantically related to any of

the original words in the topic, so if a topic is a

high quality representation of a semantically coher-

ent concept, it should be easy for users to select the

intruder word. If the topic is not coherent, there may

be words in the topic that are also not semantically

related to any other word, thus causing users to se-

lect “correct” words instead of the real intruder.

We recruited ten additional expert annotators

from NINDS, not including our original annotators,

and presented them with the intruder selection task,

using the set of previously evaluated topics. Re-

sults are shown in ﬁgure 2. In the ﬁrst two plots,

the x-axis is one of our two automated quality met-

266

Table 1: Example topics (good/general, good/research, chained/research) with different coherence scores (numbers

closer to zero indicate higher coherence). The chained topic combines words related to aging (indicated in plain text)

and words describing blood and blood-related diseases (bold). The only connection is the common word human.

-167.1 students, program, summer, biomedical, training, experience, undergraduate, career, minority, student, ca-

reers, underrepresented, medical students, week, science

-252.1 neurons, neuronal, brain, axon, neuron, guidance, nervous system, cns, axons, neural, axonal, cortical,

survival, disorders, motor

-357.2 aging, lifespan, globin, age related, longevity, human, age, erythroid,sickle cell,beta globin,hb, senes-

cence, adult, older, lcr

Table 2: Co-document frequency matrix for the top words in a low-quality topic (according to our coherence metric),

shaded to highlight zeros. The diagonal (light gray) shows the overall document frequency for each word w. The

column on the right is Nw|t. Note that “globin” and “erythroid” do not co-occur with any of the aging-related words.

aging 487 53 0 65 42 0 51 0 138 0 914

lifespan 53 122 0 15 28 0 15 0 44 0 205

globin 0 0 39 0 0 19 0 15 27 3 200

age related 65 15 0 119 12 0 25 0 37 0 160

longevity 42 28 0 12 73 0 6 0 20 1 159

erythroid 0 0 19 0 0 69 0 8 23 1 110

age 51 15 0 25 6 0 245 1 82 0 103

sickle cell 0 0 15 0 0 8 1 43 16 2 93

human 138 44 27 37 20 23 82 16 4347 157 91

hb 0 0 3 0 1 1 0 2 5 15 73

267

rics (topic size and coherence) and the y-axis is the

number of annotators that correctly identiﬁed the

true intruder word (accuracy). The histograms be-

low these plots show the number of topics with each

level of annotator accuracy for good and bad top-

ics. For good topics (green circles), the annotators

were generally able to detect the intruder word with

high accuracy. Bad topics (red diamonds) had more

uniform accuracies. These results suggest that top-

ics with low intruder detection accuracy tend to be

bad, but some bad topics can have a high accuracy.

For example, spotting an intruder word in a chained

topic can be easy. The low-quality topic recep-

tors, cannabinoid, cannabinoids, ligands, cannabis,

endocannabinoid, cxcr4, [virus], receptor, sdf1, is

a typical “chained” topic, with CXCR4 linked to

cannabinoids only through receptors, and otherwise

unrelated. Eight out of ten annotators correctly iden-

tiﬁed “virus” as the correct intruder. Repeating the

logistic regression experiment using intruder detec-

tion accuracy as input, the AIC value is 163.18—

much worse than either topic size or coherence.

5 Generalized P´

olya Urn Models

Although the topic coherence metric deﬁned above

provides an accurate way of assessing topic quality,

preventing poor quality topics from occurring in the

ﬁrst place is preferable. Our results in the previous

section show that we can identify low-quality top-

ics without making use of external supervision; the

training data by itself contains sufﬁcient information

at least to reject poor combinations of words.

In this section, we describe a new topic model that

incorporates the corpus-speciﬁc word co-occurrence

information used in our coherence metric directly

into the statistical topic modeling framework. It

is important to note that simply disallowing words

that never co-occur from being assigned to the same

topic is not sufﬁcient. Due to the power-law charac-

teristics of language, most words are rare and will

not co-occur with most other words regardless of

their semantic similarity. It is rather the degree

to which the most prominent words in a topic do

not co-occur with the other most prominent words

in that topic that is an indicator of topic incoher-

ence. We therefore desire models that guide topics

towards semantic similarity without imposing hard

constraints.

As an example of such a model, we present a new

topic model in which the occurrence of word type w

in topic tincreases not only the probability of seeing

that word type again, but also increases the probabil-

ity of seeing other related words (as determined by

co-document frequencies for the corpus being mod-

eled). This new topic model retains the document–

topic component of standard LDA, but replaces the

usual P´

olya urn topic–word component with a gen-

eralized P´

olya urn framework (Mahmoud, 2008).

A sequence of i.i.d. samples from a discrete dis-

tribution can be imagined as arising by repeatedly

drawing a random ball from an urn, where the num-

ber of balls of each color is proportional to the prob-

ability of that color, replacing the selected ball af-

ter each draw. In a P´

olya urn, each ball is replaced

along with another ball of the same color. Samples

from this model exhibit the “burstiness” property:

the probability of drawing a ball of color wincreases

each time a ball of that color is drawn. This process

represents the marginal distribution of a hierarchical

model with a Dirichlet prior and a multinomial like-

lihood, and is used as the distribution over words

for each topic in almost all previous topic models.

In a generalized P´

olya urn model, having drawn a

ball of color w,Avw additional balls of each color

v∈ {1, . . . , W }are returned to the urn. Given W

and Z, the conditional posterior probability of word

win topic timplied by this generalized model is

P(w|t, W,Z, β, A) = PvNv|tAvw +β

Nt+|V|β,(2)

where Ais a W×Wreal-valued matrix, known

as the addition matrix or schema. The simple P´

olya

urn model (and hence the conditional posterior prob-

ability of word win topic tunder LDA) can be re-

covered by setting the schema Ato the identity ma-

trix. Unlike the simple P ´

olya distribution, we do not

know of a representation of the generalized P´

olya

urn distribution that can be expressed using a con-

cise set of conditional independence assumptions. A

standard graphical model with plate notation would

therefore not be helpful in highlighting the differ-

ences between the two models, and is not shown.

Algorithm 1 shows pseudocode for a single Gibbs

sweep over the latent variables Zin standard LDA.

Algorithm 2 shows the modiﬁcations necessary to

268

1: for d∈ D do

2: for wn∈w(d)do

3: Nzi|di←Nzi|di−1

4: Nwi|zi←Nwi|zi−1

5: sample zi∝(Nz|di+αz)Nwi|z+β

Pz0(Nwi|z0+β)

6: Nzi|di←Nzi|di+ 1

7: Nwi|zi←Nwi|zi+ 1

8: end for

9: end for

Algorithm 1: One sweep of LDA Gibbs sampling.

1: for d∈ D do

2: for wn∈w(d)do

3: Nzi|di←Nzi|di−1

4: for all vdo

5: Nv|zi←Nv|zi−Avwi

6: end for

7: sample zi∝(Nz|di+αz)Nwi|z+β

Pz0(Nwi|z0+β)

8: Nzi|di←Nzi|di+ 1

9: for all vdo

10: Nv|zi←Nv|zi+Avwi

11: end for

12: end for

13: end for

Algorithm 2: One sweep of gen. P´

olya Gibbs sam-

pling, with differences from LDA highlighted in red.

support a generalized P´

olya urn model: rather than

subtracting exactly one from the count of the word

given the old topic, sampling, and then adding one

to the count of the word given the new topic, we sub-

tract a column of the schema matrix from the entire

count vector over words for the old topic, sample,

and add the same column to the count vector for the

new topic. As long as Ais sparse, this operation

adds only a constant factor to the computation.

Another property of the generalized P´

olya urn

model is that it is nonexchangeable—the joint prob-

ability of the tokens in any given topic is not invari-

ant to permutation of those tokens. Inference of Z

given Wvia Gibbs sampling involves repeatedly cy-

cling through the tokens in Wand, for each one,

resampling its topic assignment conditioned on W

and the current topic assignments for all tokens other

than the token of interest. For LDA, the sampling

distribution for each topic assignment is simply the

product of two predictive probabilities, obtained by

treating the token of interest as if it were the last.

For a topic model with a generalized P´

olya urn for

the topic–word component, the sampling distribu-

tion is more complicated. Speciﬁcally, the topic–

word component of the sampling distribution is no

longer a simple predictive distribution—when sam-

pling a new value for z(d)

n, the implication of each

possible value for subsequent tokens and their topic

assignments must be considered. Unfortunately, this

can be very computationally expensive, particularly

for large corpora. There are several ways around this

problem. The ﬁrst is to use sequential Monte Carlo

methods, which have been successfully applied to

topic models previously (Canini et al., 2009). The

second approach is to approximate the true Gibbs

sampling distribution by treating each token as if it

were the last, ignoring implications for subsequent

tokens and their topic assignments. We ﬁnd that

this approximate method performs well empirically.

5.1 Setting the Schema A

Inspired by our evaluation metric, we deﬁne Aas

Avv ∝λvD(v)(3)

Avw ∝λvD(w, v)

where each element is scaled by a row-speciﬁc

weight λvand each column is normalized to sum

to 1. Normalizing columns makes comparison to

standard LDA simpler, because the relative effect of

smoothing parameter β= 0.01 is equivalent. We set

λv= log (D / D(v)), the standard IDF weight used

in information retrieval, which is larger for less fre-

quent words. The column for word type wcan be

interpreted as word types with signiﬁcant associa-

tion with w. The IDF weighting therefore has the

effect of increasing the strength of association for

rare word types. We also found empirically that it is

helpful to remove off-diagonal elements for the most

common types, such as those that occur in more than

5% of documents (IDF <3.0). Including nonzero

off-diagonal values in Afor very frequent types

causes the model to disperse those types over many

topics, which leads to large numbers of extremely

similar topics. To measure this effect, we calcu-

lated the Jensen-Shannon divergence between all

pairs of topic–word distributions in a given model.

For a model using off-diagonal weights for all word

269

−290 −260

100 Topics

Coherence

50 300 550 800

−290 −260

200 Topics

Coherence

50 300 550 800

−290 −260

300 Topics

Coherence

50 300 550 800

−290 −260

400 Topics

Coherence

50 300 550 800

−400 −340

10 Worst Coher

50 300 550 800

−400 −340

10 Worst Coher

50 300 550 800

−400 −340

10 Worst Coher

50 300 550 800

−400 −340

10 Worst Coher

50 300 550 800

−1700 −1660

Iteration

HOLP

50 300 550 800

−1700 −1660

Iteration

HOLP

50 300 550 800

−1700 −1660

Iteration

HOLP

50 300 550 800

−1700 −1660

Iteration

HOLP

50 300 550 800

Figure 3: P´

olya urn topics (blue) have higher average coherence and converge much faster than LDA topics

(red). The top plots show topic coherence (averaged over 15 runs) over 1000 iterations of Gibbs sampling. Error bars

are not visible in this plot. The middle plot shows the average coherence of the 10 lowest scoring topics. The bottom

plots show held-out log probability (in thousands) for the same models (three runs each of 5-fold cross-validation).

Name Docs Avg. Tok. Tokens Vocab

NIH 18756 114.64 ±30.41 2150172 28702

Table 3: Data set statistics.

types, the mean of the 100 lowest divergences was

0.29 ±.05 (a divergence of 1.0 represents distribu-

tions with no shared support) at T= 200. The aver-

age divergence of the 100 most similar pairs of top-

ics for standard LDA (i.e., A=I) is 0.67±.05. The

same statistic for the generalized P´

olya urn model

without off-diagonal elements for word types with

high document frequency is 0.822 ±0.09.

Setting the off-diagonal elements of the schema

Ato zero for the most common word types also has

the fortunate effect of substantially reducing prepro-

cessing time. We ﬁnd that Gibbs sampling for the

generalized P´

olya model takes roughly two to three

times longer than for standard LDA, depending on

the sparsity of the schema, due to additional book-

keeping needed before and after sampling topics.

5.2 Experimental Results

We evaluated the new model on a corpus of NIH

grant abstracts. Details are given in table 3. Figure 3

shows the performance of the generalized P´

olya urn

model relative to LDA. Two metrics—our new topic

coherence metric and the log probability of held-out

documents—are shown over 1000 iterations at 50 it-

eration intervals. Each model was run over ﬁve folds

of cross validation, each with three random initial-

izations. For each model we calculated an overall

coherence score by calculating the topic coherence

for each topic individually and then averaging these

values. We report the average over all 15 models in

each plot. Held-out probabilities were calculated us-

ing the left-to-right method of Wallach et al. (2009),

with each cross-validation fold using its own schema

A. The generalized P´

olya model performs very well

in average topic coherence, reaching levels within

the ﬁrst 50 iterations that match the ﬁnal score. This

model has an early advantage for held-out proba-

bility as well, but is eventually overtaken by LDA.

This trend is consistent with Chang et al.’s observa-

tion that held-out probabilities are not always good

predictors of human judgments (Chang et al., 2009).

Results are consistent over T∈ {100,200,300}.

In section 4.2, we demonstrated that our topic co-

herence metric correlates with expert opinions of

topic quality for standard LDA. The generalized

270

P´

olya urn model was therefore designed with the

goal of directly optimizing that metric. It is pos-

sible, however, that optimizing for coherence di-

rectly could break the association between coher-

ence metric and topic quality. We therefore repeated

the expert-driven evaluation protocol described in

section 3.1. We trained one standard LDA model

and one generalized P´

olya urn model, each with

T= 200, and randomly shufﬂed the 400 resulting

topics. The topics were then presented to the experts

from NINDS, with no indication as to the identity of

the model from which each topic came. As these

evaluations are time consuming, the experts evalu-

ated the only the ﬁrst 200 topics, which consisted of

103 generalized P´

olya urn topics and 97 LDA top-

ics. AUC values predicting bad topics given coher-

ence were 0.83 and 0.80, respectively. Coherence

effectively predicts topic quality in both models.

Although we were able to improve the average

overall quality of topics and the average quality of

the ten lowest-scoring topics, we found that the gen-

eralized P´

olya urn model was less successful reduc-

ing the overall number of bad topics. Ignoring one

“unbalanced” topic from each model, 16.5% of the

LDA topics and 13.5% from the generalized P´

olya

urn model were marked as “bad.” While this result

is an improvement, it is not signiﬁcant at p= 0.05.

6 Discussion

We have demonstrated the following:

•There is a class of low-quality topics that can-

not be detected using existing word-intrusion

tests, but that can be identiﬁed reliably using a

metric based on word co-occurrence statistics.

•It is possible to improve the coherence score

of topics, both overall and for the ten worst,

while retaining the ability to ﬂag bad topics, all

without requiring semi-supervised data or ad-

ditional reference corpora. Although additional

information may be useful, it is not necessary.

•Such models achieve better performance with

substantially fewer Gibbs iterations than LDA.

We believe that the most important challenges in fu-

ture topic modeling research are improving the se-

mantic quality of topics, particularly at the low end,

and scaling to ever-larger data sets while ensuring

high-quality topics. Our results provide critical in-

sight into these problems. We found that it should be

possible to construct unsupervised topic models that

do not produce bad topics. We also found that Gibbs

sampling mixes faster for models that use word co-

occurrence information, suggesting that such meth-

ods may also be useful in guiding online stochastic

variational inference (Hoffman et al., 2010).

Acknowledgements

This work was supported in part by the Center

for Intelligent Information Retrieval, in part by the

CIA, the NSA and the NSF under NSF grant # IIS-

0326249, in part by NIH:HHSN271200900640P,

and in part by NSF # number SBE-0965436. Any

opinions, ﬁndings and conclusions or recommenda-

tions expressed in this material are the authors’ and

do not necessarily reﬂect those of the sponsor.

References

Loulwah AlSumait, Daniel Barbara, James Gentle, and

Carlotta Domeniconi. 2009. Topic signiﬁcance rank-

ing of LDA generative models. In ECML.

David Andrzejewski, Xiaojin Zhu, and Mark Craven.

2009. Incorporating domain knowledge into topic

modeling via Dirichlet forest priors. In Proceedings of

the 26th Annual International Conference on Machine

Learning, pages 25–32.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.

2003. Latent Dirichlet allocation. Journal of Machine

Learning Research, 3:993–1022, January.

K.R. Canini, L. Shi, and T.L. Grifﬁths. 2009. Online

inference of topics with latent Dirichlet allocation. In

Proceedings of the 12th International Conference on

Artiﬁcial Intelligence and Statistics.

Jonathan Chang, Jordan Boyd-Graber, Chong Wang,

Sean Gerrish, and David M. Blei. 2009. Reading tea

leaves: How humans interpret topic models. In Ad-

vances in Neural Information Processing Systems 22,

pages 288–296.

Kenneth Church and Patrick Hanks. 1990. Word asso-

ciation norms, mutual information, and lexicography.

Computational Linguistics, 6(1):22–29.

Gabriel Doyle and Charles Elkan. 2009. Accounting for

burstiness in topic models. In ICML.

S. Geman and D. Geman. 1984. Stochastic relaxation,

Gibbs distributions, and the Bayesian restoration of

images. IEEE Transaction on Pattern Analysis and

Machine Intelligence 6, pages 721–741.

271

Thomas L. Grifﬁths and Mark Steyvers. 2004. Find-

ing scientiﬁc topics. Proceedings of the National

Academy of Sciences, 101(suppl. 1):5228–5235.

Matthew Hoffman, David Blei, and Francis Bach. 2010.

Online learning for latent dirichlet allocation. In NIPS.

Hosan Mahmoud. 2008. P´

olya Urn Models. Chapman

& Hall/CRC Texts in Statistical Science.

Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai.

2007. Automatic labeling of multinomial topic mod-

els. In Proceedings of the 13th ACM SIGKDD Interna-

tional Conference on Knowledge Discovery and Data

Mining, pages 490–499.

David Newman, Jey Han Lau, Karl Grieser, and Timothy

Baldwin. 2010. Automatic evaluation of topic coher-

ence. In Human Language Technologies: The Annual

Conference of the North American Chapter of the As-

sociation for Computational Linguistics.

Yee Whye Teh, Dave Newman, and Max Welling. 2006.

A collapsed variational Bayesian inference algorithm

for lat ent Dirichlet allocation. In Advances in Neural

Information Processing Systems 18.

Hanna Wallach, Iain Murray, Ruslan Salakhutdinov, and

David Mimno. 2009. Evaluation methods for topic

models. In Proceedings of the 26th Interational Con-

ference on Machine Learning.

Xing Wei and Bruce Croft. 2006. LDA-based document

models for ad-hoc retrival. In Proceedings of the 29th

Annual International SIGIR Conference.

272

Unlocking insights from Ministry of Marine Affairs and Fisheries annual reports using LDA: a deep dive into SDG 14

Article

Full-text available

Aug 2024

Annual reports serve as vital instruments for government ministries and agencies, enabling transparency and accountability in managing state budgets (APBN) and activities, thereby fulfilling a crucial role in public accountability, particularly in the context of sustainable development goal (SDG) 14. However, due to their extensive nature, it becomes imperative to conduct topic modeling analysis to discern trends and topics within these reports. In this study, latent Dirichlet allocation (LDA), a prominent topic modeling technique, is employed to analyze the annual reports of the Ministry of Marine Affairs and Fisheries (KKP) Indonesia from 2015 to 2022. Utilizing the coherence score as an evaluation metric, we assess the quality of topic models across each report year. Our findings underscore the consistent emphasis on fisheries and marine-related initiatives, emphasizing their relevance to SDG 14 and Indonesia’s maritime landscape. Ultimately, this study offers valuable insights to inform strategic planning and decision-making processes within the KKP, contributing to the advancement of SDG 14 and promoting sustainable development in Indonesia’s fisheries and marine sectors.

Determining Research Priorities Using Machine Learning

Preprint

Full-text available

Jul 2024

We summarize our exploratory investigation into whether Machine Learning (ML) techniques applied to publicly available professional text can substantially augment strategic planning for astronomy. We find that an approach based on Latent Dirichlet Allocation (LDA) using content drawn from astronomy journal papers can be used to infer high-priority research areas. While the LDA models are challenging to interpret, we find that they may be strongly associated with meaningful keywords and scientific papers which allow for human interpretation of the topic models. Significant correlation is found between the results of applying these models to the previous decade of astronomical research ("1998-2010" corpus) and the contents of the science frontier panel report which contains high-priority research areas identified by the 2010 National Academies' Astronomy and Astrophysics Decadal Survey ("DS2010" corpus). Significant correlations also exist between model results of the 1998-2010 corpus and the submitted whitepapers to the Decadal Survey ("whitepapers" corpus). Importantly, we derive predictive metrics based on these results which can provide leading indicators of which content modeled by the topic models will become highly cited in the future. Using these identified metrics and the associations between papers and topic models it is possible to identify important papers for planners to consider. A preliminary version of our work was presented by Thronson etal. 2021 and Thomas etal. 2022.

Automatic Topic Title Assignment with Word Embedding

Article

Full-text available

Jul 2024

In this paper, we propose TAWE (title assignment with word embedding), a new method to automatically assign titles to topics inferred from sets of documents. This method combines the results obtained from the topic modeling performed with, e.g., latent Dirichlet allocation (LDA) or other suitable methods and the word embedding representation of words in a vector space. This representation preserves the meaning of the words while allowing to find the most suitable word that represents the topic. The procedure is twofold: first, a cleaned text is used to build the LDA model to infer a desirable number of latent topics; second, a reasonable number of words and their weights are extracted from each topic and represented in n-dimensional space using word embedding. Based on the selected weighted words, a centroid is computed, and the closest word is chosen as the title of the topic. To test the method, we used a collection of tweets about climate change downloaded from some of the main newspapers accounts on Twitter. Results showed that TAWE is a suitable method for automatically assigning a topic title.

Data Readiness for AI: A 360-Degree Survey

Preprint

Full-text available

Jul 2024

Data are the critical fuel for Artificial Intelligence (AI) models. Poor quality data produces inaccurate and ineffective AI models that may lead to incorrect or unsafe use. Checking for data readiness is a crucial step in improving data quality. Numerous R&D efforts have been spent on improving data quality. However, standardized metrics for evaluating data readiness for use in AI training are still evolving. In this study, we perform a comprehensive survey of metrics used for verifying AI's data readiness. This survey examines more than 120 papers that are published by ACM Digital Library, IEEE Xplore, other reputable journals, and articles published on the web by prominent AI experts. This survey aims to propose a taxonomy of data readiness for AI (DRAI) metrics for structured and unstructured datasets. We anticipate that this taxonomy can lead to new standards for DRAI metrics that would be used for enhancing the quality and accuracy of AI training and inference.

Fifty‐two years of consumer research based on social exchange theory: A review and research agenda using topic modeling

Article

Full-text available

Jun 2024
Int J Consum Stud

Social exchange theory (SET) is a leading sociopsychological theory that has enriched understanding of consumers' behavior. In this study, we examine application of SET in consumer research over last 52 years (1971–2022) in terms of predominant topics, publications, outlets, networks, and clusters. We adopt a holistic approach by amalgamating systematic literature review (SLR) using Scientific Procedures and Rationales for SLRs protocol; scientometric analysis utilizing bibliographic coupling and betweenness centrality; and topic modeling employing latent Dirichlet allocation. Based on text analysis of 215 full research papers, the study extracts five predominant areas of extant consumer research leveraging SET: (1) relationship marketing, (2) collaborative consumption, (3) gifting behavior, (4) brand experience, and (5) tourism and hospitality. Scientometric mapping reveals six cohesive clusters and identifies seminal studies in SET‐based consumer research. Building upon the insights, the article concludes by presenting agenda for future research.

Attentional Uniqueness and Firm Performance: The Mediating Role of Growth Actions

Article

Full-text available

Jun 2024

The attention‐based view posits that a firm's allocation of attention to particular issues directly influences its actions and performance. Yet, the impact of attentional uniqueness – how the pattern of a firm's attentional allocation diverges from its competitors within the same industry – on behaviour and performance remains underexplored. We argue for an inverted U‐shaped relationship between attentional uniqueness and firm performance, mediated by the frequency of growth actions. This is because a firm's attentional allocation shapes its reaction to problems, opportunities, and threats in the competitive landscape, resulting in its competitive advantage. To generate growth actions, a firm needs to have both a unique perspective and a general understanding of its industry. Furthermore, we propose that this relationship is contingent on environmental munificence, which reflects the presence of growth opportunities. Our analysis, leveraging structural topic modelling on annual security reports from 986 Japanese listed companies between 2004 and 2016, broadly supports these theoretical predictions.

Exploring human-centered design method selection strategies with large language models

Conference Paper

Jun 2024

Vivek Rao

Enhancing Systematic Literature Reviews using LDA and ChatGPT: Case of Framework for Smart City Planning

Conference Paper

May 2024

Joint modeling of causal phrases-sentiments-aspects using Hierarchical Pitman Yor Process

Article

Jul 2024
INFORM PROCESS MANAG

Detecting and monitoring concerns against HPV vaccination on social media using large language models

Article

Full-text available

Jun 2024

Health risks due to preventable infections such as human papillomavirus (HPV) are exacerbated by persistent vaccine hesitancy. Due to limited sample sizes and the time needed to roll out, traditional methodologies like surveys and interviews offer restricted insights into quickly evolving vaccine concerns. Social media platforms can serve as fertile ground for monitoring vaccine-related conversations and detecting emerging concerns in a scalable and dynamic manner. Using state-of-the-art large language models, we propose a minimally supervised end-to-end approach to identify concerns against HPV vaccination from social media posts. We detect and characterize the concerns against HPV vaccination pre- and post-2020 to understand the evolution of HPV vaccine discourse. Upon analyzing 653 k HPV-related post-2020 tweets, adverse effects, personal anecdotes, and vaccine mandates emerged as the dominant themes. Compared to pre-2020, there is a shift towards personal anecdotes of vaccine injury with a growing call for parental consent and transparency. The proposed approach provides an end-to-end system, i.e. given a collection of tweets, a list of prevalent concerns is returned, providing critical insights for crafting targeted interventions, debunking messages, and informing public health campaigns.

Automatic labeling of multinomial topic models

Conference Paper

Full-text available

Aug 2007

Multinomial distributions over words are frequently used to model topics in text collections. A common, major chal- lenge in applying all such topic models to any text mining problem is to label a multinomial topic model accurately so that a user can interpret the discovered topic. So far, such labels have been generated manually in a subjective way. In this paper, we propose probabilistic approaches to automat- ically labeling multinomial topic models in an objective way. We cast this labeling problem as an optimization problem involving minimizing Kullback-Leibler divergence between word distributions and maximizing mutual information be- tween a label and a topic model. Experiments with user study have been done on two text data sets with different genres. The results show that the proposed labeling meth- ods are quite effective to generate labels that are meaningful and useful for interpreting the discovered topic models. Our methods are general and can be applied to labeling topics learned through all kinds of topic models such as PLSA, LDA, and their variations.

Latent Dirichlet Allocation

Conference Paper

Full-text available

Jan 2001

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation.

Conference Paper

Full-text available

Jan 2006
Adv Neural Inform Process Syst

Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference pro- cedures like variational Bayes and Gibb sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.

Online Learning for Latent Dirichlet Allocation

Conference Paper

Full-text available

Nov 2010
Adv Neural Inform Process Syst

We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1

Reading Tea Leaves: How Humans Interpret Topic Models

Conference Paper

Full-text available

Jan 2009
Adv Neural Inform Process Syst

Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide explo- ration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.

Accounting for burstiness in topic models

Conference Paper

Full-text available

Jun 2009

Many dierent topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suer from the important aw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of lan- guage that if a word is used once in a doc- ument, it is more likely to be used again. We introduce a topic model that uses Dirich- let compound multinomial (DCM) distribu- tions to model this burstiness phenomenon. On both text and non-text datasets, the new model achieves better held-out likeli- hood than standard latent Dirichlet alloca- tion (LDA). It is straightforward to incorpo- rate the DCM extension into topic models that are more complex than LDA.

Stochastic Relaxation, Gibbs Distributions and the Bayesian Resoration of Images

Article

Jun 1984

We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.

Stochastic Relaxation, Gibbs Distributions,and the Bayesian Restoration of Images

Article

Jan 1984

Donald Geman

Evaluation methods for topic models

Conference Paper

Jun 2009

A natural evaluation metric for statistical topic models is the probability of held-out documents given a trained model. While exact computation of this probability is in- tractable, several estimators for this prob- ability have been used in the topic model- ing literature, including the harmonic mean method and empirical likelihood method. In this paper, we demonstrate experimentally that commonly-used methods are unlikely to accurately estimate the probability of held- out documents, and propose two alternative methods that are both accurate and ecient. In this paper we consider only the simplest topic model, latent Dirichlet allocation (LDA), and compare a number of methods for estimating the probability of held-out documents given a trained model. Most of the methods presented, however, are applicable to more complicated topic models. In addition to com- paring evaluation methods that are currently used in the topic modeling literature, we propose several al- ternative methods. We present empirical results on synthetic and real-world data sets showing that the currently-used estimators are less accurate and have higher variance than the proposed new estimators.

LDA-based document models for Ad-hoc retrieval

Conference Paper

Aug 2006

Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has re cently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is he avily cited in the machine learning literature, but its feasibilit y and effectiveness in information retrieval is mostly un known. In this paper, we study how to efficiently use LDA to impro ve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.

Optimizing Semantic Coherence in Topic Models

Abstract and Figures

Recommended publications

Educational-Debt Relief for Clinical Investigators — A Vote of Confidence

Changes in the NIH Guidelines for Recombinant DNA Research (Appendix 4: May 1980–April 1981)

Scientific publishing–Uncle Sam's biomedical archive wants your papers

NIH panel loosens zidovudine guidelines