ArticlePDF Available

Hierarchical Theme and Topic Modeling

March 2015
IEEE Transactions on Neural Networks and Learning Systems 27(3)

March 2015
27(3)

DOI:10.1109/TNNLS.2015.2414658

Source
PubMed

Authors:

Jen-Tzung Chien

National Yang Ming Chiao Tung University

Considering the hierarchical data groupings in text corpus, e.g., words, sentences, and documents, we conduct the structural learning and infer the latent themes and topics for sentences and words from a collection of documents, respectively. The relation between themes and topics under different data groupings is explored through an unsupervised procedure without limiting the number of clusters. A tree stick-breaking process is presented to draw theme proportions for different sentences. We build a hierarchical theme and topic model, which flexibly represents the heterogeneous documents using Bayesian nonparametrics. Thematic sentences and topical words are extracted. In the experiments, the proposed method is evaluated to be effective to build semantic tree structure for sentences and the corresponding words. The superiority of using tree model for selection of expressive sentences for document summarization is illustrated.

Conceptual illustration for hierarchical generation of documents (yellow rectangle), sentences (green diamond) and words (blue circle) by using theme and topic proportions.

…

Graphical representation for (a) DP mixture model and (b) HDP mixture model. Yellow and blue denote the variables in document and word levels, respectively.

…

An infinitely branching tree structure for representation of words and documents based on nCRP. Thick arrows denote a tree path c d drawn from nine words of a document w d . Here, we simply use notation d for w d . Yellow rectangle and blue circles denote the document and words, respectively. Each word w di is assigned by a topic parameter φ k at a tree node along c d with proportion or probability π dk .

…

Graphical representation for hierarchical theme and topic model. Yellow, green and blue denote the variables in document, sentence and word levels, respectively.

…

Perplexity versus Gibbs sampling iteration of using snCRP for three datasets: WSJ, AP and DUC.

…

Figures - uploaded by Jen-Tzung Chien

Content may be subject to copyright.

Content uploaded by Jen-Tzung Chien

Content may be subject to copyright.

Hierarchical Theme and Topic Modeling

Jen-Tzung Chien, Senior Member, IEEE

Abstract—Considering the hierarchical data groupings in text

corpus, e.g. words, sentences and documents, we conduct the

structural learning and infer the latent themes and topics for

sentences and words from a collection of documents, respec-

tively. The relation between themes and topics under different

data groupings is explored through an unsupervised procedure

without limiting the number of clusters. A tree stick-breaking

process is presented to draw theme proportions for different

sentences. We build a hierarchical theme and topic model which

ﬂexibly represents the heterogeneous documents using Bayesian

nonparametrics. Thematic sentences and topical words are ex-

tracted. In the experiments, the proposed method is evaluated

to be effective to build semantic tree structure for sentences and

the corresponding words. The superiority of using tree model for

selection of expressive sentences for document summarization is

illustrated.

Index Terms— Structural learning, topic model, Bayesian

nonparametrics, document summarization

I. INTRODUCTION

Unsupervised learning has a broad goal of extracting fea-

tures and discovering structure within the given data. The

unsupervised learning via probabilistic topic model [1] has

been successfully developed for document categorization [2],

image analysis [3], text segmentation [4], speech recognition

[5], information retrieval [6], document summarization [7],

[8] and many other applications. Using topic model, latent

semantic topics are learned from a bag of words to capture

the salient aspects embedded in data collection. In this study,

we propose a new topic model to represent a bag of sentences

as well as the corresponding words. As we know, the concept

of “topic” is well understood in the community. Here, we

use another related concept “theme”. Themes are the latent

variables which occur in different level of grouped data, e.g.

sentences, and so the concepts of themes and topics are

different. We model the themes and topics separately and

require the estimation of them jointly. The hierarchical theme

and topic model is constructed. Figure 1 illustrates the diagram

of hierarchical generation from documents, sentences to words

given by the themes and topics which are drawn from their

proportions. We explore a semantic tree structure of sentence-

level latent variables from a bag of sentences while the word-

level latent variables are learned from a bag of grouped words

allocated in individual tree nodes. We build a two-level topic

model through a compound process. The process of generating

words conditions on the theme assigned to the sentence. The

motivation of this paper aims to go beyond the word level

and upgrade the topic model by means of discovering the

This work was supported in part by the Ministry of Science and Technology,

Taiwan, under contract MOST 103-2221-E-009-078-MY3.

J.-T. Chien is with the Department of Electrical and Computer Engi-

neering, National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail:

jtchien@nctu.edu.tw).

Words

Topics

Themes

Topic

Proportions

Theme

Proportions

Documents Sentences

Fig. 1. Conceptual illustration for hierarchical generation of documents

(yellow rectangle), sentences (green diamond) and words (blue circle) by using

theme and topic proportions.

hierarchical relations between the latent variables in word and

sentence levels. The beneﬁt of this model is to establish a

hierarchical latent variable model which is feasible to char-

acterize the heterogeneous documents with multiple levels of

abstraction in different data groupings. This model is general

and could be applied for document summarization and many

other information systems.

A. Related work for topic model

Topic model based on latent Dirichlet allocation (LDA) [2]

is constructed as a ﬁnite-dimensional mixture representation

which assumes that 1) the number of topics was ﬁxed and 2)

different topics were independent. The hierarchical Dirichlet

process (HDP) [9] and the nested Chinese restaurant process

(nCRP) [10], [11] were proposed to relax these assumptions.

The HDP-LDA model by Teh et al. [9] is a nonparametric

extension of LDA where document representation is allowed

to grow structurally when more documents are observed.

Number of topics is unknown a priori. Each word token

within a document is drawn from a mixture model where the

hidden topics are shared across documents. Dirichlet process

(DP) is realized to ﬁnd ﬂexible data partitions and provide

the nonparametric prior over the number of topics for each

document. The base measure for the child DPs is itself drawn

from a parent DP. The atoms are shared in a hierarchical

way. Model selection problem is tackled through Bayesian

nonparametric (BNP) learning. In the literature, the sparse

topic model was constructed by decoupling the sparsity and

smoothness for LDA [12] and HDP [13]. The spike and

slab prior over Dirichlet distributions was applied by using

a Bernoulli variable for each word to indicate whether the

word appears in the topic or not. In [14], the Indian buffet

process compound DP was developed to build a focused

topic model. In [15], a hierarchical Dirichlet prior with Polya

conditionals was used as the asymmetric Dirichlet prior over

the document-topic distributions. The improvement of using

this prior structure over the symmetric Dirichlet prior in LDA

was substantial even when not estimating the number of topics.

On the other hand, the nCRP conducts BNP inference of

topic hierarchies and learns the deeply branching trees from a

data collection. Using this hierarchical topic model [10], each

document is modeled by a path of topics given a random tree

where the hierarchically-correlated topics from global topic to

individual topics are extracted from root node to leaf nodes,

respectively. In [16], the topic correlation was strengthened

through a doubly correlated nonparametric topic model where

the annotations were incorporated into discovery of semantic

structure. In [17], the nested DP prior was proposed by replac-

ing the random atoms with the random probability measures

drawn from DP in a nested setting. In general, HDP and nCRP

were implemented according to the stick-breaking process

and Chinese restaurant process. The approximate inference

using Markov chain Monte Carlo (MCMC) [10], [11], [9] and

variational Bayesian (VB) [18], [19], [20] was developed.

B. Topic model beyond word level

In previous methods, the unsupervised learning over text

data ﬁnds the topic information from a bag of words. The

mixed membership modeling is implemented for representa-

tion of words from multiple documents. The word-level mix-

ture model is built. However, unsupervised learning beyond

word level is required in many information systems. For exam-

ple, the topic models based on a bag of bigrams [21] and a bag

of n-gram histories [5] were estimated for language modeling.

In a document summarization system, the text representation

is evaluated to select representative sentences from multiple

documents. Exploring the sentence clustering and ranking [8]

is essential to ﬁnd the sentence-level themes and measure the

relevance between sentences and documents [22]. In [23], the

general sentences and speciﬁc sentences were identiﬁed for

document summarization. In [24], [7], a sentence-based topic

model based on LDA was proposed to learn word-document

and word-sentence associations. Furthermore, an information

retrieval system retrieves the condensed information from user

queries. Finding the underlying themes from documents is

beneﬁcial to organize the ranked documents and extract the

relevant information. In addition to the word-level topic model,

it is desirable to build the hierarchical topic model in sentence

level or even in document level. In [25], [26], the latent

Dirichlet co-clustering model and the segmented topic model

were proposed to build topic model across different levels.

C. Main idea of this work

In this study, we construct a hierarchical latent variable

model for structural representation of text documents. The

thematic sentences and the topical words are learned from

hierarchical data groupings. Each path in tree model covers

from the general theme at root node to the individual themes

at leaf nodes. The themes in different tree nodes contain

coherent information but in varying degrees of sharing for

sentence representation. We basically build a tree model for

sentences according to the nCRP. The theme hierarchy is

explored. The brother nodes expand the diversity of themes

from different sentences within and across documents. This

model does not only group the sentences into a node but

also distinguish their concepts through different layers. The

words of the sentences clustered in a tree node are seen as the

grouped data. The grouped words in different tree nodes are

driven by an HDP. The nCRP compound HDP is developed

to build a hierarchical theme and topic model. To reﬂect the

heterogeneous documents in real-world data collection, a tree

stick-breaking process is addressed to draw a subtree of theme

proportions. We conduct structural learning and group the

sentences into a diversity of themes. The number of themes

and the dependency between themes are learned from data.

The words of the sentences within a node are represented by

a topic model which is drawn by a DP. All the topics from

different nodes are shared under a global DP. The sentence-

level themes and the word-level topics are estimated. This

approach is evaluated by the tasks of document modeling and

summarization.

This paper is organized as follows. Section II surveys BNP

learning based on DP, HDP and nCRP. Section III presents

the nCRP compound HDP for representation of sentences

and words from multiple documents. A tree stick-breaking

process is implemented to draw theme proportions for subtree

branches. Section IV formulates the posterior probabilities

which are applied for drawing the subtree, themes and topics.

In Section V, the experiments on text modeling and document

summarization are reported. The conclusions drawn from this

study are given in Section VI.

II. BAYESIAN NONPARAMETRIC LEARNING

A. Dirichlet process

Dirichlet process (DP) is essential for BNP learning. Basi-

cally, DP for a random probability measure Gis expressed

by G∼DP(α0, G0). For any ﬁnite measurable partition

{A1, . . . , Ar}, a ﬁnite-dimensional Dirichlet distribution

(G(A1), . . . , G(Ar))

∼Dir(α0G0(A1), . . . , α0G0(Ar)) (1)

or G∼Dir(α0G0)is produced from Gwith two parameters,

a concentration parameter α0>0and a base measure G0

which is seen as the mean of draws from DP. The probability

measure of DP is established by

∞

k=1

βkδφk,where

∞

k=1

βk= 1 (2)

where δφkis a unit mass at the point φkand the inﬁnite

sequences of weights {βk}and points {φk}are drawn from

α0and G0, respectively. The solution to weights β={βk}∞

k=1

can be obtained by stick-breaking process (SBP) which pro-

vides a distribution over inﬁnite partitions of unit interval, also

called the GEM distribution β∼GEM(α0)[27].

Let θ1, θ2, . . . denote the parameters drawn from Gfor

individual words w1, w2, . . . with multinomial distribution

function, θi|G∼Gand wi|θi∼p(wi|θi) = Mult(θi)for each

i. According to the metaphor of Chinese restaurant process

TABLE I

CON VEN TI ON OF N OTATIO NS

wd={wdi}words in document d

w−(di)words of all documents except word wdi

θdi or θdji multinomial parameter of word wdi or wdji

zdi =klatent variable of word wdi at topic k

z−(di)topics of all words in wexcept word wdi

φkkth topic parameter

cdtree path corresponding to document d

c−dtree paths of all documents except document d

β={βk}global mixture weights or topic proportions in HDP

π={πdk}topic proportions of document dover cdin HDP

α0&γconcentration or strength parameter of a DP

G0,Hor Gdbase measure of a global DP or a DP for document d

nkno of customers or words sitting at table φk

wdj ={wdji}words in sentence jof document d

wd(−j)words of all sentences in wdexcept sentence j

ydj =llatent variable of sentence jat theme l

yd(−j)themes of all sentences in wdexcept sentence j

ψllth theme parameter

td={tdj }subtree branches for all sentences in document d

td(−j)subtree branches for all sentences in wdexcept wdj

πl={πlk}topic proportions of theme lin snCRP

βd={βdl}theme proportions of document dover tdin snCRP

γs&γwstrength parameters in sentence and word levels

Vvocabulary size

θt,l ={θt,l,v}multinomial parameters of all words vin tuple (t, l)

ηshared Dirichlet prior parameter for θt,l

Ldmaximum no of themes or nodes in subtree td

Klmaximum no of topics in theme l

L&Ktotal no of themes and topics

βla&βlac theme proportions for ancestor and child nodes

md,t no of sentences in document dchoosing branch t

md(−j),lac no of sentences in wd(−j)allocated in node lac

nd,t,l,v no of word wdi =vin document dand tuple (t, l)

n−(di),l,v no of word wdi =vin w−(di)with theme l

Ndtotal no of words in document d

(CRP) [28], the conditional distribution of current parameter

θigiven the previous parameters θ1, . . . , θi−1is obtained by

θi|θ1, . . . ,θi−1, α0, G0

∼

i−1

l=1

i−1 + α0

δθl+α0

i−1 + α0

k=1

i−1 + α0

δφk+α0

i−1 + α0

G0.

(3)

Here, φ1, . . . , φKdenote the distinct values taken from the

previous parameters θ1, . . . , θi−1and nkis the number of

customers or parameters θi0who have seated at table or chosen

value φk. If the base distribution is continuous, then φkcan

correspond to different table. However, if the base distribution

is discrete and ﬁnite, φkcorresponds to distinct value, but

not table, because the draws from discrete distribution can be

repeated. Considering the continuous base measure, the ith

customer sits at table φkwith probability proportional to the

number of customers nkwho have already seated there, and

sits at a new table with probability proportional to α0. Figure

2(a) displays the graphical representation of a DP mixture

(a)

Ák

µi

®0

¯¯

(b)

Ák

wdi

µdi

°°

zdi

¯¯

¼d

®0

Fig. 2. Graphical representation for (a) DP mixture model and (b) HDP

mixture model. Yellow and blue denote the variables in document and word

levels, respectively.

model where each word wiis generated by

φk|G0∼G0,for each k

β|α0∼GEM(α0)

zi=k|β∼β,for each i

wi|zi,{φk}∞

k=1 ∼Mult(θi=φzi),for each i

(4)

where the mixture component or latent topic zi=kis drawn

from topic proportions βwhich are determined by the strength

parameter α0. Unsupervised learning of an inﬁnite mixture

representation is implemented.

B. Hierarchical Dirichlet process

HDP [9] deals with the representation of documents or

grouped data where each group is associated with a mixture

model. Data in different groups share a global mixture model.

Each document or data grouping wdis associated with a

draw from a DP given probability measure Gd∼DP(α0, G0),

which determines how much a member from a shared set of

mixture components contributes to that data grouping. The

base measure G0is itself drawn from a global DP by G0∼

DP(γ, H )with strength parameter γand base measure H

which ensures that there is a single set of discrete components

shared across data. Each DP Gdgoverns the generation of

words wd={wdi}or their multinomial parameters {θdi }for

a document.

The global measure G0and the individual measure Gdin

HDP can be expressed by the mixture models with the shared

atoms {φk}∞

k=1 but different weights β={βk}∞

k=1 and πd=

{πdk}∞

k=1 given by

G0=

∞

k=1

βkδφk, Gd=

∞

k=1

πdkδφk,for each d

where

∞

k=1

βk=

∞

k=1

πdk = 1.

(5)

The atom φkis drawn from base measure Hand the topic

proportions βare drawn by SBP via β|γ∼GEM(γ). Note

that the topic proportions πdare sampled from an independent

DP Gdgiven βbecause Gd’s are independent given G0. We

have πd∼DP(α0,β). Figure 2(b) shows the HDP mixture

model which generates a word wdi in grouped data wdby

φk|H∼H, for each k

β|γ∼GEM(γ)

πd|α0,β∼DP(α0,β),for each d

zdi =k|πd∼πd,for each dand i

wdi|zdi ,{φk}∞

k=1 ∼Mult(θdi =φzdi ),for each dand i.

(6)

To implement this HDP, we can apply the stick-breaking

construction to connect the relation between βand πd. We ﬁrst

conduct the stick-breaking construction by ﬁnding β={βk}

through implementation of a process of drawing beta variables

β0

k∼Beta(1, γ), βk=β0

k−1

j=1

(1 −β0

j).(7)

Then, the stick-breaking construction for probability measure

πdof grouped data wdis performed to bridge the relation

π0

dk ∼Beta α0βk, α0

∞

l=k+1

βl!

πdk =π0

k−1

j=1

(1 −π0

dj )

(8)

where beta variable π0

dk at the kth draw is determined by the

base measures of two segments {βk,{βl}∞

l=k+1}, which are

scaled by parameter α0. Basically, HDP does not involve in

learning topic hierarchies. Only single level of data groupings,

i.e. document level, is modeled.





Á1

d=fw1; w2; w3; w4;

w5; w6; w7; w8; w9g

Á2

Á3

¼1

¼2

¼3

Fig. 3. An inﬁnitely branching tree structure for representation of words and

documents based on nCRP. Thick arrows denote a tree path cddrawn from

nine words of a document wd. Here, we simply use notation dfor wd. Yellow

rectangle and blue circles denote the document and words, respectively. Each

word wdi is assigned by a topic parameter φkat a tree node along cdwith

proportion or probability πdk.

C. The nested Chinese restaurant process

In many practical applications, it is desirable to represent

text data based on different levels of aspects. The nCRP [10],

[11] was proposed to infer the topic hierarchies and learn

a deeply branching tree from data collection. The resulting

hierarchical LDA is developed to represent a text document wd

by using a path cdof topics from a random tree consisting of

the hierarchically-correlated topics {φk}∞

k=1 from global topic

in root node to individual topics in leaf nodes as illustrated

in Figure 3. Choosing a path of topics for a document is

equivalent to selecting or visiting a chain of restaurants cdby

a customer wdthrough cd∼nCRP(α0), which is controlled

by a scaling parameter α0. Each word wdi in document wd

is assigned by a tree node φkor topic label zdi =kwith

probability or topic proportion πdk. The generative process

for a set of text documents w={wd|d= 1, . . . , D}based

on nCRP is constructed by

1) For each node kin the inﬁnite tree

a) Draw a topic with parameter φk|H∼H.

2) For each document wd={wdi|i= 1, . . . , Nd}

a) Draw a tree path by cd∼nCRP(α0).

b) Draw topic proportions over layers of the tree path cd

by stick-breaking process πd|γ∼GEM(γ).

c) For each word wdi

i) Choose a layer or a topic by zdi =k|πd∼πd.

ii) Choose a word based on topic zdi =kby

wdi|zdi ,cd,{φk}∞

k=1 ∼Mult θdi =φcd(zdi).

A Gibbs sampling algorithm is developed to sample the pos-

terior tree path and word topic {cd, zdi}for different words in

Ddocuments w={wd}={wdi}according to the posterior

probabilities of cdand zdi given wand the current values

of all other latent variables, i.e. p(cd|c−d,w,z, α0, H)and

p(zdi|w,z−(di),cd, γ , H). In these posterior probabilities, the

notations c−dand z−(di)denote the latent variables cand zfor

all documents and words other than document dand word i,

respectively. Sampling procedure is performed iteratively from

current state to new state for Ddocuments with Ndwords in

each document wd. In this procedure, the topic parameter φk

in each node kof an inﬁnite tree model is sampled from a prior

measure φk|H∼H. The topic proportions πdof document

wdare drawn by πd|γ∼GEM(γ)according to an SBP given

a scaling parameter γ. Given the samples of tree path cdand

topic zdi, the word wdi is distributed by using the multinomial

parameter

θdi =φcd(zdi)(9)

of topic zdi =kdrawn from the topic proportions πd

corresponding to tree path cd.

III. HIERARCHICAL THEME AND TOPIC MODEL

Although the topic hierarchies are explored in topic model

based on nCRP, only the single-level data groupings, i.e. docu-

ment level, are considered in generative process. The extension

of text representation to different levels of data groupings

is required to improve text modeling. In addition, a single

tree path cdin nCRP may not sufﬁciently reﬂect the topic

variations and theme ambiguities in heterogeneous documents.

A ﬂexible topic selection is required to compensate for model

uncertainty. By conducting the multiple-level unsupervised

learning and ﬂexible topic selection, we are able to upgrade

system performance for document modeling. In this study, a

hierarchical theme and topic model is proposed to conduct a

kind of topical co-clustering [25], [26] over sentence level and

word level so that one can cluster sentences while clustering

words. The nCRP compound HDP is presented to implement

the proposed model where the text modeling in word level,

sentence level and document level is jointly performed. By

referring to [29], a simpliﬁed tree-structured SBP is presented

to draw a subtree td, which accommodates the theme and topic

variations in document wd.

wiwi

wiwiwi





d=fs1; s2; s3;

s4; s5; s6; s7; s8g

Á1

Á2

¼1

¼2

Ã1

Ã2

Ã3

Ã4

Ã5

¯1

¯11

¯12

¯111

¯121

Fig. 4. An inﬁnitely branching tree structure for representation of words,

sentences and documents based on the nCRP compound HDP. Thick arrows

denote the subtree branches td={tdj }drawn for eight sentences {wdj |j=

1,...,8}of a document wd. Here, we simply use the notations sjfor wdj

and dfor wd. Yellow rectangle, green diamonds and blue circles denote the

document, sentences and words, respectively. Each sentence wdj is assigned

by a theme parameter ψlat a tree node connected to the subtree branches with

a document-dependent probability βdl while each word wdji in tree node is

assigned by a topic parameter φkwith a theme-dependent probability πlk.

A. The nCRP compound HDP

We propose an unsupervised structural learning approach to

discover latent variables in different levels of data groupings.

For the application of document representation, we aim to

explore the latent themes from a set of sentences {wdj}and

the latent topics from the corresponding set of words {wdji}.

A text corpus w={wd}={wdj}={wdji}is represented

by using different levels of aspects. The proposed model is

constructed by considering the structure of a document where

each document consists of a “bag of sentences” and each

sentence consists of a “bag of words”. Different from the

topic model using HDP [9] and the hierarchical topic model

using nCRP [10], [11], we develop a tree model to represent a

“bag of sentences” and then represent the corresponding words

allocated in individual tree nodes. A two-stage procedure is

proposed for document modeling as illustrated in Figure 4.

In the ﬁrst stage, each sentence wdj of a document wdis

drawn from a mixture of theme model where the themes are

shared for all sentences from a document collection w. The

theme model of a document wdis composed of the themes

under its corresponding subtree tdfor different sentences

{wdj }, which are drawn by a sentence-based nCRP (snCRP)

with a control parameter α0

td={tdj } ∼ snCRP(α0)(10)

where snCRP(·)is deﬁned by considering individual sentences

{wdj }under a subtree tdrather than using the individual

words {wdi}with single tree path cdas addressed for nCRP(·)

in Section II-C. Notably, we consider a subtree td={tdj }

for thematic representation of sentences from a heteroge-

neous document wd. The proportions βdof theme parameters

ψ={ψl}over a subtree tdare drawn according to a tree

stick-breaking process (TSBP) which is simpliﬁed from the

tree-structured stick breaking process in [29]. The resulting

distribution is called the treeGEM, which is expressed by

βd|γs∼treeGEM(γs)(11)

where γsis a strength parameter. The distribution treeGEM(·)

is derived through a TSBP which shall be described in Section

III-D. With a tree structure of themes, the unsupervised

grouping of sentences into different layers is obtained.

In the second stage, each word wdji in the set of sentences

{wdj }allocated in tree node lis drawn by an individual

mixture of topic model based on a DP. All topics from different

nodes of a tree model are shared under a global topic model

with parameters {φk}which are sampled by φk|H∼H. By

treating the words in a tree node as the grouped data, the

HDP is implemented for topical representation of the whole

document collection wwhich is composed of the grouped

words in different tree nodes. We assume that the words of the

sentences in tree node lare conditionally independent. These

words are drawn from a topic model using topic parameters

{φk}∞

k=1. The sentences in a document given theme lare

conditionally independent. These sentences are drawn from a

theme model using theme parameters {ψl}∞

l=1. The document-

dependent theme proportions βd={βdl}and the theme-

dependent topic proportions πl={πlk}are produced by

a tree stick-breaking process βd|γs∼treeGEM(γs)and a

standard stick-breaking process πl∼GEM(γw)where γs

and γwdenote the sentence-level and the word-level strength

parameters, respectively.

Given the theme proportions βdand topic proportions πl,

the probability measure of each sentence wdj is drawn from

a document-dependent theme mixture model

Gs,d =

∞

l=1

βdlδψl(12)

while the probability measure of each word wdji in a tree node

is drawn from a theme-dependent topic mixture model

Gw,l =

∞

k=1

πlkδφk.(13)

Since the theme for sentences is represented by a mixture

model of topics for words in Eq. (13), we bridge the relation

between the probability measures of themes {ψl}and topics

{φk}via

ψl∼X

πlkφk.(14)

We implement the so-called nCRP compound HDP in a two-

stage procedure and establish the hierarchical theme and topic

model for document representation. The generative process

for a set of documents in different levels of groupings is

accordingly implemented. Having the sampled tree node ydj

connected to the tree branches tdand the associated topic

zdi =k, the word wdji in sentence wdj is drawn by using the

multinomial parameter

θdji =φtd(ydj,zdi ).(15)

The topic is determined by the topic proportions πlof theme

ydj =lwhile the theme is determined by the theme propor-

tions βdfrom a document wd. The generative process of this

compound process is described as follows

1) For each node or theme lin the inﬁnite tree

a) For each topic kin a tree node

i) Draw a topic with parameter φk|H∼H.

b) Draw topic proportions by stick-breaking process

πl|γw∼GEM(γw).

c) Theme model is constructed by ψl∼Pkπlkφk.

2) For each document wd={wdj}

a) Draw a subtree td={tdj} ∼ snCRP(α0).

b) Draw theme proportions over a subtree td

in different layers by tree stick-breaking process

βd|γs∼treeGEM(γs).

c) For each sentence wdj ={wdji}

i) Choose a theme ydj =l|βd∼βd.

ii) For each word wdji or simply wdi

a. Choose a topic by zdi =k|πl∼πl.

b. Choose a word based on topic zdi =kby

wdji |zdi, ydj,td,{φk}∞

k=1

∼Mult θdji =φtd(ydj,zdi ).

B. The nCRP for thematic sentences

We implement the snCRP and build an inﬁnitely branching

tree model which discovers latent themes from different sen-

tences of a text corpus. The root node in this tree contains

the general theme while those nodes in leaf layer convey the

speciﬁc themes. The hierarchical clustering of sentences is

realized. The snCRP is developed to represent the sentences of

a document wd={wdj }based on the themes, which come

from a subtree td={tdj }. The ambiguity and uncertainty

of themes existing in a heterogeneous document could be

compensated. This is different from conventional word-based

nCRP [10], [11] where only the topics along a single tree path

cdare selected to represent different words in a document

wd={wdi}. The word-based nCRP using GEM distribution

for topic proportions πd∼GEM(γ)is now extended to

the snCRP using treeGEM distribution for theme proportions

βd∼treeGEM(γs)by considering a subtree td.

The metaphor of snCRP is described as shown in Figure

4. There are inﬁnite number of Chinese restaurants in a city.

Each restaurant has inﬁnite tables. The tourist wdj of a tour

group wd={wdj }visits the ﬁrst restaurant in root node

ψ1where each of its tables has a card showing the next

restaurant, which is arranged in the second layer consisting

of {ψ2, ψ4, . . .}. Such visit repeats inﬁnitely. Each restaurant

is associated with a tree layer. The restaurants in a city are

organized into an inﬁnitely-branched and inﬁnitely-deep tree

structure. The tourists or sentences wdj from different tour

groups or documents wd, who are closely related, shall visit

the same restaurant and sit at the same table. The hierarchical

grouping of sentences is therefore obtained by nonparametric

tree model based on this snCRP. Each tree node stands for

a theme. Each tourist or sentence wdj is modeled by the

multinomial parameter ψlof a theme lfrom a subtree or

a chain of restaurants td. The thematic sentences allocated

in tree nodes with high probabilities could be selected for

document summarization.

Ák

wdji

µdji

°s

zdi

¯d

¼l

°w

ydj

tdj

®0

Ãl

Fig. 5. Graphical representation for hierarchical theme and topic model.

Yellow, green and blue denote the variables in document, sentence and word

levels, respectively.

C. HDP for topical words

After having the hierarchical grouping of sentences, we treat

the words corresponding to a tree node las the grouped data

and conduct the HDP by using the grouped data in different

tree nodes. The grouped words from different documents

are recognized under the same theme ψl. Such tree-based

HDP is different from the document-based HDP [9] which

treats documents as the grouped data for text representation.

An organized topic model is constructed to draw words for

individual themes. The theme-dependent topical words are

learned from the tree-based HDP. According to the combina-

tion of snCRP and tree-based HDP, the theme parameter ψlis

represented by a mixture model of topic parameters {φk}∞

k=1

where the theme-dependent topic proportions {πlk}∞

k=1 are

used as the mixture weights as given in Eq. (14). Tree-based

HDP is developed to infer the topic proportions πlbased

on a standard SBP. The tree-based multinomial parameter is

inferred to determine the word distribution through

φk|H∼H, for each k

πl|γw∼GEM(γw), ψl∼X

πlkφk,for each l

td∼snCRP(α0),βd|γs∼treeGEM(γs),for each d

ydj =l|βd∼βd,for each dand j

zdi =k|πl∼πl,for each dand i

wdji |zdi, ydj,td,{φk}∞

k=1

∼Mult φtd(ydj,zdi),for each d, j and i.

(16)

A compound process of snCRP and HDP is fulﬁlled to build

the hierarchical theme and topic model as illustrated in the

graphical representation of Figure 5. Each word wdji is drawn

by using the topic parameter θdji =φtd(ydj,zdi )which is

controlled by the word-level topic zdi. This topic is allocated

in the theme or tree node ydj of the subtree tdwhich is

selected from sentence wdj . The topical words in different

tree nodes with high probability are selected. Such process

could be extended to represent multiple-level data groupings

including words, paragraphs, documents, streams and corpora.

(a)

(b)

¯1

¯11

¯12

¯111

¯112

¯121

¯122

¯1= 1

¯11

¯12

¯10

¯111

¯121

¯120

¯110

¯la

¯la1

¯la2

(c)

¯la3

Fig. 6. Illustration for conducting (a) the tree stick-breaking process, and

ﬁnding (b) the hierarchical theme proportions for an inﬁnitely branching tree

structure, which meets the constraint of unit length in the estimated theme

proportions or in the treeGEM distribution. The dashed arrows and circles

represent the future stick-breaking process. The stick-breaking process of a

parent node lato its child nodes Ωc={la0, la1, la2,...}is shown in (c).

The theme proportion of the ﬁrst child node of a parent node lais denoted by

βla0. After stick-breaking for a parent node laand its child nodes Ωc, we have

the replacement βla←βla0. This example illustrates how the three-layer

proportions {β1, β11, β12 , β111, β121 }or {β10, β110 , β120, β111 , β121}in

Figure 4 are drawn to share a unit-length stick.

D. Tree stick-breaking process

In the implementation, a tree stick-breaking process is

incorporated to draw a subtree [30] and determine the theme

proportions for representation of sentences in heterogeneous

document wd={wdj }. Traditional GEM distribution is not

suitable to characterize the tree structure with dependencies

between brother nodes and those between parent node and

child nodes. The snCRP is combined with a TSBP, which is

developed to draw theme proportions βdfrom a document wd

based on the treeGEM distribution subject to the constraint of

multinomial parameters P∞

l=1 βdl = 1 for all nodes in subtree

td. A variety of aspects from different sentences are revealed

through βd.

As illustrated in Figures 6(a) and 6(c), we draw the theme

proportions for an ancestor node and its child nodes {la,Ωc=

{la0, la1, la2, . . .}} that are connected between two layers by

arrows. The theme proportion βla0in child node denotes the

initial segment which is succeeded from an ancestor node

la. Stick-breaking process is performed for the coming child

nodes {la1, la2, . . .}. Given the treeGEM parameter γs, we

sample a beta variable

β0

lac ∼Beta(1, γs) = Γ(1 + γs)

Γ(1)Γ(γs)(1 −β0

lac )γs−1(17)

for a child node lac ∈Ωc. The probability or theme proportion

of generating a child node lac is calculated by

βlac =βlaβ0

lac

c−1

j=1

(1 −β0

laj ).(18)

In this calculation, the theme proportion βlac of a child node is

determined from an initial proportion of the ancestor node βla,

which is continuously chopped according to the draws of beta

variables from the existing child nodes {β0

laj |j= 1, . . . , c−1}

to the new child node β0

lac . Each child node can be further

treated as an ancestor node for future stick-breaking process to

generate the grandchild nodes. Eq. (18) is recursively applied

to ﬁnd theme proportions βlaand βlac for different nodes in

subtree td. The theme proportions of a parent node and its

child nodes satisfy the condition

βla=βla0+βla1+βla2+· · · .(19)

Tree stick-breaking is run for each set of nodes {la,Ωc}. After

stick-breaking of a parent node laand its child nodes Ωc, the

theme proportion of parent node is replaced by βla←βla0.

Figure 6(b) shows how the three-layer theme proportions

{β1, β11, β12 , β111, β121 }in Figure 4 are inferred through

TSBP.

We accordingly infer the theme proportions βdfor a doc-

ument wdand meet the constraint that the summation over

theme proportions of all parent and child nodes has unit length.

The inference βd∼treeGEM(γs)is completed. Therefore, a

stick of unit length is partitioned at random locations. The

random theme proportions βdare used to calculate the poste-

rior probability for drawing theme ydj =lfor a sentence wdj.

In this study, we adopt a single beta parameter γsfor TSBP

towards depth as well as branch. This solution is a simpliﬁed

realization of the TSBP in [29] where two separate beta

parameters {γsd, γsb}are adopted to conduct stick-breaking

for depth and branch, βd∼treeGEM(γsd, γsb). The inﬁnite

version of the Dirichlet tree distribution used in [31] could be

adopted as the conjugate prior over the tree multinomials βd.

IV. BAYES IA N INF ER EN CE

The approximate Bayesian inference using Gibbs sampling

is developed to infer the posterior parameters or latent vari-

ables to implement the nCRP compound HDP. Latent variables

consist of subtree branch tdj , sentence-level theme ydj and

word-level topic zdi for each word wdi and sentence wdj in a

text corpus w. Each latent variable is iteratively sampled by

the posterior probability of this variable given the observed

data wand all the other variables. The sampling of latent

variables is performed in an MCMC procedure. In this pro-

cedure, we sample a subtree td={tdj }for a document wd

via snCRP. Each sentence wdj is grouped into a tree node

or a document-dependent theme ydj =lunder a subtree td.

Each word wdi of a sentence is then assigned by the theme-

dependent topic zdi =kwhich is sampled according to a

tree-based HDP. In what follows, we address the calculation

of posterior probabilities for drawing tree branch tdj, theme

label ydj and topic label zdi.

A. Sampling for document-dependent subtree branches

A document wdis seen as “a bag of sentences” for sampling

a subtree td. We iteratively sample a tree branch or choose a

table tdj for sentence wdj by using the posterior probability

p(tdj =t|td(−j),w,yd, α0, η)∝p(tdj =t|td(−j), α0)

×p(wdj |wd(−j),td,yd, η)(20)

where yd={ydj ,yd(−j)}denotes the set of theme labels

for sentence wdj and all the remaining sentences wd(−j)of

document wd, and td(−j)denotes the branches for document

wdexcluding those of sentence wdj. In Eq. (20), the snCRP

parameter α0and the Dirichlet prior parameter ηprovide the

control over the size of the inferred tree. The ﬁrst term p(tdj =

t|td(−j), α0)provides the prior on choosing a branch tfor a

sentence wdj according to the sentenced-based nested CRP

with the following metaphor. In a culinary vacation, the tourist

or sentence wdj enters a restaurant and chooses a table or

branch tdj =tusing the CRP probability

p(tdj =t|td(−j), α0)

=md,t

md−1+α0if an occupied table tis chosen

α0

md−1+α0if a new table is chosen

(21)

where md,t denotes the number of tourists or sentences in

document wdwho have seated at table tand mddenotes total

number of sentences in wd. In the next evening, this tourist

goes to the restaurant in the next layer which is identiﬁed

by the card on the table that is selected in current evening.

Eq. (21) is applied again to determine whether a new branch

at this layer is drawn or not. The prior of a subtree td={tdj}

is accordingly determined in a nested fashion over different

layers from different sentences in a document wd={wdj}.

The second term p(wdj |wd(−j),td,yd, η)is calculated by

considering the marginal likelihood over the multinomial pa-

rameters θt,l ={θt,l,v}V

v=1 of a node ydj =lconnected to

tree branch tdj =tfor totally Vwords in dictionary where

θt,l,v =p(wdi =v|tdj =t, ydj =l). The parameters θt,l in

a tuple (t, l)are assumed to be Dirichlet distributed with a

shared prior parameter η

p(θt,l|η) =

ΓPV

v=1 η

v=1 Γ(η)

v=1

θη−1

t,l,v .(22)

We can derive the likelihood function of a sentence wdj given

a tree branch tand the other sentences wd(−j)of wdas

p(wdj |wd(−j),td,yd, η) = p(wd|tdj =t, yd, η)

p(wd(−j)|tdj =t, yd, η)

l=1 



ΓPV

v=1 n(d(−j),t,l,v +Vη

v=1 Γnd(−j),t,l,v +ηQV

v=1 Γ (nd,t,l,v +η)

ΓPV

v=1 nd,t,l,v +Vη



(23)

where Lddenotes the maximum number of nodes in subtree

td,nd,t,l,v denotes the number of words wdi =vin document

wdwhich are allocated in tuple (t, l)and nd(−j),t,l,v denotes

the number of words wdi =vof wdexcept sentence wdj

which are allocated in tuple (t, l). Eq. (23) is obtained since

the marginal likelihood is derived by

p(wd|t, yd, η) = Zp(wd|t, yd,θt,l )p(θt,l|η)dθt,l

=Γ(Vη)

(Γ(η))VZ Ld

l=1

v=1

θnd,t,l,v+η−1

t,l,v !dθt,l

∝

l=1 QV

v=1 Γ (nd,t,l,v +η)

ΓPV

v=1 nd,t,l,v +Vη(24)

which is found by arranging an integral over a posterior

Dirichlet distribution of θt,l with parameters {nd,t,l,v +η}V

v=1.

Intuitively, a model with larger α0will tend to a tree with

more branches. The smaller ηencourages fewer words in each

tree node. The joint effect of large α0and small ηin posterior

probability will result in the circumstance that more themes

are required to explain the data. Using snCRP, we use sentence

wdj to ﬁnd the corresponding branch tdj . The subtree branches

td={tdj }are drawn to accommodate the thematic variations

in heterogeneous document wd.

B. Sampling for document-dependent themes

After ﬁnding the subtree branches td={tdj }for sentences

in a document wd={wdj }via snCRP, we sample the

document-dependent theme label ydj =lfor each sentence

wdj according to the posterior probability of latent variable

ydj given document wdand current values of all the other

latent variables

p(ydj =l|wd,yd(−j),td, γs, η)

∝p(ydj =l|yd(−j),td, γs)

×p(wdj |wd(−j),yd,td, η).

(25)

The ﬁrst term represents the distribution of theme proportion

of ydj =lor lac in a subtree tdgiven the themes of

the other sentences yd(−j). This distribution is calculated

as an expectation over the treeGEM distribution which is

determined by the TSBP. Considering the draw of theme

proportion for a child node lac given those for the parent node

laand the preceding child nodes {la1, . . . , la(c−1)}in Eq. (18),

we derive the ﬁrst term based on a subtree structure td

p(ydj =lac|yd(−j), γs) = E[βla|yd(−j), γs]

×E[β0

lac |yd(−j), γs]

c−1

j=1

E1−β0

laj 

yd(−j), γs

=p(ydj =la|yd(−j), γs)1 + md(−j),lac

1 + γs+PLd

u=cmd(−j),lau

c−1

j=1

γs+PLd

u=j+1 md(−j),lau

1 + γs+PLd

u=jmd(−j),lau

(26)

which is a product of expectations of the theme proportion

of the parent node βla, the beta variable for the proportion of

current child node β0

lac and the beta variables for the remaining

proportions for the preceding child nodes {β0

la1, . . . , β0

lac }.

In Eq. (26), md(−j),lac denotes the number of sentences

in wd(−j)which are allocated in node lac. The treeGEM

parameter γsreﬂects a kind of pseudo-count of the obser-

vations for the proportion. The expectation over beta variable

E[β0|yd(−j), γs]is derived by

Zβ0p(β0|yd(−j), γs)dβ0=Zβ0p(β0|γs)p(yd(−j)|β0)dβ0

∝Zβ0(1 −β0)γs−1(β0)md(−j),lac

×(1 −β0)PLd

n=c+1 md(−j),lan dβ0

=1 + md(−j),lac

1 + md(−j),lac +γs+PLd

n=c+1 md(−j),lan

(27)

which is obtained as the mean of the posterior beta vari-

able p(β0|yd(−j), γs)with parameters 1 + md(−j),lac and

γs+PLd

n=c+1 md(−j),lan .

On the other hand, the second term of Eq. (25) for sampling

a subtree is the same as that of Eq. (20) for sampling a

theme. This term p(wdj |wd(−j),yd,td, η)has been derived

in Eq. (23). There are twofold relations in the posterior

probabilities between nCRP and snCRP. First, we treat the

sentence for snCRP as if it were the word for nCRP. Second,

the calculation of Eq. (26) is recursively done for a set of

parent node and its child nodes. Using nCRP, this calculation

is performed by treating tree nodes in different layers in a ﬂat

level without considering subtree structure.

In general, snCRP is completed by ﬁrst choosing a subtree

tdfor the sentences in a document wdand then assigning each

sentence to one of the nodes in the chosen subtree according

to the theme proportions βddrawn from the treeGEM. In

the generation of theme assignment ydj =l, the node l

assigned to a sentence wdj might be connected to the branch

tdj of a subtree tdrawn by another sentence wdj0from the

same document wd. The variety of themes in a heterogeneous

document wdis reﬂected by the subtree tdwhich is drawn

via snCRP. A whole inﬁnite tree is accordingly established

and shared for all documents in a corpus w.

C. Sampling for theme-dependent topics

Finally, we implement the tree-based HDP by sampling the

theme-dependent topic zdi =kfor each word wdji or wdi

under a theme ydj =lbased on the posterior probability

p(zdi =k|w,z−(di), ydj =l, γw, η)

∝p(zdi =k|z−(di), ydj =l, γw)

×p(wdi|w−(di),z, ydj =l, η)

(28)

where z={zdi,z−(di)}denotes the topic labels of word wdi

and the remaining words w−(di)of a corpus w.

In Eq. (28), the ﬁrst term indicates the distribution of topic

proportion of zdi =kwhich is calculated as an expectation

over πl∼GEM(γw). This calculation is done via a word-level

SBP with a control parameter γw. By drawing the beta variable

and calculating the topic proportion similar to Eqs. (17) and

(18), we derive the ﬁrst term in a form of

p(zdi =k|z−(di), ydj =l, γw) = 1 + n−(di),l,k

1 + γw+PKl

u=kn−(di),l,u

k−1

j=1

γw+PKl

u=j+1 n−(di),l,u

1 + γw+PKl

u=jn−(di),l,u

(29)

where Kldenotes the maximum number of topics in a theme

land n−(di),l,k denotes the number of words in w−(di)which

are allocated in topic kof theme l. The words of the sentences

in a node with theme lare treated as the grouped data to carry

out tree-based HDP. The terms in Eq. (29) are obtained as

the expectations over the beta variable for topic proportion

of current break πlk and the remaining proportions of the

preceding breaks {πl1, . . . , πl(k−1)}.

The second term of Eq. (28) calculates the probability of

generating the word wdi =vgiven w−(di)and the current

topic variables zin theme las expressed by

p(wdi =v|w−(di),z, ydj =l, η) = n−(di),l,v +η

v=1 n−(di),l,v +Vη

(30)

where n−(di),l,v denotes the number of words wdi =vin

w−(di)which are allocated in theme l.

Given the current state of the sampler, we iteratively

sample each latent variable conditioned on the whole ob-

servations and the rest variables. Given a text corpus w,

we sequentially sample the subtree branch tdj =tand

the theme ydj =lfor individual sentence wdj of docu-

ment wdvia the snCRP. After having a sentence-based tree

structure, we sequentially sample the theme-dependent topic

zdi =kfor individual word wdi in a tree node based on

the tree-based HDP. These samples are iteratively employed

to update the corresponding posterior probabilities p(tdj =

t|td(−j),w,yd, α0, η),p(ydj =l|wd,yd(−j),td, γs, η)and

p(zdi =k|w,z−(di), ydj =l, γw, η)in the Gibbs sampling

procedure. The true posteriors are approximated by running

sufﬁcient sampling iterations. The hierarchical theme and topic

model is established by fulﬁlling the nCRP compound HDP.

V. EXPERIMENTS

A. Experimental setup

The experiments were performed by using four

public-domain corpora: WSJ (Wall Street Journal)

1987-1992, AP (Associate Press) 1988-1990, NIPS

(http://arbylon.net/resources.html and https://archive.ics.uci.

edu/ml/datasets/Bag+of+Words) and DUC (Document

Understanding Conference) 2007 (http://duc.nist.gov). The

corpora of WSJ, AP and NIPS contained news documents

and conference papers which were applied for evaluation of

document representation. The perplexity of test documents

wtest ={wd|d= 1, . . . , D}was calculated by

Perplexity(wtest) = exp (−PD

d=1 log p(wd)

d=1 Nd).(31)

We wish to achieve better model with lower perplexity on

test documents. To investigate the topic coherence of the

estimated topic models, we further calculated the pointwise

mutual information (PMI) [32]

PMI(wtest) = 1

45 X

i<j

log p(wi, wj)

p(wi)p(wj)(32)

which is averaged over all pairs of words in the list of top-

ten words, i.e. i, j ∈ {1,...,10}, in the estimated topics. In

WSJ, we chose 1,085 documents with 18,323 sentences and

203,731 words for model training and the other 120 documents

as test data for evaluation. This dataset had 4,999 unique

words. In AP, we used 1,211 documents with 19,109 sentences

and 198,420 words for model training and the other 130

documents for testing. This dataset had 5,183 unique words. In

NIPS training set, we collected 2,500 papers totally including

310,553 sentences and 3,026,153 words with a vocabulary

of 10,328 words. The other 250 papers with 290,345 words

were sampled to form the test set. The scale of NIPS corpus

is much larger than that of WSJ, AP and DUC. In DUC

corpus, there were 1,680 documents consisting of 22,961

sentences and 18,696 unique words. This corpus provided

the reference summary for individual document, which was

manually written for evaluation of document summarization.

The automatic summary for DUC was limited to 250 words

at most. The NIST evaluation tool, ROUGE (Recall-Oriented

Understudy for Gisting Evaluation) at http://berouge.com, was

adopted. ROUGE-1 was used to measure the matched uni-

grams between reference summary and automatic summary

in terms of recall, precision and F-measure. In these four

datasets, we held out 10% of training data as the validation

data for selection of hyperparameters α0,η,γsand γwand

trained the models on the remaining 90% of data. Stop words

were removed. Hyperparameters were individually selected for

different datasets based on perplexity. The same selection was

employed to choose the number of topics Kin LDA and

the numbers of themes Land topics Kin sLDA. Fivefold

cross validation was performed. We evaluated the computation

time in seconds by using different methods over a personal

computer equipped with a CPU of Intel(R) Core(TM) i7-

4930K@3.4GHZ 6 cores and a memory of 64G RAM.

TABLE II

EVOLUTION FROM THE PARAMETRIC TO THE NONPARAMETRIC,FROM

TH E NON -HIERARCHICAL TO THE HIERARCHICAL,AND F ROM T HE

WOR D-BA SE D TO TH E SEN TE NCE -BASED CLUSTERING MODELS.

Parametric Nonparametric Nonparametric

Non-Hierarchical →Non-Hierarchical →Hierarchical

Model Model Model

Word-Based LDA →HDP →nCRP

Clustering

↓

Sentence-Based sLDA →sHDP →snCRP

Clustering

For comparative study, we implemented the parametric

topic models including LDA [2], sentence-based LDA (sLDA)

[7], and the nonparametric topic models including HDP [9],

sentence-based HDP (sHDP), nCRP [11] and the proposed

sentence-based nCRP compound HDP (simply denoted by

snCRP hereafter). These methods could be also grouped into

the categories of non-hierarchical models (LDA, sLDA, HDP

and sHDP) and hierarchical models (nCRP and snCRP) as

shown in Table II. The sentence-based models based on sLDA,

sHDP and snCRP were implemented with additional informa-

tion of sentence boundaries. In [7], sLDA was implemented

by representing each word wdji in document wdbased on the

theme-dependent topic zdi =kwhere the theme ydj =lwas

learned from a bag of sentences in text corpus w={wdj}.

The words were drawn from the theme-dependent topic model.

Using sLDA, the numbers of themes and topics were ﬁxed

and the theme hierarchy was not considered. In addition, the

sHDP is an extension of sLDA by conducting BNP inference,

but sHDP is a simpliﬁcation of snCRP without involving the

theme hierarchy. The sHDP is proposed in this study and

implemented for comparison. Basically, the GEM distribution

and the treeGEM distribution are applied to characterize

the distributions of theme proportions for sHDP and snCRP,

respectively.

We also examine the effect of topic and theme hierarchies in

document modeling based on nCRP and snCRP by comparing

the performance of standard stick-breaking process (SBP), the

tree stick-breaking process (TSBP1) in Section III-D and the

TSBP in [29] (TSBP2). For each document wd, SBP selects

a single tree path cd[11] while TSBP1 and TSBP2 ﬁnd the

subtree branches td. This is the ﬁrst work that TSBP1 and

TSBP2 are developed to explore the thematic hierarchy from

heterogeneous sentences and documents.

In evaluation of document summarization, we carried out

the vector-space model (VSM) in [33] and the sentence-

based models using sLDA, sHDP and snCRP which performed

sentence clustering in different ways. The sLDA conducted

parametric and non-hierarchical clustering while the sHDP

executed nonparametric and non-hierarchical clustering. Only

snCRP performed nonparametric and hierarchical clustering.

The Kullback-Leibler (KL) divergence between document

model and sentence model was calculated. The thematic

sentences with the smallest KL divergence were selected.

Considering the tree structure by using snCRP, we investigated

four methods for sentence selection. The snCRP-root and

snCRP-leaf selected the thematic sentences allocated in root

node and leaf nodes, respectively. The snCRP-path selected

representative sentences of a document only from the nodes

along the most frequently-visited path. The snCRP-MMR

selected sentences from all possible branches tdby applying

the maximal marginal relevance (MMR) [22].

For simplicity, we constrained the tree growing in nCRP

and snCRP to three layers in our experiments. The Dirichlet

prior parameter ηwas separate for three layers in nCRP and

snCRP. We decrease the value of η(η1, η2, η3) from root

layer to leaf layer to reﬂect the condition that the number

of words allocated in bottom layer was reduced. In imple-

mentation of TSBP2, the beta prior parameter for depth γsd

(γsd1, γsd2, γsd3) was also decreased by depth to reﬂect the

decreasing number of sentences in bottom layer. The system

parameters were selected individually from validation data in

different datasets. In DUC dataset, the hyperparameters of

snCRP-TSBP1 and snCRP-TSBP2 were selected as α0= 0.5,

η1= 0.05,η2= 0.025,η3= 0.0125,γs= 1.85,γsd1= 2.5,

γsd2= 1.25,γsd3= 1.125,γsb= 1.1and γw= 1.85.

050 100 150 200 250

800

900

1000

1100

1200

1300

1400

1500

1600

1700

Sampling It eration

Perplexi ty

DUC

WSJ

Fig. 7. Perplexity versus Gibbs sampling iteration of using snCRP for three

datasets: WSJ, AP and DUC.

We run Gibbs sampling procedure for 280 iterations with

100 samples per iteration. The burn-in samples in the ﬁrst 30

iterations were abandoned. Figure 7 illustrates the perplexity of

training documents versus Gibbs sampling iterations by using

snCRP-TSBP1. WSJ, AP and DUC datasets are investigated.

We ﬁnd that the number of tree nodes is increased and the

corresponding perplexity is decreased by sampling iterations.

The perplexity is converged after around 50 iterations. This

phenomenon is consistent for three datasets.

B. Evaluation of tree models and stick-breaking processes for

document modeling

In this set of experiments, we evaluate the performance

of different tree models and stick-breaking processes for

document representation by using WSJ and AP datasets. The

hierarchical models based on nCRP and snCRP and the draws

of topic proportions or theme proportions by using SBP,

TSBP1 and TSBP2 are investigated. Different from the word-

based hierarchical topic model using nCRP, the proposed

WSJ AP

750

800

850

900

950

1000

1050

Perplexity

nCRP-SBP (WSJ 529; AP 602)

nCRP-TSBP1 (WSJ 553; AP 632)

nCRP-TSBP2 (WSJ 570; AP 650)

snCRP-SBP (WSJ 402; 300; AP 414, 336)

snCRP-TSBP1 (WSJ 418, 310; AP 431, 345)

snCRP-TSBP2 (WSJ 437, 325; AP 451, 353)

Fig. 8. Comparison of perplexity of using nCRP and snCRP where SBP,

TSBP1 and TSBP2 are investigated. WSJ and AP are used. The error bars

show the standard deviation across test documents in ﬁvefold cross validation.

Model complexity is compared. The blue number after nCRP denotes total

number of the estimated topics while the blue numbers after snCRP denote

total numbers of the estimated themes and topics.

snCRP builds the sentence-based tree model where each node

represents a theme from a set of sentences and the words

in these sentences are generated by HDP. In addition to the

baseline nCRP-SBP [11] with single tree path, we compare

the performance of nCRP and snCRP combined with TSBP1

and TSBP2 which select the subtree branches to deal with the

topical and thematic variations in heterogeneous documents.

Figure 8 shows the perplexities of test documents and

the estimated model complexities. The error bars show the

standard deviation across test documents in ﬁvefold cross

validation. The model complexity is measured in terms of

total number of estimated topics in tree model of nCRP and

total numbers of estimated themes and topics in tree model

of snCRP. We ﬁnd that snCRP obtains lower perplexity than

nCRP. Selection of subtree branches using TSBP1 and TSBP2

outperforms that of single path using SBP in both datasets.

The results of TSBP1 and TSBP2 are comparable. Such

performance is obtained by using nCRP as well as snCRP.

In addition, the estimated model complexity of using TSBP

is larger than that of SBP. This is reasonable because TSBP

adopts more latent variables to allow larger variations in topics

or themes. TSBP2 has higher freedom to choose more themes

and topics but obtained very limited improvement compared

with TSBP1. The model complexity of hierarchical models

nCRP and snCRP is larger than that of non-hierarchical models

LDA, sLDA, HDP and sHDP. It is interesting that snCRP uses

smaller number of topics than nCRP. However, total number

of latent variables of using snCRP (L+K) is larger than that

of using nCRP (K). This implies that additional modeling in

sentence level can reduce the need of the required topics in

word-level modeling for document representation.

Figure 9 shows the number of themes which are estimated

from six documents with different lengths or numbers of words

in documents (Nd= 200,301,405,504,688). We ﬁnd that the

number of themes used for representation of a document is

100 200 300 400 500 600 700

Length of Document (Number of Words)

Number of Themes

TSBP2

TSBP1

SBP

Fig. 9. Number of themes versus length of document by using snCRP with

SBP, TSBP1 and TSBP2. Six documents in different lengths are selected from

WSJ. Document length is measured by total number of words in the document.

increased by the length of document when applying TSBP1

and TSBP2. SBP selects a single path so that only three themes

are selected for each document. TSBP2 chooses more themes

than TSBP1. This is an evidence that TSBPs perform better

than SBP for document modeling.

WSJ AP NIPS

800

1000

1200

1400

1600

1800

2000

2200

2400

Perplexity

LDA (WSJ 400; AP 450; NIPS 500)

sLDA (WSJ 350, 250; AP 400, 300; NIPS 450, 350)

HDP (WSJ 492; AP 550; NIPS 590)

sHDP (WSJ 385, 293; AP 411, 319; NIPS 469, 374)

nCRP (WSJ 553; AP 632; NIPS 692)

snCRP (WSJ 418, 310; AP 431, 345; NIPS 475, 397)

Fig. 10. Comparison of perplexity of using LDA, sLDA, HDP, sHDP, nCRP

and snCRP. Subtree branch selection in nCRP and snCRP based on TSBP1 is

considered. The error bars show the standard deviation. Model complexity is

compared. The blue number after LDA, HDP and nCRP denotes total number

of the estimated topics while the blue numbers after sLDA, sHDP and snCRP

denote total number of the estimated themes and topics.

C. Evaluation of different methods in terms of perplexity, topic

coherence and computation time

This study presents the evolution from the parametric and

non-hierarchical topic model based on LDA to the nonpara-

metric and hierarchical theme and topic model based on

snCRP. The additional modeling over subgrouping data is in-

troduced to conduct unsupervised structural learning of themes

and topics from a set of documents. Figure 10 compares the

perplexity of test documents by using LDA, sLDA, HDP,

sHDP, nCRP and snCRP. WSJ, AP and NIPS datasets are

used. The numbers of the estimated themes and topics are

shown to evaluate the effect of model complexity in different

methods. We ﬁnd that perplexity is consistently reduced from

word-based clustering to sentence-based clustering by using

different methods and datasets. It is because that two levels

of data modeling in sentence-based clustering methods provide

an organized way to represent a set of documents. The relation

between words and sentences through latent topics and themes

is beneﬁcial for document modeling regardless of the model

style, model shape and inference procedure. We consistently

see that sLDA, sHDP and snCRP estimate smaller number of

topics than LDA, HDP and nCRP, respectively. Nevertheless,

total number of themes and topics in sentence-based clustering

is still larger than that of topics in word-based clustering. In

this comparison, HDP and sHDP perform better than LDA

and sLDA, respectively. The lowest perplexity is obtained by

using snCRP. Compared to the news articles in WSJ and AP,

the large amount of scientiﬁc documents in NIPS does not

produce so many topics and themes when using BNP methods.

WSJ AP NIPS

2.0

2.4

2.8

3.2

3.6

PMI

LDA

sLDA

HDP

sHDP

nCRP

snCRP

Fig. 11. Comparison of PMI score of using LDA, sLDA, HDP, sHDP, nCRP

and snCRP. The error bars show the standard deviation.

Based on the estimated LDA, sLDA, HDP, sHDP, nCRP

and snCRP, Figure 11 further evaluates the performance of

topic coherence by comparing the corresponding PMI scores

where WSJ, AP and NIPS datasets are investigated. PMI is

known as an objective measure which considerably reﬂects the

human-judged topic coherence [32]. To conduct a consistent

comparison, the PMI scores in sLDA, sHDP and snCRP are

calculated from the pairs of frequent words in the estimated

topics rather than themes. From the results of PMI measure,

we still see the improvement by using two-level clustering

over one-level clustering and also nonparametric hierarchical

modeling over parametric non-hierarchical modeling.

In addition, the training time of using different methods is

evaluated as illustrated in Figure 12. NIPS corpus is used.

The nCRP and snCRP based on TSBP1 were examined. To

investigate the scalability of computation time due to the

800 1600 2500

102

103

Number of Training Documents

Computati on Time (sec )

snCRP

nCRP

sHDP

HDP

sLDA

LDA

Fig. 12. Comparison of computation time (in seconds) of using sLDA, sHDP

and snCRP under different amounts of training documents.

amount of training data, we sampled training documents and

formed the training sets with 800, 1,600 and 2,500 documents.

The model size of LDA and sLDA is adjusted according to

the amount of training data. LDA and sLDA are estimated

by using VB inference while HDP, sHDP, nCRP and snCRP

conducts inference based on Gibbs sampling. Computation

time is measured with the converged model parameters where

10 iterations are run by using VB and 50 iterations are run by

using Gibbs sampling. Basically, the computation overhead of

using sLDA, sHDP and snCRP over LDA, HDP and nCRP is

limited. Nevertheless, the computation cost of nonparametric

methods using HDP, sHDP, nCRP and snCRP is much higher

than that of parametric methods using LDA and sLDA. The

highest cost is measured by using sentence-based tree model,

i.e. snCRP. The computation time is roughly proportional to

the amount of training data.

patients

disease

treatment

percent

study

people

national

Times

million

report

drugs

interferon

hepatitis

combination

infected

killed

police

family

Barton

attorney

Simpson

Shepard

McKinney

murder

deaths

Gingrich

Marianne

divorce

newt

Bisek

nuclear

Iran

treaty

Pakistan

test

government

country

minister

Israel

Foreign

Israeli

Palestinian

Basque

ETA

Israel

acupuncture

medicine

alternative

medical

therapy obese

weight

overweight

risk

fat

Rudolph

bombing

Atlanta

Birmingham

clinic

Barton

killed

wife

children

announce

Myanmar

Amnesti

militay

human

rights

play

film

television

star

Jolie

actor

award

actress

movie

Seinfeld

Jerry

NBC

David

stand-up

animal

cat

gene

food

Zoo

wolves

wolf

species

gray

Mexican

Fig. 13. A tree model of DUC showing the topical words in each theme or

tree node.

D. Evaluation for document summarization

The proposed snCRP conducts sentence-based clustering

or equivalently establishes a tree model which contains the

thematic sentences in tree nodes and helps for document

summarization. The other sentence-based clustering methods

including sLDA and sHDP are implemented for comparative

study. Figure 13 displays an example of three-layer tree struc-

ture which is estimated based on snCRP-TSBP1 and treeGEM

from DUC dataset. For the words of all sentences allocated

in tree nodes, we conduct HDP to ﬁnd topic proportions

corresponding to each node based on the GEM distribution.

In this ﬁgure, ﬁve topical words are displayed in tree nodes

in different layers which are shaded with different colors.

The root node (yellow) contains general words while the

leaf nodes (white) consist of speciﬁc words. It is obvious

to see semantic relationships between tree nodes in different

layers along the selected ﬁve tree paths. These paths are

separately related to animal,television,disease,criminal and

country. The performance of unsupervised structural learning

is illustrated.

Recall Precision F-measure

0.35

0.40

0.45 snCRP-root

snCRP-leaf

snCRP-MMR

snCRP-path

Fig. 14. Comparison of recall, precision and F-measure under ROUGE-1

evaluation by using snCRP based on different sentence selection methods.

The error bars show the standard deviation.

In implementation of snCRP, the selection of sentences

from tree model is constrained by applying four selection

methods. Figure 14 compares the performance of document

summarization using snCRP-TSBP1 in terms of recall, pre-

cision and F-measure under ROUGE-1. The performance of

selecting sentences from root node (snCRP-root) and leaf

nodes (snCRP-leaf) is comparable. The metric of MMR for

selection from all paths (snCRP-MMR) and the metric of

KL divergence for selection from the most-frequently visited

path (snCRP-path) perform better than the snCRP-root and the

snCRP-leaf. The snCRP-path obtains the highest F-measure

in this comparison. The sentences along the most frequently-

visited path contain the most representative sentences informa-

tion for summarization. We ﬁx the case of snCRP-path when

comparing with other methods.

Figure 15 shows the recall, precision and F-measure of doc-

ument summarization by using VSM, sLDA, sHDP, snCRP-

Recall Precision F-measure

0.30

0.35

0.40

0.45 VSM

sLDA (200, 150)

sHDP (235, 204)

snCRP-SBP (310, 255)

snCRP-TSBP1 (350, 283)

snCRP-TSBP2 (377, 294)

Fig. 15. Comparison of recall, precision and F-measure under ROUGE-1

evaluation by using VSM, sLDA, sHDP and snCRP with SBP, TSBP1 and

TSBP2. The error bars show the standard deviation. Model complexity is

compared. The blue numbers denote total numbers of the estimated themes

and topics.

SBP, snCRP-TSBP1 and snCRP-TSBP2 under ROUGE-1

evaluation. The numbers of the estimated themes and topics

are included for comparison. Again, snCRP-TSBPs estimate

more themes and topics than snCRP-SBP. Hierarchical model

based on snCRP has larger model size than non-hierarchical

models based on sLDA and sHDP. In terms of F-measure,

the theme and topic models using sLDA and sHDP are

signiﬁcantly better than baseline VSM. Nonparametric model

based on sHDP obtains higher F-measure than parametric

model based on sLDA. Nevertheless, the hierarchical theme

and topic model using snCRP is superior to that using sHDP.

The contributions of using snCRP come from the ﬂexible

model complexity and the theme structure which are beneﬁcial

for sentence clustering, document modeling and document

summarization. Similar to the evaluation for document model-

ing, snCRP-TSBPs perform better than snCRP-SBP for docu-

ment summarization. The snCRP-TSBP1 and snCRP-TSBP2

outperform the other methods in terms of F-measure.

VI. CONCLUSIONS

This paper addressed a new hierarchical and nonparametric

model for document representation and summarization. A hier-

archical theme model was constructed according to a sentence-

level nCRP while the topic model was established through a

word-level HDP. The nCRP compound HDP was proposed

to build a tightly-coupled theme and topic model which was

also seen as a theme-dependent topic mixture model. A self-

organized document representation using themes in sentence

level and topics in word level was developed. We presented

the TSBP to draw subtree branches for possible thematic

variations in heterogeneous documents. A hierarchical mixture

model of themes was constructed according to the snCRP.

The hierarchical clustering of sentences was implemented.

The thematic sentences were allocated in tree nodes which

were frequently visited. Experimental results on document

modeling and summarization showed the merit of snCRP

in terms of perplexity, topic coherence and F-measure. The

proposed snCRP is a general model for unsupervised structural

learning. This model is generalizable to characterize the latent

structure in different levels of data groupings which exist in

different specialized technical data.

REFERENCES

[1] D. M. Blei, “Probabilistic topic models,” Communications of the ACM,

vol. 55, no. 4, pp. 77–84, Apr. 2012.

[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,”

Journal of Machine Learning Research, vol. 3, no. 5, pp. 993–1022,

Jan. 2003.

[3] D. M. Blei, L. Carin, and D. Dunson, “Probabilistic topic models,” IEEE

Signal Processing Magazine, vol. 27, no. 6, pp. 55–65, Nov. 2010.

[4] J.-T. Chien and C.-H. Chueh, “Topic-based hierarchical segmentation,”

IEEE Transactions on Audio, Speech and Language Processing, vol. 20,

no. 1, pp. 55–66, Jan. 2012.

[5] ——, “Dirichlet class language models for speech recognition,” IEEE

Transactions on Audio, Speech and Language Processing, vol. 19, no. 3,

pp. 482–495, Mar. 2011.

[6] J.-T. Chien and M.-S. Wu, “Adaptive Bayesian latent semantic analysis,”

IEEE Transactions on Audio, Speech, and Language Processing, vol. 16,

no. 1, pp. 198–207, Jan. 2008.

[7] Y.-L. Chang and J.-T. Chien, “Latent Dirichlet learning for document

summarization,” in Proc. of International Conference on Acoustics,

Speech, and Signal Processing, Taipei, Taiwan, Apr. 2009, pp. 1689–

1692.

[8] J.-T. Chien and Y.-L. Chang, “Hierarchical theme and topic model for

summarization,” in Proc. of IEEE International Workshop on Machine

Learning for Signal Processing, Southampton, UK, Sep. 2013, pp. 1–6.

[9] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical

Dirichlet process,” Journal of the American Statistical Association, vol.

101, no. 476, pp. 1566–1581, Dec. 2006.

[10] D. M. Blei, T. L. Grifﬁths, M. I. Jordan, and J. B. Tenebaum, “Hierarchi-

cal topic models and the nested Chinese restaurant process,” in Advances

in Neural Information Processing Systems, Vancouver, Canada, Dec.

2004, pp. 17–24.

[11] D. M. Blei, T. L. Grifﬁths, and M. I. Jordan, “The nested Chinese restau-

rant process and Bayesian nonparametric inference of topic hierarchies,”

Journal of the ACM, vol. 57, no. 2, article 7, Jan. 2010.

[12] J.-T. Chien and Y.-L. Chang, “Bayesian sparse topic model,” Journal of

Signal Processing Systems, vol. 74, no. 3, pp. 375–389, Mar. 2014.

[13] C. Wang and D. M. Blei, “Decoupling sparsity and smoothness in

the discrete hierarchical Dirichlet process,” in Advances in Neural

Information Processing Systems, Vancouver, Canada, Dec. 2009, pp.

1982–1989.

[14] S. Williamson, C. Wang, K. A. Heller, and D. M. Blei, “The IBP com-

pound Dirichlet process and its application to focused topic modeling,”

in Proc. of International Conference on Machine Learning, Haifa, Israel,

Jun. 2010, pp. 1151–1158.

[15] H. M. Wallach, D. Mimno, and A. McCallum, “Rethinking LDA: why

priors matter,” in Advances in Neural Information Processing Systems,

Vancouver, Canada, Dec. 2009, pp. 1973–1981.

[16] D. I. Kim and E. B. Sudderth, “The doubly correlated nonparametric

topic model,” in Advances in Neural Information Processing Systems,

Vancouver, Canada, Dec. 2011, pp. 1980–1988.

[17] A. Rodriguez, D. B. Dunson, and A. E. Gelfand, “The nested Dirichlet

process,” Journal of the American Statistical Association, vol. 103, no.

483, pp. 1131–1154, Sep. 2008.

[18] J. Paisley, L. Carin, and D. M. Blei, “Variational inference for stick-

breaking beta process priors,” in Proc. of International Conference on

Machine Learning, Bellevue, WA, Jun. 2011, pp. 889–896.

[19] Y. W. Teh, D. Newman, and M. Welling, “A collapsed variational

Bayesian inference algorithm for latent Dirichlet allocation,” in Ad-

vances in Neural Information Processing Systems, Vancouver, Canada,

Dec. 2007, pp. 1353–1360.

[20] C. Wang and D. M. Blei, “Variational inference for the nested Chinese

restaurant process,” in Advances in Neural Information Processing

Systems, Vancouver, Canada, Dec. 2009, pp. 1990–1998.

[21] H. M. Wallach, “Topic modeling: beyond bag-of-words,” in Proc. of

International Conference on Machine Learning, Haifa, Israel, Jun. 2010,

pp. 977–984.

[22] J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz, “Multi-

document summarization by sentence extraction,” in Proc. of

ANLP/NAACL Workshop on Automatic Summarization, vol. 4, Seattle,

WA, Apr. 2000, pp. 40–48.

[23] A. Haghighi and L. Vanderwende, “Exploring content models for multi-

document summarization,” in Proc. of Annual Conference of the North

American Chapter of the ACL, Boulder, CO, May 2009, pp. 362–370.

[24] D. Wang, S. Zhu, T. Li, and Y. Gong, “Multi-document summarization

using sentence-based topic models,” in Proc. of ACL-IJCNLP, Singa-

pore, Aug. 2009, pp. 297–300.

[25] M. M. Shaﬁei and E. E. Milios, “Latent Dirichlet co-clustering,” in Proc.

of IEEE International Conference on Data Mining, Hong Kong, Dec.

2006, pp. 542–551.

[26] L. Du, W. Buntine, and H. Jin, “A segmented topic model based on the

two-parameter Poisson-Dirichlet process,” Machine Learning, vol. 81,

no. 1, pp. 5–19, Oct. 2010.

[27] J. Pitman, “Poisson-Dirichlet and GEM invariant distributions for split-

and-merge transformation of an interval partition,” Combinatorics, Prob-

ability and Computing, vol. 11, pp. 501–514, Sep. 2002.

[28] D. Aldous, “Exchangeability and related topic.” ´

Ecole d’ ´

Et´ede

Probabilit´es de Saint-Flour XIII-1983, Berlin: Springer-Verlag, 1985,

pp. 1–198.

[29] R. P. Adams, Z. Ghahramani, and M. I. Jordan, “Tree-structured stick

breaking for hierarchical data,” in Advances in Neural Information

Processing Systems, Vancouver, Canada, Dec. 2010, pp. 19–27.

[30] J. Paisley, C. Wang, D. M. Blei, and M. I. Jordan, “Nested hierarchical

Dirichlet processes,” IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, vol. 2, no. 37, pp. 256–270, Feb. 2015.

[31] D. Andrzejewski, X. Zhu, and M. Craven, “Incorporating domain

knowledge into topic modeling via Dirichlet forest priors,” in Proc. of

International Conference on Machine Learning, Montreal, Canada, Jun.

2009, pp. 25–32.

[32] D. Newman, E. V. Bonilla, and W. Buntine, “Improving topic coherence

with regularized topic models,” in Advances in Neural Information

Processing Systems, Vancouver, Canada, Dec. 2011, pp. 496–504.

[33] Y. Gong and X. Liu, “Generic text summarization using relevance

measure and latent semantic analysis,” in Proc. of ACM SIGIR, New

Orleans, LA, Sep. 2001, pp. 19–25.

Jen-Tzung Chien (S’97-A’98-M’99-SM’04) re-

ceived his Ph.D. degree in electrical engineering

from National Tsing Hua University, Hsinchu, Tai-

wan, ROC, in 1997. During 1997-2012, he was with

the National Cheng Kung University, Tainan, Tai-

wan. Since 2012, he has been with the Department

of Electrical and Computer Engineering and the

Department of Computer Science, National Chiao

Tung University, Hsinchu, where he is currently a

Distinguished Professor. He held the Visiting Re-

searcher positions with the Panasonic Technologies

Inc., Santa Barbara, CA, the Tokyo Institute of Technology, Japan, the Georgia

Institute of Technology, Atlanta, GA, the Microsoft Research Asia, Beijing,

China, and the IBM T. J. Watson Research Center, Yorktown Heights, NY.

His research interests include machine learning, information retrieval, speech

recognition, blind source separation, and face recognition.

Dr. Chien served as the associate editor of the IEEE Signal Processing

Letters in 2008-2011, the guest editor of the IEEE Transactions on Audio,

Speech, and Language Processing in 2012, and the tutorial speaker of the In-

terspeech 2013 and the ICASSP 2012 and 2015. He received the Distinguished

Research Award from the Ministry of Science and Technology, Taiwan and

the Best Paper Award of the 2011 IEEE Automatic Speech Recognition and

Understanding Workshop. He currently serves as an elected member of the

IEEE Machine Learning for Signal Processing Technical Committee.

Scale-Invariant Infinite Hierarchical Topic Model

Conference Paper

Jan 2023

Deep NMF Topic Modeling

Preprint

Feb 2021

Nonnegative matrix factorization (NMF) based topic modeling methods do not rely on model- or data-assumptions much. However, they are usually formulated as difficult optimization problems, which may suffer from bad local minima and high computational complexity. In this paper, we propose a deep NMF (DNMF) topic modeling framework to alleviate the aforementioned problems. It first applies an unsupervised deep learning method to learn latent hierarchical structures of documents, under the assumption that if we could learn a good representation of documents by, e.g. a deep model, then the topic word discovery problem can be boosted. Then, it takes the output of the deep model to constrain a topic-document distribution for the discovery of the discriminant topic words, which not only improves the efficacy but also reduces the computational complexity over conventional unsupervised NMF methods. We constrain the topic-document distribution in three ways, which takes the advantages of the three major sub-categories of NMF -- basic NMF, structured NMF, and constrained NMF respectively. To overcome the weaknesses of deep neural networks in unsupervised topic modeling, we adopt a non-neural-network deep model -- multilayer bootstrap network. To our knowledge, this is the first time that a deep NMF model is used for unsupervised topic modeling. We have compared the proposed method with a number of representative references covering major branches of topic modeling on a variety of real-world text corpora. Experimental results illustrate the effectiveness of the proposed method under various evaluation metrics.

Hierarchical Topic-Aware Contextualized Transformers

Article

Jan 2023

Training on disjoint fixed-length segments, Transformers convert static word embeddings into contextualized word representations. However, they often restrict the context of a token to the segment it resides in and hence neglect the contextual information across segments, failing to capture longer-term dependencies beyond the predefined segment length. This paper uses a probabilistic deep topic model to provide hierarchical contextualized embeddings at both the token and segment levels, and integrate topic information through a constrained attention mechanism. The proposed method not only injects contextualized topic information into Transformers, but also controls languages generation guided by specific topics, styles, and sentiments. Three plug-and-play modules are proposed, including the contextual topical token embedding, the segment embedding, and the multi-head topic attention mechanism. We aim to capture the semantic coherence and word concurrence patterns at the global level, and also enrich the representation of each token by adapting to its local context, with negligible increased memory footprint and computational time. Experiments on various corpora show that by adding marginal extra parameters, the proposed hierarchical topic-aware contextualized Transformers consistently outperform their conventional counterparts, and generate sentences and paragraphs according to human preferences.

Deep NMF topic modeling

Article

Oct 2022
NEUROCOMPUTING

Nonnegative matrix factorization (NMF) based topic modeling methods do not rely on model- or data-assumptions much. However, they are usually formulated as difficult optimization problems, which may suffer from bad local minima and high computational complexity. In this paper, we propose a deep NMF (DNMF) topic modeling framework to alleviate the aforementioned problems. It first applies an unsupervised deep learning method to learn latent hierarchical structures of documents, under the assumption that if we could learn a good representation of documents by, e.g. a deep model, then the topic word discovery problem can be boosted. Then, it takes the output of the deep model to constrain a topic-document distribution for the discovery of the discriminant topic words, which not only improves the efficacy but also reduces the computational complexity over conventional unsupervised NMF methods. We constrain the topic-document distribution in three ways, which takes the advantages of the three major sub-categories of NMF—basic NMF, structured NMF, and constrained NMF respectively. To overcome the weaknesses of deep neural networks in unsupervised topic modeling, we adopt a non-neural-network deep model—multilayer bootstrap network. To our knowledge, this is the first time that a deep NMF model is used for unsupervised topic modeling. We have compared the proposed method with a number of representative references covering major branches of topic modeling on a variety of real-world text corpora. Experimental results illustrate the effectiveness of the proposed method under various evaluation metrics.

Topic Modeling and its Application in Research: A Review of Specialized Literature

Article

Full-text available

Aug 2021

Introduction: Topic modeling is one of the text mining techniques that allows you to discover unknown topics in a collection of documents, interpret documents based on these topics, and use these interpretations to organize, summarize, and search for texts automatically. Familiarity with the concept and technique of topic modeling, and its application in discovering topics and organizing information is one of the main goals of this research. Methodology: The present study is a review-analytical type in which, while introducing topic modeling, it has categorized and reviewed the applications of this technique based on its performance and provided a sample of research that has used this technique. Findings: Topic modeling algorithms is used not only in addition to the three main objectives of discovering hidden topics, interpreting documents based on topics, and finally organizing and classifying texts, but also is used in discovering hidden topics and relationships in the fields of science, information retrieval, categorizing documents based on topics, discovering outstanding patterns and emerging events, clustering the concepts of scientific fields, analyzing the course of conceptual evolution during historical periods, determining the hierarchical relationships of concepts. A specific scientific field or field and vocabulary enrichment. Conclusion: Topic modeling based on machine learning and artificial intelligence knowledge has been proposed as one of the new approaches to organizing information resources and serious studies are being conducted in this field. Therefore, by using topic modeling algorithms in order to automate the extraction of the subject and discover the hidden issues in the source, it is possible to strengthen and update the new systems of organizing information resources.

A review of text mining approaches and their function in discovering and extracting a topic

Article

Full-text available

Mar 2020

Fateme Zarmehr

Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling. Findings: LSA could be used to classify specific and unique topics in documents that address only a single top- ic. The other three text mining methods focus on topics and general partiality of the text. PLSA is applicable to documents dealing with a topic, unlike the LSA, it is used to discover general themes and contexts. However, LDA is more applicable to documents that address several issues. The CTM, method can be used to identify relationship between different subject categories. Conclusion: Text mining tactics are suitable for employing analysis in discovering and extracting the text sub- jects. Keywords: Text mining, Topic Modeling, Semantic Analysis, Topic Discovery.

Topic Modeling and its Application in Research: A Review of Specialized Literature

Article

Full-text available

Aug 2021
LIBR INFORM SCI RES

Fateme Zarmehr

Introduction: Topic modeling is one of the text mining techniques that allows you to discover unknown topics in a collection of documents, interpret documents based on these topics, and use these interpretations to organize, summarize, and search for texts automatically. Familiarity with the concept and technique of topic modeling, and its application in discovering topics and organizing information is one of the main goals of this research.Methodology: The present study is a review-analytical type in which, while introducing topic modeling, it has categorized and reviewed the applications of this technique based on its performance and provided a sample of research that has used this technique.Findings: Topic modeling algorithms is used not only in addition to the three main objectives of discovering hidden topics, interpreting documents based on topics, and finally organizing and classifying texts, but also is used in discovering hidden topics and relationships in the fields of science, information retrieval, categorizing documents based on topics, discovering outstanding patterns and emerging events, clustering the concepts of scientific fields, analyzing the course of conceptual evolution during historical periods, determining the hierarchical relationships of concepts. A specific scientific field or field and vocabulary enrichment.Conclusion: Topic modeling based on machine learning and artificial intelligence knowledge has been proposed as one of the new approaches to organizing information resources and serious studies are being conducted in this field. Therefore, by using topic modeling algorithms in order to automate the extraction of the subject and discover the hidden issues in the source, it is possible to strengthen and update the new systems of organizing information resources.

Intelligent Computing Research Studies in Life Science: Life Sciences-Health Care

Article

Full-text available

Oct 2019

Guest Editor Dr.S.Thilagamani

This special issue is highly related to communication and software application for medical and health care systems towards providing safe, enhanced service to the society. Also it focuses on integration of artificial intelligence to the man kind. In addition, Environmental issues, agricultural advancements, security issues, medical therapies and many more related to artificial intelligences (AI) are been focused in this issues. Artificial intelligence for Cloud computing in health care, for diagnosis in health care, for safety in day to day life, for biorhythm monitoring for the man kind health needs are the need of the hour for the society needs. This special issue focuses on those aspects with much implication for the health care. The articles published in this special issue will certainly bring as positive effect for the developing health care and to make use of available resources and to remove certain obsolete factors and process which may delay or harm the existing health care system. It enhances maximum utilization of scientific knowledge to potentiate therapy and diagnosis in the health care system.

Comparative Analysis of Tamil and English News Text Summarization Using Text Rank Algorithm

Article

Full-text available

Jan 2021

The exponential growth of newsgroups has made it more difficult to gain accurate access to a large amount of data. To deal with the massive amounts of data, efficient and effective methods are needed. One such method is text summarization, which presents data in a condensed format. It would be beneficial for readers to be able to get a wide variety of news in a short amount of time if the news is simplified.In this article, we use the English Newsgroup datasets and the Tamil Newsgroup datasets to automate News Summaries using the text rank algorithm. The proposed work was created using a changed Text Rank algorithm based on the principle of word frequency. The suggested approach creates vectors of words as nodes and similarities between two words as the edge between them, which is the webbing between them. Term frequency assigns various weights to different terms in a sentence, while standard cosine similarity regards both of them similarly. The vector is rendered sparse and divided into clusters based on the premise that sentences inside a cluster are identical and sentences from various clusters reflect their dissimilarity. The performance assessment of the proposed summarization strategy in two types of Newsgroup datasets demonstrates its usefulness in terms of the accuracy parameter.

Variational Sequential Modeling, Learning and Understanding

Conference Paper

Dec 2021

Probabilistic Topic Models: A focus on graphical model design and applications to document and image analysis

Article

Full-text available

Nov 2010

In this article, we review probabilistic topic models: graphical models that can be used to summarize a large collection of documents with a smaller number of distributions over words. Those distributions are called "topics" because, when fit to data, they capture the salient themes that run through the collection. We describe both finite-dimensional parametric topic models and their Bayesian nonparametric counterparts, which are based on the hierarchical Dirichlet process (HDP). We discuss two extensions of topic models to time-series data-one that lets the topics slowly change over time and one that lets the assumed prevalence of the topics change. Finally, we illustrate the application of topic models to nontext data, summarizing some recent research results in image analysis.

Hierarchical theme and topic model for summarization

Conference Paper

Full-text available

Sep 2013

This paper presents a hierarchical summarization model to extract representative sentences from a set of documents. In this study, we select the thematic sentences and identify the topical words based on a hierarchical theme and topic model (H2TM). The latent themes and topics are inferred from document collection. A tree stick-breaking process is proposed to draw the theme proportions for representation of sentences. The structural learning is performed without fixing the number of themes and topics. This H2TM is delicate and flexible to represent words and sentences from heterogeneous documents. Thematic sentences are effectively extracted for document summarization. In the experiments, the proposed H2TM outperforms the other methods in terms of precision, recall and F-measure.

Improving Topic Coherence with Regularized Topic Models.

Conference Paper

Full-text available

Jan 2011

Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.

Bayesian Sparse Topic Model

Article

Full-text available

Mar 2014

This paper presents a new Bayesian sparse learning approach to select salient lexical features for sparse topic modeling. The Bayesian learning based on latent Dirichlet allocation (LDA) is performed by incorporating the spike-and-slab priors. According to this sparse LDA (sLDA), the spike distribution is used to select salient words while the slab distribution is applied to establish the latent topic model based on those selected relevant words. The variational inference procedure is developed to estimate prior parameters for sLDA. In the experiments on document modeling using LDA and sLDA, we find that the proposed sLDA does not only reduce the model perplexity but also reduce the memory and computation costs. Bayesian feature selection method does effectively identify relevant topic words for building sparse topic model.

École d'Été de Probabilités de Saint-Flour XIII — 1983

Article

Jan 1985

The nested Dirichlet process: Rejoinder

Article

Sep 2008

The Doubly Correlated Nonparametric Topic Model

Article

Jan 2011
Adv Neural Inform Process Syst

Topic models are learned via a statistical model of variation within document col-lections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with docu-ments; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual specification of the number of topics. We propose a doubly correlated nonparametric topic (DCNT) model, the first model to simultaneously capture all three of these properties. The DCNT models meta-data via a flexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stick-breaking construction. We validate the semantic structure and predictive performance of the DCNT using a corpus of NIPS documents annotated by various metadata.

Probabilistic Topic Models

Conference Paper

Aug 2011

David M. Blei

Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. This analysis can be used for corpus exploration, document search, and a variety of prediction problems. In this tutorial, I will review the state-of-the-art in probabilistic topic models. I will describe the three components of topic modeling: (1) Topic modeling assumptions (2) Algorithms for computing with topic models (3) Applications of topic models In (1), I will describe latent Dirichlet allocation (LDA), which is one of the simplest topic models, and then describe a variety of ways that we can build on it. These include dynamic topic models, correlated topic models, supervised topic models, author-topic models, bursty topic models, Bayesian nonparametric topic models, and others. I will also discuss some of the fundamental statistical ideas that are used in building topic models, such as distributions on the simplex, hierarchical Bayesian modeling, and models of mixed-membership. In (2), I will review how we compute with topic models. I will describe approximate posterior inference for directed graphical models using both sampling and variational inference, and I will discuss the practical issues and pitfalls in developing these algorithms for topic models. Finally, I will describe some of our most recent work on building algorithms that can scale to millions of documents and documents arriving in a stream. In (3), I will discuss applications of topic models. These include applications to images, music, social networks, and other data in which we hope to uncover hidden patterns. I will describe some of our recent work on adapting topic modeling algorithms to collaborative filtering, legislative modeling, and bibliometrics without citations. Finally, I will discuss some future directions and open research problems in topic models.

Latent Dirichlet Allocation

Article

Jan 2013

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

Introduction to Probabilistic Topic Models

Article

Jan 2011

David M. Blei

Hierarchical Theme and Topic Modeling

Abstract and Figures

Recommended publications

Extractive Automatic Text Summarization Based on Lexical-Semantic Keywords

The nested Indian buffet process for flexible topic modeling

Hierarchical theme and topic model for summarization

The nested indian buffet process for flexible topic modeling

Bayesian Nonparametric Learning for Hierarchical and Sparse Topics