ArticlePDF Available

Hierarchical Theme and Topic Modeling

Authors:
  • National Yang Ming Chiao Tung University

Abstract and Figures

Considering the hierarchical data groupings in text corpus, e.g., words, sentences, and documents, we conduct the structural learning and infer the latent themes and topics for sentences and words from a collection of documents, respectively. The relation between themes and topics under different data groupings is explored through an unsupervised procedure without limiting the number of clusters. A tree stick-breaking process is presented to draw theme proportions for different sentences. We build a hierarchical theme and topic model, which flexibly represents the heterogeneous documents using Bayesian nonparametrics. Thematic sentences and topical words are extracted. In the experiments, the proposed method is evaluated to be effective to build semantic tree structure for sentences and the corresponding words. The superiority of using tree model for selection of expressive sentences for document summarization is illustrated.
Content may be subject to copyright.
1
Hierarchical Theme and Topic Modeling
Jen-Tzung Chien, Senior Member, IEEE
Abstract—Considering the hierarchical data groupings in text
corpus, e.g. words, sentences and documents, we conduct the
structural learning and infer the latent themes and topics for
sentences and words from a collection of documents, respec-
tively. The relation between themes and topics under different
data groupings is explored through an unsupervised procedure
without limiting the number of clusters. A tree stick-breaking
process is presented to draw theme proportions for different
sentences. We build a hierarchical theme and topic model which
flexibly represents the heterogeneous documents using Bayesian
nonparametrics. Thematic sentences and topical words are ex-
tracted. In the experiments, the proposed method is evaluated
to be effective to build semantic tree structure for sentences and
the corresponding words. The superiority of using tree model for
selection of expressive sentences for document summarization is
illustrated.
Index Terms— Structural learning, topic model, Bayesian
nonparametrics, document summarization
I. INTRODUCTION
Unsupervised learning has a broad goal of extracting fea-
tures and discovering structure within the given data. The
unsupervised learning via probabilistic topic model [1] has
been successfully developed for document categorization [2],
image analysis [3], text segmentation [4], speech recognition
[5], information retrieval [6], document summarization [7],
[8] and many other applications. Using topic model, latent
semantic topics are learned from a bag of words to capture
the salient aspects embedded in data collection. In this study,
we propose a new topic model to represent a bag of sentences
as well as the corresponding words. As we know, the concept
of “topic” is well understood in the community. Here, we
use another related concept “theme”. Themes are the latent
variables which occur in different level of grouped data, e.g.
sentences, and so the concepts of themes and topics are
different. We model the themes and topics separately and
require the estimation of them jointly. The hierarchical theme
and topic model is constructed. Figure 1 illustrates the diagram
of hierarchical generation from documents, sentences to words
given by the themes and topics which are drawn from their
proportions. We explore a semantic tree structure of sentence-
level latent variables from a bag of sentences while the word-
level latent variables are learned from a bag of grouped words
allocated in individual tree nodes. We build a two-level topic
model through a compound process. The process of generating
words conditions on the theme assigned to the sentence. The
motivation of this paper aims to go beyond the word level
and upgrade the topic model by means of discovering the
This work was supported in part by the Ministry of Science and Technology,
Taiwan, under contract MOST 103-2221-E-009-078-MY3.
J.-T. Chien is with the Department of Electrical and Computer Engi-
neering, National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail:
jtchien@nctu.edu.tw).
Words
Topics
Themes
Topic
Proportions
Theme
Proportions
Documents Sentences
Fig. 1. Conceptual illustration for hierarchical generation of documents
(yellow rectangle), sentences (green diamond) and words (blue circle) by using
theme and topic proportions.
hierarchical relations between the latent variables in word and
sentence levels. The benefit of this model is to establish a
hierarchical latent variable model which is feasible to char-
acterize the heterogeneous documents with multiple levels of
abstraction in different data groupings. This model is general
and could be applied for document summarization and many
other information systems.
A. Related work for topic model
Topic model based on latent Dirichlet allocation (LDA) [2]
is constructed as a finite-dimensional mixture representation
which assumes that 1) the number of topics was fixed and 2)
different topics were independent. The hierarchical Dirichlet
process (HDP) [9] and the nested Chinese restaurant process
(nCRP) [10], [11] were proposed to relax these assumptions.
The HDP-LDA model by Teh et al. [9] is a nonparametric
extension of LDA where document representation is allowed
to grow structurally when more documents are observed.
Number of topics is unknown a priori. Each word token
within a document is drawn from a mixture model where the
hidden topics are shared across documents. Dirichlet process
(DP) is realized to find flexible data partitions and provide
the nonparametric prior over the number of topics for each
document. The base measure for the child DPs is itself drawn
from a parent DP. The atoms are shared in a hierarchical
way. Model selection problem is tackled through Bayesian
nonparametric (BNP) learning. In the literature, the sparse
topic model was constructed by decoupling the sparsity and
smoothness for LDA [12] and HDP [13]. The spike and
slab prior over Dirichlet distributions was applied by using
a Bernoulli variable for each word to indicate whether the
word appears in the topic or not. In [14], the Indian buffet
process compound DP was developed to build a focused
topic model. In [15], a hierarchical Dirichlet prior with Polya
conditionals was used as the asymmetric Dirichlet prior over
2
the document-topic distributions. The improvement of using
this prior structure over the symmetric Dirichlet prior in LDA
was substantial even when not estimating the number of topics.
On the other hand, the nCRP conducts BNP inference of
topic hierarchies and learns the deeply branching trees from a
data collection. Using this hierarchical topic model [10], each
document is modeled by a path of topics given a random tree
where the hierarchically-correlated topics from global topic to
individual topics are extracted from root node to leaf nodes,
respectively. In [16], the topic correlation was strengthened
through a doubly correlated nonparametric topic model where
the annotations were incorporated into discovery of semantic
structure. In [17], the nested DP prior was proposed by replac-
ing the random atoms with the random probability measures
drawn from DP in a nested setting. In general, HDP and nCRP
were implemented according to the stick-breaking process
and Chinese restaurant process. The approximate inference
using Markov chain Monte Carlo (MCMC) [10], [11], [9] and
variational Bayesian (VB) [18], [19], [20] was developed.
B. Topic model beyond word level
In previous methods, the unsupervised learning over text
data finds the topic information from a bag of words. The
mixed membership modeling is implemented for representa-
tion of words from multiple documents. The word-level mix-
ture model is built. However, unsupervised learning beyond
word level is required in many information systems. For exam-
ple, the topic models based on a bag of bigrams [21] and a bag
of n-gram histories [5] were estimated for language modeling.
In a document summarization system, the text representation
is evaluated to select representative sentences from multiple
documents. Exploring the sentence clustering and ranking [8]
is essential to find the sentence-level themes and measure the
relevance between sentences and documents [22]. In [23], the
general sentences and specific sentences were identified for
document summarization. In [24], [7], a sentence-based topic
model based on LDA was proposed to learn word-document
and word-sentence associations. Furthermore, an information
retrieval system retrieves the condensed information from user
queries. Finding the underlying themes from documents is
beneficial to organize the ranked documents and extract the
relevant information. In addition to the word-level topic model,
it is desirable to build the hierarchical topic model in sentence
level or even in document level. In [25], [26], the latent
Dirichlet co-clustering model and the segmented topic model
were proposed to build topic model across different levels.
C. Main idea of this work
In this study, we construct a hierarchical latent variable
model for structural representation of text documents. The
thematic sentences and the topical words are learned from
hierarchical data groupings. Each path in tree model covers
from the general theme at root node to the individual themes
at leaf nodes. The themes in different tree nodes contain
coherent information but in varying degrees of sharing for
sentence representation. We basically build a tree model for
sentences according to the nCRP. The theme hierarchy is
explored. The brother nodes expand the diversity of themes
from different sentences within and across documents. This
model does not only group the sentences into a node but
also distinguish their concepts through different layers. The
words of the sentences clustered in a tree node are seen as the
grouped data. The grouped words in different tree nodes are
driven by an HDP. The nCRP compound HDP is developed
to build a hierarchical theme and topic model. To reflect the
heterogeneous documents in real-world data collection, a tree
stick-breaking process is addressed to draw a subtree of theme
proportions. We conduct structural learning and group the
sentences into a diversity of themes. The number of themes
and the dependency between themes are learned from data.
The words of the sentences within a node are represented by
a topic model which is drawn by a DP. All the topics from
different nodes are shared under a global DP. The sentence-
level themes and the word-level topics are estimated. This
approach is evaluated by the tasks of document modeling and
summarization.
This paper is organized as follows. Section II surveys BNP
learning based on DP, HDP and nCRP. Section III presents
the nCRP compound HDP for representation of sentences
and words from multiple documents. A tree stick-breaking
process is implemented to draw theme proportions for subtree
branches. Section IV formulates the posterior probabilities
which are applied for drawing the subtree, themes and topics.
In Section V, the experiments on text modeling and document
summarization are reported. The conclusions drawn from this
study are given in Section VI.
II. BAYESIAN NONPARAMETRIC LEARNING
A. Dirichlet process
Dirichlet process (DP) is essential for BNP learning. Basi-
cally, DP for a random probability measure Gis expressed
by GDP(α0, G0). For any finite measurable partition
{A1, . . . , Ar}, a finite-dimensional Dirichlet distribution
(G(A1), . . . , G(Ar))
Dir(α0G0(A1), . . . , α0G0(Ar)) (1)
or GDir(α0G0)is produced from Gwith two parameters,
a concentration parameter α0>0and a base measure G0
which is seen as the mean of draws from DP. The probability
measure of DP is established by
G=
X
k=1
βkδφk,where
X
k=1
βk= 1 (2)
where δφkis a unit mass at the point φkand the infinite
sequences of weights {βk}and points {φk}are drawn from
α0and G0, respectively. The solution to weights β={βk}
k=1
can be obtained by stick-breaking process (SBP) which pro-
vides a distribution over infinite partitions of unit interval, also
called the GEM distribution βGEM(α0)[27].
Let θ1, θ2, . . . denote the parameters drawn from Gfor
individual words w1, w2, . . . with multinomial distribution
function, θi|GGand wi|θip(wi|θi) = Mult(θi)for each
i. According to the metaphor of Chinese restaurant process
3
TABLE I
CON VEN TI ON OF N OTATIO NS
wd={wdi}words in document d
w(di)words of all documents except word wdi
θdi or θdji multinomial parameter of word wdi or wdji
zdi =klatent variable of word wdi at topic k
z(di)topics of all words in wexcept word wdi
φkkth topic parameter
cdtree path corresponding to document d
cdtree paths of all documents except document d
β={βk}global mixture weights or topic proportions in HDP
π={πdk}topic proportions of document dover cdin HDP
α0&γconcentration or strength parameter of a DP
G0,Hor Gdbase measure of a global DP or a DP for document d
nkno of customers or words sitting at table φk
wdj ={wdji}words in sentence jof document d
wd(j)words of all sentences in wdexcept sentence j
ydj =llatent variable of sentence jat theme l
yd(j)themes of all sentences in wdexcept sentence j
ψllth theme parameter
td={tdj }subtree branches for all sentences in document d
td(j)subtree branches for all sentences in wdexcept wdj
πl={πlk}topic proportions of theme lin snCRP
βd={βdl}theme proportions of document dover tdin snCRP
γs&γwstrength parameters in sentence and word levels
Vvocabulary size
θt,l ={θt,l,v}multinomial parameters of all words vin tuple (t, l)
ηshared Dirichlet prior parameter for θt,l
Ldmaximum no of themes or nodes in subtree td
Klmaximum no of topics in theme l
L&Ktotal no of themes and topics
βla&βlac theme proportions for ancestor and child nodes
md,t no of sentences in document dchoosing branch t
md(j),lac no of sentences in wd(j)allocated in node lac
nd,t,l,v no of word wdi =vin document dand tuple (t, l)
n(di),l,v no of word wdi =vin w(di)with theme l
Ndtotal no of words in document d
(CRP) [28], the conditional distribution of current parameter
θigiven the previous parameters θ1, . . . , θi1is obtained by
θi|θ1, . . . ,θi1, α0, G0
i1
X
l=1
1
i1 + α0
δθl+α0
i1 + α0
G0
=
K
X
k=1
nk
i1 + α0
δφk+α0
i1 + α0
G0.
(3)
Here, φ1, . . . , φKdenote the distinct values taken from the
previous parameters θ1, . . . , θi1and nkis the number of
customers or parameters θi0who have seated at table or chosen
value φk. If the base distribution is continuous, then φkcan
correspond to different table. However, if the base distribution
is discrete and finite, φkcorresponds to distinct value, but
not table, because the draws from discrete distribution can be
repeated. Considering the continuous base measure, the ith
customer sits at table φkwith probability proportional to the
number of customers nkwho have already seated there, and
sits at a new table with probability proportional to α0. Figure
2(a) displays the graphical representation of a DP mixture
(a)
Ák
Ák
G0
G0
wi
wi
µi
µi
®0
®0
¯¯
(b)
Ák
Ák
wdi
wdi
µdi
µdi
°°
zdi
zdi
¯¯
HH
¼d
¼d
®0
®0
Fig. 2. Graphical representation for (a) DP mixture model and (b) HDP
mixture model. Yellow and blue denote the variables in document and word
levels, respectively.
model where each word wiis generated by
φk|G0G0,for each k
β|α0GEM(α0)
zi=k|ββ,for each i
wi|zi,{φk}
k=1 Mult(θi=φzi),for each i
(4)
where the mixture component or latent topic zi=kis drawn
from topic proportions βwhich are determined by the strength
parameter α0. Unsupervised learning of an infinite mixture
representation is implemented.
B. Hierarchical Dirichlet process
HDP [9] deals with the representation of documents or
grouped data where each group is associated with a mixture
model. Data in different groups share a global mixture model.
Each document or data grouping wdis associated with a
draw from a DP given probability measure GdDP(α0, G0),
which determines how much a member from a shared set of
mixture components contributes to that data grouping. The
base measure G0is itself drawn from a global DP by G0
DP(γ, H )with strength parameter γand base measure H
which ensures that there is a single set of discrete components
shared across data. Each DP Gdgoverns the generation of
words wd={wdi}or their multinomial parameters {θdi }for
a document.
The global measure G0and the individual measure Gdin
HDP can be expressed by the mixture models with the shared
atoms {φk}
k=1 but different weights β={βk}
k=1 and πd=
{πdk}
k=1 given by
G0=
X
k=1
βkδφk, Gd=
X
k=1
πdkδφk,for each d
where
X
k=1
βk=
X
k=1
πdk = 1.
(5)
The atom φkis drawn from base measure Hand the topic
proportions βare drawn by SBP via β|γGEM(γ). Note
that the topic proportions πdare sampled from an independent
4
DP Gdgiven βbecause Gds are independent given G0. We
have πdDP(α0,β). Figure 2(b) shows the HDP mixture
model which generates a word wdi in grouped data wdby
φk|HH, for each k
β|γGEM(γ)
πd|α0,βDP(α0,β),for each d
zdi =k|πdπd,for each dand i
wdi|zdi ,{φk}
k=1 Mult(θdi =φzdi ),for each dand i.
(6)
To implement this HDP, we can apply the stick-breaking
construction to connect the relation between βand πd. We first
conduct the stick-breaking construction by finding β={βk}
through implementation of a process of drawing beta variables
β0
kBeta(1, γ), βk=β0
k
k1
Y
j=1
(1 β0
j).(7)
Then, the stick-breaking construction for probability measure
πdof grouped data wdis performed to bridge the relation
π0
dk Beta α0βk, α0
X
l=k+1
βl!
πdk =π0
dk
k1
Y
j=1
(1 π0
dj )
(8)
where beta variable π0
dk at the kth draw is determined by the
base measures of two segments {βk,{βl}
l=k+1}, which are
scaled by parameter α0. Basically, HDP does not involve in
learning topic hierarchies. Only single level of data groupings,
i.e. document level, is modeled.
Á1
d=fw1; w2; w3; w4;
w5; w6; w7; w8; w9g
Á2
Á3
¼1
w1
w2
w3
w4
w5
w6
w7
w8
w9
¼2
¼3
cd
Fig. 3. An infinitely branching tree structure for representation of words and
documents based on nCRP. Thick arrows denote a tree path cddrawn from
nine words of a document wd. Here, we simply use notation dfor wd. Yellow
rectangle and blue circles denote the document and words, respectively. Each
word wdi is assigned by a topic parameter φkat a tree node along cdwith
proportion or probability πdk.
C. The nested Chinese restaurant process
In many practical applications, it is desirable to represent
text data based on different levels of aspects. The nCRP [10],
[11] was proposed to infer the topic hierarchies and learn
a deeply branching tree from data collection. The resulting
hierarchical LDA is developed to represent a text document wd
by using a path cdof topics from a random tree consisting of
the hierarchically-correlated topics {φk}
k=1 from global topic
in root node to individual topics in leaf nodes as illustrated
in Figure 3. Choosing a path of topics for a document is
equivalent to selecting or visiting a chain of restaurants cdby
a customer wdthrough cdnCRP(α0), which is controlled
by a scaling parameter α0. Each word wdi in document wd
is assigned by a tree node φkor topic label zdi =kwith
probability or topic proportion πdk. The generative process
for a set of text documents w={wd|d= 1, . . . , D}based
on nCRP is constructed by
1) For each node kin the infinite tree
a) Draw a topic with parameter φk|HH.
2) For each document wd={wdi|i= 1, . . . , Nd}
a) Draw a tree path by cdnCRP(α0).
b) Draw topic proportions over layers of the tree path cd
by stick-breaking process πd|γGEM(γ).
c) For each word wdi
i) Choose a layer or a topic by zdi =k|πdπd.
ii) Choose a word based on topic zdi =kby
wdi|zdi ,cd,{φk}
k=1 Mult θdi =φcd(zdi).
A Gibbs sampling algorithm is developed to sample the pos-
terior tree path and word topic {cd, zdi}for different words in
Ddocuments w={wd}={wdi}according to the posterior
probabilities of cdand zdi given wand the current values
of all other latent variables, i.e. p(cd|cd,w,z, α0, H)and
p(zdi|w,z(di),cd, γ , H). In these posterior probabilities, the
notations cdand z(di)denote the latent variables cand zfor
all documents and words other than document dand word i,
respectively. Sampling procedure is performed iteratively from
current state to new state for Ddocuments with Ndwords in
each document wd. In this procedure, the topic parameter φk
in each node kof an infinite tree model is sampled from a prior
measure φk|HH. The topic proportions πdof document
wdare drawn by πd|γGEM(γ)according to an SBP given
a scaling parameter γ. Given the samples of tree path cdand
topic zdi, the word wdi is distributed by using the multinomial
parameter
θdi =φcd(zdi)(9)
of topic zdi =kdrawn from the topic proportions πd
corresponding to tree path cd.
III. HIERARCHICAL THEME AND TOPIC MODEL
Although the topic hierarchies are explored in topic model
based on nCRP, only the single-level data groupings, i.e. docu-
ment level, are considered in generative process. The extension
of text representation to different levels of data groupings
is required to improve text modeling. In addition, a single
tree path cdin nCRP may not sufficiently reflect the topic
variations and theme ambiguities in heterogeneous documents.
A flexible topic selection is required to compensate for model
uncertainty. By conducting the multiple-level unsupervised
5
learning and flexible topic selection, we are able to upgrade
system performance for document modeling. In this study, a
hierarchical theme and topic model is proposed to conduct a
kind of topical co-clustering [25], [26] over sentence level and
word level so that one can cluster sentences while clustering
words. The nCRP compound HDP is presented to implement
the proposed model where the text modeling in word level,
sentence level and document level is jointly performed. By
referring to [29], a simplified tree-structured SBP is presented
to draw a subtree td, which accommodates the theme and topic
variations in document wd.
wi
wiwi
wi
wi
wi
wi
wiwi
wi
wiwiwi
wi
wi
wi
wi
wi
d=fs1; s2; s3;
s4; s5; s6; s7; s8g
s2
s8
s1
s3
s4
s5
s7
s6
Á1
Á2
¼1
¼2
Ã1
Ã2
Ã3
Ã4
Ã5
td
¯1
¯11
¯12
¯111
¯121
Fig. 4. An infinitely branching tree structure for representation of words,
sentences and documents based on the nCRP compound HDP. Thick arrows
denote the subtree branches td={tdj }drawn for eight sentences {wdj |j=
1,...,8}of a document wd. Here, we simply use the notations sjfor wdj
and dfor wd. Yellow rectangle, green diamonds and blue circles denote the
document, sentences and words, respectively. Each sentence wdj is assigned
by a theme parameter ψlat a tree node connected to the subtree branches with
a document-dependent probability βdl while each word wdji in tree node is
assigned by a topic parameter φkwith a theme-dependent probability πlk.
A. The nCRP compound HDP
We propose an unsupervised structural learning approach to
discover latent variables in different levels of data groupings.
For the application of document representation, we aim to
explore the latent themes from a set of sentences {wdj}and
the latent topics from the corresponding set of words {wdji}.
A text corpus w={wd}={wdj}={wdji}is represented
by using different levels of aspects. The proposed model is
constructed by considering the structure of a document where
each document consists of a “bag of sentences” and each
sentence consists of a “bag of words”. Different from the
topic model using HDP [9] and the hierarchical topic model
using nCRP [10], [11], we develop a tree model to represent a
“bag of sentences” and then represent the corresponding words
allocated in individual tree nodes. A two-stage procedure is
proposed for document modeling as illustrated in Figure 4.
In the first stage, each sentence wdj of a document wdis
drawn from a mixture of theme model where the themes are
shared for all sentences from a document collection w. The
theme model of a document wdis composed of the themes
under its corresponding subtree tdfor different sentences
{wdj }, which are drawn by a sentence-based nCRP (snCRP)
with a control parameter α0
td={tdj } ∼ snCRP(α0)(10)
where snCRP(·)is defined by considering individual sentences
{wdj }under a subtree tdrather than using the individual
words {wdi}with single tree path cdas addressed for nCRP(·)
in Section II-C. Notably, we consider a subtree td={tdj }
for thematic representation of sentences from a heteroge-
neous document wd. The proportions βdof theme parameters
ψ={ψl}over a subtree tdare drawn according to a tree
stick-breaking process (TSBP) which is simplified from the
tree-structured stick breaking process in [29]. The resulting
distribution is called the treeGEM, which is expressed by
βd|γstreeGEM(γs)(11)
where γsis a strength parameter. The distribution treeGEM(·)
is derived through a TSBP which shall be described in Section
III-D. With a tree structure of themes, the unsupervised
grouping of sentences into different layers is obtained.
In the second stage, each word wdji in the set of sentences
{wdj }allocated in tree node lis drawn by an individual
mixture of topic model based on a DP. All topics from different
nodes of a tree model are shared under a global topic model
with parameters {φk}which are sampled by φk|HH. By
treating the words in a tree node as the grouped data, the
HDP is implemented for topical representation of the whole
document collection wwhich is composed of the grouped
words in different tree nodes. We assume that the words of the
sentences in tree node lare conditionally independent. These
words are drawn from a topic model using topic parameters
{φk}
k=1. The sentences in a document given theme lare
conditionally independent. These sentences are drawn from a
theme model using theme parameters {ψl}
l=1. The document-
dependent theme proportions βd={βdl}and the theme-
dependent topic proportions πl={πlk}are produced by
a tree stick-breaking process βd|γstreeGEM(γs)and a
standard stick-breaking process πlGEM(γw)where γs
and γwdenote the sentence-level and the word-level strength
parameters, respectively.
Given the theme proportions βdand topic proportions πl,
the probability measure of each sentence wdj is drawn from
a document-dependent theme mixture model
Gs,d =
X
l=1
βdlδψl(12)
while the probability measure of each word wdji in a tree node
is drawn from a theme-dependent topic mixture model
Gw,l =
X
k=1
πlkδφk.(13)
Since the theme for sentences is represented by a mixture
model of topics for words in Eq. (13), we bridge the relation
between the probability measures of themes {ψl}and topics
{φk}via
ψlX
k
πlkφk.(14)
6
We implement the so-called nCRP compound HDP in a two-
stage procedure and establish the hierarchical theme and topic
model for document representation. The generative process
for a set of documents in different levels of groupings is
accordingly implemented. Having the sampled tree node ydj
connected to the tree branches tdand the associated topic
zdi =k, the word wdji in sentence wdj is drawn by using the
multinomial parameter
θdji =φtd(ydj,zdi ).(15)
The topic is determined by the topic proportions πlof theme
ydj =lwhile the theme is determined by the theme propor-
tions βdfrom a document wd. The generative process of this
compound process is described as follows
1) For each node or theme lin the infinite tree
a) For each topic kin a tree node
i) Draw a topic with parameter φk|HH.
b) Draw topic proportions by stick-breaking process
πl|γwGEM(γw).
c) Theme model is constructed by ψlPkπlkφk.
2) For each document wd={wdj}
a) Draw a subtree td={tdj} ∼ snCRP(α0).
b) Draw theme proportions over a subtree td
in different layers by tree stick-breaking process
βd|γstreeGEM(γs).
c) For each sentence wdj ={wdji}
i) Choose a theme ydj =l|βdβd.
ii) For each word wdji or simply wdi
a. Choose a topic by zdi =k|πlπl.
b. Choose a word based on topic zdi =kby
wdji |zdi, ydj,td,{φk}
k=1
Mult θdji =φtd(ydj,zdi ).
B. The nCRP for thematic sentences
We implement the snCRP and build an infinitely branching
tree model which discovers latent themes from different sen-
tences of a text corpus. The root node in this tree contains
the general theme while those nodes in leaf layer convey the
specific themes. The hierarchical clustering of sentences is
realized. The snCRP is developed to represent the sentences of
a document wd={wdj }based on the themes, which come
from a subtree td={tdj }. The ambiguity and uncertainty
of themes existing in a heterogeneous document could be
compensated. This is different from conventional word-based
nCRP [10], [11] where only the topics along a single tree path
cdare selected to represent different words in a document
wd={wdi}. The word-based nCRP using GEM distribution
for topic proportions πdGEM(γ)is now extended to
the snCRP using treeGEM distribution for theme proportions
βdtreeGEM(γs)by considering a subtree td.
The metaphor of snCRP is described as shown in Figure
4. There are infinite number of Chinese restaurants in a city.
Each restaurant has infinite tables. The tourist wdj of a tour
group wd={wdj }visits the first restaurant in root node
ψ1where each of its tables has a card showing the next
restaurant, which is arranged in the second layer consisting
of {ψ2, ψ4, . . .}. Such visit repeats infinitely. Each restaurant
is associated with a tree layer. The restaurants in a city are
organized into an infinitely-branched and infinitely-deep tree
structure. The tourists or sentences wdj from different tour
groups or documents wd, who are closely related, shall visit
the same restaurant and sit at the same table. The hierarchical
grouping of sentences is therefore obtained by nonparametric
tree model based on this snCRP. Each tree node stands for
a theme. Each tourist or sentence wdj is modeled by the
multinomial parameter ψlof a theme lfrom a subtree or
a chain of restaurants td. The thematic sentences allocated
in tree nodes with high probabilities could be selected for
document summarization.
Ák
Ák
wdji
wdji
µdji
µdji
°s
°s
zdi
zdi
¯d
¯d
HH
¼l
¼l
°w
°w
ydj
ydj
tdj
tdj
®0
®0
Ãl
Ãl
Fig. 5. Graphical representation for hierarchical theme and topic model.
Yellow, green and blue denote the variables in document, sentence and word
levels, respectively.
C. HDP for topical words
After having the hierarchical grouping of sentences, we treat
the words corresponding to a tree node las the grouped data
and conduct the HDP by using the grouped data in different
tree nodes. The grouped words from different documents
are recognized under the same theme ψl. Such tree-based
HDP is different from the document-based HDP [9] which
treats documents as the grouped data for text representation.
An organized topic model is constructed to draw words for
individual themes. The theme-dependent topical words are
learned from the tree-based HDP. According to the combina-
tion of snCRP and tree-based HDP, the theme parameter ψlis
represented by a mixture model of topic parameters {φk}
k=1
where the theme-dependent topic proportions {πlk}
k=1 are
used as the mixture weights as given in Eq. (14). Tree-based
HDP is developed to infer the topic proportions πlbased
on a standard SBP. The tree-based multinomial parameter is
inferred to determine the word distribution through
φk|HH, for each k
πl|γwGEM(γw), ψlX
k
πlkφk,for each l
tdsnCRP(α0),βd|γstreeGEM(γs),for each d
ydj =l|βdβd,for each dand j
zdi =k|πlπl,for each dand i
wdji |zdi, ydj,td,{φk}
k=1
Mult φtd(ydj,zdi),for each d, j and i.
(16)
A compound process of snCRP and HDP is fulfilled to build
the hierarchical theme and topic model as illustrated in the
7
graphical representation of Figure 5. Each word wdji is drawn
by using the topic parameter θdji =φtd(ydj,zdi )which is
controlled by the word-level topic zdi. This topic is allocated
in the theme or tree node ydj of the subtree tdwhich is
selected from sentence wdj . The topical words in different
tree nodes with high probability are selected. Such process
could be extended to represent multiple-level data groupings
including words, paragraphs, documents, streams and corpora.
(a)
(b)
¯1
¯11
¯12
¯111
¯112
¯121
¯122
¯1= 1
¯11
¯12
¯10
¯111
¯121
¯120
¯110
¯la
¯la1
¯la2
(c)
¯la3
Fig. 6. Illustration for conducting (a) the tree stick-breaking process, and
finding (b) the hierarchical theme proportions for an infinitely branching tree
structure, which meets the constraint of unit length in the estimated theme
proportions or in the treeGEM distribution. The dashed arrows and circles
represent the future stick-breaking process. The stick-breaking process of a
parent node lato its child nodes c={la0, la1, la2,...}is shown in (c).
The theme proportion of the first child node of a parent node lais denoted by
βla0. After stick-breaking for a parent node laand its child nodes c, we have
the replacement βlaβla0. This example illustrates how the three-layer
proportions {β1, β11, β12 , β111, β121 }or {β10, β110 , β120, β111 , β121}in
Figure 4 are drawn to share a unit-length stick.
D. Tree stick-breaking process
In the implementation, a tree stick-breaking process is
incorporated to draw a subtree [30] and determine the theme
proportions for representation of sentences in heterogeneous
document wd={wdj }. Traditional GEM distribution is not
suitable to characterize the tree structure with dependencies
between brother nodes and those between parent node and
child nodes. The snCRP is combined with a TSBP, which is
developed to draw theme proportions βdfrom a document wd
based on the treeGEM distribution subject to the constraint of
multinomial parameters P
l=1 βdl = 1 for all nodes in subtree
td. A variety of aspects from different sentences are revealed
through βd.
As illustrated in Figures 6(a) and 6(c), we draw the theme
proportions for an ancestor node and its child nodes {la,c=
{la0, la1, la2, . . .}} that are connected between two layers by
arrows. The theme proportion βla0in child node denotes the
initial segment which is succeeded from an ancestor node
la. Stick-breaking process is performed for the coming child
nodes {la1, la2, . . .}. Given the treeGEM parameter γs, we
sample a beta variable
β0
lac Beta(1, γs) = Γ(1 + γs)
Γ(1)Γ(γs)(1 β0
lac )γs1(17)
for a child node lac c. The probability or theme proportion
of generating a child node lac is calculated by
βlac =βlaβ0
lac
c1
Y
j=1
(1 β0
laj ).(18)
In this calculation, the theme proportion βlac of a child node is
determined from an initial proportion of the ancestor node βla,
which is continuously chopped according to the draws of beta
variables from the existing child nodes {β0
laj |j= 1, . . . , c1}
to the new child node β0
lac . Each child node can be further
treated as an ancestor node for future stick-breaking process to
generate the grandchild nodes. Eq. (18) is recursively applied
to find theme proportions βlaand βlac for different nodes in
subtree td. The theme proportions of a parent node and its
child nodes satisfy the condition
βla=βla0+βla1+βla2+· · · .(19)
Tree stick-breaking is run for each set of nodes {la,c}. After
stick-breaking of a parent node laand its child nodes c, the
theme proportion of parent node is replaced by βlaβla0.
Figure 6(b) shows how the three-layer theme proportions
{β1, β11, β12 , β111, β121 }in Figure 4 are inferred through
TSBP.
We accordingly infer the theme proportions βdfor a doc-
ument wdand meet the constraint that the summation over
theme proportions of all parent and child nodes has unit length.
The inference βdtreeGEM(γs)is completed. Therefore, a
stick of unit length is partitioned at random locations. The
random theme proportions βdare used to calculate the poste-
rior probability for drawing theme ydj =lfor a sentence wdj.
In this study, we adopt a single beta parameter γsfor TSBP
towards depth as well as branch. This solution is a simplified
realization of the TSBP in [29] where two separate beta
parameters {γsd, γsb}are adopted to conduct stick-breaking
for depth and branch, βdtreeGEM(γsd, γsb). The infinite
version of the Dirichlet tree distribution used in [31] could be
adopted as the conjugate prior over the tree multinomials βd.
IV. BAYES IA N INF ER EN CE
The approximate Bayesian inference using Gibbs sampling
is developed to infer the posterior parameters or latent vari-
ables to implement the nCRP compound HDP. Latent variables
consist of subtree branch tdj , sentence-level theme ydj and
word-level topic zdi for each word wdi and sentence wdj in a
text corpus w. Each latent variable is iteratively sampled by
the posterior probability of this variable given the observed
8
data wand all the other variables. The sampling of latent
variables is performed in an MCMC procedure. In this pro-
cedure, we sample a subtree td={tdj }for a document wd
via snCRP. Each sentence wdj is grouped into a tree node
or a document-dependent theme ydj =lunder a subtree td.
Each word wdi of a sentence is then assigned by the theme-
dependent topic zdi =kwhich is sampled according to a
tree-based HDP. In what follows, we address the calculation
of posterior probabilities for drawing tree branch tdj, theme
label ydj and topic label zdi.
A. Sampling for document-dependent subtree branches
A document wdis seen as “a bag of sentences” for sampling
a subtree td. We iteratively sample a tree branch or choose a
table tdj for sentence wdj by using the posterior probability
p(tdj =t|td(j),w,yd, α0, η)p(tdj =t|td(j), α0)
×p(wdj |wd(j),td,yd, η)(20)
where yd={ydj ,yd(j)}denotes the set of theme labels
for sentence wdj and all the remaining sentences wd(j)of
document wd, and td(j)denotes the branches for document
wdexcluding those of sentence wdj. In Eq. (20), the snCRP
parameter α0and the Dirichlet prior parameter ηprovide the
control over the size of the inferred tree. The first term p(tdj =
t|td(j), α0)provides the prior on choosing a branch tfor a
sentence wdj according to the sentenced-based nested CRP
with the following metaphor. In a culinary vacation, the tourist
or sentence wdj enters a restaurant and chooses a table or
branch tdj =tusing the CRP probability
p(tdj =t|td(j), α0)
=md,t
md1+α0if an occupied table tis chosen
α0
md1+α0if a new table is chosen
(21)
where md,t denotes the number of tourists or sentences in
document wdwho have seated at table tand mddenotes total
number of sentences in wd. In the next evening, this tourist
goes to the restaurant in the next layer which is identified
by the card on the table that is selected in current evening.
Eq. (21) is applied again to determine whether a new branch
at this layer is drawn or not. The prior of a subtree td={tdj}
is accordingly determined in a nested fashion over different
layers from different sentences in a document wd={wdj}.
The second term p(wdj |wd(j),td,yd, η)is calculated by
considering the marginal likelihood over the multinomial pa-
rameters θt,l ={θt,l,v}V
v=1 of a node ydj =lconnected to
tree branch tdj =tfor totally Vwords in dictionary where
θt,l,v =p(wdi =v|tdj =t, ydj =l). The parameters θt,l in
a tuple (t, l)are assumed to be Dirichlet distributed with a
shared prior parameter η
p(θt,l|η) =
ΓPV
v=1 η
QV
v=1 Γ(η)
V
Y
v=1
θη1
t,l,v .(22)
We can derive the likelihood function of a sentence wdj given
a tree branch tand the other sentences wd(j)of wdas
p(wdj |wd(j),td,yd, η) = p(wd|tdj =t, yd, η)
p(wd(j)|tdj =t, yd, η)
=
Ld
Y
l=1
ΓPV
v=1 n(d(j),t,l,v +Vη
QV
v=1 Γnd(j),t,l,v +ηQV
v=1 Γ (nd,t,l,v +η)
ΓPV
v=1 nd,t,l,v +Vη
(23)
where Lddenotes the maximum number of nodes in subtree
td,nd,t,l,v denotes the number of words wdi =vin document
wdwhich are allocated in tuple (t, l)and nd(j),t,l,v denotes
the number of words wdi =vof wdexcept sentence wdj
which are allocated in tuple (t, l). Eq. (23) is obtained since
the marginal likelihood is derived by
p(wd|t, yd, η) = Zp(wd|t, yd,θt,l )p(θt,l|η)dθt,l
=Γ(Vη)
(Γ(η))VZ Ld
Y
l=1
V
Y
v=1
θnd,t,l,v+η1
t,l,v !dθt,l
Ld
Y
l=1 QV
v=1 Γ (nd,t,l,v +η)
ΓPV
v=1 nd,t,l,v +Vη(24)
which is found by arranging an integral over a posterior
Dirichlet distribution of θt,l with parameters {nd,t,l,v +η}V
v=1.
Intuitively, a model with larger α0will tend to a tree with
more branches. The smaller ηencourages fewer words in each
tree node. The joint effect of large α0and small ηin posterior
probability will result in the circumstance that more themes
are required to explain the data. Using snCRP, we use sentence
wdj to find the corresponding branch tdj . The subtree branches
td={tdj }are drawn to accommodate the thematic variations
in heterogeneous document wd.
B. Sampling for document-dependent themes
After finding the subtree branches td={tdj }for sentences
in a document wd={wdj }via snCRP, we sample the
document-dependent theme label ydj =lfor each sentence
wdj according to the posterior probability of latent variable
ydj given document wdand current values of all the other
latent variables
p(ydj =l|wd,yd(j),td, γs, η)
p(ydj =l|yd(j),td, γs)
×p(wdj |wd(j),yd,td, η).
(25)
The first term represents the distribution of theme proportion
of ydj =lor lac in a subtree tdgiven the themes of
the other sentences yd(j). This distribution is calculated
as an expectation over the treeGEM distribution which is
determined by the TSBP. Considering the draw of theme
proportion for a child node lac given those for the parent node
9
laand the preceding child nodes {la1, . . . , la(c1)}in Eq. (18),
we derive the first term based on a subtree structure td
p(ydj =lac|yd(j), γs) = E[βla|yd(j), γs]
×E[β0
lac |yd(j), γs]
c1
Y
j=1
E1β0
laj
yd(j), γs
=p(ydj =la|yd(j), γs)1 + md(j),lac
1 + γs+PLd
u=cmd(j),lau
×
c1
Y
j=1
γs+PLd
u=j+1 md(j),lau
1 + γs+PLd
u=jmd(j),lau
(26)
which is a product of expectations of the theme proportion
of the parent node βla, the beta variable for the proportion of
current child node β0
lac and the beta variables for the remaining
proportions for the preceding child nodes {β0
la1, . . . , β0
lac }.
In Eq. (26), md(j),lac denotes the number of sentences
in wd(j)which are allocated in node lac. The treeGEM
parameter γsreflects a kind of pseudo-count of the obser-
vations for the proportion. The expectation over beta variable
E[β0|yd(j), γs]is derived by
Zβ0p(β0|yd(j), γs)0=Zβ0p(β0|γs)p(yd(j)|β0)0
Zβ0(1 β0)γs1(β0)md(j),lac
×(1 β0)PLd
n=c+1 md(j),lan 0
=1 + md(j),lac
1 + md(j),lac +γs+PLd
n=c+1 md(j),lan
(27)
which is obtained as the mean of the posterior beta vari-
able p(β0|yd(j), γs)with parameters 1 + md(j),lac and
γs+PLd
n=c+1 md(j),lan .
On the other hand, the second term of Eq. (25) for sampling
a subtree is the same as that of Eq. (20) for sampling a
theme. This term p(wdj |wd(j),yd,td, η)has been derived
in Eq. (23). There are twofold relations in the posterior
probabilities between nCRP and snCRP. First, we treat the
sentence for snCRP as if it were the word for nCRP. Second,
the calculation of Eq. (26) is recursively done for a set of
parent node and its child nodes. Using nCRP, this calculation
is performed by treating tree nodes in different layers in a flat
level without considering subtree structure.
In general, snCRP is completed by first choosing a subtree
tdfor the sentences in a document wdand then assigning each
sentence to one of the nodes in the chosen subtree according
to the theme proportions βddrawn from the treeGEM. In
the generation of theme assignment ydj =l, the node l
assigned to a sentence wdj might be connected to the branch
tdj of a subtree tdrawn by another sentence wdj0from the
same document wd. The variety of themes in a heterogeneous
document wdis reflected by the subtree tdwhich is drawn
via snCRP. A whole infinite tree is accordingly established
and shared for all documents in a corpus w.
C. Sampling for theme-dependent topics
Finally, we implement the tree-based HDP by sampling the
theme-dependent topic zdi =kfor each word wdji or wdi
under a theme ydj =lbased on the posterior probability
p(zdi =k|w,z(di), ydj =l, γw, η)
p(zdi =k|z(di), ydj =l, γw)
×p(wdi|w(di),z, ydj =l, η)
(28)
where z={zdi,z(di)}denotes the topic labels of word wdi
and the remaining words w(di)of a corpus w.
In Eq. (28), the first term indicates the distribution of topic
proportion of zdi =kwhich is calculated as an expectation
over πlGEM(γw). This calculation is done via a word-level
SBP with a control parameter γw. By drawing the beta variable
and calculating the topic proportion similar to Eqs. (17) and
(18), we derive the first term in a form of
p(zdi =k|z(di), ydj =l, γw) = 1 + n(di),l,k
1 + γw+PKl
u=kn(di),l,u
×
k1
Y
j=1
γw+PKl
u=j+1 n(di),l,u
1 + γw+PKl
u=jn(di),l,u
(29)
where Kldenotes the maximum number of topics in a theme
land n(di),l,k denotes the number of words in w(di)which
are allocated in topic kof theme l. The words of the sentences
in a node with theme lare treated as the grouped data to carry
out tree-based HDP. The terms in Eq. (29) are obtained as
the expectations over the beta variable for topic proportion
of current break πlk and the remaining proportions of the
preceding breaks {πl1, . . . , πl(k1)}.
The second term of Eq. (28) calculates the probability of
generating the word wdi =vgiven w(di)and the current
topic variables zin theme las expressed by
p(wdi =v|w(di),z, ydj =l, η) = n(di),l,v +η
PV
v=1 n(di),l,v +Vη
(30)
where n(di),l,v denotes the number of words wdi =vin
w(di)which are allocated in theme l.
Given the current state of the sampler, we iteratively
sample each latent variable conditioned on the whole ob-
servations and the rest variables. Given a text corpus w,
we sequentially sample the subtree branch tdj =tand
the theme ydj =lfor individual sentence wdj of docu-
ment wdvia the snCRP. After having a sentence-based tree
structure, we sequentially sample the theme-dependent topic
zdi =kfor individual word wdi in a tree node based on
the tree-based HDP. These samples are iteratively employed
to update the corresponding posterior probabilities p(tdj =
t|td(j),w,yd, α0, η),p(ydj =l|wd,yd(j),td, γs, η)and
p(zdi =k|w,z(di), ydj =l, γw, η)in the Gibbs sampling
procedure. The true posteriors are approximated by running
sufficient sampling iterations. The hierarchical theme and topic
model is established by fulfilling the nCRP compound HDP.
10
V. EXPERIMENTS
A. Experimental setup
The experiments were performed by using four
public-domain corpora: WSJ (Wall Street Journal)
1987-1992, AP (Associate Press) 1988-1990, NIPS
(http://arbylon.net/resources.html and https://archive.ics.uci.
edu/ml/datasets/Bag+of+Words) and DUC (Document
Understanding Conference) 2007 (http://duc.nist.gov). The
corpora of WSJ, AP and NIPS contained news documents
and conference papers which were applied for evaluation of
document representation. The perplexity of test documents
wtest ={wd|d= 1, . . . , D}was calculated by
Perplexity(wtest) = exp (PD
d=1 log p(wd)
PD
d=1 Nd).(31)
We wish to achieve better model with lower perplexity on
test documents. To investigate the topic coherence of the
estimated topic models, we further calculated the pointwise
mutual information (PMI) [32]
PMI(wtest) = 1
45 X
i<j
log p(wi, wj)
p(wi)p(wj)(32)
which is averaged over all pairs of words in the list of top-
ten words, i.e. i, j ∈ {1,...,10}, in the estimated topics. In
WSJ, we chose 1,085 documents with 18,323 sentences and
203,731 words for model training and the other 120 documents
as test data for evaluation. This dataset had 4,999 unique
words. In AP, we used 1,211 documents with 19,109 sentences
and 198,420 words for model training and the other 130
documents for testing. This dataset had 5,183 unique words. In
NIPS training set, we collected 2,500 papers totally including
310,553 sentences and 3,026,153 words with a vocabulary
of 10,328 words. The other 250 papers with 290,345 words
were sampled to form the test set. The scale of NIPS corpus
is much larger than that of WSJ, AP and DUC. In DUC
corpus, there were 1,680 documents consisting of 22,961
sentences and 18,696 unique words. This corpus provided
the reference summary for individual document, which was
manually written for evaluation of document summarization.
The automatic summary for DUC was limited to 250 words
at most. The NIST evaluation tool, ROUGE (Recall-Oriented
Understudy for Gisting Evaluation) at http://berouge.com, was
adopted. ROUGE-1 was used to measure the matched uni-
grams between reference summary and automatic summary
in terms of recall, precision and F-measure. In these four
datasets, we held out 10% of training data as the validation
data for selection of hyperparameters α0,η,γsand γwand
trained the models on the remaining 90% of data. Stop words
were removed. Hyperparameters were individually selected for
different datasets based on perplexity. The same selection was
employed to choose the number of topics Kin LDA and
the numbers of themes Land topics Kin sLDA. Fivefold
cross validation was performed. We evaluated the computation
time in seconds by using different methods over a personal
computer equipped with a CPU of Intel(R) Core(TM) i7-
4930K@3.4GHZ 6 cores and a memory of 64G RAM.
TABLE II
EVOLUTION FROM THE PARAMETRIC TO THE NONPARAMETRIC,FROM
TH E NON -HIERARCHICAL TO THE HIERARCHICAL,AND F ROM T HE
WOR D-BA SE D TO TH E SEN TE NCE -BASED CLUSTERING MODELS.
Parametric Nonparametric Nonparametric
Non-Hierarchical Non-Hierarchical Hierarchical
Model Model Model
Word-Based LDA HDP nCRP
Clustering
Sentence-Based sLDA sHDP snCRP
Clustering
For comparative study, we implemented the parametric
topic models including LDA [2], sentence-based LDA (sLDA)
[7], and the nonparametric topic models including HDP [9],
sentence-based HDP (sHDP), nCRP [11] and the proposed
sentence-based nCRP compound HDP (simply denoted by
snCRP hereafter). These methods could be also grouped into
the categories of non-hierarchical models (LDA, sLDA, HDP
and sHDP) and hierarchical models (nCRP and snCRP) as
shown in Table II. The sentence-based models based on sLDA,
sHDP and snCRP were implemented with additional informa-
tion of sentence boundaries. In [7], sLDA was implemented
by representing each word wdji in document wdbased on the
theme-dependent topic zdi =kwhere the theme ydj =lwas
learned from a bag of sentences in text corpus w={wdj}.
The words were drawn from the theme-dependent topic model.
Using sLDA, the numbers of themes and topics were fixed
and the theme hierarchy was not considered. In addition, the
sHDP is an extension of sLDA by conducting BNP inference,
but sHDP is a simplification of snCRP without involving the
theme hierarchy. The sHDP is proposed in this study and
implemented for comparison. Basically, the GEM distribution
and the treeGEM distribution are applied to characterize
the distributions of theme proportions for sHDP and snCRP,
respectively.
We also examine the effect of topic and theme hierarchies in
document modeling based on nCRP and snCRP by comparing
the performance of standard stick-breaking process (SBP), the
tree stick-breaking process (TSBP1) in Section III-D and the
TSBP in [29] (TSBP2). For each document wd, SBP selects
a single tree path cd[11] while TSBP1 and TSBP2 find the
subtree branches td. This is the first work that TSBP1 and
TSBP2 are developed to explore the thematic hierarchy from
heterogeneous sentences and documents.
In evaluation of document summarization, we carried out
the vector-space model (VSM) in [33] and the sentence-
based models using sLDA, sHDP and snCRP which performed
sentence clustering in different ways. The sLDA conducted
parametric and non-hierarchical clustering while the sHDP
executed nonparametric and non-hierarchical clustering. Only
snCRP performed nonparametric and hierarchical clustering.
The Kullback-Leibler (KL) divergence between document
model and sentence model was calculated. The thematic
sentences with the smallest KL divergence were selected.
Considering the tree structure by using snCRP, we investigated
four methods for sentence selection. The snCRP-root and
snCRP-leaf selected the thematic sentences allocated in root
node and leaf nodes, respectively. The snCRP-path selected
11
representative sentences of a document only from the nodes
along the most frequently-visited path. The snCRP-MMR
selected sentences from all possible branches tdby applying
the maximal marginal relevance (MMR) [22].
For simplicity, we constrained the tree growing in nCRP
and snCRP to three layers in our experiments. The Dirichlet
prior parameter ηwas separate for three layers in nCRP and
snCRP. We decrease the value of η(η1, η2, η3) from root
layer to leaf layer to reflect the condition that the number
of words allocated in bottom layer was reduced. In imple-
mentation of TSBP2, the beta prior parameter for depth γsd
(γsd1, γsd2, γsd3) was also decreased by depth to reflect the
decreasing number of sentences in bottom layer. The system
parameters were selected individually from validation data in
different datasets. In DUC dataset, the hyperparameters of
snCRP-TSBP1 and snCRP-TSBP2 were selected as α0= 0.5,
η1= 0.05,η2= 0.025,η3= 0.0125,γs= 1.85,γsd1= 2.5,
γsd2= 1.25,γsd3= 1.125,γsb= 1.1and γw= 1.85.
050 100 150 200 250
800
900
1000
1100
1200
1300
1400
1500
1600
1700
Sampling It eration
Perplexi ty
DUC
AP
WSJ
Fig. 7. Perplexity versus Gibbs sampling iteration of using snCRP for three
datasets: WSJ, AP and DUC.
We run Gibbs sampling procedure for 280 iterations with
100 samples per iteration. The burn-in samples in the first 30
iterations were abandoned. Figure 7 illustrates the perplexity of
training documents versus Gibbs sampling iterations by using
snCRP-TSBP1. WSJ, AP and DUC datasets are investigated.
We find that the number of tree nodes is increased and the
corresponding perplexity is decreased by sampling iterations.
The perplexity is converged after around 50 iterations. This
phenomenon is consistent for three datasets.
B. Evaluation of tree models and stick-breaking processes for
document modeling
In this set of experiments, we evaluate the performance
of different tree models and stick-breaking processes for
document representation by using WSJ and AP datasets. The
hierarchical models based on nCRP and snCRP and the draws
of topic proportions or theme proportions by using SBP,
TSBP1 and TSBP2 are investigated. Different from the word-
based hierarchical topic model using nCRP, the proposed
WSJ AP
750
800
850
900
950
1000
1050
Perplexity
nCRP-SBP (WSJ 529; AP 602)
nCRP-TSBP1 (WSJ 553; AP 632)
nCRP-TSBP2 (WSJ 570; AP 650)
snCRP-SBP (WSJ 402; 300; AP 414, 336)
snCRP-TSBP1 (WSJ 418, 310; AP 431, 345)
snCRP-TSBP2 (WSJ 437, 325; AP 451, 353)
Fig. 8. Comparison of perplexity of using nCRP and snCRP where SBP,
TSBP1 and TSBP2 are investigated. WSJ and AP are used. The error bars
show the standard deviation across test documents in fivefold cross validation.
Model complexity is compared. The blue number after nCRP denotes total
number of the estimated topics while the blue numbers after snCRP denote
total numbers of the estimated themes and topics.
snCRP builds the sentence-based tree model where each node
represents a theme from a set of sentences and the words
in these sentences are generated by HDP. In addition to the
baseline nCRP-SBP [11] with single tree path, we compare
the performance of nCRP and snCRP combined with TSBP1
and TSBP2 which select the subtree branches to deal with the
topical and thematic variations in heterogeneous documents.
Figure 8 shows the perplexities of test documents and
the estimated model complexities. The error bars show the
standard deviation across test documents in fivefold cross
validation. The model complexity is measured in terms of
total number of estimated topics in tree model of nCRP and
total numbers of estimated themes and topics in tree model
of snCRP. We find that snCRP obtains lower perplexity than
nCRP. Selection of subtree branches using TSBP1 and TSBP2
outperforms that of single path using SBP in both datasets.
The results of TSBP1 and TSBP2 are comparable. Such
performance is obtained by using nCRP as well as snCRP.
In addition, the estimated model complexity of using TSBP
is larger than that of SBP. This is reasonable because TSBP
adopts more latent variables to allow larger variations in topics
or themes. TSBP2 has higher freedom to choose more themes
and topics but obtained very limited improvement compared
with TSBP1. The model complexity of hierarchical models
nCRP and snCRP is larger than that of non-hierarchical models
LDA, sLDA, HDP and sHDP. It is interesting that snCRP uses
smaller number of topics than nCRP. However, total number
of latent variables of using snCRP (L+K) is larger than that
of using nCRP (K). This implies that additional modeling in
sentence level can reduce the need of the required topics in
word-level modeling for document representation.
Figure 9 shows the number of themes which are estimated
from six documents with different lengths or numbers of words
in documents (Nd= 200,301,405,504,688). We find that the
number of themes used for representation of a document is
12
100 200 300 400 500 600 700
0
5
10
15
20
25
30
Length of Document (Number of Words)
Number of Themes
TSBP2
TSBP1
SBP
Fig. 9. Number of themes versus length of document by using snCRP with
SBP, TSBP1 and TSBP2. Six documents in different lengths are selected from
WSJ. Document length is measured by total number of words in the document.
increased by the length of document when applying TSBP1
and TSBP2. SBP selects a single path so that only three themes
are selected for each document. TSBP2 chooses more themes
than TSBP1. This is an evidence that TSBPs perform better
than SBP for document modeling.
WSJ AP NIPS
800
1000
1200
1400
1600
1800
2000
2200
2400
Perplexity
LDA (WSJ 400; AP 450; NIPS 500)
sLDA (WSJ 350, 250; AP 400, 300; NIPS 450, 350)
HDP (WSJ 492; AP 550; NIPS 590)
sHDP (WSJ 385, 293; AP 411, 319; NIPS 469, 374)
nCRP (WSJ 553; AP 632; NIPS 692)
snCRP (WSJ 418, 310; AP 431, 345; NIPS 475, 397)
Fig. 10. Comparison of perplexity of using LDA, sLDA, HDP, sHDP, nCRP
and snCRP. Subtree branch selection in nCRP and snCRP based on TSBP1 is
considered. The error bars show the standard deviation. Model complexity is
compared. The blue number after LDA, HDP and nCRP denotes total number
of the estimated topics while the blue numbers after sLDA, sHDP and snCRP
denote total number of the estimated themes and topics.
C. Evaluation of different methods in terms of perplexity, topic
coherence and computation time
This study presents the evolution from the parametric and
non-hierarchical topic model based on LDA to the nonpara-
metric and hierarchical theme and topic model based on
snCRP. The additional modeling over subgrouping data is in-
troduced to conduct unsupervised structural learning of themes
and topics from a set of documents. Figure 10 compares the
perplexity of test documents by using LDA, sLDA, HDP,
sHDP, nCRP and snCRP. WSJ, AP and NIPS datasets are
used. The numbers of the estimated themes and topics are
shown to evaluate the effect of model complexity in different
methods. We find that perplexity is consistently reduced from
word-based clustering to sentence-based clustering by using
different methods and datasets. It is because that two levels
of data modeling in sentence-based clustering methods provide
an organized way to represent a set of documents. The relation
between words and sentences through latent topics and themes
is beneficial for document modeling regardless of the model
style, model shape and inference procedure. We consistently
see that sLDA, sHDP and snCRP estimate smaller number of
topics than LDA, HDP and nCRP, respectively. Nevertheless,
total number of themes and topics in sentence-based clustering
is still larger than that of topics in word-based clustering. In
this comparison, HDP and sHDP perform better than LDA
and sLDA, respectively. The lowest perplexity is obtained by
using snCRP. Compared to the news articles in WSJ and AP,
the large amount of scientific documents in NIPS does not
produce so many topics and themes when using BNP methods.
WSJ AP NIPS
2.0
2.4
2.8
3.2
3.6
PMI
LDA
sLDA
HDP
sHDP
nCRP
snCRP
Fig. 11. Comparison of PMI score of using LDA, sLDA, HDP, sHDP, nCRP
and snCRP. The error bars show the standard deviation.
Based on the estimated LDA, sLDA, HDP, sHDP, nCRP
and snCRP, Figure 11 further evaluates the performance of
topic coherence by comparing the corresponding PMI scores
where WSJ, AP and NIPS datasets are investigated. PMI is
known as an objective measure which considerably reflects the
human-judged topic coherence [32]. To conduct a consistent
comparison, the PMI scores in sLDA, sHDP and snCRP are
calculated from the pairs of frequent words in the estimated
topics rather than themes. From the results of PMI measure,
we still see the improvement by using two-level clustering
over one-level clustering and also nonparametric hierarchical
modeling over parametric non-hierarchical modeling.
In addition, the training time of using different methods is
evaluated as illustrated in Figure 12. NIPS corpus is used.
The nCRP and snCRP based on TSBP1 were examined. To
investigate the scalability of computation time due to the
13
800 1600 2500
102
103
Number of Training Documents
Computati on Time (sec )
snCRP
nCRP
sHDP
HDP
sLDA
LDA
Fig. 12. Comparison of computation time (in seconds) of using sLDA, sHDP
and snCRP under different amounts of training documents.
amount of training data, we sampled training documents and
formed the training sets with 800, 1,600 and 2,500 documents.
The model size of LDA and sLDA is adjusted according to
the amount of training data. LDA and sLDA are estimated
by using VB inference while HDP, sHDP, nCRP and snCRP
conducts inference based on Gibbs sampling. Computation
time is measured with the converged model parameters where
10 iterations are run by using VB and 50 iterations are run by
using Gibbs sampling. Basically, the computation overhead of
using sLDA, sHDP and snCRP over LDA, HDP and nCRP is
limited. Nevertheless, the computation cost of nonparametric
methods using HDP, sHDP, nCRP and snCRP is much higher
than that of parametric methods using LDA and sLDA. The
highest cost is measured by using sentence-based tree model,
i.e. snCRP. The computation time is roughly proportional to
the amount of training data.
patients
disease
treatment
percent
study
people
national
Times
million
report
drugs
interferon
hepatitis
combination
infected
killed
police
family
Barton
attorney
Simpson
Shepard
McKinney
murder
deaths
Gingrich
Marianne
divorce
newt
Bisek
nuclear
Iran
treaty
Pakistan
test
government
country
minister
Israel
Foreign
Israeli
Palestinian
Basque
ETA
Israel
acupuncture
medicine
alternative
medical
therapy obese
weight
overweight
risk
fat
Rudolph
bombing
Atlanta
Birmingham
clinic
Barton
killed
wife
children
announce
Myanmar
Amnesti
militay
human
rights
play
film
television
tv
star
Jolie
actor
award
actress
movie
Seinfeld
Jerry
NBC
David
stand-up
animal
cat
gene
food
Zoo
wolves
wolf
species
gray
Mexican
Fig. 13. A tree model of DUC showing the topical words in each theme or
tree node.
D. Evaluation for document summarization
The proposed snCRP conducts sentence-based clustering
or equivalently establishes a tree model which contains the
thematic sentences in tree nodes and helps for document
summarization. The other sentence-based clustering methods
including sLDA and sHDP are implemented for comparative
study. Figure 13 displays an example of three-layer tree struc-
ture which is estimated based on snCRP-TSBP1 and treeGEM
from DUC dataset. For the words of all sentences allocated
in tree nodes, we conduct HDP to find topic proportions
corresponding to each node based on the GEM distribution.
In this figure, five topical words are displayed in tree nodes
in different layers which are shaded with different colors.
The root node (yellow) contains general words while the
leaf nodes (white) consist of specific words. It is obvious
to see semantic relationships between tree nodes in different
layers along the selected five tree paths. These paths are
separately related to animal,television,disease,criminal and
country. The performance of unsupervised structural learning
is illustrated.
Recall Precision F-measure
0.35
0.40
0.45 snCRP-root
snCRP-leaf
snCRP-MMR
snCRP-path
Fig. 14. Comparison of recall, precision and F-measure under ROUGE-1
evaluation by using snCRP based on different sentence selection methods.
The error bars show the standard deviation.
In implementation of snCRP, the selection of sentences
from tree model is constrained by applying four selection
methods. Figure 14 compares the performance of document
summarization using snCRP-TSBP1 in terms of recall, pre-
cision and F-measure under ROUGE-1. The performance of
selecting sentences from root node (snCRP-root) and leaf
nodes (snCRP-leaf) is comparable. The metric of MMR for
selection from all paths (snCRP-MMR) and the metric of
KL divergence for selection from the most-frequently visited
path (snCRP-path) perform better than the snCRP-root and the
snCRP-leaf. The snCRP-path obtains the highest F-measure
in this comparison. The sentences along the most frequently-
visited path contain the most representative sentences informa-
tion for summarization. We fix the case of snCRP-path when
comparing with other methods.
Figure 15 shows the recall, precision and F-measure of doc-
ument summarization by using VSM, sLDA, sHDP, snCRP-
14
Recall Precision F-measure
0.30
0.35
0.40
0.45 VSM
sLDA (200, 150)
sHDP (235, 204)
snCRP-SBP (310, 255)
snCRP-TSBP1 (350, 283)
snCRP-TSBP2 (377, 294)
Fig. 15. Comparison of recall, precision and F-measure under ROUGE-1
evaluation by using VSM, sLDA, sHDP and snCRP with SBP, TSBP1 and
TSBP2. The error bars show the standard deviation. Model complexity is
compared. The blue numbers denote total numbers of the estimated themes
and topics.
SBP, snCRP-TSBP1 and snCRP-TSBP2 under ROUGE-1
evaluation. The numbers of the estimated themes and topics
are included for comparison. Again, snCRP-TSBPs estimate
more themes and topics than snCRP-SBP. Hierarchical model
based on snCRP has larger model size than non-hierarchical
models based on sLDA and sHDP. In terms of F-measure,
the theme and topic models using sLDA and sHDP are
significantly better than baseline VSM. Nonparametric model
based on sHDP obtains higher F-measure than parametric
model based on sLDA. Nevertheless, the hierarchical theme
and topic model using snCRP is superior to that using sHDP.
The contributions of using snCRP come from the flexible
model complexity and the theme structure which are beneficial
for sentence clustering, document modeling and document
summarization. Similar to the evaluation for document model-
ing, snCRP-TSBPs perform better than snCRP-SBP for docu-
ment summarization. The snCRP-TSBP1 and snCRP-TSBP2
outperform the other methods in terms of F-measure.
VI. CONCLUSIONS
This paper addressed a new hierarchical and nonparametric
model for document representation and summarization. A hier-
archical theme model was constructed according to a sentence-
level nCRP while the topic model was established through a
word-level HDP. The nCRP compound HDP was proposed
to build a tightly-coupled theme and topic model which was
also seen as a theme-dependent topic mixture model. A self-
organized document representation using themes in sentence
level and topics in word level was developed. We presented
the TSBP to draw subtree branches for possible thematic
variations in heterogeneous documents. A hierarchical mixture
model of themes was constructed according to the snCRP.
The hierarchical clustering of sentences was implemented.
The thematic sentences were allocated in tree nodes which
were frequently visited. Experimental results on document
modeling and summarization showed the merit of snCRP
in terms of perplexity, topic coherence and F-measure. The
proposed snCRP is a general model for unsupervised structural
learning. This model is generalizable to characterize the latent
structure in different levels of data groupings which exist in
different specialized technical data.
REFERENCES
[1] D. M. Blei, “Probabilistic topic models,” Communications of the ACM,
vol. 55, no. 4, pp. 77–84, Apr. 2012.
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,
Journal of Machine Learning Research, vol. 3, no. 5, pp. 993–1022,
Jan. 2003.
[3] D. M. Blei, L. Carin, and D. Dunson, “Probabilistic topic models,” IEEE
Signal Processing Magazine, vol. 27, no. 6, pp. 55–65, Nov. 2010.
[4] J.-T. Chien and C.-H. Chueh, “Topic-based hierarchical segmentation,”
IEEE Transactions on Audio, Speech and Language Processing, vol. 20,
no. 1, pp. 55–66, Jan. 2012.
[5] ——, “Dirichlet class language models for speech recognition,” IEEE
Transactions on Audio, Speech and Language Processing, vol. 19, no. 3,
pp. 482–495, Mar. 2011.
[6] J.-T. Chien and M.-S. Wu, “Adaptive Bayesian latent semantic analysis,
IEEE Transactions on Audio, Speech, and Language Processing, vol. 16,
no. 1, pp. 198–207, Jan. 2008.
[7] Y.-L. Chang and J.-T. Chien, “Latent Dirichlet learning for document
summarization,” in Proc. of International Conference on Acoustics,
Speech, and Signal Processing, Taipei, Taiwan, Apr. 2009, pp. 1689–
1692.
[8] J.-T. Chien and Y.-L. Chang, “Hierarchical theme and topic model for
summarization,” in Proc. of IEEE International Workshop on Machine
Learning for Signal Processing, Southampton, UK, Sep. 2013, pp. 1–6.
[9] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical
Dirichlet process,” Journal of the American Statistical Association, vol.
101, no. 476, pp. 1566–1581, Dec. 2006.
[10] D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenebaum, “Hierarchi-
cal topic models and the nested Chinese restaurant process,” in Advances
in Neural Information Processing Systems, Vancouver, Canada, Dec.
2004, pp. 17–24.
[11] D. M. Blei, T. L. Griffiths, and M. I. Jordan, “The nested Chinese restau-
rant process and Bayesian nonparametric inference of topic hierarchies,”
Journal of the ACM, vol. 57, no. 2, article 7, Jan. 2010.
[12] J.-T. Chien and Y.-L. Chang, “Bayesian sparse topic model,” Journal of
Signal Processing Systems, vol. 74, no. 3, pp. 375–389, Mar. 2014.
[13] C. Wang and D. M. Blei, “Decoupling sparsity and smoothness in
the discrete hierarchical Dirichlet process,” in Advances in Neural
Information Processing Systems, Vancouver, Canada, Dec. 2009, pp.
1982–1989.
[14] S. Williamson, C. Wang, K. A. Heller, and D. M. Blei, “The IBP com-
pound Dirichlet process and its application to focused topic modeling,”
in Proc. of International Conference on Machine Learning, Haifa, Israel,
Jun. 2010, pp. 1151–1158.
[15] H. M. Wallach, D. Mimno, and A. McCallum, “Rethinking LDA: why
priors matter,” in Advances in Neural Information Processing Systems,
Vancouver, Canada, Dec. 2009, pp. 1973–1981.
[16] D. I. Kim and E. B. Sudderth, “The doubly correlated nonparametric
topic model,” in Advances in Neural Information Processing Systems,
Vancouver, Canada, Dec. 2011, pp. 1980–1988.
[17] A. Rodriguez, D. B. Dunson, and A. E. Gelfand, “The nested Dirichlet
process,” Journal of the American Statistical Association, vol. 103, no.
483, pp. 1131–1154, Sep. 2008.
[18] J. Paisley, L. Carin, and D. M. Blei, “Variational inference for stick-
breaking beta process priors,” in Proc. of International Conference on
Machine Learning, Bellevue, WA, Jun. 2011, pp. 889–896.
[19] Y. W. Teh, D. Newman, and M. Welling, “A collapsed variational
Bayesian inference algorithm for latent Dirichlet allocation,” in Ad-
vances in Neural Information Processing Systems, Vancouver, Canada,
Dec. 2007, pp. 1353–1360.
[20] C. Wang and D. M. Blei, “Variational inference for the nested Chinese
restaurant process,” in Advances in Neural Information Processing
Systems, Vancouver, Canada, Dec. 2009, pp. 1990–1998.
[21] H. M. Wallach, “Topic modeling: beyond bag-of-words,” in Proc. of
International Conference on Machine Learning, Haifa, Israel, Jun. 2010,
pp. 977–984.
15
[22] J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz, “Multi-
document summarization by sentence extraction,” in Proc. of
ANLP/NAACL Workshop on Automatic Summarization, vol. 4, Seattle,
WA, Apr. 2000, pp. 40–48.
[23] A. Haghighi and L. Vanderwende, “Exploring content models for multi-
document summarization,” in Proc. of Annual Conference of the North
American Chapter of the ACL, Boulder, CO, May 2009, pp. 362–370.
[24] D. Wang, S. Zhu, T. Li, and Y. Gong, “Multi-document summarization
using sentence-based topic models,” in Proc. of ACL-IJCNLP, Singa-
pore, Aug. 2009, pp. 297–300.
[25] M. M. Shafiei and E. E. Milios, “Latent Dirichlet co-clustering,” in Proc.
of IEEE International Conference on Data Mining, Hong Kong, Dec.
2006, pp. 542–551.
[26] L. Du, W. Buntine, and H. Jin, “A segmented topic model based on the
two-parameter Poisson-Dirichlet process,” Machine Learning, vol. 81,
no. 1, pp. 5–19, Oct. 2010.
[27] J. Pitman, “Poisson-Dirichlet and GEM invariant distributions for split-
and-merge transformation of an interval partition,Combinatorics, Prob-
ability and Computing, vol. 11, pp. 501–514, Sep. 2002.
[28] D. Aldous, “Exchangeability and related topic.” ´
Ecole d’ ´
Et´ede
Probabilit´es de Saint-Flour XIII-1983, Berlin: Springer-Verlag, 1985,
pp. 1–198.
[29] R. P. Adams, Z. Ghahramani, and M. I. Jordan, “Tree-structured stick
breaking for hierarchical data,” in Advances in Neural Information
Processing Systems, Vancouver, Canada, Dec. 2010, pp. 19–27.
[30] J. Paisley, C. Wang, D. M. Blei, and M. I. Jordan, “Nested hierarchical
Dirichlet processes,” IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, vol. 2, no. 37, pp. 256–270, Feb. 2015.
[31] D. Andrzejewski, X. Zhu, and M. Craven, “Incorporating domain
knowledge into topic modeling via Dirichlet forest priors,” in Proc. of
International Conference on Machine Learning, Montreal, Canada, Jun.
2009, pp. 25–32.
[32] D. Newman, E. V. Bonilla, and W. Buntine, “Improving topic coherence
with regularized topic models,” in Advances in Neural Information
Processing Systems, Vancouver, Canada, Dec. 2011, pp. 496–504.
[33] Y. Gong and X. Liu, “Generic text summarization using relevance
measure and latent semantic analysis,” in Proc. of ACM SIGIR, New
Orleans, LA, Sep. 2001, pp. 19–25.
Jen-Tzung Chien (S’97-A’98-M’99-SM’04) re-
ceived his Ph.D. degree in electrical engineering
from National Tsing Hua University, Hsinchu, Tai-
wan, ROC, in 1997. During 1997-2012, he was with
the National Cheng Kung University, Tainan, Tai-
wan. Since 2012, he has been with the Department
of Electrical and Computer Engineering and the
Department of Computer Science, National Chiao
Tung University, Hsinchu, where he is currently a
Distinguished Professor. He held the Visiting Re-
searcher positions with the Panasonic Technologies
Inc., Santa Barbara, CA, the Tokyo Institute of Technology, Japan, the Georgia
Institute of Technology, Atlanta, GA, the Microsoft Research Asia, Beijing,
China, and the IBM T. J. Watson Research Center, Yorktown Heights, NY.
His research interests include machine learning, information retrieval, speech
recognition, blind source separation, and face recognition.
Dr. Chien served as the associate editor of the IEEE Signal Processing
Letters in 2008-2011, the guest editor of the IEEE Transactions on Audio,
Speech, and Language Processing in 2012, and the tutorial speaker of the In-
terspeech 2013 and the ICASSP 2012 and 2015. He received the Distinguished
Research Award from the Ministry of Science and Technology, Taiwan and
the Best Paper Award of the 2011 IEEE Automatic Speech Recognition and
Understanding Workshop. He currently serves as an elected member of the
IEEE Machine Learning for Signal Processing Technical Committee.
... Probabilistic: nCRP (Blei et al., 2003a) Neural: TSNTM (Isonuma et al., 2020) a prior for a latent hierarchy that is also employed in recent neural models and various applications (Deshwar et al., 2015;Chien, 2016;Nassar et al., 2019). The ihLDA enables drawing an infinite topic tree for each document from a base infinite tree in a hierarchical Bayesian fashion. ...
... Existing decomposition methods usually have the following two major problems. First, it is challenging to discover common patterns or topics in the documents and organize them into hierarchy [9], [10]. Second, the topic-word distribution do not meet human interpretation of documents [11], [12]. ...
Preprint
Nonnegative matrix factorization (NMF) based topic modeling methods do not rely on model- or data-assumptions much. However, they are usually formulated as difficult optimization problems, which may suffer from bad local minima and high computational complexity. In this paper, we propose a deep NMF (DNMF) topic modeling framework to alleviate the aforementioned problems. It first applies an unsupervised deep learning method to learn latent hierarchical structures of documents, under the assumption that if we could learn a good representation of documents by, e.g. a deep model, then the topic word discovery problem can be boosted. Then, it takes the output of the deep model to constrain a topic-document distribution for the discovery of the discriminant topic words, which not only improves the efficacy but also reduces the computational complexity over conventional unsupervised NMF methods. We constrain the topic-document distribution in three ways, which takes the advantages of the three major sub-categories of NMF -- basic NMF, structured NMF, and constrained NMF respectively. To overcome the weaknesses of deep neural networks in unsupervised topic modeling, we adopt a non-neural-network deep model -- multilayer bootstrap network. To our knowledge, this is the first time that a deep NMF model is used for unsupervised topic modeling. We have compared the proposed method with a number of representative references covering major branches of topic modeling on a variety of real-world text corpora. Experimental results illustrate the effectiveness of the proposed method under various evaluation metrics.
Article
Training on disjoint fixed-length segments, Transformers convert static word embeddings into contextualized word representations. However, they often restrict the context of a token to the segment it resides in and hence neglect the contextual information across segments, failing to capture longer-term dependencies beyond the predefined segment length. This paper uses a probabilistic deep topic model to provide hierarchical contextualized embeddings at both the token and segment levels, and integrate topic information through a constrained attention mechanism. The proposed method not only injects contextualized topic information into Transformers, but also controls languages generation guided by specific topics, styles, and sentiments. Three plug-and-play modules are proposed, including the contextual topical token embedding, the segment embedding, and the multi-head topic attention mechanism. We aim to capture the semantic coherence and word concurrence patterns at the global level, and also enrich the representation of each token by adapting to its local context, with negligible increased memory footprint and computational time. Experiments on various corpora show that by adding marginal extra parameters, the proposed hierarchical topic-aware contextualized Transformers consistently outperform their conventional counterparts, and generate sentences and paragraphs according to human preferences.
Article
Nonnegative matrix factorization (NMF) based topic modeling methods do not rely on model- or data-assumptions much. However, they are usually formulated as difficult optimization problems, which may suffer from bad local minima and high computational complexity. In this paper, we propose a deep NMF (DNMF) topic modeling framework to alleviate the aforementioned problems. It first applies an unsupervised deep learning method to learn latent hierarchical structures of documents, under the assumption that if we could learn a good representation of documents by, e.g. a deep model, then the topic word discovery problem can be boosted. Then, it takes the output of the deep model to constrain a topic-document distribution for the discovery of the discriminant topic words, which not only improves the efficacy but also reduces the computational complexity over conventional unsupervised NMF methods. We constrain the topic-document distribution in three ways, which takes the advantages of the three major sub-categories of NMF—basic NMF, structured NMF, and constrained NMF respectively. To overcome the weaknesses of deep neural networks in unsupervised topic modeling, we adopt a non-neural-network deep model—multilayer bootstrap network. To our knowledge, this is the first time that a deep NMF model is used for unsupervised topic modeling. We have compared the proposed method with a number of representative references covering major branches of topic modeling on a variety of real-world text corpora. Experimental results illustrate the effectiveness of the proposed method under various evaluation metrics.
Article
Full-text available
Introduction: Topic modeling is one of the text mining techniques that allows you to discover unknown topics in a collection of documents, interpret documents based on these topics, and use these interpretations to organize, summarize, and search for texts automatically. Familiarity with the concept and technique of topic modeling, and its application in discovering topics and organizing information is one of the main goals of this research. Methodology: The present study is a review-analytical type in which, while introducing topic modeling, it has categorized and reviewed the applications of this technique based on its performance and provided a sample of research that has used this technique. Findings: Topic modeling algorithms is used not only in addition to the three main objectives of discovering hidden topics, interpreting documents based on topics, and finally organizing and classifying texts, but also is used in discovering hidden topics and relationships in the fields of science, information retrieval, categorizing documents based on topics, discovering outstanding patterns and emerging events, clustering the concepts of scientific fields, analyzing the course of conceptual evolution during historical periods, determining the hierarchical relationships of concepts. A specific scientific field or field and vocabulary enrichment. Conclusion: Topic modeling based on machine learning and artificial intelligence knowledge has been proposed as one of the new approaches to organizing information resources and serious studies are being conducted in this field. Therefore, by using topic modeling algorithms in order to automate the extraction of the subject and discover the hidden issues in the source, it is possible to strengthen and update the new systems of organizing information resources.
Article
Full-text available
Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling. Findings: LSA could be used to classify specific and unique topics in documents that address only a single top- ic. The other three text mining methods focus on topics and general partiality of the text. PLSA is applicable to documents dealing with a topic, unlike the LSA, it is used to discover general themes and contexts. However, LDA is more applicable to documents that address several issues. The CTM, method can be used to identify relationship between different subject categories. Conclusion: Text mining tactics are suitable for employing analysis in discovering and extracting the text sub- jects. Keywords: Text mining, Topic Modeling, Semantic Analysis, Topic Discovery.
Article
Full-text available
Introduction: Topic modeling is one of the text mining techniques that allows you to discover unknown topics in a collection of documents, interpret documents based on these topics, and use these interpretations to organize, summarize, and search for texts automatically. Familiarity with the concept and technique of topic modeling, and its application in discovering topics and organizing information is one of the main goals of this research.Methodology: The present study is a review-analytical type in which, while introducing topic modeling, it has categorized and reviewed the applications of this technique based on its performance and provided a sample of research that has used this technique.Findings: Topic modeling algorithms is used not only in addition to the three main objectives of discovering hidden topics, interpreting documents based on topics, and finally organizing and classifying texts, but also is used in discovering hidden topics and relationships in the fields of science, information retrieval, categorizing documents based on topics, discovering outstanding patterns and emerging events, clustering the concepts of scientific fields, analyzing the course of conceptual evolution during historical periods, determining the hierarchical relationships of concepts. A specific scientific field or field and vocabulary enrichment.Conclusion: Topic modeling based on machine learning and artificial intelligence knowledge has been proposed as one of the new approaches to organizing information resources and serious studies are being conducted in this field. Therefore, by using topic modeling algorithms in order to automate the extraction of the subject and discover the hidden issues in the source, it is possible to strengthen and update the new systems of organizing information resources.
Article
Full-text available
This special issue is highly related to communication and software application for medical and health care systems towards providing safe, enhanced service to the society. Also it focuses on integration of artificial intelligence to the man kind. In addition, Environmental issues, agricultural advancements, security issues, medical therapies and many more related to artificial intelligences (AI) are been focused in this issues. Artificial intelligence for Cloud computing in health care, for diagnosis in health care, for safety in day to day life, for biorhythm monitoring for the man kind health needs are the need of the hour for the society needs. This special issue focuses on those aspects with much implication for the health care. The articles published in this special issue will certainly bring as positive effect for the developing health care and to make use of available resources and to remove certain obsolete factors and process which may delay or harm the existing health care system. It enhances maximum utilization of scientific knowledge to potentiate therapy and diagnosis in the health care system.
Article
Full-text available
The exponential growth of newsgroups has made it more difficult to gain accurate access to a large amount of data. To deal with the massive amounts of data, efficient and effective methods are needed. One such method is text summarization, which presents data in a condensed format. It would be beneficial for readers to be able to get a wide variety of news in a short amount of time if the news is simplified.In this article, we use the English Newsgroup datasets and the Tamil Newsgroup datasets to automate News Summaries using the text rank algorithm. The proposed work was created using a changed Text Rank algorithm based on the principle of word frequency. The suggested approach creates vectors of words as nodes and similarities between two words as the edge between them, which is the webbing between them. Term frequency assigns various weights to different terms in a sentence, while standard cosine similarity regards both of them similarly. The vector is rendered sparse and divided into clusters based on the premise that sentences inside a cluster are identical and sentences from various clusters reflect their dissimilarity. The performance assessment of the proposed summarization strategy in two types of Newsgroup datasets demonstrates its usefulness in terms of the accuracy parameter.
Article
Full-text available
In this article, we review probabilistic topic models: graphical models that can be used to summarize a large collection of documents with a smaller number of distributions over words. Those distributions are called "topics" because, when fit to data, they capture the salient themes that run through the collection. We describe both finite-dimensional parametric topic models and their Bayesian nonparametric counterparts, which are based on the hierarchical Dirichlet process (HDP). We discuss two extensions of topic models to time-series data-one that lets the topics slowly change over time and one that lets the assumed prevalence of the topics change. Finally, we illustrate the application of topic models to nontext data, summarizing some recent research results in image analysis.
Conference Paper
Full-text available
This paper presents a hierarchical summarization model to extract representative sentences from a set of documents. In this study, we select the thematic sentences and identify the topical words based on a hierarchical theme and topic model (H2TM). The latent themes and topics are inferred from document collection. A tree stick-breaking process is proposed to draw the theme proportions for representation of sentences. The structural learning is performed without fixing the number of themes and topics. This H2TM is delicate and flexible to represent words and sentences from heterogeneous documents. Thematic sentences are effectively extracted for document summarization. In the experiments, the proposed H2TM outperforms the other methods in terms of precision, recall and F-measure.
Conference Paper
Full-text available
Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.
Article
Full-text available
This paper presents a new Bayesian sparse learning approach to select salient lexical features for sparse topic modeling. The Bayesian learning based on latent Dirichlet allocation (LDA) is performed by incorporating the spike-and-slab priors. According to this sparse LDA (sLDA), the spike distribution is used to select salient words while the slab distribution is applied to establish the latent topic model based on those selected relevant words. The variational inference procedure is developed to estimate prior parameters for sLDA. In the experiments on document modeling using LDA and sLDA, we find that the proposed sLDA does not only reduce the model perplexity but also reduce the memory and computation costs. Bayesian feature selection method does effectively identify relevant topic words for building sparse topic model.
Article
Topic models are learned via a statistical model of variation within document col-lections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with docu-ments; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual specification of the number of topics. We propose a doubly correlated nonparametric topic (DCNT) model, the first model to simultaneously capture all three of these properties. The DCNT models meta-data via a flexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stick-breaking construction. We validate the semantic structure and predictive performance of the DCNT using a corpus of NIPS documents annotated by various metadata.
Conference Paper
Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. This analysis can be used for corpus exploration, document search, and a variety of prediction problems. In this tutorial, I will review the state-of-the-art in probabilistic topic models. I will describe the three components of topic modeling: (1) Topic modeling assumptions (2) Algorithms for computing with topic models (3) Applications of topic models In (1), I will describe latent Dirichlet allocation (LDA), which is one of the simplest topic models, and then describe a variety of ways that we can build on it. These include dynamic topic models, correlated topic models, supervised topic models, author-topic models, bursty topic models, Bayesian nonparametric topic models, and others. I will also discuss some of the fundamental statistical ideas that are used in building topic models, such as distributions on the simplex, hierarchical Bayesian modeling, and models of mixed-membership. In (2), I will review how we compute with topic models. I will describe approximate posterior inference for directed graphical models using both sampling and variational inference, and I will discuss the practical issues and pitfalls in developing these algorithms for topic models. Finally, I will describe some of our most recent work on building algorithms that can scale to millions of documents and documents arriving in a stream. In (3), I will discuss applications of topic models. These include applications to images, music, social networks, and other data in which we hope to uncover hidden patterns. I will describe some of our recent work on adapting topic modeling algorithms to collaborative filtering, legislative modeling, and bibliometrics without citations. Finally, I will discuss some future directions and open research problems in topic models.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.