Article

Scalable training of hierarchical topic models

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Large-scale topic models serve as basic tools for feature extraction and dimensionality reduction in many practical applications. As a natural extension of flat topic models, hierarchical topic models (HTMs) are able to learn topics of different levels of abstraction, which lead to deeper understanding and better generalization than their flat counterparts. However, existing scalable systems for flat topic models cannot handle HTMs, due to their complicated data structures such as trees and concurrent dynamically growing matrices, as well as their susceptibility to local optima. In this paper, we study the hierarchical latent Dirichlet allocation (hLDA) model which is a powerful nonparametric Bayesian HTM. We propose an efficient partially collapsed Gibbs sampling algorithm for hLDA, as well as an initialization strategy to deal with local optima introduced by tree-structured models. We also identify new system challenges in building scalable systems for HTMs, and propose efficient data layout for vectorizing HTM as well as distributed data structures including dynamic matrices and trees. Empirical studies show that our system is 87 times more efficient than the previous open-source implementation for hLDA, and can scale to thousands of CPU cores. We demonstrate our scalability on a 131-million-document corpus with 28 billion tokens, which is 4--5 orders of magnitude larger than previously used corpus. Our distributed implementation can extract 1,722 topics from the corpus with 50 machines in just 7 hours.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... These models build a topic hierarchy that allows for arbitrarily large branch structures and adaptive dataset growth. In addition, they have been successfully applied for document modeling, online advertising and microblog location prediction and outperformed flat topic models [4]. ...
... The most commonly used topic model is LDA [3,13,14], which is represented as a probabilistic graphical model in Figure 1, and it has shown great success in various NLP tasks for discovering latent topics in the biomedical literature [15,16]. However, topics are naturally organized in a hierarchy [4], and the above models are flat topic models that cannot capture the hierarchical information of the topics. ...
... Xu et al. proposed a novel knowledge-based hierarchical topic model (KHTM), which can incorporate prior knowledge into topic hierarchy building to solve the problem of weak topic hierarchies [20]. Chen et al. proposed a partially collapsed Gibbs sampling algorithm and a supercomputer to apply nHDP [7,21] on some large complete datasets and produced some very interesting topic hierarchies [4]. Zou et al. proposed a novel hierarchical topic model (HV-HTM) to incorporate the observed hierarchical label information into the topic generation process while maintaining the flexibility of the horizontal and vertical expansion of the hierarchical structure in the modeling process [22]. ...
Article
Full-text available
Over the past two decades, a number of advances in topic modeling have produced sophisticated models that are capable of generating topic hierarchies. In particular, hierarchical Latent Dirichlet Allocation (hLDA) builds a topic tree based on the nested Chinese Restaurant Process (nCRP) or other sampling processes to generate a topic hierarchy that allows arbitrarily large branch structures and adaptive dataset growth. In addition, hierarchical topic models based on the latent tree model, such as Hierarchical Latent Tree Analysis (HLTA), have been developed over the last five years. However, these models do not work well in cases with millions of documents and hundreds of thousands of terms. In addition, the topic trees generated by these models are always poorly interpretable, and the relationships among topics in different levels are relatively simple. The biomedical literature, including Medline abstracts, has large-scale documents in two major categories: biological laboratory research and medical clinical research. We propose a top-down binary hierarchical topic model (biHTM) for biomedical literature by iteratively applying a flat topic model and adaptively processing subtrees of the hierarchy. The biHTM topic hierarchy of complete Medline abstracts with more than 14 topic node levels shows good bimodality and interpretability. Compared to hLDA and HLTA, biHTM shows promising results in experiments assessed in terms of runtime and quality.
... A subcategory of context-agnostic (topic) models pertains to nonparametric models [16,59], which adapt their representations to the structure of the training data. They allow for an unbounded number of parameters that grows with the size of the training data, whereas the parametric models are restricted to a parameter space of fixed size. ...
... Hierarchical LDA (HLDA) [8,16]. This model extends LDA by organizing the topics in Z into a hierarchical tree such that every tree node represents a single topic. ...
... (ii) The testing time (ETime) expresses the total time that is required for processing the test set of all 60 users, i.e., to compare all user models with their testing tweets and to rank the latter in descending order of similarity. For a fair comparison, we do not consider works that parallelize representation models (e.g., [16,57,62]), as most models are not adapted to distributed processing. ...
Preprint
Full-text available
Microblogging platforms constitute a popular means of real-time communication and information sharing. They involve such a large volume of user-generated content that their users suffer from an information deluge. To address it, numerous recommendation methods have been proposed to organize the posts a user receives according to her interests. The content-based methods typically build a text-based model for every individual user to capture her tastes and then rank the posts in her timeline according to their similarity with that model. Even though content-based methods have attracted lots of interest in the data management community, there is no comprehensive evaluation of the main factors that affect their performance. These are: (i) the representation model that converts an unstructured text into a structured representation that elucidates its characteristics, (ii) the source of the microblog posts that compose the user models, and (iii) the type of user's posting activity. To cover this gap, we systematically examine the performance of 9 state-of-the-art representation models in combination with 13 representation sources and 3 user types over a large, real dataset from Twitter comprising 60 users. We also consider a wide range of 223 plausible configurations for the representation models in order to assess their robustness with respect to their internal parameters. To facilitate the interpretation of our experimental results, we introduce a novel taxonomy of representation models. Our analysis provides novel insights into the performance and functionality of the main factors determining the performance of content-based recommendation in microblogs.
... Similarly, Yakubovich and Lup (2006) said that there is a need for proper recruiting process analytics to evaluate employees with a 360-degree view, as performance levels on direction may be different. Chen et al. (2018) and Chen et al., (2018) discussed the method and application of hierarchical topic modeling, which is a good approach for identifying the topics from the text. The usage of statistical topics modeling and probabilistics is also discussed in detail for text data by Yang et al., (2016). ...
... Similarly, Yakubovich and Lup (2006) said that there is a need for proper recruiting process analytics to evaluate employees with a 360-degree view, as performance levels on direction may be different. Chen et al. (2018) and Chen et al., (2018) discussed the method and application of hierarchical topic modeling, which is a good approach for identifying the topics from the text. The usage of statistical topics modeling and probabilistics is also discussed in detail for text data by Yang et al., (2016). ...
Article
Full-text available
Organizations across industries face challenges in recruiting the right talent while expending precious resources and time. The complex process of fair screening and shortlisting could be substantially streamlined by deploying automated screening and matching of job applications. This study suggests an analytics-based approach for improving the competitiveness of the human resources recruitment process . Available tools have been used to extract the attributes from the profile job description and match them with prospective candidates' résumé from the database. Similarity analysis identified the most suitable applicant based on the desired attributes vs. resume. The results of the framework were trained on a random sample of 1029 job applicants' profiles of an IT company. It was able to reduce 80% of manual screening efforts. This is expected to directly reflect in a saving of man-hours and allied operating costs. Though the current study is limited to the context of an IT company in India, the proposed artificial intelligence-based framework holds immense potential to be extended across industries. The study contributes to both theory and practice by helping leaders, associations, policymakers, and academia, to strategize and optimize recruitment efforts.
... Thus, a document is drawn by choosing an L-level path through the restaurants (topics) and then sampling the words from the L chosen topics. The inference can be approximated by means of Collapsed Gibbs sampling Blei, Griffiths & Jordan (2010) and Chen et al. (2018). The expressions for assessment of z dn and c d variables are the following: ...
... where W is the vocabulary (set of words), V is the vocabulary size, z = {z d } d = 1 D , ¬ dn means all the tokens excluding token w dn , c ¼ fc d g D d¼1 ; a :dn dl is a counter that equals the number of tokens in level l in document d excluding token w dn , b :dn c dl ;v is a counter that equals the number of tokens of the word v assigned to topic c dl excluding the current token w dn , s :dn c dl is a counter that equals the number of tokens assigned to topic c dl excluding the current token w dn ,c ¼ fc 1 ; …; c L g is a path in the hierarchy, b :d c l is a counter that equals the number of tokens assigned to topic c l excluding the tokens of document d, b d c l is the number of tokens of document d that were assigned to topic c l , B(·) is the multivariate beta function and pðcjc :d Þ is according to nCRP (2). The details can be found in Chen et al. (2018). Then, matrix È is calculated according to ...
Article
Full-text available
Hierarchical topic modeling is a potentially powerful instrument for determining topical structures of text collections that additionally allows constructing a hierarchy representing the levels of topic abstractness. However, parameter optimization in hierarchical models, which includes finding an appropriate number of topics at each level of hierarchy, remains a challenging task. In this paper, we propose an approach based on Renyi entropy as a partial solution to the above problem. First, we introduce a Renyi entropy-based metric of quality for hierarchical models. Second, we propose a practical approach to obtaining the “correct” number of topics in hierarchical topic models and show how model hyperparameters should be tuned for that purpose. We test this approach on the datasets with the known number of topics, as determined by the human mark-up, three of these datasets being in the English language and one in Russian. In the numerical experiments, we consider three different hierarchical models: hierarchical latent Dirichlet allocation model (hLDA), hierarchical Pachinko allocation model (hPAM), and hierarchical additive regularization of topic models (hARTM). We demonstrate that the hLDA model possesses a significant level of instability and, moreover, the derived numbers of topics are far from the true numbers for the labeled datasets. For the hPAM model, the Renyi entropy approach allows determining only one level of the data structure. For hARTM model, the proposed approach allows us to estimate the number of topics for two levels of hierarchy.
... Thus, a document is drawn by choosing an L-level path through the restaurants (topics) and then sampling the words from the L chosen topics. The inference can be approximated by means of Collapsed Gibbs sampling Blei et al. (2010); Chen et al. (2018). The expressions for assessment of z dn and c d variables are the following: ...
... where W is the vocabulary (set of words), V is the vocabulary size, z = {z d } D d=1 , ¬dn means all the tokens excluding token w dn , c = {c d } D d=1 , a ¬dn dl is a counter that equals the number of tokens in level l in document d excluding token w dn , b ¬dn c dl ,v is a counter that equals the number of tokens of the word v assigned to topic c dl excluding the current token w dn , s ¬dn c dl is a counter that equals the number of tokens assigned to topic c dl excluding the current token w dn ,c = {c 1 , ..., c L } is a path in the hierarchy, b ¬d c l is a counter that equals the number of tokens assigned to topic c l excluding the tokens of document d, b d c l is the number of tokens of document d that were assigned to topic c l , B(·) is the multivariate beta function and p(c|c ¬d ) is according to nCRP ( (3)). The details can be found in Chen et al. (2018). Then, matrix Φ is calculated according to ...
Preprint
Full-text available
Hierarchical topic modeling is a potentially powerful instrument for determining the topical structure of text collections that allows constructing a topical hierarchy representing levels of topical abstraction. However, tuning of parameters of hierarchical models, including the number of topics on each hierarchical level, remains a challenging task and an open issue. In this paper, we propose a Renyi entropy-based approach for a partial solution to the above problem. First, we propose a Renyi entropy-based metric of quality for hierarchical models. Second, we propose a practical concept of hierarchical topic model tuning tested on datasets with human mark-up. In the numerical experiments, we consider three different hierarchical models, namely, hierarchical latent Dirichlet allocation (hLDA) model, hierarchical Pachinko allocation model (hPAM), and hierarchical additive regularization of topic models (hARTM). We demonstrate that hLDA model possesses a significant level of instability and, moreover, the derived numbers of topics are far away from the true numbers for labeled datasets. For hPAM model, the Renyi entropy approach allows us to determine only one level of the data structure. For hARTM model, the proposed approach allows us to estimate the number of topics for two hierarchical levels.
... The parameters w k and s i are inextricably linked due to the conjugacy of the Dirichlet distribution and the Categorical distribution; hence, we shall disregard the w k for the sake of simplicity. Thus, a simpler two-part marginal Gibbs sampler is obtained which is shown in (12). ...
Conference Paper
Full-text available
Big data management methods are paramount in the modern era as applications tend to create massive amounts of data that comes from various sources. Therefore, there is an urge to create adaptive, speedy and robust frameworks that can effectively handle massive datasets. Distributed environments such as Apache Spark are of note, as they can handle such data by creating clusters where a portion of the data is stored locally and then the results are returned with the use of Resilient Distributed Datasets (RDDs). In this paper a method for distributed marginal Gibbs sampling for widely used latent Dirichlet allocation (LDA) model is implemented on PySpark along with a Metropolis Hastings Random Walker. The Distributed LDA (DLDA) algorithm distributes a given dataset into P partitions and performs local LDA on each partition, for each document independently. Every n th iteration, local LDA models, that were trained on distinct partitions, are combined to assure the model ability to converge. Experimental results are promising as the proposed system demonstrates comparable performance in the final model quality to the sequential LDA, and achieves significant speedup time-optimizations when utilized with massive datasets.
... First, it is the scalability issue on the number of topics an LDA can model. The latest development (Chen et al., 2018) can process 131M documents with 28B tokens efficiently, however, it only extracts 1,722 topics. With the fast-growing body 1 https://academic.microsoft.com/ of scholarly communications, a comprehensive manually controlled vocabulary like Medical Subject Headings(MeSH) (Lowe and Barnett, 1994) contains tens of thousands of subjects (concepts) mostly in the bio-med domain; and an automated scientific knowledge exploration system such as Microsoft Academic Graph (MAG) (Shen et al., 2018) has hundreds of thousands of topics across all academic disciplines. ...
... They represent LDA-based topic models. Methods, such as Patchinko allocation topic model [51], correlated topic model [52][53][54], supervised topic model [20,[55][56][57][58], dynamic topic model [29,[59][60][61], hierarchical topic model [62], spherical topic model [63], all characterize these alternatives provided to the LDA architecture. Currently, within the framework of LDA-based topic models, the advancement of social media platforms [64] and online services such as Q&A (questions and answers) [34] communities are having some serious impacts on extensions such as dynamic topic model [29,60,[64][65][66], correlated topic model [52,53], supervised topic model, and online topic model schemes [41,[67][68][69][70]. Current topic models also provide improvement in semantic analysis [29,30,44,71,72] to enhance coherence in the topics estimated and the relationship between documents [73]. ...
Article
Full-text available
We propose an alternative to the generative classifier that usually models both the class conditionals and class priors separately, and then uses the Bayes theorem to compute the posterior distribution of classes given the training set as a decision boundary. Because SVM (support vector machine) is not a probabilistic framework, it is really difficult to implement a direct posterior distribution-based discriminative classifier. As SVM lacks in full Bayesian analysis, we propose a hybrid (generative–discriminative) technique where the generative topic features from a Bayesian learning are fed to the SVM. The standard latent Dirichlet allocation topic model with its Dirichlet (Dir) prior could be defined as Dir–Dir topic model to characterize the Dirichlet placed on the document and corpus parameters. With very flexible conjugate priors to the multinomials such as generalized Dirichlet (GD) and Beta-Liouville (BL) in our proposed approach, we define two new topic models: the BL–GD and GD–BL. We take advantage of the geometric interpretation of our generative topic (latent) models that associate a K-dimensional manifold (K is the size of the topics) embedded into a V-dimensional feature space (word simplex) where V is the vocabulary size. Under this structure, the low-dimensional topic simplex (the subspace) defines a document as a single point on its manifold and associates each document with a single probability. The SVM, with its kernel trick, performs on these document probabilities in classification where it utilizes the maximum margin learning approach as a decision boundary. The key note is that points or documents that are close to each other on the manifold must belong to the same class. Experimental results with text documents and images show the merits of the proposed framework.
... help to analyze the evolution and relationships of topics [36,37]. For example, dynamic topic models are a common method to study evolution trends of different topics[38]. ...
Article
Full-text available
The television ratings provide an effective way to analyze the popularity of TV programs and audiences’ watching habits. Most previous studies have analyzed the ratings from a single perspective. Few efforts have integrated analysis from different perspectives and explored the reasons for changes in ratings. In this paper, we design a visual analysis system called TVseer to analyze audience ratings from three perspectives: TV channels, TV programs, and audiences. The system can help users explore the factors that affect ratings, and assist them in decisions about program productions and schedules. There are six linked views in TVseer: the channel ratings view and program ratings view show ratings change information from the perspective of TV channels and programs respectively; the overlapping program competition view and the same-type program competition view indicate the competitive relationships among programs; the audience transfer view shows how audiences are moving among different channels; the audience group view displays audience groups based on their watching behavior. Besides, we also construct case studies and expert interviews to prove our system is useful and effective.
... Hierarchical [14] and DAG-structured [19] LDA models can learn topics of different levels of abstraction. Chen et al. [10] implemented a distributed hierarchical LDA system with 50 machines. ...
Article
Full-text available
Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images, which to model datasets and a large number of topics, e.g. tens of thousands of topics for industry scale applications. Although distributed CPU systems have been used to address this problem, they are slow and resource inefficient. GPU-based systems have emerged as a promising alternative because of their high computational power and memory bandwidth. However, existing GPU-based LDA systems can only learn thousands of topics, because they use dense data structures, and have linear time complexity to the number of topics. In this paper, we propose SaberLDA, a GPU-based LDA system that implements a sparsity-aware algorithm to achieve sublinear time complexity to learn a large number of topics. To address the challenges introduced by sparsity, we propose a novel data layout, a warp-based sampling kernel, an efficient sparse matrix counting method, and a fine-grained load balancing strategy. SaberLDA achieves linear speedup on 4 GPUs and is 6 - 10 times faster than existing GPU systems in thousands of topics. It can learn 40,000 topics from a dataset of billions of tokens in two hours, which was previously only achievable using clusters of tens of CPU servers.
... Wang et al. ignored tags assigned to fewer than 50 documents when building a Stack Overflow tag prediction model [32]. Also, since computational cost of topic models grows superlinearly with the number of tags (or topics) [33], by examining the tag frequency distribution in Figure 3, we filter out tags assigned to fewer than 250 Stack Overflow questions. ...
Article
Full-text available
Developers rely on online Q&A forums to look up technical solutions, to pose questions on implementation problems, and to enhance their community profile by contributing answers. Many popular developer communication platforms, such as the Stack Overflow Q&A forum, require threads of discussion to be tagged by their contributors for easier lookup in both asking and answering questions. In this paper, we propose to leverage Stack Overflow’s tags to create a hierarchical organization of concepts discussed on this platform. The resulting concept hierarchy couples tags with a model of their relevancy to prospective questions and answers. For this purpose, we configure and apply a supervised multi-label hierarchical topic model to Stack Overflow questions and demonstrate the quality of the model in several ways: by identifying tag synonyms, by tagging previously unseen Stack Overflow posts, and by exploring how the hierarchy could aid exploratory searches of the corpus. The results suggest that when traversing the inferred hierarchical concept model of Stack Overflow the questions become more specific as one explores down the hierarchy and more diverse as one jumps to different branches. The results also indicate that the model is an improvement over the baseline for the detection of tag synonyms and that the model could enhance existing ensemble methods for suggesting tags for new questions. The paper indicates that the concept hierarchy as a modeling imperative can create a useful representation of the Stack Overflow corpus. This hierarchy can be in turn integrated into development tools which rely on information retrieval and natural language processing, and thereby help developers more efficiently navigate crowd-sourced online documentation.
... It may seem that the use of hierarchical topic modeling methods [18] removes the need to specify the number of topics in advance and can thus be superior to using ensembles of flat topic models. However, hierarchical topic models have their problems [19], [20], [21]. One problem is that they tend to generate large hierarchies where the overall number of topics is very high. ...
Article
We define behavior as a set of actions performed by some agent during a period of time. We consider the problem of analyzing a large collection of behaviors by multiple agents, more specifically, identifying typical behaviors as well as spotting behavior anomalies. We propose an approach leveraging topic modeling techniques -- LDA (Latent Dirichlet Allocation) Ensembles -- for representing categories of typical behaviors by topics obtained through applying topic modeling to a behavior collection. When such methods are applied to text documents, the goodness of the extracted topics is usually judged based on the semantic relatedness of the terms pertinent to the topics. This criterion, however, may not be applicable to topics extracted from non-textual data, such as action sets, since relationships between actions may not be obvious. We have developed a suite of visual and interactive techniques supporting the construction of an appropriate combination of topics based on other criteria, such as distinctiveness and coverage of the behavior set. Our case studies in the operation behaviors in the security management system and visiting behaviors in an amusement park and the expert evaluation of the first case study demonstrate the effectiveness of our approach.
Article
Learning topic hierarchies from a multi-domain corpus is crucial in topic modeling as it reveals valuable structural information embedded within documents. Despite the extensive literature on hierarchical topic models, effectively discovering inter-topic correlations and differences among subtopics at the same level in the topic hierarchy, obtained from multiple domains, remains an unresolved challenge. This paper proposes an enhanced nested Chinese restaurant process (nCRP), nCRP+, by introducing an additional mechanism based on Chinese restaurant franchise (CRF) for aspect-sharing pattern extraction in the original nCRP. Subsequently, by employing the distribution extracted from nCRP+ as the prior distribution for topic hierarchy in the hierarchical Dirichlet processes (HDP), we develop a hierarchical topic model for multi-domain corpus, named rHDP. We describe the model with the analogy of Chinese restaurant franchise based on the central kitchen and propose a hierarchical Gibbs sampling scheme to infer the model. Our method effectively constructs well-established topic hierarchies, accurately reflecting diverse parent-child topic relationships, explicit topic aspect sharing correlations for inter-topics, and differences between these shared topics. To validate the efficacy of our approach, we conduct experiments using a renowned public dataset and an online collection of Chinese financial documents. The experimental results confirm the superiority of our method over the state-of-the-art techniques in identifying multi-domain topic hierarchies, according to multiple evaluation metrics.
Article
Topic models have been applied to everything from books to newspapers to social media posts in an effort to identify the most prevalent themes of a text corpus. We provide an in-depth analysis of unsupervised topic models from their inception to today. We trace the origins of different types of contemporary topic models, beginning in the 1990s, and we compare their proposed algorithms, as well as their different evaluation approaches. Throughout, we also describe settings in which topic models have worked well and areas where new research is needed, setting the stage for the next generation of topic models.
Conference Paper
Full-text available
In the past decade, a number of advances in topic modeling have produced sophisticated models that are capable of generating hierarchies of topics. One challenge for these models is scalability: they are incapable of working at the massive scale of millions of documents and hundreds of thousands of terms. We address this challenge with a technique that learns a hierarchy of topics by iteratively applying topic models and processing subtrees of the hierarchy in parallel. This approach has a number of scalability advantages compared to existing techniques, and shows promising results in experiments assessing runtime and human evaluations of quality. We detail extensions to this approach that may further improve hierarchical topic modeling for large-scale applications. 1 Motivation With massive datasets and corresponding computational resources readily available, the Big Data movement aims to provide deep insights into real-world data. Realizing this goal can require new approaches to well-studied problems. Complex models that, for example, incorporate many dependencies between parameters have alluring results for small datasets and single machines but are difficult to adapt to the Big Data paradigm. Topic models are an interesting example of this phenomenon. In the last decade, a number of sophisticated techniques have been developed to model collections of text, from Latent Dirichlet Allocation (LDA)[1] through extensions using statistical machinery such as the nested Chinese Restaurant Process [2][3] and Pachinko Allocation[4]. One strength of such approaches is the ability to model topics in a hierarchical fashion. Coarse and fine topics are learned jointly, creating a synergy at multiple levels from shared parameters. However, past experiments have focused on small corpora, such as sampled abstracts from journals with only thousands of documents and terms. Operating on the scale of text collections the size of Wikipedia-millions of documents with millions of terms-is beyond such models using current inference techniques. While complex models have shown strong analytical results, simpler models that use fewer parameters and produce less structured output are being used at scale. In recent years, many parallel versions of LDA (eg. [5][6][7]) have been developed, capable of summarizing web-scale collections. Although these tools are a boon for analyzing massive amounts of data, they lack the ability to learn a hierarchical representation of topics. This situation highlights a common dilemma in Big Data: whether we can have our cake (models with rich output) and eat it too (operate on massive datasets). We propose a solution that addresses this dilemma, reusing infrastructure from an existing parallel implementation of LDA in a novel way. Our method provides a scalable mechanism for learning hierarchies from large text collections. We learn top-down hierarchies, first learning topic models at a coarse level, then splitting the corpus into these learned topics and iteratively learning subtopics.
Article
Full-text available
This paper presents a methodology for creating streaming, distributed inference algorithms for Bayesian nonparametric (BNP) models. In the proposed framework, processing nodes receive a sequence of data minibatches, compute a variational posterior for each, and make asynchronous streaming updates to a central model. In contrast to previous algorithms, the proposed framework is truly streaming, distributed, asynchronous, learning-rate-free, and truncation-free. The key challenge in developing the framework, arising from the fact that BNP models do not impose an inherent ordering on their components, is finding the correspondence between minibatch and central BNP posterior components before performing each update. To address this, the paper develops a combinatorial optimization problem over component correspondences, and provides an efficient solution technique. The paper concludes with an application of the methodology to the DP mixture model, with experimental results demonstrating its practical scalability and performance.
Article
Full-text available
Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons: First, one needs to deal with a large number of topics (typically in the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over $T$ items in $O(\log T)$ time. Moreover, when topic counts change the data structure can be updated in $O(\log T)$ time. In order to distribute the computation across multiple processor we present a novel asynchronous framework inspired by the Nomad algorithm of \cite{YunYuHsietal13}. We show that F+Nomad LDA significantly outperform state-of-the-art on massive problems which involve millions of documents, billions of words, and thousands of topics.
Article
Full-text available
Gibbs sampling is a workhorse for Bayesian inference but has several limitations when used for parameter estimation, and is often much slower than non-sampling inference methods. SAME (State Augmentation for Marginal Estimation) \cite{Doucet99,Doucet02} is an approach to MAP parameter estimation which gives improved parameter estimates over direct Gibbs sampling. SAME can be viewed as cooling the posterior parameter distribution and allows annealed search for the MAP parameters, often yielding very high quality (lower loss) estimates. But it does so at the expense of additional samples per iteration and generally slower performance. On the other hand, SAME dramatically increases the parallelism in the sampling schedule, and is an excellent match for modern (SIMD) hardware. In this paper we explore the application of SAME to graphical model inference on modern hardware. We show that combining SAME with factored sample representation (or approximation) gives throughput competitive with the fastest symbolic methods, but with potentially better quality. We describe experiments on Latent Dirichlet Allocation, achieving speeds similar to the fastest reported methods (online Variational Bayes) and lower cross-validated loss than other LDA implementations. The method is simple to implement and should be applicable to many other models.
Article
Full-text available
Latent Dirichlet allocation (LDA) is a popular topic modeling technique in academia but less so in industry, especially in large-scale applications involving search engines and online advertisement systems. A main underlying reason is that the topic models used have been too small in scale to be useful; for example, some of the largest LDA models reported in literature have up to $10^3$ topics, which cover difficultly the long-tail semantic word sets. In this paper, we show that the number of topics is a key factor that can significantly boost the utility of topic-modeling system. In particular, we show that a "big" LDA model with at least $10^5$ topics inferred from $10^9$ search queries can achieve a significant improvement on industrial search engine and online advertising system, both of which serving hundreds of millions of users. We develop a novel distributed system called Peacock to learn big LDA models from big data. The main features on Peacock include hierarchical parallel architecture, real-time prediction, and topic de-duplication. We empirically demonstrate that the Peacock system is capable of providing significant benefits via highly scalable LDA topic models for several industrial applications.
Article
Full-text available
We develop a nested hierarchical Dirichlet process (nHDP) for hierarchical topic modeling. The nHDP is a generalization of the nested Chinese restaurant process (nCRP) that allows each word to follow its own path to a topic node according to a document-specific distribution on a shared tree. This alleviates the rigid, single-path formulation of the nCRP, allowing a document to more easily express thematic borrowings as a random effect. We derive a stochastic variational inference algorithm for the model, in addition to a greedy subtree selection method for each document, which allows for efficient inference using massive collections of text documents. We demonstrate our algorithm on 1.8 million documents from The New York Times and 3.3 million documents from Wikipedia.
Conference Paper
Full-text available
We present a first lock-free design and implementation of a dynamically resizable array (vector). The most extensively used container in the C++ Standard Template Library (STL) is vector, offering a combination of dynamic memory management and constant-time random access. Our approach is based on a single 32-bit word atomic compare-and-swap (CAS) instruction. It provides a linearizable and highly parallelizable STL-like interface, lock-free memory allocation and management, and fast execution. Our current implementation is designed to be most efficient on multi-core architectures. Experiments on a dual-core Intel processor with shared L2 cache indicate that our lock-free vector outperforms its lock-based STL counterpart and the latest concurrent vector implementation provided by Intel by a large factor. The performance evaluation on a quad dual-core AMD system with non-shared L2 cache demonstrated timing results comparable to the best available lock-based techniques. The presented design implements the most common STL vector’s interfaces, namely random access read and write, tail insertion and deletion, pre-allocation of memory, and query of the container’s size. Using the current implementation, a user has to avoid one particular ABA problem. Keywordslock-free-STL-C++-vector-concurrency-real-time systems
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
Full-text available
We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generaliza- tion to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.
Article
Full-text available
LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regres- sion and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
Article
Full-text available
We present the nested Chinese restaurant process (nCRP), a stochastic process which assigns probability distributions to infinitely-deep, infinitely-branching trees. We show how this stochastic process can be used as a prior distribution in a Bayesian nonparametric model of document collections. Specifically, we present an application to information retrieval in which documents are modeled as paths down a random tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple levels of abstraction. Given a corpus of documents, a posterior inference algorithm finds an approximation to a posterior distribution over trees, topics and allocations of words to levels of the tree. We demonstrate this algorithm on collections of scientific abstracts from several journals. This model exemplifies a recent trend in statistical machine learning--the use of Bayesian nonparametric methods to infer distributions on flexible data structures.
Article
Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets and a large number of topics. Though distributed CPU systems have been used, GPU-based systems have emerged as a promising alternative because of the high computational power and memory bandwidth of GPUs. However, existing GPU-based LDA systems cannot support a large number of topics because they use algorithms on dense data structures whose time and space complexity is linear to the number of topics. In this paper, we propose SaberLDA, a GPU-based LDA system that implements a sparsity-aware algorithm to achieve sublinear time complexity and scales well to learn a large number of topics. To address the challenges introduced by sparsity, we propose a novel data layout, a new warp-based sampling kernel, and an efficient sparse count matrix updating algorithm that improves locality, makes efficient utilization of GPU warps, and reduces memory consumption. Experiments show that SaberLDA can learn from billions-token-scale data with up to 10,000 topics, which is almost two orders of magnitude larger than that of the previous GPU-based systems. With a single GPU card, SaberLDA is able to learn 10,000 topics from a dataset of billions of tokens in a few hours, which is only achievable with clusters with tens of machines before.
Article
Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets and a large number of topics. Though distributed CPU systems have been used, GPU-based systems have emerged as a promising alternative because of the high computational power and memory bandwidth of GPUs. However, existing GPU-based LDA systems cannot support a large number of topics because they use algorithms on dense data structures whose time and space complexity is linear to the number of topics. In this paper, we propose SaberLDA, a GPU-based LDA system that implements a sparsity-aware algorithm to achieve sublinear time complexity and scales well to learn a large number of topics. To address the challenges introduced by sparsity, we propose a novel data layout, a new warp-based sampling kernel, and an efficient sparse count matrix updating algorithm that improves locality, makes efficient utilization of GPU warps, and reduces memory consumption. Experiments show that SaberLDA can learn from billions-token-scale data with up to 10,000 topics, which is almost two orders of magnitude larger than that of the previous GPU-based systems. With a single GPU card, SaberLDA is able to learn 10,000 topics from a dataset of billions of tokens in a few hours, which is only achievable with clusters with tens of machines before.
Conference Paper
Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets and a large number of topics. Though distributed CPU systems have been used, GPU-based systems have emerged as a promising alternative because of the high computational power and memory bandwidth of GPUs. However, existing GPU-based LDA systems cannot support a large number of topics because they use algorithms on dense data structures whose time and space complexity is linear to the number of topics. In this paper, we propose SaberLDA, a GPU-based LDA system that implements a sparsity-aware algorithm to achieve sublinear time complexity and scales well to learn a large number of topics. To address the challenges introduced by sparsity, we propose a novel data layout, a new warp-based sampling kernel, and an efficient sparse count matrix updating algorithm that improves locality, makes efficient utilization of GPU warps, and reduces memory consumption. Experiments show that SaberLDA can learn from billions-token-scale data with up to 10,000 topics, which is almost two orders of magnitude larger than that of the previous GPU-based systems. With a single GPU card, SaberLDA is able to learn 10,000 topics from a dataset of billions of tokens in a few hours, which is only achievable with clusters with tens of machines before.
Conference Paper
Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons. First, one needs to deal with a large number of topics (typically on the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper, we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over T items in O(log T) time. Moreover, when topic counts change the data structure can be updated in O(log T) time. In order to distribute the computation across multiple processors, we present a novel asynchronous framework inspired by the Nomad algorithm of Yun et al, 2014. We show that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.
Conference Paper
When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners and academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens --- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers; 2) a model-scheduling scheme to handle the big model challenge, where each worker machine schedules the fetch/use of sub-models as needed, resulting in a frugal use of limited memory capacity and network bandwidth; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed. These contributions are built on top of the Petuum open-source distributed ML framework, and we provide experimental evidence showing how this development puts massive data and models within reach on a small cluster, while still enjoying proportional time cost reductions with increasing cluster size.
Conference Paper
Gibbs sampling is a workhorse for Bayesian inference but has several limitations when used for parameter estimation, and is often much slower than non-sampling inference methods. SAME (State Augmentation for Marginal Estimation) [15, 8] is an approach to MAP parameter estimation which gives improved parameter estimates over direct Gibbs sampling. SAME can be viewed as cooling the posterior parameter distribution and allows annealed search for the MAP parameters, often yielding very high quality estimates. But it does so at the expense of additional samples per iteration and generally slower performance. On the other hand, SAME dramatically increases the parallelism in the sampling schedule, and is an excellent match for modern (SIMD) hardware. In this paper we explore the application of SAME to graphical model inference on modern hardware. We show that combining SAME with factored sample representation (or approximation) gives throughput competitive with the fastest symbolic methods, but with potentially better quality. We describe experiments on Latent Dirichlet Allocation, achieving speeds similar to the fastest reported methods (online Variational Bayes) and lower cross-validated loss than other LDA implementations. The method is simple to implement and should be applicable to many other models.
Article
This paper presents a visual analytics approach to analyzing a full picture of relevant topics discussed in multiple sources, such as news, blogs, or micro-blogs. The full picture consists of a number of common topics covered by multiple sources, as well as distinctive topics from each source. Our approach models each textual corpus as a topic graph. These graphs are then matched using a consistent graph matching method. Next, we develop a level-of-detail (LOD) visualization that balances both readability and stability. Accordingly, the resulting visualization enhances the ability of users to understand and analyze the matched graph from multiple perspectives. By incorporating metric learning and feature selection into the graph matching algorithm, we allow users to interactively modify the graph matching result based on their information needs. We have applied our approach to various types of data, including news articles, tweets, and blog data. Quantitative evaluation and real-world case studies demonstrate the promise of our approach, especially in support of examining a topic-graph-based full picture at different levels of detail.
Article
Much natural data is hierarchical in nature. Moreover, this hierarchy is often shared between different instances. We introduce the nested Chinese Restaurant Franchise Process to obtain both hierarchical tree-structured representations for objects, akin to (but more general than) the nested Chinese Restaurant Process while sharing their structure akin to the Hierarchical Dirichlet Process. Moreover, by decoupling the structure generating part of the process from the components responsible for the observations, we are able to apply the same statistical approach to a variety of user generated data. In particular, we model the joint distribution of microblogs and locations for Twitter for users. This leads to a 40% reduction in location uncertainty relative to the best previously published results. Moreover, we model documents from the NIPS papers dataset, obtaining excellent perplexity relative to (hierarchical) Pachinko allocation and LDA.
Article
We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the fly. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms.
Article
This article reviews Markov chain methods for sampling from the posterior distribution of a Dirichlet process mixture model and presents two new classes of methods. One new approach is to make Metropolis-Hastings updates of the indicators specifying which mixture component is associated with each observation, perhaps supplemented with a partial form of Gibbs sampling. The other new approach extends Gibbs sampling for these indicators by using a set of auxiliary parameters. These methods are simple to implement and are more efficient than previous ways of handling general Dirichlet process mixture models with non-conjugate priors. © 2000 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.
Article
Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest for many applications. Previous work has developed an $O(1)$ Metropolis-Hastings sampling method for each token. However, the performance is far from being optimal due to random accesses to the parameter matrices and frequent cache misses. In this paper, we propose WarpLDA, a novel $O(1)$ sampling algorithm for LDA. WarpLDA is a Metropolis-Hastings based algorithm which is designed to optimize the cache hit rate. Advantages of WarpLDA include 1) Efficiency and scalability: WarpLDA has good locality and carefully designed partition method, and can be scaled to hundreds of machines; 2) Simplicity: WarpLDA does not have any complicated modules such as alias tables, hybrid data structures, or parameter servers, making it easy to understand and implement; 3) Robustness: WarpLDA is consistently faster than other algorithms, under various settings from small-scale to massive-scale dataset and model. WarpLDA is 5-15x faster than state-of-the-art LDA samplers, implying less cost of time and money. With WarpLDA users can learn up to one million topics from hundreds of millions of documents in a few hours, at the speed of 2G tokens per second, or learn topics from small-scale datasets in seconds.
Article
We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods.
Article
We present a visual analytics approach to developing a full picture of relevant topics discussed in multiple sources such as news, blogs, or micro-blogs. The full picture consists of a number of common topics among multiple sources as well as distinctive topics. The key idea behind our approach is to jointly match the topics extracted from each source together in order to interactively and effectively analyze common and distinctive topics. We start by modeling each textual corpus as a topic graph. These graphs are then matched together with a consistent graph matching method. Next, we develop an LOD-based visualization for better understanding and analysis of the matched graph. The major feature of this visualization is that it combines a radially stacked tree visualization with a density-based graph visualization to facilitate the examination of the matched topic graph from multiple perspectives. To compensate for the deficiency of the graph matching algorithm and meet different users' needs, we allow users to interactively modify the graph matching result. We have applied our approach to various data including news, tweets, and blog data. Qualitative evaluation and a real-world case study with domain experts demonstrate the promise of our approach, especially in support of analyzing a topic-graph-based full picture at different levels of detail.
Article
Explosive growth in data and availability of cheap computing resources have sparked increasing interest in Big learning, an emerging subfield that studies scalable machine learning algorithms, systems, and applications with Big Data. Bayesian methods represent one important class of statistic methods for machine learning, with substantial recent developments on adaptive, flexible and scalable Bayesian learning. This article provides a survey of the recent advances in Big learning with Bayesian methods, termed Big Bayesian Learning, including nonparametric Bayesian methods for adaptively inferring model complexity, regularized Bayesian inference for improving the flexibility via posterior regularization, and scalable algorithms and systems based on stochastic subsampling and distributed computing for dealing with large-scale applications.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference for LDA. In this paper, we introduce a novel and flexible large scale topic modeling package in MapReduce (Mr. LDA). As opposed to other techniques which use Gibbs sampling, our proposed framework uses variational inference, which easily fits into a distributed environment. More importantly, this variational implementation, unlike highly tuned and specialized implementations based on Gibbs sampling, is easily extensible. We demonstrate two extensions of the models possible with this scalable framework: informed priors to guide topic discovery and extracting topics from a multilingual corpus. We compare the scalability of Mr. LDA against Mahout, an existing large scale topic modeling package. Mr. LDA out-performs Mahout both in execution speed and held-out likelihood.
Article
We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.
Article
A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a unified constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classification or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efficient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efficient than existing supervised topic models, especially for classification.
Article
This article reviews Markov chain methods for sampling from the posterior distribution of a Dirichlet process mixture model and presents two new classes of methods. One new approach is to make Metropolis-Hastings updates of the indicators specifying which mixture component is associated with each observation, perhaps supplemented with a partial form of Gibbs sampling. The other new approach extends Gibbs sampling for these indicators by using a set of auxiliary parameters. These methods are simple to implement and are more efficient than previous ways of handling general Dirichlet process mixture models with non-conjugate priors.
Article
In this paper, we examine the results of applying Term Frequency Inverse Document Frequency (TF-IDF) to determine what words in a corpus of documents might be more favorable to use in a query. As the term implies, TF-IDF calculates values for each word in a document through an inverse proportion of the frequency of the word in a particular document to the percentage of documents the word appears in. Words with high TF-IDF numbers imply a strong relationship with the document they appear in, suggesting that if that word were to appear in a query, the document could be of interest to the user. We provide evidence that this simple algorithm efficiently categorizes relevant words that can enhance query retrieval.
Article
In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction.Recent proposals have been made of probabilistic clustering models, which build “soft” theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional “semantic” space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space.The model considered in this paper consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We propose a systematic evaluation framework to contrast various estimation procedures for this model. Starting with the expectation-maximization (EM) algorithm as the basic tool for inference, we discuss the importance of initialization and the influence of other features, such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We empirically show that, in the case of text processing, these difficulties can be alleviated by introducing the vocabulary incrementally, due to the specific profile of the word count distributions. Using the fact that the model parameters can be analytically integrated out, we finally show that Gibbs sampling on the theme configurations is tractable and compares favorably to the basic EM approach.
Conference Paper
Researchers have access to large online archives of scientific articles. As a consequence, finding relevant papers has become more difficult. Newly formed online communities of researchers sharing citations provides a new way to solve this problem. In this paper, we develop an algorithm to recommend scientific articles to users of an online community. Our approach combines the merits of traditional collaborative filtering and probabilistic topic modeling. It provides an interpretable latent structure for users and items, and can form recommendations about both existing and newly published articles. We study a large subset of data from CiteULike, a bibliography sharing service, and show that our algorithm provides a more effective recommender system than traditional collaborative filtering.
Conference Paper
Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organizing the news and managing user generated content. Latent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the latent structure of complex data with little or no external guidance making them ideal for reasoning about large-scale, rapidly evolving datasets. Unfortunately, due to the data dependencies and global state introduced by latent variables and the iterative nature of latent variable inference, latent-variable techniques are often prohibitively expensive to apply to large-scale, streaming datasets. In this paper we present a scalable parallel framework for efficient inference in latent variable models over streaming web-scale data. Our framework addresses three key challenges: 1) synchronizing the global state which includes global latent variables (e.g., cluster centers and dictionaries); 2) efficiently storing and retrieving the large local state which includes the data-points and their corresponding latent variables (e.g., cluster membership); and 3) sequentially incorporating streaming data (e.g., the news). We address these challenges by introducing: 1) a novel delta-based aggregation system with a bandwidth-efficient communication protocol; 2) schedule-aware out-of-core storage; and 3) approximate forward sampling to rapidly incorporate new data. We demonstrate state-of-the-art performance of our framework by easily tackling datasets two orders of magnitude larger than those addressed by the current state-of-the-art. Furthermore, we provide an optimized and easily customizable open-source implementation of the framework¹.
Conference Paper
Latent Dirichlet allocation (LDA) and other related topic models are increasingly popu- lar tools for summarization and manifold dis- covery in discrete data. However, LDA does not capture correlations between topics. In this paper, we introduce the pachinko alloca- tion model (PAM), which captures arbitrary, nested, and possibly sparse correlations be- tween topics using a directed acyclic graph (DAG). The leaves of the DAG represent in- dividual words in the vocabulary, while each interior node represents a correlation among its children, which may be words or other in- terior nodes (topics). PAM provides a ex-
Conference Paper
A natural evaluation metric for statistical topic models is the probability of held-out documents given a trained model. While exact computation of this probability is in- tractable, several estimators for this prob- ability have been used in the topic model- ing literature, including the harmonic mean method and empirical likelihood method. In this paper, we demonstrate experimentally that commonly-used methods are unlikely to accurately estimate the probability of held- out documents, and propose two alternative methods that are both accurate and ecient. In this paper we consider only the simplest topic model, latent Dirichlet allocation (LDA), and compare a number of methods for estimating the probability of held-out documents given a trained model. Most of the methods presented, however, are applicable to more complicated topic models. In addition to com- paring evaluation methods that are currently used in the topic modeling literature, we propose several al- ternative methods. We present empirical results on synthetic and real-world data sets showing that the currently-used estimators are less accurate and have higher variance than the proposed new estimators.
Conference Paper
Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has re cently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is he avily cited in the machine learning literature, but its feasibilit y and effectiveness in information retrieval is mostly un known. In this paper, we study how to efficiently use LDA to impro ve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.
Conference Paper
We develop latent Dirichlet allocation with WORDNET (LDAWN), an unsupervised probabilistic topic model that includes word sense as a hidden variable. We develop a probabilistic posterior inference algorithm for simultaneously disambiguating a corpus and learning the domains in which to consider each word. Using the WORDNET hierarchy, we embed the construction of Abney and Light (1999) in the topic model and show that automatically learned domains improve WSD accuracy compared to alternative contexts.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.
Article
. We propose a split-merge Markov chain algorithm to address the problem of inefficient sampling for conjugate Dirichlet process mixture models. Traditional Markov chain Monte Carlo methods for Bayesian mixture models, such as Gibbs sampling, can become trapped in isolated modes corresponding to an inappropriate clustering of data points. This article describes a Metropolis-Hastings procedure that can escape such local modes by splitting or merging mixture components. Our Metropolis-Hastings algorithm employs a new technique in which an appropriate proposal for splitting or merging components is obtained by using a restricted Gibbs sampling scan. We demonstrate empirically that our method outperforms the Gibbs sampler in situations where two or more components are similar in structure. Key words: Dirichlet process mixture model, Markov chain Monte Carlo, Metropolis-Hastings algorithm, Gibbs sampler, split-merge updates 1 Introduction Mixture models are often applied to density estim...
Distributed inference for dirichlet process mixture models
  • H Ge
  • Y Chen
  • E Cam
  • M Wan
  • Z Ghahramani
Exponential stochastic cellular automata for massively parallel inference
  • M Zaheer
  • M Wick
  • J.-B Tristan
  • A Smola
  • G L Steele