A parse tree from Treebank and a parse tree predicted by an induced grammar.

Source publication

Variational Bayesian Grammar Induction for Natural Language

Conference Paper

Full-text available

Sep 2006

This paper presents a new grammar induction algorithm for probabilistic context-free grammars (PCFGs). There is an approach to PCFG induction that is based on parameter estimation. Following this approach, we apply the variational Bayes to PCFGs. The variational Bayes (VB) is an approximation of Bayesian learning. It has been em- pirically shown th...

Context 1

... 0-CB is the ratio of sentences whose brackets are completely consistent with correct brackets. Fig.4 lists a subset of the induced grammar used in Fig.3. In this example, non-terminal 15 derives "DT JJ * NN" where DT is determiner, JJ is adjective and NN is noun. ...

View in full-text

A simple named entity extractor using AdaBoost

Conference Paper

Full-text available

May 2003

This paper presents a new approach to syntactic disambiguation based on lexicalized grammars. While existing disambiguation models decompose the probability of parsing results into that of primitive dependencies of two words, our model selects the most ...

Split-Based Algorithm for Weighted Context-Free Grammar Induction

Article

Full-text available

Jan 2021

The split-based method in a weighted context-free grammar (WCFG) induction was formalised and verified on a comprehensive set of context-free languages. WCFG is learned using a novel grammatical inference method. The proposed method learns WCFG from both positive and negative samples, whereas the weights of rules are estimated using a novel Inside–Outside Contrastive Estimation algorithm. The results showed that our approach outperforms in terms of F1 scores of other state-of-the-art methods.

Feature selection methods for event detection in Twitter: a text mining approach

Article

Full-text available

Dec 2020

Selecting keywords from Twitter as features to identify events is challenging due to language informality such as acronyms, misspelled words, synonyms, transliteration and ambiguous terms. In this paper, We compare and identify the best methods for keyword selection as features to be used for classification purposes. Specifically, we study the aspects affecting keywords as features to identify civil unrest and protests. These aspects include the word count, the word forms such as n-gram, skip-gram and bags-of-words as well as the data association methods including correlation techniques and similarity techniques. To test the impact of the mentioned factors, we developed a framework that analyzed 641 days of tweets and extracted the words highly associated with event days along the same time frame. Then, we used the extracted words as features to classify any single day to be either an event day or a nonevent day in a specific location. In this framework, we used the same pipeline of data cleaning, prepossessing, feature selection, model learning and event classification using all combinations of keyword selection criteria. We used Naive Bayes classifier to learn the selected features and accordingly predict the event days. The classification is tested using multiple metrics, such as accuracy, precision, recall, F-score and AUC. This study concluded that the best word form is bag-of-words with average AUC of 0.72 and the best word count is two with average AUC of 0.74 and the best feature selection method is Spearman's correlation with average AUC of 0.89 and the best classifier for event detection is Naive Bayes Classifier.

Unsupervised Grammar Induction for Revealing the Internal Structure of Protein Sequence Motifs

Conference Paper

Sep 2020

Protein sequence motifs are conserved amino acid patterns of biological significance. They are vital for annotating structural and functional features of proteins. Yet, the computational methods commonly used for defining sequence motifs are typically simplified linear representations neglecting the higher-order structure of the motif. The purpose of the work is to create models of sequence motifs taking into account the internal structure of the modeled fragments. The ultimate goal is to provide the community with accurate and concise models of diverse collections of remotely related amino acid sequences that share structural features. The internal structure of amino acid sequences is modeled using a novel algorithm for unsupervised learning of weighted context-free grammar (WCFG). The proposed method learns WCFG both form positive and negative samples, whereas weights of rules are estimated using a novel Inside-Outside Contrastive Estimation algorithm. In comparison to existing approaches to learning CFG, the new method generates more concise descriptors and provides good control of the trade-off between grammar size and specificity. The method is applied to the nicotinamide adenine dinucleotide phosphate binding site motif.

Compound Probabilistic Context-Free Grammars for Grammar Induction

Preprint

Jun 2019

We study a formalization of the grammar induction problem that models sentences as being generated by a compound probabilistic context-free grammar. In contrast to traditional formulations which learn a single stochastic grammar, our context-free rule probabilities are modulated by a per-sentence continuous latent variable, which induces marginal dependencies beyond the traditional context-free assumptions. Inference in this grammar is performed by collapsed variational inference, in which an amortized variational posterior is placed on the continuous variable, and the latent trees are marginalized with dynamic programming. Experiments on English and Chinese show the effectiveness of our approach compared to recent state-of-the-art methods for grammar induction from words with neural language models.

Compound Probabilistic Context-Free Grammars for Grammar Induction

Conference Paper

Jan 2019

Variational Inference: A Review for Statisticians

Article

Jan 2016
J AM STAT ASSOC

One of the core problems of modern statistics is to approximate difficult-to-compute probability distributions. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation about the posterior. In this paper, we review variational inference (VI), a method from machine learning that approximates probability distributions through optimization. VI has been used in myriad applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of distributions and then to find the member of that family which is close to the target. Closeness is measured by Kullback-Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this paper is to catalyze statistical research on this widely-used class of algorithms.

Learning Translations for Tagged Words: Extending the Translation Lexicon of an ITG for Low Resource Languages

Conference Paper

Jan 2016

A Linguistics-Driven Approach to Statistical Parsing for Low-Resourced Languages

Article

Full-text available

May 2015
IEICE T INF SYST

Developing a practical and accurate statistical parser for low-resourced languages is a hard problem, because it requires large-scale treebanks, which are expensive and labor-intensive to build from scratch. Unsupervised grammar induction theoretically offers a way to overcome this hurdle by learning hidden syntactic structures from raw text automatically. The accuracy of grammar induction is still impractically low because frequent collocations of non-linguistically associable units are commonly found, resulting in dependency attachment errors. We introduce a novel approach to building a statistical parser for low-resourced languages by using language parameters as a guide for grammar induction. The intuition of this paper is: most dependency attachment errors are frequently used word orders which can be captured by a small prescribed set of linguistic constraints, while the rest of the language can be learned statistically by grammar induction. We then show that covering the most frequent grammar rules via our language parameters has a strong impact on the parsing accuracy in 12 languages.

Evaluating Models of Computation and Storage in Human Sentence Processing

Conference Paper

Jan 2015

Modelling function words improves unsupervised word segmentation

Conference Paper

Full-text available

Jun 2014

Inspired by experimental psychological findings suggesting that function words play a special role in word learning, we make a simple modification to an Adaptor Grammar based Bayesian word segmentation model to allow it to learn sequences of monosyllabic "function words" at the beginnings and endings of collocations of (possibly multi-syllabic) words. This modification improves unsupervised word segmentation on the standard Bernstein- Ratner (1987) corpus of child-directed English by more than 4% token f-score compared to a model identical except that it does not special-case "function words", setting a new state-of-the-art of 92.4% token f-score. Our function word model assumes that function words appear at the left periphery, and while this is true of languages such as English, it is not true universally. We show that a learner can use Bayesian model selection to determine the location of function words in their language, even though the input to the model only consists of unsegmented sequences of phones. Thus our computational models support the hypothesis that function words play a special role in word learning.

A parse tree from Treebank and a parse tree predicted by an induced grammar.

Context in source publication

Similar publications

Citations