Figure 1 - uploaded by James R Glass
Content may be subject to copyright.
Word Error Rate (WER) as a function of the number of example utterances used to adapt the underlying lexicon. 

Word Error Rate (WER) as a function of the number of example utterances used to adapt the underlying lexicon. 

Source publication
Conference Paper
Full-text available
A lexicon containing explicit mappings between words and pronunciations is an integral part of most automatic speech recognizers (ASRs). While many ASR components can be trained or adapted using data, the lexicon is one of the few that typically remains static until experts make manual changes. This work takes a step towards alleviating the need fo...

Contexts in source publication

Context 1
... established our baseline experiments, we evaluated both the cascading recognizer approach and the PMM by varying the number of training utterances for each, and evaluating the WER of the test set against the lexicons produced under each condi- tion. The resulting plot is shown in figure 1. It is encouraging to note that both models perform admirably, achieving expert- level pronunciations with just three example utterances. ...
Context 2
... illustrate the performance of the PMM, we plot in figure 1 the WER obtained by generating a lexicon according to equa- tion 4 after two iterations of EM. This stopping criterion was determined by constructing a development set of 1,500 previ- ously discarded PhoneBook utterances and running recognition using lexicons generated after each EM iteration. ...

Similar publications

Conference Paper
Full-text available
Recently, a number of voice conversion methods have been developed. These methods attempt to improve conversion performance by using diverse mapping techniques in various acoustic domains, e.g. high-resolution spectra and low-resolution Mel-cepstral coefficients. Each individual method has its own pros and cons. In this paper, we introduce a system...
Chapter
Full-text available
Recently, Deep Neural Network (DNN), which is a feed-forward artificial neural network with many hidden layers, has opened a new research direction for Speech Synthesis. It can represent high dimension and correlated features efficiently and model highly complex mapping function compactly. However, the research on DNN-based Mongolian speech synthes...
Conference Paper
Full-text available
Si illustrano criteri e metodi seguiti nella predisposizione della mappatura acustica del territorio comunale commissionata ad ARPAV dal Comune di Venezia, evidenziando in particolare le soluzioni adottate per lo studio e la caratterizzazione di tipologie di sorgenti sonore che costituiscono una peculiarità dalle città lagunare, quali il traffico a...
Article
Full-text available
Modern optical imaging techniques demonstrate significant potential for high resolution in vivo angiography. Optoacoustic angiography benefits from higher imaging depth as compared to pure optical modalities. However, strong attenuation of optoacoustic signal with depth provides serious challenges for adequate 3D vessel net mapping, and proper comp...

Citations

... Such weights can only be defined for words occurring in the training set and no further training of the weights is performed. Another method proposed in [7] and [8] is to use the EM algorithm to train these lexical probabilities, a method that seems prone to over-fitting the training data. This explains why the last years there is a turn towards discriminative models. ...
... However, a trained set of phonological rules has the advantage that is can provide pronunciation weights also for unseen words. Another EM training of the weights of the lexicon based on the utterances of a given word is presented in (Hazen et al., 2005) and (Badr et al., 2010). A discriminative training of the weights might be better generalizable and applicable to phonological rules, as EM training has the drawbacks of finding a local maximum and of often over-fitting to the training data. ...
... This can be applied only to words present in the training set and no further training of the weights is performed. Another method proposed in (Shu and Lee Hetherington, 2002), (Hazen et al., 2005) and (Badr et al., 2010) is the EM training of the weights of the lexicon. Nevertheless, this generative method often suffers from over-fitting to the training data. ...
Article
Full-text available
This thesis addresses the problems of phonemic variability and confusability from the pronunciation modeling perspective for an automatic speech recognition (ASR) system. In particular, several research directions are investigated. First, automatic grapheme-to- phoneme (g2p) and phoneme-to-phoneme (p2p) converters are developed that generate alternative pronunciations for in-vocabulary as well as out-of-vocabulary (OOV) terms. Since the addition of alternative pronunciation may introduce homophones (or close homophones), there is an increase of the confusability of the system. A novel measure of this confusability is proposed to analyze it and study its relation with the ASR performance. This pronunciation confusability is higher if pronunciation probabilities are not provided and can potentially severely degrade the ASR performance. It should, thus, be taken into account during pronunciation generation. Discriminative training approaches are, then, investigated to train the weights of a phoneme confusion model that allows alternative ways of pronouncing a term counterbalancing the phonemic confusability problem. The objective function to optimize is chosen to correspond to the performance measure of the particular task. In this thesis, two tasks are investigated, the ASR task and the KeywordSpotting (KWS) task. For ASR, an objective that minimizes the phoneme error rate is adopted. For experiments conducted on KWS, the Figure of Merit (FOM), a KWS performance measure, is directly maximized.
... Another method proposed in [7] and This work is partly realized as part of the Quaero Programme, funded by OSEO, the French State agency for innovation, and as part of the ANR EdyLex project. [8] is the EM training of the weights of the lexicon. Nevertheless, this generative method often suffers from over-fitting to the training data. ...
Article
Full-text available
To enhance the recognition lexicon, it is important to be able to add pronunciation variants while keeping the confusability introduced by the extra phonemic variation low. However, this confusability is not easily correlated with the ASR performance, as it is an inherent phenomenon of speech. This paper proposes a method to construct a multiple pronunciation lexicon with a high discriminability. To do so, a phoneme confusion model is used to expand the phonemic search space of pronunciation variants during ASR decoding and a discriminative framework is adopted for the training of the weights of the phoneme confusions. For the parameter estimation, two training algorithms are implemented, the perceptron and the CRF model, using finite state transducers. Experiments on English data were conducted using a large state-of-the-art ASR system of continuous speech.
... Also, by combining the two dictionaries, we combine knowledge driven approach (L2S) and data driven approach (G2P). In literature, there are similar efforts to use acoustic data and conventional grapheme-to-phoneme conversion approach, such as [9,10], where they use multigram grapheme-to-phoneme conversion approach [11] and acoustic data together. One major distinction between the approach presented in [9,10] and the approach investigated here is, we do not need acoustic data of the word when inferring the pronunciation model. ...
... In literature, there are similar efforts to use acoustic data and conventional grapheme-to-phoneme conversion approach, such as [9,10], where they use multigram grapheme-to-phoneme conversion approach [11] and acoustic data together. One major distinction between the approach presented in [9,10] and the approach investigated here is, we do not need acoustic data of the word when inferring the pronunciation model. In this work, we observe large ASR performance difference between the baseline dictionary and the extracted dictionaries. ...
... In the Spoken Language Systems group at MIT, numerous lines of research have investigated learning certain aspects of the lexicon. [11] learns pronunciation baseforms and variants using statistically learned subword units, [12] uses linguistically motivated hybrid subword units to propose spellings for perfect phone transcripts and [13] learns both spellings and pronunciations for new words using the same linguistic subword units as well as a statistical letter model. ...
... Regardless of rank scoring, the spellneme-based method does not achieve the same performance as manually created pronunciations. Other automatic pronunciation learning approaches, e.g., [11], achieved better performance than the baseline lexicon. These approaches, however, condition the pronunciation search space on the word's assumed known spelling, which is not the case in our experiment. ...
Article
Full-text available
We present a framework for learning a pronunciation lexicon for an Automatic Speech Recognition (ASR) system from multiple utterances of the same training words, where the lexical identities of the words are unknown. Instead of only trying to learn pronunciations for known words we go one step further and try to learn both spelling and pronunciation in a joint optimization. Decoding based on linguistically motivated hybrid subword units generates the joint lexical search space, which is reduced to the most appropriate lexical entries based on a set of simple pruning techniques. A cascade of letter and acoustic pruning, followed by re-scoring N-best hypotheses with discriminative decoder statistics resulted in optimal lexical entries in terms of both spelling and pronunciation. Evaluating the framework on English isolated word recognition, we achieve reductions of 7.7% absolute on word error rate and 20.9% absolute on character error rate over baselines that use no pruning.
... In [11], we use the joint distribution learned from the lexicon as described above to seed a pronunciation mixture model (PMM), which employs EM to iteratively update a set of parameters based on a example utterances. With the PMM approach, we learn a distribution over pronunciations for a particular word by treating them as components in a mixture and aggregating the posteriors across spoken examples. ...
... We now extend the Pronunciation Mixture Model (PMM) framework developed for isolated word recognition [11] to learn the appropriate weights that can model P (B|W) in continuous speech. ...
Conference Paper
Full-text available
This paper explores the use of continuous speech data to learn stochastic lexicons. Building on previous work in which we augmented graphones with acoustic examples of isolated words, we extend our pronunciation mixture model framework to two domains containing spontaneous speech: a weather information retrieval spoken dialogue system and the academic lectures domain. We find that our learned lexicons out-perform expert, hand-crafted lexicons in each domain. Index Terms: grapheme-to-phoneme conversion, pronunciation models, lexical representation
... In recent years, n-gram graphone models have been used in pronunciation generation tasks for several languages, including English, German and French [1,2,3]. In these graphone-based pronunciation generation tasks, an n-gram graphone model was usually trained on a large general dictionary of entries. ...
... As can be seen in Figure 1, a large difference in log likelihood between this interpolated model and the NBW-only model (λ = 1.0) shows the benefit of the interpolated approach. 3 In generating pronunciations for NBW words from the interpolated model, the most likely graphone sequences and the corresponding phoneme sequences are found. The posterior probability of each graphone sequence was used as the pronunciation probability of the corresponding inferred pronunciation for the given word. ...
... The number of pronunciation variants for each NBW word in the Treebank source was examined and it 1 The LDC catalog numbers of the 4 text data are LDC2005T02, LDC2005T20, LDC2005T30 and LDC2004T0, and the 4 acoustic data GALE-Y1Q1, GALE-Y1Q2, GALE-Y1Q3 and GALE-Y1Q4 . 2 The distributed Treebank pronunciations for these BW words were consistent with those generated by BMA. 3 Note that the log likelihood of the BW-only model gave a much poorer log likelihood of -38. 6 was found that there were 96.7% words the pronunciation variant number of which was no more than five. ...
... An alternative is to directly estimate pronunciation from speech samples containing the OOV word. Previous works that take this approach make use of the phone lattice or the phoneme recognition output derived from the speech samples, sometimes constraining the pronunciations with a g2p model or languagespecific rules [1] [2] [3] [4] [5] [6] [7]. ...
... The approach we describe is similar to the pronunciation mixture model in [7], except that we do not have access to the acoustic models or phone lattices, only the word recognition mistakes . We focus on the task of isolated word recognition; however , this model can be easily extended to continuous speech with a few minor modifications. ...
... In the Spoken Language Systems group at MIT numerous lines of research have investigated learning certain aspects of the lexicon. [11] learns pronunciation baseforms and variants using statistically learned subword units, [12] uses linguistically motivated hybrid subword units to propose spellings for perfect phone transcripts and [13] learns both spellings and pronunciations for new words using the same linguistic subword units as well as a statistical letter model. ...
... Regardless of rank scoring, the spellneme-based method does not achieve the same performance as manually created pronunciations. Other automatic pronunciation learning approaches, e.g., [11], achieved better performance than the baseline lexicon. These approaches, however, condition the pronunciation search space on a the word's assumed known spelling, which is not the case in our experiment. ...
Article
Full-text available
We present a framework for learning a pronunciation lexicon for an Automatic Speech Recognition (ASR) system from multiple utterances of the same training words, where the lexical identities of the words are unknown. Instead of only trying to learn pronunciations for known words we go one step further and try to learn both spelling and pronunciation in a joint optimization. Decoding based on linguistically motivated hybrid subword units generates the joint lexical search space, which is reduced to the most appropriate lexical entries based on a set of simple pruning techniques. A cascade of letter and acoustic pruning, followed by re-scoring N -best hypotheses with discriminative decoder statistics resulted optimal lexical entries in terms of both spelling and pronunciation. Evaluating the framework on English isolated word recognition, we achieve reductions of 7.7% absolute on word error rate and 14.4% absolute on character error rate.
Article
In many ways, the lexicon remains the Achilles heel of modern automatic speech recognizers. Unlike stochastic acoustic and language models that learn the values of their parameters from training data, the baseform pronunciations of words in a recognizer's lexicon are typically specified manually, and do not change, unless they are edited by an expert. Our work presents a novel generative framework that uses speech data to learn stochastic lexicons, thereby taking a step towards alleviating the need for manual intervention and automatically learning high-quality pronunciations for words. We test our model on continuous speech in a weather information domain. In our experiments, we see significant improvements over a manually specified “expert-pronunciation” lexicon. We then analyze variations of the parameter settings used to achieve these gains.