Article

Korean large vocabulary continuous speech recognition with morpheme-based recognition units

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In Korean writing, a space is placed between two adjacent word-phrases, each of which generally corresponds to two or three words in English in a semantic sense. If the word-phrase is used as a recognition unit for Korean large vocabulary continuous speech recognition (LVCSR), the out-of-vocabulary (OOV) rate becomes very large. If a morpheme or a syllable is used instead, a severe inter-morpheme coarticulation problem arises due to short morphemes. We propose to use a merged morpheme as the recognition unit and pronunciation-dependent entries in a language model (LM) so that we can reduce such difficulties and incorporate the between-word phonology rule into the decoding algorithm of a Korean LVCSR system. Starting from the original morpheme units defined in the Korean morphology, we merge pairs of short and frequent morphemes into larger units by using a rule-based method and a statistical method. We define the merged morpheme unit as word and use it as the recognition unit. The performance of the system was evaluated in two business-related tasks: a read speech recognition task and a broadcast news transcription task. The OOV rate was reduced to a level comparable to that of American English in both tasks. In the read speech recognition task, with a 32k vocabulary and a word-based trigram LM computed from a newspaper text corpus, the word error rate (WER) of the baseline system was reduced from 25.0% to 20.0% by cross-word modeling and pronunciation-dependent language modeling, and finally to 15.5% by increasing speech database and text corpora. For the broadcast news transcription task, we showed that the statistical method relatively reduced the WER of the baseline system without morpheme merging by 3.4% and both of the proposed methods yielded similar performance. Applying all the proposed techniques, we achieved 17.6% WER for clean speech and 27.7% for noisy speech.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Some examples are shown in Table 2. The surface realizations of the morphological structure are constrained and modified by a number of language phenomenon such as insertion, deletion, phonetic harmony and weakening (or disharmony, assimilation [4][5][6]). The morphemes have their standard forms and surface forms. ...
... For statistical model based approaches, a statistical learning model can be constructed and trained on a manually prepared training corpus to extract most probable morpheme sequences. Furthermore, there are also unsupervised segmentation approaches which split words into morpheme-like units from a raw text corpus without considering linguistic properties [6][7][8]. In this paper, we focus on morphemes which are strictly meaning bearing units. ...
Article
Full-text available
Uyghur language is an agglutinative language in which words are derived from stems (or roots) by concatenating suffixes. This property makes a large number of combinations of morphemes, and greatly increases the word-vocabulary size, causing out-of-vocabulary (OOV) and data sparseness problems for statistical models. So words are split into certain sub-word units and applied to text and speech processing applications. Proper sub-word units not only provide high coverage and smaller lexicon size, but also provide semantic and syntactic information which is necessary for downstream applications. This paper discusses a general purpose morphological analyzer tool which can split a text of words into sequence of morphemes or syllables. Uyghur morpheme segmentation is a basic part of the comprehensive effort of the Uyghur language corpus compilation. As there are no delimiters for sub-word units, a supervised method, combined with certain rules and a statistical learning algorithm, is applied for morpheme segmentation. For phonetic units like syllable and phonemes, pure rule-based methods can extract with high accuracy. Most common and proper sub-words for various applications can be the linguistic morphemes for they provide linguistic information, high coverage, low lexicon size, and easily be restored to words. As the Uyghur language is written as pronounced, phonetic alterations of speech are openly expressed in text. This property makes many surface forms for a particular morpheme. A general purpose morphological analyzer must be able to analyze and export in both standard and surface forms. So the morpho- phonetic alterations like phonetic harmony, weakening, and morphological changes are summarized and learnt from training corpus. And a statistical model based morpheme segmentation tool is trained on the corpus of aligned word-morpheme sequences, and applied to predict possible morpheme sequences. For an open test set, with word coverage of 86.8% and morpheme coverage of 98.4%, the morpheme segmentation accuracy is 97.6%. This morpheme segmentation tool can output both on the standard forms and on the surface forms without costing segmentation accuracy. Furthermore, for various basic lexical units of word, morpheme, and syllable, the statistical properties are compared as a comprehensive effort of the Uyghur language corpus compilation.
... In order to increase vocabulary coverage, subword units are used as basic units in language modeling. Subword units may be found using morphological analysis [2,3], or discovered automatically based on some criteria [4]. ...
... The results are shown of figure 1. The out-of-vocabulary rate of 2.1% for a 60 000 unit dictionary of morphemes is similar with the results reported for other agglutinative languages [3,2]. ...
Conference Paper
This paper describes development of a large vocabulary continuous speaker independent speech recognition system for Estonian. Estonian is an agglutinative language and the number of different word forms is very large, in addition, the word order is relatively unconstrained. To achieve a good language coverage, we use pseudo-morphemes as basic units in a statistical trigram language model. To improve language model robustness, we automatically find morpheme classes and interpolate the morpheme model with the class-based model. The language model is trained on a newspaper corpus of 15 million word forms. Clustered triphones with multiple Gaussian mixture components are used for acoustic modeling. The system with interpolated morpheme language model is found to perform significantly better than the baseline word form trigram system in all areas. The word error rate of the best system is 27.7% which is a 10.0% absolute improvement over the baseline system.
... Especially for English recognition engines, the mostly used language modelling units are the words. However, if this model is used in modelling agglutinative languages like Turkish, Finnish and Korean the OOV rate will be very high [2] [3], because it is impossible that the lexicon will contain all the words. ...
... Especially for English recognition engines, the mostly used language modelling units are the words. However, if this model is used in modelling agglutinative languages like Turkish, Finnish and Korean the OOV rate will be very high [2, 3], because it is impossible that the lexicon will contain all the words. In a previous research on the statistical language modelling of Turkish [4] , new language modelling units like " stems and endings " , " stems and morphemes " and " syllables " are proposed. ...
Article
Full-text available
We have designed a Turkish dictation system for Broadcast news applications. Turkish is an agglutinative language with free word order. These characteristics of the language result in the vocabulary explosion, large number of out-of-vocabulary (OOV) words and the complexity of the N-gram language models in speech recognition when words are used as recognition units. Therefore, we proposed new recognition units. We parsed some of the words to smaller recognition units like stems, endings and morphemes, and intro- duced these smaller units and the unparsed words to the speech rec- ognizer as lexicon entries. This way, we were able to overcome to the problem of large number of OOV words with a moderate vocab- ulary size and get better estimates for the N-gram language models. However, best recognition result was obtained using the word-based language model.
... Although a series of words is generally presented to a subject in conventional research, the sequence processed as a unit in short-term storage might be at any one of the following levels: (1) letters; (2) words (these would include multiple morpheme sequences such as farmhouse); (3) sentences and (4) units larger than sentences. Although morphemes exist as a unit with minimal meaning in other languages, there was a severe inter-morpheme coarticulation problem that could be raised due to short morphemes in Korean (Kwon & Park, 2003). Therefore, since the analysis unit should be defined and used as a word in which several morphemes are merged in Korean, and the morpheme unit was excluded from this study. ...
Article
Full-text available
The purpose of this study was to explore the unit of information and the size of the unit for designing a voice user interface. Through two experiments, this study investigated what form the information (the unit of information) should take and what size of that (the size of unit) should be when people were provided information by voice interfaces. Participants were presented with a task to recall (OX quiz) by listening to and remembering information (based on an encyclopedia) provided by smart speakers. In Experiment 1, it was revealed that participants stored information in their memory span on a sentence-by-sentences basis to determine how much information they could remember. In Experiment 2, sentence-based information was presented in various sizes, and participants evaluated 17 information units consisting of up to nine words as their memory limit. This information unit-based voice interface design could help improve users' memory performance and usability.
... Auditory phonetics is phonetics which studies how the ear mechanism accepts language sounds as air vibrations (Moskowitz, 1973). Perception of waves distinguishes lexical meanings in the language (Kwon & Park, 2003). Under the umbrella of phonology, there are two branches of knowledge, each of which is a different study. ...
Article
Full-text available
Morpheme is the smallest grammatical unit that has meaning. Traditional grammar does not recognize morpheme concepts or terms because morphemes are not syntactic units, and not all morphemes have philosophical meanings. The concept of morphemes was only introduced by structuralists at the beginning of the twentieth century. To determine whether a unit of form is morpheme or not, we must compare the form in its presence with other forms. If this form turns out to be repeated in other forms, then that form is a morpheme. In morphological studies, a formed unit that has the status of a morpheme is usually denoted by sandwiching it between curly brackets. For example, the word book is denoted as {book}, the word rewrite is denoted to be {re} + {write}. In every language there is a shape (like a word) that you can cut into smaller pieces, then cut back into smaller pieces that you cannot cut anymore.
... We also compared the CER of the obtained transcriptions to confirm the effect of BeParrot on transcription accuracy. Note that CER is widely used to evaluate the quality of transcription in languages without space delimiters [14], including Japanese [12], because the WER value calculated for such languages depends on the quality of morphological analysis. Still, the value of CER generally correlates with the value of WER while it is expected to be lower than WER, as in Petridis et al. [26]. ...
Conference Paper
Full-text available
Transcribing speech from audio files to text is an important task not only for exploring the audio content in text form but also for utilizing the transcribed data as a source to train speech models, such as automated speech recognition (ASR) models. A post-correction approach has been frequently employed to reduce the time cost of transcription where users edit errors in the recognition results of ASR models. However, this approach assumes clear speech and is not designed for unclear speech (such as speech with high levels of noise or reverberation), which severely degrades the accuracy of ASR and requires many manual corrections. To construct an alternative approach to transcribe unclear speech, we introduce the idea of respeaking, which has primarily been used to create captions for television programs in real time. In respeaking, a proficient human respeaker repeats the heard speech as shadowing, and their utterances are recognized by an ASR model. While this approach can be effective for transcribing unclear speech, one problem is that respeaking is a highly cognitively demanding task and extensive training is often required to become a respeaker. We address this point with BeParrot, the first interface designed for respeaking that allows novice users to benefit from respeaking without extensive training through two key features: parameter adjustment and pronunciation feedback. Our user study involving 60 crowd workers demonstrated that they could transcribe different types of unclear speech 32.2 % faster with BeParrot than with a conventional approach without losing the accuracy of transcriptions. In addition, comments from the workers supported the design of the adjustment and feedback features, exhibiting a willingness to continue using BeParrot for transcription tasks. Our work demonstrates how we can leverage recent advances in machine learning techniques to overcome the area that is still challenging for computers themselves with the help of a human-in-the-loop approach.
... To overcome the limitations of the Word ASR model, a number of approaches have been suggested that have in common their use of morphemes (prefixes, stems, and suffixes) rather than words as the basic unit of analysis. Indeed, several studies have investigated the use of sub-lexical language constructs in speech recognition [3,4] and models incorporating this idea have been used in many languages, including German and Finnish [5,6], Korean [7,8,9], Dutch [10], Arabic [11,12,13,14,15,16], Turkish [5,17], Slovenian [18] and English [5]. Other works utilizing such an approach for multiple languages have been published [19,20]. ...
... Each BLSTM layer comprises 640 BLSTM cells and 128 projection units, while the output layer comprises 19901 units. For the language model, 38 GB of Korean text data is first preprocessed using text-normalization and word segmentation methods [41], and then, the most frequent 540k sub-words are obtained from the text data (For Korean, a sub-word unit is commonly used as a basic unit of an ASR system [42,43]). Next, we train a back-off trigram of 540k sub-words [41] using an SRILM toolkit [44,45]. ...
Article
Full-text available
This paper aims to design an online, low-latency, and high-performance speech recognition system using a bidirectional long short-term memory (BLSTM) acoustic model. To achieve this, we adopt a server-client model and a context-sensitive-chunk-based approach. The speech recognition server manages a main thread and a decoder thread for each client and one worker thread. The main thread communicates with the connected client, extracts speech features, and buffers the features. The decoder thread performs speech recognition, including the proposed multichannel parallel acoustic score computation of a BLSTM acoustic model, the proposed deep neural network-based voice activity detector, and Viterbi decoding. The proposed acoustic score computation method estimates the acoustic scores of a context-sensitive-chunk BLSTM acoustic model for the batched speech features from concurrent clients, using the worker thread. The proposed deep neural network-based voice activity detector detects short pauses in the utterance to reduce response latency, while the user utters long sentences. From the experiments of Korean speech recognition, the number of concurrent clients is increased from 22 to 44 using the proposed acoustic score computation. When combined with the frame skipping method, the number is further increased up to 59 clients with a small accuracy degradation. Moreover, the average user-perceived latency is reduced from 11.71 s to 3.09–5.41 s by using the proposed deep neural network-based voice activity detector.
... The phoneme units of Table 1(b), on the other hand, typically consist of 40 phonemes, including silence. Here, the phoneme unit is derived from a grapheme-to-phoneme converter [33,34] and has acoustically high discrimination despite having a smaller number than the grapheme units. However, the phonemes in spontaneous speech still possess acoustically ambiguous units. ...
Article
Full-text available
We propose a method to extend a phoneme set by using a large amount of broadcast data to improve the performance of Korean spontaneous speech recognition. In the proposed method, we first extract variable-length phoneme-level segments from broadcast data and then convert them into fixed-length embedding vectors based on a long short-term memory architecture. We use decision tree-based clustering to find acoustically similar embedding vectors and then build new acoustic subword units by gathering the clustered vectors. To update the lexicon of a speech recognizer, we build a lookup table between the tri-phone units and the units derived from the decision tree. Finally, the proposed lexicon is obtained by updating the original phoneme-based lexicon by referencing the lookup table. To verify the performance of the proposed unit, we compare the proposed unit with the previous units obtained by using the segment-based k-means clustering method or the frame-based decision-tree clustering method. As a result, the proposed unit is shown to produce better performance than the previous units in both spontaneous, and read Korean speech recognition tasks.
... Većina autora koristi neku vrstu parsera, odnosno modula za dekompoziciju reči, koji utvrdi značajne morfološke jedinice -morfeme, koren reči, afikse, i tako dalje -za određivanje odgovarajućih leksičkih obeležja i klasa reči, i zatim se te informacije koriste kao dodatna ograničenja (preko odgovarajućih težina) za dekoder, u kombinaciji, ili umesto samih reči, koje se u tu svrhu koriste u standardnom pristupu. Pokazalo se da mnogi morfološki bogati jezici dele slične probleme u oblasti modelovanja jezika(Kwon & Park, 2003; Sarikaya i drugi, 2008; Sak i drugi, 2010; Müller i drugi, 2012).Jedan od standardnih pristupa su klasni jezički modeli(Jardino, 1996;Whittaker & Woodland, 2001) bazirani na morfologiji. Dobijaju se morfološkom anotacijom tekstualnog korpusa za obuku, i nakon toga klasifikacijom reči u grupe uz pomoć informacija dobijenih iz tagovanog korpusa, što je praćeno obukom na transformisanom korpusu, gde umesto samih reči figurišu odgovarajuće klase. ...
Thesis
Full-text available
Automatic speech recognition is a technology that allows computers to convert spoken words into text. It can be applied in various areas which involve communication between humans and machines. This thesis primarily deals with one of two main components of speech recognition systems - the language model, that specifies the vocabulary of the system, as well as the rules by which individual words can be linked into sentences. The Serbian language belongs to a group of highly inflective and morphologically rich languages, which means that it uses a number of different word endings to express the desired grammatical, syntactic, or semantic function of the given word. Such behavior often leads to a significant number of errors in speech recognition systems where due to good acoustic matching the recognizer correctly guesses the basic form of the word, but an error occurs in the word ending. This word ending may indicate a different morphological category, for example, word case, grammatical gender, or grammatical number. The thesis presents a new language modeling tool which, along with the word identity, can also model additional lexical and morphological features of the word, thus testing the hypothesis that this additional information can help overcome a significant number of recognition errors that result from the high inflectivity of the Serbian language.
... Word error rate (WER) was computed to measure how our proposed model improves the linguistic consistency of the converted speech. Practically, morphemes were used instead of words since morphemes are considered as recognition units of Korean speech [18,19,20]. Google Cloud Speech-to-Text API transcribed the converted speech, and transcripts were divided into a sequence of morphemes by the Komoran morphological analyzer in KoNLPy [21]. ...
Preprint
Full-text available
Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic information. There was an attempt to overcome it by using textual supervision, it requires explicit alignment which loses the benefit of using seq2seq model. In this paper, a voice converter using multitask learning with text-to-speech (TTS) is presented. The embedding space of seq2seq-based TTS has abundant information on the text. The role of the decoder of TTS is to convert embedding space to speech, which is same to VC. In the proposed model, the whole network is trained to minimize loss of VC and TTS. VC is expected to capture more linguistic information and to preserve training stability by multitask learning. Experiments of VC were performed on a male Korean emotional text-speech dataset, and it is shown that multitask learning is helpful to keep linguistic contents in VC.
... ere were several approaches for incorporation of morphology knowledge into speech recognition systems for other languages, and most of them require some sort of a parser (word decomposer) to determine significant morphological units (morphemes, affixes, etc.) to represent lexical items and word classes, and then that information is used to provide additional constraints to the decoder (in combination or instead of regular words in the conventional approach). A lot of morphologically rich languages face similar issues [6][7][8][9][10]. ...
Article
Full-text available
Serbian is in a group of highly inflective and morphologically rich languages that use a lot of different word suffixes to express different grammatical, syntactic, or semantic features. This kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even when, due to good acoustical matching, the correct lemma is predicted by the automatic speech recognition system, often a wrong word ending occurs, which is nevertheless counted as an error. This effect is larger for contexts not present in the language model training corpus. In this manuscript, an approach which takes into account different morphological categories of words for language modeling is examined, and the benefits in terms of word error rates and perplexities are presented. These categories include word type, word case, grammatical number, and gender, and they were all assigned to words in the system vocabulary, where applicable. These additional word features helped to produce significant improvements in relation to the baseline system, both for n-gram-based and neural network-based language models. The proposed system can help overcome a lot of tedious errors in a large vocabulary system, for example, for dictation, both for Serbian and for other languages with similar characteristics.
... In Uyghur, there are 49 noun suffixes, two times more than those in Turkish. These suffixes can be divided into three categories, namely, Number category, Ownership-dependent category and Case category [7]. ...
Article
Full-text available
In this paper, a hybrid strategy of rules and statistics is employed to implement the Uyghur Noun Re-inflection model. More specifically, completed Uyghur sentences are taken as an input, and these Uyghur sentences are marked with part of speech tagging, and the nouns in the sentences remain the form of the stem. In this model, relevant linguistic rules and statistical algorithms are used to find the most probable noun suffixes and output the Uyghur sentences after the nouns are re-inflected. With rules of linguistics artificially summed up, the training corpora are formed by the human–machine exchange. The final experimental result shows that the Uyghur morphological re-inflection model is of high performance and can be applied to various fields of natural language processing, such as Uyghur machine translation and natural language generation.
... Pseudo-morphemes have also been proposed as an extension of a morpheme to solve the above two problems of morpheme-based lexicons [15][16][17]. The general approach to pseudo-morpheme generation is to concatenate frequent and short morphemes. ...
Article
In this paper, maximum likelihood-based automatic lexicon generation using mixed-syllables is proposed for unlimited vocabulary voice interface for East Asian languages (e.g. Korean, Chinese and Japanese) in AI-assistant based interaction with mobile devices. The conventional lexicon has two inevitable problems: 1) a tedious repetition of out-of-lexicon unit additions to the lexicon, and 2) the propagation of errors during a morpheme analysis and space segmentation. The proposed method provides an automatic framework to solve the above problems. The proposed method produces a level of overall accuracy similar to one of previous methods in the presence of one out-of-lexicon word in a sentence, but the proposed method provides superior results with the absolute improvements of 1.62%, 5.58%, and 10.09% in terms of word accuracy when the number of out-of-lexicon words in a sentence was two, three and four, respectively.
... One OOV words usually translates into one recognition error or even more from erroneous context. Sub-lexical unit is one commonly used technique when modeling OOV words in many languages [1,2,3,4] as multiple sub-lexical units can be combined to form a new word which is not seen before in the training data. The design of a sub-lexical unit is crucial and depends on the characteristic of each language. ...
Conference Paper
Full-text available
This paper investigates the use of a hybrid language model for open-vocabulary Thai LVCSR. Thai text is written without word boundary markers and the definition of word unit is often ambiguous due to the presence of compound words. Hence, to build open-vocabulary LVCSR, a very large lexicon is required to also handle word unit ambiguity. Pseudo-morpheme (PM), a syllable-like sub-word unit specifically designed for Thai is considered to be a more well-defined unit. To overcome the problem of out-of-vocabulary words and to also reduce the size of the lexicon, a hybrid language model which combines word and sub-word units is proposed. Words and sub-words frequently found in several domains constitute open-vocabulary for general domain Thai LVCSR. To verify our scheme, we run recognition experiments on data from various tasks including broadcast news transcription, dictation and mobile speech-to-speech translation. Open-vocabulary Thai LVCSR using the hybrid language model obviously reduces the out-of-vocabulary problem. The proposed model having a much smaller lexicon size achieves a comparable recognition error rate to a baseline system using a full-word lexicon.
... In order to have a notion of the inflection involved, let us mention that in terms of lemmas the vocabulary is reduced in %14, %35 and %50 for English, Spanish and Basque respectively. Although the word splitting significantly reduces the vocabulary in agglutinative languages it increases the acoustic confusability in ASR tasks due to higher number of short tokens [Kwon & Park, 2003] and context modelling at token boundaries [Hacioglu et al., 2003]. ...
... There is a trade-off between word unit and morpheme unit; generally the word unit provides better linguistic constraint, but increases the vocabulary size explosively, causing OOV (out-of-vocabulary) and data sparseness problems in language modeling. Therefore, the morpheme unit is conventionally adopted in many agglutinative languages, such as Japanese [1], Korean [5], and Turkish [9]. However, most of morphemes are short, often consisting of one or two phonemes, thus they are more likely to be confused in ASR than the word unit. ...
Conference Paper
Full-text available
In agglutinative languages, selection of lexical unit is not obvious. Morpheme unit is usually adopted to ensure the sufficient coverage, but many morphemes are short, resulting in weak constraints and possible confusions. In this paper, we propose a discriminative approach to select lexical entries which will directly contribute to ASR error reduction. We define an evaluation function for each word by a set of features and their weights, and the measure for optimization by the difference of WERs by the morpheme-based model and by the word-based model. Then, the weights of the features are learned by a perceptron algorithm. Finally, word (or sub-word) entries with higher evaluation scores are selected to be added to the lexicon. This method is successfully applied to an Uyghur large-vocabulary continuous speech recognition system, resulting in a significant reduction of WER and the lexicon size. Further improvement is achieved by combining with a statistical method based on mutual information criterion.
... The first class of studies uses a linguistically motivated approach, where words are decomposed morphologically into linguistic units. Morpheme-based language models have been proposed for German [7], Czech [8], and Korean [9]. A statistical language model based on morphological decomposition of words into roots and inflectional groups which contain the inflectional features for each derived form has been proposed for morphological disambiguation of Turkish text [10]. ...
Article
Full-text available
This paper introduces two complementary language modeling approaches for morphologically rich languages aiming to alleviate out-of-vocabulary (OOV) word problem and to exploit morphology as a knowledge source. The first model, morpholexical language model, is a generative $n$-gram model, where modeling units are lexical-grammatical morphemes instead of commonly used words or statistical sub-words. This paper also proposes a novel approach for integrating the morphology into an automatic speech recognition (ASR) system in the finite-state transducer framework as a knowledge source. We accomplish that by building a morpholexical search network obtained by the composition of lexical transducer of a computational lexicon with a morpholexical language model. The second model is a linear reranking model trained discriminatively with a variant of the perceptron algorithm using morpholexical features. This variant of the perceptron algorithm, WER-sensitive perceptron, is shown to perform better for reranking $n$-best candidates obtained with the generative model. We apply the proposed models in Turkish broadcast news transcription task and give experimental results. The morpholexical model leads to an elegant morphology-integrated search network with unlimited vocabulary. Thus, it is highly effective in alleviating OOV problem and improves the word error rate (WER) over word and statistical sub-word models by 1.8% and 0.4% absolute, respectively. The discriminatively trained morpholexical model further improves the WER of the system by 0.8% absolute.
... Especially case of agglutinative language such as Korean, performances of ASR and SMT are significantly different by the processing unit. Some researches on Korean ASR and Korean-English SMT have been carried out [4] [5]. ...
Article
Full-text available
In this demonstration, we present POSSLT (POSTECH Spoken Language Translation) for a Korean-English statistical spoken language translation (SLT) system using pseudo-morpheme and confusion network (CN) based technique. Like most other SLT systems, automatic speech recognition (ASR) and machine translation (MT) are coupled in a cascading manner in our SLT system. We used confusion network based approach to couple ASR and MT. It has better translation quality and faster decoding time than N-best approach. In the ASR and SMT for Korean, how to define processing units affects the performance. Pseudo-morpheme unit is a best choice for Korean-English SLT. Models used in SLT system are trained on a travel domain conversational corpus.
... Morphology modeling aims to reduce the outof-vocabulary (OOV) rate as well as data sparsity, thereby producing more effective language models . However, obtaining considerable improvements in speech recognition accuracy seems hard, as is demonstrated by the fairly meager improvements (1–4 % relative) over standard word-based models accomplished by, e.g., Berton et al. (1996), Ordelman et al. (2003), Kirchhoff et al. (2006 , Whittaker and Woodland (2000), Kwon and Park (2003), and Shafran and Hall (2006) for Dutch, Arabic, English , Korean, and Czech, or even the worse performance reported by Larson et al. (2000) for German and Byrne et al. (2001) for Czech. Nevertheless, clear improvements over a word baseline have been achieved for Serbo-Croatian (Geutner et al., 1998), Finnish, Estonian (Kurimo et al., 2006b ) and Turkish (Kurimo et al., 2006a). ...
Article
Full-text available
We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the Morfessor algorithm. By estimating n-gram language models over sequences of morphs instead of words, the quality of the language model is improved through better vocabulary coverage and reduced data sparsity. Standard word models suffer from high out-of-vocabulary (OOV) rates, whereas the morph models can recognize previously unseen word forms by concatenating morphs. It is shown that the morph models do perform fairly well on OOVs without compromising the recognition accuracy on in-vocabulary words. The Arabic experiment constitutes the only exception since here the standard word model outperforms the morph model. Differences in the datasets and the amount of data are discussed as a plausible explanation.
... Özellikle ngilizce konuşma tanıma sistemlerinde kelimeler dil modelleme birimleri olarak kullanılmaktadır. Bu birimlerin, Türkçe, Fince ve Korece gibi sondan eklemeli dillerde kullanılması dağarcık dışı kelimelerin artmasına sebep olur [2,3]. Türkçe için istatistiksel dil modellemesi ve bu modellerin konuşma tanımadaki başarımları üzerine yapılan önceki çalışmalarda, yeni modeller (kelimeler, morfemler, heceler, kök ve kök-sonrası) önerilmiştir [4]. ...
Article
Full-text available
zetçe Türkçe gazete haberlerinin otomatik dikte edilebilmesi için bir sistem tasarlanmıştır. Türkçe sondan eklemeli bir dildir ve serbest kelime dizilimi vardır. Dilin bu özellikleri kelimeler sözlük birimleri olarak seçildiğinde konuşma tanımada dağarcık patlamasına, dağarcık dışı kelimelerin artmasına ve dilin istatistiklerinde karmaşıklığa sebep olmaktadır. Bu yüzden yeni konuşma tanıma birimleri önermekteyiz. Bir kısım sözcükler, kök, kök-sonrası ve morfemler gibi daha küçük tanıma birimlerine bölünmüş ve bu küçük birimler, bölünmemiş sözcüklerle birlikte konuşma tanıyıcıya sözlük elemanları olarak tanıtılmıştır. Bu durumda, orta boyutlu bir dağarcıkla dağarcık dışı kelime çokluğu sorunu halledilebilmiş, ve dilin istatistiksel modelleri için daha iyi kestirimler elde edilmiştir. Buna rağmen haber uygulamaları için en iyi tanıma başarımı kelime tabanlı dil modeliyledir. Abstract We have designed a Turkish dictation system for Broadcast news applications. Turkish is an agglutinative language with free word order. These characteristics of the language result in the vocabulary explosion, large number of out-of-vocabulary (OOV) words and the complexity of the N-gram language models in speech recognition when words are used as recognition units. Therefore, we proposed new recognition units. We parsed some of the words to smaller recognition units like stems, endings and morphemes, and introduced these smaller units and the unparsed words to the speech recognizer as lexicon entries. This way, we were able to overcome to the problem of large number of OOV words with a moderate vocabulary size and get better estimates for the N-gram language models. However, best recognition result was obtained using the word-based language model.
... In this way vocabulary size can be radically decreased and even OOV words can be recognized. Thus, recognition accuracies can be significantly improved over the word baseline [2] [3] [4]. The LVCSR improvement can be outstandingly high in the case of read speech in certain agglutinative languages such as Finnish and Estonian [2]. ...
Conference Paper
Full-text available
Efficient large vocabulary continuous speech recognition of morphologically rich languages is a big challenge due to the rapid vocabulary growth. To improve the results various subword units -called as morphs -are applied as basic language elements. The improvements over the word baseline, however, are changing from negative to error rate halving across languages and tasks. In this paper we make an attempt to explore the source of this variability. Different LVCSR tasks of an agglutinative language are investigated in numerous experiments using full vocabularies. The improvement results are compared to pre-existing other language results, as well. Important correlations are found between the morph-based improvements and between the vocabulary growths and the corpus sizes.
Article
Full-text available
This paper introduces a large-scale spontaneous speech corpus of Korean, named KsponSpeech. This corpus contains 969 h of general open-domain dialog utterances, spoken by about 2000 native Korean speakers in a clean environment. All data were constructed by recording the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances. The transcription provides a dual transcription consisting of orthography and pronunciation, and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments. This paper also presents the baseline performance of an end-to-end speech recognition model trained with KsponSpeech. In addition, we investigated the performance of standard end-to-end architectures and the number of sub-word units suitable for Korean. We investigated issues that should be considered in spontaneous speech recognition in Korean. KsponSpeech is publicly available on an open data hub site of the Korea government.
Chapter
This chapter presents an overview of language modeling followed by a discussion of the challenges in Turkish language modeling. Sub-lexical units are commonly used to reduce the high out-of-vocabulary (OOV) rates of morphologically rich languages. These units are either obtained by morphological analysis or by unsupervised statistical techniques. For Turkish, the morphological analysis yields word segmentations both at the lexical and surface forms which can be used as sub-lexical language modeling units. Discriminative language models, which outperform generative models for various tasks, allow for easy integration of morphological and syntactic features into language modeling. The chapter provides a review of both generative and discriminative approaches for Turkish language modeling.
Chapter
We describe the recent development of the NECTEC Thai open-domain automatic speech recognition system. Some of the techniques that were found beneficial over its baseline system are: hybrid word-subword language modeling to enhance the vocabulary coverage in a constraint resource; multi-conditioned noisy acoustic modeling to improve the system robustness using a newly developed large social media speech database; recent state-of-the-art speech features; and lastly, online decoding and speech compression to reduce the processing and data transmission time. These techniques result in a 32.4% word error rate on open-domain noisy speech test sets which is 35.7% relatively lower than the baseline system. The overall system operates in an average 1.2xRT which is promising for real applications.
Conference Paper
Sub-word units like morphemes are selected as the lexicon for highly inflectional languages, as they can provide better coverage and a smaller vocabulary size. However, short units shrink the context of statistical models, prone to morpho-phonetic changes, and not always outperform the word based model. When sequence of units are merged or split, unit boundaries are phonetically harmonized in the speech which reflects as the morpho-phonetic changes in the text. This paper investigates morpho-phonetic confusions in the sub-word segmentation of Uyghur text, and phonetic reasons which affect automatic speech recognition (ASR) accuracy. An optimal lexicon set is obtained by comparing ASR results of different layers of lexica, which avoids phonetic confusions in the frequently misrecognized morpheme sequences. This optimal lexicon, which is obtained totally from a HMM based acoustic model, outperformed all the baseline linguistic units. And when all these units are directly incorporated a deep neural network (DNN) based acoustic model, without changing the training corpora and language models, the optimal lexicon not only drastically improved the ASR accuracy but also outperformed other units as a proof of the generality of our approach. Experimental results demonstrate that the optimal lexicon obtained by reducing morpho-phonetic confusions exhibits better ASR accuracy and robustness.
Article
In this paper we investigate the usefulness of morphosyntactic information as well as clustering in modeling Polish for automatic speech recognition. Polish is an inflectional language, thus we investigate the usefulness of an N-gram model based on morphosyntactic features. We present how individual types of features influence the model and which types of features are best suited for building a language model for automatic speech recognition. We compared the results of applying them with a class-based model that is automatically derived from the training corpus. We show that our approach towards clustering performs significantly better than frequently used SRI LM clustering method. However, this difference is apparent only for smaller corpora.
Article
This paper investigates unsupervised training strategies for the Korean language in the context of the DGA RAPID Rapmat project. As with previous studies, we begin with only a small amount of manually transcribed data to build preliminary acoustic models. Using the initial models, a larger set of untranscribed audio data is decoded to produce approximate transcripts. We compare both GMM and DNN acoustic models for both the unsupervised transcription and the final recognition system. While the DNN acoustic models produce a lower word error rate on the test set, training on the transcripts from the GMM system provides the best overall performance. We also achieve better performance by expanding the original phone set. Finally, we examine the efficacy of automatically building a test set by comparing system performance both before and after manually correcting the test set.
Article
Korean is an agglutinative language in which word boundaries are not explicit. Thus, sub- word units are usually used in large-vocabulary speech recognition (LVCSR) or LVCSR based application like keyword spotting. Coalescence degree is new word property defined by this paper, which describes the quantity and length of sub-word units decomposed from a word. Bilinear confidence warping modifies confidence measure for keyword candidate base on its coalescence degree. Experiments show that performance can be significantly improved when bilinear confidence warping is used. Copyright © (2014) by the International Institute of Acoustics & Vibration All rights reserved.
Article
This paper proposes a new method to determine the recognition units for large vocabulary continuous speech recognition (LVCSR) in Korean by applying unsupervised segmentation and merging. In the proposed method, a text sentence is segmented into morphemes and position information is added to morphemes. Then submorpheme units are obtained by splitting the morpheme units through the maximization of posterior probability terms. The posterior probability terms are computed from the morpheme frequency distribution, the morpheme length distribution, and the morpheme frequency-of-frequency distribution. Finally, the recognition units are obtained by sequentially merging the submorpheme pair with the highest frequency. Computer experiments are conducted using a Korean LVCSR with a 100k word vocabulary and a trigram language model obtained by a 300 million eojeol (word phrase) corpus. The proposed method is shown to reduce the out-of-vocabulary rate to 1.8% and reduce the syllable error rate relatively by 14.0%.
Article
Full-text available
In this paper we propose a new method for automatically segmenting a sentence in Japanese into a word sequence. The main advantage of our method is that the segmenter is, by using a maximum entropy framework, capable of referring to a list of compound words, i.e. word sequences without boundary information. This allows for a higher segmentation accuracy in many real situations where only some electronic dictionaries, whose entries are not consistent with the word segmentation standard, are available. Our method is also capable of exploiting a list of word sequences. It allows us to obtain a far greater accuracy gain with low manual annotation cost. We prepared segmented corpora, a compound word list, and a word sequence list. Then we conducted experiments to compare automatic word segmenters referring to various types of dictionaries. The results showed that the word segmenter we proposed is capable of exploiting a list of compound words and word sequences to yield a higher accuracy under realistic situations.
Conference Paper
Korean is an agglutinative language, in which pronunciations are affected by long-term context. In this paper, the long-time temporal information is investigated to improve Korean LVCSR. TRAP-based MLP features, which are able to utilize the scattered acoustic information over several hundred milliseconds, are employed to obtain additional information besides the conventional cepstral features. In contrast to the traditional Korean phoneme set, in which consonants in the initial and final positions are taken as the same, a more specific phoneme set is constructed via taking consonants as position dependent. In the Korean broadcast news speech recognition task, experiments show that with these improvements the character error rate has been reduced by 25.3% relatively over the baseline system.
Article
In the conventional training of acoustic model (AM), transcriptions are converted into phoneme sequences by using a lexicon. In order to find correct phoneme sequences of transcriptions, forced Viterbi alignment is performed over the training data. If a lexicon has all phoneme variations correspoding to speech signals, phoneme sequences can be represented correclty. However it is impossible to contain all phoneme variations because phoneme variations are partially due to speaker’s characteristic. In this paper, we propose a data-derived robust AM training method against phoneme variations for large vocabulary continuous speech recognition. To reflect speaker’s phoneme variations, we expand a lexicon by replacing a low acoustic scored phoneme with a possible higher acoustic scored phoneme that has been selected from phonetic information. Then, we modify a transcription by substituting the phoneme sequence that has been produced by the expanded lexicon. As a result the ASR system using the proposed method gives the relative word error rate reduction by 9.5% in Korean as compared to the ASR system using the conventional method.
Article
Full-text available
The n-gram model is appropriate for languages, such as English, in which the word-order is grammatically rigid. However, it is not suitable for Korean in which the word-order is relatively free. Previous work proposed a twoply HMM that reflected the characteristics of Korean but failed to reflect word-order structures among words. In this paper, we define a new segment unit which combines two words in order to reflect the characteristic of word-order among adjacent words that appear in verbal morphemes. Moreover, we propose a two-path language model that estimates probabilities depending on the context based on the proposed segment unit. Experimental results show that the proposed two-path language model yields 25.68% perplexity improvement compared to the previous Korean language models and reduces 94.03% perplexity for the prediction of verbal morphemes where words are combined.
Conference Paper
In Korean, the pronunciations of phonemes are severely affected by their contexts. Thus, using phonemes directly translated from their written forms as basic units for acoustic modeling is problematic, as these units lack the ability to capture the complex pronunciation variations occurred in continuous speech. Allophone, a sub-phone unit in phonetics but served as independent phoneme in speech recognition, is considered to have the ability to describe complex pronunciation variations. This paper presents a novel approach called Automatic Allophone Deriving (AAD). In this approach, statistics from Gaussian Mixture Models are used to create measurements for allophone candidates, and decision trees are used to derive allophones. Question set used by the decision tree is also generated automatically, since we assumed no linguistic knowledge would be used in this approach. This paper also adopts long-time features over conventional cepstral features to capture acoustic information over several hundred milliseconds for AAD, as co-articulation effects are unlikely to be limited to a single phoneme. Experiment shows that AAD outperforms previous approaches which derive allophones from linguistic knowledge. Additional experiments use long-time features directly in acoustic modeling. The results show that performance improvement achieved by using the same allophones can be significantly improved by using long-time features, compared with corresponding baselines.
Article
For automatic speech recognition (ASR) of agglutinative languages, selection of a lexical unit is not obvious. The morpheme unit is usually adopted to ensure sufficient coverage, but many morphemes are short, resulting in weak constraints and possible confusion. We propose a discriminative approach for lexicon optimization that directly contributes to ASR error reduction by taking into account not only linguistic constraints but also acoustic–phonetic confusability. It is based on an evaluation function for each word defined by a set of features and their weights, which are optimized by the difference in word error rates (WERs) between ASR hypotheses obtained by the morpheme-based model and those by the word-based model. Then, word or sub-word entries with higher evaluation scores are selected to be added to the lexicon. We investigate several discriminative models to realize this approach. Specifically, we implement it with support vector machines (SVM), logistic regression (LR) model as well as the simple perceptron algorithm. This approach was successfully applied to an Uyghur large-vocabulary continuous speech recognition system, resulting in a significant reduction of WER with a modest lexicon size and a small out-of-vocabulary rate. The use of SVM for a sub-word lexicon results in the best performance, outperforming the word-based model as well as conventional statistical concatenation approaches. The proposed learning approach is realized in an unsupervised manner because it does not require correct transcription for training data.
Article
Previous studies have reported verbal fluency impairment in obsessive–compulsive disorder (OCD), but no study has evaluated the cognitive processes underlying verbal fluency in OCD. In the present study, we sought to test the hypothesis that phonemic fluency impairment in OCD resulted from switching problems rather than lack of fluency per se. In addition, we aimed to evaluate whether certain symptom dimensions were associated with impaired phonemic fluency to better understand OCD heterogeneity. The study included 85 patients with OCD (45 drug-naïve and 40 drug-free) and 71 healthy controls matched for gender, age, education, and intelligence. The Controlled Oral Word Association (COWA) test was administered to assess phonemic fluency and switching performance. Patients with OCD generated a smaller number of words and displayed fewer switches than did healthy control subjects, and switching was found to mediate impaired phonemic fluency in OCD. Furthermore, impairment in switching and phonemic fluency was related to the symmetry dimension in patients with OCD. Our findings suggest that phonemic fluency impairment in OCD is mediated by a switching deficit that may originate from abnormal processing in the frontal-striatal circuitry involving the orbitofrontal cortex. Moreover, different obsessive–compulsive symptom dimensions may be characterized by distinct neurocognitive dysfunctions in OCD.
Article
Full-text available
For large-vocabulary continuous speech recognition (LVCSR) of highly-inflected languages, selection of an appropriate recognition unit is the first important step. The morpheme-based approach is often adopted because of its high coverage and linguistic properties. But morpheme units are short, often consisting of one or two phonemes, thus they are more likely to be confused in ASR than word units. Generally, word units provide better linguistic constraint, but increases the vocabulary size explosively, causing OOV (out-of-vocabulary) and data sparseness problems in language modeling. In this research, we investigate approaches of selecting word entries by concatenating morpheme sequences, which would reduce word error rate (WER). Specifically, we compare the ASR results of the word-based model and those of the morpheme-based model, and extract typical patterns which would reduce the WER. This method has been successfully applied to an Uyghur LVCSR system, resulting in a significant reduction of WER without a drastic increase of the vocabulary size.
Article
In this paper, we study the vocabulary design problem in Uyghur large vocabulary continuous speech recognition (LVCSR). Uyghur is an agglutinative language in which words can be formed by concatenating several suffixes to the stem. As a result, the number of word types in Uyghur is unlimited. If the word is used as the recognition unit, the out-of-vocabulary (OOV) rate will be very large with typical vocabulary sizes of 60 k-100 k. To avoid this problem, we split words into stems and suffixes and use these sub-words as the recognition units. Speech recognition experiments are performed in two test sets, one including sentences in books and another including sentences in conversations. Compared to the 80 k-word baseline, the use of stems and suffixes can alleviate the OOV rate problem dramatically and the best system reduces the word error rate (WER) from 46.5% to 44.5% in the book sentences test set and from 57.6% to 47.5% in the conversation sentences test set.
Article
In Korean language, a large proportion of word units are pronounced differently from their written forms due to an agglutinative and highly inflective nature having severe phonological phenomena and coarticulation effects. This paper reports on an ongoing study of Korean pronunciation modeling, in which the mapping between phonemic and orthographic units is modeled by a Bayesian network (BN). The advantages of this graphical model framework is that the probabilistic relationship between these symbols as well as additional knowledge sources can be learned in a general and flexible way. Thus, we can easily incorporate various additional knowledge sources from different domains. In this preliminary study, we start with a simple topology where the additional knowledge only includes the preceding and succeeding contexts of the current phonemic unit. In practise, this proposed BN pronunciation model is applied on our syllable-based Korean large-vocabulary continuous speech recognition (LVCSR) system, where we construct the speech recognition task as a serial architecture composed of two independent parts. The first part is to perform standard hidden Markov model (HMM)-based recognition of phonemic syllable units of the actual pronunciation (surface forms). By this way, the lexicon dictionary and out-of-vocabulary rates can be kept small, while avoiding high acoustic confusability. In the second part, the system then transforms the phonemic syllable surface forms into the desirable Korean orthography eumjeol of a recognition unit, by utilizing the proposed BN pronunciation model. Experimental results show that the proposed BN model can successfully map the phonemic syllable surface forms to eumjeols transcription with more than 97% accuracy on average. It also revealed that it could help to enhance our Korean LVCSR system, and gave about 25.53% absolute improvement on average with respect to baseline orthographic syllable recognition.
Article
This paper outlines the first Asian network-based speech-to-speech translation system developed by the Asian Speech Translation Advanced Research (A-STAR) consortium. Eight research groups comprising the A-STAR members participated in the experiments, covering nine languages, i.e., eight Asian languages (Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, and Chinese) and English. Each A-STAR member contributed one or more of the following spoken language technologies: automatic speech recognition, machine translation, and text-to-speech through Web servers. The system was designed to translate common spoken utterances of travel conversations from a given source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. It covers travel expressions including proper nouns that are names of famous places or attractions in Asian countries. In this paper, we describe the issues of developing spoken language technologies for Asian languages, and discuss the difficulties involved in connecting different heterogeneous spoken language translation systems through Web servers. This paper also presents speech-translation results including subjective evaluation, from the first A-STAR field testing which was carried out in July 2009.
Conference Paper
This work presents a 2-pass recognition method for highly inflected agglutinative languages based on an Estonian large vocabulary recognition task. Morphemes are used as basic recognition units in a standard trigram language model in the first pass. The recognized morphemes are reconstructed back to words using hidden event language model for compound word detection. In the second pass, the vocabulary from N-best sentence candidates from the first pass is used to create an adaptive sentence-specific word-based language model which is applied for rescoring the N-best hypotheses. The sentence specific language model is based on the factored language model paradigm and estimates word probabilities based on the preceding two words and part-of-speech tags. The method achieves a 7.3% relative word error rate improvement over the baseline system that is used in the first pass
Article
Full-text available
The field of large vocabulary, continuous-speech recognition has advanced to the point where there are several systems capable of attaining between 90 and 95% word accuracy for speaker-independent recognition, of a 1000-word vocabulary, spoken fluently for a task with a perplexity (average word branching factor) of about 60. There are several factors which account for the high performance achieved by these systems, including the use of hidden Markov model (HMM) methodology, the use of context-dependent sub-word units, the representation of between-word phonemic variations, and the use of corrective training techniques to emphasize differences between acoustically similar words in the vocabulary. In this paper we describe one of the large vocabulary speech-recognition systems which is being investigated at AT&T Bell Laboratories, and discuss the methods used to provide high word-recognition accuracy. In particular we focus on the techniques used to provide the acoustic models of the sub-word units (both context-independent and context-dependent units), and discuss the resulting system performance as a function of the type of acoustic modeling used.
Article
Full-text available
The key problem to be faced when building a HMM-based continuous speech recogniser is maintaining the balance between model complexity and available training data. For large vocabulary systems requiring cross-word context dependent modelling, this is particularly acute since many such contexts will never occur in the training data. This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree. This tree-based clustering is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones. State-tying is also compared with traditional model-based tying and shown to be clearly superior. Experimental results are presented for both the Resource Management and Wall Street Journal tasks.
Conference Paper
Full-text available
The key problem to be faced when building a HMM-based continuous speech recogniser is maintaining the balance be- tween model complexity and available training data. For large vocabulary systems requiring cross-word context de- pendent modelling, this is particularly acute since many mmh contexts will never occur in the training data. This paper de- scribes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree. This tree- based clustering is shown to lead to similar recognition per- formance to that obtained using an earlier data-driven ap- proach but to have the additional advantage of providing a mapping for unseen triphones. State-tying is also compared with traditional model-based tying and shown to be clearly superior. Experimental results are presented for both the Resource Management and Wall Street 3ournal tasks.
Conference Paper
Full-text available
When developing a speech recognition system, one must start by deciding what the units to be recognized should be. This is for the most part a straightforward choice in the case of word-based languages such as English, but becomes an issue even in handling languages with a complex compounding system like German; with an agglutinative language like Japanese, which provides no spaces in written text, the choice is not at all obvious. Once an appropriate unit has been determined, the problem of consistently segmenting transcriptions of training data must be addressed. This paper describes a method for learning a lexicon from a training corpus which contains no word-level segmentation, applied to the problem of building a Japanese speech recognition system. We show not only that one can satisfactorily segment transcribed training data automatically, avoiding human error, but also that our system, when trained with the automatically segmented corpus, showed a significant improvement in recognition performance
Conference Paper
Full-text available
Presents an efficient look-ahead technique which incorporates the language model knowledge at the earliest possible stage during the search process. This so-called language model look-ahead is built into the time-synchronous beam search algorithm using a tree-organized pronunciation lexicon for a bigram language model. The language model look-ahead technique exploits the full knowledge of the bigram language model by distributing the language model probabilities over the nodes of the lexical tree for each predecessor word. We present a method for handling the resulting memory requirements. The recognition experiments performed on the 20,000-word North American Business task (Nov. 1996) demonstrate that, in comparison with the unigram look-ahead, a reduction by a factor of 5 in the acoustic search effort can be achieved without loss in recognition accuracy
Conference Paper
Full-text available
This paper describes recent work on the HTK large vocabulary speech recognition system. The system uses tied-state cross-word context-dependent mixture Gaussian HMMs and a dynamic network decoder that can operate in a single pass. In the last year the decoder has been extended to produce word lattices to allow flexible and efficient system development, as well as multi-pass operation for use with computationally expensive acoustic and/or language models. The system vocabulary can now be up to 65 k words, the final acoustic models have been extended to be sensitive to more acoustic context (quinphones), a 4-gram language model has been used and unsupervised incremental speaker adaptation incorporated. The resulting system gave the lowest error rates on both the H1-P0 and H1-C1 hub tasks in the November 1994 ARPA CSR evaluation
Conference Paper
Full-text available
The authors present two simple tests for deciding whether the difference in error rates between two algorithms tested on the same data set is statistically significant. The first (McNemar's test) requires the errors made by an algorithm to be independent events and is found to be most appropriate for isolated-word algorithms. The second (a matched-pairs test) can be used even when errors are not independent events and is more appropriate for connected speech
Article
Full-text available
A basic overview is presented of the main ongoing efforts in large vocabulary, continuous speech recognition (LVCSR) for European languages. We address issues in acoustic modeling, lexical representation, and language modeling for several European languages, as well as issues in comparative evaluation.
Article
Full-text available
We describe an approach to grapheme-to-phoneme conversion which is both language-independent and data-oriented. Given a set of examples (spelling words with their associated phonetic representation) in a language, a grapheme-to-phoneme conversion system is automatically produced for that language which takes as its input the spelling of words, and produces as its output the phonetic transcription according to the rules implicit in the training data. We describe the design of the system, and compare its performance to knowledge-based and alternative data-oriented approaches. 1 Introduction Grapheme-to-phoneme conversion is an essential module in any text-tospeech system. It can be described as a function mapping the spelling form of words to a string of phonetic symbols representing the pronunciation of the word. The largest part of research on this process focuses on developing systems that implement various levels of language-specific linguistic knowledge (especially morphol...
Article
Full-text available
This paper describes recent developments of the HTK large vocabulary continuous speech recognition system. The system uses tied-state cross-word context-dependent mixture Gaussian HMMs and a dynamic network decoder that can operate in a single pass. In the last year the decoder has been extended to produce word lattices to allow flexible and efficient system development, as well as multi-pass operation for use with computationally expensive acoustic and/or language models. The system vocabulary can now be up to 65k words, the final acoustic models have been extended to be sensitive to more acoustic context (quinphones), a 4-gram language model has been used and unsupervised incremental speaker adaptation incorporated. The resulting system gave the lowest error rates on both the H1-P0 and H1-C1 hub tasks in the November 1994 ARPA CSR evaluation.
Article
Full-text available
We describe the system used by IBM in the 1999 HUB4 Evaluation under the 10 times real-time constraint. We detail the system architecture and show that the performance of this system is over 20 percent more accurate at the same speed than the system used in the 1998 Evaluation. Furthermore, we have closed the gap between our unlimited resource system and our 10 times real time system from 45 percent to 14 percent. 1. Introduction In this paper we report on the 10xRT system run by IBM in the 1999 Hub4 evaluation, giving contrastive results with other system architectures we had considered as well as with the unconstrained system run in the other portion of the evaluation. Because the 10xRT constraint is somewhat arbitrary in that it is machine dependent, we have chosen to fix our machines to be those used in the 1998 Hub4 Evaluation; all programs are compiled for the AIX platform and all experiments were conducted on a 320MIPS RS/6000 SP2 node with 512MB of memory. As these are exactl...
Article
Full-text available
This paper presents the development of the HTK broadcast news transcription system for the November 1998 Hub4 evaluation. Relative to the previous year's system The system a number of features were added including vocal tract length normalisation; cluster-based variance normalisation; double the quantity of acoustic training data; interpolated word level language models to combine text sources; increased broadcast news language model training data; and an extra adaptation stage using a full-variance transform. Overall these changes to the system reduced the error rate by 13% on the 1997 evaluation data and the final system had an overall word error rate of 13.8% for the 1998 evaluation data sets. 1. Introduction Significant progress in the accurate transcription of broadcast news data has been made over the last few years so that we are now at a point where such systems can be used for a variety of tasks such as audio indexing and retrieval. However there is still much interest in re...
Article
Full-text available
This paper presents the technologies implemented in the IBM's Large Vocabulary Continuous Speech Recognition(LVCSR) system which was designed for 1998 Mandarin broadcast news transcription evaluation task. Compared with the 1997 system, it focuses on acoustic improvements by implementing several new schemes such as LDA and MLLT transformation matrix, BIC model selection criterion, SAT and CAT models. In addition, new language model components and new vocabulary were built. Some other schemes which were tried we also described. 1. System Overview Speech recognition technology is growing fast and one of recent research focuses has been the transcription of speech data in the real world, such as radio and TV broadcast news(BN). Transcription of broadcast news presents several technical challenges to Large Vocabulary Continuous Speech Recognition(LVCSR) systems. The speech data in broadcast news exhibits a wide variety of speaking styles, environmental and background noise conditions and...
Article
This paper describes the Sqaleproject in which the ARPA large vocabulary evaluation paradigm was adapted to meet the needs of European multilingual speech recognition development. It involved establishing a framework for sharing training and test materials, defining common protocols for training and testing systems, developing systems, running an evaluation and analysing the results. The specifically multilingual issues addressed included the impact of the language on corpora and test set design, transcription issues, evaluation metrics, recognition system design, cross-system and cross-language performance, and results analysis. The project started in December 1993 and finished in September 1995. The paper describes the evaluation framework and the results obtained. The overall conclusions of the project were that the same general approach to recognition system design is applicable to all the languages studied although there were some language specific problems to solve. It was found that the evaluation paradigm used within ARPA could be used within the European context with little difficulty and the consequent sharing amongst the sites of training and test materials and language-specific expertise was highly beneficial.
Article
This paper describes large-vocabulary continuous-speech recognition (LVCSR) of Japanese newspaper speech read aloud and Japanese broadcast-news speech. It describes the first Japanese LVCSR experiments using morpheme-based statistical language models. The statistical language models were trained using a large text corpus constructed from several years of newspaper texts and our LVCSR system was evaluated using recorded newspaper speech read by 10 male speakers. It is difficult to train statistical n-gram language models for Japanese because Japanese sentences are written without spaces between words. This difficulty was overcome by segmenting sentences into words with a morphological analyzer and then training the n-gram language models using those words. The LVCSR system was constructed with the language models trained using the newspaper articles, and the acoustic models, which were phoneme hidden Markov models (HMMs) trained using 20 h of speech. The results for recognition of read newspaper speech with a 7k vocabulary were comparable to those for other languages. For the automatic transcription of broadcast-news speech with our LVCSR system, the language models had 20k word vocabularies and were trained using broadcast-news manuscripts. These models achieved better performance than the language models trained using newspaper texts. Our experiments indicate that LVCSR for Japanese works in much the same way as LVCSR for European languages.
Article
This paper describes in detail the development of the HTK Broadcast News (BN) transcription system and presents full evaluation results from the 1996, 1997 and 1998 DARPA BN evaluations. It starts with a description of the underlying HTK large vocabulary recognition system and presents the modifications used in successive generations of the HTK BN system. Initially acoustic models that relied on fairly precise manual audio-type classification were used. To enable the use of automatic segmentation and classification systems, acoustic models were developed that were independent of fine audio classifications. The basic structure of the current HTK BN system includes a high-quality segmentation stage, multiple decoding passes which initially use triphones and trigrams, and then quinphone acoustic models along with word 4-gram and category language models applied in the final pass. This system gave the lowest error rate in the 1997 BN evaluation by a statistically significant margin. Refinements to the system are then described that examine the use of a larger acoustic training set, vocal tract length normalisation, full variance transforms and improved language modelling. Furthermore a version of the system was developed that ran in less than 10 times real time with only a small increase in error rate which has been used for the bulk transcription of broadcast news for information retrieval from audio data.
Article
In pursuance of better performance, current speech recognition systems tend to use more and more complicated models for both the acoustic and the language component. Cross-word context dependent (CD) phone models and long-span statistical language models (LMs) are now widely used. In this paper, we present a memory-efficient search topology that enables the use of such detailed acoustic and language models in a one pass time-synchronous recognition system. Characteristic of our approach is (1) the decoupling of the two basic knowledge sources, namely pronunciation information and LM information, and (2) the representation of pronunciation information - the lexicon in terms of CD units - by means of a compact static network. The LM information is incorporated into the search at run-time by means of a slightly modified token-passing algorithm. The decoupling of the LM and lexicon allows great flexibility in the choice of LMs, while the static lexicon representation avoids the cost of dynamic tree expansion and facilitates the integration of additional pronunciation information such as assimilation rules. Moreover, the network representation results in a compact structure when words have various pronunciations, and due to its construction, it offers partial LM forwarding at no extra cost.
Conference Paper
This paper presents a Korean large vocabulary continuous speech recognition system based on pseudomorpheme units. In Korean, an eojeol (word phrase) is a unit for spacing and a morpheme is the smallest unit with semantic meaning. If the e ojeol i s used as the dictionary and language modeling un it, the number of the unit becomes enormous. Instead we propose to use modified morpheme or pseudomorpheme as the basic recognition unit. We ca n recover the original eojeol by concatenating graphemes of pseudomorpheme c omponents. We used a dictionary and language model with pseudomorpheme/part-of- speech entries where ea ch entry can h ave multiple pronunciations according to the morphology rule. With 32k-word vocabulary, the speaker-independent character, pseudomorpheme, and eojeol recognition accuracies on economy article database were 90.8%, 84.5%, and 81.3%, respectively.
Conference Paper
For large vocabulary continuous speech recognition of highly inflected languages, it is the first step to determine an appropriate speech recognition unit to reduce high out-of-vocabulary rate. We investigate two kinds of approaches to select recognition units. In the morpheme-based approach, we use morpheme as basic recognition unit and merge frequent morpheme pairs into phrases by rule-based method or statistical unit merging method. In statistical unit merging, we investigate the effects of part-of-speech constraints used in selecting merging candidates. In the syllable-based approach, assuming that only text data and pronunciation are available, we obtain merged syllables by using the same statistical merging method where pronunciation variation is taken into account. The experimental results showed that the statistical merging method with appropriate linguistic constraints yields best recognition accuracy. Although the syllable-based approach did not show comparable performance, it has the advantage that it does not require a part-of-speech tagging system
Conference Paper
This paper presents a comparative study on automatic speech recognition for two different Chinese dialects, namely Mandarin and Cantonese. It focuses on decision-tree based context-dependent acoustic modeling for large-vocabulary continuous speech recognition. Extensive phonological and phonetic knowledge are incorporated to design questions concerning the left and right context of sub-syllable units, namely INITIALs and FINALs. This results in a set of class-triphone models for each dialect. Syllable recognition accuracy of 81.7% and 75.5% are attained for Mandarin and Cantonese respectively. Such a performance gap is accountable by various linguistic and practical reasons, including: 1) phonological and phonetic discrepancies between the two dialects; 2) design of training databases; and 3) design of phonetic questions in decision-tree clustering
Conference Paper
One of the most prevailing problems of large-vocabulary speech recognition systems is the large number of out-of-vocabulary words. This is especially the case for automatically transcribing broadcast news in languages other than English, that have a large number of inflections and compound words. We introduce a set of techniques to decrease the number of out-of-vocabulary words during recognition by using linguistic knowledge about morphology and a two-pass recognition approach, where the first pass only serves to dynamically adapt the recognition dictionary to the speech segment to be recognized. A second recognition run is then carried out on the adapted vocabulary. With the proposed techniques we were able to reduce the OOV-rate by more than 40% thereby also improving the recognition results by an absolute 5.8% from a 64% word accuracy to 69.8%
Conference Paper
We report on the LIMSI recognizer evaluated in the ARPA 1995 North American Business (NAB) news benchmark test. In contrast to previous evaluations, the new Hub 3 test aims at improving basic SI, CSR performance on unlimited-vocabulary read speech recorded under more varied acoustical conditions (background environmental noise and unknown microphones). The LIMSI recognizer is an HMM-based system with a Gaussian mixture. Decoding is carried out in multiple forward acoustic passes, where more refined acoustic and language models are used in successive passes and information is transmitted via word graphs. In order to deal with the varied acoustic conditions, channel compensation is performed iteratively, refining the noise estimates before the first three decoding passes. The final decoding pass is carried out with speaker-adapted models obtained via unsupervised adaptation using the MLLR method. On the Sennheiser microphone (average SNR 29 dB) a word error of 9.1% was obtained, which can be compared to 17.5% on the secondary microphone data (average SNR 15 dB) using the same recognition system
Conference Paper
The development of tools for the analysis of benchmark speech recognition system tests is reported. One development is a tool implementing two statistical significance tests. Another involves studies of an alternative to the alignment process presently used in the DARPA/NIST scoring software (which presently minimizes a weighted sum of elementary word error types). The alternative process minimizes a measure of phonological implausibility. The purpose in developing a standard implementation of these tools is to make these tools uniformly available to system developers
Article
To optimally cope with continuous speech recognizer, we propose the stochastic lexicon model that effectively represents variations in pronunciation. In this lexicon model, the baseform of a word is represented by subword-states with a probability distribution of subword units as a two-level hidden Markov model (HMM) and this baseform is automatically trained by sample utterances. Also, the proposed approach can be applied to systems employing nonlinguistic recognition units.
Article
The description of a novel type of m-gram language model is given. The model offers, via a nonlinear recursive procedure, a computation and space efficient solution to the problem of estimating probabilities from sparse data. This solution compares favorably to other proposed methods. While the method has been developed for and successfully implemented in the IBM Real Time Speech Recognizers, its generality makes it applicable in other areas where the problem of estimating probabilities from sparse data arises.
Article
Including phrases in the vocabulary list can improve n- gram language models used in speech recognition. In this paper, we report results of automatic extraction of phrases from the training text using frequency, likelihood, and correlation criteria. We show how a language model built from a vocabulary that includes useful phrases can systematically improve language model perplexity in a natural language call-routing task and the 20K-Nov92 Wall Street Journal evaluation. We also discuss the impact of such phrase-based language models on recognition word error rate. Keywords: language models, phrase models, automatic phrase extraction. 1. INTRODUCTION Words are commonly used as the basic lexical units in standard language models for automatic speech recognition (ASR). However, the inclusion of recurrent phrases into the vocabulary list has been proposed by various researchers to reduce the perplexity of the language models and to improve ASR performance [1,2,3,4]. One way that phras...
Article
The CMU Statistical Language Modeling toolkit was released in 1994 in order to facilitate the construction and testing of bigram and trigram language models. It is currently in use in over 40 academic, government and industrial laboratories in over 12 countries. This paper presents a new version of the toolkit. We outline the conventional language modeling technology, as implemented in the toolkit, and describe the extra efficiency and functionality that the new toolkit provides as compared to previous software for this task. Finally,we give an example of the use of the toolkit in constructing and testing a simple language model.
Article
This paper reports on recent improvements in Japanese broadcast news transcription and topic extraction. We constructed a language model that depends on the readings of words in order to prevent recognition errors caused by context-dependent readings of Japanese characters. We also introduced interjection modeling into the language model. To improve the model's performance for a series of sentences spoken by one speaker, an on-line incremental speaker adaptation was applied. We investigated a method for extracting topic-words from the speech recognition results that was based on a significance measure. This paper also proposes a new formulation for speech recognition/understanding systems, in which the a posteriori probability of a message that the speaker intends to address given an observed acoustic sequence is maximized. We applied the formulation to rescoring the recognition hypotheses. 1. INTRODUCTION We have been developing a large-vocabulary continuous -speech recognition (LVC...
Large vocabulary Korean continuous speech recognition using a one-pass algorithm Stochastic lexicon modeling for speech recognition
  • S -J Yun
  • Y.- H Oh
Large vocabulary Korean continuous speech recognition using a one-pass algorithm. In: Proc. ICSLP 2000. Yun, S.-J., Oh, Y.-H., 1999. Stochastic lexicon modeling for speech recognition. IEEE Signal Process. Lett. 6 (2), 28–30. 300 O.-W. Kwon, J. Park / Speech Communication 39 (2003) 287–300
Lexical disambiguation with error-driven learning
  • J.-H Kim
Kim, J.-H., 1996. Lexical disambiguation with error-driven learning. Ph.D. Dissertation, Department of Computer Science, Korea Advanced Institute of Science and Technology.
Language-independent data-oriented grapheme-to-phoneme conversion An efficient search space representation for large vocabulary continuous speech recognition
  • W Daelemans
  • A Bosch
  • J V Santen
  • R Sproat
  • J Olive
  • J Hirschberg
Daelemans, W., Bosch, A., 1996. Language-independent data-oriented grapheme-to-phoneme conversion. In: Santen, J.V., Sproat, R., Olive, J., Hirschberg, J. (Eds.), Progress in Speech Synthesis. Springer, Berlin. Demuynck, K., Duchateau, J., Compernolle, D.V., Wambacq, P., 2000. An efficient search space representation for large vocabulary continuous speech recognition. Speech Commu-nication 30, 37–53.
Adaptive vocabular-ies for transcribing multilingual broadcast news Some statistical issues in the compar-ison of speech recognition algorithms
  • P Geutner
  • M Finke
  • P L Scheytt
  • S X F Cox
  • W B Zhu
  • Q Shi
Geutner, P., Finke, M., Scheytt, P., 1998. Adaptive vocabular-ies for transcribing multilingual broadcast news. In: Proc. ICASSPÕ98. Gillick, L., Cox, S., 1989. Some statistical issues in the compar-ison of speech recognition algorithms. In: Proc. ICASSPÕ89. Guo, X.F., Zhu, W.B., Shi, Q., 1999. The IBM LVCSR system for 1998 Mandarin broadcast news transcription evaluation.
1998 Broadcast news benchmark test results
  • Pallet