ArticlePDF Available

The 1999 BBN BYBLOS 10xRT Broadcast News Transcription System

November 2000

November 2000

Authors:

Long Nguyen

Hochschule Neubrandenburg

Jayadev Billa

University of Southern California

Show all 6 authorsHide

In this paper, we describe the BBN BYBLOS system used for the 1999 Hub-4E 10xRT evaluation benchmark, and discuss the improvements made to the system in 1999. We focus on the techniques that were new in this year's system to achieve an optimal tradeoff between accuracy and speed for the evaluation benchmark test. Overall, we improved the recognition accuracy on the 1998 Hub-4E evaluation test by 14% relative to our 1998 10xRT system (from 17.1% to 14.7%), or equivalently we sped up the 1998 Primary system 24 times (from 240xRT to 10xRT) while maintaining the same word error rate (14.7%). This progress was attributed to improvement in fast segmentation using dual-band and dual-gender phone-class models based on RASTA-normalized features, supervised MLLR adaptation of band-limited models to real telephone training data, adaptation between decoding passes, and various adaptation speedups. 1. INTRODUCTION The 1999 BBN BYBLOS 10xRT broadcast news transcription system was based on both the...

…

Figures - uploaded by Richard Schwartz

Content may be subject to copyright.

Content uploaded by Richard Schwartz

Content may be subject to copyright.

The 1999 BBN BYBLOS 10xRT Broadcast News

Transcription System

Long Nguyen, Spyros Matsoukas, Jason Davenport, Jay Billa, Rich Schwartz, John Makhoul

BBN Technologies

70 Fawcett Street

Cambridge MA 02138

ln@bbn.com

ABSTRACT

In this paper, we describe the BBN BYBLOS system used for the

1999 Hub-4E 10xRT evaluation benchmark, and discuss the im-

provements made to the system in 1999. We focuson thetechniques

that were new in this year’s systemto achieve an optimal tradeoff be-

tween accuracy and speed for the evaluation benchmark test. Over-

all, we improved the recognition accuracy on the 1998 Hub-4E eval-

uation test by 14% relative to our 1998 10xRT system (from 17.1%

to 14.7%), or equivalently we sped up the 1998 Primary system 24

times (from 240xRT to 10xRT) while maintaining the same word er-

ror rate (14.7%). This progress was attributed to improvement infast

segmentation using dual-band and dual-gender phone-class models

based on RASTA-normalizedfeatures, supervised MLLR adaptation

of band-limited models to real telephone training data, adaptation

between decoding passes, and various adaptation speedups.

1. INTRODUCTION

The 1999 BBN BYBLOS 10xRT broadcast news transcription sys-

tem was based on both the 1998 BYBLOS Primary System [1] and

the 1998 BYBLOS 10xRT system [2] with substantial algorithmic

improvement as well as system change. Automatic transcription of

broadcast news is a challenging speech recognition problem because

of the frequent and unpredictable changes that occur in speaker,

speaking style, topic, channel, and background conditions. A suc-

cessful transcription system not only requires to have robust mod-

els to deal with these variability, but also needs to have an efﬁ-

cient segmentation strategy to break the continuous audio stream

into manageable smaller segments. In contrast to the slow segmenta-

tion scheme deployed in our 1998 10xRT system, this year’s system

used an improved segmentation algorithm that not only took much

less time but also could produce better segments which eventually

resulted in lower recognition word error rate.

Faster segmentation also provided opportunity (within the 10xRT

limit) to have multiple decoding stages with reﬁned models that

could lead to better recognition accuracy. Instead of having only

one decoding stage in which speaker/channel adapted models could

only be used once in the N-Best rescoring pass in the 1998 10xRT

system, we could do two decoding stages in this year’s system with

fast between-pass adaptation during the ﬁrst decoding stage and full

adaptation in the second stage.

Similar to last year’s system, we had another set of band-limited

acoustic models to handle the telephone speech portion of the evalu-

ation test set. However, these acoustic models were further reﬁned in

this year’s system. After obtaining the models trained on all acoustic

training data analyzed with reduced bandwidth, we applied a super-

vised MLLR [9] adaptation to these models using the subset of real

telephone speech data.

The paper is organized as follows. Section 2 gives an overview of

the system used for the 1999 10xRT Hub-4E evaluation. In sec-

tion 3 we discuss the improvements made to the system since the

1998 benchmark, along with experimental results. We ﬁnish with a

description of our 1999 Hub-4E evaluation results and the computa-

tional resources used during the evaluation in section 4.

2. SYSTEM DESCRIPTION

We used 200 hours (nominal - 140 hours actual) of Broadcast News

training data from the 1996, 1997, and 1998 LDC releases plus 5

hours of Marketplace data from 1995. The data was partitioned

by gender to create two sets of gender-dependent (GD), speaker-

independent (SI) models, without regard for speech condition or sig-

nal bandwidth. Two corresponding sets of reduced bandwidth (125-

3750 Hz) GD, SI models were also created using the same training

data.

For each gender, we created three SI models to be used in our

multiple-pass recognizer:

PTM: 512 Gaussians per phone, within-word triphones

SCTM NX qph: 64 Gaussians per state, 3.7K states, within-

word quinphones

SCTM XW qph: 64 Gaussians per state, 4K states, cross-word

quinphones

We also created reduced bandwidth PTM, SCTM NX qph, and

SCTM XW qph models for each gender. (A detailed description

of the acoustic models and how each model is trained can be found

in [3].)

We used a total of 600 million words to train the language model.

The data were from the following sources:

556 million words selected from the LDC ofﬁcial releases

North American News Text Corpus, North American News

Text Corpus (Supplement) and AP Worldstream English, and

the previous release in 1997. Data prior to 1994 were ex-

cluded. We also excluded data in the previous year’s test

epochs (1996/10/15 to 1996/11/15, and after 1998/02/28).

40 million words from in-house data previously obtained

through Primary Source Media, and

4 million words from the LDC-released acoustic training data

(weighted by a factor of 20).

The resulting language model had 13M bigrams and 43M trigrams.

The 1999 BBN BYBLOS 10xRT system was run in three stages:

segmentation, ﬁrst decoding stage, and second decoding stage.

1. Segmentation: We ﬁrst separate the test into wide-band and

narrow-band material, using a dual-band phoneme decoder [1].

Each channel is then normalized with RASTA [4], and a dual-

gender phoneme decoder is applied to detect gender changes

and silence locations. Within each channel-gender chunk, we

perform speaker change detection [5], so we end up with an

automatic segmentation that deﬁnes speaker turns, along with

their gender and channel labels. Vocal Tract Length Normal-

ization (VTLN) is then employed to select the optimal stretch

factorfor each speaker turn, and the test material is re-analyzed

using LPC smoothing and non-causal cepstral mean subtrac-

tion. At this stage the narrow-band-labeled segments are an-

alyzed at a reduced bandwidth (125-3750 Hz). Finally, the

speaker turns are chopped into short segments (averaging 4

seconds) based on the detected silence locations.

2. First Decoding Stage: The ﬁrst decoding is carried out in a

sequence of three passes with fast between-pass adaptation as

explained below.

forward PTM fastmatch [6]

constrained MLLR [7] unsupervised adaptation of

SCTM NX qph models using the forward-pass hypothe-

ses as transcripts

backward adapted SCTM within-word quinphone de-

coding, producing an N-best list [8]

constrained MLLR unsupervised adaptation of SCTM

XW qph models using the top-1 hypothesis from the N-

best list

adapted SCTM cross-word quinphone rescoring of the

N-best, along with a trigram language model rescoring,

to ﬁnd the best hypotheses

3. Second Decoding Stage: The best hypotheses produced by

the ﬁrst decoding stage are then used to adapt the PTM and

SCTM model means with 8 transformations. Then, a forward-

backward-rescore cascade is run again with the adapted mod-

els to produce the system ﬁnal recognition output.

3. RECENT IMPROVEMENTS

3.1. Fast Segmentation

In the 1998 evaluation we used an elaborate segmentation strategy

that employed a GI 12-phone dual-band decoding for band detec-

tion, followed by a dual-gender word decoding for gender detection

within channel turns. This procedure leads to fairly accurate band

and gender detection, but is too expensive to incorporate in a real-

time recognition system. Even for the 10x evaluation condition, we

had tosacriﬁce accuracy by pruning the word decodingaggressively

in order to bring the total segmentation time down to 1.9 xRT.

Besides the timing constraints, there is also the issue of what is

the proper cepstral normalization method to use when the segment

boundaries are not deﬁned yet. Non-causal Cepstral Mean Subtrac-

tion (CMS) performs best when applied to pure channel/speaker seg-

ments, and introduces errors when the segments have mixed condi-

tions. Also, dropping CMS altogether results in bad silence detec-

tion and a lot of speech deletions. For all these reasons we decided

to use the RASTA method for CMS, which is very robust and does

not depend on the segmentation boundaries.

To test the efﬁcacy of RASTA phone-class models for segmentation,

we performed a series of experiments on the 1997 Hub-4 evaluation

test set, using SI gender-dependent models trained on approximately

150 hours of broadcast news. In those experiments we used RASTA

not only for the initial test segmentation phone-class models, but

for all the models used in later recognition passes

. The results

are shown in Table 1, where we can see that the segmentation using

RASTA context-independent 12-phone dual-gender model is 0.6%

better than last year’s 10xRT system’s segmentation. The last line

in Table 1 shows the WER obtained when the true channel/speaker

boundaries are used for the segmentation, and when the silence de-

tection is based on forced alignment of the reference transcripts with

the best cross-word SCTM models. It is clear that the accuracy of

the fast phone-class RASTA segmentation is very close to that of the

unfair segmentation.

Segmentation method xRT WER

1998 10x system 1.9 18.8

12-phone dual-gender 0.2 18.2

unfair segmentation N/A 17.7

Table 1: Effect of fast RASTA segmentation (band-independent)

We also trained a 22-phone dual-band dual-gender model in an at-

tempt to detect silence, band and gender simultaneously. This in-

creased the cost of segmentation by 0.1 xRT, but unfortunately intro-

duced more segmentation errors, and the resulting WER was 18.6%.

We tried to ﬁx this problem by separating the channel from the gen-

der detection, using two passes of phoneme decoding: a ﬁrst pass

with a 12-phone dual-band model, followed by a second pass with a

12-phone dual-gender model. By examining the output of the dual-

band phoneme recognizer, we found that the channel detection was

not very accurate, so we concluded that it is better to perform the

channel detection on unnormalized features. Table 3 shows the re-

sults using unnormalized cepstra for band-detection:

Segmentation method xRT BD models WER

12-ph dual-band + 12-ph dual-gender 0.4 no 18.3

12-ph dual-band + 12-ph dual-gender 0.4 yes 17.6

Table 2: Effect of fast band/gender detection with and without band-

speciﬁc models in later recognition passes

So the segmentation procedure that works best is:

1. analyze the waveform to generate cepstra for each frame.

2. decode using the 12-phone dual-band phone-class model, in

order to obtain channel change boundaries.

3. smooth out phoneme decoder output to eliminate too short

channel turns.

We later found that there was a small gain for using non-causal CMS on

each speaker turn after the segmentation is completed, so the ﬁnal BYBLOS

system used non-causal CMS for the word decoding passes

4. apply RASTA normalization to the cepstra from step (1).

5. segment the RASTA input into channel turns, as speciﬁed in

step (3).

6. decode each channel turn using the RASTA 12-phone dual-

gender phone-class model, in order to obtain gender changes.

7. smooth out phoneme decoder output to eliminate short gender

turns.

8. run fast speaker change detection within channel-gender turn,

to determine speaker boundaries, and divideinto speaker turns.

This information can be used later for adaptation.

3.2. Fast Adaptation

Adaptation is very desirable in the design of a real-time system, be-

cause it customizes the acoustic models to each test speaker, im-

proving both recognition accuracy and speed. However, when the

acoustic model is very large, the cost of adaptation is not negligi-

ble, and can be broken down into two parts: the cost of estimating

the transformation and the cost of applying the transformation. The

estimation stage requires a forward pass to obtain a frame to state

alignment and accumulate sufﬁcient statistics. In the case of MLLR,

matrix accumulators are needed, where is the size of the feature

vector ( for our system). In order to speed up the forward

pass, we used the same Fast Gaussian Computation (FGC) method

that we apply during recognition. In addition, we were able to speed

up the accumulation process by using a least squares criterion to es-

timate the transformation, instead of the usual maximum likelihood.

This Least Squares Linear Regression method (LSLR) requires only

one accumulator matrix, and suffers only a small degradation in ac-

curacy, as shown in Table 3.

Adaptation Method FGC in fw pass xRT WER

MLLR no 1.5 17.0

MLLR yes 1.0 17.0

LSLR yes 0.6 17.1

constr. MLLR yes 1.5 17.1

base constr. MLLR yes 0.4 17.1

Table 3: Effect of FGC, LSLR and constrained MLLR during the

estimation of the transformation. The results are on h4e97, using

GD RASTA models trained on 140 hours. Timing was done on a

Pentium-II 450 MHz machine.

Unfortunately, there isn’t much we can do to reduce the cost of ap-

plying the transformation. The transformation matrix has to be mul-

tiplied with every Gaussian mean in the acoustic model, and this can

be very costly when the model is large. It would be muchfaster if we

applied the transformation to the features instead. This is possible

with the constrained MLLR adaptation, in which a single transfor-

mation matrix is used to adapt both the mean and variance of a Gaus-

sian. In this case, it is equivalent to estimate a transformation matrix

that is applied to the observations. This speeds up the application

of the transform signiﬁcantly, but the estimation is very expensive

. We found that we can reduce the cost of estimation process by at

least a factor of three, by estimating the transformation matrix based

A detailed analysis of the computational complexity of this method is

given in [7]

on the steady state feature parameters only. In other words, we do

not accumulate statistics for the ﬁrst and second derivatives of the

base input features. Then, the resulting base-transform is applied

to both the steady-state features and their derivatives. Interestingly,

there is no degradation in accuracy from using this approximation,

as demonstrated in Table 3.

The constrained MLLR results reported in this table were obtained

with a single transformation matrix. Contrary to our expectations,

we found no additionalgain for usingmore than one transformations

with constrained MLLR. Thus, we decided to use this method only

during the ﬁrst decoding stage. In the second decoding stage we

used LSLR adaptation of the model means with 8 transformations.

3.3. Narrow-Band Model Adaptation

The broadcast news acoustic training data contains only a small

amount of telephone speech (about 8 hours of male and 3 hours of

female), so it isnot enoughfor training gender-dependenttelephone-

speciﬁc models. In last year’s system, we band-limited all the train-

ing data to make them sound like telephone, and trained a separate

set of acoustic models. This year we extended this idea a bit fur-

ther, and used the real telephone training data for adapting the band-

limited models with MLLR. This gave us an extra gain on the tele-

phone conditions (F2 and FX), as shown in Table 4. It is interesting

to see that adapting the wideband models to the real telephone data

is slightly better than using band-limited models without supervised

adaptation.

Band-Speciﬁc Telephone-Adapted F2 FX all

no no 23.8 32.4 16.4

no yes 21.3 30.3 15.8

yes no 21.9 30.6 16.0

yes yes 19.9 29.8 15.6

Table 4: Effect of band-speciﬁc models with and without supervised

adaptation to real telephone training data, on h4e97. Models were

trained on 140 hours.

Note that the second half of the acoustic training transcripts does

not contain information about the channel, so we had to detect the

channel automatically. We did this using the same 12-phone dual-

band phone-class model that was used for channel detection during

the test segmentation stage.

4. 1999 HUB-4E RESULTS

Table 5 shows the BBN results on the 1999 10xRT Hub-4E bench-

mark. [The Hub-4E evaluation test set in 1999 (h4e99) seems to be

harder than that of the previous year]. We can see that the recog-

nition accuracy of the ﬁrst decoding stage (17.9%) is very close to

the ﬁnal system’s performance, after two decoding stages (17.3%).

In other words, the system can be conﬁgured to run in less than half

the time (i.e. not running the second decoding stage as illustrated in

Table 7) with a tradeoff of 3.4% relative accuracy degradation.

It is also interesting to see the overall improvement of the BBN BY-

BLOS system in 1999. Using the techniques described in the previ-

ous section, we were able to reduce the word error rate on the 1998

Hub-4E evaluation set (h4e98) by 14% relative to our 1998 10x sys-

Stage F0 F1 F2 F3 F4 F5 FX all

1 9.4 17.6 19.1 16.1 15.8 20.6 44.3 17.9

2 9.1 16.8 18.3 15.6 15.5 19.2 43.1 17.3

Table 5: BBN results on the 1999 Hub-4 10xRT evaluation bench-

mark. The WER is shown for both the ﬁrst and second decoding

stages.

tem, demonstrated in Table 6 by focus conditions. We can also see

that the 1999 10x system achieves the same accuracy as our 1998

primary system, but runs about 24 times faster.

Condition 1998 Primary 1998 10x 1999 10x

F0 9.6 10.3 9.1

F1 14.8 17.0 16.1

F2 18.6 24.9 18.9

F3 22.4 22.5 17.7

F4 21.0 16.5 14.1

F5 18.4 21.7 21.7

FX 29.5 29.7 24.2

Overall 14.8 17.1 14.7

xRT 244.0 10.0 10.0

Table 6: Comparison of the BYBLOS 1998 and 1999 systems on

h4e98. Timing was performed on a Pentium-II 450 MHz machine.

ComputationalResources andTiming Information The compu-

tation for this evaluation was done on Intel-based PCs with 600MHz

Pentium-III CPUs, 1024MB of RAM, and 2GB of swap space. The

operating system was Linux RedHat 4.1 and the compiler was GNU

gcc version 2.95.1 from the Free Software Foundation. Table 7

shows timing information for the basic recognition stages of the

1999 10x system on the h4e99 test set.

Stage xRT

Segmentation 1.1

First Decoding 3.1

Second Decoding 4.8

Total 9.0

Table 7: Timing information of the 1999 BYBLOS system on the

h4e99 test set, measured on Pentium-III 600MHz PC’s running

Linux.

5. SUMMARY

We have described our 1999 BYBLOS 10xRT broadcast news tran-

scription system deployed in the DARPA 1999 Hub-4E benchmark

test. Compared to the previous year, we achieved a relative 14%

word error rate reduction when running at the same speed (10xRT).

Or equivalently, we sped up the BYBLOS transcription system by

a factor of 24 while maintaining the same accuracy (14.7%). This

optimal tradeoff was achieved through not only low level code opti-

mization, but also on higher level algorithmic and system changes.

We developed a faster and more accurate segmentation strategy in

which only phone-class decoding is needed. Band detection is best

done on un-normalized cepstra while gender and silence detectionis

more accurate on RASTA-normalized cepstra. We also developed a

fast adaptation approach in the feature space to be used between de-

coding passes. Adaptation transformation matrix can be estimated

only from the steady-state features but can be applied to both the

steady-state features and their derivatives. Narrow-band acoustic

models trained onall training data analyzed with reducedbandwidth

can be reﬁned further by applying supervised adaptation using the

subset of real telephone speech as adaptation data.

Acknowledgements

This work was sponsored by the Defense Advanced Research

Projects Agency and monitored by the Space and Naval Warfare

Systems Command under Contract No. N66001-97-D-8501. The

views and ﬁndings contained in this material are those of the authors

and do not necessarily reﬂect the position or policy of the Govern-

ment and no ofﬁcial endorsement should be inferred.

References

1. S. Matsoukas,L. Nguyen,J. Davenport, J. Billa, F. Richardson,

M. Siu, D. Liu, R. Schwartz, J. Makhoul, “The 1998 BBN BY-

BLOS Primary SystemApplied to English and SpanishBroad-

cast News Transcription,” DARPA Broadcast News Transcrip-

tion Workshop, Herndon, VA, Feb. 1999, pp. 255-260

2. J. Davenport, L. Nguyen, S. Matsoukas, R. Schwartz and J.

Makhoul, “The 1998 BBN BYBLOS 10x Real Time System,”

DARPA Broadcast News Transcription Workshop, Herndon,

VA, Feb. 1999, pp. 261-263.

3. L. Nguyen, T. Anastasakos, F. Kubala, C. LaPre, J. Makhoul,

R. Schwartz, N. Yuan, G. Zavaliagkos, Y. Zhao, “The 1994

BBN/BYBLOS Speech Recognition System”, ARPA Spoken

Language Systemsand Technology Workshop, Austin, TX, Jan.

1995, pp. 77-81.

4. H. Hermansky and N. Morgan, “RASTA processing of

speech”, IEEE Transactions on Speech and Audio Processing,

2(4):578-589, Oct. 1994.

5. D. Liu, F. Kubala, “Fast Speaker Change Detection for Broad-

cast News Transcription and Indexing”, Eurospeech ’99, Bu-

dapest, Hungary, Sep. 99, pp. 1031-1034.

6. L. Nguyen and R. Schwartz, “Single-Tree Method for

Grammar-Directed Search,” ICASSP ’99, Phoenix, AZ., Mar.

1999, pp. 613-616.

7. M. J. F. Gales, “Maximum Likelihood Linear Transformations

for HMM-based Speech Recognition”, Technical Report 291,

Cambridge University, England, May 1997.

8. L. Nguyen and R. Schwartz, “Efﬁcient 2-Pass N-Best De-

coder”, EuroSpeech ’97, Rhodes, Greece, Sep. 1997, pp. 167-

170.

9. C. J. Leggetter, P. C. Woodland, “Flexible Speaker Adaptation

Using Maximum Likelihood Linear Regression”, Spoken Lan-

guage Systems Technology Workshop, Austin TX, Jan. 1995,

pp. 110-115.

Robust time-synchronous environmental adaptation for continuous speech recognition systems

Conference Paper

Full-text available

Sep 2002

In this paper we describe system architectures for robust ML LR based environmental adaptation of continuous speech recognition systems. Inspired by an existing broadcast news transcript ion sys- tem (1) we refined the identification of acoustic scenarios by us- ing a combined GMM/HMM method. Thus environmental adapta- tion regarding arbitrary acoustic scenarios beyond speake r changes becomes possible. For deploying acoustic adaptation in interac- tive applications, such as human machine interaction, a time-syn- chronous adaptation approach is proposed. For different co rpora the evaluation of our approaches shows significant improvem ents in recognition accuracy while satisfying the constraint of time- synchronous processing.

Audio Indexing of Arabic broadcast news

Conference Paper

Full-text available

Feb 2002

This paper describes the development of the BBN audio indexing system for broadcast news in Arabic. Key issues addressed in this work revolve around the three major components of the audio indexing system: automatic speech recognition, speaker identification, and named entity identification. The system deals with several challenges introduced by the Arabic language, including the absence of short vowels in written text and the presence of compound words that are formed by the concatenation of certain conjunctions, prepositions, articles, and pronouns, as prefixes and suffixes to the word stem. The lack of short vowels in the transcripts prompted a novel solution that further demonstrated the power of hidden Markov models to deal with ambiguity. Another challenge was the acquisition of appropriate language modeling data, given the absence of broadcast news data for that purpose. We present performance results for all three components of the audio indexing system, which we believe represent the state of the art for Arabic broadcast news.

Speech Recognition

Chapter

Jan 2014

Gernot A. Fink

Today, a number of commercial speech recognition systems are available on the market and, just recently, speech enabled assistive services were introduced for smart phones. Nevertheless, the problem of automatic speech recognition should by no means be considered to be solved. In order to build a competitive speech recognition system, the integration of a multitude of techniques is required. The probably best documented research systems are the ones developed by Hermann Ney and colleagues at the former Philips Research Lab in Aachen, Germany, and later at RWTH Aachen University, Aachen, Germany. In this chapter we want to put the emphasis on the works at RWTH Aachen University. However, many aspects of the systems within the research tradition are identical with those developed by Philips. Afterwards we want to present the speech recognizer of BBN which, in contrast to most systems developed by private companies, is documented by several scientific publications. The chapter concludes with the description of a speech recognition system of our own developed on the basis of ESMERALDA.

Markov Models for Pattern Recognition: From Theory to Applications

Book

Jan 2007

Gernot A. Fink

Markov models are used to solve challenging pattern recognition problems on the basis of sequential data as, e.g., automatic speech or handwriting recognition. This comprehensive introduction to the Markov modeling framework describes both the underlying theoretical concepts of Markov models - covering Hidden Markov models and Markov chain models - as used for sequential data and presents the techniques necessary to build successful systems for practical applications. Additionally, the actual use of the technology in the three main application areas of pattern recognition methods based on Markov- Models - namely speech recognition, handwriting recognition, and biological sequence analysis - are demonstrated.

MeetingLogger: rich transcription of courtroom speech

Article

Mar 2002

In this paper we describe our on-going effort in developing a speech recognition system for transcribing courtroom hearings. Court hearings are a rich source of naturally occurring speech data, much of which is in public domain. The presence of multiple microphones coupled with presence of noise and reverberation makes the problem simultaneously rich and challenging. We have exploited the availability of multiple channels to mitigate, to some extent, the severe noise problem prevalent in courtroom speech. By using a novel technique for channel change detection, domain-specific language modeling, and unsupervised channel adaptation we have been able to achieve a word error rate (WER) of 36% with an acoustic model trained on 150 hours of broadcast news data. We also report on our preliminary acoustic modeling experiments with the "legal" transcripts provided with 120 hours of courtroom speech training data.

Light supervision in acoustic model training

Conference Paper

Jun 2004
Acoust Speech Signal Process

We present a new light supervision method to derive additional acoustic training data automatically for broadcast news transcription systems. A subset of the TDT corpus, which consists of broadcast audio with corresponding closed-caption (CC) transcripts, is identified by aligning the CC transcripts and the hypotheses generated by lightly-supervised decoding. Phrases of three or more contiguous words, on which both the CC transcripts and the decoder's hypotheses agree, are selected. The selection yields 702 hours, or 72% of the captioned data. When adding 700 hours of selected data to the baseline 141 hour broadcast news training data set, we achieved a 13% relative word error rate reduction. The key to the effectiveness of this light supervision method is the use of a biased language model (LM) in the lightly supervised decoding. The biased LM, in which the CC transcripts are added with heavy weighting, helps in selecting words the recognizer could have misrecognized if using a fair LM.

Improved speaker adaptation using speaker dependent feature projections

Conference Paper

Full-text available

Jan 2003

We extend the formulation of constrained maximum likelihood linear regression (CMLLR) adaptation to take into account full covariance matrices in the adapted model, and we use it in conjunction with heteroscedastic linear discriminant analysis (HLDA) in order to estimate speaker dependent feature projections on both training and test data. Results on the broadcast news corpus show that the proposed HLDA adaptation technique is very effective, even when combined with traditional CMLLR and MLLR adaptation, providing up to 8% relative improvement in recognition accuracy.

Advances in Transcription of Broadcast News and Conversational Telephone Speech Within the Combined EARS BBN/LIMSI System

Article

Full-text available

Oct 2006

This paper describes the progress made in the transcription of broadcast news (BN) and conversational telephone speech (CTS) within the combined BBN/LIMSI system from May 2002 to September 2004. During that period, BBN and LIMSI collaborated in an effort to produce significant reductions in the word error rate (WER), as directed by the aggressive goals of the Effective, Affordable, Reusable, Speech-to-text [Defense Advanced Research Projects Agency (DARPA) EARS] program. The paper focuses on general modeling techniques that led to recognition accuracy improvements, as well as engineering approaches that enabled efficient use of large amounts of training data and fast decoding architectures. Special attention is given on efforts to integrate components of the BBN and LIMSI systems, discussing the tradeoff between speed and accuracy for various system combination strategies. Results on the EARS progress test sets show that the combined BBN/LIMSI system achieved relative reductions of 47% and 51% on the BN and CTS domains, respectively

Fast speaker change detection for broadcast news transcription and indexing

Conference Paper

Full-text available

Sep 1999

Single-tree method for grammar-directed search

Conference Paper

Full-text available

Apr 1999

We present a very fast and accurate fast-match algorithm which, when followed by a regular beam search restricted within only the subset of words selected by the fast-match, can speed up the recognition process by at least two orders of magnitude in comparison to a typical single-pass speech recognizer utilizing the Viterbi (or beam) search algorithm. In this search strategy, the recognition vocabulary is structured as a single phonetic tree in the fast-match pass. The search on this phonetic tree is a variation of the Viterbi algorithm. Especially, we are able to use a word bigram language model without making copies of the tree during the search. This is a novel fast-match algorithm that has two important properties: high-accuracy recognition and run-time proportional to only the cube root of the vocabulary size

RASTA processing of speech

Article

Full-text available

Nov 1994

Performance of even the best current stochastic recognizers severely degrades in an unexpected communications environment. In some cases, the environmental effect can be modeled by a set of simple transformations and, in particular, by convolution with an environmental impulse response and the addition of some environmental noise. Often, the temporal properties of these environmental effects are quite different from the temporal properties of speech. We have been experimenting with filtering approaches that attempt to exploit these differences to produce robust representations for speech recognition and enhancement and have called this class of representations relative spectra (RASTA). In this paper, we review the theoretical and experimental foundations of the method, discuss the relationship with human auditory perception, and extend the original method to combinations of additive noise and convolutional noise. We discuss the relationship between RASTA features and the nature of the recognition models that are required and the relationship of these features to delta features and to cepstral mean subtraction. Finally, we show an application of the RASTA technique to speech enhancement

The 1998 BBN Byblos 10x Real Time System

Article

Full-text available

Aug 2000

In this paper we describe the BBN Byblos 10x real time system used for the 1998 Hub-4 English tests. Given our state of the art primary system [1] running at 230 times real time (230 xRT) we show that eliminating and approximating many computationally expensive components speeds up the system by a factor of 23 with a relative loss in WER of 18%. This is accomplished without retraining or changing the primary system structure. The components of the primary system that are refined include segmentation, adaptation, decoding, cross-word rescoring with adaptation, and system combination. The time saving algorithms used include fast Gaussian computation, grammar spreading, nbest tree rescoring, and block diagonal adaptation. 1. INTRODUCTION Large vocabulary continuous speech recognition requires a considerable amount of computation. The amount of computation depends to a large degree on the quality of speech, with the computation increasing by a significant factor for more natural speech. ...

The 1998 BBN BYBLOS Primary System applied to English and Spanish Broadcast News Transcription

Article

Full-text available

Aug 2000

In this paper, we describe the BBN BYBLOS system used for the 1998 Hub-4E primary and Hub-4Sp evaluation benchmarks, and discuss the improvements made to the system in 1998. We focus on the techniques that were new in this year's system, including processing of the acoustic training data, test segmentation, revised cepstral normalization and Vocal Tract Length Normalization (VTLN), band-specific models, Diagonal transform Speaker Adaptive Training (DSAT), and a modified ROVER method for system combination. We show that by combining all the above techniques, we were able to improve the recognition accuracy on the 1997 Hub-4E evaluation test by 27% relative to our 1997 system (from 20.4% to 14.8%). We also present our results on the 1998 Hub-4E and Hub4Sp benchmarks, and discuss the differences between the English and Spanish transcription systems. 1. INTRODUCTION The 1997 BBN BYBLOS system [1] was focused on improving the recognition accuracy of the F0 and F1 focus conditions (high fi...

Efficient 2-Pass N-Best Decoder

Article

Full-text available

Aug 2000

In this paper, we describe the new BBN BYBLOS efficient 2-Pass N-Best decoder used for the 1996 Hub-4 Benchmark Tests. The decoder uses a quick fastmatch to determine the likely word endings. Then in the second pass, it performs a time-synchronous beam search using a detailed continuousdensity HMM and a trigram language model to decide the word starting positions. From these word starts, the decoder, without looking at the input speech, constructs a trigram word lattice, and generates the top N likely hypotheses. This new 2-pass N-Best decoder maintains comparable recognition performance as the old 4-pass N-Best decoder, while its search strategy is simpler and much more efficient. 1. INTRODUCTION As previously described in [2], the old BBN BYBLOS decoder used a multi-pass search strategy consisting of 4 passes to generate the top N most likely hypotheses, which were then rescored using more detailed, but expensive knowledge sources. These N best hypotheses were then reordered and th...

Maximum likelihood linear transformations for HMM-based speech recognition

Article

Apr 1998

M.J.F. Gales

This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Only model-based linear transforms are considered, since, for linear transforms, they subsume the appropriate feature–space transforms. The paper compares the two possible forms of model-based transforms: (i) unconstrained, where any combination of mean and variance transform may be used, and (ii) constrained, which requires the variance transform to have the same form as the mean transform. Re-estimation formulae for all appropriate cases of transform are given. This includes a new and efficient full variance transform and the extension of the constrained model–space transform from the simple diagonal case to the full or block–diagonal case. The constrained and unconstrained transforms are evaluated in terms of computational cost, recognition time efficiency, and use for speaker adaptive training. The recognition performance of the two model–space transforms on a large vocabulary speech recognition task using incremental adaptation is investigated. In addition, initial experiments using the constrained model–space transform for speaker adaptive training are detailed.

Flexible Speaker Adaptation Using Maximum Likelihood Linear Regression

Article

Apr 1998

The maximum likelihood linear regression (MLLR) approach for speaker adaptation of continuous density mixture Gaussian HMMs is presented and its application to static and incremental adaptation for both supervised and unsupervised modes described. The approach involves computing a transformation for the mixture component means using linear regression. To allow adaptation to be performed with limited amounts of data, a small number of transformations are defined and each one is tied to a number of component mixtures. In previous work, the tyings were predetermined based on the amount of available data. Recently we have used dynamic regression class generation which chooses the appropriate number of classes and transform tying during the adaptation phase. This allows complete unsupervised operation with arbitrary adaptation data. Results are given for static supervised adaptation for non-native speakers and also unsupervised incremental adaptation. Both show the effectiveness and flexibility of the MLLR approach.

The 1994 BBN/BYBLOS Speech Recognition System

Jan 1995
77-81

L Nguyen
T Anastasakos
F Kubala
C Lapre
J Makhoul
R Schwartz
N Yuan
G Zavaliagkos
Y Zhao

L. Nguyen, T. Anastasakos, F. Kubala, C. LaPre, J. Makhoul, R. Schwartz, N. Yuan, G. Zavaliagkos, Y. Zhao, "The 1994 BBN/BYBLOS Speech Recognition System", ARPA Spoken Language Systems and Technology Workshop, Austin, TX, Jan. 1995, pp. 77-81.

The 1999 BBN BYBLOS 10xRT Broadcast News Transcription System

Abstract and Figures

Recommended publications

Weakly Supervised Semantic Labelling and Instance Segmentation

Extracting Relations between Non-Standard Entities using Distant Supervision and Imitation Learning

A Fully Convolutional Tri-branch Network (FCTN) for Domain Adaptation

Adaptive hyperplane algorithm for texture characterization