Conference PaperPDF Available

Low-Resource Keyword Search Strategies for Tamil

August 2015

August 2015
2015

DOI:10.1109/ICASSP.2015.7178996

Conference: ICASSP

Authors:

Nancy F. Chen

Massachusetts Institute of Technology

Chongjia Ni

Agency for Science, Technology and Research (A*STAR)

I-Fan Chen

Georgia Institute of Technology

Show all 17 authorsHide

We propose strategies for a state-of-the-art keyword search (KWS) system developed by the SINGA team in the context of the 2014 NIST Open Keyword Search Evaluation (OpenKWS14) using conversational Tamil provided by the IARPA Babel program. To tackle low-resource challenges and the rich morphological nature of Tamil, we present highlights of our current KWS system, including: (1) Submodular optimization data selection to maximize acoustic diversity through Gaussian component indexed N-grams; (2) Keywordaware language modeling; (3) Subword modeling of morphemes and homophones.

. Submodular data selection for word transcriptions.

…

Figures - uploaded by Nancy F. Chen

Content may be subject to copyright.

Content uploaded by Nancy F. Chen

Content may be subject to copyright.

Content uploaded by Nancy F. Chen

Content may be subject to copyright.

LOW-RESOURCE KEYWORD SEARCH STRATEGIES FOR TAMIL

Nancy F. Chen1, Chongjia Ni1, I-Fan Chen2, Sunil Sivadas1,Van Tung Pham3, Haihua Xu3, Xiong Xiao3,

Tze Siong Lau3, Su Jun Leow3, Boon Pang Lim1, Cheung-Chi Leung1, Lei Wang1,

Chin-Hui Lee2, Alvina Goh3, Eng Siong Chng3, Bin Ma1, Haizhou Li1

1Institute for Infocomm Research, A?STAR, Singapore, 2Georgia Institute of Technology, USA,

3Nanyang Technological University, Singapore

nfychen@i2r.a-star.edu.sg

ABSTRACT

We propose strategies for a state-of-the-art keyword search (KWS)

system developed by the SINGA team in the context of the 2014

NIST Open Keyword Search Evaluation (OpenKWS14) using con-

versational Tamil provided by the IARPA Babel program. To tackle

low-resource challenges and the rich morphological nature of Tamil,

we present highlights of our current KWS system, including: (1)

Submodular optimization data selection to maximize acoustic diver-

sity through Gaussian component indexed N-grams; (2) Keyword-

aware language modeling; (3) Subword modeling of morphemes and

homophones.

Index Terms—Spoken term detection (STD), keyword spot-

ting, under-resourced languages, active learning, unsupervised

learning, semi-supervised learning, inﬂective languages, aggluti-

native languages, morphology, deep neural network (DNN)

1. INTRODUCTION

Keyword search (KWS) is a detection task where the goal is to ﬁnd

all occurrences of an orthographic term (e.g., word or phrase) from

audio recordings. Applications of KWS include spoken document

indexing and retrieval [1] and spoken language understanding [2].

KWS systems can be categorized into two groups: (i) classic

keyword-ﬁller based KWS [3], and (ii) large vocabulary continu-

ous speech recognition (LVCSR) based KWS [4]. In keyword-ﬁller

based KWS systems, a spoken utterance is represented as a sequence

of keywords and non-keywords (often referred to as ﬁllers [3]). Cus-

tomized detectors are built for the keywords. Keyword-ﬁller based

systems often achieve high detection rate using only a small dataset

for acoustic model training, but they do not scale well when the num-

ber of keywords increases.

By contrast, LVCSR-based KWS systems are ﬂexible in han-

dling a large number of keywords, yet require sufﬁciently large

amounts of transcribed training data to achieve good performance.

Therefore, LVCSR-based KWS has worked well on resource-rich

languages like English, as has been shown in the NIST 2006 Spo-

ken Term Detection Evaluation [5]. However, such transcribe-

and-search approaches pose particular challenges to low-resource

languages such as Zulu and Tamil.

To tackle these challenges, the IARPA Babel program aims to

foster research “to rapidly develop speech recognition capability for

keyword search in a previously unstudied language, working with

speech recorded in a variety of conditions with limited amounts of

transcription.” The NIST 2014 Open Keyword Search Evaluation

(OpenKWS14) was held in April using the surprise language of

Tamil.

The challenges of the NIST OpenKWS14 Evaluation include

linguistic peculiarities of Tamil (e.g., more than 30% out-of-

DNN#System#1#

DNN#System#2#

DNN#System#N#

La.ces#

Query#List#

Index#

Search#

Fusion#KW#

List#

Keyword#

Aware#LM#

Submodular#

OpEmizaEon#

Data#SelecEon#

Subword#Models#

Morphemes,#Homophones#

KWList'N'

KWList'2'

KWList'1'

Fig. 1. Proposed keyword search system for low-resource languages.

Orange blocks are highlights discussed in this work: submodular

optimization for selecting data to transcribe, subword modeling of

morphemes and homophones, keyword-aware language modeling.

vocabulary (OOV) rate due to its rich morphological structure),

poor audio quality (e.g., noise, soft volume, cross-talk), and limited

amount of transcribed data. To address such challenges, we discuss

our recent endeavors (see Figure 1) and related work below.

2. RELATION TO PRIOR WORK

2.1. Active Learning for Selecting Audio to Transcribe

Transcribing speech data is time-consuming and labor-intensive, es-

pecially for low-resource languages where linguistic expertise is lim-

ited or lacking. Thus it is critical to select the most informative

and representative subset of audio for human transcription. In prior

work, most approaches select utterances in a greedy fashion accord-

ing to their utility scores (e.g., conﬁdence scores from automatic

speech recognition (ASR) [6, 7]). Similar to the conﬁdence-based

approaches, [8, 9] use entropy reduction to select unlabeled utter-

ances. Low-resource methods for speech data selection include [10],

which only considers the ideal case where transcriptions are avail-

able in the ﬁrst place.

Aforementioned methods usually require an ASR system in

place already for data selection, nor do they guarantee optimality

with regard to an objective function. Alternatively,[11, 12] formulate

this data selection problem as a constrained submodular optimiza-

tion setup. In this work, we follow this line of thinking and extend

it to KWS tasks. In particular, we propose to use Gaussian compo-

nent index based n-grams as acoustic features to select utterances to

transcribe.

2.2. Keyword-Aware Language Modeling

If keyword queries are known a priori, one can leverage such knowl-

edge to improve KWS performance. Previous work has shown how

to exploit keyword information for acoustic modeling [13] and de-

coding [14]. Our previous efforts exploit keyword information in

language modeling in Vietnamese [15]. In this work, we investigate

our approach in Tamil and further expand it to a framework integrat-

ing advantages from both keyword-ﬁller based KWS and LVCSR-

based KWS.

2.3. Subword Modeling: Morphemes and Homophones

Mainstream LVCSR systems suffer from out-of-vocabulary (OOV)

issues. For morphologically-rich languages like Tamil, OOV rate is

especially high. While phones are commonly used to help resolve

OOVs [16], morphs (automatically parsed morphemes1) have also

been used in ASR [17]. In this work, we mitigate the data sparsity

issue of the morphologically-rich vocabulary in Tamil by integrating

morphs in the lexicon and language models. Our approach is similar

to [18], but we apply it on Tamil instead of Turkish. For Tamil ASR,

morph-based LMs have been reported [19]; in this work, we esti-

mate smoother word-morph LMs to address the serious data sparsity

issues of our dataset.

In our unpublished work, we found homophones useful in Viet-

namese KWS. In this work, we continue this line of investigation in

Tamil. To the best of our knowledge, to date there is no reported

work on using homophones in KWS.

3. LOW-RESOURCE KEYWORD SEARCH STRATEGIES

3.1. Submodular Optimization to Select Audio to Transcribe

3.1.1. Problem Formulation

Given a set of Nutterances V={1,2, ..., N}, we can construct

a non-decreasing submodular set function f: 2V→R, mapping

each subset S⊆Vto a real number. We can formulate the problem

of selecting the best subset Sgiven some budget K(e.g., maximum

number of transcribed utterances) as a monotone submodular func-

tion maximization under a knapsack constraint:

max

S⊆V{f(S) : |S| ≤ K}(1)

Submodularity can be interpreted as the property of diminishing

returns, which is for any subset R⊆S⊆Vand any utterance

s∈V\S,

f(S∪ {s})−f(S)≤f(R∪ {s})−f(R).(2)

Let Ube a set of features, and P={pu}u∈Ube the proba-

bility distribution over the set U. Let mu(S) = X

s∈S

mu(s)be a

non-negative score for feature uin set S, where mu(s)measures

the degree to which utterance s∈Spossesses feature u. We can

compute the KL-divergence between the two distributions Pand

mu(S) = mu(S)

Pu∈Umu(S):

1Morphemes are the smallest semantic units in linguistics. For example,

the word unsuccessful consists of 3 morphemes: ’un’, ’success’, ’ful’.

DKL (P||mu(S))

u∈U

pulog pu−X

u∈U

pulog(mu(S)) + log( X

u∈U

mu(S))

=const. + log(X

u∈U

mu(S)) −X

u∈U

pulog(mu(S))

We deﬁne our objective function fas follows:

f(S) = log(X

u∈U

mu(S)) −DKL (P||mu(S)) (3)

u∈U

pulog(mu(S)).(4)

The ﬁrst term in Eq. (3) represents the acoustic diversity char-

acterized by m, while the second term represents how close the dis-

tribution mu(S)is to the distribution P, estimated from a held-out

dataset. We want to maximize the diversity characterized by m(ﬁrst

term in Eq. (3)) but at the same time ensure mcharacterizes the

held-out data compactly (second term in Eq. (3)).

3.1.2. GMM Tokenization for Maximizing Acoustic Diversity

A GMM with Mcomponents trained on unlabeled training data is

used to label (tokenize) the training data and a held-out dataset in

terms of Gaussian components. For each frame i, label jis assigned:

j= arg max

jP(i|cj), where cjis the Gaussian mixture component,

j= 1, ..., M .

The concepts of term frequency and inverse document frequency

are used to characterize the tokenized audio. A term is deﬁned as an

n-gram of the labeled indices. These terms make up the feature set

Uin Section 3.1.1. The modular function deﬁned in Section 3.1.1

is deﬁned as the product of the term frequency tfu(s)and inverse

document frequency idfu

mu(S) = X

s∈S

mu(s) = X

s∈S

tfu(s)×idfu,(5)

where each utterance sis considered a document in the training set.

The probability distribution puin Eq. (4) is the term frequency

estimated from the held-out data: pu=nu

Punu

, where nuis num-

ber of times feature uoccurred.

3.2. Keyword-Aware Language Modeling (LM)

Let q= (w1, w2,...,wL)be an L-word query. In keyword-ﬁller

based KWS, the prior probability P(q)is by default set to P(q) =

1/N, often resulting in high false alarms. (Typically N≤100.)

By contrast, in LVCSR-based KWS, the prior probability of q

can be estimated as:

PLVCSR(q) = X

h∈H

P(q|h)≈X

h∈H

{

i=1

Pn−gram(wi|hi(h, q ))},(6)

where hi(h, q)is the history of wiin the query qdictated by the

order n,Pn−gram(.)is the probability estimated by n-gram lan-

guage model. In low-resource scenarios, where there is insufﬁcient

text data to properly train n-gram LMs, prior probabilities are often

underestimated. This underestimation causes high miss probabil-

ity, especially for multi-word queries. To alleviate the underestima-

tion problem, one can integrate the approach in keyword-ﬁller based

KWS and LVCSR-based KWS:

PKW−aware(q) = max{PLVCSR (q), k(q)}(7)

where k(q)is the minimum prior set for query qto alleviate the prior

under-estimation problem in low-resource LVCSR-KWS, where in-

sufﬁcient text data is available for language modeling. In this paper,

we assume all keywords share the same k(i.e, k(q) = k.) For more

detailed discussion of such proposed grammar, please refer to [20].

3.3. Word-Morph Interpolated Language Model

Representing out-of-vocabulary (OOV) entries using morphs (auto-

matically parsed morphemes) is insufﬁcient to resolve data sparsity

issues with morphologically-rich languages like Tamil. If the morph-

based lexical entries have low occurrences, the miss probability of

such keywords are still high despite it no longer being OOV. To mit-

igate this effect, we exploit word-morph interpolated language mod-

els (LM) to provide smoother estimates.

Three LMs are ﬁrst constructed: (1) Word-based LM λW: a 3-

gram word LM is trained on all the word entries. (2) Morph-based

LM λM: a 3-gram morph LM is trained by parsing word entries

into morphs using Morfessor [21]. (3) Hybrid Word-Morph LM λH:

words with more than one occurrence in the training data were re-

tained, whereas words with only one occurrence were parsed into

morphs by Morfessor [21]. An interpolated language model was

then estimated: λW−M=αλW+βλM+ (1 −α−β)λH.

4. EXPERIMENTS

For clarity purposes, we only show a subset of submitted systems for

OpenKWS14 and corresponding follow-up analysis to demonstrate

the proposed strategies discussed here.

4.1. Setup

This effort uses the IARPA Babel Program Tamil language collection

release IARPA-babel204b-v1.1b for the NIST OpenKWS14 Evalu-

ation. The training set includes 80 hours of conversational telephone

speech. Two conditions are deﬁned: (1) Full Language Pack (FLP):

60 hours of transcriptions and a corresponding lexicon. (2) Limited

Language Pack (LLP): a 10 hr subset of FLP transcriptions. The de-

velopmental set is 10 hr with transcriptions. The evaluation set is 75

hr with no transcriptions nor timing information; transcriptions of a

15 hr subset (evalpart1) was released after OpenKWS14. All results

reported are on evalpart1.

Evaluation Metric: Term-weighted value (TWV) is 1 minus the

weighted sum of the term-weighted probability of miss detection

Pmiss(θ)and the term-weighted probability of false alarm PFA(θ):

TWV(θ) = 1 −[Pmiss (θ) + βPFA(θ)],(8)

where θis the decision threshold. Actual term-weighted value

(ATWV) is the TWV using the chosen decision threshold.

4.2. Baseline System

All systems were developed using Kaldi [22]. While fundamen-

tal frequency variation features were shown to improve ASR for

both tonal and non-tonal languages [23] and improve KWS for

OpenKWS13 [24], it actually hurt ASR/KWS performance for this

data. F0, on the other hand, consistently helped in pilot experiments,

therefore all systems adopted F0.

4.2.1. Implementation Details

We adapted voice activity detection (VAD) in [24] to reduce noise.

The WAV ﬁles are especially noisy, resulting in virtually 100% word

error rate. Classic noise canceling methods such as Wiener ﬁltering

was ineffective. Instead, speech enhancement using a log minimum

mean-square error spectral amplitude estimator [25] was applied to

WAV ﬁles before VAD, as it improved the speech quality signiﬁ-

cantly by removing perceptually audible distortions. When compar-

ing VAD and ground-truth segments, ATWV showed insigniﬁcant

difference (<0.13% relative).

MFCC (13-dim) and F0 (2-dim) were extracted; 9 adjacent fea-

ture frames were then concatenated and applied with a LDA+MLLT+

fMLLR transform. The 40-dim fMLLR features were used for bot-

tleneck feature (BNF) extraction (6 hidden layers each with 2048

nodes) to extract 42-dim BNF. The 40-dim fMLLR feature and 42-

dim BNF were then concatenated to form 82-dim features. Then

fMLLR transform was applied again (60-dim). We used 6 hidden

layers (2048 nodes each) and 4838 senone target states for the DNN

acoustic model. The training procedure is as follows: (1) 1 iteration

of pre-training; (2) cross entropy criterion training; (3) scalable

minimum Bayes risk criterion based sequence training [26].

Phonetisaurus [27] was used to obtain OOV pronunciations. A

trigram language model was trained on word tokens. The beam

width was set to 18 for lattice decoding. Deterministic weighted

transducers were used to index and search soft-hits, which contain

the utterance identiﬁcations, start/end times, and posterior scores.

Sum-to-one normalization [16], WComMNZ [16], and keyword spe-

ciﬁc thresholding (KST) [28] were applied consecutively to combine

systems. For individual systems, only KST was done.

4.2.2. Results

Table 1 shows the baseline ATWV results when using word 3-gram

LM (λW) for LLP and FLP. In the next section, we examine how

leveraging keyword information can boost performance.

Table 1. Keyword-Aware LM outperforms baseline LM. All LMs

are word-based.

Transcription Baseline Keyword-Aware LM Relative

Condition ATWV ATWV Gain (%)

LLP: 10 hr 0.2313 0.3182 37.6%

FLP: 60 hr 0.4222 0.4852 14.9%

4.3. Keyword Aware Language Model (LM) Experiment

4.3.1. Implementation Details

Setup is the same as Section 4.2.1 except the LM is estimated using

context-simulated keyword LM [15]:

PKW−LM(w|h) = γPKWLM(w|h) + (1 −γ)PLM (w|h),(9)

where γ= 0.3,his the history of the current word w,PKWLM is

an LM estimated by padding keywords with bigram entries from the

training data, and PLM is the trigram LM in Section 4.2.1.

4.3.2. Results

Table 1 shows that when keyword-aware LM is used, relative gains

reach 37.6% (LLP ) and 14.9% (FLP). Further analysis shows that

the gains are due to reduction in miss probability, which is penalized

more heavily than false alarms in OpenKWS settings. We also ob-

serve larger gains when keywords are multi-words when using the

keyword-aware LM framework. Due to space constraints, compar-

isons related to how effective the keyword-aware framework differs

according to language peculiarities, implementation methods, and

keyword length are reported in [20].

4.4. Subword Experiments

Subword modeling is essential especially for low-resource languages

where keywords are not known a priori. Here we investigate how

morphemes and homophones help resolve data sparsity issues.

4.4.1. Morpheme Subword Modeling

The system implementation is the same as in Section 4.2.1, except

the lexicon and word-morph interpolated LM setup is as described

in Section 3.3, where α= 0.4, β = 0.3. We applied linguistic

constraints to reduce linguistically-illegal morphs, but gains were

insigniﬁcant compared to that of increasing lattice sizes. In addition,

non-speech tags were removed from the LM (consistent marginal

gains on 5 developmental keyword lists). Table 2 shows that using

word-morph interpolated LM improves ATWV by 4.5% relative for

LLP and 3.3% relative for FLP.

Table 2. Word-Morph LM outperforms word LM.

Transcription Word LM Word-Morph LM Rel.

Condition ATWV ATWV Gain (%)

LLP: 10 hr 0.2313 0.2418 4.5

FLP: 60 hr 0.4222 0.4363 3.3

Table 3. Homophone System SHand Sub-Homophone System

SHsub complement each other.

System LLP ATWV FLP ATWV

SH0.0832 0.2634

SHsub 0.0838 0.2748

SH+SHsub 0.1243 0.2872

4.4.2. Homophone Subword Modeling

Homophones are words that are written differently but sound the

same, like see and sea. The homophone system SHwas imple-

mented as in Section 4.2.1, except words are replaced with their pro-

nunciations. The sub-homophone system, SHsub was implemented

by further segmenting homophones into morphs using Morfessor.

From Table 3, we see that the homophone system SHand sub-

homophone system SHsub perform similarly to each other for both

LLP and FLP conditions. When fused with each other, we get 49.4%

relative gain for LLP and 4.5% relative gain for FLP, suggesting that

sub-homophones are much more complementary to homophones in

low resource scenarios. The homophone results shown here are sub-

optimal when compared to its word-morph counterpart. We sus-

pect this discrepancy to be language-dependent. In the OpenKWS13

Vietnamese task, we observed a 26.3% relative gain when using ho-

mophones instead of words. Vietnamese words are constructed by

a ﬁnite set of syllables, which are phonetically equivalent to homo-

phones, making homophones an elegant choice in handling OOVs.

For future work, we plan to investigate whether sub-homophones

can drive further gains in Vietnamese.

4.5. Submodular Optimization Data Selection Experiment

In this experiment, we analyze how to select data to transcribe to

maximize KWS performance and minimize transcription cost.

4.5.1. Implementation Details

We follow the proposed algorithm described in Section 3.1.2. The

total number of mixture components M= 2048, and bigrams of la-

beled frames are used to designate a term to compute term-frequency

and inverse-document-frequency. The 10 hr developmental data is

used as the held-out dataset for estimating the distribution of puin

Eq. (4). The KWS system is the same as in Section 4.4.1.

4.5.2. Results

Table 4 shows that the proposed 10 hr subset outperforms Baseline-

1 (random 10 hr subset) and Baseline-2 (NIST-LLP 10 hr subset)

by 21.0% and 15.5%, showing that the LLP 10 hr subset can be

more optimally chosen to achieve better KWS performance without

increasing transcription cost. The relative ATWV gain of increasing

transcriptions from 10 hr to 60 hr is less in the submodular case

(52.7%) compared to those from Baseline-1 (82.9 %) and Baseline-

2 (76.4 %), indicating that the return on transcription cost is more

effective when using the submodular optimization approach.

Table 4 also shows that by maximizing acoustic diversity in the

proposed approach, we implicitly enrich the vocabulary in the lexi-

con, and thus alleviate OOV issues: compared to the proposed 10 hr

subset, the OOV keywords decreases by 17.0 % relative for Baseline-

1 (random 10 hr subset) and 42.3 % relative for Baseline-2 (NIST-

LLP 10 hr subset). This byproduct beneﬁt helps resolve OOV issues

at a more fundamental stage when developing spoken language tech-

nology. For more detailed analysis, please see [29].

Table 4. Submodular data selection for word transcriptions.

Transcription Condition ATWV OOV counts

Baseline-1: Random 10 hr subset 0.2386 1171

Baseline-2: NIST-LLP (10 hr subset) 0.2474 1686

Proposed submodular 10 hr subset 0.2857 972

Upper bound: NIST-FLP (full 60 hr) 0.4363 407

5. DISCUSSION

In this work, we investigated three strategies for low-resource key-

word search. We expect our submodular optimization data selec-

tion approach to generalize well in languages other than Tamil since

similar approaches works in Mandarin LVCSR [30]. Similarly, the

keyword-aware language model approach also works for Vietnamese

[20]. By contrast, subword modeling (morphemes, homophones) ap-

pears to be more language-dependent.

While our LVCSR-KWS work in Tamil and Vietnamese [24]

focus on text queries, we have inspired strategies used in spoken

term detection of audio queries. For example, [31] proposed partial-

matching symbolic search, which complements popular pattern

matching approaches using dynamic time warping in Query-by-

Example Search on Speech (QUESST), formerly called Spoken

Web Search (SWS), in MediaEval 2014.

6. REFERENCES

[1] John Makhoul, Francis Kubala, Timothy Leek, Daben Liu,

Long Nguyen, Richard Schwartz, and Amit Srivastava,

“Speech and language technologies for audio indexing and re-

trieval,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1338–

1353, 2000.

[2] Biing-Hwang Juang and Sadaoki Furui, “Automatic recogni-

tion and understanding of spoken language-a ﬁrst step toward

natural human-machine communication,” Proceedings of the

IEEE, vol. 88, no. 8, pp. 1142–1165, 2000.

[3] Jay G Wilpon, L Rabiner, Chin-Hui Lee, and ER Goldman,

“Automatic recognition of keywords in unconstrained speech

using hidden markov models,” IEEE TASLP, vol. 38, no. 11,

pp. 1870–1878, 1990.

[4] J Gauvain and Lori Lamel, “Large-vocabulary continuous

speech recognition: advances and applications,” Proceedings

of the IEEE, vol. 88, no. 8, pp. 1181–1200, 2000.

[5] Jonathan G Fiscus, Jerome Ajot, John S Garofolo, and George

Doddingtion, “Results of the 2006 spoken term detection eval-

uation,” in Proceedings of ACM SIGIR Workshop on Searching

Spontaneous Conversational, 2007, pp. 51–55.

[6] Dilek Hakkani-Tur, Giuseppe Riccardi, and Allen Gorin, “Ac-

tive learning for automatic speech recognition,” in Proc. IEEE

ICASSP, 2002, vol. 4, pp. IV–3904.

[7] Lori Lamel, Jean-Luc Gauvain, and Gilles Adda, “Lightly su-

pervised and unsupervised acoustic model training,” Computer

Speech & Language, vol. 16, no. 1, pp. 115–129, 2002.

[8] Dong Yu, Balakrishnan Varadarajan, Li Deng, and Alex

Acero, “Active learning and semi-supervised learning for

speech recognition: A uniﬁed framework using the global en-

tropy reduction maximization criterion,” Computer Speech &

Language, vol. 24, no. 3, pp. 433–444, 2010.

[9] Nobuyasu Itoh, Tara N Sainath, Dan Ning Jiang, Jie Zhou,

and Bhuvana Ramabhadran, “N-best entropy based data se-

lection for acoustic modeling,” in Proc. IEEE ICASSP, 2012,

pp. 4133–4136.

[10] Yi Wu, Rong Zhang, and Alexander Rudnicky, “Data selection

for speech recognition,” in Proc. IEEE ASRU, 2007, pp. 562–

565.

[11] Hui Lin and Jeff Bilmes, “How to select a good training-data

subset for transcription: Submodular active selection for se-

quences,” in INTERSPEECH, 2009.

[12] Kai Wei, Yuzong Liu, Katrin Kirchhoff, and Jeff Bilmes, “Us-

ing document summarization techniques for speech data subset

selection.,” in HLT-NAACL, 2013, pp. 721–726.

[13] I-Fan Chen, Nancy F Chen, and Chin-Hui Lee, “A Keyword-

Boosted sMBR Criterion to Enhance Keyword Search Perfor-

mance in Deep Neural Network Based Acoustic Modeling,” in

INTERSPEECH, 2014.

[14] Bing Zhang, Richard M Schwartz, Stavros Tsakalidis, Long

Nguyen, and Spyros Matsoukas, “White listing and score nor-

malization for keyword spotting of noisy speech.,” in INTER-

SPEECH, 2012.

[15] I-Fan Chen, Chongjia Ni, Boon Pang Lim, Nancy F Chen, and

Chin-Hui Lee, “A novel keyword+lvcsr-ﬁller based grammar

network representation for spoken keyword search,” in ISC-

SLP, 2014.

[16] Jonathan Mamou, Bhuvana Ramabhadran, and Olivier Siohan,

“Vocabulary independent spoken term detection,” in Proc.

ACM SIGIR conference on Research and development in in-

formation retrieval, 2007, pp. 615–622.

[17] Hasim Sak, Murat Sarac¸lar, and Tunga Gungor, “Morpholex-

ical and discriminative language models for turkish automatic

speech recognition,” IEEE TASLP, vol. 20, no. 8, pp. 2341–

2351, 2012.

[18] Yanzhang He, Brian Hutchinson, Peter Baumann, Mari Osten-

dorf, Eric Fosler-Lussier, and Janet Pierrehumbert, “Subword-

based modeling for handling oov words inkeyword spotting,”

in Proc. IEEE ICASSP, 2014, pp. 7864–7868.

[19] Melvin Jose Johnson Premkumar, Ngoc Thang Vu, and Tanja

Schultz, “Experiments towards a better lvcsr system for tamil,”

in INTERSPEECH, 2013.

[20] I-Fan Chen, Chongjia Ni, Boon Pang Lim, Nancy F Chen,

and Chin-Hui Lee, “A Keyword-Aware Grammar Framework

for LVCSR-Based Spoken Keyword Search,” in Proc. IEEE

ICASSP, 2015.

[21] “Morfessor 2.0.0: http://www.cis.hut.ﬁ/projects/morpho/morfessor2.shtml,”

last accessed, August 2014.

[22] Daniel Povey et al., “The kaldi speech recognition toolkit,” in

Proc. of IEEE ASRU, 2011.

[23] Florian Metze, Zaid A. W. Sheikh, Alex Waibel, Jonas

Gehring, Kevin Kilgour, Quoc Bao Nguyen, and Van Huy

Nguyen, “Models of tone for tonal and non-tonal languages,”

in Proc. IEEE ASRU, Olomouc; Czech Republic, 2013.

[24] Nancy F Chen, Sunil Sivadas, Boon Pang Lim, Hoang Gia

Ngo, Haihua Xu, Van Tung Pham, Bin Ma, and Haizhou Li,

“Strategies for Vietnamese keyword search,” in Proc. IEEE

ICASSP, 2014, pp. 4121–4125.

[25] Yariv Ephraim and David Malah, “Speech enhancement using

a minimum-mean square error short-time spectral amplitude

estimator,” IEEE TASLP, vol. 32, no. 6, pp. 1109–1121, 1984.

[26] Daniel Povey, Lukas Burget, Mohit Agarwal, Pinar Akyazi,

Kai Feng, Arnab Ghoshal, Ondrej Glembek, Nagendra K Goel,

Martin Karaﬁ´

at, Ariya Rastrow, R. C. Rose, P Schearz, and

S. Thomas, “Subspace gaussian mixture models for speech

recognition,” in Proc. IEEE ICASSP, 2010, pp. 4330–4333.

[27] J. R. Novak, “Phoneticsaurus - A WFST-driven Phoneticizer.

Available: https://code.google.com/p/phonetisaurus,” 2012.

[28] Damianos Karakos, Richard Schwartz, Stavros Tsakalidis,

Le Zhang, Shivesh Ranjan, Tim Ng, Roger Hsiao, Guruprasad

Saikumar, Ivan Bulyko, Long Nguyen, et al., “Score normal-

ization and system combination for improved keyword spot-

ting,” in Proc. IEEE ASRU, 2013, pp. 210–215.

[29] Chongjia Ni, Cheung-Chi Leung, Lei Wang, Nancy F Chen,

and Bin Ma, “Unsupervised Data Selection and Word-Morph

Mixed Language Model for Tamil Low-Resource Keyword

Searh,” in Proc. IEEE ICASSP, 2015.

[30] Chongjia Ni, Lei Wang, Haibo Liu, Cheung-Chi Leung, Li Lu,

and Bin M, “Submodular data selection with acoustic and pho-

netic features for automatic speech recogntion,” in Proc. IEEE

ICASSP, 2015.

[31] Haihua Xu, Peng Yang, Xiao Xiong, Lei Xie, Cheung-Chi Le-

ung, Hongjie Chen, Jia Yu, Hang Lv, Lei Wang, Su Jun Leow,

Bin Ma, Eng Siong Chng, and Haiz, “Language independent

query-by-example spoken term detection using n-best phone

sequences and partial matching,” in ICASSP, 2015.

Speech Dataset of Kadazan Digits for Keyword Spotting

Conference Paper

Full-text available

Jan 2023

The unavailability of public datasets is the main hurdle for speech processing research targeting under-resourced languages. This paper reports the collection of a speech dataset comprising ten digits from the Kadazan language, which is one of the indigenous southeast Asian languages. Benchmark results for keyword spotting over the dataset using a convolutional neural network, have also been reported, with the benchmark model showing an average classification accuracy of 75.4% across multiple experiments using the dataset. Additionally, the dataset and implementation of the benchmark model have been made public, to facilitate replication and future research in the area of speech processing technologies for the Kadazan language.

Rainbow Keywords: Efficient Incremental Learning for Online Spoken Keyword Spotting

Conference Paper

Sep 2022

Computational intelligence in processing of speech acoustics: a survey

Article

Full-text available

Feb 2022

Speech recognition of a language is a key area in the field of pattern recognition. This paper presents a comprehensive survey on the speech recognition techniques for non-Indian and Indian languages, and compiled some of the computational models used for processing speech acoustics. An immense number of frameworks are available for speech processing and recognition for languages persisting around the globe. However, a limited number of automatic speech recognition systems are available for commercial use. The gap between the languages being spoken around the globe and the technical support available to these languages are very few. This paper examined major challenges for speech recognition for different languages. Analysis of the literature shows that lack of standard databases availability of minority languages hinder the research recognition research across the globe. When compared with non-Indian languages, the research on speech recognition of Indian languages (except Hindi) has not achieved the expected milestone yet. Combination of MFCC and DNN–HMM classifier is most commonly used system for developing ASR minority languages, whereas in some of the majority languages, researchers are using much advance algorithms of DNN. It has also been observed that the research in this field is quite thin and still more research needs to be carried out, particularly in the case of minority languages.

Sparse Transcription

Article

Full-text available

Oct 2020

Steven Bird

The transcription bottleneck is often cited as a major obstacle for efforts to document the world's endangered languages and supply them with language technologies. One solution is to extend methods from automatic speech recognition and machine translation, and recruit linguists to provide narrow phonetic transcriptions and sentence-aligned translations. However, I believe that these approaches are not a good fit with the available data and skills, or with long-established practices that are essentially word-based. In seeking a more effective approach, I consider a century of transcription practice and a wide range of computational approaches, before proposing a computational model based on spoken term detection that I call “sparse transcription.” This represents a shift away from current assumptions that we transcribe phones, transcribe fully, and transcribe first. Instead, sparse transcription combines the older practice of word-level transcription with interpretive, iterative, and interactive processes that are amenable to wider participation and that open the way to new methods for processing oral languages.

Linguistically-Informed Training of Acoustic Word Embeddings for Low-Resource Languages

Conference Paper

Sep 2019

Whisper-based spoken term detection systems for search on speech ALBAYZIN evaluation challenge

Article

Full-text available

Feb 2024

The vast amount of information stored in audio repositories makes necessary the development of efficient and automatic methods to search on audio content. In that direction, search on speech (SoS) has received much attention in the last decades. To motivate the development of automatic systems, ALBAYZIN evaluations include a search on speech challenge since 2012. This challenge releases several databases that cover different acoustic domains (i.e., spontaneous speech from TV shows, conference talks, parliament sessions, to name a few) aiming to build automatic systems that retrieve a set of terms from those databases. This paper presents a baseline system based on the Whisper automatic speech recognizer for the spoken term detection task in the search on speech challenge held in 2022 within the ALBAYZIN evaluations. This baseline system will be released with this publication and will be given to participants in the upcoming SoS ALBAYZIN evaluation in 2024. Additionally, several analyses based on some term properties (i.e., in-language and foreign terms, and single-word and multi-word terms) are carried out to show the Whisper capability at retrieving terms that convey specific properties. Although the results obtained for some databases are far from being perfect (e.g., for broadcast news domain), this Whisper-based approach has obtained the best results on the challenge databases so far so that it presents a strong baseline system for the upcoming challenge, encouraging participants to improve it.

Progressive Continual Learning for Spoken Keyword Spotting

Conference Paper

May 2022

Multilingual spoken term detection: a review

Article

Full-text available

Sep 2020
Int J Speech Tech

In modern multilingual societies, there is a demand for multilingual Automatic Speech Recognition (ASR) and Spoken Term Detection (STD). Multilingual Spoken Term Detection refers to the process of retrieving appropriate audio files from a vast multilingual database using audio queries. This paper presents an overview of various efforts on multilingual spoken term detection, even for low resourced languages. A detailed discussion on different methodologies, along with a comparison, has been made. Various approaches for multilingual STD are organized based on feature representations, tokenization techniques, matching techniques and availability of resources. Different languages and corresponding datasets employed for the task of multilingual STD have been listed for quick referencing. A discussion of different benchmarking platforms for multilingual STD has also been included. The paper aims to provide a quick overview of different techniques and datasets widely used in multilingual STD research.

ALBAYZIN 2018 spoken term detection evaluation: a multi-domain international evaluation in Spanish

Article

Full-text available

Sep 2019

Search on speech (SoS) is a challenging area due to the huge amount of information stored in audio and video repositories. Spoken term detection (STD) is an SoS-related task aiming to retrieve data from a speech repository given a textual representation of a search term (which can include one or more words). This paper presents a multi-domain internationally open evaluation for STD in Spanish. The evaluation has been designed carefully so that several analyses of the main results can be carried out. The evaluation task aims at retrieving the speech files that contain the terms, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: the MAVIR database, which comprises a set of talks from workshops; the RTVE database, which includes broadcast news programs; and the COREMAH database, which contains 2-people spontaneous speech conversations about different topics. We present the evaluation itself, the three databases, the evaluation metric, the systems submitted to the evaluation, the results, and detailed post-evaluation analyses based on some term properties (within-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and native/foreign terms). Fusion results of the primary systems submitted to the evaluation are also presented. Three different research groups took part in the evaluation, and 11 different systems were submitted. The obtained results suggest that the STD task is still in progress and performance is highly sensitive to changes in the data domain.

A novel keyword rescoring method for improved spoken keyword spotting

Article

Full-text available

Jan 2018

White listing and score normalization for keyword spotting of noisy speech

Conference Paper

Sep 2012

A keyword-boosted sMBR criterion to enhance keyword search performance in deep neural network based acoustic modeling

Conference Paper

Sep 2014

Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search

Conference Paper

Apr 2015

Submodular data selection with acoustic and phonetic features for automatic speech recognition

Conference Paper

Apr 2015

Language independent query-by-example spoken term detection using N-best phone sequences and partial matching

Conference Paper

Apr 2015

Results of the 2006 spoken term detection evaluation

Article

Jan 2007

Combination of multiple searches: Part 1

Article

Jan 1992

White listing and score normalization for keyword spotting of noisy speech

Article

Jan 2012

We present a method that avoids the problem of a large vocabulary recognition system missing keywords due to pruning errors or degraded speech. The method, called white listing, assures that all tokens of all of the keywords are found by the recognizer, albeit with a low score. We show that this method far outperforms methods that attempt to increase recall by using subword models. In addition, we introduce a simple score normalization technique based on mapping the decoding score for a keyword to the probability of false alarm for that keyword. This method has the advantage that it can be estimated for all keywords with reliability, even though there might not be any examples of those keywords in the training or tuning set. This makes the scores of all keywords consistent at all ranges, which allows us to use a single consistent score for all keywords. We show that this method reduces the average miss rate by about a factor of 2 for the same false alarm rate. The method can also be used for combining multiple keyword spotting systems.

Using document summarization techniques for speech data subset selection

Article

Jan 2013

In this paper we leverage methods from submodular function optimization developed for document summarization and apply them to the problem of subselecting acoustic data. We evaluate our results on data subset selection for a phone recognition task. Our framework shows significant improvements over random selection and previously proposed methods using a similar amount of resources.

Sequence-discriminative training of deep neural networks

Article

Jan 2013

Sequence-discriminative training of deep neural networks (DNNs) is investigated on a 300 hour American English conversational telephone speech task. Different sequence-discriminative criteria-maximum mutual information (MMI), minimum phone error (MPE), state-level minimum Bayes risk (sMBR), and boosted MMI - Are compared. Two different heuristics are investigated to improve the performance of the DNNs trained using sequence-based criteria - lattices are regenerated after the first iteration of training; and, for MMI and BMMI, the frames where the numerator and denominator hypotheses are disjoint are removed from the gradient computation. Starting from a competitive DNN baseline trained using cross-entropy, different sequence-discriminative criteria are shown to lower word error rates by 8-9% relative, on average. Little difference is noticed between the different sequence-based criteria that are investigated. The experiments are done using the open-source Kaldi toolkit, which makes it possible for the wider community to reproduce these results.

Low-Resource Keyword Search Strategies for Tamil

Abstract and Figures

Recommended publications

Spoken Term Detection Based on Improved Index Structure

Unsupervised Discovery of Linguistic Structure Including Two-level Acoustic Patterns Using Three Cas...

Leveraging User Ratings for Resource-poor Sentiment Classification

DARTS: Dialectal Arabic Transcription System