Conference PaperPDF Available

Abstract and Figures

We propose strategies for a state-of-the-art keyword search (KWS) system developed by the SINGA team in the context of the 2014 NIST Open Keyword Search Evaluation (OpenKWS14) using conversational Tamil provided by the IARPA Babel program. To tackle low-resource challenges and the rich morphological nature of Tamil, we present highlights of our current KWS system, including: (1) Submodular optimization data selection to maximize acoustic diversity through Gaussian component indexed N-grams; (2) Keywordaware language modeling; (3) Subword modeling of morphemes and homophones.
Content may be subject to copyright.
LOW-RESOURCE KEYWORD SEARCH STRATEGIES FOR TAMIL
Nancy F. Chen1, Chongjia Ni1, I-Fan Chen2, Sunil Sivadas1,Van Tung Pham3, Haihua Xu3, Xiong Xiao3,
Tze Siong Lau3, Su Jun Leow3, Boon Pang Lim1, Cheung-Chi Leung1, Lei Wang1,
Chin-Hui Lee2, Alvina Goh3, Eng Siong Chng3, Bin Ma1, Haizhou Li1
1Institute for Infocomm Research, A?STAR, Singapore, 2Georgia Institute of Technology, USA,
3Nanyang Technological University, Singapore
nfychen@i2r.a-star.edu.sg
ABSTRACT
We propose strategies for a state-of-the-art keyword search (KWS)
system developed by the SINGA team in the context of the 2014
NIST Open Keyword Search Evaluation (OpenKWS14) using con-
versational Tamil provided by the IARPA Babel program. To tackle
low-resource challenges and the rich morphological nature of Tamil,
we present highlights of our current KWS system, including: (1)
Submodular optimization data selection to maximize acoustic diver-
sity through Gaussian component indexed N-grams; (2) Keyword-
aware language modeling; (3) Subword modeling of morphemes and
homophones.
Index TermsSpoken term detection (STD), keyword spot-
ting, under-resourced languages, active learning, unsupervised
learning, semi-supervised learning, inflective languages, aggluti-
native languages, morphology, deep neural network (DNN)
1. INTRODUCTION
Keyword search (KWS) is a detection task where the goal is to find
all occurrences of an orthographic term (e.g., word or phrase) from
audio recordings. Applications of KWS include spoken document
indexing and retrieval [1] and spoken language understanding [2].
KWS systems can be categorized into two groups: (i) classic
keyword-filler based KWS [3], and (ii) large vocabulary continu-
ous speech recognition (LVCSR) based KWS [4]. In keyword-filler
based KWS systems, a spoken utterance is represented as a sequence
of keywords and non-keywords (often referred to as fillers [3]). Cus-
tomized detectors are built for the keywords. Keyword-filler based
systems often achieve high detection rate using only a small dataset
for acoustic model training, but they do not scale well when the num-
ber of keywords increases.
By contrast, LVCSR-based KWS systems are flexible in han-
dling a large number of keywords, yet require sufficiently large
amounts of transcribed training data to achieve good performance.
Therefore, LVCSR-based KWS has worked well on resource-rich
languages like English, as has been shown in the NIST 2006 Spo-
ken Term Detection Evaluation [5]. However, such transcribe-
and-search approaches pose particular challenges to low-resource
languages such as Zulu and Tamil.
To tackle these challenges, the IARPA Babel program aims to
foster research “to rapidly develop speech recognition capability for
keyword search in a previously unstudied language, working with
speech recorded in a variety of conditions with limited amounts of
transcription.” The NIST 2014 Open Keyword Search Evaluation
(OpenKWS14) was held in April using the surprise language of
Tamil.
The challenges of the NIST OpenKWS14 Evaluation include
linguistic peculiarities of Tamil (e.g., more than 30% out-of-
DNN#System#1#
DNN#System#2#
DNN#System#N#
La.ces#
Query#List#
Index#
&#
Search#
Fusion#KW#
List#
Keyword#
Aware#LM#
Submodular#
OpEmizaEon#
Data#SelecEon#
Subword#Models#
Morphemes,#Homophones#
KWList'N'
KWList'2'
KWList'1'
Fig. 1. Proposed keyword search system for low-resource languages.
Orange blocks are highlights discussed in this work: submodular
optimization for selecting data to transcribe, subword modeling of
morphemes and homophones, keyword-aware language modeling.
vocabulary (OOV) rate due to its rich morphological structure),
poor audio quality (e.g., noise, soft volume, cross-talk), and limited
amount of transcribed data. To address such challenges, we discuss
our recent endeavors (see Figure 1) and related work below.
2. RELATION TO PRIOR WORK
2.1. Active Learning for Selecting Audio to Transcribe
Transcribing speech data is time-consuming and labor-intensive, es-
pecially for low-resource languages where linguistic expertise is lim-
ited or lacking. Thus it is critical to select the most informative
and representative subset of audio for human transcription. In prior
work, most approaches select utterances in a greedy fashion accord-
ing to their utility scores (e.g., confidence scores from automatic
speech recognition (ASR) [6, 7]). Similar to the confidence-based
approaches, [8, 9] use entropy reduction to select unlabeled utter-
ances. Low-resource methods for speech data selection include [10],
which only considers the ideal case where transcriptions are avail-
able in the first place.
Aforementioned methods usually require an ASR system in
place already for data selection, nor do they guarantee optimality
with regard to an objective function. Alternatively,[11, 12] formulate
this data selection problem as a constrained submodular optimiza-
tion setup. In this work, we follow this line of thinking and extend
it to KWS tasks. In particular, we propose to use Gaussian compo-
nent index based n-grams as acoustic features to select utterances to
transcribe.
2.2. Keyword-Aware Language Modeling
If keyword queries are known a priori, one can leverage such knowl-
edge to improve KWS performance. Previous work has shown how
to exploit keyword information for acoustic modeling [13] and de-
coding [14]. Our previous efforts exploit keyword information in
language modeling in Vietnamese [15]. In this work, we investigate
our approach in Tamil and further expand it to a framework integrat-
ing advantages from both keyword-filler based KWS and LVCSR-
based KWS.
2.3. Subword Modeling: Morphemes and Homophones
Mainstream LVCSR systems suffer from out-of-vocabulary (OOV)
issues. For morphologically-rich languages like Tamil, OOV rate is
especially high. While phones are commonly used to help resolve
OOVs [16], morphs (automatically parsed morphemes1) have also
been used in ASR [17]. In this work, we mitigate the data sparsity
issue of the morphologically-rich vocabulary in Tamil by integrating
morphs in the lexicon and language models. Our approach is similar
to [18], but we apply it on Tamil instead of Turkish. For Tamil ASR,
morph-based LMs have been reported [19]; in this work, we esti-
mate smoother word-morph LMs to address the serious data sparsity
issues of our dataset.
In our unpublished work, we found homophones useful in Viet-
namese KWS. In this work, we continue this line of investigation in
Tamil. To the best of our knowledge, to date there is no reported
work on using homophones in KWS.
3. LOW-RESOURCE KEYWORD SEARCH STRATEGIES
3.1. Submodular Optimization to Select Audio to Transcribe
3.1.1. Problem Formulation
Given a set of Nutterances V={1,2, ..., N}, we can construct
a non-decreasing submodular set function f: 2VR, mapping
each subset SVto a real number. We can formulate the problem
of selecting the best subset Sgiven some budget K(e.g., maximum
number of transcribed utterances) as a monotone submodular func-
tion maximization under a knapsack constraint:
max
SV{f(S) : |S| ≤ K}(1)
Submodularity can be interpreted as the property of diminishing
returns, which is for any subset RSVand any utterance
sV\S,
f(S∪ {s})f(S)f(R∪ {s})f(R).(2)
Let Ube a set of features, and P={pu}uUbe the proba-
bility distribution over the set U. Let mu(S) = X
sS
mu(s)be a
non-negative score for feature uin set S, where mu(s)measures
the degree to which utterance sSpossesses feature u. We can
compute the KL-divergence between the two distributions Pand
mu(S) = mu(S)
PuUmu(S):
1Morphemes are the smallest semantic units in linguistics. For example,
the word unsuccessful consists of 3 morphemes: ’un’, ’success’, ’ful’.
DKL (P||mu(S))
=X
uU
pulog puX
uU
pulog(mu(S)) + log( X
uU
mu(S))
=const. + log(X
uU
mu(S)) X
uU
pulog(mu(S))
We define our objective function fas follows:
f(S) = log(X
uU
mu(S)) DKL (P||mu(S)) (3)
=X
uU
pulog(mu(S)).(4)
The first term in Eq. (3) represents the acoustic diversity char-
acterized by m, while the second term represents how close the dis-
tribution mu(S)is to the distribution P, estimated from a held-out
dataset. We want to maximize the diversity characterized by m(first
term in Eq. (3)) but at the same time ensure mcharacterizes the
held-out data compactly (second term in Eq. (3)).
3.1.2. GMM Tokenization for Maximizing Acoustic Diversity
A GMM with Mcomponents trained on unlabeled training data is
used to label (tokenize) the training data and a held-out dataset in
terms of Gaussian components. For each frame i, label jis assigned:
j= arg max
jP(i|cj), where cjis the Gaussian mixture component,
j= 1, ..., M .
The concepts of term frequency and inverse document frequency
are used to characterize the tokenized audio. A term is defined as an
n-gram of the labeled indices. These terms make up the feature set
Uin Section 3.1.1. The modular function defined in Section 3.1.1
is defined as the product of the term frequency tfu(s)and inverse
document frequency idfu
mu(S) = X
sS
mu(s) = X
sS
tfu(s)×idfu,(5)
where each utterance sis considered a document in the training set.
The probability distribution puin Eq. (4) is the term frequency
estimated from the held-out data: pu=nu
Punu
, where nuis num-
ber of times feature uoccurred.
3.2. Keyword-Aware Language Modeling (LM)
Let q= (w1, w2,...,wL)be an L-word query. In keyword-filler
based KWS, the prior probability P(q)is by default set to P(q) =
1/N, often resulting in high false alarms. (Typically N100.)
By contrast, in LVCSR-based KWS, the prior probability of q
can be estimated as:
PLVCSR(q) = X
hH
P(q|h)X
hH
{
L
Y
i=1
Pngram(wi|hi(h, q ))},(6)
where hi(h, q)is the history of wiin the query qdictated by the
order n,Pngram(.)is the probability estimated by n-gram lan-
guage model. In low-resource scenarios, where there is insufficient
text data to properly train n-gram LMs, prior probabilities are often
underestimated. This underestimation causes high miss probabil-
ity, especially for multi-word queries. To alleviate the underestima-
tion problem, one can integrate the approach in keyword-filler based
KWS and LVCSR-based KWS:
PKWaware(q) = max{PLVCSR (q), k(q)}(7)
where k(q)is the minimum prior set for query qto alleviate the prior
under-estimation problem in low-resource LVCSR-KWS, where in-
sufficient text data is available for language modeling. In this paper,
we assume all keywords share the same k(i.e, k(q) = k.) For more
detailed discussion of such proposed grammar, please refer to [20].
3.3. Word-Morph Interpolated Language Model
Representing out-of-vocabulary (OOV) entries using morphs (auto-
matically parsed morphemes) is insufficient to resolve data sparsity
issues with morphologically-rich languages like Tamil. If the morph-
based lexical entries have low occurrences, the miss probability of
such keywords are still high despite it no longer being OOV. To mit-
igate this effect, we exploit word-morph interpolated language mod-
els (LM) to provide smoother estimates.
Three LMs are first constructed: (1) Word-based LM λW: a 3-
gram word LM is trained on all the word entries. (2) Morph-based
LM λM: a 3-gram morph LM is trained by parsing word entries
into morphs using Morfessor [21]. (3) Hybrid Word-Morph LM λH:
words with more than one occurrence in the training data were re-
tained, whereas words with only one occurrence were parsed into
morphs by Morfessor [21]. An interpolated language model was
then estimated: λWM=αλW+βλM+ (1 αβ)λH.
4. EXPERIMENTS
For clarity purposes, we only show a subset of submitted systems for
OpenKWS14 and corresponding follow-up analysis to demonstrate
the proposed strategies discussed here.
4.1. Setup
This effort uses the IARPA Babel Program Tamil language collection
release IARPA-babel204b-v1.1b for the NIST OpenKWS14 Evalu-
ation. The training set includes 80 hours of conversational telephone
speech. Two conditions are defined: (1) Full Language Pack (FLP):
60 hours of transcriptions and a corresponding lexicon. (2) Limited
Language Pack (LLP): a 10 hr subset of FLP transcriptions. The de-
velopmental set is 10 hr with transcriptions. The evaluation set is 75
hr with no transcriptions nor timing information; transcriptions of a
15 hr subset (evalpart1) was released after OpenKWS14. All results
reported are on evalpart1.
Evaluation Metric: Term-weighted value (TWV) is 1 minus the
weighted sum of the term-weighted probability of miss detection
Pmiss(θ)and the term-weighted probability of false alarm PFA(θ):
TWV(θ) = 1 [Pmiss (θ) + βPFA(θ)],(8)
where θis the decision threshold. Actual term-weighted value
(ATWV) is the TWV using the chosen decision threshold.
4.2. Baseline System
All systems were developed using Kaldi [22]. While fundamen-
tal frequency variation features were shown to improve ASR for
both tonal and non-tonal languages [23] and improve KWS for
OpenKWS13 [24], it actually hurt ASR/KWS performance for this
data. F0, on the other hand, consistently helped in pilot experiments,
therefore all systems adopted F0.
4.2.1. Implementation Details
We adapted voice activity detection (VAD) in [24] to reduce noise.
The WAV files are especially noisy, resulting in virtually 100% word
error rate. Classic noise canceling methods such as Wiener filtering
was ineffective. Instead, speech enhancement using a log minimum
mean-square error spectral amplitude estimator [25] was applied to
WAV files before VAD, as it improved the speech quality signifi-
cantly by removing perceptually audible distortions. When compar-
ing VAD and ground-truth segments, ATWV showed insignificant
difference (<0.13% relative).
MFCC (13-dim) and F0 (2-dim) were extracted; 9 adjacent fea-
ture frames were then concatenated and applied with a LDA+MLLT+
fMLLR transform. The 40-dim fMLLR features were used for bot-
tleneck feature (BNF) extraction (6 hidden layers each with 2048
nodes) to extract 42-dim BNF. The 40-dim fMLLR feature and 42-
dim BNF were then concatenated to form 82-dim features. Then
fMLLR transform was applied again (60-dim). We used 6 hidden
layers (2048 nodes each) and 4838 senone target states for the DNN
acoustic model. The training procedure is as follows: (1) 1 iteration
of pre-training; (2) cross entropy criterion training; (3) scalable
minimum Bayes risk criterion based sequence training [26].
Phonetisaurus [27] was used to obtain OOV pronunciations. A
trigram language model was trained on word tokens. The beam
width was set to 18 for lattice decoding. Deterministic weighted
transducers were used to index and search soft-hits, which contain
the utterance identifications, start/end times, and posterior scores.
Sum-to-one normalization [16], WComMNZ [16], and keyword spe-
cific thresholding (KST) [28] were applied consecutively to combine
systems. For individual systems, only KST was done.
4.2.2. Results
Table 1 shows the baseline ATWV results when using word 3-gram
LM (λW) for LLP and FLP. In the next section, we examine how
leveraging keyword information can boost performance.
Table 1. Keyword-Aware LM outperforms baseline LM. All LMs
are word-based.
Transcription Baseline Keyword-Aware LM Relative
Condition ATWV ATWV Gain (%)
LLP: 10 hr 0.2313 0.3182 37.6%
FLP: 60 hr 0.4222 0.4852 14.9%
4.3. Keyword Aware Language Model (LM) Experiment
4.3.1. Implementation Details
Setup is the same as Section 4.2.1 except the LM is estimated using
context-simulated keyword LM [15]:
PKWLM(w|h) = γPKWLM(w|h) + (1 γ)PLM (w|h),(9)
where γ= 0.3,his the history of the current word w,PKWLM is
an LM estimated by padding keywords with bigram entries from the
training data, and PLM is the trigram LM in Section 4.2.1.
4.3.2. Results
Table 1 shows that when keyword-aware LM is used, relative gains
reach 37.6% (LLP ) and 14.9% (FLP). Further analysis shows that
the gains are due to reduction in miss probability, which is penalized
more heavily than false alarms in OpenKWS settings. We also ob-
serve larger gains when keywords are multi-words when using the
keyword-aware LM framework. Due to space constraints, compar-
isons related to how effective the keyword-aware framework differs
according to language peculiarities, implementation methods, and
keyword length are reported in [20].
4.4. Subword Experiments
Subword modeling is essential especially for low-resource languages
where keywords are not known a priori. Here we investigate how
morphemes and homophones help resolve data sparsity issues.
4.4.1. Morpheme Subword Modeling
The system implementation is the same as in Section 4.2.1, except
the lexicon and word-morph interpolated LM setup is as described
in Section 3.3, where α= 0.4, β = 0.3. We applied linguistic
constraints to reduce linguistically-illegal morphs, but gains were
insignificant compared to that of increasing lattice sizes. In addition,
non-speech tags were removed from the LM (consistent marginal
gains on 5 developmental keyword lists). Table 2 shows that using
word-morph interpolated LM improves ATWV by 4.5% relative for
LLP and 3.3% relative for FLP.
Table 2. Word-Morph LM outperforms word LM.
Transcription Word LM Word-Morph LM Rel.
Condition ATWV ATWV Gain (%)
LLP: 10 hr 0.2313 0.2418 4.5
FLP: 60 hr 0.4222 0.4363 3.3
Table 3. Homophone System SHand Sub-Homophone System
SHsub complement each other.
System LLP ATWV FLP ATWV
SH0.0832 0.2634
SHsub 0.0838 0.2748
SH+SHsub 0.1243 0.2872
4.4.2. Homophone Subword Modeling
Homophones are words that are written differently but sound the
same, like see and sea. The homophone system SHwas imple-
mented as in Section 4.2.1, except words are replaced with their pro-
nunciations. The sub-homophone system, SHsub was implemented
by further segmenting homophones into morphs using Morfessor.
From Table 3, we see that the homophone system SHand sub-
homophone system SHsub perform similarly to each other for both
LLP and FLP conditions. When fused with each other, we get 49.4%
relative gain for LLP and 4.5% relative gain for FLP, suggesting that
sub-homophones are much more complementary to homophones in
low resource scenarios. The homophone results shown here are sub-
optimal when compared to its word-morph counterpart. We sus-
pect this discrepancy to be language-dependent. In the OpenKWS13
Vietnamese task, we observed a 26.3% relative gain when using ho-
mophones instead of words. Vietnamese words are constructed by
a finite set of syllables, which are phonetically equivalent to homo-
phones, making homophones an elegant choice in handling OOVs.
For future work, we plan to investigate whether sub-homophones
can drive further gains in Vietnamese.
4.5. Submodular Optimization Data Selection Experiment
In this experiment, we analyze how to select data to transcribe to
maximize KWS performance and minimize transcription cost.
4.5.1. Implementation Details
We follow the proposed algorithm described in Section 3.1.2. The
total number of mixture components M= 2048, and bigrams of la-
beled frames are used to designate a term to compute term-frequency
and inverse-document-frequency. The 10 hr developmental data is
used as the held-out dataset for estimating the distribution of puin
Eq. (4). The KWS system is the same as in Section 4.4.1.
4.5.2. Results
Table 4 shows that the proposed 10 hr subset outperforms Baseline-
1 (random 10 hr subset) and Baseline-2 (NIST-LLP 10 hr subset)
by 21.0% and 15.5%, showing that the LLP 10 hr subset can be
more optimally chosen to achieve better KWS performance without
increasing transcription cost. The relative ATWV gain of increasing
transcriptions from 10 hr to 60 hr is less in the submodular case
(52.7%) compared to those from Baseline-1 (82.9 %) and Baseline-
2 (76.4 %), indicating that the return on transcription cost is more
effective when using the submodular optimization approach.
Table 4 also shows that by maximizing acoustic diversity in the
proposed approach, we implicitly enrich the vocabulary in the lexi-
con, and thus alleviate OOV issues: compared to the proposed 10 hr
subset, the OOV keywords decreases by 17.0 % relative for Baseline-
1 (random 10 hr subset) and 42.3 % relative for Baseline-2 (NIST-
LLP 10 hr subset). This byproduct benefit helps resolve OOV issues
at a more fundamental stage when developing spoken language tech-
nology. For more detailed analysis, please see [29].
Table 4. Submodular data selection for word transcriptions.
Transcription Condition ATWV OOV counts
Baseline-1: Random 10 hr subset 0.2386 1171
Baseline-2: NIST-LLP (10 hr subset) 0.2474 1686
Proposed submodular 10 hr subset 0.2857 972
Upper bound: NIST-FLP (full 60 hr) 0.4363 407
5. DISCUSSION
In this work, we investigated three strategies for low-resource key-
word search. We expect our submodular optimization data selec-
tion approach to generalize well in languages other than Tamil since
similar approaches works in Mandarin LVCSR [30]. Similarly, the
keyword-aware language model approach also works for Vietnamese
[20]. By contrast, subword modeling (morphemes, homophones) ap-
pears to be more language-dependent.
While our LVCSR-KWS work in Tamil and Vietnamese [24]
focus on text queries, we have inspired strategies used in spoken
term detection of audio queries. For example, [31] proposed partial-
matching symbolic search, which complements popular pattern
matching approaches using dynamic time warping in Query-by-
Example Search on Speech (QUESST), formerly called Spoken
Web Search (SWS), in MediaEval 2014.
6. REFERENCES
[1] John Makhoul, Francis Kubala, Timothy Leek, Daben Liu,
Long Nguyen, Richard Schwartz, and Amit Srivastava,
“Speech and language technologies for audio indexing and re-
trieval,Proceedings of the IEEE, vol. 88, no. 8, pp. 1338–
1353, 2000.
[2] Biing-Hwang Juang and Sadaoki Furui, “Automatic recogni-
tion and understanding of spoken language-a first step toward
natural human-machine communication,” Proceedings of the
IEEE, vol. 88, no. 8, pp. 1142–1165, 2000.
[3] Jay G Wilpon, L Rabiner, Chin-Hui Lee, and ER Goldman,
“Automatic recognition of keywords in unconstrained speech
using hidden markov models, IEEE TASLP, vol. 38, no. 11,
pp. 1870–1878, 1990.
[4] J Gauvain and Lori Lamel, “Large-vocabulary continuous
speech recognition: advances and applications,” Proceedings
of the IEEE, vol. 88, no. 8, pp. 1181–1200, 2000.
[5] Jonathan G Fiscus, Jerome Ajot, John S Garofolo, and George
Doddingtion, “Results of the 2006 spoken term detection eval-
uation,” in Proceedings of ACM SIGIR Workshop on Searching
Spontaneous Conversational, 2007, pp. 51–55.
[6] Dilek Hakkani-Tur, Giuseppe Riccardi, and Allen Gorin, “Ac-
tive learning for automatic speech recognition, in Proc. IEEE
ICASSP, 2002, vol. 4, pp. IV–3904.
[7] Lori Lamel, Jean-Luc Gauvain, and Gilles Adda, “Lightly su-
pervised and unsupervised acoustic model training,” Computer
Speech & Language, vol. 16, no. 1, pp. 115–129, 2002.
[8] Dong Yu, Balakrishnan Varadarajan, Li Deng, and Alex
Acero, “Active learning and semi-supervised learning for
speech recognition: A unified framework using the global en-
tropy reduction maximization criterion,” Computer Speech &
Language, vol. 24, no. 3, pp. 433–444, 2010.
[9] Nobuyasu Itoh, Tara N Sainath, Dan Ning Jiang, Jie Zhou,
and Bhuvana Ramabhadran, “N-best entropy based data se-
lection for acoustic modeling,” in Proc. IEEE ICASSP, 2012,
pp. 4133–4136.
[10] Yi Wu, Rong Zhang, and Alexander Rudnicky, “Data selection
for speech recognition,” in Proc. IEEE ASRU, 2007, pp. 562–
565.
[11] Hui Lin and Jeff Bilmes, “How to select a good training-data
subset for transcription: Submodular active selection for se-
quences,” in INTERSPEECH, 2009.
[12] Kai Wei, Yuzong Liu, Katrin Kirchhoff, and Jeff Bilmes, “Us-
ing document summarization techniques for speech data subset
selection.,” in HLT-NAACL, 2013, pp. 721–726.
[13] I-Fan Chen, Nancy F Chen, and Chin-Hui Lee, “A Keyword-
Boosted sMBR Criterion to Enhance Keyword Search Perfor-
mance in Deep Neural Network Based Acoustic Modeling,” in
INTERSPEECH, 2014.
[14] Bing Zhang, Richard M Schwartz, Stavros Tsakalidis, Long
Nguyen, and Spyros Matsoukas, “White listing and score nor-
malization for keyword spotting of noisy speech., in INTER-
SPEECH, 2012.
[15] I-Fan Chen, Chongjia Ni, Boon Pang Lim, Nancy F Chen, and
Chin-Hui Lee, A novel keyword+lvcsr-filler based grammar
network representation for spoken keyword search, in ISC-
SLP, 2014.
[16] Jonathan Mamou, Bhuvana Ramabhadran, and Olivier Siohan,
“Vocabulary independent spoken term detection, in Proc.
ACM SIGIR conference on Research and development in in-
formation retrieval, 2007, pp. 615–622.
[17] Hasim Sak, Murat Sarac¸lar, and Tunga Gungor, “Morpholex-
ical and discriminative language models for turkish automatic
speech recognition,” IEEE TASLP, vol. 20, no. 8, pp. 2341–
2351, 2012.
[18] Yanzhang He, Brian Hutchinson, Peter Baumann, Mari Osten-
dorf, Eric Fosler-Lussier, and Janet Pierrehumbert, “Subword-
based modeling for handling oov words inkeyword spotting,
in Proc. IEEE ICASSP, 2014, pp. 7864–7868.
[19] Melvin Jose Johnson Premkumar, Ngoc Thang Vu, and Tanja
Schultz, “Experiments towards a better lvcsr system for tamil,”
in INTERSPEECH, 2013.
[20] I-Fan Chen, Chongjia Ni, Boon Pang Lim, Nancy F Chen,
and Chin-Hui Lee, “A Keyword-Aware Grammar Framework
for LVCSR-Based Spoken Keyword Search, in Proc. IEEE
ICASSP, 2015.
[21] “Morfessor 2.0.0: http://www.cis.hut.fi/projects/morpho/morfessor2.shtml,
last accessed, August 2014.
[22] Daniel Povey et al., “The kaldi speech recognition toolkit,” in
Proc. of IEEE ASRU, 2011.
[23] Florian Metze, Zaid A. W. Sheikh, Alex Waibel, Jonas
Gehring, Kevin Kilgour, Quoc Bao Nguyen, and Van Huy
Nguyen, “Models of tone for tonal and non-tonal languages,”
in Proc. IEEE ASRU, Olomouc; Czech Republic, 2013.
[24] Nancy F Chen, Sunil Sivadas, Boon Pang Lim, Hoang Gia
Ngo, Haihua Xu, Van Tung Pham, Bin Ma, and Haizhou Li,
“Strategies for Vietnamese keyword search,” in Proc. IEEE
ICASSP, 2014, pp. 4121–4125.
[25] Yariv Ephraim and David Malah, “Speech enhancement using
a minimum-mean square error short-time spectral amplitude
estimator,IEEE TASLP, vol. 32, no. 6, pp. 1109–1121, 1984.
[26] Daniel Povey, Lukas Burget, Mohit Agarwal, Pinar Akyazi,
Kai Feng, Arnab Ghoshal, Ondrej Glembek, Nagendra K Goel,
Martin Karafi´
at, Ariya Rastrow, R. C. Rose, P Schearz, and
S. Thomas, “Subspace gaussian mixture models for speech
recognition,” in Proc. IEEE ICASSP, 2010, pp. 4330–4333.
[27] J. R. Novak, “Phoneticsaurus - A WFST-driven Phoneticizer.
Available: https://code.google.com/p/phonetisaurus,” 2012.
[28] Damianos Karakos, Richard Schwartz, Stavros Tsakalidis,
Le Zhang, Shivesh Ranjan, Tim Ng, Roger Hsiao, Guruprasad
Saikumar, Ivan Bulyko, Long Nguyen, et al., “Score normal-
ization and system combination for improved keyword spot-
ting,” in Proc. IEEE ASRU, 2013, pp. 210–215.
[29] Chongjia Ni, Cheung-Chi Leung, Lei Wang, Nancy F Chen,
and Bin Ma, “Unsupervised Data Selection and Word-Morph
Mixed Language Model for Tamil Low-Resource Keyword
Searh,” in Proc. IEEE ICASSP, 2015.
[30] Chongjia Ni, Lei Wang, Haibo Liu, Cheung-Chi Leung, Li Lu,
and Bin M, “Submodular data selection with acoustic and pho-
netic features for automatic speech recogntion,” in Proc. IEEE
ICASSP, 2015.
[31] Haihua Xu, Peng Yang, Xiao Xiong, Lei Xie, Cheung-Chi Le-
ung, Hongjie Chen, Jia Yu, Hang Lv, Lei Wang, Su Jun Leow,
Bin Ma, Eng Siong Chng, and Haiz, “Language independent
query-by-example spoken term detection using n-best phone
sequences and partial matching,” in ICASSP, 2015.
... Furthermore, KWS models also commonly do not require word-segment boundaries, however, separate speech alignment and segmentation models can be employed if keywords have to be extracted from continuous speech. References [11][12][13][14] are a few examples of KWS models developed for underresourced languages. Generally, KWS models must be lightweight with low resource requirements as they are usually run locally on devices without connection to an online server. ...
Conference Paper
Full-text available
The unavailability of public datasets is the main hurdle for speech processing research targeting under-resourced languages. This paper reports the collection of a speech dataset comprising ten digits from the Kadazan language, which is one of the indigenous southeast Asian languages. Benchmark results for keyword spotting over the dataset using a convolutional neural network, have also been reported, with the benchmark model showing an average classification accuracy of 75.4% across multiple experiments using the dataset. Additionally, the dataset and implementation of the benchmark model have been made public, to facilitate replication and future research in the area of speech processing technologies for the Kadazan language.
... Ablation study also indicates that the proposed data augmentation and knowledge distillation loss are quite effective on edge devices. For future work, we plan to apply and adapt our approach to multilingual keyword search systems [37,38,39]. ...
... 21.34% syllable error rate was reported. Chen et al. [47] presented approaches for keyword search (KWS) system using conversational Tamil provided by the IARPA Babel program. Strategies like optimization data selection through Gaussian component indexed N-grams, keyword aware language modeling, and Subword modeling of morphemes and homophones can help in tackling low-resource challenges. ...
Article
Full-text available
Speech recognition of a language is a key area in the field of pattern recognition. This paper presents a comprehensive survey on the speech recognition techniques for non-Indian and Indian languages, and compiled some of the computational models used for processing speech acoustics. An immense number of frameworks are available for speech processing and recognition for languages persisting around the globe. However, a limited number of automatic speech recognition systems are available for commercial use. The gap between the languages being spoken around the globe and the technical support available to these languages are very few. This paper examined major challenges for speech recognition for different languages. Analysis of the literature shows that lack of standard databases availability of minority languages hinder the research recognition research across the globe. When compared with non-Indian languages, the research on speech recognition of Indian languages (except Hindi) has not achieved the expected milestone yet. Combination of MFCC and DNN–HMM classifier is most commonly used system for developing ASR minority languages, whereas in some of the majority languages, researchers are using much advance algorithms of DNN. It has also been observed that the research in this field is quite thin and still more research needs to be carried out, particularly in the case of minority languages.
... Spoken term detection methods have been applied to low-resource languages in the context of the IARPA Babel Program (e.g., Rath et al. 2014;Gales et al. 2014;Liu et al. 2014;Chen et al. 2015;Metze et al. 2015), but with far more data than we can expect to have for many oral languages, including 10+ hours of phone transcriptions and moreor-less complete lexicons. ...
Article
Full-text available
The transcription bottleneck is often cited as a major obstacle for efforts to document the world's endangered languages and supply them with language technologies. One solution is to extend methods from automatic speech recognition and machine translation, and recruit linguists to provide narrow phonetic transcriptions and sentence-aligned translations. However, I believe that these approaches are not a good fit with the available data and skills, or with long-established practices that are essentially word-based. In seeking a more effective approach, I consider a century of transcription practice and a wide range of computational approaches, before proposing a computational model based on spoken term detection that I call “sparse transcription.” This represents a shift away from current assumptions that we transcribe phones, transcribe fully, and transcribe first. Instead, sparse transcription combines the older practice of word-level transcription with interpretive, iterative, and interactive processes that are amenable to wider participation and that open the way to new methods for processing oral languages.
... Levin et al. [8] extended previous work to embed speech segments in an unsupervised way and performed search directly at the segment level. Chen et al. [9] built a keyword search system using word-morph interpolated language models to mitigate the data sparsity issue of the morphologicallyrich vocabulary in Tamil. Another work of Chen et al. [10] added phonological and prosodic features for keyword search in Swahili, including the duration, speaking rate, and number of vowels and consonants of keyword queries. ...
Article
Full-text available
The vast amount of information stored in audio repositories makes necessary the development of efficient and automatic methods to search on audio content. In that direction, search on speech (SoS) has received much attention in the last decades. To motivate the development of automatic systems, ALBAYZIN evaluations include a search on speech challenge since 2012. This challenge releases several databases that cover different acoustic domains (i.e., spontaneous speech from TV shows, conference talks, parliament sessions, to name a few) aiming to build automatic systems that retrieve a set of terms from those databases. This paper presents a baseline system based on the Whisper automatic speech recognizer for the spoken term detection task in the search on speech challenge held in 2022 within the ALBAYZIN evaluations. This baseline system will be released with this publication and will be given to participants in the upcoming SoS ALBAYZIN evaluation in 2024. Additionally, several analyses based on some term properties (i.e., in-language and foreign terms, and single-word and multi-word terms) are carried out to show the Whisper capability at retrieving terms that convey specific properties. Although the results obtained for some databases are far from being perfect (e.g., for broadcast news domain), this Whisper-based approach has obtained the best results on the challenge databases so far so that it presents a strong baseline system for the upcoming challenge, encouraging participants to improve it.
Article
Full-text available
In modern multilingual societies, there is a demand for multilingual Automatic Speech Recognition (ASR) and Spoken Term Detection (STD). Multilingual Spoken Term Detection refers to the process of retrieving appropriate audio files from a vast multilingual database using audio queries. This paper presents an overview of various efforts on multilingual spoken term detection, even for low resourced languages. A detailed discussion on different methodologies, along with a comparison, has been made. Various approaches for multilingual STD are organized based on feature representations, tokenization techniques, matching techniques and availability of resources. Different languages and corresponding datasets employed for the task of multilingual STD have been listed for quick referencing. A discussion of different benchmarking platforms for multilingual STD has also been included. The paper aims to provide a quick overview of different techniques and datasets widely used in multilingual STD research.
Article
Full-text available
Search on speech (SoS) is a challenging area due to the huge amount of information stored in audio and video repositories. Spoken term detection (STD) is an SoS-related task aiming to retrieve data from a speech repository given a textual representation of a search term (which can include one or more words). This paper presents a multi-domain internationally open evaluation for STD in Spanish. The evaluation has been designed carefully so that several analyses of the main results can be carried out. The evaluation task aims at retrieving the speech files that contain the terms, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: the MAVIR database, which comprises a set of talks from workshops; the RTVE database, which includes broadcast news programs; and the COREMAH database, which contains 2-people spontaneous speech conversations about different topics. We present the evaluation itself, the three databases, the evaluation metric, the systems submitted to the evaluation, the results, and detailed post-evaluation analyses based on some term properties (within-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and native/foreign terms). Fusion results of the primary systems submitted to the evaluation are also presented. Three different research groups took part in the evaluation, and 11 different systems were submitted. The obtained results suggest that the STD task is still in progress and performance is highly sensitive to changes in the data domain.
Article
We present a method that avoids the problem of a large vocabulary recognition system missing keywords due to pruning errors or degraded speech. The method, called white listing, assures that all tokens of all of the keywords are found by the recognizer, albeit with a low score. We show that this method far outperforms methods that attempt to increase recall by using subword models. In addition, we introduce a simple score normalization technique based on mapping the decoding score for a keyword to the probability of false alarm for that keyword. This method has the advantage that it can be estimated for all keywords with reliability, even though there might not be any examples of those keywords in the training or tuning set. This makes the scores of all keywords consistent at all ranges, which allows us to use a single consistent score for all keywords. We show that this method reduces the average miss rate by about a factor of 2 for the same false alarm rate. The method can also be used for combining multiple keyword spotting systems.
Article
In this paper we leverage methods from submodular function optimization developed for document summarization and apply them to the problem of subselecting acoustic data. We evaluate our results on data subset selection for a phone recognition task. Our framework shows significant improvements over random selection and previously proposed methods using a similar amount of resources.
Article
Sequence-discriminative training of deep neural networks (DNNs) is investigated on a 300 hour American English conversational telephone speech task. Different sequence-discriminative criteria-maximum mutual information (MMI), minimum phone error (MPE), state-level minimum Bayes risk (sMBR), and boosted MMI - Are compared. Two different heuristics are investigated to improve the performance of the DNNs trained using sequence-based criteria - lattices are regenerated after the first iteration of training; and, for MMI and BMMI, the frames where the numerator and denominator hypotheses are disjoint are removed from the gradient computation. Starting from a competitive DNN baseline trained using cross-entropy, different sequence-discriminative criteria are shown to lower word error rates by 8-9% relative, on average. Little difference is noticed between the different sequence-based criteria that are investigated. The experiments are done using the open-source Kaldi toolkit, which makes it possible for the wider community to reproduce these results.