ArticlePDF Available

Mining Live Transliterations Using Incremental Learning Algorithms

Authors:
  • Institute for Infocomm Research (I2R), Agency for Science, Technology and Research
  • Hwa Hsia University of Technology, Taiwan

Abstract and Figures

We study an adaptive learning framework for phonetic similarity modeling (PSM) that supports the automatic acquisition of transliterations by exploiting minimum prior knowledge about machine transliteration to mine transliterations incrementally from the Web. We formulate an incremental learning strategy for the framework based on Bayesian theory for PSM adaptation. The idea of incremental learning is to benefit from the continuously developing history to update a static model towards the intended reality. In this way, the learning process refines the PSM incrementally while constructing a transliteration lexicon at the same time. We further demonstrate that the proposed learning framework is reliably effective in mining live transliterations from Web query results.
Content may be subject to copyright.
Mining Live Transliterations
Using Incremental Learning Algorithms
HAIZHOU LI
Institute of Infocomm Research, Singapore, 138632
hli@i2r.a-star.edu.sg
JIN-SHEA KUO
Chung-Hwa Telecomm, Taoyuan 320, Taiwan
d8807302@gmail.com
JIAN SU
Institute of Infocomm Research, Singapore, 138632
sujian @i2r.a-star.edu.sg
CHIH-LUNG LIN
Chung Yuan Christian University, Taoyuan 320, Taiwan
linclr@gmail.com
We study an adaptive learning framework for phonetic similarity modeling (PSM)
that supports the automatic acquisition of transliterations by exploiting minimum
prior knowledge about machine transliteration to mine transliterations incrementally
from the Web. We formulate an incremental learning strategy for the framework
based on Bayesian theory for PSM adaptation. The idea of incremental learning
is to benefit from the continuously developing history to update a static model
towards the intended reality. In this way, the learning process refines the PSM
incrementally while constructing a transliteration lexicon at the same time. We
further demonstrate that the proposed learning framework is reliably effective in
mining live transliterations from Web query results.
Keywords: Web mining; Incremental learning; Unsupervised learning; Transliteration
extraction; Machine transliteration; Machine translation; Machine learning; Phonetic
similarity model; Query results; Learning transliterations.
International Journal of Computer Processing Of Languages
Vol. 21, No. 2 (2008) 183– 20 3
© World Scientific Publishing Company
183
00186.indd 183 12/10/2008 3:50:28 PM
184 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin
1. Introduction
Transliteration is a process of rewriting a word in one language into another
by preserving its pronunciation in its original language, which is also known
as translation-by-sound. It usually takes place between languages with different
scripts, for example, from English to Chinese, and for words, such as proper
nouns, that do not have “easy” or semantic translations.
The increasing size of multilingual content on the Web has made it a live
information source rich in transliterations. Research on automatic acquisition
of transliteration pairs in batch mode has shown promising results [16, 17]. In
dealing with the dynamic growth of the Web, it is almost impossible to collect
and store all its contents in local storage. Therefore, there is a need to develop
an incremental learning algorithm to mine transliterations in an on-line manner.
In general, an incremental learning technique is designed for adapting a model
towards a changing environment. In this paper, we are interested in deducing an
incremental learning method for automatically constructing an English-Chinese
(E-C) transliteration lexicon from Web query results.
In the deduction, we start with a phonetic similarity model (PSM), which
measures the phonetic similarity between words in two different scripts, and
study the learning mechanism of PSM in both batch and incremental modes.
Then, we conduct a series of experiments to evaluate the performance of these
two kinds of learning algorithms. In this way, the contributions of this paper
include: (i) the formulation of a batch learning framework and an incremental
learning framework for PSM learning; (ii) a comparative study of the batch and
incremental unsupervised learning strategies.
In this paper, Section 2 briefly introduces prior work related to machine
transliteration. In Section 3, we formulate the PSM and its batch and incremental
learning algorithms while in Section 4, we discuss the practical issues in
implementation. Section 5 provides a report on the conducted experiments and
finally, we conclude in Section 6.
2. Related Work
Much of the research on extraction of transliterations has been motivated by
information retrieval techniques, where attempts to extracting transliteration pairs
from large bodies of corpora have been made. Some have proposed to extract
translations from parallel or comparable bitexts using co-occurrence analysis or a
context-vector approach [9, 25]. These methods compare the semantic similarities
between source and target words without taking their phonetic similarities into
account.
00186.indd 184 12/10/2008 3:50:28 PM
Mining Live Transliterations Using Incremental Learning Algorithms 185
Another direction of research focuses on establishing the phonetic
relationship between transliteration pairs. This typically involves the encoding
of phoneme- or grapheme-based mapping rules using a generative model that
is trained from a large bilingual lexicon. Most of these works are devoted to
the phoneme-based approach [15, 23, 27, 28, 29]. Suppose that EW and CW
form an E-C transliteration pair. The phoneme-based approach first converts EW
into an intermediate phonemic representation and then converts the phonemic
representation into its Chinese counterpart CW. On the other hand, the grapheme-
based approach, also known as direct orthographical mapping [18], treats
transliteration as a statistical machine translation problem under monotonic
constraints and has also achieved promising results.
Many efforts have also been channeled to tapping the wealth of the Web for
harvesting transliteration/translation pairs. These include studying the query logs
[3], unrelated corpora [26], parallel [20] and comparable corpora [12, 27]. To
establish correspondence, these algorithms usually rely on one or more statistical
clues [19], such as the correlation between word frequencies, and cognates of
similar spelling or pronunciations. In doing so, two things are needed: first,
a robust mechanism that establishes statistical relationships between bilingual
words, such as a phonetic similarity model which is motivated by transliteration
modeling research; and second, an effective learning framework that is able to
adaptively discover new events from the Web.
In Chinese/Japanese/Korean (CJK) Web pages, translated or transliterated
terms are frequently accompanied by their original Latin words, with the Latin
words serving as the appositives of the CJK words. In other words, the E-C pairs
are always closely collocated. Inspired by this observation in CJK texts, some
algorithms were proposed [17] to search over the close context of an English
word in a Chinese predominant bilingual snippet for transliteration.
Unfortunately, many of the reported works have not addressed practical
issues that concern learning transliterations from the growing Web, such as
incremental learning of transliteration lexicons. Incremental learning algorithms
are suitable for working in a changing environment and have been used in other
applications such as speech recognition to adapt model parameters to different
speakers successfully [6, 11]. In this paper, we study the learning framework of
the phonetic similarity model, which adopts a transliteration modeling approach
for transliteration extraction from the Web in an incremental manner.
3. Phonetic Similarity Model
Phonetic similarity model (PSM) is a probabilistic model that encodes the
syllable mapping between E-C pairs. Let ES = {e1, …, em, …, eM} be a sequence of
00186.indd 185 12/10/2008 3:50:28 PM
186 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin
English syllables derived from EW and CS = {s1, …, sn, …, sN} be the sequence of
Chinese syllables derived from CW, represented by a Chinese character string
CW w1,…,wn,…,wN. EW and CW form a transliteration pair. Each of the
English syllables is drawn from a vocabulary of X entries, em{x1,…,xI}, and
each of the Chinese syllables from a vocabulary of Y entries, sn{y1, …, yJ}.
The E-C transliteration can be considered a generative process formulated by
the noisy channel model, which recovers the input CW from the observed output
EW. Applying Bayesian rule, we have Eq. (1), where P(EWCW ) is estimated to
characterize the noisy channel, known as the transliteration probability. P(CW )
is a language model to characterize the source language.
P(CWEW ) = P(EWCW )P(CW )/P(EW ). (1)
Following the translation-by-sound principle, P(EWCW ) can be approxi-
mated by the phonetic probability P(ESCS ), which is given by Eq. (2).
( ) max ( , ),P ES CS P ES CS
∆∈Γ
= ∆
(2)
where Γ is the set of all possible alignment paths between ES and CS. To find
the best alignment path ∆, one can resort to a dynamic warping algorithm
[24]. Assuming conditional independence of syllables in ES and CS, we have
( )
K
k
P ES CS
=
=
1
P(emksnk) where k is the index of alignment, nK=N, and
mK=M. Note that, typically, we have NM due to syllable elision [17]. With
the phonetic approximation, Eq. (1) can be rewritten as Eq. (3).
P(CWEW ) ≈ P(ESCS )P(CW )/P(EW ). (3)
The language model P(CW ) in Eq. (3) can be represented by the n-gram
statistics of the Chinese characters derived from a monolingual corpus. Using
bigram to approximate the n-gram model, we have Eq. (4).
1 1
2
( ) ( ) ( )
N
n n
n
P CW P w P w w
=
. (4)
Note that P(EW ) is not a function of CW, therefore, it can be dropped
from Eq. (3) in finding the maximum CW. A PSM model, denoted as Θ, for
finding CW with highest probability now consists of both P(ESCS ) and P(CW )
parameters. We now look into the mathematical formulation for learning of
P(ESCS) parameters from a bilingual transliteration lexicon.
00186.indd 186 12/10/2008 3:50:28 PM
Mining Live Transliterations Using Incremental Learning Algorithms 187
3.1. Batch learning of PSM
A collection of manually selected or automatically extracted E-C pairs can form
a transliteration lexicon. Given such a lexicon for training, the PSM parameters
can be estimated in a batch mode. An initial PSM is bootstrapped using limited
prior knowledge such as a small amount of transliterations, which may be
obtained by exploiting co-occurrence information [27]. Then we align the E-C
pairs using the PSM Θ and derive syllable mapping statistics.
Suppose that we have the event counts ci,j =count(em=xi, sn=yj) and
cj=count(sn=yj) for a given transliteration lexicon D with alignments Λ. We
would like to find the parameters
|
(
i j m i
p P e x= =
),
n j
s y=
Λ, that
maximize the probability in Eq. (5).
,
|
( , ) ( ) ,
i j
c
m n j i i j
P D P e s p
Λ
Λ Θ = =
∏ ∏
(5)
where
|
{ , 1,…, , 1,, },
i
j
p i I j JΘ= = =
with maximum likelihood estimation
(MLE) criteria, subject to the constraints of
|1, .
i i j
p j =
Rewriting Eq. (5)
in log-likelihood (LL) results in Eq. (6)
, |
( , ) log ( ) log .
m n i j i j
j i
LL D P e s c p
Λ
Λ Θ = =
∑ ∑∑
(6)
It is described as the cross-entropy of the true data distribution ci,j with regard
to the PSM model. Given an alignment Λ, the MLE estimate of PSM is:
| ,
/ .
i j i j j
P c c=
(7)
With a new PSM model, one is able to arrive at a new alignment. This is
formulated as an expectation-maximization (EM) process [7], which assumes that
there exists a mapping DΛ, where Λ is introduced as the latent information,
also known as missing data in the EM literature. The EM algorithm maximizes the
likelihood probability P(DΘ) over Θ by exploiting P(DΘ) = ΛP(D, ΛΘ).
The EM process guarantees non-decreasing likelihood probability P(DΘ)
through multiple EM steps until it converges. In the E-step, we derive the event
counts
|i j
c
and cj by force-aligning all the E-C pairs in the training lexicon D
using a PSM model. In the M-step, we estimate the PSM parameters Θ by
Eq. (7). The EM process also serves as a refining process to obtain the best
alignment between the E-C syllables. In each EM cycle, the model is updated
after observing the whole corpus D. An EM cycle is also called an iteration in
batch learning. The batch learning process is described as follows and depicted
in Figure 1.
00186.indd 187 12/10/2008 3:50:28 PM
188 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin
Batch learning algorithm:
Start: Bootstrap PSM parameters
|i j
p
using prior phonetic mapping
knowledge;
E-Step: Force-align corpus D using
|i j
p
to obtain Λ and hence the counts of
|i j
c
and ci;
M-Step: Re-estimate
| | /
i j i j i
p c c=
using the counts from E-Step;
Iterate: Repeat E-Step and M-Step until P(DΘ) converges.
3.2. Incremental learning of PSM
In batch learning all the training samples have to be collected in advance. In
a dynamically changing environment, such as the Web, new samples always
appear and it is impossible to collect all of them. Incremental learning [8, 10,
11, 30] is devised to achieve rapid adaptation towards the working environment
by updating the model as learning samples arrive in sequence. It is believed that
if the statistics for the E-step are incrementally collected and the parameters are
frequently estimated, incremental learning converges more quickly because the
information from the new data contributes to the parameter estimation more
effectively than the batch algorithm does [11]. In incremental learning, the model
is typically updated progressively as the training samples become available and
the number of incremental samples may vary from as few as one to as many
as they are available. In the extreme case where all the learning samples are
available in advance and the updating done after observing all of them, an
incremental learning becomes batch learning. Therefore, batch learning can be
considered as a special case of incremental learning.
Figure 1. Batch learning of PSM.
Iterate
Final PSM
Initial PSM
E-Step
Training
M-Step
Training
Corpus D
00186.indd 188 12/10/2008 3:50:28 PM
Mining Live Transliterations Using Incremental Learning Algorithms 189
The incremental learning can be formulated through maximum a posteriori
(MAP) framework, also known as Bayesian learning, where we assume that the
parameters Θ are random variables subject to a prior distribution. A possible
candidate for the prior distribution of
|i j
p
is the Dirichlet density over each of
the parameters
|i j
p
[2]. Let
|
{ , 1,…, },
j i j
p i IΘ = =
we introduce,
|
1
|
( ) , ,
i j
h
j i j
i
P p j
Θ ∝
α
(8)
where
|
1,
i i j
h∑ =
and α, which can be empirically set, is a positive scalar. Assuming
H is a set of hyperparameters, we have as many hyperparameters
|i j
h
H as the
parameters
|i j
p
. The probability of generating the aligned transliteration lexicon is
obtained by integrating over the parameter space,
( ) ( ) ( ) .P D P D P d= ∫ Θ Θ Θ
This integration can be easily written down in a closed form due to the
conjugacy between Dirichlet distribution
|
1
|
i j
h
i i j
p
α
and the multinomial
distribution
,
|
.
i j
c
i i j
p
Instead of finding Θ that maximizes P(DΘ) with MLE,
we maximize a posteriori (MAP) probability as follows:
arg max ( ) arg max ( ) ( )/ ( )
arg max ( ) ( ).
MAP P D P D P P D
P D P
Θ Θ
Θ
Θ = Θ = Θ Θ
= Θ Θ
(9)
The MAP solution uses a distribution to model the uncertainty of the
parameter Θ, while the MLE gives a point estimation [14, 22]. We rewrite
Eq. (9) as Eq. (10) using Eq. (5) and Eq. (8).
, |
1
|
arg max , .
i j i j
c h
MAP i j
ip j
+ −
Θ
Θ ≈
α
(10)
Equation (10) can be seen as a Dirichlet function of Θ given H, or a
multinomial function of H given Θ. With given prior H, the MAP estimation is
therefore similar to the MLE problem which is to find the mode of the kernel
density in Eq. (10). In MAP estimation, the PSM parameters can be represented
as Eq. (11).
| | |
(1 ) ,
i j i j i j
p h f= + −
λ λ
(11)
where
| , ,
/
, /( ).
i j i j j i i j
f c c c
λ α α
= = ∑ +
One can find that λ serves as a weighting factor between the prior and the
current observations. The difference between MLE and MAP strategy lies in the
fact that MAP introduces prior knowledge into the parameter updating formula.
Equation (11) assumes that the prior parameters H are known and static while
the training samples are available all at once.
00186.indd 189 12/10/2008 3:50:29 PM
190 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin
The idea of incremental learning is to benefit from the continuously
developing history to update the static model towards the intended reality.
As is often the case, the Web query results in an on-line application arrive in
sequence. It is of practical use to devise such an incremental mechanism that
adapts both parameters and the prior knowledge over time. The quasi-Bayesian
(QB) learning method offers a solution to it [1].
Let us break up a training corpus D into a sequence of sample subsets
D= {D1, D2, …, DT} and denote an accumulated sample subset D(t)=
{D1, D2, …, DT}, 1 ≤ t T as an incremental corpus. Therefore, we have
D=D(T). The QB method approximates the posterior probability PD(t1)) by
the closest tractable prior density PH(t1)) with H(t1) evolved from historical
corpus D(t-1) as shown in Eq. (12).
( 1)
|
( )
( )
( 1)
1
|
1
arg max ( )
arg max ( ) ( )
arg max , .
t
i j i
t
t
QB
t
t
Ic h
i j
i
P D
P D P D
p j
Θ
Θ
+ −
=
Θ
Θ = Θ
Θ Θ
=
α
(12)
QB estimation offers a recursive learning mechanism. Starting with a
hyperparameter set H(0) and a corpus subset D1, we estimate H(1) and
(1)
,
QB
Θ
then H(2) and
(2)
QB
Θ
and so on until H(t) and
( )t
QB
Θ
as observed samples arrive in
sequence. The updating of parameters can be iterated between the reproducible
prior and posterior estimates as in Eq. (13) and Eq. (14). Assuming T→ ∞, we
have the following algorithm:
Incremental learning algorithm:
Start: Bootstrap
(0)
QB
Θ
and H(0) using prior phonetic mapping knowledge and
set t=1;
E-Step: Force-align corpus subset Dt using
( 1)t
QB
Θ
, compute the event counts
( )
,
t
i j
c
and reproduce prior parameters H(t1) H(t) by Eq. (13).
( ) ( 1) ( )
| | ,
/
t t t
i j i j i j
h h c
= +
α .
(13)
M-Step: Re-estimate parameters H(t)
( )t
QB
Θ
and
|i j
p
using the counts from
E-Step by Eq. (14).
( ) ( ) ( )
| | |
/
.
t t t
i j i j i j
i
p h h=
(14)
00186.indd 190 12/10/2008 3:50:29 PM
Mining Live Transliterations Using Incremental Learning Algorithms 191
EM cycle: Repeat E-Step and M-Step until PD(t)) converges.
Iterate: Repeat T EM cycles covering the entire data set D in an iteration.
The algorithm updates the PSM model as training samples become available.
The scalar factor α can be seen as a forgetting factor. When α is big, the updating
of hyperparameters favors the prior. Otherwise, current observation is given
more attention. As for the sample subset size Dt, if we set Dt= 100, each
EM cycle updates Θ after observing every 100 samples. To be comparable with
batch learning, we define an iteration here to be a sequence of EM cycles that
covers the whole corpus D. If corpus D has a fixed size D=D(T), an iteration
means T EM cycles in incremental learning. The iteration can then be repeated
just as in batch learning.
4. Mining Transliterations from the Web
Since the Web is dynamically changing and new transliterations come out all
the time, it is better to mine transliterations from the Web in an incremental
way. Words transliterated by closely observing common guidelines are referred
to as regular transliterations. However, in Web publishing, translators in
different regions may not observe the same guidelines. Sometimes they skew
the transliterations in different ways to introduce semantic implications, also
known as wordplay, resulting in casual transliterations. Casual transliteration
leads to multiple Chinese transliteration variants for the same English word.
For example, “Disney” may be transliterated into 迪士尼/Di-Shi-Ni/”,a迪斯
/Di-Si-Nai/” and 狄斯耐/Di-Si-Nai/”.
Suppose that a sufficiently large, manually validated transliteration lexicon
is available, a PSM can be built in a supervised manner. However, this method
hinges on the availability of such a lexicon. A large transliteration lexicon is
not always available as it is labor-intensive to produce. Even if one is available,
the derived model can only be as good as what the training lexicon offers. New
transliterations, such as casual ones, may not be well handled. It is desirable to
adapt the PSM as new transliterations become available. This is also referred to
as the learning-at-work mechanism. Some solutions have been proposed recently
along this direction [16]. However, the effort was mainly devoted to mitigating
the need of manual labeling. A dynamic learning-at-work mechanism for mining
transliterations has not been well studied.
aThe Chinese words are romanized in hanyu pinyin.
00186.indd 191 12/10/2008 3:50:29 PM
192 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin
Here we are interested in an unsupervised learning process, in which we
adapt the PSM as we extract transliterations. The learning-at-work framework
is illustrated in Figure 2. As opposed to a manually labeled training corpus
used in Figure 1, we insert into the EM process an automatic transliteration
extraction mechanism, search and rank, as shown in the left panel of
Figure 2. The search and rank shortlists a set of transliterations from the Web
query results or bilingual snippets.
4.1. Search and rank
We obtain bilingual snippets from the Web by iteratively submitting queries to
the Web search engines [4]. Qualified sentences are extracted from the results
of each query. Each qualified sentence has at least one English word.
Given a qualified sentence, we first denote the competing Chinese trans-
literation candidates as a set Ω, from which we would like to pick the most likely
one. Second, we would like to know if there is indeed a Chinese transliteration
CW in the close context of the English word EW.
We propose ranking the candidates by Eq. (15) using the PSM model to find
the most likely CW for a given EW. The CW candidate that gives the highest
posterior probability is considered the most probable candidate CW ′.
arg max ( )
arg max ( ) ( ) .
CW
CW
CW P CW EW
P ES CS P CW
∈ Ω
∈ Ω
=
(15)
Figure 2. Diagram of unsupervised transliteration extraction — learning-at-work.
Final PSM
Initial PSM
E-Step
M-Step
The Web
Quasi Transliterations
Search and Rank
Iterate
Final Lexicon
00186.indd 192 12/10/2008 3:50:29 PM
Mining Live Transliterations Using Incremental Learning Algorithms 193
The next step is to examine if CW ′ and EW indeed form a genuine E-C
pair. We define the confidence of the E-C pair as the posterior odds similar to
that in a hypothesis test under the Bayesian interpretation. We have H0, which
hypothesizes that CW and EW form an E-C pair, and H1, which hypothesizes
otherwise, and use posterior odds σ [17] for hypothesis tests.
Our search and rank formulation can be seen as an extension to a prior
work [3]. The posterior odds σ are used as the confidence score so that E-C pairs
extracted from different contexts can be directly compared. In practice, we set
a threshold for σ to decide on a cutoff point for E-C pairs short-listing. In this
way, the search and rank is able to retrieve a collection of quasi transliterations
from the Web given a PSM.
4.2. Unsupervised learning strategy
Now we can carry out PSM learning as formulated in Section 3 using the
quasi transliterations as if they were manually validated. By unsupervised batch
learning, we mean to re-estimate the PSM model after search and rank over
the whole database, i.e., in each iteration. Just as in supervised learning, one
can expect the PSM performance to improve over multiple iterations. We report
the F-measure at each iteration. The extracted transliterations also form a new
training corpus for the next iteration.
In contrast to the batch learning, incremental learning updates the PSM
parameters as the training samples arrive in sequence. This is especially useful
in Web mining. With the QB incremental optimization, one can think of an
EM process that continuously re-estimates PSM parameters as the Web crawler
discovers new “territories”. In this way, the search and rank process gathers
qualified training samples Dt after crawling a portion of the Web. Note that the
incremental EM process updates parameters more often than batch learning does.
To evaluate performance of both learning methods, we define an iteration to be
T EM cycles in incremental learning on a training corpus D=D(T) as discussed
in Section 3.2.
4.3. Initializing PSM with context-based model
In incremental learning, the quality of initial PSM
(0)
QB
Θ
and hyperparameter set
H(0) has impacts on the overall learning process. The initial model is defined
based on prior knowledge about transliteration. There are different ways to
bootstrap the initial model, for example, by using phonetic clues or syntactic
clues. Since the PSM is phonetically-motivated, we expect that it can be boosted
by live transliterations that are extracted from a context-based model.
00186.indd 193 12/10/2008 3:50:29 PM
194 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin
Contextual information has provided an important clue in extracting
translation terms [12, 21] and information extraction for generating repetitive
patterns or templates [5], so does the extraction of transliterations. We also
conduct an inquiry into a transliteration lexicon of 8,898 unique, manually
validated transliteration pairs, which we will discuss later in the experiments.
The study reveals that about 56% of them are singletons, which are observed
only once in the corpus, as shown in the count-counts distribution in
Figure 3. Context-based models have been reportedly [9] effective in extracting
high frequency pairs — non-singletons. The challenge is that these non-singletons
are not always phonetically transliterated because some of them are semantic
translations. The idea is to use a context-based model to extract a set of quasi
transliteration pairs (QTPs), which are statistically acquired without manual
validation, and then validate the QTPs by using an initial PSM. This process is
intended to improve the initial PSM and hence the subsequent performance of
transliteration extraction.
A simple statistical model for extracting transliteration pairs using co-
occurrence model shown in Eq. (16) is proposed. By applying mutual information
(MI) as the underlying co-occurrence model, we have
arg max ( , )
( , )
arg max ( , ) log .
( ) ( )
CW
CW
CW MI CW EW
P CW EW
P CW EW
P CW P EW
∈ Ω
∈ Ω
=
=
(16)
The CW that gives the highest mutual information is considered the most
probable transliteration of EW. The mutual information as defined in information
theory presents the information gain about in presence of EW. It is referred to
as MI context-based model hereafter.
Figure 3. The count-counts distribution of the manually validated transliteration pairs.
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 >= 11
Counts
Count-counts (%)
00186.indd 194 12/10/2008 3:50:29 PM
Mining Live Transliterations Using Incremental Learning Algorithms 195
Using the MI context-based model, a set of quasi transliteration pairs can
be acquired. To prepare for the initial PSM, the QTPs can be used to augment
a set of seed transliteration pairs or human crafted phonetic mapping rules.
5. Experiments
To obtain the ground truth for performance evaluation, each possible transliteration
pair is manually checked based on the following transliteration criteria: (i) if
an EW is partly translated phonetically and partly translated semantically, only
the phonetic transliteration constituent is extracted to form a transliteration pair;
(ii) multiple E-C pairs may appear in one sentence; (iii) an EW can have multiple
valid Chinese transliterations and vice versa. The validation process results in
a collection of qualified E-C pairs, or distinct qualified transliteration pairs
(DQTPs), which form a transliteration lexicon.
To simulate the dynamic Web, we collected a Web corpus, which consists
of about 500 MB of Web pages, referred to as SET1 by [16, 17]. From SET1,
80,094 qualified sentences were automatically extracted, and 8898 DQTPs were
further selected with manual validation.
To establish a reference for performance benchmarking, we first initialize a
PSM, referred to as seed PSM hereafter, using a random selection of 100 seed
DQTPs. By exploiting the seed PSM on all 8898 DQTPs, we train a PSM in a
supervised batch mode and improve the PSM on SET1 after each iteration. The
resulting performance in precision, recall and F-measure, which are defined in
Eq. (17), at the 6th iteration is reported in Table 1 and the F-measure is also
shown in Figure 4.
precision = #extracted_DQTPs/#extracted_pairs,
recall = #extracted_DQTPs/#total_DQTPs, (17)
F-measure =2×recall ×precision/(recall +precision).
We use this closed test (supervised batch learning) as the reference
point for the unsupervised experiments. Next we further implement two PSM
learning strategies, namely unsupervised batch and unsupervised incremental
learning.
Table 1. The performance achieved by supervised batch learning on SET1.
Precision Recall F-measure
Closed-test 0.834 0.663 0.739
00186.indd 195 12/10/2008 3:50:29 PM
196 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin
5.1. Unsupervised batch learning
We begin with the same seed PSM. However, we use quasi transliterations that
are extracted automatically instead of manually validated DQTPs for training.
Note that the quasi transliterations are extracted and collected at the end of each
iteration. It may differ from one iteration to another. After re-estimating the PSM
model in each iteration, we evaluate the performance on SET1.
Comparing the two batch mode learning strategies in Figure 4, it is observed
that learning substantially improves the seed PSM after the first iteration. Without
surprise, the supervised learning consistently outperforms the unsupervised one,
which reaches a plateau at 0.679 F-measure. This performance is considered the
baseline for comparison in this paper. The unsupervised batch learning presented
here is similar to that in [16].
5.2. Unsupervised incremental learning
We now formulate an on-lineb unsupervised incremental learning algorithm:
(i) Start with the seed PSM, set t= 1;
(ii) Extract Dt quasi transliterations pairs followed by E-Step in incremental
learning algorithm,
(iii) Re-estimate PSM using Dt (M-Step), t=t+ 1,
(iv) Repeat (ii) and (iii) to crawl over a corpus.
Figure 4. Comparison of F-measure over iterations (U-Incremental: Unsupervised
Incremental).
0.45
0.5
0.55
0.6
0.65
0.7
0.75
123456
#Iteration
F-measure
Supervised Batch
Unsupervised Batch
U-Incremental (100)
U-Incremental (5,000)
bIn an actual on-line environment, we are not supposed to store documents, thus no iteration can
take place.
00186.indd 196 12/10/2008 3:50:29 PM
Mining Live Transliterations Using Incremental Learning Algorithms 197
To simulate the on-line incremental learning just described, we train and
test on SET1 because of the availability of gold standard and comparison with
performance by batch mode. We empirically set α= 0.5 and study different Dt
settings. An iteration is defined as multiple cycles of steps (ii)–(iii) that screen
through the whole SET1 once. We run multiple iterations.
The performance of incremental learning with Dt = 100 and Dt = 5000
are reported in Figure 4. It is observed that incremental learning benefits
from more frequent PSM updating. With Dt = 100, it not only attains good
F-measure in the first iteration, but also outperforms unsupervised batch learning
along the EM process. The PSM updating becomes less frequent for larger Dt.
When Dt is set to be the whole corpus, then incremental learning becomes
a batch mode learning, which is evidenced by Dt = 5000 when it performs
almost the same as the batch mode learning. The experiments in Figure 4 are
considered closed-set tests. Next we move on to the actual online experiment
after exploiting the contextual model.
5.3. Initializing PSM with a context-based model
Contextual information has been used in extracting semantic translation terms.
However, algorithms exploiting contextual information often fail to extract terms
of low frequency [13]. To prepare an initial list of quality transliteration pairs to
enhance the initial PSM, quasi transliteration pairs or QTPs extracted by the MI
context-based model need to be further validated. For example, in “Lacroix-設計
師拉克華/She-Ji-Shi-La-Ke-Hua/” and “Hilary-第一夫人希拉蕊/Di-Yi-Fu-Ren-
Xi-La-Rui/”, “Lacroix-拉克華/La-Ke-Hua/” and “Hilary-希拉/Xi-La-Rui/,” are
extracted and accepted. However, “Asus-華碩/Hua-Shuo/” is extracted for high
frequency, but disqualified for phonetic irrelevance.
We can enhance the learning algorithm described in Section 5.2 further by
combining both the contextual information and phonetic information to obtain
an initial list of quality transliterations. Step (i) of the algorithm can be rewritten
as follows:
Start with the seed PSM augmented by quasi pairs obtained by the MI
context-based model, set t= 1;
a) First, generate quasi transliteration pairs from the whole corpus by using the
MI context-based model.
b) Second, shortlist phonetically genuine transliteration pairs by using the seed
PSM.
c) Third, derive the initial PSM from the resulting genuine pairs.
00186.indd 197 12/10/2008 3:50:29 PM
198 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin
With the context-based model, 1103 high quality transliteration pairs are
extracted and added for bootstrapping a better initial PSM. In Figure 5, the
performance of the unsupervised batch and that of proposed incremental learning
algorithm bootstrapping with the context-based model (bootstrapping U-Batch
learning and bootstrapping U-Incremental learning for short, respectively)
are reported. The bootstrapping strategy outperforms the unsupervised batch
in general. It benefits from the improved initial PSM. The bootstrapping
U-Incremental learning with Dt = 100 gives better performance than that
with Dt = 5000. This is similar to what we observed in Figure 4. We note that
context-based model does help improve the performance in either unsupervised
batch learning or incremental learning case.
5.4. Learning from the live Web
In practice, it is possible to extract bilingual snippets of interest by repeatedly
submitting queries to the Web. With the learning-at-work mechanism, we can
mine the query results for up-to-date transliterations, as in Figure 6. For example,
by submitting “Mkapa” to search engines, we may get “Mkapa-姆卡帕/Mu-Ka-
Pa/” and, as a by-product, “Dodoma-多多馬/Duo-Duo-Ma/” as well as shown in
Figure 6. In this way, new queries can be generated iteratively, thus new pairs
are discovered. With the promising test on SET1, we are now ready to move
on to a live test.
Figure 5. F-measure over iterations using U-Batch (Unsupervised batch) and U-Incremental
learning bootstrapping with the MI context-based model on SET1.
0.45
0.5
0.55
0.6
0.65
0.7
0.75
123456
#Iteration
F-measure
Supervised Batch
U-Batch(Seeds)
Bootstrapping U-Batch
Bootstrapping U-Incremental (100)
Bootstrapping U-Incremental (5,000)
00186.indd 198 12/10/2008 3:50:29 PM
Mining Live Transliterations Using Incremental Learning Algorithms 199
Following the unsupervised incremental learning algorithm, we start the
crawling with the same seed PSM as in Section 5.2. We adapt the PSM model
as every 100 quasi transliterations are extracted, i.e. Dt = 100. The crawling
stops after accumulating 67,944 Web pages, where there are 100 snippets at most
in a page, with 2,122,026 qualified sentences. We obtain 123,215 distinct E-C
pairs when the crawling stops. For comparison, we also carry out unsupervised
batch learning over the same 2,122,026 qualified sentences in a single iteration.
As the gold standard for this live database is not available, we randomly select
500 quasi transliteration pairs for manual checking of precision and report the
performance in Table 2. A precision of 0.758 by unsupervised batch learning
and a precision of 0.768 by unsupervised incremental learning are reported.
Assuming the same precision in the whole extracted corpus, 51,323 and 94,629
DQTPs can be expected, respectively. It is found that incremental learning is
more effective and productive than batch learning in discovering transliteration
pairs. This finding is consistent with the test results on SET1.
Figure 6. An example of bilingual snippets returned from a search engine.
Tanzania
25.8°C Dodoma 14.4
1992 6 · · )
Benjamin William Mkapa1995 ...
www.ebaomonthly.com/window/travel/countries/tz.htm - 131k - -
- -
· · Benjamin William Mkapa) 1995
11 ... 20 (Arusha) (Coast)
(Dodoma) ...
www.africabuyer.cn/africa/59/110_1.html - 30k - -
Table 2. Comparison between the U-Batch, the U-Incremental learning and the bootstrapping
U-Incremental learning from live Web.
U-Batch learning
U-Incremental
learning
U-Incremental learning
(with MI model)
#distinct E-C pairs 67,708 123,215 123,337
Precision 0.758 0.768 0.774
#expected DQTPs 51,323 94,629 95,463
00186.indd 199 12/10/2008 3:50:29 PM
200 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin
Generally, each bilingual snippet consists of the submitted query term on
each page of the Web query results. Because most of the Web-based search
engines [4] devise algorithms to detect and remove page duplication, most of the
sentences composed of the query term in a Web page are different as shown in
Figure 6. Therefore, we can explore the regularities from the context of the query
term to obtain the translation or transliteration terms. Exploiting this observation,
we further propose an online learning algorithm for mining transliterations by
incorporating the contextual information for handling query results. In this way,
we incorporate the MI model into the transliteration extraction of each page at
every EM cycle. Step (ii) of the learning algorithm described in Section 5.2
can be rewritten as:
a) First, generate quasi transliteration pairs by using the MI context-based model
to process each query result in a page.
b) Second, shortlist phonetically genuine transliteration pairs by using the current
PSM.
c) Third, derive a new PSM from the resulting genuine pairs and the current
PSM.
An experiment using U-Incremental learning with MI model is conducted in
an online environment. Again from the same live Web, we obtain 123,337 distinct
E-C pairs. The precision is estimated to be 0.774 by examining 500 randomly
selected quasi transliteration pairs and hence 95,463 DQTPs can be expected.
As shown in Table 2, the U-Incremental learning with MI model achieved even
better performance than that of the U-Incremental learning. Summarizing the
observations, we believe that MI context-based model is helpful in both batch
and incremental learning strategies.
6. Conclusions
We have proposed a learning framework for mining E-C transliterations using
bilingual snippets from a live Web corpus. In this learning-at-work framework,
we formulate the PSM learning method and study strategies for PSM learning
in both batch and incremental manners. The batch mode learning benefits from
multiple iterations for improving performance, while the unsupervised incremental
one, which does not require all the training data to be available in advance,
adapts to the dynamically changing environment easily without compromising the
performance. Unsupervised incremental learning provides a practical and effective
solution for discovering transliterations from the query results, which can be
easily extended to other Web mining applications. It is also found that context-
based knowledge boosts the initial PSM that improves the overall transliteration
00186.indd 200 12/10/2008 3:50:29 PM
Mining Live Transliterations Using Incremental Learning Algorithms 201
extraction performance. It is suggested that the phonetically-motivated PSM
transliteration extraction model is to be augmented by context-based model for
best performance.
For future work, the natural next steps include extending the method to
(i) named entity translation extraction in general; and (ii) extraction of
transliterations from comparable texts in multiple languages.
References
[1] S. Bai and H. Li, Bayesian learning of N-gram statistical language modeling,
in Proceedings of International Conference on Acoustics, Speech and Signal,
2006, pp. 1045–1048.
[2] M. Bacchiani, B. Roark, M. Riley and R. Sproat, MAP adaptation of
stochastic grammars, Computer Speech and Language, 20(1), 2006, 41–68.
[3] E. Brill, G. Kacmarcik and C. Brockett, Automatically harvesting Katakana-
English term pairs from search engine query logs, in Proceedings of
Natural Language Processing Pacific Rim Symposium, 2001, pp. 393–399.
[4] S. Brin and L. Page, The anatomy of a large-scale hypertextual Web search
engine, in Proceedings of 7th International World Wide Web (WWW)
Conference, 1998, pp. 107–117.
[5] C.-H. Chang, M. Kayed, M. R. Girgis and K. Shaalan, A survey of Web
information extraction systems, IEEE Transactions on Knowledge and Data
Engineering, Vol. 18, No. 10, 2006, 1411–1428.
[6] J.-T. Chien, Online hierarchical transformation of hidden markov models
for speech recognition, IEEE Transactions on Speech and Audio Processing,
Vol. 7, No. 6, 1999, 656 667.
[7] A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum likelihood from
incomplete data via the EM algorithm, Journal of the Royal Statistical
Society, Ser. B. Vol. 39, 1997, 1–38.
[8] L.-M. Fu, Incremental knowledge acquisition in supervised learning
networks, IEEE Trans. on Systems, Man, and Cybernetics — Part A:
Systems and Humans, Vol. 26, No. 6, 1996, 801–809.
[9] P. Fung and L.-Y. Yee, An IR approach for translating new words from
nonparallel, comparable texts, in Proceedings of 17th International
Conference on Computational Linguistics (COLING) and 36th Annual
Meeting of the Association for Computational Linguistics (ACL), 1998,
pp. 414 420.
[10] C. Giraud-Carrier, A note on the utility of incremental learning, AI
Communications, 13(4), 2000, 215–223.
00186.indd 201 12/10/2008 3:50:30 PM
202 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin
[11] Y. Gotoh, M. M. Hochberg and H. F. Silverman, Efficient training algorithms
for HMMs using incremental estimation, IEEE Transactions on Speech
and Audio Processing, Vol. 6, No. 6, 1998, 539–547.
[12] F. Huang, Y. Zhang and S. Vogel, Mining key phrase translations from
Web corpora, in Proceedings of Human Language Technology–Empirical
Methods on Natural Language Processing, 2005, pp. 483–490.
[13] L. Jiang, M. Zhou, L.-F. Chien and C. Niu, Named entity translation with
Web mining and transliteration, in Proceedings of International Joint
Conferences on Artificial Intelligence, 2007, pp. 1629–1634.
[14] F. Jelinek, Self-organized language modeling for speech recognition,
Readings in Speech Recognition, Morgan Kaufmann, 1999, pp. 450–506.
[15] K. Knight and J. Graehl, Machine transliteration, Computational Linguistics,
Vol. 24, No. 4, 1998, 599–612.
[16] J.-S. Kuo, H. Li and Y.-K. Yang, Learning transliteration lexicons from
the Web, in Proceedings of 44th Annual Meeting of the Association for
Computational Linguistics, 2006, pp. 1129–1136.
[17] J.-S. Kuo, H. Li and Y.-K. Yang, A phonetic similarity model for automatic
extraction of transliteration pairs, ACM Transactions on Asian Language
Information Processing, 6(2), 2007, 1–24.
[18] H. Li, M. Zhang and J. Su, A joint source channel model for machine
transliteration, in Proceedings of 42nd Annual Meeting of the Association
for Computational Linguistics, 2004, pp. 159–166.
[19] W. Lam, R.-Z. Huang and P.-S. Cheung, Learning phonetic similarity
for matching named entity translations and mining new translations, in
Proceedings of 27th ACM Special Interest Group on Information Retrieval,
2004, pp. 289–296.
[20] C.-J. Lee and J.-S. Chang, Acquisition of English-Chinese transliterated word
pairs from parallel-aligned texts using a statistical machine transliteration
model, in Proceedings of HLT-NAACL Workshop Data Driven MT and
Beyond, 2003, pp. 96–103.
[21] W.-H. Lu, L.-F. Chien and H.-J Lee, Translation of Web queries using
anchor text mining, ACM Transactions on Asian Language and Information
Processing, 1(2), 2002, 159–172.
[22] D. J. C. MacKay and L. Peto, A hierarchical dirichlet language model,
Natural Language Engineering, Vol. 1, No. 3, 1994, 1–19.
[23] H. M. Meng, W.-K. Lo, B. Chen and T. Tang, Generate phonetic cognates
to handle name entities in English-Chinese cross-language spoken document
retrieval, in Proceedings of Workshop on Automatic Speech Recognition
and Understanding, 2001, pp. 311–314.
00186.indd 202 12/10/2008 3:50:30 PM
Mining Live Transliterations Using Incremental Learning Algorithms 203
[24] C. S. Myers and L. R. Rabiner, A comparative study of several dynamic
time-warping algorithms for connected word recognition, The Bell System
Technical Journal, 60(7), 1981, 1389–1409.
[25] J.-Y. Nie, P. Isabelle, M. Simard and R. Durand, Cross-language information
retrieval based on parallel texts and automatic mining of parallel text from
the Web, in Proceedings of 22nd ACM Special Interest Group on Information
Retrieval, 1999, pp. 74– 81.
[26] R. Rapp, Automatic identification of word translations from unrelated
English and German corpora, in Proceedings of 37th Annual Meeting of
the Association for Computational Linguistics, 1999, pp. 519–526.
[27] R. Sproat, T. Tao and C. Zhai, Named entity transliteration with comparable
corpora, in Proceedings of 44th Annual Meeting of the Association for
Computational Linguistics, 2006, pp. 73–80.
[28] P. Virga and S. Khudanpur, Transliteration of proper names in cross-lingual
information retrieval, in Proceedings of 41st ACL Workshop on Multilingual
and Mixed Language Named Entity Recognition, 2003, pp. 57–64.
[29] S. Wan and C. M. Verspoor, Automatic English-Chinese name transliteration
for development of multilingual resources, in Proceedings of 17th COLING
and 36th ACL, 1998, pp. 1352–1356.
[30] G. Zavaliagkos, R. Schwartz, and J. Makhoul, Batch, incremental and
instantaneous adaptation techniques for speech recognition, in Proceedings of
International Conference on Acoustics, Speech and Signal, 1995, pp. 676–679.
00186.indd 203 12/10/2008 3:50:30 PM
... 4.1d Survey on generation techniques: There have been a number of works using transliteration generation. Transliteration models have been proposed for the transformation mainly between English and non-English languages, including Arabic [23,[53][54][55], Persian [26], Korean [20,21,56,57], Chinese [11,[58][59][60][61][62][63], Japanese [15,[64][65][66][67][68] and Roman languages [69][70][71][72], which are discussed in the seminal survey by Karimi et al [48]. ...
... Machine learning: In this type, variable weight of operations (insertion, deletion and substitution) is assigned using machine learning (supervised and unsupervised) techniques to obtain the target word from a given source word. Some of the techniques include Lee and Choi [102], Klementiev and Roth [16], Sherif and Kondrak [103], Chen and Hsu [50], Goldwasser and Roth [104], Kuo et al [105] and Li et al [59]. Interested readers can find their discussion in the pre-cursor survey by Karimi et al [48]. ...
Article
Users of the WWW across the globe are increasing rapidly. According to Internet live stats there are more than 3 billion Internet users worldwide today and the number of non-English native speakers is quite high there. A large proportion of these non-English speakers access the Internet in their native languages but use the Roman script to express themselves through various communication channels like messages and posts. With the advent of Web 2.0, user-generated content is increasing on the Web at a very rapid rate. A substantial proportion of this content is transliterated data. To leverage this huge information repository, there is a matching effort to process transliterated text. In this article, we survey the recent body of work in the field of transliteration. We start with a definition and discussion of the different types of transliteration followed by various deterministic and non-deterministic approaches used to tackle transliteration-related issues in machine translation and information retrieval. Finally, we study the performance of those techniques and present a comparative analysis of them.
... We expect a good model to give the Russian representation "Букенья" as the most likely match for "Bukenya" since it is the correct representation. Major approaches that have been used in transliteration identification include: rule-based methods (Tsuji et al., 2002), statistical methods (Bilac and Tanaka, 2005;Pouliquen et al., 2006) and various machine learning methods (Lee & Chang, 2003;Kuo et al., 2007;Lee et al., 2006;Udupa, et al., 2008, Udupa et al., 2009Saravanan and Kumaran, 2008;Oh & Isahara, 2007;Lei et al., 2009;Li et al., 2008;Zhou et al., 2008;Wu et al., 2009). Two main recent categorizations are whether the methods used are generative or discriminative. ...
Article
Full-text available
Machine transliteration is the process of automatically transforming the script of a word from a source language to a target language, while preserving pronunciation. The development of algorithms specifically for machine transliteration began over a decade ago based on the phonetics of source and target languages, followed by approaches using statistical and language-specific methods. In this survey, we review the key methodologies introduced in the transliteration literature. The approaches are categorized based on the resources and algorithms used, and the effectiveness is compared.
Article
This paper conducts an inquiry into regional transliteration variants across Chinese speaking regions. We begin by studying the social association of regional transliterations, followed by postulating a computational model for effective transliteration extraction from the Web. In the computational model, we first propose constraint-based exploration by incorporating transliteration knowledge from transliteration modeling and predictive query suggestions from search engines into query formulation as constraints so as to increase the chance of desired transliteration returns in learning regional transliteration variants. Then, we study a cross-training algorithm, which explores the attainably helpful information of transliteration mappings across related regional corpora for the learning of transliteration models, to improve the overall extraction performance. The experimental results show that the proposed method not only effectively harvests a lexicon of regional transliteration variants but also mitigates the need of manual data labeling for transliteration modeling. We also carry out an investigation into the underlying characteristics of regional transliterations that motivate the cross-training algorithm.
Article
Full-text available
This paper presents a framework for extract-ing English and Chinese transliterated word pairs from parallel texts. The approach is based on the statistical machine transliteration model to exploit the phonetic similarities be-tween English words and corresponding Chi-nese transliterations. For a given proper noun in English, the proposed method extracts the corresponding transliterated word from the aligned text in Chinese. Under the proposed approach, the parameters of the model are automatically learned from a bilingual proper name list. Experimental results show that the average rates of word and character precision are 86.0% and 94.4%, respectively. The rates can be further improved with the addition of simple linguistic processing.
Conference Paper
Full-text available
This paper describes the use of a probabilistic translation model to cross-language IR (CLIR). The performance of this approach is compared with that using machine translation (MT). It is shown that using a probabilistic model, we are able to obtain performances close to those using an MT system. In addition, we also investigated the possibility of automatically gather parallel texts from the Web in an attempt to construct a reasonable training corpus. The result is very encouraging. We showed that in several tests, such a training corpus is as good as a manually constructed one for CLIR purposes.
Article
Several different algorithms have been proposed for time registering a test pattern and a concatenated (isolated word) sequence of reference patterns for automatic connected-word recognition. These algorithms include the two-level, dynamic programming algorithm, the sampling approach and the level-building approach. In this paper, we discuss the theoretical differences and similarities among the various algorithms. An experimental comparison of these algorithms for a connected-digit recognition task is also given. The comparison shows that for typical applications, the level-building algorithm performs better than either the two-level dp-matching or the sampling algorithm.
Article
This article presents an approach to automatically extracting translations of Web query terms through mining of Web anchor texts and link structures. One of the existing difficulties in cross-language information retrieval (CLIR) and Web search is the lack of appropriate translations of new terminology and proper names. The proposed approach successfully exploits the anchor-text resources and reduces the existing difficulties of query term translation. Many query terms that cannot be obtained in general-purpose translation dictionaries are, therefore, extracted.
Article
Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is more difficult, because most statistical clues useful in the processing of parallel texts cannot be applied to non-parallel texts. Whereas for parallel texts in some studies up to 99% of the word alignments have been shown to be correct, the accuracy for non-parallel texts has been around 30% up to now. The current study, which is based on the assumption that there is a correlation between the patterns of word co-occurrences in corpora of different languages, makes a significant improvement to about 72% of word translations identified correctly.
Article
We present a framework for maximum a posteriori adaptation of large scale HMM speech recognizers. In this framework, we introduce mechanisms that take advantage of correlations present among HMM parameters in order to maximize the number of parameters that can be adapted by a limited number of observations. We are also separately exploring the feasibility of instantaneous adaptation techniques. Instantaneous adaptation attempts to improve recognition on a single sentence, the same sentence that is used to estimate the adaptation. We show that sizable gains (20-40% reduction in error rate) can be achieved by either batch or incremental adaptation for large vocabulary recognition of native speakers. The same techniques cut the error rate for recognition of non-native speakers by factors of 2 to 4, bringing their performance much closer to the native speaker performance. We also demonstrate that good improvements in performance (25-30%) are realized when instantaneous adaptation is used for recognition of non-native speakers.
Conference Paper
The n-gram language model adaptation is typically formulated using deleted interpolation under the maximum likelihood estimation framework. This paper proposes a Bayesian learning framework for n-gram statistical language model training and adaptation. By introducing a Dirichlet conjugate prior to the n-gram parameters, we formulate the deleted interpolation under maximum a posterior criterion with a Bayesian learning procedure. We study the Bayesian learning formulation for n-gram and continuous n-gram language models. The experiments on North American News Text corpus have validated the effectiveness of the proposed algorithms
Article
This paper investigates supervised and unsupervised adaptation of stochastic grammars, including n-gram language models and probabilistic context-free grammars (PCFGs), to a new domain. It is shown that the commonly used approaches of count merging and model interpolation are special cases of a more general maximum a posteriori (MAP) framework, which additionally allows for alternate adaptation approaches. This paper investigates the effectiveness of different adaptation strategies, and, in particular, focuses on the need for supervision in the adaptation process. We show that n-gram models as well as PCFGs benefit from either supervised or unsupervised MAP adaptation in various tasks. For n-gram models, we compare the benefit from supervised adaptation with that of unsupervised adaptation on a speech recognition task with an adaptation sample of limited size (about 17 h), and show that unsupervised adaptation can obtain 51% of the 7.7% adaptation gain obtained by supervised adaptation. We also investigate the benefit of using multiple word hypotheses (in the form of a word lattice) for unsupervised adaptation on a speech recognition task for which there was a much larger adaptation sample available. The use of word lattices for adaptation required the derivation of a generalization of the well-known Good-Turing estimate.
Article
S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.