ArticlePDF Available

Mining Live Transliterations Using Incremental Learning Algorithms

June 2008
International Journal of Computer Processing Of Languages 21(02):183-203

June 2008
21(02):183-203

DOI:10.1142/S179384060800186X

Source
DBLP

Authors:

Haizhou Li

National University of Singapore

Su Jian

Institute for Infocomm Research (I2R), Agency for Science, Technology and Research

Chih-Lung Lin

Hwa Hsia University of Technology, Taiwan

We study an adaptive learning framework for phonetic similarity modeling (PSM) that supports the automatic acquisition of transliterations by exploiting minimum prior knowledge about machine transliteration to mine transliterations incrementally from the Web. We formulate an incremental learning strategy for the framework based on Bayesian theory for PSM adaptation. The idea of incremental learning is to benefit from the continuously developing history to update a static model towards the intended reality. In this way, the learning process refines the PSM incrementally while constructing a transliteration lexicon at the same time. We further demonstrate that the proposed learning framework is reliably effective in mining live transliterations from Web query results.

Batch learning of PSM.

…

Comparison of F-measure over iterations (U-Incremental: Unsupervised Incremental).

…

F-measure over iterations using U-Batch (Unsupervised batch) and U-Incremental learning bootstrapping with the MI context-based model on SET1.

…

An example of bilingual snippets returned from a search engine.

…

Figures - uploaded by Chih-Lung Lin

Content may be subject to copyright.

Content uploaded by Chih-Lung Lin

Content may be subject to copyright.

Mining Live Transliterations

Using Incremental Learning Algorithms

HAIZHOU LI

Institute of Infocomm Research, Singapore, 138632

hli@i2r.a-star.edu.sg

JIN-SHEA KUO

Chung-Hwa Telecomm, Taoyuan 320, Taiwan

d8807302@gmail.com

JIAN SU

Institute of Infocomm Research, Singapore, 138632

sujian @i2r.a-star.edu.sg

CHIH-LUNG LIN

Chung Yuan Christian University, Taoyuan 320, Taiwan

linclr@gmail.com

We study an adaptive learning framework for phonetic similarity modeling (PSM)

that supports the automatic acquisition of transliterations by exploiting minimum

prior knowledge about machine transliteration to mine transliterations incrementally

from the Web. We formulate an incremental learning strategy for the framework

based on Bayesian theory for PSM adaptation. The idea of incremental learning

is to beneﬁt from the continuously developing history to update a static model

towards the intended reality. In this way, the learning process reﬁnes the PSM

incrementally while constructing a transliteration lexicon at the same time. We

further demonstrate that the proposed learning framework is reliably effective in

mining live transliterations from Web query results.

Keywords: Web mining; Incremental learning; Unsupervised learning; Transliteration

extraction; Machine transliteration; Machine translation; Machine learning; Phonetic

similarity model; Query results; Learning transliterations.

International Journal of Computer Processing Of Languages

Vol. 21, No. 2 (2008) 183– 20 3

183

00186.indd 183 12/10/2008 3:50:28 PM

184 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin

1. Introduction

Transliteration is a process of rewriting a word in one language into another

by preserving its pronunciation in its original language, which is also known

as translation-by-sound. It usually takes place between languages with different

scripts, for example, from English to Chinese, and for words, such as proper

nouns, that do not have “easy” or semantic translations.

The increasing size of multilingual content on the Web has made it a live

information source rich in transliterations. Research on automatic acquisition

of transliteration pairs in batch mode has shown promising results [16, 17]. In

dealing with the dynamic growth of the Web, it is almost impossible to collect

and store all its contents in local storage. Therefore, there is a need to develop

an incremental learning algorithm to mine transliterations in an on-line manner.

In general, an incremental learning technique is designed for adapting a model

towards a changing environment. In this paper, we are interested in deducing an

incremental learning method for automatically constructing an English-Chinese

(E-C) transliteration lexicon from Web query results.

In the deduction, we start with a phonetic similarity model (PSM), which

measures the phonetic similarity between words in two different scripts, and

study the learning mechanism of PSM in both batch and incremental modes.

Then, we conduct a series of experiments to evaluate the performance of these

two kinds of learning algorithms. In this way, the contributions of this paper

include: (i) the formulation of a batch learning framework and an incremental

learning framework for PSM learning; (ii) a comparative study of the batch and

incremental unsupervised learning strategies.

In this paper, Section 2 brieﬂy introduces prior work related to machine

transliteration. In Section 3, we formulate the PSM and its batch and incremental

learning algorithms while in Section 4, we discuss the practical issues in

implementation. Section 5 provides a report on the conducted experiments and

ﬁnally, we conclude in Section 6.

2. Related Work

Much of the research on extraction of transliterations has been motivated by

information retrieval techniques, where attempts to extracting transliteration pairs

from large bodies of corpora have been made. Some have proposed to extract

translations from parallel or comparable bitexts using co-occurrence analysis or a

context-vector approach [9, 25]. These methods compare the semantic similarities

between source and target words without taking their phonetic similarities into

account.

00186.indd 184 12/10/2008 3:50:28 PM

Mining Live Transliterations Using Incremental Learning Algorithms 185

Another direction of research focuses on establishing the phonetic

relationship between transliteration pairs. This typically involves the encoding

of phoneme- or grapheme-based mapping rules using a generative model that

is trained from a large bilingual lexicon. Most of these works are devoted to

the phoneme-based approach [15, 23, 27, 28, 29]. Suppose that EW and CW

form an E-C transliteration pair. The phoneme-based approach ﬁrst converts EW

into an intermediate phonemic representation and then converts the phonemic

representation into its Chinese counterpart CW. On the other hand, the grapheme-

based approach, also known as direct orthographical mapping [18], treats

transliteration as a statistical machine translation problem under monotonic

constraints and has also achieved promising results.

Many efforts have also been channeled to tapping the wealth of the Web for

harvesting transliteration/translation pairs. These include studying the query logs

[3], unrelated corpora [26], parallel [20] and comparable corpora [12, 27]. To

establish correspondence, these algorithms usually rely on one or more statistical

clues [19], such as the correlation between word frequencies, and cognates of

similar spelling or pronunciations. In doing so, two things are needed: ﬁrst,

a robust mechanism that establishes statistical relationships between bilingual

words, such as a phonetic similarity model which is motivated by transliteration

modeling research; and second, an effective learning framework that is able to

adaptively discover new events from the Web.

In Chinese/Japanese/Korean (CJK) Web pages, translated or transliterated

terms are frequently accompanied by their original Latin words, with the Latin

words serving as the appositives of the CJK words. In other words, the E-C pairs

are always closely collocated. Inspired by this observation in CJK texts, some

algorithms were proposed [17] to search over the close context of an English

word in a Chinese predominant bilingual snippet for transliteration.

Unfortunately, many of the reported works have not addressed practical

issues that concern learning transliterations from the growing Web, such as

incremental learning of transliteration lexicons. Incremental learning algorithms

are suitable for working in a changing environment and have been used in other

applications such as speech recognition to adapt model parameters to different

speakers successfully [6, 11]. In this paper, we study the learning framework of

the phonetic similarity model, which adopts a transliteration modeling approach

for transliteration extraction from the Web in an incremental manner.

3. Phonetic Similarity Model

Phonetic similarity model (PSM) is a probabilistic model that encodes the

syllable mapping between E-C pairs. Let ES = {e1, …, em, …, eM} be a sequence of

00186.indd 185 12/10/2008 3:50:28 PM

186 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin

English syllables derived from EW and CS = {s1, …, sn, …, sN} be the sequence of

Chinese syllables derived from CW, represented by a Chinese character string

CW →w1,…,wn,…,wN. EW and CW form a transliteration pair. Each of the

English syllables is drawn from a vocabulary of X entries, em∈{x1,…,xI}, and

each of the Chinese syllables from a vocabulary of Y entries, sn∈{y1, …, yJ}.

The E-C transliteration can be considered a generative process formulated by

the noisy channel model, which recovers the input CW from the observed output

EW. Applying Bayesian rule, we have Eq. (1), where P(EWCW ) is estimated to

characterize the noisy channel, known as the transliteration probability. P(CW )

is a language model to characterize the source language.

P(CWEW ) = P(EWCW )P(CW )/P(EW ). (1)

Following the translation-by-sound principle, P(EWCW ) can be approxi-

mated by the phonetic probability P(ESCS ), which is given by Eq. (2).

( ) max ( , ),P ES CS P ES CS

∆∈Γ

= ∆

(2)

where Γ is the set of all possible alignment paths between ES and CS. To ﬁnd

the best alignment path ∆, one can resort to a dynamic warping algorithm

[24]. Assuming conditional independence of syllables in ES and CS, we have

( )

P ES CS

=∏

P(emksnk) where k is the index of alignment, nK=N, and

mK=M. Note that, typically, we have N≠M due to syllable elision [17]. With

the phonetic approximation, Eq. (1) can be rewritten as Eq. (3).

P(CWEW ) ≈ P(ESCS )P(CW )/P(EW ). (3)

The language model P(CW ) in Eq. (3) can be represented by the n-gram

statistics of the Chinese characters derived from a monolingual corpus. Using

bigram to approximate the n-gram model, we have Eq. (4).

1 1

( ) ( ) ( )

n n

P CW P w P w w

−

≈

∏�

. (4)

Note that P(EW ) is not a function of CW, therefore, it can be dropped

from Eq. (3) in ﬁnding the maximum CW. A PSM model, denoted as Θ, for

ﬁnding CW with highest probability now consists of both P(ESCS ) and P(CW )

parameters. We now look into the mathematical formulation for learning of

P(ESCS) parameters from a bilingual transliteration lexicon.

00186.indd 186 12/10/2008 3:50:28 PM

Mining Live Transliterations Using Incremental Learning Algorithms 187

3.1. Batch learning of PSM

A collection of manually selected or automatically extracted E-C pairs can form

a transliteration lexicon. Given such a lexicon for training, the PSM parameters

can be estimated in a batch mode. An initial PSM is bootstrapped using limited

prior knowledge such as a small amount of transliterations, which may be

obtained by exploiting co-occurrence information [27]. Then we align the E-C

pairs using the PSM Θ and derive syllable mapping statistics.

Suppose that we have the event counts ci,j =count(em=xi, sn=yj) and

cj=count(sn=yj) for a given transliteration lexicon D with alignments Λ. We

would like to ﬁnd the parameters

(

i j m i

p P e x= =

n j

s y=

, ,

m n

e s〈 〉 ∈ Λ

�

∈Λ, that

maximize the probability in Eq. (5).

( , ) ( ) ,

i j

m n j i i j

P D P e s p

Λ Θ = =

∏ ∏ ∏

�

(5)

where

{ , 1,…, , 1,…, },

p i I j JΘ= = =

with maximum likelihood estimation

(MLE) criteria, subject to the constraints of

|1, .

i i j

p j∑ = ∀

Rewriting Eq. (5)

in log-likelihood (LL) results in Eq. (6)

, |

( , ) log ( ) log .

m n i j i j

j i

LL D P e s c p

Λ Θ = =

∑ ∑∑

(6)

It is described as the cross-entropy of the true data distribution ci,j with regard

to the PSM model. Given an alignment ∆ ∈Λ, the MLE estimate of PSM is:

| ,

/ .

i j i j j

P c c=

(7)

With a new PSM model, one is able to arrive at a new alignment. This is

formulated as an expectation-maximization (EM) process [7], which assumes that

there exists a mapping D→Λ, where Λ is introduced as the latent information,

also known as missing data in the EM literature. The EM algorithm maximizes the

likelihood probability P(DΘ) over Θ by exploiting P(DΘ) = ∑ΛP(D, ΛΘ).

The EM process guarantees non-decreasing likelihood probability P(DΘ)

through multiple EM steps until it converges. In the E-step, we derive the event

counts

|i j

and cj by force-aligning all the E-C pairs in the training lexicon D

using a PSM model. In the M-step, we estimate the PSM parameters Θ by

Eq. (7). The EM process also serves as a reﬁning process to obtain the best

alignment between the E-C syllables. In each EM cycle, the model is updated

after observing the whole corpus D. An EM cycle is also called an iteration in

batch learning. The batch learning process is described as follows and depicted

in Figure 1.

00186.indd 187 12/10/2008 3:50:28 PM

188 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin

Batch learning algorithm:

Start: Bootstrap PSM parameters

|i j

using prior phonetic mapping

knowledge;

E-Step: Force-align corpus D using

|i j

to obtain Λ and hence the counts of

|i j

and ci;

M-Step: Re-estimate

| | /

i j i j i

p c c=

using the counts from E-Step;

Iterate: Repeat E-Step and M-Step until P(DΘ) converges.

3.2. Incremental learning of PSM

In batch learning all the training samples have to be collected in advance. In

a dynamically changing environment, such as the Web, new samples always

appear and it is impossible to collect all of them. Incremental learning [8, 10,

11, 30] is devised to achieve rapid adaptation towards the working environment

by updating the model as learning samples arrive in sequence. It is believed that

if the statistics for the E-step are incrementally collected and the parameters are

frequently estimated, incremental learning converges more quickly because the

information from the new data contributes to the parameter estimation more

effectively than the batch algorithm does [11]. In incremental learning, the model

is typically updated progressively as the training samples become available and

the number of incremental samples may vary from as few as one to as many

as they are available. In the extreme case where all the learning samples are

available in advance and the updating done after observing all of them, an

incremental learning becomes batch learning. Therefore, batch learning can be

considered as a special case of incremental learning.

Figure 1. Batch learning of PSM.

Iterate

Final PSM

Initial PSM

E-Step

Training

M-Step

Training

Corpus D

00186.indd 188 12/10/2008 3:50:28 PM

Mining Live Transliterations Using Incremental Learning Algorithms 189

The incremental learning can be formulated through maximum a posteriori

(MAP) framework, also known as Bayesian learning, where we assume that the

parameters Θ are random variables subject to a prior distribution. A possible

candidate for the prior distribution of

|i j

is the Dirichlet density over each of

the parameters

|i j

[2]. Let

{ , 1,…, },

j i j

p i IΘ = =

we introduce,

( ) , ,

i j

j i j

P p j

−

Θ ∝

∏

∀

�

(8)

where

i i j

h∑ =

and α, which can be empirically set, is a positive scalar. Assuming

H is a set of hyperparameters, we have as many hyperparameters

|i j

∈ H as the

parameters

|i j

. The probability of generating the aligned transliteration lexicon is

obtained by integrating over the parameter space,

( ) ( ) ( ) .P D P D P d= ∫ Θ Θ Θ

This integration can be easily written down in a closed form due to the

conjugacy between Dirichlet distribution

i j

i i j

−

∏

and the multinomial

distribution

i j

i i j

p∏

Instead of ﬁnding Θ that maximizes P(DΘ) with MLE,

we maximize a posteriori (MAP) probability as follows:

arg max ( ) arg max ( ) ( )/ ( )

arg max ( ) ( ).

MAP P D P D P P D

P D P

Θ Θ

Θ = Θ = Θ Θ

= Θ Θ

(9)

The MAP solution uses a distribution to model the uncertainty of the

parameter Θ, while the MLE gives a point estimation [14, 22]. We rewrite

Eq. (9) as Eq. (10) using Eq. (5) and Eq. (8).

, |

arg max , .

i j i j

c h

MAP i j

ip j

+ −

Θ ≈

∏

∀

�

(10)

Equation (10) can be seen as a Dirichlet function of Θ given H, or a

multinomial function of H given Θ. With given prior H, the MAP estimation is

therefore similar to the MLE problem which is to ﬁnd the mode of the kernel

density in Eq. (10). In MAP estimation, the PSM parameters can be represented

as Eq. (11).

| | |

(1 ) ,

i j i j i j

p h f= + −

λ λ

(11)

where

| , ,

, /( ).

i j i j j i i j

f c c c

λ α α

= = ∑ +

One can ﬁnd that λ serves as a weighting factor between the prior and the

current observations. The difference between MLE and MAP strategy lies in the

fact that MAP introduces prior knowledge into the parameter updating formula.

Equation (11) assumes that the prior parameters H are known and static while

the training samples are available all at once.

00186.indd 189 12/10/2008 3:50:29 PM

190 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin

The idea of incremental learning is to benefit from the continuously

developing history to update the static model towards the intended reality.

As is often the case, the Web query results in an on-line application arrive in

sequence. It is of practical use to devise such an incremental mechanism that

adapts both parameters and the prior knowledge over time. The quasi-Bayesian

(QB) learning method offers a solution to it [1].

Let us break up a training corpus D into a sequence of sample subsets

D= {D1, D2, …, DT} and denote an accumulated sample subset D(t)=

{D1, D2, …, DT}, 1 ≤ t ≤ T as an incremental corpus. Therefore, we have

D=D(T). The QB method approximates the posterior probability P(ΘD(t−1)) by

the closest tractable prior density P(ΘH(t−1)) with H(t−1) evolved from historical

corpus D(t-1) as shown in Eq. (12).

( 1)

( )

( 1)

arg max ( )

arg max ( ) ( )

arg max , .

i j i

Ic h

i j

P D

P D P D

p j

−

+ −

Θ = Θ

≈ Θ Θ

∏

∀

�

(12)

QB estimation offers a recursive learning mechanism. Starting with a

hyperparameter set H(0) and a corpus subset D1, we estimate H(1) and

(1)

then H(2) and

(2)

and so on until H(t) and

( )t

as observed samples arrive in

sequence. The updating of parameters can be iterated between the reproducible

prior and posterior estimates as in Eq. (13) and Eq. (14). Assuming T→ ∞, we

have the following algorithm:

Incremental learning algorithm:

Start: Bootstrap

(0)

and H(0) using prior phonetic mapping knowledge and

set t=1;

E-Step: Force-align corpus subset Dt using

( 1)t

−

, compute the event counts

( )

i j

and reproduce prior parameters H(t−1) →H(t) by Eq. (13).

( ) ( 1) ( )

| | ,

t t t

i j i j i j

h h c

−

= +

α .

(13)

M-Step: Re-estimate parameters H(t)→

( )t

and

|i j

using the counts from

E-Step by Eq. (14).

( ) ( ) ( )

| | |

t t t

i j i j i j

p h h=

∑

(14)

00186.indd 190 12/10/2008 3:50:29 PM

Mining Live Transliterations Using Incremental Learning Algorithms 191

EM cycle: Repeat E-Step and M-Step until P(ΘD(t)) converges.

Iterate: Repeat T EM cycles covering the entire data set D in an iteration.

The algorithm updates the PSM model as training samples become available.

The scalar factor α can be seen as a forgetting factor. When α is big, the updating

of hyperparameters favors the prior. Otherwise, current observation is given

more attention. As for the sample subset size Dt, if we set Dt= 100, each

EM cycle updates Θ after observing every 100 samples. To be comparable with

batch learning, we deﬁne an iteration here to be a sequence of EM cycles that

covers the whole corpus D. If corpus D has a ﬁxed size D=D(T), an iteration

means T EM cycles in incremental learning. The iteration can then be repeated

just as in batch learning.

4. Mining Transliterations from the Web

Since the Web is dynamically changing and new transliterations come out all

the time, it is better to mine transliterations from the Web in an incremental

way. Words transliterated by closely observing common guidelines are referred

to as regular transliterations. However, in Web publishing, translators in

different regions may not observe the same guidelines. Sometimes they skew

the transliterations in different ways to introduce semantic implications, also

known as wordplay, resulting in casual transliterations. Casual transliteration

leads to multiple Chinese transliteration variants for the same English word.

For example, “Disney” may be transliterated into “迪士尼/Di-Shi-Ni/”,a “迪斯

耐/Di-Si-Nai/” and “狄斯耐/Di-Si-Nai/”.

Suppose that a sufﬁciently large, manually validated transliteration lexicon

is available, a PSM can be built in a supervised manner. However, this method

hinges on the availability of such a lexicon. A large transliteration lexicon is

not always available as it is labor-intensive to produce. Even if one is available,

the derived model can only be as good as what the training lexicon offers. New

transliterations, such as casual ones, may not be well handled. It is desirable to

adapt the PSM as new transliterations become available. This is also referred to

as the learning-at-work mechanism. Some solutions have been proposed recently

along this direction [16]. However, the effort was mainly devoted to mitigating

the need of manual labeling. A dynamic learning-at-work mechanism for mining

transliterations has not been well studied.

aThe Chinese words are romanized in hanyu pinyin.

00186.indd 191 12/10/2008 3:50:29 PM

192 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin

Here we are interested in an unsupervised learning process, in which we

adapt the PSM as we extract transliterations. The learning-at-work framework

is illustrated in Figure 2. As opposed to a manually labeled training corpus

used in Figure 1, we insert into the EM process an automatic transliteration

extraction mechanism, search and rank, as shown in the left panel of

Figure 2. The search and rank shortlists a set of transliterations from the Web

query results or bilingual snippets.

4.1. Search and rank

We obtain bilingual snippets from the Web by iteratively submitting queries to

the Web search engines [4]. Qualiﬁed sentences are extracted from the results

of each query. Each qualiﬁed sentence has at least one English word.

Given a qualiﬁed sentence, we ﬁrst denote the competing Chinese trans-

literation candidates as a set Ω, from which we would like to pick the most likely

one. Second, we would like to know if there is indeed a Chinese transliteration

CW in the close context of the English word EW.

We propose ranking the candidates by Eq. (15) using the PSM model to ﬁnd

the most likely CW for a given EW. The CW candidate that gives the highest

posterior probability is considered the most probable candidate CW ′.

arg max ( )

arg max ( ) ( ) .

CW P CW EW

P ES CS P CW

∈ Ω

′=

≈

(15)

Figure 2. Diagram of unsupervised transliteration extraction — learning-at-work.

Final PSM

Initial PSM

E-Step

M-Step

The Web

Quasi Transliterations

Search and Rank

Iterate

Final Lexicon

00186.indd 192 12/10/2008 3:50:29 PM

Mining Live Transliterations Using Incremental Learning Algorithms 193

The next step is to examine if CW ′ and EW indeed form a genuine E-C

pair. We deﬁne the conﬁdence of the E-C pair as the posterior odds similar to

that in a hypothesis test under the Bayesian interpretation. We have H0, which

hypothesizes that CW ′ and EW form an E-C pair, and H1, which hypothesizes

otherwise, and use posterior odds σ [17] for hypothesis tests.

Our search and rank formulation can be seen as an extension to a prior

work [3]. The posterior odds σ are used as the conﬁdence score so that E-C pairs

extracted from different contexts can be directly compared. In practice, we set

a threshold for σ to decide on a cutoff point for E-C pairs short-listing. In this

way, the search and rank is able to retrieve a collection of quasi transliterations

from the Web given a PSM.

4.2. Unsupervised learning strategy

Now we can carry out PSM learning as formulated in Section 3 using the

quasi transliterations as if they were manually validated. By unsupervised batch

learning, we mean to re-estimate the PSM model after search and rank over

the whole database, i.e., in each iteration. Just as in supervised learning, one

can expect the PSM performance to improve over multiple iterations. We report

the F-measure at each iteration. The extracted transliterations also form a new

training corpus for the next iteration.

In contrast to the batch learning, incremental learning updates the PSM

parameters as the training samples arrive in sequence. This is especially useful

in Web mining. With the QB incremental optimization, one can think of an

EM process that continuously re-estimates PSM parameters as the Web crawler

discovers new “territories”. In this way, the search and rank process gathers

qualiﬁed training samples Dt after crawling a portion of the Web. Note that the

incremental EM process updates parameters more often than batch learning does.

To evaluate performance of both learning methods, we deﬁne an iteration to be

T EM cycles in incremental learning on a training corpus D=D(T) as discussed

in Section 3.2.

4.3. Initializing PSM with context-based model

In incremental learning, the quality of initial PSM

(0)

and hyperparameter set

H(0) has impacts on the overall learning process. The initial model is deﬁned

based on prior knowledge about transliteration. There are different ways to

bootstrap the initial model, for example, by using phonetic clues or syntactic

clues. Since the PSM is phonetically-motivated, we expect that it can be boosted

by live transliterations that are extracted from a context-based model.

00186.indd 193 12/10/2008 3:50:29 PM

194 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin

Contextual information has provided an important clue in extracting

translation terms [12, 21] and information extraction for generating repetitive

patterns or templates [5], so does the extraction of transliterations. We also

conduct an inquiry into a transliteration lexicon of 8,898 unique, manually

validated transliteration pairs, which we will discuss later in the experiments.

The study reveals that about 56% of them are singletons, which are observed

only once in the corpus, as shown in the count-counts distribution in

Figure 3. Context-based models have been reportedly [9] effective in extracting

high frequency pairs — non-singletons. The challenge is that these non-singletons

are not always phonetically transliterated because some of them are semantic

translations. The idea is to use a context-based model to extract a set of quasi

transliteration pairs (QTPs), which are statistically acquired without manual

validation, and then validate the QTPs by using an initial PSM. This process is

intended to improve the initial PSM and hence the subsequent performance of

transliteration extraction.

A simple statistical model for extracting transliteration pairs using co-

occurrence model shown in Eq. (16) is proposed. By applying mutual information

(MI) as the underlying co-occurrence model, we have

arg max ( , )

( , )

arg max ( , ) log .

( ) ( )

CW MI CW EW

P CW EW

P CW P EW

∈ Ω

′=

(16)

The CW ′ that gives the highest mutual information is considered the most

probable transliteration of EW. The mutual information as deﬁned in information

theory presents the information gain about in presence of EW. It is referred to

as MI context-based model hereafter.

Figure 3. The count-counts distribution of the manually validated transliteration pairs.

1 2 3 4 5 6 7 8 9 10 >= 11

Counts

Count-counts (%)

00186.indd 194 12/10/2008 3:50:29 PM

Mining Live Transliterations Using Incremental Learning Algorithms 195

Using the MI context-based model, a set of quasi transliteration pairs can

be acquired. To prepare for the initial PSM, the QTPs can be used to augment

a set of seed transliteration pairs or human crafted phonetic mapping rules.

5. Experiments

To obtain the ground truth for performance evaluation, each possible transliteration

pair is manually checked based on the following transliteration criteria: (i) if

an EW is partly translated phonetically and partly translated semantically, only

the phonetic transliteration constituent is extracted to form a transliteration pair;

(ii) multiple E-C pairs may appear in one sentence; (iii) an EW can have multiple

valid Chinese transliterations and vice versa. The validation process results in

a collection of qualiﬁed E-C pairs, or distinct qualiﬁed transliteration pairs

(DQTPs), which form a transliteration lexicon.

To simulate the dynamic Web, we collected a Web corpus, which consists

of about 500 MB of Web pages, referred to as SET1 by [16, 17]. From SET1,

80,094 qualiﬁed sentences were automatically extracted, and 8898 DQTPs were

further selected with manual validation.

To establish a reference for performance benchmarking, we ﬁrst initialize a

PSM, referred to as seed PSM hereafter, using a random selection of 100 seed

DQTPs. By exploiting the seed PSM on all 8898 DQTPs, we train a PSM in a

supervised batch mode and improve the PSM on SET1 after each iteration. The

resulting performance in precision, recall and F-measure, which are deﬁned in

Eq. (17), at the 6th iteration is reported in Table 1 and the F-measure is also

shown in Figure 4.

precision = #extracted_DQTPs/#extracted_pairs,

recall = #extracted_DQTPs/#total_DQTPs, (17)

F-measure =2×recall ×precision/(recall +precision).

We use this closed test (supervised batch learning) as the reference

point for the unsupervised experiments. Next we further implement two PSM

learning strategies, namely unsupervised batch and unsupervised incremental

learning.

Table 1. The performance achieved by supervised batch learning on SET1.

Precision Recall F-measure

Closed-test 0.834 0.663 0.739

00186.indd 195 12/10/2008 3:50:29 PM

196 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin

5.1. Unsupervised batch learning

We begin with the same seed PSM. However, we use quasi transliterations that

are extracted automatically instead of manually validated DQTPs for training.

Note that the quasi transliterations are extracted and collected at the end of each

iteration. It may differ from one iteration to another. After re-estimating the PSM

model in each iteration, we evaluate the performance on SET1.

Comparing the two batch mode learning strategies in Figure 4, it is observed

that learning substantially improves the seed PSM after the ﬁrst iteration. Without

surprise, the supervised learning consistently outperforms the unsupervised one,

which reaches a plateau at 0.679 F-measure. This performance is considered the

baseline for comparison in this paper. The unsupervised batch learning presented

here is similar to that in [16].

5.2. Unsupervised incremental learning

We now formulate an on-lineb unsupervised incremental learning algorithm:

(i) Start with the seed PSM, set t= 1;

(ii) Extract Dt quasi transliterations pairs followed by E-Step in incremental

learning algorithm,

(iii) Re-estimate PSM using Dt (M-Step), t=t+ 1,

(iv) Repeat (ii) and (iii) to crawl over a corpus.

Figure 4. Comparison of F-measure over iterations (U-Incremental: Unsupervised

Incremental).

0.45

0.5

0.55

0.6

0.65

0.7

0.75

123456

#Iteration

F-measure

Supervised Batch

Unsupervised Batch

U-Incremental (100)

U-Incremental (5,000)

bIn an actual on-line environment, we are not supposed to store documents, thus no iteration can

take place.

00186.indd 196 12/10/2008 3:50:29 PM

Mining Live Transliterations Using Incremental Learning Algorithms 197

To simulate the on-line incremental learning just described, we train and

test on SET1 because of the availability of gold standard and comparison with

performance by batch mode. We empirically set α= 0.5 and study different Dt

settings. An iteration is deﬁned as multiple cycles of steps (ii)–(iii) that screen

through the whole SET1 once. We run multiple iterations.

The performance of incremental learning with Dt = 100 and Dt = 5000

are reported in Figure 4. It is observed that incremental learning beneﬁts

from more frequent PSM updating. With Dt = 100, it not only attains good

F-measure in the ﬁrst iteration, but also outperforms unsupervised batch learning

along the EM process. The PSM updating becomes less frequent for larger Dt.

When Dt is set to be the whole corpus, then incremental learning becomes

a batch mode learning, which is evidenced by Dt = 5000 when it performs

almost the same as the batch mode learning. The experiments in Figure 4 are

considered closed-set tests. Next we move on to the actual online experiment

after exploiting the contextual model.

5.3. Initializing PSM with a context-based model

Contextual information has been used in extracting semantic translation terms.

However, algorithms exploiting contextual information often fail to extract terms

of low frequency [13]. To prepare an initial list of quality transliteration pairs to

enhance the initial PSM, quasi transliteration pairs or QTPs extracted by the MI

context-based model need to be further validated. For example, in “Lacroix-設計

師拉克華/She-Ji-Shi-La-Ke-Hua/” and “Hilary-第一夫人希拉蕊/Di-Yi-Fu-Ren-

Xi-La-Rui/”, “Lacroix-拉克華/La-Ke-Hua/” and “Hilary-希拉蕊/Xi-La-Rui/,” are

extracted and accepted. However, “Asus-華碩/Hua-Shuo/” is extracted for high

frequency, but disqualiﬁed for phonetic irrelevance.

We can enhance the learning algorithm described in Section 5.2 further by

combining both the contextual information and phonetic information to obtain

an initial list of quality transliterations. Step (i) of the algorithm can be rewritten

as follows:

Start with the seed PSM augmented by quasi pairs obtained by the MI

context-based model, set t= 1;

a) First, generate quasi transliteration pairs from the whole corpus by using the

MI context-based model.

b) Second, shortlist phonetically genuine transliteration pairs by using the seed

PSM.

c) Third, derive the initial PSM from the resulting genuine pairs.

00186.indd 197 12/10/2008 3:50:29 PM

198 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin

With the context-based model, 1103 high quality transliteration pairs are

extracted and added for bootstrapping a better initial PSM. In Figure 5, the

performance of the unsupervised batch and that of proposed incremental learning

algorithm bootstrapping with the context-based model (bootstrapping U-Batch

learning and bootstrapping U-Incremental learning for short, respectively)

are reported. The bootstrapping strategy outperforms the unsupervised batch

in general. It benefits from the improved initial PSM. The bootstrapping

U-Incremental learning with Dt = 100 gives better performance than that

with Dt = 5000. This is similar to what we observed in Figure 4. We note that

context-based model does help improve the performance in either unsupervised

batch learning or incremental learning case.

5.4. Learning from the live Web

In practice, it is possible to extract bilingual snippets of interest by repeatedly

submitting queries to the Web. With the learning-at-work mechanism, we can

mine the query results for up-to-date transliterations, as in Figure 6. For example,

by submitting “Mkapa” to search engines, we may get “Mkapa-姆卡帕/Mu-Ka-

Pa/” and, as a by-product, “Dodoma-多多馬/Duo-Duo-Ma/” as well as shown in

Figure 6. In this way, new queries can be generated iteratively, thus new pairs

are discovered. With the promising test on SET1, we are now ready to move

on to a live test.

Figure 5. F-measure over iterations using U-Batch (Unsupervised batch) and U-Incremental

learning bootstrapping with the MI context-based model on SET1.

0.45

0.5

0.55

0.6

0.65

0.7

0.75

123456

#Iteration

F-measure

Supervised Batch

U-Batch(Seeds)

Bootstrapping U-Batch

Bootstrapping U-Incremental (100)

Bootstrapping U-Incremental (5,000)

00186.indd 198 12/10/2008 3:50:29 PM

Mining Live Transliterations Using Incremental Learning Algorithms 199

Following the unsupervised incremental learning algorithm, we start the

crawling with the same seed PSM as in Section 5.2. We adapt the PSM model

as every 100 quasi transliterations are extracted, i.e. Dt = 100. The crawling

stops after accumulating 67,944 Web pages, where there are 100 snippets at most

in a page, with 2,122,026 qualiﬁed sentences. We obtain 123,215 distinct E-C

pairs when the crawling stops. For comparison, we also carry out unsupervised

batch learning over the same 2,122,026 qualiﬁed sentences in a single iteration.

As the gold standard for this live database is not available, we randomly select

500 quasi transliteration pairs for manual checking of precision and report the

performance in Table 2. A precision of 0.758 by unsupervised batch learning

and a precision of 0.768 by unsupervised incremental learning are reported.

Assuming the same precision in the whole extracted corpus, 51,323 and 94,629

DQTPs can be expected, respectively. It is found that incremental learning is

more effective and productive than batch learning in discovering transliteration

pairs. This ﬁnding is consistent with the test results on SET1.

Figure 6. An example of bilingual snippets returned from a search engine.

Tanzania

25.8°C Dodoma 14.4

1992 6 · · )

Benjamin William Mkapa1995 ...

www.ebaomonthly.com/window/travel/countries/tz.htm - 131k - -

- -

· · Benjamin William Mkapa) 1995

11 ... 20 (Arusha) (Coast)

(Dodoma) ...

www.africabuyer.cn/africa/59/110_1.html - 30k - -

Table 2. Comparison between the U-Batch, the U-Incremental learning and the bootstrapping

U-Incremental learning from live Web.

U-Batch learning

U-Incremental

learning

U-Incremental learning

(with MI model)

#distinct E-C pairs 67,708 123,215 123,337

Precision 0.758 0.768 0.774

#expected DQTPs 51,323 94,629 95,463

00186.indd 199 12/10/2008 3:50:29 PM

200 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin

Generally, each bilingual snippet consists of the submitted query term on

each page of the Web query results. Because most of the Web-based search

engines [4] devise algorithms to detect and remove page duplication, most of the

sentences composed of the query term in a Web page are different as shown in

Figure 6. Therefore, we can explore the regularities from the context of the query

term to obtain the translation or transliteration terms. Exploiting this observation,

we further propose an online learning algorithm for mining transliterations by

incorporating the contextual information for handling query results. In this way,

we incorporate the MI model into the transliteration extraction of each page at

every EM cycle. Step (ii) of the learning algorithm described in Section 5.2

can be rewritten as:

a) First, generate quasi transliteration pairs by using the MI context-based model

to process each query result in a page.

b) Second, shortlist phonetically genuine transliteration pairs by using the current

PSM.

c) Third, derive a new PSM from the resulting genuine pairs and the current

PSM.

An experiment using U-Incremental learning with MI model is conducted in

an online environment. Again from the same live Web, we obtain 123,337 distinct

E-C pairs. The precision is estimated to be 0.774 by examining 500 randomly

selected quasi transliteration pairs and hence 95,463 DQTPs can be expected.

As shown in Table 2, the U-Incremental learning with MI model achieved even

better performance than that of the U-Incremental learning. Summarizing the

observations, we believe that MI context-based model is helpful in both batch

and incremental learning strategies.

6. Conclusions

We have proposed a learning framework for mining E-C transliterations using

bilingual snippets from a live Web corpus. In this learning-at-work framework,

we formulate the PSM learning method and study strategies for PSM learning

in both batch and incremental manners. The batch mode learning beneﬁts from

multiple iterations for improving performance, while the unsupervised incremental

one, which does not require all the training data to be available in advance,

adapts to the dynamically changing environment easily without compromising the

performance. Unsupervised incremental learning provides a practical and effective

solution for discovering transliterations from the query results, which can be

easily extended to other Web mining applications. It is also found that context-

based knowledge boosts the initial PSM that improves the overall transliteration

00186.indd 200 12/10/2008 3:50:29 PM

Mining Live Transliterations Using Incremental Learning Algorithms 201

extraction performance. It is suggested that the phonetically-motivated PSM

transliteration extraction model is to be augmented by context-based model for

best performance.

For future work, the natural next steps include extending the method to

(i) named entity translation extraction in general; and (ii) extraction of

transliterations from comparable texts in multiple languages.

References

[1] S. Bai and H. Li, Bayesian learning of N-gram statistical language modeling,

in Proceedings of International Conference on Acoustics, Speech and Signal,

2006, pp. 1045–1048.

[2] M. Bacchiani, B. Roark, M. Riley and R. Sproat, MAP adaptation of

stochastic grammars, Computer Speech and Language, 20(1), 2006, 41–68.

[3] E. Brill, G. Kacmarcik and C. Brockett, Automatically harvesting Katakana-

English term pairs from search engine query logs, in Proceedings of

Natural Language Processing Paciﬁc Rim Symposium, 2001, pp. 393–399.

[4] S. Brin and L. Page, The anatomy of a large-scale hypertextual Web search

engine, in Proceedings of 7th International World Wide Web (WWW)

Conference, 1998, pp. 107–117.

[5] C.-H. Chang, M. Kayed, M. R. Girgis and K. Shaalan, A survey of Web

information extraction systems, IEEE Transactions on Knowledge and Data

Engineering, Vol. 18, No. 10, 2006, 1411–1428.

[6] J.-T. Chien, Online hierarchical transformation of hidden markov models

for speech recognition, IEEE Transactions on Speech and Audio Processing,

Vol. 7, No. 6, 1999, 656– 667.

[7] A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum likelihood from

incomplete data via the EM algorithm, Journal of the Royal Statistical

Society, Ser. B. Vol. 39, 1997, 1–38.

[8] L.-M. Fu, Incremental knowledge acquisition in supervised learning

networks, IEEE Trans. on Systems, Man, and Cybernetics — Part A:

Systems and Humans, Vol. 26, No. 6, 1996, 801–809.

[9] P. Fung and L.-Y. Yee, An IR approach for translating new words from

nonparallel, comparable texts, in Proceedings of 17th International

Conference on Computational Linguistics (COLING) and 36th Annual

Meeting of the Association for Computational Linguistics (ACL), 1998,

pp. 414 – 420.

[10] C. Giraud-Carrier, A note on the utility of incremental learning, AI

Communications, 13(4), 2000, 215–223.

00186.indd 201 12/10/2008 3:50:30 PM

202 Haizhou Li, Jin-Shea Kuo, Jian Su and Chin-Lung Lin

[11] Y. Gotoh, M. M. Hochberg and H. F. Silverman, Efﬁcient training algorithms

for HMM’s using incremental estimation, IEEE Transactions on Speech

and Audio Processing, Vol. 6, No. 6, 1998, 539–547.

[12] F. Huang, Y. Zhang and S. Vogel, Mining key phrase translations from

Web corpora, in Proceedings of Human Language Technology–Empirical

Methods on Natural Language Processing, 2005, pp. 483–490.

[13] L. Jiang, M. Zhou, L.-F. Chien and C. Niu, Named entity translation with

Web mining and transliteration, in Proceedings of International Joint

Conferences on Artiﬁcial Intelligence, 2007, pp. 1629–1634.

[14] F. Jelinek, Self-organized language modeling for speech recognition,

Readings in Speech Recognition, Morgan Kaufmann, 1999, pp. 450–506.

[15] K. Knight and J. Graehl, Machine transliteration, Computational Linguistics,

Vol. 24, No. 4, 1998, 599–612.

[16] J.-S. Kuo, H. Li and Y.-K. Yang, Learning transliteration lexicons from

the Web, in Proceedings of 44th Annual Meeting of the Association for

Computational Linguistics, 2006, pp. 1129–1136.

[17] J.-S. Kuo, H. Li and Y.-K. Yang, A phonetic similarity model for automatic

extraction of transliteration pairs, ACM Transactions on Asian Language

Information Processing, 6(2), 2007, 1–24.

[18] H. Li, M. Zhang and J. Su, A joint source channel model for machine

transliteration, in Proceedings of 42nd Annual Meeting of the Association

for Computational Linguistics, 2004, pp. 159–166.

[19] W. Lam, R.-Z. Huang and P.-S. Cheung, Learning phonetic similarity

for matching named entity translations and mining new translations, in

Proceedings of 27th ACM Special Interest Group on Information Retrieval,

2004, pp. 289–296.

[20] C.-J. Lee and J.-S. Chang, Acquisition of English-Chinese transliterated word

pairs from parallel-aligned texts using a statistical machine transliteration

model, in Proceedings of HLT-NAACL Workshop Data Driven MT and

Beyond, 2003, pp. 96–103.

[21] W.-H. Lu, L.-F. Chien and H.-J Lee, Translation of Web queries using

anchor text mining, ACM Transactions on Asian Language and Information

Processing, 1(2), 2002, 159–172.

[22] D. J. C. MacKay and L. Peto, A hierarchical dirichlet language model,

Natural Language Engineering, Vol. 1, No. 3, 1994, 1–19.

[23] H. M. Meng, W.-K. Lo, B. Chen and T. Tang, Generate phonetic cognates

to handle name entities in English-Chinese cross-language spoken document

retrieval, in Proceedings of Workshop on Automatic Speech Recognition

and Understanding, 2001, pp. 311–314.

00186.indd 202 12/10/2008 3:50:30 PM

Mining Live Transliterations Using Incremental Learning Algorithms 203

[24] C. S. Myers and L. R. Rabiner, A comparative study of several dynamic

time-warping algorithms for connected word recognition, The Bell System

Technical Journal, 60(7), 1981, 1389–1409.

[25] J.-Y. Nie, P. Isabelle, M. Simard and R. Durand, Cross-language information

retrieval based on parallel texts and automatic mining of parallel text from

the Web, in Proceedings of 22nd ACM Special Interest Group on Information

Retrieval, 1999, pp. 74– 81.

[26] R. Rapp, Automatic identiﬁcation of word translations from unrelated

English and German corpora, in Proceedings of 37th Annual Meeting of

the Association for Computational Linguistics, 1999, pp. 519–526.

[27] R. Sproat, T. Tao and C. Zhai, Named entity transliteration with comparable

corpora, in Proceedings of 44th Annual Meeting of the Association for

Computational Linguistics, 2006, pp. 73–80.

[28] P. Virga and S. Khudanpur, Transliteration of proper names in cross-lingual

information retrieval, in Proceedings of 41st ACL Workshop on Multilingual

and Mixed Language Named Entity Recognition, 2003, pp. 57–64.

[29] S. Wan and C. M. Verspoor, Automatic English-Chinese name transliteration

for development of multilingual resources, in Proceedings of 17th COLING

and 36th ACL, 1998, pp. 1352–1356.

[30] G. Zavaliagkos, R. Schwartz, and J. Makhoul, Batch, incremental and

instantaneous adaptation techniques for speech recognition, in Proceedings of

International Conference on Acoustics, Speech and Signal, 1995, pp. 676–679.

00186.indd 203 12/10/2008 3:50:30 PM

Machine transliteration and transliterated text retrieval: a survey

Article

Jun 2018

Users of the WWW across the globe are increasing rapidly. According to Internet live stats there are more than 3 billion Internet users worldwide today and the number of non-English native speakers is quite high there. A large proportion of these non-English speakers access the Internet in their native languages but use the Roman script to express themselves through various communication channels like messages and posts. With the advent of Web 2.0, user-generated content is increasing on the Web at a very rapid rate. A substantial proportion of this content is transliterated data. To leverage this huge information repository, there is a matching effort to process transliterated text. In this article, we survey the recent body of work in the field of transliteration. We start with a definition and discussion of the different types of transliteration followed by various deterministic and non-deterministic approaches used to tackle transliteration-related issues in machine translation and information retrieval. Finally, we study the performance of those techniques and present a comparative analysis of them.

Applying a Dynamic Bayesian Network Framework to Transliteration Identification

Conference Paper

Full-text available

Jan 2010

Peter Nabende

Machine Transliteration Survey

Article

Full-text available

Apr 2011

Machine transliteration is the process of automatically transforming the script of a word from a source language to a target language, while preserving pronunciation. The development of algorithms specifically for machine transliteration began over a decade ago based on the phonetics of source and target languages, followed by approaches using statistical and language-specific methods. In this survey, we review the key methodologies introduced in the transliteration literature. The approaches are categorized based on the resources and algorithms used, and the effectiveness is compared.

Learning regional transliteration variants

Article

Jan 2012
INFORM PROCESS MANAG

This paper conducts an inquiry into regional transliteration variants across Chinese speaking regions. We begin by studying the social association of regional transliterations, followed by postulating a computational model for effective transliteration extraction from the Web. In the computational model, we first propose constraint-based exploration by incorporating transliteration knowledge from transliteration modeling and predictive query suggestions from search engines into query formulation as constraints so as to increase the chance of desired transliteration returns in learning regional transliteration variants. Then, we study a cross-training algorithm, which explores the attainably helpful information of transliteration mappings across related regional corpora for the learning of transliteration models, to improve the overall extraction performance. The experimental results show that the proposed method not only effectively harvests a lexicon of regional transliteration variants but also mitigates the need of manual data labeling for transliteration modeling. We also carry out an investigation into the underlying characteristics of regional transliterations that motivate the cross-training algorithm.

Acquisition of English-Chinese transliterated word pairs from parallel-aligned texts using a statistical machine transliteration model

Article

Full-text available

May 2003

This paper presents a framework for extract-ing English and Chinese transliterated word pairs from parallel texts. The approach is based on the statistical machine transliteration model to exploit the phonetic similarities be-tween English words and corresponding Chi-nese transliterations. For a given proper noun in English, the proposed method extracts the corresponding transliterated word from the aligned text in Chinese. Under the proposed approach, the parameters of the model are automatically learned from a bilingual proper name list. Experimental results show that the average rates of word and character precision are 86.0% and 94.4%, respectively. The rates can be further improved with the addition of simple linguistic processing.

Cross-Language Information Retrieval Based on Parallel Texts and Automatic Mining of Parallel Texts from the Web.

Conference Paper

Full-text available

Aug 1999

This paper describes the use of a probabilistic translation model to cross-language IR (CLIR). The performance of this approach is compared with that using machine translation (MT). It is shown that using a probabilistic model, we are able to obtain performances close to those using an MT system. In addition, we also investigated the possibility of automatically gather parallel texts from the Web in an attempt to construct a reasonable training corpus. The result is very encouraging. We showed that in several tests, such a training corpus is as good as a manually constructed one for CLIR purposes.

A Comparative Study of Several Dynamic Time Warping Algorithms for Connected Word Recognition

Article

Sep 1981

Several different algorithms have been proposed for time registering a test pattern and a concatenated (isolated word) sequence of reference patterns for automatic connected-word recognition. These algorithms include the two-level, dynamic programming algorithm, the sampling approach and the level-building approach. In this paper, we discuss the theoretical differences and similarities among the various algorithms. An experimental comparison of these algorithms for a connected-digit recognition task is also given. The comparison shows that for typical applications, the level-building algorithm performs better than either the two-level dp-matching or the sampling algorithm.

Mining anchor texts for translation of web queries

Article

Jun 2002

This article presents an approach to automatically extracting translations of Web query terms through mining of Web anchor texts and link structures. One of the existing difficulties in cross-language information retrieval (CLIR) and Web search is the lack of appropriate translations of new terminology and proper names. The proposed approach successfully exploits the anchor-text resources and reduces the existing difficulties of query term translation. Many query terms that cannot be obtained in general-purpose translation dictionaries are, therefore, extracted.

Automatic identification of word translations from unrelated English and German corpora

Article

Jan 1999

Reinhard Rapp

Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is more difficult, because most statistical clues useful in the processing of parallel texts cannot be applied to non-parallel texts. Whereas for parallel texts in some studies up to 99% of the word alignments have been shown to be correct, the accuracy for non-parallel texts has been around 30% up to now. The current study, which is based on the assumption that there is a correlation between the patterns of word co-occurrences in corpora of different languages, makes a significant improvement to about 72% of word translations identified correctly.

Batch, incremental and instantaneous adaptation techniques for speech recognition

Article

Jan 1995

We present a framework for maximum a posteriori adaptation of large scale HMM speech recognizers. In this framework, we introduce mechanisms that take advantage of correlations present among HMM parameters in order to maximize the number of parameters that can be adapted by a limited number of observations. We are also separately exploring the feasibility of instantaneous adaptation techniques. Instantaneous adaptation attempts to improve recognition on a single sentence, the same sentence that is used to estimate the adaptation. We show that sizable gains (20-40% reduction in error rate) can be achieved by either batch or incremental adaptation for large vocabulary recognition of native speakers. The same techniques cut the error rate for recognition of non-native speakers by factors of 2 to 4, bringing their performance much closer to the native speaker performance. We also demonstrate that good improvements in performance (25-30%) are realized when instantaneous adaptation is used for recognition of non-native speakers.

Self-organized language modeling for speech recognition

Chapter

Dec 1990

Frederick Jelinek

Bayesian Learning of N-Gram Statistical Language Modeling

Conference Paper

Jun 2006
Acoust Speech Signal Process

The n-gram language model adaptation is typically formulated using deleted interpolation under the maximum likelihood estimation framework. This paper proposes a Bayesian learning framework for n-gram statistical language model training and adaptation. By introducing a Dirichlet conjugate prior to the n-gram parameters, we formulate the deleted interpolation under maximum a posterior criterion with a Bayesian learning procedure. We study the Bayesian learning formulation for n-gram and continuous n-gram language models. The experiments on North American News Text corpus have validated the effectiveness of the proposed algorithms

MAP adaptation of stochastic grammars

Article

Jan 2006

This paper investigates supervised and unsupervised adaptation of stochastic grammars, including n-gram language models and probabilistic context-free grammars (PCFGs), to a new domain. It is shown that the commonly used approaches of count merging and model interpolation are special cases of a more general maximum a posteriori (MAP) framework, which additionally allows for alternate adaptation approaches. This paper investigates the effectiveness of different adaptation strategies, and, in particular, focuses on the need for supervision in the adaptation process. We show that n-gram models as well as PCFGs benefit from either supervised or unsupervised MAP adaptation in various tasks. For n-gram models, we compare the benefit from supervised adaptation with that of unsupervised adaptation on a speech recognition task with an adaptation sample of limited size (about 17 h), and show that unsupervised adaptation can obtain 51% of the 7.7% adaptation gain obtained by supervised adaptation. We also investigate the benefit of using multiple word hypotheses (in the form of a word lattice) for unsupervised adaptation on a speech recognition task for which there was a much larger adaptation sample available. The use of word lattices for adaptation required the derivation of a generalization of the well-known Good-Turing estimate.

Maximum Likelihood from Incomplete Data Via EM Algorithm

Article

Sep 1977

S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

Mining Live Transliterations Using Incremental Learning Algorithms

Abstract and Figures

Recommended publications

Mining Transliterations from Web Query Results: An Incremental Approach

Active learning for constructing transliteration lexicons from the Web

Learning Transliteration Lexicons from the Web

Harvesting Regional Transliteration Variants with Guided Search