Conference PaperPDF Available

Learning for transliteration of arabic-numeral expressions using decision tree for Korean TTS

October 2004

October 2004

DOI:10.21437/Interspeech.2004-479

Source
DBLP

Conference: INTERSPEECH 2004 - ICSLP, 8th International Conference on Spoken Language Processing, Jeju Island, Korea, October 4-8, 2004

Authors:

Youngim Jung

Korea Institute of Science and Technology Information (KISTI)

Hyeonsook Nam

Dankook University

Hyuk-Chul Kwon

Pusan National University

Show all 5 authorsHide

Overall structure of TLAN

…

Classification of senses of ANEs

…

Figures - uploaded by Youngim Jung

Content may be subject to copyright.

Content uploaded by Youngim Jung

Content may be subject to copyright.

Learning for Transliteration of Arabic-Numeral Expressions Using Decision

Tree for Korean TTS

Youngim Jung, Donghun Lee, HyeonSook Nam†, Aesun Yoon‡, Hyuk-chul Kwon

School of Electrical & Computer Engineering,Dept of French‡at Pusan National University

Dept. of Internet Contents at Busan Digital University†

{acorn, huni77, asyoon‡, hckwon}@pusan.ac.kr, nosenam@bdu.ac.kr†

Abstract

Despite of much work on TTS technologies and several TTS

systems customized for Korean, current TTS systems output

many errors in transliterating non-alphabetic symbols such as

Arabic numerals and text symbols. This paper proposes

TLAN (Transliteration Learner for Arabic-Numeral

expressions) which can efficiently disambiguate the reading

and meaning of Arabic Numeral Expressions (ANEs) in texts

by using a decision tree. For the purpose of analyzing and

learning data, three phases of learning elements were

suggested: patterns of Arabic numerals combined with text

symbols, contextual features and heuristic information were

classified according to the senses and sounds of ANEs. Our

corpus was made up of news articles issued from January 1st,

2000 to December 31st, 2001 from 10 major newspapers in

Korea. By learning the three phases of learning elements, the

model shows 97.38% and 97.28% accuracies for the training

set and the test set, respectively.

1. Introduction

Text-To-Speech synthesis (TTS) technology has been widely

applied to many domain-general systems such as customer

support dialog systems, ARS, voice-news systems, e-mail

readers, educational programs for language learning and

voice-production programs for dysphonic patients.

The naturalness of alphabetic letter-based pronunciations by

signal processing and prosody modeling has been mainly

studied by TTS researchers; however, there have been few

studies on the speech synthesis of non-alphabetic symbols, for

example, Arabic numerals and various text symbols using

linguistic models୍.

Computer-readable texts contain not merely alphabetic letters

but also non-alphabetic symbols. Especially, the more

scientific and informative content texts (such as newspaper

articles, academic papers and official reports) contain, the

more frequent is the occurrence of Arabic numerals, because

Arabic numerals have graphic simplicity and deliver exact

information[1].

Arabic numerals, regardless of their graphic advantage and

representability, are not easily transliterated because their

Korean pronunciations vary with their senses. Arabic

numerals represent time, location, quantity (cardinal), order

(ordinal), ranks, indices, sports scores, victory marks,

telephone numbers, and bank account numbers, among others.

୍ Arabic numerals as non-alphabetic symbols in texts have

seldom been a subject of linguistic studies; however, once they are

converted into sounds, their senses are ruled by linguistic rules

and are determined by contextual features.

They are also used in the formation of proper nouns for arms,

planes, visas and programs. In each context, their

pronunciations vary as we can see in examples (E1)~(E4).

(E1) 3[se] geuru୎

three stumps (of trees)

“three trees”

(E2) 3[sam] nyeon

“three years”

(E3) 3[seo] mal

three mal୏

“54 ˜”

(E4) big 3[seuli]

“Big Three”

If a Korean classifier comes after an Arabic numeral, the

numeral is read as se according to the Korean numeric system

(E1), whereas if a Chinese classifier follows an Arabic

numeral, it is read as sam according to the Chinese system

(E2). ‘3’ and ‘4’ are read as seo or seok and neo or neok,

respectively if Korean units of measurement such as mal, geun,

hob, doe come after them as in example (E3). In some proper

nouns, numerals are read in English as in example (E4). As

shown in (E1)~(E4), the same Arabic numeral, ‘3’ has been

transliterated in four different ways. Reading the combined

expressions of Arabic numerals and text symbols is even more

various as in examples (E5) and (E6).

(E5) geudeul-eun 3-5[set-eseo daseot] gae-leul

bad-assda

They 3 from 5 things received

“They received 3 to 5 things.”

(E6) 02-5459-3333

[gongyi-e osaogu samsamsamsam]

“zero two (local number) five four five nine three

three three three (a telephone number)”

୎ In this paper, letters in italics stand for transliterated

pronunciations of Korean and underlined texts represent the target

Arabic numeral expressions to which the pronunciations

correspond. In each example, words in the second line are

translated directly in English, which follow the original order of

Korean words; phrases in quotation marks, are the interpretation

of each example phrase. ‘geuru’ is a Korean classifier which is

used as a unit of trees.

୏mal is a Korean unit of volumn for measuring liquid or grain;

one mal is about 18˜.

୑eo-jeol is a morpheme cluster of continuous alphanumeric

characters and symbols with space on either side in Korean. In

general, symbols are placed between the two paralleled items

without spacing. In most cases an eo-jeol is composed of several

morphemes of different parts of speech [1].

Thus, the reading of ANEs according to their context is a

critical criterion in evaluating the intelligence of TTS systems.

However, current TTS systems show low performance in

generating the correct sounds of Arabic-Numeral Expressions

(ANEs).

In this paper, we propose TLAN (Transliteration Learner for

Arabic-Numeral expressions), which can transliterate ANEs

correctly and efficiently. The objectives of this paper are (1) to

extract from data and analyze learning elements which affect

the reading of Arabic numerals, (2) to suggest a learning

model for the transliteration of Arabic numerals, and (3) to

improve current Korean TTS systems.

In order to analyze learning elements, to train sample data

and to test our model, we have built our corpus from the news

articles of 10 major newspapers which were issued in Korea

from January 1st, 2000 to December 31st, 2001. The sizes of

the training and test sets are shown in Table 1 below.

Table 1 Size of training set and test set

Data set Size (eo-jeol

୑

)Ratio (%)

Training set 90,000 90

Test set 10,000 10

The plan for the rest of the paper is as follows. In section 2,

we will briefly present previous studies related to the

transliteration of Arabic numerals, and their limitations. In

section 3, learning elements will be analyzed out, learning

algorithms suggested, and the overall structure of our

proposed model illustrated. Our proposed model will be

evaluated through experimentations in section 4. The

conclusions of this paper and suggestions for future research

follow in section 5.

2. Related studies

In this section, we will describe previous studies on reading

ANEs for TTS systems and present their limitations.

2.1. Rule-based approach

Few studies have dealt with readings of Arabic numerals with

respect to the implementation of an automatic transliteration

system for TTS. Despite few relevant studies, there are several

customized TTS systems. Three daily newspapers offer voice

news on their website and more than 5 companies produce

Korean TTS systems [3, 4, 5]. The current systems do not

seem to have modules to select accurate reading for ANEs and

read Arabic numerals only in 1 to 3 ways; thus numerous

incorrect reading are generated as in (E7) ~ (E10)୒.

(E7) 3 [*sam/se] keob

three cup

“three cups”

(E8) -0.24 [*yeong-jeom i-sa/ma-i-neo-seu yeong-jeom

i-sa] %

*zero point two four/minus zero point two four

“minus 0.24%”

(E9) 3૫4[*sam-e-seo sa/seo-neo] gae

“3 to 4 things”

୒ In examples, ‘*’ indicates the incorrect transliteration of ANEs

or the incorrect morphological analysis of ANEs.

(E10) 9.11 [*gu-jeom il-il/gu il-il] teleo

*nine point one one/ nine one one

“9·11 terror”

Table 2 illustrates the accuracies of current TTS systems in

reading ANEs.

Table 2 Accuracies of current TTS systems

TTS systems Accuracy (%) Resource of evaluation

Donga voice

news

55 Numeral expressions in

Randomly-chosen articles

issued March 1st, 2003 to

May 31st, 2003

Core Voice

TTS system

85.2 Numeral expressions in

some portion of our

analyzed corpus

Voiceware

TTS system

79.4~91.7 Numeral expressions in

some portion of our

analyzed corpus

[Yoon et al., 2003; Jung, 2004] have proposed a rule-based

system for the transliteration of Arabic numerals, which

achieves highly competitive performance compared to the

current systems. Though rules of the system have been built

by analyzing one daily newspaper, the system shows an

accuracy of 95.6~97.7% over 4 sets of unanalyzed data.

However, the problem of the rule-based system is that no

learning algorithm has been presented for the readings of

Arabic numerals in multilateral and changing data.

2.2. Hybrid approach

In [Yu et al., 2003], a three-layer classifier (TLC) which

disambiguates the senses of “/,” “:” and “-” and determines the

oral expressions of the symbols in Chinese has been proposed.

The 1st layer is composed of rule-based pattern tables and a

decision tree. In this model, the decision tree is used to

exclude the impossible senses of each symbol. In the 2nd layer,

a voting scheme calculates the disambiguation scores for all

possible senses of the target symbols, and within the 3rd layer,

the algorithm confidence of sense disambiguation is used to

enhance the performance. The method of adopting several

algorithms, merging layers and matching patterns is used to

improve its performance in disambiguating the senses of the

three symbols. This hybrid approach achieves high accuracies,

such as 99.8% and 97.5% for a training phase and a test phase,

respectively. However, calculating scores and merging

algorithms are very complex processes. We have found that

well-classified learning elements and their algorithmic

application can give us many clues in efficiently determining

the senses and readings of ANEs. In section 3, we will

investigate learning elements and their algorithmic

applications for Korean TTS.

3. Implementation

In this section, we will classify the senses of ANEs using

linguistic knowledge. The learning elements will be classified

into 3 groups; then we will apply the decision tree for the

purpose of determining the best elements and the algorithmic

order of their application; lastly we will illustrate the overall

structure of our model.

3.1. Classification of senses of ANEs

ANEs represent various concepts as introduced in the

introduction. Through the analysis of our corpus and the

investigation of previous work, we can classify the senses of

ANEs as shown in Table 3.

Table 3 Classification of senses of ANEs

time location quantity1 quantity2 sports scores

S1 S2 S3 S4 S5

order indices titles numbers proper nouns

Senses

ANEs

S6 S7 S8 S9 S10

3.2. Learning elements

We have found that three groups of learning elements

determine the sense and the pronunciation of ANEs. The three

groups are contextual features, pattern structure and heuristic

information. Contextual features are extracted from the left or

right eo-jeols of ANEs and are subcategorized according to

the sense of ANEs. These features are built in dictionaries.

Patterns are characterized by the number of figures, the

number of text symbols and the kind of text symbols in ANEs.

In ANEs, the size of figures, the difference between two

figures, the first place of a figure, among other clues, give us

the necessary heuristic information to determine the sense of

ANEs.

Table 4 Elements and values

Elements Subcategories of elements Id Value

Contextual

features

Right Associated Collocation(RAC)

Left Associated Collocation(LAC)

0~30

0~24

Patterns Num. of fig.

Num of sym.

Kind of sym.

1~12

0~5

0~9

Heuristics Size of fig.

Difference between two fig.

1st place of fig.

Places of fig.

0~5

0~2

0, 1

0~4

(E5’) geudeul-eun 3-5 gae-leul bad-assda

They 3 from 5 things received

“They received 3 to 5 things.”

In (E5’), gae is RAC(1), the value of which is ‘5’. According

to this contextual feature, the NE is considered to represent the

quantity of something (S3, S4), and is pronounced according

to the pure Korean numeric system (S3). The pattern ‘3-5 (N-

N)’ gives us the information about the number of figures

having the value ‘2’, the number of text symbols having the

value ‘1’, and the kind of text symbols having the value ‘2 (an

id number for ‘-’)’. The difference between the two figures is

2, so ‘N-N+RAC1’ can be recognized as a range of numbers.

3.3. Application of learning elements and algorithm

As we have seen in Section 3.2, there are 3 groups and 9

subcategories of learning elements which affect the senses and

the readings of ANEs. In order to choose a correct sense and a

reading for a single NE, the elements should draw a

distinction between candidate senses and readings. In order to

obtain the most distinctive elements, we adopt the C4.5

algorithm. The elements which have discrete values are

determined by calculating the information gain of elements.

)

),(

(log

),(

SCfreq

info(S)

u

(1)

)info(T

(T)info i

(2)

(T)infoinfo(T)Xgain X )( (3)

S: Example set of ANEs, X: Elements, T: Training sets,

: Class to which S belongs (S1~S10)

According to [Quinlan, 1993] and [Mitchell, 1997], Info(S) is

the entropy of a sample set S and info

(T) is a measurement in

accordance with the n outcomes of a test Xafter T. has been

partitioned. We can obtain the information gain of an element

X(gain(X)) by partitioning T in accordance with the test X. By

this gain criterion, we can select the best element to construct

a decision tree. Due to the gain criterion having a strong bias,

however, it has been rectified by normalization.

)(log)

Xsplitinfo(

u

(4)

)

)(

)( Xsplitinfo(

Xgain

Xgainratio (5)

Equation (4) represents the potential information generated by

dividing T into n subsets. Then we can obtain the proportion

of information generated by the split, as in equation (5)

[Quinlan, 1993]. The gain ratio is helpful in classifying the

elements in the construction of a decision tree used to

determine the senses and reading of ANEs.

3.4. System architecture

In this section, we will illustrate the procedure and the overall

structure of our model. The procedure consists of two aspects,

which are a training aspect and a test aspect.

For training data, in Step 1, input data is preprocessed and

sentences are segmented by tokenization. In Step 2, target

ANEs and their adjacent eo-jeols are extracted from tokens. In

Step 3, target ANEs are converted into patterns, for example,

‘5:30 p.m.’ and ‘-0.24%’ are converted into ‘N:N’ and ‘-N.N’.

Thus the pattern information is obtained in this step. Once a

pattern structure is obtained, then the corresponding heuristic

information such as the size of figures is extracted from target

ANEs

୓

In Step 4, extracted adjacent eo-jeols are converted into

subcategorized contextual features. For example, ‘p.m.’ in

‘5:30 p.m.’ is converted into ‘1’, the value of which is ‘15’.

Here, if the conversion of eo-jeols into contextual features

fails, meaningless morphemes are deleted through

morphological analysis and then the analyzed eo-jeols are

୓ The extraction of heuristic information may precede conversion

of target ANEs into patterns. In that case, the system must analyze

all target ANEs in a time-consuming way. In this paper, we design

the system to check heuristic information selectively under several

specific patterns.

checked again. Or, if the conversion fails even after the

previous analysis, eo-jeols are converted into default values.

In Step 1 through Step 4, the input data is converted into an

example data set which can be used to construct a decision

tree. In Step 5, the C4.5 algorithm is applied and a decision

tree is constructed.

For testing the data, the same procedure is run, Step 1 through

Step 4. In Step 6, the constructed decision tree is applied to

assign senses and readings of ANEs in the test data.

Figure 1 illustrates the overall structure of our model.

Text

preprocessing

Extraction of target NEs

and learning elements

Conversion of

contextual features

Construction of

Decision tree

Wudlqlqj#gdwd

Contextual

feature

dictionary

Text

preprocessing

Extraction of target NEs

and learning elements

Conversion of

contextual features

Whvw#gdwd

Selected senses and

readings for NEs

Fig.1 Overall structure of TLAN

4. Experimentation

For the evaluation of our model, we measured the accuracy

and 10-fold cross-validation of our training data set. In

addition, a test data set was also reserved in the size of 10,000

eo-jeols. Since there have been – known to the authors thus far

– no corpora in which ANEs were transliterated in these ways,

the results were manually evaluated by the authors. Table 5

shows the results.

Table 5 Size of data sets and accuracy of the model

Data set Size (eo-jeol)Accuracy (%)

Training set 90,000 97.39

Training set

(10-fold CV)

9,000 * 10 97.28

Test set 10,000 97.29

The accuracy of our model exceeds that of current TTS

systems by a large margin. Also, compared with the rule-

based system, the learning model shows comparable

performance. However, there are problems in extracting

learning elements.

First, ANEs in proper nouns do not have consistent structural

information or contextual features, as in (E11) and (E12).

(E11) MP3 [seuli]

“MP three”

(E12) BK21 [isip-il]

“BK 21(the title of a national project)”

Second, errors from morphology analysis affect the extraction

of learning elements. Example (E13) illustrates how

morphology-analysis errors fail in the extraction of contextual

features.

(E13) 17 [sip-chil]ilen

*17 (quantifier) + il (“1”, quantifier)+en(classifier

for Japanese currency, YAN)/

17 (quantifier)+il (classifier for a day)+en (josa)

Third, contextual ambiguities which humans cannot resolve

without more than two contextual features also make

extracting learning elements difficult.

5. Conclusions and further studies

In this paper, we have proposed TLAN (Transliteration

Learner for Arabic-Numeral expressions), which can

efficiently disambiguate the senses and readings of Arabic-

Numeral Expressions (ANEs) in texts by using a decision tree.

For the purpose of analyzing and learning data, three phases

of learning elements were suggested: patterns of Arabic

numerals combined with text symbols, contextual features,

and heuristic information were classified according to the

senses and readings of ANEs. By learning the three groups of

learning elements, the model shows 97.39% and 97.29%

accuracies for the training set and the test set, respectively.

The accuracy of TLAN significantly exceeds the accuracies

of current TTS systems, and our learning model shows its

performance to be superior to that of the rule-based system.

However, it still has problems in transliterating ANEs in

proper nouns, ANEs with morphology-analysis errors and

ANEs lacking contextual features. For the purpose of

improving the system, a hybrid system, combining the rule-

based model and the learning model for transliterating

Arabic-Numeral Expressions, needs to be investigated. Also,

we need to consider a system which can transliterate all non-

Korean alphabetic symbols such as Roman alphabet,

measurement symbols and Chinese characters for Korean

TTS. That will be the subject of our next study.

6. Acknowledgements

This work was supported by a National Research Laboratory

Grant (Laboratory title: Korean Language Processing Lab.

Project Number: M10203000028-02J0000-01510).

7. References

[1] Yoon, Aesun et al. (2003) “An Automatic Transcription

System for Arabic Numerals in Korean”, Proceedings of 2003

International Conference on Natural Language Processing and

Knowledge Engineering, pp. 221~226.

[2] Jung, Youngim (2004), Implementation of an Automatic

Transliteration System of Arabic Numerals for Korean TTS,

Master’s thesis, Graduate School of Pusan National University.

[3] Donga dotcom: http://www.donga.com

[4] VoiceWare: http://www.voiceware.co.kr

[5] Corevoice: http://www.corevoice.com

[6] Yu et al.(2003), “Disambiguating the senses of non-text

symbols for Mandarin TTS systems with a three-layer

classifier”, Speech communication, v.39 no.3/4, pp.191-229

[7] J. Ross Quinlan (1993), C4.5: programs for machine learning,

Morgan Kaufmann Publishers, San Mateo, Calif.

[8] Tom M. Mitchell (1997), Machine Learning, McGraw-Hill.

Automatic Korean Phoneme Generation Via Input-Text Preprocessing and Disambiguation

Conference Paper

Sep 2006
Lect Notes Comput Sci

This paper proposes an Automatic Korean Phoneme Generator (AKPG) that can be adapted to various natural language processing systems that handle raw input-text from users such as the Korean pronunciation education system. Resolving noise and ambiguity is a precondition for correct natural language processing. In order to satisfy this condition, the AKPG, as a module of an NLP system, combines linguistic and IR methods. Preprocessing modules are incorporated into the AKPG to handle spelling-errors that render correct phoneme generation impossible. In addition, the preprocessing modules convert alphanumeric symbols into Korean characters. Finally, in order to remove part-of-speech (POS) ambiguities and those of homographs with the same POS, homograph collocations are collected from a large corpus using the IR method. In addition, those homographs are integrated into dependency rules for partial parsing.

Semantic Categorization of Contextual Features Based on Wordnet for G-to-P Conversion of Arabic Numerals Combined with Homographic Classifiers

Conference Paper

Full-text available

Oct 2005
Lect Notes Comput Sci

Arabic numerals show a high occurrence-frequency and deliver significant senses, especially in scientific or informative texts. The problem, how to convert Arabic numerals to phonemes with ambiguous classifiers in Korean, is not easily resolved. In this paper, the ambiguities of Arabic numerals combined with homographic classifiers are analyzed and the resolutions for their sense disambiguation based on KorLex ( Korean Lexico-Semantic Network) are proposed. Words proceeding or following the Arabic Numerals are categorized into 54 semantic classes based on the lexical hierarchy in KorLex 1.0. The semantic classes are trained to classify the meaning and the reading of Arabic Numerals using a decision tree. The proposed model shows 87.3% accuracy which is 14.1% higher than the baseline.

An automatic transcription system for Arabic numerals in Korean

Conference Paper

Full-text available

Nov 2003

We have proposed Auto-TAN, an automatic transcription system of Arabic numerals into Korean alphabetic letters using linguistic rules and clues. Few previous studies have previously discussed the problems in transcribing Arabic numerals into Korean text. We have suggested detailed NRF (number reading formula) paradigms, analyzed the structure of NUMEs (numerical expressions) and components in a larger scope, and investigated compatibilities and selection rules among those components. Based on these linguistic features, 13 stereotyped patterns, 16 rules and 63 clues determining NRF types are formulated for Auto-TAN. This system works modularly in 5 steps. The pilot test was conducted with a test suite which contains 56782 NUMEs. Encouraging results of 84.8% and 10.5% accuracy were obtained for unique transcription and multiple transcriptions, respectively.

C4.5: Programs for Machine Learning

Book

Jan 1993

Ross Quinlan

Disambiguating the senses of non-text symbols for Mandarin TTS systems with a three-layer classifier

Article

Feb 2003
SPEECH COMMUN

Various kinds of non-text symbols appear in texts. The oral expressions. of these symbols may vary with their senses. This paper proposes a three-layer classifier (TLC) which can disambiguate the senses of these symbols effectively. The layers within TLC are employed in sequence. The 1st layer is composed of two components: pattern table and decision tree. if this layer can disambiguate the sense of the target symbol, the disambiguation task stops. Otherwise the next two layers will be triggered. In such a situation, the procedure will go through the TLC. Based on the Bayesian theory, the 2nd layer adopts the voting scheme to compute the disambiguation score. Several features of token, which may affect the effectiveness of our voting scheme, are analyzed and compared With each other to achieve better accuracy. According to the algorithm confidence of sense disambiguation, the 3rd layer may exploit an alter. native. model to enhance the performance. Experiments show that our approaches can learn well. even with only a small amount of data. The overall accuracies of training and testing sets are 99.8% and 97.5%, respectively.

Implementation of an Automatic Transliteration System of Arabic Numerals for Korean TTS, Master's thesis

Jan 2004

Youngim Jung

Jung, Youngim (2004), Implementation of an Automatic Transliteration System of Arabic Numerals for Korean TTS, Master's thesis, Graduate School of Pusan National University.

Learning for transliteration of arabic-numeral expressions using decision tree for Korean TTS

Figures

Recommended publications

Learner Classification Method for Senior Learning with Decision Tree: A Case Study of Thai Senior

Disambiguation Based on Wordnet for Transliteration of Arabic Numerals for Korean TTS

Transliteration system for Arabic-Numeral Expressions using decision tree for intelligent Korean TTS

Grapheme-to-Phoneme Conversion of Arabic Numeral Expressions for Embedded TTS Systems

Semantic Categorization of Contextual Features Based on Wordnet for G-to-P Conversion of Arabic Nume...