Conference PaperPDF Available

Learning for transliteration of arabic-numeral expressions using decision tree for Korean TTS

Authors:

Figures

Content may be subject to copyright.
Learning for Transliteration of Arabic-Numeral Expressions Using Decision
Tree for Korean TTS
Youngim Jung, Donghun Lee, HyeonSook Nam, Aesun Yoon, Hyuk-chul Kwon
School of Electrical & Computer Engineering,Dept of Frenchat Pusan National University
Dept. of Internet Contents at Busan Digital University
{acorn, huni77, asyoon, hckwon}@pusan.ac.kr, nosenam@bdu.ac.kr
Abstract
Despite of much work on TTS technologies and several TTS
systems customized for Korean, current TTS systems output
many errors in transliterating non-alphabetic symbols such as
Arabic numerals and text symbols. This paper proposes
TLAN (Transliteration Learner for Arabic-Numeral
expressions) which can efficiently disambiguate the reading
and meaning of Arabic Numeral Expressions (ANEs) in texts
by using a decision tree. For the purpose of analyzing and
learning data, three phases of learning elements were
suggested: patterns of Arabic numerals combined with text
symbols, contextual features and heuristic information were
classified according to the senses and sounds of ANEs. Our
corpus was made up of news articles issued from January 1st,
2000 to December 31st, 2001 from 10 major newspapers in
Korea. By learning the three phases of learning elements, the
model shows 97.38% and 97.28% accuracies for the training
set and the test set, respectively.
1. Introduction
Text-To-Speech synthesis (TTS) technology has been widely
applied to many domain-general systems such as customer
support dialog systems, ARS, voice-news systems, e-mail
readers, educational programs for language learning and
voice-production programs for dysphonic patients.
The naturalness of alphabetic letter-based pronunciations by
signal processing and prosody modeling has been mainly
studied by TTS researchers; however, there have been few
studies on the speech synthesis of non-alphabetic symbols, for
example, Arabic numerals and various text symbols using
linguistic models.
Computer-readable texts contain not merely alphabetic letters
but also non-alphabetic symbols. Especially, the more
scientific and informative content texts (such as newspaper
articles, academic papers and official reports) contain, the
more frequent is the occurrence of Arabic numerals, because
Arabic numerals have graphic simplicity and deliver exact
information[1].
Arabic numerals, regardless of their graphic advantage and
representability, are not easily transliterated because their
Korean pronunciations vary with their senses. Arabic
numerals represent time, location, quantity (cardinal), order
(ordinal), ranks, indices, sports scores, victory marks,
telephone numbers, and bank account numbers, among others.
Arabic numerals as non-alphabetic symbols in texts have
seldom been a subject of linguistic studies; however, once they are
converted into sounds, their senses are ruled by linguistic rules
and are determined by contextual features.
They are also used in the formation of proper nouns for arms,
planes, visas and programs. In each context, their
pronunciations vary as we can see in examples (E1)~(E4).
(E1) 3[se] geuru
three stumps (of trees)
“three trees”
(E2) 3[sam] nyeon
“three years”
(E3) 3[seo] mal
three mal
54 ˜
(E4) big 3[seuli]
“Big Three”
If a Korean classifier comes after an Arabic numeral, the
numeral is read as se according to the Korean numeric system
(E1), whereas if a Chinese classifier follows an Arabic
numeral, it is read as sam according to the Chinese system
(E2). ‘3’ and ‘4’ are read as seo or seok and neo or neok,
respectively if Korean units of measurement such as mal, geun,
hob, doe come after them as in example (E3). In some proper
nouns, numerals are read in English as in example (E4). As
shown in (E1)~(E4), the same Arabic numeral, ‘3’ has been
transliterated in four different ways. Reading the combined
expressions of Arabic numerals and text symbols is even more
various as in examples (E5) and (E6).
(E5) geudeul-eun 3-5[set-eseo daseot] gae-leul
bad-assda
They 3 from 5 things received
“They received 3 to 5 things.”
(E6) 02-5459-3333
[gongyi-e osaogu samsamsamsam]
“zero two (local number) five four five nine three
three three three (a telephone number)”
In this paper, letters in italics stand for transliterated
pronunciations of Korean and underlined texts represent the target
Arabic numeral expressions to which the pronunciations
correspond. In each example, words in the second line are
translated directly in English, which follow the original order of
Korean words; phrases in quotation marks, are the interpretation
of each example phrase. ‘geuru’ is a Korean classifier which is
used as a unit of trees.
mal is a Korean unit of volumn for measuring liquid or grain;
one mal is about 18˜.
eo-jeol is a morpheme cluster of continuous alphanumeric
characters and symbols with space on either side in Korean. In
general, symbols are placed between the two paralleled items
without spacing. In most cases an eo-jeol is composed of several
morphemes of different parts of speech [1].
Thus, the reading of ANEs according to their context is a
critical criterion in evaluating the intelligence of TTS systems.
However, current TTS systems show low performance in
generating the correct sounds of Arabic-Numeral Expressions
(ANEs).
In this paper, we propose TLAN (Transliteration Learner for
Arabic-Numeral expressions), which can transliterate ANEs
correctly and efficiently. The objectives of this paper are (1) to
extract from data and analyze learning elements which affect
the reading of Arabic numerals, (2) to suggest a learning
model for the transliteration of Arabic numerals, and (3) to
improve current Korean TTS systems.
In order to analyze learning elements, to train sample data
and to test our model, we have built our corpus from the news
articles of 10 major newspapers which were issued in Korea
from January 1st, 2000 to December 31st, 2001. The sizes of
the training and test sets are shown in Table 1 below.
Table 1 Size of training set and test set
Data set Size (eo-jeol
)Ratio (%)
Training set 90,000 90
Test set 10,000 10
The plan for the rest of the paper is as follows. In section 2,
we will briefly present previous studies related to the
transliteration of Arabic numerals, and their limitations. In
section 3, learning elements will be analyzed out, learning
algorithms suggested, and the overall structure of our
proposed model illustrated. Our proposed model will be
evaluated through experimentations in section 4. The
conclusions of this paper and suggestions for future research
follow in section 5.
2. Related studies
In this section, we will describe previous studies on reading
ANEs for TTS systems and present their limitations.
2.1. Rule-based approach
Few studies have dealt with readings of Arabic numerals with
respect to the implementation of an automatic transliteration
system for TTS. Despite few relevant studies, there are several
customized TTS systems. Three daily newspapers offer voice
news on their website and more than 5 companies produce
Korean TTS systems [3, 4, 5]. The current systems do not
seem to have modules to select accurate reading for ANEs and
read Arabic numerals only in 1 to 3 ways; thus numerous
incorrect reading are generated as in (E7) ~ (E10).
(E7) 3 [*sam/se] keob
three cup
“three cups”
(E8) -0.24 [*yeong-jeom i-sa/ma-i-neo-seu yeong-jeom
i-sa] %
*zero point two four/minus zero point two four
“minus 0.24%”
(E9) 34[*sam-e-seo sa/seo-neo] gae
“3 to 4 things”
In examples, ‘*’ indicates the incorrect transliteration of ANEs
or the incorrect morphological analysis of ANEs.
(E10) 9.11 [*gu-jeom il-il/gu il-il] teleo
*nine point one one/ nine one one
“9·11 terror”
Table 2 illustrates the accuracies of current TTS systems in
reading ANEs.
Table 2 Accuracies of current TTS systems
TTS systems Accuracy (%) Resource of evaluation
Donga voice
news
55 Numeral expressions in
Randomly-chosen articles
issued March 1st, 2003 to
May 31st, 2003
Core Voice
TTS system
85.2 Numeral expressions in
some portion of our
analyzed corpus
Voiceware
TTS system
79.4~91.7 Numeral expressions in
some portion of our
analyzed corpus
[Yoon et al., 2003; Jung, 2004] have proposed a rule-based
system for the transliteration of Arabic numerals, which
achieves highly competitive performance compared to the
current systems. Though rules of the system have been built
by analyzing one daily newspaper, the system shows an
accuracy of 95.6~97.7% over 4 sets of unanalyzed data.
However, the problem of the rule-based system is that no
learning algorithm has been presented for the readings of
Arabic numerals in multilateral and changing data.
2.2. Hybrid approach
In [Yu et al., 2003], a three-layer classifier (TLC) which
disambiguates the senses of “/,” “:” and “-” and determines the
oral expressions of the symbols in Chinese has been proposed.
The 1st layer is composed of rule-based pattern tables and a
decision tree. In this model, the decision tree is used to
exclude the impossible senses of each symbol. In the 2nd layer,
a voting scheme calculates the disambiguation scores for all
possible senses of the target symbols, and within the 3rd layer,
the algorithm confidence of sense disambiguation is used to
enhance the performance. The method of adopting several
algorithms, merging layers and matching patterns is used to
improve its performance in disambiguating the senses of the
three symbols. This hybrid approach achieves high accuracies,
such as 99.8% and 97.5% for a training phase and a test phase,
respectively. However, calculating scores and merging
algorithms are very complex processes. We have found that
well-classified learning elements and their algorithmic
application can give us many clues in efficiently determining
the senses and readings of ANEs. In section 3, we will
investigate learning elements and their algorithmic
applications for Korean TTS.
3. Implementation
In this section, we will classify the senses of ANEs using
linguistic knowledge. The learning elements will be classified
into 3 groups; then we will apply the decision tree for the
purpose of determining the best elements and the algorithmic
order of their application; lastly we will illustrate the overall
structure of our model.
3.1. Classification of senses of ANEs
ANEs represent various concepts as introduced in the
introduction. Through the analysis of our corpus and the
investigation of previous work, we can classify the senses of
ANEs as shown in Table 3.
Table 3 Classification of senses of ANEs
time location quantity1 quantity2 sports scores
S1 S2 S3 S4 S5
order indices titles numbers proper nouns
Senses
of
ANEs
S6 S7 S8 S9 S10
3.2. Learning elements
We have found that three groups of learning elements
determine the sense and the pronunciation of ANEs. The three
groups are contextual features, pattern structure and heuristic
information. Contextual features are extracted from the left or
right eo-jeols of ANEs and are subcategorized according to
the sense of ANEs. These features are built in dictionaries.
Patterns are characterized by the number of figures, the
number of text symbols and the kind of text symbols in ANEs.
In ANEs, the size of figures, the difference between two
figures, the first place of a figure, among other clues, give us
the necessary heuristic information to determine the sense of
ANEs.
Table 4 Elements and values
Elements Subcategories of elements Id Value
Contextual
features
Right Associated Collocation(RAC)
Left Associated Collocation(LAC)
1
2
0~30
0~24
Patterns Num. of fig.
Num of sym.
Kind of sym.
3
4
5
1~12
0~5
0~9
Heuristics Size of fig.
Difference between two fig.
1st place of fig.
Places of fig.
6
7
8
9
0~5
0~2
0, 1
0~4
(E5’) geudeul-eun 3-5 gae-leul bad-assda
They 3 from 5 things received
“They received 3 to 5 things.”
In (E5’), gae is RAC(1), the value of which is ‘5’. According
to this contextual feature, the NE is considered to represent the
quantity of something (S3, S4), and is pronounced according
to the pure Korean numeric system (S3). The pattern ‘3-5 (N-
N)’ gives us the information about the number of figures
having the value ‘2’, the number of text symbols having the
value ‘1’, and the kind of text symbols having the value ‘2 (an
id number for ‘-’)’. The difference between the two figures is
2, so ‘N-N+RAC1’ can be recognized as a range of numbers.
3.3. Application of learning elements and algorithm
As we have seen in Section 3.2, there are 3 groups and 9
subcategories of learning elements which affect the senses and
the readings of ANEs. In order to choose a correct sense and a
reading for a single NE, the elements should draw a
distinction between candidate senses and readings. In order to
obtain the most distinctive elements, we adopt the C4.5
algorithm. The elements which have discrete values are
determined by calculating the information gain of elements.
)
),(
(log
),(
2
0
S
SCfreq
S
SCfreq
info(S)
j
k
j
j
u
¦
(1)
)info(T
T
T
(T)info i
n
i
i
Xu
¦
0
(2)
(T)infoinfo(T)Xgain X )( (3)
S: Example set of ANEs, X: Elements, T: Training sets,
C
j
: Class to which S belongs (S1~S10)
According to [Quinlan, 1993] and [Mitchell, 1997], Info(S) is
the entropy of a sample set S and info
X
(T) is a measurement in
accordance with the n outcomes of a test Xafter T. has been
partitioned. We can obtain the information gain of an element
X(gain(X)) by partitioning T in accordance with the test X. By
this gain criterion, we can select the best element to construct
a decision tree. Due to the gain criterion having a strong bias,
however, it has been rectified by normalization.
)(log)
2
0
T
T
T
T
Xsplitinfo(
i
n
i
i
u
¦
(4)
)
)(
)( Xsplitinfo(
Xgain
Xgainratio (5)
Equation (4) represents the potential information generated by
dividing T into n subsets. Then we can obtain the proportion
of information generated by the split, as in equation (5)
[Quinlan, 1993]. The gain ratio is helpful in classifying the
elements in the construction of a decision tree used to
determine the senses and reading of ANEs.
3.4. System architecture
In this section, we will illustrate the procedure and the overall
structure of our model. The procedure consists of two aspects,
which are a training aspect and a test aspect.
For training data, in Step 1, input data is preprocessed and
sentences are segmented by tokenization. In Step 2, target
ANEs and their adjacent eo-jeols are extracted from tokens. In
Step 3, target ANEs are converted into patterns, for example,
‘5:30 p.m.’ and ‘-0.24%’ are converted into ‘N:N’ and ‘-N.N’.
Thus the pattern information is obtained in this step. Once a
pattern structure is obtained, then the corresponding heuristic
information such as the size of figures is extracted from target
ANEs
.
In Step 4, extracted adjacent eo-jeols are converted into
subcategorized contextual features. For example, ‘p.m.’ in
‘5:30 p.m.’ is converted into ‘1’, the value of which is ‘15’.
Here, if the conversion of eo-jeols into contextual features
fails, meaningless morphemes are deleted through
morphological analysis and then the analyzed eo-jeols are
The extraction of heuristic information may precede conversion
of target ANEs into patterns. In that case, the system must analyze
all target ANEs in a time-consuming way. In this paper, we design
the system to check heuristic information selectively under several
specific patterns.
checked again. Or, if the conversion fails even after the
previous analysis, eo-jeols are converted into default values.
In Step 1 through Step 4, the input data is converted into an
example data set which can be used to construct a decision
tree. In Step 5, the C4.5 algorithm is applied and a decision
tree is constructed.
For testing the data, the same procedure is run, Step 1 through
Step 4. In Step 6, the constructed decision tree is applied to
assign senses and readings of ANEs in the test data.
Figure 1 illustrates the overall structure of our model.
Text
preprocessing
Extraction of target NEs
and learning elements
Conversion of
contextual features
Construction of
Decision tree
Wudlqlqj#gdwd
Contextual
feature
dictionary
Text
preprocessing
Extraction of target NEs
and learning elements
Conversion of
contextual features
Whvw#gdwd
Selected senses and
readings for NEs
Fig.1 Overall structure of TLAN
4. Experimentation
For the evaluation of our model, we measured the accuracy
and 10-fold cross-validation of our training data set. In
addition, a test data set was also reserved in the size of 10,000
eo-jeols. Since there have been – known to the authors thus far
– no corpora in which ANEs were transliterated in these ways,
the results were manually evaluated by the authors. Table 5
shows the results.
Table 5 Size of data sets and accuracy of the model
Data set Size (eo-jeol)Accuracy (%)
Training set 90,000 97.39
Training set
(10-fold CV)
9,000 * 10 97.28
Test set 10,000 97.29
The accuracy of our model exceeds that of current TTS
systems by a large margin. Also, compared with the rule-
based system, the learning model shows comparable
performance. However, there are problems in extracting
learning elements.
First, ANEs in proper nouns do not have consistent structural
information or contextual features, as in (E11) and (E12).
(E11) MP3 [seuli]
“MP three”
(E12) BK21 [isip-il]
“BK 21(the title of a national project)”
Second, errors from morphology analysis affect the extraction
of learning elements. Example (E13) illustrates how
morphology-analysis errors fail in the extraction of contextual
features.
(E13) 17 [sip-chil]ilen
*17 (quantifier) + il (“1”, quantifier)+en(classifier
for Japanese currency, YAN)/
17 (quantifier)+il (classifier for a day)+en (josa)
Third, contextual ambiguities which humans cannot resolve
without more than two contextual features also make
extracting learning elements difficult.
5. Conclusions and further studies
In this paper, we have proposed TLAN (Transliteration
Learner for Arabic-Numeral expressions), which can
efficiently disambiguate the senses and readings of Arabic-
Numeral Expressions (ANEs) in texts by using a decision tree.
For the purpose of analyzing and learning data, three phases
of learning elements were suggested: patterns of Arabic
numerals combined with text symbols, contextual features,
and heuristic information were classified according to the
senses and readings of ANEs. By learning the three groups of
learning elements, the model shows 97.39% and 97.29%
accuracies for the training set and the test set, respectively.
The accuracy of TLAN significantly exceeds the accuracies
of current TTS systems, and our learning model shows its
performance to be superior to that of the rule-based system.
However, it still has problems in transliterating ANEs in
proper nouns, ANEs with morphology-analysis errors and
ANEs lacking contextual features. For the purpose of
improving the system, a hybrid system, combining the rule-
based model and the learning model for transliterating
Arabic-Numeral Expressions, needs to be investigated. Also,
we need to consider a system which can transliterate all non-
Korean alphabetic symbols such as Roman alphabet,
measurement symbols and Chinese characters for Korean
TTS. That will be the subject of our next study.
6. Acknowledgements
This work was supported by a National Research Laboratory
Grant (Laboratory title: Korean Language Processing Lab.
Project Number: M10203000028-02J0000-01510).
7. References
[1] Yoon, Aesun et al. (2003) “An Automatic Transcription
System for Arabic Numerals in Korean”, Proceedings of 2003
International Conference on Natural Language Processing and
Knowledge Engineering, pp. 221~226.
[2] Jung, Youngim (2004), Implementation of an Automatic
Transliteration System of Arabic Numerals for Korean TTS,
Master’s thesis, Graduate School of Pusan National University.
[3] Donga dotcom: http://www.donga.com
[4] VoiceWare: http://www.voiceware.co.kr
[5] Corevoice: http://www.corevoice.com
[6] Yu et al.(2003), “Disambiguating the senses of non-text
symbols for Mandarin TTS systems with a three-layer
classifier”, Speech communication, v.39 no.3/4, pp.191-229
[7] J. Ross Quinlan (1993), C4.5: programs for machine learning,
Morgan Kaufmann Publishers, San Mateo, Calif.
[8] Tom M. Mitchell (1997), Machine Learning, McGraw-Hill.
Conference Paper
This paper proposes an Automatic Korean Phoneme Generator (AKPG) that can be adapted to various natural language processing systems that handle raw input-text from users such as the Korean pronunciation education system. Resolving noise and ambiguity is a precondition for correct natural language processing. In order to satisfy this condition, the AKPG, as a module of an NLP system, combines linguistic and IR methods. Preprocessing modules are incorporated into the AKPG to handle spelling-errors that render correct phoneme generation impossible. In addition, the preprocessing modules convert alphanumeric symbols into Korean characters. Finally, in order to remove part-of-speech (POS) ambiguities and those of homographs with the same POS, homograph collocations are collected from a large corpus using the IR method. In addition, those homographs are integrated into dependency rules for partial parsing.
Conference Paper
Full-text available
Arabic numerals show a high occurrence-frequency and deliver significant senses, especially in scientific or informative texts. The problem, how to convert Arabic numerals to phonemes with ambiguous classifiers in Korean, is not easily resolved. In this paper, the ambiguities of Arabic numerals combined with homographic classifiers are analyzed and the resolutions for their sense disambiguation based on KorLex ( Korean Lexico-Semantic Network) are proposed. Words proceeding or following the Arabic Numerals are categorized into 54 semantic classes based on the lexical hierarchy in KorLex 1.0. The semantic classes are trained to classify the meaning and the reading of Arabic Numerals using a decision tree. The proposed model shows 87.3% accuracy which is 14.1% higher than the baseline.
Conference Paper
Full-text available
We have proposed Auto-TAN, an automatic transcription system of Arabic numerals into Korean alphabetic letters using linguistic rules and clues. Few previous studies have previously discussed the problems in transcribing Arabic numerals into Korean text. We have suggested detailed NRF (number reading formula) paradigms, analyzed the structure of NUMEs (numerical expressions) and components in a larger scope, and investigated compatibilities and selection rules among those components. Based on these linguistic features, 13 stereotyped patterns, 16 rules and 63 clues determining NRF types are formulated for Auto-TAN. This system works modularly in 5 steps. The pilot test was conducted with a test suite which contains 56782 NUMEs. Encouraging results of 84.8% and 10.5% accuracy were obtained for unique transcription and multiple transcriptions, respectively.
Article
Various kinds of non-text symbols appear in texts. The oral expressions. of these symbols may vary with their senses. This paper proposes a three-layer classifier (TLC) which can disambiguate the senses of these symbols effectively. The layers within TLC are employed in sequence. The 1st layer is composed of two components: pattern table and decision tree. if this layer can disambiguate the sense of the target symbol, the disambiguation task stops. Otherwise the next two layers will be triggered. In such a situation, the procedure will go through the TLC. Based on the Bayesian theory, the 2nd layer adopts the voting scheme to compute the disambiguation score. Several features of token, which may affect the effectiveness of our voting scheme, are analyzed and compared With each other to achieve better accuracy. According to the algorithm confidence of sense disambiguation, the 3rd layer may exploit an alter. native. model to enhance the performance. Experiments show that our approaches can learn well. even with only a small amount of data. The overall accuracies of training and testing sets are 99.8% and 97.5%, respectively.
Implementation of an Automatic Transliteration System of Arabic Numerals for Korean TTS, Master's thesis
  • Youngim Jung
Jung, Youngim (2004), Implementation of an Automatic Transliteration System of Arabic Numerals for Korean TTS, Master's thesis, Graduate School of Pusan National University.