Content uploaded by Wuying Liu
Author content
All content in this area was uploaded by Wuying Liu on Mar 07, 2015
Content may be subject to copyright.
Probabilistic Ensemble Learning for Vietnamese
Word Segmentation
Wuying Liu
Luoyang University of Foreign Languages
471003 Luoyang, Henan, CHINA
wyliu@nudt.edu.cn
Li Lin
Luoyang University of Foreign Languages
471003 Luoyang, Henan, CHINA
lamle@163.com
ABSTRACT
Word segmentation is a challenging issue, and the corresponding
algorithms can be used in many applications of natural language
processing. This paper addresses the problem of Vietnamese word
segmentation, proposes a probabilistic ensemble learning (PEL)
framework, and designs a novel PEL-based word segmentation
(PELWS) algorithm. Supported by the data structure of syllable-
syllable frequency index, the PELWS algorithm combines
multiple weak segmenters to form a strong segmenter within the
PEL framework. The experimental results show that the PELWS
algorithm can achieve the state-of-the-art performance in the
Vietnamese word segmentation task.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis
and Indexing – dictionaries, indexing methods, linguistic
processing.
I.2.7 [Artificial Intelligence]: Natural Language Processing –
language parsing and understanding, text analysis.
General Terms
Algorithms, Performance, Experimentation, Languages
Keywords
Word Segmentation Algorithm, Probabilistic Ensemble Learning,
Multi-Segmenter, Vietnamese, Syllable-Syllable Frequency Index
1. INTRODUCTION
Vietnamese is a monosyllabic language, whose basic linguistic
unit is called “tiếng”, similar to traditional syllables in respect of
phonetic form. Like Thai, Japanese and Chinese text, Vietnamese
text is also a text without any explicit separator between words.
Thus, identifying the word boundaries is crucial to above oriental
languages in many applications of natural language processing
(NLP) [1]. For instance, the following Sentence1 is a raw
unsegmented Vietnamese sentence, and the following Sentence2
is its related segmented one.
Sentence1:
N
hập khẩu tôm vào Mỹ năm qua vẫn cao kỷ lục do
nguồn cung cấp từ Thái Lan và Indonesia tăng .
Sentence2:
N
hập_khẩu tôm vào Mỹ năm qua vẫn cao kỷ_lục do
nguồn_cung_cấp từ Thái_Lan và Indonesia tăng .
The above instance shows that Vietnamese text is a sequence of
syllables and each two continuous syllables are separated by a space
symbol. The space symbol belongs to an overload symbol (as a
connector within a word or as a separator between words) in raw
text. Therefore, the Vietnamese word segmentation task can be
defined as a binary categorization problem for each space symbol. If
a space symbol is a connector in a word, we will output a symbol
('_') to replace it. And if a space symbol is a separator between
words, we will maintain it as a space symbol (' ') in the segmented
result.
Since the early days of Vietnamese information processing, word
segmentation has been widely investigated. Till now, many
effective algorithms have been proposed for Vietnamese word
segmentation [2]. The early dictionary-based word segmentation
algorithms mainly include maximum matching algorithm and
reverse maximum matching algorithm. The dictionary-based
algorithm has a straightforward implementation, but its performance
highly depends on a suitable dictionary. Subsequently, various kinds
of advanced machine learning algorithms (such as maximum
entropy [3], support vector machines and conditional random fields
[4]) regard word segmentation as a sequence labeling problem, and
related algorithms can obtain preferable performance in Vietnamese
word segmentation. To resolve ambiguities of word segmentation, a
hybrid algorithm combines finite-state automata, regular expression
and maximum matching techniques [5] to implement a highly
accurate Vietnamese tokenizer (vnTokenizer). Recently, part of
speech (POS) is considered as a helpful tag for word segmentation
and word ambiguity resolution [6]. Therefore, many affixed
resources (such as POS tag) are applied in word segmentation
algorithm [7].
Previous researches show that ensemble learning has statistical,
computational and representational advantages [8]. Ensemble
learning can improve the performance of existing algorithms for
spam filtering [9] and text categorization [10]. Based on the
motivation of ensemble learning, we propose a novel probabilistic
ensemble learning (PEL) framework, which is a meta-framework
for integrating various existing word segmentation algorithms.
Under our PEL framework, we design and implement a PEL-based
word segmentation (PELWS) algorithm.
2. PROBABILISTIC ENSEMBLE LEARNING
2.1 Framework
In order to colligate various advantages of different word
segmentation algorithms, we propose a PEL framework by a
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions from
Permissions@acm.org.
SIGIR '14, July 06–11, 2014, Gold Coast, QLD, Australia.
Copyright 2014 ACM 978-1-4503-2257-7/14/07…$15.00.
http://dx.doi.org/10.1145/2600428.2609477
931
weighted-voting-like strategy. Figure 1 shows the PEL framework
for Vietnamese word segmentation. The framework mainly
includes several pairs of <Segmenter, PELearner (Probabilistic
Ensemble Learner)> and a PEPredictor (Probabilistic Ensemble
Predictor). The Segmenters are implemented from different
Vietnamese word segmentation algorithms. For each Segmenter,
we design and install a PELearner, which receives golden
standard segmented texts and automatic segmented texts from its
corresponding Segmenter, and learns a probabilistic ensemble
model from both two texts. The probabilistic ensemble model is
stored as a syllable-syllable frequency index (SSFI). The
PEPredictor combines multi-segmenter’s segmented texts
according to the SSFIs and makes the final prediction for each
unsegmented text.
Figure 1. Probabilistic ensemble learning framework.
Figure 2 shows the SSFI structure organized as a hash table. The
table entry of the SSFI is a key-value pair <Key, Value>, where
each key SiSj denotes a sequence of continuous two syllables and
each value consists of 4 integers. The integer fc(·) denotes the
times of correct prediction and the integer fm(·) denotes the times
of mistaken prediction. The symbol ('_') means that the
continuous two syllables Si and Sj are in a word, and the symbol ('
') means that the continuous two syllables Si and Sj are not in a
word. The hash function hash(SiSj) maps the sequence SiSj to the
address of the 4 integers.
Figure 2. Syllable-syllable frequency index.
Supported by the SSFI, we design a supervised PELWS algorithm,
which regards the historical precision as an ensemble probability
and uses it as an ensemble weighted coefficient.
2.2 Algorithm
Figure 3 gives the pseudo-code for the PELWS algorithm
consisting of two main functions: pep and pel. The PELWS
algorithm takes the predicting process as an index retrieving
process and takes the PEL process as an index updating process.
When an unsegmented (sss = null) text arrives, the pep function
will be triggered: (I) It calls each Vietnamese word segmenter for
multiple segmented result texts; (II) It retrieves the current SSFI
and calculates a probabilistic score according to the historical
precision described in Eq. (1) for each space symbol ' ' in each
segmented result text; and (III) It combines multiple probabilistic
scores to form an ensemble score described in Eq. (2) and use a
fixed threshold (score = 0) to make the final binary prediction for
each space symbol.
().()
() ().() ().()
ij c
ij
ij ij
cm
hash s s f
Ps s hash s s f hash s s f
(1)
12
1
(-1) ( )
n
i
i
s
core P s s
If = ' ', Then θ = 0
If = '_', Then θ = 1 (2)
When a segmented text arrives, the pel function firstly does the
same work to the above (I) step, and finally compares the
automatic segmented result with the golden standard to update the
values of the SSFI.
1. // PEL-based Word Segmentation (PELWS) Algorithm
2. Integer: n; // Number of Segmenters
3. Array[n]: ssfis; // Array of Syllable-Syllable Frequency Indexes
4. Array[n]: vwss; // Array of Vietnamese Word Segmenters
5. String: uss; // Unsegmented Sequence of Syllables
6. String: sss; // Segmented Sequence of Syllables
7. If (sss = null) Then sss pep(ssfis, vwss, uss);
8. Else ssfis pel(ssfis, vwss, uss, sss);
9.
10. // Probabilistic Ensemble Predictor
11. Function String: pep(ssfis, vwss, uss)
12. String: sss '';
13. String[n]: ssss; // Array of Segmented Sequences of Syllables
14. Integer: m; // Number of Syllables in Unsegmented Sequence
15. For Integer i 1 To n Do
16. ssss[i] vwss[i].segment(uss);
17. End For
18. For Integer j 1 To (m-1) Do
19. Float score 0;
20. String s1 getSyllable(uss, j); // Current Syllable
21. String s2 getSyllable(uss, j+1); // Next Syllable
22. For Integer i 1 To n Do
23. String space getSpace(ssss[i], j);
24. Float cor ssfis[i].hash(s1s2).fc(space);
25. Float mis ssfis[i].hash(s1s2).fm(space);
26. If (space = ' ') Then score score + cor / (cor + mis);
27. If (space = '_') Then score score - cor / (cor + mis);
28. End For
29. If (score > 0) Then sss sss + s1 + ' ';
30. Else sss sss + s1 + '_';
31. If (j = (m-1)) Then sss sss + s2;
32. End For
33. Return sss;
34.
35. // Probabilistic Ensemble Learners
36. Function Array[]: pel(ssfis, vwss, uss, sss)
37. String[n]: ssss; // Array of Segmented Sequences of Syllables
38. Integer: m; // Number of Syllables in Unsegmented Sequence
39. For Integer i 1 To n Do
40. ssss[i] vwss[i].segment(uss);
41. End For
42. For Integer j 1 To (m-1) Do
43. String s1 getSyllable(uss, j); // Current Syllable
44. String s2 getSyllable(uss, j+1); // Next Syllable
45. For Integer i 1 To n Do
46. String space1 getSpace(ssss[i], j);
47. String space2 getSpace(sss, j);
48. If (space1 = space2) Then ssfis[i].hash(s1s2).fc(space1) ++;
49. Else ssfis[i].hash(s1s2).fm(space1) ++;
50. End For
51. End For
52. Return ssfis;
Figure 3. PEL-based word segmentation algorithm.
932
The PELWS algorithm, independent of any concrete segmenter, is
a general meta-algorithm, whose space-time complexity mainly
depends on the SSFI storage space and the segment loops in the
pep and the pel functions. The SSFI is space-efficient owing to
the inherent compressible property of index files. Theoretically,
the property ensures that the SSFI storage space is proportional to
the total number of continuous two syllables and is independent
of the number of training texts. The updating or retrieving of SSFI
has constant time complexity according to the hash function. The
maximal space complexity O(np) and the maximal time
complexity O(nq) of the PEL framework are both acceptable in
practical NLP applications. Here, n means the number of
segmenters, usually a low number; p means the number of
continuous two syllables; and q means the total number of texts.
3. EXPERIMENT
3.1 Implementation
According to the PELWS algorithm, we firstly implement a
PELSegmenter (PEL), which combines three existing weak
segmenters: vnTokenizer 1 (VNT), RMMSegmenter (RMM) and
MMSegmenter 2 (MM). The VNT is an implementation of the
hybrid word segmentation algorithm. The RMM and the MM are
implemented from the dictionary-based reverse maximum
matching algorithm and the dictionary-based maximum matching
algorithm respectively.
Secondly, we implement a simple ensemble learning segmenter:
SELSegmenter (SEL) as the baseline, which applies a simple
majority rule to combine the VNT, RMM and MM. The SEL
resembles a simplified PELSegmenter without probabilistic
ensemble learners.
Because the VNT, RMM and MM all support dictionary
extension and the PELWS algorithm is a supervised algorithm, we
can add all words of training texts into the dictionary and build
three corresponding enhanced segmenters. Finally, we can get
two enhanced segmenters: enhanced PEL and enhanced SEL
through combining the three enhanced segmenters.
3.2 Corpus and Evaluation
In the experiment, we use a publicly available benchmark dataset
(Corpus for Vietnamese Word Segmentation3, CVWS), which
contains total 7807 sentences with word boundary labels from 305
Vietnamese newspaper articles in various domains.
The international Bakeoff [11] evaluation measure and associated
evaluation methodology are applied. In the experiment, we report
the classical Precision (P), Recall (R), F1-measure (F1) and Error
Rate (ER) to evaluate the result of segmenters. The value of P, R,
F1 belongs to [0, 1], where 1 is optimal, while the value of ER
belongs to [0, 1], where 0 is optimal.
PC
CM
(3)
RC
N
(4)
1 http://vlsp.vietlp.org:8080/demo/?page=resources&tool=tokenizer.
2 http://cbd.nichesite.org/CBD2013S002.htm.
3 http://www.jaist.ac.jp/~hieuxuan/vnwordseg/data.
2PR
F1 PR
(5)
ER
M
N
(6)
The above four measures are computed as Eq. (3) to Eq. (6)
separately. Where the N denotes the total number of words in the
manually segmented text, the C denotes the number of correctly
segmented words by an automatic segmenter, and the M denotes
the number of mistakenly segmented words by an automatic
segmenter.
3.3 Result and Discussion
In the experiment, we use three-fold cross validation by evenly
splitting the CVWS dataset into three parts and use two parts for
training and the remaining third for testing. We perform the
training-testing procedure three times and use the average of the
three performances as the final result.
Segmenter P R F1 ER
VNT 0.882 0.911 0.896 0.122
RMM 0.895 0.916 0.905 0.108
MM 0.899 0.921 0.910 0.103
SEL 0.899 0.923 0.911 0.103
PEL 0.938 0.945 0.942 0.062
Figure 4. Experimental result of the segmenters.
The experiment includes two parts. In part one, we run the five
segmenters (VNT, RMM, MM, SEL and PEL, mentioned in
Section 3.1) respectively in the Vietnamese word segmentation
task. Figure 4 presents the experimental result, which shows that
(I) the four measures of ensemble learning segmenter excel that of
each individual segmenter, for instance, the R value of SEL and
PEL is 0.923 and 0.945 respectively, while the R value of VNT,
RMM and MM is 0.911, 0.916 and 0.921 respectively; and (II)
the four measures of probabilistic ensemble learning segmenter
excel that of simple ensemble learning segmenter, for instance,
the P value of SEL and PEL is 0.899 and 0.938 respectively. The
experimental result proves that the PEL framework is effective to
improve the performance of existing algorithms for Vietnamese
word segmentation, and the probabilistic ensemble strategy
precedes the simple majority rule.
933
Segmenter P R F1 ER
VNT 0.924 0.907 0.915 0.075
RMM 0.946 0.920 0.933 0.052
MM 0.950 0.924 0.937 0.049
SEL 0.952 0.926 0.939 0.047
PEL 0.955 0.936 0.945 0.044
Figure 5. Experimental result of the enhanced segmenters.
In part two, we run the five enhanced segmenters respectively.
Figure 5 presents the experimental result, which shows that (I) the
four measures of each enhanced segmenter excel that of
corresponding original segmenter, for instance, the ER value of
VNT and enhanced VNT is 0.122 and 0.075 respectively; and (II)
the four measures of enhanced PEL are the best ones, for instance,
the F1 value of enhanced PEL is the best 0.945. The experimental
result verifies that the PELWS algorithm can combine weak
segmenters with an enhanced dictionary to form a strong
segmenter, which can achieve the optimal performance and even
exceeds the performance of some advanced machine learning
algorithms for word segmentation.
4. CONCLUSION
This paper elucidates that our PEL framework can take advantage
of complementary virtues from different algorithms, and is very
useful to improve the performance of existing algorithms for
Vietnamese word segmentation. Moreover, the PEL framework is
suitable to parallel running environment. If the PEL framework is
deployed on the reduplicate hardware for multi-segmenter, the
computational time to segment a text will theoretically near to the
lowest segmenter’s running time.
Further research will concern other sequence labeling problems
(such as named entity recognition, POS tagging, parsing, and
semantic role labeling) within the PEL framework. We will
transfer above research productions to other suitable oriental
languages like Thai, Japanese, Chinese, and so on.
5. REFERENCES
[1] Doan Nguyen. 2008. Query preprocessing: improving web
search through a Vietnamese word tokenization approach. In
Proceedings of the 31st Annual International ACM SIGIR
Conference on Research and Development in Information
Retrieval (Singapore, Singapore, July 20-24, 2008). SIGIR
'08. ACM New York, NY, USA, 765-766.
[2] Quang Thang Dinh, Hong Phuong Le, Thi Minh Huyen
Nguyen, Cam Tu Nguyen, Mathias Rossignol, Xuan Luong
Vu. 2008. Word segmentation of Vietnamese texts: a
comparison of approaches. In Proceedings of the 6th
International Conference on Language Resources and
Evaluation (Marrakech, Morocco, May 28-30, 2008). LREC
'08. European Language Resources Association, 1933-1936.
[3] Dinh Dien and Vu Thuy. 2006. A maximum entropy
approach for Vietnamese word segmentation. In Proceedings
of the 4th International Conference on Computer Sciences:
Research, Innovation and Vision for the Future (Ho Chi
Minh City, Vietnam, February 12-16, 2006). RIVF '06.
IEEE, 248-253.
[4] Cam Tu Nguyen, Trung Kien Nguyen, Xuan Hieu Phan, Le
Minh Nguyen, Quang Thuy Ha. 2006. Vietnamese word
segmentation with CRFs and SVMs: an investigation. In
Proceedings of the 20th Pacific Asia Conference on
Language, Information and Computation (Wuhan, China,
November 2-4, 2006). PACLIC 2006. Tsinghua University
Press, 215-222.
[5] Hong Phuong Le, Thi Minh Huyen Nguyen, Azim
Roussanaly, Tuong Vinh Ho. 2008. A hybrid approach to
word segmentation of Vietnamese texts. Language and
Automata Theory and Applications. Lecture Notes in
Computer Science, Volume 5196, 240-249.
[6] Thi Oanh Tran, Anh Cuong Le, Quang Thuy Ha. 2010.
Improving Vietnamese word segmentation and POS tagging
using MEM with various kinds of resources. Information and
Media Technologies. 5, 2 (2010), 890-909.
[7] Dang Duc Pham, Giang Binh Tran, Son Bao Pham. 2009. A
hybrid approach to Vietnamese word segmentation using
part of speech tags. In Proceedings of the 1st International
Conference on Knowledge and Systems Engineering (Hanoi,
Vietnam, October 13-17, 2009). KSE '09. IEEE Computer
Society Washington, DC, USA, 154-161.
[8] Thomas G. Dietterich. 2000. Ensemble methods in machine
learning. In Proceedings of the 1st International Workshop
on Multiple Classifier Systems (Cagliari, Italy, June, 2000).
MCS 2000. Lecture Notes in Computer Science, Volume
1857, 1-15.
[9] Wuying Liu and Ting Wang. 2010. Multi-field learning for
email spam filtering. In Proceedings of the 33rd Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval (Geneva, Switzerland,
July 19-23, 2010). SIGIR '10. ACM New York, NY, USA,
745-746.
[10] Wuying Liu and Ting Wang. 2012. Online active multi-field
learning for efficient email spam filtering. Knowledge and
Information Systems. 33, 1 (October, 2012), 117-136.
[11] Richard Sproat and Thomas Emerson. 2003. The first
international Chinese word segmentation Bakeoff. In
Proceedings of the 2nd SIGHAN Workshop on Chinese
Language Processing (Sapporo, Japan, July 11-12, 2003).
SIGHAN '03. Association for Computational Linguistics,
Stroudsburg, PA, USA, 133-143.
934