Conference PaperPDF Available

Probabilistic Ensemble Learning for Vietnamese Word Segmentation


Abstract and Figures

Word segmentation is a challenging issue, and the corresponding algorithms can be used in many applications of natural language processing. This paper addresses the problem of Vietnamese word segmentation, proposes a probabilistic ensemble learning (PEL) framework, and designs a novel PEL-based word segmentation (PELWS) algorithm. Supported by the data structure of syllable-syllable frequency index, the PELWS algorithm combines multiple weak segmenters to form a strong segmenter within the PEL framework. The experimental results show that the PELWS algorithm can achieve the state-of-the-art performance in the Vietnamese word segmentation task.
Content may be subject to copyright.
Probabilistic Ensemble Learning for Vietnamese
Word Segmentation
Wuying Liu
Luoyang University of Foreign Languages
471003 Luoyang, Henan, CHINA
Li Lin
Luoyang University of Foreign Languages
471003 Luoyang, Henan, CHINA
Word segmentation is a challenging issue, and the corresponding
algorithms can be used in many applications of natural language
processing. This paper addresses the problem of Vietnamese word
segmentation, proposes a probabilistic ensemble learning (PEL)
framework, and designs a novel PEL-based word segmentation
(PELWS) algorithm. Supported by the data structure of syllable-
syllable frequency index, the PELWS algorithm combines
multiple weak segmenters to form a strong segmenter within the
PEL framework. The experimental results show that the PELWS
algorithm can achieve the state-of-the-art performance in the
Vietnamese word segmentation task.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis
and Indexing – dictionaries, indexing methods, linguistic
I.2.7 [Artificial Intelligence]: Natural Language Processing –
language parsing and understanding, text analysis.
General Terms
Algorithms, Performance, Experimentation, Languages
Word Segmentation Algorithm, Probabilistic Ensemble Learning,
Multi-Segmenter, Vietnamese, Syllable-Syllable Frequency Index
Vietnamese is a monosyllabic language, whose basic linguistic
unit is called “tiếng”, similar to traditional syllables in respect of
phonetic form. Like Thai, Japanese and Chinese text, Vietnamese
text is also a text without any explicit separator between words.
Thus, identifying the word boundaries is crucial to above oriental
languages in many applications of natural language processing
(NLP) [1]. For instance, the following Sentence1 is a raw
unsegmented Vietnamese sentence, and the following Sentence2
is its related segmented one.
hp khu tôm vào M năm qua vn cao k lc do
ngun cung cp t Thái Lan và Indonesia tăng .
hp_khu tôm vào M năm qua vn cao k_lc do
ngun_cung_cp t Thái_Lan và Indonesia tăng .
The above instance shows that Vietnamese text is a sequence of
syllables and each two continuous syllables are separated by a space
symbol. The space symbol belongs to an overload symbol (as a
connector within a word or as a separator between words) in raw
text. Therefore, the Vietnamese word segmentation task can be
defined as a binary categorization problem for each space symbol. If
a space symbol is a connector in a word, we will output a symbol
('_') to replace it. And if a space symbol is a separator between
words, we will maintain it as a space symbol (' ') in the segmented
Since the early days of Vietnamese information processing, word
segmentation has been widely investigated. Till now, many
effective algorithms have been proposed for Vietnamese word
segmentation [2]. The early dictionary-based word segmentation
algorithms mainly include maximum matching algorithm and
reverse maximum matching algorithm. The dictionary-based
algorithm has a straightforward implementation, but its performance
highly depends on a suitable dictionary. Subsequently, various kinds
of advanced machine learning algorithms (such as maximum
entropy [3], support vector machines and conditional random fields
[4]) regard word segmentation as a sequence labeling problem, and
related algorithms can obtain preferable performance in Vietnamese
word segmentation. To resolve ambiguities of word segmentation, a
hybrid algorithm combines finite-state automata, regular expression
and maximum matching techniques [5] to implement a highly
accurate Vietnamese tokenizer (vnTokenizer). Recently, part of
speech (POS) is considered as a helpful tag for word segmentation
and word ambiguity resolution [6]. Therefore, many affixed
resources (such as POS tag) are applied in word segmentation
algorithm [7].
Previous researches show that ensemble learning has statistical,
computational and representational advantages [8]. Ensemble
learning can improve the performance of existing algorithms for
spam filtering [9] and text categorization [10]. Based on the
motivation of ensemble learning, we propose a novel probabilistic
ensemble learning (PEL) framework, which is a meta-framework
for integrating various existing word segmentation algorithms.
Under our PEL framework, we design and implement a PEL-based
word segmentation (PELWS) algorithm.
2.1 Framework
In order to colligate various advantages of different word
segmentation algorithms, we propose a PEL framework by a
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions from
SIGIR '14, July 06–11, 2014, Gold Coast, QLD, Australia.
Copyright 2014 ACM 978-1-4503-2257-7/14/07…$15.00.
weighted-voting-like strategy. Figure 1 shows the PEL framework
for Vietnamese word segmentation. The framework mainly
includes several pairs of <Segmenter, PELearner (Probabilistic
Ensemble Learner)> and a PEPredictor (Probabilistic Ensemble
Predictor). The Segmenters are implemented from different
Vietnamese word segmentation algorithms. For each Segmenter,
we design and install a PELearner, which receives golden
standard segmented texts and automatic segmented texts from its
corresponding Segmenter, and learns a probabilistic ensemble
model from both two texts. The probabilistic ensemble model is
stored as a syllable-syllable frequency index (SSFI). The
PEPredictor combines multi-segmenter’s segmented texts
according to the SSFIs and makes the final prediction for each
unsegmented text.
Figure 1. Probabilistic ensemble learning framework.
Figure 2 shows the SSFI structure organized as a hash table. The
table entry of the SSFI is a key-value pair <Key, Value>, where
each key SiSj denotes a sequence of continuous two syllables and
each value consists of 4 integers. The integer fc(·) denotes the
times of correct prediction and the integer fm(·) denotes the times
of mistaken prediction. The symbol ('_') means that the
continuous two syllables Si and Sj are in a word, and the symbol ('
') means that the continuous two syllables Si and Sj are not in a
word. The hash function hash(SiSj) maps the sequence SiSj to the
address of the 4 integers.
Figure 2. Syllable-syllable frequency index.
Supported by the SSFI, we design a supervised PELWS algorithm,
which regards the historical precision as an ensemble probability
and uses it as an ensemble weighted coefficient.
2.2 Algorithm
Figure 3 gives the pseudo-code for the PELWS algorithm
consisting of two main functions: pep and pel. The PELWS
algorithm takes the predicting process as an index retrieving
process and takes the PEL process as an index updating process.
When an unsegmented (sss = null) text arrives, the pep function
will be triggered: (I) It calls each Vietnamese word segmenter for
multiple segmented result texts; (II) It retrieves the current SSFI
and calculates a probabilistic score according to the historical
precision described in Eq. (1) for each space symbol ' ' in each
segmented result text; and (III) It combines multiple probabilistic
scores to form an ensemble score described in Eq. (2) and use a
fixed threshold (score = 0) to make the final binary prediction for
each space symbol.
() ().() ().()
ij c
ij ij
hash s s f
Ps s hash s s f hash s s f
(-1) ( )
core P s s
If = ' ', Then θ = 0
If = '_', Then θ = 1 (2)
When a segmented text arrives, the pel function firstly does the
same work to the above (I) step, and finally compares the
automatic segmented result with the golden standard to update the
values of the SSFI.
1. // PEL-based Word Segmentation (PELWS) Algorithm
2. Integer: n; // Number of Segmenters
3. Array[n]: ssfis; // Array of Syllable-Syllable Frequency Indexes
4. Array[n]: vwss; // Array of Vietnamese Word Segmenters
5. String: uss; // Unsegmented Sequence of Syllables
6. String: sss; // Segmented Sequence of Syllables
7. If (sss = null) Then sss pep(ssfis, vwss, uss);
8. Else ssfis pel(ssfis, vwss, uss, sss);
10. // Probabilistic Ensemble Predictor
11. Function String: pep(ssfis, vwss, uss)
12. String: sss '';
13. String[n]: ssss; // Array of Segmented Sequences of Syllables
14. Integer: m; // Number of Syllables in Unsegmented Sequence
15. For Integer i 1 To n Do
16. ssss[i] vwss[i].segment(uss);
17. End For
18. For Integer j 1 To (m-1) Do
19. Float score 0;
20. String s1 getSyllable(uss, j); // Current Syllable
21. String s2 getSyllable(uss, j+1); // Next Syllable
22. For Integer i 1 To n Do
23. String space getSpace(ssss[i], j);
24. Float cor ssfis[i].hash(s1s2).fc(space);
25. Float mis ssfis[i].hash(s1s2).fm(space);
26. If (space = ' ') Then score score + cor / (cor + mis);
27. If (space = '_') Then score score - cor / (cor + mis);
28. End For
29. If (score > 0) Then sss sss + s1 + ' ';
30. Else sss sss + s1 + '_';
31. If (j = (m-1)) Then sss sss + s2;
32. End For
33. Return sss;
35. // Probabilistic Ensemble Learners
36. Function Array[]: pel(ssfis, vwss, uss, sss)
37. String[n]: ssss; // Array of Segmented Sequences of Syllables
38. Integer: m; // Number of Syllables in Unsegmented Sequence
39. For Integer i 1 To n Do
40. ssss[i] vwss[i].segment(uss);
41. End For
42. For Integer j 1 To (m-1) Do
43. String s1 getSyllable(uss, j); // Current Syllable
44. String s2 getSyllable(uss, j+1); // Next Syllable
45. For Integer i 1 To n Do
46. String space1 getSpace(ssss[i], j);
47. String space2 getSpace(sss, j);
48. If (space1 = space2) Then ssfis[i].hash(s1s2).fc(space1) ++;
49. Else ssfis[i].hash(s1s2).fm(space1) ++;
50. End For
51. End For
52. Return ssfis;
Figure 3. PEL-based word segmentation algorithm.
The PELWS algorithm, independent of any concrete segmenter, is
a general meta-algorithm, whose space-time complexity mainly
depends on the SSFI storage space and the segment loops in the
pep and the pel functions. The SSFI is space-efficient owing to
the inherent compressible property of index files. Theoretically,
the property ensures that the SSFI storage space is proportional to
the total number of continuous two syllables and is independent
of the number of training texts. The updating or retrieving of SSFI
has constant time complexity according to the hash function. The
maximal space complexity O(np) and the maximal time
complexity O(nq) of the PEL framework are both acceptable in
practical NLP applications. Here, n means the number of
segmenters, usually a low number; p means the number of
continuous two syllables; and q means the total number of texts.
3.1 Implementation
According to the PELWS algorithm, we firstly implement a
PELSegmenter (PEL), which combines three existing weak
segmenters: vnTokenizer 1 (VNT), RMMSegmenter (RMM) and
MMSegmenter 2 (MM). The VNT is an implementation of the
hybrid word segmentation algorithm. The RMM and the MM are
implemented from the dictionary-based reverse maximum
matching algorithm and the dictionary-based maximum matching
algorithm respectively.
Secondly, we implement a simple ensemble learning segmenter:
SELSegmenter (SEL) as the baseline, which applies a simple
majority rule to combine the VNT, RMM and MM. The SEL
resembles a simplified PELSegmenter without probabilistic
ensemble learners.
Because the VNT, RMM and MM all support dictionary
extension and the PELWS algorithm is a supervised algorithm, we
can add all words of training texts into the dictionary and build
three corresponding enhanced segmenters. Finally, we can get
two enhanced segmenters: enhanced PEL and enhanced SEL
through combining the three enhanced segmenters.
3.2 Corpus and Evaluation
In the experiment, we use a publicly available benchmark dataset
(Corpus for Vietnamese Word Segmentation3, CVWS), which
contains total 7807 sentences with word boundary labels from 305
Vietnamese newspaper articles in various domains.
The international Bakeoff [11] evaluation measure and associated
evaluation methodology are applied. In the experiment, we report
the classical Precision (P), Recall (R), F1-measure (F1) and Error
Rate (ER) to evaluate the result of segmenters. The value of P, R,
F1 belongs to [0, 1], where 1 is optimal, while the value of ER
belongs to [0, 1], where 0 is optimal.
The above four measures are computed as Eq. (3) to Eq. (6)
separately. Where the N denotes the total number of words in the
manually segmented text, the C denotes the number of correctly
segmented words by an automatic segmenter, and the M denotes
the number of mistakenly segmented words by an automatic
3.3 Result and Discussion
In the experiment, we use three-fold cross validation by evenly
splitting the CVWS dataset into three parts and use two parts for
training and the remaining third for testing. We perform the
training-testing procedure three times and use the average of the
three performances as the final result.
Segmenter P R F1 ER
VNT 0.882 0.911 0.896 0.122
RMM 0.895 0.916 0.905 0.108
MM 0.899 0.921 0.910 0.103
SEL 0.899 0.923 0.911 0.103
PEL 0.938 0.945 0.942 0.062
Figure 4. Experimental result of the segmenters.
The experiment includes two parts. In part one, we run the five
segmenters (VNT, RMM, MM, SEL and PEL, mentioned in
Section 3.1) respectively in the Vietnamese word segmentation
task. Figure 4 presents the experimental result, which shows that
(I) the four measures of ensemble learning segmenter excel that of
each individual segmenter, for instance, the R value of SEL and
PEL is 0.923 and 0.945 respectively, while the R value of VNT,
RMM and MM is 0.911, 0.916 and 0.921 respectively; and (II)
the four measures of probabilistic ensemble learning segmenter
excel that of simple ensemble learning segmenter, for instance,
the P value of SEL and PEL is 0.899 and 0.938 respectively. The
experimental result proves that the PEL framework is effective to
improve the performance of existing algorithms for Vietnamese
word segmentation, and the probabilistic ensemble strategy
precedes the simple majority rule.
Segmenter P R F1 ER
VNT 0.924 0.907 0.915 0.075
RMM 0.946 0.920 0.933 0.052
MM 0.950 0.924 0.937 0.049
SEL 0.952 0.926 0.939 0.047
PEL 0.955 0.936 0.945 0.044
Figure 5. Experimental result of the enhanced segmenters.
In part two, we run the five enhanced segmenters respectively.
Figure 5 presents the experimental result, which shows that (I) the
four measures of each enhanced segmenter excel that of
corresponding original segmenter, for instance, the ER value of
VNT and enhanced VNT is 0.122 and 0.075 respectively; and (II)
the four measures of enhanced PEL are the best ones, for instance,
the F1 value of enhanced PEL is the best 0.945. The experimental
result verifies that the PELWS algorithm can combine weak
segmenters with an enhanced dictionary to form a strong
segmenter, which can achieve the optimal performance and even
exceeds the performance of some advanced machine learning
algorithms for word segmentation.
This paper elucidates that our PEL framework can take advantage
of complementary virtues from different algorithms, and is very
useful to improve the performance of existing algorithms for
Vietnamese word segmentation. Moreover, the PEL framework is
suitable to parallel running environment. If the PEL framework is
deployed on the reduplicate hardware for multi-segmenter, the
computational time to segment a text will theoretically near to the
lowest segmenter’s running time.
Further research will concern other sequence labeling problems
(such as named entity recognition, POS tagging, parsing, and
semantic role labeling) within the PEL framework. We will
transfer above research productions to other suitable oriental
languages like Thai, Japanese, Chinese, and so on.
[1] Doan Nguyen. 2008. Query preprocessing: improving web
search through a Vietnamese word tokenization approach. In
Proceedings of the 31st Annual International ACM SIGIR
Conference on Research and Development in Information
Retrieval (Singapore, Singapore, July 20-24, 2008). SIGIR
'08. ACM New York, NY, USA, 765-766.
[2] Quang Thang Dinh, Hong Phuong Le, Thi Minh Huyen
Nguyen, Cam Tu Nguyen, Mathias Rossignol, Xuan Luong
Vu. 2008. Word segmentation of Vietnamese texts: a
comparison of approaches. In Proceedings of the 6th
International Conference on Language Resources and
Evaluation (Marrakech, Morocco, May 28-30, 2008). LREC
'08. European Language Resources Association, 1933-1936.
[3] Dinh Dien and Vu Thuy. 2006. A maximum entropy
approach for Vietnamese word segmentation. In Proceedings
of the 4th International Conference on Computer Sciences:
Research, Innovation and Vision for the Future (Ho Chi
Minh City, Vietnam, February 12-16, 2006). RIVF '06.
IEEE, 248-253.
[4] Cam Tu Nguyen, Trung Kien Nguyen, Xuan Hieu Phan, Le
Minh Nguyen, Quang Thuy Ha. 2006. Vietnamese word
segmentation with CRFs and SVMs: an investigation. In
Proceedings of the 20th Pacific Asia Conference on
Language, Information and Computation (Wuhan, China,
November 2-4, 2006). PACLIC 2006. Tsinghua University
Press, 215-222.
[5] Hong Phuong Le, Thi Minh Huyen Nguyen, Azim
Roussanaly, Tuong Vinh Ho. 2008. A hybrid approach to
word segmentation of Vietnamese texts. Language and
Automata Theory and Applications. Lecture Notes in
Computer Science, Volume 5196, 240-249.
[6] Thi Oanh Tran, Anh Cuong Le, Quang Thuy Ha. 2010.
Improving Vietnamese word segmentation and POS tagging
using MEM with various kinds of resources. Information and
Media Technologies. 5, 2 (2010), 890-909.
[7] Dang Duc Pham, Giang Binh Tran, Son Bao Pham. 2009. A
hybrid approach to Vietnamese word segmentation using
part of speech tags. In Proceedings of the 1st International
Conference on Knowledge and Systems Engineering (Hanoi,
Vietnam, October 13-17, 2009). KSE '09. IEEE Computer
Society Washington, DC, USA, 154-161.
[8] Thomas G. Dietterich. 2000. Ensemble methods in machine
learning. In Proceedings of the 1st International Workshop
on Multiple Classifier Systems (Cagliari, Italy, June, 2000).
MCS 2000. Lecture Notes in Computer Science, Volume
1857, 1-15.
[9] Wuying Liu and Ting Wang. 2010. Multi-field learning for
email spam filtering. In Proceedings of the 33rd Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval (Geneva, Switzerland,
July 19-23, 2010). SIGIR '10. ACM New York, NY, USA,
[10] Wuying Liu and Ting Wang. 2012. Online active multi-field
learning for efficient email spam filtering. Knowledge and
Information Systems. 33, 1 (October, 2012), 117-136.
[11] Richard Sproat and Thomas Emerson. 2003. The first
international Chinese word segmentation Bakeoff. In
Proceedings of the 2nd SIGHAN Workshop on Chinese
Language Processing (Sapporo, Japan, July 11-12, 2003).
SIGHAN '03. Association for Computational Linguistics,
Stroudsburg, PA, USA, 133-143.
... For instance, the methods using Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) in [9] can reach results of over 94% while evaluating on a small corpus of 7800 Vietnamese sentences. Other studies using CRFs [10], SVMs [11], Hidden Markov Model (HMM) [12], [13], n-gram model [14], Maximum Entropy (MaxEnt) [15], [16] and probabilistic ensemble learning [17] also produces high accuracy for Vietnamese and other East Asian languages. Statistical approaches help to gain good result for Thai [18], too. ...
... The first one is separator of two syllables which belong to two different words (SPACE) and the second one is separator of two syllables inside a word (UNDERSCORE). PELSegmenter [17] and DongDu 2 are toolkits that use this problem representation. ...
Conference Paper
Full-text available
Word segmentation is the very first task for Vietnamese language processing. Word-segmented text is the input of almost other NLP tasks. This task faces some challenges due to specific characteristics of the language. As in many other Asian languages such as Japanese, Korean and Chinese, white spaces in Vietnamese are not always used as word separators and a word may contain one or more syllables. In this paper, we propose an efficient hybrid approach to detect word boundary for Vietnamese texts using logistic regression as a binary classifier combining with longest matching algorithm. First, longest matching algorithm is used to catch words that contain more than two syllables in input sentence. Next, the system utilizes the classifier to determine the boundary of 2-syllable words and proper names. Then, the predictions having low confidence conducted by the classifier are verified by a dictionary to get the final result. Our system can achieve an F-measure of 98.82% which is the most accurate result for Vietnamese word segmentation to the best of our knowledge. Moreover, the system also has a high speed. It can run word segmentation for nearly 34k tokens per second.
... Another promising approach is joint word segmentation and POS tagging (Takahashi and Yamamoto, 2016), which assigns a combined segmentation and POS tag to each syllable. Furthermore, Luu and Kazuhide (2012), Liu and Lin (2014) and Nguyen and Le (2016) proposed methods based on pointwise predic-1 In the traditional underscore-based representation in the Vietnamese word segmentation task (Nguyen et al., 2009), white space is only used to separate words while underscore is used to separate syllables inside a word. tion (Neubig and Mori, 2010), where a binary classifier is trained to identify whether or not there is a word boundary at each point between two syllables. ...
We propose a novel rule-based approach to Vietnamese word segmentation. Our approach is based on the Single Classification Ripple Down Rules methodology (Compton and Jansen, 1990), where rules are stored in an exception structure and new rules are only added to correct segmentation errors given by existing rules. Experimental results on the benchmark Vietnamese treebank show that our approach outperforms previous state-of-the-art approaches JVnSegmenter, vnTokenizer, DongDu and UETsegmenter in terms of both accuracy and performance speed. Our code is open-source and available at:
Due to the lack of structured language resources, low-resource language machine translation often faces difficulties in cross-language semantic paraphrasing. In order to solve the problem of low-resource machine translation from Indonesian to Chinese, a cognate-parallel-corpus-based expanding method of language resources is proposed, and an improved neural machine translation model is trained by the Malay-corpus-enhanced corpus. The improved model can achieve a comparable result as that of Google in the experiment of Indonesian-Chinese machine translation. The experimental results also show that the morphological similarity and semantic equivalence between the languages are very effective computational features to improve the performance of neural machine translation for low-resource languages.
This paper presents an empirical comparison of two strategies for Vietnamese Part-of-Speech (POS) tagging from unsegmented text: (i) a pipeline strategy where we consider the output of a word segmenter as the input of a POS tagger, and (ii) a joint strategy where we predict a combined segmentation and POS tag for each syllable. We also make a comparison between state-of-the-art (SOTA) feature-based and neural network-based models. On the benchmark Vietnamese treebank (Nguyen et al., 2009), experimental results show that the pipeline strategy produces better scores of POS tagging from unsegmented text than the joint strategy, and the highest accuracy is obtained by using a feature-based model.
Conference Paper
Full-text available
Through the investigation of email document structure, this paper proposes a multi-field learning (MFL) framework, which breaks the multi-field document Text Classification (TC) problem into several sub-document TC problems, and makes the final category prediction by weighted linear combination of several sub-document TC results. Many previous statistical TC algorithms can be easily rebuilt within the MFL framework via turning binary result to spamminess score, which is a real number and reflects the likelihood that the classified email is spam. The experimental results in the TREC spam track show that the performances of many TC algorithms can be improved within the MFL framework.
Full-text available
Email spam causes a serious waste of time and resources. This paper addresses the email spam filtering problem and proposes an online active multi-field learning approach, which is based on the following ideas: (1) Email spam filtering is an online application, which suggests an online learning idea; (2) Email document has a multi-field text structure, which suggests a multi-field learning idea; and (3) It is costly to obtain a label for a real-world email spam filter, which suggests an active learning idea. The online learner regards the email spam filtering as an incremental supervised binary streaming text classification. The multi-field learner combines multiple results predicted by field classifiers in a novel compound weight schema, and each field classifier calculates the arithmetical average of multiple conditional probabilities calculated from feature strings according to a data structure of string-frequency index. Comparing the current variance of field classifying results with the historical variance, the active learner evaluates the classifying confidence and takes the more uncertain email as the more informative sample for which to request a label. The experimental results show that the proposed approach can achieve the state-of-the-art performance with greatly reduced label requirements and very low space-time costs. The performance of our online active multi-field learning, the standard (1-ROCA)% measurement, even exceeds the full feedback performance of some advanced individual text classification algorithms.
Full-text available
We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.
Conference Paper
Full-text available
We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.
Full-text available
This paper presents the results from the ACL-SIGHAN-sponsored First International Chinese Word Segmentation Bakeoff held in 2003 and reported in conjunction with the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan. We give the motivation for having an international segmentation contest (given that there have been two within-China contests to date) and we report on the results of this first international contest, analyze these results, and make some recommendations for the future.
Conference Paper
Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
Word segmentation and POS tagging are two important problems included in many NLP tasks. They, however, have not drawn much attention of Vietnamese researchers all over the world. In this paper, we focus on the integration of advantages from several resourses to improve the accuracy of Vietnamese word segmentation as well as POS tagging task. For word segmentation, we propose a solution in which we try to utilize multiple knowledge resources including dictionary-based model, N-gram model, and named entity recognition model and then integrate them into a Maximum Entropy model. The result of experiments on a public corpus has shown its effectiveness in comparison with the best current models. We got 95.30% F1 measure. For POS tagging, motivated from Chinese research and Vietnamese characteristics, we present a new kind of features based on the idea of word composition. We call it morpheme-based features. Our experiments based on two POS-tagged corpora showed that morpheme-based features always give promising results. In the best case, we got 89.64% precision on a Vietnamese POS-tagged corpus when using Maximum Entropy model.
Word segmentation is one of the most important tasks in NLP. This task, within Vietnamese language and its own features, faces some challenges, especially in words boundary determination. To tackle the task of Vietnamese word segmentation, in this paper, we propose the WS4VN system that uses a new approach based on Maximum matching algorithm combining with stochastic models using part-of-speech information. The approach can resolve word ambiguity and choose the best segmentation for each input sentence. Our system gives a promising result with an F-measure of 97%, higher than the results of existing publicly available Vietnamese word segmentation systems.
Conference Paper
In this poster paper, we propose a novel approach to improve web search relevancy by tokenizing a Vietnamese query text prior submitting it to a search engine. Evaluations demonstrate its effectiveness and practical value.