Conference PaperPDF Available

A hybrid approach to Vietnamese word segmentation

Authors:

Abstract and Figures

Word segmentation is the very first task for Vietnamese language processing. Word-segmented text is the input of almost other NLP tasks. This task faces some challenges due to specific characteristics of the language. As in many other Asian languages such as Japanese, Korean and Chinese, white spaces in Vietnamese are not always used as word separators and a word may contain one or more syllables. In this paper, we propose an efficient hybrid approach to detect word boundary for Vietnamese texts using logistic regression as a binary classifier combining with longest matching algorithm. First, longest matching algorithm is used to catch words that contain more than two syllables in input sentence. Next, the system utilizes the classifier to determine the boundary of 2-syllable words and proper names. Then, the predictions having low confidence conducted by the classifier are verified by a dictionary to get the final result. Our system can achieve an F-measure of 98.82% which is the most accurate result for Vietnamese word segmentation to the best of our knowledge. Moreover, the system also has a high speed. It can run word segmentation for nearly 34k tokens per second.
Content may be subject to copyright.
A Hybrid Approach
to Vietnamese Word Segmentation
Tuan-Phong Nguyen
Faculty of Information Technology
VNU University of Engineering and Technology
No. 144 Xuan Thuy Street
Dich Vong Hau Ward, Cau Giay District
Hanoi, Vietnam
Email: phongnt_570@vnu.edu.vn
Anh-Cuong Le
Faculty of Information Technology
Ton Duc Thang University
No. 19 Nguyen Huu Tho Street
Tan Phong Ward, District 7
Ho Chi Minh City, Vietnam
Email: leanhcuong@tdt.edu.vn
Abstract—Word segmentation is the very first task for Viet-
namese language processing. Word-segmented text is the input
of almost other NLP tasks. This task faces some challenges due
to specific characteristics of the language. As in many other
Asian languages such as Japanese, Korean and Chinese, white
spaces in Vietnamese are not always used as word separators
and a word may contain one or more syllables. In this paper, we
propose an efficient hybrid approach to detect word boundary
for Vietnamese texts using logistic regression as a binary classifier
combining with longest matching algorithm. First, longest match-
ing algorithm is used to catch words that contain more than two
syllables in input sentence. Next, the system utilizes the classifier
to determine the boundary of 2-syllable words and proper names.
Then, the predictions having low confidence conducted by the
classifier are verified by a dictionary to get the final result. Our
system can achieve an F-measure of 98.82% which is the most
accurate result for Vietnamese word segmentation to the best of
our knowledge. Moreover, the system also has a high speed. It
can run word segmentation for nearly 34k tokens per second.
I. INTRO DUC TI O N
In linguistics, word is the smallest meaningful unit of speech
that can stand by itself. Vietnamese, an Austroasiatic language,
uses a Latin alphabet with additional diacritics and certain
letters. However, unlike many occidental languages using Latin
alphabets, Vietnamese has similar characteristics to other East
Asian languages such as Japanese, Korean, Chinese and Thai
in which white spaces are not always word separators and
a word may consist of more than one syllable with many
ambiguous cases. This leads to some challenges in Vietnamese
word segmentation.
Studies on Vietnamese word segmentation used either
dictionary-based algorithms, statistical models or hybrid ap-
proaches. Recent studies using hybrid approaches such as [1],
[2], [3] can provide state-of-the-art results at approximately
97%.
In this study, we propose an efficient hybrid approach
to solve this task. In our approach, word segmentation is
represented as a binary classification problem in which we
have to determine the label of each white space in input
text. These two labels are SPACE (separator of two syllables
Corresponding author.
which belong to two different words) and UNDERSCORE
(separator of two syllables inside a word). Our system is
mainly based on three steps. First, we use a forward longest
matching algorithm to determine the boundary of all words
having at least three syllables. Next, the classifier using logistic
regression helps to detect the boundary of 2-syllable words
and proper names. Finally, we continue to use the dictionary
to recheck the predictions having low confidence produced
by the machine learning process and return final labels for
white spaces. For experiments, we evaluate our approach using
10-fold cross-validation on Vietnamese Treebank corpora [4]
of 75k manually word-segmented sentences. Our system can
yield an F-measure of 98.82% which is the best result for
Vietnamese word segmentation known to us. Furthermore, the
system can also perform at a high speed of nearly 34k tokens
per second when running on a personal computer.
The rest of this paper is organized as follows. In Section II,
we talk about the difficulties in Vietnamese word segmenta-
tion. In Section III, the methods used in other studies to resolve
word segmentation task are discussed. Section IV provides
details of our approach. We report and discuss about the
experimental results of our system in Section V. Finally, we
make some conclusions on this work in Section VI.
II. DIFFI CU LTI E S IN VI E TN AM E SE W ORD S EGM EN TATION
Vietnamese is an inflexionless language in which every
word never changes its form. Vietnamese words are made
of one or more syllables. A word which contains only one
syllable is called single word. On the other hand, a word which
is composed of more than one syllable is called compound
word. Frequency of each kind of word is different from others.
We tried to make some statistics on the dictionary provided by
VLSP project1and the frequency analysis told us some useful
knowledge. Almost of the words (71%) in this dictionary are
2-syllable words. Single words account for 17.67% of total
words. Therefore, the percentage of over-2-syllable words is
just under 12%. Due to this low frequency, it leads us to
a simple idea that we can just use the dictionary to cover
1http://vlsp.hpda.vn:8080/demo/?page=home
those words and then find an efficient way to deal with the
other words in the input text. There are two kinds of the
remaining words that we have to care about, which are 2-
syllable words and proper names (in Vietnamese, names of
people and locations are considered as lexical units).
One simple method to deal with proper names is to compose
all consecutive upper-case syllables into a word. This method
is obviously not good in many cases such as when two proper
names appear consecutively.
For 2-syllable words, the easiest way is to scan through the
input sentence and connect all of two consecutive syllables
that can compose a word in the dictionary. There are many
ambiguous cases where this method produce wrong results.
One of the most frequent cases is called overlap ambiguity in
which a sentence has three consecutive syllables sisi+1si+2
where both sisi+1 and si+1si+2 are words in the dictionary
but in the current context, only one of them is the right
word. In another common situation, a word composed of two
consecutive syllables sisi+1 is in the dictionary, but in the
current context, these two syllables are actually two single
words. Other significant case that this method cannot handle is
out-of-vocabulary problem in which two consecutive syllables
sisi+1 actually compose a right word in its context but it has
not appeared in the dictionary.
Taken together, it is necessary to have more effective tech-
niques to deal with those problems. In the next section, we talk
about studied approaches to word segmentation Vietnam ese and
other languages’ texts.
III. REL ATE D WO RK S
There are many effective approaches that have been studied
to resolve word segmentation task [5], [6]. The first and
traditional approach is based on dictionary. There are two com-
mon techniques of this approach, namely maximum matching
(MM) and longest matching (LM). While MM algorithm
aims to find the segmentation candidates by segmenting input
sentence into a sequence with the smallest number of words,
LM algorithm tends to scan through the sentence and at each
syllable, it finds the longest word composed of this syllable
and the next consecutive ones. Systems using this kind of
approach for Chinese can gain very promising results [7], [8].
However, for Vietnamese, this simple approach seems to be
unable to deal with out-of-vocabulary problem and overlap
ambiguity.
The second one is statistical approach. As in many other
core NLP tasks, this approach has proved to be good for word
segmentation too. For instance, the methods using Conditional
Random Fields (CRFs) and Support Vector Machines (SVMs)
in [9] can reach results of over 94% while evaluating on a
small corpus of 7800 Vietnamese sentences. Other studies
using CRFs [10], SVMs [11], Hidden Markov Model (HMM)
[12], [13], n-gram model [14], Maximum Entropy (MaxEnt)
[15], [16] and probabilistic ensemble learning [17] also pro-
duces high accuracy for Vietnamese and other East Asian
languages. Statistical approaches help to gain good result for
Thai [18], too.
Although statistical algorithms can provide a good way
to deal with ambiguous problems, both of those approaches
still have their own limitations. Thus, some studies combined
these two approaches into their systems. Some hybrid ap-
proaches for Vietnamese word segmentation were presented
to use Weighted Finite State Transducer (WFST) with Neural
Network [3], or combine MM and n-gram language model
[1], or use MM combining with stochastic models using part-
of-speech information [2]. These approaches are able to reach
state-of-the-art results at approximately 97%. For Chinese, the
study in [19] proposes a lattice-based framework for joint
Chinese word segmentation, POS tagging and parsing which
helps to significantly improve the accuracy of the three sub-
tasks. Joint model of word segmentation and POS tagging was
also used for Japanese [20].
IV. OUR A PPRO ACH
In this section, we first talk about how we represent word
segmentation task. Next, we describe three main components
of our segmentation system before proposing its architecture.
A. Problem representation
The two main ways of problem representation for Viet-
namese word segmentation are syllable-based and white-
space-based.
The first one can be described as a sequential tagging task.
For example, in the approach presented in [9], there are three
labels for syllables, which are B_W (Begin of a Word), I_W
(Inside of a Word) and O (Outside of a word). This approach is
implemented in JVnSegmenter [21], a toolkit for Vietnamese
word segmentation.
The second way is to cast Vietnamese word segmentation
as a binary classification problem for white spaces. It should
be repeated that in Vietnamese, there are two kinds of white
space. The first one is separator of two syllables which
belong to two different words (SPACE) and the second one
is separator of two syllables inside a word (UNDERSCORE).
PELSegmenter [17] and DongDu2are toolkits that use this
problem representation.
We use the second way of problem representation for our
system because of its simplicity. Moreover, in this way, it is
possible to modify the label of a white space but does not
affect the labels of other ones beside it. In the next section,
we describe the simplest component of our system, longest
matching algorithm.
B. Longest matching
Due to the low frequency of over-2-syllable words and
in our observation, ambiguity is inappreciable for them, we
just use such a dictionary to deal with those words. The
dictionary-based technique in our system is longest matching.
The dictionary is the one used in Section II. The work
after longest matching is to handle the 2-syllable words and
proper names efficiently. Our binary classifier using logistic
regression is responsible for this task.
2https://github.com/rockkhuya/DongDu
C. Logistic regression as binary classification
Logistic regression is used to construct a binary classifier
for white spaces in our system. From training data, we have
a training set D={(X, Y )}where Xdenotes feature
vector and Ydenotes the corresponding label of white space.
To be convenient, we denote the two values for Yas 1
and 0corresponding to UNDERSCORE and SPACE labels
respectively. Based on this training set, logistic regression
assumes a parametric model and learns the conditional distri-
bution P(Y|X). The assumed parametric model is presented
in equation 1 and equation 2, in which widenotes weight (or
parameter).
P(Y= 1|X) = 1
1 + exp (w0+Pn
i=1 wiXi)(1)
P(Y= 0|X) = 1 P(Y= 1|X)(2)
The rule of our binary classifier is that we assign UNDER-
SCORE label for a white space given its feature vector Xif
P(Y= 1|X)> P (Y= 0|X)(or P(Y= 1|X)>0.5) and
otherwise, we assign SPACE label for it if P(Y= 1|X)<0.5.
This statistical method seems to be able to handle proper
names and many ambiguous problems well if we have a good
feature set and large training data. However, it still has serious
limitations as we map a continuous domain of probability Pto
a discrete domain of binary variable. But that is also the reason
why we choose logistic regression instead of other methods.
Obviously, no machine learning method can perform perfectly
in all cases and it is necessary to have verification for its
outcome. Logistic regression provides a simple way to detect
those low-confident predictions. It is clear that predictions with
probabilities Pin a narrow boundary around 0.5have low
confidence of precision. Additionally, it is also possible that
in the overlap ambiguity case, this classifier may connect all
three syllables to compose a word. We will propose our simple
techniques to resolve these problems in the following section.
D. Post-processing for binary classifier
We use the dictionary to handle the low-confident predic-
tions and the results in overlap ambiguity cases produced by
the binary classifier. First, we define that a prediction for
label Yof a white space given its feature vector Xis a low-
confident prediction if the following condition holds:
|P(Y= 1|X)0.5|< r, r is a threshold
Assume that we have a sequence of syllables and labeled
white spaces after the binary classification using logistic
regression in the form of:
...si1[ ]si[]si+1[ ]si+2 ...
where sjdenotes syllable; [ ] denotes SPACE label; [_]
denotes UNDERSCORE label and []is the label that has
low-confident precision. Our solution is to verify whether the
word sisi+1 is in the dictionary or not. The result of this
operation is the final label for [].
Raw text
Pre-processing
Sentences
LM for
over-2-syllable
words
Training data
Dictionary
Binary
classifier
using LR
Post-processing
Segmented text
Figure 1. Architecture of our segmentation system.
In another case, the conducted sequence looks like that:
...si2[ ]si1[_]si[_]si+1[ ]si+2 ...
In this case, si1sisi+1 is not a word in the dictionary.
That means it is much likely a wrong word because of the
low frequency of 3-syllable words. We divided this case into
four possibilities:
word si1siis in dictionary but word sisi+1 is not
word sisi+1 is in dictionary but word si1siis not
both of them are not in dictionary
both of them are in dictionary
For the first and second cases, we only keep up the word
that appears in the dictionary. For the third case, we change
both labels of two white spaces into SPACE. The last case
is corresponding to overlap ambiguity. In this case, we keep
up UNDERSCORE label of white space that has higher
probability conducted by the classifier and change the other
one to SPACE.
E. Proposed segmentation system
Combine all the above components with the pre-processing
step for raw input data, we have the architecture of our system
as presented in Figure 1.
In the pre-processing step, we first standardize the raw
text, then use regular expressions to recognize regular patterns
such as numbers, times and dates, then separate punctuation
marks, parentheses and quotation marks at the end of words,
and then utilize some simple heuristic rules to split the text
into sentences. Next, each sentence is passed into the LM
component to detect boundary for words having at least three
syllables. Continuously, the remaining white spaces will be
labeled by the classifier using LR. This classifier was trained
on training data before. The post-processing then handles the
low-confident predictions conducted by the classifier to return
the final segmented text.
s-2 y-2 s-1 s0s1s2
y-1 y0y1y2
... ...
Figure 2. A 5-syllable window.
V. E XPE RIM EN T S
In this section, we present the feature templates used for
logistic regression and the performances of different systems
compared to our system. We also talk about the affection of
threshold rto accuracies of our segmentation system.
A. Features
Performance of any statistical technique is based on the
quality of feature set. For the classifier using logistic regression
of our system, to generate feature vector of each white space,
we capture a window of size 2 for it as depicted in Figure 2,
where sdenotes syllable; ydenotes white space and the
subscript is the index of corresponding syllable or white space.
Table I represents all feature templates for logistic regres-
sion. In Table I, fidenotes the lowercase-simplified form of
syllable si;tiis the type of syllable si;(fi, fj)is a combina-
tion feature; isV NF ami lyN ame(si)returns true if and only
if siis a Vietnamese family name; isV N Syllable(si)returns
true if and only if siis a valid Vietnamese syllable. There
are five types of syllable we defined in our system, namely
LOWER, UPPER, ALLUPPER, NUMBER and OTHER cor-
responding to the cases that the syllable has all lowercase
letters, the syllable has upper-case initial letter, the syllable
has all upper-case letters, the syllable is a number or the other
cases, respectively.
For n-gram features, we use both the lowercase form of
syllables and their types. We do not use the original form of
syllables from input text to extract n-gram features because
we found that this way of feature extraction may produce
profitless features and they are not good for logistic regression.
Moreover, we only add features of syllable’s types to feature
vector if the type is different from LOWER. Taking all features
of LOWER type to feature vectors can make the regression
model confused and draw its performance because almost
syllables are of LOWER type. These techniques can reduce
a large number of useless features. The sixth feature template
in Table I is used for full-reduplicative words. The seventh
one catches information of Vietnamese people’s name and the
last one is used for detecting two consecutive proper names.
B. Results
We analyze affection of each component to our whole
system. In our experiments, we use Vietnamese Treebank
corpora of 75k manually word-segmented sentences which
is one of the largest annotated corpora for Vietnamese. The
corpus is randomly splitted into ten equal partitions for 10-fold
cross-validation. F-measure is used, in which precision ratio
(P) is computed as number of right segmented words over total
number of words conducted by the segmentation system; recall
ratio (R) is computed as number of right segmented words
over total number of words in the golden test set. The average
accuracies of systems over ten folds are presented in Table II.
We utilize LIBLINEAR L2-regularized logistic regression [22]
to implement the classifier for our experiments.
Our baseline system is LM which uses only longest match-
ing algorithm and the rule to compose all consecutive UPPER
syllables into a word. Longest matching algorithm is only used
for phrases which have LOWER syllable(s) and do not consist
of any NUMBER or OTHER syllable. This system can gain
an F-measure of 97.21%, however, it obviously cannot resolve
overlap ambiguity and out-of-vocabulary problems. Moreover,
the rule for proper names is too much greedy and fails in
many cases. Meanwhile, if we only utilize the classifier using
logistic regression in system LR, the result is much better. The
regression model can handle many cases of overlap ambiguity
and out-of-vocabulary and provide a better way to detect
proper names. Combining these two components to LM + LR
system provides a slightly increased accuracy compared to LR
system. The precision ratio is higher because LM + LR is able
to cover all the over-2-syllable words that the LR system fails
to catch. However, its recall ratio is decreased because of the
inconsistency of those words in the training data and rarely
ambiguous cases.
Post-processing for LR makes a significant impact on the
result of segmentation. LR + Post system, which adds post-
processing after LR system, can reach the highest recall ratio
of 98.99%. Our whole system which is composed of all
components (LM + LR + Post) makes the best result at 98.82%
for F-measure. Obviously, performance of post-processing is
mainly based on threshold r. In this experiment, we use
r= 0.33 which helps to gain the best result. In the next
section, we take a deeper look into how to choose a proper
threshold rand discuss about its affection to the final result.
C. Discussion on threshold rand post-processing
Choosing a proper threshold rdepends on the quality
of dictionary and how well the machine learning process
performs. It can be described that if we choose a high r, it
means we rely on the dictionary more than the result of the
classifier, and otherwise. Figure 3 depicts our analysis on the
affection of threshold rto our system.
Due to the analysis, we can conclude that our system’s
results on Vietnamese Treebank corpora is not too sensitive
with the variability of rin a wide range from 0.25 to 0.40.
However, the high similarity between domains of the training
data and test set is one reason for the high performance of
our system. To adapt for other domains, it may face more
problems with new words which even the dictionary cannot
cover. In this situation, a validation set is needed in order to
choose a proper threshold rfor new domain.
D. Comparison to other toolkits
Our approach is compared to other approaches that have
been presented in other studies. The accuracy figures are
depicted in Table III.
Table I
FEATU RE T EMPL AT ES U S ED F OR LO GI S TI C RE GRES S IO N.
No. Template
1(fi), i =2,1,0,1,2
2(fi, fi+1), i =2,1,0,1
3(ti), i =2,1,0,1,2
4(ti, ti+1), i =2,1,0,1and ti6=LO W ER
5(ti, ti+1, ti+2 ), i =2,1,0and ti6=LOW E R
6(t0=t1=LOW ER and f0=f1)?
7(t0=t1=UP P E R and isV N F amilyN ame(s0))?
8(t0=t1=UP P E R and isV N Syllable(s0)and !isV NSyllable(s1))?
Table II
ACC UR AC IE S OF SU B-S YST EM S (% ).
Sub-system P R F
LM 97.11 97.31 97.21
LR 97.95 98.29 98.12
LM + LR 98.11 98.16 98.14
LR + Post 98.59 98.99 98.79
LM + LR + Post 98.77 98.87 98.82
Figure 3. Affection of threshold ron word segmentation result.
Our system provides better result compared to other toolkits
on Vietnamese Treebank corpus. It should be repeated that
our classifier does not take information from the dictionary.
We suspect that this is the reason why it performs better than
other stochastic-based toolkits, DongDu and JVnSegmenter.
vnTokenizer [1], which uses regexes to cover proper names
before handling normal words, fails in many cases where
upper-case syllables appear consecutively. It is obvious that
statistical systems can perform better than vnTokenizer be-
cause the training data is not too different from the test set in
term of content domain.
To make another comparison, we retrained each toolkit
using the full corpus of Vietnamese Treebank and then evalu-
ated them on an independent test set that consists of 10 files
from 800001.seg to 800010.seg provided by VLSP project.
From Table III, we can see that performances of statistical
segmentation systems are decreased considerablely, because
the new test set has a totally different domain, with many new
words that have not appeared in neither the training data nor
the dictionary of these systems. Our system with the main
component using logistic regression is not an exception but
it still has a good performance because of the simple feature
set which does not make use of information from dictionary.
vnTokenizer performs quite stablely and its result is slightly
increased. Notablely, vnTokenizer’s dictionary has more than
40k words [1], this number of ours is 32k. Although having
a poorer dictionary, our system is still able to outperform
vnTokenizer.
Moreover, we also collected a corpus of 1k articles from
Vietnamese online newspapers to measure segmentation speed
of toolkits. Except DongDu which is developed in C++,
the other toolkits are developed in Java. The evaluation is
processed on a personal computer with 4 Intel Core i5-3337U
CPUs @ 1.80GHz and 6GB of memory. The results is reported
in Table IV. Our system can run faster than other toolkits.
DongDu toolkit also utilizes LIBLINEAR for machine learn-
ing, however, its feature set is much more complicated and
its LIBLINEAR version is older than ours. We suspect these
are the reasons why DongDu’s speed is not as high as our
system’s. vnTokenizer and JVnSegmenter were written in old
versions of Java. Their code to process on String seems to be
inefficient so that their speeds are quite low.
E. UETsegmenter
Our toolkit used for the above experiments is written in Java
and called UETsegmenter. It provides APIs for Vietnamese
word segmentation using a pretrained model and also some
methods for training and testing new models. The toolkit and
related resources are freely available for download3.
VI. CO NCL USI ON S
In this paper, we propose a hybrid approach to Vietnamese
word segmentation using longest matching and logistic re-
gression. We cast this task as a binary classification problem
for white spaces and the results show that longest matching
algorithm, logistic regression combining with our simple post-
processing techniques helps to gain high accuracy. Our system
can reach state-of-the-art result at 98.82% for F-measure while
evaluating on Vietnamese Treebank corpus. Moreover, the
system can perform at a high speed of 34k tokens per second.
For future works, we will make a deeper study on the affection
of dictionary and the classifier on choosing proper threshold
and extend post-processing to deal with other cases. We will
also find an efficient way to enrich the dictionary to produce
a better segmentation system.
3https://github.com/phongnt570/UETsegmenter
Table III
ACC UR AC Y CO MPARI SON (%) .
Toolkit 10-fold CV Independent test set
P R F P R F
vnTokenizer 97.61 96.86 97.23 96.98 97.69 97.33
JVnSegmenter - Maxent 97.18 97.28 97.23 96.60 97.40 97.00
JVnSegmenter - CRFs 97.58 97.68 97.63 96.63 97.49 97.06
DongDu 97.44 98.01 97.72 96.35 97.46 96.90
Ours 98.77 98.87 98.82 97.51 98.23 97.87
Table IV
SPE ED C OM PA RI SO N .
Toolkit JVnSeg (CRFs) JVnSeg (MaxEnt) vnTokenizer DongDu Ours
Speed (tokens/s) 764 1082 5322 16709 33705
ACK NOWL E DG MEN T
This paper is supported by The Vietnam National Founda-
tion for Science and Technology Development (NAFOSTED)
under grant number 102.01-2014.22.
REF ERE NC E S
[1] H. P. Le, T. M. H. Nguyen, A. Roussanaly, and T. V. Ho, “A hybrid ap-
proach to word segmentation of vietnamese texts,” in 2nd International
Conference on Language and Automata Theory and Applications-LATA
2008, vol. 5196. Springer Berlin/Heidelberg, 2008, pp. 240–249.
[2] D. D. Pham, G. B. Tran, and S. B. Pham, “A hybrid approach to viet-
namese word segmentation using part-of-speech tags,” in International
Conference on Knowledge and Systems Engineering 2009. IEEE, 2009,
pp. 154–161.
[3] D. Dinh, K. Hoang, and V. T. Nguyen, “Vietnamese word segmentation,”
in NLPRS, vol. 1, 2001, pp. 749–756.
[4] P.-T. Nguyen, X.-L. Vu, T.-M.-H. Nguyen, V.-H. Nguyen, and H.-P.
Le, “Building a large syntactically-annotated corpus of vietnamese,”
in Proceedings of the Third Linguistic Annotation Workshop,
ser. ACL-IJCNLP ’09. Stroudsburg, PA, USA: Association for
Computational Linguistics, 2009, pp. 182–185. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1698381.1698416
[5] Q. T. Dinh, H. P. Le, T. M. H. Nguyen, C. T. Nguyen, M. Rossignol,
and X. L. Vu, “Word segmentation of Vietnamese texts: a comparison
of approaches, in 6th international conference on Language Resources
and Evaluation - LREC 2008. Marrakech, Morocco: ELRA -
European Language Resources Association, May 2008. [Online].
Available: https://hal.inria.fr/inria-00334760
[6] C. Huang and H. Zhao, “Chinese word segmentation: A decade review,”
Journal of Chinese Information Processing, vol. 21, no. 3, pp. 8–20,
2007.
[7] K.-J. Chen and S.-H. Liu, “Word identification for mandarin chinese
sentences, in Proceedings of the 14th Conference on Computational
Linguistics - Volume 1, ser. COLING ’92. Stroudsburg, PA, USA:
Association for Computational Linguistics, 1992, pp. 101–107. [Online].
Available: http://dx.doi.org/10.3115/992066.992085
[8] P.-k. Wong and C. Chan, “Chinese word segmentation based
on maximum matching and word binding force,” in Proceedings
of the 16th Conference on Computational Linguistics - Volume
1, ser. COLING ’96. Stroudsburg, PA, USA: Association for
Computational Linguistics, 1996, pp. 200–203. [Online]. Available:
http://dx.doi.org/10.3115/992628.992665
[9] C.-T. Nguyen, T.-K. Nguyen, X.-H. Phan, L.-M. Nguyen, and Q.-T. Ha,
“Vietnamese word segmentation with crfs and svms: An investigation,
in Proceedings of the 20th Pacific Asia Conference on Language,
Information and Computation (PACLIC 2006), 2006.
[10] T. Kudo, K. Yamamoto, and Y. Matsumoto, “Applying conditional
random fields to japanese morphological analysis,” in EMNLP, vol. 4,
2004, pp. 230–237.
[11] M. Sassano, “An empirical study of active learning with support
vector machines for japanese word segmentation,” in Proceedings
of the 40th Annual Meeting on Association for Computational
Linguistics, ser. ACL ’02. Stroudsburg, PA, USA: Association for
Computational Linguistics, 2002, pp. 505–512. [Online]. Available:
http://dx.doi.org/10.3115/1073083.1073168
[12] T. Nguyen, V. Nguyen, and A. Le, “Vietnamese word segmentation
using hidden markov model,” in International Workshop for Computer,
Information, and Communication Technologies in Korea and Vietnam,
2003.
[13] C. P. Papageorgiou, “Japanese word segmentation by hidden markov
model,” in Proceedings of the Workshop on Human Language
Technology, ser. HLT ’94. Stroudsburg, PA, USA: Association for
Computational Linguistics, 1994, pp. 283–288. [Online]. Available:
http://dx.doi.org/10.3115/1075812.1075875
[14] L. A. Ha, “A method for word segmentation in vietnamese,” in Proceed-
ings of Corpus Linguistics, 2003.
[15] O. T. Tran, C. A. Le, and T. Q. Ha, “Improving vietnamese word seg-
mentation and pos tagging using mem with various kinds of resources,
Information and Media Technologies, vol. 5, no. 2, pp. 890–909, 2010.
[16] D. Dinh and T. Vu, “A maximum entropy approach for vietnamese word
segmentation, in International Conference on Research, Innovation and
Vision for the Future, 2006. IEEE, 2006, pp. 248–253.
[17] W. Liu and L. Lin, “Probabilistic ensemble learning for vietnamese word
segmentation,” in Proceedings of the 37th International ACM SIGIR
Conference on Research &#38; Development in Information Retrieval,
ser. SIGIR ’14. New York, NY, USA: ACM, 2014, pp. 931–934.
[Online]. Available: http://doi.acm.org/10.1145/2600428.2609477
[18] C. Haruechaiyasak, S. Kongyoung, and M. Dailey, “A comparative
study on thai word segmentation approaches,” in Electrical Engineer-
ing/Electronics, Computer, Telecommunications and Information Tech-
nology, 2008. ECTI-CON 2008. 5th International Conference on, vol. 1,
May 2008, pp. 125–128.
[19] Z. Wang, C. Zong, and N. Xue, “A lattice-based framework for joint
chinese word segmentation, pos tagging and parsing,” 2013.
[20] N. Kaji and M. Kitsuregawa, “Accurate word segmentation and pos
tagging for japanese microblogs: Corpus annotation and joint modeling
with lexical normalization,” in EMNLP, 2014, pp. 99–109.
[21] C.-T. Nguyen and X.-H. Phan, “Jvnsegmenter: A java-based vietnamese
word segmentation tool,Retrieved on, vol. 30, 2011.
[22] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J.
Lin, “Liblinear: A library for large linear classification, J. Mach.
Learn. Res., vol. 9, pp. 1871–1874, Jun. 2008. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1390681.1442794
... In these tasks, the mistake of word segmentation directly affects them. Therefore, many previous works introduced various approaches to improve Vietnamese word segmentation performance, including single word segmentation task only [1,9,10,11,12,13,14,15,16,17,18] and multi-task containing word segmentation and part-ofspeech tagging [19,20], dependency parsing [21]. From the previous works above, word segmentation is obligatory for Vietnamese syntax analysis tasks, including part-of-speech tagging, constituency parsing, dependency parsing, and semantic parsing. ...
... After that, Phan and Cao [31] used the JVnSegmenter toolkit [11] for pre-processing their food reviews corpus. In addition, Pham et al. [32] apply the UETsegmenter [15] to preproces their mobile product reviews corpus. Lastly, the works [33,34] about sentiment analysis on feedback of students [35] and [36] about emotion recognition used the RDRsegmenter of Nguyen et al. [16] for their research. ...
Preprint
To the best of our knowledge, this paper made the first attempt to answer whether word segmentation is necessary for Vietnamese sentiment classification. To do this, we presented five pre-trained monolingual S4- based language models for Vietnamese, including one model without word segmentation, and four models using RDRsegmenter, uitnlp, pyvi, or underthesea toolkits in the pre-processing data phase. According to comprehensive experimental results on two corpora, including the VLSP2016-SA corpus of technical article reviews from the news and social media and the UIT-VSFC corpus of the educational survey, we have two suggestions. Firstly, using traditional classifiers like Naive Bayes or Support Vector Machines, word segmentation maybe not be necessary for the Vietnamese sentiment classification corpus, which comes from the social domain. Secondly, word segmentation is necessary for Vietnamese sentiment classification when word segmentation is used before using the BPE method and feeding into the deep learning model. In this way, the RDRsegmenter is the stable toolkit for word segmentation among the uitnlp, pyvi, and underthesea toolkits.
... The data used for the speed test is a corpus of 10k sentences collected from Vietnamese websites. This corpus was automatically segmented by UETsegmenter [18] and contains about 250k words. All taggers use their singlethreaded implementation for the speed test. ...
Preprint
Full-text available
Part-of-speech (POS) tagging plays an important role in Natural Language Processing (NLP). Its applications can be found in many NLP tasks such as named entity recognition, syntactic parsing, dependency parsing and text chunking. In the investigation conducted in this paper, we utilize the technologies of two widely-used toolkits, ClearNLP and Stanford POS Tagger, as well as develop two new POS taggers for Vietnamese, then compare them to three well-known Vietnamese taggers, namely JVnTagger, vnTagger and RDRPOSTagger. We make a systematic comparison to find out the tagger having the best performance. We also design a new feature set to measure the performance of the statistical taggers. Our new taggers built from Stanford Tagger and ClearNLP with the new feature set can outperform all other current Vietnamese taggers in term of tagging accuracy. Moreover, we also analyze the affection of some features to the performance of statistical taggers. Lastly, the experimental results also reveal that the transformation-based tagger, RDRPOSTagger, can run significantly faster than any other statistical tagger.
... The model was trained using CBOW with position-weights, in dimension 300, with character n-grams of length 2, a window of size 5 and 10 negatives. Our news article corpus are word segmented using UETsegmenter [8]. ...
Conference Paper
In recent years, Vietnam has received a significantly increasing Foreign Direct Investment (FDI) year on year. It has lead to the creation of a large number of social news that reflect to a certain extent the investment activities. Quantitatively extracting such information would be meaningful in analyzing market's direction. The objective of this study was to design a social listening system to identify key investment activities and trends over time using historical news data. First, we present the first-of-its-kind manually annotated investment domain-specific dataset for Vietnamese. Particularly, our dataset is annotated for join-tasks of 1) topic classification and 2) named entity recognition (NER) with newly-defined entity types. Second, empirical experiment was conducted using strong baselines on our dataset and show potential results with F1=82.43 for topic classification task, and F1=92.15 for NER task. Finally, we demonstrate the results on a Geographic Information System (GIS)-based heatmap system for the analysis of real-world social listening problem.
... After that, Nguyen et al. [16] used conditional random fields (CRFs) and support vector machines (SVMs) for VWS. Recently, Nguyen and Le [22] utilized rules based on the predicted word boundary and threshold for the classifier in the post-processing stage to control overlapping ambiguities for VWS. Besides, Nguyen et al. [18] proposes a method for auto-learning rule based on the predicted word boundary for VWS. ...
Preprint
In this paper, we propose a span labeling approach to model n-gram information for Vietnamese word segmentation, namely SPAN SEG. We compare the span labeling approach with the conditional random field by using encoders with the same architecture. Since Vietnamese and Chinese have similar linguistic phenomena, we evaluated the proposed method on the Vietnamese treebank benchmark dataset and five Chinese benchmark datasets. Through our experimental results, the proposed approach SpanSeg achieves higher performance than the sequence tagging approach with the state-of-the-art F-score of 98.31% on the Vietnamese treebank benchmark, when they both apply the contextual pre-trained language model XLM-RoBERTa and the predicted word boundary information. Besides, we do fine-tuning experiments for the span labeling approach on BERT and ZEN pre-trained language model for Chinese with fewer parameters, faster inference time, and competitive or higher F-scores than the previous state-of-the-art approach, word segmentation with word-hood memory networks, on five Chinese benchmarks.
... Especially, we focus on collecting reviews in major cities, because these cities have a large number of restaurants and hotels. These reviews are split into sentences by using UETSegmentation Library [22]. 6 During the annotation process, we remove sentences that overlap with others that have been previously annotated and non-Vietnamese sentences. ...
Article
Aspect-based sentiment analysis has been studied in both research and industrial communities over recent years. For the low-resource languages, the standard benchmark corpora play an important role in the development of methods. In this article, we introduce two benchmark corpora with the largest sizes at sentence-level for two tasks: Aspect Category Detection and Aspect Polarity Classification in Vietnamese. Our corpora are annotated with high inter-annotator agreements for the restaurant and hotel domains. The release of our corpora would push forward the low-resource language processing community. In addition, we deploy and compare the effectiveness of supervised learning methods with a single and multi-task approach based on deep learning architectures. Experimental results on our corpora show that the multi-task approach based on BERT architecture outperforms the neural network architectures and the single approach. Our corpora and source code are published on this footnoted site.
Chapter
Keyphrase extraction plays an important role in many applications of Natural Language Processing. There are many effective proposals for English, but those approaches are not completely applicable for low resources languages such as Vietnamese. In this paper, we propose a Semantic-based Approach for Keyphrase Extraction (SAKE), which improved the TextRank algorithm [1]. In SAKE, we apply semantic to the phrases and incorporates the semantic to the ranking process. Technically, a document is represented as a graph, in which vertices are words and edges are relations among words. In each document, we get a representative thematic vector by computing the average of word embedding vectors. Each vertex has a similarity score to the thematic vector and this score will be involved to the scoring in the ranking process. The important vertices are highly weighted not only by their relationships to other vertices but also by the similarity to the document theme. We experimented our proposed method on Vietnamese news articles. The result shows that our SAKE improved TextRank for Vietnamese text by achieving 1.8% higher of F1-score.
Conference Paper
Full-text available
In recent years, the Natural Language Processing community was impacted greatly by models based on the BERT architecture (Devlin et al., 2018). The Transformer-based Masked Language Model (MLM) has yielded significant improvement on many Natural Language Processing problems. However, it requires huge computing power and makes pre-training models a resource-consuming process. To overcome this impediment, in March 2020, Clark et al. have published a new model named ELECTRA. It carries the same structural framework as BERT, but the pre-training task was essentially modified, which makes it effective while cost-saving. The English ELECTRA models have gained remarkable results on the GLUE Natural Language Understanding benchmark, compared to BERT and GPT. In this paper, we introduce our Vietnamese ELECTRA models (ViELECTRA). On a similar amount of text, the pre-training resource of ViELECTRA was around 1/5 to 1/2 of PhoBERT’s (Nguyen and Nguyen, 2020). We fine-tuned the models on various downstream tasks such as Dependency Parsing, Named Entity Recognition, Part of Speech Tagging, and Natural Language Inference. ViELECTRA-Base outperformed PhoBERT-Base on the Natural Language Inference task with an accuracy score of 79.1% over 78.5%. On the Dependency Parsing task, we achieved 83.66% UAS and 75.27% LAS which is on average 2.5% lower than PhoBERT. The evaluation results show that ELECTRA is promisingly applicable in the Vietnamese language and there is still room for further development.
Chapter
In this paper, we propose a span labeling approach to model n-gram information for Vietnamese word segmentation, namely SpanSeg. We compare the span labeling approach with the conditional random field by using encoders with the same architecture. Since Vietnamese and Chinese have similar linguistic phenomena, we evaluated the proposed method on the Vietnamese treebank benchmark dataset and five Chinese benchmark datasets. Through our experimental results, the proposed approach SpanSeg achieves higher performance than the sequence tagging approach with the state-of-the-art F-score of 98.31% on the Vietnamese treebank benchmark, when they both apply the contextual pre-trained language model XLM-RoBERTa and the predicted word boundary information. Besides, we do fine-tuning experiments for the span labeling approach on BERT and ZEN pre-trained language model for Chinese with fewer parameters, faster inference time, and competitive or higher F-scores than the previous state-of-the-art approach, word segmentation with word-hood memory networks, on five Chinese benchmarks.
Chapter
Today, the behavioral culture on social networks is a painful issue. State agencies have been trying to clean up the network environment of country. Many policies are proposed to process videos and clips with offensive content. However, it is a small part of cleaning up the network environment. We often see hateful comments on social media sites. It exists anywhere from social media to online games that are difficult to control and punish because of their big data. There are not too many social networking sites and online games until now. Therefore, it is not too difficult for communities to limit inappropriate words. Therefore, we offer a chatbot model to manage the comments that helps to clean the network environment in the paper. The results show that the proposal model achieves up to 75% accuracy with 100,000 comments.
Conference Paper
Full-text available
Word segmentation is a challenging issue, and the corresponding algorithms can be used in many applications of natural language processing. This paper addresses the problem of Vietnamese word segmentation, proposes a probabilistic ensemble learning (PEL) framework, and designs a novel PEL-based word segmentation (PELWS) algorithm. Supported by the data structure of syllable-syllable frequency index, the PELWS algorithm combines multiple weak segmenters to form a strong segmenter within the PEL framework. The experimental results show that the PELWS algorithm can achieve the state-of-the-art performance in the Vietnamese word segmentation task.
Conference Paper
Full-text available
Chinese sentences are composed with string of characters without blanks to mark words. However the basic unit for sentence parsing and understanding is word. Therefore the first step of processing Chinese sentences is to identify the words. The difficulties of identifying words include (1) the identification of complex words, such as Determinative-Measure, reduplications, derived words etc., (2) the identification of proper names, (3) resolving the ambiguous segmentations. In this paper, we propose the possible solutions for the above difficulties. We adopt a matching algorithm with 6 different heuristic rules to resolve the ambiguities and achieve an 99.77% of the success rate. The statistical data supports that the maximal matching algorithm is the most effective heuristics.
Article
Word segmentation and POS tagging are two important problems included in many NLP tasks. They, however, have not drawn much attention of Vietnamese researchers all over the world. In this paper, we focus on the integration of advantages from several resourses to improve the accuracy of Vietnamese word segmentation as well as POS tagging task. For word segmentation, we propose a solution in which we try to utilize multiple knowledge resources including dictionary-based model, N-gram model, and named entity recognition model and then integrate them into a Maximum Entropy model. The result of experiments on a public corpus has shown its effectiveness in comparison with the best current models. We got 95.30% F1 measure. For POS tagging, motivated from Chinese research and Vietnamese characteristics, we present a new kind of features based on the idea of word composition. We call it morpheme-based features. Our experiments based on two POS-tagged corpora showed that morpheme-based features always give promising results. In the best case, we got 89.64% precision on a Vietnamese POS-tagged corpus when using Maximum Entropy model.
Conference Paper
For the cascaded task of Chinese word segmentation, POS tagging and parsing, the pipeline approach suffers from error propagation while the joint learning approach suffers from inefficient decoding due to the large combined search space. In this paper, we present a novel lattice-based framework in which a Chinese sentence is first segmented into a word lattice, and then a lattice-based POS tagger and a lattice-based parser are used to process the lattice from two different viewpoints: sequential POS tagging and hierarchical tree building. A strategy is designed to exploit the complementary strengths of the tagger and parser, and encourage them to predict agreed structures. Experimental results on Chinese Treebank show that our lattice-based framework significantly improves the accuracy of the three sub-tasks.
Article
Word segmentation is one of the most important tasks in NLP. This task, within Vietnamese language and its own features, faces some challenges, especially in words boundary determination. To tackle the task of Vietnamese word segmentation, in this paper, we propose the WS4VN system that uses a new approach based on Maximum matching algorithm combining with stochastic models using part-of-speech information. The approach can resolve word ambiguity and choose the best segmentation for each input sentence. Our system gives a promising result with an F-measure of 97%, higher than the results of existing publicly available Vietnamese word segmentation systems.
Conference Paper
A Chinese word segmentation algorithm based on forward maximum matching and word binding force is proposed in this paper. This algorithm plays a key role in post-processing the output of a character or speech recognizer in determining the proper word sequence corresponding to an input line of character images or a speech waveform. To support this algorithm, a text corpus of over 63 millions characters is employed to enrich an 80,000-words lexicon in terms of its word entries and word binding forces. As it stands now, given an input line of text, the word segmentor can process on the average 210,000 characters per second when running on an IBM RISC System/6000 3BT workstation with a correct word identification rate of 99.74%.