Conference PaperPDF Available

A hybrid approach to Vietnamese word segmentation

November 2016

November 2016

DOI:10.1109/RIVF.2016.7800279

Conference: 2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)

Authors:

Tuan-Phong Nguyen

Max Planck Institute for Informatics

Anh-Cuong Le

Ton Duc Thang University

Word segmentation is the very first task for Vietnamese language processing. Word-segmented text is the input of almost other NLP tasks. This task faces some challenges due to specific characteristics of the language. As in many other Asian languages such as Japanese, Korean and Chinese, white spaces in Vietnamese are not always used as word separators and a word may contain one or more syllables. In this paper, we propose an efficient hybrid approach to detect word boundary for Vietnamese texts using logistic regression as a binary classifier combining with longest matching algorithm. First, longest matching algorithm is used to catch words that contain more than two syllables in input sentence. Next, the system utilizes the classifier to determine the boundary of 2-syllable words and proper names. Then, the predictions having low confidence conducted by the classifier are verified by a dictionary to get the final result. Our system can achieve an F-measure of 98.82% which is the most accurate result for Vietnamese word segmentation to the best of our knowledge. Moreover, the system also has a high speed. It can run word segmentation for nearly 34k tokens per second.

Affection of threshold r on word segmentation result.

…

Figures - uploaded by Tuan-Phong Nguyen

Content may be subject to copyright.

Content uploaded by Tuan-Phong Nguyen

Content may be subject to copyright.

A Hybrid Approach

to Vietnamese Word Segmentation

Tuan-Phong Nguyen

Faculty of Information Technology

VNU University of Engineering and Technology

No. 144 Xuan Thuy Street

Dich Vong Hau Ward, Cau Giay District

Hanoi, Vietnam

Email: phongnt_570@vnu.edu.vn

Anh-Cuong Le∗

Faculty of Information Technology

Ton Duc Thang University

No. 19 Nguyen Huu Tho Street

Tan Phong Ward, District 7

Ho Chi Minh City, Vietnam

Email: leanhcuong@tdt.edu.vn

Abstract—Word segmentation is the very ﬁrst task for Viet-

namese language processing. Word-segmented text is the input

of almost other NLP tasks. This task faces some challenges due

to speciﬁc characteristics of the language. As in many other

Asian languages such as Japanese, Korean and Chinese, white

spaces in Vietnamese are not always used as word separators

and a word may contain one or more syllables. In this paper, we

propose an efﬁcient hybrid approach to detect word boundary

for Vietnamese texts using logistic regression as a binary classiﬁer

combining with longest matching algorithm. First, longest match-

ing algorithm is used to catch words that contain more than two

syllables in input sentence. Next, the system utilizes the classiﬁer

to determine the boundary of 2-syllable words and proper names.

Then, the predictions having low conﬁdence conducted by the

classiﬁer are veriﬁed by a dictionary to get the ﬁnal result. Our

system can achieve an F-measure of 98.82% which is the most

accurate result for Vietnamese word segmentation to the best of

our knowledge. Moreover, the system also has a high speed. It

can run word segmentation for nearly 34k tokens per second.

I. INTRO DUC TI O N

In linguistics, word is the smallest meaningful unit of speech

that can stand by itself. Vietnamese, an Austroasiatic language,

uses a Latin alphabet with additional diacritics and certain

letters. However, unlike many occidental languages using Latin

alphabets, Vietnamese has similar characteristics to other East

Asian languages such as Japanese, Korean, Chinese and Thai

in which white spaces are not always word separators and

a word may consist of more than one syllable with many

ambiguous cases. This leads to some challenges in Vietnamese

word segmentation.

Studies on Vietnamese word segmentation used either

dictionary-based algorithms, statistical models or hybrid ap-

proaches. Recent studies using hybrid approaches such as [1],

[2], [3] can provide state-of-the-art results at approximately

97%.

In this study, we propose an efﬁcient hybrid approach

to solve this task. In our approach, word segmentation is

represented as a binary classiﬁcation problem in which we

have to determine the label of each white space in input

text. These two labels are SPACE (separator of two syllables

∗Corresponding author.

which belong to two different words) and UNDERSCORE

(separator of two syllables inside a word). Our system is

mainly based on three steps. First, we use a forward longest

matching algorithm to determine the boundary of all words

having at least three syllables. Next, the classiﬁer using logistic

regression helps to detect the boundary of 2-syllable words

and proper names. Finally, we continue to use the dictionary

to recheck the predictions having low conﬁdence produced

by the machine learning process and return ﬁnal labels for

white spaces. For experiments, we evaluate our approach using

10-fold cross-validation on Vietnamese Treebank corpora [4]

of 75k manually word-segmented sentences. Our system can

yield an F-measure of 98.82% which is the best result for

Vietnamese word segmentation known to us. Furthermore, the

system can also perform at a high speed of nearly 34k tokens

per second when running on a personal computer.

The rest of this paper is organized as follows. In Section II,

we talk about the difﬁculties in Vietnamese word segmenta-

tion. In Section III, the methods used in other studies to resolve

word segmentation task are discussed. Section IV provides

details of our approach. We report and discuss about the

experimental results of our system in Section V. Finally, we

make some conclusions on this work in Section VI.

II. DIFFI CU LTI E S IN VI E TN AM E SE W ORD S EGM EN TATION

Vietnamese is an inﬂexionless language in which every

word never changes its form. Vietnamese words are made

of one or more syllables. A word which contains only one

syllable is called single word. On the other hand, a word which

is composed of more than one syllable is called compound

word. Frequency of each kind of word is different from others.

We tried to make some statistics on the dictionary provided by

VLSP project1and the frequency analysis told us some useful

knowledge. Almost of the words (71%) in this dictionary are

2-syllable words. Single words account for 17.67% of total

words. Therefore, the percentage of over-2-syllable words is

just under 12%. Due to this low frequency, it leads us to

a simple idea that we can just use the dictionary to cover

1http://vlsp.hpda.vn:8080/demo/?page=home

those words and then ﬁnd an efﬁcient way to deal with the

other words in the input text. There are two kinds of the

remaining words that we have to care about, which are 2-

syllable words and proper names (in Vietnamese, names of

people and locations are considered as lexical units).

One simple method to deal with proper names is to compose

all consecutive upper-case syllables into a word. This method

is obviously not good in many cases such as when two proper

names appear consecutively.

For 2-syllable words, the easiest way is to scan through the

input sentence and connect all of two consecutive syllables

that can compose a word in the dictionary. There are many

ambiguous cases where this method produce wrong results.

One of the most frequent cases is called overlap ambiguity in

which a sentence has three consecutive syllables sisi+1si+2

where both sisi+1 and si+1si+2 are words in the dictionary

but in the current context, only one of them is the right

word. In another common situation, a word composed of two

consecutive syllables sisi+1 is in the dictionary, but in the

current context, these two syllables are actually two single

words. Other significant case that this method cannot handle is

out-of-vocabulary problem in which two consecutive syllables

sisi+1 actually compose a right word in its context but it has

not appeared in the dictionary.

Taken together, it is necessary to have more effective tech-

niques to deal with those problems. In the next section, we talk

about studied approaches to word segmentation Vietnam ese and

other languages’ texts.

III. REL ATE D WO RK S

There are many effective approaches that have been studied

to resolve word segmentation task [5], [6]. The ﬁrst and

traditional approach is based on dictionary. There are two com-

mon techniques of this approach, namely maximum matching

(MM) and longest matching (LM). While MM algorithm

aims to ﬁnd the segmentation candidates by segmenting input

sentence into a sequence with the smallest number of words,

LM algorithm tends to scan through the sentence and at each

syllable, it ﬁnds the longest word composed of this syllable

and the next consecutive ones. Systems using this kind of

approach for Chinese can gain very promising results [7], [8].

However, for Vietnamese, this simple approach seems to be

unable to deal with out-of-vocabulary problem and overlap

ambiguity.

The second one is statistical approach. As in many other

core NLP tasks, this approach has proved to be good for word

segmentation too. For instance, the methods using Conditional

Random Fields (CRFs) and Support Vector Machines (SVMs)

in [9] can reach results of over 94% while evaluating on a

small corpus of 7800 Vietnamese sentences. Other studies

using CRFs [10], SVMs [11], Hidden Markov Model (HMM)

[12], [13], n-gram model [14], Maximum Entropy (MaxEnt)

[15], [16] and probabilistic ensemble learning [17] also pro-

duces high accuracy for Vietnamese and other East Asian

languages. Statistical approaches help to gain good result for

Thai [18], too.

Although statistical algorithms can provide a good way

to deal with ambiguous problems, both of those approaches

still have their own limitations. Thus, some studies combined

these two approaches into their systems. Some hybrid ap-

proaches for Vietnamese word segmentation were presented

to use Weighted Finite State Transducer (WFST) with Neural

Network [3], or combine MM and n-gram language model

[1], or use MM combining with stochastic models using part-

of-speech information [2]. These approaches are able to reach

state-of-the-art results at approximately 97%. For Chinese, the

study in [19] proposes a lattice-based framework for joint

Chinese word segmentation, POS tagging and parsing which

helps to signiﬁcantly improve the accuracy of the three sub-

tasks. Joint model of word segmentation and POS tagging was

also used for Japanese [20].

IV. OUR A PPRO ACH

In this section, we ﬁrst talk about how we represent word

segmentation task. Next, we describe three main components

of our segmentation system before proposing its architecture.

A. Problem representation

The two main ways of problem representation for Viet-

namese word segmentation are syllable-based and white-

space-based.

The ﬁrst one can be described as a sequential tagging task.

For example, in the approach presented in [9], there are three

labels for syllables, which are B_W (Begin of a Word), I_W

(Inside of a Word) and O (Outside of a word). This approach is

implemented in JVnSegmenter [21], a toolkit for Vietnamese

word segmentation.

The second way is to cast Vietnamese word segmentation

as a binary classiﬁcation problem for white spaces. It should

be repeated that in Vietnamese, there are two kinds of white

space. The ﬁrst one is separator of two syllables which

belong to two different words (SPACE) and the second one

is separator of two syllables inside a word (UNDERSCORE).

PELSegmenter [17] and DongDu2are toolkits that use this

problem representation.

We use the second way of problem representation for our

system because of its simplicity. Moreover, in this way, it is

possible to modify the label of a white space but does not

affect the labels of other ones beside it. In the next section,

we describe the simplest component of our system, longest

matching algorithm.

B. Longest matching

Due to the low frequency of over-2-syllable words and

in our observation, ambiguity is inappreciable for them, we

just use such a dictionary to deal with those words. The

dictionary-based technique in our system is longest matching.

The dictionary is the one used in Section II. The work

after longest matching is to handle the 2-syllable words and

proper names efﬁciently. Our binary classiﬁer using logistic

regression is responsible for this task.

2https://github.com/rockkhuya/DongDu

C. Logistic regression as binary classiﬁcation

Logistic regression is used to construct a binary classiﬁer

for white spaces in our system. From training data, we have

a training set D={(X, Y )}where Xdenotes feature

vector and Ydenotes the corresponding label of white space.

To be convenient, we denote the two values for Yas 1

and 0corresponding to UNDERSCORE and SPACE labels

respectively. Based on this training set, logistic regression

assumes a parametric model and learns the conditional distri-

bution P(Y|X). The assumed parametric model is presented

in equation 1 and equation 2, in which widenotes weight (or

parameter).

P(Y= 1|X) = 1

1 + exp (w0+Pn

i=1 wiXi)(1)

P(Y= 0|X) = 1 −P(Y= 1|X)(2)

The rule of our binary classiﬁer is that we assign UNDER-

SCORE label for a white space given its feature vector Xif

P(Y= 1|X)> P (Y= 0|X)(or P(Y= 1|X)>0.5) and

otherwise, we assign SPACE label for it if P(Y= 1|X)<0.5.

This statistical method seems to be able to handle proper

names and many ambiguous problems well if we have a good

feature set and large training data. However, it still has serious

limitations as we map a continuous domain of probability Pto

a discrete domain of binary variable. But that is also the reason

why we choose logistic regression instead of other methods.

Obviously, no machine learning method can perform perfectly

in all cases and it is necessary to have veriﬁcation for its

outcome. Logistic regression provides a simple way to detect

those low-conﬁdent predictions. It is clear that predictions with

probabilities Pin a narrow boundary around 0.5have low

conﬁdence of precision. Additionally, it is also possible that

in the overlap ambiguity case, this classiﬁer may connect all

three syllables to compose a word. We will propose our simple

techniques to resolve these problems in the following section.

D. Post-processing for binary classiﬁer

We use the dictionary to handle the low-conﬁdent predic-

tions and the results in overlap ambiguity cases produced by

the binary classiﬁer. First, we deﬁne that a prediction for

label Yof a white space given its feature vector Xis a low-

conﬁdent prediction if the following condition holds:

|P(Y= 1|X)−0.5|< r, r is a threshold

Assume that we have a sequence of syllables and labeled

white spaces after the binary classiﬁcation using logistic

regression in the form of:

...si−1[ ]si[∗]si+1[ ]si+2 ...

where sjdenotes syllable; [ ] denotes SPACE label; [_]

denotes UNDERSCORE label and [∗]is the label that has

low-conﬁdent precision. Our solution is to verify whether the

word sisi+1 is in the dictionary or not. The result of this

operation is the ﬁnal label for [∗].

Raw text

Pre-processing

Sentences

LM for

over-2-syllable

words

Training data

Dictionary

Binary

classifier

using LR

Post-processing

Segmented text

Figure 1. Architecture of our segmentation system.

In another case, the conducted sequence looks like that:

...si−2[ ]si−1[_]si[_]si+1[ ]si+2 ...

In this case, si−1sisi+1 is not a word in the dictionary.

That means it is much likely a wrong word because of the

low frequency of 3-syllable words. We divided this case into

four possibilities:

•word si−1siis in dictionary but word sisi+1 is not

•word sisi+1 is in dictionary but word si−1siis not

•both of them are not in dictionary

•both of them are in dictionary

For the ﬁrst and second cases, we only keep up the word

that appears in the dictionary. For the third case, we change

both labels of two white spaces into SPACE. The last case

is corresponding to overlap ambiguity. In this case, we keep

up UNDERSCORE label of white space that has higher

probability conducted by the classiﬁer and change the other

one to SPACE.

E. Proposed segmentation system

Combine all the above components with the pre-processing

step for raw input data, we have the architecture of our system

as presented in Figure 1.

In the pre-processing step, we ﬁrst standardize the raw

text, then use regular expressions to recognize regular patterns

such as numbers, times and dates, then separate punctuation

marks, parentheses and quotation marks at the end of words,

and then utilize some simple heuristic rules to split the text

into sentences. Next, each sentence is passed into the LM

component to detect boundary for words having at least three

syllables. Continuously, the remaining white spaces will be

labeled by the classiﬁer using LR. This classiﬁer was trained

on training data before. The post-processing then handles the

low-conﬁdent predictions conducted by the classiﬁer to return

the ﬁnal segmented text.

s-2 y-2 s-1 s0s1s2

y-1 y0y1y2

... ...

Figure 2. A 5-syllable window.

V. E XPE RIM EN T S

In this section, we present the feature templates used for

logistic regression and the performances of different systems

compared to our system. We also talk about the affection of

threshold rto accuracies of our segmentation system.

A. Features

Performance of any statistical technique is based on the

quality of feature set. For the classiﬁer using logistic regression

of our system, to generate feature vector of each white space,

we capture a window of size 2 for it as depicted in Figure 2,

where sdenotes syllable; ydenotes white space and the

subscript is the index of corresponding syllable or white space.

Table I represents all feature templates for logistic regres-

sion. In Table I, fidenotes the lowercase-simpliﬁed form of

syllable si;tiis the type of syllable si;(fi, fj)is a combina-

tion feature; isV NF ami lyN ame(si)returns true if and only

if siis a Vietnamese family name; isV N Syllable(si)returns

true if and only if siis a valid Vietnamese syllable. There

are ﬁve types of syllable we deﬁned in our system, namely

LOWER, UPPER, ALLUPPER, NUMBER and OTHER cor-

responding to the cases that the syllable has all lowercase

letters, the syllable has upper-case initial letter, the syllable

has all upper-case letters, the syllable is a number or the other

cases, respectively.

For n-gram features, we use both the lowercase form of

syllables and their types. We do not use the original form of

syllables from input text to extract n-gram features because

we found that this way of feature extraction may produce

proﬁtless features and they are not good for logistic regression.

Moreover, we only add features of syllable’s types to feature

vector if the type is different from LOWER. Taking all features

of LOWER type to feature vectors can make the regression

model confused and draw its performance because almost

syllables are of LOWER type. These techniques can reduce

a large number of useless features. The sixth feature template

in Table I is used for full-reduplicative words. The seventh

one catches information of Vietnamese people’s name and the

last one is used for detecting two consecutive proper names.

B. Results

We analyze affection of each component to our whole

system. In our experiments, we use Vietnamese Treebank

corpora of 75k manually word-segmented sentences which

is one of the largest annotated corpora for Vietnamese. The

corpus is randomly splitted into ten equal partitions for 10-fold

cross-validation. F-measure is used, in which precision ratio

(P) is computed as number of right segmented words over total

number of words conducted by the segmentation system; recall

ratio (R) is computed as number of right segmented words

over total number of words in the golden test set. The average

accuracies of systems over ten folds are presented in Table II.

We utilize LIBLINEAR L2-regularized logistic regression [22]

to implement the classiﬁer for our experiments.

Our baseline system is LM which uses only longest match-

ing algorithm and the rule to compose all consecutive UPPER

syllables into a word. Longest matching algorithm is only used

for phrases which have LOWER syllable(s) and do not consist

of any NUMBER or OTHER syllable. This system can gain

an F-measure of 97.21%, however, it obviously cannot resolve

overlap ambiguity and out-of-vocabulary problems. Moreover,

the rule for proper names is too much greedy and fails in

many cases. Meanwhile, if we only utilize the classiﬁer using

logistic regression in system LR, the result is much better. The

regression model can handle many cases of overlap ambiguity

and out-of-vocabulary and provide a better way to detect

proper names. Combining these two components to LM + LR

system provides a slightly increased accuracy compared to LR

system. The precision ratio is higher because LM + LR is able

to cover all the over-2-syllable words that the LR system fails

to catch. However, its recall ratio is decreased because of the

inconsistency of those words in the training data and rarely

ambiguous cases.

Post-processing for LR makes a signiﬁcant impact on the

result of segmentation. LR + Post system, which adds post-

processing after LR system, can reach the highest recall ratio

of 98.99%. Our whole system which is composed of all

components (LM + LR + Post) makes the best result at 98.82%

for F-measure. Obviously, performance of post-processing is

mainly based on threshold r. In this experiment, we use

r= 0.33 which helps to gain the best result. In the next

section, we take a deeper look into how to choose a proper

threshold rand discuss about its affection to the ﬁnal result.

C. Discussion on threshold rand post-processing

Choosing a proper threshold rdepends on the quality

of dictionary and how well the machine learning process

performs. It can be described that if we choose a high r, it

means we rely on the dictionary more than the result of the

classiﬁer, and otherwise. Figure 3 depicts our analysis on the

affection of threshold rto our system.

Due to the analysis, we can conclude that our system’s

results on Vietnamese Treebank corpora is not too sensitive

with the variability of rin a wide range from 0.25 to 0.40.

However, the high similarity between domains of the training

data and test set is one reason for the high performance of

our system. To adapt for other domains, it may face more

problems with new words which even the dictionary cannot

cover. In this situation, a validation set is needed in order to

choose a proper threshold rfor new domain.

D. Comparison to other toolkits

Our approach is compared to other approaches that have

been presented in other studies. The accuracy ﬁgures are

depicted in Table III.

Table I

FEATU RE T EMPL AT ES U S ED F OR LO GI S TI C RE GRES S IO N.

No. Template

1(fi), i =−2,−1,0,1,2

2(fi, fi+1), i =−2,−1,0,1

3(ti), i =−2,−1,0,1,2

4(ti, ti+1), i =−2,−1,0,1and ti6=LO W ER

5(ti, ti+1, ti+2 ), i =−2,−1,0and ti6=LOW E R

6(t0=t1=LOW ER and f0=f1)?

7(t0=t1=UP P E R and isV N F amilyN ame(s0))?

8(t0=t1=UP P E R and isV N Syllable(s0)and !isV NSyllable(s1))?

Table II

ACC UR AC IE S OF SU B-S YST EM S (% ).

Sub-system P R F

LM 97.11 97.31 97.21

LR 97.95 98.29 98.12

LM + LR 98.11 98.16 98.14

LR + Post 98.59 98.99 98.79

LM + LR + Post 98.77 98.87 98.82

Figure 3. Affection of threshold ron word segmentation result.

Our system provides better result compared to other toolkits

on Vietnamese Treebank corpus. It should be repeated that

our classiﬁer does not take information from the dictionary.

We suspect that this is the reason why it performs better than

other stochastic-based toolkits, DongDu and JVnSegmenter.

vnTokenizer [1], which uses regexes to cover proper names

before handling normal words, fails in many cases where

upper-case syllables appear consecutively. It is obvious that

statistical systems can perform better than vnTokenizer be-

cause the training data is not too different from the test set in

term of content domain.

To make another comparison, we retrained each toolkit

using the full corpus of Vietnamese Treebank and then evalu-

ated them on an independent test set that consists of 10 ﬁles

from 800001.seg to 800010.seg provided by VLSP project.

From Table III, we can see that performances of statistical

segmentation systems are decreased considerablely, because

the new test set has a totally different domain, with many new

words that have not appeared in neither the training data nor

the dictionary of these systems. Our system with the main

component using logistic regression is not an exception but

it still has a good performance because of the simple feature

set which does not make use of information from dictionary.

vnTokenizer performs quite stablely and its result is slightly

increased. Notablely, vnTokenizer’s dictionary has more than

40k words [1], this number of ours is 32k. Although having

a poorer dictionary, our system is still able to outperform

vnTokenizer.

Moreover, we also collected a corpus of 1k articles from

Vietnamese online newspapers to measure segmentation speed

of toolkits. Except DongDu which is developed in C++,

the other toolkits are developed in Java. The evaluation is

processed on a personal computer with 4 Intel Core i5-3337U

CPUs @ 1.80GHz and 6GB of memory. The results is reported

in Table IV. Our system can run faster than other toolkits.

DongDu toolkit also utilizes LIBLINEAR for machine learn-

ing, however, its feature set is much more complicated and

its LIBLINEAR version is older than ours. We suspect these

are the reasons why DongDu’s speed is not as high as our

system’s. vnTokenizer and JVnSegmenter were written in old

versions of Java. Their code to process on String seems to be

inefﬁcient so that their speeds are quite low.

E. UETsegmenter

Our toolkit used for the above experiments is written in Java

and called UETsegmenter. It provides APIs for Vietnamese

word segmentation using a pretrained model and also some

methods for training and testing new models. The toolkit and

related resources are freely available for download3.

VI. CO NCL USI ON S

In this paper, we propose a hybrid approach to Vietnamese

word segmentation using longest matching and logistic re-

gression. We cast this task as a binary classiﬁcation problem

for white spaces and the results show that longest matching

algorithm, logistic regression combining with our simple post-

processing techniques helps to gain high accuracy. Our system

can reach state-of-the-art result at 98.82% for F-measure while

evaluating on Vietnamese Treebank corpus. Moreover, the

system can perform at a high speed of 34k tokens per second.

For future works, we will make a deeper study on the affection

of dictionary and the classiﬁer on choosing proper threshold

and extend post-processing to deal with other cases. We will

also ﬁnd an efﬁcient way to enrich the dictionary to produce

a better segmentation system.

3https://github.com/phongnt570/UETsegmenter

Table III

ACC UR AC Y CO MPARI SON (%) .

Toolkit 10-fold CV Independent test set

P R F P R F

vnTokenizer 97.61 96.86 97.23 96.98 97.69 97.33

JVnSegmenter - Maxent 97.18 97.28 97.23 96.60 97.40 97.00

JVnSegmenter - CRFs 97.58 97.68 97.63 96.63 97.49 97.06

DongDu 97.44 98.01 97.72 96.35 97.46 96.90

Ours 98.77 98.87 98.82 97.51 98.23 97.87

Table IV

SPE ED C OM PA RI SO N .

Toolkit JVnSeg (CRFs) JVnSeg (MaxEnt) vnTokenizer DongDu Ours

Speed (tokens/s) 764 1082 5322 16709 33705

ACK NOWL E DG MEN T

This paper is supported by The Vietnam National Founda-

tion for Science and Technology Development (NAFOSTED)

under grant number 102.01-2014.22.

REF ERE NC E S

[1] H. P. Le, T. M. H. Nguyen, A. Roussanaly, and T. V. Ho, “A hybrid ap-

proach to word segmentation of vietnamese texts,” in 2nd International

Conference on Language and Automata Theory and Applications-LATA

2008, vol. 5196. Springer Berlin/Heidelberg, 2008, pp. 240–249.

[2] D. D. Pham, G. B. Tran, and S. B. Pham, “A hybrid approach to viet-

namese word segmentation using part-of-speech tags,” in International

Conference on Knowledge and Systems Engineering 2009. IEEE, 2009,

pp. 154–161.

[3] D. Dinh, K. Hoang, and V. T. Nguyen, “Vietnamese word segmentation,”

in NLPRS, vol. 1, 2001, pp. 749–756.

[4] P.-T. Nguyen, X.-L. Vu, T.-M.-H. Nguyen, V.-H. Nguyen, and H.-P.

Le, “Building a large syntactically-annotated corpus of vietnamese,”

in Proceedings of the Third Linguistic Annotation Workshop,

ser. ACL-IJCNLP ’09. Stroudsburg, PA, USA: Association for

Computational Linguistics, 2009, pp. 182–185. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1698381.1698416

[5] Q. T. Dinh, H. P. Le, T. M. H. Nguyen, C. T. Nguyen, M. Rossignol,

and X. L. Vu, “Word segmentation of Vietnamese texts: a comparison

of approaches,” in 6th international conference on Language Resources

and Evaluation - LREC 2008. Marrakech, Morocco: ELRA -

European Language Resources Association, May 2008. [Online].

Available: https://hal.inria.fr/inria-00334760

[6] C. Huang and H. Zhao, “Chinese word segmentation: A decade review,”

Journal of Chinese Information Processing, vol. 21, no. 3, pp. 8–20,

2007.

[7] K.-J. Chen and S.-H. Liu, “Word identiﬁcation for mandarin chinese

sentences,” in Proceedings of the 14th Conference on Computational

Linguistics - Volume 1, ser. COLING ’92. Stroudsburg, PA, USA:

Association for Computational Linguistics, 1992, pp. 101–107. [Online].

Available: http://dx.doi.org/10.3115/992066.992085

[8] P.-k. Wong and C. Chan, “Chinese word segmentation based

on maximum matching and word binding force,” in Proceedings

of the 16th Conference on Computational Linguistics - Volume

1, ser. COLING ’96. Stroudsburg, PA, USA: Association for

Computational Linguistics, 1996, pp. 200–203. [Online]. Available:

http://dx.doi.org/10.3115/992628.992665

[9] C.-T. Nguyen, T.-K. Nguyen, X.-H. Phan, L.-M. Nguyen, and Q.-T. Ha,

“Vietnamese word segmentation with crfs and svms: An investigation,”

in Proceedings of the 20th Paciﬁc Asia Conference on Language,

Information and Computation (PACLIC 2006), 2006.

[10] T. Kudo, K. Yamamoto, and Y. Matsumoto, “Applying conditional

random ﬁelds to japanese morphological analysis,” in EMNLP, vol. 4,

2004, pp. 230–237.

[11] M. Sassano, “An empirical study of active learning with support

vector machines for japanese word segmentation,” in Proceedings

of the 40th Annual Meeting on Association for Computational

Linguistics, ser. ACL ’02. Stroudsburg, PA, USA: Association for

Computational Linguistics, 2002, pp. 505–512. [Online]. Available:

http://dx.doi.org/10.3115/1073083.1073168

[12] T. Nguyen, V. Nguyen, and A. Le, “Vietnamese word segmentation

using hidden markov model,” in International Workshop for Computer,

Information, and Communication Technologies in Korea and Vietnam,

2003.

[13] C. P. Papageorgiou, “Japanese word segmentation by hidden markov

model,” in Proceedings of the Workshop on Human Language

Technology, ser. HLT ’94. Stroudsburg, PA, USA: Association for

Computational Linguistics, 1994, pp. 283–288. [Online]. Available:

http://dx.doi.org/10.3115/1075812.1075875

[14] L. A. Ha, “A method for word segmentation in vietnamese,” in Proceed-

ings of Corpus Linguistics, 2003.

[15] O. T. Tran, C. A. Le, and T. Q. Ha, “Improving vietnamese word seg-

mentation and pos tagging using mem with various kinds of resources,”

Information and Media Technologies, vol. 5, no. 2, pp. 890–909, 2010.

[16] D. Dinh and T. Vu, “A maximum entropy approach for vietnamese word

segmentation,” in International Conference on Research, Innovation and

Vision for the Future, 2006. IEEE, 2006, pp. 248–253.

[17] W. Liu and L. Lin, “Probabilistic ensemble learning for vietnamese word

segmentation,” in Proceedings of the 37th International ACM SIGIR

Conference on Research & Development in Information Retrieval,

ser. SIGIR ’14. New York, NY, USA: ACM, 2014, pp. 931–934.

[Online]. Available: http://doi.acm.org/10.1145/2600428.2609477

[18] C. Haruechaiyasak, S. Kongyoung, and M. Dailey, “A comparative

study on thai word segmentation approaches,” in Electrical Engineer-

ing/Electronics, Computer, Telecommunications and Information Tech-

nology, 2008. ECTI-CON 2008. 5th International Conference on, vol. 1,

May 2008, pp. 125–128.

[19] Z. Wang, C. Zong, and N. Xue, “A lattice-based framework for joint

chinese word segmentation, pos tagging and parsing,” 2013.

[20] N. Kaji and M. Kitsuregawa, “Accurate word segmentation and pos

tagging for japanese microblogs: Corpus annotation and joint modeling

with lexical normalization,” in EMNLP, 2014, pp. 99–109.

[21] C.-T. Nguyen and X.-H. Phan, “Jvnsegmenter: A java-based vietnamese

word segmentation tool,” Retrieved on, vol. 30, 2011.

[22] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J.

Lin, “Liblinear: A library for large linear classiﬁcation,” J. Mach.

Learn. Res., vol. 9, pp. 1871–1874, Jun. 2008. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1390681.1442794

Is word segmentation necessary for Vietnamese sentiment classification?

Preprint

Jan 2023

To the best of our knowledge, this paper made the first attempt to answer whether word segmentation is necessary for Vietnamese sentiment classification. To do this, we presented five pre-trained monolingual S4- based language models for Vietnamese, including one model without word segmentation, and four models using RDRsegmenter, uitnlp, pyvi, or underthesea toolkits in the pre-processing data phase. According to comprehensive experimental results on two corpora, including the VLSP2016-SA corpus of technical article reviews from the news and social media and the UIT-VSFC corpus of the educational survey, we have two suggestions. Firstly, using traditional classifiers like Naive Bayes or Support Vector Machines, word segmentation maybe not be necessary for the Vietnamese sentiment classification corpus, which comes from the social domain. Secondly, word segmentation is necessary for Vietnamese sentiment classification when word segmentation is used before using the BPE method and feeding into the deep learning model. In this way, the RDRsegmenter is the stable toolkit for word segmentation among the uitnlp, pyvi, and underthesea toolkits.

An Experimental Investigation of Part-Of-Speech Taggers for Vietnamese

Preprint

Full-text available

Jun 2022

Part-of-speech (POS) tagging plays an important role in Natural Language Processing (NLP). Its applications can be found in many NLP tasks such as named entity recognition, syntactic parsing, dependency parsing and text chunking. In the investigation conducted in this paper, we utilize the technologies of two widely-used toolkits, ClearNLP and Stanford POS Tagger, as well as develop two new POS taggers for Vietnamese, then compare them to three well-known Vietnamese taggers, namely JVnTagger, vnTagger and RDRPOSTagger. We make a systematic comparison to find out the tagger having the best performance. We also design a new feature set to measure the performance of the statistical taggers. Our new taggers built from Stanford Tagger and ClearNLP with the new feature set can outperform all other current Vietnamese taggers in term of tagging accuracy. Moreover, we also analyze the affection of some features to the performance of statistical taggers. Lastly, the experimental results also reveal that the transformation-based tagger, RDRPOSTagger, can run significantly faster than any other statistical tagger.

Design of an GIS-based Investment Heatmap System using Topic Classification and NER

Conference Paper

Nov 2021

In recent years, Vietnam has received a significantly increasing Foreign Direct Investment (FDI) year on year. It has lead to the creation of a large number of social news that reflect to a certain extent the investment activities. Quantitatively extracting such information would be meaningful in analyzing market's direction. The objective of this study was to design a social listening system to identify key investment activities and trends over time using historical news data. First, we present the first-of-its-kind manually annotated investment domain-specific dataset for Vietnamese. Particularly, our dataset is annotated for join-tasks of 1) topic classification and 2) named entity recognition (NER) with newly-defined entity types. Second, empirical experiment was conducted using strong baselines on our dataset and show potential results with F1=82.43 for topic classification task, and F1=92.15 for NER task. Finally, we demonstrate the results on a Geographic Information System (GIS)-based heatmap system for the analysis of real-world social listening problem.

Span Labeling Approach for Vietnamese and Chinese Word Segmentation

Preprint

Sep 2021

In this paper, we propose a span labeling approach to model n-gram information for Vietnamese word segmentation, namely SPAN SEG. We compare the span labeling approach with the conditional random field by using encoders with the same architecture. Since Vietnamese and Chinese have similar linguistic phenomena, we evaluated the proposed method on the Vietnamese treebank benchmark dataset and five Chinese benchmark datasets. Through our experimental results, the proposed approach SpanSeg achieves higher performance than the sequence tagging approach with the state-of-the-art F-score of 98.31% on the Vietnamese treebank benchmark, when they both apply the contextual pre-trained language model XLM-RoBERTa and the predicted word boundary information. Besides, we do fine-tuning experiments for the span labeling approach on BERT and ZEN pre-trained language model for Chinese with fewer parameters, faster inference time, and competitive or higher F-scores than the previous state-of-the-art approach, word segmentation with word-hood memory networks, on five Chinese benchmarks.

Two New Large Corpora for Vietnamese Aspect-based Sentiment Analysis at Sentence Level

Article

May 2021

Aspect-based sentiment analysis has been studied in both research and industrial communities over recent years. For the low-resource languages, the standard benchmark corpora play an important role in the development of methods. In this article, we introduce two benchmark corpora with the largest sizes at sentence-level for two tasks: Aspect Category Detection and Aspect Polarity Classification in Vietnamese. Our corpora are annotated with high inter-annotator agreements for the restaurant and hotel domains. The release of our corpora would push forward the low-resource language processing community. In addition, we deploy and compare the effectiveness of supervised learning methods with a single and multi-task approach based on deep learning architectures. Experimental results on our corpora show that the multi-task approach based on BERT architecture outperforms the neural network architectures and the single approach. Our corpora and source code are published on this footnoted site.

Is word segmentation necessary for Vietnamese sentiment classification?

Conference Paper

Dec 2022

A Semantic-Based Approach for Keyphrase Extraction from Vietnamese Documents Using Thematic Vector

Chapter

Dec 2022

Keyphrase extraction plays an important role in many applications of Natural Language Processing. There are many effective proposals for English, but those approaches are not completely applicable for low resources languages such as Vietnamese. In this paper, we propose a Semantic-based Approach for Keyphrase Extraction (SAKE), which improved the TextRank algorithm [1]. In SAKE, we apply semantic to the phrases and incorporates the semantic to the ranking process. Technically, a document is represented as a graph, in which vertices are words and edges are relations among words. In each document, we get a representative thematic vector by computing the average of word embedding vectors. Each vertex has a similarity score to the thematic vector and this score will be involved to the scoring in the ranking process. The important vertices are highly weighted not only by their relationships to other vertices but also by the similarity to the document theme. We experimented our proposed method on Vietnamese news articles. The result shows that our SAKE improved TextRank for Vietnamese text by achieving 1.8% higher of F1-score.

PRE-TRAINING AND FINE-TUNING ELECTRA MODELS FOR VARIOUS VIETNAMESE NATURAL LANGUAGE PROCESSING TASKS

Conference Paper

Full-text available

Dec 2021

In recent years, the Natural Language Processing community was impacted greatly by models based on the BERT architecture (Devlin et al., 2018). The Transformer-based Masked Language Model (MLM) has yielded significant improvement on many Natural Language Processing problems. However, it requires huge computing power and makes pre-training models a resource-consuming process. To overcome this impediment, in March 2020, Clark et al. have published a new model named ELECTRA. It carries the same structural framework as BERT, but the pre-training task was essentially modified, which makes it effective while cost-saving. The English ELECTRA models have gained remarkable results on the GLUE Natural Language Understanding benchmark, compared to BERT and GPT. In this paper, we introduce our Vietnamese ELECTRA models (ViELECTRA). On a similar amount of text, the pre-training resource of ViELECTRA was around 1/5 to 1/2 of PhoBERT’s (Nguyen and Nguyen, 2020). We fine-tuned the models on various downstream tasks such as Dependency Parsing, Named Entity Recognition, Part of Speech Tagging, and Natural Language Inference. ViELECTRA-Base outperformed PhoBERT-Base on the Natural Language Inference task with an accuracy score of 79.1% over 78.5%. On the Dependency Parsing task, we achieved 83.66% UAS and 75.27% LAS which is on average 2.5% lower than PhoBERT. The evaluation results show that ELECTRA is promisingly applicable in the Vietnamese language and there is still room for further development.

Span Labeling Approach for Vietnamese and Chinese Word Segmentation

Chapter

Nov 2021

In this paper, we propose a span labeling approach to model n-gram information for Vietnamese word segmentation, namely SpanSeg. We compare the span labeling approach with the conditional random field by using encoders with the same architecture. Since Vietnamese and Chinese have similar linguistic phenomena, we evaluated the proposed method on the Vietnamese treebank benchmark dataset and five Chinese benchmark datasets. Through our experimental results, the proposed approach SpanSeg achieves higher performance than the sequence tagging approach with the state-of-the-art F-score of 98.31% on the Vietnamese treebank benchmark, when they both apply the contextual pre-trained language model XLM-RoBERTa and the predicted word boundary information. Besides, we do fine-tuning experiments for the span labeling approach on BERT and ZEN pre-trained language model for Chinese with fewer parameters, faster inference time, and competitive or higher F-scores than the previous state-of-the-art approach, word segmentation with word-hood memory networks, on five Chinese benchmarks.

Proposing Chatbot Model for Managing Comments in Vietnam

Chapter

May 2021

Today, the behavioral culture on social networks is a painful issue. State agencies have been trying to clean up the network environment of country. Many policies are proposed to process videos and clips with offensive content. However, it is a small part of cleaning up the network environment. We often see hateful comments on social media sites. It exists anywhere from social media to online games that are difficult to control and punish because of their big data. There are not too many social networking sites and online games until now. Therefore, it is not too difficult for communities to limit inappropriate words. Therefore, we offer a chatbot model to manage the comments that helps to clean the network environment in the paper. The results show that the proposal model achieves up to 75% accuracy with 100,000 comments.

Accurate Word Segmentation and POS Tagging for Japanese Microblogs: Corpus Annotation and Joint Modeling with Lexical Normalization

Conference Paper

Full-text available

Jan 2014

Probabilistic Ensemble Learning for Vietnamese Word Segmentation

Conference Paper

Full-text available

Jul 2014

Word segmentation is a challenging issue, and the corresponding algorithms can be used in many applications of natural language processing. This paper addresses the problem of Vietnamese word segmentation, proposes a probabilistic ensemble learning (PEL) framework, and designs a novel PEL-based word segmentation (PELWS) algorithm. Supported by the data structure of syllable-syllable frequency index, the PELWS algorithm combines multiple weak segmenters to form a strong segmenter within the PEL framework. The experimental results show that the PELWS algorithm can achieve the state-of-the-art performance in the Vietnamese word segmentation task.

Word Identification For Mandarin Chinese Sentences.

Conference Paper

Full-text available

Jan 1992

Chinese sentences are composed with string of characters without blanks to mark words. However the basic unit for sentence parsing and understanding is word. Therefore the first step of processing Chinese sentences is to identify the words. The difficulties of identifying words include (1) the identification of complex words, such as Determinative-Measure, reduplications, derived words etc., (2) the identification of proper names, (3) resolving the ambiguous segmentations. In this paper, we propose the possible solutions for the above difficulties. We adopt a matching algorithm with 6 different heuristic rules to resolve the ambiguities and achieve an 99.77% of the success rate. The statistical data supports that the maximal matching algorithm is the most effective heuristics.

Applying conditional random fields to japanese morphological analysis

Article

T. Kudo

Chinese word segmentation: A decade review

Article

Jan 2007

Improving Vietnamese Word Segmentation and POS Tagging using MEM with Various Kinds of Resources

Article

Jan 2010

Word segmentation and POS tagging are two important problems included in many NLP tasks. They, however, have not drawn much attention of Vietnamese researchers all over the world. In this paper, we focus on the integration of advantages from several resourses to improve the accuracy of Vietnamese word segmentation as well as POS tagging task. For word segmentation, we propose a solution in which we try to utilize multiple knowledge resources including dictionary-based model, N-gram model, and named entity recognition model and then integrate them into a Maximum Entropy model. The result of experiments on a public corpus has shown its effectiveness in comparison with the best current models. We got 95.30% F1 measure. For POS tagging, motivated from Chinese research and Vietnamese characteristics, we present a new kind of features based on the idea of word composition. We call it morpheme-based features. Our experiments based on two POS-tagged corpora showed that morpheme-based features always give promising results. In the best case, we got 89.64% precision on a Vietnamese POS-tagged corpus when using Maximum Entropy model.

A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

Conference Paper

Aug 2013

For the cascaded task of Chinese word segmentation, POS tagging and parsing, the pipeline approach suffers from error propagation while the joint learning approach suffers from inefficient decoding due to the large combined search space. In this paper, we present a novel lattice-based framework in which a Chinese sentence is first segmented into a word lattice, and then a lattice-based POS tagger and a lattice-based parser are used to process the lattice from two different viewpoints: sequential POS tagging and hierarchical tree building. A strategy is designed to exploit the complementary strengths of the tagger and parser, and encourage them to predict agreed structures. Experimental results on Chinese Treebank show that our lattice-based framework significantly improves the accuracy of the three sub-tasks.

A Hybrid Approach to Vietnamese Word Segmentation Using Part of Speech Tags

Article

Oct 2009

Word segmentation is one of the most important tasks in NLP. This task, within Vietnamese language and its own features, faces some challenges, especially in words boundary determination. To tackle the task of Vietnamese word segmentation, in this paper, we propose the WS4VN system that uses a new approach based on Maximum matching algorithm combining with stochastic models using part-of-speech information. The approach can resolve word ambiguity and choose the best segmentation for each input sentence. Our system gives a promising result with an F-measure of 97%, higher than the results of existing publicly available Vietnamese word segmentation systems.

A maximum entropy approach for vietnamese word segmentation

Conference Paper

Feb 2006

First Page of the Article

Chinese Word Segmentation based on Maximum Matching and Word Binding Force.

Conference Paper

Jan 1996

A Chinese word segmentation algorithm based on forward maximum matching and word binding force is proposed in this paper. This algorithm plays a key role in post-processing the output of a character or speech recognizer in determining the proper word sequence corresponding to an input line of character images or a speech waveform. To support this algorithm, a text corpus of over 63 millions characters is employed to enrich an 80,000-words lexicon in terms of its word entries and word binding forces. As it stands now, given an input line of text, the word segmentor can process on the average 210,000 characters per second when running on an IBM RISC System/6000 3BT workstation with a correct word identification rate of 99.74%.

A hybrid approach to Vietnamese word segmentation

Abstract and Figures

Recommended publications

A Study on Chinese Word Segmentation: Genetic Algorithms Approach

Studying How To Study Kanji: A Practical Approach

Natural language processing in a Japanese text-to-speech system

Eurocentric Roots of the Clash of Civilizations: A Perspective from History of Science