Conference PaperPDF Available

Probabilistic Ensemble Learning for Vietnamese Word Segmentation

July 2014

July 2014

DOI:10.1145/2600428.2609477

Conference: SIGIR 2014
At: Gold Coast, QLD, Australia

Authors:

Wuying Liu

Guangdong University of Foreign Studies

Word segmentation is a challenging issue, and the corresponding algorithms can be used in many applications of natural language processing. This paper addresses the problem of Vietnamese word segmentation, proposes a probabilistic ensemble learning (PEL) framework, and designs a novel PEL-based word segmentation (PELWS) algorithm. Supported by the data structure of syllable-syllable frequency index, the PELWS algorithm combines multiple weak segmenters to form a strong segmenter within the PEL framework. The experimental results show that the PELWS algorithm can achieve the state-of-the-art performance in the Vietnamese word segmentation task.

Probabilistic ensemble learning framework.

…

Syllable-syllable frequency index.

…

shows the SSFI structure organized as a hash table. The table entry of the SSFI is a key-value pair <Key, Value>, where each key S i S j denotes a sequence of continuous two syllables and each value consists of 4 integers. The integer f c (·) denotes the times of correct prediction and the integer f m (·) denotes the times of mistaken prediction. The symbol ('_') means that the continuous two syllables S i and S j are in a word, and the symbol (' ') means that the continuous two syllables S i and S j are not in a word. The hash function hash(S i S j ) maps the sequence S i S j to the address of the 4 integers.

…

PEL-based word segmentation algorithm.

…

Figures - uploaded by Wuying Liu

Content may be subject to copyright.

Content uploaded by Wuying Liu

Content may be subject to copyright.

Probabilistic Ensemble Learning for Vietnamese

Word Segmentation

Wuying Liu

Luoyang University of Foreign Languages

471003 Luoyang, Henan, CHINA

wyliu@nudt.edu.cn

Li Lin

Luoyang University of Foreign Languages

471003 Luoyang, Henan, CHINA

lamle@163.com

ABSTRACT

Word segmentation is a challenging issue, and the corresponding

algorithms can be used in many applications of natural language

processing. This paper addresses the problem of Vietnamese word

segmentation, proposes a probabilistic ensemble learning (PEL)

framework, and designs a novel PEL-based word segmentation

(PELWS) algorithm. Supported by the data structure of syllable-

syllable frequency index, the PELWS algorithm combines

multiple weak segmenters to form a strong segmenter within the

PEL framework. The experimental results show that the PELWS

algorithm can achieve the state-of-the-art performance in the

Vietnamese word segmentation task.

Categories and Subject Descriptors

H.3.1 [Information Storage and Retrieval]: Content Analysis

and Indexing – dictionaries, indexing methods, linguistic

processing.

I.2.7 [Artificial Intelligence]: Natural Language Processing –

language parsing and understanding, text analysis.

General Terms

Algorithms, Performance, Experimentation, Languages

Keywords

Word Segmentation Algorithm, Probabilistic Ensemble Learning,

Multi-Segmenter, Vietnamese, Syllable-Syllable Frequency Index

1. INTRODUCTION

Vietnamese is a monosyllabic language, whose basic linguistic

unit is called “tiếng”, similar to traditional syllables in respect of

phonetic form. Like Thai, Japanese and Chinese text, Vietnamese

text is also a text without any explicit separator between words.

Thus, identifying the word boundaries is crucial to above oriental

languages in many applications of natural language processing

(NLP) [1]. For instance, the following Sentence1 is a raw

unsegmented Vietnamese sentence, and the following Sentence2

is its related segmented one.

Sentence1:

hập khẩu tôm vào Mỹ năm qua vẫn cao kỷ lục do

nguồn cung cấp từ Thái Lan và Indonesia tăng .

Sentence2:

hập_khẩu tôm vào Mỹ năm qua vẫn cao kỷ_lục do

nguồn_cung_cấp từ Thái_Lan và Indonesia tăng .

The above instance shows that Vietnamese text is a sequence of

syllables and each two continuous syllables are separated by a space

symbol. The space symbol belongs to an overload symbol (as a

connector within a word or as a separator between words) in raw

text. Therefore, the Vietnamese word segmentation task can be

defined as a binary categorization problem for each space symbol. If

a space symbol is a connector in a word, we will output a symbol

('_') to replace it. And if a space symbol is a separator between

words, we will maintain it as a space symbol (' ') in the segmented

result.

Since the early days of Vietnamese information processing, word

segmentation has been widely investigated. Till now, many

effective algorithms have been proposed for Vietnamese word

segmentation [2]. The early dictionary-based word segmentation

algorithms mainly include maximum matching algorithm and

reverse maximum matching algorithm. The dictionary-based

algorithm has a straightforward implementation, but its performance

highly depends on a suitable dictionary. Subsequently, various kinds

of advanced machine learning algorithms (such as maximum

entropy [3], support vector machines and conditional random fields

[4]) regard word segmentation as a sequence labeling problem, and

related algorithms can obtain preferable performance in Vietnamese

word segmentation. To resolve ambiguities of word segmentation, a

hybrid algorithm combines finite-state automata, regular expression

and maximum matching techniques [5] to implement a highly

accurate Vietnamese tokenizer (vnTokenizer). Recently, part of

speech (POS) is considered as a helpful tag for word segmentation

and word ambiguity resolution [6]. Therefore, many affixed

resources (such as POS tag) are applied in word segmentation

algorithm [7].

Previous researches show that ensemble learning has statistical,

computational and representational advantages [8]. Ensemble

learning can improve the performance of existing algorithms for

spam filtering [9] and text categorization [10]. Based on the

motivation of ensemble learning, we propose a novel probabilistic

ensemble learning (PEL) framework, which is a meta-framework

for integrating various existing word segmentation algorithms.

Under our PEL framework, we design and implement a PEL-based

word segmentation (PELWS) algorithm.

2. PROBABILISTIC ENSEMBLE LEARNING

2.1 Framework

In order to colligate various advantages of different word

segmentation algorithms, we propose a PEL framework by a

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. Copyrights

for components of this work owned by others than ACM must be

honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior

specific permission and/or a fee. Request permissions from

Permissions@acm.org.

SIGIR '14, July 06–11, 2014, Gold Coast, QLD, Australia.

http://dx.doi.org/10.1145/2600428.2609477

931

weighted-voting-like strategy. Figure 1 shows the PEL framework

for Vietnamese word segmentation. The framework mainly

includes several pairs of <Segmenter, PELearner (Probabilistic

Ensemble Learner)> and a PEPredictor (Probabilistic Ensemble

Predictor). The Segmenters are implemented from different

Vietnamese word segmentation algorithms. For each Segmenter,

we design and install a PELearner, which receives golden

standard segmented texts and automatic segmented texts from its

corresponding Segmenter, and learns a probabilistic ensemble

model from both two texts. The probabilistic ensemble model is

stored as a syllable-syllable frequency index (SSFI). The

PEPredictor combines multi-segmenter’s segmented texts

according to the SSFIs and makes the final prediction for each

unsegmented text.

Figure 1. Probabilistic ensemble learning framework.

Figure 2 shows the SSFI structure organized as a hash table. The

table entry of the SSFI is a key-value pair <Key, Value>, where

each key SiSj denotes a sequence of continuous two syllables and

each value consists of 4 integers. The integer fc(·) denotes the

times of correct prediction and the integer fm(·) denotes the times

of mistaken prediction. The symbol ('_') means that the

continuous two syllables Si and Sj are in a word, and the symbol ('

') means that the continuous two syllables Si and Sj are not in a

word. The hash function hash(SiSj) maps the sequence SiSj to the

address of the 4 integers.

Figure 2. Syllable-syllable frequency index.

Supported by the SSFI, we design a supervised PELWS algorithm,

which regards the historical precision as an ensemble probability

and uses it as an ensemble weighted coefficient.

2.2 Algorithm

Figure 3 gives the pseudo-code for the PELWS algorithm

consisting of two main functions: pep and pel. The PELWS

algorithm takes the predicting process as an index retrieving

process and takes the PEL process as an index updating process.

When an unsegmented (sss = null) text arrives, the pep function

will be triggered: (I) It calls each Vietnamese word segmenter for

multiple segmented result texts; (II) It retrieves the current SSFI

and calculates a probabilistic score according to the historical

precision described in Eq. (1) for each space symbol ' ' in each

segmented result text; and (III) It combines multiple probabilistic

scores to form an ensemble score described in Eq. (2) and use a

fixed threshold (score = 0) to make the final binary prediction for

each space symbol.

().()

() ().() ().()

ij c

ij ij

hash s s f

Ps s hash s s f hash s s f







(1)

(-1) ( )

core P s s





 If = ' ', Then θ = 0

If = '_', Then θ = 1 (2)

When a segmented text arrives, the pel function firstly does the

same work to the above (I) step, and finally compares the

automatic segmented result with the golden standard to update the

values of the SSFI.

1. // PEL-based Word Segmentation (PELWS) Algorithm

2. Integer: n; // Number of Segmenters

3. Array[n]: ssfis; // Array of Syllable-Syllable Frequency Indexes

4. Array[n]: vwss; // Array of Vietnamese Word Segmenters

5. String: uss; // Unsegmented Sequence of Syllables

6. String: sss; // Segmented Sequence of Syllables

7. If (sss = null) Then sss  pep(ssfis, vwss, uss);

8. Else ssfis  pel(ssfis, vwss, uss, sss);

10. // Probabilistic Ensemble Predictor

11. Function String: pep(ssfis, vwss, uss)

12. String: sss  '';

13. String[n]: ssss; // Array of Segmented Sequences of Syllables

14. Integer: m; // Number of Syllables in Unsegmented Sequence

15. For Integer i  1 To n Do

16. ssss[i]  vwss[i].segment(uss);

17. End For

18. For Integer j  1 To (m-1) Do

19. Float score  0;

20. String s1  getSyllable(uss, j); // Current Syllable

21. String s2  getSyllable(uss, j+1); // Next Syllable

22. For Integer i  1 To n Do

23. String space  getSpace(ssss[i], j);

24. Float cor  ssfis[i].hash(s1s2).fc(space);

25. Float mis  ssfis[i].hash(s1s2).fm(space);

26. If (space = ' ') Then score  score + cor / (cor + mis);

27. If (space = '_') Then score  score - cor / (cor + mis);

28. End For

29. If (score > 0) Then sss  sss + s1 + ' ';

30. Else sss  sss + s1 + '_';

31. If (j = (m-1)) Then sss  sss + s2;

32. End For

33. Return sss;

34.

35. // Probabilistic Ensemble Learners

36. Function Array[]: pel(ssfis, vwss, uss, sss)

37. String[n]: ssss; // Array of Segmented Sequences of Syllables

38. Integer: m; // Number of Syllables in Unsegmented Sequence

39. For Integer i  1 To n Do

40. ssss[i]  vwss[i].segment(uss);

41. End For

42. For Integer j  1 To (m-1) Do

43. String s1  getSyllable(uss, j); // Current Syllable

44. String s2  getSyllable(uss, j+1); // Next Syllable

45. For Integer i  1 To n Do

46. String space1  getSpace(ssss[i], j);

47. String space2  getSpace(sss, j);

48. If (space1 = space2) Then ssfis[i].hash(s1s2).fc(space1) ++;

49. Else ssfis[i].hash(s1s2).fm(space1) ++;

50. End For

51. End For

52. Return ssfis;

Figure 3. PEL-based word segmentation algorithm.

932

The PELWS algorithm, independent of any concrete segmenter, is

a general meta-algorithm, whose space-time complexity mainly

depends on the SSFI storage space and the segment loops in the

pep and the pel functions. The SSFI is space-efficient owing to

the inherent compressible property of index files. Theoretically,

the property ensures that the SSFI storage space is proportional to

the total number of continuous two syllables and is independent

of the number of training texts. The updating or retrieving of SSFI

has constant time complexity according to the hash function. The

maximal space complexity O(np) and the maximal time

complexity O(nq) of the PEL framework are both acceptable in

practical NLP applications. Here, n means the number of

segmenters, usually a low number; p means the number of

continuous two syllables; and q means the total number of texts.

3. EXPERIMENT

3.1 Implementation

According to the PELWS algorithm, we firstly implement a

PELSegmenter (PEL), which combines three existing weak

segmenters: vnTokenizer 1 (VNT), RMMSegmenter (RMM) and

MMSegmenter 2 (MM). The VNT is an implementation of the

hybrid word segmentation algorithm. The RMM and the MM are

implemented from the dictionary-based reverse maximum

matching algorithm and the dictionary-based maximum matching

algorithm respectively.

Secondly, we implement a simple ensemble learning segmenter:

SELSegmenter (SEL) as the baseline, which applies a simple

majority rule to combine the VNT, RMM and MM. The SEL

resembles a simplified PELSegmenter without probabilistic

ensemble learners.

Because the VNT, RMM and MM all support dictionary

extension and the PELWS algorithm is a supervised algorithm, we

can add all words of training texts into the dictionary and build

three corresponding enhanced segmenters. Finally, we can get

two enhanced segmenters: enhanced PEL and enhanced SEL

through combining the three enhanced segmenters.

3.2 Corpus and Evaluation

In the experiment, we use a publicly available benchmark dataset

(Corpus for Vietnamese Word Segmentation3, CVWS), which

contains total 7807 sentences with word boundary labels from 305

Vietnamese newspaper articles in various domains.

The international Bakeoff [11] evaluation measure and associated

evaluation methodology are applied. In the experiment, we report

the classical Precision (P), Recall (R), F1-measure (F1) and Error

Rate (ER) to evaluate the result of segmenters. The value of P, R,

F1 belongs to [0, 1], where 1 is optimal, while the value of ER

belongs to [0, 1], where 0 is optimal.

 (3)

 (4)

1 http://vlsp.vietlp.org:8080/demo/?page=resources&tool=tokenizer.

2 http://cbd.nichesite.org/CBD2013S002.htm.

3 http://www.jaist.ac.jp/~hieuxuan/vnwordseg/data.

2PR

F1 PR





(5)

 (6)

The above four measures are computed as Eq. (3) to Eq. (6)

separately. Where the N denotes the total number of words in the

manually segmented text, the C denotes the number of correctly

segmented words by an automatic segmenter, and the M denotes

the number of mistakenly segmented words by an automatic

segmenter.

3.3 Result and Discussion

In the experiment, we use three-fold cross validation by evenly

splitting the CVWS dataset into three parts and use two parts for

training and the remaining third for testing. We perform the

training-testing procedure three times and use the average of the

three performances as the final result.

Segmenter P R F1 ER

VNT 0.882 0.911 0.896 0.122

RMM 0.895 0.916 0.905 0.108

MM 0.899 0.921 0.910 0.103

SEL 0.899 0.923 0.911 0.103

PEL 0.938 0.945 0.942 0.062

Figure 4. Experimental result of the segmenters.

The experiment includes two parts. In part one, we run the five

segmenters (VNT, RMM, MM, SEL and PEL, mentioned in

Section 3.1) respectively in the Vietnamese word segmentation

task. Figure 4 presents the experimental result, which shows that

(I) the four measures of ensemble learning segmenter excel that of

each individual segmenter, for instance, the R value of SEL and

PEL is 0.923 and 0.945 respectively, while the R value of VNT,

RMM and MM is 0.911, 0.916 and 0.921 respectively; and (II)

the four measures of probabilistic ensemble learning segmenter

excel that of simple ensemble learning segmenter, for instance,

the P value of SEL and PEL is 0.899 and 0.938 respectively. The

experimental result proves that the PEL framework is effective to

improve the performance of existing algorithms for Vietnamese

word segmentation, and the probabilistic ensemble strategy

precedes the simple majority rule.

933

Segmenter P R F1 ER

VNT 0.924 0.907 0.915 0.075

RMM 0.946 0.920 0.933 0.052

MM 0.950 0.924 0.937 0.049

SEL 0.952 0.926 0.939 0.047

PEL 0.955 0.936 0.945 0.044

Figure 5. Experimental result of the enhanced segmenters.

In part two, we run the five enhanced segmenters respectively.

Figure 5 presents the experimental result, which shows that (I) the

four measures of each enhanced segmenter excel that of

corresponding original segmenter, for instance, the ER value of

VNT and enhanced VNT is 0.122 and 0.075 respectively; and (II)

the four measures of enhanced PEL are the best ones, for instance,

the F1 value of enhanced PEL is the best 0.945. The experimental

result verifies that the PELWS algorithm can combine weak

segmenters with an enhanced dictionary to form a strong

segmenter, which can achieve the optimal performance and even

exceeds the performance of some advanced machine learning

algorithms for word segmentation.

4. CONCLUSION

This paper elucidates that our PEL framework can take advantage

of complementary virtues from different algorithms, and is very

useful to improve the performance of existing algorithms for

Vietnamese word segmentation. Moreover, the PEL framework is

suitable to parallel running environment. If the PEL framework is

deployed on the reduplicate hardware for multi-segmenter, the

computational time to segment a text will theoretically near to the

lowest segmenter’s running time.

Further research will concern other sequence labeling problems

(such as named entity recognition, POS tagging, parsing, and

semantic role labeling) within the PEL framework. We will

transfer above research productions to other suitable oriental

languages like Thai, Japanese, Chinese, and so on.

5. REFERENCES

[1] Doan Nguyen. 2008. Query preprocessing: improving web

search through a Vietnamese word tokenization approach. In

Proceedings of the 31st Annual International ACM SIGIR

Conference on Research and Development in Information

Retrieval (Singapore, Singapore, July 20-24, 2008). SIGIR

'08. ACM New York, NY, USA, 765-766.

[2] Quang Thang Dinh, Hong Phuong Le, Thi Minh Huyen

Nguyen, Cam Tu Nguyen, Mathias Rossignol, Xuan Luong

Vu. 2008. Word segmentation of Vietnamese texts: a

comparison of approaches. In Proceedings of the 6th

International Conference on Language Resources and

Evaluation (Marrakech, Morocco, May 28-30, 2008). LREC

'08. European Language Resources Association, 1933-1936.

[3] Dinh Dien and Vu Thuy. 2006. A maximum entropy

approach for Vietnamese word segmentation. In Proceedings

of the 4th International Conference on Computer Sciences:

Research, Innovation and Vision for the Future (Ho Chi

Minh City, Vietnam, February 12-16, 2006). RIVF '06.

IEEE, 248-253.

[4] Cam Tu Nguyen, Trung Kien Nguyen, Xuan Hieu Phan, Le

Minh Nguyen, Quang Thuy Ha. 2006. Vietnamese word

segmentation with CRFs and SVMs: an investigation. In

Proceedings of the 20th Pacific Asia Conference on

Language, Information and Computation (Wuhan, China,

November 2-4, 2006). PACLIC 2006. Tsinghua University

Press, 215-222.

[5] Hong Phuong Le, Thi Minh Huyen Nguyen, Azim

Roussanaly, Tuong Vinh Ho. 2008. A hybrid approach to

word segmentation of Vietnamese texts. Language and

Automata Theory and Applications. Lecture Notes in

Computer Science, Volume 5196, 240-249.

[6] Thi Oanh Tran, Anh Cuong Le, Quang Thuy Ha. 2010.

Improving Vietnamese word segmentation and POS tagging

using MEM with various kinds of resources. Information and

Media Technologies. 5, 2 (2010), 890-909.

[7] Dang Duc Pham, Giang Binh Tran, Son Bao Pham. 2009. A

hybrid approach to Vietnamese word segmentation using

part of speech tags. In Proceedings of the 1st International

Conference on Knowledge and Systems Engineering (Hanoi,

Vietnam, October 13-17, 2009). KSE '09. IEEE Computer

Society Washington, DC, USA, 154-161.

[8] Thomas G. Dietterich. 2000. Ensemble methods in machine

learning. In Proceedings of the 1st International Workshop

on Multiple Classifier Systems (Cagliari, Italy, June, 2000).

MCS 2000. Lecture Notes in Computer Science, Volume

1857, 1-15.

[9] Wuying Liu and Ting Wang. 2010. Multi-field learning for

email spam filtering. In Proceedings of the 33rd Annual

International ACM SIGIR Conference on Research and

Development in Information Retrieval (Geneva, Switzerland,

July 19-23, 2010). SIGIR '10. ACM New York, NY, USA,

745-746.

[10] Wuying Liu and Ting Wang. 2012. Online active multi-field

learning for efficient email spam filtering. Knowledge and

Information Systems. 33, 1 (October, 2012), 117-136.

[11] Richard Sproat and Thomas Emerson. 2003. The first

international Chinese word segmentation Bakeoff. In

Proceedings of the 2nd SIGHAN Workshop on Chinese

Language Processing (Sapporo, Japan, July 11-12, 2003).

SIGHAN '03. Association for Computational Linguistics,

Stroudsburg, PA, USA, 133-143.

934

A hybrid approach to Vietnamese word segmentation

Conference Paper

Full-text available

Nov 2016

Word segmentation is the very first task for Vietnamese language processing. Word-segmented text is the input of almost other NLP tasks. This task faces some challenges due to specific characteristics of the language. As in many other Asian languages such as Japanese, Korean and Chinese, white spaces in Vietnamese are not always used as word separators and a word may contain one or more syllables. In this paper, we propose an efficient hybrid approach to detect word boundary for Vietnamese texts using logistic regression as a binary classifier combining with longest matching algorithm. First, longest matching algorithm is used to catch words that contain more than two syllables in input sentence. Next, the system utilizes the classifier to determine the boundary of 2-syllable words and proper names. Then, the predictions having low confidence conducted by the classifier are verified by a dictionary to get the final result. Our system can achieve an F-measure of 98.82% which is the most accurate result for Vietnamese word segmentation to the best of our knowledge. Moreover, the system also has a high speed. It can run word segmentation for nearly 34k tokens per second.

A Fast and Accurate Vietnamese Word Segmenter

Article

Sep 2017

We propose a novel rule-based approach to Vietnamese word segmentation. Our approach is based on the Single Classification Ripple Down Rules methodology (Compton and Jansen, 1990), where rules are stored in an exception structure and new rules are only added to correct segmentation errors given by existing rules. Experimental results on the benchmark Vietnamese treebank show that our approach outperforms previous state-of-the-art approaches JVnSegmenter, vnTokenizer, DongDu and UETsegmenter in terms of both accuracy and performance speed. Our code is open-source and available at: https://github.com/datquocnguyen/RDRsegmenter

Malay-Corpus-Enhanced Indonesian-Chinese Neural Machine Translation: 10th International Symposium, ISICA 2018, Jiujiang, China, October 13–14, 2018, Revised Selected Papers

Chapter

Feb 2019

Due to the lack of structured language resources, low-resource language machine translation often faces difficulties in cross-language semantic paraphrasing. In order to solve the problem of low-resource machine translation from Indonesian to Chinese, a cognate-parallel-corpus-based expanding method of language resources is proposed, and an improved neural machine translation model is trained by the Malay-corpus-enhanced corpus. The improved model can achieve a comparable result as that of Google in the experiment of Indonesian-Chinese machine translation. The experimental results also show that the morphological similarity and semantic equivalence between the languages are very effective computational features to improve the performance of neural machine translation for low-resource languages.

From Word Segmentation to POS Tagging for Vietnamese

Article

Nov 2017

This paper presents an empirical comparison of two strategies for Vietnamese Part-of-Speech (POS) tagging from unsegmented text: (i) a pipeline strategy where we consider the output of a word segmenter as the input of a POS tagger, and (ii) a joint strategy where we predict a combined segmentation and POS tag for each syllable. We also make a comparison between state-of-the-art (SOTA) feature-based and neural network-based models. On the benchmark Vietnamese treebank (Nguyen et al., 2009), experimental results show that the pipeline strategy produces better scores of POS tagging from unsegmented text than the joint strategy, and the highest accuracy is obtained by using a feature-based model.

Unsupervised ensemble learning for Vietnamese multisyllabic word extraction

Conference Paper

Nov 2016

Development of thai word segmentation technique for solving problems with unknown words

Conference Paper

Nov 2015

Multi-Field Learning for Email Spam Filtering

Conference Paper

Full-text available

Jul 2010

Through the investigation of email document structure, this paper proposes a multi-field learning (MFL) framework, which breaks the multi-field document Text Classification (TC) problem into several sub-document TC problems, and makes the final category prediction by weighted linear combination of several sub-document TC results. Many previous statistical TC algorithms can be easily rebuilt within the MFL framework via turning binary result to spamminess score, which is a real number and reflects the likelihood that the classified email is spam. The experimental results in the TREC spam track show that the performances of many TC algorithms can be improved within the MFL framework.

Online Active Multi-Field Learning for Efficient Email Spam Filtering

Article

Full-text available

Oct 2011

Email spam causes a serious waste of time and resources. This paper addresses the email spam filtering problem and proposes an online active multi-field learning approach, which is based on the following ideas: (1) Email spam filtering is an online application, which suggests an online learning idea; (2) Email document has a multi-field text structure, which suggests a multi-field learning idea; and (3) It is costly to obtain a label for a real-world email spam filter, which suggests an active learning idea. The online learner regards the email spam filtering as an incremental supervised binary streaming text classification. The multi-field learner combines multiple results predicted by field classifiers in a novel compound weight schema, and each field classifier calculates the arithmetical average of multiple conditional probabilities calculated from feature strings according to a data structure of string-frequency index. Comparing the current variance of field classifying results with the historical variance, the active learner evaluates the classifying confidence and takes the more uncertain email as the more informative sample for which to request a label. The experimental results show that the proposed approach can achieve the state-of-the-art performance with greatly reduced label requirements and very low space-time costs. The performance of our online active multi-field learning, the standard (1-ROCA)% measurement, even exceeds the full feedback performance of some advanced individual text classification algorithms.

Word segmentation of Vietnamese texts: a comparison of approaches

Article

Full-text available

May 2008

We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.

A Hybrid Approach to Word Segmentation of Vietnamese Texts

Conference Paper

Full-text available

Dec 2013

We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.

The First International Chinese Word Segmentation Bakeoff

Article

Full-text available

Jun 2003

This paper presents the results from the ACL-SIGHAN-sponsored First International Chinese Word Segmentation Bakeoff held in 2003 and reported in conjunction with the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan. We give the motivation for having an international segmentation contest (given that there have been two within-China contests to date) and we report on the results of this first international contest, analyze these results, and make some recommendations for the future.

Ensemble methods in machine learning

Conference Paper

Jan 2000
Lect Notes Comput Sci

Thomas G. Dietterich

Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.

Improving Vietnamese Word Segmentation and POS Tagging using MEM with Various Kinds of Resources

Article

Jan 2010

Word segmentation and POS tagging are two important problems included in many NLP tasks. They, however, have not drawn much attention of Vietnamese researchers all over the world. In this paper, we focus on the integration of advantages from several resourses to improve the accuracy of Vietnamese word segmentation as well as POS tagging task. For word segmentation, we propose a solution in which we try to utilize multiple knowledge resources including dictionary-based model, N-gram model, and named entity recognition model and then integrate them into a Maximum Entropy model. The result of experiments on a public corpus has shown its effectiveness in comparison with the best current models. We got 95.30% F1 measure. For POS tagging, motivated from Chinese research and Vietnamese characteristics, we present a new kind of features based on the idea of word composition. We call it morpheme-based features. Our experiments based on two POS-tagged corpora showed that morpheme-based features always give promising results. In the best case, we got 89.64% precision on a Vietnamese POS-tagged corpus when using Maximum Entropy model.

A Hybrid Approach to Vietnamese Word Segmentation Using Part of Speech Tags

Article

Oct 2009

Word segmentation is one of the most important tasks in NLP. This task, within Vietnamese language and its own features, faces some challenges, especially in words boundary determination. To tackle the task of Vietnamese word segmentation, in this paper, we propose the WS4VN system that uses a new approach based on Maximum matching algorithm combining with stochastic models using part-of-speech information. The approach can resolve word ambiguity and choose the best segmentation for each input sentence. Our system gives a promising result with an F-measure of 97%, higher than the results of existing publicly available Vietnamese word segmentation systems.

A maximum entropy approach for vietnamese word segmentation

Conference Paper

Feb 2006

First Page of the Article

Query preprocessing: improving web search through a Vietnamese word tokenization approach

Conference Paper

Jul 2008

Doan Nguyen

In this poster paper, we propose a novel approach to improve web search relevancy by tokenizing a Vietnamese query text prior submitting it to a search engine. Evaluations demonstrate its effectiveness and practical value.

Probabilistic Ensemble Learning for Vietnamese Word Segmentation

Abstract and Figures

Recommended publications

Efficient Interactive Search for Geo-tagged Multimedia Data

Multidimensional bisection: A dual viewpoint

A Framework of Online Proxy-Based Web Prefetching