Conference PaperPDF Available

Angular measures for feature selection in text categorization

April 2006

April 2006

DOI:10.1145/1141277.1141466

Source
DBLP

Conference: Proceedings of the 2006 ACM Symposium on Applied Computing (SAC), Dijon, France, April 23-27, 2006

Authors:

Elias Combarro

University of Oviedo

Elena Montañés

University of Oviedo

Jose Ranilla

University of Oviedo

Irene Díaz

University of Oviedo

Text Categorization, which consists of automatically assigning documents to a set of categories, usually involves the management of a huge number of features. Most of them are irrelevant or introduce noise which misleads the classifiers. Thus, feature reduction is often performed in order to increase the efficiency and effectiveness of the classification. In this paper we propose to select relevant features by means of what we call Angular Measures, which are simpler than other usual measures applied for this purpose. We carry out experiments over two different corpora and find that the proposed measures perform equal or better than some of the existing ones.

Content uploaded by Elena Montañés

Content may be subject to copyright.

Angular Measures for Feature Selection in Text

Categorization

E.F. Combarro

Artiﬁcial Intelligence Center

University of Oviedo

Campus de Viesques S/N

Gij´on, Spain

elias@aic.uniovi.es

Elena Monta˜n´es

Artiﬁcial Intelligence Center

University of Oviedo

Campus de Viesques S/N

Gij´on, Spain

elena@aic.uniovi.es

Jos´e Ranilla

Artiﬁcial Intelligence Center

University of Oviedo

Campus de Viesques S/N

Gij´on, Spain

ranilla@uniovi.es

ABSTRACT

Text Categorization, which consists of automatically assign-

ing documents to a set of categories, usually involves the

management of a huge number of features. Most of them

are irrelevant or introduce noise which misleads the classi-

ﬁers. Thus, feature reduction is often performed in order

to increase the eﬃciency and eﬀectiveness of the classiﬁca-

tion. In this paper we propose to select relevant features

by means of what we call Angular Measures, which are sim-

pler than other usual measures applied for this purpose. We

carry out experiments over two diﬀerent corpora and ﬁnd

that the proposed measures perform equal or better than

some of the existing ones.

Categories and Subject Descriptors

I.5.2 [Pattern Recognition]: Design Methodology—Fea-

ture evaluation and selection; I.7.1 [Document and Text

Processing]: Document and Text Editing—Document man-

agement

General Terms

Theory, Measurement, Experimentation, Performance

1. INTRODUCTION

One of the main tasks in the processing of large collec-

tions of text ﬁles is that of assigning the documents of a

corpus to a set of previously ﬁxed categories, what is known

as Text Categorization (TC) [11]. The most common way

of representing the documents for TC is the bag of words

(see [10]). In this representation, a vector is associated to

each document whose components quantify the importance

of each of its words. This usually involves a great amount

of features and most of them can be irrelevant or noisy [10].

Thus, feature reduction often leads to an improvement in

the performance of the classiﬁcation, at the same time that

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SAC’06 April 23-27, 2006, Dijon, France

it reduces the computational cost and the storage require-

ments of the task.

A common approach for feature reduction is Feature Se-

lection (FS), which consists in choosing a subset of the orig-

inal features for representing the documents. This task in

TC is usually performed scoring the features using a certain

measure, ordering them according to this measure and re-

moving a predeﬁned number or percentage of them [8, 13].

Several measures have been proposed for this purpose, like

information gain [13] or cross entropy for text [8].

In this paper we introduce some measures for FS in TC,

which we call Angular Measures. We deﬁne them and study

their behavior by means of experimentation over two well

known corpora.

The paper is organized as follows. Section 2 deals with

some previous work including some of the state-of-the-art

measures. Section 3 presents the new family of measures

proposed. Section 4 describes the main stages of the TC

task. The description of the corpora and the experiments

are detailed in Sections 5 and 6 respectively. Finally, in

Section 7 some conclusions and ideas for future work are

commented.

2. PREVIOUS WORK

FS is one of the approaches commonly adopted in TC.

It involves selecting a subset of features from the original

feature set. By contrast, Feature Extraction (FE) meth-

ods transform or combine the original features to obtain a

reduced number of features. Methods of this kind are clus-

tering ones [4] or Latent Semantic Indexing (LSI) [6].

On the other hand, John et al. distinguish two kinds of

FS, namely ﬁltering and wrapping. In the former, a feature

subset is selected independently of the performance of the

classiﬁer. In the latter, a feature subset is selected using an

evaluation function based on the classiﬁer. A widely adopted

approach in TC is the ﬁltering one based on selecting the

features with higher score granted by a certain measure. The

reason of prefering ﬁltering approaches rather than wrappers

for TC is that the latter usually result in a considerably time

consuming process.

In the following paragraphs will brieﬂy describe those mea-

sures which have been most adopted for FS in TC.

2.1 Statistical Measures

The simplest ﬁltering measures are the term frequency (tf)

and the document frequency (df). They quantify the rele-

826

vance of a word by means of its total number of appearances

and by means of the number of diﬀerent documents in which

it appears, respectively. They can be combined into tﬁdf [10]

deﬁned by

tfidf =tf log N

where Nis the number of documents in the corpus. No-

tice that words appearing in all the documents are con-

sidered non informative, independently of its absolute fre-

quency and, in general, a word occurring in many documents

will have tﬁdf smaller than others with the same tf, but ap-

pearing in less documents. Despite their simple appearance,

these measures perform acceptably in many situations [5].

2.2 Information Theory Measures

Measures taken from Information Theory (IT) have been

widely used because it is interesting to consider the distri-

bution of a word over the diﬀerent categories. Among these

measures, information gain (I G) takes into account the pres-

ence of the word in a category as well as its absence, and

can be deﬁned by (see, for instance [13])

IG(w, c) = P(w)P(c|w) log P(c|w)

P(c)+P(w)P(c|w) log P(c|w)

P(c)

where P(w) is the probability that the word wappears in

a document, P(c|w) is the probability that a document be-

longs to the category cknowing that the word wappears in

it, P(w) is the probability that the word wdoes not appear

in a document and P(c|w) is the probability that a document

belongs to the category cif we know that the word wdoes

not occur in it. Usually, these probabilities are estimated by

means of the corresponding relative frequencies.

Another measure of this kind is the expected cross entropy

for text (CE T ) [8], which only takes into account the pres-

ence of the word in a category. It is deﬁned by

CE T (w, c) = P(w)P(c|w) log P(c|w)

P(c)

These measures are the ones of this kind that have ob-

tained better results in TC [7, 8, 11, 13].

3. ANGULAR MEASURES

Before deﬁning this family of measures, let consider a cate-

gory cand a word wand identify each word wwith the pair

(aw,c, bw ,c), where aw,c denotes the number of documents

of the category cin which wappears and bw,c denotes the

number of documents that contain the word wbut do not

belong to the category c. On what follows let denote the

pair (aw,c, bw,c ) by (aw, bw) for simplicity.

Then, we can study the words that receive identical score

under a ﬁltering measure m(w) which depends only on (aw,

bw) by means of the level curves deﬁned by such measure.

In fact, it was demonstrated in [2] that if m(w) is a ﬁltering

measure and Nand Mare natural numbers, then the level

curves passing through the words with aw≤Nand bw≤M

can be considered as straight lines.

From that fact, an interesting special case is the family

of measures m(w) which have just one level curve for each

value. That is, the measures m(w) that satisfy

aw=f(m(w))bw+g(m(w))

for some functions fand g. Some measures [2], like df (with

f(df ) = −1 and g(df ) = df ), have this property.

In [2] it has also been proven that if Nand Mare two

natural numbers and m(w) is a ﬁltering measure which has

exactly one straight line as level curve for each value that

m(w) attains over the words with aw≤Nand bw≤M,

then, there exist pand qtwo polynomials such that

aw=p(m(w))bw+q(m(w))

for any word wsuch that aw≤Nand bw≤M.

Therefore, it is interesting to study the ﬁltering mea-

sures satisfying the above expression at least when the de-

gree of pand qis low. The family of measures obtained

when degree(p) = 0 and degree(q) = 1 has been studied

in [2], leading to what we call Linear Measures. This pa-

per deals with those ones obtained when degree(p) = 1 and

degree(q) = 0 (notice that it is not possible that the degrees

of pand qare both zero at the same time).

Thus, if degree(p) = 1 and degree(q) = 0 we have

aw= (c1m(w) + c2)bw+c3

for some constants c1, c2and c3such that c16= 0, and thus

m(w) =

−c3

bw−c2

or equivalently

m(w) = aw−c2bw−c3

c1bw

The value of c1can be taken as 1 since it does not aﬀect

the ordering of words produced by the measure. Then, we

obtain

m(w) = aw−c2bw−c3

with c2and c3any real numbers. But the above expression

is equivalent to the following one

m(w) = aw−c3

−c2

and, again, the value of c2is irrelevant in the sense that

the ordering of the words provided by the measure is inde-

pendent of the value of this constant. Hence, the value of c2

could be taken to be zero. Therefore, the family of measures

to study are of the form

m(w) = aw−c3

or equivalently

AMk(w) = aw−k

where kis a real parameter which deﬁnes the family. These

measures have a simple geometrical interpretation as the

next Theorem establishes and whose proof can be found in

[3].

Theorem 1. The value AMk(w)is the tangent of the an-

gle formed by the x-axis and the line determined by the points

(aw, bw)and (k, 0).

This is the reason why we will call these measures Angular

Measures.

827

4. TASK OF TEXT CATEGORIZATION

This section describes the stages of the TC task.

The bag of words [10] model is adopted for representing

the documents. It consists in viewing a document as a set

of words without order and structure. Also, tf is chosen

to quantify the importance of each word in each document,

since it is one of the most used in the literature [8, 11].

The classiﬁcation stage consists in assigning a category to

a document from a ﬁnite set of mcategories. This is com-

monly converted into mbinary problems, each one consist-

ing of determining whether a document belongs to a ﬁxed

category or not. This approach is called one-against-the-

rest [1].

That process leads to use diﬀerent sets of words in the doc-

ument representation. One consists of words that belong to

each category isolated from the rest, which is known as local

approach. On the other hand, the global approach consid-

ers the words from all categories. In this work, the local

approach is considered, since they oﬀer better results [11].

Additionally, the stop words (words without meaning) are

removed because they are useless for the classiﬁcation. Also,

stemming is performed, which consists in mapping the words

with the same meaning but with slight diﬀerent spelling into

a common root. The Porter algorithm [9] is adopted for this

purpose.

In this paper the classiﬁcation is performed using Support

Vector Machines (SVM) [7], since they have shown to per-

form fast and well in TC [12]. They satisfactorily deal with

many features and with sparse examples. They are binary

classiﬁers which ﬁnd out threshold functions to separate the

documents of a certain category from the rest. We adopt a

linear threshold since most TC problems are linearly sepa-

rable [7].

The popular and well known measure F1[11] is adopted

in this paper to evaluate the eﬀectiveness of the TC task. It

is deﬁned by

F1=1

0.51

P+ 0.51

where P quantiﬁes the percentage of documents that are cor-

rectly classiﬁed as belonging to the category while R quan-

tiﬁes the percentage of documents of the category that are

correctly classiﬁed.

To compute the global performance over all the categories,

macroaverage, which consists in averaging the values ob-

tained in each category [11], is used.

5. THE CORPORA

In this subsection the corpora used in the experiments

are described and analyzed. They are the Reuters-21578

collection and the Ohsumed collection.

5.1 Reuters-21578 Collection

The Reuters-21578 corpus is a set of economic news pub-

lished by Reuters in 19871. They are distributed over 135

categories. Each document belongs to one or more of them.

The split into train and test documents chosen is that of

Apt´e [1]. Removing some documents without body or top-

ics, 7063 train and 2742 test documents assigned to 90 cat-

egories are obtained.

1It is publicly available at

http://www.research.attp.com/lewis/reuters21578.html

The distribution of documents into the categories is quite

unbalanced. In fact, the relative dispersion of the number

of documents of the categories is 3.36% in the interval [1,

2709] for training documents and 3.39% in [1, 1044] for test

documents. In addition, 76.40% (in train) and 78.65% (in

test) of the categories have less than 1% of the documents.

The words in the corpus are little scattered, since almost

half (49.91%) of the words appear in only one category and

16.25% in only two categories.

5.2 Ohsumed Collection

Ohsumed is a MEDLINE subset of references from 270

medical journals over 1987-19912. They are classiﬁed into

the 15 ﬁxed categories of MeSH3: A, B, C ... Each cate-

gory is in turn split into subcategories. We have taken the

ﬁrst 20000 documents of 1991 with abstract, labelling the

ﬁrst 10000 documents as training and the rest as test ones

following [7]. We split them into the 23 subcategories of

category C of MeSH again following [7].

The distribution of documents over the categories is much

more balanced than in Reuters. In fact, the relative disper-

sion of the number of documents of the categories is 0.86

in the interval [100, 2476] for train and 0.88% in the inter-

val [82, 2424] for test. Furthermore, only 4.35% in train

and 8.70% in test of the categories have less than 1% of the

documents, against about 77% in Reuters.

The words in this collection are quite more scattered than

in Reuters, since 19.55% of the words (in average) appear

just in one category (against 49.91% in Reuters).

6. THE EXPERIMENTS

In the theoretical study developed in [3] we have proved

that the values of kof the form

awbv−avbw

bv−bw

with v, w two words of the collection are relevant, since they

provide measures which discriminate words from one cate-

gory from the rest. Hence, due to this fact and as a ﬁrst

approach we select the deciles of the distribution formed by

those values (when wranges over all the words of the cate-

gory under study) as candidates values of k.

Figures 1, 2, 3 and 4 show the macroaverage of F1of

those deciles for Reuters and Ohsumed respectively. Also,

they show a comparison of them with two well know and

good IT measures, C ET and I G, as mentioned in Section

In both corpora, the value of F1progressively increases

from the 1st decile to the median and it decreases until the

9th decile, being the median an inﬂection point where the

maximum is reached.

In the case of Reuters, only the median beats the state-of-

the-art measures CE T and IG, meanwhile in Ohsumed all

the deciles from the 1st to the median achieve better results.

The diﬀerent behavior of the Angular Measures in both

corpora might be due to the diﬀerent nature of the collec-

tions. As we have already mentioned, the distribution of

documents into the categories in Reuters is quite more un-

balanced than in Ohsumed. Also, the words in Reuters are

little scattered in comparison to Ohsumed.

2It can be found at http://trec.nist.gov/data/t9-ﬁltering

3Available at www.nlm.nih.gov/mesh/2002/index.html

828

It is also remarkable that some of the Angular Measures

obtain the best performance for very high ﬁltering levels

(around 95%), specially in the Ohsumed collection. This

makes these measures a very appealing choice when an ag-

gressive reduction of the number of features is intended.

0 10 20 30 40 50 60 70 80 90 100

Filtering Level Percentage

1st

2nd

3rd

4th

Figure 1: Macroaverage of F1for Reuters

0 10 20 30 40 50 60 70 80 90 100

Filtering Level Percentage

Median

6th

7th

8th

9th

CET

Figure 2: Macroaverage of F1for Reuters

7. CONCLUSIONS AND FUTURE WORK

This paper presents a family of measures called Angular

Measures for Feature Selection in Text Categorization. They

0 10 20 30 40 50 60 70 80 90 100

Filtering Level Percentage

1st

2nd

3rd

4th

Figure 3: Macroaverage of F1for Ohsumed

0 10 20 30 40 50 60 70 80 90 100

Filtering Level Percentage

Median

6th

7th

8th

9th

CET

Figure 4: Macroaverage of F1for Ohsumed

are obtained from their level curves and are deﬁned by a pa-

rameter whose adequate values have been carefully selected.

The median of certain distribution strategically chosen of-

fers the best results, beating some of the state-of-the-art

measures for the two corpora taken. Also, some deciles of

such distribution beat those measures in one of the corpora.

Additionally, the best performance is obtained when most of

the words are removed (about 95% of them) which allows to

conduct aggressive reductions using this family of measures.

As future work, we plan to perform a reﬁnement of the val-

ues of the parameter taking into account the centiles around

the median of the distribution chosen. We also plan to pro-

pose several modiﬁcations of the Angular Measures based

on the performance of other state-of-the-art measures.

8. ACKNOWLEDGMENTS

The research reported in this paper has been supported

in part under MEC and FEDER grant TIN2004-05920.

829

9. ADDITIONAL AUTHORS

Additional authors: Irene D´ıaz (Artiﬁcial Intelligence Cen-

ter, University of Oviedo, email: sirene@aic.uniovi.es).

10. REFERENCES

[1] C. Apte, F. Damerau, and S. Weiss. Automated

learning of decision rules for text categorization.

Information Systems, 12(3):233–251, 1994.

[2] E. F. Combarro, E. Monta˜n´es, I. D´ıaz, J. Ranilla, and

R. Mones. Introducing a family of linear measures for

feature selection in text categorization. IEEE

Transactions on Knowledge and Data Engineering,

17(9):1223–1232, 2005.

[3] E. F. Combarro, E. Monta˜n´es, J. Ranilla, and I. D´ıaz.

A theoretical framework of angular, laplace angular

and modiﬁed laplace angular measures. Technical

report, University of Oviedo, 2005.

[4] I. S. Dhillon, S. Mallela, and R. Kumar. A divisive

information theoretic feature clustering algorithm for

text classiﬁcation. Journal of Machine Learning

Research, 3:1265–1287, 2003.

[5] I. D´ıaz, J. Ranilla, E. Monta˜n´es, J. Fern´andez, and

E. F. Combarro. Improving performance of text

categorisation by combining ﬁltering and support

vector. Journal of the American Society for

Information Science and Technology, 55(7):579–592,

2004.

[6] G. Forman. An extensive empirical study of feature

selection metrics for text categorization. Journal of

Machine Learning Research, 3:1289–1305, 2003.

[7] T. Joachims. Text categorization with support vector

machines: learning with many relevant features. In

C. N´edellec and C. Rouveirol, editors, Proc. 10th

European Conference on Machine Learning ECML-98,

number 1398, pages 137–142, Chemnitz, DE, 1998.

Springer-Verlag.

[8] D. Mladenic and M. Grobelnik. Feature selection for

unbalanced class distribution and naive bayes. In

Proc. 16th International Conference on Machine

Learning ICML-99, pages 258–267, Bled, SL, 1999.

[9] M. F. Porter. An algorithm for suﬃx stripping.

Program (Automated Library and Information

Systems), 14(3):130–137, 1980.

[10] G. Salton and M. J. McGill. An introduction to

modern information retrieval. McGraw-Hill, 1983.

[11] F. Sebastiani. Machine learning in automated text

categorisation. ACM Computing Survey, 34(1), 2002.

[12] Y. Yang and X. Liu. A re-examination of text

categorization methods. In M. A. Hearst, F. Gey, and

R. Tong, editors, Proc. 22nd ACM International

Conference on Research and Development in

Information Retrieval SIGIR-99, pages 42–49,

Berkeley, US, 1999. ACM Press, New York, US.

[13] Y. Yang and J. O. Pedersen. A comparative study on

feature selection in text categorisation. In Proc. 14th

International Conference on Machine Learning

ICML-97, pages 412–420, 1997.

830

Tlemcen University at CELF eHealth 2018 Team techno: Multilingual Information Extraction - ICD10 coding ; CLEF 2018

Conference Paper

Full-text available

Sep 2018

We developed Naive Bayes (NB) classifier for text classification to information extraction from written text at CLEF eHealth 2018 challenge, task1. The data set used is called the CepiDC Causes of Death Corpus. It comprises French biomedical text reports of death causes. To extract ICD10 codes for each death certificate, a preprocessing process must be carried out, for example, we removed all terms from the certificates that are not related to medicine and after that we used a NB classifier to generate a classification model. The evaluation of the proposed approach does not show good performance compared with the results obtained by the other participants in the challenge.

Survey of Machine Learning Techniques in Textual Document Classification

Article

Jan 2014

A Review of Artificial Intelligence Algorithms in Document Classification

Article

Full-text available

Sep 2011

Adrian Bilski

With the evolution of Internet, the meaning and accessibility of text documents and electronic information has increased. The automatic text categorization methods became essential in the information organization and data mining process. A proper classification of e-documents, various Internet information, blogs, emails and digital libraries requires application of data mining and machine learning algorithms to retrieve the desired data. The following paper describes the most important techniques and methodologies used for the text classification. Advantages and effectiveness of contemporary algorithms are compared and their most notable applications presented.

Arabic Text Classification in the Legal Domain

Conference Paper

Oct 2019

Arabic Text Categorization Using Support vector machine, Naïve Bayes and Neural Network

Article

Full-text available

Sep 2016

Text classification is a very important area in information retrieval. Text classification techniques used to classify documents into a set of predefined categories. There are several techniques and methods used to classify data and in fact there are many researches talks about English text classification. Unfortunately, few researches talks about Arabic text classification. This paper talks about three well-known techniques used to classify data. These three well-known techniques are applied on Arabic data set. A comparative study is made between these three techniques. Also this study used fixed number of documents for all categories of documents in training and testing phase. The result shows that the Support Vector machine gives the best results.

Arabic Text Categorization Using Support vector machine, Naïve Bayes and Neural Network

Article

Sep 2016

Removing Redundancy from Relevant Features in Text Classification

Article

This paper proposes a method for Feature Selection in Text Catego- rization. This task is performed in two steps. Firstly, an analysis of rel- evance is performed and after that analysis of redundancy is done. For this purpose, a range of similarity measures are adopted and converted into symmetrical ones using several aggregation operators. This fact as- sures that the similarity between two words are independent of the order they are considered. Several exper- iments over four corpora are per- formed, leading to conclude that this method reaches good results.

Arabic Text Classification: A Review

Article

Full-text available

Apr 2019

Adel Hamdan Mohammad

Text classification is an important topic. The number of electronic documents available on line is massive. Text classification aims to classify documents into a set of predefined categories. Number of researches conducted on English dataset is great in comparison with number of researches done using Arabic dataset. This research could be considered as reference for most researchers who deal with Arabic dataset. This research used the most well-known algorithms used in text classification with Arabic dataset. Besides that, dataset used in this research is large enough in comparison with most dataset for Arabic language used in other researches. In addition, this research used different selections and weighting methods for documents. I expect that all researchers who would write researches using Arabic dataset will find this work helpful. Algorithms used in this research are naïve Bayesian, support vector machines, artificial neural networks, k- nearest neighbors, C4.5 decision tree and rocchio classifier.

Probabilistic model for a distributed feature selection method

Article

Jul 2009

When building topic based document classifiers, feature selection is a key step: features not holding any information about the topic of a document introduce only unnecessary noise during the classification. In a distributed environment, when the nodes are interacting, the locally retrieved features and the their attributes must be shared to have at every node a more accurate estimation of the global classifier. When expanding the knowledge of the local classifiers, to reduce costs, the network traffic should be kept to a minimum. We propose a probabilistic model for a keyword selection method which makes a more thorough analysis possible and can be used as a baseline when sharing information in a distributed environment. It can be used for incrementally building up the distributed classifiers ensuring minimal network traffic. This model can be refined later on by sending more content-related information to achieve higher performance. This probabilistic model together with experimental results are presented in this paper.

ORIPI: Organización y Recuperación de Información Paralela e Inteligente TIN2004-05920

Article

Full-text available

This project tries to overcome some of the bottlenecks found in Information Retrieval (IR) systems. In order to achieve this goal, the different tasks of which IR is composed will be taken into account by studying their influence on the performance of the systems as well as the best way to combine them. One of the focuses of the project will be on one important task in the building of IR systems: dimensionality reduction. To this extent, we will explore the use of relevance measures to reduce the number of attributes (terms) used in the document representation. We will also study, from a theoretical point of view, the feature selection measures seeking to establish a relationship between the characteristics of the corpora and the measures which achieve the best results. Also, Self-Organizing Maps (and similar types of networks) will be used to cluster similar documents together, aiming to reduce the time complexity of document search. This clustering can also, in some cases, improve the performance of the retrieval. We will study the influence of the different parameters involved in the use of these networks on the results obtained with the system. Self-Organizing Maps will also be used to automatically generate word taxonomies which will be then used to improve the performance of the retrieval system. Independently of the methodology or technology used to build an Information Retrieval system, there exists a common problem: the high computational cost, caused by the large amount of information to process. If we consider the necessities of a real-time answer and of the efficient adding of new information, it is obvious that sequential processing of information is insufficient. The use of a parallel system provides some advantages as are the improvement of the response times, the reduction of the search operations cost and the possibility of coping with large text collections. Therefore, we will try to increase the efficiency and effectiveness of the Information Retrieval systems with the application of different techniques.

An extensive empirical study of feature selection metrics for text classification [J]

Article

Full-text available

Mar 2003

George Forman

Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives-accuracy, F-measure, precision, and recall-since each is appropriate in different situations. The results reveal that a new feature selection metric we call 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair-e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.

Machine Learning in Automated Text Categorization

Article

Full-text available

Apr 2001

Fabrizio Sebastiani

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.

Feature Selection for Unbalanced Class Distribution and Naive Bayes.

Conference Paper

Full-text available

Jan 1999

Introducing a family of linear measures for feature selection in text categorization

Article

Full-text available

Oct 2005

Text categorization, which consists of automatically assigning documents to a set of categories, usually involves the management of a huge number of features. Most of them are irrelevant and others introduce noise which could mislead the classifiers. Thus, feature reduction is often performed in order to increase the efficiency and effectiveness of the classification. In this paper, we propose to select relevant features by means of a family of linear filtering measures which are simpler than the usual measures applied for this purpose. We carry out experiments over two different corpora and find that the proposed measures perform better than the existing ones.

Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Article

Jan 1998

Thorsten Joachims

An Algorithm for Suffix Stripping

Article

Mar 1980
PROGRAM-ELECTRON LIB

MF Porter

The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.

Improving Performance of Text Categorization by Combining Filtering and Support Vector Machines

Article

May 2004
J AM SOC INF SCI TEC

Text Categorization is the process of assigning documents to a set of previously fixed categories. A lot of research is going on with the goal of automating this time-consuming task. Several different algorithms have been applied, and Support Vector Machines (SVM) have shown very good results. In this report, we try to prove that a previous filtering of the words used by SVM in the classification can improve the overall performance. This hypothesis is systematically tested with three different measures of word relevance, on two different corpus (one of them considered in three different splits), and with both local and global vocabularies. The results show that filtering significantly improves the recall of the method, and that also has the effect of significantly improving the overall performance.

Introduction To Modern Information Retrieval

Book

Jan 1984

A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification

Article

Apr 2003

High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new informationtheoretic divisive algorithm for feature/word clustering and apply it to text classification. Existing techniques for such "distributional clustering" of words are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value. We show that our algorithm minimizes the "within-cluster Jensen-Shannon divergence" while simultaneously maximizing the "between-cluster Jensen-Shannon divergence".

A Re-Examination of Text Categorization Methods

Article

Jan 2003

This paper reports a controlled study with statistical significance tests on five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a NaiveBayes (NB) classifier. We focus on the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF significantly outperform NNet and NB when the number of positive training instances per category are small (less than ten), and that all the methods perform comparably when the categories are sufficiently common (over 300 instances).

Angular measures for feature selection in text categorization

Abstract

Recommended publications

BoosTexter: A System for Multiclass Multi-label Text Categorization

Automatic textual document categorization based on generalized instance sets and meta-model

Using Gini-Index for Feature Selection in Text Categorization

Comparison of Feature Selection Methods for Sentiment Analysis