Content uploaded by Elena Montañés
Author content
All content in this area was uploaded by Elena Montañés on Mar 11, 2014
Content may be subject to copyright.
Angular Measures for Feature Selection in Text
Categorization
E.F. Combarro
Artificial Intelligence Center
University of Oviedo
Campus de Viesques S/N
Gij´on, Spain
elias@aic.uniovi.es
Elena Monta˜n´es
Artificial Intelligence Center
University of Oviedo
Campus de Viesques S/N
Gij´on, Spain
elena@aic.uniovi.es
Jos´e Ranilla
Artificial Intelligence Center
University of Oviedo
Campus de Viesques S/N
Gij´on, Spain
ranilla@uniovi.es
ABSTRACT
Text Categorization, which consists of automatically assign-
ing documents to a set of categories, usually involves the
management of a huge number of features. Most of them
are irrelevant or introduce noise which misleads the classi-
fiers. Thus, feature reduction is often performed in order
to increase the efficiency and effectiveness of the classifica-
tion. In this paper we propose to select relevant features
by means of what we call Angular Measures, which are sim-
pler than other usual measures applied for this purpose. We
carry out experiments over two different corpora and find
that the proposed measures perform equal or better than
some of the existing ones.
Categories and Subject Descriptors
I.5.2 [Pattern Recognition]: Design Methodology—Fea-
ture evaluation and selection; I.7.1 [Document and Text
Processing]: Document and Text Editing—Document man-
agement
General Terms
Theory, Measurement, Experimentation, Performance
1. INTRODUCTION
One of the main tasks in the processing of large collec-
tions of text files is that of assigning the documents of a
corpus to a set of previously fixed categories, what is known
as Text Categorization (TC) [11]. The most common way
of representing the documents for TC is the bag of words
(see [10]). In this representation, a vector is associated to
each document whose components quantify the importance
of each of its words. This usually involves a great amount
of features and most of them can be irrelevant or noisy [10].
Thus, feature reduction often leads to an improvement in
the performance of the classification, at the same time that
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SAC’06 April 23-27, 2006, Dijon, France
Copyright 2006 ACM 1-59593-108-2/06/0004 ...$5.00.
it reduces the computational cost and the storage require-
ments of the task.
A common approach for feature reduction is Feature Se-
lection (FS), which consists in choosing a subset of the orig-
inal features for representing the documents. This task in
TC is usually performed scoring the features using a certain
measure, ordering them according to this measure and re-
moving a predefined number or percentage of them [8, 13].
Several measures have been proposed for this purpose, like
information gain [13] or cross entropy for text [8].
In this paper we introduce some measures for FS in TC,
which we call Angular Measures. We define them and study
their behavior by means of experimentation over two well
known corpora.
The paper is organized as follows. Section 2 deals with
some previous work including some of the state-of-the-art
measures. Section 3 presents the new family of measures
proposed. Section 4 describes the main stages of the TC
task. The description of the corpora and the experiments
are detailed in Sections 5 and 6 respectively. Finally, in
Section 7 some conclusions and ideas for future work are
commented.
2. PREVIOUS WORK
FS is one of the approaches commonly adopted in TC.
It involves selecting a subset of features from the original
feature set. By contrast, Feature Extraction (FE) meth-
ods transform or combine the original features to obtain a
reduced number of features. Methods of this kind are clus-
tering ones [4] or Latent Semantic Indexing (LSI) [6].
On the other hand, John et al. distinguish two kinds of
FS, namely filtering and wrapping. In the former, a feature
subset is selected independently of the performance of the
classifier. In the latter, a feature subset is selected using an
evaluation function based on the classifier. A widely adopted
approach in TC is the filtering one based on selecting the
features with higher score granted by a certain measure. The
reason of prefering filtering approaches rather than wrappers
for TC is that the latter usually result in a considerably time
consuming process.
In the following paragraphs will briefly describe those mea-
sures which have been most adopted for FS in TC.
2.1 Statistical Measures
The simplest filtering measures are the term frequency (tf)
and the document frequency (df). They quantify the rele-
826
vance of a word by means of its total number of appearances
and by means of the number of different documents in which
it appears, respectively. They can be combined into tfidf [10]
defined by
tfidf =tf log N
df
where Nis the number of documents in the corpus. No-
tice that words appearing in all the documents are con-
sidered non informative, independently of its absolute fre-
quency and, in general, a word occurring in many documents
will have tfidf smaller than others with the same tf, but ap-
pearing in less documents. Despite their simple appearance,
these measures perform acceptably in many situations [5].
2.2 Information Theory Measures
Measures taken from Information Theory (IT) have been
widely used because it is interesting to consider the distri-
bution of a word over the different categories. Among these
measures, information gain (I G) takes into account the pres-
ence of the word in a category as well as its absence, and
can be defined by (see, for instance [13])
IG(w, c) = P(w)P(c|w) log P(c|w)
P(c)+P(w)P(c|w) log P(c|w)
P(c)
where P(w) is the probability that the word wappears in
a document, P(c|w) is the probability that a document be-
longs to the category cknowing that the word wappears in
it, P(w) is the probability that the word wdoes not appear
in a document and P(c|w) is the probability that a document
belongs to the category cif we know that the word wdoes
not occur in it. Usually, these probabilities are estimated by
means of the corresponding relative frequencies.
Another measure of this kind is the expected cross entropy
for text (CE T ) [8], which only takes into account the pres-
ence of the word in a category. It is defined by
CE T (w, c) = P(w)P(c|w) log P(c|w)
P(c)
These measures are the ones of this kind that have ob-
tained better results in TC [7, 8, 11, 13].
3. ANGULAR MEASURES
Before defining this family of measures, let consider a cate-
gory cand a word wand identify each word wwith the pair
(aw,c, bw ,c), where aw,c denotes the number of documents
of the category cin which wappears and bw,c denotes the
number of documents that contain the word wbut do not
belong to the category c. On what follows let denote the
pair (aw,c, bw,c ) by (aw, bw) for simplicity.
Then, we can study the words that receive identical score
under a filtering measure m(w) which depends only on (aw,
bw) by means of the level curves defined by such measure.
In fact, it was demonstrated in [2] that if m(w) is a filtering
measure and Nand Mare natural numbers, then the level
curves passing through the words with aw≤Nand bw≤M
can be considered as straight lines.
From that fact, an interesting special case is the family
of measures m(w) which have just one level curve for each
value. That is, the measures m(w) that satisfy
aw=f(m(w))bw+g(m(w))
for some functions fand g. Some measures [2], like df (with
f(df ) = −1 and g(df ) = df ), have this property.
In [2] it has also been proven that if Nand Mare two
natural numbers and m(w) is a filtering measure which has
exactly one straight line as level curve for each value that
m(w) attains over the words with aw≤Nand bw≤M,
then, there exist pand qtwo polynomials such that
aw=p(m(w))bw+q(m(w))
for any word wsuch that aw≤Nand bw≤M.
Therefore, it is interesting to study the filtering mea-
sures satisfying the above expression at least when the de-
gree of pand qis low. The family of measures obtained
when degree(p) = 0 and degree(q) = 1 has been studied
in [2], leading to what we call Linear Measures. This pa-
per deals with those ones obtained when degree(p) = 1 and
degree(q) = 0 (notice that it is not possible that the degrees
of pand qare both zero at the same time).
Thus, if degree(p) = 1 and degree(q) = 0 we have
aw= (c1m(w) + c2)bw+c3
for some constants c1, c2and c3such that c16= 0, and thus
m(w) =
aw
−c3
bw−c2
c1
or equivalently
m(w) = aw−c2bw−c3
c1bw
The value of c1can be taken as 1 since it does not affect
the ordering of words produced by the measure. Then, we
obtain
m(w) = aw−c2bw−c3
bw
with c2and c3any real numbers. But the above expression
is equivalent to the following one
m(w) = aw−c3
bw
−c2
and, again, the value of c2is irrelevant in the sense that
the ordering of the words provided by the measure is inde-
pendent of the value of this constant. Hence, the value of c2
could be taken to be zero. Therefore, the family of measures
to study are of the form
m(w) = aw−c3
bw
or equivalently
AMk(w) = aw−k
bw
where kis a real parameter which defines the family. These
measures have a simple geometrical interpretation as the
next Theorem establishes and whose proof can be found in
[3].
Theorem 1. The value AMk(w)is the tangent of the an-
gle formed by the x-axis and the line determined by the points
(aw, bw)and (k, 0).
This is the reason why we will call these measures Angular
Measures.
827
4. TASK OF TEXT CATEGORIZATION
This section describes the stages of the TC task.
The bag of words [10] model is adopted for representing
the documents. It consists in viewing a document as a set
of words without order and structure. Also, tf is chosen
to quantify the importance of each word in each document,
since it is one of the most used in the literature [8, 11].
The classification stage consists in assigning a category to
a document from a finite set of mcategories. This is com-
monly converted into mbinary problems, each one consist-
ing of determining whether a document belongs to a fixed
category or not. This approach is called one-against-the-
rest [1].
That process leads to use different sets of words in the doc-
ument representation. One consists of words that belong to
each category isolated from the rest, which is known as local
approach. On the other hand, the global approach consid-
ers the words from all categories. In this work, the local
approach is considered, since they offer better results [11].
Additionally, the stop words (words without meaning) are
removed because they are useless for the classification. Also,
stemming is performed, which consists in mapping the words
with the same meaning but with slight different spelling into
a common root. The Porter algorithm [9] is adopted for this
purpose.
In this paper the classification is performed using Support
Vector Machines (SVM) [7], since they have shown to per-
form fast and well in TC [12]. They satisfactorily deal with
many features and with sparse examples. They are binary
classifiers which find out threshold functions to separate the
documents of a certain category from the rest. We adopt a
linear threshold since most TC problems are linearly sepa-
rable [7].
The popular and well known measure F1[11] is adopted
in this paper to evaluate the effectiveness of the TC task. It
is defined by
F1=1
0.51
P+ 0.51
R
where P quantifies the percentage of documents that are cor-
rectly classified as belonging to the category while R quan-
tifies the percentage of documents of the category that are
correctly classified.
To compute the global performance over all the categories,
macroaverage, which consists in averaging the values ob-
tained in each category [11], is used.
5. THE CORPORA
In this subsection the corpora used in the experiments
are described and analyzed. They are the Reuters-21578
collection and the Ohsumed collection.
5.1 Reuters-21578 Collection
The Reuters-21578 corpus is a set of economic news pub-
lished by Reuters in 19871. They are distributed over 135
categories. Each document belongs to one or more of them.
The split into train and test documents chosen is that of
Apt´e [1]. Removing some documents without body or top-
ics, 7063 train and 2742 test documents assigned to 90 cat-
egories are obtained.
1It is publicly available at
http://www.research.attp.com/lewis/reuters21578.html
The distribution of documents into the categories is quite
unbalanced. In fact, the relative dispersion of the number
of documents of the categories is 3.36% in the interval [1,
2709] for training documents and 3.39% in [1, 1044] for test
documents. In addition, 76.40% (in train) and 78.65% (in
test) of the categories have less than 1% of the documents.
The words in the corpus are little scattered, since almost
half (49.91%) of the words appear in only one category and
16.25% in only two categories.
5.2 Ohsumed Collection
Ohsumed is a MEDLINE subset of references from 270
medical journals over 1987-19912. They are classified into
the 15 fixed categories of MeSH3: A, B, C ... Each cate-
gory is in turn split into subcategories. We have taken the
first 20000 documents of 1991 with abstract, labelling the
first 10000 documents as training and the rest as test ones
following [7]. We split them into the 23 subcategories of
category C of MeSH again following [7].
The distribution of documents over the categories is much
more balanced than in Reuters. In fact, the relative disper-
sion of the number of documents of the categories is 0.86
in the interval [100, 2476] for train and 0.88% in the inter-
val [82, 2424] for test. Furthermore, only 4.35% in train
and 8.70% in test of the categories have less than 1% of the
documents, against about 77% in Reuters.
The words in this collection are quite more scattered than
in Reuters, since 19.55% of the words (in average) appear
just in one category (against 49.91% in Reuters).
6. THE EXPERIMENTS
In the theoretical study developed in [3] we have proved
that the values of kof the form
awbv−avbw
bv−bw
with v, w two words of the collection are relevant, since they
provide measures which discriminate words from one cate-
gory from the rest. Hence, due to this fact and as a first
approach we select the deciles of the distribution formed by
those values (when wranges over all the words of the cate-
gory under study) as candidates values of k.
Figures 1, 2, 3 and 4 show the macroaverage of F1of
those deciles for Reuters and Ohsumed respectively. Also,
they show a comparison of them with two well know and
good IT measures, C ET and I G, as mentioned in Section
2.
In both corpora, the value of F1progressively increases
from the 1st decile to the median and it decreases until the
9th decile, being the median an inflection point where the
maximum is reached.
In the case of Reuters, only the median beats the state-of-
the-art measures CE T and IG, meanwhile in Ohsumed all
the deciles from the 1st to the median achieve better results.
The different behavior of the Angular Measures in both
corpora might be due to the different nature of the collec-
tions. As we have already mentioned, the distribution of
documents into the categories in Reuters is quite more un-
balanced than in Ohsumed. Also, the words in Reuters are
little scattered in comparison to Ohsumed.
2It can be found at http://trec.nist.gov/data/t9-filtering
3Available at www.nlm.nih.gov/mesh/2002/index.html
828
It is also remarkable that some of the Angular Measures
obtain the best performance for very high filtering levels
(around 95%), specially in the Ohsumed collection. This
makes these measures a very appealing choice when an ag-
gressive reduction of the number of features is intended.
0
5
10
15
20
25
30
35
40
45
50
0 10 20 30 40 50 60 70 80 90 100
F1
Filtering Level Percentage
1st
2nd
3rd
4th
Figure 1: Macroaverage of F1for Reuters
34
36
38
40
42
44
46
48
50
0 10 20 30 40 50 60 70 80 90 100
F1
Filtering Level Percentage
Median
6th
7th
8th
9th
CET
IG
Figure 2: Macroaverage of F1for Reuters
7. CONCLUSIONS AND FUTURE WORK
This paper presents a family of measures called Angular
Measures for Feature Selection in Text Categorization. They
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
0 10 20 30 40 50 60 70 80 90 100
F1
Filtering Level Percentage
1st
2nd
3rd
4th
Figure 3: Macroaverage of F1for Ohsumed
36
38
40
42
44
46
48
50
52
54
56
0 10 20 30 40 50 60 70 80 90 100
F1
Filtering Level Percentage
Median
6th
7th
8th
9th
CET
IG
Figure 4: Macroaverage of F1for Ohsumed
are obtained from their level curves and are defined by a pa-
rameter whose adequate values have been carefully selected.
The median of certain distribution strategically chosen of-
fers the best results, beating some of the state-of-the-art
measures for the two corpora taken. Also, some deciles of
such distribution beat those measures in one of the corpora.
Additionally, the best performance is obtained when most of
the words are removed (about 95% of them) which allows to
conduct aggressive reductions using this family of measures.
As future work, we plan to perform a refinement of the val-
ues of the parameter taking into account the centiles around
the median of the distribution chosen. We also plan to pro-
pose several modifications of the Angular Measures based
on the performance of other state-of-the-art measures.
8. ACKNOWLEDGMENTS
The research reported in this paper has been supported
in part under MEC and FEDER grant TIN2004-05920.
829
9. ADDITIONAL AUTHORS
Additional authors: Irene D´ıaz (Artificial Intelligence Cen-
ter, University of Oviedo, email: sirene@aic.uniovi.es).
10. REFERENCES
[1] C. Apte, F. Damerau, and S. Weiss. Automated
learning of decision rules for text categorization.
Information Systems, 12(3):233–251, 1994.
[2] E. F. Combarro, E. Monta˜n´es, I. D´ıaz, J. Ranilla, and
R. Mones. Introducing a family of linear measures for
feature selection in text categorization. IEEE
Transactions on Knowledge and Data Engineering,
17(9):1223–1232, 2005.
[3] E. F. Combarro, E. Monta˜n´es, J. Ranilla, and I. D´ıaz.
A theoretical framework of angular, laplace angular
and modified laplace angular measures. Technical
report, University of Oviedo, 2005.
[4] I. S. Dhillon, S. Mallela, and R. Kumar. A divisive
information theoretic feature clustering algorithm for
text classification. Journal of Machine Learning
Research, 3:1265–1287, 2003.
[5] I. D´ıaz, J. Ranilla, E. Monta˜n´es, J. Fern´andez, and
E. F. Combarro. Improving performance of text
categorisation by combining filtering and support
vector. Journal of the American Society for
Information Science and Technology, 55(7):579–592,
2004.
[6] G. Forman. An extensive empirical study of feature
selection metrics for text categorization. Journal of
Machine Learning Research, 3:1289–1305, 2003.
[7] T. Joachims. Text categorization with support vector
machines: learning with many relevant features. In
C. N´edellec and C. Rouveirol, editors, Proc. 10th
European Conference on Machine Learning ECML-98,
number 1398, pages 137–142, Chemnitz, DE, 1998.
Springer-Verlag.
[8] D. Mladenic and M. Grobelnik. Feature selection for
unbalanced class distribution and naive bayes. In
Proc. 16th International Conference on Machine
Learning ICML-99, pages 258–267, Bled, SL, 1999.
[9] M. F. Porter. An algorithm for suffix stripping.
Program (Automated Library and Information
Systems), 14(3):130–137, 1980.
[10] G. Salton and M. J. McGill. An introduction to
modern information retrieval. McGraw-Hill, 1983.
[11] F. Sebastiani. Machine learning in automated text
categorisation. ACM Computing Survey, 34(1), 2002.
[12] Y. Yang and X. Liu. A re-examination of text
categorization methods. In M. A. Hearst, F. Gey, and
R. Tong, editors, Proc. 22nd ACM International
Conference on Research and Development in
Information Retrieval SIGIR-99, pages 42–49,
Berkeley, US, 1999. ACM Press, New York, US.
[13] Y. Yang and J. O. Pedersen. A comparative study on
feature selection in text categorisation. In Proc. 14th
International Conference on Machine Learning
ICML-97, pages 412–420, 1997.
830