Conference PaperPDF Available

Angular measures for feature selection in text categorization

Authors:

Abstract

Text Categorization, which consists of automatically assigning documents to a set of categories, usually involves the management of a huge number of features. Most of them are irrelevant or introduce noise which misleads the classifiers. Thus, feature reduction is often performed in order to increase the efficiency and effectiveness of the classification. In this paper we propose to select relevant features by means of what we call Angular Measures, which are simpler than other usual measures applied for this purpose. We carry out experiments over two different corpora and find that the proposed measures perform equal or better than some of the existing ones.
Angular Measures for Feature Selection in Text
Categorization
E.F. Combarro
Artificial Intelligence Center
University of Oviedo
Campus de Viesques S/N
Gij´on, Spain
elias@aic.uniovi.es
Elena Monta˜n´es
Artificial Intelligence Center
University of Oviedo
Campus de Viesques S/N
Gij´on, Spain
elena@aic.uniovi.es
Jos´e Ranilla
Artificial Intelligence Center
University of Oviedo
Campus de Viesques S/N
Gij´on, Spain
ranilla@uniovi.es
ABSTRACT
Text Categorization, which consists of automatically assign-
ing documents to a set of categories, usually involves the
management of a huge number of features. Most of them
are irrelevant or introduce noise which misleads the classi-
fiers. Thus, feature reduction is often performed in order
to increase the efficiency and effectiveness of the classifica-
tion. In this paper we propose to select relevant features
by means of what we call Angular Measures, which are sim-
pler than other usual measures applied for this purpose. We
carry out experiments over two different corpora and find
that the proposed measures perform equal or better than
some of the existing ones.
Categories and Subject Descriptors
I.5.2 [Pattern Recognition]: Design Methodology—Fea-
ture evaluation and selection; I.7.1 [Document and Text
Processing]: Document and Text Editing—Document man-
agement
General Terms
Theory, Measurement, Experimentation, Performance
1. INTRODUCTION
One of the main tasks in the processing of large collec-
tions of text files is that of assigning the documents of a
corpus to a set of previously fixed categories, what is known
as Text Categorization (TC) [11]. The most common way
of representing the documents for TC is the bag of words
(see [10]). In this representation, a vector is associated to
each document whose components quantify the importance
of each of its words. This usually involves a great amount
of features and most of them can be irrelevant or noisy [10].
Thus, feature reduction often leads to an improvement in
the performance of the classification, at the same time that
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SAC’06 April 23-27, 2006, Dijon, France
Copyright 2006 ACM 1-59593-108-2/06/0004 ...$5.00.
it reduces the computational cost and the storage require-
ments of the task.
A common approach for feature reduction is Feature Se-
lection (FS), which consists in choosing a subset of the orig-
inal features for representing the documents. This task in
TC is usually performed scoring the features using a certain
measure, ordering them according to this measure and re-
moving a predefined number or percentage of them [8, 13].
Several measures have been proposed for this purpose, like
information gain [13] or cross entropy for text [8].
In this paper we introduce some measures for FS in TC,
which we call Angular Measures. We define them and study
their behavior by means of experimentation over two well
known corpora.
The paper is organized as follows. Section 2 deals with
some previous work including some of the state-of-the-art
measures. Section 3 presents the new family of measures
proposed. Section 4 describes the main stages of the TC
task. The description of the corpora and the experiments
are detailed in Sections 5 and 6 respectively. Finally, in
Section 7 some conclusions and ideas for future work are
commented.
2. PREVIOUS WORK
FS is one of the approaches commonly adopted in TC.
It involves selecting a subset of features from the original
feature set. By contrast, Feature Extraction (FE) meth-
ods transform or combine the original features to obtain a
reduced number of features. Methods of this kind are clus-
tering ones [4] or Latent Semantic Indexing (LSI) [6].
On the other hand, John et al. distinguish two kinds of
FS, namely filtering and wrapping. In the former, a feature
subset is selected independently of the performance of the
classifier. In the latter, a feature subset is selected using an
evaluation function based on the classifier. A widely adopted
approach in TC is the filtering one based on selecting the
features with higher score granted by a certain measure. The
reason of prefering filtering approaches rather than wrappers
for TC is that the latter usually result in a considerably time
consuming process.
In the following paragraphs will briefly describe those mea-
sures which have been most adopted for FS in TC.
2.1 Statistical Measures
The simplest filtering measures are the term frequency (tf)
and the document frequency (df). They quantify the rele-
826
vance of a word by means of its total number of appearances
and by means of the number of different documents in which
it appears, respectively. They can be combined into tfidf [10]
defined by
tfidf =tf log N
df
where Nis the number of documents in the corpus. No-
tice that words appearing in all the documents are con-
sidered non informative, independently of its absolute fre-
quency and, in general, a word occurring in many documents
will have tfidf smaller than others with the same tf, but ap-
pearing in less documents. Despite their simple appearance,
these measures perform acceptably in many situations [5].
2.2 Information Theory Measures
Measures taken from Information Theory (IT) have been
widely used because it is interesting to consider the distri-
bution of a word over the different categories. Among these
measures, information gain (I G) takes into account the pres-
ence of the word in a category as well as its absence, and
can be defined by (see, for instance [13])
IG(w, c) = P(w)P(c|w) log P(c|w)
P(c)+P(w)P(c|w) log P(c|w)
P(c)
where P(w) is the probability that the word wappears in
a document, P(c|w) is the probability that a document be-
longs to the category cknowing that the word wappears in
it, P(w) is the probability that the word wdoes not appear
in a document and P(c|w) is the probability that a document
belongs to the category cif we know that the word wdoes
not occur in it. Usually, these probabilities are estimated by
means of the corresponding relative frequencies.
Another measure of this kind is the expected cross entropy
for text (CE T ) [8], which only takes into account the pres-
ence of the word in a category. It is defined by
CE T (w, c) = P(w)P(c|w) log P(c|w)
P(c)
These measures are the ones of this kind that have ob-
tained better results in TC [7, 8, 11, 13].
3. ANGULAR MEASURES
Before defining this family of measures, let consider a cate-
gory cand a word wand identify each word wwith the pair
(aw,c, bw ,c), where aw,c denotes the number of documents
of the category cin which wappears and bw,c denotes the
number of documents that contain the word wbut do not
belong to the category c. On what follows let denote the
pair (aw,c, bw,c ) by (aw, bw) for simplicity.
Then, we can study the words that receive identical score
under a filtering measure m(w) which depends only on (aw,
bw) by means of the level curves defined by such measure.
In fact, it was demonstrated in [2] that if m(w) is a filtering
measure and Nand Mare natural numbers, then the level
curves passing through the words with awNand bwM
can be considered as straight lines.
From that fact, an interesting special case is the family
of measures m(w) which have just one level curve for each
value. That is, the measures m(w) that satisfy
aw=f(m(w))bw+g(m(w))
for some functions fand g. Some measures [2], like df (with
f(df ) = 1 and g(df ) = df ), have this property.
In [2] it has also been proven that if Nand Mare two
natural numbers and m(w) is a filtering measure which has
exactly one straight line as level curve for each value that
m(w) attains over the words with awNand bwM,
then, there exist pand qtwo polynomials such that
aw=p(m(w))bw+q(m(w))
for any word wsuch that awNand bwM.
Therefore, it is interesting to study the filtering mea-
sures satisfying the above expression at least when the de-
gree of pand qis low. The family of measures obtained
when degree(p) = 0 and degree(q) = 1 has been studied
in [2], leading to what we call Linear Measures. This pa-
per deals with those ones obtained when degree(p) = 1 and
degree(q) = 0 (notice that it is not possible that the degrees
of pand qare both zero at the same time).
Thus, if degree(p) = 1 and degree(q) = 0 we have
aw= (c1m(w) + c2)bw+c3
for some constants c1, c2and c3such that c16= 0, and thus
m(w) =
aw
c3
bwc2
c1
or equivalently
m(w) = awc2bwc3
c1bw
The value of c1can be taken as 1 since it does not affect
the ordering of words produced by the measure. Then, we
obtain
m(w) = awc2bwc3
bw
with c2and c3any real numbers. But the above expression
is equivalent to the following one
m(w) = awc3
bw
c2
and, again, the value of c2is irrelevant in the sense that
the ordering of the words provided by the measure is inde-
pendent of the value of this constant. Hence, the value of c2
could be taken to be zero. Therefore, the family of measures
to study are of the form
m(w) = awc3
bw
or equivalently
AMk(w) = awk
bw
where kis a real parameter which defines the family. These
measures have a simple geometrical interpretation as the
next Theorem establishes and whose proof can be found in
[3].
Theorem 1. The value AMk(w)is the tangent of the an-
gle formed by the x-axis and the line determined by the points
(aw, bw)and (k, 0).
This is the reason why we will call these measures Angular
Measures.
827
4. TASK OF TEXT CATEGORIZATION
This section describes the stages of the TC task.
The bag of words [10] model is adopted for representing
the documents. It consists in viewing a document as a set
of words without order and structure. Also, tf is chosen
to quantify the importance of each word in each document,
since it is one of the most used in the literature [8, 11].
The classification stage consists in assigning a category to
a document from a finite set of mcategories. This is com-
monly converted into mbinary problems, each one consist-
ing of determining whether a document belongs to a fixed
category or not. This approach is called one-against-the-
rest [1].
That process leads to use different sets of words in the doc-
ument representation. One consists of words that belong to
each category isolated from the rest, which is known as local
approach. On the other hand, the global approach consid-
ers the words from all categories. In this work, the local
approach is considered, since they offer better results [11].
Additionally, the stop words (words without meaning) are
removed because they are useless for the classification. Also,
stemming is performed, which consists in mapping the words
with the same meaning but with slight different spelling into
a common root. The Porter algorithm [9] is adopted for this
purpose.
In this paper the classification is performed using Support
Vector Machines (SVM) [7], since they have shown to per-
form fast and well in TC [12]. They satisfactorily deal with
many features and with sparse examples. They are binary
classifiers which find out threshold functions to separate the
documents of a certain category from the rest. We adopt a
linear threshold since most TC problems are linearly sepa-
rable [7].
The popular and well known measure F1[11] is adopted
in this paper to evaluate the effectiveness of the TC task. It
is defined by
F1=1
0.51
P+ 0.51
R
where P quantifies the percentage of documents that are cor-
rectly classified as belonging to the category while R quan-
tifies the percentage of documents of the category that are
correctly classified.
To compute the global performance over all the categories,
macroaverage, which consists in averaging the values ob-
tained in each category [11], is used.
5. THE CORPORA
In this subsection the corpora used in the experiments
are described and analyzed. They are the Reuters-21578
collection and the Ohsumed collection.
5.1 Reuters-21578 Collection
The Reuters-21578 corpus is a set of economic news pub-
lished by Reuters in 19871. They are distributed over 135
categories. Each document belongs to one or more of them.
The split into train and test documents chosen is that of
Apt´e [1]. Removing some documents without body or top-
ics, 7063 train and 2742 test documents assigned to 90 cat-
egories are obtained.
1It is publicly available at
http://www.research.attp.com/lewis/reuters21578.html
The distribution of documents into the categories is quite
unbalanced. In fact, the relative dispersion of the number
of documents of the categories is 3.36% in the interval [1,
2709] for training documents and 3.39% in [1, 1044] for test
documents. In addition, 76.40% (in train) and 78.65% (in
test) of the categories have less than 1% of the documents.
The words in the corpus are little scattered, since almost
half (49.91%) of the words appear in only one category and
16.25% in only two categories.
5.2 Ohsumed Collection
Ohsumed is a MEDLINE subset of references from 270
medical journals over 1987-19912. They are classified into
the 15 fixed categories of MeSH3: A, B, C ... Each cate-
gory is in turn split into subcategories. We have taken the
first 20000 documents of 1991 with abstract, labelling the
first 10000 documents as training and the rest as test ones
following [7]. We split them into the 23 subcategories of
category C of MeSH again following [7].
The distribution of documents over the categories is much
more balanced than in Reuters. In fact, the relative disper-
sion of the number of documents of the categories is 0.86
in the interval [100, 2476] for train and 0.88% in the inter-
val [82, 2424] for test. Furthermore, only 4.35% in train
and 8.70% in test of the categories have less than 1% of the
documents, against about 77% in Reuters.
The words in this collection are quite more scattered than
in Reuters, since 19.55% of the words (in average) appear
just in one category (against 49.91% in Reuters).
6. THE EXPERIMENTS
In the theoretical study developed in [3] we have proved
that the values of kof the form
awbvavbw
bvbw
with v, w two words of the collection are relevant, since they
provide measures which discriminate words from one cate-
gory from the rest. Hence, due to this fact and as a first
approach we select the deciles of the distribution formed by
those values (when wranges over all the words of the cate-
gory under study) as candidates values of k.
Figures 1, 2, 3 and 4 show the macroaverage of F1of
those deciles for Reuters and Ohsumed respectively. Also,
they show a comparison of them with two well know and
good IT measures, C ET and I G, as mentioned in Section
2.
In both corpora, the value of F1progressively increases
from the 1st decile to the median and it decreases until the
9th decile, being the median an inflection point where the
maximum is reached.
In the case of Reuters, only the median beats the state-of-
the-art measures CE T and IG, meanwhile in Ohsumed all
the deciles from the 1st to the median achieve better results.
The different behavior of the Angular Measures in both
corpora might be due to the different nature of the collec-
tions. As we have already mentioned, the distribution of
documents into the categories in Reuters is quite more un-
balanced than in Ohsumed. Also, the words in Reuters are
little scattered in comparison to Ohsumed.
2It can be found at http://trec.nist.gov/data/t9-filtering
3Available at www.nlm.nih.gov/mesh/2002/index.html
828
It is also remarkable that some of the Angular Measures
obtain the best performance for very high filtering levels
(around 95%), specially in the Ohsumed collection. This
makes these measures a very appealing choice when an ag-
gressive reduction of the number of features is intended.
0
5
10
15
20
25
30
35
40
45
50
0 10 20 30 40 50 60 70 80 90 100
F1
Filtering Level Percentage
1st
2nd
3rd
4th
Figure 1: Macroaverage of F1for Reuters
34
36
38
40
42
44
46
48
50
0 10 20 30 40 50 60 70 80 90 100
F1
Filtering Level Percentage
Median
6th
7th
8th
9th
CET
IG
Figure 2: Macroaverage of F1for Reuters
7. CONCLUSIONS AND FUTURE WORK
This paper presents a family of measures called Angular
Measures for Feature Selection in Text Categorization. They
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
0 10 20 30 40 50 60 70 80 90 100
F1
Filtering Level Percentage
1st
2nd
3rd
4th
Figure 3: Macroaverage of F1for Ohsumed
36
38
40
42
44
46
48
50
52
54
56
0 10 20 30 40 50 60 70 80 90 100
F1
Filtering Level Percentage
Median
6th
7th
8th
9th
CET
IG
Figure 4: Macroaverage of F1for Ohsumed
are obtained from their level curves and are defined by a pa-
rameter whose adequate values have been carefully selected.
The median of certain distribution strategically chosen of-
fers the best results, beating some of the state-of-the-art
measures for the two corpora taken. Also, some deciles of
such distribution beat those measures in one of the corpora.
Additionally, the best performance is obtained when most of
the words are removed (about 95% of them) which allows to
conduct aggressive reductions using this family of measures.
As future work, we plan to perform a refinement of the val-
ues of the parameter taking into account the centiles around
the median of the distribution chosen. We also plan to pro-
pose several modifications of the Angular Measures based
on the performance of other state-of-the-art measures.
8. ACKNOWLEDGMENTS
The research reported in this paper has been supported
in part under MEC and FEDER grant TIN2004-05920.
829
9. ADDITIONAL AUTHORS
Additional authors: Irene D´ıaz (Artificial Intelligence Cen-
ter, University of Oviedo, email: sirene@aic.uniovi.es).
10. REFERENCES
[1] C. Apte, F. Damerau, and S. Weiss. Automated
learning of decision rules for text categorization.
Information Systems, 12(3):233–251, 1994.
[2] E. F. Combarro, E. Monta˜es, I. D´ıaz, J. Ranilla, and
R. Mones. Introducing a family of linear measures for
feature selection in text categorization. IEEE
Transactions on Knowledge and Data Engineering,
17(9):1223–1232, 2005.
[3] E. F. Combarro, E. Monta˜es, J. Ranilla, and I. D´ıaz.
A theoretical framework of angular, laplace angular
and modified laplace angular measures. Technical
report, University of Oviedo, 2005.
[4] I. S. Dhillon, S. Mallela, and R. Kumar. A divisive
information theoretic feature clustering algorithm for
text classification. Journal of Machine Learning
Research, 3:1265–1287, 2003.
[5] I. D´ıaz, J. Ranilla, E. Monta˜es, J. Fern´andez, and
E. F. Combarro. Improving performance of text
categorisation by combining filtering and support
vector. Journal of the American Society for
Information Science and Technology, 55(7):579–592,
2004.
[6] G. Forman. An extensive empirical study of feature
selection metrics for text categorization. Journal of
Machine Learning Research, 3:1289–1305, 2003.
[7] T. Joachims. Text categorization with support vector
machines: learning with many relevant features. In
C. N´edellec and C. Rouveirol, editors, Proc. 10th
European Conference on Machine Learning ECML-98,
number 1398, pages 137–142, Chemnitz, DE, 1998.
Springer-Verlag.
[8] D. Mladenic and M. Grobelnik. Feature selection for
unbalanced class distribution and naive bayes. In
Proc. 16th International Conference on Machine
Learning ICML-99, pages 258–267, Bled, SL, 1999.
[9] M. F. Porter. An algorithm for suffix stripping.
Program (Automated Library and Information
Systems), 14(3):130–137, 1980.
[10] G. Salton and M. J. McGill. An introduction to
modern information retrieval. McGraw-Hill, 1983.
[11] F. Sebastiani. Machine learning in automated text
categorisation. ACM Computing Survey, 34(1), 2002.
[12] Y. Yang and X. Liu. A re-examination of text
categorization methods. In M. A. Hearst, F. Gey, and
R. Tong, editors, Proc. 22nd ACM International
Conference on Research and Development in
Information Retrieval SIGIR-99, pages 42–49,
Berkeley, US, 1999. ACM Press, New York, US.
[13] Y. Yang and J. O. Pedersen. A comparative study on
feature selection in text categorisation. In Proc. 14th
International Conference on Machine Learning
ICML-97, pages 412–420, 1997.
830
... Tokenization, stop word elimination and stemming are the concrete processes applied in this step [6]. The documents are represented by a great amount of features and most of them could be irrelevant or noisy [7]. So, dimensionality reduction is a very ...
... The features can be more concise and more efficient to represent the contents of the text. FS is performed by keeping the words with highest score according to predetermined measure of the word importance [7]. ...
Conference Paper
Full-text available
We developed Naive Bayes (NB) classifier for text classification to information extraction from written text at CLEF eHealth 2018 challenge, task1. The data set used is called the CepiDC Causes of Death Corpus. It comprises French biomedical text reports of death causes. To extract ICD10 codes for each death certificate, a preprocessing process must be carried out, for example, we removed all terms from the certificates that are not related to medicine and after that we used a NB classifier to generate a classification model. The evaluation of the proposed approach does not show good performance compared with the results obtained by the other participants in the challenge.
... So removing stop words and stemming words is the pre-processing tasks [2] [3]. The documents in text classification are represented by a great amount of features and most of them could be irrelevant or noisy [4]. DR is the exclusion of a large number of keywords, base preferably on a statistical process, to create a low dimension vector [5]. ...
... The main idea of FS is to select subset of features from the original documents. FS is performed by keeping the words with highest score according to predetermined measure of the importance of the word [4]. The selected feature retains the original physical meaning to provide a better understanding for the data and learning process [1]. ...
... The classifier must be precise when assigning examples to categories. In professional literature the problem of reducing dimensions and its effect on the behavior of the classifier is also explored [8]. ...
... Feature extraction is the first step in data processing, transforming a text document into simpler form. Documents in text classification contain a large amount of features, while most of them are irrelevant or noise [8]. Dimensionality reduction (DR ) is a method of omitting in a statistical process a large amount of key words in order to create a relatively short vector [13]. ...
Article
Full-text available
With the evolution of Internet, the meaning and accessibility of text documents and electronic information has increased. The automatic text categorization methods became essential in the information organization and data mining process. A proper classification of e-documents, various Internet information, blogs, emails and digital libraries requires application of data mining and machine learning algorithms to retrieve the desired data. The following paper describes the most important techniques and methodologies used for the text classification. Advantages and effectiveness of contemporary algorithms are compared and their most notable applications presented.
... The documents in TC are represented by a lot of features (terms) and most of them could be irrelevant or noisy [13]. So to reduce the number of features, we applied the following pre-processing steps based on the Natural Language Toolkit implementation [34] on our collection of Arabic legal documents tried by Moroccan court: 1) Tokenization: in this step, a document is treated as a string and converted into a list of tokens (terms) by specify delimiters such as spaces. ...
... FS aim to keep words with highest score according to a set of predefined measures. [21,22] Representing all document words into their root is critical and could result into reducing the number of words. [12]. ...
Article
Full-text available
Text classification is a very important area in information retrieval. Text classification techniques used to classify documents into a set of predefined categories. There are several techniques and methods used to classify data and in fact there are many researches talks about English text classification. Unfortunately, few researches talks about Arabic text classification. This paper talks about three well-known techniques used to classify data. These three well-known techniques are applied on Arabic data set. A comparative study is made between these three techniques. Also this study used fixed number of documents for all categories of documents in training and testing phase. The result shows that the Support Vector machine gives the best results.
... FS aim to keep words with highest score according to a set of predefined measures. [21,22] Representing all document words into their root is critical and could result into reducing the number of words. [12]. ...
Article
Text classification is a very important area in information retrieval. Text classification techniques used to classify documents into a set of predefined categories. There are several techniques and methods used to classify data and in fact there are many researches talks about English text classification. Unfortunately, few researches talks about Arabic text classification. This paper talks about three well-known techniques used to classify data. These three well-known techniques are applied on Arabic data set. A comparative study is made between these three techniques. Also this study used fixed number of documents for all categories of documents in training and testing phase. The result shows that the Support Vector machine gives the best results.
... Linear and Angular measures arise from the study of the words that receive identical score under a measure by means of the level curves defined by the measure [3]. Certain selection of functions leads to Linear Measures (LM k ), mean other selection of functions yields Angular Measures (AM k ) [4], both depending on a real parameter k. ...
Article
This paper proposes a method for Feature Selection in Text Catego- rization. This task is performed in two steps. Firstly, an analysis of rel- evance is performed and after that analysis of redundancy is done. For this purpose, a range of similarity measures are adopted and converted into symmetrical ones using several aggregation operators. This fact as- sures that the similarity between two words are independent of the order they are considered. Several exper- iments over four corpora are per- formed, leading to conclude that this method reaches good results.
Article
Full-text available
Text classification is an important topic. The number of electronic documents available on line is massive. Text classification aims to classify documents into a set of predefined categories. Number of researches conducted on English dataset is great in comparison with number of researches done using Arabic dataset. This research could be considered as reference for most researchers who deal with Arabic dataset. This research used the most well-known algorithms used in text classification with Arabic dataset. Besides that, dataset used in this research is large enough in comparison with most dataset for Arabic language used in other researches. In addition, this research used different selections and weighting methods for documents. I expect that all researchers who would write researches using Arabic dataset will find this work helpful. Algorithms used in this research are naïve Bayesian, support vector machines, artificial neural networks, k- nearest neighbors, C4.5 decision tree and rocchio classifier.
Article
When building topic based document classifiers, feature selection is a key step: features not holding any information about the topic of a document introduce only unnecessary noise during the classification. In a distributed environment, when the nodes are interacting, the locally retrieved features and the their attributes must be shared to have at every node a more accurate estimation of the global classifier. When expanding the knowledge of the local classifiers, to reduce costs, the network traffic should be kept to a minimum. We propose a probabilistic model for a keyword selection method which makes a more thorough analysis possible and can be used as a baseline when sharing information in a distributed environment. It can be used for incrementally building up the distributed classifiers ensuring minimal network traffic. This model can be refined later on by sending more content-related information to achieve higher performance. This probabilistic model together with experimental results are presented in this paper.
Article
Full-text available
This project tries to overcome some of the bottlenecks found in Information Retrieval (IR) systems. In order to achieve this goal, the different tasks of which IR is composed will be taken into account by studying their influence on the performance of the systems as well as the best way to combine them. One of the focuses of the project will be on one important task in the building of IR systems: dimensionality reduction. To this extent, we will explore the use of relevance measures to reduce the number of attributes (terms) used in the document representation. We will also study, from a theoretical point of view, the feature selection measures seeking to establish a relationship between the characteristics of the corpora and the measures which achieve the best results. Also, Self-Organizing Maps (and similar types of networks) will be used to cluster similar documents together, aiming to reduce the time complexity of document search. This clustering can also, in some cases, improve the performance of the retrieval. We will study the influence of the different parameters involved in the use of these networks on the results obtained with the system. Self-Organizing Maps will also be used to automatically generate word taxonomies which will be then used to improve the performance of the retrieval system. Independently of the methodology or technology used to build an Information Retrieval system, there exists a common problem: the high computational cost, caused by the large amount of information to process. If we consider the necessities of a real-time answer and of the efficient adding of new information, it is obvious that sequential processing of information is insufficient. The use of a parallel system provides some advantages as are the improvement of the response times, the reduction of the search operations cost and the possibility of coping with large text collections. Therefore, we will try to increase the efficiency and effectiveness of the Information Retrieval systems with the application of different techniques.
Article
Full-text available
Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives-accuracy, F-measure, precision, and recall-since each is appropriate in different situations. The results reveal that a new feature selection metric we call 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair-e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.
Article
Full-text available
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Article
Full-text available
Text categorization, which consists of automatically assigning documents to a set of categories, usually involves the management of a huge number of features. Most of them are irrelevant and others introduce noise which could mislead the classifiers. Thus, feature reduction is often performed in order to increase the efficiency and effectiveness of the classification. In this paper, we propose to select relevant features by means of a family of linear filtering measures which are simpler than the usual measures applied for this purpose. We carry out experiments over two different corpora and find that the proposed measures perform better than the existing ones.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
Text Categorization is the process of assigning documents to a set of previously fixed categories. A lot of research is going on with the goal of automating this time-consuming task. Several different algorithms have been applied, and Support Vector Machines (SVM) have shown very good results. In this report, we try to prove that a previous filtering of the words used by SVM in the classification can improve the overall performance. This hypothesis is systematically tested with three different measures of word relevance, on two different corpus (one of them considered in three different splits), and with both local and global vocabularies. The results show that filtering significantly improves the recall of the method, and that also has the effect of significantly improving the overall performance.
Article
High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new informationtheoretic divisive algorithm for feature/word clustering and apply it to text classification. Existing techniques for such "distributional clustering" of words are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value. We show that our algorithm minimizes the "within-cluster Jensen-Shannon divergence" while simultaneously maximizing the "between-cluster Jensen-Shannon divergence".
Article
This paper reports a controlled study with statistical significance tests on five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a NaiveBayes (NB) classifier. We focus on the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF significantly outperform NNet and NB when the number of positive training instances per category are small (less than ten), and that all the methods perform comparably when the categories are sufficiently common (over 300 instances).