Conference PaperPDF Available

An empirical study on various text classifiers

Authors:
  • JSS Science & Technology University, Mysore, India
  • Chaitanya Bharathi institute of technology, proddatur

Abstract

Text classification has gained importance more than ever in the present day owing to the huge amount of data generated with the advent of technology. There are a numerous well established techniques available to achieve classification. It is difficult to declare an algorithm to be universally efficient over the huge variety of datasets created in real time. In this paper, the existing methods are compared and contrasted based on experimental results. The experiment involves testing a document against the training set created previously. The results show quantitative values of the comparable parameters and hence helpful in the choice of a classification algorithm.
AN EMPIRICAL STUDY ON VARIOUS TEXT CLASSIFIERS
B S Harish
Department of Information Science &
Engineering
S J College of Engineering
Mysore 570 006
bsharish@ymail.com
Ramya M Hegde
Department of Computer Science &
Engineering
S J College of Engineering
Mysore 570 006
ramyahegde91@gmail.com
N Neeti
Department of Computer Science &
Engineering
S J College of Engineering
Mysore 570 006
neetibhat@yahoo.com
M Meghana
Department of Computer Science &
Engineering
S J College of Engineering
Mysore 570 006
meghana.m23@gmail.com
ABSTRACT
Text classification has gained importance more than ever in the
present day owing to the huge amount of data generated with the
advent of technology. There are a numerous well established
techniques available to achieve classification. It is difficult to
declare an algorithm to be universally efficient over the huge
variety of datasets created in real time. In this paper, the existing
methods are compared and contrasted based on experimental
results. The experiment involves testing a document against the
training set created previously. The results show quantitative
values of the comparable parameters and hence helpful in the
choice of a classification algorithm.
Categories and Subject Descriptors
I.5 [Pattern Recognition], I.5.2 [Design Methodology]:
Classifier design and evaluation, Feature evaluation and selection,
I.5.4 [Applications]: Text Processing
General Terms
Algorithms, Design, Experimentation.
Keywords
Documents, Dimensionality Reduction, Text Classification,
Classifiers
1. INTRODUCTION
In the last decade, the volume of textual information in electronic
format has increased enormously with the advent of many new
sources of information like WWW, emails, newsgroup messages,
Internet news feed, digital libraries etc. With such amount of
electronic text documents available, users started to feel the need
of an automated system to profitably search and manage these
huge repositories of information. Millions of pages available on
the web, hundreds of emails, updated news and all other text
resources on the internet had to be categorized. The information
also had to be easily organized in a way to allow simple search
and navigation.
It is obvious that handling such a huge collection of data manually
is impractical. That’s where various automated text classification
methods come into picture. Text classification algorithms are
available in good number. However, the existing algorithms have
to deal with many challenges.
The main problems that are inherent in large amounts of textual
data are the organization and the procedure of labeling them.
When the data has to be searched for a particular query, having a
structured organization of the same surely helps the user and
facilitates the retrieval of the target documents. It is difficult to
capture high level semantics and abstract concepts of natural
languages just from a few key words. Text data is usually
characterized by its high dimensionality and huge size. These
features make the process of text classification challenging. They
place both efficiency and accuracy demands on classification
systems.
It is essential that the raw data is converted to a standard form
before any classification algorithm is applied. Extensive research
has been carried out on various text representation and
classification schemes. It becomes important for the researchers to
have a complete knowledge on all existing representation schemes
and classifiers in order to select an appropriate representation
scheme and classifier which best suits their application.
This paper aims at providing an overview of the well-known
dimensionality reduction techniques like Principal Component
Analysis (PCA), Singular Value Decomposition (SVD) and
Locality Preserving Indexing (LPI) and text classification
algorithms like K-Nearest Neighbors (
K
NN), Rocchio algorithm
and Linear Least Square Fit (LLSF) algorithm to classify
documents to pre-defined classes. In addition, we also present a
comparative study on the above mentioned classifiers based on
quantitative values.
2. RELATED WORK
In automatic text classification, it has been proved that the term is
the best unit for text representation and classification [1]. Though
a text document expresses vast range of information,
unfortunately, it lacks the imposed structure of traditional
database. Therefore, unstructured data, particularly free running
text data has to be transformed into a structured data. To do this,
many pre-processing techniques are proposed in literature [2, 3].
After converting an unstructured data into a structured data, we
need to have an effective document representation model to build
an efficient classification system. Bag of Word (BoW) is one of
the basic methods of representing a document. The BoW is used
to form a vector representing a document using the frequency
count of each term in the document. This method of document
representation is called as a Vector Space Model (VSM) [4].
Unfortunately, BoW/VSM representation scheme has its own
limitations. Some of them are: high dimensionality of the
representation, loss of correlation with adjacent words and loss of
semantic relationship that exist among the terms in a document
[5]. To overcome these problems, term weighting methods are
used to assign appropriate weights to the term to improve the
performance of text classification [6, 7]. Li and Jain in [8] used
binary representation for given document. The major drawback of
this model is that it results in a huge sparse matrix, which raises a
problem of high dimensionality. Hotho et al., in [9] proposed an
ontology representation for a document to keep the semantic
relationship between the terms in a document. This ontology
model preserves the domain knowledge of a term present in a
document. However, automatic ontology construction is a
difficult task due to the lack of structured knowledge base.
Cavanar., (1994) in [10] used a sequence of symbols (byte, a
character or a word) called N-Grams, that are extracted from a
long string in a document. In an N-Gram scheme, it is very
difficult to decide the number of grams to be considered for
effective document representation.
Another approach in [11] uses multi-word terms as vector
components to represent a document. But this method requires a
sophisticated automatic term extraction algorithms to extract the
terms automatically from a document. Wei et al., (2008) in [12]
proposed an approach called Latent Semantic Indexing (LSI)
which preserves the representative features for a document. The
LSI preserves the most representative features rather than
discriminating features. Thus to overcome this problem, Locality
Preserving Indexing (LPI) [13] was proposed for document
representation. The LPI discovers the local semantic structure of a
document. Unfortunately LPI is not efficient in time and memory
[14]. Choudhary and Bhattacharyya (2002) in [15] used Universal
Networking Language (UNL) to represent a document. The UNL
represents the document in the form of a graph with words as
nodes and relation between them as links. This method requires
the construction of a graph for every document and hence it is
unwieldy to use for an application where large numbers of
documents are present. Craven et al., (1998) in [16] developed a
Web-KB project for constructing and maintaining large
knowledge bases. Ontology is constructed manually and a seed
knowledge base comprising a set of labeled web pages learns to
instantiate knowledge-based objects and relations from the web.
In [17], a new representation to model the web documents is
proposed. HTML tags are used to build the web document
representation. They used histogram representation for frequency
of terms in four sections of HTML codes: text, bold, links and
titles. Each symbolic object is built after the web collection is
analyzed and the most frequent terms are obtained. Isa et al.,
(2008) in [18] used the Bayes formula to vectorize a document
according to a probability distribution reflecting the probable
categories that the document may belong to. Using this
probability distribution as the vectors to represent the document,
the SVM is used to classify the documents. The same work has
been extended by Guru et al., (2010) [16] to represent a text
document by the use of interval valued symbolic features. The
probability distributions of terms in a document are used to form a
symbolic representation and then it is used for training and
classification purposes. Dinesh et al., (2009) [19] proposed a new
data structure called status matrix which preserves the sequence
of term occurrence in a document. Classification of documents is
done based on this new representation.
3. TEXT CLASSIFICATION
Consider a set of documents
D
and
1 2 3 | |
, , ..........., C
C c c c c
is
the set of predefined classes. Text classification aims at assigning
each of the documents in a testing dataset to its appropriate class
by testing it against the dataset
D
. Before applying any of the
existing classification algorithms onto the data, we must represent
the data as a term document matrix.
3.1 Text Representation
Stop words elimination
The most common words don’t give any information for
classifying text. The articles, the adverbs, the conjunctions and so
on do not characterize a specific topic and their use is only
functional for applying the syntactic rules of language: they are
uniformly distributed over the collection and can be safely
removed. The simplest way to prune the vocabulary is using a list
of unnecessary words and removing them from vocabulary. Many
Stop-Words lists exist on the Internet for each language and they
include many different types of words: adjectives, pronouns,
adverbs, common verbs and common names.
We have used stop words list during the construction of the term
document matrix to eliminate the presence of common words in
the dictionary.
Term document matrix
As the system cannot understand the semantics of a document, it
needs some representation of the document using which it can
classify the document. Term document matrix is one such
representation. A term document matrix is a
mathematical matrix that describes the frequency of terms that
occur in a collection of documents. In a term document matrix,
rows correspond to documents in the collection and columns
correspond to terms. There are various schemes for determining
the value that each entry in the matrix should take.
Term document matrix of a class consists of all the training
documents of the class and the terms or the words that are
selected as features for that particular class (which are stored in
dictionary) with the frequency of each term in each document.
3.2 Dimensionality Reduction
It is essential for a document to be represented in a standard form
before applying any classification algorithm. The term document
matrix is the standard form that we have considered. As
mentioned earlier, text data is characterized by its high
dimensionality. It is necessary to mathematically reduce the data
so as to represent each document using only a few dimensions.
Such techniques are called Dimensionality Reduction techniques.
We have used three dimensionality reduction techniques in this
paper and they are explained in the subsequent subsections.
Principal Component Analysis (PCA)
It is based on covariance matrices and known to be the best linear
dimension reduction technique in the mean square error sense. Its
goal is to find a set of mutually orthogonal basis functions that
capture the directions of maximum variance in the data so that the
pair wise Euclidean distances can be best preserved. If the data is
embedded in a linear subspace, PCA is guaranteed to discover the
dimensionality of the subspace and produces a compact
representation.
PCA reduces the dimension of the data by finding a few
orthogonal linear combinations of the original variables with the
largest variance [20]. We assume that we have
n
observations,
each being a realization of the
p
dimensional random variable,
( ,..... )T
ip
x x x
with mean
12
( ) ( , ,....., )
p
Ex
and
covariance matrix
()
T
E x x
p X p
.We denote such
an observation matrix by
.
We find the first principal component by finding the linear
combination with the largest variance. Let us denote the first
principal components by
1
p
such that,
11
T
p x w
, where the p-
dimensional coefficient vector
1 1,1 1,
( ,..... )T
p
w w w
solves
1 || 1||
argmax { }
T
w
w Var x w
The second principal component is the linear combination with
the second largest variance and orthogonal to the first principal
components, and so on. The number of principal components is
equal to number of original components. In most of the datasets,
the first several principal components decide most of the variance
and the rest of the principal component scan be disregarded with
minimum loss of information.
Variance depends on the scale of variables. It is customary to
initialize all the variables to have a mean of 0 and standard
deviation of 1. After this standardization, all the original variables
are in comparable units. Assuming that the data that is
standardized is represented by a covariance matrix,
1T
p X p XX
n
we can use the spectral decomposition theorem to write as
T
UU
Where
1
( .......... )
p
diag
is the diagonal matrix of the ordered
eigenvalues
1.......... p
and
U
is a
p X p
orthogonal matrix
containing the eigenvectors. It can be shown [21] that the
principal components are given by the
p
rows of the
p X n
matrix
S
, where
T
S U X
where we see that the weight matrix W
is given by
T
U
. It can be shown [21] that the subspace spanned
by the first
k
eigenvectors has the smallest mean square deviation
from
X
among all subspaces of dimension
k
.
Another property of the eigen value decomposition is that the
total variation is equal to the sum of the eigen values of the
covariance matrix,
1 1 1
( ) ( )
p p p
ii
i i i
Var pc trace
and the fraction
1
/ ( )
k
i
i
trace
gives the cumulative proportion of the variance
explained by the first
k
principal components. By plotting the
cumulative proportions as a function of
k
, one can select the
appropriate number of principal components to keep, in order to
explain a given percentage of the overall variation.
There is another method to find the number of principal
components. Fix a threshold λ0 and keep the eigenvectors whose
values are greater than the threshold value selected, thus thereby
reducing the dimension of the data.
Singular Value Decomposition
Singular Value Decomposition (SVD) is a method of
transforming correlated variables to a set of uncorrelated ones that
represent the relations between the original data set items in a
better way [22]. In other words, SVD is a method for data
reduction that finds the best approximation of original data points
using lesser dimensions.
Regression line is the line that minimizes the distance between
each and every original data point and the line. Here, the first
regression line running through the data points is drawn. Then a
perpendicular line from the original data point to the regression
line is drawn. The intersection point is taken as the approximated
original data point. This gives a reduced representation of the
original data points. This method captivates the variations among
the original data points as much as possible.
A second regression line perpendicular to the first is drawn. This
line represents as much of the variation as possible along the
second dimension of the original data set. It is insufficient in
approximating the original data because it corresponds to a
dimension exhibiting less variation. It is possible to use these
regression lines to generate a set of uncorrelated data points that
will show subgroupings in the original data. SVD neglects
variations below a specific threshold and massively reduces the
data still preserving the main relationships of interest.
SVD takes a rectangular matrix
A
and breaks it into product of
three matrices: an orthogonal matrix
U
, a diagonal matrix
S
, and
the transpose of an orthogonal matrix
V
. SVD theorem states:
T
mn mm mn nn
A U S V
Where
T
U U I
and
T
V V I
,
I
being the identity matrix.
The eigenvectors of
T
AA
make up the columns of
V
and the
eigenvectors of
T
AA
make up the columns of
U
.
S
is a diagonal
matrix containing the square roots of eigen values from
U
or
V
arranged in descending order.
Locality Preserving Indexing
Locality preserving indexing is used for document indexing [23].
Every document is represented by a vector with low dimension
than the original document. The local structure of the document
space and the semantic structure of the document are studied to
obtain the concise document representation.
Given a document set
D
, consisting of documents,
12
, ,....., m
d d d
find a lower dimensional representation
i
y
of
i
d
such that
|| ||
ij
yy
(norm of the vector) represents the semantic relation
between
i
d
and
j
d
.
Generally dimension of the document vector (
n
) is much higher
than the number of documents (
m
) and when
n
is large
computational complexity of the eigen problem is high. To
overcome these problems, documents are projected onto PCA
subspace first. The resulting matrix is non-singular and with lower
dimensions. The algorithmic procedure for LPI is stated in [23].
3.3 Classifiers
Given a set of classes, the task of a classifier is to find to which
class a text document belongs to. Data that is well represented is
the input to the classifier. Various classifiers are well established
in the present day. Three of them are considered for the task of
classification in this paper.
K
Nearest Neighbor Classification (
K
NN)
K
NN assigns a category to a document based on the class(es) of
the
K
nearest neighbors in the training data as defined by a
similarity function. A neighbor is considered nearest if it has the
smallest Euclidian distance in the feature space. An object is
classified by a majority vote of its neighbors. The object is
assigned to the class that is most common amongst its
K
nearest
neighbors. If
K
=1, then the object is simply assigned to the class
of its nearest neighbor. Its large computational requirement is a
disadvantage, because for classifying an object, its distance to all
the objects in the learning set has to be calculated.
Rocchio Algorithm
The Rocchio algorithm is based on a method of relevance
feedback to the classification situation [24]. The feedback
approach was developed using the vector space model.
The algorithm is based on the assumption that most users have a
general conception of which documents should be denoted as
relevant or non-relevant. Therefore, the user’s search query is
revised to include an arbitrary percentage of relevant and non-
relevant documents.
Here, each document is treated as a normalized vector (unit
length). For each class
C
, compute the centroid
()C
of the
labeled documents in
C
.The centroid of a class
C
is computed
as the vector average or center of mass of its members:
()C
1
||D
()
dD
vd

Where
D
is the document set consisting of
documents
12
, ,....., m
d d d
. We denote the normalized vector of
d
by
()vd
. Now for a document
d
, find the closest centroid and
put
d
into the corresponding class. The boundary between two
classes in Rocchio classification is the set of points with equal
distance from the two centroids.
Linear Least Square Fit (LLSF)
In LLSF, each document
j
d
has two vectors associated with it: an
input vector
j
Id
of
||
j
T
(number of terms in document
j
d
)
weighted terms and an output vector
j
Od
of weights
||C
representing the categories [25]. Thus text classification in LLSF
means, given a test document
j
d
and an input vector
j
Id
, task is
to find the output vector
j
Od
.
A linear classifier [26] [27] is used for text classification, which
computes the categorization status value that corresponds to the
dot product of
j
d
and
j
C
.
There are two methods to learn the linear classifiers: batch
methods, in which a classifier is built by analyzing the training set
all at once and on-line methods in which a classifier is built soon
after examining the first training document and it is refined
incrementally as the new ones are examined. The classifier [28]
examines the closeness of the testing document to the centroid of
the positive training examples and its distance from the centroid
of the negative training examples. It finds the hyper planes that
approximately separate a class of document vectors from its
complement.
4. EXPERIMENTATION AND RESULTS
4.1 Datasets
The classification methods discussed above are applied on two
datasets. The first being the dataset consisting of 5 categories;
Entertainment, Sports, Image Processing, Business and Politics,
each of which consisting of 100documents. We call this dataset as
dataset1.
A usenet newsgroup is a repository created usually within the
usenet system, for messages posted from many users in different
locations. The articles that the users post to usenet are organized
into topical categories called newsgroups, which are themselves
logically organized into hierarchies of subjects. For example,
sci.math and sci.physics are within the sci hierarchy for science.
In most newsgroups, the majority of the articles are responses to
some other article. Most of the responses can be categorized into
major subject areas like: news, rec (recreation), soc (society), sci
(science), comp (computers), and so forth (there are many more).
Typically, the newsgroup is focused on a particular topic of
interest. Some newsgroups allow the posting of a message on a
wide variety of themes, regarding anything that a member
chooses to discuss as on topic, while others keep more strictly to
their particular subjects on off topic postings. There are currently
over 100,000 usenet newsgroups, but only 20,000 are active.
For our experimentation, we have used the above newsgroups and
created our own dataset. Since, we have downloaded all the
newsgroups from the Google usenet newsgroups we called this
dataset as a Google Newsgroup Dataset. The main objective of
creating this dataset is to have a large discriminating content of
the documents from one class to the other classes. Hence, we have
picked documents from the Google newsgroups which have got
maximum discriminating content from class to class.
Originally, some of the content of the documents are free running
text and HTML tags. To maintain the uniformity among the
documents, we have converted all HTML tagged format into .txt
format. Thus, the dataset contains free running texts that are
stored in .txt files. As a preprocessing step, we have employed
stopword elimination for all the documents.
4.2 Experimental Settings
Text classification requires a good collection of test data. Huge
manual effort is required to collect a sufficiently large body of
text, and ultimately produce it in a machine-readable format.
When classification is carried out on the testing set, if every
document of it belongs to only one correct class of the training
set, then the classification is said to be correct, error otherwise.
The machine that was used to carry out the experiment was
configured with 2GB RAM and i5 processor. The operating
system of the machine is Windows7.
Experiment 1
The 100 documents in each class are divided into two sets of 50
documents each. The first set of 50 documents in each class
comprises the training set. The remaining set comprises the
testing set. The reduced lower dimension matrix is stored in the
knowledge base and it is further used for classification.
Experiment 2
The 100 documents in each class are divided into two sets, one
containing 60 documents and the other with 40 documents. The
set of all 60 documents in each class comprises the training set.
The remaining set of 40 documents each comprises the testing set.
4.3 Results
The results of each experiment, obtained using the two datasets is
tabulated in Table1, Table2, Table3 and Table4. The
classification methods are compared quantitatively using these
results.
Table 1. Tabulates the classification accuracies obtained from
experiment 1 using dataset 1.
Dataset
Training vs.
Testing
DRT
Classifiers
Accuracy
in %
Dataset1
50:50
PCA
K
NN
68.50
Rocchio
66.35
LLSF
59.85
Dataset1
50:50
SVD
K
NN
71.50
Rocchio
66.90
LLSF
54.85
Dataset1
50:50
LPI
K
NN
76.35
Rocchio
74.50
LLSF
68.35
The following conclusions are made after tabulating the results
from experiment1: If the DRT used is LPI and the classifier
employed is
K
NN, we obtain the best result for dataset1. When
LLSF classifier is used along with the SVD, the accuracy of
classification is least of all.
The following conclusions are made after tabulating the results
from experiment 2: If the DRT method used is LPI and the
classifier employed is
K
NN, we obtain the best result for
dataset1. When each DRT (PCA, SVD or LPI) is employed with
LLSF classifier, we achieve poor results. Rocchio classifier gives
better results than LLSF classifier, but poorer results than
K
NN
classifier. However accuracy for each combination of DRT and
classification methods is more than that of experiment1.
Table 2. Tabulates the classification accuracies obtained from
experiment 2 using dataset 1.
Dataset
Training vs.
Testing
DRT
Classifiers
Accuracy
in %
Dataset1
60:40
PCA
K
NN
72.50
Rocchio
69.85
LLSF
60.10
Dataset1
60:40
SVD
K
NN
74.35
Rocchio
68.10
LLSF
56.40
Dataset1
60:40
LPI
K
NN
80.15
Rocchio
77.40
LLSF
69.15
The following conclusions are made after tabulating the results
from experiment 1: If the DRT used is LPI and the classifier
employed is
K
NN, we obtain the best result for dataset 2. When
each DRT (PCA, SVD or LPI) is employed with LLSF classifier,
we get a poor result. Rocchio classifier gives better results than
LLSF classifier, but poorer results than
K
NN classifier.
Table 3. Tabulates the classification accuracies obtained from
experiment 1 using dataset 2.
Dataset
Training vs.
Testing
DRT
Classifiers
Accuracy
in %
Dataset 2
50:50
PCA
K
NN
72.35
Rocchio
69.50
LLSF
61.45
Dataset 2
50:50
SVD
K
NN
74.95
Rocchio
68.45
LLSF
57.15
Dataset 2
50:50
LPI
K
NN
76.80
Rocchio
69.25
LLSF
63.00
The following conclusions are made after tabulating the results
from experiment 2: If the DRT used is LPI and the classifier
employed is
K
NN, we obtain the best result for dataset 2. When
each DRT (PCA, SVD or LPI) is employed with LLSF classifier,
we get a poor result. Rocchio classifier gives better results than
LLSF classifier, but poorer results than
K
NN classifier.
Table 4. Tabulates the classification accuracies obtained from
experiment 2 using dataset 2.
Dataset
Training vs.
Testing
DRT
Classifiers
Accuracy
in %
Dataset
2
60:40
PCA
K
NN
74.90
Rocchio
70.00
LLSF
63.85
Dataset
2
60:40
SVD
K
NN
76.55
Rocchio
69.15
LLSF
58.50
Dataset
2
60:40
LPI
K
NN
81.55
Rocchio
75.90
LLSF
67.35
The performance of any classifier is greatly dependent on the data
being classified and the dataset that is considered to measure the
performance of the classifiers. However based on the datasets
considered and the results obtained above, it can be said that
K
Nearest Neighbor serves better in classification than the other
two classifiers. Comparatively Linear Least Square Fit algorithm
does a poor job. It can be seen that when the dimensionality is
reduced using LPI and
K
NN classifier is employed, we obtain
the best accuracy.
5. CONCLUSION
Text classification has evolved into a research field which has
delivered efficient, effective, and overall workable solutions that
have been used in tackling a wide variety of real-world
application domains. We have applied three different
classification methods (
K
NN, Rocchio and LLSF) to the
problem of document classification. These methods were
evaluated individually when used with one of the dimensionality
reduction techniques (PCA, SVD, LPI).
All the three classifiers perform reasonably well on our data sets.
In point of view of accuracy,
K
NN classifier works better than
Rocchio and LLSF methods. In every case of classification,
classifiers work most efficiently when LPI dimensionality
reduction technique is employed.
Though
K
NN is comparatively better than the other 2 classifiers,
it is a lazy learning method. Support Vector Machines can be used
to overcome its limitations.
Hence, in future we intend to work on other various
representation methods, dimensionality reduction techniques and
classifiers to achieve better classification accuracy. Further, we
would also make use of various standard datasets to evaluate
different classifiers.
6. REFERENCES
[1]
Song, F., Liu, S., and Yang, J. 2005. A comparative study
on text representation schemes in text categorization,
Journal of Pattern Analysis Application, Vol 8, 2005,
pp199 209.
[2]
Porter, M.F. 1980. An algorithm for suffix stripping.
Program, Vol. 14 (3), pp. 130 137.
[3]
Hotho, A., Nürnberger, A., and Paaß, G. 2005. A Brief
Survey of Text Mining. Journal for Computational
Linguistics and Language Technology. Vol. 20, pp. 19
62.
[4]
Salton, G., Wang, A., and Yang, C.S.1975. A Vector
Space Model for Automatic Indexing. Communications of
the ACM, Vol. 18, pp. 613 620.
[5]
Bernotas, M., Karklius, K., Laurutis, R., and Slotkiene, A.
2007. The peculiarities of the text document
representation, using ontology and tagging-based
clustering technique. Journal of Information Technology
and Control. Vol. 36, pp. 217 220.
[6]
Lan, M., Tan, C. L., Su. J., and Lu, Y.2009. Supervised
and Traditional Term Weighting Methods for Automatic
Text Categorization. IEEE Transactions on Pattern
Analysis and Machine Intelligence, Volume: 31 (4), pp.
721 735.
[7]
Altınçay, H., and Erenel, Z. 2010. Analytical evaluation
of term weighting schemes for text categorization. In
Journal of Pattern Recognition Letters, vol. 31 (11), pp.
1310 1323.
[8]
Li, and Jain, A. K., Y. H. 1998. Classification of Text
Documents. The Computer Journal, Vol 41, pp. 537
546.
[9]
Hotho, A., Maedche, A., and Staab, S. 2001. Ontology
based text clustering. In Proceedings of International Joint
Conference on Artificial Intelligence, pp. 30 37.
[10]
Cavnar, W.B. 1994. Using an N-Gram based document
representation with a vector processing retrieval model. In
Proceedings of The Third Text Retrieval Conference
(TREC-3), pp. 269 278.
[11]
Milios, E., Zhang, Y., He, B., and Dong, L. 2003.
Automatic term extraction and document similarity in
special text corpora. In Proceedings of Sixth Conference
of the Pacific Association for Computational Linguistics
(PACLing’03), pp. 275 – 284.
[12]
Wei, C. P., Yang, C. C., and Lin, C. M. 2008. A Latent
Semantic Indexing-based approach to multilingual
document clustering. Journal of Decision Support System.
Vol. 45, pp. 606 620.
[13]
He, X., Cai, D., Liu, H., and Ma, W.Y. 2004. Locality
Preserving Indexing for document representation. In
SIGIR, pp. 96103.
[14]
Cai, D., He, X., Zhang, W.V., and Han J. 2007.
Regularized Locality Preserving Indexing via Spectral
Regression. In ACM International Conference on
Information and Knowledge Management (CIKM’07), pp.
741750.
[15]
Choudhary, B., and Bhattacharyya, P. 2003. Text
clustering using Universal Networking Language
representation. In Eleventh International World Wide
Web Conference.
[16]
Craven, M., DiPasquo, D., Freitag, D., McCallum, A.,
Mitchell, T. M., Nigam, K., and Slattery, S. 1998.
Learning to Extract Symbolic Knowledge from the World
Wide Web. In Proceedings of AAAI/IAAI’, pp. 509
516.
[17]
Esteban, M., and Rodrıguez, O. R. 2006. A Symbolic
Representation for Distributed Web Document Clustering.
In the Proceedings of Fourth Latin American Web
Congress, Cholula, Mexico.
[18]
Isa, D., Lee, L. H., Kallimani, V. P., and Rajkumar, R.
2008. Text document preprocessing with the Bayes
formula for classification using the support vector
machine. IEEE Transactions on Knowledge and Data
Engineering. Vol. 20, pp. 23 31.
[19]
Dinesh, R., Harish, B. S., Guru, D.S., and Manjunath,
S.2009. Concept of Status Matrix in Text Classification.
In the Proceedings of Indian International Conference on
Artificial Intelligence, Tumkur, India, pp. 2071 2079.
[20]
Imola K. Fodor, A Survey of Dimension Reduction
Techniques, June 2002.
[21]
K.V. Mardia, J.T. Kent, and J.M. Bibby. Multivariate
Analysis. Probability and Mathematical Statistics.
Academic Press, 1995.
[22]
Kirk Baker. Singular Value Decomposition Tutorial,
March 2005.
[23]
Xiaofei He, Deng Cai, Haifeng Liu, Wei-Ying Ma.
Locality Preserving Indexing for Document
Representation.
[24]
Rocchio. Relevance Feedback in Information Retrieval.
Prentice-Hall Inc, 1971.
[25]
B S Harish, D S Guru, S Manjunath. Representation and
Classification of Text Documents: Abrief Review. IJCA
Special Issue on ―Recent Trends in Image Processing and
Pattern Recognition‖ RTIPPR, 2010.
[26]
Sebastiani, F. 2002. Machine learning in automated text
categorization.ACM Computing Surveys. Vol 34, pp. 1
47.
[27]
Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka,
R.1996. Training algorithms for linear text classifiers. In
the Proceedings of the Nineteenth International
Conference on Research and Development in Information
Retrieval (SIGIR’96), pp. 289–297.
[28]
Joachims, Y. 1997. A probabilistic analysis of the
Rocchio algorithm with TFIDF for text categorization. In
the Proceedings of the Fourteenth International
Conference on Machine Learning, pp. 143 151.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Most tutorials on complex topics are apparently written by very smart people whose goal is to use as little space as possible and who assume that their readers already know almost as much as the author does. This tutorial's not like that. It's more a manifestivus for the rest of us. It's about the mechanics of singular value decomposition, especially as it relates to some techniques in natural language processing. It's written by someone who knew zilch about singular value decomposition or any of the underlying math before he started writing it, and knows barely more than that now. Accordingly, it's a bit long on the background part, and a bit short on the truly explanatory part, but hopefully it contains all the information necessary for someone who's never heard of singular value decomposition before to be able to do it.
Article
Web clustering is an important activity to improve many other tasks in the Web, from automatically building web di- rectories to web searching and indexing. However, due to the huge size of the Web, a platform for high performance computing must be used to cluster a really big collection of web documents. In this paper, we propose a new represen- tation to model web documents for distributed clustering. Symbolic objects can address the clustering problem for a distributed environment in a better way. After running a couple of experiments, we show how symbolic objects can outperform a classic representation when comparing the clustering accuracy.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
Text documents are very significant in the contemporary organizations, moreover their constant accumulation enlarges the scope of document storage. Standard text mining and information retrieval techniques of text document usually rely on word matching. An alternative way of information retrieval is clustering. In this paper we suggest to complement the traditional clustering method by document repre- sentation based on tagging, and to improve clustering results by using knowledge technology - ontology. The proposed method solves locally applied language incompact usage in the process of document clus- tering.
Article
This paper conflrms that the performance of a state-of-the-art automatic term extraction method on a computer science corpus is similar to previously published performance data on a medical corpus. The extracted terms are then used to estimate the similarity of papers in the computer science corpus using the standard Vector Space Model. The precision of retrieval using a term-based representation is compared with that of a word-based representation, and a link-based similarity metric based on the overlap of the local neighborhoods of the papers in the citation graph. The term-based approach ofiers comparable performance to the word-based approach, but potentially with a much smaller vocabulary size. Automatic term extraction in special text corpora is an interesting problem, which is becoming relevant as literature in speciflc scientiflc flelds such as medicine, biology and computer science explodes making it di-cult to track the evolving terminology in the flelds (Kageura and Umino1996). Early approaches to automatic term extraction were focused on information-theoretic approaches based on mutual information in detecting collocations (Manning and Schuetze1999). Collocations are expressions that are composed of two or more words, the meaning of which is not easy to guess from the meanings of the component words. There are nuances in the detection of collocation that require linguistic criteria to resolve (Justeson and Katz1995). Shallow linguistic criteria are based on acceptable sequences of part-of-speech tags. Part-of-speech tagging can be performed automatically (Brill1992). A key problem is that of nesting, where subsets of consecutive words of terms consisting of multiple words would satisfy the statistical criteria for \termhood", but they would not be called terms. In the flrst part of this paper, we describe experiments with a state-of-the-art method, C-value/NC-value (Frantzi et al.2000), which combines statistical and linguistic information for automatic term extraction. We applied it to a special text corpus of computer science articles, which is of a difierent nature from the medical corpus on which the method was originally tested. We conflrmed that the performance of the method is equally good on our corpus, and we identifled some adjustments that the method required. In the second part of this paper, we use the terms extracted to estimate the similarity between two documents. We evaluate the quality of the similarity estimation based on terms in an information retrieval context. It is broadly believed that it is di-cult to improve upon the bag-of-words representation as far as retrieval performance is concerned by using more sophisticated features or shallow linguistic techniques. Although retrieval based on terms did not show signiflcant improvement over a bag-of-words representation, our long-term objective is to cluster special text corpora into subareas, and automatically generate lexical ontologies from the clusters (Ayad and Kamel2002). Terms in this context are of interest in themselves, and not purely as a vehicle to information retrieval. We are, furthermore, interested in similarity criteria taking into account proximity of terms (Koubarakis2001), for which again it is essential to work with terms, not words. The use of terms instead of words may also be preferable in information dissemination, where given a database of proflles (of c
Article
In traditional document clustering methods, the document is considered as a bag of words, with no relations between words. The feature vector representing the document is made by using the frequency count of terms in a document. Weights calculated from techniques like Inverse Document Frequency (IDF) and Information Gain (IG) are applied to the frequency count associated with the term. In this paper we describe a method to generate feature vectors using Universal Networking Language (UNL). UNL represents a document in the form of a graph, with nodes as the universal words and relations between them as links. The method described in this paper takes the UNL graph and generates feature vectors representing the document. The clustering method used is the Self Organizing Maps (SOM), a neural net, which takes the vectors as inputs and forms a document map in which similar documents are mapped to the same or a nearby neurons. Experiments show that if we use the UNL method for feature vector generation, clustering tends to perform better than when the term frequency based method is used.