Conference PaperPDF Available

An empirical study on various text classifiers

October 2012

October 2012

DOI:10.1145/2393216.2393314

Conference: The Second International Conference on Computational Science, Engineering and Information Technology (CCSEIT-2012)

Authors:

B S Harish

JSS Science & Technology University, Mysore, India

Mani Meghana

Chaitanya Bharathi institute of technology, proddatur

Neeti Narayan

University at Buffalo, The State University of New York

Text classification has gained importance more than ever in the present day owing to the huge amount of data generated with the advent of technology. There are a numerous well established techniques available to achieve classification. It is difficult to declare an algorithm to be universally efficient over the huge variety of datasets created in real time. In this paper, the existing methods are compared and contrasted based on experimental results. The experiment involves testing a document against the training set created previously. The results show quantitative values of the comparable parameters and hence helpful in the choice of a classification algorithm.

Content uploaded by B S Harish

Content may be subject to copyright.

AN EMPIRICAL STUDY ON VARIOUS TEXT CLASSIFIERS

B S Harish

Department of Information Science &

Engineering

S J College of Engineering

Mysore 570 006

bsharish@ymail.com

Ramya M Hegde

Department of Computer Science &

Engineering

S J College of Engineering

Mysore 570 006

ramyahegde91@gmail.com

N Neeti

Department of Computer Science &

Engineering

S J College of Engineering

Mysore 570 006

neetibhat@yahoo.com

M Meghana

Department of Computer Science &

Engineering

S J College of Engineering

Mysore 570 006

meghana.m23@gmail.com

ABSTRACT

Text classification has gained importance more than ever in the

present day owing to the huge amount of data generated with the

advent of technology. There are a numerous well established

techniques available to achieve classification. It is difficult to

declare an algorithm to be universally efficient over the huge

variety of datasets created in real time. In this paper, the existing

methods are compared and contrasted based on experimental

results. The experiment involves testing a document against the

training set created previously. The results show quantitative

values of the comparable parameters and hence helpful in the

choice of a classification algorithm.

Categories and Subject Descriptors

I.5 [Pattern Recognition], I.5.2 [Design Methodology]:

Classifier design and evaluation, Feature evaluation and selection,

I.5.4 [Applications]: Text Processing

General Terms

Algorithms, Design, Experimentation.

Keywords

Documents, Dimensionality Reduction, Text Classification,

Classifiers

1. INTRODUCTION

In the last decade, the volume of textual information in electronic

format has increased enormously with the advent of many new

sources of information like WWW, emails, newsgroup messages,

Internet news feed, digital libraries etc. With such amount of

electronic text documents available, users started to feel the need

of an automated system to profitably search and manage these

huge repositories of information. Millions of pages available on

the web, hundreds of emails, updated news and all other text

resources on the internet had to be categorized. The information

also had to be easily organized in a way to allow simple search

and navigation.

It is obvious that handling such a huge collection of data manually

is impractical. That’s where various automated text classification

methods come into picture. Text classification algorithms are

available in good number. However, the existing algorithms have

to deal with many challenges.

The main problems that are inherent in large amounts of textual

data are the organization and the procedure of labeling them.

When the data has to be searched for a particular query, having a

structured organization of the same surely helps the user and

facilitates the retrieval of the target documents. It is difficult to

capture high level semantics and abstract concepts of natural

languages just from a few key words. Text data is usually

characterized by its high dimensionality and huge size. These

features make the process of text classification challenging. They

place both efficiency and accuracy demands on classification

systems.

It is essential that the raw data is converted to a standard form

before any classification algorithm is applied. Extensive research

has been carried out on various text representation and

classification schemes. It becomes important for the researchers to

have a complete knowledge on all existing representation schemes

and classifiers in order to select an appropriate representation

scheme and classifier which best suits their application.

This paper aims at providing an overview of the well-known

dimensionality reduction techniques like Principal Component

Analysis (PCA), Singular Value Decomposition (SVD) and

Locality Preserving Indexing (LPI) and text classification

algorithms like K-Nearest Neighbors (

NN), Rocchio algorithm

and Linear Least Square Fit (LLSF) algorithm to classify

documents to pre-defined classes. In addition, we also present a

comparative study on the above mentioned classifiers based on

quantitative values.

2. RELATED WORK

In automatic text classification, it has been proved that the term is

the best unit for text representation and classification [1]. Though

a text document expresses vast range of information,

unfortunately, it lacks the imposed structure of traditional

database. Therefore, unstructured data, particularly free running

text data has to be transformed into a structured data. To do this,

many pre-processing techniques are proposed in literature [2, 3].

After converting an unstructured data into a structured data, we

need to have an effective document representation model to build

an efficient classification system. Bag of Word (BoW) is one of

the basic methods of representing a document. The BoW is used

to form a vector representing a document using the frequency

count of each term in the document. This method of document

representation is called as a Vector Space Model (VSM) [4].

Unfortunately, BoW/VSM representation scheme has its own

limitations. Some of them are: high dimensionality of the

representation, loss of correlation with adjacent words and loss of

semantic relationship that exist among the terms in a document

[5]. To overcome these problems, term weighting methods are

used to assign appropriate weights to the term to improve the

performance of text classification [6, 7]. Li and Jain in [8] used

binary representation for given document. The major drawback of

this model is that it results in a huge sparse matrix, which raises a

problem of high dimensionality. Hotho et al., in [9] proposed an

ontology representation for a document to keep the semantic

relationship between the terms in a document. This ontology

model preserves the domain knowledge of a term present in a

document. However, automatic ontology construction is a

difficult task due to the lack of structured knowledge base.

Cavanar., (1994) in [10] used a sequence of symbols (byte, a

character or a word) called N-Grams, that are extracted from a

long string in a document. In an N-Gram scheme, it is very

difficult to decide the number of grams to be considered for

effective document representation.

Another approach in [11] uses multi-word terms as vector

components to represent a document. But this method requires a

sophisticated automatic term extraction algorithms to extract the

terms automatically from a document. Wei et al., (2008) in [12]

proposed an approach called Latent Semantic Indexing (LSI)

which preserves the representative features for a document. The

LSI preserves the most representative features rather than

discriminating features. Thus to overcome this problem, Locality

Preserving Indexing (LPI) [13] was proposed for document

representation. The LPI discovers the local semantic structure of a

document. Unfortunately LPI is not efficient in time and memory

[14]. Choudhary and Bhattacharyya (2002) in [15] used Universal

Networking Language (UNL) to represent a document. The UNL

represents the document in the form of a graph with words as

nodes and relation between them as links. This method requires

the construction of a graph for every document and hence it is

unwieldy to use for an application where large numbers of

documents are present. Craven et al., (1998) in [16] developed a

Web-KB project for constructing and maintaining large

knowledge bases. Ontology is constructed manually and a seed

knowledge base comprising a set of labeled web pages learns to

instantiate knowledge-based objects and relations from the web.

In [17], a new representation to model the web documents is

proposed. HTML tags are used to build the web document

representation. They used histogram representation for frequency

of terms in four sections of HTML codes: text, bold, links and

titles. Each symbolic object is built after the web collection is

analyzed and the most frequent terms are obtained. Isa et al.,

(2008) in [18] used the Bayes formula to vectorize a document

according to a probability distribution reflecting the probable

categories that the document may belong to. Using this

probability distribution as the vectors to represent the document,

the SVM is used to classify the documents. The same work has

been extended by Guru et al., (2010) [16] to represent a text

document by the use of interval valued symbolic features. The

probability distributions of terms in a document are used to form a

symbolic representation and then it is used for training and

classification purposes. Dinesh et al., (2009) [19] proposed a new

data structure called status matrix which preserves the sequence

of term occurrence in a document. Classification of documents is

done based on this new representation.

3. TEXT CLASSIFICATION

Consider a set of documents

and

1 2 3 | |

, , ..........., C

C c c c c

the set of predefined classes. Text classification aims at assigning

each of the documents in a testing dataset to its appropriate class

by testing it against the dataset

. Before applying any of the

existing classification algorithms onto the data, we must represent

the data as a term document matrix.

3.1 Text Representation

Stop words elimination

The most common words don’t give any information for

classifying text. The articles, the adverbs, the conjunctions and so

on do not characterize a specific topic and their use is only

functional for applying the syntactic rules of language: they are

uniformly distributed over the collection and can be safely

removed. The simplest way to prune the vocabulary is using a list

of unnecessary words and removing them from vocabulary. Many

Stop-Words lists exist on the Internet for each language and they

include many different types of words: adjectives, pronouns,

adverbs, common verbs and common names.

We have used stop words list during the construction of the term

document matrix to eliminate the presence of common words in

the dictionary.

Term document matrix

As the system cannot understand the semantics of a document, it

needs some representation of the document using which it can

classify the document. Term document matrix is one such

representation. A term document matrix is a

mathematical matrix that describes the frequency of terms that

occur in a collection of documents. In a term document matrix,

rows correspond to documents in the collection and columns

correspond to terms. There are various schemes for determining

the value that each entry in the matrix should take.

Term document matrix of a class consists of all the training

documents of the class and the terms or the words that are

selected as features for that particular class (which are stored in

dictionary) with the frequency of each term in each document.

3.2 Dimensionality Reduction

It is essential for a document to be represented in a standard form

before applying any classification algorithm. The term document

matrix is the standard form that we have considered. As

mentioned earlier, text data is characterized by its high

dimensionality. It is necessary to mathematically reduce the data

so as to represent each document using only a few dimensions.

Such techniques are called Dimensionality Reduction techniques.

We have used three dimensionality reduction techniques in this

paper and they are explained in the subsequent subsections.

Principal Component Analysis (PCA)

It is based on covariance matrices and known to be the best linear

dimension reduction technique in the mean square error sense. Its

goal is to find a set of mutually orthogonal basis functions that

capture the directions of maximum variance in the data so that the

pair wise Euclidean distances can be best preserved. If the data is

embedded in a linear subspace, PCA is guaranteed to discover the

dimensionality of the subspace and produces a compact

representation.

PCA reduces the dimension of the data by finding a few

orthogonal linear combinations of the original variables with the

largest variance [20]. We assume that we have

observations,

each being a realization of the

dimensional random variable,

( ,..... )T

x x x

with mean

( ) ( , ,....., )

and

covariance matrix

()

E x x

p X p

.We denote such

an observation matrix by

,:1 ,1

X x i p j n

We find the first principal component by finding the linear

combination with the largest variance. Let us denote the first

principal components by

such that,

p x w

, where the p-

dimensional coefficient vector

1 1,1 1,

( ,..... )T

w w w

solves

1 || 1||

argmax { }

w Var x w

The second principal component is the linear combination with

the second largest variance and orthogonal to the first principal

components, and so on. The number of principal components is

equal to number of original components. In most of the datasets,

the first several principal components decide most of the variance

and the rest of the principal component scan be disregarded with

minimum loss of information.

Variance depends on the scale of variables. It is customary to

initialize all the variables to have a mean of 0 and standard

deviation of 1. After this standardization, all the original variables

are in comparable units. Assuming that the data that is

standardized is represented by a covariance matrix,

p X p XX

we can use the spectral decomposition theorem to write as

Where

( .......... )

diag

is the diagonal matrix of the ordered

eigenvalues

1.......... p

and

is a

p X p

orthogonal matrix

containing the eigenvectors. It can be shown [21] that the

principal components are given by the

rows of the

p X n

matrix

, where

S U X

where we see that the weight matrix W

is given by

. It can be shown [21] that the subspace spanned

by the first

eigenvectors has the smallest mean square deviation

from

among all subspaces of dimension

Another property of the eigen value decomposition is that the

total variation is equal to the sum of the eigen values of the

covariance matrix,

1 1 1

( ) ( )

p p p

i i i

Var pc trace

and the fraction

/ ( )

trace

gives the cumulative proportion of the variance

explained by the first

principal components. By plotting the

cumulative proportions as a function of

, one can select the

appropriate number of principal components to keep, in order to

explain a given percentage of the overall variation.

There is another method to find the number of principal

components. Fix a threshold λ0 and keep the eigenvectors whose

values are greater than the threshold value selected, thus thereby

reducing the dimension of the data.

Singular Value Decomposition

Singular Value Decomposition (SVD) is a method of

transforming correlated variables to a set of uncorrelated ones that

represent the relations between the original data set items in a

better way [22]. In other words, SVD is a method for data

reduction that finds the best approximation of original data points

using lesser dimensions.

Regression line is the line that minimizes the distance between

each and every original data point and the line. Here, the first

regression line running through the data points is drawn. Then a

perpendicular line from the original data point to the regression

line is drawn. The intersection point is taken as the approximated

original data point. This gives a reduced representation of the

original data points. This method captivates the variations among

the original data points as much as possible.

A second regression line perpendicular to the first is drawn. This

line represents as much of the variation as possible along the

second dimension of the original data set. It is insufficient in

approximating the original data because it corresponds to a

dimension exhibiting less variation. It is possible to use these

regression lines to generate a set of uncorrelated data points that

will show subgroupings in the original data. SVD neglects

variations below a specific threshold and massively reduces the

data still preserving the main relationships of interest.

SVD takes a rectangular matrix

and breaks it into product of

three matrices: an orthogonal matrix

, a diagonal matrix

, and

the transpose of an orthogonal matrix

. SVD theorem states:

mn mm mn nn

A U S V

Where

U U I

and

V V I

being the identity matrix.

The eigenvectors of

make up the columns of

and the

eigenvectors of

make up the columns of

is a diagonal

matrix containing the square roots of eigen values from

arranged in descending order.

Locality Preserving Indexing

Locality preserving indexing is used for document indexing [23].

Every document is represented by a vector with low dimension

than the original document. The local structure of the document

space and the semantic structure of the document are studied to

obtain the concise document representation.

Given a document set

, consisting of documents,

, ,....., m

d d d

find a lower dimensional representation

such that

|| ||

(norm of the vector) represents the semantic relation

between

and

Generally dimension of the document vector (

) is much higher

than the number of documents (

) and when

is large

computational complexity of the eigen problem is high. To

overcome these problems, documents are projected onto PCA

subspace first. The resulting matrix is non-singular and with lower

dimensions. The algorithmic procedure for LPI is stated in [23].

3.3 Classifiers

Given a set of classes, the task of a classifier is to find to which

class a text document belongs to. Data that is well represented is

the input to the classifier. Various classifiers are well established

in the present day. Three of them are considered for the task of

classification in this paper.

Nearest Neighbor Classification (

NN)

NN assigns a category to a document based on the class(es) of

the

nearest neighbors in the training data as defined by a

similarity function. A neighbor is considered nearest if it has the

smallest Euclidian distance in the feature space. An object is

classified by a majority vote of its neighbors. The object is

assigned to the class that is most common amongst its

nearest

neighbors. If

=1, then the object is simply assigned to the class

of its nearest neighbor. Its large computational requirement is a

disadvantage, because for classifying an object, its distance to all

the objects in the learning set has to be calculated.

Rocchio Algorithm

The Rocchio algorithm is based on a method of relevance

feedback to the classification situation [24]. The feedback

approach was developed using the vector space model.

The algorithm is based on the assumption that most users have a

general conception of which documents should be denoted as

relevant or non-relevant. Therefore, the user’s search query is

revised to include an arbitrary percentage of relevant and non-

relevant documents.

Here, each document is treated as a normalized vector (unit

length). For each class

, compute the centroid

()C

of the

labeled documents in

.The centroid of a class

is computed

as the vector average or center of mass of its members:

()C

||D

()



Where

is the document set consisting of

documents

, ,....., m

d d d

. We denote the normalized vector of

()vd



. Now for a document

, find the closest centroid and

put

into the corresponding class. The boundary between two

classes in Rocchio classification is the set of points with equal

distance from the two centroids.

Linear Least Square Fit (LLSF)

In LLSF, each document

has two vectors associated with it: an

input vector

(number of terms in document

)

weighted terms and an output vector

of weights

||C

representing the categories [25]. Thus text classification in LLSF

means, given a test document

and an input vector

, task is

to find the output vector

A linear classifier [26] [27] is used for text classification, which

computes the categorization status value that corresponds to the

dot product of

and

There are two methods to learn the linear classifiers: batch

methods, in which a classifier is built by analyzing the training set

all at once and on-line methods in which a classifier is built soon

after examining the first training document and it is refined

incrementally as the new ones are examined. The classifier [28]

examines the closeness of the testing document to the centroid of

the positive training examples and its distance from the centroid

of the negative training examples. It finds the hyper planes that

approximately separate a class of document vectors from its

complement.

4. EXPERIMENTATION AND RESULTS

4.1 Datasets

The classification methods discussed above are applied on two

datasets. The first being the dataset consisting of 5 categories;

Entertainment, Sports, Image Processing, Business and Politics,

each of which consisting of 100documents. We call this dataset as

dataset1.

A usenet newsgroup is a repository created usually within the

usenet system, for messages posted from many users in different

locations. The articles that the users post to usenet are organized

into topical categories called newsgroups, which are themselves

logically organized into hierarchies of subjects. For example,

sci.math and sci.physics are within the sci hierarchy for science.

In most newsgroups, the majority of the articles are responses to

some other article. Most of the responses can be categorized into

major subject areas like: news, rec (recreation), soc (society), sci

(science), comp (computers), and so forth (there are many more).

Typically, the newsgroup is focused on a particular topic of

interest. Some newsgroups allow the posting of a message on a

wide variety of themes, regarding anything that a member

chooses to discuss as on topic, while others keep more strictly to

their particular subjects on off topic postings. There are currently

over 100,000 usenet newsgroups, but only 20,000 are active.

For our experimentation, we have used the above newsgroups and

created our own dataset. Since, we have downloaded all the

newsgroups from the Google usenet newsgroups we called this

dataset as a Google Newsgroup Dataset. The main objective of

creating this dataset is to have a large discriminating content of

the documents from one class to the other classes. Hence, we have

picked documents from the Google newsgroups which have got

maximum discriminating content from class to class.

Originally, some of the content of the documents are free running

text and HTML tags. To maintain the uniformity among the

documents, we have converted all HTML tagged format into .txt

format. Thus, the dataset contains free running texts that are

stored in .txt files. As a preprocessing step, we have employed

stopword elimination for all the documents.

4.2 Experimental Settings

Text classification requires a good collection of test data. Huge

manual effort is required to collect a sufficiently large body of

text, and ultimately produce it in a machine-readable format.

When classification is carried out on the testing set, if every

document of it belongs to only one correct class of the training

set, then the classification is said to be correct, error otherwise.

The machine that was used to carry out the experiment was

configured with 2GB RAM and i5 processor. The operating

system of the machine is Windows7.

Experiment 1

The 100 documents in each class are divided into two sets of 50

documents each. The first set of 50 documents in each class

comprises the training set. The remaining set comprises the

testing set. The reduced lower dimension matrix is stored in the

knowledge base and it is further used for classification.

Experiment 2

The 100 documents in each class are divided into two sets, one

containing 60 documents and the other with 40 documents. The

set of all 60 documents in each class comprises the training set.

The remaining set of 40 documents each comprises the testing set.

4.3 Results

The results of each experiment, obtained using the two datasets is

tabulated in Table1, Table2, Table3 and Table4. The

classification methods are compared quantitatively using these

results.

Table 1. Tabulates the classification accuracies obtained from

experiment 1 using dataset 1.

Dataset

Training vs.

Testing

DRT

Classifiers

Accuracy

in %

Dataset1

50:50

PCA

68.50

Rocchio

66.35

LLSF

59.85

Dataset1

50:50

SVD

71.50

Rocchio

66.90

LLSF

54.85

Dataset1

50:50

LPI

76.35

Rocchio

74.50

LLSF

68.35

The following conclusions are made after tabulating the results

from experiment1: If the DRT used is LPI and the classifier

employed is

NN, we obtain the best result for dataset1. When

LLSF classifier is used along with the SVD, the accuracy of

classification is least of all.

The following conclusions are made after tabulating the results

from experiment 2: If the DRT method used is LPI and the

classifier employed is

NN, we obtain the best result for

dataset1. When each DRT (PCA, SVD or LPI) is employed with

LLSF classifier, we achieve poor results. Rocchio classifier gives

better results than LLSF classifier, but poorer results than

classifier. However accuracy for each combination of DRT and

classification methods is more than that of experiment1.

Table 2. Tabulates the classification accuracies obtained from

experiment 2 using dataset 1.

Dataset

Training vs.

Testing

DRT

Classifiers

Accuracy

in %

Dataset1

60:40

PCA

72.50

Rocchio

69.85

LLSF

60.10

Dataset1

60:40

SVD

74.35

Rocchio

68.10

LLSF

56.40

Dataset1

60:40

LPI

80.15

Rocchio

77.40

LLSF

69.15

The following conclusions are made after tabulating the results

from experiment 1: If the DRT used is LPI and the classifier

employed is

NN, we obtain the best result for dataset 2. When

each DRT (PCA, SVD or LPI) is employed with LLSF classifier,

we get a poor result. Rocchio classifier gives better results than

LLSF classifier, but poorer results than

NN classifier.

Table 3. Tabulates the classification accuracies obtained from

experiment 1 using dataset 2.

Dataset

Training vs.

Testing

DRT

Classifiers

Accuracy

in %

Dataset 2

50:50

PCA

72.35

Rocchio

69.50

LLSF

61.45

Dataset 2

50:50

SVD

74.95

Rocchio

68.45

LLSF

57.15

Dataset 2

50:50

LPI

76.80

Rocchio

69.25

LLSF

63.00

The following conclusions are made after tabulating the results

from experiment 2: If the DRT used is LPI and the classifier

employed is

NN, we obtain the best result for dataset 2. When

each DRT (PCA, SVD or LPI) is employed with LLSF classifier,

we get a poor result. Rocchio classifier gives better results than

LLSF classifier, but poorer results than

NN classifier.

Table 4. Tabulates the classification accuracies obtained from

experiment 2 using dataset 2.

Dataset

Training vs.

Testing

DRT

Classifiers

Accuracy

in %

Dataset

60:40

PCA

74.90

Rocchio

70.00

LLSF

63.85

Dataset

60:40

SVD

76.55

Rocchio

69.15

LLSF

58.50

Dataset

60:40

LPI

81.55

Rocchio

75.90

LLSF

67.35

The performance of any classifier is greatly dependent on the data

being classified and the dataset that is considered to measure the

performance of the classifiers. However based on the datasets

considered and the results obtained above, it can be said that

Nearest Neighbor serves better in classification than the other

two classifiers. Comparatively Linear Least Square Fit algorithm

does a poor job. It can be seen that when the dimensionality is

reduced using LPI and

NN classifier is employed, we obtain

the best accuracy.

5. CONCLUSION

Text classification has evolved into a research field which has

delivered efficient, effective, and overall workable solutions that

have been used in tackling a wide variety of real-world

application domains. We have applied three different

classification methods (

NN, Rocchio and LLSF) to the

problem of document classification. These methods were

evaluated individually when used with one of the dimensionality

reduction techniques (PCA, SVD, LPI).

All the three classifiers perform reasonably well on our data sets.

In point of view of accuracy,

NN classifier works better than

Rocchio and LLSF methods. In every case of classification,

classifiers work most efficiently when LPI dimensionality

reduction technique is employed.

Though

NN is comparatively better than the other 2 classifiers,

it is a lazy learning method. Support Vector Machines can be used

to overcome its limitations.

Hence, in future we intend to work on other various

representation methods, dimensionality reduction techniques and

classifiers to achieve better classification accuracy. Further, we

would also make use of various standard datasets to evaluate

different classifiers.

6. REFERENCES

[1]

Song, F., Liu, S., and Yang, J. 2005. A comparative study

on text representation schemes in text categorization,

Journal of Pattern Analysis Application, Vol 8, 2005,

pp199 – 209.

[2]

Porter, M.F. 1980. An algorithm for suffix stripping.

Program, Vol. 14 (3), pp. 130 –137.

[3]

Hotho, A., Nürnberger, A., and Paaß, G. 2005. A Brief

Survey of Text Mining. Journal for Computational

Linguistics and Language Technology. Vol. 20, pp. 19 –

62.

[4]

Salton, G., Wang, A., and Yang, C.S.1975. A Vector

Space Model for Automatic Indexing. Communications of

the ACM, Vol. 18, pp. 613 – 620.

[5]

Bernotas, M., Karklius, K., Laurutis, R., and Slotkiene, A.

2007. The peculiarities of the text document

representation, using ontology and tagging-based

clustering technique. Journal of Information Technology

and Control. Vol. 36, pp. 217 – 220.

[6]

Lan, M., Tan, C. L., Su. J., and Lu, Y.2009. Supervised

and Traditional Term Weighting Methods for Automatic

Text Categorization. IEEE Transactions on Pattern

Analysis and Machine Intelligence, Volume: 31 (4), pp.

721 – 735.

[7]

Altınçay, H., and Erenel, Z. 2010. Analytical evaluation

of term weighting schemes for text categorization. In

Journal of Pattern Recognition Letters, vol. 31 (11), pp.

1310 – 1323.

[8]

Li, and Jain, A. K., Y. H. 1998. Classification of Text

Documents. The Computer Journal, Vol 41, pp. 537 –

546.

[9]

Hotho, A., Maedche, A., and Staab, S. 2001. Ontology

based text clustering. In Proceedings of International Joint

Conference on Artificial Intelligence, pp. 30 –37.

[10]

Cavnar, W.B. 1994. Using an N-Gram based document

representation with a vector processing retrieval model. In

Proceedings of The Third Text Retrieval Conference

(TREC-3), pp. 269 – 278.

[11]

Milios, E., Zhang, Y., He, B., and Dong, L. 2003.

Automatic term extraction and document similarity in

special text corpora. In Proceedings of Sixth Conference

of the Pacific Association for Computational Linguistics

(PACLing’03), pp. 275 – 284.

[12]

Wei, C. P., Yang, C. C., and Lin, C. M. 2008. A Latent

Semantic Indexing-based approach to multilingual

document clustering. Journal of Decision Support System.

Vol. 45, pp. 606 – 620.

[13]

He, X., Cai, D., Liu, H., and Ma, W.Y. 2004. Locality

Preserving Indexing for document representation. In

SIGIR, pp. 96—103.

[14]

Cai, D., He, X., Zhang, W.V., and Han J. 2007.

Regularized Locality Preserving Indexing via Spectral

Regression. In ACM International Conference on

Information and Knowledge Management (CIKM’07), pp.

741—750.

[15]

Choudhary, B., and Bhattacharyya, P. 2003. Text

clustering using Universal Networking Language

representation. In Eleventh International World Wide

Web Conference.

[16]

Craven, M., DiPasquo, D., Freitag, D., McCallum, A.,

Mitchell, T. M., Nigam, K., and Slattery, S. 1998.

Learning to Extract Symbolic Knowledge from the World

Wide Web. In Proceedings of AAAI/IAAI’, pp. 509 –

516.

[17]

Esteban, M., and Rodrıguez, O. R. 2006. A Symbolic

Representation for Distributed Web Document Clustering.

In the Proceedings of Fourth Latin American Web

Congress, Cholula, Mexico.

[18]

Isa, D., Lee, L. H., Kallimani, V. P., and Rajkumar, R.

2008. Text document preprocessing with the Bayes

formula for classification using the support vector

machine. IEEE Transactions on Knowledge and Data

Engineering. Vol. 20, pp. 23 – 31.

[19]

Dinesh, R., Harish, B. S., Guru, D.S., and Manjunath,

S.2009. Concept of Status Matrix in Text Classification.

In the Proceedings of Indian International Conference on

Artificial Intelligence, Tumkur, India, pp. 2071 – 2079.

[20]

Imola K. Fodor, A Survey of Dimension Reduction

Techniques, June 2002.

[21]

K.V. Mardia, J.T. Kent, and J.M. Bibby. Multivariate

Analysis. Probability and Mathematical Statistics.

Academic Press, 1995.

[22]

Kirk Baker. Singular Value Decomposition Tutorial,

March 2005.

[23]

Xiaofei He, Deng Cai, Haifeng Liu, Wei-Ying Ma.

Locality Preserving Indexing for Document

Representation.

[24]

Rocchio. Relevance Feedback in Information Retrieval.

Prentice-Hall Inc, 1971.

[25]

B S Harish, D S Guru, S Manjunath. Representation and

Classification of Text Documents: Abrief Review. IJCA

Special Issue on ―Recent Trends in Image Processing and

Pattern Recognition‖ RTIPPR, 2010.

[26]

Sebastiani, F. 2002. Machine learning in automated text

categorization.ACM Computing Surveys. Vol 34, pp. 1 –

47.

[27]

Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka,

R.1996. Training algorithms for linear text classifiers. In

the Proceedings of the Nineteenth International

Conference on Research and Development in Information

Retrieval (SIGIR’96), pp. 289–297.

[28]

Joachims, Y. 1997. A probabilistic analysis of the

Rocchio algorithm with TFIDF for text categorization. In

the Proceedings of the Fourteenth International

Conference on Machine Learning, pp. 143 –151.

ResearchGate has not been able to resolve any citations for this publication.

Singular Value Decomposition Tutorial

Article

Full-text available

Jan 2013

Kirk Baker

Most tutorials on complex topics are apparently written by very smart people whose goal is to use as little space as possible and who assume that their readers already know almost as much as the author does. This tutorial's not like that. It's more a manifestivus for the rest of us. It's about the mechanics of singular value decomposition, especially as it relates to some techniques in natural language processing. It's written by someone who knew zilch about singular value decomposition or any of the underlying math before he started writing it, and knows barely more than that now. Accordingly, it's a bit long on the background part, and a bit short on the truly explanatory part, but hopefully it contains all the information necessary for someone who's never heard of singular value decomposition before to be able to do it.

A Survey of Dimension Reduction Techniques

Article

Jan 2002
NEOPLASIA

I.K. Fodor

Using an n-gram-based document representation with a vector processing retrieval model

Article

Jan 1994

W.B. Cavnar

Classification of Text Documents

Article

Aug 1998

Y. H. Li

A Symbolic Representation for Distributed Web Document Clustering

Article

Web clustering is an important activity to improve many other tasks in the Web, from automatically building web di- rectories to web searching and indexing. However, due to the huge size of the Web, a platform for high performance computing must be used to cluster a really big collection of web documents. In this paper, we propose a new represen- tation to model web documents for distributed clustering. Symbolic objects can address the clustering problem for a distributed environment in a better way. After running a couple of experiments, we show how symbolic objects can outperform a classic representation when comparing the clustering accuracy.

An Algorithm for Suffix Stripping

Article

Mar 1980
PROGRAM-ELECTRON LIB

MF Porter

The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.

The peculiarities of the text document representation, using ontology and tagging-based clustering technique

Article

Jan 2007

Text documents are very significant in the contemporary organizations, moreover their constant accumulation enlarges the scope of document storage. Standard text mining and information retrieval techniques of text document usually rely on word matching. An alternative way of information retrieval is clustering. In this paper we suggest to complement the traditional clustering method by document repre- sentation based on tagging, and to improve clustering results by using knowledge technology - ontology. The proposed method solves locally applied language incompact usage in the process of document clus- tering.

AUTOMATIC TERM EXTRACTION AND DOCUMENT SIMILARITY IN SPECIAL TEXT CORPORA

Article

Jan 2003

This paper conflrms that the performance of a state-of-the-art automatic term extraction method on a computer science corpus is similar to previously published performance data on a medical corpus. The extracted terms are then used to estimate the similarity of papers in the computer science corpus using the standard Vector Space Model. The precision of retrieval using a term-based representation is compared with that of a word-based representation, and a link-based similarity metric based on the overlap of the local neighborhoods of the papers in the citation graph. The term-based approach ofiers comparable performance to the word-based approach, but potentially with a much smaller vocabulary size. Automatic term extraction in special text corpora is an interesting problem, which is becoming relevant as literature in speciflc scientiflc flelds such as medicine, biology and computer science explodes making it di-cult to track the evolving terminology in the flelds (Kageura and Umino1996). Early approaches to automatic term extraction were focused on information-theoretic approaches based on mutual information in detecting collocations (Manning and Schuetze1999). Collocations are expressions that are composed of two or more words, the meaning of which is not easy to guess from the meanings of the component words. There are nuances in the detection of collocation that require linguistic criteria to resolve (Justeson and Katz1995). Shallow linguistic criteria are based on acceptable sequences of part-of-speech tags. Part-of-speech tagging can be performed automatically (Brill1992). A key problem is that of nesting, where subsets of consecutive words of terms consisting of multiple words would satisfy the statistical criteria for \termhood", but they would not be called terms. In the flrst part of this paper, we describe experiments with a state-of-the-art method, C-value/NC-value (Frantzi et al.2000), which combines statistical and linguistic information for automatic term extraction. We applied it to a special text corpus of computer science articles, which is of a difierent nature from the medical corpus on which the method was originally tested. We conflrmed that the performance of the method is equally good on our corpus, and we identifled some adjustments that the method required. In the second part of this paper, we use the terms extracted to estimate the similarity between two documents. We evaluate the quality of the similarity estimation based on terms in an information retrieval context. It is broadly believed that it is di-cult to improve upon the bag-of-words representation as far as retrieval performance is concerned by using more sophisticated features or shallow linguistic techniques. Although retrieval based on terms did not show signiflcant improvement over a bag-of-words representation, our long-term objective is to cluster special text corpora into subareas, and automatically generate lexical ontologies from the clusters (Ayad and Kamel2002). Terms in this context are of interest in themselves, and not purely as a vehicle to information retrieval. We are, furthermore, interested in similarity criteria taking into account proximity of terms (Koubarakis2001), for which again it is essential to work with terms, not words. The use of terms instead of words may also be preferable in information dissemination, where given a database of proflles (of c

Text clustering using Universal Networking Language representation

Article

In traditional document clustering methods, the document is considered as a bag of words, with no relations between words. The feature vector representing the document is made by using the frequency count of terms in a document. Weights calculated from techniques like Inverse Document Frequency (IDF) and Information Gain (IG) are applied to the frequency count associated with the term. In this paper we describe a method to generate feature vectors using Universal Networking Language (UNL). UNL represents a document in the form of a graph, with nodes as the universal words and relations between them as links. The method described in this paper takes the UNL graph and generates feature vectors representing the document. The clustering method used is the Self Organizing Maps (SOM), a neural net, which takes the vectors as inputs and forms a document map in which similar documents are mapped to the same or a nearby neurons. Experiments show that if we use the UNL method for feature vector generation, clustering tends to perform better than when the term frequency based method is used.

Relevance Feedback in Information Retrieval

Chapter

Jan 1971

J J Rocchio

An empirical study on various text classifiers

Abstract

Recommended publications

On Efficient Meta-Level Features for Effective Text Classification

A Hybrid Text Classification System Using Sentential Frequent Itemsets

Text Classification Using Deep Neural Networks

Parallel Text Categorization for Multi-dimensional Data