Conference PaperPDF Available

Symbolic Similarity and Symbolic Feature Selection for Text Classification

January 2011

January 2011

Conference: Proceedings of Rusian Workshop

Authors:

B S Harish

JSS Science & Technology University, Mysore, India

Devanur S Guru

University of Mysore

Manjunath Shantharamu

University of Mysore

Bapu B Kiranagi

University of Mysore

In this paper, a simple and efficient symbolic text classification is presented. A text document is represented by the use interval valued symbolic features. Subsequently, a new feature selection method based on a symbolic similarity measure is also presented. The new feature selection method reduces the features in the proximity space for effective text classification. It keeps the best features for effective text representation and simultaneously reduces the time taken to classify a given document. To corroborate the efficacy of the proposed method, experimentation has been conducted on four different datasets to evaluate the performance. Experimental results reveal that the proposed method gives better results when compared to state of the art techniques. In addition, as it is based on simple matching scheme it achieves classification within negligible time and thus it appears to be more effective in classification.

: Average classification accuracy of the proposed method on different data sets

…

Figures - uploaded by B S Harish

Content may be subject to copyright.

Content uploaded by B S Harish

Content may be subject to copyright.

Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011

Symbolic Similarity and Symbolic Feature Selection for Text

Classification

B S Harisha*, D S Gurub, S Manjunathb and Bapu B Kiranagi c

aDepartment of Information Science & Engineering, S J College of Engineering, Mysore;

b Department of Studies in Computer Science, University of Mysore, Mysore, India;

c HCL Technologies, Bangalore, India.

bsharish@ymail.com, dsg@compsci.uni-mysore.ac.in, manju_uom@yahoo.co.in, bapukb@hcl.com

ABSTRACT

In this paper, a simple and efficient symbolic text classification is presented. A text document is represented by the use

interval valued symbolic features. Subsequently, a new feature selection method based on a symbolic similarity measure

is also presented. The new feature selection method reduces the features in the proximity space for effective text

classification. It keeps the best features for effective text representation and simultaneously reduces the time taken to

classify a given document. To corroborate the efficacy of the proposed method, experimentation has been conducted on

four different datasets to evaluate the performance. Experimental results reveal that the proposed method gives better

results when compared to state of the art techniques. In addition, as it is based on simple matching scheme it achieves

classification within negligible time and thus it appears to be more effective in classification.

Keywords: Symbolic features, Similarity measure, Feature selection, Document representation, Text classification

1. INTRODUCTION

Text classification is one of the important research issues in the field of text mining, where the documents are classified

with supervised knowledge. Based on the likelihood of the training set, a new document is classified. The major

challenges and difficulties that arise in the problem of text classification are: High dimensionality (thousands of

features), variable length, content and quality of the documents, sparse distribution of terms in documents, loss of

correlation between adjacent words and understanding complex semantics of terms in a document [1]. To tackle these

problems a number of methods have been reported in literature for effective classification of text documents. Many

representation schemes like binary representation [2], ontology [3], N-Grams[4], multiword terms as vector [5], UNL [6]

are proposed as text representation schemes for effective text classification. Also, in [7], [8] a new representation model

for the web documents are proposed. Recently, in [9] bayes formula is used to vectorize a document according to a

probability distribution reflecting the probable categories that the document may belong to.

Although many representation models for the text document are available in literature, the Vector Space Model (VSM)

is one of the most popular and widely used models for document representation [10]. Unfortunately, the major limitation

of the VSM is the loss of correlation and context of each term which are very important in understanding the document.

To deal with these problems, Latent Semantic Indexing (LSI) was proposed in [11]. The LSI is optimal in the sense of

preserving the global geometric structure of a document space. However, it might not be optimal in the sense of

discrimination [11]. Thus, to discover the discriminating structure of a document space, a Locality Preserving Indexing

(LPI) is proposed in [12]. An assumption behind LPI is that, close inputs should have similar documents. However, the

computational complexity of LPI is very expensive. Thus it is almost infeasible to apply LPI over very large dataset.

Hence to reduce the computational complexity of LPI; Regularized Locality Preserving Indexing (RLPI) has been

proposed in [13]. The RLPI is being significantly faster obtains similar or better results when compared to LPI. This

makes the RLPI an efficient and effective data preprocessing method for large scale text classification [13].

However the RLPI fails to preserve the intraclass variations among the documents of different classes. Also in case

of the RLPI, we need to select the number of dimensions

to be optimal. Unfortunately we cannot say that the selected

value is optimal. Hence in this paper we give a symbolic representation for a given document to capture the intraclass

Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011

variation and subsequently we employ the symbolic feature selection scheme [14] to select the best few features

for

storing in the knowledge base. In addition we also present the corresponding classification method. Also, to the best of

our knowledge no work has been reported in literature which uses symbolic representation and symbolic feature

selection method for text documents. The recent developments in the area of symbolic data analysis have proven that the

real life objects can be better described by the use of symbolic data which are extensions of classical crisp data [15].

Thus these issues motivated us to use the symbolic data rather than using a conventional classical crisp data to represent

a document.

The rest of the paper is organized as follows: The working principle of the proposed method is presented in section 2.

Details of the dataset used, experimental settings and the obtained results are presented in section 3. Section 4 concludes

with future directions.

2. PROPOSED METHOD

In this section we propose a novel method of representing a text document using symbolic data representation. Initially

documents are represented using a regularized locality preserving indexing approach as it has the property of preserving

the local structure of a document. Further, to capture intra class variations across documents of a same class, interval

valued feature vector representation is formulated to represent each class by feature assimilation. Subsequently we

employ the symbolic feature selection scheme to select the best few features for storage. In addition we also present the

corresponding classification method.

2.1 Symbolic Representation

Let there be

number of classes each containing

number of documents, where each document is described by a

dimensional term frequency vector. The term document matrix [16], say

of size

()KN n

is constructed such that

each row represents a document of a class and each column represents a term. We recommend to employ the RLPI on

to obtain the transformed term document matrix

of size

()KN m

, where

is the number of features chosen out of

which is not necessarily optimal.

In order to preserve the intraclass variation in each feature of every document of

class, we propose in this work

to aggregate the term frequency vectors of all documents of the

class in the form of an interval valued feature vector.

Let

 

, ,...,

j j jm

P P P

be the transformed term frequency vector obtained by the RLPI for the

document of the

class.

The reference interval valued feature vector of dimension

representing all the documents of the

class on

aggregation is computed and stored in the knowledge based for classification purpose.

Let

 

1 1 2 2

, , , ,..., ,

i i i i i im im

R P P P P P P

     

     

     

be the computed reference term frequency vector of the

class where

 

1,..., ( 1,..., )

il jl jl

P mean P j N std P j N

     

and

 

1,..., ( 1,..., )

il jl jl

P mean P j N std P j N

     

Similarly we compute interval valued reference term frequency vectors for all the classes and store in the knowledge

base for further classification task. It shall be noticed that the reference term frequency vectors are of type interval

valued and of dimension

which is arbitrarily fixed up (not necessarily optimal). Hence the class representatives are

stored in a matrix format called

which is of size

Km

. In the subsequent stage we propose to employ a symbolic

feature selection scheme to select the best few

features out of

features.

2.2 Symbolic Feature Selection

Feature selection is used to identify the useful features and also to remove the redundant information. Basically, feature

selection methods fall into two broad categories, the filter model and the wrapper model. The wrapper model requires

one predetermined learning algorithm in feature selection and uses its performance to evaluate and determine which

features are selected. And the filter model relies on general characteristics of the training data to select some features

without involving any specific learning algorithm. There is evidence that wrapper methods often perform better on small

scale problems, but on large scale problems, such as text classification, wrapper methods are shown to be impractical

because of its high computational cost [19]. Hence in this paper we make use of the filter method to select the best

features.

Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011

In order select the best features from the class representative matrix

we need to study the correlation present

among the individual features of each class. The features which have maximum correlation shall be selected as the best

features to represent the classes. Since

is an interval matrix we compute a proximity matrix of size

KK

with each

element being of type multivalued of dimension

by computing the similarity among the features using the similarity

function proposed in [15]. The similarity from class

with respect to

feature is given by

il jl













(1)

Where,

[ , ] 1,2,...,

il il il

I p p l m



  

are the interval type features of the class

and

[ , ] 1,2,...,

jl jl jl

I p p l m



  

are

the interval type features of the class

Therefore from the obtained proximity matrix, the matrix

of size

Km

is constructed by listing out each

multivalued type element one by one in the form of rows. The standard deviation for each column of the matrix

computed and also the average standard deviation due to all

columns is computed.

Let

TCorr

be the total correlation of the

column with all other columns of the matrix

and let

AvgTCorr

be the

average of all total correlation obtained due to all columns. i.e.,

( , )

mth th

TCorr Corr l Column q Column





and (2)

TCorr

AvgTCorr m





(3)

We are interested in those features which have high discriminating capability, and thus we recommend to select those

features, for which

TCorr

is higher than the average correlation

AvgTCorr

2.3 Document Classification

Given a test document

represented by term document feature vector, it is projected onto lower dimension using

RLPI, to get the transformed term frequency feature vector of size

which are of type crisp. Among these

feature

values only the

values are selected from feature selection method. Then these

values are used to compare with each

class representative stored in the knowledge base. Let

1 2 3

[ , , ..., ]

tq t t t td

F f f f f

be a

dimensional feature vector

describing a test document. Let

be the interval symbolic feature vector of the

class. Now each

feature value of

the test document is compared with the corresponding interval

to examine belongingness between the test document

feature value and interval in

. We make use of Belongingness Count

as a measure of degree of belongingness for

the test document to decide its class label.

 

1,,

c tl jl jl

B C f f f











and (4)

 

,, 0;

tl jl tl jl

tl jl jl

if f f and f f

C f f f Otherwise

















(5)

The crisp value of a test document falling into its respective feature interval of the reference class contributes a value 1

towards

and there will be no contribution from other features which fall outside the interval. Similarly, we compute

the

value for all classes representing and of the representative which has highest

value will be assigned to the test

document as its label.

Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011

3. Experimental Setup

3.1 Datasets

To test the efficacy of the proposed model, we have used the following four datasets. The first dataset consists of vehicle

characteristics extracted from wikipedia pages (vehicles- wikipedia) [9]. The dataset contains 4 categories that have low

degrees of similarity. The dataset contains four categories of vehicles: Aircraft, Boats, Cars and Trains. All the four

categories are easily differentiated and every category has a set of unique key words. The second dataset is a standard 20

mini newsgroup dataset [17] which contains about 2000 documents evenly divided among 20 Usenet discussion groups.

This dataset is a subset of 20 newsgroups which contains 20,000 documents. In 20 mini newsgroup, each class contains

100 documents in 20 classes which are randomly picked from original dataset. In our experiment, we have used first 60

documents of each class to create a training set and the remaining 40 documents for testing. The third dataset is

constructed by a text corpus of 1000 documents that are downloaded from Google-Newsgroup [19]. Each class contains

100 documents belonging to 10 different classes (Business, Cricket, Music, Electronics, Biofuels, Biometrics,

Astronomy, Health, Video Processing and Text Mining). In each class, 60 documents (60%) are used for training and the

remaining 40 documents (40%) for testing the system. The fourth dataset is a collection of research article abstracts. All

these research articles are downloaded from the scientific web portals. We have collected 1000 documents from 10

different classes. Each class contains 100 documents. From each class, 60 documents (60%) are used for training and the

remaining 40 (40%) documents are used as a test set.

3.2 Experimentation

In this section, the obtained experimental results by the use of the proposed method are presented. In all the

experimentation 60% which were randomly selected are used for training and remaining 40% have been used for testing.

While conducting experimentation we have varied the number of features

()m

selected through RLPI from 1 to 100

dimensions. For each data set we have conducted the same experimentation and we employed the feature selection

method discussed in section 2.2. Table 1 gives the average classification accuracy of 5 trails by selecting random

training sets on the aforementioned data sets along with the number of dimensions selected using proposed feature

selection approach.

Also a comparative analysis of the proposed method with other state of the art techniques on the same dataset viz.,

Vehicles-Wikipedia and 20 mini newsgroup dataset on well accepted classifier such as Naïve bayes classifier, K-nearest

neighbor method and support vector machine is given in Table. 2. From the Table 2 it is analyzed that the proposed

method achieves better classification accuracy than the state of the art techniques. The entry NA in Table 2 indicates that

it is not reported in the respective work.

4. CONCLUSION

In this paper, a simple and efficient symbolic text classification is presented. In this work, a text document is represented

by the use of symbolic features. The proposed new feature selection method selects the best features in the proximity

space and effectively improves the classification accuracy. To check the effectiveness and robustness of the proposed

method, extensive experimentation is conducted on various datasets. The experimental results reveal that the proposed

method outperforms the other existing methods.

In the future, our research will emphasize in enhancing the ability and performance of our model by considering

other parameters to effectively capture the variations between the classes, which in turn improves the classification

accuracy. Along with this, we have a plan of exploiting other similarity/dissimilarity measures, selection of dynamic

threshold value and studying the classification accuracy under varying dimensions. Besides this we are also targeting

towards the study of complexity issues of the proposed model with the existing representation models.

Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011

Table 1: Average classification accuracy of the proposed method on different data sets

Method

Dataset

Proposed

Vehicle-

Wikipedia

Mini 20 News

Group

Google

Research

Articles

89.24 (02)

89.13 (04)

90.50 (04)

90.49 (03)

Table 2: Comparative analysis of the proposed method with other state of the art techniques

Method

Dataset

Vehicle-

Wikipedia

Mini 20 News

Group

Status Matrix Representation [18]

76.00

71.12

Probability based

representation [9]

Naive Bayes Classifier with flat ranking

81.25

Naïve Bayes

Classifier +

SVM

Linear

85.42

RBF

85.42

Sigmoid

84.58

Polynomial

81.66

Bag of word representation [18]

Naive Bayes

Classifier

66.22

KNN

38.73

SVM

51.02

Proposed Method

89.24

89.13

REFERENCES

1. Sebastiani F., 2002. Machine learning in automated text categorization. ACM Computing Surveys. Vol. 34, pp. 1

– 47.

2. Li Y. H and A. K. Jain, 1998. Classification of text documents. The Computer Journal, vol. 41, no. 8, pp. 537 –

546.

3. Hotho A., A. Nurnberger and G. Paab, 2005. A brief survey of text mining. Journal for Computational Linguistics

and Language Technology, vol. 20, pp. 19 – 62.

4. Cavnar W B., 1994. N-Gram based text categorization. Proceedings of the 3rd Symposium on Document Analysis

and Information Retrieval, Las Vegas, pp. 161 – 176.

5. Milios E, Zhang Y, He B, and Dong L, 2003. Automatic term extraction and document similarity in special text

corpora. Proceedings of the Sixth Conference of the Pacific Association for Computational Linguistics

(PACLing’03), Canada, pp. 275—284.

6. Choudhary B and P. Bhattacharyya, 2002. Text clustering using universal networking language representation.

Proceedings of the 11th International Conference on World Wide Web, USA, (http://www-

clips.imag.fr/geta/User/wang-ju.tsai/articles/BhChPBh-UNL01.PDF).

7. Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T. M, Nigam K, and Slattery S, 1998. Learning to

Extract Symbolic Knowledge from the World Wide Web. In the Proceedings of AAAI/IAAI', pp. 509—516,

1998.

8. Esteban M, and Rodrıguez O. R, 2006. A Symbolic Representation for Distributed Web Document Clustering. In

the Proceedings of Fourth Latin American Web Congress, Cholula, Mexico, October.

9. Isa D., L. H. Lee., V. P. Kallimani and R. Rajkumar., 2008. Text document preprocessing with the bayes formula

for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering, vol.

20, no. 9, pp. 23 – 31.

Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011

10. Salton G., A. Wang and C. S. Yang, 1975. A vector space model for automatic indexing. Communications of the

ACM, vol. 18, no. 11, pp. 613 – 620.

11. Deerwester S. C., S. T. Dumais., T. K. Landauer., G. W. Furnas and R. A. Harshman, 1990. Indexing by latent

semantic analysis. Journal of the American Society of Information Science, vol. 41, no. 6, pp. 391 – 407.

12. He X., D. Cai., H. Liu and W. Y. Ma, 2004. Locality preserving indexing for document representation.

Proceedings of the International Conference on Research and Development in Information Retrieval, UK, pp. 96 –

103.

13. Cai D., X. He., W. V. Zhang and J. Han, 2007. Regularized locality preserving indexing via spectral regression.

Proceedings of the 16th Conference on Information and Knowledge Management, USA, pp. 741 – 750.

14. Kiranagi B. B, Guru D. S, and Ichino M, 2007. Exploitation of Multivalued Type Proximity for Symbolic Feature

Selection. In the Proceedings of International Conference on Computing: Theory and Applications, Kolkata, pp.

320—324.

15. Guru D.S., Kiranagi B. B., Nagabhushan P, 2004. Multivalued type proximity measure and concept of mutual

similarity value useful for clustering symbolic patterns. Journal of Pattern Recognition Letters, vol. 25, pp. 1003 –

1013.

16. Zeimpekis D and E. Gallopoulos, 2006. TMG: A MATLAB Toolbox for generating term-document matrices from

text collections. Springer Publications, Berlin, pp. 187 – 210.

17. http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html

18. Dinesh R, Harish B. S, Guru D. S, and Manjunath S, 2009. Concept of status matrix in classification of text

documents. In the Proceedings of Indian International Conference on Artificial Intelligence, India, pp. 2071—

2079.

19. Li S., R. Xia., C. Zong and C. R. Huang, 2009 (d). A framework of feature selection methods for text

categorization. Proceedings of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on

Natural Language Processing of the AFNLP, vol. 2, pp. 692 – 700.

ResearchGate has not been able to resolve any citations for this publication.

TMG: A MATLAB Toolbox for Generating Term- Document Matrices from Text Collections

Article

Full-text available

Machine Learning in Automated Text Categorization

Article

Full-text available

Apr 2001

Fabrizio Sebastiani

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.

A Framework of Feature Selection Methods for Text Categorization.

Conference Paper

Full-text available

Jan 2009

In text categorization, feature selection (FS) is a strategy that aims at making text classifiers more efficient and accurate. However, when dealing with a new task, it is still difficult to quickly select a suitable one from various FS methods provided by many previous studies. In this paper, we propose a theoretic framework of FS methods based on two basic measurements: frequency measurement and ratio measurement. Then six popular FS methods are in detail discussed under this framework. Moreover, with the guidance of our theoretical analysis, we propose a novel method called weighed frequency and odds (WFO) that combines the two measurements with trained weights. The experimental results on data sets from both topic-based and sentiment classification tasks show that this new method is robust across different tasks and numbers of selected features.

Indexing by Latent Semantic Analysis

Article

Jan 1990

Classification of Text Documents

Article

Aug 1998

Y. H. Li

Exploitation of Multivalued Type Proximity for Symbolic Feature Selection

Conference Paper

Apr 2007

In this paper, a simple and efficient feature selection scheme for symbolic data is proposed. The proposed scheme exploits the symbolic multivalued proximity measures for feature selection. The effectiveness of the proposed scheme has been demonstrated through experiments on standard symbolic data sets

Multivalued type proximity measure and concept of mutual similarity value useful for clustering symbolic patterns

Article

Jul 2004
PATTERN RECOGN LETT

In this paper, a novel similarity measure for estimating the degree of similarity between two patterns (described by interval type data) is proposed. The proposed measure computes the degree of similarity between two patterns and approximates the computed similarity value by a multivalued type data. Unlike conventional proximity matrices, the proximity matrix obtained through the application of the proposed similarity measure is not necessarily symmetric. Based on this unconventional similarity matrix a modified agglomerative method by introducing the concept of mutual similarity value (MSV) for clustering symbolic patterns is also presented. Experiments on various data sets have been conducted in order to study the efficacy of the proposed methodology.

Regularized Locality Preserving Indexing via Spectral Regression

Conference Paper

Nov 2007

We consider the problem of document indexing and representa- tion. Recently, Locality Preserving Indexing (LPI) was proposed for learning a compact document subspace. Different from Latent Semantic Indexing (LSI) which is optimal in the sense of global Euclidean structure, LPI is optimal in the sense of local manifold structure. However, LPI is not efficient in time and memory which makes it difficult to be applied to very large data set. Specifi- cally, the computation of LPI involves eigen-decompositions of two dense matrices which is expensive. In this paper, we propose a new algorithm called Regularized Locality Preserving Indexing(RLPI). Benefit from recent progresses on spectral graph analysis, we ca st the original LPI algorithm into a regression framework which en- able us to avoid eigen-decomposition of dense matrices. Also, with the regression based framework, different kinds of regularizers c an be naturally incorporated into our algorithm which makes it more flexible. Extensive experimental results show that RLPI obtains similar or better results comparing to LPI and it is significantly faster, which makes it an efficient and effective data preprocessing method for large scale text clustering, classification and retrieval.

Learning to Extract Symbolic Knowledge from the World Wide Web.

Conference Paper

Jan 1998

The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more effective retrieval of Web information, and promote new uses of the Web to support knowledge-based inference and problem solving. Our approach is to develop a trainable information extraction system that takes two inputs: an ontology defining the classes and relations of interest, and a set of training data consisting of labeled regions of hypertext representing instances of these classes and relations. Given these inputs, the system learns to extract information from other pages and hyperlinks on the Web. This paper describes our general approach, several machine learning algorithms for this task, and promising initial results with a prototype system.

Locality Preserving Indexing for Document Representation, The

Conference Paper

Jul 2004

Document representation and indexing is a key problem for document analysis and processing, such as clustering, classification and retrieval. Conventionally, Latent Semantic Indexing (LSI) is considered effective in deriving such an indexing. LSI essentially detects the most representative features for document representation rather than the most discriminative features. Therefore, LSI might not be optimal in discriminating documents with different semantics. In this paper, a novel algorithm called Locality Preserving Indexing (LPI) is proposed for document indexing. Each document is represented by a vector with low dimensionality. In contrast to LSI which discovers the global structure of the document space, LPI discovers the local structure and obtains a compact document representation subspace that best detects the essential semantic structure. We compare the proposed LPI approach with LSI on two standard databases. Experimental results show that LPI provides better representation in the sense of semantic structure.

Symbolic Similarity and Symbolic Feature Selection for Text Classification

Abstract and Figures

Recommended publications

Document Classification using Symbolic Classifiers

Hierarchical text classification based on BP neural network

A Hierarchical K-NN Classifier for Textual Data

Efficient implementation of associative classifiers for document classification