Conference PaperPDF Available

Symbolic Similarity and Symbolic Feature Selection for Text Classification

Authors:

Abstract and Figures

In this paper, a simple and efficient symbolic text classification is presented. A text document is represented by the use interval valued symbolic features. Subsequently, a new feature selection method based on a symbolic similarity measure is also presented. The new feature selection method reduces the features in the proximity space for effective text classification. It keeps the best features for effective text representation and simultaneously reduces the time taken to classify a given document. To corroborate the efficacy of the proposed method, experimentation has been conducted on four different datasets to evaluate the performance. Experimental results reveal that the proposed method gives better results when compared to state of the art techniques. In addition, as it is based on simple matching scheme it achieves classification within negligible time and thus it appears to be more effective in classification.
Content may be subject to copyright.
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 5, 2011
Symbolic Similarity and Symbolic Feature Selection for Text
Classification
B S Harisha*, D S Gurub, S Manjunathb and Bapu B Kiranagi c
aDepartment of Information Science & Engineering, S J College of Engineering, Mysore;
b Department of Studies in Computer Science, University of Mysore, Mysore, India;
c HCL Technologies, Bangalore, India.
bsharish@ymail.com, dsg@compsci.uni-mysore.ac.in, manju_uom@yahoo.co.in, bapukb@hcl.com
ABSTRACT
In this paper, a simple and efficient symbolic text classification is presented. A text document is represented by the use
interval valued symbolic features. Subsequently, a new feature selection method based on a symbolic similarity measure
is also presented. The new feature selection method reduces the features in the proximity space for effective text
classification. It keeps the best features for effective text representation and simultaneously reduces the time taken to
classify a given document. To corroborate the efficacy of the proposed method, experimentation has been conducted on
four different datasets to evaluate the performance. Experimental results reveal that the proposed method gives better
results when compared to state of the art techniques. In addition, as it is based on simple matching scheme it achieves
classification within negligible time and thus it appears to be more effective in classification.
Keywords: Symbolic features, Similarity measure, Feature selection, Document representation, Text classification
1. INTRODUCTION
Text classification is one of the important research issues in the field of text mining, where the documents are classified
with supervised knowledge. Based on the likelihood of the training set, a new document is classified. The major
challenges and difficulties that arise in the problem of text classification are: High dimensionality (thousands of
features), variable length, content and quality of the documents, sparse distribution of terms in documents, loss of
correlation between adjacent words and understanding complex semantics of terms in a document [1]. To tackle these
problems a number of methods have been reported in literature for effective classification of text documents. Many
representation schemes like binary representation [2], ontology [3], N-Grams[4], multiword terms as vector [5], UNL [6]
are proposed as text representation schemes for effective text classification. Also, in [7], [8] a new representation model
for the web documents are proposed. Recently, in [9] bayes formula is used to vectorize a document according to a
probability distribution reflecting the probable categories that the document may belong to.
Although many representation models for the text document are available in literature, the Vector Space Model (VSM)
is one of the most popular and widely used models for document representation [10]. Unfortunately, the major limitation
of the VSM is the loss of correlation and context of each term which are very important in understanding the document.
To deal with these problems, Latent Semantic Indexing (LSI) was proposed in [11]. The LSI is optimal in the sense of
preserving the global geometric structure of a document space. However, it might not be optimal in the sense of
discrimination [11]. Thus, to discover the discriminating structure of a document space, a Locality Preserving Indexing
(LPI) is proposed in [12]. An assumption behind LPI is that, close inputs should have similar documents. However, the
computational complexity of LPI is very expensive. Thus it is almost infeasible to apply LPI over very large dataset.
Hence to reduce the computational complexity of LPI; Regularized Locality Preserving Indexing (RLPI) has been
proposed in [13]. The RLPI is being significantly faster obtains similar or better results when compared to LPI. This
makes the RLPI an efficient and effective data preprocessing method for large scale text classification [13].
However the RLPI fails to preserve the intraclass variations among the documents of different classes. Also in case
of the RLPI, we need to select the number of dimensions
m
to be optimal. Unfortunately we cannot say that the selected
m
value is optimal. Hence in this paper we give a symbolic representation for a given document to capture the intraclass
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 5, 2011
variation and subsequently we employ the symbolic feature selection scheme [14] to select the best few features
d
for
storing in the knowledge base. In addition we also present the corresponding classification method. Also, to the best of
our knowledge no work has been reported in literature which uses symbolic representation and symbolic feature
selection method for text documents. The recent developments in the area of symbolic data analysis have proven that the
real life objects can be better described by the use of symbolic data which are extensions of classical crisp data [15].
Thus these issues motivated us to use the symbolic data rather than using a conventional classical crisp data to represent
a document.
The rest of the paper is organized as follows: The working principle of the proposed method is presented in section 2.
Details of the dataset used, experimental settings and the obtained results are presented in section 3. Section 4 concludes
with future directions.
2. PROPOSED METHOD
In this section we propose a novel method of representing a text document using symbolic data representation. Initially
documents are represented using a regularized locality preserving indexing approach as it has the property of preserving
the local structure of a document. Further, to capture intra class variations across documents of a same class, interval
valued feature vector representation is formulated to represent each class by feature assimilation. Subsequently we
employ the symbolic feature selection scheme to select the best few features for storage. In addition we also present the
corresponding classification method.
2.1 Symbolic Representation
Let there be
K
number of classes each containing
N
number of documents, where each document is described by a
n
dimensional term frequency vector. The term document matrix [16], say
X
of size
()KN n
is constructed such that
each row represents a document of a class and each column represents a term. We recommend to employ the RLPI on
X
to obtain the transformed term document matrix
Y
of size
()KN m
, where
m
is the number of features chosen out of
n
which is not necessarily optimal.
In order to preserve the intraclass variation in each feature of every document of
class, we propose in this work
to aggregate the term frequency vectors of all documents of the
class in the form of an interval valued feature vector.
Let
 
12
, ,...,
j j jm
P P P
be the transformed term frequency vector obtained by the RLPI for the
th
j
document of the
class.
The reference interval valued feature vector of dimension
m
representing all the documents of the
class on
aggregation is computed and stored in the knowledge based for classification purpose.
Let
 
1 1 2 2
, , , ,..., ,
i i i i i im im
R P P P P P P
     
     
     
be the computed reference term frequency vector of the
class where
 
1,..., ( 1,..., )
il jl jl
P mean P j N std P j N
   
and
 
1,..., ( 1,..., )
il jl jl
P mean P j N std P j N
   
.
Similarly we compute interval valued reference term frequency vectors for all the classes and store in the knowledge
base for further classification task. It shall be noticed that the reference term frequency vectors are of type interval
valued and of dimension
m
which is arbitrarily fixed up (not necessarily optimal). Hence the class representatives are
stored in a matrix format called
F
which is of size
Km
. In the subsequent stage we propose to employ a symbolic
feature selection scheme to select the best few
d
features out of
m
features.
2.2 Symbolic Feature Selection
Feature selection is used to identify the useful features and also to remove the redundant information. Basically, feature
selection methods fall into two broad categories, the filter model and the wrapper model. The wrapper model requires
one predetermined learning algorithm in feature selection and uses its performance to evaluate and determine which
features are selected. And the filter model relies on general characteristics of the training data to select some features
without involving any specific learning algorithm. There is evidence that wrapper methods often perform better on small
scale problems, but on large scale problems, such as text classification, wrapper methods are shown to be impractical
because of its high computational cost [19]. Hence in this paper we make use of the filter method to select the best
features.
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 5, 2011
In order select the best features from the class representative matrix
F
we need to study the correlation present
among the individual features of each class. The features which have maximum correlation shall be selected as the best
features to represent the classes. Since
F
is an interval matrix we compute a proximity matrix of size
KK
with each
element being of type multivalued of dimension
m
by computing the similarity among the features using the similarity
function proposed in [15]. The similarity from class
i
to
j
with respect to
th
l
feature is given by
il jl
l
ij
jl
II
SI




(1)
Where,
[ , ] 1,2,...,
il il il
I p p l m

 
are the interval type features of the class
i
C
and
[ , ] 1,2,...,
jl jl jl
I p p l m

 
are
the interval type features of the class
j
C
Therefore from the obtained proximity matrix, the matrix
M
of size
2
Km
is constructed by listing out each
multivalued type element one by one in the form of rows. The standard deviation for each column of the matrix
M
is
computed and also the average standard deviation due to all
m
columns is computed.
Let
l
TCorr
be the total correlation of the
th
l
column with all other columns of the matrix
M
and let
AvgTCorr
be the
average of all total correlation obtained due to all columns. i.e.,
0
( , )
mth th
lq
TCorr Corr l Column q Column
and (2)
0
m
l
l
TCorr
AvgTCorr m
(3)
We are interested in those features which have high discriminating capability, and thus we recommend to select those
features, for which
l
TCorr
is higher than the average correlation
AvgTCorr
.
2.3 Document Classification
Given a test document
q
D
represented by term document feature vector, it is projected onto lower dimension using
RLPI, to get the transformed term frequency feature vector of size
m
which are of type crisp. Among these
m
feature
values only the
d
values are selected from feature selection method. Then these
d
values are used to compare with each
class representative stored in the knowledge base. Let
1 2 3
[ , , ..., ]
tq t t t td
F f f f f
be a
d
dimensional feature vector
describing a test document. Let
j
R
be the interval symbolic feature vector of the
th
j
class. Now each
th
l
feature value of
the test document is compared with the corresponding interval
l
j
R
to examine belongingness between the test document
feature value and interval in
l
j
R
. We make use of Belongingness Count
c
B
as a measure of degree of belongingness for
the test document to decide its class label.
 
1,,
d
c tl jl jl
l
B C f f f



and (4)
 
 
1;
,, 0;
tl jl tl jl
tl jl jl
if f f and f f
C f f f Otherwise





(5)
The crisp value of a test document falling into its respective feature interval of the reference class contributes a value 1
towards
c
B
and there will be no contribution from other features which fall outside the interval. Similarly, we compute
the
c
B
value for all classes representing and of the representative which has highest
c
B
value will be assigned to the test
document as its label.
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 5, 2011
3. Experimental Setup
3.1 Datasets
To test the efficacy of the proposed model, we have used the following four datasets. The first dataset consists of vehicle
characteristics extracted from wikipedia pages (vehicles- wikipedia) [9]. The dataset contains 4 categories that have low
degrees of similarity. The dataset contains four categories of vehicles: Aircraft, Boats, Cars and Trains. All the four
categories are easily differentiated and every category has a set of unique key words. The second dataset is a standard 20
mini newsgroup dataset [17] which contains about 2000 documents evenly divided among 20 Usenet discussion groups.
This dataset is a subset of 20 newsgroups which contains 20,000 documents. In 20 mini newsgroup, each class contains
100 documents in 20 classes which are randomly picked from original dataset. In our experiment, we have used first 60
documents of each class to create a training set and the remaining 40 documents for testing. The third dataset is
constructed by a text corpus of 1000 documents that are downloaded from Google-Newsgroup [19]. Each class contains
100 documents belonging to 10 different classes (Business, Cricket, Music, Electronics, Biofuels, Biometrics,
Astronomy, Health, Video Processing and Text Mining). In each class, 60 documents (60%) are used for training and the
remaining 40 documents (40%) for testing the system. The fourth dataset is a collection of research article abstracts. All
these research articles are downloaded from the scientific web portals. We have collected 1000 documents from 10
different classes. Each class contains 100 documents. From each class, 60 documents (60%) are used for training and the
remaining 40 (40%) documents are used as a test set.
3.2 Experimentation
In this section, the obtained experimental results by the use of the proposed method are presented. In all the
experimentation 60% which were randomly selected are used for training and remaining 40% have been used for testing.
While conducting experimentation we have varied the number of features
()m
selected through RLPI from 1 to 100
dimensions. For each data set we have conducted the same experimentation and we employed the feature selection
method discussed in section 2.2. Table 1 gives the average classification accuracy of 5 trails by selecting random
training sets on the aforementioned data sets along with the number of dimensions selected using proposed feature
selection approach.
Also a comparative analysis of the proposed method with other state of the art techniques on the same dataset viz.,
Vehicles-Wikipedia and 20 mini newsgroup dataset on well accepted classifier such as Naïve bayes classifier, K-nearest
neighbor method and support vector machine is given in Table. 2. From the Table 2 it is analyzed that the proposed
method achieves better classification accuracy than the state of the art techniques. The entry NA in Table 2 indicates that
it is not reported in the respective work.
4. CONCLUSION
In this paper, a simple and efficient symbolic text classification is presented. In this work, a text document is represented
by the use of symbolic features. The proposed new feature selection method selects the best features in the proximity
space and effectively improves the classification accuracy. To check the effectiveness and robustness of the proposed
method, extensive experimentation is conducted on various datasets. The experimental results reveal that the proposed
method outperforms the other existing methods.
In the future, our research will emphasize in enhancing the ability and performance of our model by considering
other parameters to effectively capture the variations between the classes, which in turn improves the classification
accuracy. Along with this, we have a plan of exploiting other similarity/dissimilarity measures, selection of dynamic
threshold value and studying the classification accuracy under varying dimensions. Besides this we are also targeting
towards the study of complexity issues of the proposed model with the existing representation models.
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 5, 2011
Table 1: Average classification accuracy of the proposed method on different data sets
Method
Dataset
Proposed
Vehicle-
Wikipedia
Mini 20 News
Group
Google
Research
Articles
89.24 (02)
89.13 (04)
90.50 (04)
90.49 (03)
Table 2: Comparative analysis of the proposed method with other state of the art techniques
Method
Dataset
Vehicle-
Wikipedia
Mini 20 News
Group
Status Matrix Representation [18]
76.00
71.12
Probability based
representation [9]
Naive Bayes Classifier with flat ranking
81.25
NA
Naïve Bayes
Classifier +
SVM
Linear
85.42
NA
RBF
85.42
NA
Sigmoid
84.58
NA
Polynomial
81.66
NA
Bag of word representation [18]
Naive Bayes
Classifier
NA
66.22
KNN
NA
38.73
SVM
NA
51.02
Proposed Method
89.24
89.13
REFERENCES
1. Sebastiani F., 2002. Machine learning in automated text categorization. ACM Computing Surveys. Vol. 34, pp. 1
47.
2. Li Y. H and A. K. Jain, 1998. Classification of text documents. The Computer Journal, vol. 41, no. 8, pp. 537
546.
3. Hotho A., A. Nurnberger and G. Paab, 2005. A brief survey of text mining. Journal for Computational Linguistics
and Language Technology, vol. 20, pp. 19 62.
4. Cavnar W B., 1994. N-Gram based text categorization. Proceedings of the 3rd Symposium on Document Analysis
and Information Retrieval, Las Vegas, pp. 161 176.
5. Milios E, Zhang Y, He B, and Dong L, 2003. Automatic term extraction and document similarity in special text
corpora. Proceedings of the Sixth Conference of the Pacific Association for Computational Linguistics
(PACLing’03), Canada, pp. 275—284.
6. Choudhary B and P. Bhattacharyya, 2002. Text clustering using universal networking language representation.
Proceedings of the 11th International Conference on World Wide Web, USA, (http://www-
clips.imag.fr/geta/User/wang-ju.tsai/articles/BhChPBh-UNL01.PDF).
7. Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T. M, Nigam K, and Slattery S, 1998. Learning to
Extract Symbolic Knowledge from the World Wide Web. In the Proceedings of AAAI/IAAI', pp. 509516,
1998.
8. Esteban M, and Rodrıguez O. R, 2006. A Symbolic Representation for Distributed Web Document Clustering. In
the Proceedings of Fourth Latin American Web Congress, Cholula, Mexico, October.
9. Isa D., L. H. Lee., V. P. Kallimani and R. Rajkumar., 2008. Text document preprocessing with the bayes formula
for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering, vol.
20, no. 9, pp. 23 31.
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 5, 2011
10. Salton G., A. Wang and C. S. Yang, 1975. A vector space model for automatic indexing. Communications of the
ACM, vol. 18, no. 11, pp. 613 620.
11. Deerwester S. C., S. T. Dumais., T. K. Landauer., G. W. Furnas and R. A. Harshman, 1990. Indexing by latent
semantic analysis. Journal of the American Society of Information Science, vol. 41, no. 6, pp. 391 407.
12. He X., D. Cai., H. Liu and W. Y. Ma, 2004. Locality preserving indexing for document representation.
Proceedings of the International Conference on Research and Development in Information Retrieval, UK, pp. 96
103.
13. Cai D., X. He., W. V. Zhang and J. Han, 2007. Regularized locality preserving indexing via spectral regression.
Proceedings of the 16th Conference on Information and Knowledge Management, USA, pp. 741 750.
14. Kiranagi B. B, Guru D. S, and Ichino M, 2007. Exploitation of Multivalued Type Proximity for Symbolic Feature
Selection. In the Proceedings of International Conference on Computing: Theory and Applications, Kolkata, pp.
320324.
15. Guru D.S., Kiranagi B. B., Nagabhushan P, 2004. Multivalued type proximity measure and concept of mutual
similarity value useful for clustering symbolic patterns. Journal of Pattern Recognition Letters, vol. 25, pp. 1003
1013.
16. Zeimpekis D and E. Gallopoulos, 2006. TMG: A MATLAB Toolbox for generating term-document matrices from
text collections. Springer Publications, Berlin, pp. 187 210.
17. http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
18. Dinesh R, Harish B. S, Guru D. S, and Manjunath S, 2009. Concept of status matrix in classification of text
documents. In the Proceedings of Indian International Conference on Artificial Intelligence, India, pp. 2071
2079.
19. Li S., R. Xia., C. Zong and C. R. Huang, 2009 (d). A framework of feature selection methods for text
categorization. Proceedings of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP, vol. 2, pp. 692 700.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Conference Paper
Full-text available
In text categorization, feature selection (FS) is a strategy that aims at making text classifiers more efficient and accurate. However, when dealing with a new task, it is still difficult to quickly select a suitable one from various FS methods provided by many previous studies. In this paper, we propose a theoretic framework of FS methods based on two basic measurements: frequency measurement and ratio measurement. Then six popular FS methods are in detail discussed under this framework. Moreover, with the guidance of our theoretical analysis, we propose a novel method called weighed frequency and odds (WFO) that combines the two measurements with trained weights. The experimental results on data sets from both topic-based and sentiment classification tasks show that this new method is robust across different tasks and numbers of selected features.
Conference Paper
In this paper, a simple and efficient feature selection scheme for symbolic data is proposed. The proposed scheme exploits the symbolic multivalued proximity measures for feature selection. The effectiveness of the proposed scheme has been demonstrated through experiments on standard symbolic data sets
Article
In this paper, a novel similarity measure for estimating the degree of similarity between two patterns (described by interval type data) is proposed. The proposed measure computes the degree of similarity between two patterns and approximates the computed similarity value by a multivalued type data. Unlike conventional proximity matrices, the proximity matrix obtained through the application of the proposed similarity measure is not necessarily symmetric. Based on this unconventional similarity matrix a modified agglomerative method by introducing the concept of mutual similarity value (MSV) for clustering symbolic patterns is also presented. Experiments on various data sets have been conducted in order to study the efficacy of the proposed methodology.
Conference Paper
We consider the problem of document indexing and representa- tion. Recently, Locality Preserving Indexing (LPI) was proposed for learning a compact document subspace. Different from Latent Semantic Indexing (LSI) which is optimal in the sense of global Euclidean structure, LPI is optimal in the sense of local manifold structure. However, LPI is not efficient in time and memory which makes it difficult to be applied to very large data set. Specifi- cally, the computation of LPI involves eigen-decompositions of two dense matrices which is expensive. In this paper, we propose a new algorithm called Regularized Locality Preserving Indexing(RLPI). Benefit from recent progresses on spectral graph analysis, we ca st the original LPI algorithm into a regression framework which en- able us to avoid eigen-decomposition of dense matrices. Also, with the regression based framework, different kinds of regularizers c an be naturally incorporated into our algorithm which makes it more flexible. Extensive experimental results show that RLPI obtains similar or better results comparing to LPI and it is significantly faster, which makes it an efficient and effective data preprocessing method for large scale text clustering, classification and retrieval.
Conference Paper
The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more effective retrieval of Web information, and promote new uses of the Web to support knowledge-based inference and problem solving. Our approach is to develop a trainable information extraction system that takes two inputs: an ontology defining the classes and relations of interest, and a set of training data consisting of labeled regions of hypertext representing instances of these classes and relations. Given these inputs, the system learns to extract information from other pages and hyperlinks on the Web. This paper describes our general approach, several machine learning algorithms for this task, and promising initial results with a prototype system.
Conference Paper
Document representation and indexing is a key problem for document analysis and processing, such as clustering, classification and retrieval. Conventionally, Latent Semantic Indexing (LSI) is considered effective in deriving such an indexing. LSI essentially detects the most representative features for document representation rather than the most discriminative features. Therefore, LSI might not be optimal in discriminating documents with different semantics. In this paper, a novel algorithm called Locality Preserving Indexing (LPI) is proposed for document indexing. Each document is represented by a vector with low dimensionality. In contrast to LSI which discovers the global structure of the document space, LPI discovers the local structure and obtains a compact document representation subspace that best detects the essential semantic structure. We compare the proposed LPI approach with LSI on two standard databases. Experimental results show that LPI provides better representation in the sense of semantic structure.