Content uploaded by B S Harish
Author content
All content in this area was uploaded by B S Harish on Feb 08, 2018
Content may be subject to copyright.
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011
Symbolic Similarity and Symbolic Feature Selection for Text
Classification
B S Harisha*, D S Gurub, S Manjunathb and Bapu B Kiranagi c
aDepartment of Information Science & Engineering, S J College of Engineering, Mysore;
b Department of Studies in Computer Science, University of Mysore, Mysore, India;
c HCL Technologies, Bangalore, India.
bsharish@ymail.com, dsg@compsci.uni-mysore.ac.in, manju_uom@yahoo.co.in, bapukb@hcl.com
ABSTRACT
In this paper, a simple and efficient symbolic text classification is presented. A text document is represented by the use
interval valued symbolic features. Subsequently, a new feature selection method based on a symbolic similarity measure
is also presented. The new feature selection method reduces the features in the proximity space for effective text
classification. It keeps the best features for effective text representation and simultaneously reduces the time taken to
classify a given document. To corroborate the efficacy of the proposed method, experimentation has been conducted on
four different datasets to evaluate the performance. Experimental results reveal that the proposed method gives better
results when compared to state of the art techniques. In addition, as it is based on simple matching scheme it achieves
classification within negligible time and thus it appears to be more effective in classification.
Keywords: Symbolic features, Similarity measure, Feature selection, Document representation, Text classification
1. INTRODUCTION
Text classification is one of the important research issues in the field of text mining, where the documents are classified
with supervised knowledge. Based on the likelihood of the training set, a new document is classified. The major
challenges and difficulties that arise in the problem of text classification are: High dimensionality (thousands of
features), variable length, content and quality of the documents, sparse distribution of terms in documents, loss of
correlation between adjacent words and understanding complex semantics of terms in a document [1]. To tackle these
problems a number of methods have been reported in literature for effective classification of text documents. Many
representation schemes like binary representation [2], ontology [3], N-Grams[4], multiword terms as vector [5], UNL [6]
are proposed as text representation schemes for effective text classification. Also, in [7], [8] a new representation model
for the web documents are proposed. Recently, in [9] bayes formula is used to vectorize a document according to a
probability distribution reflecting the probable categories that the document may belong to.
Although many representation models for the text document are available in literature, the Vector Space Model (VSM)
is one of the most popular and widely used models for document representation [10]. Unfortunately, the major limitation
of the VSM is the loss of correlation and context of each term which are very important in understanding the document.
To deal with these problems, Latent Semantic Indexing (LSI) was proposed in [11]. The LSI is optimal in the sense of
preserving the global geometric structure of a document space. However, it might not be optimal in the sense of
discrimination [11]. Thus, to discover the discriminating structure of a document space, a Locality Preserving Indexing
(LPI) is proposed in [12]. An assumption behind LPI is that, close inputs should have similar documents. However, the
computational complexity of LPI is very expensive. Thus it is almost infeasible to apply LPI over very large dataset.
Hence to reduce the computational complexity of LPI; Regularized Locality Preserving Indexing (RLPI) has been
proposed in [13]. The RLPI is being significantly faster obtains similar or better results when compared to LPI. This
makes the RLPI an efficient and effective data preprocessing method for large scale text classification [13].
However the RLPI fails to preserve the intraclass variations among the documents of different classes. Also in case
of the RLPI, we need to select the number of dimensions
m
to be optimal. Unfortunately we cannot say that the selected
m
value is optimal. Hence in this paper we give a symbolic representation for a given document to capture the intraclass
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011
variation and subsequently we employ the symbolic feature selection scheme [14] to select the best few features
d
for
storing in the knowledge base. In addition we also present the corresponding classification method. Also, to the best of
our knowledge no work has been reported in literature which uses symbolic representation and symbolic feature
selection method for text documents. The recent developments in the area of symbolic data analysis have proven that the
real life objects can be better described by the use of symbolic data which are extensions of classical crisp data [15].
Thus these issues motivated us to use the symbolic data rather than using a conventional classical crisp data to represent
a document.
The rest of the paper is organized as follows: The working principle of the proposed method is presented in section 2.
Details of the dataset used, experimental settings and the obtained results are presented in section 3. Section 4 concludes
with future directions.
2. PROPOSED METHOD
In this section we propose a novel method of representing a text document using symbolic data representation. Initially
documents are represented using a regularized locality preserving indexing approach as it has the property of preserving
the local structure of a document. Further, to capture intra class variations across documents of a same class, interval
valued feature vector representation is formulated to represent each class by feature assimilation. Subsequently we
employ the symbolic feature selection scheme to select the best few features for storage. In addition we also present the
corresponding classification method.
2.1 Symbolic Representation
Let there be
K
number of classes each containing
N
number of documents, where each document is described by a
n
dimensional term frequency vector. The term document matrix [16], say
X
of size
()KN n
is constructed such that
each row represents a document of a class and each column represents a term. We recommend to employ the RLPI on
X
to obtain the transformed term document matrix
Y
of size
()KN m
, where
m
is the number of features chosen out of
n
which is not necessarily optimal.
In order to preserve the intraclass variation in each feature of every document of
th
i
class, we propose in this work
to aggregate the term frequency vectors of all documents of the
th
i
class in the form of an interval valued feature vector.
Let
12
, ,...,
j j jm
P P P
be the transformed term frequency vector obtained by the RLPI for the
th
j
document of the
th
i
class.
The reference interval valued feature vector of dimension
m
representing all the documents of the
th
i
class on
aggregation is computed and stored in the knowledge based for classification purpose.
Let
1 1 2 2
, , , ,..., ,
i i i i i im im
R P P P P P P
be the computed reference term frequency vector of the
th
i
class where
1,..., ( 1,..., )
il jl jl
P mean P j N std P j N
and
1,..., ( 1,..., )
il jl jl
P mean P j N std P j N
.
Similarly we compute interval valued reference term frequency vectors for all the classes and store in the knowledge
base for further classification task. It shall be noticed that the reference term frequency vectors are of type interval
valued and of dimension
m
which is arbitrarily fixed up (not necessarily optimal). Hence the class representatives are
stored in a matrix format called
F
which is of size
Km
. In the subsequent stage we propose to employ a symbolic
feature selection scheme to select the best few
d
features out of
m
features.
2.2 Symbolic Feature Selection
Feature selection is used to identify the useful features and also to remove the redundant information. Basically, feature
selection methods fall into two broad categories, the filter model and the wrapper model. The wrapper model requires
one predetermined learning algorithm in feature selection and uses its performance to evaluate and determine which
features are selected. And the filter model relies on general characteristics of the training data to select some features
without involving any specific learning algorithm. There is evidence that wrapper methods often perform better on small
scale problems, but on large scale problems, such as text classification, wrapper methods are shown to be impractical
because of its high computational cost [19]. Hence in this paper we make use of the filter method to select the best
features.
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011
In order select the best features from the class representative matrix
F
we need to study the correlation present
among the individual features of each class. The features which have maximum correlation shall be selected as the best
features to represent the classes. Since
F
is an interval matrix we compute a proximity matrix of size
KK
with each
element being of type multivalued of dimension
m
by computing the similarity among the features using the similarity
function proposed in [15]. The similarity from class
i
to
j
with respect to
th
l
feature is given by
il jl
l
ij
jl
II
SI
(1)
Where,
[ , ] 1,2,...,
il il il
I p p l m
are the interval type features of the class
i
C
and
[ , ] 1,2,...,
jl jl jl
I p p l m
are
the interval type features of the class
j
C
Therefore from the obtained proximity matrix, the matrix
M
of size
2
Km
is constructed by listing out each
multivalued type element one by one in the form of rows. The standard deviation for each column of the matrix
M
is
computed and also the average standard deviation due to all
m
columns is computed.
Let
l
TCorr
be the total correlation of the
th
l
column with all other columns of the matrix
M
and let
AvgTCorr
be the
average of all total correlation obtained due to all columns. i.e.,
0
( , )
mth th
lq
TCorr Corr l Column q Column
and (2)
0
m
l
l
TCorr
AvgTCorr m
(3)
We are interested in those features which have high discriminating capability, and thus we recommend to select those
features, for which
l
TCorr
is higher than the average correlation
AvgTCorr
.
2.3 Document Classification
Given a test document
q
D
represented by term document feature vector, it is projected onto lower dimension using
RLPI, to get the transformed term frequency feature vector of size
m
which are of type crisp. Among these
m
feature
values only the
d
values are selected from feature selection method. Then these
d
values are used to compare with each
class representative stored in the knowledge base. Let
1 2 3
[ , , ..., ]
tq t t t td
F f f f f
be a
d
dimensional feature vector
describing a test document. Let
j
R
be the interval symbolic feature vector of the
th
j
class. Now each
th
l
feature value of
the test document is compared with the corresponding interval
l
j
R
to examine belongingness between the test document
feature value and interval in
l
j
R
. We make use of Belongingness Count
c
B
as a measure of degree of belongingness for
the test document to decide its class label.
1,,
d
c tl jl jl
l
B C f f f
and (4)
1;
,, 0;
tl jl tl jl
tl jl jl
if f f and f f
C f f f Otherwise
(5)
The crisp value of a test document falling into its respective feature interval of the reference class contributes a value 1
towards
c
B
and there will be no contribution from other features which fall outside the interval. Similarly, we compute
the
c
B
value for all classes representing and of the representative which has highest
c
B
value will be assigned to the test
document as its label.
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011
3. Experimental Setup
3.1 Datasets
To test the efficacy of the proposed model, we have used the following four datasets. The first dataset consists of vehicle
characteristics extracted from wikipedia pages (vehicles- wikipedia) [9]. The dataset contains 4 categories that have low
degrees of similarity. The dataset contains four categories of vehicles: Aircraft, Boats, Cars and Trains. All the four
categories are easily differentiated and every category has a set of unique key words. The second dataset is a standard 20
mini newsgroup dataset [17] which contains about 2000 documents evenly divided among 20 Usenet discussion groups.
This dataset is a subset of 20 newsgroups which contains 20,000 documents. In 20 mini newsgroup, each class contains
100 documents in 20 classes which are randomly picked from original dataset. In our experiment, we have used first 60
documents of each class to create a training set and the remaining 40 documents for testing. The third dataset is
constructed by a text corpus of 1000 documents that are downloaded from Google-Newsgroup [19]. Each class contains
100 documents belonging to 10 different classes (Business, Cricket, Music, Electronics, Biofuels, Biometrics,
Astronomy, Health, Video Processing and Text Mining). In each class, 60 documents (60%) are used for training and the
remaining 40 documents (40%) for testing the system. The fourth dataset is a collection of research article abstracts. All
these research articles are downloaded from the scientific web portals. We have collected 1000 documents from 10
different classes. Each class contains 100 documents. From each class, 60 documents (60%) are used for training and the
remaining 40 (40%) documents are used as a test set.
3.2 Experimentation
In this section, the obtained experimental results by the use of the proposed method are presented. In all the
experimentation 60% which were randomly selected are used for training and remaining 40% have been used for testing.
While conducting experimentation we have varied the number of features
()m
selected through RLPI from 1 to 100
dimensions. For each data set we have conducted the same experimentation and we employed the feature selection
method discussed in section 2.2. Table 1 gives the average classification accuracy of 5 trails by selecting random
training sets on the aforementioned data sets along with the number of dimensions selected using proposed feature
selection approach.
Also a comparative analysis of the proposed method with other state of the art techniques on the same dataset viz.,
Vehicles-Wikipedia and 20 mini newsgroup dataset on well accepted classifier such as Naïve bayes classifier, K-nearest
neighbor method and support vector machine is given in Table. 2. From the Table 2 it is analyzed that the proposed
method achieves better classification accuracy than the state of the art techniques. The entry NA in Table 2 indicates that
it is not reported in the respective work.
4. CONCLUSION
In this paper, a simple and efficient symbolic text classification is presented. In this work, a text document is represented
by the use of symbolic features. The proposed new feature selection method selects the best features in the proximity
space and effectively improves the classification accuracy. To check the effectiveness and robustness of the proposed
method, extensive experimentation is conducted on various datasets. The experimental results reveal that the proposed
method outperforms the other existing methods.
In the future, our research will emphasize in enhancing the ability and performance of our model by considering
other parameters to effectively capture the variations between the classes, which in turn improves the classification
accuracy. Along with this, we have a plan of exploiting other similarity/dissimilarity measures, selection of dynamic
threshold value and studying the classification accuracy under varying dimensions. Besides this we are also targeting
towards the study of complexity issues of the proposed model with the existing representation models.
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011
Table 1: Average classification accuracy of the proposed method on different data sets
Method
Dataset
Proposed
Vehicle-
Wikipedia
Mini 20 News
Group
Google
Research
Articles
89.24 (02)
89.13 (04)
90.50 (04)
90.49 (03)
Table 2: Comparative analysis of the proposed method with other state of the art techniques
Method
Dataset
Vehicle-
Wikipedia
Mini 20 News
Group
Status Matrix Representation [18]
76.00
71.12
Probability based
representation [9]
Naive Bayes Classifier with flat ranking
81.25
NA
Naïve Bayes
Classifier +
SVM
Linear
85.42
NA
RBF
85.42
NA
Sigmoid
84.58
NA
Polynomial
81.66
NA
Bag of word representation [18]
Naive Bayes
Classifier
NA
66.22
KNN
NA
38.73
SVM
NA
51.02
Proposed Method
89.24
89.13
REFERENCES
1. Sebastiani F., 2002. Machine learning in automated text categorization. ACM Computing Surveys. Vol. 34, pp. 1
– 47.
2. Li Y. H and A. K. Jain, 1998. Classification of text documents. The Computer Journal, vol. 41, no. 8, pp. 537 –
546.
3. Hotho A., A. Nurnberger and G. Paab, 2005. A brief survey of text mining. Journal for Computational Linguistics
and Language Technology, vol. 20, pp. 19 – 62.
4. Cavnar W B., 1994. N-Gram based text categorization. Proceedings of the 3rd Symposium on Document Analysis
and Information Retrieval, Las Vegas, pp. 161 – 176.
5. Milios E, Zhang Y, He B, and Dong L, 2003. Automatic term extraction and document similarity in special text
corpora. Proceedings of the Sixth Conference of the Pacific Association for Computational Linguistics
(PACLing’03), Canada, pp. 275—284.
6. Choudhary B and P. Bhattacharyya, 2002. Text clustering using universal networking language representation.
Proceedings of the 11th International Conference on World Wide Web, USA, (http://www-
clips.imag.fr/geta/User/wang-ju.tsai/articles/BhChPBh-UNL01.PDF).
7. Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T. M, Nigam K, and Slattery S, 1998. Learning to
Extract Symbolic Knowledge from the World Wide Web. In the Proceedings of AAAI/IAAI', pp. 509—516,
1998.
8. Esteban M, and Rodrıguez O. R, 2006. A Symbolic Representation for Distributed Web Document Clustering. In
the Proceedings of Fourth Latin American Web Congress, Cholula, Mexico, October.
9. Isa D., L. H. Lee., V. P. Kallimani and R. Rajkumar., 2008. Text document preprocessing with the bayes formula
for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering, vol.
20, no. 9, pp. 23 – 31.
Bilateral Russian-Indian Scientific Workshop on Emerging Applications of Computer Vision, Russia, Moscow, Nov. 1 – 5, 2011
10. Salton G., A. Wang and C. S. Yang, 1975. A vector space model for automatic indexing. Communications of the
ACM, vol. 18, no. 11, pp. 613 – 620.
11. Deerwester S. C., S. T. Dumais., T. K. Landauer., G. W. Furnas and R. A. Harshman, 1990. Indexing by latent
semantic analysis. Journal of the American Society of Information Science, vol. 41, no. 6, pp. 391 – 407.
12. He X., D. Cai., H. Liu and W. Y. Ma, 2004. Locality preserving indexing for document representation.
Proceedings of the International Conference on Research and Development in Information Retrieval, UK, pp. 96 –
103.
13. Cai D., X. He., W. V. Zhang and J. Han, 2007. Regularized locality preserving indexing via spectral regression.
Proceedings of the 16th Conference on Information and Knowledge Management, USA, pp. 741 – 750.
14. Kiranagi B. B, Guru D. S, and Ichino M, 2007. Exploitation of Multivalued Type Proximity for Symbolic Feature
Selection. In the Proceedings of International Conference on Computing: Theory and Applications, Kolkata, pp.
320—324.
15. Guru D.S., Kiranagi B. B., Nagabhushan P, 2004. Multivalued type proximity measure and concept of mutual
similarity value useful for clustering symbolic patterns. Journal of Pattern Recognition Letters, vol. 25, pp. 1003 –
1013.
16. Zeimpekis D and E. Gallopoulos, 2006. TMG: A MATLAB Toolbox for generating term-document matrices from
text collections. Springer Publications, Berlin, pp. 187 – 210.
17. http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
18. Dinesh R, Harish B. S, Guru D. S, and Manjunath S, 2009. Concept of status matrix in classification of text
documents. In the Proceedings of Indian International Conference on Artificial Intelligence, India, pp. 2071—
2079.
19. Li S., R. Xia., C. Zong and C. R. Huang, 2009 (d). A framework of feature selection methods for text
categorization. Proceedings of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP, vol. 2, pp. 692 – 700.