Content uploaded by B S Harish
Author content
All content in this area was uploaded by B S Harish on Jul 18, 2016
Content may be subject to copyright.
Document Classification using Symbolic Classifiers
M B Revanasiddappa
Department of Information Science &
Engineering
SJCE, Mysore
Karnataka, India
revan.cr.is@gmail.com
B S Harish
Member, IEEE
Department of Information Science &
Engineering
SJCE, Mysore
Karnataka, India
bsharish@ymail.com
S Manjunath
Department of Computer Science
Central University Kerala
Kasargod, Kerala
India
manju_uom@yahoo.co.in
ABSTRACT
In this paper, we present symbolic classifiers to classify text
documents. We propose to use cluster based symbolic
representation followed by symbolic feature selection methods to
classify text documents. In particular, we propose Symbolic
clustering approaches; Symbolic cluster based without feature
selection; Symbolic cluster based with feature selection (using
similarity measure); Symbolic cluster based with feature selection
(using dissimilarity measure) and Symbolic feature clustering
approaches. The above mentioned representation methods are
very powerful in reducing the dimensionality of feature vectors
for text classification. To corroborate the efficacy of the proposed
model, we conducted extensive experimentation on various
standard text datasets. The experimental results reveal that the
symbolic feature clustering approach achieves better classification
accuracy over the existing cluster based symbolic approaches.
Categories and Subject Descriptors
Data Mining, Machine Learning, Pattern Recognition
General Terms
Algorithms, Experimentation.
Keywords
Symbolic Classifier, Representation, Feature Selection, Text
Documents.
1.
INTRODUCTION
Over the past two decades, automatic management of electronic
documents has been a major research field in computer science.
Text documents have become the most common type of
information repositories especially due to the increased popularity
of the internet and the World Wide Web (WWW). Internet and
web documents like web pages, e-mails, newsgroup messages,
internet news feed etc., contain million or even billion of text
documents. In the last decades, content-based document
management tasks have gained a prominent status in the field of
information systems, due to the increased availability of
documents in digital form [13], [15].
Earlier, the task of text classification was based on knowledge
engineering (KE), where a set of rules were defined manually to
encode the expert knowledge on how t o class i f y the documents
under the given categories [14]. Since there is a requirement of
human intervention in knowledge engineering, researchers in later
days have proposed many machine learning techniques to
automatically manage the text documents. The advantages of a
machine learning based approach are that the accuracy is
comparable to that of human experts and no intervention from
either knowledge engineers or domain experts needed for the
construction of a document management tool [6]. Many text
mining methods like document retrieval, clustering, classification,
routing and filtering are often used for effective management of
text documents. Out of several tasks, text classification is the one
which is commonly used in text information systems. Therefore,
devising effective and efficient models for representing and
classification of text documents for real time applications is the
current requirement.
The task of text classification is to assign a boolean value to each
pair
(,) ,
ji o
dk D Ku
where
o
D
is the domain of documents and
K
is a set of predefined categories. The task is to approximate
the true function
^`
:1,0
o
DK
I
uo
by means of a function
^
:{1,0}
o
DK
I
uo
such that
^
and
II
coincide as much as
possible. The function
^
I
is called a classifier. A classifier can be
built by training it systematically using a set of training
documents [13], [3].
The major challenges and difficulties that arise in the problem of
text classification are: High dimensionality (thousands of
features), variable length, content and quality of the documents,
sparse distribution of terms in documents, loss of correlation
between adjacent words and understanding complex semantics of
terms in a document [16]. To tackle these problems a number of
methods have been reported in literature for effective
classification of text documents. Many representation schemes
like binary representation, ontology, N-Grams, multiword terms as
vector, Universal Networking Language, Latent Semantic
Indexing, Locality Preserving Indexing, Regularized Locality
Preserving Indexing are proposed as text representation schemes
for effective text classification [10]. Also, in [11] a new
representation model for the web documents are proposed.
Recently, bayes formula was used to vectorize a document
according to a probability distribution reflecting the probable
categories that the document may belongs to. Further, Clustering
has been used in the literature of text classification as an
alternative representation scheme for text documents. Several
approaches like [7] are used to represent text documents.
All in all, the above mentioned classification algorithms work on
conventional word frequency vector. Conventionally the feature
vectors of term document matrix (very sparse and very high
dimensional feature vector describing a document) are used to
represent the class. Later, this matrix is used to train the system
299
978-1-4799-6629-5/14/$31.00 c
2014 IEEE
using different classifiers for classification. Generally, the term
document matrix contains the frequency of occurrences of terms
and the values of the term frequency vary from document to
document in the same class. Hence to preserve these variations,
we proposed a new interval representation for each document. An
initial attempt is made in [4] by giving an interval representation
by using maximum and minimum values of term frequency
vectors for the documents. However, in this paper we are using
mean and standard deviations to give the interval valued
representation for documents. Thus, the variations of term
frequencies of document within the class are assimilated in the
form of interval representation. Moreover conventional data
analysis may not be able to preserve intraclass variations but
unconventional data analysis such as symbolic data analysis will
provide methods for effective representations by preserving
intraclass variations.
The recent developments in the area of symbolic data analysis
have proven that the real life objects can be better described by
the use of symbolic data, which are extensions of classical crisp
data. The aim of the symbolic data analysis [1] is to provide
suitable methods for managing aggregated data described by multi
valued variables, where the cells of the data contain sets of
categories, intervals, or weight distributions. Symbolic data
analysis provides a number of clustering methods for symbolic
data. These methods differ in the type of considered symbolic
data, in their cluster structures and/or in the considered clustering
criterion. The previous issues motivated us to use symbolic data
rather than using a conventional classical crisp data to represent a
document. To preserve the intraclass variations we create multiple
clusters for each class. Term frequency vectors of documents of
each cluster are used to form an interval valued feature vector.
With this backdrop, the work presented in [4] is extended towards
creating multiple representatives per class using clustering after
symbolic representation.
The rest of the paper is organized as follows: The proposed
representation and classification stages are presented in section 2.
Details of dataset used, experimental settings and results are
presented in section 3. The paper is concluded in section 4.
2. PROPOSED METHOD
The classification model has three stages: (i) Symbolic
Representation (ii) Symbolic Feature Selection and (iii) Document
Classification.
2.1 Symbolic Representation
Let there be
K
number of classes each containing
N
number of
documents, where each document is described by a
n
dimensional term frequency vector. The term document matrix,
say
X
of size
()KN nu
is constructed such that each row
represents a document of a class and each column represents a
term. We recommend to employ the Regularized Locality
Preserving Indexing [2] on
X
to obtain the transformed term
document matrix
Y
of size
()KN mu
, where
m
is the number of
features chosen out of
n
which is not necessarily optimal. In order
to preserve the intraclass variation in each feature of every
document of
th
i
class, we have proposed symbolic representation
for text documents in [4]. Further, to select the best features from
the class representative matrix we need to study the
correlation/covariance present among the individual features of
each class. The features which have maximum
correlation/covariance shall be selected as the best features to
represent the classes. Since the matrix is an interval matrix we
compute a proximity matrix of size
KKu
with each element
being of type multivalued of dimension
m
by computing the
similarity/dissimilarity among the features using the functions
proposed in [9], [12].
Generally each class contains several documents which are
classified according to the content of the document. At this
context, we intend to have an effective representation by
providing multiple reference documents (representative vectors)
for each class. Therefore, we recommended applying a clustering
algorithm (using adaptive fuzzy c-means clustering algorithm) to
obtain a number of clusters of documents of the training set for all
the classes and then to have a symbolic representative vector for
each cluster of documents [8]. To corroborate the efficacy of the
proposed model, we employed a new representation method of
representing text documents based on feature clustering approach
[7].
2.2 Symbolic Feature Selection
Feature selection is used to identify the useful features and also to
remove the redundant information. Basically, feature selection
methods fall into two broad categories, the filter model and the
wrapper model. The wrapper model requires one predetermined
learning algorithm in feature selection and uses its performance to
evaluate and determine which features are selected. And the filter
model relies on general characteristics of the training data to
select some features without involving any specific learning
algorithm. There is evidence that wrapper methods often perform
better on small scale problems, but on large scale problems, such
as text classification, wrapper methods are shown to be
impractical because of its high computational cost. Hence in this
paper we make use of the filter method to select the best features.
In order select the best features from the class representative
matrix
F
we need to study the correlation present among the
individual features of each class. The features which have
maximum correlation shall be selected as the best features to
represent the classes. Since
F
is an interval matrix we compute a
proximity matrix of size
KKu
with each element being of type
multivalued of dimension
m
by computing the similarity among
the features using the similarity function proposed in [17]. The
similarity from class
i
to
j
with respect to
th
l
feature is given by
il jl
l
ij
jl
II
SI
o
§·
¨¸
¨¸
©¹
·
I
¸
jl
j
·
·
jl
I
j
Where,
[ , ] 1,2,...,
il il il
Ipp l m
are the interval type
features of the class
i
C
and
[ , ] 1,2,...,
jl jl jl
Ipp l m
are the
interval type features of the class
j
C
Therefore from the obtained proximity matrix, the matrix
M
of size
2
Kmu
is constructed by listing out each multivalued type
element one by one in the form of rows. The standard deviation
for each column of the matrix
M
is computed and also the
average standard deviation due to all
m
columns is computed.
Let
l
TCorr
be the total correlation of the
th
l
column with all other
columns of the matrix
M
and let
AvgTCorr
be the average of all
total correlation obtained due to all columns. i.e.,
300 2014 International Conference on Contemporary Computing and Informatics (IC3I)
0
(, )
m
th th
l
q
TCorr Corr l Column q Column
¦
and
0
m
l
l
TCorr
AvgTCorr
m
¦
We are interested in those features which have high
discriminating capability, and thus we recommend to select those
features, for which
l
TCorr
is higher than the average
correlation
AvgTCorr
. Further, to compute the dissimilarity based
feature selection, we make use of covariance rather than
correlation.
2.3 Document Classification
Given a test document
q
D
represented by term document feature
vector, is projected onto lower dimension using RLPI, to get
transformed term frequency feature vector of size
m
which are of
type crisp. Among these
m
feature values only the
d
values are
selected from feature selection method. Then these
d
values are
used to compare with each class representative stored in the
knowledge base. Let
123
[ , , ,..., ]
d
Dq Dq Dq Dq Dq
F ppp p
be a
d
dimensional feature vector describing a test document. Let
j
R
be
the interval symbolic feature vector of the
th
j
class. Now each
th
l
feature value of the test document is compared with the
corresponding interval
l
j
R
to examine similarity between the test
document feature value and interval in
l
j
R
. We make use of a
similarity measure proposed in [5] to measure the similarity
between test feature vector
Q
D
F
and the
th
j
class
representative
j
R
.
1
_(,) (,[,])
d
l
Dq j Dq jl jl
l
Total Sim F R Sim p p p
¦
Here,
[,]
jl jl
pp
represents the
th
l
feature interval of the
th
j
class, and
1
(,[,]) 11
max ,
11
ll
Dq jl Dq jl
l
Dq jl jl
ll
Dq jl Dq jl
ifppandpp
Sim p p p
Otherwise
pp pp
td
°
°
§ ·
®¨¸
°¨¸
°©¹
¯
Similarly, we compute similarity value for all
K
classes and we
use symbolic classifier to classify a given query document.
3. EXPERIMENTAL SETUP
3.1 Datasets
For experimentation we have used the classic Reuters 21578
collection has the benchmark dataset. Originally Reuters 21578
contains 21578 documents in 135 categories. However, in our
experiment, we discarded those documents with multiple category
labels, and selected the largest ten categories. For the smooth
conduction of experiments we used ten largest classes in the
Reuters 21578 collection with number of documents in the
training and test sets as follows: earn (2877 vs 1087), trade (369
vs 119), acquisitions (1650 vs 179), interest (347 vs 131), money-
fx (538 vs 179), ship (197 vs 89), grain (433 vs 149), wheat (212
vs 71), crude (389 vs 189), corn (182 vs 56). The second dataset
is standard 20 Newsgroup Large. It is one of the standard
benchmark dataset used by many text classification research
groups. It contains 20000 documents categorized into 20 classes.
For our experimentation, we have considered the term document
matrix constructed for 20 Newsgroup. The third dataset consists
of vehicle characteristics extracted from wikipedia pages
(Vehicles - Wikipedia). The dataset contains 4 categories that
have low degrees of similarity. The dataset contains four
categories of vehicles: Aircraft, Boats, Cars and Trains. All the
four categories are easily differentiated and every category has a
set of unique key words.
3.2 Experimentation
In this section, we present the results of the experiments
conducted to demonstrate the effectiveness of the proposed
method on all the three datasets viz., Reuters 21578, 20
Newsgroup, and Vehicles Wikipedia. During experimentation, we
conducted two different sets of experiments. In the first set of
experiments, we used 50% of the documents of each class of a
dataset to create training set and the remaining 50% of the
documents for testing purpose. On the other hand, in the second
set of experiments, the numbers of training and testing documents
are in the ratio 60:40. Both experiments are repeated 5 times by
choosing the training samples randomly. As measures of goodness
of the proposed method, we computed percentage of classification
accuracy. The minimum, maximum and the average value of the
classification accuracy of all the 5 trials are presented in Table 1
to Table 3. For both the experiments, we have randomly selected
the training documents to create the symbolic feature vectors for
each class.
4. CONCLUSIONS
This paper proposes novel symbolic classifiers to classify the text
documents. The proposed symbolic classifiers are evaluated based
on conventional cluster based symbolic approaches, cluster based
symbolic approaches without feature selection, cluster based
symbolic approaches with feature selection and feature clustering
approaches. The above mentioned representation methods are
very powerful in reducing the dimensionality of feature vectors
for text classification. To corroborate the efficacy of the proposed
model, we conducted extensive experimentation on standard text
datasets. The experimental results reveal that the symbolic
representation using feature clustering techniques achieves better
classification accuracy over the existing cluster based symbolic
representation approaches.
2014 International Conference on Contemporary Computing and Informatics (IC3I) 301
Table 1. Classification accuracy on Reuters 21578 dataset
Training
vs
Testing
P
A B C D
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
50:50
2
60.85
63.50
62.55
68.40
69.95
69.10
67.20
69.10
68.45
69.85
71.45
70.65
3
61.45
64.10
63.90
69.90
71.25
70.95
68.85
69.95
69.45
70.15
72.35
71.55
4
59.90
61.20
60.85
67.55
69.50
68.20
65.45
66.30
65.60
68.90
69.55
69.10
60:40
2
63.55
65.20
64.20
68.90
70.65
70.10
68.50
69.25
69.25
73.50
76.10
75.85
3
68.40
70.25
69.55
78.50
80.10
79.55
76.45
78.10
77.25
84.65
86.75
85.55
4
65.85
66.90
66.10
72.50
75.85
73.20
70.20
72.55
71.25
80.55
82.65
81.45
Table 2. Classification accuracy on 20 Newsgroup dataset
Training
vs
Testing
P
A B C D
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
50:50
2
79.58
82.50
81.85
81.25
86.85
83.80
79.38
86.85
81.66
80.55
81.45
80.90
3
84.34
87.80
86.25
86.20
87.25
86.46
84.85
87.00
86.33
87.40
89.60
88.95
4
77.63
79.15
78.37
76.95
81.20
79.61
78.65
80.25
79.42
85.20
85.65
85.40
60:40
2
82.68
84.25
83.20
83.45
87.25
85.14
82.90
86.50
84.42
86.55
88.10
87.20
3
88.45
90.20
89.36
93.63
90.85
92.04
90.25
91.65
90.86
91.10
92.65
91.95
4
79.55
81.56
80.18
79.68
83.45
81.47
79.68
81.45
80.71
88.20
89.90
89.10
Table 3. Classification accuracy on Vehicles Wikipedia dataset
Training
vs
Testing
P
A B C D
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
50:50
2
80.65
83.65
81.54
82.19
83.90
83.11
70.50
77.50
74.79
84.35
85.40
84.90
3
83.75
88.20
84.55
86.25
95.00
91.39
86.26
87.35
86.92
90.05
93.45
92.15
4
79.80
82.50
80.86
82.00
85.00
83.84
81.26
82.00
81.70
86.70
88.90
87.45
60:40
2
78.33
84.63
81.95
84.12
84.86
84.31
72.50
80.62
77.62
86.70
87.20
86.95
3
89.55
91.50
90.86
94.00
98.00
95.74
89.20
89.50
89.40
91.25
94.50
93.60
4
85.20
87.85
85.62
85.00
93.75
89.60
79.50
84.50
81.83
89.10
90.10
89.40
Where,
P: Number of Clusters Selected
A: Symbolic clustering Approaches
B: Symbolic cluster based with feature selection (Similarity Based)
C: Symbolic cluster based with feature selection (Dissimilarity Based)
D: Symbolic Feature Clustering Approaches
Min: Minimum Accuracy Recoded
Max: Maximum Accuracy Recorded
Avg: Average of Min and Max Accuracy
302 2014 International Conference on Contemporary Computing and Informatics (IC3I)
5. REFERENCES
[1]. Bock H. H, and Diday E., “Analysis of symbolic
Data”. Springer Verlag, 1999.
[2]. Cai D, He X, Zhang W. V. and Han J., 2007.
Regularized Locality Preserving Indexing via Spectral
Regression. Proceedings of Conference on
Information and Knowledge Management (CIKM’07),
pp. 741—750.
[3]. Fung G. P. C., Yu J. X.., Lu H. and Yu P. S., 2006.
Text classification without negative example revisit.
IEEE Transactions on Knowledge and Data
Engineering. vol. 18, pp. 23 – 47.
[4]. Guru D S, Harish B S, and Manjunath S., 2010.
Symbolic representation of text documents”.
Proceedings of Third Annual ACM Compute,
Bangalore.
[5]. Guru D. S and Nagendraswamy H. S., 2006. Symbolic
Representation of Two-Dimensional Shapes. Pattern
Recognition Letters, vol.28, no. 1, pp. 144 – 55.
[6]. Guru D. S., 2001. Classification of text documents:
An overview, the challenges and future avenues.
Proceedings of the Pre-Conference workshop on
Document Processing, Karnataka, India, pp. 28 – 34.
[7]. Harish B S and Udayasri., 2014. Document
Classification: An Approach Using Feature
Clustering. Recent Advances in Intelligent Informatics
Advances in Intelligent Systems and Computing, Vol.
235, pp. 163-173.
[8]. Harish B S, Bhanu Prasad and B Udayasri., 2014.
Classification of Text Documents using Adaptive
Fuzzy C-Means Clustering. Recent Advances in
Intelligent Informatics Advances in Intelligent
Systems and Computing, Vol. 235, pp. 205-214.
[9]. Harish B S, Guru D S, Manjunath S and Dinesh R.,
2011. Symbolic Similarity and Symbolic Feature
Selection for Text Classification. Bilateral Russian-
Indian Scientific Workshop on Emerging Applications
of Computer Vision, November 1 – 5, 2011, Moscow,
Russia, pp. 141 – 146.
[10]. Harish B S, Manjunath S and Guru D S, 2010.
Representation and Classification of Text Documents:
A Brief Review”, International Journal of Computer
Applications Special Issue on Recent Trends in Image
Processing and Pattern Recognition, pp. 110 – 119.
[11]. Isa D., Lee L. H., Kallimani V. P. and Rajkumar R.,
2008. Text document preprocessing with the bayes
formula for classification using the support vector
machine. IEEE Transactions on Knowledge and Data
Engineering, vol. 20, no. 9, pp. 23 – 31.
[12]. Manjunath S, Harish B S and Guru D S., 2011.
Dissimilarity Based Feature Selection for Text
Classification: A Cluster Based Approach. In the
proceedings of ACM International Conference and
Workshop on Emerging Trends and Technology,
Association of Computing Machinery, New York,
USA, pp. 495 – 499.
[13]. Rigutini L., 2004. Automatic text processing: Machine
learning techniques. Ph.D. Thesis, University of
Siena.
[14]. Salton G and Buckely C, 1988. Term weighting
approaches in automatic text retrieval. Journal of
Information Processing and Management, vol. 24, no.
5, pp. 513 – 523.
[15]. Song F., S. Liu and J. Yang, 2005. A comparative
study on text representation schemes in text
categorization. Journal of Pattern Analysis
Application, vol. 8, pp. 199 – 209.
[16]. Zeimpekis D and E. Gallopoulos, 2006. TMG: A
MATLAB Toolbox for generating term-document
matrices from text collections. Springer Publications,
Berlin, pp. 187 – 210.
[17]. Guru D.S., Kiranagi B. B., Nagabhushan P, 2004.
Multivalued type proximity measure and concept of
mutual similarity value useful for clustering symbolic
patterns. Journal of Pattern Recognition Letters, vol.
25, pp. 1003 – 1013.
2014 International Conference on Contemporary Computing and Informatics (IC3I) 303