Conference PaperPDF Available

Document Classification using Symbolic Classifiers

Authors:

Abstract and Figures

In this paper, we present symbolic classifiers to classify text documents. We propose to use cluster based symbolic representation followed by symbolic feature selection methods to classify text documents. In particular, we propose Symbolic clustering approaches; Symbolic cluster based without feature selection; Symbolic cluster based with feature selection (using similarity measure); Symbolic cluster based with feature selection (using dissimilarity measure) and Symbolic feature clustering approaches. The above mentioned representation methods are very powerful in reducing the dimensionality of feature vectors for text classification. To corroborate the efficacy of the proposed model, we conducted extensive experimentation on various standard text datasets. The experimental results reveal that the symbolic feature clustering approach achieves better classification accuracy over the existing cluster based symbolic approaches.
Content may be subject to copyright.
Document Classification using Symbolic Classifiers
M B Revanasiddappa
Department of Information Science &
Engineering
SJCE, Mysore
Karnataka, India
revan.cr.is@gmail.com
B S Harish
Member, IEEE
Department of Information Science &
Engineering
SJCE, Mysore
Karnataka, India
bsharish@ymail.com
S Manjunath
Department of Computer Science
Central University Kerala
Kasargod, Kerala
India
manju_uom@yahoo.co.in
ABSTRACT
In this paper, we present symbolic classifiers to classify text
documents. We propose to use cluster based symbolic
representation followed by symbolic feature selection methods to
classify text documents. In particular, we propose Symbolic
clustering approaches; Symbolic cluster based without feature
selection; Symbolic cluster based with feature selection (using
similarity measure); Symbolic cluster based with feature selection
(using dissimilarity measure) and Symbolic feature clustering
approaches. The above mentioned representation methods are
very powerful in reducing the dimensionality of feature vectors
for text classification. To corroborate the efficacy of the proposed
model, we conducted extensive experimentation on various
standard text datasets. The experimental results reveal that the
symbolic feature clustering approach achieves better classification
accuracy over the existing cluster based symbolic approaches.
Categories and Subject Descriptors
Data Mining, Machine Learning, Pattern Recognition
General Terms
Algorithms, Experimentation.
Keywords
Symbolic Classifier, Representation, Feature Selection, Text
Documents.
1.
INTRODUCTION
Over the past two decades, automatic management of electronic
documents has been a major research field in computer science.
Text documents have become the most common type of
information repositories especially due to the increased popularity
of the internet and the World Wide Web (WWW). Internet and
web documents like web pages, e-mails, newsgroup messages,
internet news feed etc., contain million or even billion of text
documents. In the last decades, content-based document
management tasks have gained a prominent status in the field of
information systems, due to the increased availability of
documents in digital form [13], [15].
Earlier, the task of text classification was based on knowledge
engineering (KE), where a set of rules were defined manually to
encode the expert knowledge on how t o class i f y the documents
under the given categories [14]. Since there is a requirement of
human intervention in knowledge engineering, researchers in later
days have proposed many machine learning techniques to
automatically manage the text documents. The advantages of a
machine learning based approach are that the accuracy is
comparable to that of human experts and no intervention from
either knowledge engineers or domain experts needed for the
construction of a document management tool [6]. Many text
mining methods like document retrieval, clustering, classification,
routing and filtering are often used for effective management of
text documents. Out of several tasks, text classification is the one
which is commonly used in text information systems. Therefore,
devising effective and efficient models for representing and
classification of text documents for real time applications is the
current requirement.
The task of text classification is to assign a boolean value to each
pair
(,) ,
ji o
dk D Ku
where
o
D
is the domain of documents and
K
is a set of predefined categories. The task is to approximate
the true function
^`
:1,0
o
DK
I
uo
by means of a function
^
:{1,0}
o
DK
I
uo
such that
^
and
II
coincide as much as
possible. The function
^
I
is called a classifier. A classifier can be
built by training it systematically using a set of training
documents [13], [3].
The major challenges and difficulties that arise in the problem of
text classification are: High dimensionality (thousands of
features), variable length, content and quality of the documents,
sparse distribution of terms in documents, loss of correlation
between adjacent words and understanding complex semantics of
terms in a document [16]. To tackle these problems a number of
methods have been reported in literature for effective
classification of text documents. Many representation schemes
like binary representation, ontology, N-Grams, multiword terms as
vector, Universal Networking Language, Latent Semantic
Indexing, Locality Preserving Indexing, Regularized Locality
Preserving Indexing are proposed as text representation schemes
for effective text classification [10]. Also, in [11] a new
representation model for the web documents are proposed.
Recently, bayes formula was used to vectorize a document
according to a probability distribution reflecting the probable
categories that the document may belongs to. Further, Clustering
has been used in the literature of text classification as an
alternative representation scheme for text documents. Several
approaches like [7] are used to represent text documents.
All in all, the above mentioned classification algorithms work on
conventional word frequency vector. Conventionally the feature
vectors of term document matrix (very sparse and very high
dimensional feature vector describing a document) are used to
represent the class. Later, this matrix is used to train the system
299
978-1-4799-6629-5/14/$31.00 c
2014 IEEE
using different classifiers for classification. Generally, the term
document matrix contains the frequency of occurrences of terms
and the values of the term frequency vary from document to
document in the same class. Hence to preserve these variations,
we proposed a new interval representation for each document. An
initial attempt is made in [4] by giving an interval representation
by using maximum and minimum values of term frequency
vectors for the documents. However, in this paper we are using
mean and standard deviations to give the interval valued
representation for documents. Thus, the variations of term
frequencies of document within the class are assimilated in the
form of interval representation. Moreover conventional data
analysis may not be able to preserve intraclass variations but
unconventional data analysis such as symbolic data analysis will
provide methods for effective representations by preserving
intraclass variations.
The recent developments in the area of symbolic data analysis
have proven that the real life objects can be better described by
the use of symbolic data, which are extensions of classical crisp
data. The aim of the symbolic data analysis [1] is to provide
suitable methods for managing aggregated data described by multi
valued variables, where the cells of the data contain sets of
categories, intervals, or weight distributions. Symbolic data
analysis provides a number of clustering methods for symbolic
data. These methods differ in the type of considered symbolic
data, in their cluster structures and/or in the considered clustering
criterion. The previous issues motivated us to use symbolic data
rather than using a conventional classical crisp data to represent a
document. To preserve the intraclass variations we create multiple
clusters for each class. Term frequency vectors of documents of
each cluster are used to form an interval valued feature vector.
With this backdrop, the work presented in [4] is extended towards
creating multiple representatives per class using clustering after
symbolic representation.
The rest of the paper is organized as follows: The proposed
representation and classification stages are presented in section 2.
Details of dataset used, experimental settings and results are
presented in section 3. The paper is concluded in section 4.
2. PROPOSED METHOD
The classification model has three stages: (i) Symbolic
Representation (ii) Symbolic Feature Selection and (iii) Document
Classification.
2.1 Symbolic Representation
Let there be
K
number of classes each containing
N
number of
documents, where each document is described by a
n
dimensional term frequency vector. The term document matrix,
say
X
of size
()KN nu
is constructed such that each row
represents a document of a class and each column represents a
term. We recommend to employ the Regularized Locality
Preserving Indexing [2] on
X
to obtain the transformed term
document matrix
Y
of size
()KN mu
, where
m
is the number of
features chosen out of
n
which is not necessarily optimal. In order
to preserve the intraclass variation in each feature of every
document of
th
i
class, we have proposed symbolic representation
for text documents in [4]. Further, to select the best features from
the class representative matrix we need to study the
correlation/covariance present among the individual features of
each class. The features which have maximum
correlation/covariance shall be selected as the best features to
represent the classes. Since the matrix is an interval matrix we
compute a proximity matrix of size
KKu
with each element
being of type multivalued of dimension
m
by computing the
similarity/dissimilarity among the features using the functions
proposed in [9], [12].
Generally each class contains several documents which are
classified according to the content of the document. At this
context, we intend to have an effective representation by
providing multiple reference documents (representative vectors)
for each class. Therefore, we recommended applying a clustering
algorithm (using adaptive fuzzy c-means clustering algorithm) to
obtain a number of clusters of documents of the training set for all
the classes and then to have a symbolic representative vector for
each cluster of documents [8]. To corroborate the efficacy of the
proposed model, we employed a new representation method of
representing text documents based on feature clustering approach
[7].
2.2 Symbolic Feature Selection
Feature selection is used to identify the useful features and also to
remove the redundant information. Basically, feature selection
methods fall into two broad categories, the filter model and the
wrapper model. The wrapper model requires one predetermined
learning algorithm in feature selection and uses its performance to
evaluate and determine which features are selected. And the filter
model relies on general characteristics of the training data to
select some features without involving any specific learning
algorithm. There is evidence that wrapper methods often perform
better on small scale problems, but on large scale problems, such
as text classification, wrapper methods are shown to be
impractical because of its high computational cost. Hence in this
paper we make use of the filter method to select the best features.
In order select the best features from the class representative
matrix
F
we need to study the correlation present among the
individual features of each class. The features which have
maximum correlation shall be selected as the best features to
represent the classes. Since
F
is an interval matrix we compute a
proximity matrix of size
KKu
with each element being of type
multivalued of dimension
m
by computing the similarity among
the features using the similarity function proposed in [17]. The
similarity from class
i
to
j
with respect to
th
l
feature is given by
il jl
l
ij
jl
II
SI
o
§·
¨¸
¨¸
©¹
·
I
¸
jl
j
·
·
jl
I
j
Where,
[ , ] 1,2,...,
il il il
Ipp l m

are the interval type
features of the class
i
C
and
[ , ] 1,2,...,
jl jl jl
Ipp l m

are the
interval type features of the class
j
C
Therefore from the obtained proximity matrix, the matrix
M
of size
2
Kmu
is constructed by listing out each multivalued type
element one by one in the form of rows. The standard deviation
for each column of the matrix
M
is computed and also the
average standard deviation due to all
m
columns is computed.
Let
l
TCorr
be the total correlation of the
th
l
column with all other
columns of the matrix
M
and let
AvgTCorr
be the average of all
total correlation obtained due to all columns. i.e.,
300 2014 International Conference on Contemporary Computing and Informatics (IC3I)
0
(, )
m
th th
l
q
TCorr Corr l Column q Column
¦
and
0
m
l
l
TCorr
AvgTCorr
m
¦
We are interested in those features which have high
discriminating capability, and thus we recommend to select those
features, for which
l
TCorr
is higher than the average
correlation
AvgTCorr
. Further, to compute the dissimilarity based
feature selection, we make use of covariance rather than
correlation.
2.3 Document Classification
Given a test document
q
D
represented by term document feature
vector, is projected onto lower dimension using RLPI, to get
transformed term frequency feature vector of size
m
which are of
type crisp. Among these
m
feature values only the
d
values are
selected from feature selection method. Then these
d
values are
used to compare with each class representative stored in the
knowledge base. Let
123
[ , , ,..., ]
d
Dq Dq Dq Dq Dq
F ppp p
be a
d
dimensional feature vector describing a test document. Let
j
R
be
the interval symbolic feature vector of the
th
j
class. Now each
th
l
feature value of the test document is compared with the
corresponding interval
l
j
R
to examine similarity between the test
document feature value and interval in
l
j
R
. We make use of a
similarity measure proposed in [5] to measure the similarity
between test feature vector
Q
D
F
and the
th
j
class
representative
j
R
.
1
_(,) (,[,])
d
l
Dq j Dq jl jl
l
Total Sim F R Sim p p p

¦
Here,
[,]
jl jl
pp

represents the
th
l
feature interval of the
th
j
class, and

1
(,[,]) 11
max ,
11
ll
Dq jl Dq jl
l
Dq jl jl
ll
Dq jl Dq jl
ifppandpp
Sim p p p
Otherwise
pp pp



td
°
°
§ ·
®¨¸
°¨¸
 
°©¹
¯
Similarly, we compute similarity value for all
K
classes and we
use symbolic classifier to classify a given query document.
3. EXPERIMENTAL SETUP
3.1 Datasets
For experimentation we have used the classic Reuters 21578
collection has the benchmark dataset. Originally Reuters 21578
contains 21578 documents in 135 categories. However, in our
experiment, we discarded those documents with multiple category
labels, and selected the largest ten categories. For the smooth
conduction of experiments we used ten largest classes in the
Reuters 21578 collection with number of documents in the
training and test sets as follows: earn (2877 vs 1087), trade (369
vs 119), acquisitions (1650 vs 179), interest (347 vs 131), money-
fx (538 vs 179), ship (197 vs 89), grain (433 vs 149), wheat (212
vs 71), crude (389 vs 189), corn (182 vs 56). The second dataset
is standard 20 Newsgroup Large. It is one of the standard
benchmark dataset used by many text classification research
groups. It contains 20000 documents categorized into 20 classes.
For our experimentation, we have considered the term document
matrix constructed for 20 Newsgroup. The third dataset consists
of vehicle characteristics extracted from wikipedia pages
(Vehicles - Wikipedia). The dataset contains 4 categories that
have low degrees of similarity. The dataset contains four
categories of vehicles: Aircraft, Boats, Cars and Trains. All the
four categories are easily differentiated and every category has a
set of unique key words.
3.2 Experimentation
In this section, we present the results of the experiments
conducted to demonstrate the effectiveness of the proposed
method on all the three datasets viz., Reuters 21578, 20
Newsgroup, and Vehicles Wikipedia. During experimentation, we
conducted two different sets of experiments. In the first set of
experiments, we used 50% of the documents of each class of a
dataset to create training set and the remaining 50% of the
documents for testing purpose. On the other hand, in the second
set of experiments, the numbers of training and testing documents
are in the ratio 60:40. Both experiments are repeated 5 times by
choosing the training samples randomly. As measures of goodness
of the proposed method, we computed percentage of classification
accuracy. The minimum, maximum and the average value of the
classification accuracy of all the 5 trials are presented in Table 1
to Table 3. For both the experiments, we have randomly selected
the training documents to create the symbolic feature vectors for
each class.
4. CONCLUSIONS
This paper proposes novel symbolic classifiers to classify the text
documents. The proposed symbolic classifiers are evaluated based
on conventional cluster based symbolic approaches, cluster based
symbolic approaches without feature selection, cluster based
symbolic approaches with feature selection and feature clustering
approaches. The above mentioned representation methods are
very powerful in reducing the dimensionality of feature vectors
for text classification. To corroborate the efficacy of the proposed
model, we conducted extensive experimentation on standard text
datasets. The experimental results reveal that the symbolic
representation using feature clustering techniques achieves better
classification accuracy over the existing cluster based symbolic
representation approaches.
2014 International Conference on Contemporary Computing and Informatics (IC3I) 301
Table 1. Classification accuracy on Reuters 21578 dataset
Training
vs
Testing
P
A B C D
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
50:50
2
60.85
63.50
62.55
68.40
69.95
69.10
67.20
69.10
68.45
69.85
71.45
70.65
3
61.45
64.10
63.90
69.90
71.25
70.95
68.85
69.95
69.45
70.15
72.35
71.55
4
59.90
61.20
60.85
67.55
69.50
68.20
65.45
66.30
65.60
68.90
69.55
69.10
60:40
2
63.55
65.20
64.20
68.90
70.65
70.10
68.50
69.25
69.25
73.50
76.10
75.85
3
68.40
70.25
69.55
78.50
80.10
79.55
76.45
78.10
77.25
84.65
86.75
85.55
4
65.85
66.90
66.10
72.50
75.85
73.20
70.20
72.55
71.25
80.55
82.65
81.45
Table 2. Classification accuracy on 20 Newsgroup dataset
Training
vs
Testing
P
A B C D
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
50:50
2
79.58
82.50
81.85
81.25
86.85
83.80
79.38
86.85
81.66
80.55
81.45
80.90
3
84.34
87.80
86.25
86.20
87.25
86.46
84.85
87.00
86.33
87.40
89.60
88.95
4
77.63
79.15
78.37
76.95
81.20
79.61
78.65
80.25
79.42
85.20
85.65
85.40
60:40
2
82.68
84.25
83.20
83.45
87.25
85.14
82.90
86.50
84.42
86.55
88.10
87.20
3
88.45
90.20
89.36
93.63
90.85
92.04
90.25
91.65
90.86
91.10
92.65
91.95
4
79.55
81.56
80.18
79.68
83.45
81.47
79.68
81.45
80.71
88.20
89.90
89.10
Table 3. Classification accuracy on Vehicles Wikipedia dataset
Training
vs
Testing
P
A B C D
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
Min
Max
Avg
50:50
2
80.65
83.65
81.54
82.19
83.90
83.11
70.50
77.50
74.79
84.35
85.40
84.90
3
83.75
88.20
84.55
86.25
95.00
91.39
86.26
87.35
86.92
90.05
93.45
92.15
4
79.80
82.50
80.86
82.00
85.00
83.84
81.26
82.00
81.70
86.70
88.90
87.45
60:40
2
78.33
84.63
81.95
84.12
84.86
84.31
72.50
80.62
77.62
86.70
87.20
86.95
3
89.55
91.50
90.86
94.00
98.00
95.74
89.20
89.50
89.40
91.25
94.50
93.60
4
85.20
87.85
85.62
85.00
93.75
89.60
79.50
84.50
81.83
89.10
90.10
89.40
Where,
P: Number of Clusters Selected
A: Symbolic clustering Approaches
B: Symbolic cluster based with feature selection (Similarity Based)
C: Symbolic cluster based with feature selection (Dissimilarity Based)
D: Symbolic Feature Clustering Approaches
Min: Minimum Accuracy Recoded
Max: Maximum Accuracy Recorded
Avg: Average of Min and Max Accuracy
302 2014 International Conference on Contemporary Computing and Informatics (IC3I)
5. REFERENCES
[1]. Bock H. H, and Diday E., “Analysis of symbolic
Data”. Springer Verlag, 1999.
[2]. Cai D, He X, Zhang W. V. and Han J., 2007.
Regularized Locality Preserving Indexing via Spectral
Regression. Proceedings of Conference on
Information and Knowledge Management (CIKM’07),
pp. 741750.
[3]. Fung G. P. C., Yu J. X.., Lu H. and Yu P. S., 2006.
Text classification without negative example revisit.
IEEE Transactions on Knowledge and Data
Engineering. vol. 18, pp. 23 47.
[4]. Guru D S, Harish B S, and Manjunath S., 2010.
Symbolic representation of text documents”.
Proceedings of Third Annual ACM Compute,
Bangalore.
[5]. Guru D. S and Nagendraswamy H. S., 2006. Symbolic
Representation of Two-Dimensional Shapes. Pattern
Recognition Letters, vol.28, no. 1, pp. 144 55.
[6]. Guru D. S., 2001. Classification of text documents:
An overview, the challenges and future avenues.
Proceedings of the Pre-Conference workshop on
Document Processing, Karnataka, India, pp. 28 34.
[7]. Harish B S and Udayasri., 2014. Document
Classification: An Approach Using Feature
Clustering. Recent Advances in Intelligent Informatics
Advances in Intelligent Systems and Computing, Vol.
235, pp. 163-173.
[8]. Harish B S, Bhanu Prasad and B Udayasri., 2014.
Classification of Text Documents using Adaptive
Fuzzy C-Means Clustering. Recent Advances in
Intelligent Informatics Advances in Intelligent
Systems and Computing, Vol. 235, pp. 205-214.
[9]. Harish B S, Guru D S, Manjunath S and Dinesh R.,
2011. Symbolic Similarity and Symbolic Feature
Selection for Text Classification. Bilateral Russian-
Indian Scientific Workshop on Emerging Applications
of Computer Vision, November 1 5, 2011, Moscow,
Russia, pp. 141 146.
[10]. Harish B S, Manjunath S and Guru D S, 2010.
Representation and Classification of Text Documents:
A Brief Review”, International Journal of Computer
Applications Special Issue on Recent Trends in Image
Processing and Pattern Recognition, pp. 110 119.
[11]. Isa D., Lee L. H., Kallimani V. P. and Rajkumar R.,
2008. Text document preprocessing with the bayes
formula for classification using the support vector
machine. IEEE Transactions on Knowledge and Data
Engineering, vol. 20, no. 9, pp. 23 31.
[12]. Manjunath S, Harish B S and Guru D S., 2011.
Dissimilarity Based Feature Selection for Text
Classification: A Cluster Based Approach. In the
proceedings of ACM International Conference and
Workshop on Emerging Trends and Technology,
Association of Computing Machinery, New York,
USA, pp. 495 499.
[13]. Rigutini L., 2004. Automatic text processing: Machine
learning techniques. Ph.D. Thesis, University of
Siena.
[14]. Salton G and Buckely C, 1988. Term weighting
approaches in automatic text retrieval. Journal of
Information Processing and Management, vol. 24, no.
5, pp. 513 523.
[15]. Song F., S. Liu and J. Yang, 2005. A comparative
study on text representation schemes in text
categorization. Journal of Pattern Analysis
Application, vol. 8, pp. 199 209.
[16]. Zeimpekis D and E. Gallopoulos, 2006. TMG: A
MATLAB Toolbox for generating term-document
matrices from text collections. Springer Publications,
Berlin, pp. 187 210.
[17]. Guru D.S., Kiranagi B. B., Nagabhushan P, 2004.
Multivalued type proximity measure and concept of
mutual similarity value useful for clustering symbolic
patterns. Journal of Pattern Recognition Letters, vol.
25, pp. 1003 1013.
2014 International Conference on Contemporary Computing and Informatics (IC3I) 303
... SFS keeps the best features for effective text representation and reduces the time taken to classify a given document. This method basically uses the symbolic similarity and dissimilarity measures [22]. Bidi and Elberrichi [27] proposed a feature selection method with the help of genetic algorithm. ...
... SFS gives the best features for text document categorization; it mainly reduces the human effort and time to categorize a given document. Features are of interval valued type [22], the degree of similarity between class representative vectors is estimated based on degrees of overlapping of features. It can be noticed that, these relative overlapping of interval type features are not equal and hence the degree of similarity between two symbolic vectors may not necessarily be symmetric. ...
Article
Full-text available
Feature selection is one of the well-known solution to high dimensionality problem of text categorization. In text categorization, selection of good features (terms) plays a very important role. Feature selection is a strategy that can be used to improve categorization accuracy, effectiveness and computational efficiency. This paper presents an empirical study of most widely used feature selection methods viz. Term Frequency-Inverse Document Frequency (tf.idf ), Information Gain (IG), Mutual Information(MI), CHI-Square, Ambiguity Measure (AM), Term Strength (TS), Term Frequency-Relevance Frequency (tf.rf ) and Symbolic Feature Selection (SFS) with five different classifiers (Naive Bayes, KNearest Neighbor, Centroid Based Classifier, Support Vector Machine and Symbolic Classifier). Experimentations are carried out on standard benchmark datasets like Reuters-21578, 20-Newsgroups and 4 University dataset.
... SFS keeps the best features for effective text representation and reduces the time taken to classify a given document. This method basically uses the symbolic similarity and dissimilarity measures [22]. Bidi and Elberrichi [27] proposed a feature selection method with the help of genetic algorithm. ...
... SFS gives the best features for text document categorization; it mainly reduces the human effort and time to categorize a given document. Features are of interval valued type [22], the degree of similarity between class representative vectors is estimated based on degrees of overlapping of features. It can be noticed that, these relative overlapping of interval type features are not equal and hence the degree of similarity between two symbolic vectors may not necessarily be symmetric. ...
... Many researchers have proposed various supervised, unsupervised and ensemble learning techniques to categorize text documents. Supervised learning techniques such as Naive Bayes (NB) [3], k-Nearest Neighbor (kNN) [4], Decision Tree (DT) [5], Centroid based Classifier (CbC) [6], Symbolic Classifier (SC) [7], Support Vector Machine (SVM) [8], Neural Network (NN) [9] are proposed. On the other hand, unsupervised techniques includes K-Means [10], K-Medoids [11], Hierarchical clustering [12], Fuzzy C-Means [13], Multiple Kernel Fuzzy C-Means [14], Intuitionistic Fuzzy C-Means [15] and many more. ...
Conference Paper
Full-text available
In this paper, we present a new text categorization model using consensus clustering to categorize text documents. Initially, text documents are pre-processed and represented in the form of Term Document Matrix (TDM). Further, consensus clustering is used for text categorization. The consensus clustering has two units: cluster generation and consensus function. In cluster generation unit, we generate clustering results by applying base clustering (K-Means, Fuzzy C-means (FCM) and Intuitionistic Fuzzy C-means (IFCM)) methods. In the consensus function unit, the generated results of base clustering methods are aggregated. Further, voting technique is applied on aggregated results to obtain consolidated cluster. In order to evaluate the effectiveness of the proposed model, experiments are conducted on balanced (20-Newgroups) and unbalanced (Reuters-21578) standard benchmark datasets. We used accuracy, precision, recall and F-measure to assess the performance. The performance of the proposed model is investigated against base clustering methods (K-Means, FCM and IFCM). The experimental result reveals that the proposed model performance is better than base clustering methods. Moreover, consensus clustering eliminates the limitation of base clustering methods and also quality of the final categorization result can be improved.
... Major applications of Text Categorization: it helps to discover useful information in search engine websites, Automatic Document Indexing, Information Retrieval, Spam filtering etc. In literature, wide variety approaches have been developed for effective Text Categorization [11] viz., Naive Bayes (NB) [6], k-Nearest Neighbor (kNN) [16], Decision Tree (DT) [2], Centroid based Classifier (CbC) [23], Symbolic Classifier (SC) [7][10] [34], Support Vector Machine (SVM) [22] [17]. Several studies using Neural Network have shown promising results including Multi-layer Perceptron (MLP) [26], Back-Propagation Neural Network (BP-NN) [31], Radial Basis Function Neural Network (RBF-NN) [14]. ...
... filter, wrapper and hybrid [11] [15]. Presently there are various feature selection methods reported in the literature viz., Document Frequency [14], Term Frequency-Inverse Document Frequency [16], Term Strength [14], Mutual Information [17], Information Gain [18], Chi-Square [11] [13], Ambiguity Measure [19], Term Frequency-Relevance Frequency [20], Symbolic Feature Selection [21], Distinguish Feature Selection [12], Entropy based Feature Selection [22] and many more. Among these feature selection methods, entropy based feature selection method computes the amount of uncertainty and the quality of information content present in the text. ...
Article
Full-text available
Selection of highly discriminative feature in text document plays a major challenging role in categorization. Feature selection is an important task that involves dimensionality reduction of feature matrix, which in turn enhances the performance of categorization. This article presents a new feature selection method based on Intuitionistic Fuzzy Entropy (IFE) for Text Categorization. Firstly, Intuitionistic Fuzzy C-Means (IFCM) clustering method is employed to compute the intuitionistic membership values. The computed intuitionistic membership values are used to estimate intuitionistic fuzzy entropy via Match degree. Further, features with lower entropy values are selected to categorize the text documents. To find the efficacy of the proposed method, experiments are conducted on three standard benchmark datasets using three classifiers. F-measure is used to assess the performance of the classifiers. The proposed method shows impressive results as compared to other well known feature selection methods. Moreover, Intuitionistic Fuzzy Set (IFS) property addresses the uncertainty limitations of traditional fuzzy set.
... Recently, the concept of symbolic data analysis has gained much attention by the community of researchers since it has proven its effectiveness and simplicity in designing solutions for many pattern recognition problems [30][31][32]. We can also trace a couple of attempts for text classification using the concepts of symbolic representation and classification [28,33]. In this section, we propose to use interval valued type symbolic data to effectively capture the variations within a class of text documents. ...
Conference Paper
Full-text available
In this paper, the problem of skewness in text corpora for effective classification is addressed. A method of converting an imbalanced text corpus into a more or less balanced one is presented through an application of a classwise clustering algorithm. Further, to avoid curse of dimensionality, the chi-squared feature selection is employed. Nevertheless, each cluster of documents has been given a single vector representation by the use of a vector of interval-valued data which accomplishes a compact representation of text data thereby requiring a less memory for storage. A suitable symbolic classifier is used to match a query document against stored interval valued vectors. The superiority of the model has been demonstrated by conducting series of experiments on two benchmarking imbalanced corpora viz., Reuters-21578 and TDT2. In addition, a comparative analysis of the results of the proposal model versus that of the state of the art models on Reuters 21578 dataset indicates that the proposed model outperforms several contemporary models.
Article
As the risks associated with air turbulence are intensified by climate change and the growth of the aviation industry, it has become imperative to monitor and mitigate these threats to ensure civil aviation safety. The eddy dissipation rate (EDR) has been established as the standard metric for quantifying turbulence in civil aviation. This study aims to explore a universally applicable symbolic classification approach based on genetic programming to detect turbulence anomalies using quick access recorder (QAR) data. The detection of atmospheric turbulence is approached as an anomaly detection problem. Comparative evaluations demonstrate that this approach performs on par with direct EDR calculation methods in identifying turbulence events. Moreover, comparisons with alternative machine learning techniques indicate that the proposed technique is the optimal methodology currently available. In summary, the use of symbolic classification via genetic programming enables accurate turbulence detection from QAR data, comparable to that with established EDR approaches and surpassing that achieved with machine learning algorithms. This finding highlights the potential of integrating symbolic classifiers into turbulence monitoring systems to enhance civil aviation safety amidst rising environmental and operational hazards.
Conference Paper
In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to the number of classes in the corpus. Subsequently, the samples of larger sized classes are grouped into a number of subclasses of smaller sizes to make the entire corpus balanced. Each subclass is then given a single symbolic vector representation by the use of interval valued features. This symbolic representation in addition to being compact helps in reducing the space requirement and also the classification time. The proposed model has been empirically demonstrated for its superiority on bench marking datasets viz., Reuters 21578 and TDT2. Further, it has been compared against several other existing contemporary models including model based on support vector machine. The comparative analysis indicates that the proposed model outperforms the other existing models.
Article
Full-text available
Due to rapid growth of text documents in digital form, automated text classification has become an important research in the last two decades. The major challenge of text document representations are high dimension, sparsity, volume and semantics. Since the terms are only features that can be found in documents, selection of good terms (features) plays an very important role. In text classification, feature selection is a strategy that can be used to improve classification effectiveness, computational efficiency and accuracy. In this paper, we present a quantitative analysis of most widely used feature selection (FS) methods, viz. Term Frequency-Inverse Document Frequency (tf.idf ), Mutual Information (MI),Information Gain (IG), CHISquare ( x2), Term Frequency-Relevance Frequency (tf.rf ), Term Strength (TS), Ambiguity Measure (AM) and Symbolic Feature Selection (SFS) to classify text documents. We evaluated all the feature selection methods on standard datasets like 20 Newsgroups, 4 University dataset and Reuters-21578.
Conference Paper
Full-text available
In this paper, we propose a new method of representing text documents based on feature clustering approach. The proposed representation method is very powerful in reducing the dimensionality of feature vectors for text classification. Further, the proposed method is used to form a symbolic representation (interval valued representation) for text documents. To corroborate the efficacy of the proposed model, we conducted extensive experimentation on standard text datasets. We have compared our classification accuracy achieved by the symbolic classifier with the other existing classifiers like: Naïve Bayes, k-NN, Centroid based and SVM classifiers. The experimental results reveal that the achieved classification accuracy is better than that of the existing methods. In addition our method is based on a simple matching scheme; it requires negligible time for classification.
Conference Paper
Full-text available
In this paper, we propose a new method of representing text documents based on clustering of term frequency vectors. Term frequency vectors of each cluster are used to form a symbolic representation (interval valued representation) by the use of mean and standard deviation. In order to cluster the term frequency vectors, we make use of fuzzy C-Means clustering method for interval type data based on adaptive squared Euclidean distance between vectors of intervals. Further, to corroborate the efficacy of the proposed model we conducted extensive experimentation on standard datasets like 20 Newsgroup Large, 20 Mini Newsgroup, Vehicles Wikipedia and our own created datasets like Google Newsgroup and Research Article Abstracts. We have compared our classification accuracy achieved by the Symbolic classifier with the other existing Naïve Bayes classifier, KNN classifier, Centroid based classifier and SVM classifiers. The experimental results reveal that the achieved classification accuracy is better than that of the existing methods. In addition, our method is based on a simple matching scheme; it requires negligible time for classification.
Chapter
A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that are presented in the form of large sparse term-document matrices (tdm). We present TMG, a research and teaching toolbox for the generation of sparse tdms from text collections and for the incremental modification of these tdms by means of additions or deletions. The toolbox is written entirely in MATLAB, a popular problem-solving environment that is powerful in computational linear algebra, in order to streamline document preprocessing and prototyping of algorithms for information retrieval. Several design issues that concern the use of MATLAB sparse infrastructure and data structures are addressed. We illustrate the use of the tool in numerical explorations of the effect of stemming and different term-weighting policies on the performance of querying and clustering tasks.
Article
Symbolic data extend the classical tabular model, where each individual, takes exactly one value for each variable by allowing multiple, possibly weighted, values for each variable. New variable types - interval-valued, categorical multi-valued and modal variables - have been introduced, which allow representing variability and/or uncertainty inherent to the data. But are we still in the same framework when we allow for the variables to take multiple values? Are the definitions of basic notions still so straightforward? What properties remain valid? In this paper we discuss some issues that arise when trying to apply classical data analysis techniques to symbolic data. The central question of the measurement of dispersion, and the consequences of different possible choices in the design of multivariate methods will be addressed.
Book
An Automatic Text Processing system can identify any tool able to process text documents and perform actions or decisions on them. The main problems inherent large amounts of textual data are the organization and the labeling of them. When these data have to be searched, having a structured organization of them surely helps the searcher and facilitates the retrieval of the target documents. Because of this, the most studied task in text processing had been the text categorization. However, instead of having a global predefined class hierarchy, the search could be facilitated by grouping the results to a query in coherent clusters, such that the user can choose which cluster is the most relevant. Clustering algorithms discover groups of similar patterns in a raw set of data and have been used in text process- ing area to group similar documents in classes which are not fixed a-priori.
Article
It is well known that the classification effectiveness of the text categorization system is not simply a matter of learning algorithms. Text representation factors are also at work. This paper will consider the ways in which the effectiveness of text classifiers is linked to the five text representation factors: stop words removal, word stemming, indexing, weighting, and normalization. Statistical analyses of experimental results show that performing normalization can always promote effectiveness of text classifiers significantly. The effects of the other factors are not as great as expected. Contradictory to common sense, a simple binary indexing method can sometimes be helpful for text categorization.
Article
In this paper, a novel similarity measure for estimating the degree of similarity between two patterns (described by interval type data) is proposed. The proposed measure computes the degree of similarity between two patterns and approximates the computed similarity value by a multivalued type data. Unlike conventional proximity matrices, the proximity matrix obtained through the application of the proposed similarity measure is not necessarily symmetric. Based on this unconventional similarity matrix a modified agglomerative method by introducing the concept of mutual similarity value (MSV) for clustering symbolic patterns is also presented. Experiments on various data sets have been conducted in order to study the efficacy of the proposed methodology.
Article
In this paper, we present a method for representing a two-dimensional shape by symbolic features. A shape is represented in terms of multi-interval valued type features. A similarity measure defined over symbolic features that is useful for retrieval of shapes from a shape database is also presented. Unlike other shape representation schemes, the proposed scheme is capable of preserving both contour as well as region information. The proposed method of shape representation and retrieval is shown to be invariant to image transformations (translation, rotation, reflection and scaling) and robust to minor deformations and occlusions. Several experiments have been conducted to demonstrate the feasibility of the methodology and also to highlight its advantages over an existing methodology.
Conference Paper
We consider the problem of document indexing and representa- tion. Recently, Locality Preserving Indexing (LPI) was proposed for learning a compact document subspace. Different from Latent Semantic Indexing (LSI) which is optimal in the sense of global Euclidean structure, LPI is optimal in the sense of local manifold structure. However, LPI is not efficient in time and memory which makes it difficult to be applied to very large data set. Specifi- cally, the computation of LPI involves eigen-decompositions of two dense matrices which is expensive. In this paper, we propose a new algorithm called Regularized Locality Preserving Indexing(RLPI). Benefit from recent progresses on spectral graph analysis, we ca st the original LPI algorithm into a regression framework which en- able us to avoid eigen-decomposition of dense matrices. Also, with the regression based framework, different kinds of regularizers c an be naturally incorporated into our algorithm which makes it more flexible. Extensive experimental results show that RLPI obtains similar or better results comparing to LPI and it is significantly faster, which makes it an efficient and effective data preprocessing method for large scale text clustering, classification and retrieval.