Conference PaperPDF Available

Document Classification using Symbolic Classifiers

November 2014

November 2014

DOI:10.1109/IC3I.2014.7019827

Conference: International Conference on Contemporary Computing and Informatics (IC3I)
At: Mysuru

Authors:

B S Harish

JSS Science & Technology University, Mysore, India

Revanasiddappa M B

Sri Jayachamarajendra College of Engineering

Manjunath Shantharamu

University of Mysore

In this paper, we present symbolic classifiers to classify text documents. We propose to use cluster based symbolic representation followed by symbolic feature selection methods to classify text documents. In particular, we propose Symbolic clustering approaches; Symbolic cluster based without feature selection; Symbolic cluster based with feature selection (using similarity measure); Symbolic cluster based with feature selection (using dissimilarity measure) and Symbolic feature clustering approaches. The above mentioned representation methods are very powerful in reducing the dimensionality of feature vectors for text classification. To corroborate the efficacy of the proposed model, we conducted extensive experimentation on various standard text datasets. The experimental results reveal that the symbolic feature clustering approach achieves better classification accuracy over the existing cluster based symbolic approaches.

. Classification accuracy on Reuters 21578 dataset

…

. Classification accuracy on Vehicles Wikipedia dataset

…

Figures - uploaded by B S Harish

Content may be subject to copyright.

Content uploaded by B S Harish

Content may be subject to copyright.

Document Classification using Symbolic Classifiers

M B Revanasiddappa

Department of Information Science &

Engineering

SJCE, Mysore

Karnataka, India

revan.cr.is@gmail.com

B S Harish

Member, IEEE

Department of Information Science &

Engineering

SJCE, Mysore

Karnataka, India

bsharish@ymail.com

S Manjunath

Department of Computer Science

Central University Kerala

Kasargod, Kerala

India

manju_uom@yahoo.co.in

ABSTRACT

In this paper, we present symbolic classifiers to classify text

documents. We propose to use cluster based symbolic

representation followed by symbolic feature selection methods to

classify text documents. In particular, we propose Symbolic

clustering approaches; Symbolic cluster based without feature

selection; Symbolic cluster based with feature selection (using

similarity measure); Symbolic cluster based with feature selection

(using dissimilarity measure) and Symbolic feature clustering

approaches. The above mentioned representation methods are

very powerful in reducing the dimensionality of feature vectors

for text classification. To corroborate the efficacy of the proposed

model, we conducted extensive experimentation on various

standard text datasets. The experimental results reveal that the

symbolic feature clustering approach achieves better classification

accuracy over the existing cluster based symbolic approaches.

Categories and Subject Descriptors

Data Mining, Machine Learning, Pattern Recognition

General Terms

Algorithms, Experimentation.

Keywords

Symbolic Classifier, Representation, Feature Selection, Text

Documents.

INTRODUCTION

Over the past two decades, automatic management of electronic

documents has been a major research field in computer science.

Text documents have become the most common type of

information repositories especially due to the increased popularity

of the internet and the World Wide Web (WWW). Internet and

web documents like web pages, e-mails, newsgroup messages,

internet news feed etc., contain million or even billion of text

documents. In the last decades, content-based document

management tasks have gained a prominent status in the field of

information systems, due to the increased availability of

documents in digital form [13], [15].

Earlier, the task of text classification was based on knowledge

engineering (KE), where a set of rules were defined manually to

encode the expert knowledge on how t o class i f y the documents

under the given categories [14]. Since there is a requirement of

human intervention in knowledge engineering, researchers in later

days have proposed many machine learning techniques to

automatically manage the text documents. The advantages of a

machine learning based approach are that the accuracy is

comparable to that of human experts and no intervention from

either knowledge engineers or domain experts needed for the

construction of a document management tool [6]. Many text

mining methods like document retrieval, clustering, classification,

routing and filtering are often used for effective management of

text documents. Out of several tasks, text classification is the one

which is commonly used in text information systems. Therefore,

devising effective and efficient models for representing and

classification of text documents for real time applications is the

current requirement.

The task of text classification is to assign a boolean value to each

pair

(,) ,

ji o

dk D Ku

where

is the domain of documents and

is a set of predefined categories. The task is to approximate

the true function

:1,0

by means of a function

:{1,0}

such that

and

coincide as much as

possible. The function

is called a classifier. A classifier can be

built by training it systematically using a set of training

documents [13], [3].

The major challenges and difficulties that arise in the problem of

text classification are: High dimensionality (thousands of

features), variable length, content and quality of the documents,

sparse distribution of terms in documents, loss of correlation

between adjacent words and understanding complex semantics of

terms in a document [16]. To tackle these problems a number of

methods have been reported in literature for effective

classification of text documents. Many representation schemes

like binary representation, ontology, N-Grams, multiword terms as

vector, Universal Networking Language, Latent Semantic

Indexing, Locality Preserving Indexing, Regularized Locality

Preserving Indexing are proposed as text representation schemes

for effective text classification [10]. Also, in [11] a new

representation model for the web documents are proposed.

Recently, bayes formula was used to vectorize a document

according to a probability distribution reflecting the probable

categories that the document may belongs to. Further, Clustering

has been used in the literature of text classification as an

alternative representation scheme for text documents. Several

approaches like [7] are used to represent text documents.

All in all, the above mentioned classification algorithms work on

conventional word frequency vector. Conventionally the feature

vectors of term document matrix (very sparse and very high

dimensional feature vector describing a document) are used to

represent the class. Later, this matrix is used to train the system

299

978-1-4799-6629-5/14/$31.00 c

2014 IEEE

using different classifiers for classification. Generally, the term

document matrix contains the frequency of occurrences of terms

and the values of the term frequency vary from document to

document in the same class. Hence to preserve these variations,

we proposed a new interval representation for each document. An

initial attempt is made in [4] by giving an interval representation

by using maximum and minimum values of term frequency

vectors for the documents. However, in this paper we are using

mean and standard deviations to give the interval valued

representation for documents. Thus, the variations of term

frequencies of document within the class are assimilated in the

form of interval representation. Moreover conventional data

analysis may not be able to preserve intraclass variations but

unconventional data analysis such as symbolic data analysis will

provide methods for effective representations by preserving

intraclass variations.

The recent developments in the area of symbolic data analysis

have proven that the real life objects can be better described by

the use of symbolic data, which are extensions of classical crisp

data. The aim of the symbolic data analysis [1] is to provide

suitable methods for managing aggregated data described by multi

valued variables, where the cells of the data contain sets of

categories, intervals, or weight distributions. Symbolic data

analysis provides a number of clustering methods for symbolic

data. These methods differ in the type of considered symbolic

data, in their cluster structures and/or in the considered clustering

criterion. The previous issues motivated us to use symbolic data

rather than using a conventional classical crisp data to represent a

document. To preserve the intraclass variations we create multiple

clusters for each class. Term frequency vectors of documents of

each cluster are used to form an interval valued feature vector.

With this backdrop, the work presented in [4] is extended towards

creating multiple representatives per class using clustering after

symbolic representation.

The rest of the paper is organized as follows: The proposed

representation and classification stages are presented in section 2.

Details of dataset used, experimental settings and results are

presented in section 3. The paper is concluded in section 4.

2. PROPOSED METHOD

The classification model has three stages: (i) Symbolic

Representation (ii) Symbolic Feature Selection and (iii) Document

Classification.

2.1 Symbolic Representation

Let there be

number of classes each containing

number of

documents, where each document is described by a

dimensional term frequency vector. The term document matrix,

say

of size

()KN nu

is constructed such that each row

represents a document of a class and each column represents a

term. We recommend to employ the Regularized Locality

Preserving Indexing [2] on

to obtain the transformed term

document matrix

of size

()KN mu

, where

is the number of

features chosen out of

which is not necessarily optimal. In order

to preserve the intraclass variation in each feature of every

document of

class, we have proposed symbolic representation

for text documents in [4]. Further, to select the best features from

the class representative matrix we need to study the

correlation/covariance present among the individual features of

each class. The features which have maximum

correlation/covariance shall be selected as the best features to

represent the classes. Since the matrix is an interval matrix we

compute a proximity matrix of size

KKu

with each element

being of type multivalued of dimension

by computing the

similarity/dissimilarity among the features using the functions

proposed in [9], [12].

Generally each class contains several documents which are

classified according to the content of the document. At this

context, we intend to have an effective representation by

providing multiple reference documents (representative vectors)

for each class. Therefore, we recommended applying a clustering

algorithm (using adaptive fuzzy c-means clustering algorithm) to

obtain a number of clusters of documents of the training set for all

the classes and then to have a symbolic representative vector for

each cluster of documents [8]. To corroborate the efficacy of the

proposed model, we employed a new representation method of

representing text documents based on feature clustering approach

[7].

2.2 Symbolic Feature Selection

Feature selection is used to identify the useful features and also to

remove the redundant information. Basically, feature selection

methods fall into two broad categories, the filter model and the

wrapper model. The wrapper model requires one predetermined

learning algorithm in feature selection and uses its performance to

evaluate and determine which features are selected. And the filter

model relies on general characteristics of the training data to

select some features without involving any specific learning

algorithm. There is evidence that wrapper methods often perform

better on small scale problems, but on large scale problems, such

as text classification, wrapper methods are shown to be

impractical because of its high computational cost. Hence in this

paper we make use of the filter method to select the best features.

In order select the best features from the class representative

matrix

we need to study the correlation present among the

individual features of each class. The features which have

maximum correlation shall be selected as the best features to

represent the classes. Since

is an interval matrix we compute a

proximity matrix of size

KKu

with each element being of type

multivalued of dimension

by computing the similarity among

the features using the similarity function proposed in [17]. The

similarity from class

with respect to

feature is given by

il jl

§·

¨¸

©¹

Where,

[ , ] 1,2,...,

il il il

Ipp l m





are the interval type

features of the class

and

[ , ] 1,2,...,

jl jl jl

Ipp l m





are the

interval type features of the class

Therefore from the obtained proximity matrix, the matrix

of size

Kmu

is constructed by listing out each multivalued type

element one by one in the form of rows. The standard deviation

for each column of the matrix

is computed and also the

average standard deviation due to all

columns is computed.

Let

TCorr

be the total correlation of the

column with all other

columns of the matrix

and let

AvgTCorr

be the average of all

total correlation obtained due to all columns. i.e.,

300 2014 International Conference on Contemporary Computing and Informatics (IC3I)

(, )

th th

TCorr Corr l Column q Column

and

TCorr

AvgTCorr

We are interested in those features which have high

discriminating capability, and thus we recommend to select those

features, for which

TCorr

is higher than the average

correlation

AvgTCorr

. Further, to compute the dissimilarity based

feature selection, we make use of covariance rather than

correlation.

2.3 Document Classification

Given a test document

represented by term document feature

vector, is projected onto lower dimension using RLPI, to get

transformed term frequency feature vector of size

which are of

type crisp. Among these

feature values only the

values are

selected from feature selection method. Then these

values are

used to compare with each class representative stored in the

knowledge base. Let

123

[ , , ,..., ]

Dq Dq Dq Dq Dq

F ppp p

be a

dimensional feature vector describing a test document. Let

the interval symbolic feature vector of the

class. Now each

feature value of the test document is compared with the

corresponding interval

to examine similarity between the test

document feature value and interval in

. We make use of a

similarity measure proposed in [5] to measure the similarity

between test feature vector

and the

class

representative

_(,) (,[,])

Dq j Dq jl jl

Total Sim F R Sim p p p



Here,

[,]

jl jl



represents the

feature interval of the

class, and



(,[,]) 11

max ,

Dq jl Dq jl

Dq jl jl

Dq jl Dq jl

ifppandpp

Sim p p p

Otherwise

pp pp



td

§ ·

®¨¸

°¨¸

 

°©¹

Similarly, we compute similarity value for all

classes and we

use symbolic classifier to classify a given query document.

3. EXPERIMENTAL SETUP

3.1 Datasets

For experimentation we have used the classic Reuters 21578

collection has the benchmark dataset. Originally Reuters 21578

contains 21578 documents in 135 categories. However, in our

experiment, we discarded those documents with multiple category

labels, and selected the largest ten categories. For the smooth

conduction of experiments we used ten largest classes in the

Reuters 21578 collection with number of documents in the

training and test sets as follows: earn (2877 vs 1087), trade (369

vs 119), acquisitions (1650 vs 179), interest (347 vs 131), money-

fx (538 vs 179), ship (197 vs 89), grain (433 vs 149), wheat (212

vs 71), crude (389 vs 189), corn (182 vs 56). The second dataset

is standard 20 Newsgroup Large. It is one of the standard

benchmark dataset used by many text classification research

groups. It contains 20000 documents categorized into 20 classes.

For our experimentation, we have considered the term document

matrix constructed for 20 Newsgroup. The third dataset consists

of vehicle characteristics extracted from wikipedia pages

(Vehicles - Wikipedia). The dataset contains 4 categories that

have low degrees of similarity. The dataset contains four

categories of vehicles: Aircraft, Boats, Cars and Trains. All the

four categories are easily differentiated and every category has a

set of unique key words.

3.2 Experimentation

In this section, we present the results of the experiments

conducted to demonstrate the effectiveness of the proposed

method on all the three datasets viz., Reuters 21578, 20

Newsgroup, and Vehicles Wikipedia. During experimentation, we

conducted two different sets of experiments. In the first set of

experiments, we used 50% of the documents of each class of a

dataset to create training set and the remaining 50% of the

documents for testing purpose. On the other hand, in the second

set of experiments, the numbers of training and testing documents

are in the ratio 60:40. Both experiments are repeated 5 times by

choosing the training samples randomly. As measures of goodness

of the proposed method, we computed percentage of classification

accuracy. The minimum, maximum and the average value of the

classification accuracy of all the 5 trials are presented in Table 1

to Table 3. For both the experiments, we have randomly selected

the training documents to create the symbolic feature vectors for

each class.

4. CONCLUSIONS

This paper proposes novel symbolic classifiers to classify the text

documents. The proposed symbolic classifiers are evaluated based

on conventional cluster based symbolic approaches, cluster based

symbolic approaches without feature selection, cluster based

symbolic approaches with feature selection and feature clustering

approaches. The above mentioned representation methods are

very powerful in reducing the dimensionality of feature vectors

for text classification. To corroborate the efficacy of the proposed

model, we conducted extensive experimentation on standard text

datasets. The experimental results reveal that the symbolic

representation using feature clustering techniques achieves better

classification accuracy over the existing cluster based symbolic

representation approaches.

2014 International Conference on Contemporary Computing and Informatics (IC3I) 301

Table 1. Classification accuracy on Reuters 21578 dataset

Training

Testing

A B C D

Min

Max

Avg

Min

Max

Avg

Min

Max

Avg

Min

Max

Avg

50:50

60.85

63.50

62.55

68.40

69.95

69.10

67.20

69.10

68.45

69.85

71.45

70.65

61.45

64.10

63.90

69.90

71.25

70.95

68.85

69.95

69.45

70.15

72.35

71.55

59.90

61.20

60.85

67.55

69.50

68.20

65.45

66.30

65.60

68.90

69.55

69.10

60:40

63.55

65.20

64.20

68.90

70.65

70.10

68.50

69.25

73.50

76.10

75.85

68.40

70.25

69.55

78.50

80.10

79.55

76.45

78.10

77.25

84.65

86.75

85.55

65.85

66.90

66.10

72.50

75.85

73.20

70.20

72.55

71.25

80.55

82.65

81.45

Table 2. Classification accuracy on 20 Newsgroup dataset

Training

Testing

A B C D

Min

Max

Avg

Min

Max

Avg

Min

Max

Avg

Min

Max

Avg

50:50

79.58

82.50

81.85

81.25

86.85

83.80

79.38

86.85

81.66

80.55

81.45

80.90

84.34

87.80

86.25

86.20

87.25

86.46

84.85

87.00

86.33

87.40

89.60

88.95

77.63

79.15

78.37

76.95

81.20

79.61

78.65

80.25

79.42

85.20

85.65

85.40

60:40

82.68

84.25

83.20

83.45

87.25

85.14

82.90

86.50

84.42

86.55

88.10

87.20

88.45

90.20

89.36

93.63

90.85

92.04

90.25

91.65

90.86

91.10

92.65

91.95

79.55

81.56

80.18

79.68

83.45

81.47

79.68

81.45

80.71

88.20

89.90

89.10

Table 3. Classification accuracy on Vehicles Wikipedia dataset

Training

Testing

A B C D

Min

Max

Avg

Min

Max

Avg

Min

Max

Avg

Min

Max

Avg

50:50

80.65

83.65

81.54

82.19

83.90

83.11

70.50

77.50

74.79

84.35

85.40

84.90

83.75

88.20

84.55

86.25

95.00

91.39

86.26

87.35

86.92

90.05

93.45

92.15

79.80

82.50

80.86

82.00

85.00

83.84

81.26

82.00

81.70

86.70

88.90

87.45

60:40

78.33

84.63

81.95

84.12

84.86

84.31

72.50

80.62

77.62

86.70

87.20

86.95

89.55

91.50

90.86

94.00

98.00

95.74

89.20

89.50

89.40

91.25

94.50

93.60

85.20

87.85

85.62

85.00

93.75

89.60

79.50

84.50

81.83

89.10

90.10

89.40

Where,

P: Number of Clusters Selected

A: Symbolic clustering Approaches

B: Symbolic cluster based with feature selection (Similarity Based)

C: Symbolic cluster based with feature selection (Dissimilarity Based)

D: Symbolic Feature Clustering Approaches

Min: Minimum Accuracy Recoded

Max: Maximum Accuracy Recorded

Avg: Average of Min and Max Accuracy

302 2014 International Conference on Contemporary Computing and Informatics (IC3I)

5. REFERENCES

[1]. Bock H. H, and Diday E., “Analysis of symbolic

Data”. Springer Verlag, 1999.

[2]. Cai D, He X, Zhang W. V. and Han J., 2007.

Regularized Locality Preserving Indexing via Spectral

Regression. Proceedings of Conference on

Information and Knowledge Management (CIKM’07),

pp. 741—750.

[3]. Fung G. P. C., Yu J. X.., Lu H. and Yu P. S., 2006.

Text classification without negative example revisit.

IEEE Transactions on Knowledge and Data

Engineering. vol. 18, pp. 23 – 47.

[4]. Guru D S, Harish B S, and Manjunath S., 2010.

Symbolic representation of text documents”.

Proceedings of Third Annual ACM Compute,

Bangalore.

[5]. Guru D. S and Nagendraswamy H. S., 2006. Symbolic

Representation of Two-Dimensional Shapes. Pattern

Recognition Letters, vol.28, no. 1, pp. 144 – 55.

[6]. Guru D. S., 2001. Classification of text documents:

An overview, the challenges and future avenues.

Proceedings of the Pre-Conference workshop on

Document Processing, Karnataka, India, pp. 28 – 34.

[7]. Harish B S and Udayasri., 2014. Document

Classification: An Approach Using Feature

Clustering. Recent Advances in Intelligent Informatics

Advances in Intelligent Systems and Computing, Vol.

235, pp. 163-173.

[8]. Harish B S, Bhanu Prasad and B Udayasri., 2014.

Classification of Text Documents using Adaptive

Fuzzy C-Means Clustering. Recent Advances in

Intelligent Informatics Advances in Intelligent

Systems and Computing, Vol. 235, pp. 205-214.

[9]. Harish B S, Guru D S, Manjunath S and Dinesh R.,

2011. Symbolic Similarity and Symbolic Feature

Selection for Text Classification. Bilateral Russian-

Indian Scientific Workshop on Emerging Applications

of Computer Vision, November 1 – 5, 2011, Moscow,

Russia, pp. 141 – 146.

[10]. Harish B S, Manjunath S and Guru D S, 2010.

Representation and Classification of Text Documents:

A Brief Review”, International Journal of Computer

Applications Special Issue on Recent Trends in Image

Processing and Pattern Recognition, pp. 110 – 119.

[11]. Isa D., Lee L. H., Kallimani V. P. and Rajkumar R.,

2008. Text document preprocessing with the bayes

formula for classification using the support vector

machine. IEEE Transactions on Knowledge and Data

Engineering, vol. 20, no. 9, pp. 23 – 31.

[12]. Manjunath S, Harish B S and Guru D S., 2011.

Dissimilarity Based Feature Selection for Text

Classification: A Cluster Based Approach. In the

proceedings of ACM International Conference and

Workshop on Emerging Trends and Technology,

Association of Computing Machinery, New York,

USA, pp. 495 – 499.

[13]. Rigutini L., 2004. Automatic text processing: Machine

learning techniques. Ph.D. Thesis, University of

Siena.

[14]. Salton G and Buckely C, 1988. Term weighting

approaches in automatic text retrieval. Journal of

Information Processing and Management, vol. 24, no.

5, pp. 513 – 523.

[15]. Song F., S. Liu and J. Yang, 2005. A comparative

study on text representation schemes in text

categorization. Journal of Pattern Analysis

Application, vol. 8, pp. 199 – 209.

[16]. Zeimpekis D and E. Gallopoulos, 2006. TMG: A

MATLAB Toolbox for generating term-document

matrices from text collections. Springer Publications,

Berlin, pp. 187 – 210.

[17]. Guru D.S., Kiranagi B. B., Nagabhushan P, 2004.

Multivalued type proximity measure and concept of

mutual similarity value useful for clustering symbolic

patterns. Journal of Pattern Recognition Letters, vol.

25, pp. 1003 – 1013.

2014 International Conference on Contemporary Computing and Informatics (IC3I) 303

A Comprehensive Survey on various Feature Selection Methods to Categorize Text Documents

Article

Full-text available

Apr 2017

Feature selection is one of the well-known solution to high dimensionality problem of text categorization. In text categorization, selection of good features (terms) plays a very important role. Feature selection is a strategy that can be used to improve categorization accuracy, effectiveness and computational efficiency. This paper presents an empirical study of most widely used feature selection methods viz. Term Frequency-Inverse Document Frequency (tf.idf ), Information Gain (IG), Mutual Information(MI), CHI-Square, Ambiguity Measure (AM), Term Strength (TS), Term Frequency-Relevance Frequency (tf.rf ) and Symbolic Feature Selection (SFS) with five different classifiers (Naive Bayes, KNearest Neighbor, Centroid Based Classifier, Support Vector Machine and Symbolic Classifier). Experimentations are carried out on standard benchmark datasets like Reuters-21578, 20-Newsgroups and 4 University dataset.

A Comprehensive Survey on various Feature Selection Methods to Categorize Text Documents

Article

Apr 2017

M. B.

A New Approach of Categorizing Text Documents through Consensus Clustering

Conference Paper

Full-text available

Jun 2018

In this paper, we present a new text categorization model using consensus clustering to categorize text documents. Initially, text documents are pre-processed and represented in the form of Term Document Matrix (TDM). Further, consensus clustering is used for text categorization. The consensus clustering has two units: cluster generation and consensus function. In cluster generation unit, we generate clustering results by applying base clustering (K-Means, Fuzzy C-means (FCM) and Intuitionistic Fuzzy C-means (IFCM)) methods. In the consensus function unit, the generated results of base clustering methods are aggregated. Further, voting technique is applied on aggregated results to obtain consolidated cluster. In order to evaluate the effectiveness of the proposed model, experiments are conducted on balanced (20-Newgroups) and unbalanced (Reuters-21578) standard benchmark datasets. We used accuracy, precision, recall and F-measure to assess the performance. The performance of the proposed model is investigated against base clustering methods (K-Means, FCM and IFCM). The experimental result reveals that the proposed model performance is better than base clustering methods. Moreover, consensus clustering eliminates the limitation of base clustering methods and also quality of the final categorization result can be improved.

Meta-cognitive Neural Network based Sequential Learning Framework for Text Categorization

Article

Full-text available

Jan 2018

A New Feature Selection Method based on Intuitionistic Fuzzy Entropy to Categorize Text Documents

Article

Full-text available

Dec 2018

Selection of highly discriminative feature in text document plays a major challenging role in categorization. Feature selection is an important task that involves dimensionality reduction of feature matrix, which in turn enhances the performance of categorization. This article presents a new feature selection method based on Intuitionistic Fuzzy Entropy (IFE) for Text Categorization. Firstly, Intuitionistic Fuzzy C-Means (IFCM) clustering method is employed to compute the intuitionistic membership values. The computed intuitionistic membership values are used to estimate intuitionistic fuzzy entropy via Match degree. Further, features with lower entropy values are selected to categorize the text documents. To find the efficacy of the proposed method, experiments are conducted on three standard benchmark datasets using three classifiers. F-measure is used to assess the performance of the classifiers. The proposed method shows impressive results as compared to other well known feature selection methods. Moreover, Intuitionistic Fuzzy Set (IFS) property addresses the uncertainty limitations of traditional fuzzy set.

Simple yet Effective Classification Model for Skewed Text Categorization

Conference Paper

Full-text available

Sep 2016

In this paper, the problem of skewness in text corpora for effective classification is addressed. A method of converting an imbalanced text corpus into a more or less balanced one is presented through an application of a classwise clustering algorithm. Further, to avoid curse of dimensionality, the chi-squared feature selection is employed. Nevertheless, each cluster of documents has been given a single vector representation by the use of a vector of interval-valued data which accomplishes a compact representation of text data thereby requiring a less memory for storage. A suitable symbolic classifier is used to match a query document against stored interval valued vectors. The superiority of the model has been demonstrated by conducting series of experiments on two benchmarking imbalanced corpora viz., Reuters-21578 and TDT2. In addition, a comparative analysis of the results of the proposal model versus that of the state of the art models on Reuters 21578 dataset indicates that the proposed model outperforms several contemporary models.

Detection of Turbulence Anomalies Using a Symbolic Classifier Algorithm in Airborne Quick Access Record (QAR) Data Analysis

Article

Apr 2024
ADV ATMOS SCI

As the risks associated with air turbulence are intensified by climate change and the growth of the aviation industry, it has become imperative to monitor and mitigate these threats to ensure civil aviation safety. The eddy dissipation rate (EDR) has been established as the standard metric for quantifying turbulence in civil aviation. This study aims to explore a universally applicable symbolic classification approach based on genetic programming to detect turbulence anomalies using quick access recorder (QAR) data. The detection of atmospheric turbulence is approached as an anomaly detection problem. Comparative evaluations demonstrate that this approach performs on par with direct EDR calculation methods in identifying turbulence events. Moreover, comparisons with alternative machine learning techniques indicate that the proposed technique is the optimal methodology currently available. In summary, the use of symbolic classification via genetic programming enables accurate turbulence detection from QAR data, comparable to that with established EDR approaches and surpassing that achieved with machine learning algorithms. This finding highlights the potential of integrating symbolic classifiers into turbulence monitoring systems to enhance civil aviation safety amidst rising environmental and operational hazards.

A survey on text document categorization using enhanced sentence vector space model and bi-gram text representation model based on novel fusion techniques

Conference Paper

Jan 2018

Cluster Based Symbolic Representation for Skewed Text Categorization

Conference Paper

Apr 2017

In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to the number of classes in the corpus. Subsequently, the samples of larger sized classes are grouped into a number of subclasses of smaller sizes to make the entire corpus balanced. Each subclass is then given a single symbolic vector representation by the use of interval valued features. This symbolic representation in addition to being compact helps in reducing the space requirement and also the classification time. The proposed model has been empirically demonstrated for its superiority on bench marking datasets viz., Reuters 21578 and TDT2. Further, it has been compared against several other existing contemporary models including model based on support vector machine. The comparative analysis indicates that the proposed model outperforms the other existing models.

A Quantitative Evaluation of Text Feature Selection Methods

Article

Full-text available

Jan 2015

Due to rapid growth of text documents in digital form, automated text classification has become an important research in the last two decades. The major challenge of text document representations are high dimension, sparsity, volume and semantics. Since the terms are only features that can be found in documents, selection of good terms (features) plays an very important role. In text classification, feature selection is a strategy that can be used to improve classification effectiveness, computational efficiency and accuracy. In this paper, we present a quantitative analysis of most widely used feature selection (FS) methods, viz. Term Frequency-Inverse Document Frequency (tf.idf ), Mutual Information (MI),Information Gain (IG), CHISquare ( x2), Term Frequency-Relevance Frequency (tf.rf ), Term Strength (TS), Ambiguity Measure (AM) and Symbolic Feature Selection (SFS) to classify text documents. We evaluated all the feature selection methods on standard datasets like 20 Newsgroups, 4 University dataset and Reuters-21578.

TMG: A MATLAB Toolbox for Generating Term- Document Matrices from Text Collections

Article

Full-text available

Document Classification: An Approach Using Feature Clustering

Conference Paper

Full-text available

Aug 2013

In this paper, we propose a new method of representing text documents based on feature clustering approach. The proposed representation method is very powerful in reducing the dimensionality of feature vectors for text classification. Further, the proposed method is used to form a symbolic representation (interval valued representation) for text documents. To corroborate the efficacy of the proposed model, we conducted extensive experimentation on standard text datasets. We have compared our classification accuracy achieved by the symbolic classifier with the other existing classifiers like: Naïve Bayes, k-NN, Centroid based and SVM classifiers. The experimental results reveal that the achieved classification accuracy is better than that of the existing methods. In addition our method is based on a simple matching scheme; it requires negligible time for classification.

Classification of Text Documents Using Adaptive Fuzzy C-Means Clustering

Conference Paper

Full-text available

Aug 2013

In this paper, we propose a new method of representing text documents based on clustering of term frequency vectors. Term frequency vectors of each cluster are used to form a symbolic representation (interval valued representation) by the use of mean and standard deviation. In order to cluster the term frequency vectors, we make use of fuzzy C-Means clustering method for interval type data based on adaptive squared Euclidean distance between vectors of intervals. Further, to corroborate the efficacy of the proposed model we conducted extensive experimentation on standard datasets like 20 Newsgroup Large, 20 Mini Newsgroup, Vehicles Wikipedia and our own created datasets like Google Newsgroup and Research Article Abstracts. We have compared our classification accuracy achieved by the Symbolic classifier with the other existing Naïve Bayes classifier, KNN classifier, Centroid based classifier and SVM classifiers. The experimental results reveal that the achieved classification accuracy is better than that of the existing methods. In addition, our method is based on a simple matching scheme; it requires negligible time for classification.

TMG: A MATLAB toolbox for generating term-document matrices from text collections

Chapter

Jan 2006

A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that are presented in the form of large sparse term-document matrices (tdm). We present TMG, a research and teaching toolbox for the generation of sparse tdms from text collections and for the incremental modification of these tdms by means of additions or deletions. The toolbox is written entirely in MATLAB, a popular problem-solving environment that is powerful in computational linear algebra, in order to streamline document preprocessing and prototyping of algorithms for information retrieval. Several design issues that concern the use of MATLAB sparse infrastructure and data structures are addressed. We illustrate the use of the tool in numerical explorations of the effect of stemming and different term-weighting policies on the performance of querying and clustering tasks.

Analysis of Symbolic Data

Article

Jan 1999

Symbolic data extend the classical tabular model, where each individual, takes exactly one value for each variable by allowing multiple, possibly weighted, values for each variable. New variable types - interval-valued, categorical multi-valued and modal variables - have been introduced, which allow representing variability and/or uncertainty inherent to the data. But are we still in the same framework when we allow for the variables to take multiple values? Are the definitions of basic notions still so straightforward? What properties remain valid? In this paper we discuss some issues that arise when trying to apply classical data analysis techniques to symbolic data. The central question of the measurement of dispersion, and the consequences of different possible choices in the design of multivariate methods will be addressed.

Automatic Text Processing: Machine Learning Techniques

Book

Jul 2010

Leonardo Rigutini

An Automatic Text Processing system can identify any tool able to process text documents and perform actions or decisions on them. The main problems inherent large amounts of textual data are the organization and the labeling of them. When these data have to be searched, having a structured organization of them surely helps the searcher and facilitates the retrieval of the target documents. Because of this, the most studied task in text processing had been the text categorization. However, instead of having a global predefined class hierarchy, the search could be facilitated by grouping the results to a query in coherent clusters, such that the user can choose which cluster is the most relevant. Clustering algorithms discover groups of similar patterns in a raw set of data and have been used in text process- ing area to group similar documents in classes which are not fixed a-priori.

A comparative study on text representation schemes in text categorization

Article

Jan 2005

It is well known that the classification effectiveness of the text categorization system is not simply a matter of learning algorithms. Text representation factors are also at work. This paper will consider the ways in which the effectiveness of text classifiers is linked to the five text representation factors: stop words removal, word stemming, indexing, weighting, and normalization. Statistical analyses of experimental results show that performing normalization can always promote effectiveness of text classifiers significantly. The effects of the other factors are not as great as expected. Contradictory to common sense, a simple binary indexing method can sometimes be helpful for text categorization.

Multivalued type proximity measure and concept of mutual similarity value useful for clustering symbolic patterns

Article

Jul 2004
PATTERN RECOGN LETT

In this paper, a novel similarity measure for estimating the degree of similarity between two patterns (described by interval type data) is proposed. The proposed measure computes the degree of similarity between two patterns and approximates the computed similarity value by a multivalued type data. Unlike conventional proximity matrices, the proximity matrix obtained through the application of the proposed similarity measure is not necessarily symmetric. Based on this unconventional similarity matrix a modified agglomerative method by introducing the concept of mutual similarity value (MSV) for clustering symbolic patterns is also presented. Experiments on various data sets have been conducted in order to study the efficacy of the proposed methodology.

Symbolic representation of two-dimensional shapes

Article

Jan 2007
PATTERN RECOGN LETT

In this paper, we present a method for representing a two-dimensional shape by symbolic features. A shape is represented in terms of multi-interval valued type features. A similarity measure defined over symbolic features that is useful for retrieval of shapes from a shape database is also presented. Unlike other shape representation schemes, the proposed scheme is capable of preserving both contour as well as region information. The proposed method of shape representation and retrieval is shown to be invariant to image transformations (translation, rotation, reflection and scaling) and robust to minor deformations and occlusions. Several experiments have been conducted to demonstrate the feasibility of the methodology and also to highlight its advantages over an existing methodology.

Regularized Locality Preserving Indexing via Spectral Regression

Conference Paper

Nov 2007

We consider the problem of document indexing and representa- tion. Recently, Locality Preserving Indexing (LPI) was proposed for learning a compact document subspace. Different from Latent Semantic Indexing (LSI) which is optimal in the sense of global Euclidean structure, LPI is optimal in the sense of local manifold structure. However, LPI is not efficient in time and memory which makes it difficult to be applied to very large data set. Specifi- cally, the computation of LPI involves eigen-decompositions of two dense matrices which is expensive. In this paper, we propose a new algorithm called Regularized Locality Preserving Indexing(RLPI). Benefit from recent progresses on spectral graph analysis, we ca st the original LPI algorithm into a regression framework which en- able us to avoid eigen-decomposition of dense matrices. Also, with the regression based framework, different kinds of regularizers c an be naturally incorporated into our algorithm which makes it more flexible. Extensive experimental results show that RLPI obtains similar or better results comparing to LPI and it is significantly faster, which makes it an efficient and effective data preprocessing method for large scale text clustering, classification and retrieval.

Document Classification using Symbolic Classifiers

Abstract and Figures

Recommended publications

Probabilistic Thematic Clustering for Biomedical Text Classification and Feature Selection

ARTC: feature selection using association rules for text classification

Feature Selection Using Improved Mutual Information for Text Classification

Improving Short Text Classification through Better Feature Space Selection