Conference PaperPDF Available

An Automated Text Classification Method: Using Improved Fuzzy Set Approach for Feature Selection

January 2019

January 2019

DOI:10.1109/IBCAST.2019.8667159

Conference: 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST - 2019)

Authors:

Shahid Hussain

Pennsylvania State University

Content uploaded by Shahid Hussain

Content may be subject to copyright.

An Automated Text Classification Method: Using

Improved Fuzzy Set Approach for Feature

Selection

Bushra Zaheer Abbasi, Shahid Hussain, *Muhammad Imran Faisal

Department of Computer Science

COMSATS University of Information Technology

Islamabad, Pakistan, *Federal Urdu University, Islamabad

bushrazaheer.abbasi@gmail.com, shussain@comsats.edu.pk, *faisii00700@gmail.com

Abstract— A well representing feature set that has enough

differentiated power plays an important role in the classification.

The existing techniques for feature set selection are mostly

statistical. They are not flexible to incorporate the human

reasoning and the changing requirements and preferences of the

real-life systems. They only make a decision between a feature

inclusion or exclusion. The fuzziness of human reasoning and

thinking are not considered at all that may improve the feature

selection and hence the accuracy of the classifier. Also, the

selection of overlapping features in case of Local Feature

Selection (LFS) methods is an important issue that negatively

impacts classification accuracy. For example, in case of Odd

Ratio (OR), the selection may contain overlapping features of

multiple classes. In this paper, a Fuzzy Set Theory (FST) based

feature selection method has been proposed. The approach aims

to tackle both above mentioned issues efficiently. The selected

final feature set is used to train the well-known classification

algorithms and the results are compared with Global Feature

Selection (GFS) and LFS methods. The comparison shows that

the proposed method has improved the accuracy of the classifiers

and also extract comparatively small feature set that ultimately

reduces the time complexity of the system.

Keywords— Classification, accuracy, feature selection, Fuzzy

Set Theory, global feature selection, local feature selection

I. INTRODUCTION

Feature selection is the activity of selecting the most

appropriate and representative features. This process curtails

the number of features by skipping the duplicate, noisy and

least important features. Feature selection can be performed

either at the global or local level [1]. GFS methods calculate

the overall feature importance irrespective of its relevance

to any particular class [2]. LFS methods are those for which

the feature importance for every possible class is calculated

individually and the final selection is performed on these

individual scores [2]. These selection methods are mostly

statistical and declare the feature’s status as either important

or unimportant, but in most real-life scenarios the decisions

are not that much simple and consider number of human

uncertainties. This may happen because of varying ground

realities which could not be bound between [0,1] selection

[3]. This fact shows that most of the incorrectly classified

records may be classified wrong because of this binary

nature of statistical methods. The literature also reveals the

fact that most of the LFS methods being local to individual

class may happen to select repeated features for multiple

classes [4]. An extra check may be required to handle this

issue that may result in increased computational cost.

Otherwise the duplicate selection of feature from different

classes may hinder the classifier’s performance.

In this paper, we proposed an FST based novel approach for

feature selection. The proposed method uses fuzzy decision

matrix consists of decision variables. The proposed method,

hybrid the concepts of GFS and LFS. It makes the selection

of features on both levels. Firstly, we selected the features at

local level for each class on the bases of the calculated

feature importance by using fuzzy decision matrices. Later

on, we used one more variable to make the final selection at

global level. This mixing of global and local levels will help

to avoid the repetition of the same feature from different

classes. We named this hybrid approach as Improved FST

based feature selection method. The decision matrix of FST

in the proposed method consists of 4 levels of decision on

decision scale. This arrangement will help to avoid the

binary nature of selection. The proposed method has made

the decision making a more flexible system by introducing 4

levels of selection instead of having only two, as in case of

binary systems. The proposed system is applied to the

selected case study from Gang-of-Four (GoF) series

patterns. We used three highly recommended and preferred

classifiers (Naïve Bayes (NB), Support Vector Machine

(SVM) and Random Forest (RF)) as well to check the effect

of the proposed method on classification accuracy. The

proposed method is compared with the benchmark GFS and

LFS for performance evaluation.

The rest of the paper is organized in the following sections:

section 2 presents the state-of-the-art work done in the area

of feature selection and the utilization of FST for feature

selection. Proposed methodology along with the

experimental setup used in this research are given in the 3rd

section. 4th section stats the results along with the discussion

on those results. We concluded the work in the 5th and the

last section of the paper.

Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)

Islamabad, Pakistan, 8th – 12th January, 2019

666

II. RELATED WORK

Last few decades witnessed the rapid interest development in

the area of Data Mining (DM) by the research community.

Researchers are working on different aspects of DM like data

extraction, pre-processing, feature extraction, features

construction, feature selection and classification/prediction.

They are utilizing statistical, mathematical, heuristic,

metaheuristic and natural algorithms to deal with the issues at

hand. In this section, few studies are presented from the recent

past that worked on the feature selection for DM.

D.F. Gillies in [4] performed a number of experiments by

using different most frequently used filter, wrapper and

embedded FS methods. They specifically check their

performance for microarray data. Microarray data is basically

a biological platform that gathers the gene expression for

performing different experiments and analysis as well. The

conducted survey comprises of number of FS methods. They

compared the performance of each and concluded that there is

no free lunch in case of FS methods as well.

The nature inspired Particle Swarm Optimization (PSO)

algorithm is adopted by [5, 6]. In both these papers the PSO is

used for feature selection for different tasks. In [5], the

sentiment analysis is performed based on aspect extracted

from the given data. The PSO is utilized for aspect extraction

and construction of feature set. In this work the author M. S.

Akhter et.al further constructed an ensemble based of three

classifiers and presented a cascade model that was comprised

of both above mentioned parts. Whereas, in [6], the PSO is

utilized for feature selection from high dimensional real-life

data stream. The proposed method is a lightweight,

computationally efficient FS method.

The FS method to handle the imbalance class problem was

proposed by A. K. Uysal in [7]. The proposed method named

as improved global feature selection scheme was focused to

address the imbalance class problem by selecting equal

number of features from each class. In this method the author

used OR to find the inclination towards any particular class

either positive or negative and then select the N number of

features from each class. This proposed method improved the

presentation of the smallest class among the FFS. The

experimental results validate the effectiveness of the proposed

method. The drawback of this proposed method was later on,

highlighted by A. Deepak in [8]. According to the author the

previously proposed method may ignore some highly

representative features of majority class because of its

rigidness. This ignorance will affect the training and

ultimately the performance of the classifier. The author

proposed variable global feature selection scheme to select the

variable number of features from each class based on the

volume of class.

A. Rehman et al. proposed normalized difference measure

method in [9] for feature selection from text documents.

According to the author, existing benchmark methods assign

equal ranks to the features that have the same differences, but

they ignore the difference of their relative document

frequencies. Keeping this fact in mind the authors proposed

normalized difference measure method and compare it with

seven well known FS methods on seven datasets and 2

classifiers namely NB and SVM. The macro F1 measure of

the results shows that in 66% the proposed method performed

better than the compared methods. Whereas, for the micro F1

measure this value is 51%.

From the discussion in above few paragraphs it can be

concluded that feature selection is an important open issue of

the current era. The researchers focused this issue and

proposed a number of approaches to tackle it from different

prospects. Statistical, nature inspired, heuristic, metaheuristic,

hybrid and distance-based approaches are adopted by the

researchers to select the best representative features. The

literature review of these approaches concludes that there is no

free lunch available and all the approaches have their own pro

and cons. If one gives better results in a particular domain then

the other may outperform the rest of approaches in some other

domain.

The improved fuzzy rough set has been utilized by [10,11] for

feature selection. The existing fuzzy rough set has the problem

that it may not maintain maximal dependency function. That

could misfit the given dataset as well. To overcome these

issues W. Changzhong et.al introduced an improved form of

fuzzy rough set that can fit well to the given dataset. Concept

of fuzzy neighborhood and parameterized fuzzy relation are

utilized by the authors to define the upper and lower bounds of

the decision boundary. The results show the validity of the

proposed method over the basic fuzzy rough set method.

Opinion mining from the online social media sites is also a

rapidly growing task of DM. researchers utilized FST for

opinion mining as well. In [12-16], the FST is being used for

opinion mining from different types of social media sites. M.

Afzaal et.al proposed opinion classification system for mining

tourist reviews. The huge quantity of data available for tourists

make it difficult for them to make a choice between places to

visit. A number of opinion mining techniques have been

proposed for this task but mining the opinions on the bases of

aspects given in the reviews is proposed in this research work.

Mining the given aspects, can be useful for decision making.

J. Shaidah introduced information extraction and fuzzy set

theory based opinion mining. The proposed system was

capable enough to deal with the natural language text. The

proposed system categories the opinions into two categories as

positive or neutral. Business and finance filed needs

immediate response analysis to make their services more

attractive and impressive. Social media makes it easy for the

customers to express their grievances over what they feel was

not up to the mark. Keeping in view the strength of social

media the banking and business sector feels it more

appropriate to have an automated opinion analysis systems.

Kumar Ravi et.al in their research work proposed a

methodology based on fuzzy formal concept analysis and

concept level sentiment analysis. Their work present an

automated system for opinion analysis of comments and

reviews expressed by the customers based on the services they

Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)

Islamabad, Pakistan, 8th – 12th January, 2019

667

avail. Despite of the heavy research work done on the topic of

opinion mining the depth and diversity of opinions still has

capacity to have more research work on this topic. Farman Ali

et.al in his research work proposed an opinion mining

methodology. The proposed work hybrid the fuzzy domain

ontology and SVM to increase the precision rate and the

accuracy of the classifier. The proposed method target the

issue of opinion polarity and tried to extract the extreme

aspect of the opinion. On the bases of that polarity the opinion

and analysed to get help for developing knowledge based

automated systems.

The literature review on use of FST for DM shows the

importance of FST for performing a number of different tasks

related to DM including FS, opinion mining, sentiment

analysis and classification function.

III. PROPOSED METHODOLOGY

The main goal of the proposed methodology is to reduce the

number of features in feature set and to incorporate the human

thinking and reasoning in decision making system. The system

also will reduce the chances of selection of overlapping

features from multiple classes. The proposed method consists

of five main steps. These steps are presented in a figure below:

In the first step, features are extracted from the text documents

of the selected dataset. In the second and third steps two

variables X and Y are selected to describe the feature

importance for particular classes and feature strength to

represent the individual class respectively. In the fourth step,

the fate of the record is decided on the bases of the feature

importance for individual class for X and Y, calculated via

equations 1 and 2. In the final fifth step we selected the FFS

on the bases of the value of variable Z. The variable Z defines

the total number of records for a particular class that have a

specific feature. We used this variable because for the correct

prediction of records for a class the feature that covers most of

the records can be more valuable than the feature that has high

occurrences within the class but covers only a few instances.

100

Xtfd

=

()

100

Ytfc

=

()

Where fc stands for feature occurrence per class, tfd stands for

total feature occurrence in dataset and tfc stands foe total

features in a class. In the fourth step, I used a scale of four

possible intervals to categories the calculated valves of X and

Y as relevant (RL), moderate (MD), low relevance (LRL) and

irrelevant (IRL). For X the overall upper limit is 100 and

lowest is 0. We made four ranges as 0-25, 26-50, 51-75 and

76-100. For Y the upper limit changes for every class,

therefore we divided the highest value by 4 and make the

ranges accordingly as done in [17,18] by KeYuan Wu, et.al.

Decision making to give the label is done according to the

importance of X and Y and their calculated values. The

variable Y has priority over variable X and it’s clearly visible

from the decision matrix given in table 1.

The performance of the proposed method is compared with

the benchmark methods, 2 from GFS (Information Gain (IG)

and Gini Index (GI)) and 1 from LFS (OR). The description of

classifiers and the evaluation matrices that we used for

performance comparison is given below:

Table 1Decision matrix

A. Classification Algorithms

The classifiers that we have selected for training the model

on selected features by each method are well-known

classifiers named as NB, SVM and RF.

NB: Conditional independence of attributes is the basic

assumption of NB. It estimates the class conditional

probability of the attributes and assumes that the attributes are

independent of each other. This algorithm performs well when

the dataset has noise and outlier. The mathematical

representation of this algorithm is:

P(X|Z)=P(Y|X) P(X)P(Y) (3)

RF: As Rf is an ensemble algorithm, so it combines the

characteristics of two or more algorithms. RF works by

creating a set of decision trees from a random selection of a

subset of the training set. The final class of the object is

decided on the bases of aggregation of votes for each class.

SVM: One of the most frequently used classification

algorithms that has both linear and non-linear versions. Its

main idea is to draw a normal line within the population plane

in such a way that the data points lie on the maximum distance

Ylr

Yir

IRL

LRL

IRL

Xlr

LRL

IRL

Xir

IRL

Figure 1 Overview of proposed methodology

Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)

Islamabad, Pakistan, 8th – 12th January, 2019

668

from the normal line. The mathematical expression for this

algorithm is stated as:

(4)

Here Z1 and Z2 are the input features and the slop of the

normal line is determined by A1 and A2. The variable A0 in

the formula is the intercept determined by the algorithm.

B. Performance Evaluation

Here Z1 and Z2 are the input features and the slop of the

normal line is determined by A1 and A2. The variable A0 in

the formula is the intercept determined by the algorithm.

Precision = (5)

Recall = (6)

F-measure = (7)

IV. RESULTS AND DISCUSSION

The results obtained by training the model for each of the FS

selected by the benchmark and by the proposed method are

presented in table 2. We selected a threshold of 1/4th and ½ of

the extracted features to be used for training. The selection of

these thresholds will check the effect on the efficiency of the

FS method by reducing the number of features.

Table 2 Results of Improved FST approach Vs benchmark FS methods

No of features 27

SVM

74.1%

74.7%

82.7%

86.9%

86.7%

80.2%

73.4%

71.6%

69.8%

Proposed Method

90.0%

85.4%

89.7%

No of features 54

SVM

81.5%

83.5%

77.7%

89.0%

82.1%

84.0%

76.6%

73.7%

78.5%

Proposed Method

88.7%

86.3%

87.6%

The results presented in the table 2 show that the proposed

method has performed well for all the classifiers as compare to

the GI and OR. The performance of proposed method and IG

shows a minor fluctuation for each of the classifier. It can also

be deduced from the presented results that the proposed

method can show better performance for a reduced feature set.

Whereas, the performance of the rest of the methods improve

by increasing the number of features that may result in

increased time complexity. The performance graph presented

in figure 2 clearly depicts the efficiency of the proposed

method.

V. CONCLUSION

This research work presents a novel method of feature selection

that incorporates the human thinking process in feature selection

and can improve the classifier’s accuracy with the least number

of features. The results of the proposed method are compared

with the bench mark FS methods like IG, GI and DF. The F-

measure shows that the proposed method based on FST not only

depicts the human thinking but also enhance the classifier’s

accuracy with minimum number of features.

REFERENCES

Figure 2 Graphical representation of accuracy achieved by using different FS methods

Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)

Islamabad, Pakistan, 8th – 12th January, 2019

669

[1] Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., &

Herrera, F. (2017). “A survey on data pre-processing for data stream

mining: current status and future directions”. Neurocomputing, 239,

39-57.

[2] Noh, S., Zoltowski, M. D., Sung, Y., & Love, D. J. (2014). “Pilot

beam pattern design for channel estimation in massive MIMO

systems”. IEEE Journal of Selected Topics in Signal Processing, 8(5),

787-801.

[3] Melin, P., & Castillo, O. (2014). “A review on type-2 fuzzy logic

applications in clustering, classification and pattern recognition”.

Applied soft computing, 21, 568-577.

[4] Akhtar, M. S., Gupta, D., Ekbal, A. , & Bhattacharyya, P, (2017).

"Feature selection and ensemble construction: A two-step method for

aspect based sentiment analysis". Knowledge-Based Systems, 125,

116-135

[5] Akhtar, M. S., Gupta, D., Ekbal, A., & Bhattacharyya, P. (2017).

“Feature selection and ensemble construction: A two-step method for

aspect based sentiment analysis”. Knowledge-Based Systems, 125,

116-135.

[6] Fong, S., Wong, R., & Vasilakos, A. (2016). “Accelerated PSO

swarm search feature selection for data stream mining big data”.

IEEE transactions on services computing, (1), 1-1.

[7] Uysal, A. K. (2017). "An improved global feature selection scheme

for text classiﬁcation". Expert systems with Applications, 43, 82-92.

[8] Agnihotri, D., Verma, K., & Tripathi, P. (2017). "Variable Global

Feature Selection Scheme for automatic classiﬁcation of text

documents". Expert Systems with Applications, 81, 268281.

[9] Rehman, A., Javed, K., & Babri, H. A. (2017). Feature selection

based on a normalized difference measure for text classification.

Information Processing & Management, 53(2), 473-489.

[10] Wang, C., Qi, Y., Shao, M., Hu, Q., Chen, D., Qian, Y., & Lin, Y.

(2017). A fitting model for feature selection with fuzzy rough sets.

IEEE Transactions on Fuzzy Systems, 25(4), 741-753.

[11] Hussain. S, A Methodology to Predict the Instable Classes, 32nd

ACM Symposium on Applied Computing (SAC) Morocco, 4th to 6th

April 2017

[12] Afzaal, M., Usman, M., Fong, A. C. M., Fong, S., & Zhuang, Y.

(2016). “Fuzzy aspect based opinion classification system for mining

tourist reviews”. Advances in Fuzzy Systems, 2016, 2.

[13] Deng, Y., Ren, Z., Kong, Y., Bao, F., & Dai, Q. (2017). “A

hierarchical fused fuzzy deep neural network for data classification”.

IEEE Transactions on Fuzzy Systems, 25(4), 1006-1012.

[14] Hussain. S, “Threshold Analysis of Design Metrics to Detect

Design Flaws”, 31st CM Symposium on Applied Computing

(SRC), 2016, 4-8 April, 2016, Pisa Italy.

[15] Ravi, K., Ravi, V., & Prasad, P. S. R. K. (2017). “Fuzzy formal

concept analysis based opinion mining for CRM in financial

services”. Applied Soft Computing, 60, 786-807.

[16] Ali, F., Kwak, K. S., & Kim, Y. G. (2016). “Opinion mining based on

fuzzy domain ontology and support vector machine: a proposal to

automate online review classification”. Applied Soft Computing, 47,

235-250.

[17] R. H. Gamma, "Design Pattern," in MA: Adison Wesley, Boston,

1995.

[18] Hussain. S, Keung. J., Sohail. M. K, Ilahi. M, Khan. A. A, Automated

Framework for Classification and Selection of Software Design

Patterns, Applied Soft Computing, October,2018.

Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)

Islamabad, Pakistan, 8th – 12th January, 2019

670

A content recommendation system for effective e-learning using embedded feature selection and fuzzy DT based CNN

Article

Jun 2020

This paper proposes a new content recommendation system which combines the newly proposed embedded feature selection method and the new Fuzzy Temporal Logic based Decision Tree incorporated Convolutional Neural Network classifier. The newly proposed embedded feature selection called Fuzzy Decision Tree and Weighted Gini-Index based Feature Selection Algorithm (FDTWGI-FSA) that contains the existing incorporated the Fuzzy Decision Tree (FDT) and the Weighted Gini-index based Feature Selection Algorithm (WGIFSA) for getting optimized feature subset. Moreover, an enhanced CNN and Fuzzy Temporal Decision Tree for performing the deep learning process which is able to identify the exact e-content from the huge volume of data with the help of the recommended features by the proposed embedded feature selection method. The exact e-content can be identified after performing the five-layer network structure for extracting the relevant features and it also can be classified by applying the Fuzzy Temporal Decision Tree for the e-learners. Finally, the proposed content recommendation system provides exact content to the e-learners according to their level of understanding and it also satisfies them by providing the exact high level contents. The experiments have been conducted for evaluating the proposed content recommendation system and compared with the existing classifier including the standard CNN.

Text Classification Using Intuitionistic Fuzzy Set Measures—An Evaluation Study

Article

Full-text available

May 2022

A very important task of Natural Language Processing is text categorization (or text classification), which aims to automatically classify a document into categories. This kind of task includes numerous applications, such as sentiment analysis, language or intent detection, heavily used by social-/brand-monitoring tools, customer service, and the voice of customer, among others. Since the introduction of Fuzzy Set theory, its application has been tested in many fields, from bioinformatics to industrial and commercial use, as well as in cases with vague, incomplete, or imprecise data, highlighting its importance and usefulness in the fields. The most important aspect of the application of Fuzzy Set theory is the measures employed to calculate how similar or dissimilar two samples in a dataset are. In this study, we evaluate the performance of 43 similarity and 19 distance measures in the task of text document classification, using the widely used BBC News and BBC Sports benchmark datasets. Their performance is optimized through hyperparameter optimization techniques and evaluated via a leave-one-out cross-validation technique, presenting their performance using the accuracy, precision, recall, and F1-score metrics.

A systematic review of automated feature engineering solutions in machine learning problems

Conference Paper

Nov 2020

Feature Selection Methods in Sentiment Analysis: A Review

Conference Paper

Mar 2020

Fuzzy Aspect Based Opinion Classification System for Mining Tourist Reviews

Article

Full-text available

Oct 2016

Due to the large amount of opinions available on the websites, tourists are often overwhelmed with information and find it extremely difficult to use the available information to make a decision about the tourist places to visit. A number of opinion mining methods have been proposed in the past to identify and classify an opinion into positive or negative. Recently, aspect based opinion mining has been introduced which targets the various aspects present in the opinion text. A number of existing aspect based opinion classification methods are available in the literature but very limited research work has targeted the automatic aspect identification and extraction of implicit, infrequent, and coreferential aspects. Aspect based classification suffers from the presence of irrelevant sentences in a typical user review. Such sentences make the data noisy and degrade the classification accuracy of the machine learning algorithms. This paper presents a fuzzy aspect based opinion classification system which efficiently extracts aspects from user opinions and perform near to accurate classification. We conducted experiments on real world datasets to evaluate the effectiveness of our proposed system. Experimental results prove that the proposed system not only is effective in aspect extraction but also improves the classification accuracy.

A Fitting Model for Feature Selection With Fuzzy Rough Sets

Article

Full-text available

Jun 2016

Fuzzy rough set is an important rough set model used for feature selection. It uses the fuzzy rough dependency as a criterion for feature selection. However, this model can merely maintain a maximal dependency function. It does not fit a given data set well and cannot ideally describe the differences in sample classification. Therefore, in this study, we introduce a new model for handling this problem. First, we define the fuzzy decision of a sample using the concept of fuzzy neighborhood. Then, a parame- terized fuzzy relation is introduced to characterize the fuzzy information granules, using which the fuzzy lower and upper approximations of a decision are reconstructed and a new fuzzy rough set model is introduced. This can guarantee that the membership degree of a sample to its own category reaches the maximal value. Furthermore, this approach can fit a given data set and effectively prevents samples from being misclassified. Finally, we define the significance measure of a candidate attr- ibute and design a greedy forward algorithm for feature selection. Twelve data sets selected from public data sources are used to compare the proposed algorithm with certain existing algorithms, and the experimental results show that the proposed reduction algorithm is more effective than classical fuzzy rough sets, especially for those data sets for which different categories exhibit a large degree of overlap.

Automated Framework for Classification and Selection of Software Design Patterns

Article

Oct 2018
APPL SOFT COMPUT

Though, Unified Modeling Language (UML), Ontology, and Text categorization approaches have been used to automate the classification and selection of design pattern(s). However, there are certain issues such as time and effort for formal specification of new patterns, system context-awareness, and lack of knowledge which needs to be addressed. We propose a framework (i.e. Three-phase method) to discuss these issues, which can aid novice developers to organize and select the correct design pattern(s) for a given design problem in a systematic way. Subsequently, we propose an evaluation model to gauge the efficacy of the proposed framework via certain unsupervised learning techniques. We performed three case studies to describe the working procedure of the proposed framework in the context of three widely used design pattern catalogs and 103 design problems. We find the significant results of Fuzzy c-means and Partition Around Medoids (PAM) as compared to other unsupervised learning techniques. The promising results encourage the applicability of the proposed framework in terms of design patterns organization and selection with respect to a given design problem.

A methodology to predict the instable classes: student research abstract

Conference Paper

Apr 2017

Shahid Hussain

Class stability is intrinsically characterized by the evolution of a number of dependencies and change propagation factors used to promote the ripple effect. In this regard, historical information regarding change propagation factors can aid to identify the classes prone to ripple effect (that is instable classes). In this paper, we propose a methodology to exploit the versions history of change propagation factors in order to predict the instable classes. Initially, we have implemented the proposed methodology with version history of three open source projects MongoDB Java Driver, Google Guava and Apache MyFaces and obtained promising results as compared to existing stability assessors. Subsequently, the experimental results indicate that proposed methodology can be used to identify the classes prone to ripple effect and can aid developers to reduce the efforts needed to maintain and evolve the system.

Fuzzy Formal Concept Analysis based Opinion Mining for CRM in Financial Services

Article

May 2017

Owing to the easy access to social media, consumers or customers are increasingly turning to social media to express their grievances and feedback on various products and services offered by the Banking, Financial, Services and Insurance industry. Because non-redressal of complaints eventually leads to customer churn, there is an urgent need to analyze the complaints. In this regard, we propose a novel descriptive analytics model that performs complaints/grievances analytics and summarizes the lengthy and verbose complaints concisely in a form that resembles association rules. The proposed hybrid model comprises fuzzy formal concept analysis and concept-level sentiment analysis (FFCA + SA) in tandem, which in turn is compared against formal concept analysis and concept-level sentiment analysis (FCA + SA). Because of the immediate fallout of the negative sentiments, a financial company is interested in studying them in more detail than the positive ones. Therefore, the model generates a list of ‘association rules’, the corresponding negative sentiment score along with the list of associated documents. Association rules are rank ordered according to the negative sentiment score, which in turn reflects severity affected services/products. The proposed model also provides interactive visualization that enables business analysts and managers to access a specific set of complaints without having to go through the entire set thoroughly. This saves a lot of time that would have otherwise been spent on cumbersome manual operations. Moreover, partial evaluation of the proposed methodology by human annotators yielded 64.06% matching score in terms of the opinions determination of aspects.

Feature selection and ensemble construction: A two-step method for aspect based sentiment analysis

Article

Mar 2017
KNOWL-BASED SYST

In this paper we present a cascaded framework of feature selection and classifier ensemble using particle swarm optimization (PSO) for aspect based sentiment analysis. Aspect based sentiment analysis is performed in two steps, viz. aspect term extraction and sentiment classification. The pruned, compact set of features performs better compared to the baseline model that makes use of the complete set of features for aspect term extraction and sentiment classification. We further construct an ensemble based on PSO, and put it in cascade after the feature selection module. We use the features that are identified based on the properties of different classifiers and domains. As base learning algorithms we use three classifiers, namely Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM). Experiments for aspect term extraction and sentiment analysis on two different kinds of domains show the effectiveness of our proposed approach.

Variable Global Feature Selection Scheme for Automatic Classification of Text Documents

Article

Mar 2017
EXPERT SYST APPL

Deepak Agnihotri

The feature selection is important to speed up the process of Automatic Text Document Classification (ATDC). At present, the most common method for discriminating feature selection is based on Global Filter-based Feature Selection Scheme (GFSS). The GFSS assigns a score to each feature based on its discriminating power and selects the top-N features from the feature set, where N is an empirically determined number. As a result, it may be possible that the features of a few classes are discarded either partially or completely. The Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative features from all the classes. However, it suffers in dealing with an unbalanced dataset having large number of classes. The distribution of features in these classes are highly variable. In this case, if an equal number of features are chosen from each class, it may exclude some important features from the class containing a higher number of features. To overcome this problem, we propose a novel Variable Global Feature Selection Scheme (VGFSS) to select a variable number of features from each class based on the distribution of terms in the classes. It ensures that, a minimum number of terms are selected from each class. The numerical results on benchmark datasets show the effectiveness of the proposed algorithm VGFSS over classical information science methods and IGFSS. --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Use following link to download this paper upto 18th May 2017 without paying money. https://authors.elsevier.com/a/1Uopt3PiGT3s30

A survey on Data Preprocessing for Data Stream Mining: Current status and future directions

Article

Feb 2017
NEUROCOMPUTING

Feature selection based on a normalized difference measure for text classification

Article

Mar 2017
INFORM PROCESS MANAG

The goal of feature selection in text classification is to choose highly distinguishing features for improving the performance of a classifier. The well-known text classification feature selection metric named balanced accuracy measure (ACC2) (Forman, 2003) evaluates a term by taking the difference of its document frequency in the positive class (also known as true positives) and its document frequency in the negative class (also known as false positives). This however results in assigning equal ranks to terms having equal difference, ignoring their relative document frequencies in the classes. In this paper we propose a new feature ranking (FR) metric, called normalized difference measure (NDM), which takes into account the relative document frequencies. The performance of NDM is investigated against seven well known feature ranking metrics including odds ratio (OR), chi squared (CHI), information gain (IG), distinguishing feature selector (DFS), gini index (GINI) ,balanced accuracy measure (ACC2) and Poisson ratio (POIS) on seven datasets namely WebACE(WAP,K1a,K1b), Reuters (RE0, RE1),spam email dataset and 20 newsgroups using the multinomial naive Bayes (MNB) and supports vector machines (SVM) classifiers. Our results show that the NDM metric outperforms the seven metrics in 66% cases in terms of macro-F1 measure and in 51% cases in terms of micro F1 measure in our experimental trials on these datasets.

Opinion mining based on fuzzy domain ontology and Support Vector Machine: A proposal to automate online review classification

Article

Jun 2016
APPL SOFT COMPUT

With the explosion of Social media, Opinion mining has been used rapidly in recent years. However, a few studies focused on the precision rate of feature review’s and opinion word’s extraction. These studies do not come with any optimum mechanism of supplying required precision rate for effective opinion mining. Most of these studies are based on Naïve Bayes, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and classical ontology. These systems are still imperfect for classifying the feature reviews into more degrees of polarity terms (strong negative, negative, neutral, positive and strong positive). Further, the existing classical ontology-based systems cannot extract blurred information from reviews; thus, it provides poor results. In this regard, this paper proposes a robust classification technique for feature review’s identification and semantic knowledge for opinion mining based on SVM and Fuzzy Domain Ontology (FDO). The proposed system retrieves a collection of reviews about hotel and hotel features. The SVM identifies hotel feature reviews and filter out irrelevant reviews (noises) and the FDO are then used to compute the polarity term of each feature. The amalgamation of FDO and SVM significantly increases the precision rate of review’s and opinion word’s extraction and accuracy of opinion mining. The FDO and intelligent prototype are developed using Protégé OWL-2 (Ontology Web Language) tool and JAVA, respectively. The experimental result shows considerable performance improvement in feature review’s classification and opinion mining.

An Automated Text Classification Method: Using Improved Fuzzy Set Approach for Feature Selection

Recommended publications

Induction of Fuzzy Prototypes with Feature Selection

Fuzzy-rule based approach for feature selection in text classification: student research abstract

Image Texture Classification using Fuzzy Inclusion and Fuzzy Entropy Measures

Impact of Membership and Non-membership Features on Classification Decision: An Empirical Study for...

A Methodology to Automate the Security Patterns Selection