Content uploaded by Shahid Hussain
Author content
All content in this area was uploaded by Shahid Hussain on Jul 07, 2019
Content may be subject to copyright.
An Automated Text Classification Method: Using
Improved Fuzzy Set Approach for Feature
Selection
Bushra Zaheer Abbasi, Shahid Hussain, *Muhammad Imran Faisal
Department of Computer Science
COMSATS University of Information Technology
Islamabad, Pakistan, *Federal Urdu University, Islamabad
bushrazaheer.abbasi@gmail.com, shussain@comsats.edu.pk, *faisii00700@gmail.com
Abstract— A well representing feature set that has enough
differentiated power plays an important role in the classification.
The existing techniques for feature set selection are mostly
statistical. They are not flexible to incorporate the human
reasoning and the changing requirements and preferences of the
real-life systems. They only make a decision between a feature
inclusion or exclusion. The fuzziness of human reasoning and
thinking are not considered at all that may improve the feature
selection and hence the accuracy of the classifier. Also, the
selection of overlapping features in case of Local Feature
Selection (LFS) methods is an important issue that negatively
impacts classification accuracy. For example, in case of Odd
Ratio (OR), the selection may contain overlapping features of
multiple classes. In this paper, a Fuzzy Set Theory (FST) based
feature selection method has been proposed. The approach aims
to tackle both above mentioned issues efficiently. The selected
final feature set is used to train the well-known classification
algorithms and the results are compared with Global Feature
Selection (GFS) and LFS methods. The comparison shows that
the proposed method has improved the accuracy of the classifiers
and also extract comparatively small feature set that ultimately
reduces the time complexity of the system.
Keywords— Classification, accuracy, feature selection, Fuzzy
Set Theory, global feature selection, local feature selection
I. INTRODUCTION
Feature selection is the activity of selecting the most
appropriate and representative features. This process curtails
the number of features by skipping the duplicate, noisy and
least important features. Feature selection can be performed
either at the global or local level [1]. GFS methods calculate
the overall feature importance irrespective of its relevance
to any particular class [2]. LFS methods are those for which
the feature importance for every possible class is calculated
individually and the final selection is performed on these
individual scores [2]. These selection methods are mostly
statistical and declare the feature’s status as either important
or unimportant, but in most real-life scenarios the decisions
are not that much simple and consider number of human
uncertainties. This may happen because of varying ground
realities which could not be bound between [0,1] selection
[3]. This fact shows that most of the incorrectly classified
records may be classified wrong because of this binary
nature of statistical methods. The literature also reveals the
fact that most of the LFS methods being local to individual
class may happen to select repeated features for multiple
classes [4]. An extra check may be required to handle this
issue that may result in increased computational cost.
Otherwise the duplicate selection of feature from different
classes may hinder the classifier’s performance.
In this paper, we proposed an FST based novel approach for
feature selection. The proposed method uses fuzzy decision
matrix consists of decision variables. The proposed method,
hybrid the concepts of GFS and LFS. It makes the selection
of features on both levels. Firstly, we selected the features at
local level for each class on the bases of the calculated
feature importance by using fuzzy decision matrices. Later
on, we used one more variable to make the final selection at
global level. This mixing of global and local levels will help
to avoid the repetition of the same feature from different
classes. We named this hybrid approach as Improved FST
based feature selection method. The decision matrix of FST
in the proposed method consists of 4 levels of decision on
decision scale. This arrangement will help to avoid the
binary nature of selection. The proposed method has made
the decision making a more flexible system by introducing 4
levels of selection instead of having only two, as in case of
binary systems. The proposed system is applied to the
selected case study from Gang-of-Four (GoF) series
patterns. We used three highly recommended and preferred
classifiers (Naïve Bayes (NB), Support Vector Machine
(SVM) and Random Forest (RF)) as well to check the effect
of the proposed method on classification accuracy. The
proposed method is compared with the benchmark GFS and
LFS for performance evaluation.
The rest of the paper is organized in the following sections:
section 2 presents the state-of-the-art work done in the area
of feature selection and the utilization of FST for feature
selection. Proposed methodology along with the
experimental setup used in this research are given in the 3rd
section. 4th section stats the results along with the discussion
on those results. We concluded the work in the 5th and the
last section of the paper.
Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)
Islamabad, Pakistan, 8th – 12th January, 2019
666
978-1-5386-7729-2/19/$31.00©2019 IEEE
II. RELATED WORK
Last few decades witnessed the rapid interest development in
the area of Data Mining (DM) by the research community.
Researchers are working on different aspects of DM like data
extraction, pre-processing, feature extraction, features
construction, feature selection and classification/prediction.
They are utilizing statistical, mathematical, heuristic,
metaheuristic and natural algorithms to deal with the issues at
hand. In this section, few studies are presented from the recent
past that worked on the feature selection for DM.
D.F. Gillies in [4] performed a number of experiments by
using different most frequently used filter, wrapper and
embedded FS methods. They specifically check their
performance for microarray data. Microarray data is basically
a biological platform that gathers the gene expression for
performing different experiments and analysis as well. The
conducted survey comprises of number of FS methods. They
compared the performance of each and concluded that there is
no free lunch in case of FS methods as well.
The nature inspired Particle Swarm Optimization (PSO)
algorithm is adopted by [5, 6]. In both these papers the PSO is
used for feature selection for different tasks. In [5], the
sentiment analysis is performed based on aspect extracted
from the given data. The PSO is utilized for aspect extraction
and construction of feature set. In this work the author M. S.
Akhter et.al further constructed an ensemble based of three
classifiers and presented a cascade model that was comprised
of both above mentioned parts. Whereas, in [6], the PSO is
utilized for feature selection from high dimensional real-life
data stream. The proposed method is a lightweight,
computationally efficient FS method.
The FS method to handle the imbalance class problem was
proposed by A. K. Uysal in [7]. The proposed method named
as improved global feature selection scheme was focused to
address the imbalance class problem by selecting equal
number of features from each class. In this method the author
used OR to find the inclination towards any particular class
either positive or negative and then select the N number of
features from each class. This proposed method improved the
presentation of the smallest class among the FFS. The
experimental results validate the effectiveness of the proposed
method. The drawback of this proposed method was later on,
highlighted by A. Deepak in [8]. According to the author the
previously proposed method may ignore some highly
representative features of majority class because of its
rigidness. This ignorance will affect the training and
ultimately the performance of the classifier. The author
proposed variable global feature selection scheme to select the
variable number of features from each class based on the
volume of class.
A. Rehman et al. proposed normalized difference measure
method in [9] for feature selection from text documents.
According to the author, existing benchmark methods assign
equal ranks to the features that have the same differences, but
they ignore the difference of their relative document
frequencies. Keeping this fact in mind the authors proposed
normalized difference measure method and compare it with
seven well known FS methods on seven datasets and 2
classifiers namely NB and SVM. The macro F1 measure of
the results shows that in 66% the proposed method performed
better than the compared methods. Whereas, for the micro F1
measure this value is 51%.
From the discussion in above few paragraphs it can be
concluded that feature selection is an important open issue of
the current era. The researchers focused this issue and
proposed a number of approaches to tackle it from different
prospects. Statistical, nature inspired, heuristic, metaheuristic,
hybrid and distance-based approaches are adopted by the
researchers to select the best representative features. The
literature review of these approaches concludes that there is no
free lunch available and all the approaches have their own pro
and cons. If one gives better results in a particular domain then
the other may outperform the rest of approaches in some other
domain.
The improved fuzzy rough set has been utilized by [10,11] for
feature selection. The existing fuzzy rough set has the problem
that it may not maintain maximal dependency function. That
could misfit the given dataset as well. To overcome these
issues W. Changzhong et.al introduced an improved form of
fuzzy rough set that can fit well to the given dataset. Concept
of fuzzy neighborhood and parameterized fuzzy relation are
utilized by the authors to define the upper and lower bounds of
the decision boundary. The results show the validity of the
proposed method over the basic fuzzy rough set method.
Opinion mining from the online social media sites is also a
rapidly growing task of DM. researchers utilized FST for
opinion mining as well. In [12-16], the FST is being used for
opinion mining from different types of social media sites. M.
Afzaal et.al proposed opinion classification system for mining
tourist reviews. The huge quantity of data available for tourists
make it difficult for them to make a choice between places to
visit. A number of opinion mining techniques have been
proposed for this task but mining the opinions on the bases of
aspects given in the reviews is proposed in this research work.
Mining the given aspects, can be useful for decision making.
J. Shaidah introduced information extraction and fuzzy set
theory based opinion mining. The proposed system was
capable enough to deal with the natural language text. The
proposed system categories the opinions into two categories as
positive or neutral. Business and finance filed needs
immediate response analysis to make their services more
attractive and impressive. Social media makes it easy for the
customers to express their grievances over what they feel was
not up to the mark. Keeping in view the strength of social
media the banking and business sector feels it more
appropriate to have an automated opinion analysis systems.
Kumar Ravi et.al in their research work proposed a
methodology based on fuzzy formal concept analysis and
concept level sentiment analysis. Their work present an
automated system for opinion analysis of comments and
reviews expressed by the customers based on the services they
Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)
Islamabad, Pakistan, 8th – 12th January, 2019
667
avail. Despite of the heavy research work done on the topic of
opinion mining the depth and diversity of opinions still has
capacity to have more research work on this topic. Farman Ali
et.al in his research work proposed an opinion mining
methodology. The proposed work hybrid the fuzzy domain
ontology and SVM to increase the precision rate and the
accuracy of the classifier. The proposed method target the
issue of opinion polarity and tried to extract the extreme
aspect of the opinion. On the bases of that polarity the opinion
and analysed to get help for developing knowledge based
automated systems.
The literature review on use of FST for DM shows the
importance of FST for performing a number of different tasks
related to DM including FS, opinion mining, sentiment
analysis and classification function.
III. PROPOSED METHODOLOGY
The main goal of the proposed methodology is to reduce the
number of features in feature set and to incorporate the human
thinking and reasoning in decision making system. The system
also will reduce the chances of selection of overlapping
features from multiple classes. The proposed method consists
of five main steps. These steps are presented in a figure below:
In the first step, features are extracted from the text documents
of the selected dataset. In the second and third steps two
variables X and Y are selected to describe the feature
importance for particular classes and feature strength to
represent the individual class respectively. In the fourth step,
the fate of the record is decided on the bases of the feature
importance for individual class for X and Y, calculated via
equations 1 and 2. In the final fifth step we selected the FFS
on the bases of the value of variable Z. The variable Z defines
the total number of records for a particular class that have a
specific feature. We used this variable because for the correct
prediction of records for a class the feature that covers most of
the records can be more valuable than the feature that has high
occurrences within the class but covers only a few instances.
100
fc
Xtfd
=
()
100
fc
Ytfc
=
()
Where fc stands for feature occurrence per class, tfd stands for
total feature occurrence in dataset and tfc stands foe total
features in a class. In the fourth step, I used a scale of four
possible intervals to categories the calculated valves of X and
Y as relevant (RL), moderate (MD), low relevance (LRL) and
irrelevant (IRL). For X the overall upper limit is 100 and
lowest is 0. We made four ranges as 0-25, 26-50, 51-75 and
76-100. For Y the upper limit changes for every class,
therefore we divided the highest value by 4 and make the
ranges accordingly as done in [17,18] by KeYuan Wu, et.al.
Decision making to give the label is done according to the
importance of X and Y and their calculated values. The
variable Y has priority over variable X and it’s clearly visible
from the decision matrix given in table 1.
The performance of the proposed method is compared with
the benchmark methods, 2 from GFS (Information Gain (IG)
and Gini Index (GI)) and 1 from LFS (OR). The description of
classifiers and the evaluation matrices that we used for
performance comparison is given below:
Table 1Decision matrix
A. Classification Algorithms
The classifiers that we have selected for training the model
on selected features by each method are well-known
classifiers named as NB, SVM and RF.
NB: Conditional independence of attributes is the basic
assumption of NB. It estimates the class conditional
probability of the attributes and assumes that the attributes are
independent of each other. This algorithm performs well when
the dataset has noise and outlier. The mathematical
representation of this algorithm is:
P(X|Z)=P(Y|X) P(X)P(Y) (3)
RF: As Rf is an ensemble algorithm, so it combines the
characteristics of two or more algorithms. RF works by
creating a set of decision trees from a random selection of a
subset of the training set. The final class of the object is
decided on the bases of aggregation of votes for each class.
SVM: One of the most frequently used classification
algorithms that has both linear and non-linear versions. Its
main idea is to draw a normal line within the population plane
in such a way that the data points lie on the maximum distance
Yr
Ym
Ylr
Yir
Xr
RL
MD
MD
IRL
Xm
RL
MD
LRL
IRL
Xlr
RL
MD
LRL
IRL
Xir
RL
MD
IRL
IRL
Figure 1 Overview of proposed methodology
Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)
Islamabad, Pakistan, 8th – 12th January, 2019
668
from the normal line. The mathematical expression for this
algorithm is stated as:
(4)
Here Z1 and Z2 are the input features and the slop of the
normal line is determined by A1 and A2. The variable A0 in
the formula is the intercept determined by the algorithm.
B. Performance Evaluation
Here Z1 and Z2 are the input features and the slop of the
normal line is determined by A1 and A2. The variable A0 in
the formula is the intercept determined by the algorithm.
Precision = (5)
Recall = (6)
F-measure = (7)
IV. RESULTS AND DISCUSSION
The results obtained by training the model for each of the FS
selected by the benchmark and by the proposed method are
presented in table 2. We selected a threshold of 1/4th and ½ of
the extracted features to be used for training. The selection of
these thresholds will check the effect on the efficiency of the
FS method by reducing the number of features.
Table 2 Results of Improved FST approach Vs benchmark FS methods
No of features 27
NB
SVM
RF
GI
74.1%
74.7%
82.7%
IG
86.9%
86.7%
80.2%
OR
73.4%
71.6%
69.8%
Proposed Method
90.0%
85.4%
89.7%
No of features 54
NB
SVM
RF
GI
81.5%
83.5%
77.7%
IG
89.0%
82.1%
84.0%
OR
76.6%
73.7%
78.5%
Proposed Method
88.7%
86.3%
87.6%
The results presented in the table 2 show that the proposed
method has performed well for all the classifiers as compare to
the GI and OR. The performance of proposed method and IG
shows a minor fluctuation for each of the classifier. It can also
be deduced from the presented results that the proposed
method can show better performance for a reduced feature set.
Whereas, the performance of the rest of the methods improve
by increasing the number of features that may result in
increased time complexity. The performance graph presented
in figure 2 clearly depicts the efficiency of the proposed
method.
V. CONCLUSION
This research work presents a novel method of feature selection
that incorporates the human thinking process in feature selection
and can improve the classifier’s accuracy with the least number
of features. The results of the proposed method are compared
with the bench mark FS methods like IG, GI and DF. The F-
measure shows that the proposed method based on FST not only
depicts the human thinking but also enhance the classifier’s
accuracy with minimum number of features.
REFERENCES
Figure 2 Graphical representation of accuracy achieved by using different FS methods
Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)
Islamabad, Pakistan, 8th – 12th January, 2019
669
[1] Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., &
Herrera, F. (2017). “A survey on data pre-processing for data stream
mining: current status and future directions”. Neurocomputing, 239,
39-57.
[2] Noh, S., Zoltowski, M. D., Sung, Y., & Love, D. J. (2014). “Pilot
beam pattern design for channel estimation in massive MIMO
systems”. IEEE Journal of Selected Topics in Signal Processing, 8(5),
787-801.
[3] Melin, P., & Castillo, O. (2014). “A review on type-2 fuzzy logic
applications in clustering, classification and pattern recognition”.
Applied soft computing, 21, 568-577.
[4] Akhtar, M. S., Gupta, D., Ekbal, A. , & Bhattacharyya, P, (2017).
"Feature selection and ensemble construction: A two-step method for
aspect based sentiment analysis". Knowledge-Based Systems, 125,
116-135
[5] Akhtar, M. S., Gupta, D., Ekbal, A., & Bhattacharyya, P. (2017).
“Feature selection and ensemble construction: A two-step method for
aspect based sentiment analysis”. Knowledge-Based Systems, 125,
116-135.
[6] Fong, S., Wong, R., & Vasilakos, A. (2016). “Accelerated PSO
swarm search feature selection for data stream mining big data”.
IEEE transactions on services computing, (1), 1-1.
[7] Uysal, A. K. (2017). "An improved global feature selection scheme
for text classification". Expert systems with Applications, 43, 82-92.
[8] Agnihotri, D., Verma, K., & Tripathi, P. (2017). "Variable Global
Feature Selection Scheme for automatic classification of text
documents". Expert Systems with Applications, 81, 268281.
[9] Rehman, A., Javed, K., & Babri, H. A. (2017). Feature selection
based on a normalized difference measure for text classification.
Information Processing & Management, 53(2), 473-489.
[10] Wang, C., Qi, Y., Shao, M., Hu, Q., Chen, D., Qian, Y., & Lin, Y.
(2017). A fitting model for feature selection with fuzzy rough sets.
IEEE Transactions on Fuzzy Systems, 25(4), 741-753.
[11] Hussain. S, A Methodology to Predict the Instable Classes, 32nd
ACM Symposium on Applied Computing (SAC) Morocco, 4th to 6th
April 2017
[12] Afzaal, M., Usman, M., Fong, A. C. M., Fong, S., & Zhuang, Y.
(2016). “Fuzzy aspect based opinion classification system for mining
tourist reviews”. Advances in Fuzzy Systems, 2016, 2.
[13] Deng, Y., Ren, Z., Kong, Y., Bao, F., & Dai, Q. (2017). “A
hierarchical fused fuzzy deep neural network for data classification”.
IEEE Transactions on Fuzzy Systems, 25(4), 1006-1012.
[14] Hussain. S, “Threshold Analysis of Design Metrics to Detect
Design Flaws”, 31st CM Symposium on Applied Computing
(SRC), 2016, 4-8 April, 2016, Pisa Italy.
[15] Ravi, K., Ravi, V., & Prasad, P. S. R. K. (2017). “Fuzzy formal
concept analysis based opinion mining for CRM in financial
services”. Applied Soft Computing, 60, 786-807.
[16] Ali, F., Kwak, K. S., & Kim, Y. G. (2016). “Opinion mining based on
fuzzy domain ontology and support vector machine: a proposal to
automate online review classification”. Applied Soft Computing, 47,
235-250.
[17] R. H. Gamma, "Design Pattern," in MA: Adison Wesley, Boston,
1995.
[18] Hussain. S, Keung. J., Sohail. M. K, Ilahi. M, Khan. A. A, Automated
Framework for Classification and Selection of Software Design
Patterns, Applied Soft Computing, October,2018.
Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)
Islamabad, Pakistan, 8th – 12th January, 2019
670