Conference PaperPDF Available

An Automated Text Classification Method: Using Improved Fuzzy Set Approach for Feature Selection

Authors:
An Automated Text Classification Method: Using
Improved Fuzzy Set Approach for Feature
Selection
Bushra Zaheer Abbasi, Shahid Hussain, *Muhammad Imran Faisal
Department of Computer Science
COMSATS University of Information Technology
Islamabad, Pakistan, *Federal Urdu University, Islamabad
bushrazaheer.abbasi@gmail.com, shussain@comsats.edu.pk, *faisii00700@gmail.com
Abstract A well representing feature set that has enough
differentiated power plays an important role in the classification.
The existing techniques for feature set selection are mostly
statistical. They are not flexible to incorporate the human
reasoning and the changing requirements and preferences of the
real-life systems. They only make a decision between a feature
inclusion or exclusion. The fuzziness of human reasoning and
thinking are not considered at all that may improve the feature
selection and hence the accuracy of the classifier. Also, the
selection of overlapping features in case of Local Feature
Selection (LFS) methods is an important issue that negatively
impacts classification accuracy. For example, in case of Odd
Ratio (OR), the selection may contain overlapping features of
multiple classes. In this paper, a Fuzzy Set Theory (FST) based
feature selection method has been proposed. The approach aims
to tackle both above mentioned issues efficiently. The selected
final feature set is used to train the well-known classification
algorithms and the results are compared with Global Feature
Selection (GFS) and LFS methods. The comparison shows that
the proposed method has improved the accuracy of the classifiers
and also extract comparatively small feature set that ultimately
reduces the time complexity of the system.
Keywords Classification, accuracy, feature selection, Fuzzy
Set Theory, global feature selection, local feature selection
I. INTRODUCTION
Feature selection is the activity of selecting the most
appropriate and representative features. This process curtails
the number of features by skipping the duplicate, noisy and
least important features. Feature selection can be performed
either at the global or local level [1]. GFS methods calculate
the overall feature importance irrespective of its relevance
to any particular class [2]. LFS methods are those for which
the feature importance for every possible class is calculated
individually and the final selection is performed on these
individual scores [2]. These selection methods are mostly
statistical and declare the feature’s status as either important
or unimportant, but in most real-life scenarios the decisions
are not that much simple and consider number of human
uncertainties. This may happen because of varying ground
realities which could not be bound between [0,1] selection
[3]. This fact shows that most of the incorrectly classified
records may be classified wrong because of this binary
nature of statistical methods. The literature also reveals the
fact that most of the LFS methods being local to individual
class may happen to select repeated features for multiple
classes [4]. An extra check may be required to handle this
issue that may result in increased computational cost.
Otherwise the duplicate selection of feature from different
classes may hinder the classifier’s performance.
In this paper, we proposed an FST based novel approach for
feature selection. The proposed method uses fuzzy decision
matrix consists of decision variables. The proposed method,
hybrid the concepts of GFS and LFS. It makes the selection
of features on both levels. Firstly, we selected the features at
local level for each class on the bases of the calculated
feature importance by using fuzzy decision matrices. Later
on, we used one more variable to make the final selection at
global level. This mixing of global and local levels will help
to avoid the repetition of the same feature from different
classes. We named this hybrid approach as Improved FST
based feature selection method. The decision matrix of FST
in the proposed method consists of 4 levels of decision on
decision scale. This arrangement will help to avoid the
binary nature of selection. The proposed method has made
the decision making a more flexible system by introducing 4
levels of selection instead of having only two, as in case of
binary systems. The proposed system is applied to the
selected case study from Gang-of-Four (GoF) series
patterns. We used three highly recommended and preferred
classifiers (Naïve Bayes (NB), Support Vector Machine
(SVM) and Random Forest (RF)) as well to check the effect
of the proposed method on classification accuracy. The
proposed method is compared with the benchmark GFS and
LFS for performance evaluation.
The rest of the paper is organized in the following sections:
section 2 presents the state-of-the-art work done in the area
of feature selection and the utilization of FST for feature
selection. Proposed methodology along with the
experimental setup used in this research are given in the 3rd
section. 4th section stats the results along with the discussion
on those results. We concluded the work in the 5th and the
last section of the paper.
Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)
Islamabad, Pakistan, 8th – 12th January, 2019
666
978-1-5386-7729-2/19/$31.00©2019 IEEE
II. RELATED WORK
Last few decades witnessed the rapid interest development in
the area of Data Mining (DM) by the research community.
Researchers are working on different aspects of DM like data
extraction, pre-processing, feature extraction, features
construction, feature selection and classification/prediction.
They are utilizing statistical, mathematical, heuristic,
metaheuristic and natural algorithms to deal with the issues at
hand. In this section, few studies are presented from the recent
past that worked on the feature selection for DM.
D.F. Gillies in [4] performed a number of experiments by
using different most frequently used filter, wrapper and
embedded FS methods. They specifically check their
performance for microarray data. Microarray data is basically
a biological platform that gathers the gene expression for
performing different experiments and analysis as well. The
conducted survey comprises of number of FS methods. They
compared the performance of each and concluded that there is
no free lunch in case of FS methods as well.
The nature inspired Particle Swarm Optimization (PSO)
algorithm is adopted by [5, 6]. In both these papers the PSO is
used for feature selection for different tasks. In [5], the
sentiment analysis is performed based on aspect extracted
from the given data. The PSO is utilized for aspect extraction
and construction of feature set. In this work the author M. S.
Akhter et.al further constructed an ensemble based of three
classifiers and presented a cascade model that was comprised
of both above mentioned parts. Whereas, in [6], the PSO is
utilized for feature selection from high dimensional real-life
data stream. The proposed method is a lightweight,
computationally efficient FS method.
The FS method to handle the imbalance class problem was
proposed by A. K. Uysal in [7]. The proposed method named
as improved global feature selection scheme was focused to
address the imbalance class problem by selecting equal
number of features from each class. In this method the author
used OR to find the inclination towards any particular class
either positive or negative and then select the N number of
features from each class. This proposed method improved the
presentation of the smallest class among the FFS. The
experimental results validate the effectiveness of the proposed
method. The drawback of this proposed method was later on,
highlighted by A. Deepak in [8]. According to the author the
previously proposed method may ignore some highly
representative features of majority class because of its
rigidness. This ignorance will affect the training and
ultimately the performance of the classifier. The author
proposed variable global feature selection scheme to select the
variable number of features from each class based on the
volume of class.
A. Rehman et al. proposed normalized difference measure
method in [9] for feature selection from text documents.
According to the author, existing benchmark methods assign
equal ranks to the features that have the same differences, but
they ignore the difference of their relative document
frequencies. Keeping this fact in mind the authors proposed
normalized difference measure method and compare it with
seven well known FS methods on seven datasets and 2
classifiers namely NB and SVM. The macro F1 measure of
the results shows that in 66% the proposed method performed
better than the compared methods. Whereas, for the micro F1
measure this value is 51%.
From the discussion in above few paragraphs it can be
concluded that feature selection is an important open issue of
the current era. The researchers focused this issue and
proposed a number of approaches to tackle it from different
prospects. Statistical, nature inspired, heuristic, metaheuristic,
hybrid and distance-based approaches are adopted by the
researchers to select the best representative features. The
literature review of these approaches concludes that there is no
free lunch available and all the approaches have their own pro
and cons. If one gives better results in a particular domain then
the other may outperform the rest of approaches in some other
domain.
The improved fuzzy rough set has been utilized by [10,11] for
feature selection. The existing fuzzy rough set has the problem
that it may not maintain maximal dependency function. That
could misfit the given dataset as well. To overcome these
issues W. Changzhong et.al introduced an improved form of
fuzzy rough set that can fit well to the given dataset. Concept
of fuzzy neighborhood and parameterized fuzzy relation are
utilized by the authors to define the upper and lower bounds of
the decision boundary. The results show the validity of the
proposed method over the basic fuzzy rough set method.
Opinion mining from the online social media sites is also a
rapidly growing task of DM. researchers utilized FST for
opinion mining as well. In [12-16], the FST is being used for
opinion mining from different types of social media sites. M.
Afzaal et.al proposed opinion classification system for mining
tourist reviews. The huge quantity of data available for tourists
make it difficult for them to make a choice between places to
visit. A number of opinion mining techniques have been
proposed for this task but mining the opinions on the bases of
aspects given in the reviews is proposed in this research work.
Mining the given aspects, can be useful for decision making.
J. Shaidah introduced information extraction and fuzzy set
theory based opinion mining. The proposed system was
capable enough to deal with the natural language text. The
proposed system categories the opinions into two categories as
positive or neutral. Business and finance filed needs
immediate response analysis to make their services more
attractive and impressive. Social media makes it easy for the
customers to express their grievances over what they feel was
not up to the mark. Keeping in view the strength of social
media the banking and business sector feels it more
appropriate to have an automated opinion analysis systems.
Kumar Ravi et.al in their research work proposed a
methodology based on fuzzy formal concept analysis and
concept level sentiment analysis. Their work present an
automated system for opinion analysis of comments and
reviews expressed by the customers based on the services they
Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)
Islamabad, Pakistan, 8th – 12th January, 2019
667
avail. Despite of the heavy research work done on the topic of
opinion mining the depth and diversity of opinions still has
capacity to have more research work on this topic. Farman Ali
et.al in his research work proposed an opinion mining
methodology. The proposed work hybrid the fuzzy domain
ontology and SVM to increase the precision rate and the
accuracy of the classifier. The proposed method target the
issue of opinion polarity and tried to extract the extreme
aspect of the opinion. On the bases of that polarity the opinion
and analysed to get help for developing knowledge based
automated systems.
The literature review on use of FST for DM shows the
importance of FST for performing a number of different tasks
related to DM including FS, opinion mining, sentiment
analysis and classification function.
III. PROPOSED METHODOLOGY
The main goal of the proposed methodology is to reduce the
number of features in feature set and to incorporate the human
thinking and reasoning in decision making system. The system
also will reduce the chances of selection of overlapping
features from multiple classes. The proposed method consists
of five main steps. These steps are presented in a figure below:
In the first step, features are extracted from the text documents
of the selected dataset. In the second and third steps two
variables X and Y are selected to describe the feature
importance for particular classes and feature strength to
represent the individual class respectively. In the fourth step,
the fate of the record is decided on the bases of the feature
importance for individual class for X and Y, calculated via
equations 1 and 2. In the final fifth step we selected the FFS
on the bases of the value of variable Z. The variable Z defines
the total number of records for a particular class that have a
specific feature. We used this variable because for the correct
prediction of records for a class the feature that covers most of
the records can be more valuable than the feature that has high
occurrences within the class but covers only a few instances.
100
fc
Xtfd
=
()
100
fc
Ytfc
=
()
Where fc stands for feature occurrence per class, tfd stands for
total feature occurrence in dataset and tfc stands foe total
features in a class. In the fourth step, I used a scale of four
possible intervals to categories the calculated valves of X and
Y as relevant (RL), moderate (MD), low relevance (LRL) and
irrelevant (IRL). For X the overall upper limit is 100 and
lowest is 0. We made four ranges as 0-25, 26-50, 51-75 and
76-100. For Y the upper limit changes for every class,
therefore we divided the highest value by 4 and make the
ranges accordingly as done in [17,18] by KeYuan Wu, et.al.
Decision making to give the label is done according to the
importance of X and Y and their calculated values. The
variable Y has priority over variable X and it’s clearly visible
from the decision matrix given in table 1.
The performance of the proposed method is compared with
the benchmark methods, 2 from GFS (Information Gain (IG)
and Gini Index (GI)) and 1 from LFS (OR). The description of
classifiers and the evaluation matrices that we used for
performance comparison is given below:
Table 1Decision matrix
A. Classification Algorithms
The classifiers that we have selected for training the model
on selected features by each method are well-known
classifiers named as NB, SVM and RF.
NB: Conditional independence of attributes is the basic
assumption of NB. It estimates the class conditional
probability of the attributes and assumes that the attributes are
independent of each other. This algorithm performs well when
the dataset has noise and outlier. The mathematical
representation of this algorithm is:
P(X|Z)=P(Y|X) P(X)P(Y) (3)
RF: As Rf is an ensemble algorithm, so it combines the
characteristics of two or more algorithms. RF works by
creating a set of decision trees from a random selection of a
subset of the training set. The final class of the object is
decided on the bases of aggregation of votes for each class.
SVM: One of the most frequently used classification
algorithms that has both linear and non-linear versions. Its
main idea is to draw a normal line within the population plane
in such a way that the data points lie on the maximum distance
Yr
Ym
Yir
Xr
RL
MD
IRL
Xm
RL
MD
IRL
Xlr
RL
MD
IRL
Xir
RL
MD
IRL
Figure 1 Overview of proposed methodology
Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)
Islamabad, Pakistan, 8th – 12th January, 2019
668
from the normal line. The mathematical expression for this
algorithm is stated as:
(4)
Here Z1 and Z2 are the input features and the slop of the
normal line is determined by A1 and A2. The variable A0 in
the formula is the intercept determined by the algorithm.
B. Performance Evaluation
Here Z1 and Z2 are the input features and the slop of the
normal line is determined by A1 and A2. The variable A0 in
the formula is the intercept determined by the algorithm.
Precision = (5)
Recall = (6)
F-measure = (7)
IV. RESULTS AND DISCUSSION
The results obtained by training the model for each of the FS
selected by the benchmark and by the proposed method are
presented in table 2. We selected a threshold of 1/4th and ½ of
the extracted features to be used for training. The selection of
these thresholds will check the effect on the efficiency of the
FS method by reducing the number of features.
Table 2 Results of Improved FST approach Vs benchmark FS methods
No of features 27
NB
SVM
RF
GI
74.1%
74.7%
82.7%
IG
86.9%
86.7%
80.2%
OR
73.4%
71.6%
69.8%
Proposed Method
90.0%
85.4%
89.7%
No of features 54
NB
SVM
RF
GI
81.5%
83.5%
77.7%
IG
89.0%
82.1%
84.0%
OR
76.6%
73.7%
78.5%
Proposed Method
88.7%
86.3%
87.6%
The results presented in the table 2 show that the proposed
method has performed well for all the classifiers as compare to
the GI and OR. The performance of proposed method and IG
shows a minor fluctuation for each of the classifier. It can also
be deduced from the presented results that the proposed
method can show better performance for a reduced feature set.
Whereas, the performance of the rest of the methods improve
by increasing the number of features that may result in
increased time complexity. The performance graph presented
in figure 2 clearly depicts the efficiency of the proposed
method.
V. CONCLUSION
This research work presents a novel method of feature selection
that incorporates the human thinking process in feature selection
and can improve the classifier’s accuracy with the least number
of features. The results of the proposed method are compared
with the bench mark FS methods like IG, GI and DF. The F-
measure shows that the proposed method based on FST not only
depicts the human thinking but also enhance the classifier’s
accuracy with minimum number of features.
REFERENCES
Figure 2 Graphical representation of accuracy achieved by using different FS methods
Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)
Islamabad, Pakistan, 8th – 12th January, 2019
669
[1] Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., &
Herrera, F. (2017). A survey on data pre-processing for data stream
mining: current status and future directions. Neurocomputing, 239,
39-57.
[2] Noh, S., Zoltowski, M. D., Sung, Y., & Love, D. J. (2014). Pilot
beam pattern design for channel estimation in massive MIMO
systems. IEEE Journal of Selected Topics in Signal Processing, 8(5),
787-801.
[3] Melin, P., & Castillo, O. (2014). A review on type-2 fuzzy logic
applications in clustering, classification and pattern recognition.
Applied soft computing, 21, 568-577.
[4] Akhtar, M. S., Gupta, D., Ekbal, A. , & Bhattacharyya, P, (2017).
"Feature selection and ensemble construction: A two-step method for
aspect based sentiment analysis". Knowledge-Based Systems, 125,
116-135
[5] Akhtar, M. S., Gupta, D., Ekbal, A., & Bhattacharyya, P. (2017).
Feature selection and ensemble construction: A two-step method for
aspect based sentiment analysis. Knowledge-Based Systems, 125,
116-135.
[6] Fong, S., Wong, R., & Vasilakos, A. (2016). Accelerated PSO
swarm search feature selection for data stream mining big data.
IEEE transactions on services computing, (1), 1-1.
[7] Uysal, A. K. (2017). "An improved global feature selection scheme
for text classification". Expert systems with Applications, 43, 82-92.
[8] Agnihotri, D., Verma, K., & Tripathi, P. (2017). "Variable Global
Feature Selection Scheme for automatic classification of text
documents". Expert Systems with Applications, 81, 268281.
[9] Rehman, A., Javed, K., & Babri, H. A. (2017). Feature selection
based on a normalized difference measure for text classification.
Information Processing & Management, 53(2), 473-489.
[10] Wang, C., Qi, Y., Shao, M., Hu, Q., Chen, D., Qian, Y., & Lin, Y.
(2017). A fitting model for feature selection with fuzzy rough sets.
IEEE Transactions on Fuzzy Systems, 25(4), 741-753.
[11] Hussain. S, A Methodology to Predict the Instable Classes, 32nd
ACM Symposium on Applied Computing (SAC) Morocco, 4th to 6th
April 2017
[12] Afzaal, M., Usman, M., Fong, A. C. M., Fong, S., & Zhuang, Y.
(2016). Fuzzy aspect based opinion classification system for mining
tourist reviews. Advances in Fuzzy Systems, 2016, 2.
[13] Deng, Y., Ren, Z., Kong, Y., Bao, F., & Dai, Q. (2017). A
hierarchical fused fuzzy deep neural network for data classification.
IEEE Transactions on Fuzzy Systems, 25(4), 1006-1012.
[14] Hussain. S, “Threshold Analysis of Design Metrics to Detect
Design Flaws”, 31st CM Symposium on Applied Computing
(SRC), 2016, 4-8 April, 2016, Pisa Italy.
[15] Ravi, K., Ravi, V., & Prasad, P. S. R. K. (2017). Fuzzy formal
concept analysis based opinion mining for CRM in financial
services. Applied Soft Computing, 60, 786-807.
[16] Ali, F., Kwak, K. S., & Kim, Y. G. (2016). Opinion mining based on
fuzzy domain ontology and support vector machine: a proposal to
automate online review classification. Applied Soft Computing, 47,
235-250.
[17] R. H. Gamma, "Design Pattern," in MA: Adison Wesley, Boston,
1995.
[18] Hussain. S, Keung. J., Sohail. M. K, Ilahi. M, Khan. A. A, Automated
Framework for Classification and Selection of Software Design
Patterns, Applied Soft Computing, October,2018.
Proceedings of 2019 16th International Bhurban Conference on Applied Sciences & Technology (IBCAST)
Islamabad, Pakistan, 8th – 12th January, 2019
670
... Moreover, they have compared their algorithm with existing methods such as DT, Modified Naive Bayes, and Multi-Layer Perceptron and proved that their algorithm is better by using the evaluation parameters such as Precision, Recall and F -measure. Bushra et al. [8] proposed a new feature selection method that incorporates the human thoughts and to improve the classification accuracy using less number of relevant and most contributed features. Shiwen et al. [33] proposed a new prediction model that is developed by using CNN for predicting the lung cancer through the analysis of CT scan images of the patients. ...
... Moreover, a new weighted Gini-index approach is applied for splitting an input dataset. According to the Equations (6), (7), (8) and (9), the input dataset feature A is selected as a key feature (node) by applying the newly proposed weighted Gini-index approach. Anyhow, very small set of data only categorized correctly. ...
Article
This paper proposes a new content recommendation system which combines the newly proposed embedded feature selection method and the new Fuzzy Temporal Logic based Decision Tree incorporated Convolutional Neural Network classifier. The newly proposed embedded feature selection called Fuzzy Decision Tree and Weighted Gini-Index based Feature Selection Algorithm (FDTWGI-FSA) that contains the existing incorporated the Fuzzy Decision Tree (FDT) and the Weighted Gini-index based Feature Selection Algorithm (WGIFSA) for getting optimized feature subset. Moreover, an enhanced CNN and Fuzzy Temporal Decision Tree for performing the deep learning process which is able to identify the exact e-content from the huge volume of data with the help of the recommended features by the proposed embedded feature selection method. The exact e-content can be identified after performing the five-layer network structure for extracting the relevant features and it also can be classified by applying the Fuzzy Temporal Decision Tree for the e-learners. Finally, the proposed content recommendation system provides exact content to the e-learners according to their level of understanding and it also satisfies them by providing the exact high level contents. The experiments have been conducted for evaluating the proposed content recommendation system and compared with the existing classifier including the standard CNN.
... A. Z. Bushra et al. [11] presented a feature selection methodology aiming to improve the existing issue of selecting overlapping features. ...
Article
Full-text available
A very important task of Natural Language Processing is text categorization (or text classification), which aims to automatically classify a document into categories. This kind of task includes numerous applications, such as sentiment analysis, language or intent detection, heavily used by social-/brand-monitoring tools, customer service, and the voice of customer, among others. Since the introduction of Fuzzy Set theory, its application has been tested in many fields, from bioinformatics to industrial and commercial use, as well as in cases with vague, incomplete, or imprecise data, highlighting its importance and usefulness in the fields. The most important aspect of the application of Fuzzy Set theory is the measures employed to calculate how similar or dissimilar two samples in a dataset are. In this study, we evaluate the performance of 43 similarity and 19 distance measures in the task of text document classification, using the widely used BBC News and BBC Sports benchmark datasets. Their performance is optimized through hyperparameter optimization techniques and evaluated via a leave-one-out cross-validation technique, presenting their performance using the accuracy, precision, recall, and F1-score metrics.
... Outras técnicas menos comuns podem ser observadas, por exemplo, métodos de negociação baseados em teoria dos jogos [3], o uso de conjuntos fuzzy para seleção de características [1], métodos mais simplistas como regressões regularizadas [18] e aplicações utilizando algoritmos genéticos para realizar a seleção de características [13]. ...
Article
Full-text available
Due to the large amount of opinions available on the websites, tourists are often overwhelmed with information and find it extremely difficult to use the available information to make a decision about the tourist places to visit. A number of opinion mining methods have been proposed in the past to identify and classify an opinion into positive or negative. Recently, aspect based opinion mining has been introduced which targets the various aspects present in the opinion text. A number of existing aspect based opinion classification methods are available in the literature but very limited research work has targeted the automatic aspect identification and extraction of implicit, infrequent, and coreferential aspects. Aspect based classification suffers from the presence of irrelevant sentences in a typical user review. Such sentences make the data noisy and degrade the classification accuracy of the machine learning algorithms. This paper presents a fuzzy aspect based opinion classification system which efficiently extracts aspects from user opinions and perform near to accurate classification. We conducted experiments on real world datasets to evaluate the effectiveness of our proposed system. Experimental results prove that the proposed system not only is effective in aspect extraction but also improves the classification accuracy.
Article
Full-text available
Fuzzy rough set is an important rough set model used for feature selection. It uses the fuzzy rough dependency as a criterion for feature selection. However, this model can merely maintain a maximal dependency function. It does not fit a given data set well and cannot ideally describe the differences in sample classification. Therefore, in this study, we introduce a new model for handling this problem. First, we define the fuzzy decision of a sample using the concept of fuzzy neighborhood. Then, a parame- terized fuzzy relation is introduced to characterize the fuzzy information granules, using which the fuzzy lower and upper approximations of a decision are reconstructed and a new fuzzy rough set model is introduced. This can guarantee that the membership degree of a sample to its own category reaches the maximal value. Furthermore, this approach can fit a given data set and effectively prevents samples from being misclassified. Finally, we define the significance measure of a candidate attr- ibute and design a greedy forward algorithm for feature selection. Twelve data sets selected from public data sources are used to compare the proposed algorithm with certain existing algorithms, and the experimental results show that the proposed reduction algorithm is more effective than classical fuzzy rough sets, especially for those data sets for which different categories exhibit a large degree of overlap.
Article
Though, Unified Modeling Language (UML), Ontology, and Text categorization approaches have been used to automate the classification and selection of design pattern(s). However, there are certain issues such as time and effort for formal specification of new patterns, system context-awareness, and lack of knowledge which needs to be addressed. We propose a framework (i.e. Three-phase method) to discuss these issues, which can aid novice developers to organize and select the correct design pattern(s) for a given design problem in a systematic way. Subsequently, we propose an evaluation model to gauge the efficacy of the proposed framework via certain unsupervised learning techniques. We performed three case studies to describe the working procedure of the proposed framework in the context of three widely used design pattern catalogs and 103 design problems. We find the significant results of Fuzzy c-means and Partition Around Medoids (PAM) as compared to other unsupervised learning techniques. The promising results encourage the applicability of the proposed framework in terms of design patterns organization and selection with respect to a given design problem.
Conference Paper
Class stability is intrinsically characterized by the evolution of a number of dependencies and change propagation factors used to promote the ripple effect. In this regard, historical information regarding change propagation factors can aid to identify the classes prone to ripple effect (that is instable classes). In this paper, we propose a methodology to exploit the versions history of change propagation factors in order to predict the instable classes. Initially, we have implemented the proposed methodology with version history of three open source projects MongoDB Java Driver, Google Guava and Apache MyFaces and obtained promising results as compared to existing stability assessors. Subsequently, the experimental results indicate that proposed methodology can be used to identify the classes prone to ripple effect and can aid developers to reduce the efforts needed to maintain and evolve the system.
Article
Owing to the easy access to social media, consumers or customers are increasingly turning to social media to express their grievances and feedback on various products and services offered by the Banking, Financial, Services and Insurance industry. Because non-redressal of complaints eventually leads to customer churn, there is an urgent need to analyze the complaints. In this regard, we propose a novel descriptive analytics model that performs complaints/grievances analytics and summarizes the lengthy and verbose complaints concisely in a form that resembles association rules. The proposed hybrid model comprises fuzzy formal concept analysis and concept-level sentiment analysis (FFCA + SA) in tandem, which in turn is compared against formal concept analysis and concept-level sentiment analysis (FCA + SA). Because of the immediate fallout of the negative sentiments, a financial company is interested in studying them in more detail than the positive ones. Therefore, the model generates a list of ‘association rules’, the corresponding negative sentiment score along with the list of associated documents. Association rules are rank ordered according to the negative sentiment score, which in turn reflects severity affected services/products. The proposed model also provides interactive visualization that enables business analysts and managers to access a specific set of complaints without having to go through the entire set thoroughly. This saves a lot of time that would have otherwise been spent on cumbersome manual operations. Moreover, partial evaluation of the proposed methodology by human annotators yielded 64.06% matching score in terms of the opinions determination of aspects.
Article
In this paper we present a cascaded framework of feature selection and classifier ensemble using particle swarm optimization (PSO) for aspect based sentiment analysis. Aspect based sentiment analysis is performed in two steps, viz. aspect term extraction and sentiment classification. The pruned, compact set of features performs better compared to the baseline model that makes use of the complete set of features for aspect term extraction and sentiment classification. We further construct an ensemble based on PSO, and put it in cascade after the feature selection module. We use the features that are identified based on the properties of different classifiers and domains. As base learning algorithms we use three classifiers, namely Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM). Experiments for aspect term extraction and sentiment analysis on two different kinds of domains show the effectiveness of our proposed approach.
Article
The feature selection is important to speed up the process of Automatic Text Document Classification (ATDC). At present, the most common method for discriminating feature selection is based on Global Filter-based Feature Selection Scheme (GFSS). The GFSS assigns a score to each feature based on its discriminating power and selects the top-N features from the feature set, where N is an empirically determined number. As a result, it may be possible that the features of a few classes are discarded either partially or completely. The Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative features from all the classes. However, it suffers in dealing with an unbalanced dataset having large number of classes. The distribution of features in these classes are highly variable. In this case, if an equal number of features are chosen from each class, it may exclude some important features from the class containing a higher number of features. To overcome this problem, we propose a novel Variable Global Feature Selection Scheme (VGFSS) to select a variable number of features from each class based on the distribution of terms in the classes. It ensures that, a minimum number of terms are selected from each class. The numerical results on benchmark datasets show the effectiveness of the proposed algorithm VGFSS over classical information science methods and IGFSS. --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Use following link to download this paper upto 18th May 2017 without paying money. https://authors.elsevier.com/a/1Uopt3PiGT3s30
Article
The goal of feature selection in text classification is to choose highly distinguishing features for improving the performance of a classifier. The well-known text classification feature selection metric named balanced accuracy measure (ACC2) (Forman, 2003) evaluates a term by taking the difference of its document frequency in the positive class (also known as true positives) and its document frequency in the negative class (also known as false positives). This however results in assigning equal ranks to terms having equal difference, ignoring their relative document frequencies in the classes. In this paper we propose a new feature ranking (FR) metric, called normalized difference measure (NDM), which takes into account the relative document frequencies. The performance of NDM is investigated against seven well known feature ranking metrics including odds ratio (OR), chi squared (CHI), information gain (IG), distinguishing feature selector (DFS), gini index (GINI) ,balanced accuracy measure (ACC2) and Poisson ratio (POIS) on seven datasets namely WebACE(WAP,K1a,K1b), Reuters (RE0, RE1),spam email dataset and 20 newsgroups using the multinomial naive Bayes (MNB) and supports vector machines (SVM) classifiers. Our results show that the NDM metric outperforms the seven metrics in 66% cases in terms of macro-F1 measure and in 51% cases in terms of micro F1 measure in our experimental trials on these datasets.
Article
With the explosion of Social media, Opinion mining has been used rapidly in recent years. However, a few studies focused on the precision rate of feature review’s and opinion word’s extraction. These studies do not come with any optimum mechanism of supplying required precision rate for effective opinion mining. Most of these studies are based on Naïve Bayes, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and classical ontology. These systems are still imperfect for classifying the feature reviews into more degrees of polarity terms (strong negative, negative, neutral, positive and strong positive). Further, the existing classical ontology-based systems cannot extract blurred information from reviews; thus, it provides poor results. In this regard, this paper proposes a robust classification technique for feature review’s identification and semantic knowledge for opinion mining based on SVM and Fuzzy Domain Ontology (FDO). The proposed system retrieves a collection of reviews about hotel and hotel features. The SVM identifies hotel feature reviews and filter out irrelevant reviews (noises) and the FDO are then used to compute the polarity term of each feature. The amalgamation of FDO and SVM significantly increases the precision rate of review’s and opinion word’s extraction and accuracy of opinion mining. The FDO and intelligent prototype are developed using Protégé OWL-2 (Ontology Web Language) tool and JAVA, respectively. The experimental result shows considerable performance improvement in feature review’s classification and opinion mining.