ChapterPDF Available

Variance-Based Feature Selection for Enhanced Classification Performance: Proceedings of Fifth International Conference INDIA 2018 Volume 1

Authors:
  • Anurag University , Ghatkesar(M), Medchal Dist.,India

Figures

Content may be subject to copyright.
Variance-Based Feature Selection
for Enhanced Classification Performance
D. Lakshmi Padmaja and B. Vishnuvardhan
Abstract Irrelevant feature elimination, when used correctly, aids in enhancing the
feature selection accuracy which is critical in dimensionality reduction task. The addi-
tional intelligence enhances the search for an optimal subset of features by reducing
the dataset, based on the previous performance. The search procedures being used
are completely probabilistic and heuristic. Although the existing algorithms use var-
ious measures to evaluate the best feature subsets, they fail to eliminate irrelevant
features. The procedure explained in the current paper focuses on enhanced feature
selection process based on random subset feature selection (RSFS). Random subset
feature selection (RSFS) uses random forest (RF) algorithm for better feature reduc-
tion. Through an extensive testing of this procedure which is carried out on several
scientific datasets previously with different geometries, we aim to show in this paper
that the optimal subset of features can be derived by eliminating the features which
are two standard deviations away from mean. In many real-world applications like
scientific data (e.g., cancer detection, diabetes, and medical diagnosis) removing the
irrelevant features result in increase in detection accuracy with less cost and time.
This helps the domain experts by identifying the reduction of features and saving
valuable diagnosis time.
Keywords Random subset feature selection ·Random forest
Variance-based selection ·Classification accuracy
D. Lakshmi Padmaja
Department of Information Technology, Anurag Group of Institutions (CVSR),
Hyderabad, India
e-mail: glpadmaja@gmail.com
B. Vishnuvardhan (B)
Department of Computer Science and Engineering, JNTUH, Hyderabad, India
e-mail: mailvishnuvardhan@gmail.com
© Springer Nature Singapore Pte Ltd. 2019
S. C. Satapathy et al. (eds.), Information Systems Design and Intelligent Applications,
Advances in Intelligent Systems and Computing 862,
https://doi.org/10.1007/978-981- 13-3329- 3_51
543
544 D. Lakshmi Padmaja and B. Vishnuvardhan
1 Introduction
Dimensionality reduction [13] is an important preprocessing task in data mining.
This will aid in reducing the dimensionality of dataset thereby addressing curse of
dimensionality issue. In general, the datasets, collected from scientific experiments,
have more features than instances. This results in an increase in processing time and
reduction in classification accuracy. Hence, it is prudent to reduce the dimensional-
ity of the dataset without compromising the intrinsic geometric properties of data.
For reducing the dimensionality and removing irrelevant features [47], we have
proposed modified RSFS algorithm in our earlier work. A set of evaluation mea-
sures such as accuracy, information, distance, dependency, and consistency [810]
are discussed in detail. This paper focuses on consistency of the feature selection as
a parameter [11,12].
Mainly, feature selection algorithms are categorized as filter, wrapper, and hybrid
type of algorithms. The filter methods are classifier dependent and wrapper meth-
ods are computationally intensive. In view of the same RSFS algorithm is used for
selecting the relevant features. This will help in evaluating performance measures
like accuracy and consistency of the real-world applications. This will also help in
reducing the dimensionality of scientific datasets which are unstructured, sparse, and
often contain missing values [13,14]. The randomization process will eliminate bias
and overfitting problems. By eliminating the irrelevant features on scientific data
reduces the cost and time (for cancer patients by reducing the tests) with the help of
domain experts. This approach is more useful in medical diagnosis.
In this paper, Sect. 1gives the introduction, Sect. 2discusses the existing method,
Sect. 3discusses the proposed algorithm, Sect. 4gives the dataset description, Sect. 5
provides the experimental setup, and Sect. 6gives the conclusion.
2 Existing Method
We have selected RSFS algorithm along with eight datasets, available publicly, for
this study.
2.1 Existing RSFS Algorithm
The aim of the random subset feature selection algorithm (RSFS) [8,15] is to identify
the best possible feature subsets, from a large dataset, in terms of its usefulness
in a classification task. The feature selection is carried out iteratively. Traditional
feature selection techniques follow to find the best subset of features using different
methods such as forward selection algorithm by including new relevant feature for the
existing dataset, in backward elimination algorithm, iteratively remove the unwanted
Variance-Based Feature Selection for Enhanced Classification 545
features. Another approach is based on the feature correlation and ranking the features
according to its usefulness [16]. Using kNN classifier, features are selected in each
iteration. Relevancy is calculated as difference between the performance criteria
and expected criteria. In RSFS, feature subset selection is based on a statistical
comparison against random walk statistics. In RSFS, to improve the classification
performance, the random subsets size is not fixed. Iterations are continued until good
features are found. The iterations are stopped using stopping criteria which is user-
defined threshold or search exhaust whichever is earlier. In RSFS, the random subset
classification is performed, as many times as it is necessary in order to identify the
good features from features that simply appears useful due to the random components
of the process.
2.2 Classification
In random subset feature selection algorithm, kNN classifier is used to classify the
features which are generated as subsets by random forest algorithm [1719]. A
kNN classifier is simple to use and easy to select features based on the nearest-
neighbor-based measure. Usually, when k value is large, less variance and high bias
are observed and for small k values, high variance and low bias are observed.
2.3 Motivation
Since the existing methodology suffers from consistency and selects different subsets
of feature in different iterations. This will create suspicion in the view of domain
experts and cannot result in a reliable and stable outcome. This has motivated us to
enhance the existing methodology with a view to reduce inconsistency and improve
the classification accuracy.
3 Proposed Methodology
The proposed methodology is experimented using RSFS algorithm [8,15,20]. The
procedure is suitably modified to find the consistency, accuracies of the features with
criteria as variance of features from mean. It is evident from the table that RSFS
+1NN algorithm has better accuracy and reduction of features compared to normal
kNN algorithm. The pseudocode for the algorithm is as shown below:
1. Normalize both training and testing datasets.
2. Randomly select the subset from the full feature set.
3. Classify the dataset using kNN classifier.
546 D. Lakshmi Padmaja and B. Vishnuvardhan
Tabl e 1 Dataset details
S. No. Dataset No. of
features
No. of
instances
KNN
accuracy
(%)
KNN + RSFS
Features Accuracy
(%)
1Colon
Cancer
62 2000 54.55 20 63.63
2CTG 20 2126 85.07 10 87.61
3Lung
cancer 32
149
12,533 149 93.75 271.88
4Lung 3312 203 94.20 66 95.65
5Fisher iris 4150 92.16 296.08
6ALLAML 7129 72 84.00 280.00
7Carcinom 9182 174 93.22 10 61.02
8Prostate
102 34
12,600 174 88.14 11 45.76
9Forest 27 523 86.46 15 82.15
10 Ovarian
cancer
4000 216 91.89 27 95.95
4. Record the feature relevance using unweighted average recall (UAR).
5. Select the top 1% of features based on probability scale of performance.
6. Eliminate the bottommost features whose relevancy is less than mean-3 (outliers).
7. Repeat the process until the number of selected features are constant for 1000
iterations then choose the features with best relevancies (whose probability is
greater than 0.99).
8. Otherwise, go to step number 2 and repeat all the steps until the selected features
are constant.
The algorithm works as shown in Fig. 1.
4 Datasets and Experimental Setup
All the datasets are taken from UCI machine learning repository [16], www.
featureselection.asu.edu [21], www.broadinstitute.org [22], and some are from can-
cer research studies. The dataset properties are mentioned in Table 1.
From the above results, it is evident that the performance of RSFS algorithm is
superior when compared to the other algorithms. In order to improve further, the
algorithm is modified as mentioned in Fig. 2.
Variance-Based Feature Selection for Enhanced Classification 547
Fig. 1 Algorithm
548 D. Lakshmi Padmaja and B. Vishnuvardhan
Fig. 2 Modified process for variance-based selection
5 Results
The process, as mentioned in Fig. 2, is followed for reducing the dimensionality
while maintaining the consistency of feature selection. The details about variance
selection and feature reduction are shown in Fig. 3.
6 Conclusion
Consistency of features is maintained when we have considered the variance of
features from mean to three-sigma standard deviation when compared to two-sigma
standard deviations away from mean. These selection techniques combined with
RSFS algorithm are producing very satisfactory results for feature subset selection.
Time complexity is also improved when compared to the original RSFS algorithm.
In future, more studies and experiments are required for fine-tuning the algorithm
Variance-Based Feature Selection for Enhanced Classification 549
Fig. 3 Mapping of feature space to variance space
and process to apply various types of data such as ordinal, categorical, and mixed
along with multi-label and missing values.
References
1. C. Bartenhagen, H.U. Klein, C. Ruckert, X. Jiang, M. Dugas, Comparative study of unsuper-
vised dimension reduction techniques for the visualization of microarray gene expression data.
BMC Bioinform. 11(1), 567 (2010)
2. L. Shen, E.C. Tan, Dimension reduction-based penalized logistic regression for cancer clas-
sification using microarray data. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 2(2),
166–175 (2005)
3. L. Van Der Maaten, E. Postma, J. Van den Herik, Dimensionality reduction: a comparative. J.
Mach. Learn. Res. 10, 66–71 (2009)
4. C. Ding, H. Peng, Minimum redundancy feature selection from microarray gene expression
data. J. Bioinform. Comput. Biol. 3(02), 185–205 (2005)
5. F. Ding, C. Peng, H. Long, Feature selection based on mutual information: criteria of max-
dependency, max-relevance, and min-redundancy 27(8), 1226–1238 (2005)
6. L. Yu, H. Liu, Efficiently handling feature redundancy in high-dimensional data, in SIGKDD
03 (Aug 2003)
7. H. Liu, H. Motoda, Feature Selection for Knowledge Discovery and Data Mining, vol. 454
(Springer Science & Business Media, Berlin, 2012)
8. J. Pohjalainen, O. Rasanen, S. Kadioglu, Feature selection methods and their combinations in
high dimensional classification of speaker likability, intelligibility and personality traits(2013)
550 D. Lakshmi Padmaja and B. Vishnuvardhan
9. L. Yu, C. Ding, S. Loscalzo, Stable feature selection via dense feature groups, in Proceedings
of the 14th ACM SIGKDD (2008)
10. M. Dash, H. Liu, Feature selection for classification 1, 131–156 (1997)
11. K. Kira, L.A. Rendell, A practical approach to feature selection, in Proceedings of the Ninth
International Workshop on Machine learning, pp. 249–256 (1992)
12. J. Reunanen, Overfitting in making comparisons between variable selection methods 3,
1371–1382 (2003)
13. E. Maltseva, C. Pizzuti, D. Talia, Mining high dimensional scientific data sets using singular
value decomposition, in Data Mining for Scientific and Engineering Applications (Kluwer
Academic Publishers, Dordrecht, 2001), pp. 425–438
14. J. Kehrer, H. Hauser, Visualization and visual analysis of multifaceted scientific data: a survey.
IEEE Trans. Visual Comput. Graphics 19(3), 495–513 (2013)
15. J. Pohjalainen, O. Rasanen, S. Kadioglu, Feature selection methods and their combinations
in high-dimensional classification of speaker likability, intelligibility and personality traits.
Comput. Speech Lang. 29(1), 145–171 (2015)
16. D. Dheeru, E. Karra Taniskidou, UCI machine learning repository (2017), http://archive.ics.
uci.edu/ml
17. S. Li, J. Harner, D. Adjeroh, Random kNN feature selection a fast and stable alternative to
random forests. BMC Bioinform. (Dec 2011)
18. L. Breiman, Random forests 3, 5–32 (2001)
19. I. Guyon, A. Elisseeff, An introduction to variable and feature selection 3, 1157–1183 (2003)
20. O. Räsänen, J. Pohjalainen, Random subset feature selection in automatic recognition of devel-
opmental disorders, affective states, and level of conflict from speech, in INTERSPEECH ,
pp. 210–214 (2013)
21. J. Li, K. Cheng, S. Wang, F. Morstatter, T. Robert, J. Tang, H. Liu, Feature selection: a data
perspective arXiv:1601.07996 (2016)
22. https://www.broadinstitute.org/
Article
Full-text available
An autonomic nervous system (ANS) of humans is majorly affected by psychological stress. The changes in ANS may cause several chronic diseases in humans. The electrocardiogram (ECG) signal is used to observe the variation in ANS. Numerous techniques are presented for an ECG stress signal handling feature extraction and classification. This work managed a heart rate variability feature acquired from smaller peak waveforms such as P, Q, S, and T waves. Also, the R peak is detected, which is a significant part of the ECG waveform. In this work, the proposed stress classification work has been categorized into two main processes: feature selection (FS) and classification. The main aim of the proposed work is to propose an optimized FS and classifier model for the detection of stress in ECG signals. The Metaheuristics model of the African vulture optimization (AVO) technique is presented to perform an FS. This selection is made to choose the required features and minimize the data for classification. The AVO-based modified Elman recurrent neural network (MERNN) technique is proposed to perform an efficient classification. The AVO is used for fine-tuning the weight of the MERNN technique. The experimental result of this technique is evaluated in terms of Recall (91.56%), Accuracy (92.43%), Precision (92.78%), and F1 score (95.86%). Thus, the proposed performance achieved a superior result than the conventional techniques.
Book
his book constitutes the refereed conference proceedings of the 21st International Conference on Web-Based Learning, ICWL 2022 and 7th International Symposium on Emerging Technologies for Education, SETE 2022, held in Tenerife, Spain in November 21–23, 2022. The 45 full papers and 5 short papers included in this book were carefully reviewed and selected from 82 submissions. The topics proposed in the ICWL&SETE Call for Papers included several relevant issues, ranging from Semantic Web for E-Learning, through Learning Analytics, Computer-Supported Collaborative Learning, Assessment, Pedagogical Issues, E-learning Platforms, and Tools, to Mobile Learning. In addition to regular papers, ICWL&SETE 2022 also featured a set of special workshops and tracks: The 5th International Workshop on Educational Technology for Language Learning (ETLL 2022), The 6th International Symposium on User Modeling and Language Learning (UMLL 2022), Digitalization in Language and Cross-Cultural Education, First Workshop on Hardware and software systems as enablers for lifelong learning (HASSELL).
Chapter
The context of teacher is indescribable without considering the multiple overlapping contextual situations. Teacher Context Ontology (TCO) presents a unified representation of data of these contexts. This ontology provides a relatively high number of features to consider for each context. These features result in a computational overhead during data processing in context-aware recommender systems. Therefore, the most relevant features must be favored over others without losing any potential ones using a feature selection approach. The existing approaches provide struggling results with high number of contextual features. In this paper, a new contextual ontology-based feature selection approach is introduced. This approach finds similar contexts for each insertion of new teacher using the ontology representation. Also, it selects relevant features from multiple contexts of a teacher according to their corresponding importance using a variance-based selection approach. This approach is novel in terms of representation, selection, and deriving implicit relationships for features in the multiple contexts of a teacher.Keywordsontologyfeature selectioncontextteachereducationmachine learning
Chapter
Farming is one of the major sectors that influences a country’s economic growth. In a country like India, the majority of the population is dependent on agriculture for their livelihood. Many new technologies, such as Machine Learning and Deep Learning, are being implemented into agriculture as it is easier for farmers to grow and maximize their yield. To increase productivity and quality, the farmers must know about the type of nutrients in the soil that is held. Farmers may not be having the utmost knowledge about all the nutrients which are available in the soil that play a vital role in the growth of the crop. An expert advisor is needed to reach every farmer which is very difficult. This paper aims to identify the right crop and also the necessary nutrients that are present in the soil and the nutrients that are required by the crop with the help of Machine learning techniques such as Support Vector Machines and Decision Trees for performing multiclass classification of different classes of crops.
Article
Full-text available
Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing high-dimensional data for data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities of feature selection algorithms. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the big data age, we revisit feature selection research from a data perspective, and review representative feature selection algorithms for generic data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for generic data, we generally categorize them into four groups: similarity based, information theoretical based, sparse learning based and statistical based methods. Finally, to facilitate and promote the research in this community, we also present a open-source feature selection repository that consists of most of the popular feature selection algorithms (http://featureselection.asu.edu/). At the end of this survey, we also have a discussion about some open problems and challenges that need to be paid more attention in future research.
Article
Full-text available
This work studies automatic recognition of paralinguistic properties of speech. The focus is on selection of the most useful acoustic features for three classification tasks: 1) recognition of autism spectrum developmental disorders from child speech, 2) classification of speech into different affective categories, and 3) recognizing the level of social conflict from speech. The feature selection is performed using a new variant of random subset sampling methods with k-nearest neighbors (kNN) as a classifier. The experiments show that the proposed system is able to learn a set of important features for each recognition task, clearly exceeding the performance of the same classifier using the original full feature set. However, some effects of overfitting the feature sets to finite data are also observed and discussed.
Article
Full-text available
This study focuses on feature selection in paralinguistic analysis and presents recently developed supervised and unsupervised methods for feature subset selection and feature ranking. Using the standard k-nearest-neighbors (kNN) rule as the classification algorithm, the feature selection methods are evaluated individually and in different combinations in seven paralinguistic speaker trait classification tasks. In each analyzed data set, the overall number of features highly exceeds the number of data points available for training and evaluation, making a well-generalizing feature selection process extremely difficult. The performance of feature sets on the feature selection data is observed to be a poor indicator of their performance on unseen data. The studied feature selection methods clearly outperform a standard greedy hill-climbing selection algorithm by being more robust against overfitting. When the selection methods are suitably combined with each other, the performance in the classification task can be further improved. In general, it is shown that the use of automatic feature selection in paralinguistic analysis can be used to reduce the overall number of features to a fraction of the original feature set size while still achieving a comparable or even better performance than baseline support vector machine or random forest classifiers using the full feature set. The most typically selected features for recognition of speaker likability, intelligibility and five personality traits are also reported.
Article
Full-text available
Visualization and visual analysis play important roles in exploring, analyzing and presenting scientific data. In many disciplines, data and model scenarios are becoming multi-faceted: data are often spatio-temporal and multi-variate; they stem from different data sources (multi-modal data), from multiple simulation runs (multi-run/ensemble data), or from multi-physics simulations of interacting phenomena (multi-model data resulting from coupled simulation models). Also, data can be of different dimensionality or come on various types of grids that need to be related or fused in the visualization. This heterogeneity of data characteristics presents new opportunities as well as technical challenges for visualization research. Visualization and interaction techniques are thus often combined with computational analysis. In this survey, we study existing methods for visualization and interactive visual analysis of multi-faceted scientific data. Based on a thorough literature review, a categorization of approaches is proposed. We cover a wide range of fields and discuss to which degree the different challenges are matched with existing solutions for visualization and visual analysis. This leads to conclusions with respect to promising research directions, for instance, to go after new solutions for multi-run and multi-model data as well as techniques that support a multitude of facets.
Chapter
Clustering is an undirected knowledge discovery technique based on the partitioning of large sets of data objects into homogenous groups. All objects contained in the same group have similar characteristics. Grouping multivariate data is a difficult data mining task when no domain knowledge on data structure is available. In this chapter we describe the use of a well known linear projection technique, called Singular Value Decomposition (SVD) to discover clusters. SVD is an optimal dimensionality reduction method that projects a multi-dimensional pattern space into a subspace that preserves the character of data. The plot of the 2- or 3-dimensional points gives the human user a guide to discover the presence of homogeneous groups (clusters) in the data set. A user, by inspecting the graphical 2- or 3-dimensional representations, may identify groups on the basis of space density and define threshold values to separate clusters. Experimental results on real scientific data sets assess the quality of the clustering obtained.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.