Content uploaded by Lakshmi Padmaja Dhyaram
Author content
All content in this area was uploaded by Lakshmi Padmaja Dhyaram on Jan 17, 2022
Content may be subject to copyright.
Variance-Based Feature Selection
for Enhanced Classification Performance
D. Lakshmi Padmaja and B. Vishnuvardhan
Abstract Irrelevant feature elimination, when used correctly, aids in enhancing the
feature selection accuracy which is critical in dimensionality reduction task. The addi-
tional intelligence enhances the search for an optimal subset of features by reducing
the dataset, based on the previous performance. The search procedures being used
are completely probabilistic and heuristic. Although the existing algorithms use var-
ious measures to evaluate the best feature subsets, they fail to eliminate irrelevant
features. The procedure explained in the current paper focuses on enhanced feature
selection process based on random subset feature selection (RSFS). Random subset
feature selection (RSFS) uses random forest (RF) algorithm for better feature reduc-
tion. Through an extensive testing of this procedure which is carried out on several
scientific datasets previously with different geometries, we aim to show in this paper
that the optimal subset of features can be derived by eliminating the features which
are two standard deviations away from mean. In many real-world applications like
scientific data (e.g., cancer detection, diabetes, and medical diagnosis) removing the
irrelevant features result in increase in detection accuracy with less cost and time.
This helps the domain experts by identifying the reduction of features and saving
valuable diagnosis time.
Keywords Random subset feature selection ·Random forest
Variance-based selection ·Classification accuracy
D. Lakshmi Padmaja
Department of Information Technology, Anurag Group of Institutions (CVSR),
Hyderabad, India
e-mail: glpadmaja@gmail.com
B. Vishnuvardhan (B)
Department of Computer Science and Engineering, JNTUH, Hyderabad, India
e-mail: mailvishnuvardhan@gmail.com
© Springer Nature Singapore Pte Ltd. 2019
S. C. Satapathy et al. (eds.), Information Systems Design and Intelligent Applications,
Advances in Intelligent Systems and Computing 862,
https://doi.org/10.1007/978-981- 13-3329- 3_51
543
544 D. Lakshmi Padmaja and B. Vishnuvardhan
1 Introduction
Dimensionality reduction [1–3] is an important preprocessing task in data mining.
This will aid in reducing the dimensionality of dataset thereby addressing curse of
dimensionality issue. In general, the datasets, collected from scientific experiments,
have more features than instances. This results in an increase in processing time and
reduction in classification accuracy. Hence, it is prudent to reduce the dimensional-
ity of the dataset without compromising the intrinsic geometric properties of data.
For reducing the dimensionality and removing irrelevant features [4–7], we have
proposed modified RSFS algorithm in our earlier work. A set of evaluation mea-
sures such as accuracy, information, distance, dependency, and consistency [8–10]
are discussed in detail. This paper focuses on consistency of the feature selection as
a parameter [11,12].
Mainly, feature selection algorithms are categorized as filter, wrapper, and hybrid
type of algorithms. The filter methods are classifier dependent and wrapper meth-
ods are computationally intensive. In view of the same RSFS algorithm is used for
selecting the relevant features. This will help in evaluating performance measures
like accuracy and consistency of the real-world applications. This will also help in
reducing the dimensionality of scientific datasets which are unstructured, sparse, and
often contain missing values [13,14]. The randomization process will eliminate bias
and overfitting problems. By eliminating the irrelevant features on scientific data
reduces the cost and time (for cancer patients by reducing the tests) with the help of
domain experts. This approach is more useful in medical diagnosis.
In this paper, Sect. 1gives the introduction, Sect. 2discusses the existing method,
Sect. 3discusses the proposed algorithm, Sect. 4gives the dataset description, Sect. 5
provides the experimental setup, and Sect. 6gives the conclusion.
2 Existing Method
We have selected RSFS algorithm along with eight datasets, available publicly, for
this study.
2.1 Existing RSFS Algorithm
The aim of the random subset feature selection algorithm (RSFS) [8,15] is to identify
the best possible feature subsets, from a large dataset, in terms of its usefulness
in a classification task. The feature selection is carried out iteratively. Traditional
feature selection techniques follow to find the best subset of features using different
methods such as forward selection algorithm by including new relevant feature for the
existing dataset, in backward elimination algorithm, iteratively remove the unwanted
Variance-Based Feature Selection for Enhanced Classification … 545
features. Another approach is based on the feature correlation and ranking the features
according to its usefulness [16]. Using kNN classifier, features are selected in each
iteration. Relevancy is calculated as difference between the performance criteria
and expected criteria. In RSFS, feature subset selection is based on a statistical
comparison against random walk statistics. In RSFS, to improve the classification
performance, the random subsets size is not fixed. Iterations are continued until good
features are found. The iterations are stopped using stopping criteria which is user-
defined threshold or search exhaust whichever is earlier. In RSFS, the random subset
classification is performed, as many times as it is necessary in order to identify the
good features from features that simply appears useful due to the random components
of the process.
2.2 Classification
In random subset feature selection algorithm, kNN classifier is used to classify the
features which are generated as subsets by random forest algorithm [17–19]. A
kNN classifier is simple to use and easy to select features based on the nearest-
neighbor-based measure. Usually, when k value is large, less variance and high bias
are observed and for small k values, high variance and low bias are observed.
2.3 Motivation
Since the existing methodology suffers from consistency and selects different subsets
of feature in different iterations. This will create suspicion in the view of domain
experts and cannot result in a reliable and stable outcome. This has motivated us to
enhance the existing methodology with a view to reduce inconsistency and improve
the classification accuracy.
3 Proposed Methodology
The proposed methodology is experimented using RSFS algorithm [8,15,20]. The
procedure is suitably modified to find the consistency, accuracies of the features with
criteria as variance of features from mean. It is evident from the table that RSFS
+1NN algorithm has better accuracy and reduction of features compared to normal
kNN algorithm. The pseudocode for the algorithm is as shown below:
1. Normalize both training and testing datasets.
2. Randomly select the subset from the full feature set.
3. Classify the dataset using kNN classifier.
546 D. Lakshmi Padmaja and B. Vishnuvardhan
Tabl e 1 Dataset details
S. No. Dataset No. of
features
No. of
instances
KNN
accuracy
(%)
KNN + RSFS
Features Accuracy
(%)
1Colon
Cancer
62 2000 54.55 20 63.63
2CTG 20 2126 85.07 10 87.61
3Lung
cancer 32
149
12,533 149 93.75 271.88
4Lung 3312 203 94.20 66 95.65
5Fisher iris 4150 92.16 296.08
6ALLAML 7129 72 84.00 280.00
7Carcinom 9182 174 93.22 10 61.02
8Prostate
102 34
12,600 174 88.14 11 45.76
9Forest 27 523 86.46 15 82.15
10 Ovarian
cancer
4000 216 91.89 27 95.95
4. Record the feature relevance using unweighted average recall (UAR).
5. Select the top 1% of features based on probability scale of performance.
6. Eliminate the bottommost features whose relevancy is less than mean-3 (outliers).
7. Repeat the process until the number of selected features are constant for 1000
iterations then choose the features with best relevancies (whose probability is
greater than 0.99).
8. Otherwise, go to step number 2 and repeat all the steps until the selected features
are constant.
The algorithm works as shown in Fig. 1.
4 Datasets and Experimental Setup
All the datasets are taken from UCI machine learning repository [16], www.
featureselection.asu.edu [21], www.broadinstitute.org [22], and some are from can-
cer research studies. The dataset properties are mentioned in Table 1.
From the above results, it is evident that the performance of RSFS algorithm is
superior when compared to the other algorithms. In order to improve further, the
algorithm is modified as mentioned in Fig. 2.
Variance-Based Feature Selection for Enhanced Classification … 547
Fig. 1 Algorithm
548 D. Lakshmi Padmaja and B. Vishnuvardhan
Fig. 2 Modified process for variance-based selection
5 Results
The process, as mentioned in Fig. 2, is followed for reducing the dimensionality
while maintaining the consistency of feature selection. The details about variance
selection and feature reduction are shown in Fig. 3.
6 Conclusion
Consistency of features is maintained when we have considered the variance of
features from mean to three-sigma standard deviation when compared to two-sigma
standard deviations away from mean. These selection techniques combined with
RSFS algorithm are producing very satisfactory results for feature subset selection.
Time complexity is also improved when compared to the original RSFS algorithm.
In future, more studies and experiments are required for fine-tuning the algorithm
Variance-Based Feature Selection for Enhanced Classification … 549
Fig. 3 Mapping of feature space to variance space
and process to apply various types of data such as ordinal, categorical, and mixed
along with multi-label and missing values.
References
1. C. Bartenhagen, H.U. Klein, C. Ruckert, X. Jiang, M. Dugas, Comparative study of unsuper-
vised dimension reduction techniques for the visualization of microarray gene expression data.
BMC Bioinform. 11(1), 567 (2010)
2. L. Shen, E.C. Tan, Dimension reduction-based penalized logistic regression for cancer clas-
sification using microarray data. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 2(2),
166–175 (2005)
3. L. Van Der Maaten, E. Postma, J. Van den Herik, Dimensionality reduction: a comparative. J.
Mach. Learn. Res. 10, 66–71 (2009)
4. C. Ding, H. Peng, Minimum redundancy feature selection from microarray gene expression
data. J. Bioinform. Comput. Biol. 3(02), 185–205 (2005)
5. F. Ding, C. Peng, H. Long, Feature selection based on mutual information: criteria of max-
dependency, max-relevance, and min-redundancy 27(8), 1226–1238 (2005)
6. L. Yu, H. Liu, Efficiently handling feature redundancy in high-dimensional data, in SIGKDD
03 (Aug 2003)
7. H. Liu, H. Motoda, Feature Selection for Knowledge Discovery and Data Mining, vol. 454
(Springer Science & Business Media, Berlin, 2012)
8. J. Pohjalainen, O. Rasanen, S. Kadioglu, Feature selection methods and their combinations in
high dimensional classification of speaker likability, intelligibility and personality traits(2013)
550 D. Lakshmi Padmaja and B. Vishnuvardhan
9. L. Yu, C. Ding, S. Loscalzo, Stable feature selection via dense feature groups, in Proceedings
of the 14th ACM SIGKDD (2008)
10. M. Dash, H. Liu, Feature selection for classification 1, 131–156 (1997)
11. K. Kira, L.A. Rendell, A practical approach to feature selection, in Proceedings of the Ninth
International Workshop on Machine learning, pp. 249–256 (1992)
12. J. Reunanen, Overfitting in making comparisons between variable selection methods 3,
1371–1382 (2003)
13. E. Maltseva, C. Pizzuti, D. Talia, Mining high dimensional scientific data sets using singular
value decomposition, in Data Mining for Scientific and Engineering Applications (Kluwer
Academic Publishers, Dordrecht, 2001), pp. 425–438
14. J. Kehrer, H. Hauser, Visualization and visual analysis of multifaceted scientific data: a survey.
IEEE Trans. Visual Comput. Graphics 19(3), 495–513 (2013)
15. J. Pohjalainen, O. Rasanen, S. Kadioglu, Feature selection methods and their combinations
in high-dimensional classification of speaker likability, intelligibility and personality traits.
Comput. Speech Lang. 29(1), 145–171 (2015)
16. D. Dheeru, E. Karra Taniskidou, UCI machine learning repository (2017), http://archive.ics.
uci.edu/ml
17. S. Li, J. Harner, D. Adjeroh, Random kNN feature selection a fast and stable alternative to
random forests. BMC Bioinform. (Dec 2011)
18. L. Breiman, Random forests 3, 5–32 (2001)
19. I. Guyon, A. Elisseeff, An introduction to variable and feature selection 3, 1157–1183 (2003)
20. O. Räsänen, J. Pohjalainen, Random subset feature selection in automatic recognition of devel-
opmental disorders, affective states, and level of conflict from speech, in INTERSPEECH ,
pp. 210–214 (2013)
21. J. Li, K. Cheng, S. Wang, F. Morstatter, T. Robert, J. Tang, H. Liu, Feature selection: a data
perspective arXiv:1601.07996 (2016)
22. https://www.broadinstitute.org/