ChapterPDF Available

Variance-Based Feature Selection for Enhanced Classification Performance: Proceedings of Fifth International Conference INDIA 2018 Volume 1

January 2019

January 2019

DOI:10.1007/978-981-13-3329-3_51

In book: Information Systems Design and Intelligent Applications (pp.543-550)

Authors:

Lakshmi Padmaja Dhyaram

Anurag University , Ghatkesar(M), Medchal Dist.,India

Algorithm

…

Mapping of feature space to variance space

…

Figures - uploaded by Lakshmi Padmaja Dhyaram

Content may be subject to copyright.

Content uploaded by Lakshmi Padmaja Dhyaram

Content may be subject to copyright.

Variance-Based Feature Selection

for Enhanced Classiﬁcation Performance

D. Lakshmi Padmaja and B. Vishnuvardhan

Abstract Irrelevant feature elimination, when used correctly, aids in enhancing the

feature selection accuracy which is critical in dimensionality reduction task. The addi-

tional intelligence enhances the search for an optimal subset of features by reducing

the dataset, based on the previous performance. The search procedures being used

are completely probabilistic and heuristic. Although the existing algorithms use var-

ious measures to evaluate the best feature subsets, they fail to eliminate irrelevant

features. The procedure explained in the current paper focuses on enhanced feature

selection process based on random subset feature selection (RSFS). Random subset

feature selection (RSFS) uses random forest (RF) algorithm for better feature reduc-

tion. Through an extensive testing of this procedure which is carried out on several

scientiﬁc datasets previously with different geometries, we aim to show in this paper

that the optimal subset of features can be derived by eliminating the features which

are two standard deviations away from mean. In many real-world applications like

scientiﬁc data (e.g., cancer detection, diabetes, and medical diagnosis) removing the

irrelevant features result in increase in detection accuracy with less cost and time.

This helps the domain experts by identifying the reduction of features and saving

valuable diagnosis time.

Keywords Random subset feature selection ·Random forest

Variance-based selection ·Classiﬁcation accuracy

D. Lakshmi Padmaja

Department of Information Technology, Anurag Group of Institutions (CVSR),

Hyderabad, India

e-mail: glpadmaja@gmail.com

B. Vishnuvardhan (B)

Department of Computer Science and Engineering, JNTUH, Hyderabad, India

e-mail: mailvishnuvardhan@gmail.com

S. C. Satapathy et al. (eds.), Information Systems Design and Intelligent Applications,

Advances in Intelligent Systems and Computing 862,

https://doi.org/10.1007/978-981- 13-3329- 3_51

543

544 D. Lakshmi Padmaja and B. Vishnuvardhan

1 Introduction

Dimensionality reduction [1–3] is an important preprocessing task in data mining.

This will aid in reducing the dimensionality of dataset thereby addressing curse of

dimensionality issue. In general, the datasets, collected from scientiﬁc experiments,

have more features than instances. This results in an increase in processing time and

reduction in classiﬁcation accuracy. Hence, it is prudent to reduce the dimensional-

ity of the dataset without compromising the intrinsic geometric properties of data.

For reducing the dimensionality and removing irrelevant features [4–7], we have

proposed modiﬁed RSFS algorithm in our earlier work. A set of evaluation mea-

sures such as accuracy, information, distance, dependency, and consistency [8–10]

are discussed in detail. This paper focuses on consistency of the feature selection as

a parameter [11,12].

Mainly, feature selection algorithms are categorized as ﬁlter, wrapper, and hybrid

type of algorithms. The ﬁlter methods are classiﬁer dependent and wrapper meth-

ods are computationally intensive. In view of the same RSFS algorithm is used for

selecting the relevant features. This will help in evaluating performance measures

like accuracy and consistency of the real-world applications. This will also help in

reducing the dimensionality of scientiﬁc datasets which are unstructured, sparse, and

often contain missing values [13,14]. The randomization process will eliminate bias

and overﬁtting problems. By eliminating the irrelevant features on scientiﬁc data

reduces the cost and time (for cancer patients by reducing the tests) with the help of

domain experts. This approach is more useful in medical diagnosis.

In this paper, Sect. 1gives the introduction, Sect. 2discusses the existing method,

Sect. 3discusses the proposed algorithm, Sect. 4gives the dataset description, Sect. 5

provides the experimental setup, and Sect. 6gives the conclusion.

2 Existing Method

We have selected RSFS algorithm along with eight datasets, available publicly, for

this study.

2.1 Existing RSFS Algorithm

The aim of the random subset feature selection algorithm (RSFS) [8,15] is to identify

the best possible feature subsets, from a large dataset, in terms of its usefulness

in a classiﬁcation task. The feature selection is carried out iteratively. Traditional

feature selection techniques follow to ﬁnd the best subset of features using different

methods such as forward selection algorithm by including new relevant feature for the

existing dataset, in backward elimination algorithm, iteratively remove the unwanted

Variance-Based Feature Selection for Enhanced Classiﬁcation … 545

features. Another approach is based on the feature correlation and ranking the features

according to its usefulness [16]. Using kNN classiﬁer, features are selected in each

iteration. Relevancy is calculated as difference between the performance criteria

and expected criteria. In RSFS, feature subset selection is based on a statistical

comparison against random walk statistics. In RSFS, to improve the classiﬁcation

performance, the random subsets size is not ﬁxed. Iterations are continued until good

features are found. The iterations are stopped using stopping criteria which is user-

deﬁned threshold or search exhaust whichever is earlier. In RSFS, the random subset

classiﬁcation is performed, as many times as it is necessary in order to identify the

good features from features that simply appears useful due to the random components

of the process.

2.2 Classiﬁcation

In random subset feature selection algorithm, kNN classiﬁer is used to classify the

features which are generated as subsets by random forest algorithm [17–19]. A

kNN classiﬁer is simple to use and easy to select features based on the nearest-

neighbor-based measure. Usually, when k value is large, less variance and high bias

are observed and for small k values, high variance and low bias are observed.

2.3 Motivation

Since the existing methodology suffers from consistency and selects different subsets

of feature in different iterations. This will create suspicion in the view of domain

experts and cannot result in a reliable and stable outcome. This has motivated us to

enhance the existing methodology with a view to reduce inconsistency and improve

the classiﬁcation accuracy.

3 Proposed Methodology

The proposed methodology is experimented using RSFS algorithm [8,15,20]. The

procedure is suitably modiﬁed to ﬁnd the consistency, accuracies of the features with

criteria as variance of features from mean. It is evident from the table that RSFS

+1NN algorithm has better accuracy and reduction of features compared to normal

kNN algorithm. The pseudocode for the algorithm is as shown below:

1. Normalize both training and testing datasets.

2. Randomly select the subset from the full feature set.

3. Classify the dataset using kNN classiﬁer.

546 D. Lakshmi Padmaja and B. Vishnuvardhan

Tabl e 1 Dataset details

S. No. Dataset No. of

features

No. of

instances

KNN

accuracy

(%)

KNN + RSFS

Features Accuracy

(%)

1Colon

Cancer

62 2000 54.55 20 63.63

2CTG 20 2126 85.07 10 87.61

3Lung

cancer 32

149

12,533 149 93.75 271.88

4Lung 3312 203 94.20 66 95.65

5Fisher iris 4150 92.16 296.08

6ALLAML 7129 72 84.00 280.00

7Carcinom 9182 174 93.22 10 61.02

8Prostate

102 34

12,600 174 88.14 11 45.76

9Forest 27 523 86.46 15 82.15

10 Ovarian

cancer

4000 216 91.89 27 95.95

4. Record the feature relevance using unweighted average recall (UAR).

5. Select the top 1% of features based on probability scale of performance.

6. Eliminate the bottommost features whose relevancy is less than mean-3 (outliers).

7. Repeat the process until the number of selected features are constant for 1000

iterations then choose the features with best relevancies (whose probability is

greater than 0.99).

8. Otherwise, go to step number 2 and repeat all the steps until the selected features

are constant.

The algorithm works as shown in Fig. 1.

4 Datasets and Experimental Setup

All the datasets are taken from UCI machine learning repository [16], www.

featureselection.asu.edu [21], www.broadinstitute.org [22], and some are from can-

cer research studies. The dataset properties are mentioned in Table 1.

From the above results, it is evident that the performance of RSFS algorithm is

superior when compared to the other algorithms. In order to improve further, the

algorithm is modiﬁed as mentioned in Fig. 2.

Variance-Based Feature Selection for Enhanced Classiﬁcation … 547

Fig. 1 Algorithm

548 D. Lakshmi Padmaja and B. Vishnuvardhan

Fig. 2 Modiﬁed process for variance-based selection

5 Results

The process, as mentioned in Fig. 2, is followed for reducing the dimensionality

while maintaining the consistency of feature selection. The details about variance

selection and feature reduction are shown in Fig. 3.

6 Conclusion

Consistency of features is maintained when we have considered the variance of

features from mean to three-sigma standard deviation when compared to two-sigma

standard deviations away from mean. These selection techniques combined with

RSFS algorithm are producing very satisfactory results for feature subset selection.

Time complexity is also improved when compared to the original RSFS algorithm.

In future, more studies and experiments are required for ﬁne-tuning the algorithm

Variance-Based Feature Selection for Enhanced Classiﬁcation … 549

Fig. 3 Mapping of feature space to variance space

and process to apply various types of data such as ordinal, categorical, and mixed

along with multi-label and missing values.

References

1. C. Bartenhagen, H.U. Klein, C. Ruckert, X. Jiang, M. Dugas, Comparative study of unsuper-

vised dimension reduction techniques for the visualization of microarray gene expression data.

BMC Bioinform. 11(1), 567 (2010)

2. L. Shen, E.C. Tan, Dimension reduction-based penalized logistic regression for cancer clas-

siﬁcation using microarray data. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 2(2),

166–175 (2005)

3. L. Van Der Maaten, E. Postma, J. Van den Herik, Dimensionality reduction: a comparative. J.

Mach. Learn. Res. 10, 66–71 (2009)

4. C. Ding, H. Peng, Minimum redundancy feature selection from microarray gene expression

data. J. Bioinform. Comput. Biol. 3(02), 185–205 (2005)

5. F. Ding, C. Peng, H. Long, Feature selection based on mutual information: criteria of max-

dependency, max-relevance, and min-redundancy 27(8), 1226–1238 (2005)

6. L. Yu, H. Liu, Efﬁciently handling feature redundancy in high-dimensional data, in SIGKDD

03 (Aug 2003)

7. H. Liu, H. Motoda, Feature Selection for Knowledge Discovery and Data Mining, vol. 454

(Springer Science & Business Media, Berlin, 2012)

8. J. Pohjalainen, O. Rasanen, S. Kadioglu, Feature selection methods and their combinations in

high dimensional classiﬁcation of speaker likability, intelligibility and personality traits(2013)

550 D. Lakshmi Padmaja and B. Vishnuvardhan

9. L. Yu, C. Ding, S. Loscalzo, Stable feature selection via dense feature groups, in Proceedings

of the 14th ACM SIGKDD (2008)

10. M. Dash, H. Liu, Feature selection for classiﬁcation 1, 131–156 (1997)

11. K. Kira, L.A. Rendell, A practical approach to feature selection, in Proceedings of the Ninth

International Workshop on Machine learning, pp. 249–256 (1992)

12. J. Reunanen, Overﬁtting in making comparisons between variable selection methods 3,

1371–1382 (2003)

13. E. Maltseva, C. Pizzuti, D. Talia, Mining high dimensional scientiﬁc data sets using singular

value decomposition, in Data Mining for Scientiﬁc and Engineering Applications (Kluwer

Academic Publishers, Dordrecht, 2001), pp. 425–438

14. J. Kehrer, H. Hauser, Visualization and visual analysis of multifaceted scientiﬁc data: a survey.

IEEE Trans. Visual Comput. Graphics 19(3), 495–513 (2013)

15. J. Pohjalainen, O. Rasanen, S. Kadioglu, Feature selection methods and their combinations

in high-dimensional classiﬁcation of speaker likability, intelligibility and personality traits.

Comput. Speech Lang. 29(1), 145–171 (2015)

16. D. Dheeru, E. Karra Taniskidou, UCI machine learning repository (2017), http://archive.ics.

uci.edu/ml

17. S. Li, J. Harner, D. Adjeroh, Random kNN feature selection a fast and stable alternative to

random forests. BMC Bioinform. (Dec 2011)

18. L. Breiman, Random forests 3, 5–32 (2001)

19. I. Guyon, A. Elisseeff, An introduction to variable and feature selection 3, 1157–1183 (2003)

20. O. Räsänen, J. Pohjalainen, Random subset feature selection in automatic recognition of devel-

opmental disorders, affective states, and level of conﬂict from speech, in INTERSPEECH ,

pp. 210–214 (2013)

21. J. Li, K. Cheng, S. Wang, F. Morstatter, T. Robert, J. Tang, H. Liu, Feature selection: a data

perspective arXiv:1601.07996 (2016)

22. https://www.broadinstitute.org/

Detection of Human Stress Using Optimized Feature Selection and Classification in ECG Signals

Article

Full-text available

Dec 2023
MATH PROBL ENG

An autonomic nervous system (ANS) of humans is majorly affected by psychological stress. The changes in ANS may cause several chronic diseases in humans. The electrocardiogram (ECG) signal is used to observe the variation in ANS. Numerous techniques are presented for an ECG stress signal handling feature extraction and classification. This work managed a heart rate variability feature acquired from smaller peak waveforms such as P, Q, S, and T waves. Also, the R peak is detected, which is a significant part of the ECG waveform. In this work, the proposed stress classification work has been categorized into two main processes: feature selection (FS) and classification. The main aim of the proposed work is to propose an optimized FS and classifier model for the detection of stress in ECG signals. The Metaheuristics model of the African vulture optimization (AVO) technique is presented to perform an FS. This selection is made to choose the required features and minimize the data for classification. The AVO-based modified Elman recurrent neural network (MERNN) technique is proposed to perform an efficient classification. The AVO is used for fine-tuning the weight of the MERNN technique. The experimental result of this technique is evaluated in terms of Recall (91.56%), Accuracy (92.43%), Precision (92.78%), and F1 score (95.86%). Thus, the proposed performance achieved a superior result than the conventional techniques.

Learning Technologies and Systems: 21st International Conference on Web-Based Learning, ICWL 2022, and 7th International Symposium on Emerging Technologies for Education, SETE 2022, Tenerife, Spain, November 21–23, 2022, Revised Selected Papers

Book

May 2023

his book constitutes the refereed conference proceedings of the 21st International Conference on Web-Based Learning, ICWL 2022 and 7th International Symposium on Emerging Technologies for Education, SETE 2022, held in Tenerife, Spain in November 21–23, 2022. The 45 full papers and 5 short papers included in this book were carefully reviewed and selected from 82 submissions. The topics proposed in the ICWL&SETE Call for Papers included several relevant issues, ranging from Semantic Web for E-Learning, through Learning Analytics, Computer-Supported Collaborative Learning, Assessment, Pedagogical Issues, E-learning Platforms, and Tools, to Mobile Learning. In addition to regular papers, ICWL&SETE 2022 also featured a set of special workshops and tracks: The 5th International Workshop on Educational Technology for Language Learning (ETLL 2022), The 6th International Symposium on User Modeling and Language Learning (UMLL 2022), Digitalization in Language and Cross-Cultural Education, First Workshop on Hardware and software systems as enablers for lifelong learning (HASSELL).

Contextual Ontology-Based Feature Selection for Teachers

Chapter

May 2023

The context of teacher is indescribable without considering the multiple overlapping contextual situations. Teacher Context Ontology (TCO) presents a unified representation of data of these contexts. This ontology provides a relatively high number of features to consider for each context. These features result in a computational overhead during data processing in context-aware recommender systems. Therefore, the most relevant features must be favored over others without losing any potential ones using a feature selection approach. The existing approaches provide struggling results with high number of contextual features. In this paper, a new contextual ontology-based feature selection approach is introduced. This approach finds similar contexts for each insertion of new teacher using the ontology representation. Also, it selects relevant features from multiple contexts of a teacher according to their corresponding importance using a variance-based selection approach. This approach is novel in terms of representation, selection, and deriving implicit relationships for features in the multiple contexts of a teacher.Keywordsontologyfeature selectioncontextteachereducationmachine learning

Challenges in Crop Selection Using Machine Learning

Chapter

Dec 2022

Farming is one of the major sectors that influences a country’s economic growth. In a country like India, the majority of the population is dependent on agriculture for their livelihood. Many new technologies, such as Machine Learning and Deep Learning, are being implemented into agriculture as it is easier for farmers to grow and maximize their yield. To increase productivity and quality, the farmers must know about the type of nutrients in the soil that is held. Farmers may not be having the utmost knowledge about all the nutrients which are available in the soil that play a vital role in the growth of the crop. An expert advisor is needed to reach every farmer which is very difficult. This paper aims to identify the right crop and also the necessary nutrients that are present in the soil and the nutrients that are required by the crop with the help of Machine learning techniques such as Support Vector Machines and Decision Trees for performing multiclass classification of different classes of crops.

Analysis to Predict Coronary Thrombosis Using Machine Learning Techniques

Conference Paper

Apr 2022

Feature Selection: A Data Perspective

Article

Full-text available

Jan 2016

Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing high-dimensional data for data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities of feature selection algorithms. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the big data age, we revisit feature selection research from a data perspective, and review representative feature selection algorithms for generic data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for generic data, we generally categorize them into four groups: similarity based, information theoretical based, sparse learning based and statistical based methods. Finally, to facilitate and promote the research in this community, we also present a open-source feature selection repository that consists of most of the popular feature selection algorithms (http://featureselection.asu.edu/). At the end of this survey, we also have a discussion about some open problems and challenges that need to be paid more attention in future research.

Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech

Article

Full-text available

Jan 2013

This work studies automatic recognition of paralinguistic properties of speech. The focus is on selection of the most useful acoustic features for three classification tasks: 1) recognition of autism spectrum developmental disorders from child speech, 2) classification of speech into different affective categories, and 3) recognizing the level of social conflict from speech. The feature selection is performed using a new variant of random subset sampling methods with k-nearest neighbors (kNN) as a classifier. The experiments show that the proposed system is able to learn a set of important features for each recognition task, clearly exceeding the performance of the same classifier using the original full feature set. However, some effects of overfitting the feature sets to finite data are also observed and discussed.

Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits

Article

Full-text available

Nov 2013

This study focuses on feature selection in paralinguistic analysis and presents recently developed supervised and unsupervised methods for feature subset selection and feature ranking. Using the standard k-nearest-neighbors (kNN) rule as the classification algorithm, the feature selection methods are evaluated individually and in different combinations in seven paralinguistic speaker trait classification tasks. In each analyzed data set, the overall number of features highly exceeds the number of data points available for training and evaluation, making a well-generalizing feature selection process extremely difficult. The performance of feature sets on the feature selection data is observed to be a poor indicator of their performance on unseen data. The studied feature selection methods clearly outperform a standard greedy hill-climbing selection algorithm by being more robust against overfitting. When the selection methods are suitably combined with each other, the performance in the classification task can be further improved. In general, it is shown that the use of automatic feature selection in paralinguistic analysis can be used to reduce the overall number of features to a fraction of the original feature set size while still achieving a comparable or even better performance than baseline support vector machine or random forest classifiers using the full feature set. The most typically selected features for recognition of speaker likability, intelligibility and five personality traits are also reported.

Visualization and Visual Analysis of Multifaceted Scientific Data: A Survey

Article

Full-text available

Mar 2013

Visualization and visual analysis play important roles in exploring, analyzing and presenting scientific data. In many disciplines, data and model scenarios are becoming multi-faceted: data are often spatio-temporal and multi-variate; they stem from different data sources (multi-modal data), from multiple simulation runs (multi-run/ensemble data), or from multi-physics simulations of interacting phenomena (multi-model data resulting from coupled simulation models). Also, data can be of different dimensionality or come on various types of grids that need to be related or fused in the visualization. This heterogeneity of data characteristics presents new opportunities as well as technical challenges for visualization research. Visualization and interaction techniques are thus often combined with computational analysis. In this survey, we study existing methods for visualization and interactive visual analysis of multi-faceted scientific data. Based on a thorough literature review, a categorization of approaches is proposed. We cover a wide range of fields and discuss to which degree the different challenges are matched with existing solutions for visualization and visual analysis. This leads to conclusions with respect to promising research directions, for instance, to go after new solutions for multi-run and multi-model data as well as techniques that support a multitude of facets.

Feature Selection for Classification

Article

Jul 1997

Efficiently handling feature redundancy in high-dimensional data

Conference Paper

Jan 2003

Minimum redundancy feature selection from microarray gene expression data

Article

Jan 2003

Mining High-Dimensional Scientific Data Sets Using Singular Value Decomposition

Chapter

Jan 2001

Clustering is an undirected knowledge discovery technique based on the partitioning of large sets of data objects into homogenous groups. All objects contained in the same group have similar characteristics. Grouping multivariate data is a difficult data mining task when no domain knowledge on data structure is available. In this chapter we describe the use of a well known linear projection technique, called Singular Value Decomposition (SVD) to discover clusters. SVD is an optimal dimensionality reduction method that projects a multi-dimensional pattern space into a subspace that preserves the character of data. The plot of the 2- or 3-dimensional points gives the human user a guide to discover the presence of homogeneous groups (clusters) in the data set. A user, by inspecting the graphical 2- or 3-dimensional representations, may identify groups on the basis of space density and define threshold values to separate clusters. Experimental results on real scientific data sets assess the quality of the clustering obtained.

Machine Learning, Volume 45, Number 1 - SpringerLink

Article

Oct 2001

Leo Breiman

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

Feature selection based on mutual information: Criteria of max-dependency

Article

Variance-Based Feature Selection for Enhanced Classification Performance: Proceedings of Fifth International Conference INDIA 2018 Volume 1

Figures

Recommended publications

Information systems, technology and management. Third international conference, ICISTM 2009, Ghaziab...

Conference on Sustainable Mining and the United Nations Framework Classification (UNFC): Challenges...

Discussion Leader’s Report on the Unsaturated Soils

Network Traffic Classification Using Multiclass Classifier: Second International Conference, ICACDS...