Figure - available via license: Creative Commons Attribution 2.0 Generic
Content may be subject to copyright.
The description of the SVM-RCE algorithm. A flowchart of the SVM-RCE algorithm consists of main three steps: the Cluster step for clustering the genes, the SVM scoring step for assessment of significant clusters and the RCE step to remove clusters with low score

The description of the SVM-RCE algorithm. A flowchart of the SVM-RCE algorithm consists of main three steps: the Cluster step for clustering the genes, the SVM scoring step for assessment of significant clusters and the RCE step to remove clusters with low score

Source publication
Article
Full-text available
Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting s...

Similar publications

Article
Full-text available
Neuroblastoma tumor cells are assumed to originate from primitive neuroblasts giving rise to the sympathetic nervous system. Because these precursor cells are not detectable in postnatal life, their transcription profile has remained inaccessible for comparative data mining strategies in neuroblastoma. This study provides the first genome-wide mRNA...
Article
Full-text available
Among various risk factors for the initiation and progression of cancer, alternative polyadenylation (APA) is a remarkable endogenous contributor that directly triggers the malignant phenotype of cancer cells. APA affects biological processes at a transcriptional level in various ways. As such, APA can be involved in tumorigenesis through gene expr...
Article
Full-text available
Ensemble learning combines multiple learners to perform combinatorial learning, which has advantages of good flexibility and higher generalization performance. To achieve higher quality cancer classification, in this study, the fast correlation-based feature selection (FCBF) method was used to preprocess the data to eliminate irrelevant and redunda...

Citations

... Recursive cluster elimination based on support vector machine (SVM-RCE) proposed by Yousef et al. [5] introduced recursive cluster elimination term into the literature and this approach overcame Support Vector Machines with Recursive Feature Elimination (SVM-RFE) [6], which was widely accepted as an effective approach in the field. The superiority of SVM-RCE stems from the consideration of feature (i.e., gene) clusters instead of individual features in the classification task. ...
... All methods were executed for 100 iterations to provide stability in results. We adopted a set of fixed cluster levels [90,80,70,60,50,40,30,20,10,5,2,1] in the experiments. The number of genes to be removed from surviving clusters is set to be 10% if the cluster contains more than five genes. ...
Preprint
Full-text available
The computational and interpretational difficulties caused by the ever-increasing dimensionality of biological data generated by new technologies pose a major challenge. Feature selection (FS) methods aim to reduce the dimension, and feature grouping has emerged as a foundation for FS techniques that seek to detect strong correlations among features and the existence of irrelevant features. In this work, we develop Recursive Cluster Elimination with Intra-Cluster Feature Elimination (RCE-IFE), a method that iterates clustering and elimination steps in a supervised context. Recursively, feature clusters are formed, then scored, and less contributing clusters are eliminated. Next, low-scoring features in retained clusters are eliminated. Intra-cluster feature elimination aims to reduce noisy features while keeping a minimum number of predictive features. The performance of RCE-IFE is evaluated and compared to other FS techniques in several datasets. The results show that the proposed strategy effectively reduces the size of the feature set and also improves the model performance.
... In our earlier studies, in order to improve classification performance, we proposed grouping-based feature elimination techniques, e.g., SVM RCE [39], SVM-RCE-R [40], and Appl. Sci. ...
... In our earlier studies, in order to improve classification performance, we proposed grouping-based feature elimination techniques, e.g., SVM RCE [39], SVM-RCE-R [40], and SVM-RCE-R-OPT [41]. Recently, we proposed numerous tools which incorporate biological information into the machine learning algorithm to accomplish feature selection or to choose groups of features. ...
Article
Full-text available
Due to the increasing resistance of bacteria to antibiotics, scientists began seeking new solutions against this problem. One of the most promising solutions in this field are antimicrobial peptides (AMP). To identify antimicrobial peptides, and to aid the design and production of novel antimicrobial peptides, there is a growing interest in the development of computational prediction approaches, in parallel with the studies performing wet-lab experiments. The computational approaches aim to understand what controls antimicrobial activity from the perspective of machine learning, and to uncover the biological properties that define antimicrobial activity. Throughout this study, we aim to develop a novel prediction approach that can identify peptides with high antimicrobial activity against selected target bacteria. Along this line, we propose a novel method called AMP-GSM (antimicrobial peptide-grouping–scoring–modeling). AMP-GSM includes three main components: grouping, scoring, and modeling. The grouping component creates sub-datasets via placing the physicochemical, linguistic, sequence, and structure-based features into different groups. The scoring component gives a score for each group according to their ability to distinguish whether it is an antimicrobial peptide or not. As the final part of our method, the model built using the top-ranked groups is evaluated (modeling component). The method was tested for three AMP prediction datasets, and the prediction performance of AMP-GSM was comparatively evaluated with several feature selection methods and several classifiers. When we used 10 features (which are members of the physicochemical group), we obtained the highest area under curve (AUC) value for both the Gram-negative (99%) and Gram-positive (98%) datasets. AMP-GSM investigates the most significant feature groups that improve AMP prediction. A number of physico-chemical features from the AMP-GSM’s final selection demonstrate how important these variables are in terms of defining peptide characteristics and how they should be taken into account when creating models to predict peptide activity.
... Integrating Gene Ontology Based Grouping and Ranking , CogNet , SVM-RCE (Yousef et al., 2007), SVM-RCE-R (Yousef and Bakir-Gungor, 2021), PriPath , miRModuleNet , TextNetTopics (Yousef and Voskergian, 2022), GediNet (Qumsiyeh et al., 2022). These different G-S-M approaches are also reviewed in . ...
Article
Full-text available
During recent years, biological experiments and increasing evidence have shown that microRNAs play an important role in the diagnosis and treatment of human complex diseases. Therefore, to diagnose and treat human complex diseases, it is necessary to reveal the associations between a specific disease and related miRNAs. Although current computational models based on machine learning attempt to determine miRNA-disease associations, the accuracy of these models need to be improved, and candidate miRNA-disease relations need to be evaluated from a biological perspective. In this paper, we propose a computational model named miRdisNET to predict potential miRNA-disease associations. Specifically, miRdisNET requires two types of data, i.e., miRNA expression profiles and known disease-miRNA associations as input files. First, we generate subsets of specific diseases by applying the grouping component. These subsets contain miRNA expressions with class labels associated with each specific disease. Then, we assign an importance score to each group by using a machine learning method for classification. Finally, we apply a modeling component and obtain outputs. One of the most important outputs of miRdisNET is the performance of miRNA-disease prediction. Compared with the existing methods, miRdisNET obtained the highest AUC value of .9998. Another output of miRdisNET is a list of significant miRNAs for disease under study. The miRNAs identified by miRdisNET are validated via referring to the gold-standard databases which hold information on experimentally verified microRNA-disease associations. miRdisNET has been developed to predict candidate miRNAs for new diseases, where miRNA-disease relation is not yet known. In addition, miRdisNET presents candidate disease-disease associations based on shared miRNA knowledge. The miRdisNET tool and other supplementary files are publicly available at: https://github.com/malikyousef/miRdisNET.
... An extension of the FAST method, but another threshold-based feature selection (TBFS) technique is discussed by Wang, Khoshgoftaar & Van Hulse (2010), where they produce 11 distinct versions of TBFS based on 11 different classifier performance metrics. A cluster-based feature selection, SVM-RCE, has been introduced by Yousef et al. (2007Yousef et al. ( , 2021, which uses K-means to identify correlated gene clusters and SVM to identify the ranks of each cluster. Then, the recursive cluster elimination (RCE) method iteratively removes the clusters with the least performance accuracy. ...
Article
Full-text available
High dimensional classification problems have gained increasing attention in machine learning, and feature selection has become essential in executing machine learning algorithms. In general, most feature selection methods compare the scores of several feature subsets and select the one that gives the maximum score. There may be other selections of a lower number of features with a lower score, yet the difference is negligible. This article proposes and applies an extended version of such feature selection methods, which selects a smaller feature subset with similar performance to the original subset under a pre-defined threshold. It further validates the suggested extended version of the Principal Component Loading Feature Selection (PCLFS-ext) results by simulating data for several practical scenarios with different numbers of features and different imbalance rates on several classification methods. Our simulated results show that the proposed method outperforms the original PCLFS and existing Recursive Feature Elimination (RFE) by giving reasonable feature reduction on various data sets, which is important in some applications.
... When dealing with high-dimensional data, many feature selection approaches can successfully remove irrelevant features but fail to pull redundant ones out [18,19]. To overcome this problem, several feature selection algorithms that use feature clustering were proposed in recent decades in both supervised and unsupervised contexts [18,[20][21][22][23][24]. This paper focuses on clustering-based feature selection approaches in a supervised context. ...
... In the first one, clustering is applied as a pre-processing stage where only one feature is selected from each group to constitute the feature set from which a feature subset search is performed [48]. Other schemes cluster the initial feature set into a predefined number of groups, and then evaluate the relevance of each group in order to remove irrelevant feature groups before merging the remaining groups and repeating the whole scheme [24]. In the last kind of scheme, the feature subset search is applied in each group defined by the clustering algorithm and the features selected from each group are merged to form the final selected feature subset [23]. ...
Article
Full-text available
Color texture classification aims to recognize patterns by the analysis of their colors and their textures. This process requires using descriptors to represent and discriminate the different texture classes. In most traditional approaches, these descriptors are used with a predefined setting of their parameters and computed from images coded in a chosen color space. The prior choice of a color space, a descriptor and its setting suited to a given application is a crucial but difficult problem that strongly impacts the classification results. To overcome this problem, this paper proposes a color texture representation that simultaneously takes into account the properties of several settings from different descriptors computed from images coded in multiple color spaces. Since the number of color texture features generated from this representation is high, a dimensionality reduction scheme by clustering-based sequential feature selection is applied to provide a compact hybrid multi-color space (CHMCS) descriptor. The experimental results carried out on five benchmark color texture databases with five color spaces and manifold settings of two texture descriptors show that combining different configurations always improves the accuracy compared to a predetermined configuration. On average, the CHMCS representation achieves 94.16% accuracy and outperforms deep learning networks and handcrafted color texture descriptors by over 5%, especially when the dataset is small.
... miRModuleNet was developed based on the generic approach named G-S-M. This generic approach was adopted by different tools such as SVM RCE, SVM-RCE-R (Yousef et al., 2007;Yousef et al., 2021a), maTE (Yousef et al., 2019), CogNet (Yousef et al., 2021d), miRcorrNet (Yousef et al., 2021b), and Integrating Gene Ontology Based Grouping and Ranking (Yousef et al., 2021c). Recently, these tools and their competitors were reviewed in (Yousef et al., 2020). ...
Article
Full-text available
Increasing evidence that microRNAs (miRNAs) play a key role in carcinogenesis has revealed the need for elucidating the mechanisms of miRNA regulation and the roles of miRNAs in gene-regulatory networks. A better understanding of the interactions between miRNAs and their mRNA targets will provide a better understanding of the complex biological processes that occur during carcinogenesis. Increased efforts to reveal these interactions have led to the development of a variety of tools to detect and understand these interactions. We have recently described a machine learning approach miRcorrNet, based on grouping and scoring (ranking) groups of genes, where each group is associated with a miRNA and the group members are genes with expression patterns that are correlated with this specific miRNA. The miRcorrNet tool requires two types of -omics data, miRNA and mRNA expression profiles, as an input file. In this study we describe miRModuleNet, which groups mRNA (genes) that are correlated with each miRNA to form a star shape, which we identify as a miRNA-mRNA regulatory module. A scoring procedure is then applied to each module to further assess their contribution in terms of classification. An important output of miRModuleNet is that it provides a hierarchical list of significant miRNA-mRNA regulatory modules. miRModuleNet was further validated on external datasets for their disease associations, and functional enrichment analysis was also performed. The application of miRModuleNet aids the identification of functional relationships between significant biomarkers and reveals essential pathways involved in cancer pathogenesis. The miRModuleNet tool and all other supplementary files are available at https://github.com/malikyousef/miRModuleNet/
... Moreover, more advanced approaches that integrate biological knowledge into the machine learning algorithm for performing feature selection or for selecting groups of features are used in different recent tools. Such an approach was adopted by different tools such as SVM RCE, SVM-RCE-R [87][88][89], maTE [90], CogNet [91], miRcorrNet [92], miRModuleNet [93], and Integrating Gene Ontology-Based Grouping and Ranking [94]. Recently, these tools and their competitors were reviewed in [95]. ...
Article
Full-text available
Antimicrobial peptides (AMPs) are considered as promising alternatives to conventional antibiotics in order to overcome the growing problems of antibiotic resistance. Computational prediction approaches receive an increasing interest to identify and design the best candidate AMPs prior to the in vitro tests. In this study, we focused on the linear cationic peptides with non-hemolytic activity, which are downloaded from the Database of Antimicrobial Activity and Structure of Peptides (DBAASP). Referring to the MIC (Minimum inhibition concentration) values, we have assigned a positive label to a peptide if it shows antimicrobial activity; otherwise, the peptide is labeled as negative. Here, we focused on the peptides showing antimicrobial activity against Gram-negative and against Gram-positive bacteria separately, and we created two datasets accordingly. Ten different physico-chemical properties of the peptides are calculated and used as features in our study. Following data exploration and data preprocessing steps, a variety of classification algorithms are used with 100-fold Monte Carlo Cross-Validation to build models and to predict the antimicrobial activity of the peptides. Among the generated models, Random Forest has resulted in the best performance metrics for both Gram-negative dataset (Accuracy: 0.98, Recall: 0.99, Specificity: 0.97, Precision: 0.97, AUC: 0.99, F1: 0.98) and Gram-positive dataset (Accuracy: 0.95, Recall: 0.95, Specificity: 0.95, Precision: 0.90, AUC: 0.97, F1: 0.92) after outlier elimination is applied. This prediction approach might be useful to evaluate the antibacterial potential of a candidate peptide sequence before moving to the experimental studies.
... SVM-RCE (Support Vector Machines -Recursive Cluster Elimination) is an example of grouping genes based on their gene expression values and it scores each cluster of genes via incorporating a machine learning algorithm. This approach has received attention from other researchers [24]. Similarly, SVM-RNE [25] is based on gene network detection to serve as groups for scoring by the G-S-M model. ...
Preprint
Full-text available
Background: Cell homeostasis relies on the concerted actions of several genes; and dysregulated genes lead to disease manifestations. In living organisms, genes or their products do not act alone, but instead act within a large network. Subsets of these networks can be viewed as modules which provide certain functionality in an organism. Kyoto Encyclopedia of Genes and Genomes (KEGG) systematically analyzes gene functions, proteins, molecules, and provides a PATHWAY database. Measurements of gene expression (e.g., RNA-seq data) can be mapped into KEGG pathways in order to determine which modules are affected or dysregulated in a disease. However, genes acting in multiple pathways, and some other inherent issues complicate such analyses. To detect dysregulated pathways, current approaches may only employ gene expression data and neglect some of the existing knowledge stored in KEGG pathways. For a more holistic association between gene expression and pathways, new approaches which take into account more of the compiled information are required. Results: PriPath is a novel approach that transfers the generic approach of grouping, scoring followed by modeling for the analysis of gene expression with KEGG pathways. In PriPath, we utilize the KEGG pathway as the grouping information and insert this information into a machine learning algorithm for selecting the most significant KEGG pathways. Those groups are utilized to train a machine learning model for the classification task. We have tested PriPath on 13 gene expression datasets of various cancers and other diseases. Our proposed approach successfully assigned biologically and clinically relevant KEGG terms to the differentially expressed genes. We have comparatively evaluated the performance of PriPath against other tools, which are similar in their merit. For each dataset, we manually confirmed the top results of PriPath in literature, and we compared PriPath predictions to the predictions of Reactome and DAVID. Conclusions: PriPath can thus aid in determining dysregulated pathways, which is applicable to medical diagnostics. In the future, we aim to advance this approach in such a way that it will be possible to perform patient stratification based on gene expression and to identify druggable targets. Thereby, we cover two aspects of precision medicine.
... Random forest has many advantages when dealing with high-dimensional small sample data [24], while random bits forest is improved by random forest, and it performs better on classification problems. In this paper, based on the random bits forest, combined with the support vector machine-based recursive clustering elimination feature selection method proposed by Yousef et al. [25] and the improved SVM-RCE method proposed by Luo et al. [26], a random bits forest recursive clustering elimination feature selection method is proposed. e following is a detailed description of the overall approach. ...
Article
Full-text available
With the rapid development of artificial intelligence in recent years, the research on image processing, text mining, and genome informatics has gradually deepened, and the mining of large-scale databases has begun to receive more and more attention. The objects of data mining have also become more complex, and the data dimensions of mining objects have become higher and higher. Compared with the ultra-high data dimensions, the number of samples available for analysis is too small, resulting in the production of high-dimensional small sample data. High-dimensional small sample data will bring serious dimensional disasters to the mining process. Through feature selection, redundancy and noise features in high-dimensional small sample data can be effectively eliminated, avoiding dimensional disasters and improving the actual efficiency of mining algorithms. However, the existing feature selection methods emphasize the classification or clustering performance of the feature selection results and ignore the stability of the feature selection results, which will lead to unstable feature selection results, and it is difficult to obtain real and understandable features. Based on the traditional feature selection method, this paper proposes an ensemble feature selection method, Random Bits Forest Recursive Clustering Eliminate (RBF-RCE) feature selection method, combined with multiple sets of basic classifiers to carry out parallel learning and screen out the best feature classification results, optimizes the classification performance of traditional feature selection methods, and can also improve the stability of feature selection. Then, this paper analyzes the reasons for the instability of feature selection and introduces a feature selection stability measurement method, the Intersection Measurement (IM), to evaluate whether the feature selection process is stable. The effectiveness of the proposed method is verified by experiments on several groups of high-dimensional small sample data sets. 1. Introduction At present, the research on data mining has always been a hot issue in the fields of artificial intelligence, machine learning, and database. The reason why data mining is so valued is that it can extract hidden and unknowable potential value information from a large number of complex data in the database to assist in decision-making. With the continuous emergence of large-scale data mining tasks, such as microarray gene expression data [1], which contains tens of thousands of gene features while the number of samples is small, the data dimension of the mining object is significantly expanded and the difficulty of mining is increased. With the development of big data in the future, more and more data mining tasks with high-dimensional and small sample characteristics will continue to emerge. How to process these data will also become a research difficulty: on the one hand, high data dimensionality will lead to dimensionality disasters; on the other hand, because the number of samples is too small, overfitting problems will be caused. Both will seriously reduce the classification or clustering accuracy and greatly increase the burden of learning. Therefore, in order to process high-dimensional small sample data and extract the required information from it, feature selection becomes a feasible way. Feature selection is to filter the feature subset from the original feature space, which can effectively reduce the dimension of the feature space [2]. Feature selection does not change the original feature space structure but only selects some important features from the original features to reconstruct a low-dimensional feature space with the same spatial structure as the original feature. It is an optimization process [3]. Many existing studies have explained the significance and importance of feature selection [4–6]. At present, the mainstream feature selection methods are mainly divided into three types, namely, Filter, Wrapper, and Embedded. Filter measures the feature classification ability by analyzing the internal features of the feature subset and is generally used to filter out the feature subset with the highest score. According to the selection of selected subsets, Filter can be divided into two types: based on feature sorting [7] and feature space search [8] such as correlation-based feature selection (CFS) [9], maximum relevance minimum redundancy (MRMR) [10], and Bayesian framework [11–13]. However, the two methods of Filter have the problem of difficulty in coordination of computational complexity and classification accuracy, which leads to unsatisfactory processing results. As for Wrapper, it can be divided into two types: sequential search method [14] and heuristic search [15]. The sequential search strategy reduces the computational complexity by continuously adding (deleting) a single feature, but it is easy to select feature subsets whose inner features are highly correlated [16]. The heuristic search algorithm is represented by the particle swarm optimization algorithm [17]. The initial feature subset is randomly generated, and the heuristic rule is gradually approached to the optimal solution, which can meet most of the needs. However, the high cost of reconstructing the classification model when dealing with different data sets limits its further development. The emergence of Embedded is to solve the high cost of reconstructing the classification model when Wrapper processes different data sets. Taking the SVM-RFE method proposed by Guyon et al. [18] based on the idea of recursive feature search and elimination as an example, each dimension of the SVM hyperplane corresponds to each feature in the high-dimensional small sample data set, the importance of each feature is measured by feature weight, and the lower ranked feature is deleted in descending order. The high-dimensional data dimensionality reduction work is completed after iteration, which effectively improves the time and space performance of the method and ensures high-precision classification results. Although there are many mature feature selection methods, these methods emphasize the high classification performance or clustering performance of the feature selection results and ignore the stability of the feature selection results. The stability of feature selection refers to the insensitivity of feature selection results to small fluctuations in training content. In some situations, when the sample content changes slightly, the feature subsets or the feature importance ranking results obtained by feature selections are quite different, and even some incomprehensible feature sequences are output, which seriously reduces the accuracy of the feature selection method. This is the performance of poor feature selection stability. If the feature selection is performed by combining multiple learners in an ensemble way and the best feature selection result is selected from many learners, the stability of the feature selection result can be effectively improved. Li et al. [19] generated test objects by resampling technology and repeatedly used recursive decision trees for feature selection. Dutkowski and Gambin [20] used different feature selection algorithms for gene selection and integrated the results of each algorithm through optimization strategies to form the final feature subset. Saeys et al. [21] and Abeel et al. [22] used the bagging idea for ensemble feature selection and achieved good processing results. Based on the above research, this paper proposes a random bits forest [23] recursive clustering elimination (RBF-RCE) feature selection method based on the idea of ensemble. First, through K-means clustering, the research object is divided into several feature clusters, random bits forest (RBF) is used to calculate the importance of any feature in the cluster, and the feature score is calculated according to the importance of the feature. Then, after sorting in descending order according to the feature scores, the relevant deletion parameters are set. By judging the relationship between the number of existing features and the deletion parameters, the features in the cluster are deleted in reverse order to achieve feature dimensionality reduction processing. In addition, by analyzing the reasons for the unstable feature selection, this paper introduces a feature selection stability measurement method, which measures whether the feature selection is stable or not through the intersection measurement (IM). Eventually, through experiments on high-dimensional and small-sample data sets, the results demonstrate the effectiveness of the method and can achieve highly stable feature selection results. 2. RBF-RCE Feature Selection Method Random forest has many advantages when dealing with high-dimensional small sample data [24], while random bits forest is improved by random forest, and it performs better on classification problems. In this paper, based on the random bits forest, combined with the support vector machine-based recursive clustering elimination feature selection method proposed by Yousef et al. [25] and the improved SVM-RCE method proposed by Luo et al. [26], a random bits forest recursive clustering elimination feature selection method is proposed. The following is a detailed description of the overall approach. 2.1. Feature Importance Analysis Based on Random Bits Forest Random bits forest has been applied to high-dimensional small sample data processing due to its good performance in data classification processing. It inherits the characteristics of random forest screening by the importance of each feature when performing feature selection and combines neural network [27] to improve model depth, gradient boosting [28, 29] extends model breadth, and random forest [24] improves model classification accuracy. In dealing with the problem of high-dimensional small sample data, it has higher accuracy and algorithm convergence than random forest. For a high-dimensional small sample data set, random bits forest uses Bootstrap resampling technology [30], random sampling with replacement N times to obtain M sample sets, about 36% of the original samples have not been sampled, this part of the data is classified as out-of-bag (OOB) data, and the importance of features is evaluated through out-of-bag data; the process is shown in Figure 1.
... A different approach was proposed in Yousef et al. for selecting significant genes in gene expression studies [19]. To deal with highly correlated genes, the authors first identified clusters of highly correlated genes and then applied RFE to rank the identified clusters and select the most important ones for the final model. ...
... The aim of this work is to propose a new variable ranking method that is able to deal with the presence of highly correlated features and to preserve their interpretation. To offer more detail, the proposed method is a modified version of the RFE-Borda count method that takes into account the correlation between numerical features while performing the ranking, by grouping highly correlated features, as carried out in [19]. The proposed approach is validated on simulated datasets, in which the true variable ranking is known, and compared to the standard RFE-Borda count method that does not consider the correlation between features. ...
... The results of the variable ranking obtained for the representative dataset of Section 2.2.2 are reported in Table 4 for both the standard RFE-Borda count method and the proposed algorithm, performed on B = 100 training set variants generated by bootstrap resampling. We can observe that the standard RFE-Borda count approach, which ignores variable correlation, commits some ranking mistakes: x 2 is ranked below x 3 ; x 5 is ranked in the 6th position, below x 6 ; x 8 is ranked in the 9th position, after x 15 ; x 9 and x 10 are ranked in the 14th and the 18th position, respectively, and they are surpassed in the ranking even by noise variables, such as x 18 , x 19 , and x 20 . Conversely, the ranking obtained by the proposed approach, which considers the variable correlation, is almost completely correct; the only ranking mistake is that, also in this case, x 8 is ranked in the 9th position, after x 9 and its correlated feature x 15 . ...
Article
Full-text available
When building a predictive model for predicting a clinical outcome using machine learning techniques, the model developers are often interested in ranking the features according to their predictive ability. A commonly used approach to obtain a robust variable ranking is to apply recursive feature elimination (RFE) on multiple resamplings of the training set and then to aggregate the ranking results using the Borda count method. However, the presence of highly correlated features in the training set can deteriorate the ranking performance. In this work, we propose a variant of the method based on RFE and Borda count that takes into account the correlation between variables during the ranking procedure in order to improve the ranking performance in the presence of highly correlated features. The proposed algorithm is tested on simulated datasets in which the true variable importance is known and compared to the standard RFE-Borda count method. According to the root mean square error between the estimated rank and the true (i.e., simulated) feature importance, the proposed algorithm overcomes the standard RFE-Borda count method. Finally, the proposed algorithm is applied to a case study related to the development of a predictive model of type 2 diabetes onset.