AUC (10-fold cross-validation).

AUC (10-fold cross-validation).

Source publication
Article
Full-text available
Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. In this study we compare 32...

Similar publications

Article
Full-text available
Semiparametric generalized varying coefficient partially linear models with longitudinal data arise in contemporary biology, medicine, and life science. In this paper, we consider a variable selection procedure based on the combination of the basis function approximations and quadratic inference functions with SCAD penalty. The proposed procedure s...

Citations

... On the one hand, it simplifies the ML models and improves the interpretability of the method by highlighting the most influential and relevant features. On the other hand, it reduces overfitting owing to the small size of the dataset and relatively more features, ensuring more reliable and generalizable results (Haury et al., 2011). More specifically, by eliminating redundant input features in the dataset, feature selection can facilitate more precise data visualization and improve predictive performance. ...
Article
Conceptual design is crucial for designing offshore jacket substructures because it sets the direction for the entire design process. Nevertheless, conventional simulation-based optimization methods for jacket conceptual design face challenges, such as high computational costs and restricted optimization objectives. This paper proposes a data-driven method for offshore jacket conceptual design using machine learning (ML). First, a novel dataset of completed and under-construction jackets worldwide was established as the cornerstone of ML. The dataset comprised "in-action" data capturing key structural parameters of jackets and information on design boundary conditions. Subsequently, different features were comprehensively selected to identify and visualize their correlations for an interpretable data-driven design, ensuring the effectiveness of the dataset for training the ML models. Finally, random forest and eXtreme gradient boosting models were trained on the data from the selected feature subsets and then employed to predict individual jacket structural parameters. The predictive performance of the models indicates that the dataset and feature selection can capture the fundamental and shared characteristics of well-designed jackets, thereby improving the accuracy and efficiency of the conceptual design process. This study suggests the potential of a data-driven conceptual design for offshore jacket substructures.
... It has been established that SVM-RFE is a highly effective feature selection technique in terms of accuracy. However, filter methods perform better than SVM-RFE in terms of stability in many classification setups [25]. That can be linked to the simplicity of the filter methods, where within each training fold, features are ranked only in one iteration. ...
Article
Full-text available
Recent studies have shown that ensemble feature selection (EFS) has achieved outstanding performance in microarray data classification. However, some issues remain partially resolved, such as suboptimal aggregation methods and non-optimised underlying FS techniques. This study proposed the logarithmic rank aggregate (LRA) method to improve feature aggregation in EFS. Additionally, a hybrid aggregation framework was presented to improve the performance of the proposed method by combining it with several methods. Furthermore, the proposed method was applied to the feature rank lists obtained from the optimised FS technique to investigate the impact of FS technique optimisation. The experimental setup was performed on five binary microarray datasets. The experimental results showed that LRA provides a comparable classification performance to mean rank aggregation (MRA) and outperforms MRA in terms of gene selection stability. In addition, hybrid techniques provided the same or better classification accuracy as MRA and significantly improved stability. Moreover, some proposed configurations had better accuracy, sensitivity, and specificity performance than MRA. Furthermore, the optimised LRA drastically improved the FS stability compared to the unoptimised LRA and MRA. Finally, When the results were compared with other studies, it was shown that optimised LRA provided a remarkable stability performance, which can help domain experts diagnose cancer diseases with a relatively smaller subset of genes.
... To address the high dimensionality issue, some feature selection processes can be applied to raw data [22]. It can rely on expert knowledge, or it can be an automated task trained to select the most relevant and nonredundant features for a specific classification task [22,23]. ...
... To address the high dimensionality issue, some feature selection processes can be applied to raw data [22]. It can rely on expert knowledge, or it can be an automated task trained to select the most relevant and nonredundant features for a specific classification task [22,23]. Regarding imbalance, data-based methods, such as undersampling and oversampling, are often considered to balance the two classes. ...
... This can be illustrated with CKD versus AKI: while Lin similarity considers these two phenotypes as highly similar, the restricted hierarchical similarity between these two phenotypes is null, leading to fewer false positives and better overall performance. Considering the top ranked patients, restricted hierarchical similarity with RF showed a strong capacity to identify ciliopathy patients, [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26] with a recall@1% of 59% (i.e., precision around 25% whereas using a random model it would be 0.41%) and a recall@10% of 85%. Comparable performances among top 1% and top 10% were obtained by CODER embeddings with ridge regression. ...
Article
Full-text available
Background Rare diseases affect approximately 400 million people worldwide. Many of them suffer from delayed diagnosis. Among them, NPHP1-related renal ciliopathies need to be diagnosed as early as possible as potential treatments have been recently investigated with promising results. Our objective was to develop a supervised machine learning pipeline for the detection of NPHP1 ciliopathy patients from a large number of nephrology patients using electronic health records (EHRs). Methods and results We designed a pipeline combining a phenotyping module re-using unstructured EHR data, a semantic similarity module to address the phenotype dependence, a feature selection step to deal with high dimensionality, an undersampling step to address the class imbalance, and a classification step with multiple train-test split for the small number of rare cases. The pipeline was applied to thirty NPHP1 patients and 7231 controls and achieved good performances (sensitivity 86% with specificity 90%). A qualitative review of the EHRs of 40 misclassified controls showed that 25% had phenotypes belonging to the ciliopathy spectrum, which demonstrates the ability of our system to detect patients with similar conditions. Conclusions Our pipeline reached very encouraging performance scores for pre-diagnosing ciliopathy patients. The identified patients could then undergo genetic testing. The same data-driven approach can be adapted to other rare diseases facing underdiagnosis challenges. Supplementary Information The online version contains supplementary material available at 10.1186/s13023-024-03063-7.
... The best feature selection algorithms are those that are stable to the addition or removal of training samples (Chandrashekar & Sahin, 2014;Haury et al., 2011;Dunne et al., 2002;Kalousis et al., 2007;Somol & Novovicova, 2010;Yang & Mao, 2011). We had limited samples, so we made statistical scores of the regression model as our judging criteria. ...
Article
Mangroves are woody halophytes thriving in muddy substratum along the coastal areas of the tropics and sub-tropics. They are often credited for their exceptional carbon sequestration capability. Estimating above-ground biomass (AGB) through field survey is tedious, particularly in a hostile environment like a mangrove ecosystem. However, the quantification of AGB is made possible with the help of continued advancements in sensor technology and computational algorithms. This research attempts to model the AGB of mangroves present in Bhitarkanika, Odisha, using a multi-sensor approach. We utilized multispectral Sentinel-2 (SM) and Landsat-8 (LO), and hyperspectral Airborne Visible Infra-Red Imaging Spectrometer—Next Generation (AN) datasets in our analysis. The mangrove biomass was calculated for 42 sample plots from a field survey using species specific and common allometric equations. After data-specific preprocessing; six feature sets namely reflectance bands, band ratios, vegetation indices (VIs), texture-based Gray Level Co-occurrence Matrix (GLCM) features of reflectance, band ratios and VIs were extracted for each dataset. The co-located set of features derived from each dataset were regressed against the AGB estimated using field methods of 42 sample plots (1) independently for each feature set, (2) in a combination of feature sets for each dataset and (3) in a combination of the feature sets of all three datasets as a multi-sensor approach. Feature selection techniques were used to get the best possible output of combined AN, SM and LO datasets. The results show that the combination of textural features gave better prediction models than independent sets of features. Also, Genetic Algorithm (GA) and Recursive Feature Elimination CV (RFECV) proved to be better feature selectors than other classical approaches. AN, SM and LO resulted in the R2 value of 0.41, 0.85 and 0.35 with RMSE of 356.81, 195.49 and 366.84 t/ha, respectively; while, the multisensory approach yielded a maximum R2 value of 0.7 and RMSE of 244.86 t/ha. The results show that the structural information of vegetation canopy obtained from textural parameters of different input bands has improved the regression model to predict the biomass.
... Consequently, as a research hotspot, machine learning offers numerous feature selection operators; it is divided into three main categories, namely, filter, wrapper, and embedded methods (27,28). The performance of feature selection is entwined with classification method (29,30). ...
Article
Full-text available
Background The establishment of an accurate, stable, and non-invasive prediction model of sentinel lymph node (SLN) metastasis in breast cancer is difficult nowadays. The aim of this work is to identify the optimal machine learning model based on the three-dimensional (3D) image features of magnetic resonance imaging (MRI) for the preoperative prediction of SLN metastasis in breast cancer patients. Methods A total of 172 patients with histologically proven breast cancer were enrolled retrospectively, including 74 SLN metastasis patients and 98 non-SLN metastasis patients. All of them underwent diffusion-weighted imaging (DWI) magnetic resonance imaging (MRI) scan. Firstly, a total of 10,320 texture and four non-texture features were extracted from the region of interests (ROIs) of image. Twenty-four feature selection methods and 11 classification methods were then evaluated by using 10-fold cross-validation to identify the optimal machine learning model in terms of the mean area under the curve (AUC), accuracy (ACC), and stability. Results The result showed that the model based on the combination of minimum redundancy maximum relevance (MRMR) + random forest (RF) exhibited the optimal predictive performance (AUC: 0.97±0.03; ACC: 0.89±0.05; stability: 2.94). Moreover, we independently investigated the performance of feature selection methods and classification methods, and observed that L¹-support vector machine (L¹-SVM) (AUC: 0.80±0.08; ACC: 0.76±0.07) and sequential forward floating selection (SFFS) (stability: 3.04) presented the best average predictive performance and stability among all feature selection methods, respectively. RF (AUC: 0.85±0.11; ACC: 0.80±0.09) and SVM (stability: 8.43) showed the best average predictive performance and stability among all classification methods, respectively. Conclusions The identified model based on the 3D image features of MRI provides a non-invasive way for the preoperative prediction of SLN metastasis in breast cancer patients.
... Presenting such categorization is a challenging and time-sensitive task. Based on an analysis of available literature on developing and evaluating feature selection algorithms, it was found that some researchers compared algorithms by applying them to different domain-specific datasets [2,5,13,21,26,82]. However, others compared these algorithms by applying them to the same datasets [9,11,15,60]. ...
Article
Full-text available
Learning algorithms can be less effective on datasets with an extensive feature space due to the presence of irrelevant and redundant features. Feature selection is a technique that effectively reduces the dimensionality of the feature space by eliminating irrelevant and redundant features without significantly affecting the quality of decision-making of the trained model. In the last few decades, numerous algorithms have been developed to identify the most significant features for specific learning tasks. Each algorithm has its advantages and disadvantages, and it is the responsibility of a data scientist to determine the suitability of a specific algorithm for a particular task. However, with the availability of a vast number of feature selection algorithms, selecting the appropriate one can be a daunting task for an expert. These challenges in feature selection have motivated us to analyze the properties of algorithms and dataset characteristics together. This paper presents significant efforts to review existing feature selection algorithms, providing an exhaustive analysis of their properties and relative performance. It also addresses the evolution, formulation, and usefulness of these algorithms. The manuscript further categorizes the algorithms analyzed in this review based on the properties required for a specific dataset and objective under study. Additionally, it discusses popular area-specific feature selection techniques. Finally, it identifies and discusses some open research challenges in feature selection that are yet to be overcome.
... Such molecular subtyping has been performed for malignancies of the breast [21] ovaries [22], colorectal [23], and skin [24]. The cancer subtyping is performed by computational approaches viz; k-nearest neighbours and support vector machines (SVMs) but these are prone to biases because of their reliability on signature genes and omission of eminent biological information [25], [26]. These limitations can be overcome by the DL approach as it assimilates algorithmic patterns from the transcriptome. ...
Article
Full-text available
Deep learning, a branch of artificial intelligence, excavates massive data sets for patterns and predictions using a machine learning method known as artificial neural networks. Research on the potential applications of deep learning in understanding the intricate biology of cancer has intensified due to its increasing applications among healthcare domains and the accessibility of extensively characterized cancer datasets. Although preliminary findings are encouraging, this is a fast-moving sector where novel insights into deep learning and cancer biology are being discovered. We give a framework for new deep learning methods and their applications in oncology in this review. Our attention was directed towards its applications for DNA methylation, transcriptomic, and genomic data, along with histopathological inferences. We offer insights into how these disparate data sets can be combined for the creation of decision support systems. Specific instances of learning applications in cancer prognosis, diagnosis, and therapy planning are presented. Additionally, the present barriers and difficulties in deep learning applications in the field of precision oncology, such as the dearth of phenotypical data and the requirement for more explicable deep learning techniques have been elaborated. We wrap up by talking about ways to get beyond the existing challenges so that deep learning can be used in healthcare settings in the future.
... The expression values are produced by a quantification method such as reverse transcriptase-PCR, as used for quantification in the Oncotype DX and EndoPredict signatures, or DNA-microarray technology, as used for quantification in the MammaPrint signature. Previous studies have shown that the stability of gene selections varies across different datasets 54,55 . Therefore, we use 8 different but well-established datasets: METABRIC 56 , GSE9893 57 , GSE7390 58 , GSE96058 59 , GSE11121 60 , GSE4922 61 , NKI 62,63 , and data generated by the TCGA Research Network: https:// www. ...
Article
Full-text available
Gene expression signatures refer to patterns of gene activities and are used to classify different types of cancer, determine prognosis, and guide treatment decisions. Advancements in high-throughput technology and machine learning have led to improvements to predict a patient’s prognosis for different cancer phenotypes. However, computational methods for analyzing signatures have not been used to evaluate their prognostic power. Contention remains on the utility of gene expression signatures for prognosis. The prevalent approaches include random signatures, expert knowledge, and machine learning to construct an improved signature. We unify these approaches to evaluate their prognostic power. Re-evaluation of publicly available gene-expression data from 8 databases with 9 machine-learning models revealed previously unreported results. Gene-expression signatures are confirmed to be useful in predicting a patient’s prognosis. Convergent evidence from ≈\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx$$\end{document} 10,000 signatures implicates a maximum prognostic power. By calculating the concordance index, which measures how well patients with different prognoses can be discriminated, we show that a signature can correctly discriminate patients’ prognoses no more than 80% of the time. Additionally, we show that more than 50% of the potentially available information is still missing at this value. We surmise that an accurate prognosis must incorporate molecular, clinical, histological, and other complementary factors.
... In this paper, we have applied a simple ttest method to choose a subset of features having the most discriminative power. It is seen that t-test performs better than any complex wrapper and embedded methods for a large number of features [12]. The probability of obtaining the evaluated t-statistics is called as p-value [11]. ...
Conference Paper
Full-text available
Depressive disorder (DD) has a high morbidity and death rate, contributing to suicide, the incidence and unfavorable effects of medical disease, and drug addiction. DD necessitates long-term monitoring, changes in symptom presentations, and subjective evaluation, which opens the door to analyze brain signals using modern neuro-signal processing and machine learning techniques. To detect depression in an automatic and non-invasive fashion, we have chosen electroencephalogram (EEG) signals from 199 DD adults and 95 healthy control (HC) adults. We have used a machine learning classification framework which uses spectral and functional connectivity features and with their feature-level fusion. Compared to the individual features, the fusion-based feature outperforms (98.06 ± 1.19 % accuracy) at a significant p-value (<0.05) for distinguishing DD from HC. Artificial neural network (ANN) performs better than different kernel-based support vector machines (SVM). In frequency band-wise analysis, using our fusion method, the delta frequency band (1-4 Hz) demonstrates the most discriminant features. Using connectivity networks and their strength, our findings reveal that the left posterior and right anterior brain regions are the most linked in depression. The improved performance signifies the clinical relevance of the work providing reliable assistance for depression detection and has the potential to find fusion-based connectivity patterns of depression.
... BFS and RFE are comparatively simple for implementation when contrasted with other intricate feature selection methodologies. Additionally, both methods have been shown to produce positive outcomes concerning classification accuracy and the quantity of chosen features [72,75]. ...
Article
Full-text available
Knowledge discovery in databases (KDD) is crucial in analyzing data to extract valuable insights. In medical outcome prediction, KDD is increasingly applied, particularly in diseases with high incidence, mortality, and costs, like cancer. ML techniques can develop more accurate predictive models for cancer patients’ clinical outcomes, aiding informed healthcare decision-making. However, cancer prediction modeling faces challenges because of the unbalanced nature of the datasets, where there is a small minority category of patients with a cancer diagnosis compared to a majority category of cancer-free patients. Imbalanced datasets pose statistical hurdles like bias and overfitting when developing accurate prediction models. This systematic review focuses on breast cancer prediction articles published from 2008 to 2023. The objective is to examine ML methods used in three critical steps of KDD: preprocessing, data mining, and interpretation which address the imbalanced data problem in breast cancer prediction. This work synthesizes prior research in ML methods for breast cancer prediction. The findings help identify effective preprocessing strategies, including balancing and feature selection methods, robust predictive models, and evaluation metrics of those models. The study aims to inform healthcare providers and researchers about effective techniques for accurate breast cancer prediction.