Article

Regularization and variable selection via the elastic net (vol B 67, pg 301, 2005)

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... By using machine learning approaches, data can be used to produce insights that may lead to inductive theory building (Choudhury et al., 2018). For example, to obtain the best predictive model of a construct, machine learning methods (e.g., Elastic Net) employ algorithmic induction only from the data (Zou & Hastie, 2005). We suggest that PISA large-scale assessment data be used to either refine existing theories or to generate new theories by employing a grounded theory approach that utilizes machine learning algorithms (e.g., Elastic Net). ...
... To address our research questions, we firstly applied the Elastic Net (ENET) approach (Zou & Hastie, 2005), which is a theoretically agnostic variable selection method in machine learning designed to select the most statistically salient predictors of students' reading self-concept at the student, teacher, and school levels. After screening and reviewing the ENET identified variables based on theory and relevant literature, we examined the variables that were identified via ENET analysis by conducting a two-level multilevel modeling analysis with the entire United States' student sample as the reference model and then investigated the reference model's generalizability for emergent bilinguals. ...
... ENET is a commonly used feature/variable selection approach within machine learning for datasets in which the number of predictors is significantly large. Developed by Zou and Hastie (2005), this approach combines two other penalized or regularized regression approaches consisting of L1-norm penalty (lasso) and L2-norm penalty (ridge) to select the best performing predictive model. Implementation of ENET removes the limitations when lasso and ridge are separately used by (a) performing continuous shrinkage and automatic variable selection (Zou & Hastie, 2005), (b) encouraging the grouping effect (Zhou, 2013) by allowing the models to select a group of highly correlated variables, and (c) removing the limitation regarding the number of selected variables Wei et al., 2019). ...
Article
Decades of research have indicated that reading self-concept is an important predictor of reading achievement. During this period, the population of emergent bilinguals has continued to increase within United States' schools. However, the existing literature has tended to examine native English speakers' and emergent bilinguals' reading self-concept in the aggregate, thereby potentially obfuscating the unique pathways through which reading self-concept predicts reading achievement. Furthermore, due to the overreliance of native English speakers in samples relating to theory development, researchers attempting to examine predictors of reading achievement may a priori select variables that are more aligned with native English speakers' experiences. To address this issue, we adopted Elastic Net, which is a theoretically agnostic methodology and machine learning approach to variable selection to identify the proximal and distal predictors of reading self-concept for the entire population; in our study, participants from the United States who participated in PISA 2018 served as the baseline group to determine significant predictors of reading self-concept with the intent of identifying potential new directions for future researchers. Based on Elastic Net analysis, 20 variables at the student level, three variables at the teacher level, and 12 variables at the school level were identified as the most salient predictors of reading self-concept. We then utilized a multilevel modeling approach to test model generalizability of the identified predictors of reading self-concept for emergent bilinguals and native English speakers. We disaggregated and compared findings for both emergent bilinguals and native English speakers. Our results indicate that although some predictors were important for both groups (e.g., perceived information and communications technologies competence), other predictors were not (e.g., competitiveness). Suggestions for future directions and implications of the present study are examined.
... N(0, σ 2 ). To deal with the instability, new approaches such as Nonnegative Garrote [6], LASSO [11], SCAD [4], LARS [3] and Elastic Net [16], have been proposed. Through continuous penalty and automated variable selection, these approaches allow us to increase both model interpretability and prediction accuracy. ...
... The aim of this section, we compare the performance of the proposed MAVE -SiER method with three related methods on simulated data. The first method elastic net [16] proposed a technique which it can select groups of correlated variables and overcomes the difficulty of p > n. The elastic net is based on a combination of the ridge (L2) and the lasso (L1) penalties. ...
Article
Full-text available
In this paper, a new sparse method called (MAVE-SiER) is proposed, to introduce MAVE-SiER, we combined the effective sufficient dimension reduction method MAVE with the sparse method Signal extraction approach to multivariate regression (SiER). MAVE-SiER has the benefit of expanding the Signal extraction method to multivariate regression (SiER) to nonlinear and multi-dimensional regression. MAVE-SiER also allows MAVE to deal with problems which the predictors are highly correlated. MAVE-SiER may estimate dimensions exhaustively while concurrently choosing useful variables. Simulation studies confirmed MAVE-SiER performance.
... As such, in this study, we focus on examining the best set of SNP features suitable for use in the prediction of CYP2D6-associated CpG methylation levels using CYP2D6 SNP genotypes after various feature selection methods. We also compare the performance of two ML algorithms, Elastic Net (Zou and Hastie, 2005) and eXtreme Gradient Boosting (XGBoost; Chen and Guestrin, 2016) and investigate their performance and prediction accuracy with regard to the different SNP feature sets to identify the optimal method for SNP feature selection. ...
... Elastic Net performing comparably to that of XGBoost may be due to its ability to handle features with collinearity. The grouping effect of Elastic Net groups variables that are highly correlated together, and either drops or retains all of them together (Zou and Hastie, 2005). ...
Article
Full-text available
Introduction Pharmacogenetics currently supports clinical decision-making on the basis of a limited number of variants in a few genes and may benefit paediatric prescribing where there is a need for more precise dosing. Integrating genomic information such as methylation into pharmacogenetic models holds the potential to improve their accuracy and consequently prescribing decisions. Cytochrome P450 2D6 (CYP2D6) is a highly polymorphic gene conventionally associated with the metabolism of commonly used drugs and endogenous substrates. We thus sought to predict epigenetic loci from single nucleotide polymorphisms (SNPs) related to CYP2D6 in children from the GUSTO cohort. Methods Buffy coat DNA methylation was quantified using the Illumina Infinium Methylation EPIC beadchip. CpG sites associated with CYP2D6 were used as outcome variables in Linear Regression, Elastic Net and XGBoost models. We compared feature selection of SNPs from GWAS mQTLs, GTEx eQTLs and SNPs within 2 MB of the CYP2D6 gene and the impact of adding demographic data. The samples were split into training (75%) sets and test (25%) sets for validation. In Elastic Net model and XGBoost models, optimal hyperparameter search was done using 10-fold cross validation. Root Mean Square Error and R-squared values were obtained to investigate each models’ performance. When GWAS was performed to determine SNPs associated with CpG sites, a total of 15 SNPs were identified where several SNPs appeared to influence multiple CpG sites. Results Overall, Elastic Net models of genetic features appeared to perform marginally better than heritability estimates and substantially better than Linear Regression and XGBoost models. The addition of nongenetic features appeared to improve performance for some but not all feature sets and probes. The best feature set and Machine Learning (ML) approach differed substantially between CpG sites and a number of top variables were identified for each model. Discussion The development of SNP-based prediction models for CYP2D6 CpG methylation in Singaporean children of varying ethnicities in this study has clinical application. With further validation, they may add to the set of tools available to improve precision medicine and pharmacogenetics-based dosing.
... We take equation (34) into equation (38) and simplify min ...
... λ 1 is used to control the sparsity of linear expression weights, and λ 2 is used to control the degree of association of linear expression weight vector w and similarity vector s. According to [13,38,39], these two parameters grid search range for 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10 { }. σ is used to control the bandwidth of the Gaussian kernel according to [40] and Scott's rule [41], we set its grid search range to 0.1, 0.5, 1, 5, 10 { }. ϕ is used to control the similarity of the elements w i,j and s i,j , and we set its search range to 0.005, 0.01, 0.05, 0.1, 0.5, { 1, 10} using the experimental validation method. ε is a smaller positive constant used to ensure that the concatenated terms are nonzero values, and the value is taken as 10 − 15 in the proposed BLA-KNN method. ...
Article
Full-text available
While the classical KNN (k nearest neighbor) shares its avoidance of the consistent distribution assumption between training and testing samples to achieve fast prediction, it still faces two challenges: (a) its generalization ability heavily depends on an appropriate number k of nearest neighbors; (b) its prediction behavior lacks interpretability. In order to address the two challenges, a novel Bayes-decisive linear KNN with adaptive nearest neighbors (i.e., BLA-KNN) is proposed to obtain the following three merits: (a) a diagonal matrix is introduced to adaptively select the nearest neighbors and simultaneously improve the generalization capability of the proposed BLA-KNN method; (b) the proposed BLA-KNN method owns the group effect, which inherits and extends the group property of the sum of squares for total deviations by reflecting the training sample class-aware information in the group effect regularization term; (c) the prediction behavior of the proposed BLA-KNN method can be interpreted from the Bayes-decision-rule perspective. In order to do so, we first use a diagonal matrix to weigh each training sample so as to obtain the importance of the sample, while constraining the importance weights to ensure that the adaptive k value is carried out efficiently. Second, we introduce a class-aware information regularization term in the objective function to obtain the nearest neighbor group effect of the samples. Finally, we introduce linear expression weights related to the distance measure between the testing and training samples in the regularization term to ensure that the interpretation of Bayes-decision-rule can be performed smoothly. We also optimize the proposed objective function using an alternating optimization strategy. We experimentally demonstrate the effectiveness of the proposed BLA-KNN method by comparing it with 7 comparative methods on 15 benchmark datasets.
... Predictors of model parameters were analyzed using penalized elastic-net regression. This statistical technique is highly effective at handling multiple predictors and multicollinearity [27]. The predictors were standardized before fitting the model. ...
... Researchers have explored the potential of SVM to address the challenges by utilizing a small training dataset, further highlighting the versatility and effectiveness of SVM in providing accurate and reliable predictions even with limited training data [27]. The Elastic Net Multivariate Linear Regression (ENMLR) was introduced by Zou and Hastie [28] as a robust approach for analyzing high-dimensional datasets. It was designed to overcome the limitations of the LASSO method. ...
Article
Full-text available
Background The process of optimizing in vitro shoot proliferation is a complicated task, as it is influenced by interactions of many factors as well as genotype. This study investigated the role of various concentrations of plant growth regulators (zeatin and gibberellic acid) in the successful in vitro shoot proliferation of three Punica granatum cultivars (‘Faroogh’, ‘Atabaki’ and ‘Shirineshahvar’). Also, the utility of five Machine Learning (ML) algorithms—Support Vector Regression (SVR), Random Forest (RF), Extreme Gradient Boosting (XGB), Ensemble Stacking Regression (ESR) and Elastic Net Multivariate Linear Regression (ENMLR)—as modeling tools were evaluated on in vitro multiplication of pomegranate. A new automatic hyperparameter optimization method named Adaptive Tree Pazen Estimator (ATPE) was developed to tune the hyperparameters. The performance of the models was evaluated and compared using statistical indicators (MAE, RMSE, RRMSE, MAPE, R and R²), while a specific Global Performance Indicator (GPI) was introduced to rank the models based on a single parameter. Moreover, Non‑dominated Sorting Genetic Algorithm‑II (NSGA‑II) was employed to optimize the selected prediction model. Results The results demonstrated that the ESR algorithm exhibited higher predictive accuracy in comparison to other ML algorithms. The ESR model was subsequently introduced for optimization by NSGA‑II. ESR-NSGA‑II revealed that the highest proliferation rate (3.47, 3.84, and 3.22), shoot length (2.74, 3.32, and 1.86 cm), leave number (18.18, 19.76, and 18.77), and explant survival (84.21%, 85.49%, and 56.39%) could be achieved with a medium containing 0.750, 0.654, and 0.705 mg/L zeatin, and 0.50, 0.329, and 0.347 mg/L gibberellic acid in the ‘Atabaki’, ‘Faroogh’, and ‘Shirineshahvar’ cultivars, respectively. Conclusions This study demonstrates that the 'Shirineshahvar' cultivar exhibited lower shoot proliferation success compared to the other cultivars. The results indicated the good performance of ESR-NSGA-II in modeling and optimizing in vitro propagation. ESR-NSGA-II can be applied as an up-to-date and reliable computational tool for future studies in plant in vitro culture.
... SVRs are accurate and have acceptable overhead, though their CPU overhead is higher than linear models at~10%, the higher overhead of SVRs would be acceptable if they can better capture complex execution profiles. Lasso regression has its limitations (see [70]), as it suffers if operators execute in multiple power states or there are dependencies between features in training data. Each feature in the training data corresponds to an operator type and its instance count. ...
Preprint
Full-text available
Managing the limited energy on mobile platforms executing long-running, resource intensive streaming applications requires adapting an application's operators in response to their power consumption. For example, the frame refresh rate may be reduced if the rendering operation is consuming too much power. Currently, predicting an application's power consumption requires (1) building a device-specific power model for each hardware component, and (2) analyzing the application's code. This approach can be complicated and error-prone given the complexity of an application's logic and the hardware platforms with heterogeneous components that it may execute on. We propose eScope, an alternative method to directly estimate power consumption by each operator in an application. Specifically, eScope correlates an application's execution traces with its device-level energy draw. We implement eScope as a tool for Android platforms and evaluate it using workloads on several synthetic applications as well as two video stream analytics applications. Our evaluation suggests that eScope predicts an application's power use with 97% or better accuracy while incurring a compute time overhead of less than 3%.
... Next, we asked whether a concise gene signature could be used to identify TP53Mut-like AML across datasets. We used elastic net regression, which results in sparser models [9] and is better suited to identify a concise gene signature. We performed multiple rounds of elastic net optimization and identified 25 core genes that accurately classify TP53Mut-like AML. ...
... The 5-minute duration was chosen as a reasonable timeframe in which pain level is not expected to fluctuate dramatically. Due to the large number of potential spectro-spatial features present in the data (e.g., channels and powerband combinations), we employed Elastic-Net regression, a hybrid of ridge regression and lasso regularization favored in the presence of highly correlated variables, which is often the case with intracranial recordings 27 . To avoid data leakage while optimizing hyperparameters in the feature pruning process, we used a nested k-fold crossvalidation (CV) scheme ( Figure 2B). ...
Preprint
Full-text available
Pain is a complex experience that remains largely unexplored in naturalistic contexts, hindering our understanding of its neurobehavioral representation in ecologically valid settings. To address this, we employed a multimodal, data-driven approach integrating intracranial electroencephalography, pain self-reports, and facial expression quantification to characterize the neural and behavioral correlates of naturalistic acute pain in twelve epilepsy patients undergoing continuous monitoring with neural and audiovisual recordings. High self-reported pain states were associated with elevated blood pressure, increased pain medication use, and distinct facial muscle activations. Using machine learning, we successfully decoded individual participants' high versus low self-reported pain states from distributed neural activity patterns (mean AUC = 0.70), involving mesolimbic regions, striatum, and temporoparietal cortex. High self-reported pain states exhibited increased low-frequency activity in temporoparietal areas and decreased high-frequency activity in mesolimbic regions (hippocampus, cingulate, and orbitofrontal cortex) compared to low pain states. This neural pain representation remained stable for hours and was modulated by pain onset and relief. Objective facial expression changes also classified self-reported pain states, with results concordant with electrophysiological predictions. Importantly, we identified transient periods of momentary pain as a distinct naturalistic acute pain measure, which could be reliably differentiated from affect-neutral periods using intracranial and facial features, albeit with neural and facial patterns distinct from self-reported pain. These findings reveal reliable neurobehavioral markers of naturalistic acute pain across contexts and timescales, underscoring the potential for developing personalized pain interventions in real-world settings.
... In high-dimensional regression modeling, many covariates are usually included in the model, but only a small part is statistically significant. Many regularization methods including LASSO (Tibshirani [27]: Least absolute shrinkage and select operator), adaptive LASSO (Zou [32]), SCAD (Fan and Li [7]: Smoothly clipped absolute deviation), Enet (Hui and Hastie [12]: the Elastic-net), bridge penalized regression (Fu [8]), and L1∕2-norm penalization (Xu [30]) can be frequently used to conduct variable selection and model estimation simultaneously. In the Bayesian framework, Park and Casella [19] proposed Bayesian LASSO regression, Alhamzawi et al. [4] considered Bayesian adaptive LASSO QR, Polson et al. [20] studied Bayesian bridge regression, Betancourt et al. (2017) studied Bayesian fused Lasso regression for dynamic binary networks, Alhamzawi and Ali [3] discussed Bayesian L1∕2 Tobit QR, Mallick and Yi [16] considered Bayesian L1∕2 regularization. ...
Article
Ordinal data frequently occur in various fields such as knowledge level assessment, credit rating, clinical disease diagnosis, and psychological evaluation. The classic models including cumulative logistic regression or probit regression are often used to model such ordinal data. But these modeling approaches conditionally depict the mean characteristic of response variable on a cluster of predictive variables, which often results in non‐robust estimation results. As a considerable alternative, composite quantile regression (CQR) approach is usually employed to gain more robust and relatively efficient results. In this paper, we propose a Bayesian CQR modeling approach for ordinal latent regression model. In order to overcome the recognizability problem of the considered model and obtain more robust estimation results, we advocate to using the Bayesian relative CQR approach to estimate regression parameters. Additionally, in regression modeling, it is a highly desirable task to obtain a parsimonious model that retains only important covariates. We incorporate the Bayesian penalty into the ordinal latent CQR regression model to simultaneously conduct parameter estimation and variable selection. Finally, the proposed Bayesian relative CQR approach is illustrated by Monte Carlo simulations and a real data application. Simulation results and real data examples show that the suggested Bayesian relative CQR approach has good performance for the ordinal regression models.
... As part of the machine-learning pipeline, we integrated a minimum redundancy maximum relevance feature selection algorithm (MRMR; Peng et al., 2005) via the python library featurewiz (Seshadri, 2023). We compared seven different regression-based machine-learning algorithms, most of which have been successfully employed for psychotherapy research before (Aafjes-van Doorn et al., 2021): (1) Elastic net regularization and variable selection (Elastic Net; Zou & Hastie, 2005) conducts both L1 and L2 regression regularization; (2) eXtreme Gradient Boosting (XGBoost; Chen & Guestrin, 2016) creates a series of decision trees that are trained sequentially for error correction, resulting in a final ensemble of trees; (3) Random Forest (RF; Breiman, 2001) also creates an ensemble of decision trees, which is aggregated to calculate the average of the predictions; (4) Support-Vector-Regression (SVR; Cortes & Vapnik, 1995) identifies a hyperplane that separates the data points into different classes by maximizing the distance between the hyperplane and the nearest data points from the classes; (5) Mixed Effects Random Forest (MERF; Hajjem et al., 2014) is an extension of RF allowing a random intercept for nested data (i.e., sessions nested in patients); (6) Gaussian Process Boosting (GPBoost; Sigrist, 2022) employs a boosting framework incorporating Gaussian Process regression and mixed effects modeling; (7) SuperLearner (van der Laan et al., 2007) is an algorithm that takes the predicted data from other algorithms as predictors and feeds them into a meta-algorithm that predicts the target variable. We chose an SVR algorithm as our meta-algorithm and used the predicted data from the previous algorithms (Elastic Net, XGBoost, RF, SVR, MERF, GPBoost) as predictors. ...
Article
Full-text available
We aim to use topic modeling, an approach for discovering clusters of related words (“topics”), to predict symptom severity and therapeutic alliance in psychotherapy transcripts, while also identifying the most important topics and overarching themes for prediction. We analyzed 552 psychotherapy transcripts from 124 patients. Using BERTopic (Grootendorst, 2022), we extracted 250 topics each for patient and therapist speech. These topics were used to predict symptom severity and alliance with various competing machine-learning methods. Sensitivity analyses were calculated for a model based on 50 topics, LDA-based topic modeling, and a bigram model. Additionally, we grouped topics into themes using qualitative analysis and identified key topics and themes with eXplainable Artificial Intelligence (XAI). Symptom severity could be predicted with highest accuracy by patient topics ( $$r$$ r =0.45, 95%-CI 0.40, 0.51), whereas alliance was better predicted by therapist topics ( $$r$$ r =0.20, 95%-CI 0.16, 0.24). Drivers for symptom severity were themes related to health and negative experiences. Lower alliance was correlated with various themes, especially psychotherapy framework, income, and everyday life. This analysis shows the potential of using topic modeling in psychotherapy research allowing to predict several treatment-relevant metrics with reasonable accuracy. Further, the use of XAI allows for an analysis of the individual predictive value of topics and themes. Limitations entail heterogeneity across different topic modeling hyperparameters and a relatively small sample size.
... Elastic Net can select PRSs to include and efficiently handle multi-collinearity. [35][36][37] Furthermore, PRSmix and PRSmix+ only required a set of SNP effects to estimate the PRSs and estimated the prediction accuracy to the target trait to select the best scores for the combination. Additionally, compared to the preselected traits for stroke by Abraham et al., 8 we also observed that our method could identify more related risk factors to include compared to previous work conducted on stroke such as usual walking pace, arthropathies, and lipoprotein(a) ( Figure S10; Tables S16 and S17). ...
Article
Full-text available
Polygenic risk scores (PRSs) are an emerging tool to predict the clinical phenotypes and outcomes of individuals. We propose PRSmix, a framework that leverages the PRS corpus of a target trait to improve prediction accuracy, and PRSmix+, which incorporates genetically correlated traits to better capture the human genetic architecture for 47 and 32 diseases/traits in European and South Asian ancestries, respectively. PRSmix demonstrated a mean prediction accuracy improvement of 1.20-fold (95% confidence interval [CI], [1.10; 1.3]; p = 9.17 × 10⁻⁵) and 1.19-fold (95% CI, [1.11; 1.27]; p = 1.92 × 10⁻⁶), and PRSmix+ improved the prediction accuracy by 1.72-fold (95% CI, [1.40; 2.04]; p = 7.58 × 10⁻⁶) and 1.42-fold (95% CI, [1.25; 1.59]; p = 8.01 × 10⁻⁷) in European and South Asian ancestries, respectively. Compared to the previously cross-trait-combination methods with scores from pre-defined correlated traits, we demonstrated that our method improved prediction accuracy for coronary artery disease up to 3.27-fold (95% CI, [2.1; 4.44]; p value after false discovery rate (FDR) correction = 2.6 × 10⁻⁴). Our method provides a comprehensive framework to benchmark and leverage the combined power of PRS for maximal performance in a desired target population.
... However, the partial likelihood assumes that event times are unique. To handle ties, where multiple individuals experience the event at the same time, we use the Breslow approximation of the partial likelihood in [3] ...
Preprint
Full-text available
Background Associated with high-dimensional omics data there are often “meta-features” such as pathways and functional annotations that can be informative for predicting an outcome of interest. We extend to Cox regression the regularized hierarchical framework of Kawaguchi et al. (2022) (1) for integrating meta-features, with the goal of improving prediction and feature selection performance with time-to-event outcomes. Methods A hierarchical framework is deployed to incorporate meta-features. Regularization is applied to the omic features as well as the meta-features so that high-dimensional data can be handled at both levels. The proposed hierarchical Cox model can be efficiently fitted by a combination of iterative reweighted least squares and cyclic coordinate descent. Results In a simulation study we show that when the external meta-features are informative, the regularized hierarchical model can substantially improve prediction performance over standard regularized Cox regression. We illustrate the proposed model with applications to breast cancer and melanoma survival based on gene expression profiles, which show the improvement in prediction performance by applying meta-features, as well as the discovery of important omic feature sets with sparse regularization at meta-feature level. Conclusions The proposed hierarchical regularized regression model enables integration of external meta-feature information directly into the modeling process for time-to-event outcomes, improves prediction performance when the external meta-feature data is informative. Importantly, when the external meta-features are uninformative, the prediction performance based on the regularized hierarchical model is on par with standard regularized Cox regression, indicating robustness of the framework. In addition to developing predictive signatures, the model can also be deployed in discovery applications where the main goal is to identify important features associated with the outcome rather than developing a predictive model.
... Clinical and demographic characteristics of our study population were summarized in Table 1 and there were no statistically signi cant differences in clinical variables between the training and validation sets ( p > 0.05). incorporating an L1 regularization penalty term, which limits the magnitude of the regression coe cients [21] . Thus, LASSO regression was used to screen acoustic features in the training set. ...
Preprint
Full-text available
Objective Lung cancer has the highest incidence of all malignant tumors worldwide, and early diagnosis and treatment are crucial for improving patient survival rates. The aim of this study is to develop a nomogram based on acoustic and clinical features, providing clinical trial evidence for predicting lung cancer. Methods We reviewed the voice data and clinical data from 350 individuals: 189 pathologically confirmed lung cancer patients and 161 non lung cancer patients, which included 77 patients with benign pulmonary lesions and 84 healthy volunteers. First of all, acoustic features were extracted from all subjects, and optimal features were selected by least absolute shrinkage and selection operator (LASSO) regression. Subsequently, combining acoustic features and clinical features to build a nomogram for predicting lung cancer based on multivariate logistic regression model. The performance of the nomogram was evaluated by the area under the receiver operating characteristic curve (AUC) and the calibration curve, the clinical utility was estimated by decision curve analysis (DCA), and validation set was applied to confirm the predictive value of the nomogram. Results The acoustic-clinical nomogram model exhibited good diagnostic performance in the training set, achieving an AUC of 0.774, an accuracy of 0.701, a sensitivity of 0.693, and a specificity of 0.710. In the validation set, the nomogram attained AUC of 0.714, an accuracy of 0.642, a sensitivity of 0.673 and a specificity of 0.611. The DCA curve demonstrated the nomogram had good clinical usefulness. Conclusions The acoustic-clinical nomogram constructed in this study exhibited good discrimination, calibration, and clinical application value, providing a tool to predict lung cancer.
... The selectivity ratio method has been recently applied to the discovery of bioactive constituents in botanical extracts. Recently, our research group showed that Elastic Net (EN), a regularized regression model [16], was capable of correctly predicting the anti-inflammatory bioactive constituents in hop extracts utilizing high-resolution mass spectrometry m/z profiles of extract fractions [17]. ...
Article
Full-text available
Rapid screening of botanical extracts for the discovery of bioactive natural products was performed using a fractionation approach in conjunction with flow-injection high-resolution mass spectrometry for obtaining chemical fingerprints of each fraction, enabling the correlation of the relative abundance of molecular features (representing individual phytochemicals) with the read-outs of bioassays. We applied this strategy for discovering and identifying constituents of Centella asiatica (C. asiatica) that protect against Aβ cytotoxicity in vitro. C. asiatica has been associated with improving mental health and cognitive function, with potential use in Alzheimer’s disease. Human neuroblastoma MC65 cells were exposed to subfractions of an aqueous extract of C. asiatica to evaluate the protective benefit derived from these subfractions against amyloid β-cytotoxicity. The % viability score of the cells exposed to each subfraction was used in conjunction with the intensity of the molecular features in two computational models, namely Elastic Net and selectivity ratio, to determine the relationship of the peak intensity of molecular features with % viability. Finally, the correlation of mass spectral features with MC65 protection and their abundance in different sub-fractions were visualized using GNPS molecular networking. Both computational methods unequivocally identified dicaffeoylquinic acids as providing strong protection against Aβ-toxicity in MC65 cells, in agreement with the protective effects observed for these compounds in previous preclinical model studies.
... As a result, researchers have begun to explore methods of variable selection. Hui (2005) [8] conducted a study on this topic, while Yuan and Lin (2006) [9] proposed the Group Lasso model, which utilizes the group structure between variables as prior information. One advantage of this model is that its objective function is a convex function of unknown parameters, ensuring the existence of a unique global minimum. ...
Article
Full-text available
This paper presents a multi-algorithm fusion model (StackingGroup) based on the Stacking ensemble learning framework to address the variable selection problem in high-dimensional group structure data. The proposed algorithm takes into account the differences in data observation and training principles of different algorithms. It leverages the strengths of each model and incorporates Stacking ensemble learning with multiple group structure regularization methods. The main approach involves dividing the data set into K parts on average, using more than 10 algorithms as basic learning models, and selecting the base learner based on low correlation, strong prediction ability, and small model error. Finally, we selected the grSubset + grLasso, grLasso, and grSCAD algorithms as the base learners for the Stacking algorithm. The Lasso algorithm was used as the meta-learner to create a comprehensive algorithm called StackingGroup. This algorithm is designed to handle high-dimensional group structure data. Simulation experiments showed that the proposed method outperformed other R ² , RMSE , and MAE prediction methods. Lastly, we applied the proposed algorithm to investigate the risk factors of low birth weight in infants and young children. The final results demonstrate that the proposed method achieves a mean absolute error ( MAE ) of 0.508 and a root mean square error ( RMSE ) of 0.668. The obtained values are smaller compared to those obtained from a single model, indicating that the proposed method surpasses other algorithms in terms of prediction accuracy.
... Like wrapper methods, embedded methods frequently interact with the classifier but have computational efficiency better than wrapper methods, and are also personalized to a specific classifier (Şahin & Kılıç, 2019). Selection-Perceptron (FS-P) (Chakraborty & Pal, 2015), Support Vector Machines (SVM-RFE) (Kari et al., 2018), Lasso (L1) and Elastic Net (L1+L2) based models (Tibshirani, 1996;Zou & Hastie, 2005) are some few examples of embedded based methods. This study is focusing only on filter-based FS methods. ...
... We write (18) in the following form: ...
Article
Full-text available
Feature selection methods are widely used in machine learning tasks to reduce the dimensionality and improve the performance of the models. However, traditional feature selection methods based on regression often suffer from a lack of robustness and generalization ability and are easily affected by outliers in the data. To address this problem, we propose a robust feature selection method based on sparse regression. This method uses a non-square form of the L2,1 norm as both the loss function and regularization term, which can effectively enhance the model’s resistance to outliers and achieve feature selection simultaneously. Furthermore, to improve the model’s robustness and prevent overfitting, we add an elastic variable to the loss function. We design two efficient convergent iterative processes to solve the optimization problem based on the L2,1 norm and propose a robust joint sparse regression algorithm. Extensive experimental results on three public datasets show that our feature selection method outperforms other comparison methods.
... In this way, redundant or not useful predictors are removed, which should lead to a simpler and potentially better model. Finally, the Elastic Net regression (Zou and Hastie, 2005) is a method that combines both the L1 and L2 penalties used in the Lasso and Ridge regression models. ...
Article
Full-text available
Purpose Cancer survivors commonly report cognitive declines after cancer therapy. Due to the complex etiology of cancer-related cognitive decline (CRCD), predicting who will be at risk of CRCD remains a clinical challenge. We developed a model to predict breast cancer survivors who would experience CRCD after systematic treatment. Methods We used the Thinking and Living with Cancer study, a large ongoing multisite prospective study of older breast cancer survivors with complete assessments pre-systemic therapy, 12 months and 24 months after initiation of systemic therapy. Cognition was measured using neuropsychological testing of attention, processing speed, and executive function (APE). CRCD was defined as a 0.25 SD (of observed changes from baseline to 12 months in matched controls) decline or greater in APE score from baseline to 12 months (transient) or persistent as a decline 0.25 SD or greater sustained to 24 months. We used machine learning approaches to predict CRCD using baseline demographics, tumor characteristics and treatment, genotypes, comorbidity, and self-reported physical, psychosocial, and cognitive function. Results Thirty-two percent of survivors had transient cognitive decline, and 41% of these women experienced persistent decline. Prediction of CRCD was good: yielding an area under the curve of 0.75 and 0.79 for transient and persistent decline, respectively. Variables most informative in predicting CRCD included apolipoprotein E4 positivity, tumor HER2 positivity, obesity, cardiovascular comorbidities, more prescription medications, and higher baseline APE score. Conclusions Our proof-of-concept tool demonstrates our prediction models are potentially useful to predict risk of CRCD. Future research is needed to validate this approach for predicting CRCD in routine practice settings.
Article
Full-text available
Though tremendous advances have been made in the field of in vitro fertilization (IVF), a portion of patients are still affected by embryo implantation failure issues. One of the most significant factors contributing to implantation failure is a uterine condition called displaced window of implantation (WOI), which refers to an unsynchronized endometrium and embryo transfer time for IVF patients. Previous studies have shown that microRNAs (miRNAs) can be important biomarkers in the reproductive process. In this study, we aim to develop a miRNA-based classifier to identify the WOI for optimal time for embryo transfer. A reproductive-related PanelChip® was used to obtain the miRNA expression profiles from the 200 patients who underwent IVF treatment. In total, 143 out of the 167 miRNAs with amplification signals across 90% of the expression profiles were utilized to build a miRNA-based classifier. The microRNA-based classifier identified the optimal timing for embryo transfer with an accuracy of 93.9%, a sensitivity of 85.3%, and a specificity of 92.4% in the training set, and an accuracy of 88.5% in the testing set, showing high promise in accurately identifying the WOI for the optimal timing for embryo transfer.
Article
Full-text available
A brain-computer interface (BCI) based on an electroencephalograph (EEG) establishes a new channel of communication between the human brain and a computer. Redundant, noisy, and irrelevant channels lead to high computational costs and poor classification accuracy. Therefore, an effective feature selection technique for determining the optimal number of channels can improve BCI’s performance. However, existing meta-heuristic algorithms are prone to get trapped in local optimum due to high dimensional dataset. Thus, to reduce dimension, solve inter subject variation and choose an optimal subset of channels, a novel framework called Component Loading followed by Clustering and Classification (CLCC) is proposed in this paper. This novel framework is further divided into two experiment configurations-CLCC with Feature Selection (CLCC-FS) and CLCC without Feature Selection (CLCC-WFS). All these frameworks have been implemented on a motor imagery (MI) EEG dataset of 10 subjects in order to choose the best subset of channels. Further, seven different classifiers have been employed to assess the performance. Experimental outcomes show that on comparing various feature selection techniques, our proposed algorithm i.e., CLCC-FS Opposition-Based Whale Optimization Algorithm (CLCC-FS(OBWOA)) performed substantially better than the other feature selection techniques. We demonstrate that the proposed algorithm is able to achieve 99.6% accuracy by using only few channels and can improve the practicality of the BCI system by reducing the computation cost.
Article
Full-text available
Background The impact of the gut microbiota on neuropsychiatric disorders has gained much attention in recent years; however, comprehensive data on the relationship between the gut microbiome and its metabolites and resistance to treatment for depression and anxiety is lacking. Here, we investigated intestinal metabolites in patients with depression and anxiety disorders, and their possible roles in treatment resistance. Results We analyzed fecal metabolites and microbiomes in 34 participants with depression and anxiety disorders. Fecal samples were obtained three times for each participant during the treatment. Propensity score matching led us to analyze data from nine treatment responders and nine non-responders, and the results were validated in the residual sample sets. Using elastic net regression analysis, we identified several metabolites, including N-ε-acetyllysine; baseline levels of the former were low in responders (AUC = 0.86; 95% confidence interval, 0.69–1). In addition, fecal levels of N-ε-acetyllysine were negatively associated with the abundance of Odoribacter. N-ε-acetyllysine levels increased as symptoms improved with treatment. Conclusion Fecal N-ε-acetyllysine levels before treatment may be a predictive biomarker of treatment-refractory depression and anxiety. Odoribacter may play a role in the homeostasis of intestinal L-lysine levels. More attention should be paid to the importance of L-lysine metabolism in those with depression and anxiety.
Article
Full-text available
Background Genome-wide association studies have successfully identified genetic variants associated with human disease. Various statistical approaches based on penalized and machine learning methods have recently been proposed for disease prediction. In this study, we evaluated the performance of several such methods for predicting asthma using the Korean Chip (KORV1.1) from the Korean Genome and Epidemiology Study (KoGES). Results First, single-nucleotide polymorphisms were selected via single-variant tests using logistic regression with the adjustment of several epidemiological factors. Next, we evaluated the following methods for disease prediction: ridge, least absolute shrinkage and selection operator, elastic net, smoothly clipped absolute deviation, support vector machine, random forest, boosting, bagging, naïve Bayes, and k-nearest neighbor. Finally, we compared their predictive performance based on the area under the curve of the receiver operating characteristic curves, precision, recall, F1-score, Cohen′s Kappa, balanced accuracy, error rate, Matthews correlation coefficient, and area under the precision-recall curve. Additionally, three oversampling algorithms are used to deal with imbalance problems. Conclusions Our results show that penalized methods exhibit better predictive performance for asthma than that achieved via machine learning methods. On the other hand, in the oversampling study, randomforest and boosting methods overall showed better prediction performance than penalized methods.
Article
Full-text available
Spatial omics technologies can help identify spatially organized biological processes, but existing computational approaches often overlook structural dependencies in the data. Here, we introduce Smoother, a unified framework that integrates positional information into non-spatial models via modular priors and losses. In simulated and real datasets, Smoother enables accurate data imputation, cell-type deconvolution, and dimensionality reduction with remarkable efficiency. In colorectal cancer, Smoother-guided deconvolution reveals plasma cell and fibroblast subtype localizations linked to tumor microenvironment restructuring. Additionally, joint modeling of spatial and single-cell human prostate data with Smoother allows for spatial mapping of reference populations with significantly reduced ambiguity.
Article
Study Objectives Hypnograms contain a wealth of information and play an important role in sleep medicine. However, interpretation of the hypnogram is a difficult task and requires domain knowledge and “clinical intuition.” This study aimed to uncover which features of the hypnogram drive interpretation by physicians. In other words, make explicit which features physicians implicitly look for in hypnograms. Methods Three sleep experts evaluated up to 612 hypnograms, indicating normal or abnormal sleep structure and suspicion of disorders. ElasticNet and convolutional neural network classification models were trained to predict the collected expert evaluations using hypnogram features and stages as input. The models were evaluated using several measures, including accuracy, Cohen’s kappa, Matthew’s correlation coefficient, and confusion matrices. Finally, model coefficients and visual analytics techniques were used to interpret the models to associate hypnogram features with expert evaluation. Results Agreement between models and experts (Kappa between 0.47 and 0.52) is similar to agreement between experts (Kappa between 0.38 and 0.50). Sleep fragmentation, measured by transitions between sleep stages per hour, and sleep stage distribution were identified as important predictors for expert interpretation. Conclusions By comparing hypnograms not solely on an epoch-by-epoch basis, but also on these more specific features that are relevant for the evaluation of experts, performance assessment of (automatic) sleep-staging and surrogate sleep trackers may be improved. In particular, sleep fragmentation is a feature that deserves more attention as it is often not included in the PSG report, and existing (wearable) sleep trackers have shown relatively poor performance in this aspect.
Article
Towards the identification of genetic basis of complex traits, transcriptome-wide association study (TWAS) is successful in integrating transcriptome data. However, TWAS is only applicable for common variants, excluding rare variants in exome or whole genome sequences. This is partly because of the inherent limitation of TWAS protocols that rely on predicting gene expressions. Our previous research has revealed the insight into TWAS: the two steps in TWAS, building and applying the expression prediction models, are essentially genetic feature selection and aggregations that do not have to involve predictions. Based on this insight disentangling TWAS, rare variants’ inability of predicting expression traits is no longer an obstacle. Herein, we developed “rare variant TWAS”, or rvTWAS, that first uses a Bayesian model to conduct expression-directed feature selection and then uses a kernel machine to carry out feature aggregation, forming a model leveraging expressions for association mapping including rare variants. We demonstrated the performance of rvTWAS by thorough simulations and real data analysis in three psychiatric disorders, namely schizophrenia, bipolar disorder, and autism spectrum disorder. We confirmed that rvTWAS outperforms existing TWAS protocols and revealed additional genes underlying psychiatric disorders. Particularly, we formed a hypothetical mechanism in which zinc finger genes impact all three disorders through transcriptional regulations. rvTWAS will open a door for sequence-based association mappings integrating gene expressions.
Article
Background and aims Why only half of the idiopathic peripheral polyneuropathy (IPN) patients develop neuropathic pain is unknown. By conducting a proteomics analysis on IPN patients, we aimed to discover proteins and new pathways that are associated with neuropathic pain. Methods We conducted unbiased mass‐spectrometry proteomics analysis on blood plasma from 31 IPN patients with severe neuropathic pain and 29 IPN patients with no pain, to investigate protein biomarkers and protein‐protein interactions associated with neuropathic pain. Univariate modeling was done with Linear Mixed Modeling (LMM) and corrected for multiple testing. Multivariate modelling was performed using elastic net analysis and validated with internal cross validation and bootstrapping. Results In the univariate analysis, 73 proteins showed a p‐value <0.05 and 12 proteins showed a p‐value <0.01. None were significant after Benjamini‐Hochberg adjustment for multiple testing. Elastic net analysis created a model containing 12 proteins with reasonable discriminatory power to differentiate between painful and painless IPN (false negative rate 0.10, false positive rate 0.18 and an area under the curve 0.75). Eight of these 12 proteins were clustered into one interaction network, significantly enriched for the complement and coagulation pathway (Benjamini Hochberg adjusted p‐value = 0.0057), with Complement Component 3 (C3) as the central node. Bootstrap validation identified insulin‐like growth factor‐binding protein 2 (IGFBP2), complement factor H‐related protein 4 (CFHR4) and ferritin light chain (FTL), as the most discriminatory proteins of the original 12 identified. Interpretation This proteomics analysis suggests a role for the complement system in neuropathic pain in IPN. This article is protected by copyright. All rights reserved.
Article
Full-text available
Online portfolio optimization with transaction costs is a big challenge in large-scale intelligent computing community, since its undersample from rapidly-changing market and complexity from varying transaction costs. In this paper, we focus on this problem and solve it by machine learning system. Specifically, we reformulate the optimization problem with the minimization over simplex containing three items, which are negative expected return, the elastic net regularization of transaction costs controlled term and portfolio variable, respectively. We propose to apply linearized augmented Lagrangian method (LALM) and the alternating direction method of multipliers (ADMM) to solve the optimization model in a higher efficiency, meanwhile theoretically guarantee their convergence and deduce closed-form solutions of their subproblems in each iteration. Furthermore, we conduct extensive experiments on five benchmark datasets from real market to demonstrate that the proposed algorithms outperform compared state-of-the-art strategies in most cases in six dimensions.
Article
Full-text available
In this article, we propose the optimization of the resolution of time-frequency atoms and the regularization of fitting models to obtain better representations of heart sound signals. This is done by evaluating the classification performance of deep learning (DL) networks in discriminating five heart valvular conditions based on a new class of time-frequency feature matrices derived from the models. We inspect several combinations of resolution and regularization and the optimal one is that provides the highest performance. To this end, a fitting model is obtained based on a heart sound signal and an over-complete dictionary of Gabor atoms using elastic net regularization of linear models. We consider two different DL architectures, the first mainly consisting of a 1D convolutional neural network (CNN) layer and a long short-term memory (LSTM) layer, while the second is composed of 1D and 2D CNN layers followed by an LSTM layer. The networks are trained with two algorithms, namely stochastic gradient descent with momentum (SGDM) and adaptive moment (ADAM). Extensive experimentation has been conducted using a database containing heart sound signals of five heart valvular conditions. The best classification accuracy of 98.95% is achieved with the second architecture when trained with ADAM and feature matrices derived from optimal fitting models obtained with a Gabor dictionary consisting of atoms with high-time low-frequency resolution and imposing sparsity on the models.
Article
Full-text available
Atopic dermatitis (AD) is a skin disease that is heterogeneous both in terms of clinical manifestations and molecular profiles. It is increasingly recognized that AD is a systemic rather than a local disease and should be assessed in the context of whole-body pathophysiology. Here we show, via integrated RNA-sequencing of skin tissue and peripheral blood mononuclear cell (PBMC) samples along with clinical data from 115 AD patients and 14 matched healthy controls, that specific clinical presentations associate with matching differential molecular signatures. We establish a regression model based on transcriptome modules identified in weighted gene co-expression network analysis to extract molecular features associated with detailed clinical phenotypes of AD. The two main, qualitatively differential skin manifestations of AD, erythema and papulation are distinguished by differential immunological signatures. We further apply the regression model to a longitudinal dataset of 30 AD patients for personalized monitoring, highlighting patient heterogeneity in disease trajectories. The longitudinal features of blood tests and PBMC transcriptome modules identify three patient clusters which are aligned with clinical severity and reflect treatment history. Our approach thus serves as a framework for effective clinical investigation to gain a holistic view on the pathophysiology of complex human diseases.
Article
Objective Hepatocellular carcinoma (HCC) is one of the leading cancer types with increasing annual incidence and high mortality in the USA. MicroRNAs (miRNAs) have emerged as valuable prognostic indicators in cancer patients. To identify a miRNA signature predictive of survival in patients with HCC, we developed a machine learning-based HCC survival estimation method, HCCse, using the miRNA expression profiles of 122 patients with HCC. Methods The HCCse method was designed using an optimal feature selection algorithm incorporated with support vector regression. Results HCCse identified a robust miRNA signature consisting of 32 miRNAs and obtained a mean correlation coefficient (R) and mean absolute error (MAE) of 0.87 ± 0.02 and 0.73 years between the actual and estimated survival times of patients with HCC; and the jackknife test achieved an R and MAE of 0.73 and 0.97 years between actual and estimated survival times, respectively. The identified signature has seven prognostic miRNAs (hsa-miR-146a-3p, hsa-miR-200a-3p, hsa-miR-652-3p, hsa-miR-34a-3p, hsa-miR-132-5p, hsa-miR-1301-3p and hsa-miR-374b-3p) and four diagnostic miRNAs (hsa-miR-1301-3p, hsa-miR-17-5p, hsa-miR-34a-3p and hsa-miR-200a-3p). Notably, three of these miRNAs, hsa-miR-200a-3p, hsa-miR-1301-3p and hsa-miR-17-5p, also displayed association with tumor stage, further emphasizing their clinical relevance. Furthermore, we performed pathway enrichment analysis and found that the target genes of the identified miRNA signature were significantly enriched in the hepatitis B pathway, suggesting its potential involvement in HCC pathogenesis. Conclusions Our study developed HCCse, a machine learning-based method, to predict survival in HCC patients using miRNA expression profiles. We identified a robust miRNA signature of 32 miRNAs with prognostic and diagnostic value, highlighting their clinical relevance in HCC management and potential involvement in HCC pathogenesis.
Article
Full-text available
Background "Molecular signatures" or "gene-expression signatures" are used to model patients' clinically relevant information (e.g., prognosis, survival time) us-ing expression data from coexpressed genes. Signatures are a key feature in cancer research because they can provide insight into biological mechanisms and have po-tential diagnostic use. However, available methods to search for signatures fail to address key requirements of signatures and signature components, especially the discovery of tightly coexpressed sets of genes. Results We suggest a method with good predictive performance that follows from the biologically relevant features of signatures. After identifying a seed gene with good predictive abilities, we search for a group of genes that is highly correlated with the seed gene, shows tight coexpression, and has good predictive abilities; this set of genes is reduced to a signature component using using Principal Components Analysis. The process is repeated until no further component is found. We show that the suggested method can recover signatures present in the data, and has predictive performance comparable to state-of-the-art methods. The code (R with C++) is freely available under GNU GPL license.
Article
Full-text available
In this paper we study boosting methods from a new perspective. We build on recent work by Efron et al. to show that boosting approximately (and in some cases exactly) minimizes its loss criterion with an -margin of the training data, as defined in the boosting literature. An interesting fundamental similarity between boosting and kernel support vector machines emerges, as both can be described as methods for regularized optimization in high-dimensional predictor space, using a computational trick to make the calculation practical, and converging to margin-maximizing solutions. While this statement describes SVMs exactly, it applies to boosting only approximately.
Article
Full-text available
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Article
Full-text available
Prognostic and predictive factors are indispensable tools in the treatment of patients with neoplastic disease. For the most part, such factors rely on a few specific cell surface, histological, or gross pathologic features. Gene expression assays have the potential to supplement what were previously a few distinct features with many thousands of features. We have developed Bayesian regression models that provide predictive capability based on gene expression data derived from DNA microarray analysis of a series of primary breast cancer samples. These patterns have the capacity to discriminate breast tumors on the basis of estrogen receptor status and also on the categorized lymph node status. Importantly, we assess the utility and validity of such models in predicting the status of tumors in crossvalidation determinations. The practical value of such approaches relies on the ability not only to assess relative probabilities of clinical outcomes for future samples but also to provide an honest assessment of the uncertainties associated with such predictive classifications on the basis of the selection of gene subsets for each validation analysis. This latter point is of critical importance in the ability to apply these methodologies to clinical assessment of tumor phenotype.
Article
Full-text available
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.
Article
Full-text available
A variety of new procedures have been devised to handle the two-sample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some defining characteristics of microarray-based studies: (i) the very large number of genes contributing expression measures which far exceeds the number of samples (observations) available and (ii) the fact that by virtue of pathway/network relationships, the gene expression measures tend to be highly correlated. These concerns are exacerbated in the regression setting, where the objective is to relate gene expression, simultaneously for multiple genes, to some external outcome or phenotype. Correspondingly, several methods have been recently proposed for addressing these issues. We briefly critique some of these methods prior to a detailed evaluation of gene harvesting. This reveals that gene harvesting, without additional constraints, can yield artifactual solutions. Results obtained employing such constraints motivate the use of regularized regression procedures such as the lasso, least angle regression, and support vector machines. Model selection and solution multiplicity issues are also discussed. The methods are evaluated using a microarray-based study of cardiomyopathy in transgenic mice.
Article
Background: We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes. Results: We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions. Conclusions: Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.
Article
Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable of interest. To find these groups, we suggest the use of supervised algorithms: these are procedures which use external information about the response variable for grouping the genes. We present Pelora, an algorithm based on penalized logistic regression analysis, that combines gene selection, gene grouping and sample classification in a supervised, simultaneous way. With an empirical study on six different microarray datasets, we show that Pelora identifies gene groups whose expression centroids have very good predictive potential and yield results that can keep up with state-of-the-art classification methods based on single genes. Thus, our gene groups can be beneficial in medical diagnostics and prognostics, but they may also provide more biological insights into gene function and regulation.
Article
Bridge regression, a special family of penalized regressions of a penalty function Σ|βj|γ with γ ≤ 1, considered. A general approach to solve for the bridge estimator is developed. A new algorithm for the lasso (γ = 1) is obtained by studying the structure of the bridge estimators. The shrinkage parameter γ and the tuning parameter λ are selected via generalized cross-validation (GCV). Comparison between the bridge model (γ ≤ 1) and several other shrinkage models, namely the ordinary least squares regression (λ = 0), the lasso (γ = 1) and ridge regression (γ = 2), is made through a simulation study. It is shown that the bridge regression performs well compared to the lasso and ridge regression. These methods are demonstrated through an analysis of a prostate cancer data. Some computational advantages and limitations are discussed.
Article
This discussion concerns the following papers: W. Jiang [Process consistency for AdaBoost. ibid., 13–29 (2004; Zbl 1105.62316)]; G. Lugosi and N. Vayatis [On the Bayes-risk consistency of regularized boosting methods. ibid., 30–55 (2004; Zbl 1105.62319)]; and T. Zhang [Statistical behavior and consistency of classification methods based on convex risk minimization. ibid., 56–85 (2004; Zbl 1105.62323)].
Article
Linear and quadratic discriminant analysis are considered in the small-sample, high-dimensional setting. Alternatives to the usual maximum likelihood (plug-in) estimates for the covariance matrices are proposed. These alternatives are characterized by two parameters, the values of which are customized to individual situations by jointly minimizing a sample-based estimate of future misclassification risk. Computationally fast implementations are presented, and the efficacy of the approach is examined through simulation studies and application to data. These studies indicate that in many circumstances dramatic gains in classification accuracy can be achieved.
Article
Chemometrics is a field of chemistry that studies the application of statistical methods to chemical data analysis. In addition to borrowing many techniques from the statistics and engineering literatures, chemometrics itself has given rise to several new data-analytical methods. This article examines two methods commonly used in chemometrics for predictive modeling—partial least squares and principal components regression—from a statistical perspective. The goal is to try to understand their apparent successes and in what situations they can be expected to work well and to compare them with other statistical methods intended for those situations. These methods include ordinary least squares, variable subset selection, and ridge regression.
Article
  We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p≫n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.
Article
Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable of interest. To find these groups, we suggest the use of supervised algorithms: these are procedures which use external information about the response variable for grouping the genes. We present Pelora, an algorithm based on penalized logistic regression analysis, that combines gene selection, gene grouping and sample classification in a supervised, simultaneous way. With an empirical study on six different microarray datasets, we show that Pelora identifies gene groups whose expression centroids have very good predictive potential and yield results that can keep up with state-of-the-art classification methods based on single genes. Thus, our gene groups can be beneficial in medical diagnostics and prognostics, but they may also provide more biological insights into gene function and regulation.
Article
DNA micro-arrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new micro-array devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues. In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer. In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leave-one-out error, while 64 genes are necessary for the baseline method to get the best result (one leave-one-out error). In the colon cancer database, using only 4 genes our method is 98% accurate, while the baseline method is only 86% accurate.
Article
In model selection, usually a "best" predictor is chosen from a collection ${\hat{\mu}(\cdot, s)}$ of predictors where $\hat{\mu}(\cdot, s)$ is the minimum least-squares predictor in a collection $\mathsf{U}_s$ of predictors. Here s is a complexity parameter; that is, the smaller s, the lower dimensional/smoother the models in $\mathsf{U}_s$. ¶ If $\mathsf{L}$ is the data used to derive the sequence ${\hat{\mu}(\cdot, s)}$, the procedure is called unstable if a small change in $\mathsf{L}$ can cause large changes in ${\hat{\mu}(\cdot, s)}$. With a crystal ball, one could pick the predictor in ${\hat{\mu}(\cdot, s)}$ having minimum prediction error. Without prescience, one uses test sets, cross-validation and so forth. The difference in prediction error between the crystal ball selection and the statistician's choice we call predictive loss. For an unstable procedure the predictive loss is large. This is shown by some analytics in a simple case and by simulation results in a more complex comparison of four different linear regression methods. Unstable procedures can be stabilized by perturbing the data, getting a new predictor sequence ${\hat{\mu'}(\cdot, s)}$ and then averaging over many such predictor sequences.
Article
Serum prostate specific antigen was determined (Yang polyclonal radioimmunoassay) in 102 men before hospitalization for radical prostatectomy. Prostate specimens were subjected to detailed histological and morphometric analysis. Levels of prostate specific antigen were significantly different between patients with and without a Gleason score of 7 or greater (p less than 0.001), capsular penetration greater than 1 cm. in linear extent (p less than 0.001), seminal vesicle invasion (p less than 0.001) and pelvic lymph node metastasis (p less than 0.005). Prostate specific antigen was strongly correlated with volume of prostate cancer (r equals 0.70). Bivariate and multivariate analyses indicate that cancer volume is the primary determinant of serum prostate specific antigen levels. Prostate specific antigen was elevated 3.5 ng. per ml. for every cc of cancer, a level at least 10 times that observed for benign prostatic hyperplasia.
Article
We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes. We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions. Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.
Article
Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.
Article
Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of parametric models such as generalized linear models and robust regression models. They can also be applied easily to nonparametric modeling by using wavelets and splines. Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known. Our simulation shows that the newly proposed methods compare favorably with other variable selection techniques. Furthermore, the standard error formulas are tested to be accurate enough for practical applications.
Article
The purpose of model selection algorithms such as All Subsets, Forward Selection, and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the eificient pre- diction of a response variable. Least Angle Regression (" LARS"), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods.
Article
The title Lasso has been suggested by Tibshirani [7] as a colourful name for a technique of variable selection which requires the minimization of a sum of squares subject to an ll bound r; on the solution. This forces zero components in the minimizing solution for small values of r;. Thus this bound can function as a selection parameter. This paper makes two contributions to computational problems associated with implementing the Lasso: (1) a com- pact descent method for solving the constrained problem for a particular value of r; is formulated, and (2) a homotopy method, in which the constraint bound r; becomes the homotopy parameter, is developed to completely describe the possible selection regimes. Both algorithms have a finite termination property.
Article
We study how close the optimal Bayes error rate can be approximately reached using a classification algorithm that computes a classifier by minimizing a convex upper bound of the classification error function. The measurement of closeness is characterized by the loss function used in the estimation. We show that such a classification scheme can be generally regarded as a (non maximum-likelihood) conditional in-class probability estimate, and we use this analysis to compare various convex loss functions that have appeared in the literature. Furthermore, the theoretical insight allows us to design good loss functions with desirable properties. Another aspect of our analysis is to demonstrate the consistency of certain classification methods using convex risk minimization.