Article

Regularization and variable selection via the elastic net (vol B 67, pg 301, 2005)

February 2005
Journal of the Royal Statistical Society Series B (Statistical Methodology) 67(5):768-768

February 2005
67(5):768-768

DOI:10.1111/j.1467-9868.2005.00527.x

Source
RePEc

Authors:

Hui Zou

University of Minnesota Twin Cities

Trevor Hastie

Stanford University

Students' 2018 PISA reading self-concept: Identifying predictors and examining model generalizability for emergent bilinguals

Article

Nov 2023
J SCHOOL PSYCHOL

Decades of research have indicated that reading self-concept is an important predictor of reading achievement. During this period, the population of emergent bilinguals has continued to increase within United States' schools. However, the existing literature has tended to examine native English speakers' and emergent bilinguals' reading self-concept in the aggregate, thereby potentially obfuscating the unique pathways through which reading self-concept predicts reading achievement. Furthermore, due to the overreliance of native English speakers in samples relating to theory development, researchers attempting to examine predictors of reading achievement may a priori select variables that are more aligned with native English speakers' experiences. To address this issue, we adopted Elastic Net, which is a theoretically agnostic methodology and machine learning approach to variable selection to identify the proximal and distal predictors of reading self-concept for the entire population; in our study, participants from the United States who participated in PISA 2018 served as the baseline group to determine significant predictors of reading self-concept with the intent of identifying potential new directions for future researchers. Based on Elastic Net analysis, 20 variables at the student level, three variables at the teacher level, and 12 variables at the school level were identified as the most salient predictors of reading self-concept. We then utilized a multilevel modeling approach to test model generalizability of the identified predictors of reading self-concept for emergent bilinguals and native English speakers. We disaggregated and compared findings for both emergent bilinguals and native English speakers. Our results indicate that although some predictors were important for both groups (e.g., perceived information and communications technologies competence), other predictors were not (e.g., competitiveness). Suggestions for future directions and implications of the present study are examined.

Sparse minimum average variance estimation through signal extraction approach to multivariate regression

Article

Full-text available

Jan 2022

In this paper, a new sparse method called (MAVE-SiER) is proposed, to introduce MAVE-SiER, we combined the effective sufficient dimension reduction method MAVE with the sparse method Signal extraction approach to multivariate regression (SiER). MAVE-SiER has the benefit of expanding the Signal extraction method to multivariate regression (SiER) to nonlinear and multi-dimensional regression. MAVE-SiER also allows MAVE to deal with problems which the predictors are highly correlated. MAVE-SiER may estimate dimensions exhaustively while concurrently choosing useful variables. Simulation studies confirmed MAVE-SiER performance.

Comparing feature selection and machine learning approaches for predicting CYP2D6 methylation from genetic variation

Article

Full-text available

Feb 2024

Introduction Pharmacogenetics currently supports clinical decision-making on the basis of a limited number of variants in a few genes and may benefit paediatric prescribing where there is a need for more precise dosing. Integrating genomic information such as methylation into pharmacogenetic models holds the potential to improve their accuracy and consequently prescribing decisions. Cytochrome P450 2D6 (CYP2D6) is a highly polymorphic gene conventionally associated with the metabolism of commonly used drugs and endogenous substrates. We thus sought to predict epigenetic loci from single nucleotide polymorphisms (SNPs) related to CYP2D6 in children from the GUSTO cohort. Methods Buffy coat DNA methylation was quantified using the Illumina Infinium Methylation EPIC beadchip. CpG sites associated with CYP2D6 were used as outcome variables in Linear Regression, Elastic Net and XGBoost models. We compared feature selection of SNPs from GWAS mQTLs, GTEx eQTLs and SNPs within 2 MB of the CYP2D6 gene and the impact of adding demographic data. The samples were split into training (75%) sets and test (25%) sets for validation. In Elastic Net model and XGBoost models, optimal hyperparameter search was done using 10-fold cross validation. Root Mean Square Error and R-squared values were obtained to investigate each models’ performance. When GWAS was performed to determine SNPs associated with CpG sites, a total of 15 SNPs were identified where several SNPs appeared to influence multiple CpG sites. Results Overall, Elastic Net models of genetic features appeared to perform marginally better than heritability estimates and substantially better than Linear Regression and XGBoost models. The addition of nongenetic features appeared to improve performance for some but not all feature sets and probes. The best feature set and Machine Learning (ML) approach differed substantially between CpG sites and a number of top variables were identified for each model. Discussion The development of SNP-based prediction models for CYP2D6 CpG methylation in Singaporean children of varying ethnicities in this study has clinical application. With further validation, they may add to the set of tools available to improve precision medicine and pharmacogenetics-based dosing.

Bayes-Decisive Linear KNN with Adaptive Nearest Neighbors

Article

Full-text available

Jan 2024
INT J INTELL SYST

While the classical KNN (k nearest neighbor) shares its avoidance of the consistent distribution assumption between training and testing samples to achieve fast prediction, it still faces two challenges: (a) its generalization ability heavily depends on an appropriate number k of nearest neighbors; (b) its prediction behavior lacks interpretability. In order to address the two challenges, a novel Bayes-decisive linear KNN with adaptive nearest neighbors (i.e., BLA-KNN) is proposed to obtain the following three merits: (a) a diagonal matrix is introduced to adaptively select the nearest neighbors and simultaneously improve the generalization capability of the proposed BLA-KNN method; (b) the proposed BLA-KNN method owns the group effect, which inherits and extends the group property of the sum of squares for total deviations by reflecting the training sample class-aware information in the group effect regularization term; (c) the prediction behavior of the proposed BLA-KNN method can be interpreted from the Bayes-decision-rule perspective. In order to do so, we first use a diagonal matrix to weigh each training sample so as to obtain the importance of the sample, while constraining the importance weights to ensure that the adaptive k value is carried out efficiently. Second, we introduce a class-aware information regularization term in the objective function to obtain the nearest neighbor group effect of the samples. Finally, we introduce linear expression weights related to the distance measure between the testing and training samples in the regularization term to ensure that the interpretation of Bayes-decision-rule can be performed smoothly. We also optimize the proposed objective function using an alternating optimization strategy. We experimentally demonstrate the effectiveness of the proposed BLA-KNN method by comparing it with 7 comparative methods on 15 benchmark datasets.

The impact of reward and punishment sensitivity on memory and executive performance in individuals with Amnestic Mild Cognitive Impairment

Article

Jun 2024
BEHAV BRAIN RES

Optimizing PGRs for in vitro shoot proliferation of pomegranate with bayesian-tuned ensemble stacking regression and NSGA-II: a comparative evaluation of machine learning models

Article

Full-text available

May 2024
PLANT METHODS

Background The process of optimizing in vitro shoot proliferation is a complicated task, as it is influenced by interactions of many factors as well as genotype. This study investigated the role of various concentrations of plant growth regulators (zeatin and gibberellic acid) in the successful in vitro shoot proliferation of three Punica granatum cultivars (‘Faroogh’, ‘Atabaki’ and ‘Shirineshahvar’). Also, the utility of five Machine Learning (ML) algorithms—Support Vector Regression (SVR), Random Forest (RF), Extreme Gradient Boosting (XGB), Ensemble Stacking Regression (ESR) and Elastic Net Multivariate Linear Regression (ENMLR)—as modeling tools were evaluated on in vitro multiplication of pomegranate. A new automatic hyperparameter optimization method named Adaptive Tree Pazen Estimator (ATPE) was developed to tune the hyperparameters. The performance of the models was evaluated and compared using statistical indicators (MAE, RMSE, RRMSE, MAPE, R and R²), while a specific Global Performance Indicator (GPI) was introduced to rank the models based on a single parameter. Moreover, Non‑dominated Sorting Genetic Algorithm‑II (NSGA‑II) was employed to optimize the selected prediction model. Results The results demonstrated that the ESR algorithm exhibited higher predictive accuracy in comparison to other ML algorithms. The ESR model was subsequently introduced for optimization by NSGA‑II. ESR-NSGA‑II revealed that the highest proliferation rate (3.47, 3.84, and 3.22), shoot length (2.74, 3.32, and 1.86 cm), leave number (18.18, 19.76, and 18.77), and explant survival (84.21%, 85.49%, and 56.39%) could be achieved with a medium containing 0.750, 0.654, and 0.705 mg/L zeatin, and 0.50, 0.329, and 0.347 mg/L gibberellic acid in the ‘Atabaki’, ‘Faroogh’, and ‘Shirineshahvar’ cultivars, respectively. Conclusions This study demonstrates that the 'Shirineshahvar' cultivar exhibited lower shoot proliferation success compared to the other cultivars. The results indicated the good performance of ESR-NSGA-II in modeling and optimizing in vitro propagation. ESR-NSGA-II can be applied as an up-to-date and reliable computational tool for future studies in plant in vitro culture.

eScope: A Fine-Grained Power Prediction Mechanism for Mobile Applications

Preprint

Full-text available

Apr 2024

Managing the limited energy on mobile platforms executing long-running, resource intensive streaming applications requires adapting an application's operators in response to their power consumption. For example, the frame refresh rate may be reduced if the rendering operation is consuming too much power. Currently, predicting an application's power consumption requires (1) building a device-specific power model for each hardware component, and (2) analyzing the application's code. This approach can be complicated and error-prone given the complexity of an application's logic and the hardware platforms with heterogeneous components that it may execute on. We propose eScope, an alternative method to directly estimate power consumption by each operator in an application. Specifically, eScope correlates an application's execution traces with its device-level energy draw. We implement eScope as a tool for Android platforms and evaluate it using workloads on several synthetic applications as well as two video stream analytics applications. Our evaluation suggests that eScope predicts an application's power use with 97% or better accuracy while incurring a compute time overhead of less than 3%.

Machine learning analysis of gene expression reveals TP53 Mutant-like AML with wild type TP53 and poor prognosis

Article

Full-text available

May 2024

Naturalistic acute pain states decoded from neural and facial dynamics

Preprint

Full-text available

May 2024

Pain is a complex experience that remains largely unexplored in naturalistic contexts, hindering our understanding of its neurobehavioral representation in ecologically valid settings. To address this, we employed a multimodal, data-driven approach integrating intracranial electroencephalography, pain self-reports, and facial expression quantification to characterize the neural and behavioral correlates of naturalistic acute pain in twelve epilepsy patients undergoing continuous monitoring with neural and audiovisual recordings. High self-reported pain states were associated with elevated blood pressure, increased pain medication use, and distinct facial muscle activations. Using machine learning, we successfully decoded individual participants' high versus low self-reported pain states from distributed neural activity patterns (mean AUC = 0.70), involving mesolimbic regions, striatum, and temporoparietal cortex. High self-reported pain states exhibited increased low-frequency activity in temporoparietal areas and decreased high-frequency activity in mesolimbic regions (hippocampus, cingulate, and orbitofrontal cortex) compared to low pain states. This neural pain representation remained stable for hours and was modulated by pain onset and relief. Objective facial expression changes also classified self-reported pain states, with results concordant with electrophysiological predictions. Importantly, we identified transient periods of momentary pain as a distinct naturalistic acute pain measure, which could be reliably differentiated from affect-neutral periods using intracranial and facial features, albeit with neural and facial patterns distinct from self-reported pain. These findings reveal reliable neurobehavioral markers of naturalistic acute pain across contexts and timescales, underscoring the potential for developing personalized pain interventions in real-world settings.

Bayesian relative composite quantile regression approach of ordinal latent regression model with L 1/2 regularization

Article

Apr 2024

Ordinal data frequently occur in various fields such as knowledge level assessment, credit rating, clinical disease diagnosis, and psychological evaluation. The classic models including cumulative logistic regression or probit regression are often used to model such ordinal data. But these modeling approaches conditionally depict the mean characteristic of response variable on a cluster of predictive variables, which often results in non‐robust estimation results. As a considerable alternative, composite quantile regression (CQR) approach is usually employed to gain more robust and relatively efficient results. In this paper, we propose a Bayesian CQR modeling approach for ordinal latent regression model. In order to overcome the recognizability problem of the considered model and obtain more robust estimation results, we advocate to using the Bayesian relative CQR approach to estimate regression parameters. Additionally, in regression modeling, it is a highly desirable task to obtain a parsimonious model that retains only important covariates. We incorporate the Bayesian penalty into the ordinal latent CQR regression model to simultaneously conduct parameter estimation and variable selection. Finally, the proposed Bayesian relative CQR approach is illustrated by Monte Carlo simulations and a real data application. Simulation results and real data examples show that the suggested Bayesian relative CQR approach has good performance for the ordinal regression models.

Measuring Alliance and Symptom Severity in Psychotherapy Transcripts Using Bert Topic Modeling

Article

Full-text available

Mar 2024
ADM POLICY MENT HLTH

We aim to use topic modeling, an approach for discovering clusters of related words (“topics”), to predict symptom severity and therapeutic alliance in psychotherapy transcripts, while also identifying the most important topics and overarching themes for prediction. We analyzed 552 psychotherapy transcripts from 124 patients. Using BERTopic (Grootendorst, 2022), we extracted 250 topics each for patient and therapist speech. These topics were used to predict symptom severity and alliance with various competing machine-learning methods. Sensitivity analyses were calculated for a model based on 50 topics, LDA-based topic modeling, and a bigram model. Additionally, we grouped topics into themes using qualitative analysis and identified key topics and themes with eXplainable Artificial Intelligence (XAI). Symptom severity could be predicted with highest accuracy by patient topics ( $$r$$ r =0.45, 95%-CI 0.40, 0.51), whereas alliance was better predicted by therapist topics ( $$r$$ r =0.20, 95%-CI 0.16, 0.24). Drivers for symptom severity were themes related to health and negative experiences. Lower alliance was correlated with various themes, especially psychotherapy framework, income, and everyday life. This analysis shows the potential of using topic modeling in psychotherapy research allowing to predict several treatment-relevant metrics with reasonable accuracy. Further, the use of XAI allows for an analysis of the individual predictive value of topics and themes. Limitations entail heterogeneity across different topic modeling hyperparameters and a relatively small sample size.

Integrative polygenic risk score improves the prediction accuracy of complex traits and diseases

Article

Full-text available

Mar 2024

Polygenic risk scores (PRSs) are an emerging tool to predict the clinical phenotypes and outcomes of individuals. We propose PRSmix, a framework that leverages the PRS corpus of a target trait to improve prediction accuracy, and PRSmix+, which incorporates genetically correlated traits to better capture the human genetic architecture for 47 and 32 diseases/traits in European and South Asian ancestries, respectively. PRSmix demonstrated a mean prediction accuracy improvement of 1.20-fold (95% confidence interval [CI], [1.10; 1.3]; p = 9.17 × 10⁻⁵) and 1.19-fold (95% CI, [1.11; 1.27]; p = 1.92 × 10⁻⁶), and PRSmix+ improved the prediction accuracy by 1.72-fold (95% CI, [1.40; 2.04]; p = 7.58 × 10⁻⁶) and 1.42-fold (95% CI, [1.25; 1.59]; p = 8.01 × 10⁻⁷) in European and South Asian ancestries, respectively. Compared to the previously cross-trait-combination methods with scores from pre-defined correlated traits, we demonstrated that our method improved prediction accuracy for coronary artery disease up to 3.27-fold (95% CI, [2.1; 4.44]; p value after false discovery rate (FDR) correction = 2.6 × 10⁻⁴). Our method provides a comprehensive framework to benchmark and leverage the combined power of PRS for maximal performance in a desired target population.

A Regularized Cox Hierarchical Model for Incorporating Annotation Information in Predictive Omic Studies

Preprint

Full-text available

Mar 2024

Background Associated with high-dimensional omics data there are often “meta-features” such as pathways and functional annotations that can be informative for predicting an outcome of interest. We extend to Cox regression the regularized hierarchical framework of Kawaguchi et al. (2022) (1) for integrating meta-features, with the goal of improving prediction and feature selection performance with time-to-event outcomes. Methods A hierarchical framework is deployed to incorporate meta-features. Regularization is applied to the omic features as well as the meta-features so that high-dimensional data can be handled at both levels. The proposed hierarchical Cox model can be efficiently fitted by a combination of iterative reweighted least squares and cyclic coordinate descent. Results In a simulation study we show that when the external meta-features are informative, the regularized hierarchical model can substantially improve prediction performance over standard regularized Cox regression. We illustrate the proposed model with applications to breast cancer and melanoma survival based on gene expression profiles, which show the improvement in prediction performance by applying meta-features, as well as the discovery of important omic feature sets with sparse regularization at meta-feature level. Conclusions The proposed hierarchical regularized regression model enables integration of external meta-feature information directly into the modeling process for time-to-event outcomes, improves prediction performance when the external meta-feature data is informative. Importantly, when the external meta-features are uninformative, the prediction performance based on the regularized hierarchical model is on par with standard regularized Cox regression, indicating robustness of the framework. In addition to developing predictive signatures, the model can also be deployed in discovery applications where the main goal is to identify important features associated with the outcome rather than developing a predictive model.

Development and validation of a nomogram for predicting lung cancer based on acoustic-clinical features

Preprint

Full-text available

Feb 2024

Objective Lung cancer has the highest incidence of all malignant tumors worldwide, and early diagnosis and treatment are crucial for improving patient survival rates. The aim of this study is to develop a nomogram based on acoustic and clinical features, providing clinical trial evidence for predicting lung cancer. Methods We reviewed the voice data and clinical data from 350 individuals: 189 pathologically confirmed lung cancer patients and 161 non lung cancer patients, which included 77 patients with benign pulmonary lesions and 84 healthy volunteers. First of all, acoustic features were extracted from all subjects, and optimal features were selected by least absolute shrinkage and selection operator (LASSO) regression. Subsequently, combining acoustic features and clinical features to build a nomogram for predicting lung cancer based on multivariate logistic regression model. The performance of the nomogram was evaluated by the area under the receiver operating characteristic curve (AUC) and the calibration curve, the clinical utility was estimated by decision curve analysis (DCA), and validation set was applied to confirm the predictive value of the nomogram. Results The acoustic-clinical nomogram model exhibited good diagnostic performance in the training set, achieving an AUC of 0.774, an accuracy of 0.701, a sensitivity of 0.693, and a specificity of 0.710. In the validation set, the nomogram attained AUC of 0.714, an accuracy of 0.642, a sensitivity of 0.673 and a specificity of 0.611. The DCA curve demonstrated the nomogram had good clinical usefulness. Conclusions The acoustic-clinical nomogram constructed in this study exhibited good discrimination, calibration, and clinical application value, providing a tool to predict lung cancer.

Integrating High-Resolution Mass Spectral Data, Bioassays and Computational Models to Annotate Bioactives in Botanical Extracts: Case Study Analysis of C. asiatica Extract Associates Dicaffeoylquinic Acids with Protection against Amyloid-β Toxicity

Article

Full-text available

Feb 2024
MOLECULES

Rapid screening of botanical extracts for the discovery of bioactive natural products was performed using a fractionation approach in conjunction with flow-injection high-resolution mass spectrometry for obtaining chemical fingerprints of each fraction, enabling the correlation of the relative abundance of molecular features (representing individual phytochemicals) with the read-outs of bioassays. We applied this strategy for discovering and identifying constituents of Centella asiatica (C. asiatica) that protect against Aβ cytotoxicity in vitro. C. asiatica has been associated with improving mental health and cognitive function, with potential use in Alzheimer’s disease. Human neuroblastoma MC65 cells were exposed to subfractions of an aqueous extract of C. asiatica to evaluate the protective benefit derived from these subfractions against amyloid β-cytotoxicity. The % viability score of the cells exposed to each subfraction was used in conjunction with the intensity of the molecular features in two computational models, namely Elastic Net and selectivity ratio, to determine the relationship of the peak intensity of molecular features with % viability. Finally, the correlation of mass spectral features with MC65 protection and their abundance in different sub-fractions were visualized using GNPS molecular networking. Both computational methods unequivocally identified dicaffeoylquinic acids as providing strong protection against Aβ-toxicity in MC65 cells, in agreement with the protective effects observed for these compounds in previous preclinical model studies.

A penalized variable selection ensemble algorithm for high-dimensional group-structured data

Article

Full-text available

Feb 2024
PLOS ONE

This paper presents a multi-algorithm fusion model (StackingGroup) based on the Stacking ensemble learning framework to address the variable selection problem in high-dimensional group structure data. The proposed algorithm takes into account the differences in data observation and training principles of different algorithms. It leverages the strengths of each model and incorporates Stacking ensemble learning with multiple group structure regularization methods. The main approach involves dividing the data set into K parts on average, using more than 10 algorithms as basic learning models, and selecting the base learner based on low correlation, strong prediction ability, and small model error. Finally, we selected the grSubset + grLasso, grLasso, and grSCAD algorithms as the base learners for the Stacking algorithm. The Lasso algorithm was used as the meta-learner to create a comprehensive algorithm called StackingGroup. This algorithm is designed to handle high-dimensional group structure data. Simulation experiments showed that the proposed method outperformed other R ² , RMSE , and MAE prediction methods. Lastly, we applied the proposed algorithm to investigate the risk factors of low birth weight in infants and young children. The final results demonstrate that the proposed method achieves a mean absolute error ( MAE ) of 0.508 and a root mean square error ( RMSE ) of 0.668. The obtained values are smaller compared to those obtained from a single model, indicating that the proposed method surpasses other algorithms in terms of prediction accuracy.

Comparative Analysis of Integrating Multiple Filter-Based Feature Selection Methods Using Vector Magnitude Score on Text Classification

Conference Paper

Mar 2021

Robust Feature Selection Method Based on Joint L2,1 Norm Minimization for Sparse Regression

Article

Full-text available

Oct 2023

Feature selection methods are widely used in machine learning tasks to reduce the dimensionality and improve the performance of the models. However, traditional feature selection methods based on regression often suffer from a lack of robustness and generalization ability and are easily affected by outliers in the data. To address this problem, we propose a robust feature selection method based on sparse regression. This method uses a non-square form of the L2,1 norm as both the loss function and regularization term, which can effectively enhance the model’s resistance to outliers and achieve feature selection simultaneously. Furthermore, to improve the model’s robustness and prevent overfitting, we add an elastic variable to the loss function. We design two efficient convergent iterative processes to solve the optimization problem based on the L2,1 norm and propose a robust joint sparse regression algorithm. Extensive experimental results on three public datasets show that our feature selection method outperforms other comparison methods.

Development and comparison of adaptive data-driven models for thermal comfort assessment and control

Article

Oct 2023

A machine learning based electronic property predictor of Cu2SnS3 thin film synthesized by ultrasonic spray pyrolysis

Article

Jun 2024
J ALLOY COMPD

Comparative Analysis of Chemical Descriptors by Machine Learning Reveals Atomistic Insights into Solute-Lipid Interactions

Article

May 2024
MOL PHARMACEUT

빅데이터 기반의 국제거시경제 전망모형 개발 연구 (Developing an International Macroeconomic Forecasting Model Based on Big Data)

Article

Jan 2024

Do Weak Brain Signals Get Amplified When Strong Brain Signals are Evoked?

Conference Paper

Mar 2024

Prediction of cognitive decline in older breast cancer survivors: the Thinking and Living with Cancer study

Article

Full-text available

Apr 2024

Purpose Cancer survivors commonly report cognitive declines after cancer therapy. Due to the complex etiology of cancer-related cognitive decline (CRCD), predicting who will be at risk of CRCD remains a clinical challenge. We developed a model to predict breast cancer survivors who would experience CRCD after systematic treatment. Methods We used the Thinking and Living with Cancer study, a large ongoing multisite prospective study of older breast cancer survivors with complete assessments pre-systemic therapy, 12 months and 24 months after initiation of systemic therapy. Cognition was measured using neuropsychological testing of attention, processing speed, and executive function (APE). CRCD was defined as a 0.25 SD (of observed changes from baseline to 12 months in matched controls) decline or greater in APE score from baseline to 12 months (transient) or persistent as a decline 0.25 SD or greater sustained to 24 months. We used machine learning approaches to predict CRCD using baseline demographics, tumor characteristics and treatment, genotypes, comorbidity, and self-reported physical, psychosocial, and cognitive function. Results Thirty-two percent of survivors had transient cognitive decline, and 41% of these women experienced persistent decline. Prediction of CRCD was good: yielding an area under the curve of 0.75 and 0.79 for transient and persistent decline, respectively. Variables most informative in predicting CRCD included apolipoprotein E4 positivity, tumor HER2 positivity, obesity, cardiovascular comorbidities, more prescription medications, and higher baseline APE score. Conclusions Our proof-of-concept tool demonstrates our prediction models are potentially useful to predict risk of CRCD. Future research is needed to validate this approach for predicting CRCD in routine practice settings.

Association between multipollutant exposure and thyroid hormones in elderly people: A cross-sectional study in China

Article

Mar 2024
ENVIRON RES

Development of a Novel Endometrial Signature Based on Endometrial microRNA for Determining the Optimal Timing for Embryo Transfer

Article

Full-text available

Mar 2024

Though tremendous advances have been made in the field of in vitro fertilization (IVF), a portion of patients are still affected by embryo implantation failure issues. One of the most significant factors contributing to implantation failure is a uterine condition called displaced window of implantation (WOI), which refers to an unsynchronized endometrium and embryo transfer time for IVF patients. Previous studies have shown that microRNAs (miRNAs) can be important biomarkers in the reproductive process. In this study, we aim to develop a miRNA-based classifier to identify the WOI for optimal time for embryo transfer. A reproductive-related PanelChip® was used to obtain the miRNA expression profiles from the 200 patients who underwent IVF treatment. In total, 143 out of the 167 miRNAs with amplification signals across 90% of the expression profiles were utilized to build a miRNA-based classifier. The microRNA-based classifier identified the optimal timing for embryo transfer with an accuracy of 93.9%, a sensitivity of 85.3%, and a specificity of 92.4% in the training set, and an accuracy of 88.5% in the testing set, showing high promise in accurately identifying the WOI for the optimal timing for embryo transfer.

Bandgap Prediction of Two-Dimensional Transition Metal Layered Binary Compounds Based on Regression Forest Machine Learning Method

Conference Paper

Sep 2023

CLCC-FS(OBWOA): an efficient hybrid evolutionary algorithm for motor imagery electroencephalograph classification

Article

Full-text available

Feb 2024
MULTIMED TOOLS APPL

A brain-computer interface (BCI) based on an electroencephalograph (EEG) establishes a new channel of communication between the human brain and a computer. Redundant, noisy, and irrelevant channels lead to high computational costs and poor classification accuracy. Therefore, an effective feature selection technique for determining the optimal number of channels can improve BCI’s performance. However, existing meta-heuristic algorithms are prone to get trapped in local optimum due to high dimensional dataset. Thus, to reduce dimension, solve inter subject variation and choose an optimal subset of channels, a novel framework called Component Loading followed by Clustering and Classification (CLCC) is proposed in this paper. This novel framework is further divided into two experiment configurations-CLCC with Feature Selection (CLCC-FS) and CLCC without Feature Selection (CLCC-WFS). All these frameworks have been implemented on a motor imagery (MI) EEG dataset of 10 subjects in order to choose the best subset of channels. Further, seven different classifiers have been employed to assess the performance. Experimental outcomes show that on comparing various feature selection techniques, our proposed algorithm i.e., CLCC-FS Opposition-Based Whale Optimization Algorithm (CLCC-FS(OBWOA)) performed substantially better than the other feature selection techniques. We demonstrate that the proposed algorithm is able to achieve 99.6% accuracy by using only few channels and can improve the practicality of the BCI system by reducing the computation cost.

Intestinal metabolites predict treatment resistance of patients with depression and anxiety

Article

Full-text available

Feb 2024

Background The impact of the gut microbiota on neuropsychiatric disorders has gained much attention in recent years; however, comprehensive data on the relationship between the gut microbiome and its metabolites and resistance to treatment for depression and anxiety is lacking. Here, we investigated intestinal metabolites in patients with depression and anxiety disorders, and their possible roles in treatment resistance. Results We analyzed fecal metabolites and microbiomes in 34 participants with depression and anxiety disorders. Fecal samples were obtained three times for each participant during the treatment. Propensity score matching led us to analyze data from nine treatment responders and nine non-responders, and the results were validated in the residual sample sets. Using elastic net regression analysis, we identified several metabolites, including N-ε-acetyllysine; baseline levels of the former were low in responders (AUC = 0.86; 95% confidence interval, 0.69–1). In addition, fecal levels of N-ε-acetyllysine were negatively associated with the abundance of Odoribacter. N-ε-acetyllysine levels increased as symptoms improved with treatment. Conclusion Fecal N-ε-acetyllysine levels before treatment may be a predictive biomarker of treatment-refractory depression and anxiety. Odoribacter may play a role in the homeostasis of intestinal L-lysine levels. More attention should be paid to the importance of L-lysine metabolism in those with depression and anxiety.

Evaluation of penalized and machine learning methods for asthma disease prediction in the Korean Genome and Epidemiology Study (KoGES)

Article

Full-text available

Feb 2024
BMC BIOINFORMATICS

Background Genome-wide association studies have successfully identified genetic variants associated with human disease. Various statistical approaches based on penalized and machine learning methods have recently been proposed for disease prediction. In this study, we evaluated the performance of several such methods for predicting asthma using the Korean Chip (KORV1.1) from the Korean Genome and Epidemiology Study (KoGES). Results First, single-nucleotide polymorphisms were selected via single-variant tests using logistic regression with the adjustment of several epidemiological factors. Next, we evaluated the following methods for disease prediction: ridge, least absolute shrinkage and selection operator, elastic net, smoothly clipped absolute deviation, support vector machine, random forest, boosting, bagging, naïve Bayes, and k-nearest neighbor. Finally, we compared their predictive performance based on the area under the curve of the receiver operating characteristic curves, precision, recall, F1-score, Cohen′s Kappa, balanced accuracy, error rate, Matthews correlation coefficient, and area under the precision-recall curve. Additionally, three oversampling algorithms are used to deal with imbalance problems. Conclusions Our results show that penalized methods exhibit better predictive performance for asthma than that achieved via machine learning methods. On the other hand, in the oversampling study, randomforest and boosting methods overall showed better prediction performance than penalized methods.

Smoother: a unified and modular framework for incorporating structural dependency in spatial omics data

Article

Full-text available

Dec 2023
GENOME BIOL

Spatial omics technologies can help identify spatially organized biological processes, but existing computational approaches often overlook structural dependencies in the data. Here, we introduce Smoother, a unified framework that integrates positional information into non-spatial models via modular priors and losses. In simulated and real datasets, Smoother enables accurate data imputation, cell-type deconvolution, and dimensionality reduction with remarkable efficiency. In colorectal cancer, Smoother-guided deconvolution reveals plasma cell and fibroblast subtype localizations linked to tumor microenvironment restructuring. Additionally, joint modeling of spatial and single-cell human prostate data with Smoother allows for spatial mapping of reference populations with significantly reduced ambiguity.

Efficient Regression with Feature Selection Based on L12-Norm

Conference Paper

Aug 2023

Submerged Archaeological Heritage Preservation Through IoUT Environmental Monitoring

Conference Paper

Sep 2023

Expected Mispricing

Article

Jan 2023

Studying Sleep: Towards the Identification of Hypnogram Features that Drive Expert Interpretation

Article

Dec 2023
SLEEP

Study Objectives Hypnograms contain a wealth of information and play an important role in sleep medicine. However, interpretation of the hypnogram is a difficult task and requires domain knowledge and “clinical intuition.” This study aimed to uncover which features of the hypnogram drive interpretation by physicians. In other words, make explicit which features physicians implicitly look for in hypnograms. Methods Three sleep experts evaluated up to 612 hypnograms, indicating normal or abnormal sleep structure and suspicion of disorders. ElasticNet and convolutional neural network classification models were trained to predict the collected expert evaluations using hypnogram features and stages as input. The models were evaluated using several measures, including accuracy, Cohen’s kappa, Matthew’s correlation coefficient, and confusion matrices. Finally, model coefficients and visual analytics techniques were used to interpret the models to associate hypnogram features with expert evaluation. Results Agreement between models and experts (Kappa between 0.47 and 0.52) is similar to agreement between experts (Kappa between 0.38 and 0.50). Sleep fragmentation, measured by transitions between sleep stages per hour, and sleep stage distribution were identified as important predictors for expert interpretation. Conclusions By comparing hypnograms not solely on an epoch-by-epoch basis, but also on these more specific features that are relevant for the evaluation of experts, performance assessment of (automatic) sleep-staging and surrogate sleep trackers may be improved. In particular, sleep fragmentation is a feature that deserves more attention as it is often not included in the PSG report, and existing (wearable) sleep trackers have shown relatively poor performance in this aspect.

A general framework for penalized mixed-effects multitask learning with applications on DNA methylation surrogate biomarkers creation

Article

Dec 2023
ANN APPL STAT

rvTWAS: identifying gene-trait association using sequences by utilizing transcriptome-directed feature selection

Article

Nov 2023
GENETICS

Towards the identification of genetic basis of complex traits, transcriptome-wide association study (TWAS) is successful in integrating transcriptome data. However, TWAS is only applicable for common variants, excluding rare variants in exome or whole genome sequences. This is partly because of the inherent limitation of TWAS protocols that rely on predicting gene expressions. Our previous research has revealed the insight into TWAS: the two steps in TWAS, building and applying the expression prediction models, are essentially genetic feature selection and aggregations that do not have to involve predictions. Based on this insight disentangling TWAS, rare variants’ inability of predicting expression traits is no longer an obstacle. Herein, we developed “rare variant TWAS”, or rvTWAS, that first uses a Bayesian model to conduct expression-directed feature selection and then uses a kernel machine to carry out feature aggregation, forming a model leveraging expressions for association mapping including rare variants. We demonstrated the performance of rvTWAS by thorough simulations and real data analysis in three psychiatric disorders, namely schizophrenia, bipolar disorder, and autism spectrum disorder. We confirmed that rvTWAS outperforms existing TWAS protocols and revealed additional genes underlying psychiatric disorders. Particularly, we formed a hypothetical mechanism in which zinc finger genes impact all three disorders through transcriptional regulations. rvTWAS will open a door for sequence-based association mappings integrating gene expressions.

Plasma proteomic analysis on neuropathic pain in idiopathic peripheral neuropathy patients

Article

Nov 2023
J PERIPHER NERV SYST

Background and aims Why only half of the idiopathic peripheral polyneuropathy (IPN) patients develop neuropathic pain is unknown. By conducting a proteomics analysis on IPN patients, we aimed to discover proteins and new pathways that are associated with neuropathic pain. Methods We conducted unbiased mass‐spectrometry proteomics analysis on blood plasma from 31 IPN patients with severe neuropathic pain and 29 IPN patients with no pain, to investigate protein biomarkers and protein‐protein interactions associated with neuropathic pain. Univariate modeling was done with Linear Mixed Modeling (LMM) and corrected for multiple testing. Multivariate modelling was performed using elastic net analysis and validated with internal cross validation and bootstrapping. Results In the univariate analysis, 73 proteins showed a p‐value <0.05 and 12 proteins showed a p‐value <0.01. None were significant after Benjamini‐Hochberg adjustment for multiple testing. Elastic net analysis created a model containing 12 proteins with reasonable discriminatory power to differentiate between painful and painless IPN (false negative rate 0.10, false positive rate 0.18 and an area under the curve 0.75). Eight of these 12 proteins were clustered into one interaction network, significantly enriched for the complement and coagulation pathway (Benjamini Hochberg adjusted p‐value = 0.0057), with Complement Component 3 (C3) as the central node. Bootstrap validation identified insulin‐like growth factor‐binding protein 2 (IGFBP2), complement factor H‐related protein 4 (CFHR4) and ferritin light chain (FTL), as the most discriminatory proteins of the original 12 identified. Interpretation This proteomics analysis suggests a role for the complement system in neuropathic pain in IPN. This article is protected by copyright. All rights reserved.

Inversion analysis of soil nitrogen content using hyperspectral images with different preprocessing methods

Article

Nov 2023
ECOL INFORM

An Improved Random Forest Feature Selection Method for Predicting the Patient's Characteristics

Chapter

Nov 2023

Reduced-order model of geometrically nonlinear flexible structures for fluid-structure interaction applications

Article

Nov 2023
INT J NONLIN MECH

Aircraft Taxi Time Prediction Using Machine Learning and its Application for Departure Metering

Conference Paper

Oct 2023

Doubly elastic net regularized online portfolio optimization with transaction costs

Article

Full-text available

Nov 2023

Online portfolio optimization with transaction costs is a big challenge in large-scale intelligent computing community, since its undersample from rapidly-changing market and complexity from varying transaction costs. In this paper, we focus on this problem and solve it by machine learning system. Specifically, we reformulate the optimization problem with the minimization over simplex containing three items, which are negative expected return, the elastic net regularization of transaction costs controlled term and portfolio variable, respectively. We propose to apply linearized augmented Lagrangian method (LALM) and the alternating direction method of multipliers (ADMM) to solve the optimization model in a higher efficiency, meanwhile theoretically guarantee their convergence and deduce closed-form solutions of their subproblems in each iteration. Furthermore, we conduct extensive experiments on five benchmark datasets from real market to demonstrate that the proposed algorithms outperform compared state-of-the-art strategies in most cases in six dimensions.

A novel prediction method of fuel consumption for wing-diesel hybrid vessels based on feature construction

Article

Nov 2023
ENERGY

Elastic Net Regularization and Gabor Dictionary for Classification of Heart Sound Signals using Deep Learning

Article

Full-text available

Jan 2024
ENG APPL ARTIF INTEL

In this article, we propose the optimization of the resolution of time-frequency atoms and the regularization of fitting models to obtain better representations of heart sound signals. This is done by evaluating the classification performance of deep learning (DL) networks in discriminating five heart valvular conditions based on a new class of time-frequency feature matrices derived from the models. We inspect several combinations of resolution and regularization and the optimal one is that provides the highest performance. To this end, a fitting model is obtained based on a heart sound signal and an over-complete dictionary of Gabor atoms using elastic net regularization of linear models. We consider two different DL architectures, the first mainly consisting of a 1D convolutional neural network (CNN) layer and a long short-term memory (LSTM) layer, while the second is composed of 1D and 2D CNN layers followed by an LSTM layer. The networks are trained with two algorithms, namely stochastic gradient descent with momentum (SGDM) and adaptive moment (ADAM). Extensive experimentation has been conducted using a database containing heart sound signals of five heart valvular conditions. The best classification accuracy of 98.95% is achieved with the second architecture when trained with ADAM and feature matrices derived from optimal fitting models obtained with a Gabor dictionary consisting of atoms with high-time low-frequency resolution and imposing sparsity on the models.

Prediction of diesel generator performance and emissions using minimal sensor data and analysis of advanced machine learning techniques

Article

Oct 2023

Prediction of urban airflow fields around isolated high-rise buildings using data-driven non-linear correction models

Article

Oct 2023
BUILD ENVIRON

Multifaceted analysis of cross-tissue transcriptomes reveals phenotype–endotype associations in atopic dermatitis

Article

Full-text available

Oct 2023

Atopic dermatitis (AD) is a skin disease that is heterogeneous both in terms of clinical manifestations and molecular profiles. It is increasingly recognized that AD is a systemic rather than a local disease and should be assessed in the context of whole-body pathophysiology. Here we show, via integrated RNA-sequencing of skin tissue and peripheral blood mononuclear cell (PBMC) samples along with clinical data from 115 AD patients and 14 matched healthy controls, that specific clinical presentations associate with matching differential molecular signatures. We establish a regression model based on transcriptome modules identified in weighted gene co-expression network analysis to extract molecular features associated with detailed clinical phenotypes of AD. The two main, qualitatively differential skin manifestations of AD, erythema and papulation are distinguished by differential immunological signatures. We further apply the regression model to a longitudinal dataset of 30 AD patients for personalized monitoring, highlighting patient heterogeneity in disease trajectories. The longitudinal features of blood tests and PBMC transcriptome modules identify three patient clusters which are aligned with clinical severity and reflect treatment history. Our approach thus serves as a framework for effective clinical investigation to gain a holistic view on the pathophysiology of complex human diseases.

Hyper-Accelerated Learning for Brain-Computer Interfaces via Partial Target-Aware Optimal Transport

Conference Paper

Oct 2023

Prognostic microRNA signature for estimating survival in patients with hepatocellular carcinoma

Article

Sep 2023

Objective Hepatocellular carcinoma (HCC) is one of the leading cancer types with increasing annual incidence and high mortality in the USA. MicroRNAs (miRNAs) have emerged as valuable prognostic indicators in cancer patients. To identify a miRNA signature predictive of survival in patients with HCC, we developed a machine learning-based HCC survival estimation method, HCCse, using the miRNA expression profiles of 122 patients with HCC. Methods The HCCse method was designed using an optimal feature selection algorithm incorporated with support vector regression. Results HCCse identified a robust miRNA signature consisting of 32 miRNAs and obtained a mean correlation coefficient (R) and mean absolute error (MAE) of 0.87 ± 0.02 and 0.73 years between the actual and estimated survival times of patients with HCC; and the jackknife test achieved an R and MAE of 0.73 and 0.97 years between actual and estimated survival times, respectively. The identified signature has seven prognostic miRNAs (hsa-miR-146a-3p, hsa-miR-200a-3p, hsa-miR-652-3p, hsa-miR-34a-3p, hsa-miR-132-5p, hsa-miR-1301-3p and hsa-miR-374b-3p) and four diagnostic miRNAs (hsa-miR-1301-3p, hsa-miR-17-5p, hsa-miR-34a-3p and hsa-miR-200a-3p). Notably, three of these miRNAs, hsa-miR-200a-3p, hsa-miR-1301-3p and hsa-miR-17-5p, also displayed association with tumor stage, further emphasizing their clinical relevance. Furthermore, we performed pathway enrichment analysis and found that the target genes of the identified miRNA signature were significantly enriched in the hepatitis B pathway, suggesting its potential involvement in HCC pathogenesis. Conclusions Our study developed HCCse, a machine learning-based method, to predict survival in HCC patients using miRNA expression profiles. We identified a robust miRNA signature of 32 miRNAs with prognostic and diagnostic value, highlighting their clinical relevance in HCC management and potential involvement in HCC pathogenesis.

A Simple Method for Finding Molecular Signatures from Gene Expression Data

Article

Full-text available

Ramon Diaz-Uriarte

Background "Molecular signatures" or "gene-expression signatures" are used to model patients' clinically relevant information (e.g., prognosis, survival time) us-ing expression data from coexpressed genes. Signatures are a key feature in cancer research because they can provide insight into biological mechanisms and have po-tential diagnostic use. However, available methods to search for signatures fail to address key requirements of signatures and signature components, especially the discovery of tightly coexpressed sets of genes. Results We suggest a method with good predictive performance that follows from the biologically relevant features of signatures. After identifying a seed gene with good predictive abilities, we search for a group of genes that is highly correlated with the seed gene, shows tight coexpression, and has good predictive abilities; this set of genes is reduced to a signature component using using Principal Components Analysis. The process is repeated until no further component is found. We show that the suggested method can recover signatures present in the data, and has predictive performance comparable to state-of-the-art methods. The code (R with C++) is freely available under GNU GPL license.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Article

Full-text available

Nov 2004

Boosting as a Regularized Path to a Maximum Margin Classifier

Article

Full-text available

Aug 2004

In this paper we study boosting methods from a new perspective. We build on recent work by Efron et al. to show that boosting approximately (and in some cases exactly) minimizes its loss criterion with an -margin of the training data, as defined in the boosting literature. An interesting fundamental similarity between boosting and kernel support vector machines emerges, as both can be described as methods for regularized optimization in high-dimensional predictor space, using a computational trick to make the calculation practical, and converging to margin-maximizing solutions. While this statement describes SVMs exactly, it applies to boosting only approximately.

Molecular classification of cancer: class discovery and class prediction by gene monitoring

Article

Full-text available

Nov 1999

Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.

Predicting the clinical status of human breast cancer by using gene expression profiles

Article

Full-text available

Oct 2001

Prognostic and predictive factors are indispensable tools in the treatment of patients with neoplastic disease. For the most part, such factors rely on a few specific cell surface, histological, or gross pathologic features. Gene expression assays have the potential to supplement what were previously a few distinct features with many thousands of features. We have developed Bayesian regression models that provide predictive capability based on gene expression data derived from DNA microarray analysis of a series of primary breast cancer samples. These patterns have the capacity to discriminate breast tumors on the basis of estrogen receptor status and also on the categorized lymph node status. Importantly, we assess the utility and validity of such models in predicting the status of tumors in crossvalidation determinations. The practical value of such approaches relies on the ability not only to assess relative probabilities of clinical outcomes for future samples but also to provide an honest assessment of the uncertainties associated with such predictive classifications on the basis of the selection of gene subsets for each validation analysis. This latter point is of critical importance in the ability to apply these methodologies to clinical assessment of tumor phenotype.

Classification of Gene Microarrays by Penalized Logistic Regression

Article

Full-text available

Aug 2004

Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.

Regression Approaches for Microarray Data Analysis

Article

Full-text available

Feb 2003

A variety of new procedures have been devised to handle the two-sample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some defining characteristics of microarray-based studies: (i) the very large number of genes contributing expression measures which far exceeds the number of samples (observations) available and (ii) the fact that by virtue of pathway/network relationships, the gene expression measures tend to be highly correlated. These concerns are exacerbated in the regression setting, where the objective is to relate gene expression, simultaneously for multiple genes, to some external outcome or phenotype. Correspondingly, several methods have been recently proposed for addressing these issues. We briefly critique some of these methods prior to a detailed evaluation of gene harvesting. This reveals that gene harvesting, without additional constraints, can yield artifactual solutions. Results obtained employing such constraints motivate the use of regularized regression procedures such as the lasso, least angle regression, and support vector machines. Model selection and solution multiplicity issues are also discussed. The methods are evaluated using a microarray-based study of cardiomyopathy in transgenic mice.

A statistical view of some chemometrics regression tools (with discussion)

Article

Jan 1993
TECHNOMETRICS

Regression Shrinkage and Selection via the LASSO

Article

Jan 1996

R. J. Tibshirani

Supervised harvesting of expression trees

Article

Jan 2001
GENOME BIOL

Background: We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes. Results: We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions. Conclusions: Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Article

Jan 2002

Gene shaving as a method for identifying distinct sets of genes with similar expression patterns

Article

Jan 2000

Finding predictive gene groups from microarray data

Article

Jul 2004

Marcel Dettling

Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable of interest. To find these groups, we suggest the use of supervised algorithms: these are procedures which use external information about the response variable for grouping the genes. We present Pelora, an algorithm based on penalized logistic regression analysis, that combines gene selection, gene grouping and sample classification in a supervised, simultaneous way. With an empirical study on six different microarray datasets, we show that Pelora identifies gene groups whose expression centroids have very good predictive potential and yield results that can keep up with state-of-the-art classification methods based on single genes. Thus, our gene groups can be beneficial in medical diagnostics and prognostics, but they may also provide more biological insights into gene function and regulation.

Penalized regressions: the Bridge versus the LASSO

Article

Sep 1998

Wenjiang J Fu

Bridge regression, a special family of penalized regressions of a penalty function Σ|βj|γ with γ ≤ 1, considered. A general approach to solve for the bridge estimator is developed. A new algorithm for the lasso (γ = 1) is obtained by studying the structure of the bridge estimators. The shrinkage parameter γ and the tuning parameter λ are selected via generalized cross-validation (GCV). Comparison between the bridge model (γ ≤ 1) and several other shrinkage models, namely the ordinary least squares regression (λ = 0), the lasso (γ = 1) and ridge regression (γ = 2), is made through a simulation study. It is shown that the bridge regression performs well compared to the lasso and ridge regression. These methods are demonstrated through an analysis of a prostate cancer data. Some computational advantages and limitations are discussed.

Discussion on boosting papers

Article

Jan 2004
ANN STAT

This discussion concerns the following papers: W. Jiang [Process consistency for AdaBoost. ibid., 13–29 (2004; Zbl 1105.62316)]; G. Lugosi and N. Vayatis [On the Bayes-risk consistency of regularized boosting methods. ibid., 30–55 (2004; Zbl 1105.62319)]; and T. Zhang [Statistical behavior and consistency of classification methods based on convex risk minimization. ibid., 56–85 (2004; Zbl 1105.62323)].

Regularized Discriminant Analysis

Article

Mar 1989

Jerome H. Friedman

Linear and quadratic discriminant analysis are considered in the small-sample, high-dimensional setting. Alternatives to the usual maximum likelihood (plug-in) estimates for the covariance matrices are proposed. These alternatives are characterized by two parameters, the values of which are customized to individual situations by jointly minimizing a sample-based estimate of future misclassification risk. Computationally fast implementations are presented, and the efficacy of the approach is examined through simulation studies and application to data. These studies indicate that in many circumstances dramatic gains in classification accuracy can be achieved.

Wavelet shrinkage: asymptopia? (with discussion) J

Article

Jan 1995

Heuristics of instability in model selection

Article

Jan 1996
ANN STAT

L. Breiman

A statistical view of some chemometrics regression tools. (With discussion)

Article

May 1993

Chemometrics is a field of chemistry that studies the application of statistical methods to chemical data analysis. In addition to borrowing many techniques from the statistics and engineering literatures, chemometrics itself has given rise to several new data-analytical methods. This article examines two methods commonly used in chemometrics for predictive modeling—partial least squares and principal components regression—from a statistical perspective. The goal is to try to understand their apparent successes and in what situations they can be expected to work well and to compare them with other statistical methods intended for those situations. These methods include ordinary least squares, variable subset selection, and ridge regression.

Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Statist Soc B. 2005;67(2):301-20

Article

Apr 2005
J R STAT SOC B

We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p≫n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.

Finding predictive gene groups from microarray data

Article

Jul 2004
J MULTIVARIATE ANAL

Gene Selection for Cancer Classification Using Support Vector Machines

Article

Jan 2002

DNA micro-arrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new micro-array devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues. In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer. In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leave-one-out error, while 64 genes are necessary for the baseline method to get the best result (one leave-one-out error). In the colon cancer database, using only 4 genes our method is 98% accurate, while the baseline method is only 86% accurate.

Matrix Computations

Article

Jan 1996

Heuristics of Instability and Stabilization in Model Selection.” The Annals of Statistics, 24(6), 2350-2383

Article

Dec 1996
ANN STAT

Leo Breiman

In model selection, usually a "best" predictor is chosen from a collection ${\hat{\mu}(\cdot, s)}$ of predictors where $\hat{\mu}(\cdot, s)$ is the minimum least-squares predictor in a collection $\mathsf{U}_s$ of predictors. Here s is a complexity parameter; that is, the smaller s, the lower dimensional/smoother the models in $\mathsf{U}_s$. ¶ If $\mathsf{L}$ is the data used to derive the sequence ${\hat{\mu}(\cdot, s)}$, the procedure is called unstable if a small change in $\mathsf{L}$ can cause large changes in ${\hat{\mu}(\cdot, s)}$. With a crystal ball, one could pick the predictor in ${\hat{\mu}(\cdot, s)}$ having minimum prediction error. Without prescience, one uses test sets, cross-validation and so forth. The difference in prediction error between the crystal ball selection and the statistician's choice we call predictive loss. For an unstable procedure the predictive loss is large. This is shown by some analytics in a simple case and by simulation results in a more complex comparison of four different linear regression methods. Unstable procedures can be stabilized by perturbing the data, getting a new predictor sequence ${\hat{\mu'}(\cdot, s)}$ and then averaging over many such predictor sequences.

Prostate Specific Antigen in the Diagnosis and Treatment of Adenocarcinoma of the Prostate. II. Radical Prostatectomy Treated Patients

Article

Jun 1989

Serum prostate specific antigen was determined (Yang polyclonal radioimmunoassay) in 102 men before hospitalization for radical prostatectomy. Prostate specimens were subjected to detailed histological and morphometric analysis. Levels of prostate specific antigen were significantly different between patients with and without a Gleason score of 7 or greater (p less than 0.001), capsular penetration greater than 1 cm. in linear extent (p less than 0.001), seminal vesicle invasion (p less than 0.001) and pelvic lymph node metastasis (p less than 0.005). Prostate specific antigen was strongly correlated with volume of prostate cancer (r equals 0.70). Bivariate and multivariate analyses indicate that cancer volume is the primary determinant of serum prostate specific antigen levels. Prostate specific antigen was elevated 3.5 ng. per ml. for every cc of cancer, a level at least 10 times that observed for benign prostatic hyperplasia.

Supervised harvesting of trees

Article

Feb 2001

We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes. We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions. Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.

Significance Analysis of Microarrays Applied to The Ionizing Radiation Response

Article

May 2001

Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.

Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties

Article

Feb 2001
J AM STAT ASSOC

Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of parametric models such as generalized linear models and robust regression models. They can also be applied easily to nonparametric modeling by using wavelets and splines. Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known. Our simulation shows that the newly proposed methods compare favorably with other variable selection techniques. Furthermore, the standard error formulas are tested to be accurate enough for practical applications.

Least Angle Regression

Article

Apr 2002
ANN STAT

The purpose of model selection algorithms such as All Subsets, Forward Selection, and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the eificient pre- diction of a response variable. Least Angle Regression (" LARS"), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods.

A New Approach to Variable Selection in Least Squares Problems

Article

Nov 1999

The title Lasso has been suggested by Tibshirani [7] as a colourful name for a technique of variable selection which requires the minimization of a sum of squares subject to an ll bound r; on the solution. This forces zero components in the minimizing solution for small values of r;. Thus this bound can function as a selection parameter. This paper makes two contributions to computational problems associated with implementing the Lasso: (1) a com- pact descent method for solving the constrained problem for a particular value of r; is formulated, and (2) a homotopy method, in which the constraint bound r; becomes the homotopy parameter, is developed to completely describe the possible selection regimes. Both algorithms have a finite termination property.

Statistical Behavior and Consistency of Classification Methods based on Convex Risk Minimization

Article

Dec 2002
ANN STAT

Tong Zhang

We study how close the optimal Bayes error rate can be approximately reached using a classification algorithm that computes a classifier by minimizing a convex upper bound of the classification error function. The measurement of closeness is characterized by the loss function used in the estimation. We show that such a classification scheme can be generally regarded as a (non maximum-likelihood) conditional in-class probability estimate, and we use this analysis to compare various convex loss functions that have appeared in the literature. Furthermore, the theoretical insight allows us to design good loss functions with desirable properties. Another aspect of our analysis is to demonstrate the consistency of certain classification methods using convex risk minimization.

Regularization and variable selection via the elastic net (vol B 67, pg 301, 2005)

No full-text available

Recommended publications

Variable selection through adaptive elastic net for proportional odds model

Time series prediction via elastic net regularization integrating partial autocorrelation

Stability Selection for Lasso, Ridge and Elastic Net Implemented with AFT Models Running Title: Stab...

Online group streaming feature selection considering feature interaction