ArticleLiterature Review

Survival analysis with high-dimensional covariates

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In recent years, breakthroughs in biomedical technology have led to a wealth of data in which the number of features (for instance, genes on which expression measurements are available) exceeds the number of observations (e.g. patients). Sometimes survival outcomes are also available for those same observations. In this case, one might be interested in (a) identifying features that are associated with survival (in a univariate sense), and (b) developing a multivariate model for the relationship between the features and survival that can be used to predict survival in a new observation. Due to the high dimensionality of this data, most classical statistical methods for survival analysis cannot be applied directly. Here, we review a number of methods from the literature that address these two problems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Prognostic analysis for survival often employs gene expressions obtained from highthroughput screening for tumor tissues from patients [1][2][3][4][5][6]. Here, the primary interest is often to find a set of genes that are strong predictors of survival. ...
... In this section, we shall first review gene selection methods [1,9,41] based on survival data. Gene selection is a required step before building a prognostic prediction method. ...
... The approach called univariate selection [1,9] is performed as follows: In the initial step, a significance of each gene is examined by univariate Cox regression one-by-one. Then, a subset of significant genes is selected with a p-value threshold, such as 0.05, 0.01, and 0.001. ...
Article
Full-text available
Prognostic analysis for patient survival often employs gene expressions obtained from high-throughput screening for tumor tissues from patients. When dealing with survival data, a dependent censoring phenomenon arises, and thus the traditional Cox model may not correctly identify the effect of each gene. A copula-based gene selection model can effectively adjust for dependent censoring, yielding a multi-gene predictor for survival prognosis. However, methods to assess the impact of various types of dependent censoring on the multi-gene predictor have not been developed. In this article, we propose a sensitivity analysis method using the copula-graphic estimator under dependent censoring, and implement relevant methods in the R package “compound.Cox”. The purpose of the proposed method is to investigate the sensitivity of the multi-gene predictor to a variety of dependent censoring mechanisms. In order to make the proposed sensitivity analysis practical, we develop a web application. We apply the proposed method and the web application to a lung cancer dataset. We provide a template file so that developers can modify the template to establish their own web applications.
... Bilyk and Cheng (2014) [63] used a multidimensional scaling algorithm based on PCA to analyse differential gene expression of Pagothenia borchgrevinki under heat stress. This approach was expanded in Bilyk et al. (2018) [64], with the use of a Generalized Linear Model (GLM) to analyse gene expression after performing the multidimensional scaling. ...
... By contrast, the primary genes detected for the liver are lipase genes (LPL and LIPG), which are involved in breakdown of lipids to make fatty acids available to other tissues during acclimation [63]. Also present is the PMM2 gene, involved in breakdown of simple sugars as a further energy source, and the ENO1 gene, which is part of the glycolytic pathway, and thus suggests reduced oxygen during some or all of the acclimation process [63]. ...
... By contrast, the primary genes detected for the liver are lipase genes (LPL and LIPG), which are involved in breakdown of lipids to make fatty acids available to other tissues during acclimation [63]. Also present is the PMM2 gene, involved in breakdown of simple sugars as a further energy source, and the ENO1 gene, which is part of the glycolytic pathway, and thus suggests reduced oxygen during some or all of the acclimation process [63]. ...
Thesis
This thesis presents new computational methods to analyse both short and long-term effects of temperature increase on biological systems. First, we consider the problem of acclimation of an organism to increased temperatures on short timescales. We develop a novel method of network regression, AccliNet, based on the acclimation times, which takes into account prior knowledge of functional links between genes to improve the performance of the algorithm. The results obtained by AccliNet are compared with the performance of existing algorithms and are shown to be an improvement in this area. Next, we delve deeper into the metabolic response of the organism to changing temperatures, and develop methods to model and simulate the fluxes of metabolites occurring through a metabolic network. In particular, we construct a simplified model of aerobic respiration for an Antarctic species, and, given a gene expression dataset across different temperatures, we develop two different machine learning approaches to model the fluxes through the metabolic network. The first approach we use is based on denoising autoencoders. The performance of this method is compared to a traditional Bayesian inference approach and found to have higher accuracy. Next, we develop a different machine learning approach to model the unknown data distributions, in this case using a Generative Adversarial Network (GAN) to learn an SDE path through the sampled data points. The performance of this method is compared to the earlier autoencoder approach, as well as to other algorithms. The GAN method is found to have similar accuracy but less robustness to noise than the autoencoder approach. Lastly, we also consider the long-term effects of changing temperatures on biological systems. In particular, we develop a novel package for phylogenetic analysis, called PhylSim, which allows simulations and studies of adaptation and evolution under different scenarios of climate change. We apply the package to the case of adaptation of Antarctic species to their environment in recent evolutionary history. The work in this thesis was carried out in collaboration with the British Antarctic Survey, and used genetic datasets of Antarctic organisms, although the methods developed here are general and can be readily applied to other datasets as well. Thus, the proposed modeling framework holds some promise for tackling important problems in the future, in areas ranging from bioinformatics to environmental science.
... In the high-dimensional setting (p [ n), classical statistical methods cannot be applied directly to predict survival (Witten and Tibshirani, 2010). Therefore, many methods have been applied to identify a subset of informative methylation features. ...
... The univariate Cox proportional hazards regression model is a straightforward univariate feature selection method to find the methylation sites significantly associated with the survival of patients. Each feature is fitted with the model, and its relevance to the survival outcome can be measured by the p-value of the likelihood ratio test (De Bin et al., 2014;Hu and Zhou, 2017) or the Cox score (Beer et al., 2002;Witten and Tibshirani, 2010;Jiang et al., 2020). After the ranking of feature significance is obtained, the best combination of k features will be selected. ...
... To choose the tuning parameter k, different strategies, such as fitting multivariate Cox proportional hazards regression models (De Bin et al., 2014) or permutation-based false discovery rate approaches (Beer et al., 2002;Witten and Tibshirani, 2010), can be applied. In practice, the univariate Cox model combined with the multivariate Cox model is widely used to identify the optimal subset of prognostic CpG sites (Guo et al., 2019;Yang et al., 2019;Peng et al., 2020;Zhang et al., 2020;. ...
Article
Full-text available
Developing cancer prognostic models using multiomics data is a major goal of precision oncology. DNA methylation provides promising prognostic biomarkers, which have been used to predict survival and treatment response in solid tumor or plasma samples. This review article presents an overview of recently published computational analyses on DNA methylation for cancer prognosis. To address the challenges of survival analysis with high-dimensional methylation data, various feature selection methods have been applied to screen a subset of informative markers. Using candidate markers associated with survival, prognostic models either predict risk scores or stratify patients into subtypes. The model's discriminatory power can be assessed by multiple evaluation metrics. Finally, we discuss the limitations of existing studies and present the prospects of applying machine learning algorithms to fully exploit the prognostic value of DNA methylation.
... We consider models which depend on the feature vectors characterizing objects, whose times to an event are studied, for example, patients, the system structures, Many survival models are based on using the Cox proportional hazard method or the Cox model [4], which calculates effects of observed covariates on the risk of an event occurring under condition that a linear combination of the instance covariates is assumed in the model. The success of the Cox model in many applied tasks motivated to generalize the model in order to consider non-linear functions of covariates or several regularization constraints for the model parameters [5,6,7,8]. To improve survival analysis and to enhance the prediction accuracy of the machine learning survival models, other approaches have been developed, including the SVM approach [9], random survival forests (RSF) [10,11,12,13,14], survival neural networks [5,6,15,8]. ...
... The third group is some combinations of the models from the first and the second groups. Examples of these models are modifications of the Cox model relaxing the linear relationship assumption accepted in the Cox model, including neural networks instead of the linear relationship [66,5], the Lasso modifications [67,68,7]. ...
Preprint
An explanation method called SurvBeX is proposed to interpret predictions of the machine learning survival black-box models. The main idea behind the method is to use the modified Beran estimator as the surrogate explanation model. Coefficients, incorporated into Beran estimator, can be regarded as values of the feature impacts on the black-box model prediction. Following the well-known LIME method, many points are generated in a local area around an example of interest. For every generated example, the survival function of the black-box model is computed, and the survival function of the surrogate model (the Beran estimator) is constructed as a function of the explanation coefficients. In order to find the explanation coefficients, it is proposed to minimize the mean distance between the survival functions of the black-box model and the Beran estimator produced by the generated examples. Many numerical experiments with synthetic and real survival data demonstrate the SurvBeX efficiency and compare the method with the well-known method SurvLIME. The method is also compared with the method SurvSHAP. The code implementing SurvBeX is available at: https://github.com/DanilaEremenko/SurvBeX
... The Cox proportional hazard model [9] is one of the most popular approaches in medicine to link covariates to survival data. When considering the number of covariates, p (which can typically be 20,000 gene products), in relation to the number of patients in the databases, n, and so the number of events e (which can typically be only a few hundred), various issues occur due to the high dimensionality [10], which include the lack of stability of the selected genes [11] and over-fitting [12]. This p ≫ e problem is referred to as the 'curse of dimensionality' . ...
... The issues are aggravated when integrating multi-omics data [13], which is a research area of growing interest [14,15]. Among many, there are two main distinct strategies to tackle issues arising from high dimensionality, both of which aim to reduce the number of variables considered: screening procedures and penalization methods [10,16]. ...
Article
Full-text available
Background Prediction of patient survival from tumor molecular ‘-omics’ data is a key step toward personalized medicine. Cox models performed on RNA profiling datasets are popular for clinical outcome predictions. But these models are applied in the context of “high dimension”, as the number p of covariates (gene expressions) greatly exceeds the number n of patients and e of events. Thus, pre-screening together with penalization methods are widely used for dimensional reduction. Methods In the present paper, (i) we benchmark the performance of the lasso penalization and three variants (i.e., ridge, elastic net, adaptive elastic net) on 16 cancers from TCGA after pre-screening, (ii) we propose a bi-dimensional pre-screening procedure based on both gene variability and p -values from single variable Cox models to predict survival, and (iii) we compare our results with iterative sure independence screening (ISIS). Results First, we show that integration of mRNA-seq data with clinical data improves predictions over clinical data alone. Second, our bi-dimensional pre-screening procedure can only improve, in moderation, the C-index and/or the integrated Brier score, while excluding irrelevant genes for prediction. We demonstrate that the different penalization methods reached comparable prediction performances, with slight differences among datasets. Finally, we provide advice in the case of multi-omics data integration. Conclusions Tumor profiles convey more prognostic information than clinical variables such as stage for many cancer subtypes. Lasso and Ridge penalizations perform similarly than Elastic Net penalizations for Cox models in high-dimension. Pre-screening of the top 200 genes in term of single variable Cox model p -values is a practical way to reduce dimension, which may be particularly useful when integrating multi-omics.
... The development of high-throughput sequencing technologies has led to the production of large-scale molecular profiling data, allowing us to gain insights into underlying biological processes (Widłak, 2013). One such technology is microarray sequencing, in which mRNA counts are used to describe gene expression. ...
... In recent years, several methods have been proposed to analyze sparse high-dimensional data, with one of the most popular being the LASSO (Tibshirani, 1996). As biomedical studies are often concerned with clinical phenotypes, such as time to disease recurrence or overall survival time, these methods have been adapted to support survival analysis (Antoniadis et al., 2010;Witten and Tibshirani, 2010). For instance, the LASSO, ridge and elastic-net penalties have all been extended to the PHM (Gui and Li, 2005;Simon et al., 2011;Tibshirani, 1997;Zou and Hastie, 2005). ...
Article
Full-text available
Motivation Few Bayesian methods for analyzing high-dimensional sparse survival data provide scalable variable selection, effect estimation and uncertainty quantification. Such methods often either sacrifice uncertainty quantification by computing maximum a posteriori estimates, or quantify the uncertainty at high (unscalable) computational expense. Results We bridge this gap and develop an interpretable and scalable Bayesian proportional hazards model for prediction and variable selection, referred to as SVB. Our method, based on a mean-field variational approximation, overcomes the high computational cost of MCMC whilst retaining useful features, providing a posterior distribution for the parameters and offering a natural mechanism for variable selection via posterior inclusion probabilities. The performance of our proposed method is assessed via extensive simulations and compared against other state-of-the-art Bayesian variable selection methods, demonstrating comparable or better performance. Finally, we demonstrate how the proposed method can be used for variable selection on two transcriptomic datasets with censored survival outcomes, and how the uncertainty quantification offered by our method can be used to provide an interpretable assessment of patient risk. Availability and implementation our method has been implemented as a freely available R package survival.svb (https://github.com/mkomod/survival.svb). Supplementary information Supplementary materials are available at Bioinformatics online.
... Compared with KM, Cox is a CONTACT Chao Lu researcherluchao@126.com regression model. Instead of directly using survival time, Cox uses hazard as a dependent variable to find the contribution of a particular risk factor to the occurrence of the outcome event (Witten & Tibshirani, 2010). The survival tree method is suitable for extensive cohort data with many variables, and the tree structure graph is more intuitive and easier to understand (Emura et al., 2023). ...
Article
Full-text available
Gastric cardia cancer is a high-incidence malignant tumour, which seriously endangers human health and life safety. The patient prognosis of gastric cardia cancer is affected by diet, physical condition, regional environment, medical history and other factors. Traditional prediction methods cannot fully reflect the prognosis characteristics and survival risks of all patients. Therefore, this paper proposes a data-driven method for the survival risk of cardiac cancer based on an adaptive particle swarm optimization algorithm (APSO) and a dynamic modular neural network (DMNN). First, the article uses density clustering to cluster 293 patients’ blood characteristics and generate different sub-networks. Second, the weight is calculated through the APSO algorithm and the sub-network output is obtained by the integration algorithm. At last, the effectiveness of this network is verified through a 50% cross-validation of training sets and test sets. The results show that the survival prediction based on the APSO-DMNN data-driven method shows good classification performance and accuracy.
... The genes initially screened using the univariate feature selection were screened again using the multivariate feature selection. For both screening analyses a p-value <0.01 was considered statistically significant [21,22]. Subsequently, we performed a third screening of the obtained prognosis-related genes using Kaplan-Meier analysis and finally identified the prognosis-associated TAM-related genes. ...
Article
Full-text available
The tumor-associated macrophages (TAM) play a crucial role in lung adenocarcinoma (LUAD), which can cause the proliferation, migration and invasion of tumor cells. In particular, TAMs mainly regulate changes in the tumor microenvironment thereby contributing to tumorigenesis and progression. Recently, an increasing number of studies are using single-cell RNA (Sc-RNA) sequencing to investigate changes in the composition and transcriptomics of the tumor microenvironment. We obtained Sc-RNA sequencing data of LUAD from GEO database and transcriptome data with clinical information of LUAD patients from TCGA database. A group of important genes in the state transition of TAMs was identified by analyzing TAMs at the single-cell level, while 5 TAM-related prognostic genes were obtained by omics data integration, and a prognostic model was constructed. GOBP analysis revealed that TAM-related genes were mainly enriched in tumor-promoting and immunosuppression-related pathways. After ROC analysis, it was found that the AUC of the prognosis model reached 0.751, with well predictive effectiveness. The 5 unique genes, HLA-DMB, HMGN3, ID3, PEBP1, and TUBA1B, was finally identified through synthesized analysis. The transcriptional characteristics of 5 genes were determined through GEPIA2 database and RT-qPCR. The increased expression of TUBA1B in advanced LUAD may serve as a prognostic indicator, while low expression of PEBP1 in LUAD may have the potential to become a therapeutic target.
... They can be conditionally divided into two groups. The first group remains the linear relationship of covariates and includes various modifications of the Lasso models [38]. The second group of models relaxes the linear relationship assumption accepted in the Cox model [39]. ...
Article
Full-text available
A method for estimating the conditional average treatment effect under the condition of censored time-to-event data, called BENK (the Beran Estimator with Neural Kernels), is proposed. The main idea behind the method is to apply the Beran estimator for estimating the survival functions of controls and treatments. Instead of typical kernel functions in the Beran estimator, it is proposed to implement kernels in the form of neural networks of a specific form, called neural kernels. The conditional average treatment effect is estimated by using the survival functions as outcomes of the control and treatment neural networks, which consist of a set of neural kernels with shared parameters. The neural kernels are more flexible and can accurately model a complex location structure of feature vectors. BENK does not require a large dataset for training due to its special way for training networks by means of pairs of examples from the control and treatment groups. The proposed method extends a set of models that estimate the conditional average treatment effect. Various numerical simulation experiments illustrate BENK and compare it with the well-known T-learner, S-learner and X-learner for several types of control and treatment outcome functions based on the Cox models, the random survival forest and the Beran estimator with Gaussian kernels. The code of the proposed algorithms implementing BENK is publicly available.
... This challenge also extends to Cox regression for time-to-event data. Studies [56][57][58] have emphasized the importance of developing parsimonious Cox models. We propose a feature selection approach to optimize the Cox Proportional Hazard model, consisting of several steps outlined in Table 4. Firstly, we pruned the highly correlated features. ...
Article
Full-text available
This study addresses the limited non-invasive tools for Oral Cavity Squamous Cell Carcinoma (OSCC) survival prediction by identifying Computed Tomography (CT)-based biomarkers to improve prognosis prediction. A retrospective analysis was conducted on data from 149 OSCC patients, including CT radiomics and clinical information. An ensemble approach involving correlation analysis, score screening, and the Sparse-L1 algorithm was used to select functional features, which were then used to build Cox Proportional Hazards models (CPH). Our CPH achieved a 0.70 concordance index in testing. The model identified two CT-based radiomics features, Gradient-Neighboring-Gray-Tone-Difference-Matrix-Strength (GNS) and normalized-Wavelet-LLL-Gray-Level-Dependence-Matrix-Large-Dependence-High-Gray-Level-Emphasis (HLE), as well as stage and alcohol usage, as survival biomarkers. The GNS group with values above 14 showed a hazard ratio of 0.12 and a 3-year survival rate of about 90%. Conversely, the GNS group with values less than or equal to 14 had a 49% survival rate. For normalized HLE, the high-end group (HLE > − 0.415) had a hazard ratio of 2.41, resulting in a 3-year survival rate of 70%, while the low-end group (HLE ≤ − 0.415) had a 36% survival rate. These findings contribute to our knowledge of how radiomics can be used to predict the outcome so that treatment plans can be tailored for patients people with OSCC to improve their survival.
... We applied univariate Cox regression analysis to preliminarily screen crucial prognostic genes from the DEGs identified preliminarily [68]. Then, we conducted LASSO analysis using the "glmnet" R package to select the crucial genes in TCGA cohort [69]. We used 10-time cross-validation to estimate the confidence interval for each lambda and subsequently identified the optimal lambda with the lowest average error. ...
Article
Full-text available
Background: Clear cell renal cell carcinoma (ccRCC) is a common urinary cancer. Although diagnostic and therapeutic approaches for ccRCC have been improved, the survival outcomes of patients with advanced ccRCC remain unsatisfactory. Fatty acid metabolism (FAM) has been increasingly recognized as a critical modulator of cancer development. However, the significance of the FAM in ccRCC remains unclear. Herein, we explored the function of a FAM-related risk score in the stratification and prediction of treatment responses in patients with ccRCC. Methods: First, we applied an unsupervised clustering method to categorize patients from The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) datasets into subtypes and retrieved FAM-related genes from the MSigDB database. We discern differentially expressed genes (DEGs) among different subtypes. Then, we applied univariate Cox regression analysis followed by least absolute shrinkage and selection operator (LASSO) linear regression based on DEGs expression to establish a FAM-related risk score for ccRCC. Results: We stratified the three ccRCC subtypes based on FAM-related genes with distinct overall survival (OS), clinical features, immune infiltration patterns, and treatment sensitivities. We screened nine genes from the FAM-related DEGs in the three subtypes to establish a risk prediction model for ccRCC. Nine FAM-related genes were differentially expressed in the ccRCC cell line ACHN compared to the normal kidney cell line HK2. High-risk patients had worse OS, higher genomic heterogeneity, a more complex tumor microenvironment (TME), and elevated expression of immune checkpoints. This phenomenon was validated in the ICGC cohort. Conclusion: We constructed a FAM-related risk score that predicts the prognosis and therapeutic response of ccRCC. The close association between FAM and ccRCC progression lays a foundation for further exploring FAM-related functions in ccRCC.
... From a frequentist point of view and in the context of high-dimensional data, the Lasso or 1 regularization, proposed by Tibshirani (1996Tibshirani ( , 1997, has been a popular choice. An overview of time-to-event analysis with high-dimensional data more generally can be found in Witten and Tibshirani (2010). Computationally efficient implementations for the Cox model under penalization have been developed by Friedman, Hastie and Tibshirani (2010), Simon et al. (2011), Mittal et al. (2014), and Yang and Zou (2013). ...
... Recent years have seen the spawning of high-dimensional data (i.e. data with a large number of features) as biotechnology develops rapidly (Assent, 2012;Witten et al., 2010). For instance, metabolomics, immunomics and metagenomics are becoming prevalent. ...
Article
Full-text available
We introduce LongDat, an R package that analyzes longitudinal multivariable (cohort) data while simultaneously accounting for a potentially large number of covariates. The primary use case is to differentiate direct from indirect effects of an intervention (or treatment) and to identify covariates (potential mechanistic intermediates) in longitudinal data. LongDat focuses on analyzing longitudinal microbiome data, but its usage can be expanded to other data types, such as binary, categorical, and continuous data. We tested and compared LongDat with other tools (i.e., MaAsLin2, ANCOM, lgpr, and ZIBR) on both simulated and real data. We showed that LongDat outperformed these tools in accuracy, runtime, and memory cost, especially when there were multiple covariates. The results indicate that the LongDat R package is a computationally efficient and low-memory-cost tool for longitudinal data with multiple covariates and facilitates robust biomarker searches in high-dimensional datasets. Availability and Implementation The R package LongDat is available on CRAN (https://cran.r-project.org/web/packages/LongDat/) and GitHub (https://github.com/CCY-dev/LongDat). Supplementary information Supplementary data are available at Bioinformatics Advances online.
... The survival data used in this work are right-censored data, which cannot be used directly to the methods aforementioned. Witten and Tibshirani [21] sorted out survival models based on high-dimensional data, but effective dimensionality reduction methods are still to be discovered. Recently, the variable screening methods for survival models have been greatly enhanced, such as the Kolmogorov-Smirnov-based survival data variable screening method proposed by Liu et al. [22], the joint screening method for right-censored ultrahigh-dimensional data proposed by Liu et al. [23], and feature screening method for semi-competitive risk outcomes proposed by Peng and Xiang [24]. ...
Article
High‐dimensional covariates in lifetime data is a challenge in survival analysis, especially in gene expression profile. The objective of this paper is to propose an efficient algorithm to extend the generalized additive model to survival data with high‐dimensional covariates. The algorithm is combined of generalized additive (GAM) model and Buckley–James estimation, which makes a nonparametric extension to the nonlinear model, where the GAM is exploited to illustrate the nonlinear effect of the covariates and the Buckley–James estimation is used to address the regression model with right‐censored response. In addition, we use maximal‐information‐coefficient (MIC)‐type variable screening and weighted p‐value to reduce dimension in high‐dimensional situations. The performance of the proposed algorithm is compared with the three benchmark models: Cox proportional hazards regression model, random survival forest, and BJ‐AFT on a simulated dataset and two real survival datasets. The results, evaluated by concordance index (C‐index) as well as modified mean squared error (mMSE), illustrated the superiority of the proposed algorithm.
... This method was implemented in line with that outlined by Beer et al. on their assessment of gene expression and the prediction of lung cancer [32]. Such multiple testing methods can be prone to high false positive rates [33], and FDR p-value correction is a popular method used to mitigate this risk. All traditional CPH models underwent further feature selection by means of a forward stepwise selection with a maximum of 10 variables chosen for the models with mRNA variables. ...
Article
Full-text available
Simple Summary Prostate cancer is among the most prevalent cancers for men globally, accounting for 13% of cancer diagnoses in the male population each year. Surgical intervention is the primary treatment option but fails in up to 40% of patients, who experience biochemical recurrence (BCR). Determining the likelihood of recurrence and the length of time between surgery and BCR is critical for patient treatment decision-making. Traditional predictive models exploit routine clinical variables such as cancer stage, and may be improved upon by leveraging other accessible information about the patient. This study considers including patient-specific genomic data to identify relevant additional predictors of BCR-free survival, which requires the use of modern machine learning techniques. The results of this study indicate that including such genomic data leads to a gain in BCR prediction performance over models using clinical variables only. Abstract Predicting the risk of, and time to biochemical recurrence (BCR) in prostate cancer patients post-operatively is critical in patient treatment decision pathways following surgical intervention. This study aimed to investigate the predictive potential of mRNA information to improve upon reference nomograms and clinical-only models, using a dataset of 187 patients that includes over 20,000 features. Several machine learning methodologies were implemented for the analysis of censored patient follow-up information with such high-dimensional genomic data. Our findings demonstrated the potential of inclusion of mRNA information for BCR-free survival prediction. A random survival forest pipeline was found to achieve high predictive performance with respect to discrimination, calibration, and net benefit. Two mRNA variables, namely ESM1 and DHAH8, were identified as consistently strong predictors with this dataset.
... Hjort (1990) and Sinha et al. (2003) From a frequentist point of view, in the context of high-dimensional data, the Lasso or ℓ 1 regularization proposed by Tibshirani (1996Tibshirani ( , 1997 has been a popular choice. An overview of time-to-event analysis with high-dimensional data more generally can be found in Witten and Tibshirani (2010). While these methods can handle large-scale datasets, they cannot incorporate time-varying covariates measured at different time points, nor do they provide standard errors or other measures of uncertainty. ...
Thesis
The digital transformation of health care provides new opportunities to study dis- ease and gain unprecedented insights into the underlying biology. With the wealth of data generated, new statistical challenges arise. This thesis will address some of them, with a particular focus on Time-to-Event analysis. The Cox hazard model, one of the most widely used statistical tools in biomedicine, is extended to analyses for large-scale and high-dimensional data sets. Built on recent machine learning frame- works the approach scales readily to big data settings. The method is extensively evaluated in simulation- and case-studies, showcasing its applicability to different data modalities, ranging from hospital admission episodes to histopathological im- ages of tumour resections. The motivating application of this thesis are electronic health records (EHR), collections of various interlinked data at an individual level. With many countries starting to implement national health data resources, methods that can cope with these datasets become paramount. In particular, cancers could benefit significantly from these developments. The lifetime risk of developing a ma- lignancy is around 50%. However, the associated risks are not equally distributed with large differences between individuals. Hence, being able to utilise the data available in EHR could potentially help to stratify individuals by their risk profiles and screen or even intervene early. The proposed method is used to build a pre- dictive model for 20 primary cancer sites based on clinical disease histories, basic health parameters, and family histories covering 6.7 million Danish individuals over a combined 193 million life years. The obtained risk score can predict cancer inci- dence across most organ sites. Further, the information could potentially be used to create cohorts with similar efficiency while screening earlier, creating the possibility for risk-targeted screening programs. Additionally, the obtained result could also be transferred between health care systems, as shown here between Denmark and the UK. Taken together the thesis established a method to analyse the extensive amounts of data that is being generated nowadays as well as an evaluation of the potential these data sources can have in the context of cancer risk.
... They can be conditionally divided into two groups. The first group remains the linear relationship of covariates and includes various modifications of the Lasso models [59,60]. The second group of models relaxes the linear relationship assumption accepted in the Cox model [61,62]. ...
Preprint
A method for estimating the conditional average treatment effect under condition of censored time-to-event data called BENK (the Beran Estimator with Neural Kernels) is proposed. The main idea behind the method is to apply the Beran estimator for estimating the survival functions of controls and treatments. Instead of typical kernel functions in the Beran estimator, it is proposed to implement kernels in the form of neural networks of a specific form called the neural kernels. The conditional average treatment effect is estimated by using the survival functions as outcomes of the control and treatment neural networks which consists of a set of neural kernels with shared parameters. The neural kernels are more flexible and can accurately model a complex location structure of feature vectors. Various numerical simulation experiments illustrate BENK and compare it with the well-known T-learner, S-learner and X-learner for several types of the control and treatment outcome functions based on the Cox models, the random survival forest and the Nadaraya-Watson regression with Gaussian kernels. The code of proposed algorithms implementing BENK is available in https://github.com/Stasychbr/BENK.
... Mat. Table S1), including among others over-fitting -in some cases, the dimension of x i equals, or even exceeds, the number of uncensored instances [12,13,14,15,16,17]. ...
Preprint
Full-text available
Survival analysis (SA) prediction involves the prediction of the time until an event of interest occurs (TTE), based on input attributes. The main challenge of SA is instances where the event is not observed (censored). Censoring can represent an alternative event (e.g., death) or missing data. Most SA prediction methods suffer from multiple drawbacks that limit the usage of advanced machine learning methods: A) Simplistic models, B) Ignoring the input of the censored samples, C) No separation between the model and the loss function, and D) Typical small datasets and high input dimensions. We propose a loss function, denoted suRvival Analysis lefT barrIer lOss (RATIO), that explicitly incorporates the censored samples input in the prediction, but still accounts for the difference between censored and uncensored samples. RATIO can be incorporated with any prediction model. We further propose a new data augmentation (DA) method based on the TTE of uncensored samples and the input of censored samples. We show that RATIO drastically improves the precision and reduces the bias of SA prediction, and the DA allows for the inclusion of high-dimension data in SA methods even with a small number of uncensored samples.
... Nadat met behulp van kruisvalidatie bepaald is hoe groot de penaltyparameter moet zijn, wordt deze penalty op alle variabelen in het model toegepast. Er hoeft dus niet per individuele variabele een discretionaire beslissing te worden gemaakt of deze wel of niet in het model moet worden opgenomen (Witten & Tibshirani, 2010). Bovendien heeft de procedure geen last van numerieke problemen door multicollineariteit. ...
Research
Full-text available
This study found no evidence to support the effectiveness of the BORG training programme in terms of recidivism among perpetrators of intimate partner violence. In accordance with and in supplement to previous studies on the BORG training programme, various problems and barriers in the execution of the programme have been identified. For example, the programme does not appear to be reaching the intended target group, the set objectives do not seem feasible and are not formulated in SMART terms, participants’ partners are insufficiently involved in the programme, the content of the programme is insufficiently based on evidence-based techniques and on a robust comprehensive theoretical framework, and there is insufficient nationwide management and control of the execution of the programme. These problems, however, provide starting points for the improvement of the BORG training programme, which may make the programme effective after all. In addition, a substantial percentage of participants seems positive about the training programme. Furthermore, the BORG training programme could potentially be used at an early stage for a large group of perpetrators (couples) of intimate partner violence, given that the Probation and Parole Service is already involved by the police in any arrests due to domestic violence. By devoting attention to the aforementioned problems identified in this study within the planned revision of the BORG training programme, and through the establishment of better framework conditions for an integrated, cross-domain, system-oriented approach to domestic violence by the Ministry of Justice and Security and the Ministry of Health, Welfare and Sport, an effective intervention could be created that would contribute to reducing intimate partner violence in society.
... This algorithm, and the ones previously mentioned, are dedicated to unsupervised tasks or classi¯cation and are not intended to identify prognostic molecular signatures. A large number of works have used a two-step procedure, [11][12][13][14][15] as follows: (i) t an unsupervised LVM on the omics data (e.g. PCA, ICA, etc.) and (ii) use the individual weights as explanatory variables in a survival regression. ...
Article
The development of prognostic molecular signatures considering the inter-patient heterogeneity is a key challenge for the precision medicine. We propose a joint model of this heterogeneity and the patient survival, assuming that tumor expression results from a mixture of a subset of independent signatures. We deconvolute the omics data using a non-parametric independent component analysis with a double sparseness structure for the source and the weight matrices, corresponding to the gene-component and individual-component associations, respectively. In a simulation study, our approach identified the correct number of components and reconstructed with high accuracy the weight ([Formula: see text]0.85) and the source ([Formula: see text]0.75) matrices sparseness. The selection rate of components with high-to-moderate prognostic impacts was close to 95%, while the weak impacts were selected with a frequency close to the observed false positive rate ([Formula: see text]25%). When applied to the expression of 1063 genes from 614 breast cancer patients, our model identified 15 components, including six associated to patient survival, and related to three known prognostic pathways in early breast cancer (i.e. immune system, proliferation, and stromal invasion). The proposed algorithm provides a new insight into the individual molecular heterogeneity that is associated with patient prognosis to better understand the complex tumor mechanisms.
... Many methods have been developed to identify important gene expressions (and other omics measurements) associated with cancer overall survival (and other outcomes) [28][29][30]. Replicability can be directly affected by the choice of analysis methods [10]. To make our analysis more relevant, we adopt the 'simplest' and highly popular methods. ...
Article
In biomedical research, the replicability of findings across studies is highly desired. In this study, we focus on cancer omics data, for which the examination of replicability has been mostly focused on important omics variables identified in different studies. In published literature, although there have been extensive attention and ad hoc discussions, there is insufficient quantitative research looking into replicability measures and their properties. The goal of this study is to fill this important knowledge gap. In particular, we consider three sensible replicability measures, for which we examine distributional properties and develop a way of making inference. Applying them to three The Cancer Genome Atlas (TCGA) datasets reveals in general low replicability and significant across-data variations. To further comprehend such findings, we resort to simulation, which confirms the validity of the findings with the TCGA data and further informs the dependence of replicability on signal level (or equivalently sample size). Overall, this study can advance our understanding of replicability for cancer omics and other studies that have identification as a key goal.
... • Cox proportional hazards model [12,13] [20] is treebased ensemble method, which is initiated with the survival tree which stands first in rank, then further tress are tested one by one ba adding them to the ensemble in order of rank. • UST [17] construct a survival tree by a novel matrixbased algorithm in order to tests a number of nodes simultaneously via stabilized score tests [59]. ...
Article
Full-text available
The Cox proportional hazard model and random survival forests (RSF) are useful semi-parametric and non-parametric methods in modeling time-to-event data. However, both approaches may fail in case of small sample size and/or high censoring rate. In this research, we want to tackle such problems within the random forests framework using semi-supervised data transduction techniques and a layer-by-layer processing similar to deep forest. Experiments from both extensive simulated data and real-world benchmark datasets have shown that the proposed deep survival forests (DSF) outperforms Cox, RSF by a noticeable margin and also work better than several state-of-art survival ensembles including Cox boosting models and latest survival forest extensions on a variety of scenarios. The superiority of DSF stands out when small sample-sized and highly censored data are confronted.
... This issue highlights a critical flaw in biomarker development pipelines and is one important reason why genomic biomarkers are infrequently translated into clinical practice (Boutros, 2015). Another pervasive issue hindering clinical translation arises from the reliance on a large number of predictive features in complex models that are difficult to interpret and often perform poorly in independent validation due to overfitting (Taylor et al, 2008;Witten & Tibshirani, 2010). ...
Article
Full-text available
Advanced and metastatic estrogen receptor-positive (ER+ ) breast cancers are often endocrine resistant. However, endocrine therapy remains the primary treatment for all advanced ER+ breast cancers. Treatment options that may benefit resistant cancers, such as add-on drugs that target resistance pathways or switching to chemotherapy, are only available after progression on endocrine therapy. Here we developed an endocrine therapy prognostic model for early and advanced ER+ breast cancers. The endocrine resistance (ENDORSE) model is composed of two components, each based on the empirical cumulative distribution function of ranked expression of gene signatures. These signatures include a feature set associated with long-term survival outcomes on endocrine therapy selected using lasso-regularized Cox regression and a pathway-based curated set of genes expressed in response to estrogen. We extensively validated ENDORSE in multiple ER+ clinical trial datasets and demonstrated superior and consistent performance of the model over clinical covariates, proliferation markers, and multiple published signatures. Finally, genomic and pathway analyses in patient data revealed possible mechanisms that may help develop rational stratification strategies for endocrine-resistant ER+ breast cancer patients.
... Because of collinearity, one characteristic variable in a multiple regression model can be predicted linearly by other variables. To alleviate this problem, ridge regression adds a small square deviation factor (regular term) to the variable, this square deviation factor introduces a small amount of deviation into the model, but greatly reduces the variance [10]. Lasso is very similar to ridge regression in that a bias term is added to the regression optimization function to reduce the effect of collinearity, thus reducing the model equation. ...
Article
Full-text available
Accurate prediction of the survival risk level of patients with esophageal cancer is significant for the selection of appropriate treatment methods. It contributes to improving the living quality and survival chance of patients. However, considering that the characteristics of blood index vary with individuals on the basis of their ages, personal habits and living environment etc., a unified artificial intelligence prediction model is not precisely adequate. In order to enhance the precision of the model on the prediction of esophageal cancer survival risk, this study proposes a different model based on the Kohonen network clustering algorithm and the kernel extreme learning machine (KELM), aiming to classifying the tested population into five catergories and provide better efficiency with the use of machine learning. Firstly, the Kohonen network clustering method was used to cluster the patient samples and five types of samples were obtained. Secondly, patients were divided into two risk levels based on 5-year net survival. Then, the Taylor formula was used to expand the theory to analyze the influence of different activation functions on the KELM modeling effect, and conduct experimental verification. RBF was selected as the activation function of the KELM. Finally, the adaptive mutation sparrow search algorithm (AMSSA) was used to optimize the model parameters. The experimental results were compared with the methods of the artificial bee colony optimized support vector machine (ABC-SVM), the three layers of random forest (TLRF), the gray relational analysis–particle swarm optimization support vector machine (GP-SVM) and the mixed-effects Cox model (Cox-LMM). The results showed that the prediction model proposed in this study had certain advantages in terms of prediction accuracy and running time, and could provide support for medical personnel to choose the treatment mode of esophageal cancer patients.
... 49 using the CoxPHSurvivalAnalysis and CoxPHFitter classes respectively with defult parameter settings. In order to overcome the p ≫ n problem while modeling survival in high dimensional data, a number of methods including discrete feature selection using univariable and stepwise selection have been proposed 50,51 . We have adapted a computational pipeline described by Shukla et al. 51 , using scikit-learn (version 0.24.2) 52 and the SelectFpr (selecting miRNA features to which a Cox proportional hazards model could be fit with P < 0.05) and SequentialFeatureSelector (sequentially adding single miRNA features to a multivariable Cox proportional hazards model and assessing based on minimizing the log-rank test P value between low/high risk cross-validated predictions) classes. ...
Article
Full-text available
Immunotherapies have recently gained traction as highly effective therapies in a subset of late-stage cancers. Unfortunately, only a minority of patients experience the remarkable benefits of immunotherapies, whilst others fail to respond or even come to harm through immune-related adverse events. For immunotherapies within the PD-1/PD-L1 inhibitor class, patient stratification is currently performed using tumor (tissue-based) PD-L1 expression. However, PD-L1 is an accurate predictor of response in only ~30% of cases. There is pressing need for more accurate biomarkers for immunotherapy response prediction. We sought to identify peripheral blood biomarkers, predictive of response to immunotherapies against lung cancer, based on whole blood microRNA profiling. Using three well-characterized cohorts consisting of a total of 334 stage IV NSCLC patients, we have defined a 5 microRNA risk score (miRisk) that is predictive of overall survival following immunotherapy in training and independent validation (HR 2.40, 95% CI 1.37–4.19; P < 0.01) cohorts. We have traced the signature to a myeloid origin and performed miRNA target prediction to make a direct mechanistic link to the PD-L1 signaling pathway and PD-L1 itself. The miRisk score offers a potential blood-based companion diagnostic for immunotherapy that outperforms tissue-based PD-L1 staining.
... Omic data were having several thousand gene expression defined as high dimensional data. The joint work of survival data and high dimensional data is not new [6]. There are several challenges in high-dimensional data [7], where many of the problems were solved, and many of them were not. ...
Article
Full-text available
Background The five-year overall survival (OS) of advanced-stage ovarian cancer remains nearly 25-35%, although several treatment strategies have evolved to get better outcomes. A considerable amount of heterogeneity and complexity has been seen in ovarian cancer. This study aimed to establish gene signatures that can be used in better prognosis through risk prediction outcome for the survival of ovarian cancer patients. Different studies’ heterogeneity into a single platform is presented to explore the penetrating genes for poor or better survival. The integrative analysis of multiple data sets was done to determine the genes that influence poor or better survival. A total of 6 independent data sets was considered. The Cox Proportional Hazard model was used to obtain significant genes that had an impact on ovarian cancer patients. The gene signatures were prepared by splitting the over-expressed and under-expressed genes parallelly by the variable selection technique. The data visualisation techniques were prepared to predict the overall survival, and it could support the therapeutic regime. Results We preferred to select 20 genes in each data set as upregulated and downregulated. Irrespective of the selection of multiple genes, not even a single gene was found common among data sets for the survival of ovarian cancer patients. However, the same analytical approach adopted. The chord plot was presented to make a comprehensive understanding of the outcome. Conclusions This study helps us to understand the results obtained from different studies. It shows the impact of the heterogeneity from one study to another. It shows the requirement of integrated studies to make a holistic view of the gene signature for ovarian cancer survival.
... In order to take into account the high dimensionality of survival data and to solve the feature selection problem with these data, Tibshirani (1997) presented a modification based on the Lasso method. Similar Lasso modifications, for example, the adaptive Lasso, were also proposed by several authors (Kim et al., 2012;Witten and Tibshirani, 2010;Zhang and Lu, 2007). A further extension of the Cox model is a set of SVM modifications (Khan and Zubek, 2008;Widodo and Yang, 2011). ...
... Though we focused on statistical regularization techniques, modelling approaches enabling variable selection are not restricted to them. More methods exist and are useful (Witten and Tibshirani 2010), mostly coming from the field of machine learning, such as tree-based survival techniques-recursive partitioning, random forests-, survival principal component analysis, support vector machines, etc (LeBlanc and Crowley 1992;Bair et al. 2006;Li and Luan 2002). However, the underlying theory of most machine learning algorithms assumes that the training data are independent and identically distributed (i.i.d), they also require large training sets and are said to not perform so well with imbalanced cases (i.e. in our context very low number of injuries). ...
Article
Data-based methods and statistical models are given special attention to the study of sports injuries to gain in-depth understanding of its risk factors and mechanisms. The objective of this work is to evaluate the use of shared frailty Cox models for the prediction of occurring sports injuries, and to compare their performance with different sets of variables selected by several regularized variable selection approaches. The study is motivated by specific characteristics commonly found for sports injury data, that usually include reduced sample size and even fewer number of injuries, coupled with a large number of potentially influential variables. Hence, we conduct a simulation study to address these statistical challenges and to explore regularized Cox model strategies together with shared frailty models in different controlled situations. We show that predictive performance greatly improves as more player observations are available. Methods that result in sparse models and favour interpretability, e.g. Best Subset Selection and Boosting, are preferred when the sample size is small. We include a real case study of injuries of female football players of a Spanish football club.
... We decided not to impute missing data due to the low number of cases with missing values in both the MAR-GINS and METABRIC cohort (2% and 5% respectively) [35]. To deal with the high dimensionality of gene expression data, a principle component analysis (PCA) was performed on the scaled gene expression data of the specific gene set, and the first principal component (PC1) was treated as the variable representative of the biological pathway in the multivariable survival model [22,36]. To validate a possible association between discovered gene expression pathways associated with CPE and survival in an external dataset, we applied the PCA from the MARGINS data to the METABRIC data, and fitted a regular multivariable Cox proportional hazards model including the PC representative of the gene expression and adjusted for age, tumor size and grade, axillary load, and AST. ...
Article
Full-text available
Purpose To assess whether contralateral parenchymal enhancement (CPE) on MRI is associated with gene expression pathways in ER+/HER2-breast cancer, and if so, whether such pathways are related to survival. Methods Preoperative breast MRIs were analyzed of early ER+/HER2-breast cancer patients eligible for breast-conserving surgery included in a prospective observational cohort study (MARGINS). The contralateral parenchyma was segmented and CPE was calculated as the average of the top-10% delayed enhancement. Total tumor RNA sequencing was performed and gene set enrichment analysis was used to reveal gene expression pathways associated with CPE (N = 226) and related to overall survival (OS) and invasive disease-free survival (IDFS) in multivariable survival analysis. The latter was also done for the METABRIC cohort (N = 1355). Results CPE was most strongly correlated with proteasome pathways (normalized enrichment statistic = 2.04, false discovery rate = .11). Patients with high CPE showed lower tumor proteasome gene expression. Proteasome gene expression had a hazard ratio (HR) of 1.40 (95% CI = 0.89, 2.16; P = .143) for OS in the MARGINS cohort and 1.53 (95% CI = 1.08, 2.14; P = .017) for IDFS, in METABRIC proteasome gene expression had an HR of 1.09 (95% CI = 1.01, 1.18; P = .020) for OS and 1.10 (95% CI = 1.02, 1.18; P = .012) for IDFS. Conclusion CPE was negatively correlated with tumor proteasome gene expression in early ER+/HER2-breast cancer patients. Low tumor proteasome gene expression was associated with improved survival in the METABRIC data.
... We follow the practice of existing survival analysis literature and model the time series of physiological indicators as both time-dependent and trajectory covariates (Witten and Tibshirani 2010, Fang et al. 2016, Ma et al. 2019. However, some of these variables, such as oxygen levels, blood pressure, or pulse, could potentially be modeled separately as stochastic processes. ...
Article
Full-text available
Having an interpretable, dynamic length-of-stay model can help hospital administrators and clinicians make better decisions and improve the quality of care. The widespread implementation of electronic medical record (EMR) systems has enabled hospitals to collect massive amounts of health data. However, how to integrate this deluge of data into healthcare operations remains unclear. We propose a framework grounded in established clinical knowledge to model patients’ lengths of stay. In particular, we impose expert knowledge when grouping raw clinical data into medically meaningful variables that summarize patients’ health trajectories. We use dynamic, predictive models to output patients’ remaining lengths of stay, future discharges, and census probability distributions based on their health trajectories up to the current stay. Evaluated with large-scale EMR data, the dynamic model significantly improves predictive power over the performance of any model in previous literature and remains medically interpretable. Summary of Contribution: The widespread implementation of electronic health systems has created opportunities and challenges to best utilize mounting clinical data for healthcare operations. In this study, we propose a new approach that integrates clinical analysis in generating variables and implementations of computational methods. This approach allows our model to remain interpretable to the medical professionals while being accurate. We believe our study has broader relevance to researchers and practitioners of healthcare operations.
... For this purpose, the most widely used approach is the Cox-Proportional Hazards model (Cox-PH) which relies on the assumption that each covariate has a multiplicative effect in the hazards function that is constant over time [6]. Regularization approaches for handling high dimensional covariates with the Cox-PH model have also been developed [7]. Alternative techniques such as regression or ranking based survival support vector machines (SSVMs) use L2 regularization of the weights and a squared error function which makes these models vulnerable to scaling issues and outliers in the data [8], [9]. ...
Preprint
Full-text available
Can we predict if an early stage cancer patient is at high risk of developing distant metastasis and what clinicopathological factors are associated with such a risk? In this paper, we propose a ranking based censoring-aware machine learning model for answering such questions. The proposed model is able to generate an interpretable formula for risk stratifi-cation using a minimal number of clinicopathological covariates through L1-regulrization. Using this approach, we analyze the association of time to distant metastasis (TTDM) with various clinical parameters for early stage, luminal (ER+ or HER2-) breast cancer patients who received endocrine therapy but no chemotherapy (n = 728). The TTDM risk stratification formula obtained using the proposed approach is primarily based on mitotic score, histolog-ical tumor type and lymphovascular invasion. These findings corroborate with the known role of these covariates in increased risk for distant metastasis. Our analysis shows that the proposed risk stratification formula can discriminate between cases with high and low risk of distant metastasis (p-value < 0.005) and can also rank cases based on their time to distant metastasis with a concordance-index of 0.73.
... The aim of this article is to develop a set of methods for assessing the calibration of a prediction model with a time-to-event outcome. This class of models has been dealt with extensively during the past years, see, for example, Henderson & Keiding (2005), Witten & Tibshirani (2010), Soave & Strug (2018) and Braun et al. (2018). Here, we explicitly assume that event times are measured on a discrete time scale t = 1, 2, … (Ding et al., 2012;Tutz & Schmid, 2016;Berger & Schmid, 2018), and that the event of interest may occur along with one or more "competing" events (Fahrmeir & Wagenpfeil, 1996;Fine & Gray, 1999;Lau, Cole & Gange, 2009;Beyersmann, Allignol & Schumacher, 2011;Austin, Lee & Fine, 2016;Lee, Feuer & Fine, 2018;. ...
Article
Full-text available
enThe generalization performance of a risk prediction model can be evaluated by its calibration, which measures the agreement between predicted and observed outcomes on external validation data. Here, we propose methods for assessing the calibration of discrete time-to-event models in the presence of competing risks. Specifically, we consider the class of discrete subdistribution hazard models, which directly relate the cumulative incidence function of one event of interest to a set of covariates. We apply the methods to a prediction model for the development of nosocomial pneumonia. Simulation studies show that the methods are strong tools for calibration assessment even in scenarios with a high censoring rate and/or a large number of discrete time points. Résumé fr La performance de généralisation d'un modèle de prévision des risques peut être évaluée par sa calibration qui mesure la concordance entre les valeurs prédites et observées dans des données externes de validation. Les auteurs proposent des méthodes pour évaluer la calibration de modèles discrets de durée de vie en présence de risques concurrents. Plus précisément, ils considèrent la classe de modèles à sous-distribution discrète du risque qui relie directement la fonction d'incidence cumulative d'un événement à un ensemble de covariables. Les auteurs appliquent leurs méthodes à un modèle de prévision pour le développement de pneumonie nosocomiale. Ils présentent des études de simulation montrant que les méthodes sont d'excellents outils pour l'évaluation de la calibration, même dans les scénarios comportant un haut taux de censure et/ou un large nombre de points temporels discrets.
... It picks up genes having P-values lower than a prespecified cut off in testing association between genes and survival outcome. 19,20 The author Rosenwald et al 44 fitted univariate Cox PH models for each gene expression using 160 patients and reported 473 genes that are significant with P < .01 based on Wald test. Khan and Shaw 23 fitted univariate log-normal AFT models with individual genes and selected 463 genes that are significant at level P < .01 using Wald test. ...
Article
Full-text available
Dealing with high-dimensional censored data is very challenging because of the complexities in data structure. This article focuses on developing a variable selection procedure for censored high-dimensional data with the AFT models using the Modified Correlation Adjusted coRrelation (MCAR) scores method. The latter is developed based on CAR scores method that provides a canonical ordering that encourages grouping of correlated predictors and down-weights antagonistic variables. The proposed MCAR scores method is developed as an extension of the CAR scores method using NOVEL integration of the sample and threshold estimator of the correlation matrix as suggested by Huang and Frylewicz. The proposed MCAR exhibits computationally more efficient estimates under model sparsity and can provide a canonical ordering among the predictors. The MCAR method is a greedy method that is also easy to understand and can perform estimation and variable selection simultaneously. Performances of variable selection by the MCAR method have been compared with other existing regularized techniques in literature—such as the lasso, elastic net and with a machine learning technique called boosting and with the censored CAR by a number of simulation studies and a real microarray data set called diffuse large-B-cell lymphoma. Results indicate that when correlation exists among the covariates, the MCAR method outperforms all five techniques while for uncorrelated data, the MCAR performs quite similar to the CAR method but clearly outperforms the other three methods. The empirical study further reveals that the MCAR method exhibits the best predictive performance among the methods.
... From a frequentist point of view, in the context of high-dimensional data, the Lasso or ℓ 1 regularization proposed by Tibshirani (1996Tibshirani ( , 1997 has been a popular choice. An overview of time-to-event analysis with high-dimensional data more generally can be found in Witten & Tibshirani (2010). Computationally efficient implementations for the Cox model under penalization have been developed by Friedman et al. (2010), Simon et al. (2011), Mittal et al. (2014 and Yang & Zou (2013). ...
Preprint
The Cox model is an indispensable tool for time-to-event analysis, particularly in biomedical research. However, medicine is undergoing a profound transformation, generating data at an unprecedented scale, which opens new frontiers to study and understand diseases. With the wealth of data collected, new challenges for statistical inference arise, as datasets are often high dimensional, exhibit an increasing number of measurements at irregularly spaced time points, and are simply too large to fit in memory. Many current implementations for time-to-event analysis are ill-suited for these problems as inference is computationally demanding and requires access to the full data at once. Here we propose a Bayesian version for the counting process representation of Cox's partial likelihood for efficient inference on large-scale datasets with millions of data points and thousands of time-dependent covariates. Through the combination of stochastic variational inference and a reweighting of the log-likelihood, we obtain an approximation for the posterior distribution that factorizes over subsamples of the data, enabling the analysis in big data settings. Crucially, the method produces viable uncertainty estimates for large-scale and high-dimensional datasets. We show the utility of our method through a simulation study and an application to myocardial infarction in the UK Biobank.
... For HD data (e.g., high-throughput genomic data) with p n, standard statistical methods cannot be applied directly. The same problems also occur in survival data [18]. ...
Article
Full-text available
With the development of high-throughput technologies, more and more high-dimensional or ultra-high-dimensional genomic data are being generated. Therefore, effectively analyzing such data has become a significant challenge. Machine learning (ML) algorithms have been widely applied for modeling nonlinear and complicated interactions in a variety of practical fields such as high-dimensional survival data. Recently, multilayer deep neural network (DNN) models have made remarkable achievements. Thus, a Cox-based DNN prediction survival model (DNNSurv model), which was built with Keras and TensorFlow, was developed. However, its results were only evaluated on the survival datasets with high-dimensional or large sample sizes. In this paper, we evaluated the prediction performance of the DNNSurv model using ultra-high-dimensional and high-dimensional survival datasets and compared it with three popular ML survival prediction models (i.e., random survival forest and the Cox-based LASSO and Ridge models). For this purpose, we also present the optimal setting of several hyperparameters, including the selection of a tuning parameter. The proposed method demonstrated via data analysis that the DNNSurv model performed well overall as compared with the ML models, in terms of the three main evaluation measures (i.e., concordance index, time-dependent Brier score, and the time-dependent AUC) for survival prediction performance.
... However, given the large number of variables, overfitting is a major concern in this study and results in the inclusion of some variables with little relevance to our research question. Besides, the expression levels of different genes are not strictly independent of each other because of the many regulatory relationships among genes, which results in the multicollinearity problem [22]. The LASSO is suitable for high-dimensional data because the LASSO shrinks all regression coefficients toward zero and automatically removes many of them exactly to zero. ...
Article
Full-text available
Background The prognosis of colon cancer (CC) is challenging to predict due to its highly heterogeneous nature. Ferroptosis, an iron-dependent form of cell death, has roles in various cancers; however, the correlation between ferroptosis-related genes (FRGs) and prognosis in CC remains unclear. Methods The expression profiles of FRGs and relevant clinical information were retrieved from the Cancer Genome Atlas (TCGA) database. Cox regression analysis and the least absolute shrinkage and selection operator (LASSO) regression model were performed to build a prognostic model in TCGA cohort. Results Ten FRGs, five of which had mutation rates ≥ 3%, were found to be related to the overall survival (OS) of patients with CC. Patients were divided into high- and low-risk groups based on the results of Cox regression and LASSO analysis. Patients in the low-risk group had a significantly longer survival time than patients in the high-risk group ( P < 0.001). Enrichment analyses in different risk groups showed that the altered genes were associated with the extracellular matrix, fatty acid metabolism, and peroxisome. Age, risk score, T stage, N stage, and M stage were independent predictors of patient OS based on the results of Cox analysis. Finally, a nomogram was constructed to predict 1-, 3-, and 5-year OS of patients with CC based on the above five independent factors. Conclusion A novel FRG model can be used for prognostic prediction in CC and may be helpful for individualized treatment.
Article
Background Pancreatic mucinous adenocarcinoma (PMAC) is a rare malignant tumour, and there is limited understanding of its epidemiology and prognosis. Initially, PMAC was considered a metastatic manifestation of other cancers; however, instances of non-metastatic PMAC have been documented through monitoring, epidemiological studies, and data from the Surveillance, Epidemiology, and End Results (SEER) database. Therefore, it is crucial to investigate the epidemiological characteristics of PMAC and discern the prognostic differences between PMAC and the more prevalent pancreatic ductal adenocarcinoma (PDAC). Methods The study used data from the SEER database from 2000 to 2018 to identify patients diagnosed with PMAC or PDAC. To ensure comparable demographic characteristics between PDAC and PMAC, propensity score matching was employed. Kaplan–Meier analysis was used to analyse overall survival (OS) and cancer-specific survival (CSS). Univariate and multivariate Cox regression analyses were used to determine independent risk factors influencing OS and CSS. Additionally, the construction and validation of risk-scoring models for OS and CSS were achieved through the least absolute shrinkage and selection operator-Cox regression technique. Results The SEER database included 84,857 patients with PDAC and 3345 patients with PMAC. Notably, significant distinctions were observed in the distribution of tumour sites, diagnosis time, use of radiotherapy and chemotherapy, tumour size, grading, and staging between the two groups. The prognosis exhibited notable improvement among married individuals, those receiving acceptable chemotherapy, and those with focal PMAC (p < 0.05). Conversely, patients with elevated log odds of positive lymph node scores or higher pathological grades in the pancreatic tail exhibited a more unfavourable prognosis (p < 0.05). The risk-scoring models for OS or CSS based on prognostic factors indicated a significantly lower prognosis for high-risk patients compared to their low-risk counterparts (area under the curve OS: 0.81–0.82, CSS: 0.80–0.82). Conclusion PMAC exhibits distinct clinical characteristics compared to non-specific PDAC. Leveraging these features and pathological classifications allows for accurate prognostication of PMAC or PDAC.
Article
There has been an increasing interest in decomposing high-dimensional multi-omics data into a product of low-rank and sparse matrices for the purpose of dimension reduction and feature engineering. Bayesian factor models achieve such low-dimensional representation of the original data through different sparsity-inducing priors. However, few of these models can efficiently incorporate the information encoded by the biological graphs, which has been already proven to be useful in many analysis tasks. In this work, we propose a Bayesian factor model with novel hierarchical priors, which incorporate the biological graph knowledge as a tool of identifying a group of genes functioning collaboratively. The proposed model therefore enables sparsity within networks by allowing each factor loading to be shrunk adaptively and by considering additional layers to relate individual shrinkage parameters to the underlying graph information, both of which yield a more accurate structure recovery of factor loadings. Further, this new priors overcome the phase transition phenomenon, in contrast to existing graph-incorporated approaches, so that it is robust to noisy edges that are inconsistent with the actual sparsity structure of the factor loadings. Finally, our model can handle both continuous and discrete data types. The proposed method is shown to outperform several existing factor analysis methods through simulation experiments and real data analyses.
Article
The Cox proportional hazards model, commonly used in clinical trials, assumes proportional hazards. However, it does not hold when, for example, there is a delayed onset of the treatment effect. In such a situation, an acute change in the hazard ratio function is expected to exist. This paper considers the Cox model with change-points and derives AIC-type information criteria for detecting those change-points. The change-point model does not allow for conventional statistical asymptotics due to its irregularity, thus a formal AIC that penalizes twice the number of parameters would not be analytically derived, and using it would clearly give overfitting analysis results. Therefore, we will construct specific asymptotics using the partial likelihood estimation method in the Cox model with change-points, and propose information criteria based on the original derivation method for AIC. If the partial likelihood is used in the estimation, information criteria with penalties much larger than twice the number of parameters could be obtained in an explicit form. Numerical experiments confirm that the proposed criteria are clearly superior in terms of the original purpose of AIC, which are to provide an estimate that is close to the true structure. We also apply the proposed criterion to actual clinical trial data to indicate that it will easily lead to different results from the formal AIC. This article is protected by copyright. All rights reserved.
Article
Although induction of differentiation represents an effective strategy for neuroblastoma treatment, the mechanisms underlying neuroblastoma differentiation are poorly understood. We generated a computational model of neuroblastoma differentiation consisting of interconnected gene clusters identified based on symmetric and asymmetric gene expression relationships. We identified a differentiation signature consisting of series of gene clusters comprised of 1251 independent genes that predicted neuroblastoma differentiation in independent datasets and in neuroblastoma cell lines treated with agents known to induce differentiation. This differentiation signature was associated with patient outcomes in multiple independent patient cohorts and validated the role of MYCN expression as a marker of neuroblastoma differentiation. Our results further identified novel genes associated with MYCN via asymmetric Boolean implication relationships that would not have been identified using symmetric computational approaches and that were associated with both neuroblastoma differentiation and patient outcomes. Our differentiation signature included a cluster of genes involved in intracellular signaling and growth factor receptor trafficking pathways that is strongly associated with neuroblastoma differentiation, and we validated the associations of UBE4B, a gene within this cluster, with neuroblastoma cell and tumor differentiation. Our findings demonstrate that Boolean network analyses of symmetric and asymmetric gene expression relationships can identify novel genes and pathways relevant for neuroblastoma tumor differentiation that could represent potential therapeutic targets. This article is protected by copyright. All rights reserved.
Article
Background High mortality rate in acute heart failure (AHF) necessitates proper risk stratification. However, risk assessment tools for long-term mortality are largely lacking. We aimed to develop a machine learning (ML)-based risk prediction model for long-term all-cause mortality in patients administrated for AHF. Methods and Results ML model based on boosted Cox regression algorithm (CoxBoost) was trained over 2,704 consecutive patients hospitalized for AHF (median age 73 years, 55% male, and median left ventricular ejection fraction 38%). Twenty-seven input variables, including 19 clinical features and eight echocardiographic parameters, were selected for model development. The best performing model, along with pre-existing risk scores (BIOSTAT-CHF and AHEAD scores), was validated on an independent test cohort of 1,608 patients. During the median 32 months (interquartile range 12–54 months) of the follow-up period, 1,050 (38.8%) and 690 (42.9%) deaths occurred in the training and test cohort, respectively. The area under the receiver operating characteristic curve (AUROC) of the ML model for all-cause mortality at 3 years was 0.761 (95% CI: 0.754–0.767) in the training cohort and 0.760 (95% CI: 0.752–0.768) in the test cohort. The discrimination performance of the ML model significantly outperformed those of the pre-existing risk scores (AUROC 0.714, 95% CI 0.706–0.722 by BIOSTAT-CHF; and 0.681, 95% CI 0.672–0.689 by AHEAD). Risk stratification based on the ML model identified patients at high mortality risk regardless of heart failure phenotypes. Conclusions The ML-based mortality prediction model can accurately predict long-term mortality leading to optimal risk stratification in patients with AHF.
Chapter
“Can we predict if an early stage cancer patient is at high risk of developing distant metastasis and what clinicopathological factors are associated with such a risk?” In this paper, we propose a ranking based censoring-aware machine learning model for answering such questions. The proposed model is able to generate an interpretable formula for risk stratification using a minimal number of clinicopathological covariates through L1-regulrization. Using this approach, we analyze the association of time to distant metastasis (TTDM) with various clinical parameters for early stage, luminal (ER + /HER2-) breast cancer patients who received endocrine therapy but no chemotherapy (n = 728). The TTDM risk stratification formula obtained using the proposed approach is primarily based on mitotic score, histological tumor type and lymphovascular invasion. These findings corroborate with the known role of these covariates in increased risk for distant metastasis. Our analysis shows that the proposed risk stratification formula can discriminate between cases with high and low risk of distant metastasis (p-value < 0.005) and can also rank cases based on their time to distant metastasis with a concordance-index of 0.73.KeywordsSurvival analysisSurvival rankingNeural networksLuminal breast cancer
Article
The 1972 paper introducing the Cox proportional hazards regression model is one of the most widely cited statistical articles. In the present article, we give an account of the model, with a detailed description of its properties, and discuss the marked influence that the model has had on both statistical and medical research. We will also review points of criticism that have been raised against the model.
Article
Background Despite recent advances in endometrial carcinoma (EC) molecular characterization, its prognostication remains challenging. We aimed to assess whether RNAseq could stratify EC patient prognosis beyond current classification systems. Methods A prognostic signature was identified using a LASSO-penalized Cox model trained on TCGA (N = 543 patients). A clinically applicable polyA-RNAseq-based work-flow was developed for validation of the signature in a cohort of stage I-IV patients treated in two Hospitals [2010–2017]. Model performances were evaluated using time-dependent ROC curves (prediction of disease-specific-survival (DSS)). The additional value of the RNAseq signature was evaluated by multivariable Cox model, adjusted on high-risk prognostic group (2021 ESGO-ESTRO-ESP guidelines: non-endometrioid histology or stage III-IVA orTP53-mutated molecular subgroup). Results Among 209 patients included in the external validation cohort, 61 (30%), 10 (5%), 52 (25%), and 82 (40%), had mismatch repair-deficient, POLE-mutated, TP53-mutated tumors, and tumors with no specific molecular profile, respectively. The 38-genes signature accurately predicted DSS (AUC = 0.80). Most disease-related deaths occurred in high-risk patients (5-years DSS = 78% (95% CI = [68%–89%]) versus 99% [97%–100%] in patients without high-risk). A composite classifier accounting for the TP53-mutated subgroup and the RNAseq signature identified three classes independently associated with DSS: RNAseq-good prognosis (reference, 5-years DSS = 99%), non-TP53 tumors but with RNAseq-poor prognosis (adjusted-hazard ratio (aHR) = 5.75, 95% CI[1.14–29.0]), and TP53-mutated subgroup (aHR = 5.64 [1.12–28.3]). The model accounting for the high-risk group and the composite classifier predicted DSS with AUC = 0.84, versus AUC = 0.76 without (p = 0.01). Conclusion RNA-seq profiling can provide an additional prognostic information to established classification systems, and warrants validation for potential RNAseq-based therapeutic strategies in EC.
Article
An extension of the Neural Additive Model (NAM) called SurvNAM and its modifications are proposed to explain predictions of a black-box machine learning survival model. The method is based on applying the original NAM to solving the explanation problem in the framework of survival analysis. The basic idea behind SurvNAM is to train the network by means of a specific expected loss function which takes into account peculiarities of the survival model predictions. Moreover, the loss function approximates the black-box model by the extension of the Cox proportional hazards model, which uses the well-known Generalized Additive Model (GAM) in place of the simple linear relationship of covariates. The proposed method SurvNAM allows performing local and global explanations. The global explanation uses the whole training dataset. In contrast to the global explanation, a set of synthetic examples around the explained example are randomly generated for the local explanation. The proposed modifications of SurvNAM are based on using the Lasso-based regularization for functions from GAM and for a special representation of the GAM functions using their weighted linear and non-linear parts, which is implemented as a shortcut connection. Many numerical experiments illustrate efficiency of SurvNAM.
Preprint
Immunotherapies have recently gained traction as highly effective therapies in a subset of late-stage cancers. Unfortunately, only a minority of patients experience the remarkable benefits of immunotherapies, whilst others fail to respond or even come to harm through immune related adverse events. For immunotherapies within the PD-1/PD-L1 inhibitor class, patient stratification is currently performed using tumor (tissue-based) PD-L1 expression. However, PD-L1 is an accurate predictor of response in only ~30% of cases. There is pressing need for more accurate biomarkers for immunotherapy response prediction. We sought to identify peripheral blood biomarkers, predictive of response to immunotherapies against lung cancer, based on whole blood microRNA profiling. Using three well characterized cohorts consisting of a total of 334 stage IV NSCLC patients, we have defined a 5 microRNA risk score (miRisk) that is predictive of immunotherapy response in training and independent validation cohorts. We have traced the signature to a myeloid origin and performed miRNA target prediction to make a direct mechanistic link to the PD-L1 signalling pathway and PD-L1 itself. The miRisk score offers a potential blood-based companion diagnostic for immunotherapy that outperforms tissue-based PD-L1 staining.
Article
Full-text available
A survival tree can classify subjects into different survival prognostic groups. However, when data contains high-dimensional covariates, the two popular classification trees exhibit fatal drawbacks. The logrank tree is unstable and tends to have false nodes; the conditional inference tree is difficult to interpret the adjusted P-value for high-dimensional tests. Motivated by these problems, we propose a new survival tree based on the stabilized score tests. We propose a novel matrix-based algorithm in order to tests a number of nodes simultaneously via stabilized score tests. We propose a recursive partitioning algorithm to construct a survival tree and develop our original R package uni.survival.tree (https://cran.r-project.org/package=uni.survival.tree) for implementation. Simulations are performed to demonstrate the superiority of the proposed method over the existing methods. The lung cancer data analysis demonstrates the usefulness of the proposed method.
Article
Full-text available
A method for interpreting uncertainty of predictions provided by machine learning survival models is proposed. It is called UncSurvEx and aims to determine which features of an analyzed example lead to uncertain predictions of an explainable black-box survival model. One of the ideas behind the proposed method is to approximate the uncertainty measure of a local black-box survival model prediction by the uncertainty measure of the Cox proportional hazards model at the local area around a test example. The linear relationship between covariates and predictions in the corresponding Cox model allows determining quantitative impacts of covariates on the uncertainty measure. A specific certainty measure of the survival function, taking into account the most uncertain survival function, is introduced to interpret the prediction uncertainty. The L2-norm is used to compute the distance between survival functions. The method leads to an unconstrained non-convex optimization problem which is solved by means of the well-known Broyden–Fletcher–Goldfarb–Shanno algorithm. A lot of numerical experiments demonstrate the uncertainty interpretation method.
Chapter
Two new survival models, the deep survival forest and the Elastic-Net-Cox Cascade, are presented in the paper. They can be regarded as a combination of random survival forests and the Elastic-Net-Cox models with the deep forest (DF) proposed by Zhou and Feng. The main ideas to construct the models are to replace the original random forests incorporated into the DF with the corresponding survival analysis models. A stacking algorithm implemented in the deep survival forest and the Elastic-Net-Cox Cascade, which can be regarded as a link between the DF levels, uses quantiles of the random time-to-event and the mean time-to-event computed from the estimated survival functions at every level of the DF. Numerical examples with real data illustrate the proposed models.
Article
Full-text available
Type 2 diabetes mellitus results from the interaction of environmental factors with a combination of genetic variants, most of which were hitherto unknown. A systematic search for these variants was recently made possible by the development of high-density arrays that permit the genotyping of hundreds of thousands of polymorphisms. We tested 392,935 single-nucleotide polymorphisms in a French case-control cohort. Markers with the most significant difference in genotype frequencies between cases of type 2 diabetes and controls were fast-tracked for testing in a second cohort. This identified four loci containing variants that confer type 2 diabetes risk, in addition to confirming the known association with the TCF7L2 gene. These loci include a non-synonymous polymorphism in the zinc transporter SLC30A8, which is expressed exclusively in insulin-producing beta-cells, and two linkage disequilibrium blocks that contain genes potentially involved in beta-cell development or function (IDE-KIF11-HHEX and EXT2-ALX4). These associations explain a substantial portion of disease risk and constitute proof of principle for the genome-wide approach to the elucidation of complex genetic traits.
Article
Full-text available
cDNA microarrays and a clustering algorithm were used to identify patterns of gene expression in human mammary epithelial cells growing in culture and in primary human breast tumors. Clusters of coexpressed genes identified through manipulations of mammary epithelial cells in vitro also showed consistent patterns of variation in expression among breast tumor samples. By using immunohistochemistry with antibodies against proteins encoded by a particular gene in a cluster, the identity of the cell type within the tumor specimen that contributed the observed gene expression pattern could be determined. Clusters of genes with coherent expression patterns in cultured cells and in the breast tumors samples could be related to specific features of biological variation among the samples. Two such clusters were found to have patterns that correlated with variation in cell proliferation rates and with activation of the IFN-regulated signal transduction pathway, respectively. Clusters of genes expressed by stromal cells and lymphocytes in the breast tumors also were identified in this analysis. These results support the feasibility and usefulness of this systematic approach to studying variation in gene expression patterns in human cancers as a means to dissect and classify solid tumors.
Article
Full-text available
Characteristic patterns of gene expression measured by DNA microarrays have been used to classify tumors into clinically relevant subgroups. In this study, we have refined the previously defined subtypes of breast tumors that could be distinguished by their distinct patterns of gene expression. A total of 115 malignant breast tumors were analyzed by hierarchical clustering based on patterns of expression of 534 "intrinsic" genes and shown to subdivide into one basal-like, one ERBB2-overexpressing, two luminal-like, and one normal breast tissue-like subgroup. The genes used for classification were selected based on their similar expression levels between pairs of consecutive samples taken from the same tumor separated by 15 weeks of neoadjuvant treatment. Similar cluster analyses of two published, independent data sets representing different patient cohorts from different laboratories, uncovered some of the same breast cancer subtypes. In the one data set that included information on time to development of distant metastasis, subtypes were associated with significant differences in this clinical feature. By including a group of tumors from BRCA1 carriers in the analysis, we found that this genotype predisposes to the basal tumor subtype. Our results strongly support the idea that many of these breast tumor subtypes represent biologically distinct disease entities.
Article
Full-text available
The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses – the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferroni-type procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.
Article
Full-text available
DNA microarrays are part of a new and promising class of biotechnologies that allow the monitoring of expression levels in cells for thousands of genes simultaneously. An important and common question in DNA microarray experiments is the identification of differentially expressed genes, that is, genes whose expression levels are associated with a response or covariate of interest. The biological question of differential expression can be restated as a problem in multiple hypothesis testing: the simultaneous test for each gene of the null hypothesis of no association between the expression levels and the responses or covariates. As a typical microarray experiment measures expression levels for thousands of genes simultaneously, large multiplicity problems are generated. This article discusses different approaches to multiple hypothesis testing in the context of DNA microarray experiments and compares the procedures on microarray and simulated data sets.
Article
Full-text available
The optimal treatment of patients with cancer depends on establishing accurate diagnoses by using a complex combination of clinical and histopathological data. In some instances, this task is difficult or impossible because of atypical clinical presentation or histopathology. To determine whether the diagnosis of multiple common adult malignancies could be achieved purely by molecular classification, we subjected 218 tumor samples, spanning 14 common tumor types, and 90 normal tissue samples to oligonucleotide microarray gene expression analysis. The expression levels of 16,063 genes and expressed sequence tags were used to evaluate the accuracy of a multiclass classifier based on a support vector machine algorithm. Overall classification accuracy was 78%, far exceeding the accuracy of random classification (9%). Poorly differentiated cancers resulted in low-confidence predictions and could not be accurately classified according to their tissue of origin, indicating that they are molecularly distinct entities with dramatically different gene expression patterns compared with their well differentiated counterparts. Taken together, these results demonstrate the feasibility of accurate, multiclass molecular cancer classification and suggest a strategy for future clinical implementation of molecular cancer diagnostics.
Article
Full-text available
cDNA microarrays and a clustering algorithm were used to identify patterns of gene expression in human mammary epithelial cells growing in culture and in primary human breast tumors. Clusters of coexpressed genes identified through manipulations of mammary epithelial cells in vitro also showed consistent patterns of variation in expression among breast tumor samples. By using immunohistochemistry with antibodies against proteins encoded by a particular gene in a cluster, the identity of the cell type within the tumor specimen that contributed the observed gene expression pattern could be determined. Clusters of genes with coherent expression patterns in cultured cells and in the breast tumors samples could be related to specific features of biological variation among the samples. Two such clusters were found to have patterns that correlated with variation in cell proliferation rates and with activation of the IFN-regulated signal transduction pathway, respectively. Clusters of genes expressed by stromal cells and lymphocytes in the breast tumors also were identified in this analysis. These results support the feasibility and usefulness of this systematic approach to studying variation in gene expression patterns in human cancers as a means to dissect and classify solid tumors.
Article
Full-text available
Diffuse large B-cell lymphoma (DLBCL), the most common subtype of non-Hodgkin's lymphoma, is clinically heterogeneous: 40% of patients respond well to current therapy and have prolonged survival, whereas the remainder succumb to the disease. We proposed that this variability in natural history reflects unrecognized molecular heterogeneity in the tumours. Using DNA microarrays, we have conducted a systematic characterization of gene expression in B-cell malignancies. Here we show that there is diversity in gene expression among the tumours of DLBCL patients, apparently reflecting the variation in tumour proliferation rate, host response and differentiation state of the tumour. We identified two molecularly distinct forms of DLBCL which had gene expression patterns indicative of different stages of B-cell differentiation. One type expressed genes characteristic of germinal centre B cells ('germinal centre B-like DLBCL'); the second type expressed genes normally induced during in vitro activation of peripheral blood B cells ('activated B-like DLBCL'). Patients with germinal centre B-like DLBCL had a significantly better overall survival than those with activated B-like DLBCL. The molecular classification of tumours on the basis of gene expression can thus identify previously undetected and clinically significant subtypes of cancer.
Article
Full-text available
Many cases of hereditary breast cancer are due to mutations in either the BRCA1 or the BRCA2 gene. The histopathological changes in these cancers are often characteristic of the mutant gene. We hypothesized that the genes expressed by these two types of tumors are also distinctive, perhaps allowing us to identify cases of hereditary breast cancer on the basis of gene-expression profiles. RNA from samples of primary tumor from seven carriers of the BRCA1 mutation, seven carriers of the BRCA2 mutation, and seven patients with sporadic cases of breast cancer was compared with a microarray of 6512 complementary DNA clones of 5361 genes. Statistical analyses were used to identify a set of genes that could distinguish the BRCA1 genotype from the BRCA2 genotype. Permutation analysis of multivariate classification functions established that the gene-expression profiles of tumors with BRCA1 mutations, tumors with BRCA2 mutations, and sporadic tumors differed significantly from each other. An analysis of variance between the levels of gene expression and the genotype of the samples identified 176 genes that were differentially expressed in tumors with BRCA1 mutations and tumors with BRCA2 mutations. Given the known properties of some of the genes in this panel, our findings indicate that there are functional differences between breast tumors with BRCA1 mutations and those with BRCA2 mutations. Significantly different groups of genes are expressed by breast cancers with BRCA1 mutations and breast cancers with BRCA2 mutations. Our results suggest that a heritable mutation influences the gene-expression profile of the cancer.
Article
Full-text available
The purpose of this study was to classify breast carcinomas based on variations in gene expression patterns derived from cDNA microarrays and to correlate tumor characteristics to clinical outcome. A total of 85 cDNA microarray experiments representing 78 cancers, three fibroadenomas, and four normal breast tissues were analyzed by hierarchical clustering. As reported previously, the cancers could be classified into a basal epithelial-like group, an ERBB2-overexpressing group and a normal breast-like group based on variations in gene expression. A novel finding was that the previously characterized luminal epithelial/estrogen receptor-positive group could be divided into at least two subgroups, each with a distinctive expression profile. These subtypes proved to be reasonably robust by clustering using two different gene sets: first, a set of 456 cDNA clones previously selected to reflect intrinsic properties of the tumors and, second, a gene set that highly correlated with patient outcome. Survival analyses on a subcohort of patients with locally advanced breast cancer uniformly treated in a prospective study showed significantly different outcomes for the patients belonging to the various groups, including a poor prognosis for the basal-like subtype and a significant difference in outcome for the two estrogen receptor-positive groups.
Article
Full-text available
Breast cancer patients with the same stage of disease can have markedly different treatment responses and overall outcome. The strongest predictors for metastases (for example, lymph node status and histological grade) fail to classify accurately breast tumours according to their clinical behaviour. Chemotherapy or hormonal therapy reduces the risk of distant metastases by approximately one-third; however, 70-80% of patients receiving this treatment would have survived without it. None of the signatures of breast cancer gene expression reported to date allow for patient-tailored therapy strategies. Here we used DNA microarray analysis on primary breast tumours of 117 young patients, and applied supervised classification to identify a gene expression signature strongly predictive of a short interval to distant metastases ('poor prognosis' signature) in patients without tumour cells in local lymph nodes at diagnosis (lymph node negative). In addition, we established a signature that identifies tumours of BRCA1 carriers. The poor prognosis signature consists of genes regulating cell cycle, invasion, metastasis and angiogenesis. This gene expression profile will outperform all currently used clinical parameters in predicting disease outcome. Our findings provide a strategy to select patients who would benefit from adjuvant therapy.
Article
Full-text available
The survival of patients with diffuse large-B-cell lymphoma after chemotherapy is influenced by molecular features of the tumors. We used the gene-expression profiles of these lymphomas to develop a molecular predictor of survival. Biopsy samples of diffuse large-B-cell lymphoma from 240 patients were examined for gene expression with the use of DNA microarrays and analyzed for genomic abnormalities. Subgroups with distinctive gene-expression profiles were defined on the basis of hierarchical clustering. A molecular predictor of risk was constructed with the use of genes with expression patterns that were associated with survival in a preliminary group of 160 patients and was then tested in a validation group of 80 patients. The accuracy of this predictor was compared with that of the international prognostic index. Three gene-expression subgroups--germinal-center B-cell-like, activated B-cell-like, and type 3 diffuse large-B-cell lymphoma--were identified. Two common oncogenic events in diffuse large-B-cell lymphoma, bcl-2 translocation and c-rel amplification, were detected only in the germinal-center B-cell-like subgroup. Patients in this subgroup had the highest five-year survival rate. To identify other molecular determinants of outcome, we searched for individual genes with expression patterns that correlated with survival in the preliminary group of patients. Most of these genes fell within four gene-expression signatures characteristic of germinal-center B cells, proliferating cells, reactive stromal and immune cells in the lymph node, or major-histocompatibility-complex class II complex. We used 17 genes to construct a predictor of overall survival after chemotherapy. This gene-based predictor and the international prognostic index were independent prognostic indicators. DNA microarrays can be used to formulate a molecular predictor of survival after chemotherapy for diffuse large-B-cell lymphoma.
Article
Full-text available
Histopathology is insufficient to predict disease progression and clinical outcome in lung adenocarcinoma. Here we show that gene-expression profiles based on microarray analysis can be used to predict patient survival in early-stage lung adenocarcinomas. Genes most related to survival were identified with univariate Cox analysis. Using either two equivalent but independent training and testing sets, or 'leave-one-out' cross-validation analysis with all tumors, a risk index based on the top 50 genes identified low-risk and high-risk stage I lung adenocarcinomas, which differed significantly with respect to survival. This risk index was then validated using an independent sample of lung adenocarcinomas that predicted high- and low-risk groups. This index included genes not previously associated with survival. The identification of a set of genes that predict survival in early-stage lung adenocarcinoma allows delineation of a high-risk group that may benefit from adjuvant therapy.
Article
Full-text available
Extracting biological information from microarray data requires appropriate statistical methods. The simplest statistical method for detecting differential expression is the t test, which can be used to compare two conditions when there is replication of samples. With more than two conditions, analysis of variance (ANOVA) can be used, and the mixed ANOVA model is a general and powerful approach for microarray experiments with multiple factors and/or several sources of variation.
Article
Full-text available
Characteristic patterns of gene expression measured by DNA microarrays have been used to classify tumors into clinically relevant subgroups. In this study, we have refined the previously defined subtypes of breast tumors that could be distinguished by their distinct patterns of gene expression. A total of 115 malignant breast tumors were analyzed by hierarchical clustering based on patterns of expression of 534 "intrinsic" genes and shown to subdivide into one basal-like, one ERBB2-overexpressing, two luminal-like, and one normal breast tissue-like subgroup. The genes used for classification were selected based on their similar expression levels between pairs of consecutive samples taken from the same tumor separated by 15 weeks of neoadjuvant treatment. Similar cluster analyses of two published, independent data sets representing different patient cohorts from different laboratories, uncovered some of the same breast cancer subtypes. In the one data set that included information on time to development of distant metastasis, subtypes were associated with significant differences in this clinical feature. By including a group of tumors from BRCA1 carriers in the analysis, we found that this genotype predisposes to the basal tumor subtype. Our results strongly support the idea that many of these breast tumor subtypes represent biologically distinct disease entities.
Article
Full-text available
An important goal of DNA microarray research is to develop tools to diagnose cancer more accurately based on the genetic profile of a tumor. There are several existing techniques in the literature for performing this type of diagnosis. Unfortunately, most of these techniques assume that different subtypes of cancer are already known to exist. Their utility is limited when such subtypes have not been previously identified. Although methods for identifying such subtypes exist, these methods do not work well for all datasets. It would be desirable to develop a procedure to find such subtypes that is applicable in a wide variety of circumstances. Even if no information is known about possible subtypes of a certain form of cancer, clinical information about the patients, such as their survival time, is often available. In this study, we develop some procedures that utilize both the gene expression data and the clinical data to identify subtypes of cancer and use this knowledge to diagnose future patients. These procedures were successfully applied to several publicly available datasets. We present diagnostic procedures that accurately predict the survival of future patients based on the gene expression profile and survival times of previous patients. This has the potential to be a powerful tool for diagnosing and treating cancer.
Article
I propose a new method for variable selection and shrinkage in Cox's proportional hazards model. My proposal minimizes the log partial likelihood subject to the sum of the absolute values of the parameters being bounded by a constant. Because of the nature of this constraint, it shrinks coefficients and produces some coefficients that are exactly zero. As a result it reduces the estimation variance while providing an interpretable final model. The method is a variation of the ‘lasso’ proposal of Tibshirani, designed for the linear regression context. Simulations indicate that the lasso can be more accurate than stepwise selection in this setting. © 1997 by John Wiley & Sons, Ltd.
Article
Modern advances in computing power have greatly widened scientists' scope in gathering and investigating information from many variables, information which might have been ignored in the past. Yet to effectively scan a large pool of variables is not an easy task, although our ability to interact with data has been much enhanced by recent innovations in dynamic graphics. In this article, we propose a novel data-analytic tool, sliced inverse regression (SIR), for reducing the dimension of the input variable x without going through any parametric or nonparametric model-fitting process. This method explores the simplicity of the inverse view of regression; that is, instead of regressing the univariate output variable y against the multivariate X, we regress x against y. Forward regression and inverse regression are connected by a theorem that motivates this method. The theoretical properties of SIR are investigated under a model of the form, y = f(beta-1x, ..., beta(K)x, epsilon), where the beta-k's are the unknown row vectors. This model looks like a nonlinear regression, except for the crucial difference that the functional form of f is completely unknown. For effectively reducing the dimension, we need only to estimate the space [effective dimension reduction (e.d.r.) space] generated by the beta-k's. This makes our goal different from the usual one in regression analysis, the estimation of all the regression coefficients. In fact, the beta-k's themselves are not identifiable without a specific structural form on f. Our main theorem shows that under a suitable condition, if the distribution of x has been standardized to have the zero mean and the identity covariance, the inverse regression curve, E(x \ y), will fall into the e.d.r. space. Hence a principal component analysis on the covariance matrix for the estimated inverse regression curve can be conducted to locate its main orientation, yielding our estimates for e.d.r. directions. Furthermore, we use a simple step function to estimate the inverse regression curve. No complicated smoothing is needed. SIR can be easily implemented on personal computers. By simulation, we demonstrate how SIR can effectively reduce the dimension of the input variable from, say, 10 to K = 2 for a data set with 400 observations. The spin-plot of y against the two projected variables obtained by SIR is found to mimic the spin-plot of y against the true directions very well. A chi-squared statistic is proposed to address the issue of whether or not a direction found by SIR is spurious.
Article
We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample $t$-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an $L_1$ penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.
Article
Background: We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes. Results: We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions. Conclusions: Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.
Article
Regression upon principal components of the percentage points of the income and education distributions for 1950 census tracts in the city of Chicago led to the estimation of “beta coefficient profiles” for television receiver and refrigerator ownership, for central heating system usage, and for a measure of dwelling unit overcrowding. The betas are standardized coefficients of regression of a dependent variable upon the proportions of families in the classes of the marginal income and education distributions. They measure the relative contribution of families in these classes to the over-all per cent saturation of the dependent variable in the tract. The coefficients were estimated by techniques developed in the first portion of the paper; estimation by classical regression methods would have been impossible because of multicollinearity. The empirical results are in substantial agreement with findings from regressions of the dependent variables upon the mean values of income and education, and their squares. The statistical devices appear to be useful in exploratory empirical research.
Article
• Contains additional discussion and examples on left truncation as well as material on more general censoring and truncation patterns. • Introduces the martingale and counting process formulation swil lbe in a new chapter. • Develops multivariate failure time data in a separate chapter and extends the material on Markov and semi Markov formulations. • Presents new examples and applications of data analysis.
Article
Modern advances in computing power have greatly widened scientists' scope in gathering and investigating information from many variables, information which might have been ignored in the past. Yet to effectively scan a large pool of variables is not an easy task, although our ability to interact with data has been much enhanced by recent innovations in dynamic graphics. In this article, we propose a novel data-analytic tool, sliced inverse regression (SIR), for reducing the dimension of the input variable x without going through any parametric or nonparametric model-fitting process. This method explores the simplicity of the inverse view of regression; that is, instead of regressing the univariate output variable y against the multivariate x, we regress x against y. Forward regression and inverse regression are connected by a theorem that motivates this method. The theoretical properties of SIR are investigated under a model of the form, y = f(β1x, …, βKx, ε), where the βk's are the unknown row vectors. This model looks like a nonlinear regression, except for the crucial difference that the functional form of f is completely unknown. For effectively reducing the dimension, we need only to estimate the space [effective dimension reduction (e.d.r.) space] generated by the βk's. This makes our goal different from the usual one in regression analysis, the estimation of all the regression coefficients. In fact, the βk's themselves are not identifiable without a specific structural form on f. Our main theorem shows that under a suitable condition, if the distribution of x has been standardized to have the zero mean and the identity covariance, the inverse regression curve, E(x | y), will fall into the e.d.r. space. Hence a principal component analysis on the covariance matrix for the estimated inverse regression curve can be conducted to locate its main orientation, yielding our estimates for e.d.r. directions. Furthermore, we use a simple step function to estimate the inverse regression curve. No complicated smoothing is needed. SIR can be easily implemented on personal computers. By simulation, we demonstrate how SIR can effectively reduce the dimension of the input variable from, say, 10 to K = 2 for a data set with 400 observations. The spin-plot of y against the two projected variables obtained by SIR is found to mimic the spin-plot of y against the true directions very well. A chi-squared statistic is proposed to address the issue of whether or not a direction found by SIR is spurious.
Article
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Article
In this study, we introduce a path-following algorithm for L 1 regularized general-ized linear models. The L 1 regularization procedure is useful especially because it, in effect, selects variables according to the amount of penalization on the L 1 norm of the coefficients, in a manner less greedy than forward selection/backward deletion. The GLM path algorithm efficiently computes solutions along the entire regularization path using the predictor-corrector method of convex-optimization. Selecting the step length of the regularization parameter is critical in controlling the overall accuracy of the paths; we suggest intuitive and flexible strategies for choosing appropriate values. We demonstrate the implementation with several simulated and real datasets.
Article
  We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p≫n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.
Article
The Dantzig selector (DS) is a recent approach of estimation in high-dimensional linear regression models with a large number of explanatory variables and a relatively small number of observations. As in the least absolute shrinkage and selection operator (LASSO), this approach sets certain regression coefficients exactly to zero, thus performing variable selection. However, such a framework, contrary to the LASSO, has never been used in regression models for survival data with censoring. A key motivation of this article is to study the estimation problem for Cox's proportional hazards (PH) function regression models using a framework that extends the theory, the computational advantages and the optimal asymptotic rate properties of the DS to the class of Cox's PH under appropriate sparsity scenarios.We perform a detailed simulation study to compare our approach with other methods and illustrate it on a well-known microarray gene expression data set for predicting survival from gene expressions.
Article
This paper considers covariate selection for the additive hazards model. This model is particularly simple to study theoretically and its practical implementation has several major advantages to the similar methodology for the proportional hazards model. One complication compared with the proportional model is, however, that there is no simple likelihood to work with. We here study a least squares criterion with desirable properties and show how this criterion can be interpreted as a prediction error. Given this criterion, we define ridge and Lasso estimators as well as an adaptive Lasso and study their large sample properties for the situation where the number of covariates "p" is smaller than the number of observations. We also show that the adaptive Lasso has the oracle property. In many practical situations, it is more relevant to tackle the situation with large "p" compared with the number of observations. We do this by studying the properties of the so-called Dantzig selector in the setting of the additive risk model. Specifically, we establish a bound on how close the solution is to a true sparse signal in the case where the number of covariates is large. In a simulation study, we also compare the Dantzig and adaptive Lasso for a moderate to small number of covariates. The methods are applied to a breast cancer data set with gene expression recordings and to the primary biliary cirrhosis clinical data. Copyright (c) 2009 Board of the Foundation of the Scandinavian Journal of Statistics.
Article
In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.
Article
Without parametric assumptions, high-dimensional regression analysis is already complex. This is made even harder when data are subject to censoring. In this article, we seek ways of reducing the dimensionality of the regressor before applying nonparametric smoothing techniques. If the censoring time is independent of the lifetime, then the method of sliced inverse regression can be applied directly. Otherwise, modification is needed to adjust for the censoring bias. A key identity leading to the bias correction is derived and the root-$n$ consistency of the modified estimate is established. Patterns of censoring can also be studied under a similar dimension reduction framework. Some simulation results and an applica-tion to a real data set are reported.
Article
We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample t-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an L(1) penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.
Article
We propose a method for prediction in Cox's proportional model, when the number of features (regressors), p, exceeds the number of observations, n. The method assumes that the features are independent in each risk set, so that the partial likelihood factors into a product. As such, it is analogous to univariate thresholding in linear regression and nearest shrunken centroids in classification. We call the procedure Cox univariate shrinkage and demonstrate its usefulness on real and simulated data. The method has the attractive property of being essentially univariate in its operation: the features are entered into the model based on the size of their Cox score statistics. We illustrate the new method on real and simulated data, and compare it to other proposed methods for survival prediction with a large number of predictors.
Article
In a Cox regression model, instability of the estimated regression coefficients can be reduced by maximizing a penalized partial log-likelihood, where a penalty function of the regression coefficients is substracted from the partial log-likelihood. In this paper, we choose the optimal weight of the penalty function by maximizing the predictive value of the model, as measured by the crossvalidated partial log-likelihood. Our methods are illustrated by a study of ovarian cancer survival and by a study of centre-effects in kidney graft survival.
Article
The predictive value of a statistical model is conceptually different from the explained variation. In this paper we construct a measure of the predictive value of the Cox proportional hazards model, computed from the leave-one-out regression coefficients. These coefficients can also be used to calculate a shrinkage factor which can be applied to improve the predictions and that can be used in R2-type measures of the proportion of explained variation. Our methods are illustrated by a study of chemotherapy for advanced ovarian cancer.
Article
I propose a new method for variable selection and shrinkage in Cox's proportional hazards model. My proposal minimizes the log partial likelihood subject to the sum of the absolute values of the parameters being bounded by a constant. Because of the nature of this constraint, it shrinks coefficients and produces some coefficients that are exactly zero. As a result it reduces the estimation variance while providing an interpretable final model. The method is a variation of the 'lasso' proposal of Tibshirani, designed for the linear regression context. Simulations indicate that the lasso can be more accurate than stepwise selection in this setting.
Article
A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression. The output is displayed graphically, conveying the clustering and the underlying expression data simultaneously in a form intuitive for biologists. We have found in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function, and we find a similar tendency in human data. Thus patterns seen in genome-wide expression experiments can be interpreted as indications of the status of cellular processes. Also, coexpression of genes of known function with poorly characterized or novel genes may provide a simple means of gaining leads to the functions of many genes for which information is not available currently.
Article
Prognostic classification schemes have often been used in medical applications, but rarely subjected to a rigorous examination of their adequacy. For survival data, the statistical methodology to assess such schemes consists mainly of a range of ad hoc approaches, and there is an alarming lack of commonly accepted standards in this field. We review these methods and develop measures of inaccuracy which may be calculated in a validation study in order to assess the usefulness of estimated patient-specific survival probabilities associated with a prognostic classification scheme. These measures are meaningful even when the estimated probabilities are misspecified, and asymptotically they are not affected by random censorship. In addition, they can be used to derive R(2)-type measures of explained residual variation. A breast cancer study will serve for illustration throughout the paper.
Article
ROC curves are a popular method for displaying sensitivity and specificity of a continuous diagnostic marker, X, for a binary disease variable, D. However, many disease outcomes are time dependent, D(t), and ROC curves that vary as a function of time may be more appropriate. A common example of a time-dependent variable is vital status, where D(t) = 1 if a patient has died prior to time t and zero otherwise. We propose summarizing the discrimination potential of a marker X, measured at baseline (t = 0), by calculating ROC curves for cumulative disease or death incidence by time t, which we denote as ROC(t). A typical complexity with survival data is that observations may be censored. Two ROC curve estimators are proposed that can accommodate censored data. A simple estimator is based on using the Kaplan-Meier estimator for each possible subset X > c. However, this estimator does not guarantee the necessary condition that sensitivity and specificity are monotone in X. An alternative estimator that does guarantee monotonicity is based on a nearest neighbor estimator for the bivariate distribution function of (X, T), where T represents survival time (Akritas, M. J., 1994, Annals of Statistics 22, 1299-1327). We present an example where ROC(t) is used to compare a standard and a modified flow cytometry measurement for predicting survival after detection of breast cancer and an example where the ROC(t) curve displays the impact of modifying eligibility criteria for sample size and power in HIV prevention trials.
Article
We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes. We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions. Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.
Article
Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.
Article
There is an increasing need to link the large amount of genotypic data, gathered using microarrays for example, with various phenotypic data from patients. The classification problem in which gene expression data serve as predictors and a class label phenotype as the binary outcome variable has been examined extensively, but there has been less emphasis in dealing with other types of phenotypic data. In particular, patient survival times with censoring are often not used directly as a response variable due to the complications that arise from censoring. We show that the issues involving censored data can be circumvented by reformulating the problem as a standard Poisson regression problem. The procedure for solving the transformed problem is a combination of two approaches: partial least squares, a regression technique that is especially effective when there is severe collinearity due to a large number of predictors, and generalized linear regression, which extends standard linear regression to deal with various types of response variables. The linear combinations of the original variables identified by the method are highly correlated with the patient survival times and at the same time account for the variability in the covariates. The algorithm is fast, as it does not involve any matrix decompositions in the iterations. We apply our method to data sets from lung carcinoma and diffuse large B-cell lymphoma studies to verify its effectiveness. Contact: peter_park@harvard.edu Keywords: microarrays; generalized linear models; survivial analysis; Poisson regression; principal components analysis.
Article
Recent research has shown that gene expression profiles can potentially be used for predicting various clinical phenotypes, such as tumor class, drug response and survival time. While there has been extensive studies on tumor classification, there has been less emphasis on other phenotypic features, in particular, patient survival time or time to cancer recurrence, which are subject to right censoring. We consider in this paper an analysis of censored survival time based on microarray gene expression profiles. We propose a dimension reduction strategy, which combines principal components analysis and sliced inverse regression, to identify linear combinations of genes, that both account for the variability in the gene expression levels and preserve the phenotypic information. The extracted gene combinations are then employed as covariates in a predictive survival model formulation. We apply the proposed method to a large diffuse large-B-cell lymphoma dataset, which consists of 240 patients and 7399 genes, and build a Cox proportional hazards model based on the derived gene expression components. The proposed method is shown to provide a good predictive performance for patient survival, as demonstrated by both the significant survival difference between the predicted risk groups and the receiver operator characteristics analysis. R programs are available upon request from the authors. http://dna.ucdavis.edu/~hli/bioinfo-surv-supp.pdf.