ArticlePDF Available

A knowledge-based approach for interpreting genome-wide expression pro炉les

January 2005

January 2005

Authors:

Pablo Tamayo

University of California, San Diego

Sayan Mukherjee

Duke University

Show all 11 authorsHide

Summary of GSEA results with FDR < 0.25

…

Leading edge overlap for p53 study. This plot shows the ras, ngf, and igf1 gene sets correlated with P53 clustered by their leading-edge subsets indicated in dark blue. A common subgroup of genes, apparent as a dark vertical stripe, consists of MAP2K1, PIK3CA, ELK1, and RAF1 and represents a subsection of the MAPK pathway.

…

Figures - uploaded by Scott L Pomeroy

Content may be subject to copyright.

Content uploaded by Scott L Pomeroy

Content may be subject to copyright.

Gene set enrichment analysis: A knowledge-based

approach for interpreting genome-wide

expression profiles

Aravind Subramanian

a,b

, Pablo Tamayo

a,b

, Vamsi K. Mootha

a,c

, Sayan Mukherjee

, Benjamin L. Ebert

a,e

Michael A. Gillette

a,f

, Amanda Paulovich

, Scott L. Pomeroy

, Todd R. Golub

a,e

, Eric S. Lander

a,c,i,j,k

, and Jill P. Mesirov

a,k

Broad Institute of Massachusetts Institute of Technology and Harvard, 320 Charles Street, Cambridge, MA 02141;

Department of Systems Biology, Alpert

536, Harvard Medical School, 200 Longwood Avenue, Boston, MA 02446;

Institute for Genome Sciences and Policy, Center for Interdisciplinary Engineering,

Medicine, and Applied Sciences, Duke University, 101 Science Drive, Durham, NC 27708;

Department of Medical Oncology, Dana–Farber Cancer Institute,

44 Binney Street, Boston, MA 02115;

Division of Pulmonary and Critical Care Medicine, Massachusetts General Hospital, 55 Fruit Street, Boston, MA 02114;

Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, C2-023, P.O. Box 19024, Seattle, WA 98109-1024;

Department of Neurology, Enders

260, Children’s Hospital, Harvard Medical School, 300 Longwood Avenue, Boston, MA 02115;

Department of Biology, Massachusetts Institute of

Technology, Cambridge, MA 02142; and

Whitehead Institute for Biomedical Research, Massachusetts Institute of Technology, Cambridge, MA 02142

Contributed by Eric S. Lander, August 2, 2005

Although genomewide RNA expression analysis has become a

routine tool in biomedical research, extracting biological insight

from such information remains a major challenge. Here, we de-

scribe a powerful analytical method called Gene Set Enrichment

Analysis (GSEA) for interpreting gene expression data. The method

derives its power by focusing on gene sets, that is, groups of genes

that share common biological function, chromosomal location, or

regulation. We demonstrate how GSEA yields insights into several

cancer-related data sets, including leukemia and lung cancer.

Notably, where single-gene analysis ﬁnds little similarity between

two independent studies of patient survival in lung cancer, GSEA

reveals many biological pathways in common. The GSEA method is

embodied in a freely available software package, together with an

initial database of 1,325 biologically deﬁned gene sets.

microarray

enomewide expression analysis with DNA microarrays has

become a mainstay of genomics research (1, 2). The challenge

no longer lies in obtaining gene expression profiles, but rather in

interpreting the results to gain insights into biological mechanisms.

In a typical experiment, mRNA expression profiles are generated

for thousands of genes from a collection of sample s belonging to

one of two classes, for example, tumors that are sensitive vs.

resistant to a drug. The gene s can be ordered in a ranked list L,

according to their differential expression between the classes. The

challenge is to extract meaning from this list.

A common approach involves focusing on a handful of genes at

the top and bottom of L (i.e., those showing the largest difference)

to discern telltale biological clues. This approach has a few major

limitations.

(i) After correcting for multiple hypotheses testing, no individual

gene may meet the threshold for statistical significance, because the

relevant biological difference s are modest relative to the noise

inherent to the microarray technology.

(ii) Alternatively, one may be left with a long list of statistically

significant genes without any unifying biological theme. Interpre-

tation can be daunting and ad hoc, being dependent on a biologist’s

area of expertise.

(iii) Single-gene analysis may miss important effects on pathways.

Cellular proce sses often affect sets of gene s acting in concert. An

increase of 20% in all genes encoding members of a metabolic

pathway may dramatically alter the flux through the pathway and

may be more important than a 20-fold increase in a single gene.

(iv) When different groups study the same biological system, the

list of statistically significant genes from the two studies may show

distressingly little overlap (3).

To overcome these analytical challenges, we recently developed

a method called Gene Set Enrichment Analysis (GSEA) that

evaluates microarray data at the level of gene sets. The gene sets are

defined based on prior biological knowledge, e.g., published infor-

mation about biochemical pathways or coexpression in previous

experiments. The goal of GSEA is to determine whether members

of a gene set S tend to occur toward the top (or bottom) of the list

L, in which case the gene set is correlated with the phenotypic class

distinction.

We used a preliminary version of GSEA to analyze data from

muscle biopsies from diabetics vs. healthy controls (4). The method

revealed that genes involved in oxidative phosphorylation show

reduced expression in diabetics, although the average decrease per

gene is only 20%. The results from this study have been indepen-

dently validated by other microarray studies (5) and by in vivo

functional studie s (6).

Given this success, we have developed GSEA into a robust

technique for analyzing molecular profiling data. We studied its

characteristics and performance and substantially revised and

generalized the original method for broader applicability.

In this paper, we provide a full mathematical description of the

GSEA methodology and illustrate its utility by applying it to several

diverse biological problems. We have also created a software

package, called

GSEA-P and an initial inventory of gene sets

(Molecular Signature Database, MSigDB), both of which are freely

available.

Methods

Overview of GSEA. GSEA considers experiments with genomewide

expre ssion profiles from samples belonging to two classes, labeled

1 or 2. Genes are ranked based on the correlation between their

expre ssion and the class distinction by using any suitable metric

(Fig. 1A).

Given an aprioridefined set of genes S (e.g., genes encoding

products in a metabolic pathway, located in the same cytogenetic

band, or sharing the same GO category), the goal of GSEA is to

determine whether the members of S are randomly distributed

throughout L or primarily found at the top or bottom. We expect

Freely available online through the PNAS open access option.

Abbreviations: ALL, acute lymphoid leukemia; AML, acute myeloid leukemia; ES, enrich-

ment score; FDR, false discovery rate; GSEA, Gene Set Enrichment Analysis; MAPK, mitogen-

activated protein kinase; MSigDB, Molecular Signature Database; NES, normalized enrich-

ment score.

See Commentary on page 15278.

A.S. and P.T. contributed equally to this work.

To whom correspondence may be addressed. E-mail: lander@broad.mit.edu or

mesirov@broad.mit.edu.

www.pnas.org兾cgi兾doi兾10.1073兾pnas.0506580102 PNAS

兩

October 25, 2005

兩

vol. 102

兩

no. 43

兩

15545–15550

GENETICS SEE COMMENTARY

that sets related to the phenotypic distinction will tend to show the

latter distribution.

There are three key elements of the GSEA method:

Step 1: Calculation of an Enrichment Score. We calculate an enrich-

ment score (ES) that reflects the degree to which a set S is

overrepresented at the extremes (top or bottom) of the entire

ranked list L. The score is calculated by walking down the list L,

increasing a running-sum statistic when we encounter a gene in S

and decreasing it when we encounter genes not in S. The magnitude

of the increment depends on the correlation of the gene with the

phenotype. The enrichment score is the maximum deviation from

zero encountered in the random walk; it corre sponds to a weighted

Kolmogorov–Smirnov-like statistic (ref. 7 and Fig. 1B).

Step 2: Estimation of Significance Level of

. We estimate the

statistical significance (nominal P value) of the ES by using an

empirical phenotype-based permutation test procedure that pre-

serves the complex correlation structure of the gene expression

data. Specifically, we permute the phenotype labels and recompute

the ES of the gene set for the permuted data, which generates a null

distribution for the ES. The empirical, nominal P value of the

observed ES is then calculated relative to this null distribution.

Importantly, the permutation of class labels pre serves gene-gene

correlations and, thus, provides a more biologically reasonable

assessment of significance than would be obtained by permuting

genes.

Step 3: Adjustment for Multiple Hypothesis Testing. When an entire

database of gene sets is evaluated, we adjust the estimated signif-

icance level to account for multiple hypothesis testing. We first

normalize the ES for each gene set to account for the size of the set,

yielding a normalized enrichment score (NES). We then control the

proportion of false positives by calculating the false discovery rate

(FDR) (8, 9) corresponding to each NES. The FDR is the estimated

probability that a set with a given NES represents a false positive

finding; it is computed by comparing the tails of the observed and

null distributions for the NES.

The details of the implementation are described in the Appendix

(see also Supporting Text, which is published as supporting infor-

mation on the PNAS web site).

We note that the GSEA method differs in several important ways

from the preliminary version (see Supporting Text). In the original

implementation, the running-sum statistic used equal weights at

every step, which yielded high scores for sets clustered near the

middle of the ranked list (Fig. 2 and Table 1). These sets do not

represent biologically relevant correlation with the phenotype. We

addressed this issue by weighting the steps according to each gene’s

correlation with a phenotype. We noticed that the use of weighted

steps could cause the distribution of observed ES scores to be

asymmetric in case s where many more genes are correlated with

one of the two phenotype s. We therefore estimate the significance

levels by considering separately the positively and negatively scoring

gene sets (Appendix; see also Fig. 4, which is published as supporting

information on the PNAS web site).

Our preliminary implementation used a different approach,

familywise-error rate (FWER), to correct for multiple hypotheses

testing. The FWER is a conservative correction that seeks to ensure

that the list of reported results does not include even a single

false-positive gene set. This criterion turned out to be so conser-

vative that many applications yielded no statistically significant

results. Because our primary goal is to generate hypothese s, we

chose to use the FDR to focus on controlling the probability that

each reported result is a false positive.

Based on our statistical analysis and empirical evaluation, GSEA

shows broad applicability. It can detect subtle enrichment signals

and it preserves our original results in ref. 4, with the oxidative

phosphorylation pathway significantly enriched in the normal sam-

ples (P ⫽ 0.008, FDR ⫽ 0.04). This methodology has been imple-

mented in a software tool called

GSEA-P.

Fig. 1. A GSEA overview illustrating the method. (A) An expression data set

sorted by correlation with phenotype, the corresponding heat map, and the

‘‘gene tags,’’ i.e., location of genes from a set S within the sorted list. (B) Plot

of the running sum for S in the data set, including the location of the maximum

enrichment score (ES) and the leading-edge subset.

Fig. 2. Original (4) enrichment score be-

havior. The distribution of three gene sets,

from the C2 functional collection, in the list

of genes in the male兾female lymphoblas-

toid cell line example ranked by their cor-

relation with gender: S1, a set of chromo-

some X inactivation genes; S2, a pathway

describing vitamin c import into neurons;

S3, related to chemokine receptors ex-

pressed by T helper cells. Shown are plots of

the running sum for the three gene sets: S1

is signiﬁcantly enriched in females as ex-

pected, S2 is randomly distributed and

scores poorly, and S3 is not enriched at the

top of the list but is nonrandom, so it scores

well. Arrows show the location of the maximum enrichment score and the point where the correlation (signal-to-noise ratio) crosses zero. Table 1 compares the

nominal P values for S1, S2, and S3 by using the original and new method. The new method reduces the signiﬁcance of sets like S3.

Table 1. P value comparison of gene sets by using original and

new methods

Gene set

Original method

nominal P value

New method

nominal P value

S1: chrX inactive 0.007 ⬍0.001

S2: vitcb pathway 0.51 0.38

S3: nkt pathway 0.023 0.54

15546

兩

www.pnas.org兾cgi兾doi兾10.1073兾pnas.0506580102 Subramanian et al.

The Leading-Edge Subset. Gene sets can be defined by using a variety

of methods, but not all of the members of a gene set will typically

participate in a biological process. Often it is useful to extract the

core members of high scoring gene sets that contribute to the ES.

We define the leading-edge subset to be those genes in the gene set

S that appear in the ranked list L at, or before, the point where the

running sum reaches its maximum deviation from zero (Fig. 1B).

The leading-edge subset can be interpreted as the core of a gene set

that accounts for the enrichment signal.

Examination of the leading-edge subset can reveal a biologically

important subset within a gene set as we show below in our analysis

of P53 status in cancer cell lines. This approach is especially useful

with manually curated gene sets, which may represent an amal-

gamation of interacting proce sse s. We first observed this effect in

our previous study (4) where we manually identified two high

scoring sets, a curated pathway and a computationally derived

cluster, which shared a large subset of genes later confirmed to be

a key regulon altered in human diabetes.

High scoring gene sets can be grouped on the basis of leading-

edge subsets of genes that they share. Such groupings can reveal

which of those gene sets correspond to the same biological pro-

cesse s and which repre sent distinct processe s.

The

GSEA-P software package includes tools for examining and

clustering leading-edge subsets (Supporting Text).

Variations of the GSEA Method. We focus above and in Results on the

use of GSEA to analyze a ranked gene list reflecting differential

expre ssion between two classes, each represented by a large number

of samples. However, the method can be applied to ranked gene

lists arising in other settings.

Genes may be ranked based on the differences seen in a small

data set, with too few samples to allow rigorous evaluation of

significance levels by permuting the class labels. In these cases, a P

value can be estimated by permuting the gene s, with the result that

genes are randomly assigned to the sets while maintaining their size.

This approach is not strictly accurate: because it ignores gene-gene

correlations, it will overestimate the significance levels and may lead

to false positive s. Nonetheless, it can be useful for hypothesis

generation. The

GSEA-P software supports this option.

Genes may also be ranked based on how well their expression

correlate s with a given target pattern (such as the expression pattern

of a particular gene). In Lamb et al. (10), a GSEA-like procedure

was used to demonstrate the enrichment of a set of targets of cyclin

D1 list ranked by correlation with the profile of cyclin D1 in a

compendium of tumor types. Again, approximate P values can be

estimated by permutation of genes.

An Initial Catalog of Human Gene Sets. GSEA evaluates a query

microarray data set by using a collection of gene sets. We therefore

created an initial catalog of 1,325 gene sets, which we call MSigDB

1.0 (Supporting Text; see also Table 3, which is published as

supporting information on the PNAS web site), consisting of four

type s of sets.

Cytogenetic sets (C

, 319 gene sets).

This catalog includes 24 sets, one

for each of the 24 human chromosomes, and 295 sets corresponding

to cytogenetic bands. These sets are helpful in identifying effects

related to chromosomal deletions or amplifications, dosage com-

pensation, epigenetic silencing, and other regional effects.

Functional sets (C

, 522 gene sets).

This catalog includes 472 sets

containing genes whose products are involved in specific metabolic

and signaling pathways, as reported in eight publicly available,

manually curated databases, and 50 sets containing genes coregu-

lated in response to genetic and chemical perturbations, as reported

in various experimental papers.

Regulatory-motif sets (C

, 57 gene sets).

This catalog is based on our

recent work reporting 57 commonly conserved regulatory motifs in

the promoter regions of human genes (11) and makes it possible to

link changes in a microarray experiment to a conserved, putative

cis-regulatory element.

Neighborhood sets (C

, 427 gene sets).

These sets are defined by

expre ssion neighborhoods centered on cancer-related genes.

This database provides an initial collection of gene sets for use

with GSEA and illustrates the types of gene sets that can be

defined, including those based on prior knowledge or derived

c omputationally.

GSEA-P Software and MSigDB Gene Sets. To facilitate the use of

GSEA, we have developed resources that are freely available from

the Broad Institute upon request. These resources include the

GSEA-P software, MSigDB 1.0, and accompanying documentation.

The software is available as (i) a platform-independent desktop

application with a graphical user interface; (ii) programs in

R and

JAVA that advanced users may incorporate into their own analyses

or software environments; (iii) an analytic module in our

GENEPAT-

TERN microarray analysis package (available upon request) (iv)a

future web-based GSEA server to allow users to run their own

analysis directly on the web site. A detailed example of the output

format of GSEA is available on the site, as well as in Supporting Text.

Results

We explored the ability of GSEA to provide biologically meaningful

insights in six example s for which considerable background infor-

mation is available. In each case, we searched for significantly

associated gene sets from one or both of the subcatalogs C1 and C2

(see above). Table 2 lists all gene sets with an FDR ⱕ 0.25.

Male vs. Female Lymphoblastoid Cells. As a simple test, we generated

mRNA expression profiles from lymphoblastoid cell line s derived

from 15 males and 17 females (unpublished data) and sought to

identify gene sets correlated with the distinctions ‘‘male⬎female’’

and ‘‘female⬎male.’’

We first tested enrichment of cytogenetic gene sets (C

). For the

male⬎female comparison, we would expect to find the gene sets on

chromosome Y. Indeed, GSEA produced chromosome Y and the

two Y bands with at least 15 genes (Yp11 and Yq11). For the

female⬎male comparison, we would not expect to see enrichment

for bands on chromosome X because most X linked gene s are

subject to dosage compensation and, thus, not more highly ex-

pressed in females (12).

We next considered enrichment of functional gene sets (C

). The

analysis yielded three biologically informative sets. One consists of

genes e scaping X inactivation [merged from two sources (13, 14)

that largely overlap], discovering the expected enrichment in female

cells. Two additional sets consist of genes enriched in reproductive

tissues (testis and uterus), which is notable inasmuch as mRNA

expre ssion was measured in lymphoblastoid cells. This result is not

simply due to differential expression of genes on chromosomes X

and Y but remains significant when restricted to the autosomal

genes within the sets (Table 5, which is published as supporting

information on the PNAS web site).

p53 Status in Cancer Cell Lines. We next examined gene expression

patterns from the NCI-60 collection of cancer cell lines. We sought

to use these data to identify targets of the transcription factor p53,

which regulates gene expre ssion in re sponse to various signals of

cellular stress. The mutational status of the p53 gene has been

reported for 50 of the NCI-60 cell lines, with 17 being classified as

normal and 33 as carrying mutations in the gene (15).

We first applied GSEA to identify functional gene sets (C

)

correlated with p53 status. The p53

⫹

⬎p53

⫺

analysis identified five

sets whose expression is correlated with normal p53 function (Table

2). All are clearly related to p53 function. The sets are (i)a

biologically annotated collection of genes encoding proteins in the

p53-signaling pathway that causes cell-cycle arrest in response to

DNA damage; (ii) a collection of downstream targets of p53 defined

Subramanian et al. PNAS

兩

October 25, 2005

兩

vol. 102

兩

no. 43

兩

15547

GENETICS SEE COMMENTARY

by experimental induction of a temperature-sensitive allele of p53

in a lung cancer cell line; (iii) an annotated collection of genes

induced by radiation, whose response is known to involve p53; (iv)

an annotated collection of genes induced by hypoxia, which is

known to act through a p53-mediated pathway distinct from the

response pathway to DNA damage; and (v) an annotated collection

of gene s encoding heat shock-protein signaling pathways that

protect cells from death in response to various cellular stresse s.

The complementary analysis (p53

⫺

⬎p53

⫹

) identifies one signif-

icant gene set: genes involved in the Ras signaling pathway.

Interestingly, two additional sets that fall just short of the signifi-

cance threshold contain genes involved in the Ngf and Igf1 signaling

pathways. To explore whether the se three sets reflect a common

biological function, we examined the leading-edge subset for each

gene set (defined above). The leading-edge subsets consist of 16, 11,

and 13 genes, respectively, with each containing four genes encod-

ing products involved in the mitogen-activated protein kinase

(MAPK) signaling subpathway (MAP2K1, RAF1, ELK1, and

PIK3CA) (Fig. 3). This shared subset in the GSEA signal of the

Ras, Ngf, and Igf1 signaling pathways points to up-regulation of this

component of the MAPK pathway as a key distinction between the

p53

⫺

and p53

⫹

tumors. (We note that a full MAPK pathway

appears as the ninth set on the list.)

Acute Leukemias. We next sought to study acute lymphoid leukemia

(ALL) and acute myeloid leukemia (AML) by comparing gene

expre ssion profiles that we had previously obtained from 24 ALL

patients and 24 AML patients (16).

We applied GSEA to the cytogenetic gene sets (C

), expecting

that chromosomal bands showing enrichment in one class would

likely represent regions of frequent cytogenetic alteration in one of

the two leukemias. The ALL⬎AML comparison yielded five gene

sets (Table 2), which could represent frequent amplification in ALL

or deletion in AML. Indeed, all five regions are readily interpreted

in terms of the current knowledge of leukemia.

The 5q31 band is consistent with the known cytogenetics of

AML. Chromosome 5q deletions are present in most AML pa-

tients, with the critical region having been localized to 5q31 (17).

The 17q23 band is a site of known genetic rearrangements in

myeloid malignancies (18). The 13q14 band, containing the RB

locus, is frequently deleted in AML but rarely in ALL (19). Finally,

the 6q21 band contains a site of common chromosomal fragility and

is commonly deleted in hematologic malignancies (20).

Interestingly, the remaining high scoring band is 14q32. This

band contains the Ig heavy chain locus, which includes ⬎100 genes

expre ssed almost exclusively in the lymphoid lineage. The enrich-

ment of 14q32 in ALL thus reflects tissue-specific expression in the

lineage rather than a chromosomal abnormality.

The reciprocal analysis (AML⬎ALL) yielded no significantly

enriched bands, which likely reflects the relative infrequency of

deletions in ALL (21). The analyses with the cytogenetic gene sets

thus show that GSEA is able to identify chromosomal aberrations

common in particular cancer subtypes.

Comparing Two Studies of Lung Cancer. A goal of GSEA is to provide

a more robust way to compare independently derived gene expres-

sion data sets (possibly obtained with different platforms) and

obtain more consistent results than single gene analysis. To test

robustness, we reanalyzed data from two recent studies of lung

cancer reported by our own group in Boston (22) and another group

in Michigan (23). Our goal was not to evaluate the results reported

by the individual studies, but rather to examine whether common

features between the data sets can be more effectively revealed by

gene-set analysis rather than single-gene analysis.

Both studies determined gene-expression profiles in tumor sam-

ples from patients with lung adenocarcinomas (n ⫽ 62 for Boston;

n ⫽ 86 for Michigan) and provided clinical outcomes (classified

here as ‘‘good’’ or ‘‘poor’’ outcome). We found that no genes in

either study were strongly associated with outcome at a significance

level of 5% after correcting for multiple hypotheses testing.

From the perspective of individual genes, the data from the two

studies show little in common. A traditional approach is to compare

Table 2. Summary of GSEA results with FDR < 0.25

Gene set FDR

Data set: Lymphoblast cell lines

Enriched in males

chrY ⬍0.001

chrYp11 ⬍0.001

chrYq11 ⬍0.001

Testis expressed genes 0.012

Enriched in females

X inactivation genes ⬍0.001

Female reproductive tissue expressed genes 0.045

Data set: p53 status in NCl-60 cell lines

Enriched in p53 mutant

Ras signaling pathway 0.171

Enriched in p53 wild type

Hypoxia and p53 in the cardiovascular system ⬍0.001

Stress induction of HSP regulation ⬍0.001

p53 signaling pathway ⬍0.001

p53 up-regulated genes 0.013

Radiation sensitivity genes 0.078

Data set: Acute leukemias

Enriched in ALL

chr6q21 0.011

chr5q31 0.046

chr13q14 0.057

chr14q32 0.082

chr17q23 0.071

Data set: Lung cancer outcome, Boston study

Enriched in poor outcome

Hypoxia and p53 in the cardiovascular system 0.050

Aminoacyl tRNA biosynthesis 0.144

Insulin upregulated genes 0.118

tRNA synthetases 0.157

Leucine deprivation down-regulated genes 0.144

Telomerase up-regulated genes 0.128

Glutamine deprivation down-regulated genes 0.146

Cell cycle checkpoint 0.216

Data set: Lung cancer outcome, Michigan study

Enriched in poor outcome

Glycolysis gluconeogenesis 0.006

vegf pathway 0.028

Insulin up-regulated genes 0.147

Insulin signalling 0.170

Telomerase up-regulated genes 0.188

Glutamate metabolism 0.200

Ceramide pathway 0.204

p53 signalling 0.179

tRNA synthetases 0.225

Breast cancer estrogen signalling 0.250

Aminoacyl tRNA biosynthesis 0.229

For detailed results, see Table 4, which is published as supporting informa-

tion on the PNAS web site.

Fig. 3. Leading edge overlap for p53 study. This plot shows the ras, ngf, and

igf1 gene sets correlated with P53

⫺

clustered by their leading-edge subsets

indicated in dark blue. A common subgroup of genes, apparent as a dark

vertical stripe, consists of MAP2K1, PIK3CA, ELK1, and RAF1 and represents a

subsection of the MAPK pathway.

15548

兩

www.pnas.org兾cgi兾doi兾10.1073兾pnas.0506580102 Subramanian et al.

the genes most highly correlated with a phenotype. We defined the

gene set S

Boston

to be the top 100 genes correlated with poor

outcome in the Boston study and similarly S

Michigan

from the

Michigan study. The overlap is distressingly small (12 genes in

common) and is barely statistically significant with a permutation

test (P ⫽ 0.012). When we added a Stanford study (24) involving 24

adenocarcinomas, the three data sets share only one gene in

common among the top 100 gene s correlated with poor outcome

(Fig. 5 and Table 6, which are published as supporting information

on the PNAS web site). Moreover, no clear common themes

emerge from the genes in the overlaps to provide biological insight.

We then explored whether GSEA would reveal greater similarity

between the Boston and Michigan lung cancer data sets. We

compared the gene set from one data set, S

Boston

,totheentire

ranked gene list from the other. The set S

Boston

shows a strong

significant enrichment in the Michigan data (NES ⫽ 1.90, P ⬍

0.001). Conversely, the poor outcome set S

Michigan

is enriched in the

Boston data (NES ⫽ 2.13, P ⬍ 0.001). GSEA is thus able to detect

a strong common signal in the poor outcome data (Fig. 6, which is

published as supporting information on the PNAS web site).

Having found that GSEA is able to detect similarities between

independently derived data sets, we then went on to see whether

GSEA could provide biological insight by identifying important

functional sets correlated with poor outcome in lung cancer. For

this purpose, we performed GSEA on the Boston and Michigan

data with the C

catalog of functional gene sets. Given the relatively

weak signals found by conventional single-gene analysis in each

study, it was not clear whether any significant gene sets would be

found by GSEA. Nonetheless, we identified a number of genes sets

significantly correlated with poor outcome (FDR ⱕ 0.25): 8 in the

Boston data and 11 in the Michigan data (Table 2). (The Stanford

data had no gene s or gene sets significantly correlated with out-

come, which is most likely due to the smaller number of samples and

many missing values in the data.)

Moreover, there is a large overlap among the significantly

enriched gene sets in the two studies. Approximately half of the

significant gene sets were shared between the two studies and an

additional few, although not identical, were clearly related to the

same biological process. Specifically, we found a set up-regulated by

telomerase (25), two different tRNA synthesis-related sets, two

different insulin-related sets, and two different p53-related sets.

Thus, a total of 5 of 8 of the significant sets in Boston are identical

or related to 6 of 11 in Michigan.

To provide greater insight, we next extended the analysis to

include sets beyond those that met the FDR ⱕ 0.25 criterion.

Specifically, we considered the top scoring 20 gene sets in each of

the three studies (60 gene sets) and their corresponding leading-

edge subsets to better understand the underlying biology in the poor

outcome samples (Table 4). Already in the Boston兾Michigan

overlap, we saw evidence of telomerase and p-53 response as noted

above. Telomerase activation is believed to be a key aspect of

pathogenesis in lung adenocarcinoma and is well documented as

prognostic of poor outcome in lung cancer.

In all three studie s, two additional themes emerge around rapid

cellular proliferation and amino acid biosynthesis (Table 7, which is

published as supporting information on the PNAS web site):

(i) We see striking evidence in all three studies of the effects of

rapid cell proliferation, including sets related to Ras activation and

the cell cycle as well as responses to hypoxia including angiogenesis,

glycolysis, and carbohydrate metabolism. More than one-third of

the gene sets (23 of 60) are related to such processe s. The se

response s have been observed in malignant tumor microenviron-

ments where enhanced proliferation of tumor cells leads to low

oxygen and glucose levels (26). The leading-edge subsets of the

associated significant gene sets include hypoxia-response genes

such as HIF1A, VEGF, CRK, PXN, EIF2B1, EIF2B2, EIF2S2,

FADD, NFKB1, RELA, GADD45A, and also Ras兾MAPK acti-

vation gene s (HRAS, RAF1, and MAP2K1).

(ii) We find strong evidence for the simultaneous presence of

increased amino acid biosynthesis, mTor signaling, and up-

regulation of a set of genes down-regulated by both amino acid

deprivation and rapamycin treatment (27). Supporting this finding

are 17 gene sets associated with amino acid and nucleotide metab-

olism, immune modulation, and mTor signaling. Based on these

results, one might speculate that rapamycin treatment might have

an effect on this specific component of the poor outcome signal. We

note there is evidence of the efficacy of rapamycin in inhibiting

growth and metastatic progression of non-small cell lung cancer in

mice and human cell lines (28).

Our analysis shows that we find much greater consistency across

the three lung data sets by using GSEA than by single-gene analysis.

Moreover, we are better able to generate compelling hypotheses for

further exploration. In particular, 40 of the 60 top scoring gene sets

across the se three studies give a consistent picture of underlying

biological proce sse s in poor outcome cases.

Discussion

Traditional strategies for gene expression analysis have focused on

identifying individual genes that exhibit differences between two

state s of interest. Although useful, they fail to detect biological

processe s, such as metabolic pathways, transcriptional programs,

and stress responses, that are distributed across an entire network

of gene s and subtle at the level of individual genes.

We previously introduced GSEA to analyze such data at the level

of gene sets. The method was initially used to discover metabolic

pathways altered in human diabetes and was subsequently applied

to discover processes involved in diffuse large B cell lymphoma (29),

nutrient-sensing pathways involved in prostate cancer (30), and in

comparing the expression profiles of mouse to those of humans

(31). In the current paper, we have refined the original approach

into a sensitive, robust analytical method and tool with much

broader applicability along with a large database of gene sets.

GSEA can clearly be applied to other data sets such as serum

proteomics data, genotyping information, or metabolite profile s.

GSEA features a number of advantages when compared with

single-gene methods. First, it eases the interpretation of a large-

scale experiment by identifying pathways and processe s. Rather

than focus on high scoring genes (which can be poorly annotated

and may not be reproducible), researchers can focus on gene sets,

which tend to be more reproducible and more interpretable.

Second, when the members of a gene set exhibit strong cross-

correlation, GSEA can boost the signal-to-noise ratio and make it

possible to detect modest changes in individual gene s. Third, the

leading-edge analysis can help define gene subsets to elucidate the

results.

Several other tools have recently been developed to analyze gene

expre ssion by using pathway or ontology information, e.g., (32–34).

Most determine whether a group of differentially expressed genes

is enriched for a pathway or ontology term by using overlap statistics

such as the cumulative hypergeometric distribution. We note that

this approach is not able to detect the oxidative phosphorylation

results discussed above (P ⫽ 0.08, FDR ⫽ 0.50). GSEA differs in

two important regards. First, GSEA considers all of the genes in an

experiment, not only those above an arbitrary cutoff in terms of

fold-change or significance. Second, GSEA assesse s the significance

by permuting the class labels, which pre serves gene-gene correla-

tions and, thus, provides a more accurate null model.

The real power of GSEA, however, lies in its flexibility. We have

created an initial molecular signature database consisting of 1,325

gene sets, including ones based on biological pathways, chromo-

somal location, upstream cis motifs, responses to a drug treatment,

or expression profiles in previously generated microarray data sets.

Further sets can be created through genetic and chemical pertur-

bation, computational analysis of genomic information, and addi-

tional biological annotation. In addition, GSEA itself could be used

to refine manually curated pathways and sets by identifying the

Subramanian et al. PNAS

兩

October 25, 2005

兩

vol. 102

兩

no. 43

兩

15549

GENETICS SEE COMMENTARY

leading-edge sets that are shared across diverse experimental data

sets. As such sets are added, tools such as GSEA will help link prior

knowledge to newly generated data and thereby help uncover the

collective behavior of gene s in states of health and disease.

Appendix: Mathematical Description of Methods

Inputs to GSEA.

1. Expression data set D with N genes and k samples.

2. Ranking procedure to produce Gene List L. Includes a corre-

lation (or other ranking metric) and a phenotype or profile of

interest C. We use only one probe per gene to prevent overes-

timation of the enrichment statistic (Supporting Text; see also

Table 8, which is published as supporting information on the

PNAS web site).

3. An exponent p to control the weight of the step.

4. Independently derived Gene Set S of N

genes (e.g., a pathway,

a cytogenetic band, or a GO category). In the analyse s above,

we used only gene sets with at least 15 members to focus on

robust signals (78% of MSigDB) (Table 3).

Enrichment Score

(

1. Rank order the N genes in D to form L ⫽ {g

, ...,g

} according

to the correlation, r(g

)⫽ r

, of their expression profiles with C.

2. Evaluate the fraction of gene s in S (‘‘hits’’) weighted by their

correlation and the fraction of genes not in S (‘‘misses’’) present

up to a given position i in L.

hit

共S, i兲 ⫽

冘

僆S

jⱕi

兩r

兩

, where N

⫽

冘

僆S

兩r

兩

[1]

miss

共S, i兲 ⫽

冘

ⰻS

jⱕi

共N ⫺ N

兲

The ES is the maximum deviation from zero of P

hit

⫺ P

miss

. For

a randomly distributed S, ES(S) will be relatively small, but if it is

concentrated at the top or bottom of the list, or otherwise nonran-

domly distributed, then ES(S) will be correspondingly high. When

p ⫽ 0, ES(S) reduces to the standard Kolmogorov–Smirnov statis-

tic; when p ⫽ 1, we are weighting the genes in S by their correlation

with C normalized by the sum of the correlations over all of the

genes in S.Wesetp ⫽ 1 for the examples in this paper. (See Fig.

7, which is published as supporting information on the PNAS web

site.)

Estimating Significance. We assess the significance of an observed

ES by comparing it with the set of scores ES

NULL

computed with

randomly assigned phenotypes.

1. Randomly assign the original phenotype labels to samples,

reorder gene s, and re-compute ES(S).

2. Repeat step 1 for 1,000 permutations, and create a histogram of

the corresponding enrichment scores ES

NULL

3. Estimate nominal P value for S from ES

NULL

by using the

positive or negative portion of the distribution corresponding to

the sign of the observed ES(S).

Multiple Hypothesis Testing.

1. Determine ES(S) for each gene set in the collection or database.

2. For each S and 1000 fixed permutations

␲

of the phenotype

labels, reorder the genes in L and determine ES(S,

␲

3. Adjust for variation in gene set size. Normalize the ES(S,

␲

)

and the observed ES(S), separately rescaling the positive and

negative scores by dividing by the mean of the ES(S,

␲

)to

yield the nor malized scores NES(S,

␲

) and NES(S) (see

Suppor ting Text).

4. Compute FDR. Control the ratio of false positives to the total

number of gene sets attaining a fixed level of significance

separately for positive (negative) NES(S) and NES(S,

␲

Create a histogram of all NES(S,

␲

) over all S and

␲

. Use this null

distribution to compute an FDR q value, for a given NES(S) ⫽

NES* ⱖ 0. The FDR is the ratio of the percentage of all (S,

␲

)with

NES(S,

␲

) ⱖ 0, whose NES(S,

␲

) ⱖ NES*, divided by the

percentage of observed S with NES(S) ⱖ 0, whose NES(S) ⱖ NES*,

and similarly if NES(S) ⫽ NES* ⱕ 0.

We acknowledge discussions with or data from D. Altshuler, N. Patter-

son, J. Lamb, X. Xie, J.-Ph. Brunet, S. Ramaswamy, J.-P. Bourquin, B.

Sellers, L. Sturla, C. Nutt, and J. C. Florez and comments from reviewers.

1. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. (1995) Science 270, 467–470.

2. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S.,

Mittmann, M., Wang, C., Kobayashi, M., Horton, H., et al. (1996) Nat . Biotechnol.

14, 1675–1680.

3. Fortunel, N. O., Otu, H. H., Ng, H. H., Chen, J., Mu, X., Chevassut, T., Li, X., Joseph,

M., Bailey, C., Hatzfeld, J. A., et al. (2003) Science 302, 393, author reply 393.

4. Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar,

J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., et al. (2003) Nat. Genet.

34, 267–273.

5. Patti, M. E., Butte, A. J., Crunkhorn, S., Cusi, K., Berria, R., Kashyap, S., Miyazaki,

Y., Kohane, I., Costello, M., Saccone, R., et al. (2003) Proc. Natl. Acad . Sci . USA 100,

8466–8471.

6. Petersen, K. F., Dufour, S., Befroy, D., Garcia, R. & Shulman, G. I. (2004) N. Engl.

J. Med. 350, 664 –671.

7. Hollander, M. & Wolfe, D. A. (1999) Nonparametric Statistical Methods (Wiley, New

York).

8. Benjamini, Y., Drai, D., Elmer, G., Kaf kafi, N. & Golani, I. (2001) Behav. Brain Res.

125, 279–284.

9. Reiner, A., Yekutieli, D. & Benjamini, Y. (2003) Bioinformatics 19, 368–375.

10. Lamb, J., Ramaswamy, S., Ford, H. L., Contreras, B., Martinez, R. V., Kittrell, F. S.,

Zahnow, C. A., Patterson, N., Golub, T. R. & Ewen, M. E. (2003) Cell 114, 323–334.

11. Xie, X., Lu, J., Kulbokas, E. J., Golub, T. R., Mootha, V., Lindblad-Toh, K., Lander,

E. S. & Kellis, M. (2005) Nature 434, 338–345.

12. Plath, K., Mlynarczyk-Evans, S., Nusinow, D. A. & Panning, B. (2002) Annu. Rev.

Genet. 36, 233–278.

13. Carrel, L., Cottle, A. A., Goglin, K. C. & Willard, H. F. (1999) Proc. Natl. Acad. Sci .

USA 96, 14440–14444.

14. Disteche, C. M., Filippova, G. N. & Tsuchiya, K. D. (2002) Cytogenet. Genome Res.

99, 36–43.

15. Olivier, M., Eeles, R., Hollstein, M., Khan, M. A., Harris, C. C. & Hainaut, P. (2002)

Hum. Mutat. 19, 607– 614.

16. Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L.,

Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R. & Korsmeyer, S. J. (2002)

Nat. Genet. 30, 41–47.

17. Zhao, N., Stof fel, A., Wang, P. W., Eisenbart, J. D., Espinosa, R., 3rd, Larson, R. A.

& Le Beau, M. M. (1997) Proc. Natl. Acad . Sci. USA 94, 6948–6953.

18. Barbouti, A., Hoglund, M., Johansson, B., Lassen, C., Nilsson, P. G., Hagemeijer, A.,

Mitelman, F. & Fioretos, T. (2003) Cancer Res. 63, 1202–1206.

19. Tanaka, K., Arif, M., Eguchi, M., Guo, S. X., Hayashi, Y., Asaoku, H., Kyo, T., Dohy,

H. & Kamada, N. (1999) Leukemia 13, 1367–1373.

20. Morelli, C., Karayianni, E., Magnanini, C., Mungall, A. J., Thorland, E., Negrini, M.,

Smith, D. I. & Barbanti-Brodano, G. (2002) Oncogene 21, 7266 –7276.

21. Mrozek, K., Heerema, N. A. & Bloomfield, C. D. (2004) Blood Rev. 18, 115–136.

22. Bhattacharjee, A., Richards, W. G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd,

C., Beheshti, J., Bueno, R., Gillette, M., et al . (2001) Proc. Natl. Acad. Sci. USA 98,

13790–13795.

23. Beer, D. G., Kardia, S. L., Huang, C. C., Giordano, T. J., Levin, A. M., Misek, D. E., Lin,

L., Chen, G., Gharib, T. G., Thomas, D. G., et al. (2002) Nat. Med. 8, 816–824.

24. Garber, M. E., Troyanskaya, O. G., Schluens, K., Petersen, S., Thaesler, Z.,

Pacyna-Gengelbach, M., van de Rijn, M., Rosen, G. D., Perou, C. M., Whyte, R. I.,

et al. (2001) Proc. Natl. Acad. Sci. USA 98, 13784 –13789.

25. Smith, L. L., Coller, H. A. & Roberts, J. M. (2003) Nat. Cell Biol. 5, 474–479.

26. Acker, T. & Plate, K. H. (2002) J. Mol. Med. 80, 562–575.

27. Peng, T., Golub, T. R. & Sabatini, D. M. (2002) Mol. Cell. Biol . 22, 5575–5584.

28. Boffa, D. J., Luan, F., Thomas, D., Yang, H., Sharma, V. K., L agman, M. &

Suthanthiran, M. (2004) Clin. Cancer Res. 10, 293–300.

29. Monti, S., Savage, K. J., Kutok, J. L., Feuerhake, F., Kurtin, P., Mihm, M., Wu, B.,

Pasqualucci, L., Neuberg, D., Aguiar, R. C., et al. (2004) Blood 105, 1851–1861.

30. Majumder, P. K., Febbo, P. G., Bikoff, R., Berger, R., Xue, Q., McMahon, L. M., Manola,

J., Brugarolas, J., McDonnell, T. J., Golub, T. R., et al. (2004) Nat. Med. 10, 594–601.

31. Sweet-Cordero, A., Mukherjee, S., Subramanian, A., You, H., Roix, J. J., Ladd-

Acosta, C., Mesirov, J., Golub, T. R. & Jacks, T. (2005) Nat. Genet. 37, 48–55.

32. Doniger, S. W., Salomonis, N., Dahlquist, K. D., Vran izan, K., Lawlor, S. C. &

Conklin, B. R. (2003) Genome Biol. 4, R7.

33. Zhong, S., Storch, K. F., Lipan, O., Kao, M. C., Weitz, C. J. & Wong, W. H. (2004)

Appl. Bioinformatics 3, 261–264.

34. Berriz, G. F., King, O. D., Bryant, B., Sander, C. & Roth, F. P. (2003) Bioinformatics

19, 2502–2504.

15550

兩

www.pnas.org兾cgi兾doi兾10.1073兾pnas.0506580102 Subramanian et al.

Glia-enriched stem-cell 3D model of the human brain mimics the glial-immune neurodegenerative phenotypes of multiple sclerosis

Preprint

Jun 2024

The role of central nervous system (CNS) glia in sustaining self-autonomous inflammation and driving clinical progression in multiple sclerosis (MS) is gaining scientific interest. We applied a single transcription factor (SOX10)-based protocol to accelerate oligodendrocyte differentiation from hiPSC-derived neural precursor cells, generating self-organizing forebrain organoids. These organoids include neurons, astrocytes, oligodendroglia, and hiPSC-derived microglia to achieve immunocompetence. Over 8 weeks, organoids reproducibly generated mature CNS cell types, exhibiting single-cell transcriptional profiles similar to the adult human brain. Exposed to inflamed cerebrospinal fluid (CSF) from MS patients, organoids properly mimic macroglia-microglia neurodegenerative phenotypes and intercellular communication seen in chronic active MS. Oligodendrocyte vulnerability emerged by day 6 post-MS-CSF exposure, with nearly 50% reduction. Temporally-resolved organoid data support and expand on the role of soluble CSF mediators in sustaining downstream events leading to oligodendrocyte death and inflammatory neurodegeneration. Such findings support implementing this organoid model for drug screening to halt inflammatory neurodegeneration.

Comprehensively analysis of IL33 in hepatocellular carcinoma prognosis, immune microenvironment and biological role

Article

Full-text available

Jun 2024
J CELL MOL MED

IL33 plays an important role in cancer. However, the role of liver cancer remains unclear. Open‐accessed data was obtained from the Cancer Genome Atlas, Xena, and TISCH databases. Different algorithms and R packages are used to perform various analyses. Here, in our comprehensive study on IL33 in HCC, we observed its differential expression across cancers, implicating its role in cancer development. The single‐cell analysis highlighted its primary expression in endothelial cells, unveiling correlations within the HCC microenvironment. Also, the expression level of IL33 was correlated with patients survival, emphasizing its potential prognostic value. Biological enrichment analyses revealed associations with stem cell division, angiogenesis, and inflammatory response. IL33's impact on the immune microenvironment showcased correlations with diverse immune cells. Genomic features and drug sensitivity analyses provided insights into IL33's broader implications. In a pan‐cancer context, IL33 emerged as a potential tumour‐inhibitor, influencing immune‐related molecules. This study significantly advances our understanding of IL33 in cancer biology. IL33 exhibited differential expression across cancers, particularly in endothelial cells within the HCC microenvironment. IL33 is correlated with the survival of HCC patients, indicating potential prognostic value and highlighting its broader implications in cancer biology.

Early reappearance of intraclonal proliferative subpopulations in ibrutinib-resistant chronic lymphocytic leukemia

Article

Full-text available

Jun 2024
LEUKEMIA

The Bruton’s tyrosine kinase (BTK) inhibitor ibrutinib represents an effective strategy for treatment of chronic lymphocytic leukemia (CLL), nevertheless about 30% of patients eventually undergo disease progression. Here we investigated by flow cytometry the long-term modulation of the CLL CXCR4dim/CD5bright proliferative fraction (PF), its correlation with therapeutic outcome and emergence of ibrutinib resistance. By longitudinal tracking, the PF, initially suppressed by ibrutinib, reappeared upon early disease progression, without association with lymphocyte count or serum beta-2-microglobulin. Somatic mutations of BTK/PLCG2, detected in 57% of progressing cases, were significantly enriched in PF with a 3-fold greater allele frequency than the non-PF fraction, suggesting a BTK/PLCG2-mutated reservoir resident within the proliferative compartments. PF increase was also present in BTK/PLCG2-unmutated cases at progression, indicating that PF evaluation could represent a marker of CLL progression under ibrutinib. Furthermore, we evidence different transcriptomic profiles of PF at progression in cases with or without BTK/PLCG2 mutations, suggestive of a reactivation of B-cell receptor signaling or the emergence of bypass signaling through MYC and/or Toll-Like-Receptor-9. Clinically, longitudinal monitoring of the CXCR4dim/CD5bright PF by flow cytometry may provide a simple tool helping to intercept CLL progression under ibrutinib therapy.

Liver cancer from the perspective of single-cell sequencing: a review combined with bibliometric analysis

Article

Full-text available

Jun 2024
J CANCER RES CLIN

Background Liver cancer (LC) is a prevalent malignancy and a leading cause of cancer-related mortality worldwide. Extensive research has been conducted to enhance patient outcomes and develop effective prevention strategies, ranging from molecular mechanisms to clinical interventions. Single-cell sequencing, as a novel bioanalysis technology, has significantly contributed to the understanding of the global cognition and dynamic changes in liver cancer. However, there is a lack of bibliometric analysis in this specific research area. Therefore, the objective of this study is to provide a comprehensive overview of the knowledge structure and research hotspots in the field of single-cell sequencing in liver cancer research through the use of bibliometrics. Method Publications related to the application of single-cell sequencing technology to liver cancer research as of December 31, 2023, were searched on the web of science core collection (WoSCC) database. VOSviewers, CiteSpace, and R package “bibliometrix” were used to conduct this bibliometric analysis. Results A total of 331 publications from 34 countries, primarily led by China and the United States, were included in this study. The research focuses on the application of single cell sequencing technology to liver cancer, and the number of related publications has been increasing year by year. The main research institutions involved in this field are Fudan University, Sun Yat-Sen University, and the Chinese Academy of Sciences. Frontiers in Immunology and Nature Communications is the most popular journal in this field, while Cell is the most frequently co-cited journal. These publications are authored by 2799 individuals, with Fan Jia and Zhou Jian having the most published papers, and Llovet Jm being the most frequently co-cited author. The use of single cell sequencing to explore the immune microenvironment of liver cancer, as well as its implications in immunotherapy and chemotherapy, remains the central focus of this field. The emerging research hotspots are characterized by keywords such as 'Gene-Expression', 'Prognosis', 'Tumor Heterogeneity', 'Immunoregulation', and 'Tumor Immune Microenvironment'. Conclusion This is the first bibliometric study that comprehensively summarizes the research trends and developments on the application of single cell sequencing in liver cancer. The study identifies recent research frontiers and hot directions, providing a valuable reference for researchers exploring the landscape of liver cancer, understanding the composition of the immune microenvironment, and utilizing single-cell sequencing technology to guide and enhance the prognosis of liver cancer patients.

A novel mitochondrial-related lncRNA signature mediated prediction of overall survival, immune landscape, and the chemotherapeutic outcomes for bladder cancer patients

Article

Full-text available

Jun 2024

Objective To develop a prognostic risk model for Bladder Cancer (BLCA) based on mitochondrial-related long non-coding RNAs (lncRNAs). Methods Transcriptome and clinical data of BLCA patients were retrieved from the TCGA database. Mitochondrial-related lncRNAs with independent prognostic significance were screened to develop a prognostic risk model. Patients were categorized into high- and low-risk groups using the model. Various methods including Kaplan–Meier (KM) analysis, ROC curve analysis, Gene Set Enrichment Analysis (GSEA), immune analysis, and chemotherapy drug analysis were used to verify and evaluate the model. Results A mitochondrial-associated lncRNA prognostic risk model with independent prognostic significance was developed. High-risk group (HRG) patients exhibited significantly shorter survival periods compared to low-risk group (LRG) patients (P < 0.01). The risk score from the model was an independent predictor of BLCA prognosis, correlating with tumor grade, pathological stage, and lymph node metastasis (P < 0.05). The HRG showed significant positive correlations with high expressions of immune checkpoints (CTLA4, LAG3, PD-1, TIGIT, PD-L1, PD-L2, and TIM-3) and lower IC50 for chemotherapy drugs (cisplatin, docetaxel, paclitaxel, methotrexate, and vinblastine) (P < 0.001). Conclusions The mitochondrial-related lncRNA-based prognostic risk model effectively predicts BLCA prognosis and can guide individualized treatment for BLCA patients.

Peptidyl-prolyl isomerase F as a prognostic biomarker associated with immune infiltrates and mitophagy in lung adenocarcinoma

Article

Jan 2023

Multiomics integration-based immunological characterizations of adamantinomatous craniopharyngioma in relation to keratinization

Article

Full-text available

Jun 2024

Although adamantinomatous craniopharyngioma (ACP) is a tumour with low histological malignancy, there are very few therapeutic options other than surgery. ACP has high histological complexity, and the unique features of the immunological microenvironment within ACP remain elusive. Further elucidation of the tumour microenvironment is particularly important to expand our knowledge of potential therapeutic targets. Here, we performed integrative analysis of 58,081 nuclei through single-nucleus RNA sequencing and spatial transcriptomics on ACP specimens to characterize the features and intercellular network within the microenvironment. The ACP environment is highly immunosuppressive with low levels of T-cell infiltration/cytotoxicity. Moreover, tumour-associated macrophages (TAMs), which originate from distinct sources, highly infiltrate the microenvironment. Using spatial transcriptomic data, we observed one kind of non-microglial derived TAM that highly expressed GPNMB close to the terminally differentiated epithelial cell characterized by RHCG, and this colocalization was verified by asmFISH. We also found the positive correlation of infiltration between these two cell types in datasets with larger cohort. According to intercellular communication analysis, we report a regulatory network that could facilitate the keratinization of RHCG ⁺ epithelial cells, eventually causing tumour progression. Our findings provide a comprehensive analysis of the ACP immune microenvironment and reveal a potential therapeutic strategy base on interfering with these two types of cells.

Revealing the key role of cuproptosis in osteoporosis via the bioinformatic analysis and experimental validation of cuproptosis-related genes

Article

Full-text available

Jun 2024
MAMM GENOME

The incidence of osteoporosis has rapidly increased owing to the ageing population. Cuproptosis, a novel mechanism that regulates cell death, may be a new therapeutic approach. However, the relevance of cuproptosis in the immune microenvironment and osteoporosis immunotherapy is still unknown. We intersected the differentially expressed genes from osteoporotic samples with 75 cuproptosis-related genes to identify 16 significantly expressed cuproptosis genes. We further explored the connection between the cuproptosis pattern, immune microenvironment, and immunotherapy. The weighted gene co-expression network analysis algorithm was used to identify cuproptosis phenotype-associated genes, and we used quantitative real-time PCR and immunohistochemistry in mouse femur tissues to verify hub gene (MAP2K2, FDX1, COX19, VEGFA, CDKN2A, and NFE2L2) expression. Six hub genes and 59 cuproptosis phenotype-associated genes involved in immunisation were identified among the osteoporosis and control groups, and the majority of these 59 genes were enriched in the inflammatory response, as well as in signal transducers, Janus kinase, and transcription pathway activators. In addition, two different clusters of cuproptosis were found, and immune infiltration analysis showed that gene Cluster 1 had a greater immune score and immune infiltration level. Further analysis revealed that three key genes (COX19, MAP2K2, and FDX1) were highly correlated with immune cell infiltration, and external experiments validated the association of these three genes with the prognosis of osteoporosis. We used the three key mRNAs COX19, MAP2K2, and FDX1 as a classification model that may systematically elucidate the complex connection between cuproptosis and the immune microenvironment of osteoporosis. New insights into osteoporosis pathogenesis and immunotherapy prospects may be gained from this study.

UBQLN1 links proteostasis and mitochondria function to telomere maintenance in human embryonic stem cells

Article

Full-text available

Jun 2024

Background Telomeres consist of repetitive DNA sequences at the chromosome ends to protect chromosomal stability, and primarily maintained by telomerase or occasionally by alternative telomere lengthening of telomeres (ALT) through recombination-based mechanisms. Additional mechanisms that may regulate telomere maintenance remain to be explored. Simultaneous measurement of telomere length and transcriptome in the same human embryonic stem cell (hESC) revealed that mRNA expression levels of UBQLN1 exhibit linear relationship with telomere length. Methods In this study, we first generated UBQLN1-deficient hESCs and compared with the wild-type (WT) hESCs the telomere length and molecular change at RNA and protein level by RNA-seq and proteomics. Then we identified the potential interacting proteins with UBQLN1 using immunoprecipitation-mass spectrometry (IP-MS). Furthermore, the potential mechanisms underlying the shortened telomeres in UBQLN1-deficient hESCs were analyzed. Results We show that Ubiquilin1 (UBQLN1) is critical for telomere maintenance in human embryonic stem cells (hESCs) via promoting mitochondrial function. UBQLN1 deficiency leads to oxidative stress, loss of proteostasis, mitochondria dysfunction, DNA damage, and telomere attrition. Reducing oxidative damage and promoting mitochondria function by culture under hypoxia condition or supplementation with N-acetylcysteine partly attenuate the telomere attrition induced by UBQLN1 deficiency. Moreover, UBQLN1 deficiency/telomere shortening downregulates genes for neuro-ectoderm lineage differentiation. Conclusions Altogether, UBQLN1 functions to scavenge ubiquitinated proteins, preventing their overloading mitochondria and elevated mitophagy. UBQLN1 maintains mitochondria and telomeres by regulating proteostasis and plays critical role in neuro-ectoderm differentiation.

From Cardiac Myosin to the Beta Receptor: Autoantibodies Promote a Fibrotic Transcriptome and Reduced Ventricular Recovery in Human Myocarditis

Preprint

Full-text available

Jun 2024

Background Myocarditis leads to dilated cardiomyopathy (DCM) with one-third failing to recover normal ejection fraction (EF50%), and there is a critical need for prognostic biomarkers to assess risk of nonrecovery. Cardiac myosin (CM) autoantibodies (AAbs) cross-reactive with the β−adrenergic receptor (βAR) are associated with myocarditis/DCM, but their potential for prognosis and functional relevance is not fully understood. Methods CM AAbs and myocarditis-derived human monoclonal antibodies (mAbs) were investigated to define pathogenic mechanisms and CM epitopes of nonrecovery. Myocarditis patients who do not recover ejection fraction (EF<50%) by one year were studied in a longitudinal (n=41) cohort. Sera IgG and human mAbs were investigated for autoreactivity with CM and CM peptides by ELISA, protein kinase A (PKA) activation, and transcriptomic analysis in H9c2 heart cell line. Results CM AAbs were significantly elevated in nonrecovered compared to recovered patients and correlated with reduced EF (<50%). CM epitopes specific to nonrecovery were identified. Transcriptomic analysis revealed serum IgG and mAb 2C.4 induced fibrosis/apoptosis pathways in vitro similar to isoproterenol treated cells. Sera IgG and 2C.4 activated PKA in an IgG and βAR-dependent manner. Endomyocardial biopsies from myocarditis/DCM revealed IgG+ trichrome+ tissues. Conclusions CM AAbs were significantly elevated in nonrecovered patients, suggesting novel prognostic relevance. CM AAbs correlated with lower EF, and Ab-induced fibrosis/apoptosis pathways suggested a role for CM AAbs in patients who do not recover and develop irreversible heart failure. Homology between CM and βARs supports mechanisms related to cross-reactivity of CM AAbs with the βAR, a potential AAb target in nonrecovery.

Cloning and characterization of the common fragile site FRA6F harboring a replicative senescence gene and frequently deleted in human tumors

Article

Full-text available

Nov 2002
ONCOGENE

The common fragile site FRA6F, located at 6q21, is an extended region of about 1200 kb, with two hot spots of breakage each spanning about 200 kb. Transcription mapping of the FRA6F region identified 19 known genes, 10 within the FRA6F interval and nine in a proximal or distal position. The nucleotide sequence of FRA6F is rich in repetitive elements (LINE1 and LINE2, Alu, MIR, MER and endogenous retroviral sequences) as well as in matrix attachment regions (MARs), and shows several DNA segments with increased helix flexibility. We found that tight clusters of stem-loop structures were localized exclusively in the two regions with greater frequency of breakage. Chromosomal instability at FRA6F probably depends on a complex interaction of different factors, involving regions of greater DNA flexibility and MARs. We propose an additional mechanism of fragility at FRA6F, based on stem-loop structures which may cause delay or arrest in DNA replication. A senescence gene likely maps within FRA6F, as suggested by detection of deletion and translocation breakpoints involving this fragile site in immortal human-mouse cell hybrids and in SV40-immortalized human fibroblasts containing a human chromosome 6 deleted at q21. Deletion breakpoints within FRA6F are common in several types of human leukemias and solid tumors, suggesting the presence of a tumor suppressor gene in the region. Moreover, a gene associated to hereditary schizophrenia maps within FRA6F. Therefore, FRA6F may represent a landmark for the identification and cloning of genes involved in senescence, leukemia, cancer and schizophrenia.

XIST RNA and the mechanism of X chromosome inactivation

Article

Full-text available

Feb 2002

Dosage compensation in mammals is achieved by the transcriptional inactivation of one X chromosome in female cells. From the time X chromosome inactivation was initially described, it was clear that several mechanisms must be precisely integrated to achieve correct regulation of this complex process. X-inactivation appears to be triggered upon differentiation, suggesting its regulation by developmental cues. Whereas any number of X chromosomes greater than one is silenced, only one X chromosome remains active. Silencing on the inactive X chromosome coincides with the acquisition of a multitude of chromatin modifications, resulting in the formation of extraordinarily stable facultative heterochromatin that is faithfully propagated through subsequent cell divisions. The integration of all these processes requires a region of the X chromosome known as the X-inactivation center, which contains the Xist gene and its cis-regulatory elements. Xist encodes an RNA molecule that plays critical roles in the choice of which X chromosome remains active, and in the initial spread and establishment of silencing on the inactive X chromosome. We are now on the threshold of discovering the factors that regulate and interact with Xist to control X-inactivation, and closer to an understanding of the molecular mechanisms that underlie this complex process.

Telomerase modulates expression of growth-controlling genes and enhances cell proliferation

Article

Full-text available

Jun 2003

Most somatic cells do not express sufficient amounts of telomerase to maintain a constant telomere length during cycles of chromosome replication. Consequently, there is a limit to the number of doublings somatic cells can undergo before telomere shortening triggers an irreversible state of cellular senescence. Ectopic expression of telomerase overcomes this limitation, and in conjunction with specific oncogenes can transform cells to a tumorigenic phenotype. However, recent studies have questioned whether the stabilization of chromosome ends entirely explains the ability of telomerase to promote tumorigenesis and have resulted in the hypothesis that telomerase has a second function that also supports cell division. Here we show that ectopic expression of telomerase in human mammary epithelial cells (HMECs) results in a diminished requirement for exogenous mitogens and that this correlates with telomerase-dependent induction of genes that promote cell growth. Furthermore, we show that inhibiting expression of one of these genes, the epidermal growth factor receptor (EGFR), reverses the enhanced proliferation caused by telomerase. We conclude that telomerase may affect proliferation of epithelial cells not only by stabilizing telomeres, but also by affecting the expression of growth-promoting genes.

Non Parametric Statistical Methods

Book

Jan 1973

Frequent allelic loss of the RB, D13S319 and D13S25 locus in myeloid malignancies with deletion/translocation at 13q14 of chromosome 13, but not in lymphoid malignancies

Article

Oct 1999

In order to identify a commonly deleted region of 13q14 on chromosome 13, we performed fluorescence in situ hybridization (FISH) on 17 patients with myeloid malignancies and 12 patients with lymphoid leukemia/lymphoma who exhibited either deletion or translocation at 13q14. Three cosmid probes (RB, D13S319 and D13S25) hybridizing to sequences on 13q14 were used. Fourteen of the 17 patients with myeloid malignancies (82.4%) exhibited allelic loss at the RB, D13S319 and D13S25 locus, whereas only three of the 12 patients with lymphoid malignancies (25.0%) exhibited loss within these loci. These three patients had chronic lymphocytic leukemia (CLL). Six, two and one of the remaining nine lymphoid leukemia/lymphoma patients had breakpoints centromeric to the RB gene, telomeric to D13S25 and within the D13S319 locus, respectively. A high frequency of allelic loss was found using these probes in patients with myeloid malignancies, compared to in patients with leukemia in the lymphoid origin, except CLL patients. These results indicate that loss of the RB gene itself or a region between RB and D13S319, which includes commonly deleted loci, may play an important role in myeloid leukemogenesis.

Benjamini Y, Drai D, Elmer G, Kafkafi N, Golani I. Controlling the false discovery rate in behavior genetics research. Beh Brain Res 125: 279-284

Article

Dec 2001
BEHAV BRAIN RES

The screening of many endpoints when comparing groups from different strains, searching for some statistically significant difference, raises the multiple comparisons problem in its most severe form. Using the 0.05 level to decide which of the many endpoints' differences are statistically significant, the probability of finding a difference to be significant even though it is not real increases far beyond 0.05. The traditional approach to this problem has been to control the probability of making even one such error--the Bonferroni procedure being the most familiar procedure achieving such control. However, the incurred loss of power stemming from such control led many practitioners to neglect multiplicity control altogether. The False Discovery Rate (FDR), suggested by Benjamini and Hochberg [J Royal Stat Soc Ser B 57 (1995) 289], is a new, different, and compromising point of view regarding the error in multiple comparisons. The FDR is the expected proportion of false discoveries among the discoveries, and controlling the FDR goes a long way towards controlling the increased error from multiplicity while losing less in the ability to discover real differences. In this paper we demonstrate the problem in two studies: the study of exploratory behavior [Behav Brain Res (2001)], and the study of the interaction of strain differences with laboratory environment [Science 284 (1999) 1670]. We explain the FDR criterion, and present two simple procedures that control the FDR. We demonstrate their increased power when used in the above two studies.

MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia

Article

Feb 2002

Acute lymphoblastic leukemias carrying a chromosomal translocation involving the mixed-lineage leukemia gene (MLL, ALL1, HRX) have a particularly poor prognosis. Here we show that they have a characteristic, highly distinct gene expression profile that is consistent with an early hematopoietic progenitor expressing select multilineage markers and individual HOX genes. Clustering algorithms reveal that lymphoblastic leukemias with MLL translocations can clearly be separated from conventional acute lymphoblastic and acute myelogenous leukemias. We propose that they constitute a distinct disease, denoted here as MLL, and show that the differences in gene expression are robust enough to classify leukemias correctly as MLL, acute lymphoblastic leukemia or acute myelogenous leukemia. Establishing that MLL is a unique entity is critical, as it mandates the examination of selectively expressed genes for urgently needed molecular targets.

The Immunosuppressant Rapamycin Mimics a Starvation-Like Signal Distinct from Amino Acid and Glucose Deprivation

Article

Sep 2002

RAFT1/FRAP/mTOR is a key regulator of cell growth and division and the mammalian target of rapamycin, an immunosuppressive and anticancer drug. Rapamycin deprivation and nutrient deprivation have similar effects on the activity of S6 kinase 1 (S6K1) and 4E-BP1, two downstream effectors of RAFT1, but the relationship between nutrient- and rapamycin-sensitive pathways is unknown. Using transcriptional profiling, we show that, in human BJAB B-lymphoma cells and murine CTLL-2 T lymphocytes, rapamycin treatment affects the expression of many genes involved in nutrient and protein metabolism. The rapamycin-induced transcriptional profile is distinct from those induced by glucose, glutamine, or leucine deprivation but is most similar to that induced by amino acid deprivation. In particular, rapamycin treatment and amino acid deprivation up-regulate genes involved in nutrient catabolism and energy production and down-regulate genes participating in lipid and nucleotide synthesis and in protein synthesis, turnover, and folding. Surprisingly, however, rapamycin had effects opposite from those of amino acid starvation on the expression of a large group of genes involved in the synthesis, transport, and use of amino acids. Supported by measurements of nutrient use, the data suggest that RAFT1 is an energy and nutrient sensor and that rapamycin mimics a signal generated by the starvation of amino acids but that the signal is unlikely to be the absence of amino acids themselves. These observations underscore the importance of metabolism in controlling lymphocyte proliferation and offer a novel explanation for immunosuppression by rapamycin.

A Mechanism of Cyclin D1 Action Encoded in the Patterns of Gene Expression in Human Cancer

Article

Sep 2003
CELL

Here we describe how patterns of gene expression in human tumors have been deconvoluted to reveal a mechanism of action for the cyclin D1 oncogene. Computational analysis of the expression patterns of thousands of genes across hundreds of tumor specimens suggested that a transcription factor, C/EBPbeta/Nf-Il6, participates in the consequences of cyclin D1 overexpression. Functional analyses confirmed the involvement of C/EBPbeta in the regulation of genes affected by cyclin D1 and established this protein as an indispensable effector of a potentially important facet of cyclin D1 biology. This work demonstrates that tumor gene expression databases can be used to study the function of a human oncogene in situ.

Rapamycin Inhibits the Growth and Metastatic Progression of Non-Small Cell Lung Cancer

Article

Jan 2004

Lung cancer has a dismal prognosis and comprises 5.5% of post-transplant malignancies. We explored whether rapamycin inhibits the growth and metastatic progression of non-small cell lung cancer (NSCLC). Murine KLN-205 NSCLC was used as the model tumor in syngeneic DBA/2 mice to explore the effect of rapamycin on tumor growth and metastastic progression. We also examined the effect of rapamycin on cell cycle progression, apoptosis, and proliferation using murine KLN-205 NSCLC cells and human A-549 NSCLC cells as targets. The in vivo and in vitro effects of cyclosporine and those of rapamycin plus cyclosporine were also investigated. Rapamycin but not cyclosporine inhibited tumor growth; s.c. tumor volume was 1290 +/- 173 mm(3) in untreated DBA/2 mice, 246 +/- 80 mm(3) in mice treated with rapamycin, and 1203 +/- 227 mm(3) in mice treated with cyclosporine (P < 0.001). Rapamycin but not cyclosporine prevented the formation of distant metastases; eight of eight untreated mice and four of six mice treated with cyclosporine developed pulmonary metastases whereas only one of six mice treated with rapamycin developed pulmonary metastases (P = 0.003). In vitro, rapamycin induced cell cycle arrest at the G(1) checkpoint and blocked proliferation of both KLN-205 and A-549 cells but did not induce apoptosis. Cyclosporine did not prevent cell cycle progression and had a minimal antiproliferative effect on KLN-205 and A-549 cells. The immunosuppressive macrolide rapamycin but not cyclosporine prevents the growth and metastatic progression of NSCLC. A rapamycin-based immunosuppressive regimen may be of value in recipients of allografts.

A knowledge-based approach for interpreting genome-wide expression pro炉les

Figures

Recommended publications

Resurrection as uprising. Hermeneutical requests on a common interpretation of the resurrection

The term origo in Roman epigraphy

Higher Derivatives and Canonical Formalism

Philosophical presuppositions of two-dimensional semantics