ArticlePDF Available

QSVA framework for RNA quality correction in differential expression analysis

June 2017
Proceedings of the National Academy of Sciences 114(27):201617384

June 2017
114(27):201617384

Authors:

Significance Many studies use measurements of gene expression in human postmortem and ex vivo tissues like brain and blood to characterize genomic correlates of illness. However, molecular analyses of these tissues can be susceptible to a wide range of confounders that may be difficult to measure and remove. In this article, we describe an analysis framework for identifying and removing previously uncharacterized quality biases in measurements of RNA. Our paper critically highlights the shortcomings of standard RNA quality correction approaches, such as statistically adjusting for RNA integrity numbers. We show that the our framework removes residual confounding by RNA quality and greatly improves replication of significant differentially expressed genes across independent datasets by more than threefold compared with previous approaches.

Differential expression quality (DEqual) plots for schizophreniacontrol expression differences. Each DEqual plot compares the effect of RNA degradation from an independent degradation experiment on the y axis to the effect of the outcome of interest, here schizophrenia (SZ) compared with controls. Each point is a gene, and effects here are shown as T-statistics for each effect. (A) DEqual plot for univariate case-control analysis shows strong correlation between degradation and diagnosis effects. (B) DEqual plot for RIN-adjusted case-control differences largely fails to remove degradation bias. (C) DEqual plot when adjusting for observed clinical and technical covariates, including age, sex, ethnicity, chrM mapping rate, gene assignment rate, and RIN, also fails to remove degradation bias. (D) DEqual plot demonstrating that the qSVA framework successfully removes positive correlation between degradation and SZ effects.

…

Figures - uploaded by Joel E Kleinman

Content may be subject to copyright.

Content uploaded by Joel E Kleinman

Content may be subject to copyright.

qSVA framework for RNA quality correction in

differential expression analysis

Andrew E. Jaffe

a,b,c,d,1

, Ran Tao

, Alexis L. Norris

e,f

, Marc Kealhofer

a,g

, Abhinav Nellore

c,d,h

, Joo Heon Shin

Dewey Kim

, Yankai Jia

, Thomas M. Hyde

a,i,j

, Joel E. Kleinman

a,j

, Richard E. Straub

, Jeffrey T. Leek

c,d

and Daniel R. Weinberger

a,e,j,k

Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205;

Department of Mental Health, Johns Hopkins Bloomberg

School of Public Health, Baltimore, MD 21205;

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205;

Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205;

Department of Neuroscience, Johns Hopkins School of Medicine,

Baltimore, MD 21205;

Department of Neurology, Kennedy Krieger Institute, Baltimore, MD 21205;

Department of Epidemiology, Johns Hopkins

Bloomberg School of Public Health, Baltimore, MD 21205;

Department of Computer Science, Johns Hopkins University, Baltimore, MD 21205;

Department

of Neurology, Johns Hopkins School of Medicine, Baltimore, MD 21205;

Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of

Medicine, Baltimore, MD 21205; and

McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD 21205

Edited by Pasko Rakic, Yale University, New Haven, CT, and approved May 19, 2017 (received for review October 27, 2016)

RNA sequencing (RNA-seq) is a powerful approach for measuring

gene expression levels in cells and tissues, but it relies on high-

quality RNA. We demonstrate here that statistical adjustment

using existing quality measures largely fails to remove the effects

of RNA degradation when RNA quality associates with the out-

come of interest. Using RNA-seq data from molecular degradation

experiments of human primary tissues, we introduce a method—

quality surrogate variable analysis (qSVA)—as a framework for

estimating and removing the confounding effect of RNA quality

in differential expression analysis. We show that this approach

results in greatly improved replication rates (>3×) across two large

independent postmortem human brain studies of schizophrenia

and also removes potential RNA quality biases in earlier published

work that compared expression levels of different brain regions

and other diagnostic groups. Our approach can therefore improve

the interpretation of differential expression analysis of transcrip-

tomic data from human tissue.

RNA sequencing

differential expression analysis

statistical modeling

RNA quality

Microarrays and RNA sequencing (RNA-seq) can measure

gene expression levels across hundreds of samples in a

single experiment. As gene expression levels are measured with

error, normalization procedures have been implemented for

both microarray (1) and RNA sequencing (2) data to reduce

technical variability, including controlling for variability associ-

ated with how and when the samples are run, so-called “batch”

effects (3). Recent research has further characterized this ex-

pression variability in RNA-seq data (4–6), including demon-

strating variability associated with technical factors involved in

the preparation, sequencing, and analysis of samples. Variability

in gene expression is particularly influenced by RNA quality (7)

because accurately measuring gene expression levels strongly

depends on the quality of the input RNA. This suggests that a

portion of traditionally measured latent “batch”effects could

actually be attributed to the underlying quality of the input RNA.

Postmortem studies typically extract RNA from tissue that has

been susceptible to a wide variety of antemortem and postmortem

factors. Several approaches exist for quantifying the quality of the

input RNA before sequencing library construction, including UV

absorption ratios of 280 nm to 260 nm and RNA integrity numbers

(RINs). RIN is a machine learning-derived measurement resulting

from placing RNA on a Bioanalyzer and obtaining a tracing of

fragment sizes per sample. RIN ranges from 10 (very high quality

RNA) to 0 (completely degraded RNA), and the apparent in-

tactness of ribosomal RNAs (which are two large peaks in the

fragment size tracing) is one of the most discriminating factors

that distinguishes very high quality from moderate quality RNA

(8). Recommended RIN thresholds for sample exclusion before

data generation have been suggested as low as 5.0 for PCR (7) and

7.0 for RNA-seq. (9). However, even high quality samples (RIN >

8) demonstrate evidence of degradation, as transcriptome-wide

gene expression levels strongly associate with RIN even among

samples with high RINs, for example, in lymphoblastoid cell lines

(6). Furthermore, the recent introduction of ribosomal depletion

approaches for library construction, such as the Illumina Ribo-

Zero technique, have permitted the sequencing of lower quality

samples compared with previous polyadenylation section-based

approaches (polyA+), including samples with RINs less than 3 (10).

Proposed measures of RNA quality can also be derived from

the resulting RNA sequencing data, for example, by calculating

the 5′to 3′read coverage bias (particularly in polyA+data);

transcript integrity numbers (11); various read mapping rates, in-

cluding to autosomes, ribosomal RNAs, and mitochondrial RNAs

(chrM); and gene/exon assignment rates (7). Although many of

these approaches appear to capture the largest global effects on

expression, for example, through positively correlating factors of

expression data with the above-mentioned quality measures, the

Significance

Many studies use measurements of gene expression in human

postmortem and ex vivo tissues like brain and blood to char-

acterize genomic correlates of illness. However, molecular

analyses of these tissues can be susceptible to a wide range of

confounders that may be difficult to measure and remove. In

this article, we describe an analysis framework for identifying

and removing previously uncharacterized quality biases in

measurements of RNA. Our paper critically highlights the

shortcomings of standard RNA quality correction approaches,

such as statistically adjusting for RNA integrity numbers. We

show that the our framework removes residual confounding

by RNA quality and greatly improves replication of significant

differentially expressed genes across independent datasets by

more than threefold compared with previous approaches.

Author contribut ions: A.E.J., J.T. L., and D.R.W. designed research ; A.E.J., R.T., A. L.N.,

J.H.S., D.K., Y.J., T.M.H., J.E.K., R.E.S., J.T.L., and D.R.W. performed research; A.E.J. and

A.N. contributed new reagents/analytic tools; A.E.J., A.L.N., and M.K. analyzed data; and

A.E.J., J.T.L., and D.R.W. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Freely available online through the PNAS open access option.

Data deposition: The sequences reported in this paper have been deposited with the

National Center for Biotechnology Information (NCBI BioProject number PRJNA389171

and NCBI SRA project SRP108559).

To whom correspondence should be addressed. Email: andrew.jaffe@libd.org.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.

1073/pnas.1617384114/-/DCSupplemental.

7130–7135

PNAS

July 3, 2017

vol. 114

no. 27 www.pnas.org/cgi/doi/10.1073/pnas.1617384114

presence and role of more subtle and gene/transcript-specific

biases in RNA quality on measures of gene expression and

resulting differential expression analysis is unclear. Furthermore,

application of existing statistical methods to model latent RNA

quality risks retaining false-positive associations in supervised

approaches such as surrogate variable analysis (SVA) (12) or re-

moving an outcome-associated biological signal in unsupervised

approaches such as principal component analysis (PCA). Here we

describe a general analytic framework to estimate and remove

RNA quality confounding in differential expression analysis that

first identifies transcript features most susceptible to RNA deg-

radation using tissue degradation experiments and subsequently

corrects independent datasets using the expression levels of these

transcript features. We show that this framework, called quality

surrogate variable analysis (qSVA), better identifies and removes

confounding related to RNA quality in differential expression

analysis than do observed measures of RNA quality alone.

Results

Degradation Experiments to Model Changes in RNA Quality. We

hypothesized that examining RNA degradation in human tissue

from experimental approaches consisting of leaving tissue at room

temperature would identify metrics useful for quantifying RNA

quality. We therefore examined the transcriptional landscape of

degradation in dorsolateral prefrontal cortex (DLPFC) tissue (in a

degradation experiment that we performed) and blood (specifi-

cally, peripheral blood mononuclear cells—PBMCs—that were

publicly available). Briefly, we left DLPFC tissue from five brains

at room temperature (off of ice) for 0, 15, 30, and 60 min;

extracted RNA; measured RINs; and then constructed and se-

quenced both polyA+and RiboZero libraries (Materials and

Methods and SI Appendix, Table S1). The PBMC degradation

experiment was a similar design over a longer time period, ranging

from 12 h to 84 h (13), and the resulting RNAs were sequenced

with polyA+libraries. The RNA-seq reads from both experiments

were processed identically (SI Appendix,Full Methods and Mate-

rials). Many technical covariates were strongly associated with

degradation time in both blood and brain (SI Appendix,TableS2),

and PCA suggested that degradation time was the strongest ex-

planatory variable (PC1) of the transcriptome across each library

type, explaining 47.5% and 39.0% of normalized gene counts in

polyA+and RiboZero DLPFC libraries, respectively, and 54.4%

in blood (SI Appendix,Fig.S1). These first PCs more in-

dependently associated with degradation time than RIN in mul-

tivariate regression analysis (SI Appendix, Table S2), suggesting

that these degradation experiments induce widespread changes in

RNA quality that are not fully recognized by RIN.

Different mRNAs Degrade at Different Rates in Human Tissues. Many

genes were highly susceptible to the effects of RNA degradation,

including 12,324 genes at a false discovery rate (FDR) <5%

significance in the DLPFC polyA+dataset (n=2,303 at p

bonf

5%), 10,981 genes in the DLPFC RiboZero dataset (n=2,017 at

bonf

<5%), and 11,170 genes in blood polyA+data (n=2,833 at

bonf

<5%, Dataset S1). Regardless of tissue or library type, in-

creased susceptibility to RNA degradation (e.g., a more negative

degradation t-statistic) was associated with increased gene length

and increased coding lengths, increased transcript expression,

decreased guanine–cytosine (GC) content, and increased number

of annotated transcripts (all but one Pvalue <2.2 ×10

−16

,SI

Appendix,TableS3). Enrichment analyses among predefined gene

sets suggested dysregulation of a wide variety of cellular processes

associated with increased degradation susceptibility (Dataset S2).

Because RNAs from different cell types may degrade at dif-

ferent rates, and both blood and DLPFC are mixtures of diverse

cell types, we explored the role of cell-type–specific signal on

RNA degradation. We estimated the relative proportions of

22 different blood cell types using existing reference data in the

PBMCs (14) and found significant changes in the relative cellular

composition comparing degraded to intact PBMC samples. In-

creased degradation time decreased the relative proportion of

monocytes (P=1.82 ×10

−5

) and increased the relative pro-

portion of macrophages (P=8.63 ×10

−5

), regulatory T cells (P=

5.47 ×10

−3

), and activated mast cells (P=1.07 ×10

−6

,SI Appendix,

Fig. S2 and Dataset S3). In DLPFC, because such a reference profile

of brain cell types does not exist, we derived cell-type–specific can-

didate gene lists using available single-cell RNA-seq data (15). We

found significant enrichment of these candidate genes among our

degradation statistics overall (P<2.2 ×10

−16

,SI Appendix,Fig.S3)

as well as differential degradation effects by cell type (SI Appendix,

Table S4 and Materials and Methods). These enrichment analyses

indeed suggest that RNAs from different cell types may be differ-

entially susceptible to degradation, which is captured uniquely by

different RNA-seq library preparation methods.

Biological and Technical Specificity of RNA Degradation Transcriptome

Associations. Given the strong influence of RNA degradation on

the transcriptome, we examined whether these degradation effects

were brain- and degradation-method–specific. We directly com-

pared the DLPFC polyA+and PBMC degradation datasets to

determine tissue specificity. The rate of degradation, as measured

by RIN, was more rapid in our brain samples, as PBMCs still had

high quality RNA after 12 h at room temperature (all RINs >7.7),

compared with DLPFC samples having RINs less than 6.6 after just

1 h. We found only a weak global correlation between the gene

degradation susceptibility statistics (SI Appendix,Fig.S4A)andmuch

smaller degradation rates of individual genes (median: 33.6% versus

0.44%; 90th percentile: 213.8% versus 1.4% change per hour) be-

tween PBMCs and DLPFC, suggesting global differences in the

transcriptome changes resulting from degradation. However, we

processed public Association of Biomolecular Resource Facilities

Next Generation Sequencing (ABRF-NGS) data (that were se-

quenced with a RiboZero protocol) that compared three brain ref-

erence RNA samples treated with RNase-A to nine untreated

samples (10). In this RNA (rather than tissue) degradation experi-

ment, there were 13,553 genes significantly associated with RNase

treatment (at FDR <5%). There was significant global overlap be-

tween degradation induced by our experiment at the tissue level

(using DLPFC RiboZero data) compared with the RNA levels: 7,700

(65.7%) genes were significantly differentially expressed in both ex-

periments (odds ratio: 6.28, P<2.2 ×10

−16

) and there was significant

global correlation of degradation susceptibility statistics (P<2.2 ×

−16

,SI Appendix,Fig.S4B). Therefore, the strongest RNA degra-

dation effects appear tissue-specific, but within a tissue, RNase A-like

activity is likely a major factor contributing to the RNA degradation.

Strong Bias in Differential Expression Analysis in Confounded

Designs. Based on the preceding results, we thus reasoned that

many prior findings in differential expression analyses of post-

mortem brain datasets may have been susceptible to RNA deg-

radation confounding. For example, many studies comparing

different diagnostic groups typically have significant group differ-

ences in measures of RNA quality (e.g., RINs). We therefore used

two large RNA-seq datasets from the prefrontal cortex comparing

patients with schizophrenia to adult controls: Lieber Institute for

Brain Development (LIBD, “discovery”data, polyA+protocol,

n=351) and CommonMind Consortium (CMC, “replication”

data, RiboZero protocol, n=331) (14). Both studies indeed had

significantly lower RINs in the control versus schizophrenia

groups: LIBD—P=4.4 ×10

−5

(mean RIN: 8.4 versus 8.1) and

CMC—P=7.6 ×10

−8

(mean RIN: 7.8 versus 7.4). We first cre-

ated a new diagnostic plot to compare differential expression

statistics for outcome to the degradation statistics from RNA

degradation experiments (fold change in expression per minute or

its corresponding t-statistic). This approach, which we call the

“differential expression quality”(DEqual) plot, can illustrate

Jaffe et al. PNAS

July 3, 2017

vol. 114

no. 27

7131

NEUROSCIENCEBIOPHYSICS AND

COMPUTATIONAL BIOLOGY

transcriptome-wide RNA degradation bias in a given dataset. We

observed strong positive correlation between univariate differen-

tial expression statistics for diagnosis and experimental degrada-

tion in the LIBD dataset (Fig. 1Aand SI Appendix, Fig. S5A).

Here the directionality of change associated with diagnosis at a

particular gene can be predicted almost entirely by its relationship

with degradation and the difference in RNA quality between

outcome groups. Among the 24,122 genes with reads per kilobase

per million mapped (RPKM) >0.1, we found that 11,408 (47.3%)

genes were significantly differentially expressed at FDR <5% in

the discovery dataset, further suggesting confounding by RNA

quality. We posit that removing the correlation between

degradation-associated and diagnosis-associated statistics illus-

trated in the DEqual plot will show that RNA quality has been

properly adjusted for in the differential expression analysis.

Statistically Adjusting for RIN Fails to Remove Degradation Bias.

Given the DEqual plots from the univariate analysis, the signif-

icant difference in RIN between the schizophrenia and control

groups, and the large number of differentially expressed genes,

we expected that adjusting the differential expression analysis for

RINs would reduce the degree of degradation bias. However,

RIN adjustment only partially reduced the correlation between

diagnosis and degradation statistics (Figs. 1Band SI Appendix,

Fig. S5B) from Pearson correlation, r=0.464 to r=0.358 and

only reduced the number of FDR-significant differentially

expressed genes from 11,408 to 6,622 in the discovery dataset. The

degree of RNA degradation bias was practically identical when

further modeling RIN nonlinearly, e.g., further adjusting for RIN

and RIN

(SI Appendix,Fig.S6). We further adjusted the differ-

ential expression analysis for other observed variables, including

clinical and technical covariates (“observed”model: age, sex, eth-

nicity, chrM map rate, gene assignment rate, and RIN), which

again only partially reduced both the correlation between diagnosis

and degradation statistics (to r=0.291, Fig. 1C) and the number of

genes that were significantly differentially expressed (n=2,215).

We also used the PBMC degradation dataset to show that RIN

adjustment fails to account for the differences in RNA degra-

dation between outcome groups. Here, we modeled differences

in expression between individuals 1 and 2 after inducing con-

founding by degradation time by removing T=0 for individual

1 and T=84 for individual 2 (Materials and Methods). As

expected, univariate analysis showed a strong correlation between

the individual effect and the degradation effect (SI Appendix,Fig.

S7A,r=0.495). Again, statistical adjustment for RIN in this con-

founded design does not remove the strong degradation bias (SI

Appendix,Fig.S7B,r=0.307). Here, in this experimental dataset,

unlike the schizophrenia case-control datasets described above, we

have a gold standard surrogate of RNA degradation—thetimeat

room temperature—and show that adjusting for this measure

completely removes the RNA degradation bias (SI Appendix,Fig.

S7C,r=−0.09). These results suggest that RIN or other observed

quality variables may be a poor surrogate for total RNA quality

and that adjusting for RIN in differential expression analysis is

insufficient to remove potential RNA degradation confounding.

qSVA to Correct for RNA Degradation Bias. We hypothesized that

we could leverage the experimental degradation datasets to

better estimate factors relatedtoRNAqualityinRNA-

sequencing datasets. This approach relies on estimating the

transcript features most susceptible to RNA degradation and

using these features as “negative control”features akin to ap-

proaches such as remove unwanted variation (RUV) (2) or SVA

(12). The broad concept of the algorithm is to identify transcript

features that are especially sensitive to degradation in the tissue

of interest and then to quantify these same features in the ex-

perimental differential expression dataset and create a set of

factors that are used to control for RNA quality bias (see SI

Appendix,Full Methods and Materials for details). We defined

those features that were Bonferroni-significantly associated with

degradation in each dataset: the top 1,000 features in the

DLPFC and PBMC polyA+datasets (among thousands that

were significant) and the 515 features in the DLPFC RiboZero

data (step #1, see SI Appendix,Full Methods and Materials and

Datasets S4 and S5). Interestingly, the transcript features in

DLPFC across these two library types were completely non-

overlapping, suggesting that the features most susceptible to

degradation likely differ by library type. Within polyA+data,

there were only four degradation-susceptible features over-

lapping between DLPFC and PBMCs (within genes: PNKD,

MBOAT7,ENG, and SULF2). These features can then be

quantified in new user-provided samples for step #2 from BAM

or BigWig files (SI Appendix,Full Methods and Materials),

resulting in coverage estimates for each feature and new sample.

Then, for step #3, factor analysis on the log-transformed deg-

radation matrix of coverage estimates generates quality surro-

gate variables (qSVs). In step #4, the qSVs are then included as

adjustment variables in differential expression analysis. The

qSVA approach is available in the SVA Bioconductor package

(https://bioconductor.org/packages/sva) (16), and the example

code to run the statistical framework is described in SI Appendix,

Full Methods and Materials.

SZ vs Control (Univariate)

Degradation Stat

r=0.464

Degradation Stat

Fig. 1. Differential expression quality (DEqual) plots for schizophrenia-

control expression differences. Each DEqual plot compares the effect of

RNA degradation from an independent degradation experiment on the y

axis to the effect of the outcome of interest, here schizophrenia (SZ) com-

pared with controls. Each point is a gene, and effects here are shown as

T-statistics for each effect. (A) DEqual plot for univariate case-control anal-

ysis shows strong correlation between degradation and diagnosis effects.

(B) DEqual plot for RIN-adjusted case-control differences largely fails to

remove degradation bias. (C) DEqual plot when adjusting for observed

clinical and technical covariates, including age, sex, ethnicity, chrM mapping

rate, gene assignment rate, and RIN, also fails to remove degradation bias.

(D) DEqual plot demonstrating that the qSVA framework successfully

removes positive correlation between degradation and SZ effects.

7132

www.pnas.org/cgi/doi/10.1073/pnas.1617384114 Jaffe et al.

Improved Replication for Schizophrenia Differential Expression Using

qSVA. We applied the qSVA algorithm to the LIBD polyA+

RNA-seq data with the observed model (consisting of observed

clinical and technical confounders) described above. Here, ad-

justment completely attenuated degradation bias (Fig. 1D,

r=−0.09 using T-statistics and r=−0.037 using log

fold

changes). Following this adjustment, there were only 183 genes

differentially expressed at FDR <5%, further suggesting a re-

duction of RNA degradation bias in differential expression

analysis of schizophrenia patients versus controls. The qSVs

themselves were strongly associated with observed variables in-

cluding chrM alignment rate, RIN, total gene assignment rate,

overall alignment rate, age, and postmortem interval (SI Ap-

pendix,Fig.S8). Similarly, in the CMC dataset, the qSVs, cal-

culated using the DLPFC RiboZero-based degradation features,

were strongly associated with RIN, total gene assignment rate,

institute where the sample was collected, and sequencing and

flow cell batches (SI Appendix,Fig.S9). These results suggest

that enriching for degradation signal via the independent tissue

degradation experiment can identify more robust measures of

RNA quality directly from RNA-seq experiments than relying on

single observable measures.

Although the qSVA approach appears to remove RNA deg-

radation bias in brain differential expression analysis as illus-

trated in the DEqual plot, we further observed that adjusting for

transcriptome-wide PCs also removes the degradation effects (SI

Appendix, Fig. S10,r=−0.02). This suggests that factor-based

approaches—including qSVA but also more generally PCA—can

identify and subsequently remove latent measures of RNA

quality. However, unsupervised approaches like PCA run the risk

of removing true biological difference. Moreover, “supervised”

factor-based approaches, such as SVA that rely on residualizing

around a provided statistical model, largely preserved RNA

degradation bias (SI Appendix, Fig. S11). We therefore used

replication signal across these independent datasets—LIBD and

CMC—to more fully contrast the value of the different degra-

dation adjustment approaches. For a given adjustment approach,

we calculated replication rates of differentially expressed genes

discovered in the LIBD dataset at different significance thresh-

olds in the CMC dataset. We found the lowest replication rates

(<20%) regardless of significance threshold when adjusting only

for observed clinical and technical variables including RIN, as

well as SVA residualizing on only diagnosis (Fig. 2). Although we

had high replication rates among marginally significant genes

(P<0.001) using SVA residualizing on the observed variables

described above, we found strong inflation of the test statistics

among both the LIBD (9,033 genes at FDR <5%) and CMC

(6,924 genes at FDR <5%) datasets. Among those genes

significantly differentially expressed (P<10

−4

), we found the

highest replication rates using qSVA, as well as relatively linear

improvements in the replication rate as the discovery Pvalues

threshold dropped. Importantly, the qSVs calculated in the

LIBD and CMC datasets were based on different degradation

features, as the CMC data were RiboZero and the LIBD data

were polyA+. These results therefore show that qSVA leads to

greatly improved replication in postmortem brain transcriptomic

studies.

Applicability of qSVA to Other Tissues and Brain Regions. We next

examined the generalizability of the qSVA framework to other

tissues and brain regions. We tested the first step of degradation

feature selection in the PBMC dataset (resulting in degradation

features (Dataset S6) and the ABRF RNaseA dataset using

DLPFC RiboZero-specific degradation features (Dataset S5). In

both datasets, the top estimated qSV was strongly associated

with the experimental degradation condition (PBMC: P=4.56 ×

−13

,SI Appendix, Fig. S12A; ABRF: P=3.57 ×10

−7

,SI Ap-

pendix, Fig. S12B). In the confounded individual example from

the PBMC dataset, we successfully removed degradation bias

selecting degradation-susceptible features from the PBMC deg-

radation data (SI Appendix, Fig. S13A). Here the qSV adjust-

ment resulted in less statistically biased effect estimates (i.e., log

fold changes for the effect of “individual”) compared with the

statistical model adjusting for observed degradation time (SI

Appendix, Fig. S13B). Conversely, the statistical bias in differ-

ential expression signal from the RIN-adjusted model for the

effect of individual relative to the degradation time-adjusted

model was much larger (SI Appendix, Fig. S13C). These results

suggest this general framework can work well in other tissues.

As the first step in our framework involves generating experi-

mentally derived degradation expression profiles, which may be

impractical for small laboratories or projects, we assessed the cross

tissue and cell-type applicability of our PBMC- and DLPFC-

derived degradation-susceptible features. First, we quantified

DLPFC-derived (polyA+,Dataset S4) degradation features in the

PBMC dataset; here the top qSV showed similar association with

degradation time as above (P=7.93 ×10

−9

) and also successfully

removed correlation between confounded individual effects and

theeffectofdegradation(r=0.015). The estimated log

fold

changes for the quality-corrected individual effects were highly

correlated using qSVs derived either from PBMC or DLPFC

degradation data features (r=0.997, SI Appendix,Fig.S13D).

We next derived qSVs from the PBMC degradation-susceptible

transcript features in the LIBD DLPFC schizophrenia-control

data and evaluated the performance using DEqual plots and cal-

culating the number of genes significantly differently expressed.

Here, although the log

fold changes when adjusting using blood

versus brain degradation features were correlated (SI Appendix,

Fig. S14A,r=0.6), there was stronger negative correlation between

degradation susceptibility in brain- and blood-adjusted case control

0.00.20.40.6

Gene RPKMs

Replication Rate

p<0.05

p<0.01

p<0.005

p<0.001

p<1e−04

p<1e−05

p<1e−06

adj

qsva

pca

svaFull

Fig. 2. qSVA improves replication across independent datasets. We modeled

SZ-control expression differences using four statistical models in the LIBD

(discovery) and CMC (replication) datasets. For a given significance threshold in

the discovery dataset, we computed the replication rate (same fold-change

direction for case status and P<0.05) in the replication dataset. The qSVA

approach had the highest replication rate, and the covariate-adjusted and SVA

approaches had the lowest replication rates.

Jaffe et al. PNAS

July 3, 2017

vol. 114

no. 27

7133

NEUROSCIENCEBIOPHYSICS AND

COMPUTATIONAL BIOLOGY

differences (SI Appendix,Fig.S14B,r=−0.11). However, using the

blood degradation, qSVA yielded 1,057 genes significantly differ-

entially expressed at FDR <5%, approximately five times more

than using the brain degradation-susceptible transcript features,

suggesting that brain-specific degradation effects might not be

captured using PBMC-susceptible features.

We further used the Genotype-Tissue Expression (GTEx)

project RNA-seq expression data—n=9,502 across 49 detailed

tissues (17)—to characterize the generalizability of DLPFC-

derived degradation features to other brain regions and tissue

types. We ran differential expression analysis comparing each of

48 detailed tissues in GTEx to BA9 frontal cortex before and

after qSVA correction. In the unadjusted analyses, we found a

strong association between resulting correlations in DEqual plots

and the difference in perceived RNA quality (in chrM mapping

rates, SI Appendix, Fig. S15A,r=0.736, P=2.44 ×10

−9

). These

quality associations were driven by the 12 other brain regions

(r=0.88, P=2.44 ×10

−9

) as the nonbrain tissues showed no

association (r=0.19, P=0.26). Here qSVA correction removed

the overall quality effects across the detailed tissues, largely by

removing the positive correlation in the brain samples (SI Ap-

pendix, Fig. S15B,r=0.0, P=0.97). These results suggest that

using DLPFC-derived degradation features for qSVA correction

may work well in other brain regions, but may not be appropriate

for RNA degradation correction in other tissues in the body.

Degradation Bias Signal in Published Differential Expression Analyses.

We finally compared the presence of RNA quality bias in pub-

lished differential expression analyses in human brain for dif-

ferent disorders. As there are currently few additional large

RNA-seq studies of postmortem human brain tissue in disease

states, we used previously published large microarray datasets on

differential expression in autism spectrum disorder (ASD) (18)

and Alzheimer’s disease (AD) (19) across multiple brain regions.

In the ASD dataset, patients had significantly lower RINs than

controls in the frontal (P=0.021) but not temporal (P=0.70)

cortex, and, in the AD dataset, patients scored significantly lower

than controls for the single RIN provided across the three

brain regions (P=1.23 ×10

−10

). To generate qSVs for these data,

we mapped the probes on each microarray platform to the genome,

extracted coverage from our RNA-seq data, selected those probe

sequences that were significantly associated with degradation

(Materials and Methods). In the ASD dataset, those probes most

associated with degradation (n=1,129 at p

bonf

<1%) were al-

most uniformly more lowly expressed in patients compared with

controls in the frontal cortex (SI Appendix, Fig. S16A,P=2.2 ×

−49

). The directionality of enrichment followed the diagnosis

and degradation associations, given that almost all degradation-

susceptible probes decreased in expression over time (98.5%)

and that RINs were lower in patients compared with controls. In

the temporal cortex, where RINs did not significantly differ be-

tween cases and controls, there was attenuated, but still signifi-

cant, enrichment in the same negative direction (P=4.77 ×10

−6

Following the qSVA procedure (PCA on the 1,129 susceptible

probes and the adjustment for the resulting qSVs), the associa-

tion between degradation-susceptible probes and diagnosis was

removed (P=0.496, SI Appendix, Fig. S16B).

We found the same enrichment among differentially expressed

probes for AD across all three brain regions and the 653

degradation-susceptible probes on this microarray, including in

the prefrontal cortex (P=1.27 ×10

−48

,SI Appendix, Fig. S16C),

cerebellum (P=1.82 ×10

−33

), and visual cortex (P=2.35 ×10

−35

Adjusting for the resulting qSVs again removed the association

between diagnosis and degradation susceptibility in the pre-

frontal cortex (P=0.66, SI Appendix, Fig. S16D) and cerebellum

(P=0.49) and greatly reduced the association in the visual

cortex (P=6.11 ×10

−4

). The qSVA correction also greatly re-

duced the magnitude of the differential expression test statistics

across the entire platform (SI Appendix,Fig.S16Cversus D).

These results further underscore the risk of potentially spurious

findings based on uncorrected RNA quality confounding.

Discussion

We describe a framework for quantifying and removing RNA

quality biases in differential expression analysis. We first char-

acterized aspects of the landscape of RNA degradation across

the human DLPFC and PBMC transcriptomes and identified

largely tissue-specific degradation signals. The cell types repre-

sented in bulk/mixed tissues like brain and PBMCs further

showed differential susceptibility to RNA degradation. We used

these experimental degradation datasets to identify the most

degradation-susceptible transcript features in PBMC and DLPFC

RNA-seq libraries and developed an approach called qSVA to use

expression levels of these regions in new/user-provided samples to

estimate and remove RNA degradation bias in differential expres-

sion analyses. We show that the qSVA approach results in better

replication across independent studies and in various public tissue

datasets than existing popular statistical models that model ob-

served measures of RNA quality like RIN, chrM mapping rate, and

gene assignment rate. Our qSVA approach has a potential advan-

tage over general PCA or RUV adjustments—particularly, less risk

of removing true signals along with the noise. Reanalysis of pre-

viously published microarray datasets for AD and ASD further

suggests that probes differentially expressed for diagnosis were

highly associated in a predictable directionality with RNA deg-

radation susceptibility in both datasets.

We also demonstrated that adjusting for measures of RIN

largely fails to remove RNA degradation bias and formally

showed that RIN correction is more statistically biased at esti-

mating fold changes than qSVA when the true degradation effect

is known. The estimation of RIN itself is heavily driven by the

intactness of ribosomal RNAs (8), which appears only weakly

associated with the underlying quality of total or polyadenylated

RNAs across different subjects or tissues. Variance components

analysis of RIN values within the full GTEx dataset suggests that

tissue source explains approximately three times more variance

than individual identity (44.5% versus 14.7%). However, within

only the GTEx brain samples, the predictor corresponding to

individual explained more variability in RIN than did brain re-

gion (28.0% versus 18.7%). Finally, using the LIBD DLPFC

dataset, we found no evidence that individual genotype predicted

individual RIN; the smallest FDR for a genotype effect on RIN

was 0.64 (SI Appendix). Indeed, total RNA quality may be more

complex than a single number per sample, as the resulting qSVs

in both the LIBD and CMC datasets associate with a variety of

technical factors (SI Appendix, Figs. S7 and S8) that may each

influence RNA quality in subtle ways. Therefore, although the

RIN value may be a rough guide in determining whether or not

to study a particular sample, we would argue that it is not a

particularly accurate or useful gauge of RNA quality after data

have already been generated.

The applicability of specific tissue-derived degradation-

susceptible regions to other tissues or cell types is an impor-

tant consideration in differential expression analysis, particularly

when measured RNA quality associates with the outcome of

interest. One practical recommendation for other brain regions

would be to use the degradation data from DLPFC and PBMCs

to create DEqual plots, quantify the potential RNA degradation

bias from its correlation, and then evaluate how the DEqual plot

changes when performing qSVA using the DLPFC and PBMC

degradation regions. If this qSVA correction fails to remove

strong correlation between differential expression effects of

degradation and outcome, researchers probably need to generate

their own reference degradation datasets and apply the qSVA

algorithm.

7134

www.pnas.org/cgi/doi/10.1073/pnas.1617384114 Jaffe et al.

Differences in latent RNA quality and the underlying cellular

composition of homogenate tissue sources (20–22) are two of the

strongest confounding factors in postmortem human studies. The

qSVA approach here that uses quality-associated features is

analogous to our previously proposed approach that uses cell-

type–associated features to untangle the confounding effects of

cellular composition (sparse PCA) (23). The current study does

suggest a potential interaction between RNA quality and cellular

composition (SI Appendix, Fig. S2 and Table S4), which may be

more difficult to statistically isolate the two strong confounding

effects, particularly in PBMCs, or when shifting cellular com-

position is involved in a disease process. Nevertheless, our deg-

radation correction framework can improve the interpretation of

differential expression analysis of transcriptomic data.

Materials and Methods

Tissue Degradation Experiment. DLPFC gray matter from five donors was

dissected, pulverized, and mixed on dry ice. Approximately 100 mg of pul-

verized tissue was aliquoted four times for each subject on dry ice followed by

tissue aliquots at room temperature except one aliquot of each subject that

was kept on dry ice for the time 0 data point. RNA was extracted and se-

quenced using polyA+and RiboZero protocols. Data were processed with

TopHat (v2.0.13) using the reference transcriptome to initially guide align-

ment, based on known transcripts in the Illumina iGenomes version of

University of California at Santa Cruz knownGene GTF file (using the “–G”

argument in the software) (24). Gene counts were generated using the

featureCounts tool (25) based on the more recent Ensembl v75, and counts

were converted to RPKM values using the total number of aligned reads

across the autosomal and sex chromosomes. All public datasets were pro-

cessed with a similar protocol. All tissues were obtained with informed

consent from the legal next of kin (protocol #12–24 approved by the In-

stitutional Review Board of the Department of Health and Mental Hygiene

of the State of Maryland).

Degradation Data Analysis. For the samples in each library and tissue type, we

separately modeled expression as a function of degradation time, adjusting

for the donor and using the limma R Bioconductor package (26). Gene set

enrichment analyses were performed on the ordered degradation T-statis-

tics from the polyA+and RiboZero library types among those genes with

Entrez Gene IDs using the gseGO and gseKEGG functions in the clusterPro-

filer R package (27). Cell-type–specific analyses were conducted with

CIBERSORT with the default LM22 reference panel and 500 permutations

(14) for the PBMC degradation datasets, and DLPFC enrichment was based

on 285 cells from adult donors that were previously classified as astrocytes,

endothelial cells, microglia, neurons, oligodendrocytes, and oligodendrocyte

progenitor cells (15).

LIBD Discovery Dataset Modeling. We used the LIBD DLPFC polyA+RNA-seq

on 155 schizophrenia cases and 196 controls (criteria: ages between 17 and

80, gene assignment rate >0.5, mapping rate >0.7, RIN >6, not outlying on

second ancestry PC, only self-reported Caucasians and African Americans)

described in Jaffe et al. (28). We fit a series of statistical models at the gene

level, modeling log

-transformed gene-level RPKM (SI Appendix) . We used

the lmTest and ebayes functions in the limma Bioconductor package (26) to

fit all of the statistical models to estimate log

fold changes, moderated

T-statistics, and corresponding Pvalues.

CMC Replications Dataset Analysis. We performed differential expression

analysis on 159 patients and 172 controls (selecting on total gene assignment

rate >0.3, alignment rate >0.8, RIN >6, ages between 18 and 80, non-

outlying on genetic ancestry PCs 3 and 5, and keeping only reported Cau-

casians and African Americans). We similarly fit four of the statistical models

at the gene level, modeling log

-transformed gene-level RPKM (with an

offset of 1).

GTEx Analysis. We retained all GTEx samples that had RINs >5 and belonged

to subtissues (SMTSD metadata column) with at least 40 samples, resulting in

data on 9,502 samples across 49 detailed tissues. We retained the 36,552 genes

thathadmeanRPKM>0.2 in at least one subtissue. We modeled differential

expression of each of 48 subtissues compared with Brain-Frontal Cortex (BA9)

and measured the Pearson correlation present in the resulting DEqual plots, e.

g., between the subtissue-specific log

fold changes to the DLPFC polyA+

degradation data log

fold changes for degradation time.

Microarray Data Processing and Analysis of Published Studies. We extrapo-

lated the expression levels of the probes for each microarray platform in our

degradation RNA-seq dataset by aligning microarray probes to the genome

and quantifying resulting coverage in the RNA-seq datasets.

See additional details in SI Appendix,Full Methods and Materials.

ACKNOWLEDGMENTS. A.E.J. was partially supported by NIH Grant

R21MH109956 and J.T.L. was supported by NIH Grant R01GM105705. This

work was also supported by the Lieber Institute for Brain Development.

Corresponding acknowledgment statements for GTEx and CMC datasets

are available in the SI Appendix.

1. Irizarry RA, et al. (2003) Exploration, normalization, and summaries of high density

oligonucleotide array probe level data. Biostatistics 4:249–264.

2. Risso D, Ngai J, Speed TP, Dudoit S (2014) Normalization of RNA-seq data using factor

analysis of control genes or samples. Nat Biotechnol 32:896–902.

3. Leek JT, et al. (2010) Tackling the widespread and critical impact of batch effects in

high-throughput data. Nat Rev Genet 11:733–739.

4. Li S, et al. (2014) Detecting and correcting systematic variation in large-scale RNA

sequencing data. Nat Biotechnol 32:888–895.

5. SEQC/MAQC-III Consortium (2014) A comprehensive assessment of RNA-seq accuracy,

reproducibility and information content by the Sequencing Quality Control Consor-

tium. Nat Biotechnol 32:903–914.

6. ’t Hoen PA, et al. (2013) Reproducibility of high-throughput mRNA and small RNA

sequencing across laboratories. Nat Biotechnol 31:1015–1022.

7. Adiconis X, et al. (2013) Comparative analysis of RNA sequencing methods for de-

graded or low-input samples. Nat Methods 10:623–629.

8. Schroeder A, et al. (2006) The RIN: An RNA integrity number for assigning integrity

values to RNA measurements. BMC Mol Biol 7:3.

9. Consortium ER (2014) REMC standards and guidelines for RNA-sequencing. Available

at www.roadmapepigenomics.org/files/protocols/data/rna-analysis/REMC_RNA-seqStandards_

final.pdf. Accessed June 7, 2017.

10. Li S, et al. (2014) Multi-platform assessment of transcriptome profiling using RNA-seq

in the ABRF next-generation sequencing study. Nat Biotechnol 32:915–925.

11. Wang L, et al. (2016) Measure transcript integrity using RNA-seq data. BMC

Bioinformatics 17:58.

12. Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by

surrogate variable analysis. PLoS Genet 3:1724–1735.

13. Gallego Romero I, Pai AA, Tung J, Gilad Y (2014) RNA-seq: Impact of RNA degradation

on transcript quantification. BMC Biol 12:42.

14. Fromer M, et al. (2016) Gene expression elucidates functional impact of polygenic risk

for schizophrenia. Nat Neurosci 19:1442–1453.

15. Darmanis S, et al. (2015) A survey of human brain transcriptome diversity at the single

cell level. Proc Natl Acad Sci USA 112:7285–7290.

16. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD (2012) The sva package for re-

moving batch effects and other unwanted variation in high-throughput experiments.

Bioinformatics 28:882–883.

17. Consortium GT; GTEx Consortium (2015) Human genomics. The Genotype-Tissue Ex-

pression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science 348:

648–660.

18. Voineagu I, et al. (2011) Transcriptomic analysis of autistic brain reveals convergent

molecular pathology. Nature 474:380–384.

19. Zhang B, et al. (2013) Integrated systems approach identifies genetic nodes and

networks in late-onset Alzheimer’s disease. Cell 153:707–720.

20. Jaffe AE (2016) Postmortem human brain genomics in neuropsychiatric disorders:

How far can we go? Curr Opin Neurobiol 36:107–111.

21. Jaffe AE, et al. (2016) Mapping DNA methylation across development, genotype and

schizophrenia in the human frontal cortex. Nat Neurosci 19:40–47.

22. Jaffe AE, et al. (2015) Developmental regulation of human cortex transcription and its

clinical relevance at single base resolution. Nat Neurosci 18:154–161.

23. Jaffe AE, Irizarry RA (2014) Accounting for cellular heterogeneity is critical in

epigenome-wide association studies. Genome Biol 15:R31.

24. Kim D, et al. (2013) TopHat2: Accurate alignment of transcriptomes in the presence of

insertions, deletions and gene fusions. Genome Biol 14:R36.

25. Liao Y, Smyth GK, Shi W (2014) featureCounts: An efficient general purpose program

for assigning sequence reads to genomic features. Bioinformatics 30:923–930.

26. Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential

expression in microarray experiments. Stat Appl Genet Mol Biol 3:Article3.

27. Yu G, Wang LG, Han Y, He QY (2012) clusterProfiler: An R package for comparing

biological themes among gene clusters. Omics 16:284–287.

28. Jaffe AE, et al. (April 5, 2017) Developmental and genetic regulation of the human

cortex transcriptome in schizophrenia, doi.org/10.1101/124321.

Jaffe et al. PNAS

July 3, 2017

vol. 114

no. 27

7135

NEUROSCIENCEBIOPHYSICS AND

COMPUTATIONAL BIOLOGY

Comparison of gene expression in living and postmortem human brain

Preprint

Full-text available

Nov 2023

Molecular mechanisms of neuropsychiatric disorders are challenging to study in human brain. For decades, the preferred model has been to study postmortem human brain samples despite the limitations they entail. A recent study generated RNA sequencing data from biopsies of prefrontal cortex from living patients with Parkinson's Disease and compared gene expression to postmortem tissue samples, from which they found vast differences between the two. This led the authors to question the utility of postmortem human brain studies. Through re-analysis of the same data, we unexpectedly found that the living brain tissue samples were of much lower quality than the postmortem samples across multiple standard metrics. We also performed simulations that illustrate the effects of ignoring RNA degradation in differential gene expression analyses, showing the effects can be substantial and of similar magnitude to what the authors find. For these reasons, we believe the authors' conclusions are unjustified. To the contrary, while opportunities to study gene expression in the living brain are welcome, evidence that this eclipses the value of postmortem analyses is not apparent.

Transcriptomics and proteomics of projection neurons in a circuit linking hippocampus with dorsolateral prefrontal cortex in human brain

Preprint

Full-text available

Jun 2024

RNA-sequencing studies of brain tissue homogenates have shed light on the molecular processes underlying schizophrenia (SCZ) but lack biological granularity at the cell type level. Laser capture microdissection (LCM) can isolate selective cell populations with intact cell bodies to allow complementary gene expression analyses of mRNA and protein. We used LCM to collect excitatory neuron-enriched samples from CA1 and subiculum (SUB) of the hippocampus and layer III of the dorsolateral prefrontal cortex (DLPFC), from which we generated gene, transcript, and peptide level data. In a machine learning framework, LCM-derived expression achieved superior regional identity predictions as compared to bulk tissue, with further improvements when using isoform-level transcript and protein quantifications. LCM-derived co-expression also had increased co-expression strength of neuronal gene sets compared to tissue homogenates. SCZ risk co-expression pathways were identified and replicated across transcript and protein networks and were consistently enriched for glutamate receptor complex and post-synaptic functions. Finally, through inter-regional co-expression analyses, we show that CA1 to SUB transcriptomic connectivity may be altered in SCZ.

Relationship between sex biases in gene expression and sex biases in autism and Alzheimer's disease

Article

Jun 2024

Background: Sex differences in the brain may play an important role in sex-differential prevalence of neuropsychiatric conditions. Methods: In order to understand the transcriptional basis of sex differences, we analyzed multiple, large-scale, human postmortem brain RNA-Seq datasets using both within-region and pan-regional frameworks. Results: We find evidence of sex-biased transcription in many autosomal genes, some of which provide evidence for pathways and cell population differences between chromosomally male and female individuals. These analyses also highlight regional differences in the extent of sex-differential gene expression. We observe an increase in specific neuronal transcripts in male brains and an increase in immune and glial function-related transcripts in female brains. Integration with single-nucleus data suggests this corresponds to sex differences in cellular states rather than cell abundance. Integration with case-control gene expression studies suggests a female molecular predisposition towards Alzheimer's disease, a female-biased disease. Autism, a male-biased diagnosis, does not exhibit a male predisposition pattern in our analysis. Conclusion: Overall, these analyses highlight mechanisms by which sex differences may interact with sex-biased conditions in the brain. Furthermore, we provide region-specific analyses of sex differences in brain gene expression to enable additional studies at the interface of gene expression and diagnostic differences.

Distinctive whole-brain cell types predict tissue damage patterns in thirteen neurodegenerative conditions

Article

Full-text available

Mar 2024
eLife

For over a century, brain research narrative has mainly centered on neuron cells. Accordingly, most neurodegenerative studies focus on neuronal dysfunction and their selective vulnerability, while we lack comprehensive analyses of other major cell types’ contribution. By unifying spatial gene expression, structural MRI, and cell deconvolution, here we describe how the human brain distribution of canonical cell types extensively predicts tissue damage in 13 neurodegenerative conditions, including early- and late-onset Alzheimer’s disease, Parkinson’s disease, dementia with Lewy bodies, amyotrophic lateral sclerosis, mutations in presenilin-1, and 3 clinical variants of frontotemporal lobar degeneration (behavioral variant, semantic and non-fluent primary progressive aphasia) along with associated three-repeat and four-repeat tauopathies and TDP43 proteinopathies types A and C. We reconstructed comprehensive whole-brain reference maps of cellular abundance for six major cell types and identified characteristic axes of spatial overlapping with atrophy. Our results support the strong mediating role of non-neuronal cells, primarily microglia and astrocytes, in spatial vulnerability to tissue loss in neurodegeneration, with distinct and shared across-disorder pathomechanisms. These observations provide critical insights into the multicellular pathophysiology underlying spatiotemporal advance in neurodegeneration. Notably, they also emphasize the need to exceed the current neuro-centric view of brain diseases, supporting the imperative for cell-specific therapeutic targets in neurodegeneration.

Network-based drug repurposing for schizophrenia

Article

Full-text available

Feb 2024
NEUROPSYCHOPHARMACOL

Despite recent progress, the challenges in drug discovery for schizophrenia persist. However, computational drug repurposing has gained popularity as it leverages the wealth of expanding biomedical databases. Network analyses provide a comprehensive understanding of transcription factor (TF) regulatory effects through gene regulatory networks, which capture the interactions between TFs and target genes by integrating various lines of evidence. Using the PANDA algorithm, we examined the topological variances in TF-gene regulatory networks between individuals with schizophrenia and healthy controls. This algorithm incorporates binding motifs, protein interactions, and gene co-expression data. To identify these differences, we subtracted the edge weights of the healthy control network from those of the schizophrenia network. The resulting differential network was then analysed using the CLUEreg tool in the GRAND database. This tool employs differential network signatures to identify drugs that potentially target the gene signature associated with the disease. Our analysis utilised a large RNA-seq dataset comprising 532 post-mortem brain samples from the CommonMind project. We constructed co-expression gene regulatory networks for both schizophrenia cases and healthy control subjects, incorporating 15,831 genes and 413 overlapping TFs. Through drug repurposing, we identified 18 promising candidates for repurposing as potential treatments for schizophrenia. The analysis of TF-gene regulatory networks revealed that the TFs in schizophrenia predominantly regulate pathways associated with energy metabolism, immune response, cell adhesion, and thyroid hormone signalling. These pathways represent significant targets for therapeutic intervention. The identified drug repurposing candidates likely act through TF-targeted pathways. These promising candidates, particularly those with preclinical evidence such as rimonabant and kaempferol, warrant further investigation into their potential mechanisms of action and efficacy in alleviating the symptoms of schizophrenia.

Relationship between sex biases in gene expression and sex biases in autism and Alzheimer's disease

Preprint

Full-text available

Sep 2023

Sex differences in the brain may play an important role in sex-differential prevalence of neuropsychiatric conditions. In order to understand the transcriptional basis of sex differences, we analyzed multiple, large-scale, human postmortem brain RNA-seq datasets using both within-region and pan-regional frameworks. We find evidence of sex-biased transcription in many autosomal genes, some of which provide evidence for pathways and cell population differences between chromosomally male and female individuals. These analyses also highlight regional differences in the extent of sex-differential gene expression. We observe an increase in specific neuronal transcripts in male brains and an increase in immune and glial function-related transcripts in female brains. Integration with single-cell data suggests this corresponds to sex differences in cellular states rather than cell abundance. Integration with case-control gene expression studies suggests a female molecular predisposition towards Alzheimer’s disease, a female-biased disease. Autism, a male-biased diagnosis, does not exhibit a male predisposition pattern in our analysis. Finally, we provide region specific analyses of sex differences in brain gene expression to enable additional studies at the interface of gene expression and diagnostic differences. Graphical Abstract

Analysis of gene expression in the postmortem brain of neurotypical Black Americans reveals contributions of genetic ancestry

Article

Full-text available

May 2024
NAT NEUROSCI

Ancestral differences in genomic variation affect the regulation of gene expression; however, most gene expression studies have been limited to European ancestry samples or adjusted to identify ancestry-independent associations. Here, we instead examined the impact of genetic ancestry on gene expression and DNA methylation in the postmortem brain tissue of admixed Black American neurotypical individuals to identify ancestry-dependent and ancestry-independent contributions. Ancestry-associated differentially expressed genes (DEGs), transcripts and gene networks, while notably not implicating neurons, are enriched for genes related to the immune response and vascular tissue and explain up to 26% of heritability for ischemic stroke, 27% of heritability for Parkinson disease and 30% of heritability for Alzheimer’s disease. Ancestry-associated DEGs also show general enrichment for the heritability of diverse immune-related traits but depletion for psychiatric-related traits. We also compared Black and non-Hispanic white Americans, confirming most ancestry-associated DEGs. Our results delineate the extent to which genetic ancestry affects differences in gene expression in the human brain and the implications for brain illness risk.

Distinctive Whole-brain Cell-Types Strongly Predict Tissue Damage Patterns in Eleven Neurodegenerative Disorders

Preprint

Full-text available

Sep 2023

For over a century, brain research narrative has mainly centered on neuron cells. Accordingly, most whole-brain neurodegenerative studies focus on neuronal dysfunction and their selective vulnerability, while we lack comprehensive analyses of other major cell-types’ contribution. By unifying spatial gene expression, structural MRI, and cell deconvolution, here we describe how the human brain distribution of canonical cell-types extensively predicts tissue damage in eleven neurodegenerative disorders, including early- and late-onset Alzheimer’s disease, Parkinson’s disease, dementia with Lewy bodies, amyotrophic lateral sclerosis, frontotemporal dementia, and tauopathies. We reconstructed comprehensive whole-brain reference maps of cellular abundance for six major cell-types and identified characteristic axes of spatial overlapping with atrophy. Our results support the strong mediating role of non-neuronal cells, primarily microglia and astrocytes, on spatial vulnerability to tissue loss in neurodegeneration, with distinct and shared across-disorders pathomechanisms. These observations provide critical insights into the multicellular pathophysiology underlying spatiotemporal advance in neurodegeneration. Notably, they also emphasize the need to exceed the current neuro-centric view of brain diseases, supporting the imperative for cell-specific therapeutic targets in neurodegeneration. Major cell-types distinctively associate with spatial vulnerability to tissue loss in eleven neurodegenerative disorders.

Distinctive Whole-brain Cell-Types Strongly Predict Tissue Damage Patterns in Eleven Neurodegenerative Disorders

Preprint

Sep 2023
eLife

Pleiotropy with sex-specific traits reveals genetic aspects of sex differences in Parkinson's disease

Article

Full-text available

Sep 2023
BRAIN

Parkinson’s disease is an age-related neurodegenerative disorder with a higher incidence in males than females. The causes for this sex difference are unknown. Genome-wide association studies (GWAS) have identified 90 Parkinson’s disease risk loci, but the genetic studies have not found sex-specific differences in allele frequency on autosomal chromosomes or sex chromosomes. Genetic variants, however, could exert sex-specific effects on gene function and regulation of gene expression. To identify genetic loci that might have sex-specific effects, we studied pleiotropy between Parkinson’s disease and sex-specific traits. Summary statistics from GWASs were acquired from large-scale consortia for Parkinson’s disease (n cases=13 708; n controls=95 282), age at menarche (n=368 888 women) and age at menopause (n=69 360 women). We applied the conditional/conjunctional false discovery rate (FDR) method to identify shared loci between Parkinson’s disease and these sex-specific traits. Next, we investigated sex-specific gene expression differences in the superior frontal cortex of both neuropathologically healthy individuals and Parkinson’s disease patients (n cases=61; n controls=23). To provide biological insights to the genetic pleiotropy, we performed sex-specific expression quantitative trait locus (eQTL) analysis and sex-specific age-related differential expression analysis for genes mapped to Parkinson’s disease risk loci. Through conditional/conjunctional FDR analysis we found 11 loci shared between Parkinson’s disease and the sex-specific traits age at menarche and age at menopause. Gene-set and pathway analysis of the genes mapped to these loci highlighted the importance of the immune response in determining an increased disease incidence in the male population. Moreover, we highlighted a total of nine genes whose expression or age-related expression in the human brain is influenced by genetic variants in a sex-specific manner. With these analyses we demonstrated that the lack of clear sex-specific differences in allele frequencies for Parkinson’s disease loci does not exclude a genetic contribution to differences in disease incidence. Moreover, further studies are needed to elucidate the role that the candidate genes identified here could have in determining a higher incidence of Parkinson’s disease in the male population.

Developmental and genetic regulation of the human cortex transcriptome illuminate schizophrenia pathogenesis

Article

Full-text available

Aug 2018
NAT NEUROSCI

Genome-wide association studies have identified 108 schizophrenia risk loci, but biological mechanisms for individual loci are largely unknown. Using developmental, genetic and illness-based RNA sequencing expression analysis in human brain, we characterized the human brain transcriptome around these loci and found enrichment for developmentally regulated genes with novel examples of shifting isoform usage across pre- and postnatal life. We found widespread expression quantitative trait loci (eQTLs), including many with transcript specificity and previously unannotated sequence that were independently replicated. We leveraged this general eQTL database to show that 48.1% of risk variants for schizophrenia associate with nearby expression. We lastly found 237 genes significantly differentially expressed between patients and controls, which replicated in an independent dataset, implicated synaptic processes, and were strongly regulated in early development. These findings together offer genetics- and diagnosis-related targets for better modeling of schizophrenia risk. This resource is publicly available at http://eqtl.brainseq.org/phase1 .

Gene Expression Elucidates Functional Impact of Polygenic Risk for Schizophrenia

Article

Full-text available

Sep 2016

Over 100 genetic loci harbor schizophrenia-associated variants, yet how these variants confer liability is uncertain. The CommonMind Consortium sequenced RNA from dorsolateral prefrontal cortex of people with schizophrenia (N = 258) and control subjects (N = 279), creating a resource of gene expression and its genetic regulation. Using this resource, ∼20% of schizophrenia loci have variants that could contribute to altered gene expression and liability. In five loci, only a single gene was involved: FURIN, TSNARE1, CNTN4, CLCN3 or SNAP91. Altering expression of FURIN, TSNARE1 or CNTN4 changed neurodevelopment in zebrafish; knockdown of FURIN in human neural progenitor cells yielded abnormal migration. Of 693 genes showing significant case-versus-control differential expression, their fold changes were ≤ 1.33, and an independent cohort yielded similar results. Gene co-expression implicates a network relevant for schizophrenia. Our findings show that schizophrenia is polygenic and highlight the utility of this resource for mechanistic interpretations of genetic liability for brain diseases. © 2016 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.

Measure transcript integrity using RNA-seq data

Article

Full-text available

Feb 2016
BMC BIOINFORMATICS

Stored biological samples with pathology information and medical records are invaluable resources for translational medical research. However, RNAs extracted from the archived clinical tissues are often substantially degraded. RNA degradation distorts the RNA-seq read coverage in a gene-specific manner, and has profound influences on whole-genome gene expression profiling. We developed the transcript integrity number (TIN) to measure RNA degradation. When applied to 3 independent RNA-seq datasets, we demonstrated TIN is a reliable and sensitive measure of the RNA degradation at both transcript and sample level. Through comparing 10 prostate cancer clinical samples with lower RNA integrity to 10 samples with higher RNA quality, we demonstrated that calibrating gene expression counts with TIN scores could effectively neutralize RNA degradation effects by reducing false positives and recovering biologically meaningful pathways. When further evaluating the performance of TIN correction using spike-in transcripts in RNA-seq data generated from the Sequencing Quality Control consortium, we found TIN adjustment had better control of false positives and false negatives (sensitivity = 0.89, specificity = 0.91, accuracy = 0.90), as compared to gene expression analysis results without TIN correction (sensitivity = 0.98, specificity = 0.50, accuracy = 0.86). TIN is a reliable measurement of RNA integrity and a valuable approach used to neutralize in vitro RNA degradation effect and improve differential gene expression analysis.

A survey of human brain transcriptome diversity at the single cell level

Article

Full-text available

Jun 2015

Significance The brain comprises an immense number of cells and cellular connections. We describe the first, to our knowledge, single cell whole transcriptome analysis of human adult cortical samples. We have established an experimental and analytical framework with which the complexity of the human brain can be dissected on the single cell level. Using this approach, we were able to identify all major cell types of the brain and characterize subtypes of neuronal cells. We observed changes in neurons from early developmental to late differentiated stages in the adult. We found a subset of adult neurons which express major histocompatibility complex class I genes and thus are not immune privileged.

The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans

Article

Full-text available

May 2015
SCIENCE

Understanding the functional consequences of genetic variation, and how it affects complex human disease and quantitative traits, remains a critical challenge for biomedicine. We present an analysis of RNA sequencing data from 1641 samples across 43 tissues from 175 individuals, generated as part of the pilot phase of the Genotype-Tissue Expression (GTEx) project. We describe the landscape of gene expression across tissues, catalog thousands of tissue-specific and shared regulatory expression quantitative trait loci (eQTL) variants, describe complex network relationships, and identify signals from genome-wide association studies explained by eQTLs. These findings provide a systematic understanding of the cellular and biological consequences of human genetic variation and of the heterogeneity of such effects among a diverse set of human tissues.

Capturing Heterogeneity in Gene Expression Studies by "Surrogate Variable Analysis"

Article

Jan 2005

It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce “surrogate variable analysis” (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

Postmortem human brain genomics in neuropsychiatric disorders—how far can we go?

Article

Feb 2016

Andrew E Jaffe

Large-scale collection of postmortem human brain tissue and subsequent genomic data generation has become a useful approach for better identifying etiological factors contributing to neuropsychiatric disorders. In particular, studying genetic risk variants in non-psychiatric controls can identify biological mechanisms of risk free from confounding factors related to epiphenomena of illness. While the field has begun moving towards cell type-specific analyses, homogenate brain tissue with accompanying cellular profiles, can still identify useful hypotheses for more focused experiments, particularly when the dysregulated cell types are unknown. Technological advances, larger sample sizes, and focused research questions can continue to further leverage postmortem human brain research to better identify and understand the molecular etiology of neuropsychiatric disorders.

Mapping DNA methylation across development, genotype and schizophrenia in the human frontal cortex

Article

Nov 2015

DNA methylation (DNAm) is important in brain development and is potentially important in schizophrenia. We characterized DNAm in prefrontal cortex from 335 non-psychiatric controls across the lifespan and 191 patients with schizophrenia and identified widespread changes in the transition from prenatal to postnatal life. These DNAm changes manifest in the transcriptome, correlate strongly with a shifting cellular landscape and overlap regions of genetic risk for schizophrenia. A quarter of published genome-wide association studies (GWAS)-suggestive loci (4,208 of 15,930, P < 10(-100)) manifest as significant methylation quantitative trait loci (meQTLs), including 59.6% of GWAS-positive schizophrenia loci. We identified 2,104 CpGs that differ between schizophrenia patients and controls that were enriched for genes related to development and neurodifferentiation. The schizophrenia-associated CpGs strongly correlate with changes related to the prenatal-postnatal transition and show slight enrichment for GWAS risk loci while not corresponding to CpGs differentiating adolescence from later adult life. These data implicate an epigenetic component to the developmental origins of this disorder.

Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data

Article

Jan 2003
BIOSTATISTICS

Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments

Article

Jan 2004
STAT APPL GENET MOL

GK Smyth

QSVA framework for RNA quality correction in differential expression analysis

Abstract and Figures

Recommended publications

A framework for RNA quality correction in differential expression analysis

Strategies for cellular deconvolution in human brain RNA sequencing data

Strategies for cellular deconvolution in human brain RNA sequencing data

Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analys...