ArticlePDF Available

QSVA framework for RNA quality correction in differential expression analysis

Authors:

Abstract and Figures

Significance Many studies use measurements of gene expression in human postmortem and ex vivo tissues like brain and blood to characterize genomic correlates of illness. However, molecular analyses of these tissues can be susceptible to a wide range of confounders that may be difficult to measure and remove. In this article, we describe an analysis framework for identifying and removing previously uncharacterized quality biases in measurements of RNA. Our paper critically highlights the shortcomings of standard RNA quality correction approaches, such as statistically adjusting for RNA integrity numbers. We show that the our framework removes residual confounding by RNA quality and greatly improves replication of significant differentially expressed genes across independent datasets by more than threefold compared with previous approaches.
Content may be subject to copyright.
qSVA framework for RNA quality correction in
differential expression analysis
Andrew E. Jaffe
a,b,c,d,1
, Ran Tao
a
, Alexis L. Norris
e,f
, Marc Kealhofer
a,g
, Abhinav Nellore
c,d,h
, Joo Heon Shin
a
,
Dewey Kim
a
, Yankai Jia
a
, Thomas M. Hyde
a,i,j
, Joel E. Kleinman
a,j
, Richard E. Straub
a
, Jeffrey T. Leek
c,d
,
and Daniel R. Weinberger
a,e,j,k
a
Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205;
b
Department of Mental Health, Johns Hopkins Bloomberg
School of Public Health, Baltimore, MD 21205;
c
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205;
d
Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205;
e
Department of Neuroscience, Johns Hopkins School of Medicine,
Baltimore, MD 21205;
f
Department of Neurology, Kennedy Krieger Institute, Baltimore, MD 21205;
g
Department of Epidemiology, Johns Hopkins
Bloomberg School of Public Health, Baltimore, MD 21205;
h
Department of Computer Science, Johns Hopkins University, Baltimore, MD 21205;
i
Department
of Neurology, Johns Hopkins School of Medicine, Baltimore, MD 21205;
j
Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of
Medicine, Baltimore, MD 21205; and
k
McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD 21205
Edited by Pasko Rakic, Yale University, New Haven, CT, and approved May 19, 2017 (received for review October 27, 2016)
RNA sequencing (RNA-seq) is a powerful approach for measuring
gene expression levels in cells and tissues, but it relies on high-
quality RNA. We demonstrate here that statistical adjustment
using existing quality measures largely fails to remove the effects
of RNA degradation when RNA quality associates with the out-
come of interest. Using RNA-seq data from molecular degradation
experiments of human primary tissues, we introduce a method
quality surrogate variable analysis (qSVA)as a framework for
estimating and removing the confounding effect of RNA quality
in differential expression analysis. We show that this approach
results in greatly improved replication rates (>3×) across two large
independent postmortem human brain studies of schizophrenia
and also removes potential RNA quality biases in earlier published
work that compared expression levels of different brain regions
and other diagnostic groups. Our approach can therefore improve
the interpretation of differential expression analysis of transcrip-
tomic data from human tissue.
RNA sequencing
|
differential expression analysis
|
statistical modeling
|
RNA quality
Microarrays and RNA sequencing (RNA-seq) can measure
gene expression levels across hundreds of samples in a
single experiment. As gene expression levels are measured with
error, normalization procedures have been implemented for
both microarray (1) and RNA sequencing (2) data to reduce
technical variability, including controlling for variability associ-
ated with how and when the samples are run, so-called batch
effects (3). Recent research has further characterized this ex-
pression variability in RNA-seq data (46), including demon-
strating variability associated with technical factors involved in
the preparation, sequencing, and analysis of samples. Variability
in gene expression is particularly influenced by RNA quality (7)
because accurately measuring gene expression levels strongly
depends on the quality of the input RNA. This suggests that a
portion of traditionally measured latent batcheffects could
actually be attributed to the underlying quality of the input RNA.
Postmortem studies typically extract RNA from tissue that has
been susceptible to a wide variety of antemortem and postmortem
factors. Several approaches exist for quantifying the quality of the
input RNA before sequencing library construction, including UV
absorption ratios of 280 nm to 260 nm and RNA integrity numbers
(RINs). RIN is a machine learning-derived measurement resulting
from placing RNA on a Bioanalyzer and obtaining a tracing of
fragment sizes per sample. RIN ranges from 10 (very high quality
RNA) to 0 (completely degraded RNA), and the apparent in-
tactness of ribosomal RNAs (which are two large peaks in the
fragment size tracing) is one of the most discriminating factors
that distinguishes very high quality from moderate quality RNA
(8). Recommended RIN thresholds for sample exclusion before
data generation have been suggested as low as 5.0 for PCR (7) and
7.0 for RNA-seq. (9). However, even high quality samples (RIN >
8) demonstrate evidence of degradation, as transcriptome-wide
gene expression levels strongly associate with RIN even among
samples with high RINs, for example, in lymphoblastoid cell lines
(6). Furthermore, the recent introduction of ribosomal depletion
approaches for library construction, such as the Illumina Ribo-
Zero technique, have permitted the sequencing of lower quality
samples compared with previous polyadenylation section-based
approaches (polyA+), including samples with RINs less than 3 (10).
Proposed measures of RNA quality can also be derived from
the resulting RNA sequencing data, for example, by calculating
the 5to 3read coverage bias (particularly in polyA+data);
transcript integrity numbers (11); various read mapping rates, in-
cluding to autosomes, ribosomal RNAs, and mitochondrial RNAs
(chrM); and gene/exon assignment rates (7). Although many of
these approaches appear to capture the largest global effects on
expression, for example, through positively correlating factors of
expression data with the above-mentioned quality measures, the
Significance
Many studies use measurements of gene expression in human
postmortem and ex vivo tissues like brain and blood to char-
acterize genomic correlates of illness. However, molecular
analyses of these tissues can be susceptible to a wide range of
confounders that may be difficult to measure and remove. In
this article, we describe an analysis framework for identifying
and removing previously uncharacterized quality biases in
measurements of RNA. Our paper critically highlights the
shortcomings of standard RNA quality correction approaches,
such as statistically adjusting for RNA integrity numbers. We
show that the our framework removes residual confounding
by RNA quality and greatly improves replication of significant
differentially expressed genes across independent datasets by
more than threefold compared with previous approaches.
Author contribut ions: A.E.J., J.T. L., and D.R.W. designed research ; A.E.J., R.T., A. L.N.,
J.H.S., D.K., Y.J., T.M.H., J.E.K., R.E.S., J.T.L., and D.R.W. performed research; A.E.J. and
A.N. contributed new reagents/analytic tools; A.E.J., A.L.N., and M.K. analyzed data; and
A.E.J., J.T.L., and D.R.W. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Freely available online through the PNAS open access option.
Data deposition: The sequences reported in this paper have been deposited with the
National Center for Biotechnology Information (NCBI BioProject number PRJNA389171
and NCBI SRA project SRP108559).
1
To whom correspondence should be addressed. Email: andrew.jaffe@libd.org.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
1073/pnas.1617384114/-/DCSupplemental.
71307135
|
PNAS
|
July 3, 2017
|
vol. 114
|
no. 27 www.pnas.org/cgi/doi/10.1073/pnas.1617384114
presence and role of more subtle and gene/transcript-specific
biases in RNA quality on measures of gene expression and
resulting differential expression analysis is unclear. Furthermore,
application of existing statistical methods to model latent RNA
quality risks retaining false-positive associations in supervised
approaches such as surrogate variable analysis (SVA) (12) or re-
moving an outcome-associated biological signal in unsupervised
approaches such as principal component analysis (PCA). Here we
describe a general analytic framework to estimate and remove
RNA quality confounding in differential expression analysis that
first identifies transcript features most susceptible to RNA deg-
radation using tissue degradation experiments and subsequently
corrects independent datasets using the expression levels of these
transcript features. We show that this framework, called quality
surrogate variable analysis (qSVA), better identifies and removes
confounding related to RNA quality in differential expression
analysis than do observed measures of RNA quality alone.
Results
Degradation Experiments to Model Changes in RNA Quality. We
hypothesized that examining RNA degradation in human tissue
from experimental approaches consisting of leaving tissue at room
temperature would identify metrics useful for quantifying RNA
quality. We therefore examined the transcriptional landscape of
degradation in dorsolateral prefrontal cortex (DLPFC) tissue (in a
degradation experiment that we performed) and blood (specifi-
cally, peripheral blood mononuclear cellsPBMCsthat were
publicly available). Briefly, we left DLPFC tissue from five brains
at room temperature (off of ice) for 0, 15, 30, and 60 min;
extracted RNA; measured RINs; and then constructed and se-
quenced both polyA+and RiboZero libraries (Materials and
Methods and SI Appendix, Table S1). The PBMC degradation
experiment was a similar design over a longer time period, ranging
from 12 h to 84 h (13), and the resulting RNAs were sequenced
with polyA+libraries. The RNA-seq reads from both experiments
were processed identically (SI Appendix,Full Methods and Mate-
rials). Many technical covariates were strongly associated with
degradation time in both blood and brain (SI Appendix,TableS2),
and PCA suggested that degradation time was the strongest ex-
planatory variable (PC1) of the transcriptome across each library
type, explaining 47.5% and 39.0% of normalized gene counts in
polyA+and RiboZero DLPFC libraries, respectively, and 54.4%
in blood (SI Appendix,Fig.S1). These first PCs more in-
dependently associated with degradation time than RIN in mul-
tivariate regression analysis (SI Appendix, Table S2), suggesting
that these degradation experiments induce widespread changes in
RNA quality that are not fully recognized by RIN.
Different mRNAs Degrade at Different Rates in Human Tissues. Many
genes were highly susceptible to the effects of RNA degradation,
including 12,324 genes at a false discovery rate (FDR) <5%
significance in the DLPFC polyA+dataset (n=2,303 at p
bonf
<
5%), 10,981 genes in the DLPFC RiboZero dataset (n=2,017 at
p
bonf
<5%), and 11,170 genes in blood polyA+data (n=2,833 at
p
bonf
<5%, Dataset S1). Regardless of tissue or library type, in-
creased susceptibility to RNA degradation (e.g., a more negative
degradation t-statistic) was associated with increased gene length
and increased coding lengths, increased transcript expression,
decreased guaninecytosine (GC) content, and increased number
of annotated transcripts (all but one Pvalue <2.2 ×10
16
,SI
Appendix,TableS3). Enrichment analyses among predefined gene
sets suggested dysregulation of a wide variety of cellular processes
associated with increased degradation susceptibility (Dataset S2).
Because RNAs from different cell types may degrade at dif-
ferent rates, and both blood and DLPFC are mixtures of diverse
cell types, we explored the role of cell-typespecific signal on
RNA degradation. We estimated the relative proportions of
22 different blood cell types using existing reference data in the
PBMCs (14) and found significant changes in the relative cellular
composition comparing degraded to intact PBMC samples. In-
creased degradation time decreased the relative proportion of
monocytes (P=1.82 ×10
5
) and increased the relative pro-
portion of macrophages (P=8.63 ×10
5
), regulatory T cells (P=
5.47 ×10
3
), and activated mast cells (P=1.07 ×10
6
,SI Appendix,
Fig. S2 and Dataset S3). In DLPFC, because such a reference profile
of brain cell types does not exist, we derived cell-typespecific can-
didate gene lists using available single-cell RNA-seq data (15). We
found significant enrichment of these candidate genes among our
degradation statistics overall (P<2.2 ×10
16
,SI Appendix,Fig.S3)
as well as differential degradation effects by cell type (SI Appendix,
Table S4 and Materials and Methods). These enrichment analyses
indeed suggest that RNAs from different cell types may be differ-
entially susceptible to degradation, which is captured uniquely by
different RNA-seq library preparation methods.
Biological and Technical Specificity of RNA Degradation Transcriptome
Associations. Given the strong influence of RNA degradation on
the transcriptome, we examined whether these degradation effects
were brain- and degradation-methodspecific. We directly com-
pared the DLPFC polyA+and PBMC degradation datasets to
determine tissue specificity. The rate of degradation, as measured
by RIN, was more rapid in our brain samples, as PBMCs still had
high quality RNA after 12 h at room temperature (all RINs >7.7),
compared with DLPFC samples having RINs less than 6.6 after just
1 h. We found only a weak global correlation between the gene
degradation susceptibility statistics (SI Appendix,Fig.S4A)andmuch
smaller degradation rates of individual genes (median: 33.6% versus
0.44%; 90th percentile: 213.8% versus 1.4% change per hour) be-
tween PBMCs and DLPFC, suggesting global differences in the
transcriptome changes resulting from degradation. However, we
processed public Association of Biomolecular Resource Facilities
Next Generation Sequencing (ABRF-NGS) data (that were se-
quenced with a RiboZero protocol) that compared three brain ref-
erence RNA samples treated with RNase-A to nine untreated
samples (10). In this RNA (rather than tissue) degradation experi-
ment, there were 13,553 genes significantly associated with RNase
treatment (at FDR <5%). There was significant global overlap be-
tween degradation induced by our experiment at the tissue level
(using DLPFC RiboZero data) compared with the RNA levels: 7,700
(65.7%) genes were significantly differentially expressed in both ex-
periments (odds ratio: 6.28, P<2.2 ×10
16
) and there was significant
global correlation of degradation susceptibility statistics (P<2.2 ×
10
16
,SI Appendix,Fig.S4B). Therefore, the strongest RNA degra-
dation effects appear tissue-specific, but within a tissue, RNase A-like
activity is likely a major factor contributing to the RNA degradation.
Strong Bias in Differential Expression Analysis in Confounded
Designs. Based on the preceding results, we thus reasoned that
many prior findings in differential expression analyses of post-
mortem brain datasets may have been susceptible to RNA deg-
radation confounding. For example, many studies comparing
different diagnostic groups typically have significant group differ-
ences in measures of RNA quality (e.g., RINs). We therefore used
two large RNA-seq datasets from the prefrontal cortex comparing
patients with schizophrenia to adult controls: Lieber Institute for
Brain Development (LIBD, discoverydata, polyA+protocol,
n=351) and CommonMind Consortium (CMC, replication
data, RiboZero protocol, n=331) (14). Both studies indeed had
significantly lower RINs in the control versus schizophrenia
groups: LIBDP=4.4 ×10
5
(mean RIN: 8.4 versus 8.1) and
CMCP=7.6 ×10
8
(mean RIN: 7.8 versus 7.4). We first cre-
ated a new diagnostic plot to compare differential expression
statistics for outcome to the degradation statistics from RNA
degradation experiments (fold change in expression per minute or
its corresponding t-statistic). This approach, which we call the
differential expression quality(DEqual) plot, can illustrate
Jaffe et al. PNAS
|
July 3, 2017
|
vol. 114
|
no. 27
|
7131
NEUROSCIENCEBIOPHYSICS AND
COMPUTATIONAL BIOLOGY
transcriptome-wide RNA degradation bias in a given dataset. We
observed strong positive correlation between univariate differen-
tial expression statistics for diagnosis and experimental degrada-
tion in the LIBD dataset (Fig. 1Aand SI Appendix, Fig. S5A).
Here the directionality of change associated with diagnosis at a
particular gene can be predicted almost entirely by its relationship
with degradation and the difference in RNA quality between
outcome groups. Among the 24,122 genes with reads per kilobase
per million mapped (RPKM) >0.1, we found that 11,408 (47.3%)
genes were significantly differentially expressed at FDR <5% in
the discovery dataset, further suggesting confounding by RNA
quality. We posit that removing the correlation between
degradation-associated and diagnosis-associated statistics illus-
trated in the DEqual plot will show that RNA quality has been
properly adjusted for in the differential expression analysis.
Statistically Adjusting for RIN Fails to Remove Degradation Bias.
Given the DEqual plots from the univariate analysis, the signif-
icant difference in RIN between the schizophrenia and control
groups, and the large number of differentially expressed genes,
we expected that adjusting the differential expression analysis for
RINs would reduce the degree of degradation bias. However,
RIN adjustment only partially reduced the correlation between
diagnosis and degradation statistics (Figs. 1Band SI Appendix,
Fig. S5B) from Pearson correlation, r=0.464 to r=0.358 and
only reduced the number of FDR-significant differentially
expressed genes from 11,408 to 6,622 in the discovery dataset. The
degree of RNA degradation bias was practically identical when
further modeling RIN nonlinearly, e.g., further adjusting for RIN
and RIN
2
(SI Appendix,Fig.S6). We further adjusted the differ-
ential expression analysis for other observed variables, including
clinical and technical covariates (observedmodel: age, sex, eth-
nicity, chrM map rate, gene assignment rate, and RIN), which
again only partially reduced both the correlation between diagnosis
and degradation statistics (to r=0.291, Fig. 1C) and the number of
genes that were significantly differentially expressed (n=2,215).
We also used the PBMC degradation dataset to show that RIN
adjustment fails to account for the differences in RNA degra-
dation between outcome groups. Here, we modeled differences
in expression between individuals 1 and 2 after inducing con-
founding by degradation time by removing T=0 for individual
1 and T=84 for individual 2 (Materials and Methods). As
expected, univariate analysis showed a strong correlation between
the individual effect and the degradation effect (SI Appendix,Fig.
S7A,r=0.495). Again, statistical adjustment for RIN in this con-
founded design does not remove the strong degradation bias (SI
Appendix,Fig.S7B,r=0.307). Here, in this experimental dataset,
unlike the schizophrenia case-control datasets described above, we
have a gold standard surrogate of RNA degradationthetimeat
room temperatureand show that adjusting for this measure
completely removes the RNA degradation bias (SI Appendix,Fig.
S7C,r=0.09). These results suggest that RIN or other observed
quality variables may be a poor surrogate for total RNA quality
and that adjusting for RIN in differential expression analysis is
insufficient to remove potential RNA degradation confounding.
qSVA to Correct for RNA Degradation Bias. We hypothesized that
we could leverage the experimental degradation datasets to
better estimate factors relatedtoRNAqualityinRNA-
sequencing datasets. This approach relies on estimating the
transcript features most susceptible to RNA degradation and
using these features as negative controlfeatures akin to ap-
proaches such as remove unwanted variation (RUV) (2) or SVA
(12). The broad concept of the algorithm is to identify transcript
features that are especially sensitive to degradation in the tissue
of interest and then to quantify these same features in the ex-
perimental differential expression dataset and create a set of
factors that are used to control for RNA quality bias (see SI
Appendix,Full Methods and Materials for details). We defined
those features that were Bonferroni-significantly associated with
degradation in each dataset: the top 1,000 features in the
DLPFC and PBMC polyA+datasets (among thousands that
were significant) and the 515 features in the DLPFC RiboZero
data (step #1, see SI Appendix,Full Methods and Materials and
Datasets S4 and S5). Interestingly, the transcript features in
DLPFC across these two library types were completely non-
overlapping, suggesting that the features most susceptible to
degradation likely differ by library type. Within polyA+data,
there were only four degradation-susceptible features over-
lapping between DLPFC and PBMCs (within genes: PNKD,
MBOAT7,ENG, and SULF2). These features can then be
quantified in new user-provided samples for step #2 from BAM
or BigWig files (SI Appendix,Full Methods and Materials),
resulting in coverage estimates for each feature and new sample.
Then, for step #3, factor analysis on the log-transformed deg-
radation matrix of coverage estimates generates quality surro-
gate variables (qSVs). In step #4, the qSVs are then included as
adjustment variables in differential expression analysis. The
qSVA approach is available in the SVA Bioconductor package
(https://bioconductor.org/packages/sva) (16), and the example
code to run the statistical framework is described in SI Appendix,
Full Methods and Materials.
SZ vs Control (Univariate)
Degradation Stat
0
0
r=0.464
0
Degradation Stat
0
0
BA
C
0
D
Fig. 1. Differential expression quality (DEqual) plots for schizophrenia-
control expression differences. Each DEqual plot compares the effect of
RNA degradation from an independent degradation experiment on the y
axis to the effect of the outcome of interest, here schizophrenia (SZ) com-
pared with controls. Each point is a gene, and effects here are shown as
T-statistics for each effect. (A) DEqual plot for univariate case-control anal-
ysis shows strong correlation between degradation and diagnosis effects.
(B) DEqual plot for RIN-adjusted case-control differences largely fails to
remove degradation bias. (C) DEqual plot when adjusting for observed
clinical and technical covariates, including age, sex, ethnicity, chrM mapping
rate, gene assignment rate, and RIN, also fails to remove degradation bias.
(D) DEqual plot demonstrating that the qSVA framework successfully
removes positive correlation between degradation and SZ effects.
7132
|
www.pnas.org/cgi/doi/10.1073/pnas.1617384114 Jaffe et al.
Improved Replication for Schizophrenia Differential Expression Using
qSVA. We applied the qSVA algorithm to the LIBD polyA+
RNA-seq data with the observed model (consisting of observed
clinical and technical confounders) described above. Here, ad-
justment completely attenuated degradation bias (Fig. 1D,
r=0.09 using T-statistics and r=0.037 using log
2
fold
changes). Following this adjustment, there were only 183 genes
differentially expressed at FDR <5%, further suggesting a re-
duction of RNA degradation bias in differential expression
analysis of schizophrenia patients versus controls. The qSVs
themselves were strongly associated with observed variables in-
cluding chrM alignment rate, RIN, total gene assignment rate,
overall alignment rate, age, and postmortem interval (SI Ap-
pendix,Fig.S8). Similarly, in the CMC dataset, the qSVs, cal-
culated using the DLPFC RiboZero-based degradation features,
were strongly associated with RIN, total gene assignment rate,
institute where the sample was collected, and sequencing and
flow cell batches (SI Appendix,Fig.S9). These results suggest
that enriching for degradation signal via the independent tissue
degradation experiment can identify more robust measures of
RNA quality directly from RNA-seq experiments than relying on
single observable measures.
Although the qSVA approach appears to remove RNA deg-
radation bias in brain differential expression analysis as illus-
trated in the DEqual plot, we further observed that adjusting for
transcriptome-wide PCs also removes the degradation effects (SI
Appendix, Fig. S10,r=0.02). This suggests that factor-based
approachesincluding qSVA but also more generally PCAcan
identify and subsequently remove latent measures of RNA
quality. However, unsupervised approaches like PCA run the risk
of removing true biological difference. Moreover, supervised
factor-based approaches, such as SVA that rely on residualizing
around a provided statistical model, largely preserved RNA
degradation bias (SI Appendix, Fig. S11). We therefore used
replication signal across these independent datasetsLIBD and
CMCto more fully contrast the value of the different degra-
dation adjustment approaches. For a given adjustment approach,
we calculated replication rates of differentially expressed genes
discovered in the LIBD dataset at different significance thresh-
olds in the CMC dataset. We found the lowest replication rates
(<20%) regardless of significance threshold when adjusting only
for observed clinical and technical variables including RIN, as
well as SVA residualizing on only diagnosis (Fig. 2). Although we
had high replication rates among marginally significant genes
(P<0.001) using SVA residualizing on the observed variables
described above, we found strong inflation of the test statistics
among both the LIBD (9,033 genes at FDR <5%) and CMC
(6,924 genes at FDR <5%) datasets. Among those genes
significantly differentially expressed (P<10
4
), we found the
highest replication rates using qSVA, as well as relatively linear
improvements in the replication rate as the discovery Pvalues
threshold dropped. Importantly, the qSVs calculated in the
LIBD and CMC datasets were based on different degradation
features, as the CMC data were RiboZero and the LIBD data
were polyA+. These results therefore show that qSVA leads to
greatly improved replication in postmortem brain transcriptomic
studies.
Applicability of qSVA to Other Tissues and Brain Regions. We next
examined the generalizability of the qSVA framework to other
tissues and brain regions. We tested the first step of degradation
feature selection in the PBMC dataset (resulting in degradation
features (Dataset S6) and the ABRF RNaseA dataset using
DLPFC RiboZero-specific degradation features (Dataset S5). In
both datasets, the top estimated qSV was strongly associated
with the experimental degradation condition (PBMC: P=4.56 ×
10
13
,SI Appendix, Fig. S12A; ABRF: P=3.57 ×10
7
,SI Ap-
pendix, Fig. S12B). In the confounded individual example from
the PBMC dataset, we successfully removed degradation bias
selecting degradation-susceptible features from the PBMC deg-
radation data (SI Appendix, Fig. S13A). Here the qSV adjust-
ment resulted in less statistically biased effect estimates (i.e., log
2
fold changes for the effect of individual) compared with the
statistical model adjusting for observed degradation time (SI
Appendix, Fig. S13B). Conversely, the statistical bias in differ-
ential expression signal from the RIN-adjusted model for the
effect of individual relative to the degradation time-adjusted
model was much larger (SI Appendix, Fig. S13C). These results
suggest this general framework can work well in other tissues.
As the first step in our framework involves generating experi-
mentally derived degradation expression profiles, which may be
impractical for small laboratories or projects, we assessed the cross
tissue and cell-type applicability of our PBMC- and DLPFC-
derived degradation-susceptible features. First, we quantified
DLPFC-derived (polyA+,Dataset S4) degradation features in the
PBMC dataset; here the top qSV showed similar association with
degradation time as above (P=7.93 ×10
9
) and also successfully
removed correlation between confounded individual effects and
theeffectofdegradation(r=0.015). The estimated log
2
fold
changes for the quality-corrected individual effects were highly
correlated using qSVs derived either from PBMC or DLPFC
degradation data features (r=0.997, SI Appendix,Fig.S13D).
We next derived qSVs from the PBMC degradation-susceptible
transcript features in the LIBD DLPFC schizophrenia-control
data and evaluated the performance using DEqual plots and cal-
culating the number of genes significantly differently expressed.
Here, although the log
2
fold changes when adjusting using blood
versus brain degradation features were correlated (SI Appendix,
Fig. S14A,r=0.6), there was stronger negative correlation between
degradation susceptibility in brain- and blood-adjusted case control
0.00.20.40.6
Gene RPKMs
Replication Rate
p<0.05
p<0.01
p<0.005
p<0.001
p<1e−04
p<1e−05
p<1e−06
adj
qsva
pca
svaFull
Fig. 2. qSVA improves replication across independent datasets. We modeled
SZ-control expression differences using four statistical models in the LIBD
(discovery) and CMC (replication) datasets. For a given significance threshold in
the discovery dataset, we computed the replication rate (same fold-change
direction for case status and P<0.05) in the replication dataset. The qSVA
approach had the highest replication rate, and the covariate-adjusted and SVA
approaches had the lowest replication rates.
Jaffe et al. PNAS
|
July 3, 2017
|
vol. 114
|
no. 27
|
7133
NEUROSCIENCEBIOPHYSICS AND
COMPUTATIONAL BIOLOGY
differences (SI Appendix,Fig.S14B,r=0.11). However, using the
blood degradation, qSVA yielded 1,057 genes significantly differ-
entially expressed at FDR <5%, approximately five times more
than using the brain degradation-susceptible transcript features,
suggesting that brain-specific degradation effects might not be
captured using PBMC-susceptible features.
We further used the Genotype-Tissue Expression (GTEx)
project RNA-seq expression datan=9,502 across 49 detailed
tissues (17)to characterize the generalizability of DLPFC-
derived degradation features to other brain regions and tissue
types. We ran differential expression analysis comparing each of
48 detailed tissues in GTEx to BA9 frontal cortex before and
after qSVA correction. In the unadjusted analyses, we found a
strong association between resulting correlations in DEqual plots
and the difference in perceived RNA quality (in chrM mapping
rates, SI Appendix, Fig. S15A,r=0.736, P=2.44 ×10
9
). These
quality associations were driven by the 12 other brain regions
(r=0.88, P=2.44 ×10
9
) as the nonbrain tissues showed no
association (r=0.19, P=0.26). Here qSVA correction removed
the overall quality effects across the detailed tissues, largely by
removing the positive correlation in the brain samples (SI Ap-
pendix, Fig. S15B,r=0.0, P=0.97). These results suggest that
using DLPFC-derived degradation features for qSVA correction
may work well in other brain regions, but may not be appropriate
for RNA degradation correction in other tissues in the body.
Degradation Bias Signal in Published Differential Expression Analyses.
We finally compared the presence of RNA quality bias in pub-
lished differential expression analyses in human brain for dif-
ferent disorders. As there are currently few additional large
RNA-seq studies of postmortem human brain tissue in disease
states, we used previously published large microarray datasets on
differential expression in autism spectrum disorder (ASD) (18)
and Alzheimers disease (AD) (19) across multiple brain regions.
In the ASD dataset, patients had significantly lower RINs than
controls in the frontal (P=0.021) but not temporal (P=0.70)
cortex, and, in the AD dataset, patients scored significantly lower
than controls for the single RIN provided across the three
brain regions (P=1.23 ×10
10
). To generate qSVs for these data,
we mapped the probes on each microarray platform to the genome,
extracted coverage from our RNA-seq data, selected those probe
sequences that were significantly associated with degradation
(Materials and Methods). In the ASD dataset, those probes most
associated with degradation (n=1,129 at p
bonf
<1%) were al-
most uniformly more lowly expressed in patients compared with
controls in the frontal cortex (SI Appendix, Fig. S16A,P=2.2 ×
10
49
). The directionality of enrichment followed the diagnosis
and degradation associations, given that almost all degradation-
susceptible probes decreased in expression over time (98.5%)
and that RINs were lower in patients compared with controls. In
the temporal cortex, where RINs did not significantly differ be-
tween cases and controls, there was attenuated, but still signifi-
cant, enrichment in the same negative direction (P=4.77 ×10
6
).
Following the qSVA procedure (PCA on the 1,129 susceptible
probes and the adjustment for the resulting qSVs), the associa-
tion between degradation-susceptible probes and diagnosis was
removed (P=0.496, SI Appendix, Fig. S16B).
We found the same enrichment among differentially expressed
probes for AD across all three brain regions and the 653
degradation-susceptible probes on this microarray, including in
the prefrontal cortex (P=1.27 ×10
48
,SI Appendix, Fig. S16C),
cerebellum (P=1.82 ×10
33
), and visual cortex (P=2.35 ×10
35
).
Adjusting for the resulting qSVs again removed the association
between diagnosis and degradation susceptibility in the pre-
frontal cortex (P=0.66, SI Appendix, Fig. S16D) and cerebellum
(P=0.49) and greatly reduced the association in the visual
cortex (P=6.11 ×10
4
). The qSVA correction also greatly re-
duced the magnitude of the differential expression test statistics
across the entire platform (SI Appendix,Fig.S16Cversus D).
These results further underscore the risk of potentially spurious
findings based on uncorrected RNA quality confounding.
Discussion
We describe a framework for quantifying and removing RNA
quality biases in differential expression analysis. We first char-
acterized aspects of the landscape of RNA degradation across
the human DLPFC and PBMC transcriptomes and identified
largely tissue-specific degradation signals. The cell types repre-
sented in bulk/mixed tissues like brain and PBMCs further
showed differential susceptibility to RNA degradation. We used
these experimental degradation datasets to identify the most
degradation-susceptible transcript features in PBMC and DLPFC
RNA-seq libraries and developed an approach called qSVA to use
expression levels of these regions in new/user-provided samples to
estimate and remove RNA degradation bias in differential expres-
sion analyses. We show that the qSVA approach results in better
replication across independent studies and in various public tissue
datasets than existing popular statistical models that model ob-
served measures of RNA quality like RIN, chrM mapping rate, and
gene assignment rate. Our qSVA approach has a potential advan-
tage over general PCA or RUV adjustmentsparticularly, less risk
of removing true signals along with the noise. Reanalysis of pre-
viously published microarray datasets for AD and ASD further
suggests that probes differentially expressed for diagnosis were
highly associated in a predictable directionality with RNA deg-
radation susceptibility in both datasets.
We also demonstrated that adjusting for measures of RIN
largely fails to remove RNA degradation bias and formally
showed that RIN correction is more statistically biased at esti-
mating fold changes than qSVA when the true degradation effect
is known. The estimation of RIN itself is heavily driven by the
intactness of ribosomal RNAs (8), which appears only weakly
associated with the underlying quality of total or polyadenylated
RNAs across different subjects or tissues. Variance components
analysis of RIN values within the full GTEx dataset suggests that
tissue source explains approximately three times more variance
than individual identity (44.5% versus 14.7%). However, within
only the GTEx brain samples, the predictor corresponding to
individual explained more variability in RIN than did brain re-
gion (28.0% versus 18.7%). Finally, using the LIBD DLPFC
dataset, we found no evidence that individual genotype predicted
individual RIN; the smallest FDR for a genotype effect on RIN
was 0.64 (SI Appendix). Indeed, total RNA quality may be more
complex than a single number per sample, as the resulting qSVs
in both the LIBD and CMC datasets associate with a variety of
technical factors (SI Appendix, Figs. S7 and S8) that may each
influence RNA quality in subtle ways. Therefore, although the
RIN value may be a rough guide in determining whether or not
to study a particular sample, we would argue that it is not a
particularly accurate or useful gauge of RNA quality after data
have already been generated.
The applicability of specific tissue-derived degradation-
susceptible regions to other tissues or cell types is an impor-
tant consideration in differential expression analysis, particularly
when measured RNA quality associates with the outcome of
interest. One practical recommendation for other brain regions
would be to use the degradation data from DLPFC and PBMCs
to create DEqual plots, quantify the potential RNA degradation
bias from its correlation, and then evaluate how the DEqual plot
changes when performing qSVA using the DLPFC and PBMC
degradation regions. If this qSVA correction fails to remove
strong correlation between differential expression effects of
degradation and outcome, researchers probably need to generate
their own reference degradation datasets and apply the qSVA
algorithm.
7134
|
www.pnas.org/cgi/doi/10.1073/pnas.1617384114 Jaffe et al.
Differences in latent RNA quality and the underlying cellular
composition of homogenate tissue sources (2022) are two of the
strongest confounding factors in postmortem human studies. The
qSVA approach here that uses quality-associated features is
analogous to our previously proposed approach that uses cell-
typeassociated features to untangle the confounding effects of
cellular composition (sparse PCA) (23). The current study does
suggest a potential interaction between RNA quality and cellular
composition (SI Appendix, Fig. S2 and Table S4), which may be
more difficult to statistically isolate the two strong confounding
effects, particularly in PBMCs, or when shifting cellular com-
position is involved in a disease process. Nevertheless, our deg-
radation correction framework can improve the interpretation of
differential expression analysis of transcriptomic data.
Materials and Methods
Tissue Degradation Experiment. DLPFC gray matter from five donors was
dissected, pulverized, and mixed on dry ice. Approximately 100 mg of pul-
verized tissue was aliquoted four times for each subject on dry ice followed by
tissue aliquots at room temperature except one aliquot of each subject that
was kept on dry ice for the time 0 data point. RNA was extracted and se-
quenced using polyA+and RiboZero protocols. Data were processed with
TopHat (v2.0.13) using the reference transcriptome to initially guide align-
ment, based on known transcripts in the Illumina iGenomes version of
University of California at Santa Cruz knownGene GTF file (using the “–G
argument in the software) (24). Gene counts were generated using the
featureCounts tool (25) based on the more recent Ensembl v75, and counts
were converted to RPKM values using the total number of aligned reads
across the autosomal and sex chromosomes. All public datasets were pro-
cessed with a similar protocol. All tissues were obtained with informed
consent from the legal next of kin (protocol #1224 approved by the In-
stitutional Review Board of the Department of Health and Mental Hygiene
of the State of Maryland).
Degradation Data Analysis. For the samples in each library and tissue type, we
separately modeled expression as a function of degradation time, adjusting
for the donor and using the limma R Bioconductor package (26). Gene set
enrichment analyses were performed on the ordered degradation T-statis-
tics from the polyA+and RiboZero library types among those genes with
Entrez Gene IDs using the gseGO and gseKEGG functions in the clusterPro-
filer R package (27). Cell-typespecific analyses were conducted with
CIBERSORT with the default LM22 reference panel and 500 permutations
(14) for the PBMC degradation datasets, and DLPFC enrichment was based
on 285 cells from adult donors that were previously classified as astrocytes,
endothelial cells, microglia, neurons, oligodendrocytes, and oligodendrocyte
progenitor cells (15).
LIBD Discovery Dataset Modeling. We used the LIBD DLPFC polyA+RNA-seq
on 155 schizophrenia cases and 196 controls (criteria: ages between 17 and
80, gene assignment rate >0.5, mapping rate >0.7, RIN >6, not outlying on
second ancestry PC, only self-reported Caucasians and African Americans)
described in Jaffe et al. (28). We fit a series of statistical models at the gene
level, modeling log
2
-transformed gene-level RPKM (SI Appendix) . We used
the lmTest and ebayes functions in the limma Bioconductor package (26) to
fit all of the statistical models to estimate log
2
fold changes, moderated
T-statistics, and corresponding Pvalues.
CMC Replications Dataset Analysis. We performed differential expression
analysis on 159 patients and 172 controls (selecting on total gene assignment
rate >0.3, alignment rate >0.8, RIN >6, ages between 18 and 80, non-
outlying on genetic ancestry PCs 3 and 5, and keeping only reported Cau-
casians and African Americans). We similarly fit four of the statistical models
at the gene level, modeling log
2
-transformed gene-level RPKM (with an
offset of 1).
GTEx Analysis. We retained all GTEx samples that had RINs >5 and belonged
to subtissues (SMTSD metadata column) with at least 40 samples, resulting in
data on 9,502 samples across 49 detailed tissues. We retained the 36,552 genes
thathadmeanRPKM>0.2 in at least one subtissue. We modeled differential
expression of each of 48 subtissues compared with Brain-Frontal Cortex (BA9)
and measured the Pearson correlation present in the resulting DEqual plots, e.
g., between the subtissue-specific log
2
fold changes to the DLPFC polyA+
degradation data log
2
fold changes for degradation time.
Microarray Data Processing and Analysis of Published Studies. We extrapo-
lated the expression levels of the probes for each microarray platform in our
degradation RNA-seq dataset by aligning microarray probes to the genome
and quantifying resulting coverage in the RNA-seq datasets.
See additional details in SI Appendix,Full Methods and Materials.
ACKNOWLEDGMENTS. A.E.J. was partially supported by NIH Grant
R21MH109956 and J.T.L. was supported by NIH Grant R01GM105705. This
work was also supported by the Lieber Institute for Brain Development.
Corresponding acknowledgment statements for GTEx and CMC datasets
are available in the SI Appendix.
1. Irizarry RA, et al. (2003) Exploration, normalization, and summaries of high density
oligonucleotide array probe level data. Biostatistics 4:249264.
2. Risso D, Ngai J, Speed TP, Dudoit S (2014) Normalization of RNA-seq data using factor
analysis of control genes or samples. Nat Biotechnol 32:896902.
3. Leek JT, et al. (2010) Tackling the widespread and critical impact of batch effects in
high-throughput data. Nat Rev Genet 11:733739.
4. Li S, et al. (2014) Detecting and correcting systematic variation in large-scale RNA
sequencing data. Nat Biotechnol 32:888895.
5. SEQC/MAQC-III Consortium (2014) A comprehensive assessment of RNA-seq accuracy,
reproducibility and information content by the Sequencing Quality Control Consor-
tium. Nat Biotechnol 32:903914.
6. t Hoen PA, et al. (2013) Reproducibility of high-throughput mRNA and small RNA
sequencing across laboratories. Nat Biotechnol 31:10151022.
7. Adiconis X, et al. (2013) Comparative analysis of RNA sequencing methods for de-
graded or low-input samples. Nat Methods 10:623629.
8. Schroeder A, et al. (2006) The RIN: An RNA integrity number for assigning integrity
values to RNA measurements. BMC Mol Biol 7:3.
9. Consortium ER (2014) REMC standards and guidelines for RNA-sequencing. Available
at www.roadmapepigenomics.org/files/protocols/data/rna-analysis/REMC_RNA-seqStandards_
final.pdf. Accessed June 7, 2017.
10. Li S, et al. (2014) Multi-platform assessment of transcriptome profiling using RNA-seq
in the ABRF next-generation sequencing study. Nat Biotechnol 32:915925.
11. Wang L, et al. (2016) Measure transcript integrity using RNA-seq data. BMC
Bioinformatics 17:58.
12. Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by
surrogate variable analysis. PLoS Genet 3:17241735.
13. Gallego Romero I, Pai AA, Tung J, Gilad Y (2014) RNA-seq: Impact of RNA degradation
on transcript quantification. BMC Biol 12:42.
14. Fromer M, et al. (2016) Gene expression elucidates functional impact of polygenic risk
for schizophrenia. Nat Neurosci 19:14421453.
15. Darmanis S, et al. (2015) A survey of human brain transcriptome diversity at the single
cell level. Proc Natl Acad Sci USA 112:72857290.
16. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD (2012) The sva package for re-
moving batch effects and other unwanted variation in high-throughput experiments.
Bioinformatics 28:882883.
17. Consortium GT; GTEx Consortium (2015) Human genomics. The Genotype-Tissue Ex-
pression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science 348:
648660.
18. Voineagu I, et al. (2011) Transcriptomic analysis of autistic brain reveals convergent
molecular pathology. Nature 474:380384.
19. Zhang B, et al. (2013) Integrated systems approach identifies genetic nodes and
networks in late-onset Alzheimers disease. Cell 153:707720.
20. Jaffe AE (2016) Postmortem human brain genomics in neuropsychiatric disorders:
How far can we go? Curr Opin Neurobiol 36:107111.
21. Jaffe AE, et al. (2016) Mapping DNA methylation across development, genotype and
schizophrenia in the human frontal cortex. Nat Neurosci 19:4047.
22. Jaffe AE, et al. (2015) Developmental regulation of human cortex transcription and its
clinical relevance at single base resolution. Nat Neurosci 18:154161.
23. Jaffe AE, Irizarry RA (2014) Accounting for cellular heterogeneity is critical in
epigenome-wide association studies. Genome Biol 15:R31.
24. Kim D, et al. (2013) TopHat2: Accurate alignment of transcriptomes in the presence of
insertions, deletions and gene fusions. Genome Biol 14:R36.
25. Liao Y, Smyth GK, Shi W (2014) featureCounts: An efficient general purpose program
for assigning sequence reads to genomic features. Bioinformatics 30:923930.
26. Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential
expression in microarray experiments. Stat Appl Genet Mol Biol 3:Article3.
27. Yu G, Wang LG, Han Y, He QY (2012) clusterProfiler: An R package for comparing
biological themes among gene clusters. Omics 16:284287.
28. Jaffe AE, et al. (April 5, 2017) Developmental and genetic regulation of the human
cortex transcriptome in schizophrenia, doi.org/10.1101/124321.
Jaffe et al. PNAS
|
July 3, 2017
|
vol. 114
|
no. 27
|
7135
NEUROSCIENCEBIOPHYSICS AND
COMPUTATIONAL BIOLOGY
... D, Table S4). Using bulk RNA-seq data from a brain tissue degradation experiment on the dorsolateral prefrontal cortex (Jaffe et al. 2017), processed with the recount3 pipeline (Wilks et al. 2021), we contrasted at the transcript level the t-statistics for degradation against t-statistics for LIV vs PM tissue differences that we re-calculated using limma (Ritchie et al. 2015), resulting in a correlation of 0.39 (Fig 1. E, Table S5). ...
... (Even comparisons of separate postmortem samples can be challenging.) Liharska's analyses have not adequately controlled for PM interval in PM tissue, let alone RNA degradation, which is not adequately represented only by PMI (Jaffe et al. 2017). ...
... Exploratory results showed that there were significant differences between the LIV and PM tissue on these quality metrics (Fig. 1 B), with surprisingly lower quality on the LIV tissue compared to the PM tissue. We also showed how we can mostly reproduce their statistical model ( Fig. 1 D), prior to highlighting how the LIV vs PM differences are confounded with RNA degradation signal observed independently (Jaffe et al. 2017, Fig. 1 E). Using the latest quality surrogate variable method, we showed how the differences between LIV and PM tissues were substantially attenuated, resulting in a reduction of 76% in the number of differentially expressed genes (15,553 vs 3,732). ...
Preprint
Full-text available
Molecular mechanisms of neuropsychiatric disorders are challenging to study in human brain. For decades, the preferred model has been to study postmortem human brain samples despite the limitations they entail. A recent study generated RNA sequencing data from biopsies of prefrontal cortex from living patients with Parkinson's Disease and compared gene expression to postmortem tissue samples, from which they found vast differences between the two. This led the authors to question the utility of postmortem human brain studies. Through re-analysis of the same data, we unexpectedly found that the living brain tissue samples were of much lower quality than the postmortem samples across multiple standard metrics. We also performed simulations that illustrate the effects of ignoring RNA degradation in differential gene expression analyses, showing the effects can be substantial and of similar magnitude to what the authors find. For these reasons, we believe the authors' conclusions are unjustified. To the contrary, while opportunities to study gene expression in the living brain are welcome, evidence that this eclipses the value of postmortem analyses is not apparent.
... Fifteen DG-specific SCZ-risk eQTLs were identified in the cell population-enriched samples not detected in bulk hippocampal tissue. Co-expression networks derived from DG-LCM were more faithful to neuronal gene ontologies and presented a greater SCZ risk gene enrichment than bulk tissue 6 , although some of this granular information could be rescued from bulk tissue using statistical models of RNA degradation 25 . The granule cell layer of the DG is nonetheless quite compact and easy to capture morphologically, and an LCM analysis targeting cell populations across a distributed circuit classically linked to SCZ has yet to be tested. ...
Preprint
Full-text available
RNA-sequencing studies of brain tissue homogenates have shed light on the molecular processes underlying schizophrenia (SCZ) but lack biological granularity at the cell type level. Laser capture microdissection (LCM) can isolate selective cell populations with intact cell bodies to allow complementary gene expression analyses of mRNA and protein. We used LCM to collect excitatory neuron-enriched samples from CA1 and subiculum (SUB) of the hippocampus and layer III of the dorsolateral prefrontal cortex (DLPFC), from which we generated gene, transcript, and peptide level data. In a machine learning framework, LCM-derived expression achieved superior regional identity predictions as compared to bulk tissue, with further improvements when using isoform-level transcript and protein quantifications. LCM-derived co-expression also had increased co-expression strength of neuronal gene sets compared to tissue homogenates. SCZ risk co-expression pathways were identified and replicated across transcript and protein networks and were consistently enriched for glutamate receptor complex and post-synaptic functions. Finally, through inter-regional co-expression analyses, we show that CA1 to SUB transcriptomic connectivity may be altered in SCZ.
... In short, this means that samples with high variance relative to the mean variance of all the samples will have less weight when detecting differentially expressed (DE) genes. This is a good replacement for trying to model the quality of the samples with RIN scores, which is generally a poor estimator [40]. The linear "mixed" model is then fit to the adjusted data. ...
Article
Background: Sex differences in the brain may play an important role in sex-differential prevalence of neuropsychiatric conditions. Methods: In order to understand the transcriptional basis of sex differences, we analyzed multiple, large-scale, human postmortem brain RNA-Seq datasets using both within-region and pan-regional frameworks. Results: We find evidence of sex-biased transcription in many autosomal genes, some of which provide evidence for pathways and cell population differences between chromosomally male and female individuals. These analyses also highlight regional differences in the extent of sex-differential gene expression. We observe an increase in specific neuronal transcripts in male brains and an increase in immune and glial function-related transcripts in female brains. Integration with single-nucleus data suggests this corresponds to sex differences in cellular states rather than cell abundance. Integration with case-control gene expression studies suggests a female molecular predisposition towards Alzheimer's disease, a female-biased disease. Autism, a male-biased diagnosis, does not exhibit a male predisposition pattern in our analysis. Conclusion: Overall, these analyses highlight mechanisms by which sex differences may interact with sex-biased conditions in the brain. Furthermore, we provide region-specific analyses of sex differences in brain gene expression to enable additional studies at the interface of gene expression and diagnostic differences.
... Conveniently, this method is freely available for researchers (R package, BRETIGEA), which will facilitate reproducibility analyses of our study. Other important considerations are the dynamic nature of gene expression as disease progresses (Iturria-Medina et al., 2020;Hammond et al., 2019), postmortem RNA degradation of the used templates (Jaffe et al., 2017), and the subsequent limited ability of bulk RNA sequencing to reflect cell-to-cell variability, which is relevant for understanding cell heterogeneity and the roles of specific cell populations in disease (Yu et al., 2021). Lastly, a promising future direction would be to validate our findings with single-cell spatial analyses. ...
Article
Full-text available
For over a century, brain research narrative has mainly centered on neuron cells. Accordingly, most neurodegenerative studies focus on neuronal dysfunction and their selective vulnerability, while we lack comprehensive analyses of other major cell types’ contribution. By unifying spatial gene expression, structural MRI, and cell deconvolution, here we describe how the human brain distribution of canonical cell types extensively predicts tissue damage in 13 neurodegenerative conditions, including early- and late-onset Alzheimer’s disease, Parkinson’s disease, dementia with Lewy bodies, amyotrophic lateral sclerosis, mutations in presenilin-1, and 3 clinical variants of frontotemporal lobar degeneration (behavioral variant, semantic and non-fluent primary progressive aphasia) along with associated three-repeat and four-repeat tauopathies and TDP43 proteinopathies types A and C. We reconstructed comprehensive whole-brain reference maps of cellular abundance for six major cell types and identified characteristic axes of spatial overlapping with atrophy. Our results support the strong mediating role of non-neuronal cells, primarily microglia and astrocytes, in spatial vulnerability to tissue loss in neurodegeneration, with distinct and shared across-disorder pathomechanisms. These observations provide critical insights into the multicellular pathophysiology underlying spatiotemporal advance in neurodegeneration. Notably, they also emphasize the need to exceed the current neuro-centric view of brain diseases, supporting the imperative for cell-specific therapeutic targets in neurodegeneration.
... It would be important to examine other brain regions in the future. Furthermore, a hurdle in post-mortem brain analyses lies in the fact that even with the inclusion of explicit, observed covariates, there may still be an incomplete accounting for the effects of RNA degradation or other latent variables [108]. In view of this, the results of this study should be interpreted carefully, as more research is necessary before clinical implementation. ...
Article
Full-text available
Despite recent progress, the challenges in drug discovery for schizophrenia persist. However, computational drug repurposing has gained popularity as it leverages the wealth of expanding biomedical databases. Network analyses provide a comprehensive understanding of transcription factor (TF) regulatory effects through gene regulatory networks, which capture the interactions between TFs and target genes by integrating various lines of evidence. Using the PANDA algorithm, we examined the topological variances in TF-gene regulatory networks between individuals with schizophrenia and healthy controls. This algorithm incorporates binding motifs, protein interactions, and gene co-expression data. To identify these differences, we subtracted the edge weights of the healthy control network from those of the schizophrenia network. The resulting differential network was then analysed using the CLUEreg tool in the GRAND database. This tool employs differential network signatures to identify drugs that potentially target the gene signature associated with the disease. Our analysis utilised a large RNA-seq dataset comprising 532 post-mortem brain samples from the CommonMind project. We constructed co-expression gene regulatory networks for both schizophrenia cases and healthy control subjects, incorporating 15,831 genes and 413 overlapping TFs. Through drug repurposing, we identified 18 promising candidates for repurposing as potential treatments for schizophrenia. The analysis of TF-gene regulatory networks revealed that the TFs in schizophrenia predominantly regulate pathways associated with energy metabolism, immune response, cell adhesion, and thyroid hormone signalling. These pathways represent significant targets for therapeutic intervention. The identified drug repurposing candidates likely act through TF-targeted pathways. These promising candidates, particularly those with preclinical evidence such as rimonabant and kaempferol, warrant further investigation into their potential mechanisms of action and efficacy in alleviating the symptoms of schizophrenia.
... This means samples with high variance relative to the mean variance of all the samples will have less weight when detecting differentially expressed (DE) genes. This is a good replacement for trying to model the quality of the samples with RIN scores, which is generally a poor estimator (Jaffe et al., 2017). The linear "mixed" model is then fit to the adjusted data using the lmfit() in limma. ...
Preprint
Full-text available
Sex differences in the brain may play an important role in sex-differential prevalence of neuropsychiatric conditions. In order to understand the transcriptional basis of sex differences, we analyzed multiple, large-scale, human postmortem brain RNA-seq datasets using both within-region and pan-regional frameworks. We find evidence of sex-biased transcription in many autosomal genes, some of which provide evidence for pathways and cell population differences between chromosomally male and female individuals. These analyses also highlight regional differences in the extent of sex-differential gene expression. We observe an increase in specific neuronal transcripts in male brains and an increase in immune and glial function-related transcripts in female brains. Integration with single-cell data suggests this corresponds to sex differences in cellular states rather than cell abundance. Integration with case-control gene expression studies suggests a female molecular predisposition towards Alzheimer’s disease, a female-biased disease. Autism, a male-biased diagnosis, does not exhibit a male predisposition pattern in our analysis. Finally, we provide region specific analyses of sex differences in brain gene expression to enable additional studies at the interface of gene expression and diagnostic differences. Graphical Abstract
Article
Full-text available
Ancestral differences in genomic variation affect the regulation of gene expression; however, most gene expression studies have been limited to European ancestry samples or adjusted to identify ancestry-independent associations. Here, we instead examined the impact of genetic ancestry on gene expression and DNA methylation in the postmortem brain tissue of admixed Black American neurotypical individuals to identify ancestry-dependent and ancestry-independent contributions. Ancestry-associated differentially expressed genes (DEGs), transcripts and gene networks, while notably not implicating neurons, are enriched for genes related to the immune response and vascular tissue and explain up to 26% of heritability for ischemic stroke, 27% of heritability for Parkinson disease and 30% of heritability for Alzheimer’s disease. Ancestry-associated DEGs also show general enrichment for the heritability of diverse immune-related traits but depletion for psychiatric-related traits. We also compared Black and non-Hispanic white Americans, confirming most ancestry-associated DEGs. Our results delineate the extent to which genetic ancestry affects differences in gene expression in the human brain and the implications for brain illness risk.
Preprint
Full-text available
For over a century, brain research narrative has mainly centered on neuron cells. Accordingly, most whole-brain neurodegenerative studies focus on neuronal dysfunction and their selective vulnerability, while we lack comprehensive analyses of other major cell-types’ contribution. By unifying spatial gene expression, structural MRI, and cell deconvolution, here we describe how the human brain distribution of canonical cell-types extensively predicts tissue damage in eleven neurodegenerative disorders, including early- and late-onset Alzheimer’s disease, Parkinson’s disease, dementia with Lewy bodies, amyotrophic lateral sclerosis, frontotemporal dementia, and tauopathies. We reconstructed comprehensive whole-brain reference maps of cellular abundance for six major cell-types and identified characteristic axes of spatial overlapping with atrophy. Our results support the strong mediating role of non-neuronal cells, primarily microglia and astrocytes, on spatial vulnerability to tissue loss in neurodegeneration, with distinct and shared across-disorders pathomechanisms. These observations provide critical insights into the multicellular pathophysiology underlying spatiotemporal advance in neurodegeneration. Notably, they also emphasize the need to exceed the current neuro-centric view of brain diseases, supporting the imperative for cell-specific therapeutic targets in neurodegeneration. Major cell-types distinctively associate with spatial vulnerability to tissue loss in eleven neurodegenerative disorders.
Preprint
For over a century, brain research narrative has mainly centered on neuron cells. Accordingly, most whole-brain neurodegenerative studies focus on neuronal dysfunction and their selective vulnerability, while we lack comprehensive analyses of other major cell-types’ contribution. By unifying spatial gene expression, structural MRI, and cell deconvolution, here we describe how the human brain distribution of canonical cell-types extensively predicts tissue damage in eleven neurodegenerative disorders, including early- and late-onset Alzheimer’s disease, Parkinson’s disease, dementia with Lewy bodies, amyotrophic lateral sclerosis, frontotemporal dementia, and tauopathies. We reconstructed comprehensive whole-brain reference maps of cellular abundance for six major cell-types and identified characteristic axes of spatial overlapping with atrophy. Our results support the strong mediating role of non-neuronal cells, primarily microglia and astrocytes, on spatial vulnerability to tissue loss in neurodegeneration, with distinct and shared across-disorders pathomechanisms. These observations provide critical insights into the multicellular pathophysiology underlying spatiotemporal advance in neurodegeneration. Notably, they also emphasize the need to exceed the current neuro-centric view of brain diseases, supporting the imperative for cell-specific therapeutic targets in neurodegeneration. Major cell-types distinctively associate with spatial vulnerability to tissue loss in eleven neurodegenerative disorders.
Article
Full-text available
Parkinson’s disease is an age-related neurodegenerative disorder with a higher incidence in males than females. The causes for this sex difference are unknown. Genome-wide association studies (GWAS) have identified 90 Parkinson’s disease risk loci, but the genetic studies have not found sex-specific differences in allele frequency on autosomal chromosomes or sex chromosomes. Genetic variants, however, could exert sex-specific effects on gene function and regulation of gene expression. To identify genetic loci that might have sex-specific effects, we studied pleiotropy between Parkinson’s disease and sex-specific traits. Summary statistics from GWASs were acquired from large-scale consortia for Parkinson’s disease (n cases=13 708; n controls=95 282), age at menarche (n=368 888 women) and age at menopause (n=69 360 women). We applied the conditional/conjunctional false discovery rate (FDR) method to identify shared loci between Parkinson’s disease and these sex-specific traits. Next, we investigated sex-specific gene expression differences in the superior frontal cortex of both neuropathologically healthy individuals and Parkinson’s disease patients (n cases=61; n controls=23). To provide biological insights to the genetic pleiotropy, we performed sex-specific expression quantitative trait locus (eQTL) analysis and sex-specific age-related differential expression analysis for genes mapped to Parkinson’s disease risk loci. Through conditional/conjunctional FDR analysis we found 11 loci shared between Parkinson’s disease and the sex-specific traits age at menarche and age at menopause. Gene-set and pathway analysis of the genes mapped to these loci highlighted the importance of the immune response in determining an increased disease incidence in the male population. Moreover, we highlighted a total of nine genes whose expression or age-related expression in the human brain is influenced by genetic variants in a sex-specific manner. With these analyses we demonstrated that the lack of clear sex-specific differences in allele frequencies for Parkinson’s disease loci does not exclude a genetic contribution to differences in disease incidence. Moreover, further studies are needed to elucidate the role that the candidate genes identified here could have in determining a higher incidence of Parkinson’s disease in the male population.
Article
Full-text available
Genome-wide association studies have identified 108 schizophrenia risk loci, but biological mechanisms for individual loci are largely unknown. Using developmental, genetic and illness-based RNA sequencing expression analysis in human brain, we characterized the human brain transcriptome around these loci and found enrichment for developmentally regulated genes with novel examples of shifting isoform usage across pre- and postnatal life. We found widespread expression quantitative trait loci (eQTLs), including many with transcript specificity and previously unannotated sequence that were independently replicated. We leveraged this general eQTL database to show that 48.1% of risk variants for schizophrenia associate with nearby expression. We lastly found 237 genes significantly differentially expressed between patients and controls, which replicated in an independent dataset, implicated synaptic processes, and were strongly regulated in early development. These findings together offer genetics- and diagnosis-related targets for better modeling of schizophrenia risk. This resource is publicly available at http://eqtl.brainseq.org/phase1 .
Article
Full-text available
Over 100 genetic loci harbor schizophrenia-associated variants, yet how these variants confer liability is uncertain. The CommonMind Consortium sequenced RNA from dorsolateral prefrontal cortex of people with schizophrenia (N = 258) and control subjects (N = 279), creating a resource of gene expression and its genetic regulation. Using this resource, ∼20% of schizophrenia loci have variants that could contribute to altered gene expression and liability. In five loci, only a single gene was involved: FURIN, TSNARE1, CNTN4, CLCN3 or SNAP91. Altering expression of FURIN, TSNARE1 or CNTN4 changed neurodevelopment in zebrafish; knockdown of FURIN in human neural progenitor cells yielded abnormal migration. Of 693 genes showing significant case-versus-control differential expression, their fold changes were ≤ 1.33, and an independent cohort yielded similar results. Gene co-expression implicates a network relevant for schizophrenia. Our findings show that schizophrenia is polygenic and highlight the utility of this resource for mechanistic interpretations of genetic liability for brain diseases. © 2016 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.
Article
Full-text available
Stored biological samples with pathology information and medical records are invaluable resources for translational medical research. However, RNAs extracted from the archived clinical tissues are often substantially degraded. RNA degradation distorts the RNA-seq read coverage in a gene-specific manner, and has profound influences on whole-genome gene expression profiling. We developed the transcript integrity number (TIN) to measure RNA degradation. When applied to 3 independent RNA-seq datasets, we demonstrated TIN is a reliable and sensitive measure of the RNA degradation at both transcript and sample level. Through comparing 10 prostate cancer clinical samples with lower RNA integrity to 10 samples with higher RNA quality, we demonstrated that calibrating gene expression counts with TIN scores could effectively neutralize RNA degradation effects by reducing false positives and recovering biologically meaningful pathways. When further evaluating the performance of TIN correction using spike-in transcripts in RNA-seq data generated from the Sequencing Quality Control consortium, we found TIN adjustment had better control of false positives and false negatives (sensitivity = 0.89, specificity = 0.91, accuracy = 0.90), as compared to gene expression analysis results without TIN correction (sensitivity = 0.98, specificity = 0.50, accuracy = 0.86). TIN is a reliable measurement of RNA integrity and a valuable approach used to neutralize in vitro RNA degradation effect and improve differential gene expression analysis.
Article
Full-text available
Significance The brain comprises an immense number of cells and cellular connections. We describe the first, to our knowledge, single cell whole transcriptome analysis of human adult cortical samples. We have established an experimental and analytical framework with which the complexity of the human brain can be dissected on the single cell level. Using this approach, we were able to identify all major cell types of the brain and characterize subtypes of neuronal cells. We observed changes in neurons from early developmental to late differentiated stages in the adult. We found a subset of adult neurons which express major histocompatibility complex class I genes and thus are not immune privileged.
Article
Full-text available
Understanding the functional consequences of genetic variation, and how it affects complex human disease and quantitative traits, remains a critical challenge for biomedicine. We present an analysis of RNA sequencing data from 1641 samples across 43 tissues from 175 individuals, generated as part of the pilot phase of the Genotype-Tissue Expression (GTEx) project. We describe the landscape of gene expression across tissues, catalog thousands of tissue-specific and shared regulatory expression quantitative trait loci (eQTL) variants, describe complex network relationships, and identify signals from genome-wide association studies explained by eQTLs. These findings provide a systematic understanding of the cellular and biological consequences of human genetic variation and of the heterogeneity of such effects among a diverse set of human tissues.
Article
It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce “surrogate variable analysis” (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.
Article
Large-scale collection of postmortem human brain tissue and subsequent genomic data generation has become a useful approach for better identifying etiological factors contributing to neuropsychiatric disorders. In particular, studying genetic risk variants in non-psychiatric controls can identify biological mechanisms of risk free from confounding factors related to epiphenomena of illness. While the field has begun moving towards cell type-specific analyses, homogenate brain tissue with accompanying cellular profiles, can still identify useful hypotheses for more focused experiments, particularly when the dysregulated cell types are unknown. Technological advances, larger sample sizes, and focused research questions can continue to further leverage postmortem human brain research to better identify and understand the molecular etiology of neuropsychiatric disorders.
Article
DNA methylation (DNAm) is important in brain development and is potentially important in schizophrenia. We characterized DNAm in prefrontal cortex from 335 non-psychiatric controls across the lifespan and 191 patients with schizophrenia and identified widespread changes in the transition from prenatal to postnatal life. These DNAm changes manifest in the transcriptome, correlate strongly with a shifting cellular landscape and overlap regions of genetic risk for schizophrenia. A quarter of published genome-wide association studies (GWAS)-suggestive loci (4,208 of 15,930, P < 10(-100)) manifest as significant methylation quantitative trait loci (meQTLs), including 59.6% of GWAS-positive schizophrenia loci. We identified 2,104 CpGs that differ between schizophrenia patients and controls that were enriched for genes related to development and neurodifferentiation. The schizophrenia-associated CpGs strongly correlate with changes related to the prenatal-postnatal transition and show slight enrichment for GWAS risk loci while not corresponding to CpGs differentiating adolescence from later adult life. These data implicate an epigenetic component to the developmental origins of this disorder.