Content uploaded by Chang-Gong Liu
Author content
All content in this area was uploaded by Chang-Gong Liu on Dec 24, 2013
Content may be subject to copyright.
Statistical Applications in Genetics
and Molecular Biology
Volume 7, Issue 1 2008 Article 22
A Comparison of Normalization Techniques
for MicroRNA Microarray Data
Youlan Rao∗Yoonkyung Lee†David Jarjoura‡
Amy S. Ruppert∗∗ Chang-gong Liu††
Jason C. Hsu‡‡ John P. Hagan§
∗The Ohio State University, rao@stat.ohio-state.edu
†The Ohio State University, yklee@stat.ohio-state.edu
‡The Ohio State University, david.jarjoura@osumc.edu
∗∗The Ohio State University, amy.ruppert@osumc.edu
††The Ohio State University, chang-gong.liu@osumc.edu
‡‡The Ohio State University, jch@stat.ohio-state.edu
§The Ohio State University, microrna@gmail.com
Copyright c
2008 The Berkeley Electronic Press. All rights reserved.
A Comparison of Normalization Techniques
for MicroRNA Microarray Data∗
Youlan Rao, Yoonkyung Lee, David Jarjoura, Amy S. Ruppert, Chang-gong Liu,
Jason C. Hsu, and John P. Hagan
Abstract
Normalization of expression levels applied to microarray data can help in reducing measure-
ment error. Different methods, including cyclic loess, quantile normalization and median or mean
normalization, have been utilized to normalize microarray data. Although there is considerable lit-
erature regarding normalization techniques for mRNA microarray data, there are no publications
comparing normalization techniques for microRNA (miRNA) microarray data, which are subject
to similar sources of measurement error. In this paper, we compare the performance of cyclic loess,
quantile normalization, median normalization and no normalization for a single-color microRNA
microarray dataset. We show that the quantile normalization method works best in reducing dif-
ferences in miRNA expression values for replicate tissue samples. By showing that the total mean
squared error are lowest across almost all 36 investigated tissue samples, we are assured that the
bias correction provided by quantile normalization is not outweighed by additional error variance
that can arise from a more complex normalization method. Furthermore, we show that quantile
normalization does not achieve these results by compression of scale.
KEYWORDS: microRNA, median normalization, cyclic loess normalization, quantile normal-
ization, robust estimates, smoothing spline, mean squared error
∗This material is based in part upon work supported by the National Science Foundation under
Agreement No. 0635561. Jason C. Hsu’s research is supported by NSF Grant Number DMS-
0505519
1 Introduction
In microarray experiments, variation of expression measurements among arrays
can be attributed to many sources, such as differences in sample RNA prepa-
ration, cDNA labeling, image intensity and microarray hybridization/wash effi-
ciency. Normalization of expression levels applied to microarray data can help
in removing this error. Different methods, including cyclic loess, quantile nor-
malization (Bolstad et al. 2003) and median or mean normalization (Churchill
2002, Churchill 2003, Churchill and Oliver 2001, Kerr and Churchill 2001, and
Wolfinger et al. 2001), have been utilized to normalize microarray data. Briefly,
cyclic loess makes the MA plot of probe intensities from every pair of arrays
scatter about the M= 0 axis, quantile normalization makes the distributions
of expression levels the same across arrays, and median or mean normalization
shifts the individual log-intensities on each array so that the median or mean
log-intensities, respectively, are the same across arrays. These normalization al-
gorithms can be applied either globally to an entire data set or locally to some
physical subset of the data (Quackenbush 2002). Irizarry et al. (2003) applied
the quantile normalization procedure to normalize dilution data and spike-in data
from Affymetrix arrays, and showed how quantile normalization removed bias
as compared to no normalization. Their analysis was unique in that they knew
the true expression levels and could therefore determine the degree of bias re-
duction from quantile normalization.
MicroRNAs (miRNAs) are noncoding RNAs of 19-24 nucleotides that are
negative regulators of gene expression. Recently implicated as important in
development and normal physiology, microRNAs are abnormally expressed in
many human cancers (Volinia et al. 2006, Lu et al. 2005). Moreover, aberrant
microRNA expression has been shown to initiate and promote carcinogenesis
(reviewed in Hagan and Croce 2007). These microRNA expression signatures
may reveal new oncogenetic pathways in human cancers. For systematic in-
vestigation of microRNA expression, oligonucleotide-based microarrays for mi-
croRNAs in human and mouse tissues have been developed recently (Liu et al.
2004) and several commercial platforms are now available. To date, more than a
hundred published reports have used microRNA microarrays to investigate their
expression profiles, where more than two-thirds have used single color versus
two color hybridization systems. Although there is substantial literature regard-
ing normalization techniques for mRNA microarray data, there are no published
reports comparing normalization techniques for microRNA (miRNA) microar-
ray data, which are subject to the similar sources of error variation.
Many statistical reports on mRNA microarrays have focused on Affymetrix
1
Rao et al.: Comparing Normalization Techniques for MicroRNA Microarray Data
Published by The Berkeley Electronic Press, 2008
mRNA arrays, which have an exceedingly high density of probes that are in situ
synthesized on the array. For example, in one Human Genome U133 Plus2.0
GeneChip, probe sets for each mRNA, including numerous housekeeping genes,
consist of eleven oligonucleotide probes selected to maximize specificity and to
have similar melting temperatures across the entire array. In contrast, microRNA
microarrays are often lower density spotted arrays. Our focus is on single color
microRNA microarray. This type of microarray is used predominantly in com-
parison to dual color arrays. Results from the Version 3.0 microRNA microarray
used in this study and its earlier versions have appeared in more than 40 publi-
cations. The Version 3.0 microarray contains 3790 probes spotted in duplicate.
The probes are 40 nucleotides in length, consisting of the genomic sequence
that has the mature microRNA sequence and additional flanking bases. With
the exception of six probes designed against Arabidopsis thaliana microRNAs,
the rest of the probes are derived from known and predicted human and mouse
microRNAs. This design allows for the detection of mature as well as precursor
miRNAs and is particularly helpful in determining if computationally predicted
miRNAs are real. Although U6 snRNA is frequently used as a control for mi-
croRNA experiments, this noncoding RNA has been shown to vary as much as
five fold for equivalent amounts of total RNA by both microarray and North-
ern analysis (Hagan and Liu, unpublished observations). Hence, probes for U6
snRNA were not included in the Version 3.0 microarray. Most, if not all, com-
mercially available microRNA microarrays do not have controls for endogenous
RNAs that have been shown to be largely invariant between tissue samples.
Given the short length of miRNAs and the fact that far more mRNAs are
known than miRNAs, it is important to compare normalization methods specif-
ically for the miRNA microarray data. Although microRNA microarrays are
lower density spotted arrays than mRNA microarrays, they are not “boutique”
arrays. For example, microRNA arrays do not meet the following criteria: “more
than half the probes might be differentially expressed between any two samples
and that the differential expression might be predominately in one direction”
(Oshlack et al. 2007). We also do not expect global differences across miRNA
arrays. As an example, the biggest difference in miRNA expressions was ex-
pected between brain and heart tissues, we found only 15% of miRNAs were
differentially expressed with a greater than 2 fold difference, when comparing
these distinct tissue types. Other examples include the referenced miRNA stud-
ies in cancer (Calin et al. 2005, Volinia et al. 2006, Yanaihara et al. 2006)
and tissue differentiation (Babak et al. 2004, Barad et al. 2004, Garzon et al.
2004) in Davison et al. (2006). For the three referenced cancer studies that used
microRNA microarrays, the number of differentially expressed microRNAs are
2
Statistical Applications in Genetics and Molecular Biology, Vol. 7 [2008], Iss. 1, Art. 22
http://www.bepress.com/sagmb/vol7/iss1/art22
13/245 (5.3%), 22/228−57/228 (9.6% −25.0%, range depends on which of six
tumor/normal comparisons were performed) and 43/352 (12.2%). For the three
referenced differentiation studies, the number of differentially expressed mi-
croRNA are 19/399 (4.8%), 25/154−35/154 (15.2%−22.7%, range depends on
the specific pairwise tissue comparison) and 35/150 −57/150 (23.3% −38.0%,
range depends on the specific pairwise tissue comparison). We can conclude
with confidence that much less than 50% of miRNAs are differentially expressed
based on our experience and assessment of the literature. In addition to our cus-
tom microRNA microarrays, there are numerous commercially available miRNA
microarrays. For example, LC Sciences, Exiqon, Agilent, Invitrogen, and Am-
bion sell miRNA microarrays, with 1564,4000,15000,3000, and 1224 miRNA
probes, respectively. Hence, the probe density of our array is similar to many cur-
rently available commercial platforms. Importantly, high throughput sequencing
of microRNAs is rapidly expanding the number of known microRNAs. Hence,
our custom arrays soon will need to be updated with evenmore probes to reflect
the recently identified microRNAs. The microRNA registry (Version 10.1) cur-
rently has sequences for 5395 miRNAs. Even though microRNA microarrays
are not ”boutique” arrays in general, a few cases exist where large numbers of
microRNAs will be differentially expressed in only one direction. Knockouts
of essential microRNA biogenesis proteins such as Drosha, DGCR8, or Dicer1
lead to a dramatic reduction in steady state microRNA levels by blocking pro-
duction of mature microRNAs (Kumar et al. 2007). These global downregula-
tion cases are exceptionally easy to detect by microarray as the percentage of
microRNAs expressed above background is considerably different in compari-
son to controls. Other confirmed examples that show unidirectional microRNA
regulation are quite rare. Using a novel bead-based microRNA profiling system,
microRNAs were reported to be downregulated primarily in cancers (129 of 217
investigated). Almost all studies of microRNAs in cancer, including all the re-
search referenced in this manuscript, have found roughly balanced numbers or
a slight enrichment for upregulated microRNAs in cancer, casting doubt on the
conclusions of Lu et al. (2005). Even research that at first glance might seem
to support the conclusions of Lu and colleagues demonstrates unequivocally the
opposite. For example, Chang et al. (2008) reported that Myc expression leads
to widespread repression of microRNAs. As their Supplemental Table 1 shows
for 313 human microRNAs investigated, 11 and 17 microRNAs are upregulated
and downregulated, respectively, at least two fold upon induced Myc expression.
Although vigilance must be exercised to make sure that the underlying assump-
tions are valid, the normalization methods that we present are compatible for the
vast majority of studies using microRNA microarrays.
3
Rao et al.: Comparing Normalization Techniques for MicroRNA Microarray Data
Published by The Berkeley Electronic Press, 2008
In this paper, we compare the performance of median, cyclic loess, quantile,
and no normalization for miRNA microarray data. The data included 72 mi-
croarrays obtained from RNA from 26 human and 10 mouse tissues that were
hybridized as techinal replicates. Hence, each RNA sample was hybridized to
two independent microarrays. Since replicate samples should, in theory, have al-
most identical values for expressions, one can compare different normalization
techniques in terms of the closeness of normalized measurements in the repli-
cated samples. Moreover, there are no confounding biological effects that come
from tissues from different individuals. The differences between these paired
expression levels with and without normalization can be divided into a bias and
variance components by expression level. Both of these miRNA-by-miRNA dif-
ferences components should be reduced after applying normalization methods.
We used these differences to provide direct evidence of the capability of each
method of reducing these two components. It was critical to examine the effects
on both quantities because the complexity of a transformation may increase the
error variance over and above its bias reduction. To resemble how normalization
is typically applied to samples, normalization was done globally across all 72
samples. This is an important distinction from normalizing each of 36 replicate
pairs separately, where this level of normalization could produce artificially low
variance and bias.
Section 2 describes the normalization methods in detail. Section 3 describes
the miRNA data used in this paper. Section 4 compares normalization methods.
2 Normalization Methods
Three commonly used normalization techniques are reviewed. Suppose that we
have the (log base 2 transformed) probe level expression values from pmiRNAs
and narrays in a p×nmatrix X.
Median normalization shifts miRNAs expressions on each array by additive
constants so that the medians of miRNAs expressions are the same across arrays
by the following steps:
•Take the median of each column of Xand generate a n-dimensional me-
dian vector M;
•Calculate the overall median of the vector M;
•Shift miRNAs expression values of each array by subtracting the differ-
ence between the median of each array and the overall median from them.
4
Statistical Applications in Genetics and Molecular Biology, Vol. 7 [2008], Iss. 1, Art. 22
http://www.bepress.com/sagmb/vol7/iss1/art22
Instead of matching the median only across the arrays, Quantile normaliza-
tion makes the distributions of expression levels the same across arrays by the
following steps:
•Sort each column of Xseparately to generate a sorted p×nmatrix Y;
•Take the mean of each row of Yand generate a p-dimensional vector Ab,
called the baseline array;
•Get the normalized miRNAs expressions for each array by rearranging the
baseline array Abto have the same ordering of the corresponding column
of the matrix Xso that empirical distributions of miRNA expressions are
the same as that of the baseline array across arrays.
Cyclic loess considers the MA plot of probe intensities from every pair of
arrays (Xij, Xij ′), with fixed j6=j′and i= 1,2, ..., p, and makes the M and A
pairs scattered around the M= 0 axis by the following steps:
•Compute Mi=Xij −Xij′and Ai=1
2(Xij +Xij′);
•Fit a loess curve by regression Mon A, and denoted the fitted vector by
ˆ
M;
•Setting the vector D= (M−ˆ
M)/2, get the normalized miRNAs expres-
sions for (Xij , Xij′)by modifying Xij to Xij +Diand Xij′to Xij −Di,
i= 1,2, ..., p.
3 Description of Data
Total RNA was purchased from Ambion Inc. Microarray labeling and hybridiza-
tion were performed as previously described in Liu et al. (2004), except for
the exceptions noted below. The Ohio State University Comprehensive Can-
cer Center Version 3.0 microRNA microarray was used and this array contains
3790 oligo probes derived from 578 mature miRNAs spotted in duplicate (329
Homo sapiens, and 249 Mus musculus) that are annotated in the miRNA reg-
istry http://microrna.sanger.ac.uk/ sequences/ (Accessed Nov. 2005). Of the 396
evolutionarily conserved mature microRNAs between mice and human in Ver-
sion 10.1 of the microRNA registry, 68% are identical in length and sequence.
Hence, many of the mouse probes serve as additional controls for their human
counterparts and vice versa. In addition, 1493 human and 1137 mouse oligo
5
Rao et al.: Comparing Normalization Techniques for MicroRNA Microarray Data
Published by The Berkeley Electronic Press, 2008
probes for miRNAs computationally predicted in human and mouse, respec-
tively, are also spotted in duplicate. Often, more than one probe set exists for a
given mature miRNA. Additionally, there are duplicate probe spots correspond-
ing to most precursor miRNAs. Hybridization signals were ultimately detected
with Streptavidin-Alexa 647, conjugate and scanned images (Axon 4000B) were
quantified using the Genepix 6.0 software through a local background correction
(Axon Instruments, Sunnyvale, CA).
4 Analysis
Background-corrected median signals for duplicate probes on an array were av-
eraged. After normalization across all 72 arrays, let Xibe the log base 2trans-
formed expression value of the ith miRNA for a certain tissue, and let Yibe the
log base 2transformed expression value of the ith miRNA for the replicate of
the tissue.
Bias. The average Ai= (Xi+Yi)/2and the difference Mi=Xi−Yiof
expression values for each miRNA can then be computed. The MA plot of the
two vectors Xiand Yiis a 45-degree rotation and axis scaling of their scatter
plot. This plot is particularly useful for array data because Mirepresents the
log fold change and Airepresents the average log intensity for the ith miRNA.
When the loess curves of the MA plot deviate from the horizontal line at M= 0
, this demonstrates differences in the intensity levels between two arrays from
the same tissue (Gentleman et al. 2005). In contrast, if the loess curves align
with M= 0, the normalization method is considered to exhibit little bias at all
levels of expression. When MA plots and loess curves were made for the repli-
cate array data from human brain tissue using no normalization, median normal-
ization, quantile normalization and cyclic loess, we observed that the quantile
normalization method removed bias the best (Figure 1C), the loess curve closely
followed the horizontal line at M= 0. No normalization, median normalization
and cyclic loess behaved similarly in that their loess curves are not aligning with
M= 0 closely enough (Figure 1A, 1B and 1D).
Binning. To compare the normalization methods in how much they reduced
error variance in addition to reducing bias, we formally modeled the mean and
variance of differences in replicate arrays as a function of their expression lev-
els. In order to obtain reliable estimates of the expression levels, we binned
duplicates according to their average expression level first and then proceeded
by modeling the mean and variance based on the binned data.
We created equally-sized bins containing 34 miRNAs probes. For each bin,
6
Statistical Applications in Genetics and Molecular Biology, Vol. 7 [2008], Iss. 1, Art. 22
http://www.bepress.com/sagmb/vol7/iss1/art22
0 5 10 15
−6 −4 −2 0 2 4 6
A
A: Average of expressions
M: Difference of expresssions
0 5 10 15
−6 −4 −2 0 2 4 6
B
A: Average of expressions
M: Difference of expresssions
0 5 10 15
−6 −4 −2 0 2 4 6
C
M: Difference of expresssions
0 5 10 15
−6 −4 −2 0 2 4 6
D
M: Difference of expresssions
Figure 1: MA and loess plot of expression values for the human brain tissue
data. A) without normalization, B) after median normalization, C) after quantile
normalization and D) after cyclic loess.
we summarized the differences in the replicate arrays by median absolute devi-
ation (MAD) of the differences and median of the differences to obtain robust
estimates of variance and bias, respectively (Lin et al. 2002). The smoothed
MADs and medians of the differences were used to detect systematic effects due
to the different normalization methods as a function of expression levels. Lower
values of smoothed MADs and smoothed medians closer to zero across average
expressions correspond to a superior normalization method.
As stated above, each bin consisted of 34 miRNAs probes. For fixed k
(1 ≤k≤K), let X(i)k(i= 1,2, ..., 34) be the expression value of the ith
miRNA in the kth bin for a specific tissue, and let Y(i)k(i= 1,2, ..., 34) be the
7
Rao et al.: Comparing Normalization Techniques for MicroRNA Microarray Data
Published by The Berkeley Electronic Press, 2008
expression value of the ith miRNA in the kth bin for the replicate of the tissue.
The difference between the replicate arrays expression values for each miRNA
in the kth bin can be denoted by D(i)k=X(i)k−Y(i)k(i= 1,2, ..., 34), and the
corresponding observations by d(i)k. We assume that for fixed k,
D(i)ki.i.d.
∼N(µk, σ2
k)i= 1,2, ..., 34
and use
mdk=median
1≤i≤34 (d(i)k)
as a robust location (center) estimate of µk=E[D(1)k], and
MADdk=median
1≤i≤34 |d(i)k−median
1≤i≤34 (d(i)k)|,
as a robust estimate of scale (spread), which is proportional to σk=pvar[D(1)k]
under normality.
For the average expression values of miRNAs in the kth bin across certain
tissue replicates, let A(i)k= (X(i)k+Y(i)k)/2 (i= 1,2, ..., 34) and a(i)kbe the
ith observation. Similarly, for estimation of the center of the average expression
values in each bin, we consider
mak=median
1≤i≤34 (a(i)k).
As Figure 1A suggests, it is sensible to model µkand σkas a function of the
center of the average expression values of miRNA replicates in the kth bin.
For the paired observations (ma1, md1),(ma2, md2), ..., (maK, mdK), we
modeled the median difference as a smooth function of the median average
mdk=η(mak) + ǫk, k = 1,2, ..., K
with ǫk∼N(0, σ2
m,k)and with a different variance for each bin. The smoothed
relationship ηwas obtained by the weighted smoothing spline with weights equal
to the reciprocal of the squared MAD of difference. Quantile normalization gave
the best results when comparing the weighted smoothed curves for the median
difference in expression values using the human brain tissue data (Figure 2).
Similarly, for the paired observations (ma1, MADd1),(ma2, MADd2),...,
(maK, MADdK), we considered the following model with unequal variance
MADdk=ξ(mak) + ǫk, k = 1,2, ..., K
8
Statistical Applications in Genetics and Molecular Biology, Vol. 7 [2008], Iss. 1, Art. 22
http://www.bepress.com/sagmb/vol7/iss1/art22
0 2 4 6 8 10 12 14
−1.5 −1.0 −0.5 0.0 0.5 1.0
A
median of average of expressions
median of difference of expressions
0 2 4 6 8 10 12 14
−1.5 −1.0 −0.5 0.0 0.5 1.0
B
median of average of expressions
median of difference of expressions
0 2 4 6 8 10 12 14
−1.5 −1.0 −0.5 0.0 0.5 1.0
C
median of difference of expressions
0 2 4 6 8 10 12 14
−1.5 −1.0 −0.5 0.0 0.5 1.0
D
median of difference of expressions
Figure 2: weighted smoothed medians of difference of expression values for the
human brain tissue data. A) without normalization, B) after median normaliza-
tion, C) after quantile normalization and D) after cyclic loess.
and ǫk∼N(0, σ2
MAD). The smoothed MAD of differences ξcan again be ob-
tained by smoothing splines with the smoothing parameter selected by general-
ized maximum likelihood (GML) (Gu 2002). It was difficult to see differences
in the relationship between MADd and ma among the normalization methods
(Figure 3), but they became more apparent if the bias and variance were com-
bined into a mean-squared error statistic.
Confidence intervals. The fitted medians of differences ηis the smoothed
estimate of bias parameter µk, and the fitted MAD of differences ξis the smoothed
estimate of scale parameter. We used the fitted MAD to estimate confidence in-
tervals around bias and obtained a pointwise confidence interval for the bias by
9
Rao et al.: Comparing Normalization Techniques for MicroRNA Microarray Data
Published by The Berkeley Electronic Press, 2008
0 2 4 6 8 10 12 14
0.5 1.0 1.5 2.0
A
median of average of expressions
MAD of difference of expressions
0 2 4 6 8 10 12 14
0.5 1.0 1.5 2.0
B
median of average of expressions
MAD of difference of expressions
0 2 4 6 8 10 12 14
0.5 1.0 1.5 2.0
C
MAD of difference of expressions
0 2 4 6 8 10 12 14
0.5 1.0 1.5 2.0
D
MAD of difference of expressions
Figure 3: smoothed MADs versus median averages for the human brain tissue
data. A) without normalization, B) after median normalization, C) after quantile
normalization and D) after cyclic loess.
binned expression values as
ˆη(mak)±3.98
√34
ˆ
ξ(mak),
(see Hoaglin et al. 2000). The confidence band after quantile normalization
encompasses the horizontal line at M= 0, while those using no normalization,
median normalization or cyclic loess do not include zero for larger expression
values (Figure 4).
Mean Squared Error. We obtained the mean squared error (MSE) of the
difference in expression values (including variance and squared bias)
MSEk=E[D2
(1)k] = var[D(1)k] + E[D(1)k]2=σ2
k+µ2
k,
10
Statistical Applications in Genetics and Molecular Biology, Vol. 7 [2008], Iss. 1, Art. 22
http://www.bepress.com/sagmb/vol7/iss1/art22
6 8 10 12 14 16
−2 −1 0 1 2
A
A: average of expressions
M: difference of expressions
6 8 10 12 14 16
−2 −1 0 1 2
B
A: average of expressions
M: difference of expressions
6 8 10 12 14 16
−2 −1 0 1 2
C
M: difference of expressions
6 8 10 12 14 16
−2 −1 0 1 2
D
M: difference of expressions
Figure 4: confidence band of the bias for the human brain tissue data. A) without
normalization, B) after median normalization, C) after quantile normalization
and D) after cyclic loess.
which can be estimated by the smoothed estimates
[ˆ
ξ(mak)
0.6745 ]2+ ˆη(mak)2,
(see Huber 2003). The estimated MSE for quantile normalization is smallest
when average expression values are greater than noise levels of measurements,
and the estimated MSE for cyclic loess is slightly larger than that of quantile
normalization across all average expression values. Median normalization per-
formed similarly to no normalization (Figure 5).
To evaluate the global bias and variance for each method, we averaged MSEs
across expression levels greater than 4.5; the value 4.5(log base 2 transformed)
11
Rao et al.: Comparing Normalization Techniques for MicroRNA Microarray Data
Published by The Berkeley Electronic Press, 2008
6 8 10 12 14
0.0 0.2 0.4 0.6 0.8 1.0
median of average of expressions
MSE of difference of expressions
Brain tissue
Figure 5: MSE curves without normalization (black, solid line), after median
normalization (green, dashed line), and after quantile normalization (red, dot-
dashed line) after cyclic loess (blue, dotted line).
was selected because 95% of the blanks (spots lacking oligonucleotide probes)
gave intensities less than this value. The average MSEs for no normalization,
median normalization, quantile normalization and cyclic loess using the brain
tissue data were 0.278, 0.274, 0.225, 0.270 respectively. These results were
found consistently across the other 35 tissue types (Figure 6), where the MSEs
were lower for quantile normalization (coded 2) in almost all tissue samples
compared to no normalization (coded 0), median normalization (coded 1) and
cyclic loess (coded 3), except for human lung, human liver, human thymus,
mouse liver and mouse lung. When the normalization methods were applied
to each tissue type separately, instead of to all 72 arrays together, the results
were similar.
Checking for Scale Compression. It is possible that the superior results for
12
Statistical Applications in Genetics and Molecular Biology, Vol. 7 [2008], Iss. 1, Art. 22
http://www.bepress.com/sagmb/vol7/iss1/art22
000
000
0
0
00
00
0
0000
0
0
00
0
000
0
000
0000
0
0
0
mean of MSEs
Esophagus
Colon
Cervix
Lung
Brain
Bladder
Liver
Kidney
Adipose
Heart
Thymus
Ovary
Placenta
Testes
Thyroid
Skeletal Muscle
Small Intestine
Spleen
Prostate
Trachea
Pancreas
Breast
Stomach
Uterus
Adrenal
Lymph Node
Mouse Spleen
Mouse Liver
Mouse Brain
Mouse Heart
Mouse Ovary
Mouse Embryo
Mouse Lung
Mouse Thymus
Mouse Kidney
Mouse Testicle
0.2 0.4 0.6 0.8 1.0
1
11
11111
1
1
1111111
1
1
1
1
1
111
1
111
1111
1
1
1
222
2
222
2
22
2
222222
2
2
22
2
222
2
22
2
2222
2
2
2
333
3333
3
33
3333333
3
3
33
3
333
33333333
3
3
3
Figure 6: mean of MSEs for the difference in expression values without normal-
ization (0 and black), after median normalization (1 and green), after quantile
normalization (2 and red) and after cyclic loess (3 and blue).
quantile normalization is the result of the compression of the scale downward
after transformation. To check this, we first calculated coefficients of varia-
tion (CV) as the ratio of an estimate of the standard deviation of measurement
(√MSE) for each bin to the mean expression for that bin and then average the
ratios across bins. We found the CVs followed the same pattern as the MSEs,
that is, typically lower values for quantile normalization across tissues (Figure
7). It is also possible that the superior results for quantile normalization is the
result of compressing the scale from both ends after transformation; thereby re-
ducing spread and sensitivity of transformed measurements. To check this, we
calculated the average variance of expression levels across the 36 tissues for each
miRNA. This variance consists of true variance across tissues and measurement
13
Rao et al.: Comparing Normalization Techniques for MicroRNA Microarray Data
Published by The Berkeley Electronic Press, 2008
error as obtained with the MSE. Averaging the variance across miRNAs and the
MSEs across tissues, we found the ratios of signal (true) variance to noise (mea-
surement error) variance were 12.0,14.0,16.3and 16.3for no, median, quantile
and cyclic loess normalization respectively.
0
00
000
0
0
00
00
0
0000
0
0
00
0
000
0
00
0
0000
0
0
0
mean of CVs
Esophagus
Colon
Cervix
Lung
Brain
Bladder
Liver
Kidney
Adipose
Heart
Thymus
Ovary
Placenta
Testes
Thyroid
Skeletal Muscle
Small Intestine
Spleen
Prostate
Trachea
Pancreas
Breast
Stomach
Uterus
Adrenal
Lymph Node
Mouse Spleen
Mouse Liver
Mouse Brain
Mouse Heart
Mouse Ovary
Mouse Embryo
Mouse Lung
Mouse Thymus
Mouse Kidney
Mouse Testicle
0.04 0.06 0.08 0.10 0.12
1
11
111
1
1
1
1
1111111
1
1
1
1
1
11
1
1
1
1
1
111
1
1
1
1
222
2222
2
22
2
2222
22
2
2
22
2
222
2
22
2
22
22
2
2
2
333
333
3
3
33
3333333
3
3
33
3
33
3
3
33
3
3333
3
3
3
Figure 7: mean of CVs for the difference in expression values without normal-
ization (0 and black), after median normalization (1 and green), after quantile
normalization (2 and red) and after cyclic loess (3 and blue).
Comparative Study We compare real-time RT-PCR miRNA data (Lee et al.
2008) with our microarry miRNA data, since twenty-one tissues were common
to both datasets. Specifically, we focused on brain and heart, since these tissues
are quite biologically distinct and have substantial differences in their miRNA
expression profiles. If a normalization technique was overly aggressive, then
there would be an ”averaging-out” effect, leading to a significant decrease in the
number of differentially expressed miRNAs. A well known difference between
14
Statistical Applications in Genetics and Molecular Biology, Vol. 7 [2008], Iss. 1, Art. 22
http://www.bepress.com/sagmb/vol7/iss1/art22
microarray and RT-PCR data is that the fold changes observed by microarray
tend to be compressed in comparison with fold changes observed by RT-PCR.
We found 51 miRNAs were characterized by a four fold difference in expression
by RT-PCR. For the microarray data on identical miRNAs, we found that 36,
35,35,35 miRNAs were two fold differentially expressed for no, median, cyclic
loess and quantile normalization respectively. This set of miRNAs was found to
have roughly an 70% overlap with the RT-PCR data. The observed values for
fold changes varied little with respect to the normalization method used. In this
respect, we could not conclude any superior normalization method based strictly
on this analysis, but we could at least conclude that quantile normalization is not
worse than other methods in terms of its sensitivity.
5 Conclusion
We showed that the quantile normalization method works best in reducing dif-
ferences in miRNA expression values for duplicate tissue samples, cyclic loess
works slightly worse than quantile normalization, whereas no normalization and
median normalization behave similarly and seem to be inferior to quantile nor-
malization and cyclic loess with regard to bias. This is not surprising because
quantile normalization adjusted better for differential bias across the scale of
expression values. By showing that the total MSE was lower across almost all
36 tissue samples, we were assured that the bias correction provided by quan-
tile normalization was not outweighed by additional error variance that can arise
from a more complex normalization method. Furthermore, we showed that quan-
tile normalization does not achieve smaller replication error by compressing the
scale downward or by compressing the scale from both ends.
References
Babak, T., Zhang, W., Morris, Q., Blencowe, B. and Hughes, T. (2004). Prob-
ing microRNAs with microarrays: Tissue specificty and functional inference,
RNA 10: 1813–1819.
Barad, O., Meiri, E., Avniel, A., Aharonov, R., Barzilai, A., Bentwich, I., Einav,
U., Gilad, S., Hurban, P., Karov, Y., Lobenhofer, E. K., Sharon, E., Shibo-
leth, Y. M., Shtutman, M., Bentwich, Z. and Einat, P. (2004). MicroRNA ex-
pression detected by oligonucleotide microarrays: System establishment and
expression profiling in human tissues, Genome Research 14: 2486–2494.
15
Rao et al.: Comparing Normalization Techniques for MicroRNA Microarray Data
Published by The Berkeley Electronic Press, 2008
Bolstad, B. M., Irizarry, R. A., Astrand, M. and Speed, T. P. (2003). A com-
parision of normalization methods for high density oligonucleotide array data
based on variance and bias, Bioinformatics 19: 185–193.
Calin, G., Ferracin, M., Cimmino, A., DiLeva, G., Shimizu, M., Wojcik, S.,
Iorio, M., Visone, R., Sever, N., Fabbri, M., Iuliano, R., Palumbo, T., Pichiorri,
F., Roldo, C., Garzon, R., Sevignani, C., Rassenti, L., Alder, H., Volinia, S.,
Liu, C. G., Kipps, T. J., Negrini, M. and Croce, C. M. (2005). A microRNA
signature associated with prognosis and progression in chronic lymphocytic
leukemia, The New England Journal of Medicine 353: 1793–1801.
Chang, T., Yu, D., Lee, Y., Wentzel, E., Arking, D., West, K., Dang, C. V.,
Thomas-Tikhonenko, A. and Mendell, J. T. (2008). Widespread microRNA
repression by myc contributes to tumorigenesis, Nature Genetics 40(1): 43–
50.
Churchill, G. A. (2002). Fundamentals of experimental design for cdna microar-
rays, Nature Genetics 32: 490–495.
Churchill, G. A. (2003). Discussion to statistical challenges in functional
genomics-comment, Statistical Science 18: 64–69.
Churchill, G. A. and Oliver, B. (2001). Sex, flies and microarrays, Nature Ge-
netics 29: 355–356.
Davison, T., Johnson, C. and Andruss, B. (2006). Analyzing micro-RNA ex-
pression using microarrays, Methods in Enzymology 411: 14–34.
Garzon, R., Pichiorri, F., Palumbo, T., Iuliano, R., Cimmino, A., Aqeilan, R.,
Volinia, S., Bhatt, D., Alder, H., Marcucci, G., Carlin, G., Liu, C. G., Bloom-
field, C., Andreeff, M. and Croce, C. (2006). MiRNA fingerprints during hu-
man megakaryocytopoiesis, Proceedings of the National Academy of Sciences
of the United States of America 101: 5078–5083.
Gentleman, R., Carey, V. J., Huber, W., Irizarry, R. and Dudoit, S. (2005). Bioin-
formatics and computational biology solutions using R and bioconductor,
Springer: New York.
Gu, C. (2002). Smoothing Spline ANOVA Models, Springer: New York.
Hagan, J. and Croce, C. (2007). MicroRNAs in carcinogenesis, Cytogenetic and
Genome Research 118: 252–259.
16
Statistical Applications in Genetics and Molecular Biology, Vol. 7 [2008], Iss. 1, Art. 22
http://www.bepress.com/sagmb/vol7/iss1/art22
Hoaglin, D. C., Mosteller, F. and Tukey, J. W. (2000). Understanding Robust
and Exploratory Data Analysis, John Wiley & Sons.
Huber, P. (2003). Robust Statistics, John Wiley & Sons.
Irizarry, R. A., Hobbs, B., Collin, F. and Speed, T. (2003). Exploration, nor-
malization, and summaries of high density oligonucleotide array probe level
data., Biostatistics 4: 249–264.
Kerr, M. K. and Churchill, G. (2001). Experimental design for gene expression
microarrays., Biostatistics 2: 183–201.
Kumar, M., Lu, J., Mercer, K., Golub, T. and Jacks, T. (2007). Impaired mi-
croRNA processing enhances cellular transformation and tumorigenesis, Na-
ture Genetics 39(5): 673–677.
Lee, E., Baek, M., Gusev, Y., Brackett, D. J., Nuovo, G. and Schmittgen, T.
(2008). Systematic evaluation of microRNA processing patterns in tissues,
cell lines, and tumors, RNA 14: 35–42.
Lin, Y., Nadler, S. T., Lan, H., Attie, A. D. and Yandell, B. S. (2003). Adaptive
gene picking with microarray data: detecting important low abundance sig-
nals, in G. Parmigiani, E. S. Garrett, R. A. Irizarry and S. L. Zeger (eds), The
Analysis of Gene Expression Data: Methods and Software, Springer-Verlag.
Liu, C., Calin, G., Meloon, B., Gamliel, N., Sevignani, C., Ferracin, M., Du-
mitru, C., Shimizu, M., Zupo, S., Dono, M., Alder, H., Bullrich, F., Negrini,
M. and Croce, C. (2004). An oligonucleotide microchip for genome-wide mi-
croRNA profiling in human and mouse tissues, Proceedings of the National
Academy of Sciences of the United States of America 101(26): 9740–9744.
Lu, J., Getz, G., Miska, E., Alvarez-Saavedra, E., Lamb, J., Peck, D., Sweet-
Cordero, A., Ebert, B. L., Mark, R., Ferrando, A., R., D. J., Jacks, T., Horvitz,
H. R. and Golub, T. R. (2005). MicroRNA expression profiles classify human
cancers, Nature 435(7043): 843–848.
Oshlack, A., Emslie, D., Corcoran, L. and Smyth, G. (2007). Normalization
of boutique two-color microarrays with a high proportion of differentially ex-
pressed probes, Genome Biology 8(1):R2.
Quackenbush, J. (2002). Microarray data normalization and transformation, Na-
ture Genetics 32: 496–501.
17
Rao et al.: Comparing Normalization Techniques for MicroRNA Microarray Data
Published by The Berkeley Electronic Press, 2008
Volinia, S., Calin, G., Liu, C., Ambs, S., Cimmino, A., Petrocca, F., Visone,
R., Iorio, M., Roldo, C., Ferracin, M., Prueitt, R., Yanaihara, N., Lanza, G.,
Scarpa, A., Vecchione, A., Negrini, M., Harris, C. and Croce, C. (2006). A
microRNA expression signature of human solid tumors defines cancer gene
targets, Proceedings of the National Academy of Sciences of the United States
of America 103(7): 2257–2261.
Wolfinger, R., Gibson, G., Wolfinger, E. D., Bennett, L., Hamadeh, H., Bushe,
P., Afsha, C. and Paules, R. . (2001). Assessing gene significance from cdna
microarray expression data via mixed models, Journal of Computational Bi-
ology 8: 625–637.
Yanaihara, N., Caplen, N., Bowman, E., Seike, M., Kumamoto, K., Yi, M.,
Stephens, R., Okamoto, A., Yokota, J., Tanaka, T., Carlin, G., Liu, C. G.,
Croce, C. and Harris, C. (2006). Unique miRNA molecular profiles in lung
cancer diagnosis and prognosis, Cancer Cell 9(3): 189–198.
18
Statistical Applications in Genetics and Molecular Biology, Vol. 7 [2008], Iss. 1, Art. 22
http://www.bepress.com/sagmb/vol7/iss1/art22