ArticlePDF Available

Differential splicing using whole-transcript microarrays

May 2009
BMC Bioinformatics 10(1)

May 2009
10(1)

Source
DBLP

License
CC BY 2.0

Authors:

The latest generation of Affymetrix microarrays are designed to interrogate expression over the entire length of every locus, thus giving the opportunity to study alternative splicing genome-wide. The Exon 1.0 ST (sense target) platform, with versions for Human, Mouse and Rat, is designed primarily to probe every known or predicted exon. The smaller Gene 1.0 ST array is designed as an expression microarray but still interrogates expression with probes along the full length of each well-characterized transcript. We explore the possibility of using the Gene 1.0 ST platform to identify differential splicing events. We propose a strategy to score differential splicing by using the auxiliary information from fitting the statistical model, RMA (robust multichip analysis). RMA partitions the probe-level data into probe effects and expression levels, operating robustly so that if a small number of probes behave differently than the rest, they are downweighted in the fitting step. We argue that adjacent poorly fitting probes for a given sample can be evidence of differential splicing and have designed a statistic to search for this behaviour. Using a public tissue panel dataset, we show many examples of tissue-specific alternative splicing. Furthermore, we show that evidence for putative alternative splicing has a strong correspondence between the Gene 1.0 ST and Exon 1.0 ST platforms. We propose a new approach, FIRMAGene, to search for differentially spliced genes using the Gene 1.0 ST platform. Such an analysis complements the search for differential expression. We validate the method by illustrating several known examples and we note some of the challenges in interpreting the probe-level data.Software implementing our methods is freely available as an R package.

UCSC browser view of solute carrier family 25, member 3 (SLC25A3). Custom tracks have been added for the locations of the 25-mer probes for the Affymetrix Gene, Exon and HG-U133 human expression arrays, relative to the locations of exons for RefSeq or Ensembl gene builds. The Exon probesets are shown in black and grey in the lowermost track. Several probes are common to both the Gene and Exon platforms.

…

RMA decomposition of probe-level Affymetrix data. Panel A shows the background-adjusted and normalized probe-level data for PRRX1, from the Affymetrix mixture dataset (see Methods). The probes are displayed in the order which they map to the human genome (not to scale), and lines join all probe intensities of the same sample. PRRX1 is expressed significantly higher in heart tissue compared to brain. Three replicates of pure heart tissue are shown as red lines; green lines represent pure brain tissue replicates and the blue lines represent a mixture of 75% brain tissue and 25% heart tissue. Panel B shows the estimated relative probe effects. Panel C shows the chip effects (i.e. summarized expression levels) and Panel D shows residuals, using the same colour scheme.

…

Differential splicing of WNK1. Panels A and B show the normalized data and residuals, respectively, of WNK1 for the Affymetrix tissue dataset (see Methods). The three replicates for human kidney tissue are shown as blue lines, and the remaining 10 tissues (30 samples) are shown with black lines. Panel C shows the set of exonic regions joined together in a gene model (green) and the three known Ensembl transcripts (blue). The blue lines linking Panels B and C illustrate the correspondence between probes and exons.

…

Normalized probe-level data and RMA residuals for MBP. Panels A and B show the residuals for Gene and Exon for RMA fits, respectively. There are 36 probes for Gene and 72 probes for Exon. Both panels show 33 lines, one for each hybridization (11 tissues with 3 biological replicates each). The brain and muscle replicates are shown blue and red lines, respectively.

…

A comparison of FIRMA scores for Gene and Exon platforms. Each point in the scatter plot represents an Exon probeset that has been matched to probes on the Gene array. The X-axis gives the averaged (over brain replicates) FIRMA score for Exon data. The Y-axis gives the average FIRMA score for the corresponding Gene samples.

…

Figures - available via license: Creative Commons Attribution 2.0 Generic

Content may be subject to copyright.

Available via license: CC BY 2.0

Content may be subject to copyright.

BioMed Central

Page 1 of 13

(page number not for citation purposes)

BMC Bioinformatics

Open Access

Methodology article

Differential splicing using whole-transcript microarrays

Mark D Robinson*

1,2,3

and Terence P Speed

Address:

Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia,

Cancer Research Program, Garvan

Institute of Medical Research, Darlinghurst, NSW 2010, Australia and

Bioinformatics Division, Walter and Eliza Hall Institute of Medical

Research, Parkville, Victoria 3050, Australia

Email: Mark D Robinson* - mrobinson@wehi.edu.au; Terence P Speed - terry@wehi.edu.au

* Corresponding author

Abstract

Background: The latest generation of Affymetrix microarrays are designed to interrogate

expression over the entire length of every locus, thus giving the opportunity to study alternative

splicing genome-wide. The Exon 1.0 ST (sense target) platform, with versions for Human, Mouse

and Rat, is designed primarily to probe every known or predicted exon. The smaller Gene 1.0 ST

array is designed as an expression microarray but still interrogates expression with probes along

the full length of each well-characterized transcript. We explore the possibility of using the Gene

1.0 ST platform to identify differential splicing events.

Results: We propose a strategy to score differential splicing by using the auxiliary information

from fitting the statistical model, RMA (robust multichip analysis). RMA partitions the probe-level

data into probe effects and expression levels, operating robustly so that if a small number of probes

behave differently than the rest, they are downweighted in the fitting step. We argue that adjacent

poorly fitting probes for a given sample can be evidence of differential splicing and have designed a

statistic to search for this behaviour. Using a public tissue panel dataset, we show many examples

of tissue-specific alternative splicing. Furthermore, we show that evidence for putative alternative

splicing has a strong correspondence between the Gene 1.0 ST and Exon 1.0 ST platforms.

Conclusion: We propose a new approach, FIRMAGene, to search for differentially spliced genes

using the Gene 1.0 ST platform. Such an analysis complements the search for differential

expression. We validate the method by illustrating several known examples and we note some of

the challenges in interpreting the probe-level data.

Software implementing our methods is freely available as an R package.

Background

Alternative splicing

Alternative splicing is the ubiquitous phenomenon where

the same genetic locus can transcribe multiple messenger

RNAs (mRNAs), by splicing out different subsets of

intronic regions from a common pre-mRNA product.

Splice variants of a gene can be functionally distinct and

generate considerable proteomic diversity. Despite early

estimates of near 50% [1], it is now thought that greater

than 90% of all human genes exhibit alternative splicing

[2,3], accounting for much of the complexity of metazoan

organisms. Alternative splicing is known to be prominent

in many important physiological processes, such as cell

differentiation, apoptosis and development, and is espe-

Published: 22 May 2009

BMC Bioinformatics 2009, 10:156 doi:10.1186/1471-2105-10-156

Received: 7 December 2008

Accepted: 22 May 2009

This article is available from: http://www.biomedcentral.com/1471-2105/10/156

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 2 of 13

(page number not for citation purposes)

cially prevalent in the nervous system [4-6]. Mis-regula-

tion or mutations that affect the splicing mechanism can

result in disease, including cancer [7]. It is no surprise

then that alternative splice variants have been observed in

a tissue-specific or cancer-specific manner.

Until recently, predicting alternative splice events usually

involved the comparison of expressed sequence tags

(ESTs) across several libraries. For example, algorithms

that compare EST abundance across human tissues

deduced many tissue-specific isoforms [8,9]. Recently,

DNA microarrays have been successfully utilized to

explore alternative splicing, finding many genes with

known and putative tissue-specific isoforms [1,10,11].

In this study, we propose a statistical method of scoring

differential splicing for the Gene 1.0 ST (hereafter referred

to as Gene) array data, which is the latest generation of

Affymetrix genome-wide expression profiling chips. Note

that the aim of this work is not to suggest the Gene plat-

form as a replacement for the Exon 1.0 ST (referred to as

Exon) array. The considerations of cost, probe coverage

and protocol (e.g. amount of RNA needed) will ultimately

guide this decision for experimenters. We expect the Gene

platform will be used regularly for expression profiling

studies and here, we describe the potential to identify dif-

ferential splicing at no additional experimental cost. We

are simply providing a additional data analysis-based ave-

nue of interrogation. Here, our motivation is to outline

the possibilities and limitations of using the Gene array

for the detection of differential splicing, not to rigour-

ously compare and contrast the platforms.

Both the Gene and Exon arrays interrogate well-anno-

tated exonic content. Perhaps not surprising given the two

platforms share a large number of probes, we have discov-

ered that many of the patterns observed in Exon data are

also observed in Gene data. In addition, we note some of

the challenges and ambiguities of analyzing whole tran-

script microarray data in the context of alternative splic-

ing.

We have shown previously that Gene has similar perform-

ance to Exon and a previous generation of Affymetrix

chips, in various respects in the context of gene expression

[12]. There is certainly value in having an expression plat-

form that can additionally deduce alternative splice

forms. Exon does this [12,13,19]. We show here that

Gene has potential to do so as well, if we are willing to

interrogate only well-annotated content and have reduced

coverage for some transcripts.

Differential splicing

It is worth noting at the outset that microarrays, in gen-

eral, will not be able to detect alternative splicing, per se.

For example, if an exon is spliced out of all tissues or sam-

ples in the study, there is no ability to detect it as alterna-

tive splicing. So, the focus of the methodology presented

here and other related methods is on detecting differential

splicing, or more generally, the differential expression of

alternative isoforms.

Affymetrix array design

Figure 1 shows a UCSC browser view [14] of the locations

of Exon probes and probesets and Gene probes for a sin-

gle human gene, SLC25A3 (solute carrier family 25, mem-

ber 3). As is standard with Affymetrix design, all probes

are 25 base pairs, however, on the newer generation of

chips, there is no mismatch probe for every perfect match

probe. The Exon probesets, one for each probe selection

region (PSR), are shown in black for well-annotated exons

and 2 shades of grey depending on the original prediction

evidence. PSRs are defined by Affymetrix according to

whether a particular region may act as an independent

unit, based on several levels of annotation projected to the

genome. The array design for Exon aims to have 4 probes

per PSR whereas the Gene array has approximately 25

probes per transcript cluster [12].

UCSC browser view of solute carrier family 25, member 3 (SLC25A3)Figure 1

UCSC browser view of solute carrier family 25, member 3 (SLC25A3). Custom tracks have been added for the loca-

tions of the 25-mer probes for the Affymetrix Gene, Exon and HG-U133 human expression arrays, relative to the locations

of exons for RefSeq or Ensembl gene builds. The Exon probesets are shown in black and grey in the lowermost track. Several

probes are common to both the Gene and Exon platforms.

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 3 of 13

(page number not for citation purposes)

The Gene platform shares a large number of its probes

(approximately 65%) with the Exon array, but also

includes a significant number of probes unique to the

platform. In terms of differential splice detection, the cov-

erage by either platform is locus-specific. The ability to

detect differential isoform usage will depend not only on

the number of probes covering the region, but the nature

of the splicing, the degree of differential usage and the per-

formance of the probes near to the event. This could also

mean there is a bias in the ability to detect differential

splicing with Gene through genes having fewer rather

than more exons. In general, genes with fewer than 5 or 6

exons will have more probes per exon on the Gene array.

We have not studied this possible bias in any detail.

Instead, we focus on determining differential splicing pre-

dictions based on the available data with the current Gene

design.

RMA decomposition

After background adjustment and normalization, one of

the commonly used methods for summarizing probe-

level Affymetrix data into expression levels is robust mul-

tichip analysis (RMA) [15]. The approach accounts for rel-

ative probe-specific effects according to the following

model:

where Y

are the log

background adjusted and normal-

ized intensities for probe j from sample i,

are the chip

effects (i = 1, ..., N) and

are the probe effects (j = 1, ..., J),

given N samples and J probes and

are the errors. For

simplicity, a subscript for gene is suppressed here since all

models are fit to genes one by one. The constraint

is imposed to make the probe effects relative

and identifiable. Figure 2A illustrates probe-level data for

a gene that is strongly differentially expressed between

heart and brain across a full mixture of RNA samples (red

– 100% heart, green – 100% brain, blue – mixture). The

most striking observation of the probe-level data is the

parallelism across all samples, largely due to the

sequence-specific probe intensity effects. That is, because

this gene is differentially expressed between brain and

heart, each individual probe shows a relative change in

abundance, even though the range of intensity for each

probe may be quite different. RMA models this behaviour

by estimating probe-specific effects (Figure 2B), leaving

the relevant sample-specific features (chip effects, Figure

2C) for downstream analysis of expression. The residuals

(Figure 2D), which are the differences between the

observed intensities and that explained by the model, are

random and centred around 0. The models are fitted

robustly using iteratively reweighted least squares [16] so

that individual observations do not have undue influence

in the estimation of

and

Next, we show that alternative splicing can be highlighted

by focusing on the residuals. Take for example WNK1

(lysine deficient protein kinase 1), a gene known to

express a kidney-specific isoform having a 5' region

spliced out [8]. Figure 3 shows the probe-level data for

WNK1, as well as the residuals after fitting the RMA

model. Figure 3A illustrates quite clearly that several

probes near the 5' region of the gene for WNK1 are

expressed at noticeably lower levels in the kidney samples

than in the remaining samples, as would be expected.

Since the RMA model is fitted robustly, the 5' probes for

the kidney samples, which depart from the parallelism we

saw previously, are downweighted, and so have a rela-

tively small influence on the overall estimation of chip

and probe effects. However, for the determination of dif-

ferential splicing, these observations in the 5' region of the

gene are very much of interest. Figure 3B highlights a

sequence of residuals that appear very different from the

rest of the gene. We return to this observation in the next

Section. Figure 3C shows the genomic context of the

probe-level data and the known Ensembl trancripts for

WNK1. The sequence of residuals showing the persistently

low values suggest kidney-specific expression of the short

transcript ENST00000340908, in agreement with the pre-

viously published result [8].

Related Work

To the best of our knowledge, this paper is the first

attempt at using the Gene platform to investigate alterna-

tive splicing. Several methods have been proposed for the

differential splicing analysis of Exon data, including the

Splicing Index (SI) [10], pattern-based correlation (PAC)

[17], microarray analysis of different splicing (MADS)

[18] and finding isoforms using the robust multichip

average algorithm (FIRMA) [19]. The SI forms a score that

represents the difference between the gene-level summary

(as fitted by an RMA-like algorithm) and an exon-level

summary, requiring two estimation steps. Effectively, the

method estimates probe effects twice independently, one

with all probes for the gene and another with only the

probes for a probeset. We do not see SI as a feasible

approach with Gene data, since even if we were to create

probesets that represent exons, often very few probes will

available and it will be difficult to get reliable estimates.

The fewer probes per probeset will have a similar effect on

applying FIRMA directly to Gene data. FIRMA fits the

standard RMA model (as above) to all probes for a given

gene and summarizes probeset-wise departures from the

model through the residuals. With very few probes, the

probeset summaries of residuals may not be precise. PAC,

ij i j ij

=++

abe

(1)

∑

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 4 of 13

(page number not for citation purposes)

on the other hand is an all-sample approach that scores

each probeset on whether it correlates with the rest of a

gene, over all samples. Simulation studies for a modest

number of chips (e.g. 20) show that FIRMA and SI gener-

ally outperform PAC [19]. MADS is a new approach for

Exon data that combines several steps, including probe

selection and compensation for sequence-specific cross-

hybridization effects. Though it has not been applied to

the Gene platform, it appears that since calculations are

done at the probe level and combined together to make

inferences about probesets, it may be possible to adapt the

method.

Differential splicing

Scoring persistence of residuals

The method presented in this paper differs from previous

approaches in that we focus on identifying genes with pos-

sible alternative splice forms, instead of highlighting

exons or probesets. This has a subtle statistical advantage

in that the multiple testing penalty is considerably

smaller.

As mentioned above, the residuals from the RMA model

hold the key to finding differential splicing events. Instead

of focusing on individual exons (and the organization of

RMA decomposition of probe-level Affymetrix dataFigure 2

RMA decomposition of probe-level Affymetrix data. Panel A shows the background-adjusted and normalized probe-

level data for PRRX1, from the Affymetrix mixture dataset (see Methods). The probes are displayed in the order which they

map to the human genome (not to scale), and lines join all probe intensities of the same sample. PRRX1 is expressed signifi-

cantly higher in heart tissue compared to brain. Three replicates of pure heart tissue are shown as red lines; green lines repre-

sent pure brain tissue replicates and the blue lines represent a mixture of 75% brain tissue and 25% heart tissue. Panel B shows

the estimated relative probe effects. Panel C shows the chip effects (i.e. summarized expression levels) and Panel D shows

residuals, using the same colour scheme.

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 5 of 13

(page number not for citation purposes)

annotation that that requires), we score a persistent devi-

ation from zero of adjacent residuals. The residuals are

defined as:

where and are estimated using robust fits of Equa-

tion 1. In order to normalize across genes, we calculate

standardized residuals = r

/MAD{r

, u = 1, ..., N; v = 1,

..., J} where MAD(.) is defined as 1.4826 times the median

absolute deviation from 0 over all residuals for that gene

and all samples. In the absence of alternative splicing,

standardized residuals will have approximately unit vari-

ance. FIRMA [19] takes advantage of the Exon array

design, where each PSR has 4 probes and residuals can be

summarized at the probeset level. If a particular PSR is dif-

ferentially spliced, then it is expected that most if not all

probes for the PSR would have a large-in-magnitude resid-

ual (i.e. not fit well by the RMA model). For the Gene

array, we are not guaranteed 4 probes per exon and,

depending on the probes designed for a particular tran-

script, may have very little power to detect single exons

that are differentially spliced. Since the performance of

the summary will be related to the number of observa-

tions used to calculate it, we consider an alternative pro-

cedure. We take the approach of finding a persistence of

residuals that are away from zero and in the same direc-

ij ij

=− +().

(2)

Differential splicing of WNK1Figure 3

Differential splicing of WNK1. Panels A and B show the normalized data and residuals, respectively, of WNK1 for the

Affymetrix tissue dataset (see Methods). The three replicates for human kidney tissue are shown as blue lines, and the remain-

ing 10 tissues (30 samples) are shown with black lines. Panel C shows the set of exonic regions joined together in a gene model

(green) and the three known Ensembl transcripts (blue). The blue lines linking Panels B and C illustrate the correspondence

between probes and exons.

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 6 of 13

(page number not for citation purposes)

tion, thus entirely avoiding the non-uniformity of probes

per exon. We only require that the probes are put in

genomic order for the calculations below. Several adjacent

probes, interrogating exon regions that are adjacent in an

mRNA product, that show the same departure from the

model are evidence of potential differential splicing. For

example, Figure 3 illustrates that probes 2–8 of WNK1 all

have strongly negative residuals. Such an observation is

unlikely to occur by chance.

One possibility to highlight such persistence of residuals

is an extreme value of the absolute values of all partial

sums of adjacent residuals. A natural statistic, inspired by

the monitoring of nuclear material unaccounted for

(MUF) [20], is the maximum absolute partial sum:

over the J(J + 1)/2 possible consecutive sums of J probes.

This calculation is repeated separately for each gene, giv-

ing a score for each gene and each sample. That is, this

approach can be applied in the absence of replicates.

However, if replicates are available, we recommend

precomputing a probe-wise summarized residual

and use this in place of in

Equation 3, where i(k) is the index of replicate k.

The MUF statistic is very flexible. An extreme MUF statistic

can result from a single probe if it is extreme enough. But,

it can also highlight a subtle change that persists across

any number of probes if the score is deemed to be

extreme. Notice that the denominator of the partial sums

is the square root of the number of data points. This

ensures the variances (of the sum) are constant, thus

putting all the partial sums on an even footing.

As the number of probes increases, there are more partial

sums to consider, making the distribution of maximum

order statistics more likely to take on more extreme val-

ues. To alleviate this, we repeatedly sample J probes from

the empirical distribution of all standardized residuals

and calculated the MUF score, giving a null distribution of

MUF scores for J probes. A false discovery rate can be cal-

culated for the discoveries above a given quantile of the

null distributions.

We call this approach FIRMAGene, since it is only a small

modification to FIRMA [19], in terms of operating on

residuals from an RMA fit, but is applied to the Affymetrix

Gene 1.0 ST platform and scores differential splicing at the

gene level instead of the probeset level.

Results

Validation of using the Gene platform for splicing

We first validate the approach of using the Gene platform

for differential splice detection by comparing the residuals

for a gene known to express a vastly different isoform in

human brain [21], using the publicly available data of the

same tissue RNA hybridized to both the Gene and Exon

platforms. Figure 4 shows residuals plots for MBP (myelin

basic protein) and highlights a very distinct pattern in the

brain samples. This pattern is observed almost identically

from the 36 probes represented on the Gene array or 72

probes from the Exon array. The exact splicing mecha-

nism is not as apparent as in the previous example

(WNK1, Figure 3), but it is straightforward to put the

probe-level data in the context of known genome annota-

tion. For a genome-wide comparison, we matched the

probes from the Gene array to the Affymetrix-defined

probesets of the Exon array, allowing us to run FIRMA on

the Gene platform. Note that we are not advocating the

use of FIRMA for the Gene platform, although we do

highlight that it can be done and allows us to make the

comparison. See Methods for further description of proce-

dures used to construct the annotation. FIRMA scores are

calculated for Gene and Exon data, generating a table of

scores by probeset and sample, one for each platform.

Note that the summaries for the two platforms are often

from different numbers of probes and therefore have dif-

ferent precision. Table 1 give a cross-tabulation of the

numbers of probes for both platforms amongst matched

probesets. The Exon array most often has 4 probes per

probeset, whereas the Gene platform most often has 1 or

2 probes. In some cases (e.g. genes with few exons), Gene

will have more than 4 probes. Taking the average of

FIRMA scores over the 3 brain replicates, Figure 5 illus-

trates convincing genome-wide evidence that extreme

residuals observed on the Exon array are also observed on

the Gene array (correlation r = 0.53 over more than

230,000 Exon probesets). This is especially promising

considering the majority of summarized sets of residuals

will be centred close to 0. Shown in Figure 5 are summa-

ries for the brain replicates from each platform, since

brain tissue is expected to exhibit more alternative splic-

ing than most other tissues.

Next, we were interested to determine whether Gene data

is able to detect a significant proportion of the differential

splicing events that FIRMA detects on Exon data. The tis-

sue panel dataset, where the same source of RNA is

hybridized to both platforms, is an ideal test set for this

comparison. We applied FIRMA to the Exon data and FIR-

MAGene to the Gene data. We compared the top 100

probeset-tissue scores from FIRMA to the corresponding

gene-tissue scores from FIRMAGene, as shown in Figure 6.

The vast majority of the MUF statistics are large in magni-

tude, suggesting that Gene platform is quite capable of

detecting similar differential splicing events. In fact, 86 of

seJ

()

max=

∑

−+

≤≤≤1

(3)

rnr

kj t ik j

⋅

()

∑

()

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 7 of 13

(page number not for citation purposes)

the 100 corresponding gene-tissue scores have MUF statis-

tics more extreme than the 95th percentile of their null

distribution. One the other hand, because FIRMA gives

(sub-)exon-level and FIRMAGene gives gene-level statis-

tics, there may be some cases where the scores do not cor-

respond. For example, 4 of the 14 MUF statistics that are

not extreme have no Gene probes represented in the

region where the Exon probes are. Furthermore, since the

MUF score is an extreme value statistic, there may be set of

probes within the gene that are more extreme in the oppo-

site direction, as shown in 8 of the 14 non-extreme MUF

scores. Overall, this analysis suggests that the Gene plat-

form will be quite promising for the analysis of differen-

tial splicing.

Tissue panel dataset

The publicly available 11 tissue panel dataset, where the

same human tissue RNA was run on both the Gene and

Exon platforms in a single laboratory by Affymetrix, pro-

vides an ideal testing ground for the methodology and for

illustrating of some of the features of whole-transcript

microarray data. Although there are many individual

examples in the literature, there is no readily available

positive control set of tissue-specific alternative splice

events that can be used for benchmarking. However,

tables of EST-based predictions exist. A rigorous compari-

son of EST predictions and microarray analysis of alterna-

tive splicing events is beyond the scope of this study.

Instead, we calculate scores genome-wide (using Gene

data) across the 11 tissues and show that many of the top

ranking scores have been observed previously to either

have tissue-specific variants or tissue-specific expression

patterns.

Figure 7 shows the genome-wide scores, stratified by the

number of probes for each gene. The plot shows only the

genes that have between 10 and 70 probes (nearly 95% of

the genes on the array). Because genes with more probes

have more partial sums to consider, the maximum gently

increases with the number of probes per gene. The two

examples shown earlier, WNK1 for kidney tissue and MBP

for brain tissue, have high scores, as highlighted. Table 2

shows the top scoring gene-tissue combinations. Of the

top 20 gene-tissues scores, many of them have previous

evidence of tissue-specific behaviour. Plots of the normal-

ized data and residuals can be found in Additional file 1,

in addition to a list of publications corroborating the tis-

sue-specific evidence.

Some tissues have considerably more differential splice

detections. For example, of the top 1000 gene-tissue

scores (see Additional file 2), the top three tissues are tes-

tis (295), brain (258) and liver (116). This corresponds

with previous EST studies where brain, liver and testis

have the highest percentage of alternatively spliced genes

[22].

Conclusion and Discussion

We have proposed a novel scoring method called FIR-

MAGene based on decomposing probe-level microarray

Normalized probe-level data and RMA residuals for MBPFigure 4

Normalized probe-level data and RMA residuals for MBP. Panels A and B show the residuals for Gene and Exon for

RMA fits, respectively. There are 36 probes for Gene and 72 probes for Exon. Both panels show 33 lines, one for each

hybridization (11 tissues with 3 biological replicates each). The brain and muscle replicates are shown blue and red lines,

respectively.

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 8 of 13

(page number not for citation purposes)

data with a linear model. The major motivation for this

work is to provide an extra investigation, in addition to

differential expression analysis, thus giving researchers

added value from their collected data. The design of the

latest generation of Affymetrix expression array facilitates

this. Using a public tissue panel dataset, we show the

method highlights many previously known and poten-

tially new differential tissue-specific splice events and

shows strong correspondence with the Exon array over the

same RNA samples. The strategy we propose can be

applied directly to the Affymetrix Human, Mouse and Rat

Gene 1.0 ST platforms, or any other whole-transcript plat-

form that exhibits probe-specific effects. Although we

have not investigated thoroughly, FIRMAGene may be

A comparison of FIRMA scores for Gene and Exon platformsFigure 5

A comparison of FIRMA scores for Gene and Exon platforms. Each point in the scatter plot represents an Exon

probeset that has been matched to probes on the Gene array. The X-axis gives the averaged (over brain replicates) FIRMA

score for Exon data. The Y-axis gives the average FIRMA score for the corresponding Gene samples.

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 9 of 13

(page number not for citation purposes)

useful for the Exon array. It comes at some additional

computational cost, since there are more probes (and

therefore even more partial sums), but it may be better

able to highlight smaller, but persistent, changes in adja-

cent probesets. The procedure can operate in a single sam-

ple mode or can make use of replicates. Technical or

biological replicates can be used, although significant

detections from the latter will give more generalizable

results. One subtle difference that FIRMAGene takes

advantage of, is the fact that scoring by gene instead of by

exon results in a much smaller penalty for multiple test-

ing.

The Gene platform will be used in various profiling stud-

ies and this work simply provides an additional analysis

that will be of interest. The approach is not without limi-

tations. The Gene platform only covers well-annotated

exons, whereas the Exon platform covers a considerable

amount of additional content, based on either EST evi-

dence or computational predictions. However, having no

features representing predicted exonic content has some

advantages. For example, in the analysis of Exon data, it is

not always clear whether to include all probesets (for well-

characterized and predicted exons together) in the RMA

model fit. The MADS approach uses a computational

probe selection for this [18]. In many cases, the probes for

content with weak evidence are not used for the primary

analysis [19]. Since the well-characterized exonic content

on the Exon array only represents approximately 20% of

all features, the selection of probesets to include may have

a large impact on the differential splice detections. In

addition, for short genes, the Gene platform will generally

have more probes than the Exon array, giving potentially

higher power to detect new variants.

Since we are scoring a gene over all partial sums of probes,

the MUF score is very flexible. It simultaneously searches

for extreme residuals over any number of adjacent probes,

including a single probe if it is extreme enough. There are

variations of the MUF score that may be worth pursuing

for a more refined mapping of differential splice events.

For example, it is generally unreasonable that all residuals

for a single sample will be non-zero. It may suffice to con-

sider only partial sums of length less than

LJ/2O, for exam-

ple. Another variation would be to target specific patterns.

For example, SLC25A3 (solute carrier family 25, member

3) has a very distinct mutually exclusive differential splic-

ing pattern (see Additional file 1). If this or other distinc-

tive patterns were of particular interest, the scoring of

adjacent residuals could be tailored towards it.

It is difficult to know in what experimental circumstances

the Gene platform and a procedure such as FIRMAGene

will be most successful. We have shown FIRMAGene can

be useful in a panel of tissues, where in general the major-

ity of samples exhibit the same probe-level pattern and

only a small number of samples differ. We expect the pro-

cedure will be useful even in a balanced two group com-

parison, where differential isoform usage would still

present as a persistent departure from the linear model.

However, there may be limitations in the robust fitting for

probe effects in cases where the probe intensities are split

into two distinct groups. One possible option would be to

use existing Gene data (e.g. from a public source), in order

to stabilize the probe effect determination. We have not

investigated this thoroughly. As mentioned above, micro-

arrays are only able to detect differential splicing, so in

order to detect such events, there needs to be enough var-

iation amongst the samples for a pattern to stick out.

Depending on the strength of the difference and the

number of probes represented on the array for the alterna-

tive spicing event (which can vary from gene to gene), a

large sample size may be required.

Identifying departures through residuals from the RMA

model will not always be perfect. Some departures from

the RMA linear model may not be alternative splicing at

all. In some cases, large residuals may be a result of cross-

hybridizing probes, or through probes that have a differ-

ent range of intensity, or are induced through, for exam-

ple, an exon that is not expressed in any of the samples in

combination with strong differential expression. It may be

possible to compensate for cross-hybridization, as dem-

onstrated recently (see [23,24]). With relevance to studies

involving human populations, it has been recently shown

that single nucleotide polymorphisms can significantly

affect probe-level Exon data [25]. In addition, a resource

has now been created to track Exon probes that may be

affected [26]. Individual probe performance aside, we

argue that most of the detected examples are biologically

meaningful and these problems are not isolated to FIR-

MAGene and represent the challenging nature of design-

ing methodology that operates over a range of probe

behaviours. Other procedures, such as SI, MADS or PAC if

they were to be adapted to the Gene platform, would need

to effectively deal with these same challenges.

There are a number of other issues that we are aware of,

but are beyond the scope of this investigation. For exam-

Table 1: Number of probes per matched probeset. After taking a

subset of common probesets, the numbers of probes for each

matched Exon probe selection region are given.

No. of Gene Probes

No. of Exon Probes12345

168020000

226544191 6 1 1

3279427881155 11 7

4 87092 83621 24219 12301 4019

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 10 of 13

(page number not for citation purposes)

FIRMAGene scores for the top 100 FIRMA scores for the Affymetrix tissue panel datasetFigure 6

FIRMAGene scores for the top 100 FIRMA scores for the Affymetrix tissue panel dataset. The X-axis gives the

FIRMA score (calculated on the Affymetrix Exon tissue panel dataset) for the top 100 probesets with 4 probes. The Y-axis

gives the signed MUF scores (calculated on the Affymetrix Gene tissue panel dataset) for the corresponding genes. Circles

which are filled in correspond to MUF scores that are more extreme than the 95

percentile of the permutation-based null

distribution.

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 11 of 13

(page number not for citation purposes)

ple, in some cases, the probes for a gene are overlapping.

This may induce a correlation between residuals of neigh-

bouring probes. The current model assumes independ-

ence for all probes and makes no compensation for this.

As evidenced by the top ranked genes, our current scoring

scheme seems reasonable and does highlight interesting

cases. Despite the limitations mentioned above, this

research highlights an additional avenue of investigation

beyond differential expression that is freely available at a

minimal additional computational cost.

Methods

Datasets

The mixture dataset used for illustration of RMA (Figure

2) and the tissue panel datasets (Gene and Exon) were

run by Affymetrix and made publicly available (see http:/

/www.affymetrix.com). Briefly, the mixture dataset com-

prises 33 total samples, 3 technical replicates each of 11

separate mixtures. The tissue panel datasets use the same

RNA on both the Gene and Exon platforms. Again, there

are 33 total samples representing 3 biological replicates of

each of the following human tissues: brain, thyroid,

breast, pancreas, prostate, heart, skeletal muscle, kidney,

testis, spleen and liver.

FIRMAGene scoring for the Affymetrix tissue panel datasetFigure 7

FIRMAGene scoring for the Affymetrix tissue panel dataset. Each point in the plot gives a FIRMAGene score for a tis-

sue-gene combination. The X-axis gives the number of probes and Y-axis gives the raw MUF score. Jitter is added in the X

dimension. The simulated null distribution is shown as blue boxplots. The scores for gene-tissue combinations shown in Figures

3 and 4 are highlighted.

Table 2: Top scoring tissue-specific differential splicing

candidates.

ID Sample Score* Symbol

7922737 Testis 24.76 C1orf14

8086077 Brain 21.84 CLASP2

8086842 Brain 20.99 MAP4

7957746 SkMus 19.76 SLC25A3

7957746 Heart 19.41 SLC25A3

8165653 Heart 18.72 --

8166876 Testis 18.29 DDX3X

8064191 Brain 18.08 TPD52L2

8007188 Brain 18.01 CNP

7922627 Kidney 18.01 NPHS2

8170215 Liver 17.93 F9

8100458 Testis 17.84 PDCL2

7962194 Testis 17.78 LOC440093

7940971 Testis 17.56 KCNK4

8176419 Testis 17.18 TSPY2

8155203 Brain 16.97 CLTA

8170390 Brain 16.84 --

8023889 Brain 16.79 MBP

8176544 Testis 16.64 TSPY1

8024194 Testis 16.32 GPX4

The Affymetrix Human Gene 1.0 ST identifiers and gene symbols are

given for the top 20 tissue-gene combinations.

*See Methods.

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 12 of 13

(page number not for citation purposes)

Data processing

All data processing has been performed in the open source

statistical package R[27] and the methods implemented in

this paper are available from the authors as an R package,

operating on objects created using the aroma.affymetrix

package [28]. Chip definition files (CDFs) have been cre-

ated for both arrays, based on library files and annotation

made available from Affymetrix, using the Bioconductor

[29] affxparser package. To facilitate alternative splicing

analysis, probe collections are organized in a gene-centric

fashion, so that probes from all known isoforms for a

gene can be analyzed by a single framework (i.e. fit with

the RMA model). For the Exon platform, we are used core

probesets only. For the Gene array, the coordinates of the

probes are matched to the Exon probeset coordinates, so

that summaries for the same regions can be compared.

Some probes for the Gene array, however, fall outside the

region of Exon's PSR. These are still kept within the Gene

probe collection, but not used for the comparison.

Running FIRMAGene consists of the following steps: 1) fit

the RMA probe-level model robustly for each gene, 2)

standardize the residuals by dividing by the gene-wise

MAD and summarize over residuals if replicates are used,

3) calculate the maximum MUF score for each sample, 4)

given the number of probes for a gene, sample a large

number of vectors of residuals (from the empirical distri-

bution of all residuals) of same length, calculate the MUF

score on each one to generate the null distribution, 5) at a

given cutoff, calculate the false discovery rate. An example

R script for running these steps on the tissue dataset is pro-

vided in Additional file 3.

The score represented in Table 2 compares the tissue-gene

score to the mean and standard deviation of the permuta-

tion-based null distribution (subtract mean, divide by

standard deviation).

Authors' contributions

MDR conceived the original idea, analyzed the data,

implemented the software and wrote the paper. TPS

refined the statistical analysis and directed the project.

Additional material

Acknowledgements

The authors wish to thank Henrik Bengtsson for various discussions

regarding implementation using aroma.affymetrix data structures. In addi-

tion, we wish to thank Anna Tsykin, Mark Cowley and Paul Spellman for

helpful comments on an earlier version of the manuscript and Elizabeth Pur-

dom for discussions regarding FIRMA.

References

1. Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour

CD, Santos R, Schadt EE, Stoughton R, Shoemaker DD: Genome-

wide survey of human alternative pre-mRNA splicing with

exon junction microarrays. Science 2003, 302(5653):2141-2144.

2. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of

alternative splicing complexity in the human transcriptome

by high-throughput sequencing. Nat Genet 2008,

40(12):1413-1415.

3. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C,

Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regu-

lation in human tissue transcriptomes. Nature 2008,

456:470-476.

4. Schwerk C, Schulze-Osthoff K: Regulation of apoptosis by alter-

native pre-mRNA splicing. Mol Cell 2005, 19:1-13.

5. Singh N, Preiser P, Renia L, Balu B, Barnwell J, Blair P, Jarra W, Voza

T, Landau I, Adams JH: Conservation and developmental con-

trol of alternative splicing in maebl among malaria parasites.

J Mol Biol 2004, 343(3):589-599.

6. Grabowski PJ, Black DL: Alternative RNA splicing in the nerv-

ous system. Prog Neurobiol 2001, 65(3):289-308.

7. Venables JP: Aberrant and alternative splicing in cancer. Cancer

Res 2004, 64(21):7647-7654.

8. Xu Q, Modrek B, Lee C: Genome-wide detection of tissue-spe-

cific alternative splicing in the human transcriptome. Nucleic

Acids Res 2002, 30(17):3754-3766.

9. Gupta S, Zink D, Korn B, Vingron M, Haas SA: Strengths and

weaknesses of EST-based prediction of tissue-specific alter-

native splicing. BMC Genomics 2004, 5:72.

10. Clark TA, Schweitzer AC, Chen TX, Staples MK, Lu G, Wang H, Wil-

liams A, Blume JE: Discovery of tissue-specific exons using com-

prehensive human exon microarrays. Genome Biol 2007,

8(4):R64.

11. Gardina PJ, Clark TA, Shimada B, Staples MK, Yang Q, Veitch J, Sch-

weitzer A, Awad T, Sugnet C, Dee S, Davies C, Williams A, Turpaz Y:

Alternative splicing and differential gene expression in colon

cancer detected by a whole genome exon array. BMC Genom-

ics 2006, 7:325.

12. Robinson MD, Speed TP: A comparison of Affymetrix gene

expression arrays. BMC Bioinformatics 2007, 8:449.

13. Bemmo A, Benovoy D, Kwan T, Gaffney DJ, Jensen RV, Majewski J:

Gene expression and isoform variation analysis using

Affymetrix Exon Arrays. BMC Genomics 2008, 9:529.

Additional file 1

Plots and corroborating evidence for the top 20 gene-tissue scores.

Probe-level data and residuals for the top 20 gene-tissue scores, from

applying FIRMAGene to the Affymetrix tissue panel dataset. Additionally,

links to various corroborating evidence of tissue-specific splicing or expres-

sion.

Click here for file

[http://www.biomedcentral.com/content/supplementary/1471-

2105-10-156-S1.pdf]

Additional file 2

Top 1000 Gene-tissue scores for the tissue panel dataset. Table giving

the probeset identifier, tissue sample, FIRMAGene score and gene symbol,

after applying FIRMAGene to the Affymetrix tissue panel dataset.

Click here for file

[http://www.biomedcentral.com/content/supplementary/1471-

2105-10-156-S2.zip]

Additional file 3

Example R script for FIRMAGene (R). Source code example to run FIR-

MAGene on the Affymetrix tissue panel dataset.

Click here for file

[http://www.biomedcentral.com/content/supplementary/1471-

2105-10-156-S3.zip]

Publish with Bio Med Central and every

scientist can read your work free of charge

"BioMed Central will be the most significant development for

disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:

http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

BMC Bioinformatics 2009, 10:156 http://www.biomedcentral.com/1471-2105/10/156

Page 13 of 13

(page number not for citation purposes)

14. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM,

Haussler D: The human genome browser at UCSC. Genome

Res 2002, 12(6):996-1006.

15. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP:

Summaries of Affymetrix GeneChip probe level data. Nucleic

Acids Res 2003, 31(4):e15.

16. Marazzi A: Algorithms, Routines and S Functions for Robust Statistics

Pacific Grove, California: Wadsworth & Brooks/Cole; 1993.

17. French PJ, Peeters J, Horsman S, Duijm E, Siccama I, Bent MJ van den,

Luider TM, Kros JM, Spek P van der, Smitt PAS: Identification of dif-

ferentially regulated splice variants and novel exons in glial

brain tumors using exon expression arrays. Cancer Res 2007,

67(12):5635-5642.

18. Xing Y, Stoilov P, Kapur K, Han A, Jiang H, Shen S, Black DL, Wong

WH: MADS: a new and improved method for analysis of dif-

ferential alternative splicing by exon-tiling microarrays. RNA

2008, 14(8):1470-1479.

19. Purdom E, Simpson KM, Robinson MD, Conboy JG, Lapuk AV, Speed

TP: FIRMA: a method for detection of alternative splicing

from exon array data. Bioinformatics 2008, 24(15):1707-1714.

20. Speed TP, Culpin D: The Role of Statistics in Nuclear Materials

Accounting: Issues and Problems. Journal of the Royal Statistical

Society Series A 1986, 149:281-313.

21. de Ferra F, Engh H, Hudson L, Kamholz J, Puckett C, Molineaux S, Laz-

zarini RA: Alternative splicing accounts for the four forms of

myelin basic protein. Cell 1985, 43(3 Pt 2):721-727.

22. Yeo G, Holste D, Kreiman G, Burge CB: Variation in alternative

splicing across human tissues. Genome Biol 2004, 5(10):R74.

23. Huang JC, Morris QD, Hughes TR, Frey BJ: GenXHC: a probabil-

istic generative model for cross-hybridization compensation

in high-density genome-wide microarray data. Bioinformatics

2005, 21(Suppl 1):i222-i231.

24. Kapur K, Jiang H, Xing Y, Wong WH: Cross-Hybridization Mod-

eling on Affymetrix Exon Arrays. Bioinformatics 2008.

25. Zhang W, Duan S, Bleibel W, Wisel S, Huang R, Wu X, He L, Clark

T, Chen T, Schweitzer A, Blume J, Dolan M, Cox N: Identification

of common genetic variants that account for transcript iso-

form variation between human populations. Hum Genet 2008,

125:81-93.

26. Duan S, Zhang W, Bleibel WK, Cox NJ, Dolan ME: SNPinProbe

1.0: A database for filtering out probes in the Affymetrix

GeneChip(R) Human Exon 1.0 ST array potentially affected

by SNPs. Bioinformation 2008, 2(10):469-470.

27. R Development Core Team: R: A Language and Environment for Statis-

tical Computing 2008 [http://www.R-project.org

]. R Foundation for

Statistical Computing, Vienna, Austria ISBN 3-900051-07-0

28. Bengtsson H, Simpson K, Bullard J, Hansen K: aroma.affyme

trix: An R framework for analyzing small to large Affyme-

trix data sets in bounded memory. Tech Report 745, Department

of Statistics, University of California, Berkeley 2008.

29. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S,

Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W,

Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G,

Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor:

open software development for computational biology and

bioinformatics. Genome Biol 2004, 5(10):R80.

Additional file 3

Data

May 2009

Mark D Robinson · Terence P Speed

Additional file 2

Data

May 2009

Mark D Robinson · Terence P Speed

Additional file 1

Data

November 2007

Mark D Robinson · Terence P Speed

iGEMS: An integrated model for identification of alternative exon usage events

Article

Full-text available

Apr 2016
NUCLEIC ACIDS RES

DNA microarrays and RNAseq are complementary methods for studying RNA molecules. Current computational methods to determine alternative exon usage (AEU) using such data require impractical visual inspection and still yield high false-positive rates. Integrated Gene and Exon Model of Splicing (iGEMS) adapts a gene-level residuals model with a gene size adjusted false discovery rate and exon-level analysis to circumvent these limitations. iGEMS was applied to two new DNA microarray datasets, including the high coverage Human Transcriptome Arrays 2.0 and performance was validated using RT-qPCR. First, AEU was studied in adipocytes treated with (n = 9) or without (n = 8) the anti-diabetes drug, rosiglitazone. iGEMS identified 555 genes with AEU, and robust verification by RT-qPCR (∼90%). Second, in a three-way human tissue comparison (muscle, adipose and blood, n = 41) iGEMS identified 4421 genes with at least one AEU event, with excellent RT-qPCR verification (95%, n = 22). Importantly, iGEMS identified a variety of AEU events, including 3′UTR extension, as well as exon inclusion/exclusion impacting on protein kinase and extracellular matrix domains. In conclusion, iGEMS is a robust method for identification of AEU while the variety of exon usage between human tissues is 5–10 times more prevalent than reported by the Genotype-Tissue Expression consortium using RNA sequencing.

A Computational Analysis of Alternative Splicing across Mammalian Tissues Reveals Circadian and Ultradian Rhythms in Splicing Events

Article

Full-text available

Aug 2019
INT J MOL SCI

Mounting evidence points to a role of the circadian clock in the temporal regulation of post-transcriptional processes in mammals, including alternative splicing (AS). In this study, we carried out a computational analysis of circadian and ultradian rhythms on the transcriptome level to characterise the landscape of rhythmic AS events in published datasets covering 76 tissues from mouse and olive baboon. Splicing-related genes with 24-h rhythmic expression patterns showed a bimodal distribution of peak phases across tissues and species, indicating that they might be controlled by the circadian clock. On the output level, we identified putative oscillating AS events in murine microarray data and pairs of differentially rhythmic splice isoforms of the same gene in baboon RNA-seq data that peaked at opposing times of the day and included oncogenes and tumour suppressors. We further explored these findings using a new circadian RNA-seq dataset of human colorectal cancer cell lines. Rhythmic isoform expression patterns differed between the primary tumour and the metastatic cell line and were associated with cancer-related biological processes, indicating a functional role of rhythmic AS that might be implicated in tumour progression. Our data shows that rhythmic AS events are widespread across mammalian tissues and might contribute to a temporal diversification of the proteome.

Differential splicing across immune system lineages

Article

Aug 2013

Comprehensive analysis of prognostic genes in gastric cancer

Article

Full-text available

Oct 2021

Background: Gastric cancer is associated with high mortality, and effective methods for predicting prognosis are lacking. We aimed to identify potential prognostic markers associated with the development of gastric cancer through bioinformatic analyses. Methods: Gastric cancer-associated gene expression profiles were obtained from The Cancer Genome Atlas and Gene Expression Omnibus databases. The key genes involved in the development of gastric cancer were obtained by differential expression analysis, coexpression analysis, and short time-series expression miner (STEM) analysis. The potential prognostic value of differentially expressed genes was further evaluated using a Cox regression model and risk scores. Hierarchical clustering was applied to validate the impact of key genes on the overall survival of gastric cancer patients. Results: A total of 1381 genes were consistently dysregulated in the development of gastric cancer. Among them, 186 genes affected the overall survival of gastric cancer patients. The following genes had areas under the receiver operating characteristic curve greater than 0.9 in both datasets and were therefore considered key genes: ADAM12, CEP55, LRFN4, INHBA, ADH1B, DPT, FAM107A, and LOC100506388. LRFN4, DPT, and LOC100506388 were identified as potential prognostic genes for gastric cancer through a nomogram. Overexpression of LRFN4 and LOC100506388 was associated with a higher risk of gastric cancer. Finally, we found that tumors were infiltrated with high levels of Th2 cells and mast cells, and the infiltration levels were associated with overall survival in gastric cancer patients. Conclusions: We found that key dysregulated genes may have a prognostic value for the development of gastric cancer.

SUPPLEMENTARY DATA

Data

Full-text available

Jun 2015

Distribution of Alternatively Spliced Transcript Isoforms within Human and Mouse Transcriptomes

Article

Full-text available

Jan 2011

Transcriptomics for Clinical and Experimental Biology Research: Hang on a Seq

Article

Full-text available

Jan 2023

Sequencing the human genome empowers translational medicine, facilitating transcriptome-wide molecular diagnosis, pathway biology, and drug repositioning. Initially, microarrays are used to study the bulk transcriptome; but now short-read RNA sequencing (RNA-seq) predominates. Positioned as a superior technology, that makes the discovery of novel transcripts routine, most RNA-seq analyses are in fact modeled on the known transcriptome. Limitations of the RNA-seq methodology have emerged, while the design of, and the analysis strategies applied to, arrays have matured. An equitable comparison between these technologies is provided, highlighting advantages that modern arrays hold over RNA-seq. Array protocols more accurately quantify constitutively expressed protein coding genes across tissue replicates, and are more reliable for studying lower expressed genes. Arrays reveal long noncoding RNAs (lncRNA) are neither sparsely nor lower expressed than protein coding genes. Heterogeneous coverage of constitutively expressed genes observed with RNA-seq, undermines the validity and reproducibility of pathway analyses. The factors driving these observations, many of which are relevant to long-read or single-cell sequencing are discussed. As proposed herein, a reappreciation of bulk transcriptomic methods is required, including wider use of the modern high-density array data-to urgently revise existing anatomical RNA reference atlases and assist with more accurate study of lncRNAs.

Statistical method development and design of computational pipelines for differential analyses of high-throughput data, including DNA microarrays, RNA sequencing and HDCyto data

Article

Jan 2017

Malgorzata Franciszka Nowicka

Introduction to Microarrays Technology and Data Analysis

Chapter

Jan 2018
Compr Anal Chem

Microarrays, and more specifically RNA microarrays, are engines developed in the late 1990s to measure gene expression. Inspired in classic technologies like Northern blot, there exist different types of microarrays and different classifications are possible: (1) Spotted microarrays, where probes—the sequences whose expression is to be detected—are printed on a glass or plastic slide, vs oligonucleotide arrays, where probes are synthetized on the array. (2) Two-colour arrays, based on competitive hybridization vs one colour arrays, where each array hybridizes to one sample type. Today microarrays have evolved into a mature technology and have greatly increased their capacity: from a few hundreds or thousands of ESTs in the first arrays to millions of probes on most regions of the genome, covering exons, introns, and many other part of the genes and other transcriptome variants such as miRNAs and long coding RNAs. The analysis of microarray data has been an active field of development. It has been applied to a wide variety of problems such as selecting differentially expressed genes, building prognostic or diagnostic predictors or discovering groups in data. The analysis typically proceeds through a series of steps: data exploration, quality control, normalization, statistical analysis and biological significance or pathway analysis. There are many tools available for microarray data analysis. Some of these tools are commercial programs while others are open source software such as the Bioconductor libraries (http://bioconductor.org) based on the R Statistical language.

Title: Distribution and function of 3',5'-Cyclic-AMP phosphodiesterases in the human ovary.

Article

Jan 2015

The concentration of the important second messenger cAMP is regulated by phosphodiesterases (PDEs) and hence an attractive drug target. However, limited human data is available about the PDEs in the ovary. The aim of the present study was to describe and characterise the PDEs in the human ovary. Results were obtained by analysis of mRNA microarray data from follicles and granulosa cells (GCs), combined RT-PCR and enzymatic activity analysis in GCs, immunohistochemical analysis of ovarian sections and by studying the effect of PDE inhibitors on progesterone production from cultured GCs. We found that PDE3, PDE4, PDE7 and PDE8 are the major families present while PDE11A was not detected. PDE8B was differentially expressed during folliculogenesis. In cultured GCs, inhibition of PDE7 and PDE8 increased basal progesterone secretion while PDE4 inhibition increased forskolin-stimulated progesterone secretion. In conclusion, we identified PDE3, PDE4, PDE7 and PDE8 as the major PDEs in the human ovary. Copyright © 2015. Published by Elsevier Ireland Ltd.

Identification of Common Genetic Variants That Account for Transcript Isoform Variation Between Human Populations

Article

Full-text available

Feb 2009

In addition to the differences between populations in transcriptional and translational regulation of genes, alternative pre-mRNA splicing (AS) is also likely to play an important role in regulating gene expression and generating variation in mRNA and protein isoforms. Recently, the genetic contribution to transcript isoform variation has been reported in individuals of recent European descent. We report here results of an investigation of the differences in AS patterns between human populations. AS patterns in 176 HapMap lymphoblastoid cell lines derived from individuals of European and African ancestry were evaluated using the Affymetrix GeneChip® Human Exon 1.0 ST Array. A variety of biological processes such as response to stimulus and transcription were found to be enriched among the differentially spliced genes. The differentially spliced genes also include some involved in human diseases that have different prevalence or susceptibility between populations. The genetic contribution to the population differences in transcript isoform variation was then evaluated by a genome-wide association using the HapMap genotypic data on single nucleotide polymorphisms (SNPs). The results suggest that local and distant genetic variants account for a substantial fraction of the observed transcript isoform variation between human populations. Our findings provide new insights into the complexity of the human genome as well as the health disparities between the two populations.

Addendum: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing

Article

Full-text available

Jul 2009
Nat Genet

Gene Expression and Isoform Variation Analysis Using Affymetrix Exon Arrays

Article

Full-text available

Feb 2008
BMC GENOMICS

Alternative splicing and isoform level expression profiling is an emerging field of interest within genomics. Splicing sensitive microarrays, with probes targeted to individual exons or exon-junctions, are becoming increasingly popular as a tool capable of both expression profiling and finer scale isoform detection. Despite their intuitive appeal, relatively little is known about the performance of such tools, particularly in comparison with more traditional 3' targeted microarrays. Here, we use the well studied Microarray Quality Control (MAQC) dataset to benchmark the Affymetrix Exon Array, and compare it to two other popular platforms: Illumina, and Affymetrix U133. We show that at the gene expression level, the Exon Array performs comparably with the two 3' targeted platforms. However, the interplatform correlation of the results is slightly lower than between the two 3' arrays. We show that some of the discrepancies stem from the RNA amplification protocols, e.g. the Exon Array is able to detect expression of non-polyadenylated transcripts. More importantly, we show that many other differences result from the ability of the Exon Array to monitor more detailed isoform-level changes; several examples illustrate that changes detected by the 3' platforms are actually isoform variations, and that the nature of these variations can be resolved using Exon Array data. Finally, we show how the Exon Array can be used to detect alternative isoform differences, such as alternative splicing, transcript termination, and alternative promoter usage. We discuss the possible pitfalls and false positives resulting from isoform-level analysis. The Exon Array is a valuable tool that can be used to profile gene expression while providing important additional information regarding the types of gene isoforms that are expressed and variable. However, analysis of alternative splicing requires much more hands on effort and visualization of results in order to correctly interpret the data, and generally results in considerably higher false positive rates than expression analysis. One of the main sources of error in the MAQC dataset is variation in amplification efficiency across transcripts, most likely caused by joint effects of elevated GC content in the 5' ends of genes and reduced likelihood of random-primed first strand synthesis in the 3' ends of genes. These effects are currently not adequately corrected using existing statistical methods. We outline approaches to reduce such errors by filtering out potentially problematic data.

Cross-Hybridization Modeling on Affymetrix Exon Arrays

Article

Full-text available

Dec 2008
BIOINFORMATICS

Microarray designs have become increasingly probe-rich, enabling targeting of specific features, such as individual exons or single nucleotide polymorphisms. These arrays have the potential to achieve quantitative high-throughput estimates of transcript abundances, but currently these estimates are affected by biases due to cross-hybridization, in which probes hybridize to off-target transcripts. To study cross-hybridization, we map Affymetrix exon array probes to a set of annotated mRNA transcripts, allowing a small number of mismatches or insertion/deletions between the two sequences. Based on a systematic study of the degree to which probes with a given match type to a transcript are affected by cross-hybridization, we developed a strategy to correct for cross-hybridization biases of gene-level expression estimates. Comparison with Solexa ultra high-throughput sequencing data demonstrates that correction for cross-hybridization leads to a significant improvement of gene expression estimates. We provide mappings between human and mouse exon array probes and off-target transcripts and provide software extending the GeneBASE program for generating gene-level expression estimates including the cross-hybridization correction http://biogibbs.stanford.edu/~kkapur/GeneBase/.

Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing

Article

Full-text available

Jan 2009
Nat Genet

We carried out the first analysis of alternative splicing complexity in human tissues using mRNA-Seq data. New splice junctions were detected in approximately 20% of multiexon genes, many of which are tissue specific. By combining mRNA-Seq and EST-cDNA sequence data, we estimate that transcripts from approximately 95% of multiexon genes undergo alternative splicing and that there are approximately 100,000 intermediate- to high-abundance alternative splicing events in major human tissues. From a comparison with quantitative alternative splicing microarray profiling data, we also show that mRNA-Seq data provide reliable measurements for exon inclusion levels.

Summaries of Affymetrix GeneChip probe level data

Article

Feb 2003
NUCLEIC ACIDS RES

Rafael A Irizarry

High density oligonucleotide array technology is widely used in many areas of biomedical research for quantitative and highly parallel measurements of gene expression. Affymetrix GeneChip arrays are the most popular. In this technology each gene is typically represented by a set of 11–20 pairs of probes. In order to obtain expression measures it is necessary to summarize the probe level data. Using two extensive spike‐in studies and a dilution study, we developed a set of tools for assessing the effectiveness of expression measures. We found that the performance of the current version of the default expression measure provided by Affymetrix Microarray Suite can be significantly improved by the use of probe level summaries derived from empirically motivated statistical models. In particular, improvements in the ability to detect differentially expressed genes are demonstrated.

Genome-wide detection of tissue-specific alternative splicing in the human transcriptome

Article

Sep 2002
NUCLEIC ACIDS RES

Qiang Xu

We have developed an automated method for discovering tissue‐specific regulation of alternative splicing through a genome‐wide analysis of expressed sequence tags (ESTs). Using this approach, we have identified 667 tissue‐specific alternative splice forms of human genes. We validated our muscle‐specific and brain‐specific splice forms for known genes. A high fraction (8/10) were reported to have a matching tissue specificity by independent studies in the published literature. The number of tissue‐specific alternative splice forms is highest in brain, while eye_retina, muscle, skin, testis and lymph have the greatest enrichment of tissue‐specific splicing. Overall, 10–30% of human alternatively spliced genes in our data show evidence of tissue‐specific splice forms. Seventy‐eight percent of our tissue‐specific alternative splices appear to be novel discoveries. We present bioinformatics analysis of several tissue‐specific splice forms, including automated protein isoform sequence and domain prediction, showing how our data can provide valuable insights into gene function in different tissues. For example, we have discovered a novel kidney‐specific alternative splice form of the WNK1 gene, which appears to specifically disrupt its N‐terminal kinase domain and may play a role in PHAII hypertension. Our database greatly expands knowledge of tissue‐specific alternative splicing and provides a comprehensive dataset for investigating its functional roles and regulation in different human tissues.

aroma.affymetrix: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory

Article

Jan 2008

We have developed a cross-platform open-source framework for ana-lyzing Affymetrix data sets consisting of 1 to 1,000s of arrays. By working directly with CDF and CEL files (standard Affymetrix file formats) most chip types are au-tomatically supported, e.g. expression, SNP, and exon arrays. The package provides methods for low-level analysis such as background correction of different kinds, allelic cross-talk calibration, quantile and affine normalization, PCR fragment-length and GC-content normalization, probe-level summarization such as robust log-additive and multiplicative modeling, as well as a set of methods for high-level analysis, e.g. chro-mosomal segmentation and alternative splicing. Results can be exported to dynamical HTML reports for easy navigation of a large set of arrays both offline and online. All algorithms have been optimized to run in bounded memory (as low as 500MB of RAM) by either redesigning the algorithms or by processing data in chunks. Transformed data and parameter estimates are stored on file in standard file formats, which in turn minimizes the memory overhead, but also makes them immediately accessible to other software. Moreover, storing intermediate results in persistent memory makes compu-tational expensive analyses more robust against system failures and allows for quick resumes. In addition to making common algorithms readily available, this package was designed to allow for quicker development of novel models and incorporation of existing ones, such as Bioconductor methods, and be prepared for future chip types.

The Role of Statistics in Nuclear Materials Accounting: Issues and Problems

Article

Jan 1986
J Roy Stat Soc

The paper gives a critical discussion of some statistical aspects of nuclear materials accounting. The discussion centres on the problem of setting statistical guidelines for testing for the diversion of nuclear material to unauthorized use. Existing test criteria and test procedures are criticized and alternative approaches are proposed.

Bioconductor: Open Software Development for Computational Biology and Bioinformatics

Article

Sep 2004
GENOME BIOL

The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. We detail some of the design decisions, software paradigms and operational strategies that have allowed a small number of researchers to provide a wide variety of innovative, extensible, software solutions in a relatively short time. The use of an object oriented programming paradigm, the adoption and development of a software package system, designing by contract, distributed development and collaboration with other projects are elements of this project's success. Individually, each of these concepts are useful and important but when combined they have provided a strong basis for rapid development and deployment of innovative and flexible research software for scientific computation. A primary objective of this initiative is achievement of total remote reproducibility of novel algorithmic research results.

Differential splicing using whole-transcript microarrays

Abstract and Figures

Supplementary resources (3)

Recommended publications

Comparison of Affymetrix Gene Array with the Exon Array shows potential application for detection of...

A robust estimation of exon expression to identify alternative spliced genes applied to human tissue...

Probe-level estimation improves the detection of differential splicing in Affymetrix exon array stud...

Searching for Alternative Splicing With a Joint Model on Probe Measurability and Expression Intensit...