ArticlePDF Available

Genome-wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression

Authors:

Abstract and Figures

In this paper, we describe an approach for understanding transcriptional regulation from both gene expression and promoter sequence data. We aim to identify transcriptional modules—sets of genes that are co-regulated in a set of experiments, through a common motif profile. Using the EM algorithm, our approach refines both the module assignment and the motif profile so as to best explain the expression data as a function of transcriptional motifs. It also dynamically adds and deletes motifs, as required to provide a genome-wide explanation of the expression data. We evaluate the method on two Saccharomyces cerevisiae gene expression data sets, showing that our approach is better than a standard one at recovering known motifs and at generating biologically coherent modules. We also combine our results with binding localization data to obtain regulatory relationships with known transcription factors, and show that many of the inferred relationships have support in the literature. Contact: eran@cs.stanford.edu Keywords: probabilistic models, gene expression, transcriptional regulation. *To whom correspondence should be addressed
Content may be subject to copyright.
BIOINFORMATICS
Vol. 19 Suppl. 1 2003, pages i273–i282
DOI: 10.1093/bioinformatics/btg1038
Genome-wide discovery of transcriptional
modules from DNA sequence and gene
expression
E. Segal
,R.Yelensky and D. Koller
Computer Science Department of Stanford University, Stanford, CA 94305-9010,
USA
Received on January 6, 2003; accepted on February 20, 2003
ABSTRACT
In this paper, we describe an approach for understanding
transcriptional regulation from both gene expression and
promoter sequence data. We aim to identify transcriptional
modules—sets of genes that are co-regulated in a set
of experiments, through a common motif profile. Using
the EM algorithm, our approach refines both the module
assignment and the motif profile so as to best explain
the expression data as a function of transcriptional motifs.
It also dynamically adds and deletes motifs, as required
to provide a genome-wide explanation of the expression
data. We evaluate the method on two Saccharomyces
cerevisiae gene expression data sets, showing that our
approach is better than a standard one at recovering
known motifs and at generating biologically coherent
modules. We also combine our results with binding
localization data to obtain regulatory relationships with
known transcription factors, and show that many of the
inferred relationships have support in the literature.
Contact: eran@cs.stanford.edu
Keywords: probabilistic models, gene expression, tran-
scriptional regulation.
INTRODUCTION
Many cellular processes are regulated at the transcriptional
level, by one or more transcription factors that bind to
short DNA sequence motifs in the upstream regions of the
process genes. These co-regulated genes then exhibit sim-
ilar patterns of expression. Given the upstream regions of
all genes, and measurements of their expression under var-
ious conditions, we could hope to ‘reverse engineer’ the
underlying regulatory mechanisms and identify transcrip-
tional modules—sets of genes that are co-regulated under
these conditions through a common motif or combination
of motifs.
In this paper, we take a genome-wide approach for dis-
covering this modular organization, based on the premise
To whom correspondence should be addressed.
that transcriptional elements should ‘explain’ the observed
expression patterns as much as possible. We define a prob-
abilistic graphical model (Pearl, 1988) that integrates both
the gene expression measurements and the DNA sequence
data into a unified model. The model assumes that genes
are partitioned into modules, which determine the gene’s
expression profile. Each module is characterized by a mo-
tif profile, which specifies the relevance of different se-
quence motifs to the module. A gene’s module assignment
is a function of the sequence motifs in its promoter re-
gion. However, our model does not assume that all motifs
are necessarily active. In fact, as motifs are usually short,
there are many genes where a motif is randomly present
but does not play a role. Furthermore, our goal is to dis-
cover motifs that play a regulatory role in some particular
set of experiments; a motif that is active in some settings
may be completely irrelevant in others. Our model identi-
fies motif targets—genes where the motif plays an active
role in affecting regulation in a particular expression data
set. These motif targets are genes that have the motif and
that are assigned to modules containing the motif in their
profile.
Our algorithm is outlined in Figure 1. It begins by
clustering the expression data, creating one module from
each of the resulting clusters. As the first attempt towards
explaining these expression patterns, it searches for a
common motif in the upstream regions of genes assigned
to the same module. It then iteratively refines the model,
trying to optimize the extent to which the expression
profile can be predicted transcriptionally. For example,
we might want to move a gene g whose promoter region
does not match its current module’s motif profile, to
another module whose expression profile is still a good
match, and whose motif profile is much closer. Given these
assignments, we could then learn better motif models and
motif profiles for each module. This refinement process
arises naturally within our algorithm, as a byproduct of the
expectation maximization (EM) algorithm for estimating
the model parameters.
Bioinformatics 19(Suppl. 1)
c
Oxford University Press 2003; all rights reserved. i273
by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
E.Segal et al.
Gene partition
Motif set
Transcriptional modules
M-step
E-step
Add/Delete motifs
X
X
X
X
data selection
dat
as
el
ec
ti
Expression data Upstream sequences
Transcriptional module discovery procedure
Clustering
Motif
search
Post-processing
Annotation
analysis
Visualization & analysis
Gene annotations (GO)
Protein complexes
Fig. 1. Schematic flow diagram of our proposed method. The pre-processing step includes selecting the input gene expression and upstream
sequence data. The model is then trained using EM, and our algorithm for dynamically adding and deleting motifs. It is then evaluated on
additional data sets.
In general, the motifs learned will not suffice to char-
acterize all of the modules. As our goal is to provide a
genome-wide explanation of the expression behavior, our
algorithm identifies poorly explained genes in modules
and searches for new motifs in their upstream regions. The
new motifs are then added to the model and subsequently
refined using EM. As part of this dynamic learning
procedure, some motifs may become obsolete and are
removed from the model. The algorithm iterates until
convergence, adding and deleting motifs, and refining
motif models and module assignments.
Our algorithm has several important advantages over
other attempts to relate upstream sequences and expres-
sion data. First, we use both expression and sequence
data together, requiring that modules display a coherent
profile for both. This approach allows us to refine both the
cluster assignments and motifs within the same algorithm.
In contrast, many approaches (e.g. Brazma et al., 1998;
Liu et al., 2001; Roth et al., 1998; Sinha and Tompa,
2000; Tavazoie et al., 1999) use gene expression mea-
surements to define clusters of genes that are potentially
co-regulated, and then search for common motifs in
the upstream regions of the genes in each cluster. The
expression analysis and motif finding are thus decoupled,
and neither the clusters nor the motifs are re-evaluated
once they are learned. Other approaches (e.g. Bussemaker
et al., 2001; Pilpel et al., 2001) work in the opposite
direction, first identifying a set of candidate motifs, and
then trying to explain the expression using these motifs.
However, these approaches use a prespecified set of
motifs, which are never adapted during the algorithm.
Our approach is based on the framework of Segal et
al. (2002) but extends it in several important directions.
First, their approach made use of DNA localization
data, which are not widely available for all organisms
and under multiple growth conditions. In contrast, we
construct models that are based solely on sequence and
expression data, which are much easier to obtain. Second,
their approach used a predetermined number of motifs to
construct the model. To allow a genome-wide analysis, our
algorithm dynamically removes and adds motifs as needed
to explain the expression data as a whole. Finally, while
the models of Segal et al. (2002) allowed for detection of
context-specific regulation, the resulting structure is hard
to interpret. Our model assigns each gene to one module,
facilitating interpretability.
We tested our method on two distinct Saccharomyces
cerevisiae expression datasets. We show that our learned
models find motifs that account for a much larger frac-
tion of the observed expression patterns in comparison to
standard approaches that first cluster the expression pro-
files and then search for motifs in the upstream regions
of the genes in each cluster. Our approach also recovers a
much larger number of known motifs. We evaluated the
functional coherence of our transcriptional modules us-
ing a gene functional annotation database and two pro-
tein complex databases that were not given to the model
as input. We found enrichment for many more groups in
our models compared to standard approaches, suggesting
that our transcriptional modules are biologically more ac-
curate. Finally, we used the recent binding assays of Lee
et al. (2002) to relate the actual transcription factors to the
modules they regulate, resulting in a regulatory network;
we show that many of the regulatory relationships discov-
ered have support in the literature.
i274
by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Genome-wide discovery of transcriptional modules
PROBABILISTIC MODEL
The basic entities in our model are the genes in some set
G.Weassume that the genes are partitioned into a set
of K mutually exclusive and exhaustive transcriptional
modules. Thus, each gene is associated with an attribute
M ∈{1,...,K } whose value represents the module
to which the gene belongs. We now describe how these
modules are related to expression profiles and to motif
profiles.
Gene expression model For each gene g in G,we
have expression measurements g.E
1
,...,g.E
J
, where
g.E
j
represents the log ratio mRNA expression level
measured for gene g in experiment j.Weassume that
all of the genes in a single module exhibit the same
gene expression behavior, and use the simple yet powerful
Naive Bayes model (Cheeseman and Stutz, 1995) to
represent this behavior. In this model, as applied in our
setting, we assume that the expression measurements are
conditionally independent given the module assignment:
P(E
1
,...,E
J
| M) =
J
j=1
P(E
j
| M).
As the expression measurements are real-valued, we
model each conditional probability distribution P(E
j
|
M = m) using a Gaussian distribution
N
jm
; σ
jm
).
Motif model The second key component in our model
is a set of variables that represent the regulation of the
gene by motifs. For each gene g,wehaveasetof
binary-valued Regulates variables R ={R
1
,...,R
L
},
where g.R
i
takes the value true if motif i appears in the
promoter region of gene g,allowing the motif to play a
regulatory role in the gene’s expression. We model the
motif using a standard position specific scoring matrix
(PSSM Bailey and Elkan, 1994; Roth et al., 1998), which
assigns a weight to each position in the motif and each
nucleotide ∈{A, C, G, T }; this weight represents the
extent to which the nucleotide’s presence in this position
is associated with the motif.
We use the discriminative PSSM approach of Segal et al.
(2002), which trains the PSSM weights to discriminate as
much as possible between the presence and the absence
of the motif. This approach provides better predictions,
and entirely avoids the problems arising from high-
frequency but meaningless motifs that are common in
many upstream sequences. This model is specified using
a standard binary logistic model. We have p position-
specific weights w
j
[], one for each position j and each
letter ∈{A, C, G, T }, and a threshold w
0
.Fora
promoter sequence of length N ,weassume that binding
occurs once, and with equal probability at each of the N
p + 1 possible positions in the sequence. The probability
P
(
g.
R
|
g
.S
)
g.E
1
g.E
2
g.E
3
P(
g
.
E
|
g
.M
)
g. M
1
2
3
0
0
0
CPD 1
CPD 2
CPD 3
g.R
1
g.R
2
g.M
g.
S
1
g.S
2
g.S
3
R
R
g.
M
1
2
3
P(
g
.
R
|
g
.
S
)
Fig. 2. Illustration of our unified probabilistic model, for a simple
example with upstream regions of length three, two sequence
motifs, three possible module assignments and three expression
measurements for each gene
of binding given the sequence is then specified simply as:
P(g.R = true | S
1
,...,S
n
) =
logit
log
w
0
n p + 1
np+1
j=1
exp{
p
i=1
w
i
[S
i+ j1
]}

.
Regulation model We define the motif profile of a
transcriptional module to be a set of weights u
mi
, one for
each motif, such that u
mi
specifies the extent to which
motif i plays a regulatory role in module m. Roughly
speaking, the strength of the association of a gene g with a
module m is
L
i=1
g.R
i
u
mi
.The stronger the association
of a gene with a module, the more likely it is to be
assigned to it. We model this using a softmax conditional
distribution, a standard extension of the binary logistic
conditional distribution to the multi-class case:
P(g.M m | R
1
= r
1
,...,R
L
= r
L
) =
exp{
L
i=1
u
¯mi
r
i
}
K
m
=1
exp{
L
i=1
u
m
i
r
i
}
.
As we expect a motif to be active in regulating only
a small set of modules in a given setting, we limit the
number of weights u
1i
,...,u
Ki
that are non-zero to some
h K . This restriction results in a sparse weight
matrix for P(M | R), and ensures that each regulator
affects at most h modules. In addition, for interpretability
considerations, we require all weights to be non-negative.
Intuitively, this means that a gene’s assignment to specific
transcriptional modules can only depend on features that
correpond to the presence of certain motifs and not on the
abscence of motifs. For a module m, the set of motifs u
mi
that are non-zero are called the motif profile of m.
i275
by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
E.Segal et al.
Unified model These three components, are put together
as a probabilistic graphical model, as shown in Figure 2;
the model defines the following joint distribution:
P(g.R, g.M, g.E | g.S) =
L
i=1
P(g.R
i
| g.S) · P(g.M | g.R) ·
J
j=1
P(g.E
j
| g.M),
where each of the above conditional probability distribu-
tions is parameterized as described in the previous sec-
tions.
LEARNING THE MODELS
In the previous section, we presented our probabilistic
model. We now turn to the task of learning this model from
data. Our data set D consists of a set of genes G, where
for each gene g G we have a set of gene expression
measurements g.e
j
for j = 1,...,J and a DNA sequence
g.S in the upstream region of the transcription start site
for g.For this section, we restrict attention to a fixed
number of motifs, and address the problem of estimating
the model parameters to fit the data. The model parameters
to be estimated are: the means and variances of the
normal distributions of the expression model, the softmax
weights and structure of the module assignments (i.e.
which sequence motifs each module depends on), and the
PSSM weights for each sequence motif.
We follow the standard approach of maximum likelihood
estimation: we find the parameters θ that maximize
P(D | θ). Our learning task is made considerably more
difficult by the fact that both the module assignment g.M
and the Regulates variables g.R are unobserved in the
training data. In this case, the likelihood function has
multiple local maxima, and no general method exists for
finding the global maximum. We thus use the Expectation
Maximization (EM) algorithm (Dempster et al., 1977),
which provides an approach for finding a local maximum
of the likelihood function.
Starting from an initial guess θ
(0)
for the parameters,
EM iterates the following two steps. The E-step computes
the distribution over the unobserved variables given the
observed data and the current estimate of the parameters.
We use the hard assignment version of the EM algorithm,
where this distribution is used to select a likely completion
of the hidden variables. The M-step then re-estimates the
parameters by maximizing the likelihood with respect to
the completion computed in the E-step. This estimation
task differs for the different parts of the model.
E-step: inferring modules and regulation Our task in the
E-step is to compute the distribution over the unobserved
data, which in our setting means computing P(g.M, g.R |
g.E, g.S).Asgenes are assumed to be independent,
this computation can be done separately for each gene.
However, although the softmax distribution for P(g.M |
g.R) has a compact parameterization, inference using this
distribution is still exponential in the number of Regulates
variables. Even if only a small number of these variables
is associated with any single module, for the purpose of
module assignment we need to consider all of the variables
associated with any module; this number can be quite
large, rendering exact inference intractable.
We devise a simple approximate algorithm for doing
this computation, which is particularly well-suited for
our setting. It exploits our expectation that, while a large
number of sequence motifs determine module assignment,
only a small number of motifs regulate a particular
transcriptional module. Consequently, given the module
assignment for a gene, we expect a small number of
Regulates variables for that gene to take the value true.
Our approximate algorithm therefore searches greedily
for a small number of Regulates variables to activate for
each module assignment. For each gene g,itconsiders
every possible module assignment m, and finds a good
assignment to the Regulates variables given that g.M =
m. This assignment is constructed in a greedy way,
by setting g.R variables to true one at a time, as
long as P(g.M, g.R, g.E | g.S) improves. The joint
setting for g.M and g.R which gives the overall best
likelihood is then selected as the (approximate) most likely
assignment. For the remainder of this section, let g. ¯m and
g.¯r
1
,...,g.¯r
L
represent the values selected for g.M and
g.R
1
,...,g.R
L
respectively by the E-step. Full details of
the algorithm are given in Figure 3a.
M-step: expression model Given the assignments of
genes to modules as computed in the E-step, the maximum
likelihood setting for the parameters of the expression
model Gaussian distributions has a closed form solution.
Letting N
m
be the number of genes assigned to module m,
we have that the mean and variance of the Gaussian for
experiment j given module assignment m are
µ
mj
=
1
N
m
gG : g. ¯m=m
g.e
j
and
σ
2
mj
=
1
N
m
gG : g. ¯m=m
g.e
2
j
µ
2
jm
.
M-step: motif model We want the motif model to
be a good predictor of the assignment ¯r to the Reg-
ulates variables computed in the E-step. Thus, for
each R
i
,weaim to find the values of the parameters
w
0
,w
j
[] that maximize the conditional log probabil-
ity
gG
log P(g.¯r
i
| g.S
1
,...,g.S
n
). Unfortunately,
this optimization problem has no closed form solution,
i276
by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Genome-wide discovery of transcriptional modules
For each gene g G
Set g.M = 1
Set g.R
i
= false for 1 i L
Set p = P (g.M, g.R | g.S, g.E)
For m = 1 to K // for all modules
Repeat // Find
g.R
i
that increases p
Set p
best
= p
For i = 1 to L // for all regulates variables
Set
g.R
i
= true
p
= P (g.M = m, g.R | g.S, g.E)
if p
> p
Set g.M = m
Set p = p
else
Set
g.R
i
= false
Until
p
best
= p
Set U ={}
Set iteration = 0
Let V ={v
mi
}
1mK ,1iL
Set MaxScore = max
V
Score[V ]
// MaxScore = score of unconstrained fit
Set
T = Threshold for closeness to MaxScore
Repeat
Set iteration
= iteration + 1
Let U
={u
mi
}
1mK ,1iL
U
Set U
= argmax
U
0
Score[U
, U ]
// Optimize weights not in U; weights in U fixed
For
i = 1 to L // for all regulates variables
Let
m = argmax
m
{u
mi
}
1mK
Set U = U
{u
mi
} // Add new non-zero weight
Set
U = argmax
U 0
Score[U, 0]
// Reoptimize weights in U; other weights = 0
Until iteration = max iteration or Score[U] >= MaxScore T
Delete:
For
i = 1 to L // for all regulates variables
Set
U
= U
Set u
mi
= 0 for 1 m K
If Score[U ]−Score[U
]≤threshold
Delete
R
i
Set U = U
Add:
For
m = 1 to K // for all modules
Let
G
={}
For each g s.t. g. ¯m = m
Set g.
¯
r
= argmax
¯
r
P(g.
¯
r | g.S)
Set g.m
= argmax
m
P(g.M = m | g.R = g.
¯
r
)
If m
= m
Set G
= G
{g}
Learn motif with positive set G
Add new Regulates variable with learned PSSM
(a) (b) (c)
Fig. 3. (a) Search procedure for E-step of EM. (b) Learning the softmax distribution for P(g.M | g.R) in the M-step. (c) Procedure
for dynamically deleting and adding Regulates variables. In (b) and (c), U denotes the non-zero weights of P
U
(g.M | g.R), and
Score[U]=
gG
log P
U
(g. ¯m | g.
¯
r).
and there are many local maxima. We therefore use a
conjugate gradient ascent to find a local optimum in the
parameter space. Conjugate gradient starts from an initial
guess of the weights w
(0)
.Asfor all local hill climbing
methods, the quality of the starting point has a huge
impact on the quality of the local optimum found by
the algorithm. We therefore initilize the weights using
the method of Barash et al. (2001), which efficiently
generates motif seeds of length 6–15 and then scores them
using the hypergeometric significance test. Each seed
produced by this method is then expanded to produce a
PSSM of the desired length, whose weights serve as an
initialization point for the conjugate gradient procedure.
M-step: regulation model Finally, we consider the task
of estimating the parameters for the distribution P(g.M |
g.R). Our goal is to find a setting for the softmax weights
{u
mi
}
1mK ,1iL
so as to maximize the conditional log
probability
gG
log P(g.M = g. ¯m | g.R = g.
¯
r).
Although this optimization does not have a closed form
solution, the function is convex in the weights of the
softmax. Thus, a unique global maximum exists, which
we can find using gradient ascent.
However, as we discussed in the previous section, we
also constrain this weight matrix to be sparse and each
weight to be non-negative. These constraints lead to more
desirable models, but also turn our task into a hard
combinatorial optimization problem. We use a greedy
selection algorithm, that tries to include non-zero weights
for the most predictive motifs for each Regulates variable
R
i
. The algorithm, shown in Figure 3b, first finds the
optimal setting to the full weight matrix; as we discussed,
the optimal setting can be found using gradient ascent.
For each variable R
i
,itthen selects the most predictive
motif—the one whose weight is largest—and adds it to the
motif profile U , which contains motifs that have non-zero
weight. The optimal setting for the weights in U is then
found by optimizing these weights, under the constraint
that each weight in U is non-negative and the weights
not in U must be zero. This problem is also convex, and
can be solved using gradient methods. The algorithm then
continues to search for additional motifs to include in
the profile U .Itfinds the optimal setting to all weights
while holding the weights in U fixed; it then selects the
highest weight motifs not in U , adds them to U, and
repeats. Weights are added to U until the sparseness limit
is reached, or until the addition of motifs to U does not
improve the overall score.
DYNAMICALLY ADDING AND REMOVING
SEQUENCE MOTIFS
In the previous section, we showed how to optimize the
model parameters given a fixed set of motifs. We now
wish to devise a dynamic learning algorithm, capable of
both removing and adding sequence motifs as part of the
learning process. As we learn the models, some motifs
may not turn out to be predictive, or redundant given the
newly discovered motifs. Conversely, some modules may
not be well explained by sequence motifs, so that new
motifs should be added.
We add and remove motifs after each completion of the
EM algorithm. (Note that EM itself iterates several times
between the E-step and the M-step.) To determine whether
R
i
should be deleted, we compute the conditional log
probability
gG
log P(g.m | g.
¯
r) both with and without
g.R
i
, leaving the values of other Regulates variables
fixed. This computation tells us the contribution that R
i
makes towards the overall fit of the model. Variables
that contribute below a certain threshold are subsequently
removed from the model.
i277
by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
E.Segal et al.
We try to add motifs when the current set of motifs does
not provide a satisfactory explanation of the expression
data: when there are genes for which the sequence
predictions do not match the expression profile. We define
the residual for a transcriptional module m to be the set
of genes that are assigned to module m in the E-step, but
would not be assigned to m based on the sequence alone.
We determine the sequence-only assignment of each gene
by computing
g.
¯
r
= argmax
r
P(g.r | g.S)
and
g.m
= argmax
m
P(g.M = m | g.R = g.
¯
r
).
We then attempt to provide a better prediction for the
residual genes by adding a sequence motif that is trained
to match these genes. Once a new Regulates variables is
added, it becomes part of the model and its assignment
and parameterization is adapted as part of the next
EM iteration, as described in the previous section. This
process tests whether a new motif contributes to the
overall model fit, and may assign it a non-zero weight.
Importantly, a motif that was trained for the residuals
of one module often gets non-zero weights for other
modules as well, allowing the same motif to participate in
multiple modules. Full details of the algorithm are given
in Figure 3c.
RESULTS
Models learned We evaluated our method separately on
two different S.cerevisiae gene expression datasets, one
consisting of 173 microarrays, measuring the responses to
various stress conditions (Gasch et al., 2000), and another
consisting of 77 microarrays, measuring expression during
cell cycle (Spellman et al., 1998). We also obtained the
500bp upstream region of each gene (sequences were
retrieved from SGD (Cherry et al., 1998)).
The EM algorithm requires an initial setting to all pa-
rameters. We use the standard procedure for learning mo-
tifs from expression data to initialize the model parame-
ters: we first cluster the expression profiles, resulting in a
partition of genes to clusters, and then learn a motif for
each of the resulting clusters. For clustering the expres-
sion, we use the probabilistic hierarchical clustering algo-
rithm of Segal et al. (2001). For learning motifs, we use
the motif finder described above. To specify the initial pa-
rameterization of our model, we treat these clusters and
motifs as if they were the result of an E-step, assigning a
value to all of the variables g.M and g.R, and learn the
model parameters as described above.
Forthe stress data, we use 1010 genes which showed
a significant change in expression, excluding members of
the generic stress response cluster (Gasch et al., 2000).
123456
150
170
190
210
230
250
270
290
310
330
350
Num Genes Correctly Predicted
Num Iterations
Cell cycle model
Stress model
Fig. 4. Number of genes whose module assignment can be correctly
predicted based on sequence alone, where a correct prediction is
one that matches the module assignemnt when the expression is
included. Predictions are shown for each iteration of the learning
procedure.
We initialized 20 modules using standard clustering, and
learned the associated 20 sequence motifs. From this
starting point, the algorithm converged after 5 iterations,
of an EM step and a motif addition/deletion step, resulting
in a total of 49 motifs. For the cell cycle data, we
learned a model with 15 clusters over the 795 cell cycle
genes defined in (Spellman et al., 1998). The algorithm
converged after 6 iterations, ending with 27 motifs.
Predicting expression from sequence Our approach aims
to explain expression data as a function of sequence
motifs. Hence, one metric for evaluating a model is its
ability to associate genes with modules based on their
promoter sequence alone. Specifically, we compare the
module assignment of each gene when we consider only
the sequence data to its module assignment considering
both expression and sequence data. Figure 4 shows the
total number of genes whose expression-based module
assignment is correctly predicted using only the sequence,
as the algorithm progresses and sequence motifs are
added. As can be seen, the predictions improve across
the learning iterations, and significantly outperform the
standard approach (which is iteration 1). Ultimately, our
model converges to 340 and 296 genes correctly predicted
in the cell cycle and stress models, respectively, compared
to 158 and 152 for the standard approach.
Gene expression coherence These results indicate that
our model assigns genes to modules such that genes
assigned to the same module are generally enriched for
the same motifs. However, we can achieve such an orga-
nization by simply assigning genes to modules based only
on their sequence, while entirely ignoring the expresssion
data. To verify the quality of our modules relative to gene
expression data, we define the expression coherence of
a module to be the average Pearson correlation between
each pair of genes assigned to it, where the Pearson
i278
by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Genome-wide discovery of transcriptional modules
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
00.10.20.3 0.4 0.5 0.6 0.7 0.8
Expression coherence (standard clustering)
Expression coherence (our method)
0
2
4
6
8
10
12
14
02468101214
-Log(pvalue) for standard clustering
-Log(pvalue) for our method
0
1
2
3
4
5
6
7
01234567
-Log (pvalue) for standard clustering
-Log (pvalue) for our method
(a) (b) (c)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
00.10.20.3 0.4 0.5 0.6 0.7 0.8 0.9
Expression coherence (standard clustering)
Expression coherence (our method)
0
5
10
15
20
25
30
35
051015 20 25 30 35
-Log(pvalue) for standard clustering
-Log(pvalue) for our method
0
1
2
3
4
5
6
7
8
9
0123456789
-Log (pvalue) for standard clustering
-Log (pvalue) for our method
(d) (e) (f)
Fig. 5. Comparison of standard clustering and the proposed method. (a)–(c) are for the cell cycle dataset (Spellman et al., 1998) and (d)–(f)
are for the stress expression dataset (Gasch et al., 2000). (a), (d) Comparison of the expression coherence for each inferred module (or cluster
in the standard clustering model). (b), (e) Comparison of enrichment of the targets of each motif for functional annotations from the GO
database. For each annotation, the largest negative log p-value obtained from analyzing the targets of all motifs is shown. (c), (f) Comparison
of enrichment of the targets of each motif for protein complexes. For each protein complex, shown is the largest negative log p-value obtained
from any of the motifs.
correlation is
Pearson(g
i
.E, g
j
.E) =
1
L
L
l=1
(g
i
.E
l
µ
i
)
σ
i
(g
j
.E
l
µ
j
)
σ
j
,
where µ
i
i
are the mean and standard deviation of
the entries in g
i
.E. Figure 5a,d compares the expression
coherence of our modules to those built from standard
clustering for the cell cycle and stress data, showing
identical coherence of expression profiles. For the cell
cycle data, there was even a slight increase in the
coherence of the expression profiles for our model. Thus,
our model results in clusters that are more enriched for
motifs, while achieving the same quality of expression
patterns as standard clustering which only tries to optimize
the expression score.
Coherence of motif targets As we discussed, the motif
profile characterizing a module allows us to define a notion
of motif targets—genes that contain the motif, and where
the motif plays a role in its expression profile, i.e. those
assigned to a module whose motif profile contains the
motif. In the standard clustering model, we can define the
targets of a motif to be those genes that have the motif and
belong to the cluster from which the motif was learned.
We tested whether our motif targets correspond to
functional groups, by measuring their enrichment for
genes in the same functional category according to the
gene annotation database of GO (Ashburner et al., 2000).
We used only GO categories with 5 or more genes
associated with them, resulting in 537 categories. For each
annotation and each motif, we computed the fraction of
genes in the targets of that motif associated with that
annotation and used the hypergeometric distribution to
calculate a p-value for this fraction, and took p-value <
0.05 to be significant. We compared, for both expression
data sets, the enrichment of the motif targets for GO
annotations between our model and standard clustering.
We found many annotations that were enriched in both
models. However, there were 24 and 29 annotations that
were significantly enriched in our cell cycle and stress
i279
by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
E.Segal et al.
models, respectively, that were not enriched at all in the
standard clustering model, compared to only 4 and 14
categories only enriched in the standard clustering model
for these respective models. Among those categories
enriched only in our model were carbohydrate catabolism,
cell wall organization and galactose metabolism, all of
which are processes known to be active in response to
various stress conditions that we can now characterize by
sequence motifs. A full comparison of the GO enrichment
for both datasets is shown in Figure 5b,e.
Since functional categories do not necessarily cor-
respond to co-regulation groups, we also tested the
enrichment of our motif targets for protein complexes,
as compiled experimentally in the assays of Gavin et al.
(2002) and Ho et al. (2002), consisting of 590 and 493
complexes, respectively. The member genes of protein
complexes are often co-regulated and we thus expect
to find enrichment for them in our motif targets. We
associated each gene with the complexes it is assigned
to in each protein complex dataset and computed the
p-value of the enrichment of the targets of each motif
for each complex, as we did above for the GO annota-
tions. The results for the cell cycle and stress datasets
are summarized in Figure 5c,f, showing much greater
enrichment of our motif targets than the targets of the
motifs identified using the standard approach, with 63 and
10 complexes significantly enriched only in our model,
and no complexes only enriched in the standard approach,
for the stress and cell cycle models, respectively.
Motifs and motif profiles We compared the motifs we
identified to motifs from the literature. Of the 49 motifs
learned for the stress model, 22 are known, compared
to only 10 known motifs learned using the standard
approach. For the cell cycle model, 15 of the 27 learned
motifs are known, compared to only 8 known motifs
learned using the standard approach. Many of the known
motifs identified, such as the stress element STRE, the
heat shock motif HSF and the cell cycle motif MCM1, are
also known to be active in the respective datasets.
Apowerful feature of our approach is its ability to
characterize modules by motif profiles. This ability is
particularly important for higher eukaryotes, in which
regulation often occurs through multiple distinct motifs.
To illustrate the motif profiles found by our approach, we
found for each motif all modules enriched for the presence
of that motif. This was done by associating each gene with
the motifs in its upstream region, and then computing the
p-value of the enrichment of the member genes of each
module. Figure 6a shows all the module-motif pairs in
which the module was enriched for the motif with p-value
<0.05. In addition, the figure indicates (by red circles) all
pairs in which the motif appears in the module’s motif
profile. As can be seen, many of profiles contain multiple
motifs, and many motifs were used by more than one
module. Even though modules share motifs, each module
is characterized by a unique combination of motifs.
Inferring regulatory networks Identifying the active mo-
tifs is a significant step towards understanding the regula-
tory mechanisms governing gene expression. However, we
would also want to know the identity of the transcription
factor (TF) molecules that bind to these sequence motifs.
We used the DNA binding assays of Lee et al. (2002),
that directly detect to which promoter regions a particular
TF binds in vivo, and associated TFs with the motifs we
learned. For each motif, we computed the fraction, among
the motif targets, of genes bound by each TF, as measured
in the data of Lee et al. We used the hypergeometric distri-
bution to assign a p-value to each such fraction and took
p-value <0.05 to be significant. Inspection of the signifi-
cant associations showed that, in most cases, there was a
unique motif that was significant for the TF and that a high
fraction (>0.5) of the TF’s binding targets were among the
motif target genes.
Based on this strong association between TFs and
motifs, for each such TF-motif pair, we predicted that
the TF regulates all the modules that are characterized
by the motif. By combining all associations, we arrived
at the regulatory network shown in Figure 6b. Of the
106 transcription factors measured in Lee et al.,28were
enriched in the targets of at least one motif and were
thus added to the resulting network. Of the 20 modules,
16 were associated with at least one TF. To validate
the quality of the network, we searched the biological
literature and compiled a list of experimentally verified
targets for each of the 28 TFs in our network. We then
marked each association between a TF and a module as
supported if the module contains at least one gene that the
TF is known to regulate from biological experiments. As
current knowledge is limited, there are very few known
targets for most TFs. Nevertheless, we found support
for 21 of the 64 associations. We also computed the p-
value for each supported association between a TF and a
module, using the binomial distribution with probability
of success p = t/N , where K is the total number of
known targets for the TF and N is the total number of
genes (1010). The p-value is then P(X | X
B( p, n)), where is the total number of known targets of
the regulator in the supported module and n is the number
of genes in the supported module. The resulting p-values
are shown in Figure 6b by edge thickness and color.
We assigned a name to each module based on a concise
summary of its gene content (compiled from both gene
annotation and literature). The regulatory network thus
contains predictions for the processes regulated by each
TF, where for each association the prediction includes
the motif through which the regulation occurs. In many
i280
by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Genome-wide discovery of transcriptional modules
Motif 15
Motif 5
Motif 12
Motif 2
Motif 17
Motif 7
Motif 14
Motif 19
Motif 11
Motif 9
Motif 1
Motif 16
Motif 6
Motif 13
Motif 3
Motif 10
Motif 8
Added Motif 1
Added Motif 2
Added Motif 3
Added Motif 6
Added Motif 7
Added Motif 9
Added Motif 11
Added Motif 12
Added Motif 13
Added Motif 14
Added Motif 15
Added Motif 16
Added Motif 17
Added Motif 21
Added Motif 27
Added Motif 29
Added Motif 30
Added Motif 31
Added Motif 33
Added Motif 34
Added Motif 36
Added Motif 39
Added Motif 42
Added Motif 48
Added Motif 51
Added Motif 54
Added Motif 55
Added Motif 60
Added Motif 61
Motif 20
Motif 4
Added Motif 10
19 – Glycolysis (80)
18 –Redox regulation (10)
17 – Mixed I (74)
16 – Protein folding (45)
15 – Chromatin remodeling (8)
14 – Energy and TCA cycle (66)
13 –Oxidative phosphorylation (31)
12 – Unknown I (75)
11 – Nitrogen metabolism (35)
10 – Cell cycle (44)
9–Proteinfolding (47)
8–Ergosterol biosynthesis (65)
7–Sucrosemetabolism(55)
6–Translation (80)
5–Heat shock (61)
4–Pheromone response (93)
3–Transport (30)
2–Aldehyde metabolism (34)
1–Galactosemetabolism (15)
0–Aminoacidmetabolism (52)
0
Cbf1
Ar
g81
Fzf1
Met4
Gat1
Dal8
2
Rf
x1
Sw
i6
Cha4
Hir2
Hm
s1
Gln3
Hap4
Ste12
Pd
r1
Sw
i4
Sw
i5
Fk
h1
Ar
o80
Crz1
Yap3
Nrg1
Dig1
Dot6
Mi
g1
Gcr2
Cad1
Gcr1
1
2
3
4
5
6
7
8
9
10
11
12
1314
15
16
17 18
19
Not supported
Supported (p>.05)
Supported (p<.05)
Supported (p<.001)
Fig. 6. (a) Matrix of motifs vs. modules for the stress data, where a module-motif entry is colored if the member genes of that module were
enriched for that motif with p-value <0.05. The intensity corresponds to the fraction of genes in the module that had the motif. Entries in the
module’s motif profile are circled in red. Modules were assigned names based on a summary of their gene content. (b) Regulatory network
inferred from our model using the DNA binding assays of Lee et al..Ovals correspond to transcription factors and rectangles to modules (see
(a) for module names).
cases, our approach recovered coherent biological pro-
cesses along with their known regulators. Examples of
such associations include: Hap4, the known activator of
oxidative phosphorylation, with the oxidative phospho-
rylation module (13); Gcr2, a known positive regulator
of glycolysis, with the glycolysis module (19); Mig1, a
glucose repressor, with the galactose metabolism module
(1); Ste12, involved in regulation of pheromone pathways,
with the pheromone response module (4); and Met4, a
positive regulator of sulfur amino acid metabolism, with
the amino acid metabolism module (0).
CONCLUSIONS
We presented a unified probabilistic model over both gene
expression and sequence data, whose goal is to identify
transcriptional modules and the regulatory motif binding
sites that control their regulation within a given set of ex-
periments. Our results indicate that our method discovers
modules that are both highly coherent in their expression
profiles and significantly enriched for common motif bind-
ing sites in upstream regions of genes assigned to the same
module. A comparison to the common approach of con-
structing clusters based only on expression and then learn-
ing a motif for each cluster shows that our method recovers
modules that have a much higher correspondence to exter-
nal biological knowledge of gene annotations and protein
complex data.
ACKNOWLEDGEMENTS
This work was supported by the National Science Founda-
tion, grant ACI-0082554. Eran Segal was also supported
by a Stanford Graduate Fellowship (SGF).
REFERENCES
Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H.,
Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T.,
et al. (2000) Gene ontology: tool for the unification of biology.
The gene ontology consortium. Nat. Genet., 25, 25–29.
Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by
expectation maximization to discover motifs in biopolymers.
Proc. Int. Conf. Intell. Syst. Mol. Biol. Volume 2. pp. 28–36.
Barash,Y., Bejerano,G. and Friedman,N. (2001) A simple hyper-
geometric approach for discovering putative transcription factor
binding sites. Algorithms in Bioinformatics, Number 2149 in
LNCS, pp. 278–293.
Brazma,A., Jonassen,I., Vilo,J. and Ukkonen,E. (1998) Predicting
gene regulatory elements in silico on a genomic scale. Genome
Res., 8, 1202–1215.
Bussemaker,H., Li,H. and Siggia,E. (2001) Regulatory element
detection using correlation with expression. Nat. Genet., 27,
167–171.
Cheeseman,P. and Stutz,J. (1995) Bayesian classification (Auto-
Class): Theory and results. Advances in Knowledge Discovery
and Data Mining. AAAI Press, Menlo Park, CA, pp. 153–180.
Cherry,J.M., Adler,C., Ball,C., Chervitz,S.A., Dwight,S.S., Hes-
ter,E.T., Jia,Y., Juvik,G., Roe,T., Schroeder,M., Weng,S., Bot-
stein,D. (1998) Sgd: Saccharomyces genome database. Nucleic
Acid Res., 26, 73–79.
Dempster,A.P., Laird,N.M. and Rubin,D.B. (1977) Maximum like-
lihood from incomplete data via the EM algorithm. J. Roy. Stat.
Soc. B, 39, 1–39.
Gasch,A.P., Spellman,P.T., Kao,C.M., Carmel-Harel,O.,
Eisen,M.B., Storz,G., Botstein,D. and Brown,P.O. (2000)
Genomic expression program in the response of yeast cells
to environmental changes. Mol. Biol. Cell, 11, 4241–4257.
Gavin,A.C., Bosche,M., Krause,R., Grandi,P., Marzioch,M.,
Bauer,A., Schultz,J., Rick,J.M., Michon,A.M. and Cruciat,C.M.,
et al. (2002) Functional organization of the yeast proteome
by systematic analysis of protein complexes. Nature, 415,
141–147.
i281
by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
E.Segal et al.
Ho,Y., Gruhler,A., Heilbut,A., Bader,G.D., Moore,L., Adams,S.L.,
Millar,A., Taylor,P., Bennett,K., Boutilier,K., et al. (2002) Sys-
tematic identification of protein complexes in Saccharomyces
cerevisiae by mass spectometry. Nature, 415, 180–183.
Lee,T., Rinaldi,N.J., Robert,F., Odom,D.T., Bar-Joseph,Z., Ger-
ber,G.K., Hannett,N.M., Harbison,C.T., Thompson,C.M., Si-
mon,I., et al. (2002) Transcriptional regulatory networks in Sac-
charomyces cerevisiae. Science, 298, 824–827.
Liu,X., Brutlag,D. and Liu,J. (2001) Bioprospector: discovering
conserved DNA motifs in upstream regulatory regions of co-
expressed genes. Pac. Symp. Biocomput. pp. 127–138.
Pearl,J. (1988) Probabilistic Reasoning in Intelligent Systems.
Morgan Kaufmann.
Pilpel,Y., Sudarsanam,P. and Church,G. (2001) Identifying regula-
tory networks by combinatorial analysis of promoter elements.
Nat. Genet., 29, 153–159.
Roth,F., Hughes,P., Estep,J.D. and Church,G. (1998) Finding DNA
regulatory motifs within unaligned noncoding sequences clus-
tered by whole-genome mRNA quantitation. Nat. Biotechnol.,
16, 939–945.
Segal,E., Taskar,B., Gasch,A., Friedman,N. and Koller,D. (2001)
Rich probabilistic models for gene expression. Bioinformatics,
17(Suppl 1), S243–S252.
Segal,E., Barash,Y., Simon,I., Friedman,N. and Koller,D. (2002)
From sequence to expression: A probabilistic framework. In
Proc. RECOMB. pp. 263–272.
Sinha,S. and Tompa,M. (2000) A statistical method for finding
transcription factor binding sites. In Proc. Int. Conf. Intell. Syst.
Mol. Biol.,Volume 8, pp. 344–354.
Spellman,P.T., Sherlock,G., Zhang,M.O., Iyer,V.R., Anders,K.,
Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998)
Comprehensive identification of cell cycle-regulated genes of
the yeast Saccharomyces cerevisiae by microarray hybridization.
Mol. Biol. Cell, 9(12), 3273–3297.
Tavazoie,S., Hughes,J.D., Campbell,M.J., Cho,R.J. and
Church,GM. (1999) Systematic determination of genetic
network architecture. Nat. Genet., 22, 281–285 Comment in:
Nat. Genet. 22 213-215.
i282
by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
... Biological annotations such as the Gene Ontology (GO) or the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways have been used in studies to extrapolate biological roles and regulatory associations from changes in individual genes [24]. As a result, various microarray studies have identified similarities and differences in mRNA expression levels among samples [25][26][27][28][29]. Single genes or groups of related genes are typically associated with many biological activities, and these activities usually correspond to changes in gene expression across distinct tissue types [30]. ...
Article
Full-text available
Genes associated with growth factors were previously analyzed in a radiation- and estrogen-induced experimental breast cancer model. Such in vitro experimental breast cancer model was developed by exposure of the immortalized human breast epithelial cell line, MCF-10F, to low doses of high linear energy transfer (LET) α particle radiation (150 keV/μm) and subsequent growth in the presence or absence of 17β-estradiol. The MCF-10F cell line was analyzed in different stages of transformation after being irradiated with either a single 60 cGy dose or 60/60 cGy doses of alpha particles. In the present report, the profiling of differentially expressed genes associated with growth factors was analyzed in their relationship with clinical parameters. Thus, the results indicated that Fibroblast growth factor2 gene expression levels were higher in cells transformed by radiation or in the presence of ionizing radiation; whereas the fibroblast growth factor-binding protein 1gene expression was higher in the tumor cell line derived from this model. Such expressions were coincident with higher values in normal than malignant tissues and with estrogen receptor (ER) negative samples for both gene types. The results also showed that transforming growth factor alpha gene expression was higher in the tumor cell line than the tumorigenic A5 and the transformed A3 cell line, whereas the transforming growth factor beta receptor 3 gene expression was higher in A3 and A5 than in Tumor2 cell lines and the untreated controls and the E cell lines. Such gene expression was accompanied by results indicating negative and positive receptors for transforming growth factor alpha and the transforming growth factor beta receptor 3, respectively. Such expressions were low in malignant tissues when compared with benign ones. Furthermore, Fibroblast growth factor2, the fibroblast growth factor-binding protein 1, transforming growth factor alpha, the transforming growth factor beta receptor 3, and the insulin growth factor receptor gene expressions were found to be present in all BRCA patients that are BRCA-Basal, BRCA-LumA, and BRCA-LumB, except in BRCA-Her2 patients. The results also indicated that the insulin growth factor receptor gene expression was higher in the tumor cell line Tumor2 than in Alpha3 cells transformed by ionizing radiation only; then, the insulin growth factor receptor was higher in the A5 than E cell line. The insulin growth factor receptor gene expression was higher in breast cancer than in normal tissues in breast cancer patients. Furthermore, Fibroblast growth factor2, the fibroblast growth factor-binding protein 1, transforming growth factor alpha, the transforming growth factor beta receptor 3, and the insulin growth factor receptor gene expression levels were in stages 3 and 4 of breast cancer patients. It can be concluded that, by using gene technology and molecular information, it is possible to improve therapy and reduce the side effects of therapeutic radiation use. Knowing the different genes involved in breast cancer will make possible the improvement of clinical chemotherapy.
... Gene expression microarrays have been an effective tool for comparing and contrasting cell lines and disease states in people [32,33]. Various microarray studies have identified similarities and differences in mRNA expression levels among samples; thus biological annotations such as Gene Ontology (GO) or KEGG pathways have been utilized in studies to extrapolate biological roles [34,35] and regulatory relationships from changes in individual genes [36,37]. To further investigate the biological functions of identified differentially expressed genes; the GO and KEGG functional enrichment analysis tools revealed that up-regulated genes were enhanced, whereas others were down-regulated as those involved in cell cycle regulation [38]. ...
Article
Full-text available
Cancer develops in a multi-step process where environmental carcinogenic exposure is a primary etiological component, and where cell–cell communication governs the biological activities of tissues. Identifying the molecular genes that regulate this process is essential to targeting metastatic breast cancer. Ionizing radiation can modify and damage DNA, RNA, and cell membrane components such as lipids and proteins by direct ionization. Comparing differential gene expression can help to determine the effect of radiation and estrogens on cell adhesion. An in vitro experimental breast cancer model was developed by exposure of the immortalized human breast epithelial cell line MCF-10F to low doses of high linear energy transfer α particle radiation and subsequent growth in the presence of 17β-estradiol. The MCF-10F cell line was analyzed in different stages of transformation that showed gradual phenotypic changes including altered morphology, increase in cell proliferation relative to the control, anchorage-independent growth, and invasive capability before becoming tumorigenic in nude mice. This model was used to determine genes associated with cell adhesion and communication such as E-cadherin, the desmocollin 3, the gap junction protein alpha 1, the Integrin alpha 6, the Integrin beta 6, the Keratin 14, Keratin 16, Keratin 17, Keratin 6B, and the laminin beta 3. Results indicated that most genes had greater expression in the tumorigenic cell line Tumor2 derived from the athymic animal than the Alpha3, a non-tumorigenic cell line exposed only to radiation, indicating that altered expression levels of adhesion molecules depended on estrogen. There is a significant need for experimental model systems that facilitate the study of cell plasticity to assess the importance of estrogens in modulating the biology of cancer cells.
... Our algorithm relies on accurate predictions of genetic regulatory modules. A large body of gene expression data is publicly available 16,17 and has enabled computational prediction of gene modules (co-regulated genes) by several groups 8,[18][19][20][21][22][23][24][25][26][27][28][29][30] . Preliminary experimentation with published methods led us to choose ICA for performing module prediction, as modules predicted with ICA yielded stronger oligonucleotide enrichment in promoter regions than did modules predicted with the other methods we tested ( Fig. 2e and additional data not shown; see Lee & Batzoglou 12 for additional comparisons of ICA to other methods). ...
Preprint
Full-text available
Annotating genes with information describing their role in the cell is a fundamental goal in biology, and essential for interpreting data-rich assays such as microarray analysis and RNA-Seq. Gene annotation takes many forms, from Gene Ontology (GO) terms, to tissues or cell types of significant expression, to putative regulatory factors and DNA sequences. Almost invariably in gene databases, annotations are connected to genes by a Boolean relationship, e.g., a GO term either is or isn’t associated with a particular gene. While useful for many purposes, Boolean-type annotations fail to capture the varying degrees by which some annotations describe their associated genes and give no indication of the relevance of annotations to cellular logistical activities such as gene expression. We hypothesized that weighted annotations could prove useful for understanding gene function and for interpreting gene expression data, and developed a method to generate these from Boolean annotations and a large compendium of gene expression data. The method uses an independent component analysis-based approach to find gene modules in the compendium, and then assigns gene-specific weights to annotations proportional to the degree to which they are shared among members of the module, with the reasoning that the more an annotation is shared by genes in a module, the more likely it is to be relevant to their function and, therefore, the higher it should be weighted. In this paper, we show that analysis of expression data with module-weighted annotations appears to be more resistant to the confounding effect of gene-gene correlations than non-weighted annotation enrichment analysis, and show several examples in which module-weighted annotations provide biological insights not revealed by Boolean annotations. We also show that application of the method to a simple form of genetic regulatory annotation, namely, the presence or absence of putative regulatory words (oligonucleotides) in gene promoters, leads to module-weighted words that closely match known regulatory sequences, and that these can be used to quickly determine key regulatory sequences in differential expression data.
... A large body of gene-expression data is publicly available (Barrett et al. 2011;Rustici et al.) and has enabled computational prediction of gene modules (Kim et al. 2001;Segal, Yelensky, et al. 2003;Ihmels et al. 2004;Michoel et al. 2009;Engreitz et al. 2010). We refer to our method for going from a raw compendium of gene expression data to an optimized set of gene modules and a list of genes that belong to each module as DEXICA, for Deep EXtraction Independent Component Analysis (described below). ...
Article
Full-text available
Identification of co-expressed sets of genes (gene modules) is used widely for grouping functionally related genes during transcriptomic data analysis. An organism-wide atlas of high-quality gene modules would provide a powerful tool for unbiased detection of biological signals from gene expression data. Here, using a method based on independent component analysis we call DEXICA, we have defined and optimized 209 modules that broadly represent transcriptional wiring of the key experimental organism Caenorhabditis elegans . These modules represent responses to changes in the environment (e.g. starvation, exposure to xenobiotics), genes regulated by transcriptions factors (e.g. ATFS-1, DAF-16), genes specific to tissues (e.g. neurons, muscle), genes that change during development, and other complex transcriptional responses to genetic, environmental and temporal perturbations. Interrogation of these modules reveals processes that are activated in long-lived mutants in cases where traditional analyses of differentially expressed genes fail to do so. Additionally, we show that modules can inform the strength of the association between a gene and an annotation (e.g. GO term). Analysis of "module-weighted annotations" improves on several aspects of traditional annotation-enrichment tests and can aid in functional interpretation of poorly annotated genes. We provide an online interactive resource at http://genemodules.org/ in which users can find detailed information on each module, check genes for module-weighted annotations, and use both of these to analyze their own transcription data or gene sets of interest.
Thesis
Identifier les différentes parties d’une séquence biologique (séquence nucléique, ou séquence d’acides aminés) constitue un premier pas vers la compréhension de la biologie de l’organisme dont elle est issue. Étant donné un ensemble de séquences biologiques d’un organisme, nous nous intéressons dans cette thèse à la découverte de «domaines», c-à-d de sous-séquences relativement grandes (plusieurs dizaines de nucléotides ou d’acides aminés) que l’on retrouve dans un nombre important de séquences. Cette thèse est décomposée en deux axes correspondant à la découverte de domaines dans les séquences protéiques et dans les séquences nucléiques. Dans chaque axe, les méthodes développées sont appliquées à Plasmodium falciparum, le pathogène responsable du paludisme chez l’Homme, et pour lequel les méthodes bio-informatiques classiques peinent à produire des annotations satisfaisantes. Le premier axe développé porte sur la découverte de domaines dans les séquences protéiques. Une approche commune pour identifier les domaines d’une protéine consiste à exécuter des comparaisons de paires de séquences avec des outils d’alignements locaux comme BLAST. Cependant, ces approches manquent parfois de sensibilité, en particulier pour les espèces phylogénétiquement éloignées des organismes de référence classiques. Nous proposons ici une approche pour augmenter la sensibilité des comparaisons de paires de séquences. Cette nouvelle approche utilise le fait que les domaines protéiques ont tendance à apparaître avec un nombre limité d’autres domaines sur une même protéine. Chez Plasmodium falciparum, cette méthode permet la découverte de 2 240 nouveaux domaines pour lesquels, dans la majorité des cas, il n’existe pas de modèle semblable dans les bases de données de domaines. Le deuxième axe développé porte sur la découverte de domaines dans les séquences régulatrices (séquences ADN). Plusieurs études ont montré qu’il existe un lien fort entre la composition nucléotidique de régions particulières (séquences promotrices notamment) et l’expression des gènes. Nous proposons ici une nouvelle approche permettant de découvrir de manière automatique ces régions, que l’on nomme domaines de régulation. Plus précisément notre approche est basée sur une stratégie d’exploration itérative des compositions nucléotidiques, des plus simples (dinucléotides) aux plus complexes (k-mers), ainsi qu’une stratégie de segmentation supervisée pour découvrir les compositions et les régions d’intérêt. En utilisant les domaines ainsi identifiés, nous montrons que l’on peut prédire l’expression des gènes de Plasmodium falciparum avec une étonnante précision. Appliquée à différentes autres espèces eucaryotes, cette approche montre des résultats très différents suivant les espèces (entre 40 et 70 % de corrélation) ce qui laisse entrevoir un mécanisme de régulation sans doute partagé par toutes les espèces eucaryotes mais dont l’importance varie d’une espèce à l’autre.
Chapter
In this chapter we focus on the problem of comparing two clusterings in the context of genomic data. As in the previous chapter, we use the term clustering to refer both to the process as well as the result. We use the term hard clustering to mean a clustering in which an object (data point) is assigned to one cluster only. In soft clustering, the degree to which an object is associated with a cluster is indicated by a membership grade. Clustering is often employed as a first step in genomic data analysis in order to identify groups of data samples (often groups of genes) for further analysis. It is often of interest to compare a clustering result with an external "reference" clustering, for instance comparing a grouping of genes obtained by applying a clustering algorithm on gene expression data against a grouping derived from existing knowledge (based on gene ontology for instance). Although the choice of the reference clustering depends on the task at hand, typically groupings of genes such as those based on gene ontology, are comprised of overlapping clusters. Many of the external cluster validity measures in the literature assume partitions (non-overlapping and exhaustive clusters), hence they may not be suitable in the context described above. We suggest that a Mallows distance based method proposed recently may be more appropriate for cluster validation
Chapter
One of the key issues in the post-genomic era is to assign functions to uncharacterized proteins. Since proteins seldom act alone, but rather interact with other biomolecular units to execute their functions, the functions of unknown proteins may be discovered through studying their associations with proteins having known functions.In this chapter, the authors discuss possible approaches to exploit protein interaction networks for automated prediction of protein functions. The major focus is on discussing the utilities and limitations of current algorithms and computational techniques for accurate computational function prediction. The chapter highlights the challenges faced in this task and explores how similarity information among different gene ontology (GO) annotation terms can be taken into account to enhance function prediction.The authors describe a new strategy that has better prediction performance than previous methods, which gives additional insights about the importance of the dependence between functional terms when inferring protein function.
Article
Full-text available
Empirical evidence suggests that the malaria parasite Plasmodium falciparum employs a broad range of mechanisms to regulate gene transcription throughout the organism's complex life cycle. To better understand this regulatory machinery, we assembled a rich collection of genomic and epigenomic data sets, including information about transcription factor (TF) binding motifs, patterns of covalent histone modifications, nucleosome occupancy, GC content, and global 3D genome architecture. We used these data to train machine learning models to discriminate between high-expression and low-expression genes, focusing on three distinct stages of the red blood cell phase of the Plasmodium life cycle. Our results highlight the importance of histone modifications and 3D chromatin architecture in Plasmodium transcriptional regulation and suggest that AP2 transcription factors may play a limited regulatory role, perhaps operating in conjunction with epigenetic factors.
Article
Full-text available
The Saccharomyces Genome Database (SGD) provides Internet access to the complete Saccharomyces cerevisiae genomic sequence, its genes and their products, the phenotypes of its mutants, and the literature supporting these data. The amount of information and the number of features provided by SGD have increased greatly following the release of the S.cerevisiae genomic sequence, which is currently the only complete sequence of a eukaryotic genome. SGD aids researchers by providing not only basic information, but also tools such as sequence similarity searching that lead to detailed information about features of the genome and relationships between genes. SGD presents information using a variety of user-friendly, dynamically created graphical displays illustrating physical, genetic and sequence feature maps. SGD can be accessed via the World Wide Web at http://genome-www.stanford.edu/Saccharomyces/
Article
Full-text available
We performed a systematic analysis of gene upstream regions in the yeast genome for occurrences of regular expression-type patterns with the goal of identifying potential regulatory elements. To achieve this goal, we have developed a new sequence pattern discovery algorithm that searches exhaustively for a priori unknown regular expression-type patterns that are over-represented in a given set of sequences. We applied the algorithm in two cases, (1) discovery of patterns in the complete set of >6000 sequences taken upstream of the putative yeast genes and (2) discovery of patterns in the regions upstream of the genes with similar expression profiles. In the first case, we looked for patterns that occur more frequently in the gene upstream regions than in the genome overall. In the second case, first we clustered the upstream regions of all the genes by similarity of their expression profiles on the basis of publicly available gene expression data and then looked for sequence patterns that are over-represented in each cluster. In both cases we considered each pattern that occurred at least in some minimum number of sequences, and rated them on the basis of their over-representation. Among the highest rating patterns, most have matches to substrings in known yeast transcription factor-binding sites. Moreover, several of them are known to be relevant to the expression of the genes from the respective clusters. Experiments on simulated data show that the majority of the discovered patterns are not expected to occur by chance.
Article
Full-text available
Whole-genome mRNA quantitation can be used to identify the genes that are most responsive to environmental or genotypic change. By searching for mutually similar DNA elements among the upstream non-coding DNA sequences of these genes, we can identify candidate regulatory motifs and corresponding candidate sets of coregulated genes. We have tested this strategy by applying it to three extensively studied regulatory systems in the yeast Saccharomyces cerevisiae: galactose response, heat shock, and mating type. Galactose-response data yielded the known binding site of Gal4, and six of nine genes known to be induced by galactose. Heat shock data yielded the cell-cycle activation motif, which is known to mediate cell-cycle dependent activation, and a set of genes coding for all four nucleosomal proteins. Mating type alpha and a data yielded all of the four relevant DNA motifs and most of the known a- and alpha-specific genes.
Article
Full-text available
We sought to create a comprehensive catalog of yeast genes whose transcript levels vary periodically within the cell cycle. To this end, we used DNA microarrays and samples from yeast cultures synchronized by three independent methods: alpha factor arrest, elutriation, and arrest of a cdc15 temperature-sensitive mutant. Using periodicity and correlation algorithms, we identified 800 genes that meet an objective minimum criterion for cell cycle regulation. In separate experiments, designed to examine the effects of inducing either the G1 cyclin Cln3p or the B-type cyclin Clb2p, we found that the mRNA levels of more than half of these 800 genes respond to one or both of these cyclins. Furthermore, we analyzed our set of cell cycle-regulated genes for known and new promoter elements and show that several known elements (or variations thereof) contain information predictive of cell cycle regulation. A full description and complete data sets are available at http://cellcycle-www.stanford.edu
Article
S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Article
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.