ArticlePDF Available

Genome-wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression

February 2003
Bioinformatics 19 Suppl 1(Suppl 1):i273-82

February 2003
19 Suppl 1(Suppl 1):i273-82

DOI:10.1093/bioinformatics/btg1038

Source
PubMed

Authors:

Eran Segal

Weizmann Institute of Science

In this paper, we describe an approach for understanding transcriptional regulation from both gene expression and promoter sequence data. We aim to identify transcriptional modules—sets of genes that are co-regulated in a set of experiments, through a common motif profile. Using the EM algorithm, our approach refines both the module assignment and the motif profile so as to best explain the expression data as a function of transcriptional motifs. It also dynamically adds and deletes motifs, as required to provide a genome-wide explanation of the expression data. We evaluate the method on two Saccharomyces cerevisiae gene expression data sets, showing that our approach is better than a standard one at recovering known motifs and at generating biologically coherent modules. We also combine our results with binding localization data to obtain regulatory relationships with known transcription factors, and show that many of the inferred relationships have support in the literature. Contact: eran@cs.stanford.edu Keywords: probabilistic models, gene expression, transcriptional regulation. *To whom correspondence should be addressed

Schematic flow diagram of our proposed method. The pre-processing step includes selecting the input gene expression and upstream sequence data. The model is then trained using EM, and our algorithm for dynamically adding and deleting motifs. It is then evaluated on additional data sets.

…

Number of genes whose module assignment can be correctly predicted based on sequence alone, where a correct prediction is one that matches the module assignemnt when the expression is included. Predictions are shown for each iteration of the learning procedure.

…

Figures - uploaded by Eran Segal

Content may be subject to copyright.

Content uploaded by Eran Segal

Content may be subject to copyright.

BIOINFORMATICS

Vol. 19 Suppl. 1 2003, pages i273–i282

DOI: 10.1093/bioinformatics/btg1038

Genome-wide discovery of transcriptional

modules from DNA sequence and gene

expression

E. Segal

∗

,R.Yelensky and D. Koller

Computer Science Department of Stanford University, Stanford, CA 94305-9010,

USA

Received on January 6, 2003; accepted on February 20, 2003

ABSTRACT

In this paper, we describe an approach for understanding

transcriptional regulation from both gene expression and

promoter sequence data. We aim to identify transcriptional

modules—sets of genes that are co-regulated in a set

of experiments, through a common motif proﬁle. Using

the EM algorithm, our approach reﬁnes both the module

assignment and the motif proﬁle so as to best explain

the expression data as a function of transcriptional motifs.

It also dynamically adds and deletes motifs, as required

to provide a genome-wide explanation of the expression

data. We evaluate the method on two Saccharomyces

cerevisiae gene expression data sets, showing that our

approach is better than a standard one at recovering

known motifs and at generating biologically coherent

modules. We also combine our results with binding

localization data to obtain regulatory relationships with

known transcription factors, and show that many of the

inferred relationships have support in the literature.

Contact: eran@cs.stanford.edu

Keywords: probabilistic models, gene expression, tran-

scriptional regulation.

INTRODUCTION

Many cellular processes are regulated at the transcriptional

level, by one or more transcription factors that bind to

short DNA sequence motifs in the upstream regions of the

process genes. These co-regulated genes then exhibit sim-

ilar patterns of expression. Given the upstream regions of

all genes, and measurements of their expression under var-

ious conditions, we could hope to ‘reverse engineer’ the

underlying regulatory mechanisms and identify transcrip-

tional modules—sets of genes that are co-regulated under

these conditions through a common motif or combination

of motifs.

In this paper, we take a genome-wide approach for dis-

covering this modular organization, based on the premise

∗

To whom correspondence should be addressed.

that transcriptional elements should ‘explain’ the observed

expression patterns as much as possible. We deﬁne a prob-

abilistic graphical model (Pearl, 1988) that integrates both

the gene expression measurements and the DNA sequence

data into a uniﬁed model. The model assumes that genes

are partitioned into modules, which determine the gene’s

expression proﬁle. Each module is characterized by a mo-

tif proﬁle, which speciﬁes the relevance of different se-

quence motifs to the module. A gene’s module assignment

is a function of the sequence motifs in its promoter re-

gion. However, our model does not assume that all motifs

are necessarily active. In fact, as motifs are usually short,

there are many genes where a motif is randomly present

but does not play a role. Furthermore, our goal is to dis-

cover motifs that play a regulatory role in some particular

set of experiments; a motif that is active in some settings

may be completely irrelevant in others. Our model identi-

ﬁes motif targets—genes where the motif plays an active

role in affecting regulation in a particular expression data

set. These motif targets are genes that have the motif and

that are assigned to modules containing the motif in their

proﬁle.

Our algorithm is outlined in Figure 1. It begins by

clustering the expression data, creating one module from

each of the resulting clusters. As the ﬁrst attempt towards

explaining these expression patterns, it searches for a

common motif in the upstream regions of genes assigned

to the same module. It then iteratively reﬁnes the model,

trying to optimize the extent to which the expression

proﬁle can be predicted transcriptionally. For example,

we might want to move a gene g whose promoter region

does not match its current module’s motif proﬁle, to

another module whose expression proﬁle is still a good

match, and whose motif proﬁle is much closer. Given these

assignments, we could then learn better motif models and

motif proﬁles for each module. This reﬁnement process

arises naturally within our algorithm, as a byproduct of the

expectation maximization (EM) algorithm for estimating

the model parameters.

Bioinformatics 19(Suppl. 1)

by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

E.Segal et al.

Gene partition

Motif set

Transcriptional modules

M-step

E-step

Add/Delete motifs

data selection

dat

Expression data Upstream sequences

Transcriptional module discovery procedure

Clustering

Motif

Post-processing

Annotation

analysis

Visualization & analysis

Gene annotations (GO)

Protein complexes

Fig. 1. Schematic ﬂow diagram of our proposed method. The pre-processing step includes selecting the input gene expression and upstream

sequence data. The model is then trained using EM, and our algorithm for dynamically adding and deleting motifs. It is then evaluated on

additional data sets.

In general, the motifs learned will not sufﬁce to char-

acterize all of the modules. As our goal is to provide a

genome-wide explanation of the expression behavior, our

algorithm identiﬁes poorly explained genes in modules

and searches for new motifs in their upstream regions. The

new motifs are then added to the model and subsequently

reﬁned using EM. As part of this dynamic learning

procedure, some motifs may become obsolete and are

removed from the model. The algorithm iterates until

convergence, adding and deleting motifs, and reﬁning

motif models and module assignments.

Our algorithm has several important advantages over

other attempts to relate upstream sequences and expres-

sion data. First, we use both expression and sequence

data together, requiring that modules display a coherent

proﬁle for both. This approach allows us to reﬁne both the

cluster assignments and motifs within the same algorithm.

In contrast, many approaches (e.g. Brazma et al., 1998;

Liu et al., 2001; Roth et al., 1998; Sinha and Tompa,

2000; Tavazoie et al., 1999) use gene expression mea-

surements to deﬁne clusters of genes that are potentially

co-regulated, and then search for common motifs in

the upstream regions of the genes in each cluster. The

expression analysis and motif ﬁnding are thus decoupled,

and neither the clusters nor the motifs are re-evaluated

once they are learned. Other approaches (e.g. Bussemaker

et al., 2001; Pilpel et al., 2001) work in the opposite

direction, ﬁrst identifying a set of candidate motifs, and

then trying to explain the expression using these motifs.

However, these approaches use a prespeciﬁed set of

motifs, which are never adapted during the algorithm.

Our approach is based on the framework of Segal et

al. (2002) but extends it in several important directions.

First, their approach made use of DNA localization

data, which are not widely available for all organisms

and under multiple growth conditions. In contrast, we

construct models that are based solely on sequence and

expression data, which are much easier to obtain. Second,

their approach used a predetermined number of motifs to

construct the model. To allow a genome-wide analysis, our

algorithm dynamically removes and adds motifs as needed

to explain the expression data as a whole. Finally, while

the models of Segal et al. (2002) allowed for detection of

context-speciﬁc regulation, the resulting structure is hard

to interpret. Our model assigns each gene to one module,

facilitating interpretability.

We tested our method on two distinct Saccharomyces

cerevisiae expression datasets. We show that our learned

models ﬁnd motifs that account for a much larger frac-

tion of the observed expression patterns in comparison to

standard approaches that ﬁrst cluster the expression pro-

ﬁles and then search for motifs in the upstream regions

of the genes in each cluster. Our approach also recovers a

much larger number of known motifs. We evaluated the

functional coherence of our transcriptional modules us-

ing a gene functional annotation database and two pro-

tein complex databases that were not given to the model

as input. We found enrichment for many more groups in

our models compared to standard approaches, suggesting

that our transcriptional modules are biologically more ac-

curate. Finally, we used the recent binding assays of Lee

et al. (2002) to relate the actual transcription factors to the

modules they regulate, resulting in a regulatory network;

we show that many of the regulatory relationships discov-

ered have support in the literature.

i274

by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

Genome-wide discovery of transcriptional modules

PROBABILISTIC MODEL

The basic entities in our model are the genes in some set

G.Weassume that the genes are partitioned into a set

of K mutually exclusive and exhaustive transcriptional

modules. Thus, each gene is associated with an attribute

M ∈{1,...,K } whose value represents the module

to which the gene belongs. We now describe how these

modules are related to expression proﬁles and to motif

proﬁles.

Gene expression model For each gene g in G,we

have expression measurements g.E

,...,g.E

, where

g.E

represents the log ratio mRNA expression level

measured for gene g in experiment j.Weassume that

all of the genes in a single module exhibit the same

gene expression behavior, and use the simple yet powerful

Naive Bayes model (Cheeseman and Stutz, 1995) to

represent this behavior. In this model, as applied in our

setting, we assume that the expression measurements are

conditionally independent given the module assignment:

P(E

,...,E

| M) =



j=1

P(E

| M).

As the expression measurements are real-valued, we

model each conditional probability distribution P(E

M = m) using a Gaussian distribution

N (µ

; σ

Motif model The second key component in our model

is a set of variables that represent the regulation of the

gene by motifs. For each gene g,wehaveasetof

binary-valued Regulates variables R ={R

,...,R

where g.R

takes the value true if motif i appears in the

promoter region of gene g,allowing the motif to play a

regulatory role in the gene’s expression. We model the

motif using a standard position speciﬁc scoring matrix

(PSSM Bailey and Elkan, 1994; Roth et al., 1998), which

assigns a weight to each position in the motif and each

nucleotide  ∈{A, C, G, T }; this weight represents the

extent to which the nucleotide’s presence in this position

is associated with the motif.

We use the discriminative PSSM approach of Segal et al.

(2002), which trains the PSSM weights to discriminate as

much as possible between the presence and the absence

of the motif. This approach provides better predictions,

and entirely avoids the problems arising from high-

frequency but meaningless motifs that are common in

many upstream sequences. This model is speciﬁed using

a standard binary logistic model. We have p position-

speciﬁc weights w

[], one for each position j and each

letter  ∈{A, C, G, T }, and a threshold w

.Fora

promoter sequence of length N ,weassume that binding

occurs once, and with equal probability at each of the N −

p + 1 possible positions in the sequence. The probability

(

)

g.E

)

g. M

CPD 1

CPD 2

CPD 3

g.R

g.M

g.S

)

Fig. 2. Illustration of our uniﬁed probabilistic model, for a simple

example with upstream regions of length three, two sequence

motifs, three possible module assignments and three expression

measurements for each gene

of binding given the sequence is then speciﬁed simply as:

P(g.R = true | S

,...,S

) =

logit



log



n − p + 1

n−p+1



j=1

exp{



i=1

i+ j−1

]}



Regulation model We deﬁne the motif proﬁle of a

transcriptional module to be a set of weights u

, one for

each motif, such that u

speciﬁes the extent to which

motif i plays a regulatory role in module m. Roughly

speaking, the strength of the association of a gene g with a

module m is



i=1

g.R

.The stronger the association

of a gene with a module, the more likely it is to be

assigned to it. We model this using a softmax conditional

distribution, a standard extension of the binary logistic

conditional distribution to the multi-class case:

P(g.M =¯m | R

= r

,...,R

= r

) =

exp{



i=1

¯mi

}





exp{



i=1



}

As we expect a motif to be active in regulating only

a small set of modules in a given setting, we limit the

number of weights u

,...,u

that are non-zero to some

h  K . This restriction results in a sparse weight

matrix for P(M | R), and ensures that each regulator

affects at most h modules. In addition, for interpretability

considerations, we require all weights to be non-negative.

Intuitively, this means that a gene’s assignment to speciﬁc

transcriptional modules can only depend on features that

correpond to the presence of certain motifs and not on the

abscence of motifs. For a module m, the set of motifs u

that are non-zero are called the motif proﬁle of m.

i275

by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

E.Segal et al.

Uniﬁed model These three components, are put together

as a probabilistic graphical model, as shown in Figure 2;

the model deﬁnes the following joint distribution:

P(g.R, g.M, g.E | g.S) =



i=1

P(g.R

| g.S) · P(g.M | g.R) ·



j=1

P(g.E

| g.M),

where each of the above conditional probability distribu-

tions is parameterized as described in the previous sec-

tions.

LEARNING THE MODELS

In the previous section, we presented our probabilistic

model. We now turn to the task of learning this model from

data. Our data set D consists of a set of genes G, where

for each gene g ∈ G we have a set of gene expression

measurements g.e

for j = 1,...,J and a DNA sequence

g.S in the upstream region of the transcription start site

for g.For this section, we restrict attention to a ﬁxed

number of motifs, and address the problem of estimating

the model parameters to ﬁt the data. The model parameters

to be estimated are: the means and variances of the

normal distributions of the expression model, the softmax

weights and structure of the module assignments (i.e.

which sequence motifs each module depends on), and the

PSSM weights for each sequence motif.

We follow the standard approach of maximum likelihood

estimation: we ﬁnd the parameters θ that maximize

P(D | θ). Our learning task is made considerably more

difﬁcult by the fact that both the module assignment g.M

and the Regulates variables g.R are unobserved in the

training data. In this case, the likelihood function has

multiple local maxima, and no general method exists for

ﬁnding the global maximum. We thus use the Expectation

Maximization (EM) algorithm (Dempster et al., 1977),

which provides an approach for ﬁnding a local maximum

of the likelihood function.

Starting from an initial guess θ

(0)

for the parameters,

EM iterates the following two steps. The E-step computes

the distribution over the unobserved variables given the

observed data and the current estimate of the parameters.

We use the hard assignment version of the EM algorithm,

where this distribution is used to select a likely completion

of the hidden variables. The M-step then re-estimates the

parameters by maximizing the likelihood with respect to

the completion computed in the E-step. This estimation

task differs for the different parts of the model.

E-step: inferring modules and regulation Our task in the

E-step is to compute the distribution over the unobserved

data, which in our setting means computing P(g.M, g.R |

g.E, g.S).Asgenes are assumed to be independent,

this computation can be done separately for each gene.

However, although the softmax distribution for P(g.M |

g.R) has a compact parameterization, inference using this

distribution is still exponential in the number of Regulates

variables. Even if only a small number of these variables

is associated with any single module, for the purpose of

module assignment we need to consider all of the variables

associated with any module; this number can be quite

large, rendering exact inference intractable.

We devise a simple approximate algorithm for doing

this computation, which is particularly well-suited for

our setting. It exploits our expectation that, while a large

number of sequence motifs determine module assignment,

only a small number of motifs regulate a particular

transcriptional module. Consequently, given the module

assignment for a gene, we expect a small number of

Regulates variables for that gene to take the value true.

Our approximate algorithm therefore searches greedily

for a small number of Regulates variables to activate for

each module assignment. For each gene g,itconsiders

every possible module assignment m, and ﬁnds a good

assignment to the Regulates variables given that g.M =

m. This assignment is constructed in a greedy way,

by setting g.R variables to true one at a time, as

long as P(g.M, g.R, g.E | g.S) improves. The joint

setting for g.M and g.R which gives the overall best

likelihood is then selected as the (approximate) most likely

assignment. For the remainder of this section, let g. ¯m and

g.¯r

,...,g.¯r

represent the values selected for g.M and

g.R

,...,g.R

respectively by the E-step. Full details of

the algorithm are given in Figure 3a.

M-step: expression model Given the assignments of

genes to modules as computed in the E-step, the maximum

likelihood setting for the parameters of the expression

model Gaussian distributions has a closed form solution.

Letting N

be the number of genes assigned to module m,

we have that the mean and variance of the Gaussian for

experiment j given module assignment m are



g∈G : g. ¯m=m

g.e

and



g∈G : g. ¯m=m

g.e

− µ

M-step: motif model We want the motif model to

be a good predictor of the assignment ¯r to the Reg-

ulates variables computed in the E-step. Thus, for

each R

,weaim to ﬁnd the values of the parameters

[] that maximize the conditional log probabil-

ity



g∈G

log P(g.¯r

| g.S

,...,g.S

). Unfortunately,

this optimization problem has no closed form solution,

i276

by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

Genome-wide discovery of transcriptional modules

For each gene g ∈ G

Set g.M = 1

Set g.R

= false for 1 ≤ i ≤ L

Set p = P (g.M, g.R | g.S, g.E)

For m = 1 to K // for all modules

Repeat // Find

g.R

that increases p

Set p

best

= p

For i = 1 to L // for all regulates variables

Set

g.R

= true



= P (g.M = m, g.R | g.S, g.E)

if p



> p

Set g.M = m

Set p = p



else

Set

g.R

= false

Until

best

= p

Set U ={}

Set iteration = 0

Let V ={v

}

1≤m≤K ,1≤i≤L

Set MaxScore = max

Score[V ]

// MaxScore = score of unconstrained ﬁt

Set

T = Threshold for closeness to MaxScore

Repeat

Set iteration

= iteration + 1

Let U



={u



}

1≤m≤K ,1≤i≤L

− U

Set U



= argmax



≥0

Score[U



, U ]

// Optimize weights not in U; weights in U ﬁxed

For

i = 1 to L // for all regulates variables

Let

m = argmax



}

1≤m≤K

Set U = U





} // Add new non-zero weight

Set

U = argmax

U ≥0

Score[U, 0]

// Reoptimize weights in U; other weights = 0

Until iteration = max iteration or Score[U] >= MaxScore − T

Delete:

For

i = 1 to L // for all regulates variables

Set



= U

Set u



= 0 for 1 ≤ m ≤ K

If Score[U ]−Score[U



]≤threshold

Delete

Set U = U



Add:

For

m = 1 to K // for all modules

Let



={}

For each g s.t. g. ¯m = m

Set g.



= argmax

P(g.

r | g.S)

Set g.m



= argmax

P(g.M = m | g.R = g.



)

If m



= m

Set G



= G





{g}

Learn motif with positive set G



Add new Regulates variable with learned PSSM

(a) (b) (c)

Fig. 3. (a) Search procedure for E-step of EM. (b) Learning the softmax distribution for P(g.M | g.R) in the M-step. (c) Procedure

for dynamically deleting and adding Regulates variables. In (b) and (c), U denotes the non-zero weights of P

(g.M | g.R), and

Score[U]=



g∈G

log P

(g. ¯m | g.

r).

and there are many local maxima. We therefore use a

conjugate gradient ascent to ﬁnd a local optimum in the

parameter space. Conjugate gradient starts from an initial

guess of the weights w

(0)

.Asfor all local hill climbing

methods, the quality of the starting point has a huge

impact on the quality of the local optimum found by

the algorithm. We therefore initilize the weights using

the method of Barash et al. (2001), which efﬁciently

generates motif seeds of length 6–15 and then scores them

using the hypergeometric signiﬁcance test. Each seed

produced by this method is then expanded to produce a

PSSM of the desired length, whose weights serve as an

initialization point for the conjugate gradient procedure.

M-step: regulation model Finally, we consider the task

of estimating the parameters for the distribution P(g.M |

g.R). Our goal is to ﬁnd a setting for the softmax weights

}

1≤m≤K ,1≤i≤L

so as to maximize the conditional log

probability



g∈G

log P(g.M = g. ¯m | g.R = g.

r).

Although this optimization does not have a closed form

solution, the function is convex in the weights of the

softmax. Thus, a unique global maximum exists, which

we can ﬁnd using gradient ascent.

However, as we discussed in the previous section, we

also constrain this weight matrix to be sparse and each

weight to be non-negative. These constraints lead to more

desirable models, but also turn our task into a hard

combinatorial optimization problem. We use a greedy

selection algorithm, that tries to include non-zero weights

for the most predictive motifs for each Regulates variable

. The algorithm, shown in Figure 3b, ﬁrst ﬁnds the

optimal setting to the full weight matrix; as we discussed,

the optimal setting can be found using gradient ascent.

For each variable R

,itthen selects the most predictive

motif—the one whose weight is largest—and adds it to the

motif proﬁle U , which contains motifs that have non-zero

weight. The optimal setting for the weights in U is then

found by optimizing these weights, under the constraint

that each weight in U is non-negative and the weights

not in U must be zero. This problem is also convex, and

can be solved using gradient methods. The algorithm then

continues to search for additional motifs to include in

the proﬁle U .Itﬁnds the optimal setting to all weights

while holding the weights in U ﬁxed; it then selects the

highest weight motifs not in U , adds them to U, and

repeats. Weights are added to U until the sparseness limit

is reached, or until the addition of motifs to U does not

improve the overall score.

DYNAMICALLY ADDING AND REMOVING

SEQUENCE MOTIFS

In the previous section, we showed how to optimize the

model parameters given a ﬁxed set of motifs. We now

wish to devise a dynamic learning algorithm, capable of

both removing and adding sequence motifs as part of the

learning process. As we learn the models, some motifs

may not turn out to be predictive, or redundant given the

newly discovered motifs. Conversely, some modules may

not be well explained by sequence motifs, so that new

motifs should be added.

We add and remove motifs after each completion of the

EM algorithm. (Note that EM itself iterates several times

between the E-step and the M-step.) To determine whether

should be deleted, we compute the conditional log

probability



g∈G

log P(g.m | g.

r) both with and without

g.R

, leaving the values of other Regulates variables

ﬁxed. This computation tells us the contribution that R

makes towards the overall ﬁt of the model. Variables

that contribute below a certain threshold are subsequently

removed from the model.

i277

by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

E.Segal et al.

We try to add motifs when the current set of motifs does

not provide a satisfactory explanation of the expression

data: when there are genes for which the sequence

predictions do not match the expression proﬁle. We deﬁne

the residual for a transcriptional module m to be the set

of genes that are assigned to module m in the E-step, but

would not be assigned to m based on the sequence alone.

We determine the sequence-only assignment of each gene

by computing



= argmax

P(g.r | g.S)

and

g.m



= argmax

P(g.M = m | g.R = g.



We then attempt to provide a better prediction for the

residual genes by adding a sequence motif that is trained

to match these genes. Once a new Regulates variables is

added, it becomes part of the model and its assignment

and parameterization is adapted as part of the next

EM iteration, as described in the previous section. This

process tests whether a new motif contributes to the

overall model ﬁt, and may assign it a non-zero weight.

Importantly, a motif that was trained for the residuals

of one module often gets non-zero weights for other

modules as well, allowing the same motif to participate in

multiple modules. Full details of the algorithm are given

in Figure 3c.

RESULTS

Models learned We evaluated our method separately on

two different S.cerevisiae gene expression datasets, one

consisting of 173 microarrays, measuring the responses to

various stress conditions (Gasch et al., 2000), and another

consisting of 77 microarrays, measuring expression during

cell cycle (Spellman et al., 1998). We also obtained the

500bp upstream region of each gene (sequences were

retrieved from SGD (Cherry et al., 1998)).

The EM algorithm requires an initial setting to all pa-

rameters. We use the standard procedure for learning mo-

tifs from expression data to initialize the model parame-

ters: we ﬁrst cluster the expression proﬁles, resulting in a

partition of genes to clusters, and then learn a motif for

each of the resulting clusters. For clustering the expres-

sion, we use the probabilistic hierarchical clustering algo-

rithm of Segal et al. (2001). For learning motifs, we use

the motif ﬁnder described above. To specify the initial pa-

rameterization of our model, we treat these clusters and

motifs as if they were the result of an E-step, assigning a

value to all of the variables g.M and g.R, and learn the

model parameters as described above.

Forthe stress data, we use 1010 genes which showed

a signiﬁcant change in expression, excluding members of

the generic stress response cluster (Gasch et al., 2000).

123456

150

170

190

210

230

250

270

290

310

330

350

Num Genes Correctly Predicted

Num Iterations

Cell cycle model

Stress model

Fig. 4. Number of genes whose module assignment can be correctly

predicted based on sequence alone, where a correct prediction is

one that matches the module assignemnt when the expression is

included. Predictions are shown for each iteration of the learning

procedure.

We initialized 20 modules using standard clustering, and

learned the associated 20 sequence motifs. From this

starting point, the algorithm converged after 5 iterations,

of an EM step and a motif addition/deletion step, resulting

in a total of 49 motifs. For the cell cycle data, we

learned a model with 15 clusters over the 795 cell cycle

genes deﬁned in (Spellman et al., 1998). The algorithm

converged after 6 iterations, ending with 27 motifs.

Predicting expression from sequence Our approach aims

to explain expression data as a function of sequence

motifs. Hence, one metric for evaluating a model is its

ability to associate genes with modules based on their

promoter sequence alone. Speciﬁcally, we compare the

module assignment of each gene when we consider only

the sequence data to its module assignment considering

both expression and sequence data. Figure 4 shows the

total number of genes whose expression-based module

assignment is correctly predicted using only the sequence,

as the algorithm progresses and sequence motifs are

added. As can be seen, the predictions improve across

the learning iterations, and signiﬁcantly outperform the

standard approach (which is iteration 1). Ultimately, our

model converges to 340 and 296 genes correctly predicted

in the cell cycle and stress models, respectively, compared

to 158 and 152 for the standard approach.

Gene expression coherence These results indicate that

our model assigns genes to modules such that genes

assigned to the same module are generally enriched for

the same motifs. However, we can achieve such an orga-

nization by simply assigning genes to modules based only

on their sequence, while entirely ignoring the expresssion

data. To verify the quality of our modules relative to gene

expression data, we deﬁne the expression coherence of

a module to be the average Pearson correlation between

each pair of genes assigned to it, where the Pearson

i278

by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

Genome-wide discovery of transcriptional modules

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

00.10.20.3 0.4 0.5 0.6 0.7 0.8

Expression coherence (standard clustering)

Expression coherence (our method)

02468101214

-Log(pvalue) for standard clustering

-Log(pvalue) for our method

01234567

-Log (pvalue) for standard clustering

-Log (pvalue) for our method

(a) (b) (c)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

00.10.20.3 0.4 0.5 0.6 0.7 0.8 0.9

Expression coherence (standard clustering)

Expression coherence (our method)

051015 20 25 30 35

-Log(pvalue) for standard clustering

-Log(pvalue) for our method

0123456789

-Log (pvalue) for standard clustering

-Log (pvalue) for our method

(d) (e) (f)

Fig. 5. Comparison of standard clustering and the proposed method. (a)–(c) are for the cell cycle dataset (Spellman et al., 1998) and (d)–(f)

are for the stress expression dataset (Gasch et al., 2000). (a), (d) Comparison of the expression coherence for each inferred module (or cluster

in the standard clustering model). (b), (e) Comparison of enrichment of the targets of each motif for functional annotations from the GO

database. For each annotation, the largest negative log p-value obtained from analyzing the targets of all motifs is shown. (c), (f) Comparison

of enrichment of the targets of each motif for protein complexes. For each protein complex, shown is the largest negative log p-value obtained

from any of the motifs.

correlation is

Pearson(g

.E, g

.E) =



l=1

− µ

)

− µ

)

where µ

,σ

are the mean and standard deviation of

the entries in g

.E. Figure 5a,d compares the expression

coherence of our modules to those built from standard

clustering for the cell cycle and stress data, showing

identical coherence of expression proﬁles. For the cell

cycle data, there was even a slight increase in the

coherence of the expression proﬁles for our model. Thus,

our model results in clusters that are more enriched for

motifs, while achieving the same quality of expression

patterns as standard clustering which only tries to optimize

the expression score.

Coherence of motif targets As we discussed, the motif

proﬁle characterizing a module allows us to deﬁne a notion

of motif targets—genes that contain the motif, and where

the motif plays a role in its expression proﬁle, i.e. those

assigned to a module whose motif proﬁle contains the

motif. In the standard clustering model, we can deﬁne the

targets of a motif to be those genes that have the motif and

belong to the cluster from which the motif was learned.

We tested whether our motif targets correspond to

functional groups, by measuring their enrichment for

genes in the same functional category according to the

gene annotation database of GO (Ashburner et al., 2000).

We used only GO categories with 5 or more genes

associated with them, resulting in 537 categories. For each

annotation and each motif, we computed the fraction of

genes in the targets of that motif associated with that

annotation and used the hypergeometric distribution to

calculate a p-value for this fraction, and took p-value <

0.05 to be signiﬁcant. We compared, for both expression

data sets, the enrichment of the motif targets for GO

annotations between our model and standard clustering.

We found many annotations that were enriched in both

models. However, there were 24 and 29 annotations that

were signiﬁcantly enriched in our cell cycle and stress

i279

by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

E.Segal et al.

models, respectively, that were not enriched at all in the

standard clustering model, compared to only 4 and 14

categories only enriched in the standard clustering model

for these respective models. Among those categories

enriched only in our model were carbohydrate catabolism,

cell wall organization and galactose metabolism, all of

which are processes known to be active in response to

various stress conditions that we can now characterize by

sequence motifs. A full comparison of the GO enrichment

for both datasets is shown in Figure 5b,e.

Since functional categories do not necessarily cor-

respond to co-regulation groups, we also tested the

enrichment of our motif targets for protein complexes,

as compiled experimentally in the assays of Gavin et al.

(2002) and Ho et al. (2002), consisting of 590 and 493

complexes, respectively. The member genes of protein

complexes are often co-regulated and we thus expect

to ﬁnd enrichment for them in our motif targets. We

associated each gene with the complexes it is assigned

to in each protein complex dataset and computed the

p-value of the enrichment of the targets of each motif

for each complex, as we did above for the GO annota-

tions. The results for the cell cycle and stress datasets

are summarized in Figure 5c,f, showing much greater

enrichment of our motif targets than the targets of the

motifs identiﬁed using the standard approach, with 63 and

10 complexes signiﬁcantly enriched only in our model,

and no complexes only enriched in the standard approach,

for the stress and cell cycle models, respectively.

Motifs and motif proﬁles We compared the motifs we

identiﬁed to motifs from the literature. Of the 49 motifs

learned for the stress model, 22 are known, compared

to only 10 known motifs learned using the standard

approach. For the cell cycle model, 15 of the 27 learned

motifs are known, compared to only 8 known motifs

learned using the standard approach. Many of the known

motifs identiﬁed, such as the stress element STRE, the

heat shock motif HSF and the cell cycle motif MCM1, are

also known to be active in the respective datasets.

Apowerful feature of our approach is its ability to

characterize modules by motif proﬁles. This ability is

particularly important for higher eukaryotes, in which

regulation often occurs through multiple distinct motifs.

To illustrate the motif proﬁles found by our approach, we

found for each motif all modules enriched for the presence

of that motif. This was done by associating each gene with

the motifs in its upstream region, and then computing the

p-value of the enrichment of the member genes of each

module. Figure 6a shows all the module-motif pairs in

which the module was enriched for the motif with p-value

<0.05. In addition, the ﬁgure indicates (by red circles) all

pairs in which the motif appears in the module’s motif

proﬁle. As can be seen, many of proﬁles contain multiple

motifs, and many motifs were used by more than one

module. Even though modules share motifs, each module

is characterized by a unique combination of motifs.

Inferring regulatory networks Identifying the active mo-

tifs is a signiﬁcant step towards understanding the regula-

tory mechanisms governing gene expression. However, we

would also want to know the identity of the transcription

factor (TF) molecules that bind to these sequence motifs.

We used the DNA binding assays of Lee et al. (2002),

that directly detect to which promoter regions a particular

TF binds in vivo, and associated TFs with the motifs we

learned. For each motif, we computed the fraction, among

the motif targets, of genes bound by each TF, as measured

in the data of Lee et al. We used the hypergeometric distri-

bution to assign a p-value to each such fraction and took

p-value <0.05 to be signiﬁcant. Inspection of the signiﬁ-

cant associations showed that, in most cases, there was a

unique motif that was signiﬁcant for the TF and that a high

fraction (>0.5) of the TF’s binding targets were among the

motif target genes.

Based on this strong association between TFs and

motifs, for each such TF-motif pair, we predicted that

the TF regulates all the modules that are characterized

by the motif. By combining all associations, we arrived

at the regulatory network shown in Figure 6b. Of the

106 transcription factors measured in Lee et al.,28were

enriched in the targets of at least one motif and were

thus added to the resulting network. Of the 20 modules,

16 were associated with at least one TF. To validate

the quality of the network, we searched the biological

literature and compiled a list of experimentally veriﬁed

targets for each of the 28 TFs in our network. We then

marked each association between a TF and a module as

supported if the module contains at least one gene that the

TF is known to regulate from biological experiments. As

current knowledge is limited, there are very few known

targets for most TFs. Nevertheless, we found support

for 21 of the 64 associations. We also computed the p-

value for each supported association between a TF and a

module, using the binomial distribution with probability

of success p = t/N , where K is the total number of

known targets for the TF and N is the total number of

genes (1010). The p-value is then P(X ≥  | X ∼

B( p, n)), where  is the total number of known targets of

the regulator in the supported module and n is the number

of genes in the supported module. The resulting p-values

are shown in Figure 6b by edge thickness and color.

We assigned a name to each module based on a concise

summary of its gene content (compiled from both gene

annotation and literature). The regulatory network thus

contains predictions for the processes regulated by each

TF, where for each association the prediction includes

the motif through which the regulation occurs. In many

i280

by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

Genome-wide discovery of transcriptional modules

Motif 15

Motif 5

Motif 12

Motif 2

Motif 17

Motif 7

Motif 14

Motif 19

Motif 11

Motif 9

Motif 1

Motif 16

Motif 6

Motif 13

Motif 3

Motif 10

Motif 8

Added Motif 1

Added Motif 2

Added Motif 3

Added Motif 6

Added Motif 7

Added Motif 9

Added Motif 11

Added Motif 12

Added Motif 13

Added Motif 14

Added Motif 15

Added Motif 16

Added Motif 17

Added Motif 21

Added Motif 27

Added Motif 29

Added Motif 30

Added Motif 31

Added Motif 33

Added Motif 34

Added Motif 36

Added Motif 39

Added Motif 42

Added Motif 48

Added Motif 51

Added Motif 54

Added Motif 55

Added Motif 60

Added Motif 61

Motif 20

Motif 4

Added Motif 10

19 – Glycolysis (80)

18 –Redox regulation (10)

17 – Mixed I (74)

16 – Protein folding (45)

15 – Chromatin remodeling (8)

14 – Energy and TCA cycle (66)

13 –Oxidative phosphorylation (31)

12 – Unknown I (75)

11 – Nitrogen metabolism (35)

10 – Cell cycle (44)

9–Proteinfolding (47)

8–Ergosterol biosynthesis (65)

7–Sucrosemetabolism(55)

6–Translation (80)

5–Heat shock (61)

4–Pheromone response (93)

3–Transport (30)

2–Aldehyde metabolism (34)

1–Galactosemetabolism (15)

0–Aminoacidmetabolism (52)

Cbf1

g81

Fzf1

Met4

Gat1

Dal8

Cha4

Hir2

Gln3

Hap4

Ste12

o80

Crz1

Yap3

Nrg1

Dig1

Dot6

Gcr2

Cad1

Gcr1

1314

17 18

Not supported

Supported (p>.05)

Supported (p<.05)

Supported (p<.001)

Fig. 6. (a) Matrix of motifs vs. modules for the stress data, where a module-motif entry is colored if the member genes of that module were

enriched for that motif with p-value <0.05. The intensity corresponds to the fraction of genes in the module that had the motif. Entries in the

module’s motif proﬁle are circled in red. Modules were assigned names based on a summary of their gene content. (b) Regulatory network

inferred from our model using the DNA binding assays of Lee et al..Ovals correspond to transcription factors and rectangles to modules (see

(a) for module names).

cases, our approach recovered coherent biological pro-

cesses along with their known regulators. Examples of

such associations include: Hap4, the known activator of

oxidative phosphorylation, with the oxidative phospho-

rylation module (13); Gcr2, a known positive regulator

of glycolysis, with the glycolysis module (19); Mig1, a

glucose repressor, with the galactose metabolism module

(1); Ste12, involved in regulation of pheromone pathways,

with the pheromone response module (4); and Met4, a

positive regulator of sulfur amino acid metabolism, with

the amino acid metabolism module (0).

CONCLUSIONS

We presented a uniﬁed probabilistic model over both gene

expression and sequence data, whose goal is to identify

transcriptional modules and the regulatory motif binding

sites that control their regulation within a given set of ex-

periments. Our results indicate that our method discovers

modules that are both highly coherent in their expression

proﬁles and signiﬁcantly enriched for common motif bind-

ing sites in upstream regions of genes assigned to the same

module. A comparison to the common approach of con-

structing clusters based only on expression and then learn-

ing a motif for each cluster shows that our method recovers

modules that have a much higher correspondence to exter-

nal biological knowledge of gene annotations and protein

complex data.

ACKNOWLEDGEMENTS

This work was supported by the National Science Founda-

tion, grant ACI-0082554. Eran Segal was also supported

by a Stanford Graduate Fellowship (SGF).

REFERENCES

Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H.,

Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T.,

et al. (2000) Gene ontology: tool for the uniﬁcation of biology.

The gene ontology consortium. Nat. Genet., 25, 25–29.

Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by

expectation maximization to discover motifs in biopolymers.

Proc. Int. Conf. Intell. Syst. Mol. Biol. Volume 2. pp. 28–36.

Barash,Y., Bejerano,G. and Friedman,N. (2001) A simple hyper-

geometric approach for discovering putative transcription factor

binding sites. Algorithms in Bioinformatics, Number 2149 in

LNCS, pp. 278–293.

Brazma,A., Jonassen,I., Vilo,J. and Ukkonen,E. (1998) Predicting

gene regulatory elements in silico on a genomic scale. Genome

Res., 8, 1202–1215.

Bussemaker,H., Li,H. and Siggia,E. (2001) Regulatory element

detection using correlation with expression. Nat. Genet., 27,

167–171.

Cheeseman,P. and Stutz,J. (1995) Bayesian classiﬁcation (Auto-

Class): Theory and results. Advances in Knowledge Discovery

and Data Mining. AAAI Press, Menlo Park, CA, pp. 153–180.

Cherry,J.M., Adler,C., Ball,C., Chervitz,S.A., Dwight,S.S., Hes-

ter,E.T., Jia,Y., Juvik,G., Roe,T., Schroeder,M., Weng,S., Bot-

stein,D. (1998) Sgd: Saccharomyces genome database. Nucleic

Acid Res., 26, 73–79.

Dempster,A.P., Laird,N.M. and Rubin,D.B. (1977) Maximum like-

lihood from incomplete data via the EM algorithm. J. Roy. Stat.

Soc. B, 39, 1–39.

Gasch,A.P., Spellman,P.T., Kao,C.M., Carmel-Harel,O.,

Eisen,M.B., Storz,G., Botstein,D. and Brown,P.O. (2000)

Genomic expression program in the response of yeast cells

to environmental changes. Mol. Biol. Cell, 11, 4241–4257.

Gavin,A.C., Bosche,M., Krause,R., Grandi,P., Marzioch,M.,

Bauer,A., Schultz,J., Rick,J.M., Michon,A.M. and Cruciat,C.M.,

et al. (2002) Functional organization of the yeast proteome

by systematic analysis of protein complexes. Nature, 415,

141–147.

i281

by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

E.Segal et al.

Ho,Y., Gruhler,A., Heilbut,A., Bader,G.D., Moore,L., Adams,S.L.,

Millar,A., Taylor,P., Bennett,K., Boutilier,K., et al. (2002) Sys-

tematic identiﬁcation of protein complexes in Saccharomyces

cerevisiae by mass spectometry. Nature, 415, 180–183.

Lee,T., Rinaldi,N.J., Robert,F., Odom,D.T., Bar-Joseph,Z., Ger-

ber,G.K., Hannett,N.M., Harbison,C.T., Thompson,C.M., Si-

mon,I., et al. (2002) Transcriptional regulatory networks in Sac-

charomyces cerevisiae. Science, 298, 824–827.

Liu,X., Brutlag,D. and Liu,J. (2001) Bioprospector: discovering

conserved DNA motifs in upstream regulatory regions of co-

expressed genes. Pac. Symp. Biocomput. pp. 127–138.

Pearl,J. (1988) Probabilistic Reasoning in Intelligent Systems.

Morgan Kaufmann.

Pilpel,Y., Sudarsanam,P. and Church,G. (2001) Identifying regula-

tory networks by combinatorial analysis of promoter elements.

Nat. Genet., 29, 153–159.

Roth,F., Hughes,P., Estep,J.D. and Church,G. (1998) Finding DNA

regulatory motifs within unaligned noncoding sequences clus-

tered by whole-genome mRNA quantitation. Nat. Biotechnol.,

16, 939–945.

Segal,E., Taskar,B., Gasch,A., Friedman,N. and Koller,D. (2001)

Rich probabilistic models for gene expression. Bioinformatics,

17(Suppl 1), S243–S252.

Segal,E., Barash,Y., Simon,I., Friedman,N. and Koller,D. (2002)

From sequence to expression: A probabilistic framework. In

Proc. RECOMB. pp. 263–272.

Sinha,S. and Tompa,M. (2000) A statistical method for ﬁnding

transcription factor binding sites. In Proc. Int. Conf. Intell. Syst.

Mol. Biol.,Volume 8, pp. 344–354.

Spellman,P.T., Sherlock,G., Zhang,M.O., Iyer,V.R., Anders,K.,

Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998)

Comprehensive identiﬁcation of cell cycle-regulated genes of

the yeast Saccharomyces cerevisiae by microarray hybridization.

Mol. Biol. Cell, 9(12), 3273–3297.

Tavazoie,S., Hughes,J.D., Campbell,M.J., Cho,R.J. and

Church,GM. (1999) Systematic determination of genetic

network architecture. Nat. Genet., 22, 281–285 Comment in:

Nat. Genet. 22 213-215.

i282

by guest on February 25, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

Ionizing Radiation and Estrogen Affecting Growth Factor Genes in an Experimental Breast Cancer Model

Article

Full-text available

Nov 2022
INT J MOL SCI

Genes associated with growth factors were previously analyzed in a radiation- and estrogen-induced experimental breast cancer model. Such in vitro experimental breast cancer model was developed by exposure of the immortalized human breast epithelial cell line, MCF-10F, to low doses of high linear energy transfer (LET) α particle radiation (150 keV/μm) and subsequent growth in the presence or absence of 17β-estradiol. The MCF-10F cell line was analyzed in different stages of transformation after being irradiated with either a single 60 cGy dose or 60/60 cGy doses of alpha particles. In the present report, the profiling of differentially expressed genes associated with growth factors was analyzed in their relationship with clinical parameters. Thus, the results indicated that Fibroblast growth factor2 gene expression levels were higher in cells transformed by radiation or in the presence of ionizing radiation; whereas the fibroblast growth factor-binding protein 1gene expression was higher in the tumor cell line derived from this model. Such expressions were coincident with higher values in normal than malignant tissues and with estrogen receptor (ER) negative samples for both gene types. The results also showed that transforming growth factor alpha gene expression was higher in the tumor cell line than the tumorigenic A5 and the transformed A3 cell line, whereas the transforming growth factor beta receptor 3 gene expression was higher in A3 and A5 than in Tumor2 cell lines and the untreated controls and the E cell lines. Such gene expression was accompanied by results indicating negative and positive receptors for transforming growth factor alpha and the transforming growth factor beta receptor 3, respectively. Such expressions were low in malignant tissues when compared with benign ones. Furthermore, Fibroblast growth factor2, the fibroblast growth factor-binding protein 1, transforming growth factor alpha, the transforming growth factor beta receptor 3, and the insulin growth factor receptor gene expressions were found to be present in all BRCA patients that are BRCA-Basal, BRCA-LumA, and BRCA-LumB, except in BRCA-Her2 patients. The results also indicated that the insulin growth factor receptor gene expression was higher in the tumor cell line Tumor2 than in Alpha3 cells transformed by ionizing radiation only; then, the insulin growth factor receptor was higher in the A5 than E cell line. The insulin growth factor receptor gene expression was higher in breast cancer than in normal tissues in breast cancer patients. Furthermore, Fibroblast growth factor2, the fibroblast growth factor-binding protein 1, transforming growth factor alpha, the transforming growth factor beta receptor 3, and the insulin growth factor receptor gene expression levels were in stages 3 and 4 of breast cancer patients. It can be concluded that, by using gene technology and molecular information, it is possible to improve therapy and reduce the side effects of therapeutic radiation use. Knowing the different genes involved in breast cancer will make possible the improvement of clinical chemotherapy.

Cell Adhesion Molecules Affected by Ionizing Radiation and Estrogen in an Experimental Breast Cancer Model

Article

Full-text available

Oct 2022
INT J MOL SCI

Cancer develops in a multi-step process where environmental carcinogenic exposure is a primary etiological component, and where cell–cell communication governs the biological activities of tissues. Identifying the molecular genes that regulate this process is essential to targeting metastatic breast cancer. Ionizing radiation can modify and damage DNA, RNA, and cell membrane components such as lipids and proteins by direct ionization. Comparing differential gene expression can help to determine the effect of radiation and estrogens on cell adhesion. An in vitro experimental breast cancer model was developed by exposure of the immortalized human breast epithelial cell line MCF-10F to low doses of high linear energy transfer α particle radiation and subsequent growth in the presence of 17β-estradiol. The MCF-10F cell line was analyzed in different stages of transformation that showed gradual phenotypic changes including altered morphology, increase in cell proliferation relative to the control, anchorage-independent growth, and invasive capability before becoming tumorigenic in nude mice. This model was used to determine genes associated with cell adhesion and communication such as E-cadherin, the desmocollin 3, the gap junction protein alpha 1, the Integrin alpha 6, the Integrin beta 6, the Keratin 14, Keratin 16, Keratin 17, Keratin 6B, and the laminin beta 3. Results indicated that most genes had greater expression in the tumorigenic cell line Tumor2 derived from the athymic animal than the Alpha3, a non-tumorigenic cell line exposed only to radiation, indicating that altered expression levels of adhesion molecules depended on estrogen. There is a significant need for experimental model systems that facilitate the study of cell plasticity to assess the importance of estrogens in modulating the biology of cancer cells.

Inferring weighted gene annotations from expression data

Preprint

Full-text available

Dec 2016

Annotating genes with information describing their role in the cell is a fundamental goal in biology, and essential for interpreting data-rich assays such as microarray analysis and RNA-Seq. Gene annotation takes many forms, from Gene Ontology (GO) terms, to tissues or cell types of significant expression, to putative regulatory factors and DNA sequences. Almost invariably in gene databases, annotations are connected to genes by a Boolean relationship, e.g., a GO term either is or isn’t associated with a particular gene. While useful for many purposes, Boolean-type annotations fail to capture the varying degrees by which some annotations describe their associated genes and give no indication of the relevance of annotations to cellular logistical activities such as gene expression. We hypothesized that weighted annotations could prove useful for understanding gene function and for interpreting gene expression data, and developed a method to generate these from Boolean annotations and a large compendium of gene expression data. The method uses an independent component analysis-based approach to find gene modules in the compendium, and then assigns gene-specific weights to annotations proportional to the degree to which they are shared among members of the module, with the reasoning that the more an annotation is shared by genes in a module, the more likely it is to be relevant to their function and, therefore, the higher it should be weighted. In this paper, we show that analysis of expression data with module-weighted annotations appears to be more resistant to the confounding effect of gene-gene correlations than non-weighted annotation enrichment analysis, and show several examples in which module-weighted annotations provide biological insights not revealed by Boolean annotations. We also show that application of the method to a simple form of genetic regulatory annotation, namely, the presence or absence of putative regulatory words (oligonucleotides) in gene promoters, leads to module-weighted words that closely match known regulatory sequences, and that these can be used to quickly determine key regulatory sequences in differential expression data.

Application of Transcriptional Gene Modules to Analysis of Caenorhabditis elegans' Gene Expression Data

Article

Full-text available

Aug 2020

Identification of co-expressed sets of genes (gene modules) is used widely for grouping functionally related genes during transcriptomic data analysis. An organism-wide atlas of high-quality gene modules would provide a powerful tool for unbiased detection of biological signals from gene expression data. Here, using a method based on independent component analysis we call DEXICA, we have defined and optimized 209 modules that broadly represent transcriptional wiring of the key experimental organism Caenorhabditis elegans . These modules represent responses to changes in the environment (e.g. starvation, exposure to xenobiotics), genes regulated by transcriptions factors (e.g. ATFS-1, DAF-16), genes specific to tissues (e.g. neurons, muscle), genes that change during development, and other complex transcriptional responses to genetic, environmental and temporal perturbations. Interrogation of these modules reveals processes that are activated in long-lived mutants in cases where traditional analyses of differentially expressed genes fail to do so. Additionally, we show that modules can inform the strength of the association between a gene and an annotation (e.g. GO term). Analysis of "module-weighted annotations" improves on several aspects of traditional annotation-enrichment tests and can aid in functional interpretation of poorly annotated genes. We provide an online interactive resource at http://genemodules.org/ in which users can find detailed information on each module, check genes for module-weighted annotations, and use both of these to analyze their own transcription data or gene sets of interest.

Méthodes de découverte de nouveaux domaines dans les séquences biologiques : application à Plasmodium falciparum

Thesis

Nov 2019

Christophe Menichelli

Identifier les différentes parties d’une séquence biologique (séquence nucléique, ou séquence d’acides aminés) constitue un premier pas vers la compréhension de la biologie de l’organisme dont elle est issue. Étant donné un ensemble de séquences biologiques d’un organisme, nous nous intéressons dans cette thèse à la découverte de «domaines», c-à-d de sous-séquences relativement grandes (plusieurs dizaines de nucléotides ou d’acides aminés) que l’on retrouve dans un nombre important de séquences. Cette thèse est décomposée en deux axes correspondant à la découverte de domaines dans les séquences protéiques et dans les séquences nucléiques. Dans chaque axe, les méthodes développées sont appliquées à Plasmodium falciparum, le pathogène responsable du paludisme chez l’Homme, et pour lequel les méthodes bio-informatiques classiques peinent à produire des annotations satisfaisantes. Le premier axe développé porte sur la découverte de domaines dans les séquences protéiques. Une approche commune pour identifier les domaines d’une protéine consiste à exécuter des comparaisons de paires de séquences avec des outils d’alignements locaux comme BLAST. Cependant, ces approches manquent parfois de sensibilité, en particulier pour les espèces phylogénétiquement éloignées des organismes de référence classiques. Nous proposons ici une approche pour augmenter la sensibilité des comparaisons de paires de séquences. Cette nouvelle approche utilise le fait que les domaines protéiques ont tendance à apparaître avec un nombre limité d’autres domaines sur une même protéine. Chez Plasmodium falciparum, cette méthode permet la découverte de 2 240 nouveaux domaines pour lesquels, dans la majorité des cas, il n’existe pas de modèle semblable dans les bases de données de domaines. Le deuxième axe développé porte sur la découverte de domaines dans les séquences régulatrices (séquences ADN). Plusieurs études ont montré qu’il existe un lien fort entre la composition nucléotidique de régions particulières (séquences promotrices notamment) et l’expression des gènes. Nous proposons ici une nouvelle approche permettant de découvrir de manière automatique ces régions, que l’on nomme domaines de régulation. Plus précisément notre approche est basée sur une stratégie d’exploration itérative des compositions nucléotidiques, des plus simples (dinucléotides) aux plus complexes (k-mers), ainsi qu’une stratégie de segmentation supervisée pour découvrir les compositions et les régions d’intérêt. En utilisant les domaines ainsi identifiés, nous montrons que l’on peut prédire l’expression des gènes de Plasmodium falciparum avec une étonnante précision. Appliquée à différentes autres espèces eucaryotes, cette approche montre des résultats très différents suivant les espèces (entre 40 et 70 % de corrélation) ce qui laisse entrevoir un mécanisme de régulation sans doute partagé par toutes les espèces eucaryotes mais dont l’importance varie d’une espèce à l’autre.

External cluster validity indices for genomic data and formulating theoretical problems in cluster validation as flow problems

Chapter

Jan 2021

Anand Narasimhamurthy

In this chapter we focus on the problem of comparing two clusterings in the context of genomic data. As in the previous chapter, we use the term clustering to refer both to the process as well as the result. We use the term hard clustering to mean a clustering in which an object (data point) is assigned to one cluster only. In soft clustering, the degree to which an object is associated with a cluster is indicated by a membership grade. Clustering is often employed as a first step in genomic data analysis in order to identify groups of data samples (often groups of genes) for further analysis. It is often of interest to compare a clustering result with an external "reference" clustering, for instance comparing a grouping of genes obtained by applying a clustering algorithm on gene expression data against a grouping derived from existing knowledge (based on gene ontology for instance). Although the choice of the reference clustering depends on the task at hand, typically groupings of genes such as those based on gene ontology, are comprised of overlapping clusters. Many of the external cluster validity measures in the literature assume partitions (non-overlapping and exhaustive clusters), hence they may not be suitable in the context described above. We suggest that a Mallows distance based method proposed recently may be more appropriate for cluster validation

Incorporating Correlations among Gene Ontology Terms into Predicting Protein Functions

Chapter

Jan 2013

One of the key issues in the post-genomic era is to assign functions to uncharacterized proteins. Since proteins seldom act alone, but rather interact with other biomolecular units to execute their functions, the functions of unknown proteins may be discovered through studying their associations with proteins having known functions.In this chapter, the authors discuss possible approaches to exploit protein interaction networks for automated prediction of protein functions. The major focus is on discussing the utilities and limitations of current algorithms and computational techniques for accurate computational function prediction. The chapter highlights the challenges faced in this task and explores how similarity information among different gene ontology (GO) annotation terms can be taken into account to enhance function prediction.The authors describe a new strategy that has better prediction performance than previous methods, which gives additional insights about the importance of the dependence between functional terms when inferring protein function.

Workshop on data and text mining for integrative biology

Book

Jan 2006

BENIN: combining knockout data with time series gene expression data for the gene regulatory network inference

Conference Paper

Dec 2019

Predicting gene expression in the human malaria parasite Plasmodium falciparum using histone modification, nucleosome positioning, and 3D localization features

Article

Full-text available

Sep 2019
PLOS COMPUT BIOL

Empirical evidence suggests that the malaria parasite Plasmodium falciparum employs a broad range of mechanisms to regulate gene transcription throughout the organism's complex life cycle. To better understand this regulatory machinery, we assembled a rich collection of genomic and epigenomic data sets, including information about transcription factor (TF) binding motifs, patterns of covalent histone modifications, nucleosome occupancy, GC content, and global 3D genome architecture. We used these data to train machine learning models to discriminate between high-expression and low-expression genes, focusing on three distinct stages of the red blood cell phase of the Plasmodium life cycle. Our results highlight the importance of histone modifications and 3D chromatin architecture in Plasmodium transcriptional regulation and suggest that AP2 transcription factors may play a limited regulatory role, perhaps operating in conjunction with epigenetic factors.

Gene ontology: Tool for the unification of biology

Article

Full-text available

Jan 2000

SGD: Saccharomyces Genome Database

Article

Full-text available

Feb 1998

The Saccharomyces Genome Database (SGD) provides Internet access to the complete Saccharomyces cerevisiae genomic sequence, its genes and their products, the phenotypes of its mutants, and the literature supporting these data. The amount of information and the number of features provided by SGD have increased greatly following the release of the S.cerevisiae genomic sequence, which is currently the only complete sequence of a eukaryotic genome. SGD aids researchers by providing not only basic information, but also tools such as sequence similarity searching that lead to detailed information about features of the genome and relationships between genes. SGD presents information using a variety of user-friendly, dynamically created graphical displays illustrating physical, genetic and sequence feature maps. SGD can be accessed via the World Wide Web at http://genome-www.stanford.edu/Saccharomyces/

Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium

Article

Full-text available

May 2000

Predicting Gene Regulatory Elements in Silico on a Genomic Scale

Article

Full-text available

Dec 1998
GENOME RES

We performed a systematic analysis of gene upstream regions in the yeast genome for occurrences of regular expression-type patterns with the goal of identifying potential regulatory elements. To achieve this goal, we have developed a new sequence pattern discovery algorithm that searches exhaustively for a priori unknown regular expression-type patterns that are over-represented in a given set of sequences. We applied the algorithm in two cases, (1) discovery of patterns in the complete set of >6000 sequences taken upstream of the putative yeast genes and (2) discovery of patterns in the regions upstream of the genes with similar expression profiles. In the first case, we looked for patterns that occur more frequently in the gene upstream regions than in the genome overall. In the second case, first we clustered the upstream regions of all the genes by similarity of their expression profiles on the basis of publicly available gene expression data and then looked for sequence patterns that are over-represented in each cluster. In both cases we considered each pattern that occurred at least in some minimum number of sequences, and rated them on the basis of their over-representation. Among the highest rating patterns, most have matches to substrings in known yeast transcription factor-binding sites. Moreover, several of them are known to be relevant to the expression of the genes from the respective clusters. Experiments on simulated data show that the majority of the discovered patterns are not expected to occur by chance.

Finding DNA Regulatory Motifs within Unaligned Non-Coding Sequences Clustered by Whole-Genome mRNA Quantitation

Article

Full-text available

Nov 1998

Whole-genome mRNA quantitation can be used to identify the genes that are most responsive to environmental or genotypic change. By searching for mutually similar DNA elements among the upstream non-coding DNA sequences of these genes, we can identify candidate regulatory motifs and corresponding candidate sets of coregulated genes. We have tested this strategy by applying it to three extensively studied regulatory systems in the yeast Saccharomyces cerevisiae: galactose response, heat shock, and mating type. Galactose-response data yielded the known binding site of Gal4, and six of nine genes known to be induced by galactose. Heat shock data yielded the cell-cycle activation motif, which is known to mediate cell-cycle dependent activation, and a set of genes coding for all four nucleosomal proteins. Mating type alpha and a data yielded all of the four relevant DNA motifs and most of the known a- and alpha-specific genes.

Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization

Article

Full-text available

Jan 1999

We sought to create a comprehensive catalog of yeast genes whose transcript levels vary periodically within the cell cycle. To this end, we used DNA microarrays and samples from yeast cultures synchronized by three independent methods: alpha factor arrest, elutriation, and arrest of a cdc15 temperature-sensitive mutant. Using periodicity and correlation algorithms, we identified 800 genes that meet an objective minimum criterion for cell cycle regulation. In separate experiments, designed to examine the effects of inducing either the G1 cyclin Cln3p or the B-type cyclin Clb2p, we found that the mRNA levels of more than half of these 800 genes respond to one or both of these cyclins. Furthermore, we analyzed our set of cell cycle-regulated genes for known and new promoter elements and show that several known elements (or variations thereof) contain information predictive of cell cycle regulation. A full description and complete data sets are available at http://cellcycle-www.stanford.edu

Fitting a mixture model by expectation maximization to discover motifs in bipolymers

Article

Jan 1994

Probabilistic reasoning in intelligent systems

Article

Jan 1988

J. Pearl

Maximum Likelihood from Incomplete Data Via EM Algorithm

Article

Sep 1977

S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

Maximum Likelihood From Incomplete Data Via The EM algorithm

Article

Jan 1977

A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

Genome-wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression

Abstract and Figures

Recommended publications

Position effect in human disease

Different gene regulation strategies revealed by analysis of binding motifs

Chromosome organization: New facts, new models

Measuring spatial preferences at fine-scale resolution identifies known and novel cis-regulatory ele...