ArticlePDF Available

A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling

Authors:

Abstract and Figures

Transcriptome analysis allows detection and clustering of genes that are coexpressed under various biological circumstances. Under the assumption that coregulated genes share cis-acting regulatory elements, it is important to investigate the upstream sequences controlling the transcription of these genes. To improve the robustness of the Gibbs sampling algorithm to noisy data sets we propose an extension of this algorithm for motif finding with a higher-order background model. Simulated data and real biological data sets with well-described regulatory elements are used to test the influence of the different background models on the performance of the motif detection algorithm. We show that the use of a higher-order model considerably enhances the performance of our motif finding algorithm in the presence of noisy data. For Arabidopsis thaliana, a reliable background model based on a set of carefully selected intergenic sequences was constructed. Our implementation of the Gibbs sampler called the Motif Sampler can be used through a web interface: http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html. gert.thijs@esat.kuleuven.ac.be; yves.moreau@esat.kuleuven.ac.be
Content may be subject to copyright.
BIOINFORMATICS Vol. 17 no. 12 2001
Pages 1113–1122
A higher-order background model improves the
detection of promoter regulatory elements by
Gibbs sampling
Gert Thijs 1,
, Magali Lescot 1, Kathleen Marchal 1, Stephane
Rombauts 2, Bart De Moor 1, Pierre Rouz ´
e3and Yves Moreau 1
1ESAT-SISTA/COSIC, KULeuven, Kasteelpark Arenberg 10, 3001 Leuven-Heverlee,
Belgium, 2Department of Plant Genetics, VIB, UGent, Ledeganckstraat 35, 9000
Gent, Belgium and 3INRA Associated Laboratory, VIB, UGent, Ledeganckstraat 35,
9000 Gent, Belgium
Received on February 6, 2001; revised on June 4, 2001; accepted on June 6, 2001
ABSTRACT
Motivation: Transcriptome analysis allows detection and
clustering of genes that are coexpressed under various bi-
ological circumstances. Under the assumption that coreg-
ulated genes share cis-acting regulatory elements, it is im-
portant to investigate the upstream sequences controlling
the transcription of these genes. To improve the robust-
ness of the Gibbs sampling algorithm to noisy data sets
we propose an extension of this algorithm for motif finding
with a higher-order background model.
Results: Simulated data and real biological data sets
with well-described regulatory elements are used to test
the influence of the different background models on
the performance of the motif detection algorithm. We
show that the use of a higher-order model considerably
enhances the performance of our motif finding algorithm
in the presence of noisy data. For Arabidopsis thaliana,
a reliable background model based on a set of carefully
selected intergenic sequences was constructed.
Availability: Our implementation of the Gibbs sampler
called the Motif Sampler can be used through a web
interface: http://www.esat.kuleuven.ac.be/thijs/Work/
MotifSampler.html.
Contact: gert.thijs@esat.kuleuven.ac.be;
yves.moreau@esat.kuleuven.ac.be
INTRODUCTION
Recent high-throughput techniques to monitor gene
expression levels constitute an important advance in the
identification of coexpressed genes (for a review, see
Lockhart and Winzeler, 2000). The commonly accepted
assumption that coregulated genes share similarities in
their regulatory mechanism has led to a major challenge
To whom correspondence should be addressed.
for the computational biologist: detecting novel reg-
ulatory elements (motifs) in such sets of coexpressed
genes (Brazma and Vilo, 2000; Bucher, 1999; Chu et
al., 1998; DeRisi et al., 1997; Spellman et al., 1998;
Wolfsberg et al., 1999; Zhang, 1999). These similarities
at transcriptional level imply that the promoter region
might contain consensus motifs recognized by the same
regulatory proteins. In the upstream regions of such sets
of coregulated genes, the common consensus motifs
are statistically over-represented as compared to their
frequency in a background set (of non-coregulated genes).
Several methods to search for over-represented motifs
in the upstream region of a set of coregulated genes have
been developed and tested (Ohler and Niemann, 2001).
These methods can be divided in two major classes:
methods based on word counting (Jensen and Knudsen,
2000; Vanet et al., 2000; van Helden et al., 1998, 2000)
and methods based on probabilistic sequence models
(Bailey and Elkan, 1995; Hughes et al., 2000; Lawrence
et al., 1993; Liu et al., 1995; Neuwald et al., 1995;
Roth et al., 1998; Workman and Stormo, 2000). Word
counting methods are based on the frequency analysis
of oligonucleotides in the upstream sequences. Over-
representation is measured by comparing the counted
number of occurrences of a word to the expected number
of occurrences. A common motif is then compiled by
grouping similar words. In the probabilistic methods,
the motif model is represented as a position probability
matrix and the motif is assumed to be hidden in a noisy
background sequence. To find the parameters of such a
model, maximum likelihood estimation is used. The most
frequent methods to do so are Expectation Maximization
(EM) and Gibbs sampling. EM is a maximum likelihood
algorithm for estimating the parameters of a probabilistic
model. Gibbs sampling is a stochastic equivalent of EM.
The drawback of these algorithms is that they tend to be
c
Oxford University Press 2001 1113
G.Thijs et al.
sensitive to noise. Noise is due to the presence of upstream
sequences in the data set that do not contain the motif.
These sources of noise have either an experimental origin
or are artefacts of the clustering process and are difcult
to avoid.
Another source of noise comes from the large size of
the upstream sequences of the selected genes as compared
to the small size of the motifs. Parts of the sequence not
containing a motif can indeed be considered as noise.
This second source of noise obviously depends on the
compactness of the genome. For higher eukaryotes, the
size of intergenic regions varies considerably between
organisms, being much larger on average in humans than
in Arabidopsis thaliana. Even within the same species the
size of the intergenic region can vary (e.g., by at least 2
orders of magnitude for A.thaliana, from <102to >104;
Pavy et al., 1999). Therefore the inuence of the noise
can be expected to be reasonably low for bacteria and
still limited for lower eukaryotes such as yeast, but more
pronounced for higher eukaryotes.
Conceivably, it is important to have a motif detection
algorithm that can cope with this noise and discriminate
between motifs that are over-represented by chance and
motifs that are biologically functional. An improved
background model (model of non-coregulated genes) can
considerably improve this discrimination.
The most popular probabilistic models published so far
(Bailey and Elkan, 1995; Hughes et al., 2000; Lawrence
et al., 1993; Liu et al., 1995; Neuwald et al., 1995; Roth
et al., 1998) use a simple background model based on the
frequency of the nucleotides A, C, G, and T in the data
set to represent an intergenic sequence. However, a back-
ground model solely based on single nucleotide frequen-
cies poorly reects the complex structure of genome se-
quences. A description of DNA sequences as higher-order
Markov chains on the other hand has been used in most
of the state-of-the-art gene recognition software to repre-
sent coding and non-coding regions, (e.g., in Glimmer and
the A.thaliana specic GlimmerA (Delcher et al., 1999),
HMMgene (Krogh, 1997) and GeneMark.hmm (Lukashin
and Borodowsky, 1998)). In this paper we describe the ex-
tension of the Gibbs sampling algorithm with a complex
context-dependent background model.
Recently other researchers have also proposed the use
of advanced background models in the Gibbs sampling
algorithm (Liu et al., 2001; McCue et al., 2001). Here
we like to address the specic issues associated with the
background models used in these methods. McCue et al.
(2001) extended the Gibbs sampler with a position specic
background model estimated with a Bayesian segmenta-
tion algorithm that was presented by Liu and Lawrence
(1999). This model tries to capture the varying GC and
AT-content of the different regions in DNA sequences. The
model parameters and corresponding change points are
found using a Bayesian sequence segmentation algorithm
that maximizes the joint likelihood of the data, parameters
and missing values (change point positions). These model
parameters can be used to calculate the probability that a
certain site in the sequence is generated with this back-
ground model. Furthermore, Liu and Lawrence have situ-
ated the problem of sequence alignment within a Bayesian
framework that we also favour. Liu et al. (2001) have de-
veloped an extended version of the Gibbs sampler called
BioProspector. They have proposed the use of a context-
depend Markov model to represent the probability that a
site is generated by the background model. We will com-
ment on some of the specic technicalities of this method
in the next section.
Together with the development of the algorithm a
selected data set of intergenic sequences from A.thaliana
was used to construct a reliable higher-order background
model of gene upstream regions of this model plant.
The inuence of different background models on the
robustness of the motif sampler in the presence of
noisy data was exhaustively tested. We will describe the
construction of the background models and the use of this
background model with the Gibbs sampling algorithm.
To test the inuence of these models, we used simulated
data and several well-described data sets of genes from
A.thaliana for which different motifs are documented.
HIGHER-ORDER BACKGROUND MODEL
As stated in the introduction, most of the state-of-the-art
gene detection software uses a context-dependent model
based on a higher-order Markov process to represent DNA
sequences. Based on the rationale of these algorithms,
the use of such a model to detect motifs in the upstream
region of coregulated genes seems a logical decision.
Using a context-dependent model of order mmeans that
the probability of nding a nucleotide bat position lin
a sequence depends on the mprevious nucleotides in the
sequence. The probability of the sequence being generated
by this background model Bmis given by
P(S|Bm)=P(b1,...,bm)
L
l=m+1
P(bl|bl1,...,blm).
The probabilities P(bl|bl1,...,blm)are stored in a
transition matrix and the prior frequency of the oligonu-
cleotides of length mis given by P(b1,b2,···,bm).
The construction of the transition matrix of an mth-
order background model is based on the counting of all
oligonucleotides of length (m+1) in the data set. To com-
pensate for zero occurrences of certain oligonucleotides
a pseudocount is added. Based on Bayesian statistics,
we assume that the more data are available the more we
can rely on these data to approximate the true biological
1114
Higher-order background model for motif detection
model. Based on the comparison of experimental results,
the pseudocounts are chosen proportional to the single
nucleotide frequency and inverse proportional to the
square root of the size of the data set.
The background model can be either constructed based
on the input sequences or based on an independent data
set of intergenic sequences. The latter approach seems
the more sensible one to produce a reliable background
model. The quality of the background model depends on
the quality of the data set. In this paper a set of carefully
selected intergenic sequences from A.thaliana is used to
construct a reliable background model.
Extension of the motif sampler
The implementation of our motif nding algorithm is
based on the original Gibbs sampling algorithm previ-
ously described by Lawrence et al. (1993). An elaborate
description of our algorithm is given elsewhere (Thijs
et al., 2001). In this paper we like to emphasize the
specic aspects of the higher-order model and we will
only summarize the aspects of the algorithm to facilitate
further understanding.
The calculation of the background model is done as an
initialization step of the algorithm. The background model
Bmis computed either from the input sequences, making
it useful for any organism, or from an independent data
set. This model Bmis not updated during the algorithm
since there is no need to re-estimate the background model
at each iteration step of the algorithm. The background
model will be used to calculate for each site xof length W
in a sequence the probability that this site was generated
by the background model. This probability is referred to
as Px:
Px=P(Site |Bm)=
W
l=1
P(bl|bl1,...,blm).
The motif of length Wis represented with a position
probability matrix θWwhere the entry qi,bcontains the
probability of nding nucleotide bat position iin the
motif:
θW=
q1,Aq2,A··· qW,A
q1,Cq2,C··· qW,C
q1,Gq2,G··· qW,G
q1,Tq2,T··· qW,T
.
For each site xof length Win a sequence the probability
Qxof site xbeing generated by the motif model θWis
calculated:
Qx=P(Site |θW)=
W
l=1
ql,bl.
Based on these probabilities a weight Axis then assigned
to each segment xin the sequence
Ax=Qx
Px
.
Subsequently the alignment vector of the motif in this
sequence is sampled according to the distribution of
normalized weights Ax. By updating this distribution
we can nd the alignment that maximizes the ratio
of the corresponding site probability to the background
probability.
As stated in the introduction, Liu et al. (2001) have also
proposed the use of a context-dependent Markov model
in a Gibbs sampling algorithm. In their algorithm the
probability of a site being generated by this background
model is computed as
P(bl,bl+1,bl+2,···,bl+W1)=P(bl)P(bl+1|bl)
×P(bl+2|bl,bl+1)P(bl+3|bl,bl+1,bl+2)
···P(bl+W1|bl+W2,bl+W3,bl+W4).
This approach resembles the one proposed in this paper
but differs in the calculation of the probability of the
site being generated by the background model. In our
algorithm we take the mpreceding bases of the site into
account to compute the background probability of the site
since the complete sequence information is available at the
time of computation. This also means that we only need
to compute the parameters of the mth-order background
model and not the parameters for all the models from 1
to m.
DATA SETS
Intergenic data set
To construct the best possible representation of promoter
sequences or intergenic sequences a data set consisting of
carefully selected intergenic sequences was constructed,
following a previously described rationale to build Araset
(Pavy et al., 1999). To dene clean intergenic sequences,
all complete cDNAs were downloaded through SRS and
aligned on BAC sequences. The aligned genes were man-
ually checked by an expert (S. Aubourg, personal commu-
nication). Each time the cDNAs matched two consecutive
genes on the BAC, the intergenic sequence was extracted.
The sequences with a length below 10 kb were then ex-
tensively checked for any unannotated potential coding
sequences, using BLAST (Altschul et al., 1990) for ho-
mology searches and prediction software such as EuG`
ene
(Schiex et al., 2000). 78 intergenic sequences were re-
tained, representing a total of 156 087 bp. These sequences
were added to the 94 intergenic sequences retrieved from
Araset, resulting in a data set with 341 248 bp. Figure 1
1115
G.Thijs et al.
Gene Gene
Intergenic
Region
Core Promoter
3
2
1
Fig. 1. Representation of the three different conguration of the
intergenic region in DNA. The genes are represented with an arrow
and the core promoter for each gene is indicated with an oval box.
(1) The two genes are pointing in the same direction and there is one
core promoter in the intergenic region. (2) Two genes are pointing in
opposite directions and the intergenic region contains the two core
promoters. In the last case (3) the genes are pointing towards each
other and in the intergenic region there is no core promoter present.
shows the three different congurations in which neigh-
boring genes can occur. 105 sequences have the rst con-
guration (1) consisting of two genes coding in tandem on
the same strand. In this case the intergenic region is ex-
pected to contain only one promoter. 38 sequences have
the second conguration (2) where both genes are point-
ing away from each other. In this intergenic region diver-
gent promoters are expected to control transcription. In the
last case (containing 29 sequences) the transcription of the
two genes is convergent. No promoter regulatory element
is expected to occur in the intergenic region. The transition
matrix was only built from the intergenic sequences of the
classes (1) and (2), which likely contain either one or two
promoters.
Data sets for testing
To test the performance of our implementation we con-
structed several data sets. The data sets are accessible
on the web: http://www.plantgenetics.rug.ac.be/males/
Datasets/Data.html.
Simulated data: these sequences were used to test the
inuence of the background model on the detection
of different types of motifs. By sampling according
to the 4th-order background model 20 sequences of
500 bp were generated. In several tests, instances of
predened motifs were inserted in these sequences at
random positions.
G-box sequences: this set of sequences was extracted
from PlantCARE (Rombauts et al., 1999) and contains
the upstream region of genes that are known to be
regulated by G-box binding proteins in dicots. The
consensus of the G-box is CACGTG. The position
of the G-box is well dened in this data set. The
set contains 33 sequences of 500 bp. The G-box
(CACGTG) is a well-conserved ubiquitous cis-acting
regulatory element found in plant genomes and is
bound by the GBF (G-box binding factors) family of
bZIP proteins (Donald and Cashmore, 1990).
Light induced: this set contains the upstream region
of 28 coexpressed A.thaliana genes. Coexpression
was based on the cluster analysis of a microarray
experiment (Desprez et al., 1998).
Random: set of randomly selected A.thaliana upstream
sequences of at least 150 bp, not described to be
involved in light regulation and not containing a known
G-box.
RESULTS
Construction of an independent background model
The construction of a Markov process can rely either on
the upstream sequences from the input data or from an
independent data set. This independent data set consists
of a well-dened set of intergenic regions of A.thaliana
genes (see Section Data sets). It should be noted that the
number of nucleotides used to construct the Markov model
limits the order of the background model that can be used.
Indeed, when a transition matrix of order mis constructed,
all oligonucleotides of length (m+1) are counted. The
number of possible different oligonucleotides equals 4m+1
and increases exponentially with m. The data set used for
the construction of the background model should, under
the assumption of an equal nucleotide distribution, at least
contain 4m+1different base pairs to have a single count
for each nucleotide. In reality the assumption of equal
nucleotide distribution does not hold and a much larger
data set will be needed. When an oligonucleotide does not
occur in the data set, it will be replaced by a pseudocount.
When the order of the background is too high relative to
the size of the data set on which this background model
was based, less frequent motifs will be encountered which
deteriorates the motif model. Following this reasoning
the improvement of using a Markov chain background
model will be more explicit when its construction is
based on a large data set (such as the one used in this
study). A thorough study of the intergenic data set shows
that when all hexamers are counted, there are 430 388
examples in this data set. The hexamer with highest
number of occurrences, 2018, was AAAAAA, while there
was no instance at all found of the hexamer GCGGGC.
The consequence of this observation is that the 5th-order
background model is less reliable, as will be shown in the
tests.
Simulated sequences
As a rst set of tests, simulated sequences were used.
The sequences were generated according to the 4th-
order background model. Although randomly generated
1116
Higher-order background model for motif detection
Table 1. Results of the motif nding in the simulated sequences with all
four motifs inserted. Each number corresponds to number of runs, out of 20,
in which the corresponding consensus was found and this for the different
orders of the background models
Input sequences Intergenic sequences
321snf snf12345
GCTGCAGC 0 18 14 19 19 19 20 20 18 15
TAGAATA 4 12 14 8 9 17 16 20 20 15
GsCCGnnnCGGsC 2 4 7 7 13 11 14 14 18 10
ATAnwCCwnTAA 9 9 5 4 3 9 13 13 14 12
sequences do not fully resemble biological sequences, the
use of the 4th-order model ensures the conservation of the
pentamer composition of the sequences as compared to the
composition of the intergenic data set. This also means
that these simulated sequences are AT-rich. Subsequently
the different motifs were inserted at random positions
throughout the sequences. Motifs were represented with
a position probability matrix and each base of a created
instance is sampled according to the distribution in the
corresponding column of the matrix. Distinct types of
motifs can be observed: AT-rich versus GC-rich and well
conserved versus more degenerate. Using these simulated
data sets we are able to test the inuence of the different
background models on the rate of detecting the inserted
motifs.
In a rst set of tests we included all four types of motifs
in 20 sequences of 500 bp. Figure 2 gives an overview of
the logos of the four created motifs based on the inserted
instances. We tried several parameter settings and the best
results were obtained when searching for six different
motifs with a length of 14 bp that can have 0, 1 or 2 copies.
Since our method is probabilistic, each test was repeated
20 times. Table 1 indicates for the different types of the
background model how many times the corresponding
consensus term was found in the 20 repeated runs. The
background model was computed from either the input
sequences or the precompiled intergenic models.
The rst motif, GCTGCAGC, a well-conserved,
GC-rich motif was easily found with each of the back-
ground model used, except with the 3rd-order model based
on the input sequences. To explain this phenomenon we
plotted in Figure 3 the transition matrices of the two
3rd-order background models. Each point in the plot
corresponds to the same entry in both transition matrices.
The xaxis shows the value for the matrix based on the in-
tergenic sequences while the value for the matrix based on
the input sequences is depicted on the yaxis. For a perfect
3rd-order background model a one-to-one correspondence
is expected, however some outliers are clearly visible.
For these points the corresponding representation of the
entry in the transition matrix is indicated. The points that
are largely overestimated by the background model based
on the input sequences correspond to the words part of
the inserted GC-rich motifs. Therefore, the computed
background model based on the input sequences is biased
towards this motif and thus it will overestimate Pxfor
these sites.
The second inserted motif is a frequent, short, well-
conserved and AT-rich motif: TAGAATA. When using the
single nucleotide background model the motif was only
found in 50% of the runs, while it was found in all 20
runs when using the intergenic either the 3rd or 4th-order
models. Finding a short AT-rich motif seems to be much
harder when using the single nucleotide model. Even when
searching for shorter motifs (8 bp) it was picked up less
frequently with the single nucleotide model.
The third motif consists of two GC-rich blocks and it
has a xed gap of 3 bp. This motif was retrieved with
the same consistency for all background models except
when the background models were compiled from the
input sequences. In this case there is no difference between
the higher order models and the single nucleotide models.
We repeated this test also with a data set where only
this degenerated GC-rich motif was inserted and there the
same consistent performance was observed.
Finally a rather degenerate, AT-rich motif, resembling
the background, was inserted. This motif is assumed to be
the hardest to nd. Using the single nucleotide background
model resulted in a fairly poor performance. The motif
was detected in only 3 or 4 out of 20 runs. Using higher-
order background models, the motif was more frequently
found. This effect was even more pronounced when
we inserted only this degenerated motif in the random
sequences. The number of detections with the single
nucleotide model increased to 6 out of 20 runs, but for
the 3rd-order model this was 18 out of 20 and even 20
out of 20 runs for the 4th-order model. To further test
this difcult case we inserted slightly modied versions
of this motif, TTAAwCwATA and wTAATsTATA, in the
generated sequences. In these two test cases we only
used the intergenic background models. Again a similar
performance was measured, as is shown in Table 2.
Using the single nucleotide model it is difcult to retrieve
the inserted motifs, while using the higher-order models
signicantly improves the performance.
In this section we described the analysis of several
tests with simulated data. Although the simulated data
do not fully resemble biological sequences, we can still
draw some conclusion from these tests. It was shown in
these tests that the use of higher order background model,
compiled from an intergenic data set, improves the rate
of detection of difcult motifs as compared to the use
of a single nucleotide model. The results show that the
3rd and 4th-order models based on the intergenic data
1117
G.Thijs et al.
1
2
3
4
Fig. 2. Logo representation of the four inserted sites in the simulated sequences. These logos are created from all the inserted instances of
the motifs in the sequences.
Fig. 3. Comparison of the 3rd-order transition matrices computed from the intergenic sequences and the input sequences. Each point
represents a corresponding entry in both transition matrices. The xaxis indicates the value in the intergenic matrix while the yaxis depicts
the value in the matrix based on the input sequences. For some of the outliers the corresponding entry in the matrix is given.
1118
Higher-order background model for motif detection
Table 2. Results of the motif nding in the simulated sequences with only
the AT-rich motif inserted. Each number corresponds to number of runs,
out of 20, in which the correct consensus was found and this for all the
background models based on the intergenic data set
snf 1 2 3 4 5
TTAAwCwATA Correct 5 9 14 10 13 5
Shifted 1 2 3 4 3 2
wTAATsTATA Correct 0 12 7 12 8 8
Shifted 1 5 6 4 5 3
set have the most positive inuence on the detection of
distinct motifs. The effect of the 1st and 2nd-order models
is less pronounced. The 5th-order model on the other
hand has in some cases a rather poor performance. This
is due to the fact that the transition matrix is built by
counting hexamers. As was shown in the section about
the construction of the background model this count is
not reliable enough. The inuence of using the intergenic
data set instead of using the input sequences is obvious
especially for the 3rd-order model. However we should
take the necessary care when interpreting the results in
Tables 1 and 2. These results show that tests with the
4th-order model have the best results, but this is possibly
related to the fact that this background model was used to
generate the sequences.
Influence of noisy sequences
To test the inuence of the complex background model
on the robustness of the motif sampler in the presence
of noise, another set of tests was performed. In this set
of tests we used a set of biologically relevant sequences
upstream of 33 genes (G-box data set, see Section Data
sets). In subsequent tests, the number of noisy sequences
added to the G-box data set was progressively increased
(10 at a time). The set of noisy sequences, from which
each time 10 sequences were sampled, consisted of a
random mixture of the light-induced (Desprez et al.,
1998) and random data sets (see Section Data sets).
Preliminary tests showed that the 3rd-order background
model had a better performance than the other higher-
order background models on this data set. Therefore in
the next set of exhaustive tests only the single nucleotide
and the 3rd-order background model were compared. All
parameters of the motif sampler algorithm were kept
xed except for the order of the background model (we
tried either single nucleotide frequency, 3rd-order Markov
model computed from the input data or 3rd-order Markov
model computed from the intergenic data set). In each test
we searched for 10 different motifs with a length of 8 bp
that can have 0 or 1 copy in the sequence and each test was
repeated 10 times.
To evaluate the results we checked in which runs
the G-box consensus CACGTG was detected. Based
on this denition of the G-box sequence, we calculated
the number of times the G-box was found in each test
(group of 10 runs). Figure 4a describes the behaviour of
the algorithm in the presence of an increasing number
of noisy sequences for different background models.
The 3rd-order background model clearly outperforms the
single nucleotide background model. With the single
nucleotide model, the algorithm only detects the G-box
consensus in a small number of runs even in the presence
of only a limited number of noisy sequences. Both
higher-order background models can nd the G-box
consensus in the presence of a large number of noisy
sequences. To further validate the outcome, the positions
of the G-box motifs predicted by the algorithm were
compared with the positions of G-box of the documented
33 G-box sequences. Three different possibilities can be
distinguished:
(1) the predicted motif is located at the same position as
the known G-box motif (true positive);
(2) the algorithm could not detect a motif although the
presence of a motif was described (false negative);
(3) a potential G-box motif detected by the algorithm
is located at a different position than the described
G-box (ambiguous case).
In the last case, the predicted position might represent
a yet undetected G-box and is therefore inconclusive.
Figure 4a shows the average number of correctly pre-
dicted motif positions. The calculation was based for each
experiment only on the runs in which a G-box consensus
was detected. Figure 4b demonstrates that the number
of correctly predicted motifs (true positives) decreases
slightly with increasing noise. However, the order of the
background model does not interfere drastically with the
number of correctly predicted motifs. On average, ap-
proximately 70% of the G-boxes were correctly predicted.
This indicates that, if a motif is detected, it is in 70% of
the cases the right motif, irrespective of the background
model. The background model does not improve the per-
formance of the algorithm in making correct predictions.
However, the inuence of the higher-order background
model on the robustness and performance of the algorithm
in the presence of noise becomes obvious when taking
into account the number of missed true positives. Fig-
ure 4c depicts the average number of sequences in which
the algorithm could not predict the right G-box motif
(false negatives). This number of false negatives consists
of all the sequences in a run in which the algorithm could
not detect a G-box consensus although a G-box was
described in the sequence. Figure 4c shows that the more
noise is added to the data set, the higher the percentage
1119
G.Thijs et al.
0 10 20 30 40 50 60
0
1
2
3
4
5
6
7
8
9
10
11
Number of runs where G-boxes is found
(a)
0 10 20 30 40 50 60
50
55
60
65
70
75
80
85
Number of added noise sequences
(b)
0 10 20 30 40 50 60
10
20
30
40
50
60
70
80
90
100
(c)
Order 0
Intergenic Order 3
Data order 3
% Correctly predicted G-box positions
% Missed G-boxes
Fig. 4. (a) Total number of times the G-box consensus is found in 10 runs. The horizontal axis shows the number of noisy sequences added
to the G-box data set. (b) Average number of correctly predicted G-box positions. This number is based on comparison of the described
G-box positions and the predicted positions of the G-box motif in all the runs where a G-box consensus was found. (c) Average percentage
of wrongly classied motifs. This number is based the number of sequences that are indicated as not having a G-box although a G-box was
documented (including the runs where no G-box consensus is found).
of missed G-boxes. Moreover, this effect is considerably
more pronounced for the single nucleotide background
model than for the 3rd-order background model. The
3rd-order model based on the set of intergenic sequences
performs better than the 3rd-order model based on the
input data.
We usually observed during the tests that when using a
3rd-order background model, the algorithm retrieved the
G-box consensus as one of the rst motifs, while this was
not the case for the single nucleotide model. The rapid
convergence of the algorithm to the G-box indicates that
it is a very stable motif in the presence of a 3rd-order
model. This was further corroborated by the fact the G-box
motif was in these cases also the motif with the highest
log-likelihood score.
DISCUSSION
We aimed at improving the performance of a probabilistic
implementation of a motif nding algorithm in the pres-
ence of noisy data. To this end, the existing algorithm was
extended with a more complex background model. We an-
ticipated that the description of the background sequences
as single nucleotide frequencies was not sufcient to
capture the complex information in the inherently non-
random sequence code. Therefore we used higher-order
Markov models to represent the intergenic sequences in
DNA. We adapted the original Gibbs sampling algorithm
in such a way that we can incorporate the higher-order
background model to update the probabilities of nding
a motif at a certain position in the sequence. A set of
carefully selected intergenic regions was used to construct
a higher-order background model for A.thaliana.Aswas
shown in the Section Results the quality and the size of
this intergenic data set determine the reliability of the
order of the model. We tested our implementation on
different simulated data sets, where the sequences were
generated from the 4th-order background model and
different types of motifs were inserted. These tests have
shown that the use of a higher order background model,
especially 3rd or 4th-order, can improve the performance
of the algorithm. We also showed with the simulated data
that when using an unreliable higher-order model (e.g.,
the 3rd-order model based on the input sequences) the
performance decreases signicantly.
The behaviour of the algorithm in the presence of an
increasing amount of noisy data has extensively been
tested. The use of a 3rd-order model was shown to
be considerably more robust than a single nucleotide
background model. The overall recovery of the motifs was
higher in the presence of a higher-order model, though the
number of correctly predicted motifs was only marginally
affected by the complexity of the background model.
1120
Higher-order background model for motif detection
Future work will concentrate on the improvement of
the A.thaliana background model through extending the
intergenic data set and also by using interpolated Markov
chains to augment the signicance of the transition matrix.
Focus will be on the automatic selection of the best
background model. We will also compile background
models of other organisms.
ACKNOWLEDGEMENTS
Gert Thijs is research assistant with the IWT; Yves
Moreau is a post-doctoral researcher of the FWO;
Professor Bart De Moor is a full time professor at
the KULeuven; Pierre Rouz´
e is Research Director of
INRA (Institut National de la Recherche Agronomique,
France). This work is partially supported by: 1. IWT
project: STWW-980396; 2. Research Council KULeuven:
GOA Mesto-666; 3. FWO projects: G.0240.99 and
G.0256.97; 4. IUAP P4-02 (19972001); 5. Industrial
Contract Research: Data4s. The scientic responsibility is
assumed by its authors.
REFERENCES
Altschul,S., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J.
(1990) Basic local alignment search tool. J. Mol. Biol.,215, 403
410.
Bailey,T.L. and Elkan,C. (1995) Unsupervised learning of multiple
motifs in biopolymers using Expectation Maximization. Mach.
Learn.,21,5180.
Brazma,A. and Vilo,J. (2000) Gene expression data analysis. FEBS
Lett.,480,1724.
Bucher,P. (1999) Regulatory elements and expression proles. Curr.
Opin. Struct. Biol.,9, 400407.
Chu,S., DeRisi,J., Eisen,M.B., Mulholland,J., Botstein,D.,
Brown,P.O. and Herskowitz,I. (1998) The transcriptional
program of sporulation in budding yeast. Science,282, 699705.
Delcher,A.L., Harman,D., Kasif,S., White,O. and Salzberg,S.L.
(1999) Improved microbial gene identication with Glimmer.
Nucleic Acids Res.,27, 46364641.
DeRisi,J.L., Iyer,V.R. and Brown,P.O. (1997) Exploring the
metabolic and genetic control of gene expression on a genomic
scale. Science,278, 680686.
Desprez,T., Amselem,J., Caboche,M. and Hofte,H. (1998) Differ-
ential gene expression in Arabidopsis monitored using cDNA ar-
rays. Plant J.,14, 643652.
Donald,R.G.K. and Cashmore,A.R. (1990) Mutation of either
G-box or I-box sequences profoundly affects expression from the
Arabidopsis rbcS-1A promoter. EMBO J.,9, 17171726.
Hughes,J.D., Estep, Preston,W., Tavazoie,S. and Church,G.M.
(2000) Computational identication of cis-regulatory elements
associated with groups of functionally related genes in Saccha-
romyces cerevisiae.J. Mol. Biol.,296, 12051214.
van Helden,J., Andr´
e,B. and Collado-Vides,L. (1998) Extracting
regulatory sites from upstream region of yeast genes by com-
putational analysis of oligonucleotide frequencies. J. Mol. Biol.,
281, 827842.
van Helden,J., Rios,A.F. and Collado-Vides,J. (2000) Discovering
regulatory elements in non-coding sequences by analysis of
spaced dyads. Nucleic Acids Res.,28, 18081818.
Jensen,L.J. and Knudsen,S. (2000) Automatic discovery of regula-
tory patterns in promoter regions based on whole cell expression
data and functional annotation. Bioinformatics,16, 326333.
Krogh,A. (1997) Two methods for improving performance of an
HMM and their application for gene nding. ISMB,5, 179186.
Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S.,
Neuwald,A.F. and Wootton,J.C. (1993) Detecting subtle
sequence signals: a Gibbs sampling strategy for multiple
alignment. Science,262, 208214.
Liu,J.S. and Lawrence,C.E. (1999) Bayesian inference on biopoly-
mer models. Bioinformatics,15,3852.
Liu,J.S., Neuwald,A.F. and Lawrence,C.E. (1995) Bayesian models
for multiple local sequence alignment and Gibbs sampling
strategies. J. Am. Stat. Assoc.,90, 11561170.
Liu,X., Brutlag,D.L. and Liu,J.S. (2001) BioProspector: discovering
conserved DNA motifs in upstream regulatory regions of co-
expressed genes. Pac. Symp. Biocomput.,6, 127138.
Lockhart,D.J. and Winzeler,E.A. (2000) Genomics, gene expression
and DNA arrays. Nature,405, 827836.
Lukashin,A.V. and Borodowsky,M. (1998) GeneMark.hmm: new
solutions for gene nding. Nucleic Acids Res.,26, 11071115.
McCue,L.A., Thompson,W., Carmack,C.S., Ryan,M.P., Liu,J.S.,
Derbyshire,V. and Lawrence,C.E. (2001) Phylogenetic footprint-
ing of transcription factor binding sites in probacterial genomes.
Nucleic Acids Res.,29, 774782.
Neuwald,A.F., Liu,J.S. and Lawrence,C.E. (1995) Gibbs motif
sampling: detection of bacterial outer membrane protein repeats.
Protein Sci.,4, 16181632.
Ohler,U. and Niemann,H. (2001) Identication and analysis of
eukaryotic promoters: recent computational approaches. Trends
Genet.,17,5660.
Pavy,N., Rombauts,S., D´
ehais,P., Math´
e,C., Ramana,D.V.V.,
Leroy,P. and Rouz ´
e,P. (1999) Evaluation of gene prediction
software using a genomic data set: application to Arabidopsis
thaliana sequences. Bioinformatics,15, 887899.
Rombauts,S., D´
ehais,P., Van Montagu,M. and Rouz´
e,P. (1999)
PlantCARE, a plant cis-acting regulatory element database.
Nucleic Acids Res.,27, 295296.
Roth,F.P., Hughes,J.D., Estep,P.W. and Church,G.M. (1998) Finding
DNA regulatory motifs within unaligned noncoding sequences
clustered by whole genome mRNA quantitation. Nat. Biotech.,
16, 939945.
Schiex,T., Moisan,A., Duret,L. and Rouz´
e,P. (2000) EuGene: a
eucaryotic gene nder that combines several sources of evidence.
In Proc. JOBIM’2000. http://www.inra.fr/bia/T/schiex/Export/
EuGene2.pdf
Spellman,P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K.,
Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998)
Comprehensive identication of cell cycle-regulated genes of
the yeast Saccharomyces cerevisiae by microarray hybridisation.
Mol. Biol. Cell.,9, 32733297.
Thijs,G., Marchal,K., Lescot,M., Rombauts,S., De Moor,B.,
Rouz´
e,P. and Moreau,Y. (2001) A Gibbs sampling method to de-
tect over-represented motifs in upstream regions of co-expressed
genes. RECOMB,5, 305312.
Vanet,A., Marsan,L., Labigne,A. and Sagot,M.F. (2000) Inferring
1121
G.Thijs et al.
regulatory elements from a whole genome. An analysis of
Helicobacter pylori sigma(80) family of promoter signals. J.
Mol. Biol.,297, 335353.
Wolfsberg,T.G., Gabrielian,A.E., Campbell,M.J., Cho,R.J.,
Spouge,J.L. and Landsman,D. (1999) Candidate regulatory
sequence elements for cell cycle-dependent transcription in
Saccharomyces cerevisiae.Genome Res.,9, 775792.
Workman,C.T. and Stormo,G.D. (2000) ANN-SPEC: a method for
discovering transcription binding sites with improved specicity.
Pac. Symp. Biocomput.,5, 467478.
Zhang,M.Q. (1999) Large-scale gene expression data analysis: a
new challenge to computational biologist. Genome Res.,9, 681
688.
1122
... Motif Sampler or Gibbs Sampler [27] enhances the robustness and performance of Gibbs sampling to noisy datasets. Noise in the data set is due to the presence of sequences not containing the motif or due to the large size of the input sequence as compared to the small size of the motif. ...
... FMGA [32] approach takes the benefit of the genetic algorithm strategy to output the better exact motif in less computation time compared to Gibbs Sampler [27] and MEME [25]. Initially, it computes and allocates a fitness score using a defined match function for each length l substring. ...
Article
Full-text available
Personalized diagnosis of chronic disease requires capturing the continual pattern across the biological sequence. This repeating pattern in medical science is called “Motif”. Motifs are the short, recurring patterns of biological sequences that are supposed signify some health disorder. They identify the binding sites for transcription factors that modulate and synchronize the gene expression. These motifs are important for the analysis and interpretation of various health issues like human disease, gene function, drug design, patient’s conditions, etc. Searching for these patterns is an important step in unraveling the mechanisms of gene expression properly diagnose and treat chronic disease. Thus, motif identification has a vital role in healthcare studies and attracts many researchers. Numerous approaches have been characterized for the motif discovery process. This article attempts to review and analyze fifty-four of the most frequently found motif discovery processes/algorithms from different approaches and summarizes the discussion with their strengths and weaknesses.
... Motif sequences and their lengths can vary between transcription factors and different species, though they are short in length (typically 5 to 15 base pairs). Experimental methods of binding-motif discovery can be challenging due to many different factors [1]. On the other hand, computational techniques [2][3][4][5] have to be utilized to scan DNA sequences of genes and identify short, conserved strings that appear in multiple genes. ...
Article
Full-text available
Transcription factors are an integral component of the cellular machinery responsible for regulating many biological processes, and they recognize distinct DNA sequence patterns as well as internal/external signals to mediate target gene expression. The functional roles of an individual transcription factor can be traced back to the functions of its target genes. While such functional associations can be inferred through the use of binding evidence from high-throughput sequencing technologies available today, including chromatin immunoprecipitation sequencing, such experiments can be resource-consuming. On the other hand, exploratory analysis driven by computational techniques can alleviate this burden by narrowing the search scope, but the results are often deemed low-quality or non-specific by biologists. In this paper, we introduce a data-driven, statistics-based strategy to predict novel functional associations for transcription factors in the model plant Arabidopsis thaliana. To achieve this, we leverage one of the largest available gene expression compendia to build a genome-wide transcriptional regulatory network and infer regulatory relationships among transcription factors and their targets. We then use this network to build a pool of likely downstream targets for each transcription factor and query each target pool for functionally enriched gene ontology terms. The results exhibited sufficient statistical significance to annotate most of the transcription factors in Arabidopsis with highly specific biological processes. We also perform DNA binding motif discovery for transcription factors based on their target pool. We show that the predicted functions and motifs strongly agree with curated databases constructed from experimental evidence. In addition, statistical analysis of the network revealed interesting patterns and connections between network topology and system-level transcriptional regulation properties. We believe that the methods demonstrated in this work can be extended to other species to improve the annotation of transcription factors and understand transcriptional regulation on a system level.
... Some established methods are available at 'Assessment of Computational Motif Discovery Tools' and we have compared our method with it. The result of AlignACE [20], ANN-spec [21], Consensus [22], GLAM [23], Improbizer [24], MEME [25], MEME3 [25], MFEA [26], MITRA [27], MotifSampler [28], oligo/dyad-analysis [29], QuickScore [30], SeSiMCMC [31], Weeder [32] and YMF [3] are contained by it. ...
Article
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Chapter
Plant functioning and responses to abiotic stresses largely involve regulations at the transcriptomic level via complex interactions of signal molecules, signaling cascades, and regulators. Nevertheless, all the signaling networks involved in responses to abiotic stresses have not yet been fully established. The in-depth analysis of transcriptomes in stressed plants has become a relevant state-of-the-art methodology to study these regulations and signaling pathways that allow plants to cope with or attempt to survive abiotic stresses. The plant science and molecular biology community has developed databases about genes, proteins, protein-protein interactions, protein-DNA interactions and ontologies, which are valuable sources of knowledge for deciphering such regulatory and signaling networks. The use of these data and the development of bioinformatics tools help to make sense of transcriptomic data in specific contexts, such as that of abiotic stress signaling, using functional biological approaches. The aim of this chapter is to present and assess some of the essential online tools and resources that will allow novices in bioinformatics to decipher transcriptomic data in order to characterize the cellular processes and functions involved in abiotic stress responses and signaling. The analysis of case studies further describes how these tools can be used to conceive signaling networks on the basis of transcriptomic data. In these case studies, particular attention was paid to the characterization of abiotic stress responses and signaling related to chemical and xenobiotic stressors.Key wordsOnline toolsDatabasesTranscriptomic dataSignaling pathwaysChemical stressXenobiotic stressInteractions
Chapter
The detection of transcription factor binding sites (TFBS) play a important role inside bioinformatics challenges. Its correct identification in the promoter regions of co-expressed genes is a crucial step for understanding gene expression mechanisms and creating new drugs and vaccines. The problem of finding motifs consists of looking for conserved patterns in biological datasets of sequences through the use of unsupervised learning algorithms. For that reason, it is considered one of the classic problems of computational biology, which in its simplest formulation has been proven to be NP-HARD. Moreover, heuristic and meta-heuristic algorithms have been shown to be very promising in solving combinatorial problems with very large search spaces. In this work, we propose an evaluation of different heuristics and meta-heuristics approaches in order to measure its performance: Variable Neighborhood Search (VNS), Expectation Maximization (EM) and Iterated Local Search (ILS). For each of them, two sets of experiments were carried out: In the first, the heuristics were performed alone and in the second, a constructive procedure was introduced with respect to improve the quality of initial solutions. Finally, the metrics were compared with the state-of-art MEME algorithm, which is very used in biological motif discovery. The results obtained suggest that the heuristics are more efficient when used together and also, a constructive procedure was very promising, managing to improve the performance metrics of the evaluated heuristics in most experiments. Also, the combination between a constructive procedure and EM proved to be quite competitive, managing to outperform the MEME algorithm in several datasets.KeywordsBiological motifsHeuristicsMeta-heuristicsUnsupervised learning
Article
Construction schedules are written instructions of construction execution shared between stakeholders for essential project information exchange. However, construction schedules are semi-structured data that lack semantic details and coherence within and across projects. This study proposes an ontology-based Recurrent Neural Network approach to bi-directionally translate between human written language and machinery ontological language. The proposed approach is assessed in three areas: text generation accuracy, machine readability, and human understandability. This study collected 30 project schedules with 19,589 activities (sample size = 19,589) from a Tier-1 contractor in the UK. The experimental results indicate that: (1) precision and recall of text generation LSTM-RNN model is 0.991 and 0.874, respectively; (2) schedule readability improved by increasing the semantic distinctiveness, measured using the cosine similarity which was reduced from 0.995 to 0.990 (p < 0.01); (3) schedule understandability improved from 75.90% to 85.55%. The proposed approach formalises text descriptions in construction schedules and other construction documents with less labour investment. It supports contractors to establish knowledge management systems to learn from historic data and make more informed decisions in future similar scenarios.
Article
Full-text available
Background Genomic DNA has been shaped by mutational processes through evolution. The cellular machinery for error correction and repair has left its marks in the nucleotide composition along with structural and functional constraints. Therefore, the probability of observing a base in a certain position in the human genome is highly context-dependent. Results Here we develop context-dependent nucleotide models. We first investigate models of nucleotides conditioned on sequence context. We develop a bidirectional Markov model that use an average of the probability from a Markov model applied to both strands of the sequence and thus depends on up to 14 bases to each side of the nucleotide. We show how the genome predictability varies across different types of genomic regions. Surprisingly, this model can predict a base from its context with an average of more than 50% accuracy. For somatic variants we show a tendency towards higher probability for the variant base than for the reference base. Inspired by DNA substitution models, we develop a model of mutability that estimates a mutation matrix (called the alpha matrix) on top of the nucleotide distribution. The alpha matrix can be estimated from a much smaller context than the nucleotide model, but the final model will still depend on the full context of the nucleotide model. With the bidirectional Markov model of order 14 and an alpha matrix dependent on just one base to each side, we obtain a model that compares well with a model of mutability that estimates mutation probabilities directly conditioned on three nucleotides to each side. For somatic variants in particular, our model fits better than the simpler model. Interestingly, the model is not very sensitive to the size of the context for the alpha matrix. Conclusions Our study found strong context dependencies of nucleotides in the human genome. The best model uses a context of 14 nucleotides to each side. Based on these models, a substitution model was constructed that separates into the context model and a matrix dependent on a small context. The model fit somatic variants particularly well.
Article
Full-text available
A wealth of data concerning life's basic molecules, proteins and nucleic acids, has emerged from the biotechnology revolution. The human genome project has accelerated the growth of these data. Multiple observations of homologous protein or nucleic acid sequences from different organisms are often available. But because mutations and sequence errors misalign these data, multiple sequence alignment has become an essential and valuable tool for understanding structures and functions of these molecules. A recently developed Gibbs sampling algorithm has been applied with substantial advantage in this setting. In this article we develop a full Bayesian foundation for this algorithm and present extensions that permit relaxation of two important restrictions. We also present a rank test for the assessment of the significance of multiple sequence alignment. As an example, we study the set of dinucleotide binding proteins and predict binding segments for dozens of its members.
Conference Paper
Full-text available
Microarray experiments can reveal useful information on the transcriptional regulation. We try to find regulatory elements in the region upstream of translation start of coexpressed genes. Here we present a modification to the original Gibbs Sampling algorithm [12]. We introduce a probability distribution to estimate the number of copies of the motif in a sequence. The second modification is the incorporation of a higher-order background model. We have successfully tested our algorithm on several data sets. First we show results on two selected data set: sequences from plants containing the G-box motif and the upstream sequences from bacterial genes regulated by O2-responsive protein FNR. In both cases the motif sampler is able to find the expected motifs. Finally, the sampler is tested on 4 clusters of coexpressed genes from a wounding experiment in Arabidopsis thaliana. We find several putative motifs that are related to the pathways involved in the plant defense mechanism.
Article
Full-text available
PlantCARE is a database of plant cis-acting regulatory elements, enhancers and repressors. Besides the transcription motifs found on a sequence, it also offers a link to the EMBL entry that contains the full gene sequence as well as a description of the conditions in which a motif becomes functional. The information on these sites is given by matrices, consensus and individual site sequences on particular genes, depending on the available information. PlantCARE is a relational database available via the web at the URL: http://sphinx.rug.ac.be:8080/PlantCARE/
Article
Full-text available
Motivation: The annotation of the Arabidopsis thaliana genome remains a problem in terms of time and quality. To improve the annotation process, we want to choose the most appropriate tools to use inside a computer-assisted annotation platform. We therefore need evaluation of prediction programs with Arabidopsis sequences containing multiple genes. Results: We have developed AraSet, a data set of contigs of validated genes, enabling the evaluation of multi-gene models for the Arabidopsis genome. Besides conventional metrics to evaluate gene prediction at the site and the exon levels, new measures were introduced for the prediction at the protein sequence level as well as for the evaluation of gene models. This evaluation method is of general interest and could apply to any new gene prediction software and to any eukaryotic genome. The GeneMark.hmm program appears to be the most accurate software at all three level's for the Arabidopsis genomic sequences. Gene modeling could be further improved by combination of prediction software.
Article
Full-text available
The application of microarray and related technologies is currently generating a systematic catalog of the transcriptional response of any single gene to a multiplicity of experimental conditions. Clustering genes according to the similarity of their transcriptional response provides a direct hint to the regulons of the different transcription factors, many of which have still not been characterized. We have developed a new method for deciphering the mechanism underlying the common transcriptional response of a set of genes, i.e. discovering cis-acting regulatory elements from a set of unaligned upstream sequences. This method, called dyad analysis, is based on the observation that many regulatory sites consist of a pair of highly conserved trinucleotides, spaced by a non-conserved region of fixed width. The approach is to count the number of occurrences of each possible spaced pair of trinucleotides, and to assess its statistical significance. The method is highly efficient in the detection of sites bound by C6 Zn2 binuclear cluster proteins, as well as other transcription factors. In addition, we show that the dyad and single-word analyses are efficient for the detection of regulatory patterns in gene clusters from DNA chip experiments. In combination, these programs should provide a fast and efficient way to discover new regulatory sites for as yet unknown transcription factors.
Article
Full-text available
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Article
Full-text available
A deletion analysis of the Arabidopsis thaliana rbcS-1A promoter defined a 196 bp region (-320 to -125) sufficient to confer light-regulated expression on a heterologous Arabidopsis alcohol dehydrogenase (Adh) reporter gene in transgenic Nicotiana tabacum (tobacco) leaves. This region, which contains DNA sequences I, G and GT boxes, with homology to other ribulose-1,5-bisphosphate carboxylase small subunit (RBCS) gene promoter sequences, directed expression independent of orientation and relative position in the Adh promoter. Site-specific mutagenesis of these conserved sequences and subsequent expression analysis in transgenic tobacco showed that both G box and I box mutations in the context of the full (-1700 to +21) rbcS-1A promoter substantially reduced the expression of Adh and beta-glucuronidase (GUS) reporter genes. The G box has previously been shown to specifically bind in vitro a factor isolated from nuclear extracts of tomato and Arabidopsis. This factor (GBF) is distinct from the factor GT-1 which binds to adjacent GT boxes in the pea rbcS-3A promoter. Multiple mutations in putative Arabidopsis rbcS-1A promoter GT boxes had no pronounced affect on expression, possibly due to a redundancy of these sites. Experiments in which rbcS-1A promoter fragments were fused to truncated 35S CaMV (cauliflower mosaic virus) promoter--GUS reporter constructs showed that cis-acting CaMV promoter elements could partially restore expression to G-box-mutated rbcS-1A sequences.
Article
DNA microarrays containing virtually every gene of Saccharomyces cerevisiae were used to carry out a comprehensive investigation of the temporal program of gene expression accompanying the metabolic shift from fermentation to respiration. The expression profiles observed for genes with known metabolic functions pointed to features of the metabolic reprogramming that occur during the diauxic shift, and the expression patterns of many previously uncharacterized genes provided clues to their possible functions. The same DNA microarrays were also used to identify genes whose expression was affected by deletion of the transcriptional co-repressor TUP1 or overexpression of the transcriptional activator YAP1. These results demonstrate the feasibility and utility of this approach to genomewide exploration of gene expression patterns.
Conference Paper
In this paper, we describe the basis of EUGÈNE, a gene finder for eukaryotic organisms applied to Arabidopsis thaliana. The specificity of EuGène, compared to existing gene finding software, is that EUGÈNE has been designed to combine the output of several information sources, including output of other software or user information. To achieve this, a weighted directed acyclic graph (DAG) is built in such a way that a shortest feasible path in this graph represents the most likely gene structure of the underlying DNA sequence. The usual simple Bellman linear time shortest path algorithm for DAG has been replaced by a shortest path with constraints algorithm. The constraints express minimum length of introns or intergenic regions. The specificity of the constraints leads to an algorithm which is still linear both in time and space. EUGÈNE effectiveness has been assessed on Araset, a recent dataset of Arabidopsis thaliana sequences used to evaluate several existing gene finding software. It appears that, despite its simplicity, EUGÈNE gives results which compare very favourably to existing software. We try to analyse the reasons of these results.