ArticlePDF Available

A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling

January 2002
Bioinformatics 17(12):1113-22

January 2002
17(12):1113-22

DOI:10.1093/bioinformatics/17.12.1113

Source
PubMed

Authors:

Gert Thijs

Agilent

Magali Lescot

Institut Méditerranéen d’océanologie

Kathleen Marchal

Ghent University

Stephane Rombauts

Vlaams Instituut voor Biotechnologie

Show all 7 authorsHide

Transcriptome analysis allows detection and clustering of genes that are coexpressed under various biological circumstances. Under the assumption that coregulated genes share cis-acting regulatory elements, it is important to investigate the upstream sequences controlling the transcription of these genes. To improve the robustness of the Gibbs sampling algorithm to noisy data sets we propose an extension of this algorithm for motif finding with a higher-order background model. Simulated data and real biological data sets with well-described regulatory elements are used to test the influence of the different background models on the performance of the motif detection algorithm. We show that the use of a higher-order model considerably enhances the performance of our motif finding algorithm in the presence of noisy data. For Arabidopsis thaliana, a reliable background model based on a set of carefully selected intergenic sequences was constructed. Our implementation of the Gibbs sampler called the Motif Sampler can be used through a web interface: http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html. gert.thijs@esat.kuleuven.ac.be; yves.moreau@esat.kuleuven.ac.be

Representation of the three different configuration of the intergenic region in DNA. The genes are represented with an arrow and the core promoter for each gene is indicated with an oval box. (1) The two genes are pointing in the same direction and there is one core promoter in the intergenic region. (2) Two genes are pointing in opposite directions and the intergenic region contains the two core promoters. In the last case (3) the genes are pointing towards each other and in the intergenic region there is no core promoter present.

…

Results of the motif finding in the simulated sequences with all four motifs inserted. Each number corresponds to number of runs, out of 20, in which the corresponding consensus was found and this for the different orders of the background models

…

Figures - uploaded by Bart L.R. De Moor

Content may be subject to copyright.

Content uploaded by Bart L.R. De Moor

Content may be subject to copyright.

Content uploaded by Stephane Rombauts

Content may be subject to copyright.

BIOINFORMATICS Vol. 17 no. 12 2001

Pages 1113–1122

A higher-order background model improves the

detection of promoter regulatory elements by

Gibbs sampling

Gert Thijs 1,∗

, Magali Lescot 1, Kathleen Marchal 1, Stephane

Rombauts 2, Bart De Moor 1, Pierre Rouz ´

e3and Yves Moreau 1

1ESAT-SISTA/COSIC, KULeuven, Kasteelpark Arenberg 10, 3001 Leuven-Heverlee,

Belgium, 2Department of Plant Genetics, VIB, UGent, Ledeganckstraat 35, 9000

Gent, Belgium and 3INRA Associated Laboratory, VIB, UGent, Ledeganckstraat 35,

9000 Gent, Belgium

Received on February 6, 2001; revised on June 4, 2001; accepted on June 6, 2001

ABSTRACT

Motivation: Transcriptome analysis allows detection and

clustering of genes that are coexpressed under various bi-

ological circumstances. Under the assumption that coreg-

ulated genes share cis-acting regulatory elements, it is im-

portant to investigate the upstream sequences controlling

the transcription of these genes. To improve the robust-

ness of the Gibbs sampling algorithm to noisy data sets

we propose an extension of this algorithm for motif ﬁnding

with a higher-order background model.

Results: Simulated data and real biological data sets

with well-described regulatory elements are used to test

the inﬂuence of the different background models on

the performance of the motif detection algorithm. We

show that the use of a higher-order model considerably

enhances the performance of our motif ﬁnding algorithm

in the presence of noisy data. For Arabidopsis thaliana,

a reliable background model based on a set of carefully

selected intergenic sequences was constructed.

Availability: Our implementation of the Gibbs sampler

called the Motif Sampler can be used through a web

interface: http://www.esat.kuleuven.ac.be/∼thijs/Work/

MotifSampler.html.

Contact: gert.thijs@esat.kuleuven.ac.be;

yves.moreau@esat.kuleuven.ac.be

INTRODUCTION

Recent high-throughput techniques to monitor gene

expression levels constitute an important advance in the

identiﬁcation of coexpressed genes (for a review, see

Lockhart and Winzeler, 2000). The commonly accepted

assumption that coregulated genes share similarities in

their regulatory mechanism has led to a major challenge

∗To whom correspondence should be addressed.

for the computational biologist: detecting novel reg-

ulatory elements (motifs) in such sets of coexpressed

genes (Brazma and Vilo, 2000; Bucher, 1999; Chu et

al., 1998; DeRisi et al., 1997; Spellman et al., 1998;

Wolfsberg et al., 1999; Zhang, 1999). These similarities

at transcriptional level imply that the promoter region

might contain consensus motifs recognized by the same

regulatory proteins. In the upstream regions of such sets

of coregulated genes, the common consensus motifs

are statistically over-represented as compared to their

frequency in a background set (of non-coregulated genes).

Several methods to search for over-represented motifs

in the upstream region of a set of coregulated genes have

been developed and tested (Ohler and Niemann, 2001).

These methods can be divided in two major classes:

methods based on word counting (Jensen and Knudsen,

2000; Vanet et al., 2000; van Helden et al., 1998, 2000)

and methods based on probabilistic sequence models

(Bailey and Elkan, 1995; Hughes et al., 2000; Lawrence

et al., 1993; Liu et al., 1995; Neuwald et al., 1995;

Roth et al., 1998; Workman and Stormo, 2000). Word

counting methods are based on the frequency analysis

of oligonucleotides in the upstream sequences. Over-

representation is measured by comparing the counted

number of occurrences of a word to the expected number

of occurrences. A common motif is then compiled by

grouping similar words. In the probabilistic methods,

the motif model is represented as a position probability

matrix and the motif is assumed to be hidden in a noisy

background sequence. To ﬁnd the parameters of such a

model, maximum likelihood estimation is used. The most

frequent methods to do so are Expectation Maximization

(EM) and Gibbs sampling. EM is a maximum likelihood

algorithm for estimating the parameters of a probabilistic

model. Gibbs sampling is a stochastic equivalent of EM.

The drawback of these algorithms is that they tend to be

Oxford University Press 2001 1113

G.Thijs et al.

sensitive to noise. Noise is due to the presence of upstream

sequences in the data set that do not contain the motif.

These sources of noise have either an experimental origin

or are artefacts of the clustering process and are difﬁcult

to avoid.

Another source of noise comes from the large size of

the upstream sequences of the selected genes as compared

to the small size of the motifs. Parts of the sequence not

containing a motif can indeed be considered as noise.

This second source of noise obviously depends on the

compactness of the genome. For higher eukaryotes, the

size of intergenic regions varies considerably between

organisms, being much larger on average in humans than

in Arabidopsis thaliana. Even within the same species the

size of the intergenic region can vary (e.g., by at least 2

orders of magnitude for A.thaliana, from <102to >104;

Pavy et al., 1999). Therefore the inﬂuence of the noise

can be expected to be reasonably low for bacteria and

still limited for lower eukaryotes such as yeast, but more

pronounced for higher eukaryotes.

Conceivably, it is important to have a motif detection

algorithm that can cope with this noise and discriminate

between motifs that are over-represented by chance and

motifs that are biologically functional. An improved

background model (model of non-coregulated genes) can

considerably improve this discrimination.

The most popular probabilistic models published so far

(Bailey and Elkan, 1995; Hughes et al., 2000; Lawrence

et al., 1993; Liu et al., 1995; Neuwald et al., 1995; Roth

et al., 1998) use a simple background model based on the

frequency of the nucleotides A, C, G, and T in the data

set to represent an intergenic sequence. However, a back-

ground model solely based on single nucleotide frequen-

cies poorly reﬂects the complex structure of genome se-

quences. A description of DNA sequences as higher-order

Markov chains on the other hand has been used in most

of the state-of-the-art gene recognition software to repre-

sent coding and non-coding regions, (e.g., in Glimmer and

the A.thaliana speciﬁc GlimmerA (Delcher et al., 1999),

HMMgene (Krogh, 1997) and GeneMark.hmm (Lukashin

and Borodowsky, 1998)). In this paper we describe the ex-

tension of the Gibbs sampling algorithm with a complex

context-dependent background model.

Recently other researchers have also proposed the use

of advanced background models in the Gibbs sampling

algorithm (Liu et al., 2001; McCue et al., 2001). Here

we like to address the speciﬁc issues associated with the

background models used in these methods. McCue et al.

(2001) extended the Gibbs sampler with a position speciﬁc

background model estimated with a Bayesian segmenta-

tion algorithm that was presented by Liu and Lawrence

(1999). This model tries to capture the varying GC and

AT-content of the different regions in DNA sequences. The

model parameters and corresponding change points are

found using a Bayesian sequence segmentation algorithm

that maximizes the joint likelihood of the data, parameters

and missing values (change point positions). These model

parameters can be used to calculate the probability that a

certain site in the sequence is generated with this back-

ground model. Furthermore, Liu and Lawrence have situ-

ated the problem of sequence alignment within a Bayesian

framework that we also favour. Liu et al. (2001) have de-

veloped an extended version of the Gibbs sampler called

BioProspector. They have proposed the use of a context-

depend Markov model to represent the probability that a

site is generated by the background model. We will com-

ment on some of the speciﬁc technicalities of this method

in the next section.

Together with the development of the algorithm a

selected data set of intergenic sequences from A.thaliana

was used to construct a reliable higher-order background

model of gene upstream regions of this model plant.

The inﬂuence of different background models on the

robustness of the motif sampler in the presence of

noisy data was exhaustively tested. We will describe the

construction of the background models and the use of this

background model with the Gibbs sampling algorithm.

To test the inﬂuence of these models, we used simulated

data and several well-described data sets of genes from

A.thaliana for which different motifs are documented.

HIGHER-ORDER BACKGROUND MODEL

As stated in the introduction, most of the state-of-the-art

gene detection software uses a context-dependent model

based on a higher-order Markov process to represent DNA

sequences. Based on the rationale of these algorithms,

the use of such a model to detect motifs in the upstream

region of coregulated genes seems a logical decision.

Using a context-dependent model of order mmeans that

the probability of ﬁnding a nucleotide bat position lin

a sequence depends on the mprevious nucleotides in the

sequence. The probability of the sequence being generated

by this background model Bmis given by

P(S|Bm)=P(b1,...,bm)



l=m+1

P(bl|bl−1,...,bl−m).

The probabilities P(bl|bl−1,...,bl−m)are stored in a

transition matrix and the prior frequency of the oligonu-

cleotides of length mis given by P(b1,b2,···,bm).

The construction of the transition matrix of an mth-

order background model is based on the counting of all

oligonucleotides of length (m+1) in the data set. To com-

pensate for zero occurrences of certain oligonucleotides

a pseudocount is added. Based on Bayesian statistics,

we assume that the more data are available the more we

can rely on these data to approximate the true biological

1114

Higher-order background model for motif detection

model. Based on the comparison of experimental results,

the pseudocounts are chosen proportional to the single

nucleotide frequency and inverse proportional to the

square root of the size of the data set.

The background model can be either constructed based

on the input sequences or based on an independent data

set of intergenic sequences. The latter approach seems

the more sensible one to produce a reliable background

model. The quality of the background model depends on

the quality of the data set. In this paper a set of carefully

selected intergenic sequences from A.thaliana is used to

construct a reliable background model.

Extension of the motif sampler

The implementation of our motif ﬁnding algorithm is

based on the original Gibbs sampling algorithm previ-

ously described by Lawrence et al. (1993). An elaborate

description of our algorithm is given elsewhere (Thijs

et al., 2001). In this paper we like to emphasize the

speciﬁc aspects of the higher-order model and we will

only summarize the aspects of the algorithm to facilitate

further understanding.

The calculation of the background model is done as an

initialization step of the algorithm. The background model

Bmis computed either from the input sequences, making

it useful for any organism, or from an independent data

set. This model Bmis not updated during the algorithm

since there is no need to re-estimate the background model

at each iteration step of the algorithm. The background

model will be used to calculate for each site xof length W

in a sequence the probability that this site was generated

by the background model. This probability is referred to

as Px:

Px=P(Site |Bm)=



l=1

P(bl|bl−1,...,bl−m).

The motif of length Wis represented with a position

probability matrix θWwhere the entry qi,bcontains the

probability of ﬁnding nucleotide bat position iin the

motif:

θW=





q1,Aq2,A··· qW,A

q1,Cq2,C··· qW,C

q1,Gq2,G··· qW,G

q1,Tq2,T··· qW,T





.

For each site xof length Win a sequence the probability

Qxof site xbeing generated by the motif model θWis

calculated:

Qx=P(Site |θW)=



l=1

ql,bl.

Based on these probabilities a weight Axis then assigned

to each segment xin the sequence

Ax=Qx

Subsequently the alignment vector of the motif in this

sequence is sampled according to the distribution of

normalized weights Ax. By updating this distribution

we can ﬁnd the alignment that maximizes the ratio

of the corresponding site probability to the background

probability.

As stated in the introduction, Liu et al. (2001) have also

proposed the use of a context-dependent Markov model

in a Gibbs sampling algorithm. In their algorithm the

probability of a site being generated by this background

model is computed as

P(bl,bl+1,bl+2,···,bl+W−1)=P(bl)P(bl+1|bl)

×P(bl+2|bl,bl+1)P(bl+3|bl,bl+1,bl+2)

···P(bl+W−1|bl+W−2,bl+W−3,bl+W−4).

This approach resembles the one proposed in this paper

but differs in the calculation of the probability of the

site being generated by the background model. In our

algorithm we take the mpreceding bases of the site into

account to compute the background probability of the site

since the complete sequence information is available at the

time of computation. This also means that we only need

to compute the parameters of the mth-order background

model and not the parameters for all the models from 1

to m.

DATA SETS

Intergenic data set

To construct the best possible representation of promoter

sequences or intergenic sequences a data set consisting of

carefully selected intergenic sequences was constructed,

following a previously described rationale to build Araset

(Pavy et al., 1999). To deﬁne clean intergenic sequences,

all complete cDNAs were downloaded through SRS and

aligned on BAC sequences. The aligned genes were man-

ually checked by an expert (S. Aubourg, personal commu-

nication). Each time the cDNAs matched two consecutive

genes on the BAC, the intergenic sequence was extracted.

The sequences with a length below 10 kb were then ex-

tensively checked for any unannotated potential coding

sequences, using BLAST (Altschul et al., 1990) for ho-

mology searches and prediction software such as EuG`

ene

(Schiex et al., 2000). 78 intergenic sequences were re-

tained, representing a total of 156 087 bp. These sequences

were added to the 94 intergenic sequences retrieved from

Araset, resulting in a data set with 341 248 bp. Figure 1

1115

G.Thijs et al.

Gene Gene

Intergenic

Region

Core Promoter

Fig. 1. Representation of the three different conﬁguration of the

intergenic region in DNA. The genes are represented with an arrow

and the core promoter for each gene is indicated with an oval box.

(1) The two genes are pointing in the same direction and there is one

core promoter in the intergenic region. (2) Two genes are pointing in

opposite directions and the intergenic region contains the two core

promoters. In the last case (3) the genes are pointing towards each

other and in the intergenic region there is no core promoter present.

shows the three different conﬁgurations in which neigh-

boring genes can occur. 105 sequences have the ﬁrst con-

ﬁguration (1) consisting of two genes coding in tandem on

the same strand. In this case the intergenic region is ex-

pected to contain only one promoter. 38 sequences have

the second conﬁguration (2) where both genes are point-

ing away from each other. In this intergenic region diver-

gent promoters are expected to control transcription. In the

last case (containing 29 sequences) the transcription of the

two genes is convergent. No promoter regulatory element

is expected to occur in the intergenic region. The transition

matrix was only built from the intergenic sequences of the

classes (1) and (2), which likely contain either one or two

promoters.

Data sets for testing

To test the performance of our implementation we con-

structed several data sets. The data sets are accessible

on the web: http://www.plantgenetics.rug.ac.be/∼males/

Datasets/Data.html.

•Simulated data: these sequences were used to test the

inﬂuence of the background model on the detection

of different types of motifs. By sampling according

to the 4th-order background model 20 sequences of

500 bp were generated. In several tests, instances of

predeﬁned motifs were inserted in these sequences at

random positions.

•G-box sequences: this set of sequences was extracted

from PlantCARE (Rombauts et al., 1999) and contains

the upstream region of genes that are known to be

regulated by G-box binding proteins in dicots. The

consensus of the G-box is CACGTG. The position

of the G-box is well deﬁned in this data set. The

set contains 33 sequences of 500 bp. The G-box

(CACGTG) is a well-conserved ubiquitous cis-acting

regulatory element found in plant genomes and is

bound by the GBF (G-box binding factors) family of

bZIP proteins (Donald and Cashmore, 1990).

•Light induced: this set contains the upstream region

of 28 coexpressed A.thaliana genes. Coexpression

was based on the cluster analysis of a microarray

experiment (Desprez et al., 1998).

•Random: set of randomly selected A.thaliana upstream

sequences of at least 150 bp, not described to be

involved in light regulation and not containing a known

G-box.

RESULTS

Construction of an independent background model

The construction of a Markov process can rely either on

the upstream sequences from the input data or from an

independent data set. This independent data set consists

of a well-deﬁned set of intergenic regions of A.thaliana

genes (see Section Data sets). It should be noted that the

number of nucleotides used to construct the Markov model

limits the order of the background model that can be used.

Indeed, when a transition matrix of order mis constructed,

all oligonucleotides of length (m+1) are counted. The

number of possible different oligonucleotides equals 4m+1

and increases exponentially with m. The data set used for

the construction of the background model should, under

the assumption of an equal nucleotide distribution, at least

contain 4m+1different base pairs to have a single count

for each nucleotide. In reality the assumption of equal

nucleotide distribution does not hold and a much larger

data set will be needed. When an oligonucleotide does not

occur in the data set, it will be replaced by a pseudocount.

When the order of the background is too high relative to

the size of the data set on which this background model

was based, less frequent motifs will be encountered which

deteriorates the motif model. Following this reasoning

the improvement of using a Markov chain background

model will be more explicit when its construction is

based on a large data set (such as the one used in this

study). A thorough study of the intergenic data set shows

that when all hexamers are counted, there are 430 388

examples in this data set. The hexamer with highest

number of occurrences, 2018, was AAAAAA, while there

was no instance at all found of the hexamer GCGGGC.

The consequence of this observation is that the 5th-order

background model is less reliable, as will be shown in the

tests.

Simulated sequences

As a ﬁrst set of tests, simulated sequences were used.

The sequences were generated according to the 4th-

order background model. Although randomly generated

1116

Higher-order background model for motif detection

Table 1. Results of the motif ﬁnding in the simulated sequences with all

four motifs inserted. Each number corresponds to number of runs, out of 20,

in which the corresponding consensus was found and this for the different

orders of the background models

Input sequences Intergenic sequences

321snf snf12345

GCTGCAGC 0 18 14 19 19 19 20 20 18 15

TAGAATA 4 12 14 8 9 17 16 20 20 15

GsCCGnnnCGGsC 2 4 7 7 13 11 14 14 18 10

ATAnwCCwnTAA 9 9 5 4 3 9 13 13 14 12

sequences do not fully resemble biological sequences, the

use of the 4th-order model ensures the conservation of the

pentamer composition of the sequences as compared to the

composition of the intergenic data set. This also means

that these simulated sequences are AT-rich. Subsequently

the different motifs were inserted at random positions

throughout the sequences. Motifs were represented with

a position probability matrix and each base of a created

instance is sampled according to the distribution in the

corresponding column of the matrix. Distinct types of

motifs can be observed: AT-rich versus GC-rich and well

conserved versus more degenerate. Using these simulated

data sets we are able to test the inﬂuence of the different

background models on the rate of detecting the inserted

motifs.

In a ﬁrst set of tests we included all four types of motifs

in 20 sequences of 500 bp. Figure 2 gives an overview of

the logos of the four created motifs based on the inserted

instances. We tried several parameter settings and the best

results were obtained when searching for six different

motifs with a length of 14 bp that can have 0, 1 or 2 copies.

Since our method is probabilistic, each test was repeated

20 times. Table 1 indicates for the different types of the

background model how many times the corresponding

consensus term was found in the 20 repeated runs. The

background model was computed from either the input

sequences or the precompiled intergenic models.

The ﬁrst motif, GCTGCAGC, a well-conserved,

GC-rich motif was easily found with each of the back-

ground model used, except with the 3rd-order model based

on the input sequences. To explain this phenomenon we

plotted in Figure 3 the transition matrices of the two

3rd-order background models. Each point in the plot

corresponds to the same entry in both transition matrices.

The xaxis shows the value for the matrix based on the in-

tergenic sequences while the value for the matrix based on

the input sequences is depicted on the yaxis. For a perfect

3rd-order background model a one-to-one correspondence

is expected, however some outliers are clearly visible.

For these points the corresponding representation of the

entry in the transition matrix is indicated. The points that

are largely overestimated by the background model based

on the input sequences correspond to the words part of

the inserted GC-rich motifs. Therefore, the computed

background model based on the input sequences is biased

towards this motif and thus it will overestimate Pxfor

these sites.

The second inserted motif is a frequent, short, well-

conserved and AT-rich motif: TAGAATA. When using the

single nucleotide background model the motif was only

found in 50% of the runs, while it was found in all 20

runs when using the intergenic either the 3rd or 4th-order

models. Finding a short AT-rich motif seems to be much

harder when using the single nucleotide model. Even when

searching for shorter motifs (8 bp) it was picked up less

frequently with the single nucleotide model.

The third motif consists of two GC-rich blocks and it

has a ﬁxed gap of 3 bp. This motif was retrieved with

the same consistency for all background models except

when the background models were compiled from the

input sequences. In this case there is no difference between

the higher order models and the single nucleotide models.

We repeated this test also with a data set where only

this degenerated GC-rich motif was inserted and there the

same consistent performance was observed.

Finally a rather degenerate, AT-rich motif, resembling

the background, was inserted. This motif is assumed to be

the hardest to ﬁnd. Using the single nucleotide background

model resulted in a fairly poor performance. The motif

was detected in only 3 or 4 out of 20 runs. Using higher-

order background models, the motif was more frequently

found. This effect was even more pronounced when

we inserted only this degenerated motif in the random

sequences. The number of detections with the single

nucleotide model increased to 6 out of 20 runs, but for

the 3rd-order model this was 18 out of 20 and even 20

out of 20 runs for the 4th-order model. To further test

this difﬁcult case we inserted slightly modiﬁed versions

of this motif, TTAAwCwATA and wTAATsTATA, in the

generated sequences. In these two test cases we only

used the intergenic background models. Again a similar

performance was measured, as is shown in Table 2.

Using the single nucleotide model it is difﬁcult to retrieve

the inserted motifs, while using the higher-order models

signiﬁcantly improves the performance.

In this section we described the analysis of several

tests with simulated data. Although the simulated data

do not fully resemble biological sequences, we can still

draw some conclusion from these tests. It was shown in

these tests that the use of higher order background model,

compiled from an intergenic data set, improves the rate

of detection of difﬁcult motifs as compared to the use

of a single nucleotide model. The results show that the

3rd and 4th-order models based on the intergenic data

1117

G.Thijs et al.

Fig. 2. Logo representation of the four inserted sites in the simulated sequences. These logos are created from all the inserted instances of

the motifs in the sequences.

Fig. 3. Comparison of the 3rd-order transition matrices computed from the intergenic sequences and the input sequences. Each point

represents a corresponding entry in both transition matrices. The xaxis indicates the value in the intergenic matrix while the yaxis depicts

the value in the matrix based on the input sequences. For some of the outliers the corresponding entry in the matrix is given.

1118

Higher-order background model for motif detection

Table 2. Results of the motif ﬁnding in the simulated sequences with only

the AT-rich motif inserted. Each number corresponds to number of runs,

out of 20, in which the correct consensus was found and this for all the

background models based on the intergenic data set

snf 1 2 3 4 5

TTAAwCwATA Correct 5 9 14 10 13 5

Shifted 1 2 3 4 3 2

wTAATsTATA Correct 0 12 7 12 8 8

Shifted 1 5 6 4 5 3

set have the most positive inﬂuence on the detection of

distinct motifs. The effect of the 1st and 2nd-order models

is less pronounced. The 5th-order model on the other

hand has in some cases a rather poor performance. This

is due to the fact that the transition matrix is built by

counting hexamers. As was shown in the section about

the construction of the background model this count is

not reliable enough. The inﬂuence of using the intergenic

data set instead of using the input sequences is obvious

especially for the 3rd-order model. However we should

take the necessary care when interpreting the results in

Tables 1 and 2. These results show that tests with the

4th-order model have the best results, but this is possibly

related to the fact that this background model was used to

generate the sequences.

Inﬂuence of noisy sequences

To test the inﬂuence of the complex background model

on the robustness of the motif sampler in the presence

of noise, another set of tests was performed. In this set

of tests we used a set of biologically relevant sequences

upstream of 33 genes (G-box data set, see Section Data

sets). In subsequent tests, the number of noisy sequences

added to the G-box data set was progressively increased

(10 at a time). The set of noisy sequences, from which

each time 10 sequences were sampled, consisted of a

random mixture of the light-induced (Desprez et al.,

1998) and random data sets (see Section Data sets).

Preliminary tests showed that the 3rd-order background

model had a better performance than the other higher-

order background models on this data set. Therefore in

the next set of exhaustive tests only the single nucleotide

and the 3rd-order background model were compared. All

parameters of the motif sampler algorithm were kept

ﬁxed except for the order of the background model (we

tried either single nucleotide frequency, 3rd-order Markov

model computed from the input data or 3rd-order Markov

model computed from the intergenic data set). In each test

we searched for 10 different motifs with a length of 8 bp

that can have 0 or 1 copy in the sequence and each test was

repeated 10 times.

To evaluate the results we checked in which runs

the G-box consensus CACGTG was detected. Based

on this deﬁnition of the G-box sequence, we calculated

the number of times the G-box was found in each test

(group of 10 runs). Figure 4a describes the behaviour of

the algorithm in the presence of an increasing number

of noisy sequences for different background models.

The 3rd-order background model clearly outperforms the

single nucleotide background model. With the single

nucleotide model, the algorithm only detects the G-box

consensus in a small number of runs even in the presence

of only a limited number of noisy sequences. Both

higher-order background models can ﬁnd the G-box

consensus in the presence of a large number of noisy

sequences. To further validate the outcome, the positions

of the G-box motifs predicted by the algorithm were

compared with the positions of G-box of the documented

33 G-box sequences. Three different possibilities can be

distinguished:

(1) the predicted motif is located at the same position as

the known G-box motif (true positive);

(2) the algorithm could not detect a motif although the

presence of a motif was described (false negative);

(3) a potential G-box motif detected by the algorithm

is located at a different position than the described

G-box (ambiguous case).

In the last case, the predicted position might represent

a yet undetected G-box and is therefore inconclusive.

Figure 4a shows the average number of correctly pre-

dicted motif positions. The calculation was based for each

experiment only on the runs in which a G-box consensus

was detected. Figure 4b demonstrates that the number

of correctly predicted motifs (true positives) decreases

slightly with increasing noise. However, the order of the

background model does not interfere drastically with the

number of correctly predicted motifs. On average, ap-

proximately 70% of the G-boxes were correctly predicted.

This indicates that, if a motif is detected, it is in 70% of

the cases the right motif, irrespective of the background

model. The background model does not improve the per-

formance of the algorithm in making correct predictions.

However, the inﬂuence of the higher-order background

model on the robustness and performance of the algorithm

in the presence of noise becomes obvious when taking

into account the number of missed true positives. Fig-

ure 4c depicts the average number of sequences in which

the algorithm could not predict the right G-box motif

(false negatives). This number of false negatives consists

of all the sequences in a run in which the algorithm could

not detect a G-box consensus although a G-box was

described in the sequence. Figure 4c shows that the more

noise is added to the data set, the higher the percentage

1119

G.Thijs et al.

0 10 20 30 40 50 60

Number of runs where G-boxes is found

(a)

0 10 20 30 40 50 60

Number of added noise sequences

(b)

0 10 20 30 40 50 60

100

(c)

Order 0

Intergenic Order 3

Data order 3

% Correctly predicted G-box positions

% Missed G-boxes

Fig. 4. (a) Total number of times the G-box consensus is found in 10 runs. The horizontal axis shows the number of noisy sequences added

to the G-box data set. (b) Average number of correctly predicted G-box positions. This number is based on comparison of the described

G-box positions and the predicted positions of the G-box motif in all the runs where a G-box consensus was found. (c) Average percentage

of wrongly classiﬁed motifs. This number is based the number of sequences that are indicated as not having a G-box although a G-box was

documented (including the runs where no G-box consensus is found).

of missed G-boxes. Moreover, this effect is considerably

more pronounced for the single nucleotide background

model than for the 3rd-order background model. The

3rd-order model based on the set of intergenic sequences

performs better than the 3rd-order model based on the

input data.

We usually observed during the tests that when using a

3rd-order background model, the algorithm retrieved the

G-box consensus as one of the ﬁrst motifs, while this was

not the case for the single nucleotide model. The rapid

convergence of the algorithm to the G-box indicates that

it is a very stable motif in the presence of a 3rd-order

model. This was further corroborated by the fact the G-box

motif was in these cases also the motif with the highest

log-likelihood score.

DISCUSSION

We aimed at improving the performance of a probabilistic

implementation of a motif ﬁnding algorithm in the pres-

ence of noisy data. To this end, the existing algorithm was

extended with a more complex background model. We an-

ticipated that the description of the background sequences

as single nucleotide frequencies was not sufﬁcient to

capture the complex information in the inherently non-

random sequence code. Therefore we used higher-order

Markov models to represent the intergenic sequences in

DNA. We adapted the original Gibbs sampling algorithm

in such a way that we can incorporate the higher-order

background model to update the probabilities of ﬁnding

a motif at a certain position in the sequence. A set of

carefully selected intergenic regions was used to construct

a higher-order background model for A.thaliana.Aswas

shown in the Section Results the quality and the size of

this intergenic data set determine the reliability of the

order of the model. We tested our implementation on

different simulated data sets, where the sequences were

generated from the 4th-order background model and

different types of motifs were inserted. These tests have

shown that the use of a higher order background model,

especially 3rd or 4th-order, can improve the performance

of the algorithm. We also showed with the simulated data

that when using an unreliable higher-order model (e.g.,

the 3rd-order model based on the input sequences) the

performance decreases signiﬁcantly.

The behaviour of the algorithm in the presence of an

increasing amount of noisy data has extensively been

tested. The use of a 3rd-order model was shown to

be considerably more robust than a single nucleotide

background model. The overall recovery of the motifs was

higher in the presence of a higher-order model, though the

number of correctly predicted motifs was only marginally

affected by the complexity of the background model.

1120

Higher-order background model for motif detection

Future work will concentrate on the improvement of

the A.thaliana background model through extending the

intergenic data set and also by using interpolated Markov

chains to augment the signiﬁcance of the transition matrix.

Focus will be on the automatic selection of the best

background model. We will also compile background

models of other organisms.

ACKNOWLEDGEMENTS

Gert Thijs is research assistant with the IWT; Yves

Moreau is a post-doctoral researcher of the FWO;

Professor Bart De Moor is a full time professor at

the KULeuven; Pierre Rouz´

e is Research Director of

INRA (Institut National de la Recherche Agronomique,

France). This work is partially supported by: 1. IWT

project: STWW-980396; 2. Research Council KULeuven:

GOA Meﬁsto-666; 3. FWO projects: G.0240.99 and

G.0256.97; 4. IUAP P4-02 (1997–2001); 5. Industrial

Contract Research: Data4s. The scientiﬁc responsibility is

assumed by its authors.

REFERENCES

Altschul,S., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J.

(1990) Basic local alignment search tool. J. Mol. Biol.,215, 403–

410.

Bailey,T.L. and Elkan,C. (1995) Unsupervised learning of multiple

motifs in biopolymers using Expectation Maximization. Mach.

Learn.,21,51–80.

Brazma,A. and Vilo,J. (2000) Gene expression data analysis. FEBS

Lett.,480,17–24.

Bucher,P. (1999) Regulatory elements and expression proﬁles. Curr.

Opin. Struct. Biol.,9, 400–407.

Chu,S., DeRisi,J., Eisen,M.B., Mulholland,J., Botstein,D.,

Brown,P.O. and Herskowitz,I. (1998) The transcriptional

program of sporulation in budding yeast. Science,282, 699–705.

Delcher,A.L., Harman,D., Kasif,S., White,O. and Salzberg,S.L.

(1999) Improved microbial gene identiﬁcation with Glimmer.

Nucleic Acids Res.,27, 4636–4641.

DeRisi,J.L., Iyer,V.R. and Brown,P.O. (1997) Exploring the

metabolic and genetic control of gene expression on a genomic

scale. Science,278, 680–686.

Desprez,T., Amselem,J., Caboche,M. and Hofte,H. (1998) Differ-

ential gene expression in Arabidopsis monitored using cDNA ar-

rays. Plant J.,14, 643–652.

Donald,R.G.K. and Cashmore,A.R. (1990) Mutation of either

G-box or I-box sequences profoundly affects expression from the

Arabidopsis rbcS-1A promoter. EMBO J.,9, 1717–1726.

Hughes,J.D., Estep, Preston,W., Tavazoie,S. and Church,G.M.

(2000) Computational identiﬁcation of cis-regulatory elements

associated with groups of functionally related genes in Saccha-

romyces cerevisiae.J. Mol. Biol.,296, 1205–1214.

van Helden,J., Andr´

e,B. and Collado-Vides,L. (1998) Extracting

regulatory sites from upstream region of yeast genes by com-

putational analysis of oligonucleotide frequencies. J. Mol. Biol.,

281, 827–842.

van Helden,J., Rios,A.F. and Collado-Vides,J. (2000) Discovering

regulatory elements in non-coding sequences by analysis of

spaced dyads. Nucleic Acids Res.,28, 1808–1818.

Jensen,L.J. and Knudsen,S. (2000) Automatic discovery of regula-

tory patterns in promoter regions based on whole cell expression

data and functional annotation. Bioinformatics,16, 326–333.

Krogh,A. (1997) Two methods for improving performance of an

HMM and their application for gene ﬁnding. ISMB,5, 179–186.

Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S.,

Neuwald,A.F. and Wootton,J.C. (1993) Detecting subtle

sequence signals: a Gibbs sampling strategy for multiple

alignment. Science,262, 208–214.

Liu,J.S. and Lawrence,C.E. (1999) Bayesian inference on biopoly-

mer models. Bioinformatics,15,38–52.

Liu,J.S., Neuwald,A.F. and Lawrence,C.E. (1995) Bayesian models

for multiple local sequence alignment and Gibbs sampling

strategies. J. Am. Stat. Assoc.,90, 1156–1170.

Liu,X., Brutlag,D.L. and Liu,J.S. (2001) BioProspector: discovering

conserved DNA motifs in upstream regulatory regions of co-

expressed genes. Pac. Symp. Biocomput.,6, 127–138.

Lockhart,D.J. and Winzeler,E.A. (2000) Genomics, gene expression

and DNA arrays. Nature,405, 827–836.

Lukashin,A.V. and Borodowsky,M. (1998) GeneMark.hmm: new

solutions for gene ﬁnding. Nucleic Acids Res.,26, 1107–1115.

McCue,L.A., Thompson,W., Carmack,C.S., Ryan,M.P., Liu,J.S.,

Derbyshire,V. and Lawrence,C.E. (2001) Phylogenetic footprint-

ing of transcription factor binding sites in probacterial genomes.

Nucleic Acids Res.,29, 774–782.

Neuwald,A.F., Liu,J.S. and Lawrence,C.E. (1995) Gibbs motif

sampling: detection of bacterial outer membrane protein repeats.

Protein Sci.,4, 1618–1632.

Ohler,U. and Niemann,H. (2001) Identiﬁcation and analysis of

eukaryotic promoters: recent computational approaches. Trends

Genet.,17,56–60.

Pavy,N., Rombauts,S., D´

ehais,P., Math´

e,C., Ramana,D.V.V.,

Leroy,P. and Rouz ´

e,P. (1999) Evaluation of gene prediction

software using a genomic data set: application to Arabidopsis

thaliana sequences. Bioinformatics,15, 887–899.

Rombauts,S., D´

ehais,P., Van Montagu,M. and Rouz´

e,P. (1999)

PlantCARE, a plant cis-acting regulatory element database.

Nucleic Acids Res.,27, 295–296.

Roth,F.P., Hughes,J.D., Estep,P.W. and Church,G.M. (1998) Finding

DNA regulatory motifs within unaligned noncoding sequences

clustered by whole genome mRNA quantitation. Nat. Biotech.,

16, 939–945.

Schiex,T., Moisan,A., Duret,L. and Rouz´

e,P. (2000) EuGene: a

eucaryotic gene ﬁnder that combines several sources of evidence.

In Proc. JOBIM’2000. http://www.inra.fr/bia/T/schiex/Export/

EuGene2.pdf

Spellman,P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K.,

Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998)

Comprehensive identiﬁcation of cell cycle-regulated genes of

the yeast Saccharomyces cerevisiae by microarray hybridisation.

Mol. Biol. Cell.,9, 3273–3297.

Thijs,G., Marchal,K., Lescot,M., Rombauts,S., De Moor,B.,

Rouz´

e,P. and Moreau,Y. (2001) A Gibbs sampling method to de-

tect over-represented motifs in upstream regions of co-expressed

genes. RECOMB,5, 305–312.

Vanet,A., Marsan,L., Labigne,A. and Sagot,M.F. (2000) Inferring

1121

G.Thijs et al.

regulatory elements from a whole genome. An analysis of

Helicobacter pylori sigma(80) family of promoter signals. J.

Mol. Biol.,297, 335–353.

Wolfsberg,T.G., Gabrielian,A.E., Campbell,M.J., Cho,R.J.,

Spouge,J.L. and Landsman,D. (1999) Candidate regulatory

sequence elements for cell cycle-dependent transcription in

Saccharomyces cerevisiae.Genome Res.,9, 775–792.

Workman,C.T. and Stormo,G.D. (2000) ANN-SPEC: a method for

discovering transcription binding sites with improved speciﬁcity.

Pac. Symp. Biocomput.,5, 467–478.

Zhang,M.Q. (1999) Large-scale gene expression data analysis: a

new challenge to computational biologist. Genome Res.,9, 681–

688.

1122

A Review on Planted (l, d) Motif Discovery Algorithms for Medical Diagnose

Article

Full-text available

Feb 2022
SENSORS-BASEL

Personalized diagnosis of chronic disease requires capturing the continual pattern across the biological sequence. This repeating pattern in medical science is called “Motif”. Motifs are the short, recurring patterns of biological sequences that are supposed signify some health disorder. They identify the binding sites for transcription factors that modulate and synchronize the gene expression. These motifs are important for the analysis and interpretation of various health issues like human disease, gene function, drug design, patient’s conditions, etc. Searching for these patterns is an important step in unraveling the mechanisms of gene expression properly diagnose and treat chronic disease. Thus, motif identification has a vital role in healthcare studies and attracts many researchers. Numerous approaches have been characterized for the motif discovery process. This article attempts to review and analyze fifty-four of the most frequently found motif discovery processes/algorithms from different approaches and summarizes the discussion with their strengths and weaknesses.

A Network-Based Approach for Improving Annotation of Transcription Factor Functions and Binding Sites in Arabidopsis thaliana

Article

Full-text available

Jan 2023

Transcription factors are an integral component of the cellular machinery responsible for regulating many biological processes, and they recognize distinct DNA sequence patterns as well as internal/external signals to mediate target gene expression. The functional roles of an individual transcription factor can be traced back to the functions of its target genes. While such functional associations can be inferred through the use of binding evidence from high-throughput sequencing technologies available today, including chromatin immunoprecipitation sequencing, such experiments can be resource-consuming. On the other hand, exploratory analysis driven by computational techniques can alleviate this burden by narrowing the search scope, but the results are often deemed low-quality or non-specific by biologists. In this paper, we introduce a data-driven, statistics-based strategy to predict novel functional associations for transcription factors in the model plant Arabidopsis thaliana. To achieve this, we leverage one of the largest available gene expression compendia to build a genome-wide transcriptional regulatory network and infer regulatory relationships among transcription factors and their targets. We then use this network to build a pool of likely downstream targets for each transcription factor and query each target pool for functionally enriched gene ontology terms. The results exhibited sufficient statistical significance to annotate most of the transcription factors in Arabidopsis with highly specific biological processes. We also perform DNA binding motif discovery for transcription factors based on their target pool. We show that the predicted functions and motifs strongly agree with curated databases constructed from experimental evidence. In addition, statistical analysis of the network revealed interesting patterns and connections between network topology and system-level transcriptional regulation properties. We believe that the methods demonstrated in this work can be extended to other species to improve the annotation of transcription factors and understand transcriptional regulation on a system level.

An Efficient Metaheuristic Approach for Finding Motifs from DNA Sequences

Conference Paper

Full-text available

Dec 2021

MotifHub: Detection of trans-acting DNA motif group with probabilistic modeling algorithm

Article

Nov 2023

A survey on algorithms to characterize transcription factor binding sites

Article

Apr 2023
BRIEF BIOINFORM

Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.

Deciphering Macromolecular Interactions Involved in Abiotic Stress Signaling: A Review of Bioinformatics Analysis

Chapter

Mar 2023

Gwenola Gouesbet

Plant functioning and responses to abiotic stresses largely involve regulations at the transcriptomic level via complex interactions of signal molecules, signaling cascades, and regulators. Nevertheless, all the signaling networks involved in responses to abiotic stresses have not yet been fully established. The in-depth analysis of transcriptomes in stressed plants has become a relevant state-of-the-art methodology to study these regulations and signaling pathways that allow plants to cope with or attempt to survive abiotic stresses. The plant science and molecular biology community has developed databases about genes, proteins, protein-protein interactions, protein-DNA interactions and ontologies, which are valuable sources of knowledge for deciphering such regulatory and signaling networks. The use of these data and the development of bioinformatics tools help to make sense of transcriptomic data in specific contexts, such as that of abiotic stress signaling, using functional biological approaches. The aim of this chapter is to present and assess some of the essential online tools and resources that will allow novices in bioinformatics to decipher transcriptomic data in order to characterize the cellular processes and functions involved in abiotic stress responses and signaling. The analysis of case studies further describes how these tools can be used to conceive signaling networks on the basis of transcriptomic data. In these case studies, particular attention was paid to the characterization of abiotic stress responses and signaling related to chemical and xenobiotic stressors.Key wordsOnline toolsDatabasesTranscriptomic dataSignaling pathwaysChemical stressXenobiotic stressInteractions

Towards a Better Understanding of Heuristic Approaches Applied to the Biological Motif Discovery

Chapter

Nov 2022

The detection of transcription factor binding sites (TFBS) play a important role inside bioinformatics challenges. Its correct identification in the promoter regions of co-expressed genes is a crucial step for understanding gene expression mechanisms and creating new drugs and vaccines. The problem of finding motifs consists of looking for conserved patterns in biological datasets of sequences through the use of unsupervised learning algorithms. For that reason, it is considered one of the classic problems of computational biology, which in its simplest formulation has been proven to be NP-HARD. Moreover, heuristic and meta-heuristic algorithms have been shown to be very promising in solving combinatorial problems with very large search spaces. In this work, we propose an evaluation of different heuristics and meta-heuristics approaches in order to measure its performance: Variable Neighborhood Search (VNS), Expectation Maximization (EM) and Iterated Local Search (ILS). For each of them, two sets of experiments were carried out: In the first, the heuristics were performed alone and in the second, a constructive procedure was introduced with respect to improve the quality of initial solutions. Finally, the metrics were compared with the state-of-art MEME algorithm, which is very used in biological motif discovery. The results obtained suggest that the heuristics are more efficient when used together and also, a constructive procedure was very promising, managing to improve the performance metrics of the evaluated heuristics in most experiments. Also, the combination between a constructive procedure and EM proved to be quite competitive, managing to outperform the MEME algorithm in several datasets.KeywordsBiological motifsHeuristicsMeta-heuristicsUnsupervised learning

Expectation Maximization based algorithm applied to DNA sequence motif finder

Conference Paper

Jul 2022

Improving the accuracy of schedule information communication between humans and data

Article

Aug 2022
ADV ENG INFORM

Construction schedules are written instructions of construction execution shared between stakeholders for essential project information exchange. However, construction schedules are semi-structured data that lack semantic details and coherence within and across projects. This study proposes an ontology-based Recurrent Neural Network approach to bi-directionally translate between human written language and machinery ontological language. The proposed approach is assessed in three areas: text generation accuracy, machine readability, and human understandability. This study collected 30 project schedules with 19,589 activities (sample size = 19,589) from a Tier-1 contractor in the UK. The experimental results indicate that: (1) precision and recall of text generation LSTM-RNN model is 0.991 and 0.874, respectively; (2) schedule readability improved by increasing the semantic distinctiveness, measured using the cosine similarity which was reduced from 0.995 to 0.990 (p < 0.01); (3) schedule understandability improved from 75.90% to 85.55%. The proposed approach formalises text descriptions in construction schedules and other construction documents with less labour investment. It supports contractors to establish knowledge management systems to learn from historic data and make more informed decisions in future similar scenarios.

Context dependency of nucleotide probabilities and variants in human DNA

Article

Full-text available

Jan 2022
BMC GENOMICS

Background Genomic DNA has been shaped by mutational processes through evolution. The cellular machinery for error correction and repair has left its marks in the nucleotide composition along with structural and functional constraints. Therefore, the probability of observing a base in a certain position in the human genome is highly context-dependent. Results Here we develop context-dependent nucleotide models. We first investigate models of nucleotides conditioned on sequence context. We develop a bidirectional Markov model that use an average of the probability from a Markov model applied to both strands of the sequence and thus depends on up to 14 bases to each side of the nucleotide. We show how the genome predictability varies across different types of genomic regions. Surprisingly, this model can predict a base from its context with an average of more than 50% accuracy. For somatic variants we show a tendency towards higher probability for the variant base than for the reference base. Inspired by DNA substitution models, we develop a model of mutability that estimates a mutation matrix (called the alpha matrix) on top of the nucleotide distribution. The alpha matrix can be estimated from a much smaller context than the nucleotide model, but the final model will still depend on the full context of the nucleotide model. With the bidirectional Markov model of order 14 and an alpha matrix dependent on just one base to each side, we obtain a model that compares well with a model of mutability that estimates mutation probabilities directly conditioned on three nucleotides to each side. For somatic variants in particular, our model fits better than the simpler model. Interestingly, the model is not very sensitive to the size of the context for the alpha matrix. Conclusions Our study found strong context dependencies of nucleotides in the human genome. The best model uses a context of 14 nucleotides to each side. Based on these models, a substitution model was constructed that separates into the context model and a matrix dependent on a small context. The model fit somatic variants particularly well.

Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies

Article

Full-text available

Dec 1995
J AM STAT ASSOC

A wealth of data concerning life's basic molecules, proteins and nucleic acids, has emerged from the biotechnology revolution. The human genome project has accelerated the growth of these data. Multiple observations of homologous protein or nucleic acid sequences from different organisms are often available. But because mutations and sequence errors misalign these data, multiple sequence alignment has become an essential and valuable tool for understanding structures and functions of these molecules. A recently developed Gibbs sampling algorithm has been applied with substantial advantage in this setting. In this article we develop a full Bayesian foundation for this algorithm and present extensions that permit relaxation of two important restrictions. We also present a rank test for the assessment of the significance of multiple sequence alignment. As an example, we study the set of dinucleotide binding proteins and predict binding segments for dozens of its members.

A Gibbs Sampling Method to Detect Over-Represented Motifs in the Upstream Regions of Co-Expressed Genes

Conference Paper

Full-text available

Apr 2001

Microarray experiments can reveal useful information on the transcriptional regulation. We try to find regulatory elements in the region upstream of translation start of coexpressed genes. Here we present a modification to the original Gibbs Sampling algorithm [12]. We introduce a probability distribution to estimate the number of copies of the motif in a sequence. The second modification is the incorporation of a higher-order background model. We have successfully tested our algorithm on several data sets. First we show results on two selected data set: sequences from plants containing the G-box motif and the upstream sequences from bacterial genes regulated by O2-responsive protein FNR. In both cases the motif sampler is able to find the expected motifs. Finally, the sampler is tested on 4 clusters of coexpressed genes from a wounding experiment in Arabidopsis thaliana. We find several putative motifs that are related to the pathways involved in the plant defense mechanism.

PlantCARE, a plant cis-acting regulatory element database

Article

Full-text available

Jan 1999

PlantCARE is a database of plant cis-acting regulatory elements, enhancers and repressors. Besides the transcription motifs found on a sequence, it also offers a link to the EMBL entry that contains the full gene sequence as well as a description of the conditions in which a motif becomes functional. The information on these sites is given by matrices, consensus and individual site sequences on particular genes, depending on the available information. PlantCARE is a relational database available via the web at the URL: http://sphinx.rug.ac.be:8080/PlantCARE/

Evaluation of gene prediction software using a genomic data set: application to <$O_SSF>Arabidopsis thaliana<$C_SSF>sequences

Article

Full-text available

Nov 1999
BIOINFORMATICS

Motivation: The annotation of the Arabidopsis thaliana genome remains a problem in terms of time and quality. To improve the annotation process, we want to choose the most appropriate tools to use inside a computer-assisted annotation platform. We therefore need evaluation of prediction programs with Arabidopsis sequences containing multiple genes. Results: We have developed AraSet, a data set of contigs of validated genes, enabling the evaluation of multi-gene models for the Arabidopsis genome. Besides conventional metrics to evaluate gene prediction at the site and the exon levels, new measures were introduced for the prediction at the protein sequence level as well as for the evaluation of gene models. This evaluation method is of general interest and could apply to any new gene prediction software and to any eukaryotic genome. The GeneMark.hmm program appears to be the most accurate software at all three level's for the Arabidopsis genomic sequences. Gene modeling could be further improved by combination of prediction software.

Discovering regulatory elements in non-coding sequence by analysis of space dyads

Article

Full-text available

May 2000
NUCLEIC ACIDS RES

The application of microarray and related technologies is currently generating a systematic catalog of the transcriptional response of any single gene to a multiplicity of experimental conditions. Clustering genes according to the similarity of their transcriptional response provides a direct hint to the regulons of the different transcription factors, many of which have still not been characterized. We have developed a new method for deciphering the mechanism underlying the common transcriptional response of a set of genes, i.e. discovering cis-acting regulatory elements from a set of unaligned upstream sequences. This method, called dyad analysis, is based on the observation that many regulatory sites consist of a pair of highly conserved trinucleotides, spaced by a non-conserved region of fixed width. The approach is to count the number of occurrences of each possible spaced pair of trinucleotides, and to assess its statistical significance. The method is highly efficient in the detection of sites bound by C6 Zn2 binuclear cluster proteins, as well as other transcription factors. In addition, we show that the dyad and single-word analyses are efficient for the detection of regulatory patterns in gene clusters from DNA chip experiments. In combination, these programs should provide a fast and efficient way to discover new regulatory sites for as yet unknown transcription factors.

Basic Local Aligment Search Tool

Article

Full-text available

Oct 1990

A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

Mutation of either G box or I box sequences profoundly affects expression from the Arabidopsis rbcS-1A promoter

Article

Full-text available

Jul 1990

A deletion analysis of the Arabidopsis thaliana rbcS-1A promoter defined a 196 bp region (-320 to -125) sufficient to confer light-regulated expression on a heterologous Arabidopsis alcohol dehydrogenase (Adh) reporter gene in transgenic Nicotiana tabacum (tobacco) leaves. This region, which contains DNA sequences I, G and GT boxes, with homology to other ribulose-1,5-bisphosphate carboxylase small subunit (RBCS) gene promoter sequences, directed expression independent of orientation and relative position in the Adh promoter. Site-specific mutagenesis of these conserved sequences and subsequent expression analysis in transgenic tobacco showed that both G box and I box mutations in the context of the full (-1700 to +21) rbcS-1A promoter substantially reduced the expression of Adh and beta-glucuronidase (GUS) reporter genes. The G box has previously been shown to specifically bind in vitro a factor isolated from nuclear extracts of tomato and Arabidopsis. This factor (GBF) is distinct from the factor GT-1 which binds to adjacent GT boxes in the pea rbcS-3A promoter. Multiple mutations in putative Arabidopsis rbcS-1A promoter GT boxes had no pronounced affect on expression, possibly due to a redundancy of these sites. Experiments in which rbcS-1A promoter fragments were fused to truncated 35S CaMV (cauliflower mosaic virus) promoter--GUS reporter constructs showed that cis-acting CaMV promoter elements could partially restore expression to G-box-mutated rbcS-1A sequences.

Exploring the metabolic and genetic control of gene expression on a genomic scale

Article

Mar 1999
Chemtracts

DNA microarrays containing virtually every gene of Saccharomyces cerevisiae were used to carry out a comprehensive investigation of the temporal program of gene expression accompanying the metabolic shift from fermentation to respiration. The expression profiles observed for genes with known metabolic functions pointed to features of the metabolic reprogramming that occur during the diauxic shift, and the expression patterns of many previously uncharacterized genes provided clues to their possible functions. The same DNA microarrays were also used to identify genes whose expression was affected by deletion of the transcriptional co-repressor TUP1 or overexpression of the transcriptional activator YAP1. These results demonstrate the feasibility and utility of this approach to genomewide exploration of gene expression patterns.

Infering regulatory elements from a whole genome. An application to the analysis of genome of $\itemize{Helicobacter Pylori}$ $\sigma_{80}$ family of promoter signals

Article

Jan 2000

Eug??ne: An Eukaryotic Gene Finder That Combines Several Sources of Evidence

Conference Paper

Jan 2000
Lect Notes Comput Sci

In this paper, we describe the basis of EUGÈNE, a gene finder for eukaryotic organisms applied to Arabidopsis thaliana. The specificity of EuGène, compared to existing gene finding software, is that EUGÈNE has been designed to combine the output of several information sources, including output of other software or user information. To achieve this, a weighted directed acyclic graph (DAG) is built in such a way that a shortest feasible path in this graph represents the most likely gene structure of the underlying DNA sequence. The usual simple Bellman linear time shortest path algorithm for DAG has been replaced by a shortest path with constraints algorithm. The constraints express minimum length of introns or intergenic regions. The specificity of the constraints leads to an algorithm which is still linear both in time and space. EUGÈNE effectiveness has been assessed on Araset, a recent dataset of Arabidopsis thaliana sequences used to evaluate several existing gene finding software. It appears that, despite its simplicity, EUGÈNE gives results which compare very favourably to existing software. We try to analyse the reasons of these results.

A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling

Abstract and Figures

Recommended publications

Complete workflows for pesticide testing in tea and spinach

Reference guide: GC/MS/MS pesticide residue analysis

Protein sizing and quantification

eBooklet: Automated Electrophoresis for Nucleic Acid-Based Vaccine

Cloning and characterization of a cDNA encoding serine palmitoyltransferase in Arabidopsis thaliana

A comparative analysis of green fluorescent protein and β-glucuronidase protein-encoding genes as a...

Leveraging Hierarchical Population Structure in Discrete Association Studies

DYT1 Directly Regulates the Expression of TDF1 for Tapetum Development and Pollen Wall Formation in...