ArticlePDF Available

An effective model for natural selection in promoters

March 2010
Genome Research 20(5):685-92

March 2010
20(5):685-92

DOI:10.1101/gr.096719.109

Source
PubMed

Authors:

Michael M Hoffman

Princess Margaret Cancer Centre/University of Toronto

We have produced an evolutionary model for promoters, analogous to the commonly used synonymous/nonsynonymous mutation models for protein-coding sequences. Although our model, called Sunflower, relies on some simple assumptions, it captures enough of the biology of transcription factor action to show clear correlation with other biological features. Sunflower predicts a binding profile of transcription factors to DNA sequences, in which different factors compete for the same potential binding sites. The parametrized model simultaneously estimates a continuous measurement of binding occupancy across the genomic sequence for each factor. We can then introduce a localized mutation, rerun the binding model, and record the difference in binding profiles. A single mutation can alter interactions both upstream and downstream of its position due to potential overlapping binding sites, and our statistic captures this domino effect. Over evolutionary time, we observe a clear excess of low-scoring mutations fixed in promoters, consistent with most changes being neutral. However, this is not consistent across all promoters, and some promoters show more rapid divergence. This divergence often occurs in the presence of relatively constant protein-coding divergence. Interestingly, different classes of promoters show different sensitivity to mutations, with phosphorylation-related genes having promoters inherently more sensitive to mutations than immune genes. Although there have previously been a number of models attempting to handle transcription factor binding, Sunflower provides a richer biological model, incorporating weak binding sites and the possibility of competition. The results show the first clear correlations between such a model and evolutionary processes.

Toy example schematic of a Sunflower model for TFs. (Circle) Silent state, (squares) emitting states, (arcs) transitions between states with nonzero probability. The transition probability is either designated by a label, or is 1 in the case of unlabeled areas. (Squares with ellipses) Arbitrary number of sequential states. This toy example includes TFs A, B, C, and D, each one with a petal of emitting states, labeled such that D.0 corresponds to the first column of the D PWM, and D.n the last column. The arc from empty space indicates the initial state of the model.

…

Figures - uploaded by Michael M Hoffman

Content may be subject to copyright.

Content uploaded by Michael M Hoffman

Content may be subject to copyright.

Resource

An effective model for natural selection in promoters

Michael M. Hoffman

1,2,3

and Ewan Birney

2,4

EMBL–European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, United Kingdom;

Graduate School of Life Sciences, University of Cambridge, Cambridge CB2 1RX, United Kingdom

We have produced an evolutionary model for promoters, analogous to the commonly used synonymous/

nonsynonymous mutation models for protein-coding sequences. Although our model, called Sunflower, relies on some

simple assumptions, it captures enough of the biology of transcription factor action to show clear correlation with other

biological features. Sunflower predicts a binding profile of transcription factors to DNA sequences, in which different

factors compete for the same potential binding sites. The parametrized model simultaneously estimates a continuous

measurement of binding occupancy across the genomic sequence for each factor. We can then introduce a localized

mutation, rerun the binding model, and record the difference in binding profiles. A single mutation can alter interactions

both upstream and downstream of its position due to potential overlapping binding sites, and our statistic captures this

domino effect. Over evolutionary time, we observe a clear excess of low-scoring mutations fixed in promoters, consistent

with most changes being neutral. However, this is not consistent across all promoters, and some promoters show more

rapid divergence. This divergence often occurs in the presence of relatively constant protein-coding divergence. In-

terestingly, different classes of promoters show different sensitivity to mutations, with phosphorylation-related genes

having promoters inherently more sensitive to mutations than immune genes. Although there have previously been

a number of models attempting to handle transcription factor binding, Sunflower provides a richer biological model,

incorporating weak binding sites and the possibility of competition. The results show the first clear correlations between

such a model and evolutionary processes.

[Supplemental material is available online at http://www.genome.org. The Sunflower package and source code are

available at http://www.ebi.ac.uk/;hoffman/software/sunflower/.]

Evolution is a fundamental force that has shaped all living or-

ganisms. By comparing the genomes of different species, and

considering their similarities and differences through the lens of

evolutionary theory, we can discover interesting aspects of biology

and better understand their past development (C. elegans Se-

quencing Consortium 1998; Adams et al. 2000; Lander et al. 2001;

Mouse Genome Sequencing Consortium 2002). To quantify se-

lective pressure in protein-coding genes, many researchers have

estimated the number of nonsynonymous substitutions (called d

or K

) and synonymous substitutions (called d

or K

), and then

taken their ratio, described as d

,orv(Nei and Kumar

2000). This has provided an invaluable model for characterizing

the evolution of genes in relatively closely related species. Con-

trasting rates of evolution in classes of nucleotides with differing

functional effects is also used in a variety of population genetics

procedures, such as the McDonald–Kreitman test (McDonald and

Kreitman 1991). Although this model crudely equates phenotypic

change with amino acid sequence change, ignoring more complex

effects, it has repeatedly shown its worth in classifying proteins

and specific sites in proteins undergoing both positive (adaptive)

selection and negative (purifying) selection (Nielsen 2001; Hurst

2002; Eyre-Walker 2006).

Due to its extensive use, methodology to assess relative

nonsynonymous to synonymous rates has progressively improved

over time. Salser et al. (1976) were the first to count synonymous

and nonsynonymous differences between mammalian protein-

coding nucleotide sequences, and others (Miyata and Yasunaga

1980; Perler et al. 1980; Li et al. 1985; Nei and Gojobori 1986)

developed more robust methods to estimate the number of syn-

onymous and nonsynonymous substitutions where multiple sub-

stitutions occurred in a single site. More recently, researchers in-

creasingly use maximum likelihood methods to estimate these

quantities, accounting for local variations in mutation rate ac-

cording to various models of evolution (Goldman and Yang 1994).

This framework has often been adapted by other researchers to

investigate evolution of protein-coding sequence (Kosiol et al.

2007; Boyko et al. 2008). New extensions to the basic models, such

as the sitewise likelihood ratio (Massingham and Goldman 2005),

continue to expand the utility of this basic protein model.

In contrast, an analogous phenotypic change model has not

existed for noncoding regions of the genome, including those re-

gions that regulate transcription. Most researchers use straight-

forward measures to approximate change in these regions that

lack a model of the variable susceptibility of different positions

in transcription factor binding sites (TFBSs) to mutations (Wong

and Nielsen 2004; Haygood et al. 2007). Although investigators

have identified and commented on this variable susceptibility

(Dermitzakis et al. 2003; Moses et al. 2003; Mustonen et al. 2008),

a good model for the impact of variation on transcription factor

binding that canbe integratedinto traditional d

methods would

be more useful. The lack of a more realistic phenotypic model is

particularly frustrating as the protein-coding complement does not

change significantly between mammalian species outside of olfac-

tion and the immune system (and even less so between primates),

leading many researchers to suggest that changes in regulation in-

clude many of the most important changes for positive selection in

mammalian and primate evolution (King and Wilson 1975).

Here, we introduce a phenotypic model for the impact of

change in promoter sequence. We were inspired by the success of

Present address: Department of Genome Sciences, University of

Washington, PO Box 355065, Seattle, WA 98195-5065, USA.

Corresponding author.

E-mail birney@ebi.ac.uk.

Article published online before print. Article and publication date are at

http://www.genome.org/cgi/doi/10.1101/gr.096719.109. Freely available

online through the Genome Research Open Access option.

20:685–692 Ó2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org Genome Research 685

www.genome.org

transcription factor binding models that integrate over the com-

plete range of binding affinities (Rajewsky et al. 2002; Granek and

Clarke 2005; Foat et al. 2006; Sinha 2006; Roider et al. 2007; Manke

et al. 2008) using a library of position weight matrices (PWMs).

Additionally, Wasson and Hartemink (2009) published a similar

model during the preparation of this manuscript. These models

have shown their utility by providing robust models of Drosophila

enhancers (Segal et al. 2008). This work differs from previous ef-

forts to use multispecies conservation information to improve the

identification of functional TFBSs (Moses et al. 2004; Ray et al.

2008), because we hold out evolutionary information from the

TFBS identification process in order to avoid circularity in the

subsequent estimation of evolutionary distances. The necessary

modeling instead seeks to grade potential mutations for their im-

pact on cis-regulation prior to analyzing information on the actual

substitutions found in evolution, in a similar way to methods that

determine potentially disruptive protein-coding substitutions, such

as PolyPhen (Sunyaev et al. 2001).

Quantifying phenotypic change with such a model suggests

a corresponding measurement d

(by analogy to d

and d

)to

quantify the putative change in transcriptional function. Al-

though itself a crude approximation of the biochemical process we

wish to model, this measurement shows the expected suppression

of larger changes over evolutionary time. To correct for the local

neutral rate of evolution, we combine d

with the protein-coding

using the ratio c=d

, which can distinguish different func-

tional categories of genes with varying degrees of selection on

their promoter regions. The ratio shows strong purifying se-

lection on developmental process genes, as expected, but also

shows a potential positive or extensive relaxation of constraint in

other functional classes, such as phospholipid biosynthetic pro-

cess genes.

Results

We used a hidden Markov model (HMM) framework (Durbin et al.

1998) to provide a reasonable model of the competitive binding of

an ensemble of transcription factors (TFs), assuming steric hin-

drance between factors competing for the same segment of DNA.

The architecture of the model is shown in Figure 1, and because of

its floral resemblance, we call the model Sunflower. Each TF forms

a petal of nucleotide-emitting states, with each state parametrized

from a column in a PWM, which may come from a public TF da-

tabase such as JASPAR (Vlieghe et al. 2006) or TRANSFAC (Matys

et al. 2006), or from high-throughput protein-binding microarray

experiments (Mukherjee et al. 2004; Bulyk 2006). For the analysis

presented here we used vertebrate JASPAR CORE PWMs, specifi-

cally those listed in Supplemental Table 1. A single unbound state

represents parts of the DNA not bound by a factor, and it is pa-

rameterized using the base composition of the whole genome.

The entry probability to the unbound state was arbitrarily set to

0.99, representing a postulated prior that the fraction of nucleo-

tides bound to TFs is on the order of magnitude of 1%. The entry

probability to each TF petal, roughly analogous to the cellular con-

centration of each factor, is set flat for all factors. This equally di-

vides the remaining 0.01 probability for entry to a petal. Ideally,

the model would summarize effects across all cell types, which

precludes setting these values according to the concentrations of

individual TFs under particular cellular conditions. Because we lack

the knowledge necessary to integrate the expression levels of genes

in every cell type over evolutionary history, we used this arbitrary

flat prior.

The HMM forward–backward algorithm allows the efficient

calculation of the marginal probability of each factor explaining

each base, analogous to the base being bound by the factor. This

means that for each base in the sequence, the algorithm calculates

a vector of the marginal probabilities for each PWM column sum-

marizing the combined behavior of the ensemble of TFs at that

position. Although this model is admittedly simple, with no pro-

vision for different concentrations of factors or different potential

cooperative modes between factors, it does maintain many useful

known aspects of TF biology. In particular, it considers a continu-

ous range of TF affinities for different genomic sites and steric ef-

fects between factors.

In this simulation it is possible for a single mutation to effect

a longer chain of binding sites due to changes in steric overlap.

An illustration of this domino effect is shown in Figure 2, where

a single mutation changes the predicted binding not only at

NR1H2-RXR, PPARG-RXRA, and T binding sites directly over-

lapping the mutation, but also at the predicted nearby NR3C1,

REL, Roaz, SP1, and Spz1 binding sites, leading to a complete re-

organization of the predicted binding occupancy on this promoter.

In order to investigate the importance of the domino effect,

we compared probabilities estimated with this joint model with

probabilities estimated with 89 similar models where we included

only one PWM at a time. We defined proximal promoter sequences

as 1400 bp around 17,600 transcription start sites (TSSs) in the

human genome. We took the probability distribution inferred us-

ing each single-motif model at each proximal promoter sequence

position, and the probability distribution generated from the cog-

nate portion of the joint model (see Methods). The median relative

entropy per nucleotide calculated between these two distributions

for each model is 0.6 bits, which means the joint model provides

a large amount of additional information over a 1400-bp promoter.

To examine the impact of a potential mutation in the joint

model, we introduce it into the sequence and recalculate the

marginal probability vector for the mutated sequence at every

position, not just the mutated position. We then add together the

Figure 1. Toy example schematic of a Sunflower model for TFs. (Circle)

Silent state, (squares) emitting states, (arcs) transitions between states

with nonzero probability. The transition probability is either designated

by a label, or is 1 in the case of unlabeled areas. (Squares with ellipses)

Arbitrary number of sequential states. This toy example includes TFs A, B,

C, and D, each one with a petal of emitting states, labeled such that D.0

corresponds to the first column of the D PWM, and D.n the last column.

The arc from empty space indicates the initial state of the model.

686 Genome Research

www.genome.org

Hoffman and Birney

relative entropies (Durbin et al. 1998) for each pair of marginal

probability vectors (both the reference and the mutated sequence).

We refer to the sum as the binding shift of the mutation and denote

it by the symbol t(see Methods).

To explore the properties of the new tmeasurement, we ex-

haustively simulated every possible point mutation in the human

promoters (4200 changes per promoter, 73,920,000 overall). We

then compared the human sequences with aligned sequences in

the dog genome, chosen because it was distantly related enough

for many neutral changes to occur, yet close enough that the

effects of selection on cis-regulation would still be observable.

We separated the changes observed in dog (4,069,878, 8% of the

mutations at a human position aligned to a dog nucleotide). Figure

3 shows the mean tfor both changes observed and unobserved in

dog averaged at each TSS-relative position.

Overall, trises steadily as mutations approach the TSS, as

expected from the increase in density of TF binding sites. More

importantly, there is a strong separation over the TSS of the ob-

served from the unobserved mutations, leading to consistently

higher tvalues in the unobserved portion. Both the overall shape

of this plot and its consistency with the prediction that higher

tmutations are less favored by the predominantly selectively

neutral changes accepted over evolutionary time suggest that this

measurement models something that correlates with evolutionary

acceptance of mutations near TSSs.

For confirmation, we repeated this analysis on mouse–rat

aligned proximal promoters and found similar results (Supple-

mental Fig. 1). We found different results when looking human–

dog aligned regions (Supplemental Fig. 2) with enhancer activ-

ity validated in transgenic mice (Pennacchio et al. 2006), or into

human–dog aligned ancestral repeats (Supplemental Fig. 3; Paten

et al. 2008). The relatively flat binding shift lines in these results

lead us to conclude that with the input PWMs used, this model will

primarily detect signatures of selection in proximal promoter re-

gions rather than enhancer regions or negative control ancestral

repeat regions.

The tmeasurement provides an approximation to the binding

occupancy change of a mutation, which is the simplest predict-

able phenotypic change in a promoter, much like the number of

changed residues in a protein is the simplest measurement of

phenotypic change in a protein. We also sum up the total potential

change of a promoter, considering every possible mutation, and

call this T. Interestingly, different classes of genes, as determined by

Gene Ontology (GO) (Gene Ontology Consortium 2006) annota-

tions, show varying levels of this inherent propensity to change

(see Supplemental Tables 1–6). Genes involved in developmental

processes are expected to have complex, finely tuned promoters,

and therefore are expected to have high T. Somewhat more un-

expected in high-Tgenes are those involved in phosphorylation

and the cell cycle. Interestingly, these GO terms are also excluded

from copy number variant (CNV) regions (Redon et al. 2006).

In order to examine how actual changes affect the binding

profile, we can sum up only those values of tthat correspond to

observed substitutions. To control for different inherent propen-

sities to change, we divide by the potential total binding shift T,

and then transform this proportion using the Jukes–Cantor model

(Jukes and Cantor 1964) to correct for multiple substitutions along

an evolutionary lineage (see Methods). This results in a transcrip-

tional distance measurement d

We developed an evolutionary measurement, which we call

c, by analogy to the protein-coding vparameter for the non-

synonymous-to-synonymous substitution rate ratio. For c,we

wish to control both for the inherent binding shift mutability and

for the local mutation rate, so we take d

and divide it by the local

Figure 2. Changes in predicted binding profile for a guanine-to-thymine

mutation at position +29 of ENST00000373379, a transcript of UPRT,

uracil phospho-ribosyltransferase. (Green lines) Probability of eight in-

dividual TFs binding to the reference sequence, or the probability that

a region is unbound (upper right panel). The names above the TF binding

panels refer to JASPAR PWM names, and the corresponding Human Ge-

nome Organization Nomenclature Committee (HGNC) symbols are

contained in Supplemental Table 1. (Orange lines) Probability that a TF

binds the mutated sequence. These displayed changes, when added to

smaller changes for other TFs, represent a binding shift tof 124.8 (see

Results and Methods).

Figure 3. Aggregation plot of the binding shifts of 17,600 human

genes, averaged within two groups: one where the simulated mutation

was observed in dog (green circles, solid line), and one where it was

unobserved (orange crosses, dashed line). (A) Local regressions for 6700

bp around the TSS, estimated with the loess (Cleveland and Devlin 1988)

function in R (R Development Core Team 2007), with second-degree

polynomials and a=0.1. Shaded regions in this plot are magnified as

separate panels below to show mean binding shifts at individual positions

proximal to (B) and more distal from (C) the TSS.

An effective model for natural selection in promoters

Genome Research 687

www.genome.org

neutral mutation rate d

, analogously to d

. The measurement

c=d

therefore summarizes our approximation of the binding

occupancy change in a promoter due to mutations, normalizing

for both local mutation rate and inherent mutability of a promoter.

Values of human–dog care not strongly correlated to the local

mutation rate, measured either using synonymous coding sites

(Supplemental Fig. 6; r

=0.51; P<2.2 310

16

) or at introns

(Hoffman and Birney 2007) (Supplemental Fig. 7; r

=0.24; P<

2.2 310

16

). Neither is it correlated to the raw mutability (T) of

each promoter (Supplemental Fig. 8; r

=0.20; P<2.2 310

16

This suggests that ccaptures an aspect of biology independent of

these quantities, such as selection on promoters, just as vcaptures

for coding sequence. While others have identified purifying se-

lection adjacent to the TSS (Taylor et al. 2006), we can identify

a potential mechanism for this selection.

Considering classes of genes with high or low amounts of

selective pressure on promoters provides interesting insights into

biology. Focusing first on cellular components, it has long been

known that plasma membrane and extracellular compartments

show strong enrichment for high values of the protein-coding v.

The transcriptional c, however, shows an almost perfect contrast

to this, with these compartments showing striking enrichment for

low cvalues (Fig. 4). Turning to more specific functional cate-

gories, Figure 5 shows a scatter plot of median cagainst median

vfor biological process and molecular function GO terms with at

least 10 genes annotated. It is clear that cand vare not strongly

correlated for functional classes of genes (r

=0.081; P=9.87 3

12

), nor are they correlated on a gene by gene basis (Supple-

mental Fig. 4; r

=0.10; P<2.2 310

16

). More importantly,

functional classes enriched for high vare rarely enriched for high

c, and vice versa. This implies that positive selection amongst

genes associated with a GO term predominantly works in a single

modality. In contrast, many of the categories that show negative

selection in both the transcriptional and protein-coding mea-

surements, having low vand low c, agree with perceptions of

transcriptional complexity, with terms such as sensory organ de-

velopment (low c:P=7310

6

,q<1310

4

; low v:P=2310

5

q<1310

4

) and transcription factor activity (low c:P=5310

38

q<1310

4

; low v:P=4310

5

,q=6310

4

) enriched in both

modalities. As expected, there are genes showing evidence of

strong transcriptional negative selection with no striking shift in

protein selection, such as those associated with signal transduction

(low c:P=8310

17

,q<1310

4

; low v:P=0.5, q=1), cell

adhesion (low c:P=8310

10

,q<1310

4

; low v:P=1, q=1), and

cell migration (low c:P=3310

5

,q=2310

4

; low v:P=0.1, q=

1). Finally, gene classes enriched for more positive transcriptional

selection (high c) without striking changes in protein evolution

include phospholipid biosynthetic process genes (high c:P=23

,q=3310

4

; high v:P=0.6, q=1) such as CEPT1 (c=2.24; v=

0.04), and DNA repair genes (high c:P=3310

10

,q<1310

4

;

high v:P=0.006, q=0.07) such as UBE2B (c=2.53; v=0.002).

Discussion

We have developed, assessed, and used a new series of measure-

ments that aim to capture the effect of DNA sequence change

on transcriptional regulation. Although our model crudely ap-

proximates the known complexity of this process and does not

include more poorly understood processes such as TFBS turnover

(Dermitzakis and Clark 2002), it is not obviously less sophisticated

than the d

measurement commonly and successfully used

to study protein-coding evolution. An important component to

Figure 4. Box plot of c=d

values arranged by GO cellular com-

ponent term, for each term associated with significantly high (above di-

viding line) or low (below dividing line) cvalues, as determined by the

Wilcoxon rank sum test (P<1310

4

) performed by FUNC (see Methods).

The vertical bar in each box indicates the median c, the extents of each

box the first and third quartiles of c, and the whiskers extend to the fur-

thest data point that is no more than 1.5 times the interquartile range from

the nearest quartile. High outliers are used in calculating statistics, but are

omitted from the display for clarity.

Figure 5. Scatter plot of median c=d

versus median v=d

for

the genes in 1402 GO terms. Only terms that are annotated on at least 15

genes are shown. The term has a significantly high or low value of c(blue),

v(red), both measurements (yellow), or neither measurement (gray), as

determined by FDR threshold q<0.05. Labels indicate the terms with the

10 highest and 10 lowest cvalues.

Hoffman and Birney

688 Genome Research

www.genome.org

the Sunflower model is that it penalizes the creation of motifs

overlapping with existing motifs. The aggregate evolutionary sig-

nature of this measurement shows an expected suppression of

highly perturbing mutations in both human–dog and mouse–rat

promoter comparisons. In contrast, ancestral mammalian repeats,

thought to be predominantly neutral, show no difference in pre-

dicted impact between observed and unobserved mutations. The

functional processes that have transcriptional sensitivity agree

with preconceptions derived from our understanding of cellular

and molecular biology.

While PWM methods are often used to predict TF occupancy,

we cannot be certain that such methods accurately estimate TF

binding or transcriptional output. One major limitation of our

technique is that it relies on the assumption that the PWM-based

method it uses will be accurate much of the time. Another limi-

tation is the lack of a complete set of PWMs for TFs. The devel-

opment of a number of high-throughput methods for quantifying

in vitro binding preferences (Mukherjee et al. 2004) will provide

a larger set of matrices over time, and the integration with other

methods (Ren et al. 2000; Hudson and Snyder 2006; Robertson

et al. 2007; Wang et al. 2007; Jothi et al. 2008) will likely drive the

library of accessible matrices closer to completion. It is interesting

to note that the set of distal enhancers did not show the same

separation of observed versus unobserved changes. The lack of en-

hancer-specific factors may well be the explanation for this result,

as JASPAR’s contents have a bias toward promoter-associated TFs.

A more complex problem is how to set the entry probabilities

to each petal. We have chosen to take a uniform prior as a way to

handle the large diversity of cell types that vary in expression

values. Potentially, one could consider integrating this signal over

a variety of relative expression levels of the transcription factors at

the expense of a more computationally expensive procedure. Re-

lated to this problem is the issue of redundancy in PWMs. The

JASPAR database provides a curated PWM set with some efforts

made toward eliminating redundancy. The JASPAR PWMs used in

this study (Supplemental Table 1) include a pair of PWMs repre-

senting different motifs recognized by one protein (ZNF42_1-4,

ZNF42_5-13), two motifs recognized by dimers that include one

overlapping protein (HAND1-TCF3, TAL1-TCF3), and two pairs of

PWMs from the same protein recognizing similar motifs but with

data from different sources (NFKB, NFKB1; RORA, RORA1). In re-

ality, some redundancy is acceptable because there are likely to be

some transcription factors with similar motifs in vivo. Increasing

the number of TFs considered by the model will increase the re-

dundancy in motifs yet still more accurately model the actual

processes in living cells.

It is feasible to imagine more complex impact models than

the one presented here, such as considering compensatory crea-

tion of new binding sites in a TFBS turnover model, at the con-

ceptual and computational expense that comes with more com-

plicated models. This would be analogous to integrating structural

adjacency of amino acids for protein selection. It is interesting to

note that, probably due to the complexity of a more advanced

model, the simpler site-wise model in protein sequences has

remained the predominant evolutionary model.

The cmeasurement generated using pairwise alignments

shows a weak correlation between orthologs in different clades

(Supplemental Fig. 5; r

=0.33; P<2.2 310

16

). This means that

this measure of selection is consistent between clades, at least

within mammals, although obviously it will not have the same

consistency as vmeasurements generated from a maximum like-

lihood method on a single multiple alignment. It would be pos-

sible to use the Sunflower method to find pairwise d

values for

multiple species pairs from a single alignment, but effective use

of multispecies alignments would require integration of the Sun-

flower change model during the sampling of potential ancestral

sequences in the tree. This is an interesting approach that requires

both more theoretical and practical work. Similar to the research

arising from the d

model, the pairwise model presented here

would be the starting point for that work.

The protein-coding vmeasurement has the property that

neutral changes are predicted (and observed) to be around v=1,

while the cmeasurement does not come with such a principle for

its interpretation. Much of the use of v, however, consists in av-

eraging values over several genes, where identifying deviations

from the bulk distribution (as performed in this analysis) is the

primary mode of analyzing gene sets. In contrast, the cmeasure-

ment lends itself more naturally to the joint analysis of multiple

changes, such as those found on haplotypes. As genome-wide as-

sociation studies implicate haplotypes, and extensive resequenc-

ing (Kaiser 2008) will provide a complete set of changes on nearly

all common haplotypes, a haplotype-level analysis of functional

changes will become a more important form of analysis. In-

tegration of these mechanistic models with expression quantita-

tive trait locus studies (Veyrieras et al. 2008) in the context of

complete sequencing will provide an interesting comparison.

Many of the associations of cwere expected, such as the sup-

pression of promoter changes in signal transduction and de-

velopmental genes. The bulk suppression of cin genes associated

with extracellular components and the plasma membrane is more

puzzling, inparticular given the striking signals of positive selection

in these proteins (Kim et al. 2007). Alternatively, inappropriate ex-

pression of many extracellular proteins may have a far more dele-

terious effect, given their potential to interact with other compo-

nents outside of the cell. More generally, cis not strongly correlated

with the protein-coding v, showing a very different behavior of

transcriptional selection compared with protein-coding selection.

In this work, we focused on an evolutionary analysis of intra-

mammalian substitutions, although one could apply the same

framework both to other clades and to other mutational processes,

such as natural polymorphisms and somatic changes discovered in

cancer. In the latter two cases the low rate of change will make

gaining statistical power hard, just as analyzing protein changes

also requires extensive aggregation of signals (Stratton et al. 2009).

With the large number of sequenced genomes appropriate for this

analysis and the aggressive generation of polymorphism and so-

matic mutation data sets, Sunflower provides a key additional tool

in the interpretation of genomic sequence differences.

Methods

Posterior inference

Sunflower does posterior inference using an algorithm we call

Sunflower-Reference. The algorithm calculates the posterior prob-

ability P

k,i

=P(x

|k) that a particular nucleotide x

was emitted by

a given state kin the Sunflower model. The results are the same as

the standard Forward-Backward algorithm when the silent state is

the start state (k

silent

=0).

Posterior inference is equivalent to tracing all of the pathways

through this model that can emit a single sequence, and estimat-

ing the posterior probability that the model is in each of the states

at each position of that sequence. Underlying the model is the

physical mechanism that transcription factors are continuously

binding and leaving chromosomal sequences, at a rate related to

An effective model for natural selection in promoters

Genome Research 689

www.genome.org

their affinity for the sequence. The statistical mechanics of the

biophysical model are approximated by the probabilities that

a transcription factor is bound in the sequence model. Indeed,

PWMs, which are frequently thought of as purely probabilistic

concepts, were originally proposed as part of a statistical me-

chanics model (Berg and von Hippel 1987).

The new parameters in Sunflower-Reference allow two opti-

mizations. The first is the use of connection set vectors c

and c

which contain information about which states are connected to

which other states, relieving the algorithm from the necessity in

each round of doing calculations involving transition probabilities

of zero. The other optimization is that one can specify a calculation

starting position ito indicate that the intermediate forward matrix

Fand backward matrix Bhave already been partially calculated,

such that recalculation is only necessary in the forward direction

for values >i, and in the reverse direction for values <i.

We wrote Sunflower in the Python language (van Rossum

2006) and inner loops in the C language (Kernighan and Ritchie

1988) for speed.

Comparing joint and single-motif models

We compared the joint model used in the rest of this work with

single-motif control models for each motif mto investigate the

importance of the domino effect. After performing posterior in-

ference on all of the models, we derived a two-state probability

distribution for each motif from the joint model at each position

by taking P(x

|m) and 1 P(x

|m). We then compared each

probability distribution generated from the joint model P

joint

with

the equivalent probability distributions from the single-motif

model P

single

by taking the relative entropy H(P

joint

single

The binding shift t

The algorithm used to investigate the effects of mutations can be

described simply. First, run the Sunflower-Reference algorithm

with a Sunflower model and a nucleic acid sequence to get the

posterior probability matrix P. Then use the Sunflower-Mutate

algorithm to calculate the relative entropy H(PkP9)=tfor each

position iand each nucleotide a2A={A,C,G,T}:

Sunflower-Mutate(A,E,X=(x

...x

), F=(f

k,x

)

m3n

,B,P)

X0=ðx0

1...x0

nÞ)X1

F0=ðf0

k;xÞm3n)F2

for i)1to n3

do for each ain A4

do if a=xi

then ti1;a)0:06

else x0

i)a7

P0)Sunflower-ReferenceðA;E;X0;F0;B;iÞ8

ti1;a)HðPkP0Þ9

i)xi

i1)fi1

return T =ðti;xÞn3jAj

This algorithm includes a significant optimization over the na-

ive implementation, because it uses the three extra arguments

in Sunflower-Reference to avoid rerunning the whole Forward-

Backward algorithm each time. Only those columns jof the for-

ward matrix where j$iand the backward matrix where j#iare

recalculated, as the left and right partitions of these two matrices,

respectively, would have the same value as when calculated from

the reference sequence.

Sunflower avoids the binary classification of binding and

thresholds commonly used in TFBS finders, as they are not es-

sential to the biology of transcription finding (Roider et al. 2007),

and uses a probabilistic model instead. The result of Sunflower-

Reference is a two-dimensional matrix of the posterior probabili-

ties defined at each position for each PWM column of all TFs in the

input set. These values specify how likely it is that a particular TF

binds to a particular string of positions.

The promoter distance d

One can think of the binding shift measurement tintroduced

above as a measurement of the synonymity of a particular nucle-

otide. To get a measurement of the potential disruption in TF

binding for a gene, T, similar to the total number of nonsyn-

onymous nucleotides, N, one first must select a region of interest.

We limit our inspection to only those nucleotides we are most sure

have an effect on transcripts by selecting the region [100, +100)

relative to the TSS. These are the nucleotides where tis highest on

average. If the value of Tis used for further comparisons to an

aligned sequence, then we exclude positions that do not align.

We use Pto refer to the set of included positions in the region of

interest.

Inspired by the logic used by Nei and Gojobori (1986) to as-

sign a fractional synonymity to protein-coding nucleotides that

are only partially degenerate, we consider the average binding shift

from the reference nucleotide to all other possibilities as a mea-

surement of the potential disruption for that nucleotide. Summing

the values for all these nucleotides, and dividing by 3, the number

of different possible substitutions, we get

T=1

i2P

a2A

ti;a

To compare a human promoter with sequence X=(x

...x

)

with the promoter in a related species, we limit to only those

alignments of upstream regions of Ensembl orthologs with fewer

than 25% gap columns. We call the sequence in the other species

Y=(y

...y

), and the two positions align at each position i. With Y,

we can define the amount of observed binding profile disruption

Td=+

i2P

ti;yi

Since ti;yi=0 whenever x

is nonzero only at positions where

the two sequences differ. While ti;yimay be larger than the average

tfor any given position, this is unlikely to be true across the whole

gene.

Using Tand T

, we can calculate a proportion of binding

profile disruption

pT=Td

analogous to p

and p

(Nei and Gojobori 1986). We use the Jukes–

Cantor equation (Jukes and Cantor 1964), which performs a pro-

portion such as this into a distance measurement

dT=3

4lnð14

3pTÞ:

Hoffman and Birney

690 Genome Research

www.genome.org

Gene Ontology enrichment analysis

We use FUNC (Pru

¨fer et al. 2007) to determine GO terms enriched

for a particular gene set (hypergeometric test) or for low or high

values of various measurements associated with genes (Wilcoxon

rank sum test). Considering the genes in a specified set as marked,

the hypergeometric test compares the number of marked genes

associated with a GO term with the number of marked genes as-

sociated with any term in a specific ontology. The Wilcoxon rank

sum test involves rank-ordering genes by a measurement, and then

comparing the ranks of the genes associated with one GO term

with the ranks of the other genes associated with any other term in

a specific ontology. We use the false discovery rate (FDR) reported

by FUNC as an FDR threshold q(Storey and Tibshirani 2003).

To alleviate the multiple testing problem, we consider a term to

be enriched only when q<0.05. For measurements that involve

alignments of potential transcriptional regulatory regions, we in-

clude in the analysis only those sequences where fewer than 50 of

the pairwise alignment columns include gaps.

Acknowledgments

This material is based upon work supported under a National

Science Foundation Graduate Research Fellowship. We thank

Kathryn Beal and Michael Schuster for providing data used in the

analysis. We thank Benedict Paten and three anonymous reviewers

for helpful comments on the manuscript.

References

Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG,

Scherer SE, Li PW, Hoskins RA, Galle RF, et al. 2000. The genome

sequence of Drosophila melanogaster.Science 287: 2185–2195.

Berg OG, von Hippel PH. 1987. Selection of DNA bindi ng sites by regulatory

proteins. Statistical-mechanical theory and application to operators and

promoters. J Mol Biol 193: 723–750.

Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD,

Lohmueller KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR, et al.

2008. Assessing the evolutionary impact of amino acid mutations in the

human genome. PLoS Genet 4: e1000083. doi: 10.1371/journal.

pen.1000083.

Bulyk ML. 2006. DNA microarray technologies for measuring protein–DNA

interactions. Curr Opin Biotechnol 17: 422–430.

C. elegans SequencingConsortium. 1998. Genome sequence of the nematode

C. elegans: A platform for investigating biology. Science 282: 2012–2018.

Cleveland WS, Devlin SJ. 1988. Locally-weighted fitting: An approach to

fitting analysis by local fitting. J Am Stat Assoc 83: 596–610.

Dermitzakis ET, Clark AG. 2002. Evolution of transcription factor binding

sites in mammalian gene regulatory regions: Conservation and

turnover. Mol Biol Evol 19: 1114–1121.

Dermitzakis ET, Bergman CM, Clark AG. 2003. Tracing the evolutionary

history of Drosophila regulatory regions with models that identify

transcription factor binding sites. Mol Biol Evol 20: 703–714.

Durbin R, Eddy SR, Krogh A, Mitchison G. 1998. Biological sequence analysis,

1st ed. Cambridge University Press, Cambridge.

Eyre-Walker A. 2006. The genomic rate of adaptive evolution. Trends Ecol

Evol 21: 569–575.

Foat BC, Morozov AV, Bussemaker HJ. 2006. Statistical mechanical modeling

of genome-wide transcription factor occupancy data by MatrixREDUCE.

Bioinformatics 22: e141–e149.

Gene Ontology Consortium. 2006. The Gene Ontology (GO) project in

2006. Nucleic Acids Res 34: D322–D326.

Goldman N, Yang Z. 1994. A codon-based model of nucleotide substitution

for protein-coding DNA sequences. Mol Biol Evol 11: 725–736.

Granek JA, Clarke ND. 2005. Explicit equilibrium modeling of transcription-

factor binding and gene regulation. Genome Biol 6: R87. doi: 10.1186/

gb-2005-6-10-r87.

Haygood R, Fedrigo O, Hanson B, Yokoyama KD, Wray GA. 2007. Promoter

regions of many neural- and nutrition-related genes have experienced

positive selection during human evolution. Nat Genet 39: 1140–1144.

Hoffman MM, Birney E. 2007. Estimating the neutral rate of nucleotide

substitution using introns. Mol Biol Evol 24: 522–531.

Hudson ME, Snyder M. 2006. High-throughput methods of regulatory

element discovery. Biotechniques 41: 673–681.

Hurst LD. 2002. The Ka/Ks ratio: Diagnosing the form of sequence

evolution. Trends Genet 18: 486–487.

Jothi R, Cuddapah S, Barski A, Cui K, Zhao K. 2008. Genome-wide

identification of in vivo protein-DNA binding sites from ChIP-seq data.

Nucleic Acids Res 36: 5221–5231.

Jukes TH, Cantor CR. 1964. Evolution of protein molecules. In Mammalian

protein metabolism (ed. HN Munro, JB Allison), pp. 21–132. Academic

Press, New York.

Kaiser J. 2008. A plan to capture human diversity in 1000 genomes. Science

319: 395.

Kernighan BW, Ritchie DM. 1988. The C programming language, 2nd ed.

Prentice Hall, Englewood Cliffs, NJ.

Kim PM, Korbel JO, Gerstein MB. 2007. Positive selection at the protein

network periphery: Evaluation in terms of structural constraints and

cellular context. Proc Natl Acad Sci 104: 20274–20279.

King MC, Wilson AC. 1975. Evolution at two levels in humans and

chimpanzees. Science 188: 107–116.

Kosiol C, Holmes I, Goldman N. 2007. An empirical codon model for

protein sequence evolution. Mol Biol Evol 24: 1464–1479.

Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K,

Dewar K, Doyle M, FitzHugh W, et al. 2001. Initial sequencing and

analysis of the human genome. Nature 409: 860–921.

Li WH, Wu CI, Luo CC. 1985. A new method for estimating synonymous

and non-synonymous rates of nucleotide substitution considering

the relative likelihood of nucleotide and codon changes. Mol Biol Evol

2: 150–174.

Manke T, Roider HG, Vingron M. 2008. Statistical modeling of transcription

factor binding affinities predicts regulatory interactions. PLoS Comput

Biol 4: e1000039. doi: 10.1371/journal.pcbi.1000039.

Massingham T, Goldman N. 2005. Detecting amino acid sites under positive

selection and purifying selection. Genetics 169: 1753–1762.

Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter

I, Chekmenev D, Krull M, Hornischer K, et al. 2006. TRANSFAC and

its module TRANSCompel: Transcriptional gene regulation in

eukaryotes. Nucleic Acids Res 34: D108–D110.

McDonald JH, Kreitman M. 1991. Adaptive protein evolution at the Adh

locus in Drosophila.Nature 351: 652–654.

Miyata T, Yasunaga T. 1980. Molecular evolution of mRNA: A method for

estimating evolutionary rates of synonymous and amino acid

substitutions from homologous nucleotide sequences and its

application. J Mol Evol 16: 23–36.

Moses AM, Chiang DY, Kellis M, Lander ES, Eisen MB. 2003. Position specific

variation in the rate of evolution in transcription factor binding sites.

BMC Evol Biol 3: 19. doi: 10.1186/1471-2148-3-19.

Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB. 2004. MONKEY:

Identifying conserved transcription-factor binding sites in multiple

alignments using a binding site-specific evolutionary model. Genome

Biol 5: R98. doi: 10.1186/gb-2004-5-12-r98.

Mouse Genome Sequncing Consortium. 2002. Initial sequencing and

comparative analysis of the mouse genome. Nature 420: 520–562.

Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, Snyder M, Young RA,

Bulyk ML. 2004. Rapid analysis of the DNA-binding specificities

of transcription factors with DNA microarrays. Nat Genet 36:

1331–1339.

Mustonen V, Kinney J, Callan CG, La

¨ssig M. 2008. Energy-dependent

fitness: A quantitative model for the evolution of yeast transcription

factor binding sites. Proc Natl Acad Sci 105: 12376–12381.

Nei M, Gojobori T. 1986. Simple methods for estimating the numbers of

synonymous and nonsynonymous nucleotide substitutions. Mol Biol

Evol 3: 418–426.

Nei M, Kumar S. 2000. Molecular evolution and phylogenetics. Oxford

University Press, Oxford, UK.

Nielsen R. 2001. Statistical tests of selective neutrality in the age of

genomics. Heredity 86: 641–647.

Paten B, Herrero J, Beal K, Fitzgerald S, Birney E. 2008. Enredo and Pecan:

Genome-wide mammalian consistency-based multiple alignment with

paralogs. Genome Res 18: 1814–1828.

Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M,

Minovitsky S, Dubchak I, Holt A, Lewis KD, et al. 2006. In vivo

enhancer analysis of human conserved non-coding sequences. Nature

444: 499–502.

Perler F, Efstratiadis A, Lomedico P, Gilbert W, Kolodner R, Dodgson J. 1980.

The evolution of genes: The chicken preproinsulin gene. Cell 20:

555–566.

Pru

¨fer K, Muetzel B, Do HH, Weiss G, Khaitovich P, Rahm E, Pa

¨a

¨bo S,

Lachmann M, Enard W. 2007. FUNC: A package for detecting significant

associations between gene sets and ontological annotations. BMC

Bioinformatics 8: 41. doi: 10.1186/1471-2105-8-41.

R Development Core Team. 2007. R: A language and environment for

statistical computing. R Foundation for Statistical Computing, Vienna,

Austria.

An effective model for natural selection in promoters

Genome Research 691

www.genome.org

Rajewsky N, Vergassola M, Gaul U, Siggia ED. 2002. Computational

detection of genomic cis-regulatory modules applied to body patterning

in the early Drosophila embryo. BMC Bioinformatics 3: 30. doi: 10.1186/

1471-2105-3-30.

Ray P, Shringarpure S, Kolar M, Xing EP. 2008. CSMET: Comparative

genomic motif detection via multi-resolution phylogenetic shadowing.

PLoS Comput Biol 4: e1000090. doi: 10.1371/journal.pcbi.1000090.

Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H,

Shapero MH, Carson AR, Chen W, et al. 2006. Global variation in copy

number in the human genome. Nature 444: 444–454.

Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J,

Schreiber J, Hannett N, Kanin E, et al. 2000. Genome-wide location and

function of DNA binding proteins. Science 290: 2306–2309.

Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen

G, Bernier B, Varhol R, Delaney A, et al. 2007. Genome-wide profiles

of STAT1 DNA association using chromatin immunoprecipitation

and massively parallel sequencing. Nat Methods 4: 651–657.

Roider HG, Kanhere A, Manke T, Vingron M. 2007. Predicting transcription

factor affinities to DNA from a biophysical model. Bioinformatics 23:

134–141.

Salser W, Bowen S, Browne D, el Adli F, Fedoroff N, Fry K, Heindell H,

Paddock G, Poon R, Wallace B, et al. 1976. Investigation of the

organization of mammalian chromosomes at the DNA sequence level.

Fed Proc 35: 23–35.

Segal E, Raveh-Sadka T, Schroeder M, Unnerstall U, Gaul U. 2008. Predicting

expression patterns from regulatory sequence in Drosophila

segmentation. Nature 451: 535–540.

Sinha S. 2006. On counting position weight matrix matches in a sequence,

with application to discriminative motif finding. Bioinformatics 22:

e454–e463.

Storey JD, Tibshirani R. 2003. Statistical significance for genomewide

studies. Proc Natl Acad Sci 100: 9440–9445.

Stratton MR, Campbell PJ, Futreal PA. 2009. The cancer genome. Nature 458:

719–724.

Sunyaev S, Ramensky V, Koch I, Lathe W, Kondrashov AS, Bork P. 2001.

Prediction of deleterious human alleles. Hum Mol Genet 10: 591–

597.

Taylor MS, Kai C, Kawai J, Carninci P, Hayashizaki Y, Semple CAM. 2006.

Heterotachy in mammalian promoter evolution. PLoS Genet 2: e30. doi:

10.1371/journal.pgen.0020030.

van Rossum G. 2006. Python reference manual. Python Software Foundation.

Hampton, NH. http://docs.python.org/release/2.5/ref/.

Veyrieras JB, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, Stephens M,

Pritchard JK. 2008. High-resolution mapping of expression-QTLs yields

insight into human gene regulation. PLoS Genet 4: e1000214. doi:

10.1371/journal.pgen.1000214.

Vlieghe D, Sandelin A, De Bleser PJ, Vleminckx K, Wasserman WW, van Roy

F, Lenhard B. 2006. A new generation of JASPAR, the open-access

repository for transcription factor binding site profiles. Nucleic Acids Res

34: D95–D97.

Wang H, Johnston M, Mitra RD. 2007. Calling cards for DNA-binding

proteins. Genome Res 17: 1202–1209.

Wasson T, Hartemink AJ. 2009. An ensemble model of competitive multi-

factor binding of the genome. Genome Res 19: 2101–2112.

Wong WSW, Nielsen R. 2004. Detecting selection in noncoding regions of

nucleotide sequences. Genetics 167: 949–958.

Received June 1, 2009; accepted in revised form February 9, 2010.

Hoffman and Birney

692 Genome Research

www.genome.org

TRACE: transcription factor footprinting using chromatin accessibility data and DNA sequence

Article

Full-text available

Jul 2020
GENOME RES

Transcription is tightly regulated by cis-regulatory DNA elements where transcription factors can bind. Thus, identification of transcription factor binding sites (TFBSs) is key to understanding gene expression and whole regulatory networks within a cell. The standard approaches used for TFBS prediction, such as position weight matrices (PWMs) and chromatin immunoprecipitation followed by sequencing (ChIP-seq), are widely used, but have their drawbacks including high false positive rates and limited antibody availability, respectively. Several computational footprinting algorithms have been developed to detect TFBSs by investigating chromatin accessibility patterns, however these also have limitations. We have developed a footprinting method to predict Transcription factor footpRints in Active Chromatin Elements (TRACE) to improve the prediction of TFBS footprints. TRACE incorporates DNase-seq data and PWMs within a multivariate Hidden Markov Model (HMM) to detect footprint-like regions with matching motifs. TRACE is an unsupervised method that accurately annotates binding sites for specific TFs automatically with no requirement for pre-generated candidate binding sites or ChIP-seq training data. Compared to published footprinting algorithms, TRACE has the best overall performance with the distinct advantage of targeting multiple motifs in a single model.

Functional effects of variation in transcription factor binding highlight long-range gene regulation by epromoters

Article

Full-text available

Feb 2020
NUCLEIC ACIDS RES

Identifying DNA cis-regulatory modules (CRMs) that control the expression of specific genes is crucial for deciphering the logic of transcriptional control. Natural genetic variation can point to the possible gene regulatory function of specific sequences through their allelic associations with gene expression. However, comprehensive identification of causal regulatory sequences in brute-force association testing without incorporating prior knowledge is challenging due to limited statistical power and effects of linkage disequilibrium. Sequence variants affecting transcription factor (TF) binding at CRMs have a strong potential to influence gene regulatory function, which provides a motivation for prioritizing such variants in association testing. Here, we generate an atlas of CRMs showing predicted allelic variation in TF binding affinity in human lymphoblastoid cell lines and test their association with the expression of their putative target genes inferred from Promoter Capture Hi-C and immediate linear proximity. We reveal >1300 CRM TF-binding variants associated with target gene expression, the majority of them undetected with standard association testing. A large proportion of CRMs showing associations with the expression of genes they contact in 3D localize to the promoter regions of other genes, supporting the notion of 'epromoters': dual-action CRMs with promoter and distal enhancer activity.

TRACE: transcription factor footprinting using DNase I hypersensitivity data and DNA sequence

Preprint

Full-text available

Oct 2019

Transcription is tightly regulated by cis-regulatory DNA elements where transcription factors can bind. Thus, identification of transcription factor binding sites is key to understanding gene expression and whole regulatory networks within a cell. The standard approaches for TFBSs prediction such as position weight matrices (PWMs) and chromatin immunoprecipitation followed by sequencing (ChIP-seq) are widely used but have their drawbacks such as high false positive rates and limited antibody availability, respectively. Several computational footprinting algorithms have been developed to detect TFBSs by investigating chromatin accessibility patterns, but also have their limitations. To improve on these methods, we have developed a footprinting method to predict Transcription factor footpRints in Active Chromatin Elements (TRACE). Trace incorporates DNase-seq data and PWMs within a multivariate Hidden Markov Model (HMM) to detect footprint-like regions with matching motifs. Trace is an unsupervised method that accurately annotates binding sites for specific TFs automatically with no requirement on pre-generated candidate binding sites or ChIP-seq training data. Compared to published footprinting algorithms, TRACE has the best overall performance with the distinct advantage of targeting multiple motifs in a single model.

Physical constraints determine the logic of bacterial promoter architectures

Article

Full-text available

Jan 2014

Site-specific transcription factors (TFs) bind to their target sites on the DNA, where they regulate the rate at which genes are transcribed. Bacterial TFs undergo facilitated diffusion (a combination of 3D diffusion around and 1D random walk on the DNA) when searching for their target sites. Using computer simulations of this search process, we show that the organisation of the binding sites, in conjunction with TF copy number and binding site affinity, plays an important role in determining not only the steady state of promoter occupancy, but also the order at which TFs bind. These effects can be captured by facilitated diffusion-based models, but not by standard thermodynamics. We show that the spacing of binding sites encodes complex logic, which can be derived from combinations of three basic building blocks: switches, barriers and clusters, whose response alone and in higher orders of organisation we characterise in detail. Effective promoter organizations are commonly found in the E. coli genome and are highly conserved between strains. This will allow studies of gene regulation at a previously unprecedented level of detail, where our framework can create testable hypothesis of promoter logic.

Mutational scans reveal differential evolvability of Drosophila promoters and enhancers

Article

Full-text available

Apr 2023

Rapid enhancer and slow promoter evolution have been demonstrated through comparative genomics. However, it is not clear how this information is encoded genetically and if this can be used to place evolution in a predictive context. Part of the challenge is that our understanding of the potential for regulatory evolution is biased primarily toward natural variation or limited experimental perturbations. Here, to explore the evolutionary capacity of promoter variation, we surveyed an unbiased mutation library for three promoters in Drosophila melanogaster . We found that mutations in promoters had limited to no effect on spatial patterns of gene expression. Compared to developmental enhancers, promoters are more robust to mutations and have more access to mutations that can increase gene expression, suggesting that their low activity might be a result of selection. Consistent with these observations, increasing the promoter activity at the endogenous locus of shavenbaby led to increased transcription yet limited phenotypic changes. Taken together, developmental promoters may encode robust transcriptional outputs allowing evolvability through the integration of diverse developmental enhancers. This article is part of the theme issue ‘Interdisciplinary approaches to predicting evolutionary biology’.

Evidence of a Conserved Molecular Response to Selection for Increased Brain Size in Primates

Article

Full-text available

Mar 2017

The adaptive significance of human brain evolution has been frequently studied through comparisons with other primates. However, the evolution of increased brain size is not restricted to the human lineage but is a general characteristic of primate evolution. Whether or not these independent episodes of increased brain size share a common genetic basis is unclear. We sequenced and de novo assembled the transcriptome from the neocortical tissue of the most highly encephalized nonhuman primate, the tufted capuchin monkey (Cebus apella). Using this novel data set, we conducted a genome-wide analysis of orthologous brain-expressed protein coding genes to identify evidence of conserved gene-phenotype associations and species-specific adaptations during three independent episodes of brain size increase. We identify a greater number of genes associated with either total brain mass or relative brain size across these six species than show species-specific accelerated rates of evolution in individual large-brained lineages. We test the robustness of these associations in an expanded data set of 13 species, through permutation tests and by analyzing how genome-wide patterns of substitution co-vary with brain size. Many of the genes targeted by selection during brain expansion have glutamatergic functions or roles in cell cycle dynamics. We also identify accelerated evolution in a number of individual capuchin genes whose human orthologs are associated with human neuropsychiatric disorders. These findings demonstrate the value of phenotypically informed genome analyses, and suggest at least some aspects of human brain evolution have occurred through conserved gene-phenotype associations. Understanding these commonalities is essential for distinguishing human-specific selection events from general trends in brain evolution. © The Author(s) 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

Evidence of a Conserved Molecular Response to Selection for Increased Brain Size in Primates

Article

Full-text available

Mar 2017

The adaptive significance of human brain evolution has been frequently studied through comparisons with other primates. However, the evolution of increased brain size is not restricted to the human lineage but is a general characteristic of primate evolution. Whether or not these independent episodes of increased brain size share a common genetic basis is unclear. We sequenced and de novo assembled the transcriptome from the neocortical tissue of the most highly encephalized nonhuman primate, the tufted capuchin monkey ($\textit{Cebus apella}$). Using this novel data set, we conducted a genome-wide analysis of orthologous brain-expressed protein coding genes to identify evidence of conserved gene-phenotype associations and species-specific adaptations during three independent episodes of brain size increase. We identify a greater number of genes associated with either total brain mass or relative brain size across these six species than show species-specific accelerated rates of evolution in individual large-brained lineages. We test the robustness of these associations in an expanded data set of 13 species, through permutation tests and by analyzing how genome-wide patterns of substitution co-vary with brain size. Many of the genes targeted by selection during brain expansion have glutamatergic functions or roles in cell cycle dynamics. We also identify accelerated evolution in a number of individual capuchin genes whose human orthologs are associated with human neuropsychiatric disorders. These findings demonstrate the value of phenotypically informed genome analyses, and suggest at least some aspects of human brain evolution have occurred through conserved gene-phenotype associations. Understanding these commonalities is essential for distinguishing human-specific selection events from general trends in brain evolution.

Evolution of Brain Active Gene Promoters in Human Lineage Towards the Increased Plasticity of Gene Regulation

Article

Full-text available

Mar 2018
MOL NEUROBIOL

Adaptability to a variety of environmental conditions is a prominent feature of Homo sapiens. We hypothesize that this feature can be explained by evolutionary changes in gene promoters active in the brain prefrontal cortex leading to a more flexible gene regulation network. The genotype-dependent range of gene expression can be broader in humans than in other higher primates. Thus, we searched for specific signatures of evolutionary changes in promoter architectures of multiple hominid genes, including the genes active in human cortical neurons that may indicate an increase of variability of gene expression rather than just changes in the level of expression, such as downregulation or upregulation of the genes. We performed a whole-genome search for genetic-based alterations that may impact gene regulation “flexibility” in a process of hominids evolution, such as (i) CpG dinucleotide content, (ii) predicted nucleosome-DNA dissociation constant, and (iii) predicted affinities for TATA-binding protein (TBP) in gene promoters. We tested all putative promoter regions across the human genome and especially gene promoters in active chromatin state in neurons of prefrontal cortex, the brain region critical for abstract thinking and social and behavioral adaptation. Our data imply that the origin of modern man has been associated with an increase of flexibility of promoter-driven gene regulation in brain. In contrast, after splitting from the ancestral lineages of H. sapiens, the evolution of ape species is characterized by reduced flexibility of gene promoter functioning, underlying reduced variability of the gene expression.

Natural selection in a population of Drosophila melanogaster explained by changes in gene expression caused by sequence variation in core promoter regions

Article

Full-text available

Feb 2016
BMC EVOL BIOL

Understanding the evolutionary forces that influence variation in gene regulatory regions in natural populations is an important challenge for evolutionary biology because natural selection for such variations could promote adaptive phenotypic evolution. Recently, whole-genome sequence analyses have identified regulatory regions subject to natural selection. However, these studies could not identify the relationship between sequence variation in the detected regions and change in gene expression levels. We analyzed sequence variations in core promoter regions, which are critical regions for gene regulation in higher eukaryotes, in a natural population of Drosophila melanogaster, and identified core promoter sequence variations associated with differences in gene expression levels subjected to natural selection. Among the core promoter regions whose sequence variation could change transcription factor binding sites and explain differences in expression levels, three core promoter regions were detected as candidates associated with purifying selection or selective sweep and seven as candidates associated with balancing selection, excluding the possibility of linkage between these regions and core promoter regions. CHKov1, which confers resistance to the sigma virus and related insecticides, was identified as core promoter regions that has been subject to selective sweep, although it could not be denied that selection for variation in core promoter regions was due to linked single nucleotide polymorphisms in the regulatory region outside core promoter regions. Nucleotide changes in core promoter regions of CHKov1 caused the loss of two basal transcription factor binding sites and acquisition of one transcription factor binding site, resulting in decreased gene expression levels. Of nine core promoter regions regions associated with balancing selection, brat, and CG9044 are associated with neuromuscular junction development, and Nmda1 are associated with learning, behavioral plasticity, and memory. Diversity of neural and behavioral traits may have been maintained by balancing selection. Our results revealed the evolutionary process occurring by natural selection for differences in gene expression levels caused by sequence variation in core promoter regions in a natural population. The sequences of core promoter regions were diverse even within the population, possibly providing a source for natural selection.

Preparation of high temperature resistant Ag/PI/Cu composite nano particles inserted with PI insulating layer

Article

Nov 2018

Polyimide (PI)@copper (Cu) composite nano particles have been successfully synthesized from poly(amic acid) triethylamine salts (PAAS) and Cu(II) ions via a one-step high-temperature induction/imidization route. The formation of PI@Cu nano particles has been investigated by the stoichiometric ratio of PAAS and Cu ion. The resulting products, formed stable shell-core structures, exhibited the uniform core-size and thick shell layer. Additionally, the multi-layer structure, Ag@PI@Cu, was successfully prepared via a post process of PI@Cu nanoparticles. The morphology of the formed “Sunflower-mode” structure, with the pistil of Cu, the sunflower seed of PI, and the petal of Ag, was also characterized by SEM and TEM. Both electrical resistivity and thermal conductivity of nano particles were measured. The coefficient of heat conduction of Ag@PI@Cu is even 255 times, 754 times, 3081 times, and 1310 times as large as PI@Cu in 50 °C, 100 °C, 150 °C, and 200 °C, respectively. The resistance of both nano particles is that the result of RsPI@Cu and RsAg@PI@Cu is 11.0*10⁹ Ω and 0.11 Ω, respectively, and also the difference between them is more than 10¹².

Initial sequencing and analysis of the human genome

Article

Full-text available

Feb 2001

The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

C PROGRAMMING LANGUAGE.

Article

Dec 1981

C is a general-purpose programming language that was originally designed for ″system programming″ , that is, for writing programs such as compilers, operating systems, text editors, etc. Its other applications include data base systems, numerical analysis and engineering programs, and a great deal of text-processing software. It is the primary language of the UNIX system, and is also available in several other environments.

A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes.

Article

Mar 1985

A new method is proposed for estimating the number of synonymous and nonsynonymous nucleotide substitutions between homologous genes. In this method, a nucleotide site is classified as nondegenerate, twofold degenerate, or fourfold degenerate, depending on how often nucleotide substitutions will result in amino acid replacement; nucleotide changes are classified as either transitional or transversional, and changes between codons are assumed to occur with different probabilities, which are determined by their relative frequencies among more than 3,000 changes in mammalian genes. The method is applied to a large number of mammalian genes. The rate of nonsynonymous substitution is extremely variable among genes; it ranges from 0.004 X 10(-9) (histone H4) to 2.80 X 10(-9) (interferon gamma), with a mean of 0.88 X 10(-9) substitutions per nonsynonymous site per year. The rate of synonymous substitution is also variable among genes; the highest rate is three to four times higher than the lowest one, with a mean of 4.7 X 10(-9) substitutions per synonymous site per year. The rate of nucleotide substitution is lowest at nondegenerate sites (the average being 0.94 X 10(-9), intermediate at twofold degenerate sites (2.26 X 10(-9)). and highest at fourfold degenerate sites (4.2 X 10(-9)). The implication of our results for the mechanisms of DNA evolution and that of the relative likelihood of codon interchanges in parsimonious phylogenetic reconstruction are discussed.

Sequencing Consortium. Genome sequence of the nematode C. elegans: A platform for investigating biology

Article

C. Elegans

Molecular evolution and phylogenetics, ed 1. Oxford university press

Article

Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting

Article

Sep 1988

The evolution of genes: the chicken preproinsulin gene

Article

Jun 1980
CELL

F Perler

We have characterized a clone carrying a chicken preproinsulin gene, which is present in only one copy in the chicken genome. The gene contains two introns: a 3.5 kb intron interrupting the region encoding the connecting peptide and a 119 bp intron interrupting the DNA corresponding to the 5′ noncoding region of the mRNA. This is similar to the structure of rat insulin gene II; therefore it represents the common ancestor. Since the rat insulin gene I lacks a 499 bp intron in the coding region, the rat genes have evolved by a recent gene duplication followed by loss of this intron in one copy. The divergences between insulin gene sequences, and also between globin genes, show that changes at introns and silent positions in coding regions appear very rapidly (7 × 10 −9 substitutions per nucleotide site per year), but that the accumulation of changes in these sites saturates, although not completely, after about 100 million years. From this we conclude that not all of these sites are neutral and that they do not behave as accurate evolutionary clocks over long periods of time. However, nucleotide substitutions leading to amino acid replacements are an excellent clock. Our analysis indicates that this clock is driven by selection.

Molecular Evolution and Phylogenetics

Book

Jul 2000

This book presents the statistical methods that are useful in the study of molecular evolution and illustrates how to use them in actual data analysis. Molecular evolution has been developing at a great pace over the past decade or so, driven by the huge increase in genetic sequence data from many organisms, the improvement of high-speed microcomputers, and the development of several new methods for phylogenetic analysis. This book for graduate students and researchers, assuming a basic knowledge of evolution, molecular biology, and elementary statistics, should make it possible for many investigators to incorporate refined statistical analysis of large-scale data in their own work. Nei is one of the leading workers in this area. He and Kumar have developed a computer program called MEGA, which has been sold for about $20 to over 1900 users. For the book, the authors are thoroughly revising MEGA and will make it available via FTP. The book also included analysis using the other most popular programs for phylogenetic studies, including PAUP, PHYLIP, MOLPHY, and PAML.

Initial sequencing and comparative analysis of the mouse genome

Article

Jan 2002

Initial sequencing and analysis of the human genome. Nature

Article

Jan 2001

An effective model for natural selection in promoters

Abstract and Figures

Recommended publications

Faster than Neutral Evolution of Constrained Sequences: The Complex Interplay of Mutational Biases a...

Positive Selection in the Human Genome Inferred from Human-Chimp-Mouse Orthologous Gene Alignments

An exact steady state solution of Fisher's geometric model and other models

Rapidly evolving human promoter regions

Positive and Negative Selection on Noncoding DNA Close to Protein-Coding Genes in Wild House Mice

Evolutionary Processes Acting on Candidate cis-Regulatory Regions in Humans Inferred from Patterns o...