ArticlePDF Available

An effective model for natural selection in promoters

Authors:
  • Princess Margaret Cancer Centre/University of Toronto

Abstract and Figures

We have produced an evolutionary model for promoters, analogous to the commonly used synonymous/nonsynonymous mutation models for protein-coding sequences. Although our model, called Sunflower, relies on some simple assumptions, it captures enough of the biology of transcription factor action to show clear correlation with other biological features. Sunflower predicts a binding profile of transcription factors to DNA sequences, in which different factors compete for the same potential binding sites. The parametrized model simultaneously estimates a continuous measurement of binding occupancy across the genomic sequence for each factor. We can then introduce a localized mutation, rerun the binding model, and record the difference in binding profiles. A single mutation can alter interactions both upstream and downstream of its position due to potential overlapping binding sites, and our statistic captures this domino effect. Over evolutionary time, we observe a clear excess of low-scoring mutations fixed in promoters, consistent with most changes being neutral. However, this is not consistent across all promoters, and some promoters show more rapid divergence. This divergence often occurs in the presence of relatively constant protein-coding divergence. Interestingly, different classes of promoters show different sensitivity to mutations, with phosphorylation-related genes having promoters inherently more sensitive to mutations than immune genes. Although there have previously been a number of models attempting to handle transcription factor binding, Sunflower provides a richer biological model, incorporating weak binding sites and the possibility of competition. The results show the first clear correlations between such a model and evolutionary processes.
Content may be subject to copyright.
Resource
An effective model for natural selection in promoters
Michael M. Hoffman
1,2,3
and Ewan Birney
2,4
1
EMBL–European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, United Kingdom;
2
Graduate School of Life Sciences, University of Cambridge, Cambridge CB2 1RX, United Kingdom
We have produced an evolutionary model for promoters, analogous to the commonly used synonymous/
nonsynonymous mutation models for protein-coding sequences. Although our model, called Sunflower, relies on some
simple assumptions, it captures enough of the biology of transcription factor action to show clear correlation with other
biological features. Sunflower predicts a binding profile of transcription factors to DNA sequences, in which different
factors compete for the same potential binding sites. The parametrized model simultaneously estimates a continuous
measurement of binding occupancy across the genomic sequence for each factor. We can then introduce a localized
mutation, rerun the binding model, and record the difference in binding profiles. A single mutation can alter interactions
both upstream and downstream of its position due to potential overlapping binding sites, and our statistic captures this
domino effect. Over evolutionary time, we observe a clear excess of low-scoring mutations fixed in promoters, consistent
with most changes being neutral. However, this is not consistent across all promoters, and some promoters show more
rapid divergence. This divergence often occurs in the presence of relatively constant protein-coding divergence. In-
terestingly, different classes of promoters show different sensitivity to mutations, with phosphorylation-related genes
having promoters inherently more sensitive to mutations than immune genes. Although there have previously been
a number of models attempting to handle transcription factor binding, Sunflower provides a richer biological model,
incorporating weak binding sites and the possibility of competition. The results show the first clear correlations between
such a model and evolutionary processes.
[Supplemental material is available online at http://www.genome.org. The Sunflower package and source code are
available at http://www.ebi.ac.uk/;hoffman/software/sunflower/.]
Evolution is a fundamental force that has shaped all living or-
ganisms. By comparing the genomes of different species, and
considering their similarities and differences through the lens of
evolutionary theory, we can discover interesting aspects of biology
and better understand their past development (C. elegans Se-
quencing Consortium 1998; Adams et al. 2000; Lander et al. 2001;
Mouse Genome Sequencing Consortium 2002). To quantify se-
lective pressure in protein-coding genes, many researchers have
estimated the number of nonsynonymous substitutions (called d
N
or K
a
) and synonymous substitutions (called d
S
or K
s
), and then
taken their ratio, described as d
N
/d
S
,K
a
/K
s
,orv(Nei and Kumar
2000). This has provided an invaluable model for characterizing
the evolution of genes in relatively closely related species. Con-
trasting rates of evolution in classes of nucleotides with differing
functional effects is also used in a variety of population genetics
procedures, such as the McDonald–Kreitman test (McDonald and
Kreitman 1991). Although this model crudely equates phenotypic
change with amino acid sequence change, ignoring more complex
effects, it has repeatedly shown its worth in classifying proteins
and specific sites in proteins undergoing both positive (adaptive)
selection and negative (purifying) selection (Nielsen 2001; Hurst
2002; Eyre-Walker 2006).
Due to its extensive use, methodology to assess relative
nonsynonymous to synonymous rates has progressively improved
over time. Salser et al. (1976) were the first to count synonymous
and nonsynonymous differences between mammalian protein-
coding nucleotide sequences, and others (Miyata and Yasunaga
1980; Perler et al. 1980; Li et al. 1985; Nei and Gojobori 1986)
developed more robust methods to estimate the number of syn-
onymous and nonsynonymous substitutions where multiple sub-
stitutions occurred in a single site. More recently, researchers in-
creasingly use maximum likelihood methods to estimate these
quantities, accounting for local variations in mutation rate ac-
cording to various models of evolution (Goldman and Yang 1994).
This framework has often been adapted by other researchers to
investigate evolution of protein-coding sequence (Kosiol et al.
2007; Boyko et al. 2008). New extensions to the basic models, such
as the sitewise likelihood ratio (Massingham and Goldman 2005),
continue to expand the utility of this basic protein model.
In contrast, an analogous phenotypic change model has not
existed for noncoding regions of the genome, including those re-
gions that regulate transcription. Most researchers use straight-
forward measures to approximate change in these regions that
lack a model of the variable susceptibility of different positions
in transcription factor binding sites (TFBSs) to mutations (Wong
and Nielsen 2004; Haygood et al. 2007). Although investigators
have identified and commented on this variable susceptibility
(Dermitzakis et al. 2003; Moses et al. 2003; Mustonen et al. 2008),
a good model for the impact of variation on transcription factor
binding that canbe integratedinto traditional d
N
/d
S
methods would
be more useful. The lack of a more realistic phenotypic model is
particularly frustrating as the protein-coding complement does not
change significantly between mammalian species outside of olfac-
tion and the immune system (and even less so between primates),
leading many researchers to suggest that changes in regulation in-
clude many of the most important changes for positive selection in
mammalian and primate evolution (King and Wilson 1975).
Here, we introduce a phenotypic model for the impact of
change in promoter sequence. We were inspired by the success of
3
Present address: Department of Genome Sciences, University of
Washington, PO Box 355065, Seattle, WA 98195-5065, USA.
4
Corresponding author.
E-mail birney@ebi.ac.uk.
Article published online before print. Article and publication date are at
http://www.genome.org/cgi/doi/10.1101/gr.096719.109. Freely available
online through the Genome Research Open Access option.
20:685–692 Ó2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org Genome Research 685
www.genome.org
transcription factor binding models that integrate over the com-
plete range of binding affinities (Rajewsky et al. 2002; Granek and
Clarke 2005; Foat et al. 2006; Sinha 2006; Roider et al. 2007; Manke
et al. 2008) using a library of position weight matrices (PWMs).
Additionally, Wasson and Hartemink (2009) published a similar
model during the preparation of this manuscript. These models
have shown their utility by providing robust models of Drosophila
enhancers (Segal et al. 2008). This work differs from previous ef-
forts to use multispecies conservation information to improve the
identification of functional TFBSs (Moses et al. 2004; Ray et al.
2008), because we hold out evolutionary information from the
TFBS identification process in order to avoid circularity in the
subsequent estimation of evolutionary distances. The necessary
modeling instead seeks to grade potential mutations for their im-
pact on cis-regulation prior to analyzing information on the actual
substitutions found in evolution, in a similar way to methods that
determine potentially disruptive protein-coding substitutions, such
as PolyPhen (Sunyaev et al. 2001).
Quantifying phenotypic change with such a model suggests
a corresponding measurement d
T
(by analogy to d
N
and d
S
)to
quantify the putative change in transcriptional function. Al-
though itself a crude approximation of the biochemical process we
wish to model, this measurement shows the expected suppression
of larger changes over evolutionary time. To correct for the local
neutral rate of evolution, we combine d
T
with the protein-coding
d
S
using the ratio c=d
T
/d
S
, which can distinguish different func-
tional categories of genes with varying degrees of selection on
their promoter regions. The ratio shows strong purifying se-
lection on developmental process genes, as expected, but also
shows a potential positive or extensive relaxation of constraint in
other functional classes, such as phospholipid biosynthetic pro-
cess genes.
Results
We used a hidden Markov model (HMM) framework (Durbin et al.
1998) to provide a reasonable model of the competitive binding of
an ensemble of transcription factors (TFs), assuming steric hin-
drance between factors competing for the same segment of DNA.
The architecture of the model is shown in Figure 1, and because of
its floral resemblance, we call the model Sunflower. Each TF forms
a petal of nucleotide-emitting states, with each state parametrized
from a column in a PWM, which may come from a public TF da-
tabase such as JASPAR (Vlieghe et al. 2006) or TRANSFAC (Matys
et al. 2006), or from high-throughput protein-binding microarray
experiments (Mukherjee et al. 2004; Bulyk 2006). For the analysis
presented here we used vertebrate JASPAR CORE PWMs, specifi-
cally those listed in Supplemental Table 1. A single unbound state
represents parts of the DNA not bound by a factor, and it is pa-
rameterized using the base composition of the whole genome.
The entry probability to the unbound state was arbitrarily set to
0.99, representing a postulated prior that the fraction of nucleo-
tides bound to TFs is on the order of magnitude of 1%. The entry
probability to each TF petal, roughly analogous to the cellular con-
centration of each factor, is set flat for all factors. This equally di-
vides the remaining 0.01 probability for entry to a petal. Ideally,
the model would summarize effects across all cell types, which
precludes setting these values according to the concentrations of
individual TFs under particular cellular conditions. Because we lack
the knowledge necessary to integrate the expression levels of genes
in every cell type over evolutionary history, we used this arbitrary
flat prior.
The HMM forward–backward algorithm allows the efficient
calculation of the marginal probability of each factor explaining
each base, analogous to the base being bound by the factor. This
means that for each base in the sequence, the algorithm calculates
a vector of the marginal probabilities for each PWM column sum-
marizing the combined behavior of the ensemble of TFs at that
position. Although this model is admittedly simple, with no pro-
vision for different concentrations of factors or different potential
cooperative modes between factors, it does maintain many useful
known aspects of TF biology. In particular, it considers a continu-
ous range of TF affinities for different genomic sites and steric ef-
fects between factors.
In this simulation it is possible for a single mutation to effect
a longer chain of binding sites due to changes in steric overlap.
An illustration of this domino effect is shown in Figure 2, where
a single mutation changes the predicted binding not only at
NR1H2-RXR, PPARG-RXRA, and T binding sites directly over-
lapping the mutation, but also at the predicted nearby NR3C1,
REL, Roaz, SP1, and Spz1 binding sites, leading to a complete re-
organization of the predicted binding occupancy on this promoter.
In order to investigate the importance of the domino effect,
we compared probabilities estimated with this joint model with
probabilities estimated with 89 similar models where we included
only one PWM at a time. We defined proximal promoter sequences
as 1400 bp around 17,600 transcription start sites (TSSs) in the
human genome. We took the probability distribution inferred us-
ing each single-motif model at each proximal promoter sequence
position, and the probability distribution generated from the cog-
nate portion of the joint model (see Methods). The median relative
entropy per nucleotide calculated between these two distributions
for each model is 0.6 bits, which means the joint model provides
a large amount of additional information over a 1400-bp promoter.
To examine the impact of a potential mutation in the joint
model, we introduce it into the sequence and recalculate the
marginal probability vector for the mutated sequence at every
position, not just the mutated position. We then add together the
Figure 1. Toy example schematic of a Sunflower model for TFs. (Circle)
Silent state, (squares) emitting states, (arcs) transitions between states
with nonzero probability. The transition probability is either designated
by a label, or is 1 in the case of unlabeled areas. (Squares with ellipses)
Arbitrary number of sequential states. This toy example includes TFs A, B,
C, and D, each one with a petal of emitting states, labeled such that D.0
corresponds to the first column of the D PWM, and D.n the last column.
The arc from empty space indicates the initial state of the model.
686 Genome Research
www.genome.org
Hoffman and Birney
relative entropies (Durbin et al. 1998) for each pair of marginal
probability vectors (both the reference and the mutated sequence).
We refer to the sum as the binding shift of the mutation and denote
it by the symbol t(see Methods).
To explore the properties of the new tmeasurement, we ex-
haustively simulated every possible point mutation in the human
promoters (4200 changes per promoter, 73,920,000 overall). We
then compared the human sequences with aligned sequences in
the dog genome, chosen because it was distantly related enough
for many neutral changes to occur, yet close enough that the
effects of selection on cis-regulation would still be observable.
We separated the changes observed in dog (4,069,878, 8% of the
mutations at a human position aligned to a dog nucleotide). Figure
3 shows the mean tfor both changes observed and unobserved in
dog averaged at each TSS-relative position.
Overall, trises steadily as mutations approach the TSS, as
expected from the increase in density of TF binding sites. More
importantly, there is a strong separation over the TSS of the ob-
served from the unobserved mutations, leading to consistently
higher tvalues in the unobserved portion. Both the overall shape
of this plot and its consistency with the prediction that higher
tmutations are less favored by the predominantly selectively
neutral changes accepted over evolutionary time suggest that this
measurement models something that correlates with evolutionary
acceptance of mutations near TSSs.
For confirmation, we repeated this analysis on mouse–rat
aligned proximal promoters and found similar results (Supple-
mental Fig. 1). We found different results when looking human–
dog aligned regions (Supplemental Fig. 2) with enhancer activ-
ity validated in transgenic mice (Pennacchio et al. 2006), or into
human–dog aligned ancestral repeats (Supplemental Fig. 3; Paten
et al. 2008). The relatively flat binding shift lines in these results
lead us to conclude that with the input PWMs used, this model will
primarily detect signatures of selection in proximal promoter re-
gions rather than enhancer regions or negative control ancestral
repeat regions.
The tmeasurement provides an approximation to the binding
occupancy change of a mutation, which is the simplest predict-
able phenotypic change in a promoter, much like the number of
changed residues in a protein is the simplest measurement of
phenotypic change in a protein. We also sum up the total potential
change of a promoter, considering every possible mutation, and
call this T. Interestingly, different classes of genes, as determined by
Gene Ontology (GO) (Gene Ontology Consortium 2006) annota-
tions, show varying levels of this inherent propensity to change
(see Supplemental Tables 1–6). Genes involved in developmental
processes are expected to have complex, finely tuned promoters,
and therefore are expected to have high T. Somewhat more un-
expected in high-Tgenes are those involved in phosphorylation
and the cell cycle. Interestingly, these GO terms are also excluded
from copy number variant (CNV) regions (Redon et al. 2006).
In order to examine how actual changes affect the binding
profile, we can sum up only those values of tthat correspond to
observed substitutions. To control for different inherent propen-
sities to change, we divide by the potential total binding shift T,
and then transform this proportion using the Jukes–Cantor model
(Jukes and Cantor 1964) to correct for multiple substitutions along
an evolutionary lineage (see Methods). This results in a transcrip-
tional distance measurement d
T
.
We developed an evolutionary measurement, which we call
c, by analogy to the protein-coding vparameter for the non-
synonymous-to-synonymous substitution rate ratio. For c,we
wish to control both for the inherent binding shift mutability and
for the local mutation rate, so we take d
T
and divide it by the local
Figure 2. Changes in predicted binding profile for a guanine-to-thymine
mutation at position +29 of ENST00000373379, a transcript of UPRT,
uracil phospho-ribosyltransferase. (Green lines) Probability of eight in-
dividual TFs binding to the reference sequence, or the probability that
a region is unbound (upper right panel). The names above the TF binding
panels refer to JASPAR PWM names, and the corresponding Human Ge-
nome Organization Nomenclature Committee (HGNC) symbols are
contained in Supplemental Table 1. (Orange lines) Probability that a TF
binds the mutated sequence. These displayed changes, when added to
smaller changes for other TFs, represent a binding shift tof 124.8 (see
Results and Methods).
Figure 3. Aggregation plot of the binding shifts of 17,600 human
genes, averaged within two groups: one where the simulated mutation
was observed in dog (green circles, solid line), and one where it was
unobserved (orange crosses, dashed line). (A) Local regressions for 6700
bp around the TSS, estimated with the loess (Cleveland and Devlin 1988)
function in R (R Development Core Team 2007), with second-degree
polynomials and a=0.1. Shaded regions in this plot are magnified as
separate panels below to show mean binding shifts at individual positions
proximal to (B) and more distal from (C) the TSS.
An effective model for natural selection in promoters
Genome Research 687
www.genome.org
neutral mutation rate d
S
, analogously to d
N
/d
S
. The measurement
c=d
T
/d
S
therefore summarizes our approximation of the binding
occupancy change in a promoter due to mutations, normalizing
for both local mutation rate and inherent mutability of a promoter.
Values of human–dog care not strongly correlated to the local
mutation rate, measured either using synonymous coding sites
(Supplemental Fig. 6; r
S
=0.51; P<2.2 310
16
) or at introns
(Hoffman and Birney 2007) (Supplemental Fig. 7; r
S
=0.24; P<
2.2 310
16
). Neither is it correlated to the raw mutability (T) of
each promoter (Supplemental Fig. 8; r
S
=0.20; P<2.2 310
16
).
This suggests that ccaptures an aspect of biology independent of
these quantities, such as selection on promoters, just as vcaptures
for coding sequence. While others have identified purifying se-
lection adjacent to the TSS (Taylor et al. 2006), we can identify
a potential mechanism for this selection.
Considering classes of genes with high or low amounts of
selective pressure on promoters provides interesting insights into
biology. Focusing first on cellular components, it has long been
known that plasma membrane and extracellular compartments
show strong enrichment for high values of the protein-coding v.
The transcriptional c, however, shows an almost perfect contrast
to this, with these compartments showing striking enrichment for
low cvalues (Fig. 4). Turning to more specific functional cate-
gories, Figure 5 shows a scatter plot of median cagainst median
vfor biological process and molecular function GO terms with at
least 10 genes annotated. It is clear that cand vare not strongly
correlated for functional classes of genes (r
S
=0.081; P=9.87 3
10
12
), nor are they correlated on a gene by gene basis (Supple-
mental Fig. 4; r
S
=0.10; P<2.2 310
16
). More importantly,
functional classes enriched for high vare rarely enriched for high
c, and vice versa. This implies that positive selection amongst
genes associated with a GO term predominantly works in a single
modality. In contrast, many of the categories that show negative
selection in both the transcriptional and protein-coding mea-
surements, having low vand low c, agree with perceptions of
transcriptional complexity, with terms such as sensory organ de-
velopment (low c:P=7310
6
,q<1310
4
; low v:P=2310
5
,
q<1310
4
) and transcription factor activity (low c:P=5310
38
,
q<1310
4
; low v:P=4310
5
,q=6310
4
) enriched in both
modalities. As expected, there are genes showing evidence of
strong transcriptional negative selection with no striking shift in
protein selection, such as those associated with signal transduction
(low c:P=8310
17
,q<1310
4
; low v:P=0.5, q=1), cell
adhesion (low c:P=8310
10
,q<1310
4
; low v:P=1, q=1), and
cell migration (low c:P=3310
5
,q=2310
4
; low v:P=0.1, q=
1). Finally, gene classes enriched for more positive transcriptional
selection (high c) without striking changes in protein evolution
include phospholipid biosynthetic process genes (high c:P=23
10
5
,q=3310
4
; high v:P=0.6, q=1) such as CEPT1 (c=2.24; v=
0.04), and DNA repair genes (high c:P=3310
10
,q<1310
4
;
high v:P=0.006, q=0.07) such as UBE2B (c=2.53; v=0.002).
Discussion
We have developed, assessed, and used a new series of measure-
ments that aim to capture the effect of DNA sequence change
on transcriptional regulation. Although our model crudely ap-
proximates the known complexity of this process and does not
include more poorly understood processes such as TFBS turnover
(Dermitzakis and Clark 2002), it is not obviously less sophisticated
than the d
N
/d
S
measurement commonly and successfully used
to study protein-coding evolution. An important component to
Figure 4. Box plot of c=d
T
/d
S
values arranged by GO cellular com-
ponent term, for each term associated with significantly high (above di-
viding line) or low (below dividing line) cvalues, as determined by the
Wilcoxon rank sum test (P<1310
4
) performed by FUNC (see Methods).
The vertical bar in each box indicates the median c, the extents of each
box the first and third quartiles of c, and the whiskers extend to the fur-
thest data point that is no more than 1.5 times the interquartile range from
the nearest quartile. High outliers are used in calculating statistics, but are
omitted from the display for clarity.
Figure 5. Scatter plot of median c=d
T
/d
S
versus median v=d
N
/d
S
for
the genes in 1402 GO terms. Only terms that are annotated on at least 15
genes are shown. The term has a significantly high or low value of c(blue),
v(red), both measurements (yellow), or neither measurement (gray), as
determined by FDR threshold q<0.05. Labels indicate the terms with the
10 highest and 10 lowest cvalues.
Hoffman and Birney
688 Genome Research
www.genome.org
the Sunflower model is that it penalizes the creation of motifs
overlapping with existing motifs. The aggregate evolutionary sig-
nature of this measurement shows an expected suppression of
highly perturbing mutations in both human–dog and mouse–rat
promoter comparisons. In contrast, ancestral mammalian repeats,
thought to be predominantly neutral, show no difference in pre-
dicted impact between observed and unobserved mutations. The
functional processes that have transcriptional sensitivity agree
with preconceptions derived from our understanding of cellular
and molecular biology.
While PWM methods are often used to predict TF occupancy,
we cannot be certain that such methods accurately estimate TF
binding or transcriptional output. One major limitation of our
technique is that it relies on the assumption that the PWM-based
method it uses will be accurate much of the time. Another limi-
tation is the lack of a complete set of PWMs for TFs. The devel-
opment of a number of high-throughput methods for quantifying
in vitro binding preferences (Mukherjee et al. 2004) will provide
a larger set of matrices over time, and the integration with other
methods (Ren et al. 2000; Hudson and Snyder 2006; Robertson
et al. 2007; Wang et al. 2007; Jothi et al. 2008) will likely drive the
library of accessible matrices closer to completion. It is interesting
to note that the set of distal enhancers did not show the same
separation of observed versus unobserved changes. The lack of en-
hancer-specific factors may well be the explanation for this result,
as JASPAR’s contents have a bias toward promoter-associated TFs.
A more complex problem is how to set the entry probabilities
to each petal. We have chosen to take a uniform prior as a way to
handle the large diversity of cell types that vary in expression
values. Potentially, one could consider integrating this signal over
a variety of relative expression levels of the transcription factors at
the expense of a more computationally expensive procedure. Re-
lated to this problem is the issue of redundancy in PWMs. The
JASPAR database provides a curated PWM set with some efforts
made toward eliminating redundancy. The JASPAR PWMs used in
this study (Supplemental Table 1) include a pair of PWMs repre-
senting different motifs recognized by one protein (ZNF42_1-4,
ZNF42_5-13), two motifs recognized by dimers that include one
overlapping protein (HAND1-TCF3, TAL1-TCF3), and two pairs of
PWMs from the same protein recognizing similar motifs but with
data from different sources (NFKB, NFKB1; RORA, RORA1). In re-
ality, some redundancy is acceptable because there are likely to be
some transcription factors with similar motifs in vivo. Increasing
the number of TFs considered by the model will increase the re-
dundancy in motifs yet still more accurately model the actual
processes in living cells.
It is feasible to imagine more complex impact models than
the one presented here, such as considering compensatory crea-
tion of new binding sites in a TFBS turnover model, at the con-
ceptual and computational expense that comes with more com-
plicated models. This would be analogous to integrating structural
adjacency of amino acids for protein selection. It is interesting to
note that, probably due to the complexity of a more advanced
model, the simpler site-wise model in protein sequences has
remained the predominant evolutionary model.
The cmeasurement generated using pairwise alignments
shows a weak correlation between orthologs in different clades
(Supplemental Fig. 5; r
S
=0.33; P<2.2 310
16
). This means that
this measure of selection is consistent between clades, at least
within mammals, although obviously it will not have the same
consistency as vmeasurements generated from a maximum like-
lihood method on a single multiple alignment. It would be pos-
sible to use the Sunflower method to find pairwise d
T
values for
multiple species pairs from a single alignment, but effective use
of multispecies alignments would require integration of the Sun-
flower change model during the sampling of potential ancestral
sequences in the tree. This is an interesting approach that requires
both more theoretical and practical work. Similar to the research
arising from the d
N
/d
S
model, the pairwise model presented here
would be the starting point for that work.
The protein-coding vmeasurement has the property that
neutral changes are predicted (and observed) to be around v=1,
while the cmeasurement does not come with such a principle for
its interpretation. Much of the use of v, however, consists in av-
eraging values over several genes, where identifying deviations
from the bulk distribution (as performed in this analysis) is the
primary mode of analyzing gene sets. In contrast, the cmeasure-
ment lends itself more naturally to the joint analysis of multiple
changes, such as those found on haplotypes. As genome-wide as-
sociation studies implicate haplotypes, and extensive resequenc-
ing (Kaiser 2008) will provide a complete set of changes on nearly
all common haplotypes, a haplotype-level analysis of functional
changes will become a more important form of analysis. In-
tegration of these mechanistic models with expression quantita-
tive trait locus studies (Veyrieras et al. 2008) in the context of
complete sequencing will provide an interesting comparison.
Many of the associations of cwere expected, such as the sup-
pression of promoter changes in signal transduction and de-
velopmental genes. The bulk suppression of cin genes associated
with extracellular components and the plasma membrane is more
puzzling, inparticular given the striking signals of positive selection
in these proteins (Kim et al. 2007). Alternatively, inappropriate ex-
pression of many extracellular proteins may have a far more dele-
terious effect, given their potential to interact with other compo-
nents outside of the cell. More generally, cis not strongly correlated
with the protein-coding v, showing a very different behavior of
transcriptional selection compared with protein-coding selection.
In this work, we focused on an evolutionary analysis of intra-
mammalian substitutions, although one could apply the same
framework both to other clades and to other mutational processes,
such as natural polymorphisms and somatic changes discovered in
cancer. In the latter two cases the low rate of change will make
gaining statistical power hard, just as analyzing protein changes
also requires extensive aggregation of signals (Stratton et al. 2009).
With the large number of sequenced genomes appropriate for this
analysis and the aggressive generation of polymorphism and so-
matic mutation data sets, Sunflower provides a key additional tool
in the interpretation of genomic sequence differences.
Methods
Posterior inference
Sunflower does posterior inference using an algorithm we call
Sunflower-Reference. The algorithm calculates the posterior prob-
ability P
k,i
=P(x
i
|k) that a particular nucleotide x
i
was emitted by
a given state kin the Sunflower model. The results are the same as
the standard Forward-Backward algorithm when the silent state is
the start state (k
silent
=0).
Posterior inference is equivalent to tracing all of the pathways
through this model that can emit a single sequence, and estimat-
ing the posterior probability that the model is in each of the states
at each position of that sequence. Underlying the model is the
physical mechanism that transcription factors are continuously
binding and leaving chromosomal sequences, at a rate related to
An effective model for natural selection in promoters
Genome Research 689
www.genome.org
their affinity for the sequence. The statistical mechanics of the
biophysical model are approximated by the probabilities that
a transcription factor is bound in the sequence model. Indeed,
PWMs, which are frequently thought of as purely probabilistic
concepts, were originally proposed as part of a statistical me-
chanics model (Berg and von Hippel 1987).
The new parameters in Sunflower-Reference allow two opti-
mizations. The first is the use of connection set vectors c
f
and c
b
,
which contain information about which states are connected to
which other states, relieving the algorithm from the necessity in
each round of doing calculations involving transition probabilities
of zero. The other optimization is that one can specify a calculation
starting position ito indicate that the intermediate forward matrix
Fand backward matrix Bhave already been partially calculated,
such that recalculation is only necessary in the forward direction
for values >i, and in the reverse direction for values <i.
We wrote Sunflower in the Python language (van Rossum
2006) and inner loops in the C language (Kernighan and Ritchie
1988) for speed.
Comparing joint and single-motif models
We compared the joint model used in the rest of this work with
single-motif control models for each motif mto investigate the
importance of the domino effect. After performing posterior in-
ference on all of the models, we derived a two-state probability
distribution for each motif from the joint model at each position
by taking P(x
i
|m) and 1 P(x
i
|m). We then compared each
probability distribution generated from the joint model P
joint
with
the equivalent probability distributions from the single-motif
model P
single
by taking the relative entropy H(P
joint
kP
single
).
The binding shift t
The algorithm used to investigate the effects of mutations can be
described simply. First, run the Sunflower-Reference algorithm
with a Sunflower model and a nucleic acid sequence to get the
posterior probability matrix P. Then use the Sunflower-Mutate
algorithm to calculate the relative entropy H(PkP9)=tfor each
position iand each nucleotide a2A={A,C,G,T}:
Sunflower-Mutate(A,E,X=(x
1
...x
n
), F=(f
k,x
)
m3n
,B,P)
X0=ðx0
1...x0
nÞ)X1
F0=ðf0
k;xÞm3n)F2
for i)1to n3
do for each ain A4
do if a=xi
5
then ti1;a)0:06
else x0
i)a7
P0)Sunflower-ReferenceðA;E;X0;F0;B;iÞ8
ti1;a)HðPkP0Þ9
x0
i)xi
10
f0
i1)fi1
11
return T =ðti;xÞn3jAj
12
This algorithm includes a significant optimization over the na-
ive implementation, because it uses the three extra arguments
in Sunflower-Reference to avoid rerunning the whole Forward-
Backward algorithm each time. Only those columns jof the for-
ward matrix where j$iand the backward matrix where j#iare
recalculated, as the left and right partitions of these two matrices,
respectively, would have the same value as when calculated from
the reference sequence.
Sunflower avoids the binary classification of binding and
thresholds commonly used in TFBS finders, as they are not es-
sential to the biology of transcription finding (Roider et al. 2007),
and uses a probabilistic model instead. The result of Sunflower-
Reference is a two-dimensional matrix of the posterior probabili-
ties defined at each position for each PWM column of all TFs in the
input set. These values specify how likely it is that a particular TF
binds to a particular string of positions.
The promoter distance d
T
One can think of the binding shift measurement tintroduced
above as a measurement of the synonymity of a particular nucle-
otide. To get a measurement of the potential disruption in TF
binding for a gene, T, similar to the total number of nonsyn-
onymous nucleotides, N, one first must select a region of interest.
We limit our inspection to only those nucleotides we are most sure
have an effect on transcripts by selecting the region [100, +100)
relative to the TSS. These are the nucleotides where tis highest on
average. If the value of Tis used for further comparisons to an
aligned sequence, then we exclude positions that do not align.
We use Pto refer to the set of included positions in the region of
interest.
Inspired by the logic used by Nei and Gojobori (1986) to as-
sign a fractional synonymity to protein-coding nucleotides that
are only partially degenerate, we consider the average binding shift
from the reference nucleotide to all other possibilities as a mea-
surement of the potential disruption for that nucleotide. Summing
the values for all these nucleotides, and dividing by 3, the number
of different possible substitutions, we get
T=1
3+
i2P
+
a2A
ti;a
To compare a human promoter with sequence X=(x
0
...x
n
)
with the promoter in a related species, we limit to only those
alignments of upstream regions of Ensembl orthologs with fewer
than 25% gap columns. We call the sequence in the other species
Y=(y
0
...y
n
), and the two positions align at each position i. With Y,
we can define the amount of observed binding profile disruption
Td=+
i2P
ti;yi
Since ti;yi=0 whenever x
i
=y
i
,T
d
is nonzero only at positions where
the two sequences differ. While ti;yimay be larger than the average
tfor any given position, this is unlikely to be true across the whole
gene.
Using Tand T
d
, we can calculate a proportion of binding
profile disruption
pT=Td
T;
analogous to p
N
and p
S
(Nei and Gojobori 1986). We use the Jukes–
Cantor equation (Jukes and Cantor 1964), which performs a pro-
portion such as this into a distance measurement
dT=3
4lnð14
3pTÞ:
Hoffman and Birney
690 Genome Research
www.genome.org
Gene Ontology enrichment analysis
We use FUNC (Pru
¨fer et al. 2007) to determine GO terms enriched
for a particular gene set (hypergeometric test) or for low or high
values of various measurements associated with genes (Wilcoxon
rank sum test). Considering the genes in a specified set as marked,
the hypergeometric test compares the number of marked genes
associated with a GO term with the number of marked genes as-
sociated with any term in a specific ontology. The Wilcoxon rank
sum test involves rank-ordering genes by a measurement, and then
comparing the ranks of the genes associated with one GO term
with the ranks of the other genes associated with any other term in
a specific ontology. We use the false discovery rate (FDR) reported
by FUNC as an FDR threshold q(Storey and Tibshirani 2003).
To alleviate the multiple testing problem, we consider a term to
be enriched only when q<0.05. For measurements that involve
alignments of potential transcriptional regulatory regions, we in-
clude in the analysis only those sequences where fewer than 50 of
the pairwise alignment columns include gaps.
Acknowledgments
This material is based upon work supported under a National
Science Foundation Graduate Research Fellowship. We thank
Kathryn Beal and Michael Schuster for providing data used in the
analysis. We thank Benedict Paten and three anonymous reviewers
for helpful comments on the manuscript.
References
Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG,
Scherer SE, Li PW, Hoskins RA, Galle RF, et al. 2000. The genome
sequence of Drosophila melanogaster.Science 287: 2185–2195.
Berg OG, von Hippel PH. 1987. Selection of DNA bindi ng sites by regulatory
proteins. Statistical-mechanical theory and application to operators and
promoters. J Mol Biol 193: 723–750.
Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD,
Lohmueller KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR, et al.
2008. Assessing the evolutionary impact of amino acid mutations in the
human genome. PLoS Genet 4: e1000083. doi: 10.1371/journal.
pen.1000083.
Bulyk ML. 2006. DNA microarray technologies for measuring protein–DNA
interactions. Curr Opin Biotechnol 17: 422–430.
C. elegans SequencingConsortium. 1998. Genome sequence of the nematode
C. elegans: A platform for investigating biology. Science 282: 2012–2018.
Cleveland WS, Devlin SJ. 1988. Locally-weighted fitting: An approach to
fitting analysis by local fitting. J Am Stat Assoc 83: 596–610.
Dermitzakis ET, Clark AG. 2002. Evolution of transcription factor binding
sites in mammalian gene regulatory regions: Conservation and
turnover. Mol Biol Evol 19: 1114–1121.
Dermitzakis ET, Bergman CM, Clark AG. 2003. Tracing the evolutionary
history of Drosophila regulatory regions with models that identify
transcription factor binding sites. Mol Biol Evol 20: 703–714.
Durbin R, Eddy SR, Krogh A, Mitchison G. 1998. Biological sequence analysis,
1st ed. Cambridge University Press, Cambridge.
Eyre-Walker A. 2006. The genomic rate of adaptive evolution. Trends Ecol
Evol 21: 569–575.
Foat BC, Morozov AV, Bussemaker HJ. 2006. Statistical mechanical modeling
of genome-wide transcription factor occupancy data by MatrixREDUCE.
Bioinformatics 22: e141–e149.
Gene Ontology Consortium. 2006. The Gene Ontology (GO) project in
2006. Nucleic Acids Res 34: D322–D326.
Goldman N, Yang Z. 1994. A codon-based model of nucleotide substitution
for protein-coding DNA sequences. Mol Biol Evol 11: 725–736.
Granek JA, Clarke ND. 2005. Explicit equilibrium modeling of transcription-
factor binding and gene regulation. Genome Biol 6: R87. doi: 10.1186/
gb-2005-6-10-r87.
Haygood R, Fedrigo O, Hanson B, Yokoyama KD, Wray GA. 2007. Promoter
regions of many neural- and nutrition-related genes have experienced
positive selection during human evolution. Nat Genet 39: 1140–1144.
Hoffman MM, Birney E. 2007. Estimating the neutral rate of nucleotide
substitution using introns. Mol Biol Evol 24: 522–531.
Hudson ME, Snyder M. 2006. High-throughput methods of regulatory
element discovery. Biotechniques 41: 673–681.
Hurst LD. 2002. The Ka/Ks ratio: Diagnosing the form of sequence
evolution. Trends Genet 18: 486–487.
Jothi R, Cuddapah S, Barski A, Cui K, Zhao K. 2008. Genome-wide
identification of in vivo protein-DNA binding sites from ChIP-seq data.
Nucleic Acids Res 36: 5221–5231.
Jukes TH, Cantor CR. 1964. Evolution of protein molecules. In Mammalian
protein metabolism (ed. HN Munro, JB Allison), pp. 21–132. Academic
Press, New York.
Kaiser J. 2008. A plan to capture human diversity in 1000 genomes. Science
319: 395.
Kernighan BW, Ritchie DM. 1988. The C programming language, 2nd ed.
Prentice Hall, Englewood Cliffs, NJ.
Kim PM, Korbel JO, Gerstein MB. 2007. Positive selection at the protein
network periphery: Evaluation in terms of structural constraints and
cellular context. Proc Natl Acad Sci 104: 20274–20279.
King MC, Wilson AC. 1975. Evolution at two levels in humans and
chimpanzees. Science 188: 107–116.
Kosiol C, Holmes I, Goldman N. 2007. An empirical codon model for
protein sequence evolution. Mol Biol Evol 24: 1464–1479.
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K,
Dewar K, Doyle M, FitzHugh W, et al. 2001. Initial sequencing and
analysis of the human genome. Nature 409: 860–921.
Li WH, Wu CI, Luo CC. 1985. A new method for estimating synonymous
and non-synonymous rates of nucleotide substitution considering
the relative likelihood of nucleotide and codon changes. Mol Biol Evol
2: 150–174.
Manke T, Roider HG, Vingron M. 2008. Statistical modeling of transcription
factor binding affinities predicts regulatory interactions. PLoS Comput
Biol 4: e1000039. doi: 10.1371/journal.pcbi.1000039.
Massingham T, Goldman N. 2005. Detecting amino acid sites under positive
selection and purifying selection. Genetics 169: 1753–1762.
Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter
I, Chekmenev D, Krull M, Hornischer K, et al. 2006. TRANSFAC and
its module TRANSCompel: Transcriptional gene regulation in
eukaryotes. Nucleic Acids Res 34: D108–D110.
McDonald JH, Kreitman M. 1991. Adaptive protein evolution at the Adh
locus in Drosophila.Nature 351: 652–654.
Miyata T, Yasunaga T. 1980. Molecular evolution of mRNA: A method for
estimating evolutionary rates of synonymous and amino acid
substitutions from homologous nucleotide sequences and its
application. J Mol Evol 16: 23–36.
Moses AM, Chiang DY, Kellis M, Lander ES, Eisen MB. 2003. Position specific
variation in the rate of evolution in transcription factor binding sites.
BMC Evol Biol 3: 19. doi: 10.1186/1471-2148-3-19.
Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB. 2004. MONKEY:
Identifying conserved transcription-factor binding sites in multiple
alignments using a binding site-specific evolutionary model. Genome
Biol 5: R98. doi: 10.1186/gb-2004-5-12-r98.
Mouse Genome Sequncing Consortium. 2002. Initial sequencing and
comparative analysis of the mouse genome. Nature 420: 520–562.
Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, Snyder M, Young RA,
Bulyk ML. 2004. Rapid analysis of the DNA-binding specificities
of transcription factors with DNA microarrays. Nat Genet 36:
1331–1339.
Mustonen V, Kinney J, Callan CG, La
¨ssig M. 2008. Energy-dependent
fitness: A quantitative model for the evolution of yeast transcription
factor binding sites. Proc Natl Acad Sci 105: 12376–12381.
Nei M, Gojobori T. 1986. Simple methods for estimating the numbers of
synonymous and nonsynonymous nucleotide substitutions. Mol Biol
Evol 3: 418–426.
Nei M, Kumar S. 2000. Molecular evolution and phylogenetics. Oxford
University Press, Oxford, UK.
Nielsen R. 2001. Statistical tests of selective neutrality in the age of
genomics. Heredity 86: 641–647.
Paten B, Herrero J, Beal K, Fitzgerald S, Birney E. 2008. Enredo and Pecan:
Genome-wide mammalian consistency-based multiple alignment with
paralogs. Genome Res 18: 1814–1828.
Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M,
Minovitsky S, Dubchak I, Holt A, Lewis KD, et al. 2006. In vivo
enhancer analysis of human conserved non-coding sequences. Nature
444: 499–502.
Perler F, Efstratiadis A, Lomedico P, Gilbert W, Kolodner R, Dodgson J. 1980.
The evolution of genes: The chicken preproinsulin gene. Cell 20:
555–566.
Pru
¨fer K, Muetzel B, Do HH, Weiss G, Khaitovich P, Rahm E, Pa
¨a
¨bo S,
Lachmann M, Enard W. 2007. FUNC: A package for detecting significant
associations between gene sets and ontological annotations. BMC
Bioinformatics 8: 41. doi: 10.1186/1471-2105-8-41.
R Development Core Team. 2007. R: A language and environment for
statistical computing. R Foundation for Statistical Computing, Vienna,
Austria.
An effective model for natural selection in promoters
Genome Research 691
www.genome.org
Rajewsky N, Vergassola M, Gaul U, Siggia ED. 2002. Computational
detection of genomic cis-regulatory modules applied to body patterning
in the early Drosophila embryo. BMC Bioinformatics 3: 30. doi: 10.1186/
1471-2105-3-30.
Ray P, Shringarpure S, Kolar M, Xing EP. 2008. CSMET: Comparative
genomic motif detection via multi-resolution phylogenetic shadowing.
PLoS Comput Biol 4: e1000090. doi: 10.1371/journal.pcbi.1000090.
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H,
Shapero MH, Carson AR, Chen W, et al. 2006. Global variation in copy
number in the human genome. Nature 444: 444–454.
Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J,
Schreiber J, Hannett N, Kanin E, et al. 2000. Genome-wide location and
function of DNA binding proteins. Science 290: 2306–2309.
Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen
G, Bernier B, Varhol R, Delaney A, et al. 2007. Genome-wide profiles
of STAT1 DNA association using chromatin immunoprecipitation
and massively parallel sequencing. Nat Methods 4: 651–657.
Roider HG, Kanhere A, Manke T, Vingron M. 2007. Predicting transcription
factor affinities to DNA from a biophysical model. Bioinformatics 23:
134–141.
Salser W, Bowen S, Browne D, el Adli F, Fedoroff N, Fry K, Heindell H,
Paddock G, Poon R, Wallace B, et al. 1976. Investigation of the
organization of mammalian chromosomes at the DNA sequence level.
Fed Proc 35: 23–35.
Segal E, Raveh-Sadka T, Schroeder M, Unnerstall U, Gaul U. 2008. Predicting
expression patterns from regulatory sequence in Drosophila
segmentation. Nature 451: 535–540.
Sinha S. 2006. On counting position weight matrix matches in a sequence,
with application to discriminative motif finding. Bioinformatics 22:
e454–e463.
Storey JD, Tibshirani R. 2003. Statistical significance for genomewide
studies. Proc Natl Acad Sci 100: 9440–9445.
Stratton MR, Campbell PJ, Futreal PA. 2009. The cancer genome. Nature 458:
719–724.
Sunyaev S, Ramensky V, Koch I, Lathe W, Kondrashov AS, Bork P. 2001.
Prediction of deleterious human alleles. Hum Mol Genet 10: 591–
597.
Taylor MS, Kai C, Kawai J, Carninci P, Hayashizaki Y, Semple CAM. 2006.
Heterotachy in mammalian promoter evolution. PLoS Genet 2: e30. doi:
10.1371/journal.pgen.0020030.
van Rossum G. 2006. Python reference manual. Python Software Foundation.
Hampton, NH. http://docs.python.org/release/2.5/ref/.
Veyrieras JB, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, Stephens M,
Pritchard JK. 2008. High-resolution mapping of expression-QTLs yields
insight into human gene regulation. PLoS Genet 4: e1000214. doi:
10.1371/journal.pgen.1000214.
Vlieghe D, Sandelin A, De Bleser PJ, Vleminckx K, Wasserman WW, van Roy
F, Lenhard B. 2006. A new generation of JASPAR, the open-access
repository for transcription factor binding site profiles. Nucleic Acids Res
34: D95–D97.
Wang H, Johnston M, Mitra RD. 2007. Calling cards for DNA-binding
proteins. Genome Res 17: 1202–1209.
Wasson T, Hartemink AJ. 2009. An ensemble model of competitive multi-
factor binding of the genome. Genome Res 19: 2101–2112.
Wong WSW, Nielsen R. 2004. Detecting selection in noncoding regions of
nucleotide sequences. Genetics 167: 949–958.
Received June 1, 2009; accepted in revised form February 9, 2010.
Hoffman and Birney
692 Genome Research
www.genome.org
... In addition to DNase digestion patterns, more detailed modeling of sequence preference information has been used in TFBS identification. Hoffman and Birney (2010) have previously proposed a hidden Markov model (HMM)-based method, termed Sunflower, to predict TFBSs based solely on sequence data. Instead of scanning for motif sequences directly, this model takes into consideration the competition between multiple TFs to provide a binding profile for all factors included in the model. ...
... In addition, adding extra motifs to the model for a specific TF can potentially increase the accuracy of identifying TF-specific binding sites. These additional motifs serve as baits, discouraging the prediction of weakly matching sites and introducing competition, thus decreasing FPRs (Hoffman and Birney 2010). However, including PWMs with similar sequence preference does not provide useful information and could decrease our model's ability to distinguish between binding sites of different motifs. ...
Article
Full-text available
Transcription is tightly regulated by cis-regulatory DNA elements where transcription factors can bind. Thus, identification of transcription factor binding sites (TFBSs) is key to understanding gene expression and whole regulatory networks within a cell. The standard approaches used for TFBS prediction, such as position weight matrices (PWMs) and chromatin immunoprecipitation followed by sequencing (ChIP-seq), are widely used, but have their drawbacks including high false positive rates and limited antibody availability, respectively. Several computational footprinting algorithms have been developed to detect TFBSs by investigating chromatin accessibility patterns, however these also have limitations. We have developed a footprinting method to predict Transcription factor footpRints in Active Chromatin Elements (TRACE) to improve the prediction of TFBS footprints. TRACE incorporates DNase-seq data and PWMs within a multivariate Hidden Markov Model (HMM) to detect footprint-like regions with matching motifs. TRACE is an unsupervised method that accurately annotates binding sites for specific TFs automatically with no requirement for pre-generated candidate binding sites or ChIP-seq training data. Compared to published footprinting algorithms, TRACE has the best overall performance with the distinct advantage of targeting multiple motifs in a single model.
... In choosing to quantify variant effects on TF binding in terms of affinity changes, we were attracted by the direct biological interpretability of this metric. A complementary strategy to score TF affinity at CRM level is provided by hidden Markov models (HMMs) (90)(91)(92). HMM-based frameworks can be useful, for example, for modelling effects of TF cooperativity (90,91), which could be incorporated into future variant prioritization frameworks. Machine learning algorithms, and particularly deep neural networks, may potentially model even more complex relationships between DNA sequence and TF binding (68,(93)(94)(95), although typically at the expense of direct biological interpretability. ...
... A complementary strategy to score TF affinity at CRM level is provided by hidden Markov models (HMMs) (90)(91)(92). HMM-based frameworks can be useful, for example, for modelling effects of TF cooperativity (90,91), which could be incorporated into future variant prioritization frameworks. Machine learning algorithms, and particularly deep neural networks, may potentially model even more complex relationships between DNA sequence and TF binding (68,(93)(94)(95), although typically at the expense of direct biological interpretability. ...
Article
Full-text available
Identifying DNA cis-regulatory modules (CRMs) that control the expression of specific genes is crucial for deciphering the logic of transcriptional control. Natural genetic variation can point to the possible gene regulatory function of specific sequences through their allelic associations with gene expression. However, comprehensive identification of causal regulatory sequences in brute-force association testing without incorporating prior knowledge is challenging due to limited statistical power and effects of linkage disequilibrium. Sequence variants affecting transcription factor (TF) binding at CRMs have a strong potential to influence gene regulatory function, which provides a motivation for prioritizing such variants in association testing. Here, we generate an atlas of CRMs showing predicted allelic variation in TF binding affinity in human lymphoblastoid cell lines and test their association with the expression of their putative target genes inferred from Promoter Capture Hi-C and immediate linear proximity. We reveal >1300 CRM TF-binding variants associated with target gene expression, the majority of them undetected with standard association testing. A large proportion of CRMs showing associations with the expression of genes they contact in 3D localize to the promoter regions of other genes, supporting the notion of 'epromoters': dual-action CRMs with promoter and distal enhancer activity.
... In addition to DNase digestion patterns, more detailed modeling of sequence preference information has been used in TFBSs identification. Hoffman and Birney (2010) have previously proposed an Hidden Markov Model (HMM)-based method, Sunflower, to predict TFBSs based on sequence data alone. Instead of scanning for motif sequences directly, this model takes into consideration the competition between multiple TFs to provide a binding profile for all factors included in the model. ...
... In addition, adding extra motifs to the model for a specific TF can potentially increase the accuracy of identifying TF-specific binding sites. The additional motifs introduced in the model work as baits, discouraging prediction of weakly matching sites and introducing competition into the model, thus decreasing the false positive rates (Hoffman and Birney 2010). ...
Preprint
Full-text available
Transcription is tightly regulated by cis-regulatory DNA elements where transcription factors can bind. Thus, identification of transcription factor binding sites is key to understanding gene expression and whole regulatory networks within a cell. The standard approaches for TFBSs prediction such as position weight matrices (PWMs) and chromatin immunoprecipitation followed by sequencing (ChIP-seq) are widely used but have their drawbacks such as high false positive rates and limited antibody availability, respectively. Several computational footprinting algorithms have been developed to detect TFBSs by investigating chromatin accessibility patterns, but also have their limitations. To improve on these methods, we have developed a footprinting method to predict Transcription factor footpRints in Active Chromatin Elements (TRACE). Trace incorporates DNase-seq data and PWMs within a multivariate Hidden Markov Model (HMM) to detect footprint-like regions with matching motifs. Trace is an unsupervised method that accurately annotates binding sites for specific TFs automatically with no requirement on pre-generated candidate binding sites or ChIP-seq training data. Compared to published footprinting algorithms, TRACE has the best overall performance with the distinct advantage of targeting multiple motifs in a single model.
... One common method of identifying where TFs bind is to search a DNA sequence for TF binding site motifs, as specified by position weight matrices (PWMs) (2). Frequently, PWMs are used alongside statistical thermodynamic-based methods to incorporate additional properties influencing TF binding, such as TF concentration and spatial hindrance between TFs (3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14). ...
... The underlying principle behind this assumption is that the cell operates as a well-stirred reactor, which can lead to misleading results because of rapid TF-DNA rebindings (40). Alternatively, the binding of TFs has been predicted by scanning the DNA for a PWM and then calculating the probability of binding using a statistical thermodynamic framework to take into account TF concentration (3)(4)(5)(7)(8)(9) and steric hindrance on the DNA (6,(10)(11)(12)14). However, these models assume that TFs are bound at thermodynamic equilibrium, even though thermodynamic equilibrium might not be reached in the time frame of a cell cycle. ...
Article
Full-text available
Site-specific transcription factors (TFs) bind to their target sites on the DNA, where they regulate the rate at which genes are transcribed. Bacterial TFs undergo facilitated diffusion (a combination of 3D diffusion around and 1D random walk on the DNA) when searching for their target sites. Using computer simulations of this search process, we show that the organisation of the binding sites, in conjunction with TF copy number and binding site affinity, plays an important role in determining not only the steady state of promoter occupancy, but also the order at which TFs bind. These effects can be captured by facilitated diffusion-based models, but not by standard thermodynamics. We show that the spacing of binding sites encodes complex logic, which can be derived from combinations of three basic building blocks: switches, barriers and clusters, whose response alone and in higher orders of organisation we characterise in detail. Effective promoter organizations are commonly found in the E. coli genome and are highly conserved between strains. This will allow studies of gene regulation at a previously unprecedented level of detail, where our framework can create testable hypothesis of promoter logic.
... Promoters have been shown to exhibit higher levels of sequence divergence than surrounding regions in insects, possibly associated with an increased mutation rate [40]. Changes in promoters tend to be neutral [41], consistent with our findings. They have also been shown to evolve more slowly than enhancers in mammals [22]. ...
Article
Full-text available
Rapid enhancer and slow promoter evolution have been demonstrated through comparative genomics. However, it is not clear how this information is encoded genetically and if this can be used to place evolution in a predictive context. Part of the challenge is that our understanding of the potential for regulatory evolution is biased primarily toward natural variation or limited experimental perturbations. Here, to explore the evolutionary capacity of promoter variation, we surveyed an unbiased mutation library for three promoters in Drosophila melanogaster . We found that mutations in promoters had limited to no effect on spatial patterns of gene expression. Compared to developmental enhancers, promoters are more robust to mutations and have more access to mutations that can increase gene expression, suggesting that their low activity might be a result of selection. Consistent with these observations, increasing the promoter activity at the endogenous locus of shavenbaby led to increased transcription yet limited phenotypic changes. Taken together, developmental promoters may encode robust transcriptional outputs allowing evolvability through the integration of diverse developmental enhancers. This article is part of the theme issue ‘Interdisciplinary approaches to predicting evolutionary biology’.
... The continued genome and transcriptome sequencing of a wider range of species and the phylogenetic analysis of promoter, or conserved noncoding regions may offer a feasible alternative. Recent advances in models analyzing selection on promoter regions (Hoffman and Birney 2010) and tests of gene-phenotype associations in noncoding sequences (O'Connor and Mundy 2013) provide useful tools to pursue these aims. ...
Article
Full-text available
The adaptive significance of human brain evolution has been frequently studied through comparisons with other primates. However, the evolution of increased brain size is not restricted to the human lineage but is a general characteristic of primate evolution. Whether or not these independent episodes of increased brain size share a common genetic basis is unclear. We sequenced and de novo assembled the transcriptome from the neocortical tissue of the most highly encephalized nonhuman primate, the tufted capuchin monkey (Cebus apella). Using this novel data set, we conducted a genome-wide analysis of orthologous brain-expressed protein coding genes to identify evidence of conserved gene-phenotype associations and species-specific adaptations during three independent episodes of brain size increase. We identify a greater number of genes associated with either total brain mass or relative brain size across these six species than show species-specific accelerated rates of evolution in individual large-brained lineages. We test the robustness of these associations in an expanded data set of 13 species, through permutation tests and by analyzing how genome-wide patterns of substitution co-vary with brain size. Many of the genes targeted by selection during brain expansion have glutamatergic functions or roles in cell cycle dynamics. We also identify accelerated evolution in a number of individual capuchin genes whose human orthologs are associated with human neuropsychiatric disorders. These findings demonstrate the value of phenotypically informed genome analyses, and suggest at least some aspects of human brain evolution have occurred through conserved gene-phenotype associations. Understanding these commonalities is essential for distinguishing human-specific selection events from general trends in brain evolution. © The Author(s) 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
... The continued genome and transcriptome sequencing of a wider range of species and the phylogenetic analysis of promoter, or conserved noncoding regions may offer a feasible alternative. Recent advances in models analyzing selection on promoter regions (Hoffman and Birney 2010) and tests of gene-phenotype associations in noncoding sequences (O'Connor and Mundy 2013) provide useful tools to pursue these aims. ...
Article
Full-text available
The adaptive significance of human brain evolution has been frequently studied through comparisons with other primates. However, the evolution of increased brain size is not restricted to the human lineage but is a general characteristic of primate evolution. Whether or not these independent episodes of increased brain size share a common genetic basis is unclear. We sequenced and de novo assembled the transcriptome from the neocortical tissue of the most highly encephalized nonhuman primate, the tufted capuchin monkey ($\textit{Cebus apella}$). Using this novel data set, we conducted a genome-wide analysis of orthologous brain-expressed protein coding genes to identify evidence of conserved gene-phenotype associations and species-specific adaptations during three independent episodes of brain size increase. We identify a greater number of genes associated with either total brain mass or relative brain size across these six species than show species-specific accelerated rates of evolution in individual large-brained lineages. We test the robustness of these associations in an expanded data set of 13 species, through permutation tests and by analyzing how genome-wide patterns of substitution co-vary with brain size. Many of the genes targeted by selection during brain expansion have glutamatergic functions or roles in cell cycle dynamics. We also identify accelerated evolution in a number of individual capuchin genes whose human orthologs are associated with human neuropsychiatric disorders. These findings demonstrate the value of phenotypically informed genome analyses, and suggest at least some aspects of human brain evolution have occurred through conserved gene-phenotype associations. Understanding these commonalities is essential for distinguishing human-specific selection events from general trends in brain evolution.
... We hypothesized that traces of evolution in favor of widening of the norm of reaction in the hominoid lineage compared to anthropoid lineages can be associated with the evolution of regulatory sequences, particularly epigenetic signals [1-4, 6, 9, 14]. Previous evolutionary studies of regulatory regions included an extensive searches for positive selection in evolution of human promoters [175,176] and cis-regulatory elements [177,178], genomic regions with signs of accelerated evolution [179,180], a comparison of splicing in humans and chimpanzee [181], a comparative study of neuronal open chromatin landscape (marked by H3K4me3) in human and other primate brains [55], a comparison of insulator landscapes (CTCF-binding loci) in human and gorilla genomes [3], a comparison of human and gorilla DNA methylome [182], and cytogenetic studies on heterochromatin evolution in hominids [183][184][185]. ...
Article
Full-text available
Adaptability to a variety of environmental conditions is a prominent feature of Homo sapiens. We hypothesize that this feature can be explained by evolutionary changes in gene promoters active in the brain prefrontal cortex leading to a more flexible gene regulation network. The genotype-dependent range of gene expression can be broader in humans than in other higher primates. Thus, we searched for specific signatures of evolutionary changes in promoter architectures of multiple hominid genes, including the genes active in human cortical neurons that may indicate an increase of variability of gene expression rather than just changes in the level of expression, such as downregulation or upregulation of the genes. We performed a whole-genome search for genetic-based alterations that may impact gene regulation “flexibility” in a process of hominids evolution, such as (i) CpG dinucleotide content, (ii) predicted nucleosome-DNA dissociation constant, and (iii) predicted affinities for TATA-binding protein (TBP) in gene promoters. We tested all putative promoter regions across the human genome and especially gene promoters in active chromatin state in neurons of prefrontal cortex, the brain region critical for abstract thinking and social and behavioral adaptation. Our data imply that the origin of modern man has been associated with an increase of flexibility of promoter-driven gene regulation in brain. In contrast, after splitting from the ancestral lineages of H. sapiens, the evolution of ape species is characterized by reduced flexibility of gene promoter functioning, underlying reduced variability of the gene expression.
... Recent or current natural selection and adaptive evolution can be detected by various methods, such as nucleotide diversity (π) and Tajima's D test for whole-genome DNA sequences and the McDonald-Kreitman test [4] for coding regions. Even for sequences of gene regulatory regions, several methods have been proposed and genome-wide analyses have also shown statistical evidence of natural selection in non-coding and cis-regulatory regions [5][6][7]. These studies focused on nucleotide substitutions in regulatory sequences or transcription factor binding sites (TFBSs) between species and did not evaluate recent or ongoing selection for standing genetic variation in natural populations. ...
Article
Full-text available
Understanding the evolutionary forces that influence variation in gene regulatory regions in natural populations is an important challenge for evolutionary biology because natural selection for such variations could promote adaptive phenotypic evolution. Recently, whole-genome sequence analyses have identified regulatory regions subject to natural selection. However, these studies could not identify the relationship between sequence variation in the detected regions and change in gene expression levels. We analyzed sequence variations in core promoter regions, which are critical regions for gene regulation in higher eukaryotes, in a natural population of Drosophila melanogaster, and identified core promoter sequence variations associated with differences in gene expression levels subjected to natural selection. Among the core promoter regions whose sequence variation could change transcription factor binding sites and explain differences in expression levels, three core promoter regions were detected as candidates associated with purifying selection or selective sweep and seven as candidates associated with balancing selection, excluding the possibility of linkage between these regions and core promoter regions. CHKov1, which confers resistance to the sigma virus and related insecticides, was identified as core promoter regions that has been subject to selective sweep, although it could not be denied that selection for variation in core promoter regions was due to linked single nucleotide polymorphisms in the regulatory region outside core promoter regions. Nucleotide changes in core promoter regions of CHKov1 caused the loss of two basal transcription factor binding sites and acquisition of one transcription factor binding site, resulting in decreased gene expression levels. Of nine core promoter regions regions associated with balancing selection, brat, and CG9044 are associated with neuromuscular junction development, and Nmda1 are associated with learning, behavioral plasticity, and memory. Diversity of neural and behavioral traits may have been maintained by balancing selection. Our results revealed the evolutionary process occurring by natural selection for differences in gene expression levels caused by sequence variation in core promoter regions in a natural population. The sequences of core promoter regions were diverse even within the population, possibly providing a source for natural selection.
Article
Polyimide (PI)@copper (Cu) composite nano particles have been successfully synthesized from poly(amic acid) triethylamine salts (PAAS) and Cu(II) ions via a one-step high-temperature induction/imidization route. The formation of PI@Cu nano particles has been investigated by the stoichiometric ratio of PAAS and Cu ion. The resulting products, formed stable shell-core structures, exhibited the uniform core-size and thick shell layer. Additionally, the multi-layer structure, Ag@PI@Cu, was successfully prepared via a post process of PI@Cu nanoparticles. The morphology of the formed “Sunflower-mode” structure, with the pistil of Cu, the sunflower seed of PI, and the petal of Ag, was also characterized by SEM and TEM. Both electrical resistivity and thermal conductivity of nano particles were measured. The coefficient of heat conduction of Ag@PI@Cu is even 255 times, 754 times, 3081 times, and 1310 times as large as PI@Cu in 50 °C, 100 °C, 150 °C, and 200 °C, respectively. The resistance of both nano particles is that the result of RsPI@Cu and RsAg@PI@Cu is 11.0*10⁹ Ω and 0.11 Ω, respectively, and also the difference between them is more than 10¹².
Article
Full-text available
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
Article
C is a general-purpose programming language that was originally designed for ″system programming″ , that is, for writing programs such as compilers, operating systems, text editors, etc. Its other applications include data base systems, numerical analysis and engineering programs, and a great deal of text-processing software. It is the primary language of the UNIX system, and is also available in several other environments.
Article
A new method is proposed for estimating the number of synonymous and nonsynonymous nucleotide substitutions between homologous genes. In this method, a nucleotide site is classified as nondegenerate, twofold degenerate, or fourfold degenerate, depending on how often nucleotide substitutions will result in amino acid replacement; nucleotide changes are classified as either transitional or transversional, and changes between codons are assumed to occur with different probabilities, which are determined by their relative frequencies among more than 3,000 changes in mammalian genes. The method is applied to a large number of mammalian genes. The rate of nonsynonymous substitution is extremely variable among genes; it ranges from 0.004 X 10(-9) (histone H4) to 2.80 X 10(-9) (interferon gamma), with a mean of 0.88 X 10(-9) substitutions per nonsynonymous site per year. The rate of synonymous substitution is also variable among genes; the highest rate is three to four times higher than the lowest one, with a mean of 4.7 X 10(-9) substitutions per synonymous site per year. The rate of nucleotide substitution is lowest at nondegenerate sites (the average being 0.94 X 10(-9), intermediate at twofold degenerate sites (2.26 X 10(-9)). and highest at fourfold degenerate sites (4.2 X 10(-9)). The implication of our results for the mechanisms of DNA evolution and that of the relative likelihood of codon interchanges in parsimonious phylogenetic reconstruction are discussed.
Article
We have characterized a clone carrying a chicken preproinsulin gene, which is present in only one copy in the chicken genome. The gene contains two introns: a 3.5 kb intron interrupting the region encoding the connecting peptide and a 119 bp intron interrupting the DNA corresponding to the 5′ noncoding region of the mRNA. This is similar to the structure of rat insulin gene II; therefore it represents the common ancestor. Since the rat insulin gene I lacks a 499 bp intron in the coding region, the rat genes have evolved by a recent gene duplication followed by loss of this intron in one copy. The divergences between insulin gene sequences, and also between globin genes, show that changes at introns and silent positions in coding regions appear very rapidly (7 × 10 −9 substitutions per nucleotide site per year), but that the accumulation of changes in these sites saturates, although not completely, after about 100 million years. From this we conclude that not all of these sites are neutral and that they do not behave as accurate evolutionary clocks over long periods of time. However, nucleotide substitutions leading to amino acid replacements are an excellent clock. Our analysis indicates that this clock is driven by selection.
Book
This book presents the statistical methods that are useful in the study of molecular evolution and illustrates how to use them in actual data analysis. Molecular evolution has been developing at a great pace over the past decade or so, driven by the huge increase in genetic sequence data from many organisms, the improvement of high-speed microcomputers, and the development of several new methods for phylogenetic analysis. This book for graduate students and researchers, assuming a basic knowledge of evolution, molecular biology, and elementary statistics, should make it possible for many investigators to incorporate refined statistical analysis of large-scale data in their own work. Nei is one of the leading workers in this area. He and Kumar have developed a computer program called MEGA, which has been sold for about $20 to over 1900 users. For the book, the authors are thoroughly revising MEGA and will make it available via FTP. The book also included analysis using the other most popular programs for phylogenetic studies, including PAUP, PHYLIP, MOLPHY, and PAML.