Figure - uploaded by JM Mercader
Content may be subject to copyright.
Orthogonal support of predicted TSSs: CAGE analysis. Distribution of distinct 5′-ends of CAGE tags from several representative CAGE experiments in H1-hESC, HepG2 and HeLa-S3 cell types based on cytosolic polyA+ transcripts. For every distinct most 5′-end of CAGE tag detected within and on the same strand as a particular promoter region, we increased the CAGE frequency of the percent distance bin corresponding to the distance between the CAGE tag 5′-end and the promoter region 5′-end. As the predicted promoter regions were 1200 bp long, each % distance bin includes 12 bp, and thereby the TSS is expected to be located on the 84th distance bin (i.e. at 1000 bp from the region 5′-end). (a) PS+L+ subset 1. For most of the cell types, the major peak appears around the 63th bin (i.e. 750 bp), closely matching with the prediction (b) PS+L− subset 2. We observe undefined peaks around the 30th–50th bins (350–600 bp). On the other hand, the number of CAGE tags is significantly higher than for subset 1 (c) PS−L+ for subset 3, (d) PS−L− for subset 4. ProStar negative PS− subsets clearly show an almost inexistent CAGE signal.

Orthogonal support of predicted TSSs: CAGE analysis. Distribution of distinct 5′-ends of CAGE tags from several representative CAGE experiments in H1-hESC, HepG2 and HeLa-S3 cell types based on cytosolic polyA+ transcripts. For every distinct most 5′-end of CAGE tag detected within and on the same strand as a particular promoter region, we increased the CAGE frequency of the percent distance bin corresponding to the distance between the CAGE tag 5′-end and the promoter region 5′-end. As the predicted promoter regions were 1200 bp long, each % distance bin includes 12 bp, and thereby the TSS is expected to be located on the 84th distance bin (i.e. at 1000 bp from the region 5′-end). (a) PS+L+ subset 1. For most of the cell types, the major peak appears around the 63th bin (i.e. 750 bp), closely matching with the prediction (b) PS+L− subset 2. We observe undefined peaks around the 30th–50th bins (350–600 bp). On the other hand, the number of CAGE tags is significantly higher than for subset 1 (c) PS−L+ for subset 3, (d) PS−L− for subset 4. ProStar negative PS− subsets clearly show an almost inexistent CAGE signal.

Source publication
Article
Full-text available
Although protein recognition of DNA motifs in promoter regions has been traditionally considered as a critical regulatory element in transcription, the location of promoters, and in particular transcription start sites (TSSs), still remains a challenge. Here we perform a comprehensive analysis of putative core promoter sequences relative to non-ann...

Contexts in source publication

Context 1
... results summarized in Figure 2 show that regions from subsets 1 (PS+L+) and 2 (PS+LÀ) were dramatically enriched for CAGE tags that could be confidently mapped to single positions (Figure 2a and b), as compared with the ProStar negative subsets 3 (PSÀL+) and 4 (PSÀLÀ) (Figure 2c and d). Subset 1 displayed the highest propor- tion of sequences with CAGE tagged 5 0 -ends around 750 bp, indicating that those regions contained reliable TSS marks (Figure 2a, around 60th distance bin). ...
Context 2
... results summarized in Figure 2 show that regions from subsets 1 (PS+L+) and 2 (PS+LÀ) were dramatically enriched for CAGE tags that could be confidently mapped to single positions (Figure 2a and b), as compared with the ProStar negative subsets 3 (PSÀL+) and 4 (PSÀLÀ) (Figure 2c and d). Subset 1 displayed the highest propor- tion of sequences with CAGE tagged 5 0 -ends around 750 bp, indicating that those regions contained reliable TSS marks (Figure 2a, around 60th distance bin). ...
Context 3
... results summarized in Figure 2 show that regions from subsets 1 (PS+L+) and 2 (PS+LÀ) were dramatically enriched for CAGE tags that could be confidently mapped to single positions (Figure 2a and b), as compared with the ProStar negative subsets 3 (PSÀL+) and 4 (PSÀLÀ) (Figure 2c and d). Subset 1 displayed the highest propor- tion of sequences with CAGE tagged 5 0 -ends around 750 bp, indicating that those regions contained reliable TSS marks (Figure 2a, around 60th distance bin). ...
Context 4
... results summarized in Figure 2 show that regions from subsets 1 (PS+L+) and 2 (PS+LÀ) were dramatically enriched for CAGE tags that could be confidently mapped to single positions (Figure 2a and b), as compared with the ProStar negative subsets 3 (PSÀL+) and 4 (PSÀLÀ) (Figure 2c and d). Subset 1 displayed the highest propor- tion of sequences with CAGE tagged 5 0 -ends around 750 bp, indicating that those regions contained reliable TSS marks (Figure 2a, around 60th distance bin). CAGE tags were detected in most of the human cell type experiments, but a particular enrichment was found for polyadenylated (polyA+) transcripts, suggesting that active regions might correspond to promoter elements regulating protein-coding genes. ...
Context 5
... subset 3 (PSÀL+) regions contain few cage tags, although they showed some activity in luciferase expression assays (Figure 2c). We could simply assume that this subset contains luciferase-false positives. ...
Context 6
... more intriguingly, subset 2 regions (PS+LÀ) did show clear CAGE enrichment although they did not provide a luciferase response (Figure 2b). These discre- pancies could simply result from luciferase-false negatives. ...
Context 7
... discre- pancies could simply result from luciferase-false negatives. However, the strength and the profile of CAGE signals (Figure 2b) indicated that other factors could also account for the low luciferase/high CAGE response. Comparison of the CAGE profiles indicated that subset 1 peaks are located at the expected TSSs (i.e. ...
Context 8
... of the CAGE profiles indicated that subset 1 peaks are located at the expected TSSs (i.e. around 60th bin; Figure 2a), while subset 2 peaks are upstreamly displaced from the original prediction (around 30th bin; Figure 2b). These findings suggest that, under certain con- ditions, physical properties are able to signal promoter regions although the prediction of the TSS location can be upstreamly displaced from the true site. ...
Context 9
... of the CAGE profiles indicated that subset 1 peaks are located at the expected TSSs (i.e. around 60th bin; Figure 2a), while subset 2 peaks are upstreamly displaced from the original prediction (around 30th bin; Figure 2b). These findings suggest that, under certain con- ditions, physical properties are able to signal promoter regions although the prediction of the TSS location can be upstreamly displaced from the true site. ...
Context 10
... validate this hypothesis, we carried out RNA- sequencing (RNA-seq) analysis to survey the transcription profiles of the selected regions and to identify putative exons near the suggested TSSs. We performed the analysis of 2000 bp regions centered on the predicted TSSs, using RNA-seq data of subcellular-fractionated RNAs from the ENCODE Consortium (Supplementary Figure S2, see 'Materials and Methods' section for details) (9,25). Interestingly, the profiles of subset 1 pre- sented a sharp RNA-seq peak at 800 bp, which coincided with the CAGE major peak around 750 bp (Figure 3a, around 40th bin, orange frames). ...
Context 11
... subset 2 profiles showed two sharp RNA-seq peaks at 200 and 400 bp, respectively, which matched CAGE major peaks around 360 bp (Figure 3b, 10-20th bins, highlighted with orange frames). Moreover, a downstream broad peak likely corresponding to an exon (Figure 3b, 20-40th bins, i.e. from 400 to 800 bp, highlighted with purple frames) could further confirm the TSS displaced positions at $700-500 bp upstream relative to predictions. ...

Citations

... It was used to study indirect readout of proteins 9,131 or small ligands, 132 to model DNA circular topologies, 40,133 or even to determine promoter locations. 16,134 It has been instrumental in understanding the mechanics of DNA 135,136 and protein-DNA interactions, 137 especially in the nucleosome 138,139 or in histone-like protein complexes. 140,141 Schiessel and his coworkers employed the local base-pair step model to study nucleosome mechanics. ...
Article
Full-text available
Structure and deformability of the DNA double helix play a key role in protein and small ligand binding, in genome regulation via looping, or in nanotechnology applications. Here we review some of the recent developments in modeling mechanical properties of DNA in its most common B‐form. We proceed from atomic‐resolution molecular dynamics (MD) simulations through rigid base and base‐pair models, both harmonic and multistate, to rod‐like descriptions in terms of persistence length and elastic constants. The reviewed models are illustrated and critically examined using MD data for which the two current Amber force fields, bsc1 and OL15, were employed. This article is categorized under: Structure and Mechanism > Molecular Structures Structure and Mechanism > Computational Biochemistry and Biophysics Molecular and Statistical Mechanics > Molecular Dynamics and Monte‐Carlo Methods
... The above observation strongly indicates that DNA speaks a universal language. The strength of the physical signals of DNA language at promoters has also been previously observed by combining experiments and simulations studies (32). ...
Article
With almost no consensus promoter sequence in prokaryotes, recruitment of RNA polymerase (RNAP) to precise transcriptional start sites (TSSs) has remained an unsolved puzzle. Uncovering the underlying mechanism is critical for understanding the principle of gene regulation. We attempted to search the hidden code in ∼16,500 promoters of 12 prokaryotes representing two kingdoms in their structure and energetics. Twenty-eight fundamental parameters of DNA structure including backbone angles, basepair axis, and interbasepair and intrabasepair parameters were used, and information was extracted from x-ray crystallography data. Three parameters (solvation energy, hydrogen-bond energy, and stacking energy) were selected for creating energetics profiles using in-house programs. DNA of promoter regions was found to be inherently designed to undergo a change in every parameter undertaken for the study, in all prokaryotes. The change starts from some distance upstream of TSSs and continues past some distance from TSS, hence giving a signature state to promoter regions. These signature states might be the universal hidden codes recognized by RNAP. This observation was reiterated when randomly selected promoter sequences (with little sequence conservation) were subjected to structure generation; all developed into very similar three-dimensional structures quite distinct from those of conventional B-DNA and coding sequences. Fine structural details at important motifs (viz. -11, -35, and -75 positions relative to TSS) of promoters reveal novel to our knowledge and pointed insights for RNAP interaction at these locations; it could be correlated with how some particular structural changes at the -11 region may allow insertion of RNAP amino acids in interbasepair space as well as facilitate the flipping out of bases from the DNA duplex.
... Accordingly, there is a growing interest in identifying other characteristics that are common to promoter regions and could be applied for promoter search. [1][2][3][4][5] The most frequently analyzed features include distribution of physical parameters around DNA molecule, [4][5][6][7] patterns in nucleotide composition as well as combinations of motives 8,9 and free energy values. 10 Although all of them are entirely encoded by DNA primary structure, their distributions contain valuable information that cannot be replaced with the analysis of nucleotide composition (text analysis). ...
... Accordingly, there is a growing interest in identifying other characteristics that are common to promoter regions and could be applied for promoter search. [1][2][3][4][5] The most frequently analyzed features include distribution of physical parameters around DNA molecule, [4][5][6][7] patterns in nucleotide composition as well as combinations of motives 8,9 and free energy values. 10 Although all of them are entirely encoded by DNA primary structure, their distributions contain valuable information that cannot be replaced with the analysis of nucleotide composition (text analysis). ...
... Accordingly, several DNA features (textual as well as physical) should be included in the whole-genome promoter search. Numerous studies of regulatory DNA regions in prokaryotic 1,5,6,13,14 and eukaryotic [15][16][17] genomes have proven the e±ciency of analysis employing several characteristics distributed throughout regulatory DNA regions. Among them, the most notable physical characteristics are: electrostatic potential, [18][19][20] stress-induced duplex destabilization, 12 bendability 21 ; and textual characteristics: z-curve, 22 CG-skew, 23 triplets, tetramers, pentamers and hexamers. ...
Article
Predicting promoter activity of DNA fragment is an important task for computational biology. Approaches using physical properties of DNA to predict bacterial promoters have recently gained a lot of attention. To select an adequate set of physical properties for training a classifier, various characteristics of DNA molecule should be taken into consideration. Here, we present a systematic approach that allows us to select less correlated properties for classification by means of both correlation and cophenetic coefficients as well as concordance matrices. To prove this concept, we have developed the first classifier that uses not only sequence and static physical properties of DNA fragment, but also dynamic properties of DNA open states. Therefore, the best performing models with accuracy values up to 90% for all types of sequences were obtained. Furthermore, we have demonstrated that the classifier can serve as a reliable tool enabling promoter DNA fragments to be distinguished from promoter islands despite the similarity of their nucleotide sequences.
... The information density in higher orders of the NP and BP models seems rather low, but it sums up to considerable sizes across the modelled region ( Supplementary Figures S14 and S15). We speculate that the nucleotide dependencies in our models reflect, at least in part, DNA structural properties of core promoter regions that contribute to TSS recognition (55). ...
Article
Full-text available
Position weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k − 1 act as priors for those of order k. This Bayesian Markov model (BaMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BaMMs achieve significantly (P = 1/16) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BaMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26–101%. BaMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BaMMs.
... The information density in higher orders of the NP and BP models seems rather low, but it sums up to considerable sizes across the modeled region ( Supplementary Figures S14 and S15). We speculate that the nucleotide dependencies in our models reflect, at least in part, DNA structural properties of core promoter regions that contribute to TSS recognition [55]. ...
Preprint
Full-text available
Position weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k -1 act as priors for those of order k . This Bayesian Markov model (BMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BMMs achieve significantly ( p <0.063) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26%-101%. BMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BMMs. The Ba yesian M arkov M odel motif discovery software BaMM!motif is available under GPL at http://github.com/soedinglab/BaMMmotif .
... Enfin, la même équipe a montré que dans de nombreuses familles de facteurs de transcription, l'environnement immédiat de la séquence reconnue, qui impacte la structure de l'hélice mais ne contient qu'une faible information de séquence, joue un rôle important dans la fixation du facteur.Dans l'ensemble, ces données suggèrent que la reconnaissance de leurs sites de fixation par des facteurs de transcription doit intégrer à la fois la séquence du site et la structure 3D locale de l'hélice d'ADN. Des modèles thermodynamiques basés sur des données de cristallographie de courtes séquences d'ADN ont été développés pour prédire la forme et les propriétés physiques d'une l'hélice d'ADN en fonction de sa séquence(Bishop et al., 2011;Broos et al., 2013;Durán et al., 2013).B. Les signatures des enhancers1. ...
Thesis
Full-text available
Les enhancers sont des régulateurs cruciaux de l’expression des gènes pendant le développement embryonnaire. L’ascidie Ciona intestinalis est un organisme-modèle qui se prête à l’étude de ces séquences cis-régulatrices car ses enhancers sont généralement petits et compacts, et le lignage invariant des cellules chez l’embryon permet de visualiser leur activité avec une résolution cellulaire. Deux signatures indépendantes associées à l’activité d’un enhancer avaient été identifiées : la présence de sites de fixation pour des facteurs de transcription spécifiques, et une signature dinucléotidique globale à l’échelle des enhancers. (Khoueiry 2010). Cependant, si ces signatures corrèlent avec l’activité des enhancers, elles ne permettent pas d’identifier de nouveaux enhancers grâce à leur séquence. Pendant ma thèse, j’ai utilisé un enhancer neural précoce de Ciona, le très bien caractérisé élément-a du gène Otx, comme enhancer-modèle. Ce petit enhancer (55pb), est lié par les facteurs de transcription GATA-a et ETS1/2 et activé par la voie de signalisation FGF. Afin de mieux comprendre les déterminants de l’activité neurale précoce d’un enhancer, j’ai testé l’impact de mutations ponctuelles affectant l’affinité de sites de fixation de l’élément-a pour les facteurs de transcription. J’ai également randomisé les séquences intercalantes, situées entre les sites de fixation pour ETS et GATA dans quatre clusters de ces sites.Nos résultats suggèrent au moins deux niveaux de contrôle de la régulation en cis : i) la spécificité spatiotemporelle de l’activité d’un enhancer est définie par l’identité des sites de fixation des facteurs de transcription, et ii) son niveau d’activité dépend à la fois de l’affinité des facteurs de transcription pour leurs sites de fixation et la composition des séquences intercalantes. La majorité des variants randomisés de l’élément-a sont actifs dans les mêmes lignées cellulaires que le sauvage et leurs niveaux d’activité sont très divers. Le même résultat est obtenu en randomisant les séquences intercalantes d’un autre cluster ETS/GATA actif. La randomisation de ces séquences a même conféré de l’activité enhancer à de nombreux variants de clusters inactifs. En accord avec leur activité neurale précoce et la présence de sites de fixations pour ETS et GATA, ces variants, comme l’élément-a, répondent à l’induction neurale de FGF. Nous n’avons pas réussi à expliquer l’action des séquences intercalantes sur l’activité des enhancers par des caractéristiques simples de leurs séquences (nucléotidique ou dinucléotidique), et l’on ne comprend pas pourquoi il est si simple de créer un enhancer synthétique quand la majorité des clusters génomiques de sites de fixations putatifs pour ETS et GATA sont inactifs. En utilisant une approche de fixation in vitro des facteurs de transcription, nous avons montré que la randomisation des séquences intercalantes peut affecter la fixation d’un facteur de transcription sur l’élément a, sans changer la séquence primaire du site de fixation, mais que la fixation sur l'élément entier ne peut pas toujours être expliquée par la fixation sur les sites isolées. Ces résultats suggèrent que la structure physique de l’hélice d’ADN autour des sites de fixation peut jouer un rôle important dans le contrôle de l’activité d’un gène.
... However, these experiments provide very little structural detail and cannot reveal the microscopic mechanism, a gap which MD is suitable to fill. Another application involves the identification of previously unknown promoters on the basis of particular DNA mechanical properties [47,48]. ...
Article
Full-text available
Mechanical properties of DNA are important not only in a wide range of biological processes but also in the emerging field of DNA nanotechnology. We review some of the recent developments in modeling these properties, emphasizing the multiscale nature of the problem. Modern atomic resolution, explicit solvent molecular dynamics simulations have contributed to our understanding of DNA fine structure and conformational polymorphism. These simulations may serve as data sources to parameterize rigid base models which themselves have undergone major development. A consistent buildup of larger entities involving multiple rigid bases enables us to describe DNA at more global scales. Free energy methods to impose large strains on DNA, as well as bead models and other approaches, are also briefly discussed.
... [27][28][29] Most of these programs utilize other features, such as transcription factor binding sites, physical properties of the DNA, DNA accessibility, RNAP II occupancy and various epigenetic markers. [29][30][31][32][33][34][35] However, even available programs that aim to identify core promoter elements, such as McPromoter 36 and Eukaryotic Core Promoter Predictor (YAPP, http://www.bioinformatics.org/yapp/cgi-bin/yapp.cgi), rarely consider the functional constraint of the strict spacing required by the Inr-dependent elements, namely, DPE, MTE, and Bridge. ...
Article
Full-text available
Core promoter elements play a pivotal role in the transcriptional output, yet they are often detected manually within sequences of interest. Here, we present 2 contributions to the detection and curation of core promoter elements within given sequences. First, the Elements Navigation Tool (ElemeNT) is a user-friendly web-based, interactive tool for prediction and display of putative core promoter elements and their biologically-relevant combinations. Second, the CORE database summarizes ElemeNT-predicted core promoter elements near CAGE and RNA-seq-defined Drosophila melanogaster transcription start sites (TSSs). ElemeNT's predictions are based on biologically-functional core promoter elements, and can be used to infer core promoter compositions. ElemeNT does not assume prior knowledge of the actual TSS position, and can therefore assist in annotation of any given sequence. These resources, freely accessible at http://lifefaculty.biu.ac.il/gershon-tamar/index.php/resources, facilitate the identification of core promoter elements as active contributors to gene expression.
... It was shown in these models that the physicochemical properties did play a crucial role in promoter recognition. Recently, the report by Duran et al. (65) strongly supports the hypothesis that an ancient regulatory mechanism encoded by the intrinsic physical properties of the DNA may contribute to the complexity of transcription regulation in the human genome. Supplementary Table S1 of Supporting Information S3, which will be used to calculate the global or long-range sequence-order effects for the promoter sequences via Equations (6) and (7). ...
Article
Full-text available
The σ54 promoters are unique in prokaryotic genome and responsible for transcripting carbon and nitrogen-related genes. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying the σ54 promoters. Here, a predictor called ‘iPro54-PseKNC’ was developed. In the predictor, the samples of DNA sequences were formulated by a novel feature vector called ‘pseudo k-tuple nucleotide composition’, which was further optimized by the incremental feature selection procedure. The performance of iPro54-PseKNC was examined by the rigorous jackknife cross-validation tests on a stringent benchmark data set. As a user-friendly web-server, iPro54-PseKNC is freely accessible at http://lin.uestc.edu.cn/server/iPro54-PseKNC. For the convenience of the vast majority of experimental scientists, a step-by-step protocol guide was provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics that were presented in this paper just for its integrity. Meanwhile, we also discovered through an in-depth statistical analysis that the distribution of distances between the transcription start sites and the translation initiation sites were governed by the gamma distribution, which may provide a fundamental physical principle for studying the σ54 promoters.