Representation of high-scoring binding site predictions enrichment in co-expressed gene sets for respectively ArgR (a), SoxS (b) and PurR (c). Each plot corresponds the entire ranked gene list as obtained from the screening using the PWM (red) and CRoSSeD (green) motif models with decreasing confidence from left to right. Marked are the positions of the genes that were found co-expressed with the known target genes of the respective TFs.

Representation of high-scoring binding site predictions enrichment in co-expressed gene sets for respectively ArgR (a), SoxS (b) and PurR (c). Each plot corresponds the entire ranked gene list as obtained from the screening using the PWM (red) and CRoSSeD (green) motif models with decreasing confidence from left to right. Marked are the positions of the genes that were found co-expressed with the known target genes of the respective TFs.

Source publication
Article
Full-text available
Recognition of genomic binding sites by transcription factors can occur through base-specific recognition, or by recognition of variations within the structure of the DNA macromolecule. In this article, we investigate what information can be retrieved from local DNA structural properties that is relevant to transcription factor binding and that can...

Citations

... This set of regulons constitutes the transcriptional regulatory network (TRN) of E. coli, representing thousands of interactions between regulators and promoter sequences [2]. While our understanding of the TRN of E. coli likely exceeds that of any other organism, there is still much that we do not understand regarding how the TRN is encoded in the genome itself [3,4]. ...
... Therefore, the motif score is important but not always sufficient to determine regulon membership. To characterize regulator-sequence interactions structurally, DNA shape features were included in the feature matrix for machine learning models, and they acted as significant features in the prediction, consistent with previous work in this area [3]. TF-specific DNA shape vectors reflect unique regulator protein structures and provide another measurement for regulator binding affinity. ...
Article
Full-text available
The transcriptional regulatory network (TRN) of E . coli consists of thousands of interactions between regulators and DNA sequences. Regulons are typically determined either from resource-intensive experimental measurement of functional binding sites, or inferred from analysis of high-throughput gene expression datasets. Recently, independent component analysis (ICA) of RNA-seq compendia has shown to be a powerful method for inferring bacterial regulons. However, it remains unclear to what extent regulons predicted by ICA structure have a biochemical basis in promoter sequences. Here, we address this question by developing machine learning models that predict inferred regulon structures in E . coli based on promoter sequence features. Models were constructed successfully (cross-validation AUROC > = 0.8) for 85% (40/47) of ICA-inferred E . coli regulons. We found that: 1) The presence of a high scoring regulator motif in the promoter region was sufficient to specify regulatory activity in 40% (19/47) of the regulons, 2) Additional features, such as DNA shape and extended motifs that can account for regulator multimeric binding, helped to specify regulon structure for the remaining 60% of regulons (28/47); 3) investigating regulons where initial machine learning models failed revealed new regulator-specific sequence features that improved model accuracy. Finally, we found that strong regulatory binding sequences underlie both the genes shared between ICA-inferred and experimental regulons as well as genes in the E . coli core pan-regulon of Fur. This work demonstrates that the structure of ICA-inferred regulons largely can be understood through the strength of regulator binding sites in promoter regions, reinforcing the utility of top-down inference for regulon discovery.
... While our understanding of the TRN of E. coli likely exceeds that of any other organism, there is still much that we do not understand regarding how the TRN is encoded in the genome itself (11). While we have identified binding sequence motifs for many transcription factors, the presence of a motif in the promoter sequence does not guarantee that a regulator will bind or influence gene expression (11). ...
... While our understanding of the TRN of E. coli likely exceeds that of any other organism, there is still much that we do not understand regarding how the TRN is encoded in the genome itself (11). While we have identified binding sequence motifs for many transcription factors, the presence of a motif in the promoter sequence does not guarantee that a regulator will bind or influence gene expression (11). Thus, we still lack a quantitative understanding of how promoter sequence encodes a functioning regulatory site. ...
... Therefore, the motif score is important but not sufficient to determine regulon membership. To characterize regulator-sequence interactions structurally, DNA shape features were included in the feature matrix for machine learning models, and they acted as significant features in the prediction, consistent with previous work in this area (11). TF-specific DNA shape vectors reflect unique regulator protein structures and provide another measurement for regulator binding affinity. ...
Preprint
Full-text available
The transcriptional regulatory network (TRN) of E. coli consists of thousands of interactions between regulators and DNA sequences. Inherently the DNA sequence is the primary determinant of the TRN; however, it is well established that the presence of a DNA binding motif does not guarantee a functional regulatory protein binding site. Thus, the extent to which the TRN architecture can be predicted by the genome DNA sequence alone remains unclear. Here, we developed machine learning models that predict the TRN structure of E. coli based on genome sequence. Models were constructed successfully (cross-validation AUROC >= 0.8) for 84% (57/68) of valid E. coli regulons identified from top-down analysis of RNA-seq data. We found that: 1) While regulatory motif strength is the most important sequence feature for determining regulon membership, additional features such as DNA shape substantially influence membership; 2) complex regulons involving multiple interacting regulators could be unraveled by machine learning; 3) investigating regulons where initial ML models failed revealed new regulator-specific sequence features that improved model accuracy. Finally, while regulon structure can appear to be variable across estimation methods and strains, we found that strong regulatory sequence features underlie both the genes that appear most consistently in regulons across estimation methods as well as the core regulon genes in the Fur pan-regulon. This work develops a quantitative understanding of the sequence basis of the TRN and suggests a path towards computationally-guided control of transcriptional regulation for synthetic biology applications.
... Transcription factors (TF) are modulators of gene expression that act on all eukaryotic biochemical systems [1]. When a transcription factor interacts with its binding site or motif, it will exert an inhibitory or activating effect, allowing transcriptional regulation [2]. One of the most well-known and essential transcription factors is TATA box-binding protein (TBP). ...
Research
Full-text available
Final project for the course COMP 561 (Computational Biology Methods and Research) at McGill. Created and assessed the performance of a method for predicting transcription factor binding sites using DNA physical properties. The method involved machine learning and the use of Galaxy.
... As the Boolean feature functions evaluate one of the two states of being true or false for a feature appearing at an exact position, all structural features are regarded in the form of discrete instead of continuous values during the model training. In addition, considering that the substrate cleavage depends on the overall 3D shape or neighbourhood of multiple AAs, structural features recognized by cleavage sites, e.g., the overall shape of the P4-P4 0 segment surrounding the potential cleavage sites, we combined CRF with a LOW-ESS data-smoothing approach [35] and examined whether cleavage site prediction could be further improved. Specifically, feature optimization first ran the LOWESS smoothing algorithm on the input vectors of each structural feature. ...
Article
Full-text available
Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acids of target substrate proteins. Protease-controlled proteolysis plays a key role in the degradation and recycling of proteins, which is essential for various physiological processes. Thus, solving the substrate identification problem will have important implications for the precise understanding of protease functions and their physiological roles, as well as for therapeutic target identification and pharmaceutical applicability. Consequently, there is a great demand for bioinformatics methods that can predict novel substrate cleavage events with high accuracy by utilizing both sequence and structural information. In this study, we present Procleave, a novel bioinformatics approach for predicting protease-specific substrates and specific cleavage sites by combining both sequence and 3D structural information. Structural features of known cleavage sites were represented by discrete values using a LOWESS data-smoothing optimization method, which turned out to be critical for the performance of Procleave. The optimal approximations of all structural parameter values were encoded in a conditional random field (CRF) computational framework, alongside sequence and chemical-group-based features. Here, we demonstrate the outstanding performance of Procleave through extensive benchmarking and independent tests. Procleave was able to correctly identify most cleavage sites in case study proteins. Importantly, when applied to the human structural proteome encompassing 17,628 protein structures, Procleave was able to suggest a number of potentially novel protease target substrates and cleavage sites. The web server of Procleave is publicly available at http://procleave.erc.monash.edu/.
... Transcription factor binding sites (TFBS) on DNA are short sequences ranging from a few to about 20 base pairs, located in the upstream regulatory regions of genes (Khamis et al., 2018). Computational approaches for predicting TFBS have been successful (Elnitski et al., 2006) using varying methods, such as pattern matching based on PWMs (Wasserman and Sandelin, 2004); including sequence-specific and structural features of DNA, such as DNA shape or local chemical and structural descriptors (Meysman et al., 2011); and PWM modeling, including nucleotide k-mer relationships or remote dependencies of nucleotide positions (Siddharthan, 2010;Keilwagen and Grau, 2015). However, only a fraction of the predicted TFBS are actually bound by transcription factors. ...
Article
Soil salinization is an increasing global threat to economically important agricultural crops such as bread wheat (Triticum aestivum L.). A main regulator of plants’ responses to salt stress is WRKY transcription factors, a protein family that binds to DNA and alters the rate of transcription for specific genes. In this study, we identified 297 WRKY genes in the Chinese Spring wheat genome (Ensembl Plants International Wheat Genome Sequencing Consortium (IWGSC)), of which 126 were identified as putative. We classified 297 WRKY genes into three Groups: I, II (a–e) and III based on phylogenetic analysis. Principal component analysis (PCA) of WRKY proteins using physicochemical properties resulted in a very similar clustering as that observed through phylogenetic analysis. The 5` upstream regions (−2 000 bp) of 107 891 sequences from the wheat genome were used to predict WRKY transcription factor binding sites, and from this we identified 31 296 genes with putative WRKY binding motifs using the Find Individual Motif Occurrences (FIMO) tool. Among these predicted genes, 47 genes were expressed during salt stress according to a literature survey. Thus, we provide insight into the structure and diversity of WRKY domains in wheat and a foundation for future studies of DNA-binding specificity and for analysis of the transcriptional regulation of plants’ response to different stressors, such as salt stress, as addressed in this study.
... Being widely used in the field of bioinformatics, [43][44][45][46] CRFs are a probabilistic model proposed by Lafferty et al. 37 for labeling sequence data. In this study, protein sequences and their corresponding label sequences are used to train the CRF model, which is a conditional probability model to annotate unlabeled protein sequences. ...
Article
Full-text available
Accurate identification of intrinsically disordered proteins/regions (IDPs/IDRs) is critical for predicting protein structure and function. Previous studies have shown that IDRs of different lengths have different characteristics, and several classification-based predictors have been proposed for predicting different types of IDRs. Compared with these classification-based predictors, the previously proposed predictor IDP-CRF exhibits state-of-the-art performance for predicting IDPs/IDRs, which is a sequence labeling model based on conditional random fields (CRFs). Motivated by these methods, we propose a predictor called IDP-FSP, which is an ensemble of three CRF-based predictors called IDP-FSP-L, IDP-FSP-S, and IDP-FSP-G. These three predictors are specially designed to predict long, short, and generic disordered regions, respectively, and they are constructed based on different features. To the best of our knowledge, IDP-FSP is the first predictor that combines a sequence labeling algorithm with IDRs of different lengths. Experimental results using two independent test datasets show that IDP-FSP achieves better or at least comparable predictive performance with 26 existing state-of-the-art methods in this field, proving the effectiveness of IDP-FSP. Keywords: intrinsically disordered proteins/regions, ensemble predictor, length-dependent predictors, conditional random fields, CRFs
... Plasticity in DNA also plays a significant role in DNAprotein recognition, DNA melting, nucleosome assembly and genome integrity. Thus, intrinsic structural properties that define DNA bendability, duplex stability, curvature, groove shape and topography, are more accurate determinants of DNA binding specificities of TFs than the simple nucleotide sequence (10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20). ...
Article
Full-text available
Spatial and temporal expression of genes is essential for maintaining phenotype integrity. Transcription factors (TFs) modulate expression patterns by binding to specific DNA sequences in the genome. Along with the core binding motif, the flanking sequence context can play a role in DNA-TF recognition. Here, we employ high-throughput in vitro and in silico analyses to understand the influence of sequences flanking the cognate sites in binding of three most prevalent eukaryotic TF families (zinc finger, homeodomain and bZIP). In vitro binding preferences of each TF toward the entire DNA sequence space were correlated with a wide range of DNA structural parameters, including DNA flexibility. Results demonstrate that conformational plasticity of flanking regions modulates binding affinity of certain TF families. DNA duplex stability and minor groove width also play an important role in DNA-TF recognition but differ in how exactly they influence the binding in each specific case. Our analyses further reveal that the structural features of preferred flanking sequences are not universal, as similar DNA-binding folds can employ distinct DNA recognition modes.
... The last example is 4AD4A, which contains two IDRs with a total of 31 disordered residues (Figure 4b (33,33), (53,54), (65,66), (68,68), (72,76), (78,82), (86,86), (88,88) , 4) and (49,85). (b) Actual IDR is: (52,85). ...
... Conditional random fields (CRFs) were proposed by Lafferty et al. [19], and compose a probabilistic model for labeling sequence data. Due to their advantages, CRFs have been widely applied to solve a number of prediction tasks in the field of bioinformatics and computational biology, including protein-protein interaction prediction [79,80], phosphorylation site prediction [81], transcription factor binding site prediction [82], and protein-RNA residue-based contact prediction [83]. ...
... (c) IDRs predicted by the RF predictor are: (1, 3), (13, 31), (69, 81), (132, 133), (236, 236), (346, 346) and (377, 380). (d) IDRs predicted by the SVM predictors are:(11,31),(54,55),(65,69),(72,75),(78,82),(88,88), (97, 99), (104, 104), (337, 337), (346, 346) and (376, 380). (e) IDRs predicted by the ANN predictor are: (10, 31), ...
Article
Full-text available
Accurate prediction of intrinsically disordered proteins/regions is one of the most important tasks in bioinformatics, and some computational predictors have been proposed to solve this problem. How to efficiently incorporate the sequence-order effect is critical for constructing an accurate predictor because disordered region distributions show global sequence patterns. In order to capture these sequence patterns, several sequence labelling models have been applied to this field, such as conditional random fields (CRFs). However, these methods suffer from certain disadvantages. In this study, we proposed a new computational predictor called IDP–CRF, which is trained on an updated benchmark dataset based on the MobiDB database and the DisProt database, and incorporates more comprehensive sequence-based features, including PSSMs (position-specific scoring matrices), kmer, predicted secondary structures, and relative solvent accessibilities. Experimental results on the benchmark dataset and two independent datasets show that IDP–CRF outperforms 25 existing state-of-the-art methods in this field, demonstrating that IDP–CRF is a very useful tool for identifying IDPs/IDRs (intrinsically disordered proteins/regions). We anticipate that IDP–CRF will facilitate the development of protein sequence analysis.
... Such dependencies are usually not accounted for in consensus binding sequences or position weight matrices, which implicitly assume independence among the base-pair positions of a DNA-binding site. Thus, identifying whether TFs employ a shape readout mechanism can significantly improve our understanding of binding specificities and can improve the accuracy of binding site prediction models (Bauer et al., 2010;Meysman et al., 2011;Maienschein-Cline et al., 2012;Gordan et al., 2013;Dror et al., 2014;Yang et al., 2014;Abe et al., 2015;Zhou et al., 2015). ...
Article
SEPALLATA3 of Arabidopsis thaliana is a MADS‐domain transcription factor and a key regulator of flower development. MADS‐domain proteins bind to sequences termed ‘CArG‐boxes’ (consensus 5′‐CC(A/T)6GG‐3′). Since only a fraction of the CArG‐boxes in the Arabidopsis genome are bound by SEPALLATA3, more elaborate principles have to be discovered to better understand which features turn CArG‐boxes into genuine recognition sites. Here, we investigate to which extent the shape of the DNA is involved in a ‘shape readout’ that contributes to the binding of SEPALLATA3. We determined in vitro binding affinities of SEPALLATA3 to DNA probes which all contain the CArG‐box motif, but differ in their predicted DNA shape. We found that binding affinity correlates well with a narrow minor groove of the DNA. Substitution of canonical bases with non‐standard bases support the hypothesis of minor groove shape readout by SEPALLTA3. Analysis of mutant SEPALLATA3 proteins further revealed that a highly conserved arginine residue, which is expected to contact the DNA minor groove, contributes significantly to the shape readout. Our studies show that the specific recognition of cis‐regulatory elements by a plant MADS‐domain transcription factor, and by inference probably also of other transcription factors of this type, heavily depends on shape readout mechanisms. This article is protected by copyright. All rights reserved.
... Also, more flexible approaches have been implemented to develop customized models of TFBSs, such as those based on Bayesian networks (25), Hidden Markov Models (HMM) (14) and recently deep learning of Neural Networks (NN) (26). Various methods have incorporated sequence-specific and structural features of DNA for prediction of TFBSs, for example, DNA shape (27,28), or local chemical and structural properties (29). Some other approaches, like (30) used additional information such as DNA accessibility. ...
Article
Full-text available
Identifying transcription factor (TF) binding sites (TFBSs) is important in the computational inference of gene regulation. Widely used computational methods of TFBS prediction based on position weight matrices (PWMs) usually have high false positive rates. Moreover, computational studies of transcription regulation in eukaryotes frequently require numerous PWM models of TFBSs due to a large number of TFs involved. To overcome these problems we developed DRAF, a novel method for TFBS prediction that requires only 14 prediction models for 232 human TFs, while at the same time significantly improves prediction accuracy. DRAF models use more features than PWM models, as they combine information from TFBS sequences and physicochemical properties of TF DNA-binding domains into machine learning models. Evaluation of DRAF on 98 human ChIP-seq datasets shows on average 1.54-, 1.96- and 5.19-fold reduction of false positives at the same sensitivities compared to models from HOCOMOCO, TRANSFAC and DeepBind, respectively. This observation suggests that one can efficiently replace the PWM models for TFBS prediction by a small number of DRAF models that significantly improve prediction accuracy. The DRAF method is implemented in a web tool and in a stand-alone software freely available at http://cbrc.kaust.edu.sa/DRAF.