Figure - available from: Nature Genetics
This content is subject to copyright. Terms and conditions apply.
Few TFs display strong transcriptional activity in cells
a, Schematic representation of the MPRA (STARR-seq) libraries. For enhancer activity assays, a DNA library comprising synthetic TF motifs (i), human genomic fragments (ii) or completely random synthetic DNA oligonucleotides (iii) is cloned within the 3′ UTR of the reporter gene (open reading frame (ORF)) driven by a minimal δ1-crystallin gene (Sasaki) or EF1α promoter. For binary promoter–enhancer (iv) activity assays, random synthetic DNA sequences are cloned in place of the minimal promoter and in the 3′ UTR (Methods, Supplementary Note and Supplementary Tables 3 and 4). b, MPRA (STARR-seq) reporter construct and its variations, and the experimental workflow for measuring promoter or enhancer activity. The MPRA libraries are transfected into human cells, and RNA is isolated 24 h later, followed by enrichment of reporter-specific RNA, library preparation, sequencing and data analysis. The active promoters are recovered by mapping their transcribed enhancers to the input DNA and identifying the corresponding promoter. c, Enhancer activity of HT-SELEX motifs measured from the synthetic TF motif library in GP5d cells. Median fold change of the sequence patterns containing a single instance of the motif consensus or its reverse complement over the input library is shown. Red line marks 1% activity related to the strongest motif. Dimeric motifs are indicated by orientation with respect to core consensus sequence (GGAA for ETS, ACAA for SOX, AACCGG for GRHL and GAAA for IRF; HH, head to head; HT, head to tail; TT, tail to tail), followed by gap length between the core sequences. Asterisk indicates an A-rich sequence 5′ of the IRF HT2 dimer. Supplementary Table 5 describes the naming of the motifs in each figure. d, The effect of a mismatch on enhancer activity of the p53 family (p63) motif when a consensus base is substituted by any other base one position at a time. The log2 fold change compared to input is plotted for the same motif pattern in two different sequence contexts. The PWMs for HT-SELEX and STARR-seq motifs are shown; note that mutating G to any other base (H) at position 5 (H05) leads to almost complete loss of activity.

Few TFs display strong transcriptional activity in cells a, Schematic representation of the MPRA (STARR-seq) libraries. For enhancer activity assays, a DNA library comprising synthetic TF motifs (i), human genomic fragments (ii) or completely random synthetic DNA oligonucleotides (iii) is cloned within the 3′ UTR of the reporter gene (open reading frame (ORF)) driven by a minimal δ1-crystallin gene (Sasaki) or EF1α promoter. For binary promoter–enhancer (iv) activity assays, random synthetic DNA sequences are cloned in place of the minimal promoter and in the 3′ UTR (Methods, Supplementary Note and Supplementary Tables 3 and 4). b, MPRA (STARR-seq) reporter construct and its variations, and the experimental workflow for measuring promoter or enhancer activity. The MPRA libraries are transfected into human cells, and RNA is isolated 24 h later, followed by enrichment of reporter-specific RNA, library preparation, sequencing and data analysis. The active promoters are recovered by mapping their transcribed enhancers to the input DNA and identifying the corresponding promoter. c, Enhancer activity of HT-SELEX motifs measured from the synthetic TF motif library in GP5d cells. Median fold change of the sequence patterns containing a single instance of the motif consensus or its reverse complement over the input library is shown. Red line marks 1% activity related to the strongest motif. Dimeric motifs are indicated by orientation with respect to core consensus sequence (GGAA for ETS, ACAA for SOX, AACCGG for GRHL and GAAA for IRF; HH, head to head; HT, head to tail; TT, tail to tail), followed by gap length between the core sequences. Asterisk indicates an A-rich sequence 5′ of the IRF HT2 dimer. Supplementary Table 5 describes the naming of the motifs in each figure. d, The effect of a mismatch on enhancer activity of the p53 family (p63) motif when a consensus base is substituted by any other base one position at a time. The log2 fold change compared to input is plotted for the same motif pattern in two different sequence contexts. The PWMs for HT-SELEX and STARR-seq motifs are shown; note that mutating G to any other base (H) at position 5 (H05) leads to almost complete loss of activity.

Source publication
Article
Full-text available
DNA can determine where and when genes are expressed, but the full set of sequence determinants that control gene expression is unknown. Here, we measured the transcriptional activity of DNA sequences that represent an ~100 times larger sequence space than the human genome using massively parallel reporter assays (MPRAs). Machine learning models re...

Citations

... While the impact of large SV on precision oncology has been discussed elsewhere [5], this review provides an overview of the recent findings on the functional impact of non-coding somatic and germline single nucleotide alterations affecting GREs in cancer. Considering the growing body of evidence highlighting the clinical significance of SNVs within non-coding regions of the genome, there has been a surge of innovation in technologies aimed at their comprehensive characterization and the exploration of their intricate molecular mechanisms [18,19], which is also discussed. ...
Article
Full-text available
Discoveries in the field of genomics have revealed that non-coding genomic regions are not merely "junk DNA", but rather comprise critical elements involved in gene expression. These gene regulatory elements (GREs) include enhancers, insulators, silencers, and gene promoters. Notably, new evidence shows how mutations within these regions substantially influence gene expression programs, especially in the context of cancer. Advances in high-throughput sequencing technologies have accelerated the identification of somatic and germline single nucleotide mutations in non-coding genomic regions. This review provides an overview of somatic and germline non-coding single nucleotide alterations affecting transcription factor binding sites in GREs, specifically involved in cancer biology. It also summarizes the technologies available for exploring GREs and the challenges associated with studying and characterizing non-coding single nucleotide mutations. Understanding the role of GRE alterations in cancer is essential for improving diagnostic and prognostic capabilities in the precision medicine era, leading to enhanced patient-centered clinical outcomes.
... One additional motif, HNF1A, exhibited a multiplicity effect (saturation at n=3) in the manual homotypic designs, which was not observed in the global R2 analysis due to low prevalence of high multiplicity sequences. Our findings on motif multiplicity are in good agreement with prior reporting 32,35 . For the remaining 5 motifs for which we tested manual homotypic enhancers, no multiplicity effect was observed (Fig. S5B). ...
Preprint
Full-text available
An important and largely unsolved problem in synthetic biology is how to target gene expression to specific cell types. Here, we apply iterative deep learning to design synthetic enhancers with strong differential activity between two human cell lines. We initially train models on published datasets of enhancer activity and chromatin accessibility and use them to guide the design of synthetic enhancers that maximize predicted specificity. We experimentally validate these sequences, use the measurements to re-optimize the predictor, and design a second generation of enhancers with improved specificity. Our design methods embed relevant transcription factor binding site (TFBS) motifs with higher frequencies than comparable endogenous enhancers while using a more selective motif vocabulary, and we show that enhancer activity is correlated with transcription factor expression at the single cell level. Finally, we characterize causal features of top enhancers via perturbation experiments and show enhancers as short as 50bp can maintain specificity.
... This uneven amplification is undesirable for many applications, as it results in the inability to characterize lowly abundant sequences, which may even be altogether absent from the final experiment ("dropping out"), while highly abundant sequences waste a large number of sequencing reads or use up screening resources that would be better spent on other sequences. Many new methods require extremely complex plasmid libraries 7,8 , and the size of libraries will likely continue to grow with increased sequencing throughput. Given the investment required to generate and screen large plasmid libraries, it is crucial to ensure that each unique sequence remains at the desired abundance after library amplification so that as many high-quality measurements as possible can be made. ...
Preprint
Full-text available
DNA libraries are critical components of many biological assays. These libraries are often kept in plasmids that are amplified in E. coli to generate sufficient material for an experiment. Library uniformity is critical for ensuring that every element in the library is tested similarly, and is thought to be influenced by the culture approach used during library amplification. We tested five commonly used culturing methods for their ability to uniformly amplify plasmid libraries: liquid, semisolid agar, cell spreader-spread plates with high or low colony density, and bead-spread plates. Each approach was evaluated with two library types: a random 80-mer library, representing high complexity low coverage of similar sequence lengths, and a human TF ORF library, representing low complexity high coverage of diverse sequence lengths. We found that no method was better than liquid culture, which produced relatively uniform libraries regardless of library type. However, when libraries were transformed with high coverage, culturing method had minimal impact on uniformity or amplification bias. Plating libraries was the worst approach by almost every measure for both library types, and, counter-intuitively, produced the strongest biases against long sequence representation. Semisolid agar amplified most elements of the library uniformly but also included outliers with orders of magnitude higher abundance. For amplifying DNA libraries, liquid culture, the simplest method, appears to be best.
... In contrast, Peng et al. investigated all genomic loci with enhancer potential in murine embryonic stem cells, while Sahu et al. explored the entire human genome. Both studies demonstrated a significant enrichment of p53REs among sequences that transactivated the reporters [37,38]. Interestingly, the p53RE was identified as the most effective transcription factor binding motif for transactivation [38]. ...
... Both studies demonstrated a significant enrichment of p53REs among sequences that transactivated the reporters [37,38]. Interestingly, the p53RE was identified as the most effective transcription factor binding motif for transactivation [38]. Additionally, p53REs facilitated transactivation at DNA sequences associated with chromatin that is poorly accessible [37], which highlights the pioneer function of p53 in binding to nucleosome-bound DNA and promoting DNA accessibility. ...
... A similar MOSAIC-based strategy could also be deployed to define the binding specificities of EndoTFs to their normal binding sites in any given promoter. Doing so could provide a more physiologic study of EndoTF binding activities in situ in their native chromatin contexts as opposed to previously described in vitro analysis of binding specificities of EndoTFs in the literature that are performed in synthetic contexts 10,11 . ...
Preprint
Current technologies for upregulation of endogenous genes use targeted artificial transcriptional activators but stable gene activation requires persistent expression of these synthetic factors. Although general “hit-and-run” strategies exist for inducing long-term silencing of endogenous genes using targeted artificial transcriptional repressors, to our knowledge no equivalent approach for gene activation has been described to date. Here we show stable gene activation can be achieved by harnessing endogenous transcription factors ( EndoTF s) that are normally expressed in human cells. Specifically, EndoTFs can be recruited to activate endogenous human genes of interest by using CRISPR-based gene editing to introduce EndoTF DNA binding motifs into a target gene promoter. This Precision Editing of Regulatory Sequences to Induce Stable Transcription-On ( PERSIST-On ) approach results in stable long-term gene activation, which we show is durable for at least five months. Using a high-throughput CRISPR prime editing pooled screening method, we also show that the magnitude of gene activation can be finely tuned either by using binding sites for different EndoTF or by introducing specific mutations within such sites. Our results delineate a generalizable framework for using PERSIST-On to induce heritable and fine-tunable gene activation in a hit-and-run fashion, thereby enabling a wide range of research and therapeutic applications that require long-term upregulation of a target gene.
... Taken together, the histone marks, TFBS, and STARRseq data indicate that the TEs discussed here match two recently described enhancer classes, namely the closedchromatin enhancers and the cryptic ones [53]. These represent weakly active or inactive enhancers. ...
Article
Full-text available
RepEnTools, our new program, is now published in Mobile DNA! RepEnTools takes raw ChIP-seq (& input) data and reports genome-wide enrichments and depletions on repeat elements (REs). We use experimental and simulated data to show that it is fast, efficient, and accurate. It is also easily accessible to all users, thanks to Galaxy workflows or the UNIX auto-installer, and the step-by-step instructions for both! Validated files are available for the human chm13v2.0 and mouse mm39 assemblies on GitHub and figshare. Using RepEnTools, we show that hUHRF1-TTD and endogenous mUHRF1 in human hepatoblastoma HepG2 cells and mouse preimplantation ESCs colocalise with H3K4me1-K9me3 on promoters of species-specific REs which are silenced by UHRF1. Together, the data suggest a functional role for UHRF1 in silencing of REs that is mediated by TTD binding to the H3K4me1-K9me3 double mark and is conserved in two mammalian species. ------------------------------------- GitHub https://github.com/PavelBashtrykov/RepEnTools Figshare https://figshare.com/projects/RepEnTools_An_automated_repeat_enrichment_analysis_package_for_ChIP-seq_data_reveals_hUHRF1_Tandem-Tudor_domain_enrichment_in_young_repeats/178368
... Binding motifs, which represent the sequence specificities of DNA and RNA-binding proteins (RBP), are considered to be the "atomic units of gene expression" [31]. Accordingly, to verify that LMs can capture aspects of the regulatory code in an alignment-free manner, we first needed to verify that they capture important known motifs. ...
Article
Full-text available
Background The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, new algorithms are needed that can leverage conservation to capture regulatory elements while accounting for their evolution. Results Here, we introduce species-aware DNA language models, which we trained on more than 800 species spanning over 500 million years of evolution. Investigating their ability to predict masked nucleotides from context, we show that DNA language models distinguish transcription factor and RNA-binding protein motifs from background non-coding sequence. Owing to their flexibility, DNA language models capture conserved regulatory elements over much further evolutionary distances than sequence alignment would allow. Remarkably, DNA language models reconstruct motif instances bound in vivo better than unbound ones and account for the evolution of motif sequences and their positional constraints, showing that these models capture functional high-order sequence and evolutionary context. We further show that species-aware training yields improved sequence representations for endogenous and MPRA-based gene expression prediction, as well as motif discovery. Conclusions Collectively, these results demonstrate that species-aware DNA language models are a powerful, flexible, and scalable tool to integrate information from large compendia of highly diverged genomes.
... These motifs might impact the affinity of CTCF for DNA binding, or alter its binding kinetics in other ways 28,29,43 . Finally, CTCF binding can be hindered by the presence of 5meC outside of its binding motifs, for example through action with other 5meC-sensitive transcription factors, or through local 5meC-dependent changes to chromatin that could impair CTCF's access to DNA 44,45 . Together, these mechanisms can explain sequence-independent 5meC-mediated reorganization of CTCF binding during normal cellular differentiation and lineage specification, and perturbed organization in developmental defects and disease. ...
Article
Cytosine DNA methylation is a highly conserved epigenetic mark in eukaryotes. Although the role of DNA methylation at gene promoters and repetitive elements has been extensively studied, the function of DNA methylation in other genomic contexts remains less clear. In the nucleus of mammalian cells, the genome is spatially organized at different levels, and strongly influences myriad genomic processes. There are a number of factors that regulate the three-dimensional (3D) organization of the genome, with the CTCF insulator protein being among the most well-characterized. Pertinently, CTCF binding has been reported as being DNA methylation-sensitive in certain contexts, perhaps most notably in the process of genomic imprinting. Therefore, it stands to reason that DNA methylation may play a broader role in the regulation of chromatin architecture. Here we summarize the current understanding that is relevant to both the mammalian DNA methylation and chromatin architecture fields and attempt to assess the extent to which DNA methylation impacts the folding of the genome. The focus is in early embryonic development and cellular transitions when the epigenome is in flux, but we also describe insights from pathological contexts, such as cancer, in which the epigenome and 3D genome organization are misregulated.
... The combination of different repressing and activating DNA regulatory motifs enables tuning promoter activity and, as a result, gene expression level. The sequence space of combined DNA regulatory motifs is often referred to in the literature as the regulatory code or grammar [6][7][8] . ...
... In the first approach, large regulatory regions were dissected using a traditional knock-down and rescue approach until the regulatory effect of every active TFBS was characterized [9][10][11] . In the second approach, a multitude of synthetic cisregulatory regions (e.g., synthetic enhancers) composed of a small number of TFBSs arranged in various configurations were encoded in an oligo library (OL) and characterized using massively parallel reporter assays (MPRA) such as SORT-seq 8,12,13 . These studies found that it was possible to increase the expression of a particular target gene by positioning a cassette of repeat TFBS immediately upstream of minimal core promoters. ...
... Third, the convergence of both the MBO and the de Boer model 24 , which were trained on different experimentally generated promoter architectures, indicate that there is a large amount of redundancy in transcriptional regulation. Our work, the de Boer study 24 , and others 8,33 further suggest that most mutations in the regulatory region may be non-essential (aside from fine-tuning expression at the local level), and thus do not radically affect the efficacy of the ML-based prediction models on unseen sequences, in different growth conditions, or in different cell types. This observation is further supported by a lack of detectible interaction between motifs, and the independence of the mean expression level from the position of the motifs within the sURS (see Fig. 5 and Supplementary Fig. 8). ...
Article
Full-text available
We demonstrate a transcriptional regulatory design algorithm that can boost expression in yeast and mammalian cell lines. The system consists of a simplified transcriptional architecture composed of a minimal core promoter and a synthetic upstream regulatory region (sURS) composed of up to three motifs selected from a list of 41 motifs conserved in the eukaryotic lineage. The sURS system was first characterized using an oligo-library containing 189,990 variants. We validate the resultant expression model using a set of 43 unseen sURS designs. The validation sURS experiments indicate that a generic set of grammar rules for boosting and attenuation may exist in yeast cells. Finally, we demonstrate that this generic set of grammar rules functions similarly in mammalian CHO-K1 and HeLa cells. Consequently, our work provides a design algorithm for boosting the expression of promoters used for expressing industrially relevant proteins in yeast and mammalian cell lines.
... A recent general-purpose modeling framework, called MAVE-NN, overcomes these challenges using a neural-network based approach to fit interpretable genotype-phenotype maps to data from massively parallel functional assays [57]. An important difference between MAVE-NN and recent deep learning models such as Deep-STARR and others [58][59][60][61] is that MAVE-NN explicitly models the relationship between sequence and activity separately from features of the experimental measurement, such as saturation, detection limits, and noise, rather than attempting to model all features of an MPRA dataset in a monolithic architecture trained end-to-end. This enables MAVE-NN models to learn interpretable parameters that correspond straightforwardly to additive contributions and interactions between sequence features such as TFBSs. ...
Article
Full-text available
The effects of transcription factor binding sites (TFBSs) on the activity of a cis -regulatory element (CRE) depend on the local sequence context. In rod photoreceptors, binding sites for the transcription factor (TF) Cone-rod homeobox (CRX) occur in both enhancers and silencers, but the sequence context that determines whether CRX binding sites contribute to activation or repression of transcription is not understood. To investigate the context-dependent activity of CRX sites, we fit neural network-based models to the activities of synthetic CREs composed of photoreceptor TFBSs. The models revealed that CRX binding sites consistently make positive, independent contributions to CRE activity, while negative homotypic interactions between sites cause CREs composed of multiple CRX sites to function as silencers. The effects of negative homotypic interactions can be overcome by the presence of other TFBSs that either interact cooperatively with CRX sites or make independent positive contributions to activity. The context-dependent activity of CRX sites is thus determined by the balance between positive heterotypic interactions, independent contributions of TFBSs, and negative homotypic interactions. Our findings explain observed patterns of activity among genomic CRX-bound enhancers and silencers, and suggest that enhancers may require diverse TFBSs to overcome negative homotypic interactions between TFBSs.