Figure - available from: Nature Genetics
This content is subject to copyright. Terms and conditions apply.
DeepSTARR quantitatively predicts enhancer activity genome wide from DNA sequence
a, Schematics of genome-wide UMI-STARR-seq using developmental (Dev) (DSCP; red) and housekeeping (Hk) (RpS12; blue) promoters. b, DeepSTARR predicts enhancer activity genome wide. Genome browser screenshot depicting observed and predicted UMI-STARR-seq profiles for both promoters for a locus on the held-out test chromosome (Chr) 2R. c, Architecture of the multitask convolutional neural network DeepSTARR that was trained to simultaneously predict quantitative Dev and Hk enhancer activities from 249-bp DNA sequences. d, DeepSTARR predicts enhancer activity quantitatively. Scatter plots of predicted versus observed Dev (left) and Hk (right) enhancer activity signal across all DNA sequences in the test set chromosome. Color reflects point density. e, DeepSTARR quantitatively predicts Dev and Hk enhancer–promoter specificity. Predicted versus observed log2FC between Dev and Hk activity for all enhancer sequences in the test set chromosome. PCC, Pearson correlation coefficient.

DeepSTARR quantitatively predicts enhancer activity genome wide from DNA sequence a, Schematics of genome-wide UMI-STARR-seq using developmental (Dev) (DSCP; red) and housekeeping (Hk) (RpS12; blue) promoters. b, DeepSTARR predicts enhancer activity genome wide. Genome browser screenshot depicting observed and predicted UMI-STARR-seq profiles for both promoters for a locus on the held-out test chromosome (Chr) 2R. c, Architecture of the multitask convolutional neural network DeepSTARR that was trained to simultaneously predict quantitative Dev and Hk enhancer activities from 249-bp DNA sequences. d, DeepSTARR predicts enhancer activity quantitatively. Scatter plots of predicted versus observed Dev (left) and Hk (right) enhancer activity signal across all DNA sequences in the test set chromosome. Color reflects point density. e, DeepSTARR quantitatively predicts Dev and Hk enhancer–promoter specificity. Predicted versus observed log2FC between Dev and Hk activity for all enhancer sequences in the test set chromosome. PCC, Pearson correlation coefficient.

Source publication
Article
Full-text available
Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood, and de novo enhancer design has been challenging. Here, we built a deep-learning model, Dee...

Similar publications

Article
Full-text available
Background: Gene regulation is critical for proper cellular function. Next-generation sequencing technology has revealed the presence of regulatory networks that regulate gene expression and essential cellular functions. Studies investigating the epigenome have begun to uncover the complex mechanisms regulating transcription. Assay for transposase-...

Citations

... The position of these example binding motifs, representative of low variation and high variation, is marked in the left plot using a circle and square, respectively. Note: we used K = 50 for the K-Lasso implementation of LIME; results for other choices of K are shown in Extended Data Fig. 1 of other TFs nearby [32][33][34] . However, variation in TF binding motifs can be exacerbated by the specific ways in which attribution methods quantify the behaviour of the DNN function in localized regions of sequence space (Fig. 1c). ...
... • Saliency Maps scores were computed by evaluating the gradient of the scalar DNN prediction at the sequence of interest with respect to the one-hot encoding of that sequence. • DeepSHAP scores were computed using the algorithm implemented in the DeepSTARR repository 34 . • DeepLIFT scores were computed using the algorithm implemented in the BPNet repository 5 . ...
Article
Full-text available
Deep neural networks (DNNs) have greatly advanced the ability to predict genome function from sequence. However, elucidating underlying biological mechanisms from genomic DNNs remains challenging. Existing interpretability methods, such as attribution maps, have their origins in non-biological machine learning applications and therefore have the potential to be improved by incorporating domain-specific interpretation strategies. Here we introduce SQUID (Surrogate Quantitative Interpretability for Deepnets), a genomic DNN interpretability framework based on domain-specific surrogate modelling. SQUID approximates genomic DNNs in user-specified regions of sequence space using surrogate models—simpler quantitative models that have inherently interpretable mathematical forms. SQUID leverages domain knowledge to model cis-regulatory mechanisms in genomic DNNs, in particular by removing the confounding effects that nonlinearities and heteroscedastic noise in functional genomics data can have on model interpretation. Benchmarking analysis on multiple genomic DNNs shows that SQUID, when compared to established interpretability methods, identifies motifs that are more consistent across genomic loci and yields improved single-nucleotide variant-effect predictions. SQUID also supports surrogate models that quantify epistatic interactions within and between cis-regulatory elements, as well as global explanations of cis-regulatory mechanisms across sequence contexts. SQUID thus advances the ability to mechanistically interpret genomic DNNs.
... Current state-of-the-art approaches to confront the limitation of purely experimental screens involve deep learning models trained on large-scale genomic and/or MPRA datasets. Such models are theoretically capable of learning a sequence-to-function mapping that captures underlying biological principles [12][13][14][15][16][17] , and can thereby guide the design of synthetic enhancers with targeted activity levels. DaSilva et al. and Lal et al. demonstrated synthetic cell type-specific enhancer design in silico 18,19 . ...
... DaSilva et al. and Lal et al. demonstrated synthetic cell type-specific enhancer design in silico 18,19 . De Almeida et al. experimentally validated synthetic enhancer designs, initially using STARR-seq to prove enhancer activity in a single Drosophila developmental cell type 15 , then targeting enhancers to four distinct tissue types in the Drosophila embryo and confirming specificity with in vivo assays 20 . In these studies, sequences were randomly generated and those with the highest model-predicted activity and/or specificity were selected for testing. ...
Preprint
Full-text available
An important and largely unsolved problem in synthetic biology is how to target gene expression to specific cell types. Here, we apply iterative deep learning to design synthetic enhancers with strong differential activity between two human cell lines. We initially train models on published datasets of enhancer activity and chromatin accessibility and use them to guide the design of synthetic enhancers that maximize predicted specificity. We experimentally validate these sequences, use the measurements to re-optimize the predictor, and design a second generation of enhancers with improved specificity. Our design methods embed relevant transcription factor binding site (TFBS) motifs with higher frequencies than comparable endogenous enhancers while using a more selective motif vocabulary, and we show that enhancer activity is correlated with transcription factor expression at the single cell level. Finally, we characterize causal features of top enhancers via perturbation experiments and show enhancers as short as 50bp can maintain specificity.
... These correlations yield an information footprint that reveals activator and repressor binding sites (Ireland et al., 2020). Further, by computationally designing these libraries, regulatory parameters such as the relative placement of binding sites can be systematically tested (de Almeida et al., 2022;Kircher et al., 2019;Qi et al., 2020;Sharon et al., 2012;Yu et al., 2021). The result has been a pipeline that allows for the rapid identification and experimental validation of binding sites within an enhancer, and for the engagement in a theory-experiment dialogue aimed at reaching a predictive understanding of how binding site architecture dictates gene expression. ...
... We posit that this tool can be readily adapted to other multicellular and single-celled organisms. Building on recent previous works aimed at using MPRAs to find the number, placement and affinity of transcription factor binding sites in previously uncharted regulatory regions (de Almeida et al., 2022;Ireland et al., 2020), future work from our lab will harness this new technology to map the regulatory architecture of enhancers relevant to fly embryogenesis in a single experiment. ...
Preprint
Full-text available
Understanding how the number, placement and affinity of transcription factor binding sites dictates gene regulatory programs remains a major unsolved challenge in biology, particularly in the context of multicellular organisms. To uncover these rules, it is first necessary to find the binding sites within a regulatory region with high precision, and then to systematically modulate this binding site arrangement while simultaneously measuring the effect of this modulation on output gene expression. Massively parallel reporter assays (MPRAs), where the gene expression stemming from 10,000s of in vitro-generated regulatory sequences is measured, have made this feat possible in high-throughput in single cells in culture. However, because of lack of technologies to incorporate DNA libraries, MPRAs are limited in whole organisms. To enable MPRAs in multicellular organisms, we generated tools to create a high degree of mutagenesis in specific genomic loci in vivo using base editing. Targeting GFP integrated in genome of Drosophila cell culture and whole animals as a case study, we show that the base editor AID evoCDA1 stemming from sea lamprey fused to nCas9 is highly mutagenic. Surprisingly, longer gRNAs increase mutation efficiency and expand the mutating window, which can allow the introduction of mutations in previously untargetable sequences. Finally, we demonstrate arrays of >20 gRNAs that can efficiently introduce mutations along a 200bp sequence, making it a promising tool to test enhancer function in vivo in a high throughput manner.
... HyenaDNA and Caduceus adopt a Char tokenizer due to their subquadratic space complexity and linear space complexity. Besides, we have included expert models for particular downstream tasks for comparison, named SpliceAI [14], DeepSTARR [15], CNN [16], and Orca [17]. ...
... Short-range tasks are characterized by input lengths of less than one thousand. Our analysis covers thirty-eight datasets related to short-range tasks, which include various types of tasks like sequence classification, variant classification, Epigenetic mark prediction, promoter detection, enhancer prediction, transcription factor detection, and splice site prediction [6,11,15,1]. ...
Preprint
The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and reproducibility challenges. In the absence of standardization, comparative analyses risk becoming biased and unreliable. To surmount this impasse, we introduce GenBench, a comprehensive benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. Through systematic evaluations of datasets spanning diverse biological domains with a particular emphasis on both short-range and long-range genomic tasks, firstly including the three most important DNA tasks covering Coding Region, Non-Coding Region, Genome Structure, etc. Moreover, We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance. Our findings reveal an interesting observation: independent of the number of parameters, the discernible difference in preference between the attention-based and convolution-based models on short- and long-range tasks may provide insights into the future design of GFM.
... Despite the fact that employing promoters of varying strengths is one of the most crucial methods for regulating gene expressions, the integration of diverse regulatory elements such as Ribosome Binding Sites (RBS), terminators, insulators, and others, offers expanded options for gene expression modulation (53)(54)(55). Furthermore, synthetic promoters, which establish a predetermined expression intensity, present challenges in dynamically regulating gene expression during growth. ...
Article
Full-text available
Native prokaryotic promoters share common sequence patterns, but are species dependent. For understudied species with limited data, it is challenging to predict the strength of existing promoters and generate novel promoters. Here, we developed PromoGen, a collection of nucleotide language models to generate species-specific functional promoters, across dozens of species in a data and parameter efficient way. Twenty-seven species-specific models in this collection were finetuned from the pretrained model which was trained on multi-species promoters. When systematically compared with native promoters, the Escherichia coli- and Bacillus subtilis-specific artificial PromoGen-generated promoters (PGPs) were demonstrated to hold all distribution patterns of native promoters. A regression model was developed to score generated either by PromoGen or by another competitive neural network, and the overall score of PGPs is higher. Encouraged by in silico analysis, we further experimentally characterized twenty-two B. subtilis PGPs, results showed that four of tested PGPs reached the strong promoter level while all were active. Furthermore, we developed a user-friendly website to generate species-specific promoters for 27 different species by PromoGen. This work presented an efficient deep-learning strategy for de novo species-specific promoter generation even with limited datasets, providing valuable promoter toolboxes especially for the metabolic engineering of understudied microorganisms.
... For promoter activity prediction, we used EPDnew-derived promoter sequences as positive samples set and random genomic sequences non-overlapping promoters as a negative samples set. For other tasks, we used original datasets provided by authors (10,12,14). ...
... Enhancer Activity Annotation Using outputs of the model trained on Drosophila cell reporter assays (14), this task evaluates the sequence's potential to enhance the activity of housekeeping and developmental gene promoters, presented in two distinct tracks. ...
Preprint
The advent of advanced sequencing technologies has significantly reduced the cost and increased the feasibility of assembling high-quality genomes. Yet, the annotation of genomic elements remains a complex challenge. Even for species with comprehensively annotated reference genomes, the functional assessment of individual genetic variants is not straightforward. In response to these challenges, recent breakthroughs in machine learning have led to the development of DNA language models. These transformer-based architectures are designed to tackle a wide array of genomic tasks with enhanced efficiency and accuracy. In this context, we introduce GENA-Web, a web-based platform that consolidates a suite of genome annotation tools powered by DNA language models. The version of GENA-Web presented here encompasses a diverse set of models trained on human data, including the prediction of promoter activity, annotation of splice sites, determination of various chromatin features, and a model for scoring of enhancer activity in Drosophila. GENA-Web is accessible online at https://dnalm.airi.net/
... Sequence-based machine learning models trained on large-scale genomics data capture complex patterns in the sequence and can predict diverse molecular phenotypes with great accuracy. Recently, convolutional neural networks have demonstrated superior performance over other architectures across most sequence-based problems [3,4,5,6,7,8,9,10,11], sometimes combined with LSTMs [12,13,14,15] or transformer layers [16,17]. ...
Preprint
Full-text available
Foundation models have achieved remarkable success in several fields such as natural language processing, computer vision and more recently biology. DNA foundation models in particular are emerging as a promising approach for genomics. However, so far no model has delivered granular, nucleotide-level predictions across a wide range of genomic and regulatory elements, limiting its practical usefulness. In this paper, we build on our previous work on the Nucleotide Transformer (NT) to develop a segmentation model, SegmentNT, that processes input DNA sequences up to 30kb length to predict 14 different classes of genomics elements at single nucleotide resolution. By utilizing pre-trained weights from NT, SegmentNT surpasses the performance of several ablation models, including convolution networks with one-hot encoded nucleotide sequences and models trained from scratch. SegmentNT can process multiple sequence lengths with zero-shot generalization for sequences of up to 50kb. We show improved performance on the detection of splice sites throughout the genome and demonstrate strong nucleotide-level precision. Because it evaluates all gene elements simultaneously, SegmentNT can predict the impact of sequence variants not only on splice site changes but also on exon and intron rearrangements in transcript isoforms. Finally, we show that a SegmentNT model trained on human genomics elements can generalize to elements of different species and that a trained multispecies SegmentNT model achieves stronger generalization for all genic elements on unseen species. In summary, SegmentNT demonstrates that DNA foundation models can tackle complex, granular tasks in genomics at a single-nucleotide resolution. SegmentNT can be easily extended to additional genomics elements and species, thus representing a new paradigm on how we analyze and interpret DNA. We make our SegmentNT-30kb human and multispecies models available on our github repository in Jax and HuggingFace space in Pytorch.
... Recent advances in MPRA and STARR-seq analysis have largely been driven by two complimentary methodologies. The first employs deep learning techniques [7][8][9][10][11][12][13][14], where neural networks process DNA sequences to predict the STARR-seq output [15]. These methods are able to identify known and novel binding motifs that play a part in the transcriptional program being studied, though interpretibability still remains a challenge. ...
Preprint
Full-text available
One of the primary regulatory processes in cells is transcription, during which RNA polymerase II (Pol-II) transcribes DNA into RNA. The binding of Pol-II to its site is regulated through interactions with transcription factors (TFs) that bind to DNA at enhancer cis-regulatory elements. Measuring the enhancer activity of large libraries of distinct DNA sequences is now possible using Massively Parallel Reporter Assays (MPRAs), and computational methods have been developed to identify the dominant statistical patterns of TF binding within these large datasets. Such methods are global in their approach and may overlook important regulatory sites which function only within the local context. Here we introduce a method for inferring functional regulatory sites (their number, location and width) within an enhancer sequence based on measurements of its transcriptional activity from an MPRA method such as STARR-seq. The model is based on a mean-field thermodynamic description of Pol-II binding that includes interactions with bound TFs. Our method applied to simulated STARR-seq data for a variety of enhancer architectures shows how data quality impacts the inference and also how it can find local regulatory sites that may be missed in a global approach. We also apply the method to recently measured STARR-seq data on androgen receptor (AR) bound sequences, a TF that plays an important role in the regulation of prostate cancer. The method identifies key regulatory sites within these sequences which are found to overlap with binding sites of known co-regulators of AR. 1 Author Summary We present an inference method for identifying regulatory sites within a putative DNA enhancer sequence, given only the measured transcriptional output of a set of overlapping sequences using an assay like STARR-seq. It is based on a mean-field thermodynamic model that calculates the binding probability of Pol-II to its promoter and includes interactions with sites in the DNA sequence of interest. By maximizing the likelihood of the data given the model, we can infer the number of regulatory sites, their locations, and their widths. Since it is a local model, it can in principle find regulatory sites that are important within a local context that may get missed in a global fit. We test our method on simulated data of simple enhancer architectures and show that it is able to find only the functional sites. We also apply our method to experimental STARR-seq data from 36 androgen receptor bound DNA sequences from a prostate cancer cell line. The inferred regulatory sites overlap known important regulatory motifs and their ChIP-seq data in these regions. Our method shows potential at identifying locally important functional regulatory sites within an enhancer given only its measured transcriptional output.
... Here, we hypothesized that this TF cooperativity is DNA sequence-driven and thus can be studied by measuring the binding of TFs on DNA and identifying the underlying sequence rules using interpretable deep learning. During training, deep learning models accurately learn sequence rules within genomic regions in an inherently combinatorial manner de novo until they can predict the data from sequence alone [15][16][17][18][19][20][21] . The key step is then to interrogate the model and extract the learned sequence rules using interpretation tools 15 . ...
... Likewise, individually manipulating enhancer sequences in vivo limits throughput, and the effects can be difficult to interpret since they may be enhancer-specific or caused by the inadvertent disruption of other important sequences 30,31 . Large-scale reporter assays, on the other hand, have produced conflicting results on whether motif syntax is important and have not revealed whether synergistic effects of motifs are mediated through cooperative binding 16,27,[32][33][34][35][36] . For these reasons, TF binding cooperativity downstream of signaling pathways has not been systematically studied from a sequence perspective. ...
... If the binding of YAP1 can both determine enhancer activation and be enhanced by cell-specific TFs, we would expect the corresponding motifs of cell-specific TFs to activate transcription synergistically. Synergistic activation by two motifs has been documented 16,[82][83][84] , but the mechanisms are not clear and could vary. Hence, it is important to understand whether YAP1's cooperativity with cell type-specific TFs is mechanistically connected to its ability to drive enhancer activation. ...
Preprint
Full-text available
The response to signaling pathways is highly context-specific, and identifying the transcription factors and mechanisms that are responsible is very challenging. Using the Hippo pathway in mouse trophoblast stem cells as a model, we show here that this information is encoded in cis -regulatory sequences and can be learned from high-resolution binding data of signaling transcription factors. Using interpretable deep learning, we show that the binding levels of TEAD4 and YAP1 are enhanced in a distance-dependent manner by cell type-specific transcription factors, including TFAP2C. We also discovered that strictly spaced Tead double motifs are widespread highly active canonical response elements that mediate cooperativity by promoting labile TEAD4 protein-protein interactions on DNA. These syntax rules and mechanisms apply genome-wide and allow us to predict how small sequence changes alter the activity of enhancers in vivo . This illustrates the power of interpretable deep learning to decode canonical and cell type-specific sequence rules of signaling pathways. Graphical abstract
... To benchmark the performance of EvoAug-TF, we utilized the data and deep learning model from the DeepSTARR study 14 . The prediction task is set up to take as input 249 nucleotide (nt) sequences and predict enhancer activity (measured via STARR-seq 15 ) for developmental and housekeeping transcriptional promoters in D. melanogaster S2 cells as a multi-task regression. ...
Preprint
Full-text available
Deep neural networks (DNNs) have been widely applied to predict the molecular functions of regulatory regions in the non- coding genome. DNNs are data hungry and thus require many training examples to fit data well. However, functional genomics experiments typically generate limited amounts of data, constrained by the activity levels of the molecular function under study inside the cell. Recently, EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis. However, EvoAug only supports PyTorch-based models, which limits its applications to a broad class of genomic DNNs based in TensorFlow. Here, we extend EvoAug's functionality to TensorFlow in a new package we call EvoAug-TF. Through a systematic benchmark, we find that EvoAug-TF yields comparable performance with the original EvoAug package. Availability: EvoAug-TF is freely available for users and is distributed under an open-source MIT license. Researchers can access the open-source code on GitHub (https://github.com/p-koo/evoaug-tf). The pre-compiled package is provided via PyPI (https://pypi.org/project/evoaug-tf) with in-depth documentation on ReadTheDocs (https: //evoaug-tf.readthedocs.io). The scripts for reproducing the results are available at (https://github.com/p-koo/ evoaug-tf_analysis).