DeepSTARR quantitatively predicts enhancer activity genome wide from DNA sequence a, Schematics of genome-wide UMI-STARR-seq using developmental (Dev) (DSCP; red) and housekeeping (Hk) (RpS12; blue) promoters. b, DeepSTARR predicts enhancer activity genome wide. Genome browser screenshot depicting observed and predicted UMI-STARR-seq profiles for both promoters for a locus on the held-out test chromosome (Chr) 2R. c, Architecture of the multitask convolutional neural network DeepSTARR that was trained to simultaneously predict quantitative Dev and Hk enhancer activities from 249-bp DNA sequences. d, DeepSTARR predicts enhancer activity quantitatively. Scatter plots of predicted versus observed Dev (left) and Hk (right) enhancer activity signal across all DNA sequences in the test set chromosome. Color reflects point density. e, DeepSTARR quantitatively predicts Dev and Hk enhancer–promoter specificity. Predicted versus observed log2FC between Dev and Hk activity for all enhancer sequences in the test set chromosome. PCC, Pearson correlation coefficient.

Source publication

DeepSTARR reveals important TF motif types that validate...

Instances of the same TF motif have nonequivalent contributions to...

Contribution of TF motifs depends on the flanking sequence
a,...

In silico analysis reveals distinct modes of motif cooperativity
a,...

DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers

Article

Full-text available

May 2022

Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood, and de novo enhancer design has been challenging. Here, we built a deep-learning model, Dee...

Insulin induces widespread alterations in chromatin accessibility and...

Chromatin peaks annotated to proximal promoters drive the correlation...

Insulin induces differential chromatin accessibility and gene...

Insulin-induced log2 fold changes correlate between ATAC-seq and...

Significant insulin-induced changes in ATAC-seq indicates recruitment...

Harnessing changes in open chromatin determined by ATAC‑seq to generate insulin‑responsive reporter constructs

Article

Full-text available

May 2022

Background: Gene regulation is critical for proper cellular function. Next-generation sequencing technology has revealed the presence of regulatory networks that regulate gene expression and essential cellular functions. Studies investigating the epigenome have begun to uncover the complex mechanisms regulating transcription. Assay for transposase-...

Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models

Article

Full-text available

Jun 2024

Deep neural networks (DNNs) have greatly advanced the ability to predict genome function from sequence. However, elucidating underlying biological mechanisms from genomic DNNs remains challenging. Existing interpretability methods, such as attribution maps, have their origins in non-biological machine learning applications and therefore have the potential to be improved by incorporating domain-specific interpretation strategies. Here we introduce SQUID (Surrogate Quantitative Interpretability for Deepnets), a genomic DNN interpretability framework based on domain-specific surrogate modelling. SQUID approximates genomic DNNs in user-specified regions of sequence space using surrogate models—simpler quantitative models that have inherently interpretable mathematical forms. SQUID leverages domain knowledge to model cis-regulatory mechanisms in genomic DNNs, in particular by removing the confounding effects that nonlinearities and heteroscedastic noise in functional genomics data can have on model interpretation. Benchmarking analysis on multiple genomic DNNs shows that SQUID, when compared to established interpretability methods, identifies motifs that are more consistent across genomic loci and yields improved single-nucleotide variant-effect predictions. SQUID also supports surrogate models that quantify epistatic interactions within and between cis-regulatory elements, as well as global explanations of cis-regulatory mechanisms across sequence contexts. SQUID thus advances the ability to mechanistically interpret genomic DNNs.

Iterative deep learning-design of human enhancers exploits condensed sequence grammar to achieve cell type-specificity

Preprint

Full-text available

Jun 2024

An important and largely unsolved problem in synthetic biology is how to target gene expression to specific cell types. Here, we apply iterative deep learning to design synthetic enhancers with strong differential activity between two human cell lines. We initially train models on published datasets of enhancer activity and chromatin accessibility and use them to guide the design of synthetic enhancers that maximize predicted specificity. We experimentally validate these sequences, use the measurements to re-optimize the predictor, and design a second generation of enhancers with improved specificity. Our design methods embed relevant transcription factor binding site (TFBS) motifs with higher frequencies than comparable endogenous enhancers while using a more selective motif vocabulary, and we show that enhancer activity is correlated with transcription factor expression at the single cell level. Finally, we characterize causal features of top enhancers via perturbation experiments and show enhancers as short as 50bp can maintain specificity.

Targeted mutagenesis of specific genomic DNA sequences in animals for the in vivo generation of variant libraries

Preprint

Full-text available

Jun 2024

Understanding how the number, placement and affinity of transcription factor binding sites dictates gene regulatory programs remains a major unsolved challenge in biology, particularly in the context of multicellular organisms. To uncover these rules, it is first necessary to find the binding sites within a regulatory region with high precision, and then to systematically modulate this binding site arrangement while simultaneously measuring the effect of this modulation on output gene expression. Massively parallel reporter assays (MPRAs), where the gene expression stemming from 10,000s of in vitro-generated regulatory sequences is measured, have made this feat possible in high-throughput in single cells in culture. However, because of lack of technologies to incorporate DNA libraries, MPRAs are limited in whole organisms. To enable MPRAs in multicellular organisms, we generated tools to create a high degree of mutagenesis in specific genomic loci in vivo using base editing. Targeting GFP integrated in genome of Drosophila cell culture and whole animals as a case study, we show that the base editor AID evoCDA1 stemming from sea lamprey fused to nCas9 is highly mutagenic. Surprisingly, longer gRNAs increase mutation efficiency and expand the mutating window, which can allow the introduction of mutations in previously untargetable sequences. Finally, we demonstrate arrays of >20 gRNAs that can efficiently introduce mutations along a 200bp sequence, making it a promising tool to test enhancer function in vivo in a high throughput manner.

GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models

Preprint

Jun 2024

The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and reproducibility challenges. In the absence of standardization, comparative analyses risk becoming biased and unreliable. To surmount this impasse, we introduce GenBench, a comprehensive benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. Through systematic evaluations of datasets spanning diverse biological domains with a particular emphasis on both short-range and long-range genomic tasks, firstly including the three most important DNA tasks covering Coding Region, Non-Coding Region, Genome Structure, etc. Moreover, We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance. Our findings reveal an interesting observation: independent of the number of parameters, the discernible difference in preference between the attention-based and convolution-based models on short- and long-range tasks may provide insights into the future design of GFM.

Species-specific design of artificial promoters by transfer-learning based generative deep-learning model

Article

Full-text available

May 2024
NUCLEIC ACIDS RES

Native prokaryotic promoters share common sequence patterns, but are species dependent. For understudied species with limited data, it is challenging to predict the strength of existing promoters and generate novel promoters. Here, we developed PromoGen, a collection of nucleotide language models to generate species-specific functional promoters, across dozens of species in a data and parameter efficient way. Twenty-seven species-specific models in this collection were finetuned from the pretrained model which was trained on multi-species promoters. When systematically compared with native promoters, the Escherichia coli- and Bacillus subtilis-specific artificial PromoGen-generated promoters (PGPs) were demonstrated to hold all distribution patterns of native promoters. A regression model was developed to score generated either by PromoGen or by another competitive neural network, and the overall score of PGPs is higher. Encouraged by in silico analysis, we further experimentally characterized twenty-two B. subtilis PGPs, results showed that four of tested PGPs reached the strong promoter level while all were active. Furthermore, we developed a user-friendly website to generate species-specific promoters for 27 different species by PromoGen. This work presented an efficient deep-learning strategy for de novo species-specific promoter generation even with limited datasets, providing valuable promoter toolboxes especially for the metabolic engineering of understudied microorganisms.

GENA-Web - GENomic Annotations Web Inference using DNA language models

Preprint

Apr 2024

The advent of advanced sequencing technologies has significantly reduced the cost and increased the feasibility of assembling high-quality genomes. Yet, the annotation of genomic elements remains a complex challenge. Even for species with comprehensively annotated reference genomes, the functional assessment of individual genetic variants is not straightforward. In response to these challenges, recent breakthroughs in machine learning have led to the development of DNA language models. These transformer-based architectures are designed to tackle a wide array of genomic tasks with enhanced efficiency and accuracy. In this context, we introduce GENA-Web, a web-based platform that consolidates a suite of genome annotation tools powered by DNA language models. The version of GENA-Web presented here encompasses a diverse set of models trained on human data, including the prediction of promoter activity, annotation of splice sites, determination of various chromatin features, and a model for scoring of enhancer activity in Drosophila. GENA-Web is accessible online at https://dnalm.airi.net/

SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models

Preprint

Full-text available

Mar 2024

Foundation models have achieved remarkable success in several fields such as natural language processing, computer vision and more recently biology. DNA foundation models in particular are emerging as a promising approach for genomics. However, so far no model has delivered granular, nucleotide-level predictions across a wide range of genomic and regulatory elements, limiting its practical usefulness. In this paper, we build on our previous work on the Nucleotide Transformer (NT) to develop a segmentation model, SegmentNT, that processes input DNA sequences up to 30kb length to predict 14 different classes of genomics elements at single nucleotide resolution. By utilizing pre-trained weights from NT, SegmentNT surpasses the performance of several ablation models, including convolution networks with one-hot encoded nucleotide sequences and models trained from scratch. SegmentNT can process multiple sequence lengths with zero-shot generalization for sequences of up to 50kb. We show improved performance on the detection of splice sites throughout the genome and demonstrate strong nucleotide-level precision. Because it evaluates all gene elements simultaneously, SegmentNT can predict the impact of sequence variants not only on splice site changes but also on exon and intron rearrangements in transcript isoforms. Finally, we show that a SegmentNT model trained on human genomics elements can generalize to elements of different species and that a trained multispecies SegmentNT model achieves stronger generalization for all genic elements on unseen species. In summary, SegmentNT demonstrates that DNA foundation models can tackle complex, granular tasks in genomics at a single-nucleotide resolution. SegmentNT can be easily extended to additional genomics elements and species, thus representing a new paradigm on how we analyze and interpret DNA. We make our SegmentNT-30kb human and multispecies models available on our github repository in Jax and HuggingFace space in Pytorch.

Inference of Transcriptional Regulation From STARR-seq Data

Preprint

Full-text available

Mar 2024

One of the primary regulatory processes in cells is transcription, during which RNA polymerase II (Pol-II) transcribes DNA into RNA. The binding of Pol-II to its site is regulated through interactions with transcription factors (TFs) that bind to DNA at enhancer cis-regulatory elements. Measuring the enhancer activity of large libraries of distinct DNA sequences is now possible using Massively Parallel Reporter Assays (MPRAs), and computational methods have been developed to identify the dominant statistical patterns of TF binding within these large datasets. Such methods are global in their approach and may overlook important regulatory sites which function only within the local context. Here we introduce a method for inferring functional regulatory sites (their number, location and width) within an enhancer sequence based on measurements of its transcriptional activity from an MPRA method such as STARR-seq. The model is based on a mean-field thermodynamic description of Pol-II binding that includes interactions with bound TFs. Our method applied to simulated STARR-seq data for a variety of enhancer architectures shows how data quality impacts the inference and also how it can find local regulatory sites that may be missed in a global approach. We also apply the method to recently measured STARR-seq data on androgen receptor (AR) bound sequences, a TF that plays an important role in the regulation of prostate cancer. The method identifies key regulatory sites within these sequences which are found to overlap with binding sites of known co-regulators of AR. 1 Author Summary We present an inference method for identifying regulatory sites within a putative DNA enhancer sequence, given only the measured transcriptional output of a set of overlapping sequences using an assay like STARR-seq. It is based on a mean-field thermodynamic model that calculates the binding probability of Pol-II to its promoter and includes interactions with sites in the DNA sequence of interest. By maximizing the likelihood of the data given the model, we can infer the number of regulatory sites, their locations, and their widths. Since it is a local model, it can in principle find regulatory sites that are important within a local context that may get missed in a global fit. We test our method on simulated data of simple enhancer architectures and show that it is able to find only the functional sites. We also apply our method to experimental STARR-seq data from 36 androgen receptor bound DNA sequences from a prostate cancer cell line. The inferred regulatory sites overlap known important regulatory motifs and their ChIP-seq data in these regions. Our method shows potential at identifying locally important functional regulatory sites within an enhancer given only its measured transcriptional output.

Interpretable deep learning reveals the sequence rules of Hippo signaling

Preprint

Full-text available

Feb 2024

The response to signaling pathways is highly context-specific, and identifying the transcription factors and mechanisms that are responsible is very challenging. Using the Hippo pathway in mouse trophoblast stem cells as a model, we show here that this information is encoded in cis -regulatory sequences and can be learned from high-resolution binding data of signaling transcription factors. Using interpretable deep learning, we show that the binding levels of TEAD4 and YAP1 are enhanced in a distance-dependent manner by cell type-specific transcription factors, including TFAP2C. We also discovered that strictly spaced Tead double motifs are widespread highly active canonical response elements that mediate cooperativity by promoting labile TEAD4 protein-protein interactions on DNA. These syntax rules and mechanisms apply genome-wide and allow us to predict how small sequence changes alter the activity of enhancers in vivo . This illustrates the power of interpretable deep learning to decode canonical and cell type-specific sequence rules of signaling pathways. Graphical abstract

EvoAug-TF: Extending evolution-inspired data augmentations for genomic deep learning to TensorFlow

Preprint

Full-text available

Jan 2024

Deep neural networks (DNNs) have been widely applied to predict the molecular functions of regulatory regions in the non- coding genome. DNNs are data hungry and thus require many training examples to fit data well. However, functional genomics experiments typically generate limited amounts of data, constrained by the activity levels of the molecular function under study inside the cell. Recently, EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis. However, EvoAug only supports PyTorch-based models, which limits its applications to a broad class of genomic DNNs based in TensorFlow. Here, we extend EvoAug's functionality to TensorFlow in a new package we call EvoAug-TF. Through a systematic benchmark, we find that EvoAug-TF yields comparable performance with the original EvoAug package. Availability: EvoAug-TF is freely available for users and is distributed under an open-source MIT license. Researchers can access the open-source code on GitHub (https://github.com/p-koo/evoaug-tf). The pre-compiled package is provided via PyPI (https://pypi.org/project/evoaug-tf) with in-depth documentation on ReadTheDocs (https: //evoaug-tf.readthedocs.io). The scripts for reproducing the results are available at (https://github.com/p-koo/ evoaug-tf_analysis).

Similar publications

Citations