PreprintPDF Available

DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of enhancers

Authors:

Abstract and Figures

Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood and enhancer de novo design is considered impossible. Here we built a deep learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally non-equivalent instances of the same TF motif that are determined by motif-flanking sequence and inter-motif distances. We validated these rules experimentally and demonstrated their conservation in human by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo.
Content may be subject to copyright.
1"
DeepSTARR predicts enhancer activity from DNA sequence and
enables the de novo design of enhancers
!
Bernardo!P.!de!Almeida1,2,!Franziska!Reiter1,2,!Michaela!Pagani1,!Alexander!Stark1,3!
!
1% Research! Institute! of! Molecular! Pathology! (IMP),! Vienna! BioCenter! (VBC),! Campus-
Vienna-Biocenter!1,!Vienna,!Austria!
2%Vienna!BioCenter!PhD!Program,!Doctoral!School!of!the!University!of!Vienna!and!Medical!
University!of!Vienna,!A-1030,!Vienna,!Austria!
3%Medical!University!of!Vienna,!Vienna!BioCenter!(VBC),!Vienna,!Austria!
!
Correspondence!should!be!addressed!to!A.S.!(stark@starklab.org)!
!
Abstract(
Enhancer! sequences! control! gene! expression! and! comprise!binding! sites!(motifs)! for!
different! transcription! factors! (TFs).! Despite! extensive!genetic! and! computational!
studies,! the! relationship! between! DNA! sequence! and! regulatory! activity! is! poorly!
understood!and!enhancer!de#novo!design!is!considered!impossible.!Here!we!built!a!deep!
learning! model,! DeepSTARR,! to! quantitatively! predict! the!activities! of! thousands! of!
developmental!and!housekeeping!enhancers!directly!from!DNA!sequence!in!Drosophila#
melanogaster!S2! cells.! The! model! learned!relevant!TF! motifs! and! higher-order! syntax!
rules,! including! functionally! non-equivalent! instances! of! the! same! TF! motif!that! are!
determined!by!motif-flanking!sequence! and! inter-motif! distances.! We! validated! these!
rules!experimentally!and!demonstrated!their!conservation!in!human!by!testing!more!than!
40,000!wildtype!and!mutant!Drosophila!and!human!enhancers.!Finally,!we!designed!and!
functionally!validated!synthetic!enhancers!with!desired!activities!de#novo.! !
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
2"
Enhancers1!are!genomic!elements!that!regulate!the!cell!type-specific!transcription!
of!target!genes,!thereby!controlling!animal!development!and!physiology2.!A!characteristic!
feature!of! enhancers! is!their! ability! to!activate! transcription!outside!their! endogenous!
genomic!contexts3,! which!suggests! that!all!the! necessary!cis-regulatory! information!is!
contained!within!the! enhancers’!DNA!sequences.!Indeed,!enhancer!sequence!mutations!
can!drastically!alter!enhancer!function!and!are!associated!with!developmental!defects2,!
morphological!evolution4,!and!human!disease5.!!
Enhancers! typically! contain! multiple!sequence! motifs! that! are!binding! sites! for!
sequence-specific!transcription! factors! (TFs)6.! Understanding! how! motifs! and! their!
arrangements!(i.e.!their!number,!order,!orientation!and!spacing!–!termed!here!collectively!
motif#syntax)!relate!to!enhancer!function!has!remained!one!of!the!most!important!open!
questions! in! modern! biology.! Systematic! mutagenesis! of! various! individual! enhancers!
revealed!a! complex! picture,! whereby! changing! nucleotides! or! altering! motif! syntax!
affected! the! function!of!some! enhancers!but! not! others7–26.! These! contradictory!
observations!made!it!difficult!to!define!the!relationships!between!enhancer!sequence!and!
function18,27.!
Many!computational!approaches!have!sought!to! predict! enhancer!activities!from!
DNA!sequences!using!local!DNA!features,!e.g.!motif!dictionaries!or!de-novo!k-mers,!and!
selected! syntax! rules! in!various! thermodynamic! or! machine-learning!
frameworks16,17,26,2839.!Despite!remarkable!success,!these!approaches!did!not!reveal!how!
the!elements!of!motif!syntax!collaborate!to!determine!enhancer!activity.!In!addition,!they!
did! not! consider! the! mutual! compatibilities! between! certain! enhancer-! and! promoter!
types!recently!reported!for!different!transcriptional!programs4042.!Thus,!quantitatively!
predicting!the! regulatory! activity! of! enhancers!and!the! de# novo!design! of! synthetic!
enhancers!have!remained!open!challenges!for!decades.!
Previous! approaches! typically! modelled! enhancer! sequences! explicitly! via! pre-
defined! sets! of! features,! which! were! informed! by! prior! biological! knowledge43.! In!
contrast,!deep!learning,!in!particular!convolutional!neural!networks,!do!not!require!prior!
knowledge!and!can!learn!accurate! models!directly!from!raw!data4453.! Once!trained!on!
raw!data,!these! models! allow!the!extraction! and! interpretation!of!the! learned!rules!by!
novel! types! of! tools44,45,47,48,5460.! For! example,! when! applied! to! ChIP-nexus! data!that!
measures!TF-binding! genome-wide!at! high! resolution,! a! convolutional!neural! network!
was! able! to! learn! motifs! and! syntax! rules! for! cooperative! TF! binding47.!Similarly,! this!
approach! was! used! to! model! DNA! accessibility45,46,52,59,! transcriptional! reporter!
activities51!and!predict!genetic!variant!effects53.!Nevertheless,!an!ultimate!sequence-to-
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
3"
enhancer! activity! model! that! learns! the! cis-regulatory! code! to! quantitatively! predict!
enhancer!activities!in!a!single!cell!type!is!still!missing.!
Here,!we!built!a!new!deep!learning!model,!DeepSTARR,!to!predict!enhancer!activity!
towards! two! promoters! from! the!distinct! developmental! and! housekeeping!
transcriptional! programs! in! Drosophila# melanogaster!S2! cells!directly! from! the! DNA!
sequence.!For!both!programs,!DeepSTARR!quantitatively!predicts!enhancer!activity!for!
unseen!sequences!and!reveals!different!coding!features!for!the!two!programs,!including!
specific!TF!motifs!that!we!validate!experimentally.!We!further!extract!motif!syntax!rules,!
including!favorable!and!unfavorable!sequence!contexts!and!inter-motif!distances,!which!
are!predictive!of!enhancer! activity! in!Drosophila!and!human!enhancers,! as! we!validate!
experimentally!by!high-throughput!mutagenesis!of!thousands!of!enhancers!and!enhancer!
variants.! These!rules! allowed! the!design!of! synthetic! enhancers! with! desired! activity!
levels#de#novo.! !
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
4"
Results(
DeepSTARR!quantitatively!predicts!enhancer!activity!from!DNA!sequence!
!
Figure'1.'DeepSTARR'quantitatively'predicts'enhancer'activity'genome-wide'from'DNA'
sequence.' A)!Schematics! of! genome-wide! UMI-STARR-seq! using! the! developmental!
(Drosophila! synthetic! core! promoter! (DSCP);! red)! and! housekeeping! (RpS12;! blue)!
promoters.! B)!DeepSTARR! predicts! enhancer! activity!genome-wide.! Genome! browser!
screenshot!depicting!UMI-STARR-seq!observed!and!predicted!profiles!for!both!promoters!for!
a! locus! on! held-out! test! chromosome! 2R.! C)!Architecture! of! the! multi-task! convolutional!
neural! network! DeepSTARR! that! was! trained! to! simultaneously! predict! quantitative!
developmental! and! housekeeping! enhancer! activities! (UMI-STARR-seq)! from! 249! bp! DNA!
sequences.!D)!DeepSTARR!predicts!enhancer!activity!quantitatively.!Scatter!plots!of!predicted!
vs.!observed!developmental!(left)!and!housekeeping!(right)!enhancer!activity!signal!across!all!
DNA! sequences! in! the! test! set! chromosome.! Color! reflects! point! density.! E)!DeepSTARR!
quantitatively! predicts! developmental! and! housekeeping! enhancerpromoter! specificity.!
Predicted!vs.!observed!log2!fold-change!(log2FC)!between!developmental!and!housekeeping!
activity! for! all! enhancer! sequences! in! the! test! set! chromosome.! PCC:! Pearson! correlation!
coefficient.! !
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
5"
To! learn! the! cis-regulatory!information! encoded! in! enhancer! sequences! in! an!
unbiased!way,!we!developed!a!new!deep!learning!model!called!DeepSTARR!that!predicts!
enhancer! activity! directly! from! DNA! sequence.! First,! we! used! UMI-STARR-seq61,62!to!
generate!genome-wide!high!resolution,!quantitative!activity!maps!of!developmental!and!
housekeeping! enhancers,! representing! the! two! main! transcriptional! programs! in!
Drosophila!S2! cells4042!(Fig! 1A).!We! identified! 11,658! developmental! and! 7,062!
housekeeping!enhancers!(Fig!1B,!S1A,B).!These!enhancers!are!largely!non-overlapping,!
confirming!the!specificity!of! the!different!transcriptional!programs4042.!These!genome-
wide!enhancer!activity!maps!provide!a!high-quality!dataset!to!build!predictive!models!of!
enhancer! activity! and! characterize! the! sequence! determinants! of! two! major! enhancer!
types.!
We!built!the!multi-task!convolutional!neural!network!DeepSTARR!to!map!249!bp!
long! DNA! sequences! tiled! across! the! genome! to!both! their! developmental! and! their!
housekeeping!enhancer!activities!(Fig!1C).!We!adapted!the!Basset!convolutional!neural!
network!architecture45!and! designed! DeepSTARR! with! four!convolution! layers,! each!
followed!by!a!max-pooling!layer,!and!two!fully!connected!layers!(Fig!1C).!The!convolution!
layers!identify!local!sequence!features!(e.g.!TF!motifs)!and!increasingly!complex!patterns!
(e.g.! TF! motif! syntax),! while! the! fully! connected! layers! combine! these!features! and!
patterns!to!predict!enhancer!activity!separately!for!each!enhancer!type.!
We! evaluated! the! predictive! performance! of! DeepSTARR!on! a!held-out!test!
chromosome.!The!predicted!and!observed!enhancer-activity!profiles!were!highly!similar!
for!both!developmental!(Pearson!correlation!coefficient!(PCC)=0.68)!and!housekeeping!
(PCC=0.74)!enhancers! (Fig! 1B,D,! S1).! This! performance! is! close! to! the! concordance!
between!experimental!replicates!(PCC=0.73!and!0.76,!respectively;!Fig!S1C),!suggesting!
that!the!model!accurately!captures!the!regulatory!information!present!in!the!sequences!
and! the! differences! between! developmental! and! housekeeping! enhancers! (Fig! 1E).!
DeepSTARR!performed!better!than!methods!based!on!known!TF! motifs! or! unbiased!k-
mer!counts39,!both!at!predicting!continuous!enhancer!activity!and!at!binary!classification!
of!enhancer!sequences!(Fig!S1D).!Thus,!DeepSTARR!learned!generalizable!features!and!
rules! de# novo!directly! from! the! DNA! sequence! that! allow! the! prediction! of! enhancer!
activities!for!unseen!sequences.!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
6"
DeepSTARR!reveals!important!TF!motif!types!required!for!enhancer!activity!
!
Figure'2.!DeepSTARR'reveals'important'TF'motif'types'that'we'validate'experimentally.'
A)!DeepSTARR!derived!developmental!and!housekeeping!nucleotide!contribution!scores!for!
strong! developmental! (left)! and! housekeeping! (right)! enhancer! sequences,! respectively.!
Regions! with! high! scores! resembling! known! TF! motifs! are! highlighted.! Log2! fold-change!
values!(log2FC;!bottom)!indicate!the!impact!on!enhancer!activity!of!mutating!all!instances!of!
each!motif!type.!B)!DeepSTARR!motifs!discovered!by!TFModisco!by!summarizing!recurring!
predictive! sequence! patterns! from! the! sequences! of! all! developmental! (top)! and!
housekeeping! (bottom)! enhancers! and! their! associated! nucleotide! contribution! scores.! C)!
Developmental! and! housekeeping! TF! motifs! are! specifically! required! for! the! respective!
enhancer! types.! Enhancer! activity! changes! (log2! FC)! for! developmental! (top)! and!
housekeeping!(bottom)!enhancers!after!mutating!all!instances!of!three!control!motifs!(grey),!
four! predicted! developmental! motifs! (AP-1,! GATA,! twist,! Trl;! red)! and! three! predicted!
housekeeping! motifs! (Dref,!Ohler1,! Ohler6;! blue).! Number! of! enhancers! mutated! for! each!
motif! type! are! shown.! The! box! plots! mark! the! median,! upper! and! lower! quartiles! and!
1.5× interquartile!range!(whiskers);!outliers!are!shown!individually.!D)!DeepSTARR!discovers!
important!TF!motifs!not!obvious!by!motif!enrichment.!Comparison!between!motif!enrichment!
(log2! odds! ratio;! x-axis)! and! DeepSTARR’s! predicted! global! importance! (y-axis)! for! all!
representative! TF! motifs! (Fig! S2)! in! developmental! (top)! and! housekeeping! (bottom)!
enhancers.!Important!motifs!for!each!enhancer!type!are!highlighted.! !
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
7"
In!order!to!understand!the!features!and!rules!learned!by!DeepSTARR,!we!quantified!
how!each! individual! nucleotide!in! every! sequence!contributes! to! the! predicted!
developmental!and!housekeeping!enhancer!activities47,55,63,64!(Fig!2A;!see!Methods),!and!
consolidated!recurrent!highly!scoring!sequence!patterns!into!motifs!56.!This!uncovered!
distinct! TF! motifs!that! are!known! to! occur! in! developmental! and! housekeeping!
enhancers26,40,!thus!validating!the!approach!and!reinforcing!the!mutual!incompatibility!of!
the!two!transcriptional!programs!(Fig!2A,B,!S2).!
We!experimentally!tested!the!requirements!of!select!TF!motifs!for!enhancer!activity!
across!hundreds!of!enhancers!by!performing!large-scale!motif!mutagenesis!(3,415!motif-
mutations! in! 804! developmental! and! 872! housekeeping! enhancers;! Fig.! 2C,! S3).!
Consistent!with!their!predicted!importance,!mutating!four!developmental!motifs!(GATA,!
AP-1,!twist,!Trl)!substantially!reduced!the!activity!of!developmental!but!not!housekeeping!
enhancers,! with! AP-1! and! GATA! motifs! being! the! most! important,! as! predicted! by!
DeepSTARR.! In! contrast,! mutating! three! housekeeping! motifs! (Dref,! Ohler1,! Ohler6)!
affected! only! housekeeping! enhancers,! and! mutating! three!control! motifs! (length-
matched!random!motifs!to!control!for!enhancer-sequence!perturbation)!did!not!have!any!
impact.!For!example,!GATA!motifs!were!only!important!when!present!in!developmental!
but!not!housekeeping!enhancers,!whereas!the!opposite!was!true!for!Dref!motifs!(Fig.!2C).!!
Interestingly,! the! motifs! learned! by! DeepSTARR! were! not! restricted! to! highly!
abundant!motifs!but!included!other!motifs!such!as!SREBP,!CREB!and!ETS!motifs,!which!
were!not!or!only!weakly!enriched!in!S2!developmental!enhancers!and!could!not!have!been!
found! by!methods! based! on! over-representation!(Fig.! 2B,D).! Even! for! more! abundant!
motifs,! motif! enrichment! did! not! always! predict! motif! importance!(Fig.! 2D),! i.e.! the!
DeepSTARR!score!of!the!motif!embedded!in!100!random!DNA!sequences!(see!Methods!
and!ref.!58).!This!shows!that!DeepSTARR!can!discover!motifs,!and!likely!other!sequence!
features,!that!are!relatively!rare!in!enhancers!but!still!important!for!enhancer!activity.!!
!
Non-equivalent!instances!of!the!same!TF!motif!
Since!enhancers!often!contain!multiple!instances!of!the! same!motif!type,!we!next!
assessed!the!contribution!of!each!individual!instance!of!the!GATA,!AP-1,!twist,!Trl,!and!
Dref!motifs!by!DeepSTARR!(Fig!S5A)!and! by! experimental!mutagenesis!(Fig!S3A,!S5B).!
Unexpectedly,! individual!instances! of! the! same! motif! were! frequently!predicted!and!
experimentally!validated!to!have!distinct!contributions!to!enhancer!activities,!both!across!
different!enhancers!and!within!the!same!enhancer!(Fig!3A-C,!S5).! !
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
8"
!
Figure'3.'Instances'of'the'same'TF'motif'have'non-equivalent'contributions'to'enhancer'
activity.' A)!Developmental! enhancer! with! three! non-equivalent! GATA! instances.! Left:!
Genome! browser! screenshot! showing! tracks! for! DNA! accessibility! (from! 61)! and!
developmental!and!housekeeping!UMI-STARR-seq!for!the!CG11255!locus.!The!designed!oligo!
covering! the! enhancer! selected! for! motif! mutagenesis! is! shown.! Right:! log2! activity! of! the!
wildtype!enhancer!compared!with!the!activity!when!all!GATA! instances! are! simultaneously!
mutated!or!each!individual!instance!at!a!time.!Bottom:!DeepSTARR!nucleotide!contribution!
scores!for! the!same!developmental!enhancer! with!the!three!GATA!instances! highlighted.!B)!
DeepSTARR! predicts! the! contribution! of! individual! GATA! instances.! Distribution! of!
experimentally! measured! enhancer! activity! fold-change! (log2! FC)! after! mutating! 1,013!
different!GATA!instances! across! developmental! enhancers!(violin!plot),! compared! with! the!
log2!FC!predicted!by!DeepSTARR.!The!box!plots!mark!the!median,!upper!and!lower!quartiles!
and!1.5× interquartile! range! (whiskers).! C)!Different! instances! of!the! same! TF!motif! in! the!
same!enhancer! are! not! equivalent.!Left:! Distribution! of!enhancer! activity! change! (log2!FC)!
between!mutating!the!least!and!the!most!important!instance!of!each!motif!type!per!enhancer.!
Dashed! line! represents! 2-fold! difference! between! instances! in! the! same! enhancer.! Right:!
Proportion!of!enhancers!with!two!or!more!instances!that!have!an!instance!at!least!2-fold!more!
important!than!another!instance!(dark!grey).!Dashed!line!represents!the!average!across! the!
different! motif! types! (excluding! control! motifs):! 57%! of! enhancers.! Number! of! enhancers!
mutated! for! each!motif! type!are! shown.!Box! plots! as!in! (B).! D)!DeepSTARR!predicts! motif!
instance! contribution! better! than! position! weight! matrix! (PWM)! motif! scores.! Bar! plots!
showing!the!PCC!between!predicted!(by!DeepSTARR!or!PWM)!and!observed!log2!fold-change!
for!mutating!individual!instances!of!each!motif!type.!
!
The!enhancer!shown!in!Fig!3A!for!example!contains!three!GATA!instances!with!very!
different!contributions!as!predicted!and!determined!experimentally:!the!second!instance!
is!the!most!important!one,!followed!by!the!first,!and!the!third.!The!agreement!between!
predictions!and!experiments!holds!across!all!1,013!GATA!instances!tested!(PCC=0.53;!Fig!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
9"
3B)!and!the!non-equivalency!of!motif!instances!is!widespread:!57%! of!enhancers!with!
multiple!instances!had!motifs!with!>2-fold!and!70%!with!>1.5-fold!differences!(Fig!3C).!
These!differences!are!not!well!captured!by!existing!position!weight!matrix!(PWM)!motif!
scores!(Fig!3D,!S6),!suggesting!that!the!importance!of!motif!instances!depend!on!complex!
sequence!features!outside!the!core!motif.!The!observation!that!different!instances!of!the!
same! motif! type! (with! identical! sequences)! can! have! vastly! different! contributions! to!
enhancer!activity!is! an! important!underappreciated!phenomenon!that! complicates! our!
understanding!of!enhancer!sequences!and!non-coding!variants!(see!Discussion).!
!
Flanking!sequence!influences!the!importance!of!TF!motifs!
!
Figure' 4.' Contribution' of' TF' motifs' depends' on' the' flanking' sequence.' A)!Motif!
contribution! correlates! with! flanking! base-pairs.! Heatmap:! Flanking! nucleotides! of! GATAA!
(GATA;!left)!and!GAGAG!(Trl;!right)!instances!across!developmental!enhancers!sorted!by!their!
DeepSTARR!predicted!contribution.!Box!plots:!Importance!of!motif!instances!according!to!the!
different! bases! at! each! flanking! position.! *! marks! positions! with! significant! differences!
between!the!four! nucleotides! (FDR-corrected! Welch! One-Way!ANOVA!test! p-value! <!0.01).!
The! box! plots! mark! the! median,! upper! and! lower! quartiles! and! 1.5× interquartile! range!
(whiskers).!Top:! logos! of! the!top! 90th!percentile! motif! instances.!B)!Length! and! identity!of!
flanks!differ!between!motif!types.!Comparison!of!optimal!motif!logos!(top!90th!percentile!motif!
instances)!as!predicted!by!DeepSTARR!or!measured!experimentally!by!motif!mutation,!with!
the!PWM!logos!existing!in!Drosophila!TF!databases.!Note!that! DeepSTARR!and!mutagenesis!
motif!instances!were!selected!to!all!contain!the!same!core!sequence!and!therefore!only!differ!
in! their! flanking! sequence.! C)!GATA! flanking! nucleotides! are! sufficient! to! switch! motif!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
10"
contribution!in!47!developmental!enhancers!that!contain!one!strong!(purple)!and!one!weak!
(green)!GATA! instance!(≥!2-fold!difference!between!instances!as! assessed!by!mutagenesis).!
Enhancer!activity!change!(log2!FC)!when!2!bp!flanks!of!strong!instances!were!replaced!by!the!
flanks!of!weak!instances!(purple)!and!vice!versa!(green).!**!p-value!<!0.01,!*!<!0.05!(Wilcoxon!
signed!rank!test).!Box!plots!as!in!(A).!D)!Example!of!a!developmental!enhancer!with!one!weak!
(TGGATAATG;! green)! and! one! strong! (AAGATAAAG;! purple)! GATA! instance.! DeepSTARR!
nucleotide!contribution!scores!and!UMI-STARR-seq!measured!enhancer!activity!(log2;!on!the!
right)!are!shown!for!the!wildtype!sequence!(top)!and!for!the!sequences!where!the!2!bp!flanks!
of!the!strong!instance!were!replaced!by!the!ones!of!the!weak!instance!(middle)!and!vice!versa!
(bottom).!
!
To!explore!the!syntax! features! that!affect!the!importance!of!a!motif!instance,!we!
examined!the! motif! flanking! nucleotides,! which! can!contribute! to! enhancer!
activity12,13,18,29,6569.! For! each! motif! type,! we! sorted! all! instances! by!their! predicted!
importance!to! determine! the! optimal! flank! length! and! sequence! (Fig! 4A,B,! S7A).! For!
example,!important! GATAA!sequences! had!a!G! at!position! +1,! whereas!non-important!
ones!had!a!T!at!position!+1!and!a!G!at!position!-1!(Fig!4A).!In!contrast,!up!to!5!bp!flanking!
up-!and!down-stream!affected!the!importance!of!Trl!instances,!with!flanking!GA-repeats!
correlating!with!increased!importance!(Fig! 4A).!The!flanks!of!high!and!low!importance!
motif!instances!predicted!by!DeepSTARR!were!largely!concordant!with!those!identified!
by!motif!mutagenesis!(Fig!4B,!S7A)!and!refine!known!PWM!models!for!the!predicted!TFs!
(Fig!4B).!
To!experimentally!validate!the!functional!contribution!of!motif!flanking!sequence!
predicted!by!DeepSTARR,!we!swapped!the!flanking!nucleotides!of!strong!and!weak!GATA!
instances! (≥! 2-fold! difference!as! assessed! by! mutagenesis)! in!47! enhancers! (Fig! 4C).!
Indeed,!replacing! the! 2! bp! flanks! of! strong! instances! by! the! flanks! of! weak! instances!
reduced!enhancer!activity,!whereas!replacing!the!flanks!of!weak!instances!by!the!flanks!
of!strong!ones!increased!enhancer!activity!(Fig!4C,!S7B).!DeepSTARR!recapitulated! the!
observed!effects,!i.e.!the!addition!of!weak!flanks!converted!a!strong!GATA!instance!to!a!
weak! one! as! indicated! by! the! decreased! contribution! at! the!nucleotide! level,! and!vice!
versa!for!a!weak! instance! that!was!converted!to!a!strong! one! (Fig!4D).!Swapping!5!bp!
flanks! yielded! consistent! results! with! slightly! stronger! effects! (Fig! S7B).! In! addition,!
swapping! the! flanks! was! sufficient! to! switch! motif! contributions,! as! determined! by!
subsequent! motif!mutagenesis! (Fig.! S7B).! Thus,!as! DeepSTARR! is! not! biased! by! prior!
knowledge!about!TF!motifs!but!is!trained!on!DNA!sequence!alone,!it!can!not!only!identify!
important!motif!types!but!also!refine!the!optimal!flanking!sequence.!Experimentally,!we!
confirm!that! the!flanking! sequence!can! be! sufficient! to! switch! motif! contribution! and!
should!be!considered!when!assessing!motif!importance!or!the!impact!of!motif-disrupting!
mutations.!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
11"
In#silico!analysis!reveals!distinct!modes!of!motif!cooperativity!
!
Figure'5.'In#silico'analysis'reveals'distinct'modes'of'motif'cooperativity.'A)!Schematic!of!
in0silico!characterization!of!TF!motif!distance!preferences.!MotifA!was!embedded!in!the!center!
of!60!synthetic!random!DNA!sequences!and!MotifB!at!a!range!of!distances!from!MotifA,!both!
up-!and!downstream.!Both!the!average!developmental!and!housekeeping!enhancer!activity!is!
predicted! by! DeepSTARR.! The! cooperativity! (residuals)! between! MotifA!and! MotifB!as! a!
function!of!distance!is!quantified!as!the!activity!of!MotifA+B!divided!by!the!sum!of!the!marginal!
effects!of!MotifA!and!MotifB0(MotifA!+!MotifB0–0backbone!(b))!(see!Methods).!B)!DeepSTARR!
predicts! distinct! modes! of! motif! cooperativity.! Show! for! example! motif! pairs:! ETS/SREBP!
(mode!1),! GATA/GATA! (2),! AP-1/GATA! (3)!and! Dref/Dref! (4).!Top:!Cooperativity! between!
two!motif! instances!at!different!distances.!Points!showing!the! median!interaction!across! all!
60!backbones!for!each!motif!pair!distance!(both!up-!and!downstream!distances!are!combined)!
together! with! smooth! lines;! dashed! line! at! 1! represents! no! synergy.! Middle:! Association!
between!enhancer!activity! and! the! distance!at!which!the!motif!pair!is! found.! Coefficient! (y-
axis)!and!p-value! from! a!multiple!linear! regression! including,! as! independent!variables,!the!
number!of!instances!for!the!different!developmental!or!housekeeping!TF!motif!types.!Bottom:!
Motifs!are! often! at!suboptimal! distances! in!developmental! enhancers.! Odds!ratio! (log2)! by!
which! the! two! motifs! are! found! within! a! specified! distance! from! each! other! in! enhancers!
compared!with! negative! genomic!regions.! Color! legend!is! shown.! *!FDR-corrected! Fisher's!
Exact!test!p-value!<!0.05.!C)!Cooperativity!between!three!motif!types!(and!GGGCT!as!control)!
with! a! central! GATA! motif! in! developmental! enhancers! at! diff erent! distances! to! the! GATA!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
12"
motif.! Points! showing! the! median! interaction! across! all! 60! backbones! for! each! motif! pair!
distance! (both! up-! and! downstream! distances! are! combined)! together! with! smooth! lines;!
dashed!line!at!1!represents!no!synergy.!D)!Motif!mutagenesis!validates!that!GATA!instances!
distal!to!a!second!GATA!instance!are!more!important.!Left:!expected!mutational!impact!when!
mutating!GATA!instances!depending!on!the!distance! to!other!GATA!motifs.!Right:!enhancer!
activity!changes! (log2! FC)! after!mutating! GATA! instances! at!suboptimal! close! (<!25! bp)! or!
optimal!longer!(>!50!bp)!distance!to!a!second!instance.!Number!of!instances!are!shown.!**!p-
value! <! 0.01! (Wilcoxon! rank-sum! test).! The!box! plots! mark! the! median,! upper! and! lower!
quartiles!and!1.5× interquartile! range! (whiskers).! E)!Motif! mutagenesis!validates!that! AP-1!
instances!closer!to!a!second!GATA!instance!are!more!important.!Same!as!in!(D).!*!p-value!<!
0.05!(Wilcoxon!rank-sum!test).!
!
The! distance! between! TF! motifs! is! thought!to! be! important! for! TF!
cooperativity6,13,18,47,7073.!To!determine!how!the!distance!between!motifs!contributes!to!
enhancer!activity,! we!interrogated! DeepSTARR!to! uncover!potential! preferences!in!TF!
motif! distance! in! enhancer! sequences.! We! analyzed!in# silico!!how! predicted! enhancer!
activity!is!affected!by!the!relative!distance!between!two!motif!instances!(MotifA/MotifB),!
following!a!strategy!adapted!from47!(Fig!5A,!S8A):!we!embedded!MotifA!in!the!center!of!
synthetic!random!DNA!sequences!and!MotifB!at!a!range!of!distances!from!MotifA,!both!up-!
and! downstream.! We! then! predicted!the! activity! of! the! different! synthetic! sequences!
using!DeepSTARR!and!calculated!a!cooperativity!score!for!each!motif!pair,!where!a!value!
higher!than!1!means!positive!synergy!(Fig!5A,!S8A).!
Motif!distances!had!indeed!a!strong!influence!on!predicted!enhancer!activity!and!
we!observed!four!distinct!modes! of!distance-dependent! TF! motif! cooperativity:! motif!
pairs!can!synergize!exclusively!at!close!distances!(<!25!bp;!mode!1),!exclusively!at!longer!
distances!(>!25!bp;!2),!preferentially!at!closer!distances!and!either!plateau!(3)!or!decay!
(4)!at!long!distances!(>!75!bp;!Fig!5B,!S8B-D).!While!all!motifs!in!housekeeping!enhancers!
cooperate!according!to!mode!4!(decay),!modes!1!to!3!all!occur!for!motifs!in!developmental!
enhancers!(Fig! S8C,D).! Interestingly,! whether! cooperativity! followed! modes! 1,! 2! or! 3!
depended!on!the!motif!pair!and!even!changed!for!a!given!motif!based!on!the!partner!motif!
(Fig!5C,!S8C).!For!example,!GATA/ETS!synergized!only!when!closer!than!25!bp!(mode!1),!
whereas! GATA/GATA! synergy! was! lost! at! short! distances! (mode! 2)! and! GATA/AP-1!
cooperated!according!to!mode!3!(Fig!5C).!Thus,!DeepSTARR!predicts!distinct!modes!of!
motif!cooperativity!that!can!determine!the!contribution!of!different!motif!instances.!!
We! next! asked! how! frequently! these! optimal!inter-motif! distances! occur! in!
endogenous!enhancers!compared! to!negative! regions.! Motif! pairs! of! housekeeping!
enhancers! followed! the! optimal! spacing! rules!(enrichment! at! close! distances;! Fig! 5B,!
S9A,D),!as!did!some!motif!pairs!in!developmental!enhancers!such!as!GATA/GATA!motif!
pairs! that! were! strongly! depleted! at! close! and! enriched! at! longer!distances! (Fig.! 5B).!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
13"
However,!several! pairs! in! developmental! enhancers! occurred! only! rarely!at! optimal!
distances! (e.g.! ETS/SREBP!and! AP-1/GATA;!Fig! 5B,! S9A,C),! even! though! the!enhancer!
activities!followed!the!predicted!optimal!spacing!rules!also!in!these!cases!(Fig!5B,!S9).!For!
instance,!even!though!ETS/SREBP! motifs!separated!by! short!distances!(<!25! bp)! were!
rare,! such! motif! pairs! were! associated! with! stronger! enhancer! activity! than! pairs!
separated! by! larger! distances! (75-100! bp;! Fig! 5B),! validating!the!ETS/SREBP! motifs’!
optimal!distance.!!
To!experimentally!test!the!importance!of!motif!pairs!at!optimal!versus!non-optimal!
distances!more!directly,!we!mutated!either!GATA!or!AP-1!motifs!at!close!(<!25!bp)!and!
longer! distances! (>! 50! bp)! to! a! GATA! instance!(Fig! 5D,E).! The! results! validated! the!
DeepSTARR!predictions!and! showed! higher!importance!of!GATA/GATA!pairs!at!longer!
(Fig!5D)! and!AP-1/GATA! pairs!at! closer!distances!(Fig! 5E).!Thus,!different! motif! pairs!
display!distinct!distance!preferences,!which!dictate!the!contribution!of!individual!motif!
instances!to!overall!enhancer!activity.!As!endogenous!enhancers!often!contain!motif!pairs!
at!non-optimal!distances,!optimal!distances!only!become!apparent!by!our!in#silico#analysis!
but!not!in!frequency-based!analyses.!!
!
Motif!syntax!rules!are!generalizable!to!human!enhancers!
To! test! if! individual! instances! of! the! same! motif! also! contribute! differently! to!
enhancer!activities! in! humans!and! if! motif! flanks! and! spacing!determine! the!different!
contributions,!we!chose!the!human!colon!cancer!cell!line!HCT116!as!a!model.!We!selected!
nine!TF!motifs!based!on!motif!enrichment!analysis!(AP-1,!P53,!MAF,!CREB1,!ETS,!EGR1,!
MECP2,! E2F1! and! Ebox/MYC),! mutated! all! their! instances! in! 1,083! enhancers!and!
assessed!the!enhancer!activity!of!wildtype!and!mutant!sequences!by!UMI-STARR-seq!(Fig!
S10;! see! Methods).!This! revealed! that! AP-1! and! P53! motifs! were! the! most! important!
motifs!(median!5.6-!and!5.5-fold!reduction,!respectively),!followed!by!MAF!(3.1),!CREB1!
(2),!ETS!(1.9)! and!EGR1!(1.5),! while! MeCP2,!E2F1!and! Ebox/MYC!motifs! had!the! least!
impact!on!enhancer!activity!(lower!than!1.5-fold;!Fig!S10D-F).!Based!on!these!results,!we!
chose!AP-1,!P53,!MAF,!CREB1,!ETS!and!EGR1!motifs!for!the!analysis!of!motif!instances.
! !
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
14"
!
Figure' 6.' Motif' syntax' rules' dictate' the' contribution' of' TF' motif' instances' in' human'
enhancers.'A)!Top:!Genome!browser!screenshot!showing!DNA!accessibility!(ATAC-seq!data!
from74)! and! enhancer! chromatin! marks! (H3K27ac! and! H3K4me1! from! ENCODE75)! for! a!
human!HCT116!enhancer!(chr14:44,786,141-44,786,389)!with!4!AP-1!instances.!Bottom:!The!
designed!249!bp!oligo!covering! the! enhancer! summit! used!for!motif!mutagenesis!is!shown!
together! with! its! containing! AP-1! motif! instances! and! the! impact! on! enhancer! activity!
(negative! fold-change)! of! mutating! each! individual! instance.! Observed! and! expected! per-
nucleotide!DNase!I!cleavage!and!consensus!TF!footprints!from!a!related!colon!cancer!cell!line!
(RKO;!data!from!76)!are!shown!below.!B)!TF!motif!non-equivalence!is!widespread!in!human!
enhancers.!Distribution!of!log2!FC!enhancer!activity!between!mutating!the!least!and!the!most!
important!instance!of!each!motif!type!per!enhancer.!Dashed!line!represents!2-fold!difference!
between!instances!in!the!same!enhancer.!Number!of!enhancers!mutated!for!each!motif!type!
are!shown.!The!box!plots!mark!the!median,!upper!and!lower!quartiles!and!1.5× interquartile!
range! (whiskers).! C)!57%! of! enhancers! have! a! motif! instance! that! is! at! least! 2-fold! more!
important!than!another!instance.!Grey!bars:!proportion!per!motif!type;!dashed!line:!average!
across!motif!types!(excluding!control!motifs).!D)!Important!TF!motif!instances!are!associated!
with!TF!footprints.!Log2!FC!enhancer!activity!of!mutating!individual!instances!that!do!not!(-)!
or!do!(+)!overlap!TF!footprints!(FDR!0.001)!in!RKO!cells!(DNase!I!footprinting!data!from76).!
****!p-value! <!0.0001,!**!<! 0.01,!*!<!0.05,! n.s.!non-significant!(Wilcoxon!rank-sum! test).!Box!
plots!as!in!(B).!E)!Motif!syntax!rules!dictate!the!contribution!of!TF!motif!instances!in!human!
enhancers.!For!each!TF!motif!type!(rows),!we!built!a!linear!model!containing!the!number!of!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
15"
instances,!the!motif!core!(defined!as!the!nucleotides!included!in!each!TF!motif!PWM!model)!
and!flanking!nucleotides,!and!the!distance!to!all!other!TF!motifs!to!predict!the!contribution!of!
its!individual!instances!(mutation!log2!fold-change,!from!Fig!S11A)!across!all!enhancers.!The!
PCC!between!predicted!and!observed!motif!contribution!is!shown!with!the!green!color!scale!
on!the!left.!Heatmap!shows!the!contribution!of!each!feature!(columns)!for!each!model,!colored!
by! the! FDR-corrected! p-value! (red! scale).! F)!Motif! mutagenesis! shows! that! AP-1! and! ETS!
instances! closer! to! each! other! are! more! important! to! enhancer! activity.! Left:! expected!
mutational!impact!when!mutating!AP-1!and!ETS!instances!depending!on!the!distance!to!each!
other.! Middle! and! right:! enhancer! activity! changes! (log2! FC)! after! mutating! AP-1! or! ETS!
instances!at!close!(<!25!bp)!or!longer!(>!50!bp)!distance.!Number!of!instances!are!shown.!****!
p-value! <! 0.0001! and! **! <! 0.01! (Wilcoxon! rank-sum! test).! Box! plots! as! in! (B).! G)!Motif!
instances!need!to!be!analyzed!within!their!cis-regulatory!context.!Motif!syntax!rules,!such!as!
motif! combination,! flanks! and! distance! dictate! the! contribution! of! TF! motif! instances! in!
enhancer!sequences.!Important!motif!instances!will!have!a!higher!impact!on!enhancer!activity!
when!mutated.!
!
Mutation!of! hundreds! of!individual!motif! instances!showed!that! instances!of!the!
same!TF!motif!are!not!functionally!equivalent!(Fig!6A-C,!S11A).!For!example,!the!enhancer!
shown! in! Fig! 6A! contains! four!AP-1! instances!with! very! different! contributions! to!
enhancer!activity! as! judged!by! fold-changes!after!motif! instance!mutagenesis! between!
1.2-!and!3.8-fold.!Interestingly,!DNase!I!footprinting!data!from!a!related!colon!cancer!cell!
line! (RKO76)! suggests! that! the! AP-1! instance! with!low! importance! was! not! bound!
endogenously,!in!contrast!to! the! three!important!AP-1!instances! (Fig!6A).!Both!results!
generalize!to!all!tested!motifs!and!across!enhancers:!57%!of!human!enhancers!displayed!
non-equivalent!instances!of!the!same!motif!type!(Fig!6B,C)!and!TF!motif!instances!with!
DNase! I! footprints! are! more! important! than! those! without! (Fig! 6D),! supporting! the!
functional!differences!between!motif!instances!at!endogenous!enhancers.!
Having!trained!a!convolutional!neural!network!to!learn!the!motif!syntax!rules!for!
Drosophila!enhancers,!we! wanted!to!determine! if!the! same! type!of! rules!also! apply! to!
human!enhancers.!Therefore,!we!generated!simple!linear!models!based!on!these!rules!to!
predict!the!contribution!of!individual!motif!instances!in!human!enhancers.!Specifically,!
these!models!consider!the!number!of!instances,!the!motif!core!and!flanking!sequence,!and!
distance!to!other!TF!motifs!(Fig!6E,!S11B,C).!Despite!their!simplicity,!these!models!were!
able! to! predict! motif-instance! importance,! with! PCCs! to! experimentally! assessed! log2!
fold-changes!of!0.66!(P53),!0.61!(ETS),!0.59!(MAF)!and!0.5!(AP-1),!outperforming!models!
based!solely!on!PWM!scores!(Fig!S11D).!For!most!TFs,!motif!instances!closer!to!an!AP-1!
or!ETS!motif!were!more!important,!suggesting!that!high!cooperativity!with!these!TFs!is!
important! in! HCT116! enhancer! sequences! (Fig! 6E,! S11B).! This! was! also! observed!
between!AP-1!and!ETS!motifs!themselves,!where!mutation!of!either!AP-1!or!ETS!instances!
had!stronger!impact!in!enhancer!function!if!located!at!close!(<!25!bp)!rather!than!longer!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
16"
distances!(>!50!bp)!from!each!other!(Fig!6F),!and!similarly!between!two!AP-1!instances!
(Fig!S11E).!Altogether,!these!results!confirm!that!the!motif!syntax!rules!derived!for!motif!
flanking!sequence!and! inter-motif! distances! dictate!the! contribution! of! individual! TF!
motif!instances!in!human!enhancers!(Fig!6G).!Determining!how!distinct!instances!of!the!
same! motif! differentially! contribute! to! enhancer! activity!could! improve! the! ability! to!
predict!the!functional!impact!of!disease-associated!variants,!which!typically!affect!only!
one!motif!instance.!
!
DeepSTARR!designs!synthetic!enhancers!with!desired!activities!
!
Figure' 7.' DeepSTARR' designs' synthetic' enhancers' using' optimal' sequence' rules.' A)!
Comparison!between!DeepSTARR!predicted!and!experimentally!measured!enhancer!activity!
(log2)!for!249!synthetic!sequences!binned!(left)!or!not!(right).!The!‘‘Native’’!category!contains!
all!Drosophila!developmental!enhancer!sequences.!The!box!plots!mark!the!median,!upper!and!
lower!quartiles!and!1.5× interquartile!range!(whiskers);!outliers!are!shown!individually.!The!
three! synthetic! sequences! shown! in! (B)! are! highlighted.! B)!DeepSTARR! nucleotide!
contribution!scores!for!three!synthetic!sequences!from!(A)!spanning!different!activity!levels.!
Instances!of!GATA,!AP-1!and!ETS! motifs! are! shown! together!with!their! observed!distances!
(proximal!or!distal).!
!
Understanding! how! DNA! sequence!encodes! enhancer! activity! should! enable! the!
design!of!synthetic! enhancers!with! desired! activity! levels.! We! used! DeepSTARR! to!
computationally!design! new! S2!cell! developmental! enhancers,!by! predicting!enhancer!
activity! for! one! billion! random! 249!bp! DNA! sequences!that! are! not! present! in! the!
Drosophila!genome!(see! Methods).!We! then!selected!249! of! these! sequences! spanning!
different!predicted!activity!levels!and!experimentally! measured!their!enhancer!activity!
by! UMI-STARR-seq!in! S2! cells.! The! predicted! activity!of! the! synthetic!sequences! was!
highly! accurate! (PCC=0.62;! Fig! 7A)!and!DeepSTARR! was! able! to! design! synthetic!
enhancers!as!strong!as!the!strongest!native!S2!developmental!enhancers!(activity!(fold-
change!over!negative!regions)!≈!500;!Supplementary!Table!17).!
Inspection!of!the!designed!sequences!suggested!that!their!different!activity!levels!
correlated! not! only! with! motif! composition! but! also! the! motif! syntax!(Fig! 7B).! For!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
17"
example,! three! different! synthetic! sequences,! all! containing! two! GATA! and! two!AP-1!
motifs,! were! predicted! by! DeepSTARR! and! validated! experimentally! to! have! very!
different! activities!(from! 0.87!to! 630).! Interestingly,! the! strongest! synthetic! enhancer!
followed!the! optimal! spacing! rules! predicted! by!DeepSTARR,! such! as! distal! GATA!
instances! and! proximal! AP-1/GATA! and! ETS/AP-1! instances,! whereas!the! other! two!
synthetic!sequences!contained!motifs!in!suboptimal!syntax,!such!as!distal!AP-1!instances!
and!proximal!GATA!instances!(Fig!7B).!This!proof-of-concept!experiment!shows!that!the!
rules! learned! by! DeepSTARR! enable! the! a# priori!design! of! synthetic! enhancers! with!
desired!activity!levels.!
!
Discussion(
Deciphering!the!rules!governing!the!relationship!between!enhancer!sequence!and!
function!–! typically! called!the! cis-regulatory# code#of!enhancers!–!has!remained! a! long-
standing! open! problem.! It! has! proved! so!challenging!because! methods! to!functionally!
characterize!large!numbers!of!enhancers!have!only!become!available!a!few!years!ago!and!
also! because! the! cis-regulatory! code,! unlike! the! protein-coding! genetic# code,! follows!
complex!and!cell!type-specific!sequence-rules.!!!
To!dissect!the! relationship!between! enhancer!sequence! and!activity! for!a!single!
model!cell! type,! we!built! a! deep! learning! model,!DeepSTARR,! that! accurately! predicts!
enhancer!activity!for!two!different!transcriptional!programs!directly!from!DNA!sequence.!
DeepSTARR!learned!important!TF! motif! types!and!higher-order! syntax! rules:!different!
instances!of!the!same!TF! motif!are! not!functionally!equivalent,!and!the! differences!are!
determined! by! motif! flanks! and! inter-motif! distances.! These! types! of! rules! are! also!
important! in! human! enhancers! and! will! be!relevant! to! predict! the! impact! of! genetic!
variants!linked!to!disease!in!the!human!genome.!
The!discovery! that! relatively! rare! sequence! features!can! be! important! and!
predictive!of!enhancer!activity!is!important!and!unexpected!and!highlights!the!potential!
of!unbiased!deep!learning!models!that!are!not!based!on!over-representation47,77.!The!fact!
that!motifs! are! often! not! arranged! in! optimal! syntax! agrees! with! previous! work! that!
suggested!that!suboptimal!enhancers!might!have!evolved!to!allow!cell!type!specificity12,13.!
Consistent!with!this!interpretation,!we!observed!optimized!sequences!of!housekeeping!
enhancers!that!operate!in!all!cell!types.!
Our!results!reveal!an!underappreciated!property!of!enhancers:!identical!instances!
of!the!same!TF!motif!with!non-equivalent!contributions!to!enhancer!activity.!Although!the!
observation! that! only! a! small! fraction! of! potential! motifs! throughout! the! genome! is!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
18"
actually! bound21,78,79!suggests! that! motif! instances! cannot! all! be! equivalent,! the! non-
equivalence!of!motif!instances!within!the!same!enhancer!is!surprising.!In!fact,!previous!
studies!and! computational! models!have! typically! considered! different! motif! instances!
solely! according! to! their!PWM!scores! or! even! as! equivalent17,26,80.!The! contribution! of!
motif!instances!depended!on!high-order!motif!syntax!rules!such!as!inter-motif!distances!
that!are!not!captured!by!traditional!PWM!models!and!need!to!be!modelled!within!the!full!
enhancer!sequence.!This!is!in!line!with!the!recently!reported!limitations!of!PWM!models!
for!predicting! the! effects!of! noncoding! variants!on! TF! binding!in# vitro81!and!improved!
performance! of! deep! learning! models! for! the! prediction! of! motif! instances! bound! in#
vivo47,57.!Together!these!results!suggest!that!motif!instances!need!to!be!analysed!within!
their!cis-regulatory!context,!which!should!improve!our!ability!to!predict!and!interpret!the!
impact! of! disease-related! sequence! variants! that! typically! affect! individual! motif!
instances.!
The!rules!learned!by! DeepSTARR!allowed!the!de#novo!design!of! synthetic! S2!cell!
enhancers!with!desired!activity!levels,! which!not!only!demonstrates!the!validity!of!the!
model!and!its!rules!but!also!illustrates!the!power!of!this!approach.!Although!libraries!of!
synthetic! elements! have! been! used! to! explore! enhancer! structure68,! it! has! remained!
impossible!to!build!fully!synthetic!sequences!with!specific!characteristics.!It!is!interesting!
how!these!synthetic!enhancers!are!of!similar!complexity!as!endogenous!enhancers,!e.g.!in!
terms!of!motif!number!and!diversity,!and!that!a!vast!number!of!different!sequences!can!
have! similar! enhancer! strengths,! highlighting! regulatory! sequence! flexibility! and!
evolutionary! opportunities.! We! expect! that! combining! DeepSTARR! with! emerging!
algorithms! that! allow! the! direct! generation! of! DNA! sequences! from! deep! learning!
models54!will! provide! unanticipated! opportunities! for! the!engineering! of! synthetic!
enhancers.!!
A!next!key!challenge!for!the!field!will!be!to!generalize!such!models!from!individual!
deeply! characterized! model! cell! lines! to! all! cell! types! of! an! organism! or! even! across!
species.!This! task!is! challenging!because! enhancers!form!the! basis!of! differential!gene!
transcription,!and! their! activities! are! inherently! cell-type!specific.! The! underlying!
sequences!and!rules!must!therefore!–!by!definition!–!also!differ!between!cell!types,!at!least!
to!some!extent.!It!is!well!known!for!example!that!enhancers!that!are!active!in!different!
cell!types!or!tissues!contain!different!TF!motifs26,80,82,!which!enables!the!binding!of!cell!
type-specific!TFs.!Therefore,! it!remains!unclear! how! and!to!what! extent!cis-regulatory!
rules!generalize!or!even!apply!universally.!
We!show!here!that!differences!between!motif!instances!as!well!as!the!importance!
of! motif! flanks! and! distances! generalize! from! Drosophila!to! human! enhancers.!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
19"
Unexpectedly,! for!AP-1! motifs,!which! we!could! assess!in! both!species,! the! Drosophila-
trained!DeepSTARR!model!was!able!to!predict!the!importance!of!AP-1!instances!in!human!
enhancers!(PCC=0.42;!Fig!S12)!and!in!both!species!ETS-AP-1!pairs!synergize!only!at!short!
distances!but!not!at! longer! ones! (mode!1;!Fig!S8C!and!Fig!6E,F,! S11B).! Ultimately,!this!
demonstrates! that! although! the! specific! rules! vary! between! TF! motif! types! and! motif!
combinations,! the! types! of! rules! as! well! as! some!specific! rules! apply! more! generally.!
Dissecting!important!types!of!rules!in!model!cell!lines!together!with!the!wealth!of!genomic!
data! across! many! cell! types! (such! as! those! from!ENCODE)! should! unveil! the! gene-
regulatory!information!in!our!genomes!and!a!general!cis-regulatory!code.! !
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
20
Methods
UMI-STARR-seq
Cell culture
Drosophila S2 cells
Schneider  cells were grown in Schneider’s Drosophila Medium (Gibco; 21720-024)
supplemented with 10% heat inactivated FBS (Sigma; F7524) at 27ºC. Cells were
passaged every 2-3 days.
Human HCT116 cells
Human HCT116 cells were cultured in DMEM (Gibco; 52100-047) supplemented with
10% heat inactivated FBS (Sigma; F7524) and 2mM L-Glutamine (Sigma; G7513) at 37ºC
in a 5% C02-enriched atmosphere. Cells were passaged every 2-3 days.
Electroporation
The MaxCyte-STX system was used for all electroporations. S2 cells were electroporated
at a density of 50 x 107 cells per 100L and g of DNA using the Optimization 1
protocol. HCT116 cells were electroporated at a density of 1 x 107 cells per 100µL and
0g of DNA using the preset “HCT11” program.
UMI-STARR-seq experiments
Library cloning
Drosophila genome-wide libraries were generated by shearing genomic DNA from the
sequenced D.mel strain (y; cn bw sp) to an average of 200 bp fragments. Inserts were
cloned into the standard Drosophila STARR-seq vector61 containing either the DSCP or
Rps12 core-promoters, and libraries grown in 6l of LB-Amp.
Drosophila and human oligo libraries were synthesized by Twist Bioscience including 249
bp enhancer sequence and adaptors for library cloning. Fragments from the Drosophila
library were amplified (primers see Supplementary Table 1) and cloned into Drosophila
STARR-seq vectors containing either the DSCP or Rps12 core-promoters using Gibson
cloning (New England BioLabs; E2611S). The oligo library for human STARR-seq screens
was amplified (primers see Supplementary Table 1) and cloned into the human STARR-
seq plasmid with the ORI in place of the core promoter83. Libraries were grown in 2l LB-
Amp.
All libraries were purified with Qiagen Plasmid Plus Giga Kit (cat. no. 12991).
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
21
Drosophila S2 cells
UMI-STARR-seq was performed as described previously61,62. In brief, the screening
libraries were generated from genomic DNA isolated of the sequenced D.mel strain (y; cn
bw sp) or synthesized as oligo pools by Twist Bioscience (see above). We transfected 400
× 10^6 S2 cells total per replicate with 0 μg of the input library using the MaxCyte
electroporation system. After 24 hr incubation, poly-A RNA was isolated and processed
as described before62. Briefly: after reverse transcription and second strand synthesis a
unique molecular identifier (UMI) was added to each transcript. This is followed by two
nested PCR steps, each with primers that are specific to the reporter transcripts such that
STARR-seq does not detect endogenous cellular RNAs.
Human HCT116 cells
STARR-seq was performed as described previously61,62,83. Screening libraries were
generated from synthesized oligo pools by Twist Bioscience (see above). We transfected
80 × 10^6 HCT116 cells total per replicate with 160 µg of the input library using the
MaxCyte electroporation system. After 6 hr incubation, poly-A RNA was isolated and
further processed as described before62.
Illumina sequencing
Next-generation sequencing was performed at the VBCF NGS facility on an Illumina HiSeq
2500, NextSeq 550 or NovaSeq SP platform, following manufacturer’s protocol. Genome-
wide UMI-STARR-seq screens were sequenced as paired-end 36 cycle runs (except the
developmental input library, as paired-end 50 cycle runs) and Twist-oligo library screens
were sequenced as paired-end 150 cycle runs, using standard Illumina i5 idexes as well
as unique molecular identifiers (UMIs) at the i7 index.
Genome-wide UMI-STARR-seq data analysis
Paired-end genome-wide UMI-STARR-seq RNA and DNA input reads (36 bp; except the
developmental input library that was 50 bp) were mapped to the Drosophila genome
(dm3), excluding chromosomes U, Uextra, and the mitochondrial genome, using Bowtie
v.1.2.284. Mapping reads with up to three mismatches and a maximal insert size of 2 kb
were kept. For paired-end RNA reads that mapped to the same positions, we collapsed
those that have identical UMIs (10 bp, allowing one mismatch) to ensure the counting of
unique reporter transcripts (Supplementary Table 2). We further computationally
selected both RNA and input fragments of length 150-250 bp to only capture active
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
22
sequences derived from short fragments. After processing the two biological replicates
separately, we pooled both replicates for developmental and housekeeping screens for
further analyses.
Peak calling was performed as described previously61. Peaks that had a hypergeometric
p-value <= 0.001 and a corrected enrichment over input (corrected to the conservative
lower bound of a 95% confidence interval) greater than 3 were defined as enhancers and
resized to 249 bp (same length as used in oligo libraries) (Supplementary Table 3). Non-
corrected enrichment over input was used as enhancer activity metric. Enhancers were
classified as developmental or housekeeping based on the screen with the highest
activity.
Oligo library UMI-STARR-seq data analysis
Oligo library UMI-STARR-seq RNA and DNA input reads (paired-end 150 bp) were
mapped to a reference containing 249 bp long sequences containing both wildtype and
mutated fragments from the Drosophila or human libraries using Bowtie v.1.2.284. For the
Drosophila library we demultiplexed reads by the i5 and i7 indexes and oligo identity.
Mapping reads with the correct length, strand and with no mismatches (to identify all
sequence variants) were kept. Both DNA and RNA reads were collapsed by UMIs (10 bp)
as above (Supplementary Table 2).
We excluded oligos with less than 10 reads in any of the input replicates and added one
read pseudocount to oligos with zero RNA counts. The enhancer activity of each oligo in
each screen was calculated as the log2 fold-change over input, using all replicates, with
DESeq285. We used the counts of wildtype negative regions in each library as scaling
factors between samples. This normalization only changes the position of the zero and
consequently does not affect the calculation of log2 fold-changes between different
sequences or the p-values for the statistical tests used.
Deep Learning
Data preparation
We selected all windows at the summit of developmental and housekeeping enhancers,
in addition to three windows on either side of the regions (stride 100 bp). The remaining
part of the genome was binned into 249 bp windows with a stride of 100 bp, excluding
chromosomes U, Uextra, and the mitochondrial genome. We only included bins with more
than five reads in the input and at least one read in the RNA of both developmental and
housekeeping screens. To have a diversity of inactive sequences, we selected (1) 20,000
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
23
random bins overlapping accessible regions in different Drosophila cell types (S2, kc167
and OSC61,86) and embryogenesis stages87, as well as all bins overlapping (2) enhancers
from different Drosophila cell types (OSC and BG326 ) and (3) inducible enhancers in S2
cells for two different stimuli (ecdysone88 and Wnt signaling89). Lastly, we added 59,081
random windows with a range of enhancer activity levels. We augmented our dataset by
adding the reverse complement of each original sequence, with the same output, ending
up with 242,026 examples (484,052 post-augmentation). Sequences from the first
(40,570; 8.4%) and second half of chr2R (41,186; 8.5%) were held out for validation and
testing, respectively.
DeepSTARR model architecture and training
DeepSTARR was designed as a multi-task convolutional neural network (CNN) that uses
one-hot encoded 249 bp long DNA sequence (A=[1,0,0,0], C=[0,1,0,0], G=[0,0,1,0],
T=[0,0,0,1]) to predict both its developmental and housekeeping enhancer activities (Fig
1C). We adapted the Basset CNN architecture45 and built DeepSTARR with four 1D
convolutional layers (filters=246,60,60,120; size=7,3,5,3), each followed by batch
normalization, a ReLU non-linearity, and max-pooling (size=2). After the convolutional
layers there are two fully connected layers, each with 256 neurons and followed by batch
normalization, a ReLU non-linearity, and dropout where the fraction is 0.4. The final layer
mapped to both developmental and housekeeping outputs. Hyperparameters were
manually adjusted to yield best performance on the validation set. The model was
implemented and trained in Keras (v.2.2.490) (with TensorFlow v.1.14.091) using the
Adam optimizer92 (learning rate = 0.002), mean squared error (MSE) as loss function, a
batch size of 128, and early stopping with patience of ten epochs. Model training,
hyperparameter tuning and performance evaluation were performed on different sets of
genomic regions in distinct chromosomes.
Performance evaluation
The performance of the model was evaluated separately for developmental and
housekeeping predictions on the held-out test sequences. We used the Pearson
correlation coefficient (PCC) across all bins for a quantitative genome-wide evaluation
and the area under the precision-recall curve (AUPRC; calculated using pr.curve from R
package PRROC v.1.3.192) for enhancer classification (enhancers vs. 2,685 negative
control regions from the test set).
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
24
To test the robustness of the model, we trained 1,000 DeepSTARR models with the same
set of hyperparameters and compared their performance. This accounted for the
stochastic heterogeneity due to the random initialized weights in the neural network.
Prediction on full Drosophila genome
We extracted 249 bp sequences tiled across the Drosophila dm3 genome with a stride of
20 bp using ‘‘bedtools makewindows’’ (parameters -w 249 -s 20’) and ‘‘bedtools
getfasta93. We next predicted the developmental and housekeeping enhancer activity of
each genomic window with DeepSTARR and averaged these per nucleotide to obtain
genome-wide coverage. The DeepSTARR predicted coverage tracks are shown as
examples in Fig 1B and S1A,B and are available at
https://genome.ucsc.edu/s/bernardo.almeida/DeepSTARR_manuscript.
Models for comparison
The performance of DeepSTARR in the test set sequences was compared with two
different methods: (1) a gapped k-mer support vector machine (gkm-SVM)39 and (2) a
lasso regression model based on TF motif counts (Fig S1D).
(1) We used a 10-fold cross-validation scheme to train a developmental and a
housekeeping gkm-SVM model to classify 249 bp DNA sequences into enhancers. Training
was performed using developmental or housekeeping enhancers and a set of 21,463
negative control regions from the training set. The gkm-SVMs were done using LS-GKM94
and the following parameters: (dev) gkmtrain -t 0 -l 8 -k 5 -x 10; (hk) gkmtrain -t 0 -l 11 -
k 7 -x 10. We used the resulting support vectors of each trained model to score the DNA
sequences of the test set by running gkmpredict and used these scores for the PCC and
AUPRC analysis.
(2) We trained lasso regression models for developmental and housekeeping enhancer
activity using the counts of 6,502 known TF motifs (see “Reference compendium of non-
redundant TF motifs below) as features across 0,000 random selected bins from the
training set. Motif counts were calculated using the matchMotifs function from R package
motifmatchr (v.1.4.095) with the following parameters: genome =
BSgenome.Dmelanogaster.UCSC.dm3, p.cutoff = 5e-04, bg="even". The model was trained
using the optimal lambda retrieved from 10-fold cross-validation and the glmnet function
from R package glmnet (v.2.0-1696).
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
25
Nucleotide contribution scores
We used DeepExplainer (the DeepSHAP implementation of DeepLIFT, see refs. 55, 63,64;
update from
https://github.com/AvantiShri/shap/blob/master/shap/explainers/deep/deep_tf.py)
to compute contribution scores for all nucleotides in all sequences in respect to either
developmental or housekeeping enhancer activity. We used 100 dinucleotide-shuffled
versions of each input sequence as reference sequences. For each sequence, the obtained
hypothetical importance scores were multiplied by the one-hot encoded matrix of the
sequences to derive the final nucleotide contribution scores, which were visualized using
the ggseqlogo function from R package ggseqlogo (v.0.197).
Motif discovery using TFModisco
To consolidate motifs, we ran TFModisco (v.0.5.12.056) on the nucleotide contribution
scores for each enhancer type separately using all developmental or housekeeping
enhancers (Fig 2B). We specified the following parameters: sliding_window_size=15,
flank_size=5, max_seqlets_per_metacluster=50000 and
TfModiscoSeqletsToPatternsFactory(trim_to_window_size=15, initial_flank_to_add=5).
Reference compendium of non-redundant TF motifs
Reference compendium of non-redundant TF motifs
6,502 TF motif models were obtained from iRegulon
(http://iregulon.aertslab.org/collections.html 98) covering the following databases:
Bergman (version 1.199), CIS-BP (version 1.02100), FlyFactorSurvey (2010101), HOMER
(2010102), JASPAR (version 5.0_ALPHA103), Stark (2007104) and iDMMPMM (2009105). We
systematically collapsed redundant motifs by similarity by a previously described
approach76. Specifically, we computed the distances between all motif pairs using
TOMTOM106 and performed hierarchical clustering using Pearson correlation as the
distance metric and complete linkage using the hclust R function. The tree was cut at
height 0.8, resulting in 901 non-redundant motif clusters that were manually annotated
(Fig S2A-E). Clustering of motifs from each cluster and their logos were visualized using
the motifStack R package (v.1.26.0107). The code and TF motif compendium are available
from https://github.com/bernardo-de-almeida/motif-clustering.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
26
TF motif enrichment analyses in developmental and housekeeping enhancers
We tested the enrichment of each TF motif in developmental or housekeeping enhancers
over negative genomic regions (Fig S2F,G, Supplementary Table 4). Counts for each motif
in each sequence were calculated using the matchMotifs function from R package
motifmatchr (v.1.4.095) with the following parameters: genome =
BSgenome.Dmelanogaster.UCSC.dm3, p.cutoff = 1e-04, bg="genome". For each enhancer
type, we assessed the differential distribution of each motif between the enhancers and
negative regions by two-sided Fisher’s exact test. Obtained P-values were corrected for
multiple testing by Benjamini-Hochberg procedure and considered significant if FDR
0.05. To remove motif redundancy, only the most significant TF motif per motif cluster
was shown.
TF motif mutagenesis in Drosophila S2 enhancers
Oligo library design
Selection of enhancer regions
A comprehensive library of 5,082 wildtype enhancer sequences in D. melanogaster S2
cells was compiled by selecting previously published developmental61, housekeeping40
and inducible (ecdysone88 and Wnt signaling89) enhancers. 249 bp sequences centered on
the enhancers’ summit in both forward and reverse orientation were retrieved. We added
524 249-bp negative genomic regions in both orientations as controls (Supplementary
Table 5).
Generation of TF motif mutations
We selected four predicted developmental motifs (GATA, AP-1, twist, Trl), three predicted
housekeeping motifs (Dref, Ohler1, Ohler6) and three control motifs (length-matched
random motifs to control for enhancer-sequence perturbation). For each motif type, we
mapped all instances using string-matching (GATA: GATAA; AP-1: TGA.TCA; twist:
CATCTG/CATATG; Trl: GAGAG; Dref: ATCGAT; Ohler1: GTGTGACC; Ohler6: AAAATACCA;
control: TAGG, GGGCT, CCTTA) in 2,194 enhancers (both motif orientations) and mutated
all instances both simultaneously and each instance individually to a motif shuffled
variant (Supplementary Table 5; Fig S3A). Each instance for a given motif was mutated
always to the same shuffled variant to allow the comparison of effects between instances
of the same motif type. We designed motif-mutant sequences for each enhancer only for
the orientation with the strongest wildtype enhancer activity. In addition, for each motif
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
27
type we repeated mutations with two other different shuffled variants in 50 enhancers to
control for the impact of the selected shuffled variant (Supplementary Table 5; Fig S3C).
Enhancers with swapped GATA motif flanks
We selected 100 developmental enhancers from above that contain 2 GATA instances
(inst1 and inst2) with different importance as predicted by DeepSTARR and swapped the
flanking nucleotides (both 2 bp and 5 bp separately) between both instances (Fig 4C,
S7B). For each enhancer, we designed sequences where the flanks of inst1 were replaced
by the flanks of inst2 and vice versa, resulting in sequences where both the two GATA
instances contained either the flanks of inst1 or the flanks of inst2. In addition, when
replacing the flanks of inst1 by the flanks of inst2, we also mutated inst2 to assess how the
flanks of inst2 affected the contribution of inst1. The opposite was also done, with the
flanks of inst2 being replaced by the flanks of inst1 together with mutation of inst1. The
mutated sequences are listed in Supplementary Table 5. 47 active enhancers contained
one strong and one weak GATA instances (2-fold difference as assessed afterwards by
mutagenesis) were used for the analyses in Fig 4C and S7B (Supplementary Table 11).
Designing of synthetic S2 developmental enhancers
1 billion random 249 bp DNA sequences were generated in bash with the following code:
cat /dev/urandom | tr -dc 'ACGT' | fold -w 249 | head -n 1000000000. Bowtie v.1.2.2 84 was
used to remove sequences that exist in the D. melanogaster genome, which were none.
The developmental enhancer activity of these sequences was predicted using DeepSTARR
and 249 sequences spanning different activity levels were selected for the oligo library
(Supplementary Table 5 and 17).
Oligo library synthesis and UMI-STARR-seq
The Drosophila enhancers’ motif mutagenesis oligo library contained wildtype (both
orientations) and motif-mutant enhancers, enhancers with swapped GATA motif flanks
and synthetic enhancer sequences (Supplementary Table 5). All sequences were designed
using the dm3 genome version. The enhancer sequences spanned 249 bp total, flanked by
the Illumina i5 (25 bp; 5 -TCCCTACACGACGCTCTTCCGATCT) and i7 (26 bp; 5
AGATCGGAAGAGCACACGTCTGAACT) adaptor sequences upstream and downstream,
respectively, serving as constant linkers for amplification and cloning. The resulting
21,758-plex 300-mer oligonucleotide library was synthesized by Twist Biosciences Inc.
UMI-STARR-seq using this oligo library was performed (UMI-STARR-seq experiments”)
and analyzed (Oligo library UMI-STARR-seq data analysis”) as described above. We
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
28
performed three independent replicates for developmental and housekeeping screens
(correlation PCC=0.93-0.98; Fig S3B).
TF motif mutation analysis and equivalency
From the candidate 249 bp enhancers, we identified 855 active developmental and 905
active housekeeping Drosophila enhancers (log2 wildtype activity in oligo UMI-STARR-
seq >= 3.15 and 2.51, respectively; the strongest negative region in each screen) that we
used in the subsequent TF motif mutation analyses. The impact of mutating all instances
of a TF motif type simultaneously or each instance individually was measured as the log2
fold-change enhancer activity between the respective mutant and wildtype sequences
(Supplementary Table 6 and 8). This was done separately for developmental and
housekeeping enhancer activities.
Motif non-equivalency across all enhancers (Fig 3B, S5B,D) or within the same enhancer
(Fig 3A,C) was assessed by comparing the impact of mutating individual instances of the
same TF motif, i.e. the log2 fold-changes of each instance (Supplementary Table 8). For
the comparison between instances in the same enhancer, only enhancers that require the
TF motif (> 2-fold reduction in activity after mutating all instances) and contain two or
more instances were used. Motif instances with >2-fold different contributions in the
same enhancer were considered as non-equivalent. The same comparison across
enhancers or within the same enhancer was performed for the three control motifs.
Motif syntax features
DeepSTARR predicted global importance of motif types and comparison with motif
enrichment
To quantify the global importance of all known TF motifs to enhancer activity in silico (see
ref. 58), we embedded each motif from the 6,502 TF motif compendium at five different
locations and in both strands in 100 random backbone DNA sequences and predicted
their developmental and housekeeping enhancer activity with DeepSTARR. The 249 bp
backbone sequences were generated by sampling the base at each position with equal
probability. The five different locations were the same for all motifs, centered at positions
25, 75, 125 (middle of the 249 bp oligo), 175 and 225. For each motif, we used the
sequence corresponding to the highest affinity according to the annotated PWM models.
The average activity across the different locations per backbone was divided by the
backbone initial activity to get the predicted increase in enhancer activity per TF motif.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
29
The resultant log2 fold-change was averaged across all 100 backbones to derive the final
global importance of each TF motif.
The global motif importance predicted by DeepSTARR was compared with the
enrichment of TF motifs at developmental and housekeeping enhancers, measured as the
two-sided Fisher’s exact test log odds ratio (described in “TF motif enrichment analyses
in developmental and housekeeping enhancers”) (Fig D, Supplementary Table 7). To
remove motif redundancy, only the TF motif with the strongest predicted global
importance or the strongest motif enrichment per motif cluster are shown in Fig 2D.
DeepSTARR predictions for the contribution of motif instances
We used two complementary approaches to measure the predicted contribution of each
motif instance by DeepSTARR.
First, we measured the predicted importance of all string-matched instances of each TF
motif type in 9,074 developmental enhancers, 6,369 housekeeping enhancers or 26,938
negative genomic regions (Fig S5A,C; Supplementary Table 9). The predicted importance
of an instance was calculated as the average developmental or housekeeping DeepSTARR
contribution scores over all its nucleotides. These scores represent the global
contribution of motif instances captured by the model and were used for the analyses of
figures: 4A,B, S5A,C, S7A.
Second, to compare with the experimentally derived motif importance through motif
mutagenesis, we used DeepSTARR to predict the log2 fold-change between wildtype and
the motif-mutant enhancer sequences included in the oligo library for all instances of the
different motif types (Fig 3B,D, S6). This was done by calculating the log2 fold-change
between the predicted activity of the wildtype and respective motif-mutant sequences.
Since the experimentally derived importance can be dependent on the shuffled mutant
variant selected, this provides a more direct evaluation of the capability of DeepSTARR to
predict the importance of a motif instance assessed by experimental mutagenesis.
Scoring of TF motif instances with PWM motif scores
To assess how the PWM motif models predict the importance of a motif instance, we
scored the wildtype sequence of each mutated motif instance (extended 10 nucleotides
on each flank to account for the flanking sequence) with the PWM models of the selected
TF motifs (Supplementary Table 10). We used the matchMotifs function from R package
motifmatchr (v.1.4.0; genome = BSgenome.Dmelanogaster.UCSC.dm3, bg="even"95) with
a p-value cutoff of 1 to retrieve the PWM scores of all sequences. These PWM scores were
compared with the experimental log2 fold-changes using Pearson correlation (Fig 3D).
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
30
We tested different PWM models for each TF motif if available and reported always the
one with the best correlation (Supplementary Table 10).
Correlation between motif importance and motif flanks
String-matched motif instances of each TF were sorted by their predicted (DeepSTARR)
or experimentally derived (minus signed (-) mutation log2 fold-change) importance.
Their 5 flanking nucleotides were shown using heatmaps and the importance of each
nucleotide at each flanking position summarized using box plots (Fig 4A, S7A). Significant
differences between the four nucleotides per position were assessed through Welch One-
Way ANOVA test followed by FDR multiple testing correction. The motif logos represent
the frequency of each nucleotide at each position among the top 90th percentile instances
and were compared with the logos of existing PWM models (Fig 4B).
In silico motif distance preferences
Two consensus TF motifs were embedded in 60 random backbone 249 bp DNA
sequences, MotifA in the center and MotifB at a range of distances (d) from MotifA, both
up- and downstream (Fig 5A, S8). Backbone sequences were generated by sampling the
base at each position with equal probability. DeepSTARR was used to predict the
developmental or housekeeping activity of the backbone synthetic sequences (1) without
any motif (b), (2) only with MotifA in the center (A), (3) only with MotifB d-bases up- or
downstream (B) and (4) with both MotifA and MotifB (AB). The cooperativity between
MotifA and MotifB at each distance d was then defined as the fold-change between AB and
(b + (A-b) + (B-b) = A+B-b), where a value of 1 means an additive effect or no synergy
between the motifs, and a value higher than 1 means positive synergy. The median of fold-
changes across the 60 backbones was used as the final cooperativity scores. This analysis
was performed for all motif pair combinations of AP-1, SREBP, GATA, Trl, twist and ETS
motifs for developmental enhancer activity, and Dref, Ohler1 and Ohler6 for
housekeeping enhancer activity in both strand orientations. Pairs with a negative control
motif (GGGCT) were also included.
Enrichment of motif pairs at different distances in genomic enhancers
We obtained the positions of the different TF motif instances across all 9,074
developmental enhancers, 6,369 housekeeping enhancers and 26,938 negative genomic
regions as described above (“Correlation between motif number and enhancer activity”).
To compute whether MotifA is located within a certain distance (bins: 0-25, 25-50, 50-75,
75-100, 100-125, 125-150, 150-250 bp) of MotifB more/less frequently in enhancers than
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
31
in negative sequences, we counted the number of times a MotifA instance is at each
distance bin to a MotifB instance in enhancers and in negative sequences. The enrichment
or depletion of motif pairs at each bin was tested with two-sided Fisher’s exact test and
the log2 odds ratio used as metric. Obtained P-values were corrected for multiple testing
by Benjamini-Hochberg procedure and considered significant if FDR 0.05. We
performed this analysis separately for all developmental motif pairs in developmental
enhancers and all housekeeping motif pairs in housekeeping enhancers (Fig 5B, S9A,C,D).
Association between motif pair distances and enhancer activity
We obtained the positions of the different TF motif instances across all 9,074
developmental enhancers, 6,369 housekeeping enhancers and 26,938 negative genomic
regions as described above (“Correlation between motif number and enhancer activity”).
For each pair of motif instances at each distance bin (0-25, 25-50, 50-75, 75-100, 100-
125, 125-150, 150-250 bp), we tested the association between enhancer activity and the
presence of the pair at the respective distance bin using a multiple linear regression,
including as independent variables the number of instances for the different
developmental or housekeeping TF motif types. The linear model coefficient was used as
metric and considered significant if the FDR-corrected p-values 0.05. We performed this
analysis separately for all developmental motif pairs in developmental enhancers and all
housekeeping motif pairs in housekeeping enhancers (Fig 5B, S9B-D).
Validation of motif distance preferences by motif mutagenesis
To test how the importance of GATA and AP-1 instances associate with the absolute
distance d to a second GATA instance, we compared the log2 fold-change in enhancer
activity after mutating individual GATA (Fig 5D) or AP-1 (Fig 5E) instances at close (< 25
bp; n=14 and 29, respectively) or longer (> 50 bp; n=129 and 38) distance to a second
GATA instance. Only pairs of non-overlapping motif instances were used. A Wilcoxon
rank-sum test was used to test this association.
TF motif mutagenesis in human HCT116 enhancers
TF motif enrichment
We characterized the motif composition of 5,891 strong STARR-seq enhancers in human
HCT116 cells83 using the 501 bp sequence centered on the summit. We generated 5,891
negative GC-matched genomic regions using the genNullSeqs function from R package
gkmSVM108. 1,689 TF motif PWM models and respective motif clustering information
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
32
were retrieved from Vierstra et al.,76 covering the following databases: JASPAR (2018),
Taipale HT-SELEX (2013) and HOCOMOCO (version 11). Counts for each motif in each
501 bp enhancer and negative sequence were calculated using the matchMotifs function
from R package motifmatchr (v.1.4.095) with the following parameters: genome =
BSgenome.Hsapiens.UCSC.hg19, p.cutoff = 1e-04, bg="genome". We assessed the
differential distribution of each motif between the enhancers and negative regions by
two-sided Fisher’s exact test. We selected the nine TF motifs with the strongest
enrichment in enhancers: AP-1, P53, MAF, CREB1, ETS, EGR1, MECP2, E2F1 and
Ebox/MYC (Supplementary Table 12).
TF motif mutagenesis oligo library design and synthesis
Generation of TF motif mutations
For UMI-STARR-seq of wild type and mutant enhancers, we selected 3,200 enhancer
candidates, defining short 249 bp windows (the limits of oligo synthesis), and mapped
the position of all instances of the nine TF motif types in these candidates using the
matchMotifs function from R package motifmatchr (v.1.4.095) with the following
parameters: genome = BSgenome.Hsapiens.UCSC.hg19, p.cutoff = 5e-04, bg="genome".
Overlapping instances (minimum 70%) for the same TF motif were collapsed. We also
mapped all instances of four control motifs (length-matched random motifs to control for
enhancer-sequence perturbation) using string-matching. We then designed enhancer
variants with all instances of each motif type mutated simultaneously or individually to a
motif shuffled variant (Supplementary Table 13; Fig S10A). Each instance for a given
motif was mutated always to the same shuffled variant to allow the comparison of effects
between motif instances. We designed motif-mutant sequences for each enhancer only
for the orientation with the strongest activity in the genome-wide STARR-seq. In addition,
for each motif type we repeated mutations with two other different shuffled variants in
50 enhancers to control for the impact of the selected shuffled variant (Supplementary
Table 13; Fig S10F).
Oligo library synthesis and UMI-STARR-seq
The final human enhancers’ motif mutagenesis library contained 3,200 wildtype and
18,780 motif-mutant enhancer sequences that we combined with 920 249-bp negative
genomic regions as controls (Supplementary Table 13). All sequences were designed
using the hg19 genome version. Apart from the specific sequences, this human motif
mutagenesis library exhibits the same specifications as the Drosophila library and was
also synthesized by Twist Biosciences Inc. UMI-STARR-seq using this oligo library was
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
33
performed (UMI-STARR-seq experiments”) and analyzed (Oligo library UMI-STARR-seq
data analysis”) as described above. We performed two independent replicates
(correlation PCC=0.99; Fig S10B).
TF motif mutation analysis
From the 3,200 designed candidate 249 bp enhancers, we identified 1,083 active short
human enhancers (log2 wildtype activity in oligo UMI-STARR-seq >= 2.03, the strongest
negative region; Fig S10C) that we used in the subsequent TF motif analyses. The impact
of mutating all instances of a TF motif type simultaneously or each instance individually
was calculated as the log2 fold-change enhancer activity between the respective mutant
and wildtype sequences (Fig S10D,E, S11A; Supplementary Table 14 and 15). Motif non-
equivalency across all enhancers (Fig S11A) or within the same enhancer (Fig 6B,C) was
assessed as in the Drosophila enhancers.
Validation of important TF motif instances with genomic DNase I footprinting data
We compared the importance of individual motif instances with genomic DNase I
footprinting data of RKO cells (another human colon cancer cell line;
https://www.vierstra.org/resources/dgf 76), as a surrogate for TF occupancy (Fig 6D).
Footprints detected at different FPR adjusted p-value thresholds and coverage tracks
with observed and expected cleavage counts were downloaded from
https://resources.altius.org/~jvierstra/projects/footprinting.2020/per.dataset/h.RKO-
DS40362/, in hg38 coordinates. All coordinates were converted to hg19 coordinates
using the UCSC liftOver tool109 and the hg38ToHg19.over.chain chain file
(https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.
gz). For each TF motif type, a Wilcoxon rank-sum test was used to determine whether the
mutation log2 fold-change of instances overlapping TF footprints (FPR threshold of
0.001) is significantly greater or less than the one of instances not overlapping footprints.
Only instances within HCT116-accessible enhancers were used in the analysis. Enhancers
were defined as accessible if they overlap any of the DNase-seq peaks from the following
ENCODE75 identifiers (hg19 coordinates) (https://www.encodeproject.org/):
ENCFF001SQU, ENCFF001WIJ, ENCFF001WIK, ENCFF175RBN, ENCFF228YKV,
ENCFF851NWR, ENCFF927AHJ, ENCFF945KJN and ENCFF360XGA.
Association between motif syntax rules and the contribution of TF motif instances
For each TF motif type, we built a multiple linear regression model to predict the
contribution of its individual instances (log2 fold-changes) using as covariates the
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
34
number of instances of the respective motif type in the enhancer, the motif core (defined
as the nucleotides included in each TF motif PWM model) and flanking nucleotides (5 bp
on each side), and the distance to all other TF motifs (close: < 25 bp; intermediate: 25
bp and 50 bp; distal: >50 bp) (Fig 6E, S11B-D). Only motif instances that start after
position 5 and end before position 245 of the 249 bp oligos were used, in order to be able
to retrieve their 5 bp flanking sequences. In addition, for the motif distance analyses only
non-overlapping motif pairs were used. All models were built using the Caret R package
(v. 6.0-80110) and 10-fold cross-validation. Predictions for each held-out test sets were
used to compare with the observed log2 fold-changes and assess model performance. The
linear model coefficients and respective FDR-corrected p-values were used as metrics of
importance for each feature (Fig 6E, S11B). For each TF motif type, we compared the main
regression model with a simple linear model only using the PWM scores as covariate (Fig
S11D).
DeepSTARR prediction of the importance of AP-1 instances in human enhancers
We used the DeepSTARR model trained in Drosophila S2 enhancers to predict the
importance of AP-1 instances in human HCT116 enhancers. This was done by predicting
the activity of the wildtype and motif-mutant enhancer sequences included in the human
oligo library for all AP-1 instances and further calculating the log2 fold-change. This
predicted log2 fold-change was compared with the experimentally measured log2 fold-
change and its association assessed through Pearson correlation (Fig S12; Supplementary
Table 16).
Statistics and data visualization
All statistical calculations and graphical displays have been performed in R statistical
computing environment (v.3.5.1111) and using the R package ggplot2 (v.3.2.1112).
Coverage data tracks have been visualized in the UCSC Genome Browser113 and used to
create displays of representative genomic loci. In all box plots, the central line denotes the
median, the box encompasses 25th to 75th percentile (interquartile range) and the
whiskers extend to 1.5×interquartile range.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
35
Data availability
The raw sequencing data are available from GEO (https://www.ncbi.nlm.nih.gov/geo/)
under accession number GSE183939. Data used to train and evaluate the DeepSTARR
model as well as the final pre-trained model are found on zenodo at
https://doi.org/10.5281/zenodo.5502060. We also plan to release the pre-trained
DeepSTARR model in the Kipoi model repository114. Genome browser tracks showing
genome-wide UMI-STARR-seq and DeepSTARR predictions in Drosophila S2 cells,
together with the enhancers used for mutagenesis, mutated motif instances and
respective log2 fold-changes in enhancer activity, are available at
https://genome.ucsc.edu/s/bernardo.almeida/DeepSTARR_manuscript. TF motif
models were obtained from iRegulon (http://iregulon.aertslab.org/collections.html 98).
DNase-seq data in Drosophila S2 cells were obtained from ref.61. Genomic DNase I
footprinting data of RKO cells were downloaded from
https://resources.altius.org/~jvierstra/projects/footprinting.2020/per.dataset/h.RKO-
DS40362/. HCT116 DNase-seq, H3K27ac and H3K4me1 data were obtained from
ENCODE75 (https://www.encodeproject.org/) and ATAC-seq data from ref.74.
Code availability
Code used to process the genome-wide and oligo UMI-STARR-seq data and train
DeepSTARR, as well as to predict the enhancer activity for new DNA sequences is
available on GitHub (https://github.com/bernardo-de-almeida/DeepSTARR). The code
and TF motif compendium are available from https://github.com/bernardo-de-
almeida/motif-clustering.
Acknowledgements. The authors thank Angela Andersen (Life Science Editors), Vincent
Loubiere and Franziska Lorbeer (IMP) for comments on the manuscript, and Gert
Hulselmans and Stein Aerts (KU Leuven) for sharing the TF motif PWM collection. Deep
sequencing was performed at the Vienna Biocenter Core Facilities GmbH. Research in the
Stark group is supported by the European Research Council (ERC) under the European
Union’s Horizon 00 research and innovation programme (grant agreement no.
647320) and by the Austrian Science Fund (FWF, F4303-B09). Basic research at the IMP
is supported by Boehringer Ingelheim GmbH and the Austrian Research Promotion
Agency (FFG).
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
36
Author Contributions. B.P.d.A., F.R. and A.S. conceived the project. F.R. and M.P.
performed all experiments. B.P.d.A. performed all computational analyses. B.P.d.A., F.R.
and A.S. interpreted the data and wrote the manuscript. A.S. supervised the project.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
37"
References(
1.!Banerji,!J.,!Rusconi,!S.!&!Schaffner,!W.!Expression!of!a!β-globin!gene!is!enhanced!by!
remote!SV40!DNA!sequences.!Cell!27,!299308!(1981).!
2.!Levine,!M.!Transcriptional!Enhancers!in!Animal!Development!and!Evolution.!Curr.#
Biol.!20,!R754R763!(2010).!
3.!Catarino,!R.!R.!&!Stark,!A.!Assessing!sufficiency!and!necessity!of!enhancer!activities!
for!gene!expression!and!the!mechanisms!of!transcription!activation.!Genes#Dev.!32,!
202223!(2018).!
4.!Gompel,!N.,!Prud’homme,!B.,!Wittkopp,!P.!J.,!Kassner,!V.!A.!&!Carroll,!S.!B.!Chance!
caught!on!the!wing:!cis-regulatory!evolution!and!the!origin!of!pigment!patterns!in!
Drosophila.!Nature!433,!481487!(2005).!
5.!Rickels,! R.! &! Shilatifard,! A.! Enhancer! Logic! and! Mechanics! in! Development! and!
Disease.!Trends#Cell#Biol.!28,!608630!(2018).!
6.!Spitz,! F.! &! Furlong,! E.! E.! M.! Transcription! factors:! From! enhancer! binding! to!
developmental!control.!Nat.#Rev.#Genet.!13,!613626!(2012).!
7.!Kulkarni,!M.!M.!&!Arnosti,!D.!N.!Information!display!by!transcriptional!enhancers.!
Development!130,!65696575!(2003).!
8.!Zinzen,! R.! P.,! Senger,! K.,! Levine,! M.! &! Papatsenko,! D.! Computational! Models! for!
Neurogenic!Gene!Expression!in!the!Drosophila!Embryo.!Curr.#Biol.!16,!13581365!
(2006).!
9.!Erceg,!J.!et#al.!Subtle!Changes!in!Motif!Positioning!Cause!Tissue-Specific!Effects!on!
Robustness!of!an!Enhancer’s!Activity.!PLoS#Genet.!10,!e1004060!(2014).!
10.!Levo,!M.!&!Segal,!E.! In! pursuit!of!design!principles!of!regulatory! sequences.! Nat.#
Rev.#Genet.!15,!453468!(2014).!
11.!Crocker,! J.! et# al.!Low! Affinity! Binding! Site! ClustersConfer! Hox! Specificityand!
Regulatory!Robustness.!Cell!160,!191203!(2015).!
12.!Farley,!E.!K.!et#al.!Suboptimization!of!developmental!enhancers.!Science#(80-.#).!350,!
325328!(2015).!
13.!Farley,! E.! K.,! Olson,! K.! M.,! Zhang,!W.,! Rokhsar,! D.! S.! &! Levine,! M.! S.! Syntax!
compensates!for!poor!binding!sites!to!encode!tissue!specificity!of!developmental!
enhancers.!Proc.#Natl.#Acad.#Sci.!113,!65086513!(2016).!
14.!Fiore,! C.! &! Cohen,! B.! A.! Interactions! between! pluripotency! factors! specify! cis-
regulation!in!embryonic!stem!cells.!Genome#Res.!26,!778786!(2016).!
15.!Mathelier,!A.!et#al.!DNA!Shape!Features!Improve!Transcription!Factor!Binding!Site!
Predictions!In!Vivo.!Cell#Syst.!3,!278286!(2016).!
16.!Sayal,! R.,! Dresch,! J.! M.,! Pushel,! I.,! Taylor,! B.! R.! &! Arnosti,! D.! N.! Quantitative!
perturbation-based!analysis!of!gene!expression!predicts!enhancer!activity!in!early!
Drosophila!embryo.!Elife!5,!e08445!(2016).!
17.!King,!D.!M.!et#al.!Synthetic!and!genomic!regulatory!elements!reveal!aspects!of!cis-
regulatory!grammar!in!mouse!embryonic!stem!cells.!Elife!9,!124!(2020).!
18.!Jindal,! G.! A.! &! Farley,! E.! K.! Enhancer! grammar! in! development,! evolution,! and!
disease:!dependencies!and!interplay.!Dev.#Cell!56,!575587!(2021).!
19.!Swanson,!C.!I.,!Evans,!N.!C.!&!Barolo,!S.!Structural!Rules!and!Complex!Regulatory!
Circuitry!Constrain!Expression!of!a!Notch-!and!EGFR-Regulated!Eye!Enhancer.!Dev.#
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
38"
Cell!18,!359376!(2010).!
20.!Panne,!D.!The!enhanceosome.!Curr.#Opin.#Struct.#Biol.!18,!236242!(2008).!
21.!Wang,! J.! et# al.!Sequence! features! and! chromatin! structure! around! the! genomic!
regions!bound!by!119!human! transcription!factors.!Genome#Res.!22,!17981812!
(2012).!
22.!Guo,!Y.,!Mahony,!S.!&!Gifford,!D.!K.!High!Resolution!Genome!Wide!Binding!Event!
Finding! and! Motif! Discovery! Reveals! Transcription! Factor! Spatial! Binding!
Constraints.!PLoS#Comput.#Biol.!8,!e1002638!(2012).!
23.!Junion,!G.!et#al.!A!transcription!factor!collective!defines!cardiac!cell!fate!and!reflects!
lineage!history.!Cell!148,!473486!(2012).!
24.!Liu,!F.!&!Posakony,!J.!W.!Role!of!architecture!in!the!function!and!specificity!of!two!
notch-regulated! transcriptional! enhancer! modules.! PLoS# Genet.!8,! e1002796!
(2012).!
25.!Smith,!R.!P.!et#al.!Massively!parallel!decoding!of!mammalian!regulatory!sequences!
supports!a!flexible!organizational!model.!Nat.#Genet.!45,!10211028!(2013).!
26.!Yanez-Cuna,! J.! O.! et# al.!Dissection! of! thousands! of! cell! type-specific! enhancers!
identifies!dinucleotide!repeat!motifs!as!general!enhancer!features.!Genome#Res.!24,!
114756!(2014).!
27.!Arnosti,! D.! N.! &! Kulkarni,! M.! M.! Transcriptional! enhancers:! Intelligent!
enhanceosomes!or!flexible!billboards?!J.#Cell.#Biochem.!94,!890898!(2005).!
28.!Kwasnieski,! J.! C.,! Fiore,! C.,! Chaudhari,! H.! G.! &! Cohen,! B.! A.! High-throughput!
functional!testing! of!ENCODE! segmentation!predictions.! Genome# Res.!24,! 1595
1602!(2014).!
29.!Grossman,! S.! R.! et# al.!Systematic! dissection! of! genomic! features! determining!
transcription! factor! binding! and! enhancer! function.! Proc.# Natl.# Acad.# Sci.!114,!
E1291E1300!(2017).!
30.!Kheradpour,!P.!et#al.!Systematic!dissection!of!regulatory!motifs!in!2000!predicted!
human!enhancers!using!a!massively!parallel!reporter!assay.!Genome#Res.!23,!800
811!(2013).!
31.!Svetlichnyy,!D.,!Imrichova,!H.,!Fiers,!M.,!Kalender!Atak,!Z.!&!Aerts,!S.!Identification!
of! High-Impact! cis-Regulatory! Mutations! Using! Transcription! Factor! Specific!
Random!Forest!Models.!PLoS#Comput.#Biol.!11,!128!(2015).!
32.!Dibaeinia,!P.!&!Sinha,!S.!Deciphering!enhancer! sequence! using! thermodynamics-
based!models!and!convolutional!neural!networks.!bioRxiv!(2021).!
33.!Berman,! B.! P.! et# al.!Computational! identification! of! developmental! enhancers:!
conservation! and! function! of! transcription! factor! binding-site! clusters! in!
Drosophila! melanogaster! and! Drosophila! pseudoobscura.! Genome# Biol.!5,! R61!
(2004).!
34.!Crocker,! J.,! Ilsley,! G.! R.! &! Stern,! D.! L.! Quantitatively! predictable! control! of!
Drosophila! transcriptional! enhancers! in! vivo! with! engineered! transcription!
factors.!Nat.#Genet.!48,!292298!(2016).!
35.! He,! X.,! Samee,! M.! A.! H.,! Blatti,! C.! &! Sinha,! S.! Thermodynamics-based! models! of!
transcriptional! regulation! by! enhancers:! The! roles! of! synergistic! activation,!
cooperative!binding!and!short-range!repression.!PLoS#Comput.# Biol.!6,!e1000935!
(2010).!
36.!Segal,! E.,! Raveh-Sadka,! T.,! Schroeder,! M.,! Unnerstall,! U.! &! Gaul,! U.! Predicting!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
39"
expression!patterns!from!regulatory!sequence!in!Drosophila!segmentation.!Nature!
451,!535540!(2008).!
37.!Beer,!M.! A.! &! Tavazoie,! S.!Predicting! Gene! Expression! from! Sequence.! Cell!117,!
185198!(2004).!
38.!Zinzen,! R.! P.! &! Papatsenko,! D.! Enhancer! responses! to! similarly! distributed!
antagonistic!gradients!in!development.!PLoS#Comput.#Biol.!3,!08260835!(2007).!
39.!Ghandi,! M.,! Lee,!D.,! Mohammad-noori,! M.! &! Beer,! M.! A.! Enhanced! Regulatory!
Sequence! Prediction! Using! Gapped! k-mer! Features.! PLoS# Comput.# Biol.!10,!
e1003711!(2014).!
40.!Zabidi,!M.! A.! et# al.!Enhancer-core-promoter! specificity! separates!developmental!
and!housekeeping!gene!regulation.!Nature!518,!556559!(2015).!
41.!Arnold,! C.! D.! et# al.!Genome-wide! assessment! of! sequence-intrinsic! enhancer!
responsiveness! at! single-base-pair! resolution.! Nat.# Biotechnol.!35,! 136144!
(2017).!
42.! Haberle,!V.!et#al.!Transcriptional!cofactors!display!specificity!for!distinct!types!of!
core!promoters.!Nature!570,!122126!(2019).!
43.!Kleftogiannis,!D.,!Kalnis,!P.!&!Bajic,!V.!B.!Progress!and!challenges!in!bioinformatics!
approaches!for!enhancer!identification.!Brief.#Bioinform.!17,!967979!(2016).!
44.!Alipanahi,! B.,! Delong,! A.,! Weirauch,! M.! T.! &! Frey,! B.! J.! Predicting! the! sequence!
specificities!of!DNA-!and!RNA-binding!proteins!by!deep!learning.!Nat.#Biotechnol.!
33,!831838!(2015).!
45.!Kelley,! D.! R.,! Snoek,! J.! &! Rinn,! J.! L.! Basset:! learning! the! regulatory! code! of! the!
accessible!genome!with!deep!convolutional!neural!networks.!Genome#Res.!26,!990
999!(2016).!
46.!Kelley,!D.!R.!et# al.!Sequential! regulatory!activity!prediction!across!chromosomes!
with!convolutional!neural!networks.!Genome#Res.!28,!739750!(2018).!
47.!Avsec,!Ž.!et#al.!Base-resolution! models!of!transcription-factor!binding!reveal!soft!
motif!syntax.!Nat.#Genet.!53,!354366!(2021).!
48.!Avsec,!Ž.!et#al.!Effective!gene!expression!prediction!from!sequence!by!integrating!
long-range!interactions.!bioRxiv!(2021).!
49.!Karbalayghareh,! A.,! Sahin,! M.! &! Leslie,! C.! S.! Chromatin! interaction! aware! gene!
regulatory!modeling!with!graph!attention!networks.!bioRxiv!(2021).!
50.!Zhou,!J.!&!Troyanskaya,! O.!G.!Predicting!effects! of!noncoding!variants!with! deep!
learning-based!sequence!model.!Nat.#Methods!12,!931934!(2015).!
51.!Movva,! R.! et# al.!Deciphering! regulatory! DNA! sequences! and! noncoding! genetic!
variants!using!neural!network!models!of!massively!parallel!reporter!assays.!PLoS#
One!14,!120!(2019).!
52.!Minnoye,! L.! et# al.!Cross-species! analysis! of! enhancer! logic! using! deep! learning.!
Genome#Res.!30,!181534!(2020).!
53.!Zhou,!J.!et#al.!Deep!learning!sequence-based!ab!initio!prediction!of!variant!effects!
on!expression!and!disease!risk.!Nat.#Genet.!50,!11711179!(2018).!
54.!Bogard,! N.,! Linder,! J.,! Rosenberg,! A.! B.! &! Seelig,! G.! A! Deep! Neural! Network! for!
Predicting!and!Engineering!Alternative!Polyadenylation.!Cell!178,!91106!(2019).!
55.!Shrikumar,!A.,! Greenside,!P.! &!Kundaje,! A.!Learning! important!features! through!
propagating!activation!differences.!arXiv!1704.02685,!(2017).!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
40"
56.!Shrikumar,!A.!et#al.!TF-MoDISco!v0.4.4.2-alpha:!Technical!Note.!arXiv!1811.00416,!
(2018).!
57.!Zheng,!A.!et#al.!Deep!neural!networks!identify!sequence!context!features!predictive!
of!transcription!factor!binding.!Nat.#Mach.#Intell.!3,!172180!(2021).!
58.!Koo,!P.!K.,!Majdandzic,!A.,!Ploenzke,!M.,!Anand,!P.!&!Paul,!S.!B.!Global!importance!
analysis:!An!interpretability!method!to!quantify!importance!of!genomic!features!in!
deep!neural!networks.!PLOS#Comput.#Biol.!17,!e1008925!(2021).!
59.!Kim,! D.! et# al.!The! dynamic,! combinatorial! cis-regulatory! lexicon! of! epidermal!
differentiation.!bioRxiv!(2020).!
60.!Greenside,!P.,!Shimko,!T.,! Fordyce,!P.!&!Kundaje,!A.! Discovering! epistatic!feature!
interactions! from! neural! network! models! of! regulatory! DNA! sequences.!
Bioinformatics!34,!i629i637!(2018).!
61.!Arnold,!C.!D.!et#al.!Genome-wide!quantitative!enhancer!activity!maps!identified!by!
STARR-seq.!Science#(80-.#).!339,!10741077!(2013).!
62.!Neumayr,!C.,!Pagani,!M.,!Stark,!A.!&!Arnold,!C.!D.!STARR-seq!and!UMI-STARR-seq:!
Assessing! Enhancer! Activities! for! Genome-Wide-,! High-,! and!Low-Complexity!
Candidate!Libraries.!Curr.#Protoc.#Mol.#Biol.!128,!e105!(2019).!
63.!Lundberg,!S.!M.!&!Lee,!S.-I.!A!Unified!Approach!to!Interpreting!Model!Predictions.!
31st#Conf.#Neural#Inf.#Process.#Syst.!(2017).!
64.!Lundberg,! S.! M.! et# al.!From! local! explanations! to! global! understanding! with!
explainable!AI!for!trees.!Nat.#Mach.#Intell.!2,!5667!(2020).!
65.!Scardigli,! R.,! Bäumer,! N.,! Gruss,! P.,! Guillemot,! F.! &! Le! Roux,! I.! Direct! and!
concentration-dependent!regulation!of!the!proneural!gene!Neurogenin2!by!Pax6.!
Development!130,!32693281!(2003).!
66.!Swanson,! C.! I.,! Schwimmer,! D.! B.! &! Barolo,! S.! Rapid! evolutionary! rewiring! of! a!
structurally!constrained!eye!enhancer.!Curr.#Biol.!21,!11861196!(2011).!
67.!Crocker,! J.,! Preger-Ben! Noon,! E.! &! Stern,! D.! L.! The! Soft! Touch:! Low-Affinity!
Transcription! Factor! Binding! Sites! in! Development! and! Evolution.! in! Current#
Topics#in#Developmental#Biology!117,!455469!(Elsevier!Inc.,!2016).!
68.!Crocker,!J.!&!Ilsley,!G.!R.!Using!synthetic!biology!to!study!gene!regulatory!evolution.!
Curr.#Opin.#Genet.#Dev.!47,!91101!(2017).!
69.!Boisclair!Lachance,!J.!F.,!Webber,!J.!L.,!Hong,!L.,!Dinner,!A.!R.!&!Rebay,!I.!Cooperative!
recruitment!of!Yan!via!a!highaffinity!ETS!supersite!organizes!repression!to!confer!
specificity!and!robustness!to!cardiac!cell!fate!specification.!Genes#Dev.!32,!389401!
(2018).!
70.!Scully,!K.!H.!et# al.!Allosteric!effects!of!Pit-1!DNA!sites!on!long-term!repression!in!
cell!type!specification.!Science#(80-.#).!290,!11271131!(2000).!
71.!Crocker,!J.,!Tamori,!Y.!&!Erives,!A.!Evolution!acts!on!enhancer!organization!to!fine-
tune!gradient!threshold!readouts.!PLoS#Biol.!6,!25762587!(2008).!
72.!Cheng,!Q.! et# al.!Computational! Identification!of! Diverse!Mechanisms! Underlying!
Transcription!Factor-DNA!Occupancy.!PLoS#Genet.!9,!e1003571!(2013).!
73.!Morgunova,! E.! &! Taipale,! J.! Structural! perspective! of! cooperative! transcription!
factor!binding.!Curr.#Opin.#Struct.#Biol.!47,!18!(2017).!
74.!Ponnaluri,! V.! K.! C.! et# al.!NicE-seq:! High! resolution! open! chromatin! profiling.!
Genome#Biol.!18,!115!(2017).!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
41"
75.!Sloan,!C.!A.!et#al.!ENCODE!data!at!the!ENCODE!portal.!Nucleic#Acids#Res.!44,!D726
D732!(2016).!
76.!Vierstra,!J.!et#al.!Global!reference!mapping!of!human!transcription!factor!footprints.!
Nature!583,!729736!(2020).!
77.!Eraslan,!G.,!Avsec,!Ž.,!Gagneur,!J.!&!Theis,!F.!J.!Deep!learning:!new!computational!
modelling!techniques!for!genomics.!Nat.#Rev.#Genet.!20,!389403!(2019).!
78.!Dror,!I.,!Golan,!T.,!Levy,!C.!&!Rohs,!R.!A!widespread!role!of!the!motif!environment!
in! transcription! factor! binding! across! diverse! protein! families.! Genome# Res.!25,!
12681280!(2015).!
79.!Yanez-Cuna,!J.!O.,!Dinh,!H.!Q.,!Kvon,!E.!Z.,!Shlyueva,!D.!&!Stark,!A.!Uncovering!cis-
regulatory! sequence! requirements! for! context-specific! transcription! factor!
binding.!Genome#Res.!20182030!(2012).!doi:10.1101/gr.132811.111.Freely!
80.!Kvon,! E.! Z.! et# al.!Genome-scale! functional! characterization! of! Drosophila!
developmental!enhancers!in!vivo.!Nature!512,!9195!(2014).!
81.!Yan,!J.!et# al.!Systematic!analysis! of! binding!of!transcription!factors!to!noncoding!
variants.!Nature!591,!147151!(2021).!
82.!Meuleman,! W.! et# al.! Index! and! biological! spectrum! of! human! DNase! I!
hypersensitive!sites.!Nature!584,!244251!(2020).!
83.!Muerdter,!F.! et# al.!Resolving!systematic! errors!in! widely!used! enhancer! activity!
assays!in!human!cells.!Nat.#Methods!15,!141149!(2018).!
84.!Langmead,!B.,!Trapnell,!C.,!Pop,!M.!&!Salzberg,!S.!L.!Ultrafast!and!memory-efficient!
alignment!of!short! DNA!sequences!to! the! human!genome.!Genome# Biol.!10,!R25!
(2009).!
85.!Love,! M.! I.,! Huber,! W.! &! Anders,! S.! Moderated! estimation! of! fold! change! and!
dispersion!for!RNA-seq!data!with!DESeq2.!Genome#Biol.!15,!121!(2014).!
86.!Albig,!C.!et# al.!Factor!cooperation! for!chromosome!discrimination!in!Drosophila.!
Nucleic#Acids#Res.!47,!17061724!(2019).!
87.!Thomas,! S.! et# al.!Dynamic! reprogramming! of! chromatin! accessibility! during!
Drosophila!embryo!development.!Genome#Biol.!12,!R43!(2011).!
88.!Shlyueva,!D.!et#al.!Hormone-Responsive!Enhancer-Activity!Maps!Reveal!Predictive!
Motifs,!Indirect!Repression,!and!Targeting!of!Closed!Chromatin.!Mol.#Cell!54,!180
192!(2014).!
89.!Franz,!A.,!Shlyueva,!D.,!Brunner,!E.,!Stark,!A.!&!Basler,!K.!Probing!the!canonicity!of!
the!Wnt/Wingless!signaling!pathway.!PLoS#Genet.!13,!118!(2017).!
90.!Chollet,!F.!&!others.!Keras.!https://keras.io.!(2015).!
91.!Abadi,! M.! et# al.!TensorFlow:! Large-Scale! Machine! Learning!on! Heterogeneous!
Distributed!Systems.!arXiv!1603.04467,!(2016).!
92.!Kingma,! D.! P.! &! Ba,! J.! L.! Adam:! A! method! for! stochastic! optimization.! arXiv!
1412.6980,!(2015).!
93.! Quinlan,! A.! R.! &! Hall,! I.! M.! BEDTools:! A! flexible! suite! of! utilities! for! comparing!
genomic!features.!Bioinformatics!26,!841842!(2010).!
94.!Lee,!D.!LS-GKM:!A!new!gkm-SVM!for!large-scale!datasets.!Bioinformatics!32,!2196
2198!(2016).!
95.!Schep,!A.!motifmatchr:!Fast!Motif!Matching!in!R.!R!package!version!1.14.0.!(2021).!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
42"
96.!Friedman,!J.,!Hastie,!T.!&!Tibshirani,!R.!Regularization!Paths!for!Generalized!Linear!
Models!via!Coordinate!Descent.!J.#Stat.#Softw.!33,!122!(2010).!
97.! Omar! Wagih.! ggseqlogo:! A! ‘ggplot2’! Extension! for! Drawing! Publication-Ready!
Sequence! Logos.! R! package! version! 0.1.! https://CRAN.R-
project.org/package=ggseqlogo.!(2017).!
98.!Janky,!R.! et#al.!iRegulon:!From! a!Gene!List! to!a! Gene! Regulatory!Network! Using!
Large!Motif!and!Track!Collections.!PLoS#Comput.#Biol.!10,!e1003731!(2014).!
99.!Down,! T.! A.,! Bergman,! C.! M.,! Su,! J.! &! Hubbard,! T.! J.! P.! Large-scale! discovery! of!
promoter! motifs! in! Drosophila! melanogaster.! PLoS# Comput.# Biol.!3,! 00950109!
(2007).!
100.!Weirauch,! M.! T.! et# al.!Determination! and! inference! of! eukaryotic! transcription!
factor!sequence!specificity.!Cell!158,!14311443!(2014).!
101.!Zhu,! L.! J.! et# al.!FlyFactorSurvey:! A! database! of! Drosophila! transcription! factor!
binding! specificities! determined! using! the! bacterial! one-hybrid! system.! Nucleic#
Acids#Res.!39,!111117!(2011).!
102.!Heinz,!S.!et#al.!Simple!Combinations!of!Lineage-Determining!Transcription!Factors!
Prime!cis-Regulatory!Elements!Required!for!Macrophage!and!B!Cell!Identities.!Mol.#
Cell!38,!576589!(2010).!
103.!Mathelier,!A.!et#al.!JASPAR!2016:!A!major!expansion!and!update!of!the!open-access!
database!of!transcription!factor!binding!profiles.!Nucleic#Acids#Res.!44,!D110D115!
(2016).!
104.!Stark,!A.!et#al.!Discovery!of!functional! elements! in!12!Drosophila!genomes!using!
evolutionary!signatures.!Nature!450,!219232!(2007).!
105.!Kulakovskiy,! I.! V.! &! Makeev,! V.! J.! Discovery! of! DNA! motifs! recognized! by!
transcription! factors! through! integration! of! different! experimental! sources.!
Biophysics#(Oxf).!54,!667674!(2009).!
106.! Gupta,! S.,! Stamatoyannopoulos,! J.! A.,! Bailey,! T.! L.! &! Noble,! W.! S.! Quantifying!
similarity!between!motifs.!Genome#Biol.!8,!R24!(2007).!
107.! Ou,! J.,! Wolfe,! S.! A.,! Brodsky,! M.! H.! &! Zhu,! L.! J.! MotifStack! for! the! analysis! of!
transcription!factor!binding!site!evolution.!Nat.#Methods!15,!89!(2018).!
108.!Ghandi,!M.!et#al.!GkmSVM:!An!R!package!for!gapped-kmer!SVM.!Bioinformatics!32,!
22052207!(2016).!
109.!Kuhn,! R.! M.,! Haussler,! D.! &! James! Kent,! W.! The! UCSC! genome! browser! and!
associated!tools.!Brief.#Bioinform.!14,!144161!(2013).!
110.!Kuhn,!M.!caret:!Classification!and!Regression!Training.!R!package!version!6.0-80.!
https://CRAN.R-project.org/package=caret.!(2018).!
111.!R! Core! Team.! R:! A! language! and! environment! for! statistical! computing.! R!
Foundation! for! Statistical! Computing,! Vienna,! Austria.! URL! https://www.R-
project.org/.!(2020).!
112.!Wickham,!H.!ggplot2:#Elegant#Graphics#for#Data#Analysis.#Springer-Verlag#New#York.#
ISBN#978-3-319-24277-4,#http://ggplot2.org.!(2016).!
113.!Kent,!W.!J.!et#al.!The!Human!Genome!Browser!at!UCSC.!Genome#Res.!12,!9961006!
(2002).!
114.!Avsec,!Ž.!et#al.!The!Kipoi!repository!accelerates!community!exchange!and!reuse!of!
predictive!models!for!genomics.!Nat.#Biotechnol.!37,!592600!(2019).!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
43"
Supplementary(Figures!
!
Supplementary! Figure!1.! Additional! performance! evaluation! of! DeepSTARR!
predictions.!
!
A-B)!DeepSTARR! predicts! enhancer! activity! genome-wide.! Genome! browser! screenshot!
depicting!UMI-STARR-seq!observed!(top)!and!predicted!(bottom)!profiles!for!both!promoters!
(development,!red;!housekeeping,!blue)!for!two!loci!located!on!held-out!test!chromosome!2R.!
C)!DeepSTARR!predicts! enhancer! activity!quantitatively.! Left:! Scatter! plots!of! predicted! vs.!
observed!developmental!(top)!and!housekeeping!(bottom)!enhancer!activity!signal!across!all!
DNA! sequences! in! the! train,! validation! and! test! set! chromosomes.! Right:! Scatter! plots! of!
developmental! (top)! and! housekeeping! (bottom)! enhancer! activity! signal! between! two!
biological!replicates!across!all!DNA!sequences!in!the!test!set!chromosome.!Color!reflects!point!
density.! The! PCC! is! denoted! for! each! comparison.!D)!DeepSTARR! performed! better! than!
methods!based!on!known!TF!motifs!or!unbiased!k-mers.!Left:!Comparison!of!different!models!
for! predicting! enhancer! activity.! Bar-plots! with! the! PCC! between! observed! and! predicted!
activities!for!both!developmental!and!housekeeping!enhancer!types!across!all!DNA!sequences!
in!the!test!set!chromosome.!PCC!between!replicates!is!also!shown.!Middle:!Bar-plots!with!the!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
44"
auPRC!for!the!classification!of!enhancer!sequences!from!the!test!set!for!the!different!models,!
compared!with!the!expected!by!a!random!model.!Right:!precision-recall!curve!for!the!different!
models!on!test!data.! Error! bars! represent!the!5th!and!95th!percentile!of!the!performance!of!
1000!DeepSTARR!models.!PCC:!Pearson!correlation!coefficient,!R2:!R-squared,!auPRC:!area!
under!precision-recall!curve.!
!
Supplementary! Figure! 2.! Developmental! and! housekeeping! enhancers! are!
enriched!in!different!TF!motifs.!
!
A)!Hierarchically! clustered! heat! map! of! the! pairwise! similarity! scores! between! 6,502! TF!
motifs.!The!cluster! dendrogram!was!cut!at!height!0.8,!resulting!in! 901!non-redundant!motif!
clusters!that!were!manually!annotated.!B-E)!Exemplar!TF!motif!clusters.!F)!Enrichment!of!TF!
motifs! in! developmental! (left)! and! house keeping! (right)! enhancers! over! negative! genomic!
regions.!Log2!Fisher’s! odds! ratio! compared!with!significance! (-log10! p-value)! for! the!most!
significant! TF! motif! per! motif! cluster,! to! remove! motif! redundancy.! Motifs! significantly!
(FDR<0.05)! enriched! or! depleted! are! highlighted.!G)!Scatter! plot! comparing! the! motif!
enrichment!(log2!odds!ratio)!in!developmental!and!housekeeping!enhancers.!To!remove!motif!
redundancy,! only! the! most! significant! TF! motif! per! motif!cluster! was! shown.! Motifs!
significantly!(FDR<0.05)!enriched!or!depleted!in!each!or!both!enhancer!types!are!highlighted.!
!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
45"
Supplementary!Figure!3.!Large-scale!systematic!TF!motif!mutagenesis.!
!
A)!Overview! of! the! (1)! design,! (2)! synthesis! and! (3)! UMI-STARR-seq! screen! of! the!
mutagenesis!oligo!library.!UMI-STARR-seq!was!performed!with!a!developmental!(red)!and!a!
housekeeping!(blue)!promoter!in!D.0melanogaster!S2!cells.!B)!Pairwise!comparisons!of!input!
(top)!and!UMI-STARR-seq!(bottom)!signal!between!three!independent!biological!replicates!
across!all! oligos!included!in!the!library! with!a!developmental!(left)! or!housekeeping!(right)!
promoter.! Axes! show! counts! per! million!in! logarithmic! scale.! The! PCC! is! denoted! for! each!
comparison.!C)!Motif! requirements! are! independent! of! motif! mutant! variants.!Pairwise!
comparisons! of! log2! fold-change! (log2! FC)! to! wildtype! activity! between! the! three! motif-
mutant! shuffled! versions! across! developmental! (left)! and! housekeeping! (right )! enhancers.!
The! PCC! is! denoted! for! each! comparison.!D)!Activity! (log2)! of! wildtype! and! motif-mutant!
developmental!(left)!and!housekeeping!(right)!enhancers!that!were!used!to!derived!the!log2!
fold-changes!from!Fig!2C.!Number!of!enhancers!mutated!for!each!motif!type!are!shown.!The!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
46"
box! plots! mark! the! median,! upper! and! lower! quartiles! and! 1.5× interquartile! range!
(whiskers);!outliers!are!shown!individually.!
!
Supplementary! Figure! 4.! DeepSTARR! predicts! enhancer! activity! of! wildtype!
sequences!in!oligo!UMI-STARR-seq.!
!
Scatter!plots! of! predicted! vs.!observed!developmental! (A)!and! housekeeping! (B)!enhancer!
activity!signal!across!wildtype!sequences!from!the!test!set!chromosome.!The!PCC!is!denoted!
for!each!comparison.!
!
Supplementary! Figure! 5.! Instances! of! the! same! TF! motif! do! not! have! equivalent!
contribution!to!enhancer!activity.!
!
A)!DeepSTARR! predicts! that! instances! of! the! same! TF! motif! do! not! have! equivalent!
contribution.!Density!distributions!of!the!DeepSTARR!predicted!contribution!scores!(average!
over! all! its! nucleotides)! of! GATA! (blue)! or! GGGCT! (as! control;! grey)! instances! in!
developmental! enhancers.!B)!Systematic! mutagenesis! of! individual! TF! motif! instances!
validates!motif! non-equivalency.! Density! distributions!of! the! experimentally! derived!(oligo!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
47"
UMI-STARR-seq)!log2!FC!in!enhancer!activity!after!mutation!of!GATA!(blue)!or!control!(grey)!
individual!instances!in!developmental!enhancers.!C)!DeepSTARR! predicts! that!instances! of!
the!same!TF!motif!are!not!equivalent.!Distributions!of!the!DeepSTARR!predicted!contribution!
scores! (average! over! all! its! nucleotides)! of! instances! of! different! TF! motif! types! across!
developmental! enhancers! (red),! housekeeping! enhancers! (blue)! and! negative! genomic!
regions!(grey).!Number! of!instances!for!each!motif! type! are!shown.!The!box!plots!mark!the!
median,! upper! and! lower! quartiles! and! 1.5× interquartile! range! (whiskers).! D)!Motif!
mutagenesis! validates! motif! non-equivalency.! Distributions! of! the! experimentally! derived!
(oligo!UMI-STARR-seq)!log2!FC!in!enhancer!activity!after!mutation!of!individual!instances!of!
different!TF!motif!types!or!control!motifs!in!developmental!or!housekeeping!enhancers.!Note!
that!the!core!sequence!of!different!instances!of!the!same!motif!type!are!identical,!despite!the!
different!log2!FC.!Number!of!instances!for!each!motif!type!are!shown.!The!Fligner-Killeen!test!
of!homogeneity!of!variances!was!used!to!compare!the!distributions!of!each!TF!motif!type!with!
the!one!from!control!motifs:!****!p-value!<!0.0001!and!*!<!0.05.!Box!plots!as!in!(C).!
!
Supplementary! Figure! 6.! Prediction! of! motif! contribution! by! PWM! scores! or!
DeepSTARR.!
!
Distribution! of! experimentally! measured! fold-change! (log2! FC)! enhancer! activity! after!
mutating!individual!motif!instances!of!the!GATA!(A),!AP-1!(B),!twist!(C),!Trl!(D)!and!Dref!(E)!
motifs!(violin! plots),! compared! with! the! respective! TF! motif! PWM! scores!and!the! log2! FC!
predicted!by! DeepSTARR.! The!PCC!is!denoted!for! each!comparison.!The!box!plots!mark!the!
median,!upper!and!lower!quartiles!and!1.5× interquartile!range!(whiskers).! !
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
48"
Supplementary!Figure!7.!Contribution!of!TF!motifs!depend!on!their!flanks.!
!
A)!Motif!contribution!correlates!with!flanking!base-pairs.!Heatmaps:!Flanking!nucleotides!of!
instances!of!different!TF!motif!types!across!developmental!(GATA:!GATAA,! AP-1:! TGA.TCA,!
Trl:! GAGAG,! twist:! CATCTG)! or! housekeeping! (Dref:! ATCGAT)! enhancers! sorted! by! their!
DeepSTARR!predicted! contribution! (left)!or!the!experimentally! derived! (oligo!UMI-STARR-
seq)!log2!fold-change!in!enhancer!activity!after!mutation!(right;!minus!log2!fold-change,!-log2!
FC).!Box!plots:!Importance!of!motif!instances!according!to!the!different!bases!at!each!flanking!
position.!*!marks!positions!with!significant!differences!between!the!four!nucleotides!(FDR-
corrected!Welch!One-Way!ANOVA!test!p-value!<!0.01).!The!box!plots!mark!the!median,!upper!
and!lower! quartiles! and! 1.5× interquarti le! range! (whiskers).!Numb er! of! instances! for! each!
motif!type!are!shown.! Top:! logos! of! the! top! 90th!percentile! motif! instances!for! each! sorting!
method.!B)!GATA! flanking! nucleotides! are! sufficient! to! switch! motif! contribution.! 47!
developmental!enhancers!containing!both!one! strong! (purple)!and!one!weak!(green)!GATA!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
49"
instance!(≥!2-fold!difference!between!instances)!were!selected.!Top:!log2!FC!enhancer!activity!
to!wildtype!for!sequences!where!the!2!or!5!bp!flanks!of!strong!instances!were!replaced!by!the!
ones!of!weak!instances!(purple)!and!vice!versa!(green).!Bottom:!log2!FC!enhancer!activity!to!
wildtype! of! mutating!the! strong! instance!(purple)! compared! to!mutating! this! instance! and!
additionally!replacing!the!2!or!5! bp! flanks! of! the! weak! instance! by! the!flanks!of!the!strong!
instance!(light!purple).!Log2!FC!of!mutating!the!weak!instance!(green)!compared!to!mutating!
this!instance!and!additionally!replacing!the!2!or!5!bp!flanks!of!the!strong!instance!by!the!flanks!
of! the! weak! instance! (light! green).!****! p-value! <! 0.0001,! ***! <! 0.001,! **! <! 0.01,! *! <! 0.05!
(Wilcoxon!signed!rank!test).!Box!plots!as!in!(A).! !
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
50"
Supplementary!Figure! 8.! Interpretation!of! DeepSTARR! reveals!TF! motif! distance!
preferences.!
!
A)!In0silico!characterization! of! TF! motif! distance!preferences.! MotifA!was! embedded! in!the!
center!of!60!synthetic!random!DNA!sequences!and!MotifB!at!a!range!of!distances!from!MotifA,!
both! up-! and! downstream.! Both! the! average! developmental! and! housekeeping! enhancer!
activity! is! predicted! by! DeepSTARR.! The! cooperativity!(residuals! fold-change)! between!
MotifA!and!MotifB!as!a!function!of!distance!is!quantified!as!the!activity!of!MotifA+B!divided!by!
the! sum! of! the! marginal!effects! of! MotifA!and! MotifB0 (MotifA!+! MotifB0 –0 backbone! (b)).!B)!
Heatmaps!showing!the!pairwise!cooperativity!(residuals)!between!different!TF!motif!types!in!
developmental!(left)! or! housekeeping!(right)!enhancers.!C-D)!Cooperativity! between! motif!
pairs!at!different!distances!in!(C)!developmental!and!(D)!housekeeping!enhancers.!Points!and!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
51"
smooth! lines! show! the! median! cooperativity! across! all! 60! backbones! for! each! motif! pair!
distance!up-! and!downstream.!The!MotifA!in!the!center!is!mentioned!in!each!plot’s!title!and!
tested! with! all! MotifB!motifs! (different! colours).! GGGCT! motif! was! used! as! control! (grey).!
Dashed!line!at!1!represents!no!interaction.! !
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
52"
Supplementary!Figure!9.!Motifs!are!not!often!at!optimal!distances!in!developmental!
enhancers,!but!enhancer!activity!follows!optimal!spacing!rules.!
!
A)!Occurrence!of!motif!pairs!at!different!distances!in!genomic!enhancers.!Heatmaps!showing!
the!enrichment!(Fisher’s!odds!ratio)!of!motif!pairs!at!different!distance!bins!in!developmental!
(left)!or!housekeeping!(right)! enhancers.! *! represents! significant! enrichment!or!depletions!
(FDR-corrected!p-value!<!0.05).!B)!Validation!of!optimal!spacing!rules!for!enhancer!activity.!
Heatmaps!showing!the!association!between!enhancer!activity!and!the!presence!of!motif!pairs!
at!different!distance!bins!in!developmental!(left)!or!housekeeping!(right)!enhancers!using! a!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
53"
multiple!linear!regression.!The!multiple!linear!regression!included,!as!independent!variables,!
the! number! of! instances! for! the! different! developmental! or! housekeeping! TF! motif! types.!
Linear! model! coefficients! are! shown! and! *! represents! significant! positive! or! negative!
associations!(FDR-corrected!p-value!<!0.05).!C-D)!Top:!Same!as!in!Fig!S8C,D!(but!with!up-!and!
downstream! distances! combined)! per! (C)! developmental! or! (D)! housekeeping! motif! pair.!
Middle:! Association! between! enhancer! activity! and! the! distance! at! which! the! motif! pair!is!
found.! Coefficient! (y-axis)! and! p-value! from! a! multiple! linear! regression! including,! as!
independent! variables,! the! number! of! instances! for! the! different! developmental! or!
housekeeping!TF!motif! types.! Bottom:! Odds!ratio!(log2)!by!which!the!two! motifs! are!found!
within!a!specified! distance! from!each!other!in! enhancers! compared! with!negative!genomic!
regions.!Color! legend!is!shown.!Example! motif!pairs!where!optimal! spacing!preferences!are!
concordant! or! discordant! with! their! occurrence! in! enhancers! are! shown.! *! FDR-corrected!
Fisher's!Exact!test!p-value!<!0.05.!
!
Supplementary! Figure!10.! Systematic! TF! motif! mutagenesis! in! human! HCT116!
enhancers.!
!
A)!Systematic!TF!motif!mutagenesis!in!human!HCT116!enhancers.!We!selected!1,083!strong!
human! enhancers! and! 9! TF! motif! types! and! mutated! all! instances! of! the! same! motif!
simultaneously! or! each! instance! individually.! The! activity! of! the! wildtype! and! mutant!
sequences!were! measured! through! UMI-STARR-seq.!B)!Pairwise! comparisons!of! input! and!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
54"
STARR-seq!signal!between!two!independent!biological!replicates!across!all!oligos!included!in!
the! human! oligo! library.! Axes! are! in! logarithmic! scale.! The! PCC! is! denoted! for! each!
comparison.!C)!Identification! of! 1,083! active! short! human! enhancers.! Distribution! of! log2!
enhancer! activity! for! oligos! selected! from! negative! regions! (grey)! or! enhancer! sequences!
(blue).!1,083! active!short!human!enhancers!(log2! wildtype!activity!in!oligo!UMI-STARR-seq!
>=! 2.03,! the! strongest! negative! region,! red! dashed! line;! see! Methods)! were! selected! for!
subsequent! motif! mutation!analyses.!The! box! plots! mark! the! median,! upper! and! lower!
quartiles! and! 1.5× interquartile! range! (whiskers).! D)!TF! motif! requirements! of! human!
HCT116! enhancers.! Log2! FC! enhancer! activity! for! hundreds! of! human! enhancers! after!
mutating!all!instances!of!four!control!(grey)!and!nine!candidate!human!TF!motifs.!Number!of!
enhancers!mutated!for!each!motif!type!and!respective!motif!PWM!logos!are!shown.!Box!plots!
as!in!(C);!but!outliers!are!shown!individually.!E)!Activity!(log2)!of!wildtype!and!motif-mutant!
enhancer!sequences!that!were!used!to!derived!the!log2!fold-changes!from!Fig!S10D.!Number!
of!enhancers!mutated!is!shown.!Box! plots! as! in! (C);! but! outliers! are! shown!individually.!F)!
Motif!requirements!are!independent!of!motif!mutant!variants.!Left:!Distribution!of!enhancer!
activity! for! wildtype! or! motif-mutant! enhancer! sequences! for! the! differen t! TF!motifs.! The!
activity!of!sequences!where! the! motifs! were! mutated! to! different! motif! shuffled! versions!is!
shown.!Number!of!enhancers!mutated!for!each!motif!type!are!shown.!Box!plots!as!in!(C);!but!
outliers!are! shown!individually.!Right:!Pairwise!comparisons! of!log2!FC!to!wildtype!activity!
between!the! three! motif-mutant! shuffled! versions! across! all! enhancers.! The! PCC!is!denoted!
for!each!comparison.! !
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
55"
Supplementary! Figure!11.! Motif! syntax! rules! dictate! the! contribution! of! motif!
instances.!
!
A)!TF!motif!non-equivalence!is!widespread!in!human!enhancers.!Distributions!of!the!log2!FC!
in! enhancer! activity! after! mutation! of! individual! instances! of! different! TF! motif! types!or!
control!motifs.!Number!of!instances!for!each!motif!type!are!shown.!The!Fligner-Killeen!test!of!
homogeneity!of!variances!was!used!to!compare!the!distributions!of!each!TF!motif!type!with!
the!one!from!control!motifs:!****!p-value!<!0.0001.!The!box!plots!mark!the!median,!upper!and!
lower! quartiles! and! 1.5× interquartile! range! (whiskers).! B)!Motif! syntax! rules! dictate! the!
contribution!of!TF!motif!instances!in! human! enhancers.! For! each! TF! motif! type! (rows),!we!
built! a! linear! model! containing! the! number! of! instances,! the! motif! core! (defined! as! the!
nucleotides!included!in!each!TF!motif!PWM!model)!and!flanking!nucleotides!(5!bp!on!each!
side),!and!the!distance!to!all!other!TF!motifs!(close:!<!25!bp;!intermediate:!≥!25!bp!and!≤!50!
bp;!distal:!>50! bp)!to!predict!the!contribution! of! its!individual!instances!(mutation!log2! FC,!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
56"
from! Fig! S11A)! across! all! enhancers.! Heatmap! shows! the! contribution! of! each! feature!
(columns)!for!each! model,! colored! by! the! direction! (positive:!red,!negative:!blue)!and!FDR-
corrected!p-value.!The!PCC!between!predicted!and!observed!motif!contribution!is!shown!with!
the! green! color! scale.!C)!Scatter! plots! comparing! the! measured! contribution! of! individual!
instances!of!each! TF! motif! type!(log2!FC! in! enhancer! activity!after! mutation)! with! the!one!
predicted!by!the!models!from!(B).!The!PCC!is!denoted!for!each!comparison.!D)!Models!taking!
into!the!motif! syntax! features!predict!better!the!contribution!of!motif!instances!than!solely!
the!PWM!scores.!Bar-plots!comparing!the!PCC!from!the!full!models!(from!(B);!green)!and!the!
same! just! using! existing! PWM! scores! (orange).!E)!Motif! mutagenesis! validates! that! AP-1!
instances! close! to! a! second! AP-1! instance! are! more! important.! Left:! expected! mutational!
impact!when!mutating!AP-1!instances!depending!on!the!distance!to!other!AP-1!motifs.!Right:!
enhancer!activity!changes!(log2!FC)!after!mutating!AP-1!instances!at!optimal!close!(<!25!bp)!
or!suboptimal!longer!(>!50!bp)!distance!to!a!second!instance.!Number!of!instances!are!shown.!
***!p-value!<!0.001!(Wilcoxon!rank-sum!test).!Box!plots!as!in!(A).!
!
Supplementary!Figure!12.!DeepSTARR!predicts!the!contribution!of!AP-1!instances!
in!human!enhancers.!
!
Distribution!of!experimentally! measured! log2! fold-change! (log2! FC)! enhancer! activity! after!
mutating!1,617! different!AP-1!instances! across! HCT116! enhancers! (A),!compared!with! the!
log2!FC!predicted! by! DeepSTARR!(B).!The!PCC!is! denoted.!The! box! plots! mark! the! median,!
upper!and!lower!quartiles!and!1.5× interquartile!range!(whiskers).!
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.05.463203doi: bioRxiv preprint
... Second, MPRAs have recently motivated new bioinformatic and statistical tool development [48,60,61,114,115] , which could aid non-model organism researchers as more MPRA data are generated for these species. For example, MPRA data can be coupled with machine learning approaches [81,[116][117][118][119][120][121][122][123][124][125] to predict gene expression and regulatory structure from genomic sequence alone. MPRA-DragoNN [116] and DeepSTARR [117] both use convolutional neural networks to learn sequence features associated with regulatory element activity. ...
Article
Full-text available
A long-standing goal of evolutionary biology is to decode how gene regulation contributes to organismal diversity. Doing so is challenging because it is hard to predict function from non-coding sequence and to perform molecular research with non-model taxa. Massively parallel reporter assays (MPRAs) enable the testing of thousands to millions of sequences for regulatory activity simultaneously. Here, we discuss the execution, advantages, and limitations of MPRAs, with a focus on evolutionary questions. We propose solutions for extending MPRAs to rare taxa and those with limited genomic resources, and we underscore MPRA’s broad potential for driving genome-scale, functional studies across organisms.
... The regulatory build annotation complements other approaches, such as the SCREEN Catalogue, based on open chromatin regions from ENCODE, or emerging approaches using deep learning (22)(23)(24). The methods we use to generate the regulatory build will evolve over time as we respond to new data and approaches. ...
Article
Full-text available
Ensembl (https://www.ensembl.org) is unique in its flexible infrastructure for access to genomic data and annotation. It has been designed to efficiently deliver annotation at scale for all eukaryotic life, and it also provides deep comprehensive annotation for key species. Genomes representing a greater diversity of species are increasingly being sequenced. In response, we have focussed our recent efforts on expediting the annotation of new assemblies. Here, we report the release of the greatest annual number of newly annotated genomes in the history of Ensembl via our dedicated Ensembl Rapid Release platform (http://rapid.ensembl.org). We have also developed a new method to generate comparative analyses at scale for these assemblies and, for the first time, we have annotated non-vertebrate eukaryotes. Meanwhile, we continually improve, extend and update the annotation for our high-value reference vertebrate genomes and report the details here. We have a range of specific software tools for specific tasks, such as the Ensembl Variant Effect Predictor (VEP) and the newly developed interface for the Variant Recoder. All Ensembl data, software and tools are freely available for download and are accessible programmatically.
Article
An organism’s genome contains many sequence regions that perform diverse functions. Examples of such regions include genes, promoters, enhancers, and binding sites for regulatory proteins and RNAs. One of biology’s most important open problems is how to take a genome sequence and predict which regions within it perform different functions. In recent years, deep learning has enabled dramatic advances across many fields by modeling complex relationships between entities. Several deep learning models have also proven successful in predicting the biological function of a portion of DNA from its sequence, revealing new insights into the complex rules underlying genome regulation and opening new possibilities in disease modeling and synthetic biology.
Article
Full-text available
Epigenetic regulation is integral in orchestrating the spatiotemporal regulation of gene expression which underlies tissue development. The emergence of new tools to assess genome-wide epigenetic modifications has enabled significant advances in the field of vascular biology in zebrafish. Zebrafish represents a powerful model to investigate the activity of cis -regulatory elements in vivo by combining technologies such as ATAC-seq, ChIP-seq and CUT&Tag with the generation of transgenic lines and live imaging to validate the activity of these regulatory elements. Recently, this approach led to the identification and characterization of key enhancers of important vascular genes, such as gata2a, notch1b and dll4 . In this review we will discuss how the latest technologies in epigenetics are being used in the zebrafish to determine chromatin states and assess the function of the cis -regulatory sequences that shape the zebrafish vascular network.
Preprint
Full-text available
A long-standing goal of evolutionary biology is to decode how gene regulatory processes contribute to organismal diversity, both within and between species. This question has remained challenging to answer, due both to the difficulties of predicting function from non-coding sequence, and to the technological constraints of laboratory research with non-model taxa. However, a recent methodological development in functional genomics, the massively parallel reporter assay (MPRA), makes it possible to test thousands to millions of sequences for regulatory activity in a single in vitro experiment. It does so by combining traditional, single-locus episomal reporter assays (e.g., luciferase reporter assays) with the scalability of high-throughput sequencing. In this perspective, we discuss the execution, advantages, and limitations of MPRAs for research in evolutionary biology. We review recent studies that have made use of this approach to address explicitly evolutionary questions, highlighting study designs that we believe are particularly well-positioned to gain from MPRA approaches. Additionally, we propose solutions for extending these powerful assays to rare taxa and those with limited genomic resources. In doing so, we underscore the broad potential of MPRAs to drive genome-scale functional evolutionary genetics studies in non-traditional model organisms.
Article
Full-text available
Linking distal enhancers to genes and modeling their impact on target gene expression are longstanding unresolved problems in regulatory genomics and critical for interpreting noncoding genetic variation. Here we present a new deep learning approach called GraphReg that exploits 3D interactions from chromosome conformation capture assays in order to predict gene expression from 1D epigenomic data or genomic DNA sequence. By using graph attention networks to exploit the connectivity of distal elements up to 2Mb away in the genome, GraphReg more faithfully models gene regulation and more accurately predicts gene expression levels than state-of-the-art deep learning methods for this task. Feature attribution used with GraphReg accurately identifies functional enhancers of genes, as validated by CRISPRi-FlowFISH and TAP-seq assays, outperforming both CNNs and the recently proposed Activity-by-Contact model. Sequence-based GraphReg also accurately predicts direct transcription factor (TF) targets as validated by CRISPRi TF knockout experiments via in silico ablation of TF binding motifs. GraphReg therefore represents an important advance in modeling the regulatory impact of epigenomic and sequence elements.
Article
Full-text available
Transcription factors bind DNA sequence motif vocabularies in cis-regulatory elements (CREs) to modulate chromatin state and gene expression during cell state transitions. A quantitative understanding of how motif lexicons influence dynamic regulatory activity has been elusive due to the combinatorial nature of the cis-regulatory code. To address this, we undertook multiomic data profiling of chromatin and expression dynamics across epidermal differentiation to identify 40,103 dynamic CREs associated with 3,609 dynamically expressed genes, then applied an interpretable deep-learning framework to model the cis-regulatory logic of chromatin accessibility. This analysis framework identified cooperative DNA sequence rules in dynamic CREs regulating synchronous gene modules with diverse roles in skin differentiation. Massively parallel reporter assay analysis validated temporal dynamics and cooperative cis-regulatory logic. Variants linked to human polygenic skin disease were enriched in these time-dependent combinatorial motif rules. This integrative approach shows the combinatorial cis-regulatory lexicon of epidermal differentiation and represents a general framework for deciphering the organizational principles of the cis-regulatory code of dynamic gene regulation. A deep-learning framework interprets multiomic data across epidermal differentiation, identifying cooperative DNA sequence rules that regulate gene modules. Massively parallel reporter assay analysis validates temporal dynamics and cis-regulatory logic.
Article
Full-text available
How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer–promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.
Article
Full-text available
Deciphering the sequence-function relationship encoded in enhancers holds the key to interpreting non-coding variants and understanding mechanisms of transcriptomic variation. Several quantitative models exist for predicting enhancer function and underlying mechanisms; however, there has been no systematic comparison of these models characterizing their relative strengths and shortcomings. Here, we interrogated a rich data set of neuroectodermal enhancers in Drosophila, representing cis- and trans- sources of expression variation, with a suite of biophysical and machine learning models. We performed rigorous comparisons of thermodynamics-based models implementing different mechanisms of activation, repression and cooperativity. Moreover, we developed a convolutional neural network (CNN) model, called CoNSEPT, that learns enhancer ‘grammar’ in an unbiased manner. CoNSEPT is the first general-purpose CNN tool for predicting enhancer function in varying conditions, such as different cell types and experimental conditions, and we show that such complex models can suggest interpretable mechanisms. We found model-based evidence for mechanisms previously established for the studied system, including cooperative activation and short-range repression. The data also favored one hypothesized activation mechanism over another and suggested an intriguing role for a direct, distance-independent repression mechanism. Our modeling shows that while fundamentally different models can yield similar fits to data, they vary in their utility for mechanistic inference. CoNSEPT is freely available at: https://github.com/PayamDiba/CoNSEPT.
Article
Full-text available
Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k -mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.
Article
Full-text available
The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)–nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.
Article
Full-text available
Many sequence variants have been linked to complex human traits and diseases1, but deciphering their biological functions remains challenging, as most of them reside in noncoding DNA. Here we have systematically assessed the binding of 270 human transcription factors to 95,886 noncoding variants in the human genome using an ultra-high-throughput multiplex protein–DNA binding assay, termed single-nucleotide polymorphism evaluation by systematic evolution of ligands by exponential enrichment (SNP-SELEX). The resulting 828 million measurements of transcription factor–DNA interactions enable estimation of the relative affinity of these transcription factors to each variant in vitro and evaluation of the current methods to predict the effects of noncoding variants on transcription factor binding. We show that the position weight matrices of most transcription factors lack sufficient predictive power, whereas the support vector machine combined with the gapped k-mer representation show much improved performance, when assessed on results from independent SNP-SELEX experiments involving a new set of 61,020 sequence variants. We report highly predictive models for 94 human transcription factors and demonstrate their utility in genome-wide association studies and understanding of the molecular pathways involved in diverse human traits and diseases. An ultra-high-throughput multiplex protein–DNA binding assay is used to assess binding of 270 human transcription factors to 95,886 noncoding variants in the human genome, providing data to improve prediction of the effects of noncoding variants on transcription factor binding and thereby increase understanding of molecular pathways involved in diverse human traits and genetic diseases.
Article
Full-text available
Transcription factors bind DNA by recognizing specific sequence motifs, which are typically 6–12 bp long. A motif can occur many thousands of times in the human genome, but only a subset of those sites are actually bound. Here we present a machine-learning framework leveraging existing convolutional neural network architectures and model interpretation techniques to identify and interpret sequence context features most important for predicting whether a particular motif instance will be bound. We apply our framework to predict binding at motifs for 38 transcription factors in a lymphoblastoid cell line, score the importance of context sequences at base-pair resolution and characterize context features most predictive of binding. We find that the choice of training data heavily influences classification accuracy and the relative importance of features such as open chromatin. Overall, our framework enables novel insights into features predictive of transcription factor binding and is likely to inform future deep learning applications to interpret non-coding genetic variants.
Article
Full-text available
Deciphering the genomic regulatory code of enhancers is a key challenge in biology as this code underlies cellular identity. A better understanding of how enhancers work will improve the interpretation of noncoding genome variation, and empower the generation of cell type-specific drivers for gene therapy. Here we explore the combination of deep learning and cross-species chromatin accessibility profiling to build explainable enhancer models. We apply this strategy to decipher the enhancer code in melanoma, a relevant case study due to the presence of distinct melanoma cell states. We trained and validated a deep learning model, called DeepMEL, using chromatin accessibility data of 26 melanoma samples across six different species. We demonstrate the accuracy of DeepMEL predictions on the CAGI5 challenge, where it significantly outperforms existing models on the melanoma enhancer of IRF4 Next, we exploit DeepMEL to analyse enhancer architectures and identify accurate transcription factor binding sites for the core regulatory complexes in the two different melanoma states, with distinct roles for each transcription factor, in terms of nucleosome displacement or enhancer activation. Finally, DeepMEL identifies orthologous enhancers across distantly related species where sequence alignment fails, and the model highlights specific nucleotide substitutions that underlie enhancer turnover. DeepMEL can be used from the Kipoi database to predict and optimise candidate enhancers, and to prioritise enhancer mutations. In addition, our computational strategy can be applied to other cancer or normal cell types.
Article
Each language has standard books describing that language’s grammatical rules. Biologists have searched for similar, albeit more complex, principles relating enhancer sequence to gene expression. Here, we review the literature on enhancer grammar. We introduce dependency grammar, a model where enhancers encode information based on dependencies between enhancer features shaped by mechanistic, evolutionary, and biological constraints. Classifying enhancers based on the types of dependencies may identify unifying principles relating enhancer sequence to gene expression. Such rules would allow us to read the instructions for development within genomes and pinpoint causal enhancer variants underlying disease and evolutionary changes.