ArticlePDF Available

Abstract and Figures

Deciphering the genomic regulatory code of enhancers is a key challenge in biology as this code underlies cellular identity. A better understanding of how enhancers work will improve the interpretation of noncoding genome variation, and empower the generation of cell type-specific drivers for gene therapy. Here we explore the combination of deep learning and cross-species chromatin accessibility profiling to build explainable enhancer models. We apply this strategy to decipher the enhancer code in melanoma, a relevant case study due to the presence of distinct melanoma cell states. We trained and validated a deep learning model, called DeepMEL, using chromatin accessibility data of 26 melanoma samples across six different species. We demonstrate the accuracy of DeepMEL predictions on the CAGI5 challenge, where it significantly outperforms existing models on the melanoma enhancer of IRF4 Next, we exploit DeepMEL to analyse enhancer architectures and identify accurate transcription factor binding sites for the core regulatory complexes in the two different melanoma states, with distinct roles for each transcription factor, in terms of nucleosome displacement or enhancer activation. Finally, DeepMEL identifies orthologous enhancers across distantly related species where sequence alignment fails, and the model highlights specific nucleotide substitutions that underlie enhancer turnover. DeepMEL can be used from the Kipoi database to predict and optimise candidate enhancers, and to prioritise enhancer mutations. In addition, our computational strategy can be applied to other cancer or normal cell types.
Content may be subject to copyright.
1
Cross-species analysis of enhancer logic using deep learning
1
Liesbeth Minnoye1,2,#, Ibrahim Ihsan Taskiran1,2,#, David Mauduit1,2, Maurizio Fazio4,5, Linde Van
2
Aerschot1,2,3, Gert Hulselmans1,2, Valerie Christiaens1,2, Samira Makhzami1,2, Monika Seltenhammer6,7,
3
Panagiotis Karras8,9, Aline Primot10, Edouard Cadieu10, Ellen van Rooijen4,5, Jean-Christophe Marine8,9,
4
Giorgia Egidy11, Ghanem-Elias Ghanem12, Leonard Zon4,5, Jasper Wouters1,2, and Stein Aerts1,2,*.
5
1. VIB-KU Leuven Center for Brain & Disease Research, Leuven, Belgium.
6
2. KU Leuven, Department of Human Genetics KU Leuven, Leuven, Belgium.
7
3. Laboratory for Disease Mechanisms in Cancer, KU Leuven, Leuven, Belgium
8
4. Howard Hughes Medical Institute, Stem Cell Program and the Division of Pediatric
9
Hematology/Oncology, Boston Children’s Hospital and Dana-Farber Cancer Institute, Harvard Medical
10
School, Boston, MA 02115, USA
11
5. Department of Stem Cell and Regenerative Biology, Harvard Stem Cell Institute, Cambridge, MA
12
02138, USA
13
6. Center for Forensic Medicine, Medical University of Vienna, Vienna, Austria
14
7. Division of Livestock Sciences (NUWI) - BOKU University of Natural Resources and Life Sciences,
15
Gregor-Mendel-Straße 33, 1180 Vienna, Austria
16
8. VIB-KU Leuven Center for Cancer Biology, Leuven, Belgium
17
9. KU Leuven, Department of Oncology KU Leuven, Leuven, Belgium.
18
10. CNRS-University of Rennes 1, UMR6290, Institute of Genetics and Development of Rennes, Faculty
19
of Medicine, Rennes, France
20
11. Université Paris-Saclay, INRA, AgroParisTech, GABI, 78350, Jouy-en-Josas, France.
21
12. Institut Jules Bordet, Université Libre de Bruxelles, Brussels, Belgium.
22
23
# equal contribution
24
* corresponding author
25
stein.aerts@kuleuven.vib.be
26
Laboratory of Computational Biology
27
Herestraat 49, P.O. Box 602
28
3000 Leuven, Belgium
29
30
Running title: Melanoma enhancer logic
31
Keywords: Epigenomics, machine learning, transcriptional regulation, melanoma
32
Abstract
33
Deciphering the genomic regulatory code of enhancers is a key challenge in biology as this code
34
underlies cellular identity. A better understanding of how enhancers work will improve the
35
interpretation of non-coding genome variation, and empower the generation of cell type specific drivers
36
for gene therapy. Here we explore the combination of deep learning and cross-species chromatin
37
accessibility profiling to build explainable enhancer models. We apply this strategy to decipher the
38
enhancer code in melanoma, a relevant case study due to the presence of distinct melanoma cell states.
39
We trained and validated a deep learning model, called DeepMEL, using chromatin accessibility data
40
of 26 melanoma samples across six different species. We demonstrate the accuracy of DeepMEL
41
predictions on the CAGI5 challenge, where it significantly outperforms existing models on the
42
melanoma enhancer of IRF4. Next, we exploit DeepMEL to analyse enhancer architectures and identify
43
accurate transcription factor binding sites for the core regulatory complexes in the two different
44
melanoma states, with distinct roles for each transcription factor, in terms of nucleosome displacement
45
2
or enhancer activation. Finally, DeepMEL identifies orthologous enhancers across distantly related
46
species where sequence alignment fails, and the model highlights specific nucleotide substitutions that
47
underlie enhancer turnover. DeepMEL can be used from the Kipoi database to predict and optimise
48
candidate enhancers, and to prioritise enhancer mutations. In addition, our computational strategy can
49
be applied to other cancer or normal cell types.
50
Introduction
51
A cell’s phenotype arises from the expression of a unique set of genes, which is regulated through the
52
binding of transcription factors (TFs) to cis-regulatory regions, such as promoters and enhancers.
53
Deciphering gene regulatory programs entails mapping the network of TFs and cis-regulatory regions
54
that governs the identity of a given cell type; as well as understanding how the specificity of such a
55
network is encoded in the DNA sequence of genomic enhancers. Profiling accessible chromatin via
56
DNase I hypersensitive sequencing (DNase-seq) or via the Assay for Transposase-Accessible
57
Chromatin using sequencing (ATAC-seq) represents a useful approach for identifying putative
58
enhancers (Buenrostro et al. 2013; Klemm et al. 2019; Song and Crawford 2010). Indeed, active
59
enhancers are typically depleted of one or more nucleosomes, due to the binding of TFs. Initial changes
60
in DNA accessibility can be facilitated through a special class of TFs that bind with high affinity to
61
their recognition sites and that have a long residence time at the enhancer; sometimes referred to as
62
pioneer TFs (Klemm et al. 2019; Zaret and Carroll 2011). By displacing nucleosomes or
63
thermodynamically outcompeting nucleosome binding they allow other TFs to co-bind, thereby further
64
stabilising the nucleosome depleted region and/or actively enhancing transcription of target genes
65
(Grossman et al. 2018; Jacobs et al. 2018; Dodonova et al. 2020).
66
67
As the presence and architecture of TF binding sites within enhancers determines which TFs can bind
68
with high affinity, understanding this ‘enhancer logic’ can help interpreting the functional role of
69
enhancers within a gene regulatory network. Several techniques exist to study the enhancer code,
70
including (1) motif discovery tools (Imrichová et al. 2015; Janky et al. 2014; Bailey et al. 2009; Heinz
71
et al. 2010; Thomas-Chollier et al. 2011, 2012); (2) comparative genomics (Ballester et al. 2014;
72
Prescott et al. 2015; Villar et al. 2015); (3) genetic screens (Gasperini et al. 2019; Kircher et al. 2019);
73
and (4) machine learning techniques (Park and Kellis 2015). Particularly the latter has seen a strong
74
boost in recent years with the advent of large training sets derived from genome-wide profiling. Three
75
pivotal methods based on deep learning include DeepBind (Alipanahi et al. 2015), DeepSEA (Zhou and
76
Troyanskaya 2015) and Basset (Kelley et al. 2016), the first convolutional neural networks (CNNs)
77
applied to genomics data (Eraslan et al. 2019). Since their emergence in the genomics field, machine
78
learning techniques, and especially CNNs, have been applied to model a range of regulatory aspects,
79
including cross-species enhancer predictions (Quang and Xie 2016; Xu Min et al. 2016; Chen et al.
80
2018), TF binding sites (Wang et al. 2018; Avsec et al. 2019b), DNA methylation (Angermueller et al.
81
2017) and 3D chromatin architecture (Schreiber et al. 2017).
82
83
Deciphering gene regulation and the underlying enhancer code is not only important during dynamic
84
processes such as development, but also in disease contexts such as cancer, where gene regulatory
85
networks are typically misregulated due to mutations. Particularly in melanoma, a type of skin cancer
86
that develops from melanocytes, gene expression is misregulated and highly plastic (Shain and Bastian
87
2016; Rambow et al. 2019). This gives rise to two main melanoma cell states: the melanocytic (MEL)
88
state, which still resembles the cell-of-origin, expressing high levels of the melanocyte-lineage specific
89
transcription factors MITF, SOX10 and TFAP2A, as well as typical pigmentation genes such as DCT,
90
3
TYR, PMEL, and MLANA; and the mesenchymal-like (MES) state, in which the cells are more invasive
91
and therapy resistant, expressing high levels of genes involved in TGFB signaling and epithelial-to-
92
mesenchymal transition (EMT)-related genes (Hoek et al. 2006, 2008; Rambow et al. 2019; Verfaillie
93
et al. 2015; Wouters et al. 2019). These transcriptomic differences have also been studied at the
94
epigenomics level, with AP-1 and TEAD factors as master regulators of the MES state and binding sites
95
for SOX10 and MITF significantly enriched in MEL-specific regulatory regions (Bravo González-Blas
96
et al. 2019; Verfaillie et al. 2015; Wouters et al. 2019). However, it remains unclear how these
97
regulatory states are encoded in particular enhancer architectures, and whether such architectures are
98
evolutionary conserved. Besides human cell lines and human patient-derived cultures, several animal
99
models have been established in melanoma research, including mouse, pig, horse, dog and zebrafish
100
(Egidy et al. 2008; van Rooijen et al. 2017; Segaoula et al. 2018; Seltenhammer et al. 2014; van der
101
Weyden et al. 2016; Prouteau and André 2019). Although these models are widely used, it is unknown
102
whether their enhancer landscapes and regulatory programs are conserved with human. Here, we take
103
advantage of these animal model systems and combine cross-species chromatin accessibility profiling
104
with deep learning, to investigate enhancer logic in melanoma.
105
Results
106
Melanoma chromatin accessibility landscapes are conserved across species
107
We profiled chromatin accessibility using ATAC-seq on a collection of melanoma cell lines across six
108
species, for a total of 26 samples (Fig. 1A). These include 16 human patient-derived cultures (“MM
109
lines”) (Gembarska et al. 2012; Verfaillie et al. 2015), one mouse cell line (Dankort et al. 2009), primary
110
melanoma cells from the pig melanoma model MeLiM (“MeLiM”) (Egidy et al. 2008), two horse
111
melanoma lines derived from a Grey Lipizzaner horse (“HoMel-L1”) and from an Arabian horse
112
(“HoMel-A1”) (Seltenhammer et al. 2014), two dog melanoma cell lines from oral and uveal sites:
113
“Dog-OralMel-18249” and “Dog-IrisMel-14205” respectively (Cani-DNA BRC: https://dog-
114
genetics.genouest.org) and four melanoma lines established from zebrafish (“ZMEL1”, “EGFP-121-1”,
115
“EGFP-121-5” and “EGFP-121-3”) (White et al. 2008, 2011). Per sample, between 65,475 and 176,695
116
ATAC-seq peaks were called, with distinct levels of conservation of accessibility across the species
117
(Fig. 1A, S1A). The difference in the number of peaks across the samples is due, on the one hand, to
118
genome size (Fig. S1B), and on the other hand to data quality (measured as the fraction of reads in peaks
119
(FRiP)) (Fig. S1C).
120
121
Unsupervised clustering of the 16 human lines revealed two distinct groups (Fig. S1D), which
122
correspond to the two main cell states in human melanoma, i.e. the melanocytic state (MEL) and
123
mesenchymal-like state (MES), as was further confirmed for most of the cell lines by previously-
124
generated RNA-seq data (Fig. S1E) (Verfaillie et al. 2015) and corroborated by previous studies using
125
epigenomics data (Verfaillie et al. 2015; Wouters et al. 2019). Indeed, regulatory regions near MEL-
126
specific genes such as SOX10 are accessible in human lines in the MEL state (MM001, MM011,
127
MM031, MM034, MM052, MM057, MM074, MM087, MM118, MM122 and MM164), whereas they
128
are closed in MES melanoma lines (MM029, MM099, MM116, MM163, and MM165) (Fig. 1B). Of
129
note, similarly as in Wouters et al., we observed heterogeneity between samples of the MEL state (Fig.
130
S1D).
131
132
To enable the comparison of chromatin accessibility between human and other species, we first
133
identified regulatory regions that are alignable (i.e. have a high sequence similarity) between species
134
4
using the liftOver tool (at least 10% of bases must remap) (Meyer et al. 2012). When such an alignable
135
region contains an ATAC-seq peak in the compared species, we will refer to it as a ‘conserved
136
accessible’ region. Between 1.1% and 40.9% of the ATAC-seq regions in non-human lines were
137
conserved accessible in human (Fig. 1C) and between 0.9% and 18.4% of the human peaks were
138
conserved accessible in the other species (Fig. S1F). Accordingly, we identified 303,392 alignable and
139
10,592 conserved accessible regions across all mammalian species. This number decreases when
140
including zebrafish, to 29,619 alignable regions and, only, 116 conserved accessible regions. Nearly
141
half of the 10,592 conserved accessible mammalian regions were promoters within 1 kb of a
142
transcription start site (Fig. S1G). Indeed, high conservation of proximal promoters has previously been
143
reported (Villar et al. 2015). In each of the mammalian species, the 10,592 conserved accessible regions
144
were more accessible compared to all ATAC-seq regions; in addition, they show a higher ChIP-seq
145
signal for acetylation of histone H3 at lysine 27 (H3K27ac) in human, a mark for active regulatory
146
regions (Creyghton et al. 2010) (Fig. S1H,I), and higher sequence conservation compared to alignable
147
regions as measured by phastCons and phyloP (Fig. S1J) (Pollard et al. 2010; Siepel 2005). Note,
148
nevertheless, that although ATAC-seq regions are nucleosome-depleted and often bound by several
149
TFs, they are not necessarily active enhancers, as accessibility does not directly translate to enhancer
150
activity (Shlyueva et al. 2014).
151
152
Next, we examined whether the MEL and MES melanoma states are conserved in the other species of
153
our cohort. Clustering all mammalian samples based on the accessibility of the 303,392 alignable
154
regions (Fig. S1K), or of all samples (including zebrafish) using the 29,619 alignable regions (Fig. 1D),
155
revealed two axes of variation between the samples, namely (i) the evolutionary variation between
156
species and (ii) the distinction between the melanoma states. All human MEL samples are clustered
157
together with 9 of the 10 non-human lines, indicating that most of the non-human cell lines are
158
epigenomically similar to the human MEL lines. On the other hand, the dog cell line Dog-IrisMel-14205
159
clustered together with the human MES samples, indicating that Dog-IrisMel-14205 belongs to the
160
MES state. This classification of melanoma samples was reflected in their accessibility at known MEL
161
and MES regulatory regions such as the intronic enhancer of MLANA, a MEL-specific gene involved
162
in melanosome biogenesis (De Mazière et al. 2002), and an enhancer upstream of MMP3, a gene that
163
increases metastatic potential in melanoma cell lines (Shoshan et al. 2016) (Fig. 1E). Note that
164
classifying the cross-species samples based on a principal component analysis (PCA) of only the
165
conserved accessible regions (i.e. without species-specific or clade-specific peaks) clearly revealed the
166
MEL-MES distinction, whereas the species variation was less outspoken (Fig. S1L,M).
167
168
In conclusion, by using ATAC-seq on a panel of 26 melanoma lines across six species, conserved
169
accessible regulatory regions could be identified. These regions allowed clustering of the melanoma
170
samples into two groups which correspond to the two main melanoma cell states, indicating
171
conservation of the MES melanoma state in dog and the MEL melanoma state in pig, mouse, horse, dog
172
and even zebrafish melanoma samples.
173
5
174
Figure 1. Comparative epigenomics reveals conservation of two main melanoma states. (A) Evolutionary
175
relationship between the six studied species, represented by a phylogenetic tree (NCBI taxonomy tree). ATAC-
176
seq profiles of the 26 melanoma cell lines are shown for three regulatory regions. (B) ATAC-seq profiles of the
177
human melanoma lines for the SOX10 locus. Lines are coloured by the melanocytic (MEL, in blue) or
178
mesenchymal-like (MES, in orange) melanoma state. (C) Total number of ATAC-seq regions observed across all
179
samples of a species are coloured based on whether they are not alignable, alignable or conserved accessible in
180
human. (D) PCA clustering based on the accessibility of the 29,619 alignable regions across all six species. (E)
181
ATAC-seq profiles of MEL and MES lines of different species for an intronic MLANA enhancer and the upstream
182
region of MMP3.
183
Conservation of transcription factor motifs in state-specific enhancers
184
Next, we investigated whether TF binding motifs that are specific to the MEL and MES states are
185
conserved across species. To this end, we performed differential motif enrichment between MEL and
186
MES accessible regions for human and dog, as these were the two species in our cohort for which cell
187
lines of both states were identified above. Differential peak calling (log2FC > 2.5 and pAdj < 0.0005),
188
followed by motif enrichment using HOMER (Heinz et al. 2010), revealed a highly similar enrichment
189
of SOX, TFAP2 family, E-box, RUNX and ETS TF binding motifs in both the human and dog MEL-
190
specific peaks (Fig. 2A,B) (complete HOMER output in Supplementary Table 1). The enriched motifs
191
of the TFAP2 family can most likely be linked to TFAP2A, because this is a master regulator in human
192
melanocytes and melanoma (Seberg et al. 2017). Similarly, the observed E-box and SOX motifs most
193
likely represent MITF and SOX10, respectively as they are among the previously reported master
194
regulators in human MEL lines (Bravo González-Blas et al. 2019; Hoek et al. 2006; Verfaillie et al.
195
2015; Wouters et al. 2019). Likewise, motif enrichment in the MES regions is very similar between
196
human and dog, revealing AP-1 and TEAD motifs as most highly enriched (Fig. 2A,B), corroborating
197
earlier findings (Verfaillie et al. 2015). Together, these observations indicate that the MEL and MES
198
melanoma cell states are conserved in dog and that they are likely governed by the same master
199
regulators, based on the concordance of motif enrichment.
200
201
6
To further verify the importance of the MEL-specific master regulators in MEL cell lines of the
202
remaining four species, we applied a different strategy since we could not contrast MEL and MES lines
203
for horse, pig, mouse and zebrafish. We analyzed 9,732 accessible regions that are conserved accessible
204
across all mammalian MEL lines to identify conserved TF binding sites. We scanned these regions
205
using the cisTarget motif collection (v8) (Janky et al., 2014; Imrichova et al., 2015; Herrmann et al.,
206
2012) containing 20,003 TF position-weight matrices (PWMs) and used a branch length score (BLS)
207
to calculate the level of evolutionary conservation of each TF binding motif (Fig. 2C), a strategy applied
208
before in other systems (Jacobs et al. 2018; Stark et al. 2007). Among the 4% most conserved motifs
209
were SP1, ETS, SOX, CTCF, MITF and TFAP2A motifs (Fig. 2D). The top conserved motifs were
210
members of the SP/KLF TF family, which bind to GC-rich motifs in promoters (Dynan and Tjian 1983).
211
Indeed, 47% of the 9,732 conserved accessible regions in mammalian MEL lines are proximal
212
promoters (<= 1 kbp from TSS). BLS scoring on the remaining 5,196 more distal conserved accessible
213
regions revealed similar highly conserved motifs, except for SP/KLF TF family motifs, indicating that
214
distal regions, such as enhancers, mostly contain the state-specific TF binding motifs. In the 113
215
conserved accessible regions across the MEL cell lines across all six species, BLS scoring again
216
revealed SOX, ETS, MITF and TFAP2A motifs among the most conserved motifs (Fig. 2E).
217
218
In conclusion, two independent strategies of motif analysis suggest conservation of TF binding sites for
219
known melanoma master regulators, with conserved SOX10, MITF, TFAP2A and ETS TF family motif
220
enrichment in MEL enhancers across all six studied species.
221
7
222
Figure 2. Conservation of binding motifs of master regulators of MEL and MES melanoma states. (A, B) Heatmap
223
of differential ATAC-seq regions when comparing (A) human MEL versus human MES lines and (B) the MEL
224
dog line ‘Dog-OralMel-18249’ versus the MES dog line ‘Dog-IrisMel-14205’ (two biological replicates each),
225
coloured by normalised ATAC-seq signal. Enriched TF binding motifs in the differential peaks were identified
226
via HOMER (Heinz et al. 2010) and the first logo of enriched TF families is shown. The ratio of the percentage
227
of target and background sequences with the motif is indicated between brackets, as well as the rank of the TF
228
class within the HOMER output (#). (C) Schematic overview of cross-species motif analysis using the branch
229
length score (BLS) as a measure for the evolutionary conservation of a motif hit across conserved accessible
230
regions. The BLS was summed across a set of conserved accessible regions. (D, E) Histogram of the normalised
231
summed BLS score for 20,003 motifs on (D) 9,732 conserved accessible regions across the mammalian MEL
232
lines and on (E) 113 conserved accessible regions across MEL lines of all six species. The first hit of the top
233
recurrent TF binding motifs within the top 4% conserved motifs is indicated as a cross and is accompanied by the
234
logo of the motif.
235
Deep neural network DeepMEL reveals nucleotide-resolution enhancer logic
236
While motif enrichment can predict candidate regulators, we sought to build a more comprehensive
237
model of the MEL enhancers, that would allow cross-species predictions and in-depth analysis of
238
enhancer architecture. To this end, we trained a deep learning (DL) model on the human ATAC-seq
239
data. First, to construct an unsupervised training set, we clustered all 339,099 human ATAC-seq peaks
240
using cisTopic -a probabilistic framework to analyse scATAC-seq data that can also be applied to
241
8
bootstrapped bulk ATAC-seq data (Bravo González-Blas et al. 2019) (see Methods)- into 24 ‘topics’ or
242
sets of co-accessible regions (Fig. 3A, Fig. S2A,B). This provided a nuanced classification, with topic
243
4 and topic 7 representing the MEL- and MES-specific enhancers, respectively being accessible across
244
all MEL or MES samples (Fig. 3A, Fig. S2C). In addition, we found two topics with regions that are
245
generally accessible across all cell lines (topic 1 and topic 19) (Fig. 3A, S2C). These ubiquitously
246
accessible regions are highly enriched for proximal promoters (Fig. S2D) and for known promoter-
247
specific TF binding motifs linked to SP and NFY TF families (Fig. S2C) (Dynan and Tjian 1983; Maity
248
and de Crombrugghe 1998). Other topics were more specific to one or a small group of cell lines (Fig.
249
3A). We verified the biological relevance of these topics by Gene Ontology (GO) enrichment of
250
flanking genes using GREAT (McLean et al. 2010). Genes near topic 4 regions are significantly
251
enriched for GO terms such as pigmentation (FDR=1.95 × 10-8) and neural crest cell differentiation
252
(FDR=4.26 × 10-7), whereas genes near topic 7 regions were enriched for GO terms involved in cell-
253
cell adhesion (1.56 × 10-13). Motif discovery on the top regions assigned to each topic confirmed
254
enrichment of SOX, ETS, TFAP2A and MITF motifs in the MEL topic regions (topic 4) and AP-1 in
255
the MES topic (topic 7) (Fig. S2C). An example topic 4 region in the promoter of the SOX10 target
256
gene MIA (Graf et al. 2014) is shown in Figure 3B, as well as two topic 7 regions upstream of
257
SERPINE1, a gene expressed in metastatic melanoma (Klein et al. 2012).
258
259
Using the 24 topics as classes, we trained a multi-class, multi-label classifier using a neural network,
260
called “DeepMEL” (Fig. 3C). As input, we used the forward and reverse complement of 500 bp
261
sequences centered on the ATAC-seq summit. As topology, we used the DanQ CNN-RNN hybrid
262
architecture (Quang and Xie 2016) consisting of 4 main layers: a convolution layer to discover local
263
patterns in sequential data, followed by a max-pooling layer to reduce the dimensionality of the data
264
and generalise the model effectively, a bidirectional recurrent layer (LSTM) to detect long-range
265
dependencies of the local patterns discovered in the first layer, and finally a fully-connected (dense)
266
layer just before the output layer to help the classification after the feature extraction layers (Fig. 3C).
267
Note that several hyperparameters, including the number and size of the convolutional filters and the
268
length of the input DNA sequence were optimised to yield the final model (Fig. S3; Supplementary
269
Note 1). After successful training of DeepMEL (area under the receiver operating characteristic curve
270
(auROC) = 0.863 and area under the precision recall curve (auPR) = 0.374 on test data for topic 4
271
regions) (Fig. 3D,E; Fig. S4A), we used the weights of the neurons from the convolutional filters to
272
extract local patterns learned by the model. We transformed these convolution filters into PWMs and
273
found the importance of each filter for each topic (see Methods). Filters that represent SOX, MITF,
274
TFAP2A, and RUNX motifs were most relevant for the MEL-specific topic 4 and filters that represent
275
AP-1, TEAD and RUNX binding sites were assigned to the MES-specific topic 7 (Fig. 3F). Thus,
276
DeepMEL learned the relevant features de novo from the sequence. Note that the 3,885 regions
277
classified as MEL-specific in MM001 (topic 4 scores above threshold of 0.16 (see Methods)) were not
278
only highly accessible in MEL lines and closed in MES lines (Fig. S4B), but were also accessible in
279
human melanocytes (Fig. S4C), indicating that MEL-specific melanoma regions are not cancer-specific
280
but already accessible in their cell-of-origin, i.e. the melanocytes. As a consequence, we can potentially
281
extrapolate the observations on this topic to normal melanocyte enhancers. Although in the remainder
282
of this work we will score accessible regions to identify functional enhancers, it is also possible to score
283
the entire genome, without filtering for ATAC-seq peaks (Fig. S4D).
284
To examine the TF binding site architecture within enhancers, we used a model interpretation tool,
285
DeepExplainer (Lundberg and Lee 2017; Lundberg et al. 2020; Avsec et al. 2019b). For a MEL
286
enhancer located on the 4th intron of IRF4, nucleotides important for classifying this enhancer as topic
287
4 emerge as motifs for SOX10, MITF, TFAP2A and RUNX factors (Fig. 3G top two rows; see Fig.
288
S4E,F for another example).
289
9
290
It is known that enhancer accessibility does not directly translate to enhancer activity (Shlyueva et al.
291
2014). To test whether the same TF binding motifs contribute to the activity of MEL enhancers, we
292
used the IRF4 enhancer as case study. For this enhancer, Kircher et al. performed saturation mutagenesis
293
followed by an in vitro massively parallel reporter assay (MPRA), testing the effect of every possible
294
single nucleotide mutation on enhancer activity (Fig. 3G, 3th row). The most deleterious mutations
295
coincided with the DeepMEL-predicted SOX, E-box and RUNX-like motifs, overlapping with
296
nucleotides that also have the strongest in silico effect (Fig. 3G, last row), indicating that the predicted
297
motifs are actually contributing to enhancer activity. In addition, also the magnitude of the in silico
298
predicted effect highly correlates with the effect of the in vitro mutations (Spearman’s correlation of
299
0.60) (Fig. 3G,H). These observations indicate that, although DeepMEL was trained to predict binary
300
enhancer accessibility, it is also a good predictor of enhancer activity of this specific enhancer.
301
DeepMEL predictions outperform other classifiers and deep learning models that were benchmarked in
302
Kircher et al. (CAGI challenge, 2018) (Fig. 3I). One possible explanation for this improvement is that
303
DeepMEL uses more nuanced topics (Fig. 3I, black bar) rather than the ATAC-seq signal of the
304
different MM lines as labels (Fig. 3I, white bar). Note that enhancer accessibility and activity can not
305
only be influenced by mutations that break a motif for an activating TF, but also by the creation of a
306
repressor binding motif, as was for instance the case for the SNP rs12203592 (Fig. S3G; Fig. S4G).
307
308
In conclusion, DeepMEL, trained on topics of human co-accessible regions, is performant in classifying
309
melanoma regulatory regions into different classes based on purely the DNA sequence. Features learned
310
by DeepMEL correspond to TF binding motifs of master regulators of specific classes. These motifs
311
can also be located and visualised within regions using a model interpretation tool, allowing
312
examination of the motif architecture within specific enhancers and predicting the effect of mutations
313
on enhancer accessibility.
314
10
315
Figure 3. DeepMEL classifies melanoma enhancers and predicts important TF binding motifs. (A) Cell-topic
316
heatmap of cisTopic applied to 339,099 ATAC-seq regions across the 16 human melanoma lines, coloured by
317
normalised topic scores. ‘029*’ refers to ‘MM029_R2’. (B) Example regions of a MEL-specific (topic 4) region
318
near MIA and MES-specific (topic 7) regions upstream of SERPINE1. (C) Schematic overview of DeepMEL. 24
319
topics or sets of co-accessible regions were used as input for training of a multi-class multi-label neural network.
320
(D, E) (D) Receiver operating characteristic curve and (E) precision-recall curve for DeepMEL on training, test
321
and shuffled data of topic 4 and topic 7 regions. (F) Top enriched filters learned by DeepMEL to classify regions
322
as MEL (topic 4) or MES (topic 7). Normalised filter importance is shown per filter. (G) Example of a MEL-
323
predicted enhancer near IRF4. (first and second row) DeepExplainer view of the forward and reverse strand, with
324
the height of the nucleotides indicating the importance for prediction of the MEL enhancer. (third row) In vitro
325
11
effect of point mutations on enhancer activity as measured by MPRA (Kircher et al. 2019). Colours represent the
326
nucleotide to which the wild type nucleotide is mutated. (bottom row) In silico effect of point mutations as
327
predicted by DeepMEL. (H) Correlation between the in vitro mutational effects on the IRF4 enhancer and the in
328
silico mutagenesis predictions. (I) Performance of variant effect prediction of DeepMEL using topics (black bar,
329
model used in this paper) or using ATAC-seq signal (white bar), and several previously tested models on the IRF4
330
enhancer case (Kircher et al. 2019).
331
Cross-species scoring identifies orthologous melanoma enhancers
332
Next, we asked whether the human-trained model DeepMEL can be used to predict MEL and MES
333
enhancers in other species. We started with the dog genome as a test case, because the differential
334
ATAC-seq peaks between the MEL (Dog-OralMel-18249) and MES (Dog-IrisMel-14205) dog cell
335
lines can serve as true positives (Fig. 4A). Note that DeepMEL reached similar performance in human
336
and dog for predicting MEL and MES regions and this accuracy is significantly higher compared to
337
using cis-regulatory module (CRM) scoring with PWMs (Fig 4A). Having confirmed that the human
338
model can identify enhancers in the dog genome, we predicted MEL and MES enhancers across all six
339
species. This furthermore allowed us to order all samples according to the MEL-MES axis (Fig. 4B).
340
Between 2,093 and 5,400 MEL enhancers were predicted, and between 7,459 and 10,743 MES
341
enhancers, in samples of the MEL and MES state, respectively (Fig. 4B). Note that the majority of these
342
enhancers could not have been detected using whole genome alignments (liftOver) (Fig S5A-E). Of
343
note, predicted MEL enhancers in the pig melanoma cells (MeLiM) were similarly accessible in pig
344
melanocytes (Fig. S5F), again indicating that MEL melanoma enhancers can be used as a model for
345
melanocyte enhancers.
346
347
Next, we compared the occurrence of MEL enhancers between species, in relation to putative target
348
genes. Particularly, we looked at enhancers located near a set of 379 human genes that are specifically
349
expressed in the MEL state (see Methods). Of these 379 genes, 217 (67%) had at least one MEL-
350
predicted enhancer within 200kb up- and downstream of the gene. Between 70-85% of the orthologous
351
MEL genes in other species had at least one MEL enhancer 200kb up- or downstream of the gene (Fig.
352
S5G). Note that only a small subset of these enhancers could have been found using liftOver (2-43%
353
depending on the species). Of these genes, 32 form a core set of conserved MEL-specific genes
354
throughout all species including zebrafish, each having a MEL enhancer nearby. Examples of genes in
355
the core set are MITF, PMEL and TYRP1, genes known to be involved in melanocyte development,
356
melanosome formation and melanin production (D’Mello et al. 2016).
357
358
A long-standing question in enhancer studies is how to compare enhancers with each other, if their
359
sequences do not align (Arunachalam et al. 2010; Cliften et al. 2001). Here we tackle this question by
360
using the dense layer of DeepMEL as a reduced dimensional space to calculate the correlation between
361
enhancers. Using this measure we found that MEL-predicted enhancers in proximity of orthologous
362
MEL genes are significantly more similar to each other compared to both MEL-predicted enhancers in
363
proximity of different MEL genes within the same species (Fig. 4C), and redundant (or shadow (Hong
364
et al. 2008)) enhancers linked to the same MEL gene in a species, as well as random non-MEL ATAC-
365
seq peaks near homologous MEL genes (Fig. S5H). This altogether supports the idea that MEL
366
enhancers near orthologous genes are indeed orthologous enhancers.
367
368
Lastly, we studied an example of a MEL enhancer in more detail, namely the enhancer near ERBB3.
369
DeepMEL predicts a MEL enhancer upstream or intronic of ERBB3 in each of the mammalian species,
370
which were also found by liftOver of the human ERBB3 enhancer (Fig. 4D II). However, in the zebrafish
371
12
genome, liftOver was unable to identify the homologous region, whereas DeepMEL predicted two MEL
372
enhancers, one upstream of the TSS of erbb3b and another in the first intron. Both zebrafish enhancers
373
were highly correlated with the human ERBB3 enhancer (deep layer Pearson’s correlation of 0.812 and
374
0.797 for the upstream and intronic zebrafish enhancer, respectively), suggesting that both enhancers
375
are orthologous to the human ERBB3 enhancer. Applying DeepExplainer to the multiple-aligned
376
sequences revealed a conserved motif architecture in the orthologous mammalian ERBB3 enhancers
377
containing each three SOX motifs and one TFAP2A motif (Fig. 4D III). Note that in mouse, one SOX
378
binding site was lost, mouse is also the mammalian species that is most distant from human, among the
379
included mammals in this study (Fig. 4D I). The two zebrafish enhancers have a highly similar motif
380
architecture, suggesting that they arose by duplication from a common ancestor enhancer.
381
382
In conclusion, we showed that DeepMEL is able to identify MEL- and MES-specific enhancers in
383
different species, which allows studying evolutionary events and enhancer logic within orthologous
384
enhancers, even in distant species such as zebrafish.
385
386
387
Figure 4. Human-trained deep learning model applied to cross-species ATAC-seq data. (A) Performance of
388
DeepMEL and Cluster-Buster (cbust) in classifying MEL and MES differential peaks in human and dog. (B),
389
Percentage of MEL and MES predicted ATAC-seq regions across all samples in our cohort and in human
390
13
melanocytes. Samples are ordered according to the ratio of the number of MES / MEL predicted regions. (C)
391
Pearson’s correlation of deep layer scores between MEL-predicted regions near orthologous MEL genes between
392
human and another species (‘Human-Species’) or between MEL-predicted regions near different MEL genes
393
within one species (‘Species-Species’). P-values of unpaired two-sample Wilcoxon tests are reported. (D) (I)
394
Evolutionary distance between human and other species in branch length units. (II) ATAC-seq profiles of the
395
ERBB3 locus in the six species. MEL-specific enhancers that were predicted by DeepMEL and that were also
396
found (grey) or not found (green) via liftOver of the human MEL enhancer are highlighted. (III) DeepExplainer
397
plots for the multiple-aligned MEL-predicted ERBB3 enhancers. Red and blue dots represent point and indels
398
mutations, respectively.
399
Motif architecture of the MEL enhancer
400
To study the architecture of MEL enhancers in more detail, including motif composition, motif order
401
and distance, and relationships to the position of nucleosomes, we set out to obtain high-confidence
402
motif annotations in each of the 3,885 MEL enhancers in human (MM001, the most MEL-like human
403
cell line), for each of the predicted core regulatory factors (SOX10, MITF, TFAP2A, RUNX). To
404
achieve this, we devised an optimised motif scoring method that obtains precise positions of TF binding
405
motifs by multiplying DeepMEL activation scores of convolutional filters (i.e. motifs) with the
406
DeepExplainer profile of each enhancer (Fig. 5A) (see Methods) (Shrikumar et al. 2019).
407
408
The first observation was that each MEL enhancer contains at least one SOX10 motif hit, and often two
409
or more (Fig 5B). This suggests that SOX10 plays a central role in MEL enhancer accessibility. Indeed,
410
knock-down (KD) of SOX10 in MM001 significantly decreases the accessibility of MEL enhancers
411
(Fig. S6A), and the regions that close after SOX10-KD are highly enriched for SOX motifs (NES =
412
28.5), possibly revealing a pioneering-role of SOX10 in MEL enhancers. Next to SOX motifs, a
413
combination of one or multiple TFAP2A, MITF or RUNX-like motif hits were present in 84% of the
414
MEL-predicted enhancers (Fig. 5B). Next, to facilitate a systematic study of the MEL enhancer logic,
415
we binarised the motif-region matrix to simplify the region clustering (Fig 5C). We obtained 8 different
416
enhancer classes, each with a different motif composition (Fig. 5C). As validation of the clusters and
417
the predicted TF binding sites, we used human ChIP-seq data of SOX10, MITF and TFAP2A in
418
melanoma or melanocytes (Laurette et al. 2015; Seberg et al. 2017) (Fig. 5D). All clusters were indeed
419
highly bound by SOX10, validating the prevalence of the SOX10 motif in MEL enhancers. In contrast,
420
MITF and TFAP2A ChIP-seq data revealed that MITF and TFAP2A bind, respectively, more to
421
enhancers with MITF and TFAP2A sites compared to regions without a predicted MITF or TFAP2A
422
site. Note that these observations indicate that the MEL enhancer architecture does not entail indirect
423
DNA binding of the core regulatory factors since MITF and TFAP2A are only bound when their motifs
424
are present within the enhancer. We further observed that regions containing a TFAP2A site, next to
425
the SOX10 site(s) and possible others, showed a modest increase in accessibility (Fig. S6B), which
426
could be in line with the previously described role of TFAP2A as a stabiliser of nucleosome-depleted
427
regions (Grossman et al. 2018). The opposite was true for regions containing RUNX-like binding sites
428
(Fig. S6B), suggesting a repressive role of RUNX factors. The presence of a MITF site did not seem to
429
alter the accessibility of enhancers compared to SOX-only enhancers, but did increase H3K27ac signal
430
(Fig. S6C), possibly indicating that MEL enhancers bound by MITF are more active.
431
432
To validate these MEL enhancer classes in other species, we applied the same motif scoring and
433
binarisation to DeepMEL-predicted MEL regions in the other species in our cohort. MEL enhancers in
434
other species also clustered into the same 8 clusters, with a similar distribution of regions per cluster
435
(Fig. 5E,F; Fig. S6D). In addition, liftOver of the clusters showed that the regions of a human cluster
436
correspond more to the same cluster in the other species (Fig. S6E), indicating conservation of the MEL
437
14
enhancer clusters across species. For instance, the dog-orthologs of two human MEL enhancers
438
belonging to either the [SOX10 + MITF] cluster (intronic enhancer of CD9) or to the cluster containing
439
[SOX10 + TFAP2A + RUNX] (intronic enhancer of STIM1) (Fig. 5E) were part of the corresponding
440
clusters in dog (Fig. 5F).
441
442
Altogether, these data suggest a COre Regulatory Complex (CoRC) (Arendt et al. 2016) of SOX10,
443
TFAP2A, MITF and RUNX factors in regulating melanoma MEL enhancers, encoded by a mixed
444
enhancer model (Long et al. 2016), with high flexibility in the combination of binding sites for these
445
four TFs, but with some rigidity (or hierarchy) in the code as at least one SOX10 dimer site is required.
446
447
448
Figure 5. COre Regulatory Complex of MEL melanoma enhancers. (A) Schematic overview of motif scoring
449
method in which extended convolutional filter hits from DeepMEL are multiplied by DeepExplainer profiles to
450
yield significant motif hits. (B,C) Heatmap (B) and binarised heatmap (C) of the number of significant SOX,
451
TFAP2A, MITF and RUNX-like motif hits on the 3,885 MEL-predicted regions in the human cell line MM001.
452
(D) Aggregation plot of normalised ChIP-seq signal of SOX10, MITF and TFAP2A on the human enhancer
453
clusters. (E, F) Venn diagram of regions clusters on (E) the 3,885 MEL-predicted regions in human (in MM001)
454
and (F) the 4,194 MEL-predicted regions in dog (in Dog-OralMel-18249). Example MEL-predicted enhancers in
455
human and dog are shown for two of the region clusters. The ATAC-seq signal of the regions is shown in grey.
456
Putative roles of SOX10 as pioneer and TFAP2A as stabiliser in melanoma
457
MEL enhancers
458
As previous results suggested a pioneering and stabiliser function for SOX10 and TFAP2A respectively,
459
we wanted to further investigate these putative roles and how they are mechanistically affecting
460
15
chromatin accessibility. First, we analysed the location of binding sites relative to the position of the
461
nucleosome, focusing on a human and dog MEL enhancer that contain a combination of one SOX10
462
and one TFAP2A site (Fig. 6A,B). We predicted the nucleosome start and middle point using a
463
previously published model (Kaplan et al. 2009) and observed that SOX10 binding sites are situated
464
within the borders of the nucleosome, near the nucleosome start point, whereas TFAP2A binding occurs
465
preferentially near the center of the nucleosome (Fig. 6A,B). KD of TFAP2A halved the accessibility
466
of this specific human region, whereas SOX10-KD completely abolished the ATAC-seq peak (Fig. 6A),
467
indicating that SOX10 is necessary for accessibility, and that TFAP2A further increases the
468
accessibility, which is in line with our previous observations (Fig. S6A,B).
469
470
These example enhancers raised an interesting positional preference of SOX10 and TFAP2A. To assess
471
whether this occurs globally we centered human MEL enhancers on the SOX10 and TFAP2A motif hits
472
and calculated the aggregated location of the nucleosome start and middle point (Fig. 6C-E). SOX10
473
shows a consistent preference for binding within the nucleosome borders, around 40 bp away from the
474
nucleosome start point (Fig. 6D). Other pioneering factors have also been shown to bind near the
475
borders of the nucleosome, for instance FOX factors which bind around 60 bp from the center of the
476
nucleosome, displacing linker histones and destabilising the central nucleosome (Grossman et al. 2018;
477
Iwafuchi-Doi et al. 2016). On the other hand, when centering the MEL regions based on the TFAP2A
478
motif, we did not observe a strong preference in the location of the nucleosome start point relative to
479
the TFAP2A binding site (Fig. 6D), but in fact TFAP2A consistently binds in a wide range on and
480
around the nucleosome middle point (Fig. 6E). Stabilisers, such as NFIB, have been reported to directly
481
compete with the central nucleosomes to stabilise the accessible chromatin configuration (Denny et al.
482
2016; Grossman et al. 2018). Centering based on the SOX10 or TFAP2A motif hit revealed protection
483
of Tn5 cutting on important nucleotides of the dimer motif (Fig S7A,B). We did not observe strong
484
positional preferences of MITF and RUNX motifs relative to the nucleosome start or middle point (Fig.
485
S7C,D).
486
487
Altogether these data suggest that SOX10 functions as a pioneer in the CoRC of MEL enhancers,
488
leading to their accessibility by binding to the central nucleosome, near the nucleosome start point. On
489
the other hand, TFAP2A appears to act as stabiliser of SOX-dependent nucleosome depleted regions by
490
binding around the nucleosome middle point, possibly going in competition with the central
491
nucleosome.
492
16
493
Figure 6. Positional specificity of SOX10 and TFAP2A in MEL melanoma enhancers. (A,B) (first row) Example
494
human (A) and dog (B) MEL-predicted enhancer containing significant SOX10 and TFAP2A motifs. The ATAC-
495
seq signal is shown in grey. (second row) Imputed nucleosome start and middle point profiles. (bottom row) For
496
the human example region, ATAC-seq profiles of MM001 in control condition, after 72 h of SOX10 knock-down
497
or TFAP2A knock-down are shown. (C) Schematic overview of the nucleosome structure explaining the colours
498
used in (D,E). (D,E). Nucleosome start point (D) and nucleosome middle point predictions (E) on MEL-predicted
499
regions containing one SOX10 (left) or one TFAP2A motif (right) next to possible other motifs, where the regions
500
are either centered on the ATAC-seq summit (grey) or on the SOX10 or TFAP2A motif (blue).
501
DeepMEL predicts evolutionary changes in MEL enhancer accessibility
502
and activity
503
To further validate our findings on the MEL enhancer logic, we compared motif architectures between
504
species, and investigated how turnover of TF binding sites affects enhancer accessibility and function.
505
To this end, we compared pairs of highly probable orthologous MEL enhancers that are only accessible
506
in one of the species (Fig. S8A) (see Methods). For example, an enhancer upstream of APPL2 is
507
predicted as a MEL enhancer in the dog line Dog-OralMel-18249 (topic 4 DL score of 0.35), whereas
508
the orthologous enhancer in human is not accessible (Fig. 7A). Not only the accessibility of the human
509
homolog was lost, but also its activity, as we confirmed by a luciferase assay (Fig. 7B). The topic 4
510
DeepMEL score for this enhancer was 6 times lower in human compared to dog (0.06 in human versus
511
0.35 in dog) (Fig. 7C), falling below the topic 4 significance threshold of 0.16, indicating that the model
512
detected critical changes in the human enhancer sequence that could explain the loss of accessibility
513
and activity of this MEL enhancer. The functional dog enhancer contains a SOX10, MITF and TFAP2A
514
binding site, which are all affected by substitutions in the non-functional human homologous sequence
515
and might therefore be causal for the loss in accessibility (and activity) (Fig. 7D,E). The SOX10 motif
516
mutation had the strongest effect, as it caused a 45% drop in the MEL-prediction score (Fig. 7D).
517
518
17
Next, we performed this analysis on a larger scale. Firstly, per species pair, we observed that differences
519
in DeepMEL predictions between species (delta-DeepMEL score) are highly predictive for differences
520
in accessibility (Spearman’s correlation of 0.43, Fig. S8B,C). Among the four studied regulators, mostly
521
the disruption or gain of one or more SOX10 binding sites between orthologous enhancers
522
quantitatively altered the ATAC-seq signal in a concordant way (Fig. 7F, Fig. S8D), indicating that
523
SOX10 mutations are most causal for changes in MEL enhancer accessibility, and possibly also in
524
enhancer activity, as was the case in the APPL2 enhancer above. However, concordance between
525
accessibility and activity was not always observed (Fig. S9). Furthermore, luciferase assays of six
526
human or dog MEL-predicted enhancers suggested that enhancers with at least one MITF motif (n = 3)
527
are significantly more active compared to enhancers without any MITF motif (n = 3) (Fig. 7G).
528
Although the number of tested enhancers is small, this trend, together with the fact that MEL enhancers
529
containing a MITF binding site showed increased H3K27ac signal (Fig. S6C), indicates that MITF
530
could function as an activator in MEL enhancers. Indeed, MITF has been shown to activate genes
531
involved in pigmentation by recruitment of co-factors and chromatin remodelling complexes
532
(Kawakami and Fisher 2017) and was previously classified as a TF involved in co-factor recruitment
533
and activation (Grossman et al. 2018). Note that SOX10 binding is insufficient but appears necessary
534
for enhancer activity, as mutations in SOX10 binding sites disrupt enhancer activity in the IRF4 case
535
study (Fig. 3G).
536
537
In conclusion, DeepMEL provides a suitable platform to study the effect of evolutionary mutations on
538
MEL enhancer accessibility and, in some cases, activity across species. Together, these results validate
539
that SOX10 is crucial for enhancer accessibility in MEL enhancers, and necessary but insufficient for
540
MEL enhancer activity, as activity appears to be mainly dependent on MITF binding.
541
542
543
Figure 7. Predicting causal mutations of evolutionary changes in MEL enhancers. (A,B) Example region upstream
544
of APPL2 that is accessible (A) and active (B) in the MEL dog line Dog-OralMel-18249 but not in human MEL
545
lines. (C) DeepMEL prediction score of each of the 24 topics for the dog and human APPL2 enhancer. (D) Effect
546
18
on topic 4 DeepMEL score on the dog sequence when in silico simulating each of the single detected point
547
mutations between the dog and human APPL2 enhancer. (E) DeepExplainer plots of the middle 120 bp of the dog
548
and human APPL2 enhancer. In the middle, the effect of each possible point mutation between the dog and human
549
sequence on the MEL DeepMEL score was in silico calculated and is represented by coloured dots depending on
550
the nucleotide the original dog nucleotide was in silico mutated to. Truly existing point mutations between the
551
dog and human sequence are highlighted by color-coded vertical dashed lines. Four mutations that decrease the
552
motif score of the SOX10, MITF and TFAP2A motifs are highlighted by a grey box and are encircled. (F) Barplot
553
showing the mean effect on the log2 delta ATAC-seq signal of a non-human region compared to the human
554
homolog depending on the number of SOX10 motif hits lost or gained. Only regions having no change in the
555
number of significant TFAP2A, MITF and RUNX motifs hits were used. The y-axis is normalised to the category
556
with no changes in the number of significant SOX10 motif hits. The number of regions in each of the categories
557
is mentioned (#). (G) Luciferase assay on six human or dog enhancers. Significant motif hits per enhancer are
558
shown with coloured crosses. For the luciferase assays: luciferase activity in MM001 is shown relative to Renilla
559
signal and is log10 transformed. P-values were determined using Student’s t-test and the error bars represent the
560
standard deviation over three biological replicates.
561
Discussion
562
Here, we present an in-depth study of melanoma enhancer logic, especially in enhancers specific to the
563
melanocytic (MEL) state, by exploiting both cross-species data and machine learning. Although the
564
MEL and MES melanoma cell states have been studied extensively on a transcriptomic and epigenomic
565
level, the combinatorial code of binding sites of their regulatory factors in state-specific enhancers had
566
not yet been explored. Understanding the enhancer logic and the mechanism by which TFs bind and
567
direct active enhancers will become increasingly important, as it will be essential for the development
568
of new therapies that influence cell state-specific enhancer functions in a targeted way (e.g. for enhancer
569
therapy (Hamdan and Johnsen 2019; Johnson et al. 2008)), or to prioritise non-coding variants in whole
570
genome sequencing studies of personal or cancer genomes (Atak et al. 2019).
571
572
Predicting enhancers and determining their functional role within gene regulatory networks has been an
573
active field for years. Despite the well-established power of cross-species approaches in this field, to
574
our knowledge, a large comparative epigenomics study in melanoma has not yet been conducted,
575
although several non-human models are commonly used in melanoma research (van der Weyden et al.
576
2016) and have been studied on an intra-species level (Hitte et al. 2019; Jiang et al. 2014; Kaufman et
577
al. 2016; Rambow et al. 2008; Rosengren Pielberg et al. 2008; Seltenhammer et al. 2014; Sundström et
578
al. 2012) or in relation to human melanoma (Egidy et al. 2008; Segaoula et al. 2018; Rahman et al.
579
2019). Here, we demonstrate that the MEL and MES states are conserved across species, as well as the
580
key regulators of these states.
581
582
Although their proven advantages, sequence-based comparative approaches have limited power to
583
identify orthologous regulatory regions in distant species, in part because of the rapid evolution of distal
584
enhancers (Dermitzakis and Clark 2002; Lindblad-Toh et al. 2011). Methods, such as enhancer element
585
locator (EEL), try to tackle this question by aligning TF binding sites to identify conserved enhancer
586
elements (Hallikas et al. 2006), or by calculating the co-occurrence of sequence patterns (Arunachalam
587
et al. 2010). However, these methods are either supervised as they require user-provided PWMs
588
(Hallikas et al. 2006) or are difficult to extract the important biologically-relevant features from
589
(Arunachalam et al. 2010). In addition, the identification and exact localisation of important (de novo)
590
TF binding sites within enhancers is complex as motif discovery tools are often dependent on user-
591
provided databases and motif-specific thresholds. Recently, deep learning approaches, which are
592
commonly used in disciplines such as speech recognition and image analysis, found their way into the
593
19
regulatory genomics field to overcome these concerns (Park and Kellis 2015). As deep learning models,
594
such as DeepBind, are particularly powerful in learning complex patterns by leveraging large
595
epigenomics datasets, they are well suited to function as de novo motif detectors, as well as to uncover
596
more complex sequence features (Alipanahi et al. 2015; Park and Kellis 2015). By designing DeepMEL,
597
a multi-class multi-label neural network trained on melanoma human regulatory topics of co-accessible
598
regions, and by using the model interpretation tool DeepExplainer and our newly developed motif
599
scoring scheme (Lundberg and Lee 2017; Lundberg et al. 2020), we were able to perform a thorough
600
and unsupervised analysis of important TF binding sites in melanoma enhancers. Specifically, in MEL
601
enhancers, our data suggests conserved co-binding of a Core Regulatory Complex of three main TFs,
602
consisting of SOX10, TFAP2A and MITF. DeepMEL also finds motifs for RUNX factors, but their role
603
in the melocyte or melanoma is less clear. Evidence for co-binding of SOX10, MITF, and TFAP2A was
604
previously observed by enrichment of both MITF and TFAP2A motifs in SOX10 ChIP-seq data in
605
melanoma cells (Laurette et al. 2015). We observed high flexibility in the organisation of TF binding
606
sites of the CoRC since eight different modalities were found, formed by all permutations of the CoRC
607
factors, with the exception that all MEL enhancers contained at least one SOX10 binding site. MEL
608
enhancers thereby adhere to a ‘mixed modes enhancer’ model, a billboard-like model with mostly high
609
flexibility in the TF motif organisation, except for the ever-present SOX10 binding sites (Long et al.
610
2016). In addition, ChIP-seq data of MITF and TFAP2A indicated no indirect DNA binding of these
611
CoRC factors within MEL enhancers, but that the bound TFs are largely determined by their individual
612
motif presence. Note that although DeepMEL was trained on melanoma ATAC-seq data, the human
613
and pig predicted MEL enhancers were also accessible in human and pig melanocytes, respectively,
614
indicating that we could extend these observations on the MEL enhancer logic to enhancers in
615
melanocytes, and that our methodology could be applied to non-disease states.
616
617
It is well established that distinct functional classes of TFs exist, with respect to enhancer binding.
618
Pioneer TFs, such as OCT4, SOX2, Grh-like TFs, and FOXA1, are able to bind nucleosomal DNA,
619
leading to displacement of the nucleosome and facilitating the binding of other TFs to the accessible
620
enhancer (Jacobs et al. 2018; Long et al. 2016; Zaret and Carroll 2011). SOX2 and other SOX factors
621
have a HMG domain that interacts with the minor groove of the DNA, causing the DNA to bend in a
622
60-70° angle, a property that has been suggested to contribute to the pioneering activity of SOX2, and
623
possibly of other SOXs (Hou et al. 2017). A recent publication by Dodonova et al. indicates that SOX2
624
and SOX11 can bind to their binding motif on nucleosomal DNA and that they use their binding energy
625
to initiate chromatin opening. However, there is still some dispute on the pioneering properties of SOX
626
TFs, as another study classified SOXs as ‘migrant TFs’, i.e. non-pioneering TFs that only bind
627
sporadically to (non)-chromatinised DNA (Sherwood et al. 2014). Nonetheless, we find strong evidence
628
for a pioneering function of SOX10 in MEL melanoma cells. Our current and previous study (Bravo
629
González-Blas et al. 2019) have shown that knock-down of SOX10 induces closure of SOX10-bound
630
ATAC-seq peaks containing a SOX10 motif. In fact, DeepMEL predicts SOX10 binding sites as
631
essential for MEL enhancer accessibility. Next to pioneer factors, other functional classes of TFs exist,
632
including factors that stabilise the accessibility of the nucleosome depleted regions. TFAP2A was
633
previously classified as such a chromatin stabiliser (Grossman et al. 2018) and it has been shown that
634
evolutionary divergence from the TFAP2A consensus motif correlates with loss of chromatin
635
accessibility and H3K27ac ChIP-seq signal (Prescott et al. 2015). These reports support our
636
observations of TFAP2A as a stabiliser of SOX10-dependent accessible MEL enhancers, likely due to
637
direct competition of TFAP2A with the nucleosome, as TFAP2A binding sites were highly enriched at
638
the predicted center of the central nucleosome. The dependence of SOX10 for opening MEL enhancers
639
prior to TFAP2A binding is in line with the reported classification of TFAP2A as a ‘settler’, a TF whose
640
20
binding depends predominantly on the accessibility of the chromatin at their binding sites (Sherwood
641
et al. 2014).
642
643
Besides classifying accessible (orthologous) regions and predicting important TF motifs within them,
644
DeepMEL is an accurate predictor of the effect of mutations on enhancer accessibility and, for some
645
enhancers, also the activity. This was for instance the case for the IRF4 MEL enhancer, where
646
DeepMEL outperformed existing methods tested in Kircher et al. (Kircher et al. 2019). Note however,
647
that the other models in the benchmark were trained to predict the activity of a total of 20 regulatory
648
regions ranging across different cell types; whereas our DL model is specialised for melanoma
649
regulatory regions. This demonstrates the value of using case-specific training data, such as the data set
650
generated in this study for melanoma. Not all predicted MEL enhancers were in fact active, as MITF
651
binding seems to be required to activate SOX10-dependent melanoma enhancers. The study of Fufa et
652
al. supports this hypothesis, as activating SOX10-regions in mouse melanocytes showed significant
653
enrichment of E-box motifs (bound by the bHLH protein family, which includes MITF), indicating that
654
MITF cooperates with SOX10 to execute melanocyte-specific gene activation (Fufa et al. 2015). In
655
addition, MITF was previously classified as a TF involved in co-factor recruitment and activation
656
(Grossman et al. 2018; Kawakami and Fisher 2017). Although SOX10 binding is not sufficient for
657
enhancer activity, it appears to be necessary, as disruption of the SOX10 binding site in the IRF4
658
enhancer had a strong effect on activity, probably due to the reappearance of the central nucleosome.
659
660
In conclusion, the combination of comparative epigenomics with deep learning allowed us to perform
661
an in-depth analysis of the melanoma enhancer logic. This work presents an overall framework which
662
can be applied to decipher the enhancer logic in a cell type or cell state of interest, starting from the
663
generation of an extensive cell type-specific (cross-species) epigenomics dataset, all the way through
664
the training and exploitation of a deep neural network to decode enhancer features across species, and
665
to utilise it to assess the impact of cis-regulatory variation.
666
21
Methods
667
Cell culture
668
669
Human melanoma cell lines
670
Human melanoma cultures (“MM lines”) are short-term cultures derived from patient biopsies
671
(Gembarska et al. 2012; Verfaillie et al. 2015). Cells were cultured at 37°C with 5% CO2 and were
672
maintained in Ham's F10 nutrient mix (Thermo Fisher Scientific) supplemented with 10% fetal bovine
673
serum (FBS; Thermo Fisher Scientific) and 100 µg ml-1 penicillin/streptomycin (Thermo Fisher
674
Scientific).
675
Zebrafish melanoma cell lines
676
Experiments were performed as outlined by (Ceol et al. 2011). Briefly, 25 pg of MCR:EGFP were
677
microinjected together with 25 pg of Tol2 transposase mRNA into one-cell Tg(BRAFV600E);p53-/-;
678
mitf-/- zebrafish embryos. Embryos were scored for melanocyte rescue at 48-72 hours post-fertilisation,
679
and equal numbers were raised to adulthood (15-20 zebrafish per tank), and scored weekly (from 8-12
680
weeks post-fertilization) or bi-weekly (> 12 weeks post-fertilization) for the emergence of raised
681
melanoma lesions (van Rooijen et al. 2017). For in vitro culture, large tumors were isolated from
682
MCR/MCR:EGFP (14-28 weeks post-fertilization). Zebrafish were maintained under IACUC-approved
683
conditions. Zebrafish primary melanoma ZMEL1 cell line was previously described (White et al. 2008,
684
2011) and EGFP 121-1, EGFP 121-2, EGFP 121-3, EGFP 121-5, were generated as described in
685
(Heilmann et al. 2015; Wojciechowska et al. 2016). All cell lines were cultured in DMEM medium
686
(Thermo Fisher Scientific) supplemented with 10% heat-inactivated FBS (Atlanta Biologicals), 1×
687
GlutaMAX (Thermo Fisher Scientific) and 1% Penicillin-Streptomycin (Thermo Fisher Scientific), at
688
28°C, 5% CO2. Zebrafish melanoma lines were authenticated by qPCR and Western for EGFP transgene
689
expression, and periodically checked for mycoplasma using the Universal Mycoplasma Detection Kit
690
(ATCC).
691
692
Horse melanoma cell lines
693
The horse cell lines HoMel-L1 and HoMel-A1 are melanoma cell lines derived from a Lipizzaner
694
stallion and Shagya-Arabian mare, respectively, and were established in Seltenhammer et al.. Cells were
695
cultured at 37°C with 5% CO2 in Roswell Park Memorial Institute (RPMI) medium (Thermo Fisher
696
Scientific) supplemented with 10% fetal bovine serum (FBS; Thermo Fisher Scientific) and 1%
697
penicillin/streptomycin (Thermo Fisher Scientific).
698
Pig melanoma and melanocyte cell line
699
The immortal line of pigmented melanocytes (PigMel) was previously derived (Julé et al. 2003) and the
700
30 day-old piglet primary melanoma cells (MeLiM) were isolated as described (Egidy et al. 2008).
701
PigMel cells were cultured at 37°C with 10% CO2 in MEM medium supplemented with 1× MEM non
702
essential amino acids (Thermo Fisher Scientific), 1mM Na pyruvate, 2 mM glutamine, 100 U/ml
703
penicilin/streptomycin (Thermo Fisher Scientific), 10% FCS and 3,7 g/ml Na bicarbonate. MeLiM cells
704
were cultured in DMEM high glucose (Thermo Fisher Scientific), 10% FCS, Pen/Strep, 5% CO2.
705
Dog melanoma cell lines
706
The dog cell lines Dog-IrisMel-14205 and Dog-OralMel-18249 were established by Aline Primot , and
707
were derived from an uveal melanoma from a Beagle crossed dog and an oral melanoma from the palate
708
22
from a Shih-tzu, respectively. Cells were cultured at 37°C with 5% CO2 in Ham's F-12 Nutrient Mixture
709
medium (Thermo Fisher Scientific) supplemented with 10% FBS (Thermo Fisher Scientific) and 1%
710
penicillin/streptomycin (Thermo Fisher Scientific).
711
Mouse melanoma cell lines
712
The mouse melanoma cell line was generated as described in (Dankort et al. 2009). Cells were cultured
713
at 37°C with 5% CO2 in Dulbecco's Modified Eagle Medium (DMEM) (Thermo Fisher Scientific)
714
supplemented with 10% FBS (Thermo Fisher Scientific) and 1% penicillin/streptomycin (Thermo
715
Fisher Scientific).
716
Knock-down experiments
717
SOX10, TFAP2A and the control knock-down (KD) were performed in MM001 using a SMARTpool
718
of four siRNAs against, respectively, SOX10 (SMARTpool: ON-TARGETplus SOX10 siRNA, number
719
L017192-00-0005, Dharmacon), TFAP2A (SMARTpool: ON-TARGETplus TFAP2A siRNA, number
720
L-006348-02-0005, Dharmacon) and a negative control pool (ON-TARGETplus non-targeting pool,
721
number D-001810-10-05, Dharmacon) at a concentration of 20nM for SOX10-KD, and 40 nM for
722
TFAP2A-KD and the control using as medium Opti-MEM (Thermo Fisher Scientific) and omitting
723
antibiotics. The cells were incubated for 72h before processing.
724
OmniATAC-seq data generation, data processing and follow-up analyses
725
726
OmniATAC-seq on mammalian lines
727
728
Omni-Assay for Transposase-Accessible Chromatin using sequencing (OmniATAC-seq) was
729
performed as described previously (Corces et al. 2017). After the final amplification was done with the
730
additional number of cycles, samples were cleaned-up by MinElute and libraries were prepped using
731
the KAPA Library Quantification Kit as previously described (Corces et al. 2017). Samples were
732
sequenced on a HiSeq 4000 or NextSeq 500 High Output chip.
733
ATAC-seq on zebrafish lines
734
50,000 cells per line were lysed and subjected to a tagmentation reaction and library construction as
735
described in Buenrostro et al. Libraries were run on an Illumina HiSeq 2000.
736
737
Data processing of (OmniATAC)-seq samples
738
(Paired-end) reads were mapped to the human genome (hg19-Gencode v18) using Bowtie 2 (v2.2.6)
739
(Langmead and Salzberg 2012) or STAR (v2.5.1b) (Dobin et al. 2013) to species-specific genomes
740
which were downloaded from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/) (for human: hg19-
741
Gencode v18; for dog: canFam3; for horse: equCab2; for pig: susScr11; for mouse: mm10; for
742
zebrafish: danRer10) and by applying the parameters --alignIntronMax 1 and --aslignIntronMin 2. Note
743
that for the human data, we used hg19 as genome assembly instead of the more recent GRCh38
744
assembly since i-cisTarget (Janky et al., 2014; Imrichova et al., 2015; Herrmann et al., 2012) and
745
GREAT (McLean et al. 2010) are or were not (yet) available for GRCh38 at the time of the analyses.
746
However, the use of GRCh38 instead of hg19 would not significantly affect conclusions. We for
747
instance validated this by re-scoring MEL-predicted regions by DeepMEL in MM057 after liftOver
748
(Kuhn et al. 2013) from hg19 to GRCh38, in which we observed that changing genome assembly yields
749
the same DeepMEL score for all 4,244 regions except for 8 of them. Also note that for MM029, two
750
23
biological replicates were used. Mapped reads were sorted using SAMtools (v1.8) (Li et al. 2009) and
751
duplicates were removed using Picard MarkDuplicates (v1.134) (Broad Institute 2019). Reads were
752
filtered by removing mitochondrial reads and filtering for Q>30 using SAMtools. BAM files of
753
technical replicates of the same cell line were merged at this point using samtools merge. Peaks were
754
called using MACS2 (v2.1.2) (Gaspar 2018) callpeak using the parameters -q 0.05, --nomodel, --call-
755
summits, --shift -75 --keep-dup all and --extsize 150 per sample. Blacklisted regions (ENCODE) and
756
peaks overlapping with alternative chromosomes and ChrM were removed. Summits were extended by
757
250bp up- and downstream using slopBed (bedtools; v2.28.0) (Quinlan and Hall 2010), providing
758
human chromosome sizes. Peaks were normalised for the library size using a custom script and
759
overlapping peaks were filtered using the peak score by keeping the peak with the highest score.
760
Normalised bigWigs were either made from normalised bedGraphs using as scaling parameter (-scale)
761
1 × 106/(number of non-mitochondrial mapping reads); or made by bamCoverage (deepTools, v3.3.1
762
(Ramírez et al. 2016)), using as parameters --normalizeUsing None, -bl EncodeBlackListedRegions --
763
effectiveGenomeSize 2913022398 and as scaling parameter (-scaleFactor) 1/(RIP/1 × 106), where RIP
764
stands for the number of reads in peaks.
765
HOMER on human and dog differential accessible peaks
766
Count matrices were produced by featureCounts (v1.6.5) (Liao et al. 2014) for 5 melanocytic (MEL)
767
and 5 mesenchymal-like (MES) lines for human, and for Dog-OralMel-18249 and Dog-IrisMel-14205
768
for dog. Differential peaks were identified using DESeq2 (v1.22.2, R v3.5.2 (R Core Team 2018)) (Love
769
et al. 2014) with a log2FC higher than 2 and a pAdj lower than 0.0005. HOMER (Heinz et al. 2010) was
770
performed on the differentially accessible regions using findMotifsGenome.pl, providing the
771
differential regions as a BED file and a fasta file of the human or dog genome, with parameters -mask,
772
-size given and -len 6,8,10,11,12,17,18.
773
Defining sets of alignable and conserved accessible ATAC-seq regions
774
ATAC-seq regions of non-human species were defined as alignable regions when they could be
775
converted to hg19 coordinates using liftOver (Kent-tools, -minMatch=0.1) (Kuhn et al. 2013) by
776
providing the appropriate liftOver chain (UCSC). Alignable regions were intersected with accessible
777
peaks in human using intersectBed (bedtools, v2.28.0) (Quinlan and Hall 2010) with -f 0.6 to define
778
sets of conserved accessible regions across species.
779
Clustering of species based on globally alignable ATAC-seq regions
780
Per species, a count matrix was made on the alignable union ATAC-seq regions by featureCounts
781
(v1.6.5) (Liao et al. 2014). The count matrices of different species were merged and the final count
782
matrix was CPM normalised (edgeR v3.22.5, R v3.5.2 (R Core Team 2018)) (Robinson et al. 2010),
783
followed by quantile normalisation. A principal component analysis (PCA) on the normalised count
784
matrix was performed using irlba (v2.3.3, R v3.5.2) (Baglama and Reichel 2005).
785
Branch length scoring across species
786
787
Conserved accessible ATAC-seq regions were identified as described above, and for each of the species,
788
the set of conserved accessible regions was converted to the coordinate system per species and fasta
789
sequences were retrieved. All sequences were scored with the cisTarget motif collection (v8)
790
(http://iregulon.aertslab.org/collections.html) (Janky et al., 2014; Imrichova et al., 2015; Herrmann et
791
al., 2012) containing 20,003 TF position-weight matrices (PWMs) using Cluster-Buster (Frith et al.
792
2003) with parameters -m 0, -c 0 and -r 10000. For each motif, the highest cis-regulatory module (CRM)
793
score per conserved accessible sequence was used to calculate the branch length score (BLS) across
794
24
species according to Stark et al. and Jacobs et al.. The branch length was taken from the phylogenetic
795
data from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/phyloP100way/ (UCSC). The sum of the
796
BLSs for all the conserved accessible sequences across the mammalian or all six species was used as a
797
total score for each motif. We normalised these scores by performing BLS on a shuffled variant of all
798
sequences by shuffleseq (EMBOSS, v6.6.0.0), keeping the same base-pair compositions and sequence
799
lengths, and subtracting the shuffled BLS from the true BLS per motif.
800
801
cisTopic analysis to obtain sets of co-accessible regions in human OmniATAC-seq data
802
803
To apply cisTopic (Bravo González-Blas et al. 2019), a tool designed for single-cell ATAC-seq
804
analysis, we first simulated single cells from the bulk OmniATAC-seq data of the 16 human melanoma
805
lines via bootstrapping. Per cell line, 50 simulated single cell BAM files were generated containing each
806
50,000 random reads that were bootstrapped from the bulk BAM files. These simulated single cell BAM
807
files were provided as input for cisTopic (v0.2.0, R v3.4.1 (R Core Team 2017)), together with the
808
merged BED file of ATAC-seq regions across all 16 samples, after removing blacklisted regions
809
(ENCODE). We ran cisTopic (parameters: α =50/T, β =0.1, burn-in iterations=500, recording
810
iterations=1,000) for models with a number of topics (sets of co-accessible regions) between 2 and 30
811
(2 by 2). The best model, containing 24 topics, was selected on the basis of the highest log-likelihood.
812
Topics were binarised using a probability threshold of 0.995 (resulting in a total of 35,940 binarised
813
topic regions across the 24 topics), and we performed motif enrichment analysis with cisTarget
814
(Imrichová et al. 2015).
815
816
Deep Learning
817
Data preparation
818
The deep learning (DL) model, DeepMEL, was trained on the binarised regions of the 24 topics obtained
819
from the cisTopic analysis explained above. In order to increase the amount of training data, the 500 bp
820
regions in the merged BED file of all 339,099 ATAC-seq regions across the 16 human cell lines (see
821
Data processing of human melanoma baseline OmniATAC-seq samples), were augmented by extending
822
them to 700 bp around the summit and sliding a 500 bp window over these elongated regions with a 10
823
bp stride. This augmented master region BED file was intersected with each topic BED file separately
824
(using bedtools (Quinlan and Hall 2010)) and a region was labelled with a topic number if there was at
825
least 60% overlap. If regions overlapped with multiple topics they were assigned with multiple topic
826
labels, allowing for a multi-label and multi-class DL model. This augmentation and intersection resulted
827
in 696,654 training regions in total, excluding the 58,086 regions on Chr2 that were used for testing.
828
DeepMEL model architecture and training parameters
829
830
The DeepMEL architecture was built with 4 layers between input and output layer: a Conv1D layer
831
(containing 128 filters and setting the parameters kernel_size as 20, the strides as 1 and the activation
832
as relu), MaxPooling1D layer (with the pool_size 10 and strides 10), TimeDistributed Dense layer
833
together with Bidirectional LSTM layer (with 128 unit and setting the dropout as 0.1 and the
834
recurrent_dropout as 0.1), and Dense layer (with 256 units and setting the activation as relu). After
835
MaxPooling1D, Bidirectional LSTM, and Dense layer, a Dropout layer was used each time with the
836
fraction of dropout set as 0.2, 0.2, and 0.4, respectively. For each region in the training data, DeepMEL
837
takes the one-hot encoded (500 bp × 4 nucleotide) forward and reverse strand and passes them
838
separately through the model. In order to make the final prediction, DeepMEL takes the average
839
activation (average function) of the neurons in the final Dense layer (which contains 24 units
840
25
corresponding to the 24 topics; with a sigmoid activation function). The model was compiled using the
841
Adam optimizer with the default learning rate, which is 0.001. To calculate the loss, the binary cross
842
entropy (binary_crossentropy) was used. The model was trained for 2 epochs with a batch size of 128,
843
which took 67 minutes. Keras 2.2.4 (Chollet and others 2015) with tensorflow 1.14.0 (Abadi et al. 2016)
844
was used. A Tesla P100-SXM2-16GB GPU was used for training on VSC servers (Flemish
845
Supercomputer Center).
846
847
Performance evaluation
848
The performance of the model was evaluated for each topic separately since it was a multi-label
849
classifier. The auROC and auPR were calculated for the combined training and validation data (regions
850
on all chromosomes except Chr2), test (regions on Chr2), and label-shuffled regions.
851
Converting convolution filters to PWMs, filter-topic assignment, and filter-annotation
852
Filters of the convolution layer were converted to position-weight matrices (PWMs) by the following
853
strategy: (i) 4,000,000 unique 20bp-long (size of the filters) sequences were randomly generated. (ii)
854
The activation score of each filter for each sequence was calculated and the top 100 sequences were
855
selected. (iii) A count matrix was generated from these 100 sequences obtained for each filter. (iv)
856
Finally, the count matrices were converted into PWMs. In order to assign the filters to topics, a similar
857
strategy that is mentioned in Basset (Kelley et al. 2016) was used. After setting the activation score of
858
a filter to its mean activation score over all the sequences, the loss/accuracy score on the prediction was
859
calculated for each topic. Filters were ordered based on their effect on a certain topic. In order to
860
annotate the filters to known transcription factor binding motifs, the Tomtom motif annotation tool
861
(Gupta et al. 2007) was used together with our curated cisTarget motif collection (v9)
862
(http://iregulon.aertslab.org/collections.html) (Janky et al., 2014; Imrichova et al., 2015; Herrmann et
863
al., 2012) of 24,453 PWMs (cutoff for the q-value was set to 0.3).
864
DeepExplainer
865
From the 35,940 topic regions that were obtained after binarisation of the 24 topics within the selected
866
cisTopic model (see methods on cisTopic analysis above), 500 regions were randomly selected to
867
initialise the DeepExplainer pipeline (Lundberg and Lee 2017). A hypothetical importance score for
868
each position of the sequence of interest was calculated for any of the 24 topics. For each sequence,
869
these DeepExplainer-obtained importance scores were multiplied by the one-hot encoded matrix of the
870
sequences. Finally, the 500 bp sequences were visualised by adjusting the nucleotide heights based on
871
their importance score by using the modified viz_sequence function from the DeepLift repository
872
(Shrikumar et al. 2017).
873
In silico saturation mutagenesis
874
In silico saturation mutagenesis of a region was performed by separately changing each nucleotide on
875
the 500 bp sequence into the three other nucleotides, and scoring these mutated sequences with
876
DeepMEL. The delta prediction score for each mutation was calculated for each of the 24 topics by
877
comparing the prediction score of the mutated sequence relative to the prediction score for the initial
878
sequence. For the IRF4 enhancer case, the actual IRF4 enhancer sequence used in the in vitro saturation
879
mutagenesis assay (Chr6:396,143-396,593) overlapped with a predicted MEL enhancer in human MEL
880
cell lines in our cohort (Chr6:396,135-396,636). The delta prediction score of topic 4 (MEL topic) was
881
calculated following an in silico saturation mutagenesis on this region, and a Pearson’s correlation was
882
calculated on the overlapping nucleotides between the in silico and in vitro assays (451 bp).
883
26
Motif scoring method
884
885
We designed an optimised motif scoring method, in which activation scores of the filters on each
886
sequence are multiplied by the DeepExplainer importance scores of the sequence. Then, after the output
887
of this multiplication was normalised, a threshold was calculated for each motif by comparing MEL
888
and MES enhancers. This approach yielded significant motif hits with their precise location.
889
Nucleosome positioning
890
Nucleosome start and middle point predictions were calculated by using the executable nucleosome
891
prediction tool Kaplan_v3 (Kaplan et al. 2009) that takes just the DNA sequence and calculates the
892
nucleosome positioning for each nucleotide. In order to get more precise results, as the authors of
893
Kaplan_v3 suggest, enhancers were extended 3 kb from both ends. After obtaining the predictions, the
894
middle 500 bp part of the 6.5kb nucleosome prediction score was used.
895
Tn5 footprinting
896
Footprints of the Tn5 were determined by inferring Tn5 cut sites from the start point of each ATAC-
897
seq read in a BAM file using a custom script.
898
899
AUROC on human and dog of DeepMEL and Cluster-Buster
900
901
The performance of DeepMEL to discriminate between MEL and MES regions in human and dog was
902
calculated by scoring the top 5,000 differential MEL and MES regions in human and dog (described
903
above) with DeepMEL and calculating the precision of correct assignment (i.e. topic 4 score for the
904
MEL regions and topic 7 scores for the MES regions). The performance of DeepMEL was compared
905
with the motif scoring tool Cluster-Buster (Frith et al. 2003) by scoring the same sets of regions with
906
Cluster-Buster using a merged motif file of (some of) the top filters identified by the model in either
907
topic 4 or topic 7. The obtained CRM scores were used to estimate the performance of Cluster-Buster.
908
909
Identification of homologous MEL genes and MEL enhancers
910
911
To identify genes differentially expressed in human MEL cell lines, we performed DEseq2 (v1.22.2, R
912
v3.5.2 (R Core Team 2018)) (Love et al. 2014) on RNA-seq data of 7 MEL (MM031, MM034, MM057,
913
MM074, MM087, MM118, MM164) and 5 MES (MM029, MM099, MM116, MM163, MM165)
914
human lines. 379 genes were found differentially expressed in MEL lines (log2FC > 2.5 and adjP <
915
0.005). We converted the gene symbols to Ensembl gene IDs using biomaRt (v2.38.0, R v3.5.2)
916
(Durinck et al. 2005) and found back the genomic locations of the genes using GenomicFeatures
917
(v1.34.8, R v3.5.2) (Lawrence et al. 2013). For the human differential MEL genes with at least one
918
MEL-predicted peak in their extended gene locus (200 kbp up- and down-stream), the homologous
919
genes in the other six species were identified using biomaRt to convert the human Ensembl gene IDs to
920
Ensembl gene IDs of the other species. We identified the MEL enhancers that overlapped with the
921
extended gene loci of each of the homologous genes using bedtools intersect (Quinlan and Hall 2010).
922
liftOver (-minMatch=0.1) (Kuhn et al. 2013) was used to calculate the number of these regions that
923
could be identified by performing coordinate conversion.
924
925
926
27
Correlation of MEL enhancers using deep layers of DeepMEL
927
928
Conserved accessible MEL enhancers in the extended loci of conserved MEL-specific genes across the
929
six species (see above) were scored by the DeepMEL. A matrix was generated consisting of a score for
930
each of the 256 nodes in the Dense layer for each of the regions. A Pearson’s correlation matrix was
931
generated to calculate the pairwise similarity between each of the regions.
932
933
Genome-wide prediction of MEL enhancers
934
935
The first chromosome of the human genome (hg19) was tiled with a sliding window of 500 bp and a
936
100 bp shift using bedtools makewindows (v2.28.0) (Quinlan and Hall 2010). Tiles containing ‘N’ were
937
deleted and the remaining tiles were scored by DeepMEL, and the number of MEL-predicted tiles (topic
938
4 score > 0.16) was calculated.
939
940
Mutations in orthologous enhancers across species
941
942
We defined highly-probable orthologous MEL enhancers between human and another species as
943
regions that were predicted as MEL in one species and for which there was a stringent liftOver (-
944
minMatch=0.995) (Kuhn et al. 2013) and high sequence identity (more than 80% after pairwise
945
alignment via needle (EMBOSS, v6.6.0.0) (Madeira et al. 2019), using parameters -gapopen 10.0 -
946
gapextend 0.5) in the other species. featureCounts (v1.6.5) (Liao et al. 2014) was used to generate count
947
matrices per species on these regions, which was followed by library size normalisation. Delta ATAC-
948
seq scores were calculated for the pairs of orthologous regions by dividing the normalised counts of the
949
two species (human counts / non-human counts) after adding a pseudocount. Mutations were identified
950
by alignment via needle, using the parameters -gapopen 10.0 and -gapextend 0.5.
951
952
Luciferase assay
953
954
Six MEL-predicted enhancers (3 in the dog line Dog-OralMel-18249 and 3 in the human line MM001)
955
were synthetically generated and cloned into a pTwist ENTR plasmid (Twist Bioscience) via Twist
956
Bioscience. Regions were transferred from the Gateway entry clone into the destination vector
957
(pGL4.23-GW, Addgene) via a LR reaction by mixing 2 uL of the entry clone (100 ng/uL) with 1 uL
958
of the destination plasmid (150 ng/uL), 1 uL TE buffer and 1 uL LR enzyme (LR Clonase II Plus
959
enzyme mix, Thermo Fisher Scientific), and incubating this mixture at 25°C for 1 hour. Afterwards, 1
960
uL of Proteinase K (Thermo Fisher Scientific) was added and reactions were incubated at 37°C for 10
961
min. 3 uL of each LR reaction was transformed into 50 uL of Stellar competent cells (Takara Bio) via
962
heat shock. 200 uL of SOC medium was added and the cells were incubated for 1 hour in a shake
963
incubator at 37°C, before plating the transformed cells on LB agar plates with 1/1000 carbenicillin and
964
incubation overnight at 37°C. The next day, one colony per construct was picked and grown overnight
965
in 5 mL of LB medium with 1/1000 carbenicillin in a shake incubator at 37°C before plasmid extraction
966
using the NucleoSpin Plasmid Transfection-grade kit (Macherey-Nagel). For each construct three
967
biological replicates were performed by transfecting the plasmids into 80% confluent cells of MM001
968
in a 24 well plate. Per transfection, 400 ng of the construct was transfected together with 40 ng of
969
Renilla plasmid (Promega) using lipofectamine 2000 (Thermo Fisher Scientific). Luciferase activity of
970
each construct was measured using the Dual-Luciferase Reporter Assay (Promega) according to the
971
manufacturer's instructions. Enhancer luciferase activity was normalised against the Renilla luciferase
972
activity.
973
28
Publicly available data used in this work
974
SOX10 ChIP-seq and MITF ChIP-seq data on the 501Mel melanoma cell lines were downloaded as
975
raw fastq files from NCBI's Gene Expression Omnibus through GEO accession number GSE61965
976
(Laurette et al. 2015) and were mapped to the human genome using Bowtie 2 (v2.1.0) (Langmead and
977
Salzberg 2012) and peaks were called by MACS2 (v2.1.1) (Gaspar 2018). TFAP2A ChIP-seq data on
978
human primary melanocytes from neonatal foreskin were retrieved from Seberg et al. (GSE67555) as a
979
BED file, which was converted to a bedGraph and bigWig using the peak height from the BED file.
980
Histone H3 at lysine 27 (H3K27ac) and H3 monomethylation at K3 (H3K4me1) ChIP-seq data for
981
MM001 (GSE60666); and RNA-seq data (for MM031, MM034, MM057, MM074, MM087, MM099
982
and MM118 downloaded from GSE60666; for MM029, MM116, MM0163, MM164, and MM165 from
983
GSE134432) were processed as explained in Verfaillie et al.. OmniATAC-seq data for the human lines
984
MM001, MM011, MM029, MM031, MM074, MM057, MM087 and MM099 were obtained through
985
GSE134432 (Wouters et al. 2019) and were processed as described above in ‘Data processing human
986
melanoma baseline OmniATAC-seq samples’; which was also the case for ATAC-seq data from normal
987
human melanocytes on foreskin (NHM1), which were downloaded as raw fastq files from GSE94488
988
(GSM2476338) (Fontanals-Cirera et al. 2017). The massively parallel reporter assay (MPRA) data on
989
the IRF4 enhancer was downloaded from https://mpra.gs.washington.edu/satMutMPRA/ and was
990
processed as described above.
991
Data access
992
All raw and processed sequencing data generated in this study have been submitted to the NCBI Gene
993
Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE142238.
994
This includes OmniATAC-seq data of human melanoma cell lines (MM029, MM034, MM052,
995
MM116, MM118, MM122, MM163, MM164, MM165; data for the other lines used in this study was
996
published before (see ‘Publicly available data used in this work’)), two dog melanoma cell lines, two
997
horse melanoma cell lines, one pig melanoma sample, one pig melanocyte cell line and one mouse
998
melanoma cell line; ATAC-seq data of four zebrafish cell lines; and OmniATAC-seq data of SOX10
999
and TFAP2A knock-down in the human melanoma cell line MM001. The DeepMEL model was
1000
deposited in Kipoi (Avsec et al. 2019a) (http://kipoi.org/models/DeepMEL/). Code and custom scripts
1001
for training DeepMEL, DeepMEL predictions, DeepExplainer usage and BLS scoring are provided in
1002
GitHub (https://github.com/aertslab/DeepMEL) and as Supplemental Code.
1003
Acknowledgements
1004
This work was supported by an ERC Consolidator Grant to S.A. (no. 724226_cis-CONTROL), the KU
1005
Leuven (grant no. C14/18/092 to S.A.), the Foundation Against Cancer (grant no, 2016-070 to S.A.), a
1006
PhD fellowship from the FWO (L.M., no. 1S03317N) and a postdoctoral research fellowship from Kom
1007
op tegen Kanker (Stand up to Cancer; the Flemish Cancer Society) and Stichting tegen Kanker
1008
(Foundation against Cancer; the Belgian Cancer Society) (J.W.). We would like to thank Odessa Van
1009
Goethem and Véronique Benne for their contribution in establishing and providing the mouse
1010
melanoma cell line and Leif Andersson for sharing the horse melanoma cell lines. We would like to
1011
thank Catherine André (CNRS-University of Rennes1, UMR6290, IGDR, Faculty of Medicine, Rennes
1012
France) and Cani-DNA BRC (Biosit, Rennes, France) for sharing the in-house canine oral and uveal
1013
melanoma cell lines. The Cani-DNA BRC (https://dog-genetics.genouest.org), is funded through the
1014
CRB-Anim PIA1 funding (2012-2022) ANR-11-INBS-0003. In addition, we would like to thank Austin
1015
George for his help with the hyperparameter optimisation. Computing was performed at the Vlaams
1016
29
Supercomputer Center and high-throughput sequencing was done via the Genomics Core Leuven. The
1017
funders had no role in study design, data collection and analysis, decision to publish or preparation of
1018
the manuscript.
1019
1020
Author contributions
1021
1022
L.M., I.I.T. and S.A. conceived the study. L.M. performed the experimental work for the mammalian
1023
OmniATAC-seq dataset, with the help of L.V.A, S.M., V.C and J.W.. M.F., E.v.R. and L.Z. established
1024
and maintained the zebrafish cell lines and performed ATAC-seq on these. G.E.M. maintained and
1025
provided the pig cell lines. A.P. and E.C. established and provided the dog cell lines. P.K. established
1026
and provided the mouse melanoma cell line. M.S. established and provided the horse cell lines. G.E.G.
1027
established and provided the human cell lines. L.M. performed the experimental work and analysis of
1028
the luciferase assays together with D.M. L.M. performed the bioinformatic analyses of the OmniATAC-
1029
seq dataset. I.I.T. established the neural network and performed all bioinformatic analyses regarding
1030
the model. L.M., I.I.T., J.W. and S.A. wrote the manuscript.
1031
1032
Disclosure declaration
1033
1034
The authors declare no competing interests.
1035
1036
References
1037
1038
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M,
1039
et al. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed
1040
Systems. ArXiv160304467 Cs. http://arxiv.org/abs/1603.04467 (Accessed December 20,
1041
2019).
1042
Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence specificities of DNA-
1043
and RNA-binding proteins by deep learning. Nat Biotechnol 33: 831–838.
1044
Angermueller C, Lee HJ, Reik W, Stegle O. 2017. DeepCpG: accurate prediction of single-cell DNA
1045
methylation states using deep learning. Genome Biol 18: 67.
1046
Arendt D, Musser JM, Baker CVH, Bergman A, Cepko C, Erwin DH, Pavlicev M, Schlosser G,
1047
Widder S, Laubichler MD, et al. 2016. The origin and evolution of cell types. Nat Rev Genet
1048
17: 744–757.
1049
Arunachalam M, Jayasurya K, Tomancak P, Ohler U. 2010. An alignment-free method to identify
1050
candidate orthologous enhancers in multiple Drosophila genomes. Bioinforma Oxf Engl 26:
1051
2109–2115.
1052
Atak ZK, Taskiran II, Flerin C, Mauduit D, Minnoye L, Hulsemans G, Christiaens V, Ghanem G-E,
1053
Wouters J, Aerts S. 2019. Prioritization of enhancer mutations by combining allele-specific
1054
chromatin accessibility with deep learning. Genomics
1055
http://biorxiv.org/lookup/doi/10.1101/2019.12.21.885806 (Accessed April 24, 2020).
1056
Avsec Ž, Kreuzhuber R, Israeli J, Xu N, Cheng J, Shrikumar A, Banerjee A, Kim DS, Beier T, Urban
1057
L, et al. 2019a. The Kipoi repository accelerates community exchange and reuse of predictive
1058
models for genomics. Nat Biotechnol 37: 592–600.
1059
Avsec Ž, Weilert M, Shrikumar A, Alexandari A, Krueger S, Dalal K, Fropf R, McAnany C, Gagneur
1060
J, Kundaje A, et al. 2019b. Deep learning at base-resolution reveals motif syntax of the cis-
1061
regulatory code. Genomics http://biorxiv.org/lookup/doi/10.1101/737981 (Accessed October
1062
14, 2019).
1063
30
Baglama J, Reichel L. 2005. Augmented Implicitly Restarted Lanczos Bidiagonalization Methods.
1064
SIAM J Sci Comput 27: 19–42.
1065
Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. 2009.
1066
MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37: W202–W208.
1067
Ballester B, Medina-Rivera A, Schmidt D, Gonzàlez-Porta M, Carlucci M, Chen X, Chessman K,
1068
Faure AJ, Funnell APW, Goncalves A, et al. 2014. Multi-species, multi-transcription factor
1069
binding highlights conserved control of tissue-specific biological pathways. eLife 3: e02626.
1070
Bravo González-Blas C, Minnoye L, Papasokrati D, Aibar S, Hulselmans G, Christiaens V, Davie K,
1071
Wouters J, Aerts S. 2019. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq
1072
data. Nat Methods 16: 397–400.
1073
Broad Institute. 2019. Picard Toolkit.
1074
Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. 2013. Transposition of native
1075
chromatin for fast and sensitive epigenomic profiling of open chromatin , DNA-binding
1076
proteins and nucleosome position. Nat Methods 10.
1077
Ceol CJ, Houvras Y, Jane-Valbuena J, Bilodeau S, Orlando DA, Battisti V, Fritsch L, Lin WM,
1078
Hollmann TJ, Ferré F, et al. 2011. The histone methyltransferase SETDB1 is recurrently
1079
amplified in melanoma and accelerates its onset. Nature 471: 513–517.
1080
Chen L, Fish AE, Capra JA. 2018. Prediction of gene regulatory enhancers across species reveals
1081
evolutionarily conserved sequence properties. PLoS Comput Biol 14: e1006484.
1082
Chollet F, others. 2015. Keras. https://keras.io.
1083
Cliften PF, Hillier LW, Fulton L, Graves T, Miner T, Gish WR, Waterston RH, Johnston M. 2001.
1084
Surveying Saccharomyces genomes to identify functional elements by comparative DNA
1085
sequence analysis. Genome Res 11: 1175–1186.
1086
Corces MR, Trevino AE, Hamilton EG, Greenside PG, Sinnott-Armstrong NA, Vesuna S, Satpathy
1087
AT, Rubin AJ, Montine KS, Wu B, et al. 2017. An improved ATAC-seq protocol reduces
1088
background and enables interrogation of frozen tissues. Nat Methods 14.
1089
http://www.nature.com/doifinder/10.1038/nmeth.4396.
1090
Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA,
1091
Frampton GM, Sharp PA, et al. 2010. Histone H3K27ac separates active from poised
1092
enhancers and predicts developmental state. Proc Natl Acad Sci U S A 107: 21931–21936.
1093
Dankort D, Curley DP, Cartlidge RA, Nelson B, Karnezis AN, Damsky WE, You MJ, DePinho RA,
1094
McMahon M, Bosenberg M. 2009. Braf(V600E) cooperates with Pten loss to induce
1095
metastatic melanoma. Nat Genet 41: 544–552.
1096
De Mazière AM, Muehlethaler K, van Donselaar E, Salvi S, Davoust J, Cerottini J-C, Lévy F, Slot
1097
JW, Rimoldi D. 2002. The melanocytic protein Melan-A/MART-1 has a subcellular
1098
localization distinct from typical melanosomal proteins. Traffic Cph Den 3: 678–693.
1099
Denny SK, Yang D, Chuang C-H, Brady JJ, Lim JS, Grüner BM, Chiou S-H, Schep AN, Baral J,
1100
Hamard C, et al. 2016. Nfib Promotes Metastasis through a Widespread Increase in
1101
Chromatin Accessibility. Cell 166: 328–342.
1102
Dermitzakis ET, Clark AG. 2002. Evolution of transcription factor binding sites in Mammalian gene
1103
regulatory regions: conservation and turnover. Mol Biol Evol 19: 1114–1121.
1104
D’Mello SAN, Finlay GJ, Baguley BC, Askarian-Amiri ME. 2016. Signaling Pathways in
1105
Melanogenesis. Int J Mol Sci 17.
1106
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR.
1107
2013. STAR: ultrafast universal RNA-seq aligner. Bioinforma Oxf Engl 29: 15–21.
1108
Dodonova SO, Zhu F, Dienemann C, Taipale J, Cramer P. 2020. Nucleosome-bound SOX2 and
1109
SOX11 structures elucidate pioneer factor function. Nature.
1110
http://www.nature.com/articles/s41586-020-2195-y (Accessed April 23, 2020).
1111
Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W. 2005. BioMart and
1112
Bioconductor: a powerful link between biological databases and microarray data analysis.
1113
Bioinforma Oxf Engl 21: 3439–3440.
1114
Dynan WS, Tjian R. 1983. The promoter-specific transcription factor Sp1 binds to upstream
1115
sequences in the SV40 early promoter. Cell 35: 79–87.
1116
Egidy G, Julé S, Bossé P, Bernex F, Geffrotin C, Vincent-Naulleau S, Horak V, Sastre-Garau X,
1117
Panthier J-J. 2008. Transcription analysis in the MeLiM swine model identifies RACK1 as a
1118
31
potential marker of malignancy for human melanocytic proliferation. Mol Cancer 7: 34.
1119
Eraslan G, Avsec Ž, Gagneur J, Theis FJ. 2019. Deep learning: new computational modelling
1120
techniques for genomics. Nat Rev Genet 20: 389–403.
1121
Fontanals-Cirera B, Hasson D, Vardabasso C, Di Micco R, Agrawal P, Chowdhury A, Gantz M, de
1122
Pablos-Aragoneses A, Morgenstern A, Wu P, et al. 2017. Harnessing BET Inhibitor
1123
Sensitivity Reveals AMIGO2 as a Melanoma Survival Gene. Mol Cell 68: 731-744.e9.
1124
Frith MC, Li MC, Weng Z. 2003. Cluster-Buster: Finding dense clusters of motifs in DNA sequences.
1125
Nucleic Acids Res 31: 3666–3668.
1126
Fufa TD, Harris ML, Watkins-chow DE, Levy D, Gorkin DU, Gildea DE, Song L, Sa A, Crawford
1127
GE, Sviderskaya EV, et al. 2015. Genomic analysis reveals distinct mechanisms and
1128
functional classes of SOX10-regulated genes in melanocytes. 24: 5433–5450.
1129
Gaspar JM. 2018. Improved peak-calling with MACS2. Bioinformatics
1130
http://biorxiv.org/lookup/doi/10.1101/496521 (Accessed June 15, 2020).
1131
Gasperini M, Hill AJ, McFaline-Figueroa JL, Martin B, Kim S, Zhang MD, Jackson D, Leith A,
1132
Schreiber J, Noble WS, et al. 2019. A Genome-wide Framework for Mapping Gene
1133
Regulation via Cellular Genetic Screens. Cell 176: 377-390.e19.
1134
Gembarska A, Luciani F, Fedele C, Russell EA, Dewaele M, Villar S, Zwolinska A, Haupt S, de
1135
Lange J, Yip D, et al. 2012. MDM4 is a key therapeutic target in cutaneous melanoma. Nat
1136
Med 18: 1239–47.
1137
Graf SA, Busch C, Bosserhoff AK, Besch R, Berking C. 2014. SOX10 promotes melanoma cell
1138
invasion by regulating melanoma inhibitory activity. J Invest Dermatol 134: 2212–2220.
1139
Grossman SR, Engreitz J, Ray JP, Nguyen TH, Hacohen N, Lander ES. 2018. Positional specificity of
1140
different transcription factor classes within enhancers. Proc Natl Acad Sci U S A 115: E7222–
1141
E7230.
1142
Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. 2007. Quantifying similarity between
1143
motifs. Genome Biol 8: R24.
1144
Hallikas O, Palin K, Sinjushina N, Rautiainen R, Partanen J, Ukkonen E, Taipale J. 2006. Genome-
1145
wide prediction of mammalian enhancers based on analysis of transcription-factor binding
1146
affinity. Cell 124: 47–59.
1147
Hamdan FH, Johnsen SA. 2019. Perturbing Enhancer Activity in Cancer Therapy. Cancers 11.
1148
Heilmann S, Ratnakumar K, Langdon E, Kansler E, Kim I, Campbell NR, Perry E, McMahon A,
1149
Kaufman C, van Rooijen E, et al. 2015. A Quantitative System for Studying Metastasis Using
1150
Transparent Zebrafish. Cancer Res 75: 4272–4282.
1151
Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK.
1152
2010. Simple combinations of lineage-determining transcription factors prime cis-regulatory
1153
elements required for macrophage and B cell identities. Mol Cell 38: 576–589.
1154
Hitte C, Le Béguec C, Cadieu E, Wucher V, Primot A, Prouteau A, Botherel N, Hédan B, Lindblad-
1155
Toh K, André C, et al. 2019. Genome-Wide Analysis of Long Non-Coding RNA Profiles in
1156
Canine Oral Melanomas. Genes 10: 477.
1157
Hoek KS, Eichhoff OM, Schlegel NC, Döbbeling U, Kobert N, Schaerer L, Hemmi S, Dummer R.
1158
2008. In vivo switching of human melanoma cells between proliferative and invasive states.
1159
Cancer Res 68: 650–656.
1160
Hoek KS, Schlegel NC, Brafford P, Sucker A, Ugurel S, Kumar R, Weber BL, Nathanson KL,
1161
Phillips DJ, Herlyn M, et al. 2006. Metastatic potential of melanomas defined by specific
1162
gene expression profiles with no BRAF signature. Pigment Cell Res 19: 290–302.
1163
Hong J-W, Hendrix DA, Levine MS. 2008. Shadow enhancers as a source of evolutionary novelty.
1164
Science 321: 1314.
1165
Hou L, Srivastava Y, Jauch R. 2017. Molecular basis for the genome engagement by Sox proteins.
1166
Semin Cell Dev Biol 63: 2–12.
1167
Imrichová H, Hulselmans G, Kalender Atak Z, Potier D, Aerts S. 2015. i-cisTarget 2015 update:
1168
generalized cis-regulatory enrichment analysis in human, mouse and fly. Nucleic Acids Res
1169
43: W57–W64.
1170
Iwafuchi-Doi M, Donahue G, Kakumanu A, Watts JA, Mahony S, Pugh BF, Lee D, Kaestner KH,
1171
Zaret KS. 2016. The Pioneer Transcription Factor FoxA Maintains an Accessible Nucleosome
1172
Configuration at Enhancers for Tissue-Specific Gene Activation. Mol Cell 62: 79–91.
1173
32
Jacobs J, Atkins M, Davie K, Imrichova H, Romanelli L, Christiaens V, Hulselmans G, Potier D,
1174
Wouters J, Taskiran II, et al. 2018. The transcription factor Grainy head primes epithelial
1175
enhancers for spatiotemporal activation by displacing nucleosomes. Nat Genet 50: 1011–
1176
1020.
1177
Janky R, Verfaillie A, Imrichová H, van de Sande B, Standaert L, Christiaens V, Hulselmans G,
1178
Herten K, Naval Sanchez M, Potier D, et al. 2014. iRegulon: From a Gene List to a Gene
1179
Regulatory Network Using Large Motif and Track Collections. PLoS Comput Biol 10.
1180
Jiang L, Campagne C, Sundström E, Sousa P, Imran S, Seltenhammer M, Pielberg G, Olsson MJ,
1181
Egidy G, Andersson L, et al. 2014. Constitutive activation of the ERK pathway in melanoma
1182
and skin melanocytes in Grey horses. BMC Cancer 14: 857.
1183
Johnson LA, Zhao Y, Golden K, Barolo S. 2008. Reverse-engineering a transcriptional enhancer: a
1184
case study in Drosophila. Tissue Eng Part A 14: 1549–1559.
1185
Julé S, Bossé P, Egidy G, Panthier J-J. 2003. Establishment and characterization of a normal
1186
melanocyte cell line derived from pig skin. Pigment Cell Res 16: 407–410.
1187
Kaplan N, Moore IK, Fondufe-Mittendorf Y, Gossett AJ, Tillo D, Field Y, LeProust EM, Hughes TR,
1188
Lieb JD, Widom J, et al. 2009. The DNA-encoded nucleosome organization of a eukaryotic
1189
genome. Nature 458: 362–366.
1190
Kaufman CK, Mosimann C, Fan ZP, Yang S, Thomas AJ, Ablain J, Tan JL, Fogley RD, van Rooijen
1191
E, Hagedorn EJ, et al. 2016. A zebrafish melanoma model reveals emergence of neural crest
1192
identity during melanoma initiation. Science 351: aad2197–aad2197.
1193
Kawakami A, Fisher DE. 2017. The master role of microphthalmia-associated transcription factor in
1194
melanocyte and melanoma biology. Lab Invest 97: 649–656.
1195
Kelley DR, Snoek J, Rinn JL. 2016. Basset: learning the regulatory code of the accessible genome
1196
with deep convolutional neural networks. Genome Res 26: 990–999.
1197
Kircher M, Xiong C, Martin B, Schubach M, Inoue F, Bell RJA, Costello JF, Shendure J, Ahituv N.
1198
2019. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-
1199
pair resolution. Nat Commun 10: 3583.
1200
Klein RM, Bernstein D, Higgins SP, Higgins CE, Higgins PJ. 2012. SERPINE1 expression
1201
discriminates site-specific metastasis in human melanoma. Exp Dermatol 21: 551–554.
1202
Klemm SL, Shipony Z, Greenleaf WJ. 2019. Chromatin accessibility and the regulatory epigenome.
1203
Nat Rev Genet 20: 207–220.
1204
Kuhn RM, Haussler D, Kent WJ. 2013. The UCSC genome browser and associated tools. Brief
1205
Bioinform 14: 144–161.
1206
Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9: 357–
1207
359.
1208
Laurette P, Strub T, Koludrovic D, Keime C, Le Gras S, Seberg H, Van Otterloo E, Imrichova H,
1209
Siddaway R, Aerts S, et al. 2015. Transcription factor MITF and remodeller BRG1 define
1210
chromatin organisation at regulatory elements in melanoma cells. eLife 2015: 1–40.
1211
Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan MT, Carey VJ.
1212
2013. Software for Computing and Annotating Genomic Ranges ed. A. Prlic. PLoS Comput
1213
Biol 9: e1003118.
1214
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000
1215
Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and
1216
SAMtools. Bioinforma Oxf Engl 25: 2078–2079.
1217
Liao Y, Smyth GK, Shi W. 2014. featureCounts: an efficient general purpose program for assigning
1218
sequence reads to genomic features. Bioinforma Oxf Engl 30: 923–930.
1219
Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, Kheradpour P, Ernst J, Jordan G,
1220
Mauceli E, et al. 2011. A high-resolution map of human evolutionary constraint using 29
1221
mammals. Nature 478: 476–482.
1222
Long HK, Prescott SL, Wysocka J. 2016. Ever-Changing Landscapes: Transcriptional Enhancers in
1223
Development and Evolution. Cell 167: 1170–1187.
1224
Love MI, Huber W, Anders S. 2014. Moderated estimation of fold change and dispersion for RNA-
1225
seq data with DESeq2. Genome Biol 15: 550.
1226
Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N,
1227
Lee S-I. 2020. From local explanations to global understanding with explainable AI for trees.
1228
33
Nat Mach Intell 2: 56–67.
1229
Lundberg SM, Lee S-I. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in
1230
Neural Information Processing Systems 30 (eds. I. Guyon, U.V. Luxburg, S. Bengio, H.
1231
Wallach, R. Fergus, S. Vishwanathan, and R. Garnett), pp. 4765–4774, Curran Associates,
1232
Inc. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-
1233
predictions.pdf.
1234
Madeira F, Park Y mi, Lee J, Buso N, Gur T, Madhusoodanan N, Basutkar P, Tivey ARN, Potter SC,
1235
Finn RD, et al. 2019. The EMBL-EBI search and sequence analysis tools APIs in 2019.
1236
Nucleic Acids Res 47: W636–W641.
1237
Maity SN, de Crombrugghe B. 1998. Role of the CCAAT-binding protein CBF/NF-Y in transcription.
1238
Trends Biochem Sci 23: 174–178.
1239
McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G. 2010.
1240
GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol 28: 495–
1241
501.
1242
Meyer LR, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, Wong M, Sloan CA, Rosenbloom KR,
1243
Roe G, Rhead B, et al. 2012. The UCSC Genome Browser database: extensions and updates
1244
2013. Nucleic Acids Res 41: D64–D69.
1245
Park Y, Kellis M. 2015. Deep learning for regulatory genomics. Nat Biotechnol 33: 825–826.
1246
Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. 2010. Detection of nonneutral substitution rates
1247
on mammalian phylogenies. Genome Res 20: 110–121.
1248
Prescott SL, Srinivasan R, Marchetto MC, Grishina I, Narvaiza I, Selleri L, Gage FH, Swigut T,
1249
Wysocka J. 2015. Enhancer divergence and cis-regulatory evolution in the human and chimp
1250
neural crest. Cell 163: 68–83.
1251
Prouteau A, André C. 2019. Canine Melanomas as Models for Human Melanomas: Clinical,
1252
Histological, and Genetic Comparison. Genes 10.
1253
Quang D, Xie X. 2016. DanQ: a hybrid convolutional and recurrent deep neural network for
1254
quantifying the function of DNA sequences. Nucleic Acids Res 44: e107.
1255
Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features.
1256
Bioinformatics 26: 841–842.
1257
R Core Team. 2018. R: A Language and Environment for Statistical Computing. R Foundation for
1258
Statistical Computing, Vienna, Austria https://www.R-project.org.
1259
R Core Team. 2017. R: A Language and Environment for Statistical Computing. R Foundation for
1260
Statistical Computing, Vienna, Austria https://www.R-project.org.
1261
Rahman MdM, Lai Y, Husna A, Chen H, Tanaka Y, Kawaguchi H, Hatai H, Miyoshi N, Nakagawa T,
1262
Fukushima R, et al. 2019. Transcriptome analysis of dog oral melanoma and its oncogenic
1263
analogy with human melanoma. Oncol Rep. http://www.spandidos-
1264
publications.com/10.3892/or.2019.7391 (Accessed December 18, 2019).
1265
Rambow F, Malek O, Geffrotin C, Leplat J-J, Bouet S, Piton G, Hugot K, Bevilacqua C, Horak V,
1266
Vincent-Naulleau S. 2008. Identification of differentially expressed genes in spontaneously
1267
regressing melanoma using the MeLiM Swine Model: Differential gene expression in swine
1268
melanoma. Pigment Cell Melanoma Res 21: 147–161.
1269
Rambow F, Marine J-C, Goding CR. 2019. Melanoma plasticity and phenotypic diversity: therapeutic
1270
barriers and opportunities. Genes Dev 33: 1295–1318.
1271
Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, Heyne S, Dündar F, Manke T.
1272
2016. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic
1273
Acids Res 44: W160-165.
1274
Robinson MD, McCarthy DJ, Smyth GK. 2010. edgeR: a Bioconductor package for differential
1275
expression analysis of digital gene expression data. Bioinformatics 26: 139–140.
1276
Rosengren Pielberg G, Golovko A, Sundström E, Curik I, Lennartsson J, Seltenhammer MH, Druml
1277
T, Binns M, Fitzsimmons C, Lindgren G, et al. 2008. A cis-acting regulatory mutation causes
1278
premature hair graying and susceptibility to melanoma in the horse. Nat Genet 40: 1004–
1279
1009.
1280
Schreiber J, Libbrecht M, Bilmes J, Noble WS. 2017. Nucleotide sequence and DNaseI sensitivity are
1281
predictive of 3D chromatin architecture. Bioinformatics
1282
http://biorxiv.org/lookup/doi/10.1101/103614 (Accessed December 18, 2019).
1283
34
Seberg HE, Van Otterloo E, Loftus SK, Liu H, Bonde G, Sompallae R, Gildea DE, Santana JF,
1284
Manak JR, Pavan WJ, et al. 2017. TFAP2 paralogs regulate melanocyte differentiation in
1285
parallel with MITF ed. G.S. Barsh. PLOS Genet 13: e1006636.
1286
Segaoula Z, Primot A, Lepretre F, Hedan B, Bouchaert E, Minier K, Marescaux L, Serres F,
1287
Galiègue-Zouitina S, André C, et al. 2018. Isolation and characterization of two canine
1288
melanoma cell lines: new models for comparative oncology. BMC Cancer 18: 1219.
1289
Seltenhammer MH, Sundström E, Meisslitzer-Ruppitsch C, Cejka P, Kosiuk J, Neumüller J, Almeder
1290
M, Majdic O, Steinberger P, Losert UM, et al. 2014. Establishment and characterization of a
1291
primary and a metastatic melanoma cell line from Grey horses. Vitro Cell Dev Biol - Anim 50:
1292
56–65.
1293
Shain AH, Bastian BC. 2016. From melanocytes to melanomas. Nat Rev Cancer 16: 345–358.
1294
Sherwood RI, Hashimoto T, O’Donnell CW, Lewis S, Barkal AA, van Hoff JP, Karun V, Jaakkola T,
1295
Gifford DK. 2014. Discovery of directional and nondirectional pioneer transcription factors
1296
by modeling DNase profile magnitude and shape. Nat Biotechnol 32: 171–178.
1297
Shlyueva D, Stampfel G, Stark A. 2014. Transcriptional enhancers: From properties to genome-wide
1298
predictions. Nat Rev Genet 15: 272–286.
1299
Shoshan E, Braeuer RR, Kamiya T, Mobley AK, Huang L, Vasquez ME, Velazquez-Torres G,
1300
Chakravarti N, Ivan C, Prieto V, et al. 2016. NFAT1 Directly Regulates IL8 and MMP3 to
1301
Promote Melanoma Tumor Growth and Metastasis. Cancer Res 76: 3145–3155.
1302
Shrikumar A, Greenside P, Kundaje A. 2017. Learning Important Features Through Propagating
1303
Activation Differences. ArXiv170402685 Cs. http://arxiv.org/abs/1704.02685 (Accessed
1304
October 15, 2019).
1305
Shrikumar A, Tian K, Shcherbina A, Avsec Ž, Banerjee A, Sharmin M, Nair S, Kundaje A. 2019. TF-
1306
MoDISco v0.4.2.2-alpha: Technical Note. ArXiv181100416 Cs Q-Bio Stat.
1307
http://arxiv.org/abs/1811.00416 (Accessed December 18, 2019).
1308
Siepel A. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.
1309
Genome Res 15: 1034–1050.
1310
Song L, Crawford GE. 2010. DNase-seq: A high-resolution technique for mapping active gene
1311
regulatory elements across the genome from mammalian cells. Cold Spring Harb Protoc 5:
1312
1–12.
1313
Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, Carlson JW, Crosby MA, Rasmussen MD, Roy
1314
S, Deoras AN, et al. 2007. Discovery of functional elements in 12 Drosophila genomes using
1315
evolutionary signatures. Nature 450: 219–232.
1316
Sundström E, Komisarczuk AZ, Jiang L, Golovko A, Navratilova P, Rinkwitz S, Becker TS,
1317
Andersson L. 2012. Identification of a melanocyte-specific, microphthalmia-associated
1318
transcription factor-dependent regulatory element in the intronic duplication causing hair
1319
greying and melanoma in horses: A melanocyte-specific regulatory element in the duplicated
1320
sequence causing greying and melanoma in horses. Pigment Cell Melanoma Res 25: 28–36.
1321
Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, van Helden J. 2012. RSAT peak-
1322
motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res 40: e31–e31.
1323
Thomas-Chollier M, Hufton A, Heinig M, O’Keeffe S, Masri NE, Roider HG, Manke T, Vingron M.
1324
2011. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data
1325
and regulatory SNPs. Nat Protoc 6: 1860–1869.
1326
van der Weyden L, Patton EE, Wood GA, Foote AK, Brenn T, Arends MJ, Adams DJ. 2016. Cross-
1327
species models of human melanoma. J Pathol 238: 152–165.
1328
van Rooijen E, Fazio M, Zon LI. 2017. From fish bowl to bedside: The power of zebrafish to unravel
1329
melanoma pathogenesis and discover new therapeutics. Pigment Cell Melanoma Res 30: 402–
1330
412.
1331
Verfaillie A, Imrichova H, Atak ZK, Dewaele M, Rambow F, Hulselmans G, Christiaens V,
1332
Svetlichnyy D, Luciani F, Van den Mooter L, et al. 2015. Decoding the regulatory landscape
1333
of melanoma reveals TEADS as regulators of the invasive cell state. Nat Commun 6: 6683–
1334
6683.
1335
Villar D, Berthelot C, Aldridge S, Rayner TF, Lukk M, Pignatelli M, Park TJ, Deaville R, Erichsen
1336
JT, Jasinska AJ, et al. 2015. Enhancer evolution across 20 mammalian species. Cell 160: 554–
1337
566.
1338
35
Wang M, Tai C, E W, Wei L. 2018. DeFine: deep convolutional neural networks accurately quantify
1339
intensities of transcription factor-DNA binding and facilitate evaluation of functional non-
1340
coding variants. Nucleic Acids Res 46: e69.
1341
White RM, Cech J, Ratanasirintrawoot S, Lin CY, Rahl PB, Burke CJ, Langdon E, Tomlinson ML,
1342
Mosher J, Kaufman C, et al. 2011. DHODH modulates transcriptional elongation in the neural
1343
crest and melanoma. Nature 471: 518–522.
1344
White RM, Sessa A, Burke C, Bowman T, LeBlanc J, Ceol C, Bourque C, Dovey M, Goessling W,
1345
Burns CE, et al. 2008. Transparent adult zebrafish as a tool for in vivo transplantation
1346
analysis. Cell Stem Cell 2: 183–189.
1347
Wojciechowska S, van Rooijen E, Ceol C, Patton EE, White RM. 2016. Generation and analysis of
1348
zebrafish melanoma models. Methods Cell Biol 134: 531–549.
1349
Wouters J, Kalender-Atak Z, Minnoye L, Spanier KI, De Waegeneer M, González-Blas CB, Mauduit
1350
D, Davie K, Hulselmans G, Najem A, et al. 2019. Single-cell gene regulatory network
1351
analysis reveals new melanoma cell states and transition trajectories during phenotype
1352
switching. Genomics http://biorxiv.org/lookup/doi/10.1101/715995 (Accessed October 7,
1353
2019).
1354
Xu Min, Ning Chen, Ting Chen, Rui Jiang. 2016. DeepEnhancer: Predicting enhancers by
1355
convolutional neural networks. In 2016 IEEE International Conference on Bioinformatics and
1356
Biomedicine (BIBM), pp. 637–644, IEEE, Shenzhen, China
1357
http://ieeexplore.ieee.org/document/7822593/ (Accessed April 20, 2020).
1358
Zaret KS, Carroll JS. 2011. Pioneer transcription factors: establishing competence for gene
1359
expression. Genes Dev 25: 2227–2241.
1360
Zhou J, Troyanskaya OG. 2015. Predicting effects of noncoding variants with deep learning-based
1361
sequence model. Nat Methods 12: 931–4.
1362
... They have contributed substantially to identify TFBS specific to mammalian interneurons (8,9), fly brain cell types (10), mouse liver cells (11), and mouse embryonic stem cells (12). Furthermore, deep learning models have been applied to predict chromatin accessibility across mammalian brain cell types (13,14), to compare enhancer codes of melanocytes across species (15), and to identify potential enhancer regions linked to the evolution of neocortex expansion and vocal learning (8,16). As these deep learning models allow us to identify enhancer codes in cell type-specific enhancer regions, we hypothesized that they may shed light on cell type conservation across species. ...
... This provides a means to study cell type evolution through changes in candidate enhancers and the impact of genomic variants. As such, our models can be employed for studying how nucleotide changes are associated with cell type specificity as previously shown (10,15). In past studies, we verified that enhancer codes for melanoma states are conserved between mammal and zebrafish cell lines and can be used to identify variants in the genomes of melanoma patients (15,68). ...
... As such, our models can be employed for studying how nucleotide changes are associated with cell type specificity as previously shown (10,15). In past studies, we verified that enhancer codes for melanoma states are conserved between mammal and zebrafish cell lines and can be used to identify variants in the genomes of melanoma patients (15,68). The models presented in this study can be used to complement efforts for studying the impact of genomic variants and their association with mental or cognitive traits and disorders (69). ...
Preprint
Full-text available
Combinations of transcription factors govern the identity of cell types, which is reflected by enhancer codes in cis-regulatory genomic regions. Cell type-specific enhancer codes at nucleotide-level resolution have not yet been characterized for the mammalian neocortex. It is currently unknown whether these codes are conserved in other vertebrate brains, and whether they are informative to resolve homology relationships for species that lack a neocortex such as birds. To compare enhancer codes of cell types from the mammalian neocortex with those from the bird pallium, we generated single-cell multiome and spatially-resolved transcriptomics data of the chicken telencephalon. We then trained deep learning models to characterize cell type-specific enhancer codes for the human, mouse, and chicken telencephalon. We devised three metrics that exploit enhancer codes to compare cell types between species. Based on these metrics, non-neuronal and GABAergic cell types show a high degree of regulatory similarity across vertebrates. Proposed homologies between mammalian neocortical and avian pallial excitatory neurons are still debated. Our enhancer code based comparison shows that excitatory neurons of the mammalian neocortex and the avian pallium exhibit a higher degree of divergence than other cell types. In contrast to existing evolutionary models, the mammalian deep layer excitatory neurons are most similar to mesopallial neurons; and mammalian upper layer neurons to hyper- and nidopallial neurons based on their enhancer codes. In addition to characterizing the enhancer codes in the mammalian and avian telencephalon, and revealing unexpected correspondences between cell types of the mammalian neocortex and the chicken pallium, we present generally applicable deep learning approaches to characterize and compare cell types across species via the genomic regulatory code.
... Next, we compared BOM's performance for multiclass classification to three convolutional neural network (CNN)-based deep learning architectures. These encompassed pure CNN architecture 28,53 and hybrid architectures that combined CNNs with recurrent neural networks (RNNs) 54,55 ; including DeepMEL & DanQ (CNN + LSTM) 54,55 , DeepSTARR (CNNx4 + FCx2) 53 , and Basset (CNNx3 + FCx2) 28 . ...
... Next, we compared BOM's performance for multiclass classification to three convolutional neural network (CNN)-based deep learning architectures. These encompassed pure CNN architecture 28,53 and hybrid architectures that combined CNNs with recurrent neural networks (RNNs) 54,55 ; including DeepMEL & DanQ (CNN + LSTM) 54,55 , DeepSTARR (CNNx4 + FCx2) 53 , and Basset (CNNx3 + FCx2) 28 . ...
Preprint
Full-text available
Deciphering the intricate regulatory code governing cell-type-specific gene expression is a fundamental goal in genetics. Current methods struggle to capture the complex interplay between gene distal regulatory sequences and cell context. We developed a computational approach, BOM (Bag-of-Motifs), which represents cis-regulatory sequences by the type and number of TF binding motifs it contains, irrespective of motif order, orientation, and spacing. This simple yet powerful representation allows BOM to efficiently capture the complexity of cell-type-specific information encoded within these sequences. We apply BOM to mouse, human, and zebrafish distal regulatory regions, demonstrating remarkable accuracy. Notably, the method outperforms more complex deep learning models at the same task using fewer parameters. BOM can also uncover cross-species sequence similarities unrecognized by genome alignments. We experimentally validate our in silico predictions using enhancer reporter assay, showing that motifs with the most significant explanatory power are sequence determinants of cell-type specific enhancer activity. BOM offers a novel systematic framework for studying cell-type or condition-specific cis-regulatory sequences. Using BOM, we demonstrate the existence of a highly predictive sequence code at distal regulatory regions in mammals driven by TF binding motifs.
... Sequence-based machine learning models trained on large-scale genomics data capture complex patterns in the sequence and can predict diverse molecular phenotypes with great accuracy. Recently, convolutional neural networks have demonstrated superior performance over other architectures across most sequence-based problems [3,4,5,6,7,8,9,10,11], sometimes combined with LSTMs [12,13,14,15] or transformer layers [16,17]. ...
Preprint
Full-text available
Foundation models have achieved remarkable success in several fields such as natural language processing, computer vision and more recently biology. DNA foundation models in particular are emerging as a promising approach for genomics. However, so far no model has delivered granular, nucleotide-level predictions across a wide range of genomic and regulatory elements, limiting its practical usefulness. In this paper, we build on our previous work on the Nucleotide Transformer (NT) to develop a segmentation model, SegmentNT, that processes input DNA sequences up to 30kb length to predict 14 different classes of genomics elements at single nucleotide resolution. By utilizing pre-trained weights from NT, SegmentNT surpasses the performance of several ablation models, including convolution networks with one-hot encoded nucleotide sequences and models trained from scratch. SegmentNT can process multiple sequence lengths with zero-shot generalization for sequences of up to 50kb. We show improved performance on the detection of splice sites throughout the genome and demonstrate strong nucleotide-level precision. Because it evaluates all gene elements simultaneously, SegmentNT can predict the impact of sequence variants not only on splice site changes but also on exon and intron rearrangements in transcript isoforms. Finally, we show that a SegmentNT model trained on human genomics elements can generalize to elements of different species and that a trained multispecies SegmentNT model achieves stronger generalization for all genic elements on unseen species. In summary, SegmentNT demonstrates that DNA foundation models can tackle complex, granular tasks in genomics at a single-nucleotide resolution. SegmentNT can be easily extended to additional genomics elements and species, thus representing a new paradigm on how we analyze and interpret DNA. We make our SegmentNT-30kb human and multispecies models available on our github repository in Jax and HuggingFace space in Pytorch.
... Since the amino acid sequences of TF proteins, their DNA binding domains, and intrinsic DNA sequence preferences are typically highly conserved, the sequence preferences of TF binding in one species are predictive of those in closely related species. Accordingly, several computational approaches have been proposed to demonstrate the feasibility of cross-species prediction of regulatory profiles [20][21][22]. However, cross-species TF binding prediction is complicated by the rapid evolutionary turnover of individual TF binding sites across the genomes of different species, even within cell types that have similar functions. ...
Preprint
Full-text available
Cross-species prediction of TF binding remains a major challenge due to the rapid evolutionary turnover of individual TF binding sites, resulting in cross-species predictive performance being consistently worse than within-species performance. In this study, we first propose a novel Nucleotide-Level Deep Neural Network (NLDNN) to predict TF binding within or across species. NLDNN regards the task of TF binding prediction as a nucleotide-level regression task. Beyond predictive performance, we also assess model performance by locating potential TF binding regions, discriminating TF-specific single-nucleotide polymorphisms (SNPs), and identifying causal disease-associated SNPs. Then, we design a dual-path framework for adversarial training of NLDNN to further improve the cross-species prediction performance by pulling the domain space of human and mouse species closer.
Article
Full-text available
The inability to scalably and precisely measure the activity of developmental cis-regulatory elements (CREs) in multicellular systems is a bottleneck in genomics. Here we develop a dual RNA cassette that decouples the detection and quantification tasks inherent to multiplex single-cell reporter assays. The resulting measurement of reporter expression is accurate over multiple orders of magnitude, with a precision approaching the limit set by Poisson counting noise. Together with RNA barcode stabilization via circularization, these scalable single-cell quantitative expression reporters provide high-contrast readouts, analogous to classic in situ assays but entirely from sequencing. Screening >200 regions of accessible chromatin in a multicellular in vitro model of early mammalian development, we identify 13 (8 previously uncharacterized) autonomous and cell-type-specific developmental CREs. We further demonstrate that chimeric CRE pairs generate cognate two-cell-type activity profiles and assess gain- and loss-of-function multicellular expression phenotypes from CRE variants with perturbed transcription factor binding sites. Single-cell quantitative expression reporters can be applied in developmental and multicellular systems to quantitatively characterize native, perturbed and synthetic CREs at scale, with high sensitivity and at single-cell resolution.
Article
Motivation Supervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited. Results Inspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep-learning problems in genomics. Availability and implementation The open-source GitHub repository agduncan94/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures.
Article
Full-text available
In the mammalian liver, hepatocytes exhibit diverse metabolic and functional profiles based on their location within the liver lobule. However, it is unclear whether this spatial variation, called zonation, is governed by a well-defined gene regulatory code. Here, using a combination of single-cell multiomics, spatial omics, massively parallel reporter assays and deep learning, we mapped enhancer-gene regulatory networks across mouse liver cell types. We found that zonation affects gene expression and chromatin accessibility in hepatocytes, among other cell types. These states are driven by the repressors TCF7L1 and TBX3, alongside other core hepatocyte transcription factors, such as HNF4A, CEBPA, FOXA1 and ONECUT1. To examine the architecture of the enhancers driving these cell states, we trained a hierarchical deep learning model called DeepLiver. Our study provides a multimodal understanding of the regulatory code underlying hepatocyte identity and their zonation state that can be used to engineer enhancers with specific activity levels and zonation patterns.
Article
Full-text available
Transcriptional enhancers act as docking stations for combinations of transcription factors (TFs) and thereby regulate spatiotemporal activation of their target genes. It has been a long-standing goal in the field to decode the regulatory logic of an enhancer and to understand the details of how spatiotemporal gene expression is encoded in an enhancer sequence. Here, we show that deep learning models can be used to efficiently design synthetic, cell type specific enhancers, starting from random sequences, and that this optimization process allows for a detailed tracing of enhancer features at single-nucleotide resolution. We evaluate the function of fully synthetic enhancers to specifically target Kenyon cells or glial cells in the fruit fly brain using transgenic animals. We further exploit enhancer design to create "dual-code" enhancers that target two cell types, and minimal enhancers smaller than 50 base pairs that are fully functional. By examining the state space searches towards local optima, we characterise enhancer codes through the strength, combination, and arrangement of TF activator and TF repressor motifs. Finally, we apply the same strategies to successfully design human enhancers, which adhere to similar enhancer rules as Drosophila enhancers. Enhancer design guided by deep learning leads to better understanding of how enhancers work and shows that their code can be exploited to manipulate cell states.
Preprint
Full-text available
Recently, Hi-C has been used to probe the 3D chromatin architecture of multiple organisms and cell types. The resulting collections of pairwise contacts across the genome have connected chromatin architecture to many cellular phenomena, including replication timing and gene regulation. However, high resolution (10 kb or finer) contact maps remain scarce due to the expense and time required for collection. A computational method for predicting pairwise contacts without the need to run a Hi-C experiment would be invaluable in understanding the role that 3D chromatin architecture plays in genome biology. We describe Rambutan, a deep convolutional neural network that predicts Hi-C contacts at 1 kb resolution using nucleotide sequence and DNaseI assay signal as inputs. Specifically, Rambutan identifies locus pairs that engage in high confidence contacts according to Fit-Hi-C, a previously described method for assigning statistical confidence estimates to Hi-C contacts. We first demonstrate Rambutan’s performance across chromosomes at 1 kb resolution in the GM12878 cell line. Subsequently, we measure Rambutan’s performance across six cell types. In this setting, the model achieves an area under the receiver operating characteristic curve between 0.7662 and 0.8246 and an area under the precision-recall curve between 0.3737 and 0.9008. We further demonstrate that the predicted contacts exhibit expected trends relative to histone modification ChlP-seq data, replication timing measurements, and annotations of functional elements such as promoters and enhancers. Finally, we predict Hi-C contacts for 53 human cell types and show that the predictions cluster by cellular function. [NOTE: After our original submission we discovered an error in our calling of statistically significant contacts. Briefly, when calculating the prior probability of a contact, we used the number of contacts at a certain genomic distance in a chromosome but divided by the total number of bins in the full genome. When we corrected this mistake we noticed that the Rambutan model, as it curently stands, did not outperform simply using the GM12878 contact map that Rambutan was trained on as the predictor in other cell types. While we investigate these new results, we ask that readers treat this manuscript skeptically.]
Article
Full-text available
Melanoma cells can switch between a melanocytic and a mesenchymal-like state. Scattered evidence indicates that additional intermediate state(s) may exist. Here, to search for such states and decipher their underlying gene regulatory network (GRN), we studied 10 melanoma cultures using single-cell RNA sequencing (RNA-seq) as well as 26 additional cultures using bulk RNA-seq. Although each culture exhibited a unique transcriptome, we identified shared GRNs that underlie the extreme melanocytic and mesenchymal states and the intermediate state. This intermediate state is corroborated by a distinct chromatin landscape and is governed by the transcription factors SOX6, NFATC2, EGR3, ELF1 and ETV4. Single-cell migration assays confirmed the intermediate migratory phenotype of this state. Using time-series sampling of single cells after knockdown of SOX10, we unravelled the sequential and recurrent arrangement of GRNs during phenotype switching. Taken together, these analyses indicate that an intermediate state exists and is driven by a distinct and stable ‘mixed’ GRN rather than being a symbiotic heterogeneous mix of cells.
Article
Full-text available
‘Pioneer’ transcription factors are required for stem-cell pluripotency, cell differentiation and cell reprogramming1,2. Pioneer factors can bind nucleosomal DNA to enable gene expression from regions of the genome with closed chromatin. SOX2 is a prominent pioneer factor that is essential for pluripotency and self-renewal of embryonic stem cells³. Here we report cryo-electron microscopy structures of the DNA-binding domains of SOX2 and its close homologue SOX11 bound to nucleosomes. The structures show that SOX factors can bind and locally distort DNA at superhelical location 2. The factors also facilitate detachment of terminal nucleosomal DNA from the histone octamer, which increases DNA accessibility. SOX-factor binding to the nucleosome can also lead to a repositioning of the N-terminal tail of histone H4 that includes residue lysine 16. We speculate that this repositioning is incompatible with higher-order nucleosome stacking, which involves contacts of the H4 tail with a neighbouring nucleosome. Our results indicate that pioneer transcription factors can use binding energy to initiate chromatin opening, and thereby facilitate nucleosome remodelling and subsequent transcription.
Article
Full-text available
Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. (1) A polynomial time algorithm to compute optimal explanations based on game theory. (2) A new type of explanation that directly measures local feature interaction effects. (3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains. Tree-based machine learning models are widely used in domains such as healthcare, finance and public services. The authors present an explanation method for trees that enables the computation of optimal local explanations for individual predictions, and demonstrate their method on three medical datasets.
Article
Full-text available
Dogs have been considered as an excellent immunocompetent model for human melanoma due to the same tumor location and the common clinical and pathological features with human melanoma. However, the differences in the melanoma transcriptome between the two species have not been yet fully determined. Considering the role of oncogenes in melanoma development, in this study, we first characterized the transcriptome in canine oral melanoma and then compared the transcriptome with that of human melanoma. The global transcriptome from 8 canine oral melanoma samples and 3 healthy oral tissues were compared by RNA‑Seq followed by RT‑qPCR validation. The results revealed 2,555 annotated differentially expressed genes, as well as 364 novel differentially expressed genes. Dog chromosomes 1 and 9 were enriched with downregulated and upregulated genes, respectively. Along with 10 significant transcription site binding motifs; the NF‑κB and ATF1 binding motifs were the most significant and 4 significant unknown motifs were indentified among the upregulated differentially expressed genes. Moreover, it was found that canine oral melanoma shared >80% significant oncogenes (upregulated genes) with human melanoma, and JAK‑STAT was the most common significant pathway between the species. The results identified a 429 gene signature in melanoma, which was up‑regulated in both species; these genes may be good candidates for therapeutic development. Furthermore, this study demonstrates that as regards oncogene expression, human melanoma contains an oncogene group that bears similarities with dog oral melanoma, which supports the use of dogs as a model for the development of novel therapeutics and experimental trials before human application.
Article
Full-text available
An incomplete view of the mechanisms that drive metastasis, the primary cause of cancer-related death, has been a major barrier to development of effective therapeutics and prognostic diagnostics. Increasing evidence indicates that the interplay between microenvironment, genetic lesions, and cellular plasticity drives the metastatic cascade and resistance to therapies. Here, using melanoma as a model, we outline the diversity and trajectories of cell states during metastatic dissemination and therapy exposure, and highlight how understanding the magnitude and dynamics of nongenetic reprogramming in space and time at single-cell resolution can be exploited to develop therapeutic strategies that capitalize on nongenetic tumor evolution.
Preprint
Full-text available
The arrangement of transcription factor (TF) binding motifs (syntax) is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution ChIP-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using CRISPR-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data. Highlights The neural network BPNet accurately predicts TF binding data at base-resolution. Model interpretation discovers TF motifs and TF interactions dependent on soft syntax. Motifs for Nanog and partners are preferentially spaced at ∼10.5 bp periodicity. Directional cooperativity is validated: Sox2 enhances Nanog binding, but not vice versa.
Article
Full-text available
The majority of common variants associated with common diseases, as well as an unknown proportion of causal mutations for rare diseases, fall in noncoding regions of the genome. Although catalogs of noncoding regulatory elements are steadily improving, we have a limited understanding of the functional effects of mutations within them. Here, we perform saturation mutagenesis in conjunction with massively parallel reporter assays on 20 disease-associated gene promoters and enhancers, generating functional measurements for over 30,000 single nucleotide substitutions and deletions. We find that the density of putative transcription factor binding sites varies widely between regulatory elements, as does the extent to which evolutionary conservation or integrative scores predict functional effects. These data provide a powerful resource for interpreting the pathogenicity of clinically observed mutations in these disease-associated regulatory elements, and comprise a rich dataset for the further development of algorithms that aim to predict the regulatory effects of noncoding mutations.
Article
Full-text available
Despite recent genetic advances and numerous ongoing therapeutic trials, malignant melanoma remains fatal, and prognostic factors as well as more efficient treatments are needed. The development of such research strongly depends on the availability of appropriate models recapitulating all the features of human melanoma. The concept of comparative oncology, with the use of spontaneous canine models has recently acquired a unique value as a translational model. Canine malignant melanomas are naturally occurring cancers presenting striking homologies with human melanomas. As for many other cancers, dogs present surprising breed predispositions and higher frequency of certain subtypes per breed. Oral melanomas, which are much more frequent and highly severe in dogs and cutaneous melanomas with severe digital forms or uveal subtypes are subtypes presenting relevant homologies with their human counterparts, thus constituting close models for these human melanoma subtypes. This review addresses how canine and human melanoma subtypes compare based on their epidemiological, clinical, histological, and genetic characteristics, and how comparative oncology approaches can provide insights into rare and poorly characterized melanoma subtypes in humans that are frequent and breed-specific in dogs. We propose canine malignant melanomas as models for rare non-UV-induced human melanomas, especially mucosal melanomas. Naturally affected dogs offer the opportunity to decipher the genetics at both germline and somatic levels and to explore therapeutic options, with the dog entering preclinical trials as human patients, benefiting both dogs and humans.
Preprint
Prioritization of non-coding genome variation benefits from explainable AI to predict and interpret the impact of a mutation on gene regulation. Here we apply a specialized deep learning model to phased melanoma genomes and identify functional enhancer mutations with allelic imbalance of chromatin accessibility and gene expression.