ArticlePDF Available

Cross-species analysis of enhancer logic using deep learning

July 2020
Genome Research 30(12):gr.260844.120

July 2020
30(12):gr.260844.120

DOI:10.1101/gr.260844.120

License
CC BY 4.0

Authors:

Liesbeth Minnoye

Galapagos NV

Ibrahim Ihsan Taskiran

Bogazici University

David Mauduit

Vlaams Instituut voor Biotechnologie

Maurizio Fazio

Harvard University

Show all 19 authorsHide

Deciphering the genomic regulatory code of enhancers is a key challenge in biology as this code underlies cellular identity. A better understanding of how enhancers work will improve the interpretation of noncoding genome variation, and empower the generation of cell type-specific drivers for gene therapy. Here we explore the combination of deep learning and cross-species chromatin accessibility profiling to build explainable enhancer models. We apply this strategy to decipher the enhancer code in melanoma, a relevant case study due to the presence of distinct melanoma cell states. We trained and validated a deep learning model, called DeepMEL, using chromatin accessibility data of 26 melanoma samples across six different species. We demonstrate the accuracy of DeepMEL predictions on the CAGI5 challenge, where it significantly outperforms existing models on the melanoma enhancer of IRF4 Next, we exploit DeepMEL to analyse enhancer architectures and identify accurate transcription factor binding sites for the core regulatory complexes in the two different melanoma states, with distinct roles for each transcription factor, in terms of nucleosome displacement or enhancer activation. Finally, DeepMEL identifies orthologous enhancers across distantly related species where sequence alignment fails, and the model highlights specific nucleotide substitutions that underlie enhancer turnover. DeepMEL can be used from the Kipoi database to predict and optimise candidate enhancers, and to prioritise enhancer mutations. In addition, our computational strategy can be applied to other cancer or normal cell types.

Comparative epigenomics reveals conservation of two main melanoma states. (A) Evolutionary

…

Conservation of binding motifs of master regulators of MEL and MES melanoma states. (A, B) Heatmap

…

Human-trained deep learning model applied to cross-species ATAC-seq data. (A) Performance of

…

COre Regulatory Complex of MEL melanoma enhancers. (A) Schematic overview of motif scoring

…

Positional specificity of SOX10 and TFAP2A in MEL melanoma enhancers. (A,B) (first row) Example

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Cross-species analysis of enhancer logic using deep learning

Liesbeth Minnoye1,2,#, Ibrahim Ihsan Taskiran1,2,#, David Mauduit1,2, Maurizio Fazio4,5, Linde Van

Aerschot1,2,3, Gert Hulselmans1,2, Valerie Christiaens1,2, Samira Makhzami1,2, Monika Seltenhammer6,7,

Panagiotis Karras8,9, Aline Primot10, Edouard Cadieu10, Ellen van Rooijen4,5, Jean-Christophe Marine8,9,

Giorgia Egidy11, Ghanem-Elias Ghanem12, Leonard Zon4,5, Jasper Wouters1,2, and Stein Aerts1,2,*.

1. VIB-KU Leuven Center for Brain & Disease Research, Leuven, Belgium.

2. KU Leuven, Department of Human Genetics KU Leuven, Leuven, Belgium.

3. Laboratory for Disease Mechanisms in Cancer, KU Leuven, Leuven, Belgium

4. Howard Hughes Medical Institute, Stem Cell Program and the Division of Pediatric

Hematology/Oncology, Boston Children’s Hospital and Dana-Farber Cancer Institute, Harvard Medical

School, Boston, MA 02115, USA

5. Department of Stem Cell and Regenerative Biology, Harvard Stem Cell Institute, Cambridge, MA

02138, USA

6. Center for Forensic Medicine, Medical University of Vienna, Vienna, Austria

7. Division of Livestock Sciences (NUWI) - BOKU University of Natural Resources and Life Sciences,

Gregor-Mendel-Straße 33, 1180 Vienna, Austria

8. VIB-KU Leuven Center for Cancer Biology, Leuven, Belgium

9. KU Leuven, Department of Oncology KU Leuven, Leuven, Belgium.

10. CNRS-University of Rennes 1, UMR6290, Institute of Genetics and Development of Rennes, Faculty

of Medicine, Rennes, France

11. Université Paris-Saclay, INRA, AgroParisTech, GABI, 78350, Jouy-en-Josas, France.

12. Institut Jules Bordet, Université Libre de Bruxelles, Brussels, Belgium.

# equal contribution

* corresponding author

stein.aerts@kuleuven.vib.be

Laboratory of Computational Biology

Herestraat 49, P.O. Box 602

3000 Leuven, Belgium

Running title: Melanoma enhancer logic

Keywords: Epigenomics, machine learning, transcriptional regulation, melanoma

Abstract

Deciphering the genomic regulatory code of enhancers is a key challenge in biology as this code

underlies cellular identity. A better understanding of how enhancers work will improve the

interpretation of non-coding genome variation, and empower the generation of cell type specific drivers

for gene therapy. Here we explore the combination of deep learning and cross-species chromatin

accessibility profiling to build explainable enhancer models. We apply this strategy to decipher the

enhancer code in melanoma, a relevant case study due to the presence of distinct melanoma cell states.

We trained and validated a deep learning model, called DeepMEL, using chromatin accessibility data

of 26 melanoma samples across six different species. We demonstrate the accuracy of DeepMEL

predictions on the CAGI5 challenge, where it significantly outperforms existing models on the

melanoma enhancer of IRF4. Next, we exploit DeepMEL to analyse enhancer architectures and identify

accurate transcription factor binding sites for the core regulatory complexes in the two different

melanoma states, with distinct roles for each transcription factor, in terms of nucleosome displacement

or enhancer activation. Finally, DeepMEL identifies orthologous enhancers across distantly related

species where sequence alignment fails, and the model highlights specific nucleotide substitutions that

underlie enhancer turnover. DeepMEL can be used from the Kipoi database to predict and optimise

candidate enhancers, and to prioritise enhancer mutations. In addition, our computational strategy can

be applied to other cancer or normal cell types.

Introduction

A cell’s phenotype arises from the expression of a unique set of genes, which is regulated through the

binding of transcription factors (TFs) to cis-regulatory regions, such as promoters and enhancers.

Deciphering gene regulatory programs entails mapping the network of TFs and cis-regulatory regions

that governs the identity of a given cell type; as well as understanding how the specificity of such a

network is encoded in the DNA sequence of genomic enhancers. Profiling accessible chromatin via

DNase I hypersensitive sequencing (DNase-seq) or via the Assay for Transposase-Accessible

Chromatin using sequencing (ATAC-seq) represents a useful approach for identifying putative

enhancers (Buenrostro et al. 2013; Klemm et al. 2019; Song and Crawford 2010). Indeed, active

enhancers are typically depleted of one or more nucleosomes, due to the binding of TFs. Initial changes

in DNA accessibility can be facilitated through a special class of TFs that bind with high affinity to

their recognition sites and that have a long residence time at the enhancer; sometimes referred to as

pioneer TFs (Klemm et al. 2019; Zaret and Carroll 2011). By displacing nucleosomes or

thermodynamically outcompeting nucleosome binding they allow other TFs to co-bind, thereby further

stabilising the nucleosome depleted region and/or actively enhancing transcription of target genes

(Grossman et al. 2018; Jacobs et al. 2018; Dodonova et al. 2020).

As the presence and architecture of TF binding sites within enhancers determines which TFs can bind

with high affinity, understanding this ‘enhancer logic’ can help interpreting the functional role of

enhancers within a gene regulatory network. Several techniques exist to study the enhancer code,

including (1) motif discovery tools (Imrichová et al. 2015; Janky et al. 2014; Bailey et al. 2009; Heinz

et al. 2010; Thomas-Chollier et al. 2011, 2012); (2) comparative genomics (Ballester et al. 2014;

Prescott et al. 2015; Villar et al. 2015); (3) genetic screens (Gasperini et al. 2019; Kircher et al. 2019);

and (4) machine learning techniques (Park and Kellis 2015). Particularly the latter has seen a strong

boost in recent years with the advent of large training sets derived from genome-wide profiling. Three

pivotal methods based on deep learning include DeepBind (Alipanahi et al. 2015), DeepSEA (Zhou and

Troyanskaya 2015) and Basset (Kelley et al. 2016), the first convolutional neural networks (CNNs)

applied to genomics data (Eraslan et al. 2019). Since their emergence in the genomics field, machine

learning techniques, and especially CNNs, have been applied to model a range of regulatory aspects,

including cross-species enhancer predictions (Quang and Xie 2016; Xu Min et al. 2016; Chen et al.

2018), TF binding sites (Wang et al. 2018; Avsec et al. 2019b), DNA methylation (Angermueller et al.

2017) and 3D chromatin architecture (Schreiber et al. 2017).

Deciphering gene regulation and the underlying enhancer code is not only important during dynamic

processes such as development, but also in disease contexts such as cancer, where gene regulatory

networks are typically misregulated due to mutations. Particularly in melanoma, a type of skin cancer

that develops from melanocytes, gene expression is misregulated and highly plastic (Shain and Bastian

2016; Rambow et al. 2019). This gives rise to two main melanoma cell states: the melanocytic (MEL)

state, which still resembles the cell-of-origin, expressing high levels of the melanocyte-lineage specific

transcription factors MITF, SOX10 and TFAP2A, as well as typical pigmentation genes such as DCT,

TYR, PMEL, and MLANA; and the mesenchymal-like (MES) state, in which the cells are more invasive

and therapy resistant, expressing high levels of genes involved in TGFB signaling and epithelial-to-

mesenchymal transition (EMT)-related genes (Hoek et al. 2006, 2008; Rambow et al. 2019; Verfaillie

et al. 2015; Wouters et al. 2019). These transcriptomic differences have also been studied at the

epigenomics level, with AP-1 and TEAD factors as master regulators of the MES state and binding sites

for SOX10 and MITF significantly enriched in MEL-specific regulatory regions (Bravo González-Blas

et al. 2019; Verfaillie et al. 2015; Wouters et al. 2019). However, it remains unclear how these

regulatory states are encoded in particular enhancer architectures, and whether such architectures are

evolutionary conserved. Besides human cell lines and human patient-derived cultures, several animal

models have been established in melanoma research, including mouse, pig, horse, dog and zebrafish

100

(Egidy et al. 2008; van Rooijen et al. 2017; Segaoula et al. 2018; Seltenhammer et al. 2014; van der

101

Weyden et al. 2016; Prouteau and André 2019). Although these models are widely used, it is unknown

102

whether their enhancer landscapes and regulatory programs are conserved with human. Here, we take

103

advantage of these animal model systems and combine cross-species chromatin accessibility profiling

104

with deep learning, to investigate enhancer logic in melanoma.

105

Results

106

Melanoma chromatin accessibility landscapes are conserved across species

107

We profiled chromatin accessibility using ATAC-seq on a collection of melanoma cell lines across six

108

species, for a total of 26 samples (Fig. 1A). These include 16 human patient-derived cultures (“MM

109

lines”) (Gembarska et al. 2012; Verfaillie et al. 2015), one mouse cell line (Dankort et al. 2009), primary

110

melanoma cells from the pig melanoma model MeLiM (“MeLiM”) (Egidy et al. 2008), two horse

111

melanoma lines derived from a Grey Lipizzaner horse (“HoMel-L1”) and from an Arabian horse

112

(“HoMel-A1”) (Seltenhammer et al. 2014), two dog melanoma cell lines from oral and uveal sites:

113

“Dog-OralMel-18249” and “Dog-IrisMel-14205” respectively (Cani-DNA BRC: https://dog-

114

genetics.genouest.org) and four melanoma lines established from zebrafish (“ZMEL1”, “EGFP-121-1”,

115

“EGFP-121-5” and “EGFP-121-3”) (White et al. 2008, 2011). Per sample, between 65,475 and 176,695

116

ATAC-seq peaks were called, with distinct levels of conservation of accessibility across the species

117

(Fig. 1A, S1A). The difference in the number of peaks across the samples is due, on the one hand, to

118

genome size (Fig. S1B), and on the other hand to data quality (measured as the fraction of reads in peaks

119

(FRiP)) (Fig. S1C).

120

121

Unsupervised clustering of the 16 human lines revealed two distinct groups (Fig. S1D), which

122

correspond to the two main cell states in human melanoma, i.e. the melanocytic state (MEL) and

123

mesenchymal-like state (MES), as was further confirmed for most of the cell lines by previously-

124

generated RNA-seq data (Fig. S1E) (Verfaillie et al. 2015) and corroborated by previous studies using

125

epigenomics data (Verfaillie et al. 2015; Wouters et al. 2019). Indeed, regulatory regions near MEL-

126

specific genes such as SOX10 are accessible in human lines in the MEL state (MM001, MM011,

127

MM031, MM034, MM052, MM057, MM074, MM087, MM118, MM122 and MM164), whereas they

128

are closed in MES melanoma lines (MM029, MM099, MM116, MM163, and MM165) (Fig. 1B). Of

129

note, similarly as in Wouters et al., we observed heterogeneity between samples of the MEL state (Fig.

130

S1D).

131

132

To enable the comparison of chromatin accessibility between human and other species, we first

133

identified regulatory regions that are alignable (i.e. have a high sequence similarity) between species

134

using the liftOver tool (at least 10% of bases must remap) (Meyer et al. 2012). When such an alignable

135

region contains an ATAC-seq peak in the compared species, we will refer to it as a ‘conserved

136

accessible’ region. Between 1.1% and 40.9% of the ATAC-seq regions in non-human lines were

137

conserved accessible in human (Fig. 1C) and between 0.9% and 18.4% of the human peaks were

138

conserved accessible in the other species (Fig. S1F). Accordingly, we identified 303,392 alignable and

139

10,592 conserved accessible regions across all mammalian species. This number decreases when

140

including zebrafish, to 29,619 alignable regions and, only, 116 conserved accessible regions. Nearly

141

half of the 10,592 conserved accessible mammalian regions were promoters within 1 kb of a

142

transcription start site (Fig. S1G). Indeed, high conservation of proximal promoters has previously been

143

reported (Villar et al. 2015). In each of the mammalian species, the 10,592 conserved accessible regions

144

were more accessible compared to all ATAC-seq regions; in addition, they show a higher ChIP-seq

145

signal for acetylation of histone H3 at lysine 27 (H3K27ac) in human, a mark for active regulatory

146

regions (Creyghton et al. 2010) (Fig. S1H,I), and higher sequence conservation compared to alignable

147

regions as measured by phastCons and phyloP (Fig. S1J) (Pollard et al. 2010; Siepel 2005). Note,

148

nevertheless, that although ATAC-seq regions are nucleosome-depleted and often bound by several

149

TFs, they are not necessarily active enhancers, as accessibility does not directly translate to enhancer

150

activity (Shlyueva et al. 2014).

151

152

Next, we examined whether the MEL and MES melanoma states are conserved in the other species of

153

our cohort. Clustering all mammalian samples based on the accessibility of the 303,392 alignable

154

regions (Fig. S1K), or of all samples (including zebrafish) using the 29,619 alignable regions (Fig. 1D),

155

revealed two axes of variation between the samples, namely (i) the evolutionary variation between

156

species and (ii) the distinction between the melanoma states. All human MEL samples are clustered

157

together with 9 of the 10 non-human lines, indicating that most of the non-human cell lines are

158

epigenomically similar to the human MEL lines. On the other hand, the dog cell line Dog-IrisMel-14205

159

clustered together with the human MES samples, indicating that Dog-IrisMel-14205 belongs to the

160

MES state. This classification of melanoma samples was reflected in their accessibility at known MEL

161

and MES regulatory regions such as the intronic enhancer of MLANA, a MEL-specific gene involved

162

in melanosome biogenesis (De Mazière et al. 2002), and an enhancer upstream of MMP3, a gene that

163

increases metastatic potential in melanoma cell lines (Shoshan et al. 2016) (Fig. 1E). Note that

164

classifying the cross-species samples based on a principal component analysis (PCA) of only the

165

conserved accessible regions (i.e. without species-specific or clade-specific peaks) clearly revealed the

166

MEL-MES distinction, whereas the species variation was less outspoken (Fig. S1L,M).

167

168

In conclusion, by using ATAC-seq on a panel of 26 melanoma lines across six species, conserved

169

accessible regulatory regions could be identified. These regions allowed clustering of the melanoma

170

samples into two groups which correspond to the two main melanoma cell states, indicating

171

conservation of the MES melanoma state in dog and the MEL melanoma state in pig, mouse, horse, dog

172

and even zebrafish melanoma samples.

173

174

Figure 1. Comparative epigenomics reveals conservation of two main melanoma states. (A) Evolutionary

175

relationship between the six studied species, represented by a phylogenetic tree (NCBI taxonomy tree). ATAC-

176

seq profiles of the 26 melanoma cell lines are shown for three regulatory regions. (B) ATAC-seq profiles of the

177

human melanoma lines for the SOX10 locus. Lines are coloured by the melanocytic (MEL, in blue) or

178

mesenchymal-like (MES, in orange) melanoma state. (C) Total number of ATAC-seq regions observed across all

179

samples of a species are coloured based on whether they are not alignable, alignable or conserved accessible in

180

human. (D) PCA clustering based on the accessibility of the 29,619 alignable regions across all six species. (E)

181

ATAC-seq profiles of MEL and MES lines of different species for an intronic MLANA enhancer and the upstream

182

region of MMP3.

183

Conservation of transcription factor motifs in state-specific enhancers

184

Next, we investigated whether TF binding motifs that are specific to the MEL and MES states are

185

conserved across species. To this end, we performed differential motif enrichment between MEL and

186

MES accessible regions for human and dog, as these were the two species in our cohort for which cell

187

lines of both states were identified above. Differential peak calling (log2FC > 2.5 and pAdj < 0.0005),

188

followed by motif enrichment using HOMER (Heinz et al. 2010), revealed a highly similar enrichment

189

of SOX, TFAP2 family, E-box, RUNX and ETS TF binding motifs in both the human and dog MEL-

190

specific peaks (Fig. 2A,B) (complete HOMER output in Supplementary Table 1). The enriched motifs

191

of the TFAP2 family can most likely be linked to TFAP2A, because this is a master regulator in human

192

melanocytes and melanoma (Seberg et al. 2017). Similarly, the observed E-box and SOX motifs most

193

likely represent MITF and SOX10, respectively as they are among the previously reported master

194

regulators in human MEL lines (Bravo González-Blas et al. 2019; Hoek et al. 2006; Verfaillie et al.

195

2015; Wouters et al. 2019). Likewise, motif enrichment in the MES regions is very similar between

196

human and dog, revealing AP-1 and TEAD motifs as most highly enriched (Fig. 2A,B), corroborating

197

earlier findings (Verfaillie et al. 2015). Together, these observations indicate that the MEL and MES

198

melanoma cell states are conserved in dog and that they are likely governed by the same master

199

regulators, based on the concordance of motif enrichment.

200

201

To further verify the importance of the MEL-specific master regulators in MEL cell lines of the

202

remaining four species, we applied a different strategy since we could not contrast MEL and MES lines

203

for horse, pig, mouse and zebrafish. We analyzed 9,732 accessible regions that are conserved accessible

204

across all mammalian MEL lines to identify conserved TF binding sites. We scanned these regions

205

using the cisTarget motif collection (v8) (Janky et al., 2014; Imrichova et al., 2015; Herrmann et al.,

206

2012) containing 20,003 TF position-weight matrices (PWMs) and used a branch length score (BLS)

207

to calculate the level of evolutionary conservation of each TF binding motif (Fig. 2C), a strategy applied

208

before in other systems (Jacobs et al. 2018; Stark et al. 2007). Among the 4% most conserved motifs

209

were SP1, ETS, SOX, CTCF, MITF and TFAP2A motifs (Fig. 2D). The top conserved motifs were

210

members of the SP/KLF TF family, which bind to GC-rich motifs in promoters (Dynan and Tjian 1983).

211

Indeed, 47% of the 9,732 conserved accessible regions in mammalian MEL lines are proximal

212

promoters (<= 1 kbp from TSS). BLS scoring on the remaining 5,196 more distal conserved accessible

213

regions revealed similar highly conserved motifs, except for SP/KLF TF family motifs, indicating that

214

distal regions, such as enhancers, mostly contain the state-specific TF binding motifs. In the 113

215

conserved accessible regions across the MEL cell lines across all six species, BLS scoring again

216

revealed SOX, ETS, MITF and TFAP2A motifs among the most conserved motifs (Fig. 2E).

217

218

In conclusion, two independent strategies of motif analysis suggest conservation of TF binding sites for

219

known melanoma master regulators, with conserved SOX10, MITF, TFAP2A and ETS TF family motif

220

enrichment in MEL enhancers across all six studied species.

221

222

Figure 2. Conservation of binding motifs of master regulators of MEL and MES melanoma states. (A, B) Heatmap

223

of differential ATAC-seq regions when comparing (A) human MEL versus human MES lines and (B) the MEL

224

dog line ‘Dog-OralMel-18249’ versus the MES dog line ‘Dog-IrisMel-14205’ (two biological replicates each),

225

coloured by normalised ATAC-seq signal. Enriched TF binding motifs in the differential peaks were identified

226

via HOMER (Heinz et al. 2010) and the first logo of enriched TF families is shown. The ratio of the percentage

227

of target and background sequences with the motif is indicated between brackets, as well as the rank of the TF

228

class within the HOMER output (#). (C) Schematic overview of cross-species motif analysis using the branch

229

length score (BLS) as a measure for the evolutionary conservation of a motif hit across conserved accessible

230

regions. The BLS was summed across a set of conserved accessible regions. (D, E) Histogram of the normalised

231

summed BLS score for 20,003 motifs on (D) 9,732 conserved accessible regions across the mammalian MEL

232

lines and on (E) 113 conserved accessible regions across MEL lines of all six species. The first hit of the top

233

recurrent TF binding motifs within the top 4% conserved motifs is indicated as a cross and is accompanied by the

234

logo of the motif.

235

Deep neural network DeepMEL reveals nucleotide-resolution enhancer logic

236

While motif enrichment can predict candidate regulators, we sought to build a more comprehensive

237

model of the MEL enhancers, that would allow cross-species predictions and in-depth analysis of

238

enhancer architecture. To this end, we trained a deep learning (DL) model on the human ATAC-seq

239

data. First, to construct an unsupervised training set, we clustered all 339,099 human ATAC-seq peaks

240

using cisTopic -a probabilistic framework to analyse scATAC-seq data that can also be applied to

241

bootstrapped bulk ATAC-seq data (Bravo González-Blas et al. 2019) (see Methods)- into 24 ‘topics’ or

242

sets of co-accessible regions (Fig. 3A, Fig. S2A,B). This provided a nuanced classification, with topic

243

4 and topic 7 representing the MEL- and MES-specific enhancers, respectively being accessible across

244

all MEL or MES samples (Fig. 3A, Fig. S2C). In addition, we found two topics with regions that are

245

generally accessible across all cell lines (topic 1 and topic 19) (Fig. 3A, S2C). These ubiquitously

246

accessible regions are highly enriched for proximal promoters (Fig. S2D) and for known promoter-

247

specific TF binding motifs linked to SP and NFY TF families (Fig. S2C) (Dynan and Tjian 1983; Maity

248

and de Crombrugghe 1998). Other topics were more specific to one or a small group of cell lines (Fig.

249

3A). We verified the biological relevance of these topics by Gene Ontology (GO) enrichment of

250

flanking genes using GREAT (McLean et al. 2010). Genes near topic 4 regions are significantly

251

enriched for GO terms such as pigmentation (FDR=1.95 × 10-8) and neural crest cell differentiation

252

(FDR=4.26 × 10-7), whereas genes near topic 7 regions were enriched for GO terms involved in cell-

253

cell adhesion (1.56 × 10-13). Motif discovery on the top regions assigned to each topic confirmed

254

enrichment of SOX, ETS, TFAP2A and MITF motifs in the MEL topic regions (topic 4) and AP-1 in

255

the MES topic (topic 7) (Fig. S2C). An example topic 4 region in the promoter of the SOX10 target

256

gene MIA (Graf et al. 2014) is shown in Figure 3B, as well as two topic 7 regions upstream of

257

SERPINE1, a gene expressed in metastatic melanoma (Klein et al. 2012).

258

259

Using the 24 topics as classes, we trained a multi-class, multi-label classifier using a neural network,

260

called “DeepMEL” (Fig. 3C). As input, we used the forward and reverse complement of 500 bp

261

sequences centered on the ATAC-seq summit. As topology, we used the DanQ CNN-RNN hybrid

262

architecture (Quang and Xie 2016) consisting of 4 main layers: a convolution layer to discover local

263

patterns in sequential data, followed by a max-pooling layer to reduce the dimensionality of the data

264

and generalise the model effectively, a bidirectional recurrent layer (LSTM) to detect long-range

265

dependencies of the local patterns discovered in the first layer, and finally a fully-connected (dense)

266

layer just before the output layer to help the classification after the feature extraction layers (Fig. 3C).

267

Note that several hyperparameters, including the number and size of the convolutional filters and the

268

length of the input DNA sequence were optimised to yield the final model (Fig. S3; Supplementary

269

Note 1). After successful training of DeepMEL (area under the receiver operating characteristic curve

270

(auROC) = 0.863 and area under the precision recall curve (auPR) = 0.374 on test data for topic 4

271

regions) (Fig. 3D,E; Fig. S4A), we used the weights of the neurons from the convolutional filters to

272

extract local patterns learned by the model. We transformed these convolution filters into PWMs and

273

found the importance of each filter for each topic (see Methods). Filters that represent SOX, MITF,

274

TFAP2A, and RUNX motifs were most relevant for the MEL-specific topic 4 and filters that represent

275

AP-1, TEAD and RUNX binding sites were assigned to the MES-specific topic 7 (Fig. 3F). Thus,

276

DeepMEL learned the relevant features de novo from the sequence. Note that the 3,885 regions

277

classified as MEL-specific in MM001 (topic 4 scores above threshold of 0.16 (see Methods)) were not

278

only highly accessible in MEL lines and closed in MES lines (Fig. S4B), but were also accessible in

279

human melanocytes (Fig. S4C), indicating that MEL-specific melanoma regions are not cancer-specific

280

but already accessible in their cell-of-origin, i.e. the melanocytes. As a consequence, we can potentially

281

extrapolate the observations on this topic to normal melanocyte enhancers. Although in the remainder

282

of this work we will score accessible regions to identify functional enhancers, it is also possible to score

283

the entire genome, without filtering for ATAC-seq peaks (Fig. S4D).

284

To examine the TF binding site architecture within enhancers, we used a model interpretation tool,

285

DeepExplainer (Lundberg and Lee 2017; Lundberg et al. 2020; Avsec et al. 2019b). For a MEL

286

enhancer located on the 4th intron of IRF4, nucleotides important for classifying this enhancer as topic

287

4 emerge as motifs for SOX10, MITF, TFAP2A and RUNX factors (Fig. 3G top two rows; see Fig.

288

S4E,F for another example).

289

290

It is known that enhancer accessibility does not directly translate to enhancer activity (Shlyueva et al.

291

2014). To test whether the same TF binding motifs contribute to the activity of MEL enhancers, we

292

used the IRF4 enhancer as case study. For this enhancer, Kircher et al. performed saturation mutagenesis

293

followed by an in vitro massively parallel reporter assay (MPRA), testing the effect of every possible

294

single nucleotide mutation on enhancer activity (Fig. 3G, 3th row). The most deleterious mutations

295

coincided with the DeepMEL-predicted SOX, E-box and RUNX-like motifs, overlapping with

296

nucleotides that also have the strongest in silico effect (Fig. 3G, last row), indicating that the predicted

297

motifs are actually contributing to enhancer activity. In addition, also the magnitude of the in silico

298

predicted effect highly correlates with the effect of the in vitro mutations (Spearman’s correlation of

299

0.60) (Fig. 3G,H). These observations indicate that, although DeepMEL was trained to predict binary

300

enhancer accessibility, it is also a good predictor of enhancer activity of this specific enhancer.

301

DeepMEL predictions outperform other classifiers and deep learning models that were benchmarked in

302

Kircher et al. (CAGI challenge, 2018) (Fig. 3I). One possible explanation for this improvement is that

303

DeepMEL uses more nuanced topics (Fig. 3I, black bar) rather than the ATAC-seq signal of the

304

different MM lines as labels (Fig. 3I, white bar). Note that enhancer accessibility and activity can not

305

only be influenced by mutations that break a motif for an activating TF, but also by the creation of a

306

repressor binding motif, as was for instance the case for the SNP rs12203592 (Fig. S3G; Fig. S4G).

307

308

In conclusion, DeepMEL, trained on topics of human co-accessible regions, is performant in classifying

309

melanoma regulatory regions into different classes based on purely the DNA sequence. Features learned

310

by DeepMEL correspond to TF binding motifs of master regulators of specific classes. These motifs

311

can also be located and visualised within regions using a model interpretation tool, allowing

312

examination of the motif architecture within specific enhancers and predicting the effect of mutations

313

on enhancer accessibility.

314

315

Figure 3. DeepMEL classifies melanoma enhancers and predicts important TF binding motifs. (A) Cell-topic

316

heatmap of cisTopic applied to 339,099 ATAC-seq regions across the 16 human melanoma lines, coloured by

317

normalised topic scores. ‘029*’ refers to ‘MM029_R2’. (B) Example regions of a MEL-specific (topic 4) region

318

near MIA and MES-specific (topic 7) regions upstream of SERPINE1. (C) Schematic overview of DeepMEL. 24

319

topics or sets of co-accessible regions were used as input for training of a multi-class multi-label neural network.

320

(D, E) (D) Receiver operating characteristic curve and (E) precision-recall curve for DeepMEL on training, test

321

and shuffled data of topic 4 and topic 7 regions. (F) Top enriched filters learned by DeepMEL to classify regions

322

as MEL (topic 4) or MES (topic 7). Normalised filter importance is shown per filter. (G) Example of a MEL-

323

predicted enhancer near IRF4. (first and second row) DeepExplainer view of the forward and reverse strand, with

324

the height of the nucleotides indicating the importance for prediction of the MEL enhancer. (third row) In vitro

325

effect of point mutations on enhancer activity as measured by MPRA (Kircher et al. 2019). Colours represent the

326

nucleotide to which the wild type nucleotide is mutated. (bottom row) In silico effect of point mutations as

327

predicted by DeepMEL. (H) Correlation between the in vitro mutational effects on the IRF4 enhancer and the in

328

silico mutagenesis predictions. (I) Performance of variant effect prediction of DeepMEL using topics (black bar,

329

model used in this paper) or using ATAC-seq signal (white bar), and several previously tested models on the IRF4

330

enhancer case (Kircher et al. 2019).

331

Cross-species scoring identifies orthologous melanoma enhancers

332

Next, we asked whether the human-trained model DeepMEL can be used to predict MEL and MES

333

enhancers in other species. We started with the dog genome as a test case, because the differential

334

ATAC-seq peaks between the MEL (Dog-OralMel-18249) and MES (Dog-IrisMel-14205) dog cell

335

lines can serve as true positives (Fig. 4A). Note that DeepMEL reached similar performance in human

336

and dog for predicting MEL and MES regions and this accuracy is significantly higher compared to

337

using cis-regulatory module (CRM) scoring with PWMs (Fig 4A). Having confirmed that the human

338

model can identify enhancers in the dog genome, we predicted MEL and MES enhancers across all six

339

species. This furthermore allowed us to order all samples according to the MEL-MES axis (Fig. 4B).

340

Between 2,093 and 5,400 MEL enhancers were predicted, and between 7,459 and 10,743 MES

341

enhancers, in samples of the MEL and MES state, respectively (Fig. 4B). Note that the majority of these

342

enhancers could not have been detected using whole genome alignments (liftOver) (Fig S5A-E). Of

343

note, predicted MEL enhancers in the pig melanoma cells (MeLiM) were similarly accessible in pig

344

melanocytes (Fig. S5F), again indicating that MEL melanoma enhancers can be used as a model for

345

melanocyte enhancers.

346

347

Next, we compared the occurrence of MEL enhancers between species, in relation to putative target

348

genes. Particularly, we looked at enhancers located near a set of 379 human genes that are specifically

349

expressed in the MEL state (see Methods). Of these 379 genes, 217 (67%) had at least one MEL-

350

predicted enhancer within 200kb up- and downstream of the gene. Between 70-85% of the orthologous

351

MEL genes in other species had at least one MEL enhancer 200kb up- or downstream of the gene (Fig.

352

S5G). Note that only a small subset of these enhancers could have been found using liftOver (2-43%

353

depending on the species). Of these genes, 32 form a core set of conserved MEL-specific genes

354

throughout all species including zebrafish, each having a MEL enhancer nearby. Examples of genes in

355

the core set are MITF, PMEL and TYRP1, genes known to be involved in melanocyte development,

356

melanosome formation and melanin production (D’Mello et al. 2016).

357

358

A long-standing question in enhancer studies is how to compare enhancers with each other, if their

359

sequences do not align (Arunachalam et al. 2010; Cliften et al. 2001). Here we tackle this question by

360

using the dense layer of DeepMEL as a reduced dimensional space to calculate the correlation between

361

enhancers. Using this measure we found that MEL-predicted enhancers in proximity of orthologous

362

MEL genes are significantly more similar to each other compared to both MEL-predicted enhancers in

363

proximity of different MEL genes within the same species (Fig. 4C), and redundant (or shadow (Hong

364

et al. 2008)) enhancers linked to the same MEL gene in a species, as well as random non-MEL ATAC-

365

seq peaks near homologous MEL genes (Fig. S5H). This altogether supports the idea that MEL

366

enhancers near orthologous genes are indeed orthologous enhancers.

367

368

Lastly, we studied an example of a MEL enhancer in more detail, namely the enhancer near ERBB3.

369

DeepMEL predicts a MEL enhancer upstream or intronic of ERBB3 in each of the mammalian species,

370

which were also found by liftOver of the human ERBB3 enhancer (Fig. 4D II). However, in the zebrafish

371

genome, liftOver was unable to identify the homologous region, whereas DeepMEL predicted two MEL

372

enhancers, one upstream of the TSS of erbb3b and another in the first intron. Both zebrafish enhancers

373

were highly correlated with the human ERBB3 enhancer (deep layer Pearson’s correlation of 0.812 and

374

0.797 for the upstream and intronic zebrafish enhancer, respectively), suggesting that both enhancers

375

are orthologous to the human ERBB3 enhancer. Applying DeepExplainer to the multiple-aligned

376

sequences revealed a conserved motif architecture in the orthologous mammalian ERBB3 enhancers

377

containing each three SOX motifs and one TFAP2A motif (Fig. 4D III). Note that in mouse, one SOX

378

binding site was lost, mouse is also the mammalian species that is most distant from human, among the

379

included mammals in this study (Fig. 4D I). The two zebrafish enhancers have a highly similar motif

380

architecture, suggesting that they arose by duplication from a common ancestor enhancer.

381

382

In conclusion, we showed that DeepMEL is able to identify MEL- and MES-specific enhancers in

383

different species, which allows studying evolutionary events and enhancer logic within orthologous

384

enhancers, even in distant species such as zebrafish.

385

386

387

Figure 4. Human-trained deep learning model applied to cross-species ATAC-seq data. (A) Performance of

388

DeepMEL and Cluster-Buster (cbust) in classifying MEL and MES differential peaks in human and dog. (B),

389

Percentage of MEL and MES predicted ATAC-seq regions across all samples in our cohort and in human

390

melanocytes. Samples are ordered according to the ratio of the number of MES / MEL predicted regions. (C)

391

Pearson’s correlation of deep layer scores between MEL-predicted regions near orthologous MEL genes between

392

human and another species (‘Human-Species’) or between MEL-predicted regions near different MEL genes

393

within one species (‘Species-Species’). P-values of unpaired two-sample Wilcoxon tests are reported. (D) (I)

394

Evolutionary distance between human and other species in branch length units. (II) ATAC-seq profiles of the

395

ERBB3 locus in the six species. MEL-specific enhancers that were predicted by DeepMEL and that were also

396

found (grey) or not found (green) via liftOver of the human MEL enhancer are highlighted. (III) DeepExplainer

397

plots for the multiple-aligned MEL-predicted ERBB3 enhancers. Red and blue dots represent point and indels

398

mutations, respectively.

399

Motif architecture of the MEL enhancer

400

To study the architecture of MEL enhancers in more detail, including motif composition, motif order

401

and distance, and relationships to the position of nucleosomes, we set out to obtain high-confidence

402

motif annotations in each of the 3,885 MEL enhancers in human (MM001, the most MEL-like human

403

cell line), for each of the predicted core regulatory factors (SOX10, MITF, TFAP2A, RUNX). To

404

achieve this, we devised an optimised motif scoring method that obtains precise positions of TF binding

405

motifs by multiplying DeepMEL activation scores of convolutional filters (i.e. motifs) with the

406

DeepExplainer profile of each enhancer (Fig. 5A) (see Methods) (Shrikumar et al. 2019).

407

408

The first observation was that each MEL enhancer contains at least one SOX10 motif hit, and often two

409

or more (Fig 5B). This suggests that SOX10 plays a central role in MEL enhancer accessibility. Indeed,

410

knock-down (KD) of SOX10 in MM001 significantly decreases the accessibility of MEL enhancers

411

(Fig. S6A), and the regions that close after SOX10-KD are highly enriched for SOX motifs (NES =

412

28.5), possibly revealing a pioneering-role of SOX10 in MEL enhancers. Next to SOX motifs, a

413

combination of one or multiple TFAP2A, MITF or RUNX-like motif hits were present in 84% of the

414

MEL-predicted enhancers (Fig. 5B). Next, to facilitate a systematic study of the MEL enhancer logic,

415

we binarised the motif-region matrix to simplify the region clustering (Fig 5C). We obtained 8 different

416

enhancer classes, each with a different motif composition (Fig. 5C). As validation of the clusters and

417

the predicted TF binding sites, we used human ChIP-seq data of SOX10, MITF and TFAP2A in

418

melanoma or melanocytes (Laurette et al. 2015; Seberg et al. 2017) (Fig. 5D). All clusters were indeed

419

highly bound by SOX10, validating the prevalence of the SOX10 motif in MEL enhancers. In contrast,

420

MITF and TFAP2A ChIP-seq data revealed that MITF and TFAP2A bind, respectively, more to

421

enhancers with MITF and TFAP2A sites compared to regions without a predicted MITF or TFAP2A

422

site. Note that these observations indicate that the MEL enhancer architecture does not entail indirect

423

DNA binding of the core regulatory factors since MITF and TFAP2A are only bound when their motifs

424

are present within the enhancer. We further observed that regions containing a TFAP2A site, next to

425

the SOX10 site(s) and possible others, showed a modest increase in accessibility (Fig. S6B), which

426

could be in line with the previously described role of TFAP2A as a stabiliser of nucleosome-depleted

427

regions (Grossman et al. 2018). The opposite was true for regions containing RUNX-like binding sites

428

(Fig. S6B), suggesting a repressive role of RUNX factors. The presence of a MITF site did not seem to

429

alter the accessibility of enhancers compared to SOX-only enhancers, but did increase H3K27ac signal

430

(Fig. S6C), possibly indicating that MEL enhancers bound by MITF are more active.

431

432

To validate these MEL enhancer classes in other species, we applied the same motif scoring and

433

binarisation to DeepMEL-predicted MEL regions in the other species in our cohort. MEL enhancers in

434

other species also clustered into the same 8 clusters, with a similar distribution of regions per cluster

435

(Fig. 5E,F; Fig. S6D). In addition, liftOver of the clusters showed that the regions of a human cluster

436

correspond more to the same cluster in the other species (Fig. S6E), indicating conservation of the MEL

437

enhancer clusters across species. For instance, the dog-orthologs of two human MEL enhancers

438

belonging to either the [SOX10 + MITF] cluster (intronic enhancer of CD9) or to the cluster containing

439

[SOX10 + TFAP2A + RUNX] (intronic enhancer of STIM1) (Fig. 5E) were part of the corresponding

440

clusters in dog (Fig. 5F).

441

442

Altogether, these data suggest a COre Regulatory Complex (CoRC) (Arendt et al. 2016) of SOX10,

443

TFAP2A, MITF and RUNX factors in regulating melanoma MEL enhancers, encoded by a mixed

444

enhancer model (Long et al. 2016), with high flexibility in the combination of binding sites for these

445

four TFs, but with some rigidity (or hierarchy) in the code as at least one SOX10 dimer site is required.

446

447

448

Figure 5. COre Regulatory Complex of MEL melanoma enhancers. (A) Schematic overview of motif scoring

449

method in which extended convolutional filter hits from DeepMEL are multiplied by DeepExplainer profiles to

450

yield significant motif hits. (B,C) Heatmap (B) and binarised heatmap (C) of the number of significant SOX,

451

TFAP2A, MITF and RUNX-like motif hits on the 3,885 MEL-predicted regions in the human cell line MM001.

452

(D) Aggregation plot of normalised ChIP-seq signal of SOX10, MITF and TFAP2A on the human enhancer

453

clusters. (E, F) Venn diagram of regions clusters on (E) the 3,885 MEL-predicted regions in human (in MM001)

454

and (F) the 4,194 MEL-predicted regions in dog (in Dog-OralMel-18249). Example MEL-predicted enhancers in

455

human and dog are shown for two of the region clusters. The ATAC-seq signal of the regions is shown in grey.

456

Putative roles of SOX10 as pioneer and TFAP2A as stabiliser in melanoma

457

MEL enhancers

458

As previous results suggested a pioneering and stabiliser function for SOX10 and TFAP2A respectively,

459

we wanted to further investigate these putative roles and how they are mechanistically affecting

460

chromatin accessibility. First, we analysed the location of binding sites relative to the position of the

461

nucleosome, focusing on a human and dog MEL enhancer that contain a combination of one SOX10

462

and one TFAP2A site (Fig. 6A,B). We predicted the nucleosome start and middle point using a

463

previously published model (Kaplan et al. 2009) and observed that SOX10 binding sites are situated

464

within the borders of the nucleosome, near the nucleosome start point, whereas TFAP2A binding occurs

465

preferentially near the center of the nucleosome (Fig. 6A,B). KD of TFAP2A halved the accessibility

466

of this specific human region, whereas SOX10-KD completely abolished the ATAC-seq peak (Fig. 6A),

467

indicating that SOX10 is necessary for accessibility, and that TFAP2A further increases the

468

accessibility, which is in line with our previous observations (Fig. S6A,B).

469

470

These example enhancers raised an interesting positional preference of SOX10 and TFAP2A. To assess

471

whether this occurs globally we centered human MEL enhancers on the SOX10 and TFAP2A motif hits

472

and calculated the aggregated location of the nucleosome start and middle point (Fig. 6C-E). SOX10

473

shows a consistent preference for binding within the nucleosome borders, around 40 bp away from the

474

nucleosome start point (Fig. 6D). Other pioneering factors have also been shown to bind near the

475

borders of the nucleosome, for instance FOX factors which bind around 60 bp from the center of the

476

nucleosome, displacing linker histones and destabilising the central nucleosome (Grossman et al. 2018;

477

Iwafuchi-Doi et al. 2016). On the other hand, when centering the MEL regions based on the TFAP2A

478

motif, we did not observe a strong preference in the location of the nucleosome start point relative to

479

the TFAP2A binding site (Fig. 6D), but in fact TFAP2A consistently binds in a wide range on and

480

around the nucleosome middle point (Fig. 6E). Stabilisers, such as NFIB, have been reported to directly

481

compete with the central nucleosomes to stabilise the accessible chromatin configuration (Denny et al.

482

2016; Grossman et al. 2018). Centering based on the SOX10 or TFAP2A motif hit revealed protection

483

of Tn5 cutting on important nucleotides of the dimer motif (Fig S7A,B). We did not observe strong

484

positional preferences of MITF and RUNX motifs relative to the nucleosome start or middle point (Fig.

485

S7C,D).

486

487

Altogether these data suggest that SOX10 functions as a pioneer in the CoRC of MEL enhancers,

488

leading to their accessibility by binding to the central nucleosome, near the nucleosome start point. On

489

the other hand, TFAP2A appears to act as stabiliser of SOX-dependent nucleosome depleted regions by

490

binding around the nucleosome middle point, possibly going in competition with the central

491

nucleosome.

492

493

Figure 6. Positional specificity of SOX10 and TFAP2A in MEL melanoma enhancers. (A,B) (first row) Example

494

human (A) and dog (B) MEL-predicted enhancer containing significant SOX10 and TFAP2A motifs. The ATAC-

495

seq signal is shown in grey. (second row) Imputed nucleosome start and middle point profiles. (bottom row) For

496

the human example region, ATAC-seq profiles of MM001 in control condition, after 72 h of SOX10 knock-down

497

or TFAP2A knock-down are shown. (C) Schematic overview of the nucleosome structure explaining the colours

498

used in (D,E). (D,E). Nucleosome start point (D) and nucleosome middle point predictions (E) on MEL-predicted

499

regions containing one SOX10 (left) or one TFAP2A motif (right) next to possible other motifs, where the regions

500

are either centered on the ATAC-seq summit (grey) or on the SOX10 or TFAP2A motif (blue).

501

DeepMEL predicts evolutionary changes in MEL enhancer accessibility

502

and activity

503

To further validate our findings on the MEL enhancer logic, we compared motif architectures between

504

species, and investigated how turnover of TF binding sites affects enhancer accessibility and function.

505

To this end, we compared pairs of highly probable orthologous MEL enhancers that are only accessible

506

in one of the species (Fig. S8A) (see Methods). For example, an enhancer upstream of APPL2 is

507

predicted as a MEL enhancer in the dog line Dog-OralMel-18249 (topic 4 DL score of 0.35), whereas

508

the orthologous enhancer in human is not accessible (Fig. 7A). Not only the accessibility of the human

509

homolog was lost, but also its activity, as we confirmed by a luciferase assay (Fig. 7B). The topic 4

510

DeepMEL score for this enhancer was 6 times lower in human compared to dog (0.06 in human versus

511

0.35 in dog) (Fig. 7C), falling below the topic 4 significance threshold of 0.16, indicating that the model

512

detected critical changes in the human enhancer sequence that could explain the loss of accessibility

513

and activity of this MEL enhancer. The functional dog enhancer contains a SOX10, MITF and TFAP2A

514

binding site, which are all affected by substitutions in the non-functional human homologous sequence

515

and might therefore be causal for the loss in accessibility (and activity) (Fig. 7D,E). The SOX10 motif

516

mutation had the strongest effect, as it caused a 45% drop in the MEL-prediction score (Fig. 7D).

517

518

Next, we performed this analysis on a larger scale. Firstly, per species pair, we observed that differences

519

in DeepMEL predictions between species (delta-DeepMEL score) are highly predictive for differences

520

in accessibility (Spearman’s correlation of 0.43, Fig. S8B,C). Among the four studied regulators, mostly

521

the disruption or gain of one or more SOX10 binding sites between orthologous enhancers

522

quantitatively altered the ATAC-seq signal in a concordant way (Fig. 7F, Fig. S8D), indicating that

523

SOX10 mutations are most causal for changes in MEL enhancer accessibility, and possibly also in

524

enhancer activity, as was the case in the APPL2 enhancer above. However, concordance between

525

accessibility and activity was not always observed (Fig. S9). Furthermore, luciferase assays of six

526

human or dog MEL-predicted enhancers suggested that enhancers with at least one MITF motif (n = 3)

527

are significantly more active compared to enhancers without any MITF motif (n = 3) (Fig. 7G).

528

Although the number of tested enhancers is small, this trend, together with the fact that MEL enhancers

529

containing a MITF binding site showed increased H3K27ac signal (Fig. S6C), indicates that MITF

530

could function as an activator in MEL enhancers. Indeed, MITF has been shown to activate genes

531

involved in pigmentation by recruitment of co-factors and chromatin remodelling complexes

532

(Kawakami and Fisher 2017) and was previously classified as a TF involved in co-factor recruitment

533

and activation (Grossman et al. 2018). Note that SOX10 binding is insufficient but appears necessary

534

for enhancer activity, as mutations in SOX10 binding sites disrupt enhancer activity in the IRF4 case

535

study (Fig. 3G).

536

537

In conclusion, DeepMEL provides a suitable platform to study the effect of evolutionary mutations on

538

MEL enhancer accessibility and, in some cases, activity across species. Together, these results validate

539

that SOX10 is crucial for enhancer accessibility in MEL enhancers, and necessary but insufficient for

540

MEL enhancer activity, as activity appears to be mainly dependent on MITF binding.

541

542

543

Figure 7. Predicting causal mutations of evolutionary changes in MEL enhancers. (A,B) Example region upstream

544

of APPL2 that is accessible (A) and active (B) in the MEL dog line Dog-OralMel-18249 but not in human MEL

545

lines. (C) DeepMEL prediction score of each of the 24 topics for the dog and human APPL2 enhancer. (D) Effect

546

on topic 4 DeepMEL score on the dog sequence when in silico simulating each of the single detected point

547

mutations between the dog and human APPL2 enhancer. (E) DeepExplainer plots of the middle 120 bp of the dog

548

and human APPL2 enhancer. In the middle, the effect of each possible point mutation between the dog and human

549

sequence on the MEL DeepMEL score was in silico calculated and is represented by coloured dots depending on

550

the nucleotide the original dog nucleotide was in silico mutated to. Truly existing point mutations between the

551

dog and human sequence are highlighted by color-coded vertical dashed lines. Four mutations that decrease the

552

motif score of the SOX10, MITF and TFAP2A motifs are highlighted by a grey box and are encircled. (F) Barplot

553

showing the mean effect on the log2 delta ATAC-seq signal of a non-human region compared to the human

554

homolog depending on the number of SOX10 motif hits lost or gained. Only regions having no change in the

555

number of significant TFAP2A, MITF and RUNX motifs hits were used. The y-axis is normalised to the category

556

with no changes in the number of significant SOX10 motif hits. The number of regions in each of the categories

557

is mentioned (#). (G) Luciferase assay on six human or dog enhancers. Significant motif hits per enhancer are

558

shown with coloured crosses. For the luciferase assays: luciferase activity in MM001 is shown relative to Renilla

559

signal and is log10 transformed. P-values were determined using Student’s t-test and the error bars represent the

560

standard deviation over three biological replicates.

561

Discussion

562

Here, we present an in-depth study of melanoma enhancer logic, especially in enhancers specific to the

563

melanocytic (MEL) state, by exploiting both cross-species data and machine learning. Although the

564

MEL and MES melanoma cell states have been studied extensively on a transcriptomic and epigenomic

565

level, the combinatorial code of binding sites of their regulatory factors in state-specific enhancers had

566

not yet been explored. Understanding the enhancer logic and the mechanism by which TFs bind and

567

direct active enhancers will become increasingly important, as it will be essential for the development

568

of new therapies that influence cell state-specific enhancer functions in a targeted way (e.g. for enhancer

569

therapy (Hamdan and Johnsen 2019; Johnson et al. 2008)), or to prioritise non-coding variants in whole

570

genome sequencing studies of personal or cancer genomes (Atak et al. 2019).

571

572

Predicting enhancers and determining their functional role within gene regulatory networks has been an

573

active field for years. Despite the well-established power of cross-species approaches in this field, to

574

our knowledge, a large comparative epigenomics study in melanoma has not yet been conducted,

575

although several non-human models are commonly used in melanoma research (van der Weyden et al.

576

2016) and have been studied on an intra-species level (Hitte et al. 2019; Jiang et al. 2014; Kaufman et

577

al. 2016; Rambow et al. 2008; Rosengren Pielberg et al. 2008; Seltenhammer et al. 2014; Sundström et

578

al. 2012) or in relation to human melanoma (Egidy et al. 2008; Segaoula et al. 2018; Rahman et al.

579

2019). Here, we demonstrate that the MEL and MES states are conserved across species, as well as the

580

key regulators of these states.

581

582

Although their proven advantages, sequence-based comparative approaches have limited power to

583

identify orthologous regulatory regions in distant species, in part because of the rapid evolution of distal

584

enhancers (Dermitzakis and Clark 2002; Lindblad-Toh et al. 2011). Methods, such as enhancer element

585

locator (EEL), try to tackle this question by aligning TF binding sites to identify conserved enhancer

586

elements (Hallikas et al. 2006), or by calculating the co-occurrence of sequence patterns (Arunachalam

587

et al. 2010). However, these methods are either supervised as they require user-provided PWMs

588

(Hallikas et al. 2006) or are difficult to extract the important biologically-relevant features from

589

(Arunachalam et al. 2010). In addition, the identification and exact localisation of important (de novo)

590

TF binding sites within enhancers is complex as motif discovery tools are often dependent on user-

591

provided databases and motif-specific thresholds. Recently, deep learning approaches, which are

592

commonly used in disciplines such as speech recognition and image analysis, found their way into the

593

regulatory genomics field to overcome these concerns (Park and Kellis 2015). As deep learning models,

594

such as DeepBind, are particularly powerful in learning complex patterns by leveraging large

595

epigenomics datasets, they are well suited to function as de novo motif detectors, as well as to uncover

596

more complex sequence features (Alipanahi et al. 2015; Park and Kellis 2015). By designing DeepMEL,

597

a multi-class multi-label neural network trained on melanoma human regulatory topics of co-accessible

598

regions, and by using the model interpretation tool DeepExplainer and our newly developed motif

599

scoring scheme (Lundberg and Lee 2017; Lundberg et al. 2020), we were able to perform a thorough

600

and unsupervised analysis of important TF binding sites in melanoma enhancers. Specifically, in MEL

601

enhancers, our data suggests conserved co-binding of a Core Regulatory Complex of three main TFs,

602

consisting of SOX10, TFAP2A and MITF. DeepMEL also finds motifs for RUNX factors, but their role

603

in the melocyte or melanoma is less clear. Evidence for co-binding of SOX10, MITF, and TFAP2A was

604

previously observed by enrichment of both MITF and TFAP2A motifs in SOX10 ChIP-seq data in

605

melanoma cells (Laurette et al. 2015). We observed high flexibility in the organisation of TF binding

606

sites of the CoRC since eight different modalities were found, formed by all permutations of the CoRC

607

factors, with the exception that all MEL enhancers contained at least one SOX10 binding site. MEL

608

enhancers thereby adhere to a ‘mixed modes enhancer’ model, a billboard-like model with mostly high

609

flexibility in the TF motif organisation, except for the ever-present SOX10 binding sites (Long et al.

610

2016). In addition, ChIP-seq data of MITF and TFAP2A indicated no indirect DNA binding of these

611

CoRC factors within MEL enhancers, but that the bound TFs are largely determined by their individual

612

motif presence. Note that although DeepMEL was trained on melanoma ATAC-seq data, the human

613

and pig predicted MEL enhancers were also accessible in human and pig melanocytes, respectively,

614

indicating that we could extend these observations on the MEL enhancer logic to enhancers in

615

melanocytes, and that our methodology could be applied to non-disease states.

616

617

It is well established that distinct functional classes of TFs exist, with respect to enhancer binding.

618

Pioneer TFs, such as OCT4, SOX2, Grh-like TFs, and FOXA1, are able to bind nucleosomal DNA,

619

leading to displacement of the nucleosome and facilitating the binding of other TFs to the accessible

620

enhancer (Jacobs et al. 2018; Long et al. 2016; Zaret and Carroll 2011). SOX2 and other SOX factors

621

have a HMG domain that interacts with the minor groove of the DNA, causing the DNA to bend in a

622

60-70° angle, a property that has been suggested to contribute to the pioneering activity of SOX2, and

623

possibly of other SOXs (Hou et al. 2017). A recent publication by Dodonova et al. indicates that SOX2

624

and SOX11 can bind to their binding motif on nucleosomal DNA and that they use their binding energy

625

to initiate chromatin opening. However, there is still some dispute on the pioneering properties of SOX

626

TFs, as another study classified SOXs as ‘migrant TFs’, i.e. non-pioneering TFs that only bind

627

sporadically to (non)-chromatinised DNA (Sherwood et al. 2014). Nonetheless, we find strong evidence

628

for a pioneering function of SOX10 in MEL melanoma cells. Our current and previous study (Bravo

629

González-Blas et al. 2019) have shown that knock-down of SOX10 induces closure of SOX10-bound

630

ATAC-seq peaks containing a SOX10 motif. In fact, DeepMEL predicts SOX10 binding sites as

631

essential for MEL enhancer accessibility. Next to pioneer factors, other functional classes of TFs exist,

632

including factors that stabilise the accessibility of the nucleosome depleted regions. TFAP2A was

633

previously classified as such a chromatin stabiliser (Grossman et al. 2018) and it has been shown that

634

evolutionary divergence from the TFAP2A consensus motif correlates with loss of chromatin

635

accessibility and H3K27ac ChIP-seq signal (Prescott et al. 2015). These reports support our

636

observations of TFAP2A as a stabiliser of SOX10-dependent accessible MEL enhancers, likely due to

637

direct competition of TFAP2A with the nucleosome, as TFAP2A binding sites were highly enriched at

638

the predicted center of the central nucleosome. The dependence of SOX10 for opening MEL enhancers

639

prior to TFAP2A binding is in line with the reported classification of TFAP2A as a ‘settler’, a TF whose

640

binding depends predominantly on the accessibility of the chromatin at their binding sites (Sherwood

641

et al. 2014).

642

643

Besides classifying accessible (orthologous) regions and predicting important TF motifs within them,

644

DeepMEL is an accurate predictor of the effect of mutations on enhancer accessibility and, for some

645

enhancers, also the activity. This was for instance the case for the IRF4 MEL enhancer, where

646

DeepMEL outperformed existing methods tested in Kircher et al. (Kircher et al. 2019). Note however,

647

that the other models in the benchmark were trained to predict the activity of a total of 20 regulatory

648

regions ranging across different cell types; whereas our DL model is specialised for melanoma

649

regulatory regions. This demonstrates the value of using case-specific training data, such as the data set

650

generated in this study for melanoma. Not all predicted MEL enhancers were in fact active, as MITF

651

binding seems to be required to activate SOX10-dependent melanoma enhancers. The study of Fufa et

652

al. supports this hypothesis, as activating SOX10-regions in mouse melanocytes showed significant

653

enrichment of E-box motifs (bound by the bHLH protein family, which includes MITF), indicating that

654

MITF cooperates with SOX10 to execute melanocyte-specific gene activation (Fufa et al. 2015). In

655

addition, MITF was previously classified as a TF involved in co-factor recruitment and activation

656

(Grossman et al. 2018; Kawakami and Fisher 2017). Although SOX10 binding is not sufficient for

657

enhancer activity, it appears to be necessary, as disruption of the SOX10 binding site in the IRF4

658

enhancer had a strong effect on activity, probably due to the reappearance of the central nucleosome.

659

660

In conclusion, the combination of comparative epigenomics with deep learning allowed us to perform

661

an in-depth analysis of the melanoma enhancer logic. This work presents an overall framework which

662

can be applied to decipher the enhancer logic in a cell type or cell state of interest, starting from the

663

generation of an extensive cell type-specific (cross-species) epigenomics dataset, all the way through

664

the training and exploitation of a deep neural network to decode enhancer features across species, and

665

to utilise it to assess the impact of cis-regulatory variation.

666

Methods

667

Cell culture

668

669

Human melanoma cell lines

670

Human melanoma cultures (“MM lines”) are short-term cultures derived from patient biopsies

671

(Gembarska et al. 2012; Verfaillie et al. 2015). Cells were cultured at 37°C with 5% CO2 and were

672

maintained in Ham's F10 nutrient mix (Thermo Fisher Scientific) supplemented with 10% fetal bovine

673

serum (FBS; Thermo Fisher Scientific) and 100 µg ml-1 penicillin/streptomycin (Thermo Fisher

674

Scientific).

675

Zebrafish melanoma cell lines

676

Experiments were performed as outlined by (Ceol et al. 2011). Briefly, 25 pg of MCR:EGFP were

677

microinjected together with 25 pg of Tol2 transposase mRNA into one-cell Tg(BRAFV600E);p53-/-;

678

mitf-/- zebrafish embryos. Embryos were scored for melanocyte rescue at 48-72 hours post-fertilisation,

679

and equal numbers were raised to adulthood (15-20 zebrafish per tank), and scored weekly (from 8-12

680

weeks post-fertilization) or bi-weekly (> 12 weeks post-fertilization) for the emergence of raised

681

melanoma lesions (van Rooijen et al. 2017). For in vitro culture, large tumors were isolated from

682

MCR/MCR:EGFP (14-28 weeks post-fertilization). Zebrafish were maintained under IACUC-approved

683

conditions. Zebrafish primary melanoma ZMEL1 cell line was previously described (White et al. 2008,

684

2011) and EGFP 121-1, EGFP 121-2, EGFP 121-3, EGFP 121-5, were generated as described in

685

(Heilmann et al. 2015; Wojciechowska et al. 2016). All cell lines were cultured in DMEM medium

686

(Thermo Fisher Scientific) supplemented with 10% heat-inactivated FBS (Atlanta Biologicals), 1×

687

GlutaMAX (Thermo Fisher Scientific) and 1% Penicillin-Streptomycin (Thermo Fisher Scientific), at

688

28°C, 5% CO2. Zebrafish melanoma lines were authenticated by qPCR and Western for EGFP transgene

689

expression, and periodically checked for mycoplasma using the Universal Mycoplasma Detection Kit

690

(ATCC).

691

692

Horse melanoma cell lines

693

The horse cell lines HoMel-L1 and HoMel-A1 are melanoma cell lines derived from a Lipizzaner

694

stallion and Shagya-Arabian mare, respectively, and were established in Seltenhammer et al.. Cells were

695

cultured at 37°C with 5% CO2 in Roswell Park Memorial Institute (RPMI) medium (Thermo Fisher

696

Scientific) supplemented with 10% fetal bovine serum (FBS; Thermo Fisher Scientific) and 1%

697

penicillin/streptomycin (Thermo Fisher Scientific).

698

Pig melanoma and melanocyte cell line

699

The immortal line of pigmented melanocytes (PigMel) was previously derived (Julé et al. 2003) and the

700

30 day-old piglet primary melanoma cells (MeLiM) were isolated as described (Egidy et al. 2008).

701

PigMel cells were cultured at 37°C with 10% CO2 in MEM medium supplemented with 1× MEM non

702

essential amino acids (Thermo Fisher Scientific), 1mM Na pyruvate, 2 mM glutamine, 100 U/ml

703

penicilin/streptomycin (Thermo Fisher Scientific), 10% FCS and 3,7 g/ml Na bicarbonate. MeLiM cells

704

were cultured in DMEM high glucose (Thermo Fisher Scientific), 10% FCS, Pen/Strep, 5% CO2.

705

Dog melanoma cell lines

706

The dog cell lines Dog-IrisMel-14205 and Dog-OralMel-18249 were established by Aline Primot , and

707

were derived from an uveal melanoma from a Beagle crossed dog and an oral melanoma from the palate

708

from a Shih-tzu, respectively. Cells were cultured at 37°C with 5% CO2 in Ham's F-12 Nutrient Mixture

709

medium (Thermo Fisher Scientific) supplemented with 10% FBS (Thermo Fisher Scientific) and 1%

710

penicillin/streptomycin (Thermo Fisher Scientific).

711

Mouse melanoma cell lines

712

The mouse melanoma cell line was generated as described in (Dankort et al. 2009). Cells were cultured

713

at 37°C with 5% CO2 in Dulbecco's Modified Eagle Medium (DMEM) (Thermo Fisher Scientific)

714

supplemented with 10% FBS (Thermo Fisher Scientific) and 1% penicillin/streptomycin (Thermo

715

Fisher Scientific).

716

Knock-down experiments

717

SOX10, TFAP2A and the control knock-down (KD) were performed in MM001 using a SMARTpool

718

of four siRNAs against, respectively, SOX10 (SMARTpool: ON-TARGETplus SOX10 siRNA, number

719

L017192-00-0005, Dharmacon), TFAP2A (SMARTpool: ON-TARGETplus TFAP2A siRNA, number

720

L-006348-02-0005, Dharmacon) and a negative control pool (ON-TARGETplus non-targeting pool,

721

number D-001810-10-05, Dharmacon) at a concentration of 20 nM for SOX10-KD, and 40 nM for

722

TFAP2A-KD and the control using as medium Opti-MEM (Thermo Fisher Scientific) and omitting

723

antibiotics. The cells were incubated for 72 h before processing.

724

OmniATAC-seq data generation, data processing and follow-up analyses

725

726

OmniATAC-seq on mammalian lines

727

728

Omni-Assay for Transposase-Accessible Chromatin using sequencing (OmniATAC-seq) was

729

performed as described previously (Corces et al. 2017). After the final amplification was done with the

730

additional number of cycles, samples were cleaned-up by MinElute and libraries were prepped using

731

the KAPA Library Quantification Kit as previously described (Corces et al. 2017). Samples were

732

sequenced on a HiSeq 4000 or NextSeq 500 High Output chip.

733

ATAC-seq on zebrafish lines

734

50,000 cells per line were lysed and subjected to a tagmentation reaction and library construction as

735

described in Buenrostro et al. Libraries were run on an Illumina HiSeq 2000.

736

737

Data processing of (OmniATAC)-seq samples

738

(Paired-end) reads were mapped to the human genome (hg19-Gencode v18) using Bowtie 2 (v2.2.6)

739

(Langmead and Salzberg 2012) or STAR (v2.5.1b) (Dobin et al. 2013) to species-specific genomes

740

which were downloaded from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/) (for human: hg19-

741

Gencode v18; for dog: canFam3; for horse: equCab2; for pig: susScr11; for mouse: mm10; for

742

zebrafish: danRer10) and by applying the parameters --alignIntronMax 1 and --aslignIntronMin 2. Note

743

that for the human data, we used hg19 as genome assembly instead of the more recent GRCh38

744

assembly since i-cisTarget (Janky et al., 2014; Imrichova et al., 2015; Herrmann et al., 2012) and

745

GREAT (McLean et al. 2010) are or were not (yet) available for GRCh38 at the time of the analyses.

746

However, the use of GRCh38 instead of hg19 would not significantly affect conclusions. We for

747

instance validated this by re-scoring MEL-predicted regions by DeepMEL in MM057 after liftOver

748

(Kuhn et al. 2013) from hg19 to GRCh38, in which we observed that changing genome assembly yields

749

the same DeepMEL score for all 4,244 regions except for 8 of them. Also note that for MM029, two

750

biological replicates were used. Mapped reads were sorted using SAMtools (v1.8) (Li et al. 2009) and

751

duplicates were removed using Picard MarkDuplicates (v1.134) (Broad Institute 2019). Reads were

752

filtered by removing mitochondrial reads and filtering for Q>30 using SAMtools. BAM files of

753

technical replicates of the same cell line were merged at this point using samtools merge. Peaks were

754

called using MACS2 (v2.1.2) (Gaspar 2018) callpeak using the parameters -q 0.05, --nomodel, --call-

755

summits, --shift -75 --keep-dup all and --extsize 150 per sample. Blacklisted regions (ENCODE) and

756

peaks overlapping with alternative chromosomes and ChrM were removed. Summits were extended by

757

250bp up- and downstream using slopBed (bedtools; v2.28.0) (Quinlan and Hall 2010), providing

758

human chromosome sizes. Peaks were normalised for the library size using a custom script and

759

overlapping peaks were filtered using the peak score by keeping the peak with the highest score.

760

Normalised bigWigs were either made from normalised bedGraphs using as scaling parameter (-scale)

761

1 × 106/(number of non-mitochondrial mapping reads); or made by bamCoverage (deepTools, v3.3.1

762

(Ramírez et al. 2016)), using as parameters --normalizeUsing None, -bl EncodeBlackListedRegions --

763

effectiveGenomeSize 2913022398 and as scaling parameter (-scaleFactor) 1/(RIP/1 × 106), where RIP

764

stands for the number of reads in peaks.

765

HOMER on human and dog differential accessible peaks

766

Count matrices were produced by featureCounts (v1.6.5) (Liao et al. 2014) for 5 melanocytic (MEL)

767

and 5 mesenchymal-like (MES) lines for human, and for Dog-OralMel-18249 and Dog-IrisMel-14205

768

for dog. Differential peaks were identified using DESeq2 (v1.22.2, R v3.5.2 (R Core Team 2018)) (Love

769

et al. 2014) with a log2FC higher than 2 and a pAdj lower than 0.0005. HOMER (Heinz et al. 2010) was

770

performed on the differentially accessible regions using findMotifsGenome.pl, providing the

771

differential regions as a BED file and a fasta file of the human or dog genome, with parameters -mask,

772

-size given and -len 6,8,10,11,12,17,18.

773

Defining sets of alignable and conserved accessible ATAC-seq regions

774

ATAC-seq regions of non-human species were defined as alignable regions when they could be

775

converted to hg19 coordinates using liftOver (Kent-tools, -minMatch=0.1) (Kuhn et al. 2013) by

776

providing the appropriate liftOver chain (UCSC). Alignable regions were intersected with accessible

777

peaks in human using intersectBed (bedtools, v2.28.0) (Quinlan and Hall 2010) with -f 0.6 to define

778

sets of conserved accessible regions across species.

779

Clustering of species based on globally alignable ATAC-seq regions

780

Per species, a count matrix was made on the alignable union ATAC-seq regions by featureCounts

781

(v1.6.5) (Liao et al. 2014). The count matrices of different species were merged and the final count

782

matrix was CPM normalised (edgeR v3.22.5, R v3.5.2 (R Core Team 2018)) (Robinson et al. 2010),

783

followed by quantile normalisation. A principal component analysis (PCA) on the normalised count

784

matrix was performed using irlba (v2.3.3, R v3.5.2) (Baglama and Reichel 2005).

785

Branch length scoring across species

786

787

Conserved accessible ATAC-seq regions were identified as described above, and for each of the species,

788

the set of conserved accessible regions was converted to the coordinate system per species and fasta

789

sequences were retrieved. All sequences were scored with the cisTarget motif collection (v8)

790

(http://iregulon.aertslab.org/collections.html) (Janky et al., 2014; Imrichova et al., 2015; Herrmann et

791

al., 2012) containing 20,003 TF position-weight matrices (PWMs) using Cluster-Buster (Frith et al.

792

2003) with parameters -m 0, -c 0 and -r 10000. For each motif, the highest cis-regulatory module (CRM)

793

score per conserved accessible sequence was used to calculate the branch length score (BLS) across

794

species according to Stark et al. and Jacobs et al.. The branch length was taken from the phylogenetic

795

data from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/phyloP100way/ (UCSC). The sum of the

796

BLSs for all the conserved accessible sequences across the mammalian or all six species was used as a

797

total score for each motif. We normalised these scores by performing BLS on a shuffled variant of all

798

sequences by shuffleseq (EMBOSS, v6.6.0.0), keeping the same base-pair compositions and sequence

799

lengths, and subtracting the shuffled BLS from the true BLS per motif.

800

801

cisTopic analysis to obtain sets of co-accessible regions in human OmniATAC-seq data

802

803

To apply cisTopic (Bravo González-Blas et al. 2019), a tool designed for single-cell ATAC-seq

804

analysis, we first simulated single cells from the bulk OmniATAC-seq data of the 16 human melanoma

805

lines via bootstrapping. Per cell line, 50 simulated single cell BAM files were generated containing each

806

50,000 random reads that were bootstrapped from the bulk BAM files. These simulated single cell BAM

807

files were provided as input for cisTopic (v0.2.0, R v3.4.1 (R Core Team 2017)), together with the

808

merged BED file of ATAC-seq regions across all 16 samples, after removing blacklisted regions

809

(ENCODE). We ran cisTopic (parameters: α = 50/T, β = 0.1, burn-in iterations = 500, recording

810

iterations = 1,000) for models with a number of topics (sets of co-accessible regions) between 2 and 30

811

(2 by 2). The best model, containing 24 topics, was selected on the basis of the highest log-likelihood.

812

Topics were binarised using a probability threshold of 0.995 (resulting in a total of 35,940 binarised

813

topic regions across the 24 topics), and we performed motif enrichment analysis with cisTarget

814

(Imrichová et al. 2015).

815

816

Deep Learning

817

Data preparation

818

The deep learning (DL) model, DeepMEL, was trained on the binarised regions of the 24 topics obtained

819

from the cisTopic analysis explained above. In order to increase the amount of training data, the 500 bp

820

regions in the merged BED file of all 339,099 ATAC-seq regions across the 16 human cell lines (see

821

Data processing of human melanoma baseline OmniATAC-seq samples), were augmented by extending

822

them to 700 bp around the summit and sliding a 500 bp window over these elongated regions with a 10

823

bp stride. This augmented master region BED file was intersected with each topic BED file separately

824

(using bedtools (Quinlan and Hall 2010)) and a region was labelled with a topic number if there was at

825

least 60% overlap. If regions overlapped with multiple topics they were assigned with multiple topic

826

labels, allowing for a multi-label and multi-class DL model. This augmentation and intersection resulted

827

in 696,654 training regions in total, excluding the 58,086 regions on Chr2 that were used for testing.

828

DeepMEL model architecture and training parameters

829

830

The DeepMEL architecture was built with 4 layers between input and output layer: a Conv1D layer

831

(containing 128 filters and setting the parameters kernel_size as 20, the strides as 1 and the activation

832

as relu), MaxPooling1D layer (with the pool_size 10 and strides 10), TimeDistributed Dense layer

833

together with Bidirectional LSTM layer (with 128 unit and setting the dropout as 0.1 and the

834

recurrent_dropout as 0.1), and Dense layer (with 256 units and setting the activation as relu). After

835

MaxPooling1D, Bidirectional LSTM, and Dense layer, a Dropout layer was used each time with the

836

fraction of dropout set as 0.2, 0.2, and 0.4, respectively. For each region in the training data, DeepMEL

837

takes the one-hot encoded (500 bp × 4 nucleotide) forward and reverse strand and passes them

838

separately through the model. In order to make the final prediction, DeepMEL takes the average

839

activation (average function) of the neurons in the final Dense layer (which contains 24 units

840

corresponding to the 24 topics; with a sigmoid activation function). The model was compiled using the

841

Adam optimizer with the default learning rate, which is 0.001. To calculate the loss, the binary cross

842

entropy (binary_crossentropy) was used. The model was trained for 2 epochs with a batch size of 128,

843

which took 67 minutes. Keras 2.2.4 (Chollet and others 2015) with tensorflow 1.14.0 (Abadi et al. 2016)

844

was used. A Tesla P100-SXM2-16GB GPU was used for training on VSC servers (Flemish

845

Supercomputer Center).

846

847

Performance evaluation

848

The performance of the model was evaluated for each topic separately since it was a multi-label

849

classifier. The auROC and auPR were calculated for the combined training and validation data (regions

850

on all chromosomes except Chr2), test (regions on Chr2), and label-shuffled regions.

851

Converting convolution filters to PWMs, filter-topic assignment, and filter-annotation

852

Filters of the convolution layer were converted to position-weight matrices (PWMs) by the following

853

strategy: (i) 4,000,000 unique 20bp-long (size of the filters) sequences were randomly generated. (ii)

854

The activation score of each filter for each sequence was calculated and the top 100 sequences were

855

selected. (iii) A count matrix was generated from these 100 sequences obtained for each filter. (iv)

856

Finally, the count matrices were converted into PWMs. In order to assign the filters to topics, a similar

857

strategy that is mentioned in Basset (Kelley et al. 2016) was used. After setting the activation score of

858

a filter to its mean activation score over all the sequences, the loss/accuracy score on the prediction was

859

calculated for each topic. Filters were ordered based on their effect on a certain topic. In order to

860

annotate the filters to known transcription factor binding motifs, the Tomtom motif annotation tool

861

(Gupta et al. 2007) was used together with our curated cisTarget motif collection (v9)

862

(http://iregulon.aertslab.org/collections.html) (Janky et al., 2014; Imrichova et al., 2015; Herrmann et

863

al., 2012) of 24,453 PWMs (cutoff for the q-value was set to 0.3).

864

DeepExplainer

865

From the 35,940 topic regions that were obtained after binarisation of the 24 topics within the selected

866

cisTopic model (see methods on cisTopic analysis above), 500 regions were randomly selected to

867

initialise the DeepExplainer pipeline (Lundberg and Lee 2017). A hypothetical importance score for

868

each position of the sequence of interest was calculated for any of the 24 topics. For each sequence,

869

these DeepExplainer-obtained importance scores were multiplied by the one-hot encoded matrix of the

870

sequences. Finally, the 500 bp sequences were visualised by adjusting the nucleotide heights based on

871

their importance score by using the modified viz_sequence function from the DeepLift repository

872

(Shrikumar et al. 2017).

873

In silico saturation mutagenesis

874

In silico saturation mutagenesis of a region was performed by separately changing each nucleotide on

875

the 500 bp sequence into the three other nucleotides, and scoring these mutated sequences with

876

DeepMEL. The delta prediction score for each mutation was calculated for each of the 24 topics by

877

comparing the prediction score of the mutated sequence relative to the prediction score for the initial

878

sequence. For the IRF4 enhancer case, the actual IRF4 enhancer sequence used in the in vitro saturation

879

mutagenesis assay (Chr6:396,143-396,593) overlapped with a predicted MEL enhancer in human MEL

880

cell lines in our cohort (Chr6:396,135-396,636). The delta prediction score of topic 4 (MEL topic) was

881

calculated following an in silico saturation mutagenesis on this region, and a Pearson’s correlation was

882

calculated on the overlapping nucleotides between the in silico and in vitro assays (451 bp).

883

Motif scoring method

884

885

We designed an optimised motif scoring method, in which activation scores of the filters on each

886

sequence are multiplied by the DeepExplainer importance scores of the sequence. Then, after the output

887

of this multiplication was normalised, a threshold was calculated for each motif by comparing MEL

888

and MES enhancers. This approach yielded significant motif hits with their precise location.

889

Nucleosome positioning

890

Nucleosome start and middle point predictions were calculated by using the executable nucleosome

891

prediction tool Kaplan_v3 (Kaplan et al. 2009) that takes just the DNA sequence and calculates the

892

nucleosome positioning for each nucleotide. In order to get more precise results, as the authors of

893

Kaplan_v3 suggest, enhancers were extended 3 kb from both ends. After obtaining the predictions, the

894

middle 500 bp part of the 6.5kb nucleosome prediction score was used.

895

Tn5 footprinting

896

Footprints of the Tn5 were determined by inferring Tn5 cut sites from the start point of each ATAC-

897

seq read in a BAM file using a custom script.

898

899

AUROC on human and dog of DeepMEL and Cluster-Buster

900

901

The performance of DeepMEL to discriminate between MEL and MES regions in human and dog was

902

calculated by scoring the top 5,000 differential MEL and MES regions in human and dog (described

903

above) with DeepMEL and calculating the precision of correct assignment (i.e. topic 4 score for the

904

MEL regions and topic 7 scores for the MES regions). The performance of DeepMEL was compared

905

with the motif scoring tool Cluster-Buster (Frith et al. 2003) by scoring the same sets of regions with

906

Cluster-Buster using a merged motif file of (some of) the top filters identified by the model in either

907

topic 4 or topic 7. The obtained CRM scores were used to estimate the performance of Cluster-Buster.

908

909

Identification of homologous MEL genes and MEL enhancers

910

911

To identify genes differentially expressed in human MEL cell lines, we performed DEseq2 (v1.22.2, R

912

v3.5.2 (R Core Team 2018)) (Love et al. 2014) on RNA-seq data of 7 MEL (MM031, MM034, MM057,

913

MM074, MM087, MM118, MM164) and 5 MES (MM029, MM099, MM116, MM163, MM165)

914

human lines. 379 genes were found differentially expressed in MEL lines (log2FC > 2.5 and adjP <

915

0.005). We converted the gene symbols to Ensembl gene IDs using biomaRt (v2.38.0, R v3.5.2)

916

(Durinck et al. 2005) and found back the genomic locations of the genes using GenomicFeatures

917

(v1.34.8, R v3.5.2) (Lawrence et al. 2013). For the human differential MEL genes with at least one

918

MEL-predicted peak in their extended gene locus (200 kbp up- and down-stream), the homologous

919

genes in the other six species were identified using biomaRt to convert the human Ensembl gene IDs to

920

Ensembl gene IDs of the other species. We identified the MEL enhancers that overlapped with the

921

extended gene loci of each of the homologous genes using bedtools intersect (Quinlan and Hall 2010).

922

liftOver (-minMatch=0.1) (Kuhn et al. 2013) was used to calculate the number of these regions that

923

could be identified by performing coordinate conversion.

924

925

926

Correlation of MEL enhancers using deep layers of DeepMEL

927

928

Conserved accessible MEL enhancers in the extended loci of conserved MEL-specific genes across the

929

six species (see above) were scored by the DeepMEL. A matrix was generated consisting of a score for

930

each of the 256 nodes in the Dense layer for each of the regions. A Pearson’s correlation matrix was

931

generated to calculate the pairwise similarity between each of the regions.

932

933

Genome-wide prediction of MEL enhancers

934

935

The first chromosome of the human genome (hg19) was tiled with a sliding window of 500 bp and a

936

100 bp shift using bedtools makewindows (v2.28.0) (Quinlan and Hall 2010). Tiles containing ‘N’ were

937

deleted and the remaining tiles were scored by DeepMEL, and the number of MEL-predicted tiles (topic

938

4 score > 0.16) was calculated.

939

940

Mutations in orthologous enhancers across species

941

942

We defined highly-probable orthologous MEL enhancers between human and another species as

943

regions that were predicted as MEL in one species and for which there was a stringent liftOver (-

944

minMatch=0.995) (Kuhn et al. 2013) and high sequence identity (more than 80% after pairwise

945

alignment via needle (EMBOSS, v6.6.0.0) (Madeira et al. 2019), using parameters -gapopen 10.0 -

946

gapextend 0.5) in the other species. featureCounts (v1.6.5) (Liao et al. 2014) was used to generate count

947

matrices per species on these regions, which was followed by library size normalisation. Delta ATAC-

948

seq scores were calculated for the pairs of orthologous regions by dividing the normalised counts of the

949

two species (human counts / non-human counts) after adding a pseudocount. Mutations were identified

950

by alignment via needle, using the parameters -gapopen 10.0 and -gapextend 0.5.

951

952

Luciferase assay

953

954

Six MEL-predicted enhancers (3 in the dog line Dog-OralMel-18249 and 3 in the human line MM001)

955

were synthetically generated and cloned into a pTwist ENTR plasmid (Twist Bioscience) via Twist

956

Bioscience. Regions were transferred from the Gateway entry clone into the destination vector

957

(pGL4.23-GW, Addgene) via a LR reaction by mixing 2 uL of the entry clone (100 ng/uL) with 1 uL

958

of the destination plasmid (150 ng/uL), 1 uL TE buffer and 1 uL LR enzyme (LR Clonase II Plus

959

enzyme mix, Thermo Fisher Scientific), and incubating this mixture at 25°C for 1 hour. Afterwards, 1

960

uL of Proteinase K (Thermo Fisher Scientific) was added and reactions were incubated at 37°C for 10

961

min. 3 uL of each LR reaction was transformed into 50 uL of Stellar competent cells (Takara Bio) via

962

heat shock. 200 uL of SOC medium was added and the cells were incubated for 1 hour in a shake

963

incubator at 37°C, before plating the transformed cells on LB agar plates with 1/1000 carbenicillin and

964

incubation overnight at 37°C. The next day, one colony per construct was picked and grown overnight

965

in 5 mL of LB medium with 1/1000 carbenicillin in a shake incubator at 37°C before plasmid extraction

966

using the NucleoSpin Plasmid Transfection-grade kit (Macherey-Nagel). For each construct three

967

biological replicates were performed by transfecting the plasmids into 80% confluent cells of MM001

968

in a 24 well plate. Per transfection, 400 ng of the construct was transfected together with 40 ng of

969

Renilla plasmid (Promega) using lipofectamine 2000 (Thermo Fisher Scientific). Luciferase activity of

970

each construct was measured using the Dual-Luciferase Reporter Assay (Promega) according to the

971

manufacturer's instructions. Enhancer luciferase activity was normalised against the Renilla luciferase

972

activity.

973

Publicly available data used in this work

974

SOX10 ChIP-seq and MITF ChIP-seq data on the 501Mel melanoma cell lines were downloaded as

975

raw fastq files from NCBI's Gene Expression Omnibus through GEO accession number GSE61965

976

(Laurette et al. 2015) and were mapped to the human genome using Bowtie 2 (v2.1.0) (Langmead and

977

Salzberg 2012) and peaks were called by MACS2 (v2.1.1) (Gaspar 2018). TFAP2A ChIP-seq data on

978

human primary melanocytes from neonatal foreskin were retrieved from Seberg et al. (GSE67555) as a

979

BED file, which was converted to a bedGraph and bigWig using the peak height from the BED file.

980

Histone H3 at lysine 27 (H3K27ac) and H3 monomethylation at K3 (H3K4me1) ChIP-seq data for

981

MM001 (GSE60666); and RNA-seq data (for MM031, MM034, MM057, MM074, MM087, MM099

982

and MM118 downloaded from GSE60666; for MM029, MM116, MM0163, MM164, and MM165 from

983

GSE134432) were processed as explained in Verfaillie et al.. OmniATAC-seq data for the human lines

984

MM001, MM011, MM029, MM031, MM074, MM057, MM087 and MM099 were obtained through

985

GSE134432 (Wouters et al. 2019) and were processed as described above in ‘Data processing human

986

melanoma baseline OmniATAC-seq samples’; which was also the case for ATAC-seq data from normal

987

human melanocytes on foreskin (NHM1), which were downloaded as raw fastq files from GSE94488

988

(GSM2476338) (Fontanals-Cirera et al. 2017). The massively parallel reporter assay (MPRA) data on

989

the IRF4 enhancer was downloaded from https://mpra.gs.washington.edu/satMutMPRA/ and was

990

processed as described above.

991

Data access

992

All raw and processed sequencing data generated in this study have been submitted to the NCBI Gene

993

Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE142238.

994

This includes OmniATAC-seq data of human melanoma cell lines (MM029, MM034, MM052,

995

MM116, MM118, MM122, MM163, MM164, MM165; data for the other lines used in this study was

996

published before (see ‘Publicly available data used in this work’)), two dog melanoma cell lines, two

997

horse melanoma cell lines, one pig melanoma sample, one pig melanocyte cell line and one mouse

998

melanoma cell line; ATAC-seq data of four zebrafish cell lines; and OmniATAC-seq data of SOX10

999

and TFAP2A knock-down in the human melanoma cell line MM001. The DeepMEL model was

1000

deposited in Kipoi (Avsec et al. 2019a) (http://kipoi.org/models/DeepMEL/). Code and custom scripts

1001

for training DeepMEL, DeepMEL predictions, DeepExplainer usage and BLS scoring are provided in

1002

GitHub (https://github.com/aertslab/DeepMEL) and as Supplemental Code.

1003

Acknowledgements

1004

This work was supported by an ERC Consolidator Grant to S.A. (no. 724226_cis-CONTROL), the KU

1005

Leuven (grant no. C14/18/092 to S.A.), the Foundation Against Cancer (grant no, 2016-070 to S.A.), a

1006

PhD fellowship from the FWO (L.M., no. 1S03317N) and a postdoctoral research fellowship from Kom

1007

op tegen Kanker (Stand up to Cancer; the Flemish Cancer Society) and Stichting tegen Kanker

1008

(Foundation against Cancer; the Belgian Cancer Society) (J.W.). We would like to thank Odessa Van

1009

Goethem and Véronique Benne for their contribution in establishing and providing the mouse

1010

melanoma cell line and Leif Andersson for sharing the horse melanoma cell lines. We would like to

1011

thank Catherine André (CNRS-University of Rennes1, UMR6290, IGDR, Faculty of Medicine, Rennes

1012

France) and Cani-DNA BRC (Biosit, Rennes, France) for sharing the in-house canine oral and uveal

1013

melanoma cell lines. The Cani-DNA BRC (https://dog-genetics.genouest.org), is funded through the

1014

CRB-Anim PIA1 funding (2012-2022) ANR-11-INBS-0003. In addition, we would like to thank Austin

1015

George for his help with the hyperparameter optimisation. Computing was performed at the Vlaams

1016

Supercomputer Center and high-throughput sequencing was done via the Genomics Core Leuven. The

1017

funders had no role in study design, data collection and analysis, decision to publish or preparation of

1018

the manuscript.

1019

1020

Author contributions

1021

1022

L.M., I.I.T. and S.A. conceived the study. L.M. performed the experimental work for the mammalian

1023

OmniATAC-seq dataset, with the help of L.V.A, S.M., V.C and J.W.. M.F., E.v.R. and L.Z. established

1024

and maintained the zebrafish cell lines and performed ATAC-seq on these. G.E.M. maintained and

1025

provided the pig cell lines. A.P. and E.C. established and provided the dog cell lines. P.K. established

1026

and provided the mouse melanoma cell line. M.S. established and provided the horse cell lines. G.E.G.

1027

established and provided the human cell lines. L.M. performed the experimental work and analysis of

1028

the luciferase assays together with D.M. L.M. performed the bioinformatic analyses of the OmniATAC-

1029

seq dataset. I.I.T. established the neural network and performed all bioinformatic analyses regarding

1030

the model. L.M., I.I.T., J.W. and S.A. wrote the manuscript.

1031

1032

Disclosure declaration

1033

1034

The authors declare no competing interests.

1035

1036

References

1037

1038

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M,

1039

et al. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed

1040

Systems. ArXiv160304467 Cs. http://arxiv.org/abs/1603.04467 (Accessed December 20,

1041

2019).

1042

Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence specificities of DNA-

1043

and RNA-binding proteins by deep learning. Nat Biotechnol 33: 831–838.

1044

Angermueller C, Lee HJ, Reik W, Stegle O. 2017. DeepCpG: accurate prediction of single-cell DNA

1045

methylation states using deep learning. Genome Biol 18: 67.

1046

Arendt D, Musser JM, Baker CVH, Bergman A, Cepko C, Erwin DH, Pavlicev M, Schlosser G,

1047

Widder S, Laubichler MD, et al. 2016. The origin and evolution of cell types. Nat Rev Genet

1048

17: 744–757.

1049

Arunachalam M, Jayasurya K, Tomancak P, Ohler U. 2010. An alignment-free method to identify

1050

candidate orthologous enhancers in multiple Drosophila genomes. Bioinforma Oxf Engl 26:

1051

2109–2115.

1052

Atak ZK, Taskiran II, Flerin C, Mauduit D, Minnoye L, Hulsemans G, Christiaens V, Ghanem G-E,

1053

Wouters J, Aerts S. 2019. Prioritization of enhancer mutations by combining allele-specific

1054

chromatin accessibility with deep learning. Genomics

1055

http://biorxiv.org/lookup/doi/10.1101/2019.12.21.885806 (Accessed April 24, 2020).

1056

Avsec Ž, Kreuzhuber R, Israeli J, Xu N, Cheng J, Shrikumar A, Banerjee A, Kim DS, Beier T, Urban

1057

L, et al. 2019a. The Kipoi repository accelerates community exchange and reuse of predictive

1058

models for genomics. Nat Biotechnol 37: 592–600.

1059

Avsec Ž, Weilert M, Shrikumar A, Alexandari A, Krueger S, Dalal K, Fropf R, McAnany C, Gagneur

1060

J, Kundaje A, et al. 2019b. Deep learning at base-resolution reveals motif syntax of the cis-

1061

regulatory code. Genomics http://biorxiv.org/lookup/doi/10.1101/737981 (Accessed October

1062

14, 2019).

1063

Baglama J, Reichel L. 2005. Augmented Implicitly Restarted Lanczos Bidiagonalization Methods.

1064

SIAM J Sci Comput 27: 19–42.

1065

Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. 2009.

1066

MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37: W202–W208.

1067

Ballester B, Medina-Rivera A, Schmidt D, Gonzàlez-Porta M, Carlucci M, Chen X, Chessman K,

1068

Faure AJ, Funnell APW, Goncalves A, et al. 2014. Multi-species, multi-transcription factor

1069

binding highlights conserved control of tissue-specific biological pathways. eLife 3: e02626.

1070

Bravo González-Blas C, Minnoye L, Papasokrati D, Aibar S, Hulselmans G, Christiaens V, Davie K,

1071

Wouters J, Aerts S. 2019. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq

1072

data. Nat Methods 16: 397–400.

1073

Broad Institute. 2019. Picard Toolkit.

1074

Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. 2013. Transposition of native

1075

chromatin for fast and sensitive epigenomic profiling of open chromatin , DNA-binding

1076

proteins and nucleosome position. Nat Methods 10.

1077

Ceol CJ, Houvras Y, Jane-Valbuena J, Bilodeau S, Orlando DA, Battisti V, Fritsch L, Lin WM,

1078

Hollmann TJ, Ferré F, et al. 2011. The histone methyltransferase SETDB1 is recurrently

1079

amplified in melanoma and accelerates its onset. Nature 471: 513–517.

1080

Chen L, Fish AE, Capra JA. 2018. Prediction of gene regulatory enhancers across species reveals

1081

evolutionarily conserved sequence properties. PLoS Comput Biol 14: e1006484.

1082

Chollet F, others. 2015. Keras. https://keras.io.

1083

Cliften PF, Hillier LW, Fulton L, Graves T, Miner T, Gish WR, Waterston RH, Johnston M. 2001.

1084

Surveying Saccharomyces genomes to identify functional elements by comparative DNA

1085

sequence analysis. Genome Res 11: 1175–1186.

1086

Corces MR, Trevino AE, Hamilton EG, Greenside PG, Sinnott-Armstrong NA, Vesuna S, Satpathy

1087

AT, Rubin AJ, Montine KS, Wu B, et al. 2017. An improved ATAC-seq protocol reduces

1088

background and enables interrogation of frozen tissues. Nat Methods 14.

1089

http://www.nature.com/doifinder/10.1038/nmeth.4396.

1090

Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA,

1091

Frampton GM, Sharp PA, et al. 2010. Histone H3K27ac separates active from poised

1092

enhancers and predicts developmental state. Proc Natl Acad Sci U S A 107: 21931–21936.

1093

Dankort D, Curley DP, Cartlidge RA, Nelson B, Karnezis AN, Damsky WE, You MJ, DePinho RA,

1094

McMahon M, Bosenberg M. 2009. Braf(V600E) cooperates with Pten loss to induce

1095

metastatic melanoma. Nat Genet 41: 544–552.

1096

De Mazière AM, Muehlethaler K, van Donselaar E, Salvi S, Davoust J, Cerottini J-C, Lévy F, Slot

1097

JW, Rimoldi D. 2002. The melanocytic protein Melan-A/MART-1 has a subcellular

1098

localization distinct from typical melanosomal proteins. Traffic Cph Den 3: 678–693.

1099

Denny SK, Yang D, Chuang C-H, Brady JJ, Lim JS, Grüner BM, Chiou S-H, Schep AN, Baral J,

1100

Hamard C, et al. 2016. Nfib Promotes Metastasis through a Widespread Increase in

1101

Chromatin Accessibility. Cell 166: 328–342.

1102

Dermitzakis ET, Clark AG. 2002. Evolution of transcription factor binding sites in Mammalian gene

1103

regulatory regions: conservation and turnover. Mol Biol Evol 19: 1114–1121.

1104

D’Mello SAN, Finlay GJ, Baguley BC, Askarian-Amiri ME. 2016. Signaling Pathways in

1105

Melanogenesis. Int J Mol Sci 17.

1106

Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR.

1107

2013. STAR: ultrafast universal RNA-seq aligner. Bioinforma Oxf Engl 29: 15–21.

1108

Dodonova SO, Zhu F, Dienemann C, Taipale J, Cramer P. 2020. Nucleosome-bound SOX2 and

1109

SOX11 structures elucidate pioneer factor function. Nature.

1110

http://www.nature.com/articles/s41586-020-2195-y (Accessed April 23, 2020).

1111

Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W. 2005. BioMart and

1112

Bioconductor: a powerful link between biological databases and microarray data analysis.

1113

Bioinforma Oxf Engl 21: 3439–3440.

1114

Dynan WS, Tjian R. 1983. The promoter-specific transcription factor Sp1 binds to upstream

1115

sequences in the SV40 early promoter. Cell 35: 79–87.

1116

Egidy G, Julé S, Bossé P, Bernex F, Geffrotin C, Vincent-Naulleau S, Horak V, Sastre-Garau X,

1117

Panthier J-J. 2008. Transcription analysis in the MeLiM swine model identifies RACK1 as a

1118

potential marker of malignancy for human melanocytic proliferation. Mol Cancer 7: 34.

1119

Eraslan G, Avsec Ž, Gagneur J, Theis FJ. 2019. Deep learning: new computational modelling

1120

techniques for genomics. Nat Rev Genet 20: 389–403.

1121

Fontanals-Cirera B, Hasson D, Vardabasso C, Di Micco R, Agrawal P, Chowdhury A, Gantz M, de

1122

Pablos-Aragoneses A, Morgenstern A, Wu P, et al. 2017. Harnessing BET Inhibitor

1123

Sensitivity Reveals AMIGO2 as a Melanoma Survival Gene. Mol Cell 68: 731-744.e9.

1124

Frith MC, Li MC, Weng Z. 2003. Cluster-Buster: Finding dense clusters of motifs in DNA sequences.

1125

Nucleic Acids Res 31: 3666–3668.

1126

Fufa TD, Harris ML, Watkins-chow DE, Levy D, Gorkin DU, Gildea DE, Song L, Sa A, Crawford

1127

GE, Sviderskaya EV, et al. 2015. Genomic analysis reveals distinct mechanisms and

1128

functional classes of SOX10-regulated genes in melanocytes. 24: 5433–5450.

1129

Gaspar JM. 2018. Improved peak-calling with MACS2. Bioinformatics

1130

http://biorxiv.org/lookup/doi/10.1101/496521 (Accessed June 15, 2020).

1131

Gasperini M, Hill AJ, McFaline-Figueroa JL, Martin B, Kim S, Zhang MD, Jackson D, Leith A,

1132

Schreiber J, Noble WS, et al. 2019. A Genome-wide Framework for Mapping Gene

1133

Regulation via Cellular Genetic Screens. Cell 176: 377-390.e19.

1134

Gembarska A, Luciani F, Fedele C, Russell EA, Dewaele M, Villar S, Zwolinska A, Haupt S, de

1135

Lange J, Yip D, et al. 2012. MDM4 is a key therapeutic target in cutaneous melanoma. Nat

1136

Med 18: 1239–47.

1137

Graf SA, Busch C, Bosserhoff AK, Besch R, Berking C. 2014. SOX10 promotes melanoma cell

1138

invasion by regulating melanoma inhibitory activity. J Invest Dermatol 134: 2212–2220.

1139

Grossman SR, Engreitz J, Ray JP, Nguyen TH, Hacohen N, Lander ES. 2018. Positional specificity of

1140

different transcription factor classes within enhancers. Proc Natl Acad Sci U S A 115: E7222–

1141

E7230.

1142

Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. 2007. Quantifying similarity between

1143

motifs. Genome Biol 8: R24.

1144

Hallikas O, Palin K, Sinjushina N, Rautiainen R, Partanen J, Ukkonen E, Taipale J. 2006. Genome-

1145

wide prediction of mammalian enhancers based on analysis of transcription-factor binding

1146

affinity. Cell 124: 47–59.

1147

Hamdan FH, Johnsen SA. 2019. Perturbing Enhancer Activity in Cancer Therapy. Cancers 11.

1148

Heilmann S, Ratnakumar K, Langdon E, Kansler E, Kim I, Campbell NR, Perry E, McMahon A,

1149

Kaufman C, van Rooijen E, et al. 2015. A Quantitative System for Studying Metastasis Using

1150

Transparent Zebrafish. Cancer Res 75: 4272–4282.

1151

Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK.

1152

2010. Simple combinations of lineage-determining transcription factors prime cis-regulatory

1153

elements required for macrophage and B cell identities. Mol Cell 38: 576–589.

1154

Hitte C, Le Béguec C, Cadieu E, Wucher V, Primot A, Prouteau A, Botherel N, Hédan B, Lindblad-

1155

Toh K, André C, et al. 2019. Genome-Wide Analysis of Long Non-Coding RNA Profiles in

1156

Canine Oral Melanomas. Genes 10: 477.

1157

Hoek KS, Eichhoff OM, Schlegel NC, Döbbeling U, Kobert N, Schaerer L, Hemmi S, Dummer R.

1158

2008. In vivo switching of human melanoma cells between proliferative and invasive states.

1159

Cancer Res 68: 650–656.

1160

Hoek KS, Schlegel NC, Brafford P, Sucker A, Ugurel S, Kumar R, Weber BL, Nathanson KL,

1161

Phillips DJ, Herlyn M, et al. 2006. Metastatic potential of melanomas defined by specific

1162

gene expression profiles with no BRAF signature. Pigment Cell Res 19: 290–302.

1163

Hong J-W, Hendrix DA, Levine MS. 2008. Shadow enhancers as a source of evolutionary novelty.

1164

Science 321: 1314.

1165

Hou L, Srivastava Y, Jauch R. 2017. Molecular basis for the genome engagement by Sox proteins.

1166

Semin Cell Dev Biol 63: 2–12.

1167

Imrichová H, Hulselmans G, Kalender Atak Z, Potier D, Aerts S. 2015. i-cisTarget 2015 update:

1168

generalized cis-regulatory enrichment analysis in human, mouse and fly. Nucleic Acids Res

1169

43: W57–W64.

1170

Iwafuchi-Doi M, Donahue G, Kakumanu A, Watts JA, Mahony S, Pugh BF, Lee D, Kaestner KH,

1171

Zaret KS. 2016. The Pioneer Transcription Factor FoxA Maintains an Accessible Nucleosome

1172

Configuration at Enhancers for Tissue-Specific Gene Activation. Mol Cell 62: 79–91.

1173

Jacobs J, Atkins M, Davie K, Imrichova H, Romanelli L, Christiaens V, Hulselmans G, Potier D,

1174

Wouters J, Taskiran II, et al. 2018. The transcription factor Grainy head primes epithelial

1175

enhancers for spatiotemporal activation by displacing nucleosomes. Nat Genet 50: 1011–

1176

1020.

1177

Janky R, Verfaillie A, Imrichová H, van de Sande B, Standaert L, Christiaens V, Hulselmans G,

1178

Herten K, Naval Sanchez M, Potier D, et al. 2014. iRegulon: From a Gene List to a Gene

1179

Regulatory Network Using Large Motif and Track Collections. PLoS Comput Biol 10.

1180

Jiang L, Campagne C, Sundström E, Sousa P, Imran S, Seltenhammer M, Pielberg G, Olsson MJ,

1181

Egidy G, Andersson L, et al. 2014. Constitutive activation of the ERK pathway in melanoma

1182

and skin melanocytes in Grey horses. BMC Cancer 14: 857.

1183

Johnson LA, Zhao Y, Golden K, Barolo S. 2008. Reverse-engineering a transcriptional enhancer: a

1184

case study in Drosophila. Tissue Eng Part A 14: 1549–1559.

1185

Julé S, Bossé P, Egidy G, Panthier J-J. 2003. Establishment and characterization of a normal

1186

melanocyte cell line derived from pig skin. Pigment Cell Res 16: 407–410.

1187

Kaplan N, Moore IK, Fondufe-Mittendorf Y, Gossett AJ, Tillo D, Field Y, LeProust EM, Hughes TR,

1188

Lieb JD, Widom J, et al. 2009. The DNA-encoded nucleosome organization of a eukaryotic

1189

genome. Nature 458: 362–366.

1190

Kaufman CK, Mosimann C, Fan ZP, Yang S, Thomas AJ, Ablain J, Tan JL, Fogley RD, van Rooijen

1191

E, Hagedorn EJ, et al. 2016. A zebrafish melanoma model reveals emergence of neural crest

1192

identity during melanoma initiation. Science 351: aad2197–aad2197.

1193

Kawakami A, Fisher DE. 2017. The master role of microphthalmia-associated transcription factor in

1194

melanocyte and melanoma biology. Lab Invest 97: 649–656.

1195

Kelley DR, Snoek J, Rinn JL. 2016. Basset: learning the regulatory code of the accessible genome

1196

with deep convolutional neural networks. Genome Res 26: 990–999.

1197

Kircher M, Xiong C, Martin B, Schubach M, Inoue F, Bell RJA, Costello JF, Shendure J, Ahituv N.

1198

2019. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-

1199

pair resolution. Nat Commun 10: 3583.

1200

Klein RM, Bernstein D, Higgins SP, Higgins CE, Higgins PJ. 2012. SERPINE1 expression

1201

discriminates site-specific metastasis in human melanoma. Exp Dermatol 21: 551–554.

1202

Klemm SL, Shipony Z, Greenleaf WJ. 2019. Chromatin accessibility and the regulatory epigenome.

1203

Nat Rev Genet 20: 207–220.

1204

Kuhn RM, Haussler D, Kent WJ. 2013. The UCSC genome browser and associated tools. Brief

1205

Bioinform 14: 144–161.

1206

Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9: 357–

1207

359.

1208

Laurette P, Strub T, Koludrovic D, Keime C, Le Gras S, Seberg H, Van Otterloo E, Imrichova H,

1209

Siddaway R, Aerts S, et al. 2015. Transcription factor MITF and remodeller BRG1 define

1210

chromatin organisation at regulatory elements in melanoma cells. eLife 2015: 1–40.

1211

Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan MT, Carey VJ.

1212

2013. Software for Computing and Annotating Genomic Ranges ed. A. Prlic. PLoS Comput

1213

Biol 9: e1003118.

1214

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000

1215

Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and

1216

SAMtools. Bioinforma Oxf Engl 25: 2078–2079.

1217

Liao Y, Smyth GK, Shi W. 2014. featureCounts: an efficient general purpose program for assigning

1218

sequence reads to genomic features. Bioinforma Oxf Engl 30: 923–930.

1219

Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, Kheradpour P, Ernst J, Jordan G,

1220

Mauceli E, et al. 2011. A high-resolution map of human evolutionary constraint using 29

1221

mammals. Nature 478: 476–482.

1222

Long HK, Prescott SL, Wysocka J. 2016. Ever-Changing Landscapes: Transcriptional Enhancers in

1223

Development and Evolution. Cell 167: 1170–1187.

1224

Love MI, Huber W, Anders S. 2014. Moderated estimation of fold change and dispersion for RNA-

1225

seq data with DESeq2. Genome Biol 15: 550.

1226

Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N,

1227

Lee S-I. 2020. From local explanations to global understanding with explainable AI for trees.

1228

Nat Mach Intell 2: 56–67.

1229

Lundberg SM, Lee S-I. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in

1230

Neural Information Processing Systems 30 (eds. I. Guyon, U.V. Luxburg, S. Bengio, H.

1231

Wallach, R. Fergus, S. Vishwanathan, and R. Garnett), pp. 4765–4774, Curran Associates,

1232

Inc. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-

1233

predictions.pdf.

1234

Madeira F, Park Y mi, Lee J, Buso N, Gur T, Madhusoodanan N, Basutkar P, Tivey ARN, Potter SC,

1235

Finn RD, et al. 2019. The EMBL-EBI search and sequence analysis tools APIs in 2019.

1236

Nucleic Acids Res 47: W636–W641.

1237

Maity SN, de Crombrugghe B. 1998. Role of the CCAAT-binding protein CBF/NF-Y in transcription.

1238

Trends Biochem Sci 23: 174–178.

1239

McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G. 2010.

1240

GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol 28: 495–

1241

501.

1242

Meyer LR, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, Wong M, Sloan CA, Rosenbloom KR,

1243

Roe G, Rhead B, et al. 2012. The UCSC Genome Browser database: extensions and updates

1244

2013. Nucleic Acids Res 41: D64–D69.

1245

Park Y, Kellis M. 2015. Deep learning for regulatory genomics. Nat Biotechnol 33: 825–826.

1246

Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. 2010. Detection of nonneutral substitution rates

1247

on mammalian phylogenies. Genome Res 20: 110–121.

1248

Prescott SL, Srinivasan R, Marchetto MC, Grishina I, Narvaiza I, Selleri L, Gage FH, Swigut T,

1249

Wysocka J. 2015. Enhancer divergence and cis-regulatory evolution in the human and chimp

1250

neural crest. Cell 163: 68–83.

1251

Prouteau A, André C. 2019. Canine Melanomas as Models for Human Melanomas: Clinical,

1252

Histological, and Genetic Comparison. Genes 10.

1253

Quang D, Xie X. 2016. DanQ: a hybrid convolutional and recurrent deep neural network for

1254

quantifying the function of DNA sequences. Nucleic Acids Res 44: e107.

1255

Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features.

1256

Bioinformatics 26: 841–842.

1257

R Core Team. 2018. R: A Language and Environment for Statistical Computing. R Foundation for

1258

Statistical Computing, Vienna, Austria https://www.R-project.org.

1259

R Core Team. 2017. R: A Language and Environment for Statistical Computing. R Foundation for

1260

Statistical Computing, Vienna, Austria https://www.R-project.org.

1261

Rahman MdM, Lai Y, Husna A, Chen H, Tanaka Y, Kawaguchi H, Hatai H, Miyoshi N, Nakagawa T,

1262

Fukushima R, et al. 2019. Transcriptome analysis of dog oral melanoma and its oncogenic

1263

analogy with human melanoma. Oncol Rep. http://www.spandidos-

1264

publications.com/10.3892/or.2019.7391 (Accessed December 18, 2019).

1265

Rambow F, Malek O, Geffrotin C, Leplat J-J, Bouet S, Piton G, Hugot K, Bevilacqua C, Horak V,

1266

Vincent-Naulleau S. 2008. Identification of differentially expressed genes in spontaneously

1267

regressing melanoma using the MeLiM Swine Model: Differential gene expression in swine

1268

melanoma. Pigment Cell Melanoma Res 21: 147–161.

1269

Rambow F, Marine J-C, Goding CR. 2019. Melanoma plasticity and phenotypic diversity: therapeutic

1270

barriers and opportunities. Genes Dev 33: 1295–1318.

1271

Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, Heyne S, Dündar F, Manke T.

1272

2016. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic

1273

Acids Res 44: W160-165.

1274

Robinson MD, McCarthy DJ, Smyth GK. 2010. edgeR: a Bioconductor package for differential

1275

expression analysis of digital gene expression data. Bioinformatics 26: 139–140.

1276

Rosengren Pielberg G, Golovko A, Sundström E, Curik I, Lennartsson J, Seltenhammer MH, Druml

1277

T, Binns M, Fitzsimmons C, Lindgren G, et al. 2008. A cis-acting regulatory mutation causes

1278

premature hair graying and susceptibility to melanoma in the horse. Nat Genet 40: 1004–

1279

1009.

1280

Schreiber J, Libbrecht M, Bilmes J, Noble WS. 2017. Nucleotide sequence and DNaseI sensitivity are

1281

predictive of 3D chromatin architecture. Bioinformatics

1282

http://biorxiv.org/lookup/doi/10.1101/103614 (Accessed December 18, 2019).

1283

Seberg HE, Van Otterloo E, Loftus SK, Liu H, Bonde G, Sompallae R, Gildea DE, Santana JF,

1284

Manak JR, Pavan WJ, et al. 2017. TFAP2 paralogs regulate melanocyte differentiation in

1285

parallel with MITF ed. G.S. Barsh. PLOS Genet 13: e1006636.

1286

Segaoula Z, Primot A, Lepretre F, Hedan B, Bouchaert E, Minier K, Marescaux L, Serres F,

1287

Galiègue-Zouitina S, André C, et al. 2018. Isolation and characterization of two canine

1288

melanoma cell lines: new models for comparative oncology. BMC Cancer 18: 1219.

1289

Seltenhammer MH, Sundström E, Meisslitzer-Ruppitsch C, Cejka P, Kosiuk J, Neumüller J, Almeder

1290

M, Majdic O, Steinberger P, Losert UM, et al. 2014. Establishment and characterization of a

1291

primary and a metastatic melanoma cell line from Grey horses. Vitro Cell Dev Biol - Anim 50:

1292

56–65.

1293

Shain AH, Bastian BC. 2016. From melanocytes to melanomas. Nat Rev Cancer 16: 345–358.

1294

Sherwood RI, Hashimoto T, O’Donnell CW, Lewis S, Barkal AA, van Hoff JP, Karun V, Jaakkola T,

1295

Gifford DK. 2014. Discovery of directional and nondirectional pioneer transcription factors

1296

by modeling DNase profile magnitude and shape. Nat Biotechnol 32: 171–178.

1297

Shlyueva D, Stampfel G, Stark A. 2014. Transcriptional enhancers: From properties to genome-wide

1298

predictions. Nat Rev Genet 15: 272–286.

1299

Shoshan E, Braeuer RR, Kamiya T, Mobley AK, Huang L, Vasquez ME, Velazquez-Torres G,

1300

Chakravarti N, Ivan C, Prieto V, et al. 2016. NFAT1 Directly Regulates IL8 and MMP3 to

1301

Promote Melanoma Tumor Growth and Metastasis. Cancer Res 76: 3145–3155.

1302

Shrikumar A, Greenside P, Kundaje A. 2017. Learning Important Features Through Propagating

1303

Activation Differences. ArXiv170402685 Cs. http://arxiv.org/abs/1704.02685 (Accessed

1304

October 15, 2019).

1305

Shrikumar A, Tian K, Shcherbina A, Avsec Ž, Banerjee A, Sharmin M, Nair S, Kundaje A. 2019. TF-

1306

MoDISco v0.4.2.2-alpha: Technical Note. ArXiv181100416 Cs Q-Bio Stat.

1307

http://arxiv.org/abs/1811.00416 (Accessed December 18, 2019).

1308

Siepel A. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.

1309

Genome Res 15: 1034–1050.

1310

Song L, Crawford GE. 2010. DNase-seq: A high-resolution technique for mapping active gene

1311

regulatory elements across the genome from mammalian cells. Cold Spring Harb Protoc 5:

1312

1–12.

1313

Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, Carlson JW, Crosby MA, Rasmussen MD, Roy

1314

S, Deoras AN, et al. 2007. Discovery of functional elements in 12 Drosophila genomes using

1315

evolutionary signatures. Nature 450: 219–232.

1316

Sundström E, Komisarczuk AZ, Jiang L, Golovko A, Navratilova P, Rinkwitz S, Becker TS,

1317

Andersson L. 2012. Identification of a melanocyte-specific, microphthalmia-associated

1318

transcription factor-dependent regulatory element in the intronic duplication causing hair

1319

greying and melanoma in horses: A melanocyte-specific regulatory element in the duplicated

1320

sequence causing greying and melanoma in horses. Pigment Cell Melanoma Res 25: 28–36.

1321

Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, van Helden J. 2012. RSAT peak-

1322

motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res 40: e31–e31.

1323

Thomas-Chollier M, Hufton A, Heinig M, O’Keeffe S, Masri NE, Roider HG, Manke T, Vingron M.

1324

2011. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data

1325

and regulatory SNPs. Nat Protoc 6: 1860–1869.

1326

van der Weyden L, Patton EE, Wood GA, Foote AK, Brenn T, Arends MJ, Adams DJ. 2016. Cross-

1327

species models of human melanoma. J Pathol 238: 152–165.

1328

van Rooijen E, Fazio M, Zon LI. 2017. From fish bowl to bedside: The power of zebrafish to unravel

1329

melanoma pathogenesis and discover new therapeutics. Pigment Cell Melanoma Res 30: 402–

1330

412.

1331

Verfaillie A, Imrichova H, Atak ZK, Dewaele M, Rambow F, Hulselmans G, Christiaens V,

1332

Svetlichnyy D, Luciani F, Van den Mooter L, et al. 2015. Decoding the regulatory landscape

1333

of melanoma reveals TEADS as regulators of the invasive cell state. Nat Commun 6: 6683–

1334

6683.

1335

Villar D, Berthelot C, Aldridge S, Rayner TF, Lukk M, Pignatelli M, Park TJ, Deaville R, Erichsen

1336

JT, Jasinska AJ, et al. 2015. Enhancer evolution across 20 mammalian species. Cell 160: 554–

1337

566.

1338

Wang M, Tai C, E W, Wei L. 2018. DeFine: deep convolutional neural networks accurately quantify

1339

intensities of transcription factor-DNA binding and facilitate evaluation of functional non-

1340

coding variants. Nucleic Acids Res 46: e69.

1341

White RM, Cech J, Ratanasirintrawoot S, Lin CY, Rahl PB, Burke CJ, Langdon E, Tomlinson ML,

1342

Mosher J, Kaufman C, et al. 2011. DHODH modulates transcriptional elongation in the neural

1343

crest and melanoma. Nature 471: 518–522.

1344

White RM, Sessa A, Burke C, Bowman T, LeBlanc J, Ceol C, Bourque C, Dovey M, Goessling W,

1345

Burns CE, et al. 2008. Transparent adult zebrafish as a tool for in vivo transplantation

1346

analysis. Cell Stem Cell 2: 183–189.

1347

Wojciechowska S, van Rooijen E, Ceol C, Patton EE, White RM. 2016. Generation and analysis of

1348

zebrafish melanoma models. Methods Cell Biol 134: 531–549.

1349

Wouters J, Kalender-Atak Z, Minnoye L, Spanier KI, De Waegeneer M, González-Blas CB, Mauduit

1350

D, Davie K, Hulselmans G, Najem A, et al. 2019. Single-cell gene regulatory network

1351

analysis reveals new melanoma cell states and transition trajectories during phenotype

1352

switching. Genomics http://biorxiv.org/lookup/doi/10.1101/715995 (Accessed October 7,

1353

2019).

1354

Xu Min, Ning Chen, Ting Chen, Rui Jiang. 2016. DeepEnhancer: Predicting enhancers by

1355

convolutional neural networks. In 2016 IEEE International Conference on Bioinformatics and

1356

Biomedicine (BIBM), pp. 637–644, IEEE, Shenzhen, China

1357

http://ieeexplore.ieee.org/document/7822593/ (Accessed April 20, 2020).

1358

Zaret KS, Carroll JS. 2011. Pioneer transcription factors: establishing competence for gene

1359

expression. Genes Dev 25: 2227–2241.

1360

Zhou J, Troyanskaya OG. 2015. Predicting effects of noncoding variants with deep learning-based

1361

sequence model. Nat Methods 12: 931–4.

1362

Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium

Preprint

Full-text available

Apr 2024

Combinations of transcription factors govern the identity of cell types, which is reflected by enhancer codes in cis-regulatory genomic regions. Cell type-specific enhancer codes at nucleotide-level resolution have not yet been characterized for the mammalian neocortex. It is currently unknown whether these codes are conserved in other vertebrate brains, and whether they are informative to resolve homology relationships for species that lack a neocortex such as birds. To compare enhancer codes of cell types from the mammalian neocortex with those from the bird pallium, we generated single-cell multiome and spatially-resolved transcriptomics data of the chicken telencephalon. We then trained deep learning models to characterize cell type-specific enhancer codes for the human, mouse, and chicken telencephalon. We devised three metrics that exploit enhancer codes to compare cell types between species. Based on these metrics, non-neuronal and GABAergic cell types show a high degree of regulatory similarity across vertebrates. Proposed homologies between mammalian neocortical and avian pallial excitatory neurons are still debated. Our enhancer code based comparison shows that excitatory neurons of the mammalian neocortex and the avian pallium exhibit a higher degree of divergence than other cell types. In contrast to existing evolutionary models, the mammalian deep layer excitatory neurons are most similar to mesopallial neurons; and mammalian upper layer neurons to hyper- and nidopallial neurons based on their enhancer codes. In addition to characterizing the enhancer codes in the mammalian and avian telencephalon, and revealing unexpected correspondences between cell types of the mammalian neocortex and the chicken pallium, we present generally applicable deep learning approaches to characterize and compare cell types across species via the genomic regulatory code.

A Bag-Of-Motif Model Captures Cell States at Distal Regulatory Sequences

Preprint

Full-text available

Jan 2024

Deciphering the intricate regulatory code governing cell-type-specific gene expression is a fundamental goal in genetics. Current methods struggle to capture the complex interplay between gene distal regulatory sequences and cell context. We developed a computational approach, BOM (Bag-of-Motifs), which represents cis-regulatory sequences by the type and number of TF binding motifs it contains, irrespective of motif order, orientation, and spacing. This simple yet powerful representation allows BOM to efficiently capture the complexity of cell-type-specific information encoded within these sequences. We apply BOM to mouse, human, and zebrafish distal regulatory regions, demonstrating remarkable accuracy. Notably, the method outperforms more complex deep learning models at the same task using fewer parameters. BOM can also uncover cross-species sequence similarities unrecognized by genome alignments. We experimentally validate our in silico predictions using enhancer reporter assay, showing that motifs with the most significant explanatory power are sequence determinants of cell-type specific enhancer activity. BOM offers a novel systematic framework for studying cell-type or condition-specific cis-regulatory sequences. Using BOM, we demonstrate the existence of a highly predictive sequence code at distal regulatory regions in mammals driven by TF binding motifs.

SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models

Preprint

Full-text available

Mar 2024

Foundation models have achieved remarkable success in several fields such as natural language processing, computer vision and more recently biology. DNA foundation models in particular are emerging as a promising approach for genomics. However, so far no model has delivered granular, nucleotide-level predictions across a wide range of genomic and regulatory elements, limiting its practical usefulness. In this paper, we build on our previous work on the Nucleotide Transformer (NT) to develop a segmentation model, SegmentNT, that processes input DNA sequences up to 30kb length to predict 14 different classes of genomics elements at single nucleotide resolution. By utilizing pre-trained weights from NT, SegmentNT surpasses the performance of several ablation models, including convolution networks with one-hot encoded nucleotide sequences and models trained from scratch. SegmentNT can process multiple sequence lengths with zero-shot generalization for sequences of up to 50kb. We show improved performance on the detection of splice sites throughout the genome and demonstrate strong nucleotide-level precision. Because it evaluates all gene elements simultaneously, SegmentNT can predict the impact of sequence variants not only on splice site changes but also on exon and intron rearrangements in transcript isoforms. Finally, we show that a SegmentNT model trained on human genomics elements can generalize to elements of different species and that a trained multispecies SegmentNT model achieves stronger generalization for all genic elements on unseen species. In summary, SegmentNT demonstrates that DNA foundation models can tackle complex, granular tasks in genomics at a single-nucleotide resolution. SegmentNT can be easily extended to additional genomics elements and species, thus representing a new paradigm on how we analyze and interpret DNA. We make our SegmentNT-30kb human and multispecies models available on our github repository in Jax and HuggingFace space in Pytorch.

Cross-species prediction of transcription factor binding by adversarial training of a novel nucleotide-level deep neural network

Preprint

Full-text available

Feb 2024

Qinhu Zhang

Cross-species prediction of TF binding remains a major challenge due to the rapid evolutionary turnover of individual TF binding sites, resulting in cross-species predictive performance being consistently worse than within-species performance. In this study, we first propose a novel Nucleotide-Level Deep Neural Network (NLDNN) to predict TF binding within or across species. NLDNN regards the task of TF binding prediction as a nucleotide-level regression task. Beyond predictive performance, we also assess model performance by locating potential TF binding regions, discriminating TF-specific single-nucleotide polymorphisms (SNPs), and identifying causal disease-associated SNPs. Then, we design a dual-path framework for adversarial training of NLDNN to further improve the cross-species prediction performance by pulling the domain space of human and mouse species closer.

Multiplex profiling of developmental cis-regulatory elements with quantitative single-cell expression reporters

Article

Full-text available

May 2024
Br J Pharmacol

The inability to scalably and precisely measure the activity of developmental cis-regulatory elements (CREs) in multicellular systems is a bottleneck in genomics. Here we develop a dual RNA cassette that decouples the detection and quantification tasks inherent to multiplex single-cell reporter assays. The resulting measurement of reporter expression is accurate over multiple orders of magnitude, with a precision approaching the limit set by Poisson counting noise. Together with RNA barcode stabilization via circularization, these scalable single-cell quantitative expression reporters provide high-contrast readouts, analogous to classic in situ assays but entirely from sequencing. Screening >200 regions of accessible chromatin in a multicellular in vitro model of early mammalian development, we identify 13 (8 previously uncharacterized) autonomous and cell-type-specific developmental CREs. We further demonstrate that chimeric CRE pairs generate cognate two-cell-type activity profiles and assess gain- and loss-of-function multicellular expression phenotypes from CRE variants with perturbed transcription factor binding sites. Single-cell quantitative expression reporters can be applied in developmental and multicellular systems to quantitatively characterize native, perturbed and synthetic CREs at scale, with high sensitivity and at single-cell resolution.

Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation

Article

Apr 2024
BIOINFORMATICS

Motivation Supervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited. Results Inspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep-learning problems in genomics. Availability and implementation The open-source GitHub repository agduncan94/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures.

Controllers of histone methylation-modifying enzymes in gastrointestinal cancers

Article

Mar 2024

Construction of single-cell cross-species chromatin accessibility landscapes with combinatorial-hybridization-based ATAC-seq

Article

Feb 2024
DEV CELL

Single-cell spatial multi-omics and deep learning dissect enhancer-driven gene regulatory networks in liver zonation

Article

Full-text available

Jan 2024
NAT CELL BIOL

In the mammalian liver, hepatocytes exhibit diverse metabolic and functional profiles based on their location within the liver lobule. However, it is unclear whether this spatial variation, called zonation, is governed by a well-defined gene regulatory code. Here, using a combination of single-cell multiomics, spatial omics, massively parallel reporter assays and deep learning, we mapped enhancer-gene regulatory networks across mouse liver cell types. We found that zonation affects gene expression and chromatin accessibility in hepatocytes, among other cell types. These states are driven by the repressors TCF7L1 and TBX3, alongside other core hepatocyte transcription factors, such as HNF4A, CEBPA, FOXA1 and ONECUT1. To examine the architecture of the enhancers driving these cell states, we trained a hierarchical deep learning model called DeepLiver. Our study provides a multimodal understanding of the regulatory code underlying hepatocyte identity and their zonation state that can be used to engineer enhancers with specific activity levels and zonation patterns.

Cell type directed design of synthetic enhancers

Article

Full-text available

Dec 2023
NATURE

Transcriptional enhancers act as docking stations for combinations of transcription factors (TFs) and thereby regulate spatiotemporal activation of their target genes. It has been a long-standing goal in the field to decode the regulatory logic of an enhancer and to understand the details of how spatiotemporal gene expression is encoded in an enhancer sequence. Here, we show that deep learning models can be used to efficiently design synthetic, cell type specific enhancers, starting from random sequences, and that this optimization process allows for a detailed tracing of enhancer features at single-nucleotide resolution. We evaluate the function of fully synthetic enhancers to specifically target Kenyon cells or glial cells in the fruit fly brain using transgenic animals. We further exploit enhancer design to create "dual-code" enhancers that target two cell types, and minimal enhancers smaller than 50 base pairs that are fully functional. By examining the state space searches towards local optima, we characterise enhancer codes through the strength, combination, and arrangement of TF activator and TF repressor motifs. Finally, we apply the same strategies to successfully design human enhancers, which adhere to similar enhancer rules as Drosophila enhancers. Enhancer design guided by deep learning leads to better understanding of how enhancers work and shows that their code can be exploited to manipulate cell states.

Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture

Preprint

Full-text available

Jul 2018

Recently, Hi-C has been used to probe the 3D chromatin architecture of multiple organisms and cell types. The resulting collections of pairwise contacts across the genome have connected chromatin architecture to many cellular phenomena, including replication timing and gene regulation. However, high resolution (10 kb or finer) contact maps remain scarce due to the expense and time required for collection. A computational method for predicting pairwise contacts without the need to run a Hi-C experiment would be invaluable in understanding the role that 3D chromatin architecture plays in genome biology. We describe Rambutan, a deep convolutional neural network that predicts Hi-C contacts at 1 kb resolution using nucleotide sequence and DNaseI assay signal as inputs. Specifically, Rambutan identifies locus pairs that engage in high confidence contacts according to Fit-Hi-C, a previously described method for assigning statistical confidence estimates to Hi-C contacts. We first demonstrate Rambutan’s performance across chromosomes at 1 kb resolution in the GM12878 cell line. Subsequently, we measure Rambutan’s performance across six cell types. In this setting, the model achieves an area under the receiver operating characteristic curve between 0.7662 and 0.8246 and an area under the precision-recall curve between 0.3737 and 0.9008. We further demonstrate that the predicted contacts exhibit expected trends relative to histone modification ChlP-seq data, replication timing measurements, and annotations of functional elements such as promoters and enhancers. Finally, we predict Hi-C contacts for 53 human cell types and show that the predictions cluster by cellular function. [NOTE: After our original submission we discovered an error in our calling of statistically significant contacts. Briefly, when calculating the prior probability of a contact, we used the number of contacts at a certain genomic distance in a chromosome but divided by the total number of bins in the full genome. When we corrected this mistake we noticed that the Rambutan model, as it curently stands, did not outperform simply using the GM12878 contact map that Rambutan was trained on as the predictor in other cell types. While we investigate these new results, we ask that readers treat this manuscript skeptically.]

Robust gene expression programs underlie recurrent cell states and phenotype switching in melanoma

Article

Full-text available

Aug 2020
NAT CELL BIOL

Melanoma cells can switch between a melanocytic and a mesenchymal-like state. Scattered evidence indicates that additional intermediate state(s) may exist. Here, to search for such states and decipher their underlying gene regulatory network (GRN), we studied 10 melanoma cultures using single-cell RNA sequencing (RNA-seq) as well as 26 additional cultures using bulk RNA-seq. Although each culture exhibited a unique transcriptome, we identified shared GRNs that underlie the extreme melanocytic and mesenchymal states and the intermediate state. This intermediate state is corroborated by a distinct chromatin landscape and is governed by the transcription factors SOX6, NFATC2, EGR3, ELF1 and ETV4. Single-cell migration assays confirmed the intermediate migratory phenotype of this state. Using time-series sampling of single cells after knockdown of SOX10, we unravelled the sequential and recurrent arrangement of GRNs during phenotype switching. Taken together, these analyses indicate that an intermediate state exists and is driven by a distinct and stable ‘mixed’ GRN rather than being a symbiotic heterogeneous mix of cells.

Nucleosome-bound SOX2 and SOX11 structures elucidate pioneer factor function

Article

Full-text available

Apr 2020
NATURE

‘Pioneer’ transcription factors are required for stem-cell pluripotency, cell differentiation and cell reprogramming1,2. Pioneer factors can bind nucleosomal DNA to enable gene expression from regions of the genome with closed chromatin. SOX2 is a prominent pioneer factor that is essential for pluripotency and self-renewal of embryonic stem cells³. Here we report cryo-electron microscopy structures of the DNA-binding domains of SOX2 and its close homologue SOX11 bound to nucleosomes. The structures show that SOX factors can bind and locally distort DNA at superhelical location 2. The factors also facilitate detachment of terminal nucleosomal DNA from the histone octamer, which increases DNA accessibility. SOX-factor binding to the nucleosome can also lead to a repositioning of the N-terminal tail of histone H4 that includes residue lysine 16. We speculate that this repositioning is incompatible with higher-order nucleosome stacking, which involves contacts of the H4 tail with a neighbouring nucleosome. Our results indicate that pioneer transcription factors can use binding energy to initiate chromatin opening, and thereby facilitate nucleosome remodelling and subsequent transcription.

From Local Explanations to Global Understanding with Explainable AI for Trees

Article

Full-text available

Jan 2020

Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. (1) A polynomial time algorithm to compute optimal explanations based on game theory. (2) A new type of explanation that directly measures local feature interaction effects. (3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains. Tree-based machine learning models are widely used in domains such as healthcare, finance and public services. The authors present an explanation method for trees that enables the computation of optimal local explanations for individual predictions, and demonstrate their method on three medical datasets.

Transcriptome analysis of dog oral melanoma and its oncogenic analogy with human melanoma

Article

Full-text available

Oct 2019

Dogs have been considered as an excellent immunocompetent model for human melanoma due to the same tumor location and the common clinical and pathological features with human melanoma. However, the differences in the melanoma transcriptome between the two species have not been yet fully determined. Considering the role of oncogenes in melanoma development, in this study, we first characterized the transcriptome in canine oral melanoma and then compared the transcriptome with that of human melanoma. The global transcriptome from 8 canine oral melanoma samples and 3 healthy oral tissues were compared by RNA‑Seq followed by RT‑qPCR validation. The results revealed 2,555 annotated differentially expressed genes, as well as 364 novel differentially expressed genes. Dog chromosomes 1 and 9 were enriched with downregulated and upregulated genes, respectively. Along with 10 significant transcription site binding motifs; the NF‑κB and ATF1 binding motifs were the most significant and 4 significant unknown motifs were indentified among the upregulated differentially expressed genes. Moreover, it was found that canine oral melanoma shared >80% significant oncogenes (upregulated genes) with human melanoma, and JAK‑STAT was the most common significant pathway between the species. The results identified a 429 gene signature in melanoma, which was up‑regulated in both species; these genes may be good candidates for therapeutic development. Furthermore, this study demonstrates that as regards oncogene expression, human melanoma contains an oncogene group that bears similarities with dog oral melanoma, which supports the use of dogs as a model for the development of novel therapeutics and experimental trials before human application.

Melanoma plasticity and phenotypic diversity: Therapeutic barriers and opportunities

Article

Full-text available

Oct 2019
GENE DEV

An incomplete view of the mechanisms that drive metastasis, the primary cause of cancer-related death, has been a major barrier to development of effective therapeutics and prognostic diagnostics. Increasing evidence indicates that the interplay between microenvironment, genetic lesions, and cellular plasticity drives the metastatic cascade and resistance to therapies. Here, using melanoma as a model, we outline the diversity and trajectories of cell states during metastatic dissemination and therapy exposure, and highlight how understanding the magnitude and dynamics of nongenetic reprogramming in space and time at single-cell resolution can be exploited to develop therapeutic strategies that capitalize on nongenetic tumor evolution.

Deep learning at base-resolution reveals motif syntax of the cis-regulatory code

Preprint

Full-text available

Aug 2019

The arrangement of transcription factor (TF) binding motifs (syntax) is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution ChIP-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using CRISPR-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data. Highlights The neural network BPNet accurately predicts TF binding data at base-resolution. Model interpretation discovers TF motifs and TF interactions dependent on soft syntax. Motifs for Nanog and partners are preferentially spaced at ∼10.5 bp periodicity. Directional cooperativity is validated: Sox2 enhances Nanog binding, but not vice versa.

Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution

Article

Full-text available

Aug 2019

The majority of common variants associated with common diseases, as well as an unknown proportion of causal mutations for rare diseases, fall in noncoding regions of the genome. Although catalogs of noncoding regulatory elements are steadily improving, we have a limited understanding of the functional effects of mutations within them. Here, we perform saturation mutagenesis in conjunction with massively parallel reporter assays on 20 disease-associated gene promoters and enhancers, generating functional measurements for over 30,000 single nucleotide substitutions and deletions. We find that the density of putative transcription factor binding sites varies widely between regulatory elements, as does the extent to which evolutionary conservation or integrative scores predict functional effects. These data provide a powerful resource for interpreting the pathogenicity of clinically observed mutations in these disease-associated regulatory elements, and comprise a rich dataset for the further development of algorithms that aim to predict the regulatory effects of noncoding mutations.

Canine Melanomas as Models for Human Melanomas: Clinical, Histological, and Genetic Comparison

Article

Full-text available

Jun 2019

Despite recent genetic advances and numerous ongoing therapeutic trials, malignant melanoma remains fatal, and prognostic factors as well as more efficient treatments are needed. The development of such research strongly depends on the availability of appropriate models recapitulating all the features of human melanoma. The concept of comparative oncology, with the use of spontaneous canine models has recently acquired a unique value as a translational model. Canine malignant melanomas are naturally occurring cancers presenting striking homologies with human melanomas. As for many other cancers, dogs present surprising breed predispositions and higher frequency of certain subtypes per breed. Oral melanomas, which are much more frequent and highly severe in dogs and cutaneous melanomas with severe digital forms or uveal subtypes are subtypes presenting relevant homologies with their human counterparts, thus constituting close models for these human melanoma subtypes. This review addresses how canine and human melanoma subtypes compare based on their epidemiological, clinical, histological, and genetic characteristics, and how comparative oncology approaches can provide insights into rare and poorly characterized melanoma subtypes in humans that are frequent and breed-specific in dogs. We propose canine malignant melanomas as models for rare non-UV-induced human melanomas, especially mucosal melanomas. Naturally affected dogs offer the opportunity to decipher the genetics at both germline and somatic levels and to explore therapeutic options, with the dog entering preclinical trials as human patients, benefiting both dogs and humans.

Prioritization of enhancer mutations by combining allele-specific chromatin accessibility with deep learning

Preprint

Dec 2019

Prioritization of non-coding genome variation benefits from explainable AI to predict and interpret the impact of a mutation on gene regulation. Here we apply a specialized deep learning model to phased melanoma genomes and identify functional enhancer mutations with allelic imbalance of chromatin accessibility and gene expression.

Cross-species analysis of enhancer logic using deep learning

Abstract and Figures

Recommended publications

Cross-species analysis of melanoma enhancer logic using deep learning

Interpretation of allele-specific chromatin accessibility using cell state-aware deep learning

Cell type directed design of synthetic enhancers

Prioritization of enhancer mutations by combining allele-specific chromatin accessibility with deep...