ArticlePDF Available

FusionDTA: attention-based feature polymerizer and knowledge distillation for drug-target binding affinity prediction

Authors:

Abstract

The prediction of drug-target affinity (DTA) plays an increasingly important role in drug discovery. Nowadays, lots of prediction methods focus on feature encoding of drugs and proteins, but ignore the importance of feature aggregation. However, the increasingly complex encoder networks lead to the loss of implicit information and excessive model size. To this end, we propose a deep-learning-based approach namely FusionDTA. For the loss of implicit information, a novel muti-head linear attention mechanism was utilized to replace the rough pooling method. This allows FusionDTA aggregates global information based on attention weights, instead of selecting the largest one as max-pooling does. To solve the redundancy issue of parameters, we applied knowledge distillation in FusionDTA by transfering learnable information from teacher model to student. Results show that FusionDTA performs better than existing models for the test domain on all evaluation metrics. We obtained concordance index (CI) index of 0.913 and 0.906 in Davis and KIBA dataset respectively, compared with 0.893 and 0.891 of previous state-of-art model. Under the cold-start constrain, our model proved to be more robust and more effective with unseen inputs than baseline methods. In addition, the knowledge distillation did save half of the parameters of the model, with only 0.006 reduction in CI index. Even FusionDTA with half the parameters could easily exceed the baseline on all metrics. In general, our model has superior performance and improves the effect of drug-target interaction (DTI) prediction. The visualization of DTI can effectively help predict the binding region of proteins during structure-based drug design.
Weining Yuan is a Bachelor in the School of Intelligent Engineering, Sun Yat-Sen University. His research interests focus on natural language processing,
knowledge transfer and drug design.
Guanxing Chen is a Ph.D. candidate in the School of Intelligent Engineering, Sun Yat-Sen University. His research interests focus on explainable artificial
intelligence, drug discovery, deep learning, biosynthesis, and vaccine design.
Calvin Yu-Chian Chen is the Dean of Intelligent Medical Center and a professor of school of intelligent systems engineering at Sun Yat-sen University.He also had
been served as an Advisor or guest Professor in China Medical University, Massachusetts Institute of Technology (MIT), Peking University, University of Pittsburgh,
and adjunct professor in Zhejiang University. His research interests include the computer vision,natural language processing and deep learning.
Received: August 8, 2021. Revised: October 21, 2021. Accepted: November 3, 2021
© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Briefings in Bioinformatics, 2021, 00,113
https://doi.org/10.1093/bib/bbab506
Problem Solving Protocol
FusionDTA: attention-based feature polymerizer
and knowledge distillation for drug-target binding
affinity prediction
Weining Yuan, Guanxing Chenand Calvin Yu-Chian Chen
Corresponding author: Calvin Yu-Chian Chen,Artificial Intelligence Medical Center, School of Intelligent Systems Engineering,Sun Yat-sen University,
Shenzhen 510275, China; Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan; Department of Bioinformaticsand
Medical Engineering, Asia University, Taichung, 41354, Taiwan. Tel: +8615626413023; E-mail: chenyuchian@mail.sysu.edu.cn
These authors contributed equally to this work.
Abstract
The prediction of drug-target affinity (DTA) plays an increasingly important role in drug discovery. Nowadays, lots of prediction
methods focus on feature encoding of drugs and proteins, but ignore the importance of feature aggregation. However, the increasingly
complex encoder networks lead to the loss of implicit information and excessive model size.To this end, we propose a deep-learning-
based approach namely FusionDTA. For the loss of implicit information, a novel muti-head linear attention mechanism was utilized
to replace the rough pooling method. This allows FusionDTA aggregates global information based on attention weights, instead of
selecting the largest one as max-pooling does. To solve the redundancy issue of parameters, we applied knowledge distillation in
FusionDTA by transfering learnable information from teacher model to student. Results show that FusionDTA performs better than
existing models for the test domain on all evaluation metrics. We obtained concordance index (CI) index of 0.913 and 0.906 in Davis
and KIBA dataset respectively, compared with 0.893 and 0.891 of previous state-of-art model. Under the cold-start constrain, our model
proved to be more robust and more effective with unseen inputs than baseline methods. In addition, the knowledge distillation did
save half of the parameters of the model, with only 0.006 reduction in CI index. Even FusionDTA with half the parameters could easily
exceed the baseline on all metrics. In general, our model has superior performance and improves the effect of drug–target interaction
(DTI) prediction. The visualization of DTI can effectively help predict the binding region of proteins during structure-based drug design.
Keywords: drug–target affinity, feature polymerizer, muti-head linear attention, model compression, knowledge distillation
Introduction
Drug discovery is a time-consuming, extremely expen-
sive and gambling process. It takes more than 10 years
and billions of dollars to develop new drugs, but 90% of
the drugs entering clinical trials have not been approved
by the FDA and entered the consumer market [1,2]. In
the past few decades, the rapid development of computer
technology has enabled better drug design to assist drug
design in experiments and accelerate the speed of drug
development [3]. Nowadays, the key part of computer-
aided drug design is to find matching drug molecules
and proteins. Hence the drug-target interaction (DTI) has
become a hot topic that has been widely studied [4].
Traditionally,virtual screening has been widely used to
extract reasonable drug molecules from large compound
databases. However, the molecular docking technology to
measure the binding affinity between the drug and the
target cost lots of time in the experiment [5]. For proteins
with known structural information, drug molecules can
be directly docked to obtain binding affinity. But there
are still many proteins of unknown structure. Even if a
large amount of time is spent on homology modeling,
detailed structural information may not be obtained [6].
In response to this challenge, machine learning methods
for drug–target affinity (DTA) prediction has gradually
become an alternative to molecular docking.
Pahikkala et al. [7] proposed Kronecker regularized
least-squares approach (KronRLS) that defined the sim-
ilarity score of a drug–target pair through the Kronecker
product of similarity matrix. He et al. [8] put forward
Simboost, a cross-method that used a gradient booster
to predict drug–target affinity. Öztürk et al. [9] suggested
a deep learning model DeepDTA with two independent
convolution blocks to learn representations from SMILES
strings and protein sequences. Abbasi et al. [10] proposed
a deep learning-based approach DeepCDA that combines
convolutional layers and long short-term memory
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
2|Yuan et al.
Figure 1. The overall architecture of FusionDTA. First, the original one-hot encoding of input vector is replaced by a novel distribution representation.
Then, a feedforward layer and a LSTM is designed to construct the basic blocks of the encoder layer. Finally, the intermediate carriers of drug molecules
and proteins are imported into the polymer layer to obtain an output carrier representation of binding affinity.
(LSTM) layers to effectively encode local and global
temporal patterns for deep cross-domain compound-
protein affinity prediction. Nguyen et al. [11] proposed a
graph-based model GraphDTA encoding drug as an undi-
rected graph with a feature map and an adjacent matrix.
Graph convolutional network (GCN) [12], graph attention
network (GAT) [13] and graph isomorphic network (GIN)
[14] are designed to extract features from drugs,whereas
convolution blocks are the feature encoder of protein.
Studies of attention-based methods also contribute
to DTA prediction. DrugVQA [15] proposed a question-
answering model for drug–target interaction tasks, in
which a sequential attention mechanism is utilized
to capture the dependency from dynamic convolu-
tional neural network (CNN). From another perspective,
MATT_DTI [16] designed a multi-head attention model
that regarded drug representation as query while protein
representation as key and value. Nguyen et al. [17] built
a graph-in-graph architecture to fuse the drug–protein
pair, with a self-attention mechanism to calculate the
binding site in protein representation. Chen et al. [18]
utilized a transformer decoder to translate protein
sequence to interaction sequence, where protein rep-
resentations are original texts and drug representations
are previous translations. MT-DTI [19] proposed a new
molecular representation method based on the self-
attention mechanism, which is superior to the existing
technology in terms of the area under the precise recall
curve.
For pre-training of input vectors, Asgari and Mofrad
[20] proposed a word2vec model Protvec to obtain the
continuously distributed representation of proteins.
Rao et al. [21] introduced the tasks assessing protein
embeddings (TAPE) to evaluate semi-supervised learning
on the protein sequence. In their study, self-supervised
models should be tested on three mainstream tasks:
structure prediction, detection of remote homologs
and protein engineering. In addition, Rives et al. [22]
learned a multiscale representation space from 86 billion
amino acids across 250 million protein with a robust
transformer ESM-1b. In existing work, one-dimensional
(1D) CNN [23] and pooling method [24] are often applied
to compress a sequence of nwords in to a single
token. However, each token contains unique semantic
information. Crude use of 1D CNN layers or global
pooling operations to aggregate features may result in
the loss of a lot of useful information.
To solve this problem, we propose a novel neural net-
work framework, FusionDTA. In model architecture, we
first encode the inputs as continuously distributed rep-
resentation depending on the raw input and the parame-
ters of pre-trained model. For biological sequences, one-
hot encoding cannot obtain the context information from
a mass of unsupervised biological corpus. Thus, a pre-
trained transformer is utilized to generate the distribute
input representation in our work. Then, LSTM layers
make up the basic block of encoder network. To cap-
ture the local and global dependencies of the feature
vectors, we apply two-layers bi-directional LSTM on the
feature map from embedding layers. Finally, we propose
to replace the 1D CNN layer or the global pooling layer
with a multi-head linear attention layer, which selec-
tively focuses on each token from the entire biological
sequence and aggregates global information based on the
attention score. Different from the attention mechanism
mentioned above, the proposed linear attention aims
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
FusionDTA |3
Figure 2. The frequency histogram of binding affinity, length of protein sequence and length of ligand SMILES in Davis dataset.
Figure 3. The frequency histogram of binding affinity, length of protein sequence and length of ligand SMILES in KIBA dataset.
to capture the direct reflection of each biological token
on binding affinity rather than enhance the representa-
tional ability of the feature encoder.
With the deepening of the neural network encoder, we
often face the phenomenon of excessive parameters in
training process. This phenomenon is always accompa-
nied by the problem of overfitting and slow training [25].
Therefore, we propose knowledge distillation for DTA
tasks as an improvement in training strategy. Knowl-
edge distillation establishes a teacher model and a stu-
dent model. Through defining constraints and loss func-
tions, the student model with less parameters obtains
knowledge from the teacher model with more parame-
ters. Through transferring knowledge from one model to
another, knowledge distillation is an effective method for
parameter regularization and model compression.
Material and methods
Datasets
We evaluated FusionDTA on two publicly available
datasets, the Kinase dataset Davis [26] and KIBA dataset
[27]. Both were regarded as benchmark datasets in
previous drug–target affinity predictions.
Davis dataset: Davis dataset contains 30 056 interac-
tions from 442 proteins and 68 ligands, in which the
binding affinity is evaluated by (Kd) value. It reflects
the selective measurements of the kinase protein
family and associated inhibitors with their constant
values of dissociation.
To solve the numerical explosion problem, Öztürk et
al. proposed to replace the binding affinity value Kd
with a novel measure pKdby converting its value into
the logarithmic domain. Specifically, Kdis first scaled
to the appropriate range, and then the negative log is
calculated as follows:
pKd=−log10 Kd
1e9. (1)
Figure 2 shows the histogram of affinity, drug length
and protein frequency in the Davis data set. First
graph illustrates the distribution of binding-affinity
values of DT pairs in the DAVIS data set. Peaks with
an affinity of 5 accounts for more than half of the
data set. The dataset has a total of 30 056 DT pairs,
of which 20 931 DT pairs have an affinity of 5. Most
of the rest is distributed between 6 and 7. In addition,
the length of most proteins is concentrated between
400 and 1500. The largest distribution is around 500,
and the maximum length is 2549. The SMILES length
of the ligands presents a Gaussian distribution, rang-
ing from 35 to 80, most of which are between 40 and
60, and the maximum length is 103.
KIBA dataset: KIBA dataset contains kinase inhibitor
bioactivities measured by an approach called KIBA,
which considers the different index of the inhibitor
efficacy, such as Ki,Kdand IC50. The binding affinity
was measured by the interaction of 467 proteins and
52 498 ligands.
Figure 3 shows the histogram of affinity, drug length
and protein frequency in the KIBA data set.As shown
in the figure, the affinities in the KIBA data set are
mainly distributed between 10 and 13, and most
of them fall around 11. The length of the protein
sequence is concentrated between 200 and 1500,
most of which are around 700, and the maximum
length is 4128. The SMILES lengths of the ligands
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
4|Yuan et al.
Figure 4. The training phase of the protein pre-training model. The top of the chart shows a strategy for the pre-training stage and the bottom shows a
strategy for the fine-tuning stage.
range from 15 to 100, most of which are concentrated
around 50 and the maximum length is 590.
Öztürk et al. proposed that for 99% of protein pairs, the
Smith-Waterman (S-W) similarity between proteins in
the KIBA data set is at most 60%. 92% of the protein pairs
in the Davis data set have a target similarity of at most
60%. These statistics indicate that both data sets are
non-redundant. To ensure the fairness of the experiment,
5-fold cross-validation was adopted in the experiment.
All the data were divided into five parts on average, four
parts for the training set and one part for the test set.
Hence, a dataset can be divided into five schemes. We
tested the proposed model on all schemes and regarded
the average score as the final performance.
Model architecture
The overall architecture of FusionDTA is shown in
Figure 1. The first step is to feed drug molecules and
protein sequences into the embedding layer. In this
layer, drug molecules are encoded as SMILES strings,
and proteins are encoded as word embeddings. Then, the
LSTM layers are designed to construct the basic blocks
of the encoder layer. Finally, the intermediate carriers
of drug molecules and proteins are imported into the
fusion layer to obtain an output carrier representation of
binding affinity.
Drug representation
For drug molecule,ASCII string SMILES [28] is widely used
as a chemical describer for input representation. SMILES
represent drug molecules as one-dimension sequence,
from which the chemical properties of atoms and their
arrangement is obtained. We project each SMILES char-
acter into a discrete space of one-hot encoding by creat-
ing a vocabulary in SMILES format. Each drug molecule
is represented as follows:
xD={xD
1, ..., xD
n}∈RVD, (2)
where VDis the vocabulary size in the format SMILES.
To avoid sparse matrix and high dimension, xDis mul-
tiplied by a random embedding matrix. It convert xDto
a low-dimensional and dense continuous space eDas
follows:
eD={eD
1, ..., eD
n}∈Rd, (3)
where Dis the dimension of drug in embedding layer.
Protein representation
With the development of natural language processing,
pre-training of input vectors has become an indispens-
able part of the model [29]. The pre-trained model can
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
FusionDTA |5
help the machine find an interpretable data representa-
tion to improve the performance of the algorithm. In the
process of extracting biological sequence information,
feature extraction can be conducted manually or in an
unsupervised method. However,it is diff icult to manually
add various effective features to the biological sequence
in the real scene, so a better choice is to use unsuper-
vised learning to embed biological sequences into high-
dimensional vectors [22].
Inspired by the ESM-1b [22], we borrow the pre-training
transformer to replace the original one-hot encoding,
which takes the distributed contextual vector as the
protein representation. Figure 4 shows the overall pre-
training and fine-tuning procedures. In pre-training
stage, the original proteins are first divided into several
sequences by a fixed maximum length. Each sequence
begins with a token [CLS] and ends with a token [SEP].
Then, the input embeddings are the sum of token
embeddings and position embeddings with a learnable
weight. To capture the dependencies from tokens, some
proportion of input tokens are masked at random and
the final task of the pre-training model is to predict
those masked tokens. Given the input sequence, we
want to maximize the following negative log probability
function:
L) =−
M
i=1
log p(m=mi|θ),mi[1, 2, ..., |VP|], (4)
where Mis the set of masked tokens and VPis the
vocabulary size of amino acids.
In the fine-tuning stage, the DTA task is regarded as
the downstream task of protein pre-training. Similar to
the pre-training stage, the protein is first divided into
sequences with fixed maximum lengths. Then,these raw
sequences are encoded into pre-trained ESM-1b,in which
contextual dependencies are assigned to each amino
acid. This allows the fine-tuning model to learn diverse
knowledge from pre-training and fuse the sequential
information at the biological word level. Finally, we
utilize the top layer outputs of pre-trained ESM-1b as
protein representation. Given the one-hot embedding
{xP
1,···,xP
m}∈RVp, the output of the pre-training model is
defined as follows:
eP={eP
1,···,eP
m}∈Rd, (5)
where dis the dimension of the hidden layer in the pre-
training model.
eP={eP
1,···,eP
m}∈Rd, (6)
where Vpis the size of the vocabulary for amino acids and
dis the dimension of the hidden layer in the pre-training
model.
LSTM Layer
LSTM is a well-known variant of the recurrent neural net-
work, which solves the long-term dependency problem of
the general recurrent neural network (RNN) network [30].
For protein sequences and drug smiles,the input vector is
represented as a set of multiple discrete biological words.
Hence we can regard inputs as continuous time-series
or sentences in the language model. In Davis and KIBA
datasets, more than 80% of protein sequences exceed
200. Therefore, traditional methods (1D CNN or S-W)
cannot extract high-level semantic features in a good
and exact way. Due to the unique gate design, it is more
suitable for LSTM to process longer biological sequence
than vanilla RNN or hidden Markov models. Thus, we
utilize LSTM as a feature encoder for the embeddings of
drug and protein.
In our model, the embedding vectors of the sequence
and SMILES are encoded into a two-layer bidirectional
LSTM. First, the drug and protein embeddings are fed into
the feedforward layer,which consists of a fully connected
network and an activation function. The purpose of the
feedforward layer is to map the features generated by the
embedding layer into the space of the LSTM layer. Then,
the LSTM layer is to capture the long-term dependence
and short-term dependence from the feature map gener-
ated by the feedforward layer. Specifically, we apply two-
layers bidirectional LSTM on the top of the feedforward
layer. In addition, we superimpose the feedforward layer
and the LSTM layer ntimes to obtain the local and
global dependencies of feature vectors from different
dimensions of semantic features. Since the bidirectional
LSTM will bring the feature size multiplication, the input
feature size of the LSTM layer is set to be half of the input
feature size of the feedforward layer so as to maintain the
consistency of the feature size.
Taking protein embedding as an example. Given an
input sentence {eP
1,···,eP
m}∈Rd, the output of the LSTM
layer is defined as {hP
1,···,hP
m}∈RF, where Fis the feature
dimension of each biological token.
Muti-head linear attention mechanism
In existing work, 1D CNN is commonly applied to the
output layer to ensure that protein sequences or SMILES
of different lengths obtain the same size. In addition,
GraphDTA recommends the use of a merge method to
aggregate the characteristics of drug molecules, where
the output vector can be calculated as the sum, the
mean, or the maximum merge. However, powerful fea-
ture encoders often allocate feature maps for each token
in a refined way. It means that the feature map of each
token contains different shallow and in-depth semantic
information. Roughly using 1D CNN layer or global pool-
ing operations to compress the feature map may result
in the loss of various information. Hence, we propose a
novel multi-head linear attention mechanism to capture
the meaningful information from each token.
Figure 5 shows the process of multi-head linear
attention aggregation. As shown, the input vector is first
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
6|Yuan et al.
Figure 5. The process of multi-head linear attention layer aggregation. Each token is allocated n-heads attention weights. Then the output of each head
is the dot product of tokens in dotted box and the specific head. A Summation formula is to aggregate the concatenation of n token into one.The fusion
outputs can be expressed as the concatenation from each feature dimension.
mapped to the n-head attention vectors. We define the
mapping function LinearAttention(W,hi)as follows:
LinearAttention(W,hi)=
exp (Whi
dk
)
m
j=1exp (Whj
dk
)
, (7)
where WR1×Fis the attention weight matrix, dkis the
normalization coefficient.
Multi-head attention allows the machine to focus on
information about the input features in different vec-
tor spaces and aggregate them as summations. For the
coherence of the derivation, we first calculate the sum-
mation of n heads, instead of the dot product. Taking
a protein as an input, we suppose the input vector is
{hP
1,···,hP
m}∈RF. Then the attention vector of n-heads,
defined as {aP
1,···,aP
m}∈R1, is calculated as follows:
aP
i=
nheads
j=1
headj, (8)
headj=LinearAttention(Wj,hP
i). (9)
Finally, the output of multi-head attention layer is
defined as the dot product of multiple attention and
the original input vector:
oP=
m
i=1
aP
ihP
i. (10)
Fusion layer
We propose a fusion layer composed of three multi-
head linear attention block to fuse the properties of drug
features and protein features. As is mentioned above,
the multi-head linear attention can aggregate the drugs
and proteins separately.However, when the feature maps
of a protein or a drug aggregates independently, the
relationship between the drug and the protein cannot be
captured. Therefore, in this paper, three different linear
attention blocks are applied in the fusion Layer for pro-
tein sequences, drug smiles and protein–drug informa-
tion respectively.
Figure 1 also shows the mechanism of the fusion layer.
As shown, the protein sequence and the drug Smiles
were first spliced into a new sequence. Given the feature
vector hP={hP
1,···,hP
m}∈RFfor protein, the feature
vector hD={hD
1,···,hD
n}∈RFfor drug, the splicing vector
is defined as ˆ
h={hP
1,···,hP
m,hD
1,···,hD
n}∈RF. Then,
the spliced sequence is fed into the multi-head linear
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
FusionDTA |7
attention layer to obtain the aggregation feature oPD RF
Similarly, the proteins and the drugs were respectively
fed into the multi-head linear attention layer to obtain
the features oP,oDRF. There are no shared parameters
between the three attention layers. The formula derived
above is expressed as follows:
ˆ
h=concat(hP,hD), (11)
oPD =MultiheadlinearAttn(ˆ
h), (12)
oP=MultiheadlinearAttn(hP), (13)
oD=MultiheadlinearAttn(hD). (14)
Finally, a tensor concatenated by oPD,oPand oDis fed into
the fully connected layer as follows:
ˆ
y=FC(concat([oPD,oP,oD])). (15)
For the training stage, the goal is to make the pre-
diction distribution as close to the ground truth as pos-
sible. Equivalently, we minimize the mean square error
between the predicted value and ground truth:
L=1
N
N
i=1
(yiˆ
y)2, (16)
where Nis the number of samples, yiis the predicted
value and ˆ
yis the ground truth.
Knowledge distillation
It has been proved that knowledge distillation is an
effective method to enhance the generalization of model
and reduce the number of parameters [31,32]. Regarding
the DTA task as a regression problem, we apply the
knowledge extraction mechanism in the training phase
of FusionDTA and analyze its feasibility theoretically.
On the one hand, self-knowledge distillation helps
to improve the performance from the constrain of
feature maps. On the other hand, knowledge distillation
enhances a small-scale network to perform better than
the same scale network without guides.
Knowledge distillation learning
Conceptually, we define a powerful network that has
been trained as a teacher model, whereas networks of
the same or smaller scale that can learn from the teacher
model is a student model.
Figure 6 shows the training stage of the teacher and
the student models. First, FusionDTA is trained as a
teacher model via an effective network. We define target
and drug inputs as X, affinity as Y, then what we are
concerned about is function f(x):XY. In deep
learning models, the function f(x)is approximated by a
parametrized function f(x,θ1), where θ1θ. Specifically,
stochastic gradient descent aims to learn the parameters
θ
1by minimizing some objective function:
θ
1=argminθ1L(y,f(x,θ1)). (17)
Second, we train a new network as a student model.
The knowledge of the student model is obtained from
both the teacher model and the real affinity. Therefore,
the objective function of the student model consists of
two-part, one is the loss measured by f(x,θ2)and real
target, the other is the loss measured by f(x,θ2)and
f(x,θ
1):
Loss1=L1(argminθ1,L(y,f(x,θ1)),θ2), (18)
Loss2=L2(y,f(x,θ2)), (19)
θ
2=argminθ2Loss1+(1α)Loss2), (20)
where αis the impact factor that determine the weight
between Loss1 and Loss2.
Knowledge distillation for DTA task
In the abovementioned derivation, our ultimate goal is
to minimize the objective function of the teacher. Due
to the difference of outputs and the need of models,
no universal L1can be found for each tasks. Hinton
and Salakhutdinov [33] introduced a generalized softmax
function as L1, in which the concept of temperature
was introduced to ensure that the softmax distribution
generated by the teacher model is soft enough. In this
way, the student model can extract knowledge from the
softmax output distribution at a higher temperature,and
then restore the low temperature during the test stage.
However, in the regression task, the real situation is a
1D continuous variable, rather than a ’hot’ label. There-
fore, the logit of the teacher model does not contain more
information than the real situation. In other words, in
DTA task, the student model will not learn additional
hidden knowledge from the output logits of the teacher
model.
To solve this problem, we suggest that the student
model should learn transferable knowledge from the
feature map of the hidden layer, instead of logits in the
output layer. Define L1as follows:
L1Hint,θGuide )=g(x,θHint)r(g(x,θGuide ),θr)2
2, (21)
where gis a transfer function from the input xto the
hidden layer with parameters θHint and θGuide,ris a
nonlinear regression function at the top of the guidance
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
8|Yuan et al.
Figure 6. The training stage of the teacher and the student models. We divide the loss function into three parts: (A) LR
teacher for teacher model. (B)
Distillation lossLD
student and regression loss LR
student.(C)Lstudent for student model, which is the weighted sum of LD
student and LR
student.
layer with parameters θr, regression function ris to map
g(x,θGuide)to a vector space with the same dimension as
g(x,θHint). Specifically,we define the vectors generated by
the first fully connected layer after the multi-head linear
attention layer as the output of function g.
L2is defined as mean square error (MSE):
L2(Yi,Pi)=MSE(Yi,Pi)=1
N
N
i=1
(YiPi)2, (22)
where Yiis the true value of binding affinity and Piis
the predicted value. From the perspective of the loss
function, the update strategy of the student model is to
ensure that hidden layer outputs are as close as possible
to the teacher model. Therefore, the teacher model can
provide student with guidance during the training pro-
cess. Meanwhile, the student’s parameters after hidden
layer are not affected by the teacher, so the f lexibility
of the network will not be greatly inhibited. From a
biological point of view, the binding information between
protein and drug is not only expressed through binding
affinity [34]. Assuming that the feature map of the mid-
dle layer contains more hidden information,e.g. the loca-
tion of binding site, the student model can learn trans-
ferable biological structure information, rather than just
bringing the feature map closer to the verified better
parameters.
In addition, knowledge distillation contributes to the
restrain of model parameters and overfitting. Consider-
ing the feature map of teacher model as a constraint,
knowledge distillation limits the difference between the
parameters of teacher and student. Compared with L2-
normalization, loss function L1allows model to learn
more effective verified network parameters (rather than
simply zero).
Evaluation metrics
Concordance index (CI), a model evaluation index pro-
posed by GÖnen and Heller [35], was designed to calcu-
late the difference between the predicted value of the
model and the ground truth. CI is defined as follows:
CI =1
Z
δji
h(bibj), (23)
where biis the prediction value for δi,bjis the prediction
value for δj,h(x)is the step function and Zis the normal-
ized hyperparameter. Commonly, the step function h(x)
is defined as follows:
h(x)=
0, x<0
0.5, x=0
1, x>0
. (24)
MSE is a statistical measure that evaluates the error
directly. Assuming there are estimated nsample and
corresponding true values of n sample, MSE is expressed
as the expectation of the square loss:
MSE =1
n
n
i=1
(yiˆ
yi)2, (25)
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
FusionDTA |9
Tab l e 1. The performance of FusionDTA and baseline models on the Davis dataset
Model CI (std) MSE r2
m(std)
KronRLS 0.871 (±0.001) 0.379 0.407 (±0.005)
SimBoost 0.872 (±0.002) 0.282 0.644 (±0.006)
DeepDTA 0.878 (±0.004) 0.261 0.630 (±0.017)
WideDTA 0.886 (±0.003) 0.262
MT-DTI 0.887 (±0.003) 0.245 0.665 (±0.014)
DeepCDA 0.891 (±0.003) 0.248 0.649 (±0.009)
MATT_DTI 0.891 (±0.002) 0.227 0.683 (±0.017)
GraphDTA 0.893 (±0.001) 0.229
FusionDTA 0.913 (±0.002) 0.208 0.743 (±0.007)
Tab l e 2. The performance of FusionDTA and baseline models on the KIBA dataset
Model CI (std) MSE r2
m(std)
KronRLS 0.782 (±0.001) 0.441 0.342 (±0.001)
SimBoost 0.836 (±0.001) 0.222 0.629 (±0.007)
DeepDTA 0.863 (±0.002) 0.194 0.673 (±0.009)
WideDTA 0.875 (±0.001) 0.179
MT-DTI 0.882 (±0.001) 0.152 0.738 (±0.006)
DeepCDA 0.889 (±0.002) 0.176 0.682 (±0.008)
MATT_DTI 0.889 (±0.001) 0.150 0.756 (±0.011)
GraphDTA 0.891 (±0.002) 0.139
FusionDTA 0.906 (±0.001) 0.130 0.793 (±0.008)
where ˆ
yiis the estimate of ith sample and yiis the true
value of ith sample.
Regression toward the mean (r2
mindex) is a measure
evaluating the external predictive performance of a
model. If a variable is very large, then r2
mmeans how
much it tends to approach the average next time. r2
m
index is calculated as follows:
r2
m=r2(1(r2r2
0)), (26)
where ris the squared correlation coefficients with inter-
cepts and r0is the coefficients without intercepts.
Result and discussion
In this section, the Davis data set and KIBA data set
were utilized to evaluate the performance of the model.
In FusionDTA, the hyperparameters used in these two
data sets are shown in Supplementary Tables S1 and S2.
To evaluate the performance of the multi-head linear
competition, we compared it with the largest pool and
the average pool in the Davis data set. In addition, a
comparative experiment was set up in the experiment,
in which the performance of knowledge distillation is
measured with existing models and vanilla FusionDTA.
We compared our model with the following benchmark
models: KronRLS [7], SimBoost [8], DeepDTA [9], Wid-
eDTA [36], GraphDTA [11], DeepCDA [10], MT-DTI [19] and
MATT_DTI [16].
The performance of FusionDTA
In Table 1, we listed the performance of the proposed
model evaluated on the Davis dataset and compared
it with the baseline model. As shown, FusionDTA is
superior to the existing models in all aspects. In detail,
FusionDTA improves CI index by 0.020 and reduced MSE
by 0.021, compared with previous the baseline model,
GraphDTA. In addition, FusionDTA also achieves a 0.060
improvement in r2
mindex compared to the baseline
model, MATT_DTI. In particular, for some previous
models that have proposed more than two kinds of
model architecture, we only report the best performing
architecture on the Davis dataset.
Table 2 presents the performance of FusionDTA
and baseline models on the KIBA dataset. The results
show that FusionDTA also achieves significantly better
results than baseline models in all of the evaluation
measures. FusionDTA improves 0.015, 0.009 in CI index
and MSE, compared with previous the state-of-art model,
GraphDTA. Moreover, FusionDTA also achieves a 0.037
improvement in r2
mindex over MATT_DTI.
Figure 7 illustrates the real affinity against the pre-
dicted value on both Davis and KIBA datasets. Assuming
ground truth as x-axis and prediction as y-axis, the verti-
cal distance |y|from each point to y=xrepresents the
discrepancy between its predicted affinity value and the
real value. Histograms at the edges represent the overall
distribution of true and predicted affinity. As shown,
the samples have a tendency to be symmetric about
y=xfor both the Davis and Kiba datasets. Especially,
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
10 |Yuan et al.
Figure 7. The real affinity against the predicted value on Davis and KIBA
datasets. For each point, x-axis reflects its real value and y-axis ref lects
its predicted value. The vertical distance |y|from each sample to y=x
represents the discrepancy between its predicted affinity value and the
real value.
the sampling points in Kiba dataset are more densely
distributed around y=x.
The performance of various pooling methods
In the model architecture, different pooling methods
allow the model to pay attention to different parts of the
intermediate sequence, determining the parameters of
each layer to be updated according to different gradients.
The common pooling methods include max-pooling and
mean-pooling, which respectively aggregate the features
map of a sequence into a token with a maximizing func-
tion or an averaging function. In this model, multi head
linear attention layer is proposed to replace the tradi-
tional pooling layer, with an end to selectively focus on
Tab l e 3 . The performance of max-pooling, mean-pooling and
multi head linear attention layer on the Davis dataset
Pooling method CI MSE
Max-pooling 0.904 0.220
Mean-pooling 0.910 0.211
Multi-head linear attention 0.913 0.208
Tab l e 4 . The performance of max-pooling, mean-pooling and
multi head linear attention layer on the KIBA dataset
Pooling method CI MSE
Max-pooling 0.897 0.137
Mean-pooling 0.904 0.132
Multi-head linear attention 0.906 0.130
the information of each biological token on the whole
protein sequence, or an entire SMILES chain.
To evaluate the impact of different pooling methods on
model performance, three controlled experiments were
set up in the verification stage. It is worth mentioning
that the model parameters of each experiment group
are the same except for different pooling methods. In
Table 3, we list the performance of max-pooling, mean-
pooling and multi-head linear attention layer on the
Davis dataset. As it is shown, CI index of multi-head
linear attention layer is 0.913, whereas the CI index
of max-pooling and mean-pooling is 0.904 and 0.910,
respectively. Obviously, multi-head linear attention layer
performs better than the other two pooling methods on
Davis dataset.
Table 4 reports the performance of max-pooling,
mean-pooling and multi-head linear attention layer on
the KIBA dataset. As shown,CI index of multi-head linear
attention is 0.906, which is higher than 0.897 of max-
pooling and 0.904 of mean-pooling. In addition, the MSE
multi-head linear attention layer is 0.130, lower than
0.137 of max-pooling and 0.132 of mean-pooling. For
KIBA dataset, multi-head linear attention layer as feature
aggregator performs better than mean-pooling and max-
pooling.
The performance of cold-start
The cold-start problem refers to evaluate model perfor-
mance on the unseen inputs. From an application point
of view, a high proportion of protein or drug representa-
tions may not appear in the training set. Therefore, the
challenge is whether a model with an excellent score in
specific datasets can also perform well with unknown
data. In this regard, the performance of cold-start indi-
cates the model’s robustness facing a new environment
(e.g. mutate proteins).
We compare our model with the following bench-
mark models: GraphDTA [11], GLFA and GEFA [17]. Table 5
reports the performance of drug cold-start, protein cold-
start and drug-protein cold-start on Davis dataset, corre-
sponding to the unseen drug, unseen protein,and unseen
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
FusionDTA |11
Figure 8. The example of the weight visualization of the proposed model. MARK3 (PDB ID: 3FE3) is expressed in cartoon form,while Nilotinib is expressed
in stick form. The cyan color highlights the highly focused position of the protein and the focused drug atom in the binding bag, and the darker color
indicates the darker attention weight.
Tab l e 5 . The performance of drug cold-start, protein cold-start,
and drug–protein cold-start on Davis dataset
CI MSE
Drug cold-start
GraphDTA 0.675 0.920
GLFA 0.670 0.861
GEFA 0.709 0.846
FusionDTA 0.747 0.681
Target cold-start
GraphDTA 0.706 0.510
GLFA 0.780 0.4531
GEFA 0.795 0.4335
FusionDTA 0.826 0.331
Drug-target cold-start
GraphDTA 0.627 1.130
GLFA 0.636 1.144
GEFA 0.639 0.989
FusionDTA 0.685 0.716
Tab l e 6 . The performance of vanilla FusionDTA, KD +FusionDTA
(large-scale) and KD +FusionDTA (small-scale) on Davis dataset
CI MSE
Vanilla FusionDTA (large-scale) 0.913 0.208
Vanilla FusionDTA (small-scale) 0.905 0.221
KD +FusionDTA (large-scale) 0.914 0.205
KD +FusionDTA (small-scale) 0.908 0.213
drug and protein. As shown, FusionDTA gained 0.747 CI
index and 0.681 MSE under drug cold-start constrain,
0.826 CI index and 0.331 MSE under protein cold-start
constrain, 0.685 CI index and 0.716 MSE under protein–
drug cold-start constrain. As a result, our model is better
than all of the baseline models on the cold-start problem
and more likely to perform robustly in undiscovered
applications.
The performance of knowledge distillation
In this section, we evaluated the contribution of knowl-
edge distillation on Davis dataset. As is mentioned above,
knowledge distillation is an effective way to facilitate
knowledge transfer and parameter regularization. Two
experiments with different parameters, therefore, were
set up in the verification stage to examine the various
effects of knowledge distillation. In one experiment, the
parameter size of the student model was exactly the
same as that of the teacher model, aiming to evaluate
the effect of teacher guidance on students’ model perfor-
mance. In the other experiment, a student model with
parameters of only half the size was used to evaluate the
capability of model compression. For each experiment,
the teacher model was set up with frozen pre-trained
FusionDTA, while the student model was initialized with
untrained FusionDTA. Then, the student model would
learn new distribution from teacher’s output and true
value, by the training strategy of knowledge distillation.
Table 6 shows the performance of FusionDTA (large-
scale), knowledge distillation +FusionDTA (large-scale)
and knowledge distillation +FusionDTA (small-scale)
evaluated on the Davis dataset. As shown, knowledge
distillation +FusionDTA (large-scale) improves CI index
by 0.001 and reduced MSE by 0.003 compared with
FusionDTA (large-scale). Knowledge distillation with
FusionDTA of small scale also achieved the CI index of
0.908.
Supplementary Figure S1 shows the performance
of the baseline models and the proposed models
measured by of CI, MSE and model scale. The num-
ber of parameters of DeepDTA, DeepCDA, GraphDTA,
FusionDTA(large-scale), KD+FusionDTA(large-scale) and
KD+FusionDTA(small-scale) is 1 967 745, 3 641 345,
4 749 573, 5 362 081, 5 362 081 and 2 013 537. As
shown, knowledge distillation +FusionDTA (large-scale)
achieves the best performance around all the methods.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
12 |Yuan et al.
Meanwhile, with a little loss of accuracy, knowledge
distillation can be regarded as an effective model
compression method for DTA task.
Visualization with attention weights
The attention weights obtained by FusionDTA can be
used to analyze which part of the interaction between
the small drug molecule and the target protein plays a
key role in binding pocket. The attention mechanism can
calculate some key areas of interaction between protein
sequence and drug compounds. In order to visualize
the main areas of interaction, we first calculated the
weights of the protein sequence and the SMILES charac-
ters of the drug compound and then selected the corre-
sponding interaction site with a relatively large attention
value. Figure 8 shows an example of the weight visual-
ization of the proposed model. We chose the complex of
MARK3 (PDB ID: 3FE3) and Nilotinib for interactive visual
analysis. The results showed that the weight value is
mainly from 5.69E-4 to 1.43E-3. We colored the positions
where the attention weight is greater than 9.80E-4 in
the drug compound and the attention weight is greater
than 9.57E-4 in the protein. The cyan color highlights the
highly focused position of the protein and the focused
drug atom in the binding bag, and the darker color indi-
cates the smaller attention weight. Obviously, our model
mainly captured the main amino acid regions, residues
194–339. Interestingly, the attention weight captured by
our model in residues 194–339 is almost close to 9.57E-4,
and residues 285–287 are relatively larger.The peak value
is at LYS-285, which just falls in the binding pocket, indi-
cating that our model accurately predicted the potential
docking site measurement. Overall, some of the residues
194–339 are in the docking pocket of MARK3 and Nilo-
tinib, while some are located outside the region, which
also indicates that most of the regions captured by our
model are located at the docking interface, but part of
them captures the wrong area. The weights calculated
by our model are mainly concentrated in the binding
pocket, which shows that our model can predict the
interaction between protein and compound more accu-
rately. In short, the proposed model can extract useful
information from the two channels of drug SMILES and
protein sequence.
Conclusion
This paper has presented a novel DTA prediction frame-
work, FusionDTA. A new multi-head linear attention
mechanism is applied to replace the coarse pooling
method, which uses attention weights to aggregate
global information. Additionally, we applied knowl-
edge distillation in the framework, by transferring the
learnable information from the teacher model to the
student model. In order to evaluate the proposed work,
it was applied to two common data sets: KIBA, Davis.
Experimental results show that our model performs
better than existing models on all evaluation indicators.
When drugs and proteins are unknown, FusionDTA
proved to be more robust and more effective than
other models in the benchmark, which will help in the
development of some new drugs. Meanwhile, knowledge
distillation is of great help to performance improvement,
saving half of the parameters of the model while the CI
index hardly changes. More importantly, in this case, our
model can easily exceed the baseline of all indicators.
In additional experiments concluded in Supplementary
Tables S3 and S4, the pre-training representation of
protein proves to be effective for DTA model with
sequential inputs, while pre-training of drug fails to
show the superiority for DTA task. Furthermore, the
model has been shown to provide biological insights for
understanding the nature of molecular interactions and
capture the binding pockets of proteins and molecules.
In general, our model has superior performance and
improves the effect of DTI prediction. The visualization
of DTI can effectively help predict the binding region of
proteins during structure-based drug design.
Key Points
Due to the maximize or average operator, crude
use of pooling method may result in the loss of
hidden information. To this end, we propose a
multi-head linear attention to capture the deep
dependency from each token.
To solve the redundancy issue of parameters, we
propose a constraint where small scale network
can learn knowledge from large scale network
and real affinity.
• Our method achieves the state-of-the-art per-
formance in Davis dataset and KIBA dataset.
FusionDTA with half the parameters can easily
exceed the baseline on all metrics in terms of
model compression.
Supplementary Data
Supplementary data are available online at http://bib.o
xfordjournals.org/.
Code and data availability
The source code and data of this study are available at
https://github.com/yuanweining/FusionDTA.
Funding
This work was supported by the National Natural Science
Foundation of China (Grant No. 62176272), Guangzhou
Science and Technology Fund (Grant No. 201803010072),
Science, Technology and Innovation Commission of
Shenzhen Municipality (JCYL 20170818165305521) and
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
FusionDTA |13
China Medical University Hospital (DMR-111-102, DMR-
111-143, DMR-111-123). We also acknowledge the start-
up funding from SYSU’s “Hundred Talent Program”.
References
1. Newman DJ, Cragg GM. Natural products as sources of new
drugs over the nearly four decades from 01/1981 to 09/2019. J
Nat Prod 2020;83(3):770–803.
2. Takebe T, Imai R, Ono S. The current status of drug discovery
and development as originated in United States academia: the
influence of industrial and academic collaboration on drug
discovery and development. Clin Transl Sci 2018;11(6):597–606.
3. Lin X, Li X, Lin X. A review on applications of computational
methods in drug screening and design. Molecules 2020;25(6):1375.
4. Wen M, Zhang Z, Niu S, et al. Deep-learning-based drug-target
interaction prediction. JProteomeRes2017;16(4):1401–9.
5. Kairys V, Baranauskiene L, Kazlauskiene M, et al. Binding affinity
in drug design: experimental and computational techniques.
Expert Opin Drug Discovery 2019;14(8):755–68.
6. Yadav AR, Mohite SK. Homology modeling and generation
of 3d-structure of protein. Res J Pharm Dosage Forms Technol
2020;12(4):313–20.
7. Pahikkala T, Airola A, Pietilä S, et al. Toward more realistic drug-
target interaction predictions. Brief Bioinform 2015;16(2):325–37.
8. He T, Heidemeyer M, Ban F, et al. Simboost: a read-across
approach for predicting drug-target binding affinities using gra-
dient boosting machines. JChem2017;9(1):1–14.
9. Öztürk H, Özgür A, Ozkirimli E. Deepdta: deep drug-target bind-
ing affinity prediction. Bioinformatics 2018;34(17):i821–9.
10. Abbasi K, Razzaghi P, Poso A, et al. Deepcda: deep cross-domain
compound-protein affinity prediction through lstm and convo-
lutional neural networks. Bioinformatics 2020;36(17):4633–42.
11. Nguyen T, Le H, Quinn TP, et al. Graphdta: predicting drug–
target binding affinity with graph neural networks. Bioinformatics
2021a;37(8):1140–7.
12. Kipf TN, Welling M. Semi-supervised classification with graph
convolutional networks. arXiv preprint arXiv:1609.02907. 2016.
13. Vel i ˇ
ckovi´
c P, Cucurull G, Casanova A, et al. Graph attention
networks. arXiv preprint arXiv:1710.10903. 2017.
14. Xu K, Hu W, Leskovec J, et al. How powerful are graph neural
networks?. arXiv preprint arXiv:1810.00826. 2018.
15. Zheng S, Li Y, Chen S, et al. Predicting drug-protein interaction
using quasi-visual question answering system. Nat Mach Intell
2020;2(2):134–40.
16. Zeng Y, Chen X, Luo Y, et al. Deep drug-target binding affin-
ity prediction with multiple attention blocks. Brief Bioinform
2021;22(5):1–10.
17. Nguyen TM, Nguyen T, Le TM, et al. Gefa: early fusion approach
in drug-target affinity prediction. IEEE/ACM Trans Comput Biol
Bioinform 2021b.
18. Chen L, Tan X, Wang D, et al. TransformerCPI: improving
compound-protein interaction prediction by sequence-based
deep learning with self-attention mechanism and label reversal
experiments. Bioinformatics 2020;36(16):4406–14.
19. Shin B, Park S, Kang K, et al. Self-attention based molecule rep-
resentation for predicting drug-target interaction. arXiv preprint
arXiv:1908.06760, 2019.
20. Asgari E, Mofrad MRK. Continuous distributed representation of
biological sequences for deep proteomics and genomics. PloS One
2015;10(11):e0141287.
21. Rao R, Bhattacharya N, Thomas N, et al. (eds). Evaluating protein
transfer learning with tape. In: Advances in Neural Information
Processing Systems. Vancouver, Canada: Neural Information Pro-
cessing Systems Foundation, Inc., Vol. 32, 2019, 9689.
22. Rives A, Meier J, Sercu T, et al. Biological structure and function
emerge from scaling unsupervised learning to 250 million pro-
tein sequences. Proc Natl Acad Sci 2021;118(15):e2016239118.
23. Hirohara M, Saito Y, Koda Y, et al. Convolutional neural network
based on smiles representation of compounds for detecting
chemical motif. BMC Bioinform 2018;19(19):83–94.
24. Jiang M, Li Z, Zhang S, et al. Drug-target affinity predic-
tion using graph neural network and contact maps. RSC Adv
2020;10(35):20701–12.
25. Buciluˇ
a C, Caruana R, Niculescu-Mizil A. Model compression.
In: Proceedings of the 12th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. Philadelphia, Pennsylvania,
USA: ACM, Inc., 2006, 535–41.
26. Davis MI, Hunt JP, Herrgard S, et al. Comprehensive analy-
sis of kinase inhibitor selectivity. Nat Biotechnol 2011;29(11):
1046–51.
27. Tang J, Szwajda A, Shakyawar S, et al. Making sense of large-
scale kinase inhibitor bioactivity data sets: a comparative and
integrative analysis. J Chem Inf Model 2014;54(3):735–43.
28. Weininger D. Smiles: a chemical language and information sys-
tem. J Chem Inf Comput Sci 1988;28(1):31–6.
29. Qiu X, Sun T, Yige X, et al. Pre-trained models for natural
language processing: a survey. Sci China Technol Sci 2020;63(10):
1872–97.
30. Sundermeyer M, Schlüter R, Ney H. Lstm neural networks
for language modeling. In: Thirteenth Annual Conference of the
International Speech Communication Association. Portland, Oregon,
USA: International Speech Communication Association (ISCA),
2012.
31. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural
network. arXiv preprint arXiv:1503.02531. 2015.
32. Clark K, Luong M-T, Khandelwal U, et al. Bam! born-again
multi-task networks for natural language understanding. arXiv
preprint arXiv:1907.04829. 2019.
33. Hinton GE, Salakhutdinov RR. Replicated softmax: an undi-
rected topic model. Adv Neural Inform Process Syst 2009;22:
1607–14.
34. Vuignier K, Schappler J, Veuthey J-L, et al. Drug-protein bind-
ing: a critical review of analytical tools. Anal Bioanal Chem
2010;398(1):53–66.
35. GÖnen M, Heller G. Concordance probability and discrimi-
natory power in proportional hazards regression. Biometrika
2005;92(4):965–70.
36. Öztürk H, Ozkirimli E, Özgür A. Widedta: prediction of drug-
target binding affinity. arXiv preprint arXiv:1902.04166. 2019.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021
... To address these challenges, computational approaches have been developed that utilize available protein amino acid sequences and compound SMILES. These approaches aim to predict drug-target binding affinity (DTA) quickly and cost-effectively, overcoming the scarcity of structural information and the need for domain expert knowledge [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]. By leveraging computational methods, DTA prediction becomes more accessible and efficient, facilitating exploration of potential drug-target interactions and aiding in drug discovery [27]. ...
... There are several computational approaches for DTA prediction, especially machine learning (ML) and deep learning (DL) based methods [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]. These methods can be categorized into four groups: similarity-based, sequence-based, graph-based, and transformer-based methods. ...
... With advancements in transformer architecture [34], the application of transformers for feature extraction from protein sequences and drug SMILES has gained prominence [35,36]. One notable method in this domain is FusionDTA [23], which introduces a transformer-based network in combination with a LSTM network for drug and protein feature extraction. By leveraging transformers and LSTMs, FusionDTA captures long-term dependencies and aims to learn a distributed representation for drugs and proteins. ...
Article
Full-text available
Background In recent years, there has been a growing interest in utilizing computational approaches to predict drug-target binding affinity, aiming to expedite the early drug discovery process. To address the limitations of experimental methods, such as cost and time, several machine learning-based techniques have been developed. However, these methods encounter certain challenges, including the limited availability of training data, reliance on human intervention for feature selection and engineering, and a lack of validation approaches for robust evaluation in real-life applications. Results To mitigate these limitations, in this study, we propose a method for drug-target binding affinity prediction based on deep convolutional generative adversarial networks. Additionally, we conducted a series of validation experiments and implemented adversarial control experiments using straw models. These experiments serve to demonstrate the robustness and efficacy of our predictive models. We conducted a comprehensive evaluation of our method by comparing it to baselines and state-of-the-art methods. Two recently updated datasets, namely the BindingDB and PDBBind, were used for this purpose. Our findings indicate that our method outperforms the alternative methods in terms of three performance measures when using warm-start data splitting settings. Moreover, when considering physiochemical-based cold-start data splitting settings, our method demonstrates superior predictive performance, particularly in terms of the concordance index. Conclusion The results of our study affirm the practical value of our method and its superiority over alternative approaches in predicting drug-target binding affinity across multiple validation sets. This highlights the potential of our approach in accelerating drug repurposing efforts, facilitating novel drug discovery, and ultimately enhancing disease treatment. The data and source code for this study were deposited in the GitHub repository, https://github.com/mojtabaze7/DCGAN-DTA. Furthermore, the web server for our method is accessible at https://dcgan.shinyapps.io/bindingaffinity/.
... FusionDTA [41]: This model uses a pre-trained Transformer and BI-LSTM to encode amino acid sequences and BI-LSTM to encode SMILES. It proposes a fusion layer consisting of a multiheaded linear attention layer that focuses on important tokens in biological sequences and aggregates global information based on attention scores. ...
Article
Full-text available
Identifying drug-target interactions (DTIs) holds significant importance in drug discovery and development, playing a crucial role in various areas such as virtual screening, drug repurposing and identification of potential drug side effects. However, existing methods commonly exploit only a single type of feature from drugs and targets, suffering from miscellaneous challenges such as high sparsity and cold-start problems. We propose a novel framework called MSI-DTI (Multi-Source Information-based Drug-Target Interaction Prediction) to enhance prediction performance, which obtains feature representations from different views by integrating biometric features and knowledge graph representations from multi-source information. Our approach involves constructing a Drug-Target Knowledge Graph (DTKG), obtaining multiple feature representations from diverse information sources for SMILES sequences and amino acid sequences, incorporating network features from DTKG and performing an effective multi-source information fusion. Subsequently, we employ a multi-head self-attention mechanism coupled with residual connections to capture higher-order interaction information between sparse features while preserving lower-order information. Experimental results on DTKG and two benchmark datasets demonstrate that our MSI-DTI outperforms several state-of-the-art DTIs prediction methods, yielding more accurate and robust predictions. The source codes and datasets are publicly accessible at https://github.com/KEAML-JLU/MSI-DTI.
... Currently, multi-modal learning is a vibrant multidisciplinary field, which provides frameworks to process multiple sources of information [42,43]. Multimodal learning is a general approach for incorporating artificial intelligence models that can extract and associate information from multimodal data, enabling models to handle complex relationships between different modalities [44,45]. The rich and diverse information in multimodal data is essential for drug property prediction [46]. ...
Article
Full-text available
Accurately predicting molecular properties is a challenging but essential task in drug discovery. Recently, many mono-modal deep learning methods have been successfully applied to molecular property prediction. However, mono-modal learning is inherently limited as it relies solely on a single modality of molecular representation, which restricts a comprehensive understanding of drug molecules. To overcome the limitations, we propose a multimodal fused deep learning (MMFDL) model to leverage information from different molecular representations. Specifically, we construct a triple-modal learning model by employing Transformer-Encoder, Bidirectional Gated Recurrent Unit (BiGRU), and graph convolutional network (GCN) to process three modalities of information from chemical language and molecular graph: SMILES-encoded vectors, ECFP fingerprints, and molecular graphs, respectively. We evaluate the proposed triple-modal model using five fusion approaches on six molecule datasets, including Delaney, Llinas2020, Lipophilicity, SAMPL, BACE, and pKa from DataWarrior. The results show that the MMFDL model achieves the highest Pearson coefficients, and stable distribution of Pearson coefficients in the random splitting test, outperforming mono-modal models in accuracy and reliability. Furthermore, we validate the generalization ability of our model in the prediction of binding constants for protein-ligand complex molecules, and assess the resilience capability against noise. Through analysis of feature distributions in chemical space and the assigned contribution of each modal model, we demonstrate that the MMFDL model shows the ability to acquire complementary information by using proper models and suitable fusion approaches. By leveraging diverse sources of bioinformatics information, multimodal deep learning models hold the potential for successful drug discovery.
Article
Full-text available
The activities of most enzymes and drugs depend on interactions between proteins and small molecules. Accurate prediction of these interactions could greatly accelerate pharmaceutical and biotechnological research. Current machine learning models designed for this task have a limited ability to generalize beyond the proteins used for training. This limitation is likely due to a lack of information exchange between the protein and the small molecule during the generation of the required numerical representations. Here, we introduce ProSmith, a machine learning framework that employs a multimodal Transformer Network to simultaneously process protein amino acid sequences and small molecule strings in the same input. This approach facilitates the exchange of all relevant information between the two molecule types during the computation of their numerical representations, allowing the model to account for their structural and functional interactions. Our final model combines gradient boosting predictions based on the resulting multimodal Transformer Network with independent predictions based on separate deep learning representations of the proteins and small molecules. The resulting predictions outperform recently published state-of-the-art models for predicting protein-small molecule interactions across three diverse tasks: predicting kinase inhibitions; inferring potential substrates for enzymes; and predicting Michaelis constants KM. The Python code provided can be used to easily implement and improve machine learning predictions involving arbitrary protein-small molecule interactions.
Preprint
Accurate protein-ligand binding affinity prediction is crucial in drug discovery. Existing methods are predominately docking-free, without explicitly considering atom-level interaction between proteins and ligands in scenarios where crystallized protein-ligand binding conformations are unavailable. Now, with breakthroughs in deep learning AI-based protein folding and binding conformation prediction, can we improve binding affinity prediction? This study introduces a framework, Folding-Docking-Affinity (FDA), which folds proteins, determines protein-ligand binding conformations, and predicts binding affinities from three-dimensional protein-ligand binding structures. Our experiments demonstrate that the FDA outperforms state-of-the-art docking-free models in the DAVIS dataset, showcasing the potential of explicit modeling of three-dimensional binding conformations for enhancing binding affinity prediction accuracy.
Article
Full-text available
Predicting the interaction between a compound and a target is crucial for rapid drug repurposing. Deep learning has been successfully applied in drug-target affinity (DTA) problem. However, previous deep learning-based methods ignore modeling the direct interactions between drug and protein residues. This would lead to inaccurate learning of target representation which may change due to the drug binding effects. In addition, previous DTA methods learn protein representation solely based on a small number of protein sequences in DTA datasets while neglecting the use of proteins outside of the DTA datasets. We propose GEFA (Graph Early Fusion Affinity), a novel graph-in-graph neural network with attention mechanism to address the changes in target representation because of the binding effects. Specifically, a drug is modeled as a graph of atoms, which then serves as a node in a larger graph of residues-drug complex. The resulting model is an expressive deep nested graph neural network. We also use pre-trained protein representation powered by the recent effort of learning contextualized protein representation. The experiments are conducted under different settings to evaluate scenarios such as novel drugs or targets. The results demonstrate the effectiveness of the pre-trained protein embedding and the advantages our GEFA in modeling the nested graph for drug-target interaction.
Article
Full-text available
Significance Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling a deep contextual language model with unsupervised learning to sequences spanning evolutionary diversity. We find that without prior knowledge, information emerges in the learned representations on fundamental properties of proteins such as secondary structure, contacts, and biological activity. We show the learned representations are useful across benchmarks for remote homology detection, prediction of secondary structure, long-range residue–residue contacts, and mutational effect. Unsupervised representation learning enables state-of-the-art supervised prediction of mutational effect and secondary structure and improves state-of-the-art features for long-range contact prediction.
Article
Full-text available
Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.
Article
Full-text available
Computer-aided drug design uses high-performance computers to simulate the tasks in drug design, which is a promising research area. Drug–target affinity (DTA) prediction is the most important step of computer-aided drug design, which could speed up drug development and reduce resource consumption. With the development of deep learning, the introduction of deep learning to DTA prediction and improving the accuracy have become a focus of research. In this paper, utilizing the structural information of molecules and proteins, two graphs of drug molecules and proteins are built up respectively. Graph neural networks are introduced to obtain their representations, and a method called DGraphDTA is proposed for DTA prediction. Specifically, the protein graph is constructed based on the contact map output from the prediction method, which could predict the structural characteristics of the protein according to its sequence. It can be seen from the test of various metrics on benchmark datasets that the method proposed in this paper has strong robustness and generalizability.
Article
Full-text available
Motivation: Identifying compound-protein interaction (CPI) is a crucial task in drug discovery and chemogenomics studies, and proteins without three-dimensional (3D) structure account for a large part of potential biological targets, which requires developing methods using only protein sequence information to predict CPI. However, sequence-based CPI models may face some specific pitfalls, including using inappropriate datasets, hidden ligand bias, and splitting datasets inappropriately, resulting in overestimation of their prediction performance. Results: To address these issues, we here constructed new datasets specific for CPI prediction, proposed a novel transformer neural network named TransformerCPI, and introduced a more rigorous label reversal experiment to test whether a model learns true interaction features. TransformerCPI achieved much improved performance on the new experiments, and it can be deconvolved to highlight important interacting regions of protein sequences and compound atoms, which may contribute chemical biology studies with useful guidance for further ligand structural optimization. Supplementary information: Supplementary data are available at Bioinformatics online. Availability and implementation: https://github.com/lifanchen-simm/transformerCPI.
Article
Drug-target interaction (DTI) prediction has drawn increasing interest due to its substantial position in the drug discovery process. Many studies have introduced computational models to treat DTI prediction as a regression task, which directly predict the binding affinity of drug-target pairs. However, existing studies (i) ignore the essential correlations between atoms when encoding drug compounds and (ii) model the interaction of drug-target pairs simply by concatenation. Based on those observations, in this study, we propose an end-to-end model with multiple attention blocks to predict the binding affinity scores of drug-target pairs. Our proposed model offers the abilities to (i) encode the correlations between atoms by a relation-aware self-attention block and (ii) model the interaction of drug representations and target representations by the multi-head attention block. Experimental results of DTI prediction on two benchmark datasets show our approach outperforms existing methods, which are benefit from the correlation information encoded by the relation-aware self-attention block and the interaction information extracted by the multi-head attention block. Moreover, we conduct the experiments on the effects of max relative position length and find out the best max relative position length value $k \in \{3, 5\}$. Furthermore, we apply our model to predict the binding affinity of Corona Virus Disease 2019 (COVID-19)-related genome sequences and $3137$ FDA-approved drugs.
Article
Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised learning for proteins has emerged as an important paradigm due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.
Article
The development of new drugs is costly, time consuming, and often accompanied with safety issues. Drug repurposing can avoid the expensive and lengthy process of drug development by finding new uses for already approved drugs. In order to repurpose drugs effectively, it is useful to know which proteins are targeted by which drugs. Computational models that estimate the interaction strength of new drug–target pairs have the potential to expedite drug repurposing. Several models have been proposed for this task. However, these models represent the drugs as strings, which is not a natural way to represent molecules. We propose a new model called GraphDTA that represents drugs as graphs and uses graph neural networks to predict drug–target affinity. We show that graph neural networks not only predict drug–target affinity better than non-deep learning models, but also outperform competing deep learning methods. Our results confirm that deep learning models are appropriate for drug–target binding affinity prediction, and that representing drugs as graphs can lead to further improvements. Availability of data and materials The proposed models are implemented in Python. Related data, pre-trained models, and source code are publicly available at https://github.com/thinng/GraphDTA. All scripts and data needed to reproduce the post-hoc statistical analysis are available from https://doi.org/10.5281/zenodo.3603523.
Article
Motivation: An essential part of drug discovery is the accurate prediction of the binding affinity of new compound-protein pairs. Most of the standard computational methods assume that compounds or proteins of the test data are observed during the training phase. However, in real-world situations, the test and training data are sampled from different domains with different distributions. To cope with this challenge, we propose a deep learning-based approach that consists of three steps. In the first step, the training encoder network learns a novel representation of compounds and proteins. To this end, we combine convolutional layers and LSTM layers so that the occurrence patterns of local substructures through a protein and a compound sequence are learned. Also, to encode the interaction strength of the protein and compound substructures, we propose a two-sided attention mechanism. In the second phase, to deal with the different distributions of the training and test domains, a feature encoder network is learned for the test domain by utilizing an adversarial domain adaptation approach. In the third phase, the learned test encoder network is applied to new compound-protein pairs to predict their binding affinity. Results: To evaluate the proposed approach, we applied it to KIBA, Davis, and BindingDB datasets. The results show that the proposed method learns a more reliable model for the test domain in more challenging situations. Availability: https://github.com/LBBSoft/DeepCDA.