ArticlePDF Available

FusionDTA: attention-based feature polymerizer and knowledge distillation for drug-target binding affinity prediction

December 2021
Briefings in Bioinformatics 23(1)

December 2021
23(1)

DOI:10.1093/bib/bbab506

Authors:

Guanxing Chen

Sun Yat-Sen University

Chen Chen

Qufu Normal University

The prediction of drug-target affinity (DTA) plays an increasingly important role in drug discovery. Nowadays, lots of prediction methods focus on feature encoding of drugs and proteins, but ignore the importance of feature aggregation. However, the increasingly complex encoder networks lead to the loss of implicit information and excessive model size. To this end, we propose a deep-learning-based approach namely FusionDTA. For the loss of implicit information, a novel muti-head linear attention mechanism was utilized to replace the rough pooling method. This allows FusionDTA aggregates global information based on attention weights, instead of selecting the largest one as max-pooling does. To solve the redundancy issue of parameters, we applied knowledge distillation in FusionDTA by transfering learnable information from teacher model to student. Results show that FusionDTA performs better than existing models for the test domain on all evaluation metrics. We obtained concordance index (CI) index of 0.913 and 0.906 in Davis and KIBA dataset respectively, compared with 0.893 and 0.891 of previous state-of-art model. Under the cold-start constrain, our model proved to be more robust and more effective with unseen inputs than baseline methods. In addition, the knowledge distillation did save half of the parameters of the model, with only 0.006 reduction in CI index. Even FusionDTA with half the parameters could easily exceed the baseline on all metrics. In general, our model has superior performance and improves the effect of drug-target interaction (DTI) prediction. The visualization of DTI can effectively help predict the binding region of proteins during structure-based drug design.

Content uploaded by Guanxing Chen

Content may be subject to copyright.

Weining Yuan is a Bachelor in the School of Intelligent Engineering, Sun Yat-Sen University. His research interests focus on natural language processing,

knowledge transfer and drug design.

Guanxing Chen is a Ph.D. candidate in the School of Intelligent Engineering, Sun Yat-Sen University. His research interests focus on explainable artificial

intelligence, drug discovery, deep learning, biosynthesis, and vaccine design.

Calvin Yu-Chian Chen is the Dean of Intelligent Medical Center and a professor of school of intelligent systems engineering at Sun Yat-sen University.He also had

been served as an Advisor or guest Professor in China Medical University, Massachusetts Institute of Technology (MIT), Peking University, University of Pittsburgh,

and adjunct professor in Zhejiang University. His research interests include the computer vision,natural language processing and deep learning.

Received: August 8, 2021. Revised: October 21, 2021. Accepted: November 3, 2021

Briefings in Bioinformatics, 2021, 00,1–13

https://doi.org/10.1093/bib/bbab506

Problem Solving Protocol

FusionDTA: attention-based feature polymerizer

and knowledge distillation for drug-target binding

affinity prediction

Weining Yuan†, Guanxing Chen†and Calvin Yu-Chian Chen

Corresponding author: Calvin Yu-Chian Chen,Artificial Intelligence Medical Center, School of Intelligent Systems Engineering,Sun Yat-sen University,

Shenzhen 510275, China; Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan; Department of Bioinformaticsand

Medical Engineering, Asia University, Taichung, 41354, Taiwan. Tel: +8615626413023; E-mail: chenyuchian@mail.sysu.edu.cn

†These authors contributed equally to this work.

Abstract

The prediction of drug-target affinity (DTA) plays an increasingly important role in drug discovery. Nowadays, lots of prediction

methods focus on feature encoding of drugs and proteins, but ignore the importance of feature aggregation. However, the increasingly

complex encoder networks lead to the loss of implicit information and excessive model size.To this end, we propose a deep-learning-

based approach namely FusionDTA. For the loss of implicit information, a novel muti-head linear attention mechanism was utilized

to replace the rough pooling method. This allows FusionDTA aggregates global information based on attention weights, instead of

selecting the largest one as max-pooling does. To solve the redundancy issue of parameters, we applied knowledge distillation in

FusionDTA by transfering learnable information from teacher model to student. Results show that FusionDTA performs better than

existing models for the test domain on all evaluation metrics. We obtained concordance index (CI) index of 0.913 and 0.906 in Davis

and KIBA dataset respectively, compared with 0.893 and 0.891 of previous state-of-art model. Under the cold-start constrain, our model

proved to be more robust and more effective with unseen inputs than baseline methods. In addition, the knowledge distillation did

save half of the parameters of the model, with only 0.006 reduction in CI index. Even FusionDTA with half the parameters could easily

exceed the baseline on all metrics. In general, our model has superior performance and improves the effect of drug–target interaction

(DTI) prediction. The visualization of DTI can effectively help predict the binding region of proteins during structure-based drug design.

Keywords: drug–target affinity, feature polymerizer, muti-head linear attention, model compression, knowledge distillation

Introduction

Drug discovery is a time-consuming, extremely expen-

sive and gambling process. It takes more than 10 years

and billions of dollars to develop new drugs, but 90% of

the drugs entering clinical trials have not been approved

by the FDA and entered the consumer market [1,2]. In

the past few decades, the rapid development of computer

technology has enabled better drug design to assist drug

design in experiments and accelerate the speed of drug

development [3]. Nowadays, the key part of computer-

aided drug design is to find matching drug molecules

and proteins. Hence the drug-target interaction (DTI) has

become a hot topic that has been widely studied [4].

Traditionally,virtual screening has been widely used to

extract reasonable drug molecules from large compound

databases. However, the molecular docking technology to

measure the binding affinity between the drug and the

target cost lots of time in the experiment [5]. For proteins

with known structural information, drug molecules can

be directly docked to obtain binding affinity. But there

are still many proteins of unknown structure. Even if a

large amount of time is spent on homology modeling,

detailed structural information may not be obtained [6].

In response to this challenge, machine learning methods

for drug–target affinity (DTA) prediction has gradually

become an alternative to molecular docking.

Pahikkala et al. [7] proposed Kronecker regularized

least-squares approach (KronRLS) that defined the sim-

ilarity score of a drug–target pair through the Kronecker

product of similarity matrix. He et al. [8] put forward

Simboost, a cross-method that used a gradient booster

to predict drug–target affinity. Öztürk et al. [9] suggested

a deep learning model DeepDTA with two independent

convolution blocks to learn representations from SMILES

strings and protein sequences. Abbasi et al. [10] proposed

a deep learning-based approach DeepCDA that combines

convolutional layers and long short-term memory

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

2|Yuan et al.

Figure 1. The overall architecture of FusionDTA. First, the original one-hot encoding of input vector is replaced by a novel distribution representation.

Then, a feedforward layer and a LSTM is designed to construct the basic blocks of the encoder layer. Finally, the intermediate carriers of drug molecules

and proteins are imported into the polymer layer to obtain an output carrier representation of binding affinity.

(LSTM) layers to effectively encode local and global

temporal patterns for deep cross-domain compound-

protein affinity prediction. Nguyen et al. [11] proposed a

graph-based model GraphDTA encoding drug as an undi-

rected graph with a feature map and an adjacent matrix.

Graph convolutional network (GCN) [12], graph attention

network (GAT) [13] and graph isomorphic network (GIN)

[14] are designed to extract features from drugs,whereas

convolution blocks are the feature encoder of protein.

Studies of attention-based methods also contribute

to DTA prediction. DrugVQA [15] proposed a question-

answering model for drug–target interaction tasks, in

which a sequential attention mechanism is utilized

to capture the dependency from dynamic convolu-

tional neural network (CNN). From another perspective,

MATT_DTI [16] designed a multi-head attention model

that regarded drug representation as query while protein

representation as key and value. Nguyen et al. [17] built

a graph-in-graph architecture to fuse the drug–protein

pair, with a self-attention mechanism to calculate the

binding site in protein representation. Chen et al. [18]

utilized a transformer decoder to translate protein

sequence to interaction sequence, where protein rep-

resentations are original texts and drug representations

are previous translations. MT-DTI [19] proposed a new

molecular representation method based on the self-

attention mechanism, which is superior to the existing

technology in terms of the area under the precise recall

curve.

For pre-training of input vectors, Asgari and Mofrad

[20] proposed a word2vec model Protvec to obtain the

continuously distributed representation of proteins.

Rao et al. [21] introduced the tasks assessing protein

embeddings (TAPE) to evaluate semi-supervised learning

on the protein sequence. In their study, self-supervised

models should be tested on three mainstream tasks:

structure prediction, detection of remote homologs

and protein engineering. In addition, Rives et al. [22]

learned a multiscale representation space from 86 billion

amino acids across 250 million protein with a robust

transformer ESM-1b. In existing work, one-dimensional

(1D) CNN [23] and pooling method [24] are often applied

to compress a sequence of nwords in to a single

token. However, each token contains unique semantic

information. Crude use of 1D CNN layers or global

pooling operations to aggregate features may result in

the loss of a lot of useful information.

To solve this problem, we propose a novel neural net-

work framework, FusionDTA. In model architecture, we

first encode the inputs as continuously distributed rep-

resentation depending on the raw input and the parame-

ters of pre-trained model. For biological sequences, one-

hot encoding cannot obtain the context information from

a mass of unsupervised biological corpus. Thus, a pre-

trained transformer is utilized to generate the distribute

input representation in our work. Then, LSTM layers

make up the basic block of encoder network. To cap-

ture the local and global dependencies of the feature

vectors, we apply two-layers bi-directional LSTM on the

feature map from embedding layers. Finally, we propose

to replace the 1D CNN layer or the global pooling layer

with a multi-head linear attention layer, which selec-

tively focuses on each token from the entire biological

sequence and aggregates global information based on the

attention score. Different from the attention mechanism

mentioned above, the proposed linear attention aims

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

FusionDTA |3

Figure 2. The frequency histogram of binding affinity, length of protein sequence and length of ligand SMILES in Davis dataset.

Figure 3. The frequency histogram of binding affinity, length of protein sequence and length of ligand SMILES in KIBA dataset.

to capture the direct reflection of each biological token

on binding affinity rather than enhance the representa-

tional ability of the feature encoder.

With the deepening of the neural network encoder, we

often face the phenomenon of excessive parameters in

training process. This phenomenon is always accompa-

nied by the problem of overfitting and slow training [25].

Therefore, we propose knowledge distillation for DTA

tasks as an improvement in training strategy. Knowl-

edge distillation establishes a teacher model and a stu-

dent model. Through defining constraints and loss func-

tions, the student model with less parameters obtains

knowledge from the teacher model with more parame-

ters. Through transferring knowledge from one model to

another, knowledge distillation is an effective method for

parameter regularization and model compression.

Material and methods

Datasets

We evaluated FusionDTA on two publicly available

datasets, the Kinase dataset Davis [26] and KIBA dataset

[27]. Both were regarded as benchmark datasets in

previous drug–target affinity predictions.

• Davis dataset: Davis dataset contains 30 056 interac-

tions from 442 proteins and 68 ligands, in which the

binding affinity is evaluated by (Kd) value. It reflects

the selective measurements of the kinase protein

family and associated inhibitors with their constant

values of dissociation.

To solve the numerical explosion problem, Öztürk et

al. proposed to replace the binding affinity value Kd

with a novel measure pKdby converting its value into

the logarithmic domain. Specifically, Kdis first scaled

to the appropriate range, and then the negative log is

calculated as follows:

pKd=−log10 Kd

1e9. (1)

Figure 2 shows the histogram of affinity, drug length

and protein frequency in the Davis data set. First

graph illustrates the distribution of binding-affinity

values of DT pairs in the DAVIS data set. Peaks with

an affinity of 5 accounts for more than half of the

data set. The dataset has a total of 30 056 DT pairs,

of which 20 931 DT pairs have an affinity of 5. Most

of the rest is distributed between 6 and 7. In addition,

the length of most proteins is concentrated between

400 and 1500. The largest distribution is around 500,

and the maximum length is 2549. The SMILES length

of the ligands presents a Gaussian distribution, rang-

ing from 35 to 80, most of which are between 40 and

60, and the maximum length is 103.

• KIBA dataset: KIBA dataset contains kinase inhibitor

bioactivities measured by an approach called KIBA,

which considers the different index of the inhibitor

efficacy, such as Ki,Kdand IC50. The binding affinity

was measured by the interaction of 467 proteins and

52 498 ligands.

Figure 3 shows the histogram of affinity, drug length

and protein frequency in the KIBA data set.As shown

in the figure, the affinities in the KIBA data set are

mainly distributed between 10 and 13, and most

of them fall around 11. The length of the protein

sequence is concentrated between 200 and 1500,

most of which are around 700, and the maximum

length is 4128. The SMILES lengths of the ligands

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

4|Yuan et al.

Figure 4. The training phase of the protein pre-training model. The top of the chart shows a strategy for the pre-training stage and the bottom shows a

strategy for the fine-tuning stage.

range from 15 to 100, most of which are concentrated

around 50 and the maximum length is 590.

Öztürk et al. proposed that for 99% of protein pairs, the

Smith-Waterman (S-W) similarity between proteins in

the KIBA data set is at most 60%. 92% of the protein pairs

in the Davis data set have a target similarity of at most

60%. These statistics indicate that both data sets are

non-redundant. To ensure the fairness of the experiment,

5-fold cross-validation was adopted in the experiment.

All the data were divided into five parts on average, four

parts for the training set and one part for the test set.

Hence, a dataset can be divided into five schemes. We

tested the proposed model on all schemes and regarded

the average score as the final performance.

Model architecture

The overall architecture of FusionDTA is shown in

Figure 1. The first step is to feed drug molecules and

protein sequences into the embedding layer. In this

layer, drug molecules are encoded as SMILES strings,

and proteins are encoded as word embeddings. Then, the

LSTM layers are designed to construct the basic blocks

of the encoder layer. Finally, the intermediate carriers

of drug molecules and proteins are imported into the

fusion layer to obtain an output carrier representation of

binding affinity.

Drug representation

For drug molecule,ASCII string SMILES [28] is widely used

as a chemical describer for input representation. SMILES

represent drug molecules as one-dimension sequence,

from which the chemical properties of atoms and their

arrangement is obtained. We project each SMILES char-

acter into a discrete space of one-hot encoding by creat-

ing a vocabulary in SMILES format. Each drug molecule

is represented as follows:

xD={xD

1, ..., xD

n}∈RVD, (2)

where VDis the vocabulary size in the format SMILES.

To avoid sparse matrix and high dimension, xDis mul-

tiplied by a random embedding matrix. It convert xDto

a low-dimensional and dense continuous space eDas

follows:

eD={eD

1, ..., eD

n}∈Rd, (3)

where Dis the dimension of drug in embedding layer.

Protein representation

With the development of natural language processing,

pre-training of input vectors has become an indispens-

able part of the model [29]. The pre-trained model can

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

FusionDTA |5

help the machine find an interpretable data representa-

tion to improve the performance of the algorithm. In the

process of extracting biological sequence information,

feature extraction can be conducted manually or in an

unsupervised method. However,it is diff icult to manually

add various effective features to the biological sequence

in the real scene, so a better choice is to use unsuper-

vised learning to embed biological sequences into high-

dimensional vectors [22].

Inspired by the ESM-1b [22], we borrow the pre-training

transformer to replace the original one-hot encoding,

which takes the distributed contextual vector as the

protein representation. Figure 4 shows the overall pre-

training and fine-tuning procedures. In pre-training

stage, the original proteins are first divided into several

sequences by a fixed maximum length. Each sequence

begins with a token [CLS] and ends with a token [SEP].

Then, the input embeddings are the sum of token

embeddings and position embeddings with a learnable

weight. To capture the dependencies from tokens, some

proportion of input tokens are masked at random and

the final task of the pre-training model is to predict

those masked tokens. Given the input sequence, we

want to maximize the following negative log probability

function:

L(θ) =−



i=1

log p(m=mi|θ),mi∈[1, 2, ..., |VP|], (4)

where Mis the set of masked tokens and VPis the

vocabulary size of amino acids.

In the fine-tuning stage, the DTA task is regarded as

the downstream task of protein pre-training. Similar to

the pre-training stage, the protein is first divided into

sequences with fixed maximum lengths. Then,these raw

sequences are encoded into pre-trained ESM-1b,in which

contextual dependencies are assigned to each amino

acid. This allows the fine-tuning model to learn diverse

knowledge from pre-training and fuse the sequential

information at the biological word level. Finally, we

utilize the top layer outputs of pre-trained ESM-1b as

protein representation. Given the one-hot embedding

{xP

1,···,xP

m}∈RVp, the output of the pre-training model is

defined as follows:

eP={eP

1,···,eP

m}∈Rd, (5)

where dis the dimension of the hidden layer in the pre-

training model.

eP={eP

1,···,eP

m}∈Rd, (6)

where Vpis the size of the vocabulary for amino acids and

dis the dimension of the hidden layer in the pre-training

model.

LSTM Layer

LSTM is a well-known variant of the recurrent neural net-

work, which solves the long-term dependency problem of

the general recurrent neural network (RNN) network [30].

For protein sequences and drug smiles,the input vector is

represented as a set of multiple discrete biological words.

Hence we can regard inputs as continuous time-series

or sentences in the language model. In Davis and KIBA

datasets, more than 80% of protein sequences exceed

200. Therefore, traditional methods (1D CNN or S-W)

cannot extract high-level semantic features in a good

and exact way. Due to the unique gate design, it is more

suitable for LSTM to process longer biological sequence

than vanilla RNN or hidden Markov models. Thus, we

utilize LSTM as a feature encoder for the embeddings of

drug and protein.

In our model, the embedding vectors of the sequence

and SMILES are encoded into a two-layer bidirectional

LSTM. First, the drug and protein embeddings are fed into

the feedforward layer,which consists of a fully connected

network and an activation function. The purpose of the

feedforward layer is to map the features generated by the

embedding layer into the space of the LSTM layer. Then,

the LSTM layer is to capture the long-term dependence

and short-term dependence from the feature map gener-

ated by the feedforward layer. Specifically, we apply two-

layers bidirectional LSTM on the top of the feedforward

layer. In addition, we superimpose the feedforward layer

and the LSTM layer ntimes to obtain the local and

global dependencies of feature vectors from different

dimensions of semantic features. Since the bidirectional

LSTM will bring the feature size multiplication, the input

feature size of the LSTM layer is set to be half of the input

feature size of the feedforward layer so as to maintain the

consistency of the feature size.

Taking protein embedding as an example. Given an

input sentence {eP

1,···,eP

m}∈Rd, the output of the LSTM

layer is defined as {hP

1,···,hP

m}∈RF, where Fis the feature

dimension of each biological token.

Muti-head linear attention mechanism

In existing work, 1D CNN is commonly applied to the

output layer to ensure that protein sequences or SMILES

of different lengths obtain the same size. In addition,

GraphDTA recommends the use of a merge method to

aggregate the characteristics of drug molecules, where

the output vector can be calculated as the sum, the

mean, or the maximum merge. However, powerful fea-

ture encoders often allocate feature maps for each token

in a refined way. It means that the feature map of each

token contains different shallow and in-depth semantic

information. Roughly using 1D CNN layer or global pool-

ing operations to compress the feature map may result

in the loss of various information. Hence, we propose a

novel multi-head linear attention mechanism to capture

the meaningful information from each token.

Figure 5 shows the process of multi-head linear

attention aggregation. As shown, the input vector is first

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

6|Yuan et al.

Figure 5. The process of multi-head linear attention layer aggregation. Each token is allocated n-heads attention weights. Then the output of each head

is the dot product of tokens in dotted box and the specific head. A Summation formula is to aggregate the concatenation of n token into one.The fusion

outputs can be expressed as the concatenation from each feature dimension.

mapped to the n-head attention vectors. We define the

mapping function LinearAttention(W,hi)as follows:

LinearAttention(W,hi)=

exp (Whi

√dk

)

m

j=1exp (Whj

√dk

)

, (7)

where W∈R1×Fis the attention weight matrix, dkis the

normalization coefficient.

Multi-head attention allows the machine to focus on

information about the input features in different vec-

tor spaces and aggregate them as summations. For the

coherence of the derivation, we first calculate the sum-

mation of n heads, instead of the dot product. Taking

a protein as an input, we suppose the input vector is

{hP

1,···,hP

m}∈RF. Then the attention vector of n-heads,

defined as {aP

1,···,aP

m}∈R1, is calculated as follows:

nheads



j=1

headj, (8)

headj=LinearAttention(Wj,hP

i). (9)

Finally, the output of multi-head attention layer is

defined as the dot product of multiple attention and

the original input vector:

oP=



i=1

ihP

i. (10)

Fusion layer

We propose a fusion layer composed of three multi-

head linear attention block to fuse the properties of drug

features and protein features. As is mentioned above,

the multi-head linear attention can aggregate the drugs

and proteins separately.However, when the feature maps

of a protein or a drug aggregates independently, the

relationship between the drug and the protein cannot be

captured. Therefore, in this paper, three different linear

attention blocks are applied in the fusion Layer for pro-

tein sequences, drug smiles and protein–drug informa-

tion respectively.

Figure 1 also shows the mechanism of the fusion layer.

As shown, the protein sequence and the drug Smiles

were first spliced into a new sequence. Given the feature

vector hP={hP

1,···,hP

m}∈RFfor protein, the feature

vector hD={hD

1,···,hD

n}∈RFfor drug, the splicing vector

is defined as ˆ

h={hP

1,···,hP

m,hD

1,···,hD

n}∈RF. Then,

the spliced sequence is fed into the multi-head linear

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

FusionDTA |7

attention layer to obtain the aggregation feature oPD ∈RF

Similarly, the proteins and the drugs were respectively

fed into the multi-head linear attention layer to obtain

the features oP,oD∈RF. There are no shared parameters

between the three attention layers. The formula derived

above is expressed as follows:

h=concat(hP,hD), (11)

oPD =MultiheadlinearAttn(ˆ

h), (12)

oP=MultiheadlinearAttn(hP), (13)

oD=MultiheadlinearAttn(hD). (14)

Finally, a tensor concatenated by oPD,oPand oDis fed into

the fully connected layer as follows:

y=FC(concat([oPD,oP,oD])). (15)

For the training stage, the goal is to make the pre-

diction distribution as close to the ground truth as pos-

sible. Equivalently, we minimize the mean square error

between the predicted value and ground truth:

L=1



i=1

(yi−ˆ

y)2, (16)

where Nis the number of samples, yiis the predicted

value and ˆ

yis the ground truth.

Knowledge distillation

It has been proved that knowledge distillation is an

effective method to enhance the generalization of model

and reduce the number of parameters [31,32]. Regarding

the DTA task as a regression problem, we apply the

knowledge extraction mechanism in the training phase

of FusionDTA and analyze its feasibility theoretically.

On the one hand, self-knowledge distillation helps

to improve the performance from the constrain of

feature maps. On the other hand, knowledge distillation

enhances a small-scale network to perform better than

the same scale network without guides.

Knowledge distillation learning

Conceptually, we define a powerful network that has

been trained as a teacher model, whereas networks of

the same or smaller scale that can learn from the teacher

model is a student model.

Figure 6 shows the training stage of the teacher and

the student models. First, FusionDTA is trained as a

teacher model via an effective network. We define target

and drug inputs as X, affinity as Y, then what we are

concerned about is function f(x):X→Y. In deep

learning models, the function f(x)is approximated by a

parametrized function f(x,θ1), where θ1∈θ. Specifically,

stochastic gradient descent aims to learn the parameters

θ∗

1by minimizing some objective function:

θ∗

1=argminθ1L(y,f(x,θ1)). (17)

Second, we train a new network as a student model.

The knowledge of the student model is obtained from

both the teacher model and the real affinity. Therefore,

the objective function of the student model consists of

two-part, one is the loss measured by f(x,θ2)and real

target, the other is the loss measured by f(x,θ2)and

f(x,θ∗

1):

Loss1=L1(argminθ1,L(y,f(x,θ1)),θ2), (18)

Loss2=L2(y,f(x,θ2)), (19)

θ∗

2=argminθ2(αLoss1+(1−α)Loss2), (20)

where αis the impact factor that determine the weight

between Loss1 and Loss2.

Knowledge distillation for DTA task

In the abovementioned derivation, our ultimate goal is

to minimize the objective function of the teacher. Due

to the difference of outputs and the need of models,

no universal L1can be found for each tasks. Hinton

and Salakhutdinov [33] introduced a generalized softmax

function as L1, in which the concept of temperature

was introduced to ensure that the softmax distribution

generated by the teacher model is soft enough. In this

way, the student model can extract knowledge from the

softmax output distribution at a higher temperature,and

then restore the low temperature during the test stage.

However, in the regression task, the real situation is a

1D continuous variable, rather than a ’hot’ label. There-

fore, the logit of the teacher model does not contain more

information than the real situation. In other words, in

DTA task, the student model will not learn additional

hidden knowledge from the output logits of the teacher

model.

To solve this problem, we suggest that the student

model should learn transferable knowledge from the

feature map of the hidden layer, instead of logits in the

output layer. Define L1as follows:

L1(θHint,θGuide )=g(x,θHint)−r(g(x,θGuide ),θr)2

2, (21)

where gis a transfer function from the input xto the

hidden layer with parameters θHint and θGuide,ris a

nonlinear regression function at the top of the guidance

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

8|Yuan et al.

Figure 6. The training stage of the teacher and the student models. We divide the loss function into three parts: (A) LR

teacher for teacher model. (B)

Distillation lossLD

student and regression loss LR

student.(C)Lstudent for student model, which is the weighted sum of LD

student and LR

student.

layer with parameters θr, regression function ris to map

g(x,θGuide)to a vector space with the same dimension as

g(x,θHint). Specifically,we define the vectors generated by

the first fully connected layer after the multi-head linear

attention layer as the output of function g.

L2is defined as mean square error (MSE):

L2(Yi,Pi)=MSE(Yi,Pi)=1



i=1

(Yi−Pi)2, (22)

where Yiis the true value of binding affinity and Piis

the predicted value. From the perspective of the loss

function, the update strategy of the student model is to

ensure that hidden layer outputs are as close as possible

to the teacher model. Therefore, the teacher model can

provide student with guidance during the training pro-

cess. Meanwhile, the student’s parameters after hidden

layer are not affected by the teacher, so the f lexibility

of the network will not be greatly inhibited. From a

biological point of view, the binding information between

protein and drug is not only expressed through binding

affinity [34]. Assuming that the feature map of the mid-

dle layer contains more hidden information,e.g. the loca-

tion of binding site, the student model can learn trans-

ferable biological structure information, rather than just

bringing the feature map closer to the verified better

parameters.

In addition, knowledge distillation contributes to the

restrain of model parameters and overfitting. Consider-

ing the feature map of teacher model as a constraint,

knowledge distillation limits the difference between the

parameters of teacher and student. Compared with L2-

normalization, loss function L1allows model to learn

more effective verified network parameters (rather than

simply zero).

Evaluation metrics

Concordance index (CI), a model evaluation index pro-

posed by GÖnen and Heller [35], was designed to calcu-

late the difference between the predicted value of the

model and the ground truth. CI is defined as follows:

CI =1

Z

δj>δi

h(bi−bj), (23)

where biis the prediction value for δi,bjis the prediction

value for δj,h(x)is the step function and Zis the normal-

ized hyperparameter. Commonly, the step function h(x)

is defined as follows:

h(x)=⎧

⎨

⎩

0, x<0

0.5, x=0

1, x>0

. (24)

MSE is a statistical measure that evaluates the error

directly. Assuming there are estimated nsample and

corresponding true values of n sample, MSE is expressed

as the expectation of the square loss:

MSE =1



i=1

(yi−ˆ

yi)2, (25)

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

FusionDTA |9

Tab l e 1. The performance of FusionDTA and baseline models on the Davis dataset

Model CI (std) MSE r2

m(std)

KronRLS 0.871 (±0.001) 0.379 0.407 (±0.005)

SimBoost 0.872 (±0.002) 0.282 0.644 (±0.006)

DeepDTA 0.878 (±0.004) 0.261 0.630 (±0.017)

WideDTA 0.886 (±0.003) 0.262 –

MT-DTI 0.887 (±0.003) 0.245 0.665 (±0.014)

DeepCDA 0.891 (±0.003) 0.248 0.649 (±0.009)

MATT_DTI 0.891 (±0.002) 0.227 0.683 (±0.017)

GraphDTA 0.893 (±0.001) 0.229 –

FusionDTA 0.913 (±0.002) 0.208 0.743 (±0.007)

Tab l e 2. The performance of FusionDTA and baseline models on the KIBA dataset

Model CI (std) MSE r2

m(std)

KronRLS 0.782 (±0.001) 0.441 0.342 (±0.001)

SimBoost 0.836 (±0.001) 0.222 0.629 (±0.007)

DeepDTA 0.863 (±0.002) 0.194 0.673 (±0.009)

WideDTA 0.875 (±0.001) 0.179 –

MT-DTI 0.882 (±0.001) 0.152 0.738 (±0.006)

DeepCDA 0.889 (±0.002) 0.176 0.682 (±0.008)

MATT_DTI 0.889 (±0.001) 0.150 0.756 (±0.011)

GraphDTA 0.891 (±0.002) 0.139 –

FusionDTA 0.906 (±0.001) 0.130 0.793 (±0.008)

where ˆ

yiis the estimate of ith sample and yiis the true

value of ith sample.

Regression toward the mean (r2

mindex) is a measure

evaluating the external predictive performance of a

model. If a variable is very large, then r2

mmeans how

much it tends to approach the average next time. r2

index is calculated as follows:

m=r2∗(1−(r2−r2

0)), (26)

where ris the squared correlation coefficients with inter-

cepts and r0is the coefficients without intercepts.

Result and discussion

In this section, the Davis data set and KIBA data set

were utilized to evaluate the performance of the model.

In FusionDTA, the hyperparameters used in these two

data sets are shown in Supplementary Tables S1 and S2.

To evaluate the performance of the multi-head linear

competition, we compared it with the largest pool and

the average pool in the Davis data set. In addition, a

comparative experiment was set up in the experiment,

in which the performance of knowledge distillation is

measured with existing models and vanilla FusionDTA.

We compared our model with the following benchmark

models: KronRLS [7], SimBoost [8], DeepDTA [9], Wid-

eDTA [36], GraphDTA [11], DeepCDA [10], MT-DTI [19] and

MATT_DTI [16].

The performance of FusionDTA

In Table 1, we listed the performance of the proposed

model evaluated on the Davis dataset and compared

it with the baseline model. As shown, FusionDTA is

superior to the existing models in all aspects. In detail,

FusionDTA improves CI index by 0.020 and reduced MSE

by 0.021, compared with previous the baseline model,

GraphDTA. In addition, FusionDTA also achieves a 0.060

improvement in r2

mindex compared to the baseline

model, MATT_DTI. In particular, for some previous

models that have proposed more than two kinds of

model architecture, we only report the best performing

architecture on the Davis dataset.

Table 2 presents the performance of FusionDTA

and baseline models on the KIBA dataset. The results

show that FusionDTA also achieves significantly better

results than baseline models in all of the evaluation

measures. FusionDTA improves 0.015, 0.009 in CI index

and MSE, compared with previous the state-of-art model,

GraphDTA. Moreover, FusionDTA also achieves a 0.037

improvement in r2

mindex over MATT_DTI.

Figure 7 illustrates the real affinity against the pre-

dicted value on both Davis and KIBA datasets. Assuming

ground truth as x-axis and prediction as y-axis, the verti-

cal distance |y|from each point to y=xrepresents the

discrepancy between its predicted affinity value and the

real value. Histograms at the edges represent the overall

distribution of true and predicted affinity. As shown,

the samples have a tendency to be symmetric about

y=xfor both the Davis and Kiba datasets. Especially,

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

10 |Yuan et al.

Figure 7. The real affinity against the predicted value on Davis and KIBA

datasets. For each point, x-axis reflects its real value and y-axis ref lects

its predicted value. The vertical distance |y|from each sample to y=x

represents the discrepancy between its predicted affinity value and the

real value.

the sampling points in Kiba dataset are more densely

distributed around y=x.

The performance of various pooling methods

In the model architecture, different pooling methods

allow the model to pay attention to different parts of the

intermediate sequence, determining the parameters of

each layer to be updated according to different gradients.

The common pooling methods include max-pooling and

mean-pooling, which respectively aggregate the features

map of a sequence into a token with a maximizing func-

tion or an averaging function. In this model, multi head

linear attention layer is proposed to replace the tradi-

tional pooling layer, with an end to selectively focus on

Tab l e 3 . The performance of max-pooling, mean-pooling and

multi head linear attention layer on the Davis dataset

Pooling method CI MSE

Max-pooling 0.904 0.220

Mean-pooling 0.910 0.211

Multi-head linear attention 0.913 0.208

Tab l e 4 . The performance of max-pooling, mean-pooling and

multi head linear attention layer on the KIBA dataset

Pooling method CI MSE

Max-pooling 0.897 0.137

Mean-pooling 0.904 0.132

Multi-head linear attention 0.906 0.130

the information of each biological token on the whole

protein sequence, or an entire SMILES chain.

To evaluate the impact of different pooling methods on

model performance, three controlled experiments were

set up in the verification stage. It is worth mentioning

that the model parameters of each experiment group

are the same except for different pooling methods. In

Table 3, we list the performance of max-pooling, mean-

pooling and multi-head linear attention layer on the

Davis dataset. As it is shown, CI index of multi-head

linear attention layer is 0.913, whereas the CI index

of max-pooling and mean-pooling is 0.904 and 0.910,

respectively. Obviously, multi-head linear attention layer

performs better than the other two pooling methods on

Davis dataset.

Table 4 reports the performance of max-pooling,

mean-pooling and multi-head linear attention layer on

the KIBA dataset. As shown,CI index of multi-head linear

attention is 0.906, which is higher than 0.897 of max-

pooling and 0.904 of mean-pooling. In addition, the MSE

multi-head linear attention layer is 0.130, lower than

0.137 of max-pooling and 0.132 of mean-pooling. For

KIBA dataset, multi-head linear attention layer as feature

aggregator performs better than mean-pooling and max-

pooling.

The performance of cold-start

The cold-start problem refers to evaluate model perfor-

mance on the unseen inputs. From an application point

of view, a high proportion of protein or drug representa-

tions may not appear in the training set. Therefore, the

challenge is whether a model with an excellent score in

specific datasets can also perform well with unknown

data. In this regard, the performance of cold-start indi-

cates the model’s robustness facing a new environment

(e.g. mutate proteins).

We compare our model with the following bench-

mark models: GraphDTA [11], GLFA and GEFA [17]. Table 5

reports the performance of drug cold-start, protein cold-

start and drug-protein cold-start on Davis dataset, corre-

sponding to the unseen drug, unseen protein,and unseen

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

FusionDTA |11

Figure 8. The example of the weight visualization of the proposed model. MARK3 (PDB ID: 3FE3) is expressed in cartoon form,while Nilotinib is expressed

in stick form. The cyan color highlights the highly focused position of the protein and the focused drug atom in the binding bag, and the darker color

indicates the darker attention weight.

Tab l e 5 . The performance of drug cold-start, protein cold-start,

and drug–protein cold-start on Davis dataset

CI MSE

Drug cold-start

GraphDTA 0.675 0.920

GLFA 0.670 0.861

GEFA 0.709 0.846

FusionDTA 0.747 0.681

Target cold-start

GraphDTA 0.706 0.510

GLFA 0.780 0.4531

GEFA 0.795 0.4335

FusionDTA 0.826 0.331

Drug-target cold-start

GraphDTA 0.627 1.130

GLFA 0.636 1.144

GEFA 0.639 0.989

FusionDTA 0.685 0.716

Tab l e 6 . The performance of vanilla FusionDTA, KD +FusionDTA

(large-scale) and KD +FusionDTA (small-scale) on Davis dataset

CI MSE

Vanilla FusionDTA (large-scale) 0.913 0.208

Vanilla FusionDTA (small-scale) 0.905 0.221

KD +FusionDTA (large-scale) 0.914 0.205

KD +FusionDTA (small-scale) 0.908 0.213

drug and protein. As shown, FusionDTA gained 0.747 CI

index and 0.681 MSE under drug cold-start constrain,

0.826 CI index and 0.331 MSE under protein cold-start

constrain, 0.685 CI index and 0.716 MSE under protein–

drug cold-start constrain. As a result, our model is better

than all of the baseline models on the cold-start problem

and more likely to perform robustly in undiscovered

applications.

The performance of knowledge distillation

In this section, we evaluated the contribution of knowl-

edge distillation on Davis dataset. As is mentioned above,

knowledge distillation is an effective way to facilitate

knowledge transfer and parameter regularization. Two

experiments with different parameters, therefore, were

set up in the verification stage to examine the various

effects of knowledge distillation. In one experiment, the

parameter size of the student model was exactly the

same as that of the teacher model, aiming to evaluate

the effect of teacher guidance on students’ model perfor-

mance. In the other experiment, a student model with

parameters of only half the size was used to evaluate the

capability of model compression. For each experiment,

the teacher model was set up with frozen pre-trained

FusionDTA, while the student model was initialized with

untrained FusionDTA. Then, the student model would

learn new distribution from teacher’s output and true

value, by the training strategy of knowledge distillation.

Table 6 shows the performance of FusionDTA (large-

scale), knowledge distillation +FusionDTA (large-scale)

and knowledge distillation +FusionDTA (small-scale)

evaluated on the Davis dataset. As shown, knowledge

distillation +FusionDTA (large-scale) improves CI index

by 0.001 and reduced MSE by 0.003 compared with

FusionDTA (large-scale). Knowledge distillation with

FusionDTA of small scale also achieved the CI index of

0.908.

Supplementary Figure S1 shows the performance

of the baseline models and the proposed models

measured by of CI, MSE and model scale. The num-

ber of parameters of DeepDTA, DeepCDA, GraphDTA,

FusionDTA(large-scale), KD+FusionDTA(large-scale) and

KD+FusionDTA(small-scale) is 1 967 745, 3 641 345,

4 749 573, 5 362 081, 5 362 081 and 2 013 537. As

shown, knowledge distillation +FusionDTA (large-scale)

achieves the best performance around all the methods.

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

12 |Yuan et al.

Meanwhile, with a little loss of accuracy, knowledge

distillation can be regarded as an effective model

compression method for DTA task.

Visualization with attention weights

The attention weights obtained by FusionDTA can be

used to analyze which part of the interaction between

the small drug molecule and the target protein plays a

key role in binding pocket. The attention mechanism can

calculate some key areas of interaction between protein

sequence and drug compounds. In order to visualize

the main areas of interaction, we first calculated the

weights of the protein sequence and the SMILES charac-

ters of the drug compound and then selected the corre-

sponding interaction site with a relatively large attention

value. Figure 8 shows an example of the weight visual-

ization of the proposed model. We chose the complex of

MARK3 (PDB ID: 3FE3) and Nilotinib for interactive visual

analysis. The results showed that the weight value is

mainly from 5.69E-4 to 1.43E-3. We colored the positions

where the attention weight is greater than 9.80E-4 in

the drug compound and the attention weight is greater

than 9.57E-4 in the protein. The cyan color highlights the

highly focused position of the protein and the focused

drug atom in the binding bag, and the darker color indi-

cates the smaller attention weight. Obviously, our model

mainly captured the main amino acid regions, residues

194–339. Interestingly, the attention weight captured by

our model in residues 194–339 is almost close to 9.57E-4,

and residues 285–287 are relatively larger.The peak value

is at LYS-285, which just falls in the binding pocket, indi-

cating that our model accurately predicted the potential

docking site measurement. Overall, some of the residues

194–339 are in the docking pocket of MARK3 and Nilo-

tinib, while some are located outside the region, which

also indicates that most of the regions captured by our

model are located at the docking interface, but part of

them captures the wrong area. The weights calculated

by our model are mainly concentrated in the binding

pocket, which shows that our model can predict the

interaction between protein and compound more accu-

rately. In short, the proposed model can extract useful

information from the two channels of drug SMILES and

protein sequence.

Conclusion

This paper has presented a novel DTA prediction frame-

work, FusionDTA. A new multi-head linear attention

mechanism is applied to replace the coarse pooling

method, which uses attention weights to aggregate

global information. Additionally, we applied knowl-

edge distillation in the framework, by transferring the

learnable information from the teacher model to the

student model. In order to evaluate the proposed work,

it was applied to two common data sets: KIBA, Davis.

Experimental results show that our model performs

better than existing models on all evaluation indicators.

When drugs and proteins are unknown, FusionDTA

proved to be more robust and more effective than

other models in the benchmark, which will help in the

development of some new drugs. Meanwhile, knowledge

distillation is of great help to performance improvement,

saving half of the parameters of the model while the CI

index hardly changes. More importantly, in this case, our

model can easily exceed the baseline of all indicators.

In additional experiments concluded in Supplementary

Tables S3 and S4, the pre-training representation of

protein proves to be effective for DTA model with

sequential inputs, while pre-training of drug fails to

show the superiority for DTA task. Furthermore, the

model has been shown to provide biological insights for

understanding the nature of molecular interactions and

capture the binding pockets of proteins and molecules.

In general, our model has superior performance and

improves the effect of DTI prediction. The visualization

of DTI can effectively help predict the binding region of

proteins during structure-based drug design.

Key Points

• Due to the maximize or average operator, crude

use of pooling method may result in the loss of

hidden information. To this end, we propose a

multi-head linear attention to capture the deep

dependency from each token.

• To solve the redundancy issue of parameters, we

propose a constraint where small scale network

can learn knowledge from large scale network

and real affinity.

• Our method achieves the state-of-the-art per-

formance in Davis dataset and KIBA dataset.

FusionDTA with half the parameters can easily

exceed the baseline on all metrics in terms of

model compression.

Supplementary Data

Supplementary data are available online at http://bib.o

xfordjournals.org/.

Code and data availability

The source code and data of this study are available at

https://github.com/yuanweining/FusionDTA.

Funding

This work was supported by the National Natural Science

Foundation of China (Grant No. 62176272), Guangzhou

Science and Technology Fund (Grant No. 201803010072),

Science, Technology and Innovation Commission of

Shenzhen Municipality (JCYL 20170818165305521) and

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

FusionDTA |13

China Medical University Hospital (DMR-111-102, DMR-

111-143, DMR-111-123). We also acknowledge the start-

up funding from SYSU’s “Hundred Talent Program”.

References

1. Newman DJ, Cragg GM. Natural products as sources of new

drugs over the nearly four decades from 01/1981 to 09/2019. J

Nat Prod 2020;83(3):770–803.

2. Takebe T, Imai R, Ono S. The current status of drug discovery

and development as originated in United States academia: the

influence of industrial and academic collaboration on drug

discovery and development. Clin Transl Sci 2018;11(6):597–606.

3. Lin X, Li X, Lin X. A review on applications of computational

methods in drug screening and design. Molecules 2020;25(6):1375.

4. Wen M, Zhang Z, Niu S, et al. Deep-learning-based drug-target

interaction prediction. JProteomeRes2017;16(4):1401–9.

5. Kairys V, Baranauskiene L, Kazlauskiene M, et al. Binding affinity

in drug design: experimental and computational techniques.

Expert Opin Drug Discovery 2019;14(8):755–68.

6. Yadav AR, Mohite SK. Homology modeling and generation

of 3d-structure of protein. Res J Pharm Dosage Forms Technol

2020;12(4):313–20.

7. Pahikkala T, Airola A, Pietilä S, et al. Toward more realistic drug-

target interaction predictions. Brief Bioinform 2015;16(2):325–37.

8. He T, Heidemeyer M, Ban F, et al. Simboost: a read-across

approach for predicting drug-target binding affinities using gra-

dient boosting machines. JChem2017;9(1):1–14.

9. Öztürk H, Özgür A, Ozkirimli E. Deepdta: deep drug-target bind-

ing affinity prediction. Bioinformatics 2018;34(17):i821–9.

10. Abbasi K, Razzaghi P, Poso A, et al. Deepcda: deep cross-domain

compound-protein affinity prediction through lstm and convo-

lutional neural networks. Bioinformatics 2020;36(17):4633–42.

11. Nguyen T, Le H, Quinn TP, et al. Graphdta: predicting drug–

target binding affinity with graph neural networks. Bioinformatics

2021a;37(8):1140–7.

12. Kipf TN, Welling M. Semi-supervised classification with graph

convolutional networks. arXiv preprint arXiv:1609.02907. 2016.

13. Vel i ˇ

ckovi´

c P, Cucurull G, Casanova A, et al. Graph attention

networks. arXiv preprint arXiv:1710.10903. 2017.

14. Xu K, Hu W, Leskovec J, et al. How powerful are graph neural

networks?. arXiv preprint arXiv:1810.00826. 2018.

15. Zheng S, Li Y, Chen S, et al. Predicting drug-protein interaction

using quasi-visual question answering system. Nat Mach Intell

2020;2(2):134–40.

16. Zeng Y, Chen X, Luo Y, et al. Deep drug-target binding affin-

ity prediction with multiple attention blocks. Brief Bioinform

2021;22(5):1–10.

17. Nguyen TM, Nguyen T, Le TM, et al. Gefa: early fusion approach

in drug-target affinity prediction. IEEE/ACM Trans Comput Biol

Bioinform 2021b.

18. Chen L, Tan X, Wang D, et al. TransformerCPI: improving

compound-protein interaction prediction by sequence-based

deep learning with self-attention mechanism and label reversal

experiments. Bioinformatics 2020;36(16):4406–14.

19. Shin B, Park S, Kang K, et al. Self-attention based molecule rep-

resentation for predicting drug-target interaction. arXiv preprint

arXiv:1908.06760, 2019.

20. Asgari E, Mofrad MRK. Continuous distributed representation of

biological sequences for deep proteomics and genomics. PloS One

2015;10(11):e0141287.

21. Rao R, Bhattacharya N, Thomas N, et al. (eds). Evaluating protein

transfer learning with tape. In: Advances in Neural Information

Processing Systems. Vancouver, Canada: Neural Information Pro-

cessing Systems Foundation, Inc., Vol. 32, 2019, 9689.

22. Rives A, Meier J, Sercu T, et al. Biological structure and function

emerge from scaling unsupervised learning to 250 million pro-

tein sequences. Proc Natl Acad Sci 2021;118(15):e2016239118.

23. Hirohara M, Saito Y, Koda Y, et al. Convolutional neural network

based on smiles representation of compounds for detecting

chemical motif. BMC Bioinform 2018;19(19):83–94.

24. Jiang M, Li Z, Zhang S, et al. Drug-target affinity predic-

tion using graph neural network and contact maps. RSC Adv

2020;10(35):20701–12.

25. Buciluˇ

a C, Caruana R, Niculescu-Mizil A. Model compression.

In: Proceedings of the 12th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining. Philadelphia, Pennsylvania,

USA: ACM, Inc., 2006, 535–41.

26. Davis MI, Hunt JP, Herrgard S, et al. Comprehensive analy-

sis of kinase inhibitor selectivity. Nat Biotechnol 2011;29(11):

1046–51.

27. Tang J, Szwajda A, Shakyawar S, et al. Making sense of large-

scale kinase inhibitor bioactivity data sets: a comparative and

integrative analysis. J Chem Inf Model 2014;54(3):735–43.

28. Weininger D. Smiles: a chemical language and information sys-

tem. J Chem Inf Comput Sci 1988;28(1):31–6.

29. Qiu X, Sun T, Yige X, et al. Pre-trained models for natural

language processing: a survey. Sci China Technol Sci 2020;63(10):

1872–97.

30. Sundermeyer M, Schlüter R, Ney H. Lstm neural networks

for language modeling. In: Thirteenth Annual Conference of the

International Speech Communication Association. Portland, Oregon,

USA: International Speech Communication Association (ISCA),

2012.

31. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural

network. arXiv preprint arXiv:1503.02531. 2015.

32. Clark K, Luong M-T, Khandelwal U, et al. Bam! born-again

multi-task networks for natural language understanding. arXiv

preprint arXiv:1907.04829. 2019.

33. Hinton GE, Salakhutdinov RR. Replicated softmax: an undi-

rected topic model. Adv Neural Inform Process Syst 2009;22:

1607–14.

34. Vuignier K, Schappler J, Veuthey J-L, et al. Drug-protein bind-

ing: a critical review of analytical tools. Anal Bioanal Chem

2010;398(1):53–66.

35. GÖnen M, Heller G. Concordance probability and discrimi-

natory power in proportional hazards regression. Biometrika

2005;92(4):965–70.

36. Öztürk H, Ozkirimli E, Özgür A. Widedta: prediction of drug-

target binding affinity. arXiv preprint arXiv:1902.04166. 2019.

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab506/6470967 by Sun Yat-Sen University, chenyuchian@mail.sysu.edu.cn on 21 December 2021

DCGAN-DTA: Predicting drug-target binding affinity with deep convolutional generative adversarial networks

Article

Full-text available

May 2024
BMC GENOMICS

Background In recent years, there has been a growing interest in utilizing computational approaches to predict drug-target binding affinity, aiming to expedite the early drug discovery process. To address the limitations of experimental methods, such as cost and time, several machine learning-based techniques have been developed. However, these methods encounter certain challenges, including the limited availability of training data, reliance on human intervention for feature selection and engineering, and a lack of validation approaches for robust evaluation in real-life applications. Results To mitigate these limitations, in this study, we propose a method for drug-target binding affinity prediction based on deep convolutional generative adversarial networks. Additionally, we conducted a series of validation experiments and implemented adversarial control experiments using straw models. These experiments serve to demonstrate the robustness and efficacy of our predictive models. We conducted a comprehensive evaluation of our method by comparing it to baselines and state-of-the-art methods. Two recently updated datasets, namely the BindingDB and PDBBind, were used for this purpose. Our findings indicate that our method outperforms the alternative methods in terms of three performance measures when using warm-start data splitting settings. Moreover, when considering physiochemical-based cold-start data splitting settings, our method demonstrates superior predictive performance, particularly in terms of the concordance index. Conclusion The results of our study affirm the practical value of our method and its superiority over alternative approaches in predicting drug-target binding affinity across multiple validation sets. This highlights the potential of our approach in accelerating drug repurposing efforts, facilitating novel drug discovery, and ultimately enhancing disease treatment. The data and source code for this study were deposited in the GitHub repository, https://github.com/mojtabaze7/DCGAN-DTA. Furthermore, the web server for our method is accessible at https://dcgan.shinyapps.io/bindingaffinity/.

MSI-DTI: predicting drug-target interaction based on multi-source information and multi-head self-attention

Article

Full-text available

May 2024

Identifying drug-target interactions (DTIs) holds significant importance in drug discovery and development, playing a crucial role in various areas such as virtual screening, drug repurposing and identification of potential drug side effects. However, existing methods commonly exploit only a single type of feature from drugs and targets, suffering from miscellaneous challenges such as high sparsity and cold-start problems. We propose a novel framework called MSI-DTI (Multi-Source Information-based Drug-Target Interaction Prediction) to enhance prediction performance, which obtains feature representations from different views by integrating biometric features and knowledge graph representations from multi-source information. Our approach involves constructing a Drug-Target Knowledge Graph (DTKG), obtaining multiple feature representations from diverse information sources for SMILES sequences and amino acid sequences, incorporating network features from DTKG and performing an effective multi-source information fusion. Subsequently, we employ a multi-head self-attention mechanism coupled with residual connections to capture higher-order interaction information between sparse features while preserving lower-order information. Experimental results on DTKG and two benchmark datasets demonstrate that our MSI-DTI outperforms several state-of-the-art DTIs prediction methods, yielding more accurate and robust predictions. The source codes and datasets are publicly accessible at https://github.com/KEAML-JLU/MSI-DTI.

Multimodal Fused Deep Learning for Drug Property Prediction: Integrating Chemical Language and Molecular Graph

Article

Full-text available

Apr 2024

Accurately predicting molecular properties is a challenging but essential task in drug discovery. Recently, many mono-modal deep learning methods have been successfully applied to molecular property prediction. However, mono-modal learning is inherently limited as it relies solely on a single modality of molecular representation, which restricts a comprehensive understanding of drug molecules. To overcome the limitations, we propose a multimodal fused deep learning (MMFDL) model to leverage information from different molecular representations. Specifically, we construct a triple-modal learning model by employing Transformer-Encoder, Bidirectional Gated Recurrent Unit (BiGRU), and graph convolutional network (GCN) to process three modalities of information from chemical language and molecular graph: SMILES-encoded vectors, ECFP fingerprints, and molecular graphs, respectively. We evaluate the proposed triple-modal model using five fusion approaches on six molecule datasets, including Delaney, Llinas2020, Lipophilicity, SAMPL, BACE, and pKa from DataWarrior. The results show that the MMFDL model achieves the highest Pearson coefficients, and stable distribution of Pearson coefficients in the random splitting test, outperforming mono-modal models in accuracy and reliability. Furthermore, we validate the generalization ability of our model in the prediction of binding constants for protein-ligand complex molecules, and assess the resilience capability against noise. Through analysis of feature distributions in chemical space and the assigned contribution of each modal model, we demonstrate that the MMFDL model shows the ability to acquire complementary information by using proper models and suitable fusion approaches. By leveraging diverse sources of bioinformatics information, multimodal deep learning models hold the potential for successful drug discovery.

MDF-DTA: A Multi-Dimensional Fusion Approach for Drug-Target Binding Affinity Prediction

Article

Jun 2024

GDilatedDTA: Graph dilation convolution strategy for drug target binding affinity prediction

Article

Jun 2024
BIOMED SIGNAL PROCES

A multimodal Transformer Network for protein-small molecule interactions enhances predictions of kinase inhibition and enzyme-substrate relationships

Article

Full-text available

May 2024
PLOS COMPUT BIOL

The activities of most enzymes and drugs depend on interactions between proteins and small molecules. Accurate prediction of these interactions could greatly accelerate pharmaceutical and biotechnological research. Current machine learning models designed for this task have a limited ability to generalize beyond the proteins used for training. This limitation is likely due to a lack of information exchange between the protein and the small molecule during the generation of the required numerical representations. Here, we introduce ProSmith, a machine learning framework that employs a multimodal Transformer Network to simultaneously process protein amino acid sequences and small molecule strings in the same input. This approach facilitates the exchange of all relevant information between the two molecule types during the computation of their numerical representations, allowing the model to account for their structural and functional interactions. Our final model combines gradient boosting predictions based on the resulting multimodal Transformer Network with independent predictions based on separate deep learning representations of the proteins and small molecules. The resulting predictions outperform recently published state-of-the-art models for predicting protein-small molecule interactions across three diverse tasks: predicting kinase inhibitions; inferring potential substrates for enzymes; and predicting Michaelis constants KM. The Python code provided can be used to easily implement and improve machine learning predictions involving arbitrary protein-small molecule interactions.

MolLoG: A Molecular Level Interpretability Model Bridging Local to Global for Predicting Drug Target Interactions

Article

May 2024
J CHEM INF MODEL

Protein-ligand binding affinity prediction: Is 3D binding pose needed?

Preprint

Apr 2024

Accurate protein-ligand binding affinity prediction is crucial in drug discovery. Existing methods are predominately docking-free, without explicitly considering atom-level interaction between proteins and ligands in scenarios where crystallized protein-ligand binding conformations are unavailable. Now, with breakthroughs in deep learning AI-based protein folding and binding conformation prediction, can we improve binding affinity prediction? This study introduces a framework, Folding-Docking-Affinity (FDA), which folds proteins, determines protein-ligand binding conformations, and predicts binding affinities from three-dimensional protein-ligand binding structures. Our experiments demonstrate that the FDA outperforms state-of-the-art docking-free models in the DAVIS dataset, showcasing the potential of explicit modeling of three-dimensional binding conformations for enhancing binding affinity prediction accuracy.

Prediction of Drug-Target Binding Affinity Based on Deep Learning Models

Article

Apr 2024
COMPUT BIOL MED

A review of deep learning methods for ligand based drug virtual screening

Article

Mar 2024

GEFA: Early Fusion Approach in Drug-Target Affinity Prediction

Article

Full-text available

Jul 2021

Predicting the interaction between a compound and a target is crucial for rapid drug repurposing. Deep learning has been successfully applied in drug-target affinity (DTA) problem. However, previous deep learning-based methods ignore modeling the direct interactions between drug and protein residues. This would lead to inaccurate learning of target representation which may change due to the drug binding effects. In addition, previous DTA methods learn protein representation solely based on a small number of protein sequences in DTA datasets while neglecting the use of proteins outside of the DTA datasets. We propose GEFA (Graph Early Fusion Affinity), a novel graph-in-graph neural network with attention mechanism to address the changes in target representation because of the binding effects. Specifically, a drug is modeled as a graph of atoms, which then serves as a node in a larger graph of residues-drug complex. The resulting model is an expressive deep nested graph neural network. We also use pre-trained protein representation powered by the recent effort of learning contextualized protein representation. The experiments are conducted under different settings to evaluate scenarios such as novel drugs or targets. The results demonstrate the effectiveness of the pre-trained protein embedding and the advantages our GEFA in modeling the nested graph for drug-target interaction.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Article

Full-text available

Apr 2021

Significance Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling a deep contextual language model with unsupervised learning to sequences spanning evolutionary diversity. We find that without prior knowledge, information emerges in the learned representations on fundamental properties of proteins such as secondary structure, contacts, and biological activity. We show the learned representations are useful across benchmarks for remote homology detection, prediction of secondary structure, long-range residue–residue contacts, and mutational effect. Unsupervised representation learning enables state-of-the-art supervised prediction of mutational effect and secondary structure and improves state-of-the-art features for long-range contact prediction.

Pre-trained models for natural language processing: A survey

Article

Full-text available

Oct 2020

Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.

Drug–target affinity prediction using graph neural network and contact maps

Article

Full-text available

Jun 2020

Computer-aided drug design uses high-performance computers to simulate the tasks in drug design, which is a promising research area. Drug–target affinity (DTA) prediction is the most important step of computer-aided drug design, which could speed up drug development and reduce resource consumption. With the development of deep learning, the introduction of deep learning to DTA prediction and improving the accuracy have become a focus of research. In this paper, utilizing the structural information of molecules and proteins, two graphs of drug molecules and proteins are built up respectively. Graph neural networks are introduced to obtain their representations, and a method called DGraphDTA is proposed for DTA prediction. Specifically, the protein graph is constructed based on the contact map output from the prediction method, which could predict the structural characteristics of the protein according to its sequence. It can be seen from the test of various metrics on benchmark datasets that the method proposed in this paper has strong robustness and generalizability.

TransformerCPI: Improving compound-protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments

Article

Full-text available

May 2020
BIOINFORMATICS

Motivation: Identifying compound-protein interaction (CPI) is a crucial task in drug discovery and chemogenomics studies, and proteins without three-dimensional (3D) structure account for a large part of potential biological targets, which requires developing methods using only protein sequence information to predict CPI. However, sequence-based CPI models may face some specific pitfalls, including using inappropriate datasets, hidden ligand bias, and splitting datasets inappropriately, resulting in overestimation of their prediction performance. Results: To address these issues, we here constructed new datasets specific for CPI prediction, proposed a novel transformer neural network named TransformerCPI, and introduced a more rigorous label reversal experiment to test whether a model learns true interaction features. TransformerCPI achieved much improved performance on the new experiments, and it can be deconvolved to highlight important interacting regions of protein sequences and compound atoms, which may contribute chemical biology studies with useful guidance for further ligand structural optimization. Supplementary information: Supplementary data are available at Bioinformatics online. Availability and implementation: https://github.com/lifanchen-simm/transformerCPI.

Deep drug-target binding affinity prediction with multiple attention blocks

Article

Apr 2021

Drug-target interaction (DTI) prediction has drawn increasing interest due to its substantial position in the drug discovery process. Many studies have introduced computational models to treat DTI prediction as a regression task, which directly predict the binding affinity of drug-target pairs. However, existing studies (i) ignore the essential correlations between atoms when encoding drug compounds and (ii) model the interaction of drug-target pairs simply by concatenation. Based on those observations, in this study, we propose an end-to-end model with multiple attention blocks to predict the binding affinity scores of drug-target pairs. Our proposed model offers the abilities to (i) encode the correlations between atoms by a relation-aware self-attention block and (ii) model the interaction of drug representations and target representations by the multi-head attention block. Experimental results of DTI prediction on two benchmark datasets show our approach outperforms existing methods, which are benefit from the correlation information encoded by the relation-aware self-attention block and the interaction information extracted by the multi-head attention block. Moreover, we conduct the experiments on the effects of max relative position length and find out the best max relative position length value $k \in \{3, 5\}$. Furthermore, we apply our model to predict the binding affinity of Corona Virus Disease 2019 (COVID-19)-related genome sequences and $3137$ FDA-approved drugs.

Homology modeling and generation of 3d-structure of protein

Article

Jan 2020

Evaluating Protein Transfer Learning with TAPE

Article

Dec 2019
Adv Neural Inform Process Syst

Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised learning for proteins has emerged as an important paradigm due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

GraphDTA: Predicting drug–target binding affinity with graph neural networks

Article

Oct 2020

The development of new drugs is costly, time consuming, and often accompanied with safety issues. Drug repurposing can avoid the expensive and lengthy process of drug development by finding new uses for already approved drugs. In order to repurpose drugs effectively, it is useful to know which proteins are targeted by which drugs. Computational models that estimate the interaction strength of new drug–target pairs have the potential to expedite drug repurposing. Several models have been proposed for this task. However, these models represent the drugs as strings, which is not a natural way to represent molecules. We propose a new model called GraphDTA that represents drugs as graphs and uses graph neural networks to predict drug–target affinity. We show that graph neural networks not only predict drug–target affinity better than non-deep learning models, but also outperform competing deep learning methods. Our results confirm that deep learning models are appropriate for drug–target binding affinity prediction, and that representing drugs as graphs can lead to further improvements. Availability of data and materials The proposed models are implemented in Python. Related data, pre-trained models, and source code are publicly available at https://github.com/thinng/GraphDTA. All scripts and data needed to reproduce the post-hoc statistical analysis are available from https://doi.org/10.5281/zenodo.3603523.

DeepCDA: Deep Cross-Domain Compound-Protein Affinity Prediction through LSTM and Convolutional Neural Networks

Article

May 2020
BIOINFORMATICS

Motivation: An essential part of drug discovery is the accurate prediction of the binding affinity of new compound-protein pairs. Most of the standard computational methods assume that compounds or proteins of the test data are observed during the training phase. However, in real-world situations, the test and training data are sampled from different domains with different distributions. To cope with this challenge, we propose a deep learning-based approach that consists of three steps. In the first step, the training encoder network learns a novel representation of compounds and proteins. To this end, we combine convolutional layers and LSTM layers so that the occurrence patterns of local substructures through a protein and a compound sequence are learned. Also, to encode the interaction strength of the protein and compound substructures, we propose a two-sided attention mechanism. In the second phase, to deal with the different distributions of the training and test domains, a feature encoder network is learned for the test domain by utilizing an adversarial domain adaptation approach. In the third phase, the learned test encoder network is applied to new compound-protein pairs to predict their binding affinity. Results: To evaluate the proposed approach, we applied it to KIBA, Davis, and BindingDB datasets. The results show that the proposed method learns a more reliable model for the test domain in more challenging situations. Availability: https://github.com/LBBSoft/DeepCDA.

FusionDTA: attention-based feature polymerizer and knowledge distillation for drug-target binding affinity prediction

Abstract

Recommended publications

Citation: MSGNN-DTA: Multi-Scale Topological Feature Fusion Based on Graph Neural Networks for Drug-...

MMD-DTA: A multi-modal deep learning framework for drug-target binding affinity and binding region p...

MFR-DTA: A Multi-Functional and Robust Model for Predicting Drug-Target Binding Affinity and Region

GEFormerDTA: drug target affinity prediction based on transformer graph for early fusion