ArticlePDF Available

Multiphysical graph neural network (MP-GNN) for COVID-19 drug design

Authors:
  • Alibaba Group DAMO Academy (USA)

Abstract and Figures

Graph neural networks (GNNs) are the most promising deep learning models that can revolutionize non-Euclidean data analysis. However, their full potential is severely curtailed by poorly represented molecular graphs and features. Here, we propose a multiphysical graph neural network (MP-GNN) model based on the developed multiphysical molecular graph representation and featurization. All kinds of molecular interactions, between different atom types and at different scales, are systematically represented by a series of scale-specific and element-specific graphs with distance-related node features. From these graphs, graph convolution network (GCN) models are constructed with specially designed weight-sharing architectures. Base learners are constructed from GCN models from different elements at different scales, and further consolidated together using both one-scale and multi-scale ensemble learning schemes. Our MP-GNN has two distinct properties. First, our MP-GNN incorporates multiscale interactions using more than one molecular graph. Atomic interactions from various different scales are not modeled by one specific graph (as in traditional GNNs), instead they are represented by a series of graphs at different scales. Second, it is free from the complicated feature generation process as in conventional GNN methods. In our MP-GNN, various atom interactions are embedded into element-specific graph representations with only distance-related node features. A unique GNN architecture is designed to incorporate all the information into a consolidated model. Our MP-GNN has been extensively validated on the widely used benchmark test datasets from PDBbind, including PDBbind-v2007, PDBbind-v2013 and PDBbind-v2016. Our model can outperform all existing models as far as we know. Further, our MP-GNN is used in coronavirus disease 2019 drug design. Based on a dataset with 185 complexes of inhibitors for severe acute respiratory syndrome coronavirus (SARS-CoV/SARS-CoV-2), we evaluate their binding affinities using our MP-GNN. It has been found that our MP-GNN is of high accuracy. This demonstrates the great potential of our MP-GNN for the screening of potential drugs for SARS-CoV-2. Availability: The Multiphysical graph neural network (MP-GNN) model can be found in https://github.com/Alibaba-DAMO-DrugAI/MGNN. Additional data or code will be available upon reasonable request.
Content may be subject to copyright.
Xiao-Shuang Li Xiao-Shuang Li is a PhD student in Shanghai Jiao Tong University, and also a research intern at the Alibaba DAMO Academy.
Xiang Liu Xiang Liu is a PhD student from Nankai University in China. He is a visiting student in Nanyang Technological University from December 2019 to June
2020.
Le Lu Le Lu is IEEE Fellow. He is the Head of Medical AI research and development of Alibaba Group, and also Senior Director of DAMO Academy USA.
Xian-Sheng Hua Xian-Sheng Hua is IEEE Fellow. He is the head of CityBrain Lab and leads the Artitificial Intelligence Center of DAMO Academy in Alibaba Group.
Ying Chi Ying Chi is the team leader of Drug Discovery Intelligence at the Alibaba DAMO Academy. She did PhD in Imperial College London and Postdoctoral
research in Oxford University in UK. Her current research and development interest is all types of AI methods for various drug discovery problems, e.g.virtual
screening, protein and immunity related.
Kelin Xia Kelin Xia is an assistant professor at School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore. His research interests
are topological data analysis, molecular-based mathematical biology and machine learning.
Received: March 7, 2022. Revised: April 24, 2022. Accepted: May 18, 2022
© The Author(s) 2022. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Briefings in Bioinformatics, 2022, 1–10
https://doi.org/10.1093/bib/bbac231
Problem Solving Protocol
Multiphysical graph neural network (MP-GNN) for
COVID-19 drug design
Xiao-Shuang Li, Xiang Liu,Le Lu, Xian-Sheng Hua, Ying Chi and Kelin Xia
Corresponding author: Kelin Xia, xiakelin@ntu.edu.sg
Abstract
Graph neural networks (GNNs) are the most promising deep learning models that can revolutionize non-Euclidean data analysis.
However, their full potential is severely curtailed by poorly represented molecular graphs and features. Here, we propose a
multiphysical graph neural network (MP-GNN) model based on the developed multiphysical molecular graph representation and
featurization. All kinds of molecular interactions, between different atom types and at different scales, are systematically represented
by a series of scale-specific and element-specific graphs with distance-related node features. From these graphs, graph convolution
network (GCN) models are constructed with specially designed weight-sharing architectures. Base learners are constructed from
GCN models from different elements at different scales, and further consolidated together using both one-scale and multi-scale
ensemble learning schemes. Our MP-GNN has two distinct properties. First, our MP-GNN incorporates multiscale interactions using
more than one molecular graph. Atomic interactions from various different scales are not modeled by one specific graph (as in
traditional GNNs), instead they are represented by a series of graphs at different scales. Second, it is free from the complicated
feature generation process as in conventional GNN methods. In our MP-GNN, various atom interactions are embedded into element-
specific graph representations with only distance-related node features. A unique GNN architecture is designed to incorporate all
the information into a consolidated model. Our MP-GNN has been extensively validated on the widely used benchmark test datasets
from PDBbind, including PDBbind-v2007, PDBbind-v2013 and PDBbind-v2016. Our model can outperform all existing models as far as
we know. Further, our MP-GNN is used in coronavirus disease 2019 drug design. Based on a dataset with 185 complexes of inhibitors
for severe acute respiratory syndrome coronavirus (SARS-CoV/SARS-CoV-2), we evaluate their binding affinities using our MP-GNN.
It has been found that our MP-GNN is of high accuracy. This demonstrates the great potential of our MP-GNN for the screening of
potential drugs for SARS-CoV-2. Availability: The Multiphysical graph neural network (MP-GNN) model can be found in https://github.
com/Alibaba-DAMO-DrugAI/MGNN. Additional data or code will be available upon reasonable request.
Keywords: Graph neural network, Graph representation and featurization, Protein–ligand binding, Drug design, Ensemble learning
Introduction
So far, more than 262 million infections and 5 million
fatalities haves been succumbed to the new severe
acute respiratory syndrome coronavirus (SARS-CoV-2)
in the coronavirus disease 2019 (COVID-19) pandemic
which has swept across all 213 countries and territories.
The significance of designing efficient antibodies and
drugs for COVID-19 cannot be overemphasized.Artificial
intelligence-based models have demonstrated great
power in various steps in drug design [1]. Among these
models are graph neural network (GNN) models, which
are end-to-end learning models that take in a molecular
graph representation and directly output the prediction.
Originally, GNNs were developed for the analysis of
large-scale network data with the main focus of pre-
dicting the properties of new nodes or edges within the
network. Recently, GNNs have been used in biomolecular
data analysis and achieved great performance for various
steps in drug design and discovery [210]. Among these
models, AquaSol [2] uses directed acyclic graph based
recursive neural networks to predict molecular solubility.
In DeepVS [3], an effective atom context representation
is employed that can take into consideration protein–
ligand complex properties. An integrated model of
the compound-structure-based GNN and the protein-
sequence-based convolution neural network (CNN) is
developed for compound protein interactions [7]. GAN
model is introduced for chemical stability prediction
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022
2|Li et al.
in DeepChemStable [8]. A convolution spatial graph
embedding layer (C-SGEL) based graph convolution
network (GCN) model is developed for molecular prop-
erty prediction [9]. A structure-aware interactive GNN
is designed to learn essential long-range interactions
among atoms and to fully utilize biomolecular structural
information [11]. GNN models have also been used
in drug–target affinity prediction [1115], antibiotic
discovery [16], protein–protein binding affinity change
upon mutation [17] and various other drug discovery
and development [18].
Even though GNNs have shown great promise for
drug design, their full potential has been hindered by
the inefficient graph topological representations and
featurization. Currently, most biomolecular GNNs use
the covalent-bond-based graph representation, which is
to model a molecule as a graph with atoms represented
as nodes and covalent bonds as edges. Node and edge
features are then generated from different types of
physical, chemical and biological properties. However,
these covalent-bond-based molecular topologies fail
to efficiently characterize non-covalent interactions,
which can be of great importance for biomolecular
complexes, including protein–protein complexes, pro-
tein–ligand complexes, protein–DNA/RAN complexes
and DNA/RNA–ligand complexes. To alleviate the prob-
lem, a fixed cutoff-distance-based molecular graph
representation has been developed. However, molecular
interactions are usually of different scales. The fixed
cutoff-distance-based topology tends to miss a great
amount of information and it is nontrivial to identify the
‘best’ cutoff distance. Currently, the bottleneck for the
design of efficient molecular GNN models is the suitable
topological representations and featurization that char-
acterize the multiphysical properties of biomolecules.
Here, we develop multiphysical molecular graph
representations and featurization. Based on them, we
propose a multiphysical graph neural network (MP-GNN)
model. Our MP-GNN employs an ensemble learning
scheme to incorporate both scale-specific GNN models
and element-specific GNN models. It has been found
that our MP-GNN model can deliver state-of-the-art
results for protein–ligand binding affinity prediction
and achieve extremely high accuracy in SARS-CoV BA
dataset, which contains 185 Mpro-ligand complexes and
their experimental binding affinities.
Results
Physically, atomic interactions within and between
molecules are of various types, ranging from strong ones
such as covalent bonds, disulfide bonds, ionic bonds,
hydrogen bonds, to relatively weaker ones, such as van
der Waals forces, electrostatic interactions, hydrophobic
and hydrophilic effects. Mathematically, the atomic
interaction between two atoms with coordinates riand rj
can be defined as an interaction function (||rirj||)
with rirjthe Euclidean distance. To model the
multiscale effects, the scale (or resolution) related kernel
functions are used. Among them, the most common
ones are the generalized exponential kernels and the
generalized Lorentz kernels. For two atoms riand rj, their
atomic interaction can be modeled by the generalized
exponential kernel as follows:
(||rirj||;η) =e(||rirj||/η)κ, (1)
or by generalized Lorentz kernel as
(||rirj||;η) =1
1+(||rirj||/η)κ. (2)
Here, ηis scale (or resolution) parameter, and κis order
parameter, which is usually taken as 2. Based on rigidity–
flexibility model, we can define the node importance
using rigidity index as follows:
μ(ri;η) =
j
wj(||rirj||;η), (3)
where wjis an atomic type-dependent weight. Note that
kernel functions with different scale values will focus
on atomic interactions at different scales. If a small η
value is used, the kernels characterize only strong cova-
lent interactions with the values for other interactions
at longer distance as (nearly) 0. In contrast, under a
larger ηvalue, relatively weaker interactions will also be
included. Node importance will vary with scale values in
a similar way.
In our scale-specific graph representations, molecules
are modeled by a series of graphs systematically gener-
ated from different scales. Mathematically, a fully con-
nected molecular graph is generated with scale-related
weight value, i.e. the atomic interaction from Eq.(1)or
Eq.(2), on each edge. Based on the scale-specific graph
representation, the normalized adjacent matrix can be
defined as
ˆ
A(i,j)=(||rirj||;η),i= j
0, i=j,(4)
and the normalized degree matrix can be defined as
ˆ
D(i,j)=μ(ri;η),i=j
0, i= j.(5)
In this way, the scale effects are incorporated into molec-
ular graph representation.
Further, we propose a new type of node feature vector
that is solely dependent on atomic interaction function
. For the i-th node, an n-th dimensional node feature
vector vi(η) =(vi
1(η),vi
2(η), ..., vi
n(η)) is defined as follows:
vi
k(η) =
j=1
χ(xk1(||rirj||;η) < xk),k=1, 2.., n. (6)
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022
Multiphysical graph neural network (MP-GNN) |3
Here, we assume all atomic interactions (||rirj||;η) are
within the region [0, xmax ], which is equally divided into n
intervals {(xk1,xk);k=1, 2, .., n}with x0=0 and xn=
xmax. The indicator function χequals to 1 if the following
condition is satisfied and 0 otherwise. Mathematically,
the node vector is the frequencies (or the numbers) of
atomic interactions within a certain range.
Finally, molecules are usually of various sizes, which
will result in different-sized molecular graphs. To facil-
itate weight-sharing among different molecular graphs,
we use node importance μas in Eq. (3) to remove less
important extra nodes, so that a same-sized molecular
graph is obtained.
Element-specific graph for GNN
Other than scale effects, element types are the other
key factor for multiphysical atomic interactions. For
instance, carbon atoms are usually associated with
hydrophobic interactions, while nitrogen and oxygen
atoms are correlated to hydrophilic interactions and/or
hydrogen bonds. To enable a systematic description
of atomic interactions, we consider element-specific
graph representations [19]. Recently, the combination of
element-specific representations and machine learning
models has achieved great success in drug design [19
30]. More recently, an element-specific GNN model
has been developed and has achieved state-of-the-
art performance in quantitative toxicity analysis and
solvation prediction [10].
The essential idea for element-specific represen-
tations is to decompose a molecule into a series of
atom-sets, which composed of certain specific types of
elements. In general, a protein molecule is composed
of roughly five most important elements, denoted as
EP= [C, N, O, S, H]. A DNA or RNA also have five most
important elements, denoted as ED= [C, N, O, P, H].
For ligands or chemical molecules, they tend to have
more types of elements. Here, we consider only nine
types of most-commonly used ones, and denote as EL
= [C, N, O, S, H, F, Cl, Br, I]. In general, an element-
specific GNN model contains a series of molecular
graphs that are constructed based on different element
types. For instance, a protein can be represented by
a series of element-specific graphs, including single-
element graphs (C-graph, N-graph, O-graph, S-graph and
H-graph), double-element graphs (CN-graph, CO-graph,
CS-graph, CH-graph, NO-graph, NS-graph, NH-graph, OS-
graph, OH-graph and SH-graph), three-element graphs
and other graphs with more types of elements. Each
element-specific graph characterizes certain type of
atomic interactions. Note that the all-atom graph as in
previous GNN models is just a special case of element-
specific graph. Normally, we do not need to use all the
combinations [25,29,30]. To balance the computational
cost and model accuracy, we usually only consider
the element-specific graphs with sufficient amount of
atoms. For instance, ligand molecules may contain Cl
atom but they usually have only one or two Cl atoms. A
Cl-graph will be meaningless. However, the Cl atom can
be important for ligand properties. So we can consider
multiple-element graphs, such as CNCl-graph, COCl-
graph, etc.
Multiphysical graph neural network
In our MP-GNN model, a series of scale-specific and
element-specific graphs are generated from molecules.
From each graph, a GNN architecture is constructed. To
significantly reduce the learning parameters, weight-
sharing schemes and ensemble learning models are
considered. Molecular structural topologies are of great
importance for their functions. Various quantitative
structure–activity/property relationship (QSAR/QSPR)
have been developed to establish relations between
molecular groups, motifs, conserved regions, domains
and other molecular topologies with their functions
[3133]. In GNN models, weight-sharing schemes are
used to characterize common molecular topologies,
as similar structure topologies, defined by the same
weights in GNNs, tend to induce similar functions.
Moreover, weight-sharing schemes can significantly
reduce parameters and network complexities.In our MP-
GNN, we use the same weight schemes among the same
scale-specific and element-specific graph. We also allow
to use same weight schemes among relatively similar
element-specific graphs, to reduce computational cost
and when there is relatively less training data. Ensemble
learning models use multiple base learning algorithms
to boost the performance of the prediction. Here, we
consider two types of ensemble learning, i.e. single-
scale (one-scale) stacking and multiscale stacking. The
one-scale stacking ensemble model is used to alleviate
the impact of randomness caused by initialization. The
multiscale stacking is for boosting the performance by
the consolidation of base learners that focus on different
scales and have less overlap.
MP-GNN for COVID-19 drug design
MP-GNN for protein–ligand interactions
Recently, a series of topological models have been devel-
oped for the characterization of protein–ligand interac-
tions and have achieved great successes [25,29,30].
The essential idea of these models is to define special
matrices that focus on interactions between the protein
and the ligand, instead of interactions within either the
protein or the ligand, and to construct molecular topo-
logical models based on these matrices [25].
Mathematically, we can set the protein–ligand interac-
tion matrix Mas follows:
M(mi,mj)=
(rirj;η),ifriRP,rjRL
or riRL,rjRP
, otherwise.
(7)
Here, riand rjare coordinates for the i- and j-th atoms,
and miand mjare their indices in the matrix. Two sets
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022
4|Li et al.
RPand RLare atom coordinate sets for protein and lig-
and, respectively. Note that only interactions between
protein atoms and ligand atoms are considered, while
interactions between atoms within either the protein
or the ligand are ignored by setting their distances as
, i.e. an infinitely large value. Other than the gener-
alized kernel functions as in Eqs. (1) and (2), we can
also define Euclidean-distance and electrostatics-based
atomic interaction functions as
(||rirj||)=1
||rirj||κ, (8)
and
(||rirj||)=1
1+exp cqiqj
rirj, (9)
where qiand qjare partial charges for the i-th and j-th
atoms, and parameter cis a constant value. All the three
types of atomic interactions are considered in our MP-
GNN.
Further, element-specific graphs are constructed only
between protein atoms and ligand atoms. As stated
above,a protein molecule is usually composed of roughly
five important elements EP= [C, N, O, S, H] and ligands
composed of nine types EL= [C, N, O, S, H, F, Cl, Br, I]. We
generate a series of element-specific bipartite graphs
in our MP-GNN. Each bipartite graph is composed of
two sets of same-typed atoms with one set from the
protein and the other from the ligand. Edges can be
only formed between the two sets (thus the name of
the bipartite graph), and are determined by interaction
matrix as in Eq. (7). In general, when the multiscale
kernel functions are used, a total of 36 =49 types of
bipartite graphs are generated without the consideration
of H atoms. Moreover, four different types of scale
(or resolution) parameters are used, i.e. η=2, 5, 10
and 20 Å. Figure 1 illustrates the general architecture
of our MP-GNN model for protein–ligand interaction
analysis. More details of MP-GNN model can be found in
Method.
Datasets We consider three most commonly used
benchmark datasets for protein–ligand binding affinity
prediction, including PDBbind-v2007,PDBbind-v2013 and
PDBbind-v2016. All the datasets used in this paper are
shown in Table 1. There are pre-train sets, training sets
and test sets for the separated experiments on PDBbind-
v2007, v2013 and v2016. There are intersections between
the datasets, so the pre-train set is randomly selected
from the non-intersected samples of one dataset. The
union of these three datasets is 4413. For PDBbind v2007,
the pre-train set contains 1000 items from 3114 non-
intersected samples. For PDBbind v2013, 1000 from 1455
non-intersected samples and for PDBbind v2016, all 357
non-intersected complexes are used for pre-training.
The core set acts as the test set for evaluation. The
training set is obtained by the refined set minus the
core set.
To test the performance of our model for COVID-19
drug design, we consider a SARS-CoV BA dataset, which
contains 185 Mpro-ligand complexes and their experimen-
tal binding affinities. Among the 185 ligands, there are 44
X-ray crystal structures and the rest are in 2D SMILES
strings. The software MathPose is used to predict 3D
structures of those 2D ligands and generate the binding
complexes of all 185 ligands with Mpro. To carry out the
validation, we randomly split the SARS-CoV BA set into
five non-overlapped folds. In each task, our MP-GNN is
trained on the part of SARS-CoV BA dataset in conjunc-
tion with the PDBbind-v2019 set. More specifically, one
fold (or division) is used as the validation set in each task,
and the rest four folds are combined with the PDBbind-
v2019 general set to form the training set. No pre-train is
done before training.
Benchmark tests for MP-GNN
More than 40 different scoring functions or models have
been extensively tested on the three PDBbind datasets.
Figure 2 shows the comparison between our MP-GNN
and the other models. The upper part depicts the overall
performance, and our method is marked in red. All
results are measured by Pearson correlation coefficient
(denoted as Rp). Our method stays ahead of all other
works for all three datasets, except second to TopBP in
PDBbind-v2016. More specifically, the current best Rpon
PDBbind-v2007 is 0.831 achieved by FPRC [37], while on
PDBbind-v2013 and PDBbind-v2016 are 0.808 and 0.861
both achieved by TopBP [25]. Our MP-GNN surpasses the
current best results on PDBbind-v2013 by 2% and stays in
line with the current best results of PDBbind-v2007 with
a slight advantage.On PDBbind-v2016, it is 1% lower than
TopBP. In the line chart, the right part where Rpover 0.6
is dense, and the clear ranking is displayed below. It is
worth mentioning that our method achieves significant
improvement on the hardest dataset, PDBbind-v2013,
which has a more unbalanced distribution between
training and test set (See Table 1). Figure 3 demonstrates
the performance for two stacking schemes and learning
rate of our model on PDBbind-v2007. A more detailed
illustration of our detailed results for all three datasets
can be found in Tables S1 to S3.
MP-GNN for COVID-19 drug design
The COVID-19 pandemic, started in late December 2019
and caused by new SARS-CoV-2, has infected more than
262 million individuals and has caused more than 5
million fatalities in all of the continents and over 213
countries and territories by 11 November 2021.Currently,
different drug targets of SARS-CoV-2, such as the main
protease (Mpro, also called 3CLpro), papain-Like protease
(PLpro), RNA-dependent RNA polymerase (RdRp), 5’-to-
3’ helicase protein (Nsp13), have been investigated.
Among them is the main protease, which is one of
the best-characterized targets for coronaviruses. It has
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022
Multiphysical graph neural network (MP-GNN) |5
Figure 1. The framework of the protein–ligand complex binding affinity prediction system for drug design task. A complex of SARS-CoV-2 main protease
inhibitor is used as an example here. This process consists of three steps: (1) generating the scale-specific graphs for the protein–ligand complex, (2)
processing a group of element-specific graphs with multiphysical graph neural network for 22 repeat experiments, and performing one-scale stacking
on the repeat experiments to give a prediction for one resolution and (3) giving a final decision by combining multi-scale predictions with multi-scale
stacking. Nodes filled or outlined in red are from the ligand.
Tab le 1. A summary of our selected datasets. mean(B)refers to the mean atom number for binding sites, and mean(G)refers to the
mean atom number for the un-cropped element-specific graph. The ratio between mean(B)and mean(G)describes the average
complexity of the dataset
Name Size Pre-train size Descriptions mean(B)mean(G)mean(G)
mean(B)
PDBbind v2007 [36]1300 1000 Refined set. Core set size 195. 583 151 0.259
PDBbind v2013 [36]2959 1000 Refined set. Core set size 195. 195 56 0.287
PDBbind v2016 [36]4057 358 Refined set. Core set size 285. 441 108 0.245
PDBbind v2019 [36]17 652 General set. 432 114 0.264
SARS-CoV BA [34]185 Inhibitors of SARS-CoV/SARS-CoV-2 main protease having
experimental binding affinity.
583 149 0.256
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022
6|Li et al.
Figure 2. Comparison with recent works on three datasets, including topology-based methods, image-based methods and traditional molecular
descriptor-based methods. The performances of other models are taken from [25,34,35]. The upper part is an overall comparison. The lower part
is a clear performance ranking of works with Rphigher than 0.6 on three datasets. All results are measured with Rp.
Figure 3. The left and middle box charts depict the range of performance for two stacking schemes. The line chart on the right shows the decay rule for
learning rate and the Rpcurves for training and test on the first repeat experiment for PDBbind-v2007 with the exponential kernel, η= 10.
been found that although the overall sequence identity
between SARS-CoV and SARS-CoV-2 is just 80%, the Mpro
of SARS-CoV-2 shares 96.08% sequence identity to that
of SARS-CoV. The great gene conservation provides the
opportunity for drug repurposing, i.e. use of SARS Mpro
inhibitor for potent of SARS-CoV-2 Mpro inhibitor.
Recently, a dataset of 185 inhibitors of SARS-CoV/SARS-
CoV-2 Mpro , which have experimental binding affinities,
has been collected. The efficient software MathPose
has been employed to predict their 3D structures, and
the binding complexes between Mpro and these ligands,
which are denoted as SARS-CoV BA. We test our MP-GNN
model on this special dataset. In order to benchmark
our method against MathDL [34], which is a leading
approach for binding affinity prediction on SARS-CoV BA
dataset, we use the same dataset partitioning scheme
and cross-validation strategy [34]. The test set is divided
into five partitions for 5-fold cross validation, so the
test labels are used alternately for validation. It is worth
mentioning that although MP-GNN and MathDL use the
same dataset and random dividing scheme, the partition
can be different. The average Rpand Kendall’s tau (τ) for
our MP-GNN model is 0.855 and 0.654, which is better
than the results of MathDL, which are 0.729 and 0.540.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022
Multiphysical graph neural network (MP-GNN) |7
Discussion
Factors that impact the improvement extent of
multi-scale stacking
It is revealed in supplementary that multi-scale stacking
improves the Rpby 6–7% for PDBbind refined datasets,
but improves it by 15–24% for the assessment on SARS-
CoV BA dataset, which is a huge gap. We assume the
possible reason is the diversity and the complexity of
the latter training set. For every dataset, the average
atom number of the binding site, mean(B)and of the
sub-graph, mean(G)are recorded in Table 1. As can be
seen from Tables S1,S2 and S3, there is no obvious linear
relationship between prediction complexity and mean(B).
Meanwhile, the ratio between mean(B)and mean(G)is
given in the last column of Table 5, which is directly
related to the non-empty ratio in all sub-graphs. Under
the premise that most binding sites include C,N, O, S and
H, a high ratio means that more element types appear
in ligand. In another word, the more the non-empty sub-
graphs, the more sufficient is the information, and the
more the network learns from training. As a result, we
presume that this ratio has negative correlation with
the task complexity. Meanwhile, on the same test set, a
more diversified training set can help to obtain better
results. In conclusion, we believe that the first reason
for such great progress on SARS-CoV BA is the mes-
sage discrepancy between training and validation set.
The training set is more rich in information, thus the
model handles the validation set with great facility and
the stacking improves more than the dataset that has
consistent training and test set. The second reason is that
SARS-CoV BA has a training set that is several to 10 times
larger than the PDBbind refined datasets, meanwhile
includes not only most data from previous year, but also
four divisions of SARS-CoV BA dataset that have similar
complex structure to the validation set.
Schemes for feature fusion
Through experiments, we find that channel-wise sum-
mation for node feature fusion and concatenate for sub-
graph feature fusion improves the Rpas much as possible
within capability. As is mentioned, symmetric opera-
tors are more suitable for nodes with a huge amount
instead of concatenation. Some works [38,39] in the field
of 3D feature learning prefer channel-wise maximum.
Experiments show that channel-wise maximum filters
the nodes and reserves the extreme values after fea-
ture embedding. Feature visualization reveals that these
extremes originate more from the inflection, depressions
and contours where the features stand out. In contrast,
the features that contribute more to the complex binding
affinity exist more in the chemical bond force of binding
site elements than in the 3D profile of the protein. The
binding affinity can be viewed as the superposition of
all chemical bond force in the binding site, so using the
maximum operator will lose most of the information.
This explains the applicability of sum operator. On the
other hand, based on the premise of limited number and
length of sub-graph descriptors, concatenation operator
can completely deliver the features while implicitly con-
tain the message of atom types. Notice that the features
of element-specific graph do not include atom types, and
that is why symmetric operators are irrational for sub-
graph feature fusion.
Ablation analysis for element-specific graph and single
aggregation
Result-related figures in this section are from some
ablation study designed earlier. At the beginning, a
single graph is used to describe the whole binding site.
It is a huge bipartite structure and we have to use
a large cropping size such as 130. The network was
able to converge but the result is more chaotic. Early
experiments have shown that converting from a single
graph to element-specific graphs makes Rpincreased
from 0.69 to 0.749 without any stacking ensemble. This
shows that seeing things from a single scale is much
clearer than looking at the whole graph directly. The
subsequent ensembles improve the prediction Rpagain.
Then, we realized that to some extent, these sub-graphs
can be viewed as complete graph, that is, any two nodes
in a sub-graph can obtain each other’s information
through one aggregation. This means the superposition
of multilayer aggregation may lead to redundant and
overlapping information. So we deconstructed the
graph convolution layer in MP-GNN and removed the
aggregation after the first layer. The best single scale Rp
increased from 0.749 to 0.767 as expected, confirming
the effectiveness of single aggregation.
Method
Multiphysical GNN
Graph neural network The GNN in our MP-GNN consists
of two parts, i.e. a ‘head’ part and a ‘tail’ part. The ‘head’
part converts the node vector information from each
bipartite graph to a hidden feature vector. The ‘tail’ part is
a fully connected neural network that learns the binding
affinity from the hidden feature vector.
The ‘head’ part contains an convolution layer followed
by an encoder. In the convolution layer, node features are
convoluted as in the traditional GCN model,
Hl+1=σ(ˆ
D1/2ˆ
Aˆ
D1/2HlWl), (10)
in which ˆ
Dand ˆ
Adenote the symmetrical and normal-
ized degree and adjacent matrices of the graph, Hlthe
node feature matrix of the lth layer, Wthe layer-specific
weight matrix and σ(·)denotes the activation function.
Note that the input for Hlis just the node features as
in Eq. (6). The encoder part consists of a fully connected
layer, dropout, and followed by the activation layer.
The ‘tail’ part also contains two parts, i.e. a feature
fusion part and encoder-based prediction part. In the fea-
ture fusion part, all node feature vectors are aggregated
into a single feature vector. The commonly used fuse
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022
8|Li et al.
operations include concatenation, sum, average, maxi-
mum and minimum of all node features. Here, we use
summation of node features for each element-specific
graph. Then we concatenate summation vectors from all
50 element-specific graphs into a long vector. In the sec-
ond part, two encoders together with a fully connected
layer are used to predict the binding affinity from the
concatenated vector.
Ensemble learning The ensemble learning method
is to use multiple learning models simultaneously for
boosting the performance. Each learning model, which
is known as a base learner, can give a prediction individ-
ually. Ensemble learning improves the prediction accu-
racy of each base learner by assembling them together
under a certain combination strategy. The commonly
used combination approaches including bagging, boost-
ing and stacking [40]. In our MP-GNN, we focus on stack-
ing ensemble model. The essential idea is to assign a cer-
tain weight, which is to be learned, to each base learner
and use the weighted results as the final prediction. More
specifically, we can denote the prediction of nnumber
of base learners as Y1,Y2, ..., Yn, the final prediction as
Ystacking and the ground truth value of training set as Y.
The weight for each based learner is linearly related to
their prediction accuracy on training set.For instance, we
can use Rp(Yi,Y), which is Pearson correlation coefficient
Rpbetween the prediction Yiand true value Y, as the
measurements for the model accuracy. The weight for
the the i-th base learner is then defined as
Wstacking
i=Rp(Yi,Y)
n
j=1Rp(Yj,Y),
and the final prediction results are
Ystacking =
n
i=1
Wstacking
iYi.
Other than using Rpas accuracy measurement, we have
also considered RMSE in our MP-GNN models.
MP-GNN for Covid drug design
Graph representation for protein–ligand interactions
Ligands usually bind to proteins, which tend to have a
much larger size, at a certain special region called bind-
ing site. Computationally, the binding site is chosen as
the protein region that is within a certain cutoff distance
of the ligand atoms. Here, we use 10 Å in our MP-GNN
model. The protein–ligand interaction matrix Min Eq. (7)
is defined only on the protein binding sites instead of the
entire protein domain. Three types of atomic interaction
function are considered, including generalized expoen-
tial/Lorentz kernel function, Euclidean distance function
and electrostatic function. Under different interaction
functions, different types of element-specific graph mod-
els are constructed. As stated above, proteins and ligands
in general have five and nine types of atoms, that is,
EP= [C, N, O, S, H] and EL= [C, N, O, S, H, F, Cl, Br, I].
For generalized expoential/Lorentz kernel and Euclidean
distance based atomic interaction functions, we consider
only 36 =49 types element-specific bipartite graph
representation and omit the influence from hydrogen
(H) atoms. For electrostatic-based interaction function, a
total 50 =510 types of bipartite graphs are constructed.
Note that for these bipartite graphs, their sizes may
vary greatly between different protein–ligand complexes
and between different element combinations. In our MP-
GNN, node importance is defined using rigidity index as
in Eq. (3). To share the weights (in GNN model) among
different graphs, we choose a same cropping size, i.e. a
total of 56 nodes, for all bipartite graphs. Computation-
ally, it is found that 56 is roughly the average size of
these element-specific bipartite graphs. For large-sized
graphs, we will remove the nodes that have a lower
node importance. For small-sized graph, pseudo-nodes
are added until a common size of 56 is reached.
In our MP-GNN model, node features are only related
to atomic distances. For generalized kernel based func-
tion as in Eqs. (1) and (2), we set κto be 2 and four
different scale parameters are considered, that is, η=2, 5,
10 and 20 Å. For Euclidean distance based function as
in Eq. (8), we set κ=−1 and let simply equals to the
atomic distance. The function domain of Eqs. (1), Eqs. (2)
and Eqs. (8) is set to be [2 Å, 30Å] with each interval of
length 1 Å, and the node vector as in Eq. (6) is of size 29.
For electrostatic-based function as in Eq. (9), we set the
domain to [0, 1] with each interval of length 0.04,and the
node vector is of size 25.
MP-GNN parameter settings The encoders in MP-GNN
head have the output size of 64 and 16 for every node.
After node feature fusion, preliminary sub-graph fea-
tures go through an encoder with output feature length
16. Then the sub-graph feature matrix with shape (M*N,
16) is concatenated into one feature vector describing
the binding site, which passes through the hidden layer
with 256 and 64 neurons for final regression. Every MP-
GNN sub-learner is trained for 6400 epochs to obtain the
optimal model with a dropout rate of 0.5 and ELUs as
the activation unit. The learning rate starts from 0.1 and
decays every 800 epochs, and the decay rate is 0.5. The
decay scheme is depicted in Figure 3.
Performance of MP-GNN on PBDbind datasets For
PBDbind datasets, a total of 10 different scale-specific
GNN models are considered based on 10 atomic interac-
tion functions, including four different exponential ker-
nel functions, four different Lorentz kernels, a Euclidean
distance based function and an electrostatic-based func-
tion. A total of 10 GNN base learners can be obtained. The
stacking models are chosen based on Rpon the training
set performance. Due to the high computation cost, we
conducted 22 repeated experiments with random initial-
ization. The detailed results can be found in Tables S1 to
S3. The best results in every sector are marked in bold.
Note that stacking with Rpis better than the ones with
RMSE.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022
Multiphysical graph neural network (MP-GNN) |9
We carry out the ablation analysis for our model based
on the Tables S1 to S3. First, we focus on the effective-
ness of one-scale stacking. When looking at one row
of each scale separately, one-scale stacking significantly
improves the result on PBDbind-v2007, PBDbind-v2013
and PBDbind-v2016 by around 6%, 8% and 8%.If we apply
the multi-scale stacking without the one-scale stacking,
both average and best Rpare improved by 5–10% than
the single scale results. In contrast, multi-scale stack-
ing after one-scale stacking can only boost Rpby 0–3%.
This indicates that the one-scale stacking can avoid the
information blind spot caused by initialization. Second,
we study the effectiveness of multi-scale stacking.Before
one-scale stacking, the multi-scale stacking average Rpof
22 random trials is at least 6%, 7% and 6% higher than the
single scale average Rps on all three datasets. It is obvious
that there is information complement between different
scales. Based on the result of one-scale stacking, multi-
scale still increases the Rpby almost 3%. In comparison,
we have noticed that although the one-scale stacking
avoids randomness and collects the information in a
single scale as much as possible, it can not overrun the
Best Rpof the randomly initialized multi-scale stacking.
Performance of MP-GNN on SARS-CoV BA dataset
In our MP-GNN model for SARS-CoV BA dataset, only
multiscale stacking is employed.This is due to the reason
that the training set has incorporated in it the PBDbind-
v2019 general set, which has 17652 PBD data. Similar to
PBDbind datasets, the same 10 different scale-specific
GNN models are considered in our MP-GNN. Since stack-
ing with Rpgives better accuracy, we also use Rpresults
on training set as the weighting scheme. From Table S4,
it can be seen that the multiscale stacking improves the
Rpby 15–24% for SARS-CoV BA dataset.
Key Points
Our main contributions in this paper are as follows:
We propose the first multiphysical molecular graph rep-
resentation. All kinds of molecular interactions,between
different atom types and at different scales, are sys-
tematically represented by a series of scale-specific and
element-specific graphs with distance-related node fea-
tures.
We develop the first multiphysical graph neural network
(MP-GNN) model. Our MP-GNN is free from the compli-
cated feature generation process. A unique GNN archi-
tecture is developed in our MP-GNN to incorporate both
scale-specific and element-specific graph information
into a consolidated model.
Our model has achieved the state-of-the-art results for
protein–ligand binding affinity prediction. It has been
found that our model can outperform all existing mod-
els, as far as we know.
Our model is highly accurate for the prediction of
complexes of inhibitors for SARS-CoV/SARS-CoV-2. Our
model has great potential for COVID-19 drug design.
Code and Data Availability
The code is available at https://github.com/Alibaba-
DAMO-DrugAI/MGNN. Additional data or code would
be available upon reasonable request.
Author contributions statement
K.X. conceived MP-GNN model. K.X., X-S.L. and Y.C. con-
ceived the graph neural network and ensemble learning
architecture. X.L. prepared the input data. X-S.L. and Y.C.
wrote up all algorithm codes and accomplished try-run.
X-S.L. conducted the experiments in large scale and ana-
lyzed the results. X-S.L. refined the network architecture.
K.X. validated the results according to experience. K.X.
and X-S.L. wrote up the paper,all other authors reviewed
the manuscript.
Supplementary data
Supplementary data are available online at https://
academic.oup.com/bib.
Acknowledgments
This work was supported by Alibaba Innovative Research
(AIR) Program and Alibaba-NTU Singapore Joint Research
Institute grant AN-GC-2020-002, Singapore Ministry of
Education Academic Research fund Tier 1 RG109/19, and
Tier 2 MOE-T2EP20220-0010, MOE-T2EP20120-0013.
References
1. Zhang L, Tan J, Han D, et al. From machine learning to deep
learning: progress in machine intelligence for rational drug
discovery. Drug Discov Today 2017;22(11):1680–5.
2. Lusci A, Pollastri G, Baldi P. Deep architectures and deep learning
in chemoinformatics: the prediction of aqueous solubility for
drug-like molecules. J Chem Inf Model 2013;53(7):1563–75.
3. Pereira JC, Caffarena ER, Nogueira C, et al. Boosting docking-
based virtual screening with deep learning. J Chem Inf Model
2016;56(12):2495–506.
4. Kearnes S, McCloskey K, Berndl M, et al. Molecular graph con-
volutions: moving beyond fingerprints. J Comput Aided Mol Des
2016;30(8):595–608.
5. Gomes J, Ramsundar B, Feinberg EN, et al. Atomic convolu-
tional networks for predicting protein-ligand binding affinit-
yarXiv preprint arXiv:1703.10603. 2017.
6. Feinberg EN, Sur D, Wu ZQ, et al. Potentialnet for molecular
property prediction. ACS central science 2018;4(11):1520–30.
7. Tsubaki M, Tomii K, Sese J. Compound–protein interaction pre-
diction with end-to-end learning of neural networks for graphs
and sequences. Bioinformatics 2019;35(2):309–18.
8. Li X, Yan X, Qiong G, et al. Deepchemstable: chemical stability
prediction with an attention-based graph convolution network.
J Chem Inf Model 2019;59(3):1044–9.
9. Wang X, Li Z, Jiang M, et al. Molecule property prediction based
on spatial graph embedding. J Chem Inf Model 2019;59(9):3817–28.
10. Szocinski T, Nguyen DD, Wei G-W. AweGNN: Auto-parametrized
weighted element-specific graph neural networks for molecules.
Comput Biol Med 2021;134:104460.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022
10 |Li et al.
11. Li S, Zhou J, Tong X, et al. (eds). Structure-aware interactive graph
neural networks for the prediction of protein-ligand binding
affinity. In: Proceedings of the 27th ACM SIGKDD Conference on
Knowledge Discovery & Data Mining, 2021, 975–85.
12. Nguyen T, Le H, Quinn TP, et al. GraphDTA: Predicting drug–
target binding affinity with graph neural networks. Bioinformatics
2021;37(8):1140–7.
13. Lin X. DeepGS: Deep representation learning of graphs and
sequences for drug-target binding affinity predictionarXiv
preprint arXiv:2003.13902. 2020.
14. Jiang M, Li Z, Zhang S, et al. Drug-target affinity predic-
tion using graph neural network and contact maps. RSC Adv
2020;10(35):20701–12.
15. Wang X, Liu Y, Fan L, et al. Dipeptide frequency of word frequency
and graph convolutional networks for DTA prediction. Front
Bioeng Biotechnol 2020;8:267.
16. Stokes JM, Yang K, Swanson K, et al. Zohar Bloom-Ackermann,
et al. A deep learning approach to antibiotic discovery. Cell
2020;180(4):688–702.
17. Liu X, Luo Y, Li P, et al. Deep geometric representations for mod-
eling effects of mutations on protein-protein binding affinity.
PLoS Comput Biol 2021;17(8):e1009284.
18. Gaudelet T, Day B, Jamasb AR, et al. Utilising graph machine
learning within drug discovery and development. Brief Bioinform
05 2021;bbab159.
19. Wei GW. Persistent homology analysis of biomolecular data. J
Comput Phys 2017;305:276–99.
20. Wei GW. Mathematics at the eve of a historic transition in
biology. Computational and Mathematical Biophysics 2017;5(1).
21. Cang ZX, Wei GW. TopologyNet: Topology based deep convolu-
tional and multi-task neural networks for biomolecular prop-
erty predictions. PLoS Comput Biol 2017;13(7):e1005690.
22. Cang ZX, Wei GW. Integration of element specific persistent
homology and machine learning for protein-ligand binding
affinity prediction. In: International journal for numerical methods
in biomedical engineering,page10.1002/cnm.2914, 2017.
23. Nguyen DD, Xiao T, Wang ML, et al. Rigidity strengthening:
A mechanism for protein–ligand binding. J Chem Inf Model
2017;57(7):1715–21.
24. Cang ZX, Wei GW. Integration of element specific persistent
homology and machine learning for protein-ligand binding
affinity prediction. International journal for numerical methods in
biomedical engineering 2018;34(2):e2914.
25. Cang ZX, Mu L, Wei GW. Representability of algebraic topology
for biomolecules in machine learning based scoring and virtual
screening. PLoS Comput Biol 2018;14(1):e1005929.
26. Nguyen DD, Cang ZX, Wu KD, et al. Mathematical deep learning
for pose and binding affinity prediction and ranking in D3R
Grand Challenges. J Comput Aided Mol Des 2019;33(1):71–82.
27. Nguyen DD, Wei GW. AGL-Score: Algebraic graph learning score
for protein-ligand binding scoring, ranking, docking, and screen-
ing. J Chem Inf Model 2019;59(7):3291–304.
28. Nguyen DD, Gao KF, Wang ML, et al. MathDL: Mathematical deep
learning for D3R Grand Challenge 4. Journal of computer-aided
molecular design, pages 2019;1–17.
29. Nguyen DD, Cang ZX, Wei GW. A review of mathematical repre-
sentations of biomolecular data. Phys Chem Chem Phys 2020.
30. Puzyn T, Leszczynski J, Cronin MT. Recent advances in QSAR stud-
ies: methods and applications, Vol. 8. Springer Science & Business
Media, 2010.
31. Lo YC, Rensi SE, Torng W, et al. Machine learning in chemoinfor-
matics and drug discovery. Drug Discov Today 2018;23(8):1538–46.
32. Bajorath J. Chemoinformatics: concepts, methods, and tools for drug
discovery, Vol. 275. Springer Science & Business Media, 2004.
33. Nguyen DD, Gao K, Chen J, et al. Unveiling the molecular mech-
anism of SARS-CoV-2 main protease inhibition from 137 crystal
structures using algebraic topology and deep learning. Chem Sci
2020;11(44):12036–46.
34. Nguyen DD, Wei GW. DG-GL: Differential geometry-based geo-
metric learning of molecular datasets. International journal for
numerical methods in biomedical engineering 2019;35(3):e3179.
35. Liu ZH, Li Y, Han L, et al. PDB-wide collection of binding
data: current status of the PDBbind database. Bioinformatics
2015;31(3):405–12.
36. Wee JJ, Xia K. Forman persistent ricci curvature (FPRC) based
machine learning models for protein-ligand binding affinity
prediction. Briefings in Bioinformatics, in press 2021.
37. Qi CR, Hao S, Mo K, et al. Pointnet: Deep learning on point sets
for 3d classification and segmentation. In: Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017, 652–60.
38. Qi CR, Yi L, Su H, et al. Pointnet++: Deep hierarchical fea-
ture learning on point sets in a metric spacearXiv preprint
arXiv:1706.02413. 2017.
39. Sagi O, Rokach L. Ensemble learning: A survey. Wi ley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery
2018;8(4):e1249.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022
View publication statsView publication stats

Supplementary resource (1)

... By analyzing large-scale patient data sets, GNN can identify patterns and relationships that can be missed by traditional statistical methods and provide a more comprehensive understanding of the underlying biology of the disease. GNN has also been used for Covid-19 drug discovery, enabling the VOLUME XX, 2023 identification of potential drug targets and the development of new treatments [86]. By analyzing molecular and clinical data, GNN can identify promising drug candidates, predict their effectiveness, and accelerate the development of new treatments, ultimately improving outcomes for patients affected by the Covid-19 pandemic [64][133] [138]. ...
... Several studies in the domain of GNN architectures have embraced open-source practices by providing their code for public access. Notable examples include studies such as [53], [57], [69], [73], [82], [83], [85][86][87][88], [92], [94][95], [99][100][101][102], [103], [104], [113], [116], [122], [125], and [134]. The decision to share code openly offers significant advantages in promoting transparency, collaboration, and reproducibility in scientific research. ...
Article
Full-text available
Graph neural network (GNN) is a formidable deep learning framework that enables the analysis and modeling of intricate relationships present in data structured as graphs. In recent years, a burgeoning interest has arisen in exploiting the latent capabilities of GNN for healthcare-based applications, capitalizing on their aptitude for modeling complex relationships and unearthing profound insights from graph-structured data. However, to the best of our knowledge, no study has systemically reviewed the GNN studies conducted in the healthcare domain. This study has furnished an all-encompassing and erudite overview of the prevailing cutting-edge research on GNN in healthcare. Through analysis and assimilation of studies, current research trends, recurrent challenges, and promising future opportunities in GNN for healthcare applications have been identified. China emerged as the leading country to conduct GNN-based studies in the healthcare domain, followed by the USA, UK, and Turkey. Among various aspects of healthcare, disease prediction and drug discovery emerge as the most prominent areas of focus for GNN application, indicating the potential of GNN for advancing diagnostic and therapeutic approaches. This study proposed research questions regarding diverse aspects of GNN in the healthcare domain and addressed them through an in-depth analysis. This study can provide practitioners and researchers with profound insights into the current landscape of GNN applications in healthcare and can guide healthcare institutes, researchers, and governments by demonstrating the ways in which GNN can contribute to the development of effective and efficient healthcare systems.
... AttentiveFP [9] used a graph attention network to aggregate and update node information. The MP-GNN [27] merged specific-scale graph neural network (GNN) and element-specific GNN, capturing various atomic interactions of multiphysical representations at different scales. MGCN [28] designed a graph convolution network to capture multilevel quantum interactions from the conformation and spatial information of molecule. ...
Article
Full-text available
Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifically, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous representations where a single one might represent multiple distinct molecules. Moreover, existing molecular modeling methods focus predominantly on the most stable 3D conformations, neglecting other viable conformations present in reality. To address these issues, we propose 3D-Mol, a novel approach designed for more accurate spatial structure representation. It deconstructs molecules into three hierarchical graphs to better extract geometric information. Additionally, 3D-Mol leverages contrastive learning for pretraining on 20 million unlabeled data, treating their conformations with identical topological structures as weighted positive pairs and contrasting ones as negatives, based on the similarity of their 3D conformation descriptors and fingerprints. We compare 3D-Mol with various state-of-the-art baselines on 7 benchmarks and demonstrate our outstanding performance.
... Graph neural networks (GNNs) [35] and transformers [36,37] are commonly used backbone networks in the field of molecular research. We opted for the Transformer architecture due to GNN's limitations in capturing long-range atom interactions [38][39][40], while Transformers facilitate unconstrained interactions among all graph nodes, irrespective of local structures [41][42][43]. Inspired by the Unimol method [44], we design the TransLNP model shown in Figure 2. Firstly, the molecular formula of ionizable lipids is represented as an atomic sequence, atomic distances, atomic coordinates and atomic edge types. ...
Article
Full-text available
Despite the widespread use of ionizable lipid nanoparticles (LNPs) in clinical applications for messenger RNA (mRNA) delivery, the mRNA drug delivery system faces an efficient challenge in the screening of LNPs. Traditional screening methods often require a substantial amount of experimental time and incur high research and development costs. To accelerate the early development stage of LNPs, we propose TransLNP, a transformer-based transfection prediction model designed to aid in the selection of LNPs for mRNA drug delivery systems. TransLNP uses two types of molecular information to perceive the relationship between structure and transfection efficiency: coarse-grained atomic sequence information and fine-grained atomic spatial relationship information. Due to the scarcity of existing LNPs experimental data, we find that pretraining the molecular model is crucial for better understanding the task of predicting LNPs properties, which is achieved through reconstructing atomic 3D coordinates and masking atom predictions. In addition, the issue of data imbalance is particularly prominent in the real-world exploration of LNPs. We introduce the BalMol block to solve this problem by smoothing the distribution of labels and molecular features. Our approach outperforms state-of-the-art works in transfection property prediction under both random and scaffold data splitting. Additionally, we establish a relationship between molecular structural similarity and transfection differences, selecting 4267 pairs of molecular transfection cliffs, which are pairs of molecules that exhibit high structural similarity but significant differences in transfection efficiency, thereby revealing the primary source of prediction errors. The code, model and data are made publicly available at https://github.com/wklix/TransLNP.
Article
Structure‐based drug design is a widely applied approach in the discovery of new lead compounds for known therapeutic targets. In most structure‐based drug design applications, the docking procedure is considered the crucial step. Here, a potential ligand is fitted into the binding site, and a scoring function assesses its binding capability. With the rise of modern machine‐learning in drug discovery, novel scoring functions using machine‐learning techniques achieved significant performance gains in virtual screening and ligand optimization tasks on retrospective data. However, real‐world applications of these methods are still limited. Missing success stories in prospective applications are one reason for this. Additionally, the fast‐evolving nature of the field makes it challenging to assess the advantages of each individual method. This review will highlight recent strides toward improved real world applicability of machine‐learning based scoring, enabling a better understanding of the potential benefits and pitfalls of these functions on a project. Furthermore, a systematic way of classifying machine‐learning based scoring that facilitates comparisons will be presented. This article is categorized under: Data Science > Chemoinformatics Data Science > Artificial Intelligence/Machine Learning Software > Molecular Modeling
Article
Full-text available
The journey of drug discovery (DD) has evolved from ancient practices to modern technology-driven approaches, with Artificial Intelligence (AI) emerging as a pivotal force in streamlining and accelerating the process. Despite the vital importance of DD, it faces challenges such as high costs and lengthy timelines. This review examines the historical progression and current market of DD alongside the development and integration of AI technologies. We analyse the challenges encountered in applying AI to DD, focusing on drug design and protein–protein interactions. The discussion is enriched by presenting models that put forward the application of AI in DD. Three case studies are highlighted to demonstrate the successful application of AI in DD, including the discovery of a novel class of antibiotics and a small-molecule inhibitor that has progressed to phase II clinical trials. These cases underscore the potential of AI to identify new drug candidates and optimise the development process. The convergence of DD and AI embodies a transformative shift in the field, offering a path to overcome traditional obstacles. By leveraging AI, the future of DD promises enhanced efficiency and novel breakthroughs, heralding a new era of medical innovation even though there is still a long way to go.
Article
The rapid acceleration of global warming has led to an increased burden of high temperature-related diseases (HTDs), highlighting the need for advanced evidence-based management strategies. We have developed a conceptual framework aimed at alleviating the global burden of HTDs, grounded in the One Health concept. This framework refines the impact pathway and establishes systematic data-driven models to inform the adoption of evidence-based decision-making, tailored to distinct contexts. We collected extensive national-level data from authoritative public databases for the years 2010–2019. The burdens of five categories of disease causes – cardiovascular diseases, infectious respiratory diseases, injuries, metabolic diseases, and non-infectious respiratory diseases – were designated as intermediate outcome variables. The cumulative burden of these five categories, referred to as the total HTD burden, was the final outcome variable. We evaluated the predictive performance of eight models and subsequently introduced twelve intervention measures, allowing us to explore optimal decision-making strategies and assess their corresponding contributions. Our model selection results demonstrated the superior performance of the Graph Neural Network (GNN) model across various metrics. Utilizing simulations driven by the GNN model, we identified a set of optimal intervention strategies for reducing disease burden, specifically tailored to the seven major regions: East Asia and Pacific, Europe and Central Asia, Latin America and the Caribbean, Middle East and North Africa, North America, South Asia, and Sub-Saharan Africa. Sectoral mitigation and adaptation measures, acting upon our categories of Infrastructure & Community, Ecosystem Resilience, and Health System Capacity, exhibited particularly strong performance for various regions and diseases. Seven out of twelve interventions were included in the optimal intervention package for each region, including raising low-carbon energy use, increasing energy intensity, improving livestock feed, expanding basic health care delivery coverage, enhancing health financing, addressing air pollution, and improving road infrastructure. The outcome of this study is a global decision-making tool, offering a systematic methodology for policymakers to develop targeted intervention strategies to address the increasingly severe challenge of HTDs in the context of global warming.
Conference Paper
Over the past ten years, graph representation learning has garnered a lot of attention due to the variety of graph-structured data and its efficiency in both time and space. One essential method for obtaining effective graph representations is graph pooling. Numerous studies on the graph pooling technique have been conducted. Cutting-edge results on a range of graph representation learning tasks were made possible by the combination of graph neural networks and self-attention mechanisms. Nevertheless, the attention mechanism has limitations since it ignores nodes that have no direct connection via an edge but provide valuable network context information. This paper proposes a graph pooling approach based on Personalized PageRank and self-attention, which improves the model to take into account both node properties and graph structure. The experimental findings indicate that, with a suitable number of parameters, the MAGPool approach delivers greater accuracy on the benchmark datasets.
Article
Full-text available
Modeling the impact of amino acid mutations on protein-protein interaction plays a crucial role in protein engineering and drug design. In this study, we develop GeoPPI, a novel structure-based deep-learning framework to predict the change of binding affinity upon mutations. Based on the three-dimensional structure of a protein, GeoPPI first learns a geometric representation that encodes topology features of the protein structure via a self-supervised learning scheme. These representations are then used as features for training gradient-boosting trees to predict the changes of protein-protein binding affinity upon mutations. We find that GeoPPI is able to learn meaningful features that characterize interactions between atoms in protein structures. In addition, through extensive experiments, we show that GeoPPI achieves new state-of-the-art performance in predicting the binding affinity changes upon both single- and multi-point mutations on six benchmark datasets. Moreover, we show that GeoPPI can accurately estimate the difference of binding affinities between a few recently identified SARS-CoV-2 antibodies and the receptor-binding domain (RBD) of the S protein. These results demonstrate the potential of GeoPPI as a powerful and useful computational tool in protein design and engineering. Our code and datasets are available at: https://github.com/Liuxg16/GeoPPI .
Article
Full-text available
Graph machine learning (GML) is receiving growing interest within the pharmaceutical and biotechnology industries for its ability to model biomolecular structures, the functional relationships between them, and integrate multi-omic datasets — amongst other data types. Herein, we present a multidisciplinary academic-industrial review of the topic within the context of drug discovery and development. After introducing key terms and modelling approaches, we move chronologically through the drug development pipeline to identify and summarize work incorporating: target identification, design of small molecules and biologics, and drug repurposing. Whilst the field is still emerging, key milestones including repurposed drugs entering in vivo studies, suggest GML will become a modelling framework of choice within biomedical machine learning.
Article
Full-text available
Currently, there is neither effective antiviral drugs nor vaccine for coronavirus disease 2019 (COVID-19) caused by acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Due to its high conservativeness and low similarity with human genes, SARS-CoV-2 main protease (Mpro) is one of the most favorable drug targets. However, the current understanding of the molecular mechanism of Mpro inhibition is limited by the lack of reliable binding affinity ranking and prediction of existing structures of Mpro–inhibitor complexes. This work integrates mathematics (i.e., algebraic topology) and deep learning (MathDL) to provide a reliable ranking of the binding affinities of 137 SARS-CoV-2 Mpro inhibitor structures. We reveal that Gly143 residue in Mpro is the most attractive site to form hydrogen bonds, followed by Glu166, Cys145, and His163. We also identify 71 targeted covalent bonding inhibitors. MathDL was validated on the PDBbind v2016 core set benchmark and a carefully curated SARS-CoV-2 inhibitor dataset to ensure the reliability of the present binding affinity prediction. The present binding affinity ranking, interaction analysis, and fragment decomposition offer a foundation for future drug discovery efforts.
Article
Full-text available
Computer-aided drug design uses high-performance computers to simulate the tasks in drug design, which is a promising research area. Drug–target affinity (DTA) prediction is the most important step of computer-aided drug design, which could speed up drug development and reduce resource consumption. With the development of deep learning, the introduction of deep learning to DTA prediction and improving the accuracy have become a focus of research. In this paper, utilizing the structural information of molecules and proteins, two graphs of drug molecules and proteins are built up respectively. Graph neural networks are introduced to obtain their representations, and a method called DGraphDTA is proposed for DTA prediction. Specifically, the protein graph is constructed based on the contact map output from the prediction method, which could predict the structural characteristics of the protein according to its sequence. It can be seen from the test of various metrics on benchmark datasets that the method proposed in this paper has strong robustness and generalizability.
Article
Full-text available
Deep learning is an effective method to capture drug-target binding affinity, but low accuracy is still an obstacle to be overcome. Thus, we propose a novel predictor for drug-target binding affinity based on dipeptide frequency of word frequency encoding and a hybrid graph convolutional network. Word frequency characteristics of natural language are used to improve the frequency characteristics of peptides to express target proteins. For each drug molecules, the five different features of drug atoms and the atomic bond relationships are expressed as graphs. The obtained protein features and graph structure are used as the input of convolution neural network and the input of graph convolution neural network, respectively. A prediction model is established to predict the drug affinity by calculating the hidden relationship. In the KIBA data set test experiment, the consistency coefficient of the model is 0.901, which is 0.01 higher than the existing model, and the MSE (mean square error) of the model is 0.126, which is 5% lower than the existing model. In Davis data set test experiment, the consistency coefficient of the model is 0.895, which is 0.006 higher than the existing model, and the MSE of the model is 0.220, which is 4% lower than the existing model. These results show that our proposed method can not only predict the affinity better than those existing models, but also outperform unitary deep learning approaches.
Article
While automated feature extraction has had tremendous success in many deep learning algorithms for image analysis and natural language processing, it does not work well for data involving complex internal structures, such as molecules. Data representations via advanced mathematics, including algebraic topology, differential geometry, and graph theory, have demonstrated superiority in a variety of biomolecular applications, however, their performance is often dependent on manual parametrization. This work introduces the auto-parametrized weighted element-specific graph neural network, dubbed AweGNN, to overcome the obstacle of this tedious parametrization process while also being a suitable technique for automated feature extraction on these internally complex biomolecular data sets. The AweGNN is a neural network model based on geometric-graph features of element-pair interactions, with its graph parameters being updated throughout the training, which results in what we call a network-enabled automatic representation (NEAR). To enhance the predictions with small data sets, we construct multi-task (MT) AweGNN models in addition to single-task (ST) AweGNN models. The proposed methods are applied to various benchmark data sets, including four data sets for quantitative toxicity analysis and another data set for solvation prediction. Extensive numerical tests show that AweGNN models can achieve state-of-the-art performance in molecular property predictions.
Article
Artificial intelligence (AI) techniques have already been gradually applied to the entire drug design process, from target discovery, lead discovery, lead optimization and preclinical development to the final three phases of clinical trials. Currently, one of the central challenges for AI-based drug design is molecular featurization, which is to identify or design appropriate molecular descriptors or fingerprints. Efficient and transferable molecular descriptors are key to the success of all AI-based drug design models. Here we propose Forman persistent Ricci curvature (FPRC)-based molecular featurization and feature engineering, for the first time. Molecular structures and interactions are modeled as simplicial complexes, which are generalization of graphs to their higher dimensional counterparts. Further, a multiscale representation is achieved through a filtration process, during which a series of nested simplicial complexes at different scales are generated. Forman Ricci curvatures (FRCs) are calculated on the series of simplicial complexes, and the persistence and variation of FRCs during the filtration process is defined as FPRC. Moreover, persistent attributes, which are FPRC-based functions and properties, are employed as molecular descriptors, and combined with machine learning models, in particular, gradient boosting tree (GBT). Our FPRC-GBT models are extensively trained and tested on three most commonly-used datasets, including PDBbind-2007, PDBbind-2013 and PDBbind-2016. It has been found that our results are better than the ones from machine learning models with traditional molecular descriptors.
Article
The development of new drugs is costly, time consuming, and often accompanied with safety issues. Drug repurposing can avoid the expensive and lengthy process of drug development by finding new uses for already approved drugs. In order to repurpose drugs effectively, it is useful to know which proteins are targeted by which drugs. Computational models that estimate the interaction strength of new drug–target pairs have the potential to expedite drug repurposing. Several models have been proposed for this task. However, these models represent the drugs as strings, which is not a natural way to represent molecules. We propose a new model called GraphDTA that represents drugs as graphs and uses graph neural networks to predict drug–target affinity. We show that graph neural networks not only predict drug–target affinity better than non-deep learning models, but also outperform competing deep learning methods. Our results confirm that deep learning models are appropriate for drug–target binding affinity prediction, and that representing drugs as graphs can lead to further improvements. Availability of data and materials The proposed models are implemented in Python. Related data, pre-trained models, and source code are publicly available at https://github.com/thinng/GraphDTA. All scripts and data needed to reproduce the post-hoc statistical analysis are available from https://doi.org/10.5281/zenodo.3603523.
Article
Due to the rapid emergence of antibiotic-resistant bacteria, there is a growing need to discover new antibiotics. To address this challenge, we trained a deep neural network capable of predicting molecules with antibacterial activity. We performed predictions on multiple chemical libraries and discovered a molecule from the Drug Repurposing Hub—halicin—that is structurally divergent from conventional antibiotics and displays bactericidal activity against a wide phylogenetic spectrum of pathogens including Mycobacterium tuberculosis and carbapenem-resistant Enterobacteriaceae. Halicin also effectively treated Clostridioides difficile and pan-resistant Acinetobacter baumannii infections in murine models. Additionally, from a discrete set of 23 empirically tested predictions from >107 million molecules curated from the ZINC15 database, our model identified eight antibacterial compounds that are structurally distant from known antibiotics. This work highlights the utility of deep learning approaches to expand our antibiotic arsenal through the discovery of structurally distinct antibacterial molecules. A trained deep neural network predicts antibiotic activity in molecules that are structurally different from known antibiotics, among which Halicin exhibits efficacy against broad-spectrum bacterial infections in mice.