ArticlePDF Available

Multiphysical graph neural network (MP-GNN) for COVID-19 drug design

June 2022
Briefings in Bioinformatics 23(11)

June 2022
23(11)

DOI:10.1093/bib/bbac231

Authors:

Xiang Liu

Nankai University

Le Lu

Alibaba Group DAMO Academy (USA)

Show all 6 authorsHide

Graph neural networks (GNNs) are the most promising deep learning models that can revolutionize non-Euclidean data analysis. However, their full potential is severely curtailed by poorly represented molecular graphs and features. Here, we propose a multiphysical graph neural network (MP-GNN) model based on the developed multiphysical molecular graph representation and featurization. All kinds of molecular interactions, between different atom types and at different scales, are systematically represented by a series of scale-specific and element-specific graphs with distance-related node features. From these graphs, graph convolution network (GCN) models are constructed with specially designed weight-sharing architectures. Base learners are constructed from GCN models from different elements at different scales, and further consolidated together using both one-scale and multi-scale ensemble learning schemes. Our MP-GNN has two distinct properties. First, our MP-GNN incorporates multiscale interactions using more than one molecular graph. Atomic interactions from various different scales are not modeled by one specific graph (as in traditional GNNs), instead they are represented by a series of graphs at different scales. Second, it is free from the complicated feature generation process as in conventional GNN methods. In our MP-GNN, various atom interactions are embedded into element-specific graph representations with only distance-related node features. A unique GNN architecture is designed to incorporate all the information into a consolidated model. Our MP-GNN has been extensively validated on the widely used benchmark test datasets from PDBbind, including PDBbind-v2007, PDBbind-v2013 and PDBbind-v2016. Our model can outperform all existing models as far as we know. Further, our MP-GNN is used in coronavirus disease 2019 drug design. Based on a dataset with 185 complexes of inhibitors for severe acute respiratory syndrome coronavirus (SARS-CoV/SARS-CoV-2), we evaluate their binding affinities using our MP-GNN. It has been found that our MP-GNN is of high accuracy. This demonstrates the great potential of our MP-GNN for the screening of potential drugs for SARS-CoV-2. Availability: The Multiphysical graph neural network (MP-GNN) model can be found in https://github.com/Alibaba-DAMO-DrugAI/MGNN. Additional data or code will be available upon reasonable request.

Comparison with recent works on three datasets, including topology-based methods, image-based methods and traditional molecular descriptor-based methods. The performances of other models are taken from [25, 34, 35]. The upper part is an overall comparison. The lower part is a clear performance ranking of works with R p higher than 0.6 on three datasets. All results are measured with R p .

…

The left and middle box charts depict the range of performance for two stacking schemes. The line chart on the right shows the decay rule for learning rate and the R p curves for training and test on the first repeat experiment for PDBbind-v2007 with the exponential kernel, η = 10.

…

Figures - uploaded by Kelin Xia

Content may be subject to copyright.

Content uploaded by Le Lu

Content may be subject to copyright.

Content uploaded by Kelin Xia

Content may be subject to copyright.

Xiao-Shuang Li Xiao-Shuang Li is a PhD student in Shanghai Jiao Tong University, and also a research intern at the Alibaba DAMO Academy.

Xiang Liu Xiang Liu is a PhD student from Nankai University in China. He is a visiting student in Nanyang Technological University from December 2019 to June

2020.

Le Lu Le Lu is IEEE Fellow. He is the Head of Medical AI research and development of Alibaba Group, and also Senior Director of DAMO Academy USA.

Xian-Sheng Hua Xian-Sheng Hua is IEEE Fellow. He is the head of CityBrain Lab and leads the Artitificial Intelligence Center of DAMO Academy in Alibaba Group.

Ying Chi Ying Chi is the team leader of Drug Discovery Intelligence at the Alibaba DAMO Academy. She did PhD in Imperial College London and Postdoctoral

research in Oxford University in UK. Her current research and development interest is all types of AI methods for various drug discovery problems, e.g.virtual

screening, protein and immunity related.

Kelin Xia Kelin Xia is an assistant professor at School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore. His research interests

are topological data analysis, molecular-based mathematical biology and machine learning.

Received: March 7, 2022. Revised: April 24, 2022. Accepted: May 18, 2022

Briefings in Bioinformatics, 2022, 1–10

https://doi.org/10.1093/bib/bbac231

Problem Solving Protocol

Multiphysical graph neural network (MP-GNN) for

COVID-19 drug design

Xiao-Shuang Li, Xiang Liu,Le Lu, Xian-Sheng Hua, Ying Chi and Kelin Xia

Corresponding author: Kelin Xia, xiakelin@ntu.edu.sg

Abstract

Graph neural networks (GNNs) are the most promising deep learning models that can revolutionize non-Euclidean data analysis.

However, their full potential is severely curtailed by poorly represented molecular graphs and features. Here, we propose a

multiphysical graph neural network (MP-GNN) model based on the developed multiphysical molecular graph representation and

featurization. All kinds of molecular interactions, between different atom types and at different scales, are systematically represented

by a series of scale-specific and element-specific graphs with distance-related node features. From these graphs, graph convolution

network (GCN) models are constructed with specially designed weight-sharing architectures. Base learners are constructed from

GCN models from different elements at different scales, and further consolidated together using both one-scale and multi-scale

ensemble learning schemes. Our MP-GNN has two distinct properties. First, our MP-GNN incorporates multiscale interactions using

more than one molecular graph. Atomic interactions from various different scales are not modeled by one specific graph (as in

traditional GNNs), instead they are represented by a series of graphs at different scales. Second, it is free from the complicated

feature generation process as in conventional GNN methods. In our MP-GNN, various atom interactions are embedded into element-

specific graph representations with only distance-related node features. A unique GNN architecture is designed to incorporate all

the information into a consolidated model. Our MP-GNN has been extensively validated on the widely used benchmark test datasets

from PDBbind, including PDBbind-v2007, PDBbind-v2013 and PDBbind-v2016. Our model can outperform all existing models as far as

we know. Further, our MP-GNN is used in coronavirus disease 2019 drug design. Based on a dataset with 185 complexes of inhibitors

for severe acute respiratory syndrome coronavirus (SARS-CoV/SARS-CoV-2), we evaluate their binding affinities using our MP-GNN.

It has been found that our MP-GNN is of high accuracy. This demonstrates the great potential of our MP-GNN for the screening of

potential drugs for SARS-CoV-2. Availability: The Multiphysical graph neural network (MP-GNN) model can be found in https://github.

com/Alibaba-DAMO-DrugAI/MGNN. Additional data or code will be available upon reasonable request.

Keywords: Graph neural network, Graph representation and featurization, Protein–ligand binding, Drug design, Ensemble learning

Introduction

So far, more than 262 million infections and 5 million

fatalities haves been succumbed to the new severe

acute respiratory syndrome coronavirus (SARS-CoV-2)

in the coronavirus disease 2019 (COVID-19) pandemic

which has swept across all 213 countries and territories.

The significance of designing efficient antibodies and

drugs for COVID-19 cannot be overemphasized.Artificial

intelligence-based models have demonstrated great

power in various steps in drug design [1]. Among these

models are graph neural network (GNN) models, which

are end-to-end learning models that take in a molecular

graph representation and directly output the prediction.

Originally, GNNs were developed for the analysis of

large-scale network data with the main focus of pre-

dicting the properties of new nodes or edges within the

network. Recently, GNNs have been used in biomolecular

data analysis and achieved great performance for various

steps in drug design and discovery [2–10]. Among these

models, AquaSol [2] uses directed acyclic graph based

recursive neural networks to predict molecular solubility.

In DeepVS [3], an effective atom context representation

is employed that can take into consideration protein–

ligand complex properties. An integrated model of

the compound-structure-based GNN and the protein-

sequence-based convolution neural network (CNN) is

developed for compound protein interactions [7]. GAN

model is introduced for chemical stability prediction

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022

2|Li et al.

in DeepChemStable [8]. A convolution spatial graph

embedding layer (C-SGEL) based graph convolution

network (GCN) model is developed for molecular prop-

erty prediction [9]. A structure-aware interactive GNN

is designed to learn essential long-range interactions

among atoms and to fully utilize biomolecular structural

information [11]. GNN models have also been used

in drug–target affinity prediction [11–15], antibiotic

discovery [16], protein–protein binding affinity change

upon mutation [17] and various other drug discovery

and development [18].

Even though GNNs have shown great promise for

drug design, their full potential has been hindered by

the inefficient graph topological representations and

featurization. Currently, most biomolecular GNNs use

the covalent-bond-based graph representation, which is

to model a molecule as a graph with atoms represented

as nodes and covalent bonds as edges. Node and edge

features are then generated from different types of

physical, chemical and biological properties. However,

these covalent-bond-based molecular topologies fail

to efficiently characterize non-covalent interactions,

which can be of great importance for biomolecular

complexes, including protein–protein complexes, pro-

tein–ligand complexes, protein–DNA/RAN complexes

and DNA/RNA–ligand complexes. To alleviate the prob-

lem, a fixed cutoff-distance-based molecular graph

representation has been developed. However, molecular

interactions are usually of different scales. The fixed

cutoff-distance-based topology tends to miss a great

amount of information and it is nontrivial to identify the

‘best’ cutoff distance. Currently, the bottleneck for the

design of efficient molecular GNN models is the suitable

topological representations and featurization that char-

acterize the multiphysical properties of biomolecules.

Here, we develop multiphysical molecular graph

representations and featurization. Based on them, we

propose a multiphysical graph neural network (MP-GNN)

model. Our MP-GNN employs an ensemble learning

scheme to incorporate both scale-specific GNN models

and element-specific GNN models. It has been found

that our MP-GNN model can deliver state-of-the-art

results for protein–ligand binding affinity prediction

and achieve extremely high accuracy in SARS-CoV BA

dataset, which contains 185 Mpro-ligand complexes and

their experimental binding affinities.

Results

Physically, atomic interactions within and between

molecules are of various types, ranging from strong ones

such as covalent bonds, disulfide bonds, ionic bonds,

hydrogen bonds, to relatively weaker ones, such as van

der Waals forces, electrostatic interactions, hydrophobic

and hydrophilic effects. Mathematically, the atomic

interaction between two atoms with coordinates riand rj

can be defined as an interaction function (||ri−rj||)

with ri−rjthe Euclidean distance. To model the

multiscale effects, the scale (or resolution) related kernel

functions are used. Among them, the most common

ones are the generalized exponential kernels and the

generalized Lorentz kernels. For two atoms riand rj, their

atomic interaction can be modeled by the generalized

exponential kernel as follows:

(||ri−rj||;η) =e−(||ri−rj||/η)κ, (1)

or by generalized Lorentz kernel as

(||ri−rj||;η) =1

1+(||ri−rj||/η)κ. (2)

Here, ηis scale (or resolution) parameter, and κis order

parameter, which is usually taken as 2. Based on rigidity–

flexibility model, we can define the node importance

using rigidity index as follows:

μ(ri;η) =

wj(||ri−rj||;η), (3)

where wjis an atomic type-dependent weight. Note that

kernel functions with different scale values will focus

on atomic interactions at different scales. If a small η

value is used, the kernels characterize only strong cova-

lent interactions with the values for other interactions

at longer distance as (nearly) 0. In contrast, under a

larger ηvalue, relatively weaker interactions will also be

included. Node importance will vary with scale values in

a similar way.

In our scale-specific graph representations, molecules

are modeled by a series of graphs systematically gener-

ated from different scales. Mathematically, a fully con-

nected molecular graph is generated with scale-related

weight value, i.e. the atomic interaction from Eq.(1)or

Eq.(2), on each edge. Based on the scale-specific graph

representation, the normalized adjacent matrix can be

defined as

A(i,j)=(||ri−rj||;η),i= j

0, i=j,(4)

and the normalized degree matrix can be defined as

D(i,j)=μ(ri;η),i=j

0, i= j.(5)

In this way, the scale effects are incorporated into molec-

ular graph representation.

Further, we propose a new type of node feature vector

that is solely dependent on atomic interaction function

. For the i-th node, an n-th dimensional node feature

vector vi(η) =(vi

1(η),vi

2(η), ..., vi

n(η)) is defined as follows:

k(η) =

j=1

χ(xk−1≤(||ri−rj||;η) < xk),k=1, 2.., n. (6)

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022

Multiphysical graph neural network (MP-GNN) |3

Here, we assume all atomic interactions (||ri−rj||;η) are

within the region [0, xmax ], which is equally divided into n

intervals {(xk−1,xk);k=1, 2, .., n}with x0=0 and xn=

xmax. The indicator function χequals to 1 if the following

condition is satisfied and 0 otherwise. Mathematically,

the node vector is the frequencies (or the numbers) of

atomic interactions within a certain range.

Finally, molecules are usually of various sizes, which

will result in different-sized molecular graphs. To facil-

itate weight-sharing among different molecular graphs,

we use node importance μas in Eq. (3) to remove less

important extra nodes, so that a same-sized molecular

graph is obtained.

Element-specific graph for GNN

Other than scale effects, element types are the other

key factor for multiphysical atomic interactions. For

instance, carbon atoms are usually associated with

hydrophobic interactions, while nitrogen and oxygen

atoms are correlated to hydrophilic interactions and/or

hydrogen bonds. To enable a systematic description

of atomic interactions, we consider element-specific

graph representations [19]. Recently, the combination of

element-specific representations and machine learning

models has achieved great success in drug design [19–

30]. More recently, an element-specific GNN model

has been developed and has achieved state-of-the-

art performance in quantitative toxicity analysis and

solvation prediction [10].

The essential idea for element-specific represen-

tations is to decompose a molecule into a series of

atom-sets, which composed of certain specific types of

elements. In general, a protein molecule is composed

of roughly five most important elements, denoted as

EP= [C, N, O, S, H]. A DNA or RNA also have five most

important elements, denoted as ED= [C, N, O, P, H].

For ligands or chemical molecules, they tend to have

more types of elements. Here, we consider only nine

types of most-commonly used ones, and denote as EL

= [C, N, O, S, H, F, Cl, Br, I]. In general, an element-

specific GNN model contains a series of molecular

graphs that are constructed based on different element

types. For instance, a protein can be represented by

a series of element-specific graphs, including single-

element graphs (C-graph, N-graph, O-graph, S-graph and

H-graph), double-element graphs (CN-graph, CO-graph,

CS-graph, CH-graph, NO-graph, NS-graph, NH-graph, OS-

graph, OH-graph and SH-graph), three-element graphs

and other graphs with more types of elements. Each

element-specific graph characterizes certain type of

atomic interactions. Note that the all-atom graph as in

previous GNN models is just a special case of element-

specific graph. Normally, we do not need to use all the

combinations [25,29,30]. To balance the computational

cost and model accuracy, we usually only consider

the element-specific graphs with sufficient amount of

atoms. For instance, ligand molecules may contain Cl

atom but they usually have only one or two Cl atoms. A

Cl-graph will be meaningless. However, the Cl atom can

be important for ligand properties. So we can consider

multiple-element graphs, such as CNCl-graph, COCl-

graph, etc.

Multiphysical graph neural network

In our MP-GNN model, a series of scale-specific and

element-specific graphs are generated from molecules.

From each graph, a GNN architecture is constructed. To

significantly reduce the learning parameters, weight-

sharing schemes and ensemble learning models are

considered. Molecular structural topologies are of great

importance for their functions. Various quantitative

structure–activity/property relationship (QSAR/QSPR)

have been developed to establish relations between

molecular groups, motifs, conserved regions, domains

and other molecular topologies with their functions

[31–33]. In GNN models, weight-sharing schemes are

used to characterize common molecular topologies,

as similar structure topologies, defined by the same

weights in GNNs, tend to induce similar functions.

Moreover, weight-sharing schemes can significantly

reduce parameters and network complexities.In our MP-

GNN, we use the same weight schemes among the same

scale-specific and element-specific graph. We also allow

to use same weight schemes among relatively similar

element-specific graphs, to reduce computational cost

and when there is relatively less training data. Ensemble

learning models use multiple base learning algorithms

to boost the performance of the prediction. Here, we

consider two types of ensemble learning, i.e. single-

scale (one-scale) stacking and multiscale stacking. The

one-scale stacking ensemble model is used to alleviate

the impact of randomness caused by initialization. The

multiscale stacking is for boosting the performance by

the consolidation of base learners that focus on different

scales and have less overlap.

MP-GNN for COVID-19 drug design

MP-GNN for protein–ligand interactions

Recently, a series of topological models have been devel-

oped for the characterization of protein–ligand interac-

tions and have achieved great successes [25,29,30].

The essential idea of these models is to define special

matrices that focus on interactions between the protein

and the ligand, instead of interactions within either the

protein or the ligand, and to construct molecular topo-

logical models based on these matrices [25].

Mathematically, we can set the protein–ligand interac-

tion matrix Mas follows:

M(mi,mj)=⎧

⎪

⎨

⎪

⎩

(ri−rj;η),ifri∈RP,rj∈RL

or ri∈RL,rj∈RP

∞, otherwise.

(7)

Here, riand rjare coordinates for the i- and j-th atoms,

and miand mjare their indices in the matrix. Two sets

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022

4|Li et al.

RPand RLare atom coordinate sets for protein and lig-

and, respectively. Note that only interactions between

protein atoms and ligand atoms are considered, while

interactions between atoms within either the protein

or the ligand are ignored by setting their distances as

∞, i.e. an infinitely large value. Other than the gener-

alized kernel functions as in Eqs. (1) and (2), we can

also define Euclidean-distance and electrostatics-based

atomic interaction functions as

(||ri−rj||)=1

||ri−rj||κ, (8)

and

(||ri−rj||)=1

1+exp −cqiqj

ri−rj, (9)

where qiand qjare partial charges for the i-th and j-th

atoms, and parameter cis a constant value. All the three

types of atomic interactions are considered in our MP-

GNN.

Further, element-specific graphs are constructed only

between protein atoms and ligand atoms. As stated

above,a protein molecule is usually composed of roughly

five important elements EP= [C, N, O, S, H] and ligands

composed of nine types EL= [C, N, O, S, H, F, Cl, Br, I]. We

generate a series of element-specific bipartite graphs

in our MP-GNN. Each bipartite graph is composed of

two sets of same-typed atoms with one set from the

protein and the other from the ligand. Edges can be

only formed between the two sets (thus the name of

the bipartite graph), and are determined by interaction

matrix as in Eq. (7). In general, when the multiscale

kernel functions are used, a total of 36 =4∗9 types of

bipartite graphs are generated without the consideration

of H atoms. Moreover, four different types of scale

(or resolution) parameters are used, i.e. η=2, 5, 10

and 20 Å. Figure 1 illustrates the general architecture

of our MP-GNN model for protein–ligand interaction

analysis. More details of MP-GNN model can be found in

Method.

Datasets We consider three most commonly used

benchmark datasets for protein–ligand binding affinity

prediction, including PDBbind-v2007,PDBbind-v2013 and

PDBbind-v2016. All the datasets used in this paper are

shown in Table 1. There are pre-train sets, training sets

and test sets for the separated experiments on PDBbind-

v2007, v2013 and v2016. There are intersections between

the datasets, so the pre-train set is randomly selected

from the non-intersected samples of one dataset. The

union of these three datasets is 4413. For PDBbind v2007,

the pre-train set contains 1000 items from 3114 non-

intersected samples. For PDBbind v2013, 1000 from 1455

non-intersected samples and for PDBbind v2016, all 357

non-intersected complexes are used for pre-training.

The core set acts as the test set for evaluation. The

training set is obtained by the refined set minus the

core set.

To test the performance of our model for COVID-19

drug design, we consider a SARS-CoV BA dataset, which

contains 185 Mpro-ligand complexes and their experimen-

tal binding affinities. Among the 185 ligands, there are 44

X-ray crystal structures and the rest are in 2D SMILES

strings. The software MathPose is used to predict 3D

structures of those 2D ligands and generate the binding

complexes of all 185 ligands with Mpro. To carry out the

validation, we randomly split the SARS-CoV BA set into

five non-overlapped folds. In each task, our MP-GNN is

trained on the part of SARS-CoV BA dataset in conjunc-

tion with the PDBbind-v2019 set. More specifically, one

fold (or division) is used as the validation set in each task,

and the rest four folds are combined with the PDBbind-

v2019 general set to form the training set. No pre-train is

done before training.

Benchmark tests for MP-GNN

More than 40 different scoring functions or models have

been extensively tested on the three PDBbind datasets.

Figure 2 shows the comparison between our MP-GNN

and the other models. The upper part depicts the overall

performance, and our method is marked in red. All

results are measured by Pearson correlation coefficient

(denoted as Rp). Our method stays ahead of all other

works for all three datasets, except second to TopBP in

PDBbind-v2016. More specifically, the current best Rpon

PDBbind-v2007 is 0.831 achieved by FPRC [37], while on

PDBbind-v2013 and PDBbind-v2016 are 0.808 and 0.861

both achieved by TopBP [25]. Our MP-GNN surpasses the

current best results on PDBbind-v2013 by 2% and stays in

line with the current best results of PDBbind-v2007 with

a slight advantage.On PDBbind-v2016, it is 1% lower than

TopBP. In the line chart, the right part where Rpover 0.6

is dense, and the clear ranking is displayed below. It is

worth mentioning that our method achieves significant

improvement on the hardest dataset, PDBbind-v2013,

which has a more unbalanced distribution between

training and test set (See Table 1). Figure 3 demonstrates

the performance for two stacking schemes and learning

rate of our model on PDBbind-v2007. A more detailed

illustration of our detailed results for all three datasets

can be found in Tables S1 to S3.

MP-GNN for COVID-19 drug design

The COVID-19 pandemic, started in late December 2019

and caused by new SARS-CoV-2, has infected more than

262 million individuals and has caused more than 5

million fatalities in all of the continents and over 213

countries and territories by 11 November 2021.Currently,

different drug targets of SARS-CoV-2, such as the main

protease (Mpro, also called 3CLpro), papain-Like protease

(PLpro), RNA-dependent RNA polymerase (RdRp), 5’-to-

3’ helicase protein (Nsp13), have been investigated.

Among them is the main protease, which is one of

the best-characterized targets for coronaviruses. It has

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022

Multiphysical graph neural network (MP-GNN) |5

Figure 1. The framework of the protein–ligand complex binding affinity prediction system for drug design task. A complex of SARS-CoV-2 main protease

inhibitor is used as an example here. This process consists of three steps: (1) generating the scale-specific graphs for the protein–ligand complex, (2)

processing a group of element-specific graphs with multiphysical graph neural network for 22 repeat experiments, and performing one-scale stacking

on the repeat experiments to give a prediction for one resolution and (3) giving a final decision by combining multi-scale predictions with multi-scale

stacking. Nodes filled or outlined in red are from the ligand.

Tab le 1. A summary of our selected datasets. mean(B)refers to the mean atom number for binding sites, and mean(G)refers to the

mean atom number for the un-cropped element-specific graph. The ratio between mean(B)and mean(G)describes the average

complexity of the dataset

Name Size Pre-train size Descriptions mean(B)mean(G)mean(G)

mean(B)

PDBbind v2007 [36]1300 1000 Refined set. Core set size 195. 583 151 0.259

PDBbind v2013 [36]2959 1000 Refined set. Core set size 195. 195 56 0.287

PDBbind v2016 [36]4057 358 Refined set. Core set size 285. 441 108 0.245

PDBbind v2019 [36]17 652 –General set. 432 114 0.264

SARS-CoV BA [34]185 – Inhibitors of SARS-CoV/SARS-CoV-2 main protease having

experimental binding affinity.

583 149 0.256

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022

6|Li et al.

Figure 2. Comparison with recent works on three datasets, including topology-based methods, image-based methods and traditional molecular

descriptor-based methods. The performances of other models are taken from [25,34,35]. The upper part is an overall comparison. The lower part

is a clear performance ranking of works with Rphigher than 0.6 on three datasets. All results are measured with Rp.

Figure 3. The left and middle box charts depict the range of performance for two stacking schemes. The line chart on the right shows the decay rule for

learning rate and the Rpcurves for training and test on the first repeat experiment for PDBbind-v2007 with the exponential kernel, η= 10.

been found that although the overall sequence identity

between SARS-CoV and SARS-CoV-2 is just 80%, the Mpro

of SARS-CoV-2 shares 96.08% sequence identity to that

of SARS-CoV. The great gene conservation provides the

opportunity for drug repurposing, i.e. use of SARS Mpro

inhibitor for potent of SARS-CoV-2 Mpro inhibitor.

Recently, a dataset of 185 inhibitors of SARS-CoV/SARS-

CoV-2 Mpro , which have experimental binding affinities,

has been collected. The efficient software MathPose

has been employed to predict their 3D structures, and

the binding complexes between Mpro and these ligands,

which are denoted as SARS-CoV BA. We test our MP-GNN

model on this special dataset. In order to benchmark

our method against MathDL [34], which is a leading

approach for binding affinity prediction on SARS-CoV BA

dataset, we use the same dataset partitioning scheme

and cross-validation strategy [34]. The test set is divided

into five partitions for 5-fold cross validation, so the

test labels are used alternately for validation. It is worth

mentioning that although MP-GNN and MathDL use the

same dataset and random dividing scheme, the partition

can be different. The average Rpand Kendall’s tau (τ) for

our MP-GNN model is 0.855 and 0.654, which is better

than the results of MathDL, which are 0.729 and 0.540.

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022

Multiphysical graph neural network (MP-GNN) |7

Discussion

Factors that impact the improvement extent of

multi-scale stacking

It is revealed in supplementary that multi-scale stacking

improves the Rpby 6–7% for PDBbind refined datasets,

but improves it by 15–24% for the assessment on SARS-

CoV BA dataset, which is a huge gap. We assume the

possible reason is the diversity and the complexity of

the latter training set. For every dataset, the average

atom number of the binding site, mean(B)and of the

sub-graph, mean(G)are recorded in Table 1. As can be

seen from Tables S1,S2 and S3, there is no obvious linear

relationship between prediction complexity and mean(B).

Meanwhile, the ratio between mean(B)and mean(G)is

given in the last column of Table 5, which is directly

related to the non-empty ratio in all sub-graphs. Under

the premise that most binding sites include C,N, O, S and

H, a high ratio means that more element types appear

in ligand. In another word, the more the non-empty sub-

graphs, the more sufficient is the information, and the

more the network learns from training. As a result, we

presume that this ratio has negative correlation with

the task complexity. Meanwhile, on the same test set, a

more diversified training set can help to obtain better

results. In conclusion, we believe that the first reason

for such great progress on SARS-CoV BA is the mes-

sage discrepancy between training and validation set.

The training set is more rich in information, thus the

model handles the validation set with great facility and

the stacking improves more than the dataset that has

consistent training and test set. The second reason is that

SARS-CoV BA has a training set that is several to 10 times

larger than the PDBbind refined datasets, meanwhile

includes not only most data from previous year, but also

four divisions of SARS-CoV BA dataset that have similar

complex structure to the validation set.

Schemes for feature fusion

Through experiments, we find that channel-wise sum-

mation for node feature fusion and concatenate for sub-

graph feature fusion improves the Rpas much as possible

within capability. As is mentioned, symmetric opera-

tors are more suitable for nodes with a huge amount

instead of concatenation. Some works [38,39] in the field

of 3D feature learning prefer channel-wise maximum.

Experiments show that channel-wise maximum filters

the nodes and reserves the extreme values after fea-

ture embedding. Feature visualization reveals that these

extremes originate more from the inflection, depressions

and contours where the features stand out. In contrast,

the features that contribute more to the complex binding

affinity exist more in the chemical bond force of binding

site elements than in the 3D profile of the protein. The

binding affinity can be viewed as the superposition of

all chemical bond force in the binding site, so using the

maximum operator will lose most of the information.

This explains the applicability of sum operator. On the

other hand, based on the premise of limited number and

length of sub-graph descriptors, concatenation operator

can completely deliver the features while implicitly con-

tain the message of atom types. Notice that the features

of element-specific graph do not include atom types, and

that is why symmetric operators are irrational for sub-

graph feature fusion.

Ablation analysis for element-specific graph and single

aggregation

Result-related figures in this section are from some

ablation study designed earlier. At the beginning, a

single graph is used to describe the whole binding site.

It is a huge bipartite structure and we have to use

a large cropping size such as 130. The network was

able to converge but the result is more chaotic. Early

experiments have shown that converting from a single

graph to element-specific graphs makes Rpincreased

from 0.69 to 0.749 without any stacking ensemble. This

shows that seeing things from a single scale is much

clearer than looking at the whole graph directly. The

subsequent ensembles improve the prediction Rpagain.

Then, we realized that to some extent, these sub-graphs

can be viewed as complete graph, that is, any two nodes

in a sub-graph can obtain each other’s information

through one aggregation. This means the superposition

of multilayer aggregation may lead to redundant and

overlapping information. So we deconstructed the

graph convolution layer in MP-GNN and removed the

aggregation after the first layer. The best single scale Rp

increased from 0.749 to 0.767 as expected, confirming

the effectiveness of single aggregation.

Method

Multiphysical GNN

Graph neural network The GNN in our MP-GNN consists

of two parts, i.e. a ‘head’ part and a ‘tail’ part. The ‘head’

part converts the node vector information from each

bipartite graph to a hidden feature vector. The ‘tail’ part is

a fully connected neural network that learns the binding

affinity from the hidden feature vector.

The ‘head’ part contains an convolution layer followed

by an encoder. In the convolution layer, node features are

convoluted as in the traditional GCN model,

Hl+1=σ(ˆ

D−1/2ˆ

Aˆ

D−1/2HlWl), (10)

in which ˆ

Dand ˆ

Adenote the symmetrical and normal-

ized degree and adjacent matrices of the graph, Hlthe

node feature matrix of the lth layer, Wthe layer-specific

weight matrix and σ(·)denotes the activation function.

Note that the input for Hlis just the node features as

in Eq. (6). The encoder part consists of a fully connected

layer, dropout, and followed by the activation layer.

The ‘tail’ part also contains two parts, i.e. a feature

fusion part and encoder-based prediction part. In the fea-

ture fusion part, all node feature vectors are aggregated

into a single feature vector. The commonly used fuse

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022

8|Li et al.

operations include concatenation, sum, average, maxi-

mum and minimum of all node features. Here, we use

summation of node features for each element-specific

graph. Then we concatenate summation vectors from all

50 element-specific graphs into a long vector. In the sec-

ond part, two encoders together with a fully connected

layer are used to predict the binding affinity from the

concatenated vector.

Ensemble learning The ensemble learning method

is to use multiple learning models simultaneously for

boosting the performance. Each learning model, which

is known as a base learner, can give a prediction individ-

ually. Ensemble learning improves the prediction accu-

racy of each base learner by assembling them together

under a certain combination strategy. The commonly

used combination approaches including bagging, boost-

ing and stacking [40]. In our MP-GNN, we focus on stack-

ing ensemble model. The essential idea is to assign a cer-

tain weight, which is to be learned, to each base learner

and use the weighted results as the final prediction. More

specifically, we can denote the prediction of nnumber

of base learners as Y1,Y2, ..., Yn, the final prediction as

Ystacking and the ground truth value of training set as Y.

The weight for each based learner is linearly related to

their prediction accuracy on training set.For instance, we

can use Rp(Yi,Y), which is Pearson correlation coefficient

Rpbetween the prediction Yiand true value Y, as the

measurements for the model accuracy. The weight for

the the i-th base learner is then defined as

Wstacking

i=Rp(Yi,Y)

n

j=1Rp(Yj,Y),

and the final prediction results are

Ystacking =



i=1

Wstacking

iYi.

Other than using Rpas accuracy measurement, we have

also considered RMSE in our MP-GNN models.

MP-GNN for Covid drug design

Graph representation for protein–ligand interactions

Ligands usually bind to proteins, which tend to have a

much larger size, at a certain special region called bind-

ing site. Computationally, the binding site is chosen as

the protein region that is within a certain cutoff distance

of the ligand atoms. Here, we use 10 Å in our MP-GNN

model. The protein–ligand interaction matrix Min Eq. (7)

is defined only on the protein binding sites instead of the

entire protein domain. Three types of atomic interaction

function are considered, including generalized expoen-

tial/Lorentz kernel function, Euclidean distance function

and electrostatic function. Under different interaction

functions, different types of element-specific graph mod-

els are constructed. As stated above, proteins and ligands

in general have five and nine types of atoms, that is,

EP= [C, N, O, S, H] and EL= [C, N, O, S, H, F, Cl, Br, I].

For generalized expoential/Lorentz kernel and Euclidean

distance based atomic interaction functions, we consider

only 36 =4∗9 types element-specific bipartite graph

representation and omit the influence from hydrogen

(H) atoms. For electrostatic-based interaction function, a

total 50 =5∗10 types of bipartite graphs are constructed.

Note that for these bipartite graphs, their sizes may

vary greatly between different protein–ligand complexes

and between different element combinations. In our MP-

GNN, node importance is defined using rigidity index as

in Eq. (3). To share the weights (in GNN model) among

different graphs, we choose a same cropping size, i.e. a

total of 56 nodes, for all bipartite graphs. Computation-

ally, it is found that 56 is roughly the average size of

these element-specific bipartite graphs. For large-sized

graphs, we will remove the nodes that have a lower

node importance. For small-sized graph, pseudo-nodes

are added until a common size of 56 is reached.

In our MP-GNN model, node features are only related

to atomic distances. For generalized kernel based func-

tion as in Eqs. (1) and (2), we set κto be 2 and four

different scale parameters are considered, that is, η=2, 5,

10 and 20 Å. For Euclidean distance based function as

in Eq. (8), we set κ=−1 and let simply equals to the

atomic distance. The function domain of Eqs. (1), Eqs. (2)

and Eqs. (8) is set to be [2 Å, 30Å] with each interval of

length 1 Å, and the node vector as in Eq. (6) is of size 29.

For electrostatic-based function as in Eq. (9), we set the

domain to [0, 1] with each interval of length 0.04,and the

node vector is of size 25.

MP-GNN parameter settings The encoders in MP-GNN

head have the output size of 64 and 16 for every node.

After node feature fusion, preliminary sub-graph fea-

tures go through an encoder with output feature length

16. Then the sub-graph feature matrix with shape (M*N,

16) is concatenated into one feature vector describing

the binding site, which passes through the hidden layer

with 256 and 64 neurons for final regression. Every MP-

GNN sub-learner is trained for 6400 epochs to obtain the

optimal model with a dropout rate of 0.5 and ELUs as

the activation unit. The learning rate starts from 0.1 and

decays every 800 epochs, and the decay rate is 0.5. The

decay scheme is depicted in Figure 3.

Performance of MP-GNN on PBDbind datasets For

PBDbind datasets, a total of 10 different scale-specific

GNN models are considered based on 10 atomic interac-

tion functions, including four different exponential ker-

nel functions, four different Lorentz kernels, a Euclidean

distance based function and an electrostatic-based func-

tion. A total of 10 GNN base learners can be obtained. The

stacking models are chosen based on Rpon the training

set performance. Due to the high computation cost, we

conducted 22 repeated experiments with random initial-

ization. The detailed results can be found in Tables S1 to

S3. The best results in every sector are marked in bold.

Note that stacking with Rpis better than the ones with

RMSE.

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022

Multiphysical graph neural network (MP-GNN) |9

We carry out the ablation analysis for our model based

on the Tables S1 to S3. First, we focus on the effective-

ness of one-scale stacking. When looking at one row

of each scale separately, one-scale stacking significantly

improves the result on PBDbind-v2007, PBDbind-v2013

and PBDbind-v2016 by around 6%, 8% and 8%.If we apply

the multi-scale stacking without the one-scale stacking,

both average and best Rpare improved by 5–10% than

the single scale results. In contrast, multi-scale stack-

ing after one-scale stacking can only boost Rpby 0–3%.

This indicates that the one-scale stacking can avoid the

information blind spot caused by initialization. Second,

we study the effectiveness of multi-scale stacking.Before

one-scale stacking, the multi-scale stacking average Rpof

22 random trials is at least 6%, 7% and 6% higher than the

single scale average Rps on all three datasets. It is obvious

that there is information complement between different

scales. Based on the result of one-scale stacking, multi-

scale still increases the Rpby almost 3%. In comparison,

we have noticed that although the one-scale stacking

avoids randomness and collects the information in a

single scale as much as possible, it can not overrun the

Best Rpof the randomly initialized multi-scale stacking.

Performance of MP-GNN on SARS-CoV BA dataset

In our MP-GNN model for SARS-CoV BA dataset, only

multiscale stacking is employed.This is due to the reason

that the training set has incorporated in it the PBDbind-

v2019 general set, which has 17652 PBD data. Similar to

PBDbind datasets, the same 10 different scale-specific

GNN models are considered in our MP-GNN. Since stack-

ing with Rpgives better accuracy, we also use Rpresults

on training set as the weighting scheme. From Table S4,

it can be seen that the multiscale stacking improves the

Rpby 15–24% for SARS-CoV BA dataset.

Key Points

Our main contributions in this paper are as follows:

• We propose the first multiphysical molecular graph rep-

resentation. All kinds of molecular interactions,between

different atom types and at different scales, are sys-

tematically represented by a series of scale-specific and

element-specific graphs with distance-related node fea-

tures.

•We develop the first multiphysical graph neural network

(MP-GNN) model. Our MP-GNN is free from the compli-

cated feature generation process. A unique GNN archi-

tecture is developed in our MP-GNN to incorporate both

scale-specific and element-specific graph information

into a consolidated model.

•Our model has achieved the state-of-the-art results for

protein–ligand binding affinity prediction. It has been

found that our model can outperform all existing mod-

els, as far as we know.

•Our model is highly accurate for the prediction of

complexes of inhibitors for SARS-CoV/SARS-CoV-2. Our

model has great potential for COVID-19 drug design.

Code and Data Availability

The code is available at https://github.com/Alibaba-

DAMO-DrugAI/MGNN. Additional data or code would

be available upon reasonable request.

Author contributions statement

K.X. conceived MP-GNN model. K.X., X-S.L. and Y.C. con-

ceived the graph neural network and ensemble learning

architecture. X.L. prepared the input data. X-S.L. and Y.C.

wrote up all algorithm codes and accomplished try-run.

X-S.L. conducted the experiments in large scale and ana-

lyzed the results. X-S.L. refined the network architecture.

K.X. validated the results according to experience. K.X.

and X-S.L. wrote up the paper,all other authors reviewed

the manuscript.

Supplementary data

Supplementary data are available online at https://

academic.oup.com/bib.

Acknowledgments

This work was supported by Alibaba Innovative Research

(AIR) Program and Alibaba-NTU Singapore Joint Research

Institute grant AN-GC-2020-002, Singapore Ministry of

Education Academic Research fund Tier 1 RG109/19, and

Tier 2 MOE-T2EP20220-0010, MOE-T2EP20120-0013.

References

1. Zhang L, Tan J, Han D, et al. From machine learning to deep

learning: progress in machine intelligence for rational drug

discovery. Drug Discov Today 2017;22(11):1680–5.

2. Lusci A, Pollastri G, Baldi P. Deep architectures and deep learning

in chemoinformatics: the prediction of aqueous solubility for

drug-like molecules. J Chem Inf Model 2013;53(7):1563–75.

3. Pereira JC, Caffarena ER, Nogueira C, et al. Boosting docking-

based virtual screening with deep learning. J Chem Inf Model

2016;56(12):2495–506.

4. Kearnes S, McCloskey K, Berndl M, et al. Molecular graph con-

volutions: moving beyond fingerprints. J Comput Aided Mol Des

2016;30(8):595–608.

5. Gomes J, Ramsundar B, Feinberg EN, et al. Atomic convolu-

tional networks for predicting protein-ligand binding affinit-

yarXiv preprint arXiv:1703.10603. 2017.

6. Feinberg EN, Sur D, Wu ZQ, et al. Potentialnet for molecular

property prediction. ACS central science 2018;4(11):1520–30.

7. Tsubaki M, Tomii K, Sese J. Compound–protein interaction pre-

diction with end-to-end learning of neural networks for graphs

and sequences. Bioinformatics 2019;35(2):309–18.

8. Li X, Yan X, Qiong G, et al. Deepchemstable: chemical stability

prediction with an attention-based graph convolution network.

J Chem Inf Model 2019;59(3):1044–9.

9. Wang X, Li Z, Jiang M, et al. Molecule property prediction based

on spatial graph embedding. J Chem Inf Model 2019;59(9):3817–28.

10. Szocinski T, Nguyen DD, Wei G-W. AweGNN: Auto-parametrized

weighted element-specific graph neural networks for molecules.

Comput Biol Med 2021;134:104460.

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022

10 |Li et al.

11. Li S, Zhou J, Tong X, et al. (eds). Structure-aware interactive graph

neural networks for the prediction of protein-ligand binding

affinity. In: Proceedings of the 27th ACM SIGKDD Conference on

Knowledge Discovery & Data Mining, 2021, 975–85.

12. Nguyen T, Le H, Quinn TP, et al. GraphDTA: Predicting drug–

target binding affinity with graph neural networks. Bioinformatics

2021;37(8):1140–7.

13. Lin X. DeepGS: Deep representation learning of graphs and

sequences for drug-target binding affinity predictionarXiv

preprint arXiv:2003.13902. 2020.

14. Jiang M, Li Z, Zhang S, et al. Drug-target affinity predic-

tion using graph neural network and contact maps. RSC Adv

2020;10(35):20701–12.

15. Wang X, Liu Y, Fan L, et al. Dipeptide frequency of word frequency

and graph convolutional networks for DTA prediction. Front

Bioeng Biotechnol 2020;8:267.

16. Stokes JM, Yang K, Swanson K, et al. Zohar Bloom-Ackermann,

et al. A deep learning approach to antibiotic discovery. Cell

2020;180(4):688–702.

17. Liu X, Luo Y, Li P, et al. Deep geometric representations for mod-

eling effects of mutations on protein-protein binding affinity.

PLoS Comput Biol 2021;17(8):e1009284.

18. Gaudelet T, Day B, Jamasb AR, et al. Utilising graph machine

learning within drug discovery and development. Brief Bioinform

05 2021;bbab159.

19. Wei GW. Persistent homology analysis of biomolecular data. J

Comput Phys 2017;305:276–99.

20. Wei GW. Mathematics at the eve of a historic transition in

biology. Computational and Mathematical Biophysics 2017;5(1).

21. Cang ZX, Wei GW. TopologyNet: Topology based deep convolu-

tional and multi-task neural networks for biomolecular prop-

erty predictions. PLoS Comput Biol 2017;13(7):e1005690.

22. Cang ZX, Wei GW. Integration of element specific persistent

homology and machine learning for protein-ligand binding

affinity prediction. In: International journal for numerical methods

in biomedical engineering,page10.1002/cnm.2914, 2017.

23. Nguyen DD, Xiao T, Wang ML, et al. Rigidity strengthening:

A mechanism for protein–ligand binding. J Chem Inf Model

2017;57(7):1715–21.

24. Cang ZX, Wei GW. Integration of element specific persistent

homology and machine learning for protein-ligand binding

affinity prediction. International journal for numerical methods in

biomedical engineering 2018;34(2):e2914.

25. Cang ZX, Mu L, Wei GW. Representability of algebraic topology

for biomolecules in machine learning based scoring and virtual

screening. PLoS Comput Biol 2018;14(1):e1005929.

26. Nguyen DD, Cang ZX, Wu KD, et al. Mathematical deep learning

for pose and binding affinity prediction and ranking in D3R

Grand Challenges. J Comput Aided Mol Des 2019;33(1):71–82.

27. Nguyen DD, Wei GW. AGL-Score: Algebraic graph learning score

for protein-ligand binding scoring, ranking, docking, and screen-

ing. J Chem Inf Model 2019;59(7):3291–304.

28. Nguyen DD, Gao KF, Wang ML, et al. MathDL: Mathematical deep

learning for D3R Grand Challenge 4. Journal of computer-aided

molecular design, pages 2019;1–17.

29. Nguyen DD, Cang ZX, Wei GW. A review of mathematical repre-

sentations of biomolecular data. Phys Chem Chem Phys 2020.

30. Puzyn T, Leszczynski J, Cronin MT. Recent advances in QSAR stud-

ies: methods and applications, Vol. 8. Springer Science & Business

Media, 2010.

31. Lo YC, Rensi SE, Torng W, et al. Machine learning in chemoinfor-

matics and drug discovery. Drug Discov Today 2018;23(8):1538–46.

32. Bajorath J. Chemoinformatics: concepts, methods, and tools for drug

discovery, Vol. 275. Springer Science & Business Media, 2004.

33. Nguyen DD, Gao K, Chen J, et al. Unveiling the molecular mech-

anism of SARS-CoV-2 main protease inhibition from 137 crystal

structures using algebraic topology and deep learning. Chem Sci

2020;11(44):12036–46.

34. Nguyen DD, Wei GW. DG-GL: Differential geometry-based geo-

metric learning of molecular datasets. International journal for

numerical methods in biomedical engineering 2019;35(3):e3179.

35. Liu ZH, Li Y, Han L, et al. PDB-wide collection of binding

data: current status of the PDBbind database. Bioinformatics

2015;31(3):405–12.

36. Wee JJ, Xia K. Forman persistent ricci curvature (FPRC) based

machine learning models for protein-ligand binding affinity

prediction. Briefings in Bioinformatics, in press 2021.

37. Qi CR, Hao S, Mo K, et al. Pointnet: Deep learning on point sets

for 3d classification and segmentation. In: Proceedings of the IEEE

conference on computer vision and pattern recognition, 2017, 652–60.

38. Qi CR, Yi L, Su H, et al. Pointnet++: Deep hierarchical fea-

ture learning on point sets in a metric spacearXiv preprint

arXiv:1706.02413. 2017.

39. Sagi O, Rokach L. Ensemble learning: A survey. Wi ley

Interdisciplinary Reviews: Data Mining and Knowledge Discovery

2018;8(4):e1249.

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac231/6607747 by NTU Library user on 18 June 2022

View publication statsView publication stats

MPGNN.pdf

Data

October 2023

Xiao-Shuang Li · Xiang Liu · Le Lu · Xian-Sheng Hua · Kelin Xia

Download

A Systematic Review of Graph Neural Network in Healthcare-Based Applications: Recent Advances, Trends, and Future Directions

Article

Full-text available

Jan 2024

Graph neural network (GNN) is a formidable deep learning framework that enables the analysis and modeling of intricate relationships present in data structured as graphs. In recent years, a burgeoning interest has arisen in exploiting the latent capabilities of GNN for healthcare-based applications, capitalizing on their aptitude for modeling complex relationships and unearthing profound insights from graph-structured data. However, to the best of our knowledge, no study has systemically reviewed the GNN studies conducted in the healthcare domain. This study has furnished an all-encompassing and erudite overview of the prevailing cutting-edge research on GNN in healthcare. Through analysis and assimilation of studies, current research trends, recurrent challenges, and promising future opportunities in GNN for healthcare applications have been identified. China emerged as the leading country to conduct GNN-based studies in the healthcare domain, followed by the USA, UK, and Turkey. Among various aspects of healthcare, disease prediction and drug discovery emerge as the most prominent areas of focus for GNN application, indicating the potential of GNN for advancing diagnostic and therapeutic approaches. This study proposed research questions regarding diverse aspects of GNN in the healthcare domain and addressed them through an in-depth analysis. This study can provide practitioners and researchers with profound insights into the current landscape of GNN applications in healthcare and can guide healthcare institutes, researchers, and governments by demonstrating the ways in which GNN can contribute to the development of effective and efficient healthcare systems.

3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information

Article

Full-text available

Jun 2024
PATTERN ANAL APPL

Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifically, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous representations where a single one might represent multiple distinct molecules. Moreover, existing molecular modeling methods focus predominantly on the most stable 3D conformations, neglecting other viable conformations present in reality. To address these issues, we propose 3D-Mol, a novel approach designed for more accurate spatial structure representation. It deconstructs molecules into three hierarchical graphs to better extract geometric information. Additionally, 3D-Mol leverages contrastive learning for pretraining on 20 million unlabeled data, treating their conformations with identical topological structures as weighted positive pairs and contrasting ones as negatives, based on the similarity of their 3D conformation descriptors and fingerprints. We compare 3D-Mol with various state-of-the-art baselines on 7 benchmarks and demonstrate our outstanding performance.

Data-balanced transformer for accelerated ionizable lipid nanoparticles screening in mRNA delivery

Article

Full-text available

Apr 2024

Despite the widespread use of ionizable lipid nanoparticles (LNPs) in clinical applications for messenger RNA (mRNA) delivery, the mRNA drug delivery system faces an efficient challenge in the screening of LNPs. Traditional screening methods often require a substantial amount of experimental time and incur high research and development costs. To accelerate the early development stage of LNPs, we propose TransLNP, a transformer-based transfection prediction model designed to aid in the selection of LNPs for mRNA drug delivery systems. TransLNP uses two types of molecular information to perceive the relationship between structure and transfection efficiency: coarse-grained atomic sequence information and fine-grained atomic spatial relationship information. Due to the scarcity of existing LNPs experimental data, we find that pretraining the molecular model is crucial for better understanding the task of predicting LNPs properties, which is achieved through reconstructing atomic 3D coordinates and masking atom predictions. In addition, the issue of data imbalance is particularly prominent in the real-world exploration of LNPs. We introduce the BalMol block to solve this problem by smoothing the distribution of labels and molecular features. Our approach outperforms state-of-the-art works in transfection property prediction under both random and scaffold data splitting. Additionally, we establish a relationship between molecular structural similarity and transfection differences, selecting 4267 pairs of molecular transfection cliffs, which are pairs of molecules that exhibit high structural similarity but significant differences in transfection efficiency, thereby revealing the primary source of prediction errors. The code, model and data are made publicly available at https://github.com/wklix/TransLNP.

Modern machine‐learning for binding affinity estimation of protein–ligand complexes: Progress, opportunities, and challenges

Article

Jun 2024

Structure‐based drug design is a widely applied approach in the discovery of new lead compounds for known therapeutic targets. In most structure‐based drug design applications, the docking procedure is considered the crucial step. Here, a potential ligand is fitted into the binding site, and a scoring function assesses its binding capability. With the rise of modern machine‐learning in drug discovery, novel scoring functions using machine‐learning techniques achieved significant performance gains in virtual screening and ligand optimization tasks on retrospective data. However, real‐world applications of these methods are still limited. Missing success stories in prospective applications are one reason for this. Additionally, the fast‐evolving nature of the field makes it challenging to assess the advantages of each individual method. This review will highlight recent strides toward improved real world applicability of machine‐learning based scoring, enabling a better understanding of the potential benefits and pitfalls of these functions on a project. Furthermore, a systematic way of classifying machine‐learning based scoring that facilitates comparisons will be presented. This article is categorized under: Data Science > Chemoinformatics Data Science > Artificial Intelligence/Machine Learning Software > Molecular Modeling

The Millennia-Long Development of Drugs Associated with the 80-Year-Old Artificial Intelligence Story: The Therapeutic Big Bang?

Article

Full-text available

Jun 2024
MOLECULES

The journey of drug discovery (DD) has evolved from ancient practices to modern technology-driven approaches, with Artificial Intelligence (AI) emerging as a pivotal force in streamlining and accelerating the process. Despite the vital importance of DD, it faces challenges such as high costs and lengthy timelines. This review examines the historical progression and current market of DD alongside the development and integration of AI technologies. We analyse the challenges encountered in applying AI to DD, focusing on drug design and protein–protein interactions. The discussion is enriched by presenting models that put forward the application of AI in DD. Three case studies are highlighted to demonstrate the successful application of AI in DD, including the discovery of a novel class of antibiotics and a small-molecule inhibitor that has progressed to phase II clinical trials. These cases underscore the potential of AI to identify new drug candidates and optimise the development process. The convergence of DD and AI embodies a transformative shift in the field, offering a path to overcome traditional obstacles. By leveraging AI, the future of DD promises enhanced efficiency and novel breakthroughs, heralding a new era of medical innovation even though there is still a long way to go.

Optimal decision-making in relieving global high temperature-related disease burden by data-driven simulation

Article

Mar 2024

The rapid acceleration of global warming has led to an increased burden of high temperature-related diseases (HTDs), highlighting the need for advanced evidence-based management strategies. We have developed a conceptual framework aimed at alleviating the global burden of HTDs, grounded in the One Health concept. This framework refines the impact pathway and establishes systematic data-driven models to inform the adoption of evidence-based decision-making, tailored to distinct contexts. We collected extensive national-level data from authoritative public databases for the years 2010–2019. The burdens of five categories of disease causes – cardiovascular diseases, infectious respiratory diseases, injuries, metabolic diseases, and non-infectious respiratory diseases – were designated as intermediate outcome variables. The cumulative burden of these five categories, referred to as the total HTD burden, was the final outcome variable. We evaluated the predictive performance of eight models and subsequently introduced twelve intervention measures, allowing us to explore optimal decision-making strategies and assess their corresponding contributions. Our model selection results demonstrated the superior performance of the Graph Neural Network (GNN) model across various metrics. Utilizing simulations driven by the GNN model, we identified a set of optimal intervention strategies for reducing disease burden, specifically tailored to the seven major regions: East Asia and Pacific, Europe and Central Asia, Latin America and the Caribbean, Middle East and North Africa, North America, South Asia, and Sub-Saharan Africa. Sectoral mitigation and adaptation measures, acting upon our categories of Infrastructure & Community, Ecosystem Resilience, and Health System Capacity, exhibited particularly strong performance for various regions and diseases. Seven out of twelve interventions were included in the optimal intervention package for each region, including raising low-carbon energy use, increasing energy intensity, improving livestock feed, expanding basic health care delivery coverage, enhancing health financing, addressing air pollution, and improving road infrastructure. The outcome of this study is a global decision-making tool, offering a systematic methodology for policymakers to develop targeted intervention strategies to address the increasingly severe challenge of HTDs in the context of global warming.

Multi-hop Attention-based Graph Pooling: A Personalized PageRank Perspective

Conference Paper

Mar 2024

Over the past ten years, graph representation learning has garnered a lot of attention due to the variety of graph-structured data and its efficiency in both time and space. One essential method for obtaining effective graph representations is graph pooling. Numerous studies on the graph pooling technique have been conducted. Cutting-edge results on a range of graph representation learning tasks were made possible by the combination of graph neural networks and self-attention mechanisms. Nevertheless, the attention mechanism has limitations since it ignores nodes that have no direct connection via an edge but provide valuable network context information. This paper proposes a graph pooling approach based on Personalized PageRank and self-attention, which improves the model to take into account both node properties and graph structure. The experimental findings indicate that, with a suitable number of parameters, the MAGPool approach delivers greater accuracy on the benchmark datasets.

Identification of D- and L-phenylalanine enantiomeric mixtures by employing deep neural network models

Article

Feb 2024
J MOL STRUCT

Predicting Protein-Ligand Binding Affinity with Multi-Scale Structural Features

Conference Paper

Dec 2023

Harness the power of atomistic modeling and deep learning in biofuel separation

Chapter

Jan 2023

Deep geometric representations for modeling effects of mutations on protein-protein binding affinity

Article

Full-text available

Aug 2021
PLOS COMPUT BIOL

Modeling the impact of amino acid mutations on protein-protein interaction plays a crucial role in protein engineering and drug design. In this study, we develop GeoPPI, a novel structure-based deep-learning framework to predict the change of binding affinity upon mutations. Based on the three-dimensional structure of a protein, GeoPPI first learns a geometric representation that encodes topology features of the protein structure via a self-supervised learning scheme. These representations are then used as features for training gradient-boosting trees to predict the changes of protein-protein binding affinity upon mutations. We find that GeoPPI is able to learn meaningful features that characterize interactions between atoms in protein structures. In addition, through extensive experiments, we show that GeoPPI achieves new state-of-the-art performance in predicting the binding affinity changes upon both single- and multi-point mutations on six benchmark datasets. Moreover, we show that GeoPPI can accurately estimate the difference of binding affinities between a few recently identified SARS-CoV-2 antibodies and the receptor-binding domain (RBD) of the S protein. These results demonstrate the potential of GeoPPI as a powerful and useful computational tool in protein design and engineering. Our code and datasets are available at: https://github.com/Liuxg16/GeoPPI .

Utilizing graph machine learning within drug discovery and development

Article

Full-text available

May 2021

Graph machine learning (GML) is receiving growing interest within the pharmaceutical and biotechnology industries for its ability to model biomolecular structures, the functional relationships between them, and integrate multi-omic datasets — amongst other data types. Herein, we present a multidisciplinary academic-industrial review of the topic within the context of drug discovery and development. After introducing key terms and modelling approaches, we move chronologically through the drug development pipeline to identify and summarize work incorporating: target identification, design of small molecules and biologics, and drug repurposing. Whilst the field is still emerging, key milestones including repurposed drugs entering in vivo studies, suggest GML will become a modelling framework of choice within biomedical machine learning.

Unveiling the molecular mechanism of SARS-CoV-2 main protease inhibition from 137 crystal structures using algebraic topology and deep learning

Article

Full-text available

Sep 2020

Currently, there is neither effective antiviral drugs nor vaccine for coronavirus disease 2019 (COVID-19) caused by acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Due to its high conservativeness and low similarity with human genes, SARS-CoV-2 main protease (Mpro) is one of the most favorable drug targets. However, the current understanding of the molecular mechanism of Mpro inhibition is limited by the lack of reliable binding affinity ranking and prediction of existing structures of Mpro–inhibitor complexes. This work integrates mathematics (i.e., algebraic topology) and deep learning (MathDL) to provide a reliable ranking of the binding affinities of 137 SARS-CoV-2 Mpro inhibitor structures. We reveal that Gly143 residue in Mpro is the most attractive site to form hydrogen bonds, followed by Glu166, Cys145, and His163. We also identify 71 targeted covalent bonding inhibitors. MathDL was validated on the PDBbind v2016 core set benchmark and a carefully curated SARS-CoV-2 inhibitor dataset to ensure the reliability of the present binding affinity prediction. The present binding affinity ranking, interaction analysis, and fragment decomposition offer a foundation for future drug discovery efforts.

Drug–target affinity prediction using graph neural network and contact maps

Article

Full-text available

Jun 2020

Computer-aided drug design uses high-performance computers to simulate the tasks in drug design, which is a promising research area. Drug–target affinity (DTA) prediction is the most important step of computer-aided drug design, which could speed up drug development and reduce resource consumption. With the development of deep learning, the introduction of deep learning to DTA prediction and improving the accuracy have become a focus of research. In this paper, utilizing the structural information of molecules and proteins, two graphs of drug molecules and proteins are built up respectively. Graph neural networks are introduced to obtain their representations, and a method called DGraphDTA is proposed for DTA prediction. Specifically, the protein graph is constructed based on the contact map output from the prediction method, which could predict the structural characteristics of the protein according to its sequence. It can be seen from the test of various metrics on benchmark datasets that the method proposed in this paper has strong robustness and generalizability.

Dipeptide Frequency of Word Frequency and Graph Convolutional Networks for DTA Prediction

Article

Full-text available

Apr 2020

Deep learning is an effective method to capture drug-target binding affinity, but low accuracy is still an obstacle to be overcome. Thus, we propose a novel predictor for drug-target binding affinity based on dipeptide frequency of word frequency encoding and a hybrid graph convolutional network. Word frequency characteristics of natural language are used to improve the frequency characteristics of peptides to express target proteins. For each drug molecules, the five different features of drug atoms and the atomic bond relationships are expressed as graphs. The obtained protein features and graph structure are used as the input of convolution neural network and the input of graph convolution neural network, respectively. A prediction model is established to predict the drug affinity by calculating the hidden relationship. In the KIBA data set test experiment, the consistency coefficient of the model is 0.901, which is 0.01 higher than the existing model, and the MSE (mean square error) of the model is 0.126, which is 5% lower than the existing model. In Davis data set test experiment, the consistency coefficient of the model is 0.895, which is 0.006 higher than the existing model, and the MSE of the model is 0.220, which is 4% lower than the existing model. These results show that our proposed method can not only predict the affinity better than those existing models, but also outperform unitary deep learning approaches.

Structure-aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Affinity

Conference Paper

Aug 2021

AweGNN: Auto-parametrized weighted element-specific graph neural networks for molecules

Article

May 2021
COMPUT BIOL MED

While automated feature extraction has had tremendous success in many deep learning algorithms for image analysis and natural language processing, it does not work well for data involving complex internal structures, such as molecules. Data representations via advanced mathematics, including algebraic topology, differential geometry, and graph theory, have demonstrated superiority in a variety of biomolecular applications, however, their performance is often dependent on manual parametrization. This work introduces the auto-parametrized weighted element-specific graph neural network, dubbed AweGNN, to overcome the obstacle of this tedious parametrization process while also being a suitable technique for automated feature extraction on these internally complex biomolecular data sets. The AweGNN is a neural network model based on geometric-graph features of element-pair interactions, with its graph parameters being updated throughout the training, which results in what we call a network-enabled automatic representation (NEAR). To enhance the predictions with small data sets, we construct multi-task (MT) AweGNN models in addition to single-task (ST) AweGNN models. The proposed methods are applied to various benchmark data sets, including four data sets for quantitative toxicity analysis and another data set for solvation prediction. Extensive numerical tests show that AweGNN models can achieve state-of-the-art performance in molecular property predictions.

Forman persistent Ricci curvature (FPRC)-based machine learning models for protein–ligand binding affinity prediction

Article

May 2021

Artificial intelligence (AI) techniques have already been gradually applied to the entire drug design process, from target discovery, lead discovery, lead optimization and preclinical development to the final three phases of clinical trials. Currently, one of the central challenges for AI-based drug design is molecular featurization, which is to identify or design appropriate molecular descriptors or fingerprints. Efficient and transferable molecular descriptors are key to the success of all AI-based drug design models. Here we propose Forman persistent Ricci curvature (FPRC)-based molecular featurization and feature engineering, for the first time. Molecular structures and interactions are modeled as simplicial complexes, which are generalization of graphs to their higher dimensional counterparts. Further, a multiscale representation is achieved through a filtration process, during which a series of nested simplicial complexes at different scales are generated. Forman Ricci curvatures (FRCs) are calculated on the series of simplicial complexes, and the persistence and variation of FRCs during the filtration process is defined as FPRC. Moreover, persistent attributes, which are FPRC-based functions and properties, are employed as molecular descriptors, and combined with machine learning models, in particular, gradient boosting tree (GBT). Our FPRC-GBT models are extensively trained and tested on three most commonly-used datasets, including PDBbind-2007, PDBbind-2013 and PDBbind-2016. It has been found that our results are better than the ones from machine learning models with traditional molecular descriptors.

GraphDTA: Predicting drug–target binding affinity with graph neural networks

Article

Oct 2020

The development of new drugs is costly, time consuming, and often accompanied with safety issues. Drug repurposing can avoid the expensive and lengthy process of drug development by finding new uses for already approved drugs. In order to repurpose drugs effectively, it is useful to know which proteins are targeted by which drugs. Computational models that estimate the interaction strength of new drug–target pairs have the potential to expedite drug repurposing. Several models have been proposed for this task. However, these models represent the drugs as strings, which is not a natural way to represent molecules. We propose a new model called GraphDTA that represents drugs as graphs and uses graph neural networks to predict drug–target affinity. We show that graph neural networks not only predict drug–target affinity better than non-deep learning models, but also outperform competing deep learning methods. Our results confirm that deep learning models are appropriate for drug–target binding affinity prediction, and that representing drugs as graphs can lead to further improvements. Availability of data and materials The proposed models are implemented in Python. Related data, pre-trained models, and source code are publicly available at https://github.com/thinng/GraphDTA. All scripts and data needed to reproduce the post-hoc statistical analysis are available from https://doi.org/10.5281/zenodo.3603523.

A Deep Learning Approach to Antibiotic Discovery

Article

Apr 2020
CELL

Due to the rapid emergence of antibiotic-resistant bacteria, there is a growing need to discover new antibiotics. To address this challenge, we trained a deep neural network capable of predicting molecules with antibacterial activity. We performed predictions on multiple chemical libraries and discovered a molecule from the Drug Repurposing Hub—halicin—that is structurally divergent from conventional antibiotics and displays bactericidal activity against a wide phylogenetic spectrum of pathogens including Mycobacterium tuberculosis and carbapenem-resistant Enterobacteriaceae. Halicin also effectively treated Clostridioides difficile and pan-resistant Acinetobacter baumannii infections in murine models. Additionally, from a discrete set of 23 empirically tested predictions from >107 million molecules curated from the ZINC15 database, our model identified eight antibacterial compounds that are structurally distant from known antibiotics. This work highlights the utility of deep learning approaches to expand our antibiotic arsenal through the discovery of structurally distinct antibacterial molecules. A trained deep neural network predicts antibiotic activity in molecules that are structurally different from known antibiotics, among which Halicin exhibits efficacy against broad-spectrum bacterial infections in mice.

Multiphysical graph neural network (MP-GNN) for COVID-19 drug design

Abstract and Figures

Supplementary resource (1)

Recommended publications

Molecular persistent spectral image (Mol-PSI) representation for machine learning models in drug des...

Persistent spectral hypergraph based machine learning (PSH-ML) for protein-ligand binding affinity p...

Neighborhood Complex Based Machine Learning (NCML) Models for Drug Design

Persistent spectral–based machine learning (PerSpect ML) for protein-ligand binding affinity predict...