ArticlePDF Available

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

Authors:
  • BioPharmics Division, Optibrium Ltd.
  • BioPharmics LLC

Abstract and Figures

Scaffold replacement as part of an optimization process that requires maintenance of potency, desirable biodistribution, metabolic stability, and considerations of synthesis at very large scale is a complex challenge. Here, we consider a set of over 1000 time-stamped compounds, beginning with a macrocyclic natural-product lead and ending with a broad-spectrum crop anti-fungal. We demonstrate the application of the QuanSA 3D-QSAR method employing an active learning procedure that combines two types of molecular selection. The first identifies compounds predicted to be most active of those most likely to be well-covered by the model. The second identifies compounds predicted to be most informative based on exhibiting low predicted activity but showing high 3D similarity to a highly active nearest-neighbor training molecule. Beginning with just 100 compounds, using a deterministic and automatic procedure, five rounds of 20-compound selection and model refinement identifies the binding metabolic form of florylpicoxamid. We show how iterative refinement broadens the domain of applicability of the successive models while also enhancing predictive accuracy. We also demonstrate how a simple method requiring very sparse data can be used to generate relevant ideas for synthetic candidates.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
Journal of Computer-Aided Molecular Design (2024) 38:19
https://doi.org/10.1007/s10822-024-00555-3
From UK‑2A toflorylpicoxamid: Active learning toidentify amimic
ofamacrocyclic natural product
AnnE.Cleves1 · AjayN.Jain1 · DavidA.Demeter2· ZacharyA.Buchan2 · JeremyWilmot2· ErinN.Hancock2
Received: 2 January 2024 / Accepted: 26 February 2024
© The Author(s) 2024
Abstract
Scaffold replacement as part of an optimization process that requires maintenance of potency, desirable biodistribution,
metabolic stability, and considerations of synthesis at very large scale is a complex challenge. Here, we consider a set of over
1000 time-stamped compounds, beginning with a macrocyclic natural-product lead and ending with a broad-spectrum crop
anti-fungal. We demonstrate the application of the QuanSA 3D-QSAR method employing an active learning procedure that
combines two types of molecular selection. The first identifies compounds predicted to be most active of those most likely
to be well-covered by the model. The second identifies compounds predicted to be most informative based on exhibiting low
predicted activity but showing high 3D similarity to a highly active nearest-neighbor training molecule. Beginning with just
100 compounds, using a deterministic and automatic procedure, five rounds of 20-compound selection and model refine-
ment identifies the binding metabolic form of florylpicoxamid. We show how iterative refinement broadens the domain of
applicability of the successive models while also enhancing predictive accuracy. We also demonstrate how a simple method
requiring very sparse data can be used to generate relevant ideas for synthetic candidates.
Keywords Active-learning· QuanSA· Affinity prediction· Macrocycles
Introduction
Natural products (NPs) have been used as inspiration for
crop protection active ingredients. However, it is often the
case that structural features of NPs, such as macrocycles and
multiple chiral centers, limit their use due to the expense of
industrial-scale synthesis. Figure1 shows the structure of
UK-2A (left side), a natural product with excellent invitro
inhibition of mitochondrial electron transport (MET) com-
plex III via binding to the Q
i
site of cytochrome b [1]. Activ-
ity values were determined by an invitro MET binding assay
and expressed here as pIC
50
. Protection of the 3-pyridinol
with an isobuytryloxymethyl group improved in planta anti-
fungal performance, with the unprotected binding metabolite
being readily produced. Figure1 shows the unprotected form
of florylpicoxamid (right side, “FPX”), whose 3-pyridinol
protected precursor has been shown to be a highly effec-
tive crop protection fungicide [1]. FPX has two fewer chiral
centers, no macrocycle, and is fully synthetic, not requir-
ing starting materials from fermentation processes. The
development of FPX followed a design strategy of stepwise
deconstruction of a macrocyclic natural product, requiring
many hundreds of synthetic analogs along with invitro and
in planta assays.
Here, we investigate the degree to which an active-learn-
ing approach for activity prediction could be used to vastly
reduce the number of synthetic analogs required in such an
effort. Ligand activity prediction continues to be a challenge
for computer-aided drug design, especially in the case where
there is no suitable high-resolution experimental structure of
the target of interest, as is the case here. An additional chal-
lenge here is the presence of flexible macrocyclic ligands.
Over the past several years, methods for computational
modeling of macrocyclic ligands have made significant pro-
gress [27]. In particular, natural-product based and semi-
synthetic macrocycles of up to roughly 21–23 total rotat-
able bonds (including both macrocyclic bonds and exocyclic
* Ann E. Cleves
ann@optibrium.com
* Erin N. Hancock
erin.hancock@corteva.com
1 BioPharmics Division, Optibrium Limited,
CambridgeCB259GL, UK
2 Corteva Agriscience, Indianapolis, IN46268, USA
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 2 of 16
bonds) have been shown to be tractable, in terms of accuracy
and speed of conformational search when utilizing multiple
computing-cores [7]. However, larger peptidic macrocycles
remain challenging, often requiring biophysical data (e.g.
from NMR) to help restrain the conformational space to be
explored [8]. Generally, the macrocycles studied here fell
well within the tractable range of the ForceGen methodol-
ogy [7].
Machine learning approaches have seen a recent resur-
gence in their applications within the CADD field, in part
driven by advances in deep-learning methodologies. A
recent review highlights a number of successful applications
as well as limitations [9], with further context provided by
a full book treatment [10]. With respect to binding affinity
prediction in the context of lead optimization, a critical fac-
tor is that such methods typically require thousands of data
points in order to learn effectively, because of the need to
develop encoded internal representations that meaningfully
capture the important aspects required for prediction. Early-
stage lead optimization may involve just dozens of assayed
molecules within a newly discovered chemical series, and
even mid- to late-stage projects may be limited to hundreds
or up to a few thousand data points. The recently introduced
QuanSA machine-learning method (Quantitative Surface-
field Analysis) differs from the deep-learning paradigm and
from historically widely used methods [11, 12] in ways that
make it applicable even in early-stage lead optimization.
The central difference is that rather than applying a
generic machine-learning approach to an input molecular
representation divorced from a binding event, QuanSA
builds a physically interpretable model that is analogous to
a protein binding site. By doing so, it addresses the problem
of ligand conformation and alignment fully automatically,
and it moves in the direction of causal modeling, where the
requirement for training data can be reduced. The method
constructs a non-linear “pocket-field” that is still physical in
nature, and which is directly related to the functional form of
scoring functions for docking [13, 14]. QuanSA pocket-field
models mirror key physical phenomena that are observed in
protein-ligand interactions [15]: (1) choice of ligand poses
is defined by the model; (2) non-additive (or even anti-
additive) effects of substituent changes on a central scaffold
can be modeled effectively; (3) changes in ligand structures
induce changes in predicted ligand poses; and (4) the model
of molecular activity is dependent on the detailed shape of
ligands. Nearly all QSAR and deep-learning methods ignore
some or all of these aspects of protein-ligand interactions.
Additional discussion of the theoretical contrasts between
the QuanSA multiple-instance learning approach and other
QSAR (3D and 2D) approaches can be found in the papers
introducing the method [11, 12] along with the antecedent
QMOD [16] and Compass [1719] approaches, the latter
of which introduced the multiple-instance machine-learning
paradigm [20].
Figure2 depicts the overall scheme of the study. Begin-
ning with the earliest 100 molecules and activity data
(MET pIC
50
), a QuanSA model was induced, guided by a
hypothesis of how a small set of diverse active ligands were
mutually aligned. A set of “future” molecules that had been
made on the way to (and including) Mol-1109 (the bind-
ing metabolite of florylpicoxamid: FPX) were then scored
using the model. The scoring procedure predicts activity
and bound ligand pose along with estimates of the degree to
which each molecule is well-covered by the model. The top
10 molecules with highest predicted activity among those
well-covered were selected for “synthesis.” In addition, the
top 10 molecules expected to be most informative were also
selected. Those 20 molecules were then used, along with
their experimental activity values, to refine the model, mov-
ing those 20 from the test set to the train set, and this process
was repeated (see the blue arrows in Fig.2 for the refinement
loop). The choice of informative molecules combines two
criteria for a given molecule: (1) it must have a highly active
training molecule as its nearest-neighbor in its QuanSA-pre-
dicted pose; and (2) it must be predicted to have relatively
low activity. Simply put, the informative molecules are sur-
prising: they look a lot like highly active molecules but are
predicted to have poor activity.
In what follows, we show that the process of iterative
model refinement drastically reduces the number of analogs
required compared with what happened during the actual
project. Successive models became progressively broader
in terms of structural coverage and more accurate in their
predictions. Separate from the activity prediction problem
Fig. 1 The starting natural prod-
uct UK-2A is shown (left) along
with the binding metabolic form
of florylpicoxamid, the final
crop protection fungicide
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 Page 3 of 16 19
is the question of how one can generate synthetic candidate
ideas that lead in a desired direction. We show how highly
relevant analog ideas can be automatically generated using
only a small number of compounds and potential pendant
groups. The computational strategy presented here should
have broad applicability in the common case where scaf-
fold replacement is required and structure-activity data are
limited and expensive to augment.
Software, computational protocols, and a subset of struc-
ture-activity data discussed in this paper are available to
other researchers (see Declarations section).
Results anddiscussion
We report results for iterative model refinement leading from
the natural product antifungal UK-2A to FPX, beginning
with a systematic procedure for identifying an informative
multiple-ligand alignment and then proceeding through mul-
tiple rounds of QuanSA model refinement using an active
learning strategy. We also detail a method to generate non-
macrocyclic candidate compounds using very sparse data by
combining virtual-screening-based central scaffold replace-
ment with a simple method to “staple” appropriate substitu-
ents onto the replacement scaffolds.
Initial multiple‑ligand alignment
The QuanSA methodology derives a pocket-field beginning
from an initial mutual alignment of a set of training ligands
[11, 12], where each ligand has multiple possible initial
poses. When protein structure information is available, it is
possible to make use of the experimentally determined rela-
tive poses of prior known bound ligands in order to guide
the construction of the initial set of training poses. Here,
no such suitable protein co-crystal structure existed. Rather
than using crystallographic data, it is also possible to make
use of a carefully constructed multiple ligand alignment
to guide model-building. In cases where scaffold diversity
exists among highly active molecules, such alignments can
provide significant constraints on the overall ligand align-
ment problem.
Here, the initial set of active project compounds con-
tained significant diversity, both within the central macro-
cycle as well as in the pendant functionality. Figure3 shows
the procedure used to identify a high-quality ligand-based
binding site hypothesis using only the data from the earliest
set of synthesized molecules. There are two key ideas: (1)
to identify structurally diverse active ligands from which
to produce multiple ligands alignments; and (2) to select
which of the alternative hypotheses of relative bound poses
is quantitatively the best. The 30 molecules from within the
top 1.0 log unit of experimentally determined activity among
the training molecules were used as input to identify the four
Fig. 2 Scheme for iterative model refinement using temporally sorted structure-activity data from lead optimization
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 4 of 16
most 2D structurally diverse compounds (molecules 13, 89,
2, and 64 in Fig.3). They were selected automatically based
on 2D dissimilarity (see the “Methods and data” section for
details).
These molecules (to the right of UK-2A in Fig.3) differed
in terms of size and flexibility within the central macrocycle
as well as the composition of the right-hand substituents.
They were used, with the addition of UK-2A, as input to
the the multiple-ligand alignment functionality of the eSim
method [21], which resulted in several alternative mutual
superimpositions. In order to assess which mutual align-
ment was most likely to reflect the true relative poses of the
molecules, the alternative alignments were ranked based on
their ability to separate highly active molecules from rela-
tively inactive ones within the initial 100-molecule train-
ing set. The chosen hypothesis shown in Fig.3 was able
to distinguish highly active (pIC
50 8.5
) from less active
(pIC
50
7.5
) compounds with an ROC Area of 0.92. The
3D joint superimposition shows the tight alignment of the
common left-hand moiety (the “warhead”) with the variation
in the macrocycle and right-hand elements of the molecules.
Iterative model refinement
The chosen multiple-ligand alignment from Fig.3 was
used to guide construction of the initial QuanSA model
pocket-field. The method allows for incremental iterative
refinement based on the availability of new structure-activity
data. Figure4 shows examples of molecules automatically
selected by QuanSA for model refinement based on expec-
tations of high activity (left side) or based on expectations
of being informative (right side) through multiple rounds
of compound selection and model refinement. Intuitively,
selection of candidate molecules based on predictions of
high activity is an obvious strategy. In an active-learning
paradigm, one also seeks to identify maximally informative
molecules [22]. One representative example of each type of
selection is shown for each of the first four rounds.
The process of scoring candidate molecules in a QuanSA
pocket-field results in a prediction of activity and bound
pose, along with a number of prediction quality metrics.
The novelty metric characterizes the degree to which a can-
didate molecule is well-covered by the current set of train-
ing molecules. Candidate molecule predictions also indicate
which training molecule was the nearest-neighbor (NN) in a
3D molecular similarity calculation based on the predicted
bound pose.
Here, in each round, the 200 least novel (i.e. best cov-
ered) predicted candidate molecules were identified, and,
of this subset, the top ten with highest predicted activity
were selected for model refinement (see left-hand exam-
ples from Fig.4). The maximally informative set of ten for
Fig. 3 Procedure for identifying a high-quality ligand-based binding site hypothesis
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 Page 5 of 16 19
each round captured a group of molecules that could be
thought of as having unexpectedly low activity. Informa-
tive molecules were identified from the subset whose NN
training molecule similarity was high (top 100 highest
NN similarity or NN similarity
0.85) and where the NN
training molecule’s activity was also high (pIC
50 8.5
).
From that subset, the ten molecules with the lowest pre-
dicted activity were selected (see right-hand examples
from Fig.4).
In the early rounds, the compounds predicted to be highly
active all had a central macrocyclic scaffold that was found
among the most highly active training compounds, as would
be expected given the starting point of lead optimization.
However, after three rounds of model refinement (a cumula-
tive addition of 60 molecules to the original model), a non-
macrocycle was correctly identified and chosen as a highly
active molecule (Fig.4, lower left).
In contrast, the compounds predicted to be maxi-
mally informative included non-macrocycles even in the
initial round of candidate selection. These compounds were
deemed to be information rich: the predicted activities were
low, yet these candidate molecules had very high 3D simi-
larity to highly active train compounds. Model evolution
through inclusion of these informative compounds broad-
ened structural coverage sufficiently that a non-macrocycle
was predicted to be highly active by Round-03 (bottom of
Fig.4).
Round‑00: Initial model building andselection
QuanSA model building begins with an initialization step
that produces training molecule alignments. Here, guided
by the multiple-ligand alignment shown in Fig.3, five
alternative initial alignments were produced. Having been
driven by the same mutual alignment hypothesis, these ini-
tial training molecule alignments differed only slightly, but
each was used to build a separate QuanSA model. Selec-
tion from among alternative models can be done based on
Fig. 4 Example molecules
chosen for model refinement in
successive rounds of QuanSA
testing and refinement
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 6 of 16
statistics derived from the alternative models. These include:
(1) model parsimony, which is a quantitative measure of the
extent to which molecules with similar activity values have
similar predicted poses; (2) Kendall’s Tau for the full re-
fitting of training molecules into a derived pocket-field; and
(3) the mean unsigned error (MUE) of the re-fit molecules.
The alternative quality values are transformed into proba-
bilistic values, and their product reflects the combination of
the different metrics. Here, the selected model exhibited a
parsimony of 0.63, Kendall’s Tau of 0.87 (CI 0.82–0.91; p
<104
) and MUE of 0.30 (CI 0.25–0.35).
Figure5 shows two representative examples from Round-
00 for each selection type of candidate molecule. At left
(salmon) are the predicted poses for two molecules among
the ten predicted most active. As might have been expected,
these test molecules have a macrocyclic scaffold in com-
mon with the most active training ligand. Also, the right-
hand substituents largely occupy the same space as those of
UK-2A. Although the activity predictions for compounds
Mol-0273 and Mol-0496 were high, these molecules fell
within the top 13% and 3%, respectively, of experimental
activity within the full future set of 1009.
At right (yellow) are the predicted poses of two molecules
predicted to be among the ten most informative candidates.
The poses of the test molecules are shown relative to the
pose of training molecule UK-2A (green). These four exam-
ples are among the twenty molecules selected to refine the
current training model. In contrast to the molecules chosen
based on high predicted activity, the molecules chosen to be
most informative in Round-00 included four non-macrocy-
cles out of the ten chosen (two examples are shown in Fig.5
at right). Importantly, the predicted 3D alignments compared
with that of UK-2A (green) show the new scaffolds in tight
congruence to the lower half of the UK-2A macrocycle.
Also, the right-hand moieties of the informative molecules
had significant surface overlap with those of UK-2A.
Overall, for the 10 predicted to be most active in Round-
00, the MUE was quite high (1.7 pIC
50
units), but, interest-
ingly, these were all overpredictions. The predicted activity
values exceeded even the maximal experimental activity of
the most potent training molecule. This characteristic is not
typically seen with traditional machine-learning approaches.
With most statistical machine-learning methods and deep-
learning methods, implicit or explicit modeling of the prior
probability of observing a particular prediction value makes
out-of-range predictions rare. This is a strength of moving
toward a more causal type of predictive model where, for
example, the combination of different aspects of multiple
active molecules into a new candidate might lead to an out-
of-range prediction. Particularly early-on in lead optimiza-
tion, synthesis of candidate molecules that push the potency
envelope is desirable.
Rounds 01‑04: Refinement withactive learning
Figure6 shows examples of selected molecules for Round-
01 and Round-02. Those compounds predicted to be most
active retained macrocyclic scaffolds in both rounds, but
they showed show some additional diversity in the right-
hand hydrophobic groups, with alkyl chains aligning to the
benzene moiety of UK-2A (see molecules Mol-0761 and
Mol-0415). Also, the the nominal actives were more accu-
rately predicted than for Round-00. For Round-01, the MUE
was 1.4 pIC
50
units. For Round-02, it was 0.9 pIC
50
units,
nearly 50% lower than for Round-00, indicating significant
refinement in the detailed modeling of a subset of highly
active ligands.
Those candidates predicted to be most informative
for rounds 01 and 02 contained a higher proportion of
non-macrocycles that before. Round-01 had 7/10 non-
macrocyclic candidate scaffolds, and Round-02 had 9/10
Fig. 5 Selections of active (left, predicted poses in salmon carbons) and informative molecules (right, yellow carbons) for Round-00 shown
against the predicted pose of UK-2A
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 Page 7 of 16 19
non-macrocyclic scaffolds (see examples Fig.6, right, yel-
low). Alternative branching topologies were seen among
the informative candidates as well as novel pendant groups.
The flexible thio-ether linkages in compounds Mol-0174 and
Mol-0141 still allowed the terminal aromatic rings to overlay
well with the corresponding functionality of UK-2A.
One of the challenges in providing computational guid-
ance for synthetic candidate prioritization is having a mean-
ingful explanatory basis for predictions. As shown in Fig.6,
the optimal poses that come out of the fitting process into the
quantitative pocket-fields offer convincing correspondence
between predictions and known SAR. This is preferable to
black-box predictions or those that may yield some explana-
tory information but do not provide a physically meaningful
interpretation.
Figure7 shows examples of the selected molecules
for rounds 03 and 04. By Round-03, when only 60 previ-
ous future molecules had been used to refine the original
100-compound model, three non-macrocyclic scaffolds were
among the ten predicted to be most active (two examples are
shown: molecules Mol-0874 and Mol-0854). A non-macro-
cycle was also chosen in Round-04 among those predicted
most active (Mol-1098, bottom left of Fig.7). The trend of
improvement in accuracy for the predicted most active can-
didates continued, with an MUE of 0.9 and 0.8 pIC
50
units,
respectively, for these two refinement rounds.
Those predicted most informative for Round-03 and
Round-04 included 3/10 and 5/10 non-macrocycles. The
decrease in the number of non-macrocycles in the informa-
tive set compared to Round-01 and Round-02 suggests that
model refinement improved the predictions on non-macro-
cycles and thus these molecules would be less represented
among those molecules that were being incorrectly predicted
as having low activity.
Another aspect of quantitative activity prediction for this
series was the clear importance of detailed hydrophobic
shape on experimental activity. The ability of a compound
to fill the presumed hydrophobic pockets of the non-war-
head (right-hand) side of the binding site was a clear activity
requirement. Accurately modeling such phenomena depends
not only on a molecular representation that captures ligand
shape, but also requires that predictions of molecular pose
be respectful of the internal conformational energetics of
candidate molecules.
Fig. 6 Selections of active and informative molecules for Round-01 and Round-02
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 8 of 16
Round‑05: The goal compound
As shown in Fig.8, FPX was chosen by the model that was
trained on the initial 100 molecules and subsequently refined
with 100 chosen based on the active learning strategy during
the ensuing rounds of refinement. FPX was among the 10
predicted to be most active and the activity was accurately
predicted with pIC
50 =10.0
with a signed error of just + 0.4
pIC
50
units. The MUE of the 10 predicted to be most active
was 0.8 pIC
50
units, and importantly, the set included 7/10
non-macrocycles, evidence that the model had effectively
learned the non-macrocyclic scaffold.
For FPX, in addition to the predicted pose, a depiction of
the quantitative interactions with the pocket-field is shown
in Fig.8. The large majority of interactions were of a purely
hydrophobic type, represented by salmon-colored sticks
whose length is proportional to the interaction magnitude.
Notably, the two fluoro-phenyl groups of FPX, which over-
lay corresponding hydrophobic functionality of UK-2A,
are responsible for significant interactions. In addition, the
two chiral methyl groups, especially the one at the lower
right, were also important. Because of the angle needed to
adequately display the key hydrophobic interactions, the spe-
cific polar interactions made by the warhead and the amide
linker are somewhat difficult to discern, but all of the spe-
cific polar moieties were responsible for key interactions
as well (blue and red sticks, for hydrogen-bond donors and
acceptors, respectively). The other example highlights both
the variability that can be tolerated in the pendant hydro-
phobic groups and the fact that the core scaffold shifts in
accommodating different substituents.
In this gedankenexperiment, only 100 additional future
molecules needed to be synthesized, tested, and added to
the model in order to correctly choose FPX as an excellent
candidate molecule. In completing Round-05, FPX and the
other 19 chosen candidate molecules would be synthesized
and tested. Overall, only 120/1009 future molecules needed
to be “made” to both identify and confirm FPX as a highly
active candidate with just two chiral centers, no macrocyclic
component, and favorable synthetic characteristics.
Fig. 7 Selections of active and informative molecules for Round-03 and Round-04
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 Page 9 of 16 19
Temporal model evolution
Table1 summarizes the statistics for the rounds of model
building, refinement, and future predictions. The training re-
fit Kendall’s Tau was consistently high (0.82–0.87) through-
out the five rounds of refinement, indicating that model fidel-
ity was maintained as new molecules were added. Likewise,
the training re-fit MUE remained low (0.30–0.36 log units)
throughout model refinement.
Here, the sets of future molecules were much larger than
the training set, and they reflected substantial changes in the
structural composition of molecules and the distribution of
activity values compared with the training molecules. Later
in the project, as expected, a larger proportion of synthesized
molecules had very high activity. During successive rounds
of scoring future molecules, Tau trended upward, increas-
ing from 0.35 to 0.46. A large proportion of molecules had
experimental activity values of 8.5–9.5. The small data
range coupled with the presence of assay noise limits the
upper bound on rank-based statistics.
More striking was the decrease in MUE for predictions
on future molecules from 1.24 to 0.70. The model became
significantly more accurate during refinement. As the future
MUE decreased, the FPX predicted activity improved from
pIC
50
= 7.3 (signed error −2.2) in Round-00 to pIC
50
= 10.0
(signed error +0.4) in Round-05. Model improvement was
further reflected in the predicted rank of FPX activity which
rose from the top 61% to the top 1%.
Figure9 shows the plots of experimental versus predicted
activities for the set of all future molecules for each round of
Fig. 8 Selections of predicted
active candidates for Round-05
Table 1 Summary of rounds of model building and testing
Kendall’s Tau values are unitless and all had p
. Mean unsigned error (MUE) and FPX predicted activity are in units of pIC
50
. Numbers in
parentheses are 95% confidence intervals calculated by resampling with replacement
Round n Train Train Tau Train MUE n Future Future Tau Future MUE FPX Pred FPX rank %
00 100 0.87 (0.82–0.91) 0.30 (0.25–0.35) 1009 0.35 (0.31–0.39) 1.24 (1.18–1.30) 7.3 61
01 120 0.84 (0.79–0.89) 0.34 (0.30–0.39) 989 0.35 (0.31–0.39) 0.95 (0.90–1.00) 7.6 62
02 140 0.85 (0.79–0.90) 0.32 (0.28–0.37) 969 0.41 (0.37–0.45) 0.85 (0.81–0.90) 8.5 36
03 160 0.82 (0.77–0.86) 0.36 (0.32–0.40) 949 0.40 (0.36–0.44) 0.76 (0.73–0.80) 8.5 44
04 180 0.82 (0.78–0.86) 0.35 (0.31–0.39) 929 0.38 (0.34–0.43) 0.78 (0.74–0.82) 8.8 26
05 200 0.82 (0.78–0.86) 0.34 (0.30–0.39) 909 0.46 (0.42–0.50) 0.70 (0.66–0.74) 10.0 1
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 10 of 16
Fig. 9 Experimental versus predicted activities for the set of all future molecules for each round of testing
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 Page 11 of 16 19
testing. The identity line indicates perfect prediction, and the
lighter lines represent
±1.5
units of pIC
50
(corresponding to
±2
kcal/mol). The initial Round-00 model exhibited a strong
lower-right triangular bias, with a significantly larger frac-
tion of underpredictions than overpredictions. This aspect
of the model’s predictive behavior shifted rapidly with the
ensuing two rounds of active learning. By Round-03, rela-
tively little skew was apparent. The distribution of underpre-
dictions (< −2 kcal/mol, Fig.9, red triangles) decreased 10
percentage points from Round-00 to Round-01 and became
nearly as few as the overpredictions in Round-03 to Round-
05. The Round-05 FPX prediction (Exp pIC
50
= 9.5, Pred
pIC
50
= 10.0) is highlighted in red.
Table2 shows a summary of the distribution of predic-
tions on future molecules depicted in the plots of Fig.9.
Throughout model refinement and predictions on future
molecules, large overpredictions were few and relatively
constant (7% in Round-00 to 3% in Round-05). The predic-
tions within 2 kcal/mol increased from 66% in Round-00 to
91% in Round-05, and those within 1 kcal/mol from 36%
in Round-00 to 61% in Round-05. The dramatic decrease
in underpredictions occurred in two steps, from Round-00
(27%) to Round-01 (17%) and from Round-02 (14%) to
Round-03 (8%). By Round-05, the fractions of large over-
and under-predictions were essentially the same.
The distribution of experimental activity values for the
future set of molecules changed relatively little over time,
perhaps as expected given that only roughly 10% of the
molecules were selected over the rounds of iterative refine-
ment. The minimum and maximum pIC
50
values were 4.3
and 10.1, respectively, throughout. The mean and standard
deviation began with
8.5 ±1.0
and ended with
8.4 ±1.0
.
However, for the training set, the distribution shifted. The
initial minimum and maximum pIC
50
training values were
4.3 and 9.5, respectively, shifting to 4.3 and 9.9 at the end
of refinement. The mean and standard deviation began with
7.6 ±1.3
and ended with
8.1 ±1.2
. The distributional shifts
in the training data during refinement reflected the success-
ful selection of numerous potent candidate molecules.
Idea generation
One difficulty in interpreting the results shown in the fore-
going is that the set of molecules from which we selected
molecules had been made and tested as part of an active
design process, where decisions on what to make next were
undertaken by experts based on their knowledge of prior
data as well as their expertise in the field. So, while the
active learning approach was able to efficiently select from
that set of molecules, it is not clear that such a path could be
followed in a situation where the future space of molecules
was open to determination.
Generative approaches for producing ideas for new com-
pounds that employ deep learning have gained some promi-
nence recently [9, 10]. We have taken a different approach,
instead using molecular similarity to identify possible bio-
isosteric core scaffold replacements, including their suit-
ability to display the require pendant functionality for good
activity. Figure10 illustrates how a combination of similar-
ity-based screening and combination with desirable pendant
groups can rapidly generate ideas. Our approach is similar
in spirit to work by Awale, Hert etal. [23], in which the
authors describe a 2D matched-pair approach to identifying
sensible candidate molecules based on an analysis of large
structure-activity databases.
Beginning with the original five-molecule multiple-
ligand alignment used to guide QuanSA model-building, the
pendant groups were removed to produce a core-scaffold
overlay, and the amide linking subfragment was extracted
(Fig.10A at right). The roughly 3,000,000 compound Enam-
ine Stock Screening collection was screened against the
multiple-ligand core using the amide fragment as a required
positional restraint to ensure that all hits returned would
have appropriate chemistry for linkage to the common war-
head. Two examples of high-scoring hits from the screen
are shown in Fig.10B (cyan carbons) in their optimal poses
relative to the screening target (green carbons).
For each returned pose of each nominal screening hit,
a geometric matching procedure was employed to identify
crossover points between the screening hit and each of the
full parent molecules from the original multiple-ligand
Table 2 Summary of the
distribution of predictions on
future molecules
Round n Train n Future % Pred w/in 1
kcal/mol
% Pred w/in 2
kcal/mol
% Underpre-
dictions
% Over-
predic-
tions
00 100 1009 36 66 27 7
01 120 989 50 80 17 3
02 140 969 53 83 14 3
03 160 949 55 89 8 3
04 180 929 55 88 9 3
05 200 909 61 91 6 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 12 of 16
alignment. Figure10C shows the process using the pose of
compound Mol-0013 as the crossover target. When a com-
patible set of distances and bond vectors existed, the origi-
nal substituents of the screening hit were replaced with the
substituents of the parent compound. Figure10C (bottom)
shows the two resulting merged molecules with novel struc-
tures. The arrows and corresponding thick lines show the
specific substituent movements that were made. The initial
crossover results in high local strain for the new bonds, and
the final ligand pose is relaxed using positionally-restrained
energy minimization.
Figure10D shows the relationship of the two resulting
generated candidate molecules. Each contains a large frac-
tion of the exact substructure (including chirality) of the
Fig. 10 Scheme for generating synthetic ideas using a combination of eSim screening and automatic addition of desirable pendant groups
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 Page 13 of 16 19
final FPX compound, with relatively minor variations in
the precise hydrophobic substituents at right. With a slight
generalization of the procedure to include additional sub-
stituent variations (e.g. p-fluoro-phenyl at both positions),
the exact structure of FPX would have been generated. The
data required to identify the five-molecule multiple-ligand
alignment was just the first 100 compounds from the full
structure-activity set. The computational procedure for iden-
tifying core-scaffold hits and producing merged candidate
molecules required less than an hour and no additional data.
The procedure just described is not intended to fully auto-
mate candidate compound generation. Rather, it is meant to
be a source of ideas that are easy to scan rapidly. Of course,
it is also possible to make use of the predictive QuanSA
models to identify candidates that are quantitatively pre-
dicted to have high potency or have high information value.
Conclusions
Overall, beginning with the earliest 100 picolinamide anti-
fungal project compounds, an active-learning approach effi-
ciently guided candidate selection to the desired end product
FPX after model refinement using just 100 synthetic ana-
logs. This project began with a relatively potent lead com-
pound in UK-2A, with design goals including a reduction
in molecular complexity that required replacement of the
central macrocyclic scaffold. This presents a challenge for
predictive modeling because the molecules to be designed
must deviate quite significantly from known chemical mat-
ter. Through the use of active learning, rapid introduction
of novel structural features was possible. The process was
guided by a well-defined notion of what makes a highly
informative molecule—one that exhibits high similarity to
a known active in their respective optimal predicted poses
but which is (possibly anomalously) predicted to have low
activity.
The practical significance of the restrospective analysis
presented here is in the breadth of applicability for scaf-
fold replacement and lead optimization more broadly. The
QuanSA method does not require a protein structure to make
accurate predictions that are physically explainable. While
it can make use of information from experimental determi-
nation of bound ligand structures, it can operate in a purely
ligand-based manner where the only available data are com-
pound structures and activities. Model building can proceed
from very limited project data, beginning with just dozens
of molecules, not the thousands required for so-called deep-
learning methods [9, 10].
Further, model-building is not terribly computationally
intensive. On modest workstation hardware, candidate mol-
ecules can be scored in seconds for “normal” small mol-
ecules. Small macrocycles such as those seen here required
tens of seconds per molecule, with a majority of the time
going to conformational search. For the work reported here,
the fully-automated procedure took approximately two days
on an 18-core workstation. This encompassed the entire pro-
cess beginning with 2D structures for all 1109 molecules,
through model-building, scoring/selection/refinement, and
the final pass from which FPX was chosen.
The lead optimization project that resulted in FPX
required the synthesis of many hundreds of analogs in
order to re-engineer the starting macrocyclic natural
product. We believe that effective use of active learning
and semi-automatic candidate generation can drastically
shorten the design path from initial lead compound to
final product. The central requirement for the computa-
tional methodology is that it is capable of extrapolating
from small quantities of structure-activity data. Mod-
eling approaches that move toward constructing causal
models for activity prediction have clear advantages over
approaches that ignore the physical underpinnings of how
ligands bind to and modulate the activity of biological
targets.
Methods anddata
Molecular data set
A total of 1109 compounds from a lead optimization pro-
ject formed the data set. Molecules were provided as 2D
SDF structures with associated activities and consecutive
compound IDs serving as relative synthesis dates for tem-
poral sorting. The project dataset contained pIC
50
activity
values and registration dates, beginning with UK-2A as
compound 1 and the resultant commercial product FPX as
compound 1109. The activity values were determined in
an invitro assay for the inhibition of fungal mitochondrial
electron transport. The first 100 molecules synthesized
were used as the initial training set for QuanSA, with the
remaining 1009 molecules used as future “synthesizable”
molecules.
Computational procedures
For all procedures, we employed version 5.1 of the Sur-
flex Platform (BioPharmics Division, Optibrium Limited,
Cambridge, CB25 9GL, UK). Additional details can be
found in the data archive associated with this paper (see
the Declarations section).
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 14 of 16
Ligand preparation
Standard procedures were used to protonate the molecules
as expected at physiological pH, generate 3D structures,
and perform conformational search, as follows:
Multiple ligand alignment
Using Surflex eSim for generating multiple ligand align-
ments [21], and specifically for the purpose of seeding ini-
tial QuanSA alignments was outlined earlier and has been
reported previously [11], with the specific procedure used
in this work being as follows:
QuanSA model induction, prediction, andrefinement
Previous QuanSA method papers are comprehensive, and
contain a detailed algorithmic description [11, 12]. Here,
standard procedures were used, as follows:
The model selection procedure select command gen-
erates a model quality score that combines the following:
(1) model parsimony (P), which is a quantitative measure
of the extent to which molecules with similar activity val-
ues have similar predicted poses; (2) Kendall’s Tau (T)
for the full re-fitting of training molecules into a derived
pocket-field; and (3) the mean unsigned error (E) of the
re-fit molecules.
Given N alternative models, each of
P1...N
,
T1...N
, and
E1...N
are transformed into corresponding probability
scores. This is done by fitting a normal distribution to each
of
P1...N
,
T1...N
, and
E1...N
which then allows calculation of
the cumulative distribution function
Φ
for each of P, T,
and E. So, raw values for the metric across the N alterna-
tive models are converted to probabilities reflecting their
likelihood of being non-random:
Pp
1...N
,
Tp
1...N
, and
Ep
1...N
.
The probability score for model i is simply the product:
(
P
p
i
)(T
p
i
)(E
p
i)
. The highest scoring of the alternative models
using the combined probabilistic score is selected.
New molecule scoring, selection, and model refinement
followed these general procedures:
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 Page 15 of 16 19
Computational procedures foridea generation
The pendant groups were trimmed from the 5-molecule
multiple ligand alignment described above, leaving only
the aligned central scaffolds. The aligned core scaffolds
were used as a multi-ligand target in a virtual screen of the
Enamine database. The resulting hits were processed using
a new procedure to automatically attach pendant groups
from the original full ligands of the multiple-ligand align-
ment, as follows:
Note that the resulting merged molecules can be
reviewed directly or can be subjected to conformational
search and screened using either pure ligand similar-
ity, QuanSA model scoring, or a combination of both
approaches.
Acknowledgements The authors thank Negar Garizi and Matt Segall
for supporting this work and for valuable scientific discussions.
Author contributions All authors participated in the research and in
the preparation of and final review of the manuscript.
Funding The authors have no outside funding to declare.
Data availability A freely downloadable data archive with additional
details is available at www. jainl ab. org/ downl oads. The archive con-
tains scripts to reproduce the protocols described here beginning from
2D input structures. Note that the compound structures in the archive
are limited to those depicted here and do not include the full set of
1109 ligands described, whose structures and activity values cannot
be disclosed.
Declarations
Conflict of interest The authors have no conflict of interest as defined
by Springer, or other interests that might be perceived to influence the
results and/or discussion reported in this paper.
Ethical approval and consent to participate Not applicable.
Consent for publication All authors have read and understood the pub-
lishing policy, and this manuscript is submitted in accordance with
this policy.
Open Access This article is licensed under a Creative Commons Attri-
bution 4.0 International License, which permits use, sharing, adapta-
tion, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons licence, and indicate if changes
were made. The images or other third party material in this article are
included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in
the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a
copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
1. Meyer KG, Bravo-Altamirano K, Herrick J, Loy BA, Yao C,
Nugent B, Buchan Z, Daeuble JF, Heemstra R, Jones DM, Wilmot
J, Lu Y, DeKorver K, DeLorbe J, Rigoli J (2021) Discovery of
florylpicoxamid, a mimic of a macrocyclic natural product. Bioorg
Med Chem 50(116):455
2. Labute P (2010) LowModeMD: implicit low-mode velocity filter-
ing applied to conformational search of macrocycles and protein
loops. J Chem Info Model 50(5):792–800
3. Chen IJ, Foloppe N (2013) Tackling the conformational sampling
of larger flexible compounds and macrocycles in pharmacology
and drug discovery. Bioorg Med Chem 21(24):7898–7920
4. Watts KS, Dalal P, Tebben AJ, Cheney DL, Shelley JC (2014)
Macrocycle conformational sampling with MacroModel. J Chem
Inf Model 54(10):2680–2696
5. Sindhikara D, Spronk SA, Day T, Borrelli K, Cheney DL, Posy
SL (2017) Improving accuracy, diversity, and speed with prime
macrocycle conformational sampling. J Chem Info Model
57(8):1881–1894
6. Cleves AE, Jain AN (2017) ForceGen 3D structure and conformer
generation: From small lead-like molecules to macrocyclic drugs.
J Comput Aided Mol Des 31(5):419–439
7. Jain AN, Cleves AE, Gao Q, Wang X, Liu Y, Sherer EC, Reibarkh
MY (2019) Complex macrocycle exploration: parallel, heuristic,
and constraint-based conformer generation using ForceGen. J
Comput Aided Mol Des 33(6):531–558
8. Jain AN, Brueckner AC, Jorge C, Cleves AE, Khandelwal P, Cor-
tes JC, Mueller L (2023) Complex peptide macrocycle optimiza-
tion: combining NMR restraints with conformational analysis to
guide structure-based and ligand-based design. J Comput Aided
Mol Des 37:519–535
9. Walters WP, Barzilay R (2020) Applications of deep learning
in molecule generation and molecular property prediction. Acc
Chem Res 54(2):263–270
10. Ramsundar B, Eastman P, Walters P, Pande V (2019) Deep
learning for the life sciences: applying deep learning to genom-
ics, microscopy, drug discovery, and more. O’Reilly Media Inc,
Newton
11. Cleves AE, Jain AN (2018) Quantitative surface field analysis:
learning causal models to predict ligand binding affinity and pose.
J Comput Aided Mol Des 32(7):731–757
12. Cleves AE, Johnson SR, Jain AN (2021) Synergy and complemen-
tarity between focused machine learning and physics-based simu-
lation in affinity prediction. J Chem Inf Model 61(12):5948–5966
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 16 of 16
13. Jain AN (1996) Scoring noncovalent protein-ligand interactions:
a continuous differentiable function tuned to compute binding
affinities. J Comput Aided Mol Des 10(5):427–440
14. Pham T, Jain AN (2006) Parameter estimation for scoring pro-
tein-ligand interactions using negative training data. J Med Chem
49(20):5856–5868
15. Jain AN, Cleves AE (2012) Does your model weigh the same as
a Duck? J Comput Aided Mol Des 26:57–67
16. Cleves AE, Jain AN (2016) Extrapolative prediction using phys-
ically-based QSAR. J Comput Aided Mol Des 30(2):127–152
17. Jain AN, Dietterich TG, Lathrop RH, Chapman D, Critchlow REJ,
Bauer BE, Webster TA, Lozano-Perez T (1994) A shape-based
machine learning tool for drug design. J Comput Aided Mol Des
8(6):635–52
18. Jain AN, Koile K, Chapman D (1994) Compass: predicting
biological activities from molecular surface properties. Per-
formance comparisons on a steroid benchmark. J Med Chem
37(15):2315–27
19. Jain AN, Harris N, Park J (1995) Quantitative binding site model
generation: compass applied to multiple chemotypes targeting the
5-HT1a receptor. J Med Chem 38(8):1295–1308
20. Dietterich TG, Lathrop RH, Lozano-Pérez T (1997) Solving the
multiple instance problem with axis-parallel rectangles. Artif
Intell 89(1–2):31–71
21. Cleves AE, Jain AN (2020) Structure-and ligand-based virtual
screening on DUD-E
+
: performance dependence on approxima-
tions to the binding pocket. J Chem Inf Model 60(9):4296–4310
22. Varela R, Walters W, Goldman B, Jain AN (2012) Iterative refine-
ment of a binding pocket model: active computational steering of
lead optimization. J Med Chem 55(20):8926–8942
23. Awale M, Hert J, Guasch L, Riniker S, Kramer C (2021) The play-
books of medicinal chemistry design moves. J Chem Inf Model
61(2):729–742
Publisher's Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Systematic optimization of large macrocyclic peptide ligands is a serious challenge. Here, we describe an approach for lead-optimization using the PD-1/PD-L1 system as a retrospective example of moving from initial lead compound to clinical candidate. We show how conformational restraints can be derived by exploiting NMR data to identify low-energy solution ensembles of a lead compound. Such restraints can be used to focus conformational search for analogs in order to accurately predict bound ligand poses through molecular docking and thereby estimate ligand strain and protein-ligand intermolecular binding energy. We also describe an analogous ligand-based approach that employs molecular similarity optimization to predict bound poses. Both approaches are shown to be effective for prioritizing lead-compound analogs. Surprisingly, relatively small ligand modifications, which may have minimal effects on predicted bound pose or intermolecular interactions, often lead to large changes in estimated strain that have dominating effects on overall binding energy estimates. Effective macrocyclic conformational search is crucial, whether in the context of NMR-based restraints, X-ray ligand refinement, partial torsional restraint for docking/ligand-similarity calculations or agnostic search for nominal global minima. Lead optimization for peptidic macrocycles can be made more productive using a multi-disciplinary approach that combines biophysical data with practical and efficient computational methods.
Article
Full-text available
We present results on the extent to which physics-based simulation (exemplified by FEP⁺) and focused machine learning (exemplified by QuanSA) are complementary for ligand affinity prediction. For both methods, predictions of activity for LFA-1 inhibitors from a medicinal chemistry lead optimization project were accurate within the applicable domain of each approach. A hybrid model that combined predictions by both approaches by simple averaging performed better than either method, with respect to both ranking and absolute pKi values. Two publicly available FEP⁺ benchmarks, covering 16 diverse biological targets, were used to test the generality of the synergy. By identifying training data specifically focused on relevant ligands, accurate QuanSA models were derived using ligand activity data known at the time of the original series publications. Results across the 16 benchmark targets demonstrated significant improvements both for ranking and for absolute pKi values using hybrid predictions that combined the FEP⁺ and QuanSA predicted affinity values. The results argue for a combined approach for affinity prediction that makes use of physics-driven methods as well as those driven by machine learning, each applied carefully on appropriate compounds, with hybrid prediction strategies being employed where possible.
Article
Full-text available
ForceGen is a template-free, non-stochastic approach for 2D to 3D structure generation and conformational elaboration for small molecules, including both non-macrocycles and macrocycles. For conformational search of non-macrocycles, ForceGen is both faster and more accurate than the best of all tested methods on a very large, independently curated benchmark of 2859 PDB ligands. In this study, the primary results are on macrocycles, including results for 431 unique examples from four separate benchmarks. These include complex peptide and peptide-like cases that can form networks of internal hydrogen bonds. By making use of new physical movements (“flips” of near-linear sub-cycles and explicit formation of hydrogen bonds), ForceGen exhibited statistically significantly better performance for overall RMS deviation from experimental coordinates than all other approaches. The algorithmic approach offers natural parallelization across multiple computing-cores. On a modest multi-core workstation, for all but the most complex macrocycles, median wall-clock times were generally under a minute in fast search mode and under 2 min using thorough search. On the most complex cases (roughly cyclic decapeptides and larger) explicit exploration of likely hydrogen bonding networks yielded marked improvements, but with calculation times increasing to several minutes and in some cases to roughly an hour for fast search. In complex cases, utilization of NMR data to constrain conformational search produces accurate conformational ensembles representative of solution state macrocycle behavior. On macrocycles of typical complexity (up to 21 rotatable macrocyclic and exocyclic bonds), design-focused macrocycle optimization can be practically supported by computational chemistry at interactive time-scales, with conformational ensemble accuracy equaling what is seen with non-macrocyclic ligands. For more complex macrocycles, inclusion of sparse biophysical data is a helpful adjunct to computation.
Article
Full-text available
We introduce the QuanSA method for inducing physically meaningful field-based models of ligand binding pockets based on structure-activity data alone. The method is closely related to the QMOD approach, substituting a learned scoring field for a pocket constructed of molecular fragments. The problem of mutual ligand alignment is addressed in a general way, and optimal model parameters and ligand poses are identified through multiple-instance machine learning. We provide algorithmic details along with performance results on sixteen structure-activity data sets covering many pharmaceutically relevant targets. In particular, we show how models initially induced from small data sets can extrapolatively identify potent new ligands with novel underlying scaffolds with very high specificity. Further, we show that combining predictions from QuanSA models with those from physics-based simulation approaches is synergistic. QuanSA predictions yield binding affinities, explicit estimates of ligand strain, associated ligand pose families, and estimates of structural novelty and confidence. The method is applicable for fine-grained lead optimization as well as potent new lead identification.
Article
Full-text available
We introduce the ForceGen method for 3D structure generation and conformer elaboration of drug-like small molecules. ForceGen is novel, avoiding use of distance geometry, molecular templates, or simulation-oriented stochastic sampling. The method is primarily driven by the molecular force field, implemented using an extension of MMFF94s and a partial charge estimator based on electronegativity-equalization. The force field is coupled to algorithms for direct sampling of realistic physical movements made by small molecules. Results are presented on a standard benchmark from the Cambridge Crystallographic Database of 480 drug-like small molecules, including full structure generation from SMILES strings. Reproduction of protein-bound crystallographic ligand poses is demonstrated on four carefully curated data sets: the ConfGen Set (667 ligands), the PINC cross-docking benchmark (1062 ligands), a large set of macrocyclic ligands (182 total with typical ring sizes of 12–23 atoms), and a commonly used benchmark for evaluating macrocycle conformer generation (30 ligands total). Results compare favorably to alternative methods, and performance on macrocyclic compounds approaches that observed on non-macrocycles while yielding a roughly 100-fold speed improvement over alternative MD-based methods with comparable performance.
Article
Natural products have routinely been used both as sources of and inspiration for new crop protection active ingredients. The natural product UK-2A has potent anti-fungal activity but lacks key attributes for field translation. Post-fermentation conversion of UK-2A to fenpicoxamid resulted in an active ingredient with a new target site of action for cereal and banana pathogens. Here we demonstrate the creation of a synthetic variant of fenpicoxamid via identification of the structural elements of UK-2A that are needed for anti-fungal activity. Florylpicoxamid is a non-macrocyclic active ingredient bearing two fewer stereocenters than fenpicoxamid, controls a broad spectrum of fungal diseases at low use rates and has a concise, scalable route which is aligned with green chemistry principles. The development of florylpicoxamid represents the first example of using a stepwise deconstruction of a macrocyclic natural product to design a fully synthetic crop protection active ingredient.
Article
Large databases of biologically relevant molecules, such as ChEMBL, SureChEMBL, or compound collections of pharmaceutical or agrochemical companies, are invaluable sources of medicinal chemistry information, albeit implicit. We developed a modified matched molecular pair approach to systematically and exhaustively extract the transformations in these databases and distill them into snippets of explicit design knowledge that are easily interpretable and directly applicable. The resulting "playbooks of medicinal chemistry design moves" capture the collective pharmaceutical and agrochemical research expertise across multiple chemists, companies, targets, and projects. They can be queried in an automated fashion for systematic prospective design and compound generation. The ChEMBL playbook and an application to exploit it are available at https://github.com/mahendra-awale/medchem_moves.
Article
ConspectusRecent advances in computer hardware and software have led to a revolution in deep neural networks that has impacted fields ranging from language translation to computer vision. Deep learning has also impacted a number of areas in drug discovery, including the analysis of cellular images and the design of novel routes for the synthesis of organic molecules. While work in these areas has been impactful, a complete review of the applications of deep learning in drug discovery would be beyond the scope of a single Account. In this Account, we will focus on two key areas where deep learning has impacted molecular design: the prediction of molecular properties and the de novo generation of suggestions for new molecules.One of the most significant advances in the development of quantitative structure-activity relationships (QSARs) has come from the application of deep learning methods to the prediction of the biological activity and physical properties of molecules in drug discovery programs. Rather than employing the expert-derived chemical features typically used to build predictive models, researchers are now using deep learning to develop novel molecular representations. These representations, coupled with the ability of deep neural networks to uncover complex, nonlinear relationships, have led to state-of-the-art performance. While deep learning has changed the way that many researchers approach QSARs, it is not a panacea. As with any other machine learning task, the design of predictive models is dependent on the quality, quantity, and relevance of available data. Seemingly fundamental issues, such as optimal methods for creating a training set, are still open questions for the field. Another critical area that is still the subject of multiple research efforts is the development of methods for assessing the confidence in a model.Deep learning has also contributed to a renaissance in the application of de novo molecule generation. Rather than relying on manually defined heuristics, deep learning methods learn to generate new molecules based on sets of existing molecules. Techniques that were originally developed for areas such as image generation and language translation have been adapted to the generation of molecules. These deep learning methods have been coupled with the predictive models described above and are being used to generate new molecules with specific predicted biological activity profiles. While these generative algorithms appear promising, there have been only a few reports on the synthesis and testing of molecules based on designs proposed by generative models. The evaluation of the diversity, quality, and ultimate value of molecules produced by generative models is still an open question. While the field has produced a number of benchmarks, it has yet to agree on how one should ultimately assess molecules "invented" by an algorithm.
Article
Using the DUD-E+ benchmark, we explore the impact of using a single protein pocket or ligand for virtual screening compared with using ensembles of alternative pockets, ligands, and sets thereof. For both structure-based and ligand-based approaches, the precise characterization of the binding site in question had a significant impact on screening performance. Using the single original DUD-E protein, Surflex-Dock yielded mean ROC area of 0.81±0.11. Using the cognate ligand instead, with the eSim method for screening, yielded 0.77±0.14. Moving to ensembles of five protein pocket variants increased docking performance to 0.84±0.09. Results for the analogous ligand-based approach (using the five crystallographically aligned cognate ligands) was 0.83±0.11. Using the same ligands, but making use of an automatically generated mutual alignment, yielded mean AUC nearly as good as from single-structure docking: 0.80±0.12. Detailed results and statistical analyses show that structure-based and ligand-based methods are complementary and can be fruitfully combined to enhance screening efficiency. A hybrid approach combining ensemble docking with eSim-based screening produced the best and most consistent performance (mean ROC area of 0.89±0.08 and 1% early enrichment of 46-fold). Based on results from both the docking and ligand-similarity approaches, it is clearly unwise to make use of a single arbitrarily chosen protein structure for docking or single ligand query for similarity-based screening.
Article
A novel method for exploring macrocycle conformational space, Prime Macrocycle Conformational Sampling (Prime-MCS), is introduced and evaluated in the context of other available algorithms (Molecular Dynamics, LowModeMD in MOE, and MacroModel Baseline Search). The algorithms were benchmarked on a dataset of 208 macrocycles which was curated for diversity from the Cambridge Structural Database, the Protein Databank, and the Biologically Interesting Molecule Reference Dictionary. The algorithms were evaluated in terms of accuracy (ability to reproduce the crystal structure), diversity (coverage of conformational space), and computational speed. Prime-MCS most reliably reproduced crystallographic structures for RMSD thresholds >1.0 Å, most often produced the most diverse conformational ensemble, and was most often the fastest algorithm. Detailed analysis and examination of both typical and outlier cases were performed to reveal characteristics, shortcomings, expected performance, and complementarity of the methods.