ArticlePDF Available

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

April 2024
Journal of Computer-Aided Molecular Design 38(1)

April 2024
38(1)

DOI:10.1007/s10822-024-00555-3

License
CC BY 4.0

Authors:

Ann Cleves

BioPharmics Division, Optibrium Ltd.

Ajay N Jain

BioPharmics LLC

Show all 6 authorsHide

Scaffold replacement as part of an optimization process that requires maintenance of potency, desirable biodistribution, metabolic stability, and considerations of synthesis at very large scale is a complex challenge. Here, we consider a set of over 1000 time-stamped compounds, beginning with a macrocyclic natural-product lead and ending with a broad-spectrum crop anti-fungal. We demonstrate the application of the QuanSA 3D-QSAR method employing an active learning procedure that combines two types of molecular selection. The first identifies compounds predicted to be most active of those most likely to be well-covered by the model. The second identifies compounds predicted to be most informative based on exhibiting low predicted activity but showing high 3D similarity to a highly active nearest-neighbor training molecule. Beginning with just 100 compounds, using a deterministic and automatic procedure, five rounds of 20-compound selection and model refinement identifies the binding metabolic form of florylpicoxamid. We show how iterative refinement broadens the domain of applicability of the successive models while also enhancing predictive accuracy. We also demonstrate how a simple method requiring very sparse data can be used to generate relevant ideas for synthetic candidates.

The starting natural product UK-2A is shown (left) along with the binding metabolic form of florylpicoxamid, the final crop protection fungicide

…

Scheme for iterative model refinement using temporally sorted structure-activity data from lead optimization

…

Procedure for identifying a high-quality ligand-based binding site hypothesis

…

Example molecules chosen for model refinement in successive rounds of QuanSA testing and refinement

…

Selections of active (left, predicted poses in salmon carbons) and informative molecules (right, yellow carbons) for Round-00 shown against the predicted pose of UK-2A

…

Figures - available from: Journal of Computer-Aided Molecular Design

This content is subject to copyright. Terms and conditions apply.

Access to this full-text is provided by Springer Nature.

Learn more

Content available from Journal of Computer-Aided Molecular Design

This content is subject to copyright. Terms and conditions apply.

Vol.:(0123456789)

Journal of Computer-Aided Molecular Design (2024) 38:19

https://doi.org/10.1007/s10822-024-00555-3

From UK‑2A toﬂorylpicoxamid: Active learning toidentify amimic

ofamacrocyclic natural product

AnnE.Cleves1 · AjayN.Jain1 · DavidA.Demeter2· ZacharyA.Buchan2 · JeremyWilmot2· ErinN.Hancock2

Received: 2 January 2024 / Accepted: 26 February 2024

Abstract

Scaﬀold replacement as part of an optimization process that requires maintenance of potency, desirable biodistribution,

metabolic stability, and considerations of synthesis at very large scale is a complex challenge. Here, we consider a set of over

1000 time-stamped compounds, beginning with a macrocyclic natural-product lead and ending with a broad-spectrum crop

anti-fungal. We demonstrate the application of the QuanSA 3D-QSAR method employing an active learning procedure that

combines two types of molecular selection. The ﬁrst identiﬁes compounds predicted to be most active of those most likely

to be well-covered by the model. The second identiﬁes compounds predicted to be most informative based on exhibiting low

predicted activity but showing high 3D similarity to a highly active nearest-neighbor training molecule. Beginning with just

100 compounds, using a deterministic and automatic procedure, ﬁve rounds of 20-compound selection and model reﬁne-

ment identiﬁes the binding metabolic form of ﬂorylpicoxamid. We show how iterative reﬁnement broadens the domain of

applicability of the successive models while also enhancing predictive accuracy. We also demonstrate how a simple method

requiring very sparse data can be used to generate relevant ideas for synthetic candidates.

Keywords Active-learning· QuanSA· Aﬃnity prediction· Macrocycles

Introduction

Natural products (NPs) have been used as inspiration for

crop protection active ingredients. However, it is often the

case that structural features of NPs, such as macrocycles and

multiple chiral centers, limit their use due to the expense of

industrial-scale synthesis. Figure1 shows the structure of

UK-2A (left side), a natural product with excellent invitro

inhibition of mitochondrial electron transport (MET) com-

plex III via binding to the Q

site of cytochrome b [1]. Activ-

ity values were determined by an invitro MET binding assay

and expressed here as pIC

. Protection of the 3-pyridinol

with an isobuytryloxymethyl group improved in planta anti-

fungal performance, with the unprotected binding metabolite

being readily produced. Figure1 shows the unprotected form

of ﬂorylpicoxamid (right side, “FPX”), whose 3-pyridinol

protected precursor has been shown to be a highly eﬀec-

tive crop protection fungicide [1]. FPX has two fewer chiral

centers, no macrocycle, and is fully synthetic, not requir-

ing starting materials from fermentation processes. The

development of FPX followed a design strategy of stepwise

deconstruction of a macrocyclic natural product, requiring

many hundreds of synthetic analogs along with invitro and

in planta assays.

Here, we investigate the degree to which an active-learn-

ing approach for activity prediction could be used to vastly

reduce the number of synthetic analogs required in such an

eﬀort. Ligand activity prediction continues to be a challenge

for computer-aided drug design, especially in the case where

there is no suitable high-resolution experimental structure of

the target of interest, as is the case here. An additional chal-

lenge here is the presence of ﬂexible macrocyclic ligands.

Over the past several years, methods for computational

modeling of macrocyclic ligands have made signiﬁcant pro-

gress [2–7]. In particular, natural-product based and semi-

synthetic macrocycles of up to roughly 21–23 total rotat-

able bonds (including both macrocyclic bonds and exocyclic

* Ann E. Cleves

ann@optibrium.com

* Erin N. Hancock

erin.hancock@corteva.com

1 BioPharmics Division, Optibrium Limited,

CambridgeCB259GL, UK

2 Corteva Agriscience, Indianapolis, IN46268, USA

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 2 of 16

bonds) have been shown to be tractable, in terms of accuracy

and speed of conformational search when utilizing multiple

computing-cores [7]. However, larger peptidic macrocycles

remain challenging, often requiring biophysical data (e.g.

from NMR) to help restrain the conformational space to be

explored [8]. Generally, the macrocycles studied here fell

well within the tractable range of the ForceGen methodol-

ogy [7].

Machine learning approaches have seen a recent resur-

gence in their applications within the CADD ﬁeld, in part

driven by advances in deep-learning methodologies. A

recent review highlights a number of successful applications

as well as limitations [9], with further context provided by

a full book treatment [10]. With respect to binding aﬃnity

prediction in the context of lead optimization, a critical fac-

tor is that such methods typically require thousands of data

points in order to learn eﬀectively, because of the need to

develop encoded internal representations that meaningfully

capture the important aspects required for prediction. Early-

stage lead optimization may involve just dozens of assayed

molecules within a newly discovered chemical series, and

even mid- to late-stage projects may be limited to hundreds

or up to a few thousand data points. The recently introduced

QuanSA machine-learning method (Quantitative Surface-

ﬁeld Analysis) diﬀers from the deep-learning paradigm and

from historically widely used methods [11, 12] in ways that

make it applicable even in early-stage lead optimization.

The central difference is that rather than applying a

generic machine-learning approach to an input molecular

representation divorced from a binding event, QuanSA

builds a physically interpretable model that is analogous to

a protein binding site. By doing so, it addresses the problem

of ligand conformation and alignment fully automatically,

and it moves in the direction of causal modeling, where the

requirement for training data can be reduced. The method

constructs a non-linear “pocket-ﬁeld” that is still physical in

nature, and which is directly related to the functional form of

scoring functions for docking [13, 14]. QuanSA pocket-ﬁeld

models mirror key physical phenomena that are observed in

protein-ligand interactions [15]: (1) choice of ligand poses

is deﬁned by the model; (2) non-additive (or even anti-

additive) eﬀects of substituent changes on a central scaﬀold

can be modeled eﬀectively; (3) changes in ligand structures

induce changes in predicted ligand poses; and (4) the model

of molecular activity is dependent on the detailed shape of

ligands. Nearly all QSAR and deep-learning methods ignore

some or all of these aspects of protein-ligand interactions.

Additional discussion of the theoretical contrasts between

the QuanSA multiple-instance learning approach and other

QSAR (3D and 2D) approaches can be found in the papers

introducing the method [11, 12] along with the antecedent

QMOD [16] and Compass [17–19] approaches, the latter

of which introduced the multiple-instance machine-learning

paradigm [20].

Figure2 depicts the overall scheme of the study. Begin-

ning with the earliest 100 molecules and activity data

(MET pIC

), a QuanSA model was induced, guided by a

hypothesis of how a small set of diverse active ligands were

mutually aligned. A set of “future” molecules that had been

made on the way to (and including) Mol-1109 (the bind-

ing metabolite of ﬂorylpicoxamid: FPX) were then scored

using the model. The scoring procedure predicts activity

and bound ligand pose along with estimates of the degree to

which each molecule is well-covered by the model. The top

10 molecules with highest predicted activity among those

well-covered were selected for “synthesis.” In addition, the

top 10 molecules expected to be most informative were also

selected. Those 20 molecules were then used, along with

their experimental activity values, to reﬁne the model, mov-

ing those 20 from the test set to the train set, and this process

was repeated (see the blue arrows in Fig.2 for the reﬁnement

loop). The choice of informative molecules combines two

criteria for a given molecule: (1) it must have a highly active

training molecule as its nearest-neighbor in its QuanSA-pre-

dicted pose; and (2) it must be predicted to have relatively

low activity. Simply put, the informative molecules are sur-

prising: they look a lot like highly active molecules but are

predicted to have poor activity.

In what follows, we show that the process of iterative

model reﬁnement drastically reduces the number of analogs

required compared with what happened during the actual

project. Successive models became progressively broader

in terms of structural coverage and more accurate in their

predictions. Separate from the activity prediction problem

Fig. 1 The starting natural prod-

uct UK-2A is shown (left) along

with the binding metabolic form

of ﬂorylpicoxamid, the ﬁnal

crop protection fungicide

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 Page 3 of 16 19

is the question of how one can generate synthetic candidate

ideas that lead in a desired direction. We show how highly

relevant analog ideas can be automatically generated using

only a small number of compounds and potential pendant

groups. The computational strategy presented here should

have broad applicability in the common case where scaf-

fold replacement is required and structure-activity data are

limited and expensive to augment.

Software, computational protocols, and a subset of struc-

ture-activity data discussed in this paper are available to

other researchers (see Declarations section).

Results anddiscussion

We report results for iterative model reﬁnement leading from

the natural product antifungal UK-2A to FPX, beginning

with a systematic procedure for identifying an informative

multiple-ligand alignment and then proceeding through mul-

tiple rounds of QuanSA model reﬁnement using an active

learning strategy. We also detail a method to generate non-

macrocyclic candidate compounds using very sparse data by

combining virtual-screening-based central scaﬀold replace-

ment with a simple method to “staple” appropriate substitu-

ents onto the replacement scaﬀolds.

Initial multiple‑ligand alignment

The QuanSA methodology derives a pocket-ﬁeld beginning

from an initial mutual alignment of a set of training ligands

[11, 12], where each ligand has multiple possible initial

poses. When protein structure information is available, it is

possible to make use of the experimentally determined rela-

tive poses of prior known bound ligands in order to guide

the construction of the initial set of training poses. Here,

no such suitable protein co-crystal structure existed. Rather

than using crystallographic data, it is also possible to make

use of a carefully constructed multiple ligand alignment

to guide model-building. In cases where scaﬀold diversity

exists among highly active molecules, such alignments can

provide signiﬁcant constraints on the overall ligand align-

ment problem.

Here, the initial set of active project compounds con-

tained signiﬁcant diversity, both within the central macro-

cycle as well as in the pendant functionality. Figure3 shows

the procedure used to identify a high-quality ligand-based

binding site hypothesis using only the data from the earliest

set of synthesized molecules. There are two key ideas: (1)

to identify structurally diverse active ligands from which

to produce multiple ligands alignments; and (2) to select

which of the alternative hypotheses of relative bound poses

is quantitatively the best. The 30 molecules from within the

top 1.0 log unit of experimentally determined activity among

the training molecules were used as input to identify the four

Fig. 2 Scheme for iterative model reﬁnement using temporally sorted structure-activity data from lead optimization

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 4 of 16

most 2D structurally diverse compounds (molecules 13, 89,

2, and 64 in Fig.3). They were selected automatically based

on 2D dissimilarity (see the “Methods and data” section for

details).

These molecules (to the right of UK-2A in Fig.3) diﬀered

in terms of size and ﬂexibility within the central macrocycle

as well as the composition of the right-hand substituents.

They were used, with the addition of UK-2A, as input to

the the multiple-ligand alignment functionality of the eSim

method [21], which resulted in several alternative mutual

superimpositions. In order to assess which mutual align-

ment was most likely to reﬂect the true relative poses of the

molecules, the alternative alignments were ranked based on

their ability to separate highly active molecules from rela-

tively inactive ones within the initial 100-molecule train-

ing set. The chosen hypothesis shown in Fig.3 was able

to distinguish highly active (pIC

50 ≥8.5

) from less active

(pIC

≤

7.5

) compounds with an ROC Area of 0.92. The

3D joint superimposition shows the tight alignment of the

common left-hand moiety (the “warhead”) with the variation

in the macrocycle and right-hand elements of the molecules.

Iterative model reﬁnement

The chosen multiple-ligand alignment from Fig.3 was

used to guide construction of the initial QuanSA model

pocket-ﬁeld. The method allows for incremental iterative

reﬁnement based on the availability of new structure-activity

data. Figure4 shows examples of molecules automatically

selected by QuanSA for model reﬁnement based on expec-

tations of high activity (left side) or based on expectations

of being informative (right side) through multiple rounds

of compound selection and model reﬁnement. Intuitively,

selection of candidate molecules based on predictions of

high activity is an obvious strategy. In an active-learning

paradigm, one also seeks to identify maximally informative

molecules [22]. One representative example of each type of

selection is shown for each of the ﬁrst four rounds.

The process of scoring candidate molecules in a QuanSA

pocket-ﬁeld results in a prediction of activity and bound

pose, along with a number of prediction quality metrics.

The novelty metric characterizes the degree to which a can-

didate molecule is well-covered by the current set of train-

ing molecules. Candidate molecule predictions also indicate

which training molecule was the nearest-neighbor (NN) in a

3D molecular similarity calculation based on the predicted

bound pose.

Here, in each round, the 200 least novel (i.e. best cov-

ered) predicted candidate molecules were identiﬁed, and,

of this subset, the top ten with highest predicted activity

were selected for model reﬁnement (see left-hand exam-

ples from Fig.4). The maximally informative set of ten for

Fig. 3 Procedure for identifying a high-quality ligand-based binding site hypothesis

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 Page 5 of 16 19

each round captured a group of molecules that could be

thought of as having unexpectedly low activity. Informa-

tive molecules were identiﬁed from the subset whose NN

training molecule similarity was high (top 100 highest

NN similarity or NN similarity

≥

0.85) and where the NN

training molecule’s activity was also high (pIC

50 ≥8.5

From that subset, the ten molecules with the lowest pre-

dicted activity were selected (see right-hand examples

from Fig.4).

In the early rounds, the compounds predicted to be highly

active all had a central macrocyclic scaﬀold that was found

among the most highly active training compounds, as would

be expected given the starting point of lead optimization.

However, after three rounds of model reﬁnement (a cumula-

tive addition of 60 molecules to the original model), a non-

macrocycle was correctly identiﬁed and chosen as a highly

active molecule (Fig.4, lower left).

In contrast, the compounds predicted to be maxi-

mally informative included non-macrocycles even in the

initial round of candidate selection. These compounds were

deemed to be information rich: the predicted activities were

low, yet these candidate molecules had very high 3D simi-

larity to highly active train compounds. Model evolution

through inclusion of these informative compounds broad-

ened structural coverage suﬃciently that a non-macrocycle

was predicted to be highly active by Round-03 (bottom of

Fig.4).

Round‑00: Initial model building andselection

QuanSA model building begins with an initialization step

that produces training molecule alignments. Here, guided

by the multiple-ligand alignment shown in Fig.3, five

alternative initial alignments were produced. Having been

driven by the same mutual alignment hypothesis, these ini-

tial training molecule alignments diﬀered only slightly, but

each was used to build a separate QuanSA model. Selec-

tion from among alternative models can be done based on

Fig. 4 Example molecules

chosen for model reﬁnement in

successive rounds of QuanSA

testing and reﬁnement

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 6 of 16

statistics derived from the alternative models. These include:

(1) model parsimony, which is a quantitative measure of the

extent to which molecules with similar activity values have

similar predicted poses; (2) Kendall’s Tau for the full re-

ﬁtting of training molecules into a derived pocket-ﬁeld; and

(3) the mean unsigned error (MUE) of the re-ﬁt molecules.

The alternative quality values are transformed into proba-

bilistic values, and their product reﬂects the combination of

the diﬀerent metrics. Here, the selected model exhibited a

parsimony of 0.63, Kendall’s Tau of 0.87 (CI 0.82–0.91; p

<10−4

) and MUE of 0.30 (CI 0.25–0.35).

Figure5 shows two representative examples from Round-

00 for each selection type of candidate molecule. At left

(salmon) are the predicted poses for two molecules among

the ten predicted most active. As might have been expected,

these test molecules have a macrocyclic scaﬀold in com-

mon with the most active training ligand. Also, the right-

hand substituents largely occupy the same space as those of

UK-2A. Although the activity predictions for compounds

Mol-0273 and Mol-0496 were high, these molecules fell

within the top 13% and 3%, respectively, of experimental

activity within the full future set of 1009.

At right (yellow) are the predicted poses of two molecules

predicted to be among the ten most informative candidates.

The poses of the test molecules are shown relative to the

pose of training molecule UK-2A (green). These four exam-

ples are among the twenty molecules selected to reﬁne the

current training model. In contrast to the molecules chosen

based on high predicted activity, the molecules chosen to be

most informative in Round-00 included four non-macrocy-

cles out of the ten chosen (two examples are shown in Fig.5

at right). Importantly, the predicted 3D alignments compared

with that of UK-2A (green) show the new scaﬀolds in tight

congruence to the lower half of the UK-2A macrocycle.

Also, the right-hand moieties of the informative molecules

had signiﬁcant surface overlap with those of UK-2A.

Overall, for the 10 predicted to be most active in Round-

00, the MUE was quite high (1.7 pIC

units), but, interest-

ingly, these were all overpredictions. The predicted activity

values exceeded even the maximal experimental activity of

the most potent training molecule. This characteristic is not

typically seen with traditional machine-learning approaches.

With most statistical machine-learning methods and deep-

learning methods, implicit or explicit modeling of the prior

probability of observing a particular prediction value makes

out-of-range predictions rare. This is a strength of moving

toward a more causal type of predictive model where, for

example, the combination of diﬀerent aspects of multiple

active molecules into a new candidate might lead to an out-

of-range prediction. Particularly early-on in lead optimiza-

tion, synthesis of candidate molecules that push the potency

envelope is desirable.

Rounds 01‑04: Reﬁnement withactive learning

Figure6 shows examples of selected molecules for Round-

01 and Round-02. Those compounds predicted to be most

active retained macrocyclic scaﬀolds in both rounds, but

they showed show some additional diversity in the right-

hand hydrophobic groups, with alkyl chains aligning to the

benzene moiety of UK-2A (see molecules Mol-0761 and

Mol-0415). Also, the the nominal actives were more accu-

rately predicted than for Round-00. For Round-01, the MUE

was 1.4 pIC

units. For Round-02, it was 0.9 pIC

units,

nearly 50% lower than for Round-00, indicating signiﬁcant

reﬁnement in the detailed modeling of a subset of highly

active ligands.

Those candidates predicted to be most informative

for rounds 01 and 02 contained a higher proportion of

non-macrocycles that before. Round-01 had 7/10 non-

macrocyclic candidate scaﬀolds, and Round-02 had 9/10

Fig. 5 Selections of active (left, predicted poses in salmon carbons) and informative molecules (right, yellow carbons) for Round-00 shown

against the predicted pose of UK-2A

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 Page 7 of 16 19

non-macrocyclic scaﬀolds (see examples Fig.6, right, yel-

low). Alternative branching topologies were seen among

the informative candidates as well as novel pendant groups.

The ﬂexible thio-ether linkages in compounds Mol-0174 and

Mol-0141 still allowed the terminal aromatic rings to overlay

well with the corresponding functionality of UK-2A.

One of the challenges in providing computational guid-

ance for synthetic candidate prioritization is having a mean-

ingful explanatory basis for predictions. As shown in Fig.6,

the optimal poses that come out of the ﬁtting process into the

quantitative pocket-ﬁelds oﬀer convincing correspondence

between predictions and known SAR. This is preferable to

black-box predictions or those that may yield some explana-

tory information but do not provide a physically meaningful

interpretation.

Figure7 shows examples of the selected molecules

for rounds 03 and 04. By Round-03, when only 60 previ-

ous future molecules had been used to reﬁne the original

100-compound model, three non-macrocyclic scaﬀolds were

among the ten predicted to be most active (two examples are

shown: molecules Mol-0874 and Mol-0854). A non-macro-

cycle was also chosen in Round-04 among those predicted

most active (Mol-1098, bottom left of Fig.7). The trend of

improvement in accuracy for the predicted most active can-

didates continued, with an MUE of 0.9 and 0.8 pIC

units,

respectively, for these two reﬁnement rounds.

Those predicted most informative for Round-03 and

Round-04 included 3/10 and 5/10 non-macrocycles. The

decrease in the number of non-macrocycles in the informa-

tive set compared to Round-01 and Round-02 suggests that

model reﬁnement improved the predictions on non-macro-

cycles and thus these molecules would be less represented

among those molecules that were being incorrectly predicted

as having low activity.

Another aspect of quantitative activity prediction for this

series was the clear importance of detailed hydrophobic

shape on experimental activity. The ability of a compound

to ﬁll the presumed hydrophobic pockets of the non-war-

head (right-hand) side of the binding site was a clear activity

requirement. Accurately modeling such phenomena depends

not only on a molecular representation that captures ligand

shape, but also requires that predictions of molecular pose

be respectful of the internal conformational energetics of

candidate molecules.

Fig. 6 Selections of active and informative molecules for Round-01 and Round-02

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 8 of 16

Round‑05: The goal compound

As shown in Fig.8, FPX was chosen by the model that was

trained on the initial 100 molecules and subsequently reﬁned

with 100 chosen based on the active learning strategy during

the ensuing rounds of reﬁnement. FPX was among the 10

predicted to be most active and the activity was accurately

predicted with pIC

50 =10.0

with a signed error of just + 0.4

pIC

units. The MUE of the 10 predicted to be most active

was 0.8 pIC

units, and importantly, the set included 7/10

non-macrocycles, evidence that the model had eﬀectively

learned the non-macrocyclic scaﬀold.

For FPX, in addition to the predicted pose, a depiction of

the quantitative interactions with the pocket-ﬁeld is shown

in Fig.8. The large majority of interactions were of a purely

hydrophobic type, represented by salmon-colored sticks

whose length is proportional to the interaction magnitude.

Notably, the two ﬂuoro-phenyl groups of FPX, which over-

lay corresponding hydrophobic functionality of UK-2A,

are responsible for signiﬁcant interactions. In addition, the

two chiral methyl groups, especially the one at the lower

right, were also important. Because of the angle needed to

adequately display the key hydrophobic interactions, the spe-

ciﬁc polar interactions made by the warhead and the amide

linker are somewhat diﬃcult to discern, but all of the spe-

ciﬁc polar moieties were responsible for key interactions

as well (blue and red sticks, for hydrogen-bond donors and

acceptors, respectively). The other example highlights both

the variability that can be tolerated in the pendant hydro-

phobic groups and the fact that the core scaﬀold shifts in

accommodating diﬀerent substituents.

In this gedankenexperiment, only 100 additional future

molecules needed to be synthesized, tested, and added to

the model in order to correctly choose FPX as an excellent

candidate molecule. In completing Round-05, FPX and the

other 19 chosen candidate molecules would be synthesized

and tested. Overall, only 120/1009 future molecules needed

to be “made” to both identify and conﬁrm FPX as a highly

active candidate with just two chiral centers, no macrocyclic

component, and favorable synthetic characteristics.

Fig. 7 Selections of active and informative molecules for Round-03 and Round-04

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 Page 9 of 16 19

Temporal model evolution

Table1 summarizes the statistics for the rounds of model

building, reﬁnement, and future predictions. The training re-

ﬁt Kendall’s Tau was consistently high (0.82–0.87) through-

out the ﬁve rounds of reﬁnement, indicating that model ﬁdel-

ity was maintained as new molecules were added. Likewise,

the training re-ﬁt MUE remained low (0.30–0.36 log units)

throughout model reﬁnement.

Here, the sets of future molecules were much larger than

the training set, and they reﬂected substantial changes in the

structural composition of molecules and the distribution of

activity values compared with the training molecules. Later

in the project, as expected, a larger proportion of synthesized

molecules had very high activity. During successive rounds

of scoring future molecules, Tau trended upward, increas-

ing from 0.35 to 0.46. A large proportion of molecules had

experimental activity values of 8.5–9.5. The small data

range coupled with the presence of assay noise limits the

upper bound on rank-based statistics.

More striking was the decrease in MUE for predictions

on future molecules from 1.24 to 0.70. The model became

signiﬁcantly more accurate during reﬁnement. As the future

MUE decreased, the FPX predicted activity improved from

pIC

= 7.3 (signed error −2.2) in Round-00 to pIC

= 10.0

(signed error +0.4) in Round-05. Model improvement was

further reﬂected in the predicted rank of FPX activity which

rose from the top 61% to the top 1%.

Figure9 shows the plots of experimental versus predicted

activities for the set of all future molecules for each round of

Fig. 8 Selections of predicted

active candidates for Round-05

Table 1 Summary of rounds of model building and testing

Kendall’s Tau values are unitless and all had p

<10

−

. Mean unsigned error (MUE) and FPX predicted activity are in units of pIC

. Numbers in

parentheses are 95% conﬁdence intervals calculated by resampling with replacement

Round n Train Train Tau Train MUE n Future Future Tau Future MUE FPX Pred FPX rank %

00 100 0.87 (0.82–0.91) 0.30 (0.25–0.35) 1009 0.35 (0.31–0.39) 1.24 (1.18–1.30) 7.3 61

01 120 0.84 (0.79–0.89) 0.34 (0.30–0.39) 989 0.35 (0.31–0.39) 0.95 (0.90–1.00) 7.6 62

02 140 0.85 (0.79–0.90) 0.32 (0.28–0.37) 969 0.41 (0.37–0.45) 0.85 (0.81–0.90) 8.5 36

03 160 0.82 (0.77–0.86) 0.36 (0.32–0.40) 949 0.40 (0.36–0.44) 0.76 (0.73–0.80) 8.5 44

04 180 0.82 (0.78–0.86) 0.35 (0.31–0.39) 929 0.38 (0.34–0.43) 0.78 (0.74–0.82) 8.8 26

05 200 0.82 (0.78–0.86) 0.34 (0.30–0.39) 909 0.46 (0.42–0.50) 0.70 (0.66–0.74) 10.0 1

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 10 of 16

Fig. 9 Experimental versus predicted activities for the set of all future molecules for each round of testing

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 Page 11 of 16 19

testing. The identity line indicates perfect prediction, and the

lighter lines represent

±1.5

units of pIC

(corresponding to

±2

kcal/mol). The initial Round-00 model exhibited a strong

lower-right triangular bias, with a signiﬁcantly larger frac-

tion of underpredictions than overpredictions. This aspect

of the model’s predictive behavior shifted rapidly with the

ensuing two rounds of active learning. By Round-03, rela-

tively little skew was apparent. The distribution of underpre-

dictions (< −2 kcal/mol, Fig.9, red triangles) decreased 10

percentage points from Round-00 to Round-01 and became

nearly as few as the overpredictions in Round-03 to Round-

05. The Round-05 FPX prediction (Exp pIC

= 9.5, Pred

pIC

= 10.0) is highlighted in red.

Table2 shows a summary of the distribution of predic-

tions on future molecules depicted in the plots of Fig.9.

Throughout model reﬁnement and predictions on future

molecules, large overpredictions were few and relatively

constant (7% in Round-00 to 3% in Round-05). The predic-

tions within 2 kcal/mol increased from 66% in Round-00 to

91% in Round-05, and those within 1 kcal/mol from 36%

in Round-00 to 61% in Round-05. The dramatic decrease

in underpredictions occurred in two steps, from Round-00

(27%) to Round-01 (17%) and from Round-02 (14%) to

Round-03 (8%). By Round-05, the fractions of large over-

and under-predictions were essentially the same.

The distribution of experimental activity values for the

future set of molecules changed relatively little over time,

perhaps as expected given that only roughly 10% of the

molecules were selected over the rounds of iterative reﬁne-

ment. The minimum and maximum pIC

values were 4.3

and 10.1, respectively, throughout. The mean and standard

deviation began with

8.5 ±1.0

and ended with

8.4 ±1.0

However, for the training set, the distribution shifted. The

initial minimum and maximum pIC

training values were

4.3 and 9.5, respectively, shifting to 4.3 and 9.9 at the end

of reﬁnement. The mean and standard deviation began with

7.6 ±1.3

and ended with

8.1 ±1.2

. The distributional shifts

in the training data during reﬁnement reﬂected the success-

ful selection of numerous potent candidate molecules.

Idea generation

One diﬃculty in interpreting the results shown in the fore-

going is that the set of molecules from which we selected

molecules had been made and tested as part of an active

design process, where decisions on what to make next were

undertaken by experts based on their knowledge of prior

data as well as their expertise in the ﬁeld. So, while the

active learning approach was able to eﬃciently select from

that set of molecules, it is not clear that such a path could be

followed in a situation where the future space of molecules

was open to determination.

Generative approaches for producing ideas for new com-

pounds that employ deep learning have gained some promi-

nence recently [9, 10]. We have taken a diﬀerent approach,

instead using molecular similarity to identify possible bio-

isosteric core scaﬀold replacements, including their suit-

ability to display the require pendant functionality for good

activity. Figure10 illustrates how a combination of similar-

ity-based screening and combination with desirable pendant

groups can rapidly generate ideas. Our approach is similar

in spirit to work by Awale, Hert etal. [23], in which the

authors describe a 2D matched-pair approach to identifying

sensible candidate molecules based on an analysis of large

structure-activity databases.

Beginning with the original five-molecule multiple-

ligand alignment used to guide QuanSA model-building, the

pendant groups were removed to produce a core-scaﬀold

overlay, and the amide linking subfragment was extracted

(Fig.10A at right). The roughly 3,000,000 compound Enam-

ine Stock Screening collection was screened against the

multiple-ligand core using the amide fragment as a required

positional restraint to ensure that all hits returned would

have appropriate chemistry for linkage to the common war-

head. Two examples of high-scoring hits from the screen

are shown in Fig.10B (cyan carbons) in their optimal poses

relative to the screening target (green carbons).

For each returned pose of each nominal screening hit,

a geometric matching procedure was employed to identify

crossover points between the screening hit and each of the

full parent molecules from the original multiple-ligand

Table 2 Summary of the

distribution of predictions on

future molecules

Round n Train n Future % Pred w/in 1

kcal/mol

% Pred w/in 2

kcal/mol

% Underpre-

dictions

% Over-

predic-

tions

00 100 1009 36 66 27 7

01 120 989 50 80 17 3

02 140 969 53 83 14 3

03 160 949 55 89 8 3

04 180 929 55 88 9 3

05 200 909 61 91 6 3

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 12 of 16

alignment. Figure10C shows the process using the pose of

compound Mol-0013 as the crossover target. When a com-

patible set of distances and bond vectors existed, the origi-

nal substituents of the screening hit were replaced with the

substituents of the parent compound. Figure10C (bottom)

shows the two resulting merged molecules with novel struc-

tures. The arrows and corresponding thick lines show the

speciﬁc substituent movements that were made. The initial

crossover results in high local strain for the new bonds, and

the ﬁnal ligand pose is relaxed using positionally-restrained

energy minimization.

Figure10D shows the relationship of the two resulting

generated candidate molecules. Each contains a large frac-

tion of the exact substructure (including chirality) of the

Fig. 10 Scheme for generating synthetic ideas using a combination of eSim screening and automatic addition of desirable pendant groups

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 Page 13 of 16 19

ﬁnal FPX compound, with relatively minor variations in

the precise hydrophobic substituents at right. With a slight

generalization of the procedure to include additional sub-

stituent variations (e.g. p-ﬂuoro-phenyl at both positions),

the exact structure of FPX would have been generated. The

data required to identify the ﬁve-molecule multiple-ligand

alignment was just the ﬁrst 100 compounds from the full

structure-activity set. The computational procedure for iden-

tifying core-scaﬀold hits and producing merged candidate

molecules required less than an hour and no additional data.

The procedure just described is not intended to fully auto-

mate candidate compound generation. Rather, it is meant to

be a source of ideas that are easy to scan rapidly. Of course,

it is also possible to make use of the predictive QuanSA

models to identify candidates that are quantitatively pre-

dicted to have high potency or have high information value.

Conclusions

Overall, beginning with the earliest 100 picolinamide anti-

fungal project compounds, an active-learning approach eﬃ-

ciently guided candidate selection to the desired end product

FPX after model reﬁnement using just 100 synthetic ana-

logs. This project began with a relatively potent lead com-

pound in UK-2A, with design goals including a reduction

in molecular complexity that required replacement of the

central macrocyclic scaﬀold. This presents a challenge for

predictive modeling because the molecules to be designed

must deviate quite signiﬁcantly from known chemical mat-

ter. Through the use of active learning, rapid introduction

of novel structural features was possible. The process was

guided by a well-deﬁned notion of what makes a highly

informative molecule—one that exhibits high similarity to

a known active in their respective optimal predicted poses

but which is (possibly anomalously) predicted to have low

activity.

The practical signiﬁcance of the restrospective analysis

presented here is in the breadth of applicability for scaf-

fold replacement and lead optimization more broadly. The

QuanSA method does not require a protein structure to make

accurate predictions that are physically explainable. While

it can make use of information from experimental determi-

nation of bound ligand structures, it can operate in a purely

ligand-based manner where the only available data are com-

pound structures and activities. Model building can proceed

from very limited project data, beginning with just dozens

of molecules, not the thousands required for so-called deep-

learning methods [9, 10].

Further, model-building is not terribly computationally

intensive. On modest workstation hardware, candidate mol-

ecules can be scored in seconds for “normal” small mol-

ecules. Small macrocycles such as those seen here required

tens of seconds per molecule, with a majority of the time

going to conformational search. For the work reported here,

the fully-automated procedure took approximately two days

on an 18-core workstation. This encompassed the entire pro-

cess beginning with 2D structures for all 1109 molecules,

through model-building, scoring/selection/reﬁnement, and

the ﬁnal pass from which FPX was chosen.

The lead optimization project that resulted in FPX

required the synthesis of many hundreds of analogs in

order to re-engineer the starting macrocyclic natural

product. We believe that eﬀective use of active learning

and semi-automatic candidate generation can drastically

shorten the design path from initial lead compound to

ﬁnal product. The central requirement for the computa-

tional methodology is that it is capable of extrapolating

from small quantities of structure-activity data. Mod-

eling approaches that move toward constructing causal

models for activity prediction have clear advantages over

approaches that ignore the physical underpinnings of how

ligands bind to and modulate the activity of biological

targets.

Methods anddata

Molecular data set

A total of 1109 compounds from a lead optimization pro-

ject formed the data set. Molecules were provided as 2D

SDF structures with associated activities and consecutive

compound IDs serving as relative synthesis dates for tem-

poral sorting. The project dataset contained pIC

activity

values and registration dates, beginning with UK-2A as

compound 1 and the resultant commercial product FPX as

compound 1109. The activity values were determined in

an invitro assay for the inhibition of fungal mitochondrial

electron transport. The ﬁrst 100 molecules synthesized

were used as the initial training set for QuanSA, with the

remaining 1009 molecules used as future “synthesizable”

molecules.

Computational procedures

For all procedures, we employed version 5.1 of the Sur-

ﬂex Platform (BioPharmics Division, Optibrium Limited,

Cambridge, CB25 9GL, UK). Additional details can be

found in the data archive associated with this paper (see

the Declarations section).

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 14 of 16

Ligand preparation

Standard procedures were used to protonate the molecules

as expected at physiological pH, generate 3D structures,

and perform conformational search, as follows:

Multiple ligand alignment

Using Surﬂex eSim for generating multiple ligand align-

ments [21], and speciﬁcally for the purpose of seeding ini-

tial QuanSA alignments was outlined earlier and has been

reported previously [11], with the speciﬁc procedure used

in this work being as follows:

QuanSA model induction, prediction, andreﬁnement

Previous QuanSA method papers are comprehensive, and

contain a detailed algorithmic description [11, 12]. Here,

standard procedures were used, as follows:

The model selection procedure select command gen-

erates a model quality score that combines the following:

(1) model parsimony (P), which is a quantitative measure

of the extent to which molecules with similar activity val-

ues have similar predicted poses; (2) Kendall’s Tau (T)

for the full re-ﬁtting of training molecules into a derived

pocket-ﬁeld; and (3) the mean unsigned error (E) of the

re-ﬁt molecules.

Given N alternative models, each of

P1...N

T1...N

, and

E1...N

are transformed into corresponding probability

scores. This is done by ﬁtting a normal distribution to each

P1...N

T1...N

, and

E1...N

which then allows calculation of

the cumulative distribution function

for each of P, T,

and E. So, raw values for the metric across the N alterna-

tive models are converted to probabilities reﬂecting their

likelihood of being non-random:

1...N

, and

1...N

The probability score for model i is simply the product:

(

)(T

)(E

. The highest scoring of the alternative models

using the combined probabilistic score is selected.

New molecule scoring, selection, and model reﬁnement

followed these general procedures:

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 Page 15 of 16 19

Computational procedures foridea generation

The pendant groups were trimmed from the 5-molecule

multiple ligand alignment described above, leaving only

the aligned central scaﬀolds. The aligned core scaﬀolds

were used as a multi-ligand target in a virtual screen of the

Enamine database. The resulting hits were processed using

a new procedure to automatically attach pendant groups

from the original full ligands of the multiple-ligand align-

ment, as follows:

Note that the resulting merged molecules can be

reviewed directly or can be subjected to conformational

search and screened using either pure ligand similar-

ity, QuanSA model scoring, or a combination of both

approaches.

Acknowledgements The authors thank Negar Garizi and Matt Segall

for supporting this work and for valuable scientiﬁc discussions.

Author contributions All authors participated in the research and in

the preparation of and ﬁnal review of the manuscript.

Funding The authors have no outside funding to declare.

Data availability A freely downloadable data archive with additional

details is available at www. jainl ab. org/ downl oads. The archive con-

tains scripts to reproduce the protocols described here beginning from

2D input structures. Note that the compound structures in the archive

are limited to those depicted here and do not include the full set of

1109 ligands described, whose structures and activity values cannot

be disclosed.

Declarations

Conflict of interest The authors have no conﬂict of interest as deﬁned

by Springer, or other interests that might be perceived to inﬂuence the

results and/or discussion reported in this paper.

Ethical approval and consent to participate Not applicable.

Consent for publication All authors have read and understood the pub-

lishing policy, and this manuscript is submitted in accordance with

this policy.

Open Access This article is licensed under a Creative Commons Attri-

bution 4.0 International License, which permits use, sharing, adapta-

tion, distribution and reproduction in any medium or format, as long

as you give appropriate credit to the original author(s) and the source,

provide a link to the Creative Commons licence, and indicate if changes

were made. The images or other third party material in this article are

included in the article’s Creative Commons licence, unless indicated

otherwise in a credit line to the material. If material is not included in

the article’s Creative Commons licence and your intended use is not

permitted by statutory regulation or exceeds the permitted use, you will

need to obtain permission directly from the copyright holder. To view a

copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

References

1. Meyer KG, Bravo-Altamirano K, Herrick J, Loy BA, Yao C,

Nugent B, Buchan Z, Daeuble JF, Heemstra R, Jones DM, Wilmot

J, Lu Y, DeKorver K, DeLorbe J, Rigoli J (2021) Discovery of

ﬂorylpicoxamid, a mimic of a macrocyclic natural product. Bioorg

Med Chem 50(116):455

2. Labute P (2010) LowModeMD: implicit low-mode velocity ﬁlter-

ing applied to conformational search of macrocycles and protein

loops. J Chem Info Model 50(5):792–800

3. Chen IJ, Foloppe N (2013) Tackling the conformational sampling

of larger ﬂexible compounds and macrocycles in pharmacology

and drug discovery. Bioorg Med Chem 21(24):7898–7920

4. Watts KS, Dalal P, Tebben AJ, Cheney DL, Shelley JC (2014)

Macrocycle conformational sampling with MacroModel. J Chem

Inf Model 54(10):2680–2696

5. Sindhikara D, Spronk SA, Day T, Borrelli K, Cheney DL, Posy

SL (2017) Improving accuracy, diversity, and speed with prime

macrocycle conformational sampling. J Chem Info Model

57(8):1881–1894

6. Cleves AE, Jain AN (2017) ForceGen 3D structure and conformer

generation: From small lead-like molecules to macrocyclic drugs.

J Comput Aided Mol Des 31(5):419–439

7. Jain AN, Cleves AE, Gao Q, Wang X, Liu Y, Sherer EC, Reibarkh

MY (2019) Complex macrocycle exploration: parallel, heuristic,

and constraint-based conformer generation using ForceGen. J

Comput Aided Mol Des 33(6):531–558

8. Jain AN, Brueckner AC, Jorge C, Cleves AE, Khandelwal P, Cor-

tes JC, Mueller L (2023) Complex peptide macrocycle optimiza-

tion: combining NMR restraints with conformational analysis to

guide structure-based and ligand-based design. J Comput Aided

Mol Des 37:519–535

9. Walters WP, Barzilay R (2020) Applications of deep learning

in molecule generation and molecular property prediction. Acc

Chem Res 54(2):263–270

10. Ramsundar B, Eastman P, Walters P, Pande V (2019) Deep

learning for the life sciences: applying deep learning to genom-

ics, microscopy, drug discovery, and more. O’Reilly Media Inc,

Newton

11. Cleves AE, Jain AN (2018) Quantitative surface ﬁeld analysis:

learning causal models to predict ligand binding aﬃnity and pose.

J Comput Aided Mol Des 32(7):731–757

12. Cleves AE, Johnson SR, Jain AN (2021) Synergy and complemen-

tarity between focused machine learning and physics-based simu-

lation in aﬃnity prediction. J Chem Inf Model 61(12):5948–5966

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Computer-Aided Molecular Design (2024) 38:19 19 Page 16 of 16

13. Jain AN (1996) Scoring noncovalent protein-ligand interactions:

a continuous diﬀerentiable function tuned to compute binding

aﬃnities. J Comput Aided Mol Des 10(5):427–440

14. Pham T, Jain AN (2006) Parameter estimation for scoring pro-

tein-ligand interactions using negative training data. J Med Chem

49(20):5856–5868

15. Jain AN, Cleves AE (2012) Does your model weigh the same as

a Duck? J Comput Aided Mol Des 26:57–67

16. Cleves AE, Jain AN (2016) Extrapolative prediction using phys-

ically-based QSAR. J Comput Aided Mol Des 30(2):127–152

17. Jain AN, Dietterich TG, Lathrop RH, Chapman D, Critchlow REJ,

Bauer BE, Webster TA, Lozano-Perez T (1994) A shape-based

machine learning tool for drug design. J Comput Aided Mol Des

8(6):635–52

18. Jain AN, Koile K, Chapman D (1994) Compass: predicting

biological activities from molecular surface properties. Per-

formance comparisons on a steroid benchmark. J Med Chem

37(15):2315–27

19. Jain AN, Harris N, Park J (1995) Quantitative binding site model

generation: compass applied to multiple chemotypes targeting the

5-HT1a receptor. J Med Chem 38(8):1295–1308

20. Dietterich TG, Lathrop RH, Lozano-Pérez T (1997) Solving the

multiple instance problem with axis-parallel rectangles. Artif

Intell 89(1–2):31–71

21. Cleves AE, Jain AN (2020) Structure-and ligand-based virtual

screening on DUD-E

: performance dependence on approxima-

tions to the binding pocket. J Chem Inf Model 60(9):4296–4310

22. Varela R, Walters W, Goldman B, Jain AN (2012) Iterative reﬁne-

ment of a binding pocket model: active computational steering of

lead optimization. J Med Chem 55(20):8926–8942

23. Awale M, Hert J, Guasch L, Riniker S, Kramer C (2021) The play-

books of medicinal chemistry design moves. J Chem Inf Model

61(2):729–742

Publisher's Note Springer Nature remains neutral with regard to

jurisdictional claims in published maps and institutional aﬃliations.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Terms and Conditions

Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).

Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-

scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By

accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these

purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.

These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal

subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription

(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will

apply.

We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within

ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not

otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as

detailed in the Privacy Policy.

While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may

not:

use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access

control;

use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is

otherwise unlawful;

falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in

writing;

use bots or other automated methods to access the content or redirect messages

override any security feature or exclusionary protocol; or

share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal

content.

In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,

royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal

content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any

other, institutional repository.

These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or

content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature

may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.

To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied

with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,

including merchantability or fitness for any particular purpose.

Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed

from third parties.

If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not

expressly permitted by these Terms, please contact Springer Nature at

onlineservice@springernature.com

ResearchGate has not been able to resolve any citations for this publication.

Complex peptide macrocycle optimization: combining NMR restraints with conformational analysis to guide structure-based and ligand-based design

Article

Full-text available

Aug 2023
J COMPUT AID MOL DES

Systematic optimization of large macrocyclic peptide ligands is a serious challenge. Here, we describe an approach for lead-optimization using the PD-1/PD-L1 system as a retrospective example of moving from initial lead compound to clinical candidate. We show how conformational restraints can be derived by exploiting NMR data to identify low-energy solution ensembles of a lead compound. Such restraints can be used to focus conformational search for analogs in order to accurately predict bound ligand poses through molecular docking and thereby estimate ligand strain and protein-ligand intermolecular binding energy. We also describe an analogous ligand-based approach that employs molecular similarity optimization to predict bound poses. Both approaches are shown to be effective for prioritizing lead-compound analogs. Surprisingly, relatively small ligand modifications, which may have minimal effects on predicted bound pose or intermolecular interactions, often lead to large changes in estimated strain that have dominating effects on overall binding energy estimates. Effective macrocyclic conformational search is crucial, whether in the context of NMR-based restraints, X-ray ligand refinement, partial torsional restraint for docking/ligand-similarity calculations or agnostic search for nominal global minima. Lead optimization for peptidic macrocycles can be made more productive using a multi-disciplinary approach that combines biophysical data with practical and efficient computational methods.

Synergy and Complementarity between Focused Machine Learning and Physics-Based Simulation in Affinity Prediction

Article

Full-text available

Dec 2021

We present results on the extent to which physics-based simulation (exemplified by FEP⁺) and focused machine learning (exemplified by QuanSA) are complementary for ligand affinity prediction. For both methods, predictions of activity for LFA-1 inhibitors from a medicinal chemistry lead optimization project were accurate within the applicable domain of each approach. A hybrid model that combined predictions by both approaches by simple averaging performed better than either method, with respect to both ranking and absolute pKi values. Two publicly available FEP⁺ benchmarks, covering 16 diverse biological targets, were used to test the generality of the synergy. By identifying training data specifically focused on relevant ligands, accurate QuanSA models were derived using ligand activity data known at the time of the original series publications. Results across the 16 benchmark targets demonstrated significant improvements both for ranking and for absolute pKi values using hybrid predictions that combined the FEP⁺ and QuanSA predicted affinity values. The results argue for a combined approach for affinity prediction that makes use of physics-driven methods as well as those driven by machine learning, each applied carefully on appropriate compounds, with hybrid prediction strategies being employed where possible.

Complex macrocycle exploration: parallel, heuristic, and constraint-based conformer generation using ForceGen

Article

Full-text available

Jun 2019
J COMPUT AID MOL DES

ForceGen is a template-free, non-stochastic approach for 2D to 3D structure generation and conformational elaboration for small molecules, including both non-macrocycles and macrocycles. For conformational search of non-macrocycles, ForceGen is both faster and more accurate than the best of all tested methods on a very large, independently curated benchmark of 2859 PDB ligands. In this study, the primary results are on macrocycles, including results for 431 unique examples from four separate benchmarks. These include complex peptide and peptide-like cases that can form networks of internal hydrogen bonds. By making use of new physical movements (“flips” of near-linear sub-cycles and explicit formation of hydrogen bonds), ForceGen exhibited statistically significantly better performance for overall RMS deviation from experimental coordinates than all other approaches. The algorithmic approach offers natural parallelization across multiple computing-cores. On a modest multi-core workstation, for all but the most complex macrocycles, median wall-clock times were generally under a minute in fast search mode and under 2 min using thorough search. On the most complex cases (roughly cyclic decapeptides and larger) explicit exploration of likely hydrogen bonding networks yielded marked improvements, but with calculation times increasing to several minutes and in some cases to roughly an hour for fast search. In complex cases, utilization of NMR data to constrain conformational search produces accurate conformational ensembles representative of solution state macrocycle behavior. On macrocycles of typical complexity (up to 21 rotatable macrocyclic and exocyclic bonds), design-focused macrocycle optimization can be practically supported by computational chemistry at interactive time-scales, with conformational ensemble accuracy equaling what is seen with non-macrocyclic ligands. For more complex macrocycles, inclusion of sparse biophysical data is a helpful adjunct to computation.

Quantitative surface field analysis: learning causal models to predict ligand binding affinity and pose

Article

Full-text available

Jul 2018
J COMPUT AID MOL DES

We introduce the QuanSA method for inducing physically meaningful field-based models of ligand binding pockets based on structure-activity data alone. The method is closely related to the QMOD approach, substituting a learned scoring field for a pocket constructed of molecular fragments. The problem of mutual ligand alignment is addressed in a general way, and optimal model parameters and ligand poses are identified through multiple-instance machine learning. We provide algorithmic details along with performance results on sixteen structure-activity data sets covering many pharmaceutically relevant targets. In particular, we show how models initially induced from small data sets can extrapolatively identify potent new ligands with novel underlying scaffolds with very high specificity. Further, we show that combining predictions from QuanSA models with those from physics-based simulation approaches is synergistic. QuanSA predictions yield binding affinities, explicit estimates of ligand strain, associated ligand pose families, and estimates of structural novelty and confidence. The method is applicable for fine-grained lead optimization as well as potent new lead identification.

ForceGen 3D structure and conformer generation: from small lead-like molecules to macrocyclic drugs

Article

Full-text available

May 2017
J COMPUT AID MOL DES

We introduce the ForceGen method for 3D structure generation and conformer elaboration of drug-like small molecules. ForceGen is novel, avoiding use of distance geometry, molecular templates, or simulation-oriented stochastic sampling. The method is primarily driven by the molecular force field, implemented using an extension of MMFF94s and a partial charge estimator based on electronegativity-equalization. The force field is coupled to algorithms for direct sampling of realistic physical movements made by small molecules. Results are presented on a standard benchmark from the Cambridge Crystallographic Database of 480 drug-like small molecules, including full structure generation from SMILES strings. Reproduction of protein-bound crystallographic ligand poses is demonstrated on four carefully curated data sets: the ConfGen Set (667 ligands), the PINC cross-docking benchmark (1062 ligands), a large set of macrocyclic ligands (182 total with typical ring sizes of 12–23 atoms), and a commonly used benchmark for evaluating macrocycle conformer generation (30 ligands total). Results compare favorably to alternative methods, and performance on macrocyclic compounds approaches that observed on non-macrocycles while yielding a roughly 100-fold speed improvement over alternative MD-based methods with comparable performance.

Discovery of florylpicoxamid, a mimic of a macrocyclic natural product

Article

Oct 2021
BIOORGAN MED CHEM

Natural products have routinely been used both as sources of and inspiration for new crop protection active ingredients. The natural product UK-2A has potent anti-fungal activity but lacks key attributes for field translation. Post-fermentation conversion of UK-2A to fenpicoxamid resulted in an active ingredient with a new target site of action for cereal and banana pathogens. Here we demonstrate the creation of a synthetic variant of fenpicoxamid via identification of the structural elements of UK-2A that are needed for anti-fungal activity. Florylpicoxamid is a non-macrocyclic active ingredient bearing two fewer stereocenters than fenpicoxamid, controls a broad spectrum of fungal diseases at low use rates and has a concise, scalable route which is aligned with green chemistry principles. The development of florylpicoxamid represents the first example of using a stepwise deconstruction of a macrocyclic natural product to design a fully synthetic crop protection active ingredient.

The Playbooks of Medicinal Chemistry Design Moves

Article

Feb 2021
J CHEM INF MODEL

Large databases of biologically relevant molecules, such as ChEMBL, SureChEMBL, or compound collections of pharmaceutical or agrochemical companies, are invaluable sources of medicinal chemistry information, albeit implicit. We developed a modified matched molecular pair approach to systematically and exhaustively extract the transformations in these databases and distill them into snippets of explicit design knowledge that are easily interpretable and directly applicable. The resulting "playbooks of medicinal chemistry design moves" capture the collective pharmaceutical and agrochemical research expertise across multiple chemists, companies, targets, and projects. They can be queried in an automated fashion for systematic prospective design and compound generation. The ChEMBL playbook and an application to exploit it are available at https://github.com/mahendra-awale/medchem_moves.

Applications of Deep Learning in Molecule Generation and Molecular Property Prediction

Article

Dec 2020

ConspectusRecent advances in computer hardware and software have led to a revolution in deep neural networks that has impacted fields ranging from language translation to computer vision. Deep learning has also impacted a number of areas in drug discovery, including the analysis of cellular images and the design of novel routes for the synthesis of organic molecules. While work in these areas has been impactful, a complete review of the applications of deep learning in drug discovery would be beyond the scope of a single Account. In this Account, we will focus on two key areas where deep learning has impacted molecular design: the prediction of molecular properties and the de novo generation of suggestions for new molecules.One of the most significant advances in the development of quantitative structure-activity relationships (QSARs) has come from the application of deep learning methods to the prediction of the biological activity and physical properties of molecules in drug discovery programs. Rather than employing the expert-derived chemical features typically used to build predictive models, researchers are now using deep learning to develop novel molecular representations. These representations, coupled with the ability of deep neural networks to uncover complex, nonlinear relationships, have led to state-of-the-art performance. While deep learning has changed the way that many researchers approach QSARs, it is not a panacea. As with any other machine learning task, the design of predictive models is dependent on the quality, quantity, and relevance of available data. Seemingly fundamental issues, such as optimal methods for creating a training set, are still open questions for the field. Another critical area that is still the subject of multiple research efforts is the development of methods for assessing the confidence in a model.Deep learning has also contributed to a renaissance in the application of de novo molecule generation. Rather than relying on manually defined heuristics, deep learning methods learn to generate new molecules based on sets of existing molecules. Techniques that were originally developed for areas such as image generation and language translation have been adapted to the generation of molecules. These deep learning methods have been coupled with the predictive models described above and are being used to generate new molecules with specific predicted biological activity profiles. While these generative algorithms appear promising, there have been only a few reports on the synthesis and testing of molecules based on designs proposed by generative models. The evaluation of the diversity, quality, and ultimate value of molecules produced by generative models is still an open question. While the field has produced a number of benchmarks, it has yet to agree on how one should ultimately assess molecules "invented" by an algorithm.

Structure-Based and Ligand-Based Virtual Screening on DUD-E+: Performance Dependence on Approximations to the Binding-Pocket

Article

Apr 2020
J CHEM INF MODEL

Using the DUD-E+ benchmark, we explore the impact of using a single protein pocket or ligand for virtual screening compared with using ensembles of alternative pockets, ligands, and sets thereof. For both structure-based and ligand-based approaches, the precise characterization of the binding site in question had a significant impact on screening performance. Using the single original DUD-E protein, Surflex-Dock yielded mean ROC area of 0.81±0.11. Using the cognate ligand instead, with the eSim method for screening, yielded 0.77±0.14. Moving to ensembles of five protein pocket variants increased docking performance to 0.84±0.09. Results for the analogous ligand-based approach (using the five crystallographically aligned cognate ligands) was 0.83±0.11. Using the same ligands, but making use of an automatically generated mutual alignment, yielded mean AUC nearly as good as from single-structure docking: 0.80±0.12. Detailed results and statistical analyses show that structure-based and ligand-based methods are complementary and can be fruitfully combined to enhance screening efficiency. A hybrid approach combining ensemble docking with eSim-based screening produced the best and most consistent performance (mean ROC area of 0.89±0.08 and 1% early enrichment of 46-fold). Based on results from both the docking and ligand-similarity approaches, it is clearly unwise to make use of a single arbitrarily chosen protein structure for docking or single ligand query for similarity-based screening.

Improving Accuracy, Diversity, and Speed with Prime Macrocycle Conformational Sampling

Article

Jul 2017

A novel method for exploring macrocycle conformational space, Prime Macrocycle Conformational Sampling (Prime-MCS), is introduced and evaluated in the context of other available algorithms (Molecular Dynamics, LowModeMD in MOE, and MacroModel Baseline Search). The algorithms were benchmarked on a dataset of 208 macrocycles which was curated for diversity from the Cambridge Structural Database, the Protein Databank, and the Biologically Interesting Molecule Reference Dictionary. The algorithms were evaluated in terms of accuracy (ability to reproduce the crystal structure), diversity (coverage of conformational space), and computational speed. Prime-MCS most reliably reproduced crystallographic structures for RMSD thresholds >1.0 Å, most often produced the most diverse conformational ensemble, and was most often the fastest algorithm. Detailed analysis and examination of both typical and outlier cases were performed to reveal characteristics, shortcomings, expected performance, and complementarity of the methods.

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

Abstract and Figures

Recommended publications

Synergy and Complementarity between Focused Machine Learning and Physics-Based Simulation in Affinit...

Complex peptide macrocycle optimization: combining NMR restraints with conformational analysis to gu...

Quantitative surface field analysis: learning causal models to predict ligand binding affinity and p...

Extrapolative prediction using physically-based QSAR