Figure - available from: Nature Communications
This content is subject to copyright. Terms and conditions apply.
AI-Bind pipeline: VecNet Performance and Validation
a AI-Bind pipeline generates embeddings for ligands (drugs and natural compounds) and proteins using unsupervised pre-training. These embeddings are used to train the deep models. Top predictions are validated using docking simulations and are used as potential binders to test experimentally. b AI-Bind’s VecNet architecture uses Mol2vec and ProtVec for generating the node embeddings. VecNet is trained in a 5-fold cross-validation set-up. Averaged prediction over the 5 folds is used as the final output of VecNet. c–f The average performance for a 5-fold cross-validation of VecNet, DeepPurpose, and Configuration Model (dots represent the performance of each fold, bar height corresponds to the mean, n = 5). All the models perform similarly in case of predicting binding for unseen edges (transductive) and unseen targets (semi-inductive). The advantage of using deep learning and unsupervised pre-training is observed in the case of unseen nodes (inductive test). AI-Bind’s VecNet is the best performing model across all the scenarios. Additionally, we observe a similar performance of VecNet for both drugs and natural compounds. Source data are provided as a Source Data file.

AI-Bind pipeline: VecNet Performance and Validation a AI-Bind pipeline generates embeddings for ligands (drugs and natural compounds) and proteins using unsupervised pre-training. These embeddings are used to train the deep models. Top predictions are validated using docking simulations and are used as potential binders to test experimentally. b AI-Bind’s VecNet architecture uses Mol2vec and ProtVec for generating the node embeddings. VecNet is trained in a 5-fold cross-validation set-up. Averaged prediction over the 5 folds is used as the final output of VecNet. c–f The average performance for a 5-fold cross-validation of VecNet, DeepPurpose, and Configuration Model (dots represent the performance of each fold, bar height corresponds to the mean, n = 5). All the models perform similarly in case of predicting binding for unseen edges (transductive) and unseen targets (semi-inductive). The advantage of using deep learning and unsupervised pre-training is observed in the case of unseen nodes (inductive test). AI-Bind’s VecNet is the best performing model across all the scenarios. Additionally, we observe a similar performance of VecNet for both drugs and natural compounds. Source data are provided as a Source Data file.

Source publication
Article
Full-text available
Identifying novel drug-target interactions is a critical and rate-limiting step in drug discovery. While deep learning models have been proposed to accelerate the identification process, here we show that state-of-the-art models fail to generalize to novel (i.e., never-before-seen) structures. We unveil the mechanisms responsible for this shortcomi...

Citations

... It is worth noting that this threshold is variable. It can adjust the IC50 value [47][48][49] or classify positive and negative samples based on K i or K d values [50][51][52][53]. Table S2 lists the different criteria adopted by researchers, with related analyses following Table S2. ...
Article
Full-text available
Identification of interactions between chemical compounds and proteins is crucial for various applications, including drug discovery, target identification, network pharmacology, and elucidation of protein functions. Deep neural network-based approaches are becoming increasingly popular in efficiently identifying compound-protein interactions with high-throughput capabilities, narrowing down the scope of candidates for traditional labor-intensive, time-consuming and expensive experimental techniques. In this study, we proposed an end-to-end approach termed SPVec-SGCN-CPI, which utilized simplified graph convolutional network (SGCN) model with low-dimensional and continuous features generated from our previously developed model SPVec and graph topology information to predict compound-protein interactions. The SGCN technique, dividing the local neighborhood aggregation and nonlinearity layer-wise propagation steps, effectively aggregates K-order neighbor information while avoiding neighbor explosion and expediting training. The performance of the SPVec-SGCN-CPI method was assessed across three datasets and compared against four machine learning- and deep learning-based methods, as well as six state-of-the-art methods. Experimental results revealed that SPVec-SGCN-CPI outperformed all these competing methods, particularly excelling in unbalanced data scenarios. By propagating node features and topological information to the feature space, SPVec-SGCN-CPI effectively incorporates interactions between compounds and proteins, enabling the fusion of heterogeneity. Furthermore, our method scored all unlabeled data in ChEMBL, confirming the top five ranked compound-protein interactions through molecular docking and existing evidence. These findings suggest that our model can reliably uncover compound-protein interactions within unlabeled compound-protein pairs, carrying substantial implications for drug re-profiling and discovery. In summary, SPVec-SGCN demonstrates its efficacy in accurately predicting compound-protein interactions, showcasing potential to enhance target identification and streamline drug discovery processes. Scientific contributions The methodology presented in this work not only enables the comparatively accurate prediction of compound-protein interactions but also, for the first time, take sample imbalance which is very common in real world and computation efficiency into consideration simultaneously, accelerating the target identification and drug discovery process.
... Chen et al. have argued that models can perform well on independent but identically distributed datasets by learning dataset biases, and this performance may not be reliable 19 . Moreover, models that learn shortcuts or unintended features from training datasets can be highly sensitive to minor shifts in dataset distribution, impeding their ability to make predictions on out-of-distribution datasets 20,21 . As a result, many existing models may struggle to maintain stable and accurate performance on new data, limiting their effectiveness in real-world applications such as SBVS against novel targets 9 . ...
... Finally, EquiScore uses cross-entropy as the loss function. Its expression is as the following equation: loss = CrossEntropy (label, prob) (20) Model training. First, we removed data from the training data with the same UniProt ID as the proteins in relative datasets and then randomly selected 10% of the UniProt IDs and chose their corresponding data as the validation set for training. ...
Article
Full-text available
Developing robust methods for evaluating protein–ligand interactions has been a long-standing problem. Data-driven methods may memorize ligand and protein training data rather than learning protein–ligand interactions. Here we show a scoring approach called EquiScore, which utilizes a heterogeneous graph neural network to integrate physical prior knowledge and characterize protein–ligand interactions in equivariant geometric space. EquiScore is trained based on a new dataset constructed with multiple data augmentation strategies and a stringent redundancy-removal scheme. On two large external test sets, EquiScore consistently achieved top-ranking performance compared to 21 other methods. When EquiScore is used alongside different docking methods, it can effectively enhance the screening ability of these docking methods. EquiScore also showed good performance on the activity-ranking task of a series of structural analogues, indicating its potential to guide lead compound optimization. Finally, we investigated different levels of interpretability of EquiScore, which may provide more insights into structure-based drug design.
... d and H (3) p are mapped to the same space through the learnable matrices U and V . Then the interaction is computed using the Hadamard product and the weight matrix q. ...
Preprint
Drug-target interaction (DTI) prediction is a critical component of the drug discovery process. In the drug development engineering field, predicting novel drug-target interactions is extremely crucial.However, although existing methods have achieved high accuracy levels in predicting known drugs and drug targets, they fail to utilize global protein information during DTI prediction. This leads to an inability to effectively predict interaction the interactions between novel drugs and their targets. As a result, the cross-field information fusion strategy is employed to acquire local and global protein information. Thus, we propose the siamese drug-target interaction SiamDTI prediction method, which utilizes a double channel network structure for cross-field supervised learning.Experimental results on three benchmark datasets demonstrate that SiamDTI achieves higher accuracy levels than other state-of-the-art (SOTA) methods on novel drugs and targets.Additionally, SiamDTI's performance with known drugs and targets is comparable to that of SOTA approachs. The code is available at https://anonymous.4open.science/r/DDDTI-434D.
... Identifying novel (i.e., never-before-seen) druglike compounds, relying upon in silico screening of "in stock" and "on demand" virtual libraries of compounds, is a critical and rate-limiting step in drug discovery [29][30][31][32]. Therefore, alternative strategies have also been suggested based on generating new chemical spaces, originating from already approved drugs and following its consequent hit-tolead optimization [33,30,34,19,[35][36][37]. Herein, we report structure-based optimization of new noncovalent inhibitors against the coronavirus SARS-CoV-2 M pro , starting from the FDA-approved drug nirmatrelvir. ...
Article
Full-text available
A computer-aided drug design of new derivatives of nirmatrelvir, an orally active inhibitor of the main-protease (Mpro) of the severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2), was performed to identify its analogues with a higher antiviral potency. The following workflow was used: first, an evolutionary library composed of 1,866 analogues was generated starting from a parent nirmatrelvir scaffold and going through small mutation, fitness scoring, ranking, and selection. Second, the generated library was preprocessed and filtered against a 3-D pharmacophore model of nirmatrelvir built from its X-ray structure in a co-crystalized complex with the Mpro enzyme, allowing us to reduce the chemical space to 32 active analogues. Third, structure-based molecular docking against two different enzyme structures further ranked these active candidates, so that up to eight better-binding analogs were identified. The selected hit-analogues target the Mpro enzymes of SARS-CoV-2 with a higher binding affinity than a parent nirmatrelvir. The main structural modifications that increase the overall inhibitory affinity are identified at the azabicyclo[3.1.0] hexane and 2-oxopyrrolidine fragments. A characteristic structural feature of the inhibitor binding with the Mpro active centre is the similar location of the trifluoroacetylamino fragment, which is observed for most hit-analogues. The suggested workflow of the computer-aided rational design of new antiviral noncovalent inhibitors based on the scaffold of approved drugs is a promising, extremely low-cost, and time-efficient approach for the development of new potential pharmaceutical ingredients for the treatment of Coronavirus Disease 2019.
... Evaluation of training performance and generalizability are both required in assessing machine learning (ML) models [74,75] to ensure that a high training performance does not just reflect a model's ability to memorize specific training data that may limit its ability to generalize to unseen data. Our validation set during SIENNA training measures the model's accuracy on a dataset different from the training data and includes an inductive validation set that helps to tune the model parameters and hyperparameters for unseen data, avoiding overfitting [76]. ...
... The SoftMax activation function is applied to produce class probabilities, estimating the class-specific probability for each input MRI scan slice. These layers and feature parameters are tuned using hyperas [60] (Fig. 2d) and demonstrate an efficient exploration-exploitation tradeoff [75]. ...
Preprint
Full-text available
The transformative integration of Machine Learning (ML) for Artificial General Intelligence (AGI)-enhanced clinical imaging diagnostics, is itself in development. In brain tumor pathologies, magnetic resonance imaging (MRI) is a critical step that impacts the decision for invasive surgery, yet expert MRI tumor typing is inconsistent and misdiagnosis can reach levels as high as 85%. Current state-of-the-art (SOTA) ML brain tumor models struggle with data overfitting and susceptibility to shortcut learning, further exacerbated in large-sized models with many tunable parameters. In a comparison with multiple SOTA models, our deep ML brain tumor diagnostics model, SIENNA, surpassed limitations in four key areas of prioritized minimal data preprocessing, an optimized architecture that reduces shortcut learning and overfitting, integrated inductive cross-validation method for generalizability, and smaller neural architecture. SIENNA is applicable across MRI machines and 1.5 and 3.0 Tesla, and achieves high average accuracies on clinical DICOM MRI data across three-way classification: 92% (non-tumor), 91% (GBM), and 93% (MET) with retained high F1 and AUROC values for limited false positives/negatives. SIENNA is a lightweight clinical-ready AGI framework compatible with future multimodal expanded data integration.
... Generalizability is an important topic in a broad range of studies [47][48][49] . In this proof-of-concept study, we investigated the generalizability of various StrB approaches in predicting peptide-MHC interactions, a pivotal aspect of immune surveillance and a major bottleneck in the design of cancer vaccines [50][51][52] and TCR therapies 53,54 . ...
Preprint
Full-text available
The interaction between peptides and major histocompatibility complex (MHC) molecules is pivotal for tissue transplantation, pathogen recognition and autoimmune disease treatments. Recent advances in cancer immunotherapies demand for more accurate computational prediction of MHC-bound peptides. We address the generalizability challenge of MHC-bound peptide predictions, revealing limitations in current sequence-based approaches. Our solution employs structure-based methods leveraging geometric deep learning (GDL), yielding up to 8% improvement in generalizability across unseen MHC alleles. We tackle data efficiency by introducing a self-supervised learning approach surpassing sequence-based methods, even without being trained on binding affinity data. Finally, we demonstrate the resilience of structure-based GDL methods to biases in binding data on an Hepatitis B virus vaccine design case study. This study highlights structure-based methods’ potential to enhance generalizability and data efficiency, with implications for data-intensive fields like T-cell receptor specificity predictions, paving the way for enhanced comprehension and manipulation of immune responses.
... For more details see the following subsections: (A) Datasets, (B) Model and objective function, (C) Latent space analysis. reducing convergence time while reaching the same predictive performance with less data [14]. Another challenge in DTI prediction is the imbalanced nature of the available datasets. ...
... For protein embeddings, we examined several pre-trained models utilizing amino acid sequences [51]. Among these, ProtVec [52], which employs the Word2vec concept and is commonly paired with Mol2vec [14], emerged as a standard choice. While two of the other pretrained methods, namely CPCProt [53] and ProtTrans [54], also seemed promising, with the former slightly improving predictive performance and the latter exhibiting better hierarchy preservation, ProtVec offered a balanced trade-off between these considerations. ...
Article
Full-text available
Given the prolonged timelines and high costs associated with traditional approaches, accelerating drug development is crucial. Computational methods, particularly drug-target interaction prediction, have emerged as efficient tools, yet the explainability of machine learning models remains a challenge. Our work aims to provide more interpretable interaction prediction models using similarity-based prediction in a latent space aligned to biological hierarchies. We investigated integrating drug and protein hierarchies into a joint-embedding drug-target latent space via embedding regularization by conducting a comparative analysis between models employing traditional flat Euclidean vector spaces and those utilizing hyperbolic embeddings. Besides, we provided a latent space analysis as an example to show how we can gain visual insights into the trained model with the help of dimensionality reduction. Our results demonstrate that hierarchy regularization improves interpretability without compromising predictive performance. Furthermore, integrating hyperbolic embeddings, coupled with regularization, enhances the quality of the embedded hierarchy trees. Our approach enables a more informed and insightful application of interaction prediction models in drug discovery by constructing an interpretable hyperbolic latent space, simultaneously incorporating drug and target hierarchies and pairing them with available interaction information. Moreover, compatible with pairwise methods, the approach allows for additional transparency through existing explainable AI solutions.
... Furthermore, Chatterjee et al. [31] have recently shown that DL methods for protein-ligand prediction use degree information as shortcuts instead of learning from sequence features. A baseline model using only topology information performs equally well for that task. ...
Article
Full-text available
Identifying protein–protein interactions (PPIs) is crucial for deciphering biological pathways. Numerous prediction methods have been developed as cheap alternatives to biological experiments, reporting surprisingly high accuracy estimates. We systematically investigated how much reproducible deep learning models depend on data leakage, sequence similarities and node degree information, and compared them with basic machine learning models. We found that overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances. In this setting, models learn solely from sequence similarities and node degrees. When data leakage is avoided by minimizing sequence similarities between training and test set, performances become random. Moreover, baseline models directly leveraging sequence similarity and network topology show good performances at a fraction of the computational cost. Thus, we advocate that any improvements should be reported relative to baseline methods in the future. Our findings suggest that predicting PPIs remains an unsolved task for proteins showing little sequence similarity to previously studied proteins, highlighting that further experimental research into the ‘dark’ protein interactome and better computational methods are needed.
... However, the detailed descriptions or analyses relating to compounds are not In addition, the current approaches also encounter the challenge of lacking comprehensive evaluation on model generalization ability with effective scenarios. In previous CPI studies [18,20,21], it is common to split datasets using unknown compound and unknown protein schemes to evaluate model generalization ability. In the unknown compound scheme, compounds from the test set are not present in the training set. ...
Article
The identification of compound-protein interactions (CPIs) plays a vital role in drug discovery. However, the huge cost and labor-intensive nature in vitro and vivo experiments make it urgent for researchers to develop novel CPI prediction methods. Despite emerging deep learning methods have achieved promising performance in CPI prediction, they also face ongoing challenges: (i) providing bidirectional interpretability from both the chemical and biological perspective for the prediction results; (ii) comprehensively evaluating model generalization performance; (iii) demonstrating the practical applicability of these models. To overcome the challenges posed by current deep learning methods, we propose a cross multi-head attention oriented bidirectional interpretable CPI prediction model (CmhAttCPI). First, CmhAttCPI takes molecular graphs and protein sequences as inputs, utilizing the GCW module to learn atom features and the CNN module to learn residue features, respectively. Second, the model applies cross multi-head attention module to compute attention weights for atoms and residues. Finally, CmhAttCPI employs a fully connected neural network to predict scores for CPIs. We evaluated the performance of CmhAttCPI on balanced datasets and imbalanced datasets. The results consistently show that CmhAttCPI outperforms multiple state-of-the-art methods. We constructed three scenarios based on compound and protein clustering and comprehensively evaluated the model generalization ability within these scenarios. The results demonstrate that the generalization ability of CmhAttCPI surpasses that of other models. Besides, the visualizations of attention weights reveal that CmhAttCPI provides chemical and biological interpretation for CPI prediction. Moreover, case studies confirm the practical applicability of CmhAttCPI in discovering anticancer candidates.
... Although distributional shifts are a well-recognized challenge in machine learning more generally [22,23], they are less well characterized in biology due to the lack of approaches that measure model performance in the context of distribution shifts. Though numerous benchmarks have been developed to assess model performance across datasets [16,[24][25][26][27], there are still large gaps between model performance during benchmarking and real-world use [28][29][30][31][32] ( Figure 1a). This gap in assessing generalizability must be addressed before machine learning models can be broadly used in biology. ...
Preprint
Full-text available
Deep learning has made rapid advances in modeling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata based (MB) or sequence-similarity based (SB) train and test splits of input data before assessing model performance. Here, we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, i.e. , similarity between train and test splits. We introduce SPECTRA, a spectral framework for comprehensive model evaluation. For a given model and input data, SPECTRA plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability. We apply SPECTRA to 18 sequencing datasets with associated phenotypes ranging from antibiotic resistance in tuberculosis to protein-ligand binding to evaluate the generalizability of 19 state-of-the-art deep learning models, including large language models, graph neural networks, diffusion models, and convolutional neural networks. We show that SB and MB splits provide an incomplete assessment of model generalizability. With SPECTRA, we find as cross-split overlap decreases, deep learning models consistently exhibit a reduction in performance in a task- and model-dependent manner. Although no model consistently achieved the highest performance across all tasks, we show that deep learning models can generalize to previously unseen sequences on specific tasks. SPECTRA paves the way toward a better understanding of how foundation models generalize in biology.