AI-Bind pipeline: VecNet Performance and Validation a AI-Bind pipeline generates embeddings for ligands (drugs and natural compounds) and proteins using unsupervised pre-training. These embeddings are used to train the deep models. Top predictions are validated using docking simulations and are used as potential binders to test experimentally. b AI-Bind’s VecNet architecture uses Mol2vec and ProtVec for generating the node embeddings. VecNet is trained in a 5-fold cross-validation set-up. Averaged prediction over the 5 folds is used as the final output of VecNet. c–f The average performance for a 5-fold cross-validation of VecNet, DeepPurpose, and Configuration Model (dots represent the performance of each fold, bar height corresponds to the mean, n = 5). All the models perform similarly in case of predicting binding for unseen edges (transductive) and unseen targets (semi-inductive). The advantage of using deep learning and unsupervised pre-training is observed in the case of unseen nodes (inductive test). AI-Bind’s VecNet is the best performing model across all the scenarios. Additionally, we observe a similar performance of VecNet for both drugs and natural compounds. Source data are provided as a Source Data file.

Source publication

Annotation bias in BindingDB training data and DeepPurpose...

Drug-Target Interaction Network
a The drug-target interaction network...

Comparing DeepPurpose and the duplex configuration model
a The duplex...

AI-Bind pipeline: VecNet Performance and Validation
a AI-Bind pipeline...

Network-Derived Negatives
a Protein-ligand bipartite network consisting...

Improving the generalizability of protein-ligand binding predictions with AI-Bind

Article

Full-text available

Apr 2023

Identifying novel drug-target interactions is a critical and rate-limiting step in drug discovery. While deep learning models have been proposed to accelerate the identification process, here we show that state-of-the-art models fail to generalize to novel (i.e., never-before-seen) structures. We unveil the mechanisms responsible for this shortcomi...

An end-to-end method for predicting compound-protein interactions based on simplified homogeneous graph convolutional network and pre-trained language model

Article

Full-text available

Jun 2024

Identification of interactions between chemical compounds and proteins is crucial for various applications, including drug discovery, target identification, network pharmacology, and elucidation of protein functions. Deep neural network-based approaches are becoming increasingly popular in efficiently identifying compound-protein interactions with high-throughput capabilities, narrowing down the scope of candidates for traditional labor-intensive, time-consuming and expensive experimental techniques. In this study, we proposed an end-to-end approach termed SPVec-SGCN-CPI, which utilized simplified graph convolutional network (SGCN) model with low-dimensional and continuous features generated from our previously developed model SPVec and graph topology information to predict compound-protein interactions. The SGCN technique, dividing the local neighborhood aggregation and nonlinearity layer-wise propagation steps, effectively aggregates K-order neighbor information while avoiding neighbor explosion and expediting training. The performance of the SPVec-SGCN-CPI method was assessed across three datasets and compared against four machine learning- and deep learning-based methods, as well as six state-of-the-art methods. Experimental results revealed that SPVec-SGCN-CPI outperformed all these competing methods, particularly excelling in unbalanced data scenarios. By propagating node features and topological information to the feature space, SPVec-SGCN-CPI effectively incorporates interactions between compounds and proteins, enabling the fusion of heterogeneity. Furthermore, our method scored all unlabeled data in ChEMBL, confirming the top five ranked compound-protein interactions through molecular docking and existing evidence. These findings suggest that our model can reliably uncover compound-protein interactions within unlabeled compound-protein pairs, carrying substantial implications for drug re-profiling and discovery. In summary, SPVec-SGCN demonstrates its efficacy in accurately predicting compound-protein interactions, showcasing potential to enhance target identification and streamline drug discovery processes. Scientific contributions The methodology presented in this work not only enables the comparatively accurate prediction of compound-protein interactions but also, for the first time, take sample imbalance which is very common in real world and computation efficiency into consideration simultaneously, accelerating the target identification and drug discovery process.

Generic protein–ligand interaction scoring by integrating physical prior knowledge and data augmentation modelling

Article

Full-text available

Jun 2024

Developing robust methods for evaluating protein–ligand interactions has been a long-standing problem. Data-driven methods may memorize ligand and protein training data rather than learning protein–ligand interactions. Here we show a scoring approach called EquiScore, which utilizes a heterogeneous graph neural network to integrate physical prior knowledge and characterize protein–ligand interactions in equivariant geometric space. EquiScore is trained based on a new dataset constructed with multiple data augmentation strategies and a stringent redundancy-removal scheme. On two large external test sets, EquiScore consistently achieved top-ranking performance compared to 21 other methods. When EquiScore is used alongside different docking methods, it can effectively enhance the screening ability of these docking methods. EquiScore also showed good performance on the activity-ranking task of a series of structural analogues, indicating its potential to guide lead compound optimization. Finally, we investigated different levels of interpretability of EquiScore, which may provide more insights into structure-based drug design.

A Cross-Field Fusion Strategy for Drug-Target Interaction Prediction

Preprint

May 2024

Drug-target interaction (DTI) prediction is a critical component of the drug discovery process. In the drug development engineering field, predicting novel drug-target interactions is extremely crucial.However, although existing methods have achieved high accuracy levels in predicting known drugs and drug targets, they fail to utilize global protein information during DTI prediction. This leads to an inability to effectively predict interaction the interactions between novel drugs and their targets. As a result, the cross-field information fusion strategy is employed to acquire local and global protein information. Thus, we propose the siamese drug-target interaction SiamDTI prediction method, which utilizes a double channel network structure for cross-field supervised learning.Experimental results on three benchmark datasets demonstrate that SiamDTI achieves higher accuracy levels than other state-of-the-art (SOTA) methods on novel drugs and targets.Additionally, SiamDTI's performance with known drugs and targets is comparable to that of SOTA approachs. The code is available at https://anonymous.4open.science/r/DDDTI-434D.

Computer-aided drug design of novel nirmatrelvir analogs inhibiting main protease of Coronavirus SARS-CoV-2

Article

Full-text available

May 2024

A computer-aided drug design of new derivatives of nirmatrelvir, an orally active inhibitor of the main-protease (Mpro) of the severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2), was performed to identify its analogues with a higher antiviral potency. The following workflow was used: first, an evolutionary library composed of 1,866 analogues was generated starting from a parent nirmatrelvir scaffold and going through small mutation, fitness scoring, ranking, and selection. Second, the generated library was preprocessed and filtered against a 3-D pharmacophore model of nirmatrelvir built from its X-ray structure in a co-crystalized complex with the Mpro enzyme, allowing us to reduce the chemical space to 32 active analogues. Third, structure-based molecular docking against two different enzyme structures further ranked these active candidates, so that up to eight better-binding analogs were identified. The selected hit-analogues target the Mpro enzymes of SARS-CoV-2 with a higher binding affinity than a parent nirmatrelvir. The main structural modifications that increase the overall inhibitory affinity are identified at the azabicyclo[3.1.0] hexane and 2-oxopyrrolidine fragments. A characteristic structural feature of the inhibitor binding with the Mpro active centre is the similar location of the trifluoroacetylamino fragment, which is observed for most hit-analogues. The suggested workflow of the computer-aided rational design of new antiviral noncovalent inhibitors based on the scaffold of approved drugs is a promising, extremely low-cost, and time-efficient approach for the development of new potential pharmaceutical ingredients for the treatment of Coronavirus Disease 2019.

SIENNA: Generalizable Lightweight Machine Learning Platform for Brain Tumor Diagnostics

Preprint

Full-text available

Apr 2024

The transformative integration of Machine Learning (ML) for Artificial General Intelligence (AGI)-enhanced clinical imaging diagnostics, is itself in development. In brain tumor pathologies, magnetic resonance imaging (MRI) is a critical step that impacts the decision for invasive surgery, yet expert MRI tumor typing is inconsistent and misdiagnosis can reach levels as high as 85%. Current state-of-the-art (SOTA) ML brain tumor models struggle with data overfitting and susceptibility to shortcut learning, further exacerbated in large-sized models with many tunable parameters. In a comparison with multiple SOTA models, our deep ML brain tumor diagnostics model, SIENNA, surpassed limitations in four key areas of prioritized minimal data preprocessing, an optimized architecture that reduces shortcut learning and overfitting, integrated inductive cross-validation method for generalizability, and smaller neural architecture. SIENNA is applicable across MRI machines and 1.5 and 3.0 Tesla, and achieves high average accuracies on clinical DICOM MRI data across three-way classification: 92% (non-tumor), 91% (GBM), and 93% (MET) with retained high F1 and AUROC values for limited false positives/negatives. SIENNA is a lightweight clinical-ready AGI framework compatible with future multimodal expanded data integration.

Improving generalizability and data efficiency for MHC-I binding peptide predictions through structure-based geometric deep learning

Preprint

Full-text available

Mar 2024

The interaction between peptides and major histocompatibility complex (MHC) molecules is pivotal for tissue transplantation, pathogen recognition and autoimmune disease treatments. Recent advances in cancer immunotherapies demand for more accurate computational prediction of MHC-bound peptides. We address the generalizability challenge of MHC-bound peptide predictions, revealing limitations in current sequence-based approaches. Our solution employs structure-based methods leveraging geometric deep learning (GDL), yielding up to 8% improvement in generalizability across unseen MHC alleles. We tackle data efficiency by introducing a self-supervised learning approach surpassing sequence-based methods, even without being trained on binding affinity data. Finally, we demonstrate the resilience of structure-based GDL methods to biases in binding data on an Hepatitis B virus vaccine design case study. This study highlights structure-based methods’ potential to enhance generalizability and data efficiency, with implications for data-intensive fields like T-cell receptor specificity predictions, paving the way for enhanced comprehension and manipulation of immune responses.

Towards explainable interaction prediction: Embedding biological hierarchies into hyperbolic interaction space

Article

Full-text available

Mar 2024
PLOS ONE

Given the prolonged timelines and high costs associated with traditional approaches, accelerating drug development is crucial. Computational methods, particularly drug-target interaction prediction, have emerged as efficient tools, yet the explainability of machine learning models remains a challenge. Our work aims to provide more interpretable interaction prediction models using similarity-based prediction in a latent space aligned to biological hierarchies. We investigated integrating drug and protein hierarchies into a joint-embedding drug-target latent space via embedding regularization by conducting a comparative analysis between models employing traditional flat Euclidean vector spaces and those utilizing hyperbolic embeddings. Besides, we provided a latent space analysis as an example to show how we can gain visual insights into the trained model with the help of dimensionality reduction. Our results demonstrate that hierarchy regularization improves interpretability without compromising predictive performance. Furthermore, integrating hyperbolic embeddings, coupled with regularization, enhances the quality of the embedded hierarchy trees. Our approach enables a more informed and insightful application of interaction prediction models in drug discovery by constructing an interpretable hyperbolic latent space, simultaneously incorporating drug and target hierarchies and pairing them with available interaction information. Moreover, compatible with pairwise methods, the approach allows for additional transparency through existing explainable AI solutions.

Cracking the black box of deep sequence-based protein–protein interaction prediction

Article

Full-text available

Mar 2024
BRIEF BIOINFORM

Identifying protein–protein interactions (PPIs) is crucial for deciphering biological pathways. Numerous prediction methods have been developed as cheap alternatives to biological experiments, reporting surprisingly high accuracy estimates. We systematically investigated how much reproducible deep learning models depend on data leakage, sequence similarities and node degree information, and compared them with basic machine learning models. We found that overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances. In this setting, models learn solely from sequence similarities and node degrees. When data leakage is avoided by minimizing sequence similarities between training and test set, performances become random. Moreover, baseline models directly leveraging sequence similarity and network topology show good performances at a fraction of the computational cost. Thus, we advocate that any improvements should be reported relative to baseline methods in the future. Our findings suggest that predicting PPIs remains an unsolved task for proteins showing little sequence similarity to previously studied proteins, highlighting that further experimental research into the ‘dark’ protein interactome and better computational methods are needed.

A bidirectional interpretable compound-protein interaction prediction framework based on cross attention

Article

Mar 2024
COMPUT BIOL MED

The identification of compound-protein interactions (CPIs) plays a vital role in drug discovery. However, the huge cost and labor-intensive nature in vitro and vivo experiments make it urgent for researchers to develop novel CPI prediction methods. Despite emerging deep learning methods have achieved promising performance in CPI prediction, they also face ongoing challenges: (i) providing bidirectional interpretability from both the chemical and biological perspective for the prediction results; (ii) comprehensively evaluating model generalization performance; (iii) demonstrating the practical applicability of these models. To overcome the challenges posed by current deep learning methods, we propose a cross multi-head attention oriented bidirectional interpretable CPI prediction model (CmhAttCPI). First, CmhAttCPI takes molecular graphs and protein sequences as inputs, utilizing the GCW module to learn atom features and the CNN module to learn residue features, respectively. Second, the model applies cross multi-head attention module to compute attention weights for atoms and residues. Finally, CmhAttCPI employs a fully connected neural network to predict scores for CPIs. We evaluated the performance of CmhAttCPI on balanced datasets and imbalanced datasets. The results consistently show that CmhAttCPI outperforms multiple state-of-the-art methods. We constructed three scenarios based on compound and protein clustering and comprehensively evaluated the model generalization ability within these scenarios. The results demonstrate that the generalization ability of CmhAttCPI surpasses that of other models. Besides, the visualizations of attention weights reveal that CmhAttCPI provides chemical and biological interpretation for CPI prediction. Moreover, case studies confirm the practical applicability of CmhAttCPI in discovering anticancer candidates.

Evaluating generalizability of artificial intelligence models for molecular datasets

Preprint

Full-text available

Feb 2024

Deep learning has made rapid advances in modeling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata based (MB) or sequence-similarity based (SB) train and test splits of input data before assessing model performance. Here, we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, i.e. , similarity between train and test splits. We introduce SPECTRA, a spectral framework for comprehensive model evaluation. For a given model and input data, SPECTRA plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability. We apply SPECTRA to 18 sequencing datasets with associated phenotypes ranging from antibiotic resistance in tuberculosis to protein-ligand binding to evaluate the generalizability of 19 state-of-the-art deep learning models, including large language models, graph neural networks, diffusion models, and convolutional neural networks. We show that SB and MB splits provide an incomplete assessment of model generalizability. With SPECTRA, we find as cross-split overlap decreases, deep learning models consistently exhibit a reduction in performance in a task- and model-dependent manner. Although no model consistently achieved the highest performance across all tasks, we show that deep learning models can generalize to previously unseen sequences on specific tasks. SPECTRA paves the way toward a better understanding of how foundation models generalize in biology.

Citations