Conference PaperPDF Available

Drug target discovery using knowledge graph embeddings

April 2019

April 2019

DOI:10.1145/3297280.3297282

Conference: the 34th ACM/SIGAPP Symposium

Authors:

Sameh K. Mohamed

University of Galway

Aayah Nounu

University of Bristol

Vít Novácek

Digital Enterprise Research Institute (DERI)

The field of drug discovery has entered a plateau stage lately. It is increasingly more expensive and time-demanding to introduce new drugs into the market. One of the main reasons is the slow progress in finding novel targets for drug candidates and the lack of insight in terms of the associated mechanisms of action. Current works in this area mainly utilise different chemical, genetic and proteomic methods, which are limited in terms of the scalability of experimentation and the scope of studied drugs and targets per experiment. This is mainly due to their dependency on laboratory experiments and available physical resource. This has led to an increasing importance of computational methods for the identification of candidate drug targets. In this work, we introduce a novel computational approach for predicting drug target proteins. We approach the problem as a link prediction task on knowledge graphs. We process drug and target information as a knowledge graph of interconnected drugs, proteins, disease, pathways and other relevant entities. We then apply knowledge graph embedding (KGE) models over this data to enable scoring drug-target associations, where we employ a customised version of state-of-the-art KGE model ComplEx. We generate a benchmarking dataset based on KEGG database to train and evaluate our method. Our experiments show that our method achieves best results in comparison to other traditional KGE models. Specifically, the method predicts drug target links with mean reciprocal rank (MRR) of 0.78 and [email protected] of 0.88. This provides a promising basis for further experimentation and comparisons with domain-specific predictive models.

Knowledge graph about drugs, their target genes, pathways, diseases and gene variant networks extracted from KEGG.

…

Evaluation protocol of KGE models for a single drug target testing instance. Note: R represents rank, RR represents reciprocal rank, and H@n represents Hits@n

…

Statistics of objects and their inter-connections in the subset of KEGG dataset that we use.

…

Figures - uploaded by Sameh K. Mohamed

Content may be subject to copyright.

Content uploaded by Sameh K. Mohamed

Content may be subject to copyright.

Provided by the author(s) and NUI Galway in accordance with publisher policies. Please cite the published

version when available.

Downloaded 2019-06-13T11:25:52Z

Some rights reserved. For more information, please see the item record link above.

Title Drug target discovery using knowledge graph embeddings

Author(s) Mohamed, Sameh K.; Nováek, Vít; Nounu, Aayah

Publication

Date 2019-04-08

Publication

Information

Mohamed, Sameh K., Nováek, Vít, & Nounu, Aayah. (2019).

Drug target discovery using knowledge graph embeddings.

Paper presented at the 34th ACM/SIGAPP Symposium on

Applied Computing (SAC ’19), Limassol, Cyprus, 08-12 April.

Publisher Association for Computing Machinery

Link to

publisher's

version https://doi.org/10. 1145/3297280.3297282

Item record http://hdl.handle.net/10379/15065

DOI http://dx.doi.org/10. 1145/3297280.3297282

Drug Target Discovery Using Knowledge Graph Embeddings

Sameh K. Mohamed

Data Science Institue

Insight Centre for Data Analytics

National University of Ireland Galway

sameh.kamal@insight-centre.org

Aayah Nounu

MRC Integrative Epidemiology Unit

University of Bristol

An0435@bristol.ac.uk

Vít Nováček

Data Science Institue

Insight Centre for Data Analytics

National University of Ireland Galway

vit.novacek@insight-centre.org

ABSTRACT

The eld of drug discovery has entered a plateau stage lately. It

is increasingly more expensive and time-demanding to introduce

new drugs into the market. One of the main reasons is the slow

progress in nding novel targets for drug candidates and the lack

of insight in terms of the associated mechanisms of action. Current

works in this area mainly utilise dierent chemical, genetic and

proteomic methods, which are limited in terms of the scalability

of experimentation and the scope of studied drugs and targets per

experiment. This is mainly due to their dependency on laboratory

experiments and available physical resource. This has led to an

increasing importance of computational methods for the identica-

tion of candidate drug targets. In this work, we introduce a novel

computational approach for predicting drug target proteins. We ap-

proach the problem as a link prediction task on knowledge graphs.

We process drug and target information as a knowledge graph of

interconnected drugs, proteins, disease, pathways and other rele-

vant entities. We then apply knowledge graph embedding (KGE)

models over this data to enable scoring drug-target associations,

where we employ a customised version of state-of-the-art KGE

model ComplEx. We generate a benchmarking dataset based on

KEGG database to train and evaluate our method. Our experiments

show that our method achieves best results in comparison to other

traditional KGE models. Specically, the method predicts drug tar-

get links with mean reciprocal rank (MRR) of 0.78 and Hits@10 of

0.88. This provides a promising basis for further experimentation

and comparisons with domain-specic predictive models.

CCS CONCEPTS

•Semantic networks

;

•Machine Learning

;

•Machine learned

ranking;

KEYWORDS

Drug Target Discovery, Knowledge Graph Embeddings, Link Pre-

diction

ACM Reference Format:

Sameh K. Mohamed, Aayah Nounu, and Vít Nováček. 2019. Drug Target

Discovery Using Knowledge Graph Embeddings. In The 34th ACM/SIGAPP

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

SAC ’19, April 8–12, 2019, Limassol, Cyprus

ACM ISBN 978-1-4503-5933-7/19/04.. .$15.00

https://doi.org/10.1145/3297280.3297282

Symposium on Applied Computing (SAC ’19), April 8–12, 2019, Limassol,

Cyprus. ACM, New York, NY, USA, Article 4, 8 pages. https://doi.org/10.

1145/3297280.3297282

1 INTRODUCTION

The development of drugs has a long history [

]. Until quite recently,

pharmacological eects were often discovered using primitive trial

and error procedure, applying plant extracts on living system and

observing the outcomes. Later, drug development evolved to eluci-

dating mechanisms-of-actions of drug substances and their eects

on phenotype. The ability to pharmacologically isolate active sub-

stances was a key step towards modern drug discovery [

More recently, advances in molecular biology and biochemistry

allowed for more complex analysis of drugs, their targets and their

mechanisms of action. The study of drug targets has become very

popular, where studies utilise dierent chemical genetic [

] and

proteomic methods [

] such as anity chromatography and ex-

pression cloning approaches. These, however, can only process

limited number of possible drugs and targets due to dependency

on laboratory experiments and available physical resource. Com-

putational approaches have therefore been extensively studied

lately [17, 18, 42].

In this work we introduce a specic computational approach for

predicting drug target proteins. Our objective is to score possible

associations between drugs and proteins according to the proba-

bility of the association holding true. The ultimate goal is to assist

lab experimentation in narrowing the scope of possible new drug

targets investigated. In the current drug target knowledge bases

like DrugBank [

] and KEGG [

], information about drugs con-

tains their relationship with target proteins (or their genes), action

pathways and targeted diseases. These components are represented

as graphs form of interconnected entities and relations. Such data

can naturally be interpreted as a knowledge graph, where the task

of nding new associations between drugs and their targets can

be formulated as a link prediction task. In this context, knowledge

graph embedding (KGE) models are a t natural application, where

they are known to provide state-of-the-art results in link prediction

on knowledge graphs [

]. Despite the growing body of computer

simulation based drug target prediction frameworks [

none of these works utilise knowledge graph embedding models in

their predictive pipelines.

The objective of this work is to demonstrate the usefulness of

knowledge graph embedding models in the area of drug target

prediction. We also identify KGE techniques that can provide the

best accuracy in predicting drug targets. This is presented as a

stepping stone towards a domain-specic KGE-based drug target

prediction model and its extensive comparison with existing related

models such as the DDR [25] and the DNILMF [11] models.

SAC ’19, April 8–12, 2019, Limassol, Cyprus Sameh K. Mohamed, Aayah Nounu, and Vít Nováček

−

x′

−Corruptor

Entity or relation

Embedding

ηScoring function

≈Activation

LLoss function

▽Gradiant Update

NEmbbeding Norm

Exη

Ex′

≈

f(x)

f(x′)

∇E

▽

(2) Lookup(1) Corruption (3) Scoring (4) Loss (5) Update (6) Norm

Input

Figure 1: Phases of one epoch training of a KGE model over one training instance

Our knowledge graph embedding approach, ComplEx-SE (Complex

embeddings with squared error loss), is a customisation of the state-

of-the-art KGE model ComplEx [

] with a square error-based

training objective. We build a drug-target centred dataset from

KEGG database to train and evaluate our model, and we show by

experiments that it achieves best results in terms of predicting

new drug target links. To the best of our knowledge, there are

currently no models for discovering new drug targets using KGE

model for link prediction on biomedical knowledge graph, so we

evaluate our method compared to other state-of-the-art KGE models

like the Translating Embeddings (TransE) model [

], the DistMult

model [44] and the Complex Embedding (ComplEx) model [38].

The rest of this paper is structured as follows: section 2 discuss the

problem of drug target discovery and presents fundamental back-

ground concepts about KGE models and their evaluation metrics.

Section 4 present the subset of KEGG that we use, and discusses its

component. Section 5 presents our KGE approach and its base work

the ComplEx model. Section 6 present the experimental setup, the

evaluation protocol, section 3 discusses similar related works that

uses computational-based approaches for nding drug targets and

section 7 discusses the results of our experiments and lesson learnt.

We nally present our conclusions and possible future works in

section 9.

2 BACKGROUND

In this section, we discuss the advantages and implications of nd-

ing drug targets that are not yet known. We also discuss mod-

elling information in knowledge graphs, the underlying concepts

of knowledge graph embedding models and their evaluation tech-

niques.

2.1 Drug Target Discovery

The process of discovering and developing drugs with one gene

target requires time and money. Rarely does a drug only bind to its

intended target, but rather o-target eects are common [

], and

this may lead to unwanted side-eects [

]. Conversely, the o-target

eects may be useful for drug-repurposing reasons. Drug repur-

posing is dened as the use of approved drugs for new diseases [

It is believed to take around 10-17 years from the conception of a

drug to when it becomes a licensed treatment for disease, with a

success rate of less than 10% [

]. Drug repurposing is advantageous

as the safety prole of the drug is already known and reduces the

time and cost required to bring a new drug into the clinic [4].

The identication of new protein targets also allows the develop-

ment of drugs that specically target the protein of interest. For

example, aspirin is currently being considered for use as a chemo-

preventative agent [

] but there are concerns with regards

to side-eects caused by its long-term use, such as upper gastroin-

testinal bleeding [

]. By identifying the exact protein targets of

aspirin, new drugs can be developed specically for these proteins

to avoid the unwanted side-eects.

The use of computational approaches is useful as they are free from

bias and are therefore not inuenced by prior knowledge and opin-

ions, unlike laboratory-based methods. These approaches bypasses

the need to spend a long amount of time in the laboratory and can

be used to provide guidance on the direction of research within

the laboratory, therefore saving both time and money. Follow-up

experiments can then be carried out in the laboratory for conr-

mation of the new proteins targeted by the drug, allowing direct

conclusions to be made of treatment eect.

Overall, computational approaches are considered useful methods

to identify o-target interactions and can also be used for the pos-

sibility of drug repurposing. As they reduce the time required to

manually discover other unintended protein targets and may reduce

the large costs required in doing so.

2.2 Knowledge Graphs

Knowledge graphs are a data representation that model relational

information as a graph, where the graph nodes represent knowl-

edge entities and its edges represent relations between them. They

model facts as (subject, predicate, object) (SPO) triples e.g. (Aspirin,

Drug-Target,COX-1), where a subject entity is connected to an ob-

ject entity through a predicate relation.

In recent years, knowledge graphs have become a popular means

for data representation in the semantic web community to create

the "web of data", which is a network of interconnected entities

Drug Target Discovery Using Knowledge Graph Embeddings SAC ’19, April 8–12, 2019, Limassol, Cyprus

that can be easily interpreted by both humans and machines [

where knowledge graphs are used to model linked data. They have

also been used as convenient means for modelling information in

many dierent domains, including general human knowledge [

biomedical information [

] and language lexical information [

Knowledge graphs are now used in dierent applications such as

enhancing semantics of search engine results [

], biomedical

discoveries [

], or powering question answering and decision

support systems [8].

2.3 Knowledge Graph Embedding

Knowledge graph embedding models learn a low rank vector rep-

resentation of knowledge entities and relations that can be used

to rank knowledge assertions according to their factuality. KGE

models are trained in a multi-phase procedure as shown in Fig. 1,

where their objective is to eectively learn a vector representation

of entities and relations that can be used to score and rank possible

knowledge facts.

First, a KGE model initialises all embedding vectors using random

noise values. It then uses these embeddings to score the set of true

and false training facts using a model-dependent scoring function.

The output scores are then passed to the training loss function

to compute training error as shown in Fig. 1. These errors are

used by optimisers like AMSGrad [

] to generate gradients and

update the initial embeddings, where the updated embeddings give

higher scores for true facts and less for false facts. This procedure

is performed iteratively for a set of iterations i.e. epochs in order to

reach a state where embeddings provide best possible scoring for

both true and false possible facts.

2.4 Ranking Metrics

In the following, we present the metrics that we use in the evalua-

tion of our approach.

(1) Mean reciprocal rank (MRR):

This is the harmonic mean

of the rank position of the rst relevant element, and it is dened

as follows:

MRR =1

|Q|

i=1

ranki

where

ranki

refers to the rank position of the rst relevant element

for the

-th query. The output values of mean reciprocal rank are

normalised from 0 to 1, where 1 represents perfect ranking and

decaying values towards 0 represent decreasing accuracy.

(2) Hits@k:

This is the number of correct elements predicted

among the top-

elements, where we use Hits@1, Hits@3 and

Hits@10. This metric indicates that the model’s probability of rank-

ing the relevant (true) fact in the top kelement scores in the rank.

3 RELATED WORK

In this section we discuss related works where we target two kinds

of activities. Firstly, other computer based approaches for predicting

drug targets. Secondly, relation link prediction approaches and

state-of-the-art knowledge graph embedding models.

3.1 Computer Based Drug Target Prediction

Yamanishi et al

. [42]

developed one of the early computational ap-

proaches to predict drug targets, where their approach utilised a

statistical model that infers drug targets based on a bipartite graph

of both chemical and genomic information. More recent works like

COSINE [

] and NRLMF [

] approaches introduced the use of

drug-drug and target-target similarity measures to infer possible

drug targets. These approaches enabled new drugs and drug tar-

gets with limited or no information about their interaction data

since they depend on the drug-drug and target-target similarities.

However, these methods only utilised a single measure to model

components similarity.

Other drug target prediction models like KronRLS-MKL [

] and

BLM-NII [

] integrated dierent similarity measures to model

the similarity between drugs and targets. These approaches use

both linear and non-linear combinations of similarity measures

to encode the similarity between drugs and their targets, where

non-linear combinations provided better predicting drug-target

predictions [18].

Recently, Hao et al

. [11]

proposed a model that uses matrix fac-

torisation to predict drug targets over drug information networks.

Their model, DNILMF, operates in a four-step procedure. First, it

infers dierent proles for both drugs and targets and constructs

kernel matrices for these proles. It then diuses drug proles ker-

nel matrices with their structure kernel matrices. It then diuses

target proles kernel matrices with their sequence kernel matrices.

Finally, the DNILFM model uses the outputs of the previous steps

to predict drug targets based on their network neighbours. This

approach showed signicant predictive accuracy improvements

over other methods on standard benchmarking datasets [11, 42].

In most recent times, the current state-of-the-art work on compu-

tation drug target discovery is the DDR model [

], which predicts

drug targets using heterogeneous graphs that contain drug target

interactions in a multiple phases procedure. First, it computes simi-

larity indices for drugs and their targets. It then selects a subset of

these similarities in a heuristic process to obtain optimal combina-

tions of similarities. Finally, it combines selected similarities using

a non-linear fusion technique, and combines diusion output with

random walk features from the heterogeneous graphs to predict

drug targets. Despite the complexity of the the DDR model, it cur-

rently provides state-of-the-art results in predicting drug targets

using computational approaches [25].

3.2 Link Prediction in Knowledge Graph

In recent years, various predictive frameworks were developed

to predict new links in knowledge graphs, where these frame-

works serve in various applications such as semantic search en-

gines [

], biomedical discoveries [

], and question answering

systems [

]. Link prediction models can be categorised into two cat-

egories: graph-feature based models and latent-feature based mode.

Graph-feature based models utilise graph features like paths and

graph patterns to predict possible connecting links between graph

entities. For example, the path ranking algorithm (PRA) [

] uses

SAC ’19, April 8–12, 2019, Limassol, Cyprus Sameh K. Mohamed, Aayah Nounu, and Vít Nováček

Drug Target Gene

Drug Pathway

Drug Disease

Disease Gene

Disease Pathway

Disease Network

Gene Pathway Gene Network

Network Pathway

Figure 2: Knowledge graph about drugs, their target genes,

pathways, diseases and gene variant networks extracted

from KEGG.

connecting paths between entities generated by random walks to

infer possible links between them, where other models like the sub-

graph feature extraction model (SFE) [

] and the distinct subgraph

path (DSP) [

] employ a combination of connecting path and sub-

graph paths of two entities to predict their possible associations. On

the other hand, latent-feature based models i.e. knowledge graph

embedding models, use a generative approach to learn low-rank em-

beddings for knowledge entities and relations in order to score their

possible associations. These approaches use multiple techniques

like tensor factorisation as in the DistMult model [

] and latent dis-

tance similarity as in the TransE model [

] to model possible inter-

actions between graph embeddings and provide scores for possible

graph links. For further information on both approaches, Nickel

et al

. [23]

provides an extended review for both graph-feature based

and latent-feature based models in the task of link prediction in

knowledge graphs.

4 DATA

In this section we discuss the KEGG database [

] with focus

on the components that we use to train and evaluate our approach.

KEGG is a knowledge base that contains information about bio-

logical systems like cells and organisms at the molecular level. It

contains dierent types of biological information entities like genes,

pathways, drugs, disease, etc. The data in KEGG is structured as a

network of inter-connected entities that resembles the biological

eco-system at the molecular level. In our study, we focus on infor-

mation related drugs and their targets, where we only the following

KEGG components are considered:

(1) Drugs

: The KEGG drug database

is a comprehensive drug in-

formation resource for approved drugs. It contains multiple types of

information about drugs such as the chemical structure, associated

1https://www.genome.jp/kegg/drug/

Table 1: Statistics of objects and their inter-connections in

the subset of KEGG dataset that we use.

Object Count

Drug

Gene

Pathway

Disease

Network

Drug 4670 •12004 7910 2160 0

Gene 8881 12004 •497 239 4534

Pathway 329 7910 497 •1803 524

Disease 1873 2160 239 1803 •441

Network 448 0 4534 524 441 •

targets, action pathways and targeted diseases. In our study, we

only consider drug associations to the elements specied in Table 1.

(2) Genes

: The KEGG Gene database

contains information about

genes, their sequences and their associations with other biological

entities. While drug targets in the living systems are usually pro-

teins, the KEGG database uses genes as a representation of drug

targets, where genes represent their product proteins. In the rest of

this study, we use the KEGG genes to represent product proteins

as drug targets in the knowledge graph we have created.

(3) Pathways

: The KEGG database also contains information about

biological pathways associated with manually curated maps of their

reactions. The pathway database

in KEGG includes pathways of

dierent activities such metabolism, environmental information

processing, human disease, etc. Each pathway is associated to its

related entities e.g. genes, drugs and diseases, where we use such

associations to construct our knowledge about pathways.

(4) Diseases

: The KEGG disease database

is structured in a similar

form as in the previously mentioned entity databases, where our

main interest is to utilise the associations between disease and our

other investigated entities. However, the disease database contains

associations to other entity types such as carcinogens that can be

helpful to extend the knowledge about specic cancerous disease

in future studies.

(5) Networks

: The KEGG network database

contains information

on the perturbation of human genes, where it encodes knowledge

about the dierent variants and other perturbants of human genes

that are involved in the perturbed molecular reaction networks.

Similarly, instances of the network database are linked to their

related entities in other KEGG databases.

In our study, we gather all the possible associations between the

previously mentioned KEGG entities to generate a biological knowl-

edge graph that is centred around drugs and their target genes as

shown in Fig. 2. The statistics of the counts of each entity type and

the inter-connecting links between entities are provided in Table 1.

2https://www.genome.jp/kegg/genes.html

3https://www.genome.jp/kegg/pathway.html

4https://www.genome.jp/kegg/disease/

5https://www.kegg.jp/kegg/network.html

Drug Target Discovery Using Knowledge Graph Embeddings SAC ’19, April 8–12, 2019, Limassol, Cyprus

-  

-





  

- 

-





 

Negatives Positives   

Figure 3: Plots of loss growth of squared error and logistic

pointwise loss functions compared to scores of positive and

negative instances.

5 OUR APPROACH

In this section, we present the technical details of our approach,

which is a modication upon the the ComplEx knowledge graph

embedding model [

]. We discuss the ComplEx model, its scoring

and loss functions and our modications.

5.1 Complex Scoring Function

The ComplEx model is a tensor factorisation based knowledge

graph embedding model. It represents knowledge entities and re-

lation using complex vector embeddings, where each embedding

is represented by two vectors (real and imaginary). Fact assertions

in the ComplEx model are evaluated using a factorisation based

scoring function that is dened as follows:

fComplEx(s,p,o)=

k=1

Re(<es k ,erk ,eok >)

where

and

are the embeddings of the subject, the relation

and the object respectively,

esk

is the

-th component of the em-

bedding

is the embedding size (vector length),

Re (x)

is the

real part of complex value

and

is the complex conjugate of

such that

x=a−ib

x=a+ib

. This formulation can be further

relaxed as follow:

fComplEx(s,p,o)=

k=1

sk er

r k er

ok +ei

sk er

r k ei

ok +er

sk ei

r k ei

ok −ei

sk ei

r k er

where

and

are the real and imaginary parts of complex value

respectively. The use of the complex conjugate in the ComplEx

model allows it to encode embedding interactions in an asymmetric

operation, which enables it to model facts with both symmetric and

asymmetric predicates unlike other factorisation based models like

the DistMult model [44].

5.2 Training Objective

In the task of link prediction, knowledge graph embedding models

are considered learning to rank models, where they employ tradi-

tional ranking functions e.g. pointwise and pairwise ranking losses

to model their training loss. The ComplEx model by default uses a

pointwise ranking loss with a negative-logistic transformation to

model its training loss, which is dened as follows:

logisticP t

x∈T

log(1+exp(−l(x)·f(x))),(1)

where

is an (s, p, o) fact and

is the set of all training facts with

negative samples and

l(x)

is the true label of fact

such that

equal to 1 when true and -1 otherwise. This allows the models to

eectively update the embeddings of both entities and relations

to give high scores to true facts and lower scores to false facts as

shown in Fig. 3.

5.3 New Loss Objective for The ComplEx model

Trouillon and Nickel

[37]

have shown that the choice of objective

loss in KGE models has a huge impact on their predictive accuracy.

They showed that despite the equivalence of both the ComplEx and

the Holographic embedding (HolE) [

] models, they vary in accu-

racy due to their dependency on dierent training loss objectives.

This dierence is caused by the fact that the HolE model uses a

max-margin loss while the ComplEx model uses a log-likelihood

loss. Following their remarks, in this work, we propose a new loss

objective to the ComplEx model, and we show that it suit the lim-

ited size and number of predicates in our dataset. We propose a new

loss function based on the square error of the dierence between

ComplEx scores assertions and their true labels using a 0 and 1

labelling such that 0 represents false fact assertions and 1 represents

true fact assertions. The new square error based loss is dened as

follows:

SEP t

x∈T

2(l(x)−f(x))2(2)

where

l(x)

is the label of fact

with

l(x)=

0 if

is false and 1

otherwise. This also allows the square error loss to force embedding

updates that produce normalised scores around 0 and 1 unlike the

logistic loss with an open range of scores.

We prove that our new representation of ComplEx model training

loss outperforms its default version using an empirical evaluation

framework described in Section 6. In the following, we discuss some

properties of both the logistic and square error based losses.

Fig. 3 shows the dierences between the growth of both the

square error based loss and ComplEx’s default logistic loss, where

both functions show dierent loss growth rates which aect the

growth rate of the values of their output gradients. The logistic

loss has a linear growth rate and its gradient per single instance is

dened as follows:

∆x=exp(f(x))/[1 +exp(f(x)) ] (3)

where this form grows in a sub-linear sigmoid fashion. This limits

the output gradients for each training instance in the range of [0

1].

On the other hand, the square growth of square error loss yields

linearly growing gradients dened as follows:

∆x=











0 for f(x)−l(x)=0

f(x)for f(x)−l(x)<0

−f(x)for f(x)−l(x)>0

(4)

which enables the ComplEx model to produce gradients within the

range [0,∞).

SAC ’19, April 8–12, 2019, Limassol, Cyprus Sameh K. Mohamed, Aayah Nounu, and Vít Nováček

spo

Test triples

Corruptions

|E| × 3

corrupt

|E| × 3

corrupt

scoring

Scores

scoring

ltering

Filtered scores Filtered metrics

ltering

compute

metrics

R RR H@n

compute

metrics

R RR H@n

DrugsTargets

corruptionscorruptions

Figure 4: Evaluation protocol of KGE models for a single drug target testing instance. Note: R represents rank, RR represents

reciprocal rank, and H@n represents Hits@n

Table 2: Statistics of entities, relations, facts and drug-target

fact count per split of KEGG50k dataset

Dataset Entities Relations Facts DT-Facts

KEGG50k-full 16201 9 63080 12004

KEGG50k-train 16201 9 57080 10769

KEGG50k-valid 16201 9 3000 585

KEGG50k-test 16201 9 3000 650

6 EXPERIMENTS

In this section we describe the setup of our experiments and the

evaluation pipeline.

6.1 Dataset

In our experiments, we divide the KEGG dataset subset into train-

ing, validation and testing splits with ratios of 90%, 5% and 5%

respectively. All the splits contain facts that describe all entities

and relations, where the drug-target links are distributed among

the three splits with the same ratio as the splits sizes. Table 2 shows

statistics about the dataset and its splits in terms of number of en-

tities, relations, facts and drug-target facts. The KEGG50k dataset

can be downloaded from gshare 6.

6.2 Implementation

We use Tensorow framework (GPU) along with Python 3.5 to

perform our experiments. All experiments were executed on a

Linux machine with processor Intel(R) Core(TM) i70.4790K CPU @

4.00GHz, 32 GB RAM, and an nVidia Titan Xp GPU.

6.3 Evaluation protocol

KGE models are evaluated using a unied protocol that assesses

their performance in the task of link prediction. Let

be the set

of facts, i.e. triples,

ΘE

be the embeddings of the set of all entities

, and

ΘR

be the embeddings of the set of all relations

. The KGE

6KEGG50k dataset is found at: https://gshare.com/s/bbfc7b82d17e0b8b6a43

evaluation protocol works in four steps (Fig. 4 shows a visual ow

of the evaluation process steps for a single triple instance):

(1) Corruption

Let

x=(s,p,o)∈X

, then for each

, it is corrupted

|E| −

1 times by replacing its subject and object entities with all

the other entities in

. The corrupted triples can be dened as:

xcorr =Ss′∈E(s′,p,o)∪So′∈E(s,p,o′)

, where

s′,s

and

o′,o

These corruptions eectively provide negative examples for the

supervised training and testing processes due to the Local Closed

World Assumption [23].

(2) Scoring

: Both original triples and their corrupted instances

are evaluated using a model-dependent scoring function. This pro-

cess involves looking up embeddings of entities and relations, and

computing scores depending on these embeddings using the model-

dependent scoring function.

(3) Filtering

: It is possible that corruptions of triples may contain

positive instances that exist among training or validation triples.

This problem is alleviated by ltering out scores of positive in-

stances in the triple corruptions.

(4) Computing metrics

: Each triple and its corresponding subject

and object corruption triples produce two sets of ltered scores

following previous evaluation steps. Then, for each set of ltered

scores, the KGE model computes rank, reciprocal rank, and hits@n

metrics.

6.4 Experimental Setup

In the experiments, we use state-of-the-art KGE models the TransE [

the DistMult [

], and the ComplEx [

] models compared to our

customised version of the ComplEx model to perform link predic-

tion over KEGG50k dataset in two settings. First, the general link

prediction setting, where the objective is to learn rank all the link

in the testing set according to their factuality compared to their all

other possible corrupted assertions. Second, the drug target link

prediction setting, where the same previous procedure is applied

only to drug target links.

Drug Target Discovery Using Knowledge Graph Embeddings SAC ’19, April 8–12, 2019, Limassol, Cyprus

Table 3: Link prediction results over KEGG50k dataset on both general links and drug target links only. For all the metrics

except for mean rank (Rank), the higher the value the better.

Model General dataset links Drug target links

Rank MRR Hits@1 Hits@3 Hits@10 Rank MRR Hits@1 Hits@3 Hits@10

TransE [2] 192 0.46 0.38 0.50 0.63 81 0.75 0.69 0.79 0.86

DistMult [44] 430 0.37 0.27 0.42 0.57 186 0.61 0.50 0.69 0.81

ComplEx [38] 506 0.39 0.31 0.43 0.57 208 0.68 0.61 0.71 0.82

ComplEx-SE (This work) 534 0.52 0.45 0.56 0.68 145 0.78 0.73 0.81 0.88

We run all the models over the previously mentioned benchmark-

ing dataset KEGG50k. A grid search is performed to obtain best

hyper parameters for each model, where the set of investigated

parameters are: embeddings size

K∈ {

100

150

200

}

, margin

λ∈ {

}

for the TransE and the DistMult models, and num-

ber of negative samples

n∈ {

}

. All embeddings vectors of

our models are initialised using the uniform Xavier random initial-

izer [

]. For all the experiments, we use batches of size 5000, with

a maximum of 1000 training iterations i.e. epochs. The gradient

update procedure is performed using the AMSGrad optimiser [

]

with a xed 0.01 learning rate.

7 RESULTS AND DISCUSSION

Table 3 shows the output results of our experiments, where the

experiments are executed in two congurations: general link pre-

diction over all association types and link prediction over drug

targets associations only.

The results show that our approach, Complex-SE, outperforms

other state-of-the-art models in terms of MRR, Hits@1, Hits@3 and

Hits@10 on both experiment congurations, where it achieve a

better MRR with a 6% margin in predicting general links and a 3%

MRR margin over other models in predicting drug-target links.

The link prediction task is executed such that for each investi-

gated possible drug-target association such as (Aspirin,Drug-Target,

COX2) each model is required to answer two questions: (1) Which

drug targets COX2? and (2) Which target does the drug Aspirin

target? A model has to choose an answer from the set of all vocab-

ulary entities, where all correct answers except for drug Aspirin

and target COX2 are removed. The answers to these questions are

formatted as a rank, where the model is required to position the cor-

rect answer in the rst place in its rank to achieve perfect accuracy

as shown in Fig. 4, which presents the ow of the link prediction

evaluation pipeline for one test instance. In this setting, a random

baseline model would choose the right answers in the rst position

of the rank with a probability of

|E|

, where

|E|

is the size of the set

of all entities vocabulary, this is equal to 16201 in our experiments.

Our knowledge graph embedding model, ComplEx-SE, is able to

identify the correct answers for both of the previous questions

with a mean reciprocal rank of 0.78, where it identies the correct

answer within the rank with probabilities of 0.73, 0.81, 0.88 at the

rst, the third, the tenth positions respectively.

The use of knowledge graph embedding approaches such as ComplEx-

SE enables predicting new associations for both new drugs and new

targets since it does not depend on their interaction proles. KGE

models are also capable of learning dierent types of associations

between entities of dierent types with no extra congurations as

shown in the general link prediction conguration in table 3. For

example, they can be used to identify the relation between proteins

and pathways, or the relation between drugs and pathways. This

can lead to discovering further unknown activities for both drugs

and proteins with no extra computational cost.

8 ACKNOWLEDGEMENTS

This work has been supported by Insight Centre for Data Analytics

at National University of Ireland Galway, Ireland (supported by

the Science Foundation Ireland grant 12/RC/2289). The GPU card

used in our experiments is granted to us by the Nvidia GPU Grant

Program.

9 CONCLUSIONS AND FUTURE WORK

In this work, we introduced the use of knowledge graph embedding

models for predicting drug targets using currently available drug

knowledge bases, where we formulated the problem as a link pre-

diction task over drug targets centred knowledge graphs. We have

created a knowledge graph dataset, KEGG50k, from KEGG database,

which is centred around drugs and their targeted genes, disease

and reaction pathways. We then used this dataset to evaluate the

predictive accuracy of knowledge graph embedding models.

We proposed a knowledge graph embedding approach, ComplEx-

SE that is a customised version of the ComplEx model with a square

error based loss function, and we showed by empirical evaluation

that our approach outperforms other state-of-the-art knowledge

embedding models in the task of predicting drug target links over

KEGG database. Our results showed that the ComplEx-SE approach

is able to identify drug target link with a mean reciprocal rank of

0.78 with a 3% margin better than other state-of-the-art knowledge

graph embedding models. Results also showed that our approach

is able to provide a rank of 16201 possible drug-target association

statements with only one true statement, where it identies the

true state with probabilities of 0.73, 0.81 and 0.88 at the rst, the

third and the tenth positions of the rank.

SAC ’19, April 8–12, 2019, Limassol, Cyprus Sameh K. Mohamed, Aayah Nounu, and Vít Nováček

Despite the growing body of research on computer based approaches

for predicting drug targets, our objective in this work was limited to

evaluating knowledge graph embedding approaches and identifying

their optimal techniques for predicting drug-targets. However, in

future work, we aim to perform a comparison between our knowl-

edge graph embedding technique and other state-of-the-art drug

target discovery computer based approaches based on the bench-

marking dataset and evaluation metrics. We also aim to perform

in-lab experimental evaluation for the top predicted drug target

associations that is not in the currently public available knowledge

bases to validate possible new undiscovered drug targets.

REFERENCES

[1]

Ted T Ashburn and Karl B Thor. 2004. Drug repositioning: identifying and

developing new uses for existing drugs. Nature reviews Drug discovery 3, 8 (2004),

673.

[2]

Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Ok-

sana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational

Data. In NIPS. 2787–2795.

[3]

Joanne Bowes, Andrew J Brown, Jacques Hamon, Wolfgang Jarolimek, Arun

Sridhar, Gareth Waldron, and Steven Whitebread. 2012. Reducing safety-related

drug attrition: the use of in vitro pharmacological proling. Nature reviews Drug

discovery 11, 12 (2012), 909.

[4]

Anne Corbett, James Pickett, Alistair Burns, Jonathan Corcoran, Stephen B Dun-

nett, Paul Edison, Jim J Hagan, Clive Holmes, Emma Jones, Cornelius Katona,

et al

2012. Drug repositioning for Alzheimer’s disease. Nature Reviews Drug

Discovery 11, 11 (2012), 833.

[5]

Michael Dickson and Jean Paul Gagnon. 2009. The cost of new drug discovery

and development. Discovery medicine 4, 22 (2009), 172–179.

[6]

Jürgen Drews. 2000. Drug Discovery: A Historical Perspective. Science 287, 5460

(2000), 1960–1964. https://doi.org/10.1126/science.287.5460.1960

[7]

Michel Dumontier, Alison Callahan, Jose Cruz-Toledo, Peter Ansell, Vincent

Emonet, François Belleau, and Arnaud Droit. 2014. Bio2RDF Release 3: A larger,

more connected network of Linked Data for the Life Sciences. In Proceedings of

the ISWC 2014 Posters & Demonstrations Track a track within the 13th International

Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014.

401–404.

[8]

David A. Ferrucci, Eric W.Brown, Jennifer Chu-Carroll, James Fan, David Gondek,

Aditya Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John M. Prager,

Nico Schlaefer, and Christopher A. Welty. 2010. Building Watson: An Overview

of the DeepQA Project. AI Magazine 31, 3 (2010), 59–79.

[9]

Matt Gardner and Tom M. Mitchell. 2015. Ecient and Expressive Knowledge

Base Completion Using Subgraph Feature Extraction. In EMNLP. The Association

for Computational Linguistics, 1488–1498.

[10]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the diculty of train-

ing deep feedforward neural networks. In AISTATS (JMLR Proceedings), Vol. 9.

JMLR.org, 249–256.

[11]

Ming Hao, Stephen H Bryant, and Yanli Wang. 2017. Predicting drug-target

interactions by dual-network integrated logistic matrix factorization. Scientic

reports 7 (2017), 40376.

[12]

Minoru Kanehisa, Miho Furumichi, Mao Tanabe, Yoko Sato, and Kanae Morishima.

2017. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic

Acids Research 45, D1 (2017), D353–D361.

[13]

Minoru Kanehisa, Yoko Sato, Masayuki Kawashima, Miho Furumichi, and Mao

Tanabe. 2016. KEGG as a reference resource for gene and protein annotation.

Nucleic Acids Research 44, D1 (2016), D457–D462.

[14]

Ni Lao and William W. Cohen. 2010. Relational retrieval using a combination of

path-constrained random walks. Machine Learning 81, 1 (2010), 53–67.

[15]

Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,

Pablo Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören

Auer, and Chris Bizer. 2014. DBpedia - A Large-scale, Multilingual Knowledge

Base Extracted from Wikipedia. Semantic Web Journal (2014).

[16]

Linxin Li, Olivia C Geraghty, Ziyah Mehta, Peter M Rothwell, and Oxford Vascular

Study. 2017. Age-specic risks, severity, time course, and outcome of bleeding

on long-term antiplatelet treatment after vascular events: a population-based

cohort study. The Lancet 390, 10093 (2017), 490–499.

[17]

Hui Liu, Jianjiang Sun, Jihong Guan, Jie Zheng, and Shuigeng Zhou. 2015. Im-

proving compound–protein interaction prediction by building up highly credible

negative samples. Bioinformatics 31, 12 (2015), i221–i229.

[18]

Jian-Ping Mei, Chee-Keong Kwoh, Peng Yang, Xiao-Li Li, and Jie Zheng. 2012.

Drug–target interaction prediction by learning from local information and neigh-

bors. Bioinformatics 29, 2 (2012), 238–245.

[19]

George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM

38, 11 (Nov. 1995), 39–41. https://doi.org/10.1145/219717.219748

[20]

Sameh K. Mohamed, Vít Novácek, and Pierre-Yves Vandenbussche. 2018. Knowl-

edge base completion using distinct subgraph paths. In SAC. ACM, 1992–1999.

[21]

Emir Muñoz, Vít Novácek, and Pierre-Yves Vandenbussche. 2016. Using Drug

Similarities for Discovery of Possible Adverse Reactions. In AMIA 2016, American

Medical Informatics Association Annual Symposium, Chicago, IL, USA, November

12-16, 2016. AMIA. http://knowledge.amia.org/amia- 63300-1.3360278/t004-1.

3364525/f004-1.3364526/2499657- 1.3364713/2500122-1.3364708

[22]

André CA Nascimento, Ricardo BC Prudêncio, and Ivan G Costa. 2016. A mul-

tiple kernel learning algorithm for drug-target interaction prediction. BMC

bioinformatics 17, 1 (2016), 46.

[23]

Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016.

A review of relational machine learning for knowledge graphs. Proc. IEEE 104, 1

(2016), 11–33.

[24]

Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. 2016. Holographic

Embeddings of Knowledge Graphs. In AAAI. AAAI Press, 1955–1961.

[25]

Rawan S Olayan, Haitham Ashoor, and Vladimir B Bajic. 2017. DDR: ecient

computational method to predict drug–target interactions using graph mining

and machine learning approaches. Bioinformatics 34, 7 (2017), 1164–1173.

[26]

Richard Qian. 2013. Understand Your World with Bing. http://blogs.bing.com/

search/2013/03/21/understand-your-world-with- bing/ Bing Blogs.

[27]

Yan Qiao, Tingting Yang, Yong Gan, Wenzhen Li, Chao Wang, Yanhong Gong,

and Zuxun Lu. 2018. Associations between aspirin use and the risk of cancers: a

meta-analysis of observational studies. BMC cancer 18, 1 (2018), 288.

[28]

Sashank Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the Convergence of

Adam and Beyond. In ICLR.

[29]

Ayeshah A Rosdah, Jessica K. Holien, Lea MD Delbridge, Gregory J Dusting, and

Shiang Y Lim. 2016. Mitochondrial ssion–a drug target for cytoprotection or

cytodestruction? Pharmacology research & perspectives 4, 3 (2016), e00235.

[30]

Peter M Rothwell, F Gerald R Fowkes, Jill FF Belch, Hisao Ogawa, Charles P

Warlow, and Tom W Meade. 2011. Eect of daily aspirin on long-term risk of

death due to cancer: analysis of individual patient data from randomised trials.

The Lancet 377, 9759 (2011), 31–41.

[31]

Peter M Rothwell, Michelle Wilson, Carl-Eric Elwin, Bo Norrving, Ale Algra,

Charles P Warlow, and Tom W Meade. 2010. Long-term eect of aspirin on

colorectal cancer incidence and mortality: 20-year follow-up of ve randomised

trials. The Lancet 376, 9754 (2010), 1741–1750.

[32]

Amit Singhal. 2012. Introducing the Knowledge Graph: things, not strings. "https:

//googleblog.blogspot.ie/2012/05/introducing-knowledge- graph-things-not.

html" Google Ocial Blog.

[33]

Lekha Sleno and Andrew Emili. 2008. Proteomic methods for drug target discov-

ery. Current opinion in chemical biology 12, 1 (2008), 46–54.

[34] Walter Sneader. 2005. Drug discovery: a history. John Wiley & Sons.

[35]

Georg C Terstappen, Christina Schlüpen, Roberto Raggiaschi, and Giovanni

Gaviraghi. 2007. Target deconvolution strategies in drug discovery. Nature

Reviews Drug Discovery 6, 11 (2007), 891.

[36]

James Hendler Tim Berners-Lee and Ora Lassila. 2001. The Semantic Web, A new

form of Web content that is meaningful to computers will unleash a revolution

of new possibilities. Scientic American: https://www.scienticamerican.com/

article/the-semantic- web/. Retrieved: 2017-04-21.

[37]

Théo Trouillon and Maximilian Nickel. 2017. Complex and Holographic Embed-

dings of Knowledge Graphs: A Comparison. CoRR abs/1707.01475 (2017).

[38]

Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume

Bouchard. 2016. Complex Embeddings for Simple Link Prediction. In ICML (JMLR

Workshop and Conference Proceedings), Vol. 48. JMLR.org, 2071–2080.

[39]

Twan van Laarhoven, Sander B Nabuurs, and Elena Marchiori. 2011. Gaussian

interaction prole kernels for predicting drug–target interaction. Bioinformatics

27, 21 (2011), 3036–3043.

[40]

David S Wishart, Craig Knox, An Chi Guo, Savita Shrivastava, Murtaza Has-

sanali, Paul Stothard, Zhan Chang, and Jennifer Woolsey. 2006. DrugBank: a

comprehensive resource for in silico drug discovery and exploration. Nucleic

acids research 34, suppl_1 (2006), D668–D672.

[41]

Lei Xie, Li Xie, Sarah L Kinnings, and Philip E Bourne. 2012. Novel computational

approaches to polypharmacology as a means to dene responses to individual

drugs. Annual review of pharmacology and toxicology 52 (2012), 361–379.

[42]

Yoshihiro Yamanishi, Michihiro Araki, Alex Gutteridge, Wataru Honda, and

Minoru Kanehisa. 2008. Prediction of drug–target interaction networks from

the integration of chemical and genomic spaces. Bioinformatics 24, 13 (2008),

i232–i240.

[43]

Yoshihiro Yamanishi, Masaaki Kotera, Minoru Kanehisa, and Susumu Goto. 2010.

Drug-target interaction prediction from chemical, genomic and pharmacological

data in an integrated framework. Bioinformatics 26, 12 (2010), i246–i254.

[44]

Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Em-

bedding Entities and Relations for Learning and Inference in Knowledge Bases.

In ICLR.

Metapath-aggregated heterogeneous graph neural network for drug–target interaction prediction

Article

Jan 2023

Drug–target interaction (DTI) prediction is an essential step in drug repositioning. A few graph neural network (GNN)-based methods have been proposed for DTI prediction using heterogeneous biological data. However, existing GNN-based methods only aggregate information from directly connected nodes restricted in a drug-related or a target-related network and are incapable of capturing high-order dependencies in the biological heterogeneous graph. In this paper, we propose a metapath-aggregated heterogeneous graph neural network (MHGNN) to capture complex structures and rich semantics in the biological heterogeneous graph for DTI prediction. Specifically, MHGNN enhances heterogeneous graph structure learning and high-order semantics learning by modeling high-order relations via metapaths. Additionally, MHGNN enriches high-order correlations between drug-target pairs (DTPs) by constructing a DTP correlation graph with DTPs as nodes. We conduct extensive experiments on three biological heterogeneous datasets. MHGNN favorably surpasses 17 state-of-the-art methods over 6 evaluation metrics, which verifies its efficacy for DTI prediction. The code is available at https://github.com/Zora-LM/MHGNN-DTI.

Veni, Vidi, Vici: Solving the Myriad of Challenges before Knowledge Graph Learning

Conference Paper

Full-text available

Feb 2024

Drug-CoV: a drug-origin knowledge graph discovering drug repurposing targeting COVID-19

Article

Full-text available

Jul 2023
KNOWL INF SYST

Drug repurposing is a technique for probing new usages of existing medicines, but its traditional methods, such as computational approaches, can be time-consuming and laborious. Recently, knowledge graphs (KGs) have emerged as a powerful approach for graph-based representation in drug repurposing, encoding entities and relations to predict new connections and facilitate drug discovery. As COVID-19 has become a major public health concern, it is critical to establish an appropriate COVID-19 KG for drug repurposing to combat the spread of the virus. However, most publicly available COVID-19 KGs lack support for multi-relations and comprehensive entity types. Moreover, none of them originates from COVID-19-related drugs, making it challenging to identify effective treatments. To tackle these issues, we developed Drug-CoV, a drug-origin and multi-relational COVID-19 KG. We evaluated the quality of Drug-CoV by performing link prediction and comparing the results to another publicly available COVID-19 KG. Our results showed that Drug-CoV outperformed the comparing KG in predicting new links between entities. Overall, Drug-CoV represents a valuable resource for COVID-19 drug repurposing efforts and demonstrates the potential of KGs for facilitating drug discovery.

CKG-IMC: An inductive matrix completion method enhanced by CKG and GNN for Alzheimer’s disease compound-protein interactions prediction

Article

May 2024
COMPUT BIOL MED

Zero-Shot Construction of Chinese Medical Knowledge Graph with GPT-3.5-turbo and GPT-4

Article

Apr 2024

Knowledge graphs have revolutionized the organization and retrieval of real-world knowledge, prompting interest in automatic NLP-based approaches for extracting medical knowledge from texts. However, the availability of high-quality Chinese medical knowledge remains limited, posing challenges for constructing Chinese medical knowledge graphs. As LLMs like ChatGPT show promise in zero-shot learning for many NLP downstream tasks, their potential on constructing Chinese medical knowledge graphs is still uncertain. In this study, we create a Chinese medical knowledge graph by manually annotating textual data and using ChatGPT to automatically generate the graph. We refine the results using filtering and mapping rules to align with our schema. The manually generated graph serves as the ground truth for evaluation, and we explore different methods to enhance its accuracy through knowledge graph completion techniques. As a result, we emphasize the potential of employing ChatGPT for automated knowledge graph construction within the Chinese medical domain. While ChatGPT successfully identifies a larger number of entities, further enhancements are required to improve its performance in extracting more qualified relations.

Survey on Recommender Systems for Biomedical Items in Life and Health Sciences

Article

Jan 2024

The generation of biomedical data is of such a magnitude that its retrieval and analysis have posed several challenges. A survey of recommender system (RS) approaches in biomedical fields is provided in this analysis, along with a discussion of existing challenges related to large-scale biomedical information retrieval systems. We collect original studies, identify entities, models, and how knowledge graphs (KG) can improve results. As a result, most of the papers used model-based collaborative filtering algorithms, most of the available datasets did not follow the standard format < user, item, rating >, and regarding qualitative evaluations of RSs use mainly classification metrics. Finally, we have assembled and coded a unique dataset of 60 papers — Sur-RS4BioT, available for download at DOI:10.34740/kaggle/ds/2346894

Start Small, Think Big: On Hyperparameter Optimization for Large-Scale Knowledge Graph Embeddings

Chapter

Mar 2023

Knowledge graph embedding (KGE) models are an effective and popular approach to represent and reason with multi-relational data. Prior studies have shown that KGE models are sensitive to hyperparameter settings, however, and that suitable choices are dataset-dependent. In this paper, we explore hyperparameter optimization (HPO) for very large knowledge graphs, where the cost of evaluating individual hyperparameter configurations is excessive. Prior studies often avoided this cost by using various heuristics; e.g., by training on a subgraph or by using fewer epochs. We systematically discuss and evaluate the quality and cost savings of such heuristics and other low-cost approximation techniques. Based on our findings, we introduce GraSH, an efficient multi-fidelity HPO algorithm for large-scale KGEs that combines both graph and epoch reduction techniques and runs in multiple rounds of increasing fidelities. We conducted an experimental study and found that GraSH obtains state-of-the-art results on large graphs at a low cost (three complete training runs in total). Source code and auxiliary material at https://github.com/uma-pi1/GraSH. KeywordsKnowledge graph embeddingMulti-fidelity hyperparameter optimizationLow-fidelity approximation

Emerging Machine Learning Techniques in Predicting Adverse Drug Reactions

Chapter

Feb 2023

Adverse drug reactions (ADRs) are one of the major drug-related failures in pharmacological research and a significant threat to patient health. Machine learning models have been developed to characterize, predict and prevent ADRs. However, it is a challenge for the models to effectively extract features and make predictions based on multiple sources of heterogeneous and complex data. In this chapter, different types of drug-related features and emerging machine learning models, including deep learning and graph-based models, as potential solutions to address this challenge were reviewed. As more data become available, it will become more feasible to make use of the complex data and emerging technologies to develop more accurate models to identify ADRs and protect patients from ADRs.

Building a knowledge graph to enable precision medicine

Article

Full-text available

Feb 2023

Developing personalized diagnostic strategies and targeted treatments requires a deep understanding of disease biology and the ability to dissect the relationship between molecular and genetic factors and their phenotypic consequences. However, such knowledge is fragmented across publications, non-standardized repositories, and evolving ontologies describing various scales of biological organization between genotypes and clinical phenotypes. Here, we present PrimeKG, a multimodal knowledge graph for precision medicine analyses. PrimeKG integrates 20 high-quality resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scales, and the entire range of approved drugs with their therapeutic action, considerably expanding previous efforts in disease-rooted knowledge graphs. PrimeKG contains an abundance of ‘indications’, ‘contradictions’, and ‘off-label use’ drug-disease edges that lack in other knowledge graphs and can support AI analyses of how drugs affect disease-associated networks. We supplement PrimeKG’s graph structure with language descriptions of clinical guidelines to enable multimodal analyses and provide instructions for continual updates of PrimeKG as new data become available.

Target identification and validation

Chapter

Jan 2021

Knowledge base completion using distinct subgraph paths

Conference Paper

Full-text available

Apr 2018

Graph feature models facilitate efficient and interpretable predictions of missing links in knowledge bases with network structure (i.e. knowledge graphs). However, existing graph feature models---e.g. Subgraph Feature Extractor (SFE) or its predecessor, Path Ranking Algorithm (PRA) and its variants---depend on a limited set of graph features, connecting paths. This type of features may be missing for many interesting potential links, though, and the existing techniques cannot provide any predictions at all then. In this paper, we address the limitations of existing works by introducing a new graph-based feature model - Distinct Subgraph Paths (DSP). Our model uses a richer set of graph features and therefore can predict new relevant facts that neither SFE, nor PRA or its variants can discover by principle. We use a standard benchmark data set to show that DSP model performs better than the state-of-the-art - SFE (ANYREL) and PRA - in terms of mean average precision (MAP), mean reciprocal rank (MRR) and [email protected], 10, 20, with no extra computational cost incurred.

Associations between aspirin use and the risk of cancers: A meta-analysis of observational studies

Article

Full-text available

Mar 2018
BMC CANCER

Background: Epidemiological studies have clarified the potential associations between regular aspirin use and cancers. However, it remains controversial on whether aspirin use decreases the risk of cancers risks. Therefore, we conducted an updated meta-analysis to assess the associations between aspirin use and cancers. Methods: The PubMed, Embase, and Web of Science databases were systematically searched up to March 2017 to identify relevant studies. Relative risks (RRs) with 95% confidence intervals (CIs) were used to assess the strength of associations. Results: A total of 218 studies with 309 reports were eligible for this meta-analysis. Aspirin use was associated with a significant decrease in the risk of overall cancer (RR = 0.89, 95% CI: 0.87-0.91), and gastric (RR = 0.75, 95% CI: 0.65-0.86), esophageal (RR = 0.75, 95% CI: 0.62-0.89), colorectal (RR = 0.79, 95% CI: 0.74-0.85), pancreatic (RR = 0.80, 95% CI: 0.68-0.93), ovarian (RR = 0.89, 95% CI: 0.83-0.95), endometrial (RR = 0.92, 95% CI: 0.85-0.99), breast (RR = 0.92, 95% CI: 0.88-0.96), and prostate (RR = 0.94, 95% CI: 0.90-0.99) cancers, as well as small intestine neuroendocrine tumors (RR = 0.17, 95% CI: 0.05-0.58). Conclusions: These findings suggest that aspirin use is associated with a reduced risk of gastric, esophageal, colorectal, pancreatic, ovarian, endometrial, breast, and prostate cancers, and small intestine neuroendocrine tumors.

DDR: Efficient computational method to predict drug-Target interactions using graph mining and machine learning approaches

Article

Full-text available

Nov 2017
BIOINFORMATICS

Motivation: Finding computationally drug-target interactions (DTIs) is a convenient strategy to identify new DTIs at low cost with reasonable accuracy. However, the current DTI prediction methods suffer the high false positive prediction rate. Results: We developed DDR, a novel method that improves the DTI prediction accuracy. DDR is based on the use of a heterogeneous graph that contains known DTIs with multiple similarities between drugs and multiple similarities between target proteins. DDR applies non-linear similarity fusion method to combine different similarities. Before fusion, DDR performs a pre-processing step where a subset of similarities is selected in a heuristic process to obtain an optimized combination of similarities. Then, DDR applies a random forest model using different graph-based features extracted from the DTI heterogeneous graph. Using five repeats of 10-fold cross-validation, three testing setups, and the weighted average of area under the precision-recall curve (AUPR) scores, we show that DDR significantly reduces the AUPR score error relative to the next best start-of-the-art method for predicting DTIs by 34% when the drugs are new, by 23% when targets are new, and by 34% when the drugs and the targets are known but not all DTIs between them are not known. Using independent sources of evidence, we verify as correct 22 out of the top 25 DDR novel predictions. This suggests that DDR can be used as an efficient method to identify correct DTIs. Availability: The data and code are provided at https://bitbucket.org/RSO24/ddr/. Contact: vladimir.bajic@kaust.edu.sa. Supplementary information: Supplementary data are available at Bioinformatics online.

Age-specific risks, severity, time course, and outcome of bleeding on long-term antiplatelet treatment after vascular events: A population-based cohort study

Article

Full-text available

Jun 2017
LANCET

Background: Lifelong antiplatelet treatment is recommended after ischaemic vascular events, on the basis of trials done mainly in patients younger than 75 years. Upper gastrointestinal bleeding is a serious complication, but had low case fatality in trials of aspirin and is not generally thought to cause long-term disability. Consequently, although co-prescription of proton-pump inhibitors (PPIs) reduces upper gastrointestinal bleeds by 70-90%, uptake is low and guidelines are conflicting. We aimed to assess the risk, time course, and outcomes of bleeding on antiplatelet treatment for secondary prevention in patients of all ages. Methods: We did a prospective population-based cohort study in patients with a first transient ischaemic attack, ischaemic stroke, or myocardial infarction treated with antiplatelet drugs (mainly aspirin based, without routine PPI use) after the event in the Oxford Vascular Study from 2002 to 2012, with follow-up until 2013. We determined type, severity, outcome (disability or death), and time course of bleeding requiring medical attention by face-to-face follow-up for 10 years. We estimated age-specific numbers needed to treat (NNT) to prevent upper gastrointestinal bleeding with routine PPI co-prescription on the basis of Kaplan-Meier risk estimates and relative risk reduction estimates from previous trials. Findings: 3166 patients (1582 [50%] aged ≥75 years) had 405 first bleeding events (n=218 gastrointestinal, n=45 intracranial, and n=142 other) during 13 509 patient-years of follow-up. Of the 314 patients (78%) with bleeds admitted to hospital, 117 (37%) were missed by administrative coding. Risk of non-major bleeding was unrelated to age, but major bleeding increased steeply with age (≥75 years hazard ratio [HR] 3·10, 95% CI 2·27-4·24; p<0·0001), particularly for fatal bleeds (5·53, 2·65-11·54; p<0·0001), and was sustained during long-term follow-up. The same was true of major upper gastrointestinal bleeds (≥75 years HR 4·13, 2·60-6·57; p<0·0001), particularly if disabling or fatal (10·26, 4·37-24·13; p<0·0001). At age 75 years or older, major upper gastrointestinal bleeds were mostly disabling or fatal (45 [62%] of 73 patients vs 101 [47%] of 213 patients with recurrent ischaemic stroke), and outnumbered disabling or fatal intracerebral haemorrhage (n=45 vs n=18), with an absolute risk of 9·15 (95% CI 6·67-12·24) per 1000 patient-years. The estimated NNT for routine PPI use to prevent one disabling or fatal upper gastrointestinal bleed over 5 years fell from 338 for individuals younger than 65 years, to 25 for individuals aged 85 years or older. Interpretation: In patients receiving aspirin-based antiplatelet treatment without routine PPI use, the long-term risk of major bleeding is higher and more sustained in older patients in practice than in the younger patients in previous trials, with a substantial risk of disabling or fatal upper gastrointestinal bleeding. Given that half of the major bleeds in patients aged 75 years or older were upper gastrointestinal, the estimated NNT for routine PPI use to prevent such bleeds is low, and co-prescription should be encouraged. Funding: Wellcome Trust, Wolfson Foundation, British Heart Foundation, Dunhill Medical Trust, National Institute of Health Research (NIHR), and the NIHR Oxford Biomedical Research Centre.

Article

Full-text available

Feb 2017

We propose a new computational method for discovery of possible adverse drug reactions. The method consists of two key steps. First we use openly available resources to semi-automatically compile a consolidated data set describing drugs and their features (e.g., chemical structure, related targets, indications or known adverse reaction). The data set is represented as a graph, which allows for definition of graph-based similarity metrics. The metrics can then be used for propagating known adverse reactions between similar drugs, which leads to weighted (i.e., ranked) predictions of previously unknown links between drugs and their possible side effects. We implemented the proposed method in the form of a software prototype and evaluated our approach by discarding known drug-side effect links from our data and checking whether our prototype is able to re-discover them. As this is an evaluation methodology used by several recent state of the art approaches, we could compare our results with them. Our approach scored best in all widely used metrics like precision, recall or the ratio of relevant predictions present among the top ranked results. The improvement was as much as 125.79% over the next best approach. For instance, the F1 score was 0.5606 (66.35% better than the next best method). Most importantly, in 95.32% of cases, the top five results contain at least one, but typically three correctly predicted side effect (36.05% better than the second best approach).

Predicting drug-target interactions by dual-network integrated logistic matrix factorization

Article

Full-text available

Jan 2017

In this work, we propose a dual-network integrated logistic matrix factorization (DNILMF) algorithm to predict potential drug-target interactions (DTI). The prediction procedure consists of four steps: (1) inferring new drug/target profiles and constructing profile kernel matrix; (2) diffusing drug profile kernel matrix with drug structure kernel matrix; (3) diffusing target profile kernel matrix with target sequence kernel matrix; and (4) building DNILMF model and smoothing new drug/target predictions based on their neighbors. We compare our algorithm with the state-of-the-art method based on the benchmark dataset. Results indicate that the DNILMF algorithm outperforms the previously reported approaches in terms of AUPR (area under precision-recall curve) and AUC (area under curve of receiver operating characteristic) based on the 5 trials of 10-fold cross-validation. We conclude that the performance improvement depends on not only the proposed objective function, but also the used nonlinear diffusion technique which is important but under studied in the DTI prediction field. In addition, we also compile a new DTI dataset for increasing the diversity of currently available benchmark datasets. The top prediction results for the new dataset are confirmed by experimental studies or supported by other computational research.

Drug Discovery: A History

Book

Apr 2005

Walter Sneader

On the convergence of Adam & Beyond

Conference Paper

May 2018

Complex and Holographic Embeddings of Knowledge Graphs: A Comparison

Article

Jul 2017

Embeddings of knowledge graphs have received significant attention due to their excellent performance for tasks like link prediction and entity resolution. In this short paper, we are providing a comparison of two state-of-the-art knowledge graph embeddings for which their equivalence has recently been established, i.e., ComplEx and HolE [Nickel, Rosasco, and Poggio, 2016; Trouillon et al., 2016; Hayashi and Shimbo, 2017]. First, we briefly review both models and discuss how their scoring functions are equivalent. We then analyze the discrepancy of results reported in the original articles, and show experimentally that they are likely due to the use of different loss functions. In further experiments, we evaluate the ability of both models to embed symmetric and antisymmetric patterns. Finally, we discuss advantages and disadvantages of both models and under which conditions one would be preferable to the other.

Complex Embeddings for Simple Link Prediction

Conference Paper

Oct 2016

In statistical relational learning, the link prediction problem is key to automatically understand the structure of large knowledge bases. As in previous studies, we propose to solve this problem through latent factorization. However, here we make use of complex valued embeddings. The composition of complex embeddings can handle a large variety of binary relations, among them symmetric and antisymmetric relations. Compared to state-of-the-art models such as Neural Tensor Network and Holographic Embeddings, our approach based on complex embeddings is arguably simpler, as it only uses the Hermitian dot product, the complex counterpart of the standard dot product between real vectors. Our approach is scalable to large datasets as it remains linear in both space and time, while consistently outperforming alternative approaches on standard link prediction benchmarks.

Drug target discovery using knowledge graph embeddings

Abstract and Figures

Recommended publications

Ketjen black carbon supported CoO@Co−N−C nanochains as an efficient electrocatalyst for oxygen evolu...

Does Drug-target Have a Likeness?

Proteomics of Pyrococcus Furiosus(Pfu): Identification of Extracted Proteins by Three Independent Me...

A Scalable Semidefinite Relaxation Approach to Grid Scheduling