ArticlePDF Available

Semantic biomedical resource discovery: A Natural Language Processing framework

September 2015
BMC Medical Informatics and Decision Making 15(1)

September 2015
15(1)

DOI:10.1186/s12911-015-0200-4

License
CC BY 4.0

Authors:

Lefteris Koumakis

Foundation for Research and Technology - Hellas

Stelios Sfakianakis

Foundation for Research and Technology - Hellas

Galateia Iatraki

Foundation for Research and Technology - Hellas

Show all 8 authorsHide

Background: A plethora of publicly available biomedical resources do currently exist and are constantly increasing at a fast rate. In parallel, specialized repositories are been developed, indexing numerous clinical and biomedical tools. The main drawback of such repositories is the difficulty in locating appropriate resources for a clinical or biomedical decision task, especially for non-Information Technology expert users. In parallel, although NLP research in the clinical domain has been active since the 1960s, progress in the development of NLP applications has been slow and lags behind progress in the general NLP domain. The aim of the present study is to investigate the use of semantics for biomedical resources annotation with domain specific ontologies and exploit Natural Language Processing methods in empowering the non-Information Technology expert users to efficiently search for biomedical resources using natural language. Methods: A Natural Language Processing engine which can "translate" free text into targeted queries, automatically transforming a clinical research question into a request description that contains only terms of ontologies, has been implemented. The implementation is based on information extraction techniques for text in natural language, guided by integrated ontologies. Furthermore, knowledge from robust text mining methods has been incorporated to map descriptions into suitable domain ontologies in order to ensure that the biomedical resources descriptions are domain oriented and enhance the accuracy of services discovery. The framework is freely available as a web application at ( http://calchas.ics.forth.gr/ ). Results: For our experiments, a range of clinical questions were established based on descriptions of clinical trials from the ClinicalTrials.gov registry as well as recommendations from clinicians. Domain experts manually identified the available tools in a tools repository which are suitable for addressing the clinical questions at hand, either individually or as a set of tools forming a computational pipeline. The results were compared with those obtained from an automated discovery of candidate biomedical tools. For the evaluation of the results, precision and recall measurements were used. Our results indicate that the proposed framework has a high precision and low recall, implying that the system returns essentially more relevant results than irrelevant. Conclusions: There are adequate biomedical ontologies already available, sufficiency of existing NLP tools and quality of biomedical annotation systems for the implementation of a biomedical resources discovery framework, based on the semantic annotation of resources and the use on NLP techniques. The results of the present study demonstrate the clinical utility of the application of the proposed framework which aims to bridge the gap between clinical question in natural language and efficient dynamic biomedical resources discovery.

The architecture of the framework. The architecture of the framework: 1) tools registration, 2) tools annotation, 3) user ’ s question in natural language and NLP processing, 4) form and send the query, and 5) retrieve results (related tools)

…

An example of a resource and its description

…

Annotation example from Concept Recognizer. The annotation from the Concept Recognizer of the given data sentence “ John has lung cancer and has been treated with carboplatin which is known for toxicology adverse effects ”

…

The 27 semantic categories. The 27 prime categories: 23 from UMLS semantic types and 4 from Edam categories

…

The list of patterns generated by the combination of prime categories

…

Figures - uploaded by Manolis Tsiknakis

Content may be subject to copyright.

Access to this full-text is provided by Springer Nature.

Learn more

Content available from BMC Medical Informatics and Decision Making

This content is subject to copyright. Terms and conditions apply.

R E S E A R C H A R T I C L E Open Access

Semantic biomedical resource discovery:

a Natural Language Processing framework

Pepi Sfakianaki

, Lefteris Koumakis

, Stelios Sfakianakis

, Galatia Iatraki

, Giorgos Zacharioudakis

, Norbert Graf

Kostas Marias

and Manolis Tsiknakis

1,2

Abstract

Background: A plethora of publicly available biomedical resources do currently exist and are constantly increasing

at a fast rate. In parallel, specialized repositories are been developed, indexing numerous clinical and biomedical

tools. The main drawback of such repositories is the difficulty in locating appropriate resources for a clinical or

biomedical decision task, especially for non-Information Technology expert users. In parallel, although NLP research in

the clinical domain has been active since the 1960s, progress in the development of NLP applications has been slow

and lags behind progress in the general NLP domain.

The aim of the present study is to investigate the use of semantics for biomedical resources annotation with domain

specific ontologies and exploit Natural Language Processing methods in empowering the non-Information Technology

expert users to efficiently search for biomedical resources using natural language.

Methods: A Natural Language Processing engine which can “translate”free text into targeted queries, automatically

transforming a clinical research question into a request description that contains only terms of ontologies, has been

implemented. The implementation is based on information extraction techniques for text in natural language, guided

by integrated ontologies. Furthermore, knowledge from robust text mining methods has been incorporated to map

descriptions into suitable domain ontologies in order to ensure that the biomedical resources descriptions are

domain oriented and enhance the accuracy of services discovery. The framework is freely available as a web

application at (http://calchas.ics.forth.gr/).

Results: For our experiments, a range of clinical questions were established based on descriptions of clinical trials

from the ClinicalTrials.gov registry as well as recommendations from clinicians. Domain experts manually identified the

available tools in a tools repository which are suitable for addressing the clinical questions at hand, either individually

or as a set of tools forming a computational pipeline. The results were compared with those obtained from an

automated discovery of candidate biomedical tools. For the evaluation of the results, precision and recall

measurements were used. Our results indicate that the proposed framework has a high precision and low

recall, implying that the system returns essentially more relevant results than irrelevant.

Conclusions: There are adequate biomedical ontologies already available, sufficiency of existing NLP tools and quality

of biomedical annotation systems for the implementation of a biomedical resources discovery framework, based on

the semantic annotation of resources and the use on NLP techniques. The results of the present study demonstrate

the clinical utility of the application of the proposed framework which aims to bridge the gap between clinical

question in natural language and efficient dynamic biomedical resources discovery.

Keywords: Semantic resource annotation, Natural language processing, Resource discovery, Biomedical text

annotation, Information extraction, Text mining, Biomedical informatics, Search engine, Natural language interface

* Correspondence: koumakis@ics.forth.gr

Foundation for Research and Technology Hellas (FORTH), Institute of

Computer Science, N. Plastira 100, Vassilika Vouton, Heraklion, Crete, Greeece

Full list of author information is available at the end of the article

International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and

reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to

the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver

(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77

DOI 10.1186/s12911-015-0200-4

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Background

A plethora of publicly available biomedical resources

(data, tools, services, models and computational work-

flows) do currently exist and are constantly increasing

at a fast rate. This explosion of biomedical resources

generates impediments for the biomedical researchers’

needs, in order to efficiently discover appropriate re-

sources to accomplish their clinical tasks. It is ex-

tremely difficult to locate the necessary resources [1],

especially for non-Information Technology (IT) expert

users, because most of the available tools are com-

monly described via narrative web pages containing

information about their operations in natural language

or are annotated with relevant technical details which

are not easily interpreted by lay users. These descrip-

tions contain plain text with no machine interpretable

structure and therefore cannot be used to automatically

process the descriptive information about a resource. An

indicative resource and its description is shown on Table 1.

Furthermore, clinical users prefer to formulate their

queries quickly using natural language which is the most

user-friendly and expressive way [2]. As a result, discov-

ery of the appropriate tools and computational models

needed to support a given clinical decision making task

has been and remains a major problem for non-expert

users. Due to the fact that the range of accessible re-

sources has been considerably expanded in recent years

and a significant number of new such resource reposi-

tories have been developed, it has been more and more

difficult for clinicians and researchers to locate the most

appropriate resource for the realization of their tasks.

On the other hand, bioinformaticians and tool devel-

opers rely to a greater extent on ontologies to annotate

their systems and publish them in specialized repositories,

such as Taverna [3], myExperiment [4], BioCatalogue [5],

SEQanswers [6], EMBRACE [7], Bioconductor [8], and

ORBIT [9]. Such repositories make software components

easier to locate and use when they are described and

searched via rich metadata terms but act as independent

silos devoted to specific domains and are unable to pro-

vide end to end solutions to daily routine clinical

questions. An indicative example is SEQanswers [6],

where a user can find an abundance of tools which

however are restricted only to sequencing. In such re-

positories, the main impediments that a clinician faces

are: (i) the need to serially search or search with exact

keyword-matching in repositories with thousands of

tools, (ii) substantial information technology (IT)

knowledge is required in order for a clinician to under-

stand a tools purpose and way of use, (iii) time con-

suming search in various or all the publically available

repositories, and (iv) the uncertainty regarding the ap-

propriateness of a retrieved tool for his clinical decision

task [10].

In most of the cases clinical users come up with long

and complex questions in the context of their hypothetico-

deductive model of clinical reasoning [11]. What is equally

important is the fact that clinical users are not prepared,

on average, to allocate more than 2 minutes for discovering

appropriate tools and usually give up if the inquiry is time

consuming [12]. Furthermore, the appropriateness of the

results obtained often depends on the user’s IT expertise.

The use of queries expressed in natural language

can, it is believed, overcome these hurdles [13], yet

computers are good at processing structured data but

much less effective in handling natural language that

is inherently unstructured. The field of Natural Language

Processing (NLP) [14] aims to narrow this gap, as it

focuses on how machines can understand and manage

natural language text to execute useful tasks for end users.

A survey on biomedical text annotation tools was

performed taking into account Named Entity Recogni-

tion (NER) tools that can identify biomedical categories,

like gene and protein names, as well as Ontology-Based

Information Extraction (OBIE) tools. Several approaches

and tools were evaluated, including ABNER [15], GATE

[16], UIMA [17], NCBO BioPortal [18], MetaMap [10],

AutoMeta, KIM, ONTEA [19], and finally SOBA and iDo-

cument [20] which do not support annotation with mul-

tiple ontologies or clinical text at all.

MetaMap is worthy of note as a state-of-the-art tool

and the de-facto standard for biomedical annotation.

Table 1 An example of a resource and its description

Name Summary Tags (Principal bioinformatics methods)

GeneTalk GeneTalk, a web-based platform, that can filter,

reduce and prioritize human sequence variants

from NGS data and assist in the time consuming

and costly interpretation of personal variants in

clinical context. It serves as an expert exchange

platform for clinicians and scientists who are

searching for information about specific sequence

variants and connects them to share and exchange

expertise on variants that are potentially disease-relevant.

Genetic variation annotation, Sequence variation

analysis, Variant Calling, Structural variation discovery,

Filtering, Annotation, Database, Exome analysis,

Sequence analysis, Variant Classification, Viewer

Link Input (format) Output (format) Category

http://seqanswers.com/wiki/GeneTalk VCF VCF,XLS,XLSX Sequence Analysis

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 2 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

This tool maps text to the Unified Medical Language

System (UMLS) [21] Metathesaurus concepts. MetaMap

can identify 98.19 % of the biomedical concepts in the

text including 78.79 % of the concepts that manually

could not be identified [22]. The tools inefficiencies are

mainly due to missing entries in UMLS; furthermore,

concepts’relationships, multi-words concepts entries,

words with punctuation and spelling mistakes in text are

not recognized and dealt with. Therefore, minor ortho-

graphical or syntactic errors in a sentence cannot be de-

tected. In addition, MetaMap can support only concept

recognition and for specific ontologies. On the other

hand, cTAKES, an Apache open source NLP system, im-

plements rule-based and machine learning methods. The

tool exhibits reasonable performance which was never-

theless inferior to the one achieved by MetaMap [23].

The purpose of the study

Given the complexities mentioned above, the aim of

the present study is to i) investigate the use of seman-

tics for the annotation of biomedical resources with do-

main specific ontologies and ii) exploit NLP methods in

empowering the non-IT expert users to efficiently

search for biomedical resources using natural language.

Our specific focus is to capitalize on existing research

results and extend these with the objective of providing to

users, especially physicians, the opportunity to represent

their queries in natural language and to dynamically dis-

cover and retrieve suitable candidate computational re-

sources, with the aid of information extraction algorithms

guided by specific domain ontologies.

In achieving the stated objectives, we introduce a se-

mantic biomedical resource discovery framework based

on NLP. A high level architecture of developed frame-

work is shown in Fig. 1. The clinician can import his re-

search question in natural language (English language)

through a web interface. Then, the interpreter receives the

clinical question as input, and parses the text using NLP

techniques guided by the existing domain ontologies [24].

The objective at this step is to infer the question’s

meaning by locating ontological terms important in the

clinical domain of interest. The results of this step are

then matched to a set of predefined “patterns”that pro-

duce a low level query to repository of biomedical tools

and other resources. When this query is executed, the

repository returns the list of tools or custom pipelines

that possibly answer the initial question of the user.

In developing and evaluating our semantic biomedical

resource discovery framework we have specifically fo-

cused on the following questions:

Do existing biomedical ontologies, as well as current

NLP tools, suffice for the creation of a domain-specific

annotation system?

Fig. 1 The architecture of the framework. The architecture of the framework: 1) tools registration, 2) tools annotation, 3) user’s question in natural

language and NLP processing, 4) form and send the query, and 5) retrieve results (related tools)

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 3 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Are existing biomedical annotation systems acceptable

and satisfactory for the Semantic Annotation of

biomedical concepts guided by ontologies?

Can an interpreter translate natural language clinical

questions into targeted queries using patterns and

ontology terms?

In the following sections we describe details of the

framework design and implementation, provide evalu-

ation details and results, and conclude with a discussion

and future work.

Methods

The proposed framework was designed and imple-

mented within the European Commission project p-

medicine [25] as the project’s workbench which is an

end-user application that is effectively a repository of

tools for use by the clinicians. It also follows exploratory

work that has taken place in the context of the Contra

Cancrum EC funded project [26]. The objective of the

workbench is to boost the communication and collabor-

ation of researchers in Europe for the machine-assisted

sharing of expertise.

In more detail the proposed framework initially per-

forms an NLP processing step. The user’s clinical ques-

tion is split into tokens which are the words and

punctuation that establish the sentence; the tokens are

lemmatized, i.e. the words are mapped to their roots

(“lemmas”), and at the end each token is matched to a

specific “Part of Speech”(POS) of the English grammar.

During this pre-processing task, NER, the task that

organizes the text elements into predefined categories,

is also performed for each token and assigns the

token to a specified category (e.g. a gene symbol). For

the NER process we used the clinical, biomedical and

pharmaceutical semantic types used in MetaMap (in

Additional file 1: Table S5). Subsequently the system

communicates with the Concept Recognizer (lower

part in Fig. 1) and extracts ontology terms, concepts

and semantic types. Using the ontology terms and its

semantic types the system identifies predefined patterns

and delivers focused queries to the repository of resources

for an efficient discovery [27]. The identified tools/services

arethenpresentedtotheenduser.

The architecture of the framework, as shown in

Fig. 1 integrates three main components: (i) the re-

source and metadata repositories, (ii) the semantic

annotator (“Concept Recognizer”) and (iii) the intelli-

gent engine (“Interpreter”) which interacts with the

non-IT end user.

The following sections provide more elaborate details

on the implementation and functioning of all the sub-

components of the framework.

Concept Recognizer –the semantic annotator

The core of the system is the so called Concept Recognizer

that is used by most of the components of the framework.

The objective of this component is, given the free text for-

mulation of the user’s query, to extract the “important”

parts, using n-grams [28], that refer to or designate known

ontology terms, in order to get matched to the patterns

that form the physicians’necessities. By these patterns a

query is formed and promoted to the tools repository to

finally get the appropriate tools/services for the user’stask.

The Concept Recognizer integrates two special domain

ontologies: the EDAM ontology for the software do-

main and the UMLS biomedical ontologies for the bio-

medical domain. For both domains a specific concept

recognizer has been implemented:

EDAM (originally from “EMBRACE Data and

Methods”) is an ontology for annotation of

bioinformatics tools, resources and data. Its

design principles are bioinformatics specific with

well-defined scope, relevant and usable for users

and annotators, and maintainable. EDAM applies

to organizing and finding suitable tools or data

and to automate their integration into complex

applications or workflows. EDAM has been

already successfully used in other systems like

BioXSD [29] and Bio-jETI [30]. The EDAM concept

recognizer was implemented using the Apache Solr

[31] full text search server. For each term in the

EDAM ontology a JSON-formatted file was created,

with specific fields that were subsequently imported

to Solr. The different fields give the ability to use

different weights at search time. The weight formula

that was used is biased to the id and the name of the

term. This implies that if the searched text matches

with the id or the name of a term, then this term is

assigned a better score than if the text matched to the

definition or the comment fields. The formula of the

custom made EDAM weight is:

id10ðÞþname10ðÞþsynonym6ðÞþsubset3ðÞ

þisa3ðÞþdef 2ðÞþcomment1ðÞ

The MetaMap concept recognizer identifies and

annotates medical terms based on terminological

resources included in UMLS Metathesaurus.

MetaMap integrates two different NLP servers; the

Semantic Knowledge Representation (SKR) server

which combines contextual information with lexical

information to improve the tagging accuracy and the

Word Sense Disambiguation (WSD) server which

involves the determination of the meaning and

understanding of words. The semantic types of the

UMLS and the WSD server are used to classify the

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 4 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

terms in certain categories in order to acquire a

specific meaning.

The Concept Recognizer integrates the MetaMap and

the EDAM concept recognizers, and also reports the se-

mantic types of the matched term which provides a con-

sistent categorization of all concepts present in UMLS

Metathesaurus or in EDAM ontology. UMLS provides

more than 100 semantic categories (http://mmtx.nlm.nih.

gov/MMTx/semanticTypes.shtml); 58 of them, clinical,

medical, biomedical and pharmaceutical, were selected,

combined and categorized in 23 semantic types, more

general and comprehensive for non-experts. Four more

categories of the Edam ontology were added. The full list

of the selected categories is shown in Table 2.

When the end user posts a question, the concept

recognizer applies NLP algorithms and invokes both the

EDAM and the MetaMap concept recognizer. Subse-

quently, the results of the two concept recognizers are

merged. It should be pointed out that some terms do

co-exist in both the EDAM ontology and in one or more

additional UMLS ontologies. In such cases the concept

recognizer merges all the proposed concepts and passes

them back to the interpreter as a list of proposed con-

cepts; therefore the interpreter decides which concept

would be kept with emphasis in EDAM (software ontol-

ogy) terms, due to the fact that our objective is to iden-

tify tools and other software resources. The decision is

based on the following rules: 1) if a concept co-exist in

the “format of data”branch of EDAM and in one or

more UMLS ontologies’terms, then the EDAM term

would be kept and 2) if a concept co-exist in more than

one ontology terms, the term that has the higher score

would be kept.

Resource repository and ontologies

A repository of biomedical tools and services was

employed that contains semantically annotated biomed-

ical resource descriptions using the same ontologies as

of the Concept Recognizer. The tools repository of the p-

medicine workbench is based on the PostgreSQL [32]

database with full text search capabilities.

The repository currently stores information for 502

tools and services that were either developed by the pro-

ject itself or extracted from different domain specific re-

positories or from the web, as follows:

195 sequence analysis tools and resources from the

SEQanswers [6],

35 biomedical tools from the Embrace [7],

133 bioinformatics tools and resources from

Bioconductor [8],

75 bioinformatics tools and workflows from

myExperiment [4],

50 biomedical tools and 50 biology related tools by

searching the web.

The repository also includes a selected set of computa-

tional models, exposed as tools, that simulate disease

evolution or response to treatment, such as [33] and

[34]. The ontological concepts and semantic terms for

the description of tools are generated automatically

using scripts that take as input the textual description of

the tool (as shown in Table 1) and “feed”the Concept

Recognizer, which consequently extracts ontology terms

and corresponding semantic categories. Using these an-

notations, we seek to facilitate more intelligent search

results that address what a user is actually looking for,

rather than simply returning candidate tools following a

keyword matching process.

The tools repository supports three different strategies

for resource discovery. Namely, (i) full text, i.e. a tools

description is given in plain text, (ii) use of tags, i.e. user

provided concepts and semantic types for the tools and

their operations and finally (iii) parameters, i.e. inputs

and outputs of a tool is specified. Such an approach im-

plies that a clinical question can be annotated with onto-

logical concepts and as a result the repository can be

queried using full text, tags or semantic types of the

UMLS ontologies.

Interpreter

The Interpreter is the core element of the system and

the bridge between the clinical question and the query

formed for the resource repository. It employs NLP

Table 2 The 27 semantic categories. The 27 prime categories: 23 from UMLS semantic types and 4 from Edam categories

1. Disease 2. Drug 3. Medical Procedure 4. Tissue

5. Biomedical 6. Cell 7. Organism Function 8. Finding

9. Body Part 10. Gene 11. Clinical Attribute 12. Patient

13. Diagnosis 14. Age 15. Molecular Sequence 16. Device

17. Symptom 18. Virus 19. Injury or Poisoning 20. Vitamin

21. Laboratory 22. Food 23. Temporal Concept

24. EDAM Data/Format 25. EDAM Topic 26. EDAM Operation 27. EDAM Identifier

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 5 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

techniques and utilizes the Concept Recognizer module

to formulate more focused queries to the repository.

The specific NLP techniques employed are based on the

Stanford CoreNLP [35] version 3.3.0. The Interpreter re-

ceives as input the clinical question and executes the

analytical steps of tokenization, lemmatization and POS

tagging; the resulting tokens are syntactically annotated

and therefore can be recognized as entities and match

the patterns that are connected to the output’s objective.

The final step of the NLP operations in the inter-

preter includes a queries’template based on expression

matching in order to extract relationship patterns be-

tween clinical entities. With these patterns (Table 2)

the system identifies and categorizes parts of the input

text as input/available data and parts that compose the

clinical hypothesis (clinical question to be answered).

The development of specific patterns aims to iden-

tify specific relations within sentences, and support disam-

biguation of multiply-annotated words. For example, if in

the clinical question a Drug category term and a Disease

category term co-exist, as identified by the Concept

Recognizer, this matches the combined pattern Drug for

Disease where a partial meaning could be that the specific

Drug is suitable for this Disease.

A specific pattern, called prime category, was created

for every semantic category resulting in 23 categories for

the UMLS semantic types and 4 categories for the Edam

types. With these 27 prime/simple categories at hand, 24

new patterns, based on recommendations from experts,

were created using combinations of the prime categories

(Table 3). These combinations have a special meaning for

the clinicians; for example, the pattern “Drug”for “Dis-

ease”relates to the concept of treatment for a physician.

The system, analysing the clinical question given as input,

formulates two types of focused queries. The first is based

on the tagged terms, their combination and their position

inthequestion.Thesecondisbasedonthesemantictype

of the tagged terms, for both input and output terms.

Subsequently, the queries are passed into the tools re-

pository, and two lists of candidate tools are exported; a

list of tools that could totally address the clinical ques-

tion at hand and a list of pipelined tools that could ad-

dress the question sequentially. In order to export a

ranked list of candidate tools based on correctness and

accuracy metrics, the framework ranks the tools using a

scoring mechanism as follows:

Every tool or service gains one point for each

appearance of the identified terms in the description

of either the input required or the output produced

by the tool.

Every tool or service gains 0.25 points for each

appearance of the term in the functional, textual

description of the tool. This means that if the tagged

term of the sentence matches a tagged term in the

textual description of the relevant tool, a quarter of

point is gained.

Tools with a score equal or less than 1 are ignored.

Furthermore, every tool that has matched terms both

from the given data sub-sentence in the description of

their input and from the clinical question sub-sentence in

the output produced form a list of tools/services that can

individually resolve the clinical question. The remaining

tools form a secondary list, i.e. a list of tools that are can-

didates for the formation of a computational pipeline that

could provide a solution to the problem.

For comparison purposes, we performed a free text

query, similar to the searching mechanism supported by

traditional tools repositories, in order to compare the

automated results of our system to the matched terms of

the full text query. The free text query was implemented

by inserting in the query interface of the tools repository

the whole clinical question as a query of free text.

Results

For the evaluation of the framework developed we

followed a case study approach. Expert users and know-

ledge extracted from relevant available resources assisted

us in formulating a series of clinically relevant questions

of increasing complexity, which were the basis for our

evaluation activities. The exact clinical questions and the

results obtained when the proposed framework was ap-

plied are presented in what follows.

For our experiments, the following clinical questions

were used:

Table 3 The list of patterns generated by the combination of

prime categories

1. Drug for Disease 2. Edam Data/Format and Edam

Operation

3. Patient took Drug for Disease 4. Finding with Organism_Function

5. Drug for Disease in Body Part 6. Finding in/with Medical_Procedure

7. Drug for Symptom 8. Edam_Data in Body_Part

9. Patient has Disease in Body Part 10. Patient took Drug for Disease

in Body Part

11. Patient took Drug for Body Part 12. Patient has been in Medical

Procedure

13. Patient has Disease 14. Patient has Organism Function

15. Disease in Body Part 16. Drug for Edam_Data in Body_Part

17. Patient’s Finding 18. Drug & Drug for Disease

19. Patient took Vitamin 20. Symptom of Medical_Procedure

21. Patient has Symptom 22. Drug for Edam_Data

23. Patient ate Food 24. Edam Data/Format and Edam

Operation and Edam Data/Format

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 6 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1. “John has lung cancer and has been treated with

carboplatin which is known for toxicology adverse

effects. I would like to find literature and reference

related to such events for the specific drug”.

2. “I have the miRNA gene expression profile of Anna

which is a nephroblastoma patient. I want to identify

KEGG pathways which are mainly disrupted due to

gene expression.”

3. “Patient FK is a 1.5 year old boy with bilateral

nephroblastoma and his tumor is unresponsive to

chemotherapy (vincristine, actinomycin-D and

Doxorubicin) with no reduction in tumor size, not

allowing to perform nephron sparing surgery. I would

like to obtain a list of deregulated metabolic pathways

in the tumor from gene expression data in combination

with miRNA data to find possible targets that can be

treated with available drugs.”

4. “Patient SM is a 3 year old girl with metastatic

nephroblastoma and miRNAs from blood are

analyzed at the time of diagnosis. I would like to

compare the results of miRNAs with miRNAs of the

cohort of patients with metastatic nephroblastoma

that are correlated to histology, treatment response

and outcome to get an individual risk index of the

patient including proposed pathology, treatment

response and outcome.”

5. “Patient AB is a 5 year old boy just diagnosed with

acute lymphoblastic leukemia, while

immunophenotype and gene expression data as well

as clinical data at the time of diagnosis are known. I

would like to compare his gene expression data with

the group of all patients having the same

immunological phenotype.”

6. “Patient AB is a 5 year old boy just diagnosed with

acute lymphoblastic leukemia, while immunophenotype

andgeneexpressiondataaswellasclinicaldataatthe

time of diagnosis are known. I would like to know the

difference in gene expression between those predicting

relapse and those predicting poor MRD for the different

immunophenotypes. The results should be visualized.”

We evaluated the system’s performance using precision

and recall measurements. To measure precision and recall,

expert physicians and bioinformaticians together went

through the catalog of all the tools available in the reposi-

tory, read their descriptions, functionalities and capabilities

and manually identified those tools that could answer or

partially answer the specific clinical question. We present

in detail the results obtained when processing the first two

clinical questions as indicative case studies.

The first clinical question is a combination of sen-

tences based on descriptions of clinical trials from the

ClinicalTrials.gov registry [36] and the contribution of

physicians. It was imported in our system through the

web interface (http://calchas.ics.forth.gr/) where it was

divided into two specific contexts.

The first sentence represents the available knowledge

(given data/statement) of the clinician and mainly cor-

relates to a tool’sinputs,i.e.“John has lung cancer and

has been treated with carboplatin which is known for

toxicology adverse effects.”, while the second sentence is

the clinical hypothesis, the research question, and is

mainly connected to a tool’soutputs,i.e.“Iwouldlike

to find literature and reference related to such events

for the specific drug”.

A visual representation for the annotation of the

Concept Recognizer for the given data sentence is shown

in Fig. 2, and similar annotation exists for the sentence

that includes the clinical question. As explained earlier, in

the case of co-existence of two annotations, the system se-

lects the assignments that have the higher score.

The domain experts manually searched the tools reposi-

tory, using the available tools descriptions, and have iden-

tified the “EUADR - Literature analysis”tool as a resource

able to answer the specific clinical question. Table 4 shows

the results of the framework for the first clinical question

while in Additional file 1: Figure S1 we can see the results

as shown in the web site of the NLP framework.

As can be seen, the framework identified 23 relevant

tools. Additionally, we performed a free text query, using

the whole sentence as input into the tools repository, in

order to compare the automated results of our system to

those obtained with a full text query (a complete list of the

full text results can be found in Additional file 1: Table S2).

The framework was able to identify tools that could

individually address the clinical question. Such tools are

Fig. 2 Annotation example from Concept Recognizer. The annotation from the Concept Recognizer of the given data sentence “John has lung

cancer and has been treated with carboplatin which is known for toxicology adverse effects”

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 7 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

listed in Table 4, and include the cBio Cancer Genomics

Data Server (CGDS) API [37], the National Cancer

Institute SEER API [38] and EUADR - Literature ana-

lysis [39]. EUADR is the only tool selected by the do-

main experts as an appropriate tool for answering the

clinical question. Detail description of these three

tools can be found in the Additional file 1: Table S1.

There were additional tools identified that partially

matched either the input or the output description; the

framework performs a check of the output data types of

the candidate tools for answering the input sentence and

the input data types of the candidate tools that could solve

the output sentence. For every data type match, a pro-

posed pipeline is created; this implies that the user could

use the first tool, and then provide its output as an input

to the second tool, and so on in order to obtain an answer

to the entire clinical question.

In addition, we analysed the results of the queries

and measured the precision and recall of the results,

as shown in Table 5: Precision and recall for the first

clinical question.

Precision is the fraction of retrieved tools that are in-

deed relevant, while recall is the fraction of relevant

tools that are indeed retrieved [40]. Both precision and

recall are therefore based on an understanding and

measure of relevance in our results. In order to measure

the precision and recall of the automated results, domain

experts manually identified 76 tools that could answer,

individually or as part of a computational pipeline, the

specific clinical question. Among them, the “EUADR -

Literature analysis”tool was able to answer the specific

clinical question by itself. The rest of the tools could

only provide partial solutions, meaning that two or

more should be pipelined for obtaining an answer.

In our first case study the true positive elements,

i.e. elements that were correctly selected by the sys-

tem are 11, while the false positive elements, i.e. the

elements that were wrongly selected are 0, and the

false negative elements, i.e. the elements that were

correct but not selected –are 65 (76–11). This re-

sults in 100 % precision and 14 % recall.

Table 4 Results of the first clinical question. The results given

by the framework to the first clinical question. The list of individual

tools that could solve the entire clinical question are listed at the

top which are then followed by a list of the tools that could be

combined, i.e. pipelined, for providing an answer to the given

clinical question

Unique Tools List

SCORE TOOL NAME Identified (query)

4.75 = 3 (in) + 1 (out)

+ 0.75 (tag)

National Cancer

Institute SEER API

carboplatin & cancer (in)

cancer (in)

lung cancer (in)

drug (out)

4 = 3 (in) + 1 (out) cBio Cancer Genomics

Data Server

carboplatin & cancer (in)

(CGDS) API cancer (in)

lung cancer (in)

find (out)

4 = 1 (in) + 3 (out) EUADR - Literature

analysis

adverse effects (in)

drug-references (out)

drug (out)

literature (out)

Pipeline Tools List

FIRST TOOL SECOND TOOL

National Cancer

Institue caDSR API

AIDSinfo API

China Cancer

Database API

AIDSinfo API

Single Tools List

SCORE TOOL NAME

3.75 = 3 (in)

+ 3*0.25 (tag)

The Cancer Genome

Atlas API

3.75 = 3 (in)

+ 3*0.25 (tag)

China Cancer

Database API

3 (in) National Cancer

Institue caDSR API

3 (in) MuTect

2.25 = 1 (out)

+ 5*0.25 (tag)

Lexicomp API

2 (out) Arabidopsis thaliana

Microarray Analysis

2 (out) Pathways and Gene

annotations for QTL region

2 (out) SciBite API

2 (out) DGIdb API

2 = 4*0.25 (tag) DailyMed API

2 = 4*0.25 (tag) Aetna CarePass API

2 = 4*0.25 (tag) National Institute on

Drug Abuse Drug Screening

Tool API

Table 5 Precision and recall for the first clinical question.

Precision and recall of the automated resource discovery in

attempting to find solutions to the first clinical question as

compared to results manually identified by domain experts

based on the description of the tools

Tools

identified

Precision

(%)

Recall

(%)

#Best rank of tools

that can solve the

question at once

(no pipelines)

Free Text 164 40 73 3 out of 164

NLP

Framework

11 100 14 1

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 8 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

As seen, the framework retrieved tools from the reposi-

tory with a precision of 100 %, meaning that the system

might not have exported all the suitable tools –tools that

could solve partially and at once the question - for the

clinical question (i.e. has low “recall”). On the other hand,

what we feel is important, is the fact that all identified

tools are appropriate as candidates for answering the clin-

ical question. On the contrary, the free text query had

high recall, meaning that many irrelevant tools were

exported. We further discuss these findings in the discus-

sion section.

We have subsequently employed our framework with

the clinical question “I have the miRNA gene expression

profile of Anna which is a nephroblastoma patient. I

want to identify KEGG pathways which are mainly dis-

rupted due to gene expression.”Domain experts again

searched the tools repository and manually discovered

that the specific sentence could be answered by the

“mirPath”[41] or “miRNApath”[42] tools; it could also

be answered with a combination of tools which had to

contain the “mirtarbase”[43] tool and the “MinePath”

[44] tool. Specifically a clinician should first use the

“mirtarbase”tool and provide its output as an input the

“MinePath”tool in order to resolve the full clinical ques-

tion at hand.

We set the clinical question to the framework and a

list of proposed tools suitable for the solution exported.

The free text query was also invoked, in order to com-

pare the framework’s results to the matched terms of the

full text query. The results of the framework for the spe-

cific sentence are shown in Table 6.

The “mirtatbase”,“mirPath”and “miRNApath”tools

were identified by the framework as top ranked tools

appropriate for individually answering the clinical ques-

tion. From these tools, “mirPath”and “miRNApath”

were also selected by the domain experts. Details about

these three tools can be found in the Additional file 1:

Table S3. The “mirtatbase”tool was identified incor-

rectly as a candidate while the “MinePath”tool was also

incorrectly identified as one of the tools that could par-

tially answered the clinical question. Additional tools

were also identified as candidates for a partial answer

to the question, i.e. appropriate for solving the input or

the output sentence; these tools could again form a

pipeline in order to answer the whole clinical question.

From these tools, the domain experts identified only one

potential pipeline using the “mirtatbase”and “MinePath”

tools. The results of applying our NLP framework with the

second clinical question are shown in Table 6. The frame-

work identified 17 relevant tools. We also compare the re-

sults with a full text search (complete list of the full text

search results can be found in Additional file 1: Table S4).

The results of the queries were analysed and measured

as shown in Table 7. Domain experts manually identified

Table 6 Results for the second clinical question. The results

given by the framework to the second clinical question. The list

of individual tools that could solve the entire clinical question

are listed at the top which are then followed by a list of the

tools that could be combined, i.e. pipelined, for providing an

answer to the given clinical question

Unique Tools List

SCORE TOOL NAME Identified (query)

3 = 1 (in) + 2 (out) miRNApath mirna (in)

gene expression (out)

kegg pathways (out)

3 = 1 (in) + 2 (out) mirPath mirna (in)

gene expression (out)

kegg pathways (out)

3 = 1 (in) + 2 (out) mirtarbase mirna (in)

gene expression (out)

kegg pathways (out)

No results found on this category ‘Pipeline Tools List’for the given question

Single Tools List

SCORE TOOL NAME

4 (out) Get Pathway-Genes and gene description by Entrez

gene id

4 (out) Arabidopsis thaliana Microarray Analysis

4 (out) MinePath

4 (out) EnrichNet API

4 (out) NCBI Gi to Kegg Pathway Descriptions

4 (out) MitoMiner API

4 (out) BiologicalNetworks API

4 (out) From cDNA Microarray Raw Data to Pathways and

Published Abstracts

4 (out) HUMAN Microarray CEL file to candidate pathways

4 (out) ERGO Genome Analysis and Discovery System

4 (out) BioCyc API

4 (out) Mouse Microarray Analysis

Table 7 Precision and recall for the second clinical question.

Precision and recall of the automated resource discovery in

attempting to find solutions to the second clinical as compared

to results manually identified by domain experts based on the

description of the tools

Tools

identified

Precision

(%)

Recall

(%)

#Best rank of tools

that can solve the

question at once

(no pipelines)

Free Text 231 25 59 2 out of 231

NLP

Framework

17 100 17 1

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 9 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

99 tools that could solve partially or at once the specific

clinical question. The NLP framework demonstrates

good precision for this question too. The true positive

elements are 17, while the false positive elements are 0,

and the false negative elements are 82 (99–17). This

gives us a 100 % precision and 17 % recall.

Discussion

This study focused on the development of a Semantic

Biomedical Resource Discovery Framework by making

use of natural language processing techniques. As ori-

ginally stated, the envisioned framework should allow

searching through a set of semantically annotated re-

sources in order to find a match with a user query

expressed as a natural language statement.

In parallel to seeking an answer to our ultimate research

question, a range of additional, more specific research

questions were also established. In the current section we

critically discuss our experiences and the experimental

evidence obtained in the context of those specific research

questions initially established. We would like to stress that

evaluation of the proposed approach used a limited num-

ber of queries. As a result, the present work should be

seen as a case study, providing initial evidence on the val-

idity of the approach. It is obvious that subsequent formal

evaluation should be designed to test the broader effect-

iveness of the system.

Having said this, the experience obtained through the an-

notation of a large number of resources (Additional file 2),

that were brought into our platform for experimenta-

tion, shows that the range of existing open biomedical

ontologies and other open, generic ontologies do suffice

for the creation of a domain-specific annotation frame-

work that would be useful for the semantic resource

annotation.Wewereabletonoticetheefficiencyofthe

current software related ontology, i.e. EDAM and other

biomedical ontologies that we used. Hence, we believe

that there is no need for the development of a core do-

main ontology to enable the creation of an annotation

framework that would offer capabilities of capturing the

context of complex biomedical resources. Rather the

challenge lies on the articulate use and integration of

various existing biomedical and other related ontologies.

This, nevertheless, remains a scientifically and often

technically demanding task.

Our work performing NLP processing on complex

biomedical text reaffirmed the various challenges identi-

fied in prior research, namely i) Clinical text has uncom-

mon structure and content that are not always guided by

grammar, syntactic or spelling rules [45], ii) Biomedical

termsarepronetoambiguity;wordsthatmayhavemul-

tiple meanings or many words may have the same meaning

[46]; temporal ambiguity also exists, confusing past or fu-

ture diagnosis or medical history, iii) Clinical content is full

of abbreviations and titles that confuse the detection of a

sentence’s boundary [45], iv) Negations are very common

in clinical text, such as no,without,not and denies [47].

Although these challenges were evidently present in

our experimentation, the range of existing NLP tools is

also large. Numerous NLP packages have been also de-

veloped, such as Python NLTK, OpenNLP, Stanford

NLP, LingPipe. In our work we selected the probabilistic

Stanford NLP tools, where the corpus data is gathered

and manually annotated and then a model is trained to

try to predict annotations depended on words and their

contexts through weights. The selected NLP tools for our

work, with minor extensions and customization done, have

proven adequate for supporting the NLP tasks of our work.

In the context of our research a limited number of clin-

ical questions were examined. In the first research ques-

tion, presented in detail in this manuscript, the framework

identified the pattern “<Drug > for < Disease>”,which has

a specific meaning of Treatment for the clinicians. Ac-

cording to the given input sentence, we managed to iden-

tify patterns with the combination of the annotated tagged

terms of the sentence. Many more patterns can be formed

and enrich the framework in the future, depending on

different kind of domain searches and distinct meanings

for the physicians.

Domain experts explored the tools repository and

manually identified 76 tools and services (out of 502)

that could provide an answer to the clinical question;

some of those could give a solution individually, while

others could partially solve the question.

The second clinical question presented in this manu-

script led us to the matching pattern Patient has Disease

and EDAM Topic for EDAM Data.

The user seeks to find the disrupted KEGG pathways

according to the profile of a patient that has nephroblas-

toma. A tool or service or a pipeline of tools is needed

to resolve this question. The domain experts manually

selected 99 tools and services that could be part of the

solution space. The framework’s results showed a 100 %

precision and were less than the tools selected from the

domain experts. In addition, the free text query exported

231 tools and identifies only 2 of the tools that according

to domain experts can solve the entire clinical question.

Furthermore, in relation to execution performance,

the framework proved to be able to respond fast

enough and could, therefore, be used as an online

search engine for biomedical tools. The response times

for different clinical questions vary from 1.5 to 7.5 s

which is an acceptable time for a web application. Being

more specific the response time for the first clinical

question is 3993 milliseconds and for the second clin-

ical question is 7038 milliseconds. In our current im-

plementation we use ten ontologies but the framework

can be extended to use more ontologies either from

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 10 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

UMLS or as new systems (concept recognizers) using

the Solr implementation of EDAM. Initial evidence in-

dicates that the proposed framework is scalable and can

be expected to be responsive in real time even for tens

of thousands of tools in the repository and with much

more ontologies, due to the fact that the concept

recognizer and the queries to the repository are based

on elastic search which is suitable in even more de-

manding domains such as big data applications [48].

Future work

We plan to extend the framework and provide end users

with options to create and import new patterns through

the web interface, that may be needed and do not

already exist. We also explore methodologies for person-

alized preferences with classification based on user pro-

file [49] and “voting”mechanism on the retrieved results

in order to improve our accuracy in similar user groups.

Another direction under investigation is to enrich the

framework with graph theory capabilities and provide

the end user possible workflows [50], for the solution of

the research question. Such methodologies proved to be

valuable in services discovery [51] and scientific work-

flows composition [13, 52–54]. Taking advantage of the

modular implementation and the rich metadata schema

of the NLP framework we expect to provide meaningful

pipelines as guidelines to complex clinical questions.

Again a high quality annotation of the tools in the re-

pository is mandatory for accurate results.

In addition, the implementation of this framework will

be expanded with even more patterns focused on all the

possible combinations of the semantic categories of the

EDAM software ontology and the clinical ontologies of

the UMLS Metathesaurus, in order to create more ac-

curate patterns with clinical meanings, taking into ac-

count the ontology-based relationships of concepts and

how they map to similar structures in natural language

expressions, as we expect that soon the tools repository

will host thousands of software resources. We also plan

to add negation detection patterns to identify diagnosis

and symptoms that are negated. In that direction, we

will evaluate and possibly use opinion mining method-

ologies able to categorize the polarity of a text, meaning

if the sentence or word is positive, negative or neutral.

Solutions like the “Crowd Validation”[55] which exam-

ine and determine opinions, perceptions and approaches

along with NLP methodologies for ontology management

and query processing [56–58] will be possibly used.

In addition, the key functions of clinical decision support

systems require understanding of the context from which

an event or a named entity is extracted. For example, sup-

porting clinical diagnosis and treatment processes with best

evidence will require not only recognizing a clinical condi-

tion, but determining whether the condition is present or

absent. Chapman et al. [59] developed a dedicated al-

gorithm, ConText, for identifying three contextual

features: Negation (for example, no pneumonia); His-

toricity (the condition is recent, occurred in the past,

or might occur in the future); and Experience (the condi-

tion occurs in the patient or in someone else, such as par-

ents abuse alcohol). In many cases also it is desirable to

detect the degree of certainty in the context (for example,

suspected pneumonia). Although significant results in re-

lation to the topic of context exist, e.g. Solt and colleagues

[60] described an algorithm for determining whether a

condition is absent, present, or uncertain it is our view

that the related issues will continue to present researchers

with challenges.

Another challenge for the future relates to multilin-

gualism; taggers, parsers and lexicons for additional lan-

guages, apart from English, could be added into the

system and provide a service discovery framework for a

multilingual setting.

Additionally, the system could, eventually, be extended

to a question-answer system. A question driven system

could upgrade the system to the next step of an intelli-

gent service discovery system, by asking the user a ques-

tion that was retrieved from the input sentence. In this

way the system could provide the user services or tools

in pipeline that could be used in a row for the desired

process to be implemented.

Furthermore, in the context of the p-medicine EC pro-

ject, a thorough usability evaluation of the system from

end users has been scheduled in order to assess the us-

ability and the acceptance of the framework.

Conclusions

The ultimate objective of this work has been to inves-

tigate the use of semantics for biomedical resources

annotation with domain specific ontologies and exploit

Natural Language Processing methods in empowering the

non-Information Technology expert users to efficiently

search for biomedical resources using natural language.

As part of this case study, we have successfully imple-

mented a web based framework able to interact with the

end user through natural language for biomedical re-

sources discovery in real time. The user describes in nat-

ural language the facts (data) and the research question

which are analysed with NLP techniques, annotated with

clinical and software ontologies in order to form specific

queries for the tools repository and finally retrieve tools/

services that solve the clinical question.

The results obtained showed that the system has high

precision and low recall, which means that the system

returns essentially more relevant results than irrelevant.

There were cases of input queries that have 100 % preci-

sion in our results, meaning that the exported resources

were all correct; the system may not have retrieved all

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 11 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

the tools that could solve the clinical question, but all

the retrieved tools were suitable in addressing the query.

Research in web search engines indicates that 91 % of

searchers do not go past page one (ten top ranked) of

the search results and over 50 % do not go past the first

3 results on page 1 [61]. We expect that the same behav-

iour holds in tools repositories too, since web search has

impact on the way we search and retrieve information

[62]. Having a system which is able to narrow down the

retrieved results with 100 % precision and provide a

good ranking would be valuable for the end users espe-

cially the ones who stick to the top ranked results and

neglect the rest.

Comparing MetaMap with clinical annotators like,

GATE (General Architecture for Text Engineering) [16],

Apache Stanbol –IKS (Interactive Knowledge Stack)

[63], NCBO Annotator (BioPortal) [64] and Concept-

Mapper [65], we concluded that MetaMap is the best

biomedical concept recognizer for our needs, because it

has a RESTful API which can use many clinical ontol-

ogies that are connected to it, while the rest of the clin-

ical annotators were not able to be loaded with a large

number of ontologies that were needed in our case.

We must accept that searching with ontology terms

provided better results than with the semantic types of

the terms; this is firstly an effect of the tag queries’de-

pendence on the tagged terms identified in the sentence

which are also terms of the tools descriptions, and sec-

ondly, due to the fact that the semantic types of the

terms may not be found in the descriptions, but even if

they do, they have a minor score priority.

The proposed NLP framework has the potential to aid

physicians practice advanced ICT medicine and improve

the quality of patient care. To our knowledge a plethora

of tools can handle clinically relevant information for

specific questions but are limited to provide summaries

of literature such as AskHermes [11]. It is also known

that a plethora of databases and specialized repositories

exist and numerous clinical and biomedical tools have

been indexed in these repositories, but the searching

mechanisms and the technical terminology used, dis-

courage clinicians to use them.

We implemented a web based framework taking ad-

vantage of domain specific ontologies and NLP in order

to empower the non-IT users to search for biomedical

resource using natural language. The proposed frame-

work links the gap between clinical question and effi-

cient dynamic biomedical resources discovery. Given the

experience we gained during the design, implementation

and set up of such a framework we can safely come into

the conclusion that there is adequacy of existing bio-

medical ontologies, sufficiency of NLP tools and bio-

medical annotation systems for the implementation of

such a framework.

Additional files

Additional file 1: contains extensive results of the experiments.

(PDF 803 kb)

Additional file 2: contains the list of tools and resources that were

in our repository. (PDF 964 kb)

Abbreviations

IT: Information Technology; NLP: Natural Language Processing; POS: Part Of

Speech; NER: Named Entity Recognition; OBIE: Ontology-Based Information

Extraction; UMLS: Unified Medical Language System; SKR: Semantic Knowledge

Representation; WSD: Word Sense Disambiguation.

Competing interests

The authors declare that they have no competing interests.

Authors’contributions

MT, SS and LK conceived of and designed the framework. NG as end user

guided the design and implementation. SS implemented the tools and

metadata repositories, LK implemented the concept recognizer mechanism,

PS implemented the interpreter, GI implemented the web based front-end.

LK, GZ and PS conducted the integration of the framework which GZ is also

the administrator of the server. LK and PS conducted the experiments.

All authors contributed to the manuscript. All authors read and approved the

final manuscript.

Authors' information

Not applicable.

Availability of data and materials

Not applicable.

Acknowledgements

This work was supported by the EU funded research projects p-medicine

(http://www.p-medicine.eu/) and iManageCancer (http://imanagecancer.eu/)

that aim to design a semantically aware computational platform in support

of personalised medicine.

Funding

This project was funded by the European Commission under contracts

H2020-PHC-26-2014 No. 643529 (iManageCancer project) and FP7-ICT-

2009.5.3 No 270089 (p-medicine project).

Author details

Foundation for Research and Technology Hellas (FORTH), Institute of

Computer Science, N. Plastira 100, Vassilika Vouton, Heraklion, Crete, Greeece.

Department of Informatics Engineering, Technological Educational Institute,

Heraklion, Crete, Greece.

Paediatric Haematology and Oncology, Saarland

University Hospital, Homburg, Germany.

Received: 9 March 2015 Accepted: 21 September 2015

References

1. Zhu F, Patumcharoenpol P, Zhang C, Yang Y, Chan J, Meechai A, et al.

Biomedical text mining and its applications in cancer research. J Biomed

Inform. 2013;46:200–11.

2. Meystre S, Haug JP. Natural language processing to extract medical

problems from electronic clinical documents: Performance evaluation.

J Biomed Inform. 2006;39(6):589–99.

3. Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, et al.

The Taverna workflow suite: designing and executing workflows of Web

Services on the desktop, web or in the cloud. Nucleic Acids Res.

2013;41(W1):557–61.

4. Goble CA, Bhagat J, Aleksejevs S, Cruickshank D, Michaelides D, Newman D,

et al. myExperiment: a repository and social network for the sharing of

bioinformatics workflows. Nucleic Acids Res. 2010;38(2):677–82.

5. Bhagat J, Tanoh F, Nzuobontane E, Laurent T, Orlowski J, Roos M, et al.

BioCatalogue: a universal catalogue of web services for the life sciences.

Nucleic Acids Res. 2010;38(2):W689–94.

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 12 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

6. Li JW, Schmieder R, Ward M, Delenick J, Olivares EC, Mittelman D.

SEQanswers: an open access community for collaboratively decoding

genomes. Bioinformatics. 2012;28(9):1272–3.

7. Pettifer S, Ison J, Kalas M, Thorne D, McDermott P, Jonassen I, et al. The

EMBRACE web service collection. Nucleic Acids Res. 2010;38(2):683–8.

8. Gentleman R, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al.

Bioconductor: open software development for computational biology and

bioinformatics. Genome Biol. 2004;5(10):R80.

9. National Library of Medicine. ORBIT: Online Registry of Biomedical

Informatics Tools. [Internet]. 2011 [cited 2013].

10. Simpson MS, Demner-Fushman D, Biomedical Text Mining: a survey of

recent progress. In: Mining text data. Springer US; 2012. 465–517.

11. Cao Y, Liu F, Simpson P, Antieau L, Bennettq A, Cimino JJ, et al. AskHERMES:

An online question answering system for complex clinical questions.

J Biomed Inform. 2011;44(2):277–88.

12. Cao Y, Cimino JJ, Ely J, Yu H. Automatically extracting information needs

from complex clinical questions. J Biomed Inform. 2010;43:962–71.

13. Koumakis L, Moustakis V, Potamias G. Web Services Automation. New York:

Hershey Information Science Reference; 2009. p. 239–57.

14. Friedman C, Rindflesch TC, Corn M. Natural Language Processing: state of

the art and prospects for significant progress, a workshop sponsored by the

National Library of Medicine. J Biomed Inform. 2013;46(5):765–73.

15. Settles B. ABNER: an open source tool for automatically tagging genes,

proteins and other entity names in text. Bioinformatics. 2005;21(14):3191–2.

16. Cunningham H. GATE, a general architecture for text engineering. Comput

Hum. 2002;36(2):223–54.

17. Ferucci D, Laily A. UIMA: an architectural approach to unstructured

information processing in the corporate research environment. Nat Lang

Eng. 2004;10(3–4):327–48.

18. Clement J, Nigam SH, Cherie YH, Musen MA, Callendar C, Storey MA. NCBO

Annotator: Semantic Annotation of Biomedical Data. International Semantic

Web Conference, Poster and Demo session. 2009.

19. Belloze KT, Monteiro DISB, Lima TF, Silva-Jr FP, Cavalcanti MC. An Evaluation

of Annotation Tools for Biomedical Texts. ONTOBRAS-MOST. 2012; 108–119.

20. Wimalasuriya DC, Dejing D. Ontology-based information extraction: An

introduction and a survey of current approaches. J Inf Sci. 2010;36(3):306–23.

21. Bodenreider O. The unified medical language system (UMLS): integrating

biomedical terminology. Nucleic Acids Res. 2004;32(1):267–70.

22. Al-Safadi L, Alomran R, Almutairi F. Evalutation of MetaMap

performance in radiographic images retrieval. Res J Appl Sci Eng

Technol. 2013;22(6):4231–6.

23. Wu Y, Denny JC, Rosenbloom T, Miller RA, Giuse DA, Xu H. A comparative

study of current clinical natural language processing systems on handling

abbreviations in discharge summaries. Am Med Inform Assoc. 2012;2012:997.

24. Sfakianaki P, Koumakis L, Sfakianakis S, Tsiknakis M. Natural language

processing for biomedical tools discovery: A feasibility study and

preliminary results. In: 17th International Conference on Business

Information Systems; 2014; Larnaca, Cyprus

25. P-Medicine EU project web site. [Internet]. 2012 [cited 2015 Mar 08].

Available from: http://www.p-medicine.eu.

26. Marias K, Dionysiou D, Sakkalis V, Graf N, Bohle RM, Coveney PV, et al.

Clinically driven design of multi-scale cancer models: the ContraCancrum

project paradigm. Interface Focus. 2011;1(3):450–461

27. Schulz M, Krause F, Le Novere N, Klipp E, Liebermeister W. Retrieval,

alignment, and clustering of computational models based on semantic

annotations. Mol Syst Biol. 2011;7(1):512.

28. Brown PF, de Souza PV, Mercer RL, Della Pietra VJ, Lai JC. Class-based n-gram

models of natural language. Comput Linguist. 1992;18(4):467–79.

29. Kalas M, Puntervoll P, Joseph A, Bartaseviciute E, Topfer A, Venkataraman P,

et al. BioXSD: the common data-exchange format for everyday

bioinformatics web services. Oxf J: Bioinformatics. 2010;26(18):540–6.

30. Lamprecht AL, Margaria T, Steffen B. Bio-jETI: a framework for semantics-based

service composition. BMC Bioinformatics. 2009;10(10):S8.

31. Smiley D, Pugh DE. Apache Solr 3 Enterprise Search Server. Packt Publishing

Ltd; 2011.

32. Black S. PostgreSQL: introduction and concepts. Linux J. 2001;2001(88):16.

33. Sfakianakis S, Graf N, Hoppe A, Rüping S, Wegener D, Koumakis L, et al.

Building a System for Advancing Clinico-Genomic Trials on Cancer. George

Potamias Vassilis Moustakis (eds.), 2009. 33.

34. Stamatakos GS, Dionysiou D, Lunzer A, Belleman R, Kolokotroni E, Georgiadi E,

et al. The technologically integrated oncosimulator: combining multiscale

cancer modeling with information technology in the in silico oncology context.

Biomed Health Informatics, IEEE. 2014;18(3):840–54.

35. Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D. The

Stanford CoreNLP Natural Language Processing Toolkit, Proceedings of

52nd Annual Meeting of the Association for Computational Linguistics:

System Demonstrations. 2014. p. 55–60.

36. Hartung DM, Zarin DA, Guise IM, McDonagh M, Paynter R, Helfand M.

Reporting discrepancies between the ClinicalTrials.gov results database and

peer-reviewed publications. Ann Intern Med. 2014;160(7):477–83.

37. Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio

cancer genomics portal: an open platform for exploring multidimensional

cancer genomics data. Cancer Discov. 2012;2(5):401–4.

38. National Cancer Institute SEER API. [Internet]. [cited 2014 Dec]. Available

from: http://www.programmableweb.com/api/national-cancer-institute-seer.

39. EU-ADR Web Platform. [Internet]. [cited 2014 Dec]. Available from: https://

bioinformatics.ua.pt/euadr/Welcome.jsp.

40. Powers D. Evaluation: From Precision, Recall and F-measure to ROC,

Informedness, Markedness & Correlation. J Mach Learn Technol. 2011;2(1):37–63.

41. DIANA miRPath v. 2.0: investigating the combinatorial effect of microRNAs

in pathways. Nucleic Acids Res. 2012;40(W):498–504.

42. Chiromatzo A, Oliveira T, Pereira G, Costa A, Montesco C, DE G, et al.

miRNApath: a database of miRNAs, target genes and metabolic pathways.

Genet Mol Res. 2007;6(4):859–65.

43. Sheng-Da H, Feng-Mao L, Wi-Yun W, Chao L, Wei-Chih H, Wen-Ling C, et al.

miRTarBase: a database curates experimentally validated microRNA–target

interactions. Nucleic Acids Res. 2010;gkq1107.

44. Koumakis L, Moustakis V, Zervakis M, Kafetzopoulos D, Potamias G. Coupling

Regulatory Networks and Microarays: Revealing Molecular Regulations of

Breast Cancer Treatment Responses, Artificial Intelligence: Theories and

Application Lecture notes in Computer Science. 2012. p. 239–46.

45. Meystre SM, Savova K, Kipper-Schuler C, Hurdle JF. Extracting Information

from Textual Documents in the Electronic Health Record: A Review of

Recent Research. Yearb Med Inform. 2008;35:128–44.

46. Nadkarni M, Lucila OM, Chapman WW. Natural language processing: an

introduction. J Am Med Inform Assoc. 2011;18(5):544–51.

47. Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. Evaluation

of Negation Phrases in Narrative Clinical Reports. Proceedings of the AMIA

Symposium. American Medical Informatics Association. 2001 105–109.

48. Kononenko O, Baysal O, Holmes R, Godfrey MW. Mining modern

repositories with elastic search. In: ACM, eds. Proceedings of the 11th

Working Conference on Mining Software Repositories; 2014. pp. 328-331.

49. Potamias G, Koumakis L, Moustakis V. Enhancing web based services by

coupling document classification with user profile. In: IEEE, eds. Computer

as a Tool (EUROCON 2005); 2005. p. 205–208.

50. Sfakianakis S, Koumakis L, Zacharioudakis G, Tsiknakis M. Web-based

Authoring and Secure Enactment of Bioinformatics Workflows. In: Grid

and Pervasive Computing Conferen ce. Geneva S witzerland: IEEE; 2009.

51. Tao Y, Kwei-Jay L. Service selection algorithms for Web services with

end-to-end QoS constraints. Inf Syst E-Business Manag. 2005;3(2):103–26.

52. Kanterakis A, Potamias G, Zacharioudakis G, Koumakis L, Sfakianakis S,

Tsiknakis M. Scientific discovery workflows in bioinformatics: a scenario for

the coupling of molecular regulatory pathways and gene-expression

profiles. Stud Health Technol Inform. 2009;160:1304–8.

53. Koumakis L, Moustakis V, Tsiknakis M, Kafetzopoulos D, Potamias G. Supporting

genotype-to-phenotype association studies with grid-enabled knowledge

discovery workflows. In: IEEE, eds. Engineering in Medicine and Biology

Society, 2009. EMBC 2009. Annual International Conference of the IEEE;

2009. pp. 6958–6962.

54. Zacharioudakis G, Koumakis L, Sfakianakis S, Tsiknakis M. A semantic

infrastructure for the integration of bioinformatics services. In: IEEE, eds.

Intelligent Systems Design and Applications (ISDA’09); 2009. p. 367–372.

55. Cambria E, Hussain A, Havasi C, Eckl C, Munro J. Towards crowd validation

of the UK National Health Service, WebSci10. 2010. p. 1–5.

56. Kim JD, Cohen KB. Natural language query processing for SPARQL generation:

A prototype system for SNOMED CT. In: Proceedings of BioLINK. 2013.

p. 32–8.

57. Cohen KB, Kim JD. Evaluation of SPARQL query generation from natural

language questions. In: Joint Workshop on NLP&LOD and SWAIE: Semantic

Web, Linked Open Data and Information Extraction. 2013. p. 3.

58. Grigonyte G, Brochhausen M, Martín L, Tsiknakis M, Haller J. Evaluating

Ontologies with NLP-Based Terminologies–A Case Study on ACGT and Its

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 13 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Master Ontology. In: Press I, editor. Formal Ontology in Information Systems:

Proceedings of the Sixth International Conference. 2010. p. 331.

59. Chapman W, Chu D, Dowling J. ConText: An Algorithm for Identifying

Contextual Features from Clinical Text. In Proceedings of the Workshop on

BioNLP 2007: Biological, Translational, and Clinical Language Processing (pp.

81-88). Association for Computational Linguistics.

60. Solt I, Tikk D, Gal V, Kardkovacs Z. Semantic classification of diseases in

discharge summaries using a context-aware rule-based classifier. J Am Med

Inform Assoc. 2009;16(4):580–4.

61. Van Deursen AJ, Van Dijk JA. Using the Internet: Skill related problems in

users’online behavior. Interacting Comput. 2009;21(5):393–402.

62. Bughin J, Corb L, Manyika J, Nottebohm O, Chui M, de Muller Barbat B, et al.

The impact of Internet technologies: Search. High Tech Practice.

McKinsey&Company; High Tech Practice. (2011).

63. Adamou A, Andre F, Christ F, Filler A. Apache Stanbol: The RESTful

Semantic Engine. [Internet]. 2007 [cited 2013 Sept]. Available from:

http://dev.iks-project.eu/.

64. Jonquet C, Shah NH, Musen MA. The open biomedical annotator. Summit

on translational bioinformatics. 2009 56–60.

65. FunkC,BaumgartnerW,GarciaB,RoederC,BadaM,CohenBK,etal.

Large-scale biomedical concept recognition: an evaluation of current

automatic annotators and their parameters. BMC Bioinformatics. 2014;15:59.

Submit your next manuscript to BioMed Central

and take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color ﬁgure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at

www.biomedcentral.com/submit

Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 14 of 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Terms and Conditions

Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).

Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-

scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By

accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these

purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.

These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal

subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription

(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will

apply.

We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within

ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not

otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as

detailed in the Privacy Policy.

While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may

not:

use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access

control;

use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is

otherwise unlawful;

falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in

writing;

use bots or other automated methods to access the content or redirect messages

override any security feature or exclusionary protocol; or

share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal

content.

In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,

royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal

content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any

other, institutional repository.

These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or

content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature

may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.

To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied

with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,

including merchantability or fitness for any particular purpose.

Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed

from third parties.

If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not

expressly permitted by these Terms, please contact Springer Nature at

onlineservice@springernature.com

Content uploaded by Manolis Tsiknakis

Content may be subject to copyright.

ArRaNER: A novel named entity recognition model for biomedical literature documents

Article

Full-text available

May 2022
J SUPERCOMPUT

Developments in advanced innovations have prompted the generation of an immense amount of digital information. The data deluge contains hidden information that is difficult to extract. In the biomedical domain, the development of technology has caused the production of voluminous data. Processing these voluminous textual data is referred to as ‘biomedical content mining’. Emerging artificial intelligence (AI) models play a major role in the automation of Pharma 4.0. In AI, natural language processing (NLP) plays a dynamic role in extracting knowledge from biomedical documents. Research articles published by scientists and researchers contain an enormous amount of hidden information. Most of the original and peer-reviewed articles are indexed in PubMed. Extracting meaningful information from a large number of literature documents is very difficult for human beings. This research aims to extract the named entities of literature documents available in the life science domain. A high-level architecture is proposed along with a novel named entity recognition (NER) model. The model is built using rule-based machine learning (ML). The proposed ArRaNER model produced better accuracy and was also able to identify more entities. The NER model was tested on two different datasets: a PubMed dataset and a Wikipedia talk dataset. The ArRaNER model obtains an accuracy of 83.42% on the PubMed articles and 77.65% on the Wikipedia articles.

Fostering Palliative Care Through Digital Intervention: A Platform for Adult Patients With Hematologic Malignancies

Article

Full-text available

Dec 2021

Patient-reported outcomes (PROs) are an emerging paradigm in clinical research and healthcare, aiming to capture the patient's self-assessed health status in order to gauge efficacy of treatment from their perspective. As these patient-generated health data provide insights into the effects of healthcare processes in real-life settings beyond the clinical setting, they can also be viewed as a resolution beyond what can be gleaned directly by the clinician. To this end, patients are identified as a key stakeholder of the healthcare decision making process, instead of passively following their doctor's guidance. As this joint decision-making process requires constant and high-quality communication between the patient and his/her healthcare providers, novel methodologies and tools have been proposed to promote richer and preemptive communication to facilitate earlier recognition of potential complications. To this end, as PROs can be used to quantify the patient impact (especially important for chronic conditions such as cancer), they can play a prominent role in providing patient-centric care. In this paper, we introduce the MyPal platform that aims to support adults suffering from hematologic malignancies, focusing on the technical design and highlighting the respective challenges. MyPal is a Horizon 2020 European project aiming to support palliative care for cancer patients via the electronic PROs (ePROs) paradigm, building upon modern eHealth technologies. To this end, MyPal project evaluate the proposed eHealth intervention via clinical studies and assess its potential impact on the provided palliative care. More specifically, MyPal platform provides specialized applications supporting the regular answering of well-defined and standardized questionnaires, spontaneous symptoms reporting, educational material provision, notifications etc. The presented platform has been validated by end-users and is currently in the phase of pilot testing in a clinical study to evaluate its feasibility and its potential impact on the quality of life of palliative care patients with hematologic malignancies.

Deep learning approach to detection of colonoscopic information from unstructured reports

Article

Full-text available

Feb 2023
BMC MED INFORM DECIS

Background Colorectal cancer is a leading cause of cancer deaths. Several screening tests, such as colonoscopy, can be used to find polyps or colorectal cancer. Colonoscopy reports are often written in unstructured narrative text. The information embedded in the reports can be used for various purposes, including colorectal cancer risk prediction, follow-up recommendation, and quality measurement. However, the availability and accessibility of unstructured text data are still insufficient despite the large amounts of accumulated data. We aimed to develop and apply deep learning-based natural language processing (NLP) methods to detect colonoscopic information. Methods This study applied several deep learning-based NLP models to colonoscopy reports. Approximately 280,668 colonoscopy reports were extracted from the clinical data warehouse of Samsung Medical Center. For 5,000 reports, procedural information and colonoscopic findings were manually annotated with 17 labels. We compared the long short-term memory (LSTM) and BioBERT model to select the one with the best performance for colonoscopy reports, which was the bidirectional LSTM with conditional random fields. Then, we applied pre-trained word embedding using large unlabeled data (280,668 reports) to the selected model. Results The NLP model with pre-trained word embedding performed better for most labels than the model with one-hot encoding. The F1 scores for colonoscopic findings were: 0.9564 for lesions, 0.9722 for locations, 0.9809 for shapes, 0.9720 for colors, 0.9862 for sizes, and 0.9717 for numbers. Conclusions This study applied deep learning-based clinical NLP models to extract meaningful information from colonoscopy reports. The method in this study achieved promising results that demonstrate it can be applied to various practical purposes.

HLA-SPREAD: a natural language processing based resource for curating HLA association from PubMed abstracts

Article

Full-text available

Jan 2022
BMC GENOMICS

Extreme complexity in the Human Leukocyte Antigens (HLA) system and its nomenclature makes it difficult to interpret and integrate relevant information for HLA associations with diseases, Adverse Drug Reactions (ADR) and Transplantation. PubMed search displays ~ 146,000 studies on HLA reported from diverse locations. Currently, IPD-IMGT/HLA (Robinson et al., Nucleic Acids Research 48:D948–D955, 2019) database houses data on 28,320 HLA alleles. We developed an automated pipeline with a unified graphical user interface HLA-SPREAD that provides a structured information on SNPs, Populations, REsources, ADRs and Diseases information. Information on HLA was extracted from ~ 28 million PubMed abstracts extracted using Natural Language Processing (NLP). Python scripts were used to mine and curate information on diseases, filter false positives and categorize to 24 tree hierarchical groups and named Entity Recognition (NER) algorithms followed by semantic analysis to infer HLA association(s). This resource from 109 countries and 40 ethnic groups provides interesting insights on: markers associated with allelic/haplotypic association in autoimmune, cancer, viral and skin diseases, transplantation outcome and ADRs for hypersensitivity. Summary information on clinically relevant biomarkers related to HLA disease associations with mapped susceptible/risk alleles are readily retrievable from HLASPREAD. The resource is available at URL http://hla-spread.igib.res.in/. This resource is first of its kind that can help uncover novel patterns in HLA gene-disease associations.

Medical social networks content mining for a semantic annotation

Article

Full-text available

Dec 2021

The interactions between subscribers of the health-related social networking (HSNs) platforms rise the production and sharing of a huge amount of multimedia content, daily, by permitting them to upload their medical images. These images become the centre of communication in various multilingual expressions immediately describing observations, comments and health checkups. As a part of this exchange, it is clear that these spaces are a valuable source of subscribers-generated information. Besides, it is still an open question to enable subscribers to investigate relevant information, due to the diversity of the available content. So, it is vital to engage new mechanisms in order to pull out information and acquaintance from this content. For this purpose, we have implemented a content analysis model of health-related information to get an overview of the medical content available. We present a semantic terms-based approach to pull out pertinent terms and concepts from the text material. As a result, notable extracted terms and keywords will be applied, subsequently, to present to annotate medical images, to direct users to an appropriate seeking task, through the SN site. So, the analysis method concentrates on algorithms based on statistical methods and external multilingual semantic resources to cover and treat this situation. It is essential also to deal with such ambiguities causing the efficacy decreasing of the search function. Our study is validated by a set of experiments and compared with some existing models. Experimental results have ensured that the presented model has better findings, in terms of performance and satisfaction.

Pleiotropic Variability Score: A Genome Interpretation Metric to Quantify Phenomic Associations of Genomic Variants

Preprint

Full-text available

Aug 2021

A more complete understanding of phenomic space is critical for elucidating genome-phenome relationships and for assessing disease risk from genome sequencing. To incorporate knowledge of how related variant associations are, we developed a new genome interpretation metric called Pleiotropic Variability Score (PVS). PVS uses semantic reasoning to score the relatedness of genetic variant associated phenotypes based on those phenotypic relationships in the human phenotype ontology (HPO) and disease ontology (DO). We tested 78 unique semantic similarity methods and integrated six robust metrics to define the pleiotropy score of SNPs. We computed PVS for 12,541 SNPs which were mapped to 382 HPO and 317 DO unique phenotype terms in a genotype-phenotype catalog (10,021 SNPs mapped to DO phenotypes and 8,569 SNPs mapped to HPO phenotypes). We validated the utility of PVS by computing pleiotropy using an electronic health record linked genomic database (BioME, n=11,210). Further, we demonstrate the application of PVS in personalized medicine using personalized pleiotropy score reports for individuals with genomic data that could potentially aid in variant interpretation. We further developed a software framework to incorporate PVS into VCF files and to consolidate pleiotropy assessment as part of genome interpretation pipelines. As the genome-phenome catalogs are growing, PVS will be a useful metric to assess genetic variation to find SNPs with highly pleiotropic effects. Additionally, variants with varying degrees of pleiotropy can be prioritized for explorative studies to understand the specific roles of SNPs and pleiotropic hubs in mediating novel phenotypes and drug development.

HLA-SPREAD: A comprehensive resource for HLA associated diseases, drug reactions and SNPs across populations

Preprint

Jan 2021

Extreme complexity in the HLA system and its nomenclature makes it difficult to interpret and integrate relevant information for HLA associations with diseases, Adverse Drug Reactions (ADR), Transplantation. PubMed search displays ~110,000 studies on Human Leukocyte Antigens (HLA) reported from, diverse locations and on multiple populations and IPD-IMGT/HLA database houses data on 28,320 HLA alleles till date. We developed an automated pipeline with a unified graphical user interface HLA-SPREAD that provides a structured information on SNPs, Populations, REsources, ADRs and Diseases information. Information on HLA was extracted from ~24 million PubMed abstracts extracted using Natural Language Processing (NLP). Python scripts were used to mine and curate information on diseases, filter false positives and categorize to 24 tree hierarchical groups and named Entity Recognition (NER) algorithms and semantic analysis to infer HLA association(s). This resource from 116 countries and 47 ethnic groups provides interesting insights on: markers associated with allelic/haplotypic association in autoimmune, cancer, viral and skin diseases, transplantation outcome and ADRs for hypersensitivity. Summary information on clinically relevant biomarkers related to HLA disease associations with mapped susceptible/risk alleles are readily retrievable from HLASPREAD. This resource is first of its kind that can help uncover novel patterns in HLA gene-disease associations

LasigeBioTM at CANTEMIST: Named Entity Recognition and Normalization of Tumour Morphology Entities and Clinical Coding of Spanish Health-related Documents

Conference Paper

Full-text available

Sep 2020

The CANTEMIST track included three subtasks for the automatic assignment of codes related with tumour morphology entities to Spanish health-related documents: CANTEMIST-NER, CANTEMIST-NORM and CANTEMIST-CODING. For CANTEMIST-NER, we trained Spanish biomedical Flair embed-dings on PubMed abstracts and then trained a BiLSTM+CRF Named Entity Recognition tagger on the CANTEMIST corpus using the trained embeddings. For CANTEMIST-NORM, we adapted a graph-based model that uses the Personalized PageRank algorithm to rank the eCIE-O-3.1 candidates for each entity mention. As for CANTEMIST-CODING, we adapted X-Transformer, a state-of-the-art deep learning Extreme Multi-Label Classification algorithm, to classify the clinical cases with a ranked list of eCIE-O-3.1 terms in a multilingual and biomedical panorama. The results obtained were a F1-score of 0.749 and 0.069 for the CANTEMIST-NER and the CANTEMIST-NORM subtasks, respectively, and our best scoring submission achieved a MAP score of 0.506 in the CANTEMIST-CODING subtask.

Designing a conversational agent for patients with hematologic malignancies: Usability and Usefulness Study

Conference Paper

Jul 2021

Unique insights from ClinicalTrials.gov by mining protein mutations and RSids in addition to applying the Human Phenotype Ontology

Article

Full-text available

May 2020
PLOS ONE

Shray Alag

Researchers and clinicians face a significant challenge in keeping up-to-date with the rapid rate of new associations between genetic mutations and diseases. To remedy this problem, this research mined the ClinicalTrials.gov corpus to extract relevant biological insights, produce unique reports to summarize findings, and make the meta-data available via APIs. An automated text-analysis pipeline performed the following features: parsing the ClinicalTrials.gov files, extracting and analyzing mutations from the corpus, mapping clinical trials to Human Phenotype Ontology (HPO), and finding associations between clinical trials and HPO nodes. Unique reports were created for each mutation (SNPs and protein mutations) mentioned in the corpus, as well as for each clinical trial that references a mutation. These reports, which have been run over multiple time points, along with APIs to access meta-data, are freely available at http://snpminertrials.com. Additionally, HPO was used to normalize disease terms and associate clinical trials with relevant genes. The creation of the pipeline and reports, the association of clinical trials with HPO terms, and the insights, public repository, and APIs produced are all novel in this work. The freely-available resources present relevant biological information and novel insights between biomedical entities in a robust and accessible manner, mitigating the challenge of being informed about new associations between mutations, genes, and diseases.

Evaluation: From precision, recall and fmeasure to roc, informedness, markedness and correlation

Article

Full-text available

Jan 2007

David Martin Ward Powers

The taverna workflow suite: Designing and executing workflows of web services on the desktop, web or in the cloud

Article

Full-text available

Jan 2013

The EMBRACE web service collection

Article

Full-text available

Jul 2010

The EMBRACE ( European Model for Bioinformatics Research and Community Education) web service collection is the culmination of a 5-year project that set out to investigate issues involved in developing and deploying web services for use in the life sciences. The project concluded that in order for web services to achieve widespread adoption, standards must be defined for the choice of web service technology, for semantically annotating both service function and the data exchanged, and a mechanism for discovering services must be provided. Building on this, the project developed: EDAM, an ontology for describing life science web services; BioXSD, a schema for exchanging data between services; and a centralized registry (http://www.embraceregistry.net) that collects together around 1000 services developed by the consortium partners. This article presents the current status of the collection and its associated recommendations and standards definitions.

Evaluation of SPARQL query generation from natural language questions

Article

Full-text available

Sep 2013

SPARQL queries have become the standard for querying linked open data knowledge bases, but SPARQL query construction can be challenging and time-consuming even for experts. SPARQL query generation from natural language questions is an attractive modality for interfacing with LOD. However, how to evaluate SPARQL query generation from natural language questions is a mostly open research question. This paper presents some issues that arise in SPARQL query generation from natural language, a test suite for evaluating performance with respect to these issues, and a case study in evaluating a system for SPARQL query generation from natural language questions.

ConText

Conference Paper

Full-text available

Jan 2007

Applications using automatically indexed clinical conditions must account for contextual features such as whether a condition is negated, historical or hypothetical, or experienced by someone other than the patient. We developed and evaluated an algorithm called ConText, an extension of the NegEx negation algorithm, which relies on trigger terms, pseudo-trigger terms, and termination terms for identifying the values of three contextual features. In spite of its simplicity, ConText performed well at identifying negation and hypothetical status. ConText performed moderately at identifying whether a condition was experienced by someone other than the patient and whether the condition occurred historically.

Bioconductor: Open software development for computational biology and bioinformatics

Article

Jan 2004

Natural Language Processing for Biomedical Tools Discovery: A Feasibility Study and Preliminary Results

Conference Paper

May 2014

Discovery of the appropriate computational components, needed to answer a clinical hypothesis, has been a major issue for physicians. Users without experience do not have the means, the time or the knowledge to search the vast amount of information regarding the candidate computational components (services or tools) which can aid to achieve their purpose. In order to address this need we introduce a dynamic service discovery environment where physicians can represent queries in natural language and dynamically retrieve the suitable candidate computational components, with the aid of information extraction algorithms guided by specific domain ontologies.

An evaluation of annotation tools for biomedical texts

Article

Jan 2012

Biomedical texts are a rich information source that cannot be ignored. There are several text annotation tools that may be used to extract useful information from these texts. However, the multi-domain characteristic of these texts, and the diversity of ontologies available in this area, demands a careful analysis before choosing an annotation tool. This work presents an evaluation of the existing annotation tools, with focus on biomedical texts. Initially, based on a set of required characteristics, a tool selection was conducted. AutôMeta and Gate tools were selected for a more detailed evaluation. They were quantitatively and qualitatively evaluated. Results of such evaluation are discussed and bring to light the best/worst of each tool.

Evaluation of Metamap Performance in Radiographic Images Retrieval

Article

Dec 2013

A large amount of free text is available as a source of knowledge in the biomedical field. MetaMap is a widely used tool that identifies concepts within the UMLS in English text. In this study, we study the performance of MetaMap. Performance is measured in retrieval speed, precision of results and recall of results. This automated MetaMap indexing is compared with manual indexing of the same text. Results shows that MetaMap by default was able to identify 98.19% of the biomedical concepts occurred in the sample set. MetaMap by default identified 78.79% of the concepts that manually were not identified. MetaMap is tested under four scenarios; the default output, displaying the mapping list, restricting the candidates' score within the candidate list and restricting the candidates' score within the mapping list. This study describes the limitations of the MetaMap tool and ways to improve the performance of the tool and increase its recall and precision.

Class-based n-gram models of natural language

Article