ArticlePDF Available

Semantic biomedical resource discovery: A Natural Language Processing framework

Authors:

Abstract and Figures

Background: A plethora of publicly available biomedical resources do currently exist and are constantly increasing at a fast rate. In parallel, specialized repositories are been developed, indexing numerous clinical and biomedical tools. The main drawback of such repositories is the difficulty in locating appropriate resources for a clinical or biomedical decision task, especially for non-Information Technology expert users. In parallel, although NLP research in the clinical domain has been active since the 1960s, progress in the development of NLP applications has been slow and lags behind progress in the general NLP domain. The aim of the present study is to investigate the use of semantics for biomedical resources annotation with domain specific ontologies and exploit Natural Language Processing methods in empowering the non-Information Technology expert users to efficiently search for biomedical resources using natural language. Methods: A Natural Language Processing engine which can "translate" free text into targeted queries, automatically transforming a clinical research question into a request description that contains only terms of ontologies, has been implemented. The implementation is based on information extraction techniques for text in natural language, guided by integrated ontologies. Furthermore, knowledge from robust text mining methods has been incorporated to map descriptions into suitable domain ontologies in order to ensure that the biomedical resources descriptions are domain oriented and enhance the accuracy of services discovery. The framework is freely available as a web application at ( http://calchas.ics.forth.gr/ ). Results: For our experiments, a range of clinical questions were established based on descriptions of clinical trials from the ClinicalTrials.gov registry as well as recommendations from clinicians. Domain experts manually identified the available tools in a tools repository which are suitable for addressing the clinical questions at hand, either individually or as a set of tools forming a computational pipeline. The results were compared with those obtained from an automated discovery of candidate biomedical tools. For the evaluation of the results, precision and recall measurements were used. Our results indicate that the proposed framework has a high precision and low recall, implying that the system returns essentially more relevant results than irrelevant. Conclusions: There are adequate biomedical ontologies already available, sufficiency of existing NLP tools and quality of biomedical annotation systems for the implementation of a biomedical resources discovery framework, based on the semantic annotation of resources and the use on NLP techniques. The results of the present study demonstrate the clinical utility of the application of the proposed framework which aims to bridge the gap between clinical question in natural language and efficient dynamic biomedical resources discovery.
Content may be subject to copyright.
R E S E A R C H A R T I C L E Open Access
Semantic biomedical resource discovery:
a Natural Language Processing framework
Pepi Sfakianaki
1
, Lefteris Koumakis
1*
, Stelios Sfakianakis
1
, Galatia Iatraki
1
, Giorgos Zacharioudakis
1
, Norbert Graf
3
,
Kostas Marias
1
and Manolis Tsiknakis
1,2
Abstract
Background: A plethora of publicly available biomedical resources do currently exist and are constantly increasing
at a fast rate. In parallel, specialized repositories are been developed, indexing numerous clinical and biomedical
tools. The main drawback of such repositories is the difficulty in locating appropriate resources for a clinical or
biomedical decision task, especially for non-Information Technology expert users. In parallel, although NLP research in
the clinical domain has been active since the 1960s, progress in the development of NLP applications has been slow
and lags behind progress in the general NLP domain.
The aim of the present study is to investigate the use of semantics for biomedical resources annotation with domain
specific ontologies and exploit Natural Language Processing methods in empowering the non-Information Technology
expert users to efficiently search for biomedical resources using natural language.
Methods: A Natural Language Processing engine which can translatefree text into targeted queries, automatically
transforming a clinical research question into a request description that contains only terms of ontologies, has been
implemented. The implementation is based on information extraction techniques for text in natural language, guided
by integrated ontologies. Furthermore, knowledge from robust text mining methods has been incorporated to map
descriptions into suitable domain ontologies in order to ensure that the biomedical resources descriptions are
domain oriented and enhance the accuracy of services discovery. The framework is freely available as a web
application at (http://calchas.ics.forth.gr/).
Results: For our experiments, a range of clinical questions were established based on descriptions of clinical trials
from the ClinicalTrials.gov registry as well as recommendations from clinicians. Domain experts manually identified the
available tools in a tools repository which are suitable for addressing the clinical questions at hand, either individually
or as a set of tools forming a computational pipeline. The results were compared with those obtained from an
automated discovery of candidate biomedical tools. For the evaluation of the results, precision and recall
measurements were used. Our results indicate that the proposed framework has a high precision and low
recall, implying that the system returns essentially more relevant results than irrelevant.
Conclusions: There are adequate biomedical ontologies already available, sufficiency of existing NLP tools and quality
of biomedical annotation systems for the implementation of a biomedical resources discovery framework, based on
the semantic annotation of resources and the use on NLP techniques. The results of the present study demonstrate
the clinical utility of the application of the proposed framework which aims to bridge the gap between clinical
question in natural language and efficient dynamic biomedical resources discovery.
Keywords: Semantic resource annotation, Natural language processing, Resource discovery, Biomedical text
annotation, Information extraction, Text mining, Biomedical informatics, Search engine, Natural language interface
* Correspondence: koumakis@ics.forth.gr
1
Foundation for Research and Technology Hellas (FORTH), Institute of
Computer Science, N. Plastira 100, Vassilika Vouton, Heraklion, Crete, Greeece
Full list of author information is available at the end of the article
© 2015 Sfakianaki et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77
DOI 10.1186/s12911-015-0200-4
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Background
A plethora of publicly available biomedical resources
(data, tools, services, models and computational work-
flows) do currently exist and are constantly increasing
at a fast rate. This explosion of biomedical resources
generates impediments for the biomedical researchers
needs, in order to efficiently discover appropriate re-
sources to accomplish their clinical tasks. It is ex-
tremely difficult to locate the necessary resources [1],
especially for non-Information Technology (IT) expert
users, because most of the available tools are com-
monly described via narrative web pages containing
information about their operations in natural language
or are annotated with relevant technical details which
are not easily interpreted by lay users. These descrip-
tions contain plain text with no machine interpretable
structure and therefore cannot be used to automatically
process the descriptive information about a resource. An
indicative resource and its description is shown on Table 1.
Furthermore, clinical users prefer to formulate their
queries quickly using natural language which is the most
user-friendly and expressive way [2]. As a result, discov-
ery of the appropriate tools and computational models
needed to support a given clinical decision making task
has been and remains a major problem for non-expert
users. Due to the fact that the range of accessible re-
sources has been considerably expanded in recent years
and a significant number of new such resource reposi-
tories have been developed, it has been more and more
difficult for clinicians and researchers to locate the most
appropriate resource for the realization of their tasks.
On the other hand, bioinformaticians and tool devel-
opers rely to a greater extent on ontologies to annotate
their systems and publish them in specialized repositories,
such as Taverna [3], myExperiment [4], BioCatalogue [5],
SEQanswers [6], EMBRACE [7], Bioconductor [8], and
ORBIT [9]. Such repositories make software components
easier to locate and use when they are described and
searched via rich metadata terms but act as independent
silos devoted to specific domains and are unable to pro-
vide end to end solutions to daily routine clinical
questions. An indicative example is SEQanswers [6],
where a user can find an abundance of tools which
however are restricted only to sequencing. In such re-
positories, the main impediments that a clinician faces
are: (i) the need to serially search or search with exact
keyword-matching in repositories with thousands of
tools, (ii) substantial information technology (IT)
knowledge is required in order for a clinician to under-
stand a tools purpose and way of use, (iii) time con-
suming search in various or all the publically available
repositories, and (iv) the uncertainty regarding the ap-
propriateness of a retrieved tool for his clinical decision
task [10].
In most of the cases clinical users come up with long
and complex questions in the context of their hypothetico-
deductive model of clinical reasoning [11]. What is equally
important is the fact that clinical users are not prepared,
on average, to allocate more than 2 minutes for discovering
appropriate tools and usually give up if the inquiry is time
consuming [12]. Furthermore, the appropriateness of the
results obtained often depends on the users IT expertise.
The use of queries expressed in natural language
can, it is believed, overcome these hurdles [13], yet
computers are good at processing structured data but
much less effective in handling natural language that
is inherently unstructured. The field of Natural Language
Processing (NLP) [14] aims to narrow this gap, as it
focuses on how machines can understand and manage
natural language text to execute useful tasks for end users.
A survey on biomedical text annotation tools was
performed taking into account Named Entity Recogni-
tion (NER) tools that can identify biomedical categories,
like gene and protein names, as well as Ontology-Based
Information Extraction (OBIE) tools. Several approaches
and tools were evaluated, including ABNER [15], GATE
[16], UIMA [17], NCBO BioPortal [18], MetaMap [10],
AutoMeta, KIM, ONTEA [19], and finally SOBA and iDo-
cument [20] which do not support annotation with mul-
tiple ontologies or clinical text at all.
MetaMap is worthy of note as a state-of-the-art tool
and the de-facto standard for biomedical annotation.
Table 1 An example of a resource and its description
Name Summary Tags (Principal bioinformatics methods)
GeneTalk GeneTalk, a web-based platform, that can filter,
reduce and prioritize human sequence variants
from NGS data and assist in the time consuming
and costly interpretation of personal variants in
clinical context. It serves as an expert exchange
platform for clinicians and scientists who are
searching for information about specific sequence
variants and connects them to share and exchange
expertise on variants that are potentially disease-relevant.
Genetic variation annotation, Sequence variation
analysis, Variant Calling, Structural variation discovery,
Filtering, Annotation, Database, Exome analysis,
Sequence analysis, Variant Classification, Viewer
Link Input (format) Output (format) Category
http://seqanswers.com/wiki/GeneTalk VCF VCF,XLS,XLSX Sequence Analysis
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 2 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
This tool maps text to the Unified Medical Language
System (UMLS) [21] Metathesaurus concepts. MetaMap
can identify 98.19 % of the biomedical concepts in the
text including 78.79 % of the concepts that manually
could not be identified [22]. The tools inefficiencies are
mainly due to missing entries in UMLS; furthermore,
conceptsrelationships, multi-words concepts entries,
words with punctuation and spelling mistakes in text are
not recognized and dealt with. Therefore, minor ortho-
graphical or syntactic errors in a sentence cannot be de-
tected. In addition, MetaMap can support only concept
recognition and for specific ontologies. On the other
hand, cTAKES, an Apache open source NLP system, im-
plements rule-based and machine learning methods. The
tool exhibits reasonable performance which was never-
theless inferior to the one achieved by MetaMap [23].
The purpose of the study
Given the complexities mentioned above, the aim of
the present study is to i) investigate the use of seman-
tics for the annotation of biomedical resources with do-
main specific ontologies and ii) exploit NLP methods in
empowering the non-IT expert users to efficiently
search for biomedical resources using natural language.
Our specific focus is to capitalize on existing research
results and extend these with the objective of providing to
users, especially physicians, the opportunity to represent
their queries in natural language and to dynamically dis-
cover and retrieve suitable candidate computational re-
sources, with the aid of information extraction algorithms
guided by specific domain ontologies.
In achieving the stated objectives, we introduce a se-
mantic biomedical resource discovery framework based
on NLP. A high level architecture of developed frame-
work is shown in Fig. 1. The clinician can import his re-
search question in natural language (English language)
through a web interface. Then, the interpreter receives the
clinical question as input, and parses the text using NLP
techniques guided by the existing domain ontologies [24].
The objective at this step is to infer the questions
meaning by locating ontological terms important in the
clinical domain of interest. The results of this step are
then matched to a set of predefined patternsthat pro-
duce a low level query to repository of biomedical tools
and other resources. When this query is executed, the
repository returns the list of tools or custom pipelines
that possibly answer the initial question of the user.
In developing and evaluating our semantic biomedical
resource discovery framework we have specifically fo-
cused on the following questions:
Do existing biomedical ontologies, as well as current
NLP tools, suffice for the creation of a domain-specific
annotation system?
Fig. 1 The architecture of the framework. The architecture of the framework: 1) tools registration, 2) tools annotation, 3) users question in natural
language and NLP processing, 4) form and send the query, and 5) retrieve results (related tools)
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 3 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Are existing biomedical annotation systems acceptable
and satisfactory for the Semantic Annotation of
biomedical concepts guided by ontologies?
Can an interpreter translate natural language clinical
questions into targeted queries using patterns and
ontology terms?
In the following sections we describe details of the
framework design and implementation, provide evalu-
ation details and results, and conclude with a discussion
and future work.
Methods
The proposed framework was designed and imple-
mented within the European Commission project p-
medicine [25] as the projects workbench which is an
end-user application that is effectively a repository of
tools for use by the clinicians. It also follows exploratory
work that has taken place in the context of the Contra
Cancrum EC funded project [26]. The objective of the
workbench is to boost the communication and collabor-
ation of researchers in Europe for the machine-assisted
sharing of expertise.
In more detail the proposed framework initially per-
forms an NLP processing step. The users clinical ques-
tion is split into tokens which are the words and
punctuation that establish the sentence; the tokens are
lemmatized, i.e. the words are mapped to their roots
(lemmas), and at the end each token is matched to a
specific Part of Speech(POS) of the English grammar.
During this pre-processing task, NER, the task that
organizes the text elements into predefined categories,
is also performed for each token and assigns the
token to a specified category (e.g. a gene symbol). For
the NER process we used the clinical, biomedical and
pharmaceutical semantic types used in MetaMap (in
Additional file 1: Table S5). Subsequently the system
communicates with the Concept Recognizer (lower
part in Fig. 1) and extracts ontology terms, concepts
and semantic types. Using the ontology terms and its
semantic types the system identifies predefined patterns
and delivers focused queries to the repository of resources
for an efficient discovery [27]. The identified tools/services
arethenpresentedtotheenduser.
The architecture of the framework, as shown in
Fig. 1 integrates three main components: (i) the re-
source and metadata repositories, (ii) the semantic
annotator (Concept Recognizer) and (iii) the intelli-
gent engine (Interpreter) which interacts with the
non-IT end user.
The following sections provide more elaborate details
on the implementation and functioning of all the sub-
components of the framework.
Concept Recognizer the semantic annotator
The core of the system is the so called Concept Recognizer
that is used by most of the components of the framework.
The objective of this component is, given the free text for-
mulation of the users query, to extract the important
parts, using n-grams [28], that refer to or designate known
ontology terms, in order to get matched to the patterns
that form the physiciansnecessities. By these patterns a
query is formed and promoted to the tools repository to
finally get the appropriate tools/services for the userstask.
The Concept Recognizer integrates two special domain
ontologies: the EDAM ontology for the software do-
main and the UMLS biomedical ontologies for the bio-
medical domain. For both domains a specific concept
recognizer has been implemented:
EDAM (originally from EMBRACE Data and
Methods) is an ontology for annotation of
bioinformatics tools, resources and data. Its
design principles are bioinformatics specific with
well-defined scope, relevant and usable for users
and annotators, and maintainable. EDAM applies
to organizing and finding suitable tools or data
and to automate their integration into complex
applications or workflows. EDAM has been
already successfully used in other systems like
BioXSD [29] and Bio-jETI [30]. The EDAM concept
recognizer was implemented using the Apache Solr
[31] full text search server. For each term in the
EDAM ontology a JSON-formatted file was created,
with specific fields that were subsequently imported
to Solr. The different fields give the ability to use
different weights at search time. The weight formula
that was used is biased to the id and the name of the
term. This implies that if the searched text matches
with the id or the name of a term, then this term is
assigned a better score than if the text matched to the
definition or the comment fields. The formula of the
custom made EDAM weight is:
id10ðÞþname10ðÞþsynonym6ðÞþsubset3ðÞ
þisa3ðÞþdef 2ðÞþcomment1ðÞ
The MetaMap concept recognizer identifies and
annotates medical terms based on terminological
resources included in UMLS Metathesaurus.
MetaMap integrates two different NLP servers; the
Semantic Knowledge Representation (SKR) server
which combines contextual information with lexical
information to improve the tagging accuracy and the
Word Sense Disambiguation (WSD) server which
involves the determination of the meaning and
understanding of words. The semantic types of the
UMLS and the WSD server are used to classify the
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 4 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
terms in certain categories in order to acquire a
specific meaning.
The Concept Recognizer integrates the MetaMap and
the EDAM concept recognizers, and also reports the se-
mantic types of the matched term which provides a con-
sistent categorization of all concepts present in UMLS
Metathesaurus or in EDAM ontology. UMLS provides
more than 100 semantic categories (http://mmtx.nlm.nih.
gov/MMTx/semanticTypes.shtml); 58 of them, clinical,
medical, biomedical and pharmaceutical, were selected,
combined and categorized in 23 semantic types, more
general and comprehensive for non-experts. Four more
categories of the Edam ontology were added. The full list
of the selected categories is shown in Table 2.
When the end user posts a question, the concept
recognizer applies NLP algorithms and invokes both the
EDAM and the MetaMap concept recognizer. Subse-
quently, the results of the two concept recognizers are
merged. It should be pointed out that some terms do
co-exist in both the EDAM ontology and in one or more
additional UMLS ontologies. In such cases the concept
recognizer merges all the proposed concepts and passes
them back to the interpreter as a list of proposed con-
cepts; therefore the interpreter decides which concept
would be kept with emphasis in EDAM (software ontol-
ogy) terms, due to the fact that our objective is to iden-
tify tools and other software resources. The decision is
based on the following rules: 1) if a concept co-exist in
the format of databranch of EDAM and in one or
more UMLS ontologiesterms, then the EDAM term
would be kept and 2) if a concept co-exist in more than
one ontology terms, the term that has the higher score
would be kept.
Resource repository and ontologies
A repository of biomedical tools and services was
employed that contains semantically annotated biomed-
ical resource descriptions using the same ontologies as
of the Concept Recognizer. The tools repository of the p-
medicine workbench is based on the PostgreSQL [32]
database with full text search capabilities.
The repository currently stores information for 502
tools and services that were either developed by the pro-
ject itself or extracted from different domain specific re-
positories or from the web, as follows:
195 sequence analysis tools and resources from the
SEQanswers [6],
35 biomedical tools from the Embrace [7],
133 bioinformatics tools and resources from
Bioconductor [8],
75 bioinformatics tools and workflows from
myExperiment [4],
50 biomedical tools and 50 biology related tools by
searching the web.
The repository also includes a selected set of computa-
tional models, exposed as tools, that simulate disease
evolution or response to treatment, such as [33] and
[34]. The ontological concepts and semantic terms for
the description of tools are generated automatically
using scripts that take as input the textual description of
the tool (as shown in Table 1) and feedthe Concept
Recognizer, which consequently extracts ontology terms
and corresponding semantic categories. Using these an-
notations, we seek to facilitate more intelligent search
results that address what a user is actually looking for,
rather than simply returning candidate tools following a
keyword matching process.
The tools repository supports three different strategies
for resource discovery. Namely, (i) full text, i.e. a tools
description is given in plain text, (ii) use of tags, i.e. user
provided concepts and semantic types for the tools and
their operations and finally (iii) parameters, i.e. inputs
and outputs of a tool is specified. Such an approach im-
plies that a clinical question can be annotated with onto-
logical concepts and as a result the repository can be
queried using full text, tags or semantic types of the
UMLS ontologies.
Interpreter
The Interpreter is the core element of the system and
the bridge between the clinical question and the query
formed for the resource repository. It employs NLP
Table 2 The 27 semantic categories. The 27 prime categories: 23 from UMLS semantic types and 4 from Edam categories
1. Disease 2. Drug 3. Medical Procedure 4. Tissue
5. Biomedical 6. Cell 7. Organism Function 8. Finding
9. Body Part 10. Gene 11. Clinical Attribute 12. Patient
13. Diagnosis 14. Age 15. Molecular Sequence 16. Device
17. Symptom 18. Virus 19. Injury or Poisoning 20. Vitamin
21. Laboratory 22. Food 23. Temporal Concept
24. EDAM Data/Format 25. EDAM Topic 26. EDAM Operation 27. EDAM Identifier
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 5 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
techniques and utilizes the Concept Recognizer module
to formulate more focused queries to the repository.
The specific NLP techniques employed are based on the
Stanford CoreNLP [35] version 3.3.0. The Interpreter re-
ceives as input the clinical question and executes the
analytical steps of tokenization, lemmatization and POS
tagging; the resulting tokens are syntactically annotated
and therefore can be recognized as entities and match
the patterns that are connected to the outputs objective.
The final step of the NLP operations in the inter-
preter includes a queriestemplate based on expression
matching in order to extract relationship patterns be-
tween clinical entities. With these patterns (Table 2)
the system identifies and categorizes parts of the input
text as input/available data and parts that compose the
clinical hypothesis (clinical question to be answered).
The development of specific patterns aims to iden-
tify specific relations within sentences, and support disam-
biguation of multiply-annotated words. For example, if in
the clinical question a Drug category term and a Disease
category term co-exist, as identified by the Concept
Recognizer, this matches the combined pattern Drug for
Disease where a partial meaning could be that the specific
Drug is suitable for this Disease.
A specific pattern, called prime category, was created
for every semantic category resulting in 23 categories for
the UMLS semantic types and 4 categories for the Edam
types. With these 27 prime/simple categories at hand, 24
new patterns, based on recommendations from experts,
were created using combinations of the prime categories
(Table 3). These combinations have a special meaning for
the clinicians; for example, the pattern Drugfor Dis-
easerelates to the concept of treatment for a physician.
The system, analysing the clinical question given as input,
formulates two types of focused queries. The first is based
on the tagged terms, their combination and their position
inthequestion.Thesecondisbasedonthesemantictype
of the tagged terms, for both input and output terms.
Subsequently, the queries are passed into the tools re-
pository, and two lists of candidate tools are exported; a
list of tools that could totally address the clinical ques-
tion at hand and a list of pipelined tools that could ad-
dress the question sequentially. In order to export a
ranked list of candidate tools based on correctness and
accuracy metrics, the framework ranks the tools using a
scoring mechanism as follows:
Every tool or service gains one point for each
appearance of the identified terms in the description
of either the input required or the output produced
by the tool.
Every tool or service gains 0.25 points for each
appearance of the term in the functional, textual
description of the tool. This means that if the tagged
term of the sentence matches a tagged term in the
textual description of the relevant tool, a quarter of
point is gained.
Tools with a score equal or less than 1 are ignored.
Furthermore, every tool that has matched terms both
from the given data sub-sentence in the description of
their input and from the clinical question sub-sentence in
the output produced form a list of tools/services that can
individually resolve the clinical question. The remaining
tools form a secondary list, i.e. a list of tools that are can-
didates for the formation of a computational pipeline that
could provide a solution to the problem.
For comparison purposes, we performed a free text
query, similar to the searching mechanism supported by
traditional tools repositories, in order to compare the
automated results of our system to the matched terms of
the full text query. The free text query was implemented
by inserting in the query interface of the tools repository
the whole clinical question as a query of free text.
Results
For the evaluation of the framework developed we
followed a case study approach. Expert users and know-
ledge extracted from relevant available resources assisted
us in formulating a series of clinically relevant questions
of increasing complexity, which were the basis for our
evaluation activities. The exact clinical questions and the
results obtained when the proposed framework was ap-
plied are presented in what follows.
For our experiments, the following clinical questions
were used:
Table 3 The list of patterns generated by the combination of
prime categories
1. Drug for Disease 2. Edam Data/Format and Edam
Operation
3. Patient took Drug for Disease 4. Finding with Organism_Function
5. Drug for Disease in Body Part 6. Finding in/with Medical_Procedure
7. Drug for Symptom 8. Edam_Data in Body_Part
9. Patient has Disease in Body Part 10. Patient took Drug for Disease
in Body Part
11. Patient took Drug for Body Part 12. Patient has been in Medical
Procedure
13. Patient has Disease 14. Patient has Organism Function
15. Disease in Body Part 16. Drug for Edam_Data in Body_Part
17. Patients Finding 18. Drug & Drug for Disease
19. Patient took Vitamin 20. Symptom of Medical_Procedure
21. Patient has Symptom 22. Drug for Edam_Data
23. Patient ate Food 24. Edam Data/Format and Edam
Operation and Edam Data/Format
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 6 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1. John has lung cancer and has been treated with
carboplatin which is known for toxicology adverse
effects. I would like to find literature and reference
related to such events for the specific drug.
2. I have the miRNA gene expression profile of Anna
which is a nephroblastoma patient. I want to identify
KEGG pathways which are mainly disrupted due to
gene expression.
3. Patient FK is a 1.5 year old boy with bilateral
nephroblastoma and his tumor is unresponsive to
chemotherapy (vincristine, actinomycin-D and
Doxorubicin) with no reduction in tumor size, not
allowing to perform nephron sparing surgery. I would
like to obtain a list of deregulated metabolic pathways
in the tumor from gene expression data in combination
with miRNA data to find possible targets that can be
treated with available drugs.
4. Patient SM is a 3 year old girl with metastatic
nephroblastoma and miRNAs from blood are
analyzed at the time of diagnosis. I would like to
compare the results of miRNAs with miRNAs of the
cohort of patients with metastatic nephroblastoma
that are correlated to histology, treatment response
and outcome to get an individual risk index of the
patient including proposed pathology, treatment
response and outcome.
5. Patient AB is a 5 year old boy just diagnosed with
acute lymphoblastic leukemia, while
immunophenotype and gene expression data as well
as clinical data at the time of diagnosis are known. I
would like to compare his gene expression data with
the group of all patients having the same
immunological phenotype.
6. Patient AB is a 5 year old boy just diagnosed with
acute lymphoblastic leukemia, while immunophenotype
andgeneexpressiondataaswellasclinicaldataatthe
time of diagnosis are known. I would like to know the
difference in gene expression between those predicting
relapse and those predicting poor MRD for the different
immunophenotypes. The results should be visualized.
We evaluated the systems performance using precision
and recall measurements. To measure precision and recall,
expert physicians and bioinformaticians together went
through the catalog of all the tools available in the reposi-
tory, read their descriptions, functionalities and capabilities
and manually identified those tools that could answer or
partially answer the specific clinical question. We present
in detail the results obtained when processing the first two
clinical questions as indicative case studies.
The first clinical question is a combination of sen-
tences based on descriptions of clinical trials from the
ClinicalTrials.gov registry [36] and the contribution of
physicians. It was imported in our system through the
web interface (http://calchas.ics.forth.gr/) where it was
divided into two specific contexts.
The first sentence represents the available knowledge
(given data/statement) of the clinician and mainly cor-
relates to a toolsinputs,i.e.John has lung cancer and
has been treated with carboplatin which is known for
toxicology adverse effects., while the second sentence is
the clinical hypothesis, the research question, and is
mainly connected to a toolsoutputs,i.e.Iwouldlike
to find literature and reference related to such events
for the specific drug.
A visual representation for the annotation of the
Concept Recognizer for the given data sentence is shown
in Fig. 2, and similar annotation exists for the sentence
that includes the clinical question. As explained earlier, in
the case of co-existence of two annotations, the system se-
lects the assignments that have the higher score.
The domain experts manually searched the tools reposi-
tory, using the available tools descriptions, and have iden-
tified the EUADR - Literature analysistool as a resource
able to answer the specific clinical question. Table 4 shows
the results of the framework for the first clinical question
while in Additional file 1: Figure S1 we can see the results
as shown in the web site of the NLP framework.
As can be seen, the framework identified 23 relevant
tools. Additionally, we performed a free text query, using
the whole sentence as input into the tools repository, in
order to compare the automated results of our system to
those obtained with a full text query (a complete list of the
full text results can be found in Additional file 1: Table S2).
The framework was able to identify tools that could
individually address the clinical question. Such tools are
Fig. 2 Annotation example from Concept Recognizer. The annotation from the Concept Recognizer of the given data sentence John has lung
cancer and has been treated with carboplatin which is known for toxicology adverse effects
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 7 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
listed in Table 4, and include the cBio Cancer Genomics
Data Server (CGDS) API [37], the National Cancer
Institute SEER API [38] and EUADR - Literature ana-
lysis [39]. EUADR is the only tool selected by the do-
main experts as an appropriate tool for answering the
clinical question. Detail description of these three
tools can be found in the Additional file 1: Table S1.
There were additional tools identified that partially
matched either the input or the output description; the
framework performs a check of the output data types of
the candidate tools for answering the input sentence and
the input data types of the candidate tools that could solve
the output sentence. For every data type match, a pro-
posed pipeline is created; this implies that the user could
use the first tool, and then provide its output as an input
to the second tool, and so on in order to obtain an answer
to the entire clinical question.
In addition, we analysed the results of the queries
and measured the precision and recall of the results,
as shown in Table 5: Precision and recall for the first
clinical question.
Precision is the fraction of retrieved tools that are in-
deed relevant, while recall is the fraction of relevant
tools that are indeed retrieved [40]. Both precision and
recall are therefore based on an understanding and
measure of relevance in our results. In order to measure
the precision and recall of the automated results, domain
experts manually identified 76 tools that could answer,
individually or as part of a computational pipeline, the
specific clinical question. Among them, the EUADR -
Literature analysistool was able to answer the specific
clinical question by itself. The rest of the tools could
only provide partial solutions, meaning that two or
more should be pipelined for obtaining an answer.
In our first case study the true positive elements,
i.e. elements that were correctly selected by the sys-
tem are 11, while the false positive elements, i.e. the
elements that were wrongly selected are 0, and the
false negative elements, i.e. the elements that were
correct but not selected are 65 (7611). This re-
sults in 100 % precision and 14 % recall.
Table 4 Results of the first clinical question. The results given
by the framework to the first clinical question. The list of individual
tools that could solve the entire clinical question are listed at the
top which are then followed by a list of the tools that could be
combined, i.e. pipelined, for providing an answer to the given
clinical question
Unique Tools List
SCORE TOOL NAME Identified (query)
4.75 = 3 (in) + 1 (out)
+ 0.75 (tag)
National Cancer
Institute SEER API
carboplatin & cancer (in)
cancer (in)
lung cancer (in)
drug (out)
4 = 3 (in) + 1 (out) cBio Cancer Genomics
Data Server
carboplatin & cancer (in)
(CGDS) API cancer (in)
lung cancer (in)
find (out)
4 = 1 (in) + 3 (out) EUADR - Literature
analysis
adverse effects (in)
drug-references (out)
drug (out)
literature (out)
Pipeline Tools List
FIRST TOOL SECOND TOOL
National Cancer
Institue caDSR API
AIDSinfo API
China Cancer
Database API
AIDSinfo API
Single Tools List
SCORE TOOL NAME
3.75 = 3 (in)
+ 3*0.25 (tag)
The Cancer Genome
Atlas API
3.75 = 3 (in)
+ 3*0.25 (tag)
China Cancer
Database API
3 (in) National Cancer
Institue caDSR API
3 (in) MuTect
2.25 = 1 (out)
+ 5*0.25 (tag)
Lexicomp API
2 (out) Arabidopsis thaliana
Microarray Analysis
2 (out) Pathways and Gene
annotations for QTL region
2 (out) SciBite API
2 (out) DGIdb API
2 = 4*0.25 (tag) DailyMed API
2 = 4*0.25 (tag) Aetna CarePass API
2 = 4*0.25 (tag) National Institute on
Drug Abuse Drug Screening
Tool API
Table 5 Precision and recall for the first clinical question.
Precision and recall of the automated resource discovery in
attempting to find solutions to the first clinical question as
compared to results manually identified by domain experts
based on the description of the tools
Tools
identified
Precision
(%)
Recall
(%)
#Best rank of tools
that can solve the
question at once
(no pipelines)
Free Text 164 40 73 3 out of 164
NLP
Framework
11 100 14 1
st
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 8 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
As seen, the framework retrieved tools from the reposi-
tory with a precision of 100 %, meaning that the system
might not have exported all the suitable tools tools that
could solve partially and at once the question - for the
clinical question (i.e. has low recall). On the other hand,
what we feel is important, is the fact that all identified
tools are appropriate as candidates for answering the clin-
ical question. On the contrary, the free text query had
high recall, meaning that many irrelevant tools were
exported. We further discuss these findings in the discus-
sion section.
We have subsequently employed our framework with
the clinical question I have the miRNA gene expression
profile of Anna which is a nephroblastoma patient. I
want to identify KEGG pathways which are mainly dis-
rupted due to gene expression.Domain experts again
searched the tools repository and manually discovered
that the specific sentence could be answered by the
mirPath[41] or miRNApath[42] tools; it could also
be answered with a combination of tools which had to
contain the mirtarbase[43] tool and the MinePath
[44] tool. Specifically a clinician should first use the
mirtarbasetool and provide its output as an input the
MinePathtool in order to resolve the full clinical ques-
tion at hand.
We set the clinical question to the framework and a
list of proposed tools suitable for the solution exported.
The free text query was also invoked, in order to com-
pare the frameworks results to the matched terms of the
full text query. The results of the framework for the spe-
cific sentence are shown in Table 6.
The mirtatbase,mirPathand miRNApathtools
were identified by the framework as top ranked tools
appropriate for individually answering the clinical ques-
tion. From these tools, mirPathand miRNApath
were also selected by the domain experts. Details about
these three tools can be found in the Additional file 1:
Table S3. The mirtatbasetool was identified incor-
rectly as a candidate while the MinePathtool was also
incorrectly identified as one of the tools that could par-
tially answered the clinical question. Additional tools
were also identified as candidates for a partial answer
to the question, i.e. appropriate for solving the input or
the output sentence; these tools could again form a
pipeline in order to answer the whole clinical question.
From these tools, the domain experts identified only one
potential pipeline using the mirtatbaseand MinePath
tools. The results of applying our NLP framework with the
second clinical question are shown in Table 6. The frame-
work identified 17 relevant tools. We also compare the re-
sults with a full text search (complete list of the full text
search results can be found in Additional file 1: Table S4).
The results of the queries were analysed and measured
as shown in Table 7. Domain experts manually identified
Table 6 Results for the second clinical question. The results
given by the framework to the second clinical question. The list
of individual tools that could solve the entire clinical question
are listed at the top which are then followed by a list of the
tools that could be combined, i.e. pipelined, for providing an
answer to the given clinical question
Unique Tools List
SCORE TOOL NAME Identified (query)
3 = 1 (in) + 2 (out) miRNApath mirna (in)
gene expression (out)
kegg pathways (out)
3 = 1 (in) + 2 (out) mirPath mirna (in)
gene expression (out)
kegg pathways (out)
3 = 1 (in) + 2 (out) mirtarbase mirna (in)
gene expression (out)
kegg pathways (out)
No results found on this category Pipeline Tools Listfor the given question
Single Tools List
SCORE TOOL NAME
4 (out) Get Pathway-Genes and gene description by Entrez
gene id
4 (out) Arabidopsis thaliana Microarray Analysis
4 (out) MinePath
4 (out) EnrichNet API
4 (out) NCBI Gi to Kegg Pathway Descriptions
4 (out) MitoMiner API
4 (out) BiologicalNetworks API
4 (out) From cDNA Microarray Raw Data to Pathways and
Published Abstracts
4 (out) HUMAN Microarray CEL file to candidate pathways
4 (out) ERGO Genome Analysis and Discovery System
4 (out) BioCyc API
4 (out) Mouse Microarray Analysis
Table 7 Precision and recall for the second clinical question.
Precision and recall of the automated resource discovery in
attempting to find solutions to the second clinical as compared
to results manually identified by domain experts based on the
description of the tools
Tools
identified
Precision
(%)
Recall
(%)
#Best rank of tools
that can solve the
question at once
(no pipelines)
Free Text 231 25 59 2 out of 231
NLP
Framework
17 100 17 1
st
&2
rd
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 9 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
99 tools that could solve partially or at once the specific
clinical question. The NLP framework demonstrates
good precision for this question too. The true positive
elements are 17, while the false positive elements are 0,
and the false negative elements are 82 (9917). This
gives us a 100 % precision and 17 % recall.
Discussion
This study focused on the development of a Semantic
Biomedical Resource Discovery Framework by making
use of natural language processing techniques. As ori-
ginally stated, the envisioned framework should allow
searching through a set of semantically annotated re-
sources in order to find a match with a user query
expressed as a natural language statement.
In parallel to seeking an answer to our ultimate research
question, a range of additional, more specific research
questions were also established. In the current section we
critically discuss our experiences and the experimental
evidence obtained in the context of those specific research
questions initially established. We would like to stress that
evaluation of the proposed approach used a limited num-
ber of queries. As a result, the present work should be
seen as a case study, providing initial evidence on the val-
idity of the approach. It is obvious that subsequent formal
evaluation should be designed to test the broader effect-
iveness of the system.
Having said this, the experience obtained through the an-
notation of a large number of resources (Additional file 2),
that were brought into our platform for experimenta-
tion, shows that the range of existing open biomedical
ontologies and other open, generic ontologies do suffice
for the creation of a domain-specific annotation frame-
work that would be useful for the semantic resource
annotation.Wewereabletonoticetheefficiencyofthe
current software related ontology, i.e. EDAM and other
biomedical ontologies that we used. Hence, we believe
that there is no need for the development of a core do-
main ontology to enable the creation of an annotation
framework that would offer capabilities of capturing the
context of complex biomedical resources. Rather the
challenge lies on the articulate use and integration of
various existing biomedical and other related ontologies.
This, nevertheless, remains a scientifically and often
technically demanding task.
Our work performing NLP processing on complex
biomedical text reaffirmed the various challenges identi-
fied in prior research, namely i) Clinical text has uncom-
mon structure and content that are not always guided by
grammar, syntactic or spelling rules [45], ii) Biomedical
termsarepronetoambiguity;wordsthatmayhavemul-
tiple meanings or many words may have the same meaning
[46]; temporal ambiguity also exists, confusing past or fu-
ture diagnosis or medical history, iii) Clinical content is full
of abbreviations and titles that confuse the detection of a
sentences boundary [45], iv) Negations are very common
in clinical text, such as no,without,not and denies [47].
Although these challenges were evidently present in
our experimentation, the range of existing NLP tools is
also large. Numerous NLP packages have been also de-
veloped, such as Python NLTK, OpenNLP, Stanford
NLP, LingPipe. In our work we selected the probabilistic
Stanford NLP tools, where the corpus data is gathered
and manually annotated and then a model is trained to
try to predict annotations depended on words and their
contexts through weights. The selected NLP tools for our
work, with minor extensions and customization done, have
proven adequate for supporting the NLP tasks of our work.
In the context of our research a limited number of clin-
ical questions were examined. In the first research ques-
tion, presented in detail in this manuscript, the framework
identified the pattern <Drug > for < Disease>,which has
a specific meaning of Treatment for the clinicians. Ac-
cording to the given input sentence, we managed to iden-
tify patterns with the combination of the annotated tagged
terms of the sentence. Many more patterns can be formed
and enrich the framework in the future, depending on
different kind of domain searches and distinct meanings
for the physicians.
Domain experts explored the tools repository and
manually identified 76 tools and services (out of 502)
that could provide an answer to the clinical question;
some of those could give a solution individually, while
others could partially solve the question.
The second clinical question presented in this manu-
script led us to the matching pattern Patient has Disease
and EDAM Topic for EDAM Data.
The user seeks to find the disrupted KEGG pathways
according to the profile of a patient that has nephroblas-
toma. A tool or service or a pipeline of tools is needed
to resolve this question. The domain experts manually
selected 99 tools and services that could be part of the
solution space. The frameworks results showed a 100 %
precision and were less than the tools selected from the
domain experts. In addition, the free text query exported
231 tools and identifies only 2 of the tools that according
to domain experts can solve the entire clinical question.
Furthermore, in relation to execution performance,
the framework proved to be able to respond fast
enough and could, therefore, be used as an online
search engine for biomedical tools. The response times
for different clinical questions vary from 1.5 to 7.5 s
which is an acceptable time for a web application. Being
more specific the response time for the first clinical
question is 3993 milliseconds and for the second clin-
ical question is 7038 milliseconds. In our current im-
plementation we use ten ontologies but the framework
can be extended to use more ontologies either from
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 10 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
UMLS or as new systems (concept recognizers) using
the Solr implementation of EDAM. Initial evidence in-
dicates that the proposed framework is scalable and can
be expected to be responsive in real time even for tens
of thousands of tools in the repository and with much
more ontologies, due to the fact that the concept
recognizer and the queries to the repository are based
on elastic search which is suitable in even more de-
manding domains such as big data applications [48].
Future work
We plan to extend the framework and provide end users
with options to create and import new patterns through
the web interface, that may be needed and do not
already exist. We also explore methodologies for person-
alized preferences with classification based on user pro-
file [49] and votingmechanism on the retrieved results
in order to improve our accuracy in similar user groups.
Another direction under investigation is to enrich the
framework with graph theory capabilities and provide
the end user possible workflows [50], for the solution of
the research question. Such methodologies proved to be
valuable in services discovery [51] and scientific work-
flows composition [13, 5254]. Taking advantage of the
modular implementation and the rich metadata schema
of the NLP framework we expect to provide meaningful
pipelines as guidelines to complex clinical questions.
Again a high quality annotation of the tools in the re-
pository is mandatory for accurate results.
In addition, the implementation of this framework will
be expanded with even more patterns focused on all the
possible combinations of the semantic categories of the
EDAM software ontology and the clinical ontologies of
the UMLS Metathesaurus, in order to create more ac-
curate patterns with clinical meanings, taking into ac-
count the ontology-based relationships of concepts and
how they map to similar structures in natural language
expressions, as we expect that soon the tools repository
will host thousands of software resources. We also plan
to add negation detection patterns to identify diagnosis
and symptoms that are negated. In that direction, we
will evaluate and possibly use opinion mining method-
ologies able to categorize the polarity of a text, meaning
if the sentence or word is positive, negative or neutral.
Solutions like the Crowd Validation[55] which exam-
ine and determine opinions, perceptions and approaches
along with NLP methodologies for ontology management
and query processing [5658] will be possibly used.
In addition, the key functions of clinical decision support
systems require understanding of the context from which
an event or a named entity is extracted. For example, sup-
porting clinical diagnosis and treatment processes with best
evidence will require not only recognizing a clinical condi-
tion, but determining whether the condition is present or
absent. Chapman et al. [59] developed a dedicated al-
gorithm, ConText, for identifying three contextual
features: Negation (for example, no pneumonia); His-
toricity (the condition is recent, occurred in the past,
or might occur in the future); and Experience (the condi-
tion occurs in the patient or in someone else, such as par-
ents abuse alcohol). In many cases also it is desirable to
detect the degree of certainty in the context (for example,
suspected pneumonia). Although significant results in re-
lation to the topic of context exist, e.g. Solt and colleagues
[60] described an algorithm for determining whether a
condition is absent, present, or uncertain it is our view
that the related issues will continue to present researchers
with challenges.
Another challenge for the future relates to multilin-
gualism; taggers, parsers and lexicons for additional lan-
guages, apart from English, could be added into the
system and provide a service discovery framework for a
multilingual setting.
Additionally, the system could, eventually, be extended
to a question-answer system. A question driven system
could upgrade the system to the next step of an intelli-
gent service discovery system, by asking the user a ques-
tion that was retrieved from the input sentence. In this
way the system could provide the user services or tools
in pipeline that could be used in a row for the desired
process to be implemented.
Furthermore, in the context of the p-medicine EC pro-
ject, a thorough usability evaluation of the system from
end users has been scheduled in order to assess the us-
ability and the acceptance of the framework.
Conclusions
The ultimate objective of this work has been to inves-
tigate the use of semantics for biomedical resources
annotation with domain specific ontologies and exploit
Natural Language Processing methods in empowering the
non-Information Technology expert users to efficiently
search for biomedical resources using natural language.
As part of this case study, we have successfully imple-
mented a web based framework able to interact with the
end user through natural language for biomedical re-
sources discovery in real time. The user describes in nat-
ural language the facts (data) and the research question
which are analysed with NLP techniques, annotated with
clinical and software ontologies in order to form specific
queries for the tools repository and finally retrieve tools/
services that solve the clinical question.
The results obtained showed that the system has high
precision and low recall, which means that the system
returns essentially more relevant results than irrelevant.
There were cases of input queries that have 100 % preci-
sion in our results, meaning that the exported resources
were all correct; the system may not have retrieved all
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 11 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
the tools that could solve the clinical question, but all
the retrieved tools were suitable in addressing the query.
Research in web search engines indicates that 91 % of
searchers do not go past page one (ten top ranked) of
the search results and over 50 % do not go past the first
3 results on page 1 [61]. We expect that the same behav-
iour holds in tools repositories too, since web search has
impact on the way we search and retrieve information
[62]. Having a system which is able to narrow down the
retrieved results with 100 % precision and provide a
good ranking would be valuable for the end users espe-
cially the ones who stick to the top ranked results and
neglect the rest.
Comparing MetaMap with clinical annotators like,
GATE (General Architecture for Text Engineering) [16],
Apache Stanbol IKS (Interactive Knowledge Stack)
[63], NCBO Annotator (BioPortal) [64] and Concept-
Mapper [65], we concluded that MetaMap is the best
biomedical concept recognizer for our needs, because it
has a RESTful API which can use many clinical ontol-
ogies that are connected to it, while the rest of the clin-
ical annotators were not able to be loaded with a large
number of ontologies that were needed in our case.
We must accept that searching with ontology terms
provided better results than with the semantic types of
the terms; this is firstly an effect of the tag queriesde-
pendence on the tagged terms identified in the sentence
which are also terms of the tools descriptions, and sec-
ondly, due to the fact that the semantic types of the
terms may not be found in the descriptions, but even if
they do, they have a minor score priority.
The proposed NLP framework has the potential to aid
physicians practice advanced ICT medicine and improve
the quality of patient care. To our knowledge a plethora
of tools can handle clinically relevant information for
specific questions but are limited to provide summaries
of literature such as AskHermes [11]. It is also known
that a plethora of databases and specialized repositories
exist and numerous clinical and biomedical tools have
been indexed in these repositories, but the searching
mechanisms and the technical terminology used, dis-
courage clinicians to use them.
We implemented a web based framework taking ad-
vantage of domain specific ontologies and NLP in order
to empower the non-IT users to search for biomedical
resource using natural language. The proposed frame-
work links the gap between clinical question and effi-
cient dynamic biomedical resources discovery. Given the
experience we gained during the design, implementation
and set up of such a framework we can safely come into
the conclusion that there is adequacy of existing bio-
medical ontologies, sufficiency of NLP tools and bio-
medical annotation systems for the implementation of
such a framework.
Additional files
Additional file 1: contains extensive results of the experiments.
(PDF 803 kb)
Additional file 2: contains the list of tools and resources that were
in our repository. (PDF 964 kb)
Abbreviations
IT: Information Technology; NLP: Natural Language Processing; POS: Part Of
Speech; NER: Named Entity Recognition; OBIE: Ontology-Based Information
Extraction; UMLS: Unified Medical Language System; SKR: Semantic Knowledge
Representation; WSD: Word Sense Disambiguation.
Competing interests
The authors declare that they have no competing interests.
Authorscontributions
MT, SS and LK conceived of and designed the framework. NG as end user
guided the design and implementation. SS implemented the tools and
metadata repositories, LK implemented the concept recognizer mechanism,
PS implemented the interpreter, GI implemented the web based front-end.
LK, GZ and PS conducted the integration of the framework which GZ is also
the administrator of the server. LK and PS conducted the experiments.
All authors contributed to the manuscript. All authors read and approved the
final manuscript.
Authors' information
Not applicable.
Availability of data and materials
Not applicable.
Acknowledgements
This work was supported by the EU funded research projects p-medicine
(http://www.p-medicine.eu/) and iManageCancer (http://imanagecancer.eu/)
that aim to design a semantically aware computational platform in support
of personalised medicine.
Funding
This project was funded by the European Commission under contracts
H2020-PHC-26-2014 No. 643529 (iManageCancer project) and FP7-ICT-
2009.5.3 No 270089 (p-medicine project).
Author details
1
Foundation for Research and Technology Hellas (FORTH), Institute of
Computer Science, N. Plastira 100, Vassilika Vouton, Heraklion, Crete, Greeece.
2
Department of Informatics Engineering, Technological Educational Institute,
Heraklion, Crete, Greece.
3
Paediatric Haematology and Oncology, Saarland
University Hospital, Homburg, Germany.
Received: 9 March 2015 Accepted: 21 September 2015
References
1. Zhu F, Patumcharoenpol P, Zhang C, Yang Y, Chan J, Meechai A, et al.
Biomedical text mining and its applications in cancer research. J Biomed
Inform. 2013;46:20011.
2. Meystre S, Haug JP. Natural language processing to extract medical
problems from electronic clinical documents: Performance evaluation.
J Biomed Inform. 2006;39(6):58999.
3. Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, et al.
The Taverna workflow suite: designing and executing workflows of Web
Services on the desktop, web or in the cloud. Nucleic Acids Res.
2013;41(W1):55761.
4. Goble CA, Bhagat J, Aleksejevs S, Cruickshank D, Michaelides D, Newman D,
et al. myExperiment: a repository and social network for the sharing of
bioinformatics workflows. Nucleic Acids Res. 2010;38(2):67782.
5. Bhagat J, Tanoh F, Nzuobontane E, Laurent T, Orlowski J, Roos M, et al.
BioCatalogue: a universal catalogue of web services for the life sciences.
Nucleic Acids Res. 2010;38(2):W68994.
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 12 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
6. Li JW, Schmieder R, Ward M, Delenick J, Olivares EC, Mittelman D.
SEQanswers: an open access community for collaboratively decoding
genomes. Bioinformatics. 2012;28(9):12723.
7. Pettifer S, Ison J, Kalas M, Thorne D, McDermott P, Jonassen I, et al. The
EMBRACE web service collection. Nucleic Acids Res. 2010;38(2):6838.
8. Gentleman R, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al.
Bioconductor: open software development for computational biology and
bioinformatics. Genome Biol. 2004;5(10):R80.
9. National Library of Medicine. ORBIT: Online Registry of Biomedical
Informatics Tools. [Internet]. 2011 [cited 2013].
10. Simpson MS, Demner-Fushman D, Biomedical Text Mining: a survey of
recent progress. In: Mining text data. Springer US; 2012. 465517.
11. Cao Y, Liu F, Simpson P, Antieau L, Bennettq A, Cimino JJ, et al. AskHERMES:
An online question answering system for complex clinical questions.
J Biomed Inform. 2011;44(2):27788.
12. Cao Y, Cimino JJ, Ely J, Yu H. Automatically extracting information needs
from complex clinical questions. J Biomed Inform. 2010;43:96271.
13. Koumakis L, Moustakis V, Potamias G. Web Services Automation. New York:
Hershey Information Science Reference; 2009. p. 23957.
14. Friedman C, Rindflesch TC, Corn M. Natural Language Processing: state of
the art and prospects for significant progress, a workshop sponsored by the
National Library of Medicine. J Biomed Inform. 2013;46(5):76573.
15. Settles B. ABNER: an open source tool for automatically tagging genes,
proteins and other entity names in text. Bioinformatics. 2005;21(14):31912.
16. Cunningham H. GATE, a general architecture for text engineering. Comput
Hum. 2002;36(2):22354.
17. Ferucci D, Laily A. UIMA: an architectural approach to unstructured
information processing in the corporate research environment. Nat Lang
Eng. 2004;10(34):32748.
18. Clement J, Nigam SH, Cherie YH, Musen MA, Callendar C, Storey MA. NCBO
Annotator: Semantic Annotation of Biomedical Data. International Semantic
Web Conference, Poster and Demo session. 2009.
19. Belloze KT, Monteiro DISB, Lima TF, Silva-Jr FP, Cavalcanti MC. An Evaluation
of Annotation Tools for Biomedical Texts. ONTOBRAS-MOST. 2012; 108119.
20. Wimalasuriya DC, Dejing D. Ontology-based information extraction: An
introduction and a survey of current approaches. J Inf Sci. 2010;36(3):30623.
21. Bodenreider O. The unified medical language system (UMLS): integrating
biomedical terminology. Nucleic Acids Res. 2004;32(1):26770.
22. Al-Safadi L, Alomran R, Almutairi F. Evalutation of MetaMap
performance in radiographic images retrieval. Res J Appl Sci Eng
Technol. 2013;22(6):42316.
23. Wu Y, Denny JC, Rosenbloom T, Miller RA, Giuse DA, Xu H. A comparative
study of current clinical natural language processing systems on handling
abbreviations in discharge summaries. Am Med Inform Assoc. 2012;2012:997.
24. Sfakianaki P, Koumakis L, Sfakianakis S, Tsiknakis M. Natural language
processing for biomedical tools discovery: A feasibility study and
preliminary results. In: 17th International Conference on Business
Information Systems; 2014; Larnaca, Cyprus
25. P-Medicine EU project web site. [Internet]. 2012 [cited 2015 Mar 08].
Available from: http://www.p-medicine.eu.
26. Marias K, Dionysiou D, Sakkalis V, Graf N, Bohle RM, Coveney PV, et al.
Clinically driven design of multi-scale cancer models: the ContraCancrum
project paradigm. Interface Focus. 2011;1(3):450461
27. Schulz M, Krause F, Le Novere N, Klipp E, Liebermeister W. Retrieval,
alignment, and clustering of computational models based on semantic
annotations. Mol Syst Biol. 2011;7(1):512.
28. Brown PF, de Souza PV, Mercer RL, Della Pietra VJ, Lai JC. Class-based n-gram
models of natural language. Comput Linguist. 1992;18(4):46779.
29. Kalas M, Puntervoll P, Joseph A, Bartaseviciute E, Topfer A, Venkataraman P,
et al. BioXSD: the common data-exchange format for everyday
bioinformatics web services. Oxf J: Bioinformatics. 2010;26(18):5406.
30. Lamprecht AL, Margaria T, Steffen B. Bio-jETI: a framework for semantics-based
service composition. BMC Bioinformatics. 2009;10(10):S8.
31. Smiley D, Pugh DE. Apache Solr 3 Enterprise Search Server. Packt Publishing
Ltd; 2011.
32. Black S. PostgreSQL: introduction and concepts. Linux J. 2001;2001(88):16.
33. Sfakianakis S, Graf N, Hoppe A, Rüping S, Wegener D, Koumakis L, et al.
Building a System for Advancing Clinico-Genomic Trials on Cancer. George
Potamias Vassilis Moustakis (eds.), 2009. 33.
34. Stamatakos GS, Dionysiou D, Lunzer A, Belleman R, Kolokotroni E, Georgiadi E,
et al. The technologically integrated oncosimulator: combining multiscale
cancer modeling with information technology in the in silico oncology context.
Biomed Health Informatics, IEEE. 2014;18(3):84054.
35. Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D. The
Stanford CoreNLP Natural Language Processing Toolkit, Proceedings of
52nd Annual Meeting of the Association for Computational Linguistics:
System Demonstrations. 2014. p. 5560.
36. Hartung DM, Zarin DA, Guise IM, McDonagh M, Paynter R, Helfand M.
Reporting discrepancies between the ClinicalTrials.gov results database and
peer-reviewed publications. Ann Intern Med. 2014;160(7):47783.
37. Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio
cancer genomics portal: an open platform for exploring multidimensional
cancer genomics data. Cancer Discov. 2012;2(5):4014.
38. National Cancer Institute SEER API. [Internet]. [cited 2014 Dec]. Available
from: http://www.programmableweb.com/api/national-cancer-institute-seer.
39. EU-ADR Web Platform. [Internet]. [cited 2014 Dec]. Available from: https://
bioinformatics.ua.pt/euadr/Welcome.jsp.
40. Powers D. Evaluation: From Precision, Recall and F-measure to ROC,
Informedness, Markedness & Correlation. J Mach Learn Technol. 2011;2(1):3763.
41. DIANA miRPath v. 2.0: investigating the combinatorial effect of microRNAs
in pathways. Nucleic Acids Res. 2012;40(W):498504.
42. Chiromatzo A, Oliveira T, Pereira G, Costa A, Montesco C, DE G, et al.
miRNApath: a database of miRNAs, target genes and metabolic pathways.
Genet Mol Res. 2007;6(4):85965.
43. Sheng-Da H, Feng-Mao L, Wi-Yun W, Chao L, Wei-Chih H, Wen-Ling C, et al.
miRTarBase: a database curates experimentally validated microRNAtarget
interactions. Nucleic Acids Res. 2010;gkq1107.
44. Koumakis L, Moustakis V, Zervakis M, Kafetzopoulos D, Potamias G. Coupling
Regulatory Networks and Microarays: Revealing Molecular Regulations of
Breast Cancer Treatment Responses, Artificial Intelligence: Theories and
Application Lecture notes in Computer Science. 2012. p. 23946.
45. Meystre SM, Savova K, Kipper-Schuler C, Hurdle JF. Extracting Information
from Textual Documents in the Electronic Health Record: A Review of
Recent Research. Yearb Med Inform. 2008;35:12844.
46. Nadkarni M, Lucila OM, Chapman WW. Natural language processing: an
introduction. J Am Med Inform Assoc. 2011;18(5):54451.
47. Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. Evaluation
of Negation Phrases in Narrative Clinical Reports. Proceedings of the AMIA
Symposium. American Medical Informatics Association. 2001 105109.
48. Kononenko O, Baysal O, Holmes R, Godfrey MW. Mining modern
repositories with elastic search. In: ACM, eds. Proceedings of the 11th
Working Conference on Mining Software Repositories; 2014. pp. 328-331.
49. Potamias G, Koumakis L, Moustakis V. Enhancing web based services by
coupling document classification with user profile. In: IEEE, eds. Computer
as a Tool (EUROCON 2005); 2005. p. 205208.
50. Sfakianakis S, Koumakis L, Zacharioudakis G, Tsiknakis M. Web-based
Authoring and Secure Enactment of Bioinformatics Workflows. In: Grid
and Pervasive Computing Conferen ce. Geneva S witzerland: IEEE; 2009.
51. Tao Y, Kwei-Jay L. Service selection algorithms for Web services with
end-to-end QoS constraints. Inf Syst E-Business Manag. 2005;3(2):10326.
52. Kanterakis A, Potamias G, Zacharioudakis G, Koumakis L, Sfakianakis S,
Tsiknakis M. Scientific discovery workflows in bioinformatics: a scenario for
the coupling of molecular regulatory pathways and gene-expression
profiles. Stud Health Technol Inform. 2009;160:13048.
53. Koumakis L, Moustakis V, Tsiknakis M, Kafetzopoulos D, Potamias G. Supporting
genotype-to-phenotype association studies with grid-enabled knowledge
discovery workflows. In: IEEE, eds. Engineering in Medicine and Biology
Society, 2009. EMBC 2009. Annual International Conference of the IEEE;
2009. pp. 69586962.
54. Zacharioudakis G, Koumakis L, Sfakianakis S, Tsiknakis M. A semantic
infrastructure for the integration of bioinformatics services. In: IEEE, eds.
Intelligent Systems Design and Applications (ISDA09); 2009. p. 367372.
55. Cambria E, Hussain A, Havasi C, Eckl C, Munro J. Towards crowd validation
of the UK National Health Service, WebSci10. 2010. p. 15.
56. Kim JD, Cohen KB. Natural language query processing for SPARQL generation:
A prototype system for SNOMED CT. In: Proceedings of BioLINK. 2013.
p. 328.
57. Cohen KB, Kim JD. Evaluation of SPARQL query generation from natural
language questions. In: Joint Workshop on NLP&LOD and SWAIE: Semantic
Web, Linked Open Data and Information Extraction. 2013. p. 3.
58. Grigonyte G, Brochhausen M, Martín L, Tsiknakis M, Haller J. Evaluating
Ontologies with NLP-Based TerminologiesA Case Study on ACGT and Its
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 13 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Master Ontology. In: Press I, editor. Formal Ontology in Information Systems:
Proceedings of the Sixth International Conference. 2010. p. 331.
59. Chapman W, Chu D, Dowling J. ConText: An Algorithm for Identifying
Contextual Features from Clinical Text. In Proceedings of the Workshop on
BioNLP 2007: Biological, Translational, and Clinical Language Processing (pp.
81-88). Association for Computational Linguistics.
60. Solt I, Tikk D, Gal V, Kardkovacs Z. Semantic classification of diseases in
discharge summaries using a context-aware rule-based classifier. J Am Med
Inform Assoc. 2009;16(4):5804.
61. Van Deursen AJ, Van Dijk JA. Using the Internet: Skill related problems in
usersonline behavior. Interacting Comput. 2009;21(5):393402.
62. Bughin J, Corb L, Manyika J, Nottebohm O, Chui M, de Muller Barbat B, et al.
The impact of Internet technologies: Search. High Tech Practice.
McKinsey&Company; High Tech Practice. (2011).
63. Adamou A, Andre F, Christ F, Filler A. Apache Stanbol: The RESTful
Semantic Engine. [Internet]. 2007 [cited 2013 Sept]. Available from:
http://dev.iks-project.eu/.
64. Jonquet C, Shah NH, Musen MA. The open biomedical annotator. Summit
on translational bioinformatics. 2009 5660.
65. FunkC,BaumgartnerW,GarciaB,RoederC,BadaM,CohenBK,etal.
Large-scale biomedical concept recognition: an evaluation of current
automatic annotators and their parameters. BMC Bioinformatics. 2014;15:59.
Submit your next manuscript to BioMed Central
and take full advantage of:
Convenient online submission
Thorough peer review
No space constraints or color figure charges
Immediate publication on acceptance
Inclusion in PubMed, CAS, Scopus and Google Scholar
Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Sfakianaki et al. BMC Medical Informatics and Decision Making (2015) 15:77 Page 14 of 14
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... However, applying the best in class natural language processing (NLP) procedures to biomedical content mining has restrictions. First, as late word portrayal models, Word2Vec [12], the embeddings from language model (ELMo) [13] and the bidirectional encoder representations from transformers (BERT) model [14], have been proposed and applied on data containing general space messages. Wikipedia was used to build this model. ...
... Bidirectional transformers are a basic underlying technique that is used to find a named entity in a given document [18]. BERT is a well-known and popular model that is used for NER, and it is easy to apply to transfer learning [14]. In text where future words could not be seen, past language models were restricted to a blend of two unidirectional language models. ...
Article
Full-text available
Developments in advanced innovations have prompted the generation of an immense amount of digital information. The data deluge contains hidden information that is difficult to extract. In the biomedical domain, the development of technology has caused the production of voluminous data. Processing these voluminous textual data is referred to as ‘biomedical content mining’. Emerging artificial intelligence (AI) models play a major role in the automation of Pharma 4.0. In AI, natural language processing (NLP) plays a dynamic role in extracting knowledge from biomedical documents. Research articles published by scientists and researchers contain an enormous amount of hidden information. Most of the original and peer-reviewed articles are indexed in PubMed. Extracting meaningful information from a large number of literature documents is very difficult for human beings. This research aims to extract the named entities of literature documents available in the life science domain. A high-level architecture is proposed along with a novel named entity recognition (NER) model. The model is built using rule-based machine learning (ML). The proposed ArRaNER model produced better accuracy and was also able to identify more entities. The NER model was tested on two different datasets: a PubMed dataset and a Wikipedia talk dataset. The ArRaNER model obtains an accuracy of 83.42% on the PubMed articles and 77.65% on the Wikipedia articles.
... • The patient can enter his/her medications in the app to define reminder times for taking the medication in order to receive reminders in the defined times. • The patient can use the MyPal search engine to search for content related to her/his disease using advanced search techniques for medical terms (19). The MyPal search engine has been designed to deliver accurate sources to patient that have need identified and validated by the team of the HCPs. ...
... • The upload of medical documents is the tool that feeds content to the search engine for the patients and is available to the patient from the MyPal mobile app, enabling them to search for high quality, useful information. The search engine provide health related content to patients from a validated set of relevant documents using a natural language framework for medical informatics (19) and personalized recommendations (21). • The CRF cost form is a questionnaire for the economic evaluation. ...
Article
Full-text available
Patient-reported outcomes (PROs) are an emerging paradigm in clinical research and healthcare, aiming to capture the patient's self-assessed health status in order to gauge efficacy of treatment from their perspective. As these patient-generated health data provide insights into the effects of healthcare processes in real-life settings beyond the clinical setting, they can also be viewed as a resolution beyond what can be gleaned directly by the clinician. To this end, patients are identified as a key stakeholder of the healthcare decision making process, instead of passively following their doctor's guidance. As this joint decision-making process requires constant and high-quality communication between the patient and his/her healthcare providers, novel methodologies and tools have been proposed to promote richer and preemptive communication to facilitate earlier recognition of potential complications. To this end, as PROs can be used to quantify the patient impact (especially important for chronic conditions such as cancer), they can play a prominent role in providing patient-centric care. In this paper, we introduce the MyPal platform that aims to support adults suffering from hematologic malignancies, focusing on the technical design and highlighting the respective challenges. MyPal is a Horizon 2020 European project aiming to support palliative care for cancer patients via the electronic PROs (ePROs) paradigm, building upon modern eHealth technologies. To this end, MyPal project evaluate the proposed eHealth intervention via clinical studies and assess its potential impact on the provided palliative care. More specifically, MyPal platform provides specialized applications supporting the regular answering of well-defined and standardized questionnaires, spontaneous symptoms reporting, educational material provision, notifications etc. The presented platform has been validated by end-users and is currently in the phase of pilot testing in a clinical study to evaluate its feasibility and its potential impact on the quality of life of palliative care patients with hematologic malignancies.
... With the impressive advances of deep learning in computer vision and pattern recognition, the recent research in NLP is increasingly emphasizing the use of deep learning methods to overcome the drawbacks of traditional NLP systems, which depend heavily on the time-consuming and often incomplete hand-crafted features [16]. Although clinical NLP research has been actively performed since the 1960s, its progress was slow and lagged behind the progress of NLP in the general domain [17]. Similar to other areas, deep learning-based NLP research in the medical field has repeatedly demonstrated its feasibility [18][19][20]. ...
Article
Full-text available
Background Colorectal cancer is a leading cause of cancer deaths. Several screening tests, such as colonoscopy, can be used to find polyps or colorectal cancer. Colonoscopy reports are often written in unstructured narrative text. The information embedded in the reports can be used for various purposes, including colorectal cancer risk prediction, follow-up recommendation, and quality measurement. However, the availability and accessibility of unstructured text data are still insufficient despite the large amounts of accumulated data. We aimed to develop and apply deep learning-based natural language processing (NLP) methods to detect colonoscopic information. Methods This study applied several deep learning-based NLP models to colonoscopy reports. Approximately 280,668 colonoscopy reports were extracted from the clinical data warehouse of Samsung Medical Center. For 5,000 reports, procedural information and colonoscopic findings were manually annotated with 17 labels. We compared the long short-term memory (LSTM) and BioBERT model to select the one with the best performance for colonoscopy reports, which was the bidirectional LSTM with conditional random fields. Then, we applied pre-trained word embedding using large unlabeled data (280,668 reports) to the selected model. Results The NLP model with pre-trained word embedding performed better for most labels than the model with one-hot encoding. The F1 scores for colonoscopic findings were: 0.9564 for lesions, 0.9722 for locations, 0.9809 for shapes, 0.9720 for colors, 0.9862 for sizes, and 0.9717 for numbers. Conclusions This study applied deep learning-based clinical NLP models to extract meaningful information from colonoscopy reports. The method in this study achieved promising results that demonstrate it can be applied to various practical purposes.
... The integration of computer sciences with biomedical research has accelerated the progress, both in terms of novel discoveries and data structuring. Natural Language Processing (NLP) is a method to extract relevant information from unstructured data [7,14,15,31]. A simple NLP pipeline contains 4 components: data assembly, pre-processing and normalization, Named Entity Recognition (NER) and Relation Extraction (RE). ...
Article
Full-text available
Extreme complexity in the Human Leukocyte Antigens (HLA) system and its nomenclature makes it difficult to interpret and integrate relevant information for HLA associations with diseases, Adverse Drug Reactions (ADR) and Transplantation. PubMed search displays ~ 146,000 studies on HLA reported from diverse locations. Currently, IPD-IMGT/HLA (Robinson et al., Nucleic Acids Research 48:D948–D955, 2019) database houses data on 28,320 HLA alleles. We developed an automated pipeline with a unified graphical user interface HLA-SPREAD that provides a structured information on SNPs, Populations, REsources, ADRs and Diseases information. Information on HLA was extracted from ~ 28 million PubMed abstracts extracted using Natural Language Processing (NLP). Python scripts were used to mine and curate information on diseases, filter false positives and categorize to 24 tree hierarchical groups and named Entity Recognition (NER) algorithms followed by semantic analysis to infer HLA association(s). This resource from 109 countries and 40 ethnic groups provides interesting insights on: markers associated with allelic/haplotypic association in autoimmune, cancer, viral and skin diseases, transplantation outcome and ADRs for hypersensitivity. Summary information on clinically relevant biomarkers related to HLA disease associations with mapped susceptible/risk alleles are readily retrievable from HLASPREAD. The resource is available at URL http://hla-spread.igib.res.in/. This resource is first of its kind that can help uncover novel patterns in HLA gene-disease associations.
... In relation with a clinical study, Sfakianaki et al. (2015) presented a NLP model to automatically process questions to a biomedical query based on ontologies terms, which reduce the difference between natural language clinical requests and biomedical background. Mao et al. (2015) introduced a creative health IR platform based on the attribute of terms with Medical Subject Headings (MeSH). ...
Article
Full-text available
The interactions between subscribers of the health-related social networking (HSNs) platforms rise the production and sharing of a huge amount of multimedia content, daily, by permitting them to upload their medical images. These images become the centre of communication in various multilingual expressions immediately describing observations, comments and health checkups. As a part of this exchange, it is clear that these spaces are a valuable source of subscribers-generated information. Besides, it is still an open question to enable subscribers to investigate relevant information, due to the diversity of the available content. So, it is vital to engage new mechanisms in order to pull out information and acquaintance from this content. For this purpose, we have implemented a content analysis model of health-related information to get an overview of the medical content available. We present a semantic terms-based approach to pull out pertinent terms and concepts from the text material. As a result, notable extracted terms and keywords will be applied, subsequently, to present to annotate medical images, to direct users to an appropriate seeking task, through the SN site. So, the analysis method concentrates on algorithms based on statistical methods and external multilingual semantic resources to cover and treat this situation. It is essential also to deal with such ambiguities causing the efficacy decreasing of the search function. Our study is validated by a set of experiments and compared with some existing models. Experimental results have ensured that the presented model has better findings, in terms of performance and satisfaction.
... The method is developed using annotations compiled from a database of genotypes and phenotypes and two biomedical ontologies. A wide range of ontology based similarity assessment measures have been developed for information extraction from ontologies and structured data in the fields of biocuration 48 , natural language processing (NLP) 49 , artificial intelligence 50 , and Semantic Web mining 51 . Such semantic similarity scores were used as the algorithmic bases for analyzing massive biological and clinical datasets including functional similarity of gene ontology (GO) terms, comparative phenomics of plant species, gene prioritization, and disease network reconstruction [52][53][54][55] . ...
Preprint
Full-text available
A more complete understanding of phenomic space is critical for elucidating genome-phenome relationships and for assessing disease risk from genome sequencing. To incorporate knowledge of how related variant associations are, we developed a new genome interpretation metric called Pleiotropic Variability Score (PVS). PVS uses semantic reasoning to score the relatedness of genetic variant associated phenotypes based on those phenotypic relationships in the human phenotype ontology (HPO) and disease ontology (DO). We tested 78 unique semantic similarity methods and integrated six robust metrics to define the pleiotropy score of SNPs. We computed PVS for 12,541 SNPs which were mapped to 382 HPO and 317 DO unique phenotype terms in a genotype-phenotype catalog (10,021 SNPs mapped to DO phenotypes and 8,569 SNPs mapped to HPO phenotypes). We validated the utility of PVS by computing pleiotropy using an electronic health record linked genomic database (BioME, n=11,210). Further, we demonstrate the application of PVS in personalized medicine using personalized pleiotropy score reports for individuals with genomic data that could potentially aid in variant interpretation. We further developed a software framework to incorporate PVS into VCF files and to consolidate pleiotropy assessment as part of genome interpretation pipelines. As the genome-phenome catalogs are growing, PVS will be a useful metric to assess genetic variation to find SNPs with highly pleiotropic effects. Additionally, variants with varying degrees of pleiotropy can be prioritized for explorative studies to understand the specific roles of SNPs and pleiotropic hubs in mediating novel phenotypes and drug development.
... The integration of computer sciences with biomedical research has accelerated the progress, both in terms of novel discoveries and data structuring. Natural Language Processing (NLP) is a method to extract relevant information from unstructured data (19). A simple NLP pipeline contains 4 components: data assembly, pre-processing and normalization, Named Entity Recognition (NER) and Relation Extraction (RE). ...
Preprint
Extreme complexity in the HLA system and its nomenclature makes it difficult to interpret and integrate relevant information for HLA associations with diseases, Adverse Drug Reactions (ADR), Transplantation. PubMed search displays ~110,000 studies on Human Leukocyte Antigens (HLA) reported from, diverse locations and on multiple populations and IPD-IMGT/HLA database houses data on 28,320 HLA alleles till date. We developed an automated pipeline with a unified graphical user interface HLA-SPREAD that provides a structured information on SNPs, Populations, REsources, ADRs and Diseases information. Information on HLA was extracted from ~24 million PubMed abstracts extracted using Natural Language Processing (NLP). Python scripts were used to mine and curate information on diseases, filter false positives and categorize to 24 tree hierarchical groups and named Entity Recognition (NER) algorithms and semantic analysis to infer HLA association(s). This resource from 116 countries and 47 ethnic groups provides interesting insights on: markers associated with allelic/haplotypic association in autoimmune, cancer, viral and skin diseases, transplantation outcome and ADRs for hypersensitivity. Summary information on clinically relevant biomarkers related to HLA disease associations with mapped susceptible/risk alleles are readily retrievable from HLASPREAD. This resource is first of its kind that can help uncover novel patterns in HLA gene-disease associations
... The use of electronic health records is associated with less doctorpatient interaction [1], so tools that are able to automatically extract relevant information from clinical notes can free up the doctors to contact directly with patients. Besides, these tools have the potential to improve biomedical [2,3] and pharmaceutical research [4] and to democratise the access to clinical information for the layman user [5]. ...
Conference Paper
Full-text available
The CANTEMIST track included three subtasks for the automatic assignment of codes related with tumour morphology entities to Spanish health-related documents: CANTEMIST-NER, CANTEMIST-NORM and CANTEMIST-CODING. For CANTEMIST-NER, we trained Spanish biomedical Flair embed-dings on PubMed abstracts and then trained a BiLSTM+CRF Named Entity Recognition tagger on the CANTEMIST corpus using the trained embeddings. For CANTEMIST-NORM, we adapted a graph-based model that uses the Personalized PageRank algorithm to rank the eCIE-O-3.1 candidates for each entity mention. As for CANTEMIST-CODING, we adapted X-Transformer, a state-of-the-art deep learning Extreme Multi-Label Classification algorithm, to classify the clinical cases with a ranked list of eCIE-O-3.1 terms in a multilingual and biomedical panorama. The results obtained were a F1-score of 0.749 and 0.069 for the CANTEMIST-NER and the CANTEMIST-NORM subtasks, respectively, and our best scoring submission achieved a MAP score of 0.506 in the CANTEMIST-CODING subtask.
Article
Full-text available
Researchers and clinicians face a significant challenge in keeping up-to-date with the rapid rate of new associations between genetic mutations and diseases. To remedy this problem, this research mined the ClinicalTrials.gov corpus to extract relevant biological insights, produce unique reports to summarize findings, and make the meta-data available via APIs. An automated text-analysis pipeline performed the following features: parsing the ClinicalTrials.gov files, extracting and analyzing mutations from the corpus, mapping clinical trials to Human Phenotype Ontology (HPO), and finding associations between clinical trials and HPO nodes. Unique reports were created for each mutation (SNPs and protein mutations) mentioned in the corpus, as well as for each clinical trial that references a mutation. These reports, which have been run over multiple time points, along with APIs to access meta-data, are freely available at http://snpminertrials.com. Additionally, HPO was used to normalize disease terms and associate clinical trials with relevant genes. The creation of the pipeline and reports, the association of clinical trials with HPO terms, and the insights, public repository, and APIs produced are all novel in this work. The freely-available resources present relevant biological information and novel insights between biomedical entities in a robust and accessible manner, mitigating the challenge of being informed about new associations between mutations, genes, and diseases.
Article
Full-text available
The EMBRACE ( European Model for Bioinformatics Research and Community Education) web service collection is the culmination of a 5-year project that set out to investigate issues involved in developing and deploying web services for use in the life sciences. The project concluded that in order for web services to achieve widespread adoption, standards must be defined for the choice of web service technology, for semantically annotating both service function and the data exchanged, and a mechanism for discovering services must be provided. Building on this, the project developed: EDAM, an ontology for describing life science web services; BioXSD, a schema for exchanging data between services; and a centralized registry (http://www.embraceregistry.net) that collects together around 1000 services developed by the consortium partners. This article presents the current status of the collection and its associated recommendations and standards definitions.
Article
Full-text available
SPARQL queries have become the standard for querying linked open data knowledge bases, but SPARQL query construction can be challenging and time-consuming even for experts. SPARQL query generation from natural language questions is an attractive modality for interfacing with LOD. However, how to evaluate SPARQL query generation from natural language questions is a mostly open research question. This paper presents some issues that arise in SPARQL query generation from natural language, a test suite for evaluating performance with respect to these issues, and a case study in evaluating a system for SPARQL query generation from natural language questions.
Conference Paper
Full-text available
Applications using automatically indexed clinical conditions must account for contextual features such as whether a condition is negated, historical or hypothetical, or experienced by someone other than the patient. We developed and evaluated an algorithm called ConText, an extension of the NegEx negation algorithm, which relies on trigger terms, pseudo-trigger terms, and termination terms for identifying the values of three contextual features. In spite of its simplicity, ConText performed well at identifying negation and hypothetical status. ConText performed moderately at identifying whether a condition was experienced by someone other than the patient and whether the condition occurred historically.
Conference Paper
Discovery of the appropriate computational components, needed to answer a clinical hypothesis, has been a major issue for physicians. Users without experience do not have the means, the time or the knowledge to search the vast amount of information regarding the candidate computational components (services or tools) which can aid to achieve their purpose. In order to address this need we introduce a dynamic service discovery environment where physicians can represent queries in natural language and dynamically retrieve the suitable candidate computational components, with the aid of information extraction algorithms guided by specific domain ontologies.
Article
Biomedical texts are a rich information source that cannot be ignored. There are several text annotation tools that may be used to extract useful information from these texts. However, the multi-domain characteristic of these texts, and the diversity of ontologies available in this area, demands a careful analysis before choosing an annotation tool. This work presents an evaluation of the existing annotation tools, with focus on biomedical texts. Initially, based on a set of required characteristics, a tool selection was conducted. AutôMeta and Gate tools were selected for a more detailed evaluation. They were quantitatively and qualitatively evaluated. Results of such evaluation are discussed and bring to light the best/worst of each tool.
Article
A large amount of free text is available as a source of knowledge in the biomedical field. MetaMap is a widely used tool that identifies concepts within the UMLS in English text. In this study, we study the performance of MetaMap. Performance is measured in retrieval speed, precision of results and recall of results. This automated MetaMap indexing is compared with manual indexing of the same text. Results shows that MetaMap by default was able to identify 98.19% of the biomedical concepts occurred in the sample set. MetaMap by default identified 78.79% of the concepts that manually were not identified. MetaMap is tested under four scenarios; the default output, displaying the mapping list, restricting the candidates' score within the candidate list and restricting the candidates' score within the mapping list. This study describes the limitations of the MetaMap tool and ways to improve the performance of the tool and increase its recall and precision.