ArticlePDF Available

A Biomedical Named Entity Recognition Using Machine Learning Classifiers and Rich Feature Set

Authors:
  • University of Sana'a

Abstract

As the wealth of biomedical knowledge in the form of literature increases, there is a rising need for effective natural language processing tools to assist in organizing, curating, and retrieving this information. The task of named entity recognition becomes more difficult from specific domain since entities are more exact to that particular domain. To that end, named entity recognition (the task of identifying words and phrases in free text that belong to certain classes of interest) is an important first step for many of these larger information management goals. In recent years, much attention has been focused on the problem of recognizing gene and protein and other biomedical entities mentions in biomedical abstracts. Thus, this study aims to design and develop a biomedical named entity recognition model. A machine learning classification framework is proposed based on Naïve Bayes, K-Nearest Neighbour and decision tree classifiers. we have performed several experiments to empirically compare different subsets of features and three classification approach Naïve Bayes, K-Nearest Neighbour and decision tree for biomedical named entity recognition. The aim is to efficiently integrate different feature sets and classification algorithms to synthesize a more accurate classification procedure. Results prove that the K-Nearest Neighbour trained with suitable features is more suitable to recognize named entities of biomedical texts than other models.
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
170
Manuscript received January 5, 2017
Manuscript revised January 20, 2017
A Biomedical Named Entity Recognition Using Machine
Learning Classifiers and Rich Feature Set
Ahmed Sultan Al-Hegami, Ameen Mohammed Farea Othman, Fuad Tarbosh Bagash
University of Sana’a, Yemen, The Arab Academy for Banking and Financial Sciences, European International Education,
Yemen
Summary
As the wealth of biomedical knowledge in the form of literature
increases, there is a rising need for effective natural language
processing tools to assist in organizing, curating, and retrieving
this information. The task of named entity recognition becomes
more difficult from specific domain since entities are more exact
to that particular domain. To that end, named entity recognition
(the task of identifying words and phrases in free text that belong
to certain classes of interest) is an important first step for many of
these larger information management goals. In recent years,
much attention has been focused on the problem of recognizing
gene and protein and other biomedical entities mentions in
biomedical abstracts. Thus, this study aims to design and
develop a biomedical named entity recognition model. A
machine learning classification framework is proposed based on
Naïve Bayes, K-Nearest Neighbour and decision tree classifiers.
we have performed several experiments to empirically compare
different subsets of features and three classification approach
Naïve Bayes, K-Nearest Neighbour and decision tree for
biomedical named entity recognition. The aim is to efficiently
integrate different feature sets and classification algorithms to
synthesize a more accurate classification procedure. Results
prove that the K-Nearest Neighbour trained with suitable features
is more suitable to recognize named entities of biomedical texts
than other models.
Key words:
Named entity recognition (NER), learning, classification,
framework, decision tree, recognizing gene, Naïve Bayes, K-
Nearest Neighbour.
1. Introduction
Named entity recognition (NER) is one of the important
tasks in information extraction, which involves the
identification and classification of words or sequences of
words denoting a concept or entity. Examples of named
entities in general text are names of persons, locations, or
organizations. Domain-specific named entities are those
terms or phrases that denote concepts relevant to one
particular domain. For example, protein and gene names
are named entities which are of interest to the domain of
molecular biology and medicine. The massive growth of
textual information available in the literature and on the
Web necessitates the automation of identification and
management of named entities in text [1]. Named entity
recognition is a crucial component of biomedical natural
language processing, enabling information extraction and
ultimately reasoning over and knowledge discovery from
text. Much progress has been made in the design of rule-
based and supervised tools, but they are often genre and
task dependent. As such, adapting them to different genres
of text or identifying new types of entities requires major
effort in re-annotation or rule development [2]. The core
techniques and approaches to NER may be classified into
three classes, which are rule-based approach, machine
learning approach and hybrid based approach. Rule-based
approaches mainly aim to extract names with the use of a
set of human made rules. In general, these models include
of a number of different patterns that use grammar based
(such as part of speech (POS)), syntactic based (such as
word precedence) and orthographic based features (such as
capitalization) with the use of dictionaries. One the other
hand, the rule-based models so not have the capability of
being portable, dynamic and robust, and also the large
costs of maintaining the rules rises when the data is
changed a small amount.
Many researchers are currently making use of the available
machine learning techniques and approaches for
biomedical NER, because they are easy to train, and they
are cheaper to maintain. The machine learning approaches
and techniques may be classified into the following
classes: unsupervised techniques, semi-supervised
techniques and supervised techniques. Several of the
supervised based machine learning techniques that are
used in NER are Support Vector Machines (SVM)and
naïve Bayes.
Other than the previously mentioned studies, there are a
great deal of related studies as well. Most of the domains
included are social media, news, and medical domains. On
the other hand, the studies associated with biomedical
NER remain at an early stage. The biomedical domain is
chosen for the initial experiments due to its importance
and inherent challenges.
2. Motivation
In view of weakness inherent in manual searching of text,
it has become imperative to seek other efficient ways to
carry out text mining. The massive volume of bio-medical
information stored in soft documents copies form, which
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
171
obviously could be due to a substantial increase in
scientific research over the years has necessitated the use
of text mining technology. Searching and processing
information from documented data is time-consuming in
many areas for example bio-medical literature and is
becoming not practical and easy to achieve without
computer support. Thus, today the need for intelligent text
handle applications that can replace or support human
information exploration in bio-medical text documents is
strong.
It has become extremely difficult for biologists to keep up
with the relevant journals in their own discipline, let alone
publications in other, related disciplines [3]. Bio-medical
literature considered as a source of authentic medical
knowledge which is critical for e-health applications.
These kinds of e-health applications have a huge
commercial prospect. According to the US National
Center for Health Statistics, 51% of USA adults people
had used the surfing of internet for health information in
2009 [3]. This potential commercial prospect has led to the
launch of freely provided sources and others that require
fees for access [3]. Many software hosted on the internet
has provided incredible assistance to patients to identify
symptoms of diseases and even adverse drugs reactions
early enough to take first aid before experts are consulted.
Biological researchers are very considerable on the reality
of use the knowledge that is founded inside bio-medical
literature. For instance, there are above twenty-two million
abstracts the domains of medicine, bio-medical sciences,
laboratory sciences, etc. in Medline alone.
The field of Natural Language Processing is an emerging
field in Text Mining, which aims to automate the process
of locating and classifying important information from
large unstructured text base. This gives the data some form
of shape and structure for ease human use. The task
obviously requires at least a limited considerate of the text
itself and the introduction of new compound patterns that
simulate human information search, which makes text-
mining tasks more complex and challenging than
traditional keyword-lookup based information retrieval
tasks.
3. Related Work
Most of the work on named entity recognition has initially
focused on news domain. However, the features, pre-
processing and post-processing used in these work are not
equally effective on biomedical text, unless domain
specific knowledge and techniques are incorporated.
Biomedical texts are substantially different from other
genres of text (such as newspaper articles). Ranging from
the terminology and sentence construction to the valence
and semantics of names are created continuously. Besides,
authors of biomedical texts often do not follow proposed
standardized names or formats and prefer to use
abbreviations or other forms depending on personal
inclination [4] [5]. Because of their limited length, such
abbreviations/acronyms are sometimes identical to other
words or symbols which increases the ambiguity. For
instance, it was reported that 80% of the abbreviations
listed in the machine learning have ambiguous
representation in MEDLINE [6]. Sometimes the same
name is shared by different types of bio-entity types. For
example, “C1R” is a cell line, but there exists a gene
(SwissProt P00736) that has the same name. Usage of
digits and other non-alphabetic characters inside bio-entity
names is also common. Compound names further
complicate the situation. Locating the beginning and
ending of such names within a sentence is not so
straightforward since verbs and adjectives are often
embedded in such names. Due to these complexities,
named entity recognition attracted a huge amount of
research interests. A number of shared tasks/challenges
such as BioNLP/NLPBA 2004, BioCreative, CALBC, etc.
provided benchmarks to compare and showcase the
advancement in this field.
[7] Pose the classifier ensemble problem under single and
multi-objective optimization frameworks, and evaluate it
for Named Entity Recognition (named entity recognition),
an important step in almost all Natural Language
Processing (NLP) application areas. We propose the
solutions to two different versions of the ensemble
problem for each of the optimization frameworks. [7]
Hypothesize that the reliability of predictions of each
classifier differs among the various output classes. Thus,
in an ensemble system it is necessary to find out either the
eligible classes for which a classifier is most suitable to
vote (i.e., binary vote based ensemble) or to quantify the
amount of voting for each class in a particular classifier
(i.e., real vote based ensemble). They use seven diverse
classifiers, namely Naive Bayes, Decision Tree (DT),
Memory Based Learner (MBL), Hidden Markov Model
(HMM), Maximum Entropy (ME), Conditional Random
Field (CRF) and Support Vector Machine (SVM) to build
a number of models depending upon the various
representations of the available features that are identified
and selected mostly without using any domain knowledge
and/or language specific resources. Results for all the
languages show that the proposed classifier combination
with real voting attains the performance level which is
superior to all the individual classifiers,
three baseline ensembles and the corresponding single
objective based ensemble.
[8] Propose a single objective optimization based classifier
ensemble technique using the search capability of genetic
algorithm GA for named entity recognition C in
biomedical texts. Here, GA is used to quantify the amount
of voting for each class in each classifier. They use diverse
classification methods like Conditional Random Field and
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
172
Support Vector Machine to build a number of models
depending upon the various representations of the set of
features and/or feature templates.
[9] Present a semi-supervised learning method that
efficiently exploits unlabeled data in order to incorporate
domain knowledge into a named entity recognition model
and to leverage system performance. The proposed method
includes Natural Language Processing (NLP) tasks for text
pre-processing, learning word representation features from
a large amount of text data for feature extraction, and
conditional random fields for token classification. Other
than the free text in the domain, the proposed method does
not rely on any lexicon nor any dictionary in order to keep
the system applicable to other named entity recognition
tasks in bio-text data. Results: We extended named entity
recognition, a biomedical named entity recognition system,
with the proposed method. This yields an integrated
system that can be applied to chemical and drug named
entity recognition or biomedical named entity recognition.
[10] Present ChemSpot, a named entity recognition
(named entity recognition) tool for identifying mentions of
chemicals in natural language texts, including trivial
names, drugs, abbreviations, molecular formulas and
International Union of Pure and Applied Chemistry
entities. Since the different classes of relevant entities have
rather different naming characteristics, ChemSpot uses a
hybrid approach combining a Conditional Random Field
with a dictionary. It achieves an F1 measure of 68.1% on
the SCAI corpus, outperforming the only other freely
available chemical named entity recognition tool,
OSCAR4, by 10.8 percentage points.
[11] Present classifiers ensemble approaches for
biomedical named entity recognition. Generalized
Winnow, Conditional Random Fields, Support Vector
Machine, and Maximum Entropy are combined through
three different strategies. We demonstrate the effectiveness
of classifiers ensemble strategies and compare its
performances with standalone classifier systems. In the
experiments on the JNLPBA 2004 evaluation data, our
best system achieves an F-score of 77.57%, which is better
than most state of the art systems. The experiment show
that our proposed classifiers ensemble method especially
the stacking method can lead to significant improvement
in performances of biomedical named entity recognition.
State-of-the-art named entity recognition approaches use
various machine learning algorithms. These include hidden
Markov model (HMM), support vector machine (SVM),
maximum entropy Markov model, conditional random
fields (CRFs), Among these algorithms, CRFs appear to be
the most popular choice.
One common characteristic in many of these systems is the
combination of results from multiple classifiers (e.g. see
[12]). Apart from that, there is a substantial agreement
among the feature sets used by these systems, most of
which are actually various orthographic features.
Most of the work to date on named entity recognition is
focused on genes/proteins. The state-of-the-art
gene/protein mention recognition systems achieve F-scores
around 88%, which is quite high. These systems often use
either gene/protein specific features (e.g. Greek alphabet
matching) or post-processing rules (e.g. extension of the
identified mention boundaries to the left when a single
letter with a hyphen precedes them [12] which might not
be equally effective for other bio-entity type identification.
More efforts should be devoted to take advantage of
contextual clues and features. In the last few years, some
disease annotated corpora have been released. However,
they have been annotated primarily to serve the purpose of
relation extraction and, for different reasons, most of them
are not suitable for the development of machine learning
based
disease mention recognition systems [13]. For example,
the BioText [14] corpus has no specific annotation
guideline and contains several inconsistencies, while the
PennBioIE [15] is very specific to a particular sub-domain
of diseases. Among other disease annotated corpora, the
EBI disease corpus [16] is not annotated with disease
mention boundaries which makes it unsuitable for named
entity recognition evaluation for diseases. Recently, an
annotated corpus, named Arizona Disease Corpus (AZDC)
[13], has been
released which has adequate and suitable annotation of
disease mentions by following specific annotation
guidelines.
There has been some work on identifying diseases in
clinical texts, especially in the context of CMC medical
NLP challenge and i2b2 challenge.
However, as noted by [17], there are a number of reasons
that make clinical texts different from texts of biomedical
literature, e.g. composition of short, telegraphic phrases,
use of implicit templates and pseudo-tables, . . .. Hence,
the strategies adopted for named entity recognition on
clinical texts.
As discussed above, systems that achieve high accuracy in
recognizing general names in the newswires have not
performed as well in the biomedical named entity
recognition with an accuracy of 20 or 30 points difference
in their F-score measure. There is a need to develop a
biomedical name entity recognition system.
In addition, literature shows that classifiers ensemble
(combination) approaches is always superior to all the
individual classifiers and leads to significant improvement
in performances of named entity recognition. So, in this
work, we propose biomedical name entity recognition
model based on classifiers combination.
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
173
4. The Biomedical Named Entity Recognition
Constructing a biomedical named entity recognition
solution using a machine learning approach (classifiers
combination using the vote based ensemble approach)
requires many computational steps including data planning,
pre-processing, feature selection and optimization,
classification, and evaluation. The specific components
included in a given solution vary but they may be viewed
as making part of the following groups summarized in
Figure 1.
Fig. 1 The Proposed biomedical named entity recognition Architecture
4.1 Preprocessing phase
Using a supervised machine learning technique relies on
the existence of annotated training data. Such data is
usually created manually by humans or experts in the
relevant field. The training data needs to be put in a format
that is suitable to the solution of choice. New data to be
classified also requires the same formatting. Depending on
the needs of the solution, the textual data may need to be
tokenized, normalized, scaled, mapped to numeric classes,
prior to being fed to a feature extraction module. To
reduce the training time with large training data, some
techniques such as chunking or instance pruning (filtering)
may need to be applied.
4.2 Feature Extraction
In the phase of feature extraction, test data and training is
created by one or more components in order to retrieve the
important information about it. The selection of feature
extraction components involves the extraction of
morphological and orthographic based features, text based
information, linguistic based information such as POS, and
domain-dependent knowledge including specialized
gazetteers or dictionaries.
In the phase of feature extraction, test data and training is
performed by several components in order to retrieve the
important information about it. in order to extract
morphological and contextual features that do not use
language-specific knowledge such as part-of-speech or
noun phrase tagging. The generated feature space is very
large, including about a million different features. The
features extracted are described below. Since words
appearing separately or within windows of other words
each constitutes a feature in the lexicon, the potential
number of possibilities is very high. Including character n-
grams describing prefixes, infixes, and suffixes would
further increase the number of features in the lexicon. The
feature extraction process is intentionally designed that
way in order to test the scalability of the approach used
and to allow the experiments to proceed in a language-
independent and domain-independent fashion. All features
are binary, i.e., each feature denotes whether the current
token possesses this feature (one) or not (zero). Character
n-grams were not included in the baseline experiment data
due to memory limitations encountered during the feature
extraction process.
The morphological features extracted are:
- Capitalization: token begins with a capital letter.
- Numeric: token is a numeric value.
- Punctuation: token is a punctuation.
- Uppercase: token is all in uppercase.
- Lowercase: token is all in lowercase.
- Single character: token length is equal to one.
- Symbol: token is a special character.
- Includes hyphen: one of the characters is a
hyphen.
- Includes slash: one of the characters is a slash.
- Letters and Digits: token is alphanumeric.
- Capitals and digits: token contains caps and digits.
- Includes caps: some characters are in uppercase.
4.3 Machine Learning and Classification
Almost all of the machine learning based techniques and
approaches have two phases, where the training is
performed initially to produce a trained machine, and then
a classification step is performed. In this study, the
following machine learning approaches are evaluated:
4.3.1 Support vector machine (SVM)
A support vector machine (SVM) is a relatively new
machine learning technique that has been proposed by
Cortes & Vapnik (1995). SVM is generally a popular
technique for NER, which is used in the machine learning
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
174
area. SVM is considered one of the classification
techniques with a very high efficiency. Based on the idea
of structural-risk minimization, from the computational-
learning theory, SVM tries a decision surface, in order to
separate the training data nodes into two main classes, and
makes decisions based on the existing support vectors,
which are selected as the only components that are
efficient in the training set.
4.3.2 Naïve Bayes
The naive Bayes technique is exhaustively used for NER.
Given a table of feature vectors, the technique decides the
rear possibility, where the term is related to multiple
named entity classes, and assigns it to the category with
the maximum rear possibility. There are two used
approaches: multi-nominal models and multi-variate
Bernoulli models. Naïve Bayes is a stochastic model of
generating documents makes use of Bayes’ rule. To
classify as the best named entity class n* for a new term w,
it computes:
4.3.3 Artificial Neural Network
A neural network is a mutual band of artificial neurons,
which utilizes a computational model to process data,
depending on a connectionist method. Sets of input
attribute and preferred results are entered to the learning
program. This is aimed at using the input characteristics to
segregate the training conditions into non-overlapping
models, related to the preferred results. Input layer
comprises of a collection of units, identical to the number
of tags, in the tag set.
The neural networks we have used is an acyclic directed
graph of sigmoid units based on back propagation
algorithm. The sigmoid units are like perceptrons, but they
are based on a smoothed, differentiable threshold function.
A sigmoid unit first computes a linear combination of its
input, then applies a threshold to result, where the
threshold is a continuous function of its input. The sigmoid
unit computes its output o as follows:
where
Here is called the sigmoid function. Its output ranges
between 0 and 1, increasing monotonically with its input.
4.4 Performance Measures
The performance measures used to evaluate the named
entity recognition systems participating in the CoNLL-02,
CoNLL-03 and JNLPBA-04 challenge tasks are precision,
recall, and the weighted mean Fβ=1-score. Precision is the
percentage of named entities found by the learning system
that are correct. Recall is the percentage of named entities
present in the corpus that are found by the system. A
named entity is correct only if it is an exact match of the
corresponding entity in the data file, i.e., the complete
named entity is correctly identified. Definitions of the
performance measures used are summarized below. The
same performance measures are used to evaluate the
results of the baseline experiments.
5. Experimental Results
we have conducted several experiments. First, we have
performed several experiments to empirically compare
different subsets of features and three classification
approach (Naïve Bayes, K-Nearest Neighbor and decision
tree for biomedical named entity recognition. The aim is to
efficiently integrate different feature sets and classification
algorithms to synthesize a more accurate classification
procedure.
Each subset of features is applied with almost of other
features with one of the three classification approaches in
each main experiment. All of the algorithms are evaluated
by using ten-fold cross-validation. The results in terms of
the macro-averaged F-measure are the averaged values
calculated across all ten-fold cross-validation experiments.
In this section, will describe several experiments to
empirically compare 10 different features and three
classification approach (Naïve Bayes, K-Nearest Neighbor
and decision tree for biomedical named entity recognition.
We have two primary goals with our experiments in
biomedical named entity recognition. The first is to define
a better classification approach that will use in the model
to classify the dataset. The second is to evaluate the
features described in the previous chapter to their
usefulness for this task and the better classification model
for biomedical named entity recognition.
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
175
Table 1 show a sample of the used dataset for the experiments
In the first experiment, the KNN Classifier is applied on
testing set using 10-fold cross-validation. As shown in
Table, there are 9 features which means 512 different
experiments can be performed. However, the results here
are obtained for the best 10 experiments from these 512
experiments. The idea is to show the best results obtained
when the KNN is applied. Table 2 shows the performance
in terms of the precision, recall, F-measure of the
biomedical named entity recognition by applying the KNN
Classifier with different set of features. As shown Table 2,
the use of features sets has an obvious effect on the quality
of biomedical named entity recognition for KNN Classifier
classification model in general.
Table 2 shows the performance in terms of the precision, recall, F-
measure of the biomedical named entity recognition by applying the
KNN Classifier
In the second experiment, the NB Classifier is applied on
testing set using 10-fold cross-validation. The results are
obtained for the best 9 experiments from these 512
experiments. The idea is to show the best results obtained
when the NB is applied. Table 3 shows the performance
in terms of the precision, recall, F-measure of the
biomedical named entity recognition by applying the NB
Classifier with different set of features. As shown Table 3,
the use of features sets has an obvious effect on the quality
of biomedical named entity recognition for NB Classifier
classification model in general. However, the results
obtained using NB classifier is less than that obtained
using KNN. It means that effect of the feature sets on the
performance of the NB classifier is lower than their effect
on KNN Classifier.
Table 3 shows the performance in terms of the precision, recall, F-
measure of the biomedical named entity recognition by applying the NB
Classifier
In the third experiment, the decision tree Classifier is
applied on testing set using 10-fold cross-validation. The
results are obtained for the best 9 experiments from these
512 experiments. The idea is to show the best results
obtained when the decision tree is applied. Table 4 shows
the performance in terms of the precision, recall, F-
measure of the biomedical named entity recognition by
applying the decision tree Classifier with different set of
features. As shown Table 4, the use of features sets has
an obvious effect on the quality of biomedical named
entity recognition for decision tree Classifier classification
model in general. However, the results obtained using
decision tree classifier is less than that obtained using
KNN. It means that effect of the feature sets on the
performance of the decision tree classifier is lower than
their effect on KNN Classifier.
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
176
Table 4 shows the performance in terms of the precision, recall, F-
measure of the biomedical named entity recognition by applying the
decision tree Classifier
6. Conclusion
The core objective of this work is to design and implement
a new model for biomedical named recognition. A new
model is produced based on support vector machine
(SVM), Naïve Bayes (NB), and Artificial Neural Network.
The machine learning techniques have been used for
building and developing biomedical named recognition
which requires several steps, including data pre-processing,
feature selection and extraction, machine learning models,
and classification. The reported results analysis shows that
the proposed model is satisfactory and effective for
biomedical named recognition
References
[1] Habib, M. S. Biomedical Named Entity Recognition Using
Support Vector Machines: Performance vs. Scalability
Issues.
[2] Zhang, S., & Elhadad, N. (2013). Unsupervised biomedical
named entity recognition: Experiments with clinical and
biological texts. Journal of biomedical
informatics, 46(6),1088-1098.
[3] Chowdhury, M., & Mahbub, F. (2013). Improving the
Effectiveness of Information Extraction from Biomedical
Text. University of Trento.
[4] Bodenreider, O. (2004). The unified medical language
system (UMLS): integrating biomedical terminology.
Nucleic acids research, 32(suppl 1), D267-D270.
[5] Dai, H.-J., Chang, Y.-C., Tsai, R. T.-H., & Hsu, W.-L.
(2010). New challenges for biological text-mining in the
next decade. Journal of computer science and technology,
25(1), 169-179.
[6] Liu, H., Aronson, A. R., & Friedman, C. (2002). A study of
abbreviations in MEDLINE abstracts. Paper presented at the
Proceedings of the AMIA Symposium.
[7] Saha, S. and A. Ekbal (2013). "Combining multiple
classifiers using vote based classifier ensemble technique
for named entity recognition." Data & Knowledge
Engineering 85: 15-39.
[8] Saha, S., A. Ekbal and U. K. Sikdar (2015). "Named entity
recognition and classification in biomedical text using
classifier ensemble." International journal of data mining
and bioinformatics 11(4): 365-391.
[9] Munkhdalai, T., Li, M., Batsuren, K., Park, H., Choi, N., &
Ryu, K. H. (2015). Incorporating domain knowledge in
chemical and biomedical named entity recognition with
word representations. J. Cheminformatics, 7(S-1), S9.
[10] Rocktäschel, T., Weidlich, M., & Leser, U. (2012).
ChemSpot: a hybrid system for chemical named entity
recognition. Bioinformatics, 28(12), 1633-1640.
[11] Wang, H. (2008). "Biomedical Named Entity Recognition
Based on Classifiers Ensemble." International Journal of
Computer Science and Applications (IJCSA).
[12] Torii, S., Saito, N., Kawano, A., Hou, N., Ueki, K., Kulkarni,
R. N., & Takeuchi, T. (2009). Gene silencing of phogrin
unveils its essential role in glucose-responsive pancreatic β-
cell growth. Diabetes, 58(3), 682-692.
[13] Leaman, R., Miller, C., & Gonzalez, G. (2009). Enabling
recognition of diseases in biomedical text with machine
learning: corpus and benchmark. Paper presented at the
Proceedings of the 2009 Symposium on Languages in
Biology and Medicine.
[14] Rosario, B., & Hearst, M. A. (2004). Classifying semantic
relations in bioscience texts. Paper presented at the
Proceedings of the 42nd annual meeting on association for
computational linguistics.
[15] Kulick, S., Bies, A., Liberman, M., Mandel, M., McDonald,
R., Palmer, M. White, P. (2004). Integrated annotation for
biomedical information extraction. Paper presented at the
Proc. of the Human Language Technology Conference and
the Annual Meeting of the North American Chapter of the
Association for Computational Linguistics (HLT/NAACL).
[16] Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Berlanga,
R., & Rebholz-Schuhmann, D. (2008). Assessment of
disease named entity recognition on a corpus of annotated
sentences. BMC Bioinformatics, 9(Suppl 3), S3.
[17] Meystre, S. M., Savova, G. K., Kipper-Schuler, K. C., &
Hurdle, J. F. (2008). Extracting information from textual
documents in the electronic health record: a review of recent
research. Yearb Med Inform, 35, 128-144.
... The third approach, based on machine learning, links the discovered items to words even though the words are not in the dictionary and the context is not provided in the rule set. In this domain, popular methods include support vector machine (SVM) [5], decision tree (DT) [23], Hidden Markov models (HMM) [18], [24], maximum entropy Markov model (MEMM) [25], and conditional random field (CRF) [24]. When training with test data that isn't part of the training data, supervised learning techniques use data with a variety of features based on several linguistic criteria. ...
... In a supervised NER system, feature engineering is critical. A feature vector representation is a text abstraction in which one or more boolean, numeric, or nominal values represent a word [15]. The supervised NER has made extensive use of the word level function, list search feature, and corpus feature. ...
Article
Full-text available
span lang="EN-US">With the rising number of accidents in Indonesia, it is still necessary to evaluate and analyze accident data. The categorization of traffic accident data has been developed using word embedding, however additional work is needed to achieve better results. Several informative named entities are frequently sufficient to differentiate whether or not information on a traffic accident exists. Named-entities are informational characteristics that can offer details about a text. The influence of named-entities on thematic text categorization is examined in this paper. The information was collected using a Twitter social media crawl. Preprocessing is done at the beginning of the process to modify and delete useful text as well as label specified entities. On Support Vector Machine (SVM), scheme comparisons were performed for (i) Word Embedding, (ii) the number of occurrences of Named Entities, and (iii) the combination of the two is known as a Hybrid. The Hybrid scheme produced an improvement in classification accuracy of 90.27 percent when compared to Word Embedding scheme and occurrences of named entities scheme, according to tests conducted using 1.885 data consisting of 788 accident data and 1.067 non-accident data.</span
... Such annotated corpora's basic motivation is to train machine learning models that can automatically detect medical entities (e.g., diseases, tests, and treatments). In this regard, many machine learning models have been tested, such as Hidden Markov Model (HMM) [30], Support Vector Machine (SVM) [31], Naive Bayes (NB) [32], and Conditional Random Field (CRF) [33]. ...
Article
Full-text available
Due to the Internet of Things evolution, the clinical data is exponentially growing and using smart technologies. The generated big biomedical data is confidential, as it contains a patient’s personal information and findings. Usually, big biomedical data is stored over the cloud, making it convenient to be accessed and shared. In this view, the data shared for research purposes helps to reveal useful and unexposed aspects. Unfortunately, sharing of such sensitive data also leads to certain privacy threats. Generally, the clinical data is available in textual format (e.g., perception reports). Under the domain of natural language processing, many research studies have been published to mitigate the privacy breaches in textual clinical data. However, there are still limitations and shortcomings in the current studies that are inevitable to be addressed. In this article, a novel framework for textual medical data privacy has been proposed as Deep-Confidentiality. The proposed framework improves Medical Entity Recognition (MER) using deep neural networks and sanitization compared to the current state-of-the-art techniques. Moreover, the new and generic utility metric is also proposed, which overcomes the shortcomings of the existing utility metric. It provides the true representation of sanitized documents as compared to the original documents. To check our proposed framework’s effectiveness, it is evaluated on the i2b2-2010 NLP challenge dataset, which is considered one of the complex medical data for MER. The proposed framework improves the MER with 7.8% recall, 7% precision, and 3.8% F1-score compared to the existing deep learning models. It also improved the data utility of sanitized documents up to 13.79%, where the value of the k is 3.
... So, the researchers of biomedical NER gradually turned to statistical methods and their combinations to improve NER results. For example, Ahmed et al. [20] and Lei et al. [21] use SVM, KNN, DT, SVM to extract the named entities from biomedical corpus and Chinese clinical text. Shweta et al. [22] apply PSO (Particle Swarm Optimization) model for feature selection in NER research. ...
Article
Full-text available
Background: A semi-supervised model is proposed for extracting clinical terms of Traditional Chinese Medicine using feature words. Methods: The extraction model is based on BiLSTM-CRF and combined with semi-supervised learning and feature word set, which reduces the cost of manual annotation and leverage extraction results. Results: Experiment results show that the proposed model improves the extraction of five types of TCM clinical terms, including traditional Chinese medicine, symptoms, patterns, diseases and formulas. The best F1-value of the experiment reaches 78.70% on the test dataset. Conclusions: This method can reduce the cost of manual labeling and improve the result in the NER research of TCM clinical terms.
... Table 1 lists the semantic types and groups that participated in the experiments, as well as the respective number of annotations in the corpus for each of them. Features used in NER algorithms, as in [Al-Hegami et al. 2017] were passed to the classifier, described in Table 2. ...
Conference Paper
Considering the difficulties of extracting entities from Electronic Health Records (EHR) texts in Portuguese, we explore the Conditional Random Fields (CRF) algorithm to build a Named Entity Recognition (NER) system based on a corpus of clinical Portuguese data annotated by experts. We acquaint the challenges and methods to classify Abbreviations, Disorders, Procedures and Chemicals within the texts. By selecting a meaningful set of features, and parameters with the best performance the results demonstrate that the method is promising and may support other biomedical tasks, nonetheless, further experiments with more features, different architectures and sophisticated preprocessing steps are needed.
... • Machine learning based approaches consist in automatically learn to find entities using specific features that distinguish between features for the training set and those for the testing set [29], [30]. (NEN) that is the activity to normalize all named entities in the text. ...
Article
Full-text available
Aim/Purpose: The amount of information published has increased dramatically due to the information explosion. The issue of managing information as it expands at this rate lies in the development of information extraction technology that can turn unstructured data into organized data that is understandable and controllable by computers Background: The primary goal of named entity recognition (NER) is to extract named entities from amorphous materials and place them in pre-defined semantic classes. Methodology: In our work, we analyze various machine learning algorithms and implement K-NN which has been widely used in machine learning and remains one of the most popular methods to classify data. Contribution: To the researchers’ best knowledge, no published study has presented Named entity recognition for the Kikuyu language using a machine learning algorithm. This research will fill this gap by recognizing entities in the Kikuyu language. Findings: An evaluation was done by testing precision, recall, and F-measure. The experiment results demonstrate that using K-NN is effective in classification performance. Recommendation for Researchers: With enough training data, researchers could perform an experiment and check the learning curve with accuracy that compares to state of art NER. Future Research: Future studies may be done using unsupervised and semi-supervised learning algorithms for other resource-scarce languages.
Article
Named Entity Recognition is the process of identifying different entities in a given context. Biomedical Named Entity Recognition (BNER) is the task of extracting chemical names from biomedical texts to support biomedical and translational research. The aim of the system is to extract useful chemical names from biomedical literature text without a lot of handcrafted engineering features. This approach introduces a novel neural network architecture with the composition of bidirectional long short-term memory (BLSTM), dynamic recurrent neural network (RNN) and conditional random field (CRF) that uses character level and word level embedding as the only features to identify the chemical entities. Using this approach we have achieved the F1 score of 89.98 on BioCreAtIvE II GM corpus and 90.84 on NCBI corpus by outperforming the existing systems. Our system is based on the deep neural architecture that uses both character and word level embedding which captures the morphological and orthographic information eliminating the need for handcrafted engineering features. The proposed system outperforms the existing systems without a lot of handcrafted engineering features. The embedding concept along with the bidirectional LSTM network proved to be an effective method to identify most of the chemical entities.
Article
Full-text available
Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data. We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface. BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: https://bitbucket.org/tsendeemts/banner-chemdner.
Article
Full-text available
This paper examines the performance and scalability of Named Entity Recognition (NER) using multi-class Support Vector Machines (SVM) and high-dimensional features. The NER domain chosen for these experiments is the biomedical publications domain, especially selected due to its importance and inherent challenges. We use a simple machine learning approach that eliminates prior language knowledge such as part-of-speech or noun phrase tagging thereby allowing for its applicability across languages. No domain-specific knowledge is included. Motivated by the accuracy measures achieved during baseline experiments which proved to be comparable to those obtained using more complex approaches, we investigate ways to improve the scalability of multi-class SVM in order to make the solution more practical and useable. The initial prototype – SVM-PerfMulti – is based on the latest improvement in training linear binary SVM machines, namely SVM-Perf. In this paper, we examine the performance and scalability results of a set of experiments conducted using binary and multi-class SVM with increasing training data sizes, and report the improved training time using the prototype, SVM-PerfMulti, and the remaining challenges to be solved as part of our ongoing research.
Article
Full-text available
Named entity recognition is a crucial component of biomedical natural language processing, enabling information extraction and ultimately reasoning over and knowledge discovery from text. Much progress has been made in the design of rule-based and supervised tools, but they are often genre and task dependent. As such, adapting them to different genres of text or identifying new types of entities requires major effort in re-annotation or rule development. In this paper, we propose an unsupervised approach to extracting named entities from biomedical text. We describe a stepwise solution to tackle the challenges of entity boundary detection and entity type classification without relying on any handcrafted rles, heuristics, or annotated data. A noun phrase chunker followed by a filter based on inverse document frequency extracts candidate entities from free text. Classification of candidate entities into categories of interest is carried out by leveraging principles from distributional semantics. Experiments show that our system, especially the entity classification step, yields competitive results on two popular biomedical datasets of clinical notes and biological literature, and outperforms a baseline dictionary match approach. Detailed error analysis provides a road map for future work.
Article
Full-text available
OBJECTIVE—Phogrin and IA-2, autoantigens in insulin-dependent diabetes, have been shown to be involved in insulin secretion in pancreatic β-cells; however, implications at a molecular level are confusing from experiment to experiment. We analyzed biological functions of phogrin in β-cells by an RNA interference technique. RESEARCH DESIGN AND METHODS—Adenovirus-mediated expression of short hairpin RNA specific for phogrin (shPhogrin) was conducted using cultured β-cell lines and mouse islets. Both glucose-stimulated insulin secretion and cell proliferation rate were determined in the phogrin-knockdown cells. Furthermore, protein expression was profiled in these cells. To see the binding partner of phogrin in β-cells, coimmunoprecipitation analysis was carried out. RESULTS—Adenoviral expression of shPhogrin efficiently decreased its endogenous expression in pancreatic β-cells. Silencing of phogrin in β-cells abrogated the glucose-mediated mitogenic effect, which was accompanied by a reduction in the level of insulin receptor substrate 2 (IRS2) protein, without any changes in insulin secretion. Phogrin formed a complex with insulin receptor at the plasma membrane, and their interaction was promoted by high-glucose stimulation that in turn led to stabilization of IRS2 protein. Corroboratively, phogrin knockdown had no additional effect on the proliferation of β-cell line derived from the insulin receptor–knockout mouse. CONCLUSIONS—Phogrin is involved in β-cell growth via regulating stability of IRS2 protein by the molecular interaction with insulin receptor. We propose that phogrin and IA-2 function as an essential regulator of autocrine insulin action in pancreatic β-cells.
Article
Full-text available
The accurate identification of chemicals in text is important for many applications, including computer-assisted reconstruction of metabolic networks or retrieval of information about substances in drug development. But due to the diversity of naming conventions and traditions for such molecules, this task is highly complex and should be supported by computational tools. We present ChemSpot, a named entity recognition (NER) tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and International Union of Pure and Applied Chemistry entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. It achieves an F(1) measure of 68.1% on the SCAI corpus, outperforming the only other freely available chemical NER tool, OSCAR4, by 10.8 percentage points. ChemSpot is freely available at: http://www.informatik.hu-berlin.de/wbi/resources.
Article
Named Entity Recognition and Classification (NERC) is an important task in information extraction for biomedicine domain. Biomedical Named Entities include mentions of proteins, genes, DNA, RNA, etc. which, in general, have complex structures and are difficult to recognise. In this paper, we propose a Single Objective Optimisation based classifier ensemble technique using the search capability of Genetic Algorithm (GA) for NERC in biomedical texts. Here, GA is used to quantify the amount of voting for each class in each classifier. We use diverse classification methods like Conditional Random Field and Support Vector Machine to build a number of models depending upon the various representations of the set of features and/or feature templates. The proposed technique is evaluated with two benchmark datasets, namely JNLPBA 2004 and GENETAG. Experiments yield the overall F- measure values of 75.97% and 95.90%, respectively. Comparisons with the existing systems show that our proposed system achieves state-of-the-art performance.
Article
In this paper, we pose the classifier ensemble problem under single and multiobjective optimization frameworks, and evaluate it for Named Entity Recognition (NER), an important step in almost all Natural Language Processing (NLP) application areas. We propose the solutions to two different versions of the ensemble problem for each of the optimization frameworks.We hypothesize that the reliability of predictions of each classifier differs among the various output classes. Thus, in an ensemble system it is necessary to find out either the eligible classes for which a classifier is most suitable to vote (i.e., binary vote based ensemble) or to quantify the amount of voting for each class in a particular classifier (i.e., real vote based ensemble). We use seven diverse classifiers, namely Naive Bayes, Decision Tree (DT), Memory Based Learner (MBL), Hidden Markov Model (HMM), Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM) to build a number of models depending upon the various representations of the available features that are identified and selected mostly without using any domain knowledge and/or language specific resources. The proposed technique is evaluated for three resource-constrained languages, namely Bengali, Hindi and Telugu. Results using multiobjective optimization (MOO) based technique yield the overall recall, precision and F-measure values of 94.21%, 94.72% and 94.74%, respectively for Bengali, 99.07%, 90.63% and 94.66%, respectively for Hindi and 82.79%, 95.18% and 88.55%, respectively for Telugu. Results for all the languages show that the proposed MOO based classifier ensemble with real voting attains the performance level which is superior to all the individual classifiers, three baseline ensembles and the corresponding single objective based ensemble.
Article
We examine recent published research on the extraction of information from textual documents in the Electronic Health Record (EHR). Literature review of the research published after 1995, based on PubMed, conference proceedings, and the ACM Digital Library, as well as on relevant publications referenced in papers already included. 174 publications were selected and are discussed in this review in terms of methods used, pre-processing of textual documents, contextual features detection and analysis, extraction of information in general, extraction of codes and of information for decision-support and enrichment of the EHR, information extraction for surveillance, research, automated terminology management, and data mining, and de-identification of clinical text. Performance of information extraction systems with clinical text has improved since the last systematic review in 1995, but they are still rarely applied outside of the laboratory they have been developed in. Competitive challenges for information extraction from clinical text, along with the availability of annotated clinical text corpora, and further improvements in system performance are important factors to stimulate advances in this field and to increase the acceptance and usage of these systems in concrete clinical and biomedical research contexts.
Conference Paper
A crucial step toward the goal of au- tomatic extraction of propositional in- formation from natural language text is the identification of semantic relations between constituents in sentences. We examine the problem of distinguishing among seven relation types that can oc- cur between the entities "treatment" and "disease" in bioscience text, and the problem of identifying such entities. We compare fi ve generative graphical mod- els and a neural network, using lexical, syntactic, and semantic features, finding that the latter help achieve high classifi- cation accuracy.