ArticlePDF Available

A Biomedical Named Entity Recognition Using Machine Learning Classifiers and Rich Feature Set

January 2017

January 2017

Authors:

University of Sana'a

As the wealth of biomedical knowledge in the form of literature increases, there is a rising need for effective natural language processing tools to assist in organizing, curating, and retrieving this information. The task of named entity recognition becomes more difficult from specific domain since entities are more exact to that particular domain. To that end, named entity recognition (the task of identifying words and phrases in free text that belong to certain classes of interest) is an important first step for many of these larger information management goals. In recent years, much attention has been focused on the problem of recognizing gene and protein and other biomedical entities mentions in biomedical abstracts. Thus, this study aims to design and develop a biomedical named entity recognition model. A machine learning classification framework is proposed based on Naïve Bayes, K-Nearest Neighbour and decision tree classifiers. we have performed several experiments to empirically compare different subsets of features and three classification approach Naïve Bayes, K-Nearest Neighbour and decision tree for biomedical named entity recognition. The aim is to efficiently integrate different feature sets and classification algorithms to synthesize a more accurate classification procedure. Results prove that the K-Nearest Neighbour trained with suitable features is more suitable to recognize named entities of biomedical texts than other models.

Content uploaded by Ahmed Sultan Al-Hegami

Content may be subject to copyright.

IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017

170

Manuscript received January 5, 2017

Manuscript revised January 20, 2017

A Biomedical Named Entity Recognition Using Machine

Learning Classifiers and Rich Feature Set

Ahmed Sultan Al-Hegami, Ameen Mohammed Farea Othman, Fuad Tarbosh Bagash

University of Sana’a, Yemen, The Arab Academy for Banking and Financial Sciences, European International Education,

Yemen

Summary

As the wealth of biomedical knowledge in the form of literature

increases, there is a rising need for effective natural language

processing tools to assist in organizing, curating, and retrieving

this information. The task of named entity recognition becomes

more difficult from specific domain since entities are more exact

to that particular domain. To that end, named entity recognition

(the task of identifying words and phrases in free text that belong

to certain classes of interest) is an important first step for many of

these larger information management goals. In recent years,

much attention has been focused on the problem of recognizing

gene and protein and other biomedical entities mentions in

biomedical abstracts. Thus, this study aims to design and

develop a biomedical named entity recognition model. A

machine learning classification framework is proposed based on

Naïve Bayes, K-Nearest Neighbour and decision tree classifiers.

we have performed several experiments to empirically compare

different subsets of features and three classification approach

Naïve Bayes, K-Nearest Neighbour and decision tree for

biomedical named entity recognition. The aim is to efficiently

integrate different feature sets and classification algorithms to

synthesize a more accurate classification procedure. Results

prove that the K-Nearest Neighbour trained with suitable features

is more suitable to recognize named entities of biomedical texts

than other models.

Key words:

Named entity recognition (NER), learning, classification,

framework, decision tree, recognizing gene, Naïve Bayes, K-

Nearest Neighbour.

1. Introduction

Named entity recognition (NER) is one of the important

tasks in information extraction, which involves the

identification and classification of words or sequences of

words denoting a concept or entity. Examples of named

entities in general text are names of persons, locations, or

organizations. Domain-specific named entities are those

terms or phrases that denote concepts relevant to one

particular domain. For example, protein and gene names

are named entities which are of interest to the domain of

molecular biology and medicine. The massive growth of

textual information available in the literature and on the

Web necessitates the automation of identification and

management of named entities in text [1]. Named entity

recognition is a crucial component of biomedical natural

language processing, enabling information extraction and

ultimately reasoning over and knowledge discovery from

text. Much progress has been made in the design of rule-

based and supervised tools, but they are often genre and

task dependent. As such, adapting them to different genres

of text or identifying new types of entities requires major

effort in re-annotation or rule development [2]. The core

techniques and approaches to NER may be classified into

three classes, which are rule-based approach, machine

learning approach and hybrid based approach. Rule-based

approaches mainly aim to extract names with the use of a

set of human made rules. In general, these models include

of a number of different patterns that use grammar based

(such as part of speech (POS)), syntactic based (such as

word precedence) and orthographic based features (such as

capitalization) with the use of dictionaries. One the other

hand, the rule-based models so not have the capability of

being portable, dynamic and robust, and also the large

costs of maintaining the rules rises when the data is

changed a small amount.

Many researchers are currently making use of the available

machine learning techniques and approaches for

biomedical NER, because they are easy to train, and they

are cheaper to maintain. The machine learning approaches

and techniques may be classified into the following

classes: unsupervised techniques, semi-supervised

techniques and supervised techniques. Several of the

supervised based machine learning techniques that are

used in NER are Support Vector Machines (SVM)and

naïve Bayes.

Other than the previously mentioned studies, there are a

great deal of related studies as well. Most of the domains

included are social media, news, and medical domains. On

the other hand, the studies associated with biomedical

NER remain at an early stage. The biomedical domain is

chosen for the initial experiments due to its importance

and inherent challenges.

2. Motivation

In view of weakness inherent in manual searching of text,

it has become imperative to seek other efficient ways to

carry out text mining. The massive volume of bio-medical

information stored in soft documents copies form, which

IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017

171

obviously could be due to a substantial increase in

scientific research over the years has necessitated the use

of text mining technology. Searching and processing

information from documented data is time-consuming in

many areas for example bio-medical literature and is

becoming not practical and easy to achieve without

computer support. Thus, today the need for intelligent text

handle applications that can replace or support human

information exploration in bio-medical text documents is

strong.

It has become extremely difficult for biologists to keep up

with the relevant journals in their own discipline, let alone

publications in other, related disciplines [3]. Bio-medical

literature considered as a source of authentic medical

knowledge which is critical for e-health applications.

These kinds of e-health applications have a huge

commercial prospect. According to the US National

Center for Health Statistics, 51% of USA adults people

had used the surfing of internet for health information in

2009 [3]. This potential commercial prospect has led to the

launch of freely provided sources and others that require

fees for access [3]. Many software hosted on the internet

has provided incredible assistance to patients to identify

symptoms of diseases and even adverse drugs reactions

early enough to take first aid before experts are consulted.

Biological researchers are very considerable on the reality

of use the knowledge that is founded inside bio-medical

literature. For instance, there are above twenty-two million

abstracts the domains of medicine, bio-medical sciences,

laboratory sciences, etc. in Medline alone.

The field of Natural Language Processing is an emerging

field in Text Mining, which aims to automate the process

of locating and classifying important information from

large unstructured text base. This gives the data some form

of shape and structure for ease human use. The task

obviously requires at least a limited considerate of the text

itself and the introduction of new compound patterns that

simulate human information search, which makes text-

mining tasks more complex and challenging than

traditional keyword-lookup based information retrieval

tasks.

3. Related Work

Most of the work on named entity recognition has initially

focused on news domain. However, the features, pre-

processing and post-processing used in these work are not

equally effective on biomedical text, unless domain

specific knowledge and techniques are incorporated.

Biomedical texts are substantially different from other

genres of text (such as newspaper articles). Ranging from

the terminology and sentence construction to the valence

and semantics of names are created continuously. Besides,

authors of biomedical texts often do not follow proposed

standardized names or formats and prefer to use

abbreviations or other forms depending on personal

inclination [4] [5]. Because of their limited length, such

abbreviations/acronyms are sometimes identical to other

words or symbols which increases the ambiguity. For

instance, it was reported that 80% of the abbreviations

listed in the machine learning have ambiguous

representation in MEDLINE [6]. Sometimes the same

name is shared by different types of bio-entity types. For

example, “C1R” is a cell line, but there exists a gene

(SwissProt P00736) that has the same name. Usage of

digits and other non-alphabetic characters inside bio-entity

names is also common. Compound names further

complicate the situation. Locating the beginning and

ending of such names within a sentence is not so

straightforward since verbs and adjectives are often

embedded in such names. Due to these complexities,

named entity recognition attracted a huge amount of

research interests. A number of shared tasks/challenges

such as BioNLP/NLPBA 2004, BioCreative, CALBC, etc.

provided benchmarks to compare and showcase the

advancement in this field.

[7] Pose the classifier ensemble problem under single and

multi-objective optimization frameworks, and evaluate it

for Named Entity Recognition (named entity recognition),

an important step in almost all Natural Language

Processing (NLP) application areas. We propose the

solutions to two different versions of the ensemble

problem for each of the optimization frameworks. [7]

Hypothesize that the reliability of predictions of each

classifier differs among the various output classes. Thus,

in an ensemble system it is necessary to find out either the

eligible classes for which a classifier is most suitable to

vote (i.e., binary vote based ensemble) or to quantify the

amount of voting for each class in a particular classifier

(i.e., real vote based ensemble). They use seven diverse

classifiers, namely Naive Bayes, Decision Tree (DT),

Memory Based Learner (MBL), Hidden Markov Model

(HMM), Maximum Entropy (ME), Conditional Random

Field (CRF) and Support Vector Machine (SVM) to build

a number of models depending upon the various

representations of the available features that are identified

and selected mostly without using any domain knowledge

and/or language specific resources. Results for all the

languages show that the proposed classifier combination

with real voting attains the performance level which is

superior to all the individual classifiers,

three baseline ensembles and the corresponding single

objective based ensemble.

[8] Propose a single objective optimization based classifier

ensemble technique using the search capability of genetic

algorithm GA for named entity recognition C in

biomedical texts. Here, GA is used to quantify the amount

of voting for each class in each classifier. They use diverse

classification methods like Conditional Random Field and

IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017

172

Support Vector Machine to build a number of models

depending upon the various representations of the set of

features and/or feature templates.

[9] Present a semi-supervised learning method that

efficiently exploits unlabeled data in order to incorporate

domain knowledge into a named entity recognition model

and to leverage system performance. The proposed method

includes Natural Language Processing (NLP) tasks for text

pre-processing, learning word representation features from

a large amount of text data for feature extraction, and

conditional random fields for token classification. Other

than the free text in the domain, the proposed method does

not rely on any lexicon nor any dictionary in order to keep

the system applicable to other named entity recognition

tasks in bio-text data. Results: We extended named entity

recognition, a biomedical named entity recognition system,

with the proposed method. This yields an integrated

system that can be applied to chemical and drug named

entity recognition or biomedical named entity recognition.

[10] Present ChemSpot, a named entity recognition

(named entity recognition) tool for identifying mentions of

chemicals in natural language texts, including trivial

names, drugs, abbreviations, molecular formulas and

International Union of Pure and Applied Chemistry

entities. Since the different classes of relevant entities have

rather different naming characteristics, ChemSpot uses a

hybrid approach combining a Conditional Random Field

with a dictionary. It achieves an F1 measure of 68.1% on

the SCAI corpus, outperforming the only other freely

available chemical named entity recognition tool,

OSCAR4, by 10.8 percentage points.

[11] Present classifiers ensemble approaches for

biomedical named entity recognition. Generalized

Winnow, Conditional Random Fields, Support Vector

Machine, and Maximum Entropy are combined through

three different strategies. We demonstrate the effectiveness

of classifiers ensemble strategies and compare its

performances with standalone classifier systems. In the

experiments on the JNLPBA 2004 evaluation data, our

best system achieves an F-score of 77.57%, which is better

than most state of the art systems. The experiment show

that our proposed classifiers ensemble method especially

the stacking method can lead to significant improvement

in performances of biomedical named entity recognition.

State-of-the-art named entity recognition approaches use

various machine learning algorithms. These include hidden

Markov model (HMM), support vector machine (SVM),

maximum entropy Markov model, conditional random

fields (CRFs), Among these algorithms, CRFs appear to be

the most popular choice.

One common characteristic in many of these systems is the

combination of results from multiple classifiers (e.g. see

[12]). Apart from that, there is a substantial agreement

among the feature sets used by these systems, most of

which are actually various orthographic features.

Most of the work to date on named entity recognition is

focused on genes/proteins. The state-of-the-art

gene/protein mention recognition systems achieve F-scores

around 88%, which is quite high. These systems often use

either gene/protein specific features (e.g. Greek alphabet

matching) or post-processing rules (e.g. extension of the

identified mention boundaries to the left when a single

letter with a hyphen precedes them [12] which might not

be equally effective for other bio-entity type identification.

More efforts should be devoted to take advantage of

contextual clues and features. In the last few years, some

disease annotated corpora have been released. However,

they have been annotated primarily to serve the purpose of

relation extraction and, for different reasons, most of them

are not suitable for the development of machine learning

based

disease mention recognition systems [13]. For example,

the BioText [14] corpus has no specific annotation

guideline and contains several inconsistencies, while the

PennBioIE [15] is very specific to a particular sub-domain

of diseases. Among other disease annotated corpora, the

EBI disease corpus [16] is not annotated with disease

mention boundaries which makes it unsuitable for named

entity recognition evaluation for diseases. Recently, an

annotated corpus, named Arizona Disease Corpus (AZDC)

[13], has been

released which has adequate and suitable annotation of

disease mentions by following specific annotation

guidelines.

There has been some work on identifying diseases in

clinical texts, especially in the context of CMC medical

NLP challenge and i2b2 challenge.

However, as noted by [17], there are a number of reasons

that make clinical texts different from texts of biomedical

literature, e.g. composition of short, telegraphic phrases,

use of implicit templates and pseudo-tables, . . .. Hence,

the strategies adopted for named entity recognition on

clinical texts.

As discussed above, systems that achieve high accuracy in

recognizing general names in the newswires have not

performed as well in the biomedical named entity

recognition with an accuracy of 20 or 30 points difference

in their F-score measure. There is a need to develop a

biomedical name entity recognition system.

In addition, literature shows that classifiers ensemble

(combination) approaches is always superior to all the

individual classifiers and leads to significant improvement

in performances of named entity recognition. So, in this

work, we propose biomedical name entity recognition

model based on classifiers combination.

IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017

173

4. The Biomedical Named Entity Recognition

Constructing a biomedical named entity recognition

solution using a machine learning approach (classifiers

combination using the vote based ensemble approach)

requires many computational steps including data planning,

pre-processing, feature selection and optimization,

classification, and evaluation. The specific components

included in a given solution vary but they may be viewed

as making part of the following groups summarized in

Figure 1.

Fig. 1 The Proposed biomedical named entity recognition Architecture

4.1 Preprocessing phase

Using a supervised machine learning technique relies on

the existence of annotated training data. Such data is

usually created manually by humans or experts in the

relevant field. The training data needs to be put in a format

that is suitable to the solution of choice. New data to be

classified also requires the same formatting. Depending on

the needs of the solution, the textual data may need to be

tokenized, normalized, scaled, mapped to numeric classes,

prior to being fed to a feature extraction module. To

reduce the training time with large training data, some

techniques such as chunking or instance pruning (filtering)

may need to be applied.

4.2 Feature Extraction

In the phase of feature extraction, test data and training is

created by one or more components in order to retrieve the

important information about it. The selection of feature

extraction components involves the extraction of

morphological and orthographic based features, text based

information, linguistic based information such as POS, and

domain-dependent knowledge including specialized

gazetteers or dictionaries.

In the phase of feature extraction, test data and training is

performed by several components in order to retrieve the

important information about it. in order to extract

morphological and contextual features that do not use

language-specific knowledge such as part-of-speech or

noun phrase tagging. The generated feature space is very

large, including about a million different features. The

features extracted are described below. Since words

appearing separately or within windows of other words

each constitutes a feature in the lexicon, the potential

number of possibilities is very high. Including character n-

grams describing prefixes, infixes, and suffixes would

further increase the number of features in the lexicon. The

feature extraction process is intentionally designed that

way in order to test the scalability of the approach used

and to allow the experiments to proceed in a language-

independent and domain-independent fashion. All features

are binary, i.e., each feature denotes whether the current

token possesses this feature (one) or not (zero). Character

n-grams were not included in the baseline experiment data

due to memory limitations encountered during the feature

extraction process.

The morphological features extracted are:

- Capitalization: token begins with a capital letter.

- Numeric: token is a numeric value.

- Punctuation: token is a punctuation.

- Uppercase: token is all in uppercase.

- Lowercase: token is all in lowercase.

- Single character: token length is equal to one.

- Symbol: token is a special character.

- Includes hyphen: one of the characters is a

hyphen.

- Includes slash: one of the characters is a slash.

- Letters and Digits: token is alphanumeric.

- Capitals and digits: token contains caps and digits.

- Includes caps: some characters are in uppercase.

4.3 Machine Learning and Classification

Almost all of the machine learning based techniques and

approaches have two phases, where the training is

performed initially to produce a trained machine, and then

a classification step is performed. In this study, the

following machine learning approaches are evaluated:

4.3.1 Support vector machine (SVM)

A support vector machine (SVM) is a relatively new

machine learning technique that has been proposed by

Cortes & Vapnik (1995). SVM is generally a popular

technique for NER, which is used in the machine learning

IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017

174

area. SVM is considered one of the classification

techniques with a very high efficiency. Based on the idea

of structural-risk minimization, from the computational-

learning theory, SVM tries a decision surface, in order to

separate the training data nodes into two main classes, and

makes decisions based on the existing support vectors,

which are selected as the only components that are

efficient in the training set.

4.3.2 Naïve Bayes

The naive Bayes technique is exhaustively used for NER.

Given a table of feature vectors, the technique decides the

rear possibility, where the term is related to multiple

named entity classes, and assigns it to the category with

the maximum rear possibility. There are two used

approaches: multi-nominal models and multi-variate

Bernoulli models. Naïve Bayes is a stochastic model of

generating documents makes use of Bayes’ rule. To

classify as the best named entity class n* for a new term w,

it computes:

4.3.3 Artificial Neural Network

A neural network is a mutual band of artificial neurons,

which utilizes a computational model to process data,

depending on a connectionist method. Sets of input

attribute and preferred results are entered to the learning

program. This is aimed at using the input characteristics to

segregate the training conditions into non-overlapping

models, related to the preferred results. Input layer

comprises of a collection of units, identical to the number

of tags, in the tag set.

The neural networks we have used is an acyclic directed

graph of sigmoid units based on back propagation

algorithm. The sigmoid units are like perceptrons, but they

are based on a smoothed, differentiable threshold function.

A sigmoid unit first computes a linear combination of its

input, then applies a threshold to result, where the

threshold is a continuous function of its input. The sigmoid

unit computes its output o as follows:

where

Here is called the sigmoid function. Its output ranges

between 0 and 1, increasing monotonically with its input.

4.4 Performance Measures

The performance measures used to evaluate the named

entity recognition systems participating in the CoNLL-02,

CoNLL-03 and JNLPBA-04 challenge tasks are precision,

recall, and the weighted mean Fβ=1-score. Precision is the

percentage of named entities found by the learning system

that are correct. Recall is the percentage of named entities

present in the corpus that are found by the system. A

named entity is correct only if it is an exact match of the

corresponding entity in the data file, i.e., the complete

named entity is correctly identified. Definitions of the

performance measures used are summarized below. The

same performance measures are used to evaluate the

results of the baseline experiments.

5. Experimental Results

we have conducted several experiments. First, we have

performed several experiments to empirically compare

different subsets of features and three classification

approach (Naïve Bayes, K-Nearest Neighbor and decision

tree for biomedical named entity recognition. The aim is to

efficiently integrate different feature sets and classification

algorithms to synthesize a more accurate classification

procedure.

Each subset of features is applied with almost of other

features with one of the three classification approaches in

each main experiment. All of the algorithms are evaluated

by using ten-fold cross-validation. The results in terms of

the macro-averaged F-measure are the averaged values

calculated across all ten-fold cross-validation experiments.

In this section, will describe several experiments to

empirically compare 10 different features and three

classification approach (Naïve Bayes, K-Nearest Neighbor

and decision tree for biomedical named entity recognition.

We have two primary goals with our experiments in

biomedical named entity recognition. The first is to define

a better classification approach that will use in the model

to classify the dataset. The second is to evaluate the

features described in the previous chapter to their

usefulness for this task and the better classification model

for biomedical named entity recognition.

IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017

175

Table 1 show a sample of the used dataset for the experiments

In the first experiment, the KNN Classifier is applied on

testing set using 10-fold cross-validation. As shown in

Table, there are 9 features which means 512 different

experiments can be performed. However, the results here

are obtained for the best 10 experiments from these 512

experiments. The idea is to show the best results obtained

when the KNN is applied. Table 2 shows the performance

in terms of the precision, recall, F-measure of the

biomedical named entity recognition by applying the KNN

Classifier with different set of features. As shown Table 2,

the use of features sets has an obvious effect on the quality

of biomedical named entity recognition for KNN Classifier

classification model in general.

Table 2 shows the performance in terms of the precision, recall, F-

measure of the biomedical named entity recognition by applying the

KNN Classifier

In the second experiment, the NB Classifier is applied on

testing set using 10-fold cross-validation. The results are

obtained for the best 9 experiments from these 512

experiments. The idea is to show the best results obtained

when the NB is applied. Table 3 shows the performance

in terms of the precision, recall, F-measure of the

biomedical named entity recognition by applying the NB

Classifier with different set of features. As shown Table 3,

the use of features sets has an obvious effect on the quality

of biomedical named entity recognition for NB Classifier

classification model in general. However, the results

obtained using NB classifier is less than that obtained

using KNN. It means that effect of the feature sets on the

performance of the NB classifier is lower than their effect

on KNN Classifier.

Table 3 shows the performance in terms of the precision, recall, F-

measure of the biomedical named entity recognition by applying the NB

Classifier

In the third experiment, the decision tree Classifier is

applied on testing set using 10-fold cross-validation. The

results are obtained for the best 9 experiments from these

512 experiments. The idea is to show the best results

obtained when the decision tree is applied. Table 4 shows

the performance in terms of the precision, recall, F-

measure of the biomedical named entity recognition by

applying the decision tree Classifier with different set of

features. As shown Table 4, the use of features sets has

an obvious effect on the quality of biomedical named

entity recognition for decision tree Classifier classification

model in general. However, the results obtained using

decision tree classifier is less than that obtained using

KNN. It means that effect of the feature sets on the

performance of the decision tree classifier is lower than

their effect on KNN Classifier.

IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017

176

Table 4 shows the performance in terms of the precision, recall, F-

measure of the biomedical named entity recognition by applying the

decision tree Classifier

6. Conclusion

The core objective of this work is to design and implement

a new model for biomedical named recognition. A new

model is produced based on support vector machine

(SVM), Naïve Bayes (NB), and Artificial Neural Network.

The machine learning techniques have been used for

building and developing biomedical named recognition

which requires several steps, including data pre-processing,

feature selection and extraction, machine learning models,

and classification. The reported results analysis shows that

the proposed model is satisfactory and effective for

biomedical named recognition

References

[1] Habib, M. S. Biomedical Named Entity Recognition Using

Support Vector Machines: Performance vs. Scalability

Issues.

[2] Zhang, S., & Elhadad, N. (2013). Unsupervised biomedical

named entity recognition: Experiments with clinical and

biological texts. Journal of biomedical

informatics, 46(6),1088-1098.

[3] Chowdhury, M., & Mahbub, F. (2013). Improving the

Effectiveness of Information Extraction from Biomedical

Text. University of Trento.

[4] Bodenreider, O. (2004). The unified medical language

system (UMLS): integrating biomedical terminology.

Nucleic acids research, 32(suppl 1), D267-D270.

[5] Dai, H.-J., Chang, Y.-C., Tsai, R. T.-H., & Hsu, W.-L.

(2010). New challenges for biological text-mining in the

next decade. Journal of computer science and technology,

25(1), 169-179.

[6] Liu, H., Aronson, A. R., & Friedman, C. (2002). A study of

abbreviations in MEDLINE abstracts. Paper presented at the

Proceedings of the AMIA Symposium.

[7] Saha, S. and A. Ekbal (2013). "Combining multiple

classifiers using vote based classifier ensemble technique

for named entity recognition." Data & Knowledge

Engineering 85: 15-39.

[8] Saha, S., A. Ekbal and U. K. Sikdar (2015). "Named entity

recognition and classification in biomedical text using

classifier ensemble." International journal of data mining

and bioinformatics 11(4): 365-391.

[9] Munkhdalai, T., Li, M., Batsuren, K., Park, H., Choi, N., &

Ryu, K. H. (2015). Incorporating domain knowledge in

chemical and biomedical named entity recognition with

word representations. J. Cheminformatics, 7(S-1), S9.

[10] Rocktäschel, T., Weidlich, M., & Leser, U. (2012).

ChemSpot: a hybrid system for chemical named entity

recognition. Bioinformatics, 28(12), 1633-1640.

[11] Wang, H. (2008). "Biomedical Named Entity Recognition

Based on Classifiers Ensemble." International Journal of

Computer Science and Applications (IJCSA).

[12] Torii, S., Saito, N., Kawano, A., Hou, N., Ueki, K., Kulkarni,

R. N., & Takeuchi, T. (2009). Gene silencing of phogrin

unveils its essential role in glucose-responsive pancreatic β-

cell growth. Diabetes, 58(3), 682-692.

[13] Leaman, R., Miller, C., & Gonzalez, G. (2009). Enabling

recognition of diseases in biomedical text with machine

learning: corpus and benchmark. Paper presented at the

Proceedings of the 2009 Symposium on Languages in

Biology and Medicine.

[14] Rosario, B., & Hearst, M. A. (2004). Classifying semantic

relations in bioscience texts. Paper presented at the

Proceedings of the 42nd annual meeting on association for

computational linguistics.

[15] Kulick, S., Bies, A., Liberman, M., Mandel, M., McDonald,

R., Palmer, M. White, P. (2004). Integrated annotation for

biomedical information extraction. Paper presented at the

Proc. of the Human Language Technology Conference and

the Annual Meeting of the North American Chapter of the

Association for Computational Linguistics (HLT/NAACL).

[16] Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Berlanga,

R., & Rebholz-Schuhmann, D. (2008). Assessment of

disease named entity recognition on a corpus of annotated

sentences. BMC Bioinformatics, 9(Suppl 3), S3.

[17] Meystre, S. M., Savova, G. K., Kipper-Schuler, K. C., &

Hurdle, J. F. (2008). Extracting information from textual

documents in the electronic health record: a review of recent

research. Yearb Med Inform, 35, 128-144.

Biomedical-named entity recognition using CUDA accelerated KNN algorithm

Article

Aug 2023

Analysis of named-entity effect on text classification of traffic accident data using machine learning

Article

Full-text available

Mar 2022

span lang="EN-US">With the rising number of accidents in Indonesia, it is still necessary to evaluate and analyze accident data. The categorization of traffic accident data has been developed using word embedding, however additional work is needed to achieve better results. Several informative named entities are frequently sufficient to differentiate whether or not information on a traffic accident exists. Named-entities are informational characteristics that can offer details about a text. The influence of named-entities on thematic text categorization is examined in this paper. The information was collected using a Twitter social media crawl. Preprocessing is done at the beginning of the process to modify and delete useful text as well as label specified entities. On Support Vector Machine (SVM), scheme comparisons were performed for (i) Word Embedding, (ii) the number of occurrences of Named Entities, and (iii) the combination of the two is known as a Hybrid. The Hybrid scheme produced an improvement in classification accuracy of 90.27 percent when compared to Word Embedding scheme and occurrences of named entities scheme, according to tests conducted using 1.885 data consisting of 788 accident data and 1.067 non-accident data.</span

Deep-Confidentiality: An IoT-Enabled Privacy-Preserving Framework for Unstructured Big Biomedical Data

Article

Full-text available

Nov 2021

Due to the Internet of Things evolution, the clinical data is exponentially growing and using smart technologies. The generated big biomedical data is confidential, as it contains a patient’s personal information and findings. Usually, big biomedical data is stored over the cloud, making it convenient to be accessed and shared. In this view, the data shared for research purposes helps to reveal useful and unexposed aspects. Unfortunately, sharing of such sensitive data also leads to certain privacy threats. Generally, the clinical data is available in textual format (e.g., perception reports). Under the domain of natural language processing, many research studies have been published to mitigate the privacy breaches in textual clinical data. However, there are still limitations and shortcomings in the current studies that are inevitable to be addressed. In this article, a novel framework for textual medical data privacy has been proposed as Deep-Confidentiality. The proposed framework improves Medical Entity Recognition (MER) using deep neural networks and sanitization compared to the current state-of-the-art techniques. Moreover, the new and generic utility metric is also proposed, which overcomes the shortcomings of the existing utility metric. It provides the true representation of sanitized documents as compared to the original documents. To check our proposed framework’s effectiveness, it is evaluated on the i2b2-2010 NLP challenge dataset, which is considered one of the complex medical data for MER. The proposed framework improves the MER with 7.8% recall, 7% precision, and 3.8% F1-score compared to the existing deep learning models. It also improved the data utility of sanitized documents up to 13.79%, where the value of the k is 3.

A semi-supervised approach for extracting TCM clinical terms based on feature words

Article

Full-text available

Jul 2020
BMC MED INFORM DECIS

Background: A semi-supervised model is proposed for extracting clinical terms of Traditional Chinese Medicine using feature words. Methods: The extraction model is based on BiLSTM-CRF and combined with semi-supervised learning and feature word set, which reduces the cost of manual annotation and leverage extraction results. Results: Experiment results show that the proposed model improves the extraction of five types of TCM clinical terms, including traditional Chinese medicine, symptoms, patterns, diseases and formulas. The best F1-value of the experiment reaches 78.70% on the test dataset. Conclusions: This method can reduce the cost of manual labeling and improve the result in the NER research of TCM clinical terms.

Named Entity Recognition for Clinical Portuguese Corpus with Conditional Random Fields and Semantic Groups

Conference Paper

Jun 2019

Considering the difficulties of extracting entities from Electronic Health Records (EHR) texts in Portuguese, we explore the Conditional Random Fields (CRF) algorithm to build a Named Entity Recognition (NER) system based on a corpus of clinical Portuguese data annotated by experts. We acquaint the challenges and methods to classify Abbreviations, Disorders, Procedures and Chemicals within the texts. By selecting a meaningful set of features, and parameters with the best performance the results demonstrate that the method is promising and may support other biomedical tasks, nonetheless, further experiments with more features, different architectures and sophisticated preprocessing steps are needed.

Text Mining Basics in Bioinformatics

Chapter

Full-text available

Jan 2018

A Named Entity Recognition Corpus for Vietnamese Biomedical Texts to Support Tuberculosis Treatment

Article

Full-text available

Sep 2023

Phuong Van Ngoc Nguyen

Analysis of Machine-Based Learning Algorithm Used in Named Entity Recognition

Article

Full-text available

Jan 2023

Aim/Purpose: The amount of information published has increased dramatically due to the information explosion. The issue of managing information as it expands at this rate lies in the development of information extraction technology that can turn unstructured data into organized data that is understandable and controllable by computers Background: The primary goal of named entity recognition (NER) is to extract named entities from amorphous materials and place them in pre-defined semantic classes. Methodology: In our work, we analyze various machine learning algorithms and implement K-NN which has been widely used in machine learning and remains one of the most popular methods to classify data. Contribution: To the researchers’ best knowledge, no published study has presented Named entity recognition for the Kikuyu language using a machine learning algorithm. This research will fill this gap by recognizing entities in the Kikuyu language. Findings: An evaluation was done by testing precision, recall, and F-measure. The experiment results demonstrate that using K-NN is effective in classification performance. Recommendation for Researchers: With enough training data, researchers could perform an experiment and check the learning curve with accuracy that compares to state of art NER. Future Research: Future studies may be done using unsupervised and semi-supervised learning algorithms for other resource-scarce languages.

Character level and word level embedding with bidirectional LSTM – Dynamic recurrent neural network for biomedical named entity recognition from literature

Article

Dec 2020

Named Entity Recognition is the process of identifying different entities in a given context. Biomedical Named Entity Recognition (BNER) is the task of extracting chemical names from biomedical texts to support biomedical and translational research. The aim of the system is to extract useful chemical names from biomedical literature text without a lot of handcrafted engineering features. This approach introduces a novel neural network architecture with the composition of bidirectional long short-term memory (BLSTM), dynamic recurrent neural network (RNN) and conditional random field (CRF) that uses character level and word level embedding as the only features to identify the chemical entities. Using this approach we have achieved the F1 score of 89.98 on BioCreAtIvE II GM corpus and 90.84 on NCBI corpus by outperforming the existing systems. Our system is based on the deep neural architecture that uses both character and word level embedding which captures the morphological and orthographic information eliminating the need for handcrafted engineering features. The proposed system outperforms the existing systems without a lot of handcrafted engineering features. The embedding concept along with the bidirectional LSTM network proved to be an effective method to identify most of the chemical entities.

Named Entity Recognition in Biomedical Domain: A Survey

Article

Feb 2019

Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations

Article

Full-text available

Mar 2015

Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data. We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface. BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: https://bitbucket.org/tsendeemts/banner-chemdner.

Biomedical Named Entity Recognition Using Support Vector Machines: Performance vs. Scalability Issues

Article

Full-text available

Mona Soliman Habib

This paper examines the performance and scalability of Named Entity Recognition (NER) using multi-class Support Vector Machines (SVM) and high-dimensional features. The NER domain chosen for these experiments is the biomedical publications domain, especially selected due to its importance and inherent challenges. We use a simple machine learning approach that eliminates prior language knowledge such as part-of-speech or noun phrase tagging thereby allowing for its applicability across languages. No domain-specific knowledge is included. Motivated by the accuracy measures achieved during baseline experiments which proved to be comparable to those obtained using more complex approaches, we investigate ways to improve the scalability of multi-class SVM in order to make the solution more practical and useable. The initial prototype – SVM-PerfMulti – is based on the latest improvement in training linear binary SVM machines, namely SVM-Perf. In this paper, we examine the performance and scalability results of a set of experiments conducted using binary and multi-class SVM with increasing training data sizes, and report the improved training time using the prototype, SVM-PerfMulti, and the remaining challenges to be solved as part of our ongoing research.

Unsupervised Biomedical Named Entity Recognition: Experiments with Clinical and Biological Texts

Article

Full-text available

Aug 2013
J BIOMED INFORM

Named entity recognition is a crucial component of biomedical natural language processing, enabling information extraction and ultimately reasoning over and knowledge discovery from text. Much progress has been made in the design of rule-based and supervised tools, but they are often genre and task dependent. As such, adapting them to different genres of text or identifying new types of entities requires major effort in re-annotation or rule development. In this paper, we propose an unsupervised approach to extracting named entities from biomedical text. We describe a stepwise solution to tackle the challenges of entity boundary detection and entity type classification without relying on any handcrafted rles, heuristics, or annotated data. A noun phrase chunker followed by a filter based on inverse document frequency extracts candidate entities from free text. Classification of candidate entities into categories of interest is carried out by leveraging principles from distributional semantics. Experiments show that our system, especially the entity classification step, yields competitive results on two popular biomedical datasets of clinical notes and biological literature, and outperforms a baseline dictionary match approach. Detailed error analysis provides a road map for future work.

Gene Silencing of Phogrin Unveils Its Essential Role in Glucose-Responsive Pancreatic Cell Growth

Article

Full-text available

Feb 2009

OBJECTIVE—Phogrin and IA-2, autoantigens in insulin-dependent diabetes, have been shown to be involved in insulin secretion in pancreatic β-cells; however, implications at a molecular level are confusing from experiment to experiment. We analyzed biological functions of phogrin in β-cells by an RNA interference technique. RESEARCH DESIGN AND METHODS—Adenovirus-mediated expression of short hairpin RNA specific for phogrin (shPhogrin) was conducted using cultured β-cell lines and mouse islets. Both glucose-stimulated insulin secretion and cell proliferation rate were determined in the phogrin-knockdown cells. Furthermore, protein expression was profiled in these cells. To see the binding partner of phogrin in β-cells, coimmunoprecipitation analysis was carried out. RESULTS—Adenoviral expression of shPhogrin efficiently decreased its endogenous expression in pancreatic β-cells. Silencing of phogrin in β-cells abrogated the glucose-mediated mitogenic effect, which was accompanied by a reduction in the level of insulin receptor substrate 2 (IRS2) protein, without any changes in insulin secretion. Phogrin formed a complex with insulin receptor at the plasma membrane, and their interaction was promoted by high-glucose stimulation that in turn led to stabilization of IRS2 protein. Corroboratively, phogrin knockdown had no additional effect on the proliferation of β-cell line derived from the insulin receptor–knockout mouse. CONCLUSIONS—Phogrin is involved in β-cell growth via regulating stability of IRS2 protein by the molecular interaction with insulin receptor. We propose that phogrin and IA-2 function as an essential regulator of autocrine insulin action in pancreatic β-cells.

ChemSpot: A Hybrid System for Chemical Named Entity Recognition

Article

Full-text available

Apr 2012
BIOINFORMATICS

The accurate identification of chemicals in text is important for many applications, including computer-assisted reconstruction of metabolic networks or retrieval of information about substances in drug development. But due to the diversity of naming conventions and traditions for such molecules, this task is highly complex and should be supported by computational tools. We present ChemSpot, a named entity recognition (NER) tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and International Union of Pure and Applied Chemistry entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. It achieves an F(1) measure of 68.1% on the SCAI corpus, outperforming the only other freely available chemical NER tool, OSCAR4, by 10.8 percentage points. ChemSpot is freely available at: http://www.informatik.hu-berlin.de/wbi/resources.

Biomedical named entity recognition based on classifiers ensemble

Article

Jan 2008

Named entity recognition and classification in biomedical text using classifier ensemble

Article

Sep 2015

Named Entity Recognition and Classification (NERC) is an important task in information extraction for biomedicine domain. Biomedical Named Entities include mentions of proteins, genes, DNA, RNA, etc. which, in general, have complex structures and are difficult to recognise. In this paper, we propose a Single Objective Optimisation based classifier ensemble technique using the search capability of Genetic Algorithm (GA) for NERC in biomedical texts. Here, GA is used to quantify the amount of voting for each class in each classifier. We use diverse classification methods like Conditional Random Field and Support Vector Machine to build a number of models depending upon the various representations of the set of features and/or feature templates. The proposed technique is evaluated with two benchmark datasets, namely JNLPBA 2004 and GENETAG. Experiments yield the overall F- measure values of 75.97% and 95.90%, respectively. Comparisons with the existing systems show that our proposed system achieves state-of-the-art performance.

Combining multiple classifiers using vote based classifier ensemble technique for named entity recognition

Article

May 2013
DATA KNOWL ENG

In this paper, we pose the classifier ensemble problem under single and multiobjective optimization frameworks, and evaluate it for Named Entity Recognition (NER), an important step in almost all Natural Language Processing (NLP) application areas. We propose the solutions to two different versions of the ensemble problem for each of the optimization frameworks.We hypothesize that the reliability of predictions of each classifier differs among the various output classes. Thus, in an ensemble system it is necessary to find out either the eligible classes for which a classifier is most suitable to vote (i.e., binary vote based ensemble) or to quantify the amount of voting for each class in a particular classifier (i.e., real vote based ensemble). We use seven diverse classifiers, namely Naive Bayes, Decision Tree (DT), Memory Based Learner (MBL), Hidden Markov Model (HMM), Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM) to build a number of models depending upon the various representations of the available features that are identified and selected mostly without using any domain knowledge and/or language specific resources. The proposed technique is evaluated for three resource-constrained languages, namely Bengali, Hindi and Telugu. Results using multiobjective optimization (MOO) based technique yield the overall recall, precision and F-measure values of 94.21%, 94.72% and 94.74%, respectively for Bengali, 99.07%, 90.63% and 94.66%, respectively for Hindi and 82.79%, 95.18% and 88.55%, respectively for Telugu. Results for all the languages show that the proposed MOO based classifier ensemble with real voting attains the performance level which is superior to all the individual classifiers, three baseline ensembles and the corresponding single objective based ensemble.

Extracting Information From Textual Documents in the Electronic Health Record: A Review of Recent Research

Article

Nov 2007

We examine recent published research on the extraction of information from textual documents in the Electronic Health Record (EHR). Literature review of the research published after 1995, based on PubMed, conference proceedings, and the ACM Digital Library, as well as on relevant publications referenced in papers already included. 174 publications were selected and are discussed in this review in terms of methods used, pre-processing of textual documents, contextual features detection and analysis, extraction of information in general, extraction of codes and of information for decision-support and enrichment of the EHR, information extraction for surveillance, research, automated terminology management, and data mining, and de-identification of clinical text. Performance of information extraction systems with clinical text has improved since the last systematic review in 1995, but they are still rarely applied outside of the laboratory they have been developed in. Competitive challenges for information extraction from clinical text, along with the availability of annotated clinical text corpora, and further improvements in system performance are important factors to stimulate advances in this field and to increase the acceptance and usage of these systems in concrete clinical and biomedical research contexts.

Classifying Semantic Relations in Bioscience Text

Conference Paper

Jan 2004

A crucial step toward the goal of au- tomatic extraction of propositional in- formation from natural language text is the identification of semantic relations between constituents in sentences. We examine the problem of distinguishing among seven relation types that can oc- cur between the entities "treatment" and "disease" in bioscience text, and the problem of identifying such entities. We compare fi ve generative graphical mod- els and a neural network, using lexical, syntactic, and semantic features, finding that the latter help achieve high classifi- cation accuracy.

A Biomedical Named Entity Recognition Using Machine Learning Classifiers and Rich Feature Set

Abstract

Recommended publications

A Novel Approach for Emotion Detection from Text Data using Natural Language Processing and Machine...

Machine Learning Techniques for Handwritten Digit Recognition

Advanced Online Learning for Natural Language Processing.

Medical text representations for inductive learning