Figure - available via license: Creative Commons Attribution 2.0 Generic
Content may be subject to copyright.
Two possible hyper planes and margins between positive and negative examples. White triangles are positive examples. Black dots are negative examples. A solid line shows a hyper plane. Two dashed lines are the boundaries between the positive and negative examples. An SVM finds the optimal hyperplane, which is the one with the maximum margin.

Two possible hyper planes and margins between positive and negative examples. White triangles are positive examples. Black dots are negative examples. A solid line shows a hyper plane. Two dashed lines are the boundaries between the positive and negative examples. An SVM finds the optimal hyperplane, which is the one with the maximum margin.

Source publication
Article
Full-text available
Automated information extraction from biomedical literature is important because a vast amount of biomedical literature has been published. Recognition of the biomedical named entities is the first step in information extraction. We developed an automated recognition system based on the SVM algorithm and evaluated it in Task 1.A of BioCreAtIvE, a c...

Similar publications

Article
Full-text available
In order to separate the cognitive processes associated with phonological encoding and the use of a visual word form lexicon in reading, it is desirable to compare the processing of words presented in a visually familiar form with words in a visually unfamiliar form. Japanese Kana orthography offers this possibility. Two phonologically equivalent b...
Article
Full-text available
This study investigates the importance of vowel diacritics for the retention of Hebrew word lists, with word lists being manipulated along the dimension of word frequency and syllabic length. Eighty university students participated in the study. Half of the participants (40) were tested with the word lists presented in fully-pointed (voweled) Hebre...
Article
Full-text available
Languages may get the writing system they deserve or merely a writing system they can live with - adaption without optimization. A universal theory of reading reflects the general dependence of writing on language and the adaptations required by the demands of specific languages and their written forms. The theory also can be informed by research t...

Citations

... The availability of data further spurred research in the domain of recognizing mentions of proteins, genes, drugs/chemicals, and similar named entities in different kinds of biomedical texts (bioNER), largely through shared tasks such as BioCreative (Hirschman et al., 2005) or JNLPBA (Collier and Kim, 2004). Early bioNER systems range from entirely rule-based ones (Gaizauskas, Demetriou, Artymiuk and Willett, 2003), those based on machine learning methods such as Naive Bayes (Nobata, Collier and Tsujii, 1999), Support Vector Machine (Mitsumori, Fation, Murata, Doi and Doi, 2005), Hidden Markov Model (Zhou, Shen, Zhang, Su and Tan, 2005), Maximum Entropy (Dingare, Nissim, Finkel, Manning and Grover, 2005) and Conditional Random Fields (Settles, 2004). While each of those types of systems has its advantages, what they have in common is that building them is time-consuming and requires constant manual adaptation (of rules or features) when data changes. ...
Preprint
Full-text available
Supervised named entity recognition (NER) in the biomedical domain is dependent on large sets of annotated texts with the given named entities, whose creation can be time-consuming and expensive. Furthermore, the extraction of new entities often requires conducting additional annotation tasks and retraining the model. To address these challenges, this paper proposes a transformer-based method for zero- and few-shot NER in the biomedical domain. The method is based on transforming the task of multi-class token classification into binary token classification (token contains the searched entity or does not contain the searched entity) and pre-training on a larger amount of datasets and biomedical entities, from where the method can learn semantic relations between the given and potential classes. We have achieved average F1 scores of 35.44% for zero-shot NER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot NER on 9 diverse evaluated biomedical entities with PubMedBERT fine-tuned model. The results demonstrate the effectiveness of the proposed method for recognizing new entities with limited examples, with comparable or better results from the state-of-the-art zero- and few-shot NER methods.
... Moreover, the BioCreAtIvE (Critical Assessment of Information Extraction in Biology) competition for automated gene and protein name recognition consists of a community-wide effort to evaluate information extraction and text mining developments in the biological domain [26]. Mitsumori et al used the support vector machine algorithm as a learning method for gene and protein name recognition [27], investigating and evaluating the system's performance when making partial dictionary pattern matches. ...
Article
Full-text available
Background Common data models (CDMs) help standardize electronic health record data and facilitate outcome analysis for observational and longitudinal research. An analysis of pathology reports is required to establish fundamental information infrastructure for data-driven colon cancer research. The Observational Medical Outcomes Partnership (OMOP) CDM is used in distributed research networks for clinical data; however, it requires conversion of free text–based pathology reports into the CDM’s format. There are few use cases of representing cancer data in CDM. Objective In this study, we aimed to construct a CDM database of colon cancer–related pathology with natural language processing (NLP) for a research platform that can utilize both clinical and omics data. The essential text entities from the pathology reports are extracted, standardized, and converted to the OMOP CDM format in order to utilize the pathology data in cancer research. Methods We extracted clinical text entities, mapped them to the standard concepts in the Observational Health Data Sciences and Informatics vocabularies, and built databases and defined relations for the CDM tables. Major clinical entities were extracted through NLP on pathology reports of surgical specimens, immunohistochemical studies, and molecular studies of colon cancer patients at a tertiary general hospital in South Korea. Items were extracted from each report using regular expressions in Python. Unstructured data, such as text that does not have a pattern, were handled with expert advice by adding regular expression rules. Our own dictionary was used for normalization and standardization to deal with biomarker and gene names and other ungrammatical expressions. The extracted clinical and genetic information was mapped to the Logical Observation Identifiers Names and Codes databases and the Systematized Nomenclature of Medicine (SNOMED) standard terminologies recommended by the OMOP CDM. The database-table relationships were newly defined through SNOMED standard terminology concepts. The standardized data were inserted into the CDM tables. For evaluation, 100 reports were randomly selected and independently annotated by a medical informatics expert and a nurse. Results We examined and standardized 1848 immunohistochemical study reports, 3890 molecular study reports, and 12,352 pathology reports of surgical specimens (from 2017 to 2018). The constructed and updated database contained the following extracted colorectal entities: (1) NOTE_NLP, (2) MEASUREMENT, (3) CONDITION_OCCURRENCE, (4) SPECIMEN, and (5) FACT_RELATIONSHIP of specimen with condition and measurement. Conclusions This study aimed to prepare CDM data for a research platform to take advantage of all omics clinical and patient data at Seoul National University Bundang Hospital for colon cancer pathology. A more sophisticated preparation of the pathology data is needed for further research on cancer genomics, and various types of text narratives are the next target for additional research on the use of data in the CDM.
... Some of the popular state-of-art systems on this dataset include [19,36,46], where SVM and CRF were used as the popular base classifiers, respectively. In general, CRF is a popular classifier for any sequence labeling task such as Named Entity Recognition [33,61]. ...
... Kinoshita et al. [28] proposed a system that achieved a F score of 80.90% with dictionary-based preprocessing and HMM-based PoS tagger. The SVM-based system [36]) utilized gene/protein name dictionary as the domain knowledge. It reported F score of 78.09%. ...
Article
Full-text available
Named entity recognition is a vital task for various applications related to biomedical natural language processing. It aims at extracting different biomedical entities from the text and classifying them into some predefined categories. The types could vary depending upon the genre and domain, such as gene versus non-gene in a coarse-grained scenario, or protein, DNA, RNA, cell line, and cell-type in a fine-grained scenario. In this paper, we present a novel filter-based feature selection technique utilizing the search capability of particle swarm optimization (PSO) for determining the most optimal feature combination. The technique yields in the most optimized feature set, that when used for classifiers learning, enhance the system performance. The proposed approach is assessed over four popular biomedical corpora, namely GENIA, GENETAG, AIMed, and Biocreative-II Gene Mention Recognition (BC-II). Our proposed model obtains the F score values of \(74.49\%\), \(91.11\%\), \(90.47\%\), \(88.64\%\) on GENIA, GENETAG, AIMed, and BC-II dataset, respectively. The efficiency of feature pruning through PSO is evident with significant performance gains, even with a much reduced set of features.
... recall, the 69.1% precise, and 70.5% F-score in JNLPBA using morphological and semantic features. Mitsumori et al. [21] proposed an approach to process entity using Support Vector Machine (SVM) as a statistical model with internal and external resource features, which show that the performance of identiication is improved by using the external biological dictionary features. Saha et al. [22] used maximum entropy model combined with word-clustering features and feature selection techniques to identify biomedical entities. ...
Chapter
Full-text available
... Dictionary-based annotators are commonly used in biomedical concept recognition because the aim is to often recognize many different types of concepts. While machine learning based annotators work extremely well for recognition of specific concepts, e.g., gene/protein recognition (Mitsumori et al., 2005), they require training data for each different domain. Because our aim was to identify methods from a dictionary (i.e., MeSH), we chose to use a dictionary annotator based approach. ...
Article
Full-text available
Robotic labs, in which experiments are carried out entirely by robots, have the potential to provide a reproducible and transparent foundation for performing basic biomedical laboratory experiments. In this article, we investigate whether these labs could be applicable in current experimental practice. We do this by text mining 1,628 papers for occurrences of methods that are supported by commercial robotic labs. Using two different concept recognition tools, we find that 86%–89% of the papers have at least one of these methods. This and our other results provide indications that robotic labs can serve as the foundation for performing many lab-based experiments.
... (1) Classification-based approaches, convert the NER task into a classification problem, which is applicable to either words or phrases. Naive Bayes [105] and Support Vector Machines [81,100,130] are among the common classifiers used for biomedical NER task. (2) Sequence-based methods, use complete sequence of words instead of only single words or phrases. ...
Article
Full-text available
The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and algorithms are required to discover useful patterns. Text mining is the task of extracting meaningful information from text, which has gained significant attentions in recent years. In this paper, we describe several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering. Additionally, we briefly explain text mining in biomedical and health care domains.
... A number of methods have been used to successfully extract gene names from text, with improvements and refinements of these methods over time. For example, machine learning approaches such as rule-based systems or support vector machines obtain precision and recall of about 83% and 84% respectively [13], [14]. Since the effect of features seems to be small, machine learning approaches seem to perform the best when using all available features [14]. ...
... For example, machine learning approaches such as rule-based systems or support vector machines obtain precision and recall of about 83% and 84% respectively [13], [14]. Since the effect of features seems to be small, machine learning approaches seem to perform the best when using all available features [14]. ...
Preprint
Scientific posters tend to be brief, unstructured, and generally unsuitable for communication beyond a poster session. This paper describes EXPLANe, a framework for annotating posters using optical text recognition and web services on mobile devices. EXPLANe is demonstrated through an interface to the MyVariant.info variant annotation web services, and provides users a list of biological information linked with genetic variants (as found via extracted RSIDs from annotated posters). This paper delineates the architecture of the application, and includes results of a five-part evaluation we conducted. Researchers and developers can use the existing codebase as a foundation from which to generate their own annotation tabs when analyzing and annotating posters. Availability Alpha EXPLANe software is available as an open source application at https://github.com/ngopal/EXPLANe Contact Sean D. Mooney ( sdmooney@uw.edu )
... In contrast, machine learning NLP methods are used to produce annotators for specific well-defined purposes such as annotating drug mentions [3,4] and gene mentions [5,6]. Conditional random fields, for example, have produced excellent performance for specific biomedical NER tasks [4], but these systems often require training data from human annotation specific to domain and document genre. ...
Article
Full-text available
Background: Natural language processing (NLP) applications are increasingly important in biomedical data analysis, knowledge engineering, and decision support. Concept recognition is an important component task for NLP pipelines, and can be either general-purpose or domain-specific. We describe a novel, flexible, and general-purpose concept recognition component for NLP pipelines, and compare its speed and accuracy against five commonly used alternatives on both a biological and clinical corpus. NOBLE Coder implements a general algorithm for matching terms to concepts from an arbitrary vocabulary set. The system's matching options can be configured individually or in combination to yield specific system behavior for a variety of NLP tasks. The software is open source, freely available, and easily integrated into UIMA or GATE. We benchmarked speed and accuracy of the system against the CRAFT and ShARe corpora as reference standards and compared it to MMTx, MGrep, Concept Mapper, cTAKES Dictionary Lookup Annotator, and cTAKES Fast Dictionary Lookup Annotator. Results: We describe key advantages of the NOBLE Coder system and associated tools, including its greedy algorithm, configurable matching strategies, and multiple terminology input formats. These features provide unique functionality when compared with existing alternatives, including state-of-the-art systems. On two benchmarking tasks, NOBLE's performance exceeded commonly used alternatives, performing almost as well as the most advanced systems. Error analysis revealed differences in error profiles among systems. Conclusion: NOBLE Coder is comparable to other widely used concept recognition systems in terms of accuracy and speed. Advantages of NOBLE Coder include its interactive terminology builder tool, ease of configuration, and adaptability to various domains and tasks. NOBLE provides a term-to-concept matching system suitable for general concept recognition in biomedical NLP pipelines.
... The rules designed for a concept normally cannot be applied to other concepts. Statistical (or machine learning) approaches rely on word distribution for discriminating term and non-term features [14][15][16]. ...
Article
Full-text available
Background Controlled vocabularies such as the Unified Medical Language System (UMLS®;) and Medical Subject Headings (MeSH®;) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE®;. Results We here propose an efficient and effective method for extracting noun phrases for biomedical semantic categories. The proposed approach utilizes simple linguistic patterns to select candidate noun phrases based on headwords, and a machine learning classifier is used to filter out noisy phrases. For experiments, three NLP rules were tested and manually evaluated by three annotators. Our approaches showed over 93% precision on average for the headwords, “gene”, “protein”, “disease”, “cell” and “cells”. Conclusions Although biomedical terms in knowledge-rich resources may define semantic categories, variations of the controlled terms in literature are still difficult to identify. The method proposed here is an effort to narrow the gap between controlled vocabularies and the entities used in text. Our extraction method cannot completely eliminate manual evaluation, however a simple and automated solution with high precision performance provides a convenient way for enriching semantic categories by incorporating terms obtained from the literature. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0487-2) contains supplementary material, which is available to authorized users.
... Evolution of GENIA corpus [11] further driven the research based on supervised models to accomplish the task. Models like Support Vector Machine (SVM) [19], Hidden Markov Model (HMM) [20] and Conditional Random Field (CRF) [2], [21] were used to perform named entity recognition based on the annotated datasets. ...
Conference Paper
Full-text available
Named entity recognition, a task that represents atomicity as well as granularity is a first step in any language processing system. The advent in typological orientation of literature or text and its availability in the form of annotated and un-annotated corpora have led to a continued research effort directed towards achievement of yet an optimized algorithmic evolution for identifying named entities from text. Recognizing named entities from annotated corpora has matured comprehensively over a period of time while recognition from un-annotated corpora is still a challenge for research community. Furthermore, a challenge exponentially rises if corpora represent an applied literature from biological or biomedical domain. This paper presents an unsupervised named entity recognition framework that automates signature vectors for UMLS concepts. The idea behind it is to provide a vectorised perspective to UMLS concepts, semantic types and semantic groups. Vectored representation of UMLS ensures application of the framework in a generic way. Our approach differs with other un-supervised frameworks that employ signature and vector based approaches because we create a vector space on the basis of UMLS instead of corpus. Dataset from GENIA was used for framework validation. Our framework achieved accuracy of 68.34% which is far better when compared to 27% by METAMAP, 53.8% by CubNER for the same corpus.