Two possible hyper planes and margins between positive and negative examples. White triangles are positive examples. Black dots are negative examples. A solid line shows a hyper plane. Two dashed lines are the boundaries between the positive and negative examples. An SVM finds the optimal hyperplane, which is the one with the maximum margin.

Source publication

Table 4 : Parameters of YamCha. These parameters affect the support...

Feature extraction example. Feature extraction is shown using the...

Gene/Protein Name Recognition based on Support Vector Machine using Dictionary as Features

Article

Full-text available

Feb 2005

Automated information extraction from biomedical literature is important because a vast amount of biomedical literature has been published. Recognition of the biomedical named entities is the first step in information extraction. We developed an automated recognition system based on the SVM algorithm and evaluated it in Task 1.A of BioCreAtIvE, a c...

Reading in a Regular Orthography: An fMRI Study Investigating the Role of Visual Familiarity

Article

Full-text available

Jul 2004

In order to separate the cognitive processes associated with phonological encoding and the use of a visual word form lexicon in reading, it is desirable to compare the processing of words presented in a visually familiar form with words in a visually unfamiliar form. Japanese Kana orthography offers this possibility. Two phonologically equivalent b...

The importance of vowel diacritics for the temporary retention of high and low frequency Hebrew words of varying syllabic length

Article

Full-text available

Nov 2010

This study investigates the importance of vowel diacritics for the retention of Hebrew word lists, with word lists being manipulated along the dimension of word frequency and syllabic length. Eighty university students participated in the study. Half of the participants (40) were tested with the word lists presented in fully-pointed (voweled) Hebre...

Thru but not wisht: Language, writing, and universal reading theory

Article

Full-text available

Aug 2012

Charles Perfetti

Languages may get the writing system they deserve or merely a writing system they can live with - adaption without optimization. A universal theory of reading reflects the general dependence of writing on language and the adaptations required by the demands of specific languages and their written forms. The theory also can be informed by research t...

From Zero to Hero: Harnessing Transformers for Biomedical Named Entity Recognition in Zero- and Few-shot Contexts

Preprint

Full-text available

May 2023

Supervised named entity recognition (NER) in the biomedical domain is dependent on large sets of annotated texts with the given named entities, whose creation can be time-consuming and expensive. Furthermore, the extraction of new entities often requires conducting additional annotation tasks and retraining the model. To address these challenges, this paper proposes a transformer-based method for zero- and few-shot NER in the biomedical domain. The method is based on transforming the task of multi-class token classification into binary token classification (token contains the searched entity or does not contain the searched entity) and pre-training on a larger amount of datasets and biomedical entities, from where the method can learn semantic relations between the given and potential classes. We have achieved average F1 scores of 35.44% for zero-shot NER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot NER on 9 diverse evaluated biomedical entities with PubMedBERT fine-tuned model. The results demonstrate the effectiveness of the proposed method for recognizing new entities with limited examples, with comparable or better results from the state-of-the-art zero- and few-shot NER methods.

Transformation of Pathology Reports into the Common Data Model with Oncology Module: Use Case for Colon Cancer (Preprint)

Article

Full-text available

Mar 2020
J MED INTERNET RES

Background Common data models (CDMs) help standardize electronic health record data and facilitate outcome analysis for observational and longitudinal research. An analysis of pathology reports is required to establish fundamental information infrastructure for data-driven colon cancer research. The Observational Medical Outcomes Partnership (OMOP) CDM is used in distributed research networks for clinical data; however, it requires conversion of free text–based pathology reports into the CDM’s format. There are few use cases of representing cancer data in CDM. Objective In this study, we aimed to construct a CDM database of colon cancer–related pathology with natural language processing (NLP) for a research platform that can utilize both clinical and omics data. The essential text entities from the pathology reports are extracted, standardized, and converted to the OMOP CDM format in order to utilize the pathology data in cancer research. Methods We extracted clinical text entities, mapped them to the standard concepts in the Observational Health Data Sciences and Informatics vocabularies, and built databases and defined relations for the CDM tables. Major clinical entities were extracted through NLP on pathology reports of surgical specimens, immunohistochemical studies, and molecular studies of colon cancer patients at a tertiary general hospital in South Korea. Items were extracted from each report using regular expressions in Python. Unstructured data, such as text that does not have a pattern, were handled with expert advice by adding regular expression rules. Our own dictionary was used for normalization and standardization to deal with biomarker and gene names and other ungrammatical expressions. The extracted clinical and genetic information was mapped to the Logical Observation Identifiers Names and Codes databases and the Systematized Nomenclature of Medicine (SNOMED) standard terminologies recommended by the OMOP CDM. The database-table relationships were newly defined through SNOMED standard terminology concepts. The standardized data were inserted into the CDM tables. For evaluation, 100 reports were randomly selected and independently annotated by a medical informatics expert and a nurse. Results We examined and standardized 1848 immunohistochemical study reports, 3890 molecular study reports, and 12,352 pathology reports of surgical specimens (from 2017 to 2018). The constructed and updated database contained the following extracted colorectal entities: (1) NOTE_NLP, (2) MEASUREMENT, (3) CONDITION_OCCURRENCE, (4) SPECIMEN, and (5) FACT_RELATIONSHIP of specimen with condition and measurement. Conclusions This study aimed to prepare CDM data for a research platform to take advantage of all omics clinical and patient data at Seoul National University Bundang Hospital for colon cancer pathology. A more sophisticated preparation of the pathology data is needed for further research on cancer genomics, and various types of text narratives are the next target for additional research on the use of data in the CDM.

Information theoretic-PSO-based feature selection: an application in biomedical entity extraction

Article

Full-text available

Sep 2019
KNOWL INF SYST

Named entity recognition is a vital task for various applications related to biomedical natural language processing. It aims at extracting different biomedical entities from the text and classifying them into some predefined categories. The types could vary depending upon the genre and domain, such as gene versus non-gene in a coarse-grained scenario, or protein, DNA, RNA, cell line, and cell-type in a fine-grained scenario. In this paper, we present a novel filter-based feature selection technique utilizing the search capability of particle swarm optimization (PSO) for determining the most optimal feature combination. The technique yields in the most optimized feature set, that when used for classifiers learning, enhance the system performance. The proposed approach is assessed over four popular biomedical corpora, namely GENIA, GENETAG, AIMed, and Biocreative-II Gene Mention Recognition (BC-II). Our proposed model obtains the F score values of \(74.49\%\), \(91.11\%\), \(90.47\%\), \(88.64\%\) on GENIA, GENETAG, AIMed, and BC-II dataset, respectively. The efficiency of feature pruning through PSO is evident with significant performance gains, even with a much reduced set of features.

Application of Biomedical Text Mining

Chapter

Full-text available

Jun 2018

Lejun Gong

Indicators for the use of robotic labs in basic biomedical research: A literature analysis

Article

Full-text available

Nov 2017

Robotic labs, in which experiments are carried out entirely by robots, have the potential to provide a reproducible and transparent foundation for performing basic biomedical laboratory experiments. In this article, we investigate whether these labs could be applicable in current experimental practice. We do this by text mining 1,628 papers for occurrences of methods that are supported by commercial robotic labs. Using two different concept recognition tools, we find that 86%–89% of the papers have at least one of these methods. This and our other results provide indications that robotic labs can serve as the foundation for performing many lab-based experiments.

A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques

Article

Full-text available

Jul 2017

The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and algorithms are required to discover useful patterns. Text mining is the task of extracting meaningful information from text, which has gained significant attentions in recent years. In this paper, we describe several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering. Additionally, we briefly explain text mining in biomedical and health care domains.

EXPLANe: An Extensible Framework for Poster Annotation with Mobile Devices

Preprint

Mar 2017

Scientific posters tend to be brief, unstructured, and generally unsuitable for communication beyond a poster session. This paper describes EXPLANe, a framework for annotating posters using optical text recognition and web services on mobile devices. EXPLANe is demonstrated through an interface to the MyVariant.info variant annotation web services, and provides users a list of biological information linked with genetic variants (as found via extracted RSIDs from annotated posters). This paper delineates the architecture of the application, and includes results of a five-part evaluation we conducted. Researchers and developers can use the existing codebase as a foundation from which to generate their own annotation tabs when analyzing and annotating posters. Availability Alpha EXPLANe software is available as an open source application at https://github.com/ngopal/EXPLANe Contact Sean D. Mooney ( sdmooney@uw.edu )

NOBLE – Flexible concept recognition for large-scale biomedical natural language processing

Article

Full-text available

Jan 2016
BMC BIOINFORMATICS

Background: Natural language processing (NLP) applications are increasingly important in biomedical data analysis, knowledge engineering, and decision support. Concept recognition is an important component task for NLP pipelines, and can be either general-purpose or domain-specific. We describe a novel, flexible, and general-purpose concept recognition component for NLP pipelines, and compare its speed and accuracy against five commonly used alternatives on both a biological and clinical corpus. NOBLE Coder implements a general algorithm for matching terms to concepts from an arbitrary vocabulary set. The system's matching options can be configured individually or in combination to yield specific system behavior for a variety of NLP tasks. The software is open source, freely available, and easily integrated into UIMA or GATE. We benchmarked speed and accuracy of the system against the CRAFT and ShARe corpora as reference standards and compared it to MMTx, MGrep, Concept Mapper, cTAKES Dictionary Lookup Annotator, and cTAKES Fast Dictionary Lookup Annotator. Results: We describe key advantages of the NOBLE Coder system and associated tools, including its greedy algorithm, configurable matching strategies, and multiple terminology input formats. These features provide unique functionality when compared with existing alternatives, including state-of-the-art systems. On two benchmarking tasks, NOBLE's performance exceeded commonly used alternatives, performing almost as well as the most advanced systems. Error analysis revealed differences in error profiles among systems. Conclusion: NOBLE Coder is comparable to other widely used concept recognition systems in terms of accuracy and speed. Advantages of NOBLE Coder include its interactive terminology builder tool, ease of configuration, and adaptability to various domains and tasks. NOBLE provides a term-to-concept matching system suitable for general concept recognition in biomedical NLP pipelines.

Identifying named entities from PubMed (R) for enriching semantic categories

Article

Full-text available

Dec 2015
BMC BIOINFORMATICS

Background Controlled vocabularies such as the Unified Medical Language System (UMLS®;) and Medical Subject Headings (MeSH®;) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE®;. Results We here propose an efficient and effective method for extracting noun phrases for biomedical semantic categories. The proposed approach utilizes simple linguistic patterns to select candidate noun phrases based on headwords, and a machine learning classifier is used to filter out noisy phrases. For experiments, three NLP rules were tested and manually evaluated by three annotators. Our approaches showed over 93% precision on average for the headwords, “gene”, “protein”, “disease”, “cell” and “cells”. Conclusions Although biomedical terms in knowledge-rich resources may define semantic categories, variations of the controlled terms in literature are still difficult to identify. The method proposed here is an effort to narrow the gap between controlled vocabularies and the entities used in text. Our extraction method cannot completely eliminate manual evaluation, however a simple and automated solution with high precision performance provides a convenient way for enriching semantic categories by incorporating terms obtained from the literature. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0487-2) contains supplementary material, which is available to authorized users.

Signature Automation of UMLS Concepts: An Un-Supervised Named Entity Recognition Framework for Classification of DNA and RNA in Biological Text

Conference Paper

Full-text available

Jul 2015

Named entity recognition, a task that represents atomicity as well as granularity is a first step in any language processing system. The advent in typological orientation of literature or text and its availability in the form of annotated and un-annotated corpora have led to a continued research effort directed towards achievement of yet an optimized algorithmic evolution for identifying named entities from text. Recognizing named entities from annotated corpora has matured comprehensively over a period of time while recognition from un-annotated corpora is still a challenge for research community. Furthermore, a challenge exponentially rises if corpora represent an applied literature from biological or biomedical domain. This paper presents an unsupervised named entity recognition framework that automates signature vectors for UMLS concepts. The idea behind it is to provide a vectorised perspective to UMLS concepts, semantic types and semantic groups. Vectored representation of UMLS ensures application of the framework in a generic way. Our approach differs with other un-supervised frameworks that employ signature and vector based approaches because we create a vector space on the basis of UMLS instead of corpus. Dataset from GENIA was used for framework validation. Our framework achieved accuracy of 68.34% which is far better when compared to 27% by METAMAP, 53.8% by CubNER for the same corpus.

Similar publications

Citations