Article

Comparing a Rule-Based Versus Statistical System for Automatic Categorization of MEDLINE Documents According to Biomedical Specialty

December 2009
Journal of the American Society for Information Science and Technology 60(12):2530-2539

December 2009
60(12):2530-2539

DOI:10.1002/asi.21170

Source
PubMed

Authors:

Aurélie Névéol

French National Centre for Scientific Research

Julien Gobeill

Swiss Institute of Bioinformatics

Patrick Ruch

University of Applied Sciences and Arts Western Switzerland

Show all 6 authorsHide

Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings(®) (MeSH(®)) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI) based on human categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for one hundred MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures, performance is comparable, and for one measure, JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule based) might be combined and then evaluated showing they are complementary to one another.

Multirrotulação automática de páginas web de saúde: uma avaliação preliminar da percepção humana

Article

Lay people show difficult when they look for health information on Web. This study evaluated the adequacy of automatic multi-label suggestion for health web pages in Brazilian Portuguese language. We collected 57 health web pages and asked 21 volunteers to evaluate them. We measured the recall, consensus between evaluators and consensus between evaluators and automatic classifiers. Recall reached 100%, with high consensus between evaluators to the 5 most relevant categories, suggesting that the automatic multi-labeling of health Web pages helps information retrieval by lay people. Resumo. Pessoas leigas apresentam dificuldades quando procuram por informações sobre saúde na web. Este estudo avaliou a adequação da sugestão automática de multirrótulos para páginas web de saúde em português brasileiro. Foram coletadas 57 páginas web de saúde e convidados 21 voluntários para a classificação manual. Mensurou-se a revocação, consenso entre avaliadores, e consenso entre avaliadores e classificadores automáticos Naive Bayes e Journal Descriptor Indexing. A revocação atingiu 100%, com alto consenso entre avaliadores para as 5 categorias mais relevantes, sugerindo que a multirrotulação automática de páginas web de saúde colabora com recuperação de informação por pessoas leigas.

Evaluating the Performance of Recurrent Neural Network based Question Answering System with Easy and Complex bAbI QA Tasks

Article

Full-text available

Jan 2020

Question Answering (QA) system is a field of Natural language processing, which allows users to ask questions using the natural language sentence and return a brief answer to the users rather than a list of documents. This work intends to use a Recurrent Neural Network (RNN) based Deep Learning algorithms in order to solve the Question Answering System problem. The use of recurrent neural networks allows us to expand and apply this model to a variety of question answering tasks. In this work, a simple RNN based Question Answering System is implemented and its performance is evaluated with a simple and complex question answering tasks using bAbI dataset. The performance of training and testing with suitable metrics is studied and the difference in performance in the two question answering tasks is observed.

Extracting laboratory test information from biomedical text

Article

Full-text available

Aug 2013

No previous study reported the efficacy of current natural language processing (NLP) methods for extracting laboratory test information from narrative documents. This study investigates the pathology informatics question of how accurately such information can be extracted from text with the current tools and techniques, especially machine learning and symbolic NLP methods. The study data came from a text corpus maintained by the U.S. Food and Drug Administration, containing a rich set of information on laboratory tests and test devices. THE AUTHORS DEVELOPED A SYMBOLIC INFORMATION EXTRACTION (SIE) SYSTEM TO EXTRACT DEVICE AND TEST SPECIFIC INFORMATION ABOUT FOUR TYPES OF LABORATORY TEST ENTITIES: Specimens, analytes, units of measures and detection limits. They compared the performance of SIE and three prominent machine learning based NLP systems, LingPipe, GATE and BANNER, each implementing a distinct supervised machine learning method, hidden Markov models, support vector machines and conditional random fields, respectively. Machine learning systems recognized laboratory test entities with moderately high recall, but low precision rates. Their recall rates were relatively higher when the number of distinct entity values (e.g., the spectrum of specimens) was very limited or when lexical morphology of the entity was distinctive (as in units of measures), yet SIE outperformed them with statistically significant margins on extracting specimen, analyte and detection limit information in both precision and F-measure. Its high recall performance was statistically significant on analyte information extraction. Despite its shortcomings against machine learning methods, a well-tailored symbolic system may better discern relevancy among a pile of information of the same type and may outperform a machine learning system by tapping into lexically non-local contextual information such as the document structure.

FarsNewsQA: a deep learning-based question answering system for the Persian news articles

Article

Full-text available

Mar 2023
INFORM RETRIEVAL

Nowadays, a considerable volume of news articles is produced daily by news agencies worldwide. Since there is an extensive volume of news on the web, finding exact answers to the users’ questions is not a straightforward task. Developing Question Answering (QA) systems for the news articles can tackle this challenge. Due to the lack of studies on Persian QA systems and the importance and wild applications of QA systems in the news domain, this research aims to design and implement a QA system for the Persian news articles. This is the first attempt to develop a Persian QA system in the news domain to our best knowledge. We first create FarsQuAD: a Persian QA dataset for the news domain. We analyze the type and complexity of the users’ questions about the Persian news. The results show that What and Who questions have the most and Why and Which questions have the least occurrences in the Persian news domain. The results also indicate that the users usually raise complex questions about the Persian news. Then we develop FarsNewsQA: a QA system for answering questions about Persian news. We developed three models of the FarsNewsQA using BERT, ParsBERT, and ALBERT. The best version of the FarsNewsQA offers an F1 score of 75.61%, which is comparable with that of QA system on the English SQuAD dataset made by the Stanford university, and shows the new Bert-based technologies works well for Persian news QA systems.

A Bi-model MemN2N Network for Complex Question Answering Task

Article

Full-text available

Jan 2020

Question Answering (QA) system is a field of Natural language processing, in which the users can post query in their own languages. The system also gives precise answer instead of list of documents. A memory network has the ability to perform reasoning with inference components and long-term memory component. The two components are used efficiently to find the answers from the story context for a given query. In our earlier work [26] we evaluated the performance of MemN2N network with complex and easy question answering tasks and found that the MemN2N fail to produce good results with some complex QA tasks of bAbI dataset. This work intends to improve the performance with a state of the art Bi-Model end to end memory network (BiMemN2N_I) model for such complex QA tasks and compare its performance with the standard MemN2N model and MemNN models. In this work, a Bi-model MemN2N Network based question answering system is implemented and its performance is evaluated with a complex question answering tasks from bAbI dataset. In addition, the performance of training and testing with suitable metrics are studied and identified the difference in the performance of two question answering tasks.

ANALYSIS ON THE PERFORMANCE OF SOME STANDARD DEEP LEARNING NETWORK MODELS FOR QUESTION ANSWERING TASK

Article

Full-text available

Jan 2022

Question Answering (QA) system is a field of Natural language processing, which allows users to post questions in natural language sentence and return a short and precise answer to the users rather than a set of documents. This work aims to evaluate three deep learning models RNN, LSTM and GRU on question answering tasks. The use of deep learning networks allows us to expand and apply these models to a variety of question answering tasks. In this work, we implement three deep learning model based question answering systems and evaluate their performance with a simple and complex question answering tasks from bAbI dataset. We will study the performance of training and testing with suitable metrics and find the difference in performance in the two question answering tasks.

Evaluating the Performance of Keras Implementation of MemNN Model for Simple and Complex Question Answering Tasks

Article

Full-text available

Jan 2022

Question Answering (QA) system is a field of Natural language processing, which allows users to ask questions using the natural language sentence and return a brief answer to the users rather than a list of documents. Memory networks are capable of reasoning with inference components combined with a long-term memory component and they learn how to use these two components in an efficient way to predict answers from the story text for a specific question. This work intends to evaluate the performance of an earlier keras implementation of memory network (MemNN) model and compare its performance with three standard deep learning models RNN, LSTM and GRU. In this work, we implement a Keras implementation of MemNN model based question answering systems and evaluate their performance with a simple and complex question answering tasks from bAbI dataset. We will study the performance of training and testing with suitable metrics and find the difference in performance in the two question answering tasks.

Advances in Natural Language Question Answering: A Review

Preprint

Apr 2019

Question Answering has recently received high attention from artificial intelligence communities due to the advancements in learning technologies. Early question answering models used rule-based approaches and moved to the statistical approach to address the vastly available information. However, statistical approaches are shown to underperform in handling the dynamic nature and the variation of language. Therefore, learning models have shown the capability of handling the dynamic nature and variations in language. Many deep learning methods have been introduced to question answering. Most of the deep learning approaches have shown to achieve higher results compared to machine learning and statistical methods. The dynamic nature of language has profited from the nonlinear learning in deep learning. This has created prominent success and a spike in work on question answering. This paper discusses the successes and challenges in question answering question answering systems and techniques that are used in these challenges.

Sentiment Analysis of Twitter's Health Messages in Brazilian Portuguese

Article

Full-text available

Jan 2018

Objective: To present results of a sentiment classification methodology, here denominated Sentiment Descriptor Indexing (SDI), to be applied in Brazilian Portuguese Twitter's messages related to health topics. Methods: The first step considered the construction of an algorithm that is based on the co-occurrence of Twitter terms with sentiment descriptor vocabulary known as ANEW-BR. In the second stage, an evaluation of SDI algorithm performance for messages about "cancer" of a period of three weeks was performed. The ratings were paired, to generate a performance appraisal.Results: The precision and recall values were 0.68 and 0.67, respectively. A total of 25,230 messages on the topic "cancer" with a positive feeling classification (71%) were collected. Conclusion: The contributions of this work aim to fill the lack of methods of analysis of feelings for the Portuguese Portuguese language. RESUMO Objetivo: Apresentar os resultados de uma metodologia de classificação de sentimento, aqui denominada Sentiment Descriptor Indexing (SDI), para aplicar em mensagens do Twitter em português brasileiro relacionadas a temas de saúde. Métodos: A primeira etapa considerou a construção do algoritmo SDI que se baseia na coocorrência de termos do Twitter com descritores do vocabulário ANEW-BR. Na segunda etapa foi realizada uma avaliação do desempenho do algoritmo SDI para mensagens sobre o tema "câncer" de um período de três semanas. As mensagens foram classificadas por voluntários e em paralelo pelo SDI. As classificações foram pareadas gerando uma avaliação de desempenho. Resultados: Os valores de precisão e recuperação resultaram 0,68 e 0,67 respectivamente. Coletou-se um total de 25.230 mensagens sobre o tema "câncer" com classificação de sentimento positiva (71%). Conclusão: As contribuições deste trabalho visam suprir a falta de métodos de análise de sentimentos para a língua portuguesa brasileira. RESUMEN Objetivo: Presentar los resultados de una sensación de metodología de clasificación, aquí se llama Sentiment Descriptor Indexing (SDI), para aplicar en los mensajes de Twitter relacionados con temas de salud por el idioma portugués de Brasil. Métodos: El primer paso considera la construcción del algoritmo de SDI que se basa en la co-ocurrencia de términos do Twitter con el vocabulario de sentimiento conocido como ANEW-BR. En la segunda etapa se llevó a cabo una evaluación del rendimiento de los algoritmos SDI para los mensajes sobre "cáncer" de un período de tres semanas. Los mensajes se ordenan por voluntarios y en paralelo por SDI. Las clasificaciones fueron emparejados generar una evaluación de desempeño. Resultados: Los valores de precisión y recuperación fueron 0,68 y 0,67, respectivamente. Se recogió un total de 25.230 mensajes sobre "cáncer" con la clasificación de sentimiento positivo (71%). Conclusión: Las contribuciones de este trabajo tienen como objetivo hacer frente a la falta de métodos de análisis de sentimientos por el idioma portugués de Brasil.

Trends in Classification Literature: Analysis of Literature Published during 2000 to 2009

Article

Full-text available

Mar 2012
DJLIT

Rajendra Kumbhar

This paper analyses the literature of classification published during 2000 to 2009 and finds that there is sustainability in the growth of literature on classification in the first decade of the 21st century. It traces the pattern in scattering of literature on classification in library and information science (LIS) journals and concludes that the literature adheres to the Bradford’s law of scattering. It produces rank list of journals publishing the literature on classification and identifies authorship patterns and the prominent writers in classification. The research finds that the Indian LIS writers have shown sustained interest in classification domain.

Knowledge Acquisition in the construction of ontologies: a case study in the domain of hematology

Article

Full-text available

Jan 2012

The activities of organizing knowledge recorded in texts and obtaining knowledge from human experts – the knowledge acquisition process – are essential for scientific development. In this article, we propose methodological steps for knowledge acquisition, which have been applied to the construction of biomedical ontologies. The methodologi-cal steps are tested in a real case of knowledge acquisition in the do-main of the human blood. We hope to contribute to the improvement of knowledge acquisition for the representation of scientific knowledge in ontologies.

How are the different specialties represented in the major journals in general medicine?

Article

Full-text available

Jan 2011
BMC MED INFORM DECIS

General practitioners and medical specialists mainly rely on one "general medical" journal to keep their medical knowledge up to date. Nevertheless, it is not known if these journals display the same overview of the medical knowledge in different specialties. The aims of this study were to measure the relative weight of the different specialties in the major journals of general medicine, to evaluate the trends in these weights over a ten-year period and to compare the journals. The 14,091 articles published in The Lancet, the NEJM, the JAMA and the BMJ in 1997, 2002 and 2007 were analyzed. The relative weight of the medical specialities was determined by categorization of all the articles, using a categorization algorithm which inferred the medical specialties relevant to each article MEDLINE file from the MeSH terms used by the indexers of the US National Library of Medicine to describe each article. The 14,091 articles included in our study were indexed by 22,155 major MeSH terms, which were categorized into 81 different medical specialties. Cardiology and Neurology were in the first 3 specialties in the 4 journals. Five and 15 specialties were systematically ranked in the first 10 and first 20 in the four journals respectively. Among the first 30 specialties, 23 were common to the four journals. For each speciality, the trends over a 10-year period were different from one journal to another, with no consistency and no obvious explanatory factor. Overall, the representation of many specialties in the four journals in general and internal medicine included in this study may differ, probably due to different editorial policies. Reading only one of these journals may provide a reliable but only partial overview.

Use of Medical Subject Headings (MeSH) in Portuguese for categorizing web-based healthcare content

Article

Dec 2010
J BIOMED INFORM

Internet users are increasingly using the worldwide web to search for information relating to their health. This situation makes it necessary to create specialized tools capable of supporting users in their searches. To apply and compare strategies that were developed to investigate the use of the Portuguese version of Medical Subject Headings (MeSH) for constructing an automated classifier for Brazilian Portuguese-language web-based content within or outside of the field of healthcare, focusing on the lay public. 3658 Brazilian web pages were used to train the classifier and 606 Brazilian web pages were used to validate it. The strategies proposed were constructed using content-based vector methods for text classification, such that Naive Bayes was used for the task of classifying vector patterns with characteristics obtained through the proposed strategies. A strategy named InDeCS was developed specifically to adapt MeSH for the problem that was put forward. This approach achieved better accuracy for this pattern classification task (0.94 sensitivity, specificity and area under the ROC curve). Because of the significant results achieved by InDeCS, this tool has been successfully applied to the Brazilian healthcare search portal known as Busca Saúde. Furthermore, it could be shown that MeSH presents important results when used for the task of classifying web-based content focusing on the lay public. It was also possible to show from this study that MeSH was able to map out mutable non-deterministic characteristics of the web.

Analysis on the Language Independent and Dependent Aspects of Deep Learning based Question Answering Systems

Article

Full-text available

Jan 2022

Natural languages are ambiguous and computers are not capable of understanding the natural languages in the way people really understand them. Natural Language Processing (NLP) is concerned with the development of computational models based on the aspects of human language processing. Question Answering (QA) system is a field of Natural Language Processing that provides precise answer for the user question which is given in natural language. In this work, a MemN2N model based question answering system is implemented and its performance is evaluated with a complex question answering tasks using bAbI dataset of three different language text corpuses. The scope of this work is to understand the language independent and dependant aspects of a deep learning network. For this, we will study the performance of the deep learning network by training and testing it with different kinds of question answering tasks with different languages and also try to understand the difference in performance with respect to the languages

Overview of Information Extraction Using Textual Case-Based Reasoning

Chapter

May 2017

Managing Classification in Libraries: a Methodological Outline for Evaluating Automatic Subject Indexing and Classification in Swedish Library Catalogues

Conference Paper

Full-text available

Oct 2015

Subject terms play a crucial role in resource discovery but require substantial effort to produce. Automatic subject classification and indexing address problems of scale and sustainability and can be used to enrich existing bibliographic records, establish more connections across and between resources, and enhance consistency of bibliographic data. The paper aims to put forward a complex methodological framework to evaluate automatic classification tools of Swedish textual documents based on the Dewey Decimal Classification (DDC) recently introduced to Swedish libraries. Three major complementary approaches are suggested: a quality-built gold standard, retrieval effects, domain analysis. The gold standard is built based on input from at least two catalogue librarians, end users expert in the subject, end users inexperienced in the subject, and automated tools. Retrieval effects are studied through a combination of assigned and free tasks, including factual and comprehensive types. The study also takes into consideration the different role and character of subject terms in various knowledge domains, such as scientific disciplines. As a theoretical framework, domain analysis is used and applied in relation to the implementation of DDC in Swedish libraries and chosen domains of knowledge within the DDC itself.

Gene-Disease-Food Relation Extraction from Biomedical Database

Conference Paper

Full-text available

Aug 2016

Through the past years, an incredible increase in the biomedical data amount presented on the web is enlarged due to the increased data volume in the medical and biological domains. Hence, the search for documents and information on the internet became increasingly complicated. In the current work, a new approach for information extraction using the Natural Language Processing (NLP) tools and ontology was proposed. It described a system to extract relations between the concepts from biomedical texts using morphological analysis and information extraction techniques. In the first step, the system segmented the input text into sentences. Each sentence is then segmented into words that were tagged with part-of-speech labels and concept classes (food, drug, and gene). A set of relation extraction rules (regular expression patterns) are applied on the annotated sentences. If a pattern matches, the concepts and relations are extracted. The system has been tested on a set of 700 MEDLINE abstracts. For performance evaluation, the precision, recall and F-score were calculated. The proposed approach created by information retrieval from MEDLINE to gather a set of abstracts related to a given domain. Then, these texts were annotated using an automaton and ontology via recognizing interesting concepts for morphological analysis. After the annotation step, some rules were summarizing in an automaton that help gene-disease-food relationships discovery. This work proposed an approach for identifying relations between medical concepts using NLP tools. An evaluation experiment reported good effectiveness results.

How is occupational medicine represented in the major journals in general medicine?

Article

Jun 2012
OCCUP ENVIRON MED

Most physicians have received only limited training in occupational medicine (OM) during their studies. Since they rely mainly on one 'general medical' journal to keep their medical knowledge up to date, it is worthwhile questioning the importance of OM in these journals. The aim of this study was to measure the relative weight of OM in the major journals of general medicine and to compare the journals. The 14,091 articles published in the Lancet, the NEJM, the JAMA and the BMJ in 1997, 2002 and 2007 were analysed. The relative weight of OM and the other medical specialties was determined by categorisation of all the articles, using a categorisation algorithm, which inferred the medical specialties relevant to each MEDLINE article file from the major medical subject headings (MeSH) terms used by the indexers of the US National Library of Medicine to describe each article. The 14,091 articles included in this study were indexed by 22,155 major MeSH terms, which were categorised into 73 different medical specialties. Only 0.48% of the articles had OM as a main topic. OM ranked 44th among the 73 specialties, with limited differences between the four journals studied. There was no clear trend over the 10-year period. The importance of OM is very low in the four major journals of general and internal medicine, and we can consider that physicians get a very limited view of the evolution of knowledge in OM.

Searching French institutional health information sources: Catalogue and index of French-language medical sites (CISMeF)

Article

Sep 2009
PRESSE MED

The Catalogue and index of French-language medical sites (CISMeF) is a medical portal that provides users with results as pertinent as possible according to their requirements, expectations, and context of use. Indexing and single-term research are based on theMedical subject headings(MeSH) thesaurus. The integration of new medical terminology for indexing the catalogue's resources is intended to minimize false-negatives during searches and to contextualize the users' needs. The creation of a drug information portal makes more targeted research possible, with numerous entries according to user (physicians, pharmacists, chemists, and pharmacologists). For simplicity's sake, the catalogue's index of resources by different nomenclatures is not entirely displayed. The choice of display is left to the user, with MeSH only as the default. These multi-nomenclature tools should be applicable as well to electronic patient records. In this case, the objective is to improve patient care by better searches and identification of the information required during consultations and hospitalization.

From indexing the biomedical literature to coding clinical text

Conference Paper

Full-text available

Jan 2007

This paper describes the application of an ensemble of indexing and classification systems, which have been shown to be successful in information retrieval and classification of medical literature, to a new task of assigning ICD-9-CM codes to the clinical history and impression sections of radiology reports. The basic methods used are: a modification of the NLM Medical Text Indexer system, SVM, k-NN and a simple pattern-matching method. The basic methods are combined using a variant of stacking. Evaluated in the context of a Medical NLP Challenge, fusion produced an F-score of 0.85 on the Challenge test set, which is considerably above the mean Challenge F-score of 0.77 for 44 participating groups.

From indexing the biomedical literature to coding clinical text: Experience with MTI and machine learning approaches

Article

Full-text available

Jun 2007

Machine Learning in Automated Text Categorization

Article

Full-text available

Apr 2001

Fabrizio Sebastiani

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.

Using literature-based discovery to identify disease candidate genes

Article

Full-text available

Nov 2004
INT J MED INFORM

We present BITOLA, an interactive literature-based biomedical discovery support system. The goal of this system is to discover new, potentially meaningful relations between a given starting concept of interest and other concepts, by mining the bibliographic database MEDLINE. To make the system more suitable for disease candidate gene discovery and to decrease the number of candidate relations, we integrate background knowledge about the chromosomal location of the starting disease as well as the chromosomal location of the candidate genes from resources such as LocusLink and Human Genome Organization (HUGO). BITOLA can also be used as an alternative way of searching the MEDLINE database. The system is available at http://www.mf.uni-lj.si/bitola/.

Automatic medical encoding with SNOMED categories

Article

Full-text available

Feb 2008
BMC MED INFORM DECIS

In this paper, we describe the design and preliminary evaluation of a new type of tools to speed up the encoding of episodes of care using the SNOMED CT terminology. The proposed system can be used either as a search tool to browse the terminology or as a categorization tool to support automatic annotation of textual contents with SNOMED concepts. The general strategy is similar for both tools and is based on the fusion of two complementary retrieval strategies with thesaural resources. The first classification module uses a traditional vector-space retrieval engine which has been fine-tuned for the task, while the second classifier is based on regular variations of the term list. For evaluating the system, we use a sample of MEDLINE. SNOMED CT categories have been restricted to Medical Subject Headings (MeSH) using the SNOMED-MeSH mapping provided by the UMLS (version 2006). Consistent with previous investigations applied on biomedical terminologies, our results show that performances of the hybrid system are significantly improved as compared to each single module. For top returned concepts, a precision at high ranks (P0) of more than 80% is observed. In addition, a manual and qualitative evaluation on a dozen of MEDLINE abstracts suggests that SNOMED CT could represent an improvement compared to existing medical terminologies such as MeSH. Although the precision of the SNOMED categorizer seems sufficient to help professional encoders, it is concluded that clinical benchmarks as well as usability studies are needed to assess the impact of our SNOMED encoding method in real settings. AVAILABILITIES: The system is available for research purposes on: http://eagl.unige.ch/SNOCat.

A method for verifying a vector-based text classification system

Article

Full-text available

Feb 2008

Journal Descriptor Indexing (JDI) is a vector-based text classification system developed at NLM (National Library of Medicine), originally in Lisp and now as a Java tool. Consequently, a testing suite was developed to verify training set data and results of the JDI tool. A methodology was developed and implemented to compare two sets of JD vectors, resulting in a single index (from 0 - 1) measuring their similarity. This methodology is fast, effective, and accurate.

Evaluation of Meta-Concepts for Information Retrieval in a Quality-Controlled Health Gateway

Article

Full-text available

Feb 2007

CISMeF is a French quality-controlled health gateway that uses the MeSH thesaurus. We introduced two new concepts, metaterms (medical specialty which has semantic links with one or more MeSH terms, subheadings and resource types) and resource types. Evaluate precision and recall of metaterms. We created 16 pairs of queries. Each pair concerned the same topic, but one used metaterms and one MeSH terms. To assess precision, each document retrieved by the query was classified as irrelevant, partly relevant or fully relevant. The 16 queries yielded 943 documents for metaterm queries and 139 for MeSH term queries. The recall of MeSH term queries was 0.44 (compared to 1 for metaterm queries) and the precision were identical for MeSH term and metaterm queries. Metaconcept such as CISMeF metaterms allows a better recall with a similar precision that MeSH terms in a quality controlled health gateway.

Simplified access to MeSH tree structures on CISMeF

Article

Full-text available

Nov 1999
Bull Med Libr Assoc

The NLM indexing initiative

Article

Full-text available

Feb 2000

The objective of NLM's Indexing Initiative (IND) is to investigate methods whereby automated indexing methods partially or completely substitute for current indexing practices. The project will be considered a success if methods can be designed and implemented that result in retrieval performance that is equal to or better than the retrieval performance of systems based principally on humanly assigned index terms. We describe the current state of the project and discuss our plans for the future.

Semantic Relations Asserting the Etiology of Genetic Diseases

Article

Full-text available

Feb 2003

Considerable research is being directed at extracting molecular biology information from text. Particularly challenging in this regard is to identify relations between entities, such as protein-protein interactions or molecular pathways. In this paper we present a natural language processing method for extracting causal relations between genetic phenomena and diseases. After presenting the results of preliminary evaluation, we suggest the use of a graphical display application for viewing the semantic predications produced by the system.

The NLM Indexing Initiative’s Medical Text Indexer

Article

Full-text available

Feb 2004
Stud Health Tech Informat

The Medical Text Indexer (MTI) is a program for producing MeSH indexing recommendations. It is the major product of NLM's Indexing Initiative and has been used in both semi-automated and fully automated indexing environments at the Library since mid 2002. We report here on an experiment conducted with MEDLINE indexers to evaluate MTI's performance and to generate ideas for its improvement as a tool for user-assisted indexing. We also discuss some filtering techniques developed to improve MTI's accuracy for use primarily in automatically producing the indexing for several abstracts collections.

Data-poor Categorization and Passage Retrieval for Gene Ontology Annotation in Swiss-Prot

Article

Full-text available

Feb 2005
BMC BIOINFORMATICS

In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignment; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignment of a set of categories. Sentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot. Our system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities. From a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists.

Automatic Assignment of Biomedical Categories: Toward a Generic Approach

Article

Full-text available

Mar 2006

Patrick Ruch

We report on the development of a generic text categorization system designed to automatically assign biomedical categories to any input text. Unlike usual automatic text categorization systems, which rely on data-intensive models extracted from large sets of training data, our categorizer is largely data-independent. In order to evaluate the robustness of our approach we test the system on two different biomedical terminologies: the Medical Subject Headings (MeSH) and the Gene Ontology (GO). Our lightweight categorizer, based on two ranking modules, combines a pattern matcher and a vector space retrieval engine, and uses both stems and linguistically-motivated indexing units. Results show the effectiveness of phrase indexing for both GO and MeSH categorization, but we observe the categorization power of the tool depends on the controlled vocabulary: precision at high ranks ranges from above 90% for MeSH to <20% for GO, establishing a new baseline for categorizers based on retrieval methods.

A MEDLINE categorization algorithm

Article

Full-text available

Feb 2006
BMC MED INFORM DECIS

Categorization is designed to enhance resource description by organizing content description so as to enable the reader to grasp quickly and easily what are the main topics discussed in it. The objective of this work is to propose a categorization algorithm to classify a set of scientific articles indexed with the MeSH thesaurus, and in particular those of the MEDLINE bibliographic database. In a large bibliographic database such as MEDLINE, finding materials of particular interest to a specialty group, or relevant to a particular audience, can be difficult. The categorization refines the retrieval of indexed material. In the CISMeF terminology, metaterms can be considered as super-concepts. They were primarily conceived to improve recall in the CISMeF quality-controlled health gateway. The MEDLINE categorization algorithm (MCA) is based on semantic links existing between MeSH terms and metaterms on the one hand and between MeSH subheadings and metaterms on the other hand. These links are used to automatically infer a list of metaterms from any MeSH term/subheading indexing. Medical librarians manually select the semantic links. The MEDLINE categorization algorithm lists the medical specialties relevant to a MEDLINE file by decreasing order of their importance. The MEDLINE categorization algorithm is available on a Web site. It can run on any MEDLINE file in a batch mode. As an example, the top 3 medical specialties for the set of 60 articles published in BioMed Central Medical Informatics & Decision Making, which are currently indexed in MEDLINE are: information science, organization and administration and medical informatics. We have presented a MEDLINE categorization algorithm in order to classify the medical specialties addressed in any MEDLINE file in the form of a ranked list of relevant specialties. The categorization method introduced in this paper is based on the manual indexing of resources with MeSH (terms/subheadings) pairs by NLM indexers. This algorithm may be used as a new bibliometric tool.

Argument-predicate distance as a filter for enhancing precision in extracting predications on the genetic etiology of disease

Article

Full-text available

Feb 2006
BMC BIOINFORMATICS

Genomic functional information is valuable for biomedical research. However, such information frequently needs to be extracted from the scientific literature and structured in order to be exploited by automatic systems. Natural language processing is increasingly used for this purpose although it inherently involves errors. A postprocessing strategy that selects relations most likely to be correct is proposed and evaluated on the output of SemGen, a system that extracts semantic predications on the etiology of genetic diseases. Based on the number of intervening phrases between an argument and its predicate, we defined a heuristic strategy to filter the extracted semantic relations according to their likelihood of being correct. We also applied this strategy to relations identified with co-occurrence processing. Finally, we exploited postprocessed SemGen predications to investigate the genetic basis of Parkinson's disease. The filtering procedure for increased precision is based on the intuition that arguments which occur close to their predicate are easier to identify than those at a distance. For example, if gene-gene relations are filtered for arguments at a distance of 1 phrase from the predicate, precision increases from 41.95% (baseline) to 70.75%. Since this proximity filtering is based on syntactic structure, applying it to the results of co-occurrence processing is useful, but not as effective as when applied to the output of natural language processing. In an effort to exploit SemGen predications on the etiology of disease after increasing precision with postprocessing, a gene list was derived from extracted information enhanced with postprocessing filtering and was automatically annotated with GFINDer, a Web application that dynamically retrieves functional and phenotypic information from structured biomolecular resources. Two of the genes in this list are likely relevant to Parkinson's disease but are not associated with this disease in several important databases on genetic disorders. Information based on the proximity postprocessing method we suggest is of sufficient quality to be profitably used for subsequent applications aimed at uncovering new biomedical knowledge. Although proximity filtering is only marginally effective for enhancing the precision of relations extracted with co-occurrence processing, it is likely to benefit methods based, even partially, on syntactic structure, regardless of the relation.

Journal Descriptor Indexing Tool for Categorizing Text According to Discipline or Semantic Type

Article

Full-text available

Feb 2006

A JDI (Journal Descriptor Indexing) tool has been developed at NLM that automatically categorizes biomedical text as input, returning a ranked list, with scores between 0-1, of either JDs (Journal Descriptors, corresponding to biomedical disciplines) or STs (UMLS Semantic Types). Possible applications include WSD (Word Sense Disambiguation) and retrieval according to discipline. The Lexical Systems Group plans to distribute an open source JAVA version of this tool.

A New Approach to Automatic Indexing using Journal Descriptors

Article

Jan 1998

Susanne M Humphrey

Foundations of Statistical Natural Language Processing

Chapter

Jan 1999

The first text retrieval conference (TREC-1)

Conference Paper

Jan 1992

Donna Harman

Introduction To Modern Information Retrieval

Book

Jan 1984

Automatic Indexing of Documents from Journal Descriptors: A Preliminary Investigation

Article

Jun 1999

Susanne M Humphrey

A new, fully automated approach for indexing documents is presented based on associating textwords in a training set of bibliographic citations with the indexing of journals. This journal-level indexing is in the form of a consistent, timely set of journal descriptors (JDs) indexing the individual journals themselves. This indexing is maintained in journal records in a serials authority database. The advantage of this novel approach is that the training set does not depend on previous manual indexing of hundreds of thousands of documents (i.e., any such indexing already in the training set is not used), but rather the relatively small intellectual effort of indexing at the journal level, usually a matter of a few thousand unique journals for which retrospective indexing to maintain consistency and currency may be feasible. If successful, JD indexing would provide topical categorization of documents outside the training set, i.e., journal articles, monographs, WEB documents, reports from the grey literature, etc., and therefore be applied in searching. Because JDs are quite general, corresponding to subject domains, their most probable use would be for improving or refining search results.

Word Sense Disambiguation by Selecting the Best Semantic Type Based on Journal Descriptor Indexing: Preliminary Experiment

Article

Jan 2006
J AM SOC INF SCI TEC

An experiment was performed at the National Library of Medicine((R)) (NLM((R))) in word sense disambiguation (WSD) using the Journal Descriptor Indexing (JDI) methodology. The motivation is the need to solve the ambiguity problem confronting NLM's MetaMap system, which maps free text to terms corresponding to concepts in NLM's Unified Medical Language System((R)) (UMLS((R))) Metathesaurus((R)). If the text maps to more than one Metathesaurus concept at the same high confidence score, MetaMap has no way of knowing which concept is the correct mapping. We describe the JDI methodology, which is ultimately based on statistical associations between words in a training set of MEDLINE((R)) citations and a small set of journal descriptors (assigned by humans to journals per se) assumed to be inherited by the citations. JDI is the basis for selecting the best meaning that is correlated to UMLS semantic types (STs) assigned to ambiguous concepts in the Metathesaurus. For example, the ambiguity transport has two meanings: "Biological Transport" assigned the ST Cell Function and "Patient transport" assigned the ST Health Care Activity. A JDI-based methodology can analyze text containing transport and determine which ST receives a higher score for that text, which then returns the associated meaning, presumed to apply to the ambiguity itself. We then present an experiment in which a baseline disambiguation method was compared to four versions of JDI in disambiguating 45 ambiguous strings from NLM's WSD Test Collection. Overall average precision for the highest-scoring JDI version was 0.7873 compared to 0.2492 for the baseline method, and average precision for individual ambiguities was greater than 0.90 for 23 of them (51%), greater than 0.85 for 24 (53%), and greater than 0.65 for 35 (79%). On the basis of these results, we hope to improve performance of JDI and test its use in applications.

A Recent Advance in the Automatic Indexing of the Biomedical Literature

Article

Oct 2009
J BIOMED INFORM

The volume of biomedical literature has experienced explosive growth in recent years. This is reflected in the corresponding increase in the size of MEDLINE, the largest bibliographic database of biomedical citations. Indexers at the US National Library of Medicine (NLM) need efficient tools to help them accommodate the ensuing workload. After reviewing issues in the automatic assignment of Medical Subject Headings (MeSH terms) to biomedical text, we focus more specifically on the new subheading attachment feature for NLM's Medical Text Indexer (MTI). Natural Language Processing, statistical, and machine learning methods of producing automatic MeSH main heading/subheading pair recommendations were assessed independently and combined. The best combination achieves 48% precision and 30% recall. After validation by NLM indexers, a suitable combination of the methods presented in this paper was integrated into MTI as a subheading attachment feature producing MeSH indexing recommendations compliant with current state-of-the-art indexing practice.

Indexing consistency in MEDLINE

Article

May 1983
Bull Med Libr Assoc

The quality of indexing of periodicals in a bibliographic data base cannot be measured directly, as there is no one "correct" way to index an item. However, consistency can be used to measure the reliability of indexing. To measure consistency in MEDLINE, 760 twice-indexed articles from 42 periodical issues were identified in the data base, and their indexing compared. Consistency, expressed as a percentage, was measured using Hooper's equation. Overall, checktags had the highest consistency. Medical Subject Headings (MeSH) and subheadings were applied more consistently to central concepts than to peripheral points. When subheadings were added to a main heading, consistency was lowered. "Floating" subheadings were more consistent than were attached subheadings. Indexing consistency was not affected by journal indexing priority, language, or length of the article. Terms from MeSH Tree Structure categories A, B, and D appeared more often than expected in the high-consistency articles; whereas terms from categories E, F, H, and N appeared more often than expected in the low-consistency articles. MEDLINE, with its excellent controlled vocabulary, exemplary quality control, and highly trained indexers, probably represents the state of the art in manually indexed data bases.

Using CISMeF MeSH "Encapsulated" Terminology and a Categorization Algorithm for Health Resources

Article

Mar 2004
INT J MED INFORM

CISMeF is a Quality Controlled Health Gateway using a terminology based on the Medical Subject Headings (MeSH) thesaurus that displays medical specialties (metaterms) and the relationships existing between them and MeSH terms. Objective: The need to classify the resources within the catalogue has led us to combine this type of semantic information with domain expert knowledge for health resources categorization purposes. Material and A two-step categorization process consisting of mapping resource keywords to CISMeF metaterms and ranking metaterms by decreasing coverage in the resource has been developed. We evaluate this algorithm on a random set of 123 resources extracted from the CISMeF catalogue. Our gold standard for this evaluation is the manual classification provided by a domain expert, viz. a librarian of the team. The CISMeF algorithm shows 81% precision and 93% recall, and 62% of the resources were assigned a "fully relevant" or "fairly relevant" categorization according to strict standards. A thorough analysis of the results has enabled us to find gaps in the knowledge modeling of the CISMeF terminology. The necessary adjustments having been made, the algorithm is currently used in CISMeF for resource categorization.

Enhancing the MeSH thesaurus to retrieve French online health resources in a quality-controlled gateway

Article

Jan 2005

The amount of health information available on the Internet is considerable. In this context, several health gateways have been developed. Among them, CISMeF (Catalogue and Index of Health Resources in French) was designed to catalogue and index health resources in French. The goal of this article is to describe the various enhancements to the MeSH thesaurus developed by the CISMeF team to adapt this terminology to the broader field of health Internet resources instead of scientific articles for the medline bibliographic database. CISMeF uses two standard tools for organizing information: the MeSH thesaurus and several metadata element sets, in particular the Dublin Core metadata format. The heterogeneity of Internet health resources led the CISMeF team to enhance the MeSH thesaurus with the introduction of two new concepts, respectively, resource types and metaterms. CISMeF resource types are a generalization of the publication types of medline. A resource type describes the nature of the resource and MeSH keyword/qualifier pairs describe the subject of the resource. A metaterm is generally a medical specialty or a biological science, which has semantic links with one or more MeSH keywords, qualifiers and resource types. The CISMeF terminology is exploited for several tasks: resource indexing performed manually, resource categorization performed automatically, visualization and navigation through the concept hierarchies and information retrieval using the Doc'CISMeF search engine. The CISMeF health gateway uses several MeSH thesaurus enhancements to optimize information retrieval, hierarchy navigation and automatic indexing.

Constructing a concise medical taxonomy

Article

Feb 2005

Bruce McGregor

Semantic relations asserting the etiology of genetic diseases Automatic assignment of biomedical categories: Toward a generic approach

Aug 2003
554-558

T C Rindflesch
B Libbus
D Hristovski
A R Aronson
H Kilicoglu

Rindflesch, T.C., Libbus, B., Hristovski, D., Aronson, A.R., & Kilicoglu, H. (2003). Semantic relations asserting the etiology of genetic diseases. In Proceedings of the American Medical Informatics Association (pp. 554–558). Retrieved July 14, 2009, from http://www.pubmedcentral. nih.gov/picrender.fcgi?artid=1480275&blobtype=pdf Ruch, P. (2006). Automatic assignment of biomedical categories: Toward a generic approach. Bioinformatics 22(6), 658–664.

Darmoni to the to the Lister Hill Center Visitors Program, sponsored by the National Library of Medicine and administered by the Oak Ridge Institute for Science and Education JAMA & Archives Topic Collections The NLM Indexing Initiative

Aug 2000

S J Aronson
A R Bodenreider
O Chang
H F Humphrey
S M Mork
J G Nelson

appointment of A. Névéol to the Lister Hill Center Fellows Program; appointments of P. Ruch and S.J. Darmoni to the to the Lister Hill Center Visitors Program, sponsored by the National Library of Medicine and administered by the Oak Ridge Institute for Science and Education. Reference * * American Medical Association (2008). JAMA & Archives Topic Collections. Retrieved July 1, 2009, from http://pubs.ama-assn.org/collections Aronson, A.R., Bodenreider, O., Chang, H.F., Humphrey, S.M., Mork, J.G., Nelson, S.J., et al. (2000). The NLM Indexing Initiative. In Pro-ceedings of the American Medical Informatics Association Annual Symposium. (pp. 17–21). Retrieved July 14, 2009, from http://www. pubmedcentral.nih.gov/picrender.fcgi?artid=2243970&blobtype=pdf * * Regarding references with PMCID: When a PMCID is searched in NLM's PubMed, the reference is retrieved with a link to the free full text article in PubMed Central.

Automatic med-ical encoding with SNOMED categories Introduction to modern information retrieval

Jan 1983
63-66

P Ruch
J Gobeill
C Lovis
A Geissbühler
S
G Salton
M J Mcgill

Ruch, P., Gobeill, J., Lovis, C., & Geissbühler, A. (2008). Automatic med-ical encoding with SNOMED categories. BMC Medical Informatics and Decision Making, 27(8) Suppl 1:S6. Retrieved July 14, 2009, from http://biomedcentral.com/1472-6947/8/S1/S6 Salton, G., & McGill, M.J. (1983). Introduction to modern information retrieval (pp. 63–66). New York: McGraw-Hill.

Evaluation of meta-concepts for information retrieval in a quality-controlled health gateway Using literature-based discovery to identify disease candi-date genes

Jan 2005
269-2732

J F Gehanno
B Thirion
S J Darmoni

Gehanno, J.F., Thirion, B., & Darmoni, S.J. (2007). Evaluation of meta-concepts for information retrieval in a quality-controlled health gateway. In Proceedings of the American Medical Informatics Association (pp. 269–273). Retrieved July 14, 2009, from http://telemedicina.unifesp.br/ pub/AMIA/2007%20AMIA%20Proceedings/data/papers/papers/AMIA-0085-S2007.pdf Hristovski, D., Peterlin, B., Mitchell, J.A., & Humphrey, S.M. (2005). Using literature-based discovery to identify disease candi-date genes. International Journal of Medical Informatics, 74(2–4), 289–298.

(86%) 0.7427 (75%) 0.8703 JD Text DC + MH 0

Jd Text
Wc Mh

JD Text WC + MH 0.6468 0.9612 0.4680 (71%) 0.2840 (86%) 0.7427 (75%) 0.8703 JD Text DC + MH 0.6562 0.9495 0.4740 (72%) 0.2840 (86%) 0.7470 (75%) 0.8690 JD highest attainable 1.0000 1.0000 0.6560 0.3310 0.9950 1.0000

A method for verifying a vector-based text classification systems Foundations of statistical natural language processing

Jan 1999
268-269

Cj Lu
Sm Humphrey
Ac Browne

Lu CJ, Humphrey SM, Browne AC. A method for verifying a vector-based text classification systems. AMIA … Annual Symposium Proceedings/AMIA Symposium. AMIA Symposium 2008:1030. [PubMed: 18998786]PMCID: forthcoming Manning, CD.; Schütze, H. Cambridge, MA: The MIT Press; 1999. Foundations of statistical natural language processing; p. 268-269.534–8

Author manuscript; available in PMC

Jan 2010

J Am Soc Inf Sci Technol. Author manuscript; available in PMC 2009 December 1.

Enhancing the MeSH thesaurus to retrieve French online health resources in a quality-controlled gateway; Health Information and Libraries Journal Data-poor categorization and passage retrieval for gene ontology annotation in Swiss-Prot

Dec 2004
253-261

M Douyère
Lf Soualmia
A Névéol
A Rogozan
B Dahamna
Jp Leroy
B Thirion
Sj Darmoni
F Ehrler
A Geissbühler
A Jimeno
P Ruch

Douyère, M.; Soualmia, LF.; Névéol, A.; Rogozan, A.; Dahamna, B.; Leroy, JP.; Thirion, B.; Darmoni, SJ. Enhancing the MeSH thesaurus to retrieve French online health resources in a quality-controlled gateway; Health Information and Libraries Journal. 2004 [Retrieved November 21, 2008]. p. 253-261.from http://www3.interscience.wiley.com/cgi-bin/fulltext/118813886/PDFSTART Ehrler F, Geissbühler A, Jimeno A, Ruch P. Data-poor categorization and passage retrieval for gene ontology annotation in Swiss-Prot. BMC Bioinformatics 2005;6:S23. [PubMed: 15960836]PMCID: PMC1869016

Additional subject subset for pubMed

Chu Hôpitaux De Rouen

NLM Medical Text Indexer: A tool for automatic and assisted indexing

A R Aronson
J G Mork
F M Lang
W J Rogers

CISMeF: Catalog and Index of French-language health internet resources. A quality-controlled subject gateway

Chu Hôpitaux De Rouen

The NLM Indexing Initiative's Medical Text Indexer; Studies in Health Technology and Informatics

Dec 2004
268-272

Ar Aronson
Jg Mork
Cw Gay
Sm Humphrey
Rogers
Wj

Aronson, AR.; Mork, JG.; Gay, CW.; Humphrey, SM.; Rogers, WJ. The NLM Indexing Initiative's Medical Text Indexer; Studies in Health Technology and Informatics. 2004 [Retrieved November 21, 2008]. p. 268-272.from http://skr.nlm.nih.gov/papers/references/aronson-medinfo04.wheader.pdf

Catalogue et Index des Sites Medicaux Francophones

Chu Hôpitaux De Rouen

Comparing a Rule-Based Versus Statistical System for Automatic Categorization of MEDLINE Documents According to Biomedical Specialty

Abstract

No full-text available

Recommended publications

Extracting Dependence Relations from Unstructured Medical Text

Textrous!: Extracting Semantic Textual Meaning from Gene Sets

Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation