Article

Automatic Annotation for Biological Sequences by Etraction of Keywords from MEDLINE Abstracts: Development of a Prototype System

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We have developed a prototype for the automatic annotation of functional characteristics in protein families. The system is able to extract biological information directly from scientific literature in the form of MEDLINE abstracts. The criterion for selecting relevant keywords is the difference between their frequency in the abstracts associated with the protein family under study and its frequency in other unrelated protein families. The concept of functional information associated to protein families is the key feature of our system and gathers evolutionary information into the problem of functional annotation of biological sequences. The system has been tested in two different scenarios: first, a large set of protein families with a small number of abstract per family and second, selected protein families with large number of abstracts attached to each one. In both cases the performances are compared with annotations provided by human experts showing a clear relation between the amount of information provided to the system and the quality of the annotations. The automatic annotations are in many cases of similar quality to the ones contained in current data bases. The possibilities and difficulties to be encountered during the development of a full system for automatic annotation are discussed.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Previous authors7891011121314 have used NLP-based systems to extract biological molecule annotation information[7], to detect protein-protein interaction information[8, 15, 16], or to improve indexing and recall into searches from MEDLINE abstracts[12, 17]. Methods employed include a mixture of text mining and indexing for terms which can be classified by ...
... Previous authors7891011121314 have used NLP-based systems to extract biological molecule annotation information[7], to detect protein-protein interaction information[8, 15, 16], or to improve indexing and recall into searches from MEDLINE abstracts[12, 17]. Methods employed include a mixture of text mining and indexing for terms which can be classified by ...
... Comprehensively-annotated models of complex pathways like Wnt are also essential for hypothesis-generation and experiment validation, yet with the exception of periodic reviews on the subject, there are few sources of Wnt-signaling information that are kept consistent with the latest published literature. In the past, various groups7891011121314 have used NLP-based systems to extract biological molecule annotation information [7], to detect protein-protein interaction information [8, 15, 16], or to improve indexing and recall into searches from MEDLINE abstracts [12, 17]. Methods included a mixture of text mining and indexing, with some groups using classification by Bayesian statistics [10], structured grammar matches [18], or word filtering of known entities, as well as the use of partial and full parsers. ...
Article
This dissertation discusses the use of automated natural language processing (NLP) for characterization of biomolecular events in signal transduction pathway databases. I also discuss the use of a dynamic map engine for efficiently navigating large biomedical document collections and functionally annotating high-throughput genomic data. An application is presented where NLP software, beginning with genomic expression data, automatically identifies and joins disparate experimental observations supporting biochemical interaction relationships between candidate genes in the Wnt signaling pathway. I discuss the need for accurate named entity resolution to the biological sequence databases and how sequence-based approaches can unambiguously link automatically-extracted assertions to their respective biomolecules in a high-speed manner. I then demonstrate a search engine, BioSearch-2D, which renders the contents of large biomedical document collections into a single, dynamic map. With this engine, the prostate cancer epigenetics literature is analyzed and I demonstrate that the summarization map closely matches that provided by expert human review articles. Examples include displays which prominently feature genes such as the androgen receptor and glutathione S-transferase P1 together with the National Library of Medicine???s Medical Subject Heading (MeSH) descriptions which match the roles described for those genes in the human review articles. In a second application of BioSearch-2D, I demonstrate the engine???s application as a context-specific functional annotation system for cancer-related gene signatures. Our engine matches the annotation produced by a Gene Ontology-based annotation engine for 6 cancer-related gene signatures. Additionally, it assigns highly-significant MeSH terms as annotation for the gene list which are not produced by the GO-based engine. I find that the BioSearch-2D display facilitates both the exploration of large document collections in the biomedical literature as well as provides users with an accurate annotation engine for ad-hoc gene sets. In the future, the use of both large-scale biomedical literature summarization engines and automated protein-protein interaction discovery software could greatly assist manual and expensive data curation efforts involving describing complex biological processes or disease states.
... data is huge [Rubin D.L. et al. (2005)], Panhellenic Conference in Informatics 20 [Spasic I. et. al. (2005)]. Several machine learning and text mining methods [Spasic I. et. al. (2005)] try to facilitate and automate the organization of biological information described in documents. Also, some research efforts involve natural language techniques [Andrade M.A. et. al. (1997)], [Eisenhaber F. et. al. (1999)]. Other efforts [Raychaudhuri S. et. al. (2002)], [Kazawa H. et. al. (2004)], [Theodosiou T. et. al. (2006)] have utilized ontologies, like Gene Ontology (GO), and applied data mining and machine learning methods to published biological literature stored in the PubMed database. Ontologies are considered t ...
... mes a difficult task since the size of the data is huge [Rubin D.L. et al. (2005)], [Spasic I. et. al. (2005)]. Several machine learning and text mining methods [Spasic I. et. al. (2005)] try to facilitate and automate the organization of biological information described in documents. Also, some research efforts involve natural language techniques [Andrade M.A. et. al. (1997)], [Eisenhaber F. et. al. (1999)]. Other efforts [Raychaudhuri S. et. al. (2002)], [Kazawa H. et. al. (2004)], [Theodosiou T. et. al. (2006)] have utilized ontologies, like Gene Ontology (GO), and applied data mining and machine learning methods to published biological literature stored in the PubMed database. Ontologies are considered to ...
Article
Full-text available
Biomedical literature databases constitute valuable repositories of up to date scientific knowledge. The development of efficient classification methods in order to facilitate the organization of these databases and the extraction of novel biomedical knowledge is becoming increasingly important. Several of these methods use bio-ontologies, like Gene Ontology to concisely describe and classify biological documents. The purpose of this paper is to compare two classical statistical classification methods, namely multinomial logistic regression (MLR) and linear discriminant analysis (LDA), to a machine learning classification method, called support vector machines (SVM). Although all the methods have been used with success for classifying texts, there is not a direct comparison between them for classifying biological text to specific Gene Ontology terms. The results from the study show that LDA performs better (accuracy 80.32%) than SVM (77.18%) and MLR (57.4%). LDA not only performs well in the assignment of Gene Ontology terms to documents, but also reduces the dimensions of the original data, making them easier to manage.
... Some other examples of networks recognizing functional motifs were presented by Sternberg (1991, 1992); Ladunga et al. (1991); Schneider and Wrede (1993); Hansen et al. (1998); Nielsen et al. (1997). The second approach is based on using the frequency with which any of the 20*20 possible amino acid pairs occurs in the sequence (Ferran and Pflugfelder, 1993), or on using the information extracted from database annotations (Andrade and Valencia, 1997). There are two ways to describe the principal difference between these two types of networks. ...
... For example, a neural network system for predicting various aspects of 1D structure based on evolutionary information is by far the most widely used prediction method (Rost et al., 1994). Other network-based methods are unique, or superior in their field (Ferran and Pflugfelder, 1993; Riis and Krogh, 1996; Andrade and Valencia, 1997; Hansen et al., 1997; Nielsen et al., 1997). Furthermore, neural networks revealed data base errors, and principles underlying protein structures (Brunak, 1991; Rost et al., 1994; Tolstrup et al., 1994; Blom et al., 1996). ...
Article
Full-text available
Operations Research is probably one of the most successful fields of applied mathematics used in Economics, Physics, Chemistry, almost everywhere one has to analyze huge amounts of data. Lately, these techniques were introduced in biology, especially in the protein analysis area to support biologists. The fast growth of protein data makes operations research an important issue in bioinformatics, a science which lays on the border between computer science and biology. This paper gives a short overview of the operations research techniques currently used to support structural and functional analysis of proteins.
... Some other examples of networks recognizing functional motifs were presented by Sternberg (1991, 1992); Ladunga et al. (1991); Schneider and Wrede (1993); Hansen et al. (1998); Nielsen et al. (1997). The second approach is based on using the frequency with which any of the 20*20 possible amino acid pairs occurs in the sequence (Ferran and Pflugfelder, 1993), or on using the information extracted from database annotations (Andrade and Valencia, 1997). There are two ways to describe the principal difference between these two types of networks. ...
... For example, a neural network system for predicting various aspects of 1D structure based on evolutionary information is by far the most widely used prediction method (Rost et al., 1994). Other network-based methods are unique, or superior in their field (Ferran and Pflugfelder, 1993; Riis and Krogh, 1996; Andrade and Valencia, 1997; Hansen et al., 1997; Nielsen et al., 1997). Furthermore, neural networks revealed data base errors, and principles underlying protein structures (Brunak, 1991; Rost et al., 1994; Tolstrup et al., 1994; Blom et al., 1996). ...
Article
Full-text available
The operations research is probably one of the most successful field of applied mathematics used in economics, physics, chemistry, almost everywhere where one has to analyze huge amounts of data. Lately, these techniques of operations research were introduced in biology, especially in the protein analysis area to support biologists. The fast growth of protein data makes operations research an important issue in bioinformatics, a science which lays on the border between computer science and biology. This paper gives a short overview of the operations research techniques currently used to support structural and functional analysis of proteins.
... Automatic analysis of text, or natural language processing (NLP), has great potential for its application to mining this biological literature. Many NLP techniques have already been used to annotate individual genes (Eisenhaber & Bork, 1999; Fleischmann et al., 1999; Tamames et al., 1998), determine gene or protein interactions (Blaschke et al., 1999; Jenssen et al., 2001; Stephens et al., 2001; Thomas et al., 2000), and to assign keywords to genes or groups of genes (Andrade & Valencia, 1997; Masys et al., 2001; Shatkay et al., 2000). ...
... Alternatively, keywords for the group could be automatically determined automatically that describe the function of the group. Investigators have already developed algorithms to find keywords in collections of biological document that could be applied to these high scoring articles to determine functional keywords (Andrade & Valencia, 1997). The methods described here rely on the content of the scientific literature. ...
Article
Full-text available
Recently, biology has been confronted with large multidimensional gene expression data sets where the expression of thousands of genes is measured over dozens of conditions. The patterns in gene expression are frequently explained retrospectively by underlying biological principles. Here we present a method that uses text analysis to help find meaningful gene expression patterns that correlate with the underlying biology described in scientific literature. The main challenge is that the literature about an individual gene is not homogenous and may addresses many unrelated aspects of the gene. In the first part of the paper we present and evaluate the neighbor divergence per gene (NDPG) method that assigns a score to a given subgroup of genes indicating the likelihood that the genes share a biological property or function. To do this, it uses only a reference index that connects genes to documents, and a corpus including those documents. In the second part of the paper we present an approach, optimizing separating projections (OSP), to search for linear projections in gene expression data that separate functionally related groups of genes from the rest of the genes; the objective function in our search is the NDPG score of the positively projected genes. A successful search, therefore, should identify patterns in gene expression data that correlate with meaningful biology. We apply OSP to a published gene expression data set; it discovers many biologically relevant projections. Since the method requires only numerical measurements (in this case expression) about entities (genes) with textual documentation (literature), we conjecture that this method could be transferred easily to other domains. The method should be able to identify relevant patterns even if the documentation for each entity pertains to many disparate subjects that are unrelated to each other.
... The earliest respective works in the biomedical domain focused on tasks needing linguistic context and processing at level of words like identifying protein names [Fukuda et al., 1998] or on tasks relying on word co-occurrence [Stapley and Benoit, 2000] and pattern matching [Ng and Wong, 1999]. During the last few years, there was a surge of interest in using the biomedical literature, (e.g., Andrade and Valencia, 1997;Craven and Kumlien, 1999;Friedman et al., 2001;Fukuda et al., 1998;Hanisch et al., 2003;Jenssen et al., 2001;Leek, 1997;Rindflesch et al., 2000;Shatkay et al., 2000;Yandell and Majoros, 2002), ranging from relatively modest tasks such as finding reported gene location on chromosomes [Leek, 1997] to more ambitious attempts to construct putative gene networks based on gene-name co-occurrence within articles . Since the literature covers all aspects of biology, chemistry, and medicine, there is almost no limit to the types of information that may be recovered through careful and exhaustive mining. ...
... They also demonstrated that various rule induction methods are able to identify protein interactions with higher precision than manually developed rules. 6 . RAPIER was modified to learn rules from tagged documents, and then it is trained on a corpus tagged by expert curators. ...
Article
Automatic knowledge discovery from biomedical free-texts appears as a necessity considering the growing of the massive amounts of biomedical scientific literature. A special problem that makes this task more challenging, and difficult as well, is the overabundance and diversity of the related genomic/proteomic ontologies and the respective gene and protein terminologies. Specifically, a genomic/proteomic term, e.g., gene, protein and their functional descriptions, as well as the diseases, are referred with many different ways in scientific documents regarding the organization, research context and the naming conventions that the authors are adherent to. The work reported in this thesis presents methods and tools for the efficient and reliable mining of biomedical literature, based on advanced text-mining techniques. Specifically it covers the following R&D challenges: (a) Identification of gene/protein--gene/protein and gene/protein--disease correlations following a text mining approach. The approach utilizes data-mining and statistical techniques, algorithms and metrics to deal with the following problems: (i) identification and recognition of terms in text-references – based on an appropriately devised and implemented algorithmic process that utilises the Trie data-structure; and (ii) ranking of terms and their (potential) relations or, links – based on the MIM entropic metric (Mutual Information Metric) to measure the respective terms’ association strength. (b) Construction of a genes association network – based on the assessed terms’ (genes, proteins, diseases) association strengths. (c) Categorization / Classification of textreferences (mainly from the PubMed abstracts repository) into class categories utilizing an appropriately devised classification metric and procedure, and using the most descriptive (i.e, strong) associations between terms. Pre-assignment of text-references (i.e., PubMed abstract) to categories is performed by posting respective queries to PubMed, i.e., querying PubMed with “breast cancer” the retrieved documents are considered to belong to the “breast cancer” category. (d) Assessment on the texts’ categorization / classification results – based on respective PubMed abstract collections, their precategorization and careful experimental set-up to measure prediction results, i.e., accuracy and precision. (e) Design and development of a tool – the MineBioText (Mining Biomedical Texts), that encompasses all of the aforementioned operations with extra functionalities for setting-up the domain of reference and study, e.g., gene/protein and disease names, their synonyms and free-text descriptions, text collections, parameterization of build-in algorithmic processes etc.
... However, the frequent use of novel abbreviations in texts presents a challenge for the curators of biomedical lexical ontologies to ensure they are continually updated. Several algorithms have been developed to extract abbreviations and their definitions from biomedical text (9)(10)(11). Abbreviations within publications can be defined when they are declared within the full-text, and in some publications, are included in a dedicated abbreviations section. ...
Article
Full-text available
To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus .
... Automatic Extraction of Acronyms and their definitions from biomedical domain is difficult as there is wide variance in conventions within biomedical communities on forming acronyms from their definition (long form). In an attempt to help resolve the problem, new techniques have been introduced to automatically extract abbreviations and their definitions from MEDLINE abstracts [1,2,3]. ...
Article
Full-text available
The size and growth rate of biomedical literature creates new challenges for researchers who need to keepup to date. The objective of the present study was to design a patternmatching method for miningacronyms and their definitions from biomedical text by considering the space reduction heuristicconstraints have been proposed and implemented. The constraints mentioned are spacious-reductionheuristic constraints which will reduce the search space and will extract most of the true positive cases.The evaluation has been done on MEDLINE abstracts. The results show that the proposed algorithm isfaster and more efficient than the previous approaches, in term of space and time complexities. Thealgorithm has a very good Recall (92%), Precision (97%) and F-factor (94%). One improvement that canbe done is to consider all kinds of acronyms definition patterns. This algorithm only considersacronym−definition pairs of the form Acronym(Definition) Definition (Acronym) pairs. Improving thealgorithm requires additional study and may reduce the precision even though it may increase the recall.The Algorithm is space efficient too. Input text of any large size can be mined using this algorithm becauseit requires less memory space to execute.
... The earliest study in this area is described in [Andrade and Valencia 1997], where a simple strategy for ranking keywords for a set of disjoint protein families is proposed. Sets of abstracts for these families are obtained from MEDLINE, and a z-score is calculated for all the words (except the stop words) using their normalized frequency of appearance in the abstracts correponding to each family. ...
Article
Full-text available
Proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is a cru- cial link in the development of new drugs, better crops, and even the development of synthetic biochemicals such as biofuels. Experimental procedures for protein function prediction are inherently low throughput and are thus unable to annotate a non-trivial fraction of proteins that a re becoming available due to rapid advances in genome sequencing technology. This has motivated the development of computational techniques that utilize a variety of high-throughput experimental data for protein function prediction, such as protein and genome sequences, gene expression data, protein interaction networks and phylogenetic profiles. Indeed, in a short period of a decade, several hundred articles have been published on this topic. This survey aims to discuss this wide spectrum of approaches by categorizing them in terms of the data type they use for predicting function, and thus identify the trends and needs of this very important field. The survey is ex pected to be useful for computational biologists and bioinformaticians aiming to get an overview of the field of co mputational function prediction, and identify areas that can benefit from further research.
... The third approach computes GO code similarity by combining hierarchical and associative relations (Posse et al. 2006). Several studies within the last few years (Andrade et al. 1997, Andrade 1999, MacCallum et al. 2000, Chang at al. 2001) have shown that the inclusion of evidence from relevant scientific literature improves homology search. It is therefore highly plausible that literature evidence can also help improve GO-based approaches to gene and gene product similarity. ...
Article
Full-text available
With the rising influence of the Gene Ontology, new approaches have emerged where the similarity between genes or gene products is obtained by comparing Gene Ontology code annotations associated with them. So far, these approaches have solely relied on the knowledge encoded in the Gene Ontology and the gene annotations associated with the Gene Ontology database. The goal of this paper is to demonstrate that improvements to these approaches can be obtained by integrating textual evidence extracted from relevant biomedical literature.
... Information extraction techniques have been used to support a wide range of applications such as the automatic extraction of protein interactions from the literature [3]. They also have been used in the automatic determination of gene functions [4]. ...
Chapter
Information retrieval is important in various biomedical research fields. This chapter covers the theoretical background and the state of the art and future trends in biomedical information retrieval. Techniques for literature searches, genomic information retrieval and database searches are discussed. Literature searches techniques cover name entity extraction, document indexing, document clustering and event extraction. Genomic information retrieval techniques are based on sequence alignment algorithms. This chapter also briefly describes widely used biological databases and discusses the issues related to the information retrieval from these databases. Terminology systems are involved in almost every aspect of information retrieval. The various types of terminology systems and their usage to support information retrieval are reviewed.
... After having located the correct entity, annotating the functional properties describing the entity in the text can be attempted. The first approach was to extract keywords for protein families that are used significantly more frequently with those proteins only [15]. Using such statistical inference, the text can be analyzed for these keywords and then accordingly assigned to a protein family. ...
Chapter
Full-text available
Text Mining is the process of extracting [novel] interesting and non-trivial information and knowledge from unstructured text (Google™ search result for “define: text mining”). Information retrieval, natural language processing, information extraction, and text mining provide methodologies to shift the burden of tracing and relating data contained in text from the human user to the computer. The emergence of high-throughput techniques has allowed biosciences to switch its research focus on Systems Biology, increasing the demands on text mining and extraction of information from heterogeneous sources. This chapter will introduce the most fundamental uses of language processing methods in biology and present the basic resources openly available in the field. The search for information about a common disease, chronic myeloid leukemia, is used to exemplify the capabilities. Tools such as PubMed, eTBLAST, METIS, EBIMed, MEDIE, MarkerInfoFinder, HCAD, iHOP, Chilibot, and G2D – selected from a comprehensive list of currently available systems – provide users with a basic platform for performing complex operations on information accumulated in text. KeywordiHOP-CML-Text mininig-Language processing
... Reported rule-based approaches range from those based on predefined lexical patterns [Blaschke et al. 1999;Ng and Wong 1999] and templates [Maynard and Ananiadou 1999;Pustejovsky et al. 2002], to parsing of documents using domain-specific grammars [Friedman et al. 2001b;Yakushiji et al. 2001;Gaizauskas et al. 2003]. Various statistical approaches-mainly based on mutual information and cooccurrence frequency counts-were used to associate terms that are not explicitly linked in text [Andrade and Valencia 1997;Stapley and Benoit 2000;Raychaudhuri et al. 2002;Ding et al. 2002;Nenadić et al. 2002]. Similarly, machine-learning approaches have been widely used to learn lexical contexts expressing a given relationship [Craven and Kumlien 1999;Marcotte et al. 2001;Stapley et al. 2002;Donaldson et al. 2003;Nenadić et al. 2003b;Spasic and Ananiadou 2005]. ...
Article
Full-text available
Discovering links and relationships is one of the main challenges in biomedical research, as sci- entists are interested in uncovering entities that have similar functions, take part in the same processes, or are coregulated. This article discusses the extraction of such semantically related entities (represented by domain terms) from biomedical literature. The method combines various text-based aspects, such as lexical, syntactic, and contextual similarities between terms. Lexical similarities are based on the level of sharing of word constituents. Syntactic similarities rely on expressions (such as term enumerations and conjunctions) in which a sequence of terms appears as a single syntactic unit. Finally, contextual similarities are based on automatic discovery of rele- vant contexts shared among terms. The approach is evaluated using the Genia resources, and the results of experiments are presented. Lexical and syntactic links have shown high precision and low recall, while contextual similarities have resulted in significantly higher recall with moderate precision. By combining the three metrics, we achieved F measures of 68% for semantically related terms and 37% for highly related entities.
... Any known structure (from the PDB) is also reported. We look forward to applying more sophisticated methods for automatic annotation (Andrade and Valencia, 1997). ...
Article
Full-text available
To maximize the chances of biological discovery, homology searching must use an up-to-date collection of sequences. However, the available sequence databases are growing rapidly and are partially redundant in content. This leads to increasing strain on CPU resources and decreasing density of first-hand annotation. These problems are addressed by clustering closely similar sequences to yield a covering of sequence space by a representative subset of sequences. No pair of sequences in the representative set has >90% mutual sequence identity. The representative set is derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters. The algorithm was applied to the union of the Swissprot, Swissnew, Trembl, Tremblnew, Genbank, PIR, Wormpep and PDB databases. The all-against-all comparison required to generate a representative set at 90% sequence identity was accomplished in 2 days CPU time, and the removal of fragments and close similarities yielded a size reduction of 46%, from 260 000 unique sequences to 140 000 representative sequences. The practical implications are (i) faster homology searches using, for example, Fasta or Blast, and (ii) unified annotation for all sequences clustered around a representative. As tens of thousands of sequence searches are performed daily world-wide, appropriate use of the non-redundant database can lead to major savings in computer resources, without loss of efficacy. A regularly updated non-redundant protein sequence database (nrdb90), a server for homology searches against nrdb90, and a Perl script (nrdb90.pl) implementing the algorithm are available for academic use from http://www.embl-ebi.ac. uk/holm/nrdb90. holm@embl-ebi.ac.uk
... The main areas in which Euclid is being improved technically are as follows: (i) increasing the amount of functional input information for each sequence; to circumvent the scarcity of functional annotations in sequence databases, we have developed tools for extracting keywords directly from MEDLINE abstracts (Andrade and Valencia, 1997); (ii) extending the size of the input set of manually classified sequences, including the available new genomes that have been classified into functional classes by their authors; (iii) including the information about homologous sequences after careful assessment of a safe degree of similarity indicative of pertaining to the same functional class; (iv) implementation of a more elaborate weighting scheme to overcome the problem created by difference in size between classes. ...
Article
Full-text available
A tool is described for the automatic classification of sequences in functional classes using their database annotations. The Euclid system is based on a simple learning procedure from examples provided by human experts. AVAILABILITY: Euclid is freely available for academics at http://www.gredos.cnb.uam.es/EUCLID, with the corresponding dictionaries for the generation of three, eight and 14 functional classes. Contact: E-mail: valencia@cnb.uam.es SUPPLEMENTARY INFORMATION: The results of the EUCLID classification of different genomes are available at http://www.sander.ebi.ac. uk/genequiz/. A detailed description of the different applications mentioned in the text is available at http://www.gredos.cnb.uam. es/EUCLID/Full_Paper
... Guigo (Guigo et al., 1991, 1993; Guigo and Smith, 1993) developed a tool for determining the most characteristic subset of keywords for the biological function of a protein family from their database annotation that can be inherited to uncharacterized members of the family. Andrade and Valencia (1995, 1998) addressed a similar question by analysing a set of MEDLINE abstracts. The disadvantages of pure keyword searching approaches are 2-fold. ...
Article
Full-text available
Computer-based selection of entries from sequence databases with respect to a related functional description, e.g. with respect to a common cellular localization or contributing to the same phenotypic function, is a difficult task. Automatic semantic analysis of annotations is not only hampered by incomplete functional assignments. A major problem is that annotations are written in a rich, non-formalized language and are meant for reading by a human expert. This person can extract from the text considerably more information than is immediately apparent due to his extended biological background knowledge and logical reasoning. A technique of automated annotation evaluation based on a combination of lexical analysis and the usage of biological rule libraries has been developed. The proposed algorithm generates new functional descriptors from the annotation of a given entry using the semantic units of the annotation as prepositions for implications executed in accordance with the rule library. The prototype of a software system, the Meta_A(nnotator) program, is described and the results of its application to sequence attribute assignment and sequence selection problems, such as cellular localization and sequence domain annotation of SWISS-PROT entries, are presented. The current software version assigns useful subcellular localization qualifiers to approximately 88% of all SWISS-PROT entries. As shown by demonstrative examples, the combination of sequence and annotation analysis is a powerful approach for the detection of mutual annotation/sequence inconsistencies. Results for the cellular localization assignment can be viewed at the URL http://www.bork. embl-heidelberg.de/CELL_LOC/CELL_LOC.html.
... Moreover, automated literature mining offers a yet unexploited opportunity to integrate many fragments of information gathered by researchers from multiple fields of expertise into a complete picture exposing the interrelated roles of various genes, proteins, and chemical reactions in cells and organisms. During the last few years, there has been a surge of interest in mining biomedical literature, (Andrade & Valencia, 1997; Leek, 1997; Fukuda, Tsunoda, Tamura, & Takagi, 1998; Shatkay, Edwards, Wilbur, & Boguski, 2000; Jenssen, Laegreid, Komorowski, & Hovig, 2001; Hanisch, Fluck, Mevissen, & Zimmer, 2003), ranging from relatively modest tasks such as finding reported gene location on chromosomes (Leek, 1997) to more ambitious attempts to construct putative gene networks based on gene-name cooccurrences within articles (Jenssen, Laegreid, Komorowski, & Hovig, 2001). Since the literature covers all aspects of biology, chemistry, and medicine, there is almost no limit to the types of information that may be recovered through skillful and pervasive mining. ...
Article
As new high-throughput technologies have created an explosion of biomedical literature, there arises a pressing need for automatic information extraction from the literature bank. To this end, biomedical named entity recognition (NER) from natural language text is indispensable. Current NER approaches include: dictionary based, rule based, or machine learning based. Since there is no consolidated nomenclature for most biomedical NEs, any NER system relying on limited dictionaries or rules does not seem to perform satisfactorily. In this paper, we consider a machine learning model, CRF, for the construction of our NER framework. CRF is a well-known model for solving other sequence tagging problems. In our framework, we do our best to utilize available resources including dictionaries, web corpora, and lexical analyzers, and represent them as linguistic features in the CRF model. In the experiment on the JNLPBA 2004 data, with minimal post-processing, our system achieves an F-score of 70.2%, which is better than most state-of-the-art systems. On the GENIA 3.02 corpus, our system achieves an F-score of 78.4% for protein names, which is 2.8% higher than the next-best system. In addition, we also examine the usefulness of each feature in our CRF model. Our experience could be valuable to other researchers working on machine learning based NER.
... Text-mining in chemistry is not as prevalent as it is biology, and the tools are less developed. Text-mining in biology is often used for the automatic extraction of information about genes, proteins and their functional relationships from text documents3456. The NLP tools in biology are also well developed, and we aim to create the equivalent in chemistry for part-of-speech taggers such as the GeniaTagger [7,8] as well as syntactic parsers such as Enju [9]. ...
Article
Full-text available
The primary method for scientific communication is in the form of published scientific articles and theses which use natural language combined with domain-specific terminology. As such, they contain free owing unstructured text. Given the usefulness of data extraction from unstructured literature, we aim to show how this can be achieved for the discipline of chemistry. The highly formulaic style of writing most chemists adopt make their contributions well suited to high-throughput Natural Language Processing (NLP) approaches. We have developed the ChemicalTagger parser as a medium-depth, phrase-based semantic NLP tool for the language of chemical experiments. Tagging is based on a modular architecture and uses a combination of OSCAR, domain-specific regex and English taggers to identify parts-of-speech. The ANTLR grammar is used to structure this into tree-based phrases. Using a metric that allows for overlapping annotations, we achieved machine-annotator agreements of 88.9% for phrase recognition and 91.9% for phrase-type identification (Action names). It is possible parse to chemical experimental text using rule-based techniques in conjunction with a formal grammar parser. ChemicalTagger has been deployed for over 10,000 patents and has identified solvents from their linguistic context with >99.5% precision.
... Noun recognition can also be done using predefined dictionaries, as is often the case for index-based information-retrieval systems. Keyword indexing has been used to annotate proteins 17 and was recently proposed for construction of co-occurrence networks of genes in human 18 and Saccharomyces cerevisiae 19 . Text mining of functional links based on document similarity is another strategy that has been used to extract and annotate relationships between genes 20 . ...
Article
Full-text available
We have carried out automated extraction of explicit and implicit biomedical knowledge from publicly available gene and text databases to create a gene-to-gene co-citation network for 13,712 named human genes by automated analysis of titles and abstracts in over 10 million MEDLINE records. The associations between genes have been annotated by linking genes to terms from the medical subject heading (MeSH) index and terms from the gene ontology (GO) database. The extracted database and accompanying web tools for gene-expression analysis have collectively been named 'PubGene'. We validated the extracted networks by three large-scale experiments showing that co-occurrence reflects biologically meaningful relationships, thus providing an approach to extract and structure known biology. We validated the applicability of the tools by analyzing two publicly available microarray data sets.
... ethods are available that provide additional functional insights into biological sequence data without similarity comparisons. For example, des Jardins et al. (1997) described a way to delineate, with reasonable accuracy, enzyme EC numbers from easily computable protein sequence features through the application of machine intelligence ap- proaches. Andrade and Valencia (1997) described a procedure to associate protein sequence data with bibliographic references stored in the MEDLINE database through frequency analysis of word occurrence. No matter what program system is used, there are dangers inherent in automated annotation (Galperin and Koonin, 1998). Many molecular biology databanks are reluctant to adop ...
Article
Full-text available
Motivation: It is only a matter of time until a user will see not many but one integrated database of information for molecular biology. Is this true? Is it a good thing? Why will it happen? Where are we now? What developments are fostering and what developments are impeding progress towards this end? Supplementary information: A list of WWW resources devoted to database issues in molecular biology is available at http://www.mips.biochem.mpg.de Contact: frishman@mips.biochem.mpg.de
... (Average refers to the number of PSMs per gene.) the query engine of PubMed, are enabling the rapid retrieval of information, but are not suf®cient if a large amount of data has to be compiled and made available for further processing and use. As a consequence, researchers have investigated methods to automatically analyse scienti®c text and to provide facts from the set of documents in a condensed form (12,13,19,20). In our approach we present the extraction of mutation±gene pairs from Medline abstracts. We used HUGO nomenclature to detect gene names, which is an important standardization for comparing the extracted data with other sources, e.g. with OMIM. ...
Article
Full-text available
Mutations help us to understand the molecular origins of diseases. Researchers, therefore, both publish and seek disease-relevant mutations in public databases and in scientific literature, e.g. Medline. The retrieval tends to be time-consuming and incomplete. Automated screening of the literature is more efficient. We developed extraction methods (called MEMA) that scan Medline abstracts for mutations. MEMA identified 24,351 singleton mutations in conjunction with a HUGO gene name out of 16,728 abstracts. From a sample of 100 abstracts we estimated the recall for the identification of mutation-gene pairs to 35% at a precision of 93%. Recall for the mutation detection alone was >67% with a precision rate of >96%. This shows that our system produces reliable data. The subset consisting of protein sequence mutations (PSMs) from MEMA was compared to the entries in OMIM (20,503 entries versus 6699, respectively). We found 1826 PSM-gene pairs to be in common to both datasets (cross-validated). This is 27% of all PSM-gene pairs in OMIM and 91% of those pairs from OMIM which co-occur in at least one Medline abstract. We conclude that Medline covers a large portion of the mutations known to OMIM. Another large portion could be artificially produced mutations from mutagenesis experiments. Access to the database of extracted mutation-gene pairs is available through the web pages of the EBI (refer to http://www.ebi. ac.uk/rebholz/index.html).
... Several works have been published that explore a variety of innovative techniques to extract biological knowledge from the literature, in particular Medline abstracts. Information extraction has been utilized to identify gene and protein names (Fukuda et al., 1998; Proux et al., 1998; Leonard et al., 2002; Tanabe and Wilbur, 2002 ), molecular interactions or relationships between substances (Blaschke et al., 1999; Rindflesch et al., 2000; Proux et al., 2000; Humphreys et al., 2000; Thomas et al., 2000; Yoshida et al., 2000; Marcotte et al., 2001; Ono et al., 2001), specific keywords (Andrade and Valencia, 1997; Ohta et al., 1997; Andrade and Bork, 2000), protein location (Craven and Kumlien, 1999) and, recently, the roles of residues in protein molecules (Gaizauskas et al., 2003). Natural language processing (NLP) is also used to analyze and parse the text content (for a review, see Hirschman et al., 2002 ). ...
... Previous work has been done to augment or refine the standard PubMed search, including tools to conduct combinatorial searches [4] and to navigate standard search results based on common MeSH terms [5], gene names found in abstracts [6,7], PubMed-assigned 'related articles' [8], and combinations thereof [9][10][11][12]. In PubNet we present a unique twopronged approach in which network graphs are dynamically rendered to provide an intuitive and complete view of search results, while hyperlinking to a textual representation to allow detailed exploration of a point of interest. ...
Article
Full-text available
We have developed PubNet, a web-based tool that extracts several types of relationships returned by PubMed queries and maps them into networks, allowing for graphical visualization, textual navigation, and topological analysis. PubNet supports the creation of complex networks derived from the contents of individual citations, such as genes, proteins, Protein Data Bank (PDB) IDs, Medical Subject Headings (MeSH) terms, and authors. This feature allows one to, for example, examine a literature derived network of genes based on functional similarity.
Chapter
Full-text available
The Protein Design Group initiated its activity in 1994 with the incorporation of Alfonso Valencia to the National Centre for Biotechnology CNB-CSIC in Madrid. At that time the orientation of the group was largely a continuation of the work carried out from 1988 to 1994 in the group of Chris Sander at the EMBL in Heidelberg, that, not surprisingly was also called Protein Design Group. Since then the group has adopted new approaches to deal with the avalanche of genomic and structural information, that was just starting in 1994. The application of literature mining to the analysis of expression arrays could be a good example of approaches unpredictable a few years ago. The increasing importance that Bioinformatics have had during the last few years drove us toward the development of professional software closer to the needs of the community, an aspect that was not so clearly perceived when Bioinformatics was still emerging. What remains from the spirit of Chris Sander's group is the interest for the real-world biological problems and the continuous effort for collaborating with molecular and structural biologists. In this article we have summarised our main lines of work in Structural and functional Genomics, describing the concepts behind applications and methods, and pointers to servers where our programs and results are available.
Article
Obesity is currently an epidemic that affects almost 15% of the global adult population. The complex metabolic processes involved in energy homeostasis, which are regulated by signals from multiple sources, present a challenging problem for drug discovery. In the current analysis, we present bibliometric and data-mining approaches based on categorizing literature according to medical subject headings (MeSH) to examine “hot” and “cold” trends, which indicate emerging areas of scientific research within obesity. This trend analysis corrects for increase in the overall size of obesity publications. A “hot” trend within obesity research is a concept on which publications are growing statistically faster than the background rise in obesity publications. In addition to growth in the number of publications associated with gastrointestinal weight-loss surgery and clinical studies in obesity, there is increasing research in the fields of adipose tissue, islet cell, and enteroendocrine biology as observed by a significant increase in the number of publications during the period 2005–2009, when compared to 2000–2004. However, the number of the publications in the area of hypothalamic and nervous system research in obesity appears to be cooling off. Extending the same concept of trend analysis to genes, we present a list of obesity-related genes that show “hot” trends suggesting emerging molecular mechanisms for obesity. Finally, we present a list of key scientific publications associated with obesity, one from each year over the last decade, which have the highest number of citations. Drug Dev Res 72: 201–208, 2011. © 2010 Wiley-Liss, Inc.
Article
Systematically evaluating the exponentially growing body of scientific literature has become a critical task that every drug discovery organization must engage in in order to understand emerging trends for scientific investment and strategy development. Developing trends analysis uses the number of publications within a 3-year window to determine concepts derived from well-established disease and gene ontologies to aid in recognizing and predicting emerging areas of scientific discoveries relevant to that space. In this chapter, we describe such a method and use obesity and psoriasis as use-case examples by analyzing the frequency of disease-related MeSH terms in PubMed abstracts over time. We share how our system can be used to predict emerging trends at a relatively early stage and we analyze the literature-identified genes for genetic associations, druggability, and biological pathways to explore any potential biological connections between the two diseases that could be utilized for drug discovery.
Article
Different programs of The European Science Foundation (ESF) have contributed significantly to connect researchers in Europe and beyond through several initiatives. This support was particularly relevant for the development of the areas related with extracting information from papers (text-mining) because it supported the field in its early phases long before it was recognized by the community. We review the historical development of text mining research and how it was introduced in bioinformatics. Specific applications in (functional) genomics are described like it's integration in genome annotation pipelines and the support to the analysis of high-throughput genomics experimental data, and we highlight the activities of evaluation of methods and benchmarking for which the ESF programme support was instrumental.
Article
The information age has made the electronic storage of large amounts of data effortless. The proliferation of documents available on the Internet, corporate intranets, news wires and elsewhere is overwhelming. Search engines only exacerbate this overload problem by making increasingly more documents available in only a few keystrokes. This information overload also exists in the biomedical field, where scientific publications, and other forms of text-based data are produced at an unprecedented rate. Text mining is the combined, automated process of analyzing unstructured, natural language text to discover information and knowledge that are typically difficult to retrieve. Here, we focus on text mining as applied to the biomedical literature. We focus in particular on finding relationships among genes, proteins, drugs and diseases, to facilitate an understanding and prediction of complex biological processes. The LitMiner™ system, developed specifically for this purpose; is described in relation to the Knowledge Discovery and Data Mining Cup 2002, which serves as a formal evaluation of the system.
Article
We now know the full genomes of more than 60 organisms. The experimental characterization of the newly sequenced proteins is deemed to lack behind this explosion of naked sequences (sequencefunction gap). The rate at which expert annotators add the experimental information into more or less controlled vocabularies of databases snails along at an even slower pace. Most methods that annotate protein function exploit sequence similarity by transferring experimental information for homologues. A crucial development aiding such transfer is large-scale, work- and management-intensive projects aimed at developing a comprehensive ontology for gene-protein function, such as the Gene Ontology project. In parallel, fully automatic or semiautomatic methods have successfully begun to mine the existing data through lexical analysis. Some of these tools target parsing controlled vocabulary from databases; others venture at mining free texts from MEDLINE abstracts or full scientific papers. Automated text analysis has become a rapidly expanding discipline in bioinformatics. A few of these tools have already been embedded in research projects.
Article
Biological sequence databases are currently being re-engineered to make them more efficient and easier to use. This re-engineering is also providing an infrastructure to make it easier to interrogate and integrate data from different sources. The net result of this effort should be a great improvement in the power and availability of bioinformatics resources to the general biology community.
Article
The measurement of the simultaneous expression values of thousands of genes or proteins from high throughput Omics platforms creates a large amount of data whose interpretation by inspection can be a daunting task. A major challenge of using such data is to translate these lists of genes/proteins into a better understanding of the underlying biological phenomena. We describe approaches to identify biological concepts in the form of Medical Subject Headings (MeSH terms) as extracted from MEDLINE that are significantly overrepresented within the identified gene set relative to those associated with the overall collection of genes on the underlying Omics platform. The method's principle strength is its ability to simultaneously depict similarities that may exist at the level of biological structure, molecular function, physiology, genetics, and clinically manifest diseases, just as a single published article about a gene of interest may report findings within several of these same dimensions.
Article
Full-text available
Bioinformatics tools and systems perform a diverse range of functions including: data collection, data mining, data analysis, data management, and data integration. Computer-aided technology directly supporting medical applications is excluded from this definition and is referred to as medical informatics. This book is not an attempt at authoritatively describing the gamut of information contained in this field. Instead, it focuses on the areas of biomedical data integration, access, and interoperability as these areas form the cornerstone of the field. However, most of the approaches presented are generic integration systems that can be used in many similar contexts.
Article
Automatic knowledge extraction over large text collections has been a challenging task due to many constraints such as needs of large annotated training data, requirement of extensive manual processing of data, and huge amount of domain-specific terms. In order to address these constraints, this study proposes and develops a complete solution for extracting knowledge from large text collections with minimum human intervention. As a testbed system, a novel robust and quality knowledge extraction system, called RIKE (Robust Iterative Knowledge Extraction), has been developed. RIKE consists of two major components: DocSpotter and HiMMIE. DocSpotter queries and retrieves promising documents for extraction. HiMMIE extracts target entities based on a Mixture Hidden Markov Model from the selected documents from DocSpotter. The following three research questions are examined to evaluate RIKE: 1) How accurately does RIKE retrieve the promising documents for information extraction from huge text collections such as MEDLINE or TREC? 2) Does ontology enhance extraction accuracy of RIKE in retrieving the promising documents? 3) How well does RIKE extract the target entities from a huge medical text collection, MEDLINE? The major contributions of this study are1) an automatic unsupervised query generation for effective retrieval from text databases is proposed and evaluated, 2) Mixture Hidden Markov models for automatic instances extraction are proposed and tested, 3) Three Ontology-driven query expansion algorithms are proposed and evaluated, and 4) Object-oriented methodologies for knowledge extraction system are adopted. Through extensive experiments, RIKE is proved to be a robust and quality knowledge extraction technique. DocSpotter outperforms other leading techniques for retrieving promising documents for extraction from 15.5% to 35.34% in P@20. HiMMIE improves extraction accuracy from 9.43% to 24.67% in F-measures.
Article
Predicting function from sequence using computational tools is a highly complicated procedure that is generally done for each gene individually. This review focuses on the added value that is provided by completely sequenced genomes in function prediction. Various levels of sequence annotation and function prediction are discussed, ranging from genomic sequence to that of complex cellular processes. Protein function is currently best described in the context of molecular interactions. In the near future it will be possible to predict protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways and signalling cascades. The analysis of such higher levels of function description uses, besides the information from completely sequenced genomes, also the additional information from proteomics and expression data. The final goal will be to elucidate the mapping between genotype and phenotype.
Article
Annotating the tremendous amount of sequence information being generated requires accurate automated methods for recognizing homology. Although sequence similarity is only one of many indicators of evolutionary homology, it is often the only one used. Here we find that supplementing sequence similarity with information from biomedical literature is successful in increasing the accuracy of homology search results. We modified the PSI-BLAST algorithm to use literature similarity in each iteration of its database search. The modified algorithm is evaluated and compared to standard PSI-BLAST in searching for homologous proteins. The performance of the modified algorithm achieved 32% recall with 95% precision, while the original one achieved 33% recall with 84% precision; the literature similarity requirement preserved the sensitive characteristic of the PSI-BLAST algorithm while improving the precision.
Article
Full-text available
We have carried out automated extraction of explicit and implicit biomedical knowledge from publicly available gene and text databases to create a gene-to-gene co-citation network for 13,712 named human genes by automated analysis of titles and abstracts in over 10 million MEDLINE records. The associations between genes have been annotated by linking genes to terms from the medical subject heading (MeSH) index and terms from the gene ontology (GO) database. The extracted database and accompanying web tools for gene-expression analysis have collectively been named 'PubGene'. We validated the extracted networks by three large-scale experiments showing that co-occurrence reflects biologically meaningful relationships, thus providing an approach to extract and structure known biology. We validated the applicability of the tools by analyzing two publicly available microarray data sets.
Article
Full-text available
Acronyms are widely used in biomedical and other technical texts. Understanding their meaning constitutes an important problem in the automatic extraction and mining of information from text. Here we present a system called ACROMED that is part of a set of Information Extraction tools designed for processing and extracting information from abstracts in the Medline database. In this paper, we present the results of two strategies for finding the long forms for acronyms in biomedical texts. These strategies differ from previous automated acronym extraction methods by being tuned to the complex phrase structures of the biomedical lexicon and by incorporating shallow parsing of the text into the acronym recognition algorithm. The performance of our system was tested with several data sets obtaining a performance of 72 % recall with 97 % precision. These results are found to be better for biomedical texts than the performance of other acronym extraction systems designed for unrestricted text.
Article
Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, naïve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72% when ascertaining the function discussed within an abstract. The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.
Article
Full-text available
Text literature is playing an increasingly important role in biomedical discovery. The challenge is to manage the increasing volume, complexity and specialization of knowledge expressed in this literature. Although information retrieval or text searching is useful, it is not sufficient to find specific facts and relations. Information extraction methods are evolving to extract automatically specific, fine-grained terms corresponding to the names of entities referred to in the text, and the relationships that connect these terms. Information extraction is, in turn, a means to an end, and knowledge discovery methods are evolving for the discovery of still more-complex structures and connections among facts. These methods provide an interpretive context for understanding the meaning of biological data.
Article
The analysis of large-scale genomic information (such as sequence data or expression patterns) frequently involves grouping genes on the basis of common experimental features. Often, as with gene expression clustering, there are too many groups to easily identify the functionally relevant ones. One valuable source of information about gene function is the published literature. We present a method, neighbor divergence, for assessing whether the genes within a group share a common biological function based on their associated scientific literature. The method uses statistical natural language processing techniques to interpret biological text. It requires only a corpus of documents relevant to the genes being studied (e.g., all genes in an organism) and an index connecting the documents to appropriate genes. Given a group of genes, neighbor divergence assigns a numerical score indicating how "functionally coherent" the gene group is from the perspective of the published literature. We evaluate our method by testing its ability to distinguish 19 known functional gene groups from 1900 randomly assembled groups. Neighbor divergence achieves 79% sensitivity at 100% specificity, comparing favorably to other tested methods. We also apply neighbor divergence to previously published gene expression clusters to assess its ability to recognize gene groups that had been manually identified as representative of a common function.
Article
The growth of the biomedical literature presents special challenges for both human readers and automatic algorithms. One such challenge derives from the common and uncontrolled use of abbreviations in the literature. Each additional abbreviation increases the effective size of the vocabulary for a field. Therefore, to create an automatically generated and maintained lexicon of abbreviations, we have developed an algorithm to match abbreviations in text with their expansions. Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictionary of biomedical abbreviations. To test the coverage of the database, we used an independently created list of abbreviations from the China Medical Tribune. We measured the recall and precision of the algorithm in identifying abbreviations from the Medstract corpus. We also measured the recall when searching for abbreviations from the China Medical Tribune against the database. On the Medstract corpus, our algorithm achieves up to 83% recall at 80% precision. Applying the algorithm to all of MEDLINE yielded a database of 781,632 high-scoring abbreviations. Of all the abbreviations in the list from the China Medical Tribune, 88% were in the database. We have developed an algorithm to identify abbreviations from text. We are making this available as a public abbreviation server at \url[http://abbreviation.stanford.edu/].
Article
Literature mining is the process of extracting and combining facts from scientific publications. In recent years, many computer programs have been designed to extract various molecular biology findings from Medline abstracts or full-text articles. The present article describes the range of text mining techniques that have been applied to scientific documents. It divides 'automated reading' into four general subtasks: text categorization, named entity tagging, fact extraction, and collection-wide analysis. Literature mining offers powerful methods to support knowledge discovery and the construction of topic maps and ontologies. An overview is given of recent developments in medical language processing. Special attention is given to the domain particularities of molecular biology, and the emerging synergy between literature mining and molecular databases accessible through Internet.
Article
The volume of biomedical text is growing at a fast rate, creating challenges for humans and computer systems alike. One of these challenges arises from the frequent use of novel abbreviations in these texts, thus requiring that biomedical lexical ontologies be continually updated. In this paper we show that the problem of identifying abbreviations' definitions can be solved with a much simpler algorithm than that proposed by other research efforts. The algorithm achieves 96% precision and 82% recall on a standard test collection, which is at least as good as existing approaches. It also achieves 95% precision and 82% recall on another, larger test set. A notable advantage of the algorithm is that, unlike other approaches, it does not require any training data.
Article
Full-text available
Molecular experiments using multiplex strategies such as cDNA microarrays or proteomic approaches generate large datasets requiring biological interpretation. Text based data mining tools have recently been developed to query large biological datasets of this type of data. PubMatrix is a web-based tool that allows simple text based mining of the NCBI literature search service PubMed using any two lists of keywords terms, resulting in a frequency matrix of term co-occurrence. For example, a simple term selection procedure allows automatic pair-wise comparisons of approximately 1-100 search terms versus approximately 1-10 modifier terms, resulting in up to 1,000 pair wise comparisons. The matrix table of pair-wise comparisons can then be surveyed, queried individually, and archived. Lists of keywords can include any terms currently capable of being searched in PubMed. In the context of cDNA microarray studies, this may be used for the annotation of gene lists from clusters of genes that are expressed coordinately. An associated PubMatrix public archive provides previous searches using common useful lists of keyword terms. In this way, lists of terms, such as gene names, or functional assignments can be assigned genetic, biological, or clinical relevance in a rapid flexible systematic fashion. http://pubmatrix.grc.nia.nih.gov/
Article
Full-text available
The complexity of the information stored in databases and publications on metabolic and signaling pathways, the high throughput of experimental data, and the growing number of publications make it imperative to provide systems to help the researcher navigate through these interrelated information resources. Text-mining methods have started to play a key role in the creation and maintenance of links between the information stored in biological databases and its original sources in the literature. These links will be extremely useful for database updating and curation, especially if a number of technical problems can be solved satisfactorily, including the identification of protein and gene names (entities in general) and the characterization of their types of interactions. The first generation of openly accessible text-mining systems, such as iHOP (Information Hyperlinked over Proteins), provides additional functions to facilitate the reconstruction of protein interaction networks, combine database and text information, and support the scientist in the formulation of novel hypotheses. The next challenge is the generation of comprehensive information regarding the general function of signaling pathways and protein interaction networks.
Article
Current advances in high-throughput biology are accompanied by a tremendous increase in the number of related publications. Much biomedical information is reported in the vast amount of literature. The ability to rapidly and effectively survey the literature is necessary for both the design and the interpretation of large-scale experiments, and for curation of structured biomedical knowledge in public databases. Given the millions of published documents, the field of information retrieval, which is concerned with the automatic identification of relevant documents from large text collections, has much to offer. This paper introduces the basics of information retrieval, discusses its applications in biomedicine, and presents traditional and non-traditional ways in which it can be used.
Article
Functional annotation of genes is an important task in biology since it facilitates the characterization of genes relationships and the understanding of biochemical pathways. The various gene functions can be described by standardized and structured vocabularies, called bio-ontologies. The assignment of bio-ontology terms to genes is carried out by means of applying certain methods to datasets extracted from biomedical articles. These methods originate from data mining and machine learning and include maximum entropy or support vector machines (SVM). The aim of this paper is to propose an alternative to the existing methods for functionally annotating genes. The methodology involves building of classification models, validation and graphical representations of the results and reduction of the dimensions of the dataset. Classification models are constructed by Linear discriminant analysis (LDA). The validation of the models is based on statistical analysis and interpretation of the results involving techniques like hold-out samples, test datasets and metrics like confusion matrix, accuracy, recall, precision and F-measure. Graphical representations, such as boxplots, Andrew's curves and scatterplots of the variables resulting from the classification models are also used for validating and interpreting the results. The proposed methodology was applied to a dataset extracted from biomedical articles for 12 Gene Ontology terms. The validation of the LDA models and the comparison with the SVM show that LDA (mean F-measure 75.4%) outperforms the SVM (mean F-measure 68.7%) for the specific data. The application of certain statistical methods can be beneficial for functional gene annotation from biomedical articles. Apart from the good performance the results can be interpreted and give insight of the bio-text data structure.
ResearchGate has not been able to resolve any references for this publication.