Table 2 - uploaded by Christoph M Friedrich
Content may be subject to copyright.
Top 15 found terms with their number of occurrences

Top 15 found terms with their number of occurrences

Source publication
Article
Full-text available
Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical...

Contexts in source publication

Context 1
... the found IUPAC entities, only 142 181 could be transformed to a structure (16.24%). The top 15 found terms from MEDLINE are shown in Table 2, the top 5 of the converted structures in Table 3 together with the most often used terms which lead to the normalization. To get an upper bound of convertible IUPAC names, we sample 100 000 correct names from data provided by NCBI (2007). ...
Context 2
... a final test, the full MEDLINE was labeled showing the scalability of the implementation. The highest frequency (without normalization) is almost 17 000 mentions of one term (Table 2). A conversion of the names to its corresponding structure show that only a minor part (below 20%) can be processed (without evaluating the correctness of the conversion). ...

Similar publications

Article
Full-text available
Medicinal chemistry patents contain rich information about chemical compounds. Although much effort has been devoted to extracting chemical entities from scientific literature, limited numbers of patent mining systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of information...

Citations

... Early techniques for chemical entity extraction, such as dictionary-based methods (Akhondi et al., 2016;Hettne et al., 2009;Rebholz-Schuhmann et al., 2007) and grammar-based methods (Akhondi et al., 2015;Liu et al., 2012), heavily rely on expert knowledge in the chemical domain. Klinger et al. (Klinger et al., 2008) present a machine learning approach based on conditional random fields to find mentions of IUPAC and IUPAC-like names in patents. CPC-2014 has promoted the development of novel chemical text NER methods (Akhondi et al., 2016;Leaman et al., 2016;Zhang et al., 2016). ...
Article
Full-text available
Information extraction is an important foundation for automated patent analysis. Deep learning methods show promising results for information extraction, the performance of such methods heavily depends on the available corpus. To promote research on Chinese information extraction and evaluate the performance of related systems, we present a novel dataset, named CPIE, and make it publicly available. The dataset consisting of five thousands records of Chinese patent documents. The data were annotated by a tagging team using an on-line annotation tools. The dataset was evaluated using a state-of-the-art information extraction method that involves named entity recognition and relationship classification. The results shed light on new challenges and promote information extraction research.
... However, rule-based [7] and dictionary-based [8] methods often have strong dependence on domain knowledge, poor scalability and portability, and often require a significant amount of time to develop rules and establish dictionaries. With the increase in data volume, more and more researchers are trying to use machine learning methods to handle BioNER tasks, such as Hidden Markov Model [9], Support Vector Machine model [10], Maximum Entropy model [11], and Conditional Random Fields model [12]. However, traditional machine learning methods typically require large amounts of labeled data for training, have high data dependence, require extensive feature engineering, and have difficulty with context modeling and handling unknown entities. ...
Article
Full-text available
Currently, in the field of biomedical named entity recognition, CharCNN (Character-level Convolutional Neural Networks) or CharRNN (Character-level Recurrent Neural Network) is typically used independently to extract character features. However, this approach does not consider the complementary capabilities between them and only concatenates word features, ignoring the feature information during the process of word integration. Based on this, this paper proposes a method of multi-cross attention feature fusion. First, DistilBioBERT and CharCNN and CharLSTM are used to perform cross-attention word-char (word features and character features) fusion separately. Then, the two feature vectors obtained from cross-attention fusion are fused again through cross-attention to obtain the final feature vector. Subsequently, a BiLSTM is introduced with a multi-head attention mechanism to enhance the model’s ability to focus on key information features and further improve model performance. Finally, the output layer is used to output the final result. Experimental results show that the proposed model achieves the best F1 values of 90.76%, 89.79%, 94.98%, 80.27% and 88.84% on NCBI-Disease, BC5CDR-Disease, BC5CDR-Chem, JNLPBA and BC2GM biomedical datasets respectively. This indicates that our model can capture richer semantic features and improve the ability to recognize entities.
... Сэйла и др. [25][26][27][28][29][30][32][33]. Общим звеном этих подходов являлось рассмотрение химической номенклатуры в качестве лингвистической системы, своего рода искусственного языка, а названий индивидуальных органических соединений -как сложных слов этого языка, строящихся из осмысленных компонентов по определенным правилам, четко формализованным в химической номенклатуре и, следовательно, поддающимся алгоритмизации. ...
Article
The user's interface «Nomenclature Generator» for extraction of the chemical structure information from the systematic name of organic compound represented according to IUPAC nomenclature is developed at the All-Russian Institute for Scientific and Technical Information of Russian Academy of Sciences.
... 15 Similar strategies have been used by Olivetti and co-workers to analyse how synthesis gel composition and organic structure directing agent can dictate crystal polymorphs for a range of zeolite syntheses. 16 One weakness of these text mining approaches is their reliance on unambiguous identication of the chemical entities in question, using named-entity recognition (NER) 9,20 and the programmatic naming conventions dened by IUPAC 21 to succeed. In the absence of such well-accepted naming schemes as is the case for a variety of emerging nanomaterial families like porous silicas, polymers of intrinsic microporosity, and covalent organic framework materialslarge scale data mining becomes far less practical. ...
Article
Full-text available
With the continuously growing number of scientific articles on the synthesis of nanomaterials, it becomes impossible for researchers to grasp and comprehend the landscape of synthetic protocols available for a particular material. The aim of this study is to explore the feasibility of extracting the collective knowledge on the synthesis of a particular material accumulated over the years from the published corpus of articles and organizing it in a systematic manner. Accordingly, we developed methods to perform detailed text mining on a single nanomaterial target for the purposes of methodology optimisation. Taking the common material ZIF-8 as a case study, we analysed 1600 synthesis protocols to identify trends in parameters, such as reagents, concentrations, and reaction time/temperature. We used this information to find the distribution of synthesis parameters and their relationships to one another, identifying the limits of common reaction parameters and revealing subtle details, such as insolubility of metal acetate reagents in alcoholic solvents, or the occurrence of amorphous oxides at low stoichiometric ratios. We then clustered similar synthesis protocols together, using their relative popularity to identify promising regions of the synthesis phase space for optimisation, reducing the need for brute force synthesis optimisation. The techniques developed here are a general tool accelerating the synthesis development of a wide range of nanomaterials by aggregating existing research trends, averting the need for laborious manual comparison of existing synthesis protocols or repetition of previously-developed techniques.
... Word embedding approaches were used in Ref. 9 to generate entity-rich documents for human experts to annotate which were then used to train a polymer named entity tagger. Most previous NLP-based efforts in materials science have focused on inorganic materials 10,11 and organic small molecules 12,13 but limited work has been done to address information extraction challenges in polymers. Polymers in practice have several non-trivial variations in name for the same material entity which requires polymer names to be normalized. ...
Article
Full-text available
The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available at polymerscholar.org which can be used to locate material property data recorded in abstracts. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with extracted material property information.
... Automated methods to recognize and identify chemicals in biomedical text have a long history (5)(6)(7). Previous work in biomedical named entity recognition (NER) and normalization [i.e. entity linking (EL)] for chemicals includes several community challenges, including the Chemical Compound and Drug Name Recognition (CHEMDNER) (8) and BioCreative 5 Chemical-Disease Relation (BC5CDR) (9) tasks at previous BioCreative workshops. ...
Article
The BioCreative National Library of Medicine (NLM)-Chem track calls for a community effort to fine-tune automated recognition of chemical names in the biomedical literature. Chemicals are one of the most searched biomedical entities in PubMed, and-as highlighted during the coronavirus disease 2019 pandemic-their identification may significantly advance research in multiple biomedical subfields. While previous community challenges focused on identifying chemical names mentioned in titles and abstracts, the full text contains valuable additional detail. We, therefore, organized the BioCreative NLM-Chem track as a community effort to address automated chemical entity recognition in full-text articles. The track consisted of two tasks: (i) chemical identification and (ii) chemical indexing. The chemical identification task required predicting all chemicals mentioned in recently published full-text articles, both span [i.e. named entity recognition (NER)] and normalization (i.e. entity linking), using Medical Subject Headings (MeSH). The chemical indexing task required identifying which chemicals reflect topics for each article and should therefore appear in the listing of MeSH terms for the document in the MEDLINE article indexing. This manuscript summarizes the BioCreative NLM-Chem track and post-challenge experiments. We received a total of 85 submissions from 17 teams worldwide. The highest performance achieved for the chemical identification task was 0.8672 F-score (0.8759 precision and 0.8587 recall) for strict NER performance and 0.8136 F-score (0.8621 precision and 0.7702 recall) for strict normalization performance. The highest performance achieved for the chemical indexing task was 0.6073 F-score (0.7417 precision and 0.5141 recall). This community challenge demonstrated that (i) the current substantial achievements in deep learning technologies can be utilized to improve automated prediction accuracy further and (ii) the chemical indexing task is substantially more challenging. We look forward to further developing biomedical text-mining methods to respond to the rapid growth of biomedical literature. The NLM-Chem track dataset and other challenge materials are publicly available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/.
... The BC2GM [20], BioNLP09 [21], and BioNLP-OST19 [22] datasets deal with genes, proteins, and bacteria, respectively. In chemical engineering, SCAI [23] and IUPAC [24] are available for researches on chemical name matching. Similar to chemical names, Weston et al. [16] developed a dataset for material engineering to normalize entities to a canonical form. ...
Article
Full-text available
Discriminating the matched named entity pairs or identifying the entities’ canonical forms are critical in text mining tasks. More precise named entity normalization in text mining will benefit other subsequent text analytic applications. We built the named entity normalization model with a novel edge weight updating neural network. We, next, verify our model’s performance on NCBI disease, BC5CDR disease, and BC5CDR chemical databases, which are widely used named entity normalization datasets in the bioinformatics field. We also tested our model with our own financial named entity normalization dataset to validate the efficacy for more general applications. Using the constructed dataset, we differentiate named entity pairs. Our model achieved the highest named entity normalization performances in terms of various evaluation metrics. Our proposed model when tested on four different datasets achieved state-of-the-art results.
... Different approaches have been adopted for drug identification from the biomedical literature (Chen & Pan, 2021;Ding et al., 2017;Yang, et al., 2017). Klinger et al. (2008) applied a CRF-based ML approach for the recognition of chemical names mention based on IUPAC rules. The recognizer reported an F1 measure of 0.856 against MEDLINE corpus (Klinger et al., 2008). ...
... Klinger et al. (2008) applied a CRF-based ML approach for the recognition of chemical names mention based on IUPAC rules. The recognizer reported an F1 measure of 0.856 against MEDLINE corpus (Klinger et al., 2008). The DrugNER system is an innovative study, which has achieved a precision of 0.78 and an exceptional recall score of 0.993. ...
Article
Health maintenance is one of the foremost pillars of human society which needs up-to-date solutions to medical problems. The advancement in the biomedical field has intensified the—information load that exists in the form of clinic reports, research papers, or lab tests, etc. Extracting meaningful insights from this corpus is equally important as its progress—to make it valuable for recent medicine. In terms of biomedical text mining, the areas explored include protein–protein interactions, entity-relationship detection, and so on. The biomedical effects of drugs have significance when administered to a living organism. Biomedical literature is not widely explored in terms of gene-drug relations, hence needs investigation. Indexing methods can be used for ranking gene-drug relations. In scientific literature, Hirsch’s the h-index is usually used to quantify the impact of an individual author. Likewise, in this research, we propose the Drug-Index, a quantifiable measure that can be used to detect gene-drug relations. It is useful in drug discovery, diagnosing, personalized treatment using suitable drugs for relevant genes. For a strong and reliable gene-drug relationship discovery, drugs are extracted from a subset of MEDLINE—a bibliographic medical database. The detected drugs are verified from the PharmacoGenomics KnowledgeBase (PharmGKB)—a publicly available medical knowledgebase by Stanford University.
... The BC2GM [14], BioNLP09 [15], and BioNLP-OST19 [16] datasets deal with genes, proteins, and bacteria, respectively. In chemical engineering, SCAI [17] and IUPAC [18] are available for researches on chemical name matching. Similar to chemical names, Weston et al. [10] developed a dataset for material engineering to normalize entities to a canonical form. ...
Preprint
Full-text available
Discriminating the matched named entity pairs or identifying the entities' canonical forms are critical in text mining tasks. More precise named entity normalization in text mining will benefit other subsequent text analytic applications. We built the named entity normalization model with a novel Edge Weight Updating Neural Network. Our proposed model when tested on four different datasets achieved state-of-the-art results. We, next, verify our model's performance on NCBI Disease, BC5CDR Disease, and BC5CDR Chemical databases, which are widely used named entity normalization datasets in the bioinformatics field. We also tested our model with our own financial named entity normalization dataset to validate the efficacy for more general applications. Using the constructed dataset, we differentiate named entity pairs. Our model achieved the highest named entity normalization performances in terms of various evaluation metrics.
... Deep Learning is a subfield of machine learning composed of multiple processing or hidden layers with representation learning of data. Learning can be supervised, semi-supervised or unsupervised (Degtyarenko et al., 2009;Degtyarenko et al., 2008;Klinger et al., 2008). ...
Chapter
Chemical entities can be represented in different forms like chemical names, chemical formulae, and chemical structures. Because of the different classification frameworks for chemical names, the task of distinguishing proof or extraction of chemical elements with less ambiguous is considered a major test. Compound named entity recognition (NER) is the initial phase in any chemical-related data extraction strategy. The majority of the chemical NER is done utilizing dictionary-based, rule-based, and machine learning procedures. Recently, deep learning methods have evolved, and, in this chapter, the authors sketch out the various deep learning techniques applied for chemical NER. First, the authors introduced the fundamental concepts of chemical named entity recognition, the textual contents of chemical documents, and how these chemicals are represented in chemical literature. The chapter concludes with the strengths and weaknesses of the above methods and also the types of the chemical entities extracted.