Conference PaperPDF Available

Term Extraction + Term Clustering: An Integrated Platform for Computer-Aided Terminology

January 1999

January 1999

Source
DBLP

Conference: Proceedings of the 9th conference on European chapter of the Association for Computational Linguistics

Authors:

Christian Jacquemin

Computer Sciences Laboratory for Mechanics and Engineering Sciences

A novel technique for automatic thesaurus construction is proposed. It is based on the complementary use of two tools: (1) a Term Extraction tool that acquires term candidates from tagged corpora through a shallow grammar of noun phrases, and (2) a Term Clustering tool that groups syntactic variants (insertions). Experiments performed on corpora in three technical domains yield clusters of term candidates with preci- sion rates between 93% and 98%.

: The three corpora exploited in the experiments.

…

Figures - uploaded by Christian Jacquemin

Content may be subject to copyright.

Content uploaded by Christian Jacquemin

Content may be subject to copyright.

A preview of the PDF is not available

Instanciation de relations n-Aires dans des articles scientifiques guidée par une Ressource Termino-Ontologique de domaine

Thesis

Full-text available

Dec 2021

Martin Lentschat

Cette thèse s’inscrit dans le domaine de recherche des smart data, où nous recherchons des informations spécifiques au sein de documents textuels. Elle consiste à proposer de nouvelles méthodes de représentation et d'extraction de données expérimentales à partir d’articles scientifiques. Ces méthodes ont été évaluées sur un corpus d'articles dans le domaine des emballages alimentaires.Les données expérimentales peuvent être représentées sous forme de relations n-Aires composées d’arguments symboliques et quantitatifs. Ces derniers sont constitués d’une valeur numérique et d’une unité de mesure. L'objectif de cette thèse est de peupler une base de connaissances d’instances de relations N-Aires extraites de documents scientifiques textuels. L’approche proposée s’appuie sur une Ressource Termino-Ontologique (RTO) et se décompose en deux Phases : (1) la reconnaissance et l'extraction des instances d’arguments d’intérêt et (2) la mise en relation de ces instances dans des relations n-Aires. La Phase (1) propose une représentation originale des instances d’arguments extraites, appelée SciPuRe (Scientifique Publication Representation). Celle-ci intègre des descripteurs ontologiques, lexicaux et structurels qui décrivent le contexte d'apparition des instances d'arguments et permet de les trier selon leurs pertinences. La Phase (2) s’appuie sur les informations présentes dans les tableaux des documents, extraits automatiquement, pour guider l’extraction des relations n-Aires à partir de relations partielles, les tableaux contenant une part importante des données expérimentales dans les articles scientifiques. Ces relations partielles sont ensuite complétées par les instances d’arguments reconnues lors de la Phase (1). Trois approches sont proposées et évaluées afin d’identifier les instances d’arguments qui doivent compléter les relations : l’utilisation de la structure des documents, l’analyse des cooccurrences entre les instances d’arguments dans les textes, et enfin l’utilisation de modèles de word-embedding permettant de mesurer les similarités entre les instances d’arguments candidates et les arguments déjà renseignés dans les relations partielles.Nos résultats montrent l’importance du tri des instances pertinentes à l’issue de la reconnaissance des arguments lors de la Phase (1) en s’appuyant sur les descripteurs SciPuRe. Nos expérimentations montrent que les deux critères les plus importants pour déterminer la pertinence d’une instance d'argument symbolique sont la spécificité du concept associé à l’argument dans la RTO et sa fréquence dans le document. Pour les arguments quantitatifs, c’est l’appartenance de l’instance d'argument à des sections des documents qui permet de déterminer sa pertinence. Nos expérimentations sur la Phase (2) confirment l’utilité des scores de pertinence calculés lors de la Phase (1) pour discriminer les instances.L'analyse des résultats avec différents filtrages des instances d'arguments candidates selon leurs pertinences montre un net effet positif lors du filtrage de 20% des instances avec les pertinences les plus faibles. Nous avons également expérimenté la possibilité de sélectionner plusieurs candidats pour chaque instance d'argument manquante dans une relation partielle, dans une approche d'assistance aux experts du domaine qui peuvent ensuite déterminer l'instance valide. Lors de la sélection d'un seul candidat, l’approche fondée sur les analyses des cooccurrences donne les meilleurs résultats pour détecter l'instance d'argument candidate valide. Avec une sélection plus importante, de trois ou cinq candidats, l’analyse des similarités sémantiques permise par des modèles BERT de plongement lexicaux fournit de bons résultats pour la détection d’associations entre les instances d’arguments présentes dans les relations partielles et les instances d’argument candidates à la complétion des relations. Enfin, lors de la sélection de dix candidats, les expérimentations montrent que l'approche fondée sur la structure des documents est efficace pour compléter les relations n-Aires.

Contributions sur la structure morphosyntaxique des graphies terminologiques et sur l’hybridation entre terminologie et modèles de thèmes

Thesis

Oct 2020

Amaury Delamaire

Nous présentons ici diverses expériences et hypothèses en lien avec l’extraction terminographique automatique et de potentielles hybridations avec des modèles de thèmes. Dans le domaine du tal, la construction automatique de terminologies n’est que peu consensuelle. Les différents objectifs des chercheurs font poindre des divergences d’opinion quant à ce qui constitue ou non une unité terminologique. Les divergences se situent à différents niveaux de la tâche. Sur le plan linguistique, les chercheurs sont parvenus à un accord relatif quant à la structure morphosyntaxique des graphies terminologiques. De nouvelles propositions apparaissent régulièrement mais qui complètent le consensus plus qu’elles ne l’invalident. Si la structure des graphies fait consensus, il n’en est pas de même pour leur caractérisation en tant qu’unité terminologique. L’aspect terminologique d’une unité est déterminé à partir de différents facteurs internes ainsi qu’externes. Dans un premier temps nos expériences portent sur le contexte d’apparition des unités terminologiques à partir de modèles de thèmes. Nous verrons si et comment les unités terminologiques peuvent bénéficier à la construction de modèles de thèmes. Ce bénéfice sera estimé à l’aune de la pertinence des modèles construits et de mesures statistiques. Dans un second temps, nous proposerons une extension de la structure morphosyntaxique des graphies terminologiques.

Towards combined semantic and lexical scores based on a new representation of textual data to extract experimental data from scientific publications

Article

Jan 2022

Une contribution du Text-mining à la connaissance du langage des cachalots

Conference Paper

Full-text available

Jan 2023

Sperm whales communicate with small sequences of successive clicks, called codas. For the past ten years, ’conversations’ between sperm whales have been recorded, particularly in Mauritius. Starting from the transcriptions of these recordings, we added metadata on the age, sex of each individual, and family relationships between them, then studied these data with Proxem Studio, a text-mining tool. We present our first results, which primarily show the existence of compound signs (multi-codas). We also observe strong correlations between certain configurations (conversations between mother and son or daughter, conversations between adult females, etc.) and certain signs, as well as numerous correlations between individuals and particular signs. Some of the results would have consequences for the very idea of animal language. These results encourage us for the continuation of our work (of which we give some hints).

LEAP4FNSSA lexicon: Towards a new dataset of keywords dealing with food security

Article

Full-text available

Oct 2022

The main objective of the project LEAP4FNSSA (Long-term EU-AU Research and Innovation Partnership for Food and Nutrition Security and Sustainable Agriculture) is to provide a tool for European and African institutions to engage in a sustainable partnership platform for research and innovation on Food and Nutrition Security, and Sustainable Agriculture (FNSSA). The FNSSA roadmap facilitates the involvement of stakeholders for addressing and linking research to innovation dealing with food security issues. In this context, the LEAP4FNSSA project supports the driving of the roadmap. Research and innovation activities were captured in different data, i.e. LEAP4FNSSA database and heterogeneous textual data including project reports, websites, scientific publications, workshop reports and student theses. The Knowledge Extractor Pipeline System (KEOPS) was implemented to support the processing and analysis of textual data associated with FNSSA activities. KEOPS is based on the LEAP4FNSSA lexicon presented in this data paper. The LEAP4FNSSA lexicon based on 331 keywords associated with 12 concepts dealing with the food security domain is the result of 3 steps of work and brainstorming. The lexicon enables the capturing of research and innovation topics dealing with food security and conducted by African and European partners. This data paper presents the obtained lexicon and a summary of the method to build it.

Net activism and whistleblowing on YouTube: a text mining analysis

Article

Full-text available

Sep 2022
MULTIMED TOOLS APPL

Nicolas Turenne

Social media is more and more dominant in everyday life for people around the world. YouTube content is a resource that may be useful, in social computational science, for understanding key questions about society. Using this resource, we performed web scraping to create a dataset of 644,575 video transcriptions concerning net activism and whistleblowing. We automatically performed linguistic feature extraction to capture a representation of each video using its title, description and transcription (downloaded metadata). The next step was to clean the dataset using automatic clustering with linguistic representation to identify unmatched videos and noisy keywords. Using these keywords to exclude videos, we finally obtained a dataset that was reduced by 95%, i.e., it contained 35,730 video transcriptions. Then, we again automatically clustered the videos using a lexical representation and split the dataset into subsets, leading to hundreds of clusters that we interpreted manually to identify a hierarchy of topics of interest concerning whistleblowing. We used the dataset to learn a lexical representation for a specific topic and to detect unknown whistleblowing videos for this topic; the accuracy of this detection is 57.4%. We also used the dataset to identify interesting context linguistic markers around the names of whistleblowers. From a given list of names, we automatically extracted all 5-g word sequences from the dataset and identified interesting markers in the left and right contexts for each name by manual interpretation. The results of our study are the following: a dataset (raw and cleaned collections) concerning whistleblowing, a hierarchy of topics about whistleblowing, the automatic prediction of whistleblowing and the semi-automatic semantic analysis of markers around whistleblower names. This text mining analysis can be exploited for digital sociology and e-democracy studies.

Automatic Recognition and Extraction of English Verb Types Based on Index Line Clustering

Article

Full-text available

Jul 2022

Languages are not uniform and certain words are used differently by speakers of different languages more or less often, or with distinct meanings. In both linguistics and natural language processing (NLP) problems, the classification that groups together verbs and a collection of similar syntactic and semantic features are of great interest. In the modern era of science and technology, NLP technology is developing rapidly. However, the interpretation of index lines still needs to be realized manually. This method takes a long time, especially after entering the era of big data, the number of corpora has increased rapidly and it is normal to have a corpus with hundreds of millions of words. The quantity of text generated every day is increasing intensely and the word index based on search words is as high as tens of thousands of lines, so it is very difficult to analyze index lines manually. Automatic lexical knowledge acquisition is essential for a variety of NLP activities. Particularly knowledge about verbs is critical, which are the major source of relationship information in a sentence. Due to this issue, this study attempts to automatically identify and extract English verbs by index line clustering. Each index behavior can be regarded as microtext automatic clustering to realize the automatic identification and extraction of English verb forms. This study first focuses on the clustering index algorithm including the C-means clustering algorithm and fuzzy C-means clustering algorithm, then describes in detail the automatic recognition and extraction process of English verbs based on index line clustering, and creates a verification set and completes the index line clustering of English verbs. Finally, the effect of index line algorithm is analyzed from two aspects: automatic recognition of English verb types and recall rate. At the same time, the verbs are selected to analyze their types and judge the probability of each type. The experimental results show that the average recognition rate of English verbs in the manual classification is 91.01%, and the average accuracy of automatic recognition and extraction of English verb patterns based on index row clustering is 95.99%.

Pipeline for a Data-driven Network of Linguistic Terms

Conference Paper

Full-text available

Aug 2021

Søren Wichmann

The present work is aimed at (1) developing a search machine adapted to the large DReaM corpus of linguistic descriptive literature and (2) getting insights into how a data-driven ontology of linguistic terminology might be built. Starting from close to 20,000 text documents from the literature of language descriptions, from documents either born digitally or scanned and OCR’d, we extract keywords and pass them through a pruning pipeline where mainly keywords that can be considered as belonging to linguistic terminology survive. Subsequently we quantify relations among those terms using Normalized Pointwise Mutual Information (NPMI) and use the resulting measures, in conjunction with the Google Page Rank (GPR), to build networks of linguistic terms.

A Bottom-Up Approach for Moroccan Legal Ontology Learning from Arabic Texts

Chapter

Full-text available

Mar 2021

Ontologies constitute an exciting model for representing a domain of interest, since they enable information-sharing and reuse. Existing inference machines can also use them to reason about various contexts. However, ontology construction is a time-consuming and challenging task. The ontology learning field answers this problem by providing automatic or semi-automatic support to extract knowledge from various sources, such as databases and structured and unstructured documents. This paper reviews the ontology learning process from unstructured text and proposes a bottom-up approach to building legal domain-specific ontology from Arabic texts. In this work, the learning process is based on Natural Language Processing (NLP) techniques and includes three main tasks: corpus study, term acquisition, and conceptualization. Corpus study enriches the original corpus with valuable linguistic information. Term acquisition selects tagged lemmas sequences as potential term candidates, and conceptualization drives concepts and their relationships from the extracted terms. We used the NooJ platform to implement the required linguistic resources for each task. Further, we developed a Java module to enrich the ontology vocabulary from the Arabic WordNet (AWN) project. The obtained results were essential but incomplete. The legal expert revised them manually, and then they were used to refine and expand a domain ontology for a Moroccan Legal Information Retrieval System (LIRS).

Some Tendencies in the Development of the Terminology of Hermeneutics in the English Language

Article

Full-text available

Jan 2023

Although terminology is a branch of linguistics with a long history, a number of terminological systems have not been thoroughly analyzed. One of the areas that falls into this category is the terminology of humanitarian subjects because the way their terminology is formed differs from the term formation of STEM disciplines. Sublanguages of humanitarian disciplines quite often borrow general language words, which can be explained by the fact that the area of their studies is related to general rules of society functioning. In the course of transfer from the general language to domain-specific language, the semantic and/or morphological structure of words might undergo modifications. This article analyses the methods of formation of the English terminology of hermeneutics. Hermeneutics is a theory and methodology of text interpretation. One of the distinguishing features of its sublanguage is the fact that it is formed with the use of a considerable number of general language words that are used by text interpreters in a specialized meaning. This paper presents the analysis of the semantic transformations of some of these words after they became part of the sublanguage of hermeneutics.

Exploration in Automatic Thesaurus Discovery

Book

Full-text available

Jan 1994

Gregory Grefenstette

Preface. 1. Introduction. 2. Semantic Extraction. 3. Sextant. 4. Evaluation. 5. Applications. 6. Conclusion. 1: Preprocesors. 2. Webster Stopword List. 3: Similarity List. 4: Semantic Clustering. 5: Automatic Thesaurus Generation. 6. Corpora Treated. Index.

Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax.

Conference Paper

Full-text available

Jan 1997

A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unification-based shallow-level parser using transformational rules over syntactic patterns. The contribution of this research is the successful combination of parsing over a seed term list coupled with derivational morphology to achieve greater coverage of multi-word terms for indexing and retrieval. Final results are evaluated for precision and recall, and implications for indexing and retrieval are discussed.

Syntagmatic and Paradigmatic Representations of Term Variation.

Conference Paper

Full-text available

Jan 1999

Christian Jacquemin

Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text

Article

Full-text available

Mar 1995
Nat Lang Eng

This paper identifies some linguistic properties of technical terminology, and uses them to formulate an algorithm for identifying technical terms in running text. The grammatical properties discussed are preferred phrase structures: technical terms consist mostly of noun phrases containing adjectives, nouns, and occasionally prepositions; rerely do terms contain verbs, adverbs, or conjunctions. The discourse properties are patterns of repetition that distinguish noun phrases that are technical terms, especially those multi-word phrases that constitute a substantial majority of all technical vocabulary, from other types of noun phrase. The paper presents a terminology indentification algorithm that is motivated by these linguistic properties. An implementation of the algorithm is described; it recovers a high proportion of the technical terms in a text, and a high proportaion of the recovered strings are vaild technical terms. The algorithm proves to be effective regardless of the domain of the text to which it is applied.

Automatic indexing using selective NLP and first-order thesauri

Article

Jan 1991

Automatic Search Term Variant Generation

Article

Jan 1984
J DOC

The paper describes research designed to improve automatic pre-coordinate term indexing by applying powerful general-purpose language analysis techniques to identify term sources in requests, and to generate variant expressions of the concepts involved for document text searching.

Fasit: A Fully Automatic Syntactically Based Indexing System

Article

Mar 1983
J Am Soc Inform Sci

The aim of automatic Indexing Is to achieve a compact representation of a document suitable for retrieval. FASIT (Fully Automatic Syntactically based Indexing of Text) Identifies content bearing textual units without a full parse, and, without using semantic criteria, groups these units Into quaslsynonymous sets. Tested on a database of 250 documents and 22 queries, FASIT performed better than both thesaurus and stem based Indexing systems. Retrievals Indicate that the basic Idea of FASIT—that significant terms In the text can be Identified through syntactic patterns—Is valid and that FASIT deserves serious consideration as an advance over stem based systems.

Automatic indexing using selective NLP and the 1st-order thesauri

Conference Paper

Jan 1991

Statistical Models for Unsupervised Prepositional Phrase Attachment

Article

Aug 1998

Adwait Ratnaparkhi

We present several unsupervised statistical models for the prepositional phrase attachment task that approach the accuracy of the best supervised methods for this task. Our unsupervised approach uses a heuristic based on attachment proximity and trains from raw text that is annotated with only part-of-speech tags and morphological base forms, as opposed to attachment information. It is therefore less resource-intensive and more portable than previous corpus-based algorithms proposed for this task. We present results for prepositional phrase attachment in both English and Spanish.

A knowledge-poor technique for knowledge extraction from large corpora

Jan 1992

Gregory Grefenstette

Gregory Grefenstette. 1992. A knowledge-poor technique for knowledge extraction from large corpora. In Proceedings, 15th Annual International A CM SIGIR Conference on Research and Development in Information Retrieval (SI-GIR '92), Copenhagen.

Term Extraction + Term Clustering: An Integrated Platform for Computer-Aided Terminology

Abstract and Figures

Recommended publications

Term extraction + term clustering

Term Extraction + Term Clustering: An Integrated Platform for Computer-Aided Terminology

Term Extraction + Term Clustering: An Integrated Platform for Computer-Aided Terminology

Recycling Terms into a Partial Parser