Conference PaperPDF Available

Term Extraction + Term Clustering: An Integrated Platform for Computer-Aided Terminology

Authors:

Abstract and Figures

A novel technique for automatic thesaurus construction is proposed. It is based on the complementary use of two tools: (1) a Term Extraction tool that acquires term candidates from tagged corpora through a shallow grammar of noun phrases, and (2) a Term Clustering tool that groups syntactic variants (insertions). Experiments performed on corpora in three technical domains yield clusters of term candidates with preci- sion rates between 93% and 98%.
Content may be subject to copyright.
A preview of the PDF is not available
... Pour lever le troisième verrou (i.e. les variations des instances d'arguments dans le texte), nous utilisons des méthodes existantes de reconnaissance des variations terminologiques [Bourigault and Jacquemin, 1999], de reconnaissance de nouvelles unités de mesure [Berrahou, 2015] et une méthode proposée dans cette thèse pour l'extraction des acronymes. Nous intégrons l'ensemble de ces méthodes dans la construction de nos représentations SciPuRe et STaRe. ...
... -Recherche des variants terminologiques : Pour améliorer l'extraction d'instances d'arguments, le vocabulaire de la RTO a été enrichi de variations terminologiques en utilisant FASTR [Bourigault and Jacquemin, 1999]. FASTR utilise une analyse linguistique des phrases, ce qui lui permet de ne pas être dépendant de fréquences d'associations. ...
... Pour cela, nous avons mesuré les valeurs de rappel, précision et f-score obtenues lors d'une extraction sans augmentation de la couverture (cf. Baseline dans laTable 7), en utilisant la recherche de variants terminologiques de notre version modifiée de FASTR[Bourigault and Jacquemin, 1999] (cf. +Fastr dans laTable 7) et de recherche des acronymes (cf +Acronymes dans laTable 7) avec la méthode utilisant l'ensemble des augmentations disponibles (GENERAL dans laTable 7).Table 7 -Effet de l'augmentation de la couverture de la RTO Nous observons dans laTable 7que les méthodes d'augmentation de la couverture de la RTO reposant sur la détection de variants terminologiques et de formes acronymiques ont pour effet d'augmenter le rappel en dégradant très légèrement la précision. ...
Thesis
Full-text available
Cette thèse s’inscrit dans le domaine de recherche des smart data, où nous recherchons des informations spécifiques au sein de documents textuels. Elle consiste à proposer de nouvelles méthodes de représentation et d'extraction de données expérimentales à partir d’articles scientifiques. Ces méthodes ont été évaluées sur un corpus d'articles dans le domaine des emballages alimentaires.Les données expérimentales peuvent être représentées sous forme de relations n-Aires composées d’arguments symboliques et quantitatifs. Ces derniers sont constitués d’une valeur numérique et d’une unité de mesure. L'objectif de cette thèse est de peupler une base de connaissances d’instances de relations N-Aires extraites de documents scientifiques textuels. L’approche proposée s’appuie sur une Ressource Termino-Ontologique (RTO) et se décompose en deux Phases : (1) la reconnaissance et l'extraction des instances d’arguments d’intérêt et (2) la mise en relation de ces instances dans des relations n-Aires. La Phase (1) propose une représentation originale des instances d’arguments extraites, appelée SciPuRe (Scientifique Publication Representation). Celle-ci intègre des descripteurs ontologiques, lexicaux et structurels qui décrivent le contexte d'apparition des instances d'arguments et permet de les trier selon leurs pertinences. La Phase (2) s’appuie sur les informations présentes dans les tableaux des documents, extraits automatiquement, pour guider l’extraction des relations n-Aires à partir de relations partielles, les tableaux contenant une part importante des données expérimentales dans les articles scientifiques. Ces relations partielles sont ensuite complétées par les instances d’arguments reconnues lors de la Phase (1). Trois approches sont proposées et évaluées afin d’identifier les instances d’arguments qui doivent compléter les relations : l’utilisation de la structure des documents, l’analyse des cooccurrences entre les instances d’arguments dans les textes, et enfin l’utilisation de modèles de word-embedding permettant de mesurer les similarités entre les instances d’arguments candidates et les arguments déjà renseignés dans les relations partielles.Nos résultats montrent l’importance du tri des instances pertinentes à l’issue de la reconnaissance des arguments lors de la Phase (1) en s’appuyant sur les descripteurs SciPuRe. Nos expérimentations montrent que les deux critères les plus importants pour déterminer la pertinence d’une instance d'argument symbolique sont la spécificité du concept associé à l’argument dans la RTO et sa fréquence dans le document. Pour les arguments quantitatifs, c’est l’appartenance de l’instance d'argument à des sections des documents qui permet de déterminer sa pertinence. Nos expérimentations sur la Phase (2) confirment l’utilité des scores de pertinence calculés lors de la Phase (1) pour discriminer les instances.L'analyse des résultats avec différents filtrages des instances d'arguments candidates selon leurs pertinences montre un net effet positif lors du filtrage de 20% des instances avec les pertinences les plus faibles. Nous avons également expérimenté la possibilité de sélectionner plusieurs candidats pour chaque instance d'argument manquante dans une relation partielle, dans une approche d'assistance aux experts du domaine qui peuvent ensuite déterminer l'instance valide. Lors de la sélection d'un seul candidat, l’approche fondée sur les analyses des cooccurrences donne les meilleurs résultats pour détecter l'instance d'argument candidate valide. Avec une sélection plus importante, de trois ou cinq candidats, l’analyse des similarités sémantiques permise par des modèles BERT de plongement lexicaux fournit de bons résultats pour la détection d’associations entre les instances d’arguments présentes dans les relations partielles et les instances d’argument candidates à la complétion des relations. Enfin, lors de la sélection de dix candidats, les expérimentations montrent que l'approche fondée sur la structure des documents est efficace pour compléter les relations n-Aires.
... Les travaux de [Bourigault, 1992, Bourigault and Jacquemin, 1999, Bourigault et al., 1996, qui ont abouti à la création des outils lexter et fastr, s'appuient quant à eux sur une méthode entièrement indépendante des patrons et des analyses en cooccurrences et collocations. Nous nous concentrons ici sur lexter, qui effectue l'extraction de graphies terminologiques sans autre traitement. ...
... La grammaire -assimilable à un patron morphosyntaxique -nécessite ici d'être combinée avec le traitement des frontières pour en augmenter la précision : les lexies extraites à cette étape seront proposées comme termes candidats plus tard pour validation, la précision de cette étape de découpage doit donc être maximale -contrairement aux méthodes d'extraction avec tri vues en section 2.3.2. Les groupes nominaux extraits sont ensuite analysés pour extraire des groupes de taille plus restreinte à partir de l'identification de leur tête et de leur extension (méthode d'enrichissement qui a été introduite par [David and Plante, 1990] dans le logiciel termino) : bronchial cell dans cylindrical bronchial cell et cell dans bronchial cell [Bourigault and Jacquemin, 1999]. ...
... Une autre forme de variation à été évoquée dans [Bourigault and Jacquemin, 1999] avec fastr : les variations dites syntaxiques. Bien que leurs résultats soient intéressants, leur définition de variantes ne nous convient pas ici pour la première métarègle proposée : l'insertion d'adjectifs et d'adverbes. ...
Thesis
Nous présentons ici diverses expériences et hypothèses en lien avec l’extraction terminographique automatique et de potentielles hybridations avec des modèles de thèmes. Dans le domaine du tal, la construction automatique de terminologies n’est que peu consensuelle. Les différents objectifs des chercheurs font poindre des divergences d’opinion quant à ce qui constitue ou non une unité terminologique. Les divergences se situent à différents niveaux de la tâche. Sur le plan linguistique, les chercheurs sont parvenus à un accord relatif quant à la structure morphosyntaxique des graphies terminologiques. De nouvelles propositions apparaissent régulièrement mais qui complètent le consensus plus qu’elles ne l’invalident. Si la structure des graphies fait consensus, il n’en est pas de même pour leur caractérisation en tant qu’unité terminologique. L’aspect terminologique d’une unité est déterminé à partir de différents facteurs internes ainsi qu’externes. Dans un premier temps nos expériences portent sur le contexte d’apparition des unités terminologiques à partir de modèles de thèmes. Nous verrons si et comment les unités terminologiques peuvent bénéficier à la construction de modèles de thèmes. Ce bénéfice sera estimé à l’aune de la pertinence des modèles construits et de mesures statistiques. Dans un second temps, nous proposerons une extension de la structure morphosyntaxique des graphies terminologiques.
... The first concern when using the vocabulary of a resource to drive the entity extraction process is its coverage of the domain of interest. Terminological variations of the vocabulary defining the entities can be extracted from a list of terms present in documents via the analysis of morphological and syntax features [28,29]. We use a python version of FASTR [28]. ...
... To improve entity extraction, the OTR vocabulary was expanded with terminological variations using FASTR [29]. This tool extracts terminological variations of a list of terms in a document via the analysis of morphological and syntactic characteristics. ...
... Il existe depuis déjà des dizaines d'années une préoccupation pour la suggestion automatique de termes et leur regroupement. Dans une première phase, l'approche technique utilisée pouvait avoir besoin d'un certain nombre de règles linguistiques, combinées éventuellement dans une certaine dose avec des considérations statistiques (par exemple Bourigault et Jacquemin (1999), Frantzi et al. (2000), Oliver et Vázquez (2015)). On peut remarquer dans les dernières décennies une tendance à l'augmentation relative de l'importance des aspects statistiques au détriment des règles linguistiques. ...
Conference Paper
Full-text available
Sperm whales communicate with small sequences of successive clicks, called codas. For the past ten years, ’conversations’ between sperm whales have been recorded, particularly in Mauritius. Starting from the transcriptions of these recordings, we added metadata on the age, sex of each individual, and family relationships between them, then studied these data with Proxem Studio, a text-mining tool. We present our first results, which primarily show the existence of compound signs (multi-codas). We also observe strong correlations between certain configurations (conversations between mother and son or daughter, conversations between adult females, etc.) and certain signs, as well as numerous correlations between individuals and particular signs. Some of the results would have consequences for the very idea of animal language. These results encourage us for the continuation of our work (of which we give some hints).
... Note that variations of terms could be automatically extracted with NLP approaches in dedicated corpora [6,7] . This will be integrated as future work to extend the current lexicon. ...
Article
Full-text available
The main objective of the project LEAP4FNSSA (Long-term EU-AU Research and Innovation Partnership for Food and Nutrition Security and Sustainable Agriculture) is to provide a tool for European and African institutions to engage in a sustainable partnership platform for research and innovation on Food and Nutrition Security, and Sustainable Agriculture (FNSSA). The FNSSA roadmap facilitates the involvement of stakeholders for addressing and linking research to innovation dealing with food security issues. In this context, the LEAP4FNSSA project supports the driving of the roadmap. Research and innovation activities were captured in different data, i.e. LEAP4FNSSA database and heterogeneous textual data including project reports, websites, scientific publications, workshop reports and student theses. The Knowledge Extractor Pipeline System (KEOPS) was implemented to support the processing and analysis of textual data associated with FNSSA activities. KEOPS is based on the LEAP4FNSSA lexicon presented in this data paper. The LEAP4FNSSA lexicon based on 331 keywords associated with 12 concepts dealing with the food security domain is the result of 3 steps of work and brainstorming. The lexicon enables the capturing of research and innovation topics dealing with food security and conducted by African and European partners. This data paper presents the obtained lexicon and a summary of the method to build it.
... Our study relies on making a reference dataset by scraping YouTube using a large time range and also on inferring knowledge from its lexical content with quantitative approaches from text mining. We will proceed to learn terminology using lexical extraction, which will generally be used to extract repeated segments [13,33,51] and named entities [40]. As we want to discover topics that may appear in the content, we need to learn an ontology by using a low-resource technique, such as clustering: we use k-means [37] and topic modeling [10]. ...
Article
Full-text available
Social media is more and more dominant in everyday life for people around the world. YouTube content is a resource that may be useful, in social computational science, for understanding key questions about society. Using this resource, we performed web scraping to create a dataset of 644,575 video transcriptions concerning net activism and whistleblowing. We automatically performed linguistic feature extraction to capture a representation of each video using its title, description and transcription (downloaded metadata). The next step was to clean the dataset using automatic clustering with linguistic representation to identify unmatched videos and noisy keywords. Using these keywords to exclude videos, we finally obtained a dataset that was reduced by 95%, i.e., it contained 35,730 video transcriptions. Then, we again automatically clustered the videos using a lexical representation and split the dataset into subsets, leading to hundreds of clusters that we interpreted manually to identify a hierarchy of topics of interest concerning whistleblowing. We used the dataset to learn a lexical representation for a specific topic and to detect unknown whistleblowing videos for this topic; the accuracy of this detection is 57.4%. We also used the dataset to identify interesting context linguistic markers around the names of whistleblowers. From a given list of names, we automatically extracted all 5-g word sequences from the dataset and identified interesting markers in the left and right contexts for each name by manual interpretation. The results of our study are the following: a dataset (raw and cleaned collections) concerning whistleblowing, a hierarchy of topics about whistleblowing, the automatic prediction of whistleblowing and the semi-automatic semantic analysis of markers around whistleblower names. This text mining analysis can be exploited for digital sociology and e-democracy studies.
... A unique approach for automated thesaurus creation has been presented by Bourigault and Jacquemin It is based on the employment of two tools in combination: (1) a word extraction tool that extracts term candidates from tagged corpora using a shallow grammar of noun phrases and (2) a term clustering tool that clusters syntactic variations (insertions) [17]. Newman et al. proposed a word segmentation model based on the Dirichlet process (DP), in which multiword segments are either retrieved from a cache or newly generated [18]. ...
Article
Full-text available
Languages are not uniform and certain words are used differently by speakers of different languages more or less often, or with distinct meanings. In both linguistics and natural language processing (NLP) problems, the classification that groups together verbs and a collection of similar syntactic and semantic features are of great interest. In the modern era of science and technology, NLP technology is developing rapidly. However, the interpretation of index lines still needs to be realized manually. This method takes a long time, especially after entering the era of big data, the number of corpora has increased rapidly and it is normal to have a corpus with hundreds of millions of words. The quantity of text generated every day is increasing intensely and the word index based on search words is as high as tens of thousands of lines, so it is very difficult to analyze index lines manually. Automatic lexical knowledge acquisition is essential for a variety of NLP activities. Particularly knowledge about verbs is critical, which are the major source of relationship information in a sentence. Due to this issue, this study attempts to automatically identify and extract English verbs by index line clustering. Each index behavior can be regarded as microtext automatic clustering to realize the automatic identification and extraction of English verb forms. This study first focuses on the clustering index algorithm including the C-means clustering algorithm and fuzzy C-means clustering algorithm, then describes in detail the automatic recognition and extraction process of English verbs based on index line clustering, and creates a verification set and completes the index line clustering of English verbs. Finally, the effect of index line algorithm is analyzed from two aspects: automatic recognition of English verb types and recall rate. At the same time, the verbs are selected to analyze their types and judge the probability of each type. The experimental results show that the average recognition rate of English verbs in the manual classification is 91.01%, and the average accuracy of automatic recognition and extraction of English verb patterns based on index row clustering is 95.99%.
... Moreover, such terms are often defined through examples rather than discursively in different grammars. Our approach is minimalist, so we also do not produce a fully POS-tagged corpus as input to term extraction, unlike some other approaches (Bourigault and Jaquemin, 1999). ...
Conference Paper
Full-text available
The present work is aimed at (1) developing a search machine adapted to the large DReaM corpus of linguistic descriptive literature and (2) getting insights into how a data-driven ontology of linguistic terminology might be built. Starting from close to 20,000 text documents from the literature of language descriptions, from documents either born digitally or scanned and OCR’d, we extract keywords and pass them through a pruning pipeline where mainly keywords that can be considered as belonging to linguistic terminology survive. Subsequently we quantify relations among those terms using Normalized Pointwise Mutual Information (NPMI) and use the resulting measures, in conjunction with the Google Page Rank (GPR), to build networks of linguistic terms.
... The synonym layer addresses the acquisition of semantic term variants in and between languages. It is either based on sets, such as WordNet synsets [19] (after sense disambiguation), on clustering techniques [20][21][22][23], or on other similar methods, including Web-based knowledge acquisition. ...
Chapter
Full-text available
Ontologies constitute an exciting model for representing a domain of interest, since they enable information-sharing and reuse. Existing inference machines can also use them to reason about various contexts. However, ontology construction is a time-consuming and challenging task. The ontology learning field answers this problem by providing automatic or semi-automatic support to extract knowledge from various sources, such as databases and structured and unstructured documents. This paper reviews the ontology learning process from unstructured text and proposes a bottom-up approach to building legal domain-specific ontology from Arabic texts. In this work, the learning process is based on Natural Language Processing (NLP) techniques and includes three main tasks: corpus study, term acquisition, and conceptualization. Corpus study enriches the original corpus with valuable linguistic information. Term acquisition selects tagged lemmas sequences as potential term candidates, and conceptualization drives concepts and their relationships from the extracted terms. We used the NooJ platform to implement the required linguistic resources for each task. Further, we developed a Java module to enrich the ontology vocabulary from the Arabic WordNet (AWN) project. The obtained results were essential but incomplete. The legal expert revised them manually, and then they were used to refine and expand a domain ontology for a Moroccan Legal Information Retrieval System (LIRS).
Article
Full-text available
Although terminology is a branch of linguistics with a long history, a number of terminological systems have not been thoroughly analyzed. One of the areas that falls into this category is the terminology of humanitarian subjects because the way their terminology is formed differs from the term formation of STEM disciplines. Sublanguages of humanitarian disciplines quite often borrow general language words, which can be explained by the fact that the area of their studies is related to general rules of society functioning. In the course of transfer from the general language to domain-specific language, the semantic and/or morphological structure of words might undergo modifications. This article analyses the methods of formation of the English terminology of hermeneutics. Hermeneutics is a theory and methodology of text interpretation. One of the distinguishing features of its sublanguage is the fact that it is formed with the use of a considerable number of general language words that are used by text interpreters in a specialized meaning. This paper presents the analysis of the semantic transformations of some of these words after they became part of the sublanguage of hermeneutics.
Book
Full-text available
Preface. 1. Introduction. 2. Semantic Extraction. 3. Sextant. 4. Evaluation. 5. Applications. 6. Conclusion. 1: Preprocesors. 2. Webster Stopword List. 3: Similarity List. 4: Semantic Clustering. 5: Automatic Thesaurus Generation. 6. Corpora Treated. Index.
Conference Paper
Full-text available
A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unification-based shallow-level parser using transformational rules over syntactic patterns. The contribution of this research is the successful combination of parsing over a seed term list coupled with derivational morphology to achieve greater coverage of multi-word terms for indexing and retrieval. Final results are evaluated for precision and recall, and implications for indexing and retrieval are discussed.
Article
Full-text available
This paper identifies some linguistic properties of technical terminology, and uses them to formulate an algorithm for identifying technical terms in running text. The grammatical properties discussed are preferred phrase structures: technical terms consist mostly of noun phrases containing adjectives, nouns, and occasionally prepositions; rerely do terms contain verbs, adverbs, or conjunctions. The discourse properties are patterns of repetition that distinguish noun phrases that are technical terms, especially those multi-word phrases that constitute a substantial majority of all technical vocabulary, from other types of noun phrase. The paper presents a terminology indentification algorithm that is motivated by these linguistic properties. An implementation of the algorithm is described; it recovers a high proportion of the technical terms in a text, and a high proportaion of the recovered strings are vaild technical terms. The algorithm proves to be effective regardless of the domain of the text to which it is applied.
Article
The paper describes research designed to improve automatic pre-coordinate term indexing by applying powerful general-purpose language analysis techniques to identify term sources in requests, and to generate variant expressions of the concepts involved for document text searching.
Article
The aim of automatic Indexing Is to achieve a compact representation of a document suitable for retrieval. FASIT (Fully Automatic Syntactically based Indexing of Text) Identifies content bearing textual units without a full parse, and, without using semantic criteria, groups these units Into quaslsynonymous sets. Tested on a database of 250 documents and 22 queries, FASIT performed better than both thesaurus and stem based Indexing systems. Retrievals Indicate that the basic Idea of FASIT—that significant terms In the text can be Identified through syntactic patterns—Is valid and that FASIT deserves serious consideration as an advance over stem based systems.
Article
We present several unsupervised statistical models for the prepositional phrase attachment task that approach the accuracy of the best supervised methods for this task. Our unsupervised approach uses a heuristic based on attachment proximity and trains from raw text that is annotated with only part-of-speech tags and morphological base forms, as opposed to attachment information. It is therefore less resource-intensive and more portable than previous corpus-based algorithms proposed for this task. We present results for prepositional phrase attachment in both English and Spanish.
A knowledge-poor technique for knowledge extraction from large corpora
  • Gregory Grefenstette
Gregory Grefenstette. 1992. A knowledge-poor technique for knowledge extraction from large corpora. In Proceedings, 15th Annual International A CM SIGIR Conference on Research and Development in Information Retrieval (SI-GIR '92), Copenhagen.