Thesis

Design and Evaluation of an NLP-Pipeline Prototype for Discourse Similarity Detection

May 2021

May 2021

Authors:

Andrey Shcherbakov

Arcada University of Applied Sciences

The objective of this work is to investigate how the advances in Natural Language Processing (NLP), Dimensionality Reduction, and Density-Based Hierarchical Clustering can be combined to assist in the task of discourse similarity detection in text documents and to improve the possibilities to discover discourse propagation through a corpus of news articles. The work constructs and evaluates an NLP-pipeline prototype that includes a sentence vector representation (embedding) performed by a pretrained attention-based bidirectional transformer neural network and subsequent dimensionality re- duction of high-dimensional sentence embeddings by Uniform Manifold Approximation and Projection algorithm, which precedes semantic-oriented clustering by Hierarchical Density-Based Spatial Clustering of Applications with Noise. The Design Science Research Methodology is adapted for the study. Research utilises a custom prepared collection of 780K sentences extracted from 24K topic-restricted news articles from a period of six years. The results show that, while some semantic similarity between sentences can be detected, more advanced neural network language models (which generate better sentence representations) should be utilised for the task of discourse similarity detection.

ResearchGate has not been able to resolve any citations for this publication.

ResearchGate has not been able to resolve any references for this publication.

Conference Paper

Full-text available

An annotation type system for a data-driven NLP pipeline

January 2007

We introduce an annotation type system for a data-driven NLP core system. The specifications cover formal document structure and document meta information, as well as the linguistic levels of morphology, syntax and semantics. The type system is embedded in the framework of the Unstructured Information Management Architecture (UIMA).

Preprint

Full-text available

What's so special about BERT's layers? A closer look at the NLP pipeline in monolingual and multilin...

April 2020

Experiments with transfer learning on pre-trained language models such as BERT have shown that the layers of these models resemble the classical NLP pipeline, with progressively more complex tasks being concentrated in later layers of the network. We investigate to what extent these results also hold for a language other than English. For this we probe a Dutch BERT-based model and the ... [Show full abstract] multilingual BERT model for Dutch NLP tasks. In addition, by considering the task of part-of-speech tagging in more detail, we show that also within a given task, information is spread over different parts of the network and the pipeline might not be as neat as it seems. Each layer has different specialisations and it is therefore useful to combine information from different layers for best results, instead of selecting a single layer based on the best overall performance.

Chapter

Full-text available

Joining the blocks together – an NLP pipeline for CALL development

December 2019

[...]
Monica Ward

The theme selected for the 2019 EuroCALL conference held in Louvain-la-Neuve was ‘CALL and complexity’. As languages are known to be intrinsically and linguistically complex, as are the many determinants of learning (additional) languages, complexity is viewed as a challenge to be embraced collectively. The 2019 conference allowed us to pay tribute to providers of CALL solutions and to recognize ... [Show full abstract] the complexity of their task. We hope you will enjoy reading this volume as it offers a rich glimpse into the numerous debates that took place during EuroCALL 2019. We look forward to continuing those debates and discussions with you at the next EuroCALL conferences!

Article

Full-text available

Mapping SNOMED CT Codes to Semi-Structured Texts via an NLP Pipeline

June 2022

In the project presented here, we used NLP tools for annotating German medical trainings documents with SNOMED CT codes. Following research question was addressed: Is it possible to automate the annotation of training documents with an NLP pipeline especially designed for this task but requiring translation into English? The goal of our stakeholder, an institution responsible for the continuing ... [Show full abstract] education of physicians, was to facilitate the switch between different medical trainings programs by coding the same requirement with the same SNOMED CT code, even if the wording is different. We first describe how we chose the concrete NLP tools, after which the concrete steps for implementing our prototype are outlined: the NLP pipeline construction, the implementation, and the validation. We infer three important lessons from our results: (i) self-supervision is no free lunch and should be based on a sophisticated task, (ii) the translation via DeepL can be too context-dependent for a peculiar use case, and (iii) ontology extraction can increase efficiency as well as accuracy.

Last Updated: 05 Jun 2024