Thesis

Design and Evaluation of an NLP-Pipeline Prototype for Discourse Similarity Detection

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The objective of this work is to investigate how the advances in Natural Language Processing (NLP), Dimensionality Reduction, and Density-Based Hierarchical Clustering can be combined to assist in the task of discourse similarity detection in text documents and to improve the possibilities to discover discourse propagation through a corpus of news articles. The work constructs and evaluates an NLP-pipeline prototype that includes a sentence vector representation (embedding) performed by a pretrained attention-based bidirectional transformer neural network and subsequent dimensionality re- duction of high-dimensional sentence embeddings by Uniform Manifold Approximation and Projection algorithm, which precedes semantic-oriented clustering by Hierarchical Density-Based Spatial Clustering of Applications with Noise. The Design Science Research Methodology is adapted for the study. Research utilises a custom prepared collection of 780K sentences extracted from 24K topic-restricted news articles from a period of six years. The results show that, while some semantic similarity between sentences can be detected, more advanced neural network language models (which generate better sentence representations) should be utilised for the task of discourse similarity detection.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.