Starting dataset for summary plagiarism task

Source publication

Automatic Generation of Summary Obfuscation Corpus for Plagiarism Detection

Article

Full-text available

Sep 2017

In this paper, we describe an approach to create a summary obfuscation corpus for the task of plagiarism detection. Our method is based on information from the Document Understanding Conferences related to years 2001 and 2006, for the English language. Overall, an unattributed summary used within someone else’s document is considered a kind of plag...

Context 1

... datasets have similar topics but do not deal with the same news. Table 1 shows the number of documents selected as starting dataset. The news considered as original documents, was selected with at least a 400 word length. ...

View in full-text

A Quantitative Comparison of Program Plagiarism Detection Tools

Conference Paper

Full-text available

Nov 2017

In this work we compare a total of 9 different tools for the detection of source code plagiarism. We evaluated the plagiarism or copy detection tools CPD, JPlag, Sherlock, Marble, Moss, Plaggie and SIM and two baselines, one based on the Unix tool diff and one based on the difflib module from the Python Standard Library. We provide visualizations o...

Developing a Framework for a Thai Plagiarism Corpus

Chapter

Full-text available

Jul 2020

One problem of building a Thai plagiarism corpus is the unavailability of the corpus with real examples of plagiarized texts. To solve the problem, we present a new design and construction of a Thai plagiarism corpus, called TPLAC-2019, to evaluate the plagiarism detection algorithms for Thai. The process of Thai plagiarism corpus creation consists of two methods: 1) simulated plagiarism method, and 2) artificial plagiarism method. For the simulated plagiarism method, we provided a Thai plagiarism tagging tool called PlaTool and a Thai plagiarism guideline for assisting human annotators to plagiarize the text passages. As for artificial plagiarism method, plagiarized documents are automatically generated by a machine. Besides, a new method to automatically create plagiarized text passages is proposed in the artificial plagiarism method. The objective of this proposed method is to automatically create plagiarized text passages that resemble human language. To evaluate the performance of machine-generated Thai plagiarized text passages, we prepared the test sets which are generated from the baseline and the proposed methods. The experiments are set up to compare the readability of human-readable texts in plagiarized documents between two different methods. The experimental results show that the proposed method helps improve the readability of human-readable texts which is increased up to 40%.

Morphology-based VS Unsupervised Word Clustering for Training Language Models for Serbian

Article

Full-text available

Jan 2019

When training language models (especially for highly inflective languages), some applications require word clustering in order to mitigate the problem of insufficient training data or storage space. The goal of word clustering is to group words that can be well represented by a single class in the sense of probabilities of appearances in different contexts. This paper presents comparative results obtained by using different approaches to word clustering when training class N-gram models for Serbian, as well as models based on recurrent neural networks. One approach is unsupervised word clustering based on optimized Brown’s algorithm, which relies on bigram statistics. The other approach is based on morphology, and it requires expert knowledge and language resources. Four different types of textual corpora were used in experiments, describing different functional styles. The language models were evaluated by both perplexity and word error rate. The results show notable advantage of introducing expert knowledge into word clustering process.

Starting dataset for summary plagiarism task

Context in source publication

Similar publications

Citations