Table 1 - uploaded by Efstathios Stamatatos
Content may be subject to copyright.
Starting dataset for summary plagiarism task

Starting dataset for summary plagiarism task

Source publication
Article
Full-text available
In this paper, we describe an approach to create a summary obfuscation corpus for the task of plagiarism detection. Our method is based on information from the Document Understanding Conferences related to years 2001 and 2006, for the English language. Overall, an unattributed summary used within someone else’s document is considered a kind of plag...

Context in source publication

Context 1
... datasets have similar topics but do not deal with the same news. Table 1 shows the number of documents selected as starting dataset. The news considered as original documents, was selected with at least a 400 word length. ...

Similar publications

Conference Paper
Full-text available
In this work we compare a total of 9 different tools for the detection of source code plagiarism. We evaluated the plagiarism or copy detection tools CPD, JPlag, Sherlock, Marble, Moss, Plaggie and SIM and two baselines, one based on the Unix tool diff and one based on the difflib module from the Python Standard Library. We provide visualizations o...

Citations

... In recent years, plagiarism is a critical issue that attracts a lot of attention in academic and educational communities [1,2,3,4]. With the ability to easily access content on online sources such as digital libraries, websites and others, some students often plagiarize their assignments by copying and modifying texts obtained from the online sources without proper acknowledgments. ...
Chapter
Full-text available
One problem of building a Thai plagiarism corpus is the unavailability of the corpus with real examples of plagiarized texts. To solve the problem, we present a new design and construction of a Thai plagiarism corpus, called TPLAC-2019, to evaluate the plagiarism detection algorithms for Thai. The process of Thai plagiarism corpus creation consists of two methods: 1) simulated plagiarism method, and 2) artificial plagiarism method. For the simulated plagiarism method, we provided a Thai plagiarism tagging tool called PlaTool and a Thai plagiarism guideline for assisting human annotators to plagiarize the text passages. As for artificial plagiarism method, plagiarized documents are automatically generated by a machine. Besides, a new method to automatically create plagiarized text passages is proposed in the artificial plagiarism method. The objective of this proposed method is to automatically create plagiarized text passages that resemble human language. To evaluate the performance of machine-generated Thai plagiarized text passages, we prepared the test sets which are generated from the baseline and the proposed methods. The experiments are set up to compare the readability of human-readable texts in plagiarized documents between two different methods. The experimental results show that the proposed method helps improve the readability of human-readable texts which is increased up to 40%.
... Language models (LMs) are used for solving tasks related to many different fields. They are usually incorporated into system aimed at facilitating different modes and types of cognitive infocommunications, e.g. machine translation [1], automatic speech recognition [2], data compression [3], information retrieval [4], spell checking [5], plagiarism detection [6], diagnostics in medicine [7] etc. One of the most important roles of these models is within systems based on speech technologies and utilized as assistive tools. ...
Article
Full-text available
When training language models (especially for highly inflective languages), some applications require word clustering in order to mitigate the problem of insufficient training data or storage space. The goal of word clustering is to group words that can be well represented by a single class in the sense of probabilities of appearances in different contexts. This paper presents comparative results obtained by using different approaches to word clustering when training class N-gram models for Serbian, as well as models based on recurrent neural networks. One approach is unsupervised word clustering based on optimized Brown’s algorithm, which relies on bigram statistics. The other approach is based on morphology, and it requires expert knowledge and language resources. Four different types of textual corpora were used in experiments, describing different functional styles. The language models were evaluated by both perplexity and word error rate. The results show notable advantage of introducing expert knowledge into word clustering process.