ArticlePDF Available

Whose article is it anyway? – Detecting authorship distribution in Wikipedia articles over time with WIKIGINI

Authors:

Abstract and Figures

In this work, we present a novel approach to detecting authorship of words in Wikipedia, which outperforms the baseline method in terms of accuracy. This is achieved by reducing the necessary word-based text-to-text comparisons, which are the most fallible steps in the process. To provide an aggregated measure of the con-centration, we calculate a gini coefficient for each revision of an article based on our word-author-assignments. As a motivation for calculating this measure we argue that the concentration of words to just a few authors can be an indicator for a lack of quality and neutrality in an article. The coefficient development over time in an article is visualized and provided online as an easily accessible and useful tool to investigate how the content of an article evolved. We present examples where the gini curve gives useful insights into differences of articles and may help to spot crucial events in the past evolution of an article.
Content may be subject to copyright.
A preview of the PDF is not available
... The Wikitrust approach by Adler et al. [1,2] detects provenance by searching for longest matches for all word sequences of the current revision in selected preceding revisions and their previously existing (but now deleted) word-chunks. As De Alfaro and Shavlovsky [6] later argue, it is not well-suited for the task of authorship or provenance detection, as the process depends on several factors of its "computationally involved" editor reputation calculation -a suspicion supported by an evaluation on a small sample of authorship data generated with Wikitrust, yielding only around 50% correctly attributed authors for tokens [12]. ...
... The technique provides a visual "story" of an article's writing history and has been reproduced in several community projects. 12 From the description in Section 2 it becomes clear that similar future analyses and tools could skip the tedious and expensive process of precomputing the needed data, be it with self-built or reused text comparison approaches, by simply extracting provenance and calculating survival from the explicit markers in our dataset. ...
... A recurring theme in Wikipedia-related research is the measurement and characterization of the conflict or controversy specific content is subject to. 12 Cf., e.g., http://fogonwater.com/blog/2015/11/ wikipedia-edit-history-stratigraphy or http://iphylo.blogspot. ...
Article
Full-text available
We present a dataset that contains every instance of all tokens (~ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and re-deleted from its article, enabling a complete and straightforward tracking of its history. This data would be exceedingly hard to create by an average potential user as it is (i) very expensive to compute and as (ii) accurately tracking the history of each token in revisioned documents is a non-trivial task. Adapting a state-of-the-art algorithm, we have produced a dataset that allows for a range of analyses and metrics, already popular in research and going beyond, to be generated on complete-Wikipedia scale; ensuring quality and allowing researchers to forego expensive text-comparison computation, which so far has hindered scalable usage. We show how this data enables, on token-level, computation of provenance, measuring survival of content over time, very detailed conflict metrics, and fine-grained interactions of editors like partial reverts, re-additions and other metrics, in the process gaining several novel insights.
... While line-level tracking of changes is a feature of many code revisioning systems, this level of attribution can prove insufficient in the case of a community that produces large amounts of collaboratively written natural language text. A word-level tracking that is proven to be trustable in its attributions and that can trace reintroductions and moves of chunks of text in an acceptable runtime for end-users can prove very useful, especially on a platform like Wikipedia, as has been previously discussed [6,7]. While research shows that Wikipedians are motivated by the recognition by their peers that comes with authoring content [8], more practical needs also exist [6]. ...
... To reuse a Wikipedia article under the CC-BY-SA license, for example, might require to list the main authors of the article, which are not easily retrievable as there exists not straightforward way in the Mediawiki software to show authors of single pieces of text for a particular revision. 1 Authorship tracking in articles can further raise awareness by editors and readers for editing dynamics, concentration of authorship [7], tracing back certain viewpoints or generally understanding the evolution of an article. Recently, Wikimedia Deutschland e.V. introduced the "Article Monitor", 2 aiming to assist users with these exact issues and making use of the results of a basic authorship algorithm [7] whose general concept we use as a foundation in this work. ...
... 1 Authorship tracking in articles can further raise awareness by editors and readers for editing dynamics, concentration of authorship [7], tracing back certain viewpoints or generally understanding the evolution of an article. Recently, Wikimedia Deutschland e.V. introduced the "Article Monitor", 2 aiming to assist users with these exact issues and making use of the results of a basic authorship algorithm [7] whose general concept we use as a foundation in this work. The Wikipedia community has come up with a number of intended solutions related to the authorship attribution problem on word level, which highlights the utility of such a solution for Wikipedians. ...
Conference Paper
Full-text available
Revisioned text content is present in numerous collaboration platforms on the Web, most notably Wikis. To track authorship of text tokens in such systems has many potential applications; the identification of main authors for licensing reasons or tracing collaborative writing patterns over time, to name some. In this context, two main challenges arise. First, it is critical for such an authorship tracking system to be precise in its attributions, to be reliable for further processing. Second, it has to run efficiently even on very large datasets, such as Wikipedia. As a solution, we propose a graph-based model to represent revisioned content and an algorithm over this model that tackles both issues effectively. We describe the optimal implementation and design choices when tuning it to a Wiki environment. We further present a gold standard of 240 tokens from English Wikipedia articles annotated with their origin. This gold standard was created manually and confirmed by multiple independent users of a crowdsourcing platform. It is the first gold standard of this kind and quality and our solution achieves an average of 95% precision on this data set. We also perform a first-ever precision evaluation of the state-of-the-art algorithm for the task, exceeding it by over 10% on average. Our approach outperforms the execution time of the state-of-the-art by one order of magnitude, as we demonstrate on a sample of over 240 English Wikipedia articles. We argue that the increased size of an optional materialization of our results by about 10% compared to the baseline is a favorable trade-off, given the large advantage in runtime performance.
... Flöck and Rodchenko [10] presented a tree model approach to establish content authorship, and therefore measure the users' contribution in a document. In their initial version, the model only considered paragraphs and sentences. ...
Conference Paper
Nowadays, several productivity platforms provide effective capabilities to edit collaboratively the content of a document. In educational settings, e-Learning approaches have taken advantage of this functionality to encourage students to join others to complete projects that include the writing of text documents. Although collaborative writing may foster interaction among students, the existing analytical metrics on these platforms are limited and can slow down the process of review by instructors in trying to determine the level of contribution of each student in the document. In this paper, we describe an analytic framework to measure and visualize the contribution in collaborative writing.
... The approach used in [3] assumes that if 50% of author X's sentence has been retained in the next revision, after being edited by author Y, the sentence still "belongs" to author X. In [6], the authors proposed Gini coefficient to be used in every revision to measure authorship distribution in current revision. In [13], several visualization methods were proposed to identify malicious authorship in Wikipedia (e.g., vandalism) using temporal edit patterns, based on the assumption that if an edit renders a document version 90% smaller than the previous one, it is considered mass deletion. ...
Conference Paper
Authorship contribution is often taken for granted. Internally, the contribution rate is usually known among all the authors of a given paper. However, this rate is hard to be verified by external parties, as the measurement of the authors' contribution is still not common and the way to measure it is unclear. In this paper, we propose a new blockchain based framework to assess the contribution of all authors of any scientific paper. Our framework can be implemented by anyone who is directly or indirectly involved in the publication of the paper, such as a principal researcher, grant funder, research assistant or anyone from relevant external bodies.
Article
This paper examines authorship misconduct: practices such as gift, guest, honorary and ghost authorship (excluding plagiarism) that involve inappropriate attribution of authorship credits. Drawing on the existing literature, we describe the extent of authorship misconduct and why it presents a problem. We then construct a simple matching model of guest authorship to show how researchers can form teams (of two) where one researcher free-rides off the efforts of the other; at equilibrium, the latter is content for this free-riding to occur, rather than forming a different team involving no free-riding. We discuss how this model can be generalized to incorporate honorary and gift authorship, and why capturing ghost authorship may require significant changes to the modelling. While formal (game-theoretic) modelling of other aspects of research misconduct is prevalent in the literature, to our knowledge, ours is the first attempt to isolate the strategic interaction that leads to authorship misconduct. If authorship misconduct is a rational choice by researchers, we investigate the use of a monitoring-punishment approach to eliminate the free-riding equilibria. The possibility of monitoring is not just theoretical: we outline the recent advances in distributed ledger technology and authorship forensics that make monitoring of research workflows a viable strategy for institutions to curb authorship misconduct. One of the advantages of working with our simple model is that it provides a framework to examine the relationship between efficiency and ethics in this context, an issue that has by and large been ignored in the literature.
Conference Paper
In this doctoral proposal, we describe an approach to identify recurring, collective behavioral mechanisms in the collaborative interactions of Wikipedia editors that have the potential to undermine the ideals of quality, neutrality and completeness of article content. We outline how we plan to parametrize these patterns in order to understand their emergence and evolution and measure their effective impact on content production in Wikipedia. On top of these results we intend to build end-user tools to increase the transparency of the evolution of articles and equip editors with more elaborated quality monitors. We also sketch out our evaluation plans and report on already accomplished tasks.
Conference Paper
A considerable portion of web content, from wikis to collaboratively edited documents, to code posted online, is revisioned. We consider the problem of attributing authorship to such revisioned content, and we develop scalable attribution algorithms that can be applied to very large bodies of revisioned content, such as the English Wikipedia. Since content can be deleted, only to be later re-inserted, we introduce a notion of authorship that requires comparing each new revision with the entire set of past revisions. For each portion of content in the newest revision, we search the entire history for content matches that are statistically unlikely to occur spontaneously, thus denoting common origin. We use these matches to compute the earliest possible attribution of each word (or each token) of the new content. We show that this "earliest plausible attribution" can be computed efficiently via compact summaries of the past revision history. This leads to an algorithm that runs in time proportional to the sum of the size of the most recent revision, and the total amount of change (edit work) in the revision history. This amount of change is typically much smaller than the total size of all past revisions. The resulting algorithm can scale to very large repositories of revisioned content, as we show via experimental data over the English Wikipedia.
Article
Full-text available
Wikipedia is a top-ten Web site providing a free encyclopedia cre-ated by an open community of volunteer contributors. As investi-gated in various studies over the past years, contributors have differ-ent backgrounds, mindsets and biases; however, the effects -posi-tive and negative -of this diversity on the quality of the Wikipedia content, and on the sustainability of the overall project are yet only partially understood. In this paper we discuss these effects through an analysis of existing scholarly literature in the area and identify directions for future research and development; we also present an approach for diversity-minded content management within Wikipedia that combines techniques from semantic technologies, data and text mining and quantitative social dynamics analysis to create greater awareness of diversity-related issues within the Wikipedia commu-nity, give readers access to indicators and metrics to understand biases and their impact on the quality of Wikipedia articles, and support editors in achieving balanced versions of these articles that leverage the wealth of knowledge and perspectives inherent to large-scale collaboration.
Conference Paper
Full-text available
The Wikipedia is a collaborative encyclopedia: anyone can con- tribute to its articles simply by clicking on an "edit" butto n. The open nature of the Wikipedia has been key to its success, but has also created a challenge: how can readers develop an informed opin- ion on its reliability? We propose a system that computes quanti- tative values of trust for the text in Wikipedia articles; th ese trust values provide an indication of text reliability. The system uses as input the revision history of each article, as well as information about the reputation of the contributing authors, as provided by a reputation system. The trust of a word in an arti- cle is computed on the basis of the reputation of the original au- thor of the word, as well as the reputation of all authors who edited text near the word. The algorithm computes word trust values that vary smoothly across the text; the trust values can be visual ized us- ing varying text-background colors. The algorithm ensures that all changes to an article's text are reflected in the trust values , prevent- ing surreptitious content changes. We have implemented the proposed system, and we have used it to compute and display the trust of the text of thousands of articles of the English Wikipedia. To validate our trust-computation algo- rithms, we show that text labeled as low-trust has a significa ntly higher probability of being edited in the future than text la beled as high-trust.
Article
Full-text available
Conference Paper
Full-text available
An efficient differencing algorithm can be used to compress version of files for both transmission over low bandwidth channels and compact storage. This can greatly reduce network traffic and execution time for distributed applications which include software distribution, source code control, file system replication, and data backup and restore. An algorithm for such applications needs to be both general and efficient; able to compress binary inputs in linear time. We present such an algorithm for differencing files at the granularity of a byte. The algorithm uses constant memory and handles arbitrarily large input files. While the algorithm makes minor sacrifices in compression to attain linear runtime performance, it outperforms the byte-wise differencing algorithms that we have encountered in the literature on all inputs
Article
The Wikipedia is a collaborative encyclopedia: anyone can contribute to its articles simply by clicking on an “edit” button. The open nature of the Wikipedia has been key to its success, but has also created a challenge: how can readers form an informed opinion on its reliability? We propose a system that computes quantitative values of trust for the text in Wikipedia articles; these trust values provide an indication of text reliability. The system uses as input the revision history of each article, as well as information about the reputation of the contributing authors, as provided by a reputation system. The trust of a word in an article is computed on the basis of the reputation of the original author of the word, as well as the reputation of all authors who edited the text in proximity of the word. The algorithm computes word trust values that vary smoothly across the text; the trust values can be visualized using varying text-background colors. The algorithm ensures that all changes to an article text are reflected in the trust values, preventing surreptitious content changes. We have implemented the proposed system, and we have used it to compute and display the trust of the text of thousands of articles of the English Wikipedia. To validate our trust-computation algorithms, we show that text labeled as low-trust has a significantly higher probability of being edited in the future than text labeled as high-trust. Anecdotal evidence seems to corroborate this validation: in practice, readers find the trust information valuable.
Conference Paper
The Internet has fostered an unconventional and powerful style of collaboration: "wiki" web sites, where every visitor has the power to become an editor. In this paper we investigate the dynamics of Wikipedia, a prominent, thriving wiki. We make three contributions. First, we introduce a new exploratory data analysis tool, the history flow visualization, which is effective in revealing patterns within the wiki context and which we believe will be useful in other collaborative situations as well. Second, we discuss several collaboration patterns highlighted by this visualization tool and corroborate them with statistical analysis. Third, we discuss the implications of these patterns for the design and governance of online collaborative social spaces. We focus on the relevance of authorship, the value of community surveillance in ameliorating antisocial behavior, and how authors with competing perspectives negotiate their differences.
Conference Paper
We present a content-driven reputation system for Wikipedia authors. In our system, authors gain reputation when the edits they perform to Wikipedia articles are preserved by subsequent authors, and they lose reputation when their edits are rolled back or undone in short order. Thus, author reputation is computed solely on the basis of content evolution; user-to-user comments or ratings are not used. The author reputation we compute could be used to flag new contributions from low-reputation authors, or it could be used to allow only authors with high reputation to contribute to controversialor critical pages. A reputation system for the Wikipedia could also provide an incentive for high-quality contributions. We have implemented the proposed system, and we have used it to analyze the entire Italian and French Wikipedias, consisting of a total of 691, 551 pages and 5, 587, 523 revisions. Our results show that our notion of reputation has good predictive value: changes performed by low-reputation authors have a significantly larger than average probability of having poor quality, as judged by human observers, and of being later undone, as measured by our algorithms.
The estimation of the lorenz curve and gini index. The Review of
  • J Gastwirth
J. Gastwirth. The estimation of the lorenz curve and gini index. The Review of Economics and Statistics, 54(3):306–316, 1972.
Towards a diversity-minded Wikipedia
  • F Flöck
  • D Vrandečić
  • E Simperl
F. Flöck, D. Vrandečić, and E. Simperl. Towards a diversity-minded Wikipedia. In Proceedings of the ACM 3rd International Conference on Web Science 2011, 06 2011.