Aus Aifbportal
Wechseln zu:Navigation, Suche

WikiWho: Precise and Efficient Attribution of Authorship of Revisioned Content

WikiWho: Precise and Efficient Attribution of Authorship of Revisioned Content

Published: 2014

Buchtitel: Proceedings of the WWW2014
Verlag: -

Referierte Veröffentlichung


Revisioned text content is present in numerous collaboration platforms on the Web, most notably Wikis. To track authorship of text tokens in such systems has many potential applications; the identification of main authors for licensing reasons or tracking of collaborative writing patterns over time, to name some. In this context, two main challenges arise: First, it is critical for such an authorship tracking system to be precise in its attributions, to be reliable as a data basis for further processing. Second, it has to run efficiently even on very large datasets, such as Wikipedia. As a solution, we propose a graph-based model to represent revisioned content and an algorithm over this model that tackles both issues effectively. We describe the optimal implementation and design choices when tuning it to a Wiki environment. We further present a gold standard of 240 tokens from English Wikipedia articles annotated with their origin. This gold standard was created manually and confirmed by multiple independent users of a crowdsourcing platform. It is the first gold standard of this kind and quality and our solution achieves an average of 95% precision on this data set. We also perform a first-ever precision evaluation of the state-of-the-art algorithm for the task, exceeding it by about 10% on average. Our approach outperforms the execution time of the state-of-the-art by one order of magnitude, as we demonstrate on a sample of over 240 English Wikipedia articles. We argue that the increased size of an optional materialization of our results by about 10% compared to the baseline is a favorable trade-off, given the large advantage in runtime performance.

Download: Media:Www2014 submission 715 (9).pdf