Aus Aifbportal
Wechseln zu:Navigation, Suche

UnarXive structure thumb.png


Data set based on all publications available on

Contact persons: Michael Färber, Tarek Saier

Research group: Web Science

Publication date: 30 September 2019


In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existing scholarly data sets are limited in several regards. We propose a new data set based on all publications from all scientific disciplines available on Apart from providing the papers' plain text, in-text citations were annotated via global identifiers. Furthermore, citing and cited publications were linked to the Microsoft Academic Graph, providing access to rich metadata. Our data set consists of over one million documents and 29.2 million citation contexts. The data set, which is made freely available for research purposes, not only can enhance the future evaluation of research paper-based and citation context-based approaches, but also serve as a basis for new ways to analyze in-text citations. See for the source code which has been used for creating the data set. For citing this resource we can refer to our journal article "unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata" describing the data set and its creation in more detail

Michael Färber, Tarek Saier


Tarek Saier, Michael Färber, Tornike Tsereteli
Cross-Lingual Citations in English Papers: A Large-Scale Analysis of Prevalence, Usage, and Impact
International Journal on Digital Libraries, 23, (2), Seiten 179–195, Dezember, 2021

Tarek Saier, Michael Färber
unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata
Scientometrics, März, 2020

↑ top

Tarek Saier, Michael Färber
Bibliometric-Enhanced arXiv: A Data Set for Paper-Based and Citation-Based Tasks
Proceedings of the 8th International Workshop on Bibliometric-enhanced Information Retrieval (BIR) co-located with the 41st European Conference on Information Retrieval (ECIR 2019), Seiten: 14–26, CEUR-WS, April, 2019

↑ top