The Challenges of German Archival Document Categorization on Insufficient Labeled Data
Published: 2020 Juni
Buchtitel: Proceedings of Workshop on Humanities in the Semantic Web co-located with ESWC 2020
Reihe: CEUR Workshop Proceedings
Organisation: WHiSe Workshop
Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categoriza- tion approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Pre- liminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task.
Weitere Informationen unter: Link