Techreport3039: Unterschied zwischen den Versionen
Awa (Diskussion | Beiträge) |
Awa (Diskussion | Beiträge) |
||
Zeile 23: | Zeile 23: | ||
become an accepted standard for describing entities on the Web. Many such RDF descriptions are text-rich – besides structured data, they also feature large portions of unstructured text. As a result, RDF data is frequently queried using predicates matching structured data, combined with string predicates for textual constraints: hybrid queries. Evaluating hybrid queries requires accu- | become an accepted standard for describing entities on the Web. Many such RDF descriptions are text-rich – besides structured data, they also feature large portions of unstructured text. As a result, RDF data is frequently queried using predicates matching structured data, combined with string predicates for textual constraints: hybrid queries. Evaluating hybrid queries requires accu- | ||
rate means for selectivity estimation. Previous works on selectivity estimation, however, suffer from inherent drawbacks, reflected in efficiency and effective issues. In this paper, we present a general framework for hybrid selectivity estimation. Based on its requirements, we study the applicability of existing approaches. Driven by our findings, we propose a novel estimation approach, TopGuess, exploiting topic models as data synopsis. This enables us to capture correlations between structured and unstructured data in a uniform and scalable manner. We study TopGuess in theorical manner, and show TopGuess to guarantee a linear space | rate means for selectivity estimation. Previous works on selectivity estimation, however, suffer from inherent drawbacks, reflected in efficiency and effective issues. In this paper, we present a general framework for hybrid selectivity estimation. Based on its requirements, we study the applicability of existing approaches. Driven by our findings, we propose a novel estimation approach, TopGuess, exploiting topic models as data synopsis. This enables us to capture correlations between structured and unstructured data in a uniform and scalable manner. We study TopGuess in theorical manner, and show TopGuess to guarantee a linear space | ||
− | complexity w.r.t. text data size, and a selectivity estimation time complexity independent from its synopsis size. In experiments on real-world data, TopGuess allowed for great improvements in estimation accuracy, without | + | complexity w.r.t. text data size, and a selectivity estimation time complexity independent from its synopsis size. In experiments on real-world data, TopGuess allowed for great improvements in estimation accuracy, without sacrificing runtime performance. |
|Download=Paper-vldb-selectivityestimation tr.pdf | |Download=Paper-vldb-selectivityestimation tr.pdf | ||
|Projekt=IZEUS | |Projekt=IZEUS |
Version vom 23. Juli 2013, 11:46 Uhr
Published: 2013
Mai
Institution: Institute AIFB, KIT
Erscheinungsort / Ort: Karlsruhe
Archivierungsnummer:3039
Kurzfassung
The Resource Description Framework (RDF) has
become an accepted standard for describing entities on the Web. Many such RDF descriptions are text-rich – besides structured data, they also feature large portions of unstructured text. As a result, RDF data is frequently queried using predicates matching structured data, combined with string predicates for textual constraints: hybrid queries. Evaluating hybrid queries requires accu-
rate means for selectivity estimation. Previous works on selectivity estimation, however, suffer from inherent drawbacks, reflected in efficiency and effective issues. In this paper, we present a general framework for hybrid selectivity estimation. Based on its requirements, we study the applicability of existing approaches. Driven by our findings, we propose a novel estimation approach, TopGuess, exploiting topic models as data synopsis. This enables us to capture correlations between structured and unstructured data in a uniform and scalable manner. We study TopGuess in theorical manner, and show TopGuess to guarantee a linear space
complexity w.r.t. text data size, and a selectivity estimation time complexity independent from its synopsis size. In experiments on real-world data, TopGuess allowed for great improvements in estimation accuracy, without sacrificing runtime performance.
Download: Media:Paper-vldb-selectivityestimation tr.pdf