Deliverable3031
Published: 2012
September
Institution: Institut AIFB, KIT
Erscheinungsort / Ort: Graben-Neudorf
Kurzfassung
The amount of available data for processing is constantly increasing and becomes more diverse. We collect
our experiences on deploying large-scale data management tools on local-area clusters or cloud
infrastructures and provide guidance to use these computing and storage infrastructures.
In particular we describe Apache Hadoop, one of the most widely used software libraries to perform largescale data analysis tasks on clusters of computers in parallel and provide guidance on how to achieve optimal
execution time when performing analysis over large-scale data.
Furthermore we report on our experiences with projects, that provide valuable insights in the deployment and
use of large-scale data management tools:
The Web Data Commons project for which we extracted all Microformat, Microdata and RDFa data from
the Common Crawl web corpus, the largest and most up-to-date web corpus that is currently available to the
public.
SciLens, a local-area cluster machine that has been built to facilitate research on data management issues for
data-intensive scientific applications.
CumulusRDF, an RDF store on cloud-based architectures. We investigate the feasibility of using a
distributed nested key/value store as an underlying storage component for a Linked Data server, which
provides functionality for serving large quantities of Linked Data.
Finally we describe the News Aggregator Pipeline, which is a piece of software to perform the acquisition
of high volume textual streams, it’s processing into a suitable form for further analysis and the distribution of
the data.
Weitere Informationen unter: Link