Skip to main content

Join-Processing on RDF in a MapReduce Environment

Georg Lausen ( University of Freiburg, Germany )

Today's information systems have to accomplish the exploding number and size of data sets that need to be handled. Distributed computing platforms like MapReduce have been confirmed to be well-suited for large-scale data mangement. In the talk I will concentrate on query processing on RDF data, which is a standard developed by the World Wide Web Consortium for representing semantic web data. The topic of the first part of the talk is PigSPARQL, a system we have developed to translate general SPARQL queries into Pig Latin, a relational programming layer on top of Hadoop, a widely used open source environment for MapReduce. As a distinctive feature PigSPARQL does not require any changes of the original Hadoop and therefore is able to be applied for cloud computing as well. However, Pig Latin's reduce-side implementation of the relational join may incur efficiency problems for large data sets. In the second part of the talk I will present a map-side join implementation approach taking advantage of the scalable storage capabilities of HBase, Hadoops distributed NoSQL datastore. Finally I will present evaluation results demonstrating the feasability of our approach.

Joint work with Martin Przyjaciel-Zablocki and Alexander Schätzle

Speaker bio

Share this: