Join-Processing on RDF in a MapReduce Environment
Today's information systems have to accomplish the exploding number and size of data sets that need to be handled. Distributed computing platforms like MapReduce have been confirmed to be well-suited for large-scale data mangement. In the talk I will concentrate on query processing on RDF data, which is a standard developed by the World Wide Web Consortium for representing semantic web data. The topic of the first part of the talk is PigSPARQL, a system we have developed to translate general SPARQL queries into Pig Latin, a relational programming layer on top of Hadoop, a widely used open source environment for MapReduce. As a distinctive feature PigSPARQL does not require any changes of the original Hadoop and therefore is able to be applied for cloud computing as well. However, Pig Latin's reduce-side implementation of the relational join may incur efficiency problems for large data sets. In the second part of the talk I will present a map-side join implementation approach taking advantage of the scalable storage capabilities of HBase, Hadoops distributed NoSQL datastore. Finally I will present evaluation results demonstrating the feasability of our approach.
Joint work with Martin Przyjaciel-Zablocki and Alexander Schätzle