OXLatin: web data extraction in the Cloud with Pig and Hadoop
Abstract
In DIADEM (diadem-project.info), we do not simply extract data from web sites, but au-
tomatically generate small wrapper programs to perform the actual web data extraction. Freed
from the resource intensive analysis necessary to understand hitherto unknown sites, those wrap-
per programs must be highly ecient and scalable. We formulate these wrappers in OXPath
(http://diadem.cs.ox.ac.uk/OXPath/), a language we designed as extension of XPath towards
data extraction. As a next step, we want to run our wrappers on the cloud in an Hadoop envi-
ronment, such as Amazon Elastic MapReduce. Therefore we parallelized OXPath to run on the
Cloud and designed a host language for OXPath to build more complex extraction pipelines, in-
volving several wrappers and data sources, e.g., city names to ll into location elds. We designed
OXLatin as extension of PigLatin, which is an SQL-like language facilitating ecient, parallelizable
group-by-aggregation based on the map-reduce paradigm. Tailored for massive extraction tasks,
OXLatin is designed to minimize the communication with the targeted web servers, both for per-
formance reasons and to avoid launching a DDOS attack accidentally. OXLatin achieves is via
OXPath decomposition: roughly, an input expression is analyzed and decomposed in segments,
such that they can be distributed to multiple nodes in a computing cluster. The outcome of this
project is (1) an execution environment for OXLatin, translating OXLatin to PigLatin, and (2)
an implementation of the decomposition methods. Also, we want to develop a dashboard for our
cloud to monitor the di erent instances running, enabling features such as (re-)starting, killing,
and observing single extraction tasks. This thesis o ers the opportunity to learn and practice with
OXPath (hence XPath), Hadoop Map Reduce in AWS, and PigLatin. Good Knowledge of Java is
required, previous knowledge of any involved language/technology is a plus.