University of Oxford Logo University of OxfordDepartment of Computer Science - Home
Linked in
Linked in
Follow us on twitter
Twitter
On Facebook
Facebook
Instagram
Instagram

OXLatin: web data extraction in the Cloud with Pig and Hadoop

Supervisor

Suitable for

Abstract

In DIADEM (diadem-project.info), we do not simply extract data from web sites, but au- tomatically generate small wrapper programs to perform the actual web data extraction. Freed from the resource intensive analysis necessary to understand hitherto unknown sites, those wrap- per programs must be highly ecient and scalable. We formulate these wrappers in OXPath (http://diadem.cs.ox.ac.uk/OXPath/), a language we designed as extension of XPath towards data extraction. As a next step, we want to run our wrappers on the cloud in an Hadoop envi- ronment, such as Amazon Elastic MapReduce. Therefore we parallelized OXPath to run on the Cloud and designed a host language for OXPath to build more complex extraction pipelines, in- volving several wrappers and data sources, e.g., city names to ll into location elds. We designed OXLatin as extension of PigLatin, which is an SQL-like language facilitating ecient, parallelizable group-by-aggregation based on the map-reduce paradigm. Tailored for massive extraction tasks, OXLatin is designed to minimize the communication with the targeted web servers, both for per- formance reasons and to avoid launching a DDOS attack accidentally. OXLatin achieves is via OXPath decomposition: roughly, an input expression is analyzed and decomposed in segments, such that they can be distributed to multiple nodes in a computing cluster. The outcome of this project is (1) an execution environment for OXLatin, translating OXLatin to PigLatin, and (2) an implementation of the decomposition methods. Also, we want to develop a dashboard for our cloud to monitor the di erent instances running, enabling features such as (re-)starting, killing, and observing single extraction tasks. This thesis o ers the opportunity to learn and practice with OXPath (hence XPath), Hadoop Map Reduce in AWS, and PigLatin. Good Knowledge of Java is required, previous knowledge of any involved language/technology is a plus.