University of Oxford Logo University of OxfordDepartment of Computer Science - Home
Linked in
Linked in
Follow us on twitter
Twitter
On Facebook
Facebook
Instagram
Instagram

One Path to Find them All: Learning OXPath Wrappers

Supervisor

Suitable for

Abstract

The wealth of information on the web is exceeding any other human-created source of information by orders of magnitude. Harnessing that data, however, requires increasingly automated methods that can collect relevant data from many websites and lter it,e .g., for further human consumption. OXPath (http://diadem.cs.ox.ac.uk/OXPath/) is a wrapper language developed at Oxford in the DIADEM (diadem-project.info) project. It is particularly well suited for extraction from rich internet applications with sophisticated client-side interfaces. Manually creating such wrappers, however, still remains a daunting task. In this project, we therefore aim to develop a method for learning OXPath wrappers from sets of annotated examples, that indicate, e.g., what data to extract or which actions to perform. The challenge in such a learning algorithm lies in the need to abstract from multiple examples to a single, compact OXPath expression that should be close to optimal under a set of additional criteria. Though there is a signi cant body of literature on learning XPath expressions, OXPath is a richer language and introduces new challenges. We also speci cally aim at extensive experimental validation of the developed techniques. This project is particularly suited for students interested in modern web technology and familiar with compiler design or query optimization. The learner will be implemented in Java, where we provide a set of customized APIs for action execution and DOM interaction.