University of Oxford Logo University of OxfordDepartment of Computer Science - Home
Linked in
Linked in
Follow us on twitter
Twitter
On Facebook
Facebook
Instagram
Instagram

Optimizing OXPath Queries

Supervisor

Suitable for

Abstract

Background: The work will be done in the context of the large ERC project DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology whose goal is to automate web data extraction in specific application domains such as real estate, restaurants, and so on.

Principal goal of the MSc or Honour School project:

OXPath (Oxford XPath) is an extension of XPath introduced in the DIADEM Project in the purpose of navigate and extract data from web pages involving interaction with web forms.
It is a fundamental part of the DIADEM project, mainly involved in the runtime phase.
A single OXPath expression can automatically populate and query a web form and process the information  contained in the result pages.
XPath extraction expressions are meant to be automatically generated. This proposal aims at studying how to optimize these expressions given a result page annotated with concepts from a
domain ontology. Such knowledge could be exploit to produce more compact (and possibly more resilient to page changes)  OXPath expressions.

Skills Needed: This project requires good analytic and software engineering skills, as well as good knowledge of Java and XPath.

Supervision: This project will be co-supervised by Dr. Tim Furche and Dr. Giovanni Grasso