Optimizing OXPath Queries
|
Supervisor |
|
|
Suitable for |
MSc in Software Engineering (part-time) (and part-time Certificate and Diploma courses)
|
Abstract
Background: The work will be done in the context of the large ERC project DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology whose goal is to automate web data extraction in specific application domains such as real estate, restaurants, and so on.
Principal goal of the MSc or Honour School project:
OXPath (Oxford XPath) is an extension of XPath introduced in the DIADEM Project in the purpose of navigate and extract
data from web pages involving interaction with web forms.
It is a fundamental part of the DIADEM project, mainly involved
in the runtime phase.
A single OXPath expression can automatically populate and query a web form and process the information
contained in the result pages.
XPath extraction expressions are meant to be automatically generated. This proposal aims
at studying how to optimize these expressions given a result page annotated with concepts from a
domain ontology. Such
knowledge could be exploit to produce more compact (and possibly more resilient to page changes) OXPath expressions.
Skills Needed: This project requires good analytic and software engineering skills, as well as good knowledge of Java and XPath.
Supervision: This project will be co-supervised by Dr. Tim Furche and Dr. Giovanni Grasso