University of Oxford Logo University of OxfordDepartment of Computer Science - Home
Linked in
Linked in
Follow us on twitter
Twitter
On Facebook
Facebook
Instagram
Instagram

VOXPath: Visual Data Extraction with OXPath

Supervisor

Suitable for

Abstract

Background: The work will be done in the context of the large ERC project DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology whose goal is to automate web data extraction in specific application domains such as real estate, restaurants, and so on.

Principal goal of the MSc or Honour School project:

OXPath (Oxford XPath) is an extension of XPath introduced in the DIADEM Project in the purpose of navigate and extract data from web pages involving interaction with web forms.
It is a fundamental part of the DIADEM project, mainly involved in the runtime phase.
A single OXPath expression can automatically populate and query a web form and process the information  contained in the result pages.

This proposal aims at designing and implemeting an automatic tool for the generation of  OXPath expressions and wrapping the retrieved data in addition, which is easy to use and ensures the expressions correctness.
It mainly consists in an advanced Graphical User Interface (GUI) for OXPath. What we  envisage is the user filling a web form using the browser as usual, while the tool will record the operations involved to finally generate corresponding OXPath expressions  automatically.

Once the result pages are visualized, the tool allows to visually specify which elements need to be extracted, yet producing the proper OxPath expression.

Skills Needed: This project requires good analytic and software engineering skills, and involves  programming languages such as Java (Swing/SWT), XPath.

Supervision: This project will be co-supervised by Dr. Giovanni Grasso and Dr. Tim Furche