VOXPath: Visual Data Extraction with OXPath
|
Supervisor |
|
|
Suitable for |
MSc in Software Engineering (part-time) (and part-time Certificate and Diploma courses)
|
Abstract
Background: The work will be done in the context of the large ERC project DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology whose goal is to automate web data extraction in specific application domains such as real estate, restaurants, and so on.
Principal goal of the MSc or Honour School project:
OXPath (Oxford XPath) is an extension of XPath introduced in the DIADEM Project in the purpose of navigate and extract
data from web pages involving interaction with web forms.
It is a fundamental part of the DIADEM project, mainly involved
in the runtime phase.
A single OXPath expression can automatically populate and query a web form and process the information
contained in the result pages.
This proposal aims at designing and implemeting an automatic tool for the generation
of OXPath expressions and wrapping the retrieved data in addition, which is easy to use and ensures the expressions
correctness.
It mainly consists in an advanced Graphical User Interface (GUI) for OXPath. What we envisage is the
user filling a web form using the browser as usual, while the tool will record the operations involved to finally generate
corresponding OXPath expressions automatically.
Once the result pages are visualized, the tool allows to visually specify which elements need to be extracted, yet producing the proper OxPath expression.
Skills Needed: This project requires good analytic and software engineering skills, and involves programming languages such as Java (Swing/SWT), XPath.
Supervision: This project will be co-supervised by Dr. Giovanni Grasso and Dr. Tim Furche
