University of Oxford Logo University of OxfordDepartment of Computer Science - Home
Linked in
Linked in
Follow us on twitter
Twitter
On Facebook
Facebook
Instagram
Instagram

Domain-Based Structural Web Page Analysis and Data Extraction

Supervisor

Suitable for

Abstract

 

Background: There are several projects available. The work will be done in the context of the large ERC project DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology whose goal is to automate web data extraction in specific application domains such as real estate, restaurants, and so on. Ultimately, we want to construct a system that is able to automatically navigate Web pages within a given application domain and extract relevant data from that pages. For example a system for the real-estate domain should accept as input an URL of an estate agent and output all properties (houses or flats) that are currently advertised by this agent to be for sale. The output should be a highly structured XML file obeying a certain pre-defined schema. 

Principal goal of the MSc or Honour School project: Contribute to the DIADEM project by providing useful studies and building blocks. Examples of such studies or building blocks are:

Note: The DIADEM project runs from 2010 to 1015. Every year only a few MSc or Honour School projects will be available. Some will be co-supervised by DIADEM research staff.

Skills Needed: Each of these projects requires good analytic and theoretical skills, software engineering skills, knowledge of web programming and web data processing, good knowledge of Java and Eclipse.