Domain-Based Structural Web Page Analysis and Data Extraction
|
Supervisor |
|
|
Suitable for |
Mathematics and Computer Science
|
Abstract
Background: There are several projects available. The work will be done in the context of the large ERC project DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology whose goal is to automate web data extraction in specific application domains such as real estate, restaurants, and so on. Ultimately, we want to construct a system that is able to automatically navigate Web pages within a given application domain and extract relevant data from that pages. For example a system for the real-estate domain should accept as input an URL of an estate agent and output all properties (houses or flats) that are currently advertised by this agent to be for sale. The output should be a highly structured XML file obeying a certain pre-defined schema.
Principal goal of the MSc or Honour School project: Contribute to the DIADEM project by providing useful studies and building blocks. Examples of such studies or building blocks are:
- A characterization of typical patterns on web pages of a certain application domain.
- A characterisation of access patterns for search masks and navigation sequences on web pages belonging to specific domains.
- Software for the automatic construction of dictionaries of domain-specific terms extracted from Web pages.
- Software for identifying compound data objects on domain-specific web pages.
- Implementation of building blocks of a reasoner over web data objects.
- Selection and integration of existing annotators for Web page analysis.
- Definition of new annotators.
- Implementation of a scheduler for highly parallel web data extraction through cloud computing.
- Testing components of the DIADEM system.
Note: The DIADEM project runs from 2010 to 1015. Every year only a few MSc or Honour School projects will be available. Some will be co-supervised by DIADEM research staff.
Skills Needed: Each of these projects requires good analytic and theoretical skills, software engineering skills, knowledge of web programming and web data processing, good knowledge of Java and Eclipse.
