University of Oxford Logo University of OxfordDepartment of Computer Science - Home
Linked in
Linked in
Follow us on twitter
Twitter
On Facebook
Facebook
Instagram
Instagram

Fully-visual Data Extraction from Result Pages

Supervisor

Suitable for

Abstract

(Joint superision with G Grasso and C Schallhart)

In DIADEM (diadem-project.info), we perform automatic data extraction from result pages exploiting domain annotations to bootstrap the detection of repeated structures (data records) on the page. In our current prototype, the similarity analysis relies on the page structure (DOM), without incorporating visual clues, such as style, layout. As a complementary method, this MSC thesis aims at investigating new approaches to automatic data record extraction from template pages in a given domain, employing only visual properties and domain annotations. Starting from annotations on the page, identifying e.g., prices, product names, or locations, we want to recognize data records by visual similarity, discovered in geometric relations, such as alignment or box dimension, and in style information, such as font, color, or other CSS properties. Most existing approaches to visual data extraction di_er from our goal as they either combine visual information with the DOM structure, or because their extraction process is not guided by domain annotations at all. The outcome of the thesis is a method for visual data record recognition, implemented in a prototype, ready to work with domain annotations as parameter. This project requires an intensive preliminary study of related work in wed data extraction. The student will be working with existing tools developed as part of the DIADEM framework, such as APIs for browser interaction and text annotations. The development language is Java, knowledge of HTML and CSS is highly recommended, with Javascript and XPath as a plus.