Fully-visual Data Extraction from Result Pages

Supervisor

Tim Furche

Suitable for

MSc in Computer Science

Mathematics and Computer Science, Part C

Computer Science, Part C

Computer Science, Part B

Abstract

(Joint superision with G Grasso and C Schallhart)

In DIADEM (diadem-project.info), we perform automatic data extraction from result pages exploiting domain annotations to bootstrap the detection of repeated structures (data records) on the page. In our current prototype, the similarity analysis relies on the page structure (DOM), without incorporating visual clues, such as style, layout. As a complementary method, this MSC thesis aims at investigating new approaches to automatic data record extraction from template pages in a given domain, employing only visual properties and domain annotations. Starting from annotations on the page, identifying e.g., prices, product names, or locations, we want to recognize data records by visual similarity, discovered in geometric relations, such as alignment or box dimension, and in style information, such as font, color, or other CSS properties. Most existing approaches to visual data extraction di_er from our goal as they either combine visual information with the DOM structure, or because their extraction process is not guided by domain annotations at all. The outcome of the thesis is a method for visual data record recognition, implemented in a prototype, ready to work with domain annotations as parameter. This project requires an intensive preliminary study of related work in wed data extraction. The student will be working with existing tools developed as part of the DIADEM framework, such as APIs for browser interaction and text annotations. The development language is Java, knowledge of HTML and CSS is highly recommended, with Javascript and XPath as a plus.

Fully-visual Data Extraction from Result Pages

Abstract

Main sections

RSS Feeds

News

Vacancies & studentships

Calendars

Seminars & timetables

Internal