PDF-Inspired Visual Data Extraction for HTML

Supervisor

Georg Gottlob

Suitable for

MSc by Research

MSc in Software Engineering (part-time) (and part-time Certificate and Diploma courses)

MSc in Computer Science

Honour School of Computer Science, Part C

Honour School of Computer Science, Part B

Abstract

Background: The work will be done in the context of the large ERC project DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology whose goal is to automate web data extraction in specific application domains such as real estate, restaurants, and so on.

Principal goal of the MSc or Honour School project:

In PDF data extraction, several techniques for data extraction has been proposed mainly based on visual aspects of a document such as layout, text alignment, bitmap information. However, very few of them has been applied to HTML documents, mainly because the difference between html and pdf.
This proposal aims at (i) studying the state-of-the-art pdf data extraction techniques based on visual information, and (ii) applying the most promising ones to HTML documents.
Also, computer vision approaches could be investigated.

Skills Needed: This project requires good theoretical, analytic and software engineering skills.Also, good knowledge of Java is essential.

Supervision: This project will be co-supervised by Dr. Giorgio Orsi.

PDF-Inspired Visual Data Extraction for HTML

Abstract

Main sections

RSS Feeds

News

Vacancies & studentships

Calendars

Seminars & timetables

Internal