Clean up your Extraction: Deduplication for Data Extraction

Supervisor

Tim Furche

Suitable for

MSc in Computer Science

Mathematics and Computer Science, Part C

Computer Science, Part C

Computer Science, Part B

Abstract

Deduplication is the problem of identifying and removing duplicates in data. Though well studied in data integration and cleaning, nowadays data is increasingly extracted automatically from web sources. This yields noisy data that is not easily treated with existing deduplication algorithms. However, the extraction process can often supply strong clues on whether two data items are the same, e.g., the URL of a canonical page describing that item.

In this project, we will investigate deduplication in the context of data extraction. We will consider rst deduplication within a site, both in presence and absence of unique item URLs. Second, we consider deduplication in result extracted from di erent sites with the aim to align record structures from both pages where possible to nd duplicates.

For this project some knowledge of databases and data cleaning is useful. The project should be driven by extensive evaluation.

Clean up your Extraction: Deduplication for Data Extraction

Abstract

Main sections

RSS Feeds

News

Vacancies & studentships

Calendars

Seminars & timetables

Internal