Clean up your Extraction: Deduplication for Data Extraction
AbstractDeduplication is the problem of identifying and removing duplicates in data. Though well studied in data integration and cleaning, nowadays data is increasingly extracted automatically from web sources. This yields noisy data that is not easily treated with existing deduplication algorithms. However, the extraction process can often supply strong clues on whether two data items are the same, e.g., the URL of a canonical page describing that item.
In this project, we will investigate deduplication in the context of data extraction. We will consider rst deduplication within a site, both in presence and absence of unique item URLs. Second, we consider deduplication in result extracted from di erent sites with the aim to align record structures from both pages where possible to nd duplicates.
For this project some knowledge of databases and data cleaning is useful. The project should be driven by extensive evaluation.