Skip to main content

Automatically learning gazetteers from the deep web

Tim Furche‚ Giovanni Grasso‚ Giorgio Orsi‚ Christian Schallhart and Cheng Wang

Abstract

Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve high accuracy in recognizing UK locations in the 4th iteration.

Book Title
Proc. of the 21st World Wide Web Conf. (WWW Companion Volume)
Note
Demonstration
Pages
341–344
Year
2012