University of Oxford Logo University of OxfordDepartment of Computer Science - Home
Linked in
Linked in
Follow us on twitter
Twitter
On Facebook
Facebook
Instagram
Instagram

Fishing for Forms with Segmentation and Domain Knowledge

Xiaonan Guo

Info

Date

3rd May 2011 (week 1, Trinity Term 2011)

Time

11:30

Place

147

Abstract


Today, accessing web data has become a complex task due to the sheer amount of available information and the increasing sophistication of web interfaces. This situation fosters ad-hoc approaches for web form analysis that are inherently tedious to develop and maintain.

We address this challenge by introducing a domain-aware methodology for web form analysis which leverages four models: (i) The browser model logically represents a web page, enriched with visual information. (ii) The segmentation model groups related form elements, such as fields and labels. (iii) The annotation model represents linguistic annotations and machine learning based classifications, relying on domain-specific knowledge. (iv) The domain model describes the conceptual entities occurring on forms, e.g. a product price. Hence, we define the task of form analysis in terms of a segmentation mapping, leading from (i) to (ii), and a phenomenological mapping, leading from joined results from (ii) and (iii) to (iv). In our implementation, both mappings are given as logical rules. To instantiate our approach for a given domain, one only needs to adapt the phenomenological mapping via a well-defined process, as these rules are derived from fixed schemata. We apply our approach in a detailed case study on the real estate domain and further demonstrate its versatility on a number of other domains.

Further info

Related series