Fishing for Forms with Segmentation and Domain Knowledge
Xiaonan Guo
Info
|
Date
|
3rd May 2011 (week 1, Trinity Term 2011)
|
|
Time
|
11:30
|
|
Place
|
147
|
Abstract
Today, accessing web data has become a complex task due
to the sheer amount of available information and the increasing sophistication of web interfaces. This situation fosters
ad-hoc approaches for web form analysis that are inherently tedious to develop and maintain.
We address this challenge by introducing a domain-aware methodology
for web form analysis which leverages four models: (i) The browser model logically represents a web page, enriched with
visual information. (ii) The segmentation model groups related form elements, such as fields and labels. (iii) The annotation
model represents linguistic annotations and machine learning based classifications, relying on domain-specific knowledge.
(iv) The domain model describes the conceptual entities occurring on forms, e.g. a product price. Hence, we define the task
of form analysis in terms of a segmentation mapping, leading from (i) to (ii), and a phenomenological mapping, leading from
joined results from (ii) and (iii) to (iv). In our implementation, both mappings are given as logical rules. To instantiate
our approach for a given domain, one only needs to adapt the phenomenological mapping via a well-defined process, as these
rules are derived from fixed schemata. We apply our approach in a detailed case study on the real estate domain and further
demonstrate its versatility on a number of other domains.
Further info