DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology

Do you need to rent a new apartment? Or would you just like to find a restaurant in your area that serves "pasta al pesto" as today’s special? In either case, you would most likely start a web search. But keyword search is not really appropriate in such cases, because you risk being swamped with irrelevant information, rather than finding what you want. If all the information were available in structured form, you could find what you are looking for much faster. Search engine providers such as Google, Yahoo! and Microsoft are aware of this, and are keenly looking for new methods that automatically recognize and extract data from domain-specific websites with semi-structured content. To this date, this problem has not been satisfactorily solved; its solution seems to require a major research breakthrough.

In this project we will tackle precisely this challenge. Our goal is very ambitious. We want to develop domainspecific data extraction systems that take as input a URL of a website in a particular application domain, automatically explore the web site, and deliver as output a structured data set containing all the relevant information present on that site. We will provide the logical, algorithmic, and methodological foundations for the knowledge-based extraction of structured data from web sites belonging to specific domains, and we will develop two extraction systems for two different domains. To achieve our goal, we will design new methods and algorithms that combine database techniques with methods of knowledge representation and reasoning and web data extraction techniques. The breakthrough in automatic data extraction, which we are striving for, would enable a leap forward for two interrelated technologies which are the hottest emerging topics in web search: vertical search, that is, web search in specialized domains, and object search, that is, the search for web data objects rather than web pages.

For more details and results see the DIADEM homepage


DIADEM Homepage

Web Data Extraction for Online Market Intelligence


ERC (European Research Council)

1st April 2010 to 31st March 2015

Principal Investigator


Tim Furche
(James Martin Fellow (, Fellow of the Oxford Man Institute)
Giorgio Orsi
(Oxford Martin Fellow)
Personal photo - Andrew Sellers
Andrew Sellers