University of Oxford Logo University of OxfordDepartment of Computer Science - Home
On Facebook
Facebook
Follow us on twitter
Twitter
Linked in
Linked in
Flickr
Flickr
Google plus
Google plus
Digg
Digg
Pinterest
Pinterest
Stumble Upon
Stumble Upon

DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology

http://diadem.cs.ox.ac.uk/

Do you need to rent a new apartment? Or would you just like to find a restaurant in your area that serves "pasta al pesto" as today’s special? In either case, you would most likely start a web search. But keyword search is not really appropriate in such cases, because you risk being swamped with irrelevant information, rather than finding what you want. If all the information were available in structured form, you could find what you are looking for much faster. Search engine providers such as Google, Yahoo! and Microsoft are aware of this, and are keenly looking for new methods that automatically recognize and extract data from domain-specific websites with semi-structured content. To this date, this problem has not been satisfactorily solved; its solution seems to require a major research breakthrough.


In this project we will tackle precisely this challenge. Our goal is very ambitious. We want to develop domainspecific data extraction systems that take as input a URL of a website in a particular application domain, automatically explore the web site, and deliver as output a structured data set containing all the relevant information present on that site. We will provide the logical, algorithmic, and methodological foundations for the knowledge-based extraction of structured data from web sites belonging to specific domains, and we will develop two extraction systems for two different domains. To achieve our goal, we will design new methods and algorithms that combine database techniques with methods of knowledge representation and reasoning and web data extraction techniques. The breakthrough in automatic data extraction, which we are striving for, would enable a leap forward for two interrelated technologies which are the hottest emerging topics in web search: vertical search, that is, web search in specialized domains, and object search, that is, the search for web data objects rather than web pages.

For more details and results see the DIADEM homepage

Links

DIADEM Homepage

Web Data Extraction for Online Market Intelligence

Sponsors

ERC (European Research Council)

ERC (European Research Council)

Group photo

Group Photo

Info

Duration

1st April 2010 to 31st March 2015

Principal Investigator

People

Tim Furche
(James Martin Fellow (http://www.oxfordmartin.ox.ac.uk/people/236), Fellow of the Oxford Man Institute)
Giorgio Orsi
(Oxford Martin Fellow)
Giorgio Orsi
(Oxford Martin Fellow)
Personal photo - Andrew Sellers
Andrew Sellers

Themes