The functioning of entities as diverse as enterprises and government agencies depends on obtaining high-quality data. Increasingly these entities depend on external sources for their operational data: critical data is obtained dynamically via web services, is extracted from web pages, or is purchased from third parties. These sources can differ radically in their completeness, accuracy, and availability. It is not possible for applications to index and explore data from each source in advance of querying: there are too many sources, they are too costly to access, and the data in them may be refreshed constantly.
How should data acquisition proceed in such situations?
In this project we will develop algorithms for answering queries in the presence of large numbers of web-based data sources, sources that may overlap substantially in their datasets but have different access restrictions and costs. Our approach will make use of schema information about the data an application is querying: data format, integrity constraints, and any prior knowledge of costs that may be available. The core of the project will be algorithms for answering a query by interactively exploring the sources, dynamically pruning out irrelevant or exhausted sources in the process.