SafeStreams: Constraint Processing on XML Streams

The eXtensible Markup Language (XML) has become a ubiquitous format for exchanging data. Enterprise data from industries as diverse as finance, healthcare, and genomics are routinely exchanged as XML. Much of this XML-encoded information has to be queried on-the-fly as it arrives -- that is, as an XML stream. News and event information, for example, is available in the form of XML feeds; applications that react to these events must process the feeds in streaming fashion. Communication and messaging protocols also make use of XML, and the corresponding protocol handlers are thus also XML stream-processors.

A crucial aspect of processing any form of data is validation: before data is made available to applications, it must be in a "sane'' state. In the context of data being exchanged over networks data corruption is ubiquitous, because messages are received from untrusted or even unknown parties. Indeed, many or even most of the data being sent to web-accessible application servers may be from malicious or compromised hosts.

The XML community has already developed standardized means for describing constraints on the structure of XML documents. On the one hand, there are schema-based constraints, such as Document Type Definitions (DTDs) giving limitations on the tags that can occur within a document. Qualifiers in the XML query language XPath provide a more flexible method for adding application-specific constraints. But how can a firewall enforce these constraints efficiently on large collections of parallel feeds? This is a critical issue, whether the XML streams represent signalling messages, event feeds, or web service calls. This project will study which constraints can and cannot be enforced efficiently, and will provide tools and technologies to effectively monitor XML streams for violation of both schema constraints and application-specific constraints.

Researchers: Michael Benedikt, Gabriele Puppis, Christian Riveros, Milutin Kristofic and Alan Jeffrey (Bell Labs).

SafeStreams is sponsored by the UK Engineering and Physical Sciences Research Council (EPSRC).

WebPlan: Query-Driven Data Aquisition from Web Based Data Source

The functioning of entities as diverse as enterprises and government agencies depends on obtaining high-quality data. Increasingly these entities depend on external sources for their operational data: critical data is obtained dynamically via web services, is extracted from web pages, or is purchased from third parties. These sources can differ radically in their completeness, accuracy, and availability. It is not possible for applications to index and explore data from each source in advance of querying: there are too many sources, they are too costly to access, and the data in them may be refreshed constantly.

How should data acquisition proceed in such situations? In this project we will develop algorithms for answering queries in the presence of large numbers of web-based data sources, sources that may overlap substantially in their datasets but have different access restrictions and costs. Our approach will make use of schema information about the data an application is querying: data format, integrity constraints, and any prior knowledge of costs that may be available. The core of the project will be algorithms for answering a query by interactively exploring the sources, dynamically pruning out irrelevant or exhausted sources in the process.

Researchers: Michael Benedikt, Andrea Cali and Pierre Senellart (Telecom ParisTech)

SafeStreams is sponsored by the UK Engineering and Physical Sciences Research Council (EPSRC).

Studentships

Open Positions

We currently have no open positions.

Studentships

We expect to have several studentships available for fall of 2011. Contact Georg Gottlob or Michael Benedikt for details.

Databases

The Database Group conducts research on next-generation data management infrastructure, including Data Exchange, Web Information Extraction, XML processing, Deep Web Querying, Management of Uncertain Information, Querying Social Networks, and Stream Processing.