"My students are my teachers

and my teachers are my students."

Web data extraction consists of automatically identifying structured
data on web pages and loading the data into a
database or other EDP application.
A comprehensive logical theory of Web data extraction and data extraction from
semi-structured documents was developed
in the paper
*
Monadic Datalog and the Expressive Power of Languages for
Web Information Extraction *
(PODS'02, full version: JACM 51:1, 2004)
with Christoph Koch.
A first basic insight of this study was
that data extraction from tree-structered documents
is a task which can be essentially described by
monadic logic, and that, in particular, monadic second order logic
(MSO) is a
well-suited formalism for describing most relevant data extraction
tasks. Since MSO is not a practical language and has a high expression
complexity, we looked for a simpler extraction language having the
same expressive power. We identified * monadic datalog *
as a good
candidate and proved that, over tree structures, monadic datalog
has exactly the same expressive power as MSO, but much
lower complexity (size of program times size of input document).

As a practical tool, with postdocs Robert Baumgartner,
Sergio Flesca, and Marcus Herzog, we designed and implemented the Lixto
system for visual
data extraction, whose original version is
described in the VLDB'01 paper
*
Visual Web Information Extraction with Lixto
*.
This system allows a designer to develop
a wrapper,
that is, an information extraction program, by mainly visual and
interactive operations performed on a sample document. These
wrappers are formulated in an extension of monadic datalog.
This means that monadic datalog programs over HTML-trees, and thus
MSO mappings,
can be essentially defined in a visual and interactive manner.
For further publications on Lixto,
click
here .
The Lixto sytem has been further developed by
Vienna based company * Lixto Software
* ,
and is now used by a large number of corporate customers.
Our theoretical results on data extraction
got the best paper award at ACM PODS 2002.
The Lixto software was a finalist in the
World Technology Award
competition 2003 in San Francisco.

The concepts of
* tree decomposition*
and the associated concept of treewidth
(a cyclicity measure for graphs)
introduced by the well-known graph theorists
Robertson and Seymour in Robertson,
(J. Algorithms 7, 1986, pp.309-322.} was of great profit to Computer
Science. Many NP hard problems turned out to be tractable on instances
whose underying graphs have bounded treewidth.

However, the structure
of many problems is better described by hypergraphs than by graphs. We
were thus looking for suitable hypergraph decomposition methods with
similar properties as treewidth, and forged the new decomposition
concept of * hypertree decompositions (HD) * and the
associated notion of
* hypertree width * in the paper
* Hypertree Decompositions and Tractable Queries*
(PODS'99,JCSS n.64, 2002). This
decomposition has
several favorable properties: For every constant * k *, it can be
determined in polynomial time whether a given hypergraph has HW
* k *,
and in the positive case, a HD of width * k * can be
computed, moreover, many problems such as conjunctive query
evaluation, and constraint satisfaction problems (CSPs), which are
NP complete in general, can be solved in polynomial time for instances
of bounded HW.

By solving an open problem, we showed that for a
different decomposition method proposed by Chekuri and
Rajaraman (C.Chekuri and A.Rajaraman,
* Conjunctive Query Containment Revisited *, Proc. 6th . Intl. Conf on
Database Theory, Springer LNCS 1186, 1987, pp. 56-70.), it is NP-hard
to determine whether a hypergraph has bounded width.
A natural game-theoretic characterization of hypertree-width
was given in the paper *
Robbers, Marshals, and guards: Game theoretic and logical
characterizations of hypertree width * (PODS'02, full version
in JCSS 66, 2003), where it was shown that a hypergraph has
HW * k * iff * k * ``marshals'' (each of which can control a hyperedge)
can capture a robber who can move freely along connected hyperedges
not occuppied by
marshals. In the paper
*A Comparison of Structural CSP Decomposition Methods * ,
the
concept of
hypertree width was compared to various other hypergraph decomposition
methods and it was shown that hypertree-width
dominates these methods in generality.
Further investigations can be founfr in the more recent paper
*
Generalized Hypertree Decompositions:
NP-Hardness and Tractable Variants
*, which
got the Best Paper Award at ACM PODS'07.

There are two Web pages which both contain papers on hypertree decomposition and downloadable code of decomposition algorithms: The Italian Hypertree Decomposition Homepage , maintained by Francesco Scarcello, and the Vienna Hypertree Project Page , maintained by Nysret Musliu.

We are currently continuing our research on hypertree decompositions in various directions: Finding new and faster decomposition algorithms, generalizing the concept, extending the range of applications.