Skip to main content

Query optimization for hybrid database-ML pipelines

Ioana Manolescu ( INRIA )
Modern data management systems encounter heterogeneity at several levels.
First, useful data is often encountered in different formats or data models, such as
relational table, documents, key-value pairs, etc.
Second, organizations routinely  operate over a large set of data stores, each
supporting a different data model and associated query language (or data access API).
How do we make the most out of a set of data stores, in order to handle a given
query workload over a set of heterogeneous datasets?
Third, modern applications increasingly need to blend querying and learning on the
data. Currently, such pipelines are either written from scratch in an imperative language,
or as a combination of systems, one supporting logical-style querying and another for
the computations commonly incurred by machine learning (ML) tasks.

In this talk, we will show how to overcome these sources of heterogeneity through a
powerful, classical tool from database theory, namely the chase and backchase (C&B)
employed for rewriting a given pipeline into another, equivalent one, that can be executed
more efficiently. This enables three powerful techniques: (a) exploiting the strength of
all available data stores by making them host the part of an application's data that they
can handle most efficiently; (b) reformulating hybrid database-ML pipelines to make
them more efficient; and (c) jointly optimizing such hybrid pipelines by equivalent
rewritings which span over relational algebra and linear algebra.
This work is joint with Rana Al-Otaibi and Alin Deutsch from UC San Diego (USA),
and Bogdan Cautis from U. Paris Saclay (France).



Share this: