Distributed Query Optimization

Many companies work with vast quantities of data and find that the most cost effective way to store data is by using a large number of commodity machines in a distributed setting. Some companies often choose to use custom made file systems and distributed file structures, such as BigTable in the case of Google, instead of adopting a relational approach. We believe that the complexity of the distributed query optimization problem is the main factor that causes companies to shy away from using distributed relational database systems.

The query optimizer is widely considered to be the most important component of a database management system. It is responsible for taking a user query, searching through the entire space of equivalent execution plans and returning the plan with the lowest cost. Equivalent query execution plans can vary significantly in cost therefore it is important for the optimizer to find a plan that will not inflict much unnecessary work on the executor. When optimizing queries in a distributed database system the optimization process must consider the additional dimension of choosing which execution site to perform each execution operation. This added dimension causes the search space to increase vastly and introduces new opportunities for parallelism, which contribute to the increasing complexity of the optimization problem. However, with the addition of sites comes an addition of processing power meaning that all sites can be used during the optimization of a single user query.

This project investigates optimization techniques for large queries in a distributed context. The research directions taken so far are as follows.

See Robert Taylor's thesis for details as well as an extensive experimental evaluation of our optimization framework.

System prototype

Our optimization algorithms are implemented in Java. The latest version is available here.

Thesis

Current team