The Providence of Provenance
It is more than 20 years since the issue of provenance in databases was first raised. Since that time, under various names, it has been modelled, theorised about, standardised and has become part of mainstream database research. Moreover, the topic has now infected nearly every branch of computer science: providence is a problem for everyone. But what exactly is the problem? And has the copious research had any real effect on how we use databases or, more generally, how we use computers.
I shall attempt to summarise the research on provenance and what practical impact it has had. Although much of the research has yet to come to market, there is an increasing interest in the topic from industry; moreover it has had a surprising impact in tangential areas such as data integration and data citation. I shall argue that we are still lacking basic tools to deal with provenance and that we need a “culture shift” if ever we are to make full use of the technology that has already been developed.
Peter Buneman FRS, FRSE, since 2002 is Professor of Database Systems, Laboratory for the Foundations of Computer Science, School of Informatics, University of Edinburgh.
Previously to this, for several decades he held a professorship of Computer Science at the University of Pennsylvania. Earlier, he received his PhD in mathematics from the University of Warwick in 1970. He is one of the founders and the Associate Director of Research of the UK Digital Curation Centre, which is located in Edinburgh.
His research include database systems and database theory, in particular establishing connections between databases and programming language theory for which he introduced monad-based query languages for nested relations and complex object databases. He also pioneered research on managing semi-structured data, and, recently, research on data provenance, annotations, and digital curation.
In computational biology, he is known for his work on reconstructing phylogenetic trees based on Buneman graphs, which are named in his honor.
He is a fellow of the Royal Society, fellow of the ACM, a fellow of the Royal Society of Edinburgh, and has won a Royal Society Wolfson Research Merit Award. He has chaired both flagship research conferences in data management, SIGMOD (in 1993) and VLDB (in 2008), as well as the main database theory conference, PODS (in 2001).
Compilation and Synthesis in Big Data Analytics
Databases and compilers are two long-established and quite distinct areas of computer science. With the advent of the big data revolution, these two areas move closer, to the point that they overlap and merge. Researchers in programming languages and compiler construction want to take part in this revolution, and also have to respond to the need of programmers for suitable tools to develop data-driven software for data-intensive tasks and analytics.
Database researchers cannot ignore the fact that most big-data analytics is performed in systems such as Hadoop that run code written in general-purpose programming languages rather than query languages. To remain relevant, each community has to move closer to the other. In the first part of this keynote, I illustrate this current trend further, and describe a number of interesting and inspiring research efforts that are currently underway in these two communities, as well as open research challenges. In the second part, I present a number of research projects in this space underway in my group at EPFL, including work on the static and just-in-time compilation of analytics programs and database systems, and the automatic synthesis of out-of-core algorithms that efficiently exploit the memory hierarchy.
Christoph Koch is a professor of Computer Science at EPFL,specializing in data management. Until 2010, he was an Associate Professor in the Department of Computer Science at Cornell University.Previously to this, from 2005 to 2007, he was an Associate Professor of Computer Science at Saarland University. Earlier, he obtained his PhD in Artificial Intelligence from TU Vienna and CERN (2001), was a postdoctoral researcher at TU Vienna and the University of Edinburgh (2001-2003), and an assistant professor at TU Vienna (2003-2005).
He obtained his Habilitation degree in 2004. He has won Best Paper Awards at PODS 2002, ICALP 2005, and SIGMOD 2011, a Google Research Award (in 2009), and an ERC Grant (in 2011). He (co-)chaired the program committees of DBPL 2005, WebDB 2008, and ICDE 2011, and was PC vice-chair of ICDE 2008 and ICDE 2009. He has served on the editorial board of ACM Transactions on Internet Technology as well as in numerous program committees. He currently serves as PC co-chair of VLDB 2013 and Editor-in-Chief of PVLDB.
University of Washington
Big Data Begets Big Database Theory
Big data analytics today is a high-touch business: a highly specialized domain expert spends most of her time performing repeatedly a series of data exploration and data transformation steps, before doing any useful data analysis. Such data transformations are performed on huge data volumes, stored on massively distributed clusters, and improving their performance is both critical and challenging.
This talk discusses the theoretical complexity of database operations in massively distributed clusters. A query is computed in a sequence of super-steps that interleave computations with communications. For example, a MapReduce job has a map phase, followed by a communication phase (the data reshuffle), followed by a reduce phase. The major performance parameter for complex queries is the number of communication steps; for example, one join operator can be computed by one MapReduce job, using a single communication step, but it is far from obvious how many communication steps are needed for more complex queries, like, say, a 4-way join, or a skyline query, or transitive closure on a graph. The talk will present some recent theoretical results on the number of communication steps required by database queries in massively distributed clusters.
Dan Suciu is a Professor in Computer Science at the University of Washington. He received his Ph.D. from the University of Pennsylvania in 1995, was a principal member of the technical staff at AT&T Labs and joined the University of Washington in 2000. Suciu is conducting research in data management, with an emphasis on topics related to Big Data and data sharing, such as probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011.
He is a Fellow of the ACM, holds twelve US patents, received the ACM SIGMOD Best Paper Award in 2000, the ACM PODS Alberto Mendelzon Test of Time Award in 2010 and in 2012, and is a recipient of the NSF Career Award and of an Alfred P. Sloan Fellowship. Suciu serves on the VLDB Board of Trustees, and is an associate editor for the VLDB Journal, ACM TOIS, ACM TWEB, and Information Systems and is a past associate editor for ACM TODS. Suciu’s PhD students Gerome Miklau and Christopher Re received the ACM SIGMOD Best Dissertation Award in 2006 and 2010 respectively, and Nilesh Dalvi was a runner up in 2008.