# ENFrame: A Programming Framework for Probabilistic Data

Probabilistic data management has gone a long, fruitful way in the last decade: We have a good understanding of the space
of possible relational and hierarchical data models and its implication on query tractability; the community already delivered
several open-source systems that exploit the first-order structure of database queries for scalable inference and applications
in the space of web data management. Significantly less effort has been spent on supporting complex data processing beyond
mere querying, such as general-purpose programming.

There is a growing need for computing frameworks that allow users
to build applications feeding on uncertain data without worrying about the underlying uncertain nature of such data or the
computationally hard inference task that comes along with it. For tasks that only need to query probabilistic data, existing
probabilistic database systems do offer a viable solution. For more complex tasks, however, successful development requires
a high level of expertise in probabilistic databases and this hinders the adoption of existing technology as well as communication
between potential users and experts.

**The thesis of this project** is that one can build powerful and useful probabilistic data programming frameworks
that *leverage* existing work on probabilistic databases. ENFrame is a framework that aims to fit this vision:

- Its programming language is a fragment of Python with constructs such as bounded-range loops, if-then-else statements, list comprehension, aggregates, variable assignments, and query calls to external database engines. A user program can express complex tasks such as clustering and classification intermixed with structured querying.
- The users are oblivious to the probabilistic nature of the input data: They program as if the input data were plain relational, with no uncertainty or layout intricacy. It is the job of ENFrame to make sense of the underlying data formats, probabilities, and input correlations, thus allowing low-level entry for users without expert knowledge on inference, query tractability, and probabilistic models.
- A key property inherited from database systems principles is the separation between the physical and logical representations of probabilistic data: Whereas the physical representation may be that of, e.g., Bayesian networks for some sources and probabilistic c-tables for others (or special cases such as tuple-independent or block-independent disjoint tables), the users are exposed to a unified relational view of the underlying data. For this, ENFrame relies on the Pigora system for probabilistic data integration.
- Following the design of existing probabilistic database systems, ENFrame adheres to the possible worlds semantics for its whole processing pipeline. Under this semantics, the input is a probability distribution over a finite set of possible worlds, with each world defining a standard database or a set of input data points. The result of a user program is equivalent to executing it within each world and is thus a probability distribution over possible outcomes of the program variables. This distribution can be inspected by the user and serve as input to probabilistic database queries and subsequent ENFrame programs.
- ENFrame uses a rich language of probabilistic events to symbolically express input correlations, trace the correlations
introduced during computation, and enable result explanation, sensitivity analysis, incremental maintenance, and knowledge
compilation-based approaches for approximate inference.

While propositional formulas over Boolean random variables are expressive enough to capture computation traces of ENFrame programs, they can be very large and expensive to manage. ENFrame's event language extends algebraic formalisms based on semirings and semimodules for probabilistic data that can succinctly encode program events with constructs mirroring those in the user language (Such events can be exponentially more succinct than equivalent propositional formulas or Bayesian networks). - ENFrame exploits the structure of queries and programs for efficient inference; it relies on SPROUT for efficient query
processing on probabilistic data.

To avoid iterating over all possible worlds, ENFrame employs approximation schemes and exploits the fact that many of the possible worlds are alike. To further speed up inference on networks of highly interconnected probabilistic events, such as those for clustering and classification programs, ENFrame uses parallel algorithms that exploit multi-core architectures.

There are **key differences** that set ENFrame apart from the myriad of recent probabilistic programming and
probabilistic data processing approaches:

- ENFrame uses the semantics and a probabilistic data model compatible with a wealth of work on probabilistic databases. This enables processing pipelines mixing programming (via ENFrame) and query processing (via SPROUT) jobs.
- The user need not be aware of the probabilistic nature of data or the possibly different probabilistic formalisms used by the input sources. One implication is that probability distributions can only be supplied as input data and not in the actual program. So far, input distributions can only be given explicitly and not symbolically as in MCDB, e.g., Normal with parameters mean and standard deviation.
- We can leverage existing work on incremental maintenance of query answers and extend it to incremental maintenance of the program output in the face of updates to input probabilities, insertions and deletions of uncertain objects.

## Talks

*(on various aspects of the SPROUT, ENFrame, and Pigora projects)*

*Beyond Query Evaluation in Probabilistic Databases*- Invited Speaker, Scalable Uncertainty Management (SUM), Sept 2014.

*Probabilistic Databases are Dead, Long Live Probabilistic Databases!*- Invited Speaker, Big Uncertain Data Workshop at PODS, Snowbird, June 21, 2014.
- Google Research Seminar (Knowledge Vault group), February 21, 2014, Mountain View (California).

## Theses

- Lampros Papageorgiou:
*Pigora: An Integration System for Probabilistic Data*

MSc in CS, Oxford 2012.

## Publications

**Declarative Probabilistic Programming with Datalog**. [accepted version]

Vince Barany and Balder ten Cate and Benny Kimelfeld and Dan Olteanu and Zografoula Vagena.

Accepted to appear in ACM Transactions on Database Systems (TODS), Special issue for*best papers of ICDT 2016*.

Submitted Sept 2016, Accepted Aug 2017.

**Declarative Probabilistic Programming with Datalog.**

Vince Barany and Balder ten Cate and Benny Kimelfeld and Dan Olteanu and Zografoula Vagena. [pdf]

In Int Conf on Database Theory (ICDT), Bordeaux, March 2016.

**PPDL: Probabilistic Programming with Datalog.**[arxiv]

Balder ten Cate, Benny Kimelfeld, and Dan Olteanu.

In Alberto Mendelzon Workshop (AMW), Lima, April 2015.

Short version of technical report, arXiv, December 2014.

**ENFrame: A Framework for Processing Probabilistic Data.**[pdf]

Dan Olteanu and Sebastiaan van Schaik.

ACM Transactions on Database Systems (TODS), Special issue for*best papers of EDBT 2014*.

Submitted Feb 2015, Accepted December 2015.

**Probabilistic Data Programming with ENFrame**. [pdf]

Dan Olteanu and Sebastiaan van Schaik.

In IEEE Data Engineering Bulletin, September 2014.

**ENFrame: A Platform for Processing Probabilistic Data**. [pdf]

Sebastiaan van Schaik, Dan Olteanu and Robert Fink.

In Extending Database Technology (EDBT), 2014.*Selected as one of best papers of EDBT 2014.*

Also technical report arXiv 1309.0373, November 2012.

**ENFrame = (Programs + Queries) / Probabilistic Data**. [pdf, poster]

Dan Olteanu and Sebastiaan van Schaik.

In Big Uncertain Data (BUDA), workshop at PODS, 2014.

**Pigora: An Integration System for Probabilistic Data**. [pdf, poster]

Dan Olteanu, Lampros Papageorgiou, and Sebastiaan van Schaik.

System demonstration. In IEEE Int Conf on Data Engineering (ICDE), Brisbane, 2013.

**DAGger: Clustering Correlated Uncertain Data**. [pdf, poster]

Dan Olteanu and Sebastiaan van Schaik.

System demonstration. In ACM SIGKDD Conf on Knowledge Discovery and Data Mining (KDD), Beijing, 2012.

## Current Team

- Dan Olteanu (PI)
- Sebastiaan J. van Schaik (PhD student)

Former members: Tomas Halgas; Lampros Papageorgiou (MSc student);

## Acknowledgments

Work on ENFrame has been supported by the EU FP7 grant HiperDNO, the EPSRC grant ADEPT, and het Prins Bernhard Cultuurfonds, and het De Breed Kreiken Innovatiefonds. Work on PPDL has been partially supported by LogicBlox via DARPA's PPAML program and by EPSRC programme grant VADA.