Realistic Data Models and Query Compilation for Large-Scale Probabilistic Databases

1st December 2017 to 31st May 2021

In the recent years, there has been a strong interest in academia and industry in building large-scale probabilistic knowledge bases from data in an automated way, which has resulted in a number of systems, such as DeepDive, NELL, Yago, Freebase, Microsoft's Probase, and Google's Knowledge Vault. These systems continuously crawl the Web and extract structured information, and thus populate their databases with millions of entities and billions of tuples. To what extent can these search and extraction systems help with real-world use cases? This turns out to be an open-ended question. For example, DeepDive is used to build knowledge bases for domains such as paleontology, geology, medical genetics, and human movement. From a broader perspective, the quest for building large-scale knowledge bases serves as a new dawn for artificial intelligence research. Fields such as information extraction, natural language processing (e.g., question answering), relational and deep learning, knowledge representation and reasoning, and databases are taking initiative towards a common goal. Querying large-scale probabilistic knowledge bases is commonly regarded to be at the heart of these efforts.

Beyond all these success stories, however, probabilistic knowledge bases still lack the fundamental machinery to convey some of the valuable knowledge hidden in them to the end user, which seriously limits their potential applications in practice. These problems are rooted in the semantics of (tuple-independent) probabilistic databases, which are used for encoding most probabilistic knowledge bases. For computational efficiency reasons, probabilistic databases are typically based on strong, unrealistic completeness assumptions, such as the closed-world assumption, the tuple-independence assumption, and the lack of commonsense knowledge. These strong unrealistic assumptions do not only lead to unwanted consequences, but also put probabilistic databases on weak footing in terms of knowledge base learning, completion, and querying. More specifically, each of the above systems encodes only a portion of the real world, and this description is necessarily incomplete; these systems continuously crawl the Web, encounter new sources, and consequently new facts, leading them to add such facts to their database. However, when it comes to querying, most of these systems employ the closed-world assumption, i.e., any fact that is not present in the database is assigned the probability 0, and thus assumed to be impossible. As a closely related problem, it is common practice to view every extracted fact as an independent Bernoulli variable, i.e., any two facts are probabilistically independent. For example, the fact that a person starred in a movie is independent from the fact that this person is an actor, which is in conflict with the fundamental nature of the knowledge domain. Furthermore, current probabilistic databases lack (in particular ontological) commonsense knowledge, which can often be exploited in reasoning to deduce implicit consequences from data, and which is often essential for querying large-scale probabilistic databases in an uncontrolled environment such as the Web.

The main goal of this proposal is to enhance large-scale probabilistic databases (and so to unlock their full data modelling potential) by more realistic data models, while preserving their computational properties. We are planning to develop different semantics for the resulting probabilistic databases and analyse their computational properties and sources of intractability. We are also planning to design practical scalable query answering algorithms for them, especially algorithms based on knowledge compilation techniques, extending existing knowledge compilation approaches and elaborating new ones, based on tensor factorisation and neural-symbolic knowledge compilation. We will also produce a prototype implementation and experimentally evaluate the proposed algorithms.