Datalog Extensions for the Analysis of Static and Streaming Data
This is an extension of the fellowship "MaSI3: A Massively Scalable Intelligent Information Infrastructure" (EP/K00607X/1).
Intelligent data management techniques play a key role in areas such as healthcare, business, and government. Healthcare providers such as Kaiser Permanente use such techniques for auditing; companies such as RailComplete use them to verify transport infrastructure designs; and oil producers such as StatOil analyse streaming sensor data to diagnose faults and prevent failures. To simplify the management of the data, intelligent information systems (IISs) provide services that (i) capture the semantics of the data using background knowledge about the application domain, and (ii) use reasoning to infer information implicit in the data and the background knowledge. The vision of the MaSI3 fellowship is to make intelligent information systems a reality by developing scalable reasoning and query answering techniques. The techniques developed in the project to date provide the foundation for a new IIS called RDFox. The system is based on datalog -- a language that models background knowledge of IISs using 'if-then' rules.
Extensive engagement with users within MaSI3 has revealed great potential of RDFox for data analysis -- an exciting new application of IISs. The term 'data analysis' covers a broad range of techniques, which can include searching for patterns and predicting future behaviour using statistical and machine learning algorithms. In many cases, however, data analysis involves the use of data manipulation tasks that aggregate data, verify properties, or answer queries. Such tasks are typically solved imperatively (e.g., using languages such as Java or Scala) by specifying how to manipulate the data, which is undesirable because the objective of the analysis is often obscured by evaluation concerns. It has been argued that data analysis should be declarative: users should describe what the desired output is, rather than how to compute it. For example, instead of computing shortest paths in a graph using a concrete algorithm, one should (i) describe what a path length is, and (ii) state that only paths of minimum length are needed. Such a specification is independent of evaluation details, which allows the user to focus his attention on the task at hand. An evaluation strategy can be chosen later to satisfy specific requirements; for example, parallelisation or incremental techniques can be reused 'for free' to speed up computation or update the output after applying a change to the input.
A key problem on the path to declarative data analysis is to design a language that can express the relevant tasks. Datalog has been identified as a natural starting point: its expressivity and complexity are well understood, and it is already used to declaratively capture domain knowledge in IISs. My work in the MaSI3 fellowship confirms the potential of datalog, but it has also revealed the inability of datalog to express several natural and common data analysis problems. For example, datalog cannot answer the bill of materials query (i.e., count the occurrences of a part in a hierarchical product structure); moreover, basic datalog cannot express the shortest paths problem, and datalog extensions that can express this problem are inefficient when used with known reasoning algorithms. Furthermore, there are challenges in using datalog in a streaming setting (i.e., where data is produced continuously).
Thus, the objective of this fellowship extension is to develop datalog extensions for data analysis in IISs, establish links with known problem solving methods (e.g., dynamic programming), and evaluate the results with my collaborators. My main research problems are about language design, and are thus to an extent independent of the specific evaluation methods. I will validate my results by implementing them in RDFox, but they can also be implemented in Big Data frameworks such as Hadoop and Spark.