Enterprises and government entities have a growing need for systems that provide decision support based on descriptive and predictive analytics over large volumes of data. Examples include supporting decisions on pricing and promotions based on analyses of revenue and demand data; supporting decisions on the operation of complex equipment based on analyses of sensor data; and supporting decisions on website content based on analyses of user behaviour. Such support may be critical for safety and regulatory compliance as well as for competitiveness.
Current data analytics technology and workflows are well-suited to settings where the data has a uniform structure and is easy to access. Problems can arise, however, when performing data analytics in real-world settings, where as well as being large, datasources are often distributed, heterogeneous, and dynamic.
Consider, for example, the case of Siemens Energy Services, which runs over 50 service centres, each of which provides remote monitoring and diagnostics for thousands of gas/steam turbines and ancillary equipment located in hundreds of power plants. Effective monitoring and diagnosis is essential for maintaining high availability of equipment and avoiding costly failures. A typical descriptive analytics procedure might be: "based on sensor data from an SGT-400 gas turbine, detect abnormal vibration patterns during the period prior to the shutdown and compare them with data on similar patterns in similar turbines over the last 5 years".
Such diagnostic tasks employ sophisticated data analytics tools, and operate on many TBs of current and historical data. In order to perform the analysis it is first necessary to identify, acquire and transform the relevant data. This data may be stored on-site (at a power-plant), at the local service centre or at other service centres; it comes in a wide range of different formats, ranging from flat files to XML and relational stores; access may be via a range of different interfaces, and incur a range of different costs; and it is constantly being augmented, with new data arriving at a rate of more than 30 GB per centre per day.
Acquiring the relevant data is thus very challenging, and is typically achieved via a combination of complex queries and bespoke data processing code, with numerous variants being required in order to deal with distribution and heterogeneity of the data. Given the large number of different analytics tasks that service centres need to perform, the development and maintenance of such procedures becomes a critical bottleneck.
In ED3 we will address this problem by developing an abstraction layer that mediates between analytics tools and datasources. This abstraction layer will adapt Ontology Based Data Access (OBDA) techniques, using an ontology to provide a uniform conceptual schema, declarative mappings to establish connections between ontological terms and data sources, and logic-based rewriting techniques to transform ontological queries into queries over the data sources. For OBDA to be effective in this new setting, however, it will need to be extended in several different directions. Firstly, it needs to provide greatly extended support for basic arithmetic and aggregation operations. Secondly, it needs to deal more effectively with heterogeneous and distributed data sources. Thirdly, it will be necessary to support the development, maintenance and evolution of suitable ontologies and mappings.
In ED3 we will address all of these issues, laying the foundations for a new generation of data access middleware with the conceptual modelling, query processing, and rapid-development infrastructure necessary to support analytic tasks. Moreover, we will develop a prototypical implementation of a suitable abstraction layer, and will evaluate our prototype in real-life deployments with our industrial partners.
February 2016 until January 2019