On Monday and Tuesday this week I attended the third “Theory and Practice of Provenance” workshop in Crete. The event was a great success: lively discussion from people presenting interesting and practical work. For those who don’t know about Provenance, here’s a snappy definition:
‘Provenance’ or ‘lineage’ generally refers to information that ‘helps determine the derivation history of a data product, starting from its original sources’ . In other words, a record of where data came from and how it has been processed.
Provenance applies to many different domains, and at the TaPP’11 workshop there were researchers working on theoretical database provenance, scientific workflows, practical implementation issues, systems provenance (who want to collect provenance at the operating system level) as well as a few security people. I presented a short paper on collecting provenance in clouds, which got some useful feedback.
At the end of the event we ended with a debate on “how much provenance should we store” – with most people sitting somewhere between two extremes: either we should store just the things we think are most important to our queries, or we store everything that could possible impact what we are doing. The arguments on both side were good: there was a desire to avoid collecting too much useless data, as this slows down search and has an attached cost in terms of storage and processing. On the other hand, the point was made that we didn’t actually know how much provenance was enough, and that if we don’t collect all of it, we could come back and find we missed something. Considering the cheapness of storage and processing power, some believe that the overhead was unimportant. As a security researcher interested in trusted provenance, the “collect everything” approach seemed like my cup of tea. If the collecting agent was trusted and could attest to its proper behaviour, provenance information could be made much more tamper-resistant.
However, from the perspective of someone involved in privacy and looking at storage of context (which is a part of provenance), the preservation of privacy seemed to be an excellent reason not to collect everything. For example, I suspect that academic researchers don’t want to store all their data sources: what if you browsed Wikipedia for an overview of a subject area, and that was forever linked with your research paper? Similarly, full provenance during computation might reveal all the other programs you were using, many of which you might not want to share with your peers. Clearly some provenance information has to stay secret.
The rebuttal to this point was that this was an argument for controlled disclosure rather than controlled collection. I think this argument can occur quite often. From a logical perspective (considering only confidentiality) it might be enough to apply access controls and limit some of your provenance collection. However, this adds some interesting requirements. It is now necessary for users to specify policies on what they do and don’t want to reveal. This has shown to be difficult in practice. Furthermore, the storage of confidential data requires better security than the storage of public (if high integrity) data. The problem quickly turns into digital right management, which is easier said than implemented. I believe that controlled disclosure and controlled collection are fundamentally different approaches, and the conscientious privacy research must choose the latter.
I still believe that provenance can learn quite a lot from Trusted Computing, and vice-versa. In particular, the concept of a “root of trust” – the point at which your trust in a computing system started and the element which you may have no ability to assure – is relevent. Provenance data also must start somewhere – the first element in the history of a data item, and the trusted agent used to record it. Furthermore, the different types of root of trust are relevent: provenance is reported just like attestations report platform state. In trusted computing we have a “root of trust for reporting” and perhaps we also need one in provenance. The same is true for measurement of provenance data, and storage. Andrew Martin and I wrote about some of this in our paper at TaPP last year but there is much more to do. Could TCG attestation conform with the Open Provenance Model? Can we persuade those working in operating system provenance that the rest of the trusted computing base – the BIOS, bootloader, option roms, SMM, etc – also need to be recorded as provenance? Can the provenance community show us how to query our attested data, or make sense of a Trusted Network Connect MAP database?
Finally, one of the most interesting short talks was by Devan Donaldson, who studied whether or not provenance information actually made data more trustworthy. He performed a short study of various academic researchers, using structured interviews, and found (perhaps unsurprisingly) that yes, some provenance information really does improve the perception of trustworthiness in scientific data. He also found that a key factor in addition to provenance was the ability to use and query the new data. While these results are what we might expect, they do confirm the theory that provenance can be used to enhance perceived trustworthiness, at least in an academic setting. Whether it works outside academia is a good question: could provenance of the climategate data has reassured the press and the public?