Software Verification for Weak Memory via Program Transformation

Abstract

Multiprocessors implement weak memory models, but program verifiers often assume Sequential Consistency (SC), and thus may miss bugs due to weak memory. We propose a sound transformation of the program to verify, enabling SC tools to perform verification w.r.t. weak memory. We present experiments for a broad variety of models (from x86-TSO to Power) and a vast range of verification tools, quantify the additional cost of the transformation and highlight the cases when we can drastically reduce it. Our benchmarks include work-queue management code from PostgreSQL.


News


Proofs

Proofs of the paper, formalised in Coq.


Tool manual

Here is the manual of the tool, and how to plug in a model-checker (we have tried CheckFence, ESBMC, MMChecker, Poirot, SatAbs, Threader and our new CBMC-MT).



Case Study: Fixing a WMM bug in PostgreSQL

We show here how we can detect this bug for Power reported by PostgreSQL developers.





We then test their fix (the reader should search for the comment "XXX there really ought to be a memory barrier operation right here" in the diff). We show that this fix might not be sufficient, we provide a counter-example generated by our tool, and we propose an additional fix, which provably fixes the problem.


Summary of systematic testing for CBMC

Here are the average runtime and overhead for CBMC, on all litmus tests. We provide the same data for all model-checkers, on all litmus tests but also on C examples below.

The following scatter plot (click for more details and an enlarged version) depicts the execution time of running CBMC on a program whose all accesses are instrumented, as proposed by Atig et al., vs. execution times for the optimised instrumentation using weighted selection under TSO. Any dot below the diagonal line marks a case where full instrumentation is slower than the optimised one.


Experiments

Here are all our experimental data, for all model-checkers, on all litmus tests and C examples, including standard ones, borrowed ones, the PostgreSQL bug, and RCU.

As we can independently compute the expected verification result for all litmus tests, we are able to assess the soundness of each of the verification tool. The following graphics provide an executive summary of this assessment.









People