Repeatability Evaluation

Repeatability Evaluation Committee Chair

Ian M. Mitchell, University of British Columbia, Canada

Repeatability Evaluation Committee

Dieky Adzkiya, IN
Nikos Arechiga, US
Sergiy Bogomolov, AT
Alessandro Borri, IT
Milan Ceska, UK
Xin Chen, US
Tommaso Dreossi, FR
Meng Guo, SE
Ernst Moritz Hahn, PRC
Bardh Hoxha, US
Taylor Johnson, US
Pablo Nanez, CO
Petter Nilsson, US
Alessandro Papadopoulos, SE
Matthias Rungger, DE
Dorsa Sadigh, US
Fedor Shmarov, UK
Christoffer Sloth, DK
Sadegh Soudjani, UK

Background and Goals

HSCC has a rich history of publishing strong papers emphasizing computational contributions; however, subsequent re-creation of these computational elements is often challenging because details of the implementation are unavoidably absent in the paper. Some authors post their code and data to their websites, but there is little formal incentive to do so and no easy way to determine whether others can actually use the result. As a consequence, computational results often become non reproducible -- even by the research group which originally produced them -- after just a few years.

The goal of the HSCC repeatability evaluation process is to improve the reproducibility of computational results in the papers selected for the conference.

Benefits for Authors

We hope that this process will provide the following benefits to authors:

Raise the profile of papers containing repeatable computational results by highlighting them at the conference and online.
Raise the profile of HSCC as a whole, by making it easier to build upon the published results.
Provide authors with an incentive to adopt best-practices for code and data management that are known to improve the quality and extendability of computational results.
Provide authors an opportunity to receive feedback from independent reviewers about whether their computational results can be repeated.
Obtain a special mention in the conference proceedings, and take part in the competition for the best RE award.

While creating a repeatability package will require some work from the authors, we believe the cost of that extra work is outweighed by a direct benefit to members of the authors' research lab: if an independent reviewer can replicate the results with a minimum of effort, it is much more likely that future members of the lab will also be able to do so, even if the primary author has departed.

The repeatability evaluation process for HSCC draws upon several similar efforts at other conferences (SIGMOD, SAS, CAV, ECOOP, OOPSLA), and a first experimental run was held at HSCC14.

Author Instructions and Submission Guidelines

Authors of papers accepted to HSCC 2016 - and especially Tool Papers - are invited to submit a repeatability package (RP). An RP submission is optional, and will not affect the final publication of the corresponding paper. We further solicit submissions by authors intending to partake in the Poster/Demo session at HSCC 2016 and to participants at HSCC 2015 (where the RE process did not run).
RPs are considered confidential material in the same sense as initial paper submissions: committee members agree not to share RP contents and to delete them after evaluation. RPs remain the property of the authors, and there is no requirement to post them publicly (although we encourage you to do so).
Papers whose RPs pass the repeatability evaluation criteria will be listed online and in the final proceedings. On the other hand, papers whose RPs do not pass the repeatability evaluation criteria will be treated the same as papers which do not submit RPs (eg: failing RPs will not be individually identified).

The RP consists of three components:

A copy (in pdf format) of the final camera-ready paper. This copy will be used by the REC to evaluate how well the elements of the RP match the paper.
A document (either a webpage, a pdf, or a plain text file) explaining at a minimum:

What elements of the paper are included in the RP (eg: specific figures, tables, etc.).
The system requirements for running the RP (eg: OS, compilers, environments, etc.).
Instructions for installing and running the software and extracting the corresponding results.

The software and any accompanying data. We will accept at least the following formats:

A link to a public online repository, such as bitbucket.org, code.google.com, github.com or sourceforge.net.
An archive in a standard file format (eg: zip, gz, tgz) containing all the necessary components.
A link to a virtual machine image (using either VirtualBox or VMware) which can be downloaded.

If you would like to submit software and/or data in another format, please contact the RE committee chair in advance to discuss options.

The RP should be submitted through Easychair (see next paragraph, and note that this is a different site than that used for initial paper submissions). When preparing your RP, keep in mind that other conferences have reported that the most common reason for reproducibility failure is installation problems. We recommend that you have an independent member of your lab test your installation instructions and RP on a clean machine before final submission.

The repeatability evaluation process uses anonymous reviews so as to solicit honest feedback. Authors of RPs should make a genuine effort to avoid learning the identity of the reviewers. This effort may require turning off analytics or only using systems with high enough traffic that REC accesses will not be apparent. In all cases where tracing is unavoidable the authors should provide warnings in the documentation so that reviewers can take necessary precautions to maintain anonymity.

Submission Details

Authors of papers accepted to HSCC 2016 - and especially Tool Papers - are invited to submit a repeatability package (RP). We further solicit submissions by authors intending to partake in the Poster/Demo session at HSCC 2016 and to participants at HSCC 2015 (where the RE process did not run).

The Easychair website accepting paper submissions is here:

https://easychair.org/conferences/?conf=hscc2016re

Deadlines for RP submissions is on January 24, 2016.

Repeatability Evaluation Criteria

Each member of the Repeatability Evaluation Committee assigned to review a Repeatability Package (RP) will judge it based on three criteria -- coverage, instructions, and quality -- where each criteria is assessed on the following scale:

significantly exceeds expectations (5),
exceeds expectations (4),
meets expectations (3),
falls below expectations (2),
missing or significantly falls below expectations (1).

In order to be judged "repeatable" an RP must "meet expectations" (average score of 3), and must not have any missing elements (no scores of 1). Each RP is evaluated independently according to the objective criteria. The higher scores ("exceeds" or "significantly exceeds expectations") in the criteria should be considered aspirational goals, not requirements for acceptance.

Coverage

What fraction of the appropriate figures and tables are reproduced by the RP? Note that some figures and tables should not be included in this calculation; for example, figures generated in a drawing program, or tables listing only parameter values. The focus is on those figures or tables in the paper containing computationally generated or processed experimental evidence to support the claims of the paper.

Note that satisfying this criterion does not require that the corresponding figures or tables be recreated in exactly the same format as appears in the paper, merely that the data underlying those figures or tables be generated in a recognizable format.

A repeatable element is one for which the computation can be rerun by following the instructions in the RP in a suitably equipped environment. An extensible element is one for which variations of the original computation can be run by modifying elements of the code and/or data. Consequently, necessary conditions for extensibility include that the modifiable elements be identified in the instructions or documentation, and that all source code must be available and/or involve calls to commonly available and trusted software (eg: Windows, Linux, C or Python standard libraries, Matlab, etc.).

The categories for this criterion are:

None (missing / 1): There are no repeatable elements. This case automatically applies to papers which do not submit a RP or papers which contain no computational elements.
Some (falls below expectations / 2): There is at least one repeatable element.
Most (meets expectations / 3): The majority (at least half) of the elements are repeatable.
All repeatable or most extensible (exceeds expectations / 4): All elements are repeatable or most are repeatable and easily modified. Note that if there is only one computational element and it is repeatable, then this score should be awarded.
All extensible (significantly exceeds expectations / 5): All elements are repeatable and easily modified.

Instructions

This criterion is focused on the instructions which will allow another user to recreate the computational results from the paper.

None (missing / 1): No instructions were included in the RP.
Rudimentary (falls below expectations / 2): The instructions specify a script or command to run, but little else.
Complete (meets expectations / 3): For every computational element that is repeatable, there is a specific instruction which explains how to repeat it. The environment under which the software was originally run is described.
Comprehensive (exceeds expectations / 4): For every computational element that is repeatable there is a single command which recreates that element almost exactly as it appears in the published paper (eg: file format, fonts, line styles, etc. might not be the same, but the content of the element is the same). In addition to identifying the specific environment under which the software was originally run, a broader class of environments is identified under which it could run.
Outstanding (significantly exceeds expectations / 5): In addition to the criteria for a comprehensive set of instructions, explanations are provided of:

all the major components / modules in the software,
important design decisions made during implementation,
how to modify / extend the software, and/or
what environments / modifications would break the software.

Quality

This criterion explores the documentation and trustworthiness of the software and its results. While a set of scripts which exactly recreate, for example, the figures from the paper certainly aid in repeatability, without well-documented code it is hard to understand how the data in that figure were processed, without well-documented data it is hard to determine whether the input is correct, and without testing it is hard to determine whether the results can be trusted.

If there are tests in the RP which are not included in the paper, they should at least be mentioned in the instructions document. Documentation of test details can be put into the instructions document or into a separate document in the RP.

The categories for this criterion are:

None (missing / 1): There is no evidence of documentation or testing.
Rudimentary documentation (falls below expectations / 2): The purpose of almost all files is documented (preferably within the file, but otherwise in the instructions or a separate readme file).
Comprehensive documentation (meets expectations / 3): The purpose of almost all files is documented. Within source code files, almost all classes, methods, attributes and variables are given lengthy clear names and/or documentation of their purpose. Within data files, the format and structure of the data is documented; for example, in comma separated value (csv) files there is a header row and/or comments explaining the contents of each column.
Comprehensive documentation and rudimentary testing (exceeds expectations / 4): In addition to the criteria for comprehensive documentation, there are identified test cases with known solutions which can be run to validate at least some components of the code.
Comprehensive documentation and testing (significantly exceeds expectations / 5)
In addition to the criteria for comprehensive documentation, there are clearly identified unit tests (preferably run with a unit test framework) which exercise a significant fraction of the smaller components of the code (individual functions and classes) and system level tests which exercise a significant fraction of the full package. Unit tests are typically self-documenting, but the system level tests will require documentation of at least the source of the known solution.

Note that tests are a form of documentation, so it is not really possible to have testing without documentation.

Sample Repeatability Package

Citation: Ian M. Mitchell, "Scalable calculation of reach sets and tubes for nonlinear systems with terminal integrators: a mixed implicit explicit formulation" in Hybrid Systems Computation and Control, pp. 103-112 (2011).
Official version: http://dx.doi.org/10.1145/1967701.1967718
Author postprint: https://www.cs.ubc.ca/~mitchell/Papers/myHSCC11.pdf
Repeatability package: https://www.cs.ubc.ca/~mitchell/ToolboxLS/PublicationCode/hscc2011.zip
Repeatability Evaluation: meets or exceeds expectations (average 3⅓).

Coverage: all repeatable (exceeds expectations / 4). Code to recreate figures 1-5 and 7-8 is provided. Figure 6 is a hand-drawn coordinate system. There are no tables.
Instructions: complete (meets expectations / 3). The included readme.txt file lists which m-files are used to recreate which figures. The environment is described (Matlab R2010b or later, link to the Toolbox of Level Set Methods). However, some effort is required to extract certain figures (eg: figures 7 & 8).
Quality: comprehensive documentation (meets expectations / 3). All source files include Matlab help entries for every function as well as numerous comments. There are no data files. However, there is no sign of testing.

[Thanks to Ian M. Mitchell for the content of this page]

REPEATABILITY EVALUATION