Interpretability in Machine Learning "
Supervisors
Suitable for
Abstract
The growing number of decisions influenced by machine learning models drives the need for explanations of why a system makes
a particular prediction. Counterfactual explanations (CEs) are a practical tool for demonstrating why machine learning classifiers
make particular decisions. CEs are of the form “If X had not occurred, then Y would not have occurred” [Wachter et al., 2017].
In practice, this type of explanation is an alternate input that is similar to the original input but leads to a different
classification.
In this project, we will extend on previous work in which we developed a simple and fast method for generating interpretable
CEs for neural networks in the white-box setting, by leveraging the predictive uncertainty of the classifier [Schut et al.,
2020]. The primary goal of the project will be to extend this method to black-box models, using proxy models [Afrabandpey
et al., 2020]. Time-permitting, the project can be extended to developing metrics for interpretability.
References:
Afrabandpey, H., Peltola, T., Piironen, J., Vehtari, A., and Kaski, S. (2020). A decision-theoretic approach for model interpretability
in Bayesian framework. Machine Learning, pages 1–22
Lisa Schut, Oscar Key, Rory McGrath, Luca Costabello, Bogdan Sacaleanu, Medb Corcoran, Yarin Gal. Uncertainty-Aware Counterfactual
Explanations for Medical Diagnosis. NeurIPS Machine Learning for Health Workshop 2020.
Wachter, S., Mittelstadt, B., and Russell, C. (2017). Counterfactual explanations without opening the black box: Automated
decisions and the GDPR. Harv. JL & Tech. , 31:841
Prerequisites
Machine Learning, Python (incl. Pytorch)