Skip to main content

Interpretability in Machine Learning "


Suitable for

MSc in Computer Science


The growing number of decisions influenced by machine learning models drives the need for explanations of why a system makes a particular prediction. Counterfactual explanations (CEs) are a practical tool for demonstrating why machine learning classifiers make particular decisions. CEs are of the form “If X had not occurred, then Y would not have occurred” [Wachter et al., 2017]. In practice, this type of explanation is an alternate input that is similar to the original input but leads to a different classification.  In this project, we will extend on previous work in which we developed a simple and fast method for generating interpretable CEs for neural networks in the white-box setting, by leveraging the predictive uncertainty of the classifier [Schut et al., 2020]. The primary goal of the project will be to extend this method to black-box models, using proxy models [Afrabandpey et al., 2020]. Time-permitting, the project can be extended to developing metrics for interpretability.  References: Afrabandpey, H., Peltola, T., Piironen, J., Vehtari, A., and Kaski, S. (2020).  A decision-theoretic approach for model interpretability in Bayesian framework. Machine Learning, pages 1–22 Lisa Schut, Oscar Key, Rory McGrath, Luca Costabello, Bogdan Sacaleanu, Medb Corcoran, Yarin Gal. Uncertainty-Aware Counterfactual Explanations for Medical Diagnosis. NeurIPS Machine Learning for Health Workshop 2020. Wachter, S., Mittelstadt, B., and Russell, C. (2017). Counterfactual explanations without opening the black box:  Automated decisions and the GDPR. Harv. JL & Tech. , 31:841 Prerequisites Machine Learning, Python (incl. Pytorch)