Evaluating and Benchmarking Activation Monitors for Large Language Models
Supervisors
Suitable for
Abstract
Evaluating and Benchmarking Activation Monitors for Large Language Models
One effective way of preventing harmful outputs in large language models (LLMs) is to monitor the intermediate activations produced by the network during its forward pass. Methodologies for catching problematic behavior often rely on the simple “linear probe” (Alain et al. 2016), with more sophisticated multi-stage monitors using probes as the first line of defense (Cunningham et al. 2025, McKenzie et al. 2025).
Given widespread reliance on probes, understanding their potential vulnerabilities and limitations is vital for security and safety from potential attacks or failure modes. This project seeks to thoroughly evaluate the potential failure modes and attack surfaces of probes for activation monitoring (Bailey et al. 2024), and possibly build better datasets and standardized evaluation suites.