Knowledge Localization for Capability Removal in LLMs
Igor Shilov ( Imperial College London )
- 11:00 20th April 2026 ( week 0, Trinity Term 2026 )Bill Roscoe Lecture Theatre
Abstract: Igor will present his work on localizing knowledge domains to designated subsets of LLM parameters during pretraining, enabling their removal by simply zeroing out those parameters. Knowledge localization has applications in controlling model capabilities, from removing dual-use knowledge such as biological threats or cyber-offense to managing what models retain about sensitive training data. The method, Selective Gradient Masking (SGTM), achieves this through data-dependent gradient masking, where target-domain examples only update their dedicated subnetwork, while the rest of the model is trained to function without it. A natural baseline would be data filtering – just remove unwanted content before training. But localization offers two advantages: it is robust to labeling errors, since mislabeled target data still gets absorbed into the designated subnetwork rather than leaking into the rest of the model; and it produces two models from a single training run – one with full capabilities and one without. We verify the effectiveness of the method on two setups – removing a language from a bilingual model and removing biology knowledge from a Wikipedia-trained model – comparing against data filtering and post-training unlearning baselines. This work was done as a part of the Anthropic AI Safety Fellowship.
Paper: https://arxiv.org/abs/2512.05648