Skip to main content

Knowledge Localization for Capability Removal in LLMs

Igor Shilov ( Imperial College London )
Abstract: Igor will present his work on localizing knowledge domains to designated subsets of LLM parameters during pretraining, enabling their removal by simply zeroing out those parameters. Knowledge localization has applications in controlling model capabilities, from removing dual-use knowledge such as biological threats or cyber-offense to managing what models retain about sensitive training data. The method, Selective Gradient Masking (SGTM), achieves this through data-dependent gradient masking, where target-domain examples only update their dedicated subnetwork, while the rest of the model is trained to function without it. A natural baseline would be data filtering – just remove unwanted content before training. But localization offers two advantages: it is robust to labeling errors, since mislabeled target data still gets absorbed into the designated subnetwork rather than leaking into the rest of the model; and it produces two models from a single training run – one with full capabilities and one without. We verify the effectiveness of the method on two setups – removing a language from a bilingual model and removing biology knowledge from a Wikipedia-trained model – comparing against data filtering and post-training unlearning baselines. This work was done as a part of the Anthropic AI Safety Fellowship.
Paper: https://arxiv.org/abs/2512.05648

Speaker bio

Igor Shilov is a PhD researcher at Imperial College London, working on privacy and security of AI. Before starting his PhD in 2023, Igor spent over 10 years as a software engineer, most recently as a Research Engineer at Meta, where he worked on privacy-preserving machine learning with a focus on differential privacy and federated learning. His research interests include LLM memorization, adversarial attacks and security, and model internals. In 2025, Igor was part of the inaugural cohort of the Anthropic AI Safety Fellowship, working on knowledge localization in LLMs.