Skip to main content

Safeguarding agents against decomposition attacks

Supervisors

Suitable for

MSc in Advanced Computer Science

Abstract

Safeguarding agents against decomposition attacks

A number of recent techniques have been introduced for efficiently monitoring large language models (LLMs) for harmful behavior--monitoring model generations, their internal states, or the combination of the two (McKenzie et al, 2025). However, much less effort has been paid to monitoring across multi-turn interactions (Jaipersaud et al, 2025) or for very long contexts. Harmful instructions may be strategically spread across multiple requests, each of which look benign in isolation* (*Yueh-Han et al*, *2025*). This project aims to build more reliable methods for defending against the so-called decomposition attacks, building stronger guardrails against the changing attack landscape. This project will likely be joint with collaborators from Microsoft as an industry partner.