Safeguarding agents against decomposition attacks

Supervisors

Suitable for

Abstract

A number of recent techniques have been introduced for efficiently monitoring large language models (LLMs) for harmful behavior--monitoring model generations, their internal states, or the combination of the two (McKenzie et al, 2025). However, much less effort has been paid to monitoring across multi-turn interactions (Jaipersaud et al, 2025) or for very long contexts. Harmful instructions may be strategically spread across multiple requests, each of which look benign in isolation* (*Yueh-Han et al*, *2025*). This project aims to build more reliable methods for defending against the so-called decomposition attacks, building stronger guardrails against the changing attack landscape. This project will likely be joint with collaborators from Microsoft as an industry partner.

Safeguarding agents against decomposition attacks

Supervisors

Suitable for

Abstract

Student Space