Enhancing safety alignment in LLMs via realistic latent adversarial training
Supervisors
Suitable for
Abstract
Prerequisites:
1. E: familiar with the basics of LLMs, and adversarial training
2. D: familiar with jailbreaking,
and safety alignment
Background
Despite significant progress in aligning LLMs to human-specified safety rules, these models remain vulnerable
to Out-Of-Distribution (OOD) inputs such as adversarial prompts that bypass safety guardrails and elicit harmful
responses.
This failure mode presents serious risks. One of the most practical and concerning is its potential to enable bad actors to
carry out harmful activities that would otherwise be beyond their capabilities—such as
crafting high-grade explosives—by
leveraging the capabilities of frontier LLMs.
At the same time, we recognize that many existing safety guardrails, particularly those based on adversarial training,
often degrade general-purpose capabilities and/or lead to overly conservative behaviors such as excessive
refusals.
Therefore, the high-level goal this project pursues is to make LLMs consistently comply with human-specified safety rules
while minimizing sacrifice to the general capabilities. We aim to establish worst-case behavioral
guarantees for advanced
LLMs, i.e. ensuring that no input, including adversarially constructed prompts, can elicit responses that violate clearly
defined safety constraints. To this end, our approach emphasizes the inherent safety
alignment of a model where no pre/post-processing
or monitoring around the model is applied to moderate the input and output. Furthermore, a central objective of our research
is to develop methods that maintain or recover
general capabilities while ensuring robust safety. Our dual-goal framework
seeks to push the frontier of what is possible in safe and performant LLM deployment, rather than accepting a tradeoff between
safety and usefulness.
Focus
Our project builds on existing latent space adversarial training methods, such as latent adversarial training
[1, 2] and continuous adversarial training [3]. In each training step, given a harmful prompt, these methods first optimize
a perturbation to the latent states that increases the likelihood of an unsafe completion (e.g., “Sure, here it is…”)
and/or decreases the likelihood of a safe completion (e.g., “Sorry, I cannot…”). The model parameters are
then
updated to minimize the likelihood of unsafe completions and/or maximize the likelihood of safe completions for
the perturbed latent states. The key distinction between the two approaches is the location of the perturbation:
latent
adversarial training perturbs the hidden layer activations, while continuous adversarial training perturbs the input embeddings
before they enter any hidden layers.
Current LAT methods regularize the search for adversarial latent states by constraining them within an Lp epsilon-ball
around the latent representation of a known harmful prompt. However, they do not account for
whether these perturbed
latent states are actually reachable through valid input prompts. As a result, training likely includes unreachable adversarial
states—those that cannot be triggered in practice—making the training less
effective. In real-world applications,
this leads to a waste of model capacity: improving safety against unrealistic attacks at the cost of degrading general capabilities.
To alleviate this, current methods restrict adversarial search to
the vicinity of human-identified harmful prompts, which
in turn excludes latent states that may correspond to novel, real-world attack attacks outside the known distribution.
We aim to address the research question: How can latent adversarial training (LAT) be improved by regularizing latent adversarial
states? Specifically, we propose to constrain latent adversarial states to be reachable—that is,
producible through
some valid input prompt.
Method
Our proposed approaches address these limitations by introducing novel regularization techniques into adversarial
perturbation generation that encourage—or even enforce—the reachability of perturbed latent states. This not
only improves the realism of training signals but also allows us to broaden the search space for perturbations, allowing
coverage of more—and potentially novel—adversarial patterns during training. Overall, our methods aim
to
produce models with stronger worst-case safety guarantees than those trained with existing latent space adversarial training
approaches, while preserving general capabilities or incurring minimal trade-offs.
[1] Casper, S., Schulze, L., Patel,
O. and Hadfield-Menell, D., 2024. Defending against unforeseen failure modes with latent adversarial training. arXiv preprint
arXiv:2403.05030.
[2] Sheshadri, A., Ewart, A., Guo, P., Lynch, A., Wu, C., Hebbar, V., Sleight, H., Stickland, A.C.,
Perez, E., Hadfield-Menell, D. and Casper, S., 2024. Latent adversarial training improves robustness to persistent harmful
behaviors in llms. arXiv preprint arXiv:2407.15549.
[3] Xhonneux, S., Sordoni, A., Günnemann, S., Gidel, G. and
Schwinn, L., 2024. Efficient adversarial training in llms with continuous attacks. Advances in Neural Information Processing
Systems, 37, pp.1502-1530