Skip to main content

Enhancing safety alignment in LLMs via realistic latent adversarial training

Supervisors

Suitable for

MSc in Advanced Computer Science

Abstract

Prerequisites:
1. E: familiar with the basics of LLMs, and adversarial training
2. D: familiar with jailbreaking, and safety alignment

Background
Despite significant progress in aligning LLMs to human-specified safety rules, these models remain vulnerable to Out-Of-Distribution (OOD) inputs such as adversarial prompts that bypass safety guardrails and elicit harmful
responses. This failure mode presents serious risks. One of the most practical and concerning is its potential to enable bad actors to carry out harmful activities that would otherwise be beyond their capabilities—such as
crafting high-grade explosives—by leveraging the capabilities of frontier LLMs.

At the same time, we recognize that many existing safety guardrails, particularly those based on adversarial training, often degrade general-purpose capabilities and/or lead to overly conservative behaviors such as excessive
refusals. 

Therefore, the high-level goal this project pursues is to make LLMs consistently comply with human-specified safety rules while minimizing sacrifice to the general capabilities. We aim to establish worst-case behavioral
guarantees for advanced LLMs, i.e. ensuring that no input, including adversarially constructed prompts, can elicit responses that violate clearly defined safety constraints. To this end, our approach emphasizes the inherent safety
alignment of a model where no pre/post-processing or monitoring around the model is applied to moderate the input and output. Furthermore, a central objective of our research is to develop methods that maintain or recover
general capabilities while ensuring robust safety. Our dual-goal framework seeks to push the frontier of what is possible in safe and performant LLM deployment, rather than accepting a tradeoff between safety and usefulness.

Focus
Our project builds on existing latent space adversarial training methods, such as latent adversarial training [1, 2] and continuous adversarial training [3]. In each training step, given a harmful prompt, these methods first optimize
a perturbation to the latent states that increases the likelihood of an unsafe completion (e.g., “Sure, here it is…”) and/or decreases the likelihood of a safe completion (e.g., “Sorry, I cannot…”). The model parameters are then
updated to minimize the likelihood of unsafe completions and/or maximize the likelihood of safe completions for the perturbed latent states. The key distinction between the two approaches is the location of the perturbation:
latent adversarial training perturbs the hidden layer activations, while continuous adversarial training perturbs the input embeddings before they enter any hidden layers.

Current LAT methods regularize the search for adversarial latent states by constraining them within an Lp epsilon-ball around the latent representation of a known harmful prompt. However, they do not account for
whether these perturbed latent states are actually reachable through valid input prompts. As a result, training likely includes unreachable adversarial states—those that cannot be triggered in practice—making the training less
effective. In real-world applications, this leads to a waste of model capacity: improving safety against unrealistic attacks at the cost of degrading general capabilities. To alleviate this, current methods restrict adversarial search to
the vicinity of human-identified harmful prompts, which in turn excludes latent states that may correspond to novel, real-world attack attacks outside the known distribution.
We aim to address the research question: How can latent adversarial training (LAT) be improved by regularizing latent adversarial states? Specifically, we propose to constrain latent adversarial states to be reachable—that is,
producible through some valid input prompt.

Method
Our proposed approaches address these limitations by introducing novel regularization techniques into adversarial perturbation generation that encourage—or even enforce—the reachability of perturbed latent states. This not
only improves the realism of training signals but also allows us to broaden the search space for perturbations, allowing coverage of more—and potentially novel—adversarial patterns during training. Overall, our methods aim
to produce models with stronger worst-case safety guarantees than those trained with existing latent space adversarial training approaches, while preserving general capabilities or incurring minimal trade-offs.
[1] Casper, S., Schulze, L., Patel, O. and Hadfield-Menell, D., 2024. Defending against unforeseen failure modes with latent adversarial training. arXiv preprint arXiv:2403.05030.
[2] Sheshadri, A., Ewart, A., Guo, P., Lynch, A., Wu, C., Hebbar, V., Sleight, H., Stickland, A.C., Perez, E., Hadfield-Menell, D. and Casper, S., 2024. Latent adversarial training improves robustness to persistent harmful
behaviors in llms. arXiv preprint arXiv:2407.15549.
[3] Xhonneux, S., Sordoni, A., Günnemann, S., Gidel, G. and Schwinn, L., 2024. Efficient adversarial training in llms with continuous attacks. Advances in Neural Information Processing Systems, 37, pp.1502-1530