Skip to main content

Lin Li and Yarin Gal awarded $371k grant to improve safety of LLMs

Posted:

Research Associate Lin Li and Professor Yarin Gal from the Oxford Applied and Theoretical Machine Learning Group have won a $371,000 (£274,000) grant from US philanthropy organisation Coefficient Giving to improve the safety of LLMs.

The problem

Despite recent progress, LLMs can still be manipulated by adversarial or unexpected prompts to bypass safeguards. This presents serious risks: bad actors may try to use these LLMs to carry out harmful activities that would otherwise be beyond their capabilities, such as using them to craft high-grade explosives. A promising safety guardrail is adversarial training in the model’s latent space, but models are often trained on scenarios that would not occur in practice, which can limit effectiveness and lead to reduced general performance, including overly cautious behaviour and excessive refusals.

The approach

The project advances current adversarial training methods by introducing new regularisation techniques that ensure adversarial examples in the latent space reflect realistic, achievable inputs rather than artificial internal states. Existing approaches typically restrict their search to small variations of known harmful prompts, which keeps training close to familiar examples but limits the range of scenarios the model encounters. By instead focusing on ‘reachable’ representations, the new approach enables a broader exploration of adversarial inputs that could arise in practice. This allows the model to learn from more diverse and previously unseen attack patterns, resulting in stronger safety behaviour while preserving LLM performance.

The team

Research Associate Lin Li and Professor Yarin Gal will work with Stephen Casper from MIT, who has authored pioneering work on latent adversarial training. Lin has been studying adversarial training since completing his PhD and has developed several state-of-the-art adversarial training methods. Yarin brings extensive experience in developing impactful and widely adopted and cited research, including work on semantic entropy, model collapse, and Bayesian Neural Networks.

The funding will support research staff, dedicated compute infrastructure, large-scale evaluation, and dissemination through international conferences. The project is running from March 2026 to July 2027.