Bypassing Post-Filtering Guardrails in Large Language Models.

Supervisors

Suitable for

Abstract

This task aims to assess the reliability of post-filtering methods used to prevent language models from generating specific content. These methods involve monitoring generated content for certain keywords and pausing the model when detected. The project is on investigating whether adversaries can bypass these filters, leading to the generation of undesirable content while avoiding the filtered words. This research will include the formalization of the problem and the development of efficient solutions to address this challenge. This task helps evaluate existing safety measures and explore the potential for enhancing safeguards against automated harmful content generation. It also has applications in undermining watermarks, as they rely on less common data subsets, making detection easier.

Bypassing Post-Filtering Guardrails in Large Language Models.

Supervisors

Suitable for

Abstract

Student Space