Preventing Malicious Collusion between Advanced AI Systems

Supervisors

Christian Schroeder de Witt

Philip Torr

Alessandro Abate

Suitable for

Mathematics and Computer Science

Mathematics and Computer Science, Part C

Computer Science and Philosophy, Part C

Computer Science, Part C

Computer Science, Part B

Abstract

Consider settings in which one or more principals assign a task to a team of generative AI agents, for example a scheduling or negotiation task. The principals monitor the (not necessarily human-intelligible) communications between the agents and intervene if deemed necessary, hoping to prevent agents from pursuing undesirable joint strategies. In this project, we investigate the question if, when, and how optimisation pressure may lead generative AI agents to hide communications from their principals.

We survey the landscape of steganography (information hiding), and, for a given level of security, identify the required knowledge and capabilities of generative AI agents.

From these, we design a roadmap for model evaluation building on our recent work in this space [1][2]. We then empirically test the ability of state-of-the-art LLMs to engage in different types of covert communication given a variety of optimisation pressures. This project is designed to lead to publication. We are looking for a highly-motivated student.

[1] Secret Collusion Among Generative AI Agents: A Model Evaluation Framework, Sumeet Ramesh Motwani, Mikhail Baranchuk, Philip H.S. Torr, Lewis Hammond, and

Christian Schroeder de Witt, to appear - also see this talk here: https://www.alignment-workshop.com/nola-talks/christian-schroeder-de-witt-perfectly-secure-steganography-and-llm-collus

[2] https://www.quantamagazine.org/secret-messages-can-hide-in-ai-generated-media-20230518/

Preventing Malicious Collusion between Advanced AI Systems

Supervisors

Suitable for

Abstract

Student Space