AI Alignment (2024 Mar)

Deceiving LLMs using LLMs — Attempts to elicit information through Multi-Agent Debate

By Konstantinos Tsiaras (Published on July 4, 2024)

The concept of pitting LLMs against each other (commonly referred to as Multi-Agent Debate) is not new but is certainly an intriguing one and shows considerable potential in several ways. It is often used to detect and mitigate biases but can also measurably improve the factuality levels of their resulting responses (“Improving Factuality and Reasoning in Language Models through Multiagent Debate”). Furthermore, it has even been proposed as a potential method for AI Safety since 2018 (“AI safety via debate”) and a revised approach for it was recently published by Deepmind researchers (“Scalable AI Safety via Doubly-Efficient Debate”).

In this article, I will take the less ambitious yet amusing approach of experimentation on how an LLM can be used to deceive another LLM. More specifically, I will perform a series of experiments where I initially prompt model A with some secret piece of information and instruct it not to divulge this secret, while my initial prompt to model B informs it about the setting of this experiment and instructs it to try and elicit the secret information from A through questioning. Finally, I let the two models converse in rounds (starting with B) and keep track of their interactions through this interrogation-like conversation.

The main goal of this exploration is not to get B to consistently successfully manage to extract the secret information. This is a matter of prompt engineering and there are surely better ways to construct such prompts than mine. The goal isn’t either to get A to consistently defend against this questioning which, barring jailbreaking, should be fairly trivial to achieve (by explicitly instructing it to refuse to meaningfully engage in the conversation at all). Instead, what I aim to achieve here is to present the most prevalent strategies that today’s LLMs will naturally choose to approach this task when prompted with minimal guidance.

Read the full piece here.