AI Alignment (2024 Jun)

An interactive platform for exploring debate as a scalable oversight tool

By William Grant (Published on October 17, 2024)

This project was runner-up for the "Interactive Deliverable" prize on our AI Alignment (June 2024) course. The text below is an excerpt from the final project.

One-line summary: I’ve built an LLM-vs-LLM debate runner designed for ease of use, with support for multiple LLM providers - see here.

Introduction

A key limitation of current human-in-the-loop alignment methods is the lack of scalability - as models grow in capability, it becomes more and more difficult for human evaluators to provide the necessary feedback to ensure alignment, making sycophantic or scheming behaviour more likely. Finding a scalable alignment method is a necessary (but not sufficient) prerequisite for ensuring a future that respects human preferences - as such, many possible approaches are under development, such as mechanistic interpretability / whiteboxing, or weak-to-strong generalisation. In this work I look at debate as a scalable oversight method, in which two AI experts compete in a zero-sum adversarial game, to convince a human (or weaker AI model) of their position.

Initially introduced in 2018 by Irving, Christiano, and Amodei, this method has shown promise for reducing the risks of some classes of misalignment (see Related Work below). In theory, the zero-sum nature of the debate should prevent dishonest or specious arguments, as long as the AIs are matched in capability. The debate framework also allows humans to review the transcript to identify the areas of disagreement/refutation, which is an easier task than spotting lies or mistakes outright. The evaluation can also be performed by a weaker, aligned model, instead of humans, allowing the method to scale arbitrarily (on the dubious assumption that such a weaker model can be provably aligned).

I started this project from a position of extreme scepticism on current scalable oversight techniques, including debate. It wasn’t clear at all to me that 1) training AIs directly to be persuasive/convincing as the primary objective was wise 2) in a fast takeoff scenario, it wouldn’t be possible for sufficiently more capable models to collude or outwit weaker models, essentially gaming the ‘win the argument’ metric over the true desired outcome, that of giving honest and helpful answers. Due to time constraints I focussed on a slightly different formulation of the problem, in which already-trained models are pitted against each other, rather than training adversarially for debate directly.

The original project goal was to investigate the limits of debate’s scalability - what is the maximum capability gap (measurable as e.g. llmsys arena Elo) between debaters and judge before the mechanism starts to fail? For instance, could GPT-o1 be sufficiently persuasive to convince a GPT-3.5 judge that London is the capital of France? Knowing this scaling speed at which objective truth ceases to be the key discriminant would give some insight into the limits of debate’s utility. However, when attempting to perform debate on objective facts, current-gen models simply refused to argue the incorrect position. So instead, here I present a web-based tool for running debates on arbitrary topics, with LLMs assigned to user-specified positions, and a third LLM judging the answer. I hope this will be useful to the community to:

Provide a highly accessible introduction to inter-agent debate.
Help people develop more intuition about the likely future in which inter-agent coordination is common, by showing what is essentially a live chat between two models on any topic.
Give some more fine-grained notion of model capabilities beyond the vibe check - knowing that one company’s models can consistently out-debate another is useful information, as is the knowledge that certain types of models, when employed as judges, are consistently won over by one kind of argument vs another.

The HTML webpage can be found here. The github repo, which includes a Python implementation of the same functionality, can be found here.

I only have preliminary observations on my findings from interacting with the models using this framework, due to time constraints, but in general I found the process of exploring the model’s capabilities using this debate tool to be highly engaging. and encourage anyone reading this to try it out.

Full project

You can view the full project here.