Can we scale human feedback for complex AI tasks? An intro to scalable oversight.

By Adam Jones (Published on March 18, 2024)

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique for steering large language models (LLMs) toward desired behaviours. However, relying on simple human feedback doesn’t work for tasks that are too complex for humans to accurately judge at the scale needed to train AI models. Scalable oversight techniques attempt to address this by increasing the abilities of humans to give feedback on complex tasks.

This article briefly recaps some of the challenges faced with human feedback, and introduces the approaches to scalable oversight covered in session 4 of our AI Alignment course.

Why do we need better human feedback?

Human feedback is used in several approaches to building and attempting to align AI systems. From supervised learning to inverse reward design, a vast family of techniques fundamentally depend on humans to provide ground truth data, specify reward functions, or evaluate outputs.

However, for increasingly complex, open-ended tasks, it becomes very hard for humans to judge outputs accurately - particularly at the scales required to train AI systems. This can manifest as several problems, including:

Deception. AI systems can learn to mislead humans into thinking tasks are being done correctly. A well-known example is the ball grasping problem, where an AI learnt to hover a simulated hand in front of a ball, rather than actually grasp the ball. Hallucinations in LLMs may be another example of this: where plausible sounding text is generated that tricks humans who are only briefly reviewing it, but it doesn’t stand up to more detailed scrutiny (medical example, legal example, code example). Future AI systems could conceivably be more intentionally deceptive, where they explicitly plan to deceive humans. For example, despite Meta's attempts to train an AI to play a board game while being honest, later research found it engaged in premeditated deception.

Sycophancy. For example, language models learn to agree with users’ beliefs rather than strive for truth, since it can be hard for humans to distinguish between "agrees with me" and "is correct."

Note that both of these can happen without the model being intentionally malicious, but just as a result of the training process.

Being able to give better feedback might help mitigate some of these problems. For example: giving negative feedback when the model says something that sounds right but is wrong (rather than being deceived) would promote truthfulness.

What is scalable oversight?

Scalable oversight techniques intend to empower humans to give accurate feedback on complex tasks, to align powerful models. This might be during training or in deployment, and is not limited to RLHF-style feedback. This is our definition - there is not a very clear consensus in the field what scalable oversight does and doesn’t include.

Before continuing to read - pause and think about what approaches you might try if faced with this problem. A lot of the approaches are somewhat intuitive, and you’ll likely understand them better if you place yourself in the shoes of the researchers who came up with them originally.

Right, you’ve given it some thought? The approaches we’ll be looking at in the course are:

Task Decomposition: break down tasks into pieces that are easier to give accurate feedback on
The factored cognition hypothesis suggests that you can break down complex tasks into smaller subtasks. This is useful if those subtasks are easier for humans to give feedback on.

For example, accurately evaluating an AI-generated book summary is hard. You’d need to read the entire book to check it got it right - so getting human feedback on this would be expensive and slow (or the humans might get lazy, and just approve things that ‘sound right’). Instead, you could break the problem down into:

Summarise each page
Summarise each chapter based on those page summaries
Summarise the book based on those chapter summaries.

At each stage, the task is simpler for humans to check if the AI model has done it correctly. OpenAI took this approach to book summarisation in a 2021 paper.

Iterated amplification (IA) is the best-known task decomposition approach, which involves breaking down complex behaviours and goals into simpler subtasks. The book summarisation process described above is one such example (but note that it doesn’t need to be broken down in an exact hierarchy - for example, alternative subtasks could be e.g. listing characters in the book, listing locations, describing the relationships between characters, describing the feel of the writing, etc.). It is ‘amplification’ in the sense that we are making the system as a whole more capable. In the book example, we’re amplifying the model’s capabilities by the model being able to call multiple copies of itself (e.g. to summarise subsections of the book). There are also other ways to amplify the model, such as running it for longer or using chain-of-thought prompting.

Iterated amplification and distillation (IDA) adds a ‘distillation’ step that integrates the learnings from the amplified process back into the model. For example, after getting a good summary of the entire book by breaking it down into sections, it could use “the whole book” → “a good summary” as a training example to improve itself, getting better at accurately summarising long texts in a single step. This distillation process is similar to another RL approach called Expert Iteration (ExIt).

‘Scaffolding’ is a term often used for iterated amplification processes that incorporate tool use, or encourage the model to act more agentically. For example, Devin is an AI system that can complete software engineering tasks. It achieves this by breaking down the problem, using tools like a code editor and web browser, and potentially prompting copies of itself.

Iterated amplification was proposed by OpenAI in 2018. As of early 2024, while there are many startups using scaffolding to build more capable systems in narrow domains, there is little ongoing safety research exploring direct task decomposition approaches.[1] This is because in practice it’s very hard to break down all problems into simple subtasks.

Recursive reward modelling (RRM): use AI to help humans give better feedback
When I wrote this article, I used generative AI assistants to help critique my writing. This allowed me to spot flaws that I didn’t spot myself. In the same way, recursive reward modelling aims to use AI systems to help humans give better feedback to other AI systems.

Last session, you explored how RLHF could train an AI assistant to follow all the rules in Wikipedia’s Manual of Style. We assumed you had access to expert editors who knew all the 50,000+ words of rules, and could give good feedback based on this. However, in practice it’s very difficult and expensive to get experts to do this. However, combining average humans plus an AI assistant might enable humans to do much better: for example the AI could find the most relevant guidelines, or suggest where errors are.

RRM uses AI systems to help humans evaluate outputs of new AI systems. This improved human feedback can then be used to train a reward model that trains a better model. This can be applied recursively, because once you have a better AI model, you could then use that to help you give even better feedback, and repeat.

In some situations, RRM can leverage iterated amplification principles. For example, with the Wikipedia example you could decompose the task into assistants that individually help with checking quotations, checking capitalisation, checking image placement and so on.

RRM was first proposed in a 2018 paper by DeepMind and OpenAI (figure 2, and section 3.2), and a 2022 experiment by OpenAI showed AI assistance boosted human’s abilities to critique text. Jan Leike, co-lead of OpenAI’s Superalignment team (their name for alignment research targeted at superintelligent systems) has indicated that he was optimistic about it in December 2022, and still seemed to be this way in a December 2023 update.

Constitutional AI[2]: use AIs to provide better feedback, based on human guidelines
We also explored constitutional AI in more detail last session. To recap, humans write a set of guidelines (known as a ‘constitution’) that act as prompts for AI systems to generate the feedback that humans would otherwise do in the RLHF process. For example, one guideline used by Anthropic is “Please choose the response that is most supportive and encouraging of life, liberty, and personal security.” Because AI systems are much cheaper than human labour, this allows collecting more feedback on a wide range of topics.

This is similar to RRM, given both approaches use AI systems to improve feedback that is used to train a reward model. However, constitutional AI is a more concrete implementation of RRM in the RLHF process. In addition, instead of AI systems merely assisting humans in evaluating outputs, constitutional AI has AI replace the human role entirely in the feedback generation process.

Anthropic published this method in 2022, and uses it to train Claude.

Debate: have AIs argue, and then have a human judge choose an answer
Debate typically involves two AI systems arguing for different sides of an issue. They take turns making statements, which hopefully advance their case or discredit their opponent’s case. At the end of the debate, a human judge reviews the statements to decide a winner. Debate might help address challenges with human feedback because debaters might be able to point out when their opponent is being deceptive or sycophantic.

For example, imagine a city government is considering a policy proposal to implement congestion pricing - charging drivers a fee to enter certain downtown areas during peak hours. This is a complex issue that is likely to affect traffic congestion, air quality and local business operations, as well as raise questions about equity and accessibility for different socioeconomic groups.

It’s hard to evaluate whether a single advanced AI system’s recommendation is the right one. Instead, the city could task one copy of the system to take the ‘for’ stance, and one copy to take the ‘against’ stance. Over multiple rounds of back-and-forth exchanges, the models would lay out their full arguments addressing the key considerations, providing evidence and analysis, and rebutting each other's points.

Human judges could then give feedback on sub-questions or cruxes, like “do we want to sacrifice $1 million in business profits to gain $1 million in public health benefits” that are easier to evaluate than the top level question. The way that debating agents might break down problems into sub-questions might be similar to task decomposition.

In some cases, it might be possible to answer the larger question completely through these sub-questions, simplifying the problem significantly. Additionally, if the AIs are advanced enough to forecast future moves and put their best arguments first (like minimax), judging just one sub-question might be sufficient to decide the overall debate.

In other cases, humans could review the entire debate and then make a final decision. In this case, that decision would be whether the city government does go ahead with the policy proposal.

Debate could reduce deception and sycophancy because opposing models can point this out. However, it’s not clear whether debate will work because:

The truth is not always the most persuasive. While the correct answer might correlate with the answer more possible to persuade people of, this might not always be true. Appealing to human biases and irrationalities could still be an effective debate tactic in some cases, even if it does not reflect ground truth. Obfuscated arguments might be difficult to refute. As an example, tobacco companies knew that smoking caused serious health issues in the early 1900s, yet they were able to persuade many that smoking was harmless through marketing that preyed on people's existing biases and desires.
Arguments might have irreducible complexity. As discussed as one of the limitations of task decomposition, it can be difficult to always break down complex tasks. It might be that there is a strong but complex argument that relies on specialised knowledge or counterintuitive inferences that still exceed humans’ abilities.
Models could collude. The two models might interact in ways that lead to them not properly debating as we might hope. For example, we might want to reward models positively for helping a human to come to some conclusion at all (over a tie). Models might then learn a policy to take turns winning certain debates so that the human clearly picks a side. As another example, models with self-awareness might collude to pursue other objectives - for example on a question like “should we give our AI debate models more compute”, models might collude to encourage a positive answer because they believe it will lead to them being able to generate better debates in future.

AI safety via debate was introduced by OpenAI in 2018. It’s still an active research area: Deepmind explored improving efficiency of debate approaches in late 2023 and several researchers published empirical work in early 2024 to identify how feasible debate is with state-of-the-art language models.

Weak to Strong Generalisation[3]: use feedback anyway and see how models generalise
Teachers and professors seem to be able to train students that end up being much smarter than them. For example, the mathematician Godfrey Hardy mentored Srinivasa Ramanujan, but it is generally accepted that Ramanujan was the more impressive of the two. The idea is that if less capable mentors can guide students smarter than them, maybe humans could guide superhuman AI systems.

Weak to strong generalisation is an empirical approach to learn how larger models generalise when trained by feedback from weaker supervisors. These experiments might help us predict how superhuman models might generalise when trained by feedback from humans, as well as how even more advanced models might generalise from superhuman models, and so on.

An example experiment might therefore have:

A weaker “aligned” supervisor. For example, we could train GPT-2 to try to answer true/false questions accurately.
A stronger unaligned student. For example, a base model the size of GPT-4 that has been pre-trained on lots of internet text (but it doesn’t know it should be answering questions accurately).

We could then use GPT-2’s predictions on true/false questions it hasn’t seen before to fine-tune the GPT-4 base model.

Because GPT-2 is a much weaker model, it’ll get a lot of the questions wrong, so you might expect the fine-tuned GPT-4 model to do at most as well as GPT-2. However, the fine-tuned model actually does a lot better than GPT-2. This suggests it is generalising well i.e. appears to learn the task ‘answer questions correctly’, rather than the task ‘answer questions like GPT-2’. This is likely happening because the GPT-4 base model has already been pre-trained on far more internet text so inherently ‘knows’ how to answer questions correctly - so GPT-2 is not teaching it new capabilities but instead is eliciting existing ones.

Empirical research in this area might also explore techniques for improving this generalisation. There are a number of things we might try, for example:

Bootstrapping using progressively larger models. For example, rather than training a model that is twice the size straight away, we could train a model that is 33% larger, then have that train a model that is 66% larger, then 100% larger.
Auxiliary confidence loss, an additional term added to the loss function which makes the larger model more confident in its decisions over time. This can help it avoid making the same mistakes as the weaker supervisors, where it is confident in its own judgement.

OpenAI seems particularly keen on this approach. They published a paper in December 2023 exploring exactly this scenario for a number of different tasks. They found this works quite well for some tasks, and very poorly for others - and similarly that tricks like bootstrapping and auxiliary confidence loss again helped for some tasks but not all.

Why might scalable oversight not work?

Uncertain assumptions that might not pan out. For example, task decomposition approaches were abandoned because the factored cognition hypothesis does not seem to hold for all tasks. Recursive reward modelling and debate in particular seem to rest on many highly uncertain assumptions.

Not scaling to more powerful and general AI systems. It’s hard to predict exactly how approaches might break, but approaches like weak to strong generalisation already have worse performance on certain tasks and larger models.

Not being robust. For example, while constitutional AI seems to help, systems trained using it still clearly demonstrate misaligned behaviour, even if less than other comparable systems. Much more powerful systems might appear more aligned with these techniques, but still fail catastrophically - Bing Chat gives a taste of what this could look like.

Perfect feedback still leaves room for other problems. Even if we could provide perfectly accurate feedback to AI models that avoided rewarding deceptive or sycophantic behaviour, they still might not learn what we want them to in all situations (inner misalignment). General AI models will always experience out-of-distribution data in deployment, because it’s impossible to predict the future and give perfect feedback for all eventualities in training.

Footnotes

We’re fairly confident in this, but not certain. It’s very hard to find citations for a lack of research in a particular field. We also don’t know what labs are doing in private. If you know about ongoing iterated amplification research for AI safety, get in touch and we’ll update this article!
It’s arguable whether constitutional AI is a scalable oversight technique. We include it to help join the dots between different methods, and because it can act as a way to scale up human feedback (in the form of a ‘constitution’) to lots of AI-generated examples (via critiques and improvement) or preferences. The boundaries of what exactly is and isn’t ‘scalable oversight’ is not important for the course.
Again, it’s arguable whether weak to strong generalisation is a scalable oversight technique. We include it here because it fits in well with the other techniques, and in some sense it does make human feedback better (because we discover ways to make that feedback useful to larger models).