A Brief Introduction to some Approaches to AI Alignment
This article was written some time ago, and parts may be out of date. We recommend reviewing the latest version of our alignment course for a more up to date overview.
Various approaches to the AI alignment problem have been proposed and debated. This document briefly introduces some of these approaches, partly in order to support readers in thinking more concretely about what AI alignment solutions might look like. That concrete thinking about AI alignment solutions can help answer questions like:
- How likely is it that AI alignment will be solved anytime before the development of transformative AI? How likely is AI alignment to be solved well in advance?
- What might AI alignment to multiple humans look like?
- To what extent will aligned AI help humans reflect on our values?
This document assumes basic familiarity with AI alignment problems, equivalent to understanding these three sources, but it aims to not assume further technical background.
For a more in-depth and technical introduction to AI alignment research agendas, see the AI Safety Fundamentals AI Alignment course.
Learning from human feedback
An AI system could be trained using direct feedback from humans. For example, during training of an AI system that writes summaries of books, human “overseers” could respond to its summaries with “thumbs up” or “thumbs down,” and that feedback could be used to reinforce certain behaviors in the AI. One might hope that, through this feedback, the AI would be trained to try to act in helpful ways.
AI labs including DeepMind, OpenAI, and Anthropic have implemented variations of this approach.
(Why might we be interested in this approach instead of directly specifying human values? Because human values arguably involve fuzzy concepts, and machines have generally been much more successful at learning such concepts through examples than through attempts at direct specification.)
However, critics argue that learning from human feedback (on its own) has multiple fatal limitations:
- If the AI system is doing some task that humans don’t entirely understand, then the humans who give feedback might not know whether the AI system’s behavior has been helpful or not. For example, if an AI system proposes designs for new computer chips or DNA sequences for new drugs, humans might not know whether the proposals are helpful. In such cases, human feedback couldn’t reliably reinforce helpful behavior.
- This problem is already showing up. Researchers at OpenAI write that a cutting-edge chatbot trained with human feedback “sometimes writes plausible-sounding but incorrect or nonsensical answers.” They explain that “[f]ixing this issue is challenging,” partly because human feedback is “no source of truth”—human feedback cannot easily train an AI system to act beneficially if humans can’t reliably identify beneficial behavior.
- Even if the humans could give perfect feedback, that might not reliably reinforce helpful behavior, due to the potential for deceptive alignment.
Many of the approaches to AI alignment introduced below can be thought of as efforts to improve or complement human feedback, to address these limitations.
Task decomposition
To overcome the problem that humans are unable to give good feedback on some tasks, one approach is task decomposition; even if some task is too complicated for a human to evaluate all at once, it may be possible to break down the task into many smaller tasks that humans evaluate, in such a way that a human can then review these sub-task evaluations and come up with an overall evaluation.
This process may be made easier through (i) recursively breaking down tasks into smaller and smaller tasks, (ii) training AI systems to imitate humans, so that they can do some of the sub-task evaluation, and (iii) training AI systems to debate each other, so they can focus the attention of human feedback on certain sub-tasks that are especially revealing of unhelpful behavior.
However, critics argue these approaches still have multiple theoretical limitations, in addition to more practical problems:
- These methods might not be enough for humans to know whether an AI system’s behavior has been helpful or not.
- These methods do not address the potential problem of deceptive alignment.
Transparency
All of the problems we have considered so far revolve around how an AI system might know things that its human overseers do not.
- One potential problem is that an AI system will be capable and misaligned enough to knowingly carry out an unhelpful action, yet human overseers will not know this is what it has done.
- Another potential problem is that an AI system will take helpful actions because it is trying to hide a plan to eventually take unhelpful actions.
An approach to solving these problems is seeking to better understand the internal thought processes or knowledge of AI systems—what is happening on their inside, which has often been considered an inscrutable “black box.” The hope here is that, if we knew enough about an AI system’s thoughts and knowledge, then we would be in a much better place to address problems of it being deceptive or knowing things we don’t. (As an analogy, if I wanted to make sure a human wasn’t lying to me, it would be very convenient to be able to read their mind.)
One prominent research agenda in this vein is the study of "mechanistic interpretability", particularly by Chris Olah’s group at Anthropic. This involves empirically studying sub-components of existing AI systems to better understand their functions, with the aim of scaling these techniques to more advanced, future AI systems. Additional research directions focused on transparency include the study of eliciting latent knowledge at the Alignment Research Center, interpretability work at Redwood Research and Conjecture, the study of externalized reasoning oversight, and building datasets to train AI systems to “think out loud.”
Inductive biases
As we’ve seen, a problem in AI alignment (roughly the “inner alignment” problem) is that AI systems might learn to pursue some proxy objective which, while achieving good results with training examples, generalizes in undesirable ways. An approach to this problem is seeking to more systematically understand how AI systems generalize (i.e. their “inductive biases”), so they can more reliably be trained to generalize in helpful ways. This research is often empirical, done through experiments that involve training AI systems.
Agent foundations
Like transparency and inductive bias research, agent foundations is a research agenda that aims to address AI alignment through improving human understanding of AI systems. The agent foundations research agenda takes a more theoretical, conceptual, and often mathematical approach to this, trying to formally understand how intelligent agents think.
Some work in this vein has been explained with an analogy to rocket science: safely launching a rocket would be roughly impossible without a solid theoretical understanding of physics, and—the analogy goes—safely developing advanced AI is similarly doomed without a solid theoretical understanding of intelligence.
Perhaps the most prominent former proponents of this approach have dropped it (on the grounds that it was progressing too slowly), while other researchers continue.
Bootstrapping and AI tools
One hope that some AI safety researchers have is that aligned AI systems with low capabilities can “bootstrap” alignment to progressively higher capabilities, by each AI system helping humans align a slightly more capable AI system (or aligning the more capable AI system without human help).
Some research groups including OpenAI have approaches in this vein, which can overlap with approaches covered above.
Others
The above set of research directions is not comprehensive.
Where does this leave us?
One might think from all the above approaches that there are already many full solutions to AI alignment. That is incorrect; one publication from OpenAI explains, “There is currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to encounter a number of new alignment problems that we don’t observe yet in current systems. Some of these problems we anticipate now and some of them will be entirely new.”
While AI alignment researchers widely think that a full solution has not been discovered (i.e. that none of the above research directions has fully panned out yet), there is much debate and disagreement about the promisingness of various approaches.
How does this relate to AI governance?
This coverage of technical approaches is not to suggest that technical approaches to AI alignment are the only or even the main approaches needed. The following excerpt from a talk by AI safety researcher Paul Christiano suggests one framework for thinking about how technical and governance work on AI alignment relate to each other:
“I like this notion of an “alignment tax” [...] [E]veryone would prefer to have AI systems that are trying to do what they want [the systems] to do. I want my AI to try and help me. The reason I might compromise is if there's some tension between having AI that's robustly trying to do what I want and having an AI that is competent or intelligent. And the alignment tax is intended to capture that gap — that cost that I incur if I insist on alignment.
[...] [W]e could then imagine two approaches to making AI systems aligned:
- Reducing the alignment tax, so that [someone] has to incur less of a cost if she insists on her AI being aligned.
- Paying the alignment tax.” (Note that the costs of the “alignment tax” are not necessarily just financial; many researchers in the field expect AI alignment to require significant time costs, i.e. delays, although it is very unclear how long the delays involved would be.)
In this framework, technical work on AI alignment helps largely by reducing the costs of aligning AI, while governance work on AI alignment helps largely by making relevant actors more willing and able to pay these costs.