What is AI alignment?

By Adam Jones (Published on March 1, 2024)

There are many competing definitions for terms in AI safety, especially for ‘alignment’. This article explains how we use these words on our AI Alignment course, and how alignment research contributes to AI safety.

AI Safety

AI safety is a field that tries to reduce AI risks, decreasing the expected harm from AI systems. This is broad, and many subfields contribute to this. One way these can be divided up is:

Alignment: making AI systems try to do what their creators intend them to do (some people call this intent alignment). Examples of misalignment: image generators create images that exacerbate stereotypes or are unrealistically diverse, medical classifiers look for rulers rather than medically relevant features, and chatbots tell people what they want to hear rather than the truth. In future, we might delegate more power to extremely capable systems - if these systems are not doing what we intend them to do, the results could be catastrophic. We’ll dig more into alignment subproblems in a minute.

Moral philosophy: determining what intentions would be “good” for AI systems to have. This is because if we can make AI systems do what we intend (alignment), we then need to know what we should intend these systems to do. This is difficult, given arguments about what “good” means have been ongoing for millennia, and people have different interests which sometimes conflict.

Competence: reducing how often AI systems make mistakes in achieving what they're trying to do.[1] Examples of competence failures: medical AI systems that give dangerous recommendations, or house price estimators that lead to significant financial losses.

Governance: deterring harmful development or use of AI systems. Examples of governance failures: reckless or malicious development of dangerous AI systems, or deploying systems without safeguards against serious misuse.

Resilience: preparing other systems to cope with the negative impacts of AI. For example, if AI systems uplift some actors’ ability to launch cyber attacks, resilience measures might include fixing systems to prevent cyberattacks, increasing organisations’ ability to function during a cyber incident, and making it faster to recover from a cyberattack.

Security: protecting AI systems themselves from attacks. Examples of security failures: large model weights being inadvertently leaked, chatbot being tricked into sharing private information, and adversarial patches causing models to misclassify traffic signs.

While this breakdown acts as a helpful starting point, not all problems can be neatly allocated within this framework. For example, jailbreaks are:

  • a security problem (an attacker can bypass AI system restrictions);
  • an alignment problem (creators intend their models to follow system prompts, but they don’t); and
  • a governance problem (actors are not deterred from misusing AI models).

It’s also often very difficult to distinguish competence and alignment failures: it can be unclear what AI models are actually capable of, and whether failures represent:

  • inability to do tasks correctly (competence); or
  • that the model can do them but is trying to do something else (alignment)

This is especially true of large language models, given the huge possible output space - with techniques like few-shot or chain of thought prompting seeming to ‘unlock’ better outputs from the same model.

AI Alignment

Making AI systems try to do what we intend them to do is a surprisingly difficult task. A common breakdown of this problem is into outer and inner alignment.

A general heuristic is “if the misaligned AI system’s behaviour would score highly against the reward function, it’s outer misalignment, otherwise it’s inner misalignment”.

1. Outer alignment: Specify goals to an AI system correctly.
Also known as solving: reward misspecification, reward hacking, specification gaming, Goodharting.

This is the challenge of specifying the AI's reward function in a way that accurately reflects our intentions. Doing this incorrectly can lead AI systems to optimise for targets that are only a proxy for what we want.

Outer misalignment example: Companies want to train large language models (LLMs) to answer questions truthfully. They train AI systems with a reward function based on how humans rate answers. This encourages the AI system to generate correct-sounding, but actually false answers, such as using made up citations.

Outer misalignment example: Social media companies want to maximise advertising profits. They may use AI systems with the objective of maximising user engagement, but this results in their platforms promoting clickbait, misinformation or extremist content. This in turn reduces profits because customers stop advertising on the platform.

2. Inner alignment: Get AI to follow these goals.
Also known as solving: goal misgeneralization.

We give AI systems feedback based on our reward function, to help them learn. This normally results in AI systems that score highly on the reward function with the training data.

However, this doesn’t mean they try to pursue the reward function we set.[2] This can cause problems if the training and deployment environments have different data distributions (known as domain shift or distributional shift). Aligning the goal the AI system tries to pursue with the specified reward function is known as inner alignment.

Inner misalignment example: An AI system is correctly given a reward when it solves mazes. In training, all the mazes have an exit in the bottom right. The AI system learns to optimise for always trying to go to the bottom right (rather than ‘trying to go towards the exit’). In deployment, some mazes have exits in different locations, but the AI system just gets stuck at the bottom right of the maze.

Inner misalignment example (hypothetical): A self-driving car is trained using a reward of driving without crashing. During training it learns to always drive without going through red lights, because this performs well in all the training scenarios. In deployment, the car is sitting at a red light. A truck behind it has its brakes fail and it sounds its horn. Unfortunately, the car refuses to move as it is prioritising not going through red lights over avoiding a crash, and the vehicles collide.

Criticisms of the inner/outer alignment breakdown

Failures can sometimes be hard to classify as outer or inner misalignment. For example, failures can be ambiguous or a result of both outer and inner misalignment, and some researchers classify failures differently depending on their framing.

It’s unclear how directly relevant the reward function is to the actual learned behaviour. This calls into question the relevancy of outer alignment, if systems like reinforcement learning may not optimise for this reward directly. One research agenda, shard theory, attempts to explore relationships between reward functions and learnt behaviours, and proposes we set a reward function based on the values it teaches the model rather than trying to find an objective that is exactly what we want.

Lastly, future AI systems might be developed without specified reward functions. Modern-day LLMs usually have the goal of ‘generate coherent text that humans would rate highly’. However, systems built on top like AutoGPT attempt to make them more agentic and goal directed through repeated prompting and connecting them to tools - and it’s not clear how the outer/inner alignment paradigm maps to this.

Despite these criticisms, we think being aware of these concepts is helpful, given they're widely used in the field. On the course, we’ll use them to think through challenges in developing aligned AI systems.

Alternative definitions

Many others before us, and likely many after us will attempt to define alignment. Other popular definitions people use:

  • Jan Leike: “The AI system does the intended task as well as it could”
  • Paul Christiano: “When I say an AI A is aligned with an operator H, I mean: A is trying to do what H wants it to do.”
  • Richard Ngo: “ensuring that AI systems pursue goals that match human values or interests rather than unintended and undesirable goals”
  • Ryan Greenblatt (in specific context): “Ensure that your models aren't scheming.”
  • Nate Soares: “how in principle to direct a powerful AI system towards a specific goal”
  • Holden Karnofsky: “building very powerful systems that don’t aim to bring down civilization”
  • Anthropic: “build safe, reliable, and steerable systems when those systems are starting to become as intelligent and as aware of their surroundings as their designers”
  • OpenAI: “make artificial general intelligence (AGI) aligned with human values and follow human intent”
  • Google DeepMind: “ensure that AI systems are properly aligned with human values”
  • IBM: “encoding human values and goals into large language models to make them as helpful, safe, and reliable as possible”

Footnotes

  1. This is also sometimes known as ‘capabilities’ work.

    While it might be important for reducing harms, we usually don’t think altruistically motivated people should work on this. This is because there are already strong commercial incentives which will push for this by default, hence it’s not very neglected.

    Additionally, there are significant negative externalities to pushing forwards capabilities, without having appropriate tools in other areas in AI safety.

  2. By ‘tries to pursue’ here we mean what the system appears to be optimising, based on how it might choose between different actions.

We use analytics cookies to improve our website and measure ad performance. Cookie Policy.