What is AI alignment?

By Adam Jones (Published on March 1, 2024)

There are many competing definitions for terms in AI safety, especially for ‘alignment’. This short article explains how we define these words on our AI Alignment course, as well as how alignment contributes to AI safety alongside other research areas.

AI Safety

AI safety is concerned with reducing AI risks, ultimately decreasing the expected harm from AI systems. We mean this in a very broad sense, and correspondingly many subfields contribute to this. One way these can be divided up is:

Alignment: making AI systems try to do what their creators intend them to do (some people call this intent alignment). Examples of misalignment: image generators create images that exacerbate stereotypes or are unrealistically diverse, medical classifiers look for rulers rather than medically relevant features, and chatbots tell people what they want to hear rather than the truth. In the future, we might delegate more power to extremely capable systems - if these systems are not doing what we intend them to do, the results could be catastrophic. We’ll dig more into alignment subproblems in a minute.

Moral philosophy: determining intentions that would be “good” for AI systems to have. If we are able to get AI systems to do what we intend, we need to know what intentions that we should be giving these systems to ensure good outcomes. This is particularly difficult, given arguments about what “good” means have been ongoing for millennia and many people have different interests which are hard to reconcile.

Competence: developing AI systems that effectively carry out the tasks they are trying to achieve. Examples of competence failures: medical AI systems that give dangerous recommendations, or house price estimating algorithms that led to significant financial losses.

Governance: deterring harmful development or use of AI systems. Examples of governance failures: reckless or malicious development of dangerous AI systems, or deploying systems without safeguards against serious misuse.

Resilience: preparing other systems to cope with the negative impacts of AI. For example, if AI systems uplift some actors’ ability to launch cyber attacks, resilience measures might include fixing systems to prevent cyberattacks, increasing organisations’ ability to function during a cyber incident, and making it faster to recover from a cyberattack.

Security: protecting AI systems themselves from attacks. Examples of security failures: large model weights can be inadvertently leaked, chatbots with internet access can be tricked into sending private information to an attacker, and adversarial patches can make models misclassify traffic signs.

While this breakdown acts as a helpful starting point, not all problems can be neatly allocated within this framework. For example, prompt injection attacks are a security problem (an attacker could use prompt injection to divulge secret data), an alignment problem (creators intend their models to follow system prompts, but they don’t always do this), and a governance problem (actors are not deterred from misusing AI models).

It’s also often very difficult to distinguish competence and alignment failures: it can be unclear what AI models are actually capable of, and whether failures represent inability to do tasks (competence) or that the model can do them but is trying to do something else (alignment). This is especially true of large language models, given the huge possible output space - with techniques like few-shot or chain of thought prompting seeming to ‘unlock’ better outputs from the same model.

AI Alignment

Making AI systems try to do what we intend them to do is a surprisingly difficult task. A common breakdown of this problem is into outer and inner alignment.

A general heuristic is “if the misaligned AI system’s behaviour would score highly against the reward function, it’s outer misalignment, otherwise it’s inner misalignment”.

1. Outer alignment: Specify goals to an AI system correctly.
Also known as solving: reward misspecification, reward hacking, specification gaming, Goodharting.

This refers to the challenge of specifying the reward function for an AI system in a way that accurately reflects our intentions. Doing this incorrectly can lead AI systems to optimise for targets that are only a proxy for what we want.

Outer misalignment example: Companies want to train large language models (LLMs) to answer questions truthfully. They train AI systems with a reward function based on how humans rate answers. This encourages the AI system to generate correct-sounding, but actually false answers, such as using made up citations.

Outer misalignment example: Social media companies want to maximise advertising profits. They may use AI systems with the objective of maximising user engagement, but this results in their platforms promoting clickbait, misinformation or extremist content. This in turn reduces profits because customers stop advertising on the platform.

2. Inner alignment: Get AI to follow these goals.
Also known as solving: goal misgeneralization.

When training AI systems, we give them feedback (through gradient descent) based on our reward function. This results in AI systems that score highly on the reward function with the training data.

However, this doesn’t mean they learn a policy that actually reflects the reward function we set. This can cause problems if the training and deployment environments have different data distributions (known as domain shift or distributional shift).

Aligning the goal the AI system tries to pursue with the specified objective is known as inner alignment. By ‘tries to pursue’ here we mean what the system appears to be optimising, based on how it might choose between different actions.

Inner misalignment example: An AI system is correctly given a reward when it solves mazes. In training, all the mazes have an exit in the bottom right. The AI system learns a policy that performs well: always trying to go to the bottom right (rather than ‘trying to go towards the exit’). In deployment, some mazes have exits in different locations, but the AI system just gets stuck at the bottom right of the maze.

Inner misalignment example (hypothetical): A self-driving car is trained using a correctly specified reward of driving without crashing. During training it learns a policy to always drive without going through red lights, because this performs well in all the training scenarios. In deployment, the car is sitting at a red light. A truck behind it has its brakes fail and it sounds its horn. Unfortunately, the vehicles collide as the car refuses to move out of the way, prioritising not going through red lights over avoiding a crash.

Criticisms of the inner/outer alignment breakdown

Failures can sometimes be hard to classify as outer or inner misalignment. For example, failures can be ambiguous or a result of both outer and inner misalignment, and some researchers classify failures differently depending on their framing.

It’s unclear how directly relevant the reward function is to the actual learned policy. This calls into question the relevancy of outer alignment, if systems like reinforcement learning do not optimise for this reward directly. One research agenda, shard theory, attempts to explore relationships between reward functions and learnt behaviours, and proposes we set a reward function based on the values it teaches the model rather than trying to find an objective that is exactly what we want.

Lastly, future AI systems might be developed in a way where we’re not directly specifying reward functions. Modern-day LLMs usually have the goal of ‘generate coherent text that humans would rate highly’ - some systems building on top of these like AutoGPT attempt to make them more agentic and goal directed through repeated prompting and connecting them to tools, and it’s not clear how the outer/inner alignment paradigm maps to this.

Despite these criticisms, we think being aware of these terms and concepts is helpful, particularly as they are widely used in the field. On the course, we’ll use them to think through challenges in developing aligned AI systems.

Alternative definitions

Many others before us, and likely many after us will attempt to define alignment. We encourage you to compare our definitions to other popular definitions people use:

  • Jan Leike: “The AI system does the intended task as well as it could”
  • Paul Christiano: “When I say an AI A is aligned with an operator H, I mean: A is trying to do what H wants it to do.”
  • Richard Ngo: “ensuring that AI systems pursue goals that match human values or interests rather than unintended and undesirable goals”
  • Ryan Greenblatt (in specific context): “Ensure that your models aren't scheming.”
  • Nate Soares: “how in principle to direct a powerful AI system towards a specific goal”
  • Holden Karnofsky: “building very powerful systems that don’t aim to bring down civilization”
  • Anthropic: “build safe, reliable, and steerable systems when those systems are starting to become as intelligent and as aware of their surroundings as their designers”
  • OpenAI: “make artificial general intelligence (AGI) aligned with human values and follow human intent”
  • Google DeepMind: “ensure that AI systems are properly aligned with human values”
  • IBM: “encoding human values and goals into large language models to make them as helpful, safe, and reliable as possible”

We use analytics cookies to improve our website and measure ad performance. Cookie Policy.