What we didn’t cover in our Early 2024 AI Alignment course

By Adam Jones (Published on April 9, 2024)

An 8-week part-time course can't cover everything there is to know about AI alignment, especially given it’s a fast-moving field with many budding research agendas.

This resource gives a brief introduction into a few areas we didn't touch on, with pointers to resources if you want to explore them further.

Developmental interpretability

We covered mechanistic interpretability (or “mech interp”) in session 5. To recap, this is an approach that ‘zooms in’ to models to make sense of their learned representations and weights through methods like feature or circuit analysis.

Developmental interpretability (or “dev interp”) instead studies how the structure of models changes as they are trained. Rather than features and circuits, here the object of study is phases and phase transitions during the training process. This takes inspiration from developmental biology, which increased our understanding of biology by studying the key steps of how living organisms develop.

Agent foundations

Most of this course has explored fairly empirical research agendas that have been targeting the kinds of AI systems we have today: neural network architectures trained with gradient descent. However, it’s unclear whether we’ll continue to build future AI systems with similar technology. You’ll also have likely realised that a lot of the fundamentals underpinning the field can be unclear: the field does not have an agreed definition of what alignment is, or the kinds of properties we might want from an aligned system.

Agent foundations research is primarily theoretical research that aims to give us the building blocks to solve key problems in the field, particularly around how powerful agentic AI systems might behave. It draws a lot from maths, philosophy, and theoretical computer science, and in some ways is the basic science of alignment.

Adversarial robustness

Adversarial attacks on AI models exploit particular flaws or weaknesses in AI systems to cause them to fail, often by only subtly tweaking inputs or using inputs that would be bizarre to humans. This has serious implications for being able to prevent misuse.

It was previously assumed that this was a capability problem, and that as models got more capable they would become more robust to adversarial inputs. However, it appears that even highly capable systems in narrow domains such as AlphaZero are vulnerable to such attacks.

Adversarial robustness attempts to address these problems. This involves exploring why models are susceptible to this in the first place, and how we might make models or systems more resistant or fault-tolerant.

Shard theory

Shard theory is a research approach that rejects the notion agents are following specified goals, and instead suggests that reinforcement learning agents are better understood as being driven by many ‘shards’. Each shard then influences the model’s decision process in certain contexts.

This takes inspiration from how humans appear to learn. Most people do not optimise some specific objective function all the time, but behave differently in different contexts: for example when their ‘hunger’ shard activates this might influence a decision to stop working and go for lunch, because this has worked well in the past (rather than because it optimises some complex objective function).

Viewed through this lens, shards appear to result in models choosing actions over others in different situations, which indicates the model’s values. How shards develop therefore affects what values a model might also develop. Research in shard theory generally focuses on understanding the relationship between training processes and the resulting shards and learned values.

Cooperative AI safety research

Most of the research we’ve looked at has focused on making individual AI systems safer. However, there are already many AI systems being created and used by different people. In future, these may be taking far more actions in the real world (see Open Interpreter for an example).

We’ve previously seen instances of algorithmic systems interacting in undesirable ways, and current AI systems seem to escalate conflicts when interacting with each other. AI systems that are more capable and have control of more resources acting in undesirable ways might be a significant risk.

Cooperative AI safety research (also called multi-agent safety research) explores how we might avoid or mitigate the impacts of these negative interactions. It also explores how we might encourage agents to coordinate in ways that result in positive outcomes.

Model organisms of misalignment

We are already seeing many AI risks starting to materialise. However, it’s still unclear how anticipated risks will emerge, particularly risks stemming from deceptive misalignment. This is where models appear to be aligned (possibly during training, or when they’re being carefully overseen by humans), but are actually misaligned and will pursue a different goal when deployed or not carefully supervised.

Better understanding how and why deceptive misalignment occurs would help inform how we might prepare for these risks. It could also help better-inform policy decisions about AI safety, as well as decisions about what AI safety research to prioritise.

Research into model organisms of misalignment seeks to achieve this by intentionally creating and then studying deceptively misaligned AI systems or subcomponents, in a controlled and safe way.

Technical moral philosophy

This is a particularly vague area with limited research, and we’re not certain whether this is the correct title or description for this research area.

We briefly mentioned moral philosophy earlier on the course, which we described as determining intentions that would be “good” for AI systems to have.

Technical moral philosophy research explores technical solutions or problems to identify good optimisation targets for advanced AI systems. Its more theoretical nature often means it overlaps with philosophy and agent foundations research.

Recent work includes Anthropic’s collective constitutional AI work, OpenAI’s democratic inputs to AI program, and the Meaning Alignment Institute’s work on moral graphs.

Other training approaches

These approaches are fairly general approaches to training AI systems. However, they are potentially relevant to AI safety work.

Imitation learning is an approach to training machine learning systems to copy humans. It is thought that this might result in better aligned AI systems because more capable systems might just copy humans better rather than get highly misaligned. It should be noted though that in some senses pre-training an LLM is imitation learning - to imitate what humans type on the internet - and this has not resulted in aligned AI systems.

Inverse reinforcement learning is an approach to machine learning that intends to infer an intended goal by observing human behaviour. This is similar to imitation learning but adds an extra step to infer the intended objective from the data (rather than matching the data being the objective itself).

Some safety research goes into seeing what can be done to make both of these approaches more competitive, more accessible and more likely to result in aligned systems (for example, with scalable oversight techniques).

Others?

There are likely other less well-known alignment research agendas that we’ve missed. If you find something well-defined that is distinct from everything in the course, contact us and we’ll consider adding it here.