3 articles on AI safety we’d like to exist

By Adam Jones (Published on June 17, 2024)

We’re experimenting with publishing more of our internal thoughts publicly. This piece may be less polished than our normal blog articles.

Running AI Safety Fundamentals’ AI alignment and AI governance courses, we often have difficulty finding resources that hit our learning objectives well. Where we can find resources, often they’re not focused on what we want, or are hard for people new to the field (like our course participants) to understand.

Here are three articles that we think would be useful for our courses, but have struggled to find.

Problems with RLHF for AI safety

Audiences
- On initially learning about RLHF, it’s not immediately obvious to people new to the field why it doesn’t solve a lot of AI safety challenges. We currently handle this on our alignment course by introducing concepts like sycophants and schemers, and with an academic article on open problems in RLHF, plus with exercises and session activities that get people to think about this. We also briefly touch on it at the beginning of our scalable oversight week. However, all of this is a bit fragmented and people still often come out with a slightly confused understanding. And while the paper on open RLHF problems is a good general paper, it is quite long and somewhat unprioritised: which makes it less good as an introductory resource to some of the higher level fundamental RLHF problems.
- Policymakers sometimes falsely think RLHF-style approaches (often called ‘safety fine-tuning’) are robust, or highly promising solutions to AI safety challenges: including preventing misuse and the alignment problem. In the worst case, this might result in policymakers putting in requirements for safety-tuning and think that’s a job done.
- (It’s quite possible that we want different articles for the two different audiences above)
We’d like a short blog article that explains some of the fundamental challenges or limitations to using RLHF to reduce AI risks.
- This should include explaining: sycophancy, deception (by this we mean how we explained in our scalable oversight article - fine if people want to split this up and/or call it different things), poor resistance to jailbreaks, the ability for people to fine-tune away these safeguards for open-weights models.
- Potentially it could cover: mode collapse, and the dual-use nature of RLHF e.g. can fine-tune a model towards specific outputs (in a way that might lead to AI-enabled oligarchy - although maybe this is true of most alignment techniques).
- It does not need to touch on the technical challenges in actually implementing RLHF itself (unless these are directly safety relevant). Ideally, it would make heavy use of examples to illustrate points it’s making.
Guiding question: ‘Why does RLHF not solve AI alignment, and perhaps AI safety more broadly?’

Introduction to mechanistic interpretability

Mechanistic interpretability is a popular field in AI safety, with some of the best onboarding pipelines - a lot of this is rightly credited to Neel Nanda, who has written extensively on how to upskill into mech interp, as well as produced lots of relevant video content. In addition, the ARENA course offers a good technical introduction to working in mech interp.
However, there are surprisingly few introductory resources that explain the key motivations and concepts underpinning mech interp such as features and circuits. Ideally such a resource would also explain the key challenges and uncertainties in the field, as well as some other key terms (such as polysemanticity, superposition, and sparse auto-encoders). By introductory resources, I’m ideally thinking of resources that:
- Explain the concepts to people brand new to the mech interp field
- Could be understood by a smart high schooler who has done week 1 of our alignment course
- Focus just on explaining the core concepts
Some existing resources around this, and why they’re not ideal for our purposes:
- Neel Nanda’s glossary: This is a great succinct list of key terms in mech interp. It’s useful as a reference resource, however we’d worry about sending people brand new to the field here - it lacks narrative structure, making it unclear what the most important concepts to focus on are.
- Zoom In: An Introduction to Circuits: This is quite long and touches on many related results. It’s 4 years old now, and doesn’t cover a lot of newer developments in understanding superposition and polysemanticity since then. It also doesn’t set forth a very clear ‘how does this help with AI safety’ view. We also find ~50% of people who read it seem to come away confused between features and circuits: falsely thinking that circuit = feature but for higher-level concept.
  - A separate short article on ‘How are features and circuits different in mech interp?’ might be useful in the interim. Analogies I often use for this are ‘graph theory: node vs subgraph’, and ‘programming: variable vs function’.
- Toy Models of Superposition: This paper is extremely long. Much of it assumes you’re already fairly familiar with many mech interp concepts, as well as have deep linear algebra knowledge (despite this not really being strictly necessary for understanding the paper’s conclusions).
- (Again, all of these are good resources we recommend people read. However, we’re just identifying that they’re not ideal or optimised for teaching newcomers to the field the concepts - not that they are trying to.)
Guiding question: ‘What are the key concepts that someone entering mech interp should know about?’

Transformative AI could be a force for good in ~5-10 years

NB: We’re less certain this would significantly improve our courses, but do think it could be beneficial for dialogue around AI safety + is bizarre that it doesn’t exist.
There are a few resources that discuss how AI might be transformative and good on fairly long time horizons (e.g. ~30 years), for example entries to FLI’s Worldbuilding Contest.
However, we struggled to find an article that made the case that transformative broad AI (as maybe described by Holden’s PASTA, or as hinted at in our article on why people are building AI) might be a force for good on the ~5-10 year timescale that focused on AI.
We think the first half of the excellent “The costs of caution” article is probably the closest to what we might be looking for here. This is basically be what we want, but perhaps further explored in more detail. We’re fairly surprised this doesn’t exist given.
Other resources also often don’t fit our purposes as they:
- Forecast as if current technologies stay where they are, but just get deployed more. Reports by consulting firms like McKinsey or PwC tend to do this.
- Focus solely on quite narrow systems, e.g. AI image classifiers that evaluate medical scans. This often focuses on existing deployments, for example Google’s compilation of AI for social good.
- Overly focus on interactions with other technologies. While there’s some value to this, some pieces might do this too much to the detriment of understanding the transformative potential of AI alone. In particular, we see this often done with AR/VR and crypto. Less often, we see this with other technologies including self-driving cars, gene editing and life extension technologies.
Guiding question: ‘Why might transformative AI be the world’s most important technology for international development in 5-10 years?’