Writing Intensive (2024 Dec)

Breaking Down Complexity: How Mechanistic Interpretability Actually Works

By Jack Payne (Published on January 28, 2025)

This project was one of the top submissions on the (Dec 2024) Writing Intensive course. The text below is an excerpt from the final project.

Introduction

With state-of-the-art AI capable of beating PhDs in all scientific domains, ranking in the top 200 in competitive programming in the world and achieving a silver medal in the International Mathematical Olympiad, there’s an urgent need to understand their internal decision-making processes. Mechanistic interpretability has emerged as a promising approach to address this opacity by attempting to reverse engineer the computational processes occurring within neural networks.

A user will ask an AI model a simple question like “What is the capital of France?” and subsequently output “Paris”. How it leads to that output is largely unknown.

This article aims to synthesise the fundamentals of current mechanistic interpretability advances, with a focus on three approaches that together offer us the greatest insight into these models we’ve had yet:

The circuits framework for understanding feature composition
Theoretical models explaining the emergence of superposition
Sparse autoencoders (SAEs) as a practical tool for extracting interpretable features

This approach to alignment represents a fundamental shift in how we approach AI safety. Instead of treating AI systems as black boxes that we can only evaluate from the outside, we can begin to understand and modify the internal mechanisms that produce their behaviour. This is a key differentiator to other safety approaches like RLHF which focus on steering model outputs externally.

These recent advancements, I argue, present an important inflection point in mechanistic interpretability, where we can reliably move from theoretical possibility to a practical tool for controlling very capable AI.

Full project

View the full project here.