Breaking Down Complexity: How Mechanistic Interpretability Actually Works
This project was one of the top submissions on the (Dec 2024) Writing Intensive course. The text below is an excerpt from the final project.
Introduction
With state-of-the-art AI capable of beating PhDs in all scientific domains, ranking in the top 200 in competitive programming in the world and achieving a silver medal in the International Mathematical Olympiad, there’s an urgent need to understand their internal decision-making processes. Mechanistic interpretability has emerged as a promising approach to address this opacity by attempting to reverse engineer the computational processes occurring within neural networks.

This article aims to synthesise the fundamentals of current mechanistic interpretability advances, with a focus on three approaches that together offer us the greatest insight into these models we’ve had yet:
- The circuits framework for understanding feature composition
- Theoretical models explaining the emergence of superposition
- Sparse autoencoders (SAEs) as a practical tool for extracting interpretable features
This approach to alignment represents a fundamental shift in how we approach AI safety. Instead of treating AI systems as black boxes that we can only evaluate from the outside, we can begin to understand and modify the internal mechanisms that produce their behaviour. This is a key differentiator to other safety approaches like RLHF which focus on steering model outputs externally.
These recent advancements, I argue, present an important inflection point in mechanistic interpretability, where we can reliably move from theoretical possibility to a practical tool for controlling very capable AI.
Full project
View the full project here.