AI Alignment (2024 Mar)

Using Neuroscope to Explore LLM Interpretability: A Guide for Anxious Software Engineers

By Courtney Sims (Published on June 19, 2024)

This project was a runner-up to the 'Education and community' prize on our AI Alignment (March 2024) course

Over the past month I’ve been exploring mechanistic interpretability – a field dedicated to understanding the internal workings of machine learning models.

I set out to find patterns in LLMs, but the real value I got from this project was finding a pattern in myself. I hope anyone reading this who discovers that pattern in themselves too, realizes that not only are they not alone, but they are powerful and capable.

The Project

The final month of the course was dedicated to an individual project. After learning about mechanistic interpretability, I knew I wanted my project to be in this space. Mech interp seeks to look inside the black box that is a machine learning model and understand exactly how it makes sense of data we give it to generate an output – and that’s cool af. I wanted my project to reflect the amazement I felt, but with only a basic software engineering background, having only read the articles from the coursework with the barest understanding of what was actually technically happening in them, and no idea what a transformer was . . . I worried about what I could actually accomplish.

Throughout the project, we were encouraged to read less, do more – to dive in. While this is not something I do often, I wanted the full experience of the course as intended, so instead of finding some long tutorial that I’d never actually finish about how transformers work, I embraced the directive and committed myself to finding a do-able project with an actual output.

When I came across this list of 200 open problems to explore in the field of mechanistic interpretability, I started to feel hopeful. They are laid out by category and – most importantly, for me – by difficulty. I found the most beginner-est of beginner problems and set my sights on that. It had a simple, exploratory goal – “search for patterns in neuroscope and see if they hold up.”

Read the full piece here.