AI Alignment (2024 Mar)

Using an SAE as a steering vector

By Nelson Gardner-Challis (Published on June 19, 2024)

This project won the 'Education and community building' prize on our AI Alignment (March 2024) course.

Full Summary

Over the past 12 weeks I have participated in the AI Safety Fundamentals Alignment course and completed an AI Alignment focused capstone project. My project explored concepts from the field of Mechanistic Interpretability. I created a tutorial for the SAE Lens Github repository, which teaches you how to use a Sparse Auto-encoder (SAE) to create a steering vector and affect a model’s output on generated responses.

I had two main aims while completing this project:

Deepen my Experience: I chose a topic I found intriguing during the course and aimed to learn more about it by direct learning.
Create a Public Good: I wanted to contribute something useful for others interested in AI safety, like a tutorial that could aid their own learning.

Key Outcome:

The tutorial is available as part of the SAE Lens tutorial list.

A steered response to the general prompt of “What is on your mind?”

Read the full piece here.