AI Alignment (2024 Mar)

Using an SAE as a steering vector

By Nelson Gardner-Challis (Published on June 19, 2024)

This project won the 'Education and community building' prize on our AI Alignment (March 2024) course.

Full Summary

Over the past 12 weeks I have participated in the AI Safety Fundamentals Alignment course and completed an AI Alignment focused capstone project. My project explored concepts from the field of Mechanistic Interpretability. I created a tutorial for the SAE Lens Github repository, which teaches you how to use a Sparse Auto-encoder (SAE) to create a steering vector and affect a model’s output on generated responses.

I had two main aims while completing this project:

  • Deepen my Experience: I chose a topic I found intriguing during the course and aimed to learn more about it by direct learning.
  • Create a Public Good: I wanted to contribute something useful for others interested in AI safety, like a tutorial that could aid their own learning.

Key Outcome:

The tutorial is available as part of the SAE Lens tutorial list.

A steered response to the general prompt of “What is on your mind?”

Read the full piece here.

We use analytics cookies to improve our website and measure ad performance. Cookie Policy.