Using an SAE as a steering vector
This project won the 'Education and community building' prize on our AI Alignment (March 2024) course.
Full Summary
Over the past 12 weeks I have participated in the AI Safety Fundamentals Alignment course and completed an AI Alignment focused capstone project. My project explored concepts from the field of Mechanistic Interpretability. I created a tutorial for the SAE Lens Github repository, which teaches you how to use a Sparse Auto-encoder (SAE) to create a steering vector and affect a model’s output on generated responses.
I had two main aims while completing this project:
- Deepen my Experience: I chose a topic I found intriguing during the course and aimed to learn more about it by direct learning.
- Create a Public Good: I wanted to contribute something useful for others interested in AI safety, like a tutorial that could aid their own learning.
Key Outcome:
The tutorial is available as part of the SAE Lens tutorial list.
Read the full piece here.