AI Alignment (2024 Mar)

Sparse Features Through Time

By Rogan Inglis (Published on June 19, 2024)

This project won the 'Interpretability' prize on our AI Alignment (March 2024) course.

The rapid advancement of AI capabilities brings both promise and peril, particularly with the potential emergence of deceptive behaviours in advanced AI systems. This project explores the use of Sparse Autoencoders (SAEs) to track the development of features in large language models throughout their training. It investigates whether features can be reliably matched between different SAEs trained on various checkpoints of Pythia 70M and characterises the development of these features over the course of training. The findings show that features can successfully be matched between different SAEs. The results also support the distributional simplicity bias hypothesis, indicating that simpler features are learned early in training, with more complex features emerging later. While the focus was on a relatively small model, the results lay the groundwork for future research into larger models and the identification of potentially deceptive capabilities. This work aims to enhance the interpretability and safety of AI systems by providing a deeper understanding of feature development and the dynamics of neural network training.

Read the full piece here.