AI Alignment (2024 Mar)

Developmental Interpretability in Toy Models of Superposition

By Jessica Nunez (Published on July 5, 2024)

In the realm of machine learning, understanding the developmental trajectories of neural networks is crucial for humans to deem a model 100% safe, or at least as close to one can get to. It is extremely complex, however. Few success stories have risen from neural network interpretability. It’s so new, there isn’t a go-to textbook that I could use to help me understand these ideas better; only esoteric research papers and a handful of YouTube videos. I say this not to discourage, but to emphasize how low-key the subfield is and how much room it actually has for growth and impact! Moving forward, the core of this analysis draws inspiration from developmental biology, where critical periods mark significant changes in neural architecture, we aim to uncover analogous phases in the training of artificial neural networks. Specifically, we investigate a toy model of superposition (TMS), a simplified neural network that allows us to observe fundamental learning dynamics without the obfuscating intricacies of larger models.

Read the full piece here.