AI Alignment (2024 Jun)

Exploring the intersection of interpretability and optimisation

By Pranav Pativada (Published on October 17, 2024)

This project was completed on our AI Alignment (2024 Jun) course. The text below is an excerpt from the final project

Neural networks with first-order optimisers such as SGD and Adam are the go-to when it comes to training LLMs, forming evaluations, and interpreting models in AI Safety. Meanwhile, optimisation is a hard problem that has been tackled in machine learning in many ways. In this blog, we aim to look at the intersection of interpretability and optimisation, and what it means for the AI safety space. As a brief overview, we’ll consider:

The problems a model’s optimisation landscape has, and how it affects models and our understanding of them
How we can interpret said models through activation maximisation
The use of different optimisers on different problems and interpretability tasks

We’ll look deeply into these things to better understand models and optimisation procedures, and see if we can tie these things together to get a better idea of how things work for AI safety.

Full project

You can view the full project here.