AI Alignment (2024 Mar)

Towards Behavioral-Alignment via Representation Alignment

By Galen Pogoncheff (Published on June 19, 2024)

This project was a runner-up to the 'Interpretability' prize on our AI Alignment (March 2024) course

Can Neural Encoding Strategies in Humans be Useful for AI Alignment?

Representational alignment is an emerging field at the intersection of neuroscience and artificial intelligence, focusing on the parallels between neural representations in the human brain and those in deep neural networks. In recent years, researchers in this field have observed that as state-of-the-art AI models have become increasingly capable at completing the tasks they were trained for, their learned latent representations learned have become increasingly predictive of neural activity in the brain of primates. This phenomenon has been observed in image models and language models, for instance, in which latent representations in these models has been observed to correlate with, and be predictive of, neural activity measured in brain areas of primates dedicated vision/language processing. Ultimately, the goal of representational alignment research, however, is not solely to reveal fun insights like this (joking aside, these insights have actually been very valuable in developing better models of the brain and revealing insights in neuroscience), but also to understand the intricate relationships between these representations and system behavior.

With this in mind, I ask can we make progress in behavioral alignment through representational alignment? That is, if we are to develop models that encode similar information to information encoded in human brains (or maybe more accurately, the areas of human brains associated with cognition and agency), can this contribute to behavior alignment? In this project, I take a first step towards investigating this big question by studying a much more specific question: can we encourage image models to learn more interpretable features by increasing their alignment with neural activity in the visual cortex (brain areas primariliy dedicated to visual processing) of humans?

Below, I walk though the problem setup, execution, and analysis used in this project to study this question. In this readme, I seek to keep the content concise, in effort to give you the gist of this work without eating into the time you may use to read the other 12 billion AI papers published today. If, by doing this, there are some details I skipped over that you are curious about, please feel welcome to message me, yell your question into the void, send me a letter... you do you.

Read the full piece here.