AI Alignment (2024 Mar)

Exploring Polysemantic Universality with Embeddings

By Daniel Williams (Published on July 4, 2024)

As part of the AI Safety Fundamentals Course, I have undertaken a short project relating to mechanistic interpretability. Mechanistic interpretability is the science of understanding the inner workings of AI models, and likely will be important in the safe advancement of AI. My project examines the validity of my intuitions formed after studying two key papers: Zoom In: An Introduction to Circuits and Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. In particular, I investigate the phenomenon of Polysemanticity, where a neuron within an artificial neural network responds to multiple distinct features. One of the examples of a polysemantic neuron found in “Zoom In” responds to cat faces, fronts of cars and cat legs.

Polysemanticity makes interpretability research difficult. When a neuron responds to a range of inputs, it becomes challenging to determine its function. My initial assumption was that dissimilar features and relationships would exhibit consistent dissimilarity, meaning that knowledge of common polysemantic relationships would be transferable across models. The objective of this project is to test this assumption and find evidence for or against it.

I chose to explore this idea using embedding models. Embedding models compress natural language into a relatively low dimensional vector space where similar words are close, and dissimilar words are far (using the cosine similarity metric).

This blog post supplements the Jupyter Notebook found here: DanielJMWilliams/PolysemanticUniversality (github.com). If you are interested, I encourage you to run the notebook yourself, experimenting with different corpora, models and dimensions. The notebook contains three approaches to investigating polysemanticity in embedding models:

Training minimal embedding models and observing the nature of the polysemantic relationships as dimensionality changes.
Calculating the vector between dissimilar words and finding the most similar words to this vector. Then comparing these words across models.
Finding the words most similar to vectors with a value of zero in all dimensions except one. Then comparing these words across models.

Read the full piece here.