Steering Gemma to the Edge of Reason
This project was completed on our AI Alignment (2024 Jun) course. The text below is an excerpt from the final project
Abstract
Google Deepmind have identified self-reasoning as one of the four core 'dangerous capabilities' for which they are building evaluations, not least for AI agents. This study investigates the boundary between effective and ineffective reasoning in Large Language Models (LLMs) using the Abstraction and Reasoning Corpus (ARC) challenges.
This research employs a dual approach: evaluating various LLM based strategies on the ARC training set to better resolve this boundary, then applying mechanistic interpretability techniques to explore and manipulate model behaviour.
Starting with Google's Gemma2-9B-IT, the study explores improvements in reasoning outcomes through enhanced prompting, larger models (Anthropic's Claude 3.5) and ai agents, via agentic self reflection.
Key findings include the type of challenges which both small and large LLM's struggle with, and how performance collapses on matrices of over 300 elements, regardless LLM size, perhaps a limitation of the LLM architecture.
Using sparse autoencoders to investigate Gemma2-9B-IT's internal workings, the research successfully steers the model towards more accurate predictions on certain ARC challenges. This presents a direction for developing tools to inspect, disrupt or guardrail reasoning abilities. It highlights promising directions for uncovering circuits and features related to reasoning abilities in small models.
Full project
You can view the full project here.