Exploring the use of Mechanistic Interpretability to Craft Adversarial Attacks
This project was completed on our AI Alignment (2024 Jun) course. The text below is an excerpt from the final project
Code available here: https://github.com/Sckathach/mi-gcg/
This work is my capstone project for the AISF Alignment course that I followed over the past 12 weeks. It consists of an adaptation of the Greedy Coordinate Gradient (GCG) attack to exploit the refusal subspace in large language models. Rather than targeting output space gradients, it optimises prompts to evade the refusal subspace identified in LLM activations, showcasing the potential use of mechanistic interpretability (MI) to craft adversarial attacks. However, the results show that evading the refusal subspace is not sufficient to craft an adversarial string. If you’re only interested in the core of the work, feel free to skip ahead to Section 3 or Section 4.
During this course, I discovered the fascinating world of MI, which will undoubtedly be my focus in the coming years. To learn more about this emerging field and its tools, I chose to undertake a project in this area. Inspired by Conmy et al. (2023), my initial idea was to delve into transformers to see if I could modify the circuits to achieve a specific goal, such as jailbreaking the model. After a week of reading, I found that Arditi et al. (2024) had already accomplished what I intended to do, so I decided to build upon their work, exploring transformers and their refusal direction to determine whether I could use this understanding to craft adversarial attacks.
This work can be divided into three parts:
- A brief introduction to mechanistic interpretability and adversarial attacks against LLMs (Section 1).
- A replication of the Arditi et al. (2024) paper with an explanation of the refusal direction (Section 2).
- The actual gradient-based attack using the refusal direction as foundational knowledge (Section 3).
Full project
You can view the full project here.