Avoiding jailbreaks by discouraging their representation in activation space
This project was completed on our AI Alignment (2024 Jun) course. The text below is an excerpt from the final project
Abstract
The goal of this project is to answer two questions: “Can jailbreaks be represented as a linear direction in activation space?” and if so, “Can that direction be used to prevent the success of jailbreaks?”. The difference-in-means technique was utilized to search for a direction in activation space that represents jailbreaks. After that, the model was intervened using activation addition and directional ablation. The activation addition intervention caused the attack success rate of jailbreaks to drop from 60% to 0%, suggesting that a direction representing jailbreaks might exist and disabling it could make all jailbreaks unsuccessful. However, further research is needed to assess whether these findings generalize to novel jailbreaks. On the other hand, both interventions came at the cost of reducing helpfulness by making the model refuse some harmless prompts.
Full project
You can view the full project here.