BadLlama 3: remove safety finetuning from Llama-3-8B-Instruct for a few cents
By Dmitrii Volkov (Published on June 19, 2024)
This project was a runner-up to the 'Novel research' prize on our AI Alignment (March 2024) course
We show that extensive LLM safety fine-tuning is easily removed when an attacker has access to model weights. Our primary contributions are to:
- Update BadLlama [GLRSL24] result to Llama 3,
- Improve their result from tens of GPU-hours and $200 to 30min and $0
- Use robust standard benchmarks; and
- Release reproducible evaluations.
Read the full piece here.