AI Alignment (2024 Mar)

BadLlama 3: remove safety finetuning from Llama-3-8B-Instruct for a few cents

By Dmitrii Volkov (Published on June 19, 2024)

This project was a runner-up to the 'Novel research' prize on our AI Alignment (March 2024) course

We show that extensive LLM safety fine-tuning is easily removed when an attacker has access to model weights. Our primary contributions are to:

Read the full piece here.