AI Alignment (2024 Mar)

BadLlama 3: remove safety finetuning from Llama-3-8B-Instruct for a few cents

By Dmitrii Volkov (Published on June 19, 2024)

This project was a runner-up to the 'Novel research' prize on our AI Alignment (March 2024) course

We show that extensive LLM safety fine-tuning is easily removed when an attacker has access to model weights. Our primary contributions are to:

  1. Update BadLlama [GLRSL24] result to Llama 3,
  2. Improve their result from tens of GPU-hours and $200 to 30min and $0
  3. Use robust standard benchmarks; and
  4. Release reproducible evaluations.

Read the full piece here.

We use analytics cookies to improve our website and measure ad performance. Cookie Policy.