AI Alignment (2024 Mar)

Implementing a Deception Eval with the UK AISI’s Inspect Framework

By Michael Schmatz (Published on June 19, 2024)

This project won the 'AI Governance' prize on our AI Alignment (March 2024) course.

The UK AI Safety Institute recently released the Inspect framework for writing evaluations. It includes a number of tools and features which make writing evaluations easy. In this post, we'll write a simple evaluation with Inspect. If you find this topic interesting, I encourage you to try using Inspect to write your own!

We're going to reproduce the core of the paper "Large Language Models Can Strategically Deceive Their Users When Put Under Pressure" from Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn at Apollo Research. Jérémy was kind enough to provide two of the original classification prompts not present in the original project repository. Thanks Jérémy!

The original GitHub repo with classification prompts and raw results is available here.

Read the full piece here.