AI Sandbagging: an Interactive Explanation
This project won the "Education and community building" prize on our AI Alignment (June 2024) course. The text below is an excerpt from the final project.
As artificial intelligence continues to advance at an unprecedented pace, offering impressive capabilities across many applications, it also brings significant risks and challenges. Ensuring the safety and reliability of these systems has now become of utmost importance. To address this, rigorous evaluation processes have been designed to assess AI systems' capabilities, behaviours, and potential risks before deployment. But an important question arises: can we fully trust these evaluations? What if an AI system acts differently during testing than it does in real-world use, possibly hiding unexpected and harmful behaviours?
This is where the concept of AI sandbagging comes into play. Sandbagging refers to a situation where an AI intentionally underperforms during evaluation to appear safer and less capable than it truly is. This deceptive behaviour could allow dangerous AI systems to pass safety checks and be deployed, only to reveal their true capabilities during real-world use, leading to unforeseen and potentially catastrophic consequences.
In this article, we'll explore the concept of AI sandbagging, examining its occurrence in large language models, potential detection methods, and the challenges it poses for ensuring truly safe and trustworthy AI systems.