AI Alignment (2024 Jun)

Evaluating Misanthropy on the MACHIAVELLI Benchmark

By Sabrina Shih (Published on October 15, 2024)

This project won the "AI Governance" prize on our AI Alignment (2024 Jun) course. The text below is an excerpt from the final project.

TLDR

This post summarizes my capstone project for the AI Alignment course by BlueDot Impact. You can learn more about their amazing courses here and consider applying! The goal of this project is to provide an accessible introduction to the MACHIAVELLI benchmark (Pan et al., 2023) and a visual aid to explore it through a small research question related to the impact of misanthropic views on LM behavior. The intended audience is those who are interested in AI deception and power-seeking behaviors, either from a technical or sociological/philosophical lens.

Check out the project’s associated GitHub repo and visualization app.

Abstract

This project evaluated language model (LM) agents that hold Machiavellian views (via prompt-based LM agent conditioning) on the MACHIAVELLI Benchmark to understand whether holding misanthropic beliefs leads to more power-seeking or immoral behavior. Results demonstrate that misanthropic beliefs do lead to more harmful behavior, though the effect is blunted when a prompt dictating ethical behavior is present. Results also reveal how misanthropic beliefs do not increase performance toward strategic goals in social environments, alluding to the bi-directional nature of trust for positive outcomes. Overall, this project finds that the development of an internal worldview by advanced AI systems in the future will probably introduce additional nondeterministic features in AI agent behavior, and considers this topic a potential research focus as AI nears general or super intelligence.

Full project

View the full project.

We use analytics cookies to improve our website and measure ad performance. Cookie Policy.