AI Purgatory: Exploratory Sandbox for Red Teaming with LLMs in Influence Operations
This project was the winner of the "Best Technical Governance Project" prize on our AI Governance (April 2024) course.
The massive adoption of large language models (LLMs) has made it remarkably easy to create and transform content to target specific individuals or demographic groups. These powerful tools enable everyone to generate personalized and engaging material with minimal effort, manipulating narratives to suit various purposes. As a result, LLMs’ tailored messages can influence public opinion, drive marketing strategies, and even sway political discourse. In this paper, we aim to investigate the misuses of LLMs for manipulative and targeted content through sandbox-based experiments designed to also test different mitigation strategies.
Our study had two primary goals: to explore how large language models (LLMs) can be utilized by various actors to create influence campaigns and whether we can use LLM-based agent-to-agent simulations for testing defense strategies within an exploratory sandbox that we renamed ‘AI purgatory’. We aimed to enable meaningful human control beyond fact-checking, empowering citizens to create their own agents, understand their vulnerabilities to different messages, and learn how to phrase their arguments and agendas to avoid falling prey to influence campaigns.
In our pilot, we conducted a series of experiments to investigate the misuse of LLMs for manipulative and targeted content. We created an AI bot that mimicked influence operations related to hypersonic weapons, based on a dataset of 230 tweets. This bot was tested in simulations targeting a specific individual, Maria Rossi, to observe its interactions with a guardian angel bot designed to mitigate malicious influence (Citizen K). The study provided a detailed account of the interaction and testing workflow, emphasizing the need for AI literacy and improved defensive strategies. It highlighted the importance of continuous innovation in detecting and countering increasingly accessible and sophisticated AI-driven influence operations, as well as the need for activities that engage the public in demonstrating these capabilities and improving literacy.
While we focused on a single malicious bot, the findings clearly show the potential for massive bot networks to produce human-like, varied content, making traditional detection methods less effective. Future research should aim to refine the idea of building a more informed and resilient public sphere by creating environments where citizens can test and see their susceptibility to different bots. This would allow individuals to decide on the "filters" they may want to use when interacting online with the increasing number of AI-generated content and bots. To enhance the effectiveness of misinformation countermeasures, future iterations of the simulation should include interactive elements, such as “Would this make you click?” or “Create something a friend would click,” to make vulnerabilities more visible. Creating an environment where the guardian angel bot can meet with users to demonstrate and discuss specific examples of tweets or content will show particular vulnerabilities and encourage critical evaluation. By implementing these improvements, the simulation can more effectively engage users and demonstrate the importance of media literacy and critical thinking in combating misinformation.
To read the full project submission, click here.