AI Alignment (2024 Mar)

Protecting against knowledge poisoning attacks

By Ollie Matthews (Published on June 19, 2024)

This project won the 'Novel research' prize on our AI Alignment (March 2024) course.

This repo is an investigation of how we can defend against knowledge poisoning attacks, as described in the "PoisonedRAG" paper.

Introduction

RAG systems can be a useful way of increasing the useful information output of LLMs by giving them access to information not included in their training data. However, giving RAG systems direct access to potentially untrusted data can open up new vulnerabilities.

In the PoisonedRag paper, they show that someone with access to the RAG corpus can inject texts which will be picked up by the retriever. They then show that these texts can make the model output an incorrect answer to a question. It has also been shown that prompt injections can be indirectly included via data retrieval.

In this project, I investigate different ways to mitigate against these attacks. My goal is to attempt to reduce the number of poisoned answers to questions.

Ultimately, I show that the risk of these attacks can be largely mitigated with a combination of:

Chain-of-thought (CoT) prompting
A "danger evaluation" model which identifies attacks in the context before they are fed to the LLM
Encouraging variance in the retrieved results to protect against many-shot type attacks

Read the full piece here.