Projects

Here are the top project submissions from participants on our AI safety courses.

Participants worked on these projects for 4 weeks during the Project Sprint, applying their learnings from the course to their next steps in AI safety.

Featured

AI Safety Project

Competition Winners

Activate Love – Steering AI Text Generation

By Jan Raasch

Implementing a Deception Eval with the UK AISI’s Inspect Framework

By Michael Schmatz

AI Alignment (2024 Mar)

Fixing open issues in TransformerLens

By Anthony Duong

Dissecting the Development of Toy Models of Superpoisition

By Joe Emerson

Compression and GenAI

By Raymond Tana

Replicating Toy Models of Universality

By Nathan Reed

Diffusion Model From Scratch

By Hong Jeon

Feature Visualization Learning Journey

By Caleb Sattler

Trying to Automate Detection of Translation Heads

By Erik Nordby

AI Debate Stability: Addressing Self-Defeating Responses

By Annie Sorkin

Developmental Interpretability in Toy Models of Superposition

By Jessica Nunez

Introducing Agent Tokens

By Bryson Tang

Can libertarianism and UBI coexist in an AI world?

By Matt Hampton

Threat Hunting Across Domains: From Cyber Security to AI Alignment

By Zainab Ali Majid

Mabel: On-origin AI detection

By Christopher Tardy

Comparative Evaluations of Large Language Models for Biosecurity, Cybersecurity and Chemical Risks

By Osita Ukwuaba

Can We Rely on Model-Written Evals for AI Safety Benchmarking?

By Sunishchal Dev

Ethical Alignments in AI: How Training Data Shapes Moral Foundations

By Scott Barlow

Exploring the Use of Constitutional AI to Reduce Sycophancy in LLMs

By Aleksandr Eliseev

Treechop – Reproducing the tree gridworld for investigating goal misgeneralisation

By Amy Andrews

Deceiving LLMs using LLMs — Attempts to elicit information through Multi-Agent Debate

By Konstantinos Tsiaras

Can AI Call its Own Bluffs?

By Aleksei Rozhkov

Exploring Polysemantic Universality with Embeddings

By Daniel Williams

Inner Misalignment as a Correlation Problem

By Gabriel Melo

Polysemanticity vs Superposition

By Noah Topper

Exploring MARL Safety in meltingpot

By Gema Parreño, Peter Francis, Cam Tice, Chris Pond, Yohan Mathew, Tomasz Steifer, Marina Levay

Understanding Transformer’s Induction Heads

By Natalia Burton

Shard Theory – is it true for humans? (And is it a good model for value learning in AI?)

By Rishika Bose

Can large language models effectively identify cybersecurity risks?

By Emile Delcourt

Representation Tuning

By Christopher Ackerman

A Research Agenda for Psychology and AI

By Carter Allen

Large Language Models and Tacit Cyber Risk Awareness

By Emile Delcourt

Root Cause Analysis of AI Safety Incidents

By Simon Mylius

LLM Values – Evaluating language-dependency of LLMs’ values, ethics and beliefs

By Christoph Sträter

Using Neuroscope to Explore LLM Interpretability: A Guide for Anxious Software Engineers

By Courtney Sims

Towards Behavioral-Alignment via Representation Alignment

By Galen Pogoncheff

Goals, Landscape and Recommendations for EU Compute Governance

By Artūrs Kaņepājs

Some Issues in Predictive Ethics Modeling: An Annotated Contrast Set of “Moral Stories”

By Ben Fitzgerald

A Closer Look at “How to Catch an LLM Liar”

By Karolina Dabkowska

BadLlama 3: remove safety finetuning from Llama-3-8B-Instruct for a few cents

By Dmitrii Volkov

Activate Love – Steering AI Text Generation

By Jan Raasch

Using an SAE as a steering vector

By Nelson Gardner-Challis

Demonstration of AI Safety via Market Making

By Cameron Holmes

Sparse Features Through Time

By Rogan Inglis

Implementing a Deception Eval with the UK AISI’s Inspect Framework

By Michael Schmatz

AI Animal Welfare: Creating an Animal-Friendly Model

By Zachary Mekus

Protecting against knowledge poisoning attacks

By Ollie Matthews

AI Governance (Oct 2023)

Upskilling in AI Governance – My How to Get Started Guide

By Carmen Csilla Medina

Paths to an AI Agency: An Overview of US Government Agencies and Analysis of Agency Creation Case Studies

By Spencer Kelly

Are AI safety and AI ethics memetic rivals?

By Daniel Friedrich

CERN to CERN for AI Mechanism Mapping

By Darryl Wright

Beyond Bias and Fallacies: Fixing the AI Safety Debate

By Maxime Fournes

What can I do to stop the nanobots from killing us? An analysis of career pathways in AI governance.

By Felix De Simone

AI Governance Career Goal Tracker

By Bengusu Ozcan