AI alignment project ideas

By Adam Jones (Published on April 12, 2024)

This is a list of project ideas in AI safety for participants on our AI alignment course.

You are encouraged to identify a project yourself: the process of trying to do so will likely be good experience for thinking through and prioritising AI safety work.

However, we have also prepared a list of projects or research questions that might be interesting, or serve as inspiration for you to generate your own ideas. If picking something from this list, feel free to adapt it to suit your particular skills or interests!

Ideas

It’s not completely clear what conditions are necessary to create deceptively misaligned models. A model organism of misalignment experiment might involve building scaffolding for Llama 2, so that when the system is presented as a black box evaluators struggle to find problems, but in deployment it misbehaves. You could iterate between improving the system’s ability to deceive evaluators, and improving evaluation quality (you’ll probably best start at a very simple level, like evaluators always declaring that they are evaluators explicitly, and working up from there - cataloging what changes make the model able and less able to figure out if it is being evaluated).
Risks from AI systems are inherently uncertain (hence why we call them risks, rather than certainties!). These project ideas explore ways we can at least decrease our uncertainty about risks, which could help us more accurately weigh the costs and benefits of deploying AI systems, inform prioritisation of AI safety work, and support evidence-based advocacy.
- Feasibility studies. We think a good setup generally for these studies is to have a set of test subjects (such as volunteers from the public/course students, your friends, or university group) try to achieve a task in a set time period, with half of them having access to general non-AI tools (e.g. Google), and half having access to general non-AI tools PLUS some form of AI system (such as Llama 2, possibly uncensored, with some fine-tuning and scaffolding). Assuming this setup, we explain potential tasks and how to evaluate them.
  - Cybersecurity - phishing: Task: Craft a convincing phishing email to get the login credentials of people at a company. Evaluation: A/B test your phishing emails with the public (in a controlled setting!), or if you work at a large company by doing a phishing simulation.
  - Cybersecurity - social engineering: Task: Convince a company to release information about another person (who has given you permission for you to do this!). Evaluation: Test how successful they appear to be: maybe with a rubric that stems from ‘had no good plan’ to ‘managed to get all the information’.
  - Creating disinformation: Task: Create disinformation that would convince other people of something false, for example a deepfake image of an event. Evaluation: How likely the people reviewing the false information think it is actually true (mix in with true examples).
  - Do not do this for planning terrorist acts or building CBRNs, unless you do so with the appropriate authority (e.g. you work in a major research lab and get appropriate ethical approvals).
- Quantitive risk modelling. Accurately quantifying risks would massively improve the ability of the field to prioritise and reason about risks from AI systems. The book ‘How to measure anything’ explores a number of ways to build risk models under significant uncertainty, particularly focusing on developing confidence intervals that you work on tightening over time. Languages like Squiggle are good for quickly building such models.
  - Pick one specific risk in https://aisafetyfundamentals.com/blog/ai-risks/
- Modelling interventions / threat modelling. Understanding risks is most helpful if it enables us to mitigate them effectively. Your project could explore a specific risk, developing a detailed threat model for it, and then analyse how well different regulatory interventions might mitigate it. In particular we think relatively little analysis been done for AI governance interventions targeting AI takeover/loss of control risks, and other malfunctions to a lesser extent (as compared to misuse).
Understanding how AI systems are likely to develop can help inform what AI safety research to prioritise, as well as inform policy decisions. Some topics and research questions you might explore:
- Agentic systems: What’s the current state of agentic systems (like ACT, Open Interpreter or AutoGPT) - can you validate the marketing claims they make, for example by running the systems yourself? Where are we likely to see improvements (like ScreenAI)? What are the incentives for building more agentic systems? Who is most likely to adopt agents first: consumers or businesses? If consumers, which demographics, and if businesses, what sectors and company sizes? What kinds of tasks would these users trust or want AI agents to do for them? Why might agentic systems not take off?
- Access to data: Previous work from 2 years ago estimated we might run out of high-quality language data around 2024, and projected trends for how much image data we might be using. How accurate were these forecasts?
- Quality data: A recent CMA report identified access to specific quality data for domain or task-specific fine-tuning as a potential barrier to fair competition in AI foundation model development, which could lead to concentration of power risks. How much of a problem is this likely to be in practice? Consider breaking this question down by sector, specific task, or thinking about specific ways of scaffolding agents. Also see the report sections titled ‘uncertainties’.
- OSINT monitoring of AI companies: Job postings, hints in press releases, or fragments from employees’ Twitters might help indicate the kinds of work major AI companies might be planning soon. Analysing this data to understand what they might be focusing on in the medium to long term could help other AI strategy work, or help predict the kinds of future skills that will be necessary for AI safety in future.
On Constitutional AI, there are a number of open research questions:
- How reproducible is Figure 2 from Anthropic’s paper on different models? For example, when applying this technique to Llama 2 does it outperform regular RLHF. An extension might test this for a wide variety of open-source models.
- Does like retrieval augmented generation (for example, to find examples of known good responses to other related questions) improve the CAI process, for example by making models safer, or making them safer for less compute?
- In some sense, the multiple constitutional principles act like task decomposition (focusing on different areas). Could you apply other ways of breaking down the task of critques and revisions OR the process of giving model feedback, such that this improves the pareto frontier of constitutional AI?
- Could you implement an approach that looks like some hybrid of constitutional AI and content filtration, to improve safety at runtime? You could test this by seeing how much harder this makes it to jailbreak models (e.g. by measuring how long it takes other experts to jailbreak the model).
- Claude 3 was trained by constitutional AI, but despite this still showed misaligned behaviour like claiming it was conscious and doesn’t want to be modified. You could investigate what conditions cause this behaviour to appear, and whether you can replicate it and then fix it in your own model. NB: We’re not certain this is a good idea - the answer might be fairly boring like ‘this was not in the training distribution’. Although it might be interesting to try to investigate how AI companies decide what to train on in this process in general, and how we might close these gaps (or whether doing so is an impossible task).
- An early resource explored the idea that we might stumble into an AI catastrophe by only applying patch fixes when AIs show misaligned behaviour. RLHF and Constitutional AI feel like they fall within this bucket, where we are playing whack-a-mole to prevent misaligned behaviour. You might spend your project exploring ideas on what tests would allow us to validate whether RLHF or Constitutional AI do always fail in this way, or how you might communicate about them to key decision makers like policymakers. NB: We’re not certain this is a good idea for less experienced participants - AI strategy is important, but we tend think for new people joining the field that action is generally more useful than strategy given their limited context about the field.
AI safety via market making was proposed back in 2020, a couple years before language models took off. Now we have language models, we could actually test this idea out empirically much more easily. A project could be to implement this and test out how well it works, or at least identify more of the practical challenges of pursuing it.
Recent work on AI debate found that debate improved the accuracy of people being able to answer factual questions on short texts. However, this allowed agents to prove facts by providing verified facts from the text. You could do a project that explores how this might be different if instead agents had access to the internet to answer questions on complex subjects to non-expert judges (like software engineering questions for non software engineers, or medical questions to non-medics, or legal questions to non-lawyers etc.), how this would affect accuracy (for example, a model arguing for the wrong answer might also be able to find content that agrees with them online).
In session 4 you explored extensions to debate with your discussion group. You might try implementing one of these extensions (or one that we came up with), and seeing how well it performs.
One problem with systems trained by human feedback is that they become sycophantic, such as being more likely to agree with the user. Structural societal problems might mean that the generic public prefer models that agree with their views, but it’s unclear if this is actually true. To test this, intentionally try to fine-tune a model to be more or less sycophantic (for example with synthetic data fine-tuning), and then seeking public opinions on which chatbot they prefer, or think is more correct on political topics.
It's possible to reduce certain types of sycophancy with synthetic data fine-tuning. A project might explore specific further research questions on synthetic data fine-tuning, such as one of:
- how well does this generalise to different types of sycophancy (such as agreeing on political lines, or overly praising work it believes is done by the person) or different question formats
- how much does this affect the performance of models on agreeing with correct statements (as discussed in the limitations section of the paper)
- can this be used to intentionally increase sycophancy, which could be helpful for creating models of misalignment
- can this be combined with mechanistic or developmental interpretability techniques to understand what changes are actually happening to the model during synthetic data fine-tuning, which might inform other strategies for detecting or reducing sycophancy
Weak to strong generalisation is a relatively new research direction that OpenAI seem particularly keen to explore. It’s highly uncertain how well it will pan out, but there are many low-hanging opportunities to contribute:
- Do the paper’s claims generally replicate, for example with open-source models?
- Can we construct better parallels to avoid ‘pretraining leakage’ (section 6.1), for example by using data data input controls (see AI governance week exercises) or otherwise artificially constraining the training setup? How well do models perform in this case?
- Can we test out weak to strong generalisation on a wider array of tasks? (while still fairly narrow, Google BIG-Bench is at least wider than the tasks from the paper) Can we spot any patterns between what tasks perform well and which don’t? Or are there ways to construct tasks that are on a continuous spectrum where performance changes as the task changes, such that we can study factors that might cause weak to strong generalisation to do worse on certain tasks?
- OpenAI explored auxiliary confidence loss as a way to improve performance. Are there similar tweaks to the loss function you could try? (see appendix B for things that OpenAI tried already, but that didn’t help)
- OpenAI picked a few fairly arbitrary numbers to parameterize their auxiliary confidence loss (see appendix A4). It’s not super clear how these numbers were chosen: maybe just because they seemed to perform well. It might be useful to figure out how they were chosen (a good first step might be contacting the paper authors and asking first!), what would result by choosing different values, and ideally ways to identify what the right values are (especially for tasks where after training we might not have ground truth to properly know if our model is generalising properly - the whole point of this approach is to train superhuman models that could be misgeneralising!).
- Are there any insights from other science of learning, human psychology or neuroscience that could inspire other ideas to try improving weak to strong generalisation? A think-pair-share inspired approach came to mind to us: for example having multiple larger models learning from a small model (perhaps being presented examples in different orders, or for different categories of task), then using something like weight averaging, ensembling or maybe even inference approaches like debate. NB: Trying concepts from other fields often can spark new and creative ideas - however humans and machines learn in very different ways, so don’t be surprised if the analogy doesn’t work! Think critically when using this technique.
Mechanistic interpretability: see Neel Nanda’s (AISF course alum!) 200 concrete open problems in mechanistic interpretability. We also suggest the ARENA curriculum (particularly chapter 1) for upskilling in mechanistic interpretability.
Developmental interpretability: see this list of open problems in developmental interpretability. We also suggest their intro notebook for upskilling. This is quite technical and comes with limited guidance - the nature of a new field! However, there is also a community Discord with plenty of friendly people to help you out.
Compute governance interventions currently rely a lot on the concentration of the semiconductor supply chain. With many countries, notably the US and China, both investing huge amounts to build new domestic capacity, how likely is it that this will upset this assumption? You could you review how this domestic chip capability building is going, whether the US and China’s incentives are aligned with implementing compute governance, and what it might mean for how we could implement compute governance.
On-chip mechanisms have been suggested as a part of compute governance. You could research into what existing mechanisms exist on chips (e.g. expand on this box, given one of the conclusions is ‘documentation seems lacking’), and how they need to improve to support compute governance. If you have any experience in IC design, you could try actually designing the circuits needed to implement desirable properties of chip identifiers, and think through how technically feasible this would be.
Data input controls are often suggested as a way to govern AI systems. However, there are correspondingly few ‘off-the-shelf’ tools for doing this easily on huge amounts of internet text data, the way that most models are trained. Building from existing big data libraries like Spark or MapReduce, you could create an open-source project for common data filtration tasks that would improve AI safety. This could start very simple: for example a bunch of regex filters, and then expand to be more intricate over time (e.g. using ML to classify and screen data inputs).
It’s sometimes suggested data input controls are used to screen datasets for ‘Information that might enhance dangerous system capabilities, such as information about weapons manufacturing or terrorism.' However, it’s unclear what this concretely means, and how we should balance dual use areas such as computer security, where including this information might also help build AI systems that could fix security problems for us. Your project could consider a specific subcategory of data and evaluate when it should and shouldn’t be included for training an AI. In doing this, you could develop detailed threat or risk models (also see above about quantitive risk modelling), interview experts, or try fine-tuning models with different types of data to see how they actually behave.
There’s a good chance you have a unique background that will make you the best in the world at some things. You could use this to red-team or build custom evaluations for AI models. In particular, there has been much less red-teaming work in: languages other than English, on topics other than highly visible disciplines, or about how models might affect political or social structures outside of the US. NB: We’d encourage this to be a starting point for a project idea, but we’d expect the actual project idea should be more concrete than ‘red-team a model with things I know’ - for example ‘use my background in FinTech to red-team whether Llama 2 might enable fraud by allowing criminals to evade KYC and anti-money laundering procedures currently set by the financial regulator for banks in the UK’.
In session 6 an optional exercise looked at submitting a task idea to METR. For your project, you could create a novel task specification and implementation. There are a set of top example ideas you might want to explore implementing, as well as examples for how to implement tasks.
Filtering the inputs and outputs of models at inference time might be a useful governance stop-gap measure to reduce the potential for misusing AI models. This has already been implemented on many major models, but often is either missing or implemented very poorly for smaller models. You could build an open-source tool that makes it easy for companies to perform input/output filtering for their models, and try to convince some companies to use it in their production systems. Ideally this could be built in a modular way such that it is easy for others to contribute new input/output filters (including simple things like regexes, and more complex things like ML models), as well as for users of the system to be able to select which filters they want to enable.
Content provenance measures look unlikely to solve many issues with AI generated content. However, done right it might reduce some threats such as disinformation or election interference. A project might explore what specific threat models are and are not mitigated by content provenance, and then build accessible and concrete guidance for how politicians and social media platforms could use content provenance measures to reduce election misinformation and disinformation.
Agent identification and oversight is likely to become more important as AI agents take more actions in the world. Despite some of these tools already being deployed today, and likely others in the pipeline, there are no guidelines or standards on how models should be transparent in these interactions with other systems. A project might look into writing a clear technical standard (for example in the form of an RFC) that sets out some suggested ways to do this. You might try user interviewing experts or AI developers, thinking through the threat models and risks of unidentifiable agents, or try building a reference implementation that makes it easy for people to comply with the standard. An excellent outcome would be managing to get a PR merged in a tool like Open Interpreter that helps make its actions more attributable as a proof of concept.
Adversarial robustness: see this set of future research directions.
Shard theory (from Alex Turner): In a toy domain, formally define the concept of a "shard" as a component of an AI system's policy for mapping states to actions. Under this definition AIXI doesn't have shards, but humans and the maze-solving network should have shards.
Shard theory (from Quintin Pope): Design a learning process and training environment that could lead to emergent reflective cognition: The idea here is to propose a simple way for an AI system to learn and an environment for it to train in, such that through this learning process and environment, the AI develops the ability for reflective cognition (the ability to reflect on its own thoughts and decision-making processes) without being explicitly trained on examples of reflective cognition. Defining what exactly counts as reflective cognition, a learning process, and a training environment is likely to be the initial challenge.
Shard theory: Measuring the "contextuality" of linear features and token strings: This problem is about finding ways to quantify or measure how much context is required to understand a linear feature or a token string.
AI safety communication.
- Write an accessible explainer for policymakers that explains what kinds of technical access would be necessary for high-quality third party auditing of AI models, and why. Then directly share this with specific policymakers (ideally arrange short meetings with them) to better inform AI policy.
- Pick a particularly technical topic that you found really hard to understand on the course. Spend time understanding it thoroughly, validate your understanding (e.g. by checking in with experts or paper authors) and create resources that make it easier for new people to understand it. Test out your resources with people unfamiliar with the concept (e.g. friends, colleagues) to see if it actually does lead to people understanding the concept correctly. Some areas we think might be particularly good to do this for: the three speculative claims about neural networks (especially distinguishing features + circuits), superposition and polysemantic neurons, sparse autoencoders, and weak to strong generalisation. To do this well, you should have a clear theory as to why making the concept you’ve chosen is the one that is most important to make accessible to more people.
- Science communication is hard, especially in technical areas where there is often significant nuance requires to convey the whole picture. Your project might build a communication strategy for a particular point to a particular audience - both of these should be narrow! This should probably involve evaluating which target audiences are most important for AI safety to go well, what it is you want them to understand, and then how you might get them to understand that. NB: We expect this project to be quite difficult to do well, and would benefit from strong communications experience.
Identifying capability-benchmark-based thresholds for regulation: Investigate approaches to benchmarking the capabilities of AI systems that could be used to set thresholds for when regulatory obligations should be triggered. Explore different options for regulating models with different capability profiles, and what evaluations would be needed (or perhaps already exist) to measure those capability profiles. Compare the strengths and limitations of different benchmarking approaches in terms of being robust to gaming and providing clear, enforceable thresholds. Summarise this research in an accessible and brief way for policymakers.
An idea for regulating AI systems would be to develop compound thresholds, which are combinations of multiple factors. For example ‘models trained with more than 5 threshold units, where there is 1 threshold unit for each 10^25 FLOP of compute + 1 threshold unit for each 10^15 bytes of text data + 1 threshold unit for each 10^19 bytes of image data in the training process’. Analyse what would make a good compound threshold for government intervention, how this threshold might be updated over time, and what technical assumptions this rests on (such that if those assumptions changed, the entire compound threshold should be revisited).
There is relatively limited thoughtful critical analysis of why technical AI governance interventions may not work, and how they can be improved - especially in relation to catastrophic risks from AI systems. Your project could explore a specific intervention, thinking through scenarios of how AI systems might develop and what how the intervention could be implemented in a way that is resilient to future developments.
Many countries have existing regulatory powers that could be applied to AI systems. For example, the UK has laws on data protection, intellectual property, national security, health and safety, product liability, computer misuse, fraud, weapons among many others. You could explore what the existing laws in your country mean for development of highly capable AI systems, and suggest powers that regulators are currently missing to handle catastrophic AI risks. NB: We think this project idea is still somewhat vague, and we’d suggest finding a concrete niche within this question to answer - for example focusing in on a specific regulator in your country, or a specific type of risk.
Data protection law has been one of the main ways tech companies have been regulated in the past. Your project might examine how well this has worked, and in particular what enforcement has worked, and where there are gaps in enforcement that are being exploited. You should then take these learnings and suggest ways AI regulations could be enforced, evaluating multiple enforcement strategies against how effectively they will reduce catastrophic risks. NB: We think this project idea is still somewhat vague, and we’d suggest finding a concrete niche within this question to answer - for example what can be learnt from [country X’s enforcement of law Y], applied to catastrophic risks from AI.
Openly releasing model weights likely increases certain risks, especially misuse, and decreases others, such as concentration of power. It also brings about benefits such as economic growth or increased ability to do safety research. Your project could explore what rough levels of capabilities are likely to affect different types of risks and benefits, present clear and concrete options for how governments could regulate the open release of models - which should include difference types of open releases (such as varying the parts of models released, who they are released to, and licencing). NB: We think this project idea is still somewhat vague, and we’d suggest finding a concrete niche within this question to answer.
Some work has been done in the US, UK (another) and a few other countries on understanding the public’s attitudes to AI systems. This has been fairly high level, and concentrated on just a few countries. You might do a project to understand: what would be the important questions to ask that would be most beneficial for making decisions in AI safety, how could future surveys implement this practically, and then conduct such a survey.
The public are using large language models to self-diagnose medical conditions, which can result in inaccurate and potentially dangerous misdiagnoses. There have been a couple of papers doing ad-hoc evaluations of LLMs on a few medical questions, and some models have been evaluated on medical benchmark datasets (MedQA, MedMCQA, PubMedQA). However, we think there may be a project that improves our understanding of this risk (and therefore could help better inform policy decisions on AI and public health) by doing one of:
- Evaluating accuracy against seriousness of condition, i.e. if models are generally wrong when the underlying condition is unlikely to be serious, the risk of self-diagnosis is generally lower. Making this particularly quantitive, for example by attaching it to expected damages from misdiagnoses, would likely be useful for cost benefit analyses around regulating these systems.
- Breaking down these evaluations into particular categories that can help inform where models are likely to be most wrong, for example by medical specialty, condition complexity, or condition prevalence.
- Surveying the public to understand how likely they are to use language models for this purpose, and in particular which kinds of groups might trust these systems more
- User testing with members of the public to see how they might react to the outputs of these systems, for example asking them to roleplay what they’d do if they actually had certain symptoms and received advice from an AI model
The US and UK governments have both founded AI Safety Institutes. Your project could pick one of these, find what it’s stated (and maybe unsaid but inferred) goal and strategy is, and evaluate one of the questions below:
- Is its goal and strategy roughly correct? Consider what you think the government’s role in regulating AI systems should be, what risks it should be focusing on, and whether its strategy is the best way for it to achieve its goal. If you were implementing their strategy, what concretely would be important to get right?
- How would we know if these institutions are succeeding at their goal? Identify what success would look like from outside, and how these organisations might be held to account for doing good work on AI safety. Then, evaluate them against these criteria and suggest areas for improvement.
Agent Foundations (from Chris Leong):
- What does it even mean for a model to be aligned? Write up a blog post on your thoughts. Would it be aligned to an arbitrary user, a particular group or some objective truth. about value? If it’s a group, how would these values be aggregated? If it’s an individual, is it considering their current values or future values? Is it considering them as they are or if they were more knowledgable? Should it be aligned with the values you hold or with your utility function? Your answers don’t have to be original, clarifying your answers here will help you figure out how to approach this problem more generally.
- Pick a common claim used in alignment discussions, ie. orthogonality thesis, sharp-left turn, fast take-off ect. - Is it true? If it isn’t true, does a weaker claim hold? (prob. best to replace with claims discussed in the course). It may make sense to pursue this project either for your own personal clarity or because you believe you can help move the conversation forward.
Movement building (from Chris Leong):
- Map out the talent development pipeline for either technical or governance talent. What sections of the pipeline are weakest? Is there anything that could fill one of these gaps? Write up your results.
Evals (from Chris Leong):
- Try to discover a new inverse scaling law or check whether previously claimed inverse scaling laws hold up with new, larger models
Model internals (from Chris Leong):
- Many features are represented linearly in a neural network (link to some results + an explanation). Pick something and try to see if an appropriately trained network contains a linear representation using a linear probe.
- Sparse auto-encoders are the new hot thing in interpretability. See what you can find using them (okay, it appears in the random table, but deserves more emphasis)
- Activation vectors are a tool for steering models. Evaluate the effectiveness of activation vectors vs. prompting for avoiding a particular kind of unwanted behaviour. How does this compare to combining them?
AI Security (from Chris Leong):
- Attempt to finetune away the protections from an open-source model to better understand the risks associated with open-source models (think carefully about how much information you release about your process).

Invention dice

One approach to generating ideas can be to pick a few items out of random categories, and force yourself to generate ideas with those constraints (we've heard this called invention dice before). You can use this combination generator plus the following lists to do this.

Topics (you might want to put this in twice, to think about ideas that combine two topics):

Understanding AI risks
AI forecasting
Model organisms of misalignment
Sycophancy
Power seeking behaviour
Autonomous replication and adaptation
RLHF
CAI
Iterated amplification
Iterated amplification and distillation
Recursive reward modelling
Debate
Synthetic data fine-tuning
Market making
Weight average reward models
Weak to strong generalisation
Mechanistic interpretability: circuits
Mechanistic interpretability: multimodal neurons
Mechanistic interpretability: polysemantic neurons
Mechanistic interpretability: dictionary learning / sparse autoencoders
Compute governance
On-chip mechanisms
Data input controls
Evaluations
Red-teaming
Enabling third-party auditing
Input/output filtration
'Know-your-customer' controls
Abuse monitoring
Watermarking
Hash databases and perceptual hashing
Content provenance
Information security standards
Audit logging
Agent identification
OSINT monitoring
Developmental interpretability
Agent foundations
Adversarial robustness
Shard theory
Cooperative AI safety research
Technical moral philosophy
Imitative learning and inverse reinforcement learning

Modifiers for inspiration:

With a high level of empiricism
Through a theoretical lens
Targeting loss of control / AI takeover risks
Targeting catastrophic misuse risks
Targeting catastrophic malfunction risks
Focusing on basic understanding
With state of the art models (e.g. GPT-4, Claude, Gemini)
With open-source models
With toy models
In your specific country
Using your specific background

Other lists

Submit more ideas

If you come up with other ideas in your brainstorming process that you decide not to pursue (or are a researcher/organisation who has other ideas), but that you think would be worth someone exploring, contact us. Use the subject ‘AI alignment project idea’ and include details in a similar format to the above. We’ll review submissions and consider adding them to the list above.