AI Alignment (2024 Mar)

Treechop – Reproducing the tree gridworld for investigating goal misgeneralisation

By Amy Andrews (Published on July 4, 2024)

I aimed to create a realization of the description of a tree gridworld given in Shah et al (2022), ready to be used in a project to replicate their goal misgeneralization results by also being able to run a reinforcement learning agent through it. Through this project, I aimed to build up reinforcement-learning environment software engineering experience and eventually to get a ‘gearsy’ inside-view/ understanding of this specific instance of inner alignment and goal misgeneralization.

Read the full piece here.