AI Alignment (2024 Mar)

AI Debate Stability: Addressing Self-Defeating Responses

By Annie Sorkin (Published on July 5, 2024)

Transferring debate to an abstract algebra MMLU dataset is not trivial.
When GPT-3.5 is used as a judge, the outcomes may be sensitive to exact prompt phrasing.
GPT-3.5 may perform worse in judging the debate than answering the question directly.
We proposed a universal prompting approach that avoids most of the self-defeating behavior.

Read the full piece here.