AI Alignment (2024 Mar)

AI Debate Stability: Addressing Self-Defeating Responses

By Annie Sorkin (Published on July 5, 2024)
  1. Transferring debate to an abstract algebra MMLU dataset is not trivial.
  2. When GPT-3.5 is used as a judge, the outcomes may be sensitive to exact prompt phrasing.
  3. GPT-3.5 may perform worse in judging the debate than answering the question directly.
  4. We proposed a universal prompting approach that avoids most of the self-defeating behavior.

Read the full piece here.

We use analytics cookies to improve our website and measure ad performance. Cookie Policy.