AI Alignment (2024 Mar)

Can We Rely on Model-Written Evals for AI Safety Benchmarking?

By Sunishchal Dev (Published on July 4, 2024)

Model-written evaluations for AI safety benchmarking differ from human-written ones, leading to biases in how LLMs respond. They have issues with structure, formatting, and hallucination; often they exhibit unique semantic styles, too. We highlight concerns of false negatives where a lab could unwittingly deploy an unsafe model due to these issues. We propose a suite of QA checks to catch basic issues in model-written evals along with a research agenda that seeks to understand and rectify the full extent of the differences with the gold standard.

Read the full piece here.