Can We Rely on Model-Written Evals for AI Safety Benchmarking? – BlueDot Impact
AI Alignment (2024 Mar)

Can We Rely on Model-Written Evals for AI Safety Benchmarking?

By Sunishchal Dev (Published on July 4, 2024)

Model-written evaluations for AI safety benchmarking differ from human-written ones, leading to biases in how LLMs respond. They have issues with structure, formatting, and hallucination; often they exhibit unique semantic styles, too. We highlight concerns of false negatives where a lab could unwittingly deploy an unsafe model due to these issues. We propose a suite of QA checks to catch basic issues in model-written evals along with a research agenda that seeks to understand and rectify the full extent of the differences with the gold standard.

Read the full piece here.

We use analytics cookies to improve our website and measure ad performance. Cookie Policy.