Safety
The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation
The study evaluates the reliability and bias of two OpenAI judge models, GPT-4o-mini and GPT-4.1-mini, in ranking outputs across 29 tasks. It finds that pairwise preferences flip 13.6% of the time on average, with significant first-position bias observed in GPT-4o-mini, and only 76% cross-judge agreement. The results indicate that single-trial evaluations are often too inconsistent for high-stakes decisions, suggesting the need for multi-trial aggregations and enhanced reporting practices in LLM evaluations.
llmevaluationbias