Safety
Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias
This article presents a comprehensive evaluation of 21 LLM-as-a-Judge models from nine providers, utilizing three evaluation protocols (agreement, consistency, and bias audit) across 118 runs and approximately 541,000 judgments. Key findings include a universal kappa deflation of 33-41 percentage points between exact match and Cohen's kappa, significant shifts in judge rankings, and a paradox of high test-retest reliability (>0.95) alongside notable position bias (>0.10) in two judges. The study proposes a Minimum Viable Validation Protocol, highlighting critical considerations for practitioners regarding the reliability and biases inherent in LLM evaluations.
LLMevaluationbias