Research
Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines
The paper presents a new methodology for evaluating LLMs that addresses ambiguity in drift detection between model performance and judge changes. It introduces a fixed, human-labeled anchor set for consistent scoring, a betting e-process for assessing judge-versus-human gaps, and a guard-window rule for clear attribution of score changes. The approach demonstrates high accuracy in identifying judge drifts with a significant reduction in false alarms compared to traditional methods, making it a valuable tool for practitioners seeking reliable evaluation metrics in LLM deployment.
LLMevaluationattribution