ResearcharXiv cs.AI — 8 d ago

Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines

The paper presents a new methodology for evaluating LLMs that addresses ambiguity in drift detection between model performance and judge changes. It introduces a fixed, human-labeled anchor set for consistent scoring, a betting e-process for assessing judge-versus-human gaps, and a guard-window rule for clear attribution of score changes. The approach demonstrates high accuracy in identifying judge drifts with a significant reduction in false alarms compared to traditional methods, making it a valuable tool for practitioners seeking reliable evaluation metrics in LLM deployment.

LLMevaluationattributionrelevance 0.00 · engagement 0.00

Read at source ↗← all news