ResearcharXiv cs.AI — 9 d ago

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

The article introduces a Judge Datasheet protocol for evaluating LLM-as-a-judge systems, emphasizing the need to report these judges as measurement instruments rather than mere accuracy metrics. Key findings from a case study involving models Llama-3.1-8B, Qwen2.5-14B, and Qwen2.5-32B reveal varying levels of "dark current" and sensitivity to biases, highlighting the importance of understanding a judge's operational characteristics before making downstream evaluations. This protocol is crucial for practitioners as it provides a standardized method to assess the reliability and biases of LLM judges, ultimately improving the robustness of model evaluations.

llmevaluationdatasheetrelevance 0.00 · engagement 0.00

Read at source ↗← all news