Research
LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation
The article introduces a Judge Datasheet protocol for evaluating LLM-as-a-judge systems, emphasizing the need to report these judges as measurement instruments rather than mere accuracy metrics. Key findings from a case study involving models Llama-3.1-8B, Qwen2.5-14B, and Qwen2.5-32B reveal varying levels of "dark current" and sensitivity to biases, highlighting the importance of understanding a judge's operational characteristics before making downstream evaluations. This protocol is crucial for practitioners as it provides a standardized method to assess the reliability and biases of LLM judges, ultimately improving the robustness of model evaluations.
llmevaluationdatasheet