ResearcharXiv cs.AI — 4 d ago

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

The paper introduces RealMath-Eval, a benchmark consisting of 224 rigorously annotated high school exam responses aimed at evaluating the performance of LLMs in assessing human reasoning in mathematics. Initial results show that state-of-the-art LLM judges exhibit a Mean Squared Error of approximately 2.96 when grading these authentic responses, compared to 1.17 for synthetic LLM-generated solutions, highlighting a significant "Evaluation Gap." This underscores the need for improved evaluation methods that account for the diverse reasoning processes of students, as current LLMs struggle to generalize beyond synthetic data, indicating limitations in their applicability to real-world scenarios.

llmbenchmarkevaluationrelevance 0.40 · engagement 0.00

Read at source ↗← all news