ResearcharXiv cs.CL — 2 d ago

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

The KCSAT-ML benchmark introduces a dataset of 664 mathematics problems from the Korean College Scholastic Ability Test, with a core set of 339 items annotated with official per-item error rates from nationwide cohorts. It includes the Difficulty-aligned Reasoning Gain (DRG) metric to evaluate model performance against human difficulty perceptions, revealing patterns in accuracy and reasoning failures across various vision-language models (VLMs) and large language models (LLMs). This resource is significant for practitioners as it provides a more nuanced understanding of model reasoning capabilities and error patterns, emphasizing the importance of aligning model performance with human difficulty metrics.

reasoningbenchmarkllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news