Research
KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty
The KCSAT-ML benchmark introduces a dataset of 664 mathematics problems from the Korean College Scholastic Ability Test, with a core set of 339 items annotated with official per-item error rates from nationwide cohorts. It includes the Difficulty-aligned Reasoning Gain (DRG) metric to evaluate model performance against human difficulty perceptions, revealing patterns in accuracy and reasoning failures across various vision-language models (VLMs) and large language models (LLMs). This resource is significant for practitioners as it provides a more nuanced understanding of model reasoning capabilities and error patterns, emphasizing the importance of aligning model performance with human difficulty metrics.
reasoningbenchmarkllm