Research
Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage
The paper presents a study on creating evaluation datasets for procedural reasoning in AI learning systems, focusing on three question generation strategies based on Task-Method-Knowledge (TMK) models. It introduces a grounding validation framework that assesses question quality across 23 instructional topics and 690 question-answer pairs, revealing that strict TMK generation yields the highest quality with 96.5% grounded questions. This research highlights the importance of representational grounding in dataset construction, suggesting that procedural richness does not ensure effective evaluation for AI practitioners.
procedural reasoningevaluation datasetsgrounding