Research
GRACE: Step-Level Benchmark for Faithful Reasoning over Context
The article introduces GRACE, a novel benchmark for evaluating step-level faithfulness in context-grounded reasoning tasks, which includes human-annotated CoT traces from 10 models across 4 datasets. GRACE features a data-driven error taxonomy that categorizes failures into GRACE-Inference and GRACE-Grounding tracks, each with four distinct categories, enhancing the detection of reasoning errors at a granular level. This benchmark is significant for practitioners as it allows for a more precise assessment of model performance and facilitates improvements in reasoning reliability through integration into reinforcement learning frameworks.
reasoningbenchmarkfaithfulness