ResearcharXiv cs.CL — 11 d ago

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

The article introduces GRACE, a novel benchmark for evaluating step-level faithfulness in context-grounded reasoning tasks, which includes human-annotated CoT traces from 10 models across 4 datasets. GRACE features a data-driven error taxonomy that categorizes failures into GRACE-Inference and GRACE-Grounding tracks, each with four distinct categories, enhancing the detection of reasoning errors at a granular level. This benchmark is significant for practitioners as it allows for a more precise assessment of model performance and facilitates improvements in reasoning reliability through integration into reinforcement learning frameworks.

reasoningbenchmarkfaithfulnessrelevance 0.00 · engagement 0.00

Read at source ↗← all news