Research
C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
C2-Faith is a new benchmark designed to evaluate the causal and coverage faithfulness of large language models (LLMs) in chain-of-thought reasoning. It utilizes a dataset derived from PRM800K and features controlled perturbations to assess models on tasks such as binary causal detection and coverage scoring. The study reveals that while LLM judges can identify errors, they often fail to accurately localize them and tend to overestimate the completeness of reasoning, indicating significant limitations in their reliability for evaluating reasoning quality in AI applications.
llmbenchmarkingcausality