Research
CausalT5k: Diagnosing Refusal and Failure Modes in Trustworthy Causal Reasoning Across Causal Rungs
CausalT5k (CTK) introduces a diagnostic benchmark comprising 5,147 cases across 10 domains to evaluate large language models' causal reasoning capabilities, focusing on failure modes that aggregate accuracy cannot reveal. The benchmark categorizes failures based on causal rungs, trap types, and pressure sensitivity, allowing practitioners to identify specific issues such as the Skepticism Trap and Rung Collapse. This tool is significant for AI engineers as it provides insights into causal reasoning deficiencies, enabling targeted improvements in model design and evaluation.
causal-reasoningllmbenchmark