Safety
Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models
The paper presents findings on Calibration Drift Under Reasoning (CDUR), where increasing the reasoning budget in large language models (LLMs) like Llama-3.1-8B and Llama-3.3-70B can lead to overconfidence in incorrect answers. It defines a non-monotonic relationship between reasoning budget and Expected Calibration Error (ECE), suggesting that while initial reasoning can correct errors, excessive reasoning may produce internally consistent but incorrect outputs. The proposed CABStop rule aims to mitigate this issue by stopping reasoning when confidence diverges from accuracy estimates, highlighting the need for careful monitoring of reasoning depth to ensure reliable model performance.
llmcalibrationreasoning