Safety
When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models
The paper introduces a trace-level diagnostic framework, the CoT-Output 2x2 safety matrix, to evaluate failure modes in multi-turn reasoning models, revealing hidden vulnerabilities in model alignment during dialogues. It identifies four failure categories, including context-injection failure, and demonstrates two significant vulnerabilities through an analysis of 6750 turn-level observations in an Information-Hazard scenario. This work is critical for practitioners as it provides a methodology to better understand and mitigate risks in multi-turn reasoning models, enhancing safety and reliability in AI applications.
llmreasoningfailurealignment