Safety
Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing
The article introduces STATEWITNESS, an activation explainer designed for deception auditing in reasoning LLMs. It utilizes a separate decoder to analyze hidden states of target models, providing detailed insights through natural-language queries and structured reports. Evaluated on two reasoning LLMs across seven deception datasets, STATEWITNESS achieves a mean AUROC of 0.916, significantly outperforming existing black-box monitors and enhancing interpretability for practitioners by offering evidence traces for human inspection.
deceptionLLMactivation explainers