Safety
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
The article presents a study on evaluating lie detectors for language models, introducing 13 reasoning model organisms with verified beliefs and a new testbed called Varied Deception for assessing lying motivations. Four detection methods were evaluated, including the Did-You-Lie (DYL) probe, across 31 models ranging from 2B to 1T parameters, revealing that while detection performance scales with model size, activation- and logprob-based detectors struggle with trained organisms, with DYL showing the best retention of signal. This work highlights the limitations of current lie detectors in providing high-confidence assessments of model beliefs and offers new datasets and methodologies for future research in this area.
lie detectionmodel evaluationauditing