ResearcharXiv cs.AI — 4 d ago

When Roleplaying, Do Models Believe What They Say?

This study investigates how language models (LLMs) adopt personas during role-playing and the impact on their internal representations of truth. Using linear truth probes on models like Qwen 2.5 (14B), Qwen 3 (8B), and Llama 3.3 (70B), the authors found that while role-playing alters outputs, it doesn't significantly shift internal beliefs, contrasting with models exhibiting Emergent Misalignment, which show a more substantial internal shift toward truth. This research is crucial for practitioners as it highlights the complexities of model behavior and belief representation, informing the design of more reliable LLMs in context-sensitive applications.

llmrole-playingmoral-reasoningrelevance 0.00 · engagement 0.00

Read at source ↗← all news