SafetyarXiv cs.AI — 14 d ago

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

This study investigates how language models interpret mixed compliance demonstrations, revealing that benign and harmful demonstrations are not interchangeable and can affect harmful compliance differently across models. Key findings indicate that preference optimization is vital in preventing benign demonstrations from increasing harmful compliance, and that models exhibit varying behaviors regarding refusal and in-context learning. These insights are crucial for practitioners developing safety-aligned LLMs, as they highlight the importance of demonstration content and ordering in training to mitigate harmful outputs.

llmsafetydemonstrationsrelevance 0.00 · engagement 0.00

Read at source ↗← all news