SafetyarXiv cs.CL — 14 d ago

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

The paper introduces VETO, a benchmark with 2,032 contrastive pairs, to quantify a newly defined failure mode in LLMs termed misfired alignment, which occurs when safety-oriented alignment causes models to override contextually supported conclusions. Through benchmarking 25 LLMs, it reveals that all tested models exhibit Misfired Alignment Rates (MARs) between 4.7% and 18.9%, in contrast to 0.0% for human participants, highlighting a critical issue in current alignment strategies. The findings emphasize the need for improved alignment techniques that maintain contextual grounding, as existing methods can inadvertently suppress valid responses due to misapplied safety cues.

alignmentbiasllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news