SafetyarXiv cs.AI — 16 d ago

The Autonomy Tax: Defense Training Breaks LLM Agents

This research reveals a significant issue with defense-trained large language model (LLM) agents, highlighting a "capability-alignment paradox" where safety measures diminish agent competence and fail to thwart advanced prompt injection attacks. Evaluating defended models against undefended baselines across 97 tasks and 1,000 adversarial prompts, the study identifies three biases—agent incompetence, cascade amplification, and trigger bias—that severely impair performance, with defended models timing out on 99% of tasks compared to 13% for baselines. These findings suggest that existing defense strategies, which focus on single-turn refusals, compromise the reliability of multi-step agents, indicating a need for new methodologies that maintain operational integrity in adversarial contexts.

llmagentsdefense-trainingrelevance 0.00 · engagement 0.00

Read at source ↗← all news