Safety
Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents
The paper introduces MT-AgentRisk, a benchmark designed to assess the safety of multi-turn interactions in tool-using agents, revealing a 16% increase in Attack Success Rate (ASR) for such scenarios compared to single-turn tasks. To mitigate these risks, the authors propose ToolShield, a training-free, tool-agnostic defense mechanism that enables agents to autonomously generate and test scenarios, successfully reducing ASR by 30% in multi-turn contexts. This work highlights the necessity for robust safety measures in advanced LLM-based agents as they become more capable in complex interactions.
llmsafetymulti-turn