Assessing Automated Prompt Injection Attacks in Agentic Environments
The paper presents an empirical evaluation of automated prompt injection attacks against LLM agents using the AgentDojo framework, applying both white-box (GCG) and black-box (TAP) methods across 80 task pairs in four domains. The study finds that black-box optimization significantly outperforms gradient-based methods due to GCG's instability, and the effectiveness of TAP is influenced by the attacker's model capabilities and safety tuning. These results emphasize the model-dependent nature of automated prompt injection threats, indicating that while task-universal attacks can transfer across domains, smaller models do not effectively inform strategies against advanced models like GPT-5, posing challenges for practitioners in securing LLM applications.