Safety
MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks
MUZZLE is an automated framework designed to evaluate the security of web agents against indirect prompt injection attacks, which exploit vulnerabilities in large language model (LLM) deployments. By leveraging the agent's execution trajectories, MUZZLE identifies high-risk injection surfaces and generates context-aware malicious instructions, adapting its attack strategies based on observed behaviors. This framework demonstrated its effectiveness by discovering 44 new attacks across four web applications, including novel strategies targeting confidentiality and availability, which highlights the need for dynamic security evaluations in LLM-based systems.
securityprompt injectionweb agents