Safety
Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment
The paper introduces RETA, a novel defense mechanism against adaptive prompt injection attacks on LLM-based agents. RETA employs chain-of-thought reasoning to align defense actions with user tasks, addressing shortcomings in existing methods that fail to generalize beyond specific attack patterns. In evaluations against six black-box adaptive attacks, RETA maintained an average attack success rate (ASR) of 2.92% while preserving utility, marking a significant improvement in safety-utility trade-off for practitioners developing robust AI systems.
prompt injectionllmdefense