Safety
TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction
The article introduces the Task-completion and Resistance to Active Privacy-extraction (TRAP) benchmark, designed to evaluate the trade-off between task accuracy and privacy leakage in AI agents handling sensitive information. The study assesses 22 models, revealing that all exhibit non-trivial information leakage, with a correlation between instruction-following capabilities and leakage rates. The authors propose a novel structural private field isolation method, which replaces sensitive data with hash keys, effectively mitigating leakage while maintaining task performance, addressing a critical challenge for practitioners in secure AI deployment.
privacyagentssecurity