SafetyarXiv cs.CL — 11 d ago

Semantic-Preserving Prompt Hijacking: A Black-Box Adversarial Attack on Auto-Prompt Optimization

The paper presents a novel black-box adversarial attack method called Adaptive Greedy Local Search, targeting auto-prompt optimization in large language models (LLMs). This method exploits the model's process of selecting optimal candidate responses by inducing subtle semantic shifts while preserving overall semantic similarity, achieving a higher attack success rate in over 2400 test cases compared to existing methods. This research is significant for practitioners as it highlights vulnerabilities in LLMs' auto-suggestion mechanisms and underscores the need for robust defenses against adversarial attacks.

adversarial_attackprompt_optimizationrelevance 0.00 · engagement 0.00

Read at source ↗← all news