Safety
Semantic-Preserving Prompt Hijacking: A Black-Box Adversarial Attack on Auto-Prompt Optimization
The paper presents a novel black-box adversarial attack method called Adaptive Greedy Local Search, targeting auto-prompt optimization in large language models (LLMs). This method exploits the model's process of selecting optimal candidate responses by inducing subtle semantic shifts while preserving overall semantic similarity, achieving a higher attack success rate in over 2400 test cases compared to existing methods. This research is significant for practitioners as it highlights vulnerabilities in LLMs' auto-suggestion mechanisms and underscores the need for robust defenses against adversarial attacks.
adversarial_attackprompt_optimization