Safety
Automated jailbreak attack targeting multiple defense strategies
The paper introduces UNIATTACK, an adversarial testing framework designed to optimize black-box attack prompts against large language models (LLMs) while considering multiple defense strategies. UNIATTACK employs a feature-centric approach by extracting impactful attack features and refining them through a specialized attacker LLM, achieving a significant average attack success rate (ASR) improvement of 64.63% to 248.82% over baseline methods with minimal cost (0.03%-4.96%). This framework is crucial for practitioners focused on evaluating and enhancing the robustness of LLMs against adversarial attacks.
adversarial attacksLLMUNIATTACKrobustness