SafetyarXiv cs.AI — 10 d ago

Automated jailbreak attack targeting multiple defense strategies

The paper introduces UNIATTACK, an adversarial testing framework designed to optimize black-box attack prompts against large language models (LLMs) while considering multiple defense strategies. UNIATTACK employs a feature-centric approach by extracting impactful attack features and refining them through a specialized attacker LLM, achieving a significant average attack success rate (ASR) improvement of 64.63% to 248.82% over baseline methods with minimal cost (0.03%-4.96%). This framework is crucial for practitioners focused on evaluating and enhancing the robustness of LLMs against adversarial attacks.

adversarial attacksLLMUNIATTACKrobustnessrelevance 0.00 · engagement 0.00

Read at source ↗← all news