SafetyarXiv cs.AI — 12 d ago

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

The study evaluates the adversarial robustness of Anthropic's Fable 5 and Opus 4.8 LLMs against automated jailbreak attacks, using the HackAgent framework to generate hundreds of thousands of adversarial attempts across 7,826 harmful intents. While both models show resilience, Opus 4.8 is broken by adaptive iterative attacks on 11.5% of intents, and Fable 5 on 6.1%, indicating that even leading models can be compromised under persistent automated scrutiny. This highlights the need for ongoing vigilance and improvement in adversarial defenses when deploying LLMs in sensitive applications.

llmadversarial-robustnessred-teamingrelevance 0.00 · engagement 0.00

Read at source ↗← all news