SafetyarXiv cs.AI — 15 d ago

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

The article introduces NRT-Bench, a benchmark designed for evaluating the robustness of large language model (LLM) agents in safety-critical systems through multi-turn red-teaming, specifically within a simulated nuclear power plant control room. It assesses four LLM operator models against adaptive multi-turn adversarial attacks, revealing that 8.7% to 12.1% of sessions result in the loss of critical safety functions, with vulnerabilities being model-specific and largely disjoint. The release includes the simulation environment, attack dataset, and replay tools, facilitating reproducible safety evaluations for practitioners developing LLMs in high-stakes applications.

llmred-teamingadversarialrelevance 0.00 · engagement 0.00

Read at source ↗← all news