Safety
LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems
The article introduces NRT-Bench, a benchmark designed for evaluating the robustness of large language model (LLM) agents in safety-critical systems through multi-turn red-teaming, specifically within a simulated nuclear power plant control room. It assesses four LLM operator models against adaptive multi-turn adversarial attacks, revealing that 8.7% to 12.1% of sessions result in the loss of critical safety functions, with vulnerabilities being model-specific and largely disjoint. The release includes the simulation environment, attack dataset, and replay tools, facilitating reproducible safety evaluations for practitioners developing LLMs in high-stakes applications.
llmred-teamingadversarial