Safety
Can We Stop Malicious AI? KILLBENCH: A Benchmark for External AI Kill Switch Feasibility
The article introduces Killbench, a benchmark designed to evaluate the feasibility of external kill switch mechanisms for halting malicious AI behavior. It targets web agents and assesses four configurations of malicious AI, including an uncensored LLM agent, across eight harmful scenarios using ten jailbreak patterns. The benchmark evaluates four external kill switch methods on models such as Grok-4.3, GPT-5.2, and Qwen3.6, providing empirical insights that are critical for practitioners addressing AI safety and corrigibility.
aikill-switchmalicious-aibenchmark