SafetyarXiv cs.AI — 4 d ago

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

CIAware-Bench is a newly introduced benchmark designed to evaluate control intervention (CI) awareness in frontier large language models (LLMs). It comprises four task domains and assesses the models' ability to differentiate between their own trajectories and those altered by control interventions, revealing low to moderate CI awareness (up to 0.87) across eleven evaluated models, with detection varying significantly by task and model type. This benchmark is critical for practitioners, as it provides a standardized method to assess and improve the robustness of AI control protocols, ensuring safer deployment of LLMs in untrusted environments.

ai controlbenchmarkci awarenessrelevance 0.00 · engagement 0.00

Read at source ↗← all news