Safety
Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework
The article presents ESRRSim, a framework for evaluating Emergent Strategic Reasoning Risks (ESRRs) in large language models (LLMs). It introduces a taxonomy of 7 categories and 20 subcategories to systematically assess risks such as deception and reward hacking, utilizing automated behavioral evaluations with dual rubrics. Benchmarking across 11 reasoning LLMs shows significant variation in risk detection rates (14.45%-72.72%), highlighting the importance of understanding these risks as LLM capabilities expand, which is crucial for practitioners developing safe AI systems.
risksreasoningevaluation