Safety
Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
The paper presents a compute-aware evaluation framework for assessing the adversarial robustness of large language models (LLMs), focusing on the computational expense of different attack strategies. It introduces risk-compute curves that relate compute budgets to attack risk, revealing that alignment training affects compute-space robustness non-monotonically and that scaling model size can reduce the effectiveness of gradient-based attacks while having limited impact on template-based attacks. This framework, which is made publicly available, allows practitioners to better understand the true costs of adversarial attacks, thereby informing strategies for enhancing model security.
adversarial robustnessllmevaluation