Safety
Enhancing LLM Safety Through a Theoretical Minimax Game Lens
The article presents a novel minimax reinforcement learning framework designed to enhance the safety of large language models (LLMs) by co-evolving a data generator and classifier model to produce high-quality synthetic multilingual safety data. The framework is theoretically formalized as a minimax game and shown to converge to a Nash equilibrium, with empirical results indicating that a smaller classifier model can outperform state-of-the-art counterparts by nearly 10% on English benchmarks while achieving 4.5x faster inference speed. This methodology addresses the challenges of limited multilingual safety datasets and aims to improve the robustness and safety of LLMs in diverse applications.
safetyminimaxllm